UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The buffered fat tree switch for ATM networks Al-Junaidi, Husam 1998-12-31

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
ubc_1998-0338.pdf [ 3.73MB ]
Metadata
JSON: 1.0065203.json
JSON-LD: 1.0065203+ld.json
RDF/XML (Pretty): 1.0065203.xml
RDF/JSON: 1.0065203+rdf.json
Turtle: 1.0065203+rdf-turtle.txt
N-Triples: 1.0065203+rdf-ntriples.txt
Original Record: 1.0065203 +original-record.json
Full Text
1.0065203.txt
Citation
1.0065203.ris

Full Text

The Buffered Fat Tree Switch for ATM Networks by Husam Al-Junaidi B.Sc, The University of Petroleum and Minerals, 1992 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in THE FACULTY OF GRADUATE STUDIES (Department of Electrical and Computer Engineering) We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA October 7, 1997 © H. M. Al-Junaidi, 1997 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of £. Ie<^v\ LcJ^J Cou^Ji^ The University of British Columbia Vancouver, Canada Date l» <?• DE-6 (2/88) ABSTRACT The development of ATM (Asynchronous Transfer Mode) switches is one of the main tasks required to implement B-ISDN (Broadband Integrated Services Data Networks). This thesis proposes a general class of scalable ATM switches based on buffered tree structures. A distinguishing feature of the proposed switch is that it has been designed to handle nonuniform as well as uniform traffics robustly while fully utilizing switch resources (buffers and bandwidth). The buffer-size and bandwidth of each stage of the switch are specified by parameters which can be computed to optimize the switch with respect to cost, utilization, cell loss and total delay. The thesis also develops a discrete-time approximation model for analyzing the performance of the proposed switch. In particular, the analysis determines the influence of various design parameters on optimizing the switch performance. Similar to pure output buffered switches with fully interconnected fabrics, the BFT switch can easily achieve a throughput of 100 percent at much less complexity. Analysis show that delay and cell loss probability of the switch can be greatly enhanced by applying cut-through routing. They also show that more buffers are required by the lower stages of the switch to achieve a desired cell loss probability and this reduces coupling of input traffics. Simulation was used to further study the behavior of the buffers in the switch under uniform and bursty traffics. Keywords: ATM, switch architecture, performance analysis, tree-based switches. ii Contents ABSTRACT ii List of Tables v List of Figures vChapter 1 Introduction 1 1.1 ATM Evolution 2 1.2 Abstract Model of an ATM Switch 4 1.3 Requirements of the Ideal ATM Switch 7 1.4 Thesis Outline 11 Chapter 2 Topologies and Queueing Design of ATM Switches 12 2.1 Switching Fabrics Classification 13 2.2 ATM Queueing Schemes 20 2.3 Examples 24 Chapter 3 The Buffered Fat Tree 29 3.1 Motivation and Evolution of the New Architecture 29 3.2 The Conceptual Model of BFT Switch 36 3.3 BFT Structure 38 3.4 Switch Fault Tolerance 47 3.5 Routing, Broadcasting and Multicasting within the Switch .... 48 3.6 Switch Expendability and Modularity 51 3.7 Comparison with Other Switches 2 Chapter 4 Performance Modeling 57 4.1 Analytical Modeliii 4.2 Performance Discussion 67 Chapter 5 Simulation Results 84 5.1 Simulation vs Analytical Model 85 5.2 Backpressure vs Dropping Techniques 85 5.3 Behavior of Buffers in the Switch 88 Chapter 6 Conclusion 9Bibliography 100 IV List of Tables Table 1 Terms Definition 61 Table 2 Required Buffers for Each Stage in a 64X64 BFT (Example) . . 70 Table 3 Delay of Different Stages in a 64x64 BFT (Example) 71 Table 4 Arrangement of Buffers in a 64x64 Switch (Example) 73 Table 5 Different Approaches for Placing Buffers in BFT 74 V List of Figures Figure 1. Generic ATM Switch 5 Figure 2. Time Division Switch Fabrics 15 Figure 3. Shared Buffer Memory Switch 6 Figure 4. Examples of Crossbar Switch Fabrics 19 Figure 5. ATM Queuing Techniques 21 Figure 6. Utilization of Resources in the Banyan Network 31 Figure 7. Path Stream for Port 1 in Banyan 32 Figure 8. Re-arranging the Buffers and Links of the Path Stream 33 Figure 9. Customizing the Path Stream 35 Figure 10. Combining Path Streams 6 Figure 11. Switching and Buffering Nodes in the BFT 38 Figure 12. General Structure of BFT Node 39 Figure 13. Internal Organization of a Shared-Buffer that Preserves the Cell Arrival Sequence 42 Figure 14. Packing Ranked Cells on a Butterfly Network 42 Figure 15. Packing and Shifting Ranked Cells on a Butterfly Network .... 42 Figure 16. Cell Header 50 Figure 17. Routing Control Logic 51 Figure 18. Difference between Multinet and BFT 55 Figure 19. Mathematical Model of Buffering Stage in the BFT 59 Figure 20. A Simplified Queuing Model for the BFT 59 Figure 21. Markov Chain when i < R 63 Figure 22. Markov Chain when i > R 4 vi Figure 23. Effect of BFT Degree on Cell Loss Probability (without Cut-through, load=0.9) 76 Figure 24. Effect of BFT Degree on Delay under Different Input Loads (without Cut-through) 77 Figure 25. Cell Loss in Different BFT Stages (64x64, degree=2, stages=6, without Cut-through, load= 0.9) 78 Figure 26. Delay in Different BFT Stages (64x64, degree=2, stages=6,without Cut-through, load=0.9) 79 Figure 27. Effect of Cut-Through on Cell Loss Probability under Different Input Loads (64x64, degree=2, stages=6) 80 Figure 28. Effect of Cut-Through on Delay under Different Input Loads (64x64, degree=2, stages=6) 81 Figure 29. Effect of Buffer Distribution on Cell Loss Under Different Input Loads (64x64, degree=2, stages=6, without Cut-through) .... 82 Figure 30. Effect of Buffer Distribution on Delay under Different Input Loads (64x64, degree=2, stages=6, without Cut-through) 83 Figure 31. Markov Chain for a Bursty Source 84 Figure 32. Simulation vs Analytical Model for Cell loss Probability (64x64, degree=2, stages=6, without C ut-through) 90 Figure 33. Simulation vs Analytical Model for Delay (64x64, degree=2, stages=6, without Cut-through) 91 Figure 34. Difference between the Average and the Maximum Reached Buffer Length per Node (64x64,degree=2,stages=6, uniform traffic of load 0.72) 92 Figure 35. Difference between the Average and the Maximum Reached Buffer Length per Stage (64x64, degree=2, stages=6, uniform traffic with load 0.72) 93 Figure 36. Delay in Different Stages of th BFT (64x64, degree=2, stages=6, uniform traffic of load = 0.72) 94 Figure 37. The Difference between the Average and the Maximum Buffer Length Reached Per Node (64x64, degree=2, stages=6, bursty traffic of 0.72 average arrival rate) 95 Figure 38. The Difference between the Average and the Maximum Buffer Length Reached Per Stage (64x64, degree=2,stages=6,bursty traffic of 0.72 average arrival rate) 96 Figure 39. Delay in Different Stages of BFT (64x64, degree=2, stages=6, bursty traffic of 0.72 average arrival rate) 97 viii Chapter 1. Introduction This thesis presents the design and performance analysis of a novel universal ATM switching architecture based on buffered tree structures. The importance of the proposed switch stems from two facts. First, it represents a general class of scalable ATM switches whose fabrics employ a tree topology. Its definition incorporates other types of switches ranging from those of pure output buffering to the ones with shared memory. Therefore, it can effectively be used as a model for studying the impact of different design parameters on the performance of ATM switches. Second, the proposed architecture has many interesting features that make it highly suited to ATM switching. It is characterized by full buffer utilization as well as high bandwidth utilization, both achieved with optimal delay-throughput performance. It is a self-routing switch with no central controller to manage its routing functions. Therefore, it can operate at high speeds to support the demands of future B-ISDN networks. It is able to perform multicasting and broadcasting operations without duplicating cells or employing extra complex copying hardware besides the switch fabric. Unlike many other buffered switches, the proposed switch does not require any internal speedup of its internal operations. Priority and fault tolerance mechanisms can easily be implemented in it. Two design principles characterize the design of the Buffered Fat Tree (BFT) switch. First, it employs a tree topology to maximize and effectively utilize the hierarchical sharing of its internal resources. Second, it employs smart configuration between its stages to make all its resources (links and buffers) equally accessible to any cell entering the switch irrespective of the input from which it comes. Consequently, any discrepancy among cells Chapter 1 Introduction arrival rates at the different input ports does not affect the overall performance of the switch. The above design principles enable the switch to handle nonuniform (as well as uniform) traffic robustly. Section 1.1 of this chapter gives a general background on the evolution of Asyn chronous Transfer Mode (ATM) as a universal transfer mode. The generic model of an ATM switch is demonstrated next in section 1.2 of this chapter to highlight its main components. Then section 1.3 explains the characteristics which must be met in an ATM switch to make it ideal for general traffic. Finally, section 1.4 demonstrates the thesis outline for the next chapters. 1.1 ATM Evolution Current advances in optical transmission and VLSI technology have allowed the evo lution of broadband integrated data networks or (B-ISDN) as a universal network which effectively supports a wide range of services. The bit rates of these services vary from a few bits/sec as in low speed data transmissions to some Mbits/sec as in high quality video distribution. Therefore, the provision of high speed data links with flexible bit rate allocation are the most important features of B-ISDN networks. This requires B-ISDN to use a transfer mode capable of fulfilling all the requirements needed for the above services as well as any unexpected services that may evolve in the future [16]. B-ISDN is built upon a service-independent transfer technique called Asynchronous Transfer Mode or ATM. ATM can carry different traffic types in a homogeneous and integrated manner. It is a packet oriented transfer mode which employs asynchronous time division multiplexing [25]. Unlike other packet switched networks, ATM has three 2 Chapter 1 Introduction unique features that minimize the functionality in the network to achieve fast and flexible switching at less cost. First, ATM switching protocols are implemented in hardware to ensure the fast switching of the cells and to match the speed of link transmission. Second, ATM does not provide a flow control or error protection techniques on a link-by-link basis. This Was made possible due to the very high quality and high bandwidth links of B-ISDN and due to the connection oriented nature of ATM. In ATM, proper network resources are allocated in advance before information is transferred to guarantee controllable queue overflows with a very low probability. Finally, information transferred by ATM is sliced into time slots of fixed length called cells. Fixing the length of cells simplifies the queue memory management and buffer dimensioning needed to fulfil ATM services. Each cell has a header and an information field. The functionality of the header is reduced to guarantee its fast processing. The information field is kept relatively small to guarantee small queueing delay and small delay jitter. The information field is routed transparently through the ATM network. Before sending information from an ATM terminal to the network, virtual connection setup is performed to check if statistically enough resources are available in the network for the requested call. If no enough resources are available, the call is refused. Otherwise, the call is accepted and a virtual channel is established for the connection. Quality of Service (QOS) is negotiated at this stage to determine the throughput of the connection, cell loss probability, peak bandwidth, average sustained bandwidth, burst size and other information related to the connection. Virtual path identifiers (VPI) and channel identifiers (VCI) are assigned for the connec tion. Transferred data is then divided into small cells whose headers contain the VCI 3 Chapter 1 Introduction and/or VPI to which the cell belongs. VPI/ VCI information are used by ATM switching nodes to route cells to their destinations. Other information related to the connection are associated with VPI/VCIs. In ATM networks, signalling and data information of a connection travel through the network on different logical channels. 1.2 Abstract Model of an ATM Switch The generic model of an ATM switch depends upon the nature of the basic functions performed by the switch. The basic functions performed by any ATM switch include routing, contention resolution, table translation and other control and management oper ations [25][27][29]. Routing transfers cells from the input ports to the output ports of the switch. The switch must use a routing table to determine the input and output ports of each connection. Routing function is implemented at hardware to speed up the process. Switches must apply a contention resolution scheme when more than a cell compete for the same resource within the switch. Queuing is the most popular scheme for contention resolution [11][13]. Different queuing techniques and methods can be implemented in the ATM switch. These methods and their advantages and disadvantage will be explained in details in the next chapter. An ATM switch must also translate the header information of the cell before it leaves the switch to determine the outgoing link it must take to reach its destination. This is done by using a translation table that is created at setup time. In addition to the above operations concerning the data traversing the network, ATM switches must perform other control and management functions to establish, negotiate and, may be, force the QOS associated to each VCI/VPI connection. They also need to monitor the overall switch functions and performance. Therefore, the abstract model of an ATM switch must have components to support all the functions mentioned above. 4 Chapter 1 Introduction The discussion of this section will focus on the first three function mentioned above which are routing, contention resolution and table translation. The abstract model of an ATM switch consists of input and output controllers, a switching fabric and a control processor as shown in fig (1). A translation table which keeps track of accepted VCIs/VIPs is used in each switch to route incoming cells to their destinations. This table defines the input and output ports linked to each established connection. Other information related to VPIs/VCIs like QOS, priorities and multicasting options are kept in this table as well. Control Processor Translation Table Switch Fabric IC: Input Controller IB: Input Buffers OC: Output Controller OB: Output Buffers. Figure 1: Generic ATM Switch Input controllers must be synchronized to align the headers of incoming cells before entering the switching fabric. They provide buffering for incoming cells and use the 5 Chapter 1 Introduction translation table to translate the VCIs included in cells headers. They also append to each cell an internal header which contains the necessary information for routing and processing the cells within the switch. Output controllers remove the above internal headers appended to cells by the input controllers. They provide buffering and arbitrating schemes if more than one cell is destined to the same output port. If the switch fabric does not preserve the sequence of cell arrivals, output controllers must sequence the cells before sending them out to other ATM nodes to comply with the connection oriented nature of ATM networks. The control processor manages the total operations of the switch. It communicates with other parts of the switch to ensure the correct transportation of information from switch inputs to its outputs according to the quality of service associated with each connection. It is responsible for handling the signalling information and higher level functions such as bandwidth allocation, cells establishment and release, maintenance, administration and management. Having a single central processor can be a performance bottleneck for large switches. So, the tasks of the central processor can be distributed among different input controllers where more coordination is required among them. ATM switching fabrics route cells from input ports to the output ports of the switch based on the internal tags appended to each cell by input controllers. They are the most important component of the ATM switch. The cost, performance, complexity, capacity and total functionality of the switch largely depends upon the switching fabric employed in it. ATM fabrics are usually composed of smaller building blocks, called switching elements, interconnected with each other to form the fabric. In other words, the overall topology of the fabric is determined by its switching elements and the way these elements are interconnected with each other. A switching element can be considered by itself a 6 Chapter 1 Introduction small ATM switch from which bigger switches are constructed. Switching fabrics must provide contention resolution schemes when more than one cell compete for the same output, or for any other resources within the same switch. Different schemes will be discussed in chapter two of the thesis. 1.3 Requirements of the Ideal ATM Switch The concepts of packet routing and packet switching has been known to researchers long before the emergence of ATM as an infrastructure technology some years ago. Interconnection networks have been used in parallel computers to route packets among processors and memory modules. Enormous efforts have been invested in developing and improving interconnection networks which are considered a determining factor for the performance of any computing systems. Such interconnection networks, however, have proved inadequate for ATM switching due to the nature of services expected to be offered by future B-ISDN networks. Consequently, a clear understanding of such services is essential for developing or proposing any ATM switch. An ideal ATM switch should support high data rates and achieve high throughput at a moderate cost with short delay and small cell loss probability for any type of traffic. The cost of the switch depends upon the amount of buffering it employs and the complexity of its implementation [41][44]. The delay is the time it takes the cells to travel from the input port through the switch to the output port. Typical values of delay can range between 10 to 1000 ms, a few jitter of few hundreds of microseconds or less. To fulfill such requirements, an ideal ATM switch must have the following features: Self-Routing: In self-routing switches, packets are individually routed based solely on their appended binary headers (tags). A binary header is appended to each cell by input 7 Chapter 1 Introduction controllers before the cell enters the switch fabric. This header is processed locally by the different components of the switch fabric when routing the cell to its output destination. The header is removed once the cell arrives at the output port. Routing in this case is distributed since no centralized controller or global information is needed. This feature enables the switch to operate at high speeds while maintaining relatively simple switch hardware. Preserving the Sequence of Packets: Since ATM is a connection oriented mode, the switch must preserve the order in which packets arrive [25]. If this is not guaranteed by the switch fabric, the switch must have a mechanism to resequence packets at the outputs after they have been routed. Re-sequencing cells at output ports causes longer delays and requires larger buffers. Multicasting and Broadcasting Operations: An ATM switch must be capable of easily performing multicasting and broadcasting operations which are essential for many of B-ISDN services [19][39]. Services like TV distribution, electronic mails and video library access require multicasting and broadcasting operating. A packet is broadcasted when it is routed to every output in the switch. It is multicasted if it is sent to only a subset of the switch outputs. The challenge for designing an ATM switching fabric relies in performing multicasting, broadcasting and routing in a uniform manner without increasing the complexity of the switch or affecting its performance. The complexity of these operations varies based on the used switching architecture. Some switches, for example, duplicate the cells before they enter the switch fabrics. They use a separate copy network to generate duplicate cells based on the number of the destinations. This approach increases the traffic volume entering the switch and thus 8 Chapter 1 Introduction it may badly impact the performance. It also requires extra hardware to generate the duplicate copies. Resources Contention Resolution: Contention arise when more than one cell compete for the same resource within the switch during the same time slot. Cells may compete for the same input ports, for the same internal links and buffers within the fabric or for the same output ports. Different schemes can be used to handle contentions. Lost packets can be disregarded, recirculated back to the inputs or buffered for later routing [34] . In all cases, the performance of the switch in terms of delay and cell loss probability must be maintained acceptable for ATM traffics. Modularity and Scalability: This is an essential feature for building switches of large dimensions since modular switches are easily expandable [22][29]. Modularity allows the switch to grow in size without any technology limitations. An ATM module can be constructed by interconnecting fundamental switching elements together. Different ATM modules are then interconnected with each other to form the ATM switch. The perfect expandable switching architecture is the one where expanding its size does not affect its original performance [9] or change any of its operational characteristics. Routing in the original or expanded switches must still be conducted in a uniform manner using the same routing algorithm. Many proposed switch architectures have failed to satisfy this requirements. Supporting Performance Requirements of Various Traffics Types : The ATM switch must be capable of handling different types of traffic efficiently to meet the required quality of service (QOS) contracted with users. QOS includes parameters such as the desired peak bandwidth, average sustained bandwidth and burst size. Services like 9 Chapter 1 Introduction telex, voice, low speed data, high speed data, high quality video distribution and video library have different requirements in terms of their bit rates, burstiness of their traffics, their cell loss probability, error rate, delay and delay jitters. For example, the bit rates of above services widely range from a few kilobits/sec to 650 Mbits/sec or even higher. The delay can range between 10 to 1000ms and the cell loss probability must be below 10-7. Allow for Priority Schemes: Priority schemes must be implemented to handle different classes of ATM traffics differently based on their contracted requirements. The two types of priorities addressed in ATM switches are delay priority and cell loss priority. Delay priority ensures that cells delay stays within the desired quality of service assigned to the connection. It determines the cells which will be transmitted out of the buffer to meet their delay requirement. Cell loss priority keeps the cell loss rate of a connection within the desired limits. Cells of lower cell loss priority are discarded first in case of congestion. One technique to implement priorities is to place cells in different buffers based on their class. Literature includes various buffer management techniques for handling priorities. High Utilization of Resources: A key factor for the success of a switch architecture is its ability to fully utilize the available resources under any type of traffic. The main resources of any switch are its bandwidth and its buffers. High utilization of resources makes the switch achieve the best possible performance for its invested cost. Fault Tolerance: The switch must be able to detect faults and respond gracefully to unexpected failures, or in other words, be fault tolerant [2][23] [28]. The level of fault tolerance in the switch differs from design to design. Fault tolerance is generally achieved by building the switch from redundant components that are placed in parallel. Some switches provide different physical paths between each input and output so that cells can 10 Chapter 1 Introduction be still routed to its destination if one of the paths become faulty. Some may duplicate the whole switching fabrics to provide high level of fault tolerance. Others may selectively duplicate only the important components of the switch to safeguard the switch operations when failure occurs. These design options depend upon the cost and the application for which the switch is intended to be used. Testing mechanisms need to be implemented in the switch to detect faults. Test cells can be periodically injected at the inputs of the switch in a predefined pattern and observed at the outputs. If a fault is detected, traffic must be redirected away from this fault until it is fixed. It is obviously not easy to design a switch which can fulfill all the above requirements since there is always a trade-off between performance and cost. Satisfying certain requirements may preclude others. Therefore, an efficient switch design should fulfill the basic requirements and be flexible enough to adjust parameters to fully utilize whatever hardware invested in it for supporting its expected traffics. 1.4 Thesis Outline Chapter 2 classifies ATM switches based on their topologies and the queueing tech niques employed in them. It presents the advantages and disadvantage of each class and it surveys the ATM switches which have been reported in literature. Chapter 3 introduces the motivation for the BFT design and presents its operations and character istics. It also compares the switch design to some of the currently available designs to demonstrate the universality of the switch definition. Chapter 4 develops a discrete-time approximation model for analyzing the performance of the proposed switch under uniform traffic. It discusses the impact of the different design parameters on the performance based on that model. Finally, chapter 5 presents and discusses the simulation results for the switch under uniform and bursty traffics. ii Chapter 2. Topologies and Queueing Design of ATM Switches ATM switches have recently been the focus of many researchers attempting to develop a switching architecture that effectively fulfill the demands of future B-ISDN networks. As a result, many different switching architectures have been proposed and investigated in literature. These architectures differ based on the queuing techniques employed in them, the topology of their fabrics and the structure of the switching elements inside their fabric [4][7][8][9][21][37]. As mentioned in the previous chapter, a switching fabric may be composed of one or more switching elements. The discussion of this chapter is applied equally to a single switching element inside the fabric as well as to the overall topology of the switch. The queuing technique employed by the switching elements may not necessarily be the same as the one used in the switch. The switching elements of an ATM switch, for example, may employ input queuing while the switch itself employs output queuing. Moreover, the transmission medium used within a switching element may differ from the one used by the switch fabric to connect its different switching element to each others. A switching element, for example, may use a bus to connect its input and output ports while the switch itself may use a multistage network to connect these switching elements together. The remainder of this chapter is based on the currently available switching architectures which have been reported in literature. It is organized as follows. Section 2.1 classifies the ATM switching fabrics based on their topology. Then, section 2.2 presents and compares the performance of the different queuing techniques that can be employed by the switch design (or switching element). Finally, section 2.3 brings into focus different 12 Chapter 2 Topologies and Queueing Design of ATM Switches examples from the literature for switches that use one or a combination of the discussed topologies and queuing techniques. 2.1 Switching Fabrics Classification ATM switches can be classified based on their underlying switching fabrics into time division (shared medium), shared memory and space division fabrics. Each of these fabrics can be internally blocking or internally nonblocking. Assuming that no more than a cell is destined to the same output, the switch is blocking if any of its input cells can not be routed to its available destination due to lack of internal resources in the switch. The internal resources include the links and buffers of the switch. In this case, the switch does not successfully route all input-output permutations. On the other hand, the switch is internally nonblocking if all its input cells can be routed to their output destinations, given that no more than a cell is destined to the same output. Blocking fabrics must employ some techniques to resolve contention for their resources [4]. Next, the three different types of switching fabrics (time division, space division and shared memory fabrics) will be discussed in details. A. Time division (Shared Medium) In time division switches, the input and output ports of the switch are linked to a common transmission medium such as a bus or a ring as shown in fig (2). To have a nonblocking switch, the bandwidth of the shared medium must be greater than or at least equal to the total transmission rates of incoming links. Otherwise, the switch becomes blocking and requires large buffers at the inputs to achieve the desired quality of service. Buffers may be needed at output ports since the cell arrival rate at an output port may momentarily exceed the processing rate of that port. 13 Chapter 2 Topologies and Queueing Design of ATM Switches An arbiter is needed to manage and control the access of different input ports to the shared medium. Different arbitrating policies and algorithms can be used based on the desired performance of the switch. Each output port is linked to the shared medium through a decoder to filter the cells destined to it. Therefore, multicast and broadcast operations can easily be implemented in these fabrics since all cells traverse the same medium to which all output ports are attached and filtered by the right destinations. This is one of the advantages achieved by using a shared medium switch. On the other hand, time division switches are not easily expandable. When the number of inputs (or their rate) increases, it becomes technically impractical, if not impossible, to provide a shared medium with high bandwidth that can cope with such increase. Therefore, the throughput of the shared medium determines the maximum capacity beyond which the switch can not be expanded. Another disadvantage that may complicate the design of such switches lies in having a centralized management and control functions for these switches. Such limitations make such topology a non attractive choice for big switches, although it can be easily used to implement the switching elements ( small number of ports) from which big ATM switches are constructed. Many time division ATM switches have been implemented. IBM has built a bus based switch called Packetized Automated Routing Integrated System (PARIS) and plaNET. Other bus based switches include the ATM Output Buffer Modular Switch (ATOM) and ASX-100 switch of ForeSystem. An example of a ring based switch is the Synchronous Composite Packet Switching (SCPS) which uses multiple rings. 14 Chapter 2 Topologies and Queueing Design of ATM Switches Outputs Inputs Outputs Figure 2: Time Division Switch Fabrics B. Memory Shared Fabrics In this scheme, a memory is shared among all output ports of the switch. Incoming cells are multiplexed as a single stream and written into the shared memory [8][33][41][44]. Cells are then read out from the memory and de-multiplexed to the correct outgoing link. In other words, the memory itself is used here as a switching fabric for the switch as shown in Fig (3). The shared memory can be designed to be fully shared by all output ports. Such design requires dynamic memory allocation techniques to be implemented in hardware to speed up the operations of the switch. Another design is to partition the memory into fixed-size different logical queues, each assigned to an output port. In such a scheme, Inputs 15 Chapter 2 Topologies and Queueing Design of ATM Switches cells will be lost once the logical queue of one output port is full even if there is free space in the shared memory for the logical queues of other ports. The first approach achieves better throughput and cell loss probability since it smartly utilizes the available space. However, it is not equally fair to all output ports under bursty traffic as most of the memory will be filled by cells destined to one or few output ports at the expense of others. The second approach, on the other hand, does not have the complexity found with the first approach. The main advantage of the memory shared scheme is its full utilization of its available buffers. They can achieve the same cell loss probability of pure output buffering but at much less buffer capacity. The achieved gain in buffer space between memory shared fabrics and purely output fabrics is proportional to the number of input ports in the switch. Inputs Cells Headers Shared Memory Cell Buffer Part Address Chains Control Circuit Outputs Figure 3: Shared Buffer Memory Switch 16 Chapter 2 Topologies and Queueing Design of ATM Switches The shared memory scheme, however, requires complex memory management to simulate the effect of FIFO discipline for each output port and to implement the broad casting and multicasting operations. Another disadvantage with this scheme is the need to have a very fast memory that can cope with the speed of the expected input and out put operations. The memory speed is proportional to the number of inputs and outputs used. This restriction may not be realizable for large-scale switches. Finally, shared memory switches are not easily expandable due to the use of separate units for input multiplexing and output de-multiplexing operations Many ATM switches have been implemented using a shared memory switching fabric. These include the Prelude switch [18], Hitachi switch [12] and Roxana. C. Space Division In space division switches, multiple paths exist between input and output ports to allow more than one cell to simultaneously pass through the switch. Therefore, the throughput obtained from space division fabrics is much higher than time division fabrics and their capacity is theoretically unlimited. Beside the blocking and nonblocking characteristic of a fabric, space division topologies can be categorized based on the number of paths they provide between input and output ports (single or multiple path) and the number of stages that constitute the fabric (single or multiple stage). Space division switches can be single-path or multiple-path. In single-path switches, only a single path exists between any pair of input-output ports. Switches of this category have the disadvantage of not being fault-tolerant. If a path becomes faulty, traffic of that path will be blocked. On the other hand, multipath switches support multiple paths between any input-output ports. If one of the paths become faulty, traffic can be rerouted 17 Chapter 2 Topologies and Queueing Design of ATM Switches to an alternative one. However, routing in multipath switches is more complex since a routing algorithm must be used to determine the path through which the cells are routed. Space division switches can have a single stage or multiple stages of switching elements. In single-stage switches, a message needs to pass only one stage of switching elements before it reaches the output. This reduces the delay of the cell inside the switch. In multi-stage switches, a cell must be routed through multiple stages of switching elements before it reaches to its destination. Multistage switches are scalable and easier to implement. The most common form of a space division topology is the crossbar fabrics shown in fig (4). They are non-blocking single path, single stage switches. These switches are built of N2 cross points to connect the inputs of the switch to its outputs. Each crosspoint can be activated or deactivated based on the destination of the cell. The complexity of these switches is proportional to N2 which makes them unsuitable for large-scale switches. Since a single path is employed between each input-output pair, traffic will be blocked if a path or a cross point becomes faulty. Different contention resolution can be implemented with crossbar topologies. Any of the queuing techniques that will be discussed in the next chapter can be used to resolve conflicts among cells. If buffers are not used, the internal link speed must be N times higher than the arriving rate at incoming links. As is the case with shared medium switches, the crossbar topology is practical for small switches or for building small switching element from which bigger switches are constructed. 18 Chapter 2 Topologies and Queueing Design of ATM Switches Inputs Outputs Inputs Concentrating and Buffering fill Outputs Figure 4: Examples of Crossbar Switch Fabrics To overcome the complexity of crossbar topologies, multistage networks have been proposed in literature as a promising topology for ATM switching fabrics. In these networks, identical switching elements (mostly 2x2) are arranged in different stages and connected to each other to form the network. Various multistage networks of different characteristics exist depending on the number of stages employed in the network and the nature of connection between its stages. The banyan network is an important topology for multistage networks. It is a self-routing , single path, multi-stage network. The main disadvantage of the banyan is its blocking characteristics due to cell contention for its internal links. Similar to crossbar topologies, different schemes can be used to overcome the blocking characteristics of banyan networks. These are: 19 Chapter 2 Topologies and Queueing Design of ATM Switches a. Internal buffers can be used to store conflicting cells as will be described in the next section [3][14]. b. The speed of the internal links can be made \/N times faster than incoming links. This, however, is a costly approach and its implementation is impractical. c. The incoming cells can be rearranged into a combination realizable by the network before they are fed to it. Therefore, a sorter network can be used to sort the cells before being routed via the banyan. More of these techniques will be discussed in the next chapter. d. More stages of switching elements can be added to the network. This turn the banyan into a multi-path switch at the expense of compromising cell sequence integrity. This also can complicate the VLSI design of the switch. The following section presents the different queuing techniques that can be employed by a the switch (or by any of its switching element). 2.2 ATM Queueing Schemes Different queuing techniques can be employed by the switch (or by its switching elements). These techniques differ from each other based on the location of the buffers within the switch. A switch can employ input buffering, output buffering, central buffering or a combination of the above. A. Input Queuing In input queueing , each input port has a FIFO queue in which arriving cells are stored until they can be routed to their destinations as shown in fig (5). Arbitration algorithms are needed to choose the queue to be served when more than a head of line cell are 20 Chapter 2 Topologies and Queueing Design of ATM Switches destined to the same outputs. These algorithms range from simple round robin fashion to very complex ones. As cells may arrive randomly at the switch, input queuing help synchronize their headers before they enter the switching fabric. INPUTS OUTPUTS INPUTS INPUTS Pure Input Queueing Pure Output Queueing Central Queueing OUTPUTS OUTPUTS Figure 5: ATM Queuing Techniques 21 Chapter 2 Topologies and Queueing Design of ATM Switches Input queuing suffers from low throughput due to head of line (HOL) blocking [24]. If the head-of-line cells compete for the same output, only one cell will be transmitted out of the buffer while others have to wait. Cells behind a blocked head-of-line cell will be blocked also even if their destinations are available. It was shown in [24] that the maximum throughput achieved with this scheme is 58 % The HOL problem can be minimized by using random access techniques. In this way, if the first cell is blocked, the one behinds it still can be routed to its destination if that destination is available. This, however, requires complex management and complex hardware. Therefore, pure input queueing is not an attractive scheme for ATM switches. Other ways have been reported in literature to improve input buffering [10][46]. B. Output Queuing In this scheme, contention for output ports is resolved by placing separate FIFO buffers at each output as shown in fig (5). Ideally, each output port should handle up to N cells simultaneously to have a non blocking architecture. This, however, is not a technically practical approach. Therefore, buffers at the outputs must be used. If more than a cell is destined to the same output port, one will be served and the rest will be buffered. Concentrators can be used to reduce the number of cells an output buffer needs to handle simultaneously. This should be part of dimensioning the switch to meet the contracted QOS. FIFO output buffering does not require complex memory management schemes or arbitration algorithms like central and input queueing. This simplifies the implementation of the switch and reduces its cost. Output buffering has been shown to be the best in terms of performance under random uniform traffic. It can achieve a throughput of 22 Chapter 2 Topologies and Queueing Design of ATM Switches 100 with small cell loss probability. The delay in output queueing is not high since no HOL occur. The main disadvantage with output buffering is under utilization of the available buffers especially under bursty correlated traffic. Cells destined to some outputs may be discarded if the buffers of their outputs are full even if buffers of other outputs have vacancies. This is because output ports do not share their buffers. Therefore, big buffers are usually needed to achieve certain cell loss probability. Some researches tried to alleviate this problem by dividing the output buffers into different groups and allowing the output of each group to share a common memory. It is a scheme between pure output buffering and pure central queuing. Examples of output buffered switches can be found in [11] [15] [20] [24] [42] [45]. C. Central and Internal Queuing The shared memory is either used as a switching medium between the input and output ports or used as internal buffers in the switching elements within the fabrics. The first case has been discussed in details in the context of the different switching fabrics schemes in the previous section. Now, discussion will be devoted to the case where buffers are internally used in the switching elements within a space division switching fabrics as shown in fig (5). Internal buffers are used to resolve cell contention for internal links within the fabric. The performance of buffered switches depends upon the topology of the fabric and the location of the buffers within the switching elements. Crossbar fabrics have buffers at each cross points. This requires buffers of order N2 which make too expensive to implement. 23 Chapter 2 Topologies and Queueing Design of ATM Switches Buffered banyans have been extensively studied in literature. Internal buffers can destroy the order of cells if fabric has multiple paths. So, cells can be reordered at output ports. This, however, requires 20 % of more buffers. When buffers are full, cells are either discarded inside the switch or back pressure mechanism is used. In backpressure, blocked switching elements ask previous elements of previous stages to delay sending their cells until blocking is resolved. Cells in this case may be lost at input ports and not within the fabric. In summary, each of the above queuing method is characterized by different types of drawbacks. Input queuing in space-division switching suffers from low throughput (about 58.6 % with FIFO buffers) due to head-of-line (HOL) blocking effects. Internal queuing suffer from a rapid nonlinear growth in memory complexity with switch size. Output queuing achieves optimal throughput but require switch operation at a much faster rate than the effective or peak rate of the incoming traffic. Moreover, pure output switching exerts poor utilization of buffers, because buffers are exclusively devoted for each output. Although sufficient space might be available at some outputs, messages destined to other outputs with full buffers are lost resulting in an increase in cell loss probability. This is due to the lack of buffer sharing among switch outputs. The next section will present some of ATM switches that have been proposed in literature. Most of these examples use one or a combination of the above topologies and queuing techniques. Thus, the advantages and disadvantages of these switches are based on the advantages and disadvantages of the technique they use. . 2.3 Examples Efforts have been concentrated on output buffering and shared memory approaches. 24 Chapter 2 Topologies and Queueing Design of ATM Switches Such efforts can be divided into 3 major classes: a. Knock-out based switches (output queuing). b. Banyan-based switches. c. Shared memory switches. The remainder of this section discusses the research efforts focused to improve the above classes. A. Knockout-based Switches This category of switches employ output buffering. The basic knockout switch was introduced in [45]. It employs full connectivity by having N*1 disjoint paths to connect each input to each output in the switch. Each output has an interface module which consists of a concentrator (to route a maximum of only / simultaneous packets to each output), a shared output buffer and a common bus to forward packets from the buffers to the output port. This basic architecture has many drawbacks: a. The size of the switch grows quadratically as a function of the number of inputs which makes it very costly to realize switches of large dimensions. b. Only a very small portion of its bandwidth is utilized at anytime [43]. c. Its output buffers are totally independent which yields a poor utilization of buffer space. The switch, also, requires large buffers (compared to shared memory) to achieve an acceptable loss probability. d. It has poor performance under bursty and non-uniform traffic. e. It requires N output interfaces of fixed size. 25 Chapter 2 Topologies and Queueing Design of ATM Switches Subsequently, efforts have been directed towards improving some of the drawbacks mentioned above. In the Socron switch [17], a different concentration scheme is used to reduce the delay but it employs the same original structure of the knockout switch and thus suffers the same drawbacks. In the Christmas Tree which was proposed in [42], interleaving filtering and concentration has been implemented. In this architecture, the problem of buffer utilization among outputs have not been addressed. The main goal was to reduce the complexity of the switch by allowing bandwidth to be shared among outputs. Phased address filtering was also employed in the KSMIN switch [43] to reduce the severe pin requirements of large knockout switches. In this switch, a multistage network (banyan), in which the internal routing nodes are small knockout switches, was used. The performance of this switch under uniform traffic was found to be comparable to the performance of the knockout switch when sufficient buffering is used ( buffer size > 30). This switch has the same drawbacks which are inherent in the Banyan architecture since it does not permit the full utilization of the links within its fabric. Although more than one path can exist between an input port to any output port, no routing scheme was given to determine a criteria for selecting a particular path. The same approach, in which a double phase filtering is used, was proposed in [7]. However, an extra stage of distributing networks was added to improve the performance under non-uniform traffic. Besides the inherent problems of the Banyan topology, the main drawback of this switch is that the sequence of packets is not maintained. B. Banyan-based Switches This class includes any switch which employs the banyan network as its routing network. Many blocking and nonblocking switch architectures, which are based on the 26 Chapter 2 Topologies and Queueing Design of ATM Switches banyan network, have been proposed in literature [14], [15] [31] [36] [38] [40]. Different types of networks are usually used in the switch beside the routing network to improve its performance, to implement multicasting operation and/or to resolve output contentions. One important example is Turner's switch which is used in phase 0 of zeus project [21]. This switch employs two buffer banyan networks: a copy network and a routing network. One main drawback of banyan networks is that the functions of the switch are not performed in a uniform matter since extra networks are needed to support the routing network. Banyan based switches are not modular (when another network is added to the original banyan network) and suffer poor performance under bursty non uniform traffic. Moreover, packets encounter long delay as in Turner's switch. Broadcasting might severely overload the network, and consequently degrades the performance, because copies are generated at an early stage of the switch. Resources of these switches are not smartly utilized. In our proposed architecture, copies of broadcasted packets are generated at the closest ancestors of their destinations to avoid overloading the network. Moreover, the functions of the switch (multicasting, routing, broadcasting, and contention resolution) are all performed in a uniformed manner as to be explained later. C. Fully shared memory switches It has been shown that this type of switches can achieve the same cell loss probability of pure output buffering at much less buffer space. So, this approach fully utilizes the available buffer space. As mentioned earlier, it, however, has many disadvantages: 1) Broadcasting, multicasting requires fairly complex control. 2) It requires complex control to keep the order in which packets arrive at the switch. 3) It requires a very fast memory 27 Chapter 2 Topologies and Queueing Design of ATM Switches to accept up to 2 * N reads and write operations at the same time and this can imposes a restriction on the size of the switch. The next chapter introduces the motivation, definition and operation of the proposed BFT switch. 28 Chapter 3. The Buffered Fat Tree The BFT represents a general class of scalable ATM switches whose switch fabrics employ a tree topology. It is a non blocking multi-stage, multi-path space division switch with internal shared buffers. Two design principles categorize the design of BFT. First, it employes a tree topology to maximize and simplify the hierarchical sharing of its internal resources. Second, it employes dynamically reconfigurable nodes to ensure full accessibility to its resources by all ATM cells traversing its fabric. This chapter is organized as follows. Section 3.1 introduces the motivation behind the evolution of the BFT architecture based on the above two design principles. An abstract model for the BFT will then be discussed in section 3.2. Then, the structure and operations of the BFT will be explained in section 3.3, Section 3.4 discusses the fault tolerance of the switch. The routing, broadcasting and multicasting operations are discussed in section 3.5. Section 3.6 presents the expendability and modularity of the switch. Finally, The switch design will be compared to other switches from literature in section 3.7. 3.1 Motivation and Evolution of the New Architecture An immediate goal of our design approach is to fully utilize the switch resources by maximizing the resource sharing among the output ports and to enhance the fair accessibility of all shared resources by any input cell. The resources of the switch consist mainly of its bandwidth (channels connecting input and output ports) and the buffers located within its fabric. 29 Chapter 3 The Buffered Fat Tree Buffered banyan networks have been proposed as a possible model for ATM switches in which buffers are located at the inputs of each switching element in the network. The buffered-banyan network does not provide satisfactory performance under long bursty traffic in which internal conflicts persist for a long time at the same location within the switch fabric. This is mainly due to the static nature of the connections within its fabric. Its internal buffers and bandwidth are not fully utilized since input ports do not have access to every link and every buffer in the switch. To illustrate this point, let us consider an 8 x 8 banyan network as shown in fig (6). Each of the buffers labeled 1 to 8 is accessible only by one input port, each of the buffers labeled 9 to 16 is accessible by only two input ports, and each of the buffers 17 to 24 is accessible by 4 input ports only. Similarly, each of the links connecting the first and second stages is utilized by two input ports, each of the links connecting the second and the third stages is utilized by only 4 input ports and each of the output links is accessible by all the input ports. For the sake of the coming discussion, let us define the path stream of an output port to be all buffers and links which can be used to route packets from any input to that output port. The path stream of output 1 in the banyan network is shown in Fig (7). This path stream consists of 14 buffers (numbered 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 17 and 18) and 6 internal links ( numbered 1 , 2, 5, 7, 9, 11). In banyan networks, an input cell destined to an output port does not have access to all resources included in the path stream of that output. This means that sharing of resources within the banyan is not optimal. 30 Chapter 3 The Buffered Fat Tree mp ' • 2 inp2 inp 3 • 3 • 4 inp 4 inp 5 • 5 • 6 inp 6 inp 7 • 7 • 8 inp 8 buffer 1 9 ouput 1 • 9 • 10 JO • l7 I Im output 2 output 3 • 11 • 12 12 • 19 I I 20 output 4 13 output 5 • 13 • 14 • 21 • 22 44 / output 6 output 7 • 15 • l6 • 23 16 output 8 buffers 1 to 8 are accessible by only one input. Buffers 9 to 16 are accessible by only two inputs. Buffers 17 to 24 are accessible by only 4 inputs Links 1 to 8 are utilized by two inputs each. Links 9 to 16 are utilized by 4 inputs each. Each output link is utilized by 8 inputs each. Figure 6: Utilization of Resources in the Banyan Network Let us assume that input ports 1 and 3 are simultaneously sending long bursty traffic to output port 1 while all other inputs are idle. A conflict will occur on link 9 and thus cells will be stored in buffers 9 and 10. Buffers 1 and 3 may soon get full if the burst continues. The other buffers of the path stream (numbered 5,6,7,8,11, and 12) and its other links (numbered 5,7 and 11) are not being utilized although they are idle. This is due to the fixed connections between the stages of the banyan network. Thus, only 31 Chapter 3 The Buffered Fat Tree some of the buffers and links of the path stream are accessible by cells coming from input ports 1 and 3. An improvement can be achieved if all the buffers and links of the output path stream are fully accessible by all inputs to fully utilize these resources whenever conflicts between long bursty traffic sources occur. Figure 7: Path Stream for Port 1 in Banyan 32 Chapter 3 The Buffered Fat Tree inp 1 inp 2 inp 3 inp 4 inp 5 inp 6 inp 7 inp 8 o S a> c c CO link 1 link 3 link 5 link 7 o 3 4—• c O) c 3 £1 (0 TJ 10 11 12 link 9 link 11 O a) c _c 3 (0 TJ 17 18 O 3 Q) C 3 w TJ OUDUt 1 Figure 8: Re-arranging the Buffers and Links of the Path Stream To improve the accessibility of the internal resources of the switch by ATM cells, dynamically reconfigurable interconnection must be employed between the stages of the network to allow any input to access any buffer or any link within the switch fabric when needed as shown in fig(8). The effect of such dynamic interconnections can be easily achieved if the switch is able to equally balance the external and internal traffic among its internal buffers and links. Let us go back to the previous example under the new arrangement of fig (8). If only inputs 1 and 3 are simultaneously sending long bursty traffic to output 1, all buffers in the path stream of output 1 (numbered 1 ,2 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,17 and 18) and all its links ( numbered 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 17 and 18) can be used by these connections. Thus, there will be no conflict within the internal fabric of the switch. Cells will contend for the output port. However, output port may not need to 33 Chapter 3 The Buffered Fat Tree run at high speed because the contended cells are stored in the internal buffers of the path stream. For the bursty traffic of inputs 1 and 3, the second arrangement increased the potential capacity of buffers from 5 to 14 (almost 300 %) and the potential internal bandwidth from 3 to 6 ( 200 %). The above gain does not come at no cost. Extra hardware between stages is added to implement the effect of fully configurable nodes. However, this hardware is fully justified due to the gain achieved in operations and performance. First, it enables the switch to handle routing, multicasting, broadcasting and output contention in a uniform manner whereas Banyan network requires extra complicated networks to handle such issues as explained in the chapter 2. Second, the new arrangement considerably enhances the flexibility of the design in the sense that the number of used buffers and links are not controlled by prescribed mathematical formulas. Instead, the number of buffers and the capacity of the channels ( number of links or wires) become design parameters which can be determined subject to traffic intensity and hardware cost. Therefore, the number of buffers and links which can be used in routing packets from the inputs to an output port can be optimized as shown in fig(9). This means that the size of the path stream becomes a design parameter adjusted as needed to meet the expected operation of the switch. 34 Chapter 3 The Buffered Fat Tree Inp1 Inp 2 Inp 8 Figure 9: Customizing the Path Stream Third, the rate on internal buffering as well as output buffering does not increase significantly. In fact, the internal rate can remain the same as the input rate if the cell loss is acceptable. Again, the internal rate speed up can be incorporated as a design parameter, which can be traded-off with bandwidth or buffer size. Based on our previous discussion, the ideal situation happens when the switch has a path stream for each output port. However, it is totally not practical to exclusively devote such a stream to every output since implementation cost will be high. A more efficient design is to allow the path streams of two different outputs to share resources starting from the inputs until a certain stage at which they split (this stage forms their nearest common ancestor in the switch) as shown in fig (10). The above is the basis for the evolution of the BFT switch. 35 Chapter 3 The Buffered Fat Tree stage 1 stage 2 stage 3 inputs output i inputs output j c output i inputs c 5> I output j Figure 10: Combining Path Streams 3.2 The Conceptual Model of BFT Switch The previous concept of combining and sharing the path streams of output ports till they split at their nearest common ancestor generates a switch of tree topology. This topology is characterized with hierarchical sharing of resources to fully utilizes its available buffer space and bandwidth. In BFT, the resources of each stage are divided into mutually 36 Chapter 3 The Buffered Fat Tree exclusive sets, with each set utilized by a distinct group of output ports (associated with that set). The size of each group decreases as one traverses the tree from top (root) to bottom (leaves) since more combined path streams split into smaller ones as they approach the output ports. The rational behind this strategy is that the buffers and bandwidth of early stages (closer to the inputs) have more potential users and thus require more amount of resources than those of later stages (closer to the outputs). This guarantees the full utilization of the hardware invested in the switch. In BFT, buffers are placed in the internal nodes of the tree and the outputs are placed at the leaves. In a tree of degree d, in which the root is located at level 1 and leaves are located at level lg^ N where N is the number of outputs), buffers at level i are partitioned into dl distinct sets, where each set of buffers is shared among d^ogN~^ outputs as shown in figure (6). Each such set of buffers ( at any level i) will be called a shared multibuffer. The size of each shared multibuffer at level i is Mi, where M; is a design parameter that can be selected to optimize performance. Thus, a message is buffered in a node at level i if its destination is one of the d^°^N~%^ children of the subtree rooted at that node. Consequently, the utilization of buffers in the tree is maximized, see Fig(11). Similarly speaking, a channel connecting nodes in level i to children at level i + 1 is utilized by ^(W^-*-1) output ports. Thus, the number of the potential users of a channel increases as the level of that channel increases in the tree. Consequently, the capacities of channels increase as one goes up towards the root resulting in an improvement of the bandwidth utilization in the tree. This is similar to the Fat-Tree concept [6] in which capacities of channels increase as one goes up towards the root of the tree. The following section presents a detailed description of the buffered Fat Tree (BFT) switch. 37 Chapter 3 The Buffered Fat Tree Levels N Inputs log(N) d outputs outputs B&S stands tor buffering and switching node C: the capacity of channel connecting level i to level i+1 d: the degree of the tree ( each node has d children). Figure 11: Switching and Buffering Nodes in the BFT outputs 3.3 BFT Structure BFT fabric is composed of different nodes connected together in a tree topology. Inputs are placed at the root and outputs are placed at the leaves of the tree. This topology makes the switch easily expandable. Its nodes provide highly configurable connections 38 Chapter 3 The Buffered Fat Tree between the internal links of the fabric. To achieve optimal bandwidth utilization, the available wires must be fully utilized. This can be achieved by uniformly balance the traffic load on internal available resources. The main function of a node is to split the incoming traffic into different outgoing streams and buffer the cells which temporarily can not be forwarded to the next stage. Therefore, each node is constructed of two stages, buffering and switching as shown in fig (12). Node Inputs Concentrating and Balanced Distributing Parallel Multiple Buffers — — Switc ning Star. je outputs grouped Outputs grouped for child 1 for child d Figure 12: General Structure of BFT Node Five design principles are considered here when the implementation of the BFT node is discussed. These are the required bandwidth of buffers, the utilization of the buffer, the ability of the node to preserve the arrival sequence of the cells, the throughput and fairness of the routing algorithms, support for fault tolerance and easiness of priority implementations. 39 Chapter 3 The Buffered Fat Tree A. Buffering Stage In a BFT node, the incoming traffic is split into different outgoing streams of less capacity, each directed towards one of the node's children. So, buffers are needed to store the cells which temporarily can not be routed to the next stage (children). The required buffer bandwidth must be kept relatively small. The buffer bandwidth is defined as the number of required access operations to the memory in each time slot. In typical shared memory topologies, the scalability of memory access is of O(N). The BFT simplifies this complexity by placing separate buffers in parallel for each outgoing stream to minimize the required bandwidth of the buffer memory. For each buffer, a maximum of two access operations (one read and one write) are needed in each time slot. Having separate isolated buffers requires a mechanism for maximizing their utilization. The BFT employs a balanced-distribution networks to maximize the utilization of the buffers inside the node as shown in fig (13). The balanced-distributing network is connected to the input of the buffers and it evenly distributes packets among these buffers. Packets are distributed in a cyclic (round robin) fashion such that the difference between the contents of any two FIFOs does not exceed one packet. Packets must also be fetched out from the buffers in a cyclic (round robin) fashion to keep the optimal utilization of the buffers. It can be shown that such mechanism will preserve the sequence of packet arrivals at the switch. Several schemes can be used to implement the distributing network. In a pack and shift approach, cells are packed using either a concentrator, a sorter, or a routing network to remove the empty slots which may exist between packets. A shifter is then used to cyclically distribute the concentrated packets among the FIFO such that the contents 40 Chapter 3 The Buffered Fat Tree of any two buffers differ by at most one packet. This ensures the even distribution of packets among buffers and balances the load on the output wires of the node It is worth mentioning that both the pack and shift function can be conveniently implemented by the well known butterfly (banyan type) network which performs parallel prefix addition, for ranking packets by order of arrival , then employs a simple greedy routing algorithm to route packets to their destinations. The routing procedure utilizes the monotonic routing property of the butterfly to pack and shift input cells with no path contention. Figures (14) and (15) show how cells can be packed and shifted (once they have been ranked) on a butterfly network. The details can be found in a wide body of literature. In smaller switches , the same functions (pack and shift) can be realized with simpler structures such as a running adder and a shared bus. The running adder computes the ranks of the incoming cells based on their order of arrival and the shared bus routes the cells according to their ranks. One can envisage several other hardware solutions between these extreme solutions. 41 Chapter 3 The Buffered Fat Tree E S3 51 0 CD E E 0 LU • L Incoming cells i 1 r GQ Gl S3 ED FX) E E QD S3 D] OD ID Q] (D J, JL JL JL JL X m Packets foBowlng pack-and-shftt FIFO buffers CeM sequence preserved Figure 13: Internal Organization of a Shared-Buffer that Preserves the Cell Arrival Sequence Ranked input Cete ID 0 CD 0 ® cell with rank 000 [0 cell with rank 001 0 cell with rank 010 d cell with rank 011 SJ cell with rank 100 a a 0 a a Figure 14: Packing Ranked Cells on a Butterfly Network Figure 15: Packing and Shifting Ranked Cells on a Butterfly Network 42 Chapter 3 The Buffered Fat Tree B. Switching Stage This stage is responsible for forwarding the buffered cells to the right child of a node. It is also responsible for broadcasting and multicasting operations as it can route the cell to more than a child or to all children based on the desired operation. Switching is done in two phases. First, head of line cells in the buffers are de-multiplexed towards their destination. All cells destined to a child are then multiplexed into the outgoing bandwidth which connects the current node to that child. The capacity of the outgoing link may not be able to simultaneously forward all cells of similar destinations at once. Therefore, some cells will stay in the buffers until the next time slot. The buffered stage and switching stage behave like a buffered concentrator. Any switching fabric can be used for implementing the switching stage of a BFT node if it can easily perform the above functionality. A combination of multiplexers and demultiplexers can be used when the degree of the switch is small. A knock-out like crossbar fabric can be used for trees of large degrees. The main disadvantage of the crossbar topology lies in the number of its crosspoints which are linearly proportional to N2. This, however, is not the case for the switching stage of the BFT. In the BFT arrangement, the outgoing links are grouped into d groups, where d is the degree of the tree. Each group can share the same vertical lines which are crossed with the in-going links. Therefore, the crosspoints within each node linearly grow with N.d. Small values of d must be used to simplify the implementation of the switch and to keep its sharing capabilities. 43 Chapter 3 The Buffered Fat Tree C. Routing Cells Inside the Node Different algorithms can be used to route head-of-line cells from their buffers to the next stage of the BFT. These routing algorithms differ in their performance, their fairness to different ATM connections, their buffer utilization and their ability to preserve the sequence of the cells inside the switch. These are simple algorithms that can be implemented in hardware. The coming discussion will only focus on the conceptual model of the algorithm without detailing the extra hardware required for its implementation. Algorithm (a): In this algorithm, a head of line cell is forwarded to the next stage only if all head-of-line cells on its left have been forwarded to their destinations (or dropped) and both its outgoing link and its destination buffers have enough capacity for it. Otherwise, it is either dropped or blocked for the next time slot. A limit is defined for determining the maximum allowed number of dropped cells in each time slot within a shared buffer in a node. The drop limit is a design parameter which can range from 0 to /V where N is the number of buffers in that node. If the limit is 0 , the routing algorithm becomes a pure back-pressure and no cells will be allowed to drop within the fabric. If the limit is equal to N, no buffers will be needed inside the nodes and the switching fabric becomes completely unbuffered which is similar to the Christmas tree [42] if the degree of the switch is 2. If the degree of the switch is 2 and the dropping limit is 2, the switch will be close to the Multinet which is defined in [13] Since N incoming lines are concentrated into C for each outgoing group, where C is the capacity of that group C < N, a cyclic round robin mechanism is used as an arbiter to 44 Chapter 3 The Buffered Fat Tree resolve the contentions for the outgoing links. Routing in the next time slot must always start from the first cell which was blocked in the previous time slot. The above algorithm preserves the sequence of the cells within the switch and keeps the optimal utilization of the parallel buffers inside the node. It can be easily implemented in hardware since it does not involve complex coordinations among the control of the separate buffers. Therefore, no complicated central controller is required. On the other hand, this algorithm does not provide an optimal throughput since a head-of-line blocking can occur. Some of the cells may be blocked even if their destinations are available (buffer and bandwidth). Moreover, this algorithm does not provide a fair treatment to different ATM connections. Traffic destined to some outputs can be blocked by others which pass the same shared memory. Algorithm (b) : In this algorithm, a head of line cell is forwarded to the next stage if both its outgoing link and the buffers in its destination have enough capacity for it. No consideration is given to the routing status of the head-of-line cells in other buffers. It is considered a back-pressure algorithm where no cells can be dropped inside the fabric. It can be easily implemented since no coordination between the buffers is required. It achieves better throughput than the first one because it decreases the level of head-of-line blocking. On the other hand, it does not preserve the sequence of cells within the buffer. Thus, cells must be re-sequenced at output ports to comply with ATM requirements. Another disadvantage of this algorithm is under utilization of the buffer space. Cells are not removed from the buffer in a cyclic fashion although they are still written into it cyclically by the balanced-distribution network. This can increase the difference among buffer 45 Chapter 3 The Buffered Fat Tree occupancies and can badly affect the utilization of the parallel buffers inside the node. As is the case with algorithm (a), algorithm (b) is not fair to different ATM connections since some of them can be blocked because of others. Algorithm (c) : In this algorithm, cells are routed out of the buffers based on their outgoing destinations in the next stage. Status of head-of-line cells has no significance or impact on the routing decision. Separate random access buffers are implemented as d linked lists. In a node at stage i , cells destined to a child at the next level i + 1 are grouped into different linked lists, one for each shared buffer in that child. Cells are then read out from the parallel buffers based on these linked lists. The algorithm routes the head-of-line cells of these linked lists in a round robin fashion. If the required outgoing link is not available or the destination of that cell has no enough buffer, the whole linked list is skipped till the next time slot. Otherwise, the cell is routed and the corresponding link list is adjusted to reflect the new head-of-line cell. Linked lists are traversed till no more links are available or no more cells can be routed to the next stage. This algorithm may not keep the optimal utilization of the parallel buffers as some of them may get full while others are empty. This can be improved by combining different buffers to increase sharing among them. However, this will also increase their required access speed. Another disadvantage of this algorithm is its complex memory control required to handle the logical linked lists. The control logic may need to run N times faster than the speed of the links that connect the different stages together. This, however, is still far less complex than pure shared memory where N logical linked lists need to be implemented and controlled. The simplification of the logic in BFT over pure memory is achieved because routing is done into different stages when the degree d is small. 46 Chapter 3 The Buffered Fat Tree On the other hand, algorithm (c) ensures the fair treatment of different ATM connections since cells of each outgoing stream within a node get fair and equal chance to be routed to the next stage without being blocked by cells of other outgoing streams. Parallel data buffers are still used to increase the outgoing bandwidth of the node without increasing their access requirements. Another main advantage is its ability to implement priority mechanism if cells are grouped in different linked lists based on their priority, not based on their destinations. More about the performance of the above algorithms will be presented when the simulation results is presented in the next chapter. 3.4 Switch Fault Tolerance The fault tolerance of the switch is its ability to detect and effectively response to any unexpected failures in any of its internal components. Fault tolerance can be achieved by providing redundant components of its important parts, by providing multiple paths between any input-output pair or by having parallel planes of the same switch. The BFT achieves fault tolerance by providing multiple paths between its input and output ports. Physically separate parallel links and buffers are available at any stage for any cell traversing the switch. Therefore, faulty links and buffers can be easily isolated without affecting the overall operation of the switch. In a switch of degree d, there are P separate paths between any input-output pair. P is calculated as follows: P = Ci.cY• • • Cigd(jv-i). where N is the number of inputs and Cj is the bandwidth of the links connecting a node at stage i to its child at stage i + 1 (assuming the root is placed at stage 1). For example, if d = 2 and C; = then 47 Chapter 3 The Buffered Fat Tree lg(JV-l) P = IT If a link or a buffer within the switch has an unexpected failure, the i=i switch can still route cells to their destinations but at less performance. The severity and scope of the failure depends upon the stage at which it occurs. In the context of the above example, a link connecting a node at stage i to its child at stage i + 1 is part of Pj = F[ $ paths. Since C{ decreases as one traverses the tree towards the outputs, the severity of faulty links or buffers increases as one goes down the tree. This is because the number of blocked paths caused by a faulty link or buffer increases as the stage gets closer to output ports. In this case , however, less output ports are affected since the number of output ports sharing the path decreases as one traverses the tree towards the leaves. This means that the impact of having a faulty component near the outputs is more severe than failure in early stages but its scope is limited to the few output ports that share that component. Stages close to outputs have more priority to any redundant components provided in the design for fault tolerance. 3.5 Routing, Broadcasting and Multicasting within the Switch One of the interesting features of the proposed switch is that routing, broadcasting, and multicasting operations can be easily performed in a unified matter. A packet is broadcasted when it is routed to every output in the switch. If the packet is routed to only a subset of the switch outputs, it is called multicasted. An ideal ATM switch should be self-routing and should support broadcasting and multicasting operations. The BFT is self-routing since each internal node routes packets independently based on their local destination and thus no central control is required. Multicasting and broadcasting are easily done in the BFT and do not require duplication of packets. 48 Chapter 3 The Buffered Fat Tree The routing inside the node was explained in the previous section. This section will explain the routing through the tree topology of the switch. The BFT switch operates in the following manner. Starting at the root, packets move down the tree to the leaves, where outputs are placed. Packets are routed based on their destination addresses (encoded in radix-d for a d-degree switch). A decision must be made at each internal node to determine the child to which the packet is to be routed. Packets are divided into d groups, each destined to one of the children of the node. Packets destined to the same child take their routes through wires that connect the node to that child. A packet can be routed through any wire based on availability. Note that when d = n , the switch becomes purely output-buffered with no buffer sharing, and when d = 2, the switch operates as a binary tree with maximum buffer-sharing. When a packet arrives at the input port of the switch, the input processor of that port appends a header to the packet in order to route it through the switch. The format of the header is illustrated in Fig(16). In a tree of degree d, the first d bits of the header determine the branches to which the packet is to be routed ( one pit for each branch). At any stage, the packet is sent to a branch if the bit corresponding to that branch is set to one. If the packet is multicasted to r branches, its processed header will have r subheaders, each proceeded by a counter to determine the number of bits included in that subheader. The same applies recursively on the subheaders at the next stage until the packet is routed or multicasted to its destinations. This process is shown in fig (16) and (17) for a tree of degree 2. The length of the header is not the same for different multicasted and broadcasted packets and the worst case occurs when the packet is broadcasted to all outputs. 49 Chapter 3 The Buffered Fat Tree routing bits (to lett child, to right child, or to both children) Packet is sent to the left child a packet to be routed to both children • original packet header 1 header 2 original packet packet is sent to the right child original packet • Figure 16: Cell Header The disadvantage of the above approach lies in having different header lengths based on the number of destinations. Another approach is to fix the number of bits in each header by having a bit for each children at any stage. In a tree of degree d and N inputs, a header of d x lg TV bits is needed to determine the route the packet to its destination . At any BFT node, the d bits corresponding to the d children of that node are checked to determine the direction of the cell. A cell is then sent to a child through its corresponding outgoing link if its corresponding bit in the cell header is set to 1. If it is broadcasted, all bits in its appended headers must be set to 1. This approach is keeps the length of the cells constant inside the fabric. 50 Chapter 3 The Buffered Fat Tree count bits header routing bits control to left child input to right child routing bits: 10 to left branch 01 to right child 11 to both children 00 incorrect combination Figure 17: Routing Control Logic 3.6 Switch Expendability and Modularity The general tree topology of the BFT allows for switch modules of different sizes and output capacities to be added or removed from the tree in a simple modular fashion. The performance and the operational characteristics of the original BFT switch do not change by expanding its size. The architecture of the switch is modular and easily expandable. Assuming d = 2, a switch of larger size, say 2N x 2N can be realized by connecting two N x N modules ( represent the subtrees) with a root module. The new 2N x 2N switch can be considered 51 Chapter 3 The Buffered Fat Tree as one module and may be used to expand the size in the same recursive manner. In this scheme, a new module is designed as the root every time the switch expanded. Another scheme is to construct big switches from modules of the same size. In other words, small modules of the original switch are connected together to realize switches of bigger sizes. In this large switch, resources are not fully shared as in the first approach. Inputs as well as outputs are divided into different groups in which members of the same group share some of the switch resources. In other words, inputs do not form a single group as the case in the fully tree architecture explained before. Expanding switches using this approach , however, may minimize the probability that cells of some connection get blocked by other connections under long bursty traffic as will be discussed later. . 3.7 Comparison with Other Switches BFT is a universal and flexible switch. Its definition incorporates other type of switches ranging from purely output buffered switches to shared memory switches. Therefore, the BFT switch provides a unifying framework for optimizing the design and performance of ATM switches. In this section, the switch will be compared to three switches from literatures, Knock-out switch, Multinet switch and the Christmas tree switch. A. BFT vs. Knock-out Switches The knock-out switch [45] can be realized by making the degree of the BFT equal to N. In that case, the BFT turns into a purely output buffered switch. The BFT is better than the knock-out switch in two aspects. First, the complexity of the crosspoints can be decreased by decreasing the degree of the switch. This is achieved because the crosspoints in any node within the BFT can be shared by a group of outgoing links. The 52 Chapter 3 The Buffered Fat Tree sharing increases when the degree of the switch decreases allowing for better complexity than Knock-out. . The complexity of the crosspoints of the switching stages in a BFT node is d.N. lg^ N where d is the degree of the switch and N is the number of its inputs. This is a great improvement compared to the complexity of crosspoints in knock-out switch which is JV2 . Second, the BFT employs hierarchal sharing of buffers to accommodate for bursty traffic. Switches with shared buffers performs better than purely output buffered switches under bursty or correlated traffics. They achieve the same cell loss probability with much less buffer. Knock-out switches (or purely output buffers) have less delay than BFT. However, implementing the cut-through concept within the BFT dramatically decreases the delay. B. BFT vs. Multinet A similar architecture to the Multinet [13] can be realized by making the degree of the BFT equal to 2 and by implementing routing algorithm (a) with drop limit of ^ where Ni is the number of buffers in a node at stage i. The BFT is different from the Multinet in the following aspects. First, the BFT inter-stage switching is more effective than the Multinet. In the Multinet switch, the maximum number of outgoing cells from a node at stage i with Ni buffers is ^ while this number can reach up to Ni in a BFT of degree 2. Therefore, the BFT provides better internal throughput than the Multinet. The reason can be easily understood from fig(18). In the Multinet, the traffic coming from the outputs of the parallel buffers are first 53 Chapter 3 The Buffered Fat Tree multiplexed into half of its size and then de-multiplexed to the right destination (child). This forms a bottleneck where some of the cells will not be routed to their destinations even if these destinations have enough resources to accommodate them. In the BFT, traffic coming from the outputs of the parallel buffers are first expanded (or demultiplexed) to their destinations and then multiplexed into narrower bandwidth. This increases the number of cells that can be forwarded to the next stage if their detonations have enough buffers for them and the number of cells destined to the same destination is less than or equal to the capacity of the outgoing link. In summary, BFT can take one time slot to route to the next stage some combinations of head-of-line cells that takes two time slots by the Multinet. Second, implementation of cut-through technique is much easier in the BFT because of the above reason. When buffers are empty, a physical path can be easily set up to the children of the next stage. Finally, the definition of the BFT model is more universal and flexible than the Multi-net. Its different design parameters can be dimensioned to meet the quality of service required. In other words, its implementation can be realized for different design param eters. The degree can increase or decrease, the bandwidth of its internal links can be modified without obeying a predefined mathematical formula and resources sharing can be adjusted as required by QOS. Therefore, the BFT provides a better and an applica ble module for studying the effect of all these design parameters on the performance of the ATM switch while no similar proposed implementations are defined in the context of the Multinet. 54 Chapter 3 The Buffered Fat Tree 8 Buffers BFT Cyclic Arbitrator Demultiplexing Cyclic Arbitrator Multiplexing r i i—i J—i 8 Buffers Multinet Multiplexing Bottleneck Demultiplexing Figure 18: Difference between Multinet and BFT C. BFT vs Christmas Tree Switch The Christmas tree [42] can be realized from the BFT by making the degree equal to 2 and by using routing algorithm (a) with drop limit value equal to the number of parallel buffers in each node. Therefore, no buffers will be needed and the BFT will be similar a purely output tree. BFT model differs from Christmas tree model in many aspects. The BFT employes 55 Chapter 3 The Buffered Fat Tree shared buffering while Christmas tree is a purely out buffered switch. Thus, BFT performs better under bursty traffic and requires less buffer space to reach the cell loss probability of purely output buffered switches. Moreover, BFT model provides a realizable implementation for different values of design parameters while Christmas tree does not. This is similar to the above discussion on the difference between the BFT and the Multinet switch. Finally the switching and concentrating techniques are different in the Christmas Tree. 56 Chapter 4. Performance Modeling This chapter develops a queuing model for evaluating the performance of the BFT under certain traffic assumptions. The BFT is first modelled assuming that no cut-through routing is used in the switch. Thus, cells can not traverse more than one stage during each time slot. Then, the BFT is modelled with cut-through routing technique. With cut-through routing, incoming cells can bypass a buffer in a stage if the buffer does not contain cells and if the outgoing link leading to the next stage has sufficient capacity for the incoming cells. The second section discusses the performance results obtained from solving the analytical model for different values of the switch design parameters. 4.1 Analytical Model In a node at stage k, a maximum of Ch~x cells can be successful routed to the next stage where Ck~1 is the capacity of the incoming link to the node. However, this maximum is not always attainable due to cell blocking. Three reasons can block the cell from being routed to the next stage. First, bandwidth blocking occurs when the number of head-of-line cells destined to a specific child is more than the capacity of the outgoing link to that child. In this case, the extra cells will be blocked even if that child has enough buffers for them. Second, buffer blocking happens when the destination of a cell does not have enough buffer for it. In this case, the cell will be blocked even if the outgoing link to that destination has enough capacity to carry it. Finally, sequence blocking occurs when forwarding the cell to the next stage may not preserve the arrival sequence of the cells inside the switch fabric. As explained earlier, 57 Chapter 4 Performance Modeling cells must be read out from the shared buffers in a cyclic fashion to ensure the equal utilization of these buffers and to preserve the arrival sequence of the cells within the fabric. If a cell of older sequence is blocked for a reason, the head-of-lines cells on its right will also be blocked even if there is enough bandwidth to carry them to their destinations and these destinations have enough buffer for them. One way to alleviate this problem without affecting the cells sequence is to drop the blocked cell so that others may be forwarded to their destinations. The cells are dropped if the number of previously dropped cells in that node during the same time slot is less than the maximum limit allowed by the designer. The performance model discussed in this chapter operates under the following as sumptions: a. The input traffic is random and uniform. This means that the head-of-line cells in the parallel buffers are mutually independent and equally destined to any of the children in the next stage. This assumption implies that the behavior of a series of shared memories is independent of the particular output under consideration. Therefore, this series of queues can be used to evaluate the performance of the whole switch. b. No back-pressure mechanism is implemented. The cell is dropped if there is no enough space for it in the next stage. c. Cells will not be dropped if they experience bandwidth blocking. The performance of the switch can be studied by analyzing each shared buffer in isolation. The shared buffer at stage k can be modelled as a single queueing station with Ik inputs and Rk servers as shown in fig (19). 58 Chapter 4 Performance Modeling m cells I inputs R outputs Figure 19: Mathematical Model of Buffering Stage in the BFT The whole switch is analyzed as a series of queues, each is similar to the above, connected together based on the tree topology as shown in fig(20). The arrival process of any queue depends on whether the shared buffer is in the first stage (which is directly connected to the switch inputs) or in an internal stage within the switch. Figure 20: A Simplified Queuing Model for the BFT A discrete-time Markov chain method is used to analyze the modelled shared buffer. The following symbols will be defined before discussing the model: 59 Chapter 4 Performance Modeling k Stage number. d Degree of the switch. ck The capacity of the outgoing link that connects the buffer at stage k to any of the children in the next stage. Mk The maximum capacity of the shared memory at stage k. pk i The propability that the shared buffer at stage k has i cells. pk ij The transition probability that the queue at stage k will move from state i to state j. pk loss The cell loss probability of a shared memory at stage k p loss The total cell loss probability of the whole switch. Jk The maximum number of arriving cells at a shared memory in stage k during a time slot. Rk The maximum number of cells that can leave a shared buffer at stage k during a time slot. Dk The average delay of a shared buffer at stage k. E>total The total average delay in the switch. Nk The mean number of cells in a shared buffer at stage k. E\ The propability that i cells leave the shared memory at stage k in a time slot. pk The mean number of cells leaving a shared buffer at stage k each time slot. Etotal The mean nu,mber of cells that leave the switch (last stage) each time slot. Ak The propability that i cells arrive at the shared memory in stage k during a time slot. \k ^mean The mean number of cells that arrive at a shared buffer in stage k in a time slot. Table 1 Terms Definition (Continued) . . . 60 Chapter 4 Performance Modeling Atotal The mean number of arriving cells at the switch ( first stage) each time slot. It is equal to A^efm. Ak ^loss The avergae number of lost cells at inputs of stage k. pk rJBB{avg) The average number of cells that leave the shared buffer during a time slot given that there is always enough buffers for them in the next stage and that no cell is affected by sequence blocking. Table 1 Terms Definition A. Analysis of the Switch When Cut-through is not Implemented The transition from a state to another in the Markov chain of the above modelled buffer depends upon the difference between the number of incoming cells and the number of outgoing cells during a time slot. Accordingly, the transition probability can be expressed as follows: n — m=j — t 0<m<Rk 0<n<Ik pk Ak pk ij / j ^n^m n,m n-m>Mk 0<m<Rk 0<n<Ik pk _ V"1 Ak pk 1 ij / j ^n^m n,m Pk- = 0 if i - min (j, Rk) <j< min (i + Ik, Mk - 1 if i>Mk- Ik and j = Mk all other cases The value of E\ depends on the buffer blocking probability of the next stage if back pressure mechanism is used. It also depends on the blocking probability due to insufficient bandwidth in the outgoing links and upon the maximum number of cells 61 Chapter 4 Performance Modeling allowed to be dropped by the routing algorithm to preserve the arrival sequence of the cells inside the switch. The effect of bandwidth blocking can be realized in the analysis by making Rk equal to EkBB^avg^ which is average number of cells that leave the shared buffer during a time slot given that there is always enough buffers for them in the next stage and that no cell is affected by sequence blocking. Under a random uniform traffic where each parallel buffer has at least one cell and each head-of-line cell in any of the parallel buffers is equally destined to any of the children, this average can be calculated as follows: ^BBiavg) To further simplify the analysis and realize buffer and sequence blocking, the analysis in this section assumes that a maximum of Rk = ck cells can be routed to the next stage each time slot and will be dropped if no buffer is available for them in the next stage. This represents the worst case scenario of the BFT. The analysis of the switch assumes that Rk cells leave a shared memory each time independently of the occupancy of buffers in the next stages. If the Markov chain is at state i, the above assumption can be mathematically represented as 1 •< i >Rk,m = Rk Ern = l f • ~ „i. i < RK, m = i = 0 for all other values of m Thus, under the above assumption, the transition probability of the Markov chain can be expressed as follows: 62 Chapter 4 Performance Modeling nk- — Ak P%] ~ ^j-i+mm(i,Rk) Ik pk - 'ST Ak n=i—j+m'm (i,Rk) pk = 0 if max (o, i- Rk^ <j < min (lk + max ((D, i - Rk^j, Mk if j = Mk , i - j + min (i, Rk^j < Ik all other cases The Markov chain for the shared buffer under the above equations can be constructed as shown in the following figures (21) and (22): Fig(2i): When 0<i<R: P, =A, lf 0 < j < I P = 0 if i+1 < j < M Figure 21: Markov Chain when i < R 63 Chapter 4 Performance Modeling 0 0 Fig (22) When R < i < M : P = A , _ If i-R £ j £ I - R + I I] T-t+R I] P = 0 otherwise Figure 22: Markov Chain when i > R The above equations are based on the observation that when the queue has less than Rk packets at the beginning of a time slot, the next state of the queue depends only on the number of arrivals in that time slot since all of the previous packets will have been forwarded to the next stage. Otherwise, the next state of the queue is determined by the number of arriving packets as well as the number of packets which remain in the queue after forwarding Rk of its packets to the next stage. Assuming uniform traffic with Bernoulli arrival process, the value of Ak depends on whether the buffer is in the first stage or in an intermediate stage. If the buffer is in the first stage, then If the buffer is in stage k, Ak is computed as follows. A"= Epj~\i)(i/<9(i-£ P} 3=i j=Rk~1 + l k-1 (l/d)\l - (1/d)) R' — i 64 Chapter 4 Performance Modeling if i > Rk-\ A\ = 0. Consequently, the probability distribution for the Markov chain, showing only the non zero terms, can be calculated using the following equations : i+Rk j=0 i+Rk pf = Pijpj 'f Ik + l< i <Mk -Rk j=Rk+i-Ik Mk pf = ^ p\.p^ jf Mk - Rk + 1 < i < Mk j=Rk+i-Ik The corresponding transition probability matrix is used to calculate the steady-state probability distribution. It should be noted that the above Markov chain always achieves equilibrium (for any Ik,Rk and Mk ) since it is aperiodic and irreducible. Once the steady-state probabilities are calculated, the mean number of cells in the shared buffer at stage k, A^ean > can be computed as following: Mk Nk = V iPk l^mean / j 11 % Similarly, the average number of cell arrivals, Akmean, at the shared buffer in stage k can be calculated as following : ik Ak —S^iAk "•mean ~~ / y l,rLi Delay analysis of stage k can be calculated using Little's theory as following : , Nk r\k _ -"mean umean t I : I r I ? I 65 Chapter 4 Performance Modeling and the overall delay in a switch with i stages is given by : j-i Dtotal = Dmean k=0 B. Analysis of the Switch when Cut-Through is Implemented Cut-through routing can be used to decrease the delay in the switch. A packet is allowed to bypass a buffer if the buffer is empty and its corresponding outgoing link is not busy. In this case, the buffer can be modelled in the same way as discussed above except that the transition probability P^ is calculated differently for 0 < i < R. The transition probability for cut-through will be as follows: Rk-i P&= £ A* if 0 < i < Rk n=Q pk Ak f 0<i<Rk,l<j<i + Ik- Rk ij - R"-i+j 1 Rk + i < i < Mk,i - Rk < j < min (lk + i - Rk, Mk - l) Ik PiM = £ An W Mk + Rk - Ik <i<Mk n=Rk+Mk-i Pij — 0 all other cases The arrival process in this case is derived from the departure process of the previous stage queue which is approximated by a random variable. The input characteristics of a generic queue are described by the following equations: Ai = (Nl)(p/dni-p/d)N~i ik A\ = YJok-1{\id)\i-iidy-i if k>2 j=i The output characteristics of a generic queue are described by the following equations: 66 Chapter 4 Performance Modeling min {i,Mk} E pjAi-j For 3=0 mm{Rk ,Mk) p Mk o\ = E p* E An + E p" for ^ = p" j=0 n=Rk-j j=Rk C. Cell Loss and Delay Let P^ss be the cell loss probability at stage k, then Pkoss = where Ak is the mean average number of packets lost in a time slot because no empty space is available at stage k. The cell loss probability in a switch of i stages, Pf°jf, can be calculated as follows : k=0 \ j=0 ' ' / The delay of the switch is computed similarly by summing up the delays of all stages together To avoid congestion, the switch must be designed such that the number of outgoing links from a shared buffer of any stage is at least equal to the expected number of packets arriving at that buffer in any time slot. Let 0^ be the ratio of packets arriving at a shared buffer in stage i to the capacity of its outgoing link. Then 0^ = A'R?n, and consequently, the necessary condition to avoid congestion is : 4.2 Performance Discussion The performance of the BFT has been extensively studied under the analytical model developed in the previous section (with and without cut-through). The main objective of 0 < 0{i) < 1 67 Chapter 4 Performance Modeling this section is to understand the effect of the different design parameters on the switch performance. This is important for dimensioning the switch based on the quality of services expected from it. A. Effect of Switch Degree The degree of a BFT can range from 2 to N where TV is the number of switch inputs. The hardware complexity of the switch increases as its degree increases. Results indicate that the degree of the BFT has slight effect on cell loss probability for a given fixed number of buffers. This means that under uniform traffic, the optimal cell loss performance of the complex fully connected purely output buffered switches ( degree =N ) can be easily achieved by the less complex BFTs of lower degrees after adding few buffers to the switch. The exact effect of the BFT degree on cell loss probability and delay will be explained in details next for BFT without cut-through. With no cut-through, the number of buffers required to achieve a certain cell loss probability slightly decreases when the degree of the switch increases until it reaches its optimal case under uniform random traffic when the degree is equal to N. Fig (23) compares the performance of two BFTs of degree 2 and 4 to output buffering. The pure output buffered switches can be considered as a BFT of degree N where N is the number of switch inputs. The figure is based on a 64 x 64 switch operating under input load of 0.9. The performance of the BFTs is very close to the optimal case under uniform random traffic. BFT of degree 4 is better than that of degree 2 but worse than the output buffered switch. The number of extra buffers which need to be added to BFTs of lower degrees to achieve a cell loss probability of purely output buffered switches increases as the cell loss probability decreases. For example, pure output buffering needs almost 300 buffers 68 Chapter 4 Performance Modeling less than the BFT of degree two and about 200 buffers less than the BFT of degree four to achieve a cell loss probability of 10~3. This difference increases to 800 buffers with BFT of degree two and to almost 400 buffers with BFTs of degree four for cell loss probability of 10-9. Notice that the above figures are based only on random uniform traffics and for BFTs where no cut-through or back-pressure is implemented as our model assumes. Back-pressure greatly enhances the results of BFT under uniform random traffic since it smartly utilizes the available buffers in the switch. No cells will be lost unless all the buffers located along the path from the input port of the cell to its output are full which is unlikely to happen under uniform random traffic. Cut-Through technique also greatly enhances the performance of the switch as will be explained later in this chapter. For the same switch considered above, Fig (24) plots the delay of three BFTS (degree 2, 4 and N) under different input loads. The delay of the switch decreases when the degree increases since this decreases the number of stages in the switch that a cell has to pass. Notice that the minimum delay of a switch, when no cut-through is used, achieves logdN time slots which is equal to the stages of the switch. However, this delay can greatly be decreased by applying the cut-through technique as to be explained later. In conclusion, a complex fully connected output buffered switch which has the optimal performance under random and uniform traffic can be replaced with a less complex BFT of low degrees after adding extra buffers to the switch. The delay caused by having multi stages in the BFT of low degrees can be overcome by applying the cut-through technique. B. Delay and Cell Loss of BFT Stages The delay and cell loss probability of a shared buffer in a BFT node depends upon the capacity of its buffers, the traffic load arriving from other nodes and the capacity of the 69 Chapter 4 Performance Modeling outgoing links which connect the node to its children. The behavior of the shared buffer is affected by any change on the above parameters. Although the cell arriving rate at the shared buffers of early stages (close to the root) is much higher than those of later stages, buffers of early stages achieve better cell loss and delay figures because of the following two reasons. First, the capacity of the outgoing links in early stages is very high. Thus, more servers are available for serving the waiting cells. Second, the available space in these buffers is fully utilized most of the time since more cells pass through them. This is different for buffers close to outputs which are dedicated only to cells destined to their outputs. Therefore, stages close to the switch outputs require more buffers to achieve the same cell loss and delay figures of early stages. Fig (25) demonstrates the required buffers to achieve a cell loss probability for the different stages within the same switch. A 64 x 64 switch of degree 2 and 6 stages operating under input load of 0.9 is considered here. The capacity of the last stage in the switch needs to be 18 times higher than that of first stage to achieve the same cell loss probability of 10-7. Details of other stages are summarized in the following table: Stage # Capacity of a single shared Total buffers needed at that buffer (cell loss of 10~7) stage (cell loss of 10-7) 1 62 124 2 55 220 3 46 368 4 42 672 5 38 1216 6 35 2240 Table 2 Required Buffers for Each Stage in a 64X64 BFT (Example) 70 Chapter 4 Performance Modeling As for the delay, analytical results show that the delay of the stages increases as one goes down the tree from roots to outputs as shown in fig (26). Fixing the input load at 0.9 for the above switch, the delay of a stage continues to increase when more buffers are added to it until it reaches a certain capacity after which delay does not change. This specific capacity is a function of the stage input load and any extra buffers added to it will not affect the performance of the switch under that given input load. Early stages has shorter delay than later stages. If the buffers in the above switch are arranged according to the previous table to achieve a 10-7 cell loss probability under input load of 0.9, the average delay of the last stage will be 3.3 times higher than the delay of the first stage. The following table demonstrates the delay of different stages in time slots for 10-7 cell loss probability: stage # 1 2 3 4 5 6 delay 1.00 1.12 1.26 1.56 2.15 3.3 Table 3 Delay of Different Stages in a 64x64 BFT (Example) In conclusion, most of the buffers in the BFT must be placed in stages close to the outputs to achieve the required cell loss probability because buffer sharing decreases in these stages. On the other hand, this arrangement will increase the delay in the switch especially in the lower stages. The delay can effectively be reduced by implementing the cut-through approach which will be discussed next. C. Effect of Cut Through Routing On Performance Cut-through mechanism improves the overall performance of the switch. It decreases the switch cell loss probability and its total delay. A 64 x 64 switch with degree 2 is used 71 Chapter 4 Performance Modeling to demonstrate the performance of BFT with and without cut-through mechanism. Fig (27) shows how the cell loss probability changes with input load for both cases. Cut-through slightly improves the cell loss probability when the switch operates under high input load; (0.7 < P < 1). More improvement is realized when input load decreases to P < 0.7. When the size of buffers inside the switch increases, the effect of cut-through on switch cell loss probability decreases because cut-through switches achieves less gains when more buffers is added. On the other hand, Fig(28) shows how delay has been greatly improved for different input loads by implementing cut-through mechanism especially under high input loads. When the input load decreases, both mechanisms get closer to their minimum delay. In the above switch, for example, Cut-through reduced the delay by 3 time slot when P = 1. This improvement continues to increase until it reaches up to 6 time slots when P = 0.7 after which BFT with no cut-through reaches its minimum delay of 6 stages while the cut-through BFT reaches its minimum delay of 0 time slots. In summary, cut-through mechanism decreases the cell loss probability in the switch as well as its delay. This is because cells under cut-through mechanism are always forwarded to the next stage whenever enough bandwidth and buffers are available for them. As a result, the internal bandwidth between stages increases and the mean waiting time for each cell decreases leaving only few cells stored in buffers. Consequently, delay and cell loss probability decrease. D. Effect of Buffers Distribution among Switch Stages The performance of the switch is affected by the way buffers are distributed among its stages. To illustrate this point, a 32 x 32 switch of degree 2 which has 5 stages is considered. Assume this switch has 320 buffers distributed among its stages as shown 72 Chapter 4 Performance Modeling in the following table: stage # Number of Capacity of Each Total Buffers in Stage Shared Shared Buffer Buffers in Stage stage 1 2 32 64 stage 2 4 16 64 stage 3 8 8 64 stage 4 16 4 64 stage 5 32 2 64 Table 4 Arrangement of Buffers in a 64x64 Switch (Example) Assume also that an extra 360 buffers will be added to this switch. Six different ways for distributing these buffers among stages are discussed here to study their impact on performance; mainly the cell loss probability and delay. The following notation (ii, 22,23,24, is) is used to denote the capacity of each individual shared buffer in each stage. The following table describes the methods discussed in this section. Notice that the total number of buffers for all of them is fixed and equal to 680 buffers. Tree# Characteristic of Appraoch BFT1 The original tree before adding the extra 360 buffers to it (32,16,8,4,2). BFT2 Add all 360 buffers to first stage (212,16,8,4,2). BFT3 Equally distribute the new 360 buffers among shared memories of last stage (32,16,8,4,14) Table 5 Different Approaches for Placing Buffers in BFT (Continued) . . . 73 Chapter 4 Performance Modeling BFT4 Add 40 % to first stage, 40 % to second stage and 20 % to third stage (104,52, 17, 5,5) BFT5 Add 40% to last stage, 40 % to fourth stage and 20 % to third stage (32,16,17,13,7) BFT6 Add 60% to third stage, 20 % to second stage and 20 % to fourth stage (32,34,35,9,2). BFT7 Equally distirbute the extra buffers among all switch stages (68,34,17,5,5). Table 5 Different Approaches for Placing Buffers in BFT Comparison of the above techniques (with no cut-through) are plotted in fig(29,30). Fig(29) plots the load versus cell loss probability for each of the above techniques. BFT5 comes first with best cell loss figure. BFT7 comes next followed by BFT3. No major improvement has been achieved by adding the extra buffers based on BFT2, BFT4 and BFT6. So, sharing the extra buffers among stages achieves the best results if focus is given to lower stages of the switch (close to outputs). This scheme is superior to placing all buffers at output ports where no sharing exists for the extra buffers. However, both are superior to the worst case arrangement where all buffers are placed at the inputs or close to the inputs. Fig (30) plots the load versus the total delay within the switch. BFT2 achieves the best delay while BFT5 is the worst. The difference between them decreases when the load on the switch decreases. This shows that adding the extra buffers at input will not badly affect the delay of the original switch BFT1. However, delay increases with shared and output buffering. The above results can be explained as follows. The outgoing bandwidth of shared memories in upper stages is higher than lower stages. Cells from upper stages are 74 Chapter 4 Performance Modeling routed to the next stage as soon as they arrive at the buffer. Thus, no long line up is formed at upper stages and only few cells are lost. Consequently, adding extra buffers at these stages hardly affects the performance of the switch (cell loss and delay). The opposite can be said about lower stages where more cells have to wait for service. This causes bigger delay and higher cell loss probability. Thus, adding any extra buffers there affects the performance by improving the cell loss probability and increasing the delay. The problem of the delay can be minimized by applying the cut-through technique. 75 Chapter 4 Performance Modeling Figure 23: Effect of BFT Degree on Cell Loss Probability (without Cut-through, load=0.9). Cell Loss BFT(64x64,degree= 2,stages=6) B'FT(64x64,^ "BFW4x64^de^ee=64>YagVs=l) 0.00 2.00 4.00 6.00 Total Buffers x 103 8.00 76 Chapter 4 Performance Modeling delay 26.00 24.00 22.00 20.00 18.00 16.00 14.00 12.00 10.00 8.00 6.00 4.00 2.00 0.00 Figure 24: Effect of BFT Degree on Delay under Different Input Loads (without Cut-through) BFT(64X64,degree= 2,stages=6) ~BFT(64x64^ 0.00 0.50 1.00 Load 77 Chapter 4 Performance Modeling Figure 25: Cell Loss in Different BFT Stages (64x64, degree=2, stages=6, without Cut-through, load= 0.9) cell loss 3 le-01 3 le-02 3 le-03 3 le-04 3 le-05 3 le-06 3 le-07 3 le-08 3 le-09 3 le-10 3 le-11 3 stage 0 stage i stage 2 stage 3 stage 4 stage 5 0.00 20.00 40.00 60.00 80.00 buffer size 100.00 78 Chapter 4 Performance Modeling Figure 26: Delay in Different BFT Stages (64x64, degree=2, stages=6,without Cut-through, load=0.9) delay 3.40 3.20 3.00 2.80 2.60 2.40 2.20 2.00 1.80 1.60 1.40 1.20 1.00 0.80 0.60 0.40 _ 1 1 1 1 _ / / _ / / _ 1 1 _ 1 1 — 1 1 1 1 „ 1 ' - , / 1 / 1 / 1' l' 'l , — Il ' 11 / Il / •i i < — 1'. ; / / l" ' / — i1 / / x I,1 / — IM; ' " 1'' _ 1 1 1 1 stage 0 stage i stage 2 stage 3 stage 4 stage 5 0.00 50.00 100.00 150.00 buffer size of stage 79 Chapter 4 Performance Modeling Figure 27: Effect of Cut-Through on Cell Loss Probability under Different Input Loads (64x64, degree=2, stages=6) Without Cut-Through Cut-Through 0.00 0.20 0.40 Load 0.60 0.80 1.00 80 Chapter 4 Performance Modeling Figure 28: Effect of Cut-Through on Delay under Different Input Loads (64x64, degree=2, stages=6) delay Without Cut-Through —I With CuT-Through 0.00 0.20 0.40 Load 0.60 0.80 1.00 Chapter 4 Performance Modeling Figure 29: Effect of Buffer Distribution on Cell Loss Under Different Input Loads (64x64, degree=2, stages=6, without Cut-through) Cell Loss 0.00 0.20 0.40 Load 0.60 0.80 1.00 82 Chapter 4 Performance Modeling Figure 30: Effect of Buffer Distribution on Delay under Different Input Loads (64x64, degree=2, stages=6, without Cut-through) 0.00 0.20 0.40 Load 0.60 0.80 1.00 83 Chapter 5. Simulation Results The performance of the BFT has been extensively studied by simulation under uniform and bursty traffic. The bursty source is modelled as an interrupted Bernoulli process [5][36] and can be described by a two state Markov chain. When the source is in the active state, it transmits a cell with probability Pact whereas no cells are transmitted from the source if it is in the inactive state. Assuming that P\Q is the probability that the source will move from the active state to the inactive state and that Pni is tne probability that the source will move from the inactive state to the active state, the two state Markov chain for the source can be constructed as in figure(31). Cells which are transmitted from the source during the active state and belong to the same burst are given the same address The addressees of the different bursts are randomly chosen. The burst is equally destined to any of the switch outputs. P01 P10 Figure 31: Markov Chain for a Bursty Source Solving the above Markov Chain, Pi = Po^!Pw. Accordingly, the arrival process of the bursty traffic is expressed as 84 Chapter 5 Simulation Results . _ Pad X Ppi Arate- piQ + poi where Arate is the departure rate from the source and is equal to the arrival rate at any input in the switch. 5.1 Simulation vs Analytical Model The simulation results show that the values predicted by the analysis underestimate slightly the true loss rate and delay although both results are almost of the same order of magnitude. The analytical model, as shown in fig (32) and (33), gives better cell loss values and shorter delay. However, it provides a valuable insight into the effects of the main switch parameters on its performance. The difference between the delay values obtained from the analytical model and the simulations decreases when the input load decreases whereas the difference between the values of the cell loss probability increases when input load decreases. 5.2 Backpressure vs Dropping Techniques The effect of backpressure routing and dropping techniques can be studied by applying algorithm (a) which was discussed before. If cells can be dropped between stages, the maximum number of cells allowed to be dropped in each stage during a time slot, called the dropping limit of the switch, affects the performance of the switch under random and bursty traffic. If that limit is set to 0, no cells will be dropped inside the switch and the switch will operate in a purely backpressure mode. If the limit is set to Ik, where Ik is the number of the parallel buffers in a shared memory at stage k, the switch operates as if no buffer exists in its fabric. Values between these two extremes can be adjusted to get the desired quality of services required from the switch. 85 Chapter 5 Simulation Results Simulation shows that backpressure technique has better performance under uniform traffic. The cell loss probability, for example, in a 32 x 32 switch of degree 2 and 5 stages operating under input load of p = 0.8 is 2.7 x 10-7 when back pressure is used. This value goes down to 5.3 x 10-3 when dropping of blocked cells is allowed between stages inside the switch. This is because backpressure mechanism is more efficient in utilizing the available buffer of the switch. Cells travelling from an input port to an output port will not be dropped unless the buffers of all the nodes that are located along the path between these input and output pair become full. In other words, the blocking probability depends upon the status of all the buffers that included in the path stream of that output. This is not the case if backpressure is not used. If cells can be dropped inside the switch fabric, the loss probability depends only on the status of the buffers located in the next stage. Therefore, cells will be dropped even if other buffers which belong to the path stream of that output are not full. On the other hand, backpressure mechanism can degrade the performance if the switch operates under long bursty traffics. The cell loss probability of the above switch under long bursty traffic goes down 2.3 x 10_1and its delay becomes 22 time slots if back pressure is used. If dropping is allowed between stages, cell loss probability under the same bursty traffic will be 1.38 x 10_1 and delay is 10.32 time slots. This shows that allowing cells to be dropped between stages give better results than backpressure under bursty traffic. The above results can be explained as follows. Under bursty traffic, buffers may get congested by cells that belong to few connections. Therefore, it is highly possible that most of the head-of-line cells in a shared buffer are destined to the same congested children during a time slot. Allowing zero drop policy may block the cells of other 86 Chapter 5 Simulation Results connections destined to uncongested children. Moreover, This may propagate the congestion up to the root of the switch after which cells of all connections will be dropped at the inputs even if there is still available space in the internal buffers. This will decrease the internal throughput of the switch and increase its delay and cell loss probability. On the other hand, dropping cells between stages minimizes the blocking effect since head-of-line cells destined to congested children are dropped to prevent the congestion from propagating up towards the root and to give cells of other connections a chance to go forward to their destinations. Consequently, the delay and cell loss probability of the switch decreases in this case. Generally speaking, the effect of bursty traffic on the performance of the switch depends upon the average length of the burst relative to the number of stages in the switch. If the average length of the burst relative to the number of stages is large, the burst will have wider impact and the performance of the switch will degrade. The burst in this case will be able to congest nodes of upper stages close to the inputs. Once these are congested, more incoming connections will be blocked because the number of connections passing via a shared buffer increases if the buffer is close to inputs. If the length of the burst to the number of stages is small, the impact of the burst will be limited to the few connections that need to pass by any of the congested nodes. The above problem can be minimized by decreasing the level of sharing near the input of the switch. Resources of upper nodes can be mutually divided between different sets of input ports. This will minimize the number of blocked connections if a congestion takes place. However, it will require more resources and decrease the effective utilization of the switch resources. Adopting any alternative depends upon the characteristics required in the switch as there will always be a trade-off between buffer utilization and traffic 87 Chapter 5 Simulation Results blocking caused by other connections. 5.3 Behavior of Buffers in the Switch The natural behavior of the buffers in the various stages of the switch was studied by simulating the switch under the assumption that these buffers have indefinite capacity. The switch was simulated under uniform and bursty traffic. A parameter is defined to give us an idea about the utilization of different buffers in the switch. It is measured by comparing the mean number of the cells in that buffer to the maximum number of cells that lasted in that buffer more than .05 of its total operating time. The results of this section are based on a 64 x 64 switch operating under a bursty traffic of 0.72 arrival rate. Figures 34, 35, 36 show the behavior of the internal buffers of the switch when simulated under uniform traffic. Figures 37, 38 and 39 represent the results of simulating the same switch under bursty traffic. Fig (34) and (37) report the maximum number of cells reached in the buffer versus the average number of cells for an individual node. Figures (35) and (38) report the same results but for the whole stage. The figures show that the above defined measure for burstiness reaches 90 % for shared buffers of early stages while it goes down to 35 % in buffers of late stages. This indicates that the type of traffic has little impact on buffers close to the input since they are fully utilized most of the time. However, the occupancy of buffers close to outputs greatly vary with the burstiness of the traffic. As a results, stages close to outputs need more buffers than those close to inputs which hardly get affected by the nature and burstiness of the traffic. Figures (36) and (39) plot delay in each stage under bursty and uniform traffics. As is the case with the analytical model, the delay increases for stages close to the outputs and this increase is much higher under bursty traffic. 88 Chapter 5 Simulation Results In summary, traffic type has little impact on stages close to the inputs while it has significant impact on those near the outputs. As was the case with analytical model, stages close to outputs need large buffers to achieve a desired performance. 89 Chapter 5 Simulation Results Figure 32: Simulation vs Analytical Model for Cell loss Probability (64x64, degree=2, stages=6, without C ut-through) Analytical Model Simulation 0.00 0.20 0.40 0.60 0.80 Load 1.00 90 Chapter 5 Simulation Results Figure 33: Simulation vs Analytical Model for Delay (64x64, degree=2, stages=6, without Cut-through) Analytical Model Simulation 0.00 Load 0.20 0.40 0.60 0.80 1.00 91 Chapter 5 Simulation Results Figure 34: Difference between the Average and the Maximum Reached Buffer Length per Node (64x64,degree=2,stages=6, uniform traffic of load 0.72) Buffer Size 50.00 45.00 40.00 35.00 30.00 25.00 20.00 15.00 10.00 5.00 h-0.00 Average Length 'Maximum Length 0.00 2.00 4.00 6.00 8.00 Stage Number 92 Chapter 5 Simulation Results Figure 35: Difference between the Average and the Maximum Reached Buffer Length per Stage (64x64, degree=2, stages=6, uniform traffic with load 0.72) Size Average Buffer Length for Stage "^Viaximum Buffer Length for Stage 0.00 2.00 4.00 6.00 8.00 Stage Number 93 Chapter 5 Simulation Results Figure 36: Delay in Different Stages of th BFT (64x64, degree=2, stages=6, uniform traffic of load = 0.72) Delay 0.00 2.00 4.00 6.00 8.00 Stage Number 94 Chapter 5 Simulation Results Buffer Length 50.00 45.00 40.00 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0.00 h 0.00 Figure 37: The Difference between the Average and the Maximum Buffer Length Reached Per Node (64x64, degree=2, stages=6, bursty traffic of 0.72 average arrival rate) Average Length per Node ^Maximum Length per Node 2.00 4.00 6.00 8.00 Stage Number 95 Chapter 5 Simulation Results Figure 38: The Difference between the Average and the Maximum Buffer Length Reached Per Stage (64x64, degree=2,stages=6,bursty traffic of 0.72 average arrival rate) Buffer Length x 103 Average Buffer Length for Stage • Maximum Buffer Length for Stage 0.00 2.00 4.00 6.00 8.00 Stage Number 96 Chapter 5 Simulation Results Figure 39: Delay in Different Stages of BFT (64x64, degree=2, stages=6, bursty traffic of 0.72 average arrival rate) Delay 0.00 2.00 4.00 6.00 8.00 stage number 97 Chapter 6. Conclusion This thesis has laid out the foundation for developing an efficient ATM switch architec ture using the Buffered Fat Tree topology. The design and the operations of the BFT were described in detail to demonstrate its merits. The performance and behavior of the switch were extensively studied by the analytical model developed for BFT and by computer simulations. The impact of several design parameters on the performance of the switch were considered under uniform and bursty traffic. These parameters include the degree of the switch, its size, the capacity of its buffers, the distribution of buffers among its stages and the routing techniques used within its fabric. These parameters can be used to dimension the switch to achieve the desired performance. A trade-off between different elements of the performance, such as delay and cell loss, can arise when any of the design parameters is considered. Future work in this area should be focused on moving the BFT concept from concept to physical realization. The first step towards achieving such a goal is to investigate effective partitioning technique that will enhance the modularity of the BFT architecture and enable its implementation across multiples chips and multiples boards. The partitioning technique should minimize the interconnection overhead especially among multiples boards. Interconnect complexity can dominate the performance bottlenecks in large scale switches. Partitioning may require modify the BFT model proposed here. For example, buffers with large number of inputs and outputs ( in the early stages of BFT) may have to be divided among several chips. A more modular realization of such nodes may be necessary at that point. 98 Chapter 6 Conclusion Another step towards physical realization of the BFT is to devise practical scheme for multicasting within the ATM switch. The core of the BFT has tremendous capacity and buffering resources to support multicasting. However, the scheduling of multicast traffic is normally done in the switch port-processor (located in the line cards). Such port processors should be aware and capable of utilizing the BFT fabric for efficient multicasting. 99 Chapter 6 Conclusion Bibliography 1. A. Huang and S. Knauer," STARLITE: a Wideband Digital Switch", Proceedings of GlobeCOM, pp. 121-125, November 1984. 2. A. Itoh, " A Fault Tolerant Switching Network for B-ISDN," IEEE J Sel Areas Commun, Vol. 9, No. 8, pp. 1218-1226, October 1991. 3. A. Monterosso and A. Pattavina," Performance Analysis of Multistage Interconnec tion Networks with Shared-Buffered Switching Elements for ATM Switching", INFO-COM'92, pp. 0124-0131, 1992. 4. A. Pattavina,, " Nonblocking Architectue for ATM Switches", IEEE Communication Magazine, Vol. 31, No. 2, pp. 38-48, February 1993. 5. Bae J. and Suda T, " Schemes and Protocols in ATM Networks", Proceedings of the IEEE, Vol 70, No. 2, pp. 170-189, February 1991. 6. C. E. Leisersn, " Fat-Trees: Universal Networks for Hardware-Effecient Supercom-puting", IEEE Transactions on Computers, Vol. C-34, No. 10, pp. 892-901, October 1985. 7. C. Weng and C. Hwang, "Distributed Double-Phase Switch", IEEE Proceedings-I, Vol. 138, No. 5, Oct. 1991. 8. D. Seidel A. Raju, M. A. Bayoumi, " A New ATM Switch Architecture: Scalable Shared Buffer", ICECS'96, pp. 772-775, 1996. 9. F. A. Tobagi," Fast Packet Switch Architecture For Broadband Integrated Services Digital Networks" , Proceedings of the IEEE, Vol. 78, No. 1, pp. 133-167, January 1990. 100 Chapter 6 Conclusion 10. G. Kbar and W. Dewar, " On Improving the Design of an ATM Switch Based on Input-Output Buffers", International Journal of Communication Systems, Vol. 9, pp. 269-282, 1996. 11. H. Bruneel and B. Steycont," Buffer Requirements for ATM Switches with Multiserver Output Queues" , Electronics Letters, Vol. 27, No. 8, pp. 671-673, April 1991. 12. H. Kuwahara, N. Lido, M. Ogino, T. Kozaki, Y. Sakurai and S. Gohara, "Shared Buffer Memory Switch for an ATM Exchange," ICC '90, Vol. 1, pp. 118-122, 1989. 13. H. Kim, "Multinet Switch: Multistage ATM Switch Architecture with Partially Shared Buffers", IEEE/ACM Transactions on Networking, Vol. 2, pp. 571-580, Dec. 1994. 14. H. S. kirn and A. Leon-Garcia, " Performance of Buffered Banyan Networks Under Nonuniform Traffic Patterns" , INFOCOM, pp. 344-353, 1988. 15. H. S. Kim and A. Leon-Gracia," Performance of Output-Buffered Banyan Networks with Arbitrary Buffer Sizes," INFOCOM'91, Vol. 2, pp. 701-710, April 1991. 16. H. Miyahara, " ATM: A Most Promising Switching Technique for B-ISDN, ", IEICE Transactions,\/o\ E 74, No 4, April 1991. 17. I. Chrysochos, P. Ganos, and G. Kokkinakis, " SORCON Switch: A New Cell Switching Architecture" , Electronics Letters, Vol. 29, No. 2, pp. 202-204, January 1993. 18. J. Coudreuse, M. Servel," Prelude: An Asynchronous Time Division Packet Switched Networks," ICC '87, pp. 769-773, 1987. 19. J. Gracia-Haro, J. Malgosa and J. Moreno," Multicasting Facilities in ATM Switching Architectures. A Study of Several Appoaches,", IEEE Pacific Rim Conf On Commu, 101 Chapter 6 Conclusion Computers and Signal Processing, pp. 90-95, 1995. 20. J. Li, " An Output-Shared Buffer ATM Switch", Proceedings International Conf on Computer Design, pp. 147-148, October 1996. 21. J. Turner, " Design of a Broadcast Packet Switching Network, " IEEE Transactions on Communications, Vol. 36, No. 6, pp. 734-743, June 1988. 22. K. Eng, Mark K. and Yu Yeh," A Growable Packet (ATM) Switch Architecture: Design Principles and Applications", IEEE Transactions on Communications, Vol. 36, No. 6, pp. 423-430, February 1992. 23. K. Padmanabhan, " An Efficient Architecture for Fault-Tolerant ATM Switches," IEEE/ACM Transactions on Networking, Vol 3, No. 5, pp. 527-537, October 1995. 24. «MJ Karol, M Hlutchy and S. Morgan," Input Versus Output Queueing on a Space-Division Packet Switch," IEEE Trans on Communications, Vol. COM-35, No. 12, pp 1347-1356, Dec 1997. 25. M. De Prycker, Asynchronous Transfer Mode. Prentice Hall, 1995. 26. P. Lau and A. Leon-Garcia," Design and Analysis of a Multilink Access Subsystem Based on the Batcher-Banyan Network Architecture, "IEEE Transactions on Communications, Vol. 40, No.11, pp. 1757-1766, November 1992. 27. P. Newman," ATM Technology for Corporate Networks," IEEE Communications Magazine, Vol. 30, No. 4, pp. 90-101, April 1992. 28. P. Tagle and N. Sharma," Performance of Fault Tolerant ATM Switches," IEEE Proc Commun, Vol. 143, No. 5, pp. 317-324, October 1996. 102 Chapter 6 Conclusion 29. R. Rooholamini, V. Cherkassky and M. Garver, " Finding the Right ATM Switch for the Market", IEEE Computer Society, Vol. 27, No. 4, pp. 16-28, April 1994. 30. S. Butner and R. Chivukula, " On the Limits of Electronic ATM Switching," IEEE Network, pp. 26-31, November/December 1996. 31. S. Gianatti and A. Pattavina, " Performance Analysis of Shared Buffered Banyan Networks Under Arbitrary Traffic Patterns ", INFOCOM'93, Vol. 3, pp. 943-952, 1993. 32. S. Shaikh, M. Schwarts and T. Szymaski, " Performance Analysis and Design of Banyan Network Based Broadband Packet Switches for Integrated Traffic", GLOBECOM'89, Vol. 2, pp. 1154-1158, 1989. 33. S. Wei and V. Kumer, " On the Multiple Shared Memory Module Approach to ATM Switching", INFOCOM '92, Vol. 1, pp. 116-123, May 1992. 34. T. H. Cheng and D. Smith," Queueing Analysis of a Multichannel ATM Switch with Input Buffering" , ICC '91, Vol. 2, pp. 1028-32, June 1991. 35. T. Morris and H. Perros, "Performance Modelling of a Multi-Buffered Banyan Switch Under Bursty TraW\c,"INFOCOM'92, Vol. 1, pp. 436-445, May 1992. 36. T. Banniza, G. Eilenberger, B. Pauwels and Y. Therasse, " Design and Technology Aspects of VLSI's for ATM Switches," IEEE J. Select. Areas Commun., Vol 9, pp. 1255-1264, Oct. 1991. 37. Tien-Shun Gary Yang," ATM Switch Evolution and Performance Test", ICC'96, Vol. 1, pp. 485-490, June 1996. 103 Chapter 6 Conclusion 38. T. Lee," Performance of Banyan Networks with Inhomogeneous Traffic Flow", IEEE Proceedings, Vol. 137, No. 4, July 1990. 39. T. Lee," Nonblocking Copy Networks for Multicast Packet Switching", IEEE Journal on Selected Areas in Communications, Vol. 6, No. 9, Dec. 1988, pp. 1655-1667. 40. T. Szymonski and S. Shaikh," Markov Chain Analysis of Packet-Switched Banyans with Arbitrary Switch Sizes, Queue Size, Link Multiplicities, and Speedup" , INFOCOM'89, Vol. 3, pp. 960-971, 1989. 41. T. Takeuchi, H. Suzuki and T. Aramaki, " Switch Architecture and Technologies for Asynchronous Transfer Mode,", IEICE Transactions, Vol. E74, No. 4, April 1991. 42. W. Wang and F. Tobagi, " The Christmas-tree Switch: an Output Queuing Space-division Fast Packet Switch based on Interleaving Distribution and Concentration Functions", Computer Networks and ISDN Systems, pp. 631-644, 1993. 43. Y. Kim and K. Lee, " KSMINs: Knockout Switch Based Multistage Interconnection Networks for High Speed Packet Switching", GLOBECOM'90, Vol. 1, pp. 218-23, 1990. 44. Y. Park and G.Lee," A Modular and Scalable ATM Switch Using Shared Buffer Architecture," Proceedings of IEEE Asia Pacific Conference on Circuits and Systems, pp. 318-321, November 1996. 45. Y. Yeh, M. Hluchyj and A. Acampora, " The Knockout Switch: A Simple, Modular Architecture for High-Performance Packet Switching", IEEE Journal on Selected Areas in Communications, Vol. SAC-5, No.8, pp. 1274-1283, October 1987. 46. Z. Tao andd S. Cheng," A New Way to Share Buffer-grouped Input Queueing in ATM 104 Chapter 6 Conclusion Switching," GLOBECOM'94, Vol. 1, pp. 475-479, November 1994. 105 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Country Views Downloads
United States 11 0
Germany 6 2
China 6 9
India 2 0
Bangladesh 1 0
Hashemite Kingdom of Jordan 1 0
Russia 1 0
France 1 0
Canada 1 0
City Views Downloads
Ashburn 10 0
Unknown 10 7
Shenzhen 5 9
Woodbridge 1 0
Dhaka 1 0
Saint Petersburg 1 0
Seattle 1 0
Beijing 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}
Download Stats

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065203/manifest

Comment

Related Items