Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Cost-performance trade-offs in large scale banyan-based ATM switches Mohammad, Alimuddin. 1998

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1998-272036.pdf [ 4.91MB ]
Metadata
JSON: 831-1.0064804.json
JSON-LD: 831-1.0064804-ld.json
RDF/XML (Pretty): 831-1.0064804-rdf.xml
RDF/JSON: 831-1.0064804-rdf.json
Turtle: 831-1.0064804-turtle.txt
N-Triples: 831-1.0064804-rdf-ntriples.txt
Original Record: 831-1.0064804-source.json
Full Text
831-1.0064804-fulltext.txt
Citation
831-1.0064804.ris

Full Text

COST-PERFORMANCE TRADE-OFFS IN LARGE SCALE BANYAN-BASED ATM SWITCHES by ALIMUDDIN M O H A M M A D B . E . , Visvesvaraya College of Engineering, Bangalore, India, 1987 M . S . , University of Petroleum a n d M i n e r a l s , D h a h r a n , S a u d i A r a b i a , 1991  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF T H E REQUIREMENTS FOR T H E DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING We accept this thesis as conforming to the required standard  T H E UNIVERSITY OF BRITISH COLUMBIA March, 1998 © Alimuddin Mohammad, 1998  In  presenting  degree freely  at  this  the  thesis  in  partial  fulfilment  of  University  of  British  Columbia,  I agree  available for reference  copying  of  department publication  this or of  thesis by  this  for  his thesis  and  study.  I further  scholarly purposes  or for  her  requirements that  agree  may  be  It  is  representatives.  financial  the  gain shall  not  that  the  an  advanced  Library shall  permission for  granted  by the  understood be  for  that  allowed without  head  make  it  extensive of  my  copying  my written  permission.  Department  of  £lddWcql  The University of British Vancouver, Canada  Date  DE-6 (2/88)  Columbia  O l d  G>~f&*  or  BAJU^UO^  Abstract Much research effort has been directed into the design and performance analysis of ATM  switches to date. However, less work has been done in efficiently utilizing  the switch resources (e.g. buffers and links) to achieve the required performance. The current techniques for designing and evaluating ATM  switches focus mainly on achieving  a specified global cell loss probability within a given maximum cell delay. While useful and necessary to ensure that the switch can provide an acceptable QoS, this approach cannot effectively measure other important qualities of an ATM  switch such as resource  utilization and implementation complexity. Further, as the use of the Internet increases and new multimedia applications emerge there is need for substantial improvement in capacity on Internet backbone links.  The need for ever increasing packet delivery  bandwidth makes it inevitable that switches will have to scale to handle larger aggregate bandwidth (quiet possibly in terabit-per-second) range and more importantly to maintain QoS  delivery. This thesis proposes a unifying framework for the design and analysis of large  scale ATM  switches based on enhanced banyan topologies. The banyan architecture  has been chosen as the target architecture for developing the framework, since it is a logiV depth network and is optimal in the use of switching elements.  However,  the standard banyan network suffers from internal blocking. In order to overcome blocking in the banyan network we have proposed the fat-banyan (FAB) architecture which employs gradual increase in dilation from the input stage to the output stage of the network. Gradually incrementing dilation has dramatic impact on the implementation and architectural scalability of the FAB  switch.  Further, the FAB  switch is highly  optimized in the utilization of the internal bandwidth by appropriate setting of the dilation parameter per stage. The FAB switch can scale to the terabit-per-sec range with reasonable increase in chip count. Also, there is minimal interaction between the different streams (VCs/VPs) traversing the switch, thus minimizing any effect on the QoS  of a stream  from other streams. The performance of the FAB is analyzed by analytical methods and by extensive computer simulations. It should be noted that the technique of optimizing ii  internal bandwidth can be applied in general to other ATM  switching fabrics like the  Benes network and the Fat-tree. An important resource of an ATM effort services implies that ATM congestion.  switch is its buffers. The proliferation of best-  switches must contain large buffers to effectively manage  Shared-memory switching uses smaller amount of buffering to achieve  the same loss probability as that of dedicated buffering. However, shared memory architectures do not scale due to limitations of memory cycle time. A novel approach to scale shared-memory switches is proposed using small depth (truncated) self-routing fabrics. This approach provides an efficient realization of the growable switches concept which was proposed by other researchers. Three buffering schemes have been considered for the FAB switch: output buffering, shared output buffering and combined input with output buffering (dedicated as well as shared). The performance under these buffering schemes has been analyzed by extensive simulation. Pure output buffering is considered as it has been shown to provide optimal delay and throughput performance. Shared output buffering has been combined with the use of the memoryless truncated FAB network to design highly scalable shared-memory switch architectures. Input-output buffering with backpressure is shown to provide a cost-effective way of designing a lossless switching fabric. This thesis also examines multicasting on the FAB  switch. We  have proposed a  scheme to efficiently handle the replication of multicast packets in the FAB switch. It is shown that this technique can achieve very low cell loss even at very high offered loads and can also handle the case where the number of packet copies exceeds the switch size. The various aspects of design and performance optimization developed in this thesis can serve as a general approach for designing and evaluating Terabit-per-second switches.  iii  ATM  Table  of  Contents  Abstract  ii  List of Figures  vii  List of Tables  xiii  Acknowledgment  xiv  Chapter 1  Introduction  1  1.1  ATM Switch Functionality  3  1.2  Motivation and Scope of Thesis  4  1.3  Thesis contributions  7  1.4  Organization of the thesis  9  ATM Switch Architectures  11  2.1  Introduction  11  2.2  Types of ATM Switches  11  2.2.1  Shared-Memory Switches  11  2.2.2  Shared-Medium Switches  14  2.2.3  Space-Division Switches  15  Chapter 2  2.3  Banyan-Based Switch Architectures  16  2.4  Benes-based Networks  19  2.5  Growable Switch  21  2.6  Drawbacks and Improvements Required  23  2.7  Conclusion  24 iv  Chapter 3  Bandwidth Optimization in Banyan Switches  25  3.1  Introduction  25  3.2  The Fat-Banyan Switching Fabric  27  3.3  Fat-Banyan Architectural Features  29  3.4  Chip Count Estimate for a Large Scale FAB Switch Implementation  36  3.5  Comparison of FAB Chipcount with other Switch Fabrics . 40  3.6  Conclusion  44  Performance of The Fat-Banyan Switch  46  4.1  Introduction  46  4.2  Performance Analysis under Independent Uniform Traffic .46  4.3  Cell Loss and Stage Dilation  49  4.4  Results  51  Chapter 4  4.5  4.6  4.7  4.4.1  Uniform Traffic  51  4.4.2  Non-Uniform Traffic  55  Performance of the Output Buffered Fat-Banyan  61  4.5.1  Uniform Traffic .  61  4.5.2  Bursty Traffic  64  The Input-Output Queued Fat-Banyan  67  4.6.1  Backpressure mode  67  4.6.2  Uniform traffic  69  4.6.3  Bursty traffic  69  Conclusions  72 V  Chapter 5  Design of Scalable Shared Memory Switches  73  5.1  Introduction  73  5.2  Considerations in Designing ATM Switches  75  5.3  Truncated Switch Architectures  76  5.4  5.3.1  Generalized Truncated Switch Model  5.3.2  Banyan Realization  . . . . 77 78  Performance Analysis of the Truncated Fat-Banyan . . . . 82 5.4.1  Uniform Traffic  82  5.4.2  Bursty Traffic  82  5.4.3  Unbalanced Bursty Traffic  85  5.5  Truncated Fat-Banyan with Input Queues  86  5.6  Conclusions  88  Multicasting  89  6.1  Introduction  89  6.2  Multicasting in the FAB Switch  90  6.3  Multicast Performance  92  6.4  Conclusion  94  Conclusions and future research areas  97  7.1  Conclusions  97  7.2  Future research areas  99  Chapter 6  Chapter 7  Appendix A  List of standard acronyms  Bibliography  101 102  vi  List of  Figures  Figure 1.1  ATM cell and Header formats at the UNI and NNI  2  Figure 1.2  General Structure of an ATM switch  3  Figure 2.1  Shared-memory switch  12  Figure 2.2  Shared-medium switch  14  Figure 2.3  Banyan-based ATM switch architectures  17  Figure 2.4  Benes network of size N=16  21  Figure 2.5  The growable switch model  22  Figure 3.1  Two 16x16 banyans a) with 2 x 2 elements and b) with 4 x 4 elements  Figure 3.2  26  (a)2x2 Buffered S E . (b)2x2 Fat S E with output concentrator. (3) 2 x 2 fully dilated Fat S E (no output concentrator required).  Figure 3.3  ...  29  8x8 output-buffered FAB switch with dilation configuration as follows: DC=[2,3,4]  30  Figure 3.4  2 dilated 8 x 8 banyan network  30  Figure 3.5  Generic package of B F S E s  31  Figure 3.6  Constructing 2 x 2 FSE's with L ( L : input link dilation, L jn  Figure 3.7  o u t  n  o u t  = 2 and L  o u t  = 4 from B F S E ' s  : output link dilation)  Constructing 2 x 2 FSE's with L (Lj : input link dilation, L  j n  j n  = 4 and L  o u t  32  = 8 from B F S E ' s  : output link dilation)  . .  32  Figure 3.8  Isomorphic representation of a banyan network  33  Figure 3.9  Board partitioning for a 8x8 FAB switch  33  Figure 3.10  Partitioning a B F S E among planes  37  Figure 3.11  B F S E s of varying sizes  38  Figure 3.12  The Knockout switch  41  Figure 3.13  Building an N:L concentrator from smaller p :L concentrators. . . 41 vii  Figure 4.1  Fat Switching Element (FSE)  Figure 4.2  Cell loss probability as a function of dilation for a F S E , S(8,2,L) for different input loads under uniform traffic (by analysis).  Figure 4.3  48  ...  50  Maximum throughput for a fat-banyan switch with D C =[L-|=2, l_j=l_i_i+1, 2<i<logN] and simple banyan network under the independent uniform traffic pattern with input load p=1.0 (by analysis)  Figure 4.4  52  Maximum throughput for a fat-banyan switch with DC=[l_i=2; L.j=l_i_i+1, 2<i<logN] under independent uniform traffic pattern (expanded scale) with input load p=1 (by analysis)  Figure 4.5  52  Maximum throughput of a fat-banyan switch with different dilation configurations under independent uniform traffic with input load p=1 (by analysis)  53  Figure 4.6  Concentrator action of a section of the FAB switch  54  Figure 4.7  Maximum throughput (by simulation) as a function of input load for a 8x8 FAB switch with two different dilation configurations under independent uniform traffic  Figure 4.8  55  Cell loss probability (by simulation) as a function of input load for a 8x8 FAB switch with two different dilation configurations under independent uniform traffic  Figure 4.9  55  Cell loss probability as a function of output dilation for a 16x16 FAB switch for two different stage dilation configurations under independent uniform traffic with input load p=0.9 (by simulation).  Figure 4.10  56  An example of a maximum conflict pattern in a 16x16 banyan network  56 viii  Figure 4.11  Maximum throughput (by simulation) for a 8 x 8 FAB switch under mixed S S S D (point-to-point traffic) and background uniform traffic load of 1.0 for two different dilation configurations  Figure 4.12  58  Cell loss probability results (by simulation) for a 8 x 8 FAB switch under mixed S S S D (point-to-point traffic) and background uniform traffic load of 1.0 for two different dilation configurations  Figure 4.13  Routing of a Maximum Conflict Pattern in an 8 x 8 FAB switch with the following dilation configuration: DC=[2,3,4]  Figure 4.14  58  59  Cell loss probability for a 6 4 x 6 4 FAB switch with DC=[2,4,8,8,8,8] as a function of output buffer size under independent uniform traffic with input load p=0.9  Figure 4.15  63  Cell loss probability (concentrator loss+buffer loss) for three different dilation configurations of a 16x16 FAB switch as a function of output buffer size under independent uniform traffic with input load p=0.9 (by simulation)  Figure 4.16  64  Cell loss probability (concentrator loss+buffer loss) for two different dilation configurations of a 6 4 x 6 4 FAB switch as a function of output buffer size under bursty traffic conditions with /?=0.7 and average burst length=15 (by simulation)  Figure 4.17  65  Cell loss probability (concentrator loss+buffer loss) for two different dilation configurations of a 6 4 x 6 4 FAB switch as a function of output buffer size under bursty traffic conditions with p=0.9 and average burst length=15 (by simulation)  Figure 4.18  66  Concept of LC backpressure in input-output queued fat-banyan switch with N=8 and dilation of last stage FSE's = 4 ix  68  Figure 4.19  Cell loss probability for FAB switches of size 32 and 64 as a function of input buffer size per output port under random uniform traffic with offered load p= 0.7 and output buffer size=32 cells (by simulation). For N=32, DC= [2,4,4,4,4]. For N=64 DC=[2,4,4,4,4,4]  Figure 4.20  69  Cell loss probability for FAB switches of size 32 and 64 as a function of input buffer size per output port under random uniform traffic with offered load p= 0.9 and output buffer size=32 cells (by simulation). For N=32, DC= [2,4,4,4,4]. For N=64 DC=[2,4,4,4,4,4]  Figure 4.21  70  Cell loss probability for FAB switches of size 32 and 64 as a function of input buffer size per input port under bursty uniform traffic with average burst length=15 and offered load p= 0.7 (by simulation) and Bout=256. For N=32, DC= [2,4,4,4,4]. For N=64 DC=[2,4,4,4,4,4]  Figure 4.22  70  Average delay versus offered load (p) for FAB switch of size 32 under bursty uniform traffic with average burst length=15. DC=[2,4,4,4,4,4], Bout=256 (by simulation)  71  Figure 5.1  The general truncated switch model  77  Figure 5.2  The growable switch model  79  Figure 5.3  Truncated fat-banyan (TFAB(8,2)) switch with output shared memory modules and DC=[2,3]  Figure 5.4  81  Cell loss probability as a function of output buffer size for a 16x 16 TFAB switch for different dilation configurations under independent uniform traffic with input load p=0.9 (obtained by simulation). . . 83  Figure 5.5  Cell loss probability (buffer loss) for two 16x16 TFAB switches with no concentrator loss as a function of buffer size per output port under bursty traffic with average burst length=15 and offered load ,9=0.5 (by simulation)  Figure 5.6  84  Cell loss probability (buffer loss) for two 16x16 TFAB switches with no concentrator loss as a function of buffer size per output port under bursty traffic with average burst length=15 and offered load p= 0.7 (by simulation)  Figure 5.7  84  Cell loss probability (buffer loss) for two 16x16 T F A B switches with no concentrator loss as a function of buffer size per output port under bursty traffic with average burst length=15 and offered load ,9=0.9 (by simulation)  Figure 5.8  85  Cell loss probability (buffer loss) for a 16x16 TFAB switch for two output group sizes as a function of buffer size per output port under unbalanced bursty traffic with offered load ,9=0.71 and average burst length=15 (by simulation)  Figure 5.9  Concept of L C backpressure in input-output queued TFAB switch  Figure 5.10  86  87  Cell loss probability for 64x64 TFAB switch with input queueing and L C backpressure under bursty uniform traffic with average burst length=15 and offered load p= 0.7 (by simulation). Bout=256  Figure 6.1  87  A multicast packet switch consisting of a copy network and a point-to-point switch  Figure 6.2  89  Balanced tree approach to copy number division per stage. . . . 91 xi  Figure 6.3  Output interval assignment approach with examination of two bits at each stage  Figure 6.4  93  Simulation results for a 128x128 copy network: Cell loss probability versus Output offered load for different dilation configurations with mean fanout =2  Figure 6.5  95  Simulation results for a 128x128 copy network: Cell loss probability versus Output offered load under overload conditions for different dilation configurations with mean fanout =2  xii  95  List of  Tables  Table 1.1  Types of ATM services  2  Table 1.2  Terabit Experimental Switch Implementations  6  Table 3.1  Comparison of interconnections required for FAB and dilated banyan networks  Table 3.2  35  Comparison of FAB and dilated banyan networks built with 2 dilated 2 x 2 FSE's  Table 3.3  35  Chip count for various FAB switch sizes with external link rate of 622 Mb/s  Table 3.4  39  Chip count for various FAB switch sizes with external link rate of 2.5 Gb/s  Table 3.5  39  Chip count for various FAB switch sizes with external link rate of 10 Gb/s  39  Table 3.6  Chip count for Knockout switch (as reported in  42  Table 3.7  Chip count for a Banyan switch  42  Table 3.8  Chip count for the Tandem Banyan switch  43  Table 3.9  Chip count for the Dilated banyan  43  Table 3.10  Chip count for the FAB switch with C=128 and R=1.0 Gb/s. . . . 44  Table 3.11  Chip count for the FAB switch with C=240 and R=2.5 Gb/s. . . . 44  Table 4.1  Maximum throughput and cell loss probability results (by simulation) for a 8x8 FAB switch with dilation configuration as follows: DC=[2,3,4] under hot-spot traffic  Table 4.2  60  Maximum throughput and cell loss probability results (by simulation) for a 8x8 FAB switch with dilation configuration as follows: DC=[2,4,6] under hot-spot traffic  xiii  60  Acknowledgment I would like to express sincere gratitude to Dr. H. M. Alnuweiri for his constant guidance and advice in supervising my Ph.D. research and to Dr. R. W. Donaldson for his interest and support in this project. This research was partially supported by a grant from the Canadian Institute for Telecommunications Research under the NCE  program of the Canadian Government,  and by a Canadian Natural Sciences and Engineering Research Council grant. Finally I would like to thank my beloved wife, Gazala for her support in my study.  xiv  Ph.D.  Chapter 1. Introduction  Chapter  1  Introduction  Asynchronous transfer mode (ATM) has been adopted as the standard for Broadband Integrated Services Digital Network (B-ISDN) by CCITT. A T M is a fast packet switching and connection-oriented transfer mode technology based on statistical time division multiplexing. Emerging ATM/BISDN's will support four main classes of communication services as shown in Table 1.1 using different protocols for the A T M adaptation layer (AAL). A T M uses fixed length packets called cells, consisting of a user information field (48 bytes) and a header (5 bytes). The 48 bytes are used by A A L and the user payload. The A T M cell is configured slightly differently for the UNI (User Network Interface) and the NNI (Network Network Interface) as shown in Figure 1.1. At the UNI interface a G F C (Generic Flow Control) field is defined since various O A M (Operation and Maintenance) functions operate at the UNI only. The cell header mainly consists of the virtual circuit labels of VPI and VCI. At the UNI 8 bits are assigned to VPI and 16 bits assigned to VCI. For the NNI, 12 bits are assigned to the VPI and 16 bits are assigned to the VCI. The PT (payload type) field identifies the type of traffic residing in the cell. The cell may contain user traffic or management/control traffic. The C (cell loss priority) field is a 1-bit value. If C is set to 1, the cell may be discarded by the network during congestion. A C bit value of 0 indicates a higher priority cell and is so treated during congestion. The H E C (header error control) field is an error check field, which can also correct a 1-bit error. It is calculated on the 5-octet A T M header, and not on the 48-octet user payload. The key component of an A T M network is the A T M switching node, which performs the fast packet switching function. A T M is designed to support serial link rates from a few kilobits per second to tens of gigabits per second (Tl/El, T3, OC-3, OC-12, OC-48, OC-192) [1, 2, 3]. The major challenge in designing A T M switches is to be able to  1  Chapter 1. Introduction Header  User Information ATM cell  5 bytes  ,  7  48 bytes  6 5 4  ,  3 2  1 0  GFC  VPI  VPI  VCI  , Bits  User-Network Interface (UNI) Header Format  VCI VCI  PT  C  HEC Bits GFC  VPI  VPI  VCI  Network-Network Interface (UNI) Header Format  VCI VCI  PT  C  HEC  VPI = Virtual Path Identifier G F C = Generic Flow Control VCI = Virtual Circuit Identifier PT = Payload Type C = Cell Loss Priority H E C = Header Error Control  Figure 1.1. A T M cell and Header formats at the UNI and NNI.  Table 1.1. Types of A T M services  Class  AConstant bit rate (CBR)  BVariable bit rate (VBR)  CConnection oriented data traffic  DConnectionless data traffic  Connection mode  Connection onented  Cgnnection onented  Cgnnection onented  Connectionless  Bit rate  Constant  Variable  Variable  Variable  Example  H.261 video  VBR voice and video  Fax  Electronic mail  ATM Adaption Layer  AAL 1  AAL 2  AAL 3 or 5  AAL 4 or 5  2  Chapter 1. Introduction  handle aggregate bit rates greater than 10 Gb/s and into the terabit-per-second range. Building efficient switch architectures and fast enough digital electronics to handle such high bit rates is a major challenge. Another problem that any A T M switching system must solve is that of buffering i.e., storing incoming A T M cells that cannot be delivered upon arrival until the resources blocking their delivery (e.g. internal bandwidth of the switch, queue length at the input/output) is freed. This thesis addresses the architectural issues of design and evaluation of large scale switching systems.  1.1. ATM Switch Functionality The general structure of an A T M switch is shown in Figure 1.2. The A T M switch receives statistically multiplexed cell stream on its input links. Each cell in the stream is identified to belong to a virtual connection by its VPI/VCi value. The input controller processes the header and may typically perform some or all of the following functions: identify the type of cell (Signalling, OAM, data, etc.), add a routing tag for routing through the switch fabric to the appropriate output, add information from the connection table like connection priority, perform a VPI/VCI translation and control input buffers if they are provided. The control processor deals with functions like connection establishment, connection release and bandwidth allocation and management. The output controller performs cell processing like removing information (routing tag, connection priority, etc) added at the input controllers and output buffer management. In this thesis we are Control Processor Input Controllers  Output Controllers  SWITCH FABRIC  CO  CL.  CL  S N-2_  N-2  o-  N-1_  §  N-1  Figure 1.2. General Structure of an A T M switch.  3  Chapter 1. Introduction  mainly concerned with the ATM switching fabric and the buffering strategy employed in the switch. An ideal ATM switch fabric should have the following three characteristics: 1. Capability to route all packets from its input lines to the requested output lines without loss. 2. Capability to deliver the packets to the requested output lines with minimum transit delay. 3. Capability to preserve the same order of the packets at the output, as their arrival order at the switch inputs. Thus, the functionality required of an ATM switch fabric is quite simple, but the challenge as mentioned earlier is to achieve these characteristics at high speeds and to satisfy the Quality of Service (QoS) requirements of both real-time and non-realtime traffic, which would traverse these switches in a B-ISDN environment. The details of the various switch architectures that have been proposed in the literature are discussed in chapter 2.  1.2. Motivation and Scope of Thesis Much research effort has been directed into the design and performance analysis of ATM switches to date [4, 5, 6, 7]. However, less work has been done in efficiently utilizing the switch resources (e.g. buffers and links) to achieve the required performance. The current techniques for designing and evaluating ATM switches focuses mainly on achieving a specified global cell loss probability within a given maximum cell delay. While useful and necessary to ensure that the switch can provide an acceptable QoS, this approach cannot effectively measure other important qualities of an ATM switch such as resource utilization and implementation complexity. In our work an important design goal is the optimization of internal links and buffers of the switch. We show that the optimization of links has a profound impact on the interconnect complexity of the switch. Since ATM switches generally employ large buffers, especially since the introduction of 4  Chapter 1. Introduction  available bit rate (ABR) services, the optimization of buffers is important in order to reduce the memory cost component of the switch. As the use of the Internet increases and new multimedia applications emerge there is need for substantial improvement in capacity on Internet backbone links. The Internet growth in the last several years has been phenomenal. Personal computers are approaching the performance of workstations and it is only a matter of time before 155Mbps to the desktop will be common place within the professional environment. All of these factors will drive the deployment of A T M but it is not known when the explosion will occur. Additionally, it can only be speculated as to the sufficient A T M switch size required to meet the end user's demands. Will 32, 64 or 128 155Mbps ports be the ideal sized LAN  based switch? What about the WAN switches? Present state-of the-art commercial  switching equipment is capable of operating at 2.5 Gb/s to 10 Gb/s with 50 Gb/s to 100 Gb/s equipment beginning to be available. Table 1.2 lists a few experimental switch implementations which reportedly can handle multi Gb/s links. However, the need for ever increasing packet delivery bandwidth makes it inevitable that switches will have to scale to handle larger aggregate bandwidth (say in the terabit-per-second) range and, more importantly to maintain QoS delivery. In this thesis an important design consideration is the modular growth of the switch to handle such high bandwidth with modest cost increase and without adversely impacting the QoS provided to all flows through the switch. In the following we consider few important factors that influence the design and performance of A T M switches. Routing: The routing algorithm used in the switch fabric greatly influences the speed of switching. Since a full crossbar implementation is not feasible for modular growth, a multistage interconnection network seems to be the most feasible solution towards modular growth of the switch. Routing schemes within the switch fabric can be classified as centralized routing schemes, partially distributed routing schemes and  5  Chapter 1. Introduction  self-routing schemes. Centralized routing schemes do not scale with large port numbers or high link speeds. Partially distributed routing schemes are faster than centralized routing schemes but may not scale well with large port numbers or high link speeds. Self-routing schemes use truly distributed routing algorithms and are the ideal choice for routing within the switch fabric. In this thesis, only self-routing switching fabrics are considered, although they are combined sometimes with shared-memory switching. Flow Segregation and QoS: Another important issue in the design of an A T M switch fabric is that of interaction of different traffic streams within the switch fabric. This phenomenon is called QoS coupling. QoS coupling leads to the interference of a connections cell loss, cell delay and cell delay variation by other connections traversing Table 1.2. Terabit Experimental Switch Implementations Switch Name and Topology  Link Rate (Gh/s)  Aggregate Rate (Gb/sf  Stage of Development  Technology  Thunder and Lightening (Crossbar)  10-SO  Ift()-h4(>  4-pon Prototype  ECL  WUGS (Washington Univ. Gigabit Switch) (Buffered Benes Network)  2.5  SO  64 port Prototype  CMOS 1.2 Gb/s G-link (HP)  10  320  32-port Prototype  0.25-miemn CMOS 2 Gb/s serial-links (Tl)  10  1000  12X ports  GaAs MESFET 10 Gb/s serial-links  iPoint/3D (illonois Pulsar-based Optical INTerconnect switch) (Crossbar)  0.155 - 2.5  10 - 40  4 ports  FPG A/ASIC  Growable Switch (Clos Network)  0.155 - 1.0  25  32 ports  CMOS  Tiny Tera (Crossbar)  Rerouting Banyan (Overlapped Banyan)  6  Chapter 1. Introduction  the switch. In the worst case a few highly bursty connections may severely affect the QoS  of well behaved connections.  This problem is mainly due to the introduction of  ABR/UBR services which is not subject to connection admission control (CAC) and has the potential to cause QoS coupling within the fabric. An important design criterion is hence the minimization of QoS coupling within the fabric. Ideally a fabric with no QoS coupling would require a crossbar implementation with no buffers in the fabric. In this thesis we effectively eliminate QoS coupling in the switching fabric. In our switch models different flows do not impact each others cell delay or cell delay variation. It should be noted however that QoS coupling management has to be done at the periphery (either at the input or output buffers) of the switch. This can be handled by efficient buffer management [8, 9] and cell scheduling techniques which employ the Fair Queueing algorithm or one of its variations [10]. Multicasting: Applications like teleconferencing and videoconferencing require the setting up of multicast virtual circuits. Implementing efficient multicast channels through an A T M switch is a major challenge. There are generally two approaches to multicasting in an A T M switch. In the first approach, the A T M switch is preceded by a separate copy network which creates the required number of copies of a multicast cell. These copies are then input to the "unicast" A T M switch which routes them to the appropriate destinations. The second approach is to incorporate the multicasting function in the switch core itself. In this thesis we propose an efficient way of implementing multicasting on our proposed switch architectures.  1.3. Thesis contributions Broadly speaking this thesis proposes several techniques for designing large-scale ATM  switches (operating with Tb/s aggregate capacity) that also employ high-speed  links (e.g. OC-48=2.5 Gb/s or OC-192=10 Gb/s). In the following we elaborate on several specific contributions of the thesis: 7  Chapter 1. Introduction  1. This thesis proposes several switch architectures which provide high utilization of the two main resources of the switch namely, the interconnect and the buffers. Interconnect and wiring complexity becomes a major bottleneck for large-scale implementations, where the switch fabric spans several boards and shelves. Therefore minimizing interconnect complexity is an important design issue. Memory is a significant component affecting both cost and internal delays. Switches with highspeed links (OC-48) may require very large cell buffers in the fabric to guarantee the QoS of cell streams passing through the fabric. Hence, optimization of memory (cell buffers) in A T M switches is another important design goal. 2.  We provide analysis of the performance of the proposed fabrics with theoretical methods and computer simulations under independent and bursty traffic conditions, and under general traffic patterns such as non-uniform and hot-spot. The effect of various parameters which affect cell loss, such as internal link dilation, depth of the fabric, and shared memory size are considered in the performance analysis.  3.  QoS (Quality of Service) coupling is an important performance measure of the fairness of a switching fabric. QoS coupling occurs when cells from different streams (VCs/VPs) affect each others cell loss and cell delay. The architectures proposed in this thesis do not use internal buffers thus eliminating QoS coupling in the form of delay effects between different streams. Instead, they strategically employ internal link dilation to reduce cell loss.  4. We generalize our switch design strategies to provide an efficient method to develop scalable shared memory switches. This is based on the use of a truly self-routing memoryless interconnect. By themselves, shared memory switches do not scale to large number of ports or high port speeds since memory access time becomes a bottleneck. We also study the impact of different buffering strategies on the proposed architectures. We considered dedicated output buffering, shared output buffering  8  Chapter 1. Introduction  and a combination of input with output buffering (both dedicated and shared). We also introduce the notion of link-capacity back pressure mode of operation for the case of input-output buffering. This method eliminates cell loss in the memoryless interconnection part of the switch. 5.  Since multicast traffic will be a significant portion of the overall traffic we consider the performance of one of the proposed switch architectures as a copy network and analyze its performance under multicast traffic, both uniform and bursty by simulation methods. The proposed copy network achieves very low cell loss even at very high output offered loads.  6.  We propose a very effective modular implementation of the proposed switch architectures and have analyzed the implementation complexity, both in terms of the interconnection complexity and the chip count required for their implementation. We show that the architectures proposed can be scaled to large dimensions with reasonable increase in interconnection complexity and chip cost per port.  1.4. Organization of the thesis Chapter 2 presents a comprehensive survey of existing switch architectures. It takes an in-depth look into the existing A T M switch fabrics and highlights their drawbacks and the improvements required for large scale switching. Chapter 3 looks at the architecture of the FAB in terms of the interconnection complexity and partitioning for large scale implementations. It also addresses the issue of modular growth and scalability of the FAB switch. Chapter 4 introduces the output buffered fat-banyan. The performance under uniform traffic conditions is analyzed. Simulation results are presented for general traffic patterns such as non-uniform and hot-spot. The performance of the output-buffered banyan is further enhanced by using small input buffers and a backpressure mechanism. Simulation results under independent uniform and bursty traffic conditions are presented.  9  Chapter I. Introduction  Chapter 5 looks at the issue of growability of the switch fabric. The idea of truncated switch fabrics is introduced. System performance is analyzed by computer simulations. Chapter 6 addresses the issue of multicasting. The FAB network is used as a copy network to perform the function of multicasting. A novel aspect of this copy network is that very low cell loss is achieved even when the output offered load is high. Chapter 7 concludes the thesis and outlines future research problems.  10  Chapter 2.  ATM Switch Architectures  Chapter 2  A T M Switch  Architectures  2.1. Introduction Large scale A T M switch architectures support high aggregate bit rates in the range of hundreds of Gb/s to tens of Tb/s. Such architectures cannot be realized as single stage networks of either shared-memory or shared-bus using current technology. Some form of multistage interconnection network is to be used to handle aggressive link rates in the range of OC-48 (2.5 Gb/s) to OC-192 (10 Gb/s). These link rates can support bandwidth demanding applications. The high link rates also ensure very efficient statistical multiplexing of applications with peak rates in the range of tens of Mb/s [1]. As A T M networks are installed they may begin to revolutionize business practices, which may in turn lead to greater demands on the network. Both of these questions can only be answered as A T M adapts to the ever changing user requirements over time. Therefore, A T M switches which are designed today should adopt a core switching architecture which can easily adapt and scale to all future networking requirements. In this chapter we first look at the different types of switch architectures that have been proposed in the literature. We also survey various Banyan and Benes based multistage architectures and discuss their relative merits and demerits. Finally, we discuss the important concept of Growable switches, initially proposed by Karol et al [11].  2.2.  Types of ATM Switches A T M switch architectures can be broadly classified into the following three types:  Shared-memory switching, Shared-medium switching and Space-division switching. In the following subsections, a brief review of these switch types is given. 2.2.1 Shared-Memory Switches A shared-memory switch consists of a single dual-ported memory shared by all input and output lines [12, 13, 5, 6]. The incoming cells are multiplexed into a single stream and 11  Chapter 2.  ATM Switch Architectures  are written into the shared memory as shown in Figure 2.1. The memory can be logically organized as either fully shared, partially shared or completely partitioned with respect to the destination output ports. Full sharing leads to minimum memory requirements to achieve a certain cell loss probability; but may be unfair to other ports when the traffic is bursty, causing degradation in their performance. In complete partitioning, each output port has a dedicated queue and requires larger memory to achieve the same cell loss probability as the full sharing scheme. Partial sharing can be used to achieve optimal size and performance. At the outputs, cells are read out in a single stream, demultiplexed and transmitted on the output lines. Therefore, the control logic in the shared memory architecture must be able to handle N incoming cells, queue them in the proper addresses in memory and select N outgoing cells in each time slot. The main bottleneck, which  Figure 2.1. Shared-memory switch.  restricts the number of input-output lines of a single switch, is the memory cycle time. In order to accommodate all input and output traffic in a single time slot, the memory bandwidth, which is defined as the average number of words accessed per second, must be at least the sum of the bandwidths of the incoming and outgoing lines. For a given memory cycle time, the number of input-output lines is constrained by the following equation: Memory cycle time = cell length / 2x/Vxlink speed For a memory cycle time of 10 ns, link speed of 155 Mbps, and cell length of 53 bytes, the maximum possible switch size is iV=136. If the link speed is 2.4 Gbps then N=8. Three basic types of shared memory switches have been proposed and analyzed in the literature. The Prelude switch [12] was the first shared-memory type switch developed 12  Chapter 2.  for ATM  ATM Switch Architectures  and based on the complete partitioning approach. One of its drawbacks is that it  ties the number of input/output ports to be equal to the number of bytes per packet. The second type of shared-memory switches discussed in the literature are the linked-list based switches [13], where the buffer memory is logically organized as N linked lists, one for each output. The linked-list based approach achieves complete sharing of the memory, and hence lesser memory is required to achieve a given cell loss probability. The third type is the hybrid (shared and dedicated output buffer) switch [5, 14]. The advantage of this switch is that overloading of the shared buffer by some output ports during bursts can be easily controlled, by controlling the queue sizes at each output. An architecture which does away with the multiplexing and demultiplexing stages by replacing them by crosspoint space-division switches, thus increasing the memory bandwidth, is described in [5]. However, the hardware complexity of the switch is increased due to the introduction of the crosspoint switches, and hence this design is not suitable for large switches. Better switch scalability can be achieved using single-stage interconnection [15] and multistage interconnection [5] of shared-memory switches. The major challenge in designing these switches is minimizing the hardware complexity to achieve the desired performance. In summary, a single shared-memory switch has the best hardware utilization, because the two required functions of a packet switch, queueing and switching, are carried out via buffering. Hence, shared-memory switches are of particular interest to small to medium sized switches with OC-3  or OC-12  links. However in order to make them viable for  large switches, improvement is required in the following aspects: 1.  Schemes to improve the memory cycle time. For this purpose cost-effective parallel access schemes need to be investigated.  2.  Methods to efficiently interconnect the shared-memory switches so that maximal (optimal) sharing of the buffer-memory is achieved.  13  Chapter 2.  ATM Switch Architectures  2.2.2 Shared-Medium Switches 1 co  Z) CQ O UJ  CO  IZ> CL  2  H Z) Q_  — I  Z>  O  < x  N.  co  CO  N  Address filter  Buffer  Figure 2.2. Shared-medium switch.  A shared-medium switch consists of a common high-speed medium (typically a high-speed bus) onto which the input lines are synchronously multiplexed as shown in Figure 2.2.  The bandwidth of the bus is equal to N times the rate of a single  input line. At the output, there is an address filter and a FIFO buffer for each output line. The function of the address filter is to extract the packets which are destined to a particular output and to store them in the FIFO buffer for that output. As the number of links attached to the medium and their speeds increase, the medium speed becomes a bottleneck. Not many shared-medium architectures have appeared in the literature. The first shared medium switch for ATM  was the ATOM (ATM  Output Buffer Modular)  switch [16]. It uses a bit-slice organization to alleviate the bottleneck of the medium speed. The PARIS (Packetized Automated Routing Integrated System) shared-medium switch [16] was designed for variable length packets.  More recently a bus structure  based on sequentially scheduling the inputs and parallel transfer [17] has been proposed. However a parallel bus of 425 wires is used, which seems to be excessive.  Many  current and earlier commercial switches are based on shared-bus architectures, including switches from FORE systems and Newbridge. The shared-medium architecture results 14  Chapter 2.  ATM Switch Architectures  in separate queues for each output, and hence requires more memory in order to achieve a required cell loss probability. However, the shared medium supports broadcasting and multicasting in a natural way.  2.2.3 Space-Division Switches Space-division switches can be broadly classified into two major categories based on their routing capabilities: 1. Non-blocking switches: Non-blocking switches are switches in which internal blocking within a switch will not occur due to the existence of a large number of non overlapping internal paths (When two or more cells contend for an internal link, only one can pass through the link, the rest have to be dropped. This is referred to as internal blocking). However, blocking may occur at the output of the switch when more than one packet are destined to the same output line. 2. Blocking switches: In these switches internal (as well as output) blocking will occur when two or more packets contend for the same link within the switch. Internal blocking and output contention cause degradation in throughput. In order to reduce the amount of throughput degradation, buffers may be provided at the input ports, output ports or internally at the switching elements, respectively referred to as input buffering, output buffering, and internal buffering [18]. Input buffering suffers from low throughput (about 58.6% with FIFO buffers) due to head-of-line blocking effects. With output buffering, all cells contending for the same output port are stored at the output ports until they can be read out. Output buffering increases the throughput over input buffering, since more than one cell can be delivered to the output when output contention occurs, which is not possible with input buffering. Internal buffering is used to alleviate internal blocking. But it has several limitations; the complexity of the switching elements is increased and also buffers introduce random delay within the switching fabric, causing undesired cell-delay variation in the network.  15  Chapter 2.  ATM Switch Architectures  2.3. Banyan-Based Switch Architectures The self-routing property of banyan networks provides great simplicity in the control of the switching elements and hence makes them attractive for implementing high speed switching nodes. An NxN banyan network comprising 2x2 switching elements consists of 7V72xlog/V switching elements, interconnected such that there is an unique path from any input to any output. However, because of the internal blocking property of banyan networks, they cannot be directly used for switching purposes. Therefore various topologies based on the banyan network have been studied in the literature [7, 6] few of which are illustrated in Figure 2.3. The work on these architectures has mainly focused on the performance enhancement of the banyan fabric with no significant work on the resource utilization or scalability of the architecture. Several of the enhanced networks are described below.  16  Chapter 2.  ATM Switch Architectures  1  2-I 2  k-1  1, 2, 3  Stage 0  Stags 1  k-1, k—> Banyan Networks  THE OVERLAPPED BANYAN SWITCH  Stage 2  DILATED BANYAN  1st Banyan Network  ~log N 2  Batcher Sorter  logN  Banyan Network  1 2  M  BATCHER-BANYAN NETWORK MULTIPLE BANYAN NETWORK  Packet filter for marked packets 1_  2-\  2nd Banyan Network  1st Banyan Network  Q kth Banyan Network o  THE TANDEM BANYAN SWITCH Bus Interface  Bus Interface  Figure 2.3. Banyan-based A T M switch architectures.  Dilated Banyan Switch: Dilated banyans [19, 20, 21] employ fixed dilation of the input and outputs of a switching element. Dilation results in improved performance of the banyan network by increasing the internal bandwidth. However, the internal bandwidth of the switch fabric is not optimally utilized, being under-utilized in the initial stages and the degree of dilation may not be appropriate for the later stages. Buffered Banyan Switch:  Turner [7] was one of the pioneers in proposing a  design for a fast packet switching network. The main component of Turner's switching network was a buffered banyan network, tailored to the needs of a fast packet-switching 17  Chapter 2.  network.  ATM Switch Architectures  When an internal conflict occurs, the buffered-banyan network buffers the  blocked packets, which are resubmitted for contention in the next cycle. However, the buffered-banyan network has a limited throughput [22, 23, 24, 25] which decreases rapidly for increasing N, for an NxN network.  At Af=1024, throughput equals 0.45  for a single-buffered banyan. The throughput increases with increasing buffer size, and reaches to around 0.6 for a buffer size of four. Beyond buffer size of four, improvement is negligible. Moreover, from the cost and delay point of view, buffered-banyan networks are not optimum. Hence the buffered-banyan model is not an appropriate model for fast large-scale packet switching. Tandem Banyan Switch: The Tandem Banyan network [26] is constructed by placing several banyan networks in series. The idea is to misroute the packets when a contention occurs, and start a new routing at the subsequent banyan network. The Tandem Banyan switch achieves output queueing (and a high throughput as well) and reduces the size of the concentrators, by making it proportional to k, where k is the number of banyans in series (k is much smaller than AO- Under independent uniform traffic with N=32, input load p=l, nine banyans (i.e.,&=9) are required to achieve a cell loss probability of 1 0 , and for N=1024, fourteen banyans (£=14) are required. However, the -6  tandem banyan is not optimal in the use of the hardware, requiring kx(NlogN)  1  switching  elements. For large N, placing k banyans in tandem would be excessive from a hardware point of view. Further, the series connection leads to synchronization problems between banyan networks on multiple chips and boards. Overlapped Banyan Switch: The overlapped banyan [27] reduces the number of stages compared to the tandem banyan by embedding several banyan networks within the switch fabric. The basic idea is that, whenever a contention occurs, the cell is misrouted and a new routing is started immediately from the next stage of switching elements  1  All logarithms are assumed to be base-2 unless indicated otherwise.  18  Chapter 2.  ATM Switch Architectures  (SE's), unlike the tandem banyan where the misrouted packet starts a new routing only after arriving at the output of one of the banyan networks in cascade. Once the packet reaches the required position in the overlapped banyan, it is passed to the output via bypass links. For iV=64, p=l, 28 stages of SE's are required to achieve a cell loss probability of 10 . For JV=1024, 54 stages are required, whereas for the tandem-banyan 140 stages -6  (in 14 banyans) would be required. Though the overlapped banyan requires a smaller number of stages compared to the tandem banyan, the number of stages required is still quite large considering the fact that the switching elements are a little more complex than those of the tandem-banyan. Further, both the tandem banyan and the overlapped banyan have to deal with the packet resequencing problem, which introduces a delay overhead.  Batcher-Banyan Switch: This is another class of architectures based on the Batcherbanyan network [28, 29, 30], which make use of the property that a banyan network becomes internally non-blocking if the inputs are sorted with respect to the destination requests. In order to achieve the sorting of the incoming packets, a Batcher sorter, based on the principle of bitonic sorting is employed. An AMnput Batcher sorting network consists of (log/Vx(logiv+l))/2 stages of comparators which is rather large, considering the fact that the banyan network itself employs log/V stages. Moreover, the Batcher sorting network does not contribute to any improvement in throughput when multiple requests are destined to the same output. Additional networks like a trap and a concentrator [30] are required to filter out the excessive packets and resubmit them for routing in the next time slot. Thus, the hardware overhead is large to achieve the required performance.  2.4. Benes-based Networks An NxN Benes network consists of 21og2N-l stages of switching elements and provides N/2 different paths for any input-output pair. A Benes network is illustrated in  . 19  Chapter 2.  ATM Switch Architectures  Figure 2.4. Two basic types of Benes switches have been proposed. These are the Input buffered Benes and the Buffered Benes. Input Buffered Benes: In this switch each input port is equipped with an input port controller (IPC) which contains an input buffer, a path table for that port, and a link-flags vector [31]. The IPC's of the different input ports are linked in a circular manner by links that allow each IPC to communicate with its two neighbor IPCs. Starting with a particular input port (e.g. port 0) a path to an output is searched from a window of the first w cells from the input FIFO (first-infirst-out)queue. When a path is determined, the bits of the link-flags vector corresponding to the selected links is marked accordingly. The link-flags vector is passed on to the IPC of the following input port and the search and marking procedure is repeated until all the ports have been visited. Once a path is determined, it is encoded using a routing tag which is used to self-route the cell from the input queue to the output port. The switch hence, achieves a non-blocking operation. With w=l the throughput is limited to 0.58 due to H O L (head of line) blocking. Higher throughput is achieved by using w>B/2 where B is the input buffer size per port. This architecture does not scale to large switch sizes or high link rates. One of the bottlenecks is the centralized scheduling which would be very slow for large switch sizes and also would not work at high link rates. The other bottleneck is the window searching scheme at the input FIFOs which again slows down the routing operation through the switch. Buffered Benes : The buffered Benes network has also been used as an A T M switching fabric [1]. The WUGS (Washington University Gigabit Switch) switch prototype which was developed at Washington University by Jonathan Turner and others is a prime example of the buffered Benes switch. More recently switch fabrics marketed by IgT Inc. fall under this model. The switching element (SE) design in the WUGS switch is based on a highly parallel shared-buffer architecture. The SE's are equipped with a backward (back-pressure) signalling capability that eliminates the possibility of cell loss  20  Chapter 2.  ATM Switch Architectures  in any SE. The cells are routed to their destinations using a two-phase technique. In the first phase, cells are routed randomly to the outputs of the SE's at stage log2^ (i.e., the outputs of the banyan subnetwork) such that these outputs are evenly loaded. In the second phase, cells are routed to their respective destinations using the reverse banyan subnetwork. This technique leads to cells arriving in an out-of-sequence order at the outputs. This is dealt with by providing resequencing buffers at the switch output. Also, this switch supports multicasting. Multicasting is supported by successive application of copy-by-two routing function in the Benes network [32]. A connection with m destinations requires log2m passes through the switch fabric. The main disadvantage of this fabric is that it is internally buffered and could lead to QoS coupling in the cell delay and cell delay variation of the connections passing through it.  Demultiplexer Concentrator  OO 01  Stags 0  Stage 1  Stage 2  Stage 3  Staga 4  Stage 5  Stage 6  Figure 2.4. Benes network of size JV=16.  2.5.  Growable Switch Some networks based on the Clos network topology [33, 11] have been proposed for  ATM  switching. Of these the most important one is the growable switch model which  is discussed in this section. 21  Chapter 2.  ATM Switch Architectures  The growable packet switch is based on a 3-stage NxN Clos network topology, with the first two stages providing a memoryless interconnection capable of routing upto m packets to each group of n output ports, where m<n. There are N/n output groups, and an m-Xo-n packet switch module provides the buffering and final routing for each group of n output ports. The growable packet switch is illustrated in Figure 2.5. The main drawback of this architecture is that the distributed routing scheme used is not self-routing and therefore will not scale with high port speeds or larger port sizes. The details of the routing algorithm can be found in [11]. The routing is performed essentially by passing "link-reservation" vectors among the cross-bar switches of the first stage of the memoryless fabric. The vectors are updated as they are passed from one cross-bar to another. This requires an additional bus to connect the cross-bar switches of the first stage. The advantage of this interconnection network is that it is highly modular as it employs crossbar switches only in a two-stage memoryless fabric. Also, shared memory modules are employed to serve groups of output ports.  Figure 2.5. The growable switch model  22  Chapter 2.  ATM Switch Architectures  2.6. Drawbacks and Improvements Required In the previous section we have mainly considered three types of multistage interconnection networks and their variations. The Banyan based packet switches are attractive because of their self-routing property. However, unbuffered or buffered Banyan switches are known to suffer from severe blocking which leads to limited throughput [22, 34, 6, 24]. A wide variety of techniques studied in the literature were presented in the previous section. The Benes network when used as an input buffered switch does not scale with switch size or port speeds. The buffered Benes has been shown to perform well under balanced traffic conditions [1]. However, it suffers from QoS coupling within the fabric. Finally, the three-stage growable switch has its advantage of modularity but does not scale due to the limitation of its routing algorithm. An ideal multistage interconnection network for an A T M switch which can support high link rates and large number of ports, must have the following properties: 1. Uses a fast self-routing scheme. 2.  Low interconnection density.  3.  Very low QoS coupling (in terms of average delay, cell delay variation and cell loss).  4. Is capable of modular growth. 5.  Is scalable in a cost-effective (low chipcount per port) manner.  6. Makes effective use of available memory technology. With the above criteria in mind we have proposed a high-capacity switching fabric based on incremental internal dilation called the fat-banyan (FAB) [35, 36, 37]. The FAB is discussed in detail in chapter 3. A modified version of the FAB called the Truncated FAB  (TFAB) [38, 39] is also presented in chapter 5. The TFAB makes effective use of  available memory technology and provides a scalable solution to the design of sharedmemory switches as will be shown in chapter 5. An unbuffered Benes with its second half (the reverse banyan subnetwork) dilated 23  Chapter 2.  ATM Switch Architectures  like the FAB may serve as a very good switching fabric as it eliminates QoS coupling (in terms of delay and cell delay variation) and also achieves load balancing among the internal links within the fabric. The internal load balancing may lead to reduction in the dilation required to achieve a certain cell loss. This study however is not part of this thesis and has been recommended for future work.  2.7. Conclusion In this chapter we discussed space division architectures and commented on their suitability for large scale A T M switching in the range of 10-100 Tb/sec. The characteristics required of such fabrics were outlined in section 2.6. In the subsequent chapters we will evaluate and analyze in detail the FAB network and show that it is capable of large scale A T M switching in a cost effective (high in resource utilization) and scalable manner.  24  Chapter 3.  Bandwidth Optimization in Banyan Switches  Chapter 3  Bandwidth  in B a n y a n  Switches  3.1.  Optimization  Introduction The internal blocking of the banyan network (see Figure 3.1) can be improved by  placing internal buffers or by increasing the internal link capacity as was seen in the previous chapter. The question of whether to use internal buffers or to increase the internal link capacity depends on a number of factors including: 1. The extent of performance improvement in terms of cell loss, delay, and throughput. 2.  Feasibility and cost of implementation which includes such factors like interconnection and I/O density, switch modularity, and growability.  3.  Impact on QoS coupling and cell sequence integrity. In this chapter we first proceed to carefully examine these crucial factors in some  detail. Buffered-banyan networks [7, 22, 24, 40, 23] are based on using internal buffers. In these networks buffers are placed at the inputs (or outputs) of each switching element. Performance studies for both uniformly distributed output requests [24] and non-uniformly distributed output requests [23] have been reported. By using multiple buffers a saturation throughput of 0.7 [7] is obtained under uniform traffic conditions. Under non-uniform traffic conditions, the performance of these networks is severely degraded, and for very large networks [23] the throughput may drop to nearly zero. One solution that has been suggested is the use of a randomizing network (also called distribution network) as in Benes switches [41] to break up any traffic patterns which may cause severe congestion to persist. But such randomization may lead to QoS coupling within the switch fabric. For example with multipath networks, such as the buffered Benes, randomization is carried out in the first half of the network, which causes different flows to affect each other's  25  Chapter 3.  Bandwidth Optimization in Banyan Switches  Stage 0  Stage 1  (b) SE - switching element Figure 3.1. Two 16x16 banyans a) with 2x2 elements and b) with 4x4 elements, cell delay and cell loss within the fabric. Thus, buffered-banyan networks suffer from performance limitations. The other resource of the switch is its internal link capacity. Dilated banyan networks [20, 19, 42, 35, 38] are based on increasing the internal link capacity. This, in effect, increases the internal speed up of the switch by using parallel paths.  Alternatively,  researchers have considered increasing the number of ports of a switching element [7, 19] to increase bandwidth as shown in Figure 3.1. However, this approach increases the bandwidth within a switching element only but it does not increase the bandwidth  26  Chapter 3.  Bandwidth Optimization in Banyan Switches  between different stages of the switch. Dilated networks do not efficiently utilize the internal bandwidth, while networks with large number of ports per switching element, do not provide sufficient bandwidth. The main bottleneck in a banyan network is the limited bandwidth of the inner links. From a performance point of view, dilated networks are attractive as they provide sufficient bandwidth. Also the QoS coupling of different flows within the switch fabric is limited to cell loss which can be made arbitrarily small by employing sufficient dilation. While the performance of the dilated banyan has been studied for random uniform traffic [20, 19], a comprehensive investigation of such factors as the impact of different traffic patterns, use of shared output queueing (refer to chapter 5) and various levels of dilation is lacking. In this chapter we propose a unifying model for banyan networks with variable degrees of dilation and introduce a new model called the fat-banyan (FAB) switch and provide detailed description of its architectural features, modular construction, and feasibility of constructing large capacity (Terabit-per-second) switches from basic modules. We also provide useful estimates on the number of chips and the interchip wiring required to construct switches of a given size and external link capacity. The loss performance of the unbuffered fabric as a function of dilation and switch element size will be discussed in the next chapter. Although originally conceived as an output-buffered switch, the FAB  switching fabric can be optimized for input-  buffering, input-output buffering, and shared buffering. We also defer any discussion on buffer placement and sizing to the next chapters.  3.2.  The Fat-Banyan Switching Fabric The fat-banyan (FAB) switch has switching elements (SEs) with multiple input and  output links per port. The number of input and output links per port may not be equal, and we call such an SE as a fat SE (FSE). Figure 3.2 illustrates the difference between a FSE and a standard buffered SE. In a 2x2 buffered SE if an incoming cell cannot be routed to one of the two output ports because of contention, it will be buffered. In  27  Chapter 3.  Bandwidth Optimization in Banyan Switches  the FSE, cells can be routed to dilated output port based on availability of links. In the former case cells will be lost if buffers are full, and in the latter case cells will be lost if the links are not available. For the FAB switch, the number of input and output links will grow in the first few stages of the switch. However, in the remaining stages, the number of links can remain fixed or even decrease depending on the type of traffic pattern. The incremental increase in dilation aims at eliminating the need for internal buffering by increasing internal bandwidth. This is based on the observation that in multibufferedbanyan networks, performance improvement has been achieved for buffer sizes of up to four cells. With buffer sizes greater than that, the improvement in performance is negligible [24, 25]. Moreover, theoretical analysis on congestion in banyan networks has shown that under uniform traffic conditions, the congestion of all but 1/N  a  of the routing  problems with a single packet per input in a banyan network of size NxN  is at most  O(a)+o(o;logA0, for any a [43] . Since the congestion is a function of ldgiV, the number of links in the FSE's of the FAB switch would not grow to be very large. There are several advantages of the FAB switch over a multibuffered-banyan. Mainly, there are no queueing delays inside the FAB  switch, and the throughput of the FAB  switch reaches nearly unity under uniform traffic conditions (as will be shown in a later chapter), while the throughput of the multibuffered-banyan saturates at around 0.7 [24]. A detailed analysis of the fundamental complexities of the crossbar-based (knockout switch) and banyan-based switches has been reported. It has been shown that [21] for a given cell loss probability requirement, the dilated-banyan network has the lowest order of complexity among the space-division switches that have been proposed to date. By utilizing variable dilation, the FAB switch achieves a lower order of complexity than the banyan switches with constant dilation, and also provides a more general banyan switch model [36] in which buffer-bandwidth trade-offs can be fully investigated and optimized. In any particular stage of the FAB switch the FSE's are of the same type. The notation L,-  28  Chapter 3.  Bandwidth Optimization in Banyan Switches  Fully Dilated Fat Switching Element  (c) 10, II  -  Input ports  OO, 01 - Output ports  Figure 3.2. (a)2x2 Buffered SE. (b)2x2 Fat SE with output concentrator. (3) 2x2 fully dilated Fat SE (no output concentrator required). where i is the stage number will be used to denote the output link dilation of the FSE's in stage i. Also any given FAB network is characterized by a dilation configuration (DC) represented by the set DC=[L/j^JL?,...]. Figure 3.3 illustrates an 8x8 FAB  switch with  dilation configuration DC=[L/,L2>L?]=[2,3,4].  3.3.  Fat-Banyan Architectural Features The FAB  architecture can be viewed as a generalized version of the dilated banyan  architecture (a 2 dilated (DC=[2,2,2]) 8x8 banyan network is shown in Figure 3.4). The dilated banyan can be operated in two ways: one is to route cells over any one of the multiple links at each input and the other is to route only over one link at each input then use more links at subsequent stages [20]. In the latter case the performance of the dilated architecture is similar to that of the FAB increase in dilation of the FAB  architecture. However, the gradual  topology leads to a much lower construction cost not  29  Chapter 3. Bandwidth Optimization in Banyan Switches Demultiplexer Concentrator  Dilated Output Links  01  Figure 3.3. 8x8 output-buffered FAB switch, with dilation configuration as follows: DC=[2,3,4]. OUT OUT OUT, kOUTg  0  1  CO I-  OUT NOUT  Q_  4  5  OUT  c  CO I=> £L H zz> O  OUT, Stage 0  Stage 1  Stage 2  Figure 3.4. 2 dilated 8x8 banyan network. only because of savings in cross-connect hardware but, more importantly, because of the resulting reduction in interchip and interboard wiring and the resulting modularization of the architecture as will be explained next. Modular Expansion of Switching Elements: Figure 3.5 shows our proposal for constructing FSEs of arbitrary input and output dilation using basic FSEs of fixed size and dilation pattern. In its simplest form a 2x2 FSE with an input link dilation of L =\ m  30  Chapter 3.  Bandwidth Optimization in Banyan Switches  and output link dilation of L =2 is called a basic FSE (BFSE). A parallel arrangement out  of BFSEs can be used to construct 2 x 2 FSEs of varying input and output dilations. Configuring a FSE of a given size and dilation can be done simply by pin assignment to a generic chip containing several BFSEs as shown in Figure 3.5. Two pin assignments for two types of FSEs are shown in Figures 3.6 and 3.7. Each line in the above Figures represents one link which may consist of several wires routed through several I/O pins. With this construction a fully dilated FSE is obtained. An FSE is said to bo fully dilated if the number of links per output port is equal to the total number of its input links. Other variations of this construction can be devised to implement FSE's which are not fully dilated. Output concentrators will be required within each FSE to reduce output dilation in this case.  10-  O0  II-  01  2x2 BFSE  (b)  a generic package of four BFSEs Figure 3.5. Generic package of BFSEs  31  Chapter 3.  Bandwidth Optimization in Banyan Switches 10  KM  r= OO  Il=\  B 01  OO 01  II  OO  10  Symbol  01 II  :  (b) 2x2 FSEs with  Vr?  constructed from a package of four BFSEs  Figure 3.6. Constructing 2x2 FSE's with L = 2 and L =4 from BFSE's (L,„: input link dilation, L : output link dilation) m  out  out  =  | OO  11=  = 01  I0  Symbol (b) 2x2 FSEs with  ; Vr?  Un=4  constructed from a package of four BFSEs  Figure 3.7. Constructing 2x2 FSE's with L- = 4 and L m  from BFSE's (L : input link dilation, L : out  in  oM  = 8  output link dilation)  Banyan Partitioning Strategy: It should be emphasized that the reduction in interconnection complexity in the initial stages of the FAB has a significant impact on reducing the complexity of inter board wiring, which can lead to improved packaging for large scale implementations. An isomorphic representation of the banyan network is shown in Figure 3.8, where D ^ denotes a delta network [19, 44] with n inputs and outputs n  constructed of dxd crossbars. This representation is particularly useful for partitioning 32  Chapter 3.  Bandwidth Optimization in Banyan Switches  large banyans among multiple chips or multiple boards. For the FAB network, the best strategy is to horizontally slice through the network as shown in Figure 3.9. The FAB k-i  / stages  stages  Figure 3.8. Isomorphic representation of a banyan network architecture allows the modular growth to a large scale ATM switch. For a large scale ATM switch with say a throughput of the order of terabits/sec an important issue is to break the design into modular units and interconnect these units. This is done by putting the modular units onto boards, which are placed in shelves which in turn interconnect the boards through backplanes. A rack of shelves may be needed to house these boards with additional cabling required to interconnect the shelves.  OUT  Board-1  f=  Board-2  1  =  OUT,  §  OUT,  P  OUT,  W  OUT,  =  OUT  =  OUT-  Figure 3.9. Board partitioning for a 8x8 FAB switch  This partitioning strategy is justified in view of the following facts: 33  c  OUT  t  e  Chapter 3.  Bandwidth Optimization in Banyan Switches  1. The wiring becomes more and more localized to within smaller subsets of adjacent FSEs as we go from the input stage to the output stage because of the banyan topology. 2. The number of wires required for interconnecting the FSE's on different boards is limited to the first few stages which have the smallest levels of link dilation. In general, the FAB switch can be partitioned among multiple boards by horizontal slicing of the network and placing each slice on a distinct board. Partitioning a NxN  FAB among  K<N boards is done by placing N/K input ports, N/K output ports and N/2K FSEs from each stage of the network on a single board. Assuming that N=2 and K=2 , the maximum n  k  number of external links between two boards is  (/)i N  K  L  i=l  where L, is the dilation of stage i. As an example consider a FAB network of size N=256 with dilation configuration, DC=[2,4,8,8,8,8,8,8]. If the FAB is partitioned among 2 boards the maximum number of external wires between two boards is 256. The maximum number of wires interconnecting two boards will be 384 if the FAB is partitioned among 4 boards and 448 if the FAB is partitioned among 8 boards. Table 3.1 compares the maximum number of external wires crossing two boards. We see that the interboard wiring for the FAB switch is greatly reduced compared to the dilated banyan. In comparison to the dilated banyan network, the FAB network does provide the required internal bandwidth while significantly reducing the interchip/interboard interconnection compared to other switch architectures such as the Knockout switch [45], Buffered Benes [1, 31] and 3-stage Clos [11]. Additionally the FAB offers a modest saving in the number of FSEs. Table 3.2 compares the dilated banyan and FAB architectures of different sizes assuming that the network is built from 2 dilated FSEs (L;„=2, L «r=2). It should be noted that 2 dilated FSE's require 4x2 concentrators at each output 0  to provide arbitration if the number of cells contending for an output is greater than 2. For a 16x16 FAB switch with DC=[2,4,8,8] the total number of 2 dilated FSEs is given 34  Chapter 3.  Bandwidth Optimization in Banyan Switches  by: 8+16+32+32=88. It should be noted that in the first three stages of the 16x16 FAB only one of the input links of the 2 dilated FSE's need to be used, or a much simpler implementation based on the FSE's built from BFSE's shown in Figure 3.5 can be used. For a 16x16 dilated banyan switch with a dilation of 8, the number of 2 dilated 2x2 SEs required is given by: 32x4=128.  Switch Name = Fat-BanyanSwitch  Switch Name = Dilated Banyan  Size = 256 x 256  Switch Size = 256 x 256  DC=2,4,8,8,8,8,8,8  Dilation = 8  Maximum number of external wires  Maximum number of external wires  interconnecting two boards  interconnecting two boards  2  256  1024  4  384  1024  8  448  768  Number of boards  Table 3.1. Comparison of interconnections required for FAB and dilated banyan networks.  Switch Size  Fat-Banyan  Dilated Banyan  Dilation Configuration for  Number of 2  Number of 2 dilated  Fat-Banyan  dilated 2x2 FSEs  2x2 FSEs (Lin=2,  (Lin-2, Lout=2) +. Lout=2) forming an 8 dilated banyan  16x16  2,4,8,8  88  128  64x64  2,4,8,8,8,8  608  768  256x256  2,4,8,8,8,8,8,8  3840  4096  4096x4096  2,4,8,8,8,8,8,8,8,8,8,8  88064  98304  Note that the FSEs in thefirsttwo stages do not require output concentrators and hence have simplified implementations.  Table 3.2.. Comparison of FAB and dilated banyan networks built with 2 dilated 2x2 FSE's. 35  Chapter 3.  Bandwidth Optimization in Banyan Switches  3.4. Chip Count Estimate for a Large Scale FAB Switch Implementation In this section we develop a simple but useful cost model that evaluates cost of a switch in terms of pin-limited VLSI routing chips. In this model the cost depends on the FAB size as well as the dilation configuration used. This cost model will be used to evaluate the scalability of the core fabric in terms of the growth in number of chips required per port of the switch Cost Model: The main element of our cost model is a basic routing chip that contains multiple BFSEs as was shown in Figure 3.5. In this model the parts of each BFSE can be accessed through the chip I/O. The main limiting factor in our model is the number of I/O pins per chip. This is justified in light of the fact that such a routing chip includes mainly multiplexers, demultiplexers and a few bytes of buffers, and such units will use little area in silicon. The use of multiple planes to allow the switch to keep up with the external link speed has implications for additional hardware at the inputs (to the separate the cell stream into multiple planes) and outputs (to combine the separate planes). Large scale switches are expected to support link rates of 2.5 Gb/s to 10 Gb/s or even up to 40 Gb/s [46]. However, with current microelectronic technology I/O speeds are far lower than such rates. Therefore, to support a given high link rate of magnitude R bits/sec, several I/O pins of rate r bits/sec (where r< R) may have to be dedicated to the link. In this case \R/r~\ are needed per link of an FSE. In our model if the number of (\R/r]) I/O pins is large (e.g. > 40) the pins can be allocated to multiple planes of chips. We demultiplex the bits from a single external link over s planes of chips. Hence, each chip in a given plane would have \R/rs] pins for a single external link. We denote this quantity as  p = \R/rs~\. 36  (3.2)  Chapter 3.  Bandwidth Optimization in Banyan Switches  Number of pins per external link per chip (per plane) per BFSE  Demultiplexer  Figure 3.10. Partitioning a BFSE among planes For example if the link rate R- 10 Gb/s and the I/O rate (per pin) is r=200 Mb/s, and the number of planes of chips used is s=5, then the number of pins per chip (per plane) is p=10. Figure 3.10 illustrates the method of partitioning a single BFSE among s planes. This section provides formulas for partitioning the FAB switch among VLSI chips with C I/O pins. The partitioning assumes the use of dxd BFSEs with input dilation L =\ in  and output dilation L =d, out  as shown in Figure 3.11 (for d=2 and d=4). It is assumed  that a BFSE can actually be partitioned among different chips or to enable the layout of multiple BFSEs on the same chip. We use the following parameters for our calculations: External link rate : R bits/sec Number of input links per BFSE: d Number of I/O pins/link per chip: p I/O pin rate: r bits/sec 37  Chapter 3.  Bandwidth Optimization in Banyan Switches  Number of planes of chips into which a BFSE is partitioned : s Total I/O pins per chip : C Number of BFSEs per chip : i  CO  co vrr O o_  10  oo  I-  rr O  OO  11  01  h-  12  02  13  03  ZD  1-  11 0 1  — 10 PORTS  The total number of BFSEs for the FAB switch: T  ZD O  CO H  rr O Q_ ZD 0.  2x2 B F S E 4x4 B F S E Figure 3.11. BFSEs of varying sizes The number of I/O pins per link per chip is given by  p=\R/rs].  (3.3)  Consider a BFSE of size dxd as the basic module. The number of BFSEs per chip (per plane) is given by i. It is straightforward to determine that i is the largest integer which satisfies  idp + id p < C.  (3.4)  2  For an NxN  switch the total number of BFSEs is T = iV/2[l + 2 + 4 + 8*(log JV-3)]. 2  (3.5)  The total chip count is given by  [Ts/i\. '  (3.6)  Tables 3.3, 3.4, and 3.5 summarize the chip count for a few practical FAB switch sizes 2  and dilation configurations for external link rates of 2.5 Gb/s and 10 Gb/s. We see that 2  The results reported in these tables are based on the formula (3.2) which assumes that each BFSE has full access to chip I/O.  Alternatively, multiple stages of BFSEs can be packaged together to further reduce I/O requirements.  38  Chapter 3.  Bandwidth Optimization in Banyan Switches  as the switch size increases the chip count per port increase in a sub-linear manner. For instance at 2.5 Gb/s in order to go from a switch of size 64x64 to 256x256, a scaling by a factor of 4; the number of chips/port increases by a factor of 1.5 only.  fl=622Mb/s, r=200 Mb/s, s=5, C=240, d=2 Switch Size  Dilation Configuration  Calculated parameters  Total Chip  Chips/port  count  64x64  2,4,8,8,8,8  p=l, r=992  124  1.94  256x  2,4,8,8,8,8,8,8  p=l, r=6016  752  2.94  2,4,8,8,8,8,8,8,8,8,8,8  p=l, T=161792  20224  4.94  256 4096x4096  Table 3.3. Chip count for various FAB switch sizes with external link rate of 622 Mb/s.  fl=2.5Gb/s, r=200 Mb/s, s=5, C=240, d=2 Switch  Dilation Configuration  Size  Calculated parameters  Total Chip  Chips/port  count  64x64  2,4,8,8,8,8  p=3 7/=992, /=13  382  5.96  256x256  2,4,8,8,8,8,8,8  p=3 T=6016, /=13  2314  9.04  4096x4096  2,4,8,8,8,8,8,8,8,8,8,8  p=3, 7/=161792,  62228  15.19  i=13  Table 3.4. Chip count for various FAB switch sizes with external link rate of 2.5 Gb/s.  #=10Gb/s, r=200 Mb/s, *=5, C=240, d=2 Switch  Dilation Configuration  Size  Calculated parameters  Total Chip  Chips/port  count  64x64  2,4,8,8,8,8  P  =10, T=992, i=4  1240  19.375  256x256  2,4,8,8,8,8,8,8  p=10, 7>6016, i=4  7520  29.375  4096x4096  2,4,8,8,8,8,8,8,8,8,8,8  p=10, T=161792,  202240  49.375  i=4 Table 3.5. Chip count for various FAB switch sizes with external link rate of 10 Gb/s.  39  Chapter 3.  Bandwidth Optimization in Banyan Switches  3.5. Comparison of FAB Chipcount with other Switch Fabrics In the previous section the chipcounts were derived based on the assumption that each BFSE has full access to chip I/O. In this section we use an alternative partitioning approach for the switch fabric. This approach is based on partitioning the fabric to minimize the chip count by packaging several banyan stages within the switch fabric. This approach has been used previously in [44] where chipcount equations for various switch fabrics: Knockout, Tandem banyan etc. are derived. In this section we compare the chipcount of the FAB switch with the Knockout, Dilated banyan and the Tandem banyan using the above approach. The parameters for the switching network are so chosen that the cell loss performance under independent uniform traffic conditions is of the order of 10 . For all the networks considered, in order to match the external data -6  rate |~i?/ l parallel planes are required, each a bit-serial network. r  Knockout Switch: As we have seen previously the Knockout switch is a crossbar switch. A simplified block diagram of this switch is shown in Figure 3.12. The details of the chipcount derivation for the Knockout switch can be found in [44]. Each bit-plane of the Knockout network has N Knockout concentrators with TV inputs and L outputs.  An N:L  concentrator can be built from a tree of p :L concentrators as shown in Figure 3.13. The number of levels of p :L chips required is given by the smallest integer / satisfying (3.7) and the number of chips per level is given by (3.8) Assuming p =C, the total I/O pins per chip, the chipcount for a one-bit knockout concentrator, Q , is (3.9)  40  Chapter 3.  Bandwidth Optimization in Banyan Switches  N-1  if  Cell address Filters  Knockout concentrator  Buffers  T"  T  N-1  Figure 3.12. The Knockout switch. There are N Knockout concentrators per plane and hence the total number of chips for the switch is  N\R/r]C . k  Table 3.6 lists the chipcount for two switch sizes.  W/p'  NU(pf  Figure 3.13. Building an N:L concentrator from smaller p :L concentrators. 41  (3.10)  Chapter 3.  Bandwidth Optimization in Banyan Switches  R=l Gb/s, r=200 Mb/s, C=128, L=8 Switch Size  Calculated parameters  Total Chip count  Chip count/port  256x256  C =2  2560  10  4096x4096  Q=34  696320  170  k  Table 3.6. Chip count for Knockout switch (as reported in [44] Banyan Switch: Consider an NxN banyan switch constructed from dxd crossbars. In order to package this using chips with total I/O of C we require -^j- chips to accommodate [logdC/2] stages [44]. There are a total of log^iVstages. Hence, the total number of chips required is  Ii N logdN r ~CJ2 [log C/2\  (3.11)  d  The chipcount estimates are shown for two switch sizes in Table 3.7. R=l Gb/s, r=200 Mb/s, C=128, d=2 Switch Size  Total Chip count  Chip count/port  256x256  40  0.15625  4096x4096  640  0.15625  Table 3.7. Chip count for a Banyan switch Tandem Banyan Switch: Consider an NxN Tandem banyan switch constructed from dxd crossbars with k banyan networks (of size NxN) connected in cascaded. The total number of chips required are the sum of the chips required for the cascade of k banyan networks plus the number of chips required for packaging the output concentrators [44]. Hence, we have the total chip count as k  R N log N r ~C]2 [log C/2\ d  4"  Cout-  (3.12)  d  Cout  is the number of chips required to package the output concentrators and their values  are calculated based on the pinout requirement and is shown in Table 3.8. 42  Chapter 3.  Bandwidth Optimization in Banyan Switches  R=l Gb/s, r=200 Mb/s, C=128, d=2. Switch Size  256x256, k=9, C t=3 ou  Total Chip count  Chip count/port  363  1.41  1024x1024, £=14, C t=5 2245  2.19  ou  Table 3.8. Chip count for the Tandem Banyan switch Dilated Banyan Switch: Consider an NxN dilated banyan switch with dxd switch elements each with a dilation of L. In order to package this using chips with total I/O of C we require  chips to accommodate [logi [C/2\ c  stages. There are a total of  log^TVstages. Hence, the total number of chips required is  ~R  Nd  r  C72  logdN \log C/2\  (3.13)  Ld  The chip count estimates are shown for two switch sizes in Table 3.9. R=l Gb/s, r=200 Mb/s, C=128, L=8, d=2 Switch Size  Dilation Configuration  Total Chip count  Chip count/port  256x256  8,8,8,8,8,8,8,8  1280  5  4096x4096  8,8,8,8,8,8,8,8,8,8,8,8  30720  7.5  Table 3.9. Chip count for the Dilated banyan. Fat-Banyan Switch: Consider an NxN FAB switch with dilation of the first two stages being 2 and 4 respectively. There is a modest savings in chipcount when compared to the dilated banyan since more number of FSE's can be packed per chip in the initial stages. The first two stages have an average dilation of 3 and the next stage has an average dilation of 6, the rest of the stages are similar to the dilated banyan. The chipcount estimates for different switch sizes and parameters are shown in Tables 3.10 and 3.11. From the tables above we see that the FAB switch (Table 3.10) has a significantly reduced chipcount compared to the Knockout and modest savings compared to the dilated banyan network. The Tandem banyan network has a very low chipcount. The FAB 43  Chapter 3.  Bandwidth Optimization in Banyan Switches  R=l Gb/s, r=200 Mb/s, C=128 Switch Size  Dilation Configuration  Total Chip count  Chip count/port  256x256  2,4,8,8,8,8,8,8  1000  3.9  1024x1024  2,4,8,8,8,8,8,8,8,8  5200  5.1  4096x4096  2,4,8,8,8,8,8,8,8,8,8,8  25920  6.32  256x256  2,4,4,4,4,4,4,4  300  1.2  1024x1024  2,4,4,4,4,4,4,4,4,4,4,4  1520  1.48  Table 3.10. Chip count for the FAB switch with C=128 and R=l.O Gb/s. R=2.5 Gb/s, r=200 Mb/s, C=240, d=2 Switch Size  Dilation Configuration  Total Chip count  Chip count/port  256x256  2,4,8,8,8,8,8,8  1361  5.31  4096x4096  2,4,8,8,8,8,8,8,8,8,8,8  35945  8.78  Table 3.11. Chip count for the FAB switch with C=240 and R=2.5 Gb/s. chipcount with DC=[2,4,4...] is better than that of the Tandem Banyan. As will be shown in the next chapter this dilation configuration when combined with input-output buffering yields an excellent loss performance. Additionally, the FAB has the advantage of lower delay overhead compared to the Tandem banyan.  3.6. Conclusion In this chapter the FAB switch was introduced and shown to require lesser number of internal links compared to an equivalent uniformly dilated banyan switch. It was shown that incremental dilation of internal links has a dramatic impact on the implementation and scalability of the switch. It was shown that the FAB chipcount depends on the packaging strategy used. Two strategies were proposed. Thefirstis based on the assumption that the ports of each BFSE has full access to the chip I/O. The second approach packs as many stages of FSE's as possible and yields a lower chip count than the first (refer to Tables 3.4 and 3.11). However, the latter strategy is less flexible in constructing a FAB 44  Chapter 3.  Bandwidth Optimization in Banyan Switches  switch of a given size. The FAB chipcount is significantly better than the crossbar based Knockout and better than the dilated banyan according to the estimates presented in [44].  45  Chapter 4.  Performance of The Fat-Banyan Switch  Chapter 4  Performance of The Fat-Banyan  Switch  4.1. Introduction This chapter discusses the performance of the Fat-Banyan (FAB) switch. The performance of the switch is evaluated by simulation and analysis. The performance of a specific A T M switch depends on the traffic pattern according to which cells arrive at its inputs as well as the distribution of output requests. The traffic pattern is determined by: 1) the arrival process of the cells at the inputs of the switch, and 2) the distribution of the output requests of the arriving cells. In this chapter we have considered the following arrival processes a) Bernoulli process and b) the Interrupted Bernoulli process. Basically two types of distributions have been considered for the output requests i.e., a) Uniform and b) Non-Uniform. In this chapter we first analyze the performance of the bufferless FAB switch under independent uniform traffic. The performance of the bufferless FAB switch is shown to be very efficient in terms of cell loss probability. The cell loss is easily controlled by choosing an appropriate dilation configuration. Analysis and simulation results under uniform traffic are presented followed by simulation results for non-uniform and bursty traffic patterns. Next, the output buffered FAB switch is considered.  Analysis and  simulation results are discussed for this case. Finally, we present the performance results for the input-output queued FAB switch with backpressure. This has the advantages that the fabric is rendered lossless, the dilation is reduced and small input buffers are required to achieve a similar cell loss performance as that of the purely output-buffered case.  4.2. Performance Analysis under Independent Uniform Traffic In this section we consider the simplest traffic pattern, called the independent uniform traffic pattern. The arrival process of this traffic pattern is a Bernoulli process with parameter p, at an input port. Also, for this traffic pattern the arrivals at an input are 46  Chapter 4.  Performance of The Fat-Banyan Switch  independent of the arrivals at any other input. Further, the distribution of the destination requests of the arriving cells is uniform over all the output ports for this traffic pattern. The throughput of the switch is defined as the rate R(p) of cells which reach their requested destinations. The parameter p represents the load offered to the switch. The normalized throughput is defined to be R(p)/p, and the mean cell loss rate is given by  l-R(p)/p. The analysis is based on the approach presented in [47] for determining the performance of an unbuffered banyan network. The following assumptions are made for the analysis of the FAB switch: 1. The traffic at the input ports conforms to the independent uniform traffic pattern. 2. The performance of a stage can be approximated by that of a single FSE in that stage. 3. The number of input and output links per port of an FSE may vary from stage to stage. 4. Each link of an input port of a FSE can connect independently to either one of its output ports. 5. Cells are removed immediately at the output ports of an FSE. 6. The cells which are blocked at an FSE (cannot reach their destination port) are discarded. 7. No errors are introduced in the operation of the FSEs and the entire switch. For the analysis we consider a FAB switch with N input ports and N output ports. A FSE can be characterized as a function of three parameters S(M, D,L), where M denotes the total number of input links to each FSE, D is the number of output ports and L is the number of links per output port. We will use the general notation M,-, Z ) „ andL,to denote an FSE in the /th stage (l</<logA0 of the FAB switch. However, whenever we refer to an arbitrary FSE of the FAB  switch we will disregard the subscripts for the  47  Chapter 4.  Performance of The Fat-Banyan Switch 1xN Demultiplexer  DILATION PortO  M INPUTS  OUTPUT PORTS  Port N-1  Figure 4.1. Fat Switching Element (FSE). sake of simplicity. Figure 4.1 illustrates such an FSE. For the purpose of the analysis, a knowledge of the total number of input links per FSE is sufficient, without the need to distinguish between the number of input ports and the number of links per input port. Let pi  3  be the probability that/ cells are destined to a particular output port of a FSE,  given that there were /' cell arrivals. Then  ^) =(i)iw-i/r '  (4-D  Note that an incoming cell has an equal probability of l/D,  of being destined to any  j  output port of an FSE. An output port can accept only minij, L) cells destined to it. If j > L then j — L cells are lost.  We  define E(i)  as the expected number of cells that  the FSE passes successfully to its outputs per clock cycle, given that / cells arrived at the beginning of that cycle. Thus,  E(i) =  DY,p{ij)™n(j,L)  (4.2)  The mean number of cells at the output of the FSE per clock cycle is then given by M  E = Y,E(i)q(i)  where q(i) We  now  =  (p)'(l — p) ~\ M  is the probability of i arrivals in one clock cycle.  define two output rates: 1) the output rate per output link, 0\  output rate per output port, O . v  (4.3)  and 2) the  Our interest in 0\ is due to the fact that the input load 48  Chapter 4.  Performance of The Fat-Banyan Switch  to the next stage is equal to the link rate 0\ of the previous stage output. Thus, we have  Oi = E/{D  x L) (4.4)  O =  E/D.  p  However, for the output of the last stage, we are interested in the output rate O  p  (per  port). This is consistent with our initial goal of replacing buffers with ports of multiple links. The output rate of the last stage can be calculated recursively using the above formulas. For a FAB  switch of size N x N the normalized throughput  R(p) = where O  pn  4.3.  O /p  (4.5)  pn  is the output rate per output port of the final stage.  Cell Loss and Stage Dilation The blocking probability of dilated banyan networks under permutation and random  request assumptions have been analyzed [19, 48] for the dilated banyan based interconnection networks. Also, the blocking probability of a dilated switching element under random uniform cell based traffic has been discussed in [20]. In this section we  apply  it to the case of a FSE. The problem of interest is to determine the cell loss rate as a function of stage dilation. Under the independent uniform traffic pattern, all the FSE's of a stage are equally loaded. Hence, the average loss rate is the same for all FSE's in the same stage. The following analysis is therefore for a single FSE S(M,  D, L)  only  and is applicable to all FSEs in the same stage. The arrival rate at an input link is p. In order to simplify the analysis, we consider the joint probability of a cell arriving at an input being bound to any output. The joint probabil ity is given by pID since the cell is equally likely to be destined to any output. The probability that j out of M packets will be destined to a particular output is given by (4.6)  49  Chapter 4.  Performance of The Fat-Banyan Switch  The probability of cell loss for an FSE is given as the ratio of the average number of cells lost at an output of an FSE to the average number of cells at an output of an FSE if there were no cell loss. The probability of cell loss for an FSE is Pr(cell loss) =  (D/ M) P  E*( )wD) (l-p/Df-lW M  M  ( 4  L +k  '  ? )  The cell loss probability is plotted as a function of the dilation L in Figure 4.2, for different input traffic rates and for a FSE 5(8,2, L ) . It can be seen that the dilation required to achieve a certain cell loss rate is a function of the input traffic rate. For input rates greater than 0.4, a dilation of 8 is necessary to achieve a cell loss probability under 10 . At -6  traffic rates of 0.2 and 0.3 a dilation of 7 is sufficient. The dilation required at any stage can thus be determined as a function of the traffic rate at the inputs of the FSE's in that stage. In the fat-banyan, under uniform traffic assumption, the load is maximum at the inputs and reduces as we move towards the output. By keeping the initial dilation high it is possible to reduce the traffic rate quickly and thereafter maintain the dilation constant.  Dilation (L)  Figure 4.2. Cell loss probability as a function of dilation for a FSE, S(8,2,L) for different input loads under uniform traffic (by analysis). 50  Chapter 4.  4.4.  Performance of The Fat-Banyan Switch  Results The simulation results presented here are based on the following arbitration mech-  anism at an FSE. Arbitration is required when the number of cells contending for an output of an FSE exceeds the output dilation of the FSE. Arbitration is done by giving priority to the cells from top to bottom at the output concentrators of the FSE. This arbitration mechanism is similar to the knockout switch concentrators [45] where priority is given from left to right of the concentrator inputs. Alternatively cells could be chosen randomly at the concentrator, but this would require a random number generator for each concentrator which would add to the complexity of the FSE. Hence, we have chosen this simple arbitration mechanism. The simulation results are obtained with 95% confidence intervals using the method of independent replications (refer to [49] for a discussion on this method).  4.4.1 Uniform Traffic The performance results of the FAB switch under uniform traffic are discussed in this section.  In Figure 4.3, the maximum achievable throughput is shown as a  function of the number of stages, where each FSE has 2 input and 2 output ports (i.e., D=2), under independent uniform traffic pattern for the simple banyan network and the FAB switch.  For the FAB switch we have used MJ=DJ=LJ=2 for the first  stage; M,=£>;__/xL(_/, Z>,-2 and L;=L,_;+1 (2</<log2A0 for the subsequent stages (i.e, DC=[Li^y+1^7+2,...^i+log2A'-l]). In Figure 4.4, the throughput of the FAB switch is plotted on a different scale to show more clearly the variation of the maximum throughput as a function of the number of stages. It should be noted that this throughput is only for illustrating the improvement in throughput obtained by gradual dilation. Further, improvement in throughput and cell loss probability is achieved by using an appropriate dilation configuration as discussed below. 51  Chapter 4.  Performance of The Fat-Banyan Switch  oooooo Fat-banyan: D C = [2,3,4,5,...] * * * * * * Banyan  4  5  6  7  8  9  10  11  Number of Banyan stages (log^N)  Figure 4.3. Maximum throughput for a fat-banyan switch with DC =[Li=2, Li=Li_i+l, 2<i<logiV] and simple banyan network under the independent uniform traffic pattern with input load p=1.0 (by analysis).  Number of Banyan stages (log^N)  Figure 4.4. Maximum throughput for a fat-banyan switch with DC=[Li=2; Li=Li_i+l, 2<i<logA ] under independent uniform traffic r  pattern (expanded scale) with input load p=l (by analysis).  The high throughput of the FAB  is due to the gradual increase in the internal 52  Chapter  4.  Performance of The Fat-Banyan Switch  bandwidth. This causes the load on the FSE's to gradually reduce as we move downstream from the input stage towards the output stage. For a load of p=l at the inputs (i.e., the first stage), the load on the inputs of the FSE's in the second stage is only 0.5. This gradual reduction in load ensures that the FSE's in the intermediate stages deliver a high local throughput at each stage. Since the load on the FSE's decreases as we move towards the output stage, it is unnecessary to keep increasing the degree of dilation of the FSE's to the final stage. 1  Q  1  1  1  1  e-  »  "9  °  1  I  e  o  o -  1  e  -i  o DC=[2,4,4,...]  •  • DC=[2,4,8,8,...]  0  0 00=12,4,8,12,12,...] n  "  »  0  -  '  n  •  1  1  3  *  ^^^^^ iI  • 1  iI  iI  i1  I  I  I  I  4  5  6  7  8  9  10  11  12  Number of Banyan stages (logN)  Figure 4.5. Maximum throughput of a fat-banyan switch with different dilation configurations under independent uniform traffic with input load p=l (by analysis). We have investigated the effect of increasing the dilation upto a certain stage, and from there on maintaining the number of input and output links constant. Figure 4.5 plots cell loss probability versus switch size for three dilation configurations. We see that for the configuration with a dilation of 8 from the third stage onwards, the cell loss probability is of the order of 10  -6  even for a switch size of 4096. For the configuration  with a dilation of 12 from the fourth stage onwards, the cell loss is of the order of 10 . This result is comparable to the result obtained by a non-blocking switch like the -11  knockout. Figure 4.6 illustrates the concentrator action of a section of the FAB switch. 53  Chapter 4.  Performance of The Fat-Banyan Switch  Stage 1  Stage 2  Stage 3  M-|=2 D-|=2 L 2  M =4 D2=2 L3  M =6 D =2 L =4  1 =  2  r  3  3  3  Figure 4.6. Concentrator action of a section of the FAB switch. For the Knockout switch it has been shown that under independent uniform traffic, it is statistically unlikely that more than a specified number of packets, say L, will be destined to the same output simultaneously. For achieving a loss probability of 10~ the 6  required concentrator output size L=8, independent of the switch size N. Thus, the FAB switch achieves a comparable performance to the Knockout switch, but using the banyan structure with significantly lower hardware complexity. Figures 4.7 and 4.8 show simulation results for the 8x8 FAB switch with two different dilation configurations. With dilation configuration DC=[2,3,4] the cell loss probability is of the order of lO' at full load 4  cell loss is of the order of 10  -6  (p=l). With dilation configuration DC=[2,4,6] the  at full load. Figure 4.9 is a plot of cell loss probability  versus the switch output dilation for two different dilation configurations of a 16x16 FAB  switch. For configuration 1 (DC=[2,3,4,variable]) the cell loss probability is of  the order of 10 , and for configuration 2 (DC=[2,4,6,variable]) cell loss probability of -2  the order of 10  -6  can be achieved with a dilation of 8. For both switch configurations  a dilation of greater than 8 does not result in any significant improvement in cell loss probability. Hence, for uniform traffic conditions a dilation of 8 serves to keep the cell loss probability within practical limits.  54  Chapter 4.  Performance of The Fat-Banyan Switch  1 0.9998  0.9996  0.9994 -  ^0.9992 *  0.999  E g 0.9988 0.9986 ****** 0.9984 -  Config 1: DC=[2,3,4]  oooooo Config 2: DC=[2,4,6]  0.9982 -  0.998  0.4  0.5 0.6 Input load (p)  Figure 4.7. Maximum throughput (by simulation) as a function of input load for a 8 x 8 FAB switch with two different dilation configurations under independent uniform traffic. 10° 10" Config 1: DC = [2, 3, 4]  10"  oooooo Config 2: DC = [2, 4, 6]  10"  1 n o  10"  8 10" O  °  io10" 10" 0.4  0.5  0.6  0.7  0.8  0.9  1  Input load (p)  Figure 4.8. Cell loss probability (by simulation) as a function of input load for a 8 x 8 FAB switch with two different dilation configurations under independent uniform traffic. 4.4.2 Non-Uniform Traffic It is useful to study the performance of the F A B switch under certain non-uniform traffic patterns which occur in practice. By non-uniform traffic we mean that a cell is likely to require an output address according to a non-uniform distribution. We first 55  Chapter 4.  Performance of The Fat-Banyan Switch  Output dilation (l_ ) 4  Figure 4.9. Cell loss probability as a function of output dilation for a 16x16 FAB switch for two different stage dilation configurations under independent uniform traffic with input load p=0.9 (by simulation).  Figure 4.10. An example of a maximum conflict pattern in a 16 x 16 banyan network consider single-source to single-destination (SSSD) non-uniform traffic pattern [23] in which each input source sends all its cells to a single destination. This type of traffic is also called as point-to-point traffic. An example of a point-to-point traffic is broadband traffic involving a large file transfer or an image transfer. The performance of an 8x8 single-buffered banyan network under point-to-point traffic [7, 23] has shown that for the maximum conflict pattern (This traffic pattern has 56  Chapter 4.  Performance of The Fat-Banyan Switch  the maximum conflicts among all possible traffic patterns consisting of N (for a switch of size NxN) point to point traffic paths involving distinct inputs and outputs) shown in Figure 4.10 the maximum throughput that can be achieved is 0.49. For an 8x8 multibuffered banyan network the maximum throughput for the maximum conflict pattern is only slightly greater than 0.5. This low throughput in a buffered banyan network can be attributed to the conflicts arising in the innermost shared links. However, for an 8 x 8 FAB switch with dilation configuration DC=[2,3,4] (or an equivalent dilated banyan switch) there is absolutely no conflict within the switch fabric as shown in Figure 4.13 and a throughput of 1 is achieved. This is because the FAB switch has been designed to reduce internal conflicts by increasing the internal bandwidth of the banyan network. A more general case of non-uniform traffic case would be a few SSSD paths embedded in a uniform traffic pattern. Here we consider the case of a single SSSD path embedded in a uniform traffic pattern [23] which could be used to model a single dedicated video channel with several voice or data channels. Such a traffic pattern would be represented as a load matrix shown below in which A represents the fraction of uniform traffic directed to the non-SSSD outputs and X d represents the point-to-point dedicated sss  traffic between an input and the corresponding output. "A 0 A A  A  0  ..  A"  ^sssd 0 .. 0 0  A  ..  A  0  A  ..  A  Simulation results obtained for an 8x8 FAB switch with dilation configuration: DC=[2,3,4] and DC=[2,4,6] for the above traffic pattern are shown in Figures 4.11 and 4.12.  The results indicate that as the throughput of the embedded SSSD path in-  creases, the throughput of the background uniform traffic decreases and hence the cell loss probability increases. The cell loss probability is of the order of 10  -4  for dilation  configuration 1. The cell loss probability can be improved by increasing the dilation of the FSE's. The cell loss probability is improved to the order of 10 as shown in Figure -7  57  Chapter 4.  Performance of The Fat-Banyan Switch  4.12 for a 8x8 FAB switch with dilation configuration 2. Therefore, with a small increase in dilation the cell loss probability is greatly improved for this non-uniform pattern.  1 0.9998 0.9996  5  0.9994 H •^0.9992  2 |  0.999  E g 0.9988  ****** Config 1: D C [2,3,4] 0.9986  oooooo Config 2: DC = [2,4,6]  0.9984 0.9982  ).1  _i 0.2  i_ 0.3  0.4  0.99I  _i 0.5  i 0.6  SSSD input traffic load  i_ 0.7  0.8  0.9  Figure 4.11. Maximum throughput (by simulation) for a 8x8 FAB switch under mixed SSSD (point-to-point traffic) and background uniform traffic load of 1.0 for two different dilation configurations.  ****** Config 1: DC -[2,3,4] oooooo Config 2: DC = [2,4,6]  10" 0.1  0.2  0.3  0.4  0.5  0.6  SSSD input traffic load  0.7  0.8  0.9  1  Figure 4.12. Cell loss probability results (by simulation) for a 8x8 FAB switch under mixed SSSD (point-to-point traffic) and background uniform traffic load of 1.0 for two different dilation configurations. 58  Chapter 4.  Performance of The Fat-Banyan Switch  A more severe case of non-uniform traffic is the hot-spot pattern in which many input sources try to send their cells to one destination. An example of such a situation is when several callers try to call a popular location on a telephone network. Another example is a L A N (local area network) consisting of many diskless computers and a single file server. When several computers simultaneously try to access the file server a hot-spot situation results. The hot-spot traffic is modelled as follows. A fraction h of cells arriving at an input are directed towards the hot-spot. If p is the arrival rate at an input of the FAB switch then hp cells are directed to the hot-spot and (l-h)p cells are directed uniformly to the N outputs. Table 4.1 shows simulation results for the 8x8 FAB switch with dilation configuration: DC=[2,3,4] under hot-spot conditions. It can be seen that the effect on cell loss is severe even at low loads. This effect on cell loss can be alleviated considerably by increasing the dilation of the FSE's in the FAB switch as illustrated in Table 4.2. However, the hot spot pattern is a problem with any switch fabric including the crossbar based Knockout switch. The solution to handling hot-spot traffic lies in limiting its intensity.  Stage 1  Stage 2  Stage 3  Mi =2  M =4  M3=6  D-|=2  D =2  D =2  L 2  l_2=3  1-3=4  1 =  2  2  3  Figure 4.13. Routing of a Maximum Conflict Pattern in an 8x8 FAB switch with the following dilation configuration: DC=[2,3,4]  59  Chapter 4.  Performance of The Fat-Banyan Switch  Input Load =1.0 Hot-spot percentage (h)  Throughput  Cell loss probability  0.0125  0.99944991  5.500912e-04  0.025  0.99944088  5.5912e-04  0.05  0.99938337  6.166312e-04  0.1  0.99905955  9.404475e-04  Table 4.1. Maximum throughput and cell loss probability results (by simulation) for a 8x8 FAB switch with dilation configuration as follows: DC=[2,3,4] under hot-spot traffic. Input Load =1.0 Hot-spot percentage (h)  Throughput  Cell loss probability  0.0125  0.99999919  8.1375e-07  0.025  0.99999917 ,  8.3375e-07  0.05  0.99999897  1.0275e-06  0.1  0.99999743  2.56875e-06  Table 4.2. Maximum throughput and cell loss probability results (by simulation) for a 8x8 FAB switch with dilation configuration as follows: DC=[2,4,6] under hot-spot traffic.  60  Chapter 4.  Performance of The Fat-Banyan Switch  Performance of the Output Buffered Fat-Banyan  4.5.  In this section the performance of the output buffered switch is assessed by analysis and simulation. The simulation results are obtained with 95% confidence intervals using the method of independent replications (refer to [49] for a discussion on this method).  4.5.1  Uniform Traffic  The FAB switch can be implemented as a pure output-buffered switch. The analysis of a non-blocking output queued switch under independent uniform traffic has been dealt with in [50]. Here we use the analysis for the FAB switch and validate the simulation results against the analytical results. Two parameters are required in determining the relation between cell loss and buffer size, namely the arrival rate and the buffer size. The arrival rate at an input link of an FSE at a particular stage of the FAB switch is the output link rate, Oi from the previous stage. For finite number of output links per port, L and finite buffer size, b, the output queue can be modelled as afinite-state,discrete-time Markov chain. Let Q  m  denote the number of cells in an output queue of the FAB switch  at the end of the mth time slot, and A  m  denote the number of packet arrivals during the  rath time slot. Then, we have  Qm — min{max(0, Q -\ m  The max  + A  m  - 1), b}.  (4.9)  term inside the bracket ensures that the queue size is non-negative and the  overall min term ensures that the queue size can never be greater than b. Cells are lost if queue size exceeds b. Defining the random variable A as the number of cell arrivals at an output port of the FAB switch in a given time slot, we have  (4.10)  61  Chapter 4.  Performance of The Fat-Banyan Switch  The state transition probabilities Pij = P r [ Q  Pij = ao + a\  = j \ Qm-i = i] are given by:  m  i = 0, j = 0 l < i < 6 , j = i —1  = ao = aj_,- i  1 < j < b - 1, 0 < i < j  +  = ^2  ™ j  a  (4.11)  -> - * - j  h  =  0  m=j—i+l  = 0  otherwise.  The steady-state queue size can be obtained from the Markov chain global balance equations. We get  qi = Pv[Q = 1] = (1 - a - ai)q /a 0  0  0  q = PT[Q = n] = n  }  *=°  J  (4.12)  2 <n < o qo  = P [Q = 0] = V  V ra=l /  n=l  In order to determine the cell loss rate introduced by the buffers we need to determine the switch throughput at the buffer outputs. A cell is not transmitted in the mth time slot from an output buffer if and only if Q .i=0 and A =0. Denoting po as the normalized m  m  switch throughput obtained at the buffer outputs, we have p = l-q ao. 0  A cell emerging from the FAB  (4.13)  0  switch will be lost if the output queue already contains  b packets. The buffer cell success probability is obtained by dividing the normalized switch throughput at the buffer outputs by the output port rate of the FAB switch. Hence, the cell loss probability is given by Pr[ cell loss] = 1 - p /O . 0  62  pn  (4.14)  Chapter 4.  Performance of The Fat-Banyan Switch  10  ' E  1Q" ! 8  1  1  1  1  1  1  1  1  r  I 5  I 10  I 15  I 20  I 25  I 30  I 35  1 40  I 45  I  50  Output Buffer Size  Figure 4.14. Cell loss probability for a 64x64 FAB switch with DC=[2,4,8,8,8,8] as a function of output buffer size under independent uniform traffic with input load p=0.9 Figure 4.14 is a plot of cell loss which includes the buffer loss for DC=[2,4,8,8,8,8] and input load p=0.9. The analysis results are optimistic compared to the simulation results, but provide a good approximation for small buffer sizes. Figure 4.15 is a plot of cell loss probability (concentrator loss+buffer loss) as a function of buffer size for three different dilation configurations for a 16 x 16 FAB  switch.  The concentrator loss is dependent on the dilation configurations. The concentrator losses for each of the configurations are shown alongside the curves. The concentrator loss is the highest for dilation configuration 1 (DC=[2,3,4,8]). For dilation configuration 3 (DC=[2,4,8,10]) the concentrator loss is less than 10~ for a load of 0.9 under independent 6  uniform traffic, and the loss curve tracks that for a lossless switch fabric such as the Knockout. The cell loss probability curves flatten out after a buffer size of 70 for dilation configurations 1 and 2 (DC=[2,4,6,8]), since the buffer loss is negligible after that, and throughput is limited by the concentrator loss which is determined by the dilation configuration.  63  Chapter 4.  Performance of The Fat-Banyan Switch  10°  * * * * * * Config 1: D C = [2,3,4,8] 10"  oooooo Config 2: D C = [2,4,6,8] ++++++ Config 3: D C = [2,4,8,10]  Concentrator loss: 3.52504e-03  Concentrator loss: 1.831123e-06  10  20  30  50  40  60  70  80  Buffer size per output port  Figure 4.15. Cell loss probability (concentrator loss+buffer loss) for three different dilation configurations of a 16x 16 FAB switch as a function of output buffer size under independent uniform traffic with input load p=0.9 (by simulation).  4.5.2 Bursty Traffic A bursty traffic model in which each source alternates between active and idle periods is used for the performance evaluation. The durations of the active and idle periods are both geometrically distributed with parameters a and 8 respectively. In the active period it is assumed that cells arrive in consecutive time slots. The burst lengths are assumed to be statistically independent. The mean burst length is L tive=l/a, the mean idle duration ac  is Lidie=l//# and the offered load /^LactiveAWtive+Lidie)- For our simulations we have assumed L i e=15 cells. act  v  Figures 4.16 and 4.17 plot the cell loss versus output buffer size for two different loads, for two dilation configurations of a 64x64 FAB switch.  For the switch with  DC=[2,4,4,4,4,4] the cell loss probability is high due to the concentrator loss. With DC=[2,4,8,8,8,8] the concentrator loss is negligible and is dominated by the buffer loss and the overall cell loss probability improves as the output buffer size increases. As in the case of the random traffic the dilation configuration DC=[2,4,8,...] which has a 64  Chapter 4.  Performance of The Fat-Banyan Switch  maximum dilation of 8 gives the best concentrator loss. Further, improvement in cell loss probability is achieved by using larger buffer sizes. Truncated architectures with shared memory can be used to reduce the memory size required to achieve a given cell loss probability. This is the topic of the next chapter.  10"'  , -l 0  0  ,  ,  ,  ,  ,  1  50  100  150  200  250  300  1 350  Output Buffer S i z e  Figure 4.16. Cell loss probability (concentrator loss+buffer loss) for two different dilation configurations of a 64x64 FAB switch as a function of output buffer size under bursty traffic conditions with p=0.7 and average burst length=15 (by simulation).  65  Chapter 4.  Performance of The Fat-Banyan Switch  Figure 4.17. Cell loss probability (concentrator loss+buffer loss) for two different dilation configurations of a 64x64 FAB switch as a function of output buffer size under bursty traffic conditions with p=0.9 and average burst length=15 (by simulation).  66  Chapter 4.  Performance of The Fat-Banyan Switch  4.6. The Input-Output Queued Fat-Banyan Input queueing in ATM switches is easy to implement, however, it suffers from serious throughput degradation due to the head-of-line (HOL) blocking effect in FIFO buffers. Throughput can be enhanced if buffering schemes other than FIFO are employed. Recently, a number of cell scheduling algorithms for input queuing were proposed to overcome the HOL blocking effect [51, 52, 53].  These algorithms are iterative and  require the availability of fast scheduling hardware and the use of non FIFO buffers. Totally eliminating HOL blocking requires each input port to allocate a distinct buffer for each output port, a technique dubbed VOQ (virtual output queueing). Obviously, the implementation of such a scheme can be very costly [2, 3]. In this section, we examine the performance of input-output queueing in the FAB switch as a means for overcoming the HOL blocking effect. In the FAB switch with input-output queueing we use simple FIFO buffers for the input queues and multiport FIFO buffers for the output queues. We use the notation Bin for input buffer size per port and Bout for output buffer size per port. Back pressure can be employed to enforce no cell loss in the memoryless fabric and/or output buffers. The backpressure mode of operation for these switches is discussed next followed by its performance under uniform and bursty traffic conditions. It is shown that the backpressure (along with small input buffers) technique we employ renders the fabric lossless while achieving a better cell loss performance than the output buffered case. 4.6.1 Backpressure mode Backpressure is a technique which is used to control cell loss within a switching fabric. Backpressure can be used to prevent cell loss in a buffer (or a group of buffers) or to prevent cell loss in the memoryless fabric. The former case is called the buffercapacity (or BC) backpressure and the latter is called link-capacity (or LC) backpressure. BC back pressure aims at eliminating cell loss in the output buffers. On the other hand, LC back pressure aims at eliminating cell loss in the memoryless interconnection fabric. 67  Chapter 4.  Performance of The Fat-Banyan Switch  In the following we focus on L C back pressure as a method for making the memoryless interconnection network of the FAB lossless. Figure 4.18 shows an example of how  LC  back-pressure is implemented on an input-output buffered switch. In the example, six HOL  cells from input queues 0, 1, 2, 3, 4, and 7, are destined to output port 0, but the  output dilation of the last stage is 4. The HOL  cell from input queue 3 is blocked by  the top FSE at stage 2, and the HOL cell from input queue 7 is blocked by the top FSE at stage 3. Note that an FSE propagates back-pressure signals on a reverse path back to the concerned input queues if it can not route cells on its outgoing links. This type of operation requires simple distributed control that can be easily incorporated in FSE hardware. With L C backpressure, the number of cells, k, that are switched to an output port is equal to the dilation of the FSEs in the last stage of the switch.  Dilation of Last stage FSE's, K = 4  Back pressure  signals  +  Blocked link —  Figure 4.18. Concept of L C backpressure in input-output queued fat-banyan switch with N=8 and dilation of last stage FSE's = 4. 68  Chapter 4.  Performance of The Fat-Banyan Switch  4.6.2 Uniform traffic From Figures 4.19 and 4.20 we see that small input buffer sizes (2-3 cells) result in improving the performance of the switch while eliminating the loss in the fabric. We also note that as size of the switch N increases the cell loss probability increases slightly. Hence, the L C backpressure technique further enhances the capability of the FAB switch. The additional complexity of adding small input buffers and using backpressure results in reducing the dilation within the fabric, a lossless fabric and better cell loss performance than the purely output-buffered case.  4.6.3 Bursty traffic From Figure 4.21 we see that input buffers have a similar effect to the case of independent uniform traffic. However, larger input buffers in the range of 4—8 cells are required for the bursty traffic case. Finally, in Figure 4.22 we see that the delay performance by adding small input buffers is not affected much when compared to the pure output queueing case (Bin=0).  Figure 4.19. Cell loss probability for FAB switches of size 32 and 64 as a function of input buffer size per output port under random uniform traffic with offered load p= 0.7 and output buffer size=32 cells (by simulation). For N=32, DC= [2,4,4,4,4]. For N=64 DC=[2,4,4,4,4,4]. 69  ter 4.  Performance of The Fat-Banyan Switch  Input buffer size  Figure 4.20. Cell loss probability for FAB switches of size 32 and 64 as a function of input buffer size per output port under random uniform traffic with offered load p= 0.9 and output buffer size=32 cells (by simulation). For N=32, DC= [2,4,4,4,4]. For N=64 DC=[2,4,4,4,4,4].  Figure 4.21. Cell loss probability for FAB switches of size 32 and 64 as a function of input buffer size per input port under bursty uniform traffic with average burst length=15 and offered load p= 0.7 (by simulation) and Bout=256. For N=32, DC= [2,4,4,4,4]. For N=64 DC=[2,4,4,4,4,4].  70  Chapter 4.  Performance of The Fat-Banyan Switch  100  0  t:  0.1  1 0.2  1 0.3  1 0.4  1 0.5  1 0.6  1 0.7  1 0.8  1 0.9  1 1  Input offered load  Figure 4.22. Average delay versus offered load (p) for FAB switch of size 32 under bursty uniform traffic with average burst length=15. DC=[2,4,4,4,4,4], Bout=256 (by simulation).  71  Chapter 4.  Performance of The Fat-Banyan Switch  4.7. Conclusions The performance of the FAB switch has been shown to achieve very low cell loss probability (of the order of 10  -6  or lesser). The cell loss is easily controlled by choosing  an appropriate dilation configuration. For the purely output buffered switch (without any input buffers) DC=[2,4,8,...] gives a cell loss probability of the order of 10  -6  under  independent uniform traffic conditions for switch sizes as large as 4096x4096. For the bursty traffic case with uniform output destinations the switch with DC=[2,4,8,...] gives better performance as in the previous case. However, larger output buffers are required and the cell loss is very much higher at high input loads. Another version of the FAB switch with both input and output buffers was also considered. This version employs a backpressure technique termed L C backpressure to make the fabric lossless. This has the advantage of reduced dilation, a lossless fabric and better cell loss performance than that of the purely output-buffered case. Also, we have shown by simulation that the delay performance is not affected by using input buffers and employing backpressure because the input buffers are very small in size.. We also considered some examples of the non-uniform traffic case. For the case of SSSD traffic with independent uniform background traffic we have shown that a good cell loss performance can be obtained with increased dilation. The extent of improvement in performance depends on the level of dilation that we are willing to accomodate before the switch becomes very expensive. With hot-spot traffic the overall cell loss probability for a given hot-spot intensity is improved by using a slightly higher dilation.  72  Chapter 5.  Design of Scalable Shared Memory Switches  Chapter 5 Shared  Design of Scalable  Memory  Switches  5.1. Introduction The design of A T M switches has evolved in a number of ways over the past few years, but several key issues remain to be resolved. One major issue is scalability and modularity, because many solutions which are efficient for small size switches are not scalable to large switches. Output buffered switches, such as the Knockout switch [54] provide optimal delay-throughput performance, however, they may require excessive buffering resources to handle general types of input traffic and their expandability to larger size is questionable due to their high interconnection complexity. Both routing and buffering resources would be severely under-utilized when the cell traffic distribution at the output ports is unbalanced. In this case, cell loss in heavily loaded output buffers will be high even if sufficient buffer space is available at other output ports [55, 56]. The "growable" A T M switch architecture proposed in [11] is a generalization of the Knockout switch that uses shared memory buffers for groups of outputs, instead of buffers dedicated to each output. This concept has two major advantages. The first is that output buffer sharing results in smaller overall buffer memory size than the dedicated output buffer case for a given cell loss rate. The second, and more important advantage, is that the size of each shared memory module remains fixed for a given cell loss rate (under the uniform input traffic assumption) for any switch size, thus providing a truly scalable switch architecture. In this chapter we propose and evaluate a scalable A T M switch architecture based on the FAB switch which employs shared-memory buffers as basic building blocks. The proposed architecture is completely parameterized in terms of the shared-memory size and the number of ports sharing a memory buffer. Thus, at one extreme we can have one (output) port per memory module, which is the case when no buffer sharing is employed. 73  Chapter 5.  Design of Scalable Shared Memory Switches  At the other extreme, all ports share the same memory module, which is the case of a pure shared memory switch. Thus our techniques provide a unified methodology for designing and evaluating a wide range of ATM switches ranging from space-division switches with no buffer sharing to shared-memory switches. We evaluate the performance trade-offs resulting from placing shared memory buffers at the outputs of a switch. The main motivation for a shared memory design strategy is two fold: 1. By taking advantage of well developed and commercially available shared memory switching (SMS) technology, ATM switch design can be simplified to determining a suitable SMS module size, and identifying a proper interconnection among the modules. In this way, switch architectures can be reusable and able to evolve as memory technology advances. 2.  Employing shared memory greatly enhances buffer space utilization, allows the implementation of weighted fairness for multiple services, and supports multipriority traffic efficiently.  The simplest form of complete memory sharing suffers from  unfairness caused by unbalanced logical queue length in the shared memory. In such a scenario, certain output ports of the SM (shared memory) switch will have queues that consume most of the memory space, thus causing heavy cell loss to cells destined to other ports. This problem can be solved by a number of buffer management schemes [8, 9]. A buffer management scheme is needed only when the SM is full (or over a given threshold). A particularly efficient scheme is pushout buffer management policy[8]. When the SM is full, the push-out scheme will select for replacement the cell at the tail of the longest queue. Thus the push-out policy guarantees fairness to all output ports. The scheme can be implemented by associating a length counter with each logical queue such that whenever a cell enters (or leaves) the SM, the counter associated with its logical queue is incremented (decremented). The push-out policy can then identify the longest queue by a suitable  74  Chapter 5.  Design of Scalable Shared Memory Switches  fast method that inspects the counters only (for example, through fast bus arbitration). There are several other methods for buffer management^, 9] but they all attempt to approximate the push-out scheme. In the next section we discuss the key design issues to be considered in designing  ATM  switches. In section 5.3 we introduce the truncated switch model and its realization using the truncated fat-banyan switch. This is followed by a discussion of the performance analysis of the truncated fat-banyan switch. Finally, we consider the input-output queued truncated fat-banyan and study its performance.  5.2. Considerations in Designing ATM Switches Buffering is the resource that dominates both the complexity and cost of an  ATM  switch. This is particularly true for switches that will collaborate with network-wide congestion control schemes with or without feedback congestion control. The recently standardized rate-based congestion control scheme requires the exchange of congestion information between the source and destination for each ABR  connection. In this scheme,  based on congestion information (propagating from source to destination), the destination sends special resource management (or RM)  cells back to the source informing it as to  whether it must increase or decrease its cell rate. The relatively long delays incurred in this type of congestion control implies that ATM large buffers to prevent excessive cell loss for ABR  switches must have sufficiently  connections especially during initial  bursts. In the steady state, explicit-rate algorithms are known to require small buffer sizes. Traffic management schemes that do not use feedback information demand large buffer resources. Recent experiments for implementing TCP/IP over ATM  with partial or early  packet discard show that a buffer space of few thousand cells [57, 58] may be required to provide an acceptable effective throughput for TCP connections. In comparison to buffers, wires and memoryless interconnection stages pose much less problems. Therefore, our 75  Chapter  5.  Design of Scalable  Shared Memory  Switches  approach is based on buffer optimization in A T M switch design. This involves several key design considerations: 1. The number of buffer stages should be minimized to reduce cell delay. If this cannot be achieved then cut-through routing should be employed. This type of routing allows cells to "bypass" empty buffers. 2.  Buffer sharing should be utilized as much as possible, to reduce the amount of total buffering, to allow efficient control policies to be applied to multiple connections sharing a buffer, and to take advantage of available memory technology.  3.  Use effective control and scheduling techniques for buffer management. This is justified by the fact that since large buffers will dominate the cost of a switch, then it is worthwhile to employ complex controllers that enhance the performance and utilization of such buffers for delay-sensitive and loss-sensitive services.  4.  Interconnection fabrics must be self-routing and almost non-blocking even in the presence of unbalanced traffic. This feature is needed to restrict cell loss to buffer overflow only and not because of interconnection concentration loss.  5.3. Truncated Switch Architectures In output buffered switches, incoming cells are routed to the output buffers through some interconnection network. If this network is non blocking then it can be shown that such switches provide optimal delay-throughput performance at the expense of using larger buffers than other schemes. However, since most practical (cost-effective) fabrics are generally blocking, part of the cell loss is expected to occur in the bufferless interconnection network (this will be referred to as concentration loss). The remaining part of cell loss will occur in the output buffers. The main drawback of this buffering scheme is that cell loss can occur in certain overloaded buffers even if plenty of buffer space is available at other outputs. As mentioned earlier, this problem can be alleviated by bundling outputs into distinct groups and assigning a shared memory switch to buffer the 76  Chapter 5.  Design of Scalable Shared Memory Switches  cells for each group of outputs as well as perform the final routing to individual output ports. In [38], we proposed the class of truncated switch fabrics as a very effective implementation for ATM  5.3.1  switches.  Generalized Truncated Switch Model An NxN  truncated switch model can be represented by the general model shown in  Figure 5.1. The term "truncated" refers to the fact that the bufferless interconnection stage is derived from standard multistage networks through removal of the last few stages of switching elements (SE's). The truncated output-buffered switch model consists of two major stages, the first stage represents a truncated self-routing interconnect and the second stage consists of shared-memory switching modules. The first stage mainly concentrates traffic from theiV input links into/V multiple streams each of bandwidth N" (finks). Each concentrated stream (of N" links) from the first stage is fed directly to an N" xN'  shared-  memory switch. Note that switch models other than shared-memory can be used to realize the N" xN'  switching task. However, the shared memory switches will reduce  the buffer-size requirements of the last stage, and consequently of the entire switch. INPUTS INO  INI  IN(N-1)  STAGE I : Memoryless Cell Distribution Network  N"  Switching Modules  N-  SM1  SM2  SM(N/N')  STAGE II  •T  N'  OUTPUTS  Figure 5.1. The general truncated switch model 77  Chapter 5.  Design of Scalable Shared Memory Switches  The destination address field normally used in banyan type networks can be divided into two subfields: the higher-order subfield contains the address of the output SM switch (SMA) and the lower order subfield contains the output port address (SPA) within an SM. The field SMA is used to route cells up to one of the switches, then an SM switch uses the address field SPA to route cells to their final output destinations. The truncated switch model is based on the growable switch model proposed in [11]. The growable switch model generalizes the Knockout principle so that it applies for a group of outputs instead of a single output. If the arrivals on different input lines are statistically independent, then the probability of more than a handful of simultaneous packets say N" arriving at a group of N' outputs where N'<N" is exceedingly small even for arbitrarily large N. For example, 10  -6  cell loss can be achieved with N"=33 and TV"=16 for a independent uniform load of  0.9 and arbitrarily large N. The growable switch model has been realized using a Clos network topology and is shown in Figure 5.2. The first two stages of the Clos network are used as the memoryless interconnect fabric and the last stage is realized by shared memory switching modules. Two types of losses occur in the interconnect fabric. The first is the loss due to the concentrator action of the fabric when more than'm' cells are destined to the same output module. The second type of loss called the scheduling loss is due to the use of a distributed but not truly self-routing algorithm within the interconnect fabric. Optimal centralized routing algorithms are known for the Clos network [59] but they are slow and unsuitable for fast packet switching.  5.3.2 Banyan Realization The scheduling loss suffered in the interconnect fabric of the growable switch can be entirely avoided by using a true self-routing interconnect such as a banyan network. However, the standard banyan is known to have a severely blocking topology. The Truncated Fat-Banyan (TFAB) switch which is introduced in this section employs selfrouting through the fabric and hence avoids this loss. 78  Chapter 5.  Design of Scalable Shared Memory Switches Memoryless Interconnect fabric 1  J  I  n  „ „ , Output Packet-  (  k  m  0  m  k  0  V i V f\A t' X /> \ 1 / \\ 1' // \ 1  n  m  k  1  V /  k  m  k-1  /  m  k  k  m-1  n  1  1 \ / V  n  n  0  / \ \ / t \ /  V V  •  Switch Modules  \/y  m  n  k-1  \ Figure 5.2. The growable switch model  The TFAB topology is derived from the FAB topology by removing one or more switching stages starting from the last (or output) stage and replacing the output buffers by shared-memory switches. Thus the TFAB provides an efficient realization of the truncated switch model since it achieves the same cell loss performance but with less hardware and much less wiring. The number of stages in the TFAB switch can vary from 1 to logA/-l for an NxN to denote an NxN  switch fabric. The notation TFAB(Af,S-l) will be used  TFAB switch with S stages, theSth stage being the shared memory  stage. The lesser the number of stages the more is the truncation. If the number of stages is logjV-1 then the last stage is truncated; if the number of stages is logA/-2 then the last two stages are truncated, and so on. Figure 5.3 illustrates a TFAB(8,2) with dilation configuration DC=[2,3]. The routing in the truncated fabric is similar to the normal self-routing algorithm except that only those destination bits required for routing through the truncated fabric are used and the rest of the bits are used by the shared memory modules which provide the buffering and final routing. The TFAB switch has the following features: 79  Chapter 5.  Design of Scalable Shared Memory Switches  1. Low cost : One advantage of the TFAB interconnection network over other fabrics (such as the knockout or crossbar based switches) is its cost-effectiveness (according to the cost function used in chapter 3). In the TFAB network, cell concentration functions are distributed among the switching elements and are, thus, shared among the switch outputs. Additionally, the TFAB network has the advantage of using the same simple self-routing algorithm used with banyan network topologies. 2.  QoS separation : The FSE's in the FAB are essentially bufferless, and implement some type of wormhole routing (as opposed to a store-and-forward routing) that maintains cell sequence integrity. More importantly, flows from different input ports do not affect each other's delays. Although flows from different input ports can affect each other's cell loss, the amount of loss can be reduced to arbitrarily small levels through a proper setting of link dilations. Such is not the case for example with other multipath networks, such as the buffered Benes [1] in which differentflowscan affect each other's cell delay as well as cell loss. QoS separation is very important not only for achieving fairness, but also for providing QoS guarantees to the various flows crossing the switch. The TFAB employs shared memory switches in the last stage. In the shared memory switch logical queues are maintained for the output ports they serve. The logical output queues avoid QoS coupling in the form of delay interactions between the cell streams for different output ports. Further, unfairness in the buffer occupancy of any queue (and hence buffer loss) can be avoided by using a buffer management strategy [8, 9].  3.  Multicasting capabilities : The TFAB switch has excellent multicasting capability. The internal dilation allows the bufferless TFAB network to maintain 100% throughput even when the multicast traffic load offered to the output port exceeds unity. Simulation results reported in [60] for the FAB switch show that, under random uniform traffic with fanout (number of packet copies) given by a truncated geometric  80  Chapter 5.  Design of Scalable Shared Memory Switches  distribution, high throughput can be maintained even when the instantaneous output offered load is increased to 200%.  Of course large output buffers are needed un-  der such conditions. For multicast connections, the FAB  network can be viewed as  composed of multiple multicast trees which share their roots and some of the early links. However, the tree paths become distinct as one moves towards the output ports. The TFAB switch can implement multicasting using a two-phase procedure. In the first phase, multicasting is achieved by splitting trees routed at the input ports and performing cell replications at the appropriate internal FSE's. Because of internal dilation, a single FSE can be a member of multiple multicast trees without any conflicts. In the second phase, one of several shared-memory multicasting schemes, such as those reported in [61] can be employed.  Demultiplexer 4x3  Concentrator  OO Dilated Output Links  01 SHARED MEMORY  h  1 SHARED MEMORY  2  SHARED MEMORY 3 SHARED MEMORY  4  Stage 1  OUT  C  OUT-i U  h  OUT  OUT, OUT  j-  4  OUT,  1— OUTg OUT,  Stage 2  Figure 5.3. Truncated fat-banyan (TFAB(8,2)) switch with output shared memory modules and DC=[2,3].  81  Chapter 5.  Design of Scalable Shared Memory Switches  5.4. Performance Analysis of the Truncated Fat-Banyan In the following, simulation results are presented to demonstrate the efficiency of TFAB switches under various input traffic loads and conditions. The simulation results are obtained by the method of independent replications (refer to [49] for a discussion on this method) using 95% confidence intervals. The confidence intervals are very tight and not shown in the figures.  5.4.1 Uniform Traffic We  consider a TFAB(16,3), i.e., a FAB  switch with its last stage truncated. Figure  5.4 illustrates the performance curves of cell loss probability as a function of the buffer size per output port for the three different dilation configurations considered previously. We can see that the buffer size per output port is very much reduced in comparison with pure output buffering. With pure output buffering, a buffer size of approximately 57 per output (refer to Figure 4.15) is required to achieve a cell loss probability of 10 . -6  However, for the TFAB(16,3) switch the buffer size is about 30 per output, which is about half the requirement for the pure output buffered case. However, the memory speed [5] would be much slower in this case.  5.4.2  Bursty Traffic This subsection examines the performance of the TFAB switch under bursty traffic  conditions. The same traffic model as in chapter 4 is used for the simulations with Lactive l5 cells. =  The simulation results are shown in Figures 5.5 to 5.7 for three different bursty traffic loads. For each load two different TFAB switches are considered. TFAB(16,3) is obtained by truncating the last stage of a 16x16 FAB switch, and using shared-memory modules each of which is shared among two output ports. Hence, a TFAB(16,3) switch requires 8 shared-memory modules. TFAB(16,2) is obtained by truncating the last two stages, and using 4 shared-memory modules, each of which is shared among four output  82  Chapter 5.  Design of Scalable Shared Memory Switches  Figure 5.4. Cell loss probability as a function of output buffer size for a 16x16 TFAB switch for different dilation configurations under independent uniform traffic with input load p=0.9 (obtained by simulation). ports. We  see that in all the cases the cell loss probability improves as the number of  outputs sharing an output buffer increases. For example, with offered load p=0.5 and a buffer size of 60, TFAB(16,2) achieves a cell loss probability of 10~ , whereas with 5  TFAB(16,3) the cell loss probability is only about 10 . Hence, by adjusting the degree -3  of truncation we can provide the required sharing of the output buffers and achieve a specified cell loss probability.  83  Chapter 5.  Design of Scalable Shared Memory Switches  20  40  60  80  100  120  140  Output buffer size per port  Figure 5.5. Cell loss probability (buffer loss) for two 16x16 TFAB switches with concentrator loss as a function of buffer size per output port under bursty traffic with average burst length=15 and offered load p-0.5 (by simulation).  TFAB(16,3), D C = [2.4.8]| TFAB(16,2), D C = [2,4]  100  120  140  160  ISO  200  220  240  Buffer size per output port  Figure 5.6. Cell loss probability (buffer loss) for two 16x16 TFAB switches with concentrator loss as a function of buffer size per output port under bursty traffic with average burst length=15 and offered load p= 0.7 (by simulation).  84  Chapter 5.  Design of Scalable Shared Memory Switches 10°  T  r  O  O  TFAB(16,3), D C o [2.4,8]  +  +  TFAB(16,2), D C = [2,4]  O Q.  (/> to  _o o  100  200  300  400  500  000  700  Buffer size per output port  Figure 5.7. Cell loss probability (buffer loss) for two 16x16 TFAB switches with no concentrator loss as a function of buffer size per output port under bursty traffic with average burst length=15 and offered load p=0.9 (by simulation).  5.4.3 Unbalanced Bursty Traffic In this traffic model the switch is assumed to be loaded with bursty traffic that is balanced at the inputs, each carrying a load of p. The traffic pattern is unbalanced with respect to the outputs in such a way that the outputs can be divided into two groups named "hot outlet group" and "cold outlet group". Figure 5.8 shows cell loss probability versus buffer size per port, for the case of two equal sized (8 in this case) destination groups, and the probability of a cell being destined to group 1 is pjj and the probability of a cell being destined to group 2 is pn-  The hot outlet group is more heavily loaded  than the cold outlet group and hence suffers from more loss. Also, the cold outlet group benefits more from buffer sharing due to its lighter load than the hot outlet group.  85  Chapter 5.  Design of Scalable Shared Memory Switches  Figure 5.8. Cell loss probability (buffer loss) for a 16 x 16 TFAB switch for two output group sizes as a function of buffer size per output port under unbalanced bursty traffic with offered load p=0Jl and average burst length=15 (by simulation).  5.5. Truncated Fat-Banyan with Input Queues We have seen in chapter 4 that the input-output queued FAB switch with L C (linkcapacity) backpressure is used to overcome the loss within the switch fabric. The same technique can be used to enhance the performance of the TFAB switch. Figure 5.9 illustrates the same example as in chapter 4 for the TFAB switch. We note that in this case the H O L cell from input queue 7 is not blocked due to the truncation of the last stage, and provided there is enough space in the shared memory module SMI it will not be dropped. Figure 5.10 illustrates the performance of a TFAB(64,4) switch (an output grouping of 4 ports per SM). We see that in comparison to the input-output queued FAB  switch the cell loss performance is greatly improved. The improvement is in two  ways. Firstly, since the fabric is truncated the cell loss in the input queues due to L C backpressure is reduced. Secondly, the buffer loss in the output buffers is reduced due to buffer sharing. 86  Back pressure signals  Blocked link —  M  Shared Memory Module  SM  Figure 5.9. Concept of L C backpressure in input-output queued TFAB switch.  10"*  0  0  TFAB(64,6), DC=(2,4,4,4,4,4]  a  •  TFAB(64,4), DC - [2.4,4,4]  I  '  '  '  1  1  1  1  '  1  8  10  12  14  16  18  20  22  24  26  1  28  Input Buffer Size  Figure 5.10. Cell loss probability for 64x64 TFAB switch with input queueing and L C backpressure under bursty uniform traffic with average burst length=15 and offered load p= 0.7 (by simulation). Bout=256. 87  Chapter 5.  Design of Scalable Shared Memory Switches  5.6. Conclusions This chapter introduced the truncated fat banyan (or TFAB) switch model as a basic distribution network in a growable switch fabric. The TFAB satisfies all the properties of a growable fabric as specified in [11], and has the additional advantage of employing self-routing schemes. As far as the authors know, the TFAB was the first truly growable fabric with self-routing capability. The chapter also introduced the link-capacity backpressure technique as a means of realizing an economical lossless version of the FAB which employs both input and output buffers. With link-capacity backpressure cell loss can occur only in the input or output buffers.  88  Chapter 6. Multicasting  Chapter 6  Multicasting  6.1. Introduction Multicasting is an important service to be provided by an A T M broadband network. Many applications like videoconferenceing and entertainment video require the setting up of point-to-multipoint connections.  An efficient multicasting architecture is hence  essential in the A T M switches to setup such point to multipoint connections. A multicast A T M switch can be built by cascading a copy network with a pointto-point switching network as shown in Figure 6.1. The copy network creates multiple copies of the input packets and delivers them to the inputs of the point-to-point switching network which follows it. The copy network provides a fast mechanism of replicating the packets without worrying about the routing to the output destinations. The point-to-point switching network does the final routing to the appropriate destinations. At the output of the copy network, packets of a particular multicast connection have to be distinguished so that the subsequent point-to-point switch can route them to the appropriate outputs. This is achieved by using an index reference (IR) [7]. The packets of different multicast connections are distinguished by a broadcast channel number (BCN). During call setup, the point-to-point switch output for a particular B C N and IR are stored at each output of the copy network. Efficient ways of reducing the memory requirements for storing this information at each output of the copy network are proposed in [62]. COPY  POINT-TO-POINT  NETWORK  NETWORK  Figure 6.1. A multicast packet switch consisting of a copy network and a point-to-point switch. 89  Chapter 6.  Multicasting  This chapter presents an efficient copy network architecture. Essentially the previously proposed fat-banyan (FAB) [35] is used as a copy network. When compared to other banyan-based copy networks [62, 34, 63] the FAB switch can implement multicasting with minimal delay and at reasonable cost. In particular, the FAB network has superior delay performance compared to banyan networks with deflection routing. Liew [63] showed how to enhance cell loss performance by adding more stages to the banyan. For example, for a banyan size 128x128 to achieve a cell loss of 10"" with an output 4  offered load of 0.6 and mean fanout of 2 at the inputs, about 16 stages are required (versus 7 in the standard banyan). In order to achieve cell loss of the order of 10  -6  a  speed up of 2 is required in Liew's copy network. Lee [34] proposed a nonblocking, self-routing copy network with constant latency. In Lee's method in order to build a copy network with 2  M  inputs and 2  stages are required. In addition 2  N  M  outputs (N>M), 2 '  N M  banyan networks each with M  (TV-M)-stage binary trees are required. Thus, in order  for Lee's network to handle packet overflows (i.e., total number of copies in excess of output ports) considerable hardware is required. Alternatively Lee's architecture would require speedup to handle packet overflows. In section 6.2 we review the multicasting techniques proposed in the literature. In section 6.3 we present the performance results of the FAB copy network under multicast traffic followed by conclusions in section 6.4.  6.2. Multicasting in the F A B Switch  Basically two approaches have been proposed in the literature for the replication of packets in multistage interconnection networks. These are the balanced tree approach and the interval splitting approach. In both the approaches a multicast packet has a copy number field which stores the number of copies (CN) desired for that packet. In the balanced tree approach proposed in [7] the first duplication of the packet happens at 90  Chapter 6. Multicasting  stage i = 1 + log iV - [log CN](CN>l). The C N of the upper output packet is set to 2  2  [CN/2J and the copy number of the lower output packet is set to [CN/2], or vice versa. Here contention for output links of a switching element (SE) can be dealt by buffering [7] one of the packets and allowing the other packet to be duplicated or by dropping one of the packets. Dropping of packets may be acceptable only if the cell loss of the multicast packet copies can be kept very low. The balanced tree approach is illustrated in Figure 6.2. Input 3 receives a multicast packet with 6 copies to be generated. In the first stage (stage 0) the packet splits into two with C N of each of the copies set to 3. In the next stage each of the packet copies generated in the previous stage is split again with the C N of the upper packet set to 1 and the C N of the lower packet set to 2. In the last stage packets with CN=2 are split while packets with CN=1 are arbitrarily sent to either the upper or lower output. Index  Copy numbers after splitting at stage 0  •  Copy numbers after splitting at stage 1  Final splitting at output to produce the required copies  Figure 6.2. Balanced tree approach to copy number division per stage.  The second approach is the output interval assignment approach and is illustrated 91  Chapter 6. Multicasting  in Figure 6.3. In this approach a multicast packet with copy number CN is assigned an output interval (Min, Max) where Min and Max are the minimum and maximum of the output interval. The packet copies are to be sent to the outputs of the copy network specified by this output interval. This scheme can be explained as follows: Let the binary representation of the address interval be (m\m2...m ,M\M2...M ). n  nit and  n  At any stage k, bits  are examined and routing decisions made as follows:  1. If m/c = Mfr = 0, then send the packet on the upper link; if nik =  - 1, then send  the packet on the lower link. 2.  If nik = 0 and Af* = 1, then duplicate the packet and send it out on both links with the header of the duplicated packets modified as follows:  a.  For the packet sent out on the upper link, Min remains unchanged, and Max is set to Mi...Afjfc_i01...1.  b.  For the packet sent out on the upper link, Max remains unchanged, and Min is set to Mi...Mjt_ilO...O.  It should be noted that when splitting occurs the new intervals are contiguous to each other, and together they cover the original interval. In our work we have used the output interval assignment approach of packet replication. The output address intervals are assigned randomly in the range (i,CN-i-l) where i </V-CN. Also the first splitting of a packet is delayed as long as possible and occurs at stage i = 1 + log iV - [log CN]. 2  2  6.3. Multicast Performance  In this section we highlight the performance of the FAB network as a copy network. Multicast traffic is modelled using a geometric interarrival distribution with the number 92  Chapter 6. Multicasting Index reference of copies generated  Output interval assigned  Copy number of multicast packet  [no]  -  000 101  Stage 0  Stage 1  Stage 2  L S B is compared last  A MSB is compared first  000 001 000 011  000 101  010 011 100 101  Output interval assigned to the multicast packet  1  A 100 101  Figure 6.3. Output interval assignment approach with examination of two bits at each stage. of packet copies described by a geometric distribution. The following equations then hold p = input offered load <?(Yj = y) = probability that the number of copies requested by an incoming packet is y p(Xi = x) = probability that the number of copies generated is x  p(Xi = k) = l-p M  k=0  ( Y i = k)  k = l,2,... (6.1)  Assuming that Y; is distributed according to the truncated geometric distribution with the parameter q, we have  q(Y = k) {  (1 - q)q -! -for k  l-q  N  1< k<N  (6.2)  Note that N is the maximum number of allowable copies since an incoming packet will multicast to at most all of the switch outputs. The average number of copies per 93  Chapter 6. Multicasting  incoming packet is  E(Yi)  1  Nq  N  w  1-q  N  (6.3)  The output offered load is given by  pE(Yi)  (6.4)  With the following traffic parameters we obtain the simulation results with 9 5 % confidence intervals using the method of independent replications. In Figure 6.4 cell loss probability is plotted for a 128 x 128 FAB network as a function of output offered load with mean fanout, E(Yj)=2 for different dilation configurations. For a given load, the cell loss probability improves as the dilation increases. With an output offered load of 0.6 cell loss probability below 10^ can be achieved with dilation configuration DC=[2,3,4,5,5,5,5]. In [63] where the extended shuffle exchange network is used about 16 stages (an extra 9 stages) are required to achieve a similar cell loss with the same output offered load. The extra stages however do not solve the problem of output request overflow which occurs when the sum of the packet copies exceeds N. The FAB copy network is designed to handle this kind of overflow problem by gradual internal bandwidth expansion. Figure 6.5 shows the case when the output offered load is > 1. It can be seen that with configuration DC=[2,4,8,8,8,8,8] the cell loss for an output offered load of 1 is of the order of 10  -6  and when the output offered load is 2  it is of the order of 10"". 4  6.4.  Conclusion We have presented an efficient copy network for multicasting based on the fat-banyan  architecture. An unique feature of this architecture is that it can support total packet replications > N. With a maximum dilation of 5 the copy network achieves cell loss probability of 10  -4  for an output offered load of 0.6 for a 128x128 switch. With a 94  Chapter  6.  Multicasting  •jQ" l 0.1 7  I 0.2  I 0.4  I 0.3  I 0.5  I 0.7  I 0.6  I 0.9  I 0.8  1 1  Output Offered Load  Figure 6.4. Simulation results for a 128x128 copy network: Cell loss probability versus Output offered load for different dilation configurations with mean fanout =2. .-0  10  1  F  10~ I 1 7  1  1  1  1  1  1  1  1  3  i  i  i  i  i  i  i  i  i  I  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2  Output Offered Load Figure 6.5. Simulation results for a 128x128 copy network: Cell loss probability versus Output offered load under overload conditions for different dilation configurations with mean fanout =2.  95  Chapter 6. Multicasting  maximum dilation of 8 the cell loss probability is below 10  -6  even when the output  offered load is 1. In terms of latency the FAB copy network performs very well since FSE's in any stage examine only a bit each of the output multicast address range bits, [Min, Max], which are present in the header of the multicast cell.  96  Chapter 7  7.1.  Conclusions and  future research  areas  Conclusions In this thesis we have provided an in-depth investigation of Banyan-based packet  switches. We proposed a new generalized banyan model called the fat-banyan (FAB) [35, 64] switch model characterized by a flexible interstage dilation scheme that can be used to optimize switch cost with respect to internal bandwidth utilization. The optimization of internal bandwidth leads to significantly reduced interchip/interboard interconnection as was shown in Table 3.1. The main features of the FAB switch are 1) amenability to efficient hardware implementation (Table 3.10), 2) modularity and 3) scalability to very large capacities in the range of Terabits/sec as discussed in chapter 3. The performance of the switch shows that it is comparable to the Knockout switch under uniform, nonuniform [36, 37] and bursty traffic patterns. Another important feature of the FAB switch is that the cells from different connections do not affect each other's delays (i.e., no QoS coupling). The FAB  also provides a very cost-effective and flexible realization of Growable  switches [11]. By adjusting the depth of the fabric (truncating the fabric) the desired number of outputs (in multiples of d for a FAB switch made with dxd switching elements) can be grouped. For the group of outputs shared memory switches (or any other existing switches) can be used to provide a scalable switching fabric based on the growable switch concept. The truncated fabric concept [38, 39] was introduced in this thesis is based on designing a fabric which is highly optimized both with respect to bandwidth and buffer utilization. It should be noted that the FAB model is a generalized model for designing switches highly optimized for bandwidth and buffer utilization. This concept can be 97  Chapter 7.  Conclusions and future research areas  applied to any multistage interconnection network (eg. Fat-tree, Benes network, Clos Network). The output buffered FAB switch provides a cost-effective way of realizing a spacedivision switch which has a comparable performance to the Knockout switch. The TFAB switch provides a cost-effective architecture for realizing the growable switch. For large scale ATM  switch implementations it is desirable to reduce the interconnect complexity  as much as possible while maintaining minimal cell loss through the switch. Hence, the FAB  and TFAB switches were further enhanced by using input-output buffering  and backpressure to obtain a lossless fabric [39]. The backpressure technique results in reduction of the dilation requirement within the fabric to obtain a specified cell loss probability. This further reduces the chipcount and interconnection complexity of the switching fabric (refer to Table 3.10). This technique of L C backpressure can be applied to any switching fabric to obtain a lossless fabric. In order to support applications like telephone conferencing, video conferencing and downloading of video programs, ATM connections.  This requires the ATM  provides a service called point-to-multipoint  switch to create multiple copies of incoming  multicast cells. In Chapter 6 we proposed an efficient cell replication method for pointto-multipoint connections by using a copy network based on the FAB architecture [60]. The copy network achieves very low cell loss even at very high output offered loads as shown in Chapter 6. The FAB  copy network performs multicasting with minimal  delay and reasonable cost in comparison to the extended shuffle exchange network with deflection routing. Moreover, it is capable of producing copies in excess of the number of input/output ports due to its gradual incrementing of internal bandwidth. In contrast, the shuffle exchange network cannot handle the case where the number of copies requested is greater than the number of input/output ports.  98  Chapter 7.  Conclusions and future research areas  7.2. Future research areas  The work in this thesis can be extended to yield some very interesting research topics to further enhance the design of A T M switches. As a continuation of the work in this thesis the following areas can be investigated in future research work.  Efficient support of multicasting within the switch fabric We have investigated the use of a copy network architecture for replicating cells of point-to-multipoint connections. For large scale switching systems it may be more efficient to incorporate this mechanism within the switching fabric. Efficient mechanisms are needed to control replication of cells within the fabric in order to maintain good performance of the switch. The recursive copy generation technique [32, 65] has been used in other switch fabrics to combine copying and routing of multicast cells. This technique is based on generating a limited number of copies inside the fabric, and then recycling some output copies of a cell at the input of the switching fabric such that the input lines are equally loaded. Further study should enable the feasibility of including this feature in the FAB switch. For the TFAB switch an approach based on combining the recursive copy generation with one of the techniques proposed in [61] for multicasting in shared-memory switches may prove to be efficient.  Efficient support of multicasting of ABR/UBR traffic in an ATM switch In A T M environment CBR and V B R traffic are subject to connection admission control (CAC). The multicasting of this kind of traffic which is subject to C A C is relatively simpler compared to traffic like ABR and UBR. A B R is based on rate based flow control with a Minimum Cell Rate (MCR) guarantee while with UBR there are no QoS guarantees. Research work is required to determine how to efficiently multicast ABR and UBR; for instance in a point-to-multipoint ABR connection what is the best 99  Chapter 7.  Conclusions and future research areas  policy to deal with congestion on one or more paths of the multicast tree, what is its effect on the cell replication algorithm, how should the multicast cell be buffered.  Enhancing the performance of the FAB by using a Benes network topology The Buffered Benes network has been used as the basis for the Washington University Gigabit switch [1]. This network suffers from QoS coupling within the fabric due to the internal buffers. However, the technique of using the first half of the Benes network as a randomizing network and subsequently doing the routing in the second half of the network can be effectively used to enhance the performance of the FAB switch. It would be very interesting to study the performance of this network with the second half replaced by a FAB network. The randomization in the first half will probably lead to a decrease in dilation requirement (hence cost) of the second half (FAB network) to achieve a specified cell loss probability.  Enhancing fabric performance to deal with multi-QoS traffic The switch fabrics which have been considered in the literature are generally insensitive to the QoS requirement of the traffic traversing the fabric. In order to maintain the QoS of different streams through the switch there is a need to enhance the switch fabric design to be sensitive to the QoS of the traffic. For a memoryless fabric like the FAB switch this would require enhancing the arbitration mechanism in the switching elements. This feature will allow higher priority cells to be given preference in case of conflict.  100  Appendix:  A :  List of standard  acronyms  AAL  A T M adaption layer  ABR  Available bit rate  ATM  Asynchronus transfer mode  BISDN  Broadband integrated services digital network  CAC  Connection admission control  CBR  Constant bit rate  FIFO  First infirstout  GFC  Generic flow control  HEC  Header error control  IP  Internet protocol  LAN  Local area network  MCR  Minimum cell rate  NNI  Network network interface  OAM  Operation and maintainance  OC-n  Optical carrier signal  PT  Payload type  QoS  Quality of service  RM  Resource management  TCP  Transport control protocol  UBR  Unspecified bit rate  UNI  User Network Interface  VBR  Variable bit rate  VCI  Virtual circuit identifier  VPI  Virtual path identifier  WAN  Wide Area Networking  101  Bibliography  Bibliography [I] T. Chaney, J. A. Fingerhut, M. Flucke, and J. S. Turner, "Design of a gigabit ATM switch," INFOCOM, 1997.  [2] N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick, and M. Horowitz, "Tiny Tera: A packet switch core," IEEE Micro, vol. 17, pp. 26-33, Jan. 1997. [3] H. Duan, J. W. Lockwood, S. M. Kang, and J. D. Will, "Ainy High performance oc-12/oc-48 queue design prototype for input-buffered ATM switches.," IEEE INFOCOM, 1997. [4] H. Ahmadi and W. E. Denzel, "A survey of modern high-performance switching techniques," IEEE Selected Areas Commun., pp. 1091-1103, 1989.  [5] J. Garcia-Haro and A. Jajszczyk, "ATM shared-memory switching architectures," IEEE Network, pp. 18-26, July 1994. [6] F. A. Tobagi, "Fast packet switch architectures for broadband integrated services digital networks," Proceedings of the IEEE, vol. 38, pp. 133-167, Jan. 1990. [7] J. S. Turner, "Design of a broadcast packet switching network," IEEE Trans. Commun., vol. 36, pp. 734-743, June 1988. [8] A. K. Choudhury and E. L. Hahne, "Space priority management in a shared memory ATM switch," IEEE Globecom, vol. 3, pp. 1375-1383, Dec. 1993. [9] A. K. Choudhury and E. L. Hahne, "Dynamic queue length thresholds in a shared memory ATM switch," IEEE INFOCOM, pp. 679-687, 1996. [10] H. Zhang, "Service disciplines for guaranteed performance service in packet-switching networks." Proceedings of the IEEE, vol. 83, pp. 1374-1396, Oct. 1995. [II] K. Y. Eng, M. J. Karol, and Y. Yeh, "A growable packet (ATM) switch architecture: Design principles and applications," IEEE Transactions on Communications, vol. 1, pp. 423-430, Feb. 1992. [12] M. Devault, J. Cochennec, and M. Servel, "The "prelude" ATD experiment: Assessments and future prospects," IEEE Journal on Selected Areas in Communications, pp. 1528-1537,  1988. 102  Bibliography  [13] N. Endo, "Shared buffer memory switch for an ATM exchange," IEEE Transactions on Communications, vol. 41, pp. 237-245, Jan. 1993. [14] Y. He, H. M. Alnuweiri, and M. Ito, "Enhancing ATM switch design with shared-queue tail memory.," HiNet'96 Workshop, International Parallel Processing Synposium, Apr. 1996.  [15] S. X. Wei and V. P. Kumar, "On the multiple shared memory module approach to ATM switching," INFOCOM 92, pp. 116-122, 1992. [16] R. O. Onvural, Asynchronous Transfer Mode Networks: Performance Issues. Artech House,  1994. [17] C. Fayet, A. Jazques, and G. Pujolle, "High speed switching for ATM: the BSS," Computer Networks and ISDN Systems, pp. 1225-1233, 1994.  [18] M. J. Karol, M. G. Hluchyj, and S. P. Morgan, "Input versus output queueing on spacedivision packet switch," IEEE Trans. Commun., pp. 1347-1356, 1987. [19] C. P. Kruskal and M. Snir, "The performance of multistage interconnection networks for multiprocessors," IEEE Transactions on Computers, vol. c-32, pp. 1091-1098, Dec. 1983.  [20] E. T. Bushnell and J. S. Meditch, "Dilated multistage interconnection networks for fast packet switching," IEEE INFOCOM, pp. 1264-1273, 1991. [21] T. T. Lee and S. C. Liew, "Broadband packet switches based on dilated interconnection networks.," IEEE Transactions on Communications, vol. 42, pp. 732-744, Feb. 1994.  [22] Y. C. Jenq, "Performance analysis of a packet switch based on single-buffered banyan networks," IEEE Journal on Selected Areas in Communications, vol. SAC-1, pp. 1014-  1021, Dec. 1983. [23] H. S. Kim and A. Leon-Garcia, "Performance of buffered banyan networks under nonuniform traffic patterns," IEEE Trans. Commun., vol. 38, pp. 648-658, May 1990. [24] Y. Mun and H. Y. Youn, "Performance analysis offinitebuffered multistage interconnection networks," IEEE Trans. Computers, vol. 43, pp. 153-162, Feb. 1994.  [25] H. Yoon, K. Y. Lee, and M. T. Liu, "Performance analysis of multibuffered packet-switching networks in multiprocessor systems," IEEE Trans. Computers, vol. 39, pp. 319-327, Mar.  103  Bibliography  1990. [26] F. A. Tobagi, T. Kwok, and F. M. Chiussi, "Architecture, performance, and implementation of the tandem banyan fast packet switch," IEEE Journal on Selected Areas in Communications, vol. 9, pp. 1173-1193, Oct. 1991. [27] S. Urushidani, "Rerouting network: A high-performance self-routing switch for B-ISDN," IEEE Journal on Selected Areas in Communications, vol. 9, pp. 1194-1204, Oct. 1991. [28] A. Huang and S. Knauer, "Starlite: a wideband digital switch," Proc. GLOBECOM' 84, pp. 121-125, Dec. 1984. [29] M. J. Narasimha, "The Batcher-banyan self-routing network: Universality and simplification," IEEE Transactions on Computers, vol. 36, pp. 1175-1198, Oct. 1988. [30] J. Giacopelli, M. Littlewood, and W. D. Sincoslie, "Sunshine: A high performance self-routing broadband packet switch architecture," IEEE Journal on Selected Areas in Communications, vol. 9, pp. 1289-1298, Oct. 1991. [31] J. F. Lin and S. D. Wang, "High performance low-cost nonblocking switch for ATM," INFOCOM, pp. 7a.3.1-7a.3.4, 1996. [32] J. S. Turner, "An optimal nonblocking multicast virtual circuit switch.," IEEE INFOCOM, June 1994. [33] P. Harubin, S. Chowdhry, and B. Sengupta, "The design of a distributor in a large ATM switch," IEEE GLOBECOM, pp. 2080-2086, 1995. [34] T. T. Lee, "Nonblocking copy networks for multicast packet switching," IEEE Journal on Selected Areas in Communications, vol. 6, pp. 1455-1467, Dec. 1988. [35] M. Alimuddin, H. M. Alnuweiri, and R. W. Donaldson, "The fat-banyan ATM switch," IEEE INFOCOM, pp. 659-666, Apr. 1995. [36] M. Alimuddin, H. M. Alnuweiri, and R. W. Donaldson, "A unified approach to the design of dilated banyan switches," International Conference on Communications, pp. 1132-1136, June 1995. [37] M. Alimuddin, H. M. Alnuweiri, and R. W. Donaldson, "Performance of the fat-banyan  104  Bibliography  switch under non-uniform traffic," IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, pp. 82-85, May 1995. [38] M. Alimuddin, H. M. Alnuweiri, and R. W. Donaldson, "Efficient design of scalable sharedmemory packet switches," International Conference on Communications, pp. 1132-1136, June 1996. [39] M. Alimuddin and H. M. Alnuweiri, "Design and evaluation of scalable shared-memory ATM switches," IEICE Journal of Communications, accepted for publication. [40] T. H. Theimer, E. P. Rathgeb, and M. N. Huber, "Performance analysis of buffered banyan networks," IEEE Trans. Commun., vol. 39, pp. 269-277, Feb. 1991. [41] N. Mirfakhraei and Y. Tang, "Performance analysis of Benes networks under nonuniform traffic," International Conference on Communications, pp. 1669-1673, 1996. [42] T. Szymanski and S. Shaikh, "Markov chain analysis of packet-switched banyans with arbitrary switch sizes, queue sizes, link multiplicities and speedups," IEEE INFOCOM, pp. 960-971, 1989. [43] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays-TreesHypercubes. Morgan Kaufmann Publishers, Inc., 1992. [44] E. W. Zegura, "Architectures for ATM switching systems.," IEEE Communications Magazine, Feb. 1993. [45] H. Yeh, M. G. Hluchyj, and A. S. Acampora, "The knockout switch: A simple, modular architecture for high-performance packet switching," IEEE Journal on Selected Areas in Communications, vol. 5, pp. 1274-1283, Oct. 1987. [46] M. D. Santos, P. M. M. Smith, and L. E. Moser, "A protocol simulator for the Thunder and Lighting ATM network," Proceedings of COM'96., pp. 28-31, May 1996. [47] J. H. Patel, "Performance of processor-memory interconnections for multiprocessors," IEEE Transactions on Computers, vol. C-30, pp. 771-780, Oct. 1981. [48] T. H. Szymanski and V. C. Hamacher, "On the permutation capability of multistage interconnection networks," IEEE Transactions on Computers, pp. 810-822, 1987.  105  Bibliography  [49] A. M. Law and W. D. Kelton, Simulation Modeling and Analysis. McGraw-Hill, Inc., 1991. [50] M. G. Hluchyj and M. J. Karol, "Queueing in high-performance packet switching," IEEE Journal on Selected Areas in Communications, vol. 6, pp. 1587-1597, Dec. 1988. [51] H. Obara, "Optimum architecture for input queueing ATM switches," IEEE INFOCOM, pp. 110-115, 1992. [52] M. J. Karol, K. Y. Eng, and H. Obara, "Improving the performance of input-queued ATM packet switches," IEEE INFOCOM, pp. 110-115, 1992. [53] N. Mckeown, V. Anantharam, and J. Warland, "Achieving 100% throughput in an inputqueued switch," IEEE INFOCOM, 1996. [54] H. Yoon, M. T. Liu, and K. Y. Lee, "The Knockout switch under nonuniform traffic," GLOBECOM, pp. 1628-1634, 1988. [55] H. S. Kim, "Design and performance of multinet switch: A multistage ATM switch architecture with partially shared buffers," IEEE/ACM Transactions on Networking, vol. e, pp. 571-580, Dec. 1994. [56] H. M. Alnuweiri, H. Aljunaidi, and R. Beraldi, "The buffered fat tree ATM switch," Globecom, 1995. [57] A. Romanow, "TCP over ATM: Some performance results," ATM Forum Traffic Management Sub-Working Group report, July 1993. [58] A. Romanow and S. Floyd, "Dynamics of TCP over ATM networks," IEEE Journal of Selected Areas in Commn., May 1995. [59] M. J. Karol and I. C. Lin, "Performance analysis of a growable architecture for broad-band packet (ATM) switching," IEEE Transactions on Communications, vol. 40, pp. 431-439, Feb. 1992. [60] M. Alimuddin, H. M. Alnuweiri, and R. W. Donaldson, "Efficient multicast copy network,"  submitted to IEEE International Workshop on Broadband Switching Symposium, 199 [61] S. Kumar and D. P. Agarwal, "On multicast support for shared-memory-based ATM switch architecture," IEEE Network, pp. 34-39, Jan. 1996.  106  Bibliography [62] J. S. Turner, "A practical version of lee's multicast switch architecture," IEEE Transactions on Communications, vol. 41, pp. 1166-1169, Aug. 1993. [63] S. C. Liew, "A general packet replication scheme for multicasting with application to shuffleexchange networks," IEEE Transactions on Communications, vol. 44, pp. 1021-1033, Aug. 1996. [64] M. Alimuddin, H. M. Alnuweiri, and R. W. Donaldson, "A unified approach to the design of large-scale banyan-based packet switches," IEEE Trans, on Comm., submitted. [65] F. Sestini, "Recursive copy generation for multicast ATM switching," IEEE/ACM Transactions on Networking, vol. 5, June 1997.  107  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0064804/manifest

Comment

Related Items