Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Design and evaluation of a high performance multi-priority multicast ATM switch Lam, Patrick 1999

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1999-0544.pdf [ 5.29MB ]
Metadata
JSON: 831-1.0065174.json
JSON-LD: 831-1.0065174-ld.json
RDF/XML (Pretty): 831-1.0065174-rdf.xml
RDF/JSON: 831-1.0065174-rdf.json
Turtle: 831-1.0065174-turtle.txt
N-Triples: 831-1.0065174-rdf-ntriples.txt
Original Record: 831-1.0065174-source.json
Full Text
831-1.0065174-fulltext.txt
Citation
831-1.0065174.ris

Full Text

Design and Evaluation of a High Performance Multi-Priority Multicast A T M Switch By Patrick L a m  B . Sc., The University of California, Irvine, 1993 A Thesis Submitted In Partial Fulfillment O f The Requirement For The Degree Of Master O f Applied Science In The Faculty of Graduate Studies Electrical Engineering W e accept this thesis as conforming to the required standard  The University of British Columbia October, 1999 © Patrick Lam, 1999  In  presenting  degree freely  at  this  the  thesis  in  partial  University  of  British  available for  copying  of  department publication  this or of  reference  thesis by  this  for  his thesis  fulfilment  scholarly her  for  I further  purposes  of  representatives.  financial  h Ug.-tK jL C t  The University of British C o l u m b i a Vancouver, Canada  Date  DE-6  (2/88)  t°  A  W  H i  1  requirements  agree that  may It  gain shall not  permission.  Department  the  Columbia, I agree that the  and study.  or  of  <=p „ J, „  be is  Library  an  granted  by  advanced  shall make  permission for  understood be  for  the that  allowed without  it  extensive  head  of  my  copying  or  my  written  Abstract  Asynchronous Transfer Mode (ATM) is believed to be the standard protocol for the extremely demanding high speed networking field. The switching technologies employed in A T M cell switches are extensively researched and studied in recent years. However, most developed switches nowadays are lack of either performance or costefficiency. Furthermore, most switching researches published are based on uniform incoming cell traffic pattern, which is very different from real time traffics. Real time traffics are not only bursty, they are also involved with multiple classes of prioritized traffics, as well as multicast traffic.  In this thesis, a high performance and cost-effective A T M switch architecture is introduced. The switching architecture is based on two existing technologies, namely Skew Round Robin scheduling and Virtual Output Queuing schemes. These two schemes are proved to be simple and high performers under uniform traffic pattern [16]. Simulation results show that with a little modification made to these schemes, a switch can perform extremely well under many kinds of real time traffic patterns, including multi-priority and multicast. In addition, with the proposed switching architecture, it's shown that cell loss ratio can be arbitrarily reduced — with finite buffer size and bounded delay - even under bursty traffic pattern.  Table of Contents  Abstract  >i  Table of Content  ii \  List of Figures  Vui  List of Tables  X'H  Acknowledgement Chapter 1  .-XiV Introduction  1  1.1  Asynchronous Transfer Mode (ATM)  3  1.2  Traffic Classes in A T M  6  1.2.1  C B R Traffic  6  1.2.2  V B R Traffic  7  1.2.3  U B R Traffic  7  1.2.4  A B R Traffic  8  1.3  Multi-class Priority Traffic  8  1.4  Multicasting Traffic  10  1.5  Uniform and Bursty Traffic  10  1.6  A T M Cell Switching  11  1.7  Motivation of this thesis  15  1.7.1  Increasing Information Rates  16  1.7.2  Multicasting Capability  16  1.7.3  Switch Performance  17  1.7.3.1  Throughput  iii  18  1.7.3.2  Connection Blocking  18  1.7.3.3  Cell Loss Rate (CLR)  19  1.7.3.4  Switching Delay  ....19  1.8  Contribution of the thesis  19  1.9  Organization of the thesis  22  A T M Switch Queuing Strategies  23  Routing Schemes  24  Chapter 2 2.1 2.1.1  Centralized Routing  24  2.1.2  Self Routing  26  Buffering Strategies  28  2.2 2.2.1  Input Buffering  29  2.2.2  Output Buffering  37  2.2.3  Internal (Inter-stage) Buffering  41  2.3  Comparison between Buffering Strategies  41  Preventing Internal Cell Co llisio ns  43  2.3.1  Chapter 3  2.3.1.1  Switch Speedup  44  2.3.1.2  Flow Control (Backpressure)  44  2.3.1.3  Virtual Output Queuing  45  2.3.2  Handling of Multicasting Traffic  46  2.3.3  Concern of Buffer Size Requirement  46  Survey of Switch Architecture  49  3.1  The Batcher-Banyan Switch  .50  3.2  The Tiny Tera Switch  55  Illinois Pulsar-based Optical Interconnect (iPOINT) Switch 61 The Hitibas Switch  65  Conclusion of this chapter  71  Architecture of the Proposed Input Buffered Switch 73 A Brief Introduction to our Proposed Input Buffered Switch 73 The Multi-class Priority Arbitrator  77  The Switching Network System  78  The Input Buffering System  79  The Output Buffering System  87  The Algorithm of the Switching Process  88  The Simulation and the Results  92  The Simulation  92  Simulation Results Obtained From the Referenced Models 95 The Basic FIFO Input Buffering Scheme  95  The Hitibas Scheme with Grouping Factor  96  Effect of Degree of Dilations  98  Effect of Group Size  100  Simulation Results from Our Proposed Switch  103  Single-class Traffic  103  5.5.1.1  Effect of Offered Loads under Uniform Traffic 104  5.5.1.2  Effect of Offered Loads under Bursty Traffic 107  5.5.1.3  Effect of Buffer Size under Uniform Traffic 110  5.5.1.4  Effect of Buffer Size under Bursty Traffic 113  5.5.2  2-priority-class Traffic 5.5.2.1  Effect of Offered Load under Uniform 2priority-class Traffic  5.5.2.2  2-priority-class Multicast Traffic 5.5.3.1  119 124  Effect of Offered Load under Bursty 2-priorityclass Multicast Traffic  Chapter 6  116  Effect of Offered Load under Bursty 2-priorityclass Traffic  5.5.3  115  125  Conclusion  131  6.1  Summary  131  6.2  Major Results  133  6.3  Future Works  135  References Appendix A  136 A Situation When Number of Retials is allowed to exceed GinMSRR  142 v;  Appendix B  List of Acronyms  List of Figures Figure 1.1  A T M cell header formats at the UNI and NNI  Figure 1.2  A T M Services (both ITU-T and A T M F standards are shown) and A A L Types  Figure 1.3  4  6  Simplified V P C switching, with VCI's kept unchanged 14  Figure 2.1  Example of Routing Table and Centralized Routing Scheme :  25  Figure 2.2  Example of Self-routing Scheme  27  Figure 2.3  The H O L problem resulted in Input FIFO Buffered Switch  30  Figure 2.4  8*8 Bifurcated Switch  33  Figure 2.5  Throughput of Bifurcated Switch against k  36  Figure 2.6  The knockout process in a 4*2 concentrator  39  Figure 3.1  Simple 4*4 Banyan Switch  50  Figure 3.2  Collision occurs in a Banyan switch even though the cells have different destinations  51  Figure 3.3  Batcher Banyan Switch  53  Figure 3.4  iSLIP Algorithm - First iteration  58  Figure 3.5  iSLIP Algorithm - Second iteration  58  Figure 3.6  A Port Card in a 4*4 Tiny Tera Switch  60  Figure 3.7  The structure of a 3DQ module  63  Figure 3.8  The structure of a destination group  64  Wi]  Figure 3.9  Layout of an 8*8 Hitibas Switch  Figure 3.10  Bandwidth is wasted when certain logical queues are empty in  67  virtual output queuing  70  Figure 4.1  Overview of the Proposed Switch Architecture  76  Figure 4.2  A Dilated Banyan Switch with Degree of Dilation of 2  79  Figure 4.3  An Input Module of a 32*32 switch with group size of 4  80  Figure 4.4  The MSRR Scheduling Scheme in the proposed switch  83  Figure 4.5  Collision on a simple 8x8 banyan switch  84  Figure 4.6  Collision can be avoided with F A B  85  Figure 4.7  An Output Module  87  Figure 5.1  Markov Modulated Bernoulli Process (MMBP)  92  Figure 5.2  Throughput for basic input buffered switch  96  Figure 5.3  Average Cell Delay using Hitibas Architecture (Against Applied Load)  Figure 5.4  Overall C L R of the Proposed Switch under Bursty Traffic (Against Degree of Dilation)  Figure 5.5  99  The Average Delay of the Propsed Switch under Bursty Traffic (Against Degree of Dilation)  Figure 5.7  98  The Throughput of the Proposed Switch under Bursty Traffic (Against Degree of Dilation)  Figure 5.6  97  99  Overall C L R of the Proposed Switch under Bursty Traffic (Against Group Size)  101  Figure 5.8  Throughput of the Proposed Switch under Bursty Traffic (Against Group Size)  Figure 5.9  :  101  Average Delay of the Proposed Switch under Bursty Traffic (Against Group Size)  102  Figure 5.10  Simplified Single-Class Input Module  104  Figure 5.11  Simplified Single-Class Output Module  104  Figure 5.12  •  Comparison of the Throughput between the Proposed Switch and the Hitibas with Similar Total Buffer Size under Uniform Traffic (Against Offered Load)  Figure 5.13  105  Comparison of the Average C L R between the Proposed Switch and the Hitibas with Similar Total Buffer Size under Uniform Traffic (Against Offered Load)  Figure 5.14  106  Comparison of the Average Cell Delay between the Proposed Switch and the Hitibas with Similar Total Buffer Size under Uniform Traffic (Against Offered Load)  Figure 5.15  106  Comparison of the Throughtput between the Proposed Switch and the Hitibas with Similar Total Buffer Size under Bursty Traffic (Against Offered Load)  Figure 5.16  108  Comparison of the Average C L R between the Proposed Switch and the Hitibas with Similar Total Buffer Size under Bursty Traffic (Against Offered Load)  108  Figure 5.17  Comparison of the Average Cell Delay between the Proposed Switch and the Hitibas with Similar Total Buffer Size under Bursty Traffic (Against Offered Load)  Figure 5.18  109  Comparison of the Throughput between the Proposed Switch and the Hitibas with Similar Total Buffer Size under Uniform Traffic (Against Buffer Size per port)  Figure 5.19  110  Comparison of the Average C L R between the Proposed Switch and the Hitibas with Similar Total Buffer Size under Uniform Traffic (Against Buffer Size per port)  Figure 5.20  Ill  Comparison of the Average Cell Delay between the Proposed Switch and the Hitibas with Similar Total Buffer Size under Uniform Traffic (Against Buffer.Size per port)  Figure 5.21  Ill  Comparison of the Normalized Throughput of the Proposed Switch and the Hitibas with Similar Total Buffer Size under Bursty Traffic (Against Buffer Size per port)  Figure 5.22  113  Comparison of the Average C L R of the Proposed Switch and the Hitibas with Similar Total Buffer Size under Bursty Traffic (Against Buffer Size per port)  Figure 5.23  113  Comparison of the Average Cell Delay of the Proposed Switch and the Hitibas with Similar Total Buffer Size under Bursty Traffic (Against Buffer Size per port)  Figure 5.24  114  Effect of Offered Load on Normalized Throughput under Uniform 2-Class Traffic  118  Effect of Offered Load on Average Cell Delay under Uniform 2Class Traffic  119  Effect of Offered Load on Normalized Throughput under Bursty 2Class Traffic  121  Effect of Offered Load on A B R and Non-ABR C L R under Bursty 2-Class Traffic  122  Effect of Offered Load on A B R and Non-ABR Average Cell Delay under Bursty 2-Class Traffic  122  Effect of Offered Load on Unicast A B R , Unicast Non-ABR, Multicast A B R and Multicast Non-ABR C L R under Bursty 2Class Multicast Traffic  127  Effect of Offered Load on Overall Throughput under Bursty 2Class Multicast Traffic  128  Effect of Offered Load on Unicast A B R , Unicast Non-ABR, Multicast A B R and Multicast Non-ABR Average Cell Delay under Bursty 2-Class Multicast Traffic  129  Collision occurs when number of retrials in M S R R is allowed to exceed G (G = 2)  144  List of Tables Table 2 . 1  Comparison between general Input and Output Buffering Schemes 47  Acknowledgement  I would like to thank my supervisor, Dr. H . M. Alnuweiri, for his constant guidance and advice given to my master research work. His patience and support throughout this project are very much appreciated. I also would like to thank my family for their concern and understanding in all these years. Finally, I can never finish this project without the encouragement brought by my girl friend, Clara.  Xiv  Chapter 1: Introduction  As today's highly technology oriented environment is demanding the communications network infrastructure as never before, bandwidth requirements are skyrocketing. In addition, applications such as Internet browsers that allow access across long distances, or even across the globe, are becoming mainstream. Millions of people and organizations are relying on the Internet to develop and expand their businesses. The need for high speed data networking is hence greater than ever. Fortunately, Asynchronous Transfer Mode (ATM) promises to provide a long term solution to the ever increasing demand in networking speed.  The area of A T M switch design and implementation has been extensively researched. In recent years, A T M switching products are being offered by many vendors worldwide (e.g. Fore system Inc.). Although the A T M switches available on the market do feature some very sophisticated designs, they are still facing some fundamental problems that are yet to be solved. These problems include high cost brought by complex architecture design and huge buffer size requirement, and lower than optimal performance caused by many design difficulties such as cell collisions and unbounded delays.  It is usual in switch design that high performance reflects high switch complexity, while low switch cost comes with low efficiency. By low efficiency we mean low throughput, high cell loss rate and unbounded delay. An example is that FIFO input 1  buffered switch is well known for its simplicity, but it suffers severely from head-of-line (HOL) blocking that limits the throughput to 58% [1, 2, 3, 4]. On the other hand, many output buffered switches can provide 100% throughput [5, 6], but they usually require a speedup factor of N to achieve this performance for a N * N switch. When N is large and 1  the line rate is sufficiently high, such memory bandwidth is simply not available with current technologies [43].  This thesis proposes a novel switch architecture with several advanced features for next generation A T M switching platform. The proposed switch is basically an input buffering switch with truly non-blocking architecture, although output buffers are still necessary to enhance the performance. The switch can operate at the line rate (no speedup is needed). The most outstanding feature of this proposed switch is that it employs an architecture with a simplicity comparable to FIFO input queuing with banyan switching. It also applies a simple but powerful scheduling technique that is capable of handling real-time and best-effort traffic flows in a fair manner. The simulations we have included in this thesis show that the proposed switch achieves almost 100% throughput, and provides arbitrarily low C L R with practical buffer size and delay even under realtime bursty traffics.  1  The speedup factor will be explained with more detail in Chapter 2. 2  1.1. Asynchronous Transfer Mode (ATM)  A T M is defined by Consultative Committee for International Telephone and Telegraph (CCITT), which was then renamed as International Telecommunication Union (ITU), to be a Broadband Integrated Services Digital Network (B-ISDN) standard transfer protocol [7]. A T M is designed to provide the switching infrastructure for next generation telecommunications networks, as well as to meet the quality of service requirements for many different types of services. The size of the data packets, or cells in A T M terminology, plays an important role in the A T M protocol, as it affects attributes such as cell delay and switching speed of the network. In A T M , the smallest switching unit is called a cell. The cell size is fixed at 53 bytes. Among the 53 bytes, 5 bytes are allocated to be the cell header, and 48 bytes to be the payload. The 5-byte header therefore represents 9% of the bandwidth and it contains the following control information fields:  GFC:  General Flow Control;  VCI:  Virtual Channel ID;  VPI:  Virtual Path ID;  PTI:  Payload Type Indicator;  CLP:  Cell Loss Priority;  H E C : Header Error Control.  3  Considering that the usage of cell header at the user-network interface (UNI) and network-network interface (NNI) are slightly different during cells routing, their cell header formats are also slightly different. They are shown in Fig. 1.1.  A T M UNI cell header format 6 5 4 3  Bit  GFC  VPI  VPI  VCI VCI  VCI  PTI  CLP  HEC  8  A T M NNI cell header format 6 5 4 3 VPI  7 VPI  2  1  Bit  VCI VCI  VCI  PTI  CLP  HEC Fig. 1.1. A T M cell header formats at the UNI and NNI  As seen from Fig. 1.1, the only difference between an UNI cell header and a NNI cell header is the 4-bit GFC field, which acts as a control for the amount of traffic being allowed to enter the network at the UNI.  Because the vast majority of current network applications and protocols generate data units that are much longer than 48 bytes, a process called segmentation and reassembly (SAR) is necessary in A T M networks. The SAR process is taken place in the  4  highest A T M protocol stack: the A T M Adaptation Layer (AAL). Furthermore, a wide variety of formats and information types (e.g., voice and video) are used in applications for which A T M networks will provide transport service. In order to adapt and optimize different kinds of services, the A A L specification also defines 4 protocol types: A A L 1, A A L 2, A A L %, and A A L 5 (Note that A A L % is the combination of the abandoned A A L 3 and A A L 4). Each protocol type describes the format of the cell's payload for different service classes. According to ITU-T [7], there are basically 4 service classes that would provide services for specific applications:  Class A: Circuit emulation, constant-bit-rate voice and video; Class B: Variable-bit-rate compressed audio and video; Class C: Connection-oriented data transfer; Class D: Connectionless data transfer.  Based on A T M Forum's (ATMF's) UNI version 4.0, the definition of service classes — namely Constant-bit-rate (CBR), Variable-bit-rate (VBR), Unspecified-bit-rate (UBR) and Available-bit-rate (ABR) - is very similar to ITU-T's. Fig. 1.2 shows the translation between A T M F ' s and ITU-T's definitions, according to their functional similarity. In addition, Fig. 1.2 also includes the relationship between the service classes and the A A L types.  5  Service Class (ITU-T)  A  B  C  D  Service Class (ATMF)  CBR  VBR  UBR  ABR  Connection Mode  Connection-oriented  Bit Rate  Constant  Delay Sensitivity  Delay Sensitive  A A L Types  1  Connectionless  Variable Delay Insensitive 2  3  /4  5  Fig. 1.2. A T M Services (both I T U - T and A T M F standards are shown) and A A L Types  1.2. Traffic Classes i n A T M  A s mentioned in the previous paragraph, C B R , V B R , U B R and A B R are service classes defined by A T M forum traffic management group. Each class is described in the following paragraphs.  1.2.1. CBR Traffic  The C B R service is intended for real-time applications that require tightly constrained delay and delay variation. Two simple examples are voice and video applications. In other words, the C B R service is delay sensitive. The consistent  .  availability of a constant amount of bandwidth is critical for C B R service. Quality of 6  service (QoS) is guaranteed once a call is setup. However, cell loss is relatively tolerable in CBR applications.  1.2.2. VBR Traffic  Similar to CBR service, the real time VBR service class is also intended for realtime applications that requires tightly constrained delay and delay variation. Sources are expected to transmit at a rate that varies with time. Real-time VBR service may support statistical multiplexing of real-time sources, or may provide a consistently guaranteed QoS.  1.2.3. UBR Traffic  The UBR service class is intended for delay-tolerant or non-real-time applications. In other words, the UBR service is delay insensitive. That is, the UBR applications do not require tightly constrained delay and delay variation. Examples are the traditional computer communications applications or file downloads. Sources are expected to transmit non-continuous bursts of cells. UBR service supports a high degree of statistical multiplexing among sources, but it has no notion of a per-VC allocated bandwidth resource. Transport of cells in UBR service is not necessarily guaranteed by mechanisms operating at the cell level. However it is expected that resources will monitor the UBR service in such a way that it will be appropriate for some specific applications.  7  1.2.4. ABR Traffic  M a n y applications choose to reduce their information transfer rate i f the network requires them to do so, rather than dropping any cells. Likewise, they may wish to increase their information transfer rate i f there is extra bandwidth available within the network. There may not be deterministic parameters because there is no bandwidth reserved for the users. A s a result, A B R traffic is not characterized by peak cell rate and burst tolerance. Also, bandwidth reservations are not made. Instead, traffic is allowed into the network throttled by a flow control type mechanism based on the available bandwidth left by other services. After all, A B R is designed to provide better utilization of network bandwidth resources, but application delay may be very large. Consequently, applications that use A B R service must be delay insensitive [8, 9, 10].  1.3. Multi-prioritv-class Traffic  In C B R and V B R services, cell loss is somewhat tolerable because occasional cell drop may not be very much concerned (imagine dropping part of a word in a conversation). A s quality of service (QoS) contract agreed on the call setup requires certain bandwidth guaranteed during the transmission, C B R and V B R cells should have higher priority to be switched. On the contrary, some data transmission (e.g. email) is not delay sensitive, but requires zero or very low cell loss rate. A B R falls into this category of service. In addition, A B R is a connectionless service, and therefore only requires to occupy whatever bandwidth left unutilized by C B R and V B R traffics. When the traffic is 8  very busy, A B R data is willing to "wait" at some point in between the transmission, rather than dropping any bits, until the required bandwidth is available for the data cells to pass through. U B R service is somewhat similar to A B R with respect to delaytolerance. However, the main difference between A B R and U B R services is that U B R service does not implement any feedback mechanisms to control congestion. As a result, U B R service will drop cells whenever congestion occurs, instead of "wait until the bandwidth becomes available".  In this thesis, traffics are separated into two major priority classes during cell switching. C B R and V B R cells belong to the higher priority category, which have privilege to get switched to assure low cell delay. A B R and U B R cells belong to the lower priority category, which can be switched at an input port in a particular time slot only when no C B R or V B R cells are being switched. Though, there are more buffer space reserved for A B R and U B R services to enforce low cell loss rate. For the sake of simplicity, C B R and V B R services will be categorized as "Non-ABR" traffic class, while both A B R and U B R will be categorized as " A B R " traffic class throughout this thesis. We have to emphasize that although U B R is classified as " A B R " traffic, it has lower priority than A B R cells for storage. Later on in this thesis, we will see that both A B R and U B R cells are placed on the same " A B R queues", but U B R cells will be discarded when a given A B R queue overflows.  9  1.4. Multicast Traffic  Many applications (e.g. videoconferencing, distributed data processing, etc.) nowadays require the switches to setup multicast virtual circuits. Numerous literatures have discussed switch designs that perform multicast cell switching [11, 12, 13, 14, 15]. Some of these proposals involve very complex scheduling schemes in multicast traffic handling. A n instance for this is iPOINT switch (will be discussed in chapter 3) [15], which requires additional link lists and time slots to route multicast cells. While those schemes do support multicasting, more efficient and hardware-conservative scheme is desired.  1.5. Uniform and Bursty Traffics  Since non-correlated traffic input can usually demonstrate the "best case" switch 2  performance, many researches tend to base on it as traffic model to showcase new switch designs [16, 20]. Although uniform traffic can provide some insights on a switch's performance during simulations, it can hardly represent accurately today's high volume real-time traffic. In the real time traffic, especially A T M traffic, incoming cells are highly correlated. This is because the data is fragmented into packets before being transferred. Long packets are in turn fragmented into A T M cells during the SAR process. Therefore, each A T M cell is usually only a part of the piece of data that is being transferred. It is  2  We will refer "non-correlated" traffic to "uniform" traffic from now on. 10  very likely that many cells, which belong to the same packet, are heading to the same output port across a long time. As a result, cells arriving to a switch are very bursty in the sense that a particular port may be very busy in a moment, while other ports may be idle. This kind of traffic is known as bursty traffic. Since switch implementations should always cope with real time traffics, performing well under bursty traffic is an important issue in switch design. However, it is very difficult to design an algorithm to tolerate arbitrary amount of burstiness, compromises are usually accepted in,some switch designs [44]. In our proposed switch, we focus to handle bursty traffic with an algorithm that simply "eliminates" the burstiness. We achieve this goal by serving every queue fairly, as long as they are not empty. As a result, we can obtain performances with our proposed switch under bursty traffic that are very comparable to uniform traffic.  1.6. A T M Cell Switching  A T M is a network architecture that uses unchannelized network transports [21]. Nonetheless, the traversing traffic must be identified as voice, video, or data because the network itself has to determine the required QoS parameters for each different service. Since there are no physical channels to distinguish the traffic in an A T M network, the traffic types are recognized by their logical connections. Therefore, instead of voice or video channels, A T M networks have voice or video connections. In A T M networks, these logical connections are established and maintained by means of a two-part identifier structure: the virtual channel and virtual path. 11  Virtual channels (VCs) and virtual paths (VPs) are a part of the overall architecture of broadband ISDN (B-ISDN). VCs and VPs together provide the A T M layer transport functions on a logical level. According to CCITT's definitions (CCITT 1.113), a virtual channel is a concept used to describe unidirectional transport of A T M cells associated by a common unique identifier value called virtual channel identifier (VCI). Note that the V C I is valid in only one direction. On the other hand, a virtual path is a concept used to describe unidirectional transport of cells belonging to virtual channels that are associated by a common identifier value called virtual path identifier (VPI). The VPI is also valid in one direction. Basically, the V C I identifies dynamically allocated connections, while the VPI identifies statically allocated connections.  VCIs and VPIs are contained in cell headers and are hierarchical. Cells flow along the transmission path in an A T M network. The transmission path itself may comprise one or more VPs. These VPs may in turn be comprised of several VCs. Besides assisting the switch to direct cells to various destinations, VPs and VCs also lead to two other concepts in A T M : the idea of a virtual link and a virtual connection.  A virtual channel link is a unidirectional link between two points: from where a V C I is assigned, to where it is switched to and removed. Similarly, a virtual path link is bounded by the points in the network from where the VPI value is assigned, to where it is switched to and removed. Thus links on VCs or VPs are just the paths on the A T M  12  network where the VCI or VPI values stay the same, The places where they change establish the endpoints of the link.  The end systems will be connected by more than one V C link and V P link in most cases. That is, a V C link or V P link is likely concatenated by another V C link or V P link. The concatenation of these virtual channel links is called a virtual channel connection (VCC), and the concatenation of virtual path links is called a virtual path connection (VPC). A V C C provides end-to-end transfer of A T M cells between two endpoints (usually the A T M adaptation layer). The endpoints may be end users or network entities. Each endpoint associates a unique VCI with each V C C . In addition, within the network, there may be a number of points at which virtual channels are switched, and at those points the VCI may be changed. Thus, a V C C consists of a concatenation of one or more V C links, with the V C I remaining constant for the extent of the V C link and changing at the V C switch points.  Between an endpoint and a V C switch point, or between two V C switch points, a V P C provides a route for all V C links that share the two VPC endpoints. Again, at this level, there may be internal switching, such that a V P C passes through one or more V P switch points, with the VPI changing at each such point.  In other words, V P switches terminate V P links. A V P switch translates incoming VPIs to the corresponding outgoing VPIs according to the destination of the V P C ; V C I values remain unchanged. V C switches terminate V C links and necessarily V P links. A 13  V C switch must therefore switch both virtual paths and virtual channels, and so both VPI and V C I translation is performed.  V P multiplexing provides end-user equipment with the capability of simultaneously switching a number of VCs while only establishing a single logical connection. A V C C is uniquely identified by the VPI and V C I while a V P C is identified only by the VPI. A T M switches are capable of switching both individual VCs within VPs or entire VPs. The VPI identifies the V P as one that is switched as a whole, or one where the individual VCs are switched. Multiplexing and switching in A T M are always done on VPs first and then on VCs. This is based on their hierarchical relationship. Fig. 1.3 illustrates a simplified example of V P C switching, with VCIs kept unchanged.  14  1.7. Motivation of this thesis  As mentioned before, A T M is believed to be the transfer mode for the next generation networks, hence it definitely needs to support all kinds of traffic that will traverse the global network. A T M switch designs therefore must be capable of handling wide range of cell types, ranging from voice to high quality video. As we have seen, these services have different requirements in terms of bit rate, cell loss rate and delay. The major challenge of switch design is to cope with all these requirements.  15  1.7.1. In creasing Information Rates  Since the information rates of the different services are very diversified, a wide spectrum of information rates must be supported in the broadband switches. These rates range from a few kbps up to values close to 155 Mbps. The maximum bit rate which A T M switches must be able to switch is 155 Mbps. However, the ever-increasing demand of networking speed makes this seem insufficient. One solution is that several 155 Mbps can be multiplexed on a single link, so that internally higher speeds than 155Mbps are possible (potentially Gbps).  Nonetheless, the complexity inside a switch's architecture often forces the switch to perform much worse than the maximum bit rate. To achieve such maximum switching rate, hardware simplicity is yet another challenge for switch designers.  7.7.2. Multicast Capability  In synchronous transfer mode (STM) and packet switches, only point-to-point connections are available, because information has to be switched from a logical input to a logical output. However, in the future broadband network, an additional requirement arises - multicast switching. Indeed, some services (e.g. video conferencing) already have a "copying" nature, and thus future A T M switches will require the capability to provide a multicast functionality. Multicasting is defined as the process which provides the information from one source to many (selected) destinations [7]. Unicasting can then 16  be seen as one extreme of multicasting — directing the information to a single destination, whereas broadcasting lies on another extreme - the provision of information from one source to all destinations,  The multicast function is typically required for services such as email, video library access and T V distribution. Such a multicast facility may be required from a server to multiple subscriber lines, and also from subscriber lines to multiple subscriber lines. One simple example for the later case is that a subscriber is sending one email to 20 receivers through the A T M network simultaneously. Another example is a subscriber offering a video library service which is located at its own premises and connected to the network via its A T M subscriber line (local loop).  1.7.3. Switch Performance  In A T M switches, the performance is often determined by four parameters, namely throughput, connection blocking, cell loss rate and switching delay. A well designed switch has to produce results that are within the tolerable limits of all these parameters. Otherwise, improvement on the switch architecture should be taken place. The parameters are described in the following paragraphs.  17  1.7.3.1.  Throughput  Throughput is the total amount of traffic that the network can carry in a unit of time (typically measured in bits per second). Also, switch throughput determines the amount of traffic that can pass through a switch in any given unit of time. The higher the throughput, the higher will be the switch performance. Note that one switch with especially low throughput will drag down the whole network's performance. Hence, the 3  network's aggregate throughput is affected by the individual switch's throughput.  1.7.3.2. Connection  Blocking  Most A T M services, such as C B R and V B R , are connection-oriented. This means that at connection set-up, a logical connection must be found between a logical input and output. It's very important to bare in mind that a connection oriented A T M service does not imply that an A T M switch is internally connection oriented.  The connection blocking occurs when there are not enough resources found between the inputs and outputs in the switch so that all existing connections and the new connections can be served. If this happens, a new connection will not be setup and the newly arrival cell is said to be blocked. Most switch implementations have the internal connection blocking problem. The blocking probability of those switches is determined by the parameters of the switch, such as the number of internal connections and the load  3  This is usually considered as a bottleneck of the network. 18  on those connections.  Fortunately, like the s w i t c h p r o p o s e d i n this thesis, s o m e s w i t c h  implementations c a navoid the internal b l o c k i n g completely.  T h i s m e a n s that i f e n o u g h  resources (i.e. b a n d w i d t h a n d header values) are a v a i l a b l e o n t h e i n p u t a n d the output o f the s w i t c h , n o c o n n e c t i o n b l o c k i n g w i l l o c c u r internally. T h e r e f o r e , i f e n o u g h resources are a v a i l a b l e o n t h e e x t e r n a l l i n k s , a n e w c o n n e c t i o n w i l l a l w a y s b e a c c e p t e d w i t h o u t a n explicit c h e c k o f the internal s w i t c h resources.  1.7.3.3. Cell Loss Ratio (CLR)  I n A T M s w i t c h e s , a l t h o u g h buffers are u s e d to store e x t r a cells that a r e n o t b e i n g s w i t c h e d i n t h e c u r r e n t t i m e slot, i t is p o s s i b l e that t e m p o r a r i l y t o o m a n y c e l l s a r e destined f o rthe same internal or external link.  T h e c o n s e q u e n c e is that m o r e c e l l s t h a n  the buffer i n t h e s w i t c h c a n store w i l l s i m u l t a n e o u s l y c o m p e t e f o r the buffer space.  This  leads to c e l l loss because t h e cells that c a n n o t be buffered w i l l be d i s c a r d e d . T h e probability o f losing a c e l l must be kept w i t h i n limits to ensure h i g h performance. T y p i c a l v a l u e s f o r C L R m e n t i o n e d f o r A T M s w i t c h w o r k s r a n g e b e t w e e n 10~ a n d 10" . 6  9  1.7.3.4. Switching delay  T h e length o f the t i m e p e r i o d to s w i t c h a n A T M c e l l through the s w i t c h is also a n important factor.  S w i t c h delay occurs w h e n users input m o r e traffic than w h a t the s w i t c h  can handle s i m u l t a n e o u s l y , causing the s w i t c h to buffer the traffic. A n e v e n w o r s e case is w h e n t h e b u f f e r o v e r f l o w s , cells w i l l b e d i s c a r d e d , a n d e v e n t u a l l y r e q u i r e d t o b e  19.  retransmitted from the source. Note that the delay that the users encounter is actually the sum of all the switch delays in the network. In some cases, the switching delay may not even converge [22].  By using simpler hardware and utilizing the resources more effectively, switching delay can be greatly reduced, or at least bounded.  1.8. Contributions of the thesis  This thesis addresses the performance (throughput, connection blocking, C L R and cell delay) difficulties mentioned in section 1.7 and proposes an input buffered switch that will dramatically enhance the switching performance under both uniform and bursty traffic, while keeping the architecture simple to build. In the following we will briefly describe the enhancements brought by our proposed switch:  1. A very powerful and simple input queuing technique - Virtual Output Queuing [16, 17, 18, 19, 33] - is employed at the inputs, while output buffers are placed at the outputs to enhance throughput. That is, we use a somewhat "hybrid" design with both input and output buffering in the same switch. We have modified the virtual output queuing scheme so that a 2-priority-class virtual output queuing scheme and a "retry with another when one is empty" unloading policy are applied at the input modules. Furthermore, an internal bandwidth expansion (we call it output port dilation in the rest of the paper) technique is utilized at the output ports. With this output port dilation 20  (to be detailed later in chapter four), multiple cells can be safely switched to the same output port simultaneously without the need of internal speedup in the switch.  2. A modified skewed round robin (MSRR) scheduling algorithm is used to monitor the cell scheduling at the input buffers. There are two classes (i.e. two priorities) of logical input queues per input module sharing the same schedule at every time slot, the lower priority logical input queue will take over the time slot for unloading when the higher priority logical input queue is empty. Therefore, the chance of wasting an empty time slot is greatly reduced. The output port dilation at the output ports also helps further improve the utilization of resources, since it is possible to send multiple cells to an output port simultaneously without worrying about collision. Thus, the scheduling mechanism has more chances (depends the Degree of Dilation, we will detail this in chapter 4) to avoid wasting a time slot when a logical input queue is found empty during the switching process. On the other hand, since there possibly are multiple cells being switched to the same output port simultaneously and only one cell can be sent out, output buffers are required in order to store the extra cell(s). This is the sole reason that output FIFO buffers are required. Note that the output buffers are also constructed as two-class priority logical queues.  3. In order to support multicast traffic effectively, the cell copying capability is built-in to the input buffering system. The replication-at-receiving (RAR) [12] algorithm is employed, from which multicast cell is copied to as many copies as necessary and the copies are placed at the logical input queues according to the output port number. It is 21  an efficient way to route multicast cells because every arriving cell is sharing the same "copying" and switching mechanism. Note that a unicast cell simply needs to be copied zero times. In addition, knowing that a low cell loss rate is critical to multicast cells, higher priority of storage is given to all multicast cells. It is shown in the simulations that our switch handles multicast traffic seamlessly.  Since we want to investigate the performance of our switch under real time traffic, simulations under real-time bursty as well as uniform traffics are performed. The results show that our switch achieves performance similar to output buffering designs, while keeping the switch architecture simple and the buffer size small. In addition, our switch can handle multiple-class and multicast traffics with performance that is comparable to some much more complex designs.  1.9. Organization of the Thesis  This thesis is organized as follows. We will discuss some scheduling and buffering strategies in chapter 2. Some switch architectures using different kind of input buffering schemes are explored in chapter 3. In chapter 4 the architecture of our high performance input-output buffered switch is layout. The simulation results performed by the proposed switch are shown, with explanations, in chapter 5. Finally, we will conclude this thesis in chapter 6.  22  Chapter 2: A T M Switch Queuing Strategies  S i n c e A T M is b e i n g w i d e l y a c c e p t e d as t h e B - I S D N s t a n d a r d , t h e p e r f o r m a n c e o f A T M  s w i t c h e s has b e e n o n e o f the m a j o r issues i n the n e t w o r k i n g d e v e l o p m e n t f i e l d .  H o w e v e r , the heaviness  a n d b u r s t i n e s s e m b e d d e d i n t h e c y b e r - s p a c e t r a f f i c m a k e it  a l m o s t i m p o s s i b l e f o r s w i t c h e s to h a n d l e e v e r y s i n g l e c e l l i n s u c h a s h o r t p e r i o d o f t i m e so that c o n g e s t i o n w o u l d not occur.  In fact, cells are almost i n e v i t a b l y c o n g e s t e d  s w i t c h e s ( o r r o u t e r s ) s o m e w h e r e o n its w a y to t h e d e s t i n a t i o n .  in  W h e n cells are c o n g e s t e d ,  t h e e a s i e s t w a y f o r t h e s w i t c h to h a n d l e t h e s i t u a t i o n is t o d i s c a r d t h e e x t r a c e l l s .  B u t then  c e l l loss w i l l result. W h e n the c o n g e s t i o n gets serious, the c e l l l o s s rate c a n b e c o m e so l a r g e that the t r a n s f e r r e d d a t a w i l l be m e a n i n g l e s s ( r e m e m b e r that o n e s i n g l e c e l l loss  will  m a k e t h e w h o l e p a c k e t it b e l o n g s t o u s e l e s s ) .  T o m i n i m i z e the c e l l loss, a n e f f i c i e n t r o u t i n g s c h e m e m u s t be b u i l t i n to the  s w i t c h so that c e l l s c a n b e s w i t c h e d to the a p p r o p r i a t e output ports w i t h h i g h e r s p e e d .  g o o d r o u t i n g s c h e m e c a n greatly r e d u c e the c o n g e s t i o n  A  ( t h e r e f o r e t h e c e l l l o s s ) at t h e  i n p u t p o r t s . F u r t h e r m o r e , b u f f e r s ( o r q u e u e s ) a r e n e e d e d to s t o r e t h e c e l l s t h a t c a n n o t  be  s w i t c h e d at t h e m o m e n t , b u t c a n w a i t f o r a l a t e r c l o c k c y c l e w h e n it c a n b e s w i t c h e d .  I n t r o d u c i n g buffers to a s w i t c h w i l l b r i n g in s o m e delays, but a l o w  s o m e w h a t guaranteed.  c e l l loss rate c a n b e  T h e r e are b a s i c a l l y three types o f b u f f e r i n g strategies: Input  B u f f e r i n g , O u t p u t B u f f e r i n g a n d Internal B u f f e r i n g . E a c h o f these has t h e i r  advantages and disadvantages.  own  W e w i l l u s e p a r t o f this c h a p t e r to d i s c u s s the p r o s  c o n s o f e a c h strategy.  23  and  No matter an A T M switch employs buffering or not, a situation called blocking (or collision) may occur. Blocking occurs when two or more cells contend for the same resource at the same time. Since some resources (e.g. output ports) can at most handle one cell in a time slot, the extra cell(s) will have to be discarded. One of the most important areas in switch design is to avoid blocking by manipulation of buffers (to make the switch non-blocking).  In this chapter, some basic routing schemes will be briefly introduced, and the major buffering strategies will be described in details.  2.1. Routing Schemes  There are two basic routing schemes: centralized routing and self-routing. Each routing scheme has advantages and disadvantages. We will explore both routing methods in this section.  2.1.1. Centralized Routing  In centralized routing, all cells carry a fixed size header in the cell in which the virtual channel (VC) is obtained. The switch then routes the cells according to their V C . That is, the switching network determines where the cells are heading, therefore the name "centralized".  24  When a cell enters the switch, the switch looks up the V C in a routing table to locate three pieces of information: V C of the inbound cell, the output port and the new V C to put on the outbound cell. After locating the V C of the inbound cell from the table, the switch's routing logic then put the cell on the matched output port. Before it sends out the cell, it has to change the existing V C to a new one for routing in the downstream switch. Fig. 2.1 shows a simplified example of centralized routing scheme.  Input V C 28 36 17 42  Node 1  Routing Table Output Port 5 3 6 1  Node 2  Output V C 36 17 42 12  Node 3  Fig. 2.1: Example of Routing Table and Centralized Routing Scheme  The upper part of Fig. '2.1 shows a simple version of the routing table found in switches employing centralized routing scheme. A cell with V C 28.is entering node 1. The switch will look up the built-in routing table to locate the output port and the output V C . It is found that the cell should exit at output port 5 and carry a new V C of 36 downstream. At node 2, an input V C 36 is translated into output port 3 and a new V C of  17. The same procedure is repeated in all downstream nodes until the cell reaches the destination. Note that an identical routing table is used here for illustration purpose. In real time networks, the switches can have their own version of routing tables.  Changing the V C on a per-hop basis means that VCs need to be unique only on a per-switch basis, and therefore can be rather small. Imagine that if the V C does not update itself in each switch, cells will only be routed to the same output port in all the switches along the V C .  One problem of this scheme is that each switch needs to do quite a lot of comparisons while it tries to look up the table. Therefore, the delay of a cell will be noticeable when the number of nodes it has to pass through is large.  2.1.2. Self Routing  Self-Routing, in contrary to centralized routing scheme, allows the switch to do nothing more than routing the cell to the right output port. Where a cell is going is all determined by the source. In other words, a cell knows where exactly it is going when it leaves the source.  A sequence of integers, or more formally, a sequence of route elements (e.g. 0, 1, 2,  , N), is attached to a cell when it is generated. A route element refers to the output  port number of the corresponding node. Whenever a switch receives a cell, it reads in the  26  first route element in the sequence, route the cell to the corresponding output port, then remove that particular route element from the sequence.  The advantage of this scheme is its simplicity in the switch designs (in terms of cell routing). The most obvious disadvantage is the overhead brought in by the additional header attached to the cells. Also, bounding the header size (number of route elements in the sequence) in a cell also bounds the number of nodes it can step by. This might limit the distance that cells can travel. Internal blocking is also a big problem with selfrouting, though it can be avoided with proper scheduling schemes.  For the sake of illustration purpose, self-routing is used as the switching network in many buffering design projects. The simulations taken place in this thesis are also based on self-routing scheme. Fig. 2.2 depicts an extremely simplified example of selfrouting scheme.  7  4  X  1  5  Node 1  Node 2  Fig. 2.2: Example of Self-routing Scheme  27  Node 3  In Fig. 2.2, a cell with route elements (2, 4, 8, 3, 5) is to be switched. At node 1, the first route element read in is 5, the cell is then switched to output port 5. The route element " 5 " is removed once it has been read by node 1. At node 2, the first route element is now 3, hence it is switched to output port 3.. The route element " 3 " is then removed. Similar situation happens at node 3. The cell will then head downstream to other nodes with node elements (2, 4) left. Note that the input port is not set inside the cell header. We simply assign the input port number randomly in this case.  2.2. Buffering Strategies  Choosing a buffering strategy to employ makes a great difference in switching capability and performance. In the following section we will describe a few major buffering schemes, and we will discuss some advanced switches that implement those buffering schemes in next chapter. It is noteworthy that the traffic pattern (uniform or bursty) will also make a lot of difference in terms of switching performance. For instance, it is found that output buffering scheme performs much better than input buffering scheme under uniform traffic, but the input buffering scheme is far more robust when the traffic is bursty. During our discussions, we will elaborate more details about this issue.  28  2.2.1. Input Buffering  In this scheme, buffers (individual or shared) are placed at every input ports. In some designs, an indication is necessary to notify the inputs that the previous cell has been successfully switched to the output. Cells that failed to go through are required to stay in the queue and wait for the next clock cycle.  Obviously, when a cell at the first slot of the queue is not switched in a particular time slot, all the cells in the same queue are blocked. In other words, a single congested cell at the head of the queue is blocking all those behind it to get into the switch. Thus, input buffered switches suffer severely from the well-known head of line (HOL) blocking [3, 7, 22, 23, 25]. The congestion, caused by H O L blocking can be quite severe. Many incoming cells have to be discarded as a result of buffer overflow induced by the congestion. Studies found that the maximum throughput of input buffered FIFO switches under uniform traffic is, surprisingly, a constant value of 0.586 [1, 2, 3, 4, 22, 25]. This number is obtained with the assumption of uniformly distributed (over both the input and output ports) traffic. Under bursty traffic, it can be further degraded. More surprisingly, this throughput is actually worse than a non-buffering switch (1-e = 0,632) 1  [4, 22], though many more cells might have to be discarded in the non-buffering case. The H O L problem is depicted in Fig. 2.3.  29  Cells  Buffer Overflow  Fig. 2.3: The H O L problem resulted in Input FIFO Buffered Switch  Fig. 2.3 is an 8*8 simple FIFO input buffered switch. There is one individual finite FIFO queue at each input port storing cells that are to be switched. The queue length is indicated by the number of "squares" in the queue. As seen from the figure, the H O L cell from input queue #3 collides with the one from input queue #4 at output #2. Assuming that the H O L cell from input queue #4 takes over the output port first, then the H O L cell from input queue #3 has to stay in the buffer until output port #2 is cleared. Imagine that another queue takes over output port #2 in the next clock cycle, the H O L cell in input queue #3 then has to stay in the buffer for one more cycle. Consequently, all the cells in input queue #3, no matter which output ports they are heading to, are blocked by the H O L cell, and they cannot move forward as long as the H O L cell stays. Since new cells keep arriving at input queue #3, input queue #3 gets overflowed very easily. Once it gets overflowed, any new incoming cell will have to be discarded. Similar situation also happens to input queue #6.  30  The situation depicted in Fig. 2.1 happens quite often, especially under real-time bursty traffic when a chunk of cells are usually heading to the same output port. Thus, though input buffered switch is comparably simple to build, the H O L blocking problem keeps the performance far from economical.  Fortunately, H O L blocking can be reduced or eliminated by many means. Two significant approaches are bifurcated queuing [25] and virtual output queuing [16, 17, 18, 19, 33]. It is noteworthy that basic idea of virtual output queuing is based on and developed from bifurcated queuing. In the following we will describe and analyze how the bifurcated queuing works. From the analysis, we will also see how it will expand to virtual output queuing.  In Bifurcated switch, there are two queues placed at each input port, resulting a total of 2*N queues. The output ports are separated into 2 groups - upper half and lower half. At each input port, one of the two input queues is to store the cells heading to the upper half output ports, while the other one is to store the cells heading to the lower half. Thus, a blocked cell belonging to the upper half output ports will not affect the cells heading to the lower half output ports. H O L blocking will still occur, but with a frequency that is half as often in average as in FIFO queuing. This architecture is depicted in Fig. 2.4.  If H O L blocking probability is to be further reduced, the number of queues placed at the input can be increased by a multiple of 2. When the number of queues placed at 31  each input port is doubled, the H O L blocking probability will be halved, because each queue is responsible for one half less of output ports. In general, if the number of queues at each input port is k (k = 1, 2, 4, 8, 16, 32, . . . ) , then each queue is responsible for N/k output ports, provided that N/k is an integer. In the case, if k > 2, assuming N/k = M (an integer), a cell arriving at a particular input port will be placed at queue #1 (the first queue) of the input port if it is heading to an output port m where 0 < m < M - l ; it will be placed at queue #2 if M < m < 2M-1... and it will be placed at queue #k (the last queue) if (k-l)M < m < N - l . After all, a cell heading to output port m will be placed at queue number f m / M l where f x l denotes the smallest integer larger than x.  32  Cells #1  Port 1  I*! I I I 11 11  #2  \....T7T—  ffi.;-..  #1  Port 2  #2  #1 Port 3  11111111  #2 #1  Port 4  111111  11111  #2  :;-\\V  #1  Port 5  #2 #1  Port 6  IE  #2  #1 Port 7 #2  n  #1 Port 8 #2  Fig. 2.4: 8*8 Bifurcated Switch  From Fig. 2.4, it is easy to observe that there are virtually two 8*4 sub-switches. The first sub-switch consists of the upper queue of all the 8 inputs, serving the upper 4 output ports, while the second sub-switch consists of the lower queue of the inputs, serving the lower 4 output ports. To extend this observation to a more general case, where k logical queues are placed at each input port, we see that there will be k N * M sub-switches. The k sub-switches are totally identical and do not interfere each other. Thus, we can narrow the analysis of a N * N switch down to k identical N * M subswitches. Let 33  be the number of cells heading to a output port i, which will move up to  A  t  H O L positions in time slot t; B  t  be the number of H O L cells heading to output port i in time slot t, but were not able to be switched to the output (i.e. backlog);  F  t  be the throughput of the sub-switch, which is basically the number of cells exiting the sub-switch, at time slot t.  With the definitions given above and the fact that each output port can only accept one cell in a time slot, we can easily see that, for a particular output port i,  B = max(0, l  + A -1) t  (2.1) As the switch time taken by every cell is constant, the switch is actually a M / D / l system [25]. Therefore the expected backlog is:  E[B] = XW =—£  2(1 ~p) (2.2)  where X is the arrival rate, Wis a cell's expected waiting time and p is the throughput of each output port of the system.  34  Also, the sum of the total number of backlogged H O L cells and the number of switched H O L cells is simply the number of input ports N:  F,+J^B,=N (2.3)  Since we need the total number of backlogged H O L cells, which is actually the sum of backlogged cells regarding to all the output ports ranging from 1 to M , therefore the term £B in (2.3). Taking expectations in the limit as t —> °°, divide (2.3) by M , we then t  obtain:  M  M (2.4)  Thus, from our earlier discussion, we know that N / M - k. Now, The term E[F]/M is just the throughput of an output port in the N * M switching system (the sub-switch), which is p in (2.2). Therefore, after combining (2.2) and (2.4), we get:  2  E[B]=—£  =  k-p  2(1-p)  (2.5)  Thus, p, as a function of k, is found to be:  35  p(k) = l + k-M  + k,  k = 1,2,3,....  2  (2.6) For the case that k-l,  therefore simple single FIFO, p(l) = 0.586. This matches exactly  the well-known bound of 0.586 of H O L blocking switch. From (2.6), we can also find that the bifurcated switch yields a throughput of p(2) = 0.76. In fact, simulations show that the throughput increases with k [25] and finally p(k)—> 1 as k—> °°. Fig 2.5 shows a plot of equation (2.6), where we can see how increasing k improves throughput.  1  J  0.95  -  0.9  -  0.85  -  3 Q. 0.8 -C DI =J 0.75 -O ^_  0.7  -  t—  0.65  -  0.6  -  JZ  •Throughput  0.55 -0.5  8  16  32  64  k  Fig. 2.5: Throughput of Bifurcated Switch against k  In addition, when k reaches N (i.e. M = 1), every input port would have N logical queues, each of these logical queues corresponds to a specific output port. Thus, we obtain a virtual output queuing system that has one logical queue specifically reserved for one output port. In virtual output queuing system, H O L blocking is completely 36  eliminated. Furthermore, virtual output queuing is extensively applied in Hitibas Switch, which will be detailed in chapter 3.  2.2.2. Output Buffering  In this buffering scheme, buffers are placed at each output port. Since output buffering does not have H O L blocking problem, queuing delay is largely a function of the arrival process and is not affected by internal blocking of the switch. For instance, the mean waiting time of a N * N switch can be estimated by [23]:  - A^i_e_ =  N  ( 2 7 )  2*(l-p)  where N is the switch size, and p is the arrival probability of Bernoulli process. Output buffering switches may reach 100% throughput easily. Also, given an offered load smaller than 1, we can always find a buffer size that lets the switch meet any loss probability requirement. This is the major advantage of output buffering switch over input buffering switch. However, large buffer size could be expensive, it is always desirable to find ways to reduce necessary buffer size. Output buffering switch has been involved in numerous researches [5, 6, 23, 27, 28, 29]. Knockout switch [27, 28] is one of the major advancements in this area.  37  Knockout process involves two subsystems inside the output buffered switch: a concentrator and a shared queue subsystem. In every time slot, each input sends one cell to all the outputs through the crossbar switching system. A concentrator at the output will find out if the cell is destined for the corresponding output port. If so, it allows the cell to go on to the knockout process (will be described in the following paragraph). From the knockout process, the cell will compete with other cells that are heading to the same output port in the same time slot. If the cell wins in the knockout process, it will be sent to the shared queue; if it loses, it will be discarded. Note that cells are also discarded if the concentrator receives more cells than it can handle, or if the shared queue overflows. In case a particular input has no cell to send, it sends a specially marked empty cell that will surely lose in the knockout process. The following is a brief description of the concentrator and the shared queue system.  There is a concentrator at every output port. The major job of the concentrator is to determine which incoming cells are to be sent to the shared queue. It takes h (0 < h < N) cells in a cycle, and output up to L (L <= h) cells, where L is the maximum number that the output hardware can handle.  How knockout switch picks out the L cells from h cells is what the term "knockout" means. Actually, the knockout switch is named from the L-elimination knockout tournament. The basic component of a concentrator is a 2*2 switch module that takes two cells in and output one winner cell, plus one loser cell. The decision process is done randomly, unless one of the competing cells is an empty cell (for which 38  the empty cell will be the sure loser). The winner cell then competes with the winner cell from another switch module at the same level, and the loser cell would compete with the loser cell from another switch module at the same level, and so on. The cell that wins all the matches will take the first output route, the cell that loses 1 match will take the second output route, the one that loses 2 matches will take the third output route, and so on. Finally the cell that loses L - l matches takes the L-th output route, any cell that loses L or more matches would be discarded.  Fig. 2.6 illustrates the knockout process in a concentrator. Assuming h = 4, and L = 2. Therefore, the concentrator takes in 4 cells and outputs 2 cells. A , B, C, D are four cells that are competing for the two inputs. In stage 1, cell A wins the match and continues to compete with C, which is also a winner in stage 1. However, B loses the match in stage 1, it then has to compete with D, which is also a loser. In stage 2, A wins again, and is successfully sent to the first output port. Also, D wins the match in the "loser's line", and is sent to the second output port. Cell B and C are discarded.  ^  A : wins two matches  D: loses one match  Fig. 2.6: The knockout process in a 4*2 concentrator  39  Similar to the case in the concentrator, there is one shared queue at every output port. In each clock cycle the shared queue sends one cell out of the output port while accepting up to L cells from the concentrator to the queue. It has to move L cells from the concentrator to the queue quickly in one time slot. Otherwise cell loss may be large. Under bursty conditions, the buffer even needs to be capable to tolerate several heavy cycles without dropping any cells.  Since the time to remove a cell from the queue is constant, the queue can be modeled as an M / D / l queue. By assuming the M / D / l model, and therefore uniformly distributed arrival, the average queue length can be easily calculated. However, we have to keep in mind that bursty traffic often requires much longer queues in some output ports at one moment while some others are empty.  Output buffered switches generally need a complexity of 0(N ). Also, its low cell 2  loss rate depends heavily on the assumption that the input traffic is not correlated. That is, the probability of the destination port that each cell is heading is 1 0 0 % random. As mentioned before, under correlated traffic, output buffering could perform even worse than input buffering.  40  2.2.3. Internal (Inter-stage)  Buffering  In internal buffering schemes, buffers are placed between the inputs and outputs, or more exactly, at the middle of the internal switch network. However, internal buffering switches generally require more complex infrastructure and larger area to build than the other two buffering schemes [22]. Therefore, it is usually not a favorable for A T M switch designers.  2.3. Comparisons between Buffering Strategies  Since internal buffering generally requires many more buffers and takes up much more space to build, it will not be considered in our switch design. Therefore, only input and output buffering schemes are compared in this section.  Output buffering schemes are free of H O L blocking in nature. In fact, unless the switching network is blocking itself, output buffering is actually free from any kind of blocking effects. Consequently, 100% of throughput can be easily obtained. Also, as long as there are enough buffers, the cell loss can be maintain at an arbitrarily low level, although cell delay may be significant for large buffers. In fact, cell delay is only a function of input load (2.7). That means the network itself can control the cell delay by manipulating the input load.  41  However, (2.7) is calculated with the assumption of independent and identical Bernoulli traffic, which is only an optimistic model. In real-time traffic, cells tend to be correlated, and bursts of cells tend to be heading to the same output ports in a switch for a period of time. Hence, under bursty traffic, a few output ports may be full of waiting cells, while others are empty for a long time. As a result, these bursts greatly raise the buffer size requirement in switch designs. In addition, output-buffering switches have no way to schedule or rearrange cells before they enter the switching network. In case two or more cells are contending for the same output port, cells will have to be discarded because each output port can only accept one cell at a time.  Many output buffered switch also require a speedup factor of N, where N is the number of input/output of a switch, in order to provide 100% throughput. As mentioned before, this speedup factor is simply not practical for large switch and high line rates [43].  On the other hand, many basic FIFO input buffering switches suffer from H O L blocking, which limits the throughput to about 58% under uniform traffic. The throughput will be much worse under bursty traffic. This is the main concern for input buffering switch designers. However, many technologies are invented to successfully prevent H O L blocking from happening [2, 3, 16, 18, 19, 24, 33]. For instance, through scheduling techniques cells can be rearranged in ways that no more than one cell will be heading to the same output port within the same time slot. Finally, since incoming cells can be rearranged inside the input buffering system (as long as they are kept in order), it  42  is possible to make the input cells "less correlated" under bursty traffic. This is why simulation results [30] found that input buffering systems are generally more robust under bursty traffic. For the same reason, input buffering systems are also more suitable for multicast traffics. The fact that the switch can operates at the same speed as the lines (i.e. no speedup is required) makes a switch more feasible and scalable in terms of current technology and cost [44]. In fact, input buffering can just provide that environment.  2.3.1. Preventing Internal Cell Collisions  As mentioned above, in output buffering, when two or more cells are trying to access the same output port, collision will occur and cell loss will result. There are many ways to get around this problem. We will now evaluate a few methods that are mentioned in many switch designs.  2.3.1.1. Switch Speedup  Switch speedup is to increase the switching speed to M (where M is an integer, and is usually called the speedup factor) times of the link speed, so that M cells can be switched in one time slot with respect to the link speed. Of course, the speed of the buffering system also has to be increased with multiple of M in order to buffer the increased number of cells per time slot. However, the link speed has been improved dramatically in recent years (e.g. OC-48 approaches a link speed of 2.5 Gbps), making 43  the switch many times faster than the link speed is extremely difficult and expensive nowadays.  2.3.1.2, Flow Control  (Backpressure)  After all, switch speedup can only decrease the probability •of collision, but not eliminate it . One way to completely eliminate cell collision is to control the traffic flow 4  inside the switch, so that no more than one cell will be heading to the same output port within a time slot.  A n output port being contended can issue a backpressure signal to some of the contending input ports, blocking them from sending cells to it. Thus extra cells are blocked at the input, instead of going into the switch and eventually collide with other cells. To keep these blocked cells from being deleted and send them again in the next clock cycle, some FIFO buffers are required at the input ports.  Although flow control can efficiently prevent collisions from happening, backward paths are required for each output port to send backpressure signals to all the input ports. In addition, contention check has to be issued to every input port whenever an output port receives a cell. Cells at the input ports cannot be switched until the contention checks are done. These overheads greatly complicate the hardware wiring, and increase the delay inside the switch.  4  Unless the speedup factor equals to the switch size N. 44  The above schemes are designed to avoid ceil collisions in output buffering projects.' In input buffering, a very effective scheme exists and is called Virtual Output Queuing.  2.3.1.3, Virtual Output. Queuing  As mentioned in the previous subsection of this chapter, virtual output queuing simply means to reserve a "private" logical queue, at every input port, for each output port. Therefore, a H O L cell blocked for a particular output port will not affect cells at the same input port that are heading to other output ports (because cells heading to different output ports are buffered at different logical input queues). Virtual output queuing does not only eliminate H O L blocking, when applied with proper scheduling scheme, it also eliminate internal cell collision without complex wiring or any contention check. For example, if it is applied with skewed-round-robin (SRR) scheduling [16], there will never be more than one cell heading to the same output port at the same time slot.  It is noteworthy that internal cell collision is a switch level cell loss source. Even infinite buffer is available, cell loss still occurs if internal cell collision exists. Therefore, it is very important to completely eliminate internal cell collision before we consider any improvement in the buffering system. We have found that virtual output queuing together with SRR scheme is a simple, yet powerful method to do so.  45  2.3.2. Handling of Multicast Traffic  As multicast traffic is making a major leap forward in the communication industry recently, it is worthwhile to consider the multicast performance before designing the switch.  Multicast traffics are considered exceptionally bursty at the switch level. In output buffering schemes, since there is no input scheduling, the internal cell collision can be large under multicast traffic. In addition, multicast traffics indirectly increase the actual load to the switch. Since output buffering schemes do not have much control on fairness, the buffer length can be very long for some output ports.  Nonetheless, with the help from scheduling schemes, most major input buffering switches can process cells with great degree of fairness. With the ability to manipulate cells at the input, scheduling also helps prevent internal collisions from occurring. As a result, input buffering schemes are more suitable to handle multicast traffics than output buffering schemes.  2.3.3. Concern of Buffer Size Requirement  In switch design projects, the buffer size required to achieve expected result is always a concern. In output buffering schemes, the efficiency of buffer utilization is limited by the fact that one output port can only send out one cell per time slot. In case  46  that a particular queue is extremely long while others are empty, there is not much a designer can do on output buffering. Therefore, to prevent cells from being discarded, buffer size usually has to be very large.  On the other hand, as we have mentioned a few times before, arriving cells can be rearranged in many ways under input buffering schemes. It turns out that, from simulations provided in Chapter 5, the average queue lengths can be bounded even though zero C L R is to be obtained.  Before we conclude this chapter, Table 2.1 gives a brief and general comparison between common input and output buffering schemes. Input Buffering Schemes  Output Buffering Schemes  Throughput  Low  High  Cell Loss Ratio  High  Low  Cell Delay  Low  High  Implementation Cost  Low  High  Architecture Complexity  Low  High  Cell Scheduling at Input  Possible  Impossible  Major Problems  H O L Blocking, Internal  Output Port Collision,  Cell Collision  Internal Cell Collision  Virtual output queuing, Cell  Switch Speedup,  Rearrangement at Input  Backpressure Flow Control  Current Solutions  Table 2 . 1 : Comparison between general Input and Output Buffering Schemes 47  As a conclusion to this chapter, we think that, except simple FIFO scheme, input buffering are more robust than output buffering strategies in many ways. However, there are still many problems that input buffering schemes face. In the next chapter, we will look at some switches that are designed around input buffering schemes. From those designs, we can study some of the problems faced by input buffering schemes and discuss the solutions the switch designers have taken.  48  Chapter 3: Survey of Input-Queueing Switch Architectures  We have discussed the input and output buffering schemes in depth in Chapter 2. We found that each of them has their own advantages and disadvantages. However, when it comes to handling bursty traffic, input buffering schemes have been proven to be more robust than output buffering schemes [30]. In this chapter we will provide an overview of some popular input queuing switch designs based on different input scheduling techniques. By input queuing we mean an arrangement where cells are first queued and then switched. Since cells are queued as soon as they arrive and one cell at most can be served per time slot, the switch fabric needs only to operate at the same speed at the input or output link (internal speedup is not required). However, when two or more cells at different input ports are contending for the same output port, only one cell will be switched. Buffers are necessary to store the extra cell(s). The buffered cells have to wait until the switching fabric corresponding to its input-output port is free to do the switching. As mentioned in chapter 2, the H O L problem severely limits the throughput of basic FIFO input buffering switches to about 58.6%. Thus, the primary task of an input buffering switch is to eliminate the H O L problem.  The switching designs described in this chapter are highly related to the proposed switch. Understanding these switching architectures is the key to the design of our proposed high performance switching system (which will be described in detail in the next chapter).  49  3.1. The Batcher-Banyan Switch  The Batcher-Banyan Switch [6, 23, 31, 32] is well known for its low cost. It is basically an input buffering switch and is made up of a Banyan switching network and a Batcher sorting network. It employs a self-routing algorithm in the Banyan switching i network, and a sorting algorithm in the Batcher sorting network. Input ports and output ports are connected through logiV stages of the routing network. Each routing stage consists of N/2 binary switches. A binary switch routes the cells through the switch's  : ;  upper or lower link according to a cell's output port number. There is exactly one route between each input-output pair within the switch. Fig. 3.1 illustrates a simple 4*4 Banyan switch. Stage 1  Stage 2  Fig. 3.1: Simple 4*4 Banyan Switch. i  In Fig. 3.1, a cell heading to output port 10 is entering from input port #1. At  :  stage 1, the first route element (1) leads the cell to pass through the lower port of the j binary switch (the darkened line). At stage 2, the second route element (0) leads the cell to pass through the upper port of the binary switch. The cell then uniquely reaches the\ 50  '  output port 10. In general, a route element (0) will take the upper link, while a route element (1) will take the lower link.  The Banyan switch is simple, as it only involves 0(N*logN) complexity, yet it has a drawback: internal blocking occurs whenever an internal link is being competed by two or more cells. This blocking can also occur even though the contending cells are heading to different output ports! This is illustrated in Fig. 3.2. Nonetheless, the capability of using comparably less amount of hardware in Banyan switch makes the design attractive to many switch designers. Lots of works are therefor put into researches that try to improve the performance in Banyan switch [6, 31, 32].  Stage 1  Stage 2 00 01  10 11  Fig. 3.2: Collision occurs in a Banyan switch even though the cells have different destinations  If the incoming cells are sorted (by their destinations) and there are no gaps (i.e.  ;  slots with no cells at all) among cells before they enter the Banyan network, internal blocking can be avoided [11, 23]. Therefore, we need some input buffers to first store the incoming cells and sort them within the same clock cycle before the cells are put into 51  the input ports of the Banyan switch. It is important that the sorting is done within one: clock cycle, otherwise cells from the next clock cycle will not be able to be sorted. Therefore, the sorting algorithm must be very efficient. The Batcher sorting scheme (which is a bitonic sorter) is chosen to do the sorting in this switch because it involves a fairly low complexity of 0(N log A0 [6, 23]. 2  Fig. 3.3 is a simple layout of a Batcher Banyan Switch. Assuming the switch size is N * N , cells are first synchronized and input to the Batcher sorter. Note that there may be some gaps in between the incoming cells. In the Batcher sorter, cells are sorted in two steps: Bitonic Merge Sort (BMS) and Bitonic Sort (BS). The B M S sorts any incoming sequence into bitonic sequences. Bitonic sequences are sequences that are juxtapositions of an increasing sequence and a decreasing sequence. For example,  .  1,2,3,5,9,8,6,4  is a bitonic sequence. After the incoming sequence becomes bitonic, BS can sort it into a monotonic (increasing order) sequence efficiently.  Recall that there can not be any gaps between cells. A trap module will first take in the sorted sequence from the Batcher sorter. It then investigates the sequence and looks for any empty cell slots or duplicated destinations (duplicated destinations will cause collisions in the Banyan switch). When two or more cells are found to be destined to the same output port, one cell will be randomly selected to be switched, the others  52  :  will be discarded. After that, the sequence will be compressed, so that no gaps are left between cells, and sent to the Banyan switching network.  1  N  Batcher Sorter  Trap Module  Banyan Switch  1  N  Fig. 3.3: Batcher Banyan Switch  53  As mentioned above, the major advantages of Batcher-Banyan switches are the non-blocking property and the simplicity of O(Mog A0- The major drawback of Batcher2  Banyan switch is that the throughput is still too low to be practical, even when traffic is lightly loaded. The reason is that many cells are discarded when they are contending for the same output port. This is especially true in bursty and multicast traffics, where many  i cells are often destined for the same output port during many consecutive time slots. Many versions of Batcher Banyan switches (e.g. Starlite and Sunshine switches [23]) have been developed to avoid discarding cells. Most of these newer versions employ a technique that "re-circulates" contending cells, instead of discarding them. Under the recirculation scheme, cells that cannot be switched are returned to the front of the Batcher; sorter through a re-circulation path. The cells are then sorted and switched again.  i  Although these Banyan switches do succeed in improving the throughput, cells may get | reordered during re-circulation. Input cell sequence becomes yet another difficulty to be solved. In addition, all of these Batcher-Banyan designs require a Batcher sorter. Although Batcher sorting involves only O(Mog A0 complexity, it requires a lot of 2  comparisons before a sequence is sorted, especially long sequences. Building this sorter to fulfill the high speed requirement could be very expensive.  Nonetheless, there are many other input buffering technologies that are also nonblocking, yet do not discard contending cells. Tiny Tera, iPOINT and Hitibas are three ; outstanding ones.  54  3.2. The Tiny Tera Switch  Another input buffering example in our study is the Tiny Tera Switch [33]. The Tiny Tera switch consists of three parts: the buffering system, the scheduling system and the switching system. The buffering system is contained in the port cards, each of which corresponds to a pair of input-output ports with the same port number. The scheduling system is embedded in the centralized scheduler. An algorithm called iSLIP is responsible for the scheduling task. The switching system is a central switching hub that consists of N parallel sliced crossbar switches. The following paragraphs will review each part in detail.  Tiny Tera implements a virtual output queuing scheme for unicast cells, and adds a single FIFO queue for multicast cells. Virtual output queuing has been described in the previous chapter, in which a cell is stored in a specific FIFO queue according to the output port number it is heading. For multicast traffic, multicast cells afe first stored in a FIFO queue (specific for multicast cells), and then forwarded to the switching network based on two scheduling methods: fanout-splitting at the input and Weight Based Algorithm (WBA) at the output. Fanout-splitting allows cell-copies, which are duplicated from an incoming multicast cell, to be delivered to the output ports over any number of time slots. In other words, cell-copies from a multicast cell do not have to be all delivered at the same time slot. This is an efficient and throughput-enhancing method, and will be used in our switch as well. More details on W B A can be found on [33].  55  iSLIP is an improved version of the simple round robin scheduling scheme for virtual output queuing. We will provide details for the simple round robin scheduling when we discuss the Hitibas switch in later section. At this point we will simply state that simple round robin scheduling can cause a severe output contention problem in virtual output queuing. iSLIP is priority-based and solves the output contention problem by having the arbiters of the input modules slip with respect to each other. There can be an arbitrary number of iterations in a single time slot. An input-output match in the first iteration will lead to a faster match in the second iteration and so on. A l l inputs and outputs are initially unmatched. Connections (or matches) made in one iteration are never removed by a later iteration within the same time slot, and only those inputs and outputs not matched at the end of one iteration are eligible for matching in the next iteration. The following procedure illustrates how the iSLIP works in one iteration operated in parallel on each input and output.  56  For all unmatched input ports Send requests to all output ports for which it has a HOL cell; For all unmatched output ports Grant the input port that is at or after the highest priority pointer in the Round-Robin schedule; For all unmatched input ports Accept the output port that is at or after the highest priority pointer in the Round-Robin schedule; If accepted output port has also granted the request for this input port { A match has occurred for an input-output pair; }  } For all input ports { Increment the pointer to one location (modulo N) beyond the accepted output port, only if this input port was matched in the first iteration; }  For all output ports { Increment the pointer to one location (modulo N) beyond the granted input port only if this output port was matched in the first iteration;  This algorithm is depicted in Fig. 3.4 and 3.5. Fig 3.4 shows the first iteration, while Fig. 3.5 illustrates the second iteration.  57  Fig. 3.4: iSLIP Algorithm - First iteration  Fig. 3.5: iSLIP Algorithm - Second iteration 58  In Fig. 3.4, input port 1 (II) has sent requests to output ports 1,2 and 4 (01, 0 2 , 04). In return, 01 grants to 14, 02 grants to II and 04 grants to 13 according to their round-robin arbiters. Since II accepts 02 according to the round-robin arbiter, a match occurs at the 11-02 pair. On the other hand, 13 has sent requests to 02 and 03. It receives a grant from 03 (Note that 13 is the only input port requesting a grant from 03), but it only issues an accept signal to 02 (next highest priority to 04) according to its round-robin arbiter. Therefore, no match occurs for this pair. If the second iteration is allowed to proceed, as shown in Fig. 3.5, 03 will grant to 13 (because 13 is the next highest priority input port to 12), while 13 will accept 03. Therefore, 13 and 03 will be matched in the second iteration.  A l l the buffering and scheduling works are carried out in the Port Cards, which is shown in Fig. 3.6. A single port card represents both input and output ports simultaneously, and contains an interface to the external links, an interface to the sliced crossbar switching network, and an interface to the centralized scheduler. A l l the interfaces are connected via high speed (Gb/s) serial links. In addition to the interfaces, the port card also contains N application-independent data slices, and an applicationdependent Port Processor. The incoming cells are received by the data slices in a unit of 64-bit. The 64-bit unit is called a chunk. When a chunk of data goes into a data slice, it gets a memory address and a chunk header from the port processor. Then it will be stored to the assigned memory address where it waits to be switched.  59  SRAM  Gbps Serial Link  Data Slice 1  SRAM  -3N  on cr  Data Slice 2 SRAM  p o  Data Slice 3 Gbps Serial Link  CD  SRAM ^  Data Slice 4  EI  WW Port Processor  Scheduler Interface  >  Gbps Serial Link  Fig. 3.6: A Port Card in a 4*4 Tiny Tera Switch Tiny Tera can also be output buffered if the optional algorithms are implemented in the port cards. For more effective use of buffer spaces, both incoming and outgoing cells can dynamically share the same S R A M modules for both input and output buffering.  The Tiny Tera switch does show good results under uniform traffic model. Simulations show that Tiny Tera approaches 100% throughput in just one iteration, but the cell delay is very.high [33]. With four iterations, however, cell delay can be made very small. It is noteworthy that four iterations done in one time slot demands an extreme amount of speed. Tiny Tera also excels in multicast traffic. With an average fanout of 4, it also obtains very good throughput.  60  3.3. Illinois Pulsar-based Optical Interconnect (iPOINT) Switch  iPOINT [15] is built around yet another approach to input buffering. Simulations of iPOINT demonstrate a great improvement in performance over the simple FIFO input buffering switch, although it employs some very complex structure in the buffering and scheduling systems — Three Dimensional Queuing (3DQ) and Matrix Unit Cell Scheduler (MUCS) [15]. We will elaborate on its architecture in this section.  First of all, the 3DQ stores cells in non-FIFO random access memories (RAM). Cells are sorted according to a three dimensional criteria. The three dimensions are Virtual Connection (VC), Priority, and Destination. Sorting by V C assures per-VC bandwidth allocation; sorting by priority enables quality of service (QoS) tracking; sorting by destination organizes cells into virtual output queues, so that H O L blocking is avoided.  Inside a 3DQ module, the R A M is organized as cell rooms. Each cell room can be taken up by one A T M cell. After a cell enters the 3DQ module, it will be represented and manipulated by the address pointer that points to the cell room it is located. Cells are first sorted according to the VCs they belong to (i.e. according to their virtual connection identifier or VCI). A V C is then represented by an address pointer that points to the H O L cell of it. Within a V C , cells are clustered into destination groups. Each destination group corresponds to a particular output port (similar to the idea of virtual output queuing). Within a destination group, cells are further sorted into service queues. A 61  higher priority service queue will be served ahead of lower priority queues. The number of service queues a 3DQ module can have is specified by the designer.  After cells are sorted in the 3DQ, they need to be scheduled for transmission. The scheduling is done by M U C S . MUCS is a combined analog-digital circuit. In each time slot, M U C S examines the 3DQ modules and finds the "right" ones to transmit, one cell per module per time slot. The cells with least flexibility are selected for transmission first. The flexibility of a cell is defined by the traffic matrix [15], which is in turn defined by the traffic vectors. Each traffic vector is formed by an N-element vector (ai, &2, •••&), aj+i, ... aN). Each element, aj, is a non-negative integer, and is corresponding to a destination group j . When aj is zero, it indicates that the destination group is empty and has no cell to send. When aj is non-zero, it indicates the highest priority level among the cells buffered in the destination group, and that a cell is ready to be switched to output port j . Fig. 3.7 and 3.8 show the structures of a 3DQ system at an input port and a destination group in a 3DQ system respectively.  62  Pointers to the cell rooms in the RAM module RAM Module  Traffic Vector of 3DQ  1  VC Table 1  O - O - O - K D  VC Table 2 Pointer to the highest priority service queue inside a destination group  a  N  Cell Room Linked List (per-VC Queues)  Destination Group N  VC Table N  Fig. 3.7: The structure of a 3DQ module  63  > 0 - < > - » 0 - < )  Priority of the service queue  1  Destination Group j  E: The service queue is empty NE: The service queue is not empty  H  Pointers pointing to the corresponding cell room Pointer from element (aj) of the traffic vector  NE  r^  Service queue  Fig. 3.8: The structure of a destination group  A V C is active whenever it has a cell to send. A cell tag containing the cell's address pointer (recall that cells are manipulated by the address pointer that points to their cell room locations) will be appended to the service queue. After a cell is sent from a V C , the V C s entry (i.e. the cell tag) will be removed from the service queue and the cell room will be recycled to a freelist. A freelist, which is a linked list that tracks for removed cells and provides empty cell rooms for newcomers, is updated every time slot. Every active V C can only have one entry in a service queue. If the V C still has more cells to send, its cell tag will be appended to the serviced queue again at the end of the time slot. If a V C has no cell to send, it becomes inactive. Thus, no V C will be starved, and fair bandwidth can be enforced. 64  In addition, a V C table is maintained for each active V C . V C tables store QoS parameters (priority, maximum queue length, current queue length, and queue length threshold), traffic statistics, destination(s), outgoing VPI/VCI values, and pointers to the head and tail cell rooms of the V C . V C tables are updated whenever a cell is transmitted from, or a cell has arrived to the V C . Maintaining V C tables in the R A M module allows the QoS information updated efficiently for each V C , which is essential for per-VC QoS tracking.  Although iPOINT can also handle multicast traffic, the complexity of the internal structure may cost several time slots to process a single multicast cell. This can cause large delays when the multicast traffic is heavy.  Simulations show that iPOINT has a throughput of about 100% and 90% under uniform and bursty traffics respectively [15]. While these figures are comparable to many output buffering strategies, the complexity involved, in the buffering and scheduling systems limits the size of iPOINT to only 16 or 32 ports, which is definitely not practical for commercial application.  3.4. The Hitibas Switch  The Hitibas switch [16, 17] is a very cost-effective input buffered switch and is capable of achieving 100% throughput by using only input buffers. A Hitibas switch 65  consists of N input ports and N output ports running at the same speed as the switching fabric. The switching network is a simple Banyan switch, and therefore it is selfrouting. Although the Banyan switch is generally blocking, the Hitibas switch is not because it employs two powerful techniques - virtual output queuing and skewed round robin (SRR) scheduling schemes. Virtual output queuing has been described in detail in chapter 2. SRR is a modified simple round robin scheduling scheme. Simple round robin is a scheduling scheme that serves virtual output queues cyclically to enforce fairness. Let's consider an N * N Banyan switch employing virtual output queuing at the inputs. Each input port is connected to an input module (IM), which contains N independent FIFO logical input queues (LIQs) and an input queue controller. Each LIQ i  corresponds to an output port. When a cell arrives to a particular IM, it enters a LIQ according to the output port number contained in the cell header. With simple round robin scheduling, all the N LIQs are served cyclically, and at most one cell will be forwarded to-the Banyan switch in any time slot. Although a simple round robin algorithm seems to be fair, it suffers from serious output contention. For example, a H O L cell in LIQ #3 from I M #1 actually contends with the cell in LIQ #3 of another IM. As a result, the throughput is limited to just about 63% [33]. The skewed round robin scheduling avoids output contention by imposing a shift of one LIQ index between any two consecutive IMs. For example, if LIQ #3 in I M #1 is being served in a particular time slot, then LIQ #4 in I M #2, LIQ #5 in I M #3 and so on will also be served in that same time slot. Subsequently, all the cells being forwarded to the Banyan switch are essentially sorted by their output destinations. Under such input ordering, the Banyan network becomes non-blocking. Fig. 3.9 describes a simple Hitibas switch.  66  Input Modules LIQ # 1  A  II  II  \  LIQ #8  II  v  8 * 8 Banyan Switch  LIQ #8 1 1 1 1  LIQ #8 1 1 II  \ \  \t  w  Fig. 3.9: Layout of an 8*8 Hitibas Switch  The Hitibas switch provides an internally non-blocking environment with very low hardware complexity. However, a large portion of the switch bandwidth will be wasted when many LIQs are empty in certain time slots. This is very wasteful under bursty traffic conditions. By employing the technique of input grouping, the buffers can be utilized in a much more efficient way. With input grouping, G input ports (where G is the grouping factor) will share one IM. The IM, and the shared memory, has to be capable of accepting G cells from G input ports and switching G cells in a single time 67  slot. The number of IMs and total buffer size needed is now reduced by a factor of G . This leads to a large saving in resources. O f course, the S R R scheduling needs some modifications in order to work with input grouping. It now requires a shift of G , instead of one, L I Q number between any two consecutive IMs being served. The following algorithm illustrates the work done by a Hitibas switch with grouping in one time slot. In the algorithm, G is assumed to be the grouping factor, q(j) is the logical input queue with queue number j .  J = 0;  For (Input Modules k = 0; k < N / G ) {  For ( i = i, i < i+G) { If (LIQs q( i ) is non-empty)  Switch the HOL cell of q( i); Else Nothing is done to the logical input queue; }  1 = (j + G )  mod N ;  }  Simulations show that for a group size of four, the utilization of Hitibas switch reaches the applied load while the throughput reaches 1 0 0 % for uniform traffic. Obviously, these results are much better than simple F I F O input buffered switch. A n d it is noteworthy that Hitibas is very simple and cost-effective to build, because Banyan switching architecture is known to be very inexpensive and virtual output queuing is very easy to implement.  68  Although Hitibas is very impressive in performance and simplicity, it still has room to improve. First, virtual output queuing involves N LIQs for each of the N input ports, it sums up to be N (or N / G if input grouping is employed, where G <= N) LIQs 2  2  for the whole switch. Note that among these N (or N /G) queues, at most only N queues 2  2  will be switching cells in each time slot. If memory size is a concern, or if the switch size is required to be small, virtual output queuing technique will probably be a wasteful idea. Secondly, every LIQ is only served once in a cycle of N time slots (or N/G time slots if input grouping is employed), congestion is very easy to build up under heavy traffic. Finally, when a LIQ that is supposed to take the turn to send a cell happens to be empty, the chance of switching a cell will be wasted. This chance will still be wasted even when other logical queues have cells to forward. As a result, the time slot will be only partially (and inefficiently) utilized. This circumstance is very common in real time traffic, where some logical input queues are empty for a long time, while others are full of cells waiting to be switched. That is, the utilization of buffers may be very low in virtual output queuing under real-time or multicast traffic. The situation is illustrated in Fig. 3.10.  69  Very congested queues  i i / ii  i i  IM  #1  i  1111111; i [  Wasted Bandwidth  "TTTT  II I II T II I I I I I I I I"  XX Cell Streams  IXJ j  ml XXX  —V rx XX  XXJ XXJ XXJ  Switching Network  IM#2  111111  XJ  Fig. 3.10: Bandwidth is wasted when certain logical queues are empty in virtual output queuing  In Fig. 3.10, LIQ #2 in IM #1 takes the turn to switch a cell (determined by the SRR schedule). However, it happens to be empty in this time slot. No cell will be switched at IM #1 at this time slot even though other logical queues are quite full and new cells are arriving. Similar situation also happens to LIQ #3 in IM #2. This particular  70  time slot is therefore wasted for I M #1 and IM #2. Note that many other LIQs in both IM #1 and IM #2 are already congested at the same time. It would have relieved the congestion a bit if other queues can utilize this wasted time slot, but this is forbidden by the SRR scheduling scheme. In the next chapter, we will describe how our proposed switch avoids this wasteful situation with only minor modifications to the Hitibas switch.  3.5. Conclusion of this chapter  In this chapter we have investigated a few switch architectures designed around different input buffering technologies. Batcher Banyan is one of the first input buffered switches that break through the 58% H O L blocking.limitation. Many designs based on Batcher Banyan improve the throughput even further. However, these Batcher Banyan based switches all involve very demanding and expensive hardware. Virtual output queuing technology is considered to be robust because it successfully eliminates H O L blocking with very practical hardware requirement. If proper scheduling scheme is employed with virtual output queuing, we can also eliminate internal blocking inside a simple Banyan switch. This is achieved by Hitibase switch. One scheduling scheme that the Hitibas switch employs is skewed round robin (SRR) scheme. SRR scheme provides a naturally sorted as well as uniquely destined input sequence under virtual output queuing strategy. As a result, in uniform input traffic, the Hitibas switch is able to achieves a 100% throughput [16]. There are still some shortcomings, though. The SRR scheme serves each logical queue only once in N time slots. These may result in severely congested queues and very long delay, especially in real-time traffic. Another virtual 71  output queuing based switch is the Tiny Tera switch [33]. The Tiny Tera employs a different scheduling scheme called iSLIP.  Though iSLIP requires no synchronization  among logical input queues, it might take up several iterations before matching all inputoutput pairs in order to avoid internal blocking. The matching process is very timeconsuming, and extremely fast hardware is necessary to avoid congestion and long delay. Finally, the iPOINT switch [15] is built on 3DQ technique and M U C S scheduling scheme. It provides per-VC QoS service and achieves very impressive throughput, with the expense of high hardware complexity.  In next chapter, our proposed switch, also based on virtual output queuing technology, will be described. We will show that, with some minor additions to the available technologies, we can build a switch that performs very well under all kinds of traffic conditions (uniform, bursty, and multicast).  72  Chapter 4: Architecture of the Proposed Input Buffered Switch  4.1. A Brief Introduction to our Proposed Input Buffered Switch  In the previous chapter, we have discussed the advantages and disadvantages of different switch architectures that are based on several input queueing technologies. Among those studied, we found that the application of virtual output queuing and skewed round robin (SRR) scheduling schemes can provide impressive results such as 100% throughput under uniform traffic, as well as 0% H O L blocking. The combination of virtual output queuing and SRR is not only robust, but also reasonably simple and costeffective to implement. Nonetheless, as we have discussed in chapter 3, the SRR scheme can be quite resource wasteful when used to emulate output queuing (virtual output queueing). This is because each LIQ gets a turn to switch a cell only once in a cycle of N time slots. Under the SRR scheduling scheme, some LIQs may give up the turn to switch cells while other LIQs are eager to do so. This happens when some scheduled LIQs (i.e. LIQs that are taking the turn to switch) are empty in a particular time slot. Thus, the SRR scheme will leave such time slot only partially utilized, rather than letting other nonempty LIQs take over the wasted bandwidth to do their cell switching. This can lead to severe buffer congestion under bursty traffic.  In order to increase the efficiency of virtual output queuing, the SRR scheme has to be modified so that it can unload a cell from non-empty queue whenever a given scheduled queue is found empty. Thus, the bandwidth, that otherwise would be wasted, will be better utilized, and the efficiency of virtual output queuing can be increased. 73  Unfortunately, such modification creates another problem: cell collisions. Consider that if one of the input modules does not follow the schedule and tries to switch a cell from a queue other than the scheduled one, the cell may collide with a cell sent from another I M in the switching network (this fact will be described with detail later in this chapter). In our proposed switch architecture, a modified SRR scheduling scheme is used, and cell collision is eliminated by employing the dilated interconnection network approach. This proposed switch will be described in this chapter.  Fig. 4.1 shows the simplified architecture of our proposed switch. Basically, it is an input buffering switch enhanced with additional output buffering. The switch consists of four subsystems:  1. The Multi-class Priority Arbitrator;. 2. The Switching Network; 3. The Input Buffering System; 4. The Output Buffering System.  Each subsystem takes care of a particular task. By putting them together, these subsystems form a high performance, scalable and multicast-capable switch with simple and low cost architecture. The switch is designed to provide service to 2 traffic classes, namely a delay sensitive class and a loss sensitive class. As described in chapter one, by delay sensitive class we refer to the C B R and V B R traffics (the Non-ABR class); by loss sensitive class we refer to the A B R and U B R traffics (the A B R class). At each input port, a distinct LIQ is assigned to each traffic class for each output port. Priority is  74  always given to the Non-ABR (delay sensitive) class to unload cells from the input logical queues (LIQs) or output queues (OQs), while buffer space (LIQs and OQs), will be assigned to A B R (loss sensitive) class to minimize loss. It is noteworthy that, as we mentioned in chapter one, for loss sensitive class of traffic we will focus mainly on the presence of A B R traffic. U B R traffic will share the same buffer space with A B R traffic, but with a higher priority to be dropped. We therefore collectively name this as the " A B R " class of traffic.  For the sake of simulation, cells are assumed to be arriving in a time-slotted manner. The length of a time slot is the period for a single cell to be completely switched, which is also considered as a unit time in the simulation.  75  Multi-class Priority Arbitrator  If  ~7r7  Output Module 1  Output Module 2  Input Module 1  Output Module 3  Output Module 4  Input Module 2  Input Buffering System  Switching Network  Output Buffering System  (32X32 Lossless Switch with Degree of Dilation = 2, and Group Size = 4)  Output Module 29  Input Module 7  Output Module 30  Output Module 31  Input Module 8  Output Module 32  Fig. 4.1: Overview of the Proposed Switch Architecture  76  4.2. The Multi-class Priority Arbitrator  The Multi-class Priority Arbitrator ( M P A ) handles all the arbitration processes for both the input and the output buffer systems. The arbitration is based on the following arbitration rules:  1. N o n - A B R (i.e. C B R and V B R ) cells have higher priority over the A B R (i.e. A B R and U B R ) cells to be switched (from the input ports) or sent out (from the output ports); 2. A l l unicast cells have higher priority to be deleted than multicast cells (or cell copies); 3. Within a single time slot, each input port can forward at most one cell to the switch, and each output port can send out at most one cell.  Rule (1) implies that at any time slot, a C B R or V B R cell always wins i f it competes with another A B R or U B R cell from the same input port that is being switched to the same output port, or being sent out from the same output port. Rule (2) says that when a multicast cell arrives at a fully occupied queue, unless the whole queue is occupied by multicast cells, the last unicast cell in the queue w i l l be removed in order to give up space for the newly' arrived multicast cell. Rule (3) simply states that no more than one cell can be switched out of an input port or out of an output port i n one time slot.  The M P A unit monitors the occupancies of all the logical input queues and output queues, and controls the unloading of queues in all the input and output modules. When any of the queues in the modules becomes empty, the M P A w i l l decide, according to the arbitration rules and modified skewed round robin ( M S R R ) scheduling scheme, which 77  queue is to be unloaded instead (the MSRR mechanism will be described later in this chapter). When the switch is operating in multicast traffic environment, and if any of the queues in the modules become full, the M P A will decide either an incoming cell or the last unicast cell in the queue will be deleted (more explanation on multicast environment will be followed in section 4.4).  4.3. The Switching Network  The Switching Network (SN) is an N*N non-blocking Dilated Banyan Switch (DBS) [34, 35, 36] and is responsible for all cell-switching tasks. The DBS is basically a simple self-routing banyan switch with a speedup factor achieved through link dilation. The principle of the DBS relies on internal bandwidth expansion introduced through variable dilation of the switching elements (SEs) of the banyan network. Because of the dilation brought on by the SEs, DBS can have multiple output links per output port. Thus, an N * N DBS can actually be viewed as an N*(M*N) switch, where M is the number of output links per output port. The rationale behind the output port dilation is to allow each output port to accept multiple (up to M) cells simultaneously. This in turn allows multiple (up to M) input ports to send cells simultaneously to the same output port with no collision. The number of output links per output port, i.e. M , is called the Degree of Dilation (DOD) of the banyan switch. However, when more than M cells are heading to a particular output port, the excess cells will be blocked in the input buffers. The expanded internal bandwidth, together with proper scheduling, help resolve internal path conflicts, and make the DBS a non-blocking lossless switching architecture. Note that each output port is only allowed to send out one cell in a time slot, even though it may 78  receive many cells in the same time slot. As a result, certain amount of output buffering is required to temporarily store the excess cells arriving to the output ports in the same time slot. Fig. 4.2 shows a simple DBS with a DOD of 2.  OutO SE  SE  SE  Outl  SE  SE  :>  Out2  Out3  SE  SE  SE  SE  SE  =y =y =y  =>  Out4  Out5  Out6  Out7  Fig. 4.2: A Dilated Banyan Switch with Degree of Dilation of 2  4.4. The Input Buffering System  The Input Buffering System (IBS) consists of N/G synchronized input modules, where N is the total number of input ports and G is the grouping factor introduced in chapter 3 (i.e. each group of G input ports are grouped in one input module). The IBS is depicted in Figure 4.3. 79  Control signals from the arbitrator A group of 4 input links (G = 4) from the source  •  A group of 4 (G = 4) links to the switch  Fig. 4.3: An Input Module of a 32*32 switch with group size of 4  Each input module (IM) has N sets of synchronized logical input queues (LIQs). Each set of LIQ is dedicated to one of the N output ports of the switch. A set of LIQs consists of two prioritized queues: one queue to exclusively store A B R or U B R cells (ABR queue) and one queue to exclusively store CBR and V B R cells (non-ABR queue). A l l the cells being forwarded to the switching network are synchronized. Since C B R and V B R cells have higher priority to be switched (as explained in chapter 1), the non-ABR queue has higher priority to be unloaded. The IMs employ the M S R R scheduling scheme (MSRR scheduling scheme will be described in detail later in this chapter) to ensure a collision free environment. As mentioned in the previous chapter, the M S R R scheduling with grouping factor G is achieved by shifting G queues between two consecutive IMs during cell unloading at the input buffering system. A l l IMs are required to get permission from the arbitrator (i.e. the MPA) before a cell can be switched from any LIQs. The arbitrator gives permission to the IMs according to the arbitration rules and the M S R R scheduling scheme.  80  The M S R R scheduling scheme works as follows. At the beginning of each time slot, a total of G non-ABR queues (based on the skewed round robin scheme) in an input module are eligible to forward one cell to the switching network. However, if one of the G scheduled non-ABR queues is empty, the M P A will grant permission to the A B R queue in the same set (recall that a set contains one A B R queue and one non-ABR queue, both are heading to the same output port) to forward a cell. In the case both the A B R and non-ABR queues of a set are empty, the arbitration process will be retried with as many as G sets next to the group border. Figure 4.4 illustrates the M S R R scheduling scheme with a simplified version of the input module and switching network. If the set #1 is the scheduled set and the non-ABR queue is empty, the A B R queue will be granted to send a cell to the switching network. If both queues in set #1 are empty, the arbitration process will be retried with set #5, because set #5 is the set next to the group border of set #1 (Group #1). If set #5 is also empty, the arbitration process will be retried with set #6, and so on, up to G times (G = 4 in this case), until a non-empty queue is found. In Figure 4.4, the A B R queue of set #7 is found non-empty, and a cell from it will be sent to the switching network. Enforcing the maximum number of retrials to group size G is to restrict the algorithm to retry only the sets of queues within the group next to the one currently being unloaded. This keeps the maximum possible number of cells heading to the same output port equal to two, and therefore requires a DOD of 2 only (Fig. 4.6). This also helps the switching network structure remain simple and inexpensive. If the number of retrials were allowed to exceed G, a larger DOD is required because more than two cells could possibly be heading to the same output port, while performance will not gain much (see Appendix A).  81  Determining which LIQ to be unloaded when the one currently scheduled is empty can be time consuming. Our approach will make sure that all retrials can be done within one time slot. Using the previous example, the occupancies of sets #5, #6, #7, and #8 are checked simultaneously with set #1. In case set #1 is empty, a non-empty set among #5, #6, #7 or #8 can be switched immediately because the M P A already knows which set is to be granted the switching permission.  82  Input Module #1 Non-ABR Queue  ABR Queue  Switching Network  1  Set#l  Set #2 Group #1  Output Port 1  Set #3 Output Port 2 Set #4  IXJ  Set #5  Group #2  Output Port 3  Set #6  Set #7  Output Port 4  XXX  Set. #8  Output Port 7  Fig. 4.4: The MSPvR scheduling scheme in the proposed switch  I f a standard Banyan switching architecture is used, the retrial process is not possible, because cell collisions will result. This situation is illustrated in Fig. 4.5.  83  Switching Network (Standard Banyan)  Input Module #1  Group 1  Group 2  Group 1  Group 2  Fig. 4.5: Collision on a simple 8x8 banyan switch  For the sake of simplicity, Fig. 4.5 only depicts a small 8x8 Banyan switch, with single class traffic, that will be enough to demonstrate the collision situation. There are two IMs, each of which has 8 LIQs. The switching network is a simple banyan. With G = 4, each input module has two groups. If LIQ #1 in I M #1 happens to be empty in a given time slot, the M P A will issue a retrial request on LIQ #5 because LIQ #5 is the queue right next to the border of Group 1. Since there are cells waiting in LIQ #5, a cell will be forwarded to the switching network. At the same time slot, LIQ #5 in I M #2 is  84  also scheduled to forward a cell to the SNS, as this is its turn according to the M S R R schedule. As a result, collision will occur at output #5.  Recall that a dilated banyan switch (DBS) can accept multiple cells to a single output port without conflict, the retrial process is made possible (Fig. 4.6). Switching Network System (DBS)  Input Module #1  Output Modules  Group 1  Group 2  Group 1  Group 2  Fig. 4.6: Collision can be avoided with DBS  Fig. 4.6 illustrates the same situation in Fig. 4.5, except that the Banyan switching network is replaced with a dilated banyan switching network. Since as many as two (when DOD = 2) cells can be accepted by any output modules in a single time slot, 85  collision is avoided. Of course, buffers have to be added to the output modules to store the extra cell. It is noteworthy that at any time slot, an input module in our switch can forward still at most G cells to the switching architecture.  Under multicast environment, a Replication-at-Receiving (RAR) [ 1 2 ] algorithm will be performed at the IBS. That is, once an incoming cell is detected as a multicast cell, it will be immediately duplicated to generate as many copies as needed. It is worth mentioning that the original multicast cell has to be discarded after all the necessary copies are made, otherwise there will be one extra copy of this cell! The copies will then be put into the input buffering system (according to an N-bit multicast bit vector) as if they are unicast cells, except that the multicast cells have higher priority to be stored in the LIQs and OQs. That is, if a LIQ is full when a unicast cell is arriving, the arriving cell will be deleted. On the other hand, if the arriving cell is detected as a copy of a multicast cell, then the last unicast cell in the queue is deleted instead. Every cell behind the removed one will be shifted one slot forward in the LIQ (note that all the cells, if there are any, behind the one being removed should be multicast cells by now), and the incoming multicast cell copy will take the cell slot at the end of the queue. However, there is one exception — in case that the whole LIQ is already occupied by multicast cells, the incoming multicast cell will be discarded. Similar process will also apply at the output buffering system when a copy of a multicast cell is switched to a fully occupied output queue.  86  4.5. The Output Buffering System  The Output Buffering System (OBS) consists of N output modules (Fig. 4.7).  Control Signalsfromthe Arbitrator  A B R Queue  From the Switching Architecture  Ml  Output to the routes  Fig. 4.7: An Output Module  Each output module (OM) has M (as a reminder, M equals to the value of the DOD) inputs and one output. Inside the module there are two logical queues, one is specifically for A B R or U B R cells storage (ABR queue), and the other is specifically for C B R and V B R cells storage (Non-ABR queue). Within each time slot, an O M may unload at most one cell. The non-ABR queue will always be unloaded with higher priority. The A B R queue gets unloaded only when the non-ABR queue is empty.  Similar to the input modules, in case a virtual queue is full when a unicast cell is arriving, the arriving cell will be deleted. If the arriving cell is detected as a copy of a multicast cell, then the last unicast cell in the queue will be deleted instead, and all the cells behind it will be shifted one slot forward. The multicast cell copy will then take the cell slot at the end of the queue.  87  4.6. The Algorithm  of the Switching  Process  In this section we will provide an overview of the algorithm employed by our proposed switch. Since the switching network is simply a dilated banyan switch, which is well described in [35], we will only provide the algorithms applied in our IBS and OBS. The following describes how an input module functions in one time slot:  88  While a cell is arriving { If the designated LIQ is full { If the arriving  cell is a multicast cell  { If there is any unicast cell in the LIQ  Delete the last unicast cell; Store the arriving cell; Else (there is no unicast cell in the queue)  Delete the arriving cell; } Else (the arriving cell is not a multicast cell) {  Delete the arriving cell; }  } Else (the designated LIQ is not full) {  Store the arriving cell; }  } While switching { If the designated non-ABR LIQ is empty  { If the corresponding ABR LIQ is empty { For (k - 0, (k < G) and ((the set of LIQ is still empty) or (the set has been active in this time slot)), k++) {  Retry with the LIQ set within the next group; If a LIQ or ABR LIQ is not empty {  Forward the HOL cell to the switching network; }  } } Else (the corresponding ABR LIQ is not empty) {  ABR LIQ will forward a HOL cell to the switching network; } Else (the designated non-ABR LIQ is not empty) {  The Non-ABR queue will forward a HOL cell to the switching network; }  89  And the following algorithm shows the work done in an O M in one time slot: While a cell is being switched from the switching network { If the designated OQ is full { If the switched cell is a multicast cell { If there is any unicast cell in the OQ { Delete the last unicast cell; Store the switched cell; }  Else (there is no unicast cell in the OQ) { Delete the incoming (switched) cell; }  }  Else (the particular switched cell is not a multicast cell) { Delete the switched cell;  } }  Else (the designated OQ is not full) { Store the switched cell; }  }  For (OQ # = 1, OQ # < Switch Size; OQ # ++) { If the non-ABR OQ is empty { If the corresponding ABR OQ is not empty { A B R OQ will send out a cell; }  Else {  }  The Non-ABR OQ will send out a cell;  }  90  Note that all these algorithms are carried out by the M P A . We have detailed the structure and the operation algorithms of the proposed switch. The most noteworthy point of this switch is its simplicity and efficiency in implementing virtual output queuing. It is simple because it involves only virtual output queuing and dilated Banyan technologies, which are known to be simple and inexpensive to build [5, 6, 17, 18, 19, 23]. It is efficient because the M S R R scheduling scheme can fully utilize all the buffer resources fairly in every time slot. In next chapter, we will present the simulation results under uniform and bursty traffics reflecting high performance of the switch architecture described in this chapter.  91  Chapter 5: The Simulation and the Results  We have explicated our proposed switch in chapter 4. In this chapter we will first explain clearly how we created the simulation environment and ran the simulations, then we will provide some simulation data obtained from the proposed switch. In order to prove the robustness of our proposed switch, we will also compare those data with well-known switching technologies under exactly the same environment.  5.7. The Simulation  The simulation program is written in C. A l l the tasks that have to be done in one time slot are synchronized and processed simultaneously. Uniform and bursty traffics are run separately. For the uniform traffic, the probability that the next state is active (a cell is arriving in the next time slot) is p, while the next state is idle (no cell is arriving in the next time slot) is 1-/7. p is known as the offered load. For the bursty traffic, the Markov Modulated Bernoulli Process (MMBP) [37, 38, 39, 40] is employed to produce the data stream:  Fig. 5.1: Markov Modulated Bernoulli Process (MMBP)  92  Fig. 5.1 illustrates the M M B P . In Fig. 5.1, assuming the state in the current time slot is active: the probability that the next time slots will stay active is p; and the probability that the next time slot will change to idle is 1-p. On the other hand, if the current state is idle: the probability that the next time slot will stay idle is q; and the probability that the next time slot will change to active is 1-q. The offered load in this case is calculated as follows:  (5-1)  P=^— — S  2-p-q  [40]. Each simulation is completed with five million time slots. Each time slot may contain 0 to N (the size of the switch) cells.  Cells are generated during the simulations. There can be none or one incoming cell per link within a time slot, and, therefore, there can be at most N incoming cells per time slot that the switch can handle. Each cell generated carries the following fields:  1. Multicast indicator: 1 indicates a multicast cell, 0 otherwise; 2. A B R indicator: 1 indicates an A B R or U B R cell, 0 indicates C B R or V B R cell; 3. Designated output port number; 4. Cell delay (within the switch).  A l l cells require exactly the same amount of time to be serviced. The field "cell delay" is used to keep track of the number of time slots a cell has stayed in the switch (from entering the 93  input buffering system to leaving the output buffering system). These fields can actually be viewed as the A T M cell header fields. However, for the sake of simulation purpose, these fields are much simpler than the ones defined in the A T M standard (described in Chapter 1 ) .  Throughout this chapter, performance of our proposed switch as well as other existing switching systems will be posted by means of simulation results. The folio wings are the major performance data we will measure from the simulations:  „ „ . Cell Loss Ratio (CLR)  Number of Cells Discarded Number of Cells Input.  _, , Number of Cells Switched Successfully Throughput = —Number of Cells Input to the Switch  Normalized Throughput —  Throughput Offered Load  Average Cell Delay = The Average Number of Time Slots a Cell has Spent in the Switch (From the moment, it enters the input, buffering system to the moment it leaves the output buffering system)  Since the C L R and delay requirements are different for A B R traffic (ABR and U B R services) and non-ABR traffic (VBR and C B R services), the C L R and average delay for A B R  94  traffic as well as non-ABR traffic will be calculated separately in order to give more insight to the performance difference in both classes . 5  5.2. Simulation Results Obtained From Referenced Models  Before we demonstrate the simulation results obtained by our proposed switch, we would like to perform some basic simulations as comparisons to the published results. This will assure that the simulation results obtained for our switch are reliable and trustable.  5.2.1. The Basic FIFO Input Buffering Scheme  This is the simplest form of input buffering scheme. The switch with N input ports consists of N input FIFO buffers. An incoming cell enters a queue at the designated input port accordingly. It is queued until it gets to the head-of-line position, and then switched in the very next time slot if the output port it is heading is not blocked. However, as this queuing scheme suffers from the well-known Head-Of-Line (HOL) blocking effect, there is great possibility that the head-of-line cells will be blocked. As a result, the throughput is always limited to about 58% no matter how large the buffers are. Fig. 5.2 shows the H O L blocking effect under an offered load of 1.0:  As mentioned in chapter one, UBR cells share the same buffers and switching priority with ABR cells, but has higher priority to be dropped. Both services are collectively classified as "ABR traffic class", while CBR and VBR services are classified as "Non-ABR traffic class". Non-ABR traffic class has higher priority to be switched. 95 5  Thoughput for basic input buffered switch 0.6  T  0.58 J-  0.56  2  0.54  ro  Thoughput  0.52 0.5  i 50  1  100  1  150  1  200  1  250  1  300  1  350  1  400  450  Input Buffer Size  Fig. 5.2: Throughput for basic input buffered switch  5.2.2. The Hatibas Scheme with Grouping Factor  We have described the architecture of Hitibas switch in depth in chapter 3. Fig. 5.3 shows a plot of Mean Switching Delay against Applied Load simulated by our program (with N = 64, uniform traffic). The data appears to be identical to the same plot shown in [16].  96  Data Obtained with Hitibas Switch CD  E  CD CD  700 -j 600 500 —  Q 400 CT)o GO 300 o 200 CO cc  CD  -G =1 - G=2  - - * -- G = 4 • G=8  100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9  1  Applied Load  Fig. 5.3: Average Cell Delay using Hitibas Architecture (G = Group size)  Both of Fig. 5.2 and Fig. 5.3 prove that the results produced by our simulation program are very close to the expected ones. Other data are also obtained to show that the simulation results created by the simulation program matches the referenced results from many references. Thus, we are confident that our simulation program produces reliable and accurate outcomes.  In the following sections we will provide simulation data obtained from our proposed switch. Different parameters will be used to demonstrate the high performance achieved by our switch. First of all, section 5.4 and 5.5 will explain the reasons we have considered when we chose the values of certain parameters, namely degree of dilation and group size, in our proposed switch.  97  5.3.  Effect of Degree of Dilations  We have mentioned in Chapter 4 that a DOD of 2 will provide the most efficient result of the throughput in our proposed switch. In order to confirm this finding, Fig. 5.4 through Fig. 5.6 illustrate the effect of dilation on the switch in terms of CLR, throughput and delay. Note that a DOD of 1 means that there is no dilation applied.  The following simulations are based on these parameters: Input Buffer Size (Cells/LIQ) 400  Output Buffer Size (Cells/OQ) 800  Offered Load  Grouping factor  Traffic Type  0.99  4  Bursty  Overall C L R  •Overall C L R  1.00E-06 2  4  Degree of Dilation  Figure 5.4: Overall C L R of the Proposed Switch under Bursty Traffic  98  Nomalized Throughput  Q . D) O i  sz  0 0 0 0 0 0 0 0 0  999 998 997 996 995 994 993 992 991  -  -Nomalized Throughput  2  4  Degree of Dilation  Figure 5.5: The Throughput of the Proposed Switch  Under bursty traffic, a DOD of 2 shows a very obvious gain in performance over nondilated switch in all aspects. It's easy to observe that a greater DOD can provide further gain in  99  performance. However, a DOD larger than 2 will require much more complex and faster operations within the dilated banyan switching network.  As mentioned in chapter 4 (also in appendix A), with a DOD of 2 the IBS only requires tracing and examining the number of queues equal to the G (the group size) in an input module when a given scheduled queue happens to be empty. Supposed that we are employing a group size of 4. If we also employ a DOD of 4, up to 4 queues from 4 different groups are allowed to access the same output port. As a result, the IBS will require an algorithm that is capable to trace up to 4 groups of queues (which is 16 queues in total) in one time slot, even though that is not necessary in most of the times. And the capability of tracing 4 groups within one time slot requires extremely fast memory operations. Higher DOD (e.g. 8) requires even higher memory speed. Therefore, we think a DOD of 2 will be most efficient with the consideration of cost/performance factor.  5.4. Effect of Group Size  The last parameter that we take into account when designing the switch is the group size. Since grouping has already been described in Chapter 3, we will not provide further description here. The following figures show the simulation results obtained when different group sizes are applied to our proposed switch. Please note that a group size of 1 is the case where no grouping is applied, and a group size of 32 is the case where only one group of shared buffers is provided.  100  The following simulations are based on these parameters: Input Buffer Size (Cells/LIQ) 400  Output Buffer Size (Cells/OQ) 800  Degree of Dilation 2  Offered Load 0.99  Traffic Type Bursty  Overall CLR 5.00E-04 4.00E-04 0- 3.00E-04  • CLR  _i  °  2.00E-04 1.00E-04 0.00E+00  4  8  16  32  Group Size  Fig. 5.7: Overall C L R of the Proposed Switch under Bursty Traffic  Normalized Throughput  5 •| I !-  1.005 1 0.995 0.99 0.985 0.98 0.975 0.97  -Throughput  i  2  1  4  8  16  32  Group Size  Fig. 5.8: Throughput of the Proposed Switch  101  Overall Delay 1000 -,  o  -f  1  1  1  2  1  4  1  8  1  16  32  Group Size  Fig. 5.9: Average Delay of the Proposed Switch  The figures indicate that a greater group size always provides better switch performance in terms of CLR, throughput and cell delay. However, a greater grouping factor also requires faster memory architecture in order to be capable of switching G cells within one time slot, where G is the grouping factor. As a result, the system cost will increase as G increases. From Fig. 5.7 through Fig. 5.9 we find that a group size of 4 already provides almost optimal performance, greater group size does not seem to justify the higher cost. Therefore, we decided to employ a group size of 4 in our proposed switch.  102  5.5. Simulation Results from Our Proposed Switch  Since a good switch should perform well under any kind of traffic patterns, the capability of a switch to flawlessly handle all traffic patterns is critical to our design. Therefore, we have generated different input traffic patterns, through the traffic generator routine, to confirm that our switch is suitable to most, if not all, real time traffic conditions. In fact, the simulation program is very flexible regarding to the input traffics. For instance, the input traffic can be single- or multi-class, uniform or bursty, unicast or multicast. Among all these input traffic patterns, we will first look into the results with unicast traffic. The results for the multicast traffic will be discussed at the later part of this section. Similarly, single-class (with no A B R nor U B R services) and multi-class (with A B R and U B R services) traffics are simulated separately so that we can investigate the impact of prioritized traffics. Also, in order to get more realistic results, we will focus the discussion on bursty traffic against our switch. We will only provide results obtained from uniform traffic for comparison purpose.  5.5.1. Single-class Traffic  In Single-Class Traffic, all of the cells are generated equal, and have the same priority. Therefore, the input traffic is not differentiated between A B R and non-ABR classes. The proposed switch model described in chapter 4 can then be simplified by reducing the number of logical queues per set in the input modules from 2 to 1 (Fig. 4.3), and number of logical queues in the output modules also from 2 to 1 (Fig. 4.6). The simplified single-class input module and output module will look like the following (Fig. 5.10 and Fig. 5.11): 103  Control signals from the arbitrator  XT  A group of 4 input links (G = 4) from the source  32 single-class logical queues  -•  A group of 4 links (G = 4) to the switch  Mill  Fig. 5.10: Simplified Single-Class Input Module  Control Signals from the Arbitrator  1 From the Switching Network  ^  Single-class Logical queue 1 1 I'l 1 M 1  ^  Output to the routes  Fig. 5.11: Simplified Single-Class Output Module  5.5.1.1. Effect of Offered Loads under Uniform Traffic  With uniform single-class traffic, our proposed switch achieves a normalized throughput of more than 99% at any offered load (Fig. 5.12). In the proposed switch, H O L blocking is eliminated by the means of having a logical input queue (LIQ) at each input module reserved for each output port. In addition, the average queue length of the LIQs is greatly reduced by employing output port dilation (Degree of Dilation = 2). This bandwidth dilation tremendously 104  improves the efficiency of buffer usage in the input buffering system, because multiple (two in our case) cells can now be switched to an output port simultaneously. This allows the switch to eliminate many wasted cycles when any of the LIQs is empty, because the switch will "look for" another non-empty LIQ in the same input module to perform the switching without the fear to collide with another cell. Consequently, we get a much lower C L R with the proposed switch than with the Hitibas (Fig. 5.13). Note that in one time slot each input port is still sending at most one cell to the switch, therefore the total number of cells being switched from the input ports to output ports is still equal to or less than N. Also, due to this improvement in efficiency, the average cell delay (Fig. 5.14) is largely reduced when compared with the Hitibas switch. Please note that both the proposed switch and the Hitibas employ the same amount of buffer in the simulations for comparison purpose. The following simulations are based on these parameters: Input Buffer Size (cells/LIQ) 50  Output Buffer Size (cells/OQ) 100  Incoming Traffic Type Uniform  Normalized Throughput Comparison  3  0.995  t  0.99  •Proposed Switch: Normalized Throughput 'Hitibas: Normalized Throughput  P 0.985  0.98 0.975 0.97  -i  1  1  r  0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Offered Load  Fig. 5.12: Comparison of the Throughput between the Proposed Switch and the Hitibas with Similar Total Buffer Size  105  1.00E+00  -|—I—r -r w  I  f  1 .OOE-01 rr  1.00E-02  "Proposed Switch "Hitibas  o 1.00E-03 aj  <  1.00E-04 1.00E-05 1.00E-06 9)  Offered Load  Fig. 5.13: Comparison of the Average C L R between the Proposed Switch and the Hitibas with Similar Total Buffer Size  Proposed Switch Hitibas  Offered Load  Fig. 5.14: Comparison of the Average Cell Delay between the Proposed Switch and the Hitibas with Similar Total Buffer Size  106  5.5.1.2. Effect of Offered Loads under Bursty Traffic  The bursty traffic is created using M M B P (Fig. 5.1) with p = 0.9 and q calculated by:  IzPQ^Pl  (52)  1-/7  In equation (5.2), p refers to the offered load. The average burst length is kept constant and is equal to 10 (burst length = 1/(1-/?)). Note that q cannot be negative. Therefore the offered load, p, must be less than 0.9 when p = 0.9. Under these circumstances, our proposed switch achieves a throughput of more then 99% for any offered loads (Fig. 5.15). Fig. 5.16 depicts the difference between the proposed switch and the Hitibas on the C L R view under bursty traffic. Our proposed switch performs well above the level that Hitibas achieves, especially for offered loads under 0.7, where C L R is practically zero in our proposed switch. Finally, the proposed switch also performs better in terms of average cell delay (Fig. 5.17).  The following simulations are based on these parameters: Input Buffer Size (cells/LIQ) 400  Output Buffer Size (cells/OQ) 400  107  Incoming Traffic Type Bursty  Throughput  \  §> 0.97 o £ 0.96  1  Throughput "»"«" """Hitibas: Throughput  0.95 0.94 0.93 0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  Offered Load  5.15: Comparison of the Throughput between the Proposed Switch and the Hitibas with Similar Total Buffer Size  Overall CLR  — Proposed Switch: Overall C L R -"Hitibas: C L R at Input  0.3 0.4 0.5 0.6 0.7 0.8 0.9 Offered Load  Fig. 5.16: Comparison of the Average C L R between the Proposed Switch and the Hitibas with Similar Total Buffer Size  108  Overall Cell Delay  ;ed Switch: i: Delay  1  0.3  1  0.4  1  0.5  1  0.6  1  0.7  1  0.8  0.9  Offered Load  Fig. 5.17: Comparison of the Average Cell Delay between the Proposed Switch and the Hitibas with Similar Total Buffer Size  As we mentioned earlier, having a larger input buffer size will reduce the C L R of a system and hence improve the system performance. In fact, in our proposed switch, the C L R can even be reduced to zero under both uniform and bursty traffic, if the input buffer size is large enough. That is, the C L R is bounded. The following section demonstrates the effect of input buffer size towards the performance of the proposed switch. For the sake of comparison purpose, we will plot the data against the total buffer size of the switch. Since the output buffer size in our proposed switch is kept constant, the increase in total buffer size will reflect the increase in input buffer size.  109  5.5.1.3. Effect of Buffer Size under Uniform Traffic  Since we would like to investigate the worst case performance of our proposed switch, the simulations in this section is based on an offered load of 0.99 . Note that our proposed switch 6  requires output buffers at the output ports, while the Hitibas does not. For the sake of comparison between the proposed switch and the Hitibas, the buffer size in our figures reads as the total buffer size (that is, the sum of input buffers and output buffers) employed in the switch. The output buffer size is fixed at a capacity of 400 cells per output queues. In Hitibas, the numbers indicates only the total input buffer size.  Fig. 5.18 demonstrates the throughput of the proposed switch under uniform traffic. The throughput reaches the peak at about the total buffer size of 720 cells per port, which is approximately the capacity of 40 cells per LIQ. At that point, the throughput is about 99%.  The following simulations are based on these parameters: Offered Load 0.99  6  Output Buffer Size (cells/OQ) 400  Under uniform traffic, the offered load is not limited by equation (5.2).  110  Incoming Traffic Type Unifrom  Comparison: Normalized Throughput  "~™ """Hitibas: Throughput P r o p o s e d Switch: Throughput  400  720  1040  1360  Buffer S i z e per port  Fig. 5.18: Comparison of the Normalized Throughput between the Proposed Switch and the Hitibas with Similar Total Buffer Size  Comparison: CLR 1.0E+00 | 1.0E-01 "Hitibas: Overall CLR  1.0E-02 r §  1 0 E - 0 3 ,-  • P r o p o s e d Switch: Overall C L R  1.0E-04 ,1.0E-05 r 1.0E-06 &  o?>  o^>  G?>  4*  r£  #  Buffer S i z e per port  Fig. 5.19: Comparison of the Average C L R between the Proposed Switch and the Hitibas with Similar Total Buffer Size  111  Comparison - Average Cell Delay  :: Delay 5ed Switch:  Buffer Size per port  Fig. 5.20: Comparison of the Average Cell Delay between the Proposed Switch and the Hitibas with Similar Total Buffer Size  Fig. 5.19 is a plot of the C L R against the increase in input buffer size. With our proposed switch, the C L R is bounded and is virtually zero when the buffer size reaches the size of 2000 cells per port (which is approximately 200 cells per logical input queue). Also seen from the figure is the fact that the Hitibas switch will require much larger buffer size to converge.  Fig. 5.20 shows an interesting plot of average cell delay against buffer size. At small buffer sizes, the Hitibas switch performs better then our proposed switch in terms of cell delay, but this situation reverses at the point with a total buffer size of 1680 cells (about 160 cells per logical input queues). This can be explained as follows. When total buffer size is small, the influence of output buffers out-weighs input buffers in a large proportion. As a result, with buffer size smaller than 1680 cells, our proposed switch actually "acts like" an output buffered switch. Recall that the size of the output buffers is kept constant and we only increase the input 112  buffers. As the total buffer size grows beyond 1680 cells, the influence of input buffers start to out-weigh the output buffers and our proposed switch becomes effectively an input buffered switch at this point. It's also noteworthy that the average delay of our switch remains practically flat against buffer sizes after this same turning point, while the delay for Hitibas switch is still increasing.  From Fig. 5.18 through 5.20, we can conclude that with a total buffer size of about 2000 cells (200 cells per logical output queue), out proposed switch can achieve a throughput of 99%, zero C L R and bounded average delay.  5.5,1.4. Effect of Buffer Size under Bursty Traffic In this simulation, the bursty traffic is produced similarly as section 5.6.1.2. Again, p is equal to 0.9. We fixed the offered load at a very heavy level of 0.9 . Therefore, q will be a 7  constant in this case and is equal to, according to equation (5.2), 0.1. The following simulations are based on these parameters: Offered Load 0.9  7  Output Buffer Size (cells/OQ) 400  For q to remain positive in equation (5.2), the maximum value that p can be is 0.9.  113  Incoming Traffic Type Bursty  Comparison: Normalized Throughput  Hitibas: Throughput •Proposed Switch: Throughput  R°  0°  ? ?  t ?  C?>  0°  t ?  #  Buffer Size per port  Fig. 5.21: Comparison of the Normalized Throughput of the Proposed Switch and the Hitibas with Similar Total Buffer Size  Comparison: CLR  Hitibas: Overall CLR •Proposed Switch: Overall C L R  r&  # ^  o?>  cfi ^  (vO t\Q #  <£? ^  o?  <$>  Buffer Size per port  Fig. 5.22: Comparison of the Average C L R of the Proposed Switch and the Hitibas with Similar Total Buffer Size  114  Comparison - Average Cell Delay 500  IQ  400  1  300  8,  200  O 2 §  ™ Hitibas: Delay •Proposed Switch: Delay  100 0  *  -t*i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—r  o&° #  v> 0?  Q  %  0  ty  ty  #  <V°  <SP  <b V <Jr  N  Buffer Size per port  Fig. 5.23: Comparison of the Average Cell Delay between the Proposed Switch and the Hitibas with similar total buffer size  Fig. 5.21 through 5.23 are similar plots to Fig. 5.18 through Fig. 5.20. The only difference is that these plots are obtained under bursty traffic simulations. Although performance under bursty traffic is generally worse than under uniform traffic, Fig. 5.21 through Fig. 5.23 also unveil similar performance gain of our proposed switch over the Hitibas.  Fig. 5.21 shows that our proposed switch reaches more than 99% of normalized throughput with about a total buffer size of 1360 cells (120 cells per logical input queue). For the Hitibas case, the maximum normalized throughput flats out at about 95% with 5200 cells (600 cells per logical input queue) of total buffer size. In Fig. 22, we can see that at small input buffer sizes, both switches have substantially high CLR,  even though output buffers are added to  our proposed switch. However, as input buffer size grows, the performance gap between our proposed switch and the Hitibas widens. In addition, at a total buffer size of 6800 cells, the C L R of our proposed switch is reduced to zero, while the C L R of the Hitibase is 0.01. Finally, Fig.  115  5.23 demonstrates that the delay of our proposed switch is bounded at about 270 time slots with approximately a total buffer size of 4240 cells (480 cells per logical input queue).  • As a conclusion of the section, we found that, even under bursty traffic, our proposed switch can reach a performance of 99% normalized throughput, zero C L R and bound delay with about a total buffer size of about 6800 cells per port (800 cells per logical input queue).  5.5.2. 2-priority-class Traffic  The single-class traffic has been the standard model used in simulations of many networking researches. However, as Available Bit Rate (ABR), Unspecified Bit Rate (UBR) services and Multicast data are emerging as the major trend in real time traffic, especially in A T M , it is necessary to consider those classes of traffic in real time simulations. In fact, with some minor changes, our switch is capable to deliver multiple-class traffic in a performance that is equal to or even better than the single-class traffic model. These few minor changes are:  i.  A low switching priority buffer is added to every logical input queue at the input modules to specifically store incoming A B R or U B R cells before they can be switched. We denote his buffer A B R LIQ. As mentioned in chapter 4, a non-ABR LIQ (for V B R and C B R cells) and an A B R LIQ (for A B R and U B R cells), that are heading to the same output port, together form a "LIQ set" in a given input module.  116  ii.  A low switching priority buffer is added to every output queue at the output modules to specifically store outgoing A B R and U B R cells before they are sent out. We denote this buffer A B R OQ.  iii.  The M P A has to perform an extra algorithm to arbitrate schedules for the LIQs and OQs, so that the cells can be switched or sent out according to the scheduling rules mentioned in chapter 4.  In the following subsection, we will investigate the 2-Class (ABR and non-ABR) traffic first. In the next subsection, we will look at the 2-Class multicast traffic.  5.5.2.1. Effect of Offered Load under Uniform 2-priority-class Traffic  Since A B R traffic class is still considered as minor in today's real time traffic, we 8  assume a 0.3 proportion of the overall traffic load in our simulations as A B R traffic class. On the other hand, the non-ABR traffic class consumes 0.7 proportion of the overall traffic. The 9  ratio ABR Cells : Non-ABR Cells is therefore 3 : 7. Before we go on with 2-class traffic analysis, it is necessary to define the following terms in order to avoid any possible confusion:  „, , Ihroughput  ABR  =  , Throughput _ Non  Number of ABR Cells Switched Successfully —— Number of ABR Cells Input to the Switch Number of Non - ABR Cells Switched Successfully  ABR  = Number of Non - ABR Cells Input to the Switch  Overall Throughput = Throughput  ABR  8  9  + Throughput _  As a reminder, the ABR traffic class consists of ABR and UBR cell traffics. As a reminder, the non-ABR traffic class consists of VBR and CBR cell traffics.  117  Non  ABR  _ .... .. . Overall Normalized Throughput  Overall Throughput :  Offered Load  CLR ABR  Number of Cells Input  _ Number of Non — ABR Cells Discarded  CLR „_ No  Number of ABR Cells Discarded —  ABR  Number of Cells Input  (Average Cell Delay)ABR = The Average Number of Time Slots an ABR Cell has Spent in the Switch  (Average Cell Delay)NOH-ABR = The Average Number of Time Slots a NonABR Cell has Spent in the Switch  Recall that non-ABR cells have higher priority over A B R cells to be switched (and sent out). In other words, an H O L A B R cell only gets switched at the input buffering system (or sent out from the output buffering system) when the corresponding non-ABR LIQ (or corresponding OQ) is empty. Since A B R buffers are added to both the input buffering system and output buffering system, the total buffer size is enlarged. As a result, we found that both the CLRNOHABR and C L R A B R are 0 at any loads. At an offered load of 0.99, the overall throughput achieved by our switch is more than 99.9% (Fig. 5.24). As the A B R cells have a lower priority for being switched (and being sent out) than non-ABR cells, the delay for A B R traffic (Fig. 5.25) is considerably larger than non-ABR traffic. Especially at high loads, the non-ABR LIQs and nonA B R OQs are filled with cells most of the time. Therefore, there are lower chances that an A B R cell can be switched in the input buffering system, or sent out from the output buffering. This is 118  fine for A B R traffic, because it is supposed to be delay insensitive by nature. However, short delays are still important for non-ABR traffics. Prioritization allows our switch achieve this goal for any loads (Fig. 5.25). The following simulations are based on these parameters: Output Buffer Size (cells/OQ) ABR Non-ABR 200 100  Input Buffer Size (cells/LIQ) Non-ABR ABR 100 50  Incoming Traffic Type Uniform 2-Class  Effect of Offered Load on Overall Normalized Throughput 1.0002 1 zz  0.9998  ZZ  0.9996  \-  0.9994  Q_ .C O)  o  • Overall Throughput  0.9992 0.999  1  \  1  1  1  1  0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Overall Offered Load  Fig. 5.24: Effect of Offered Load on Normalized Throughput  119  Effect of Offered Load on ABR and non-ABR Cell Delay  I f I , *  Non-ABR Average Delay  •  '  i  ,  » » » A B R Average Delay  1  i  o 0.3 0.4 0.5 0.6 0.7 0.8 0.9  1  Offered Load  Fig. 5.25: Effect of Offered Load on Average Cell Delay  Now we will look at how the switch performs under bursty environment for 2-Class traffic.  5.5.2.2. Effect of Offered Load under Bursty 2-priority-class Traffic  Under Bursty traffic, equation 5.2 needs to be considered. In order to applied 5.2 to 2class traffic condition, the following terms will be necessary to be defined:  p = The total offered load  PABR  = The offered load for ABR traffic = 0.3*p  120  PNOH-ABR = The. offered load for Non-ABR traffic = 0.7*p  Since the A B R and non-ABR traffics are supposed to be independent to each other, equation 5.2 will be applied to A B R and non-ABR traffics separately:  1  PABR (2  QABR  1 Non-ABR  where p R AB  1-p 1  PABR)  (5.3)  ABR  PNon-ABR  PNon-ABR*)  (5.4)  1 - A Non-ABR  = PNOH-ABR = p = 0.9.  Due to the requirement of very low CLR on A B R traffic, we have substantially enlarged the buffer sizes for A B R LIQs and A B R OQs in our switch when bursty traffic is simulated. With the help of output port dilation, and the presence of low priority A B R LIQs and OQs in our switch, the frequency of wasting a time slot is virtually eliminated. Our switch achieves an overall normalized throughput of 98.5% at overall load of 0.99 (Fig. 5.26). The high 10  throughput and the enlarged buffers also contribute to a virtually zero CLRNOH-ABR and zero CLRABR at loads below 0.5 and 0.8 respectively (Fig. 5.27). Although A B R cells have a lower priority to be transmitted than the non-ABR cells, the CLRABR peaks at about 0.0004 at very high  Note that in the 2-class bursty traffic case, the overall offered load can be larger than 0.9. The reason behind is that equation 5.3 and 5.4 are still positive at an offered load of 0.99. 121 1 0  load (0.99), comparable to the peak C L R  N o  n-ABR  of 0.006 at the same load. The average cell  delays for both A B R and non-ABR traffics are shown in Fig. 5.28.  The following simulations are based on these parameters: Input Buffer Size (cells/LIQ) ABR Non-ABR 200 100  Output Buffer Size (cells/OQ) Non-ABR ABR 400 200  Incoming Traffic Type Bursty 2-Class  Effect of Offered Load on Overall Normalized Throughput 1.005000 CD N  1.000000 -j 0.995000  o 2> 0.990000 o  E  CD >  O  • Normalized Throughput  0.985000  0.980000 0.975000 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Overall Offered Load  Fig. 5.26: Effect of Offered Load on Normalized Throughput  122  Effect of Offered Load on ABR and non-ABR CLR 7.00E-03 6.00E-03 5.00E-03 cr 4.00E-03 O  » ABR CLR Non-ABR C L R  3.00E-03 2.00E-03 1.00E-03 0.00E+00 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Offered Load  Fig. 5.27: Effect of Offered Load on A B R and Non-ABR C L R  Effect of Offered Load on ABR and non-ABR Cell Delay 3500 3000 A B R Cell Delay  in |> 2000 1  2  5  0  1  5  0  0  •Non-ABR Cell Delay  0  O 1000 500 iiwiiiaiiijti.  0 0.3  0.4 0.5 0.6 0.7 0.8 0.9  1  Offered Load  Fig. 5.28: Effect of Offered Load on A B R and Non-ABR Average Cell Delay  To conclude this subsection where 2-Class traffic is simulated, we will calculate that how many buffers (in terms of bytes) are required to achieve the above results under bursty traffic. Fig. 5.26 through Fig. 5.28 are simulated with the following configuration: 123  (1) Input Non-ABR Buffer Size = 100/LIQ; (2) Input A B R Buffer Size = 200/LIQ; (3) Output Non-ABR Buffer Size = 200/OQ; (4) Output A B R Buffer Size = 400/OQ; (5) G = 4.  With a switch size of 32 and a group size of 4, there are totally 8 input modules. Each input module has 32 non-ABR LIQs and 32 A B R LIQs. Each non-ABR LIQ can store up to 100 cells, and each A B R LIQ can store up to 200 cells simultaneously. Therefore each input module has an amount of buffer that holds up to 32*100+32*200 = 9,600 cells. The total buffer size in the input buffering system is then 9,600*8 = 76,800 cells, which is translated to about 4.1 Megabytes. In the OBS, there are totally 32 non-ABR OQs and 32 A B R OQs. Each non-ABR OQ can take up to 200 cells, and each A B R OQ can take up to 400 cells simultaneously. Similar calculation finds that a buffer size of about 1.07 Megabytes is enough for the OBS. Therefore, the total buffer size required for our switch is about 5.17 Megabytes. The cost for this size of buffer is very low in today's technology, even with very high-speed buffer types.  124  5.5.3. 2-priority-class Multicast Traffic  It is quite obvious that multicast traffic is getting more attention recently. The capability to handle multicast traffic is essential in today's switch design. Having this in mind, with just one change in our original design -- integrating the copying capability into each input module -our switch becomes a multicast capable device. The replication-at-receiving (RAS) method is used to copy incoming multicast cells at the IBS. That is, a cell is duplicated into as many copies as needed when it arrives at the IBS. The average number of copies made for each incoming multicast cell is the Degree of Multicast (DOM). The range of D O M lies between 2 to full size of the switch (broadcasting) in today's applications. According to numerous simulation results, a D O M of 6.5 is found to be most reflective to the M M B P model applied in our simulations. Therefore, a D O M of 6.5 is chosen in our simulations in order to create significant multicast traffic.  The multicast traffic is also divided into A B R portion and non-ABR portion. The A B R portion refers to the multicast traffic among the A B R cells, and the non-ABR portion refers to the multicast traffic among the non-ABR cells. The multicast portion is 0.1 of the offered load in our simulations, while the A B R to non-ABR ratio is still 3 to 7.  It is noteworthy that the maximum offered load is different from the actual load when the D O M is greater than unity. The reason is that a single incoming multicast cell could bring in many cells to the input buffering system. For example, with a D O M of 6.5, in average every  125  multicast cell would bring in 6.5 cells to the IBS. Therefore, the overall input load is calculated as follows: Actual Load = Offered Load for Unicast traffic + Offered Load for Multicast traffic = (1 - Multicasting Portion) * Offered Load + Degree Of Multicasting * Multicasting Portion * Offered Load = (1 - 0.1) * Offered Load + 6.5 * 0.1 * Offered Load = 1.55* Offered Load (5.5)  Therefore, the actual load is 1.55 times of the offered load when the D O M is 6.5. The highest possible offered load in our simulations is then 0.6 (when the actual load is 0.93) , stead of 1.0. 11  Let's now look at the simulation results reflecting the multicast capability of our switch.  5.5.3.1. Effect of Offered Load under Bursty 2-priority-class Multicast Traffic  Recall that when an incoming cell is recognized in an input module as a multicast cell, it will be duplicated to a number of copies equivalent to the number of output ports it is addressed to. Then the cell-copies will be placed in the appropriate LIQ corresponding to the output port number in its cell header. As mentioned in chapter four, the original cell will be discarded after all the copies are made (otherwise there will be one extra cell). After this point, all the multicast cells will be treated like unicast cells, except that they have higher priority to be stored in the LIQs (or OQs). Simply put, all multicast cells will be queued, switched, and sent just like unicast cell, unless when some LIQs (or OQs) are full. Therefore, we should expect similar results compared to subsection 5.5.2.2 (2-Class unicast bursty traffic) at low loads where the  11  With a 0.1 increment in offered loads, the next step with be 0.7, and the actual load will be 1.085! 126  LIQs or OQs are less likely to be full. However, with additional multicast traffic, the CLRs for unicast cells are substantially higher (Fig. 5.29) at heavy loads. This is because unicast cells are removed from fully occupied LIQs or OQs when multicast cells are arriving. The CLRs are especially high at offered loads higher than 0.5 (actual load = 0.775). At high actual loads, the overall input load contains considerable amount of multicast cells (Note that at an actual load of 0.775, the multicast load is 0.5*0.1*6.5 = 0.325 after duplication). This large amount of multicast cells needs a lot of buffer spaces, and thus causes a lot of unicast cell losses. It's mentioned in previous subsection that the actual load (when multicast traffic is present) is 1.55 times of the offered load. Therefore, the actual load is 0.93 at an offered load of 0.6. In our simulations, a normalized throughput of more than 95% (Fig. 5.30) is obtained at a 0.6 offered load. Fig. 5.31 illustrates the average cell delays obtained for the unicast and multicast traffics.  The following simulations are based on these parameters: Input Buffer Size (cells/LIQ) Non-ABR ABR 100 400  Output Buffer Size (cells/LIQ) Non-ABR ABR 200 800  127  Incoming Traffic Type Bursty 2-Class Multicast  Effect of Offered Load on CLRs (2-class, Multicasting, Bursty Traffic) 1 .OE+00 1.0E-01 - « — U n i c a s t Non-ABR CLR  1 .OE-02 or _i o  Unicast A B R C L R  1 .OE-03  • - - A - - - Multicast Non-ABR CLR  1 .OE-04  Multicast A B R C L R  1 .OE-05 1 .OE-06 0.1  0.2  0.3  0.4  0.5  0.6  Offered Load  Fig. 5.29: Effect of Offered Load on Unicast A B R , Unicast Non-ABR, Multicast A B R and Multicast Non-ABR C L R  128  Effect of Offered Load on Overall Normalized Throughput (2-Class, Multicasting, Bursty Traffic)  Overall N o r m a l i z e d Throughput  0.1  0.2  0.3  0.4  0.5  0.6  Offered L o a d  Fig. 5.30: Effect of Offered Load on Normalized Throughput  129  Effect of Offered Load on Delays (2-Class, Multicasting, Bursty Traffic) 10000  Unicast Non-ABR Delay  1000  Unicast ABR Delay CO  IT  100  Q  10  - -A • -  Multicast Non-ABR Cell Delay  - - X - -  Multicast ABR Cell Delay  1 -\ 0.1  0.2  0.3  0.4  0.5  0.6  Offered Load Fig. 5.31: Effect of Offered Load on Unicast A B R , Unicast N o n - A B R , Multicast A B R and Multicast N o n - A B R Average Cell Delay  Although the performance of the switch degrades substantially under bursty multicast traffic at high actual loads, it does handle the bursty multicast traffic well at offered loads under 0.4.  Since a low C L R is critical to A B R cells, more buffers can improve the C L R performance  (decrease the C L R ) for A B R class of traffics. However, this will also increase the average cell delay incurred to multicast A B R cells (because the unicast cells and multicast cells share the same buffers), which might not be acceptable in all multicast applications.  In this section, we have shown that our input-output Dilated Banyan enhanced switch meets the requirement of high throughput and low C L R in any kind of traffic. The switch can reliably provide a normalized throughput that is constantly higher than 95% even under bursty traffic, and a C L R that is bounded and can even reach zero if large buffer size is available. 130  When A B R traffic class is present, the switch is able to handle 2-priority-class traffic, with A B R traffic class being the lower priority one. Since low C L R is essential to A B R traffic class, the switch is designed to allow high cell delay in exchange of low C L R for A B R or U B R cells. On the other hand, low cell delay is essential for non-ABR traffic class, the switch is design to allow larger CLR, in exchange of low cell delay for CBR or V B R cells. In addition, multicast traffic is also handled very well with our switch, also showing more than 95% of normalized throughput constantly, and very low C L R .  131  Chapter 6: Conclusion  6.1. Summary  In this thesis, we have presented a high performance A T M switch that achieves very high throughput, low cell loss rate and bounded delay with finite buffer size. Simulations show that these performance results are obtained under all the major kinds of cell traffic patterns that most researches are based on: uniform, bursty, single-class and multi-class, as well as unicast and multicast traffic patterns. We, therefore, believe that this switch is practical in today's networking industry.  Input buffering is employed as the buffering system to manage excessive incoming cells, while output buffers are added to enhance the throughput. Two existing technologies - Virtual Output Queuing with Grouping and Skew Round Robin (SRR) Scheduling - are extensively engaged to the switch design. These two technologies are proved to be robust under uniform traffic pattern in many switch researches [16, 41, 42]. However, both technologies fall short when the applied traffic pattern becomes more realistic. That is, when the traffic pattern becomes bursty, and, involves many different types of cells like multi-class and multicast cells.  To overcome the shortfalls that surface when realistic traffic pattern is applied, we have made some modifications to the above technologies:  132  1. Incoming cells are differentiated by their bandwidth usage (i.e. CBR, V B R , A B R and UBR). 2. The incoming cells are then prioritized by their delay and loss sensitivity. For example, A B R and U B R cells are more loss sensitive than C B R and V B R cells, while C B R and V B R cells are less tolerable to delay. 3. A second logical queue is added to each virtual output queue. This second logical queue is considered as a lower delay priority queue in which A B R and U B R cells are stored. 4. The SRR scheme is modified so that when the scheduled queue is empty, another queue from the same input module can take over the time slot and switch a cell (please refer to section 4.4 for more detail on the modified SRR scheme). Time slots can then be better utilized. 5. Since the modified SRR (MSRR) scheme allows at most two cells to be switched to the same output port at the same time, output buffers (also prioritized) are added to store the extra cell when necessary.  Group Size and the Degree of Dilation are two major parameters that determine the performance of the proposed switch. From our analysis we found that a group size of 4 and a degree of dilation of 2 seem to be most cost effective. These parameter values are therefore elected to be used in our proposed switch. Larger group size and a higher degree of dilation may further enhance the performance, but the high cost associated with them (faster memory and more complex switching infrastructure) may not justify the small performance gain. 133  6.2. Major Results  The uniform traffic is modeled by the principle that the probability of having an active time slot is simply the offered load. The bursty traffic model is more complex and is based on the Markov Modulated Bernoulli Process (MMBP) (refer to section 5.1 for more details).  Under single class uniform traffic with offered load of 0.99, our proposed switch can achieve a normalized throughput of more than 99% (Fig. 5.12), a C L R of less than 10" (Fig. 5.13) and an average delay of less than 50 time slots (Fig. 5.14). 4  In bursty traffic with offered load of 0.9 (recall that for q to remain positive in equation 5.2, the maximum value of offered load is 0.9), a normalized throughput of almost unity (Fig. 5.15), a C L R of less than 10" (Fig. 5.16) and an average delay of less 2  than 250 time slots (Fig. 5.17) are obtained.  In 2-class bursty traffic model, A B R cells (i.e. A B R and UBR) and non-ABR (i.e. C B R and VBR) cells are assumed to be independent to each other, although they share the bandwidth with a ratio of 3:7. In addition, A B R cells are considered to be lower priority for transmission (less sensitive to delay), and they only get switched when no non-ABR cells are competing for the time slot. However, A B R cells are more sensitive to cell loss, substantially larger buffers are therefore reserved for them. As a result, our 134  proposed switch can achieve an overall (combined) throughput of more than 99.9% (Fig. 5.24), a very low delay for non-ABR cells (Fig. 5.25), zero CLRABR and zero CLR OII-ABRN  The most complex traffic model considered in this thesis is 2-class multicast simulation. Under this model, the traffic will contain a noticeable amount of multicast cells. Each multicast cell is heading to more than one output ports and needs to be multiplied when it arrives the switch. We assume a 0.1 portion of input traffic to be multicast cells. According to the analysis we have done to the simulation results, the degree of multicast (DOM) is found to be 6.5. That is, in average each multicast cell will be copied 6.5 times when it enters the switch. As a result, the actual load on the switch is 1.55*offered load (equation 5.3). Also, multicast cells are considered to be more sensitive to cell loss than unicast cells. Therefore, our proposed switch handles multicast cells by giving it higher priority to be stored. That is, in the case a queue is full when a multicast cell arrives, the last unicast cell will be removed. A l l the cells behind it (either none or all multicast) will be pushed one slot forward. The incoming multicast cell will then be placed at the end of the queue. With our proposed switch, we achieve a C L R A B R of 1*10~ and a C L R „ -ABR of 1*10" for multicast cells, at an offered load of 0.6 (actual 4  3  No  load of 0.93). CLRABR and a CLRNOII-ABR for unicast cells are approximately 1.2*10" and 1.7*10" respectively (Fig. 5.29). At the same load, the average cell delay is 2  approximately 100 time slots for multicast cells and 1000 time slots for unicast cells (Fig. 5.31). We also found that the overall throughput is about 0.955, also at an actual load of 0.93 (Fig. 5.30).  135  Finally, we also proved that the proposed switch is bounded for delay with finite buffer size, under both uniform and bursty traffic (Fig. 5.20 and 5.23).  6.3. Future Works  1. Simulations of two class traffic in this thesis is based on a constant ratio between A B R and non-ABR traffics. Simulations with varying ratio should be taken so that more realistic results may be obtained. 2. Also, multicast traffic simulations in this thesis are assumed to have a constant proportion of multicast traffic. Simulations of varying proportions should be taken in order to obtain more realistic results. 3. Since we want to focus on the performance of 2-class traffic, A B R and U B R traffics as well as C B R and V B R traffics are assumed to be in the same classes in this thesis. In future studies, we should separate all four classes of traffics and investigate their performance individually. 4. Although simulations already proves that our M S R R scheduling scheme is much more robust than the SRR scheme, more detailed analysis should be done to determine the difference in performance between both schemes. 5. A l l simulations are based on the fact that incoming traffic is synchronized. More studies on asynchronized incoming traffic should be taken place.  136  Reference  [1]  M. Karol, M. Hluchyj and S. Morgan, "Input Versus Output Queueing on a Space Division Switch," IEEE Trans. Comm, 35(12) (1987) pp. 1347-1356.  [2]  M. Karol, K. Eng and H. Obara, "Improving the Performance of Input-Queued ATM Packet Switches," INFOCOM '92, pp.110-115.  [3]  M. G. Hluchyj and M. J. Karol, "Queueing in High-Performance Packet Switching," IEEE Journal on Selected Areas in Communications, Vol. 6, No. 9, December 1988, pp. 1587 - 1597.  [4]  Y. Oie, T. Suda, M. Murata, H. Miyahara, "Survey of the Performance of Nonblocking Switches with FIFO Input Buffers," ICC '90, Vol.2, pp. 737 - 741.  [5]  H. S. Kim, I. Widjaja and A. Leon-Garcia, "Performance of Output-Buffered Banyan Networks with Arbitrary Buffer Sizes," INFOCOM  [6]  '91, pp. 0701 - 710.  B. R. Collier and H. S. Kim, "An Output Queueing Batcher-Banyan ATM Switch Architecture," MILCOM '93, pp. 303 - 307.  [7]  M. Prycker, Asynchronous Transfer Mode: Solution for Broadband ISDN, Prentice Hall, 1995.  [8]  T. M. Chen, S. S. Liu and V. K. Samalam, "The Available Bit Rate Service for Data in ATM Networks," IEEE Communications Magazine, Vol. 34, No. 5, May 1996, pp. 56-71.  [9]  F. Bonomi and K. W. Fendick, "The Rate-Based Flow Control Framework for the Available Bit Rate ATM Service," IEEE Network Magazine, Vol. 9, No. 2, March 1995, pp.25 -39.  [10]  R. Jain, S. Kalyanaraman, S. Fahmy and R. Goyal, "Source Behavior for A T M A B R Traffic Management: An Explanation," IEEE Communications Magazine, Vol. 34, No. 11, November 1996, pp. 5 0 - 5 5 .  [11]  T. T. Lee, "Nonblocking Copy Network for Multicast Packet Switching," IEEE Journal on Selected Areas in Communications, Vol. 6, No. 9, December 1988, pp.1455-1467.  [12]  F. M . Chiussi, Y . X i a and V . P. Kumar, "Performance of Shared-Memory Switches under Multicast Bursty Traffic," IEEE Journal on Selected Areas in Communications, Vol. 15, No. 3, April 1997, pp. 473-487.  [13]  B. Prabhakar, N . Mckeown and R. Ahuja, "Multicast Scheduling for InputQueued Switches," IEEE Journal on Selected Areas in Communications, Vol. 15, No. 5, June 1997, pp. 855-866.  [14]  L . Ngoh, H . L i and H . Pung, " A Direct A T M Multicast Service with Quality-ofService Guarantees," Proceedings of Multimedia '96, September 1996, pp. 54-61.  [15]  H . Duan, J. W. Lockwood, S. M . Kang and J. D. Will, " A High-Performance OC12/OC-48 Queue Design Prototypes for Input-Buffered A T M Switches," Proceedings of IEEE INFOCOM '97, Vol. 1, 1997, pp. 20 - 28.  [16]  J. Garcfa-Haro, F. Fan and A. Jajszczyk, "Hitibas - A High-Throughput InputBuffered A T M Switch," IEEE Singapore International Conference of Network, September 1993, pp. 359 - 363.  138  [17]  J. Garcia-Haro, J. Malgosa-Sanahuja and J. Melus-Moreno, "Multicasting Facilities in A T M Switching Architectures. A Study of Several Approaches," Proceedings of the 1995 IEEE Pacific RIM Conference on Communications, Computers, and Signal Processing, February 1995, pp. 90 - 95.  [18]  A . Mekkittikul and N . Mckeown, "Practical Scheduling Algorithm to Achieve 100% Throughput in Input-Queued Switches," Proceedings of IEEE, INFOCOM, Vol 2, 1998, p 792-799.  [19]  C. Kolias and L. Kleinrock, "Throughput Analysis of Multiple Input-Queuing in A T M Switches," Proceedings of the International IFIP — IEEE Conference on Broadband Communications, 1996, pp. 382 - 393.  [20]  I. Fituri and H . T. Mouftah, "The IBSN Switch: A New High Performance A T M Switch," Electrical and Computer Engineering, 1994 Canadian Conference, pp. 316-319.  [21]  W. J. Goralski, Introduction to ATM Networking, McGRQW-HILL International Editions, 1996.  [22]  S. Gu, "Buffering Schemes for A T M Switches," IEEE TENCON '93, Beijing, pp. 1 9 4 - 197.  [23]  C. Partridge, Gigabyte Networking, Addison-Wesley, 1993.  [24]  G. Thomas, "Multi-Channel Input-Queueing for High Throughput Switches," Electronics-Letters, Vol. 33, No. 3, January 1997, pp. 184-185.  [25]  G. Thomas, "Bifurcated Queueing for Throughput Enhancemnet in Input-Queued Switches," IEEE COMMUNICATIONS,  139  Vol. 1, No. 2, March 1997, pp. 56 - 57.  [26]  D. Bertsekas and R. Gallager, Data Networks, Second Edition, Prentice Hall, 1991.  [27]  M . Karol and M . Hluchyj, "Knockout Packet Switch: Principles and Performance," Proceedings of the 12 Conference on Local Computer Networks, th  October 1987, pp.16 - 22. [28]  Y . S. Yeh, M . Hluchyj and A. Acampora, "The Knockout Switch: A Simple Modular Architecture for High Performance Packet Switching," IEEE Journal on Selected Areas of Communications, Vol. 5, No. 8, October 1987, pp. 1274- 1283.  [29]  A Huang and S. Knauer, "Starlite: A Wideband Digital Switch," Proceedings of GLOBECOM  [30]  '84, Atlanta, GA, December 1984, pp. 121 - 125.  S. C. Liew, "Performance of Input-Buffered and Output-Buffered A T M Switches Under Bursty Traffic: Simulation Study," Proceedings of GLOBECOM  '90, San  Diego, C A , December 1990, pp. 1919 - 1925. [31]  R. Y. Awdeh and H . T. Moudftah, " A Fair and Simple Mechanism for Contention Resolution in Batcher-Banyan Networks," CCECE/CCGEI  '93, March, pp. 775 -  778. [32]  P. S. Lau and G. A. Leon, "Design and Performance Analysis of a Buffer Subsystem for the Batcher-Banyan Switch," GLOBECOM  '90, IEEE Global  Telecommunications Conference and Exhibition, Vol. 3, pp. 1926-1930. [33]  N . Mckeown, M . Izzard, A. Mekkittikul, W. Ellersick, and M . Horowitz, "Tiny Tera: A Packet Switch Core," IEEE Micro, Vol. 17, No. 1, Jan./Feb. 1997, pp. 2633.  140  [34]  M . Alimuddin, H . M . Alnuweiri and R.W. Donaldson, " A Unified Design Approach for Dilated Banyan Switches," Proceedings of the 1995 IEEE International Conference on Communications, Vol. 2, June 1995, pp. 1132-1136.  [35]  M . Alimuddin, H. M . Alnuweiri and R. W. Donaldson, "The Fat Banyan A T M Switch," IEEE INFOCOM, April 1995, pp. 659 - 666.  [36]  M . Alimuddin, H . M . Alnuweiri and R. W. Donaldson, "Performance of the FatBanyan Switch under Non-Uniform Traffic," IEEE Pacific Rim Conference on Communications, Computer and Signal Processing, May 1995, pp. 8 2 - 8 5 .  [37]  C. Ng, L. Bai and Z. Liren, "Queue Length Solutions for an A T M Buffer with M M B P Arrivals," Computer Communications, Vol. 20, No. 10, September 1997, pp. 878-883.  [38]  C. Ng, L. Bai and B. H . Soong, "Modeling Multimedia Traffic over A T M using M M B P , " IEE-Proceedings: Communications, Vol. 144, No. 5, Oct 1997, pp. 307310.  [39]  Z. Niu and F. Kubota, "Bursty Transmission Scheme for Wireless A T M and its Analysis," International Conference on Communication Technology Proceedings, ICCT, Vol. 1, 1996, pp. 276-280.  [40]  L . Jianxin, L . Lemin and S. Hairong, "The Influence of Burstiness and Correlation of Traffic on an A T M Multiplexer," International-Conference-onCommunication-Technology-Proceedings,-ICCT, Vol. 1, 1996, pp. 20-23.  141  [41]  Y . Tamie and G. Frazier, "High Performance Multi-Queue Buffers for VLSI Communication Switches, " Proc. of 15th Annual International Symposium on Computer Architecture, June 1988, pp. 343 -354.  [42]  N . Mckeown, V. Anantharam and J. Walrand, "Achieving 100% Throughput in an Input-Queued Switch," Proceedings of IEEE INFO COM '96, Vol.3, March 1996, pp. 296-302.  [43]  Shang-Tse Chuang, Ashish Goel, Nick Mckeown and Balaji Prabhakar, "Matching Output Queueing with a Combined Input/Output-Queued Switch," IEEE Journal on Selected Areas in Communications, Vol. 17, No. 6, June 1999, pp. 1030- 1039.  [44]  Anthony C. Kam and Kai-Yeung Siu, "Linear-Complexity Algorithm for QoS Support in Input-Queued Switches with No Speedup," IEEE Journal on Selected Areas in Communications, Vol. 17, No. 6, June 1999, pp. 1040- 1056.  142  Appendix A : A situation when number of retrials is allowed to exceed G in MSRR  One of the major algorithms developed in our proposed switch is the modified skew round robin (MSRR) scheduling scheme. MSRR greatly improves the switching efficiency by eliminating most of the empty time slots at the input ports. By allowing the input algorithm to retry with another queue to do the switching, when the scheduled queue is empty, many of the 'would have been wasted" time slots can be utilized. One limitation for the M S R R scheme is that the maximum number of retrials is limited to (M1)*G, where M is the degree of dilation and G is the group size. As a result, in our proposed switch (which has a DOD of 2), the algorithm only retries with the queues within the group next to the one currently being unloaded. We are going to explain this constraint and demonstrate what will happen if this constraint is not in place.  For the sake of illustration, we will use a simplified version of our proposed switch in this example (Fig. A . l ) . It is exactly the same switch as our proposed one except: 1) the scale of this simplified version is reduced to 6*6; 2) the group size of the simplified version is reduced to 2.  In Fig. A . l , the SRR scheme schedules group #1 of module #1, group #2 of module #2 and group #3 of module #3 to forward cells to the switching network (SN). However, queue #1 in group #1 of module #1 is empty. The algorithm will then retry to forward a cell from group #2. Unfortunately, all the queues in group 2 are empty as well. If the retrial does not stop here, as what our MSRR algorithm constraints, it will keep 143  retrying with queues in other group. Finally, in group #3 of module #1, the algorithm finds a H O L cell, and forwards it to the S N . Note that with our M S R R scheme, module #1 would send no cell to the S N , because the retrial stops when it finds all the queues in group #2 are empty.  Similarly, queue #1 in group #2 of module #2 is also empty. The algorithm w i l l retry the switching with group #3 of module #2. It finds a H O L cell in queue #1 and forwards it to the S N S .  As seen from the figure, both cells that the algorithm is switching are heading to output port #5. This is fine because of the help of output port dilation (with a D O D of 2). The problem appears when queue #1 in group #3 of module #3 is also sending a cell to output port #5 (recall that this queue is supposed to forward a cell according to the scheduling scheme). A collision w i l l occur because now 3 cells are heading to the same output port, which can accept at most two cells at the same time slot.  This collision can be prevented by increasing the D O D at the output ports (so that each output pot can accept more cells), or enforcing the maximum number of retrials to group size G (so that module # 1 would have no cell to forward in this example).  After all, it is relatively rare that the whole group of buffers is empty in a particular time slot. Increasing the D O D at the output ports requires more complex  144  infrastructure in the SN, but introduces only minimal or no gain in performance. Enforcing the maximum number of retrials seems to be more cost-effective way. IM  #1  Group #1  Group #2  Group #3  Buffers Added  Buffers Added  Buffers Added Group #1 Buffers Added Group #2 Buffers Added Group #3 Buffers Added  Group #1  Group #2  Group #3  Fig. A.1: Collision occurs when number of retrials in MSRR is allowed to exceed G (G=2) 145  Appendix B : List of Acronyms 3DQ  Three Dimensional Queuing  AAL  A T M Adaptation Layer  ABR  Available Bit Rate  ATM  Asynchronous Transfer Mode  B-ISDN  Broadband Integrated Services Digital Network  BMS  Bitonic Merge Sort  BS  Bitonic Sort  CBR  Constant Bit Rate  CCITT  Consultative Committee for International Telephone and Telegraph  CLP  Cell Loss Priority  CLR  Cell Loss Rate  DBS  Dilated Banyan Switch  DOD  Degree of Dilation  FAB  Fat Banyan  FIFO  Fist-in-first-out  GFC  General Flow Control  HEC  Header Error Control  HOL  Head Of Line  IBS  Input Buffering System  IM  Input Module  iPOINT  Illinois Pulsar-based Optical Interconnect  ITU  International Telecommunication Union  LIQ  Logical Input Queue  MMBP  Markov Modulated Bernoulli Process  MSSR  Modified Skew Round Robin  MUCS  Matrix Unit Cell Scheduler  NNI  Network-Network Interface  OBS  Output Buffering System  OM  Output Module  146  OQ  Output Queue  PTI  Payload Type Indicator  QoS  Quality of Service  RAM  Random Access Memory  RAR  Replication-at-Receiving  SAR  Segmentation and Reassembly  SE  Switching Element  SNS  Switching Network System  SRAM  Static Random Access Memory  SRR  Skew Round Robin  STM  Synchronous Transfer Mode  UNI  User-Network Interface  VBR  Variable Bit Rate  VC  Virtual Channel  VCC  Virtual Channel Connection  VCI  Virtual Channel ID  VCI  Virtual Channel Identifier  VP  Virtual Path  VPC  Virtual Path Connection  VPI  Virtual Path ID  VPI  Virtual Path Identifier  WBA  Weight Based Algorithm  147  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065174/manifest

Comment

Related Items