Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A VLSI receiver architecture for high speed ATM computer networks Chow, Jeffrey 1994

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1994-0398.pdf [ 1.49MB ]
Metadata
JSON: 831-1.0087374.json
JSON-LD: 831-1.0087374-ld.json
RDF/XML (Pretty): 831-1.0087374-rdf.xml
RDF/JSON: 831-1.0087374-rdf.json
Turtle: 831-1.0087374-turtle.txt
N-Triples: 831-1.0087374-rdf-ntriples.txt
Original Record: 831-1.0087374-source.json
Full Text
831-1.0087374-fulltext.txt
Citation
831-1.0087374.ris

Full Text

A VLSI Receiver Architecture forHigh Speed ATM Computer NetworksByJeffrey ChowB.A.Sc. Electrical Eng., University of British Columbia, Vancouver, Canada, 1991.A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENT FOR ThE DEGREE OFMASTER OF APPLIED SCIENCEinTHE FACULTY OF GRADUATE STUDIESDepartment ofELECTRICAL ENGINEERINGWe accept this thesis as conformingtandardTHE UNIVERSITY OF BRITISH COLUMBIAOctober, 1994© Jeffrey Chow, 1994In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)______________________________Department of teC67C/ f’2Qr/Yy;The University of British ColumbiaVancouver, CanadaDate 7qDE-6 (2188)Abstract iiAbstractNetwork communication bandwidths are surpassing the computational power of hostsystems. This improvement in communications bandwidth creates a need to develop dedicated communications processors which service the network without serious degradationof throughput. Asynchronous Transfer Mode(ATM) is a new Broadband Integrated ServicesDigital Network(BISDN) protocol specified to operate over fibre lines at rates of 155 Mbpsor 622 Mbps. As the high-speed ATM networking protocol gains widespread acceptance inthe networking community, dedicated high-speed protocol processing hardware is requiredto process ATM’s small 53 byte cells. In ATM networks, the problem of reassembly of cellsat network speeds on the receiver side, is of particular significance. The Adaptation Layeris responsible for cell reassembly. In this thesis, a VLSI architecture for ATM receiver adaptation layer processing is presented. A fine grained parallel architecture is employed in ourreceiver design. Improvements over other ATM receiver designs have been made in our design. Potential bottlenecks in existing designs have been identified and an effort is made toovercome these problems. Problems such as parallel CRC computation, fast Virtual CircuitIdentifier (VCI) mapping, and effective memory architectures are resolved. The proposeddesign is shown to operate at high throughputs ranging from 264.46 Mbps to well over 622Mbps for sustained network data. The design is modelled at two levels so that low leveltiming details can be determined and high level traffic behavior can be monitored.Table of Contents iiiTable of ContentsAbstract iiList of Figures viList of Tables viiiAcknowledgment ix1 Introduction 11.1 Background 11.2 Objectives 21.3 Related Work 31 .4 Outline 112 Description of Asynchronous Transfer Mode 122.1 Introduction 122.2 Adaptation Layer Functions 153 Architectural Models 193.1 Introduction 193.2 Single Uniprocessor Approach 203.3 Coarse Grained Parallel Multiprocessor Design 213.4 Fine Grained Parallel Multiprocessor Design 244 Proposed Receiver Architecture 264.1 Introduction 264.2 Architectural Overview 274.3 Memory Data Structures 314.3.1 Cell Buffer and Free Cell Buffer List 32Table of Contents iv4.3.2 Frame Buffer and Free Frame List 324.3.3 Global State RAM 324.3.4 Inter Processor Communication (IPC) Queues 394.4 Communication Structure 434.4.1 Receiver Initialization 444.4.2 Rejecting Unmatched VCI Cells 444.4.3 Open a Connection 454.4.4 Receive Frames 464.4.5 Closing VCI Connection 504.5 Hardware Overview 504.5.1 The Incoming Cell Processor(ICP) 514.5.2 Interleaved Cell Buffer RAM 534.5.3 The Header Processing Unit(HPU) 544.5.4 FIFO Control Unit(FCU) and Memory Transfer Controller(MTC) 564.5.5 Video RAM Frame Buffer 584.5.6 CRC Engine 615 Simulation Results 695.1 Simulation Methodology 695.1.1 The Hardware Model 705.1.2 The Systems Model 715.2 Performance Results 716 Conclusions and Future Work 836.1 Conclusions 836.2 Future Work 84Table of Contents vReferences 86Appendix A List of Acronyms 89List of Figures viFigure 1.3.1Figure 1.3.2Figure 1.3.3Figure 2.1.1Figure 2.1.2Figure 2.1.3Figure 2.2.4Figure 2.2.5Figure 2.2.6Figure 3.3.1Figure 4.2.1Figure 4.3.2Figure 4.3.3Figure 4.3.4Figure 4.3.5Figure 4.3.6Figure 4.3.7Figure 4.3.8Figure 4.3.9Figure 4.4.10Figure 4.4.11Figure 4.4.12Figure 4.4.13List of Figures5791213151617182228333538394041424344464749Traw’s Receiver Design [Tra9l]Traw’s Link List Manager [Tra9l]Hill’s AAL3/4 Receiver Design [Hi192]BISDN Protocol Stack using ATM [CC193]BISDN Protocol Reference Model using ATM [CC193]Comparison of Protocol Complexity in terms of Layer Implementation:[Pry9l] (a) X.25 Packet Switched, (b) Frame Relay, (c) ATM Cell RelayATM AAL5 Cell FormatFrame encapsulation at Convergence SublayerFlow Diagram of AAL-5 Receiver Adaptation Layer FunctionsA Coarse Grain Parallel Systems Architecture for an Adaptation LayerNetwork Interface ReceiverProposed ATM Receiver InterfaceGlobal State RAM ConfigurationVCI Hash Table Lookup MechanismExample of Per VCI Structure and Associated BuffersHEADQ Message FormatSFQ Message FormatFFQ Command FormatHOSTQ Command FormatsMSGQ Message FormatsReceiver System InitializationGlobal State Manipulations Required to Open a ConnectionExample of Frame Reception Showing Internal StateInternal State After Each SEQ CommandList of Figures viiFigure 4.5.14 Incoming Cell Processor and Interleaved Cell Buffer Ram 52Figure 4.5.15 ICP Responsibilities during Word Arrival Slots 53Figure 4.5.16 HPU Hardware Functional Diagram 55Figure 4.5.17 Hardware VCI Hash Function Example 56Figure 4.5.18 FCU/MTC Hardware Functional Diagram 57Figure 4.5.19 Video RAM Frame Buffer Interface 59Figure 4.5.20 Algorithm Showing Frame Transfers to Video RAM via SAM 60Figure 4.5.2 1 CRC Computation for ATM CRC-32 64Figure 4.5.22 Parallel CRC XOR Array Computations 66Figure 4.5.23 Logic Equations to Perform CRC-32 16 bits at a Time 67Figure 5.1.1 Structure of Simulation Methodology 70Figure 5.2.2 Receiver Architecture Simulations Results 75Figure 5.2.3 ATM Throughput Vs Frame Size 76Figure 5.2.4 ATM ECU Shared Bus Utilization Vs Frame Size 77Figure 5.2.5 ATM HPU Shared Bus Utilization Vs Frame Size 78Figure 5.2.6 Host System Bus Utilization Vs Frame Size 79List of Tables viiiList of TablesTable 2.1.1 ATM compared to other Protocols [Bat93] 14Table 3.1.1 Various Inter-arrival Times for ATM Data at Different Bit Rates 20Table 4.3.1 Description of State Data Structure Elements 37Acknowledgment ixAcknowledgmentI would like to thank my supervisor, Dr. Mabo Ito, for originally suggesting this thesistopic and for his guidance and support during this work. His patience throughout, made myexperiences as a graduate student filled with wonderful memories. Many thanks to Dr. CarlSeger for his kindness and wisdom thoughtfully inserted during my struggles to get the workdone. Thanks also to Dr. Mark Greenstreet for the time he spent working with me during theinitial portions of the design work. Special thanks to Mark McCutcheon for the many long andinsightful hours spent discussing all aspects of ATM design architectures. The many sanitybreaks spent discussing, life, the universe, and everything, provided by Chong Ong helpedme to keep things in its proper perspective. The rapid departure from graduate school ofmy good friends Peter Bonek, Robert Lam, William Low, and Barry Tsuji greatly enhancedmy progress. I would also like to express my gratitude to Raymond Cheng, Kendra Cooper,John Tanlimco and the many other friends for their support. I would especially like to thankall those true friends who cared enough to never ask, “Aren’t you finished your thesis yet?”Sincere thanks to my family for their constant love and encouragement which kept me going.Warm thanks to my friends from church who supported me in prayer. I thank God for blessingme with the needed perseverance and providing me with all those who helped me. It is onlythrough Him that all thing are made possible.1 Introduction 11 Introduction1.1 BackgroundRecent advances in fiber optics technology have resulted in large increases in availablecommunications bandwidth. Communication bandwidths in excess of 1 Gbps are possibletoday. The large increase in communications bandwidths has not been met with a comparableincrease in host systems’ computational performance. The net effect of this performancemismatch is that host system processors can no longer directly process network packetswith software. Lower layer protocol processing must be performed with dedicated hardwareto maintain high communications throughput. The design of dedicated hardware for high-speed protocol processing is not trivial. The variation in network packet size combinedwith the bursty nature of network data introduce many complications in hardware design.Realizing the shortfalls of current protocols, the networking community moved to adopt anew Broadband Integrated Services Network(BISDN) protocol which would use small fixedsize cells in the network backbone. The Asynchronous Transfer Mode(ATM) protocol wasadopted as a result. Although ATM standardization is still in a state a flux, network data willbe exclusively carried by fixed sized 53 byte cells. The cells contain a 5 byte header whichleaves 48 bytes to carry payload data.The ATM protocol is expected to maintain high throughput for both bursty and continuousbit rate data streams. Fixed sized cells make hardware designs ideal, especially in lowerlayers of the protocol. Of particular interest is hardware implementations of the ATMAdaptation Layer(AAL) on the receiver end. This layer is very close to the physical layerin the BISDN protocol stack. Its major responsibility is the reassembly of higher level frames.Receiver adaptation layer designs present a more difficult design problem than the transmitteradaptation layer designs. Receivers are more difficult to design because cell processing must1 . Introduction 2occur at network speeds. Lost cells place great congestion strains on ATM networks becauseerror-recovery is an end-to-end function. The high bandwidth delay product in ATM widearea-networks(WANs), produces a large number of cells in the network backbone beforeerror-recovery is initiated. This causes large requirements on buffering resources.Fixed sized cells simplify receiver adaptation layer hardware designs but the highernetwork speed introduces new design problems. The short inter-cell arrival time of smallATM cells imposes severe constraints on possible hardware architectures. Mapping cells toavailable interface buffers between inter-cell arrival times at the host-network interface is nontrivial. The mapping issue combined with the fact that the ATM cell header has a virtual circuitconnections space of 65535 identifiers of which any number can be dynamically opened orclosed, illustrates the scope of the design problem.1.2 ObjectivesThe ML maps higher level protocols to 53 byte ATM cells. There are currently fivedifferent AALs specified. The most suitable adaptation layers for computer networks areAAL3/4 and AAL5. This thesis will focus on the newest adaptation layer AAL5 [T1S1].The ATM protocol is specified to perform at 155 Mbps or 622 Mbps. Many currentATM adapter interface designs only operate at 155 Mbps and have not considered problemswhich occur when running at 622 Mbps. The problems arise from the increased network datarate. Cell inter-arrival times are much shorter and require faster hardware for adaptation layerprocessing. The architectural decision for the 155 Mbps design are infeasible for higher speed622 Mbps designs. Major modifications to designs would be necessary when scaling up thedesign to operate at the higher speeds. This is especially true of receiver adaptation layerarchitectures. In this thesis, different receiver adaptation layer architectures are reviewed.Inherent traits of each architecture are discussed to expose potential bottlenecks whichappear when network speeds are scaled up.This thesis will:1 . Introduction 3• Identify architectural considerations for ATM AAL5 protocol processing at high speeds,• Identify unscalable areas in current 155 Mbps receiver designs,• Propose design alternatives to problems in current designs,• Propose a new ATM AAL5 receiver using the proposed design alternatives,• Determine the low level feasibility of proposed design through modelling and simulationusing the VHDL language, and• Determine the high level feasibility of the proposed design through modelling and simulation using the SIMSCRIPTII.5 discrete event simulator.In the future, it is envisaged that these VHDL models can be incorporated into VHDL librariesfor automatic synthesis and fabrication.1.3 Related WorkThe concept of implementing layers of a communications protocol stack in hardware toenhance data throughput is not a new one. Extensive work has been performed most notablywith the seven layer Open System lnterconnect(OSI) protocol stack [TanBB]. Hardwareimplementations for host-network interfaces continue to draw much interest in the researchcommunity. Host-network interfaces are particularly important with newer high speed(>100Mbps) networks. These interfaces act as communication subsystems and significantly reducethe amount of processing required by the host CPU.Takeuchi [Tak9l] looked at parallel architectures which implement up to the PresentationLayer of the OSI 7 layer model. Chesson [Che89] studied the VLSI Protocol Engine designsfor the specialized Xpress Transfer Protocol(XTP). Zitterbart [Zit9l] looked into high speedtransport components. Kanakia and Cheriton [Kan88] developed the VMP Network AdapterBoard(NAB) which implements a new transport protocol on a specially designed adapterboard. Much of the previous work used specialized protocols and interfaces to implementOSI layers.1 . Introduction 4Another approach for developing high speed network adapters was to develop siliconcompilers. Interfaces for different protocols can be achieved by compiling its specificationsthrough a specially designed silicon compiler. Abu-Amara et al [Abu891 developed the PSIsilicon compiler. More recently, Pei [Pei92] developed the Network Interfacing CompilerExperiment(NICE) compiler which uses specialized function blocks for Cyclic RedundancyChecking(CRC), routing tables, and semi-shared buffers. Silicon compilers are used becausenetworking protocols are in a continual state of flux. Although silicon compilers offer elegantsolutions for high speed networking interface designs, this approach trades off optimaldesigns for shorter VLSI development time. Pei’s NICE compiler attempts to reduce thistrade-off by designing in the specialized function blocks.For ATM networks, many host-network interfaces are designed to perform the adaptationlayer functions in hardware. Traw [Tra92J designed a ATM Adaptation Layer host-networkinterface at the University of Pennsylvania. His implementation was for AAL3/4 running at155 Mbps on a RISC System/6000. Cooper et al [Coo9l] developed a minimal hardwareATM interface which still requires the host system to do most of the adaptation layer functions.Hill et al [Hil92J designed an ATM receiver running at 155 Mbps. Their design integrated aprocessor pipeline with a video RAM frame buffer. McCutcheon et al [Mcc92] designed an155 Mbps interface to a multiprocessor host. McCutcheon uses an interleaved static RAMdesign to interface to the host bus. Davies [Dav9lJ designed a 622 Mbps adaptation layerinterface which uses two RISC processors integrated to the host through either a DirectMemory Access(DMA) controller or dual-ported RAM. Adaptive Corporation has come outwith the Fragmentation and Reassembly Device(FRED) chipset [Ada92], a commercial ATMAdaptation Layer design.Although there are several different designs in existence for ATM host-network interfaces,none of them specifically looked at a VLSI 622 Mbps implementation. The Adaptive design isthe most complete design in the sense that it incorporates congestion management and ratemetering. The Adaptive chipset can operate at speeds up to 181 Mbps and it also requires a1 . Introduction 5control processor. Davies [Dav9l] design was the only one which would run at 622 Mbps butit required two RISC processors and multiplexed four 155 Mbps channels. Hill et al [Hi192]came up with a nice solution for running at 155 Mbps but did not discuss design decisionsand requirements for running at higher ATM speeds. Cooper et al [Coo9l] have a designwhich minimized hardware requirements but still required the host system to do much of theprocessing. Traw’s [Tra92] solution only runs at 155 Mbps and transfers received framesto the host memory with DMA bursts across the system bus. Although Traw suggested asolution for running at the higher 622 Mbps ATM rate, it is not a true 622 Mbps interfacebecause it multiplexes four 155 Mbps designs over a high speed line.Of all the ATM host interface receiver designs which were reviewed, Traw’s design andHill’s design have the greatest potential for improvements to allow them to operate at thehigher 622 Mbps network rate. Traw’s design uses concurrently running state machinesfor the Synchronous Optical Network(SONET) interface, VCI Lookup, Linked List management, and Dual port reassembly(see Figure 1.3.1). SONET is an existing network protocolcommonly used to transport ATM data.Status4CommandsConfiguration Linked List 4Command SONET Interface 155MbpsManager andStatus VCI Reference VCI LookupControllerControl L I2semdllZf}0Figure 1.3.1 Traw’s Receiver Design [Tra9l]1 Introduction 6Traw’s design is not expected to operate at the full non-multiplexed 622 Mbps becausebottlenecks will occur in the VCI lookup controller, Link List Manager, and the Dual PortReassembly Buffer. The Content Addressable Memory(CAM) used for VCI lookup is a veryslow device having an access time of approximately 165 ns to match a 16 bit word [Amd9l].CAMs used for 622 Mbps receiver applications would use approximately 25% of the 681 nsinter-cell arrival time. The dual port reassembly buffer in Traw’s design is actually a singleport static RAM with associated logic to make it look like a dual port ram. Traw implementsthe dual ported RAM by arbitrating accesses to the single ported RAM where the access timeto the single ported RAM is two times faster than the required access time. For example,at 155 Mbps, bytes arrive at approximately 51 ns, so the single ported RAM must have anaccess time of approximately 25 ns to emulate the dual ported nature. Ideally, constraintsshould be relaxed on RAM access times so that it is only slightly faster than the networkdata inter-arrival time not twice as fast.Traw’s choice of using DMA transfers to move frame data from the dual port reassemblybuffer to the host memory also creates bottlenecks at higher network rates. Two problemsresult from DMA transfers. One problem is that DMA rates can only be as fast as the hostsystem bus. Speed mismatches where the host bus is slower than network peak bandwidthare sure to create bottlenecks in the network interface. A second problem is the host busutilization by the network interface. As more data arrives from the network at higher rates,the network interface requires more system bus accesses. This results in a reduction in hostsystem performance which has the compounding effect of further reducing overall systemperformance. A 622 Mbps network interface using DMA transfers would require a systembus to operate at approximately 78 MBytes/s. The SUN workstation bus has a transfer rateof 80 MBytes/s at 32 bits.The linked list manager in Traw’s design is also a bottleneck because its structure istoo complex to maintain at higher speeds. As illustrated in Figure 1.3.2, a head pointer, tailpointer, next pointer, cell pointer and free list pointer must be maintained during list access.1 . Introduction 7In the worst case, Traw’s linked list manager requires 650 ns to perform an operation ona list [Tra92].Pointer MemoryOK__________________________________Pointer to Last CellPointer to First Cell Address of Cell Body Address of Cell Body Address of Cell BodyLength of List Next Cell Next Cell Next CellNot Assigned NullPointer to Last Cell — NullPointer to First Cell — NullLength of ListNot Assigned...Pointer to Last Cell_ __ _ __ __ _ __Address of Cell BodyPointer to First CellNext Cell NullLength of ListK Not Assigned Address of Cell BodyPointer to First Free Next Cell • Next CellPointer to Last Free L,i. Null4K :Address of Cell BodyNext Cell This portion of memory is used for the linked lists:represented aboveAddress of Cell Body Note: All cell body addresses refer to locations in theNext Cell dual port reassembly bufferBKFigure 1.3.2 Traw’s Link List Manager [Tra9l]Hill takes a slightly different receiver design approach from Traw. As illustrated in Figure1.3.3, Hill uses a 3 stage pipelined approach with no large intermediate internal bufferingto get reassembly rates in excess of 100 Mbps. Hill uses a direct mapped RAM look-uptable for mapping incoming VCI’s. Although this mapping method is fast, it leaves a lot of1 . Introduction 8costly static RAM unused since 64K of RAM is needed for the 16 bit VCI data. Hill uses fastmultiported video RAM to transfer frames from the network to the host. This is ideal becauseit reduces utilization of the systems bus and can handle data input at network rates.1 . Introduction 9Interface to UpperLayer ProcessorStage 2MemoryManagement& HeaderProcessingDataFrom Physical ProcessorFigure 1.3.3 Hill’s AAL3/4 Receiver Design [Hi192]The bottleneck in Hill’s design arises from the fixed time cost overhead of writing data/V1? RAM btifferStage 3Video RAMcontrol &DataTransferStage 1Elastic Buffer&1 . Introduction 10into the video RAM through the serial port for each arriving cell. As network rates increaseand become comparable to video RAM serial port bandwidth, the fixed time costs associatedwith video RAM setup and transfer become a limiting factor.Although Hill’s design manages transfer speeds in excess of 100 Mbps, it requires thehost processor to use the assembled data in fragmented form. That is, a linked list containingthe complete frame is given to the host. This requires the host to access the frame by explicitlytraversing a linked list. Hence, the net transfer speed would be less than a network interfacewhich provided fully contiguous frames to the host at speeds in excess of 100 MbpsTwo other problem areas which have not been emphasized in much of the previous workon high speed network interfaces is CRC computation and provisions to handle a framearriving immediately after one finishes for the same connection identifier. CRC computationsare difficult at high speeds because the computation has traditionally been done in softwareand many hardware implementations use slow bit serial designs. The CRC checksummust be determined as quickly as possible because it is the most computational intensiverequirement along the critical path of adaptation layer processing.Frames arriving immediately after one finishes can cause state consistency problems inthe network interface. This is a problem in many network interfaces because there is onlyone state connection record for each virtual connection. This connection record must bemaintained for arriving frames since it contains the current connection state. The problem iscreated if the state associated with the first arrived frame, such as partial CRC checksum orbuffer offsets, is erroneously used by the next new frame before the state can be updated toreflect the reception of the first frame. The connection record is usually invalidated once thecomplete frame has been transferred to the host before it can be reused for the next frame.Many high speed network interfaces use intermediate cell buffering. The error condition iscaused when the tail end of one complete frame is still in the intermediate buffer and a newframe for the same connection arrives. Traw’s design [Tra92] does not appear to handle thissituation properly. Hill’s design [Hil92], however, does handle the problem because Hill does1 Introduction 11not use large intermediate cell buffers.Design bottlenecks in two current ATM receiver architectures have been identified anddiscussed as well as other potential problems in ATM receiver designs. A new designproposing solutions to existing bottlenecks and improvements to recent designs will beprovided.1.4 OutlineThis thesis will provide an understanding of the architectural decisions necessary whendesigning high speed network interfaces as well as the issues involved specifically in ATMnetworks. The thesis accomplishes this through a discussions of solutions for potential areasof processing bottlenecks and then with an implementation of a high speed 622 Mbps ATMAAL5 receiver.The thesis is organized as follows. Chapter 2 provides an introduction to ATM and thespecifications for AAL5. It provides concepts in ATM which gained its acceptance in thenetworking community as the protocol for next generation high speed networks. A detailedlook at the adaptation layer functions explaining how cells are segmented and reassembledis provided. Chapter 3 discusses the architectural considerations involved in designing highspeed network interfaces. It covers different architectural models and their computationalcosts and time costs associated with network packet processing. Chapter 4 proposes a newATM receiver design which solves for bottlenecks in current designs. A detailed discussionof processing speeds at each stage of frame reassembly is provided. Chapter 5 coverssimulation results of the proposed ATM receiver. Chapter 6 is the conclusions about thework presented and includes some suggestions for future work.2. Description of Asynchronous Transfer Mode 122 Description of Asynchronous Transfer Mode2.1 IntroductionATM is a fast cell switching protocol which operates at 155 Mbps or 622 Mbps. ATM isa protocol designated to carry BISDN information. BISDN provides a wide array of servicesfor telephony, cable TV, and computer applications. Such applications require networkbandwidths which range from low speed email services to high speed, high bandwidth HDTVservices. Current networking protocols such as X.25 packet switching, FDDI, and FrameRelay networks cannot support the service requirements for BISDN.BISDN can be modelled with a layered protocol stack much like the more common 7-layer OSI protocol stack. The BISDN model for the ATM protocol, illustrated in Figure 2.1.1,is defined by CCITT, the ATM standardization body. Figure 2.1.2 illustrates the functions thatthe three lower layers are responsible for. The physical layer is responsible for the transportof bit information, the ATM layer is responsible for switching, routing, and multiplexing, andthe AAL layer is responsible for adapting higher level frame information to an ATM stream.Plane Management FunctionsLa er Mans ernent Functions /Cocol User/ /Higher Layer Higher Layer,,“ /Protocol Functions Protocol Functions ,/‘,//ATM Adaptation Layer(AAL) / // // /ATM LayerPhysical LayerFigure 2.1.1 BISDN Protocol Stack using ATM [CC193]2. Description of Asynchronous Transfer Mode 13Higher Layer FunctionsCs Convergence SublayerAALSAR Segmentation and ReassemblyGeneric Flow ControlCell Header Translation and ExtractionATM Cell VPI/VCI TranslationCell Multiplex and De-multiplexCell rate DecouplingHEC: Header Error Correction GenerationNerificationTC Cell DelineationTransmission Frame Adaption TC: TransmissionPHY Transmission Frame Generation/Recovery ConvergencePM Bit Timing PM: PhysicalPhysical Medium MediumFigure 2.1.2 BISDN Protocol Reference Model using ATM [CC193]ATM is an ideal protocol because of its high throughput, its low delay, and its ability tosupport bursty multimedia traffic. ATM employs a fast packet switching Asynchronous TimeDivision(ATD) technique for transmitting data through a network. Its high transmission rate isachieved with reduced protocol complexity using a simplified packet structure called a cell.The simplifications to ATM set it apart from other networking protocols. A summary ofthe differences is given in Table 2.1.1. Several key simplifications contribute to ATM’s highperformance characteristics. These simplifications are fixed packet size, end-to-end errorcontrol, highly simplified protocol, and the use of cell relay. All these simplifications reducethe cost and complexity of intermediate switching nodes in the network backbone.2. Description of Asynchronous Transfer Mode 14Funtionality X.25 Packet Switching Frame Relay ATMTransmission Speed 300bps-64kbps 64kbps-2Mbps 1 55Mbps-622MbpsThroughput(packets/sec) Thousands Thousands MillionsPacket Size Variable Variable FixedProtocols Complex Simplified Highly SimplifiedError Control Node to Node End to end End to End & HeaderFlow Control Yes No NoRetransmission Window Yes No NoFrame Delimiter Yes Yes NoTable 2.1.1 ATM compared to other Protocols [Bat93]ATM’s cell switching transport mechanism is of particular importance in reducing theoverall complexity of the network. A comparison between different transport protocols interms of their protocol layering complexity is illustrated in Figure 2.1.3. The X.25 networkis one of the earliest networking protocols and uses a low quality transmission media. X.25networks require full error control on a link-by-link basis to ensure high quality end-to-endservice. Improvements in the transmission media reduced the number of bit errors in thenetwork and a frame relay protocol was devised. Frame relay networks only implemented aportion of layer 2 functions in its switching nodes and provided limited link-to-link error control.Full Error control is provided at an end-to-end level. Cell relay is the simplest of the threenetworking protocols. It takes advantage of high quality, low error fibre for its transmissionmedia and performs full error control at the end-to-end level only.2. Description of Asynchronous Transfer Mode 15Terminal Switching Node TerminalOSI Protocol Layers3 - Network Layer2- Data Link Layer1 - Physical Layer3 32bJ4 Full Error Control 2b2a i2a 2a.. 2aLimited Limited1 Error 1 1 Error 1Control ControlFigure 2.1.3 Comparison of Protocol Complexity in terms of Layer Implementation: [Pry9l](a) X.25 Packet Switched, (b) Frame Relay, (c) ATM Cell Relay2.2 Adaptation Layer FunctionsThe simplicity of the ATM protocol makes it feasible to incorporate different protocollayers into a hardware design. For ATM host interface designs, the adaptation layer can beimplemented in hardware to reduce the amount of protocol processing which will be requiredby the host processor. Host processor resources will be heavily burdened if it were requiredto process the small ATM cells arriving at 622 Mbps.The adaptation layer resides immediately above the physical layer and the ATM layer inthe BISDN protocol stack. The main purpose of the adaptation layer is to map and converthigher layer frames to simple ATM cells. These higher level frames are constrained to amaximum 64K byte size. CCITT has defined five different AALs aptly named AAL1—>AAL5.(c)2. Description of Asynchronous Transfer Mode 16Different AALs are used for different types of network traffic. AAL5 is a “lightweight” AALused primarily to transport computer data. It has a very low amount of control informationoverhead embedded into the user data frames. AAL5 will be the primary focus for our hostinterface design.The adaptation layer is made up of two sublayers. These sublayers are the Segmentationand Reassembly(SAR) sublayer and the Convergence Sublayer(CS). The SAR sublayersegments frames to ATM cells on the transmitter side and reassembles ATM cells to frameson the receiver side. In order to perform frame reassembly, an identifying header is extractedfrom ATM cells and is used to map the cell to proper upper layer frames. As illustrated inFigure 2.2.4, each 53 byte ATM cell contains a 5 byte identifying header and a 48 bytepayload. The header contains a Generic Flow Control(GFC), Virtual Path ldentifier(VPI),Virtual Channel ldentifier(VCI), Payload Type(PT), Cell Loss Priority(CLP), and Header ErrorCheck(HEC). Only VCI, HEC, and PT are used at the host end. The other fields have notbeen defined or are only used in intermediate nodes. The VCI is a unique 16 bit field whichidentifies cells belonging to the same upper level frame. HEC is an 8 bit CRC header errorcheck field and PT indicates the type of received cell.5 Byte Header 48 Byte Payload7 6 5 4 3 2TheATM CellGeneric Flow Control VPI__________________________________VPI- Virtual Path Indicatorvci VCI - Virtual Connection IndicatorPT - Payload TypeVCI CLP - Cell Loss Priorityvci PT CLPHeader Error CheckAAL-5 Header FormatFigure 2.2.4 ATM AAL5 Cell FormatThe AAL5 CS sublayer provides error detection capabilities for received frames. Errordetection uses a 32 bit CRC computed checksum embedded at the end of the received2. Description of Asynchronous Transfer Mode 17frame. As illustrated in Figure 2.2.5, the CS sublayer adds a 0-40 bytes pad field and 8 bytestrailer to the user frame. A zero valued pad field aligns the CS trailer to the end of the lastATM cell transmitted for a given user frame. The CS trailer contains a 2 bytes control field,a 2 bytes length field and a 4 bytes CRC computed checksum. The checksum is computedover the entire CS. The control field is currently undefined and is coded as all zeros. Thelength field contains the user frame length in bytes.Upper Layer Frame (<=65535 bytes)0-40 bytes 8 bytesUpper Layer Frame Pad TrailerConvergence Sublayer Encapsulated FrameControl Length CRC2 bytes 2 bytes 4 bytesFigure 2.2.5 Frame encapsulation at Convergence SublayerThe adaptation layer at the receiver side must perform many functions during the inter-cell arrival time in order to maintain a high bandwidth. Incoming cell headers must be strippedand error-checked, each cell body must be mapped for reassembly based on its VC, and aCRC update must be made using all cells which belong in the same frame. Concurrently,assembled frames must be passed up to the host system. These adaptation layer functionson the receiver side are illustrated in the data flow diagram of Figure 2.2.6. In the dataflowdiagram, closed rectangular boxes represent I/O with external interfaces, open rectangularboxes represents data stores, and circles represent processes performed on the data.2. Description of Asynchronous Transfer Mode 18VadvolTaeLUpdate____________volkrtormabonBackEnd_ProcessnaFigure 2.2.6 Flow Diagram of AAL-5 Receiver Adaptation Layer FunctionsAs indicated in the flow diagram, three distinct areas of processing are identified: thefront end processing, the cell reassembly, and the back end processing. The front end isresponsible for taking cells off the network at the network rates. It is critical that this sectiondoes not lose cells or frame loss will occur. The cell header is split from the cell body andeach portion is processed independently.The cell reassembly portion is responsible for utilizing the header information and combining cell bodies together from the same original frame. The back end processing takes thecombined frames and computes the CRC checksum before passing it up to the host memory.The host will be notified on receipt of an error-free frame.Network ( 53-byteInterface \\l_________Error/‘c:—\ _____)F:nd ( HeaderProcessingCell Body IHeader IStorage StorageValidCeReassembvHostMemoryFrame CRCper VC3. Architectural Models 193 Architectural Models3.1 IntroductionIn this section, several different architectural models are examined to determine suitablechoices for host-network ATM receiver designs. It is desirable to design host-networkinterfaces which are scalable in terms of preserving throughput as network rates increase. Inother words, no bandwidth reducing bottlenecks are introduced when a higher speed networktransmission is required from the network architecture. This constrains the architecturalconsiderations in the final design.Scalability is achieved by identifying tasks along the cell processing ciltical path anddeveloping solutions to reduce the processing times. Tasks which must be performed atnetwork rates to preserve throughput are:• buffering incoming cells,• mapping cells via VCI to its proper reassembly area, and• computing CRC error check across received frames.Each of the individual tasks is simple but there is complexity at the high ATM networkspeed. Network data inter-arrival times for cells, bytes, words, and long words are summarized in Table 3.1.1 to indicate the time available for processing cells. Results from the tablegive indications on the speed requirements of memory. The table shows that 622 Mbps ATMnetworks have 681 ns for cell processing. If network data is processed in long word(32bits)format only 51 ns is available between long words. Long word processing requires cell buffermemory with cycle times of 51 ns for dual ported memory or 25 ns for arbitrated singleported memory.3. Architectural Models 20Inter-Arrival TimeBit Rate Cell Time Byte Time Word Time Long Word Time155.52 Mbps 2726 ns 51 ns 102 ns 205 ns622.08 Mbps 681 ns 12 ns 25 ns - 51 nsTable 3.1.1 Various Inter-arrival Times for ATM Data at Different Bit RatesSince CRC computation must be performed for every received data bit, it is the mostprocessing intensive AAL5 receiver operation. Its performance will be dependent on the wordsize chosen for data manipulation. Mapping cells to frames is much less intensive becauseit must be performed within one cell time. Not immediately apparent are costs associatedwith data accesses. These costs arise because I/O costs are much more expensive thancomputational costs. All these performance costs must be considered when examiningarchitectural models for receiver adaptation layer processing. The performance costs suggestthat processing should be separated into a data transfer path and a control path. The controlpath performs cell mapping and manipulations. The data path is for data transfers and CRCcomputations. The architectural models which will be examined include:• Single high speed uniprocessor designs,• Coarse Grain Parallel Multiprocessor designs, and• Fine Grained Multiprocessor designs.3.2 Single Uniprocessor ApproachThe single uniprocessor method is the simplest of all the possible approaches. In thisapproach, a single processor is used to implement all processes illustrated in the previousdataflow diagram(Figure 2.2.6). In most cases, a RISC processor must be used because ofits higher speeds. The uniprocessor approach is not an ideal solution. The adaptation layer3. Architectural Models 21has at least four independent threads of control that can run concurrently. Simplistically,these threads of control are:• Removing cells at 622 Mbps from the network,• Combining cells into frames,• Computing CRC checksums, and,• Transferring frames from the adaptation layer to the host system.Confining the implementation to a single processor solution requires large overheads inmultiple task switching. The single processor would run at fairly high speeds. Assuming cellprocessing in 32 bit wide long words, then long words arrive at a rate of 19.5 MHz. Even thefastest CRC error detection algorithm, requiring 8 instructions per data word [Sar88], resultsin a 156 mips processor. Such high speeds for a communications control processor is notadequate especially if designs should be scalable. Clearly, the single processor approach isinadequate. Methods for implementing hardware CRC units to speed up CRC computationsmust be examined.3.3 Coarse Grained Parallel Multiprocessor DesignIn coarse grained processing, a task is divided into several smaller portions and handledwith parallel processing elements(PE). Coarse grained parallel multiprocessor designs usea pool of powerful PE5, each performing complete adaptation layer processing, to gain anaggregate processing power increase. Coarse grained processing is analogous to replicatingmultiple copies of the single uniprocessor approach to reduce the effective cell inter-arrivaltimes seen by each copy. In this way, more time can be used for cell processing withoutloss of network cells.Suppose widely available 40 MHz RISC processors are used as PE5 for a typical coarsegrained implementation of a AAL5 network interface receiver. To process just the CRC for16 bit words across 48 bytes of payload cell data at 8 instructions/word would require atotal of 192 instructions per cell. Add to that one instruction for copying each word to an3. Architectural Models 22output queue results in 216 instructions per cell. For a 40 MHz RISC, 216 instructions take5.4 us to complete. Since each cell at 622 Mbps arrives every 681 ns, at least 8 PEs areneeded to maintain network throughput. Figure 3.3.1 illustrates a simple implementation ofa coarse grained approach which requires shared global state, input queues, output queues,host system interface queues, and a mechanism to transfer cells to the host.Input Cell BusBandwidthCell BusCoarse Grained Parallel Processing EngineGlobal State BusInterfaceQueuesIrHost SystemTo Host SystemFigure 3.3.1 A Coarse Grain Parallel Systems Architecture for an Adaptation Layer Network Interface Receiver3. Architectural Models 23A common problem in all parallel processing schemes is contention of processingelements for global resources. The increased contention overhead must be examined todetermine the overall speedup when compared with a uniprocessor approach. Notice thateach PE can only access global state bus for a maximum of one cell inter-arrival timeregardless of how many PEs are added to the design. If processing AAL5 cells requiresa total access of global state bus of more than one cell time, there will not be enoughtime for other PE’s processing cells concurrently. For our 40 Mhz PE example with a celltime of 681 ns, the total amount of global accesses for each PE can be no more than68 ins Celltime/2öns accesstime 27 without considering arbitration costs.Contention for global state can be reduced with localized memory containing portions ofthe global state. With localized memory, multiple reads or multiple writes to the same memorylocation only requires one access to global RAM. Note, however, that the total accesses toglobal state remain at 27.It is difficult to achieve coarse grained parallelism for receiver adaptation layer processingsince the adaptation layer’s main purpose is for frame reassembly. The implication hereis that much of the processing involves manipulation of global data structures for eachprocessed frame. The characteristic of frame reassembly is a short computational timeand a long global access time. This characteristic does not favor parallelism with a coarsegrain implementation.Another problem with this network interface implementation arises when sequential cellsfor the same virtual connection arrive from the network. If the network interface is concurrentlyprocessing cells belonging to the same virtual connection, parallelism is reduced becauseprocessors must stall to maintain the sequential state dependence. Any time a PE is stalled,the possibility of receiving more cells for the same virtual connection increases. StalledPE5 create other problems such as bus contention on busy waits or complex protocols forinterprocessor communications.In our network interface example, processors will stall if cells for the same virtual3. Architectural Models 24connection arrive any faster than once every eight cell since 8 cell times are needed toprocess one cell. This translates to a gross rate of 622Mbps/8 = 77.75Mbps and a net userdata rate of 77.75Mbps * 4SUserBytes/S3CellBytes = 70.4Mbps. Any frame requiring morethan 70.4 Mbps of network bandwidth will reduce parallelism in our network interface example.It can be concluded that a coarse grained multiprocessor approach to AAL5 receiverprocessing is possible to design but great care must be taken to optimize on global state RAMaccesses and to preserve sequential state independence for concurrent processing of cellsfrom the same virtual circuit. This approach is not desirable because much of its complexitycenters around resolving multiprocessor contention problems and synchronization problems.3.4 Fine Grained Parallel Multiprocessor DesignIn fine grained processing, a task is divided into many smaller tasks and processedin parallel. The number of smaller tasks is much larger in fine grained processing than incoarse grained processing. Fine grained parallel multiprocessor designs use a number ofsimple PEs to simultaneously process a cell. Each PE is assigned a small part of adaptationlayer processing and all PE5 running concurrently combine to increase processing power.Specialized processors are designed to process a part of adaptation layer processing.A way to divide adaptation layer functions for processors would simply be to assign aPE for each process used in the previous dataflow diagram (Figure 2.2.6). The fine grainedapproach is used in many of the recent adaptation layer receiver designs [Hil92] [Tra92J[Jai9O].Fine grained receivers do not have as many problems as the coarse grained design.Contention for global state information will be less than for the coarse grained approachbecause the number of PEs contending is expected to be less since each PE does not usethe same global state information. Referring to the previous data flow diagram(Figure 2.2.6),contention for global resources occurs between processes so contention will be dependenton how the data stores are partitioned. Sequential cells from the same virtual connection will3. Architectural Models 25not cause fine grained PEs to stall because there is no concurrency of exact same adaptationlayer functions. Concurrency in fine grained processing is not gained through replication butthrough pipelining.An advantage of the fine grained approach is the simpler functions that each PE hasto perform compared to a coarse grained approach. The simplicity allows PEs to beimplemented as fast state machine designs. Faster state machines have shorter memoryaccess times which reduce contention on any shared busses. In the next section, a finegrained AAL5 adaptation layer receiver is proposed and discussed.4. Proposed Receiver Architecture 264 Proposed Receiver Architecture4.1 IntroductionThis section proposes an architectural design for an adaptation layer implementationon the receiver side of an ATM network. The fine grained parallel multiprocessor designphilosophy is adopted. The proposed design will provide solutions to problems in existingdesigns and utilize ATM cell traffic characteristics to optimize the design for 622 Mbpsoperation.In previous sections, the following requirements have been identified:• The need to take control processing out of the main data path, and• the need to develop hardware CRC computation units.Recall that two current designs have been examined in detail. The following limitations werefound:• In Traw’s design:a. Required cell memory access time to be half that of network data inter-arrival timeto implement dual ported ram,b. used slow CAM for VCI mapping,c. used DMA to transfer network cells to host memory which reduces host systemperformance, and,d. used a linked list structure for intermediate cell storage, which is hard to maintain.• In Hill’s design:a. Used 32K of sparsely filled direct mapped RAM for VCI mapping,b. required long video RAM setup times for each arriving network cell, and,4. Proposed Receiver Architecture 27c. gave the host system a linked list of frame fragments instead of a frame in contiguousmemory.The new proposed design will address all of these problems and provide an architecturewhich will process network ATM cells at 622 Mbps. The state consistency problem resultingfrom any two frames arriving one after another will be resolved in the proposed design.4.2 Architectural OverviewIn the proposed design, the focus is to remove resource contention as well as to providea means for efficient frame reassembly. Figure 4.2.1 illustrates the proposed design. Thedashed line in the figure shows which part of the receiver is integrated into a chip and whichparts are external to the receiver. The receiver has four concurrent processing elementsutilizing interleaved cell buffers, a CRC hardware computational engine, a single global stateRAM, video RAM host interface, and a number of FIFO queues. The receiver architecture’skey features with indications of where they are discussed are:• Single shared bus,• Interleaved cell memory which keeps access time only slightly below data inter-arrivaltime, (see §4.5.2)• Fixed sized circular buffers for maintaining cell pointers, (see §4.3.3)• Hardware hashing with key shuffling to map VCI’s to frames (see §4.3.3),• Adaptive sizing of frame segments to optimize memory transfers (see §4.4.4),• Video Ram frame buffers to reduce host system bus utilization (see §4.5.5),• 16 bit parallel CRC design (see §4.5.6),• Simple IPC communications structure to handle all processor functions (see §4.3.4 and§4.4),• IPC queues and FIFOs to virtually eliminate shared busses, and;• IPC queues used as intermediate state buffers handles consecutive frame arrivals (see§4.4.4).4. Proposed Receiver Architecture 28Figure 4.2.1 Proposed ATM Receiver InterfaceThe basic principle of the proposed ATM receiver operation is quite simple. A commercialSONET chip is used to extract ATM cells from a SONET network. The SONET chipperforms Transmission Convergence layer processing which includes HEC verification. Theraw ATM cells are passed to the ATM receiver which begins adaptation layer processing.Each incoming cell is split into its header section and payload data section by the IncomingCell Processor(ICP). Payload portions are stored directly in a shared interleaved cell bufferspace and the VCI from the header along with a payload location pointer and payload typeinformation is inserted into a header queue.Global State RAM(32Kx1 6 Static RAM)Host System Bus4. Proposed Receiver Architecture 29The Header Processing Unit(HPU) extracts header information from the header queueand examines the VCI. Each VCI is checked for a matching entry in a VCI lookup table.The VCI lookup is achieved with a hardware hashing function which maps the 16 bit VCIinto a 11 bit hash address. The hash table is used to map a VCI to its connection stateinformation. The HPU uses the state information to map the payload pointer to a FIFOdata structure. The FIFO data structure is the key component to perform frame reassemblybecause it contains ordered cell pointers. The HPU initiates a frame transfer to the hostsystem by writing commands into a command queue for the FIFO Control Unit(FCU). TheHPU is also responsible for handling command requests from the host system.The ECU handles all EIEO data structure manipulations in the global state RAM. It isclosely coupled with the Memory Transfer Controller(MTC) through a register bank interface.The ECU is responsible for fetching a set of cell pointers from the EIFO data structure totransfer to the register bank interface. Then it can notify the MTC of available cells to transferto the host system. Once the MTC is initiated, the ECU continuously reloads the register bankinterface with cell pointers. In this way, frame transfers to the host system always occur atmaximum speed because a set of buffered cell pointers allow quick transfers into video RAMindependent of incoming network data rate. The video RAM write bandwidth must matchpeak network bandwidth to prevent bottlenecks. The ECU is also responsible for feedingrecycled cell pointers and frame pointers to respective queues.The MTC handles all interfacing requirements between the interleaved cell body RAM andthe video frame buffer ram. It utilizes a parallel hardware CRC engine to perform intermediateCRC computations. The MTC provides high speed transfers from cell buffers to frame buffers.The structure of the proposed receiver architecture is well suited for ATM adaptation layerprocessing. Multiple processing units running concurrently are decoupled with interconnecting FIFO queues or data buffers. No two concurrent PEs are directly dependent on eachother in a lock step manner. The architecture has two flow paths; a short higher utilized cellbody path and a longer lower utilized control path. The cell body path is made up with the4. Proposed Receiver Architecture 30ICP, cell body buffer, MTC and the frame buffer. The control path handles all cell pointermanipulations and is composed of the ICP, HPU, and the ECU.The architecture is kept simple by offloading the non-time-critical functions into the hostprocessor. This is a practical and efficient method for minimizing the number of decisionsor manipulations that the receiver must directly perform. All ATM reassembly critical pathfunctions are provided by hardware. Other functions like connection setup, connection takedown, and system initialization are performed by host accesses into the receiver hardware.Not only do the FIFO queues and buffers reduce coupling between processing elements,but they also reduce resource contention. The buffers and FIEO’s are all inherently dual portdesigns which allow concurrent access by two processors.The EIFO’s serve two major purposes in this receiver design. They serve as free bufferlists or as Inter Processor Communications(IPC) queues. Two free buffer lists used in thisdesign are the cell buffer list and the frame buffer list. In both cases, 16 bit addresses arequeued in the EIFO’s indicating which areas in cell RAM or video RAM are free to be used.There are 4 IPC queues in the proposed design. Two are used for communication from theHPU to the FCU. Messages to the host processor are carried by the MSGQ between theECU and host system bus. Commands from the host are carried on the HOSTQ locatedbetween the HPU and the host system bus.The two FIEO queues between the HPU and ECU are used to increase processingefficiency. The FFQ handles all release-buffer resource requests originating from the HPUand host processor. Host processor requests arrive via the HPU through the HOSTQ. TheSEQ handles all memory transfer initiation requests from the HPU.The only source for resource contention is the global state RAM. This RAM can beaccessed by three processors; the HPU, the ECU, and the host system processor. TheHPU is given the highest priority for RAM accesses followed by the ECU, and host systemprocessor. This priority scheme is used because the HPU must continually process incomingdata at network rates whereas the FCU processes cell transfers in intermittent bursts and4. Proposed Receiver Architecture 31the host system accesses occur mostly during connection state changes. Careful designconsiderations are needed to reduce contention on the global bus. Simulations are requiredto determine the overall global bus utilization.A low bandwidth host system bus does not cause a problem in the proposed receiverarchitecture because a video RAM buffers reassembled network cells. Video RAMs havea high speed serial port and a low speed random access port. The receiver writes framedata into the serial access side and the host system reads frame data from the randomaccess side.The serial port configuration overhead required before any accesses to video RAM serialport is reduced by amalgamating this cost over multiple received cells for a given frame.The internal cell buffers stores cells waiting for transfer to the host. The HPU monitors thenumber of cells per connection to determine when to request ECU for a frame transfer.The cell buffer is configured as two banks of low order interleaved static RAM concurrentlyaccessible by both the ICP and the MTC. Cell RAM configured in this way requires RAMaccess time to be slightly greater than network data inter-arrival time to maintain highthroughput. The cell buffer configuration takes advantage of the cell I/O access pattern.Cell I/O from memory have both a random and sequential nature. The first access to cellRAM is random but the following accesses for the same cell are sequential. This means thatthere is one possible wait state per cell and only at the beginning of the cell. The remainderof the cell is synchronized with memory access.4.3 Memory Data StructuresThe receiver state is stored in various memory elements. The receiver memory elementsare cell buffers, free cell list, frame buffers, free frame list, global state RAM, and five IPCcommand queues.4. Proposed Receiver Architecture 324.3.1 Cell Buffer and Free Cell Buffer ListThe cell buffer RAM is used by the receiver as intermediate buffer space. Cell bufferspace is organized as a pool of 24 word buffers. Each 24 word buffer provides sufficientspace to store the ATM cell payload portion. The receiver uses a pooi of 2730 independentbuffers from 64K words of external static RAM for cell buffering.Storage of available cell buffers is kept in an auxiliary Free Cell Buffer List. This list isinitialized with pointers to the available buffer locations. Pointers are removed when buffersare used and replaced when buffers are released.4.3.2 Frame Buffer and Free Frame ListThe frame buffer is used to store reassemble user frames. A pool of fixed 64K bytespages make up the frame buffer. The frame buffer facilitates frame transfers between thereceiver and the host system. Total frame buffer capacity is determined by the total amountof frame buffer RAM designed into the system.Storage of available frame pages is kept in an auxiliary FIFO structure. Its structure andoperation are the same as the free cell buffer list discussed in the previous section. Hostsystem processor writes frame pages into the free frame buffer list by sending commandsthrough the HOSTQ.4.3.3 Global State RAMGlobal state RAM containing all the connection status information is accessed by the threeindependent processors: the HPU, the FCU, and the host system processor. Global stateRAM is implemented with 32Kx16 words of external high speed static RAM. It is partitionedinto three sections: the VCI Hash Table, the FIFO circular queue space, and the ‘per VCI’data. The complete structure is illustrated in Figure 4.3.2. The following sections discusseach different memory partition in greater detail.4. Proposed Receiver Architecture 33Hash Addr Key Pointer/ 0 vci perVClptr/ 1 Null per VCI ptr/ 2 VCI perVClptr__________I.Hash 2046 Null per VCI ptrTable 2047 VCI per VCI ptrFIFO #0 ptr 1 ptr 2 . . . ptr 64256x64x16 FIFO #1_________________________________________FIFO Buffers FIFO #2__ ______ ______• I• .Per VGI•Data FIFO #254____ _ _ _ __ __FIFO#255 ptrl ptr2 . • • ptr64256 Circular FIFO BuffersLOG #1 Status—FIFO_loc Head Tail CRG VR Offset Valid SemLOG #2 Status..j16_bit 16 bits 6 bits 6 bits 32 bits 16 bits 16 bits 1 bit 8 bitsLOG #255 StatusLOG #256 StatusFigure 4.3.2 Global State RAM ConfigurationThe VCI Hash Table The VCI Hash Table is a sparsely filled table for mapping VCIs totheir ‘per VCI’ data. The table has 2048 possible hash locations where each location containsa VCI key and an associated ‘per VCI’ data pointer. Unused locations in the hash table haveVCI key fields marked with a ‘0’ since ‘O’is the only invalid VCI.The receiver design is limited to have no more than 256 open connections so a 2048entries hash table is sufficient to reduce hash collisions. The probability of hash collisions isbinomially distributed and is found by the following formula:Fr(X collisions)=(fl) *pX* (1 — )fl X (4.1)4. Proposed Receiver Architecture 34where:• x is the number of collisions in a given hash address• n is the number of hash keys• p = 1/(hash table size) is the probability of hashing a given addressGiven the 2048 element hash table with a maximum of 256 keys, the probability of collision(where x=2) in a given hash address is found to be: Pr(X=2) = 6.87e-3.The hash function is intended to uniformly distribute VCI keys in the hash table. Uniformlydistributed keys prevent clustering and also reduce hash collision probability. In the lowprobability that a hash collision occurs when a new connection is opened, a simple collisionresolution function called open addressing [Knu73J is used. Collisions are resolved bysearching the next 7 hash table locations for an unused location. Only 7 hash table locationsare searched in order to bound incoming VCI lookup. Incoming VCI lookup always searchesthe 8 possible hash table locations for a key match. Knuth has shown that the averagenumber of searches needed in open addressing is given by [Knu73]:1/’ 1 “\C— ( 1 + ) (successful search)1 2 (4.2)CN + ( — ) ) (unsuccessful search)Where a=(keys in table)/(table size). Substituting values with the proposed hash tableparameters, the average searches on success is 1.07 and on failure is 1.15. These resultsindicate a very low probability of collisions because only one search is needed on averageto find a matching key.Figure 4.3.3 gives an example of how an incoming VCI is checked for a matching hashtable key. In the example, the 16 bit VCI WXYZ maps to hash location PQ. The 8 locationsfrom PQ to PQ+7 are sequentially searched for a key match and a match is found whenlocation PQ+4 is reached.4. Proposed Receiver Architecture 35...Null PtrAAAA PtrLLLL PtrNull PtrWXyZ PtrPQRS PtrBBBB PtrWXVV Ptr...Figure 4.3.3 VCI Hash Table Lookup MechanismWhen an attempt to open a new connection fails because all 8 possible hash tablelocations are occupied, a hash table shuffling algorithm is applied. The shuffle algorithm triesto insert the new VCI by moving one of the 8 table locations into another free location. Thisis accomplished by rehashing each of the 8 keys in the initial search area to find if any keycan be moved to another location. If a key can be moved, its ‘per VCI’ pointer is copied intothe new location first followed by the key value. Then the old key location is nulled beforethe new ‘per VCI’ pointer is written. The new VCI key is written last. The manipulations takeplace in this order to ensure that the transient hash table state does not cause errors in theconcurrent adaptation layer processing. Proper hash table operation is guaranteed as longas the VCI key is written after the associated ‘per VCI’ pointer.The receiver architecture is simplified by moving hash table management functions offthe receiver design. Table management functions can be performed by the host systemprocessor or a co-processor in a board-level design. This is practical because managementfunctions consume a lot of clock cycles which are better used for processing critical path16 bit incoming VCIHash TablePQPQ + 1PQ +2PQ +3VCI Match PQ+4PQ + 5PQ + 6PQ + 7HashI SearchRegion4. Proposed Receiver Architecture 36functions. Management functions involve a lot of decision making which a host processoreasily handles. For the hash table, only incoming VCI lookup is along the adaptation layercritical path.Methods have been investigated for the host system to perform hash table management.The host processor directly modifies global state RAM to manipulate the hash table. Hostglobal RAM accesses must be kept low so that receiver adaptation layer processing is notimpeded by global RAM contentions. This criterion is achieved with a VCI hash table shadowimage in the host system’s own memory. The host system must also know the hardwarehash function and how the global RAM is organized. Now host accesses to global RAM arelimited to writes or lookups of dynamically changing state values.Circular FIFO Queues Space The largest area in the global state RAM is used by 256circular FIFO queues for frame reassembly. Cell pointers from the same VCI are temporarilystored in a circular queue. Since ATM cells always arrive off the network in its originalsequential order, the consecutive cell pointers in the circular queue maps out a contiguousportion of the original frame.Each circular queue is a fixed set of 64 contiguous entries in the RAM. It is maintainedby the following 3 pointers in ‘per VCI’ data: the location pointer, the head pointer, andthe tail pointer. The location pointer indicates the circular queue location associated with aVCI, the head and tail pointers are used to add and remove cell buffer pointers respectively.Maintenance costs for fixed circular queues are less than for a linked list data structure.Only three pointers are needed for fixed circular queues and there is no need to maintain anauxiliary free space list as in the case for a linked list data structure.The Per VCI Data Space The ‘per VCI’ portion of the global state RAM contains stateinformation for each opened Virtual Circuit(VC). Each ‘per VCI’ block has 9 informationfields. Each field is word aligned to simplify indexing and each ‘per VCI’ block is indexedas 16 word blocks. Table 4.3.1 summarizes ‘per VCI’ data fields and indicates actual field4. Proposed Receiver Architecture 37size used in the word aligned data structure. Total memory required for ‘per VCI’ data is256connections * l6words/comnection = 4K Words.State Item Size descriptionVCI 16 bits The VCI for this connection.FIFO_loc 16 bits Number indicating which FIFO is used for this VCI.Head 6 bits Indicates Head position of FIFO queue.Tail 6 bits Indicates Tail position of FIFO queue.CRC 32 bits Current CRC computation.VR 16 bits Video RAM Buffer Page.Offset 16 bits Offset position into Video RAM Page.Valid 1 bit Flag to denote whether VR and Offset fields are currentlyvalid (True if valid).Sem 8 bits Semaphore indicates outstanding memory transferrequests from HPU to FFCTable 4.3.1 Description of State Data Structure ElementsThe 3 status fields of FIFO_loc, Head, and Tall are used to maintain the circular queuesassociated with a VCI. Head and Tall are pointers for first and last entries in the FIFOrespectively while FIFO_loc is a pointer to the queue location. The two status fields, VA andOffset, are frame buffer maintenance pointers. VA denotes a 64K frame page and Offsetis a relative location inside the frame page where incoming cells are transferred. The twofields, Valid and Sem, serve as synchronization variables for HPU and FCU. Valid denoteswhether frame buffer maintenance fields, VA and Offset, are valid. Sem is a semaphorefield used to synchronize cell transfer command requests from the HPU to the FCU. Thesesynchronization variables are discussed further in the next section. The two remaining fields,VCI and CRC, denote the VCI associated with the current ‘per VCI’ data and the currentCRC-32 partial computation respectively.4. Proposed Receiver Architecture 38An example of how the various data structures are related is given in Figure 4.3.4. Here,a 6 cell frame with VCI=5 is sent over the network to the receiver. Figure 4.3.4 illustrates thestate of the receiver when 3 of the cells have already been transferred to the frame buffer andthe last 3 cells are still resident in cell buffer space. The VClfield indicates the example valueof 5, the CRC field indicates some partial CRC-32 computation, the Valid field indicates validframe buffer pointers, and the Sem field indicates that there are no outstanding cell transfers.Cell 1 Cell 2 Cell 3 Cell 4 Cell Cell 6 Original Frame Transmitted, VC = 5Sample Receiver State64K Byte Frameper VCI status blockFigure 4.3.4 Example of Per VCI Structure and Associated Buffers4. Proposed Receiver Architecture 394.3.4 Inter Processor Communication (IPC) QueuesFive IPC queues used in the receiver design are: HEADQ, FFQ, SFQ, HOSTQ, andMSGQ. All these queues are 16 bits wide and are connect between two concurrent processors. They provide communications facilities between concurrent processes and reduceoverall resource contention through their two-port structure. This section discusses the simple message structure used in each queue.HEADQ The HEADQ is located between the ICP and the HPU. It accepts the header VCIand associated data from the ICP. ICP messages to the HPU are 2 or 3 words long for eachincoming cell. The message format in the HEADQ is illustrated in Figure 4.3.5. The firstword is always the incoming cell VCI, followed by the cell type, and followed by cell bufferpointer if necessary. The cell type field denotes if the incoming cell is an End-of-Frame(EOF)type, a general user cell, or an errored user cell. Partial cell reception results in an erroredcell and no pointer is required for this type of message.Generalized HEADQ Format1St Word 2nd Word 3rd Word/<VCI> <Type> <Cell Buff Pointer>Specific HEADQ Messages DescriptionVCI EOFType Pointer- End Of Frame Cell RecievedVCI UserType Pointer- Other User CellsVCI ErrorType- Cell Received in ErrorFigure 4.3.5 HEADQ Message FormatSFQ The SFQ is located between the HPU and the FCU. It contains all VCI relatedcommands from the HPU used to manipulate the ‘per VCI’ circular queues. As illustratedin Figure 4.3.6, there are 3 types of SFQ commands and each command is 3 words long.The first word of a command is always the command type followed by a pointer to a ‘per4. Proposed Receiver Architecture 40VCI’ data block and a current circular queue head pointer is always the last word. The twocommands, TX_SEG and TX_EOF, are commands used by the HPU to initiate reassembledcell transfers to the frame buffer. The segment of cells to transfer is always from the currenttail pointer to the given head pointer in an associated circular queue. TX_EOF differs fromTX_SEG because the last cell in the transfer segment of the TX_EOF command is the EOFcell. When ECU finishes a TX_EOE command, it will notify the host system that a frame hasbeen received. ECU will also invalidate the ‘per VCI’ frame buffer pointers, VR and Offset,by setting the ‘per VCI’ VALID bit to false.Generalized SEQ Format1 St Word 2nd Word 3rd Word17<Command> <per VCI pointer> <Head Pointer>Specific SEQ Messages DescriptionTX_SEG Pointer Hd_Pointer - Transfer reassembled portion of cells toFrame bufferTX_EOF_SEG Pointer Hd_Pointer- Transfer reassembled portion of cellscontaining EOF cell to Frame bufferClose_VCI Pointer Dummy_Ptr- Flush circular buffers to close VCFigure 4.3.6 SFQ Message FormatThe 3rd SEQ command type is the CLOSE_VCI command. The CLOSE_VCI commandis always followed by the ‘per VCI’ pointer and a dummy pointer. The dummy pointer is usedto simplify the ECU state machine design. Since SEQ commands are fixed at 3 words inlength, ECU state machine design can be simplified. The dummy pointer is unused by theECU. The CLOSE_VCI command empties internal circular queues by freeing all remainingcell buffer resources. Those cell buffer contents are never used because the connectionhas been closed.4. Proposed Receiver Architecture 41FFQ The FFQ is also situated between the HPU and the FCU. It accepts recycle pointercommands, used to release receiver resources, from the HPU. As illustrated in Figure 4.3.7,FFQ commands are always 2 words long. The first word is the resource type and the secondis the pointer to the resource. The FCU takes the pointer and inserts it in the appropriatefree buffer list.Generalized FFQ Format1 w d 2nd Wordsor<Recycle_type> <Buffer Pointer>Specific FFQ Messages DescriptionRecycle_Cell Pointer- Add cell pointer to free cell buffer listRecycle_Frame Pointer- Add frame page pointer to free frame listFigure 4.3.7 FFQ Command FormatFFQ is used in addition to SEQ between HPU and FCU because efficiency in receiverperformance is improved with this configuration. FFQ commands have short executiontimes in comparison to SEQ commands. Unlike SEQ commands, FFQ commands do nothave sequential order dependencies. Sequential independence allows EFQ command to beseparated from SFQ commands without jeopardizing frame reassembly. System efficiencyincreases because FFQ’s release resource commands effectively skip over preceding SEQcommands resulting in earlier execution. Early execution of FFQ commands gives a largerresource pool to work with.HOSTQ The HOSTQ is located between the host processor and the HPU. The hostprocessor uses this queue to pass commands into the receiver. As illustrated in Figure 4.3.8,there are two possible command types, both of which are two words long. The commandtypes are Close_VCI and Recycle_Frame. Both commands have been discussed in FFQand SFQ sections above. Although the HOSTQ feeds the HPU, both queued commands are4. Proposed Receiver Architecture 42handled directly by the ECU. The HPU merely passes both commands to the FCU verbatimexcept it appends a dummy pointer, Close_VCI, making it a 3 word command.Generalized HOSTQ Format1st Word 2nd Word<Command_type> <Pointer>Specific HOSTQ Messages DescriptionClose_VCI per_VCI_Pointer- Dump any remaining circular queue contentsRecycle_Frame Frame_Pointer- Add frame page pointer to free frame listFigure 4.3.8 HOSTQ command FormatsThere are two reasons why HOSTQ feeds the HPU instead of ECU. Eirst, it is desiredthat every FIFO queue be located between only two concurrent processors to reduce EIEOcontention. If HOSTQ commands were fed directly into either SEQ or EEQ, both HPU and thehost processor would contend for the command queues. The command queue access wouldhave to be arbitrated which increases complexity and incurs arbitration time overheads.The other reason why HOSTQ feeds HPU instead of ECU is because the Close_VCIcommand has sequential order dependencies with SEQ commands. The Close_VCI command should occur only after all TX_SEG and TX_EOE for the VCI are performed. If theClose_VCI command feeds directly through HOSTQ to the ECU, it would be very difficult tosynchronize the command with SEQ queued TX_SEG and TX_EOE commands. When theCLOSE_VCI is passed to SLQ from HPU, Close_VCI is guaranteed to execute only afterpossible queued TX_EOE and TX_SEG commands. Since HOSTQ is placed to feed directlyto HPU, Recycle_Erame commands reach FCU via HPU by default.MSGQ The MSG is located between the ECU and the host system processor. It containsstatus information for the host system. Two status messages of the form shown in Eigure4.3.9 can be sent by the ECU. The Erame_Received message is followed by the VCI and4. Proposed Receiver Architecture 43a pointer to the frame page. Every error-free received frame results in this message tothe host system. The second message, closed_frame, acknowledges original Close_VCIrequests from the host. The Closed_VCI message is followed by the same ‘per VCI’ pointersent as part of the original Close_VCI command.Generalized MSGQ Format1st Word 2nd/3rd Word<Message_type> <args>Specific MSGQ Messages DescriptionFrame_Received VCI Frame_Ptr - Informs Host of received error-free frameClosed_VCI Per_VCI_Ptr- Acknowledges CLOSE_VCI commandsFigure 4.3.9 MSGQ Message Formats4.4 Communication StructureThe proposed receiver performs many tasks via IPCs and global state manipulations. Inthis section, the communications interactions needed for efficient receiver adaptation layerprocessing are discussed.IPCs provide a mechanism to maintain a high level of concurrency. IPC command queuesare also effectively utilized as auxiliary memory elements for transient state information.Transient state storage is used to handle sequential cells from different frames on the sameconnection.Global state manipulations from the host system are used for modifying receiver datastructures in order to reduce receiver architecture complexity. Both IPC and global statemanipulations are integrated to provide the necessary receiver adaptation layer functionality.IPCs and state RAM interactions are described for each of the following receiver functions:initialization, rejecting unmatched VCI cells, opening VCI connection, reception of networkcells to frame buffer, and closing a VCI connection.4. Proposed Receiver Architecture 444.4.1 Receiver InitializationThe data structures in the receiver are initialized mostly by the host system processor asillustrated in Figure 4.4.10. The host system writes available frame page pointers to HOSTQwith Recycle_Frame commands. Global state RAM data structures are initialized by thehost. The entire hash table key field is filled with null entries. Internal cell buffer pointersare initialized by the receiver itself. All cell buffer pointers are loaded in the free cell bufferlist at system start-up.Recycle_Frame PtrlRecycle_Frame Ptr 2Recycle_Frame Ptr NHOSTQFigure 4.4.10 Receiver System Initialization4.4.2 Rejecting Unmatched VCI CellsThe ICP buffers the payload part and queues the header part of incoming cells. It doesnot directly check for incoming cell validity. ICP passes the VCI to the HPU as soon as itreceives it. The validity check is done in the HPU by looking for a hash table key match ofthe given VCI. This check is concurrent with the ICP buffering the cell payload. If the HPUfinds no VCI match on an error free received cell, it takes the cell pointer from the header‘PCWriteState RAM ManipulationsKey PointerVCI Hash Table4. Proposed Receiver Architecture 45queue and sends a ‘Recycle_Cell’ command to the ECU via the EEQ. The cell is rejected bythe system when the cell buffer is recycled.4.4.3 Open a ConnectionBefore the host can receive frames through the frame buffer, a connection must beopened. The connection is identified by its VCI. Eigure 4.4.11 illustrates how the global stateRAM data structures are initialized to accept cells for a given VCI. A host memory residentimage of the hash table determines what global state RAM manipulations are required toenter the new VCI in the hash table(discussed in §4.3.3). If entry is possible, the ‘per VCI’fields of VCI, FIFO, Head, Tail, Sem, and Valid are written with the appropriate initializationvalues. Then the hash table is updated by first writing the ‘per VCI’ pointer followed by theVCI key to open the connection. ‘Per VCI’ fields CRC, VA, and Offset are written internallyby the ECU.4. Proposed Receiver Architecture 46Figure 4.4.11 Global State Manipulations Required to Open a Connection4.4.4 Receive FramesMany IPC and global state RAM manipulations are used to properly process incomingnetwork cells into reassembled frames. This section will show how the different communications interactions combine to recover frames. An example will clearly illustrate how IPCqueues are used to store transient state information and how this information is utilized toproperly handle sequential frames on the same connection.Receiver operation explanation is simplified if only the case of a single VOl connectionis considered. The simple example will illustrate various stages of receiver operation andprovide a detailed description. For this example, a 12 cell frame followed by a 3 cell frameis assumed to be transmitted and the HPU initiates frame transfer when there are at least5 cells in the circular queue. Since it is assumed that no frame fragments have made it toHost Writes to per VCI blockHost Writes toHash Tableper VCI status block64 ElementCircular BufferHash TableNote: Writes to hash table mayinclude writes for movinga key4. Proposed Receiver Architecture 47the frame buffer, no outstanding transfer requests exist. Figure 4.4.12 illustrates the relevantportions of the receiver showing the internal state condition for this example.Transmitted..15wxYzTX_EOF12wxYzTX_EOF5wxYzTX.SEGProcessingBlockFigure 4.4.12 Example of Frame Reception Showing Internal StateFrom Figure 4.3.4, the ‘per VCI’ block is shown with entries reflecting current state ofthe connection. The ‘0’ valued Valid field indicates that fields CRC, VR, and Offset areundefined and that partial frame transfers have not occurred yet. Sem field value of ‘3’Per VCI Pointer3 Cell Frame[] 12 Cell FrameCRC VR Offset Valid SeVCI Undef Undef Undef 0 3Per VCI BlockHEADQGlobal State RAMCircular Buffer23...Note: In this Example,TX_SEG is sent whenthere are more than5 outstanding cellsin the circular bufferand SEM = 0.4. Proposed Receiver Architecture 48indicates that there are 3 outstanding transfers (TX_SEG, TX_EOF, TX_EOF), which have notbeen serviced by ECU. The circular queue contains the 15 total cells from the two transmittedframes. Contents of the free frame buffer list show next available free pages. The contents ofthe SEQ show the 3 outstanding command requests from HPU. There is only one TX_SEGbecause the command is only used when SEM=O. SEM=O when all outstanding transferrequests have been serviced for the connection. The head pointer portion of the transfercommands has values of 5, 12, and 15. These values indicate how far the ‘per VCI’ tailpointer must traverse in the circular queue to satisfy the given transfer request. All EOF cellsforce a TX_EOF request onto SFQ from HPU.The above example clearly shows how sequential frames for the same connection arehandled. Although there is only one ‘per VCI’ state block, frame boundaries are actuallystored in the SEQ. SEQ is used as axillary storage for state information. This informationcombined with the Valid status field ensures proper handling of frame transfers. ECU acceptsthe transfer commands and feeds MTC with pointers from the circular queue. MTC readscells from the cell buffers and transfers them to the frame buffers. The internal state ofthe system after each command execution is shown in Figure 4.4.13. After each TX_EOFcommand, the Valid status field is set to ‘0’ so that a new frame buffer would be used bythe next transfer request.4. Proposed Receiver Architecture• Per VCI PointeriVCI FIFOJoc Head Tail CRC VR Offset Valid Sem• I 1’lvci I XXXXXXXX 1 120 1 2• ii ‘ II I Per VCI BlockI • I______________________W)’’’Z_ ______ _ _ ______ _ _ _ ____ _FreeTX_EQ F Circular BufferFrame wxyz iob tate RAMBuffer TX EQList— FCU Command :TXSEG WXYZ 5MSGQ SFQ• Per VCI Pointer1I CRC VR Offset Valid Sem• wXYz IUndef Undef Undef 0 1Per VCI BlockVCIFrameReceivedMSGQFree Circular Buffer__ ___Frame Global State RAMBufferList FCU Command :TXEQF, WXYZ 12Per VCI Pointer I: wx\’z CRC VR Offset Valid Sem_____Undef Undef Undef 0 02 131VCI II Per VCI BlockFrame h—iReceived hiVCIFrame FreeReceived FrameMSGQ BufferListFigure 4.4.13 Internal State After Each SFQ Command49Figure 4.4.13 shows that Sem is decremented after each command execution. Semsynchronizes transfer requests from the HPU. On each transfer request, HPU atomicallyincrements Sem. After performing each transfer request, ECU atomically decrements Sem.The semaphore variable increases the efficiency of adaptation layer processing because noextra TXSEG commands are sent if there are outstanding transfer requests. The resultingeffect is that larger frame fragments get transferred to the frame buffer per transfer request.SFQ• • •SFQ• • •Circular BufferGlobal State RAMFCU Command :TXEOF, WXYZ , 154. Proposed Receiver Architecture 50This is the case in the example after the first TX_SEG is issued by HPU. The followingtransfer block is 7 cells long which is 2 cells greater than the default 5 cell transfer length.Longer block transfers increase system performance because the fixed cost in initializing theframe buffer is incurred only at the two ends of the transfer block. Fewer transfers of longerblocks are much more efficient than more transfers of shorter blocks.The mechanism of using longer transfer blocks when outstanding transfer requests existsis ideally suited for the networking environment. Whenever congestion occurs in the receiverdue to factors such as resource contention, the architecture automatically adapts to thesituation by making transfer blocks longer. This adaptive feedback mechanism works tooffset throughput reducing congestion by increasing the effective transfer rate.MSGQ and HOSTQ IPC commands also occur in addition to the SFQ IPC commands.The ‘Frame_Received’ message is placed on the MSGQ by the FCU after each TX_EOFbarring CRC error situations. Not shown in the above example is the host system placingthe ‘recycle_frame’ command on the HOSTQ after using each frame.4.4.5 Closing VCI ConnectionThe host does many of the necessary manipulations to properly close connections. Itfirst invalidates the hash table key field so that no further cells with the specified VCI areaccepted by the receiver. Then the host places the ‘Close_VCI’ command on the HOSTQand waits for an acknowledgment ‘Closed_VCI’ message from the MSGQ.The Close_VCI command is executed by the FCU. It cleans up pointers associated withpartially transferred frames. All cell pointers still in the circular queue are recycled and thepartially filled frame page pointer is recycled.4.5 Hardware OverviewThe receiver is designed with five processing elements: the incoming cell processor(ICP),the Header Processing Unit(HPU), the FIFO Control Unit(FCU), the Memory Transfer Controller(MTC) and the CRC-32 Computational Engine. These processing elements are sup-4. Proposed Receiver Architecture 51ported by interconnecting queues and buffers. All the processing elements are expected tobe integrated into a single Applications Specific Integrated Circuit(ASIC). The only externalblocks are the frame buffer video RAM, interleaved static RAM cell buffer, SONET framer,and the global state RAM. The hardware details of each processing element and its buffermemory are discussed in the following sections.4.5.1 The Incoming Cell Processor(ICP)The ICP is implemented as a simple state machine. It interlaces to an external SONETframer and to cell buffer RAM. Its primary function is to split cells between their header andpayload. The payload is sent to available buffers in the cell body RAM. The VCI in the cellheader and associated cell information are delivered to the header queue for processing bythe HPU.Figure 4.5.14 illustrates the hardware functional diagram of the incoming cell processor.The SONET framer chip performs cell sync and HEC functions before passing ATM cells tothe receiver. A 622 Mbps SONET framer is expected to perform much in the same way asexisting 155 Mbps framers such as the PM5345 [PMC92J. The framer can be configuredto pass ATM cells in 16 bit words or 8 bit bytes. In the 16 bit mode, an extra 8 bit byte isinserted after the 5th byte for word alignment. The PM5345 can also be configured to rejectNULL VCI cells and bad HEC cells. Internally it buffers 4 ATM cells. Handshaking lines areprovided to control access to queued cells.4. Proposed Receiver Architecture 5232Kx16ODD32Kx1 6EVENFigure 4.5.14 Incoming Cell Processor and Interleaved Cell Buffer RamThe ICP expects cell data in a form similar to the form provided by the PM5345. TheICP stores the first two words in a register so that VCI and payload type can be extractedand stored in the header queue. The header payload type is examined for EOF indication oruser cell indication. All other cells are classified as error cells. The payload type informationis passed onto the header queue after the reception of the last word of the cell. This ensuresthat partial cell reception can be tagged as an error. The 3rd header word is discardedbecause it is unused. All payload data is stored directly into a cell buffer location addressedVCI/TypeRegisterNetwork 16StartHPUFreeCellBufferListLinesFCUlnteiieavedCell BufferRAMMTC4. Proposed Receiver Architecture 53by the top element in the free cell buffer list. A counter is used to address sequential wordsin cell memory.The transfer of the VCI into the HEADQ during the early part of cell reception allowsthe HPU to concurrently search for the VCI connection state. The ICP state machine isresponsible for controlling all the ICP functions during the proper time interval. An indicationof the ICP’s responsibilities for one cell during cell arrival time is illustrated in Figure 4.5.15.Each slot time in the diagram indicates 16 bit word inter-arrival time.Word Slots... ••• 2627I123...: CeUPointer->HEADQVCI>HEADQ Payload->Cell Buffer from FIFOCell Type->HEADQFigure 4.5.15 ICP Responsibilities during Word Arrival Slots4.5.2 Interleaved Cell Buffer RAMThe cell buffer is implemented with two banks of 32Kx16 static RAM. Low order addressbit 1 is used to select either the even word bank or the odd word bank. As illustrated inFigure 4.5.14, the RAM is configured as a dual port RAM. One side of the port is accessedby the ICP while the other side is accessed by the MTC.A memory sync generator is used to alternate the banks on each side of the dual portRAM. The memory sync allows each side to concurrently access separate banks for onememory cycle time. This cell buffer configuration results in no overhead accesses oncesynchronization is achieved. The synchronization cost is at most 1 access time per cell. Thisdoes not create a bottleneck on the ICP side because a cell time is composed of 27 words,24 of which are written to the cell buffer. In effect, synchronization time can be as high as 34. Proposed Receiver Architecture 54word times with no adverse affects. For full 622 Mbps operation, the receiver design can betuned to incur only one synchronization delay at the beginning of the 622 Mbps stream.Interleaved buffer RAM configuration allows memory access time to remain close tointerword arrival time and still maintain high throughput. This is not true for non-interleaveddual ported RAM designs using single ported RAM because there is only one RAM devicewhich must be accessed twice as fast in order to maintain throughput.4.5.3 The Header Processing Unit(HPU)As illustrated in Figure 4.5.16, the HPU is implemented as a state machine interfacedto queues, global state RAM, and registers. The HPU’s primary responsibility is to processheader information from HEADQ and to combine cell buffer pointers into the circular queues.Pointers in circular queues are use for frame reconstruction. The HPU is also responsiblefor processing commands on the HOSTQ.Registers in the HPU store local copies of various global values to minimize globalRAM accesses. This reduces contention for the global bus and speeds up adaptation layerprocessing. Some registers connect directly to logic blocks to compute when the HPU caninitiate segment transfer requests to the ECU. The HPU writes Head and cell buffer pointerinto state RAM on every correctly received cell. Sem is atomically incremented whenever atransfer request is sent to the ECU.4. Proposed Receiver Architecture 55HostFigure 4.5.16 HPU Hardware Functional DiagramGlobalStateRAMThe HPU maps an incoming VCI to its VCI state information with a hardware hashfunction. The hash function is applied to all incoming VCIs to determine the ‘per VCI’ dataaddress in the global state RAM. If no initial match is found, up to the next 7 consecutive stateinformation locations are searched for a VCI match. These lookups are done in an interleavedfashion where the next consecutive hash key is fetched while the current key is checked fora match. Any suitable hash function that randomizes the VCI mapping distribution across thehash table can be selected. An simple hardware hash function example using exclusive-orgates is illustrated in Figure 4.5.17.System16FCUFCU SystemRegister BankAvailable4. Proposed Receiver Architecture 56I X1eX6=H1Incoming I X2X7 = H2IVCI = X(O...15) Hash Vahe=H(O...1O)I XOX1O=H5 IX1eX11=H6I X2eX12=H7I X3X13=H8 IX4X14=H9 I$=XORGateHash FunctionFigure 4.5.17 Hardware VCI Hash Function Example4.5.4 FIFO Control Unit(FCU) and Memory Transfer Controller(MTC)The FCU and MTC are closely coupled concurrent processors where the ECU initiatesall MTC processing. The FCU is made up of two state machines, one servicing commandsfrom FFQ and the other servicing commands from SEQ. The SEQ state machine(SFQSM)is also the ECU master state machine. All state machines run concurrently once initiatedby the SEQSM. A bank of local registers interfaces the SEQSM to the MTC state machinewhich reduces global state RAM access. Eigure 4.5.18 is the functional hardware diagramfor the ECU and the MTC.4. Proposed Receiver Architecture 57VCI CeH_ptrlFIFO_loc Cell_ptr2Head Cell_ptr3Tail CelI_ptr4CRC_High Cell_ptr5CRC_Low Cell_ptr6VR CelI_ptr7Offset Cell_ptr8Valid Cell_ptr9GlobalStateRAMFigure 4.5.18 FCU/MTC Hardware Functional DiagramSFQSM performs all global state RAM interfacing. It loads state values, VCI, FIFO_loc,Head, Tail, CRC, VR, Offset, and Valid into local registers, as well as a portion of cell bufferpointers from the circular queue. After servicing a transfer command, SFQSM writes backupdated state values, Tail, CRC, Offset, Sem, and Valid.Locally stored cell buffer pointers improve system efficiency because pointers are available on demand. No arbitration costs or contention delays are needed as in the case offetching cell pointers directly from global state RAM. Arbitration costs and contention delaysare incurred once per refill of the local pointer registers. Large transfer blocks are servicedby successively refilling local pointer registers.Input QueueBusHPUOutput QueueBusICPHPU HostSystemCelL ptrlORegister BankSerial SideVideo RAMRandom SideHost SystemBus4. Proposed Receiver Architecture 58The MTC extracts cell buffer pointers from the local pointer register to transfer cell bodiesfrom the interleaved cell RAM. Each cell pointer is concurrently recycled by the SFQSM whenthe MTC finishes transferring the body to the video RAM frame buffer. Cell body transfers tovideo RAM is done at network rates once the video RAM serial ports are set up. The largetime cost of serial port set up is incurred only once per TX_SEG or TX_EOF command. TheMTC is directly interfaced to the CRC Engine for CRC-32 computations. All cell data errorcheck computations are performed while the data is passed to the video RAM frame buffers.The FFQ state machine(FFQSM) is a very simple state machine which reads the recycleresource commands from the FFQ and writes the recycle pointers to the proper free list. Itis controlled by the SFQSM to handle arbitration on the input queue bus. Output queue busarbitration is performed in hardware. The output queue arbitrator is a non-preemptive arbiterwhich gives a higher priority to SFQSM requests.4.5.5 Video RAM Frame BufferThe frame buffer is implemented using a video RAM interface to the host system. VideoRAM is ideal because of its low cost, high density, dual port, and high speed. This framebuffer configuration reduces the host system bus utilization required to transfer frame data.The host system does not get loaded by bursts of cells arriving off the network. The videoRAM interface between MTC and the host system bus is illustrated in Figure 4.5.19.4. Proposed Receiver Architecture 59RASInterfaceCAS LinesDTBufferFigure 4.5.19 Video RAM Frame Buffer InterfaceOperational characteristics of video RAMs are extensively covered in data book such as[Nec93]. A short summary of relevant video RAM functions is provided here. Video RAMshave an independent random access port and a serial access port. The serial access portmust be initialized before it can be read or written. Initialization is needed because the serialaccess port actually reads or writes internal serial access memory(SAM) which is independentfrom the RAM core. SAM serves as an internal R/W buffer of a complete row in the internalRAM array. For write operations, SAM parameters of I/O direction and RIW index must beinitialized. A transfer from SAM->RAM must be made in order to actually write serial datainto the memory core. Transfers from RAM->.SAM must be made before frame transfers thatstart at mid-row begin. This will ensure that the SAM does not improperly overwrite data atthe beginning of a RAM row when SAM->RAM transfers are used to complete a serial portwrite. All RAM->SAM transfers change serial port I/O direction to read. Serial port can be4. Proposed Receiver Architecture 60changed for writes by performing a SAM->RAM transfer or a pseudo SAM->RAM transfers.Pseudo SAM->RAM transfer do not actually transfer SAM->RAM. It is just used to changeserial port direction and to set the internal SAM write pointer. Pseudo code algorithm of howframe transfers are written to Video RAM is given in Figure 4.5.20./*** Pseudo Code of video RAM write/‘ Note: write_address == Row_address concat column_address ***//*** SAM_ptr = next serial write position kk/SAM Initialization ***/if column_address = 0 then /* start of row /pseudo SAM->RAN, (SAI4_ptr 0) 7* SAN I/O = write /elseRAN(Row_address)->SAM / read row /pseudo SAM->RAN, (SAN_ptr = column_address) /* SAM I/O = write) *7end if/*** transfer frame ***/While (not finished writing transfer segment)SAN(SAN_ptr) <= data 7* write SAN /SAN_ptr <= (SAN_ptr + 1) mod row_sizeif SAM_ptr = 0SAN->RAN(row_address) 7* write data to PJ4 *7row_address++end ifend WhileEnsure SAM writtenSAN->RAN(row_address) 7* write data to 4*/Figure 4.5.20 Algorithm Showing Frame Transfers to Video RAM via SAMAll video RAM internal transfers and serial port initialization require access to the randomaccess port. The MTC explicitly requests access to the random port using bus request andbus grant handshaking signals. Serial port accesses for video RAMs are much faster thanrandom port accesses. For example, the NEC uPD4882234 256Kx8 video RAM has a serialport cycle time of 22/25 ns and a random port cycle time of 140/l5Ons [Nec93]. These cycletimes translate to one interword arrival time for serial access and 6 interword arrival timesfor random access in a 622 Mbps ATM network.4. Proposed Receiver Architecture 61Receiver host system bus accesses are required only when initializing the SAM and wheninternal SAM->RAM transfers are needed. In this way, the host system bus is not undulyloaded by high bandwidth frame transfers.4.5.6 CRC EngineCyclic redundancy code computations are commonly used in network transmission environments to ensure correct data reception. The AAL5 adaptation layer uses a 32 bit CRC forerror detection. An ATM receiver must compute the CRC result across the entire incomingdata stream for each virtual connection. This is an inherently difficult task in ATM networksdue to the high speeds.The most common method for computing CRC in hardware is the bitwise serial method[Lin83]. Here, a linear feedback shift register(LFSR) is combined with appropriately placedXOR gates to perform the required CRC division. A parallel hardware CRC generator isneeded because the serial method is too slow for high speed networks.CRC Algorithm The CRC computation specified for ATM networks is found in [ANS87Jand summarized here. Given the following:F(x): A degree k-i polynomial that is used to represent the k bits of the frame to sendL(X): A degree 31 polynomial with all coefficients equal to 1G(X): Standard CRC-32 generator polynomial.32 26 23 22 16 12 11 10 8G(X)=X +X +X +X +X +x +x +x +x +7 5 4 2 (4.3)x +x +x +x +x+i.R(X): Remainder polynomial that is of degree less than 32.P(X): The remainder polynomial on the receive checking side that is of degree less than32.CRC: The CRC polynomial of degree less than 32.Q(X): The greatest multiple of G(X) in [X32F(X) + X’L(X)].4. Proposed Receiver Architecture 62Q*(X): X32Q(X).M(X): The sequence that is transmitted.M*(X): The sequence that is received.C(X): A unique polynomial remainder produced by the receiver. This polynomial hasthe value:C(X) =X32L(X)/G(X)31 30 26 25 24 18 15 14 12C(X)=X +X +X +X +X +X +X +X +X +11 10 8 6 5 4 3x +x +x +x +x +x +x +x+i.(4.4)All CRC computation equations are modulo 2. The equations that are used to generatethe CRC from F(X) are:CRC=L(X)-FR(X)[x32F(x) + XkL(X)] /G(X) = Q(X) + R(X)/G(X) (45)M(X) =X32F(X) + CRC.The remainder, R(X), is generated by appending 32 bits of zeros to F(X), taking the onescomplement of the first 32 bits of the frame, and dividing by the generator polynomial, G(X).The CRC is computed as the ones complement of R(X). The message sent is the originalframe, F(X), shifted left by 32 bits and added to the CRC.The received sequence M*(X) may differ from the transmitted sequence M(x) if there aretransmission errors. The equation used to check the received message is:x32 [M*(x) +XkL(X)]/G(X) = Q(X) +P(X)/G(X) (4.6)For error free reception of M*(X), the unique remainder is given by:P(X)/G(X) = X’32L(X)/G(X) = C(X) (4.7)The receiver computes the error check code C(X) by taking the ones complement of the first32 bits of M*(X), appending 32 bits of zeros and dividing the result by the generator G(X).4. Proposed Receiver Architecture 63Intuitively, 32 bit CRC error check sequences work by computing the remainder ofdividend, F(X) with 32 bits of trailing zeros divided by G(X). When the remainder is addedto the original dividend to form the transmitted message, the message becomes a multipleof G(X). The receiver divides the received message by G(X) and a zero remainder wouldimply error free transmission.The ATM CRC-32 computation is slightly different. ATM CRC-32 computations requirecomplementing the first 32 bits of the message before computation and it requires appending32 zeros to the received message. The ones complement of the first 32 bits of the message sequence is used to produce a unique remainder for a leading zeros in a messagesequence. Leading zeros normally would not change the computed remainder. The transmitter complements the computed remainder R(X) to permit the receiver to detect trailingzeros which appear due to errors. Complementing R(X) is equivalent to adding L(X) to R(X)which changes the expected remainder from zero to L(X). Appending 32 zeros to the received message further changes the expected remainder from L(X) to L(X) with 32 bits ofzeros appended, divided by G(X). The remainder at the receiver is equivalent to C(X) in theabsence of transmitted errors.CRC Bit Serial Computation CRC are serially computed with a LFSR and feedback XORgates [Lin83]. The LFSR with feedback essentially computes CRC by successive subtraction.Whenever the highest order bit in the LFSR is a 1, the generator polynomial is subtractedusing the feedback XOR branches. Figure 4.5.2 1 illustrates the simple hardware needed tocompute the CRC-32.4. Proposed Receiver Architecture 64CRC Parallel Method Parallel CRC implementations are commonly designed throughexamining the effects on LFSR contents after N clock cycles [Per83]. The effects arereplicated using logic structures to yield a circuit that computes the CRC N bits at a time.In this thesis, a different approach is taken which gives the same results as the abovemethod, but provides an analytic description of the parallel computation. Instead of startingfrom the LFSR to design parallel CRC generators, the general CRC equations are manipulated first. This analysis provides a better understanding of what parallel CRC circuits areactually computing. Once the theory of parallel CRC computation is understood, the circuitdesigner is free to choose his own design methodology.Examination of successive N bit CRC computations will indicate how a parallel circuit canbe designed. N is defined to be less than the degree of G(X) so that degree N polynomialsare its own remainder. Before the next N bits are introduced, the previous bits of the messagealready result in an intermediate remainder. Let M1(X) be the earlier part of the messagealready computed. Let M2(X) be the next N bits of the message. Let M(X) be the messagegiven by the combination of M1(X) and M2(X). M(X) can be expressed as:Figure 4.5.21 CR Computation for ATM CRC-32M(X) = X’Mi(X) + M2(X) (4.8)4. Proposed Receiver Architecture 65The CRC remainder, R(X)/G(X), of M(X) is given by:M(X)/G(X) = Q(X) + R(X)/G(X). (4.9)The intermediate remainder of M1(X) is given by:M1(X)/G(X) = Q1(X) +R1(X)/G(X). (4.10)The CRC remainder of M2(X) is given by:M2(X)/G(X) = Q2(X) +R2(X)/G(X) (4.11)but the order of M2(X) is less than the order of G(X) so:Q2(X) = 0,(4.12)R2(X) = II/12(X).Combining equations 4.8 and 4.10 and simplifying gives:M(X)/G(X) = XN[Q1(X) +R1(X)/G(X)] +M2(X)/G(X)(4.13)xNQ1(x) + [XNR1( ) + (X)]/G(X).Therefore, the introduction of N bit M2(X) to M1, changes remainder R1 (X)/G(X) to theremainder of XNR1(X)IG(X)+M2(X)/G(X). The procedure can be generalized as:R1(X) = Rem [X’R(x)/a(x) +(4.14)= Rem [XNRn(X)/G(X)] + M+1(x).The next successive CRC remainder, R÷1(X) is the remainder of the previous remainder,R, shifted over by N bits, divided by G(X) and finally, added to the new N bit messagesequence, M+1(X). Successive N message bits are computed in this way to find the CRCremainder of the combined message.A parallel CRC circuit performing the function of Equation 4.14 must compute theremainder of X”R(X)IG(X). Intuitively, this computation means that any of the first N highorder bits of R(X) equal to 1 must be subtracted out with multiples of G(X). In effect, multiple4. Proposed Receiver Architecture 66subtractions of G(X) will perform the required division and leave the remainder polynomial.Since subtraction and addition are the same in modulo 2 arithmetic, the remainder circuit isimplemented with an array of XOR gates.For the proposed ATM design, CRC computations are done 16 bits at a time. The XORarray required to compute successive results is determined using a table similar to Figure4.5.22 [Gla93]. Each shifted G(X) row in Figure 4.5.22 is used to subtract any is in the first16 bits of Rn(X). The row equations shown at the right side of the table indicate which rowsare used to cancel a given combination of the 16 high order bits R31 to R16. For example , ifonly R31 is 1, then rows 1, 7, 10, 11, and 13 are needed to give the resulting row sum all Osin the first 16 bits. Row 1 cancels out R31 but introduces a 1 in R25,21 positions. Other rowsare used to cancel these is out and any new is. Now the XOR function required for eachbit position Cn of the next CRC remainder, is given by summing all the row equationswith is in the same column and the variable at the top of the row. For example, C31 is givenby sum of R15 and row equations for rows 1, 7, 8, and 11. The resulting simplified equationsfor each Rn+i bit positions, C31 are summarized in equations 4.5.23.R (X) M(X) Row Equations31 R23 R15 M7 ow1 rR3,2 R30100OOOi1O0110OOiO010O011j1O110110i11100000i00OOOOO0 3 1R29 I10OOOO10O110OO001O00111O1101I01110000i00OOOOO0 4 1R285 R2ArrayofShiftectG(X) 100I00010011i00000100iO1110110I11O11100i00000000: 6 iR26lOOOOlOO14OOOOOlOiOOlllOlIOllOlllOiOOOOOOOO:8 R243010OOOO10i011OOOO0i1O00111Oi11O11O1J10000O00 9 II 1OOOOO1I0O11OO00IO100011J011011OJ1100OO00: 10 R223iI 100000i10011000I00100011j1O11011011100000: 11 IR R R R12I 1OOOIO01O011000OO100111011O1I1O111000 13 IR R R R R I1O0O001O010OO001O0111011011O111O0 14 IR e24ReRR301OiO0O01001i100O00100111011i011O1110 15 ieRR$2O2 16 22228I I I031 23 15 7R1(X)Figure 4.5.22 Parallel CRC XOR Array ComputationsN(0.9E2U)U)CD-1=LI)U)NN—NCO(0NCDCQCCeeU)U)coor,-U)EU)N_00)0NCDU)NU)00))0)U)N6E_c’1‘-___—(“Ir.—.,ccccccccccccccccccccccccccccccccccoLI)U)C’J‘00)U)CDU)U)NU)C\10)U)OC)0)-ooU)esicEc’Jcjcie.j———.—c.je.jc’jc.jc’.ic’Jocicccccccccccco1rcccccccccccccccccccccccccrccci)0—00)U)NCDU)NNCOLf•)0)C’J.-0U)U)C’J0100)—0—01—01U)U)CD,cj01———.--——-—cjc1cjc’jc’jojc’.jc—ejc’.j01oje.ioi0101—cca:cccccccccccccca:cccca:cca:cccccccca:a:cccccca:a:cca:a:a:ccoU)C\I.—00)0)U)C’ILt),-OU)U)C%JCLflLt).-L0U)U)CJLf)L()01c’J(‘.1010101——010101010101010101010101010101(‘1C’]0101Cs]Cs]01Cs]Cs]Dcccccccccca:a:a:cccccccccccccca:cccca:cca:a:cca:a:a:a:cca:a:cra:ciW0)0101Cs]Cs]Cs]Cs]Cs]Cs]C’]01Cs]Cs]Cs]Cs]01Cs]Cs]Cs]C’]C5J(‘4(‘1Cs]Cs]Cs]Cs]C’]Cs]C’]Cs]Cs]C’]C)cca:crcccrcccrcccca:crcccrcca:cccrcccccca:crcccrcccrcca:a:crcccrNU)LOU)CsiNU)0)U)U)LDN00)U)U)O0)0)00)U)NU)N(DU)0O)U)CDLfCs]Cs]Cs]Cs]C’]C’]C’.]Cs]C’]Cs]Cs]C’]C’]U)Cs]Cs](54U)C’]Cs]U)Cs]Cs]01C’]C’]Cs]C’.]U)Cs]C”]Cs]Cl)cca:cccrcca:a:crcccccccrcccccccrcccccrcccrcccccccccrcca:a:cccra:—00)U)NU)—0—00)U)——00)——0-—00)U)—00)——O0)U)C)U)U)C’JCs]Cs]Cs]U)COU)U)C’]C’1U)U)U)Cs]U)U)U)U)U)U)Cs]Cs]U)U)C’]U)U)U)csJCs]cccccccra:crcrcccrcrcrcccrcrcrcccrcrcccrcccrcrcrcccrcrcra:crcra:Dc,.iiii00)U)NCDLDU)Cs100)U)N(DLf)01—0C)U)U)C’]Cs]Cs]C’]C’.]Cs]C’]C’.]C’]Cs]r--————-———0)CDNCDLI)U)C’.]—000000000000000000000000000000000(I)(_)ci)-cI4. Proposed Receiver Architecture 68to output. This may well be the rate determining parameter because frame transfer rate isdirectly proportional to how fast the CRC can be computed. 10 gate delays between interwordarrive time of 25 ns results in gate delays of 2.5 ns to maintain network transfer rates.5. Simulation Results 695 Simulation Results5.1 Simulation MethodologyThe proposed receiver architecture is simulated at the hardware level and at the systemslevel. Two different simulators, each specialized at the different levels, are used to model thedesign. A VHDL simulator is used to perform hardware level simulations and the discreteevent simulator, SIMSCRIPTII.5 is used to perform system level simulations. The differentinformation available from each simulation level is used to optimize the architecture. Thestructure of the simulation methodology used is illustrated in Figure 5.1.1.Two different simulators are required to gather the data because of the different characteristics of the data. VHDL is well suited for hardware modelling because it handles designverification checks like setup/hold times and cycle times. The VHDL model is essentially afunctional verification model which will give a good indication of the complexity and feasibilityof the receiver design.SIMSCRIPTII.5, on the other hand, is well suited for systems level simulations. It is asimulator that provides provisions to measure system throughput, resource utilizations, andother relevant statistics for a multiprocessing environment.Both models serve as powerful tools used to optimize the receiver hardware design.Data from the timing data is extracted from the VHDL model and is used as input forthe SIMSCRIPTII.5 model. SIMSCRIPT models the receiver with different size frames anddifferent amount of connections. Results from SIMSCRIPT are used to determine how thereceiver structure can be optimized. The VHDL model is changed to reflect the optimizationsand the whole process is repeated. Through the combined results from models, potentialbottlenecks and resource sizes can be determined.5. Simulation Results 705.1.1 The Hardware ModelConnectionThe complete receiver architecture is modelled with VHDL to provide details of systemtiming and to facilitate an ASIC design path. The hardware level models the interconnectionbetween all internal processing elements, host system bus, SONET framer, memory bufferdevices, arbiters, and internal queues. All the required interface handshaking is modelled inorder to obtain accurate system timing parameters. The VHDL model includes design verification checks for setup/hold time violations, cycle time violations, and device handshakingerrors.VHDL models are written for each external device. The external devices are modelledwith available consumer devices parameters. Devices modelled include the SONET Framerusing PMC’s PM5345 [PMC92]; cell buffer RAM and state RAM using NEC uPD43101615 us 64Kx16 static RAM [Necc93]; and the frame buffer using NEC uPD4882234 256Kx8video RAM [Nec93]. Since the PM5345 operates at the lower 155 Mbps speed, only itshandshaking interface and data format are modelled. Its speed parameters are increased toemulate a 622 Mbps device. Device parameters and handshaking information are obtainedfor the NEC RAM devices to create realistic VHDL models. All RAM devices have cycleHardwareStructure ATM TrafficFigure 5.1.1 Structure of Simulation Methodology5. Simulation Results 71times less than the 622 Mbps interword arrival time of 25 ns. The interface specifications forFIFO queues are all modelled after the TI SN74ACT7802 1024x18 FIFO [Tex92]. The FIFOdevices parameters are too slow with a worst case 25 ns access time and therefore a cycletime well above the required 25 ns interword arrival time. The modelled FIFO devices areassumed to have a cycle time of 25 ns. This is reasonable because the FIFO is internal andis expected to operate faster than discrete external FIFOs.Si .2 The Systems ModelThe system model uses timing details obtained from the VHDL model. SIMSCRIPTII.5has language construct, process, for modelling a thread of control, and construct, resource,for modelling shared devices. All the processing elements are modelled with processesand all the external memory devices are modelled as resource. Internal FIFO buffers aremodelled with SIMSCRIPTII.5 FIFO set entities.The ATM receiver system model is much simpler than the VHDL hardware model becauseall handshaking interactions between devices are not modelled. All the cumulative effects ofinterface timing were lumped together to form access delays.The system level model can be used to simulate receiver architecture performance fora wide array of network traffic characteristics and host system characteristics. The systemmodel allows receiver designs to be tuned for special purpose applications. It provides away to target the receiver to specific host platforms. The system model also provides anindication of how resource limits affect the overall performance of the receiver architecture.The system level model is needed to determine if the global shared bus is a systembottleneck. The model is also required to determine SLQ and HEADQ sizes needed forimplementations. Maximum throughput and host bus utilization are found with the model.5.2 Performance ResultsThe architecture’s performance is characterized with the system model. Various sets oftraffic conditions are processed by the receiver to find what network rates can be sustained5. Simulation Results 72without loss of data. The data are all continuous bit rate data with different sized frames anddifferent numbers of virtual circuits opened. Maximum throughput is found for 1, 8, 48,128,and 256 connections with frame sizes of 1, 6, 11, 22, 171, 683, and 1366 cells. The framesizes used contain frames of 1, 256, 512, 1 K, 8K, 32K, and 64K bytes respectively.The maximum throughput at steady state without loss of data is obtained for each trafficcharacteristic. Maximum throughput is determined by monitoring the growth of the internal cellbuffer over the simulation time. When the standard deviation of the cell buffer depth beginsto decrease while the mean depth remains constant, a steady state condition is reached. Atthe maximum throughput, several performance parameters are recorded. The parametersare: maximum SLQ size, maximum header queue, host system bus utilization, HPU sharedbus utilization, FCU shared bus utilization, and an indication of either the cell buffer resourceor FIFO queue resource which limited further increase in throughput. To ensure that thesteady state condition is always reached, all these results are obtained assuming that framebuffers never run out.The cell buffer size is fixed at 64K words and the internal circular queues are fixed at 64words for all simulations due to the fixed static RAM in the system. Throughput is reducedonly if one of these two resources runs out because cells are lost. The resource whichcauses cell loss is also recorded at maximum throughput.Internal frame transfer initiations from HPU to ECU are set to occur after 10 cells havebeen buffered in a circular queue. This setpoint has a large effect on the dynamics of receiverperformance. If the setpoint is too low, initialization overheads for internal frame transfer arenot offset by the length of the frame transfer. The negative feedback mechanism reacts toincrease the effective transfer length but the receiver will take a longer time to reach steadystate. If the setpoint is too high, the average buffer occupancy will increase and more bufferspace is required.The receiver architecture is simulated at a speed of 83.3 MHz or 12 ns/cycle. This speedwas chosen to allow two clock cycles per interword arrival time of 25 ns. Two cycles per5. Simulation Results 73interword arrival time provides the synchronous timing signals required to handle externalstatic RAM accesses. The results of the simulations are summarized below in Table 5.2.2.The important system operational parameters of throughput, and bus utilizations are graphedin Figures 5.2.3 to 5.2.6. These graphs will identify trends in achievable throughput andindicate potential areas for receiver bofflenecks. The maximum SEQ size and maximumHEADQ size values are tabulated to determine the overall maximum resource size forimplementing the receiver.dm,:ktirin Q,:It 7z1Frame TP Max Max ECU Bus HPU Bus System Res.(#cells) (MBps) SEQ headQ Util.(%) UtiI.(%) Bus Lim.Size Size UtiI.(%) (BIQ)1 264.46 3 1 28.58 19.42 17.59 Q6 542.37 3 3 12.23 33.44 5.99 Q11 592.59 3 3 9.14 35.80 3.58 Q22 627.45 3 3 6.81 37.54 2.80 Q171 640.00 3 3 5.50 38.26 2.32 Q683 640.00 3 3 5.29 38.34 2.16 Q1366 640.00 3 3 5.26 38.34 2.19 QOpen Connections = 1Frame TP Max SEQ Max FCU Bus HPU Bus System Res.(#cells) (MBps) Size headQ UtiI.(%) UtiI.(%) Bus Lim.Size UtiI.(%) (B/Q)1 264.46 3 1 28.58 19.42 17.59 Q6 547.01 84 3 12.32 33.68 6.03 Q11 598.13 93 3 9.20 35.87 3.63 Q22 627.45 45 4 7.01 37.12 3.07 Q171 627.45 45 4 6.97 37.31 3.87 Q683 627.45 45 3 6.68 37.80 3.61 Q1366 627.45 45 4 6.65 37.81 3.59 QOpen Connections = 8Erame TP Max SEQ Max ECU Bus HPU Bus System Res.(#cells) (MBps) Size headQ UtiI.(%) Util.(%) Bus Lim.Size UtiI.(%) (B/Q)1 264.46 3 1 28.64 19.46 17.63 Q6 547.01 162 3 12.17 33.28 5.96 Q11 592.59 270 3 9.26 35.08 3.84 Q22 627.45 279 4 6.90 37.57 2.89 Q171 621.36 279 4 7.05 37.31 3.95 Q683 621.36 279 4 7.09 37.38 4.16 Q1366 621.36 276 4 7.04 37.47 4.13 QOpen Connections = 48Figure 5.2.2 Receiver Architecture Simulations Results (Continued) . .5. Simulation Results 75Frame TP Max SFQ Max FCU Bus HPU Bus System Res.(#cells) (MBps) Size headQ Util.(%) Util.(%) Bus Lim.Size UtiI.(%) (B/Q)1 264.46 3 1 28.65 19.46 17.63 Q6 547.01 363 3 12.21 33.40 5.98 B11 587.16 726 3 9.17 35.28 3.71 B22 581.82 726 3 9.13 35.14 5.53 B171 592.59 726 4 8.43 35.91 5.45 B683 598.13 732 4 8.19 36.23 5.27 B1366 598.13 735 4 8.30 36.29 5.37 BOpen Connections = 128Frame TP Max SFQ Max FCU Bus HPiJ Bus System Res.(#cells) (MBps) Size headQ Util.(%) UtiI.(%) Bus Lim.Size Util.(%) (BIQ)1 264.46 3 1 28.65 19.46 17.63 Q6 542.37 639 3 12.22 33.41 5.98 B11 524.59 1377 3 11.92 30.61 7.45 B22 556.52 1308 3 10.42 32.80 7.19 B171 571.43 1377 3 9.26 34.75 6.64 B683 576.58 1227 3 9.17 35.16 6.55 B1366 576.58 996 3 9.14 35.20 6.53 BOpen Connections = 256Figure 5.2.2 Receiver Architecture Simulations ResultsThe table contains columns for frame size, throughput, maximum SFQ size, maximumheader queue size, FCU bus utilization, HPU bus utilization, host system bus utilization,and resource limit. The resource limit indicates whether the cell buffer(B) or the circularqueue(Q) caused the throughput to reduced.5. Simulation Results 76650.0 Ip600.0a0)D2.Intions0 °c1D c=8550.0o °c=48.c=128c=256500.0 I0.0 500.0 1000.0 1500.0Frame Size (# cells)Figure 5.2.3 ATM Throughput Vs Frame Size5. Simulation Results 7730.0# Connections0 0c=1DC80 °c=48tc=128c 1c=256o0C0ND0ci)c’10.0_____0.00.0 500.0 1000.0 1500.0Frame Size (# cells)Figure 5.2.4 ATM FCU Shared Bus Utilization Vs Frame Size5. Simulation Results 7840.035.0-30.00CuN.1-’Du 25.0Dci)CuC/)# Connections20.0____° °c=1O oc=8o °c=48c=12815.0 c=25610.00.0 500.0 1000.0 1500.0Frame Size (# cells)Figure 5.2.5 ATM HPU Shared Bus Utilization Vs Frame Size5. Simulation Results 7920.0 I# Connectionso °c=1O Oc8o oc=48c=12815.0 i c=256C0ND0 10.0 VEci)4-aU)>U)0___________________________________________o - -I0.00.0 500.0 1000.0 1500.0Frame Size (# cells)Figure 5.2.6 Host System Bus Utilization Vs Frame SizeThe above graphs characterize different aspects of the receiver architecture. Thecharacteristics of each graph are summarized below.5. Simulation Results 80ATM Throughput Figure 5.2.3 shows the ATM throughput for different frame sizes anddifferent number of open connections. The low 264.46 Mbps throughput yielded by the 1celled frame is omitted from the graph. The low data point is omitted so that more detailfrom the remaining data points is illustrated in the graph. The throughput ranges from a lowof 264.46 Mbps for all single celled frames to a high of 640 Mbps for large framed singleconnections.The low 264.46 Mbps case is due to the inability of the ECU to adaptively recover fromhigh overhead short frame transfers. All received cells are EOF types so that each cellcauses a transfer request to the ECU. The high 640 Mbps for large frame single connectionsis due to the receiver architecture’s ability to exploit low overhead long frame transfers forlarge framed cells.It is important to note that the simulated results are for steady state conditions. Thereceiver can sustain data processing at the graphed rate indefinitely, as long as the hostrecycles frame buffers quickly enough. The receiver is able to handle adaptation layerprocessing in all test cases at the full 622 Mbps for short bursts.The maximum throughput achievable for a given frame size is limited by two differentresources limits. With a low number of connections, maximum throughput is determined bythe fixed sized circular queue. With a high number of connections, throughput is determinedby the fixed sized shared cell buffer.The fixed sized circular queues determine the maximum steady state throughput for thelow number of connections cases. When there is a low number of connections, the sharedcell buffer sufficiently buffers incoming cells without risk of overflow. Throughput is sustainedas long as the circular queues do not overflow.The fixed sized shared cell buffer determines the maximum steady state throughput forthe larger number of connections cases. Throughput decreases for increasing amount ofconnections because the larger amount of connections place larger demands on the sharedcell buffer memory. The maximum sustained throughput must decrease to offset the larger5. Simulation Results 81demand on the shared resource.Throughput between successive frame sizes vary considerably for the smaller frames.Larger frames exhibit nearly constant throughput between successive sizes. This effect isdue to the fact that the smaller frames do not give the receiver enough time to a optimizecell transfers to the frame memory before a TX_EOF forces a frame transfer. Larger framesallow larger frame segment transfers and therefore allow the receiver to transfer optimalframe segment lengths.Internal Shared Bus Utilization The internal shared bus for global state RAM betweenthe ECU, HPU, and host system processor is a possible resource bottleneck. Figures 5.2.4and 5.2.5 show the shared bus utilization for all test cases. The combined bus utilizationis no greater than 50% for all cases. This leaves 50% of the shared bus available for thehost system processor. The host system processor does not access the shared bus exceptwhen opening and closing a connection. Changing connection states occur very infrequentlycompared to the length of time connections are generally opened. It can be concluded thatthe shared global RAM bus is not expected to be a bottleneck under these conditions.Host System Bus Utilization The host system bus utilization does not show utilizationfor a particular bus from a target system. The utilization indicates the fraction of time thata host bus is accessed during the time to transfer complete frames into the frame buffer.Figure 5.2.6 shows the host system bus utilization for all test cases. The maximum utilizationof 17.63% is for the pathological case of 1 cell sized frames. All other cases have less than10% host system bus utilization. These are very good results because the host system isvirtually unaffected by high throughput network data.Other Results The tabulated results also show the maximum resource size that must beimplemented in the receiver to maintain throughput. The maximum header queue size isfound to be 4. This indicates that the HPU is able to keep up with network data because 3HPU messages are generally produce for each arriving network cell. For practical purposes,5. Simulation Results 82the maximum queues size that needs to be implemented would be 3 because incoming cellsare already elastically buffered in the SONET device.The maximum SEQ EIFO size is found for the 256 open connections case with a 11 cellframe size. The tabulated result under these conditions is 1377 elements or 459 outstandingECU commands. This works out to just under 2 commands per connection. This is expectedbecause the small frame size with internal transfer requests after 10 cells causes a TX_SEGfollowed by a TX_EOF command at worst case. Two outstanding commands for each openedconnection give a theoretical upper bound of 512 outstanding commands.6. Conclusions and Future Work 836 Conclusions and Future Work6.1 ConclusionsIn this thesis, ATM receiver architectures were reviewed. Two current receiver architectures were examined in detail to determine potential bottlenecks. A discussion of generalparallel architectures and their effectiveness as ATM adaptation layer receivers was provided.The fine grained parallel approach was applied to a proposed receiver architecture.The performance of the proposed receiver was simulated using a low level VHDL modeland a high level SIMSCRIPTII.5 model. The design was shown to perform effectively overa wide variety of network traffic conditions. It achieved sustained network speeds rangingfrom 264.46 Mbps to well over 622 Mbps. The target 622 Mbps could be achieved for alltraffic conditions for short burst lengths. The major features of the proposed parallel receiverarchitecture as well as contributions from this thesis are summarized below.• Connection open/close and other non-time-critical functions done by the host system tosimplify receiver state machine design.• Hardware hash method for mapping VCIs used instead of slow CAMs and sparsely filledstatic RAMs.• Hash Function Shuffling Algorithm proposed to reduce connection failures.• Fixed sized, low maintenance, circular queues to handle cell pointer manipulations.• Cell buffer memory with cycle times comparable to network arrival rates achieved throughan interleaved static RAM design.• Overhead from frame transfer initialization reduced through longer transfer segments.• Adaptive negative feedback utilized to reduce effects of resource contention in a widearray of traffic conditions.• Internal queue used as intermediate state buffer to handle consecutive frame arrivals.6. Conclusions and Future Work 84• Shared global state RAM bus utilization of 50% is not expected to be a bottleneck.• A receiver which requires less than 10% of the host system bus for a wide range offrame sizes.• A 16 bit parallel CRC-32 design provided with a new analytical description of theimplementation.Based on the above results and simulations, this thesis shows that much faster ATM receiverdesigns can be realized with a fine grained approach.6.2 Future WorkThe work presented in this thesis provides details to implement an ATM AAL5 receiver.More work can still be done to better characterize the design.Although two levels of the ATM receiver are modelled to give a better understanding ofthe receiver design, more complete results can be obtained if the VHDL model is synthesizedinto a silicon design. In this way, better timing values can be obtained with VLSI extractiontools. These tools extract synthesized chip timing based on the fabrication technology. Thetiming values can be used to optimize the design. Synthesis will indicate how fast the CRCEngine performs.Synthesis will also give a good indication of chip area of the proposed receiver design.Chip area have direct consequences to expected yield and cost for the receiver. Other VLSIconsiderations such as power requirements can be obtained through synthesis. Unfortunately, the synthesis tools are not currently available at UBC.The receiver design is characterized at steady state conditions. It will be very informativeto run simulations in transient conditions for all throughputs that are below the target 622 Mbpsto determine what kind of 622 Mbps data burst lengths can be effectively handled.Steady state sustained traffic gives a good indication of how the receiver performs underheavy loads. Performance results for observed network traffic characteristics will be useful.This will require obtaining network traces or using statistical network data generators.6. Conclusions and Future Work 85It was noted that the internal transfer length setpoint affects how the system performs.Simulations to determine optimal values for the setpoint for different traffic characteristic canbe performed. This information can be utilized to tune the receivers performance for a targettraffic characteristic.References. 86References[Abu89] H. Abu-Amara, T.Barzilai and Y. Yeminia, “PSi: A Silicon Compiler for Very FastProtocol Processing,” in Protocols for High-Speed Networks (H. Rudin and R.Williamson, eds.), Elsevier Science Publishers B.V., 1989.[Ada92] Adaptive Corporation, Redwood City, California, FRED Chipset Technical Manual,Version 2.0, October 1992.[Amd9l} Advanced Micro Devices: CMOS Memoly Products Data Book, ch. Am99C1 OA:256x48 CAM, pp. 5.3—5.28. AMD Inc., 1991.[ANS87] “FDDI,” in ANSI X3.139-1987, 1987.[Bar93] C. Barnaby and N. Richards, Networks, Routers and Transputers: Function, Performance, Applications, ch. 10. lOS Press, 2nd ed., 1993.[Bat93] M. Batubara and A. J. McGregor, “An Introduction to B-ISDN and ATM,” Tech. Rep.93/14, Monash University, September 1993.[CC193] “CCITT Draft Recommendations: BISDN Protocol Reference Model and its Applications,” in Draft Recommendation 1.321, 1993.[Che89] G. Chesson, “XTP/PE Design Considerations,” in Protocols for High-Speed Networks (H. Rudin and R. Williamson, eds.), Elsevier Science Publishers B.V., 1989.[C1a90] D. D. Clark and D. L. Tennenhouse, “Architectural Considerations for a New Generation of Protocols,” in Proceedings, SIGCOMM 1990, (Philadelphia, PA), pp. 200-208,September 1990.[Coo9l] E. Cooper, 0. Menzilcioglu, R. Sansom, and F. Bitz, “Host Interface Design forATM LANS,” in Proceedings, 16th Conference on Local Computer Networks, pp.247-258, October 1991.[Dav9l] B. S. Davies, “A Host-Network Interface Architecture for ATM,” in Proceedings,SIGCOMM 1991, (Zurich, Switzerland), pp. 307-315, September 1991.References. 87[Gla931 R. J. Glaise and X. Jacquart, “Fast CRC Calculation,” in Proceedings, IEEE International Conference on Computer design, (Cambridge, Massachusetts), pp. 602-605,October 3-6, 1993.[Hi192J M. Hill, A. Cantoni, and T. Moors, “Design and Implementation of an ATM Receiver,”in Proceedings, 3rd International IFIP W06. 1/6.4 Workshop on Protocols for High-Speed Networks, (Stockholm), pp. 177-192, May 13-15, 1992.[Jai9O] N. Jam, M. Schwartz, and T. R. Bashkow, “Transport Protocol Processing at GbpsRates,” in Proceedings, SIGCOMM 1990, (Philadelphia, PA), pp. 188-198, September 1990.[Kan88] H. Kanakia and D. Cheriton, “The VMP Network Adapter Board(NAB): High-Performance Network Communication for Multiprocessors,” in Proceedings, SIC-COMM 88, (Stanford, CA), pp. 175-187, August 1988.[Knu73] D. Knuth, The Art of Computer Programming, §6.4. Addison-Wesley, vol. 3, 1973.[Lin83] S. Lin and D. J. Costello, Jr., Error Control Coding: Fundamentals and Applications,ch. 2, 4. Prentice-Hall, 2nd ed., 1983.[Mcc92] M. McCutcheon, M. Ito, and G. Neufeld, “Interlacing a Multiprocessor ProtocolEngine to an ATM Network,” in Proceedings, 3rd International IFIP WG6. 1/6.4Workshop on Protocols for High-Speed Networks (B. Pehrson, P. Gunningberg,and S. Pink, eds.), (Stockholm), pp. 151-166, May 13—15, 1992.[MeI921 H. E. Meleis and D. N. Serpanos, “Designing Communications Subsystems for High-Speed Networks,” IEEE Network, pp. 40-46, July 1992.[Nec93] NEC Memoij Products Data Book, vol. 1, ch. uPD482234 256Kx8 VRAM, pp.12G:1-52, NEC Electronics Inc., 1993.[Necc93J NEC Memoiy Products Data Book, vol. 2, ch. uPD431016 64Kx16 SRAM, pp.21e:1-9, NEC Electronics Inc., 1993.[Nic88] J.-D. Nicoud, “Video RAMs: Structure and Applications,” Master’s thesis, CarltonUniversity, November 1992.[0mu92] D. S. Omundsen, “A Pipelined, Multi-Processor Architecture for a ConnectionlessServer for Broadband ISDN,” IEEE J. Solid-State Circuits, vol. SC-22, pp. 575-581, August 1987.References. 88[Pei92] T.-B. Pei “Compilation of a Flexible IC for Telecommunications Network Interfacing,”PhD thesis, Columbia University, 1992.[Per83] A. Perez, “Byte-wise CRC Calculations,” IEEE Micro, pp. 40-46, June 1983.[PMC92] PMC-Sierra, Inc., Bumaby, BC, PM 5345 Saturn User Network Interface: Preliminary Information,1 993.[Pry9l] M. de Prycker, Asynchronous Transfer Mode: Solutions for Broadband ISDN, EllisHorwood Limited, 1991.[Sar88J D. V. Sarwate, “Computation of Cyclic Redundancy Checks via Table Look-Up,”Communications of the ACM, vol. 31, no. 8, pp. 1008-1013, 1988.[Sch89] W. D. Schwaderer, “Cyclic Redundancy Checking,” Journal of Data and ComputerCommunications, vol. 1, no. 4, pp. 4-25, 1989.[Son88] C. Song and L. H. Landweber, “Optimizing Bulk Data Transfer Performance: APacket Train Approach,” Proceedings, SIGCOMM 88, (Stanford, CA), pp. 134-145,August 1988.[T1S1] “AAL 5 - A New High Speed Data Transfer ML,” in ANSI T1S1.5/91—449, ANSI,November 1991.[Tan88] A. S. Tanenbaum, Computer Networks, Prentice HaIl, 1988.[Tak9lJ L. Y. Takeuchi, “Study of OSI Protocol Processing Engines,” Master’s thesis, University of British Columbia, July 1991.[Tex92] High Performance FIFO Memories, pp. 3.3-3.13. Texas Instruments, 1992.[Tra9l] C. B. S. Traw and J. M. Smith, “A High-Performance Host Interface for ATM Networks,” (Zurich, Switzerland), pp. 317-325, September 1991.[Tra92] C. B. S. Traw, “A Host Interface Architecture and Implementation for ATM Networks,”Master’s thesis, University of Pennsylvania, May 1992.[Zit9l] M. Zitterbart, “High-Speed Transport Components,” IEEE Network, pp. 54-63, January 1991.Appendix A. List of Acronyms 89Appendix A. List of AcronymsML ATM Adaptation LayerASIC Applications Specific Integrated CircuitATD Asynchronous Time DivisionATM Asynchronous Transfer ModeBISDN Broadband Integrated Services Digital NetworkCAM Content Addressable MemoryCCITT International Consultative Committee for Telegraphy and TelephonyCLP Cell Loss PriorityCRC Cyclic Redundancy CheckCS Convergence SublayerDMA Direct Memory AccessEOF End of FrameFCU FIFO Control UnitFFQ Fast FIFO QueueFFQSM Fast FIFO Queue State MachineFIFO First In First OutFRED Fragmentation and Reassembly DeviceGbps Giga bits per secondGFC Generic Flow ControlHDTV High Definition TelevisionHEADQ Header FIFO QueueHEC Header Error CheckHOSTQ Host FIFO QueueHPU Header Processing UnitICP Incoming Cell Processor89Appendix A. List of Acronyms 90I/O Input/OutputI PC Inter Processor CommunicationsLFSR Linear Feedback Shift RegisterMbps Mega bits per secondMSGQ Message FIFO QueueMTC Memory Transfer ControllerNAB Network Adapter BoardN ICE Network Interfacing Compiler ExperimentOSI Open Systems InterconnectPE Processing ElementPT Payload TypeRAM Random Access MemoryRISC Reduced Instruction Set ComputerSAM Serial Access MemorySAR Segmentation and ReassemblySFQ Slow FIFO QueueSFQSM Slow FIFO Queue State MachineSONET Synchronous Optical NetworkVHDL VHSIC Hardware Description LanguageVC Virtual CircuitVCI Virtual Circuit IdentifierVPI Virtual Path IdentifierVHSIC Very High Speed Integrated CircuitVLSI Very Large Scale IntegrationXTP Xpress Transfer Protocol90

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0087374/manifest

Comment

Related Items