Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

On the parallel implementation of OSI protocol processing systems Lam, Robert 1993

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1993_fall_lam_robert.pdf [ 4.42MB ]
JSON: 831-1.0064905.json
JSON-LD: 831-1.0064905-ld.json
RDF/XML (Pretty): 831-1.0064905-rdf.xml
RDF/JSON: 831-1.0064905-rdf.json
Turtle: 831-1.0064905-turtle.txt
N-Triples: 831-1.0064905-rdf-ntriples.txt
Original Record: 831-1.0064905-source.json
Full Text

Full Text

ON THE PARALLEL IMPLEMENTATION OF OSI PROTOCOLPROCESSING SYSTEMSByRobert LamB. A. Sc., The University of British Columbia, 1989A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinTHE FACULTY OF GRADUATE STUDIESDEPARTMENT OF ELECTRICAL ENGINEERINGWe accept this thesis as conformingto the required standardTHE UNIVERSITY OF BRITISH COLUMBIASeptember 1993© Robert Lam, 1993In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)Department of Elpei4e 41^/11,40,1 The University of British ColumbiaVancouver, CanadaDate 4,44;J, /fr.DE-6 (2/88)AbstractIn a heterogeneous computing environment, computers have to use a suitable transfer syntaxto communicate with each other because of the differences in data representations in eachcomputer. The need to translate the data from one format to another, which can be under-stood by all computers, takes over 90% of the processing power used for protocol processings.Application specific architectures in a heterogeneous system may not be efficient in perform-ing protocol processing functions and there is a need for a stand alone protocol processingsystem. In a gigabit networking environment, the processing power required for protocol pro-cessing is beyond that of a single processor and a multiprocessor approach is needed. Threedifferent multiprocessor architectures with increasing complexities for protocol processing aredescribed in this thesis. These designs make use of the properties of locality of processingand prefetching techniques. Both simulations and analytical methods are used to evaluatethe performance. The results indicated that by putting instructions and packet data close tothe processors, a significant increase in processing throughput can be obtained as comparedto previously published designs. In our best scenario, we obtained a throughput of up to560MBits/s. With further optimizations, it is expected that the throughput can be extendedto a gigabit rate. The study also revealed that a bus based architecture is sufficient in han-dling the OSI protocol processing requirement for a network rate of over 560MBits/s. It wasobserved that the use of fast processors alone without matching fast memory is inadequate inproviding a feasible solution. The sequential nature of protocol definition could be changedto include more parallelism and to further improve the scalability of the protocol processingsystem.11Table of ContentsAbstract^ iiTable of Contents^ iiiList of Tables^ viList of Figures^ viiiAcknowledgements^ x1 Introduction 11.1 Guide to the rest of the thesis ^ 21.2 Background on protocol processing 21.3 Motivation for the research into a multiprocessor design ^ 61.4 Background on parallel processing ^ 71.5 Related Research Activities ^ 71.6 Goals and approach of study 92 Design issues 112.1 System requirement ^ 112.2 Method of parallelization 112.3 Single queue versus multiple queues ^ 132.4 Slow versus fast processors ^ 152.5 Abstract Model of the system 192.5.1^Interconnection requirement ^ 212.5.2^Number and kind of buses needed ^ 211112.5.3^Locking mechanism^.^.^.^•^.^........^.^•^•^. ^ 232.5.4^Analysis ^ 242.5.5^Other parameters for measuring the performance ^ 302.5.6^Other major components ^ 312.6 The simple shared memory design 343 The local instruction memory design 413.1 Background on locality ^ 413.2 Effects of locality 423.3 Performance analysis and Simulations ^ 433.3.1^Merits of the LIM design 493.3.2^Weakness of the LIM design ^ 504 The packet relocation design 514.1 Usefulness of data cache ^ 514.2 Spatial locality achieved through copying ^ 524.3 Performance analysis and Simulation 554.3.1^Merits of the SPR design ^ 604.3.2^Weakness of the SPR design 604.4 Overcoming the weakness ^ 644.4.1^Reduce the number of packet relocation before processing starts 644.4.2^Reduce the number of packet relocation during processing ^ 654.4.3^Discussions^......^.^.^.^.^.^.^...........^. 735 Results and Discussions 765.1 Cost ^ 765.2 Memory speed ^ 775.3 Bus speed 775.4 Multiple connections ^ 78iv5.5^Packet lengths^...... . . . . . . ................ . ......^785.6^Other presentation layer operations . . . . . . . . . . . . ....... . . . . .^805.7^Future Work^. ..... . . . . . .^. ...... . ...... . ........^806 Conclusions 81Bibliography^ 84A Simulation Details^ 87A.1 Simulation model .....^. ....................^87A.2 Protocol processing data  ^89A.3 Simulation parameters ^  91A.3.1 Obtaining Peak Performance ^  91A.3.2 Selecting the content of the workload ^91A.4 Analysis of accuracy  ^93A.4.1 Random numbers. .^. ....... .  ^93A.4.2 Sensitivity to random numbers ^  93A.4.3 Picking the simulated time  ^95B queueing model^ 97B.1 Multiple queue multiple server  ^97B.2 Single queue multiple server  ^97B.3 Comparing single queue and multiple queue systems^.  ^98C Simulation Results^ 101D Glossary of terms^ 106VList of Tables1.1^OSI 7 layers of protocols . . . . . . . . . . . . ..................^31.2 Processing power required for protocol processing in a 1000 megabit network^62.1 Memory access type  ^262.2 Summary of memory access based on static code count ^ 282.3 Hit ratios for simulating the SSM design  ^362.4 Parameters used for simulating the SSM design ^  363.1 Hit ratios for simulating the LIM design  ^443.2 Variation of maximum throughput with processor speed ^ 494.1 Hit ratios for simulating the SPR design  ^554.2 Variations of maximum throughput when the burst access speed is changed^614.3 Average processing time using 10mips processors in a 0.5s interval  ^624.4 Average processing time using 20mips processors in a 0.5s interval  ^634.5 Average processing time using 40mips processors in a 0.5s interval  ^634.6 Average processing time for the IPR design in a 0.5s interval  ^694.7 Parameters used for finding the max throughput  ^704.8 Hit ratios for finding the max throughput  ^704.9 System performance when the input rate is higher than the processing throughput 735.1 Summary of system characteristics  ^76A.1 Number of access to PDUs for the different layers  ^89A.2 Number of state access for the different layers  ^89viA.3 Number of instruction access for the different layers  ^90A.4 Effects on Throughput with different workload  ^92A.5 Standard Deviation of Peak Throughput(MBytes/s)  ^94A.6 Standard Deviation of Random Access Bus Usage^  94A.7 Standard Deviation of Burst Access Bus Usage  ^95A.8 Variation due to different simulated time  ^96C.1 SSM design with 10mips processors ^  101C.2 SSM design with 20mips processors  101C.3 SSM design with 40mips processors ^  102C.4 LIM design with 10mips processors  102C.5 LIM design with 20mips processors ^  102C.6 LIM design with 40mips processors  103C.7 SPR design with 10mips processors ^  103C.8 SPR design with 20mips processors  103C.9 SPR design with 40mips processors ^  104C.10 IPR design using 40mips processors  104C.11 IPR design using 40mips processors and 2 times transfer syntax operations^104C.12 IPR using 40mips processors and fast memory^  105viiList of Figures1.1 OSI layer concept  ^42.1 A single queue multiple server system  ^132.2 a multiple queue multiple server system  ^142.3 Variation of effective processor cycle time for different instruction hit ratio .  ^ 172.4 Variation of effective processor cycle time for different data hit ratio  ^172.5 Variation of effective processor cycle time for different ratio of register only toall instructions  ^182.6 Abstract model of the protocol processing system  ^202.7 Determining the average number of retries needed for getting a lock^242.8 Shared resources in the system  ^252.9 Block diagram of the SSM design ^  342.10 Number of times a packet is moved during processing  ^352.11 Analytical and Simulation results of the throughput for the simple shared mem-ory design   372.12 Bus utilizations of the random access and burst access bus for the SSM design 392.13 Throughput and bus utilization of the SSM design with 20mips and 40mipsprocessors  ^403.1 The Local Instruction Memory design ^  443.2 Analytical and Simulation results of the throughput for the LIM design . .  ^ 463.3 Bus utilizations of the random access and burst access bus for the LIM design 473.4 Throughput and bus utilization of the LIM design with 20mips and 40mipsprocessors  ^48viii4.1 The Simple Packet Relocation design  ^534.2 Number of packet relocations for the SPR design  ^544.3 Analytical and Simulation results of the throughput for the SPR design . .  ^574.4 Bus utilizations of the random access and burst access bus for the SPR design 584.5 Throughput and bus utilization of the SPR design with 20mips and 40mipsprocessors  ^594.6 Comparison of the LIM design and the SPR design with different burst accessspeeds  ^624.7 Number of packet relocations for the IPR design  ^674.8 Throughput of the IPR design using 40mips processors ^ 684.9 Resource utilization of the IPR design using 40mips processors  ^694.10 Throughput of the IPR design using 40mips processors and fast memory . .  ^714.11 Resource utilization of the IPR design using 40mips processors and fast memory 725.1 Maximum throughput variation with different packet lengths  ^79A.1 Overview of the simulation program ^  88A.2 Throughput variation with contents of workload  ^92B.1 A multiple queue multiple servers system ^  98B.2 A single queue multiple servers system  ^99ixAcknowledgementsI would like to express my sincere thanks to Dr. Mabo Ito for his guidance throughoutthe course of my research. I gratefully acknowledge the valuable comments, suggestions andsimulation tools provided by Leonard Takeuchi. I would also like to thank my committee, inparticular Dr. Hussein Alnuweiri, for their insightful suggestions that enhanced the qualityof my thesis.I would like to extend my appreciation to Murray Goldberg and Mark McCutcheonfrom the high-speed networking group at IJBC for answering my questions on protocol pro-cessing.A very special thanks to IBM Canada Laboratory for granting me the education leaveof absence so that I can continue my education. Special thanks also go to the Natural Sciencesand Engineering Research Council of Canada for awarding me the post graduate scholarship.Thanks also go to Barry Tsuji, William Low and Jeffrey Chow for the enlighteningdiscussions throughout the course of my work. Lastly, I would like to thank my mom, mydad and my sisters for their support, encouragement and understanding without which thework presented in this thesis would not be possible.xChapter 1IntroductionHigh performance computer architectures have been designed to deal with different compu-tational needs. However, a single application will usually consist of a number of tasks havingdifferent computational requirements. One single architecture is thus unable to satisfy theneeds of every application and peak performance of the architecture can not be achievedas the system must spend part of its time working on computations not designed for it. Aheterogeneous computing environment solves this problem by having a number of applicationspecific architectures in the system so that each portion of the task can be processed in thefastest possible rate [13].Computers within a heterogeneous computing environment have to communicate witheach other to exchange information. Since each computer may have its own data format, asuitable transfer syntax is needed to enable the exchange of data. The OSI protocol stackprovides a model which computers can use to communicate. In a high performance computersystem, a high speed network would be required to handle the communication traffic. Theneed to perform transfer syntax conversion in a high performance computer system turns outbe a major burden computationally because a large amount of data has to be transformedin a short period of time. The problem becomes worse with the use of application specificarchitectures in the heterogeneous environment because these architectures may not be effi-cient in performing protocol processing operations. Thus, a stand alone protocol processingsystem is desirable.An example of such a system may consist of a database server machine optimized fordata access, a vector processor for computations and a workstation for user interface. Protocol1Chapter 1. Introduction^ 2processing systems capable of high speed operations could be installed at the database serverand the vector processor as these specialized machines would not be optimized for protocolprocessing. The workstation may or may not be equipped with an external protocol processingsystem depending on the traffic and the cost of such a system.1.1 Guide to the rest of the thesisThe remaining sections in this chapter describe the OSI protocol reference model and somebasics of parallel processing. Related research activities, motivations for the research, ourapproach and objectives are also discussed. Chapter 2 explains the major design issues thathave to be made for the multiprocessor protocol processing system. An analytical model isdeveloped for a simple shared memory design and the behavior of this design is examined usingboth the analytical approach and the simulation approach. Chapter 3 deals with the effectsof locality and studies the performance improvements that is obtained with the improvedarchitecture. Chapter 4 discusses other means of achieving locality and thus increasing thesystem throughput by means of a further improved architecture. Chapter 5 gives a summaryof the results and discusses their implications. Chapter 6 draws some conclusions based onthe results. Details about the simulation program, data and simulation results are given inthe appendices.1.2 Background on protocol processingOpen System Interconnection (OSI) is a reference model defined by the International Organi-zation for Standardization (ISO) for the communication between computer systems [32]. TheOSI model consists of 7 layers, each of which has a unique function as defined by a servicedefinition and a protocol specification. The service definition specifies the activity betweenadjacent layers and the protocol specification defines the interaction between peer entities onthe communicating machines. The seven layers include the application layer, the presenta-tion layer, the session layer, the transport layer, the network layer, the datalink layer and theChapter I. Introduction^ 3physical layer. A brief description of each layer is given in table 1.1.Layer DescriptionApplication deals with the distributed information services and access ofthe OSI environment by users.Presentation deals with the data representation and includes conversions,compression and encryption of data.Session deals with the establishment, maintenance and termination ofconnections between co-operating applications.Transport deals with the reliable and transparent transfer of data betweenend points. Segmentation and reassembly of packets are also performedif the network packet size is fixed and smaller than the session packetsize.Network deals with the establishment, maintenance and termination ofconnections. In a connection oriented system, routing table has to bemaintained to determine the path for routing packets.Data link deals with the reliable transfer of information across the physicallink.Physical deals with the mechanical, electrical, functional and proceduralcharacteristics to access the physical medium.Table 1.1: OSI 7 layers of protocolsCommunication between two systems is achieved by passing information in the formof protocol data units (PDUs) between peer layer entities. Since all peer layers are onlyconnected logically except the physical layer, PDUs have to move down the layers until theyget to the physical layer and then move up to the peer layer on the destination machine.Information is passed between adjacent layers using service data units (SDUs) through serviceaccess points (SAP). Figure 1.1 shows the interactions between the PDUs and the layerentities. An (N+1) layer entity trying to communicate with its peer entity with a (N+1)PDU will send the PDU as a (N) SDU to the (N) layer entity in its own machine by invokinga (N) service primitive through a (N) SAP. The (N) layer entity will then communicate with itspeer with a (N) PDU created by adding protocol control information (PCI) to the (N) SDUl.This will continue until the physical layer is reached and at which point, the information issent.'The protocol control information is commonly know as the header.Chapter 1. Introduction^ 4Figure 1.1: OSI layer conceptChapter 1. Introduction^ 5On receiving a (N) PDU, the (N) layer entity will carry out the actions specified inthe PCI. The (N) SDU will also be passed to the (N+1) layer entity if required.Protocol processing for connection-oriented protocols can be divided into two parts,packet processing and connection control block manipulation. Packet processing includes thetransfer syntax conversion, checksum calculation, encryption and decryption, and encodingand decoding of headers. Connection control block manipulation [15] includes state machinetransitions, enqueuing of received packets and other synchronization operations. Connectioncontrol block manipulation requires only a few lines of code for each packet while packetprocessing is more complicated and could take 20 or more instructions per byte of packetdata.In our design, we are assuming that the host machine is responsible for the applicationlayer processing and the network adapter is responsible for the physical layer processing.Therefore, our protocol processing system is only responsible for the presentation layer to thedatalink layer. In our discussion, processing associated with sending a PDU is referred toas downward processing as it involves moving the PDU down the protocol stack. Processingassociated with receiving a PDU is referred to as upward processing for a similar reason.Upward processing is more complicated and time consuming than downward process-ing. This is because for upward processing, we have to reassemble packets into bigger onesin the transport layer and since packets could come in out of sequence, special care mustbe taken to make sure that all the required packets have arrived before carrying on withthe processing. Decoding of the packets in upward processing also requires more time thanencoding in downward processing. For decoding, we do not know what is coming in and haveto decode the data bit by bit whereas for encoding, we have all the information required. Theneed to compute the checksum for error detection also makes upward processing very timeconsuming. For downward processing, checksum calculation can be done in parallel with thedata copy operations and thus no extra time is required. The definition of OSI uses a finiteChapter 1. Introduction^ 6state machine and thus the actual sequence of the PDUs has to be preserved. In a uniproces-sor system, this could be achieved easily but for a multiprocessor system, PD Us can get outof sequence and explicit means are needed [15] to ensure that they are kept in order. Thisrequirement of processing packets in sequence also makes upward processing more difficultthan downward processing.1.3 Motivation for the research into a multiprocessor designOne of the major difficulties in OSI protocol processing at a gigabit network rate is the largeamount of processing power required for the transfer syntax conversion operations whichaccount for more than 90% [9] of all the protocol processing power. The computational powerneeded can not be easily met with a single processor if we would like to make use of allthe communication bandwidth available to us. Table 1.2 gives the processing power requiredfor different transfer syntax conversion complexities in terms of the number of instructionsper byte 2 . For a gigabit network, data is coming in at a rate of 125MBytes/s and theprocessing power needed is tabulated as shown in table 1.2. Even with an Alpha processor,number of instructions/byte processing power (mips)10 125020 250040 5000Table 1.2: Processing power required for protocol processing in a 1000 megabit networkwhich could provide up to 400mips of processing power, we see that in a gigabit network, anumber of them would still be needed. Since we are building the protocol processing systemas a peripheral device to the host, we would like to keep the cost down by using slowerprocessors. Therefore, in terms of cost and technology, a multiprocessor approach provides apossible solution. The use of a multiprocessor design could also allow an incremental increasein processing throughput if not all of the available network bandwidth is needed now and also2 refer to appendix A for detailsChapter 1. Introduction^ 7provides a higher fault tolerancy than a single processor system.1.4 Background on parallel processingThe speed at which a processor runs at depends a lot on the technology used in fabricatingthe processor. For some applications, the processing power required exceeds that which canbe offered by a single processor. A practical solution in those situations is to incorporate anumber of processors into the system instead of having only one. A multiprocessor systemcould either speed up the processing of a task by dividing the task into subtasks and workingon these subtasks in parallel or increase the system throughput by pipelining tasks to raisethe rate of task completion. The former approach would benefit applications which can besubdivided into smaller tasks easily while the latter approach is more suitable for applicationswhere new tasks can be started in a shorter time than the actual processing time of each task.The performance improvement will be affected by the granularity of the subtasks or pipelinestages.Many types of multiprocessor architectures have been proposed and Flynn [12] hasclassified them based on the number of instruction and data streams. In a single instructionmultiple data streams (SIMD) machine, the same operation is executed on each datum ina uniformly structured data set. In a multiple instructions multiple data streams (MIMD)system, there are several independent computers each capable of executing its own programand consequently the resulting system is more suited to less structured operations.The choice of a particular architecture would be very dependent on the characteristicsof the application itself. More discussions on the system architecture is given in the nextchapter.1.5 Related Research ActivitiesResearch work on protocol processing techniques for high speed networks has increased withthe advent in optical network technology. Different approaches have been taken by researchersChapter 1. Introduction^ 8to tackle the high processing power requirement for OSI protocol processing. One approachis to speed up the transfer syntax conversion operations as these are the most CPU cycleconsuming operations of protocol processing. Huitema et al. [18] proposed a replacement ofthe ASN.1 basic encoding rules with a light weight syntax to improve the performance. Theyfound that an improvement of 6 times was possible and a higher increase could be obtainedfor some particular data structures. The problem with using a new transfer syntax is thatit will take a long time before it can be accepted as a standard. Joseph [23] parallelized thetransfer syntax conversion operations using specialized RISC processors. He was also ableto speed up the processing by about 5 times. The limitation of Joseph's approach is thatthe speedup depends on the complexity of the data packet and so for simple data types, thespeedup is low.Another approach is to deal with the lower protocol layers only in the protocol pro-cessing system and leave the higher layers to the host. Kanakia et al. [25] designed a highlyspecialized network adapter board (NAB) for the light weight VMTP transport protocol usinga multiprocessor system. Giarizzo et al. [14] exploited the inherent parallelism in protocolsby using a multiprocessor-based communication subsystem. Braun et al. [3] described atransputer based approach for implementing a gateway. The results of these studies provedthat a multiprocessor approach could handle high speed networks with a throughput of morethan 100MBits/s at the transport layer. Jain et al. [21] increased the throughput furtherto GBits/s rates with their parallel processing design. The problem of leaving the higherlayers to the host is that with a high network rate, the host will have to spend most of itstime doing protocol processing and not on the application using those data. Out of the largenumber of research efforts, only a few of them have attempted to deal with the higher proto-col layers. Goldberg et al. [15] and Zitterbart [40] described ways of parallel processing theprotocol stack up to and including the presentation layer using a small number of processorsand showed that a significant improvement in throughput can be obtained.Chapter 1. Introduction^ 9Takeuchi [37] presented a couple of designs which could provide higher layer process-ings using up to 50 processors with a processing throughput of 100MBits/s. The first designincluded both local and shared memory. In that design, the shared memory was used tostore the state information and other synchronization variables while the local memory wasused for everything else, including packets. Packets coming into the system were assignedto the processors using a round robin scheme. The second design used only shared mem-ory. All instructions, packet data and state information were stored in the global sharedmemory. The result of Takeuchi's research indicated that the interconnection between theprocessors and also the job allocation algorithm significantly affects the processor utilization,which in turn results in the nonlinear increase in system throughput with the number of pro-cessors. The shared memory design was found to scale poorly and was not able to provide a100MBits/s throughput because of bus contentions. The mixed memory design was able togive a throughput of 100MBits/s but suffered from a low processor utilization.1.6 Goals and approach of studyOur research attempts to solve the problems observed in Takeuchi's design and to look athigher speed networks. Designs presented in this thesis are based on the principal of local-ity and software cache prefetching techniques. In software cache prefetching, hit ratios areimproved by predicting what data will be used next and loading them into the cache beforethey are used. In our case, we would like to use a similar idea to reduce the amount of sharedmemory contentions. It is believed that with the specific knowledge of the application, wecould put the data in the best possible location at any given time to ensure a high degree oftemporal and spatial locality. An incremental approach is taken here to look for a suitablearchitecture. Designs with increasing complexities are studied. Both analytical and simula-tion methods are used to evaluate our results. Objectives of the research are summarized asfollows:• Design a system that could handle the OSI processing requirement of a gigabit network.Chapter 1. Introduction^ 10• Design such a system in a simple and economical way. This is important because wedo not want the protocol processing unit, which is designed as a peripheral, to be morecomplicated or expensive than the host machine it is attached to.• Design a system with good linear increase in throughput and high utilization of allsystem resources.Chapter 2Design issuesIn the design of a multiprocessor system, many design issues have to be solved. The followingsections describe some of the system requirements and the major choices that have beenmade to optimize the system performance. An analytical method for determining the systemperformance is also discussed.2.1 System requirementIn this research, we would like to design a multiprocessor based protocol processing system ca-pable of all the processing required from layer 2 through layer 6 in the OSI protocol stack. Thedesign should maximize the resource utilizations and processing throughput at a reasonablecomplexity and cost.2.2 Method of parallelizationProtocol processing can be parallelized in a number of ways based on the granularity of theunit of parallelization [15][40]. The atomic unit can be a connection, a protocol layer, afunction or a packet. The resulting multiprocessor system would either increase the peakthroughput or speed up the processing as other parallel processing systems discussed in sec-tion 1.4.For a processor per connection system, improvement of performance is obtained fromthe concurrent processing of different connections. The processing time remains the sameand the increase in system throughput depends on the number of concurrent connections andthe amount of activity in each. If the load is not balanced for each of the connections, the11Chapter 2. Design issues^ 12effectiveness of a parallel design is low as only those heavily used connections can contributeto the improvement in performance.Using protocol layers as a unit of parallelization would again increase the systemthroughput. The system works as a pipeline and its performance is governed by the sloweststage. The high processing power needed for the presentation layer and the synchronizationneeded between layers makes this design undesirable. We could improve this design by sub-dividing the presentation layer into more pipeline stages but we would also introduce moresynchronization overhead which may offset any improvement.In the processor per function design, the processing time is reduced from the concur-rent processing of different protocol functions. The performance is highly dependent on theamount of embedded parallelizable tasks in the protocol. A high synchronization overheadwould be incurred because of this sort of parallelism.The final approach uses a processor per packet design. This design improves thesystem throughput without speeding up the processing of individual packets. With thisarrangement, each packet is assigned to a processor which would carry out all processingsrequired for that packet. Since the processor will carry the packet through all the layers,synchronization between layers as in the processor per layer design is not required. Withoutregard of which connection a packet belongs, the system can still derive the parallelism outof the processing as long as there are packets in the system. Inherent parallelism neededfor processor per function design is not required. The only drawback of the processor perpacket approach is that it is possible for some packets to pass each other in the stack andextra synchronization operations are needed to maintain the logical order of the packets. Theproper order can be ensured by the use of internal sequence number [15] which adds to theoverhead. Despite the extra overhead, this design is taken because of the higher potentialgain in processing throughput.With the packet per processor design, we are parallelizing the transfer syntax conver-sion operations which normally do not require access to the shared state information and doChapter 2. Design issues^ 13not need any synchronization for the encoding and decoding functions. Multiple processorscould be performing transfer syntax conversions for a number of packets in parallel once theirheaders which must be processed sequentially have been worked on. Each processor can beexecuting different instructions at any time because different packets could be in differentlayers and a MIMD type of architecture would be suitable for this application.2.3 Single queue versus multiple queuesIn a per packet based model, each packet forms a basic unit for parallel processing. A decisionarises as to whether these units of work should be put into a single queue and distributed tothe processors when the latter are ready or there should be a queue for each processor andthese basic work units put into these queues according to some scheduling algorithm. Thesetwo configurations are shown in figures 2.1 and 2.2.Figure 2.1: A single queue multiple server systemThese two approaches can be compared by using the classical queueing theory. Thedelay time per packet of the single queue design can be proved to be less than that of themultiple queue design. The detailed analysis is given in appendix B. Intuitively, it couldo0oJob queue ServersSchedulerChapter 2. Design issues^ 14Figure 2.2: a multiple queue multiple server systembe seen that a single queue design will give the maximum processor utilization as none ofthe processors in such a system can become idle if there is something in the queue. On thecontrary, for a multiple queue system, processors may be idle even while some queues are notempty depending on how intelligent the scheduling algorithm is in estimating the processingtime for each packet. The sequential nature of protocol processing in which packets have to beprocessed in a predefined sequence could complicate the scheduling algorithm for the multiplequeue system and therefore worsen the situation. Some of the processors may also be forcedto become idle even if their job queues are non empty if they do not have the packet withthe proper sequence number. Takeuchi [37] found that processor utilization of the multiplequeue system could be below 50%.The single queue structure has been chosen in our designs because of its better delaytime characteristics, higher resource utilization and its simplicity in terms of implementation.The multiple queue system is not considered further in our discussion.Chapter 2. Design issues^ 152.4 Slow versus fast processorsIn protocol processing, packet data have to be accessed a number of times for checksumcalculation, transfer syntax conversion and reassembly operations [2]. There is very littleimmediate reuse of these packet data after they are accessed each time and thus normalcaching techniques are not effective in increasing the system speed. The use of caches for thiskind of operations may be harmful as useful data may be displaced from the cache [35] whenthe packet data are accessed. Without the benefits of using a cache, packet data accesseswould have to go through the main memory most of the time. The access rate would bedependent on the memory speed. Since there is a gap between processor speed and memorycycle time [16], no real benefits can be obtained by using a fast processor if the main memoryspeed can not be increased with it.The effective speed of a processor given the memory cycle time can be estimatedgiven the knowledge of the application statistics. To develop an analytic model, the followingnotations are used:portion of register only instructions out of all instructionsinstruction hit ratiodata hit ratiocache cycle timeprocessor execution time (for reg only instructions)memory cycle timetave — average system execution timeti = average instruction execution timehi =hd =t, =tp =tin =Assuming that if an instruction is in the cache, there is no extra overhead in accessingthe instruction and that an instruction fetch starts when an instruction is being executed.Chapter 2. Design issues^ 16Thusti = [h, *0 + (1 —^— tp)] + tpLet tm = Otp and t, = tpwhere /3 is the ratio of memory to processor speedthereforetavetavetp• xti + (1 — x)fti + hdt, + (1— hd)tm]= ti + (1— x)[hdtc + (1— hd)tni]= (1— hi)(tm — tp) tp + (1 — X){hCItc + (1 — hd)trniby substituting ti into the equation= (1 — h)(0 — 1)tp tp + (1 — x)[hdtp + (1 — hd)13tp]by substituting tm into the equation• tp{0 — 1 —^+ hi + 1+ (1— x)iltd + (1 — hd)/3i• tp{13 — Oki + hi + (1 — x)[hd + — hd011• tp{0 — Ohi + hi + (1 — x)0(1 — hd) + (1 — x)hdl• tp{13[1 — + (1 — x)(1 — hd)] + hi + (1 — x)hd}• 13ki + k2where k1 and k2 are application specific constantsTo achieve peak performance of the processor, lalLat p should be as close to 1 as possible. Tominimize the effect of 13, it is also desirable to have k1 as close to 0 as possible. In other words,we would like to have the hit rates and percentage of register only instructions to approach1.Graphs of IgaLatp versus 3 have been plotted for values of hi, x and hd as depicted infigure 2.3, figure 2.4 and figure 2.5. The effects of each one of them on the effective processorcycle time is studied by fixing two of the three variables in each of the graphs. We can observethat the effective processor cycle time is most sensitive to the instruction hit ratio and leastChapter 2. Design issues^ 17Figure 2.3: Variation of effective processor cycle time for different instruction hit ratioFigure 2.4: Variation of effective processor cycle time for different data hit ratioVariation of effective processing time with different XRChapter 2. Design issues^ 18Figure 2.5: Variation of effective processor cycle time for different ratio of register only to allinstructionsto the percentage of register only instructions. This is because in our application, a highnumber of instructions is being executed in the syntax conversion phase. The slope of thelines vary greatly for the former case but not for the latter. To ensure that we can utilize atleast 50% of the processing power, we would like to keep1032-atp to be less than or equal to 2.This value of1-‘1.32-tt, can be used to estimate the value of (3 and the corresponding hit ratios andpercentage of register only instructions required.From figures 2.4,2.3 and 2.5, we can see that even with high instruction and data hitratios and a moderately high percentage of register only instructions, we can still only usememories that are around 10 times slower than that of the processor to provide an effectivecycle time less than or equal to two times that of the processor. Memory which is slower thanthe processor by more than 10 times is definitely undesirable as shown in figures 2.4 and 2.5.When the hit ratios are lowered, the speed of the memory must be increased to compensatefor the loss.Chapter 2. Design issues^ 19Keeping a high instruction hit rate is possible by using various prefetching tech-niques [22][24][26][28][6] but keeping a high data hit rate is a major challenge. Since the datawe are referring to here are the packets that keep coming through the system and do not stayin the system for a long time, prefetching them into the cache may not be worthwhile. Thecost of issuing prefetches could be high for data packets which vary in length and processingtime.Based on the above discussion, it can be concluded that by using fast processorswithout matching memory devices, we will not be able to fully utilize the fast processors.The resulting system performance may even be poorer than than a slow processor systemwith matching memory devices. On the other hand, using a high number of slow processorsintroduces the problem of bus and memory contentions. Thus a tradeoff has to be made inpicking the processor speed.Designs presented in the thesis are restricted to processors with speeds 2 to 8 timesfaster than memory because of this reason. For a typical memory cycle time of 200ns, thecorresponding processor speed is around 10 to 40mips.2.5 Abstract Model of the systemIn the previous sections we have described some of the major issues in the architecture. Fromour discussions, we have decided that the protocol processing system would be a MIMDsystem having packets as the unit of parallelization. A single active queue would be usedfor all jobs to ensure a high resource utilization and low latency. Processor speeds between10mips to 40mips with the corresponding memory speed at around 200ns for random accesswill be used.A simplified model of the system is given in figure 2.6. All packets coming into thesystem, from the host or the network are enqueued into a single active job queue in a FIFOmanner. When a processor from the pool is ready, the first job in the queue is sent to it forprocessing. Owing to the sequential processing requirement, some jobs cannot be processedChapter 2. Design issues^ 20Figure 2.6: Abstract model of the protocol processing systemfurther when jobs ahead of it have not been finished yet and a strict processor per packetdesign cannot be realized. These jobs have to be suspended and put into a suspended queue.Since each connection operates independently from others, there are separated suspendedqueues for each connection. The queues are further divided into upward and downwardsuspended queues for each connection oriented layer within a single connection. The upwardqueues are for packets going up the protocol stack and the downward queues are for packetsgoing down the protocol stack.Jobs in suspended queues will be moved to the active job queue once the preceding jobhas been processed. When a processor finishes processing a packet in any connection orientedlayer, it would look at the suspended queue of that particular connection layer concerned andmove the first suspended job to the active queue if it can be processed now.This simplified model does not take into account of the other resources like the sharedmemory or the interconnection structure between them. A detailed description of theseresources is given in the following sections.Chapter 2. Design issues^ 212.5.1 Interconnection requirementState information indicating the locations of jobs, connection statistics and other synchroniza-tion data are used by processors to determine how a job is to be processed. In a multiprocessorsystem, information can be shared by using a message passing or shared memory paradigm.For the message passing paradigm, the processor sending the message must know which pro-cessor will process the packet of the same connection that the sender is working on. Thisinformation is not readily available because we do not know which processor is going to pro-cess the next packet in that particular connection and thus the sender would have to send theshared data to a well known message server instead. Processors requesting state informationhave to communicate with this server using a client server model. In such a system, themessage server could become a bottleneck.For a shared memory design, information is written directly into some shared memorylocations without going through a server. The particular memory location would become abottleneck and some explicit locking mechanisms are required to handle concurrent requests.Since the implementation of the server is more complicated than a shared memory design andthat the two paradigms would both have a similar contention problem, the simplier sharedmemory design is employed in our system.A bus is used to implement the shared memory paradigm because it is the easiestdesign to implement. By giving each processor the same priority for bus arbitration andusing a first come first serve arbitration scheme, we can maintain a fair access to the bus byall processors. Other interconnection networks would be more complicated and should onlybe considered if the bus design cannot handle the memory traffic.2.5.2 Number and kind of buses neededThe performance of the shared memory design is affected by the contention problem of theshared resources. Here, we have to deal with both the bus and memory contention problems.Bus contentions occur when more than one processing unit try to acquire the bus and can beChapter 2. Design issues^ 22solved by using a bus sufficient system. In a bus sufficient system with m memory modules andp processors, the number of buses b > min(p, m) and there will not be any bus contention.Memory contentions occur when more than one processing unit tries to access the samememory port at the same time and thus can be solved by using multiple memory modulesand putting data into different modules to avoid conflicts. These two problems are interrelatedand are considered together when we are determining the proper system configuration.Two major types of access modes can be identified for the protocol processing appli-cation. They are the burst access where a whole block of data is being moved and the randomaccess where only a small chunk of data is accessed each time. The burst access mode is usedwhen packets are copied from the interface to the main memory for processing and from themain memory to the interface when processing is done. It is also used for checksum compu-tation when the packet is sent through some specialized hardwarel. Random access is usedfor the reading and writing of state tables and queues and also for accessing instructions andpacket data.Naturally, it makes sense to have two types of buses to match the access modes.One bus should be optimized for random and another optimized for burst access. Anotheradvantage of this configuration is that dual ported memory devices optimized for this type ofoperations exists in the form of video random access memory [30].As mentioned above, the problem of bus contention can be solved by using a bussufficient system for both buses. However, we could not easily make use of the buses availableto us because of memory contentions. Owing to the sequential nature of protocol processingand that the actual processing time for each packet is different, we can not put the packetsinto the memory modules in such a way that no memory contention will occur without someextensive data movements. The high cost of a bus sufficient system and the time and resourceconsuming data movements make a bus sufficient system unattractive for our application.Therefore, we are not going to consider a bus sufficient system unless the result reveals that'refer to section 2.5.6Chapter 2. Design issues^ 23there is a need to use more than two buses.2.5.3 Locking mechanismIn our design, requesters for a particular lock are responsible for getting and releasing thelock.2 The requester uses a test-and-set atomic instruction to obtain a lock. If the lock is notavailable, a busy wait mechanism is used. The requester will wait a certain interval and thenretry the test-and-set until the lock is obtained. The choice of the retry interval has a majoreffect of the system performance. If it is too long, the latency increases. If it is too short,many retries may be needed for getting a lock. These retries will generate extra traffic on therandom access bus and affect the speedup behavior.This simple locking mechanism is used because of its ease of implementation. Sinceeach processor attempts to get a lock independently without any particular co-ordination,it is possible that some processors may never get the lock they want. The severity of thisproblem will be looked at through simulations.To determine the number of attempts needed to obtain a lock, the expected value ofretry has to be found. This expected value depends on the probability at which a lock isneeded and the probability that the lock is locked. The latter would in turn be a functionof the system load. Since the system load changes dynamically with the type of packets andalso the creation and removal of connections, the exact analysis of the number of attemptsis complex and nontractable. Therefore, instead of getting an exact value, a rough estimateof the number of retries is used. To do so, we have assumed a case in which all n processorsattempt to get the same lock at the same instant and that once the lock is obtained, theprocessor concerned will move onto some other jobs and not request the same lock againbefore all remaining requesters have finished with that lock. Without loss of generality,assume that the processors are numbered in the order in which the lock is acquired as infigure 2.7. Then the last processor has to wait (n-1) time intervals while the first processor2 note that the problem of fault recovery is not discussed here for it is out of the scope of the current work.Y^a^Y^a • • • Y^aProcessor Number1 ^ TimeY^aY^a^Y^a1^2 n-1y = lock holding time,^a = retry interval23n••^IP-T = 2a(n — 1)(y + a) (2.1)Chapter 2. Design issues^ 24does not have to wait. Assume that the lock is held for y units of time and that if a lockis not available, the processor will try again a time units later. In the worst case where theFigure 2.7: Determining the average number of retries needed for getting a locklock becomes available immediately after the processor makes a request, the processor wouldneed to wait for another a time units. Thus the time interval between successful lock requestswould be y-F a time units3. Using this assumption, we could estimate that the average numberof retries is2.5.4 AnalysisThe performance of a dual bus shared memory system can be estimated by looking at theprocessing throughput and the usage of the three shared resources, namely the processors, the3locking is a major problem in parallel systems and more work needs to be done in this area.Random Access Busr-MemoryLProcessor Processor ProcessorBurst Access BusChapter 2. Design issues 25random access bus and the burst access bus. The memory is not considered here because weonly have a single access path to the memory devices in our design. Therefore looking at thebuses would be sufficient because once the bus is obtained, the memory device should also beavailable. An analytic model and a simulation program are developed to estimate the systemprocessing throughput and resource utilizations. The SSM design proposed by Takeuchi [37]is analyzed using these methods and the results of which are used as a basis for comparisonwith our designs described in later chapters.Figure 2.8: Shared resources in the systemMaximum throughput based on the number of processorsSince over 90% of the protocol processing time is spent on the transfer syntax conversionphase [9], we can estimate the maximum throughput based on the processing power neededfor transfer syntax conversion.Let 7 = instructions required per byte for transfer syntax conversionn = number of processorsChapter 2. Design issues^ 26= effective mips of processor (including the effect of memory speed)Pprocessor = maximum throughput that could be provided by n processorsthen Pprocessor = n*ç (2.2)The simple relation shows that throughput can be increased by using more processorswhich is the main idea of using a multiprocessor system. However, as the number of processorsincreases, more contentions appear and bus saturation would limit the maximum number thatcan be used. To get a more accurate estimation, we have to look at the utilization of theshared buses.Maximum throughput based on the random access bus usageAssume that we have a Von Neumann architecture. Memory accesses on the random accessbus can then be classified as instruction and data accesses. Instruction access refers to thefetching of protocol processing code and data access refers to the reading and writing of packetheader, data and other state information. Since we are looking at a complete peripheral,the operating system code is included in the protocol processing code. To facilitate theanalysis, instruction access is further divided into transfer syntax conversion related andheader processing related accesses. Data access is also divided into state information, packetheader and packet data accesses. The following notation is used in our analysis:Category Access Type1 Instruction: Transfer syntax conversion2 Instruction: Header processing3 Data: State, Locks4 Data: Packet header5 Data: Packet dataTable 2.1: Memory access typeLet 7 = average number of instruction access per byte for processing a packetUm = memory cycle time for random access1lim +5^5Prandom (1 Ep,h,)Et,1 i=1^i=1Chapter 2. Design issues^ 27ite = processor execution timeti = number of instructions executed per packet for memory access in category iPi = fraction of memory access in category i out of all memory accesshi = hit ratio for category i1 = average length of a packet= processing throughputPrandom = maximum processing throughput derived from random access bus usagepn = network packet sizert,r8, rt = # of retries for the transport, session and presentation locks respectively5Therefore average hit ratio• Epihii=15and average miss ratio 1 — Epihii=15^5average # of memory access /packet (1 — Epihi)Etii=t^i=i5 5average # of memory access /s —/(1 — Epihi)EtiAssume that when a processor needs to access memory, it has to acquire the bus, access thememory location, finish its operation and then relinquish the bus. Thus it will take pm,unit of time for each memory access ormax # of memory access/sTherefore,^1Prandom = + /2e ) ( l — ELI hi )^ti(Pm + Ae)(1 Z-r2=1 7,5 t Z-4=1 tiL-a1=1Chapter 2. Design issues^ 281^  MBytes/s^(2.3)(Am + ite)(ELi ti(1 — hi))Based on a static code count of the protocol processing code, ti and pi could bedetermined. The value of hi depends on the system design and can only be estimated. Theaverage length / depends on the actual traffic and again has to be estimated. Some typicalvalues of ti have been tabulated in table 2.2. The way in which those values are obtained isdiscussed in appendix A.i access type number of memory access (ti)1 Transfer syntax -yl2 Header processing 557.5(7*) + 7953 State, Locks -1-(128 + rt) + 112 + rs + rtPn4 Packet header 84()d- 90pn5 Packet data /tiTable 2.2: Summary of memory access based on static code countSince our estimate for the number of retries is a pessimistic value under most situa-tions, although it could be worse if starvation occurs, the resulting maximum throughput willbe a lower bound. In other words, we can expect the actual throughput to be higher thanPr andom •Maximum throughput based on the burst access bus usageThe analysis of the maximum throughput based on the burst access bus usage is somewhatsimpler than that based on the random access bus. Burst access is used in cases where blocksof continuous memory locations are accessed together. The maximum throughput can becomputed by estimating the number of data movements required. The following notationsare used in deriving the formula.= bandwidth of burst accessGam + itt e ) (EL i 4 —^ 4hi)1Chapter 2. Design issues^ 29mup^number of packet data movement in upward processingrndown = number of packet data movement in downward processingp burst = average memory cycle time for burst accessbus width (in bits)percentage of traffic that is from the net (upward)P burst = maximum processing throughput derived from burst access bus usagePburst^+ (1 x)^ MBytes/smup^mdown1wwhere ,3 = ^Aburst(2.4)The maximum throughput derived here gives an upper bound on the data output rate sincewe are ignoring any other usages that may be required. For instance, more than one accessis needed to access a block of data which crosses the memory device boundary.Estimation of the maximum throughputThe actual maximum throughput that can be obtained depends on the maximum throughputobtainable from both access modes and also the processing power available. The relationshipbetween the maximum throughput and the processing power is simple because a certainamount of instructions are needed per byte of packet data processed. Therefore, we have:PnlaX 5_ pprocessorSince there is a complex relationship between the two access modes 4, we can only use somekind of heuristic to estimate a value for the throughput.> Pburst, the maximum is approximately equal toWhen Prandom Nur st • WhenNur st^Prandom the maximum is approximately equal to prandom. When both values are'when the burst access bus usage increases, we get a higher throughput and this implies more sharedinformation access. The converse is true.Chapter 2. Design issues^ 30about the same, an upper bound of the maximum would be equal to the min[pburst,Prandom]•In other words, we havePbus =Pburst^ when Prandom > PburstPrandom when Pburst > Prandomwhen Pburst = PrandomMin[Pburstl Prandom} Since Pprocessor gives the maximum processing throughput that could be provided by theprocessors assuming that all other resources are available and Pbus gives the maximum pro-cessing throughput that could be supported by the buses assuming that all other resourcesare available, the maximum processing throughput can be estimated byPniax '=-{Pprocessor when Pprocessor < ()busPbus^when Pprocessor > Pbus(2.5)In other words, if the processors can process more data than the bus can support, the maxi-mum will be limited by pbus since there is no way for the system to keep the processors busyall the time. The converse is true because if we do not have enough processing power, thethroughput will be limited by 0, processor as the bus will not be fully utilized.2.5.5 Other parameters for measuring the performanceIn addition to optimizing the maximum processing throughput, we would like to maintaina high resource utilization. The resource utilization is a function of many parameters. Inestimating the resource utilization, we have to work out the details of the locking mechanismdescribed in section 2.5.3. We also have to deal with the changes in the system load whenpackets are suspended and reactivated. We also need to consider the different complexitiesin transfer syntax conversions which vary with the data in the packets. Finally, the speedand number of processors in the system and their effects on resource contention have to beincluded as well. Since a large number of parameters have to be taken into account, theanalysis becomes very complicated and hard to track. In our analysis, we have instead reliedChapter 2. Design issues^ 31on simulations to give us an estimate of the resource utilizations because it is easier to includethe parameters mentioned above into the simulation program. We are mostly interested inthe processor and bus utilizations which are defined below. Processor utilization is definedhere as the ratio of the maximum throughput per processor in the multiprocessor system overthe maximum throughput in a single processor system. This value gives us an estimate of theamount of processing power used in synchronization activities and lost in bus contentions ina multiprocessor system. Bus utilization is defined as the ratio of time the bus is used over aperiod of time. This value simply measures how much of the bus is being used and is relatedto the memory access cycle time. If memory is slow, each transaction will take a longer timeand for the same number of accesses, the bus utilization will be higher. Other parameterslike the number of packets processed or the average processing latency are also collected andreported.2.5.6 Other major componentsA protocol processing system consists of a number of other essential components in additionto the processors, memory devices and the interconnection structure. These components haveto deal with the interfacing tasks between the host and the system and between the networkand the system. They are also used to speed up certain critical operations like calculatingthe checksum and implementing timers.Host interfaceThe host interface is responsible for communicating with the host machine. Since our protocolprocessing system is designed as a peripheral to the host computer, this interface must becapable of working with the I/O port of the host. One of its major functions is to ensure thatdata are buffered in both directions to minimize data loss. In our analysis, only a high leveldesign is used and the actual design, which is machine specific, is not discussed.Chapter 2. Design issues^ 32Network interfaceThe network interface is used by the system to communicate with the network. It is respon-sible for taking care of the sending and receiving of data through the network. It could becombined with the network adapter to form a single unit. In that case, specialized circuitrywould be needed based on the protocol used in the physical layer. For an ATM network, theinterface would have to carry out the reassembly of cells and perform the checksum calcu-lation. As in the case of the host interface, we only use a high level model for the networkinterface. Studies concerning the design of the network adapter can be found in [29] and [10].Checksum unitChecksum calculation is needed in the transport layer for reliable communication. Sincechecksum computation involves all the data in the packet, it is a very computation intensiveoperation. Using a specialized circuit for doing the calculation could improve the overallperformance of the system [37] at a fairly low cost with a VLSI implementation. Processorsin the system can then be freed to perform other tasks which are not as straight forward.The checksum unit comprises of a specialized adder circuit that can compute thechecksum when the data is sent through it in a stream using the burst access bus. The actualnumber of checksum units will vary according to the design. Since there is only one burstaccess bus, a single checksum unit may seem to be sufficient. However, if the result is returnedthrough the random access bus, then the tradeoff of using multiple unit versus random accessbus contention must be considered. In some of our designs, it is found that multiple units aremore efficient than a single unit.TimersMany timers are used in protocol processing especially in the transport layer for ensuring areliable transfer of data. The processing cost for implementing timers can be substantial [21].Depending on the method of implementation, most of the processing can either be lumpedChapter 2. Design issues^ 33together in the phase for setting the timer or spread out in the phase for maintaining thetimer. In the former case, an ordered list of the time remaining before expiry of the timers ismaintained and processing is needed when inserting a new timer. In the latter case, no specialprocessing is needed for setting the timer but processings are needed in counting down eachof them. In our design, we have taken the former design and used a specialized processor totake care of all the timer processings. In our study, we are only interested in maximizing theprocessing throughput and error or flow controls are thus not considered. Therefore, in termsof simulating timers, we have restricted ourselves to look at the cost of setting them up butnot in the detailed operations when a timer expires.DMA deviceTo facilitate the movement of large blocks of data on the burst access bus, DMA devices areneeded. Since all processors and the host and network interfaces are capable of initiating ablock transfer for moving packets, each one of them is equipped with a DMA unit to reducethe need for using the random access bus and shared memory, which will normally be utilizedif only one single DMA unit is installed in the system.Bus ArbiterSince the two shared buses are the major bottleneck of the system, we need an efficient schemefor allocating these resources. The First Come First Serve arbitration scheme is chosen togive the best possible performance [1]. Specialized hardware is needed to make sure thatbus arbitration could be performed in parallel with bus transactions to maximize the useof the buses. Other simpler arbitration schemes are not used because of the importance ofperformance here.Chapter 2. Design issues^ 342.6 The simple shared memory designA simple shared memory (SSM) design based on the discussion in [37] is used to see if thesimplifying assumptions made in the analysis are acceptable. The result is also used as a basisfor comparison with other designs to be discussed later. The design of the system is shownin figure 2.9. It consists of all the major components described above and they are connectedtogether using the two shared,ac>C> interfaceprocessor poolchecksum unittimer processorshared memorynetwork interfaceinetworkFigure 2.9: Block diagram of the SSM designIn this system, there is no local storage and both instruction and data are stored inthe shared memory. Since everything is stored in the shared memory, the complete packet ismoved only a few times during the processing. The number of movement is different whenthe packet is moved up and down the protocol stack as shown in figure 2.10. For upwardprocessing, once the packet enters the system, it will be moved by the network interfaceChapter 2. Design issues^ 35into the shared memory where most of the processing is performed. The packet is movedthrough the checksum unit during transport layer processing for error detection and is moveda third time to the host interface when all processing is done. For all other processing steps,the packet remains in the shared memory and is being accessed word by word through therandom access bus. In total, three packet movements are required for upward processing.Figure 2.10: Number of times a packet is moved during processingFor downward processing, the situation is similar except that the packet does notneed to be moved through the checksum unit in a separate step. Instead, the checksum canbe calculated and inserted into the packet when the packet is being moved from the sharedmemory to the network interface before it is being sent out into the network. The numberof data movements is thus reduced to two. This modification reduces the number of datamovements but may be regarded as a violation of the protocol stack model [40].The expected throughput is determined by making the following assumptions. HitChapter 2. Design issues^ 36ratios as shown in table 2.3 are used. Here we are assuming that buses are 32 bits wide. Foraccess category Transfer SyntaxconversionsHeaderprocessingsLocksaccessHeader dataaccessPacket dataaccesshit ratio 0.975 0.975 0 0.75 0.75Table 2.3: Hit ratios for simulating the SSM designinstruction accesses, we have assumed that about 75% are register only instructions5 and thatthe hit ratio of these is 1. For the remaining 25% of instruction accesses, we have assumed ahit ratio of 0.9. Thus on average, the hit rate is 0.975 for all instruction accesses. To simplifythe analysis of synchronization operations, we have assumed that the hit ratio for state andqueue access to be 0. If we have assumed a non zero hit rate, we would need to consider datacache and cache coherence schemes which will complicate the analysis because of the extracoherence traffic on the shared buses. For data access, we have assumed that all the datafetched will be used and since each time 4 bytes are fetched with a 32 bit bus, a data hit rateof 0.75 is obtained. This value is a little optimistic but since transfer syntax conversion maytouch a large portion of the packet data, the error introduced is small.The lock retry interval a is assumed to be a fixed value of 100ns. From the simula-tions, it was found that the lock holding time y 10a for most situations. Therefore, from1)11equation 2.1 the number of retries can be written as r (n-2 .Other parameters used are shown in table 2.4:parameter Ile 1 Pn x libur st Mup Mdownvalues 200ns 100ns 10000 4096 17 0.5 5Ons 3 2Table 2.4: Parameters used for simulating the SSM designSubstituting these values into equation 2.3Prandom ^ * 8 MBits/s300* 10-9(6028 + 24.42n)Thus for n = 1, Prandom = 44.08MBits/s and for n = 20, prandom = 40.80MBits/s. If there ismore than one connection, the amount of lock contentions would be lowered and thus there5Estimation of the ratio of register only instructions is discussed in appendix A10000-^'burst 64MBits/s807060Chapter 2. Design issues^ 37will be less retries and a higher throughput is possible. From equation 2.4, the maximumthroughput that could be supported by the burst access bus is:Pburst = 264MBits/sSince pburst > prandom, the bottleneck is at the random access bus and the maximum through-put would be closer to 40.80MBits/s than to 264MBits/s. Since 7 = 17, the processing powerrequired is around 85mips which can be obtained with eight to nine 10mips processors.Throughput of the SSM design with 10mips processors2^4^6^8^10^12^14^16^18^20number of processorsFigure 2.11: Analytical and Simulation results of the throughput for the simple shared mem-ory designThe analytical and simulation results of the maximum throughput are shown in fig-ure 2.11. The simulation result follows 0processor up to about 8 processors. Thus linear, Chapter 2. Design issues^ 38increase in throughput is only achieved with 8 processors. When more processors are added,the throughput saturates at about 44MBits/s as predicted by prandom•From the graph of bus utilization in figure 2.12, we can observe that the random accessbus is a major point of contention. Since all instructions and data accesses have to go throughthis bus, it saturates quickly with only about twelve 10mips processors. On the other hand,the burst access bus is hardly used. This also explains why the Pburst, which is based on theburst access bus usage, is way off in estimating the maximum throughput. The simulationresult shows that our analysis is quite accurate in describing the trend and in estimating thethroughput to within 10%. We also did not observe any deadlock occurring with the simplelocking mechanism.The simulation is repeated using 20 and 40mips processors. The results are shown infigure 2.13. With higher speed processors, we could provide the same processing power witha fewer number of processors than by using 10mips processors. This would then reduce theamount of bus contentions due to lock contentions and allow us to increase the maximumthroughput. However, the simulation result indicates that high speed processors do not helpvery much in improving the maximum throughput. The maximum throughput as obtainedby using 40mips processors is only about 32% more than that of the 10mips processors. Thusthe return for using faster processors is very low. The main reason for this result is that thebottleneck here is not caused by the lock accesses which would increase with the number ofprocessors. Instead, the bottleneck is caused by the high amount of transfer syntax operationsthat are carried out using the random access bus. Therefore, the effects of having to busywait for a lock, which increases with the number of processors, are not observed because theprocessors spend most their time waiting for the bus doing transfer syntax conversions ratherthan accessing the shared states.From our results, we can conclude that using the SSM design to implement the singleactive queue processing model is inefficient and more sophisticated designs are needed.0.90.8ProcessorRandom Access BusBurst Access Bus2^4^6^8^10^12^14^16^18^20number of processors.g 0.7al. N0.6=6cT 0.50C)00) 0.3=_o0.20.1Chapter 2. Design issues^ 39Bus and processor utilization of the SSM design with 10mips processorsFigure 2.12: Bus utilizations of the random access and burst access bus for the SSM designChapter 2. Design issues^ 40Figure 2.13: Throughput and bus utilization of the SSM design with 20mips and 40mipsprocessorsChapter 3The local instruction memory designIn this chapter, we are going to look at some design features for improving the system perfor-mance. From the result of the SSM design, we can observe that even though the instructionhit ratio is high (at 0.975), instruction accesses still create a lot of shared memory referencesbecause of the high proportion of transfer syntax operations which is in the order of twentyor more instructions per byte of packet data. One effective means of improving the systemperformance is to make use of locality to reduce the shared memory access. If the informationneeded is easily accessible without accessing the shared resources, the speed of the operationcan be increased. In the following discussions, we are going to see how we could apply theeffects of locality to the design of protocol processing systems.3.1 Background on localityThere are two aspects to the property of locality, namely the spatial and temporal aspects [11].Temporal locality refers to the property that information which will be used in the near futureis likely to be in use already. This type of behavior is observed in program loops whereinstructions and variables are reused. Spatial locality is the property that the location of theinformation to be accessed will be near to the ones that are being currently used. This is seemostly in program codes and large data structures where data are accessed sequentially.In an application, we could divide the memory references into instruction access anddata access. Instruction access has a higher degree of locality when compared to data access ingeneral because of the sequential execution of program codes and the various loop constructs.For data access, the number of times that a piece of data is used and the location of the data41Chapter 3. The local instruction memory design^ 42used depend a lot on the application. In general, the degree of locality is lower than that ofinstruction access. In case of protocol processing, the main data structure is the packet data.If we consider only a single packet, temporal and spatial localities can be found since both thepresentation and transport layer would potentially touch all the bytes in the packet duringprocessing. However, if we consider the process in general, then little spatial and temporallocality can be obtained as packets keep coming in and moving out of the system at a highrate.A lot of work have been done in improving cache system performance by prefetchingwhen the inherent locality is not enough. Prefetching techniques range from simple oneswhich only prefetch the next memory location to ones which are controlled by program code.In our research, we are going to apply the prefetching technique to reduce the shared memoryaccess by moving the program code and packet data closer to the processor into local storagesbefore they are needed. This could be done because the packet is used heavily only in thepresentation and transport layer and we know when these are performed. We do not simplyapply prefetching to our cache because doing that alone is not enough in removing sharedmemory access which is a main problem in our design. In this chapter we will study the effectof keeping instructions locally and we will look at the effects for packet data in the followingchapter.3.2 Effects of localityFor the shared memory design discussed in the last chapter, both instructions and data arestored in the shared memory. In the transfer syntax conversion phase, even for a simpledata structure like the personnel record used in the ASN.1 specification, 20 to 40 instructionsper byte of packet data may be required for doing transfer syntax conversion using the basicencoding rules [37]. Fetching these instructions from the shared memory creates a lot of trafficon the shared bus and limits the total system throughput because of bus contention. Usingthe shared memory for temporary variables further adds to the problem as they create moreChapter 3. The local instruction memory design^ 43shared bus traffic.The result of our simulations for the SSM design shows that the random access bususage rises quickly with the increase in the number of processors while the burst access bususage is low. Our first step in studying the effects of locality is to see the change in performancewhen instructions and temporary variables are stored locally instead of in the shared memoryso as to reduce the amount of random access bus traffic. This change is implemented byadding local memory to each of the processors. The local storage is designed to hold all codesand temporary variables required for processing. With this design, the shared memory isonly used for storing connection state, system state and packet data. If we use the analogythat our shared memory is like the main memory in a hierarchical memory system and thatthe local memory is the cache, we would have a perfect hit ratio here in terms of instructionaccess. We call this architecture the local instruction memory (LIM) design.The architecture of the LIM design is essentially the same as that of the SSM design.The only difference is the addition of local memory to each of the processors. The localmemory is connected to its processor through the random access bus of the processor andis only accessible by that processor as shown in figure 3.1. Since the local memory is onlyavailable to one processor, no special arbitration hardware is needed. The design could beimplemented by using the supervisor and user address spaces found in most common generalpurpose microprocessor or by using a memory mapped address space. The major cost of thedesign comes from the memory chips.3.3 Performance analysis and SimulationsThe way in which packets are processed is the same as the SSM design. Since we are onlychanging the location of the code and temporary variables and this is all done in hardware,the software would be essentially the same as the SSM design. In the analysis for the SSMdesign, the hit ratios refer to memory accesses to the shared memory. With local memoryinstalled in the system, some of the hit ratios have to be modified. The numbers are changedChapter 3. The local instruction memory design^ 44Local Instruction Memory (LIM) designRandom Access Bus Shared Memory ProcessorLocalMemoryBurst Access BusFigure 3.1: The Local Instruction Memory designas shown in table 3.1. They are only used in deriving 0, random• The same hit ratio as in theaccess category Transfer SyntaxconversionsHeaderprocessingsLocksaccessHeader dataaccessPacket dataaccesshit ratio for LIM 1.0 1.0 0 0.75 0.75hit ratio for SSM 0.975 0.975 0 0.75 0.75Table 3.1: Hit ratios for simulating the LIM designSSM design is used for the local memory in the simulation so that instructions stored in localmemory would still have a hit rate of 0.975. However, fetching these instructions from thelocal memory would not add any contention to the shared bus. We can obtain a good estimateof the system behavior without being too optimistic by using these hit ratios although a hitrate of closed to 1 is achievable if we have a perfect instruction pipeline implemented withcache prefetch mechanisms like branch prediction [31], stream buffers, victim buffers [24] andsoftware prefetch schemes [6] [26].10000300 *10-9(1724 + 24.42n)which is a function of the number of processorsPrandom *8 MBits/sChapter 3. The local instruction memory design^ 45Using the same assumptions and parameters used for the simple shared memory de-sign, we have:As n increases, there will be more contentions for the bus and this will result in a lowerthroughput. Since packets are processed in the same way as before, the number of packetmovements is the same and so pburst remains unchanged orPburst = 264MBits/sEven for n=1 where there is no contention, Prandom = 152.48MBits/s < Pburst• From sec-tion 2.5.4, we estimate that the maximum throughput is limited by prandom.The simulated throughput is plotted against the number of processors as depicted infigure 3.2. The analytic results derived for prandom,Pburst and oprocessor are also plotted on, the graph. The simulated throughput follows pprocessor up to about 30 processors and startsto saturate to about 112MBits/s which is just a little larger than the prandom. Thus we couldobtain a 3 times increase in maximum processing throughput as compared with the SSM. Alinear increase in throughput with the number of processors is maintained until saturationoccurs and the bus design is found to be capable of supporting 3 times more processors inthis configuration than the SSM design.The utilization of the burst access bus is also improved as shown in figure 3.3. Bycomparing with the bus utilization of the SSM design, the difference in the utilization of thetwo buses in the SSM and LIM designs has been reduced from approximately 0.8 to 0.45.The increase in burst access bus usage is directly related to the increase in throughput. Fromequation 2.4, we can see that burst access bus usage increases linearly with the throughput.By storing instructions and temporary variables in the local memory, we are able to reducethe traffic on the random access bus and allow the system to utilize more processors and thushandle more packets.Chapter 3. The local instruction memory design^ 46Throughput of the LIM design with 10mips processors_200---.9AmM17--150=0__ccs,=2-.F. 1005^10^15^20^25^30^35^40number of processorsFigure 3.2: Analytical and Simulation results of the throughput for the LIM designChapter 3. The local instruction memory design^ 47Bus and processor utilization of the LIM design with 10mips processors10.90.8Processor .g 0.7its.n,77. 0.6=82 0.5a)C)Random Access BusBurst Access Busco 0.3=_00.20.1oo^5^10^15^20^25^30^35^40number of processorsFigure 3.3: Bus utilizations of the random access and burst access bus for the LIM designChapter 3. The local instruction memory design^ 48The processor utilization is found to be better than 0.9 for most of the time beforesaturation of the bus occurs. Together with the linear increase in throughput, we can see thatthe overhead introduced by parallel processing is small as most of the processing power goestoward producing a higher output as desired.The simulations are repeated with 20mips and 40mips processors with all other pa-rameters remaining the same. The results are shown in figure 3.4. By using faster processors,Figure 3.4: Throughput and bus utilization of the LIM design with 20mips and 40mipsprocessorswe can obtain the same processing power with a fewer number of processors. When the num-ber of processors is low, the amount of bus contention will also be low and therefore moreChapter 3. The local instruction memory design^ 49processors can be added. This would imply that a higher throughput can be achieved. Thesimulation results supported this argument. For the same throughput, say at 100MBits/s,the random access bus usage of the 20mips system is higher than that of the 40mips systemas shown in figure 3.4. As a result of the higher bus utilization, a higher bus contention willbe seen in the 20mips system and the maximum throughput achievable on the 20mips systemwill be lower than that of the 40mips system.Table 3.2 shows the maximum throughputs obtained from the SSM and LIM designwith different processor speeds. The results indicate that with the LIM design, we can increaseprocessorspeed (mips)SSM design LIM designmax throughputMBits/s% increase with40mips cpumax throughputMBits/s% increase with40mips cpu10 43.52 32 109.60 5020 52.48 10 143.12 1540 57.60 0 164.72 0Table 3.2: Variation of maximum throughput with processor speedthe maximum throughput more effectively by using higher speed processors. As discussed insection 2.4, using high speed processor without high speed memory would not be efficient andthe results here illustrate that point. When the processor speed is increased by 4 times, themaximum throughput is only improved by 50% in the best case when the memory speed iskept the same.3.3.1 Merits of the LIM designThe LIM architecture is designed to tackle some of the problems in the SSM design. Thereis only a minor modification in the hardware with the inclusion of local memory for eachprocessor with minimal software changes. The use of local memory for storing program codesand temporary variables removes a big portion of memory accesses to the shared memory.Although there is not a lot of changes in the design as compared to the original, the maximumprocessing throughput is almost tripled for the different processor speeds simulated. TheChapter 3. The local instruction memory design^ 50number of processors that can be accommodated and still maintaining a linear speedup isalso increased by more than 3 times. The utilization of the burst access bus has been increasedwhile that of the random access bus has been reduced to obtain a more balanced use of theshared resources.3.3.2 Weakness of the LIM designA major disadvantage of this design is that instructions have to be duplicated in the localmemory of each processor. Thus in terms of instruction memory usage, the efficiency isproportional tonumber of processors . This is not a major problem because the cost ofmemory is expected to go down with the advent of technology. Another problem with thisdesign is that false sharing of packet data is not addressed. Since there is only one accesspath to the packet data in the shared memory, whenever a processor is accessing a packet,no other processor can use the shared memory even if it needs to access a different packet.Assuming that only 1 out or 4 bytes of the packet needs to be accessed during transfer syntaxconversion, there is still a lot of memory traffic. This problem seriously affects the systemthroughput and is dealt with in the next chapter.Another undesirable characteristic of the design is that there is a low utilization of thefaster burst access bus and a high utilization of the slower random access bus. The randomaccess bus is thus the bottleneck again as in the SSM design. When the random access bussaturates, there is still a lot of bandwidth remaining on the burst access bus. This revealsthat some improvement is possible without major modification to the hardware if we canchange the usage of the buses.Chapter 4The packet relocation designIn the discussion of the LIM design, we have noticed that a fair amount of random accesstraffic is still required because packet data are stored in the shared memory. The resultingutilization of the slower random access bus is higher than that of the burst access bus andthe processing throughput is limited to a lower value than allowed by the faster bus. In thischapter, we would look at an improvement that attempts to reduce the shared memory accessby moving packets to the local memory. We could also balance the load on the two buses byusing the burst access bus for the relocation of packets.4.1 Usefulness of data cacheSince we know when the packet data will be used during protocol processing, we could prefetchthe packet data into a local cache of the processor using a cache prefetch technique. However,we still have to use the shared random access bus to perform this prefetch, which is carried outlike a cache miss operation, and therefore the overall bus contention problem is not improved.Since a cache prefetch instruction can only bring in a certain number of bytes determinedby the bus width and the packet size varies a lot, extra calculations would be needed todetermine how many prefetches are required and where and when to issue them. This couldcomplicate the processing and affect the processing latency. Moreover, packet data are notreused extensively and thus leaving them in the cache would not be useful.Instead of using a cache prefetch mechanism, we choose to simply move the data fromthe shared to the local memory just before the access begins. The advantage is that localmemory could be made to support burst transfer and data can reside in them for a longer51Chapter 4. The packet relocation design^ 52time without the possibility of becoming invalid like in the cache. By using the burst accessmode to move the data, only one single transfer is needed to relocate the data and the randombus is not used.4.2 Spatial locality achieved through copyingTo see why moving packets to the local memory will improve the performance, we have tolook at how packet data is used during protocol processing. In the transfer syntax conversionphase, representation of the packet data has to be changed from one form to the other. Forinstance, all of the bytes have to be read and rewritten for the integer data type. If the packetis stored in the shared memory, the processor has to get hold of the bus, access a few bytes,release the bus and repeat the cycle again until the whole packet is processed. If there is morethan one processor in the system, bus contention would occur and slow down the processing.Since the transfer syntax conversion of a packet is considered as a single step and iscarried out by one processor, the packet needs not be shared. Therefore, if the packet is movedto the local memory before the transfer syntax conversion phase begins, the false sharing ofpackets will be reduced and the shared random access bus traffic will also be lowered. The keyidea here is to use the burst access bus instead of the random access bus to do the transfer.By doing so, we can also effectively balance the usage of both buses. We call this architecturethe simple packet relocation (SPR) design.The system architecture of the SPR design is similar to the LIM design. However, inthis design, the local memory is dual ported, one for random and one for burst access as inthe global shared memory. The random access port is only connected to the local processorthrough the local random access bus while the burst access port is connected to the sharedburst access bus as shown in figure 4.1. Since all the local memories are connected to theshared burst access bus, the memory system could be viewed as a distributed shared memorysystem. Dual port memory devices are used to facilitate the movement of data from sharedto local memory and the access to code and data by the local processor.Chapter 4. The packet relocation design^ 53Figure 4.1: The Simple Packet Relocation designChapter 4. The packet relocation design^ 54The processing steps are essentially the same as before with the addition of someextra data movements as shown in figure 4.2. For upward processing, global shared memoryFigure 4.2: Number of packet relocations for the SPR designis used for storing the packets up to and including the session layer. In the presentation layerwhere there is potentially a high number of packet data accesses, the packet is copied fromthe shared memory to the local memory using block transfer on the burst access bus. Byprefetching the data to the local memory, we have made a tradeoff between the decrease inrandom access bus usage with the increase in burst access bus usage. Packets are left in theshared memory for the lower layers because of the need to perform reassembly operationswhich can be carried out more efficiently when all packets are in the same address space sothat no copying is required.For downward processing, packets are copied from the shared to the local memoryof the processor when processing begins. This packet can stay in the local memory of theChapter 4. The packet relocation design^ 55processor which would carry out all the processings if possible. In an ideal situation, whenthe sequence number is correct for all connection oriented layers, the packet would be kept inthe local memory of the processor until the packet is sent out. In a less than ideal case, whenthe sequence number is incorrect, the packet will be copied back to the shared memory. Thisis done so that the current processor is freed to work on something else and that keeping thepacket in the global shared memory allows any processors to work on it at a later time.4.3 Performance analysis and SimulationSince packets have to be moved to the local memory, the software has to include properDMA instructions. Each processor is equipped with a DMA device so that each processorcan initiate a packet transfer. Other than that, the software and hardware would still be verysimilar to the LIM design. Since the packet is now in the local memory for the transfer syntaxconversion, the hit ratio for accessing the packet data has to be changed. The new hit ratiosare shown in table 4.1. As in the case for the LIM design, these hit ratios refer to the sharedaccess category Transfer SyntaxconversionsHeaderprocessingsLocksaccessHeader dataaccessPacket dataaccesshit ratio for SPR 1.0 1.0 0 0.75 1.0hit ratio for LIM 1.0 1.0 0 0.75 0.75hit ratio for SSM 0.975 0.975 0 0.75 0.75Table 4.1: Hit ratios for simulating the SPR designmemory. For items which are now stored in the local memory, the hit ratio is always 1 asfar as the shared memory is concerned as the shared memory would not be accessed. For thelocal memory, the original hit ratios used for the simple shared memory design will be used.Using the same assumptions as before, we can derive the maximum throughput usingequation 2.3:Prandom10000 * 8MBits/s300* 10-9(474 + 24.42n)For n = 1, Prandom = 534.84MBits/s and for n = 40, Prandom = 183.84MBits/s. Depending onChapter 4. The packet relocation design^ 56whether we need to copy the packet back to the shared memory during downward processing,we would have a best and a worst case for pburst. For the best case, only 3 data movements are\.rndo wn = 4)needed for downward processing (rndown = 3). For the worst case, 4 movements (are required.Best case: Pbur st = ( 0 .5 *80 *106^0.5 *80 *106 )* 8\^ +4^3= 186.64MBits/sWorst case: pburst =^0.5 * 80 * 106^0.5 * 80 * 106 ) * 8^+( 4^4== 160MBits/sIn this design, pburst is smaller than Prandorn in both the best and worst case for a large rangeof values for the number of processors. Therefore, in this design, the burst access bus is thebottleneck of the system. The maximum processing throughput would be around 160MBits/sand the number of 10mips processors required is around 42.Both analytical and simulation results are plotted as depicted in figure 4.3. Thesimulated throughput follows the analytical value closely and saturates at around 144MBits/s.The number of processors that can be accommodated is almost five times higher than thatof the SSM design. The utilization of the burst access bus is higher than that of the randomaccess bus as predicted by the analysis. From the plot in figure 4.4, we can observe thatthe utilization of the two buses have been interchanged as compared to the previous designs.A strange behavior is observed when the number of processors rises to about 45 when therandom access bus utilization increases sharply and the burst access bus utilization drops.As the burst access bus usage decreases, the throughput also drops. This observation couldbe explained in terms of the contention in the random access bus. When there is a highnumber of processors in the system, the system may become unstable if all the processors tryto compete for some shared resources. A good guideline would then be to keep the numberof processors to around 35 to 40 under most situations.The erratic behavior is not observed when the processor speed is increased to 20mips2-.5 200 Pburst35 4010^15^20^25^30number of processorsThroughput of the SPR design with 10mips processors600500_400-(4,1..=m2Pprocessor^Actual10045RandomChapter 4. The packet relocation design^ 57Figure 4.3: Analytical and Simulation results of the throughput for the SPR designProcessorRandom Access Busril. N..'7, 0.6=6`01 0.5a)00.90.8.S 0.7u) 0.3=_o0.20.1Chapter 4. The packet relocation design^ 58Bus and processor utilization of the SPR design with 10mips processors5^10^15^20^25^30^35^40^45number of processorsFigure 4.4: Bus utilizations of the random access and burst access bus for the SPR designChapter 4. The packet relocation design^ 59and 40mips as shown in figure 4.5. With faster processors, a fewer number is needed toFigure 4.5: Throughput and bus utilization of the SPR design with 20mips and 40mipsprocessorsprovide a certain processing power and this can reduce the probability of bus congestion. Themaximum throughput obtained with different processor speeds are very similar as observedfrom the simulation result. In other words, the maximum throughput is not very sensitiveto the speed of the processors used. This behavior is similar to that of the simple sharedmemory design but the reasoning is different. In the former case, there is so much randomaccess bus traffic due to transfer syntax conversion that the effects of increasing the processorspeed are not significant. In the SPR design, the bottleneck is in the burst access bus andChapter 4. The packet relocation design^ 60the traffic on which does not depend on the number of processors nor the processor speed. Abetter processor utilization trend is thus possible with the SPR design. Even when the burstaccess bus saturates, the processor utilization is still better than 0.85. Comparing with theLIM design in figure 3.4, the processor utilization near saturation drops much faster in theLIM design than here.4.3.1 Merits of the SPR designThe SPR design reduces the false sharing problem and shifts the bottleneck to the faster busby adding dual ported VRAMS as local memory and some minor changes to the software.The number of processors that can be supported in this design is almost 5 times that of theSSM design with the maximum throughput being around 4 times higher. The burst accessbus utilization is improved significantly and becomes the throughput controlling factor of thesystem. In practice, burst access is more than 4 times faster than random access in dualported VRAMS [17] as those used in our simulations. With a burst access time of 3Ons, to, bur stwould be increased to 267MBits/s. The advanced design in cache DRAM [4][5] suggests thatwe have the technology to construct VRAMS with even faster burst and random access whichwill further improve the performance of this design. Another improvement that we haveachieved is that the maximum throughput of the system is not dependent on the speed of theprocessors. Therefore, the design allows us to use low speed processors, which could reducethe cost of the system, without lowering the processing throughput.4.3.2 Weakness of the SPR designWhen the number of processor increases, the system can become unstable when the randomaccess bus usage suddenly rises. As more and more packets are processed per second, thedemand for the state information and other synchronization data also increases and wouldcause the random access bus to become the bottleneck again. Thus a further reduction inrandom access bus usage is needed to ensure that more packets can be processed when fasterChapter 4. The packet relocation design^ 61memory devices become available especially for a faster burst access cycle time.The success of this design depends on a fast burst access mode. If burst access is slowbecause of slow memory or frequent crossing of memory device boundaries, the improvementwould decrease. We could compare the SPR design which is limited by the burst access speedwith the LIM design which is limited by the random access speed. With a 32 bit bus, equalnetwork and host traffic and 4 copies for both upward and downward processing, p burst forthe SPR design reduces to 1 . Table 4.2 shows the change in Pburst for this design whenAburstthe speed of burst access is varied.["burst us 25 50 100 150 200Pburst MBits/s 320 160 80 53.33 40Table 4.2: Variations of maximum throughput when the burst access speed is changedPlotting the data in table 4.2 with Pr andom for the LIM design based on a 200nsrandom access cycle time , we can see from figure 4.6 that when the burst access cycle timeit burst is more than 174 that of the random access cycle time, no benefit can be derived from thisSPR design. Therefore, when implementing the system, the memory management componentmust be designed carefully to reduce the possibility of storing a packet over a few memorydevices so as to make sure that burst access is fast enough.The extra data movements involved in this design increase the latency time especiallyfor the situations in which very few data movements are needed for transfer syntax conversionas in character strings. To see if this is a real concern, data have been collected to estimatethe latency time. Tables 4.3, 4.4 and 4.5 show the average processing latency time as obtainedfrom the simulations.The average bus request delay is also shown. Assuming that the average length of apacket from the host is 10000 bytes, the average copying time with a 5Ons burst access cycleand a 32 bit bus is about 125,tts. From the simulation results, we can see that the copying timeand the bus delay is about 2 orders of magnitude smaller than the processing time. Thus,the increase in latency due to the extra data relocation would not be significant.Chapter 4. The packet relocation design^ 62Figure 4.6: Comparison of the LIM design and the SPR design with different burst accessspeedsnumber ofprocessorsprocessing timeper packet (ms)# of packetsprocessed# of packetscopied back% of packetscopied backbus requesttime(ns)1 20.3 23 0 0 96892 23.6 39 3 8 118034 21.9 85 18 21 137188 22.6 170 54 32 2141616 23.2 327 130 40 3538820 22.4 415 181 44 5315732 22.0 654 299 46 9650740 22.4 783 386 49 13107848 27.1 626 343 55 127694Table 4.3: Average processing time using 10mips processors in a 0.5s intervalChapter 4. The packet relocation design^ 63number ofprocessorsprocessing timeper packet (ms)# of packetsprocessed# of packetscopied back% of packetscopied backbus requesttime(ns)1 12.5 38 0 0 115972 11.9 81 3 4 164534 12.6 151 25 17 199618 12.7 301 86 29 2671216 12.3 604 239 40 6941720 12.6 728 306 42 10435926 12.9 890 423 48 15876927 12.9 921 451 49 16049228 13.2 942 498 53 176760Table 4.4: Average processing time using 20mips processors in a 0.5s intervalnumber ofprocessorsprocessing timeper packet (ms)# of packetsprocessed# of packetscopied back% of packetscopied backbus requesttime(ns)1 6.9 70 0 0 127842 6.9 138 4 3 174984 7.2 264 35 13 256518 7.2 520 141 27 4950112 7.2 758 263 35 8095814 7.3 856 299 35 10194715 7.5 890 342 38 12435016 7.6 932 362 39 14334817 7.4 1002 576 57 19248718 7.2 1088 797 73 264338Table 4.5: Average processing time using 40mips processors in a 0.5s intervalChapter 4. The packet relocation design^ 64The need to move the packet from local to shared memory when they are out ofsequence is also a concern. Table 4.3, 4.4 and 4.5 also give the percentage of packets thathave to be moved because of the serialization requirement. When the number of processorincreases, the need to copy packets back to shared memory also increases. Over 50% of thepackets have to be moved when the system is closed to saturation in all cases simulated.These extra copies result in a severe overhead which cannot be neglected.4.4 Overcoming the weaknessArchitectural changes are made to deal with the problems described in the last section andto further improve the system performance. We would like to reduce the bus utilization,to extend the maximum processing throughput to a gigabit rate and to maintain a gracefuldegradation of performance where there is a sudden increase of workload owing to burstytraffic or failure of some of the processors. The new architecture is called the improvedpacket relocation (IPR) design.4.4.1 Reduce the number of packet relocation before processing startsFor the SPR design, packets buffered in the host interface for downward processing are copiedinto the shared memory when the packets arrive. The packet is copied again into the localmemory of some processor when the presentation layer processing startsl. The reason fordoing that is to give every processor equal access to the data so that when a processor isready, it can process any packet in the shared memory. This arrangement results in anindirect copying step which uses up extra bandwidth on the burst access bus and the sharedmemory. We can eliminate the copying from the interface to the shared memory by movingthe data directly from the interface to the local memory of the processor concerned. Thedetailed operation is discussed below.When a packet enters the system from the host, the active queue is updated as before.lrefer to discussions in section 4.2Chapter 4. The packet relocation design^ 65However, the packet remains in the host interface buffer to minimize the use of the sharedbuses. When a processor becomes free, it gets the first job from the active queue and if that isa new job from the host, the processor will initiate a transfer of the packet from the interfacebuffer directly into its local memory.This configuration will also give the system more tolerancy to bursty traffic or failureof one or more processors. Since all packets are buffered in the interface, the shared resourceusage will be proportional to the actual number of packets being processed and thus thesystem performance would not be affected by the number of jobs waiting in the active jobqueue. In other words, graceful degradation of performance is achieved.For upward processing, the number of data movements cannot be reduced easily. Therequirement for reassembling packets makes it efficient to leave packets in the shared memory.However, storing the packets in the interface buffer still helps in making the system more faulttolerant as in downward processing.4.4.2 Reduce the number of packet relocation during processingFor downward processing, the data portion of the packet is not accessed after the presentationlayer processing where syntax conversion is performed. Headers are simply added to theoriginal PDU in the lower layers. Therefore, it is possible to avoid the need of copyingpackets back into the shared memory when they are out of sequence as done in the previousdesign. The saving can be substantial. If packets are long, copying them takes a lot ofvaluable resources. Even if they are short, it would still be a waste of resources as the header,which is the part really needed for lower layer processing, is usually much shorter than thedata portion of the PDU. For instance, in an ATM cell where there is only 53 bytes, the dataportion still consists of over 90% of the cell. The processor performing the processing couldallocate the header in its local memory and update the state table with this information.When the packet is going to be sent out, instead of having 1 single transfer to move the datafrom the memory to the network interface buffer, multiple transfers are used to get the packetChapter 4. The packet relocation design^ 66and all the headers which may reside in local memories of different processors. In the worstcase, each layer will be processed by a different processor and a total of five transfers wouldbe needed. Since the length of the packet is not changed, usage of the bus would not beaffected. However, there would be more overhead in terms of getting the bus and setting upthe DMA transfers and the overhead introduced would worsen the processing latency time.On the other hand, the packet is moved only once in the IPR design rather than a coupletimes in the SPR design and therefore the processing time will be shortened because of thereduced number of data relocations. The actual effect on the processing time would have tobe observed from the simulations.Results of the IPR designFigure 4.7 shows the number of data movement after making the above improvements. Fordownward processing, only 2 relocations are needed while for upward processing, 4 copies arestill required. We can estimate the maximum throughput of this system using the analysisdiscussed in chapter 2. The hit ratios are the same and by using 40mips processors,Prandom10000---'^ *8 MBits/s225* 10-9(474 + 24.42n) (4.6)When n=1, Prandom = 713.37MBits/s and when n = 20, Prandom = 369.44MBits/s. The newvalue for pburst is:pburst(0.5*80* 106^0.5 *80 * 106)*8 MBits/s4^ + 2= 240 MBits/s(4.7)Only 40mips processors are used for this design because the increased maximum throughputwould require too many 10mips or 20mips processors which would exceed the physical limitthat can be driven by the bus.By reducing the number of data movements in downward processing from the averagevalue of 3.5 to 2, we have increased the maximum throughput by over 25% from 176.96MBits/sChapter 4. The packet relocation design^ 67Figure 4.7: Number of packet relocations for the IPR designChapter 4. The packet relocation design^ 68to 227.20MBits/s as shown in figure 4.8. Therefore, the increase in DMA overhead for gath-ering the headers is really smaller than the decrease in data copying time. Figures 4.8 and4.9 show the simulation and analytical results. The characteristics of the IPR design is verysimilar to that of the SPR design. High processor utilization is maintained until saturationof the burst access bus occurs. The introduction of the changes do not seem to result in anyundesirable effects.Throughput of the IPR design with using 40mips processors4^6^8^10^12^14^16^18^20number of processorsFigure 4.8: Throughput of the IPR design using 40mips processorsTable 4.6 gives the average processing time per packet. We can see that there is animprovement in latency time when compared to the SPR design as depicted in table 4.5. Inother words, the extra overhead in setting up the transfer gives a smaller increase in timeBus and processor utilization of the IPR design with 40mips processorsProcessorBurst Access BusRandom Access Bus014; 0.6812 0.5a)C.)20-0.42as0.90.8.g 0.7co 0.3=_o0.20.1Chapter 4. The packet relocation design^ 692^4^6^8^10^12^14^16^18^20number of processorsFigure 4.9: Resource utilization of the IPR design using 40mips processorsnumber ofprocessorsavg processingtime(ms)/packet# of packetsprocessed1 6.19 772 6.35 1484 6.79 2758 6.90 53716 7.03 101620 7.35 1169Table 4.6: Average processing time for the IPR design in a 0.5s intervalChapter 4. The packet relocation design^ 70than the reduction in time resulting from the lower number of data movements. In fact,the amount of reduction is not important as long as there is no significant increase in theprocessing time as we are mostly interested in reducing the usage of the shared resourceswithout increasing the latency time.Since we are interested in a gigabit network rate, we would like to see what the systemthroughput would be with the fastest memory device and widest bus that is practical. Thenew system parameters are shown in table 4.7 and hit ratios in table 4.8.parameter /1„, it, 1 pn 7 X itburst rflup Mdown Wnew values 190ns 25ns 10000 4096 17 0.5 3Ons 3 2 64old values 200ns 25ns 10000 4096 17 0.5 5Ons 4 2 32Table 4.7: Parameters used for finding the max throughputaccess category Transfer SyntaxconversionsHeaderprocessingsLocksaccessHeader dataaccessPacket dataaccesshit ratio for IPR(wide bus)1.0 1.0 0 0.875 1.0hit ratio for SPR 1.0 1.0 0 0.75 1.0hit ratio for LIM 1.0 1.0 0 0.75 0.75hit ratio for SSM 0.975 0.975 0 0.75 0.75Table 4.8: Hit ratios for finding the max throughputWith a 64bit bus, the hit ratio for state information is increased to 0.875 assumingthat consecutive bytes fetched will all be consumed before they are removed from the cache.Therefore, we have215 * 10-9(437 + 24.42n) *8 MBits/sPrandom =and pburst = (0.5* 266* 106 0.5* 266 * 106+^4^2^) * 8 MBits/s= 800MBits/s10000 (4.8)(4.9)For n=1, Prandorn = 806.41MBits/s and for n=40, Prandom = 263.18MBits/s. Theanalytical results indicate that if the burst access speed is increased without a correspondingactual40 50Throughput of the IPR design with high speed memory using 40mips processors900PburstPrandom.c 4002-4 30020010010^20^30number of processorsooprocessor800700600m 500Chapter 4. The packet relocation design^ 71increase in random access speed, the bottleneck could shift back to the random access bus.Since the calculation of Prandom incurs a certain amount of uncertainty in estimating thenumber of attempts for obtaining a lock, simulation results are needed to see if the bottleneckwill really shift back.Maximum processing throughputs from analytical results and simulations are shownin figure 4.10. The maximum throughput is obtained with about 40 processors at 560MBits/s.Figure 4.10: Throughput of the IPR design using 40mips processors and fast memoryLinear increase in throughput is obtained for up to about 30 processors and processor uti-lization only drops to 85% at maximum throughput. The simulation results show that themaximum processing throughput lies somewhere between Pburst and Prandom but not equalBus and processor utilization of the IPR design using high speed memory with 40mips processors10.90.88C`"? 0.5euC-)^20^30^40^50number of processorsChapter 4. The packet relocation design^ 72to Prandom as expected. Therefore, there is definitely some inaccuracy in the estimation ofthe number of lock attempts. The results also reveal that by speeding up the burst accesscycle, a significant increase in throughput, which is not achievable with the SSM design, isobtained. The graph of resource utilization in figure 4.11 shows that the difference in theFigure 4.11: Resource utilization of the IPR design using 40mips processors and fast memoryusage between the shared buses has reduced significantly. The burst access bus usage is onlyslightly higher than that of the random access bus until the system starts to saturate. This is adesirable characteristic because both buses are fully utilized and this is a major improvementas compared to previous designs.To see if the system is fault tolerant as predicted, another test is run to observe theChapter 4. The packet relocation design^ 73effect of overloading the system. Packets are sent into a system with 4 processors at rates of2 and 4 times that of the maximum throughput rate as determined earlier. A poisson distri-bution and exponential service time is used. The results, as shown in table 4.9, indicate thatthe maximum throughput remains relatively unchanged. Packets are stuck in the interfacesystemconfigurationThroughput(MBits/s)Random accessbus usageBurst accessbus usageAvg Latency(ms)peak value 64.41 0.096 0.096 5.88input rate doubled 64.64 0.103 0.099 5.81input rate quadruple 63.84 0.110 0.096 5.87Table 4.9: System performance when the input rate is higher than the processing throughputbuffers and the buffer size required to store all the packets for 0.5s is between 8-24MBytes.The actual buffer size required would be dependent on the protocol's ability to regulate thetraffic in reaction to a reduction in processing power. Work on congestion control and avoid-ance are discussed in [19][20]. Our simulation results indicate that there is a slight increasein random access bus usage. This is due to the need of accessing shared states and activequeues when a new packet arrives.4.4.3 DiscussionsThe IPR design provides a processing throughput of more than 500MBit/s or half of that ofa gigabit network. A balanced bus utilization is also obtained. However, the results reveal anumber of weakness in the design and the analytical model.By increasing the burst access speed alone, the bottleneck will go back to the randomaccess bus. Thus we can predict that if the memory speeds are not increased in the sameproportion for the two access modes, the bottleneck may bounce back and forth between thetwo buses.With a 32bit bus system, let the ratio of random access cycle time to burst access cycletime be x so that Prandom -= X taburst• To ensure that burst access would be our bottleneckfor up to 20 processors, we could substitute the new processor and random access speed intoChapter 4. The packet relocation design^ 74equations 4.6 and 4.7 to derive a proper value of x. Assuming that the processor speed is 4times that of the random access cycle time, we have10000 ^3^4>^* ^4 WI'S—5 xitb t(474 + 24.42* 20)^8 //burstx < 5.5In other words, if we cannot keep the random access fast enough so that the cycle time willbe less than 5.5 times that of the burst access, the bottleneck will be shifted to the randomaccess bus. When the bus width is increased, this value of x will change. This is because theburst access bandwidth will increase by a greater extent than the random access bandwidth.Using equations 4.8 and 4.9, we have5-4 X liburst (437 + 24.42 * 20)^8 /-burstx < 2.9The small value of x implies that it is a lot harder to keep the bottleneck on the faster bus.Therefore, using a wider bus would not be a solution by itself.The simulation result reveals a weakness in the analytical model for computing Prandom.In the analysis, we have to estimate the average number of attempts for acquiring a lock usingthe assumption that when the number of processors increases, the number of attempts willalso increase. In this design, we have made the improvement in assuming that the value ofa, which is the time between retries, is a fraction of the average lock holding time insteadof a fixed value to get a better estimate of the number of retries. However, n, random 15 stilloff from the simulated value by quite a large margin because the number of attempts alsodepends on how much time each processor spends on accessing shared information. With amore complex transfer syntax conversion step, the processor would spend more time in localprocessing and spend a smaller amount of time accessing shared information when comparedto a processor doing simplier transfer syntax conversions. Thus, even if there is a large num-ber of processors, the average number of attempts needed to acquire a lock can still be low10000^3^8>^*Chapter 4. The packet relocation design^ 75when the transfer syntax conversion is complicated. Since our estimate of Prandom has notincluded the effects of the transfer syntax complexity, the estimated value would only be apessimistic approximation. This is not important if we could ensure that the burst access busis the bottleneck. Otherwise, care must be taken in using the estimate.Another difficult problem is that we must maintain the buffering capability of thehost interface so that data will not be lost. In previous designs, packets are moved out theinterface immediately when they arrived so that the interface is always ready to accept moredata. With this new approach, if the input rate is higher than the output rate due to burstytraffic, data may stay in the interface for a long time and thus special care is needed to makesure that we have enough buffer space in the interface.Chapter 5Results and DiscussionsIn the last few chapters, we have looked at how processing throughput and resource utilizationsare affected by the number of processors, their speed and the location of the packets andprogram code. Here we are going to look at some other measurements of the system designand effects of other parameters on the system performance.5.1 CostTable 5.1 summarizes the throughput and major characteristics for the various designs pro-posed. The trend is that the higher the maximum throughput, the more hardware andsoftware changes are required.Design Throughput(MBits/s)ProcessorUtilizationCostSSM 60.08 0.6 use VRAM for the shared memoryLIM 164.16 0.81 same as SSM + use DRAM for the local memorySPR 176.96 0.88use VRAM for both the shared and local memory.DMA devices are needed for each processor andspecial instructions needed to relocate the packetIPR 227.20 0.82 same as SPR + big interface buffers in the hostand network adapterIPR 565.90 0.85 same as SPR except that fast VRAM and a 64bitbus (2 times wider than other designs) are usedTable 5.1: Summary of system characteristicsThe biggest jump in throughput is obtained when local memory is added to the SSMdesign. The improvement for using VRAM in the SPR design is small but it allows us tobenefit from the faster burst access bus. With the optimizations implemented in the IPR76Chapter 5. Results and Discussions^ 77design, the throughput increases by close to 40% when compared to that of the LIM design.The ability to use faster memory with the IPR design allows us to push the throughput upto more than 9 times that of the SSM system. The main design feature which results inthe improvement of the processing throughput is the addition of local memory. The cost ofthe system increases with the kind and number of memory devices added. Since the priceof memory in terms of cost per bit is expected to drop with the advent of VLSI technology,the system proposed should be realizable with an affordable price tag. The major cost ofthe systems proposed comes from the hardware as the software component is essentially thesame for all designs. The choice of a particular design for implementation depends on thethroughput requirement and the budget.5.2 Memory speedThe limiting factor in all our designs is the memory speed. In our work, we have only useda single memory speed except for the last design. With a faster memory cycle time, moretransactions can be carried out with the use of faster processors and a higher throughputcan be obtained. As we have discussed in the last chapter, cycle time of random and burstaccesses have to be reduced by the same proportion to ensure that the system characteristicbe preserved. Techniques like interleaving and split transactions can be used to increase thespeed of random access effectively to compensate for the higher increase in burst access speedthen the random access speed that are being seen in newer devices [17]. If the memory speedis increased to lOns for burst access, pb.urst will rise to over 2400MBits/s which will enableour designs to work in a gigabit network.5.3 Bus speedIncreasing the memory speed implies that the bus speed has to be increased and that moreprocessing power is required to make use of the extra bandwidth. These would pose twoproblems in the bus design in terms of the speed of the bus and number of loads on the bus.Chapter 5. Results and Discussions^ 78Currently, common bus design is limited to a speed of around 100MHz [36]. Problems of strayinductance and capacitance will become serious at higher speeds and special care is needed tomake the system work. One may even have to resort to optical links for speeds higher than400MHz. The number of loads on a bus affects the speed of the bus. The higher the numberof loads, the lower the speed will be because of the loading effects. A bus design is unlikelyto be able to support more than 40 to 50 loads.To extend our design to gigabit network rates, we would need to double the speed ofthe various components. This is achievable with the various optimizing techniques mentionedabove. However, to build a system that could handle even higher network rates based on a busbased shared memory architecture would seem unlikely because of the physical limitations ofthe bus design. Changes have to be made to the protocol to simplify the syntax conversionrequired and to put in more parallelism in the definition so that there will not be so muchshared memory traffic. Other possible alternatives include going for bus sufficient systems orresorting to more expensive interconnection architectures.5.4 Multiple connectionsOur analysis has been limited to situations where there is only one active connection. Allour packets originate from the same connection and this worst case scenario results in aheavy contention for the state information in the shared memory. In a real system, there willlikely be more than one active connection and we can expect to have less bus and memorycontentions and thus a higher system throughput than the results shown here. Takeuchi [37]had performed some simulations in this area and showed that this is the case.5.5 Packet lengthsFrom equation 2.3, the maximum processing throughput is directly proportional to the packetlength. The packet length that we have used in this context refers to the length of thepresentation layer PDU. Figure 5.1 shows that variations in Prandom when the packet lengthChapter 5. Results and Discussions^ 79is changed for the IPR design. We can observe that if the packet size is reduced, the bottleneckFigure 5.1: Maximum throughput variation with different packet lengthswill go back to the random access bus. On the other hand, if the packet size is increased,pbt,„t will become the ultimate limit and memory devices with higher burst access speed canbe used without creating the problem of having the random access bus as the bottleneck.The problem of having the random access bus as the bottleneck is that the random accessbus traffic is less linearly related to the system throughput than the burst access traffic, andthus the accuracy of the estimate is lower.Chapter 5. Results and Discussions^ 805.6 Other presentation layer operationsIn our discussions, we have only concentrated on the transfer syntax conversion operationsin the presentation layer. In fact, there are other computation intensive functions in thepresentation layer like the encryption operations for data security and compression operationsfor reduced bandwidth requirement. By using a similar argument as with syntax conversion,our design should be able to handle these functions as well with minimal changes.5.7 Future WorkWe have presented a high level analysis of a protocol processing system using multiple pro-cessors. The next step is to perform some detailed designs and to construct some criticalsystem enabling circuits. These circuits include an efficient FCFS bus arbitration circuit andmemory management circuit for the shared memory. The possibility of using interleaving orsplit transaction schemes should also be investigated. Once the basic hardware architecturehas been defined, software can be ported to the system from some parallel implementationslike the one described in {15] and further testing can be carried out.Chapter 6ConclusionsIn this thesis report, three multiprocessor based OSI protocol processing systems have beendescribed. The systems are designed to minimize the utilization of shared resources usingtechniques similar to those used for cache prefetching. An incremental approach had beenused in which each system is designed with some improvements over the previous one. Theresults indicated that our designs are capable of maintaining a higher resource utilization andgiving a higher processing throughput than previous published designs. We have also shownthat the OSI protocol stack is promising in a high performance heterogeneous computingenvironment using a gigabit network despite of the computation intensive operations.In the LIM design, local memories are used to store program codes. With this ar-chitecture, the processing throughput was increased by closed to 3 times as compared to theSSM design proposed in {37}. It was found that the false sharing of data packets in the sharedmemory had resulted in a severe bus congestion problem on the random access bus and thiscongestion had restricted the system processing throughput to a low value. The SPR designdealt with this problem by moving packet data to the local memory of the processor doing theprocessing before the operations begin. Our result indicated that by using the faster burstaccess bus to relocate the packets, we had successfully reduced the random access bus usageand shifted the bottleneck to the faster burst access bus. The high number of data movementsand lack of fault tolerancy of this design prompted further changes. The IPR design improvedon the SPR design by reducing the number of data relocation operations with the use of abig interface buffer. This design resulted in a throughput closed to 4 times that of the SSMdesign and allowed the system to provide a graceful degradation in performance when some81Chapter 6. Conclusions^ 82of the processors fail. With the use of faster memory devices and wider buses, a processingthroughput of 560MBits/s has been estimated. The processing throughput of this design isclose to half of that required in a gigabit network and almost an order of magnitude higherthan that of the SSM design.An analytical method for estimating the processing throughput for such systems hasalso been developed. Although many simplifying assumptions have been made in the analysis,good estimates were still obtained. From both the analytical and simulation results, we haveidentified that bus speed, memory speed, synchronization, number of active connections,packet lengths and transfer syntax complexities are among the more important parametersthat control the processing throughput.A bus based architecture is found to be sufficient for handling a network rate of over500MBits/s. With further optimizations, it is possible to extend it to a gigabit rate. Ifhigher network rates or processing power are needed, physical limits of the bus design mustbe reconsidered to ensure that proper operations can be maintained.Using fast processors alone without the matching main memory is not enough toprovide the necessary processing power needed for protocol processing. The constant flow ofpackets through the system makes data caches ineffective in bridging the growing gap betweenprocessor and memory speed. Memory speed is a critical factor in the system performance.An equal proportion of increase in random access and burst access speed in VRAMsis important to enable our designs to scale properly. Similar to the problem of processor andmemory speed difference, if access speeds of both access modes are not changed together,the performance of the system will be limited by the slowest device. This would make ituneconomical to use higher speed components in other parts of the system.The need to process packets in sequence reduces the amount of parallelism that canbe incorporated into the system. Operations like reassembly are found to slow down theprocessing of received packets and this type of operations should be revised to enable abetter use of the resources. As stated by Amdahl's Law, the maximum speedup of a parallelChapter 6. Conclusions^ 83machine is inversely proportional to the amount of serial processing required. Therefore, todesign protocol processing systems to work with gigabit rates and beyond, a parallel protocolspecification not using finite state machines should be used to minimize the synchronizationsneeded and to allow a higher degree of parallelism.Bibliography[1] W.Bain and S.Ahuja, "Performance Analysis of High Speed Digital Busses", Proceedingsof the 1981 Conference on Computer Architecture, Minneapolis, Minnesota, May, 1981,pp. 107-133.[2] D.Banks and M.Prudence, "A High-Performance Network Architecture for a PA-RISCWorkstation", IEEE Journal on Selected Areas in Communications, Vol. 11, No. 2, Febru-ary, 1993, pp. 191-202.[3] T.Braun and M.Zitterbart, "High Performance Internetworking Protocol", 15th Confer-ence on Local Computer Networks, October, 1990, pp. 172-179.[4] D.Bursky, "Combination DRAM-SRAM Removes Secondary Caches", Electronic Design,Vol. 40, No. 2, January, 1992, pp. 39-44.[5] D.Bursky, "Integrated Cached DRAM lets Data Flow at 100MHz", Electronic Design,Vol. 40, No. 4, February, 1992, pp. 142-146.[6] D.Callahan, K.Kennedy and A.Porterfield, "Software Prefetching", SIGARCH ComputerArchitecture News, Vol. 19, No. 2, April, 1991, pp. 40-52.[7] C.Chen and A.Somani, "Effects of Cache Traffic on Shared Bus Multiprocessor Systems",Proceedings of the 1992 International Conference on Parallel Processing, August, 1992,pp. I-285-1-288.[8] D.Clark, V.Jacobson, J.Romkey, H.Salwen, "An Analysis of TCP Processing Overhead",IEEE Communications Magazine, June, 1989, pp. 23-29.[9] D.Clark and D.Tennehouse, "Architectural Considerations for a New Generation of Pro-tocols", SIGCOMM 90 Communications Architecture and Protocols, Computer Com-munications Review, Vol. 20, No. 4, September, 1990, pp. 200-208.[10] B.Davie, "A Host-Network Interface Architecture for ATM", SIGCOMM 91 Communi-cations Architectures and Protocols, Computer Communication Review, Vol. 21, No. 4,September, 1991, pp. 307-315.[11] P.Denning, "On modeling program behavior", Proceedings of the Spring Joint ComputerConference, vol 40, AFIPS Press, Arlington Va., 1972, pp. 937-944.[12] M.Flynn, "Some Computer Organizations and Their Effectiveness", IEEE Transactionon Computers, 1972, pp.948-960.[13] R.Freund, "Heterogeneous Processing", Computer Vol. 26 No. 6, June, 1993, pp. 13-17.84Bibliography^ 85[14] D.Giarrizzo, M.Kaiserswerth, T.Wicki and R.Williamson, "High-Speed Parallel Proto-col Implementation", Proceedings of the IFIP WG 6.1/6.4 International Workshop onProtocols for High-Speed Networks, Zurich, Switzerland, May 1989, North-Holland, pp.165-180.[15] M.Goldberg, G.Neufeld, M.Ito, "A Parallel Approach to OSI Connection-Oriented Pro-tocols", Proceedings of the 3rd International IFIP WG6.1/6.4 Workshop on Protocolsfor High-Speed Networks, Stockholm, May, 1992, pp. 225-240.[16] J.Hennessy and N.Jouppi, "Computer Technology and Architecture: An Evolving Inter-action", Computer, September, 1991, pp. 18-29.[17] IC Memory Data Book, Hitachi America Ltd., January, 1990.[18] C.Huitema and A. Doghri, "Defining faster transfer syntaxes for the OSI PresentationProtocol", Computer Communication Review Vol. 19 No. 5, October, 1989, pp. 44-55.[19] V.Jacobson, "Congestion Avoidance and Control", Computer Communication Review,Vol. 18, No. 4, August, 1988, pp. 314-329.[20] R.Jain, "Congestion Control in Computer Networks: Issues and Trends", IEEE NetworkMagazine, May, 1990, pp. 24-30.[21] N.Jain, M.Schwartz and T.Bashkow, "Transport Protocol Processing at GBPS Rates",Communications Architectures and Protocols: Proceeding of the ACM-SIGCOMM 90,Philadelphia, PA, September, 1990, pp. 188-199.[22] E.Johnson, "Working Set Prefetching for Cache Memories", Computer ArchitectureNews, Vol. 17, No. 6, December, 1989, pp. 137-141.[23] C.Joseph, "A Parallel Algorithm for ASN.1 Encoding/Decoding", Master's Thesis, De-partment of Electrical Engineering, University of British Columbia, September, 1992.[24] N.Jouppi, "Improving Direct-Mapped Cache Performance by the Addition of a SmallFully-Associative Cache and Prefetch Buffers", Proceedings of the 17th InternationalSymposium in Computer Architecture, 1990, pp. 364-373.[25] H.Kanakia and D.Cheriton, "The VMP Network Adapter Board (NAB): High-Performance Network Communication for Multiprocessors", Communication Architec-tures and Protocols: Proceedings of the ACM-SIGCOMM 88, Stanford, CA, August,1988, pp. 165-174.[26] A.Klaiber and H.Levy, "An Architecture for Software-Controlled Data Prefetching", Pro-ceedings of the 18th International Symposium on Computer Architecture, 1991, pp. 43-53.[27] L.Kleinrock, Queueing Systems, A Wiley-Interscience publication, 1975.Bibliography^ 86[28] D.McCrackin and B.Szabados, "Horizontal demand prefetching: a novel approach toeliminating the jump problem", Canadian Journal on Electrical and Computer Engi-neering, Vol. 16, No. 3, July, 1991, pp. 118-124.[29] M.McCutcheon, M.Ito, G.Neufeld, "Interfacing a Multiprocessor Protocol Engine to anATM Network", Proceedings of the 3rd Internal IFIP WG6.1/6.4 Workshop on Protocolsfor High-Speed Networks, Stockholm, May, 1992, pp. 151-166.[30] J.Nicoud, "Video RAMS: Structure and Applications", IEEE MICRO, Vol. 8, No. 1,February, 1988, pp. 8-27.[31] A.Pleszkum and E.Davidson, "Structured Memory Access Architecture", Proceedings ofthe 1983 International Conference on Parallel Processing, 1983, pp. 461-471.[32] M.Rose, The Open Book: A Practical Perspective on OSI, Prentice-Hall Inc., 1990.[33] E.Russell, Building Simulation Models with SIMSCRIPT 11.5, CACI, 1983.[34] A.Smith, "Sequential Program Prefetching in Memory Hierarchies", Computer, Decem-ber, 1978, pp. 7-21.[35] A.Smith, "Cache Memories", Computing Surveys, Vol. 14, No. 3, September, 1982, pp.473-530.[36] H.Stone and J.Cocke, "Computer Architecture in the 1990s", Computer, Vol. 24, No. 9,September, 1991, pp. 30-38.[37] L.Takeuchi, "Study of OSI Protocol Processing Engine", Master's Thesis, Departmentof Electrical Engineering, University of British Columbia, July, 1991.[38] M.Terada, T.Yokoyama, T.Hirata and S.Matsui, "A High Speed Protocol Processor toExecute OSI", IEEE Infocom Vol. 2, Piscataway, NJ, USA, 1991, pp. 944-949.[39] M.Zitterbart, "High-Speed Protocol Implementations Based on a Multiprocessor-Architecture", Proceedings of the IFIP WG 6.1/WG 6.4 International Workshop onProtocols for High-Speed Networks, Zurich, Switzerland, May 1989, North-Holland, pp.151-163.[40] M.Zitterbart, "Parallelism in Communication Subsystems", IBM Research Report, 1992.Appendix ASimulation DetailsSimulation is used to evaluate the different designs because it allows us to try out differentconfigurations and combinations of parameters easier than implementing the system. It isalso needed to study system behavior which can not be derived from the analysis presentedbecause of the many assumptions that have been made to make the analysis tractable. Oursimulation involves simulating the hardware and software of the system. The simulationprogram is written in SimScript 11.5 and run on Sun Sparc workstations.A.1 Simulation modelSimScript constructs for process and resource are used to model the system components. Aprocess is defined as an object and the sequence of actions it experiences throughout its lifein the model. A process routine is used to describe the activity of a process. A resourceis defined as a passive object which is required by the process objects. If the resource isnot available when required, the process object is made to wait until the resource becomesavailable.Processors in the system are simulated as processes while buses and memory systemsare simulated as resources. Instruction executions are simulated by obtaining the memoryresource and/or the bus depending on whether the instruction is in the cache, local or sharedmemory and waiting for a fixed amount of time corresponding to the processor speed. Thenumber and frequency of memory accesses for processing a packet is based on static codecounts as used in other studies [37][23]. Only the processing of data packets, acknowledge-ment packets, connection packets and disconnection packets are simulated. Segmentation and87Proccessornewjob?objobdatapacketheader (\6& packetdataInsert job intoactive queueRemove jobrom suspendedqueueGet a job fromactive queuejobresourceProcessheaderTransfersyntaxconversionSynchronization operationsChecksumcomputationObtain bus &memoryresourceProcesspacketInsert a job intosuspendedqueueMove a jobrom suspendedo active queueAppendix A. Simulation Details^ 88reassembly of packets are only done in the transport layer for simplicity. Since we are onlyinterested in the maximum processing throughput, error and flow controls are not simulated.The structured chart in figure A.1 shows the structure of the simulation programFigure A.1: Overview of the simulation programfor a processing element. Not shown in the diagram are the network and host interfaceadapters which insert new jobs into the active queue. Monitoring routines which collect systemstatistics and routines which perform the initization and termination operations are also notincluded. Three input file are needed and are used for specifying the system configuration,host data and network data. One output file is produced giving the system statistics atregular intervals and at the end of the simulation.Appendix A. Simulation Details^ 89A.2 Protocol processing dataThe data used for simulating the software or the timing aspect are based on the static codecount. The set of data used in [37] is employed because it represents an unoptimized imple-mentation on a single processor system. The data link layer statistics is for the IEEE 802.2logical link control type I. The network and transport layer statistics are for the InternetProtocol (IP) and Transmission Control Protocol (TCP) respectively from an implementa-tion in C done for Unix4.3 BSD by the University of California, Berkeley. The session andpresentation layer statistics are for the OSI Session and Presentation protocol respectivelyfrom the ISODE implementation done in C (version 6.0) by Marshal Rose. Table A.1 showsthe number of times a PDU is accessed by different layers. In our analytical model, we haveProtocol Layer receive send value used in analysisDatalink 6 6 6Network 14 16 16Transport 62 20 62Session 60 40 60Presentation 30 30 30Table A.1: Number of access to PDUs for the different layerssimply taken the worst case disregarding the difference in send and receive operations forsimplicity. Therefore, we have assume that the number of PDU access is 841-p + 90 where 1 isthe length of the packet and p is the network packet size. For large packets, multiple packetswould be needed in the lower three layers and thus more processing required.Table A.2 shows the number of state and queue access in the different layers. AssumingProtocol Layer receive send value used in analysisD at alink 0 0 0Network 22 20 22Transport 65 90 90Session 40 30 40Presentation 40 30 40Table A.2: Number of state access for the different layersAppendix A. Simulation Details^ 90that we need around 16 access for acquiring a lock which includes access to the queue andactually getting the lock, and using the maximum value among send and receive operations,the value used in our analysis becomes (112 + 16 + rt)ip + 80 + 2 * 16 + rs rp where r'is arethe number of lock retries for the transport, session and presentation layers respectively.Table A.3 shows the number of instructions executed in different layers. The averageProtocol Layer receive send connect disconnect ack averageDatalink 45 40 - 42.5Network 160 160 - 160Transport 375 400 250 250 200 355Session 450 400 400 400 - 422.5Presentation 400 350 350 350 - 372.5Table A.3: Number of instruction access for the different layersvalues used in the analytical model are obtained by assuming that 10% of the traffic comesfrom acknowledgments, 5% comes from connection requests, 5% comes from disconnectionrequests and the rest from packetsl. The number of instructions used in the analysis areobtained from the average values and is equaled to 557.5ip + 795.Transfer syntax conversion is assumed to take 5-20 instructions for encoding and5-40 instructions for decoding. The number of instructions are based on the ASN.1 en-coder/decoder implemented by Mike Sample of the computer science department at the Uni-versity of British Columbia [37]. We have further assumed that only of the packet data iscopied during the conversion and thus the number of bytes accessed (read and write) isThe amount of register only instructions is assumed to be 75% based on an educatedguess. The actual number would depend on the processor used and the software implemen-tation which could vary a lot. The accuracy of this estimation is verified by comparing thesimulated processing latency time using this assumption with the published measurements fora RISC multiprocessor protocol processing system described in [15]. The simulated resultswere found to have the same order of magnitude with the measurements reported.'half of which from packets sent out and half from packets receivedAppendix A. Simulation Details^ 91A.3 Simulation parametersIn our simulations, we have to set a number of parameters to enable us to observe themaximum processing throughput and to optimize the simulation time. We have used botha normal and poisson distribution for the network and host workload with a mean of 10000and a standard deviation of 5000. The normal distribution is used in most simulations unlessotherwise specified. Other parameters used are described in the following sections.A.3.1 Obtaining Peak PerformanceOne way of achieving peak performance is to keep sending packets into the system. However,data sent into a system use a certain amount of processing power even before actual processingstarts. For instances, active job queue needs to be updated, burst access bus may be used formoving the packet to the shared memory and locks may be acquired. All these are importantshared resources and we would like to reduce this kind of over usage which would affect theresult. In the simulation, a feedback mechanism is used to make sure that the input data rateis always higher than the output data rate so that all processors can be kept busy and at thesame time that not too much extra packets would be sent. This is achieved by setting theinput rate to about 4MBits/s higher than the output rate and that the number of outstandingjob in the active queue to be just greater than 3 times the number of processors. Since wewant to find the peak throughput, flow control and error control are not simulated as theywould reduce the peak rate.A.3.2 Selecting the content of the workloadIn a real system, the traffic going through the protocol processing system will consist ofvarying proportions of network and host packets. In our simulations, a 50-50 load is used.This would give us a good estimation when the peer are communicating on a equal basis.Different proportions of network and host packets have been simulated for the SPR designwith 10 processors using a Poisson interarrival rate and an exponential service time to model10095908575a__c702_c656055 -20^40^60^80percentage of host traffic500 100Appendix A. Simulation Details^ 92a realistic system. The results are shown in table A.4 and plotted in figure A.2. It is observedHost traffic % Net traffic % Throughput (MBits/s)0 100 78.1620 80 82.7240 60 81.0460 40 81.2880 20 81.28100 0 77.52Table A.4: Effects on Throughput with different workloadfrom figure A.2 that a higher throughput is obtained in the middle of the graph where theratio of host to network traffic is closer to 1. This is because when all the traffic comes fromthe same source, the chance of getting a lock contention is a lot higher and it reduces theoverall performance.Effects on Throughput with different workloadFigure A.2: Throughput variation with contents of workloadAppendix A. Simulation Details^ 93A.4 Analysis of accuracyIn our simulations, random numbers are used. For an accurate simulation, the results shouldnot be dependent on the random numbers used. The following sections describe what we havedone to ensure that the choice of random numbers would not affect our simulation result.A.4.1 Random numbersSimScript generates pseudo-random numbers based on the Lehmer technique. A starting seedis multiplied by a constant to produce a new seed and a sample. In SimScript, 10 differentrandom number streams are provided by default. These random number streams are used inour simulations and each one of them is used for a different purpose in an attempt to keep theresults reproducible so that changing one part would not affect other parts in some unknownways when the usage of a random number stream is changed. Further information about therandom number generators in SimScript can be found in [33].A.4.2 Sensitivity to random numbersThe result obtained from a single simulation gives us an estimate of the true value. Thisestimated value is affected by the random number stream and also the simulated time. Ideally,the result should not be sensitive to the choice of the random seeds. Using a long simulatedtime will reduce the effects of the random numbers but a long simulated time implies a longsimulation time.To be statistically accurate, each simulation should be repeated a number of timeswith different random seeds. However, a typical simulation takes up to 5-10 hours on a SunSparc station 2 workstation and to repeat every single run 10 or more times is not feasible forthis research. Instead, we have looked at the SSM design where the random seed is exercisedthe least and tried to determine its sensitivity to the random seeds. The set of tests usingdifferent number of processors for the simple shared memory design was repeated 10 times.Appendix A. Simulation Details^ 94Different random number streams provided by SimScript are used in each run. The resultingthroughputs and bus utilizations are shown in Tables A.5, A.6 and A.7.TrialNumber of Processors1 2 4 8 161 0.482 0.858 1.820 3.457 5.3482 0.537 0.972 1.876 3.658 5.4003 0.558 1.078 2.141 3.838 5.6684 0.558 1.074 2.008 3.730 5.3975 0.479 1.012 1.946 3.601 5.3976 0.537 0.972 1.840 3.727 5.4327 0.559 1.074 1.961 3.733 5.2388 0.480 0.971 1.914 3.623 5.4569 0.480 0.971 1.913 3.776 5.41510 0.482 0.982 1.990 3.877 5.564mean 0.515 0.996 1.941 3.702 5.432s.d. 0.037 0.067 0.093 0.123 0.116% 7 7 5 3 2Table A.5: Standard Deviation of Peak Throughput(MBytes/s)TrialNumber of Processors1 2 4 8 161 0.115 0.197 0.368 0.658 0.9712 0.114 0.200 0.366 0.662 0.9753 0.114 0.207 0.383 0.673 0.9824 0.116 0.204 0.378 0.667 0.9725 0.113 0.203 0.372 0.662 0.9746 0.114 0.200 0.364 0.661 0.9787 0.115 0.204 0.377 0.667 0.9748 0.113 0.199 0.368 0.660 0.9759 0.113 0.199 0.366 0.664 0.97510 0.114 0.200 0.375 0.678 0.977mean 0.114 0.201 0.372 0.665 0.975s.d. 0.0010 0.0031 0.0063 0.0062 0.0031% 1 2 2 1 0.3Table A.6: Standard Deviation of Random Access Bus UsageThe standard deviation for the throughput values is less than 10% for all configu-rations. As the number of processors increases, the standard deviation decreases becauseAppendix A. Simulation Details^ 95TrialNumber of Processors1 2 4 8 161 0.023 0.035 0.069 0.122 0.1962 0.025 0.040 0.071 0.132 0.2003 0.026 0.044 0.080 0.137 0.2104 0.025 0.042 0.076 0.135 0.1995 0.023 0.041 0.072 0.128 0.2026 0.025 0.040 0.069 0.134 0.2027 0.026 0.043 0.074 0.135 0.1958 0.023 0.040 0.072 0.131 0.2029 0.023 0.040 0.071 0.135 0.20210 0.023 0.040 0.075 0.140 0.205mean 0.024 0.041 0.073 0.133 0.201s.d. 0.0013 0.0024 0.0034 0.0050 0.0043% 5 6 5 4 2Table A.7: Standard Deviation of Burst Access Bus Usagewith more processing power, more packets get through and thus more random numbers areconsumed and the initial choice of the seed is not as important.The standard deviation for the bus utilization shows a similar trend although thevariation is even smaller. This is mainly because during the startup phase of the simula-tion, resources are consumed but data do not get out of the system until after the requiredprocessing latency time. Since the throughput is computed based on the total data outputover the entire simulated time, the effect of the startup phase actually reduces it value. Thethroughput obtained would thus underestimate the real value. This throughput value couldbe adjusted by eliminating the first part of the simulation when computing the throughput.That approach is not taken as we would like to be not too optimistic in establishing the peakvalues.A.4.3 Picking the simulated timeAnother set of simulations are carried out to observe the effects of the simulated time. Asdiscussed above, a longer simulated time will reduce the bias introduced in the startup phaseand thus will give us a higher value for the throughput. A longer simulated time will alsoAppendix A. Simulation Details^ 96reduce the error introduced by not accounting for the jobs still running in the processor whenthe simulated time limit has been reached. Table A.8 shows the simulation results.The results indicate that as the simulated time increases, the throughput also in-creases while the bus utilization remains the same. By using a 0.5s simulated time, we areunderestimating the throughput by only 10% from the observed mean. The random and burstaccess bus utilizations are within 2% from the observed mean. This pessimistic estimation isuseful for establishing a lower bound for the maximum throughput and bus utilization. In allof our simulations, a 0.5s simulated time is used.Simulated time(s) Throughput (Mbytes's) Random Bus Usage Burst Bus Usage0.5 0.482 0.115 0.0231 0.500 0.111 0.0232 0.519 0.114 0.0243 0.534 0.113 0.0244 0.533 0.113 0.0245 0.520 0.112 0.024mean 0.515 0.113 0.0237s.d. 0.020 0.0014 0.0005% 4 1 2Table A.8: Variation due to different simulated timewhere PQ =1 + ni!( I)' (1 — P) ET;o1Erlang C function11 ( A)7-1n! k tz IAppendix Bqueueing modelIn section 2.3, we discussed what queueing model would be suitable for the packets enteringthe protocol processing system. Here, we will use the classical queueing theory and determinewhich queueing model is better. For the following analysis, assume that the system has aPoisson input rate and exponential service time.B.1 Multiple queue multiple serverA multiple queue multiple server system can be represented by a M/M/1 queue for each ofthe server. Assume that the input rate is equal to A, then for each server, the input rate willbe „An. From queueing theory [27], the average delay time for each packet is given by:Tn, == (B.10)B.2 Single queue multiple serverFrom standard queueing analysis of a M/M/m queue [27], the average delay time is given by:1^PQTs = +^y m,u, — Am — + PQ= m^g — A (B.11)97Job queue ServersSchedulerAppendix B. queueing model^98Figure B.1: A multiple queue multiple servers systemand p —Arap,B.3 Comparing single queue and multiple queue systemsComparing equations B.10 and B.11, it can be deduced that the single queue system will havea smaller average delay time ifA— > PQIt can be proved by mathematical induction that T, is always greater than T3.Proof:To prove that PQ =For m=2,1 ^Am-11+70i)m(1-0 En.. 77( AA. )1111+ 2()2(1 _ A)(1 + _A17)12A2 2g-A vI -I-^k 2g A^)PQ•Job queue•ServersAppendix B. queueing model^99Figure B.2: A single queue multiple servers system1+ 2(24-A)(1-1-A)2 A22A22A2 4A2 — 2,ILA zlitA — 2A22A24,a2 2ILAA2A(2,a + A)A A itt(2it + A )AThus the statement is true when m = 2.Assume that the statement is true for m = k, thus PQ =When m = k 1,11 + (k 1)!(*)k+1(1 P)ElLo1PQ1 ^A •< - is true.1-1-k4^(1-0 Ekn.-10 7. (A7, )".^A1+ (k + 1)(17)k!()k(1 — p)[Etio 7-1,7(;,)nAppendix B. queueing model^ 10011+ (k +1)(1A)k!Mk(1 - p)E-.,10:71 103n (k +1)()k!(V1 c 17( IV11 + (k +1)(1A1)(ki - 1) + (k +1)(LAL-)k!Mk(1 - p)thO)k1'1^ Afrom assumption k!()k(1 -^1 p) E —(—)n > — — 1n=0 n!^A11+ (k + 1)()( - 1) + (k^1)()(1(k+A1),i)1• 1 + ( k + 1 ) ( 1i)( 1t: -1 + 1^(k-1)it)11 + (k + 1 )( ){(kt,c1.„—,,A2 ]1(k+1)122-A21+ ^A21• (k +1)()21 ^A• k + 1 )( pa )2A ^A• (k+1)12thus PQ < -A when m = k+1itA (k +1)12 P^1Since the statement is true when m=2 and the statement is true when m=k+1 as-suming that m=k is true, by the principal of mathematical induction, the statement is truefor all integer m>2. (Q.E.D.)sinceAppendix CSimulation ResultsThe following tables contain the simulation results of the various designs discussed in thethesis. These data have been plotted in a number of ways throughout the thesis.number ofprocessorsThroughput(MBits/s)Random accessbus utilizationBurst accessbus utilizationProcessorutilizationActualspeedup1 3.84 0.115 0.023 1.00 1.002 6.88 0.197 0.035 0.90 1.804 14.56 0.368 0.069 0.95 3.808 27.68 0.658 0.122 0.90 7.2012 37.68 0.868 0.171 0.82 9.8414 41.04 0.937 0.191 0.76 10.6416 42.80 0.971 0.196 0.70 11.2020 43.52 0.995 0.204 0.57 11.40Table C.1: SSM design with 10mips processorsnumber ofprocessorsThroughput(MBits/s)Random accessbus utilizationBurst accessbus utilizationProcessorutilizationActualspeedup1 7.12 0.163 0.037 1.00 1.002 14.0 0.296 0.065 0.98 1.964 26.0 0.537 0.117 0.91 3.648 46.56 0.875 0.204 0.82 6.5616 52.48 0.998 0.237 0.46 7.3620 52.00 0.998 0.239 0.37 7.40Table C.2: SSM design with 20mips processors101Appendix C. Simulation Results^ 102number ofprocessorsThroughput(MBits/s)Random accessbus utilizationBurst accessbus utilizationProcessorutilizationActualspeedup1 12.48 0.250 0.056 1.00 1.002 23.44 0.448 0.105 0.94 1.884 44.88 0.765 0.195 0.90 3.608 60.08 0.985 0.259 0.60 4.8016 57.60 0.998 0.260 0.29 4.64Table C.3: SSM design with 40mips processorsnumber ofprocessorsThroughput(MBits/s)Random accessbus utilizationBurst accessbus utilizationProcessorutilizationActualspeedup1 3.84 0.052 0.023 1.00 1.002 7.04 0.076 0.037 0.92 1.844 14.64 0.137 0.069 0.95 3.808 30.24 0.256 0.134 0.98 '7.8416 57.76 0.478 0.261 0.94 15.0420 72.32 0.579 0.320 0.94 18.8024 85.44 0.689 0.383 0.93 22.3228 99.12 0.791 0.433 0.92 25.7632 107.84 0.880 0.470 0.88 28.1636 109.60 0.959 0.479 0.79 32.94Table C.4: LIM design with 10mips processorsnumber ofprocessorsThroughput(MBits/s)Random accessbus utilizationBurst accessbus utilizationProcessorutilizationActualspeedup1 7.12 0.067 0.037 1.00 1.002 14.56 0.115 0.067 1.00 2.004 28.88 0.209 0.128 1.00 4.008 56.64 0.388 0.245 0.99 7.9216 109.28 0.710 0.460 0.96 15.3620 130.48 0.845 0.554 0.92 18.4024 143.12 0.939 0.602 0.84 20.1628 135.20 0.988 0.585 0.68 19.04Table C.5: LIM design with 20mips processorsAppendix C. Simulation Results^ 103number ofprocessorsThroughput(MBits/s)Random accessbus utilizationBurst accessbus utilizationProcessorutilizationActualspeedup1 12.72 0.097 0.060 1.00 1.002 25.12 0.173 0.112 0.99 1.984 50.40 0.318 0.216 0.99 3.968 99.12 0.587 0.415 0.97 7.7612 140.64 0.805 0.581 0.92 11.0414 153.60 0.876 0.631 0.86 12.0416 164.16 0.939 0.667 0.81 12.9620 164.72 0.992 0.673 0.65 13.00Table C.6: LIM design with 40mips processorsnumber ofprocessorsThroughput(MBits/s)Random accessbus utilizationBurst accessbus utilizationProcessorutilizationActualspeedup1 3.84 0.031 0.029 1.00 1.002 6.88 0.039 0.046 0.90 1.804 14.64 0.062 0.093 0.95 3.808 30.88 0.111 0.190 1.00 8.0016 60.56 0.201 0.379 0.99 15.8420 74.96 0.248 0.473 0.98 19.6032 116.48 0.395 0.730 0.95 30.4040 138.00 0.508 0.882 0.90 36.0044 154.40 0.558 0.946 0.91 40.0445 150.56 0.618 0.906 0.87 39.1548 108.72 0.737 0.744 0.59 28.32Table C.7: SPR design with 10mips processorsnumber ofprocessorsThroughput(MBits/s)Random accessbus utilizationBurst accessbus utilizationProcessorutilizationActualspeedup1 7.28 0.037 0.047 1.00 1.002 14.32 0.054 0.088 0.98 1.964 28.16 0.089 0.167 0.97 3.888 56.40 0.161 0.336 0.97 7.7616 113.92 0.313 0.675 0.98 15.6820 136.64 0.376 0.806 0.94 18.8024 157.44 0.439 0.914 0.90 21.6026 165.20 0.478 0.967 0.87 22.6227 169.92 0.489 0.988 0.86 23.2228 176.80 0.515 0.999 0.87 24.36Table C.8: SPR design with 20mips processorsAppendix C. Simulation Results^ 104number ofprocessorsThroughput(MBits/s)Random accessbus utilizationBurst accessbus utilizationProcessorutilizationActualspeedup1 12.64 0.047 0.077 1.00 1.002 25.12 0.075 0.148 0.99 1.984 50.24 0.132 0.289 0.99 3.968 100.32 0.248 0.575 0.99 7.9212 146.88 0.358 0.830 0.97 11.6414 162.56 0.397 0.908 0.92 12.8815 169.52 0.418 0.945 0.89 13.3516 176.96 0.438 0.981 0.88 14.08Table C.9: SPR design with 40mips processorsnumber ofprocessorsThroughput(MBAs's)Random accessbus utilizationBurst accessbus utilizationProcessorutilizationActualspeedup1 13.92 0.048 0.066 1.00 1.002 28.40 0.080 0.135 1.00 2.004 52.96 0.133 0.256 0.95 3.808 104.56 0.246 0.494 0.94 7.5216 195.28 0.448 0.873 0.88 14.0820 227.20 0.520 0.986 0.82 16.40Table C.10: IPR design using 40mips processorsnumber ofprocessorsThroughput(MBits/s)Random accessbus utilizationBurst accessbus utilizationProcessorutilizationActualspeedup1 7.12 0.034 0.034 1.00 1.002 14.80 0.050 0.071 1.00 2.004 30.48 0.085 0.148 1.00 4.008 56.73 0.143 0.276 1.00 8.0016 114.21 0.273 0.542 1.00 16.0020 137.31 0.325 0.648 0.96 19.2032 210.82 0.493 0.937 0.93 29.7636 228.86 0.543 0.987 0.89 32.04Table C.11: IPR design using 40mips processors and 2 times transfer syntax operationsAppendix C. Simulation Results^ 105number ofprocessorsThroughput(MBits/s)Random accessbus utilizationBurst accessbus utilizationProcessorutilizationActualspeedup1 16.63 0.038 0.025 1.00 1.002 33.43 0.058 0.050 1.00 2.004 64.41 0.096 0.096 0.97 3.888 133.94 0.184 0.198 1.00 8.0016 262.02 0.341 0.382 0.98 15.6820 326.34 0.417 0.469 0.98 19.6026 409.67 0.530 0.582 0.95 24.7032 488.83 0.640 0.685 0.92 29.4440 565.90 0.772 0.777 0.85 34.0050 547.95 0.881 0.804 0.66 33.00Table C.12: IPR using 40mips processors and fast memoryAppendix DGlossary of termsASN^Abstract syntax notationBAB^Burst access busDMA^Direct memory accessDRAM Dynamic random access memoryFCFS^First come first serveIPR^Improved packet relocationLIM^Local instruction memoryMIMD Multiple instruction multiple dataMIPS^Million instructions per secondOSI^Open system interconnectionP CI^Protocol control informationPDU^Protocol data unitRAB^Random access busRISC^Reduced instruction set computerSAP^Service access pointSDU^Service data unitSIMD^Single instruction multiple dataSPR^Simple packet relocationSSM^Simple shared memoryVRAM Video random access memory106


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items