A PARALLEL ALGORITHM FORASN.1 ENCODING/DECODINGbyCarlton Charles Agnel JosephB.Eng. (EE), McGill, 1990.A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinTHE FACULTY OF GRADUATE STUDIESDEPARTMENT OF ELECTRICAL ENGINEERINGWe accept this thesis as conformingto the required standard THE UNIVERSITY OF BRITISH COLUMBIASeptember 1992© Carlton C.A. Joseph, 1991In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.Department of Electrical EngineeringThe University of British ColumbiaVancouver, CanadaDate^DE-6 (2/88)AbstractComputers differ in many ways: the characters may be coded in ASCII, EBCDIC orUNICODE; integers can vary in size; processors can operate in Big-Endian or Little-Endianformat; and programming languages can use different memory representations for data. TheInternational Standards Organization (ISO) provides an Open System Interconnection (OSI)seven-layer reference model[20] to facilitate heterogeneous inter-computer communication.Multimedia technology places new demands on inter-computer communication speedsbecause of the higher bandwidth requirements of voice and image data. Various implementa-tions of high speed networks meet this demand at the lower layers. The presentation layer ofthe ISO reference model is the dominant component of the total protocol processing time[5].Hence, the presentation layer represents a bottleneck in communication systems that use highspeeds networks. The presentation layer is traditionally implemented in software but now acombined parallel hardware/software approach is taken.Prior to this thesis, VASN.1[3][24][25] was the only parallel hardware/software com-bination that achieved high throughput in the presentation layer. The parallel algorithm ofVASN.1 limits the scalablity of hardware and hence the achievable parallelism. VASN.1users are expected to use the parallel data structure that it uses. This limitation reduces theflexibility for VASN.1 users and incurs an additional overhead for handling the data structure.The parallel system presented in this thesis avoids the serious problems found in VASN.1.A new parallel algorithm has been developed which provides users with flexible serial datastructures 2. The encoding algorithm converts flexible serial data structures into parallelinformation which allow simultaneous processing. The conversion from a serial to a parallelinformation has not been previously explored in the literature. The decoding algorithmalso provides maximum parallelism while providing a flexible serial data structure forapplications to use. The algorithm has been designed in such a way that any changes inA parallel data structure is a data structure that contains extra information required for a parallel algorithm.2^A serial data structure is a data structure that holds no extra information required for a parallel algorithm.iithe ASN.1 specification can be easily incorporated into the system. The final advantage isthat the hardware scales gracefully and allows for more parallelism than VASN.1. The initialsimulations are promising and show that the throughput achievable by the parallel algorithmis five times greater than that achievable by a serial implementation.iiiContents Abstract ^ iiList of Tables viiList of Figures ^ viiiAcknowledgments 1 Introduction ^ 11.1 Communication in Heterogenous Networks ^ 11.1.1 ISO Presentation Layer Conversion Standard ^ 21.2 Presentation Layer Cost ^ 41.3 Initial Parallel Analysis 61.4 Previous Work with ASN.1 ^ 101.4.1 Single Processor ASN.1 Implementations ^ 101.4.2 Multi Processor ASN.1 Implementations 111.5 Thesis Objectives and Outline ^ 142 Parallel Architecture ^ 152.1 Hardware Architecture 152.1.1 Contiguous Memory Management ^ 172.2 Parallel Algorithm Outline ^ 182.3 Memory Mapping Scheme 193 Encoding^ 203.1 Encoding Interface Data Structures ^ 203.2 Encoder Generation from ASN.1 Specification ^ 233.3 Encoders Functions and Data Structures 243.3.1 EncodeStructBegin () ^ 253.3.2 EncodeStructEnd () 283.3.3 QEncode () ^ 293.4 Encoding Flow Trace 303.4.1 Host Responsibilities ^ 313.4.2 Queue-Processor 333.4.3 Processors ^ 34iv4 Decoding^ 354.1 Decoding Interface Data Structures ^ 354.2 Decoder Generation from ASN.1 Specification ^ 354.3 Decoders Functions and Data Structures 364.3.1 Node for Primitive Types^ 374.3.2 Node for SEQUENCE and SET Types ^ 374.3.3 Node for SEQUENCE OF and SET OF Types ^ 384.4 Decoding Flow Trace ^ 394.4.1 Host Responsibilities 414.4.2 Queue-Processor ^ 414.4.3 Processors 444.4.3.1 tsptr = tsend^ 454.4.3.2 UPDATE bit set 464.4.3.3 find correct node ^ 464.4.3.4 CONSTRUCTOR bit set ^ 474.4.3.5 OF bit set ^ 484.4.3.6 primitive 485 Hardware Models ^ 505.1 Bus Arbitration Method ^ 505.2 Processor^ 555.2.1 Integer Unit 565.2.2 Data Unit ^ 575.2.3 Instruction Unit 575.2.4 Register File ^ 575.2.4.1 Reg(0) = Zero Register ^ 575.2.4.2 Reg(26) = Frame Pointer 575.2.4.3 Reg(27) = Stack Pointer ^ 575.2.4.4 Reg(28) = Return Address 585.2.4.5 Reg(29) = Queue Address ^ 585.2.4.6 Reg(30) = Processor Address 585.2.4.7 Reg(31) = Wait Result ^ 585.3 Queue-Processor ^ 596 Simulations ^ 636.1 Simulation with System V IPC ^ 636.2 Memory Mapper ^ 646.3 Assembler 656.4 VHDL Model 666.5 Utilities ^ 666.6 Description of a Complete Simulation Cycle ^ 677 Results 697.1 Need for Standard Test Vectors ^ 697.2 Simulation Statistics ^ 698 Future Work^ 738.1 Simulations 738.2 Processing Element ^ 738.3 Parallel use of PEs 748.4 Fabrication Constraints 759 Conclusion ^ 76Bibliography 78Appendix A Acronym Definitions ^ 81Appendix B Cost per Instruction Mapping ^ 82Appendix C Pros and Cons of Static Analysis 85Appendix D Processors Memory Read and Write Cycle^ 86Appendix E ASN.1 PersonnelRecord Type ^ 87Appendix F ASN.1 Types Instruction Counts 89viList of Tables Table 1^CASN.1 Primitive encoding/decoding Routines ^ 6Table 2^Primitive Routines Instruction Count.^ 7Table 3^Parallelism Speed Gain ^ 7Table 4^Minimal Instructions for Encoding/Decoding ^ 8Table 5^Queue-Processor Commands. 17Table 6^Processing Elements Signals ^ 58Table 7^Instructions Executable On a Processing Element^ 59Table 8^Simulation Results in Mbit/s 70Table 9^Worst Case Simulation Results for PersonnelRecord. ^ 76Table 10^Simulations Instruction per Byte Values. ^ 89VIIList of FiguresFigure 1^Abbreviated OSI Seven Layer Model ^ 2Figure 2^Example of Local Representation to Transfer Syntax ^ 3Figure 3^ASN.1 Description of Personnel Information 4Figure 4^ASN.1 Processing Cost. ^ 5Figure 5^Serial vs Parallel Encode Time 10Figure 6^VASN.1 Overview^ 11Figure 7^Parser/Assembler and Encoder/Decoder ^ 13Figure 8^Parallel Hardware Setup ^ 16Figure 9^Contiguous Memory: Initial 17Figure 10^Contiguous Memory: Allocated 18Figure 11^Memory Mapping Scheme ^ 19Figure 12^Partial List of Mapping Rules 21Figure 13^ASN.1 Example ^ 22Figure 14^C Type Equivalent of Figure 13 ^ 22Figure 15^Transfer Syntax of Figure 13 in IDX Structures ^ 23Figure 16^Generated Encoding Routines for Figure 13 24Figure 17^Encode Queue Element. ^ 25Figure 18^Encoding Cumulative Length Data Structures ^ 26Figure 19^Data Dependency Graph of Figure 13 ^ 27Figure 20^Postponed ASN.1 Type ^ 28Figure 21^Queue with the Encode Example Calls 31Figure 22^Partial Flow Diagram for Small Encoding Example ^ 32Figure 23^Decoding Queue-Element 36Figure 24^Node Data Structure of Type Tree ^ 36Figure 25 Two Decode Type Trees Comparing SET and SEQUENCE ^ 39Figure 26^Transfer Syntax of Figure 13 40Figure 27^Decode Type Tree of Figure 13 ^ 40Figure 29^Partial Flow Diagram For Decoding Example ^ 42Figure 28^Algorithm of Queue-Processor 43Figure 30^Decoding Algorithm for Processors^ 45Figure 31^Daisy Chain and Rotating Daisy Chain 50Figure 32^Partial Diagram of RDC Arbitration 53Figure 33^RDC Timing Diagram ^ 54Figure 34^Processor Block Diagram 56Figure 35^queue-processor 60Figure 36^Queue Information ^ 61Figure 37^Processor State Information ^ 62Figure 38^Software used in Simulation 63Figure 39^Development of Algorithm using Unix and System V IPCs ^ 64Figure 40^Example Assembler Code ^ 65Figure 41^Local Cache Setup ^ 74Figure 42^Multi-Bus Global Cache 75Figure 43^Memory Read/Write Timing Diagrams ^ 86Figure 44^Ten Integers Type ^ 89ixAcknowledgmentsI would like to express my gratitude to Dr. Mabo Ito for his suggestions and forsupporting me throughout my graduate studies. I also would like to thank Dr. GeraldNeufeld for his insight on ASN.1 and his very valuable comments. A special thanks toMike Sample for providing me with statistics on ASN.1 processing and Donna Fuller for hersupport during the preparation period of this thesis. Lastly, I would like to thank my parentsfor their continuous support, encouragement and understanding.x1 Introduction Section 1.1 of this chapter covers the need for communication in heterogeneous networks.Section 1.2 presents information showing that the computation cost of the presentation layeraccounts for the majority of the total OSI protocol cost[5]. Section 1.3 suggest a parallelapproach to reduce this cost. Section 1.4 discusses the approaches of other researchers andtheir relative advantages and disadvantages. Section 1.5, the final section of the introduction,states the objectives and the outline of this thesis.1.1 Communication in Heterogenous NetworksIn industry, at home and in educational institutions, there are a large variety of computerhardware configurations available from various manufacturers. There exists a need forcomputers based on different hardware structures to communicate with one another. Severalvendors address this issue by implementing protocols so that networks with heterogenouscomputers can communicate. Two of these vendor-specific protocols are Sun Microsystems'XDR[6] and Apollo Computer's NDR[8].To aid in the goal of truly inter-operable heterogenous systems, ISO has constructed theOSI seven-layer protocol reference model[17][20]. Figure 1 shows the top three layers ofthe OSI seven-layer model. Any layer n of Host A communicates with its peer on layer nof Host B using a preset protocol. For instance, the presentation layer's peers in Host Aand B communicate using a presentation protocol. The protocol exchanges information byexchanging data units known as protocol data units (PDU) 3 . The presentation layer in HostA and Host B do not directly exchange PDUs. The PDU is propagated down through to thelower layers where layer one is responsible for exchanging information between Host A andHost B. Within the OSI model the presentation layer is responsible for the transformationof information from a machine-dependent format to a machine-independent format and vice-versa. Section 1.1.1 presents an outline of how this is achieved in a standard way.3^A list of all the acronyms used in this thesis is presented in Appendix A1Application ProtocolPresentation ProtocolSession Protocol-4 ^..3$ ^,4:1 ^- a--4.-Application ApplicationInterface InterfacePresentation PresentationInterfaceVh.InterfaceSession SessionLowerLayers ••• Lower• Layers••Host A Host BFigure 1 Abbreviated OSI Seven Layer Model1.1.1 ISO Presentation Layer Conversion StandardThis section presents a summary of the roles that Abstract Syntax Notation One(ASN.1[26]) and the Basic Encoding Rules (BER[27]) play in the presentation layer. Ref-erences [11][4] present additional detail on BER and ASN.1.ASN.1 can specify the data structures of the presentation protocol data units (PPDU) whentwo machines communicate (figure 1). The ASN.1 specification can then be compiled intoroutines that encode/decode to/from machine-specificlmachine-independent formats. BERare used when transforming data from a machine-specific format to a machine-independentformat and vice-versa. Figure 2 shows a possible communication scenario, where Host Atransmits information to Host B.2Tag Len. Encoded Bool Tag Len. Encoded Integer Tag Len.^Encoded StringIntegerHost A Internal RepresentationBooleanHost B Internal RepresentationBoolean•^ 32 •String40String40Application Layer (Machine Dependent)Encode Using BER Decode Using BERPresentation Layer (Conversion)Integer•^ 32 •Transmitted Transfer SyntaxSession Layer (Machine Independent)Figure 2 Example of Local Representation to Transfer SyntaxIn figure 2 one can see that the internal representation of the data on the two machines isdifferent and hence BER are used to convert the data into a generic transfer syntax (TS) andthen back to an internal representation. This diagram shows that the TS consists of a tag, alength and the encoded value (TLV4). In the triples, the tag represents the information that isstored in the encoded part, the length specifies the encoded part's length and the encoded partholds the actual encoded information. BER specify how information is encoded/decoded.Figure 3 shows an ASN.1 representation of a data structure that holds personnel information.Notice that the data representation is similar to those used in most high level programminglanguages.4^TLV = Tag Length Value.3Personnellnformation ::= SET {Name IA5String,Age INTEGER,Sex BOOLEAN }Figure 3 ASN.1 Description of Personnel InformationThe ASN.1 description given in figure 3 is machine-independent and the internal repre-sentations shown in figure 2 are the machine-specific information for one person. Host Aand Host B in figure 2 are exchanging personnel information for one person. When boththe machines receive an ASN.1 description of PersonnelInformation they are responsible formaintaining value instances of this data in their machine-specific format and converting itto/from a machine generic format when transmitting/receiving.1.2 Presentation Layer CostWhen using the OSI protocol stack, transferring from a machine-dependent to a machine-independent format can consume 97%[5] of the total processing time. Unpublished simula-tions carried out at the University of British Columbia (UBC) show that the cost is closer to70% when the more efficient CASN.1[12] 5 implementation is used in the presentation layer.The data gathered by Takeuchi[ 19] can also provide the processing cost associated with thepresentation layer. Takeuchi assumes the processing costs per packet of the OSI lower lay-ers are constant6 . In the presentation layer the cost per byte is calculated for the ASN.1PersonnelRecord shown in Appendix E. For the ASN.1 PersonnelRecord specification, theencoding/decoding instruction count for the presentation layer is Encode = 350 + 27 * bytesand Decode = 400 + 60 * bytes? . The spread sheet extract in figure 4 shows the cost ofthe presentation layer with reference to the total protocol processing cost. Data Set Oneshows the cost for sending/receiving PDUs at the lower layers. Data Set Two shows that the5^Details of CASN.1 will be given later.6^The assumption is correct if the CRC checking in the transport layer is omitted or computed by a hardware assist at no cost.7^Page 55 Sc Appendix C page 106-108 of Ref.D9].4presentation layer can consume a significant portion of the processing power as the packetsize increases.Data Set OneLayer Send Receive2 40 453 160 1604 400 3755 400 450Total 1000 1030Bytes/Pkt Layers 2-5Encode DecodeData Set TwoLayer 6Encode^Decode% LayerEncode6Decode100 1000 1030 3050 6400 75.31 86.14200 1000 1030 5750 12400 85.19 92.33400 1000 1030 11150 24400 91.77 95.95800 1000 1030 21950 48400 95.64 97.921600 1000 1030 43550 96400 97.76 98.943200 1000 1030 86750 192400 98.86 99.476400 1000 1030 173150 384400 99.43 99.7312800 1000 1030 345950 768400 99.71 99.87Figure 4 ASN.1 Processing Cost.The ASN.1 PersonnelRecord type in Appendix E consists of IA5String string types andone integer type. This leads to two important conclusions:Since the cost for encoding/decoding a strings is low, the costs attributed to the presenta-tion layer in figure 4 are mostly due to the encoding/decoding of the tag and the length.5Figure 4 already shows that the presentation layer occupies a significant portion of theprotocol stack even with a simple encoding/decoding to perform. The percentage of theprocessing time that the presentation layer uses is significantly higher for complex typessuch as integers and reals.1.3 Initial Parallel AnalysisThe initial parallel analysis is carried out by gathering statistics for the CASN.1[12] tool.CASN.1 takes an ASN.1 specification as input and generates C code that is capable of encod-ing/decoding PPDUs. In CASN.1, procedure calls are made to primitive encoding/decodingroutines that are shown in table 1. These calls take primitive types and transfer them backand forth from machine-specific format to machine-independent transfer syntax TLV triples.ASN.1 Type Encode Routine Decode RoutineBoolean Encode_bool() Decode_bool()Integer Encode_intQ Decode_int()Bit String Encode_bits() Decode_bits()Octet String Encode_octs() Decode_octs()Null Encode_null() Decode_null()Object Identifier Encode_oid() Decode_oid()Universal Time Encode_UTCTime() Decode_UTCTime()Generalized Time Encode_GNLTime() Dec ode_GNLTime()External Encode_ext() Decode_ext()Table 1 CASN.1 Primitive encoding/decoding RoutinesBy using primitive encoding/decoding routines with constructor types such as sets andsequences, complex data types can be described. A set of primitive data types is shown infigure 3. The only difference between sets and sequences is that in a sequence the orderof the primitive data types is maintained and in sets it is not. The use of set and sequencedata types can be nested.6The primitive encoding/decoding calls constitute most of the time in CASN.1. Anestimate of the number of instructions used by these primitive calls was constructed bycompiling CASN.1's implementation of the primitives to assembler 8 . Table 2 presents thetotal number of assembler instructions counted 9. The static analysis in table 2 shows thatsignificant gains are achieved if the primitives are executed in parallel. Significant speedupis gained even if the cost incurred by the parallel algorithm is high. Table 3 shows how thecost of a parallel algorithm is amortized over the number of processors. The table ignoresthe requests placed on scarce resources such as a shared bus. It also assumes that the parallelalgorithm does not incur any communication overhead. These effects are apparent in thefinal evaluation of the system and thus the linear speedup shown in table 3 is not achieved.Primitives Encoding Instruction Count Decoding Instruction CountInteger 583 479Boolean 529 457Bits 668 523OCTET String 672 477Null 515 432Table 2 Primitive Routines Instruction Count.Processors—* 1 1 2 3 mPrimitives1 Serial Parallel Parallel Parallel Parallel1 100 200 200/2 200/3 200/m2 200 400 400/2 400/3 400/mn 100n 200n 200n/2 200n/3 200n/mTable 3 Parallelism Speed GainThe final analysis of the primitive encoding/decoding routines proves that even whenencoding/decoding is optimized for the simplest cases parallelism is applicable due to8^The cc —S option is used on a SPARC 2.Appendix C provides the pro and cons of using a static analysis.9^The count includes the cost of encoding/decoding the tag and the length.7the number of instructions that need to be performed. Table 4 presents the number ofinstructions 10 executed for encoding/decoding some of the primitive types assuming thefollowing simplifications:• All the tags and lengths are in short form.• The instruction counts for NULL, BOOLEAN, INTEGER and OCTET String, includethe instructions required to encode/decode the tag and length.• The INTEGER instruction counts were generated for a one byte positive integer.• The OCTET String instruction counts were generated for octets that start and end on aword boundary. The total length of the octet must be less than 127 bytes.Routine TS size in bytes Encode InstructionCountDecode InstructionCountNULL 2 12 13BOOLEAN 3 18 19INTEGER 3 20 16OCTET String 2+n 27+1.5n 21+1.5nTAG & LEN. 2 17 14Table 4 Minimal Instructions for Encoding/DecodingThe instruction count presented in table 4 does not consider:• The cost incurred by procedure calls.• The cost incurred when allocating memory.• The variable cost of handling non word-aligned OCTET Strings.• The cost of handling complex primitive encodings/decodings.• The cost of maintaining data dependencies within complex structure types.to^The Motorola 88100110] RISC processor instruction set was used.8The above analysis shows that if the primitive encoding/decoding calls are executed fastthen a significant speedup can be achieved. In the best case, the calls return as soon as theyare made. This thesis takes the approach that when a primitive call occurs it is pushed ontoa FIFO queue for servicing, therefore the calls take little time. The problem of servicingthese calls quickly is solved using the architecture proposed in chapter 2.Multiple processors, reading from the FIFO, are implemented to service the calls quickly.This essentially increases the number of instructions executed in a cycle. If there are nprocessors, then in one cycle, n instructions are executed while in a single processor systemonly one instruction is executed. Figure 5 shows how time is saved by encoding in parallelwith n processors. The encoded values could be an individual's personnel information asdiscussed in section 1.1.1. Note that there exists a time skew in the parallel encoding shownin figure 5. A similar skew also exists in the decoding algorithm. The following is thedescription of the skew.When encoding, the skew times are associated with the following actions.• T1 is the time delay between assigning each primitive call to the FIFO. The instructionsexecuted between the assignments account for the delay and are explained in chapter 3.• T2 is the time delay from Ti to the time the primitive call receives service.When decoding, the skew times are associated with the following actions.• T1 is the time required to read the length of the present encode value so that the nextvalue's starting address is known.• T2 is the delay between the end of T1 and when computation starts on the next value. Thedelay is due to the communication required by the parallel algorithm. The communicationrequired by the decoding parallel algorithm is explained in chapter 4.9Encoded BooleanLenTagLen Encoded IntegerTagIP, • ►^ Tag Encoded StringT2T11 Processor Encodingr Tag Len Encoded Boolean Tag Len Encoded Integer Tag Len^Encoded Stringn Processors Parallel EncodingTimeFigure 5 Serial vs Parallel Encode Time1.4 Previous Work with ASN.1Some of the previous work carried out in relation to the presentation layer in the ISOmodel includes the ISODE[23] project, the CASN.1[28] project, the a2c project[15] andthe VASN.1[3][24][25] project. Section 1.4.1 briefly discusses the ISODE, CASN.1 anda2c single-processor software used for ASN.1 encoding/decoding. Section 1.4.2 describesVASN.1, a multiprocessor ASN.1 encoder/decoder.1.4.1 Single Processor ASN.1 ImplementationsThe ISODE, CASN.1 and a2c projects are single processor programs 11 that take an ASN.1specification and create encoding/decoding routines to transfer to/from machine -independenttransfer syntax/machine -dependent format. The problem with using single-processor softwareto perform this function is that the achievable transfer rates are unacceptably low for highspeed networks. A 8 Mbits/s transfer rate was achieved by running a2c on a MIPS R3260(approximately 18 specmarks) with 48 megabytes of RAM.11^The programs are Unix-based C programs.1 0ParserAssemblerAInter-mediateMemoryEncoderDecoder HostMainMemorySystem Bus Interface1.4.2 Multi Processor ASN.1 ImplementationsFigure 6 shows the parallel hardware that VASN.1 uses to achieve a higher throughputcompared to a single processor implementation.Figure 6 VASN.1 OverviewThe following factors limit the hardware used in the design and simulations:• The design is based on a 16—bit microprocessor; hence, a complicated memory mappingscheme is required to map to today's typical 32—bit architecture. This mapping couldbe expensive.• The hardware setup is non-expandable because a ROM holds the encoding/decodingroutines.The above problems could be solved by adapting the algorithm to a 32—bit machine andconverting the ROM that holds the encoding/decoding routine to a downloadable RAM.However, VASN.1 fails to eliminate some of the other problems as easily. Figure 7 is theblock diagram of the hardware that is used for the Parser/Assembler and the Encoder/Decoderof figure 6. The following factors limit the hardware shown in figures 6 and 7:• The Intermediate Memory block of figure 6 is an associative memory scheme thatdepends on the algorithm VASN.1 uses. The scaling of the hardware is difficult due11to the associative memory and the Intermediate Memory block will not operate whenASN.1 types have long tags.• Figure 7 shows that two queues are used for servicing the interrupts generated byslave processors. This method for handling interrupts produces unfair arbitration if thesampling rate of the two queues is low[2].• In figure 7, information exchanged between the slave processors is transmitted on Bus Avia the Processor Master. Information that the slave processor needs from the systemor local bus is retrieved on Bus B via the Interface Processor. The setup is restrictivebecause it partitions the utilization of the two busses and leads to a high utilization ofone bus and a low utilization of the other. The total throughput of the multiple bussystem is not exploited.In figure 7 the arbiter services the requests for the system bus and for the local bus madeby the four slave processors, hence the arbiter can become a bottleneck in the system.This thesis proposes an implementation that avoids the problems of VASN.1 by choosing amore flexible hardware/software configuration.12Local Bus110- Queue ProcessorMaster411— QueueBus A VBus BV^• ProcessorSlave 241111—► Arbiter InterfaceProcessor• v ProcessorSlave 1V^• ProcessorSlave 4• V ProcessorSlave 3System BusInterrupt LinesFigure 7 Parser/Assembler and Encoder/DecoderThe encoding/decoding algorithms represent the last hidden cost for VASN.1. The localrepresentation that VASN.1 uses for its encoding input and decoding output is a parallel datastructure that holds all the hierarchical information of the equivalent ASN.1 description. Thedata structure has the following implications:• An application would either use the parallel data structure if they want to encode/decodedata via VASN.1.• Or an application can convert the parallel data structure into one that is more convenient.Either case incurs an expensive programming cost because of the lack of flexibility andextraneous information in the data structure. This processing cost is the cost of parallelizinga serial data structure. VASN.1 attributes a zero cost for the processing. The algorithmpresented in this thesis provides the application with flexible data structure and does theserial to parallel conversion at a low cost.131.5 Thesis Objectives and OutlineThis section presents a list of thesis objectives that were achieved and the organization ofthe thesis. Prior to this thesis, VASN.1 was the only parallel system for encoding/decodingASN.1 types. Therefore many of the objectives listed below are solutions to problems thatVASN.1 exhibits.1. Devise a parallel ASN.1 encoding/decoding algorithm that is capable of higher throughputthan present serial implementations.2. Devise an algorithm that can take serial data and transform it into parallel data foroptimal encoding/decoding.3. Devise an ASN.1 encoding/decoding algorithm that can be incorporated with applicationsat a minimal cost.4. Design a scalable parallel architecture that is capable of running the algorithm at highspeeds.The following is the outline of this thesis. Chapter 2 presents the parallel architecture.Chapters 3 and 4 describe the encoding and decoding algorithms in detail. Chapter 5 providesthe functional specification of the hardware. Chapter 6 presents the details of how thesimulations of the parallel architecture were carried out and chapter 7 presents the simulationresults. Chapters 8 and 9 provide the conclusions and suggest directions that future workmay follow.142 Parallel Architecture Figure 5 shows that in a single processor implementation the time required to encode thedata is the sum of the time required to encode each primitive type. In the parallel version theprimitive types encode nearly simultaneously thus saving time. The same principle applies tothe decoding routines. Section 2.1 provides a high level description of the parallel hardwareneeded for the efficient execution of the parallel algorithm. Section 2.2 presents an outlineof the parallel encoding/decoding algorithm. Section 2.3 explains how the host system andthe parallel ASN.1 system share the same address space.2.1 Hardware ArchitectureFigure 8 depicts the parallel setup that allows primitive encoding/decoding routines to runsimultaneously. This section presents a high level description of the hardware and chapter 5presents details of the bus arbitration scheme, the processors and the queue-processor. Thehost or a processor can write primitive calls into the FIFO queue. The information in theFIFO queue is stored in the Queue Memory.15QueueMemory4111-1110.QueueMemoryManager• •buffptrProcessor Processor Processor Processor1 2 3 n• •ToExternalBusQueue ProcessorFigure 8 Parallel Hardware SetupThe Queue Memory Manager is responsible for accesses made to the Queue Memoryand for distributing primitive encoding/decoding calls to the processors for servicing. Theencoding/decoding algorithms in chapters 3 and 4 need multiple words of information to bepassed to the processors. If a processor requires n words of information, then n words canbe written into the Queue Memory and the Queue Memory Manager can distribute thesewords to the processor. Another option is to write the address of the n words into the QueueMemory and then the Queue Memory Manager can redistribute the address. A Queue-Processor can be designed with either method but in our implementation the address methodis chosen. The Queue Memory Manager is also responsible for recognizing certain valuesas commands. Constants that represent these commands can not be used as memory locationpointers. Table 5 lists these commands with a brief explanation of each. The completeexplanation of these commands occurs in chapters 3 and 4.16Command Explanationwait for free Wait for all the processors to be free before assigning out the queueelement.add control Increments the control counter.end signal Decrements the control counter. If the control counter is zero execute await for free then interrupt the host.Table 5 Queue-Processor Commands.The processing elements in our system each have a local instruction memory that storesthe primitive encoding/decoding routines that are to be executed. When a primitive typeis encoded, the resulting TS must be stored in memory, hence the processor needs somemechanism to allocate memory. Memory is also needed when decoding a TS to a machinerepresentation. The buffptr in figure 8 is a register that holds a pointer to the starting memorylocation of a free memory pool. The host is responsible for setting up the buffptr pointerto point to the available memory. When the free memory pool is set up, it could consistof contiguous or noncontiguous memory blocks. For the initial simulations, the contiguousmemory approach explained in section 2.1.1 is taken.2.1.1 Contiguous Memory ManagementFigure 9 shows what the memory space looks like at initialization. The buffptr holdsthe starting address of the free memory space.buffptrFree MemoryFree MemoryFree MemoryFigure 9 Contiguous Memory: InitialWhen a process needs x bytes of memory it preforms the following operations withoutreleasing the shared bus:17Reads buffptr.Increments the read value by x.Writes the new value to buffptr.The memory, after a memory allocation, looks like figure 10.buffptr^xAllocated Memory^Free MemoryFree MemoryFree MemoryFigure 10 Contiguous Memory: Allocated2.2 Parallel Algorithm OutlineThe following section provides a brief outline of the encoding/decoding algorithms.Chapters 3 and 4 present the full details of the algorithms.When encoding, the host processor parses its data structure and writes the primitives intothe FIFO queue. The queue-processor then distributes the primitives to the free processors.Since the host processor knows the exact data that it wants to encode, it is responsible forassigning all the primitive calls to the queue-processor. Simulations verify that the primitivecalls take most of the processing time therefore the time spent by the host on parsing thedata structure is minimal.When decoding, the host processor writes the starting address of the TS to the FIFOqueue. The queue-processor then assigns the address to a free processor P1. P1 then readsthe tag and length of the first element in the TS and calculates the starting location of thesecond element in the TS. P1 then writes the starting address of the second element in theTS to the FIFO queue. The queue-processor then assigns the address to the free processorP2. P2 performs the same process that P1 has performed. The process is repeated by otherprocessors until the complete TS is decoded.18•••Queue ProcessorProcessor 1Processor 2•••ProCessor n••••••I/OAddressSpaceTotalAddressSpaceMemoryAddressSpace2.3 Memory Mapping SchemeThe memory scheme that the host system and the ASN.1 parallel hardware use is depictedin figure 11. The host system segments the memory space into I/O address space and memoryaddress space, which allows I/O registers to be mapped to I/O address spaces.Figure 11 Memory Mapping SchemeFigure 11 shows that the queue-processor and the processors of the parallel hardware aremapped to some of the I/O addresses. Thus, when the host system wants to assign work tothe queue-processor it writes to the I/O mapped location of the queue-processor.193 EncodingIn the encoding process, information is taken from the application layer, encoded andpassed to the session layer. The data structures received from the application layer andprovided to the session layer are presented in section 3.1. Once these data structures aredefined then the explanation of how an ASN.1 protocol specification is used to generateencoding routines for an application is discussed in section 3.2. Section 3.3 then explainsthe functionality of the encoding procedures that the applications use. Finally, section 3.4provides the control flow of the parallel architecture and also concludes the running exampleof the chapter.3.1 Encoding Interface Data StructuresThe application layer provides a C data structure of the value instance that it needs toencode. The data structure is generated according to the ASN.1 specification and the mappingrules provided in figure 12. A complete list of these mapping rules is found in Ref. [28].When a value instance is encoded it is presented to the session layer in IDX structures. AnIDX structure is defined in figure 12. Figure 13 shows an ASN.1 specification and figure 14shows an application's equivalent C type. When the value instance shown in figure 13 isencoded, the resulting output is presented in figure 15. A hardware component can convertthe IDX structure shown in figure 15 into a contiguous TS. The hardware component can bepipelined with the parallel encoder and will not effect the total throughput of the system.20ASN.1 Type C TypeINTEGER intCharacter String OCTSSEQUENCE {} struct {SET 0 struct 0SEQUENCE OF { } LISTSET OF { } LISTtypedef struct OCTS {struct OCTS^*next; /* pointer to next OCTS nodeint^len;^/* # of bytes in OCTS string */char *data;^/* pointer to the OCTS string) OCTS;typedef struct LIST_ITEM {struct LIST_ITEM^*next;^/* next list item */char^*item;^/* list item */} LIST_ITEM;typedef struct LIST {LIST_ITEM^*top; /* first item in the list */LIST_ITEM^*last;^/* last item in the list */}^LIST; ./* encoded output goes into IDX structures */typedef struct IDX {struct IDX^*next;^/* next idx */int^*len;^/* # of bytes in IDX data */char *data;^/* information of the IDX */)^IDX;*/*/Figure 12 Partial List of Mapping Rules21recordType^[APPLICATION 1] IMPLICIT SEQUENCE {scores SEQUENCE OF REAL,name^[1] IA5String,age^[2] INTEGER}value instance = {{70.12,80.34,90.56},"John",50}Figure 13 ASN.1 Example^#define APPTAG^0x61#define NAMETAG Oxal#define AGETAG Oxa2typedef struct recordType {LIST^scores;OCTS^name;int age;}recordType;Figure 14 C Type Equivalent of Figure 1322len^tag lendata next Altag len►seq oftag lentag lenretag lenail lendatanextlendatanextlendatanextlendatanextlendatanextlen datanextlen datanext80.34tag lenI Tohntag lentag lenI I^90.56realtag len• ;-,E1===Ilen data ^nextNULLFigure 15 Transfer Syntax of Figure 13 in IDX Structures3.2 Encoder Generation from ASN.1 SpecificationTo generate encoded values, like those shown in figure 15, all the values that need to beencoded are written to the parallel ASN.1 encoder/decoder hardware. A program that readsin the ASN.1 specification and generates the encoding routines that the application usesis necessary. CASN.1 can be modified to generate these encoding routines. The encodingroutines for the parallel algorithm can be identical to the code generated by the present versionof CASN.1. An example of the encoding routines required for the specification of figure 13is shown in figure 16. The actual function of the QEncode () is to write the encode routine23to the queue-processor. The EncodeStructBegin 0 and EncodeStructEnd () routines functionare optimized for parallelism and the detailed explanation of them is provided in section 3.3.EncodeData (pcode)pcodeType *pcode;{EncodeStructBegin (APPLICATION1TAG);EncodeSeq0f1 (&pcode->scores, SEQOFTAG)EncodeStructBegin (NAMETAG);QEncode (OCT, &pcode->name, tag);EncodeStructEnd ();EncodeStructBegin (AGETAG);QEncode (INT, &pcode->age, tag);EncodeStructEnd ();EncodeStructEnd ();}EncodeSeq0f1 (list, tag)LIST *list; tagType tag;{LIST_ITEM *item;EncodeStructBegin (tag);for (item = list->top; item; item^item->next)QEncode (REAL, item->item, REALTAG);EncodeStructEnd ();Figure 16 Generated Encoding Routines for Figure 133.3 Encoders Functions and Data StructuresThe EncodeStructBegin (), EncodeStructEnd () and the QEncode () procedures areexplained in sections 3.3.1-3.3.3. The cost for each of the calls is estimated and compared24to the equivalent call in CASN.1. The determination of the cost of the calls is shown inAppendix B.The information shown in figure 17 is assigned to the processing element when a primitivecall is made. Within the data structure in figure 17 the following are defined:• rtag holds the tag of the value to be encoded and a value that specifies the encodingroutine to be used.• data holds a pointer to the data to be encoded.• idxptr holds a pointer to the idx structure that is updated with the encoded value.• updateptr holds a pointer to an integer that is incremented by the total number of TSbytes generated by the primitive encoding routine. The integer is used for calculatingthe cumulative length in structures.typedef struct egelemType {int^rtag; /* tag & encode routine */char^*data; /* data to be encoded */idxType *idxptr;./* encoded data location */int^*updateptr; /* cumulative length */} egelemType;Figure 17 Encode Queue Element.3.3.1 EncodeStructBegin 0When an EncodeStructBegin call is made the length of the structure cannot be encodedbecause the lengths of the elements in the structure are unknown. There exists a datadependency between the elements of the structure and the length of the structure; hencethe length of the structure can only be calculated after all the elements in the structureare encoded. This data dependency must be maintained somehow and must be generallyexpandable for SEQUENCES, SET, SET OF and SEQUENCE OF. A tree that maintains thedata dependency is constructed by using the data structures presented in figure 18.25typedef struct nodeType {/* called structure node */struct nodeType *parent;struct nodeType *next;int^lent;idxType^*idx; /* idxType same as IDX */tagType *tag;} nodeType;typedef struct levelType {/* called level node */struct levelType *plus;struct levelType *minus;nodeType^*head;nodeType *tail;}Figure 18 Encoding Cumulative Length Data StructuresIn the remainder of this thesis a node of type nodeType is called a structure node andnode of levelType is called a level node. Figure 19 shows the data dependency structureconstructed for the value instance given in figure 13. This data dependency structure isgenerated according to the following rules.Initially the tree consists of:a. one level node known as the root levelb. one structure node at the root levelc. a present level pointer pointing to the root leveld. a deepest level pointer pointing to the root levelWhen an EncodeStructBegin call is made and the next level exists:a. a new structure node is added to the tail of the next levelb. the parent field of the new structure node points to the present levels last nodec. the present level pointer now points to the next leveld. the tag of the EncodeStructBegin call is saved in the new structure node26level+^level+^level+level level- level-head ^head headtail tail^tail tag idxlenparentnexttagidxlenparentnexttag idxlenparentnextV tag idxlenparentnext V tag idx lenparentnextWhen an EncodeStructBegin call is made and the next level is null:a. a level node along with an accompanying structure node is added after the present levelb. the parent field of the new structure node points to the present levels last nodec. the present level pointer now points to the next leveld. the deepest level pointer now points to the next levele. the tag of the EncodeStructBegin call is saved in the new structure nodeThe maximum cost for calling EncodeStructBegin is 35 while the equivalent routinein CASN.1 costs a minimum of 22. The CASN.1 cost is lenient because it assigns theEncodeTag routine within the EncodeStructBegin routine a zero cost.Figure 19 Data Dependency Graph of Figure 13273.3.2 EncodeStructEnd 0When an EncodeStructEnd call occurs the elements within a structure have been assignedto the queue-processor. The length of the structure becomes known when all the elementswithin the structure are encoded. The tag and length of a structure can be encoded at theend of the EncodeStructEnd call or postponed until a later time. The postponed option isused here. The reason for this choice is explained by using the ASN.1 specification shownin figure 20.postponedType ::= SEQUENCE {SEQUENCE {reall^REAL,real2^REAL} ,SEQUENCE {rea13^REAL,real4 REAL}}Figure 20 Postponed ASN.1 TypeIf there are four processors available to encode a value instance of the ASN.1 specificationin figure 20 and the structure lengths are calculated immediately after the EnodeStructEndcall, then only two of the four processors are utilized. Once the values of reall and real2 areassigned to the processors the EncodeStructEnd routine is executed. For the correct lengthof the sequence to be known one must wait until reall and real2 are encoded and the lengthof the structure is updated by them. The same thing happens for real3 and real4. Thusat any time only two of the four processors are utilized. If the length calculations of thestructures are postponed until the end of all the primitive encodings, then reall—real4 canbe encoded simultaneously on four processors. Once all the reals are encoded then the two28sequences that make up the postponedType can encode their tag and length. Finally the tagand length of the postponed-Type is encoded.EncodeStructEnd moves the level pointer to the previous level. If the previous levelis the root level then all the structures, represented by the structure nodes, in the datadependency graph can be assigned to the queue-processor. The structures below the deepestlevel pointer are assigned to the queue-processor first, next the structures of the previous levelare assigned to the queue-processor and this continues until the root level is reached. Afterall the structures of a level are assigned to the queue-processor, the wait for free commandis queued. The command tells the queue-processor to wait for all the processors to becomefree before assigning the next element in the queue. This is necessary because the data inthe previous level is dependent on the lengths calculated by the present level.The cost for calling EncodeStructEnd is 6 for the non-root level and is variable for the rootlevel according to cost = 25* nodes +7 * levels. In the worst case every EncodeStructBegincall creates a structure node and a level node. The cost for assigning this data dependencygraph to the queue-processor is cost = 32* EncodeStructEnd. Hence the maximum cost forthe EncodeStructEnd is proportional to the number of times the EncodeStructEnd is calledand is cost = 38 * EncodeStructEnd. The simplest case of CASN. 1 's equivalent call costsis cost = 40 * EncodeStructEnd.3.3.3 QEncode ()The QEncode call is equivalent in function to the primitive encoding routines in CASN.1.This is where the parallel algorithm does significantly better than CASN.1. The total costof writing this information into the queue is 16 whereas in CASN.1 the cost for a primitiveencoding depends on the primitive.There are three parameters passed to the QEncode (), namely, a value specifying theencoding routine that should be used, the tag for the transfer syntax and the pointer to thedata structure to be encoded. This information is written to the queue—processor along withan allocated idx pointer and a cumulative length pointer. The cumulative length is extracted29from the data dependency structure and is the length field of the node that is at the tail ofthe present level node.3.4 Encoding Flow TraceThe queue information for the value instance of the ASN.1 specification given in figure13 is shown in figure 21. A partial parallel trace of the program is shown in figure 22. Infigure 21 the first element of the queue is at the bottom of the figure. The first five elementsin the queue are for encoding the 3 REALS, the IA5String and the INTEGER. All the non-bold boxed values in the diagram point to queue element structures, one of which is shownto the right in figure 21. The sixth element, wait for free, is a signal to the queue-processornot to assign another element until all the processors are free (i.e. until all the previouslyassigned elements have been encoded). The next elements in the queue are the sequence ofelement, the nametag element, the agetag element and then the wait for free signal. The lastencodable element in the queue is the APPLICATION 1 element which is followed by theend signal that instructs the queue-processor to interrupt the host when all the processors arefree. Sections 3.4.1-3.4.3 explain the actions that the short statements in figure 22 perform.30end signal^Ipointer to APP 1wait for free ^Ipointer to agetagpointer to nametagpointer to seq ofwait for treepointer to INTpointer to IA5pointer to real 3pointer to real 2pointer to real 1queue element contentscumulative lengthtag and routinedata pointeridx pointerFigure 21 Queue with the Encode Example Calls3.4.1 Host ResponsibilitiesWhen the host wants to encode a TS it has the option of either creating the completequeue information and then assigning the work to the queue-processor or assigning the workto the queue-processor as it parses its data structure. In either case, the host puts all theprimitive encoding routines into the FIFO. The host then assigns, to the queue-processor, allthe structures from the deepest level of the data dependency graph followed by the wait forfree signal. It then assigns all the structures from the previous level followed by the waitfor free signal. This process continues until the root level is reached. At the root level thehost writes an end signal into the FIFO to signify that the encoding list is complete. When31O■-■al)C"00CC-00C44O.• • • •► • • •.o 1:7:1 9a.)-o- 8 -=4.) t4.-1,- a c a o,C=^0,)Fti•^.8 •3 ¢ 3 ¢ 3r.nCa)CYN9 9 9 -Q^0^0 "71-4.) II...'0 0 b M "0 M8^8 - 8 1==•• 8(-1^C t".4Q•^a.) O U G. t.>•••••• f4i P.' ,^• • •fa.• C 4 C N Ctv) .^ 44^01)^0.0 7.1•R '7, - 8 - 7,v•-•^s"S 2a) a) 4.>^0 C'0 '0 '0 0 "0 ,■-■• 8 8 8 8 8 .•A' 8=000O 0 a.) a) o• 4.) (1) 4.)-c • -c -c3 3 3 3 3 3 3•• •°^_sea.) A L ■-■2 o ."S °0•• •72)2rA04cn 2"0 a.)r.4^•-• • • •c▪ .32the queue-processor encounters this end signal and all the processors are free it sends aninterrupt to the host to signal the completion of the encoding.The following describes the actions performed in the flow diagram of figure 22.• Write (buffptr) write the starting address of the memory pool that is available for theencoding routines into buffptr. This is the first action that the host performs and is notshown on the diagram.• Write (encode call i) write into the queue-processor any of the possible primitive encodecalls.• Write (wait for free) write to the queue-processor a synchronization symbol that signalsthat all the processors must have a free status before the other elements in the queueare assigned to the processors.• Write (End_Signal) write to the queue-processor a symbol that signals the end of aencoding session.• Accept Finished accept that the encoding is finished. One possible action is the releaseof the unused buffer space that is pointed to by buffptr.3.4.2 Queue-ProcessorDuring encoding, the queue-processor reads elements inserted into the queue by the hostand assigns the work to one of the n processors. Whenever the queue-processor encountersthe wait for free signal it waits for all the processors to become free. When the queue-processor encounters an end signal it stops assigning work. After the processors havefinished their encoding the queue-processor signals to the host that the encoding is complete.The following describes the algorithm that the queue-processor performs.1. Read an element from the queue. If the queue is empty, then block.2. If the element is a wait for free then:Waitlist (pl,p2,...,pn) block until all the encoding processors have finished beforeassigning the next work.33else if the element is a End_Signal then.Waitlist (pl,p2,...,pn) block until all the encoding processors have finished beforeinterrupting the host.Interrupt Host signal the host of the encoding completion and wait for the ac-knowledge.else the element is a encode call iAssign (px, encode call i) assign a processor x to encode a specific call. If all theprocessors are busy, block and wait for one to become free.3 Start over at step 1.3.4.3 ProcessorsA processor blocks until it is assigned a primitive encode call by the queue-subsystem.The processor then encodes the primitive call and upon completion informs the queue-subsystem that it is free. The following explains the actions that the processor executes:1. Start Encoding receives work from the queue processor and starts encoding the primitive.2. End Encoding finishes the encoding and signals the queue-processor of its free avail-ability status.3. WaitAssign (encode call i) waits for the queue-processor to assign a primitive call.WaitAssign is not drawn on the diagram for clarity but would occur in the space beforethe Start Encoding and after the End Encoding.344 DecodingThe decoding chapter follows the same format as the encoding chapter. Section 4.1discusses the data structure requirements of the application layer and session layer. Section4.2 explains how the decoding is achieved for ASN.1 specifications. Section 4.3 presentsthe requirements of the software that is needed to generate the decoder. Finally section 4.4presents a flow trace of the parallel algorithm.4.1 Decoding Interface Data StructuresPrior to decoding, the session layer provides a continuous memory block which holdsthe TS to be decoded. When the decoding is complete the result is made available to theapplication layer in a data structure created by using the mapping rules given in table 12.4.2 Decoder Generation from ASN.1 SpecificationFor an optimal parallel decoding strategy a type tree needs to be generated for the ASN.1specification. Generation of such a type tree is outlined in [13][1611231 The type tree isgenerated off-line and loaded when an instance of a transfer syntax needs to be decoded.To decode a received TS the application needs to write to the parallel ASN.1 decoder thestarting address of the TS, the finishing address of the TS, the address of the type tree to beused for decoding this TS instance and the memory location to store the decoded result ora pointer to the decoded result. The data structure used for communicating this informationfrom the host to the parallel decoder is shown in figure 23. The contents and usage of thetype tree are discussed in section 4.3.35typedef struct qdType {nodeType *tt; /* pointer to a node of the type tree */char^*tsptr; /* pointer to the start of the TS */char^*update; /* mem. location to be updated or used * /char^*tsend; /* pointer to the end of the TS */Figure 23 Decoding Queue-Element.4.3 Decoders Functions and Data StructuresFor a functional decoder to operate it is necessary to construct a type tree and thetraversing routine for this type tree. The rules for generating the type tree are presentedin this section and the traversing routines are explained in section 4.4. The type tree isconstructed of nodes that have the data structure as shown in figure 24.typedef struct nodeType {int^tag;int meminst;struct nodeType *option;struct nodeType *listptr;struct nodeType *next;} nodeType;Figure 24 Node Data Structure of Type TreeThe tag field holds the tag that is to be found in the transfer syntax and a value whichindicates the decoding routine to be used. The meminst field is discussed in the nextparagraph. The option field is used for elements in a SET. The listptr field is used forSEQUENCE/SET and OF types. The next field is used for pointing to the next elementexpected.The memisnt field is made up of three sub-fields; the allocation field, offset field and abit marker field. The allocation field specifies the amount of memory to allocate at this node.The offset field specifies at which offset within a memory location the result of the decodeis stored. The bit marker field contains an UPDATE bit, an OF bit and a CONSTRUCTOR36bit. The UPDATE bit specifies whether the present node needs to allocate memory for thedecoded output or update a memory pointer passed to it. The OF bit is set whenever thepresent node corresponds to a SEQUENCE OF or SET OF. The CONSTRUCTOR bit shouldbe set whenever the present node corresponds to a SEQUENCE or SET tag.The rules stated in sections 4.3.1 to 4.3.3 are used when generating nodes of the typetree. Two example type trees are shown in figure 25. Notice that the set type tree in thisfigure occupies more memory than the sequence type tree. The memory requirements ofthe set type tree can be reduced by adding a level of indirection; but by doing so, the totalnumber of memory references increase which slows down the overall speed of the system.This essentially becomes a speed/memory trade off.4.3.1 Node for Primitive TypesThe tag field holds the. tag of the primitive and a value which indicates the decodingroutine to be used. The option pointer is set only if the primitive is an element of a SET.The listptr is always NULL and the next node pointer points to the next node expected.Within the meminst field the allocate sub-field is zero. The offset sub-field holds theoffset location used to store the result. The UPDATE, CONSTRUCTOR and OF bit fieldsare not set.4.3.2 Node for SEQUENCE and SET TypesThe tag field holds the tag of the SEQUENCE/SET and a value which indicates thedecoding routine to be used. The option pointer is non NULL if the SEQUENCE/SET isan element within a SET. The listptr pointer points to the first element expected withinthe SEQUENCE/SET. The next pointer points to the first element expected after the SE-QUENCE/SET.Within the meminst field the allocate sub-field is zero. The offset sub-field is zero. TheCONSTRUCTOR bit is set. The OF bit is not set. The UPDATE bit is set only on theroot node of the tree.374.3.3 Node for SEQUENCE OF and SET OF TypesThe tag field holds the tag of the SEQUENCE OF/SET OF and a value which indicatesthe decoding routine to be used. The option pointer is non NULL when the SEQUENCEOF/SET OF is an element of a SET. The list pointer, points to the first element expectedwithin the SEQUENCE OF/SET OF. The next pointer points to the first element expectedafter the SEQUENCE OF/SET OF.Within the meminst field the allocate sub-field is set to the size of the SEQUENCEOF/SET OF structure. The offset sub-field holds the offset location of a LIST structure. TheCONSTRUCTOR bit is not set. The OF bit is set. The UPDATE bit is set only on theroot node of the tree.38codel ::= SEQUENCE {score REAL,name IA5String,age INTEGERcode I Type TreeSeq Tagmeminstoptionlistptr --sw Real Tagmeminstnextcode2 ::= SET {score REAL,name IA5String,age INTEGERcode2 Type TreeSeq TagmeminstoptionlistptrnextoptionlistptrnextIA5 TagmeminstoptionlistptrnextTNT Tagmeminstoptionlistptrnextmeminstoptionlistptrmeminstoptionlistptrmeminstoptionlistptrmeminstoptionlistptroptionlistptrmeminstoptionmeminstoptionlistptrmeminst meminstmeminstmeminstoption optionoptionlistptroptionlistptr listptroptionlistptrFigure 25 Two Decode Type Trees Comparing SET and SEQUENCE4.4 Decoding Flow TraceTo describe flow control of the decoding algorithm we use the transfer syntax in figure26. This transfer syntax is the value instance of the ASN.1 specification given in figure39Of TagmeminstoptionlistptrnextSeq Tagmeminstoption listptrnextlistptrnext2listptrnextINT TagmeminstoptionlistptrnextReal TagmeminstoptionlistptrnextIA5 Tagmeminstoption listptr next••13 of chapter 3. The decoding type tree generated for this ASN.1 specification is shownin figure 27.tag len tag len tag len tag len tag len tag len tag len tag len tag len70.12 I-^I 80.34 90.56 John f 150Al^seq of real^real real^IA5^2^INTFigure 26 Transfer Syntax of Figure 13Figure 27 Decode Type Tree of Figure 13For the decoding to start a decoding queue-element must be written into the queue by thehost. The queue—processor then assigns this element to a free processor and continues todo so whenever an element is available in the queue. The queue—processor stops assigning40values to decode when it sees a special end marker, at which time it signals the host thatdecoding has finished. Figure 29 shows the graphical representation of the control flow thatoccurs when the TS of figure 26 is decoded. Sections 4.4.1 to 4.4.3 explain the actions of theshort statements in figure 29. For clarity purposes the control flow figure does not includeall the signals that are transmitted in the system.4.4.1 Host ResponsibilitiesWhen a host wants to decode a TS it sets up the free buffer pointer and then writes a de-coding queue-element into the queue-processor. The host then waits for the queue—processorto generate an interrupt upon completion of the decoding.The following describes the actions that are performed on the flow diagram of figure 29.• Setup (buffptr) set up a pointer to the available free memory that the processor can use.• Write (decode call 1) write the first decode queue element into the queue-processor.• Acknowledge Interrupt acknowledge the interrupt of the queue-processor.4.4.2 Queue-ProcessorThe algorithm that the queue-processor performs is shown in figure 28. The queue-processor reads elements inserted into the queue and assigns the non-command element toone of the free processors. The functions of the command elements such as the wait forfree, add control and end signal are discussed next. When the queue-processor sees await for free signal it waits for all the processors to become free before it assigns anyother work. When the queue-processor sees an add control signal it increments the controlcounter that is initially zero: When the queue-processor sees an end signal it decrements thecontrol counter. If the control counter is zero when the queue-processor sees an end signalit interrupts the host and waits for an acknowledgment. Note that the functions that thequeue-processor performs for encoding are a subset of the functions described here; hence,the same queue-processor is used for both encoding and decoding.41est•I.)042Figure 28 Algorithm of Queue-Processor434.4.3 ProcessorsThe decoding algorithm used by the processors is depicted in figure 30. A processorsignals the queue-processor that it is free and then it waits for an address of a decode queueelement to be written to its I/O mapped location. Henceforth whenever a processor assignswork this means it constructs a queue element and then writes a pointer to this informationinto the queue. Also the queue element pointer that the processor receives is referred to asrec and the pointer that the processor assigns to the queue processor is referred to as q.The rest of this section explains all the actions shown in figure 30.44-calc next addr-queue add control-queue next-caic sub-struct addr-queue sub-struct-caic sub-structs addrs-queue all sub-structswith add controls-find correct node-calc next addr-queue next addrFigure 30 Decoding Algorithm for Processorstsptr = tsend If tsptr = tsend then the processor is at the end of the decoding segment andhas to signal the queue-processor that one control trace of the algorithm has ended.45if (rec->tsptr .= rec->tsend)write end signal to queue-processor;UPDATE bit set The UPDATE bit is set only for the first node in the type tree. This signalsthat the amount of memory specified in the allocation sub-field of meminst (figure 24) shouldbe allocated and that the queue element update pointer should be updated with the address ofthe allocated memory. The following C like pseudo code segment summarizes these actions:q->tt = rec->tt->next;q->tsptr = rec->tsptr +length in length field of (rec->tsptr);q->update = allocate memory;q->tsend = rec->tsend;*q->update = allocate memory;find correct node When the processor is not at the end of a control trace it checks to see itis at the correct type tree node. If the nodes option pointer is not NULL, it then decodes theTS tag and determines if the tag at the present node matches this TS tag. If the TS tag andtag at the node are not the same it follows the option pointer to the optional node and thencompares the TS tag to the tag at that node. The processor continues following the optionpointer until it finds a matching tag. If the processor does not find a matching tag a signalis generated signifying that there is an error in the incoming transfer syntax.Note that the elements within a SEQUENCE always have a NULL option pointer butfor elements within a SET the option pointer can be non NULL. This means that the tag ofan element in a SEQUENCE need not be decoded before the next element is assigned tothe queue-processor; but for elements within a SET, the tag has to be decoded and matchedbefore the next element is assigned to the queue-processor. This will make SEQUENCEdecoding faster than SET decoding. The following C like pseudo code segment summarizesthese actions:46tt = rec->tt;while (tt != NULL) {if (tt->tag == tag of (rec->tsptr) break;if (tt->option == NULL) error; else tt = tt - >option;if (tt->option == NULL) error;CONSTRUCTOR bit set If the CONSTRUCTOR bit at the present node of type treeis set then the find correct node and UPDATE bit set operations are performed. TheCONSTRUCTOR bit set means that we are at the start of a SEQUENCE/SET. At this pointthe processor calculates the end of the SEQUENCE/SET which is the starting location of thenext element expected. It then informs the queue that one more control thread is being addedand assigns, to the queue-processor, the element expected after this sequence. The processorthen calculates the starting location of the first element of this sequence and assigns thiselement to the queue-processor. Finally if the option pointer is NULL the processor checksto see if the TS tag matches the present nodes tag. The following C like pseudo code segmentsummarizes these actions. Note that tt value used below is defined in find correct node.ql->tt^tt->next;ql->tsptr^rec->tsptr +size of tag and len of (rec->tsptr) +length in length field of (rec->tsptr);ql->update^rec->update;ql->tsend^rec->tsend;write ql to queue-processor;write add control to queue-processor;q2->tt^tt->listptr;q2->tsptr^rec->tsptr +size of tag and len of (rec->tsptr);q2->update = rec->update;q2->tsend^rec->tsptr +size of tag and len of (rec->tsptr) +length in length field of (rec - >tsptr);write q2 to queue-processor;match tt->tag and tag of rec->tsptr;47OF bit set If the OF bit of the present node of the type tree is set then the find correctnode and UPDATE bit set operations are performed. The OF bit set means that we areat the start of a SEQUENCE OF/SET OF. At this point the processor calculates the endof the SEQUENCE OF/SET OF which is the starting location of the next element. I thenassigns this element to the queue-processor. Next the processor scans through the elementsof the SEQUENCE OF/SET OF performing only length calculations, to arrive at the nextSEQUENCE OF/SET OF element. At each SEQUENCE OF/SET OF element the processorassigns the element to the queue-processor and informs the queue-processor to add one tread.Finally, if the option pointer for this OF structure is NULL the processor checks to see ifthe TS tag matches the tag of the present node. The following C like pseudo code segmentsummarizes these actions. Note that tt value used below is defined in find correct node.q->tt^tt->next;q->tsptr = rec->tsptr +size of tag and len of (rec->tsptr) +length in length field of (rec->tsptr);q->update = rec->update;q->tsend = rec->tsptr +size of tag and len of (rec->tsptr);write q to queue-processor;endptr = rec->tsptr +size of tag and len of (rec->tsptr) +length in length field of (rec->tsptr);q->tt = tt->listptr;q->tsptr = rec->tsptr +size of tag and len of (rec->tsptr);while (endptr != q->tsptr) {q->tsend = q->tsptr + length in length field of (q->tsptr);q->update = allocate memory give in sub-file of tt->meminst;write q to queue-processor;q->tsptr = q->tsend;match tt->tag and tag of rec->tsptr;primitive If neither the OF bit nor the CONSTRUCTOR bit is set then this is a primitivedecode element and the first action is to assign out the next element to be decoded. The48primitive element is then decoded and stored at the update pointer location adjusted withthe offset quantity in meminst. The following C like pseudo code segment summarizes theseactions. Note that tt value used below gets defined in find correct node.q->tt = tt->next;q->tsptr = rec->tsptr +size of tag and len of (rec->tsptr) +length in length field of (rec->tsptr);q->update = rec->update;q->tsend = rec->tsend;write q to queue-processor;decode the primitive;49ArbiterADaisy Chain^ g3ing2in g4inglin PE2 PE3PE1 • • •Vgnin PEngnin PEn• • •glin g2in g3inPE 1 PE2 g4inPE35 Hardware Models The algorithms suggested in the previous sections were simulated on the parallel hardwarepresented in figure 8. The elements in this hardware are specified at the behavior level usingVHDL[1][9]. Sections 5.1-5.3 describe the operations of each of the components.5.1 Bus Arbitration MethodA rotating daisy chain (RDC) is chosen for the bus arbitration method. A daisy chainand a rotating daisy chain bus arbitration setup are shown in figure 31. A brief outline ofboth the arbitration schemes is provided in the next paragraph. A RDC is chosen for thebus arbitration method because this is a distributed method that is scales easily. The RDCmethod also offers the best bus arbitration scheme for the amount of hardware needed forimplementation[2]. Ref.[21] presents various implementation methods for the RDC alongwith their relative advantages.bus request lineRotating Daisy Chainbus request lineFigure 31 Daisy Chain and Rotating Daisy ChainIn the daisy chain bus .arbitration scheme shown in figure 31 any of the n processingelements (PE) can request the bus by raising the bus request line. The arbiter will then50raise the glin grant signal. The glin signal allows PE1 access to the bus; but if PE1 doesnot require the bus, it will propagate the glin signal to g2in. The grant signal can propagatethrough to any of the n PEs that request the bus. This bus arbitration scheme is unfair asPEI has the highest priority because whenever it requests the bus it will always get the glingrant signal. In this system, PEn has the lowest priority because when PEn requests thebus it will get the grant if none of the previous (n-1) PEs require the bus. At a processorPEi the processors to the right of PEi have lower priorities and processors to the left of PEihave higher priorities.The RDC initially starts with one PE responsible for generating a bus grant signal. ThisPE has the lowest priority while the first PE to the right has highest priority. The PE thatgenerates the grant signal is the arbiter. When a PE requests the bus and receives the busgrant signal this PE then assumes lowest priority and becomes responsible for generating thebus grant signal for the next arbitration. This moving of the bus granting capabilities fromone PE to another maintains an even priority system[2]. Another advantage of using theRDC method is the relative ease with which other systems can connect into the RDC. Whenanother system wants to share the bus with the elements on the bus, it simply includes itselfin the RDC. The overhead associated with including a arbiter in every PE is low. Figure 32shows that the hardware requirements are minimal.Figure 32 shows a three PE RDC implementation. Each processing element has a onebit bus grant register and a one bit bus request register. The operation of this three PE RDCis explained by using the timing diagram of figure 33 on page 54. The signals /B_RQ<1-3>are bus request signals generated by the PEs. The /req<1-3> are the clocked /B_RQ<1-3>signals and /t<1-3> are the grant signals that the PEs use. When the system initializes,only one bus grant register is set and this is signal in the timing diagram. Now, itl isresponsible for generating the bus grant upon initialization. When one PE receives a busgrant from another, this sets its bus grant register and now it is responsible for generatingthe next bus grant signal. The next action in the timing diagram occurs when PE 3 requests51for the bus by raising the /B_RQ3 line which in turn gets clocked and raises /req3. Finally/t3 rises signaling that PE 3 now owns the bus. PE 3 is now responsible for generating thenext bus grant and so PE 1 is released of this duty. The timing diagram of figure 33 containssome interesting scenario such as two PEs requesting the bus at the same time. The readeris free to explore the timing diagram and no further explanation is provided.52gdig -&ggggg.;a1.1aIM---> MEMil1ratLtai.....Igi ,iNI^NIgkflEg.nhd .,'^mm e.-->NMI_t WM.^.Ili„g t. §1t1'41•F'.._ —ggkE,gi._._> El111,.._,..-1k•,^.5 _._._c>-^a 1^- III iLlbIFigure 32 Partial Diagram of RDC Arbitration53Figure 33 RDC Timing Diagram545.2 ProcessorThe proposed RISC processor shown in figure 34 is used for the processing elements (PE)in the parallel architecture. The architecture is similar to the Motorola MC88100[10] andthe Sun SPARC[18] architecture but the instruction set is optimized for encoding/decoding.The processing element is a dual bus 32—bit processor with a 32—bit fixed length instructionset. A 32—bit processor is chosen because memory management and mapping is compatiblewith contemporary 32—bit address space systems. A 32—bit fixed length instruction format ischosen to make the instruction decoder simpler and to allow an instruction to be fetched permemory read. This processor can exploit super-scalar parallelism because the integer unit,the data unit and the instruction unit are designed to operate simultaneously.The core processor in figure 34 has a smaller instruction set than the Motorola MC88100or the Sun SPARC and it has some special instructions that are not available on either. Thesespecial instructions are used to implement atomic memory read/write cycles and to accesslocal memory. These special instructions make this processor more suited for running theparallel algorithm described above than either the Motorola MC88100 or the Sun SPARC.55PROCESSING ELEMENTCORE PROCESSORInteger RegisterUnit Filet tInternal Busest tData InstructionUnit UnitA •• lo• 1! •LocalMemoryInstructionMemoryRDCUnit7^iShared BusFigure 34 Processor Block DiagramSections 5.2.1 to 5.2.4 describe the main units in the block diagram and discuss theinstruction set of the PE which is presented in table 7. Table 6 shows all input/output signalsof the PE and provides a brief explanation of each.5.2.1 Integer UnitThe integer unit executes the first 7 instructions in table 7. The unit executes all thearithmetic and logical operations in one clock cycle. The integer unit works only in a registerto register mode. There are no complex instructions for modifying memory locations.565.2.2 Data UnitLoading from and storing to memory is performed by this unit. Depending on theload/store operation that is being executed, the data unit sets an internal line that activatesthe local memory or the RDC element of section 5.1. A typical memory read and writetiming diagram is given in Appendix D.5.2.3 Instruction UnitThe instruction unit does memory reads which follow the read timing diagram depictedin Appendix D. The instruction unit implements a two stage pipeline with delayed branching.This allows an instruction to be executed in the slot immediately after a branch operationso that the cost associated with branching can be reduced by compiler writers and assemblylanguage programmers.5.2.4 Register FileThe register file consists of a 32 32—bit registers. Not all of these registers are availablefor general programming use. The following text explains the functions of the specialregisters.Reg(0) = Zero Register Use of register 0 as a zero register is a hardware convention.Register zero is wired to zero by the hardware. Programs are allowed to write into registerzero, but this has no effect on its contents.Reg(26) = Frame Pointer Use of register 26 for the frame pointer is a software convention.The frame pointer points to the base address of the stack frame of each procedure.Reg(27) = Stack Pointer Use of register 27 for the stack pointer is a software convention.The stack pointer points to the present location in the stack.57Reg(28) = Return Address Use of register 28 for the return address is a hardware con-vention. Whenever a branch to a subroutine call is made the return address is stored intoregister 28.Reg(29) = Queue Address Use of register 29 for the queue address is a software conven-tion. The queue address is the I/O mapped address of the queue-processor.Reg(30) = Processor Address Use of register 30 for the processor address is a softwareconvention. The I/O mapped address of the processor is stored in this location. When aprocessor writes to its own I/O mapped location this means that it is free for work assignment.Reg(31) = Wait Result Use of register 31 for the wait instruction's result is a hardwareconvention. When the processor executes the wait instruction, the information that is writtento the processors I/O mapped location is stored in register 31.Signal Functional DescriptionA<31..0> Address lines.D<31..0> Data lines.READ Signals memory read.WRITE Signals memory write.READY Signals a completion of a memory access.LOCAL Signals a write to the local memory.ATOMIC Signals that a atomic read, increment, write is taking place.BUSRQ Bus request line.BUSGTIN Bus grant in line.BUSGTOUT Bus grant out line.RESET The asynchronous reset line.CLK1 Phase one of the clocking signal.CLK2 Phase two of the clocking signal.Table 6 Processing Elements Signals58Instruction Description Of Arithmethic And Logic Instructionsshift left Shifts a register left.shift right Shifts a register right.add Adds two registers or a register and an immediate.subtract Subtracts between two registers or a register and an immediate.XOR Bitwise XOR of two registers.AND Bitwise AND of two registers.OR Bitwise OR of two registers.Instruction Descriptions Of Control Flow Instructionsbrig Branch to an address specified by a register and an immediate.brsr Same as brig but the return address is saved.wait Blocks until a memory write is executed on the memory mappedlocation of the processor.Instruction Description Data Load / Store Instructionslocal load Loads to a register from the local memory.local store Stores to the local memory the contents of a register.global load Loads to a register from a global memory.global store Stores to the global memory the contents of a register.atomicincrementLoads a register from memory, increments it and stores the result backinto memory without releasing the shared bus.load high Loads to a register's high bits the immediate constant.Table 7 Instructions Executable On a Processing Element.5.3 Queue-ProcessorThe queue-processor maintains the availability status of the processors in the parallelsystem and distributes the work load accordingly. Figure 35 shows the block diagram ofthe queue-processor. The Queue Memory Manager is the controller of the queue-processorand implements the algorithm presented in figure 28. The block structure of the Queue59Information block is shown in figure 36 and the block structure of the Processor StateInformation block is shown in figure 37.Queue Processor QueueInformationAQueueMemoryManagerAProcessorStateInformation.41r--II■ .0111--•RDCUnita•Shared BusFigure 35 queue-processorWhenever a word of information is written to the I/O mapped address of the queue-processor this information is actually written into the queue memory location pointed to bythe first queue element pointer (figure 36). The first queue element pointer is then incrementedso that it is ready for the next write to the I/O mapped location of the queue-processor. Whena word of information is assigned to the queue, the information is read from the locationthat the last queue element pointer is pointing to. Again the last queue element pointer getsincremented after this operation is performed. The first and last queue element pointers areimplemented using a counter that gets incremented whenever information is written to theI/O mapped location of the queue.602AnA First^Element..-^I^QueueLast^ElementQueue.4 ►Words otn bitsQueue AddressFigure 36 Queue InformationWhen the queue-processor sees that there is an element in the queue and that any oneof the processing elements is free, it initiates a write to the I/O mapped location of the freeprocessor. This increments the last element pointer of the queue (figure 36) which signifiesthat the job is assigned to a processor. After the processor finishes the job it signals thequeue-processor of its availability status. The processor signals that it is free by writingany value into its I/O mapped address. Figure 37 shows a simplified diagram of how thequeue-processor and four PEs can maintain correct processor availability states. In figure37, comp is a comparator and the other components are explained below. Correct processoravailability states are maintained by the queue-processor. It maintains a toggle bit (p<1-4>state's in figure 37) for every processor that it assigns work to. This bit toggles on each writeto the I/O mapped location of a processor (p addr in figure 37). Initially all bits areset to 1 signifying that all processors are busy (p<1-4> = 1) and upon start-up each processorwrites to its I/O mapped location which signals that it is free. When there is work for aprocessor to do, the queue-processor writes the information to any of the free processors.Suppose p4 gets the work written to it (ie w = 1 and AddrBus = p4 addr). This changesp4's state in the queue-processor to 1, which signifies that processor 4 is busy. Once theprocessor finishes its work, it writes any information to its I/O mapped location (ie w = 161^p I state<^p2 state