Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Networks on Chip : emerging interconnect infrastructures for MP-SoC platforms Pande, Partha Pratim 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2005-105488.pdf [ 11.37MB ]
Metadata
JSON: 831-1.0092369.json
JSON-LD: 831-1.0092369-ld.json
RDF/XML (Pretty): 831-1.0092369-rdf.xml
RDF/JSON: 831-1.0092369-rdf.json
Turtle: 831-1.0092369-turtle.txt
N-Triples: 831-1.0092369-rdf-ntriples.txt
Original Record: 831-1.0092369-source.json
Full Text
831-1.0092369-fulltext.txt
Citation
831-1.0092369.ris

Full Text

NETWORKS ON CHIP: EMERGING INTERCONNECT INFRASTRUCTURES FOR MP-SOC PLATFORMS by P A R T H A PRATIM P A N D E B.Tech., Calcutta University, India, 1997 M . S c , The National University of Singapore, 2002 A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE F A C U L T Y OF G R A D U A T E STUDIES ELECTRICAL A N D COMPUTER ENGINEERING  THE UNIVERSITY OF BRITISH C O L U M B I A J U L Y 2005 © Partha Pratim Pande  Abstract Multiprocessor system-on-chip (MP-SoC) platforms are emerging as an important trend for System on Chip (SoC) design. The state of the art has reached a point where commercial designs are readily integrating in the range of 10-100 embedded functional/storage blocks in a single SoC. As a result of this enormous degree of integration, several industrial and academic research groups are striving to develop efficient communication architectures, in some cases optimized for specific applications. Global synchronization is becoming increasingly complex due to process variability and power dissipation, and cross-chip signaling can no longer be achieved in a single clock cycle. Thus, system design must incorporate networking and distributed computation paradigms with communication structures designed first and then functional blocks integrated into the communication backbone. The emerging Network on chip (NoC) design methodology is a step in this direction. The practical implementation and adoption of the NoC design paradigm is faced with various unsolved issues related to design methodologies, test strategies, and dedicated C A D tools. The focus of this research is on design aspects and architectural issues of this new paradigm. The contributions of this research are two fold. First is the performance evaluation of various NoC architectures in regard to data rates, latency, silicon area overhead, and energy dissipation and second is the quantification of the timing characteristics of NoC architectures. Through detailed circuit design and timing analysis, this research has established that different NoC architectures proposed to date are guaranteed to achieve the high-performance clock cycle requirements in a given CMOS technology, usually specified in normalized units of F 0 4 (fan out of 4) delays.  ii  Table of Contents Abstract  ii  Table of Contents  iii  List of Figures  .  List of Tables  ,  Acknowledgements  v vii viii  Chapter 1: Introduction and Overview 1.1 Introduction 1.2 Contributions 1.3 Thesis Organization  1 1 5 6  Chapter 2: Background and Related Work 2.1 Background 2.2 Related Work 2.3 SPIN 2.4 C L I C H E 2.5 TORUS 2.6 Octagon 2.7 BFT 2.8 Irregular Architectures  7 7 9 12 12 13 14 15 16  Chapter 3: Infrastructure IP Design Considerations 3.1 Switching Methodologies 3.2 Switch Architecture 3.2.1 Arbiter 3.2.2 Routing Block 3.2.3. Multi-destination Routing 3.2.3 Switch Traversal 3.2.4 FIFO Buffers 3.3 Network Interfacing 3.4 Summary  21 21 24 27 29 31 33 34 35 36  Chapter 4: Timing Analysis of NoC Interconnect Architectures 4.1 Achievable Clock Cycle in a Bus Segment 4.2 Communication Pipelining in NoCs 4.2.1 Wire Delay Between Switches 4.2.2 Circuit Delay Through the Switches 4.2.2.1 Delay Through the Arbiter 4.2.2.2 Delay Through the Routing Block 4.2.2.3 Delay Incurred in Switch Traversal 4.3 Experimental Results  37 39 43 44 50 50 54 56 57  iii  4.4 Summary  «  59  Chapter 5: Performance Evaluation and Design Trade-offs 5.1 Message Throughput 5.2 Transport Latency 5.3 Communication Energy 5.4 Area Requirements 5.5 Performance Evaluation 5.6 Experimental Results and Analysis 5.6.1 Throughput and Latency 5.6.2 Energy Dissipation 5.6.2.1 Energy Dissipation & Throughput 5.6.3 Area Overhead 5.6.4 Wiring Complexity 5.7 Case Study 5.8 Summary  61 61 62 63 64 65 66 68 73 76 79 82 85 88  Chapter 6: Conclusions and Future Work 6.1 Conclusions 6.2 Future Work 6.2.1 Testing of NoC-based Systems 6.2.1.1 Testing of the Functional/storage Blocks 6.2.1.2 Testing of the Interconnect Infrastructure 6.2.1.2.1 Testing of the Switch Blocks 6.2.1.2.2 Testing of the Inter-switch Wire Segments 6.2.1.3 Testing of the Integrated System 6.3 Reliability and Fault Tolerance 6.3.1 Error Control Coding 6.3.2 Fault Tolerant Architectures 6.4 NoC Benchmark Circuits 6.4.1 Benefits of Benchmarks 6.4.2 Proposal for NoC Benchmarks 6.5 C A D for NoC  90 90 91 91 92 92 92 94 95 95 95 96 97 98 98 99  References  101  Appendix 1. Theory of Logical Effort  108 108  iv  List of Figures Fig. 2. 1: Gate delay versus interconnect delay Fig. 2. 2: Bus architectures for (a) A M B A and (b) CoreConnect Fig. 2. 3: Sonic's Silicon backplane Fig. 2. 4: MIPS SoC-it Fig.,2. 5: SPIN architecture Fig. 2. 6: C L I C H E architecture Fig. 2. 7 : Torus and Folded Torus Fig. 2. 8: Basic Octagon Configuration Fig. 2. 9: B F T Architecture Fig. 2. 10: Irregular Architecture  7 10 11 11 12 13 14 14 16 16  Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.  3. 1: a) Header flit; b) Data and Tail flit 3. 2: Virtual-channel switch 3. 3: Switch I/O configurations for different NoC architectures 3. 4: Block diagram of a switch 3. 5: Priority matrix 3. 6: Priority matrix transition when requestor 2 is granted access 3. 7: Logic circuit to generate grant] signal 3. 8: (a) Block diagram of an arbiter; (b) one element of the priority matrix 3. 9 (a) L C A determination b) Block diagram of the L C A routing circuit 3. 10: (a) e-cube routing (b) Block diagram of the e-cube routing circuit 3. 11: Circuit of L C A routing block 3.12: Tree of NOR gates replacing the 6-input NOR gate 3.13: Multi-destination routing using bit-string encoding 3. 14: Switch traversal circuit 3. 15: The FIFO Buffer 3. 16: Interfacing of IP cores with the network fabric  24 24 25 25 27 28 28 29 29 30 31 31 33 34 35 35  Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.  4. 1: Illustration of the F 0 4 metric 4. 2: Clock period trend [8] 4. 3: A buffered bus-wire segment 4. 4: Variation of delay with parasitic capacitance Qp for a fixed bus length 4. 5: Variation of delay with bus length for a fixed CIP 4. 6: Variation of delay with CIP and bus wire length 4. 7: Pipelined data transfer 4. 8 (a): BFT block diagram (b) SPIN block diagram 4. 9: Buffered global wire delay in different technology nodes 4. 10: (a) Block diagram of an arbiter; (b) one element of the priority matrix 4. 11: Critical path of the input arbiter 4. 12: Grant signals as control inputs of the mux 4. 13: Critical path of the L C A routing block 4. 14: Critical path of the e-cube routing block 4. 15: Delays associated with the pipelined stages on data path  37 38 40 41 42 43 44 45 49 51 51 52 55 56 59  v  Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.  5. 1 : Variation of throughput under spatially uniform traffic distribution 5. 2: Variation of latency with virtual channels 5.3: Variation of accepted traffic with injection load 5. 4: Variation of throughput under localized traffic (number of vc = 4) 5. 5: Latency variation with injection load for spatially uniform traffic distribution 5. 6: Latency variation with injection load (localization factor = 0.3) 5. 7: Latency variation with injection load (localization factor = 0.5) 5. 8: Latency variation with injection load (localization factor = 0.8) 5. 9: Average energy dissipation per packet 5. 10: Energy dissipation profile for uniform traffic 5. 11: Energy dissipation profile for localized traffic (localization factor =0.3) 5. 12: Energy dissipation profile for localized traffic (localization factor =0.5) 5. 13: Energy dissipation profile for localized traffic (localization factor =0.8) 5. 14: Bit Energy versus Throughput characteristics (System size = 16) 5. 15: Bit Energy versus Throughput characteristics (System size = 64) 5. 16: Bit Energy versus Throughput characteristics (System size = 256) 5. 17: Energy savings for different localization factors 5.18: Area overhead 5. 19: Simplified layout examples of SPIN, and BFT 5. 20: Inter-switch wire length distribution 5. 21: Functional block diagram of a typical network processor  Fig. 6. 1: Distributed BIST structure for FIFO testing  vi  69 69 70 71 72 72 73 73 74 75 75 76 76 78 78 78 79 82 84 84 86 93  List of Tables Table Table Table Table Table  4. 1: Inter-Switch wire lengths in mm (Tree-based architectures) 4. 2: Inter-Switch wire lengths for CLICHE, Torus and Folded Torus 4. 3: Values of R , C , ti„ and F 0 4 in different technology nodes 4. 4: Logical Effort - Summary of parameters 4. 5: Delay Through the Switches, determined from Synopsys™ Prime Time  46 47 48 53 58  Table Table Table Table Table  5. 1: Simulation parameters 5. 2: Maximum number of 100K IP blocks in different technology nodes 5. 3: Distribution of functional and infrastructure IP blocks 5. 4: Summary of comparative architecture evaluation 5. 5: Projected performance of a network processor SoC platform in NoC design paradigm  68 80 81 85 87  w  w  v  vii  Acknowledgements I would like to thank my research superviser Prof Andre Ivanov, who allowed me to investigate this challenging problem. Special thanks goes to Prof Res Saleh for his guidance. I would cherish in my memory the lively discussions I had with these two professors during the course of my reaserch. It was a great experience working with them. M y PhD is a tribute to my childhood hero, my father Prof. Dulal Pande. He is a big source of inspiration for me all along. I tried to follow him, wanted to be an academician like him and that's why I pursued higher studies. I also express my gratitude to my mother Mrs. Gita Pande for her untiring effort in bringing me up. I also thank my wife Somava, for her encouragement. She accepted all the burdens of a grad student's wife with a smiling face. Without her I could not have proceeded so far. Special thanks goes to my sister Monomita Chakravarty and brother-in-law Dr.Sumantra Chakravarty for suggesting U B C as a possible grad school for me in 2001. I can not finish without saying thanks to my friend, philosopher and guide, Cristian Grecu. He worked with me for the last four years. I deeply appreciate his cooperation, suggestion and help. I wish to thank Micronet, PMC-Sierra, Gennum, and NSERC for their financial support and the C M C for providing access to C A D tools.  viii  Chapter 1: Introduction and Overview 1.1 Introduction  System on Chip (SoC) design methodologies will undergo revolutionary changes in the years to come and will involve the extensive use and seamless integration of numerous semiconductor intellectual property (IP) blocks in the form of processors, embedded memories and smart interconnects. Today, there exist many SoC designs that contain a number of processors in applications such as set-top boxes, wireless base stations, HDTV, mobile handsets, and image processing [2]. Such systems will behave as multiprocessors, and will require a corresponding design methodology for both their hardware and software implementations. Power and cross-chip signaling constraints are forcing the development of new design methodologies to incorporate explicit parallelism and provide a more  structured  communication fabric [1] [2]. Many have rightfully argued, that future designs will be dominated by arrays of processors that form the basis of new multiprocessor SoC platforms (the so-called MP-SoC platforms). As shown in Fig. 1.1, MP-SoC platforms will include ten to hundreds of embedded processors [2]. These will come in wide variety, including general purpose RISC, specialized application-specific instruction-set processors (ASIPs), embedded F P G A , DSP etc [2]. A key component of these multi-processor SoC (MP-SoC) platforms [2] [3] is the interconnect fabric. Such SoCs imply the seamless integration of numerous IPs performing different functions and operating at different clock frequencies. The integration of several components into a single system gives rise to new challenges. It is critical that infrastructure IP (I P) [4] be developed for a systematic integration of numerous functional IP blocks to enable the widespread use of the SoC design methodology. One of the major problems associated with future SOC designs arises from non-scalable global wire delays. Global wires carry signals across a chip, but these wires typically do not scale in length with  1  technology scaling [5]. Though gate delays scale down with technology, global wire delays typically increase exponentially or, at best, linearly by inserting repeaters.  | ROM,  Flash  MCU  eSRAM eDRAMJI /I  eFPGA  Standard H/WIP  Z3  Standard l » blocks eFPGA  Scalable Interconnect  ASIP  I GP RISC, VLIW  DSP  MicroProg peripherals  Fig 1.1: MP-SoC Platform Even after repeater insertion [5], the delay may exceed the limit of one clock cycle (often, multiple clock cycles). According to ITRS (2003 update) [8], "Global synchronization becomes prohibitively costly due to process variability and power dissipation, and cross-chip signaling can no longer be achieved in a single clock cycle". In ultra-deep submicron processes, 80% or more of the delay of critical paths will be due to interconnects [6] [7]. In fact many large designs today use FIFO (first-in, first-out) buffers to synchronously propagate data over large distances to overcome this problem. This solution effectively pipelines the communication but is ad-hoc in nature. In fact, the interconnect is designed after the floorplanning stage and is constrained to fit in the available routing regions of the design. A better approach would be to define formal interconnect network protocols and routing templates early in the chip planning process. Thus, system design would incorporate networking and distributed computation paradigms with communication structures designed first and then functional blocks integrated into the communication backbone. Currently the most frequently used on-chip interconnect architecture is the shared medium arbitrated bus, where all communication devices share the same transmission medium. The advantage of the shared-bus architecture is that it is well-understood. Other features are : a simple topology, low area  2  cost, and extensibility. However, for a relatively long bus line, the intrinsic parasitic resistance and capacitance can be quite high. Moreover, every additional IP block connected to the bus adds to this parasitic capacitance, in turn causing increased propagation delay. As the bus length increases and/or the number of added IP blocks increases, the associated delay in bit transfer over the bus may grow to become arbitrarily large, and will eventually exceed the targeted clock period. It thus limits, in practice, the number of IP blocks that can be connected to a bus, and thereby limits the system scalability [9] [10]. One solution for such cases is to split the bus into multiple segments and introduce a hierarchical architecture [11]. As the number of reusable intellectual property (IP) blocks on SoC platforms continues to increase, many have argued that monolithic bus-based interconnect architectures will not be able to support the clock cycle requirements of these leading-edge SoCs. While hierarchical system integration using multiple smaller buses connected through repeaters or bridges offer possible solutions, such approaches tend to be ad hoc in nature and therefore lack generality and scalability. To overcome the above-mentioned problems, several research groups, including ourselves, have advocated the use of a communication-centric approach to integrate IPs in complex SoCs. This new model allows the decoupling of the processing elements (i.e., the IPs) from the communication fabric (i.e., the network). The need for global synchronization can thereby disappear. This new approach employs explicit parallelism, exhibits modularity to minimize the use of global wires, and utilizes locality for power minimization [3]. A number of different interconnect architectures for MP-SoC platforms have been proposed. Their origins can be traced back to the field of parallel computing. However, a different set of constraints exists when adapting these architectures to the SoC design paradigm. High throughput and low latency are the desirable characteristics of a multi-processing system. Instead of aiming strictly for speed, designers increasingly need to consider energy consumption constraints [3], especially in the SoC  3  domain. None of the existing works on Networks on chip (NoCs) has compared the proposed interconnect architectures relative to throughput, latency,energy dissipation profile and silicon area overhead. Some architectures can sustain very high data rates at the expense of high-energy dissipation and considerable silicon area overhead, while others can provide a lower data rate and lower energy dissipation levels [12]. The inherent characteristics of NoC architectures allow a higher degree of spatial locality with regard to the inter-block communications. Consequently the functional mapping should be performed so as to exploit the advantages of spatial locality, i.e., the blocks that communicate more frequently should be placed closer to each other [13]. This will reduce the use of long global paths and the energy dissipation. In this research we are principally focused on the design aspects and architectural issues of the NoC paradigm. One of the very significant contributions of this work is the performance evaluation of a set of recently proposed NoC architectures with realistic traffic models [12]. This will help the system integrators to differentiate between a set of possible interconnect fabrics while building a large SoC containing very high number of functional/storage blocks. Another significant contribution of this work is to show how the inherent structured architecture of NoC interconnects helps to establish a highly pipelined communication medium for the constituent IP blocks [10]. The network can be decomposed into switches and interconnect that are designed to fit within a target clock cycle requirement. We believe that the NoC approach will ultimately be the preferred communication fabric for next generation designs. To support this conjecture, we demonstrate, through detailed circuit design and timing analysis, that different proposed NoC architectures to date are guaranteed to achieve the minimum possible clock cycle times in a given C M O S technology, usually specified in normalized units as 10-15 F 0 4 delays. This is contrasted with the bus-based approach,  4  which may require several design iterations to deliver the same performance when the number of IP blocks connected to the bus exceeds certain limits. This work furthers the body of knowledge associated with the design and analysis of such complex architectures, and our analysis allows us to identify useful design trade-offs that are critical for the optimal development of integrated network-based designs. 1.2 Contributions The principal contributions of this thesis are as follows: •  Switch Design for NoCs: Switches are the integral part of NoC interconnect architectectures. Switch blocks are designed according to the interconnect infrastructure and the routing methodology adopted. Multi-cast features are provided through a simple bit-string encoding scheme.  •  Timing Analysis of NoC Interconnects: The global wire delay problem is solved by adopting  a structured  interconnect  infrastructure,  which  establishes  a pipelined  communication medium. The timing characteristics of NoC architectures are presented in sections 4.2 and 4.3. Specific solutions to achieve the clock cycle constraint are elaborated. •  Performance Evaluation: A quantitative comparison of  various candidate NoC  architectures in regards to data rates, latency, silicon area overhead, energy dissipation and wiring complexity is presented. This comparison is based on a set of metrics that establish a useful basis for the optimal evaluation and selection of interconnect infrastructures for large and complex SoCs. •  Traffic Localization: The effect of traffic localization on system level performance, which includes throughput, latency, energy dissipation is presented in sections 5.6.1 and 5.6.2.  5  Extensive simulation results show that a considerable amount of communication energy can be saved through this approach. •  Case Study: A case study based on a commercial SoC design illustrates the applicability of the evaluation methodology undertaken in this thesis.  1.3 Thesis Organization  This thesis is organized as follows. Chapter 1 introduces the problem. Chapter 2 deals with background and related work. Chapter 3 elaborates design of the infrastructure IP blocks. It mainly consists of the design of the different components of the switch blocks. Timing analysis of NoC architectures is covered in Chapter 4. In this chapter we principally demonstrate how the switches and the structured inter-switch wire segments in different NoC architectures establish a pipelined communication medium. Chapter 5 illustrates comparative evaluation of different NoC architectures. Finally Chapter 6 concludes this work by pointing towards the future research direction.  6  Chapter 2 : Background and Related Work 2.1 Background A common problem shared by all interconnect topologies implemented in ultra deep submicron (UDSM) technology is the communication latency that arises from global wire delays. Current projections show that the propagation delays for highly optimized global wires, taking wire sizing and buffering into account will exceed 6-10 clock cycles in 50 nm technology [1]. Thus the delay for a signal traversing the chip diagonally will be more than one clock cycle. In order to quantify the large delays associated with long wires, we plot the F 0 4 gate delay (i.e., the delay of one inverter driving four 1  identical ones) and the wire delay for a number of different wire lengths as technology scales as shown in Fig. 2.1. As reference points, the F 0 4 delay for 350 nm is approximately 150 ps; for 130 nm, it is 1  reduced to approximately 50 ps; for 65 nm the value reduces to about 25 ps. On the other hand, for L=l mm the interconnect delay increases from 3 ps, to 27 ps, to 116 ps, for the same three technology nodes respectively. For L= 2 mm and L= 3 mm, the interconnect delay values exhibit the same trend. The wire delay doubles with each technology node and increases quadratically as a function of wire length [50].  Fig. 2.1: Gate delay versus interconnect delay  ' The concept of F04 is described in more detail in Chapter 4 of this thesis  7  For very long wires, the quadratic delay characteristics described above can not be tolerated for any design. The existing solution in this scenario is to insert repeaters or buffers periodically along the wire. This effectively divides the wire into smaller segments. The repeaters act to convert the quadratic delay to a more linear delay. Even with this repeater insertion, the intra-chip propagation delay of long wires can exceed the limit of one clock cycle [5] [6] if the die size is large enough. To overcome this problem, many large designs today use first-in first-out (FIFO) buffers to synchronously propagate data over long distances across the chip. The FIFO insertion process is ad-hoc in nature, and therefore not easily generalizable and scaleable. Wires need to be divided into segments that have a propagation time compatible with the clock cycle budget, and signals need to propagate along these segments in a pipelined fashion. To date, the most frequently used on-chip interconnect architecture is the shared-medium arbitrated bus. Advantages of shared-bus architectures include topological simplicity, low silicon area requirements, and extensibility. Some of the associated disadvantages as the number of IP blocks increases include relatively large load per data-bus line, long delay for data transfer, high energy consumption, increasing complexity of the decoding/arbitration and low bandwidth [11]. Every additional IP block connected to the bus adds to the parasitic capacitance, causing in turn, increased propagation delay. For a relatively long bus line, the intrinsic parasitic resistance and capacitance can reach excessive values [5] [10]. As the bus length increases and/or the number of IP blocks increases, the bus bit transfer delay may exceed the specified clock cycle. One solution for such cases is to introduce a hierarchical architecture and split the bus into multiple shorter segments. This approach also reduces the bottleneck caused by relatively slower IP blocks, such as typical I/O devices. Each shorter bus section can be viewed as effectively constituting a pipeline stage with a propagation delay that must fit within the clock cycle budget. However, the delay of  8  each stage in the case of a hierarchical bus-based system depends on the parasitic capacitance due to the IP blocks connected to the individual bus segments. Therefore, the above methodology only yields a solution where the timing specifications and the interconnect design are tightly coupled, and the solution will depend on the specific IP blocks and their functional and parametric specifications. The ad hoc nature of the latter solution presents several disadvantages, particularly with regard to scalability and design automation. The NoC-based approach to the above-mentioned interconnect problem provides a highlystructured, multi-core SoC design methodology that can achieve aggressive clock/data rates without incurring undue design effort. The common characteristic of these kinds of architectures is that the functional IP blocks communicate with each other through intelligent switches. These switches should be designed to be reusable and their primary role should be to ease the integration process of the functional IPs. Global signals, spanning a significant portion of a die in more traditional design styles, now only have to span the distance separating infrastructure switches. This switch-based architecture offers the advantage that the switches, along with the inter-switch wire segments, can now form the basis for a highly pipelined communication medium. 2.2 Related W o r k  SoC designs today predominantly use a shared-medium, bus-based functional interconnects to integrate IP blocks. The reason for this is that buses are well-understood, are suitable for small SoC designs and have standardized interfaces.  Several industrial buses in use today have similar  characteristics. They consist of a high-performance system backbone bus, able to sustain the bandwidth of the C P U and other direct memory access devices, plus a bridge to a lower speed bus on which lower bandwidth peripherals are located. Three popular commercial bus configurations include A R M A M B A [14] bus, Wishbone [15] and I B M CoreConnect [16]. In the case of the Advanced Microcontroller Bus  9  Architecture ( A M B A ) [14], the high-speed buses are Advanced High-performance Bus (AHB) and Advanced System Bus (ASB) and the lower speed bus is the Advanced Peripheral Bus (APB). A typical A M B A architecture is shown in Fig 2.2(a).  High-performance ARM processor  High-bandwidth ARM processor UART  Timer  [Keypad!  PIO  System core  High-bandwidth memory interface DMA Bus master  System core  System core  Processor Local Bus  Peripheral core  Bus Bridge  Peripheral core  On-chip Peripheral Bus  (b)  (a)  Fig. 2. 2: Bus architectures for (a) AMBA and (b) CoreConnect Similarly, the I B M CoreConnect architecture provides three different types of buses, namely Processor Local Bus (PLB) for high-speed devices, On chip Peripheral Bus (OPB) for peripherals, and Device Control Register Bus (DCR) for status and configuration registers. Fig 2.2(b) shows the typical CoreConnect architecture. According to [3], these bus-based systems may suffer from drawbacks of limited scalability as per the preceding discussion and arguments. A few on-chip micro network proposals for SoC integration can be found in the literature. Sonic's Silicon Backplane [17] is one example. In this architecture, IP blocks are connected to the communication fabric through specialized interfaces called agents. Each core communicates with an agent using the Open Core Protocol (OCP) [18]. Agents communicate with each other using time division-multiple access (TDMA) bus access schemes. These agents effectively decouple the IP cores from the communication network.  10  SiliconBackplane™  SiliconBackplane  Agenf  MicroNetwork  M  Open Core Protocol™  (OCP)  Fig. 2.3: Sonic's Silicon backplane MIPS Technologies has introduced an on-chip switch integrating IP blocks in a SoC [19]. The switch, called SoC-it, is intended to provide a high-performance link between a MIPS processor and multiple third-party IP cores. It is a central switch connecting different peripherals, but only in a point-topoint mode. None of these involves any specific interconnect architecture. SDRjDDR  Fig. 2 . 4 : MIPS SoC-it  11  In the following we briefly describe a few NoC architectures proposed by different reasearch groups. For the purpose of illustration, the functional IP blocks are denoted by white squares while the infrastructure IPs (switches) are denoted by dark squares. 2.3 SPIN  Guerrier and Greiner [20] have proposed a generic interconnect template called SPIN (Scalable, Programmable, Integrated Network) for on-chip packet switched interconnections, where a fat-tree architecture is used to integrate IP blocks. In this fat tree, every node has four children and the parent is replicated four times at any level of the tree. Fig. 2.5 shows the basic SPIN architecture with N=16 nodes, representing the number of functional IP blocks in the system. The size of the network grows as (Nlog2N)/8. The functional IP blocks reside at the leaves and the switches reside at the vertices. In this 3N  architecture, the number of switches converges to S = —, where ./V is the system size, for large N . Leve!  Fig. 2. 5: SPIN architecture 2.4 C L I C H E  Kumar et al. [21] have proposed a mesh-based interconnect architecture called CLICHE (ChipLevel Integration of Communicating Heterogeneous Elements). This architecture consists of an mxn mesh of switches interconnecting computational resources (IPs) placed along with the switches. Every  12  switch, except those at the edges, is connected to four neighboring switches and one IP block. In this case, the number of switches is equal to the number of IPs, i.e., S=N. The IPs and the switches are connected through communication channels. A channel consists of two unidirectional links between two switches or between a switch and a resource.  Fig. 2. 6: C L I C H E architecture 2.5 T O R U S  Dally et. al. [22] have proposed a 2-D torus as an NoC architecture. The torus architecture is basically the same as a regular mesh [23]; the only difference is that the switches at the edges are connected to the switches at the opposite edge through a wrap-around channel. Every switch has five ports, one connected to the local resource and the others connected to the closest neighboring switches. Again, the number of switches is S = N . The wrap-around wires in the torus architecture give rise to very high delay. This can be avoided by folding the torus as shown in Fig. 2.7.  13  \jArea  1  p  — . , — I I -  r"  1  1  1  r=JII I — — I I  — r l  Fig. 2. 7 : Torus and Folded Torus 2.6 Octagon  Karim et al. [24] have proposed the O C T A G O N MP-SoC architecture. Fig. 2.8 shows a basic octagon unit consisting of 8 nodes and 12 bi-directional links. Each node is associated with a processing element and a switch. Communication between any pair of nodes takes at most two hops within the basic octagonal unit.  Fig. 2. 8: Basic Octagon Configuration For a system consisting of more than 8 nodes, the octagon is extended to multidimensional space. The scaling strategy is as follows: Each octagon node is indexed by the 2-tuple (i, j), i, j e [0,7].  14  For each i = I, Ie [0,7], an octagon is constructed using nodes /(7, j), j e [0,7]}, which results in eight individual octagon structures. These octagons are then connected by linking the corresponding i nodes according to the octagon configuration. Each node (I, J) belongs to two octagons: one consisting of nodes [(I, j) j e [0, 7]}, and the other consisting of nodes {(i, J) ie [0, 7]}. Of course, this type of interconnection mechanism may significantly increase the wiring complexity.  2.7 B F T  We proposed an interconnect template following a Butterfly Fat-Tree (BFT) [25] architecture as shown in Fig. 2.9. In our network, the IPs are placed at the leaves and switches placed at the vertices. A pair of coordinates is used to label each node, (I, p), where / denotes a node's level and p denotes its position within that level. In general, at the lowest level, there are A IPs with addresses ranging from 0 to 7  (N-l). The pair (0, N) denotes the locations of IPs at that lowest level. Each switch, denoted by S (I, p) has four child ports and two parent ports. The IPs are connected to N/4 switches at the first level. The number of levels depends on the total number of IPs, i.e., for N IPs, the number of levels will be (log4N). In the j level of the tree, there are N/2/ switches. In the case of 64 IPs the number of switches is 28 as th  +1  shown in Fig. 2.9. The number of switches in the butterfly fat tree architecture converges to a constant independent of the number of levels. If we consider a 4-ary tree as shown in Fig. 2.9 with four down links corresponding to child ports and two up links, corresponding to parent ports, then the total number of switches in level j=l is N/4. At each subsequent level the number of required switches reduces by a factor of 2. In this way the total number of switches is given by _ N IN IN S=—+ + + 4 2 4 4 4  15  N >  —  2  Fig. 2. 9: BFT Architecture 2.8 Irregular Architectures  Benini et. al. [26] have introduced application specific irregular architectures to build networked SoCs. The individual components of SoCs are inherently heterogeneous with widely varying functionality and communication requirements. As a result of this the communication infrastructure and also the architectures of the switches may vary across the whole SoC. As shown in Fig. 2.10 the architecture of the switches is not uniform and the communication characteristics among these switch, blocks also vary.  Fig. 2.10: Irregular Architecture  The performance of the NoC-based interconnects is strongly correlated with the particular topology selected for implementation. These topologies can be broadly classified into two different  16  categories: (i) regular architectures (Figs. 2.5-2.9) and (ii) irregular application-specific (custom) NoC structures (Fig. 2.10). In regular architectures, the level of performance is homogeneous across the whole system. In irregular architectures, the service requirements vary widely for the different processor/storage blocks. Hence, the design of the switches and also the topology is not homogeneous across the whole system. In a custom-built NoC architecture, switch blocks may not be identical, and are designed and placed according to the specific communication requirements. The regular network architectures fit well the realization of communication schemes for multi-processors. Irregular network architectures may be required for realizing application-specific SoCs, such as mobile phone systems, where various heterogenous blocks with various communication requirements need to be linked. One of the salient features of the NoC architectures is the decoupling of the communication fabric from the processing/storage elements [1]. This allows the optimization of the communication medium independently of the functionality using different levels of abstraction. A complex SoC can be viewed as a micronetwork of multiple blocks and hence models and techniques from networking and parallel processing can be borrowed and applied to SoC design methodology. The micronetwork must ensure quality of service requirements (such as reliability, guaranteed bandwidth/latency), energy efficiency, under the limitation of intrinsically unreliable signal transmission medium. Such limitations are due to the increased likelihood of timing and data errors, due to variability of process parameters, crosstalk, and environmental factors such as electro-magnetic interference (EMI) and soft errors. Quality of service (QoS) encompasses a collection of different design requirements that need to be fulfilled to achieve a certain performance level. As the interconnect infrastructure provides data communication services to the constituent IP blocks, it has to be designed such that it maintains predictability of performances under various operating conditions. In NoCs, two types of services are  17  considered essential: (i) data integrity, meaning that the data is delivered without corruption (ii) throughput and latency services, characterized by time related bounds. These can be achieved through •  Contention free routing schemes  •  Deadlock avoidance mechanisms  •  Appropriate flow control strategies  •  Error control coding  Guaranteed services require resource reservation for worst case scenarios. As a consequence, resources often remain underutilized. Best-effort services do not reserve any resources and hence provide no service guarantee. Guaranteed services should be used for critical traffic and the best-effort services for non-critical ones. By integrating both guaranteed and best-effort services in the same interconnect it is possible to design predictable and at the same time low cost interconnect infrastructures [27]. In the Ethereal NoC [27] the Network Interface (NI) offers a guaranteed service by providing a lower bound on throughput and upper bound on latency, as these two are the most critical components for supporting realtime communication. In Ethereal, throughput and latency guarantees are implemented by configuring connections as pipelined time-division-multiplexed circuits over the network. In order to become a viable alternative, the Network on Chip paradigm has to be supported by C A D tools through the creation of specialized libraries, application mapping tools and synthesis flows. This is the focus of different reasearch groups [26] [28]. According to [29] current simulation methods and tools can be ported to networked SoCs, to validate functionality and performance at various abstraction levels, ranging from electrical to transaction levels. NoC libraries, including switches/routers, links and interfaces will provide designers with flexible components to complement processor/storage cores. Nevertheless the usefulness of such libraries to designers will depend much on the level of maturity of the corresponding synthesis/optimization tools and flows. In other words, micro-network  18  synthesis will enable NoC/SoC design in a way similar to how logic synthesis made efficient semicustom design possible in the eighties. Examples of libraries of on-chip networks are provided by xPipes [26] and xPipesLite [28], the latter being a simpler, faster and synthesizable version of the former. xPipes compiler and Sunmap [30] are examples of a network synthesis tool for xPipes and an automatic topology selection tool. The library and tools have been used to realize experimental gate-level models of complex system applications. Any new design methodology can be widely adopted if it is complemented by efficient test mechanisms. The development of test infrastructures and techniques supporting the Network on Chip design paradigm is a challenging problem. Specifically, the design of specialized Test Access Mechanisms (TAMs) for distributing test vectors and novel Design for Testability (DFT) schemes is of major importance. The reuse of the on-chip network as T A M for the functional/storage cores is proposed in [31]. The principal advantage of using NoC as a test access mechanism is the availability of several parallel paths to transmit test data to each core and no extra T A M hardware is needed. Therefore reduction in system test time can be achieved through extensive use of test parallelization, i.e., more functional blocks can be tested in parallel as more test paths are available. One side effect of test parallelization is excessive power dissipation. Hence, both the test time and power dissipation aspects need to be considered while exploiting parallelization for testing of the functional blocks [31]. The range of functional IP blocks integrated in a single SoC extends from application-specific processors, general-purpose RISC to I/O and memory blocks. A l l of them need to comply to a common programming platform. A programming model for parallel systems like NoCs is a description of the basic components, their properties and available operations, and synchronization [2] [32]. There are two primary parallel computing models - the parallel random access machine (PRAM) and the message passing (MP) model. ST Microelectronics' Multiflex multi-processors SoC programming environment  19  [2], focuses on both of these models. It consists of a distributed system object component (DSOC) message passing model and a symmetrical multiprocessing (SMP) model using shared memory. The Multiflex environment is developed specifically for multiprocessor SoCs and maps these models onto the StepNP platform [32]. In [33], a Task Transaction Level (TTL) interface is described, which can be used for developing parallel application models by integrating hardware and software tasks on a platform. As the T T L interface is an abstract interface, the application developers can use it easily without knowing low level implementation details. This abstract interface allows the implementation of a broad range of multiprocessor platforms. The practical implementation and adoption of the NoC design paradigm is faced by multiple unresolved issues related to design methodology/technology and analysis of architectures, test strategies and dedicated C A D tools. In this work, we are principally focussed on the design aspects and architectural issues of this new paradigm. The principal contribution of this work lies in the establishment and illustration of a consistent comparison and evaluation methodology based on a set of readily quantifiable parameters for NoCs. Our methodology sets an important basis for the optimal evaluation and selection of interconnect infrastructures for large and complex SoCs.  20  Chapter 3 : Infrastructure IP Design Considerations The design process of NoC-based systems differs significantly from the conventional SoC design methodology. The main objective of these NoC-based architectures is the decoupling of the communication fabric from the processing elements. This allows independent optimization of the interconnect irrespective of the functional IP blocks. One common characteristic of the communicationcentric architectures considered in this work is that the functional IP blocks communicate with each other with the help of intelligent switches. The switches provide a robust data transport medium for the functional IP modules. The design of the switch blocks depends on the adopted switching methodologies and interconnect architecture. The goal of the design process is to ensure that the timing of the switches and interconnect meet the clock cycle constraints.  3.1 Switching Methodologies  The switching methodology is responsible for successful forwarding of bit streams through the network. The switching techniques determine when and how internal switches connect their inputs to outputs and the time at which message components may be transferred along these paths. These techniques are coupled with aflowcontrol mechanism for forwarding of messages through the network. Flow control is tightly coupled with buffer management algorithms that determine how message buffers are requested and released, and therefore determine how messages should be handled when blocked in the network. Implementations of switching methodology differ in decisions made in each of these processes and in their relative timing, i.e., when one operation can be initiated relative to the occurrence of others. The implementation and relative timing of flow control operations can distinguish switching methodologies from one another. Flow control is the mechanism for transmitting and receiving a unit of  21  information. The unit of flow control refers to the smallest unit of information whose transfer is agreed between the sender and the receiver through exchange of request/acknowledgement signals. The switching technique is mainly concerned with connecting inputs of the switch to the outputs and forward information along this path. There are different types of switching techniques namely: Circuit Switching,  Packet Switching, Virtual Cut-Through (VCT) Switching and Wormhole switching. These technique mainly distinguished by their flow control methodologies. In circuit switching, a physical path from source to destination is reserved prior to the transmission of the data. The path is held until all the data has been transmitted. The advantage of this approach is that the network bandwidth is reserved for the entire duration of the data. However, valuable resources are also tied up for the duration of the transmitted data and the set up of an end-to-end path causes unnecessary delays. In packet switching data is divided into fixed-length blocks called packets and instead of establishing a path before sending any data whenever the source has a packet to be sent it transmits and depending on the flow control mechanism each packet is routed individually.  Packet switching is  advantageous when messages are short and frequent. Unlike circuit switching, where a segment of the reserved path may be idle for a significant period of time, in packet switching a communication link is fully utilized when there are data to be transmitted. Packet switching is based on the assumption that a packet must be received in its entirety before any routing decision can be made and the packet forwarded to the destination. Now, the first few bytes, i.e., the header of the packet contain routing information that is typically available within first few cycles. The switch can start forwarding the header and subsequent data bytes as soon as the routing decision is made and the output buffer is free. The message does not even have to be buffered at the output and can cut through to the input of the next switch before the complete packet has been received at the current switch. This switching technique is called virtual cut-  22  through switching (VCT). The main drawback with this method is that if a header is blocked in a switch due to heavy traffic then the whole packet needs to be buffered at that node. Thus eventually, in heavy traffic load condition the V C T switching is similar to normal packet switching. The need of storing the whole packet in a switch in case of conventional packet switching or V C T makes the buffer requirement high in these cases. In wormhole switching, the packets are divided into fixed length flow control units (flits) and the input and output buffers are expected to store only a few flits. As a result, the buffer space requirement in the switches can be small compared to that generally required for packet switching. Thus, using a wormhole switching technique, the switches will be small and compact. Consequently wormhole switching is the solution of choice for NoCs. The first flit, i.e., header flit, of a packet contains routing information. Header flit decoding enables the switches to establish the path and subsequent flits simply follow this path in a pipelined fashion. As a result, each incoming data flit of a message packet is simply forwarded along the same output channel, as the preceding data flit and no packet reordering is required at destinations. If a certain flit faces a busy channel, subsequent flits also have to wait at their current locations. One drawback of this simple wormhole switching method is that the transmission of distinct messages cannot be interleaved or multiplexed over a physical channel. Messages must cross the channel in their entirety before the channel can be used by another message. This will decrease channel utilization if a flit from a given packet is blocked in a buffer. By introducing virtual channels [23] at the input and output ports we can increase channel utilization considerably. If a flit belonging to a particular packet is blocked in one of the virtual channels, then flits of alternate packets can use the other virtual channel buffers and, ultimately, the physical channel.  23  The header, data, and tail flit structures are as shown in Fig. 3.1. The first field denotes the flit type, namely header, data or tail. The second field contains the virtual channel identifier (VCID). The third field denotes the address length, which is dependent on the number of SoC IP blocks. The fourth field contains packet length information, i.e., the number of flits in the corresponding packet. The next two fields give source and destination addresses. The flit length is constant but the total number of flits in a packet will vary according to the contents of the packet length field.  ( a ) Type VCID Address Length Packet Length {Source AddressDest. Address ( b ) Type VCID  Data  Fig. 3.1: a) Header flit; b) Data and Tail flit 3.2 Switch Architecture  The cannonical architecture of a virtual-channel switch is shown in Fig. 3.2. The number of ports in a switch varies with the interconnect architecture as shown in Fig. 3.3, e.g., the switch block in BFT has 6 ports whereas in C L I C H E and Folded Torus it has 5 ports, and in SPIN it has 8 ports. However, the basic architecture of a switch port remains the same. Input buffers  Output buffers  MZZD-n  HTTTH Fig. 3. 2: Virtual-channel switch  24  BFT  Spin  C L I C H E / Folded Torus  Fig. 3. 3: Switch I/O configurations for different NoC architectures The different components of a switch port are shown in Fig. 3.4, which will be detailed in the sections to follow. It mainly consists of input/output FIFO buffers, input/output arbiters, M U X and D E M U X units, and a routing block. In order to have a considerably high throughput, we use a virtual channel switch, where each port of the switch has multiple parallel buffers [23]. req,  m=4  Input arbiter  req. req  4  gnt, gnt. gnt  t  Input virtual channels  II  II  Output virtual channels  1 1  H i l l irrmr-rmmn-  —rTTTT -[  rrm-  K=6  Fig. 3. 4: Block diagram of a switch  25  Each physical input port has more than one virtual channel, uniquely identified by its VCID. Flits may simultaneously arrive at more than one virtual channel. As a result, an arbitration mechanism is necessary to allow only one virtual channel to access a single physical port. Let there be m virtual channels corresponding to each input port; we need an m:l arbiter at the input. As an example as shown in Fig. 3.4, m is equal to 4. Similarly, flits from more than one input port may simultaneously try to access a particular output port. If k is the number of ports in a switch, then we need a (k-l):l arbiter at each output port. Once again, as shown in Fig. 3.4, k is equal to 6, which corresponds to a B F T architecture. The routing logic block determines the output port to be taken by an incoming flit. The operation of the switch consists of one or more processes depending on the nature of the flit.  In the case of a header flit, the processing sequence is: (1) input arbitration; (2) routing; and (3) output arbitration. In the case of body flits, switch traversal replaces the routing process since the routing decision based on the header information is maintained for the subsequent body flits. The basic functionality of the input/output arbitration blocks does not vary from one architecture to another. The design of the routing hardware depends on the specific topology and routing algorithm adopted. In order to make the routing logic simple, fast and compact we follow different forms of deterministic routing [22]. In our routing schemes we use distributed source routing; i.e., the source node determines only its neighboring nodes that are involved in message delivery. For the tree-based architectures (SPIN and BFT), the routing algorithm applied is the least common ancestor (LCA) and, for CLICHE and Folded Torus, we apply the e-Cube (dimensional) routing [23]. The corresponding routing blocks have been implemented for all the above-mentioned cases. The principal objective in designing the switch blocks for NoC architectures is that the delay of each stage constituting the switch must fit within one clock cycle in a particular technology node. In the following subsections the different components of the switch block are described.  26  3.2.1 Arbiter  There are different possible arbitration mechanisms, like round-robin, queuing, time division multiple access (TDMA), and matrix arbitration. In the design of the switch blocks the matrix arbitration methodology is adopted, as it is the most generic one. The arbiter circuit essentially consists of a priority matrix [34], which stores the priorities py of the requesters and grants generation circuits used to grant (gntO  resources to requesters. The priority matrix stores priorities between n requesters in a binary n-by-n  matrix. The structure of the matrix in case of four requesters is shown in Fig 3.5. The priority of a requester with respect to itself does not have any physical significance and hence the elements along the main diagonal in the priority matrix are void and denoted by X . Pn  Pn  Pj\  %  P23 Pu  Pn  P32  X  _Pax  Pai  %  P43  Pu  P34 X  _  Fig. 3. 5: Priority matrix Each matrix element [i, j] records the binary priority between each pair of inputs. For example, suppose requester i has a higher priority than requester j, then the matrix element [i, j] will be set to 1, while the corresponding matrix element  [j, i] will be 0. A requester will be granted the resource if no  other higher priority requester is bidding for the same resource. Once a requester succeeds in being granted a resource, its priority is updated and set to be the lowest among all requesters. As an example, consider that the status of the priority matrix is as shown in left matrix in Fig. 3.6 and requestor 2 is granted access to the switch. Then after arbitration, column 2 is set to 1 and row 2 is set to 0, such that requestor 2 has the lowest priority with respect to all other requestors.  27  X  0  0  1  X  1  0  1  X  X  1  0  X  X  0  0  X  X  X  0 —> X  X  X  0  X  X  X  X  X  X  X  X  Fig. 3. 6: Priority matrix transition when requestor 2 is granted access. The logic equations to express the value of grant signals are given as follows: gnt, = req (req + p \req + p \req + p ) x  2  3  l2  X3  A  lA  gnt = req [req + p \req + p \req + p ) 2  2  x  X2  3  23  A  2A  gnt = req (req + p \req + p \req + p ) 3  3  x  n  gnt = req {req + p \req  2  A  A  x  l4  2  23  A  34  + p \req + p ) 2A  3  3A  Applying De Morgan's law to the equations above, the gate level circuit for a grant signal is shown below:  Fig. 3. 7: Logic circuit to generate granti signal The rest of the grant signals are generated similarly to Fig. 3.7. The block diagram of the matrix arbiter and the circuit diagram to implement one element of the priority matrix are shown in Fig 3.8.  28  (a)  (b)  Fig. 3. 8: (a) Block diagram of an arbiter; (b) one element of the priority matrix 3.2.2 Routing Block  The hardware implementation of the routing block depends on the specific routing algorithm adopted. In the case of BFT and SPIN, the L C A (least common ancestor) based [23] [35] routing algorithm is followed. In this case, the first step in the implementation of the routing logic involves the bit-wise comparison (XOR) of the source and destination addresses taking the most significant, i.e, M= (log2N - 21) bits, where N is the number of functional IP blocks in the system and / denotes the level number of the switch. Subsequently, the result of the comparison is checked, i.e., whether any " 1 " results from the bitwise X O R operation. The basic structure of the hardware block implementing the L C A algorithm is shown in Fig. 3.9 (b).  Fig. 3. 9 (a) L C A determination  b) Block diagram of the L C A routing circuit  29  The routing algorithm adopted for the CLICHE, and the Folded Torus is the e-cube dimension order routing [23]. Here, each address is divided into two groups representing the X and Y coordinates. At each node, the X and Y parts of the destination addresses are compared with the corresponding parts of the current nodes to determine the next node in the path. This routing algorithm routes packets by crossing dimensions in strictly increasing or decreasing order, reducing to zero the offset in one dimension before routing in the next one. The hardware implementation of the e-cube algorithm is as shown in Fig. 3.10 (b).  Fig. 3.10: (a) e-cube routing  (b) Block diagram of the e-cube routing circuit  The final stage of the routing block in the case of L C A or e-cube routing will have a multiinput NOR gate. The circuit diagram of the final stage of the routing block, implementing L C A algorithm is shown in Fig. 3.11. The multi-input big NOR gate is not feasible for CMOS implementation due to its large delay, and the solution is to replace it with a tree of NOR and inverter pairs [36]. The depth of the tree grows logarithmically with the number of inputs of NOR gate, which helps in reducing the delay of the equivalent NOR gate. A 6-inputs NOR gate can be replaced by a two-levels tree of NOR/inverter pairs, as shown in Fig. 3.12.  30  Source address  Destination address  L  Fig. 3.11: Circuit of L C A routing block  Fig. 3.12: Tree of NOR gates replacing the 6-input NOR gate 3.2.3. Multi-destination Routing  The above routing schemes are specifically meant to handle the unicasting scenario, where there is one particular destination for the message to be routed. The routing block needs to be modified in case of multicasting, where the message needs to be routed to more than one destination at the same time. The multicast mechanism needs to be incorporated mainly in case of testing the switch blocks for manufacturing faults. The NoC infrastructure has to be progressively used for testing its own components in a recursive manner, i.e., the good, already tested NoC switches are to be used to transport test patterns  31  to the untested switches. In this scenario the test data multicast will be very usefull in reducing the overall test time needed to test the infrastructure [57]. The header of a multi-destination message must carry all the destination node addresses. The header information is an overhead to the system, increasing message latency and reducing effective network bandwidth. A good multiaddress encoding scheme should minimize the message header length, and also header processing time. To route a multi-destination message, a switch must be equipped with a method for determining the output ports to which a multicast message must be simultaneously forwarded. The multi-destination worm header encodes information that allows the switch to determine destination ports. One possible way to multicast is simply to unicast multiple times, but this implies a very high latency. The all-destination encoding is another simple scheme in which all destination addresses are carried by the header. This encoding scheme has two advantages. First, the same routing hardware used for unicast messages can be used for multi-destination messages. Second, the message header can be processed on the fly as address flits arrive. The main problem with this scheme is that if the number of IP blocks in the system increases then the header length will increase accordingly and cause significant overhead. One form of header encoding that accomplishes multicast to arbitrary multicast destination sets in a single communication phase and also limits the size of the header is known as bit-string encoding [37]. A multi-destination worm with a bit-string encoded header carries an encoding of the destinations. The encoding consists of N bits where TV is the number of IP blocks, with a T bit in the ith position indicating that JP i is a multicast destination, as shown in Fig. 3.13. To decode a bit-string encoded header, a switch must possess knowledge of the IP blocks reachable through each of its output ports.  32  source  destinations  Fig. 3.13: Multi-destination routing using bit-string encoding This reachability information can be encoded using a similar /V bit string for each output port with T  bits denoting IPs reachable via this output port. When a multi-destination worm with a bit-string  encoded header arrives at a switch, the switch compares the bit-string in the header with the reachability information associated with each of its output ports. This allows the determination of the output port required by the worm. One compelling feature of this bit-string encoding mechanism is the associated ease of implementation. There are however some drawbacks. One is that a switch will usually need to buffer the entire bit-string in order to make a routing decision and will therefore consume more silicon area compared to that required for the unicasting case. Moreover, address decoding cannot be done with the same routing hardware as for unicast messages. Finally, the length of the string will depend on the number of IP blocks to be integrated, and will hence limit the system scalability. Despite the latter drawbacks, here we still use this bit-string encoding mechanism to achieve multicasting in a SoC environment for simple hardware implementation. 3.2.3 Switch Traversal  Routing decisions are made once the header flit reaches the switch and subsequent body flits just follow the same path. Consequently, from Fig 3.4, the switch traversal process involves a chain of multiplexers and demultiplexers as shown in Fig. 3.14.  33  Incoming Flits  Outgoing Flits  Fig. 3.14: Switch traversal circuit 3.2.4 FIFO Buffers  The FIFO buffers are also critical components of the switch. Their operating speed should be high enough not to become a bottleneck in a high-speed network. More specifically, the switches need to be interfaced with the S o C s constituent IP blocks. Hence, the switches should be able to receive and transmit data at the rated speed of the corresponding IPs. Furthermore, the FIFOs should be able to operate with different read and write clocks as the S o C s constituents IPs are expected to generally operate at different frequencies. Buffers can be implemented as either SRAMs or shift registers. Some on-chip networks, like the Raw microprocessor [38], use shift registers to implement the FIFO buffers due to the less demanding buffer space requirement of static networks [38]. We adopted a shift registerbased low-latency, high-throughput FIFO design which robustly accommodates mixed-clock systems [39] as shown in Fig. 3.15. Instead of using separate counters to implement read and write pointers, two tokens are circulated among the FIFO cells to implement read and write operations [39]. A FIFO cell can be read from or written into only if it holds the corresponding token. After a token is used in a given cell, it is subsequently passed on to the adjacent cell.  34  Data_Put  Cell  [1 Data_Get  Fig. 3. 15: The FIFO Buffer 3.3 Network Interfacing  The success of the NoC design paradigm relies greatly on the standardization of the interfaces between IP cores and the interconnection fabric. The Open Core Protocol (OCP) [18] is a plug and play interface standard receiving a wide industrial and academic acceptance. Using a standard interface should not impact the methodologies for IP core development. In fact,  IP cores wrapped with a standard  interface like the OCP will exhibit a higher reusability and greatly simplify the task of system integration [2]. As shown in Fig. 3.16, for a core having both master and slave interfaces, the OCP compliant signals of the functional IP blocks are packetized by a second interface, which sits between the OCP instances and the communication fabric. The network interface has two functions: 1) injecting/absorbing the flits leaving/arriving at the functional/storage blocks; 2) packetizing/depacketizing the signals coming from/reaching to OCP compatible cores in form of messages/flits.  NETWORK  CORE  FABRIC  0- >  85 OCP Interface : Network Interface  Fig. 3.16: Interfacing of IP cores with the network fabric  35  A l l OCP signals are unidirectional and synchronous, simplifying core implementation, integration and timing analysis. The OCP defines a point-to-point interface between two communicating entities, such as the IP core and the communication medium. One entity acts as the master of the O C P instance, and the other as the slave. OCP unifies all inter-core communications, including dataflow, sideband control and test-specific signals. The dataflow signals are divided into basic signals, simple extensions, burst extensions and thread extensions. Optional signals can be configured to support additional core communication requirements. All sideband control and test signals are optional. The protocol consists of two basic commands, Read and Write, and their extensions. The detailed OCP specification can be found at www.ocpip.org 3.4 Summary  The switch blocks are integral parts of an NoC infrastructure. Their design depends on the adopted routing schemes, and interconnect architectures. In this thesis, a consistent design methodology for the switch blocks is followed  irrespective  of the  interconnect  infrastructures.  Consequently,  the  characterization of the NoC communication fabric will not depend on the specifics of the switch design. The design of the switch blocks affects the characteristics of the interconnect infrastructure in different ways. It contributes to the overall power consumption and silicon area overhead of the network fabric. It also helps the system integrator to study the timing characteristics of the communication medium. These issues are discussed in detail in the subsequent chapters.  36  Chapter 4: T i m i n g Analysis of N o C Interconnect Architectures  The key element of the NoC platforms is the nature of the communication fabric. One of the design issues arising out of the NoC paradigm is to constrain the delay of each of the pipeline units constituting the whole fabric to a limit of one clock cycle. Both switch and interconnect delay must adhere to this constraint. The definition of a clock cycle is, of course, based on the required performance of a digital circuit, but this value changes from design to design, and from technology to technology. One way to normalize the clock cycle is to represent it in terms of a technology-independent value that still provides information about the speed of the design. The fanout-of-four (F04) delay metric serves exactly this purpose. It is defined as the delay of one inverter driving four identical ones as shown in Fig. 4.1(a). For specific values of the supply and threshold voltages of the transistors, the F 0 4 delay is fixed for a given technology. For example, for a 90nm technology with V d = 1 V and V = 0 . 3 V , the F 0 4 delay D  T  is approximately 40ps, as is shown in Fig.4.1(a). If the cycle time through the combinational logic between two flip-flops (including flip-flop overhead) is given as lOOOps in a 90nm technology, it would be represented by 2 5 F 0 4 delay units, as illustrated in Fig. 4.1(b). A design with a higher clock frequency would, of course, have fewer than 2 5 F 0 4 delays between clock edges.  (a)  (b)  Fig. 4 . 1 : Illustration of the F 0 4 metric  37  Clock frequencies of high-performance microprocessors have improved by nearly 40% annually over the last decade. This increase in clock frequency has come from technology scaling and deeper pipelining of designs. The clock period scaling trend is summarized in Fig. 4.2 following data published by ITRS [8]. Around 1985, designs had roughly 100 F 0 4 delays per cycle. Since that time, the number has been steadily declining in value, primarily due to pipelining the computation, and has recently exhibited a tendency towards saturation. In accordance with Fig. 4.2 and ITRS [8], a generally accepted rule of thumb is that the clock cycle of high performance SoCs will saturate at a value between 1 0 - 1 5 F04 delay units. This is a reasonable target saturation level since flip-flops incur overhead in the range of 4 -5 F 0 4 units, leaving only about 10 F 0 4 units to carry out all necessary logic functions. It is difficult to imagine clock cycles significantly below this level. In fact, there appear to be some tendency to return to slightly longer clock periods to address problems of power consumption.  o -I 1985  ,  ,  ,  1  1990  1995  2000  2005  Year  Fig. 4. 2: Clock period trend [8] We will show that even this clock cycle limit is achievable by proper design of the switch elements and inter-switch wire segments of the NoC architecture. This makes communication pipelining with a pre-specified clock rate achievable regardless of the system size. On the other hand, achieving this in a bus-based approach is difficult. Hence, with a network on chip (NoC) architecture, the SoC inter-  38  block communication fabric can be designed and optimized independently from the specific constituent blocks (i.e., processing elements). 4.1 Achievable Clock Cycle in a Bus Segment  We now return to the bus-based SoC and develop a simplified model to analyze how system size affects the achievable clock cycle. In this situation, multiple IP blocks share the same transmission media. As the number of connected IP blocks increases, the capacitance attached to the bus wires increases correspondingly. For ease of analysis (but without loss of generality), we assume this extra capacitance to be evenly distributed along the wire and model it as a parasitic capacitance. This negatively impacts propagation delay and, ultimately, the achievable clock cycle. Viewed in another way, this limits the number of IP blocks that can be connected to the bus and, thereby, the system scalability. As many existing on-chip buses are multiplexer-based [14] [15] [16], they are basically unidirectional and can therefore easily be buffered. Attaching IP blocks to a bus adds an equivalent parasitic capacitance of Cip per unit length of wire to an existing capacitance of C per unit length due to w  the wire itself. As a result, the driving capability of the bus is negatively affected, and buffer insertion is required to accommodate a number of IPs beyond a certain threshold while still satisfying a propagation delay of one clock cycle. If a bus wire is divided into N  b u s  segments, then each wire segment will have a  capacitance of ( C + Cip) per unit length. The configuration of the wire with inserted buffers (repeaters) is w  shown in Fig. 4.3.  39  Lb us  ^bus  «  ^ Rw  {>—Vi—[>MA __c +q w  [>» ... -M  J_ c c  P  w+  (1)  m  [>~W—  J_  (P  J_c *c„ w  (2)  (N ) bus  Fig. 4. 3: A buffered bus-wire segment We now derive the delay equation for this situation. First, without any parasitic wire and IP capacitance and resistance, the total delay D i  n v  would simply be:  D =N J =N (C +C )R mv  where C g + C j  b  inv  bus  G  J  (4.1)  eqn  is the total gate and junction capacitance of the inverter, respectively, and R  e q n  is the  equivalent driving resistance of the inverter. Second, if we consider the wire and parasitic capacitance by themselves, we would represent it as a lumped Tt-model and derive the delay D rasitics as: pa  D  =0AR (C  parasltlcs  w  +  w  C) IP  N  bus  (4.2)  bus  where R is the wire resistance per unit length and Lb w  US  is the total length of the bus. Finally, when the  inverter and parasitics are combined in each stage, we obtain a delay using the Elmore model [50] that is the sum of the above equations, plus and additional term representing the interactions between buffer and wire. Therefore, the delay in the buffered bus wire Differed can be computed according to the following equation:  Dbuffered where tj  nv  CRm + ^'bushnv G  -  L +0AR (C  q  w  m  bus  w  C ) " N Lb  w  J  +  s  IP  (4.3)  bus  is the delay of an inverter sized for equal rise and fall propagation delays, m is the size of the  inverters, C g is the gate capacitance of the minimum size inverter, R  40  e q n  is the resistance of n-type  diffusion region in Q/n, R respectively, and L  b u s  w  and C  w  are the resistance and capacitance per unit length of the wire,  is the bus length. The values of m and Nt, that minimize Differed have been used in us  subsequent calculations. Consequently, the achievable clock cycle will be 1/D ffered- This approach of bu  delay modelling optimizes the delay along a bus segment assuming a uniformly distributed load. However if the IP blocks load the bus wires non uniformly, then the buffers can have different sizes and spacings, depending on the specific parasitics distribution. Using this equation, it is possible to study the variation of the clock cycle as a function of the value of CIP and establish limits on the performance of the bus-based approach. The particular case of 130 nm technology node for a fixed bus length is shown in Fig. 4.4. From Fig. 4.4, it is evident that as CIP increases beyond a certain value, the clock cycle exceeds the limit of, 15F04 delay units. In this respect, the threshold value of CIP can be considered as a metric for the scalability of a bus-based system as it relates to how many IP blocks that can be attached to a bus before the delay exceeds one clock cycle. However, due to heterogeneous nature of constituent IP cores in a SoC (DSPs, M P E G decoders, memories, etc.), it is not possible to quantify, a priori, the number of IPs that can be connected to a bus segment. Rather, by knowing CIP and the types of IPs that need to be integrated for a particular application, we will be able to determine whether the clock cycle is achievable.  o 0 l  ' 5  ' 10 Cp (  ' 15  1 20  [fF/um]  Fig. 4. 4: Variation of delay with parasitic capacitance Cip for a fixed bus length  41  Lbus [mm]  Fig. 4. 5: Variation of delay with bus length for a fixed CIP One way to deal with this scalability problem is to split the bus into smaller segments. A bus of shorter length can support more parasitic capacitance, Cip, arising from attached IP blocks (assuming that the bus can physically accommodate them). Fig. 4.5 shows the delay along a bus segment as a function of L ^ s for a 130 nm technology node, but here we assume a fixed value of Crp . Again, beyond a certain limit of Cip, a 15 F 0 4 cycle time cannot be achieved. In reality, the two graphs of Figs 4.4 and 4.5 should be combined to provide a three-dimensional view of the delay characteristics. In Fig. 4.6, we show, for illustrative purposes, the delay characteristics as a function of both L  b u s  and C i p for the same technology node (130 nm). From this figure, the trend is  that more parasitic capacitance Crp can be supported by reducing the length of the bus.  On the other  hand, for any bus length, Cip will be constrained by the target 15F04 limit. As an example, when the bus length is 6 mm, then the allowable parasitic capacitance is 20 fF/u,m; when the bus length is increased to 10 mm then the parasitic capacitance reduces to 8 fF/Lim to keep the delay along the wire segment within the specified limit of 15 F04. Based on this analysis, a single conventional shared bus-based system will not easily be able to achieve the clock cycle predicted in rTRS in high performance MP-SoCs. The relatively long bus needs to be split into smaller components. The lengths of each section of the shorter buses can be designed such  42  that they are able to individually support a clock cycle of 15F04. The multiple, relatively shorter, buses can be integrated using repeaters or bridges.  Fig. 4. 6: Variation of delay with CIP and bus wire length For larger SoCs, the use of multiple hierarchical bus-based systems is recommended and, ultimately, it can be viewed as a form of network on chip. However, unless they are constrained to have certain topological characteristics a priori, such bus-based networks will vary widely (depending on the specific SoC) and will therefore possess unpredictable characteristics. For more robust and scalable solutions to this problem, we propose that a design methodology for a structured network be imposed.  4.2 Communication Pipelining in NoCs  A common characteristic of all the regular NoC architectures is that the whole communication medium can be divided into accurately predictable stages. The structured inter-switch wire segments together with the switches establish a highly pipelined communication infrastructure as shown in Fig. 4.7. In terms of interconnect wire delay, the NoC architectures offer the advantage that the wires between  43  IP blocks and between switches are logically structured. Therefore, their lengths and delays are largely predictable and consistent across the entire network. This is a compelling feature of any structured approach to design. The switches (I Ps) required in each of the network architecture solutions mentioned previously 2  consist of multiple stages. We can differentiate between two types of delays, namely inter- and intraswitch. Through detailed circuit level design and analysis, we can constrain the delay of each intra-switch pipelined stage to be within the ITRS suggested limit of 15 F 0 4 delay units. By doing this, and ensuring that wire delays also comply with this target, we can guarantee that the communication network will operate at any reasonable clock frequency. I  I _..  H e a d e r Flits  Q-Functional IP (embedded processor) I-Infrastructure IP (switch)  Data/Tail Flits  I (  input arbitration  input  I r o u t (  I  , arbitration |  '  .. 9  n  switch traversal  I  I  I  output  j arbitration  (  output  '  , arbitration  Fig. 4. 7: Pipelined data transfer 4.2.1 Wire Delay Between Switches  Using the earlier example layouts shown for the various NoC architectures, we will now determine the longest inter-switch wire that arises for each of them under the specific placement constraints. After determining the longest inter-switch wire segments that arises in each of the topologies, we determine the delay in sending data across each such wire segment. If the delay along the longest inter-switch wire segment can be constrained to be within the assumed clock cycle limit of 15F04, then the delay along all the other shorter inter-switch wire segments will follow suit. The inter-switch wire lengths depend on the  44  specific topology adopted for the SoC infrastructure and the system size. The number of IPs interconnected in a single SoC will vary from one technology node to another, and also depend on the specific architecture. We assume the distributions of IP blocks that will be discussed in section 5.6.3. The wire lengths between switches in the BFT and the SPPN architectures depend on the levels of the switches. On the other hand the number of switch levels can be expressed as a function of system size (N) as levels = log N for both, where N is the number of functional IP blocks in a SoC. Following the 4  block diagrams shown in Figs 4.8 (a) and 4.8 (b) the inter-switch wire lengths in the case of BFT and SPIN are given by the equation (4.4) [10]: -yjA re a  Fig. 4. 8 (a): B F T block diagram  (b) SPIN block diagram  where w ] is the length of the wire spanning the distance between level a and level a+1 switches, a+  ia  where a can take integer values between 0 and (levels-1). Table 4.1 shows the inter-switch wire lengths for the BFT and SPIN architectures in all the technology nodes assuming a die size of VArea = 20mm . In the table, X indicates that the particular inter-switch wire is not present in the given technology node.  45  Table 4.1: Inter-Switch wire lengths in mm (Tree-based architectures) Technology node  No. of levels  W ,io  W  130 nm 90 nm 65 nm 45 nm 32 nm  6 7 9 10 11  X X X X 10.000  X X X 1().()()() 5.000  Technology node  No. of levels  W4  W  130 nm 90 nm 65 nm 45 nm 32 nm  6 7 9 10 11  5.000 2.500 0.625 0.312 0.156  2.500 1.250 0.312 0.156 0.078  u  10)9  5)  w  9>8  X X 10.000 5.000 2 500  4)3  w,  w,  W,  X X 5.000 2.500 1.250  X  10.000 4.000 1.250 0.625 0.312  8 7  W, 3  2  1.250 0.625 0.156 0.078 0.039  7 6  10.0-iii  2.500 1.250 0.625  6  5  w  2>1  0.625 0.312 0.078 0.039 0.019  From Table 4.1, the longest inter-switch wire length in the BFT and SPIN is 10 mm. In CLICHE, the inter-switch wire lengths can be determined from the following expression:  w•  y/ Area  (4.5)  Viv-i  As the system size N , which is the number of functional IP blocks, varies among the technologies in CLICHE, the inter-switch wire lengths differ for the different technology nodes. However, in a specific technology node all of them are of same length. In the Torus architecture, all the inter-switch wire lengths are the same as those for the CLICHE except for the wraparound wires, which will have a length o f V A rea = 20mm.  In the Folded Torus, all the inter-switch wire lengths are double those for the  C L I C H E architecture [28]. Table 4.2 shows the inter-switch wire lengths in the case of CLICHE, Torus and Folded Torus architectures for different ITRS technology nodes.  46  Table 4. 2: Inter-Switch wire lengths for CLICHE, Torus and Folded Torus Inter-Switch wire length  Technology Nodes  Torus *  CLICHE  1.10 mm 130 nm 90 nm 0.73 mm 0.46 mm 65 nm 45 nm 0.27 mm 0.23 mm 32 nm With the exception of wrap-around wires  1.10 0.73 0.46 0.27 0.23  mm mm mm mm mm  Folded Torus 2.20 mm 1.46 mm 0.92 mm 0.54 mm 0.46 mm  As before, we can compute the intrinsic RC delay of a wire according to the equation below [50]: D  , _  = OAR C L  unbuffered  (4.6)  w w  where L is the wire length. In different technology nodes, the corresponding F04 delay can be estimated as 400*L  min  where L ,„ is the minimum gate length in each technology node [8]. For long wires, the m  intrinsic delay will easily exceed the 15F04 limit. In those cases, the delay can, at best, be made to increase linearly with wire length by inserting buffers. If the wire is divided into n segments and n inverters are inserted, then the total delay of the buffered wire will be according to the following expression [50] (which is similar to the one given earlier for bus-based systems): r  D.  rr  . = Ylt  buffered  +  CRm G  inv  w  +  CR ^ ^ L + m  0ARC,— n  (4.7)  where m is the relative size of the inverters with respect to the minimum sized inverter, ti is the delay of nv  an inverter driving another identical inverter, CG is the gate capacitance of the minimum size inverter, R  EQN  is the resistance of n-type diffusion region in Q/n, R and C are the resistance and capacitance per unit W  length of the wire, respectively, and L is the wire length. The optimum number of segments can be calcualated as  47  W  OAR C Is opt  „  (4.8)  t  R can be calculated according to the following formula: w  P  R..  (4.9)  TW  where p is the resistivity of the copper wire (here assumed to be 2.2 Q/um) [6], and T and W are the wire thickness and width, respectively. C can be calculated according to the following equation [6]: w  l + 7(T/ ) \/wj 1  C =2e e w  d  0  T/  (4.10)  +C  z  fringe  )  where ed is the dielectric constant, so is the permittivity of free space. Cf  ringe  is the fringing capacitance  assumed to be constant and equal to 0.04fF/um in all technology nodes [6]. Specific values for R , C and t w  w  inv  are shown in Table 4.3 for successive technology nodes.  Table 4. 3: Values of R , C , ti„ w  Technology node 130 nm 90 nm 65 nm 45 nm 32 nm  We used the values of R , C , and t w  w  inv  w  v  Rw  and F04 in different technology nodes  r  [QJii [fF/M m] 0.06 0.12 0.20 0.44 0.73  m] 0.30 0.22 0.20 0.20 0.20  [ps]  F04 [ps]  11.05 7.65 5.50 3.82 2.70  55.25 38.25 27.50 19.10 13.50  tinv  from Table 4.3 to calculate unbuffered and buffered global wire  delays in different technology nodes. Fig. 4.9 reports buffered global wire delay, Dbuffered, versus wire length in successive technology nodes. From the plots, in the case of BFT and SPIN, the longest inter-switch wires can be driven by a  48  clock of period of 15F04 after proper buffer insertion. Our analysis also shows that most of the interswitch wires in the case of BFT and SPIN need not be buffered [10]; shadings in Table 4.1 denotes the wire segments that need to be buffered. The only apparent exception occurs at the 32 nm technology node for the wire segment between levels 11 and 10 of the communication network. These wire segments are projected to be 10 mm long, whereas the maximum buffered wire length that can be driven by a clock of 15 F 0 4 in this particular technology node is 8 mm. There are techniques available to resolve this issue such as in [51], but undoubtedly new techniques will be developed by the time the industry reaches the 32nm technology node. Considering the same die size, inter-switch wires for the CLICHE architecture need not be buffered. The wrap-around wires in the Torus need buffer insertion, and, in some cases their delays will exceed the limit of one clock cycle even with buffer insertion. Once again, this can be overcome through the same procedure of [51]. In the case of the Folded Torus architecture the interswitch wires need not be buffered. Their delays will always be less than 15F04. 32 nm 4 5 n m 6 5 n m 90nm130nm 800  700  600  to S  O  MO  300  200  100  0 5  10  .  15  20  25  30  35  40  15  Global Wire Length (buffered) [mm)  Fig. 4. 9: Buffered global wire delay in different technology nodes The above analysis illustrates the important feature of the NoC architectures whereby all the interswitch wire lengths and corresponding delays can be determined a priori. Moreover the parasitic capacitance due to the functional IP blocks does not load these wire segments directly. Consequently, the  49  delay along these segments can be determined independently of the processing nodes. The regular structure of NoC topologies simplifies the layout and even with first order placement solutions as depicted in Fig. 4.8, the timing constraints of high performance SoC designs can be relatively easily met. 4.2.2 Circuit Delay Through the Switches  The switches principally have three stages: input arbiter, routing (switch traversal) and output arbiter. The particular design of these blocks varies from one architecture to another. Through circuitlevel design and analysis we show that the delay in each of these stages for all the topologies under consideration can be constrained within the limit of 10-15 F04. Before R T L design and synthesis to have an estimate of the delay through the different intra-switch stages we used the method of Logical Effort (explained in Appendix 1) for hand calculation. 4.2.2.1 Delay Through the Arbiter  The arbiter circuit essentially consists of a priority matrix [12], which stores the priorities of the requesters and grants generation circuits used to grant resources to requesters. The design of the arbiter circuit was elaborated in Chapter 3. The delay of the arbiter circuit depends on the number of requesters. For the input arbiter, the number of virtual channels governs the number of requesters. The primary role of the virtual channels is to increase channel utilization so that overall throughput of the system increases [52] [53]. It is shown in Chapter 5 that for a multi-processing system under different data patterns, throughput exhibits a saturating trend if the number of virtual channels is increased beyond four. Each extra virtual channel beyond this limit does not significantly improve system throughput. Consequently in our design we set the number of virtual channels to four. Hence for all the NoC architectures we need a 4:1 arbiter at the input.  50  The output arbiter depends on the total number of ports in a switch for a particular architecture. If a switch has k ports then a (k-1): 1 arbiter is needed at each output port. Consequently for BFT and SPIN we need 5:1 and 7:1 arbiter at the output respectively and for CLICHE, and Folded Torus a 4:1 arbiter is needed at each output port. Before implementing the arbiter block we tried to have an estimate of the delay through it and for this purpose we used the method of Logical Effort. Fig. 4.10 shows the block diagram of the arbiter, consisting of the grant generation circuit and the priority matrix. The critical path of the input arbiter circuit is shown in Fig. 4.11.  =1 retf2 • reqi • reqr-i •  Grant  circuit  Fnoiiy matrix  Fig. 4.10: (a) Block diagram of an arbiter; (b) one element of the priority matrix  Csifeinart  — i -  sttetoad - j -  L  BF=3  Fig. 4.11: Critical path of the input arbiter The grant signals control a multiplexer to select a specific virtual channel. Considering an 8-bit data bus, these grant signals are the control inputs of eight multiplexers as shown in Fig. 4.12.  51  Fig. 4.12: Grant signals as control inputs of the mux. This will give rise to a side load capacitance equivalent to  C id ioad S  e  = (8+8/3) times the minimum size  inverter input capacitance at point C according to Fig 4.12. From Fig. 4.10(b), it is evident that the signal Ujj is driving a N A N D gate and inverter considering that the flip-flop consists of a pair of cross-coupled N A N D gates. Consequently, the load capacitance at point D will be equivalent to three minimum-sized inverter gate capacitances. A l l the capacitances are expressed relative to the input capacitance of a minimum sized inverter. We use the notations in the Table 4.4 in determining the delay.  52  Table 4. 4: Logical Effort - Summary of parameters Term  Expression  Logical Effort of a gate  LE  Logical Effort of a path  LE = TILEi  Fan Out  FO — C / C  Branching Factor  BF,  Branching Effort  BE = FlBFi  Path Effort  PE = (LE )(BE)(FO)  Stage Effort  SE = (PE)" , N = No. of  t  p  OM  in  p  N  stages in the path Parasitic Delay of a gate  P (Intrinsic delay due to its t  own internal capacitance) Parasitic Delay of the path  P = EPi  Delay of a path  D = (SE)(N) + P  From Table 4.4, determination of the delay of a path is straightforward. It mainly involves determining the optimal stage effort. In the case of the input arbiter circuit as shown in Fig. 4.10, in addition to the output load C  out  there is a side load at point C. Consequently, this amounts to two stage  efforts, one characterizing the circuit behavior from point C to the output load, and the other from the input to point C. To determine the first one we eliminate the side load and determine SE according to Table 3 as 2.8. Considering SE = 2.& and Ci„ d = 3, we calculate the input capacitances at the point D as a  TF  °~  x RF SE  53  ~  Considering Co as the load capacitance, we can calculate the input capacitance at point C. Using a similar equation as (4.5) we get Cc = 1-49. Consequently, in the calculation of the SE of the first 5 stages, we consider the total load capacitance as C  /onrf  ,c=C  jMejMrf  +1.49 = 12.16  Again, following Table 3 with a fan out of 12.16 yields the stage effort of the first five stages to be 4.38. The parasitic delay of the path is  P=i3/',>,vCombining both stage efforts and the parasitic delay  we get the delay of the input arbiter as  D _ input  arhiter  = 5 x 4 . 3 8 + 2 x 2 . 8 + 13 = 4 0 f  t o  =8.1F04  The delay of the input multiplexer will be given as  D _ =2 Inpul  MUX  + 2 = 4t =0.SFO4 inv  Combining the latter two, the delay in the input arbitration process is ^ i n p u t _arbitratim  ~  ^input_arbiter  ^  ^Input_MUX  ~ 8.9F04  The delay of the output arbitration block for all the architectures was estimated following the same methodology described above. 4.2.2.2 Delay Through the Routing Block  The hardware implementation of the routing block depends on the specific routing algorithm adopted. In the case of BFT and SPIN, the L C A (least common ancestor) based [23] routing algorithm is followed. In this case, the first step in the implementation of the routing logic involves the bit-wise comparison (XOR) of the source and destination addresses taking the most significant, i.e, M- (log2N 21) bits, where N is the number of functional D? blocks in the system and / denotes the level number of the switch. Subsequently, the result of the comparison is checked, i.e., whether any " 1 " results from the  54  bitwise X O R operation. The basic structure of the hardware block implementing the L C A algorithm is shown in Chapter 3. The routing algorithm adopted for CLICHE and the Folded Torus is the e-cube dimension order routing [23]. Here, the addresses of all the nodes are divided into two groups representing the X and Y coordinates. At each node, the X and Y parts of the destination addresses are compared with the corresponding parts of the current nodes to determine the next node in the path. The hardware implementation of the e-cube algorithm is as shown in Chapter 3. As an example the methodology for estimating the delay through the routing block in case of L C A is described. The first step in the implementation of the routing logic involves the comparison (XOR) of the source and destination addresses taking the most significant (M= (log2N - 21)) bits, where N is the number of functional IP blocks in the system and / denotes the level number of the switch. Subsequently, the result of the comparison is checked, i.e., whether a "1" results from the X O R operation. As a result of these two logical operations the critical path of the routing block is as shown in Fig 4.13.  Fig. 4.13: Critical path of the L C A routing block The final M-input OR gate of Fig. 4.13 is modeled as a tree of 2-input NOR gates [19]. If k is the number of levels in the NOR tree then 2 = M. The logical effort of this M- input OR tree is k  LE {M) 0R  = (LE) ^  M  N0R  =M ^  = Af  a 7  The output of the routing logic block fans out to an input demux control inputs and to the input of a 5:1 arbiter. The output load of the routing block will be equal to Ci d = (8+8/3+1)= 11.67 times the oa  minimum size inverter input capacitance and the fan out will be 11.67. Hence, the stage effort is given as  55  ^  ^=(2xM°- xll.67)3 7  r o u  and the delay of the routing block will be D  _  routing  = 3 x ( 2 x M - x l l . 6 7 ^ + (log M )(P„ 0  block  7  2  0R2  + Pj  +P +P T  1  inv ~ XOR2 1  By adding the delay corresponding to the demux to the delay of the routing block, we get the total delay associated with the routing process expressed by  ^routing  D outing_biock r  ^  routing  _ block  ^Input  _  DEMUX  will depend on M, which in turn depends on the system size. From Table 4.1, the value of M  varies from 7 (in 130 nm node) to 11 (in 32 nm node). Consequently, nm) to 6 F04 (32 nm). As a result, D  rouling  D ting_biock rou  varies from 4 F04 (130  varies from 5 F04 (130nm) to 7 F04 (32 nm).  The critical path of the routing logic block implementing e-cube routing is shown in Fig. 4.14.  (Log N}/2 2  /•  "1  Fig. 4.14: Critical path of the e-cube routing block Similar to the L C A routing block, we used the L E analysis for the circuit of Fig. 4.14 to determine the delay through it. 4.2.2.3 Delay Incurred in Switch Traversal  Routing decisions are made once the header flit reaches the switch and subsequent body flits just follow the same path. Consequently, as shown in Chapter 3 the switch traversal process involves a chain of multiplexers and demultiplexers.  56  For the switch traversal process the delay is computed considering the chain of four input and output muxes and demuxes. The output of the final demux drives the latches of the virtual channels. Considering that the latches consist of a pair of cross-coupled N A N D gates, the load capacitance is equivalent to two minimum-sized inverter gate capacitances, and hence, the fan out will be 2. Following the same method as in the case of the input arbiter we get the stage effort (SE) of this mux-demux chain to be 2.38. Finally, the delay of the switch traversal process can be expressed according to the following: D  s w  it htraversal c  P()utput_DEMUX  (4X  +  SE^)  ~\~ P] p _MUX n  ^Output_MUX  Ut  ~ ^ ' ^ h n v  ^Input_  DEMUX  "" r  ~ 3.5F04  4.3 Experimental Results  We developed V H D L models for the complete switches in all the topologies mentioned above and synthesized them using Synopsys' synthesis tool in a CMOS 0.13 u,m standard-cell based technology. We used Synopsys™ Prime Time to determine the delay along the critical path in all the building blocks of the switches. The results are shown in Table 4.5. To have a technology independent measure of the delays we also converted the absolute values obtained from the Prime Time timing analysis tool to F 0 4 delay units.  57  Table 4. 5: Delay Through the Switches, determined from Synopsys  Prime Time  Delay of the intra-switch pipelined stages Input Arbitration  Routing (t ) r  Switch Traversal  Output Arbitration  (tst.)  (tea.)  (ti.a.)  NoC Architecture [ps]  F04  F04  [ps]  units  [ps]  F04  [ps]  unity  units  F04 units  SPIN  500  9  360  6.5  310  5.6  608  11  CLICHE  500  9  331  6  221  4  500  9  Torus  500  9  331  6  221  4  500  9  Folded Torus  500  9  331  6  221  4  500  9  BFT  500  9  276  5  275  5  555  10  Due to the fact that there are four virtual channels at the inputs, irrespective of the NoC architecture, the delay of the input arbitration block has the same value for all of them. Furthermore, in the case of CLICHE, Torus and Folded Torus, the number of ports in a switch is identical, and thus, the delays of the output arbitration block are equal. These results indicate that the delay associated with each stage of operation of the switch is well within the ITRS-suggested limit of 15F04 and can therefore be driven by a clock with a period of 15F04. The structured inter-switch wires and the processes underlying the switch operations yield four types of pipelined stages. Together, these four types characterize the functionality of the communication fabric. This is illustrated in Fig. 4.15. Through careful design and analysis, we have shown how to constrain the delays of each of these stages such that they are bounded by the clock period limits suggested by ITRS [8] for high performance multi-core SoC platforms.  58  t <1SF04 <— > r  <—  t< w  15F04  ka. ti.a.<15F04  kt t,,<15F04 •  f 4  o.a <15F0  t Fig. 4.15: Delays associated with the pipelined stages on data path 4.4 Summary  Multi-core SoC platforms are emerging as the trend for future SoC designs. A single monolithic bus-based architecture cannot generally meet the clock cycle requirements for these systems. One solution is to simply split up the single bus into a network of interconnected smaller buses.  Ad-hoc  solutions tend to constitute locally optimized solutions that are not easily automated, portable, or scalable. The NoC design paradigm overcomes this clock cycle limitation problem based on a highlystructured architecture. In this chapter, we have presented compelling arguments for the adoption of structured NoCbased topologies as interconnect fabric as they greatly simplify the design process and provide predictable timing characteristics. A complete SoC interconnection network can be constructed by dividing the communication medium into multiple pipeline stages. From detailed circuit level design and analysis, we demonstrated that for all the NoC topologies considered here, these stages can be clocked with a minimum feasible time period of 15 F 0 4 delay units irrespective of the system size. It can be argued that this clock cycle constraint can also be achieved in a hierarchical bus-based system. However multiple design iterations may be required due to the fact that the IP blocks directly affect the achievable clock cycle by capacitively loading the communication fabric. On the other hand the clock cycle requirement can be met independently of the IP blocks in NoC architectures. This would  59  allow a decoupling of the design and optimization of the communication fabrics and the functional IP blocks. In effect, we have demonstrated how network on chip (NoC) types of interconnect architectures overcome the inherent non-scalability associated with traditional bus-based systems. Any perceived constraints of the structured architectures are well outweighed by the performance benefits and other portability, scalability, and design automation benefits.  60  Chapter 5 : Performance Evaluation and Design Trade-offs As mentioned in Chapter 2 many different possible NoC interconnect architectures have been proposed by various research groups. However, prior to adopting any particular architecture as an NoC fabric it needs to be evaluated in terms of different relevant parameters. Till now the adoption of an interconnect fabric into SoC domain was ad-hoc, without following any benchmarking methodology. This work tries to bridge the gap between theoretical possibility and practical realizations of different NoC interconnect architectures. To compare and contrast different NoC architectures, a standard set of performance metrics can be used [23] [40]. For example, it is desirable that MP-SoC interconnect architectures exhibit high throughput, low latency, energy efficiency and low area overhead. In today's power constrained environments, it is increasingly critical to be able to identify the most energy efficient architectures, and to be able to quantify the energy-performance tradeoffs [3]. Generally, the additional area overhead due to the infrastructure IPs should be reasonably small. We now describe these metrics in more detail. 5.1 Message Throughput Typically, the performance of a digital communication network is characterized by its bandwidth in bits/sec. However, we are more concerned here with the rate that message traffic can be sent across the network and so throughput is a more appropriate metric. Throughput can be defined in a variety of different ways depending on the specifics of the implementation. For message passing systems, we can define message throughput, TP, as follows: (Total messages completed)x(Message length) (Number of IP blocks) x (Total time)  61  (5 1)  where Total messages completed refers to the number of whole messages that successfully arrive at their destination IPs, Message length is measured in flits, Number of IP blocks is the number of functional IP blocks involved in the communication, and Total time is the time (in clock cycles) that elapses between the occurrence of the first message generation and the last message reception. Thus, message throughput is measured as the fraction of the maximum load that the network is capable of physically handling. A n overall throughput of TP = 1 corresponds to all end nodes receiving one flit every cycle. Accordingly, throughput is measured in flits/cycle/IP. Throughput signifies the maximum value of the accepted traffic and it is related to the peak data rate sustainable by the system. 5.2 Transport Latency Transport latency is defined as the time (in clock cycles) that elapses from the occurrence of a message header injection into the network at the source node and the occurrence of a tail flit reception at the destination node [23]. We refer to this simply as latency in the remainder of this thesis. In order to reach the destination node from some starting source node, flits must travel through a path consisting of a set of switches and interconnect, called stages. Depending on the source/destination pair and the routing algorithm, each message may have a different latency. There is also some overhead in the source and destination that also contributes to the overall latency. Therefore, for a given message i, the latency L, is: Z/ . — ;  sender overhead + transport latency + receiver overhead  We use the average latency as a performance metric in our evaluation methodology. Let P be the total number of messages reaching their destination IPs, and let L, be the latency of each message i, where / ranges from 1 to P. The average latency, L , is then calculated according to the following: avg  p  (5.2)  P  62  5.3 Communication Energy When flits travel on the interconnection network, both the inter-switch wires and the logic gates in the switches toggle and this will result in energy dissipation. Here we are concerned with the dynamic energy dissipation caused by the communication process in the network. The flits from the source nodes need to traverse multiple hops consisting of switches and wires to reach destinations. Consequently, we determine the energy dissipated by the flits in each interconnect and switch hop. The energy per flit per hop is given by  ^ h o p  where E  switch  and E  ^ s w i t c h  ^interconnect  (5-3)  depend on the total capacitances and signal activity of the switch and each  interconnect  section of interconnect wire, respectively. They are determined as follows:  Eswitch  ~  ® s w i t c h ^ s w i t c h ^  ^interconnect  a  i „ , e r c o n n e c ,  a  n  d  c  's i W  I C  h  >  c  ^interconnect  ^ c o n n e c t  a  r  e t  h  e s i  S  n a l  (5-4)  ^interconnect^  •  ^)  activities and the total capacitances of the switches  and wire segments, respectively. The energy dissipated in transporting a packet consisting of n flits over h hops can be calculated as h  Epacket  =  n  ^jE  n o p  (5.6)  j  Let P be the total number of packets transported, and let E packet, where i ranges from 1 to P. The average energy per packet, the following equation:  63  be the energy dissipated by the i'  h  packel  Epacket,  is then calculated according to  p  La E  The parameters a  s 'switch  packet  and a.  —  'interconnect  Vpacket;  (  h 'hopj  1\  J  P  P  (5.7)  are those that capture the fact that the signal activities in the switches  and the interconnect segments will be data-dependent, e.g., there may be long sequences of l ' s or O's that will not cause any transitions. Any of the different low-power coding techniques [41] aimed at minimizing the number of transitions can be applied to any of the topologies described here. For the sake of simplicity and without loss of generality, we do not consider any specialized coding techniques in our analysis. 5.4 Area Requirements To evaluate the feasibility of these interconnect schemes, we consider their respective silicon area requirements. As the switches form an integral part of the infrastructure, it is important to determine the amount of relative silicon area they consume. The switches have two main components: the storage buffer, and logic to implement routing and flow control. The storage buffers are the FIFOs at the inputs and outputs of the switch. Another source of silicon area overhead arises from the inter-switch wires, which, depending on their lengths may have to be buffered through repeater insertion to keep the interswitch delay within one clock cycle [10]. Consequently, this additional buffer area should also be taken into account. Another important factor that needs to be considered when analyzing the area overhead is the wiring layout. One of the main advantages of the NoC design methodology is the division of long global wires into smaller segments, characterized by propagation times that are compatible with the clock cycle budget [26]. The different NoC architectures considered here achieve this as a result of their inherent interconnect structure. But the segmented wire lengths will vary from one topology to another. Consequently, for each architecture, the layout of inter-switch wire segments presents different degrees  64  of complexity. Long wires can block wiring channels, forcing the use of additional metal layers, and causing other wires to become longer. Architectures that possess longer inter-switch wires will generally create more routing challenges, compared to those possessing only shorter wire segments. The distribution of inter-switch wire lengths can give a first-order indication of the overall wiring complexity. 5.5 Performance Evaluation  In order to carry out a consistent comparison, we developed a simulator [42] [12] employing flitlevel event-driven wormhole routing to study the characteristics of the communication-centric parameters of the interconnect infrastructures. In our experiments the traffic injected by the functional IP blocks followed Poisson [43] and self-similar distributions [44]. In the past, a Poisson distributed injection rate was frequently used when characterizing performance of multiprocessor platforms [44]. However, the self-similar distribution was found to be a better match to real world SoC scenarios [45]. Using a flit counter at the destinations, we obtain the throughput as the number of flits reaching each destination per unit time. To calculate average latency and energy, we associate an ordered pair, switch, and an ordered pair, Linterconnect  and  E h, switc  E  (Li,  i n t e r c o n n e c l  lterconnecU  E  i n t e r c o n n e c l  ),  (L , , swi cn  E h), switC  with each  with each interconnect segment, where  L  s w i t c h >  denote the delays and energy dissipated in the switch and interconnect,  respectively. The average latency and energy dissipation are calculated according to equations (5.2) and (5.7). To estimate the silicon area consumed by the switches, we developed their V H D L models and synthesized them using a fully static, standard cell-based approach for a 0.13 pm C M O S technology library. Starting from this initial estimation, by using an ITRS (International Technology Roadmap for Semiconductors) suggested scaling factor of 0.7, we can project the area overhead in future technology nodes.  65  5.6 Experimental Results and Analysis  We applied our evaluation methodology to different proposed regular NoC architectures described in Chapter 2. The wormhole routing simulator was used to compare and contrast the NoC topologies in terms of throughput and latency. In this simulator, the user may choose between uniform and localized traffic patterns for the packets. There are options of using both Poisson and self-similar message injection distributions. Self-similar traffic has been observed in the bursty traffic between on-chip modules in typical MPEG-2 video applications [45] and networking applications [43]. It has been shown that modeling of self-similar traffic can be obtained by aggregating a large number of ON-OFF message sources [44]. The length of time each message spends in either the O N or the OFF state should be selected according to a distribution which exhibits long-range dependence. The Pareto distribution (F(x) = l-x" , with l<a<2) has a  -l  been found to fit well to this kind of traffic. A packet train remains in the O N state for t0N — (l— r)<*0N -l  and in the OFF state for t0FF = ( l — r ) « o / r , where r is a random number uniformly distributed between 0  and 1, a  0N  =1.9, and a  0FF  =1.25 [44].  For the Poisson simulations the request generation is controlled by a Poisson process, the generation interval is determined using t  nexl  = — ln(l — i?), where R is a uniformly distributed random A  number between 0 and 1 and A, is the request generation rate. The destination IP selection depends on the traffic pattern adopted. The simulator is capable of handling variable message lengths. Message lengths may vary depending on the application. On the other hand message length and buffer depth are strongly correlated. In an SoC environment buffer depth is of extreme importance, as it adds to the silicon area overhead due to the switches. In addition, switch  66  ;  parameters can also be specified. These include input/output port buffer depths (in flits), number of ports, and the number of virtual channels per switch port. Messages arriving at destinations are immediately consumed at the rate of one flit per time step, i.e., no blocking is encountered at the destinations. A l l resource contention is handled without bias in the sense that granting of resources to packets is done on a first come, first-serve basis. The energy dissipation of NoC fabrics arise from two different sources: (1) the switch blocks, which include the buffers and (2) inter-switch wire segments.  To study the energy efficiency of the  interconnect architectures, we determine the energy dissipated in each switch, E  ,  switch  by running  Synopsys™ Prime Power on the gate-level netlist of the switch blocks, including the FIFO buffers. Our energy estimation methodology involved feeding a large set of data patterns to the switch blocks. Through functional simulation using Synopsys™ Prime Power, the average value for the activity factors were determined. The experimental data set included long sequences of l's and O's to account for the possible cases where low transition activity data were to be transported. To determine interconnect energy, ^interconnect'  m  e  capacitance of each interconnect stage, C  specific layout of each topology. C  imercomect  can be estimated according to the following expression  ^interconnect = ^  wire  'W  a+la  + Yl •  where C , is the wire capacitance per unit length, and w ; vv  , is calculated taking into account the  inlercormea  re  a +  f l  M  •( C  G  +  Cj )  (5.8)  is the wire length between two consecutive  switches; Cc and Cj are the gate and junction capacitance of a minimum size inverter, respectively, n denotes the number of inverters (when buffer insertion is needed) in a particular inter-switch wire segment and m is their corresponding size with respect to a minimum size inverter. While calculating C ire we have considered the worst-case switching scenario, where the two adjacent wires switch in the W  opposite direction of the signal line simultaneously.  67  In all the subsequent experiments we consider each system to be consisting of 256 functional IP blocks, i.e., /V = 256. Table 5.1 summarizes the simulation parameters. Table 5.1: Simulation parameters Topology  Message Length (Flits)  Buffer Depth (Flits)  Port Count  SPIN CLICHE FOLDED TORUS BFT  64 64 64 64  2 2 2 2  8 5 5 6  2  5.6.1 Throughput and Latency  We now compare the throughput and latency characteristics of the various NoC architectures. The throughput of the communication infrastructure generally depends on the traffic pattern. Fig. 5.1 shows the variation of throughput with the number of virtual channels for all the topologies, determined through simulation using Eq. (5.1). Measuring throughput under uniform spatial distribution assumptions is an accepted metric [23] for evaluating parallel systems. Throughput is the maximum traffic accepted by the network and it relates to the peak data rate sustainable by the system. The accepted traffic depends on the rate at which the functional IP blocks are injecting data into the network. Ideally, accepted traffic should increase linearly with this injection load. However, due to the limitation of routing resources (switches and interconnect wires) accepted traffic will saturate at a certain value of the injection load. Similar to the throughput, the unit of measure for injection load is also flits/cycle/TP. For both Poisson and self-similar injection rates the variation of throughput with virtual channels has similar characteristics. It is evident from Fig. 5.1, that when the number of virtual channels is increased beyond four, there is a trend towards throughput saturation. However, each additional virtual channel implies an increased silicon area.  The average message latency decreases when buffer size increases [23]. According to [23], the effect of buffer size on performance is small. Consequently, to avoid excessive silicon area consumption in our switch design, here we considered the buffer depths to be equal to two flits. 2  68  Fig. 5.2 shows the variation of latency with the number of virtual channels. The average message latency depends on the number of virtual channels and injection load. In this case, the average latency generally increases with the number of virtual channels. To keep the latency low while simultaneously maintaining a considerable throughput, the number of virtual channels is constrained to four in the design of the switches. Consequently, a system with four virtual channels strikes an appropriate balance between high throughput, low latency and conservation of silicon area. This result is consistent with previous research on the optimal number of virtual channels [46] and, in part, validates the modeling and simulation approach used to generate the results in this paper. The plots in Fig. 5.1 also indicate that under the uniform traffic assumption, BFT, C L I C H E and Folded Torus provide a lower throughput than do the SPIN. This happens due to the fact that the SPIN has more links between a source and a destination pair than do the others.  Number of Virtual Channels Virtual Channels  Fig. 5.1 .Variation of throughput under spatially uniform traffic distribution  Fig. 5. 2: Variation of latency with virtual channels  The role of injection load on the accepted traffic was also studied and shown in Fig. 5.3. We observe that the accepted traffic increases linearly with the injection load up to the throughput saturation point. Fig. 5.3 (b) shows that self-similar traffic saturates the networks at slightly lower average data rates.  69  0 Normalized Injection Load  (a) Poisson  0.2  0.4 0.6 Normalized Injection Load  (b) Self-similar  Fig. 5 . 3 : Variation of accepted traffic with injection load While these results are as one would expect, the assumption of spatial uniformity of traffic is not very realistic in an SoC environment, since different functions will be mapped to different parts of the SoC and they will exhibit highly localized patterns. Hence, we studied the effect of traffic localization on throughput for both types of injection processes and considered the illustrative case of spatial localization where local messages travel from a source to the set of the nearest destinations. In the case of BFT and SPLN, localized traffic is constrained within a cluster consisting of a single sub-tree while, in the case of CLICHE and Folded Torus, it is constrained within the four destinations placed at the shortest Manhattan distance [12] [23]. We define the localization factor as the ratio of local traffic to total traffic. For example, if the localization factor is 0.3, then 30% of the traffic generated by an IP occurs within its cluster while the rest of the traffic is uniformly distributed in the remainder of the entire SoC. Fig. 5.4 shows the effect of traffic localization on throughput for all the topologies. We assumed that the number of virtual channels (vc) is four, based on the previously described  experiments.  Localization of traffic does not have much impact on the SPLN, but it enhances the throughput of BFT, CLICHE and Folded Torus considerably. Though SPIN has very high throughput for the uniformly distributed traffic, it lacks the ability to exploit the traffic localization inherent in SoC architectures.  70  Localization (Fraction of Localized Traffic)  Localization (Fraction of Localized Traffic)  (a) Poisson  (b) Self-similar  Fig. 5. 4: Variation of throughput under localized traffic (number of vc = 4) In Fig. 5.5, we show the variation of latency with injection load in the case of both Poisson and self-similar distributions for uniform traffic. The injection load directly affects the average message latency. As the injection load approaches the accepted traffic (throughput) limit, there will be more message contention and latency will increase. At the limit, latency grows to infinitely large when the injection load reaches the saturation point. Consequently, the desirable point of operation for the system should be well below network saturation. The self-similar traffic distribution yields higher average message latency principally due to its bursty nature. We considered the effect of traffic localization on the latency. Variation of latency with localization factors of 0.3, 0.5 and 0.8 is shown in Figs. 5.6, 5.7 and 5.8. One important effect of localization on the latency characteristic is that it allows a higher injection load. Consequently, more traffic can be processed without the network being saturated. Eventually, this will enhance the overall data-rate of the system.  71  500 450  I  400  ' /  o  3>. 350  I/ I/  y  I 300  3  & 250 ra  i  1  -•-BFT "•"CLICHE - ^ - F O L D E D TORUS -•-SPIN  200  I •  > 100 50 0.4  0  0.6  0.2  0.4  0.6  0.8  Offered Traffic  Offered Traffic  (a) Poisson  (b) Self-similar  Fig. 5. 5: Latency variation with injection load for spatially uniform traffic distribution  -BFT -CLICHE -FOLDED TORUS -SPIN  •^BFT •••CLICHE H I H FOLDED TORUS -•-SPIN  0.2  0.4  0.6  0.4  Offered Traffic  0.6  Offered Traffic  (a) Poisson  (b) Self-similar  Fig. 5. 6: Latency variation with injection load (localization factor = 0.3)  72  7 100  X  -  --^3.  -•-CLICHE -•-BFT ^ • • F O L D E D TORUS ••-SPIN  i * 20  0.4 Offered  OI. Trafci  0.4  0.6  Offered Traffic  a)Poisson  (b) Self-similar  Fig. 5. 7: Latency variation with injection load (localization factor = 0.5)  FOLDED TORUS -•-CLICHE -•-BFT -•-SPIN <  0.4  20  0.4  0.6  Offered Traffic  0.6  Offered Traffic  (a) Poisson  (b) Self similar  Fig. 5. 8: Latency variation with injection load (localization factor = 0.8) It is seen from Figs. 5.6-5.8 that, similar to the case of throughput characteristics, traffic localization does not have significant impact on the latency variations for SPIN. 5.6.2 Energy Dissipation While evaluating the feasibility of an interconnect infrastructure, its energy dissipation profile must be considered as it can be a significant portion of the overall SoC energy budget. The metric for comparing the NoC architecture with respect to the energy dissipation is the average dynamic energy dissipated when a packet moves between a pair of source and destination IP blocks. This average energy dissipation, in turn, depends on the number of  73  virtual channels and injection load. Fig. 5.9 shows the variation of average energy dissipation per packet as a function of the number of virtual channels (same for both Poisson and selfsimilar traffic) assuming the networks to be operated at the peak sustainable data rate. We can observe that the energy dissipation increases linearly with the number of virtual channels for all the architectures. Consequently, a system with four virtual channels per physical link will give reasonably low energy dissipation without compromising throughput. 600  T -  500 - -  5  400 - -  |  ,5 300 - «  ft < 200 -  100 -  0  0  2  4  6  8  10  12  14  16  18  Virtual Channels  Fig. 5. 9: Average energy dissipation per packet The effect of injection load on the energy dissipation for a uniform traffic distribution for both Poisson and self-similar injection process is shown in Fig. 5.10. Similar to the nature of accepted traffic variations, the energy dissipation profiles show a saturating characteristic occurring when the injection load reaches the throughput limit. Beyond saturation, no additional messages can be injected successfully into the system, and consequently, no additional energy is dissipated. We also consider the effect of traffic localization on the energy dissipation profile for all the NoC architectures. Figs. 5.11, 5.12 and 5.13 show the energy dissipation profile for localization factors of 0.3, 0.5 and 0.8, respectively. The benefit of traffic localization is 74  evident from these figures, increasing the amount of traffic localization causes more messages to be injected without increasing the average energy dissipation. This happens due to the fact that, on the average, messages will traverse fewer stages in the case of a greater amount of localization. Consequently, the functional mapping should be performed so as to exploit the advantages of spatial locality, i.e., the blocks that communicate more frequently should be placed closed to each other. This will reduce the use of long global paths and the energy dissipation.  Normalized Injection Load  Normalized Injection Load  (a) Poisson  (b) Self-similar  Fig. 5.10: Energy dissipation profile for uniform traffic  Normalized Injection Load  Normalized Injection Load  (a) Poisson  (b) Self-similar  Fig. 5.11: Energy dissipation profile for localized traffic (localization factor =0.3)  75  0  0.2  0.4  0.6  0.8  1  0  Normalized Injection Load  0.2  0.4  0.6  0.8  1  Normalized Injection Load  a) Poisson (b) Self-similar Fig. 5.13: Energy dissipation profile for localized traffic (localization factor =0.8) From Figs. 5.10 and (5.11-5.13), we can infer that the architecture with higher degree of connectivity like SPIN has greater average energy dissipation at saturation than the others though it provides higher throughput and lower latency on the average. 5.6.2.1 Energy Dissipation & Throughput  The variation of average energy dissipation with injection load alone is not a complete measure of energy efficiency of any given NoC architecture. The average bit energy dissipation (energy dissipated in moving a bit between a pair of source and destination) versus throughput is another meaningful indicator that also enables significant comparison as it effectively expresses the dissipation with respect 76  to data processing capability. In addition to this the effect of system size on energy dissipation is worth investigating. To specifically study the effect of system size on the energy dissipation under varied traffic localization, we considered three different system sizes, i.e., 16, 64, and 256 IP blocks, numbers believed to be illustrative of what could be considered as small-, medium-, and large-scale MP-SoC platforms. Figs. 5.14 (a)-5.16(a) show the variation of bit energy as a function of system throughput for three different system sizes, under uniformly distributed traffic scenarios. It can be inferred from these figures that the nature of bit energy variation with throughput remains approximately the same with various system sizes, only the absolute value differs. To evaluate the advantage of traffic localization on energy dissipation, we studied the bit energy vs. throughput characteristics for different degrees of localizations. Figs. 5.14(b)-5.16 (b) show the variation of bit energy with throughput for a localization factor of 0.5 for three different system sizes. From these it is seen that as a result of traffic localization, system throughput increases while the bit energy dissipation decreases compared to the case of uniformly distributed traffic. From Figs. 5.14(a) and 5.14(b), we can infer that for a given throughput, an increased degree of localization results in a reduced bit energy dissipation, as one would expect. The relative amount of energy savings for different localization factors is shown in Fig. 5.17. To have a consistent comparison, we kept the system throughput at the same level for all the architectures, while varying the amount of localized traffic. The relative amount of energy savings is not significantly affected by system size. When the factor of localization is varied from 0.3 to 0.8, the bit energy savings relative to the uniformly distributed traffic scenario vary from 20% to 50%, independently from the system size in terms of number of IPs. This trend is observed for all the NoC topologies under consideration. This brings out one of the major advantages of these modular NoC architectures, that is, reducing the degree of global communication improves the energy efficiency of the systems without compromising  system 77  throughput.  (a) Uniform traffic  (b) Localized Traffic  Fig. 5.14: Bit Energy versus Throughput characteristics (System size = 16)  THROUGHPUT  THROUGHPUT  (a) Uniform traffic  (b) Localized Traffic  Fig. 5.15: Bit Energy versus Throughput characteristics (System size = 64)  1.5  (a) Uniform traffic  (b) Localized Traffic  Fig. 5.16: Bit Energy versus Throughput characteristics (System size = 256)  78  Energy Savings vs Localization  0.3  0.5  Localization  0.8  Fig. 5.17: Energy savings for different localization factors 5.6.3 Area Overhead  In the NoC design paradigm, the silicon area overhead arises due to the presence of the switches, the inter-switch repeaters and the interfaces between the functional IP blocks and the network. Regardless of the specific topology used by the interconnect infrastructure, each functional IP block needs to have the OCP-fP interface. Consequently, in our analysis we consider the whole interfacing circuitry as part of the functional IP blocks. From our detailed circuit level design and synthesis, we deduce that within a switch, the buffer area significantly dominates the logic [25]. The buffer area, in turn, largely depends on the number of virtual channels and the flit length. In a networked SoC, IPs can be divided into two groups, functional and infrastructure IPs (switches). Hence, the distribution of functional and I Ps depends on their respective sizes and interconnect topology. Letting A  FIP  denote the area of the functional IP blocks and A . 2  denote the area of the switches then  19  A  ^  a  c h i  =  P  #i  •  A  F  W  +  ^2  *  A  , 2  (5.9)  P  where Nj and N2 are the number of functional and infrastructure IPs, respectively, and Area \s the total logic area of the SoC under consideration. For a BFT, N,=2N ; for chip  2  SPIN architecture  4 Nj=—N ; and for all the others, Nj=N2. These numbers help 2  determine the distribution of functional and infrastructure IP blocks in a SoC. Through R T L level design and synthesis, we found that the switches consist of approximately 30K gates , while the OCP-D? interface accounted for around 1400 gates. Using Eq. (5.9) 3  and the constraints on Ni and N2 we can determine the distribution of functional and infrastructure IP blocks in all the topologies. We consider the case where functional IP blocks are constrained to be of the order of 100K gates as suggested in [5]. Table 5.2 shows the maximum number of 100K blocks that can be integrated within an SoC in different technology nodes [9]. Table 5. 2: Maximum number of 100K IP blocks in different technology nodes Technology Node Max. no. of 100 k gates IP Blocks  130 nm  90nm  65nm  45nm  32nm  500  1000  2500  7500  10000  The distribution of functional and infrastructure D? blocks is indicated in Table 5.3.  Here we considered a 2-input NAND structure as a reference gate.  80  Table 5. 3: Distribution of functional and infrastructure IP blocks rechnology node  No. of Functional IPs  130 nm 90 nm 65 nm 45 nm 32 nm  BFT  SPIN  428 856 2142 6426 8568  400 800 2000 6000 8000  No. of I Ps 2  CLICHE / FOLDED TORUS/ 375 750 1875 5625 7500  BFT  SPIN  214 428 1071 3213 4284  300 600 1500 4500 6000  CLICHE / FOLDED TORUS/ 375 750 1875 5625 7500  Under these assumptions, we determined the silicon area consumed by the switch blocks for all the architectures. The other factors contributing to the area overhead are the inter-switch repeaters. The wire length between switches in the B F T and SPLN architectures depends on the levels of the switches. Consequently, to keep the inter-switch wire delay within one clock cycle, some of them need to be buffered [10]. In C L I C H E and Folded Torus, all the interswitch wires are of equal length and their delay is always within one clock cycle [10]. Therefore, no repeater insertion is required. The silicon area overhead in different technology nodes can be estimated for all the interconnect architectures, as the sum of the area due to the switches (Area ) and {Zp  repeaters ( A r e a  repeaters  ).  ^ " o v e r h e a d  =  + AreU^^  (5.10)  Fig. 5.18 reports the silicon area overhead for all the platforms consideration  across  different  technology  81  nodes,  assuming  a  die  under  size  of  20mmx20mmand that each functional IP block consist of 100K equivalent 2-input N A N D gates.  • BFT • CLICHE • Folded Torus  • SPIN  130 nm  90 nm  65 nm  45 nm  32 nm  Technology nodes Fig. 5.18: Area overhead From Fig. 5.18, we see that SPIN has a considerably higher silicon area overhead. This happens due to the fact that this architecture provides a higher degree of connectivity. The percentage of silicon area overhead for different platforms increases slightly with technology scaling. However, the relative area overhead between any two platforms remains almost unchanged. 5.6.4 Wiring Complexity  The problem of wire area estimation involves determination of the longest wire segments that may arise in each architecture and their distribution. The long wire segments block wiring channels and force other wires to become longer. From our experience, there will be additional area consumed by the wires than what is predicted by the first order analysis. Assuming this kind of overhead, our aim is to estimate the distribution of wire lengths in all the interconnect architectures under consideration. In an  82  NoC environment, the inter-switch wire segments are the longest on-chip wires except for clock, power and ground wires [47]. Due to the structured nature of NoC-based interconnects, the inter-switch wire lengths can be determined a priori. The wire lengths between switches in the BFT and the SPIN architectures depend on the levels of the switches. On the other hand, the number of switch levels can be expressed as a function of system size (N) as levels - log /V for both, where N is the number of functional IP 4  blocks in a SoC. The inter-switch wire length is given by the following expression [10]: _ W  a + l , a  ~  yjArea ^ e v e l s - a  (5-11)  where w +i, is the length of the wire spanning the distance between level a and level a+1 a  a  switches, where a can take integer values between 0 and (levels-l). For CLICHE and Folded Torus, all wire segments are of the same length, respectively; the inter-switch wire lengths of CLICFfE architecture can be determined from the following expression: 4 Area  7W^i  w=  ( 5 J 2 )  while for the Folded Torus, all the inter-switch wire lengths are double those for the CLICHE [28]. Considering a die size of 20mmx 20mm and a system size of 256 IP blocks, we determined the number of inter-switch links and their lengths for all the NoC architectures under consideration. We calculated the inter-switch wire lengths in case of CLICHE, Folded Torus, BFT and SPIN using Eqs. 5.11 and 5.12 and Figs 5.1 and 5.19.  83  BPS  SPIN  BFT  Fig. 5.19: Simplified layout examples of SPIN, and BFT Inter-switch wire length distribution 300  J2 £  250 200 150 4  i  100 4  BFT  SPIN  CLICHE:  FOLDED TORUS  • 5-10 mm O2.5-5 mm H 1.25-2 5 mm • 0.625-1 25 mm  Fig. 5. 20: Inter-switch wire length distribution Fig. 5.20 shows the distribution of inter-switch wire lengths for all the NoC architectures. From this figure, we can qualitatively infer that SPIN will have the highest degree of wiring complexity, while CLICHE and Folded Torus will have the lowest complexity. Consequently, for the SPIN topology, the layout of inter-switch wire segments will have greatest impact on area overhead. C L I C H E and Folded Torus are the simplest ones from a layout perspective, while BFT stands between these two groups. Table 5.4 summarizes the comparative analysis of different NoC architectures undertaken in this chapter. In this table the single, double and triple check marks denote low, medium and high level of performance with respect to the particular metric.  84  Table 5. 4: Summary of comparative architecture evaluation Performance Metrics NoC Architectures  Throughput  Latency  Communication  Area  Wiring  Energy  Overhead  Complexity  CLICHE Folded Torus BFT SPIN  In the above analysis we have assumed the switch blocks in all the architectures under consideration consist of four virtual channels. The principal role of virtual channels is the throughput enhancement. By increasing the number of virtual channels in the switch blocks in a lower performing architecture we can boost up its throughput. But this will have considerable impact on area overhead and energy dissipation. In a particular application if the system integrator can allow the extra area and energy overhead then it is possible to adopt a limited connectivity architecture with higher number of virtual channels as the overall SoC interconnect fabric. 5.7 Case Study  To illustrate how system designers can use the analytical and experimental procedures outlined in this chapter to estimate the performance of a SoC application, we simulated a multiprocessor SoC (a network processing platform), mapped to different NoC  communication fabrics we described in earlier sections. Among all the  85  architectures under consideration, SPIN has the higher throughput, but its energy dissipation is much greater than those of the others. In addition, the silicon area overhead due to the infrastructure IP blocks is also higher. Taking these facts into account, we considered the architectures with a lower energy dissipation profile, i.e., the B F T , CLICFfJE and Folded Torus, for further evaluation. For illustrative purposes, we mapped the network processing platform onto these three interconnect architectures. The functional block diagram of the network processor is shown in Fig. 5.21, based on a commercial design [48]. A l l the functional blocks are divided into five clusters. Initially, we assumed the traffic to be uniformly distributed among these five clusters. The Micro Engines (MEs) in the clusters 2 and 3 are the programmable engines specialized for network processing. MEs do the main data plane [48] processing for each packet and communicate in a pipelined fashion within each M E cluster. Consequently, the traffic will be highly localized within these two clusters (cluster 2 and 3).  Cluster 0 Media Switch Fsbdc (MSF)  5E Hash Unit  Scratchpad Memory  SRAM Controller 0  SRAM Controller 1  DRAM Controller  3. PCI Controller  Cluster 1  CAP  ME 0X1  l  ME 0X2  ME 0X0  ME 0X10  ME 0X11  ME 0X3  ME 0X13  ME 0X12  1  ME Cluster 0  ME Cluster 1  Cluster 2  Cluster 3  Core Peripherals  Processor Core  Performancs Monitor  Cluster4  Fig. 5. 21: Functional block diagram of a typical network processor As discussed earlier, we assumed localization factor's of 0.3, 0.5 and 0.8 for the traffic within these two clusters while the rest of the traffic is assumed to be uniformly distributed. We also assumed a self-similar injection process. Under the stated traffic distributions, we simulated the performance of the network processor SoC shown in Fig.  86  5.21. From the throughput characteristics we can project the aggregate bandwidth [49] sustainable by the SoC platform by using the following expression.  Aggregate Bandwidth=(Number of IP blocks)x(Flit length)x(Accepted traffic)x( clock rate)  Table 5.5 shows the projected bandwidth, assuming a clock rate of 500 M H z (typical for an SoC implemented in a 130 nm process), average message latency and average energy dissipation. Table 5.5: Projected performance of a network processor SoC platform in NoC design paradigm Uniform  CLICHE Throughput  FOLDED  0.5 Localization  0.3 Localization  BFT  TORUS  CLICHE  FOLDED TORUS  BFT  CLICHE  FOLDED TORUS  0.8 Localization  BFT  CLICHE  FOLDED TORUS  BFT  0.58  0.60  0.40  0.63  0.65  0.57  0.75  0.70  0.63  0.80  0.78  0.82  92.8  96  64  100.8  104  91.2  120  112  100.8  128  124.8  131.2  33  30  28  33  30  28  33  30  28  33  30  28  2.175  2.2  2.9  1.70  1.80  2.32  1.50  1.62  1.74  1.25  1.30  1.5  Aggregate Bandwidth (Gbps) Average Latency (Cycles) Energy per packet (nJ)  As expected and discussed in Section (5.6.1), throughput increases significantly with traffic localization, which in turn gives rise to higher aggregate bandwidth. The value of average latency is measured at an injection load below saturation. The effect of traffic localization on average latency is that it allows a higher injection load without saturating the network. The message latency, at a lower injection load (below saturation), remains largely unaffected  by traffic localization. While measuring the average  energy  dissipation, to have a consistent comparison, we kept the system throughput at the same  87  level for all the architectures, while varying the amount of localized traffic. When the factor of localization is varied from 0.3 to 0.8, the bit energy savings relative to the uniformly distributed traffic scenario vary from 20% to 50%. As shown in this case study it is possible to project the achievable performance of a typical multi-core SoC implemented using the NoC design paradigm.  5.8 Summary  Networks on chip (NoC) are emerging as a viable interconnect architectures for multi-processor (MP-) SoC platforms. In this new paradigm, infrastructure IPs are used to establish the on-chip communication medium. NoC-based architectures are characterized by various trade-offs, in regards to functional, structural, and performance specifications. Here, we carried out detailed comparisons and contrasted different NoC architectures in terms of throughput, latency, energy dissipation, and silicon area overhead. We illustrated that some architectures can sustain very high data rates at the expense of high-energy dissipation and considerable silicon area overhead while others can provide a lower data rate and lower energy dissipation levels. Our principal contribution lies in the establishment and illustration of a consistent comparison and evaluation methodology based on a set of readily quantifiable parameters for NoCs. Our methodology sets an important basis for the optimal evaluation and selection of interconnect infrastructures for large and complex SoCs. Though the parameters considered in our benchmarking are considered by experts in the field to be among the most critical, they do not constitute a unique set nor are they exhaustive. Different applications or circumstances may require this set to be altered or augmented, e.g., by including parameters such as testability,  88  dependability, and reliability. However, they are an important set to characterize the emerging NoC architectures.  89  Chapter 6: Conclusions and Future Work 6.1 Conclusions Multiprocessor system-on-chip (MP-SoC) platforms are emerging as an important trend for SoC design. Power and wire design constraints are forcing the adoption of new design methodologies  for system-on-chip  (SoC), namely those that incorporate  modularity and explicit parallelism.To enable these MP-SoC platforms, researchers have recently pursued scaleable communication-centric interconnect fabrics, such as networkson-chip (NoC), which possess many features that are particularly attractive for these applications. These communication-centric interconnect fabrics are characterized by different trade-offs with regard to latency, throughput, energy dissipation, and silicon area requirements. In this work, a consistent and meaningful evaluation methodology to compare the performance and characteristics of a variety of NoC architectures is developed. We also explore design trade-offs that characterize the NoC approach and obtain comparative results for a number of common NoC topologies, assuming a realistic traffic pattern. To the best of our knowledge this is the first effort in characterizing different NoC architectures with respect to their performance and design trade-offs. Global communication is one of the major problems in a large SoC. It is well documented that in ultra deep submicron (UDSM) tehnology era the intra-chip propagation delay will exceed the limit of one clock cycle. In NoC paradigm by adopting a structured communication template it is possible to divide the communication medium into multiple pipelined stages. It is being demonstrated through detail circuit level design and analysis that the propagation delays of the pipelined stages can be made comparable with the clock cycle budget. Consequently different proposed NoC architectures to date are  90  guaranteed to achieve the minimum possible clock cycle times in a given CMOS technology, usually specified in normalized units as 10-15 F 0 4 delays.  The role of  communication infrastructure on the energy dissipation is also discussed and it is shown that by reducing the degree of global communication a considerable amont of energy can be saved without compromising the system performance. 6.2 Future Work  Any new design methodology can be widely adopted only if it is complemented by efficient test mechanisms. The development of test infrastructures and techniques supporting the Network on Chip design paradigm is a challenging problem. Specifically, the design of specialized Test Access Mechanisms (TAMs) for distributing test vectors and novel Design for Testability (DFT) schemes are of major importance [8]. Moreover, in a communication-centric design environment like that provided by the NoCs, fault tolerance and reliability of the data transmission medium are two significant requirements in safety critical VLSI applications. In order to become a viable alternative, the Network on Chip paradigm has to be supported by C A D tools through the creation of specialized libraries, application mapping tools and synthesis flows 6.2.1 Testing of NoC-based Systems  The test strategy of NoC-based systems addresses three problems, (i) testing of the functional/storage blocks and their corresponding network interfaces, (ii) testing of the interconnect infrastructure itself; and (iii) the testing of the integrated system.  91  6.2.1.1 Testing of the Functional/storage Blocks  For testing the functional/storage blocks and their corresponding network interfaces a test access mechanism (TAM) is needed to transport the test data. This T A M provides on-chip transport of test stimuli from a test pattern source to the core under test. It also transmits test responses from the core under test to test pattern sink. The reuse of the on-chip network as T A M for the functional/storage cores is proposed in [31]. The principal advantage of using NoCs as test access mechanisms is the availability of several parallel paths to transmit test data to each core and the fact that no extra T A M hardware is needed. Therefore reduction in system test time can be achieved through extensive use of test parallelization, i.e., more functional blocks can be tested in parallel as more test paths are available. One side effect of test parallelization is excessive power dissipation. Hence, both the test time and power dissipation aspects need to be considered while exploiting parallelization for testing of the functional blocks [31]. 6.2.1.2 Testing of the Interconnect Infrastructure  Testing of the interconnect infrastructure involves two different aspects, (i) testing of the switch blocks and (ii) testing of the inter-switch wire segments. 6.2.1.2.1 Testing of the Switch Blocks  The switch blocks consist of the FIFO buffers and the routing logic. From the silicon area perspective, FIFO buffers dominate over the routing logic. Consequently, the testing of the switch blocks can be sub-divided into two problems: testing of the FIFO buffers and the testing of the routing circuitry. Generally, routing logic consists of a few  92  hundred logic gates and traditional testing methods like scan or BIST can be employed here. However, testing of the FIFO buffers poses a unique challenge as numerous relatively small buffers are distributed all over the chip. Traditionally, BIST is an accepted methodology for testing FIFOs. But in the NoC scenario, the classical BIST approach is not suitable, as one dedicated BIST per FIFO block will give rise to an unacceptably large silicon area overhead. Consequently, a distributed BIST methodology  — o 2 j o c o 0>  o  J  r—  < r£  FIFO  FIFO  LRA |  as the one depicted in Fig. 6.1 is more appropriate.  L 0  fg  |5  Q c  i  H  FIFO  FIFO  cc  Multiple Input Shift Register (MISR)  Fig. 6 . 1 : Distributed BIST structure for F I F O testing In a distributed BIST scheme, the read/write mechanisms, the control circuitry and the test data source are shared among the multiple FIFO blocks, whereas the local response analyzers (LRA) are distributed, one for each FIFO [54]. In addition to this, realistic fault models for these FIFO buffers need to be investigated. Once the FIFO buffers are tested and it is known they are operational, they can be reused to transport test data to the combinational blocks of the switches. The test data patterns used to test the routing circuitry can be injected from an external A T E through one injection port that is connected directly to one of the NoC switches. We propose that  93  the NoC infrastructure be progressively used for testing its own components in a recursive manner, i.e., the good, already tested NoC components are used to transport test patterns to the untested elements. This test strategy minimizes the use of additional mechanisms for transporting data to the NoC elements under test, while allowing reduction of test time through the use of parallel test paths and test data multicast. 6.2.1.2.2 Testing of the Inter-switch Wire Segments  Testing of inter-switch wire segments involves the adoption of adequate fault models that take into account the D S M effects. In the digital domain, device defects used to be modeled with extremely simplified models such as the stuck-at fault model. In deep submicron technologies the crosstalk and inductive effects introduce more complex behaviors that require the use of more advanced fault models for interconnect testing. The Maximal Aggressor Fault (MAF) model proposed in [55] can be applied to test the interswitch wire segments in NoC architectures. For a link consisting of N wires this M A F model assumes the worst-case situation with one victim line and (N-l) aggressors. The application of the M A F tests needs to be executed at operational speed, which may require expensive external testers. To achieve high-quality at-speed  testing of  interconnects, different self-test methods have been proposed, that use embedded BIST structures to generate M A F tests. However, these introduce area and delay overhead. Testing of inter-switch wire segments in a NoC remains an open problem that needs to be investigated further.  94  6.2.1.3 Testing of the Integrated System  Testing of the functional/storage blocks and the interconnect infrastructure separately are not sufficient to ensure adequate test quality. The interaction between the functional/storage cores and the communication fabric has to undergo extensive functional testing. This functional system testing should encompass testing of I/O functions of each processing elements and the data routing functions. 6.3 Reliability and Fault Tolerance  Many SoCs are used within embedded systems, where reliability is an important figure of merit. At the same time, in deep submicron technologies beyond the 65 nm node, failures of transistors and wires are more likely to happen due to a variety of effects, such as soft (cosmic) errors, crosstalk, process variations, electromigration, material aging [1]. In general we can distinguish among transient and permanent failures. Design of reliable SoCs must encompass techniques that address both  types of  malfunctions. We address transient malfunctions first, and we defer the analysis of permanent malfunctions to Section 6.3.2. 6.3.1 Error Control Coding  From a reliability point of view, one of the advantages  of packetized  communication is the possibility of incorporating error control information into the transmitted data stream [1] [29]. Effective error detection and correction methods borrowed from the communications engineering domain can be applied to cope with uncertainty in on-chip data transmission. Such methods need to be evaluated and  95  optimized in terms of area, delay and power trade-offs. For bus-based on-chip communication fabrics, different error detecting/correcting codes have been proposed and studied. Coding involves mapping k data bits to n code bits resulting in an (n,k) code having a code rate of k/n. Among all existing error-correcting codes (ECCs), Hamming codes are widely employed class of single-error-correcting (SEC) codes. In addition to the ECCs we need to investigate more on different coding schemes that would reduce the crosstalk among the adjacent wire segments. It has been shown that for a bus-based system, from the energy efficiency perspective, error detecting codes with retransmission are more effective than error correction [47] [56]. In NoC architectures, the error recovery mechanism can either be distributed over multiple hops or concentrated at the end-nodes.  In  distributed  detection/correction  schemes,  each  switch  circuitry such that transmission  is  equipped  of corrupted  with  error  data can be  stopped/corrected at the intermediate switches. In the centralized mechanisms, the retransmission of corrupted data may cause a severe latency penalty, especially when the source and destination nodes are far apart [47]. Thus, the trade-off related to the localization of error detection and correction involves several figures of merit, such as latency, area and power consumption. 6.3.2 Fault Tolerant Architectures  Permanent failures may be due to material aging (e.g., oxide), electromigration and mechanical/thermal stress. Failures can incapacitate a processing/storage core and/or a communication link. Different fault-tolerant multiprocessor architectures and routing algorithms have been proposed in the parallel processing domain. Some of these can be adapted to the NoC domain, but their effectiveness needs to be evaluated in terms of  96  throughput, delay, energy dissipation and silicon area overhead metrics. For example, redundant stand-by components can be used as spare parts. On-chip networks ease the seamless integration of such components, as well as the on-line transition from a malfunctioning unit to a spare part. Specific choices of on-chip network topology can provide the SoC with multiple paths from source to destination, and this redundancy may suffice to obviate a link malfunctioning possibly at the expense of performance. In the specific realm of NoCs, all types of redundancies have to be weighted against additional layout complexity (larger chips) and increased energy consumption. Thus, it is important to view reliable system design in conjunction with power management. Indeed, management policies affect the failure rates, because of frequency and temperature levels, as well as thermal cycles. Similarly, power management techniques can be used to switch on/off spare units, and thus limit their energy consumption impact. This area is subject of on-going research. 6.4 NoC Benchmark Circuits  To advance and accelerate the state of the art of the NoC paradigm R & D , the community is in need of widely available reference benchmarks. The current SoC benchmark circuits (ITC 2002) contain only a very limited number of blocks and do not reflect the high level of integration specific to the NoC scenarios. We propose an international collaborative initiative to develop a set of NoC benchmarks that will foster improved and accelerated developments in this field.  97  6.4.1 Benefits of Benchmarks  The envisaged benefits that a set of relevant NoC benchmarks would provide similar to those ensuing from instances of benchmarks set up in our field or related fields. Examples include (1) Improved sharing and comparison of R & D results (2) Increased healthy competitiveness between R & D (3) Increased commonality for comparative purposes (4) Increased reproducibility of results (5) Accelerated development and analysis 6.4.2 Proposal for NoC Benchmarks  In view of the propriety issues involved, a set of synthetic benchmarks characterized, at least initially, by three sets of orthogonal parameters need to be evolved. The parameters are as follows: i)  core composition (# of PEs, # and size of memories, # I/Os)  ii)  interconnect architecture;  iii)  data communication requirements; Presumably, these sets of parameters, when combined, would suffice to yield a  meaningful and useful representative set of characteristics. E.g., core composition would characterize the NoC with respect to the number of processing elements, memory elements, I/Os. For testing purposes, additional information would be required. This test information would include, for each functional core, the test strategy (BIST, scan ...),  98  and test related parameters (number of scan chains, scan chains lengths, number of test patterns, number of I/Os, etc.). The interconnect architecture is intended to characterize NoCs with respect to the data transport capabilities of its communication fabric. The proposed interconnect architectures to date can be classified into one of the following: cube-based topologies, tree-based topologies, irregular ones, and their different combinations. Data communication requirements would define the communication needs of the synthetic NoC. This set of parameters consists of inter-core bandwidth/latency, data integrity requirements (error rate) and spatial/temporal traffic distributions. The  open questions are whether the international industrial and academic  communities would be able to agree to provide relevant data with respect to the specific parameters necessary to constitute these benchmarks and what should be the appropriate level of abstraction used for the description of these benchmarks. Should the NoC benchmarks use data flow, tabular form, SystemC or using a format similar to that of the rTC'02 SoC benchmarks. 6.5 C A D f o r N o C  Development of Computer Aided Design (CAD) tools to support NoC paradigm is a challenging problem. Current simulation methods and tools can be ported to networked SoCs, to validate functionality and performance at various abstraction levels, ranging from electrical to transaction levels. NoC libraries, including switches/routers, links and interfaces will provide designers with flexible components to complement processor/storage cores. Nevertheless the usefulness of such libraries to designers will depend much on the level of maturity of the corresponding synthesis/optimization tools  99  and flows. Synthesis of on-chip networks is a means of realizing gate and circuit level models starting from an architectural template and design constraints. Due to the novelty of NoCs, there are no specialized languages or formalisms for their high-level modeling. Nevertheless, network topologies can be modeled with structural formalisms and hardware and software (e.g., protocols) behavior can be captured by procedural languages (e.g., SystemC, C++). There are many parameters to optimize in an on-chip network implementation. C A D tools can be used for optimizing the implemented circuitry, for example by sizing switches and links to provide adequate QoS with minimal area and/or energy dissipation. In many cases, for a given system application of choice, the "best" network architecture, protocols and parameters are not known. A designer needs to experiment with different models, in the search for the best solution that trades off performance versus energy consumption and layout complexity. Eventually, the network topology, protocols and parameters can be chosen with C A D tool support. The principal challenge in designing a multi-core SoC is the development of an efficient communication infrastructure. Network on chip (NoC) is emerging as an interconnect infrastructure to suport the design of these large SoCs. It is viewed as an enabling solution for these large SoCs. The focus of this thesis was on design aspects and architectural issues of this new paradigm. This work together with the future direction shown here will help in adoption of Network on Chip paradigm as a main stream SoC design methodology in coming years.  100  References [1] L . Benini, G. De Micheli, "Networks on Chips: A New SoC Paradigm", Computer, Volume: 35 Issue: 1, Jan. 2002, pp.: 70-78. [2] P. Magarshack, P.G. Paulin, "System-on-Chip Beyond the Nanometer Wall", Proceedings ofDAC, June 2-6, 2003, Anaheim, U S A , pp. 419-424. [3] M . Horowitz, B . Dally, "How Scaling will change Processor Architecture", Proceedings oflSSCC February 15-19, 2004, San Francisco, USA, pp. 132-133. [4] Y . Zorian, "Guest editor's introduction: what is infrastructure IP?", IEEE Design & Test of Computers, Volume: 19 Issue: 3 , May-June 2002 pp. 3 -5. [5] R. Ho, K . W. Mai, M . A . Horowitz, "The Future of Wires", Proceedings of the IEEE, Volume: 89 Issue: 4, April 2001 pp. 490-504. [6] P. Kapur, G. Chandra, J. P. McVittie, K . C. Saraswat, "Technology and Reliability Constrained Future Copper Interconnects - Part II: Performance Implications," IEEE Transactions on Electron Devices, Vol. 49, No. 4, April 2002 pp. 598-604. [7]  D.  Sylvester,  K.  Keutzer,  "Impact  of  Small  Process  Geometries  on  Microarchitectures in Systems on a Chip", Proceedings of the IEEE, V o l . 89, No. 4, April 2001, pp. 467-489. [8] ITRS 2003 Documents http://public.itrs.net/Files/2003ITRS/Home2003.htm [9] C. Grecu, P. P. Pande, A . Ivanov, R. Saleh, "Structured Interconnect Architecture: A Solution for the Non-Scalability of Bus-Based SoCs", Great Lakes Symposium on VLSI, Boston, U S A 26-28 April 2004, pp 192-195.  101  [10] C. Grecu, P. P. Pande, A . Ivanov, R. Saleh, "Timing Analysis of Network on Chip Architectures for MP-SoC Platforms", Microelectronics Journal, Elsevier, Vol. 36, issue 9, pp. 833-845. [11] C. Hsieh, M . Pedram "Architectural Energy Optimization by Bus Splitting", IEEE Transactions on CAD Vol. 21, No. 4, April 2002, pp: 408-414. [12] P. P. Pande, C. Grecu, M . Jones, A . Ivanov, R. Saleh, "Performance Evaluation and Design Trade-offs for Network on Chip Interconnect Architectures", IEEE Transactions on Computers, vol. 54, no. 8, pp. 1025-1040, August 2005. [13] P. P. Pande, C. Grecu, M . Jones, A . Ivanov, R. Saleh, "Effect of Traffic Localization on Energy Dissipation in NoC-based Interconnect", Proceedings of International Symposium on Circuits and Systems (ISACS), Kobe, Japan, May 23-26, 2005. [14] A M B A Bus specification, http://www.arm.com. [15] Wishbone Service Center, http://www.silicore.net/wishbone.htm [16] CoreConnect Specification, http://www-3.ibm.com/chips/products/coreconnect/ [17]_D. Wingard, " MicroNetwork-Based Integration for SoCs", Proceedings of Design Automation Conference (DAC), Las Vegas, Nevada, U S A , June 18-22, 2001, pp. 673677. [18] Open Core Protocol, www.ocpip.org. [19] MIPS SoC-it, www.mips.com [20] P. Guerrier, A . Greiner, " A Generic Architecture for On-Chip Packet-switched Interconnections", Proceedings of Design, Automation and Test in Europe (DATE), Paris, France, March 27-30, 2000, pp. 250 -256.  102  [21] S. Kumar, A . Jantsch, J. P. Soininen, M . Forsell, M . Millberg, J. Oberg, K . Tiensyrja, A . Hemani, " A Network on Chip Architecture and Design Methodology," Proceedings of IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Pittsburgh, U S A , 2002, pp. 117-124. [22] W. J. Dally, and B . Towles, "Route Packets, Not Wires: On-Chip Interconnection Networks", Proceedings of Design Automation Conference (DAC), Las Vegas, Nevada, U S A , June 18-22, 2001, pp. 683-689. [23] J. Duato, S. Yalamanchili, L . N i , Interconnection Networks-An Engineering Approach, Morgan Kaufmann, 2002. [24] F. Karim, A . Nguyen, S. Dey, " A n Interconnect Architecture for Networking Systems on Chips", IEEE Micro, Volume: 22, Issue: 5, Sept.-Oct. 2002 pp: 36 - 45. [25] P. P. Pande, C. Grecu, A . Ivanov, R. Saleh, "Design of a Switch for Network on Chip Applications", Proceedings of International Symposium on Circuits and Systems (ISCAS), Bangkok, May 2003, Volume 5, pp. 217-220. [26] L . Benini, D . Bertozzi, "Xpipes: A Network-on-chip Architecture for Gigascale Systems-on-chip", IEEE Circuits and Systems Magazine, Volume: 4, Issue: 2, 2004, pp: 18-31 [27] A . Radulescu, J. Dielissen, S. G. Pestana, O. P. Gangwal, E. Rijpkema, P. Wielage, K. Goossens, " A n Efficient On-Chip NI Offering Guaranteed Services, Shared-Memory Abstraction, and Flexible Network Configuration", IEEE Transactions on CAD of Integrated Circuits and Systems, vol. 24, no. 1, January 2005, pp. 4-17. [28] S. Stergiou, G. DeMicheli, F. Angiolini, D . Bertozzi, S. Carta, L. Raffo, "xPipesLite: A Synthesis Oriented Design Library for Networks on Chips", Proceedings of Design,  103  Automation and Test in Europe (DATE), Munich, Germany, March 7-11, 2005, pp. 11881193. [29] P. P. Pande, C. Grecu, A . Ivanov, R. Saleh, G. DeMicheli "Design, Synthesis and Test of Networks on Chip: Challenges and Solutions" IEEE Design & Test of Computers, September/October 2005. [30] D.Bertozzi, A . Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L . Benini, G. DeMicheli, "NoC Synthesis Flow, for Customized Domain-specific Multiprocessor System on Chip", IEEE Transactions on Parallel and Distributed Systems, vol 16, no. 2, February 2005, pp. 113-129. [31] E. Cota, L . Carro, F. Wagner, M . Lubaszewski , "Power-aware NoC Reuse on the Testing of Core-based Systems", Proceedings of International Test Conference, ITC, vol. 1, Sept. 30- Oct 2, 2003, pp. 612-621. [32] P. G. Paulin, C. Pilkington, E. Bensoudane, "StepNP: a System-level Exploration Platform for Network Processors", IEEE Design & Test of Computers, vol. 19, issue 6, Nov-Dec 2002, pp. 17-26 [33] P. van der Wolf, E. de Kock, T. Henriksson, W. Kruijtzer, G. Essink, "Design and Programming  of  Embedded  Multiprocessors:  An  Interface-Centric  Approach",  Proceedings ofCODES+ISSS, Stockholm, Sweden, September 8-10, 2004, pp. 206-217. [34] Hang-Sheng Wang; L .S Peh, S. Malik, " A Power Model for Routers: Modeling Alpha 21364 and Infiniband Routers", Proceedings of 10th Symposium on High Performance Interconnects, Stanford, California, USA, 2002, pp. 21 -27.  104  [35] R.I. Greenberg, Lee Guan, " A n Improved Analytical Model for Wormhole Routed Networks with Application to Butterfly Fat-Trees ", Proceedings of International Conference on Parallel Processing, 1997, pp.: 44 - 48. [36] I. Sutherland, B. Sproull and D. Harris, Logical Effort: Designing Fast C M O S Circuits, Morgan Kaufmann, 1999. [37] C. B. Stunkel, R. Sivaram, D . K . Panda, "Implementing Multidestination Worms in Switch-Based  Parallel  Systems:  Proceedings of the 24  th  Architectural  Alternatives  and  their  Impact."  ACM Annual International Symposium on Computer  Architecture (ISCA-24), June 1997. [38] Hang-Sheng Wang; L .S oriented  Peh, S.  Malik, " A Technology-aware and Energy-  Topology Exploration for On-chip Networks", Proceedings of Design,  Automation and Test in Europe (DATE), Munich, Germany, March 7-11, 2005, pp. [39] Chelcea, T., Nowick, S.M., " A Low-latency FIFO for Mixed-clock Systems", Proceedings of IEEE Computer Society Workshop on VLSI, April 27-28, Orlando, F L , U S A 2000, pp. 119-126. [40] J. Hennessey, D. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, 2003. [41] V . Raghunathan, M . B . Srivastava, R. K . Gupta, " A Survey of Techniques for Energy Efficient on-chip Communications", Proceedings of Design, Automation and Test in Europe (DATE), Munich, Germany, June 2003, pp. 900-905 [42] M . Jones, "NoC Sim - A versatile Network on chip simulator", MAsc Thesis,March 2005, Department of Electrical & Computer Engineering, University Of British Columbia.  105  [43] K . Park, W. Willinger, Self-similar Network Traffic and Performance Evaluation, John Wiley & Sons, 2000. [44] D. R. Avresky, V . Shubranov, R. Horst, P. Mehra, "Performance Evaluation of the ServerNet S A N under Self-Similar Traffic" Proceedings of 13 International and 10 R  th  th  Symposium on Parallel and Distributed Processing, April 12-16 , 1999, pp. 143-147. th  [45] G. Varatkar, R. Marculescu, "Traffic Analysis for On-chip Networks Design of Multimedia Applications", Proceedings of Design Automation Conference (DAC 02), June 10-14, 2002, pp: 510-517. [46] W. J. Dally, C. L . Seitz, "The Torus Routing Chip", Technical Report 5208:TR: 86, Computer Science Department, California Institute of Technology, 1986, pp. 1-19. [47] A . Jantsch, H . Tenhunen (Eds.), Networks on Chip, Kluwer Academic Publishers, 2003. [48]  Intel  IXP2400  datasheet,  http://www.intel.com/design/network7products/npfamily/ixp2400.htm [49] B. Vermeulen, J. Dielissen, K . Goossens, C. Ciordas, "Bringing Communication Networks on a Chip: Test and Verification Implications", IEEE Communications Magazine, September 2003, pp.74-81 [50] D. A . Hodges, H . G . Jackson and R. Saleh, Analysis and Design of Digital Integrated Circuits, Third Edition, McGraw-Hill, 2003. [51] K . S. Stevens, "Energy and Performance Models for Clocked and Asynchronous Communication" Proceedings of the 9  th  International Symposium on Asynchronous  Circuits and Systems, May 2003, Vancouver, Canada, pp. 56-66,  106  [52] W. J. Dally, "Virtual-channel Flow Control," IEEE Transaction on Parallel and Distributed Systems, vol.3, no.2, pp. 194-205, March 1992 [53] P. P.  Pande, C. Grecu, A . Ivanov, R. Saleh, "High-Throughput Switch-Based  Interconnect for Future SoCs", Proceedings of 3  rd  IEEE International Workshop on  System-on-Chip for Real-Time Applications, June 30-July 2, 2003, Calgary, Canada, pp. 304-310. [54] B . Wang, Y . Wu and A . Ivanov, "Designs for Reducing Test Time of Distributed Small Embedded SRAMs", Proceedings of International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'04), Oct. 10-13, 2004, pp. 120-128. [55] M . Cuviello, S. Dey, B . Xiaoliang, Y . Zhao, "Fault Modeling and Simulation for Crosstalk in System-on-chip Interconnects", Proceedings of International Conference on CAD, (ICCAD, 1999), Nov. 7-11, 1999, San Jose, U S A , pp. 297-303. [56] D. Bertozzi, L . Benini, G. De Micheli, "Low Power Error Resilient Encoding for onchip Data Buses", Proceedings of Design, Automation and Test in Europe (DATE 02), March 4-8, 2002, Paris, France, pp. 102-109 [57]C. Grecu, P. P. Pande, B. Wang, A . Ivanov, R. Saleh, "Methodologies and Algorithms for Testing Switch-based NoC Interconnects", Proceedings of  IEEE  International Workshop on Infrastructure IP (I-IP), May 4-5 2005, Palm Springs, California, U S A  107  Appendix 1. Theory of Logical Effort  Logical effort provides a method for fast "back-of-the-envelope" estimates of delay in a CMOS circuit. The basic theory of logical effort and how it can be used to determine the delay through a logic gate is described briefly here.  in  Fig. 1: Delay for inverter driving a load The delay of a single-stage inverter as shown in Fig. 1 is given as:  * delay  ^eff  ^out  ^self  (1)  _  where R ff is the output resistance of the inverter, C e  o u t  is the output capacitance driven by  it and C if its own drain-source capacitance. Equation (1) can be re-arranged as follows se  r * delay  ^eff^in  ^out  C i  self  c. c,„  108  = T.  c.  •+Vin  V  (2)  where Yi  nv  C.  'self  is the ratio of self-capacitance to input capacitance for the inverter, i.e. , and  Yin  Ti v n  is the intrinsic delay of the inverter. Now if there is a chain of  inverters as shown in Fig. 2, then the delay through the stages can be calculated as  (3)  total _ delay = ^ T  inv  7=1  j+1  j-1  in  'load  Fig. 2: Inverter chain Chains of N A N D and NOR gates can be handled similarly. For the N A N D chain the total delay can be derived as  total _ delay =  where  T  n a n  d  c nand  nand  is the intrinsic time constant for the N A N D gate, and  (4)  A™,nd  is the ratio of the  self-capacitance to the input gate capacitance. Similar expression can be derived for a NOR  chain. Typical logic paths in a digital circuit will have variety of gates.  Consequently the more generic expression for the total delay for a chain of gates will be given as follows  109  total _ delay nand =t  C  Si  nand + r,  c 7+3  +r  (5)  in  The total path delay normalized with respect to the intrinsic inverter delay can be expressed as  c total _ delay r, nand  D = (LE V  ,FO,+P 1  nand  (  nand  +•  Cj 2  +r ,  +  ir  ,) + (LE. FO.+P. ) + (LE nand )  \  inv  2  i«v /  (c  +-  ;+3  + Vn,  FO. + P )  \  nor  3  (6)  nor)  LE = logical effort = -J2!L FO. = fanout  r  t  '  P = parasitic term=LEgale xy Each gate produces a term (LExFO + P) in the delay equation. Consequently equation (6) can be rewritten as (7)  D = Y {LExFO + P) J  For a path comprised of multiple logic gates, the logical effort along the path, called path logical effort (LE ) is calculated as the product of LE of all gates along the P  path: LE =ULE p  (8)  gale  The path effective effort or path fan out FO can be defined as the ratio of the load p  capacitance of the last gate of the path to the input capacitance of the first gate: FO P  C„  (9)  C,  110 \  V  When fan out occurs at the output of a node and some of the available drive current is directed along the analyzed path, and some branches out of the path, to account for logical fan out within the logical path, we use branching factor (BF) of a logic gate: C QJ7  —  +C  on-pa'h  off-path  ^ o n - p a t h  The path branching effort BE is defined as the product of the branching factors along p  that path:  BE =T\BF p  (11)  gate  Finally, the total path effort PE can be defined as:  PE = FO Y\BF LE p  sate  gate  = FO BE LE p  p  p  (12)  The gate effort that minimizes the path delay, called stage effort (SE), is calculated with: S£ =  VP£  (13)  where N is the number of stages (gates) on that path. The minimum delay through the path can therefore be calculated as: D = N*SE + ZP  (14)  111  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0092369/manifest

Comment

Related Items