Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Test and fault-tolerance for network-on-chip infrastructures Grecu, Cristian 2008-11-28

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata


24-ubc_2009_spring_grecu_cristian.pdf [ 2.01MB ]
JSON: 24-1.0065579.json
JSON-LD: 24-1.0065579-ld.json
RDF/XML (Pretty): 24-1.0065579-rdf.xml
RDF/JSON: 24-1.0065579-rdf.json
Turtle: 24-1.0065579-turtle.txt
N-Triples: 24-1.0065579-rdf-ntriples.txt
Original Record: 24-1.0065579-source.json
Full Text

Full Text

Test and Fault-Tolerance for Network-on-Chip InfrastructuresbyCristian GrecuB.Sc., Technical University of Iasi, 1996M.A.Sc., The University of British Columbia, 2003A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE DEGREE OF DOCTOR OF PHILOSOPHYinThe Faculty of Graduate Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)© Cristian Grecu, 2008November 2008iiAbstractThe demands of future computing, as well as the challenges of nanometer-era VLSIdesign, will require new design techniques and design styles that are simultaneouslyhigh-performance, energy-efficient, and robust to noise and process variation. One of theemerging problems concerns the communication mechanisms between the increasingnumber of blocks, or cores, that can be integrated onto a single chip. The bus-basedsystems and point-to-point interconnection strategies in use today cannot be easily scaledto accommodate the large numbers of cores projected in the near future. Network-on-chip(NoC) interconnect infrastructures are one of the key technologies that will enable theemergence of many-core processors and systems-on-chip with increased computingpower and energy efficiency. This dissertation is focused on testing, yield improvementand fault-tolerance of such NoC infrastructures.The motivation for the work is that, with technology scaling into the nanometerrange, defect densities will become a serious challenge for fabrication of integratedcircuits counting billions of transistors. Manufacturing these systems in high volumes canonly be possible if their cost is low. The test cost is one of the main components of thetotal chip cost. However, relying on post-manufacturing test alone for guaranteeing thatICs will operate correctly will not suffice, for two reasons: first, the increased fabricationproblems that are expected to characterize upcoming technology nodes will adverselyaffect the manufacturing yield, and second, post-fabrication faults may develop due toelectromigration, thermal effects, and other mechanisms. Therefore, solutions must bedeveloped to tolerate faults of the NoC infrastructure, as well as of the functional cores.In this dissertation, a fast, efficient test method is developed for NoCs, that exploitstheir inherent parallelism to reduce the test time by transporting test data on multiplepaths and testing multiple NoC components concurrently. The improvement of test timevaries, depending on the NoC architecture and test transport protocol, from 2X to 34X,compared to current NoC test methods. This test mechanism is used subsequently toperform detection of NoC link permanent faults, which are then repaired by an on-chipmechanism that replaces the faulty signal lines with fault-free ones, thereby increasingthe yield, while maintaining the same wire delay characteristics. The solution describedin this dissertation improves significantly the achievable yield of NoC inter-switchchannels – from 4% improvement for an 8-bit wide channel, to a 71% improvement for a128-bit wide channel. The direct benefit is an improved fault-tolerance and increasedyield and long-term reliability of NoC-based multicore systems.iiiTable of ContentsAbstract .........................................................................................................iiTable of Contents.........................................................................................iiiList of Tables................................................................................................. vList of Figures .............................................................................................. viAcknowledgments........................................................................................ ix1 Introduction ........................................................................................... 11.1 Dissertation contribution.................................................................................. 92 Background on Network-on-chip Testing ......................................... 112.1 Introduction..................................................................................................... 112.2 Multi-processor systems-on-chip................................................................... 112.3 Networks-on-chip............................................................................................ 132.4 Network-on-chip test – previous work.......................................................... 142.5 Fault models for NoC infrastructure test ..................................................... 182.5.1 Wire/crosstalk fault models for NoC inter-switch links ......................... 192.5.2 Logic/memory fault models for FIFO buffers in NoC switches ............ 202.6 Summary.......................................................................................................... 233 Test Time Minimization for Networks-on-Chip .............................. 243.1 Test data organization .................................................................................... 243.2 Testing NoC switches...................................................................................... 253.3 Testing NoC links............................................................................................ 263.4 Test data transport ......................................................................................... 283.4.1 Multicast test transport mechanism ........................................................ 303.5 Test scheduling................................................................................................ 353.5.1 Test time cost function............................................................................. 373.5.2 Test transport time minimization............................................................. 393.5.3 Unicast test scheduling ............................................................................ 413.5.4 Multicast test scheduling ......................................................................... 443.6 Experimental results....................................................................................... 483.6.1 Test output evaluation.............................................................................. 483.6.2 Test modes for NoC components............................................................. 493.6.3 Test scheduling results............................................................................. 503.7 Summary.......................................................................................................... 574 Fault-tolerance Techniques for Networks-on-chip .......................... 594.1 Introduction..................................................................................................... 604.2 Traditional fault-tolerance metrics ............................................................... 654.3 Fault-tolerance metrics for network-on-chip subsystems........................... 674.4 Metrics evaluation........................................................................................... 734.5 Summary.......................................................................................................... 825 Fault-tolerant Global Links for Inter-core Communication inNetworks-on-chip ...................................................................................... 835.1 Introduction..................................................................................................... 835.2 Related work.................................................................................................... 84List of Acronyms and Abbreviations .....................................................viiiiv5.3 Problem statement .......................................................................................... 865.4 Interconnect yield modeling and spare calculation ..................................... 885.5 Fault-tolerant NoC links ................................................................................ 905.6 Sparse crossbar concentrators....................................................................... 915.7 Fault tolerant global interconnects based on balanced crossbars.............. 955.8 Link test and reconfiguration mechanisms ................................................ 1035.8.1 Testing the sparse crossbar matrix and interconnect wires ................. 1045.8.2 Link reconfiguration.............................................................................. 1075.8.3 Yield, performance and cost analysis.................................................... 1105.9 Summary........................................................................................................ 1146 Conclusions ........................................................................................ 1156.1 Summary of contributions ........................................................................... 1156.2 Limitations..................................................................................................... 1166.3 Future work................................................................................................... 1177 Appendices ......................................................................................... 1227.1 Appendix 1: Proof of Correctness for Algorithms 1 and 2 ....................... 1227.2 Appendix 2: Algorithm for balancing fault-tolerant sparse crossbars.... 125References ................................................................................................. 129vList of TablesTable 3-1: Unicast test data scheduling for the example in Fig. 3-7........................... 44Table 3-2: Multicast test data scheduling for the example in Fig. 3-7 ....................... 47Table 3-4: Gate count and comparison for the proposed test mechanism ............... 56Table 3-5: Test scheduling run-times........................................................................... 57Table 4-1: Detection latency (10-10flit error rate) ....................................................... 81Table 4-2: Recovery latency (10-10flit error rate) ....................................................... 81Table 5-1: Effective yield improvement vs interconnect complexity ....................... 111Table 5-2: Test and reconfiguration time overhead .................................................. 113viList of FiguresFigure 1-1: a) Regular NoC. b) Irregular NoC. ............................................................. 2Figure 1-2: (a) Global links in NoC-based systems-on-chip; (b) global inter-core linkwith m signal lines; (c) interconnect line spanning multiple metal/via levels...... 5Figure 2-1: MP-SoC platform........................................................................................ 12Figure 2-2: Network-on-chip building blocks in a mesh configuration..................... 14Figure 2-3: Test configurations in [29]: (a) straight paths; (b) turning paths; (c) localresource connections............................................................................................... 15Figure 2-4: Core-based test of NoC routers using an IEEE 1500 test wrapper andscan insertion [30] ................................................................................................... 16Figure 2-5: Test data transport for NoC router testing using (a) multicast and (b)unicast [26]............................................................................................................... 17Figure 2-6: Examples of faults that can affect NoC infrastructures: (a) crosstalkfaults; (b) memory faults in the input/output buffers of the switches; (c)short/open interconnect faults; (d) stuck-at faults affecting the logic gates ofNoC switches............................................................................................................ 18Figure 2-7: MAF crosstalk errors (Y2– victim wire; Y1, Y3– aggressor wires). ...... 20Figure 2-8: (a) 4-port NoC switch – generic architecture; (b) dual port NoC FIFO.22Figure 3-1: Test packet structure .................................................................................. 25Figure 3-2: a) Wire i and adjacent wires; b) Test sequence for wire i; c) Conceptualstate machine for MAF patterns generation. ....................................................... 27Figure 3-3: (a) Unicast data transport in a NoC; (b) multicast data transport in aNoC (S – source; D – destination; U – switches in unicast mode; M – switches inmulticast mode). ...................................................................................................... 29Figure 3-4: 4-port NoC switch with multicast wrapper unit (MWU) for test datatransport. ................................................................................................................. 31Figure 3-5: Multicast route for test packets. ................................................................ 34Figure 3-6: (a), (b): Unicast test transport. (c) Multicast test transport.................... 37Figure 3-7: 4-switch network with unidirectional links. ............................................. 43Figure 3-8: Test packets processing and output comparison...................................... 49Figure 4-1: Processes communicating across a NoC fabric ........................................ 68Figure 4-2: Hierarchical partitioning for fault tolerant NoC designs........................ 68Figure 4-3: Average detection latency for end-to-end (e2e), switch-to-switch (s2s),and code-disjoint (cdd) detection schemes............................................................ 74Figure 4-4: Average recovery latency for equal priority recovery (epr) and prioritybased recovery (pbr)............................................................................................... 76Figure 4-5: Average recovery latency for pbr scheme with variable flit error rate . 78Figure 4-6: Average message latency vs. bound on MAP. .......................................... 79Figure 4-7: Processes P1, P2mapped on a mesh NoC with QoS communicationconstraints................................................................................................................ 80Figure 4-8: Performance and cost of detection techniques. ........................................ 80Figure 5-1: (a) Non-fault-tolerant sparse crossbar and crosspoint implementation;(b) n-m fault-tolerant sparse crossbar. ................................................................. 92Figure 5-2: Fastest and slowest propagation delay paths for non-balanced fault-tolerant links............................................................................................................ 93viiFigure 5-3: Delay variation for imbalanced fault-tolerant links with 25%redundancy. ............................................................................................................. 84Figure 5-4: Balancing an (n-m) fault-tolerant sparse crossbar through successivecolumn exchange operations; m=5, n=11.............................................................. 98Figure 5-5: Delay variation for balanced (b) links with 25% redundancy.............. 100Figure 5-6: Delay variation versus degree of redundancy for a 64-bit link ............ 102Figure 5-7: Delay variation versus crossbar size and degree of redundancy.......... 102Figure 5-8: Self-test and repair link architecture with shared test andreconfiguration blocks.......................................................................................... 104Figure 5-9: Link-level test and reconfiguration. ........................................................ 105Figure 5-10: Test and reconfiguration hardware. ..................................................... 107Figure 5-11: Link reconfiguration algorithm............................................................. 108Figure 5-12: Physical link width (n) for different values of logic link width (m = 32,64, 128 bits) and target effective yield (Yeff). ...................................................... 110Figure 5-13: Area overhead for different crossbar sizes and degrees of redundancy.................................................................................................................................. 112viiiATE: Automated Test EquipmentATPG: Automated Test Pattern GenerationCAD: Computer Aided DesignCMOS: Complementary Metal Oxide SemiconductorCMP: Chip Multi-ProcessorsCPU: Central Processing UnitDFT: Design for TestDFM: Design for ManufacturabilityDRC: Design Rule CheckingDSP: Digital Signal ProcessorFIFO: First In First OutFO4: Fan-out of fourFPGA: Field Programmable Gate ArrayIC: Integrated CircuitIP: Intellectual PropertyISO: International Organization for StandardizationITRS: International Technology Roadmap for SemiconductorsI/O: Input/OutputMAF: Maximum Aggressor FaultMAP: Message Arrival ProbabilityMP-SoC: Multi-processor System-on-chipMWU: Multicast Wrapper UnitNMR: N-Modular RedundancyNoC: Network-on-chipOSI: Open Systems InterconnectionQoS: Quality of ServicesRISC: Reduced Instruction Set ComputerRLB: Routing Logic BlockSoC: System-on-chipTAM: Test Access MechanismTTPE: Test Time per ElementTT: Test TimeVLSI: Very Large Scale IntegrationList of Acronyms and AbbreviationsixAcknowledgmentsI would like to thank my research supervisors, Professors André Ivanov and ResveSaleh, for their continued support along my graduate studies. Without their guidance andhelp, this work would have not been possible.I would also like to thank my colleague and friend Partha Pande, for his advice,participation and help in my research.Special thanks are due to my wife, Gabriela, who supported me unconditionally allthese years.This work was partially supported by NSERC of Canada and Micronet.1Chapter 11 IntroductionThe microprocessor industry is moving from a single-core processor to multi-coreand, in the foreseeable future, to many-core processor architectures built with tens tohundreds of identical cores arranged as chip multi-processors (CMPs) [1]. Anothersimilar trend can be seen in the evolution of systems-on-chip (SoCs) from single-processor systems with a set of peripherals to heterogeneous multi-core systems,composed of several different types of processing elements and peripherals (memoryblocks, hardware accelerators, custom designed cores, input/output blocks, etc.) [2].Microprocessor vendors are also venturing forwards to mixed approaches, combiningmultiple identical processors with different types of cores, such as the AMD Fusionfamily [3], which combines together multiple identical CPU cores and a graphicsprocessor in one design.Such multi-core, multi-processor systems, whether homogeneous, heterogeneous, orhybrid, must be interconnected in a manner that is high-performance, scalable, andreliable. The emerging design paradigm that targets such interconnections is called an on-chip interconnection network, or network-on-chip (NoC). The basic idea of the NoCparadigm can be summarized as “route packets, not wires” [4]. Interconnecting coresthrough an on-chip network has several advantages over dedicated point-to-point orbus-based wiring, offering potential advantages in terms of high aggregate bandwidth,2low communication latency, low power dissipation in inter-core communication, andincreased scalability, modularity and flexibility.The potential of NoCs is, however, far from being fully realized, and their practicalimplementation presents numerous challenges [5]. The first set of challenges is associatedwith meeting power/performance targets. The second set of issues relates to CAD toolsand design methodologies to cope with the typical number of devices (in the range of onebillion transistors or more), and the high degree of hardware and software parallelism thatis characteristic of NoC-based systems.Network-on-chip architectures were proposed as a holistic solution for a set ofchallenges faced by designers of large, multi-core systems-on-chip. In general, two typesof NoC architectures can be identified, as shown in Fig. 1-1: (a) regular interconnectionstructures derived from parallel computing, and (b) irregular, custom-built NoC fabrics.The infrastructure for NoC includes switches, inter-switch wires, interface blocks toconnect to the cores, and a protocol for data transmission [6]. There is a significantnumber of active research efforts to bring NoCs into mainstream use [7].(a) (b)- Functional core- SwitchFigure 1-1: a) Regular NoC. b) Irregular NoC.3The power dissipation of current NoC implementations is estimated to besignificantly higher (in the order of 10 times or more) compared to the expected powerbudget of future CMP interconnects [5]. Both circuit-level and architecture-leveltechniques must be combined in order to reduce the power dissipation of NoC-basedsystems and their data-communication sub-systems to acceptable values.The data transport latency of NoC infrastructures is too large for the currentprogramming models, leading to significant performance degradation, especially in thecase of on-chip memory access operations. Here, various research efforts are focused onreduced-latency router microarchitecture design [8], circuit techniques that reduce signalpropagation time through NoC channels [9], and network architectures with reducednumber of hops for latency-critical data transfers [10].From a CAD perspective, most of the NoC architectures and circuit techniques areincompatible with the current design flows and tools, making them difficult to integratein the typical SoC design flow. The fundamental reason is the high degree of parallelismat both computation and data transport levels, combined with the different ways in whichthe software and hardware components of a NoC interact with each other.A major challenge that SoC designers are expected to face [11] is related to theintrinsic unreliability of the communication infrastructure due to technology limitations.Two different sources can be identified: random and systematic. The random defectsarise from the contaminants, particles and scratches [12] that occur during the fabricationprocess. The systematic defects have roots in the photolithographic process, chemical-mechanical polishing methods and the continuous feature size shrinking in semiconductorfabrication technology. Chips are increasingly prone to feature corruption producing4shorts and opens in the wires, and missing vias, or vias with voids, as a result ofphotolithographic limitations. In addition, the effect of steps such as chemical-mechanicalpolishing may cause surface dishing through over-polishing, ultimately becomingimportant yield-loss factors [13].Metal and via layers in advanced CMOS processes are characterized by defectdensities correlated with their specific physical dimensions. Upper layers of metal tend tobe wider and taller due to the expected current levels. The fabrication of wider wires doesnot present as much of a problem to photolithographic techniques; however, the vias tendto be tall and thin and are prone to voids. Therefore, via duplication or via arrays areemployed to circumvent this issue.Producing fine-lined wires presents the greatest challenge to the resolution of themost advanced photolithographic capabilities available today. Mitigating all these generalobservations is the fact that, in a typical foundry, older generations of equipment are usedto fabricate the middle and upper layers of metal, each with their own yieldcharacteristics. It is expected that the overall yield of interconnects will continue todecrease as the number of metal layers increases (e.g., 12 metal layers for 45 nmtechnology, compared with 9 layers for a typical 90nm process, and 5 layers for a 180nmprocess), as projected by the ITRS documents [14].The impact on the overall yield of multi-core chips [15] can be quite significant,given that the data communication infrastructure alone can consume around 10-15% ofthe total chip area [16]. The inter-core communication links are likely to be routed on themiddle and upper metal layers. Hence, the layout scenario is very similar to the exampleshown in Fig. 1-2, which illustrates a multi-processor SoC built on a network-on-chip5platform, consisting of functional cores, switching blocks and global links. The activedevices (processing cores, memory blocks) are on the lower levels on the silicon surface,and the inter-core wires are on the higher levels of the 3D structure [17]. Global signallines span more metal layers and require more levels of vias in order to go to/from theactive devices on the silicon surface and therefore are more prone to manufacturingdefects.Figure 1-2: (a) Global links in NoC-based systems-on-chip; (b) global inter-core link with msignal lines; (c) interconnect line spanning multiple metal/via levels.Many authors have tried to make the case for the network-on-chip paradigm bystating that “wires are free” and consequently an interconnect architecture consisting ofmany multi-wire links is, by default, feasible and efficient. In fact, wires are expensive interms of the cost of the masks that are required in the successive steps of the fabricationof interconnect metal layers through photolithography. Moreover, they are not free interms of the possible defects that can appear due to technology constraints specific to an6increase in the number of metal layers for every process generation, and continuousincrease in the aspect ratio of wires (the ratio between their vertical and horizontaldimensions). While the wires on the lower layers can be made with an aspect ratio of lessthan one, the interconnects on middle and upper metal layers require an aspect ratio oftwo or higher. This, in turn, places an increased difficulty in achieving reliable contactbetween wires on different layers, since the vias between such wires will need to berelatively tall and narrow, and as a consequence more likely to exhibit open or resistivedefects.Currently, via duplication is the solution that attempts to address this issue byinserting redundant vias in the routing stage of the chip layout design phase. This methodfor improving interconnect yield is ad-hoc in nature and can only provide results ifenough space is left on the upper level metal layers to insert the redundant vias whilepreserving the design rules. Most of the authors of technical articles on the topic of viaduplication state clearly that, ideally, the solution for interconnect yield improvementshould be pushed earlier in the design flow of ICs, more preferably in the front-end ratherthan in the back-end. A universal front-end solution for improving interconnect yield hasnot yet been found. However, by examining the categories of interconnect that are moreprone to defects, it is possible to develop custom solutions targeting particularinterconnect types.That is, in order to ensure correct fabrication, faults must be detected through post-fabrication testing, and, possibly, compensated for through fault-tolerant techniques.When looking in particular at design, test, manufacturing and the associated CADtools in the NoC design flow in the context of shrinking transistor size and wire7dimensions, it is clear that fabrication defects and variability are significant challengesthat are often overlooked. On-chip networks need mechanisms to ensure correctfabrication and life-time operation in presence of new defect/fault mechanisms, processvariability, and high availability requirements.Traditionally, correct fabrication of integrated circuits is verified by post-manufacturing testing using different techniques ranging from scan-based techniques todelay test and current-based test. Due to their particular nature, NoCs are exposed to awide range of faults that can escape the classic test procedures. Among such faults wecan enumerate crosstalk, faults in the buffers of the NoC routers, and higher-level faultssuch as packet misrouting and data scrambling [6]. These fault types add to the classicfaults that must be tested after fabrication for all integrated circuits (stuck-at, opens,shorts, memory faults, etc.). Consequently, the test time of NoC-based systems increasesconsiderably due to these new faults.Test time is an important component of the test cost and, implicitly, of the totalfabrication cost of a chip. For large volume production, the total chip testing time must bereduced as much as possible in order to keep the total cost low. The total test time of anIC is governed by the amount of test data that must be applied and the amount ofcontrollability/observability that the design-for-test (DFT) strategy chosen by designerscan provide. The test data increases with chip complexity and size, so the option the DFTengineers are left with is to improve the controllability/observability. Traditionally, this isachieved by increasing the number of test inputs/outputs, but this has the same effect ofincreasing the total cost of the IC.8DFT techniques such as scan-based test improve the controllability and observabilityof IC internal components by serializing the test input/output data and feeding/extractingit to/from the IC through a reduced number of test pins. The trade-off is the increase intest time and test frequency, which makes at-speed test difficult using scan-basedtechniques. While scan-based solutions are useful, their limitations in the particular caseof NoC systems demand the development of new test data generation and transportmechanisms that simultaneously minimize the total test time and the number of test I/Opins.In this work, two types of solutions are proposed to reduce the test time of NoC datatransport infrastructures: first, we use a built-in self-test solution to generate test datainternally, eliminating the need to inject test data using I/O pins. Second, we replace theuse of a traditional dedicated test access mechanism (TAM) with a test transportmechanism that reuses the NoC infrastructure progressively to transport test data to NoCcomponents under test. This approach has the advantage that it exploits the inherent highdegree of parallelism of the NoC, thus allowing the delivery of multiple test vectors inparallel to multiple NoC components under test. A multicast test delivery mechanism isdescribed, with one test data source that sends test data in a parallel fashion along thesubset of already tested NoC components. The test data routing algorithm is optimizedoff-line, deterministically, such that the shortest paths are always selected for forwardingtest packets. This techniques guarantees that the test transport time is minimized which,together with the advantage of testing multiple NoC components in parallel, yields areduced test time for the overall NoC.9An effective and efficient test procedure is necessary, but not sufficient to guaranteethe correct operation of NoC data transport infrastructures during the life-time of theintegrated circuits. Defects may appear later in the field, due to causes such aselectromigration, thermal effects, material ageing, etc. These effects will become morepronounced with continuous downscaling of device dimensions beyond 65 nm andmoving towards the nanoscale domain. New methods are needed to enhance the yield ofthese links to make them more reliable. The fault-tolerant solution using reconfigurablecrossbars and redundant links developed in this work is aimed specifically at the NoClinks, and allows both post-fabrication yield tuning and self-repair of links that may breakdown later in the life-cycle of the chip.1.1 Dissertation contributionThis dissertation offers three major contributions:1. A complete NoC test methodology, including the hardware circuitry andscheduling algorithms, which together minimize the test time bydistributing test data concurrently to multiple components of the NoCfabric.2. An evaluation of various fault-tolerance mechanisms for NoCinfrastructures, and two new metrics relevant to quality-of-services onNoCs.3. A fault-tolerant design method for NoC links that allows fine yield tuningand life-cycle self-repair of NoC interconnect infrastructures. This methoduses the above-mentioned test mechanism to diagnose and identify thefaulty interconnects.10The rest of this dissertation is organized as follows: Chapter 2 presents backgroundon network-on-chip test aspects and fault models. Chapter 3 presents contribution (1) - atest methodology and scheduling algorithms that minimize the test time of NoC ofarbitrary topologies. Chapter 4 presents contribution (2) - an overview and evaluation offault-tolerance mechanisms and two proposed metrics relevant to NoCs. Chapter 5describes contribution (3) - a method to provide design fault-tolerant NoC links based onsparse crossbars and tune the yield of NoC links based on the expected defect rate.Chapter 6 concludes the dissertation, outlines a few limitations of the proposedapproaches, and provides a set of possible future research directions that can be pursuedas a direct follow-up to the contributions presented in this dissertation.11Chapter 22 Background on Network-on-chip Testing2.1 IntroductionSystem-on-Chip (SoC) design methodologies are currently undergoing revolutionarychanges, driven by the emergence of multi-core platforms supporting large sets ofembedded processing cores. These platforms may contain a set of heterogeneouscomponents with irregular block sizes and/or homogeneous components with a regularblock sizes. The resulting platforms are collectively referred to as multi-processor SoC(MP-SoC) designs [7]. Such MP-SoCs imply the seamless integration of numerousIntellectual Property (IP) blocks performing different functions and exchanging datathrough a dedicated on-chip communication infrastructure. A key requirement of theseplatforms, whether irregular or regular, is a structured interconnect architecture. Thenetwork-on-chip (NoC) architecture is a leading candidate for this purpose [18].2.2 Multi-processor systems-on-chipSince today’s VLSI chips can accommodate in excess of 1 billion transistors, enough totheoretically place thousands of 32-bit RISC [18] processors on a die, leveraging suchcapabilities is a major challenge. Traditionally, SoC design was based on the use of aslowly evolving set of hardware and software components: general-purpose processors,digital signal processors (DSPs), memory blocks, and other custom-designed hardware IPblocks (digital and analog). With a significant increase in the number of components incomplex SoCs, significantly different design methods are required to cope with the deepsub-micron physical design issues, verification and design-for-test of the resulting SoCs.12One of the solutions that the design community adopted to reduce the design cycle is theplatform-based design paradigm, characterized by the use of higher-level off-the-shelf IPblocks, connected via a modular, scalable SoC interconnect fabric and standardcommunication interfaces and protocols. Such MP-SoC platforms are highly flexible,programmable and/or reconfigurable for application areas such as wireless, multimedia,networking, automotive. A high-level view of the general MP-SoC platform is given inFig. 2-1. When designing with such platforms, no IP design is performed, butspecification, configuration and assembly of existing IP blocks is greatly facilitated.Figure 2-1: MP-SoC platform [7]MP-SoC platforms will include, in the near future, tens to hundreds of embeddedprocessors, in a wide variety of forms, from general-purpose reduced instruction-setcomputers (RISC) to application-specific instruction-set processors (ASIP) with differenttrade-offs in time-to-market, performance, power and cost. New designs are beingreported from industry [19], with more than 100 embedded heterogeneous processors, forapplications ranging from communications, network processing, to security processors,storage array networks, and consumer image processing.132.3 Networks-on-chipA key component of the MP-SoC platform of Fig. 2-1 is the interconnect fabricconnecting the major blocks. Such a fabric must provide orthogonality, scalability,predictable physical parameters and a plug-and-play design style for integrating varioushard-wired, reconfigurable or programmable IPs. Such architectures must support high-level, layered communication abstraction, and simplify the automatic mapping ofprocessing resources onto the interconnect fabric. Networks-on-chip are particularlysuitable for accomplishing these features.A NoC interconnect fabric is a combination of hardware (switches and inter-switchlinks) and software components (communication and interfacing protocols). NoCs can beorganized in various topologies [20] – mesh, tree, ring, irregular, etc…– and canimplement various subsets of the ISO/OSI communication stack [21]. A well-known 2Dmesh NoC topology is illustrated in Fig. 2-2. The term Network-on-chip is used todaymostly in a very broad sense, encompassing the hardware communication infrastructure,the communication protocols and interfaces, operating system communication services,and the design methodology and tools for NoC syndissertation and application mapping.All these components together can be called a NoC platform [7]. Some authors use theNetwork-on-chip to denote the entire MP-SoC built on a structured, networked fabric –including the IP cores, the on-chip communication medium, application software andcommunication protocols [20]. In this dissertation, the Network-on-chip term refers to theon-chip communication architecture, including the hardware components (switches,links) and communication protocols.14Figure 2-2: Network-on-chip building blocks in a mesh configuration2.4 Network-on-chip test – previous workWhile much work has centered on design issues, much less effort has been directed totesting such NoCs. Any new design methodology will only be widely adopted if it iscomplemented by efficient test mechanisms. In the case of NoC-based chips, two mainaspects have to be addressed with respect to their test procedures: how to test the NoCcommunication fabric, and how to test the functional cores (processing, memory andother modules). Since the inception of SoC designs, the research community has targetedprincipally the testing of the IP cores [22], giving little emphasis to the testing of theircommunication infrastructures. The main concern for SoC test was the design of efficienttest access mechanisms (TAM) for delivering the test data to the individual cores underconstraints such as test time, test power, and temperature. Among the different test accessmechanisms, TestRail [23] was one of the first to address core-based test of SoCs.Recently, a number of different research groups suggested the reuse of thecommunication infrastructure as a test access mechanism [24] [25] [26]. In [27] theauthors assumed the NoC fabric as fault-free and subsequently used it to transport testdata to the functional blocks; however, for large systems, this assumption can beunrealistic, considering the complexity of the design and communication protocols. In15[28], the authors proposed a dedicated TAM based on an on-chip network, wherenetwork-oriented mechanisms were used to deliver test data to the functional cores of theSoC.A test procedure for the links of NoC fabrics is presented in [29], targeted specificallyto mesh topologies. The NoC switches are assumed to be tested using conventionalmethods first, and then three test configurations are applied in series to diagnose potentialfaults of the inter-switch links, as indicated in Fig. 2-3.(a) (b) (c)Figure 2-3: Test configurations in [29]: (a) straight paths; (b) turning paths; (c) localresource connectionsA configuration is set up by adjusting the corresponding destination address fields ofthe transmitted packets to the last row (column) of the network matrix. The threeconfigurations cover all possible link directions in a mesh NoC: vertical, horizontal,turning paths, and local connections to processing resources. The major limitations of thisapproach are: 1) applicability to mesh-based NoC only; 2) a test procedure for NoCswitches is not defined.A different approach is presented in [30], based on an extension of the classicwrapper-based SoC test [23] and scan-chain method [31]. Each NoC switch (or router) issubjected to scan-chain insertion, and the set of all NoC switches is wrapped with an16IEEE 1500 test wrapper and tested using the core-based test approach [23]. The overalltest architecture is presented in Figure 2-4.test inputs output comparisonFigure 2-4: Core-based test of NoC routers using an IEEE 1500 test wrapper and scaninsertion [30]The solution presented in [30] addresses the test of the NoC switches only,overlooking completely the aspect of testing the NoC links. For large NoCs, the methodinherits the limitations of scan-based test when applied to large cores: slow test speed forlong scan chains, the trade-off between the number of test I/Os and scan chain length.The idea of reusing the NoC fabric for delivering test data to the processing elementsappears also in [26], combined with progressive test of NoC routers and overlapping thetest of routers and processing elements in time to reduce the total test time. Both unicastand multicast test transport methods are considered, as shown in Figure 2-5.17(a) (b)Figure 2-5: Test data transport for NoC router testing using (a) multicast and (b) unicast[26]The work presented in [26] does not consider the test of NoC links, and does notshow how to minimize the test transport time in either the unicast or multicast transportmodes. It is also not clear if the proposed solution delivers an optimal test time whencombining test of NoC routers and processing cores.This dissertation is focused on the test of the NoC infrastructure that includes bothNoC switches and inter-switch links. The work complements previous approaches bydeveloping the test strategy for the interconnect infrastructure itself. The test strategies ofNoC-based interconnect infrastructures must address two problems [16]: (i) testing of theswitch blocks; (ii) testing of the inter-switch wire segments. The test procedures of bothswitches and inter-switch links are integrated in a streamlined fashion. Two noveltechniques characterize the proposed solution. The first is the reuse of the already testedNoC components to transport the test data towards the components under test in arecursive manner. The second is employing the inherent parallelism of the NoC structuresto propagate the test data simultaneously to multiple NoC elements under test. Two testscheduling algorithms are provided that guarantee a minimal test time for arbitrary NoCtopologies. In the next section we elaborate the set of fault models used for designing theproposed test method, including scheduling algorithms and on-chip test-related hardware.182.5 Fault models for NoC infrastructure testWhen developing a test methodology for NoC fabrics, one needs to start from a set ofmodels that can realistically represent the faults specific to the nature of NoC as a datatransport mechanism. As stated previously, a NoC infrastructure is built from two basictypes of components: switches and inter-switch links. For each type of component, testpatterns must be constructed that exercise its characteristic faults.- Functional core- Switch- Link‘1’‘0’(a)(b)(d)(c)Figure 2-6: Examples of faults that can affect NoC infrastructures: (a) crosstalk faults; (b)memory faults in the input/output buffers of the switches; (c) short/open interconnect faults;(d) stuck-at faults affecting the logic gates of NoC switches.The range of faults that can affect the NoC infrastructure is significant and it extendsfrom interconnects faults to logic and memory faults. Consequently, the data set requiredto test all these faults is extremely large, and carries a major overhead to the overall testtime of NoC-based integrated circuits. A subset of these faults is represented in Fig. 2-6.In the following subsections the set of faults considered in this work for the NoCswitches and links is presented.192.5.1 Wire/crosstalk fault models for NoC inter-switch linksCuviello et al. [32] proposed a novel fault model for the global interconnects of DSMSoCs that accounts for cross-talk effects between a set of aggressor lines and a victim line.This fault model is referred to as Maximum Aggressor Fault (MAF) and it occurs whenthe signal transition on a single interconnect line (called the victim line) is affectedthrough cross-talk by transitions on all the other interconnect lines (called the aggressors)due to the presence of the crosstalk effect. In this model, all the aggressor lines switch inthe same direction simultaneously.The MAF model is an abstract representation of the set of all defects that can lead toone of the six crosstalk errors: rising/falling delay, positive/negative glitch, andrising/falling speed-up. The possible errors corresponding to the MAF fault model arepresented in Fig. 2-7 for a link consisting of 3 wires. The signals on lines Y1and Y3actas aggressors, while Y2is the victim line. The aggressors act collectively to produce adelay, glitch or speed-up on the victim.This abstraction covers a wide range of defects including design errors, design rulesviolations, process variations and physical defects. For a link consisting of N wires, theMAF model assumes the worst-case situation with one victim line and (N-1) aggressors.For links consisting of a large number of wires, considering all such variations isprohibitive from a test coverage point of view [31].The transitions needed to sensitize the MAF faults can be easily derived from Fig. 2-7based on the waveform transitions indicated. For an inter-switch link consisting of Nwires, a total of 6N faults need to be tested, and requiring 6N 2-vector tests. These 6NMAF faults cover all the possible process variations and physical defects that can cause20any crosstalk effect on any of the N interconnects. They also cover more traditional faultssuch as stuck-at, stuck-open and bridging faults.(c)fault-free signalsignal affected by MAFY , Y : aggressor lines1 3Y : victim line2Y1Y2Y3“0”(a)gp(positiveglitch)Y1Y2Y3“1”(b)gn(negativeglitch)Y1Y2Y3(d)df(delayedfall)Y1Y2Y3dr(delayedrise)Y1Y2Y3(e)sr(speedyrise)Y1Y2Y3(f)sf(speedyfall)Figure 2-7: MAF crosstalk errors (Y2– victim wire; Y1, Y3– aggressor wires).2.5.2 Logic/memory fault models for FIFO buffers in NoC switchesNoC switches generally consist of a combinational block in charge of functions suchas arbitration, routing, error control, and first-in/first-out (FIFO) memory blocks thatserve as communication buffers [33][34]. Fig. 2-8(a) shows the generic architecture of a21NoC switch. As information arrives at each of the ports, it is stored in FIFO buffers andthen routed to the target destination by the routing logic block (RLB).The FIFO communication buffers for NoC fabrics can be implemented as registerbanks [35] or dedicated SRAM arrays [36]. In both cases, functional test is preferable dueto its reduced time duration, good coverage, and simplicity.The block diagram of a NoC FIFO is shown in Fig. 2-8(b). From a test point of view,the NoC-specific FIFOs fall under the category of restricted two-port memories. Due tothe unidirectional nature of the NoC communication links, they have one write-only portand one read-only port, and are referred to as (wo-ro)2P memories. Under theserestrictions, the FIFO function can be divided in three ways: the memory-cells array, theaddressing mechanism, and the FIFO-specific functionality [37].22RLB()routing logicblockFIFO FIFO(a)WRITE PORTREAD PORTFFWOEFROWCK RCKWD0WD1WDn-1RD0RD1RDn-1B(b)Figure 2-8: (a) 4-port NoC switch – generic architecture; (b) dual port NoC FIFO.Memory array faults can be stuck-at, transition, data retention or bridging faults [31].Addressing faults on the RD/WD lines are also important as they may prevent cells frombeing read/written. In addition, functionality faults on the empty and full flags (EF and FF,respectively) are included in our fault models set [37].232.6 SummaryIn this chapter, the problems of NoC testing and prior work in this area weredescribed. Then, the set of fault models that used in this work for developing the testscheduling algorithms and the associated on-chip test hardware were detailed. Differentfault models are outlined for testing NoC channels and routers. The choice forconstructing test vectors for NoC links is the maximum aggressor fault (MAF) whichtakes the worst-case crosstalk scenario into consideration, with the benefit that it alsocovers other, more traditional faults (opens, shorts, stuck-at). For the input/output buffersof the NoC routers we use memory-specific fault models which take into account thedual-port characteristic of the FIFO buffers and allow functional test for thesecomponents. The routing logic blocks of the NoC switches are simply tested using classicstuck-at, open, and short fault models.24Chapter 33 Test Time Minimization for Networks-on-Chip1A significant portion of the total production cost of an IC is represented by itsassociated test procedures. A direct measure of an IC’s test cost is the time spent fortesting it. With increasing transistor-count and complexity, multi-core SoCs pose aparticular challenge in terms of keeping the test time under reasonable limits. Muchresearch effort is invested in minimizing the test time of large SoCs, and, consequently,the total production cost. In the case of NoC-based MP-SoCs, the test time of the NoCinfrastructure adds to the total IC production cost. Reducing the NoC test time contributesto lowering the total SoC test time, and, implicitly, production cost. This chapter presentsa test method and corresponding hardware circuitry that minimize the test time of NoCinterconnect fabrics.3.1 Test data organizationA key feature that differentiates a NoC from other on-chip communication media is thetransmission of data in form of packets [4]. In the approach proposed here, the raw testdata, obtained based on the fault models and assumptions outlined in Chapter 2, areorganized into test packets that are subsequently directed towards the differentcomponents of the NoC under test. Test data is organized into packets by adding routing1This chapter is based on work published in:1. C. Grecu, P.P. Pande, B. Wang, A. Ivanov, R. Saleh, " Methodologies and algorithms for testingswitch-based NoC interconnects", IEEE Symposium on Defect and Fault Tolerance in VLSISystems, 2005, DFT '05, Oct. 2005.2. C. Grecu, A. Ivanov, R. Saleh, P.P. Pande, "Testing Networks-on-chip CommunicationInfrastructures", IEEE Transactions on Computer Aided Design, Volume 26, Issue 12, Dec. 2007.25and control information to the test patterns generated for each NoC component. Therouting information is similar to that of functional data packets and identifies theparticular NoC component towards which the test packet is directed. The controlinformation consists of fields that identify the packet type (e.g., test packet) and type ofthe NoC component under test (inter-switch link, switch combinational block, FIFO). Atthe beginning and the end of each test packet, dedicated fields signal the start and stopmoments of the test sequence, respectively.T_startT_stopTest_controlFigure 3-1: Test packet structureThe test set for the faults presented in Chapter 2 was developed based on [32] and[37], and includes test vectors for inter-switch links, FIFO buffers, and routing logicblocks of the NoC switches. The generic structure of a test packet is shown in Fig. 3-1.The test header field contains routing information and packet identification informationwhich denotes the type of data being carried (test data). The second field (test control)contains the type of test data, i.e., interconnect, FIFO, or RLB (routing logic block) testdata. The test data field is bordered by corresponding flags (T_start and T_stop) markingits boundaries.3.2 Testing NoC switchesWhen a test packet arrives at an untested switch, the payload is unpacked to extract thetest data. NoC switches can be tested with standard logic and memory test methods.26Scan-based testing [31] is adopted for the routing logic blocks of NoC switches, whilefunctional test [37] is used for the communication buffers (FIFO memory blocks). Testpatterns are generated for the RLB and FIFO blocks separately so that the test vectors canbe optimized for each type of block. The test of the logic part of the switch (the RLBblock) is performed while it is isolated from the rest of the switch.Assuming the FIFOs are B bits wide and have n locations, the test uses B-bit patterns.As an example, consider the detection of a bridging fault, i.e., a short between the bitlinesbiand bj(i ≠ j), that can eventually yield an AND or OR behaviour. In order to detectsuch faults, four specific test patterns are used: 0101…, 1010…, 0000…, and 1111…,denoted by1G , 1G ,2G , and 2G , respectively [37] . Specifically, to test the dual-portcoupling faults, the following sequence is used:11nw rw rfor each of the four test patterns above. The first write operation (denoted by w in theexpression above) sets the read/write pointers to FIFO cells 0 and 1, respectively; thenext (n-1) simultaneous read(r)/write(w) operations (denoted by11nwr ) sensitize thecoupling faults between adjacent cells, and the last read operation (denoted by r) emptiesthe FIFO and prepares it for the next test pattern. All other standard tests proceed in asimilar manner.3.3 Testing NoC linksTesting for wire faults and crosstalk effects can be carried out together as follows.According to the MAF fault model, each possible MAF on a victim line of a link requiresa two-vector test sequence to be sensitized. The test sequence exhibits some usefulproperties which allow for a compact and efficient design of the MAF test packets:27 Property (a): For each test vector, the logic values on the aggressor lines are theopposite of that on the victim line; Property (b): After having applied the exhaustive set of test sequences for aparticular victim line, the test sequence of the adjacent victim line can be obtainedby shifting (rotating) the test data by exactly one bit.If wire i in Fig. 3-2(a) is to be tested for 6 MAF faults, then twelve vectors areimplied, due to the two-vector test needed for each case. However, the transitions fromone test vector to another can be concatenated such that the number of test vectors neededto sensitize the MAF faults can be reduced from twelve vectors per wire to eight, asshown in Fig. 3-2(b) and (c).(a) (b) (c)Figure 3-2: a) Wire i and adjacent wires; b) Test sequence for wire i; c) Conceptual statemachine for MAF patterns generation.The test data packets are designed based on Properties (a) and (b) above, bygenerating the logical values corresponding to the MAF tests in eight distinct states s1 tos8. In an s1-to-s8 cycle, the state machine produces eight vectors. During each cycle, oneline is tested and is assigned the victim logic values, while the rest of the lines get the28aggressor values. The selection of the victim wire is achieved through the victim linecounter field that controls the test hardware such that for the first eight test cycles, thefirst wire of the link is the victim. During the second set of eight test cycles, the secondwire is the victim, and so on. After each eight-vector sequence, the test patterns shift byone bit position, and an identical eight-vector sequence is applied with a newcorresponding wire acting as the victim. This procedure repeats until all the lines of thelink are completed.3.4 Test data transportThis section describes the NoC modes of operation and a minimal set of features thatthe NoC building blocks must possess for packet-based test data transport. Asystem-wide test transport mechanism must satisfy the specific requirements of the NoCfabric and exploit its highly-parallel and distributed nature for an efficient realization. Infact, it is advantageous to combine the testing of the NoC inter-switch links with that ofthe other NoC components (i.e., the switch blocks) in order to reduce the total silicon areaoverhead. The high degree of parallelism of typical NoCs allows simultaneous test ofmultiple components. However, special hardware may be required to implement paralleltesting features.Each NoC switch is assigned a binary address such that the test packets can be directedto particular switches. In the case of direct-connected networks, this address is identicalto the address of the IP core connected to the respective switch. In the case of indirectnetworks (such as BFT [38] and other hierarchical architectures [39]), not all switches areconnected to IP cores, so switches must be assigned specific addresses in order to be29targeted by their corresponding test packets. Considering the degree of concurrency ofthe packets being transported through the NoC, two cases can be distinguished:Unicast mode: the packets have a single destination [40]. This is the more commonsituation and it is representative for the normal operation of an on-chip communicationfabric, such as processor cores executing read/write operations from/into memory cores,or micro-engines transferring data in a pipeline [41]. As shown in Fig. 3-3(a), packetsarriving at a switch input port are decoded and directed to a unique output port, accordingto the routing information stored in the header of the packet (for simplicity, functionalcores are not shown in Fig. 3-3). Test packets are injected at the source switch (denotedby S in Fig. 3-3) and transported towards the destination switch (denoted by D) along thepath indicated by the set of switches in the unicast (U) mode.(a) (b)Figure 3-3: (a) Unicast data transport in a NoC; (b) multicast data transport in a NoC (S –source; D – destination; U – switches in unicast mode; M – switches in multicast mode).Multicast mode: the packets have multiple destinations [42]. This mode is useful for themanagement and reconfiguration of functional cores of the NoC, when identical packetscarrying setup and/or configuration information must be transported to the processingelements [43]. Packets with multicast routing information are decoded at the switch inputports and then replicated identically at the switch outputs indicated by the multicast30decoder. The multicast packets can reach their destinations in a more efficient and fastermanner than in the case when repeated unicast is employed to send identical data tomultiple destinations [44]. Fig. 3-3(b) shows a multicast transport instance, where thedata is injected at the switch source (S), replicated and retransmitted by the intermediateswitches in both multicast (M) and unicast (U) modes, and received by multipledestination switches (D). The multicast mode is especially useful for test data transportpurposes, when identical blocks need to be tested as fast as possible.3.4.1 Multicast test transport mechanismOne possible way to multicast is simply to unicast multiple times, but this implies avery high latency. The all-destination encoding is another simple scheme in which alldestination addresses are carried by the header. This encoding scheme has twoadvantages. First, the same routing hardware used for unicast messages can be used formulti-destination messages. Second, the message header can be processed on the fly asaddress flits arrive. The main problem with this scheme is that, as the number of switchblocks in the system increases, the header length increases accordingly and therebyresults in significant overhead in terms of both hardware and time necessary for addressdecoding.A form of header encoding that accomplishes multicast to arbitrary destination sets in asingle communication phase and also limits the size of the header is known as bit-stringencoding [45]. The encoding consists of N bits where N is the number of switch blocks,with a ‘1’ bit in the ith position indicating that switch i is a multicast destination. Todecode a bit-string encoded header, a switch must possess knowledge of the switchesreachable through each of its output ports [34].31Several NoC platforms developed by research groups in industry and academiafeature the multicast capability for functional operation [46] [47]. In these cases, nomodification of the NoC switches hardware or addressing protocols is required toperform multicast test data transport.If the NoC does not possess multicast capability, this can be implemented in asimplified version that only services the test packets and is transparent for the normaloperation mode. As shown in Fig. 3-4, the generic NoC switch structure presented in Fig.2-8(a) was modified by adding a multicast wrapper unit (MWU) that contains additionaldemultiplexers and multiplexers relative to the generic (non-multicast) switch. TheMWU monitors the type of incoming packets and recognizes the packets that carry testdata. An additional field in the header of the test packets identifies that they are intendedfor multicast distribution.FIFOFIFOFIFOFIFOMWURLB(1)(2)(3)(4)Figure 3-4: 4-port NoC switch with multicast wrapper unit (MWU) for test data transport.For NoCs supporting multicast for functional data transport, the routing/arbitrationlogic block (RLB) is responsible for identifying the multicast packets, processing the32multicast control information, and directing them to the corresponding output ports of theswitch [33]. The multicast routing blocks can be relatively complex and hardware-intensive.In the design proposed here for multicast test data transport, the RLB of the switch iscompletely bypassed by the MWU and does not interfere with the multicast test data flow,as illustrated in Fig. 3-4. The hardware implementation of the MWU is greatly simplifiedby the fact that the test scheduling is done off-line, i.e., the path and injection time ofeach test packet are determined prior to performing the test operation. Therefore, for eachNoC switch, the subset of input and output ports that will be involved in multicast testdata transport is known a priori, allowing the implementation of this feature to thesespecific subsets only. For instance, in the multicast step shown in Fig. 3-3(b), only threeswitches must possess the multicast feature. By exploring all the necessary multicaststeps to reach all destinations, the switches and ports that are involved in the multicasttransport are identified, and subsequently the MWU is implemented only for the requiredswitches/ports.The header of a multi-destination message must carry the destination node addresses[44]. To route a multi-destination message, a switch must be equipped with a method fordetermining the output ports to which a multicast message must be simultaneouslyforwarded. The multi-destination packet header encodes information that allows theswitch to determine the output ports towards which the packet must be directed.When designing multicast hardware and protocols with limited purpose, such as testdata transport, a set of simplifying assumptions can be made in order to reduce the33complexity of the multicast mechanism. This set of assumptions can be summarized asfollows:Assumption A1: The test data traffic is fully deterministic. For a given set of faultmodels and hardware circuitry, the set of test vectors is unique and known at design time.On the contrary, application data can widely vary during the normal operation of the NoC.Assumption A2: Test traffic is scheduled off-line, prior to test application. Since thetest data is deterministic, it can be scheduled in terms of injection time and componentsunder test prior to test execution.Assumption A3: For each test packet, the multicast route can be determined exactly atall times (i.e., routing of test packets is static). This is a direct consequence ofassumptions A1 and A2 above: knowing the test data, the test packets source anddestinations, multicast test paths can be pre-determined before the test sequence is run.Assumption A4: For each switch, the set of input/output ports involved in multicasttest data transport is known and may be a subset of all input/output ports of the switch(i.e., for each switch, only a subset of I/O ports may be required to support multicast).These assumptions help in reducing the hardware complexity of multicast mechanismby implementing the required hardware only for those switch ports that must supportmulticast. For instance, in the example of Fig. 3-4, if the multicast feature must beimplemented exclusively from input port (1) to output ports (2), (3), and (4), then onlyone demultiplexer and three multiplexers are required. A detailed methodology for testscheduling is presented in Section 3.5. The set of I/O ports of interest can be extractedaccurately knowing the final scheduling of the test data packets, and then those ports canbe connected to the MWU block, as indicated in Figs. 3-3(b) and 3-4. Various options for34implementing multicast in NoC switches were presented in [34] [43] [44]; therefore, thedetails regarding physical implementation are omitted here. Instead, we describe how themulticast routes can be encoded in the header of the test packets, and how the multicastroute can be decoded at each switch by the MWU.To assemble the multicast routes, binary addresses are assigned first to each switch ofthe NoC. Then, for each switch, an index is assigned to each of its ports, e.g., if a switchhas four ports, they will be indexed (in binary representation) from 00 to 11. Themulticast route is then constructed as an enumeration of switch addresses, each of themfollowed by the corresponding set of output port indices. These steps must be followedfor each possible multicast route that will be used by a multicast test packet. A simpleexample to illustrate how the multicast test address is built is presented in Fig. 3-5.1234560001000000000001011010DA list Switch 1 Switch 2 Switch 34, 5, 6 1 {00, 01} 2 {01, 10} 3 {10}Figure 3-5: Multicast route for test packets.Consequently, with the assumptions A1 to A4 stated previously, the multicast wrapperunit must simply decode the multicast routing data and place copies of the incomingpacket at the output ports found in the port list of the current switch. Since the test data isfully deterministic and scheduled off-line, the test packets can be ordered to avoid the35situation where two (or more) incoming packets compete for the same output port of aswitch. This is guaranteed according to the packet scheduling algorithms presented laterin Section 3.5. Therefore, no arbitration mechanism is required for multicast test packets.Also, by using this simple addressing mode, no routing tables or complex routinghardware is required.The lack of input/output arbitration for the multicast test data has a positive impact onthe transport latency of the packets. Our multicast implementation has lower transportlatency than the functional multicast since the only task performed by the MWU block isrouting. The direct benefit is a reduced test time compared to the use of fully functionalmulticast, proportional to the number of functional pipeline stages [33] that are bypassedby the MWU. The advantages of using this simplified multicasting scheme are reducedcomplexity (compared to the fully-functional multicast mechanisms), lower silicon arearequired by MWU, and shorter transport latency for the test data packets.3.5 Test schedulingThe next step is to perform test scheduling to minimize test time. The approachdescribed in this work does not use a dedicated test access mechanism (TAM) totransport test data to NoC components under test. Instead, test data is propagated towardsthe components in a recursive, wave-like manner, via the NoC components already tested.This method eliminates entirely the need for a dedicated TAM and saves thecorresponding resources. Another distinct advantage of this method over the dedicatedTAM is that the test data can be delivered at a rate independent of the size of the NoCunder test. The classic, shared-bus TAMs are not able to deliver test data at a speedindependent of the size of the SoC under test. This occurs due to the large intrinsic36capacitive load of the TAM combined with the load of multiple cores serviced by theTAM [32]. In Section 3.6.3, we compare the test time achieved through our approach,with previously proposed NoC test methods, and show a significant improvementcompared to the results obtained by applying the prior methods.An important pre-processing task that determines the total test time is referred to astest scheduling. Many of the previous research efforts have been devoted to reducing thetest time of large systems-on-chip designs by increasing test concurrency using advancedtest architectures and test scheduling algorithms. In the case of NoC-based MP-SoCs, thedata communication infrastructure itself contributes to the total test time of the chip andthis contribution must be minimized as well. The test scheduling problem can beformulated as optimizing the spatial and temporal distribution of test data such that a setof constraints are satisfied. In this work, the specific constraint is minimizing the test timerequired to perform post-manufacturing test of the NoC infrastructure. At this point, weassume that the test data is already available and organized in test packets, as outlined inSections 3.1-3.4. We also assume that, when required, the fault-free part of the NoC cantransport the test data to the NoC components under test using unicast or multicast, asdetailed in Section 3.4.With these considerations, two different components of the test time can be identifiedfor each NoC building element. The first component is represented by the amount of timerequired to deliver the test patterns to the NoC element that is targeted, called transporttime (TT). The second component represents the amount of time that is actually needed toapply the test patterns to the targeted element and perform the actual testing procedure.37This latter component is called test time per element (TTPE), where element refers to alink segment, a FIFO buffer, or a routing/arbitration block.3.5.1 Test time cost functionTo search for an optimal scheduling, we must first use the two components of the testtime to determine a suitable cost function for the complete test process. We then computethe test cost for each possible switch that can be used as a source for test packet injection.After sequencing through all the switches as possible test sources and evaluating thecosts, the one with the lowest cost is chosen as the test source.We start by introducing a simple example that illustrates how the test time iscomputed in the two transport modes, unicast and multicast, respectively. Let,up rT (,mp rT )be the time required to test switches Spand Sr, including the transport time and therespective TTPEs, using the unicast (multicast) test transport mode, respectively.Consider the example in Fig. 3-6, where switch S1, and links l1and l2are already testedand fault-free, and switches S2and S3are the next switches to be tested. When test data istransmitted in the unicast mode, one and only one NoC element goes into the test mode ata time, at any given time, as shown in Figs. 3-6(a) and 3-6(b).S1T(a) (b) (c)l1l2l1l2l1l2S2S3S1S2S3S1S2S3TTTFigure 3-6: (a), (b): Unicast test transport. (c) Multicast test transport.Then, for each switch, the test time equals the sum of the transport latency, TT = Tl,L+ Tl,S, and the test time of the switch, TTPE = Tt,S. The latter term accounts for testing theFIFO buffers and RLB in the switches. Therefore, the total unicast test time2,3uT for38testing both switches S2and S3is:2,3 , , ,2 2ul L l S t ST T T T   (3.1)where Tl,Lis the latency of the inter-switch link, Tl,Sis the switch latency (the number ofcycles required for a flit to traverse a NoC switch from input to output), and Tt,Sis thetime required to perform the testing of the switch (i.e.,,t S FIFO RLBT T T  ).Following the same reasoning for the multicast transport case in Fig. 3-6(c), the totalmulticast test time2,3mT for testing switches S2and S3can be written as:2,3 , , ,ml L l S t ST T T T   (3.2)Consequently, it can be inferred that the test time cost function can be expressed asthe sum of the test transport time, TT, and the effective test time required for applying thetest vectors and testing the switches, TTPE, over all NoC elements.Test TimeNoC=All NoC Elements(TT + TTPE) (3.3)which can be rewritten asTest TimeNoC= All NoC Elements(TT) + All NoC Elements(TTPE) (3.4)Expression (3.4) represents the test time cost function that has to be minimized forreducing the test cost corresponding to the NoC fabric.Consequently, there are two mechanisms that can be employed for reducing the testtime: reducing the transport time of test data, and reducing the effective test time of NoCcomponents. The transport time of test patterns can be reduced in two ways:(a) by delivering the test patterns on the shortest path from the test source to the39element under test;(b) by transporting multiple test patterns concurrently on non-overlapping paths to theirrespective destinations.The TTPE is governed by the considerations described in Sections 3.3 and 3.4.Therefore, in order to reduce it, one would need to re-evaluate the fault models or theoverall test strategy (i.e., to generate test data locally for each element, with therespective incurred overhead [31]). Within the assumptions in this work (all test data isgenerated off-line and transported to the elements under test), the only feasible way toreduce the term corresponding to TTPE is to overlap the test of more NoC components.The direct effect is the corresponding reduction of the overall test time. This can beultimately accomplished by employing the multicast transport and applying test datasimultaneously to more components.3.5.2 Test transport time minimizationWith the goal of minimizing the time required to deliver test patterns to the NoCelements under test, we formulate the problem using a graph representation of the NoCinfrastructure. We then find, for each NoC component, the shortest path from an arbitrarysource node on the NoC graph, traversing only previously tested, fault-free components.The total transport time TT equals the sum of all transport latencies for the set of shortestpaths corresponding to the chosen source node, as expressed in Eq. (3.4); consequently,since these paths are minimal, the total test time corresponding to the respective node isalso minimal. By repeating the procedure similarly for all possible nodes in the network,and choosing the solution that delivers the shortest test time, the minimum test transporttime is guaranteed to be obtained.40The graph representation of the NoC infrastructure used to find the minimum testtransport latency is obtained by representing each NoC element as a directed graph G=(S, L), where each vertex si  S is a NoC switch, and each edge li L is an inter-switchlink. Each switch is tagged with a numerical pair (Tl,s, Tt,s) corresponding to switchlatency and switch test time. Each link is similarly labeled with a pair (Tl,L, Tt,L)corresponding to link latency and link test time, respectively. For each edge and vertex,we define a symbolic toggle t which can take two values: N and T. When t = N, the cost(weight) associated with the edge/vertex is the latency term, which corresponds to thenormal operation. When t = T, the cost (weight) associated with the edge/vertex is thetest time (of the link or switch) and corresponds to the test operation.A modified version of Dijkstra’s shortest path algorithm [48] is used for graphtraversal in order to find a test scheduling with minimum test cost. Dijkstra's algorithmsolves the single-source shortest path problem for a directed graph with non-negativeedge weights. It is known for its efficient and simple implementation.Initially, the t toggle is equal to T for all edges/vertices. We start by choosing a nodeand traversing the NoC graph using Dijkstra's algorithm. Every time an element isencountered whose t toggle is T, the test cost function is updated with the correspondingterm, and t is switched to N. When an element whose toggle is N is encountered, the testfunction is updated with the corresponding term (the element’s current weight) and tremains unchanged. There are slight differences in our approach compared to the classicalgorithm: are modified to a lower value exactly once during graph traversal2. Also, aftera directed edge is traversed, the edge in the opposite direction is traversed as soon as2Details regarding the weight updating and a proof that the modified algorithm returns a shortest path areprovided in Appendix 1.41possible. An additional constraint placed on the modified algorithm is that all toggles tmust be switched to N by the end of the algorithm. This ensures that no edges remain thathave not been traversed (i.e., no inter-switch links remain untested). However, thesedifferences are minor and only affect the way in which the test cost function is calculated.The test cost function is computed differently, depending on whether unicast ormulticast employed. In the following, the algorithms used to determine a minimum costtest scheduling for these two test delivery methods are presented.3.5.3 Unicast test schedulingProblem formulation: Given the graph G(S, L), the pairs (Tl,s, Tt,s) and (Tl,L, Tt,L), andassuming that only one vertex/edge whose toggle t equals T can be visited at a time,determine a graph traversal sequence that covers all vertices and edges, and has aminimum associated test cost function FTC,u.The fact that only one toggle can be switched at a time accounts for the unicasttransport mechanism when the NoC components are tested sequentially. The unicast testcost function FTC,u, is defined recursively as:, ,1 11,1,( 1) ( ),,p ru u i jTC TC l L l Si jkt Lkt SF k F k T TT if element k+1 is a linkT if element k+1 is a switch (3.5)where k+1 is the index of the current element under test and k is the index of the previousNoC element (switch or link) under test. The path onto which test packets are delivered toelement k+1 consists of p links and r switches, with corresponding latencies Tl,Land Tl,S.It can be observed that, in each step of the sequence, the latency terms corresponding to42the path along which the test data is transported, and the terms corresponding to theeffective test procedure per element are added to the cost function. Hence, the unicast-based test procedure can be minimized by minimizing the latency component of the testcost function. This fact is used to obtain a minimum-cost test scheduling by ensuring thateach element of the graph G(S, L) is visited on the shortest path from the test source.Algorithm 1: Unicast Schedulingfor each sS--- initialization ---Uni_min (G,S,L,w,t)for each vertex s in Gv_toggle(s) := T; --- initialize all switches as untestedv_weight(s) := Tt,S;d[s] :=;previous(s) := undefined;for each edge between vertices u,v in Ge_toggle(u,v) := T; --- initialize all links as untestede_weight(u,v) := Tt,L;Q := S union L;R := empty set;FTC,u:= 0;--- graph traversal on all shortest paths from switch s ---while Q is not an empty set ----graph traversalu := Extract Min{S}; ---- pick switch with min. Tt,SR := R union {u};for each edge (u,v) outgoing from uif {d[u] + e_weight(u,v) < d[v] and v_toggle(v)=T}d[v] := d[u] + weight(u,v);update FTC,u; --- update cost using Eq. (3.5) ---v_toggle(v) := N; --- set unicast mode ---e_toggle(u,v):=N;v_weight(u):= Tl,S; --- change weights ---e_weight(u,v):=Tl,L;previous[v] := u;endif ;return FTC,u;choose {smin, FTC,u, min}; --- pick the test source with minimum cost ---end.43In Algorithm 1 above, d[u] denotes the distance from the current test source to switchu under test when the NoC graph G(S, L) is traversed using the modified Dijkstraalgorithm, and it represents the test injection time corresponding to switch u. Uponcompletion of Algorithm 1, the switch sminthat yielded a test scheduling with a minimumcost, FTC,u, min, is selected as the final test source. The test cost function for unicasttransport FTC,uis updated according to Eq. (3.1). This algorithm returns the optimum testsource switch in the NoC, and a test scheduling that is minimized with respect with testtime, since all test paths that are returned have the shortest length. Table 3-1 shows thetest scheduling obtained using the unicast-based test time minimization algorithm,applied to the 4-switch network in Fig. 3-7, when switch S1is selected as the initial testsource. As outlined in Table 3-1, the scheduling algorithm returns the test time and thepath for each test packet.S1l1l2S2S3S4l3l4l5l5’l1’l2’l3’ l ’4Test packetsFigure 3-7: 4-switch network with unidirectional links.44Table 3-1: Unicast test data scheduling for the example in Fig. 3-7ElementundertestTestpacketspathUnicast Test CostFTC,uS1- Tt,Sl1S1Tt,S + Tl,S + Tt,Ll2S1Tt,S + 2Tl,S+2Tt,LS2S1l12Tt,S+3Tl,S+2Tt,L+ Tl,Ll1’ S1l1S22Tt,S+5Tl,S+3Tt,L+2Tl,LS3S1l23Tt,S+6Tl,S+3Tt,L+3Tl,Ll2’ S1l2S33Tt,S+8Tl,S+4Tt,L+4Tl,Ll5S1l2S33Tt,S+10Tl,S+5Tt,L+5Tl,Ll5’ S1l1S23Tt,S+12Tl,S+6Tt,L+6Tl,Ll3S1l1S23Tt,S+14Tl,S+7Tt,L+7Tl,LS4S1l1S2l34Tt,S+16Tl,S+6Tt,L+9Tl,Ll3’S1l1S2l3 S44Tt,S+19Tl,S+8Tt,L+11Tl,Ll4S1l2S34Tt,S+21Tl,S+9Tt,L+12Tl,Ll4’S1l2S3l4 S44Tt,S+24Tl,S+10Tt,L+14Tl,LWhen all possible test sources are exhaustively exercised, the optimum solution is theone that selects S2(or S3) as test sources, since this is the case for which the transporttime TT reaches its minimum value. Inter-switch links are indicated in full, as pairs ofunidirectional connections.3.5.4 Multicast test schedulingProblem formulation: Given the graph G(S, L), the pairs (Tl,s, Tt,s) and (Tl,L, Tt,L), andassuming that all vertices/edges whose toggle t equals T and are adjacent toedges/vertices whose toggle equals N, can be visited at a time, determine a graphtraversal sequence that covers all vertices and edges, and has a minimum associated test45cost function FTC,m.The fact that more than one toggle can be switched at a time accounts for themulticast transport mechanism when the NoC components are tested concurrently.We define the multicast test cost function, FTC,m, recursively as:1 , ,1 1,1,1( ) ( )m mTC TCr npi l L l Sj jit Lqiit Sk kif  current element is a linkif  current element is a switchF FMax T TT ,MaxT ,          (3.6)where k+1 and k refer to the value of the multicast test cost function in thecorresponding multicast test step. In each multicast step, the test cost function is updatedaccording to Eq. (3.6), adding the latency of the longest multicast paths, and the largesttest time required by the elements under test in the current test step.Hence, the multicast-based test procedure can be minimized by minimizing both thelatency component of the test cost function and the total test by transporting test data tomore components concurrently, i.e., minimizing both terms of Eq. (3.4). This observationis used to realize a minimum-cost test scheduling by ensuring that each element of thegraph G(S, L) is visited on the shortest path from the test source and all elements whosetoggle equals T and are adjacent to elements whose toggle equals N are visited at thesame time.The pseudo-code of the algorithm that realizes the multicast test scheduling ispresented below:46Algorithm 2: Multicast schedulingThe test cost function is updated according to Eq. (3.4), once per each multicast step.Since more than one node (edge) is involved in each multicast step, and only themaximum weight of all the elements involved in each multicast step is used to update thecost function, the total value of FTC,mwill be lower than that of the unicast test costfor each sS--- initialization ---Multi_min (G,S,L,w,t)for each vertex s in Gv_toggle(s) := T; --- initialize all switches as untestedv_weight(s) := Tt,S;d[s] :=;previous(s) := undefined;for each edge between vertices u, v in Ge_toggle(l) := T; --- initialize all links as untestede_weight(l) := Tt,L;Q := S union L;R := empty set;FTC,m:= 0;--- graph traversal on all shortest paths from switch s ---while Q is not an empty set ---graph traversalu := Extract Min{S}; --- pick switch with min. Tt,SR := R union {u};for all edges (u,v) outgoing from ufor all nodes vif {d[u] + weight(u,v) < d[v] and toggle(v) = T}d[v] := d[u] + weight(u,v);v_toggle(v) := N; --- set multicast mode ---e_toggle(u,v):=N;v_weight(u):= Tl,S; --- change weights ---e_weight(u,v):=Tl,L;previous[v] := u;update FTC,m; --- update cost function using Eq. (3.6) ---return FTC,m;choose {smin, , FTC,u, min}; --- pick the test source with minimum cost ---end.47function FTC,u , for identical networks.Algorithms 1 (unicast) and 2 (multicast) are similar in regards to the search sequence.The differences between the two algorithms lie in a different way of updating the test costfunction and the sets of paths (single switch/link in the unicast case, multipleswitches/links in the multicast case). Also, the original Dijkstra algorithm is slightlymodified in the sense that the weights of the links/switches are modified during thesearch sequence. The proof that the modified algorithm returns a minimum cost solutionis provided in Appendix 1.Table 3-2 shows the test scheduling obtained by using the multicast-based algorithmfor the example in Fig. 3-7, when switch S1is selected as test source. Once Algorithm 2terminates and thereby all possible test sources have been considered, in this example, theoptimum test scheduling would select S2(or S3) as the test source yielding the optimalsolution.Table 3-2: Multicast test data scheduling for the example in Fig. 3-7ElementsundertestTest packetspathMulticastTest CostFTC,mS1Tt,Sl1, l2S1Tt,S+Tl,S+Tt,LS2, S3S1{l1,l2}2Tt,S+2Tl,S+Tt,L+ Tl,Ll1’, l2’, l3,l4, l5, l5’S1{l1,l2}{S2,S3}2Tt,S+4Tl,S+2Tt,L+2Tl,LS4S1l1S2l33Tt,S+6Tl,S+2Tt,L+4Tl,LBoth test scheduling algorithms are direct mappings of Dijkstra’s shortest pathalgorithm for the NoC graph, the difference being in the way the cost functions arecalculated and updated. Therefore, the complexity of the algorithms can be considered to48be as O(e log v), where e is the number of directed edges, and v is the number of verticesin the NoC graph [48].3.6 Experimental resultsIn order to evaluate the efficiency of our testing methods, we first need to presentsome implementation details necessary to realize the application of methods andalgorithms described in Section Test output evaluationThe methods described in this dissertation are primarily targeted forpost-manufacturing testing, where the objective is to deliver a set of test patterns in theshortest possible time and the final outcome is the fail/pass decision.In classical SoC core-based testing, test data is injected from a test source, transportedand applied to the core under test, and then the test output is extracted and transported tothe test sink for comparison with the expected response [49]. In this work, a moreeffective solution is adopted, first proposed in [50], where the expected output data is senttogether with the input test data, and the comparison is performed locally at eachcomponent under test. A clear advantage of this approach is a shorter test time, sincethere is no need to extract the test output and to transport it to a test sink. Moreover, thetest protocol is also simplified, since this approach eliminates the need for a flow controlof test output data (in terms of routing and addressing). The tradeoff is a small increase inhardware overhead due to additional control and comparison circuitry, and increased sizeof the test packets. These must now contain the expected output of each test vector,interleaved with test input data.49As shown in Fig. 3-8, the test packets are processed by test controller (TC) blocksthat direct their content towards the inputs/outputs of the component under test (CUT)and perform the synchronization of test output and expected output data. This data iscompared individually for each output pin, and, in case of a mismatch, the component ismarked as faulty by raising the pass/fail flag. The value of this flag is subsequently storedin a pass-fail flip-flop which is a part of a shift register that connects pass-fail flops of allswitches. The content of this register is serially dumped off-chip at the end of the testprocedure.pass/failtest inputsinputs outputsexpectedoutputstest packetsNoC channelSD QFigure 3-8: Test packets processing and output comparison.In our implementation, the TC block is shared by a switch and its adjacent links, inorder to reduce the area overhead of the scheme.3.6.2 Test modes for NoC componentsThe components of the NoC fabrics are quite heterogeneous with respect to their testrequirements and their test modes differ significantly. For each type of component (FIFObuffers, routing/arbitration blocks, inter-switch links) we provide test modes for50executing the test procedure in a manner suitable for the nature of the component undertest.The test data can be injected from the external ATE by multiplexing the functionalI/Os so that, in test mode, test packets are directed towards the first switch under testdetermined according to Algorithms 1 and 2 in Section 3.5. Subsequently, the test path isconstructed progressively by adding the individual NoC components, after their testprocedure is completed successfully (when a faulty component is found the testprocedure terminates). In other words, chip I/Os are multiplexed to a NoC channel,through which test data is injected into the NoC to access the flip-flops within the RLBand FIFO.As stated in Section 3.3, scan insertion is adopted as the DFT strategy for therouting/arbitration blocks. The number of scan-chains is constrained to be equal to theinter-switch link width.Since we perform functional test in the case of the FIFO blocks, they do not requirespecial control signals or additional hardware to manage the test input data. In this case,the test control (TC) block only separates the input data from the expected output dataand performs the output comparison for the FIFO test. A similar situation arises for theinter-switch links, except that the link test packets do not contain any expected outputdata. Instead, since the expected output is identical to the MAF test inputs, the TC blockduplicates the latter and thus creates the expected output, which is then used to performthe test output comparison.3.6.3 Test scheduling resultsThe proposed scheduling algorithms were evaluated with respect to their run-times51(time required to run the program that implements the algorithms on a computer) andfinal test cost (the value of the test cost function associated with the optimal schedulingsolution returned by each algorithm). For estimation purposes, two different types of NoCtopologies are considered: the mesh and butterfly-fat-tree (BFT) [33]. For each of thesetypes, NoCs were built of three different sizes considered representative for the level ofintegration at which the NoC paradigm becomes a viable solution: 16-IP (“small NoC”),64-IP (“medium” NoC), and 256-IP (“large” NoC). Correspondingly, the number ofswitches that service each NoC instance differs for the two types of topologies, due to thefact that they are direct (mesh) and indirect (BFT) networks; their respective number ofswitches is given in Table 3-3. More details on the particular characteristics of thesetopologies can be found in [33] [38] [51].The switches were designed using commercial grade digital design tools bySynopsys, and synthesized in a 90 nm CMOS standard cell technology from STMicroelectronics. The test patterns for the routing/arbitration blocks were obtained usingthe Synopsys’ TetraMAX ATPG tool [52], and arranged in test packets by in-housewritten scripts. The FIFOs were also designed using standard cells; however, they wereisolated from the logic blocks for test generation purposes, and their test data wasobtained and organized in test packets according to Sections 3.3 and 3.4. The size of theFIFOs was set to four flits per virtual channel, with four virtual channels per port andsymmetrical input-output buffering [20] [33].For all NoC instances, the bit-width of the inter-switch links was set to 32. Allinter-switch links consist of a pair of unidirectional interconnections. The MAF testpackets were generated according to the MAF model presented in Section 2.2. The code52implementing the two scheduling algorithms was run on a Linux PC with a 2 GHzX86-family processor and 1 GB of DRAM.The most important quality metric for the test methods developed in this dissertationis the corresponding test time required to perform both the unicast- and multicast-basedschemes. We compare our method with two previously proposed test methods for NoCswitches. None of these prior methods addresses the test of interconnects (inter-switchlinks). We include them in our comparison since they are the most relevant NoC testmethods currently available. In the following, a brief overview of the two NOC testmethods used for comparison is provided.The work in [30] proposed a test methodology for NoC switches based on acombination of flat core full scan and hierarchical core full scan methods. For testpurpose, the NoC is considered as a flat core and wrapped with an IEEE 1500 testwrapper [53] for test data access to internal scan chains of individual NoC routers. Testdata is delivered in parallel to all identical scan chains of the routers and the testresponses are compared internally. The number of comparators used depends on thenumber of scan chains: a comparator block is required for each scan chain in the routers.Ideally, all routers are tested in parallel, and a single comparator is needed. For NoCs oflarger size where practical fan-out limitations cannot be ignored, the number of scanchains per router (and correspondingly the number of comparator blocks) must beincreased.In [25] the authors use progressive transport of test data to the switches under test,while the NoC links are assumed fault-free. Test data is organized in packets and injectedfrom the ATE using input ports, routed through several channels and routers, then53captured by the test wrapper of the router under test. This wrapper is similar to the IEEE1500 test wrapper. The main concern for test packets scheduling is to avoid usinguntested routers in their paths. Moreover, only unicast data transport is considered, withno particular focus on minimizing the test transport time. Test responses are processedlocally on-chip through the use of either comparator blocks or MISR (Multiple InputShift Registers). In [25], test time results are presented for both switches-only test andintegrated switches/cores test. For the purpose of this comparison, we only considered thecase of switches-only test.Depending on the availability of test resources (I/O pins), designers may choose toinclude dedicated TAM that can be used in combination with NoC-based test transport.Here, we considered that NoC reuse is the only mechanism used to transport test data.The results showing the test time required by the unicast, multicast, and the previoustest methods presented in [26] and [30] are summarized in Table 3-3.Before discussing the specifics of the experimental results, note that no directcomparison is intended between the two types of NoC architectures considered for testTable 3-3: Test time results and comparison.NoC type and size Test method and test time [cycles] Relative test time improvementTest time [30]*Test time [26]*Unicast MulticastTest time improvement[30]/ multicastTest time improvement[26]/ multicastTest time improvement unicast/ multicast4 x 4 9,451 18,840 19,958 4,603 2X 4X 4.3X8 x 8 40,309 79,921 85,122 7,559 5.3X 10.5X 11.2XMesh16 x 16 105,775 209,720 223,368 15,223 7X 13.7 14.6X6 7,226 15,352 16,107 3,258 2.2X 4.7X 4.9X28 27,690 58,830 61,724 7,036 4X 8.3X 8.7XBFT120 125,576 266,896 280,073 8,107 15.5X 33X 34.5X* Test time corresponding to interconnect test is not included (NoC links are assumed fault-free in [30] and [26]).54time evaluation. The NoC structures studied here have very different topologies and sizes(in terms of number of switches).The reason for choosing these particular topologies and sizes is that they representtwo extreme cases with respect to the nature of their respective topologies: the mesh is amore uniform architecture, while the BFT is hierarchical. In the absence of standard NoCbenchmark circuits, we believe these topologies offer a reasonable representation of thepractical cases.The routing policies employed for the unicast and functional multicast cases were thedimensional routing (e-cube) [20] for mesh topologies, and least-common ancestor(LCA) for the BFT topologies [33]. For the test-only multicast, the routing wascustomized according to Algorithm 2 in Section 3.5.4.Based on the test times reported in Table 3-3, we note that the unicast test time andthe test time reported in [26] are very similar. This is because, in each case, test data isdelivered sequentially to each component under test. The test volume of the proposedunicast method is larger than the one in [26], due to the fact that we include test packetsfor testing the NoC channels. The larger test volume is compensated, however, byminimizing the test transport time. Compared to [30], the unicast approach appears toperform worse, but only because [30] uses a reduced test data volume (no interconnecttest is performed) and second-order effects such as the effect of scan input fan-out are notconsidered for larger NoCs. Our unicast method has, however, an advantage that is notapparent from Table 3-3: it can deliver test data at the nominal operating speed of theNoC infrastructure regardless of the NoC size. The method presented in [30] may not beable to carry the test data at the NoC nominal frequency, especially in the case of55large-size architectures, when the fan-out of the scan input will increase significantly andthe maximum switching rate of the scan-chains accordingly.Comparing the test times with multicast, the superiority of the multicast test datatransport is evident. The improvement in test speed range from 2X for the case ofsmall-size NoCs, to 34.5X for large size NoCs. As with the unicast transport method, themulticast mechanism is able to transport test data at the nominal operating speed of theNoC, thus making possible at-speed transport of test packets test with no additional cost.Another important quality metric of our test methodology is the silicon area overheadof the unicast and multicast test data transport mechanisms. There are two majorcomponents that contribute to the overhead of the schemes. The first is the test control(TC) unit, in charge of test packet processing, test input data injection to CUT inputs, andtest output comparison. The second is the multicast wrapper unit (MWU) whichimplements the single-source, multiple-destination test data transport feature. Wecompare the gate-count of the test-only multicast solution with the gate-count of thefunctional multicast implementation.Table 3-4 shows the gate count for the unicast, test-only multicast, and functionalmulticast implementations. The gate-count for the unicast case corresponds to the testcontrol (TC) blocks, which are identical for a given topology. The TC blocks differ fordifferent topologies, since they have to handle a different number of components (FIFOs,routing blocks, inter-switch links) adjacent to a switch. The gate-count reported for thetest-only multicast is the sum of TC gate-count and the MWU gate-count, averaged forall the switches in the respective NoC instance. Averaging is needed because the test-onlymulticast feature is implemented selectively, only for those ports per switch which are56involved in multicast data transport. The list of ports that need to be connected to theMWU for each switch was obtained by examining the optimal test scheduling ascomputed using Algorithms 1 and 2 presented in Section 3.5.4. The last column in Table3-4, labeled %diff, shows the percent difference between the gate-count per switch of thefunctional multicast implementation (column 4) and the test-only one (column 3).The absolute area required to implement the proposed test mechanisms is acceptable,especially when compared to the typical gate-count of a NoC switch (around 30,000gates, as reported in [25], [46] and [47]).Table 3-4: Gate count and comparison for the proposed test mechanism (per switch)NoC type & sizeAlgorithm 1unicast test[gates]Algorithm 2multicast test[gates]Functionalmulticast[gates]%diff4 x 4 524 825 1025 19.58 x 8 548 792 1025 22.7Mesh16 x 16 576 721 1025 29.66 693 816 1210 32.528 718 771 1210 36.3BFT120 736 722 1210 40.3Moreover, when multicast data transport is adopted, by implementing this featureselectively for only the switch ports on multicast paths, significant area reduction can beobtained (up to 40% when compared to the functional multicast realization, for the NoCinstances in our experiments).The MWU hardware introduces a certain amount of delay in the data path of thepackets, equivalent to the delay of a demultiplexer/multiplexer pair as shown in Fig. 3-4.This delay adds to the delay of the routing block RLB. Through gate level design andanalysis, the equivalent delay of the MWU is found to be equal to 3 FO4 (fan-out of four)delay units. The typical delay of the RLB block is around 6 FO4 units [54]. Therefore,57the total delay of the RLB and MWU blocks is around 9 FO4 units, which fits well withinthe limit of 10-15 FO4 units according to ITRS projected trends for clock speed of high-performance multi-processors.Table 3-5 shows the amount of time required to obtain an optimal scheduling runningthe two algorithms presented in Section 3.5.Table 3-5: Test scheduling run-timesNoC type &sizeAlgorithm 1 - unicastrun-time [s]Algorithm 2 – multicastrun-time [s]4 x 4 4 38 x 8 12 9Mesh16 x 16 33 276 2 228 7 5BFT120 15 11The run-times in Table 3-5 do not include the time required to run the ATPG tool togenerate the test patterns for the logic blocks, since this has to be done irrespective of theparticular test transport implementation. Clearly, the test scheduling methods do notrequire significant run-times, even for relatively large NoCs (256-nodes mesh,120-switch BFT).3.7 SummaryA novel method for testing the communication fabric of network-on-chip (NoC)based multi-processor systems-on-chip (MP-SoCs) was presented in this chapter. Thenovelty of this method lies in the reuse of the NoC fabric for test data transport in arecursive manner, and in exploiting the inherent parallelism of the NoC architectures forspeeding the test data transport and reducing the test time. Test scheduling algorithmswere developed for two types of test data transport: sequential (unicast) and concurrent58(multicast). The proposed methods integrate the test of all types of NoC components(buffers, links, and routing blocks) in a unified fashion. The efficiency of the testmethods was evaluated for synthetic NoCs of different topologies and sizes. The resultsof this assessment show significant speed-up of test data delivery compared to previouslyproposed NoC test methods, from 2X for small-size NoCs, up to 34X for larger networks.Chapter 44 Fault-tolerance Techniques for Networks-on-chip3Fault tolerance in a design implies that the design can withstand certain static anddynamic faults, and still operate correctly. Although all possible fault mechanisms are notusually covered, the most important ones are addressed to greatly increase the reliabilityof a system.Fault tolerant design of network-on-chip communication architectures requires theaddressing of issues pertaining to different elements described at different levels ofdesign abstraction – these may be specific to architecture, interconnection,communication and application issues. Assessing the effectiveness of a particular faulttolerant implementation can be a challenging task for designers, constrained with tightsystem performance specifications and other constraints. In this chapter, a top-down viewof fault tolerance methods for NoC infrastructures is provided, and two novel metricsused for estimating their quality are proposed: detection latency and avoidance latency.The use of these metrics is illustrated by simulating a few simple but realistic faulttolerant scenarios.3This chapter is based on work published in:1. C. Grecu, L. Anghel, P. P. Pande, A. Ivanov, R. Saleh, "Essential fault-tolerance metrics for NoCinfrastructures", 13thIEEE International Online Testing Symposium (IOLTS), IOLTS’07 9th-11thJuly, 2007.2. C. Grecu, A. Ivanov, R. Saleh, E. S. Sogomonyan, P.P. Pande, "On-line fault detection andlocation for NoC interconnects", 12thIEEE International Online Testing Symposium (IOLTS),IOLTS06, July 2006.604.1 IntroductionA major cause affecting the reliability of the VLSI global interconnects is theshrinking of the feature size, which exposes them to different faults of a permanent(static), transient (dynamic) or intermittent (dynamic) nature. Static faults include opensand shorts in the wires and missing vias in the interconnect, and a variety of well-knownfaults in logic and memory. Among the dynamic failure mechanisms we can enumeratefactors such as crosstalk, electromigration, electromagnetic interference, alpha particlehits, and cosmic radiations [55]. These phenomena can alter the timing and functionalityof the NoC fabrics and thus degrade their quality-of-services (QoS) characteristics or,eventually, lead to failures of the whole NoC-based system. Providing resilience fromsuch faults is mandatory for the operation of NoC-based chips.Traditionally, error detection and correction mechanisms are used to protectcommunication subsystems against the effects of transient malfunctions. Designers mustcarefully weigh the hardware cost of implementing such mechanisms for the on-chip datacommunication infrastructures against the potential benefits they can bring [56] [57].Complex error detection and correction (EDC) schemes may require additional energydissipation and area overhead, and can affect the performance of SoC communicationarchitectures in terms of throughput and latency. Other approaches for fault-tolerant onchip communication include stochastic communication, adaptive routing, and differenthybrid schemes that combine spatial and temporal redundancy for achieving faulttolerance.Metrics for fault-tolerant systems are well-established and have been used extensivelyin design of distributed computing systems. Recent research in fault-tolerant NoC fabrics61proposed metrics that are specific to message-passing on-chip communication systemsfor evaluating the effectiveness of new and existing fault tolerance methods. In theabsence of widely accepted metrics, it is difficult for NoC designers to assess the fault-tolerant capabilities in a quantitative manner. This is especially true when comparingdifferent fault-tolerant techniques with the objective of determining which one candeliver the best performance within acceptable costs.One option for evaluating the fault-tolerance (FT) capabilities of a system is tomeasure the degree of hardware and software redundancy (spatial and temporal), which isan inherent property of most NoC architectures. While redundancy by itself is a usefulmeasure, it is incomplete in that different systems can exploit redundancy in more or lessefficient ways. It is therefore preferable to have metrics that can measure the effectivefault tolerance as it influences the required performance in accomplishing the task oftransporting information across the NoC fabric. Based on the above considerations, thescope of this chapter is to present traditional FT metrics in the novel context of NoCsystems, and use them synergistically with newly proposed metrics, in order to measurethe effective fault tolerance in the context of overall system performance for NoCsubsystems.The quest for on-chip fault-tolerant communication started well before the NoCparadigm emerged as a solution for integrating large MP-SoCs. Design techniques forfault-tolerant on-chip busses are found in [58], where a particular form of duplication isused to detect and correct crosstalk and transient faults, and [59], where different errorrecovery mechanisms from simple retransmission to correction/retransmission areanalyzed in terms of power/area figures.62The NoC infrastructures are characterized by more complex topologies (relative tobuses), with higher degrees of connectivity. Moreover, in the NoC paradigm, data istransferred across the chip by employing some form of packet switching, where sent datais tagged with additional flow control information [20]. The additional hardware and/orinformation redundancy, offers significant potential for implementing differentfault-tolerant strategies. Hardware redundancy can be exploited by using multiple pathsto transport data between source and destination cores, e.g., in the case of adaptiverouting methods [60], or stochastic communication [61] [62]. In [61], the total time tocomplete the application (a Fast Fourier Transform algorithm) in the presence of faults isanother performance measure of the FT method. General-purpose metrics used in [61]and [62] are energy consumption and area cost, which are fully usable and relevant.Information redundancy can be implemented by means of error detection/correctionschemes, where additional control bits can be used to detect the presence of errors andeventually to reconstruct the original data. When the presence of errors in the data streamis detected but not corrected, error recovery is usually performed through retransmission,which is one of the possible forms of temporal recovery [63].In practice, temporal and spatial data redundancy is often used to achieve the faulttolerant goals. Ultimately, designers must assess the effectiveness of the FTimplementation in the context of the NoC performance specification. This informationmust be readily available and quantifiable such that, if a specific FT implementationcannot be accommodated while still meeting the performance specification of the NoCmedium, it can be eliminated from the set of potential FT realizations at early stages ofthe design process.63Most of the previously published works on the issue of FT in NoCs present methodsat algorithm or circuit level, and then proceed to assess them relative to an ad-hoc set ofparameters that may be more or less relevant to the specifics of an on-chip data transportmechanism.In [61] and [62], the authors propose probabilistic communication schemes based onprobabilistic broadcast and random walk, respectively. In these schemes, multiple copiesof the same message are transmitted following different paths, selected randomly fromthe set of possible routes between communicating pairs of cores. When transient orpermanent faults manifest themselves in the NoC fabric, the affected message replicas arediscarded. Eventually, from the set of redundant messages, one that is error-free mayreach the destination, completing a successful transmission. Message latency is correctlyidentified in both works as an important measure for evaluating the impact of the FTsolution. However, there are many factors that can affect message latency, other than thespecific FT implementation, such as traffic distribution (spatial and temporal), buffer sizeand buffer management, flow control, etc. The contributions of all these factors towardsmessage latency are generally inter-dependent, and it is quite difficult to separate theirindividual effects. As such, it is difficult to estimate with acceptable accuracy the effectof the FT method on latency, isolated from the other factors.In [64], a new metric to characterize fault-tolerant NoCs is proposed, called messagearrival probability (MAP). For a pair of tasks (processes) that exchange messages acrossthe NoC, MAP is defined as the fraction of successfully transmitted messages. Thismetric is useful for assessing NoC performance when tight bounds are imposed on itscapability of transmitting error-free messages.64A different category of FT metrics relates to topology characteristics that canpotentially offer FT properties. In particular, connectivity is used to express the ability ofan interconnection network to offer multiple paths among its communicating nodes [65].This metric is useful, but only when combined with the flow control mechanism that canexploit the degree of connectivity in a more or less efficient manner.When dedicated spare routers or links are employed for achieving fault tolerance, theamount of spare elements can give an estimate of the NoC’s FT capabilities. As for theconnectivity, the amount of spares is only offering information about the NoC’s potentialfault tolerance, rather than its actual ability.Various coding schemes were proposed for providing error resilience to on-chipcommunication infrastructures [57] [66]. A comprehensive comparison and evaluation iscarried out in [67] for detection-only and single-error correcting codes, combined withretransmission for data recovery. The effect of locating the recovery points in the NoCtopology (e.g., end-to-end or switch-to-switch recovery) is also studied and evaluated.The metrics used in [66] and [67] quantify the effect of error recovery schemes on NoCparameters such as data latency, area overhead, and energy consumption, under specificconditions of NoC operation.The types of approaches for determining the efficiency of FT schemes based on theireffect on NoC performance parameters, under specific operating conditions, have thedrawback that they only deliver information for a set of inter-dependent operatingconditions (i.e., a given NoC hardware instance, specific traffic distributions and flowcontrol protocols).65We propose to complement these measuring practices by including the measurementof the intrinsic performance of the particular FT scheme in terms of its capability ofavoiding failed states, detecting a failure, or recovering from a failed state. Combinedwith performance parameter measurement in presence of FT mechanisms, the metricsproposed and evaluated in this chapter give the designers a better understanding of thetrade-offs involved in designing high-performance, fault-tolerant networks-on-chip.4.2 Traditional fault-tolerance metricsTraditional engineering methods that address design of fault tolerant systems usereliability and availability analysis to characterize these systems and their components[68]. Reliability is defined as the probability with which a system will perform its taskwithout failure under specified environmental conditions over a specified time duration.Thus, the MTBF (Mean Time Between Failures) parameter is used as a representation ofreliability, and is calculated as:Total operating timeMTBFNo.of failures encountered (4.1)Another metric used for evaluation of fault tolerant systems with repair/recoverycapabilities is represented as MTTR (Mean Time to Repair):Total time spent for repairsMTTRNo. of repairs (4.2)Based on MTBF and MTTR, the availability of a system can be used to measure theimpact of failures on an application, and is defined as:100%MTBFAvailabilityMTBF MTTR (4.3)While useful at the system level, these metrics may overlook important properties offault tolerant NoC subsystems. One such property that is misrepresented by use of MTBF66is the capability of a NoC protocol to rapidly recover from failures. Even in the casewhen the number of failures is high (which indicates a low, undesired MTBF), if therecovery can be performed quickly (e.g., through flit-level recovery [67] [69]), the impactof failures may be minimal and, therefore, it may not affect the application at all. For thesame reason, the availability of the NoC subsystem in such a condition can bemisinterpreted if viewed only by Eq. (4.3).Another drawback of these generic, system-level metrics is that they represent averagevalues. In the case of NoC fabrics that must meet tight quality of services (QoS)requirements in the presence of failures, the average values are not useful since theperformance constraints (in terms of guaranteed latency per message or availablethroughput) have to be met for all possible instances, not only on an average basis.While there is little doubt that fault tolerance is a desirable and useful property of NoCs,designers need simple, readily available metrics to be able to characterize FT methods inthe context of the specific NoC implementation. We introduce these metrics relative tothe NoC’s ability to detect the occurrence of faults and recover from failures. The metricsproposed in this work aim to address the issue of characterizing the effectiveness of faulttolerance schemes for NoC communication subsystems in the context of the specific QoSrequirements that designers face in their implementation. They are not intended tosubstitute the existing metrics, but to complement them by offering a more detailed viewof the properties of different fault-tolerance methods. We illustrate how the metricsdefined here can help designers gain more insight on the actual performance of faulttolerant implementations related to NoC architectures. The metrics proposed in thischapter are described in Section 4.3 and evaluated in Section 4.4.674.3 Fault-tolerance metrics for network-on-chip subsystemsBefore defining the set of metrics forming the object of this work, we need todifferentiate between different hierarchical abstraction levels for achieving resilience tofailures. There are five key elements in a comprehensive approach to fault tolerantdesign: avoidance, detection, containment, isolation, and recovery. Ideally, these areimplemented in a modular, hierarchical design, encompassing an integrated combinationof hardware and software techniques. Moreover, the fault tolerant techniques can beapplied at different layers from the set of ISO/OSI layers [21] that the NoC mayimplement, resulting in numerous possibilities for fine tuning the performance of the FTimplementation by combining the (sub)set of FT elements with the (sub)set of NoClayers. As an example, error detection may be implemented in the data layer, andrecovery may be realized either in the data layer (e.g., if an error correcting code is used)or at the application layer. In a more generic approach, the partitioning and derivation ofrequirements, and the partitioning and implementation of fault/failure managementtechniques must be realized in a hierarchical fashion. For each hierarchical level, theexistence of appropriate metrics allows the designers to have full control andunderstanding of the implications that a particular fault-tolerant implementation will haveon the operation of a NoC subsystem. In this chapter, two novel metrics for evaluatingthe performance of fault-tolerant mechanisms are proposed: detection latency andavoidance latency.For illustrating the value of the metrics proposed here and their usability, a simpleexample of an application A running on a NoC-based multi-processing system isconsidered. The application uses two processes P1and P2on two different processing68cores, as shown in Fig. 4-1. Ideally, processes P1and P2use the NoC subsystem tocommunicate with each other along the path shaded in grey. However, a different pathmay be taken if there is a fault on the ideal path.Figure 4-1: Processes communicating across a NoC fabric layout & device level(physical layer) flit/packet level(data layer) end-to-end (network/transport layer) app lication level(application/session layer)Faults:- type - source - rate - impactD/RD/RD/RD/RFigure 4-2: Hierarchical partitioning for fault tolerant NoC designs.69Faults can be detected and recovered in many ways. Using a layered representation ofthe data communication in a NoC-based system, and considering a subset of the standardOSI layers [70], Fig. 4-2 shows the propagation of faults from the physical level (faultsaffecting low-level device functionality) to the application level (faults affecting thesoftware application running on the NoC-based system). At each level in the hierarchy,faults can be characterized by type, source, frequency of occurrence, and impact. At thelowest level (physical), it is assumed that a fault results in a degraded operation of therespective component. For example, a wire short or open, or a missing via could occur atthis level. Then, either repair is performed or the problem is passed on to the next level.That is, at each level, detection and recovery have associated costs and performancepenalties. If it is not always cost effective to detect-and-recover (D/R) at the lowest level,the fault manifests as a local error/failure which propagates to the next higher level,where the corresponding FT technique is evaluated again relative to performance and costeffectiveness. The objective of this hierarchical partitioning is to provide cost-effectivehandling of error/failures while satisfying system requirements. The hierarchicalapproach can provide back-up at higher levels for faults which, for any reason, are nothandled at lower levels. In the case of Fig. 4-1, if a low level-fault occurs, an alternateroute can be taken between P1 and P2 (along the dashed line) to avoid the problem byaddressing it at the network/transport layer. This comes at the cost of higher datatransport latency due to the non-optimal path-length. Generally, the higher the level in thehierarchy, the longer it takes to contain and/or recover from the effect of a failure, butthere are certain advantages with respect to area cost and power dissipation. For real-timeNoC systems, time is the critical factor for specifying the performance of the FT70implementation. Designers must decide on the effectiveness of a particular method byknowing how quickly faults must be detected, how quickly they have to recover from theoccurrence of a fault, how long an error can exist in the NoC infrastructure withoutimpairing/compromising system performance. This is the motivation for new metrics forfault-tolerance in NoCs, especially in cases when QoS constraints must be satisfied.First, based on the example application in Fig. 4-1 and considering a hierarchicalimplementation as in Fig. 4-2, we define metrics for the five elements of comprehensivefault-tolerant methods.(a) AvoidanceFault-avoidance techniques can be realized through information redundancy (bymeans of error correcting codes for NoCs) or hardware redundancy (n-modularredundancy being a typical example). Depending on the targeted number of errors tocorrect, coding/decoding hardware blocks may require a certain number of cycles toperform their operation. The associated time overhead adds to the total latency of the databeing transported across the NoC fabric. Additionally, area and power overhead must beconsidered.We define the time overhead of an avoidance scheme, Tav,ov, as the differencebetween data latency with (Latav) and without (Lat) fault avoidance:Tav ov= Latav– Lat (4.4)The difference between various implementations of this concept can be significantrelative to this metric. In the example in Fig. 4-1, if coding/decoding functions areimplemented at each switch on the path between P1and P2(switch-to-switch avoidance),the resulting time overhead will be significantly higher than in the case where only71end-to-end avoidance is implemented (i.e., data is encoded at the source and decoded atdestination). Ideally, latency insensitive protocols and link pipelining techniques mitigatethe effect of the extra latency cycles.(b) DetectionThe next hierarchical level in a fault-tolerant design is detection of faults that were nothandled by the avoidance mechanism. Detection is built in most error-correcting codes,which generally can provide information regarding the number of un-corrected faults,when the correction mechanism fails (or is not even present in the particularimplementation). Fault detection is then used to assess the need for recovery frompotentially uncorrected fatal failures. The quicker the detection mechanism signals thepresence of uncorrected faults, the quicker the recovery can be initiated. We define thedetection latency Tlatas the amount of time between the moment a fault occurs and themoment it is detected. Going back to our example in Fig. 1, fault detection may beperformed by the processes P1and P2whenever data is received (end-to-end detection), atthe input ports of the intermediate switches (switch-to-switch detection), or at eachswitch input/output port (code-disjoint detection). In each case, the detection latency Tlatis different and the impact on the performance of the FT system varies accordingly.(c) ContainmentFault containment is concerned with limiting the impact of a fault to a well-definedregion within the NoC. Error containment refers to avoiding the propagation of theconsequences of a fault, the error, out of this defined region. Fault containment regions(FCR) may be defined with variable resolutions, directly correlated with the quality andresolution of the fault detection mechanism. For the case of Fig. 4-1, and assuming an72end-to-end detection mechanism, the fault containment region can be defined as theentire shaded route between the cores where processes P1and P2are being executed. Thesize of the fault containment regions can be decreased by employing more accurate faultdetection schemes, such as switch-to-switch or code-disjoint detection. It is essential thatFCR be independent in the sense that a fault occurring in a FCR does not affect adifferent FCR. In this respect, if two routes between cores/processes P1and P2can befound that are on independent FCRs, and a fault is detected on one of the routes, the otherroute can be used to provide an alternative path between processes P1and P2(representedas a dashed line in Fig. 4-1).(d) IsolationThe independency of fault containment regions can only be achieved if an effectiveisolation method can be provided, which can guarantee that the effect of a fault occurringin a FCR does not propagate to another FCR. At the physical layer, in the case ofpermanent faults, isolation can be accomplished by marking or disconnecting the faultyNoC components (links, switches, routes) and avoiding their use until, eventually,hardware recovery/repair can be performed through reconfiguration. At higher layers,erroneous data packets can be dropped on the fly or at the destination process, such thatthey are not allowed to interfere with the rest of the data and propagate at applicationlevel.(e) RecoveryThe ultimate goal of fault tolerant schemes is to provide means to recover fromoccurrence of failures. For fault-tolerant, QoS constrained NoCs, it is important torecover from failures within the time budget allowed by the QoS specifications. Late73recoveries, even when successful, are not acceptable, since they lead toout-of-specification behaviour of the NoC subsystem. Consequently, we define recoverytime (Trec) as the amount of time that passes between the detection of a fault and recoveryfrom the corresponding failure. A simple form of attempting recovery of erroneous dataflowing between processes P1and P2is to provide a hardware correction at lower layers,or a retransmission mechanism at higher layers where, upon detection of an error, anautomated retransmission request (ARQ) is generated and an error-free copy of originaldata is resent from the source process.4.4 Metrics evaluationThe use of the metrics defined in this work is illustrated by running simulations of afew detection and recovery schemes on a 4x4 NoC fabric, as depicted in Fig. 4-1. Acycle-accurate, flit-level simulator is used [101], with messages injected with equalprobabilities by all NoC cores (i.e., uniformly distributed random traffic). The value ofthese metrics will be illustrated with an example.In all simulations, messages are organized in 16 flits each, and are transferred acrossthe NoC medium following the dimension-order (e-cube) routing algorithm, wheremessages are first sent along the x direction of the topology, then along the y directiontowards the destination core. The injection rate is varied uniformly. Faults are injectedrandomly, on a flit-per-cycle basis, i.e., in each cycle, flits may be marked as erroneouswith equal probabilities, regardless of their current location in the NoC fabric.In this set of experiments, we are mainly interested in detection and recoveryestimation. We consider the following cases for fault detection scenarios:74- end-to-end (e2e) detection: the errors are detected at the destination cores. Each flit ischecked for errors upon reception and, if found erroneous, the detection latency iscalculated as the time difference between the fault injection moment and flit arrival time.- switch-to-switch (s2s) detection: error checking is performed at each input port of theswitches, for each flit traveling across the NoC.- code-disjoint (cdd) detection: flits are checked for errors at both input and output portsof the NoC switches.Figure 4-3: Average detection latency for end-to-end (e2e), switch-to-switch (s2s), andcode-disjoint (cdd) detection schemes.Fig. 4-3 shows the average detection latency for the three detection mechanismsconsidered here. Flit error rate is set at e=10-10, i.e., in each cycle, for each flit, theprobability of being flagged as erroneous is 10-10[71]. The e2e detection is the poorestperformer relative to detection latency, since the messages must travel all the way fromsource to destination before an eventual error being detected. The cdd mechanism75performs best, due to the fact that error detection is performed at each input/output of theswitches.Next, two recovery methods are simulated, both based on retransmission of erroneousmessages. The first scheme is a message-level, equal-priority retransmission. Upondetection of an erroneous flit, an ARQ message is generated at the destination core. Forease of implementation, the ARQ message is injected into the network immediately, onceall the flits of the erroneous message have been completely received. If more flits in amessage are erroneous, only one ARQ message is generated. The ARQ is sent towardsthe source of the erroneous message and handled as any other regular message by theNoC environment. The case where ARQ messages themselves are affected by errors isnot considered here. There are three types of messages in this scheme: regular messages,ARQ messages, and retransmitted messages. They all have equal priorities, and aretreated similarly when traversing the NoC. We call this method equal priority recovery(epr).For speeding up the recovery process, we consider a second retransmission-basedrecovery scheme, where ARQ and retransmitted messages have higher priority than theregular messages. When contention arises, regular messages will wait for ARQ andretransmitted messages to be processed. This allows faster transport of the lattermessages types, and therefore speed-up the recovery process. We call this methodpriority-based recovery (pbr).The next experiment measures the recovery latency for a few combinations ofdetection/recovery schemes. Ideally, we would like to isolate the detection and recoverymechanisms, and report individual performance figures for the recovery phase only. In76practice, this is difficult due to the two processes being inter-dependent. In ourexperiments, the inter-dependency is caused by the need to manage the erroneousmessages in a realistic fashion, i.e., upon detection of an erroneous flit, the correspondingmessage continues to be transported toward its destination and, eventually, interact withARQ and retransmitted messages. This causes erroneous messages to continue to traversethe NoC after error detection and keep NoC resources occupied, effectively reducing theNoC capability to transport ARQ and retransmitted messages with maximum efficiency.Figure 4-4: Average recovery latency for equal priority recovery (epr) and priority basedrecovery (pbr).The average recovery latencies for the two recovery methods considered in ourexperiments are shown in Fig. 4-4. The flit error rate is set to e=10-10for all simulationruns. Note that the pbr method offers better recovery latency due to preferentialprocessing of ARQ and retransmitted messages. The advantage of priority-basedrecovery is more apparent at higher injection rates, where the ARQ and retransmittedmessages are able to travel faster than the regular messages waiting in queue buffers foravailable resources. Also, the pbr scheme can support tighter QoS requirements in terms77of minimum latency in presence of errors.In the last experiment, we study the effect of varying the flit error rate e on theaverage recovery latency, for the pbr scheme. In this experiment, the message injectionrate was set constant to a non-saturating value of 0.1 flits/cycle/core. A high flit error ratewill cause the generation of numerous ARQ and retransmitted messages, which willeffectively cause an increase of the NoC traffic. Similarly, a high injection rate willcause, even in the error-free case, an increase in latency due to message congestion.Therefore, simulations are performed at a low, non-saturating injection rate, at which thevariation of message recovery latency is principally affected by the error rate e.The results shown in Fig. 4-5 indicate that the system can reliably recover fromfailures at flit error rates up to approximately e=10-9. At this point, the recovery latencystarts to increase significantly, and reaches values that may not be compatible with theQoS specifications of the system. The recovery latency is influenced by both the flit errorrate and message injection rate. For higher message injection rates, the nature of thecurves in Fig. 4-5 remains the same, but with a higher flat plateau at low flit error rates(due to increased average message latency), and a faster increase of recovery latency withincreasing flit error rate. For a complete characterization of a practical case, simulationswould have to cover a wider range of traffic conditions.78Figure 4-5: Average recovery latency for pbr scheme with variable flit error rateThe advantage of using the detection and recovery latencies as defined in this workbecomes apparent by observing that using experiments similar to the ones presented here,designers are able to isolate the performance figures of fault tolerant methods from theenvironmental conditions (represented by traffic conditions in our experiments).As an example, the design specification of a NoC fabric such as the one in Fig. 4-1may require an average recovery latency of 30 cycles for a message injection rate of up to0.1 flits/cycle/core, at a flit error rate of up to e=10-10. Based on the simulation results inFig. 4-4 and 4-5, the end-to-end (e2e) schemes can be eliminated from the set of potentialcandidates for FT implementations, since their average recovery latency is greater thanthe required value of 30 cycles.Finally, we estimate the message arrival probability (MAP) associated with the threedetection scenarios, for the pair of communicating processes P1and P2. This is anindication of how well the three schemes perform when application A in Fig. 4-1 has tightbound imposed on the quality of communication between processes P1and P2in terms of79the ratio of successfully transmitted messages (targeted QoS [72]). In Fig. 4-6, we plotthe average message latency versus bound on MAP under traffic assumptions statedearlier in this section. In this set of experiments, flit error rate was set at e=10-6, at amessage injection rate close to network saturation.Figure 4-6: Average message latency vs. bound on MAP.Note that when the required MAP is increasing towards 1, the average messagelatency increases significantly. This indicates that, when the application requires a MAPclose to 1, the simple recovery techniques presented here may not suffice, and moreadvanced FT techniques need to be employed. Also, a complete characterization ofparticular FT techniques would involve additional application metrics underapplication-specific traffic conditions.To illustrate how the metrics presented in this chapter can be used by NoC designersto select an appropriate FT technique when a NoC system is designed with QoSconstraints, consider the case of two application tasks T1, T2mapped onto processors P1,P2, respectively, in a 4X4 mesh NoC, as shown in Fig. 4-7.80Figure 4-7: Processes P1, P2mapped on a mesh NoC with QoS communication constraints.The QoS specifications for the two tasks are expressed as follows:“The maximum allowed flit latency between processes P1and P2is 60 clock cycles, fora message injection rate of up to 0.3 message/cycle/process and a maximum flit errorrate of 10-10.”The task of the designer is to find a combination of error detection and recoverymethods that satisfies the above constraint, while factoring in the cost in terms of areaoverhead. For simplicity, these costs are assumed to be known and ordered as in Fig. 4-8for the set of detection and recovery methods considered earlier in this chapter.code-disjointswitch-to-switchlowmediumhighend-to-endlowmediumhighFigure 4-8: Performance and cost of detection techniques.For determining what combination of error detection and latency is the most81appropriate, the designer will extract the latency numbers performing analysis orsimulations, and then choose the pair whose added latencies satisfy the QoS requirement.For illustration, the latency metrics for the detection/recovery methods considered in thischapter and applied to the two communicating processes P1, P2in Fig. 4-7 are shown inTables 4-1 and 4-2, respectively.Table 4-1: Detection latency (10-10flit error rate)Traffic load (message injection rate, [message/cycle/process])0 (no-load) 0.1 0.2 0.3cdd 3 3 3 4s2s 4 4 5 6e2e 21 21 22 24Table 4-2: Recovery latency (10-10flit error rate)Traffic load (message injection rate, [message/cycle/process])0 (no-load) 0.1 0.2 0.3epr 42 42 45 58pbr 42 42 45 51Adding the corresponding cells in Tables 4-1 and 4-2, one can determine whichdetection/recovery pair can be successfully used. Two options are available to thedesigner in this example: cdd + pbr and s2s + pbr. All other combinations exceed thelatency limit imposed by the QoS constraint. Factoring in the higher area cost of thedetection techniques as shown in Fig. 4-8, the solution of choice is s2s + pbr, which82requires a lower implementation cost. Clearly, the separation of fault-tolerancemechanisms and differentiation of the latency metrics provides greater visibility inselecting the optimal method.4.5 SummaryThe need for fault-tolerant features in NoC communication fabrics is recognized byacademic and industrial communities, and different implementations are possible toaddress the challenges of providing fault-tolerant communication in NoC subsystems. Itis, however, difficult for designers to choose a particular FT scheme due to lack ofmetrics that can express the intrinsic effectiveness of a FT method, and not only its effecton generic performance parameters such as latency, throughput, and power consumption.We have presented a comprehensive, unified view of fault tolerant methods for NoCsubsystems, with hierarchical partitioning that makes possible mapping the elements of aFT implementation on the generic layered structure of network-based communicationsystems. We presented two novel metrics (detection latency and recovery latency) thatassess the capability of a NoC subsystem to quickly detect the presence of a fault andrecover from failures in a speedy manner. The use of these metrics was illustrated with aset of simulations. These metrics are intended to complement the generic, applicationoriented (throughput, latency) and resource oriented (area overhead, power consumption)metrics.83Chapter 55 Fault-tolerant Global Links for Inter-core Communicationin Networks-on-chip45.1 IntroductionIn this chapter, we propose a method to improve the yield of global links of NoC-based chips by using redundant upper-level wires interconnected with the low-level metallines in a sparse matrix configuration. The global NoC links are defined as the lines usedto transfer the intercore signals (inter-router or router-to-PE). A yield improvementtechnique targeted specifically at these links is developed. The structured nature andfunctionality of the SoC interconnect infrastructures render them particularly suitable forthe implementation of these features, similar to the case of memory blocks, where yieldenhancement and defect tolerant techniques have become common practice [15] [55] forsome time already. According to the discussion and classification in Chapter 4, thefault-tolerant solution here is a low-level fault-tolerant technique, placed at the physicallevel, whose containment region is the corresponding NoC link. Its performance in termsof fault detection time and recovery speed through link reconfiguration is outlined inSection 5.8 of this chapter.4This chapter is based on work published in or submitted to:1. C. Grecu, A. Ivanov, R. Saleh, P.P. Pande, "NoC interconnect yield improvement using crosspointredundancy", IEEE Symposium on Defect and Fault Tolerance in VLSI Systems, 2006, DFT '06,Oct. 2006.2. C. Grecu, D. Sengupta, P.P. Pande, A. Ivanov, R. Saleh, “Self-repairable SoC communicationlinks using crosspoint redundancy”, IEEE Transactions on VLSI, under review, submitted Feb.2008.84The method described in the following sections is based on the use of sparse regularcrossbars, which are a particular class of generalized connectors. The novel features ofour method are two-fold: first, we provide a mechanism to test the inter-core linkspost-fabrication at start-up or during run-time and detect the faulty wires. Second, wedesign a reconfiguration scheme that selects a set of fault-free wires to set-up theconnections between cores. The reconfiguration scheme is designed such that everypossible combination of fault-free wires yields closely matched delays, with the benefitof ensuring consistent, uniform link speed across a wide range of defect sites.A widely-used technique for achieving fault-tolerance in computing systems is theN-way modular redundancy (NMR) [63]. This is based on replicating the component anumber of times (three or more) and employing voting mechanisms to decide which ofthe replicas delivered a fault-free output. Generally, this technique brings a high hardwareoverhead due to the number of replicas involved. In the particular case of fault-tolerantNoC communication links, the use of NMR is prohibitive due to large interconnectoverhead, which can lead to wire congestion and routing difficulties. Our solution is morefine-grained than the classic NMR, and delivers a methodology for yield tuning, test, andreconfiguration.5.2 Related workSparse crossbars were originally proposed in [73] for designing high-performance,hardware efficient concentrators and distributors for interconnection networks. Theirapplications range from computer networks [74] to multi-processor systems [75] andinstrumentation networks [76]. Fault tolerance of global SoC interconnects was addressedby different groups proposing error recovery schemes based on error detection and85correction [67], mainly aimed at recovering from failures caused by transient faults.Another approach, proposed for multi-core SoCs built on NoC platforms, is to achievefault-tolerance at the data transport level by employing stochastic [77] [78] or adaptive[79] communication mechanisms. For more structured interconnect systems, such asFPGAs, tolerance to permanent faults can be achieved by rerouting and remapping thefaulty interconnect resources [80]. In [81], FPGA interconnect are tested, diagnosed andreconfigured on-line, performing an incremental routing that uses a routing window toreduce reconfiguration time, incremental configuration file size, and incrementalconfiguration file download time. The high level of programmability and modularity ofthe FPGA systems makes them readily adaptable for implementing fault-tolerantmechanisms.Another method of improving interconnect yield when via failure is a significantsource of yield loss is via duplication [82] [83]. Usually, via duplication is performed as apost-layout design for manufacturability (DFM) step, where additional vias are insertedwherever the layout and design rules checking (DRC) allow it [84] [85]. Post-layout viaduplication tends to increase the complexity of the layout by adding vertices and edges tothe layout. Restricted layout topology design styles aim for reduced layout complexity,with fewer vertices and edges. This places the goals of restricted topologies and DFM inconflict with one another. Ideally, DFM techniques are applied early in the design stage.Yield-aware routers attempt to insert redundant vias during the routing stage [86].However, the optimum parameters for DFM may not be known until after the layout issubstantially complete, and layouts migrated from earlier technologies will typically notbe designed in a way that follows the newer technology’s DFM requirements.86The NoC global interconnects (communication links) do not have the degree ofprogrammability of FPGA interconnects; however, they are prone to the same types ofmanufacturing and reliability defects. Via duplication, being performed late in the designstage, is a “best effort” solution that cannot provide guarantees with respect to a targetedinterconnect yield. As a layout-level solution, via duplication is not entirely reusablewhen designs are migrated from one technology to another. Although most defects canbe detected during manufacturing test, permanent defects (undetected and/or developedpost-fabrication) may still affect the operation of SoC interconnects due to latentmanufacturing flaws and wear-out mechanisms. More importantly, the globalinterconnect defects are a potential source of yield loss [13] due to the large number ofvias they need.In this chapter, a method is proposed for designing NoC infrastructure links such thata specified interconnect yield target can be achieved. We present the trade-off betweenthe interconnect resources and the achievable yield. Our method integrates sparecalculation for targeted yield, test and diagnosis mechanism, and interconnectsreconfiguration.5.3 Problem statementIn this section, the problem of constructing global, structured NoC interconnects suchthat, under a specific set of defect assumptions, a certain interconnect yield be achieved,is formally described. Let L be a NoC link consisting of a set of wires whose function isto transfer a certain number of signals, denoted by m, between two cores (NoC routers orprocessing cores). Let Plinebe the probability that a single global line (including the viasfrom the silicon surface to the upper metal layers and back – as in Fig. 5-1) is fabricated87correctly. Given a certain interconnect yield target Y to be achieved, we want todetermine the total number of metal lines n (n  m) that need to be provided to the globalNoC link L.Additional objectives are:1. The spare and active metal lines must have closely-matched electricalcharacteristics.2. The switching circuitry that routes the m signal lines to the n active wires of theNoC interconnect L must be of minimal area and complexity.The first objective above reflects the fact that the circuits that implement and managethe interconnects and perform the actual mapping of signal lines to physical wires mayplace an additional capacitive load on the metal wires. If this load is not identical for allwires, then their delays may be different and, consequently, the timing characteristics ofthe same link may change when the link is reconfigured upon the occurrence of apermanent fault.The second objective is due to the fact that the implementation of the fault-tolerantfeatures must be hardware efficient, since the silicon area is an important component ofthe overall cost.Our approach is intended for use in the early stages of the physical design of the NoC-based SoCs and it consists of the following steps:Step 1: Identify the NoC links that connect different routers/cores and span multiplemetal layers (as indicated in Fig. 5-1b).Step 2: Calculate the probability of failure per line, using the probability of failure of anindividual via; calculate the number of spare metal lines required in order to achieve the88desired interconnect yield.Step 3: Provide a mechanism that can select a set of good lines at both ends of theNoC link, with the number of defect-free lines equal to the number of logic coreinputs/outputs, from a total number of lines equal to the sum of logic inputs/outputs andspare lines calculated in Step 2.Step 1 can be accomplished by inspecting the schematics of the NoC circuitry andselecting the global links – the ones that interconnect NoC switches. In Section 5.4, wepresent a method to calculate the number of spare metal lines that must be provided inorder to achieve a desired interconnect yield as required in Step 2. The mechanism thatrealizes Step 3 above is presented in Section Interconnect yield modeling and spare calculationIn VLSI manufacturing, the critical area model [87] is commonly employed to estimatethe yield sensitivity of chips to random failure mechanisms. For specific manufacturingprocesses, the results are then calibrated using electrical test structures [88]. Thepower-law model is applied, following the procedure in [89], and the critical area modelis applied to determine the probability of failure of an individual via, PoFvia:10,/k kvia ref refPoF D Dens Dens  (5.1)where D0, refrepresents the reference via chain defect density, while Densrefrepresents thevia density of the regular via chain test structure. Here, D0,refand Densrefare constants fora specific manufacturing process and are determined through curve fitting during processcharacterization using regular via chain test structures. Experiments on technologies withdifferent minimum feature sizes reveal that the exponent of the power-law model, k,tends to 1 as the feature size decreases [89]. This allows the assumption that PoFviais89almost independent of via density (Dens) and consider via failures as independent eventsfor statistical yield calculations. This weak dependence is even more pronounced forsmaller feature sizes, i.e., PoFviain a 90 nm technology has a weaker dependence on viadensity than does PoFviain a 130 nm technology.The probability of failure of an interconnect line spanning multiple metal layers, Pline,is then determined, taking into account the corresponding number of via levels that ittraverses. Assuming via failures on different via levels as independent events, theprobability that a global interconnect line spanning l metal layers is defect-free can beexpressed as:21,11lline i iiP PoF via  (5.2)where PoF(viai-1,i) is the probability of failure of vias between metal layers i-1 and i.Under these considerations, tolerating global wire defects is a matter of providing anadequate number of spares, separating the good wires from the faulty ones, andconfiguring the inter-core links accordingly.An m-choose-n calculation can be performed to obtain the number n of wires thatmust be available to achieve m functional wires in the link. The probability that the yieldbe such that exactly i wires (i  m) are defect-free is:( , ) 1i n iyield line linenP n i P Pi        (5.3)That is, there are exactlyni   ways of selecting i wires from a total of n wires, and the90probability that each of these cases is defect-free is 1i n iline lineP P . We can have a setof m functional wires whenever m or more of them are defect-free, so the yield of the linkis expressed as the cumulative distribution function:1ni n im of n line linei mnP P Pi        (5.4)Given the desired probability (or yield) for having at least m defect-free wires, Pm of n,(5.4) provides a way to find the number of physical wires, n, that are required in thelayout in order to achieve the target yield.5.5 Fault-tolerant NoC linksIn order to physically connect a set of m functional wires at the core interface, a schememust be provided to arbitrarily connect the m interface signals to the n physical wires (m n) such that exactly m defect-free wires are used to transmit/receive the data betweentwo cores. Moreover, we must ensure that, in any possible configuration, the electricalproperties of the core-to-core interconnect are preserved, i.e., the delays of differentpossible configurations of the same link must match closely. Another objective that wetarget is that the area of the configuration circuitry must be minimal, since this adds up tothe total cost of interconnects.With these objectives, we propose a method to construct configurable connectors thathave closely-matched delays for any possible connection, and are of minimal size. Thismethod is based on the theory of generalized connectors [90], and uses specific results fora class of connectors known as sparse crossbar concentrators [91].915.6 Sparse crossbar concentratorsAn (m,n)-sparse crossbar concentrator [96] is defined as a bipartite graph G={I,O,E},with a set of m inputs {I}, a set of n outputs {O}, and a set of edges {E} such that thereexists a one-to-one mapping between any n or fewer outputs and the m inputs. In ourcase, the m inputs correspond to the functional wires that must be realized per link, whilethe n outputs correspond to the total number of wires (including the redundant ones) thatmust be physically implemented for achieving the target yield. The edges in E are calledthe crosspoints of graph G. The number of crosspoints of G represents its crosspointcomplexity. For a hardware implementation of a sparse crossbar, the crosspointcomplexity is a direct measure of the silicon area required by its layout.The number of outputs (inputs) to which an input (output) is connected to in graph Gis referred to as its fanout (fanin). The maximum number of outputs (inputs) to which aninput (output) in G is connected to is referred to as the fanout (fanin) of graph G. Asparse crossbar is said to be regular (or balanced) if the difference between both thefanouts of its inputs and the fanins of its outputs is no greater than two. The direct sum G1+ G2of sparse crossbars G1={I,O1,E1} and G2={I,O2,E2} is defined as being anothersparse crossbar G1+G2={I,O1UO2,E1UE2}.An (m,n)-sparse crossbar is called a fat-and-slim crossbar if any n-m of its outputs areconnected to all the inputs in I, and each of the remaining n outputs is connected to adistinct input. An (m,n) fat-and-slim crossbar G can be represented by an m x n adjacencymatrix, AG=[aij]m x nof binary elements, where a ‘1’ element in row i and column jrepresents a crosspoint between input i and output j. An (m,n)-fat-and-slim crossbar hasthe adjacency matrix [Fm, n-m|Sm], where Fm, n-mis a m x n-m matrix of ‘1’s and Smdenotes92the m x m diagonal matrix, as indicated in Fig. 5-1. Each crosspoint may have one of twostates: active (‘1’), in which the corresponding pass-transistor is in a conducting state,and inactive (‘0’), in which the pass-transistor is shut-off. At all times, during normaloperation, at most one crosspoint per row can be active, such that each of the m wiresrepresenting the rows is connected to exactly one of the n wires of the fault-tolerant link.Note that in the formal definition of the sparse crossbar concentrator above, the “inputs”of the crossbar denote in fact the set of logic signals that must be transferred between twoIP cores along the fault-tolerant link, while the “outputs” denote the set of physical wires(including the spare ones) of the fault-tolerant link. In terms of directionality, the fault-tolerant link is bidirectional in the sense that the rows of the crossbar connector can beconnected to either inputs or outputs of the IP cores.A different option for physical implementation of such crossbars is to usemultiplexers [102]. The requirement of bidirectionality and reconfigurability precludes,however, the use of multiplexers due to the complexity and area overhead involved.(a) (b)Figure 5-1: (a) Non-fault-tolerant sparse crossbar and crosspoint implementation; (b) n-mfault-tolerant sparse crossbar.Fault tolerant links based on fat-and-slim crossbar connectors have a major limitation:the asymmetrical resistive/capacitive loads on input/output combinations determine alarge variation of the corresponding delays. The amount of delay variation is in direct93relation with the mismatch between the total numbers of crosspoints of the differentrow/column combinations that may be selected to achieve a fault-free configuration. Notethat there is a best-case delay, for the case in which both ends of the connection belong tothe “slim” part of the crossbar, and a worst-case delay, for the case in which both belongto the “fat” part of the crossbar, as indicated in Fig. 5-2. For each row/column pair, thesignal propagation delay is proportional to the amount of capacitance attached to the twowires on the respective row and column of the crossbar matrix.Figure 5-2: Fastest and slowest propagation delay paths for non-balanced fault-tolerantlinks.Each crosspoint – either in conducting state or not – contributes to the attachedcapacitance with the equivalent capacitances of its source and drain terminals, CSourceandCDrain. Denoting the delays of the best and worst cases by DBestand DWorstrespectively,their dependency on the crossbar size (which in turn is proportional to the degree ofredundancy) can be expressed as:( 1) ( )Best Drain SourceD n m C C     (5.5)and94( 1) ( )Worst Drain SourceD n C C    (5.6)This dependency quantifies the delay mismatch for the cases when a crosspoint isactivated in the “slim” part of the crossbar, or in the “fat” part of the crossbar,respectively. The delay variation (the difference between the worst-case delay DWorstandbest-case delay DBest) becomes more pronounced for large crossbars (large m) withincreased level of redundancy (n-m). This is illustrated in Fig. 5-3 for a set of inter-corecommunication links of 1 mm length with different row sizes and a 25% degree ofredundancy, in a 65 nm CMOS process. The curve in Fig. 5-3 was obtained throughHSPICE [92] simulation. The 25% value was chosen to reflect a typical case of yieldenhancement using redundant links as discussed later in Section 5.8. The delay variationacross the fault-tolerant links may take values along the curve depicted in Fig 5-3. Thedelay difference is significant (10% to 35%) for link widths in the range of 8 to 128 bits,typical for on-chip networks [93] [94].Figure 5-3: Delay variation for imbalanced fault-tolerant links with 25% redundancy.When building integrated systems with fault-tolerant links, designers must account95conservatively for the delay of the worst case, which involves a significant penalty -between 10% and 35% in the example in Fig. 5-3. Ideally, when the configuration of thefault-tolerant link is modified due to occurrence of a permanent fault, the delay of the linkshould not change. Therefore, all possible link configurations should have very similardelays. Intuitively, we can observe that this can happen if, for all I/O combinations, thesums of crosspoints on the corresponding crossbar rows and columns are identical. In thefollowing subsection, a structured method is provided to distribute the crosspoints acrossthe crossbar matrix such that these sums are identical (or as closely matched as possible).5.7 Fault tolerant global interconnects based on balanced crossbarsBased on the yield considerations formulated in Section 5.4 and sparse crossbarsbriefly described in Section 5.6, we now present a method to build connectors that cantolerate defects of the global wires, while still complying with the balanced delayrequirement.As shown in Fig. 5-1(a), we start with a simple crossbar whose correspondingadjacency matrix is the m x m identity matrix. This corresponds to the case when theglobal interconnect is not fault-tolerant and represents the “slim” part of the fat-and-slimcrossbar. The “fat” part is then added, as shown in Fig. 5-1(b), corresponding to n-mspare wires required to achieve the desired interconnect yield according to (5.4), as afully-connected crossbar whose adjacency matrix is of size (m) x (n-m) and has allelements ’1’.We have now a fat-and-slim crossbar that can tolerate n-m defects. However, thiscrossbar is not balanced, in the sense that the capacitive loads from a row to a column arenot matched for different row-column combinations. To balance the load, the “fat-and-96slim” crossbar can be transformed such that for each row/column combination, the sumsof the corresponding fanout and fanin are “closely matched”. First, we require that such acrossbar exists. Its existence is provided by the following theorem [91]:Theorem 1: For any positive integers m and nm, the (m,n)-fat-and-slim sparsecrossbar can be rearranged to obtain a sparse crossbar with fanout of either 1 or ,fanin of n-m+1 and minimum number of crosspoints, where  is given by: - 1 3, if2( 1) ( 1) ( 1) 3, if 0.5, and2( 1) ( 1) ( 1) 3, if 0.5, and2n m mm  nnn m m n m m n m m mnn n nn m m n m m n m m mnn n n                                              (5.7)where x and x are the floor and ceiling functions, defined as the largest integer lessthan or equal to x, and the smallest integer not less than x, respectively. A rigorousmathematical proof of this theorem can be found [91] and [95]. This theorem guaranteesthe existence of crossbars with closely-matched fanins and fanouts, but does not provide amethod to construct such crossbars. In the following, we develop a method to constructsuch crossbars.Intuitively, it can be observed that the sum fanin+fanout of column/row pairs candiffer by one due to the fanin and fanout being even or odd integers, which explains whythe capacitive loads of input/output connections cannot always be made identical:physically, this implies that they may differ by an amount equal to the equivalent nominalcapacitance of one crosspoint (equal to the sum of drain and source capacitances of thepass-transistor connecting a row to a column).97The following column exchange operation is used for balancing the rows andcolumns of the adjacency matrix of a “fat-and-slim” crossbar. The column exchangeoperation distributes evenly the crosspoints along the columns of the sparse crossbar.Since the crosspoints are only moved along the same row, the total number of crosspointson each row does not change, and remains equal to n-m+1. Formally, let AGbe theadjacency matrix of the (m,n)-sparse crossbar G.Definition: given the adjacency matrix A = [aij]mxnof the sparse crossbar, column x =[a1ia2i… ami] is said to cover column y = [a1ja2j… amj], 1i, jn, ij, if for each akj=1in y, aki=1 in x.Let x and y be any two columns in AG, where x covers y. Let ai1,x,ai2x,…,air,xbe astring of r ‘1’s in x. For any 1lr, the column exchange operation is performed byexchanging ail,xwith ail,y. If matrix B is the result of the column exchange operation,matrix B is said to be balanced when the difference between the sums of ‘1’ elements ofany two columns is equal or less than one.98Figure 5-4: Balancing an (n-m) fault-tolerant sparse crossbar through successive columnexchange operations; m=5, n=11.Fig. 5-4 shows how the column exchange operation is applied on the “fat-and-slim”,n-m fault tolerant crossbar (n=11, m=5). By plugging into Eq. (5.7), the value of  is 3,and the number of elements per column after balancing is expected to be either 3 () or 499(+1). The first step of the balancing operation (step 0 in Fig. 5-4) is re-arranging theadjacency matrix of the crossbar as four concatenated matrices: a full matrix, a lower-triangular matrix, a banded matrix, and an upper-triangular matrix. After the pre-processing step, the column exchange operation proceeds with the columns outside thebanded matrix (step 1 in Fig. 5-4). Note that column 1 is the first that covers column 11,so we start by exchanging elements (1,1) and (1,11). For the matrix newly formed, weobserve that the column 11 is still unbalanced, and the set of crosspoints in column 1unaffected by the previous exchange operation still covers the corresponding set ofcrosspoints in column 11. Therefore, a new exchange takes place (step 2 in Fig. 5-4),whose results is that both columns 1 and 11 are balanced (each column has 3crosspoints). The exchange operation continues in the same manner with columns 2 and10 (step 3 in Fig. 5-4). At the end of the sequence, a balanced crossbar is obtained, whosesum fanin + fanout has two possible values: either 10 or 11.The details regarding the arrangement of the slim-and-fat matrix and balancingalgorithm are provided in Appendix 2. Note that, after balancing, the fanout + fanin sumsfor each possible input/output combination do not differ by more than 1.The order of complexity of the column balancing algorithm is governed by thenumber of columns (n) and the average number of crosspoint elements per column ().For each pairs of columns, the average number of column exchange operations is equal tothe average number of crosspoints per column, therefore the order of complexity of thealgorithm can be estimated as1002( 1)( 1) ( 1)n m mO n n O n n O n nm m mnWe compare the link delays between the unbalanced and balanced cases, for differentlink widths and degrees of redundancy, using HSPICE simulation and 65 nm CMOSparameters. The effect of balancing on the link delays for 25% redundancy is shown inFig. 5-5. For all simulations, link length was maintained constant at 1 mm.Figure 5-5: Delay variation for balanced (b) links with 25% redundancy.Note that the effect of balancing brings the delay of the redundant links closer to theunbalanced best case delay (ubc curve in Fig. 5-5). Moreover, the range of delays for theindividual wires in each redundant link is reduced from the area between the unbalancedworst and best cases (uwc and ubc), to a very narrow range around the balanced delaycurve (bc). This reduction of delay variation for the balanced case is due to the balancingoperation yielding closely matched numbers of crosspoints on each column of thecrossbar matrix. As an example, in the non-balanced case of 32-bit links (Fig. 5-3), the101delay variation may be up to 20% (between 1.6 ns and 2 ns according to Fig. 5-5),whereas after the balancing operation, the delay is set to a value of approximately 1.8 ns,with no variation between different link configurations. At most, the number ofcrosspoints per column can differ by one and, since the numbers of crosspoints oncrossbar rows are identical, the capacitive load on link wires may differ with theequivalent capacitance of exactly one pass-transistor.An important parameter that has to be considered at design time is the delay variationas a function of the degree of redundancy. This dependence expresses the trade-offbetween the target yield (directly related to the degree of redundancy) and the achievablelink data rate (inversely proportional to link delay).Fig. 5-6 shows the effect of balancing on a 64-bit link when the degree of redundancyis varied from 0% to 100%. Note that the balancing operation has two beneficial effects:first, it reduces the worst-case link delay (from the unbalanced curve uwc to the balancedcurve bc). Second, it reduces the range of link delay variation from the entire rangebetween the two unbalanced cases (ubc and uwc curves) to values along the curvecorresponding to the balanced case (bc curve).102Figure 5-6: Delay variation versus degree of redundancy for a 64-bit linkFig. 5-7 shows the effect of both link width and degree of redundancy on the delay of thefault-tolerant links.Figure 5-7: Delay variation versus crossbar size and degree of redundancy.The degree of redundancy is set by the yield requirements; however, if the yieldtarget imposes a large degree of redundancy, the performance target in terms of linkdelay may not be met. The trade-off between performance, link width, and degree of103redundancy brings relevant limitations in the design space when yield appears as anadditional constraint. We note that the speed of wide links tends to be severely limited inthe presence of manufacturing faults, when fault-tolerant mechanisms are implementedfor yield improvement.5.8 Link test and reconfiguration mechanismsThe required number of spare wires that must be included for constructing the sparse,balanced crossbar that can be used to interface the NoC routers/cores with the faulttolerant links, can be determined as presented in Sections 5.4 and 5.5. This requiresknowledge of the defect distribution for a particular technology and the target yield,together with the delay budget of each inter-core link.Once the desired link width is selected such that yield and performance requirementsare met, a mechanism must be provided to perform two functions: first, to test the linksafter the chip is fabricated; and second, to configure the links such that only fault-freewires are used.This work is concerned with fault types that typically affect the NoC interconnects:the faults that model fabrication defects, the effect of parameter variation, design rulesviolations, and data dependency. Consequently, we include fault models like opens,shorts, and bridging faults, and crosstalk faults, as outlined earlier in Chapter 2.The links can be reconfigured both at start-up and at run-time, as soon as one or morefaulty wires are detected and marked as such. Before reconfiguration, a test procedure isrun in two steps to identify (1) faults in the sparse matrix elements, and (2) faults in theset of interconnect wires. The first step insures that the redundant interconnect can bereconfigured successfully. The second step identifies potential faulty wires and marks104them as defective. If the number of detected faults is greater than the number ofredundant wires, the test procedure returns a “fail” signal and no further reconfigurationis possible. If the number of faulty wires is less the total number of redundant wires, thereconfiguration procedure is started and a fault-free configuration is selected among theset of redundant wires.The overall test/reconfiguration architecture is presented in Fig. 5-8. Note that, forarea efficiency reasons, the test and reconfiguration hardware blocks are shared amongmultiple links for the same IP core. This is of particular interest for the case ofcommunication-intensive cores with multiple links, such as network-on-chip routers,split-bus controllers and multi-core processors.Figure 5-8: Self-test and repair link architecture with shared test and reconfigurationblocksThe following subsection presents the test and reconfiguration procedures and thecorresponding hardware blocks, respectively.5.8.1 Testing the sparse crossbar matrix and interconnect wiresFor simplicity and efficiency, we choose a scan-based approach for testing the105programmable switches of the matrix array. For test purposes, the columns of the sparsematrix are concatenated in a one-dimensional array, organized as a serial shift register.Testing is performed by shifting-in a sequence of alternative ‘0’s and ‘1’s, and thenshifting-out the content of the array. The length of this sequence is equal to the size of theone-dimensional array. At the output of the array, a simple sequence detector indicatesthe occurrence of two consecutive identical values, which means that a fault is present inthe array. In this implementation, the test procedure is halted and the “fail” flag is raisedif faults are detected in the crossbar array.Figure 5-9: Link-level test and reconfiguration.The time taken for the test procedure to complete is equal to 2*m*(n-m+1) clockcycles, since the test sequence must be shifted through the uni-dimensional array whoselength is equal to the total number of matrix crosspoints (see Fig. 5-2). After the106crosspoint matrix is tested, if it is found fault-free, the test procedure advances to theinter-core wires. These are tested using a built-in self-test (BIST) method, based on afault model that accounts for both defects/faults characteristic to global interconnects.In Chapter 3, a method was introduced to test structured links using the MAF modeland a BIST approach. Here, we adapt it to the specifics of employing the sparse crossbarmatrices as a means to reconfigure the redundant links. The MAF test must be applied notonly to the wires actually used to transport data between cores, but to all the wires –including the spare ones, which are not yet determined at the time of MAF test – in thefault-tolerant communication link.The MAF test vectors are launched at one end of the link, and the response iscompared to the expected output at the other end of the link. The results of the MAF testprocedure are stored on a per wire basis, by marking each wire as good/faulty at the endof its corresponding 6-vector sequence. This information is stored in a register array RL(one element per column) and subsequently used for reconfiguring the sparse crossbarmatrices. The number of faulty wires in a link is stored and updated by the MAF testprocedure; when this number exceeds the total number of redundant wires (n-m in Fig. 5-2(b)), the test procedure is halted and a “fail” flag is raised. In this case, the link ismarked as permanently defective, and cannot be used further.107Figure 5-10: Test and reconfiguration hardware.5.8.2 Link reconfigurationOnce the links are tested, if the number of faulty wires is found to be less than theamount of redundant ones, the reconfiguration phase begins. The reconfigurationmechanism uses the output of the link test mechanism (stored in register RL) to assign aset of m fault-free wires to the m core I/Os. When activating the crosspoints of thecrossbar matrix, at most one element per column can be set to logical ‘1’ (to ensure that acore I/O is connected to exactly one wire), and exactly m elements can be active (toensure that all core I/Os are connected to wires in the redundant link). A reconfigurationvector initially set to [1 0 0 … 0] is used to configure each column of the crossbar matrixby shifting it serially into each column’s set of crosspoints. To ensure that at most one isconnected to any row, the reconfiguration vector of the next connected column is shiftedright by one position relative to the previous connected column.The pseudo-code describing the operation of the reconfiguration mechanism ispresented in Fig. 5-11. It assumes that at least m wires are fault-free (out of n wires of the108redundant link).Figure 5-11: Link reconfiguration algorithm.For each fault-free wire, the reconfiguration vector is constructed by rotating right thereconfiguration vector of the previous fault-free wire, and then shifted into each columnof the crosspoint matrix. Faulty or unused wires are invalidated by shifting in an all-0vector on the corresponding column of the crossbar. This ensures that at most one coreI/O is connected to each redundant global wire.So far, via failure was considered as the predominant yield loss mechanism for globalinterconnects. Using active devices for implementing the link test and reconfigurationhardware adds a potential source of yield loss due to the non-zero probability of failure ofthese devices. This effect must be accounted for when estimating the yield of theinterconnect. Conversely, when calculating the number of spares required for achieving acertain target yield, this additional source of yield loss must be considered as well. Wedefine the effective interconnect yield Yeffas the interconnect yield achievable in presenceof non-zero probability of failure of test and reconfiguration circuitry. Therefore, theprobability of failure of a single line of interconnect Plinein Eq. (5.2) must be adjustedaccordingly.-- crosspoint matrix reconfiguration-- initializek = 0; -- column indexl = 0; -- row indexrec_vector = [1000…0]while l < m -- not all rows have been connectedwhile RL(k) = 0 -- wire k is defectiveinvalidate(k); -- do not use column kk=k+1; -- advance to next wireend while;column(k) = rec_vector; -- reconfigure column krec_vector = ror (rec_vector); -- rotate right and set the next   --reconfiguration vector l++;  -- advance to the next rowend while;109As depicted in Fig. 5-2, both ends of each line of the fault-tolerant inter-core links areconnected to the core’s inputs/outputs by a pass-transistor. The probability of failurePoFTof this transistor can be estimated knowing the killer defect density k [14] for themanufacturing process under consideration, and the area of the device AT:T TPoF k A  (5.8)Therefore, Eq. (5.2) must be rewritten to take into account the probability of failure of thetwo pass transistors as:221,11 1lline T i iiP PoF PoF via    (5.9)Similarly, the effective yield Yeffof an inter-core link must account for the potential yieldloss of the crosspoint matrix elements and test/reconfiguration mechanism. Using againthe killer defect density k and the area AFTof the circuitry used to implement andreconfigure the fault-tolerant links, the effective yield of a link can be express as:eff FT m of nY k A P   (5.10)where Pm of nis the yield of the link without considering the yield loss due to fault-tolerantcircuitry, as calculated in Eq. (5.4). In the expression of effective yield Yeffin Eq. (5.10)we use the adjusted value of Plinecorresponding to Eq. (5.9).The test and reconfiguration sequences can be run at start-up only, both at start-upand run-time, or during idle periods in which there is no traffic on the particular link. Theamount of time required to run the test/reconfiguration procedures is an important figureof merit since it expresses how fast the system can be operational after start-up, or howlong the individual links undergoing test/reconfiguration will be unavailable duringrun-time. In Section 5.8.3, a more detailed analysis and estimation of these figures of110merit is provided, in addition to other cost parameters such as wire complexity and areaoverhead associated with the hardware implementation of fault-tolerant interconnects.5.8.3 Yield, performance and cost analysisWe applied the method developed in this work to sets of global interconnects ofdifferent widths, representative for the cases of massive multi-core communicationfabrics. We considered the case of a 65 nm technology with 12 metal layers and defectdistributions as projected by the ITRS 2006 roadmaps.Figure 5-12: Physical link width (n) for different values of logic link width (m = 32, 64,128 bits) and target effective yield (Yeff).Fig. 5-12 plots the total number of wires (including the redundant ones), n, that needto be routed in order to provide 90%, 99%, and 99.9% effective interconnect yieldrespectively, for global link widths (m) of 32, 64, and 128 bits and probability ofdefect-free global interconnect line Plineranging between 0.9 and 1.The total number of wires n, of the global link, in the above example, was determinedassuming the global interconnects are laid out starting on source/drain contacts, through111intermediate metal layers to the upper metal layers and back. As mentioned above, weuse a 65 nm CMOS technology, with a typical defect density of 0.025defects/cm2obtained from [14].Table 5-1: Effective yield improvement vs interconnect complexityInterconnectwidth (m)Targetinterconnecteffectiveyield Yeff[%]Number ofcrosspointsInterconnecteffectiveyieldimprovement[%]90 25 3.999 58 7.6899.9 74 8.390 82 11.199 248 15.71699.9 279 16.190 112 16.999 302 25.83299.9 725 26.790 285 36.499 573 45.36499.9 2224 46.590 1192 61.399 5429 70.312899.9 10984 71.2Table 5-1 shows the relative interconnect effective yield improvement (the fourthcolumn in the table) for a probability of defect-free line Pline= 0.99 and the cost toachieve it in terms of number of crossbar crosspoints (third column) for different targetinterconnect yields and link widths.As shown in Table 5-1, the yield of NoC interconnects can be significantly improvedby using our technique. It is, however, extremely important to estimate carefully thetarget effective yield and characterize the manufacturing process accurately, sincedemanding a high yield or underestimating the individual Plinefor global interconnects112can increase dramatically the size of the fault tolerant crossbars and, implicitly, thesilicon area and metal lines required. For example, if a 99% effective yield is targeted fora 64-bit wide global link, we need to provide a 573-crosspoint sparse crossbar at each endof the link; if, instead, a 99.9% yield is aimed for, the number of required crosspoints is2224. That translates to an approximately 4X increase in silicon area for an interconnectyield improvement of only 0.9%.The area overhead of the combined test and reconfiguration mechanisms is presentedin Fig. 5-13, for different crossbar sizes and degrees of redundancy. Note that the amountof circuitry required for implementing the test and reconfiguration mechanisms isrelatively small. The overhead corresponding to the programmable crosspoints, eachconsisting of a pass transistor and a memory cell is reported separately in Table 5-1. Eachcrossbar matrix constructed as described in Section 5.5 has a number of ( 1)n n m  such crosspoints.Area overhead (test and reconfiguration)0501001502002503008 16 32 64 128crossbar size25% 50% 75% 100%Figure 5-13: Area overhead for different crossbar sizes and degrees of redundancy.Table 5-2 presents the time overhead of test and reconfiguration mechanisms, in113clock cycles. These values are important for systems that are required to have a shortstart-up time, or, in the case in which the test/reconfiguration mechanism is executedduring the regular operation, for systems that must achieve high availability.In Table 5-2, TC represents the crossbar test time, and is proportional to the logicalcrossbar size (m) and the degree of redundancy (expressed in percentage in Table 5-2).TL denotes the physical link test time, and is governed by the number of physical wiresthat must be realized for each logical link, for achieving a target yield. TR is the linkreconfiguration time, and denotes the amount of time required to shift-in thereconfiguration vectors for each crossbar. The sum of TC, TL and TR is denoted by and indicates the total amount of time required to set-up a fault-tolerant link between twocores of a NoC, either at start-up on during its operation.Table 5-2: Test and reconfiguration time overhead25% redundancy 50% redundancy 75% redundancy 100% redundancyCrossbar sizeTC TL TR  TC TL TR  TC TL TR  TC TL TR 859 60 40 159 119 72 96 287 179 84 168 431 239 96 240 57516199 120 140 459 399 144 312 855 599 168 532 1299 799 192 800 179132719 240 480 1439 1439 288 1104 2831 2159 336 1904 4399 2879 384 2880 6143642719 480 1760 4959 3439 576 2592 6607 8159 672 7168 15999 10879 768 10880 2252712810559 960 6720 18239 21119 1152 15936 38207 31679 1344 27776 60799 42239 1536 42240 86015Depending on the link width and the amount of redundancy required by the interconnectyield target, the fault-tolerant communication links will not be available for transportingdata during their respective test and reconfiguration procedures, for an amount of timecorresponding to the columns labeled “” in Table 5-2.1145.9 SummaryIn this chapter, a structured method is presented for building defect tolerant NoCinterconnects that takes into account the manufacturing process characteristics in terms ofvia failure probability and, in turn, generates interconnects with the desired target yield.The mechanism that selects and reconfigures the communication links is based onbalanced sparse crossbars, and uses a minimum number of crosspoints to connect the setof defect-free wires at the core interface. The method also ensures that all validinterconnect configuration have identical capacitive load, which guarantees theinvariance of timing characteristics of different possible configurations. With the methoddescribed in this chapter, a relative interconnect yield improvement can be obtained in therange of 3.9% for an 8-bit logic link width to 71.2% for a 128-bit logic link width. Animportant advantage of this technique is that it can be easily automated and implementedin NoC/SoC design tools.We expect the solution proposed here to become more relevant with the emergence ofnanometer-scale VLSI processes, where high defect rates and device failures will beserious manufacturing limitations.115Chapter 66 Conclusions6.1 Summary of contributionsIn this dissertation, an integrated solution for testing and yield-tuning of network-on-chip (NoC) interconnect infrastructures is presented. A method for testing the NoC fabric(switches and links) is proposed, that employs the transport of test data progressivelythrough the NoC, reusing the fault-free NoC elements to deliver test vectors to the NoCelements under test. Two types of test data transport are proposed. The first is based on aunicast transport, where test data is directed to each element under test on a single path,and exactly one NoC component is tested at a time. The second type uses a multicasttransport method, where multiple NoC components are tested concurrently, by deliveringtest data simultaneously on multiple paths constructed from fault-free components. Theyield-tuning method that complements the test solution addresses the potential yield lossin NoC systems due to interconnect defects, and is based on the use of reconfigurablecrossbars and redundant links. A technique is proposed for distributing the crossbarcrosspoints such that all the possible link configurations are characterized by closelymatched signal propagation delays. The benefit of this technique lies in its ability toprovide consistent performance in presence of interconnect defects, with effectively nodelay penalty compared to the fault-free case.The test procedure of this solution was shown to be faster than previously proposedmethods for NoC test, effectively reducing the test cost and consequently the total cost ofNoC-based chips. Experiments on NoC infrastructure of various sizes and topologies116show that the proposed test method can improve the test time with up to 34X whencompared to current solutions. The fault tolerance technique presented here is able tosignificantly improve the yield of NoC links, in a range from 3% to 71% for the case oflogical link widths of 8-bit to 128-bit in our experiments.6.2 LimitationsIn this subsection, we enumerate certain limitations and provide views as to howthese limitations can be overcome. One of the aspects that were not addressed in thisdissertation is the power overhead. The test method developed in Chapter 3 is very time-efficient because it can deliver a large quantity of test data in a short time to the NoCcomponents across the chip. This comes at a cost of increased power dissipation, which,potentially, can be a problem depending on the power budget for the design. The solutionis to include the power dissipation as an additional component of the test cost function,and limit the injection of new test data packets when the maximum power dissipation isabout to be exceeded. Similarly for the fault-tolerant mechanism presented in Chapter 5,there is a certain amount of power that the extra circuitry (crosspoint elements, test andreconfiguration hardware) will dissipate through mechanisms such as switching (duringtest and reconfiguration phases), and leakage during normal operation. This leakagepower should be included in the overall chip power.The test solution presented in Chapter 3 is targeted to the NoC infrastructure only.The main reason behind this approach is that a common trend in industry is to designNoCs as IP cores, which are generally delivered together with the associated test method.However, if the testing of the NoC infrastructure is integrated with testing of thefunctional cores of the NoC-based chip, there is a possibility to use the concepts in117Chapter 3 to deliver test data concurrently to NoC components and functional cores, andreduce the overall test time of the chip in a more significant manner. We believe that, inpractice, this is not a very likely situation, especially when the NoC is developed as an IPcore and provided to the final user together with its test method.Another limitation of the fault-tolerant mechanism in Chapter 5 is the significantincrease of link delay when the degree of redundancy increases. This, in turn, implies asignificant defect rate, which will have serious repercussions on the operation of the NoCfunctional cores. Therefore, the target yield may not be achievable when a designconstraint is applied to the link delay.Finally, some form of quantitative estimation of the “amount” of fault-tolerance mostappropriate for each layer of the NoC communication protocol, and the effect ofinteraction between FT mechanisms at different layers, would be extremely useful fordesigners looking for the best FT solution for NoC-based chips.6.3 Future workThe high level of concurrency and parallelism in on-chip networks/multi-processorsystems-on-chip is at the same time an opportunity and a challenge. The opportunity isprovided by the great flexibility of these systems in terms of applications (software) thatcan be used to manage different aspects of their operation: power management, yieldtuning, and fault-tolerance. These aspects become increasingly challenging with the shifttoward nanoelectronics, which demands a paradigm change best described as “designingreliable systems with unreliable components” [97]. A few research directions can bepursued as direct developments of the contributions presented in this dissertation, asfollows:118 Application-level fault-tolerance for multi-processor systems-on-chip. In the pastfew years, fault-tolerance of multi-core chips was addressed mainly through hardwareredundancy or error correction mechanisms. The increased level of faults associated withnano-scale device sizes demands new approaches for dealing with transient andpermanent faults. There may be a great opportunity to exploit the distributed computingcapabilities of these systems for achieving fault tolerance. The challenge is to integratefault-tolerant solutions at application level through software-based micro-agents thatmonitor the operation of the system periodically, and restore it to error-free state upondetection of malfunctions. This would allow a higher degree of flexibility for designers offault-tolerant systems-on-chip. Traditionally, hardware-intensive voting mechanismsbased on component replication are used to achieve fault-tolerance at system level.Complementing the hardware redundancy with state restoration makes possible a morerefined tuning of fault-tolerant capabilities of a MP-SOC depending on the operatingconditions and expected failure rate. Fault-tolerant crossbar structures for nanoelectronics. Reconfigurable crossbararrays are increasingly being proposed as the basic building structure for nanoelectronicsystems [98]. The main challenge in manufacturing commercially viable nanoelectronicchips is the high defect rate – estimated in the order of 10% or more [99]. This can becompensated for by adding a certain amount of redundancy to the crossbar structures, andusing reconfiguration techniques to avoid the faulty components. Implicitly, theperformance of such structures is heavily influenced by the degree of redundancy and theparticular reconfiguration mechanism employed. Initial studies have shown that the speedof reconfigurable crossbar-based structures can vary widely for different instances of the119same fault-tolerant array [100]. To account for this variation, designers must take theconservative approach and report the worst-case performance with the goal of developingreconfiguration methods for nanoelectronic crossbar arrays that will yield near-optimal,closely-matched performance for all functional instances, with improved overall speedand tight parameters variation.Nanoelectronics is expected to play an important role in providing solutions tosurmount obstacles imposed physical limitations of CMOS electronics. Nanoelectronicsimposes significant quantitative constraints resulting in qualitative modifications to thedesign process; among them are an increase in fabric size, density and fault rates whichresult in (1) increased emphasis on parallelism, (2) short-range communication, (3)regular nano-scale interconnects, and (4) offline and online repair of faulty structures.A fundamental challenge in constructing nanoelectronics-based systems is the highunreliability of nanoelectronic devices which manifests mainly in two forms. First,manufacturing defects increase significantly, due to the defect prone fabrication processthrough bottom-up self-assembly. The defect rates of nanoelectronic systems areprojected to be orders of magnitude higher than those of current CMOS systems [99].Second, a high variance in the fault rate at run-time and a clustered behavior of faults areexpected. These are essentially caused by the nanoscale size of the devices as well as thelow operating voltage, which result in extremely high sensitivity to environmentalinfluences, such as temperature, cosmic radiation and background noise.The emerging challenges of nanoelectronics require a re-evaluation of thecomponents and methods typically employed in creating designs. Questions of paramountimportance are related to (1) mapping design descriptions to regular nanofabrics with120massive defect rates; (2) realizing defect tolerant designs in the presence of significantdefect occurrence rates and limited test access; (3) development of efficient, systematicfault tolerance techniques for faults with dynamic spatial and temporal distributions; (4)ensuring reliability without sacrificing performance.Due to the difference in complexity at the various system hierarchical levels,appropriate fault tolerance schemes vary. Building large systems using nanoelectronicfabrics must include fault-tolerance as one of the primary constraints, similar toperformance, power, area and testability. Moreover, fault-tolerance must be seamlesslyintegrated in the design flow starting from physical design and going up to applicationdesign and specification. For instance, an interconnect routing tool should be able to takeinto account the possible fault set that can manifest for the particular technology used (e.g.carbon nanotubes, quantum dots, nano-scale CMOS, or hybrid) and achieve a target yieldin a transparent fashion, without the explicit intervention of the designer. At higher levels- such as that of application - compilers should be capable to insert automatically micro-agents for fault management, enabling self-healing by repairing failed components orrestoring the system to fault-free state.Integrating fault-tolerance into the nanoelectronic system design flow requires anintimate understanding of fault sources, nature, spatial-temporal distribution, and effectsat different levels from device behaviour to application. Having this knowledge, it ispossible to exploit the particular characteristic of nanoelectronic fabrics, such asreconfigurability, multiple-valued logic, and high density, for integrating fault-toleranceinto digital nanoelectronics design flow.121In summary, the semiconductor industry is rapidly advancing towards nanoscaledomain, driven by continuous miniaturization and need for computing power. Theimmediate benefit of integrating fault-tolerance in the design flow for nanoelectronics isthe possibility of a rapid transition from research phase to volume production, enablingthe use of nanoelectronic systems in key areas such as biomedical sciences,communications, multimedia; ultimately, facilitating developments that are leading us tothe ubiquitous information society.1227 Appendices7.1 Appendix 1: Proof of Correctness for Algorithms 1 and 2The search sequence that Algorithms 1 and 2 presented in Chapter 3 implement isbased on Dijkstra’s shortest path search algorithm, which is guaranteed to find theshortest distance between any two vertices in a graph.Given a graph G(V,E), where V is the set of vertices and E is the set of edges.Each edge e in E is labeled with its weight w(e) which corresponds to the “length” of thatedge. In the original Dijkstra algorithm, the weights are static and remain constant as longas the search sequence is executed. In the modified search sequence, the weights areinitialized with a set of values corresponding to the TTPE (test time per element). Thesearch starts by selecting the edge with minimum weight (minimum TTPE) and adding itto the set of elements on the shortest path. Before the next minimum-weight element isselected, the weight of the current element is modified from the value corresponding toits test time to the value corresponding to the time needed by a test packet to traverse theNoC element placed in the normal mode of operation. For a link, this corresponds to thelinks latency, while for a switch, this is the number of cycles required to transfer a datapacket from an input port to an output port.123(a) (b)Figure A1: Shortest path search with dynamical weight updating.For instance, in Fig. A1(a), assume that the current search path is (v7, v8, v5) and thecurrent value of the test cost function is F. The next element to be added to the searchpath is the edge v5-v6 with weight 4 corresponding to its test time. The edge is added tothe search path which becomes (v7, v8, v5, v6) and the test cost function is updated to itsnew value, F+4. In the next step of the search sequence, the test cost function is updatedto its new value F+4+2 = F+6, then edge v6-v3 is added to the search path. The test costfunction becomes F+6+4 = F+10, and then the weight of edge v6-v3 is updated toF+10+2. At this moment, a minimum length path was found from v7 to v3, and thealgorithm advances to a new pair of vertices.In the general case, let Pk= (v1, vi, …, vj, vk) be the shortest path from v1 to vk.Then Pj= (v1, vi, …,vj) must be a shortest path from v1 to vj, otherwise Pkwould not beas short as possible. Also, Pjmust be shorter than Pk(assuming all edges have positiveweights).Let w(j,k) be the weight of the edge between vertices vj and vk as initialized at thebeginning of the algorithm (corresponding to the test time per element TTPE), andv1 v2 v3v4 v5 v6v7 v8 v9v10v11 v122 55 42 21 2233614452v1 v2 v3v4 v5 v6v7 v8 v9v10v11 v122 55 22 21 2233614452124w’(j,k) the new value of the weight of the (j,k) edge (corresponding to the latency of theNoC element) that gets updated after edge (j,k) is added to the search path. Since w(j,k)reflects the test time, and w’(j,k) reflects the latency, the following inequality is true:w’(j,k) < w(j,k),  (vj, vk)  E.If Fk, Fj are the values of the test cost function corresponding to paths Pk, Fj,respectively, then Fj < Fk. Moreover, Fk is calculated as:Fk = Fj + w(j,k)and then updated as:F’(k) = F(k) + w’(j,k)Since w’(j,k) < w(j,k)  (vj, vk)  E, then F’(k) is the minimum possible value of the testcost function, which proves that the modified shortest path algorithm returns the test pathwith the minimum associated test cost.The main difference between Algorithms 1 and 2, and Dijkstra’s algorithm, is themanner in which the cost function is updated., Therefore, the order of complexity ofAlgorithms 1 and 2 is similar to the one of the original Dijkstra algorithm, i.e., O(N),where N is the total number of NoC components to be tested (links and switches).1257.2 Appendix 2: Algorithm for balancing fault-tolerant sparsecrossbarsGiven a (n,m) fat-and-slim sparse crossbar with a AGadjacency matrix1 0 0 . 0 1 1 1 1 . 10 1 0 . 0 1 1 1 1 . 10 0 1 . 0 1 1 1 1 . 1. . . . . . . . 1 . 10 0 0 0 1 1 1 1 1 . 1GA           = Sm|Fm,n-mthe balancing algorithm that matches the fanin+fanout sums corresponding to each of theelements of AGmatrix is described below:1. Re-arrange the adjacency matrix asAG= Fmx|Umx(-1)|Bmx(n--2-2)|Lmx(-1), where:Fmxis a matrix with all elements equal to ‘1’;Umx(-1)is an upper triangular matrix;Bmx(n--2-2)is a banded matrix;Lmx(-1)is a lower triangular matrix;Let  and  be [91]:126 - 1 3, if2( 1) ( 1) ( 1) 3, if 0.5, and2( 1) ( 1) ( 1) 3, if 0.5, and2n m mm  nnn m m n m m n m m mnn n nn m m n m m n m m mnn n n                                              ;1n m    where  represents an estimation of the average number of ‘1’ per column of matrixAG, and  represents the number columns of matrix F.F                    U B L1 1 . 1 1 1 1 0 . 0 01 1 . 1 0 1 1 1 . 0 01 1 . 1 0 0 1 1 . . 0. . . . . . . . . 1 01 1 . 1 0 0 0 0 0 1 1GA           2. Let  ‘1’s be assigned to each column in matrix B. Then the number of ‘1’s left inthe F, U and L matrices is (+2-2)( -1), and the number of ‘1’s left unassignedis = (n-m+1)m-*n-(+2-2)The total number of ‘1’s in the AGmatrix is greater than or equal to *n:(n-m+1)m-*n  0,therefore   -(+2-2)127Case i):   0Then the average number of ‘1’s in AGis in the [,+1] interval. Therefore, thematrix can be balanced by transferring ‘1’s from the columns of the F matrix to thecolumns of U and L matrices so that each column has either  or +1 ‘1’s, using thecolumn exchange operation described in Section 5.7.Case ii):   0Then the average number of ‘1’s in each column of AGis more than +1. In this case,a number of1m     columns in F can be balanced with the U and L matricessuch that each balanced column has +1 ‘1’s.Since0  (n-m-1)m-*n  n,and = (n-m-1)m-*n-(+2-2)then  n-(+2-2).128Therefore, the remaining  ‘1’s in the unbalanced columns of matrix F can bedistributed to the columns of B, each having at most one additional ‘1’, using the columnexchange operation described in Section 5.7.Proof of convergenceWe prove the convergence of the balancing algorithm by contradiction. Assume thatthe balancing algorithm does not converge, that is, the balancing operation ends byleaving a column that has  ‘1’s and another that has +2 ’1’s. Then, a ‘1’ can be movedfrom the column with +2 ‘1’s to the column with  ’1’s, hence balancing the twocolumns. Since the existence of the balanced crossbar is guaranteed, it follows that theinitial assumption is false, and hence the balancing operation converges to a solution.129References[1] G. Sohi, “Single-Chip Multiprocessors: The Next Wave of Computer ArchitectureInnovation”, The 37thAnnual IEEE/ACM International Symposium onMicroarchitecture, Dec. 2004.[2] I. O’Connor, B. Courtois, K. Chakrabarty, M. Hampton, “Heterogeneous Systemson Chip and Systems in Package”, Design, Automation & Test in EuropeConference & Exhibition, 2007.[3] AMD Fusion Microprocessor Family,[4] W. J. Dally, B. Towles, “Route Packets, Not Wires: On-Chip InterconnectionNetworks”, Proceedings of DAC, Las Vegas, Nevada, USA, June 18-22, 2001, pp:683-689.[5] J. Owens, W. J. Dally, R. Ho, D. N. Jayashima, S. W. Keckler, L-S. Peh, “ResearchChallenges for On-Chip Interconnection Networks”, IEEE Micro, vol. 27, no. 5, pp.96-108, Sept/Oct, 2007.[6] T. Ye, L. Benini and G. De Micheli, “Packetization and Routing Analysis of On-chip Multiprocessor Networks”, Journal of System Architecture, Vol. 50, Issues2-3,February 2004, pp: 81-104.[7] P. Magarshack, P. Paulin, “System-on-chip beyond the nanometer wall”,Proceedings of DAC, Anaheim, 2003, pp: 419-424.[8] A. Kumar, P. Kundu, A. Singh, L.-S. Peh, N. Jha. “A 4.6Tbits/s 3.6GHz Single-cycle NoC Router with a Novel Switch Allocator in 65nm CMOS”, InternationalConference on Computer Design (ICCD), October, 2007.[9] E. Nigussie, T. Lehtonen, S. Tuuna, J. Plosila, J. Isoaho, “ High-Performance LongNoC Link Using Delay-Insensitive Current-Mode Signaling“, VLSI Design, Vol.2007 (2007), Article ID 46514.[10] J. Chan, S. Parameswaran, “NoCOUT: NoC topology generation with mixedpacket-switched and point-to-point networks”, ASP-DAC, 2008, pp: 265-270.[11] R. Saleh, S. Wilton, S. Mirabbasi, A. Hu, M. Greenstreet, P. Pande, C. Grecu, A.Ivanov, “System on Chip: Reuse and Integration ”, Proceedings of the IEEE, Vol.94, Issue 6, June 2006, pp. 1050-1069.[12] Y. Li (editor), Microelectronic Applications of Chemical Mechanical Planarization,Wiley, 2007.[13] Y. Zorian, D. Gizopoulos, C. Vandenberg, P. Magarshack, “Guest editors'introduction: design for yield and reliability”, IEEE Design & Test of Computers,May-June 2004, Vol. 21, Issue: 3, pp. 177–182.[14] International Technology Roadmap for Semiconductors, 2006 update,[15] S. Shoukourian, V. Vardanian, Y. Zorian, “SoC yield optimization via anembedded-memory test and repair infrastructure”, IEEE Design & Test ofComputers, May-June 2004, Vol. 21 , Issue: 3, pp. 200 – 207.130[16] P. Pande, C. Grecu, A. Ivanov, R. Saleh, G. De Micheli, “Design, Syndissertationand Test of Networks on Chip”, IEEE Design and Test of Computers, Vol. 22, No.5, 2005, pp: 404-413.[17] D.A. Hodges, H.G. Jackson, R.A. Saleh, Analysis and Design of Digital IntegratedCircuits: In Deep Submicron Technology, McGraw-Hill, 3rdedition, 2003.[18] L. Benini, G. De Micheli, “Networks on Chip: A New Paradigm for Systems onChip Design”, International Conference on Design and Test in Europe DATE, Paris2002, pp. 418-419.[19] Intel Teraflops Research Chip,[20] A. Jantsch and Hannu Tenhunen, editors, Networks on Chip, Kluwer AcademicPublishers, 2003.[21] International Standards Organization, Open Systems Interconnection (OSI)Standard 35.100,[22] Y. Zorian, E. J. Marinissen, S. Dey, “Testing Embedded-Core-Based SystemChips”, IEEE Computer Vol. 32, Issue 6, pp: 52-60, 1999.[23] E. J. Marinissen, R. Arendsen, G. Bos, H. Dingemanse, M. Lousberg, C. Wouters,“A Structured and Scalable Mechanism for Test Access to Embedded ReusableCores”, Proceedings of ITC 1998, pp: 284-293.[24] E. Cota, Luigi Caro, Flavio Wagner, Marcelo Lubaszewski, “Power aware NoCReuse on the Testing of Core-Based Systems”, Proceedings of ITC 2003, pp: 612-621.[25] C. Liu, V. Iyengar, J. Shi and E. Cota, “Power-Aware Test Scheduling in Network-on-Chip Using Variable-Rate On-Chip Clocking”, IEEE VLSI Test Symposium,pp: 349-354, 2005.[26] C. Liu, Z. Link, D. K. Pradhan, “Reuse-based Test Access and Integrated TestScheduling for Network-on-Chip”, Proceedings of Design, Automation and Test inEurope, 2006. DATE '06, pp: 303 – 308.[27] B. Vermeulen, J. Dielissen, K. Goossens, C. Ciordas, “Bringing CommunicationsNetworks on a Chip: Test and Verification Implications”, IEEE CommunicationsMagazine, Volume 41, Issue 9, Sept. 2003, pp: 74-81.[28] M. Nahvi, A. Ivanov, “Indirect Test Architecture for SoC Testing”, IEEETransactions on Computer-Aided Design of Integrated Circuits and Systems,Volume 23, Issue 7, July 2004, pp: 1128-1142.[29] J. Raik, R. Ubar, V. Govind, “Test Configurations for Diagnosing Faulty Links inNoC Switches”, Proceedings of the 12thIEEE European Test Symposium, 2007,pp: 29-34.[30] A. M. Amory, E. Briao, E. Cota, M. Lubaszewski, F. G. Moraes, “A Scalable TestStrategy for Network-on-chip Routers”,Proceedings of The IEEE International TestConference, 2005, ITC 2005, pp: 591-599.[31] M. L. Bushnell and V. D. Agrawal, Essentials of Electronic Testing for Digital,Memory and Mixed-Signal VLSI Circuits, Boston: Springer, 2005.[32] M. Cuviello, S. Dey, X. Bai, Y. Zhao, “Fault Modeling and Simulation forCrosstalk in System-on-Chip Interconnects”, Proceedings of the IEEE/ACMInternational Conference on Computer-Aided Design, San Jose, CA, Nov. 1999,pp: 297-303.131[33] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, R. Saleh, "Performance Evaluation andDesign Trade-offs for MP-SoC Interconnect Architectures", IEEE Transactions onComputers, Volume 54, Issue 8, August 2005, pp: 1025-1040.[34] J. Duato, S. Yalamanchili, L. Ni, Interconnection Networks – An EngineeringApproach, Morgan Kaufmann, 2002.[35] I. Saastamoinen, M. Alho, J. Nurmi, “Buffer Implementation for Proteo Network-on-chip”, Proceedings of the 2003 International Symposium on Circuits andSystems, 2003. ISCAS '03, Vol: 2, pp: II-113- II-116.[36] H. Wang L.-S. Peh, S. Malik, “Power-driven Design of Router Microarchitecturesin On-chip Networks”, Proceedings of the 36thAnnual IEEE/ACM InternationalSymposium on Microarchitecture, 2003. MICRO-36, pp: 105- 116.[37] A.J. Van de Goor, I. Schanstra, Y. Zorian, “Functional test for shifting-type FIFOs”,Proceedings of European Design and Test Conference 1995, March 1995, pp. 133-138.[38] C. E. Leiserson, “Fat-trees: Universal Networks for Hardware-EfficientSupercomputing”, IEEE Transactions on Computers, October 1985, Volume 34,Issue 10, pp: 892 – 901.[39] A.O. Balkan, Q. Gang, V. Uzi, “A Mesh-of-Trees Interconnection Network forSingle-Chip Parallel Processing”, International Conference on Application-specificSystems, Architectures and Processors, ASAP, Sept. 2006, pp: 73 – 80.[40] M. Millberg, E. Nilsson, R. Thid, S. Kumar, and A. Jantsch “The Nostrumbackbone - a communication protocol stack for networks on chip”, Proceedings ofthe IEEE VLSI Design Conference, Mumbai, India, January 2004, pp: 693- 696.[41] Intel IXP2400 datasheet,[42] J. Liu, L-R. Zheng, H. Tenhunen, “Interconnect intellectual property for network-on-chip (NoC)”, Journal of Systems Architecture: the EUROMICRO Journal,Volume 50 , Issue 2-3 (February 2004), pp: 65 – 79.[43] A. Radulescu, J. Dielissen, K. Goossens, E. Rijpkema, P. Wielage, “An Efficienton-chip Network Interface Offering Guaranteed Services, Shared-memoryAbstraction, and Flexible Network configuration”, Proceedings of IEEE DATE2004, vol. 2, pp: 878-883.[44] P. P. Pande, C. Grecu, A. Ivanov, R. Saleh, "Switch-Based InterconnectArchitecture for Future Systems on Chip", Proceedings of SPIE, VLSI Circuits andSystems, Vol. 5117, pp: 228-237, 2003.[45] C-M. Chiang, L. Ni, “Multi-address Encoding for Multicast”, Proceedings of theFirst International Workshop on Parallel Computer Routing and Communication,pp: 146-160, 1994.[46] Z. Lu, B. Yin, A. Jantsch, “Connection-oriented multicasting in wormhole-switchednetworks on chip”, IEEE Computer Society Annual Symposium on Emerging VLSITechnologies and Architectures, 2006.[47] E. Bolotin, Z. Guz, I. Cidon, R. Ginosar, A. Kolodny, “The Power of Priority: NoCBased Distributed Cache Coherency”, The First International Symposium onNetworks-on-Chip, 2007. NOCS 2007, pp: 117 – 126.[48] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction toAlgorithms, Second Edition, MIT Press and McGraw-Hill, 2001.132[49] Y. Zorian, E. J. Marinissen, S. Dey, “Testing Embedded-core-based System Chips”,IEEE Computer, Volume 32, Issue 6, June 1999, pp. 52-60.[50] M. Nahvi, A. Ivanov, “An embedded autonomous scan-based results analyzer(EARA) for SoC cores”, Proceedings of the 21stIEEE VLSI Test Symposium, 2003,pp: 293-298.[51] A. DeHon, “Compact, Multilayer Layout for Butterfly Fat-tree”, Proceedings of the12thAnnual ACM Symposium on Parallel Algorithms and Architectures, 2000, pp:206 – 215.[52] Synopsys TetraMAX ATPG Methodology Backgrounder,[53] IEEE Std 1500 - Standard for Embedded Core Test, online,[54] C. Grecu, P. Pande, A. Ivanov, R. Saleh, “Timing Analysis of Network on ChipArchitectures for MP-SoC Platforms”, Microelectronics Journal, Elsevier, Vol.36(9), pp: 833-845.[55] C. Constantinescu, “Trends and challenges in VLSI circuit reliability”, IEEE Micro,July-Aug. 2003, Vol. 23, Issue: 4, pp. 14-19.[56] D. Bertozzi, L. Benini, G. De Micheli, ‘’Error Control Schemes for On-ChipCommunication Links: The Energy-Reliability Tradeoff ‘’, IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, Vol. 24, No. 6, June2005, pp. 818-831.[57] S. R. Sridhara, and N. R. Shanbhag, “Coding for System-on-Chip Networks: AUnified Framework”, IEEE Transactions on Very Large Scale Integration (TVLSI)Systems, Vol. 13, No. 6, June 2005, pp. 655-667.[58] M. Lajolo, “Bus guardians: an effective solution for online detection and correctionof faults affecting system-on-chip buses”, IEEE Transactions on VLSI Systems,Vol. 9, Issue: 6, Dec. 2001, pp: 974-982.[59] D. Bertozzi, L. Benini, G. De Micheli, “Low power error resilient encoding for on-chip data buses”, Proceedings of the Design, Automation and Test in EuropeConference and Exhibition, (DATE), 4-8 March. 2002, pp: 102-109.[60] J. Kim, D. Park, T. Teocharides, N. Vijaykrishnan, C. R. Das, “A low latencyrouter supporting adaptivity for on-chip interconnects”, Proceedings of the 42ndDAC, Anaheim, 2005, pp: 559 – 564.[61] T. Dumitras, S. Kerner, and R. Marculescu, “Towards on-chip fault-tolerantcommunication”, in Proceedings of ASP-DAC, Jan. 2003. pp: 225-232.[62] M. Pirretti, G. Link, R. R. Brooks, N. Vijaykrishnan, M. Kandemir, M. J. Irwin,“Fault Tolerant Algorithms for Network-On-Chip Interconnect”, Proceedings ofIEEE ISVLSI 2004, pp: 46-51.[63] B. Joshi, D. Pradhan, J. Stiffler. "Fault-tolerant computing", Wiley Encyclopedia ofComputer Science and Engineering, January 15, 2008.[64] S Manolache, P Eles, Z Peng, “Fault and Energy-Aware Communication Mappingwith Guaranteed Latency for Applications Implemented on NoC”, Proceedings ofDAC 2005, pp: 266 – 269.[65] D. Ferrero, C. Padró, “Connectivity and fault-tolerance of hyperdigraphs”, DiscreteApplied Mathematics 117 (2002), pp: 15-26.133[66] P. P. Pande, A. Ganguly, B. Feero, B. Belzer, C. Grecu, “Design of low power andreliable networks on chip through joint crosstalk avoidance and forward errorcorrection coding”, Proceedings of IEEE DFT Symposium, DFT’06, 2006, pp:466-476.[67] S. Murali, T. Theocharides,. N. Vijaykrishnan, M. J. Irwin, L. Benini, G. DeMicheli, “Analysis of error recovery schemes for networks on chips”, IEEE Design& Test of Computers, Sept.-Oct. 2005, Vol: 22, Issue: 5, pp: 434- 442.[68] P. D. T. O’Connor, Practical Reliability Engineering, 4thedition, Wiley, June 2002.[69] C. Grecu, A. Ivanov, R. Saleh, E. S. Sogomonyan, P. P. Pande, "On-line faultdetection and location for NoC interconnects," Proceedings of the 12thIEEEInternational On-Line Testing Symposium (IOLTS'06), 2006, pp. 145-150.[70] L. Benini, G. De Micheli, “Networks on Chips: A New SoC Paradigm”, IEEEComputer, Vol. 35, Issue 1, pp: 70-78, 2002.[71] M. Zhang, N.R. Shanbhag, “Soft-Error-Rate-Analysis (SERA) Methodology”,IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,Volume 25, Issue 10, Oct. 2006 pp: 2140 – 2155.[72] E. Bolotin, I. Cidon, R. Ginosar and A. Kolodny, ”QNoC: QoS architecture anddesign process for Network on Chip", Special Issue on Networks on Chip, TheJournal of Systems Architecture, December 2003.[73] G. Masson, “Binomial switching networks for concentration and distribution”,IEEE Transactions on Communications, Sep 1977, Vol. 25, Issue: 9, pp. 873–883.[74] S. G. Shiva, Advanced Computer Architectures, CRC Press, 2005.[75] B. Wilkinson, “On crossbar switch and multiple bus interconnection networks withoverlapping connectivity”, IEEE Transactions on Computers, Vol. 41, Issue 6, June1992, pp. 738-746.[76] L. Noordergraaf, R. Zak, “SMP System Interconnect Instrumentation forPerformance Analysis”, Proceedings of the ACM/IEEE Conference onSupercomputing 2002, pp. 53-62.[77] T. Dumitras, R. Marculescu, “On-chip stochastic communication”, Proceedings ofthe IEEE Design, Automation and Test in Europe Conference, 2003, pp. 0790-10795.[78] M. Pirretti, G. M. Link, R. R. Brooks, N. Vijaykrishnan, M. T. Kandemir, M. J.Irwin, “Fault tolerant algorithms for network-on-chip interconnect”, Proceedings ofISVLSI 2004, pp. 46-51[79] H. Zhu, P. P. Pande, C. Grecu, "Performance evaluation of adaptive routingalgorithms for achieving fault tolerance in NoC fabrics," Proceedings of 18th IEEEInternational Conference on Application-specific Systems, Architectures andProcessors, ASAP 2007.[80] J. Huang, M. B. Tahoori, F. Lombardi, “Fault tolerance of switch blocks and switchblock arrays in FPGA”, IEEE Transactions on Very Large Scale Integration (VLSI)Systems, Vol. 13 , Issue 7, July 2005, pp. 794-807.[81] J. M. Emmert, S. Baumgart, P. Kataria, A. M. Taylor, C. E. Stroud, M. Abramovici,“On-line fault tolerance for FPGA interconnect with roving STARs”, Proceedingsof the IEEE International Symposium on Defect and Fault-Tolerance in VLSISystems, 2001, pp.445-454.134[82] N. Harrison, “A simple via duplication tool for yield enhancement”, IEEEInternational Symposium on Defect and Fault Tolerance in VLSI Systems, 2001, pp.39-47.[83] K. McCullen, “Redundant via insertion in restricted topology layouts”, Proceedingsof the 8thInternational Symposium on Quality Electronic Design, (ISQED’07),2007, pp. 821-828.[84] G. A. Allan, “Targeted layout modification for semiconductor yield/reliabilitymodifications”, IEEE Transactions on Semiconductor Manufacturing, Vol. 17,Issue 4, Nov. 2004 pp. 573 – 581.[85] G. A. Allan, J. Walton, “Automated redundant via placement for increased yieldand reliability”, Proceedings of SPIE, Vol. 3216, Microelectronics ManufacturingYield, Reliability, and Failure Analysis III, September 1997, pp. 114-125.[86] G. Xu, L.-D. Huang, D. Z. Pan, M. D. F. Wong, “Redundant-via enhanced mazerouting for yield improvement”, Proceedings of the 2005 ASP-DAC, 2005, pp.1148-1151.[87] S. Gandemer, B.C. Tremintin, J.-J. Charlot, “Critical area and critical levelscalculation in IC yield modeling”, IEEE Transaction on Electron Devices, Vol. 35,Issue: 2, Feb. 1988, pp. 158-166.[88] A. Cabrini, D. Cantarelli, P. Cappelletti, R. Casiraghi, A. Maurelli, Marco Pasotti,P.L. Rolandi, G. Torelli, “A test structure for contact and via failure analysis indeep-submicrometer CMOS technologies”, IEEE Transactions on SemiconductorManufacturing, Feb. 2006, Vol. 19, Issue: 1, pp. 57-66.[89] D.K. de Vries, P.L.C. Simon, “Calibration of open interconnect yield models”,Proceedings of the 18th IEEE International Symposium on Defect and FaultTolerance in VLSI Systems, 2003, pp. 26-33.[90] M. D. Grammatikakis, D. F. Hsu, M. Kraetzl, Parallel System Interconnections andCommunications, CRC Press, 2000.[91] W. Guo, A. Y. Oruc, “Regular Sparse Crossbar Concentrators”, IEEE Transactionson Computers, Vol. 47 , Issue 3, March 1998, pp. 363-368.[92] HSPICE,,Synopsys, Inc.[93] K. Goossens, J. Dielissen, A. Radulescu, “The Aethereal network on chip:concepts, architectures, and implementations”, IEEE Design and Test ofComputers, Sept-Oct 2005, Vol. 22, Issue: 5, pp. 414-421.[94] C. Grecu, A. Ivanov, R. Saleh, P.P. Pande, “Testing network-on-chipcommunication fabrics”, IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems, Vol. 26, Issue 12, Dec. 2007, pp. 2201 – 2214.[95] E. Gunduzhan, A. Y. Oruç, “Structure and density of sparse crossbarconcentrators,” DIMACS Series in Discrete Mathematics and Computer Science.Advances in Switching Networks. 1998, Vol. 42. pp. 169-180.[96] G. Lemieux, D. Lewis, `”Design of Interconnection Networks for ProgrammableLogic”, Kluwer Academic Publishers, 2004.[97] P. Bose, “Designing reliable systems with unreliable components”, IEEE Micro,Vol. 26, Issue 5, Sept-Oct. 2006, pp. 5-6.135[98] G. Snider, P. Kuekes, T. Hogg1 and R. Stanley Williams, “Nanoelectronicarchitectures”, Journal of Applied Physics A: Materials Science & Processing,Volume 80, Number 6, March, 2005, pp: 1183-1195.[99] R. I. Bahar, D. Hammerstrom, J. Harlow, W. H. Joyner Jr., C. Lau; D. Marculescu,A. Orailoglu; M. Pedram “Architectures for Silicon Nanoelectronics and Beyond”,Computer, Vol. 40, Issue 1, Jan. 2007, pp: 25-33.[100] A. Schmid, Y. Leblebici, “Regular array of nanometer-scale devices performinglogic operations with fault-tolerance capability”, IEEE Conference onNanotechnology, Munich, 2004, pp: 399-401.[101] C. Grecu, C. Rusu, A. Ivanov, R. Saleh, L. Anghel, P. P. Pande, “A FlexibleNetwork-on-Chip Simulator for Early Design Space Exploration”, The 1stMicrosystems and Nanoelectronics Research Conference (MNRC 2008), Ottawa,2008.[102] Advanced Microcontroller Bus Architecture,


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items