UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A reconfigurable post-silicon debug infrastructure for systems-on-chip Quinton, Bradley 2008

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2008_fall_quinton_bradley.pdf [ 17.92MB ]
Metadata
JSON: 24-1.0065576.json
JSON-LD: 24-1.0065576-ld.json
RDF/XML (Pretty): 24-1.0065576-rdf.xml
RDF/JSON: 24-1.0065576-rdf.json
Turtle: 24-1.0065576-turtle.txt
N-Triples: 24-1.0065576-rdf-ntriples.txt
Original Record: 24-1.0065576-source.json
Full Text
24-1.0065576-fulltext.txt
Citation
24-1.0065576.ris

Full Text

A Reconfigurable Post-Silicon Debug Infrastructure for Systems-on-Chip by Bradley Quinton B.Sc., University of Alberta, 1999  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in The Faculty of Graduate Studies (Electrical and Computer Engineering)  The University of British Columbia (Vancouver) October 2008  ! Bradley Quinton, 2008  Abstract As the level of integrated circuit (IC) complexity continues to increase, the post-silicon validation stage is becoming a large component of the overall development cost. To address this, we propose a reconfigurable post-silicon debug infrastructure that enhances the post-silicon validation process by enabling the observation and control of signals that are internal to the manufactured device. The infrastructure is composed of dedicated programmable logic and programmable access networks.  Our reconfigurable infrastructure enables not only the  diagnoses of bugs; it also allows the detection and potential correction of errors in normal operation. In this thesis we describe the architecture, implementation and operation of our new infrastructure. Furthermore, we identify and address three key challenges arising from the implementation of this infrastructure. Our results demonstrate that it is possible to implement an effective reconfigurable post-silicon infrastructure that is able to observe and control circuits operating at full speed, with an area overhead of between 5% and 10% for many of our target ICs.  ii  Table of Contents Abstract ................................................................................................................................................................ ii Table of Contents ................................................................................................................................................ iii List of Tables ........................................................................................................................................................ v List of Figures ..................................................................................................................................................... vi List of Abbreviations ......................................................................................................................................... vii Acknowledgements ........................................................................................................................................... viii Co-Authorship Statement .................................................................................................................................... ix Chapter 1 Introduction ...................................................................................................................................... 1 1.1 Motivation .................................................................................................................................................... 1 1.2 Focus and Contributions of this Thesis ........................................................................................................ 4 1.2.1 A Reconfigurable Post-Silicon Debug Infrastructure ............................................................................. 5 1.2.2 Concentrator Network Topologies of SoCs ............................................................................................ 6 1.2.3 Asynchronous Interconnect Network Design and Implementation ........................................................ 6 1.2.4 Enhanced PLC Architectures for High-Speed On-Chip Interfaces ......................................................... 7 1.3 Thesis Organization ..................................................................................................................................... 8 Chapter 1 References ....................................................................................................................................... 10 Chapter 2 Post-Silicon Debug Infrastructure ................................................................................................ 13 2.1 Introduction ................................................................................................................................................ 13 2.2 Motivation .................................................................................................................................................. 13 2.3 Background ................................................................................................................................................ 15 2.3.1 Post-Silicon Debug ................................................................................................................................ 15 2.3.2 Post-Silicon Error Detection and Correction ........................................................................................ 18 2.4 Post-Silicon Debug Architecture ................................................................................................................ 20 2.4.1 Baseline SoC Architecture .................................................................................................................... 20 2.4.2 Post-Silicon Debug Infrastructure ......................................................................................................... 21 2.4.3 Selection of the Debug Node ................................................................................................................ 22 2.5 Post-Silicon Debug Operation .................................................................................................................... 23 2.5.1 Determining the Root-Cause of Unexpected Behaviours (Bugs) ......................................................... 23 2.5.2 Detecting Design Errors in Normal Operation ...................................................................................... 24 2.5.3 Correcting Design Errors ...................................................................................................................... 25 2.6 Implementation .......................................................................................................................................... 26 2.7 Estimated Area overhead ........................................................................................................................... 27 2.8 Chapter Summary ....................................................................................................................................... 30 Chapter 2 References ....................................................................................................................................... 32 Chapter 3 Network Topology .......................................................................................................................... 34 3.1 Introduction ................................................................................................................................................ 34 3.2 Motivation .................................................................................................................................................. 34 3.3 Network Requirements ............................................................................................................................... 36 3.4 Background ................................................................................................................................................ 37 3.4.1 Permutation Networks ........................................................................................................................... 37 3.4.2 Concentration Networks ........................................................................................................................ 39 3.5 Comparison of Concentrator and Permutation Networks .......................................................................... 41 3.6 New Concentrator Construction ................................................................................................................. 43 3.6.1Network Description ............................................................................................................................. 43 3.6.2 Cost and Depth Comparison ................................................................................................................. 45 3.7 Concentrator Constructions for SoC Applications ..................................................................................... 46 3.7.1 SoC Implementation Considerations ..................................................................................................... 46 3.7.2 Hierarchical Network Construction ...................................................................................................... 47 3.7.3 Concentrator Network Mapping to Post-Silicon Debug Applications .................................................. 49 3.7.4 SoC Area Overhead ............................................................................................................................... 50  iii  3.8 Chapter Summary ....................................................................................................................................... 52 Chapter 3 References ....................................................................................................................................... 53 Chapter 4 Network Implementation .............................................................................................................. 54 4.1 Introduction ................................................................................................................................................ 54 4.2 Motivation .................................................................................................................................................. 55 4.3 Background ................................................................................................................................................ 56 4.4 Experimental Framework ........................................................................................................................... 57 4.4.1 Interconnect Network Structure ............................................................................................................ 58 4.4.2 Target ICs .............................................................................................................................................. 59 4.4.3 CAD Tool Flow ..................................................................................................................................... 60 4.5 Asynchronous Implementation .................................................................................................................. 61 4.5.1 CAD Tool / Library Limitations ........................................................................................................... 61 4.5.2Design ................................................................................................................................................... 62 4.5.3Delay Optimization ............................................................................................................................... 66 4.5.4 Timing Verification ............................................................................................................................... 67 4.6 Synchronous Implementation ..................................................................................................................... 69 4.6.1 CAD Tool / Library Limitations ........................................................................................................... 69 4.6.2Design ................................................................................................................................................... 69 4.7 Throughput Comparison Without An Inter-Block Clock Tree .................................................................. 69 4.8 CAD Tool Enhancement ............................................................................................................................ 71 4.9 Performance and Cost Comparisons .......................................................................................................... 72 4.10 Chapter Summary ..................................................................................................................................... 77 Chapter 4 References ....................................................................................................................................... 79 Chapter 5 Programmable Logic Interface ..................................................................................................... 81 5.1 Introduction ................................................................................................................................................ 81 5.2 Motivation .................................................................................................................................................. 81 5.3 Background ................................................................................................................................................ 82 5.4 SoC Framework ......................................................................................................................................... 84 5.5 PLC Architecture Framework .................................................................................................................... 85 5.6 System Bus Interfaces ................................................................................................................................ 86 5.6.1 System Bus Overview ........................................................................................................................... 87 5.6.2 System Bus Interface Timing Requirements ......................................................................................... 87 5.6.3 Bus Interface Functional Requirements ................................................................................................ 88 5.6.4 Modified CLB Implementation ............................................................................................................. 90 5.6.5 Experimental Evaluation ....................................................................................................................... 92 5.7 Direct Synchronous Interfaces ................................................................................................................... 96 5.7.1Overview ............................................................................................................................................... 97 5.7.2 Timing Requirements ............................................................................................................................ 97 5.7.3Functional Requirements ...................................................................................................................... 97 5.7.4 Modified CLB Implementation ............................................................................................................. 98 5.7.5Experimental Results ............................................................................................................................ 98 5.8 Chapter Summary ..................................................................................................................................... 107 Chapter 5 References ..................................................................................................................................... 109 Chapter 6 Conclusion .................................................................................................................................... 111 6.1 Summary .................................................................................................................................................. 111 6.2 Contributions and Publications ................................................................................................................ 114 6.3 Limitations and Future Work ................................................................................................................... 116 6.3.1 Methodology for the Selection of Debug Nodes ................................................................................. 116 6.3.2 Amount of Programmable Logic and Debug Buffering ..................................................................... 117 6.3.3 Inference of Circuit Behaviour From a Limited Set of Debug Nodes ................................................ 118 Chapter 6 References ..................................................................................................................................... 120 Appendix A Publications ............................................................................................................................... 121  iv  List of Tables Table 2.1: Experimental Frameworks ...................................................................................................... 27 Table 4.1: Area/Power/Latency/Pipeline Comparison Results ................................................................ 76 Table 5.1: Interface Register Bit Types .................................................................................................... 89 Table 5.2: System Bus Area Overhead .................................................................................................... 95 Table 5.3: System Bus Interface Place and Route Results ....................................................................... 96 Table 5.4: Direct Synchronous Area Overhead ...................................................................................... 105 Table 5.5: Direct Synchronous Interface Place and Route Results ........................................................ 106 Table 6.1: A Reconfigurable Post-Silicon Debug Infrastructure ........................................................... 115 Table 6.2: Network Topology ................................................................................................................ 115 Table 6.3: Network Implementation ...................................................................................................... 115 Table 6.4: Programmable Logic Interface .............................................................................................. 116  v  List of Figures Fig. 2.1: Baseline SoC Architecture .......................................................................................................... 20 Fig. 2.2: SoC with Post-Silicon Debug Infrastructure ............................................................................... 21 Fig. 2.3: Connection of the Access Network to the Debug Nodes ............................................................ 22 Fig: 2.4: Post-Silicon Debug Infrastructure Implementation .................................................................... 25 Fig. 2.5: Area Overhead of the Programmable Logic Core ....................................................................... 29 Fig. 2.6: Post-Silicon Debug Area Overhead ............................................................................................ 30 Fig. 3.1: Asymmetric Benes ...................................................................................................................... 37 Fig. 3.2: Sparse Crossbar Concentrator ..................................................................................................... 38 Fig. 3.3: Narasimha Network Definition ................................................................................................... 40 Fig. 3.4: Narasimha Network Construction ............................................................................................... 40 Fig. 3.5: Permutation vs. Concentrator Network Area Cost ...................................................................... 41 Fig. 3.6: Permutation vs. Concentrator Depth ........................................................................................... 43 Fig. 3.7: New Concentrator Construction .................................................................................................. 43 Fig. 3.8: (16,8)-concentrator ...................................................................................................................... 44 Fig. 3.9: New Concentrator Area Cost ...................................................................................................... 45 Fig. 3.10: New Concentrator Depth ........................................................................................................... 46 Fig. 3.11: Hierarchical Concentrator Network Architecture ..................................................................... 47 Fig. 3.12: Example Network Implementation on a SoC for Post Silicon Debug ...................................... 49 Fig. 3.13: Hierarchical Network Overhead ............................................................................................... 51 Fig. 4.1: Multiplexed Bus .......................................................................................................................... 57 Fig. 4.2: 64-Block IC Placement ............................................................................................................... 59 Fig. 4.3: CAD Tool Flow ........................................................................................................................... 60 Fig. 4.4: Dual-Rail Encoding ..................................................................................................................... 62 Fig. 4.5: Dual-Rail 2-Phase Implementation ............................................................................................. 63 Fig. 4.6 Clock Generation Circuit .............................................................................................................. 65 Fig. 4.7: Clock Generation Waveform ...................................................................................................... 65 Fig. 4.8: Circuit Modification .................................................................................................................... 66 Fig. 4.9: Example Static Timing Constraints ............................................................................................ 67 Fig. 4.10: Example Circuit with Annotated Timing And Drive Strengths ................................................ 68 Fig. 4.11: Throughput Comparison Without Inter-clock Clock Tree ........................................................ 70 Fig. 4.12: Throughput with tool enhancement .......................................................................................... 71 Fig. 4.13: Cell Area for 700 MHz Throughput Target .............................................................................. 73 Fig. 4.14: Dynamic Power for 700 MHz Throughput Target .................................................................... 74 Fig. 4.15: Latency for 700 MHz Throughput Target ................................................................................. 75 Fig. 5.1: Example SoC ............................................................................................................................... 83 Fig. 5.2: Shadow Cluster Modified CLBs ................................................................................................. 85 Fig. 5.3: Modified PLC Architecture ......................................................................................................... 86 Fig. 5.4: Slave Bus Interface ..................................................................................................................... 90 Fig. 5.5: Modified CLB Implementation ................................................................................................... 90 Fig. 5.6: Register-Type CLB ..................................................................................................................... 91 Fig. 5.7: Configurable Interface Register Bit ............................................................................................ 92 Fig. 5.8: Modified Connection Blocks ...................................................................................................... 93 Fig. 5.9: Direct Synchronous Modified CLBs .......................................................................................... 99 Fig. 5.10: Incoming-Type CLB FIFO ..................................................................................................... 100 Fig. 5.11: Outgoing-Type CLB FIFO ...................................................................................................... 102 Fig. 5.12: Sample Count / Phase Lock Circuit ........................................................................................ 103 Fig. 5.13: Sample Count Generation Example Timing ........................................................................... 104  vi  List of Abbreviations AHB APB ASIC BIC CAD CLB CRC DFD DFT DRC ECO FIFO FPGA FPSOC FRCL GALS GMII HTTB I/O IC IP IP ITRS JTAG LEDR LUT NoC OPB PCB PCI PLA PLC SoC TCL TLA  Advanced High-speed Bus Advanced Peripheral Bus Application Specific Integrated Circuit Bus Interface Controller Computer Aided Design Configurable Logic Block Cyclic Redundancy Check Design-for-Debug Design-for-Test Direct Register Control Engineering Change Order First-in-first-out Field-Programmable Gate Array Field-Programmable System-on-a-Chip Field-Repairable Control Logic Globally Asynchronous, Locally Synchronous Gigabit Media Independent Interface HyperTransport Trace Buffer Input/Output Integrated Circuit Internet Protocol Intellectual Property International Technology Roadmap For Semiconductors Joint Test Action Group Level-Encoded Dual-Rail Look-up-table Network-on-Chip On-chip Peripheral Bus Printed Circuit Board Peripheral Component Interconnect Programmable Logic Array Programmable Logic Core System-on-Chip Tool Control Language Trace Logic Analyzer  vii  Acknowledgements First, I would like to thank my research supervisor, Dr. Steve Wilton. His guidance, technical insight, patience, and positive outlook were invaluable throughout my degree. His constant professionalism has given me a great example that I will strive to emulate in my career. In addition, I would like to thank Dr. Mark Greenstreet, with whom I have co-authored a number of papers that form part of this thesis. His deep knowledge and creative thinking were a real inspiration. I would also like to thank the numerous professors at UBC who served on my supervisory committees and provided much useful feedback on my research, these include, Dr. Res Saleh, Dr. Tor Aamodt, Dr. Guy Lemieux, and Dr. Alan Hu. I would also like to thank my external examiner, Dr. Bruce Cockburn of the University of Alberta, who was kind enough to travel to attend my thesis defence in person. I would like to acknowledge the SoC support staff, Dr. Roberto Rosales, Roozbeh Mehrabadi, and Sandy Scott, who were always willing to lend a hand when it was needed. I would like to thank all the great graduate students in the SoC research lab who made the long hours much more pleasant, including, Julien Lamoureux, Nathalie Choy, Scott Chin, Andrew Lam, Cindy Mark, Andy Yan, Peter Hallschmid, James Wu, Ernie Lin, Marcel Gort, Victor Aken'Ova, Eddie Lee, Marvin Tom and Anthony Yu. To my parents, Reg and Shirley Quinton, I am deeply indebted for a lifetime of constant support and encouragement. Their emphasis on the value of education has really helped me to get where I am today. And finally, and most importantly, I would like to thank my wife, Shawna, my new son, Jimmy and, of course, Sparky. They make everything I do worthwhile. Shawna has been there for me over and over again. From picking up and moving across the country to proofreading every paper and thesis chapter I’ve ever written. I certainly couldn’t have completed this work without her. There is no way to thank her enough. Jimmy’s arrival provided the final motivation I needed finish this thesis, and Sparky was the only one who never asked: “when will you be done?”  viii  Co-Authorship Statement1 Each of the four major contributions of this thesis has been published in conference and journal papers that were co-authored by a number of researchers. The details of this authorship are outlined below. The first contribution, the research on post-silicon debug infrastructures, was published as [1] and [2].  For both papers the concepts and experimental framework were developed  collaboratively between Dr. Steve Wilton and Brad Quinton. Brad Quinton then performed the research, data generation, data analysis and the preparation of the manuscripts under the guidance of Dr. Steve Wilton. The second contribution, the research on network topologies, was published as [3] and further elaborated in [2]. Again, for both papers, the concepts and experimental framework were developed collaboratively between Dr. Steve Wilton and Brad Quinton, and Brad Quinton performed the research, data generation, data analysis and the preparation of the manuscripts under the guidance of Dr. Steve Wilton. The third contribution, the research on asynchronous network implementations, was published first as [4] and then extended in [5]. For both papers, Dr. Mark Greenstreet and Brad Quinton jointly developed the concepts, initial designs and experimental framework. Brad Quinton then created the final designs, and performed background research, data generation, data analysis, and the preparation of the manuscripts under the guidance of Dr. Mark Greestreet and Dr. Steve Wilton. The fourth contribution, the research on programmable logic interfaces, was published in [6] and extended in [7]. For each of these papers the initial concepts and experimental frameworks were developed collaboratively between Dr. Steve Wilton and Brad Quinton. Brad Quinton then performed the research, data generation, data analysis and the preparation of the manuscript under the guidance of Dr. Steve Wilton.  1  Note: Reference numbers in this text correspond to the publications listed in Appendix A. ix  Chapter 1 Introduction 1.1 Motivation Integrated circuits are becoming increasingly complex. Today, state-of-the-art chips contain as many as 1.72 billion transistors [1] and that number is expected to increase to 4.4 billion by 2015 [2]. This capacity has enabled integrated circuits (ICs) to provide the equivalent functionality of an entire system on a single device; hence, these large ICs are often called Systems-on-a-Chip (SoC) [3]. However, ensuring that such complex devices operate as expected is challenging. Failures can occur because of design errors, fabrication errors, or impurities in the base silicon. Of these, design errors are often the most difficult to uncover. The design of an integrated circuit typically follows a very structured methodology. Designs are specified using a variety of techniques, and Computer-Aided Design (CAD) tools are used to translate these designs to transistors and then to physical layout information that can be directly fabricated. At all levels of the CAD flow, the design is simulated in an attempt to identify as many design errors (bugs) as possible. In large designs, as much as 50-70% of the pre-silicon design time is spent simulating the design [4,5]. In addition, formal verification techniques are often employed to prove certain properties of the design [6,7]. Together, simulation and formal verification is termed pre-silicon verification. Once the designer has achieved a sufficiently high level of confidence that the design is correct, the chip is manufactured. Creating the lithography masks required for manufacturing can cost up to $1.2 million and the initial device fabrication can take up to 16 weeks [8]. Regardless of how careful a designer is, some bugs will escape the pre-silicon verification process. In the design of the Intel Pentium 4, researchers report that their simulation effort required 6000 processors, each running for two years, and that their effective simulation coverage was less than one minute of real time operation [9]. Clearly, this is not sufficient to uncover all the design errors (bugs) that may exist in the device. Nonetheless, it is critically important to find all these bugs before products are shipped; to not do so can be very costly. For example, in a well-publicized incident, a simple bug in the floating-point division unit of the original Pentium chip cost Intel Corp. approximately $475 Million [10]. 1  The only way to find these bugs is to test manufactured chips.  Unlike pre-silicon  verification, testing manufactured chips can provide a much larger functional coverage because the chip can run at-speed and be connected to other integrated circuits in a larger system. Testing of the chip after fabrication is termed post-silicon validation. According to the 2007 International Technology Roadmap for Semiconductor (ITRS), “Post-silicon validation … is an area of verification which has grown rapidly in the past decade, mostly due to the large complexity of silicon-based systems…” [2]. Recent studies have shown that 35% of IC development time now occurs after the initial device fabrication, and this proportion continues to grow [11]. When failures are observed during post-silicon validation it is critical to uncover the source of the failure in the original design. This process is termed post-silicon debug. Time-to-market pressure means that these design errors must be discovered quickly and it is essential that as many errors as possible are uncovered before the chip is subject to another lengthy and costly manufacturing spin. Yet, finding these bugs is very challenging, mainly because of a lack of internal visibility. According to the same ITRS report quoted above, the key issue with postsilicon validation “is the limited visibility inside the manufactured silicon parts.” [2]. As designers continue to integrate more functionality on an integrated circuit, the complexity of these devices will increase significantly. At the same time, the inter-component communications that could once be observed by probing traces on a printed circuit board (PCB) are moving inside the chip. This leaves the validation process in the position of having to manage more complexity with less visibility. In addition, as the likelihood of post-silicon bugs increases, the possibility that a bug will prevent extensive validation of the other portions of the chip increases. This puts chip designers in the unfortunate position of having to potentially re-spin a device, simply to enable further validation (which may, in turn, discover further bugs). Integrated circuit designers are attempting to address this issue by adding extra logic to their designs specifically to assist the post-silicon debug process [12, 13, 14]. New post-silicon debug concepts are emerging that are analogous to design-for-test (DFT) methodologies where designers add dedicated logic to their design to enable efficient manufacturing test procedures. While it is possible to achieve some level of post-silicon debug by reusing these existing DFT structures and/or software debug facilities, it has been shown that these techniques are not sufficient for modern, high-performance integrated circuits and SoCs [12]. Because of this, even 2  designers of the latest high-volume, cost-sensitive, multi-core processors are willing to add extra logic to assist with the debug process. The AMD Opteron and IBM Cell BE both contain specialized chip-level trace buffers that are not tied to a single processor core or software debugger, but are intended to debug system-level issues [13, 14]. In both cases these debug resources have proved to be quite useful during the post-silicon validation phase. The Cell BE validation team described the successful debug of an intermittent cyclic redundancy check (CRC) error in the bus interface controller (BIC) that was caused by a hardware initialization failure [15]. Likewise, the AMD validation team described the successful debug of a dead-lock scenario in an eight-node (four processor) configuration that was triggered after four days of continuous operation [13]. Although these emerging techniques increase internal visibility to a degree, the current postsilicon debug solutions are limited in a number of ways. They are both quite design-specific and are not flexible enough to handle many unexpected debug situations: both the IBM and AMD proposals are limited to observing correctly formatted transactions on a specific internal interconnect, which significantly reduces the number of potential circuits that can be debugged on a given device as well as the number of scenarios for which the debug resources are useful. For instance, if a design block ‘locks-up’, and ceases to generate new transactions, there will be no new transactions to observe. In this case the root-cause is likely within the design block. However these internal nodes cannot be observed with the current solutions. In addition, transactions are logged in dedicated, finite-sized trace buffers and a limited set of configurable triggers and compression techniques are employed to make more efficient use of these buffers. However, since the nature of the bug is not known in advance, these triggers and compression techniques are not always appropriate. Finally, since the debug information in these proposals is processed after the test, and not in real-time, there is no possibility of taking action when an error is detected during normal operation. In this thesis, we propose a new reconfigurable post-silicon infrastructure that can be used to assist the debug of any digital circuit. Similar to existing solutions, our proposal enhances the visibility of the internal operation of the integrated circuit. However, the use of general-purpose programmable logic at the core of our infrastructure enables a number of key methodologies that are not available in other proposals. First, our infrastructure has the flexibility to target arbitrary digital logic in a chip and is not restricted to specific circuits such as processor cores or system 3  busses. Rather, it can be used to monitor key aspects of the internal design, such as the current state of a state-machine. The programmable logic allows us to build debug circuits to interpret any digital pattern. Second, our proposal allows for complex, scenario-specific, event triggers as well as event filtering, and trace compression, thereby making more efficient use of debug buffering; enabling longer running debug scenarios; and providing real-time monitoring. Third, our proposal enables the detection and potential correction of design errors during normal device operation by enabling the creation of new digital circuits in the programmable logic which can operate in parallel with the normal operation of the device. We believe that the availability of this type of dedicated debug logic will provide a number of key benefits in the development of an integrated circuit, including: 1) reduced time-to-market because of an increased ability to quickly isolate and understand unexpected behaviours, 2) decreased resources required for post-silicon validation because of an increase in the functional coverage of a given test, 3) elimination of design revisions which result when one design error hides another error, 4) increased customer satisfaction because of the enhanced ability to provide quick ‘work-arounds’ to known bugs. We expect that ‘design-for-debug’ (DFD) structures will become commonplace in all large integrated circuits, and structures such as ours will become a key part of this infrastructure. 1.2 Focus and Contributions of this Thesis In this thesis we will show that it is possible to build a reconfigurable post-silicon debug infrastructure, with reasonable area overhead, using programmable networks and embedded programmable logic. This infrastructure enhances the post-silicon validation of integrated circuits by enabling the observation and control of signals internal to the device during normal, full-speed operation. This new infrastructure addresses key limitations of existing post-silicon solutions and significantly improves on the state-of-the-art in this area. We address the architecture, operation and implementation of this infrastructure. Through the implementation we identify three key areas for targeted research: 1) the development of low-cost and low-depth network topologies for connecting fixed-function circuits to circuits implemented in programmable logic, 2) the effective implementation of high-speed on-chip interconnects, and 3) the implementation of high-speed, rate-adaptive circuits in embedded programmable logic. We address each of these key areas in detail in this thesis. The results of these targeted research 4  efforts both enable the implementation of our infrastructure and provide results that can be extended to the implementation of integrated circuits in general. Although we also considered the design of specialized programmable logic cores for the task of circuit debug [16, 17], we decided to focus on the use of general-purpose programmable logic cores in this thesis. The research contributions of this thesis can therefore be grouped into four main areas, each of which is summarized in the following subsections. 1.2.1 A Reconfigurable Post-Silicon Debug Infrastructure The first major contribution of this thesis is the development of a new post-silicon debug infrastructure for complex integrated circuits and SoCs. As the level of SoC integration continues to increase, the difficulty of producing a functionally correct chip continues to grow [2]. A significant effort is made during the SoC development process to verify the correct behaviour of the design prior to the manufacturing of the device. Both simulation-based and formal methods are used for this pre-silicon verification, but devices are still manufactured which do not operate as expected [18]. These functional defects, or bugs, are discovered in the post-silicon validation stage. The process of determining the root-cause of these bugs is becoming a large component of the overall development cost [11]. To address this, we propose a reconfigurable post-silicon debug infrastructure that enables the observation and control of internal signals. We use programmable networks and embedded programmable logic to create our infrastructure. Its adaptive nature is well suited to the problem of device debug since the bugs are, by their very nature, unexpected; unlike existing solutions [13, 14, 19], it can be reconfigured for each specific debug scenario. In addition, our reconfigurable infrastructure not only enables the diagnoses of bugs, it also allows the detection (and potentially the correction) of errors in normal operation. In Chapter 2 we describe the architecture, operation and implementation of our new infrastructure, and then analyze the area overhead of the infrastructure. The results show that it is possible to implement our reconfigurable post-silicon debug infrastructure with an area overhead of less than 10% for a large proportion of our target integrated circuits. This work is significant because it demonstrates that it is possible to create a debug infrastructure with significant flexibility and that it is feasible to implement this infrastructure in an integrated circuit with a reasonably low area overhead. 5  1.2.2 Concentrator Network Topologies of SoCs The second major contribution of this thesis is the construction of a new network topology that takes advantage of the flexibility of the programmable logic in our post-silicon debug infrastructure. We use this flexibility to decrease the area overhead and improve the network timing of our proposal. The area cost and performance of the programmable access network is an important factor in the efficiency of our overall infrastructure. The number, depth and interconnection of the switches (i.e., the topology) have a direct impact on these factors. We demonstrate that a class of unordered, non-blocking networks, called a concentrator, is well suited to the task of connecting fixed function circuits to a programmable logic core. Concentrator networks have been studied previously [20, 21, 22, 23]. In most cases the work has been theoretical with limited focus on the application of these networks. In our case, these networks can take advantage of the flexibility of the input and output assignments on the programmable logic cores, which removes the network ordering constraint. In Chapter 3 we describe two new concentrator constructions. The first is shown to have a lower area cost and lower switch depth than previously described concentrators for network sizes that are of interest in our application. The second network construction is optimized specifically for our post-silicon debug infrastructure. It extends the first network construction to enable a network that can be partitioned into two levels of hierarchy. We show that this network can be implemented with low area overhead resulting in as little as 2% overhead for most of our target implementations. This result is significant for our post-silicon debug infrastructure since it enables a more efficient implementation. It is also salient to the more general problem of integrating embedded programmable logic into fixed function integrated circuits because it highlights an important element of flexibility on the interface between the two types of logic. 1.2.3 Asynchronous Interconnect Network Design and Implementation The third major contribution of this thesis is the design and implementation of a new asynchronous interconnect network design that eliminates the need for top-level clock tree insertion while enabling high-speed operation for on-chip networks. A major challenge with the centralized programmable logic in our post-silicon debug infrastructure is the requirement that the access network span the entire device and run at full speed. Synchronous pipelining 6  techniques have been proposed for this type of implementation [24], but they require the design of a low-skew clock tree that is distributed throughout chip. The new asynchronous interconnect design and implementation that we present does not require a global clock, it therefore has a potential advantage in terms of design effort. There have been a number of other asynchronous interconnect proposals [25, 26, 27], but they are not well suited to implementations with standard CAD tools. Instead they normally rely on manual optimization or custom-built CAD tools. In contrast, our asynchronous interconnect design can be implemented using a standard ASIC design flow. To that end, in Chapter 3 we build on the two-stage concentrator network developed in Chapter 2. We target the second stage of the network that is responsible for aggregating the signals from all parts of the die. We compare our proposal to a standard pipelined synchronous implementation. The results demonstrate that there is a region of the design space where the asynchronous implementation provides an advantage over the synchronous interconnect by removing the need for clocked inter-block pipeline stages, while maintaining high throughput. We then demonstrate a CAD tool enhancement that would significantly increase the space where the asynchronous design has an advantage over a synchronous implementation. In addition, we provide a detailed comparison of the power, area and latency of the two strategies over a wide range of IC implementation scenarios. This work is significant to our implementation of a post-silicon debug infrastructure since it simplifies a challenging step in the insertion of our debug logic. It is also significant to the general problem of integrated circuit design since it demonstrates that it is possible to create effective asynchronous circuits using standard CAD tools, even though they have been developed primarily for synchronous design styles. 1.2.4 Enhanced Programmable Logic Core Architectures for High-Speed On-Chip Interfaces The fourth major contribution of this thesis is the development of enhancements to the architecture of embedded programmable logic cores (PLCs) in order to enable the implementation of high-speed, rate-adaptive interface circuits. The architecture of our debug infrastructure requires interfacing high-speed fixed function logic (the circuits under debug) to circuits implemented in a programmable logic core (the debug circuits). The primary challenge is managing the difference in timing performance between the fixed function logic and programmable logic. The performance of the programmable logic will inevitability be lower 7  than that of the fixed function logic [28]. Without careful consideration, the programmable logic may affect the performance of the overall debug infrastructure. We address this problem by proposing changes to the structure of the PLC itself; these architectural enhancements enable circuit implementations with high performance interfaces. In Chapter 5, we demonstrate PLC architectural changes that target both system bus interfaces and direct synchronous interfaces. These changes in the PLC architecture maintain all the key attributes of a general-purpose PLC [29, 30, 31]; the standard FPGA CAD tools for placement, routing and static timing still work with only slight modification [32]. Our results demonstrate a significant improvement in PLC interface timing, which ensures that interaction with full-speed fixed-function logic is possible. We are able to show that we can do this without compromising the basic structure or routeability of the programmable fabric. In addition we show that these new structures are very area-efficient. The area impact on circuit designs that do not need to make use of our new enhancements is less than 1%. These results are significant to our post-silicon debug infrastructure because using these PLC enhancements enables us to create debug circuits that are able to interact directly with the full-speed logic in the circuit under debug. These results are also significant to the architecture of programmable logic cores and integrated circuits in general, since they demonstrate that circuits implemented in programmable logic can interface directly to fixed function circuits. 1.3 Thesis Organization This thesis follows the manuscript-based format for PhD theses at the University of British Columbia. In this format, each chapter is based on a complete, published or submitted paper, and includes an introduction, detailed literature review, conclusion, and references. The four chapters that make up the body of this thesis are submitted or published papers, as referenced below. The first contribution, our reconfigurable post-silicon debug infrastructure, is described in Chapter 2. This material was originally published in [33] and [34], but has been updated with more recent references and results. In addition to presenting our overall architecture, this chapter highlights the three key challenges associated with the implementation of our post-silicon debug infrastructure, each of which will be addressed in detail in subsequent chapters. The second contribution, our new concentrator network topology, is outlined in Chapter 3. This work was 8  published in [34] and extended in [35]. Our third contribution, the implementation of this network within a large integrated circuit, is described in Chapter 4. This work has been published in [36]. The fourth contribution, enhanced programmable logic core architectures for high-speed on-chip interfaces, is described in Chapter 5. Preliminary parts of this work were published in [37], and a journal version has been accepted for publication in [38]. Finally, Chapter 6 summarizes the thesis’s contributions and conclusions and provides suggestions for future work.  9  Chapter 1 References [1] C. McNairy, R. Bhatia, “Montecito: A Dual-Core, Dual-Thread Itanium Processor”, IEEE Micro, vol. 25, no. 2, pp. 10-20, March/April 2005. [2] International Technology Roadmap for Semiconductors (ITRS), 2007 Report, http://www.itrs.net, 2008. [3] R. Saleh, et al., "System-on-Chip: Reuse and Integration," Proceedings of the IEEE, vol. 94, no. 6, pp.1050-1069, June 2006. [4] J. Ogawa, “Living in the Product Development ‘Valley of Death’”, FPGA and Structured ASIC Journal, November 23, 2004. [5] P. Rashinkar, et al., System-on-a-Chip Verification: Methodology and Techniques, Springer, 2000. [6] C. Kern, M. R. Greenstreet, “Formal verification in hardware design: a survey”, ACM Transactions on Design Automation of Electronic Systems, vol. 4, no. 2, pp. 123-193, April 1999. [7] A. J. Hu, D. L. Dill, “New techniques for efficient verification with implicitly conjoined BDDs”, Proceedings of the Design Automation Conference, pp. 276-282, June 1994. [8] S. Miraglia, et al., “Cost effective strategies for ASIC masks”, Proceedings of SPIE, vol. 5043, pp. 142-152, 2003. [9] B. Bentley, “Validating a modern microprocessor”, Proceedings of the International Conference Computer Aided Verification, pp.2-4, July 2005. [10] V. Pratt, “Anatomy of the Pentium Bug”, Proceedings of the Theory and Practice of Software Development, pp. 97-107, 1995. [11] Collett ASIC/IC Verification Study, 2004. [12] A.B.T. Hopkins, K.D. McDonald-Maier, “Debug support for complex systems on-chip: a review”, IEE Proceedings - Computers and Digital Technology, vol. 153, no. 4, pp.197207, July 2006. [13] M. Riley, M. Genden, “Cell Broadband Engine Debugging for Unknown Events”, IEEE Design and Test of Computers, vol. 24, no. 5, pp. 486-493, September/October 2007. [14] T. J. Foster, et al., “First Silicon Functional Validation and Debug of Multicore Microprocessors”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 15, no. 5, pp. 495-504, May 2007. [15] M. Riley, et al., “Debug of the CELL Processor: Moving the Lab into Silicon”, Proceedings of the IEEE International Test Conference, pp. 1-9, October 2006. [16] S.J.E. Wilton, C.H. Ho, B.R. Quinton, P.H.W. Leong, W. Luk, “A Synthesizable DatapathOriented Embedded FPGA Fabric for Silicon Debug Applications”, ACM Transactions on Reconfigurable Technology and Systems, vol. 1, no. 1, pp 7.1-7.25, March 2008. [17] S.J.E. Wilton, C.H. Ho, P.H.W. Leong, W. Luk, B.R. Quinton, "A Synthesizable DatapathOriented Embedded FPGA Fabric", Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, February 2007. [18] S. Sandler, “Need for debug doesn’t stop at first silicon”, E.E. Times, Feb. 21, 2005. [19] A.B.T. Hopkins, K.D. McDonald-Maier, “Debug Support Strategy for Systems-on-Chips with Multiple Processor Cores”, IEEE Transactions on Computers, vol. 55, no. 2, pp.174184, February 2006. [20] M. S. Pinsker, “On the complexity of a concentrator,” Proceedings of the Seventh International Teletraffic Congress, Stockholm, Sweden, pp. 318/1-318/4, 1973. 10  [21] F. R. K. Chung, “On Concentrators, Superconcentrators, Generalizers, and Nonblocking Networks,” Bell System Technology Journal, vol. 58, no. 8, pp. 1765-1777, 1978. [22] O. Gabber and Z. Galil, “Explicit constructions of linear sized super-concentrators”, Journal of Computer and System Science, pp. 407-420, 1981. [23] M. J. Narasimha, “A Recursive Concentrator Structure with Applications to Self-Routing Switching Networks”, IEEE Transactions on Communications, vol. 42, no. 2/3/4, pp. 896897, April 1994. [24] P. Cocchini, “Concurrent Flip-Flop and Repeater Insertion for High Performance Integrated Circuits”, Proceedings of the IEEE International Conference on Computer-Aided Design, pp. 268-273, 2002. [25] W.J. Bainbridge and S.B. Furber, “Delay insensitive system-on-chip interconnect using 1of-4 data encoding”, Proceedings of the International Symposium on Asynchronous Circuits and Systems, pp. 118-126, March 2001. [26] A. Lines, “Asynchronous Interconnect For Synchronous SoC Design”, IEEE Micro, vol. 24, no. 1, pp. 32-41, January/February 2004. [27] W.J. Bainbridge, W.B. Toms, et al., “Delay-Insensitive, Point-to-Point Interconnect using m-of-n Codes”, Proceedings of the International Symposium on Asynchronous Circuits and Systems, pp. 132-140, March 2003. [28] I. Kuon, J. Rose, "Measuring the Gap Between FPGAs and ASICs," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 26, no. 2, pp.203-215, February 2007. [29] S.J.E. Wilton, et al., “Design Considerations for Soft Embedded Programmable Logic Cores”, IEEE Journal of Solid-State Circuits, vol. 40, no. 2, pp. 485-497, February 2005. [30] S. Phillips, S. Hauck, “Automatic layout of domain-specific reconfigurable subsystems for systems-on-a-chip,” Proceedings of the ACM International Symposium on FieldProgrammable Gate Arrays, p. 165, February 2002. [31] V.C. Aken'Ova, G. Lemieux, R. Saleh, "An improved "soft" eFPGA design and implementation strategy," Proceedings of the IEEE Custom Integrated Circuits Conference, pp. 179-182, September 2005. [32] V. Betz, J. Rose, A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs, Kluwer Academic Publishers, 1999. [33] B.R. Quinton, S.J.E. Wilton, “Post-Silicon Debug Using Programmable Logic Cores”, Proceedings of IEEE International Conference on Field-Programmable Technology, Singapore, pp. 241-247, December 2005. [34] B.R. Quinton, S.J.E. Wilton, “Concentrator Access Networks for Programmable Logic Cores on SoCs”, Proceedings of the IEEE International Symposium on Circuits and Systems, Kobe, Japan, pp. 45-48, May 2005. [35] B.R. Quinton, S.J.E. Wilton, "Programmable Logic Core Based Post-Silicon Debug For SoCs", Proceedings of the IEEE Silicon Debug and Diagnosis Workshop, Germany, May 2007. [36] B.R. Quinton, M.R. Greenstreet, S.J.E. Wilton, “Practical Asynchronous Interconnect Network Design”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, no. 5, pp. 579-588, May 2008. [37] B.R. Quinton, S.J.E. Wilton, "Embedded Programmable Logic Core Enhancements for System Bus Interfaces", Proceedings of the International Conference on FieldProgrammable Logic and Applications, Amsterdam, pp. 202-209, August 2007. 11  [38] B.R. Quinton, S.J.E. Wilton, “Programmable Logic Core Enhancements for High Speed On-Chip Interfaces” accepted for publication in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, May 2008.  12  Chapter 2 Post-Silicon Debug Infrastructure1 2.1 Introduction In this chapter we present our new post-silicon debug infrastructure.  By adding  programmable networks and programmable logic to the SoC, we address key limitations of existing debug methodologies and significantly enhance the post-silicon validation process. We begin by describing our motivation for developing a reconfigurable post-silicon debug infrastructure. Second, we provide background information on existing post-silicon debug techniques. Third, we describe the architecture of our infrastructure in detail. Fourth, we explain the operation of our infrastructure during the process of post-silicon validation. Fifth, we examine the implementation of our infrastructure, which highlights three key research challenges that are addressed in detail in subsequent chapters. Finally, summarizing the results from the research in later chapters, we present an evaluation of the area overhead for our proposal for a range of SoCs. 2.2 Motivation Advances in integrated circuit (IC) technology have made it possible to integrate a large number of functional blocks on a single chip, but there are many challenges associated with this high level of integration. One of these challenges is ensuring that the IC design is functionally correct. Although pre-fabrication simulation and formal verification are used to help ensure that the IC performs as desired, the complexity of modern ICs prevents exhaustive verification [1]. Design errors (bugs) are often found after fabrication, during the post-silicon validation of the device. For complex ICs and SoCs, the process of verifying and debugging new devices requires significant time and money [2]. This problem will be exacerbated as the level of IC integration  1  A version of this chapter has been published as: - B.R. Quinton, S.J.E. Wilton, “Post-Silicon Debug Using Programmable Logic Cores”, Proceedings of IEEE International Conference on Field-Programmable Technology, Singapore, pp. 241-247, December 2005. (Reprinted with permission) - B.R. Quinton, S.J.E. Wilton, "Programmable Logic Core Based Post-Silicon Debug For SoCs", Proceedings of the IEEE Silicon Debug and Diagnosis Workshop, Germany, May 2007.  13  continues to rise and more functionality is combined into a single ‘black-box’ that must be tested using only the external chip interfaces. Determining the root cause of a bug during the post-silicon validation phase presents a significant challenge as it is difficult to observe or control signals within a manufactured device. The chip cannot be easily modified after fabrication to provide these signals to output pins where they can be observed. Even if the signals were identified before fabrication, providing external access to them at design-time is problematic since input and output (I/O) resources are often limited and do not operate at the high speeds that would be required. Integrated circuit designers are attempting to address this issue by adding extra logic to their designs specifically to assist the post-silicon debug process [3]. New post-silicon debug concepts are emerging that are analogous to design-for-test (DFT) methodologies wherein designers add dedicated logic to their design to enable efficient manufacturing test procedures. Although these new techniques increase internal visibility to a degree, the current post-silicon debug solutions, are limited in a number of ways, as we will show in Section 2.3. We propose a post-silicon debug infrastructure based on a programmable logic core (PLC) and a programmable interconnect network. The post manufacturing re-configurability of the network and programmable logic core is a key aspect of the technique. The region of the circuit being debugged may change during post-fabrication validation and is not likely to be predictable during design time. The flexible network allows the validation engineer to select the internal signals that are of interest for any given test, and the programmable logic core provides a means to create debug circuits that process these signals in a manner that depends on the debug task being performed. In addition, since it is possible to create arbitrary circuits in the programmable logic core, our infrastructure may be re-purposed to enable the detection and potential correction of design errors during normal device operation. The increased flexibility and functionality of our proposal comes with a potential cost in terms of silicon area and introduces a number of implementation challenges. However — as we will show in this chapter — because the programmable structures are centralized and re-used in multiple scenarios, the overall cost of the proposed post-silicon infrastructure is still quite low. In subsequent chapters we provide detailed examinations of the implementation challenges associated with the debug infrastructure and show that these issues can also be overcome.  14  2.3 Background Although research in the area of dedicated post-silicon debug infrastructures for integrated circuits is quite new, interest in this area is growing rapidly. Most of the research outlined in this section has been done concurrently with the research for this thesis. Since our proposal includes both post-silicon debug and design error detection and correction, we provide background on each in the following sections. 2.3.1 Post-Silicon Debug Existing post-silicon debug strategies for complex integrated circuits (ICs) can be categorized into four groups: Software-based, Test Feature-based, In-Circuit Emulation, and Onchip Emulation [3]. The first three strategies have been used in integrated circuit design for some time, but tend to be insufficient for complex, high-performance ICs and SoCs. Most new post-silicon debug proposals (including the one in this thesis) fall into the category of On-chip Emulation which involves adding new purpose-built hardware for debug. We will summarize each the four categories below and provide detailed descriptions of existing On-chip Emulation research. 1) Software-based Post-Silicon Debug Software-based methodologies rely on the insertion of monitor routines into the base software that act alongside platform-specific processor hardware to enable the observation of a limited set of internal variables specific to the processor [4, 5, 6]. From a SoC perspective, these techniques are limited in a number of ways. First, they require the modification of the base software, which affects the software performance and often changes the underlying behaviour that stimulated the bug [3]. Second, at best they allow limited visibility into the hardware state outside the processor core because the software is usually unaware of the ‘system’ hardware onchip. Third, they require that the software be halted, or pre-empted, to perform the observation of the internal processor state which in turn makes it difficult to debug real-time interactions between software and hardware [3].  Because of these limitations, the software-based  methodology is not by itself sufficient to provide debug support to the entire SoC, but is useful in conjunction with other techniques.  15  2) Test Feature-based Post-Silicon Debug Test Feature-based techniques use the existing manufacturing test structures, commonly referred to as Design-for-Test (DFT) structures, to gain visibility into the internal state of the device [7]. Most SoCs use a scan chain-based DFT methodology. The loading and unloading of these scan chains provides a low overhead technique to observe and control the internal state of the device. However, in most cases the normal operation of the devices must be halted while the scan chains are unloaded or loaded because the operation of the scan chains corrupts the state of all the flip-flops on the scan chain. For many SoCs it is not feasible to halt the device during normal operation because their high-speed interfaces use synchronous free-running clocks, and it is often not possible to ‘pause’ the incoming data to unload the scan chains. In addition, relationships between asynchronous clocks (which are often the source of bugs) are altered in this case [3]. This technique is therefore often only viable as a ‘postmortem’ debug tool to investigate the final state of a device after it has failed, and is not usually sufficient as a generalized SoC debug technique. 3) In-Circuit Emulation Based Post-Silicon Debug In-Circuit Emulation (ICE) relies on a custom-designed part that is used during the debug and development process instead of the normal production device [8]. It has increased input and output (I/O) capability. The additional I/O are used to allow the observation of internal signals, which are routed to the I/O pins and recorded with external test equipment. This technique requires considerable time and adds considerable expense, since two versions of the devices need to be designed and maintained. In addition, the internal operating frequencies of SoCs are often much faster than conventional digital I/O pads can support. This limitation forces the ICE version of the device to operate at a slower clock rate than the production device, thereby reducing the correlation between the behaviour of the two devices [3]. 4) On-chip Emulation Based Post-Silicon Debug On-chip Emulation solves many of the problems of In-Circuit Emulation by adding dedicated debug logic to the SoC [9, 10, 11, 12, 13, 15, 16]. Because this new circuitry can run in parallel to the normal device logic, it has limited (if any) impact on the normal operation of the device. In addition, it is not bound by the performance of the I/O pads and scales well with the operating 16  frequency of the device. The obvious limitation is that it increases the area and potentially the power dissipation of the final SoC. Most current On-chip Emulation techniques rely on trace buffers combined with programmable triggers and various degrees of pre-buffer compression. The AMD Opteron, a dual core processor with an integrated memory controller, contains a HyperTransport trace buffer (HTTB) that captures inter-core and inter-device memory transactions on the HyperTransport interconnect [9]. Programmable triggers capture HyperTransport packets based on criteria provided by the user. The HTTB is used in addition to the software-based debug features built into each processor core. Although the HTTB has an optional feature that allows the buffer to ‘spill’ its contents to the main memory in order to enable larger trace captures, the memory transactions for this operation affect the normal operation of the device. As the authors of [9] note, on some occasions this would change the device’s behaviour and mask the bug in question. The IBM Cell BE — a nine-processor core SoC design for game consoles and highperformance computing — contains a Trace Logic Analyzer (TLA) for storing and viewing internal signals [10, 11]. There is one TLA per Cell BE device that must be shared for all debug tasks. It contains a trace buffer designed specifically to capture debug traces and can capture up to 128 system interconnect signals from any one of the nine logic units on the device; four of these signals can be designated as event triggers signals. The TLA has built-in compression to enhance the efficiency of the buffer and also contains a number of programmable counters that can act as event triggers. Anis and Nicolic have addressed the problem of enhancing the efficiency of trace buffer captures by using signal compression. They have described both lossy and lossless trace buffer compression techniques [12, 13]. In order to create a complete set of results, the lossy techniques require repeated test runs. This is a significant drawback if the time required to stimulate the bug is long, or if the results have an element of non-determinism, such as clock domain crossings, or asynchronous designs. The lossless techniques improve on this limitation. Hopkins and McDonald-Maier have described a trace message framework that enables the efficient storage of trace data from multiple processor cores [14]. Their goal is to allow a single trace buffer to be used for multiple cores that may be operating asynchronously. Like the IBM  17  and AMD implementations this technique is targeted at capturing specific, well-formatted transactions rather than arbitrary digital signals. Since publication of our initial proposal [15], Abramovici, et al. at DAFCA have also suggested a reconfigurable solution for SoC debug [16]. Their proposal is a proprietary commercial implementation and, therefore, it is not possible to determine their exact implementation. However, we can determine some of the details from existing public documents [16, 17]. Like our proposal, the DAFCA infrastructure is targeted at general-purpose digital logic in a SoC design. The DAFCA infrastructure is based on a distributed heterogeneous reconfigurable fabric. Unlike DAFCA, we propose using a centralized, generic programmable logic core. There are therefore important differences between the two architectures. First, we make use of existing FPGA architectures and CAD tools to implement our debug circuits. This is an important difference since it leverages existing research on the architecture, synthesis, placement and routing of island-style FPGAs [18]. Second, by centralizing the programmable logic we can make efficient use of an expensive resource because each new debug circuit can reuse the same programmable logic. Third, by using generic programmable logic we ensure that debug circuits are fully flexible and not restricted to pre-conceived debug scenarios. For example, our infrastructure does not require any explicit trigger signals to be defined in advance; instead, it is possible to define the required triggering scheme at debug time. Fourth, by using a general-purpose core and connecting it to the system bus, we enable the possibility of constructing error detection and correction circuits or adding new features to the device after it has been manufactured. 2.3.2 Post-Silicon Error Detection and Correction To our knowledge there is no existing proposal to address design error detection and correction within the context of SoCs. However, recently there have been a number of such proposals targeted at processor cores. We summarize them as follows. Sarangi et al. have described a proposal for using programmable hardware to help patch design errors in processors [19].  As part of their patching process they make use of  programmable logic to detect specific conditions in the processor. In some cases, once they have detected these specific problem conditions, they can make use of existing processor features, such as pipeline flushes, cache re-fills, or instruction editing to correct the error; in other cases 18  they can cause an exception to be serviced by the operating system or hypervisor. Their proposed architecture is distributed and targeted at a specific type of design, namely modern processor cores. The programmable logic architecture that they use is much more targeted than the one that we propose. They make use of a programmable logic array (PLA) fabric that increases performance and lowers overhead, but is not able to implement arbitrary debug circuits. Although the primary motivation of these authors’ proposal is the in-field correction of processor design errors, not post-silicon debug, it is evident that their proposal could also be useful during the debug stage. Like their proposal, our infrastructure is also designed to detect and correct errors. The PLC in our infrastructure can be configured to drive signals to the fixedfunction design. So, for example, the debug circuit in the PLC can trigger an interrupt on the embedded processor if a specific error condition occurs. The embedded processor can then take action to fix the problem. An important difference is that with our proposal it is possible to construct complex digital circuits to monitor and react to errors, in contrast to the simple combinational circuits that are possible in a PLA. Wagner et al. have proposed Field Repairable Control Logic (FRCL) [20]. This proposal is quite similar to the proposal described above for patching design errors in processors. Programmable logic is used to detect that a pre-defined error condition is about to occur based on an expected pattern in the control logic of the processor core. Action is then taken to prevent the error. However, instead of the relying on existing processor features or software intervention, in the FRCL proposal the processor is dynamically switched into a single-issue instruction mode and a circuit implemented in additional programmable logic is used to temporarily replace the control logic of the processor core. Once the error condition has passed, the normal processor operation resumes. In a successful repair scenario, the code execution is logically correct but the throughput of the processor is temporarily reduced. If the event is rare, then this technique is not likely to be noticed by the user.  19  Fig. 2.1: Baseline SoC Architecture 2.4 Post-Silicon Debug Architecture In this section we outline the architecture of our post-silicon debug infrastructure. We begin by describing our baseline SoC architecture and then describe the components of our new infrastructure and their integration into the SoC. Finally, we explain the selection of signals as targets for the debug process. 2.4.1 Baseline SoC Architecture We assume a SoC design containing multiple IP (Intellectual Property) cores, as shown in Figure 2.1. These cores will be implemented in either digital logic or using analogue techniques. We target the digital portion of the SoC. Typically, at least one of these IP cores will be a processor. The cores are typically designed as distinct blocks and are connected to each other at the top level of the SoC by either fixed wires, a shared bus, or a Network-on-Chip (NoC) architecture [21]. It has been suggested that, in future integrated circuits, all inter-block communication will be done using dynamic packet-based networks [22]. However, today’s 20  integrated circuits often contain a mix of networks, fixed wires, and shared busses to connect IP blocks. Our technique will work regardless of the interconnect technique that is used.  Fig. 2.2: SoC with Post-Silicon Debug Infrastructure 2.4.2 Post-Silicon Debug Infrastructure Figure 2.2 shows a conceptual diagram of the baseline SoC architecture that has been enhanced with our new post-silicon debug infrastructure. The infrastructure contains two key components: the programmable logic core and the programmable access network. The programmable logic core (PLC) is at the core of the debug architecture. The PLC is integrated into the top-level of the SoC hierarchy. In our proposal, the PLC contains both general-purpose programmable logic and embedded memory buffers; the general-purpose logic can be used to implement debug circuits specific to the debug task at hand and the embedded memory buffers are used to store results of a given test run. The PLC is connected to the rest of the chip using a programmable access network. This network spans both the design blocks and the top-level of 21  the SoC hierarchy. The network allows the circuits under debug to be connected to the debug circuits implemented in the PLC. In addition, the PLC is connected to the shared bus or NoC that already exists on the SoC to allow simple and direct communication with the processor core. Finally, the PLC and the access network are connected to the JTAG (Joint Test Action Group) port on the SoC — virtually all SoCs have pre-existing JTAG ports to enable boundary scan and other manufacturing-related testing [23]. This JTAG port connection enables the programming of the debug circuits into the programmable logic fabric, the loading and unloading of the embedded memory buffers, and the programming of the access network.  Fig. 2.3: Connection of the Access Network to the Debug Nodes 2.4.3 Selection of the Debug Nodes At the time a SoC is designed, the designer does not yet know whether the chip will fail, and if so, how it will fail. Thus, it is impossible to select the exact set of signals within the integrated circuit that should be connected to the PLC. Instead, the SoC designer chooses a much larger set of signals (which we refer to as the debug nodes) from the integrated circuit, and connects them to the access network. The access network will connect a subset of the debug nodes to the debug circuit during a given debug scenario. Each debug node is either an observable node or a controllable node. The observable nodes are signals in the SoC that will potentially be monitored by a debug circuit in the PLC while controllable nodes are signals that will potentially  22  be overridden by values driven from debug circuits implemented in the PLC. An example of the connection of each type of node to the access network is shown in Figure 2.3. It is not feasible to classify all the signals in the device as debug nodes because the overhead of the debug infrastructure would be too large. Instead, a subset of the device signals must be chosen. The correct selection of the debug nodes is design instance-dependent and since the location of potential bugs is not known at the time of insertion, there is no fool-proof selection criterion. However, research has shown that it is possible to infer much of the operation of a given digital design from observations of a small subset its signals [24], and tools are being developed to automatically determine the most effective observable nodes [25]. The criteria for node selection in not considered in detail in this thesis, but potential future research in this area is outlined in Chapter 6. 2.5 Post-Silicon Debug Operation In this section we describe how our post-silicon debug infrastructure can be used as part of the post-silicon validation process. We also describe the processes of determining the root cause of an unexpected behaviour, detecting design errors in normal operation, and correcting existing design errors. 2.5.1 Determining the Root-Cause of Unexpected Behaviours (Bugs) When an unexpected behaviour is encountered during the post-silicon validation phase, the debug process proceeds as follows. To begin, the validation engineer determines what he/she thinks is the source of the problem inside the SoC based on either the observed incorrect behaviour or just simple intuition. The starting point is not critical since the process is iterative. Second, the validation engineer accesses a list of the debug nodes that are available in the SoC and selects a subset for observation. Third, he or she constructs a debug circuit that tests a hypothesis about the source of the unexpected behaviour. For instance, the validation engineer may choose to observe the states of all the state machines in a given subsystem, assuming that one of them must be entering an error state. Fourth, he or she would design a debug circuit that simultaneously monitors each state machine in the subsystem and logs a notification message in the embedded memory of the debug PLC if one the state machines ever enters the error state. Fifth, the debug circuit and access network settings would then be programmed into the debug 23  infrastructure using the JTAG port on the SoC. Finally, the stimulus that created the unexpected device behaviour would then be regenerated with the debug circuit in place. Once the unexpected behaviour has been recreated, the results of the test can then be retrieved through the JTAG port for examination. Based on this new information the validation engineer can construct new debug circuits that continue to isolate the problem. For instance, in the above case, it may be determined that one particular state machine enters the error state first. The target debug nodes could then be adjusted to observe the inputs to that state machine with a new debug circuit. This process would then be repeated until the root cause of the unexpected behaviour is determined. 2.5.2 Detecting Design Errors in Normal Operation Because of the complexity of SoCs and the expense of the manufacturing process, some design errors (bugs) often continue to exist in a device that is in volume production. These bugs may exist for a number of reasons. They may have been discovered in the post-silicon validation phase, but were not deemed significant enough to justify a device re-spin. In other cases, they may have been discovered once the device has begun to be shipped in volume, which would make a re-spin extremely expensive. In either scenario, there is a strong desire to minimize the impact of the bug on the normal operation of the device. This can often be done in firmware on the SoC itself or by notifying higher-level supervisory software, but only if the error condition can be reliably detected. Since the debug circuits in our proposal are able to run in parallel with the normal device, they are well suited to this type of error detection. The construction of an error-detection mechanism using our post-silicon debug infrastructure would proceed as follows. Once it has been determined that there is a design flaw in the device, the SoC development engineer would examine a list of available debug nodes and determine a set that could be used to detect the condition. In the same manner as described in the previous section, the engineer would then develop a circuit implemented in the programmable logic core to detect the condition. However, instead of logging the condition to a memory buffer, an additional circuit would be developed to interrupt the on-board processor and provide the specifics of the problem that has occurred. The firmware on the SoC would then take whatever action is appropriate to minimize the impact of the error.  24  To further understand this case, consider the example of an Internet Protocol (IP) packet buffer implemented in hardware. Imagine further that the buffer has a design error such that it incorrectly calculates the fill level of the buffer when it encounters packets with a data payload of, for example, less than four bytes. Packets of this type may be extremely rare, but if they were to occur the packet buffer may corrupt the packet data and cause a significant performance impact to the device. If a debug circuit can be created such that it is possible to detect this rare condition, the SoC firmware may be able to simply discard the packet before it impacts the system.  Fig: 2.4: Post-Silicon Debug Infrastructure Implementation 2.5.3 Correcting Design Errors We can also extend the error detection scenario outlined in the previous section to potentially correct design errors. Again, we make use of the fact that the debug circuits in our post-silicon debug infrastructure can operate in parallel with the normal device operation. In addition, the access network can be configured to override the controllable nodes in the SoC. Using these features, there is the possibility of creating debug circuits that autonomously override signals in the SoC such as block-level resets or next-state logic in state machines. In the IP packet buffer 25  case described above, the debug circuit could drive, for instance, the hardware reset of the buffer, essentially discarding the problem packet, and correcting the error condition. 2.6 Implementation Figure 2.4 shows a high-level implementation of our proposal.  Analyzing the  implementation, we can identify a number of fundamental requirements: 1. The number of debug nodes in the SoC should be as large as possible to ensure the maximum possible utility of the infrastructure. 2. The area overhead of the infrastructure must be as low as possible to minimize the impact on the overall cost of the device. 3. The access network must operate at an equivalent clock rate to the circuits in the SoC. 4. The debug circuits, implemented in the programmable logic core, must interact with the circuits under debug operating at their normal frequency. 5. The implementation of the infrastructure should require a minimal design effort, since it will be added to the device based on speculation on its later usefulness. Based on these criteria, we have identified three key implementation challenges: 1) optimizing the interconnect topology to maximize the number of debug nodes while minimizing the area overhead, 2) creating a high-throughput multi-frequency on-chip network that can be implemented without excessive design effort, and 3) building efficient, high-performance interface circuits between the debug circuits in the PLC and the fixed-function circuits under debug.  26  Table 2.1: Experimental Frameworks Research Challenge  Chpt.  Network Topology  3  Network Implementation  4  Progammable Logic Interface  5  Experimental Framework a) Area cost and network depth in terms of 2:1 multiplexers for network sizes of up to 5000 inputs. b) Percentage area overhead for 90nm, standard cell SoC implementations of between 5 and 80 million equivalent gates and networks with between 200 and 7200 inputs. Area cost, data throughput, and dynamic power dissipation for network implementations in 90nm, standard cell SoCs of with between 2.7 to 27.7 million equivalent gates and between 16 to 256 block partitions. Percentage area overhead, configurable logic block (CLB) utilization, routeing channel width and critical path delay for 20 benchmark circuits implemented in a 0.18µ m programmable logic core (PLC).  We will address each of these research challenges in subsequent chapters in this thesis. In each chapter we develop an independent experimental framework that is designed specifically to highlight the challenges associated with the respective aspect of the implementation. These frameworks are explained in detail in each chapter and a summary of each is given in Table 2.1. Although these frameworks are not exactly the same in each chapter, they are sufficiently similar to enable us to evaluate the overall post-silicon debug infrastructure. In the following section we combine the results from all three chapters into a single experimental framework to evaluate the area overhead of the infrastructure. 2.7 Estimated Area overhead In this section, we will estimate the area overhead of our infrastructure. Rather than presenting numbers for only a single integrated circuit, we present overhead estimates for a range of integrated circuit sizes and a range of debug nodes. These results are intended show the feasibility of our proposal for a wide range of scenarios. To calculate the area overhead, we created a parameterized model of both the baseline SoC architecture and the enhanced architecture. The model is parameterized based on the total number of equivalent logic gates in the SoC and the number of debug nodes in the post-silicon debug infrastructure. The area of the enhanced architecture was calculated by summing the size of the IP cores themselves, an area estimate of a specific programmable logic core, and the area 27  of standard cell implementations of the debug access networks. A 90nm technology and the STMicroelectronics Core90GPSVT standard cell library were assumed throughout [26]. For the programmable logic portion of our infrastructure we use area estimates from a 90nm commercially-available PLC designed by IBM and Xilinx [27]. We updated the published area to reflect the additional area overhead of our PLC interfaces enhancements, which we will describe in detail in Chapter 5. The PLC has an equivalent capacity of approximately 10,000 ASIC gates. We believe that this logic capacity is more than adequate for our intended debug circuits and that it could potentially be reduced to lower the overhead of our proposal. To highlight this trade-off we will present the area overhead of the programmable logic separately in the following results. The PLC has 384 available ports, each of which is configurable as an input or an output. For our models we choose to assign 64 ports to be inputs (connected to the observation network) and 64 ports as to be outputs (connected to the control network). The remaining ports were assigned to connect to the system bus and JTAG interfaces. The interface from the embedded processor, through a NoC or shared system bus, is implemented in the PLC logic, as we will describe in detail in Chapter 5. This interface therefore requires no additional area overhead. The area of the access networks was calculated by generating a hierarchical concentrator network, which we will describe in detail in Chapter 3 and 4. The number of simultaneously observable nodes was set to 64 in order to match the capabilities of the PLC. Likewise, the number of simultaneously controllable nodes was also set to 64. Then, the number of debug nodes (assigned to be 50% observe nodes and 50% control nodes) was varied to generate a number of different implementation scenarios. The first stage of the concentrator network was implemented using 2:1 multiplexers for switching and a flip-flop (with enable) per 2:1 multiplexer for routing control. The second stage was constructed using the asynchronous implementation techniques described in Chapter 4.  28  Fig. 2.5: Area Overhead of the Programmable Logic Core In order to better delineate the contributions to the area overhead of the proposed architecture, we first present the PLC area overhead. We then present the area overhead of the overall infrastructure. Integrated circuit sizes between 5 million and 80 million gates were assumed (this is the sum of the gate count of all blocks before the debugging enhancements were added). For each integrated circuit size, the total number of debug nodes was varied from 200 to 14600. Results for the area overhead of the PLC are shown in Figure 2.5. The area overhead due to the PLC does not depend on the number of debug nodes. It is simply the area of the PLC core divided by the size of the original integrated circuit. Using an enhanced, 10,000 ASIC gate core from IBM/Xilinx, the overhead ranges from less than 1% to about 9% for our assumed integrated circuits. It is important to note that it is possible to reduce this area overhead by reducing the capacity of the programmable logic core. For instance, if the area overhead with the 10,000-gate PLC is not acceptable, it is possible to insert a PLC with a capacity of, for instance, 5000 gates. The issue of sizing the PLC correctly is not addressed in this thesis, but is discussed as potential future research in Chapter 6.  29  Fig. 2.6: Post-Silicon Debug Area Overhead Finally, Figure 2.6 shows the combined area overhead results. The graph shows that for large IC designs, the cost of implementing the extra logic to facilitate post-silicon debug is modest. For example, on a 20 million gate SoC it would be possible to observe and control a total of 7200 internal signals for an overhead of approximately 5%. Note that as the IC design becomes larger, the area of the PLC is amortized and its overhead becomes less significant, making the issue of the PLC size less important. We also observe that the area increase for additional debug nodes is small for the larger SoCs. Extrapolating these results, we can see that for the large SoC it would be feasible to enable many thousands of debug nodes with an overhead of less than 10%. For these large ICs the cost of the access networks will be the important issue. 2.8 Chapter Summary We have the shown that it is feasible to use embedded programmable logic and programmable access networks to assist the post-silicon debug of complex ICs and we have described the architecture and operation of our post-silicon debug infrastructure. From this we have identified three implementation challenges, each of which are addressed later in this thesis. 30  We have also shown that the area overhead of our proposal is quite low for many of our target ICs. Our area overhead results compare well with recently released area overhead values from commercial debug implementations. For instance, the area overhead of DAFCA’s debug infrastructure is reported to be 4.8% and 4.2% respectively for two example implementations [17]. We believe that our proposal will significantly reduce the time-to-market for SoC developments by enabling effective post-silicon debug and by enhancing the overall validation effectiveness.  31  Chapter 2 References [1] S. Sandler, “Need for debug doesn’t stop at first silicon”, E.E. Times, February 21, 2005. [2] R. Goering, “Post-silicon debugging worth a second look”, E.E. Times, May 02, 2007. [3] A.B.T. Hopkins, K.D. McDonald-Maier, “Debug support for complex systems on-chip: a review”, IEE Proceedings - Computer and Digital Technology, vol. 153, no. 4, July 2006. [4] Y. Zorian, et al., “Testing embedded-core-based system chips”, IEEE Computer, 1999, vol. 32, no. 6, pp. 52-60, June 1999. [5] Motorola Inc., “MPC565/MPC566 user’s manual”, www.motorola.com/semiconductors. [6] Infineon Technologies AG, “Tricore 1 architecture manual”, version 1.3.3, 2002, www.infineon.com. [7] F. Goldshan, “Test and on-line debug capabilities of IEEE Std. 1149.1 in UltraSPARC-III microprocessor”, Proceedings of the International Test Conference, pp. 141-150, October 2000. [8] B. Vermeulen, S.K. Goel, “Design for Debug: catching design errors in digital chips”, IEEE Design and Test of Computers, vol. 19, no. 3, pp 37-45, May/June 2002. [9] T.J. Foster, et al., “First Silicon Functional Validation and Debug of Multicore Microprocessors”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 15, no. 5, pp. 495-504, May 2007. [10] M. Riley, et al., “Debug of the CELL Processor: Moving the Lab into Silicon”, Proceedings of the International Test Conference, pp. 1-9, October 2006. [11] M. Riley, M. Genden, “Cell Broadband Engine Debugging for Unknown Events”, IEEE Design and Test of Computers, vol. 24, no. 5, September/October 2007. [12] E. Anis, N. Nicolici, “On Using Lossless Compression of Debug Data in Embedded Logic Analysis”, Proceedings of the IEEE International Test Conference, pp. 1-10, October 2007. [13] E. Anis, N. Nicolici, “Low Cost Debug Architecture Using Lossy Compression for Silicon Debug”, Proceedings of Design Automation and Test in Europe, pp. 1-6, April 2007. [14] A.B.T. Hopkins, K.D. McDonald-Maier, “Debug Support Strategy for Systems-on-Chips with Multiple Processor Cores”, IEEE Transactions on Computers, vol. 55, no. 2, pp. 174184, February 2006. [15] B.R. Quinton, S.J.E. Wilton, “Post-Silicon Debug Using Programmable Logic Cores”, Proceedings of the IEEE International Conference on Field-Programmable Technology, pp. 241-247, December 2005. [16] M. Abramovici, “A Reconfigurable Design-for-Debug Infrastructure for SoCs”, Proceedings of the Design Automation Conference, pp. 7-12, July 2006. [17] M. Abramovici, “A Silicon Validation and Debug Solution with Great Benefits and Low Costs”, Proceedings of the IEEE International Test Conference, p. 1, October 2007. [18] V. Betz, J. Rose, A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs, Kluwer Academic Publishers, 1999. [19] S. Sarangi, et al., “Patching Processor Design Errors With Programmable Hardware”, IEEE Micro, pp. 12-25, January/February 2007. [20] I. Wagner, et al., “Using Field-Repairable Control Logic to Correct Design Errors in Microprocessors”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 2, pp. 380-393, February 2008. [21] A. Hemani, et al., “Network on a Chip: An architecture for billion transistor era”, Proceedings of the IEEE NorChip Conference, November 2000. 32  [22] W.J Dally, B. Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks”, Proceedings of the Design Automation Conference, Las Vegas, NV, pp. 684-689, June 2001. [23] IEEE Std. 1149.1-1990, IEEE 1149.1 Standard Test Access Port and Boundary-Scan Architecture, IEEE Computer Society, 1990. [24] Y. Hsu, “Visibility Enhancement for Silicon Debug”, Proceedings of the Design Automation Conference, July 2006. [25] Nova Software, “Siloti Visibility Enhancement Product datasheet”, www.novas.com/Solutions/Siloti, 2007. [26] STMicroelectronics, “Core90GPSVT_Databook”, Data Sheet, October 2004. [27] P. Zuchowski, et al., “A Hybrid ASIC and FPGA Architecture”, Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, pp. 187-194, November 2002.  33  Chapter 3 Network Topology1 3.1 Introduction In order to enable our post-silicon debug infrastructure, it is necessary to provide a configurable network that connects fixed function logic (i.e., the circuits under debug) to the debug circuits implemented in a programmable logic core [1,2]. Different debug scenarios will require different input and output signals; the network provides a means to programmably select a number of signals from across the chip and connect these signals to the programmable logic core. The area cost and switch depth of this network are a key concern to the overall debug infrastructure since they will directly affect the area overhead and the operating frequency of the proposal. This chapter presents two new network topologies that meet the needs of our postsilicon debug infrastructure. In this chapter, we will first provide the motivation for examining new network topologies. Second, we demonstrate the difference in the cost and depth of existing permutation and concentrator network topologies. Third, we describe a new concentrator network construction that improves on existing concentrator implementations to produce a network that is lower cost and lower depth for all the network sizes that we have analyzed. Lastly, we investigate the problem of implementing a concentrator network within the context of a SoC. We identify key constraints specific to SoC implementations, and then extend our original concentrator topology to create a hierarchical network that is specifically optimized for our SoC post-silicon debug application. We present results for this second construction in terms of area overhead in a SoC. 3.2 Motivation The idea of using a network for SoC communication is not new [3,4]. SoC networks have mainly targeted communication between an embedded processor and a number of peripherals 1  A version of this chapter has been published as: - B.R. Quinton, S.J.E. Wilton, “Concentrator Access Networks for Programmable Logic Cores on SoCs”, Proceedings of the IEEE International Symposium on Circuits and Systems, Kobe, Japan, pp. 45-48, May 2005. (Reprinted with permission) - B.R. Quinton, S.J.E. Wilton, “Post-Silicon Debug Using Programmable Logic Cores”, Proceedings of IEEE International Conference on Field-Programmable Technology, Singapore, pp. 241-247, December 2005. (Reprinted with permission)  34  using packet- or transaction-based protocols. Unfortunately, these proposed networks are not well suited to the task of post-silicon debug since they require formatted data and an adherence to network protocols. A network of this sort would limit the potential debug scenarios of our infrastructure to ones involving correctly formatted and controlled transactions or packets rather than allowing us to debug arbitrary digital logic. Instead of packet- or transaction- based networks, indirect, non-blocking networks such as those defined for traditional telephone networks are better suited to applications of embedded programmable logic for post-silicon debug. These networks provide a connection with a fixed-bandwidth and a fixed-latency that can be configured on a per debug node basis for each pin on the programmable core. A great deal of research has been done on non-blocking multistage networks that allow arbitrary connections between any of the network’s inputs and outputs [5,6]. These networks can be viewed as permutation networks, since they provide all possible permutations of the inputs on the output side of the network. As such, permutation networks can be used to programmably connect signals from across the chip to the programmable logic core for debug purposes. These permutation networks may provide more flexibility than is required. Before configuration, all the pins of a programmable logic core are equivalent. The circuit implemented in the programmable core can be adapted to any arbitrary signal-to-pin assignment. Thus, it is not necessary to be able to connect every network input to every network output. Instead, it is only necessary to be able to connect every subset of network inputs of size less than or equal to the number of outputs to some set of network outputs. Exactly which pin is assigned to each signal does not matter. We would like to take advantage of this flexibility in order to lower the cost and improve the performance of the debug access network. Research has been done on the problem of unordered networks [7,8]. A specific class of unordered network called a concentrator has been defined. This network allows a number of inputs to connect to a smaller number of outputs. As defined, concentrator networks are well suited to the problem of providing access to a programmable core. However, thus far, most work on these networks has been theoretical in nature and to our knowledge no work has been done to apply these networks to SoCs or post-silicon debug applications. This chapter will show that although concentrators are lower in depth than permutation networks in all cases, they are not lower cost, in terms of switch area, than permutation networks  35  for many of the configurations of interest to our debug application. This limitation, and the constraints of SoC design, motivates us to develop new concentrator network topologies. 3.3 Network Requirements The fundamental requirement of an access network for our post-silicon debug infrastructure is that it must be able to connect a large number of potential debug nodes to a smaller number of inputs and outputs on a programmable logic core. It must also have low area overhead to enable the maximum debug coverage for a given area overhead, and low network depth to enable operation at the high clock rates of internal SoC logic. The input and output capacity of the programmable logic core will necessarily limit the number of signals that can be observed or controlled for a given debug scenario. However, to maximize the potential usefulness of the debug infrastructure, the network should not impose further limits on this particular set of debug nodes. To meet this requirement, the network must be non-blocking. For example, if 5000 nodes have to be connected as potential debug points, and the programmable logic core has, for instance, 64 inputs, the network should be able to simultaneously connect any set of up to 64 of the 5000 nodes to the PLC. In a hierarchical SoC composed of distinct IP blocks, or subsystems, it may make sense to relax this constraint slightly. In these cases the non-blocking criterion is required for nodes inside the IP block, or subsystem, but is not necessarily simultaneously required for the entire SoC. This will be discussed further in Section 3.7. The network required for our debug infrastructure does not have to be configured in real time. It would instead be programmed once for each debug scenario. For this reason the network does not need to handle incremental connection requests and is only required to be ‘rearrangeably’ non-blocking [6]. The access network required for debug applications in a SoC can be viewed as two separate networks. One will connect a number of potential sources to a smaller number of programmable logic inputs in order to observe the circuit under debug. The other will connect a number of programmable logic outputs to a larger number of potential sinks in order to control signals in the circuit under debug. Essentially, these networks are mirrors of each other. Therefore, a network solution that is appropriate for the input-side network will likely satisfy the output-side requirements when configured in reverse. For simplicity and readability, in this chapter we will  36  only consider the network connecting a number of sources to a smaller number of the inputs of the programmable logic in this chapter. 3.4 Background In this section we provide background on existing permutation and concentrator network constructions. We begin by describing existing work on permutation networks and showing how these constructions can be used for our application. Following this, we introduce a formal definition for a concentrator network, review existing work in the area, and finally, provide detailed constructions of the most applicable networks for our application.  Fig. 3.1: Asymmetric Benes 3.4.1 Permutation Networks A permutation network is a network with n inputs and m outputs for which any number of inputs q ! m can each connect to any one specified output. Since permutation networks allow the arbitrary connection of any input to any output, it is possible to use a permutation network to access a programmable core. The most obvious implementation of a permutation network is a fully-connected crossbar switch, which is equivalent to multiplexing each potential input to each output. A single-stage crossbar switch requires n•m switches and, therefore, the area cost of a crossbar switch is large for all but very small values of n and m. The Clos network has been shown to be an effective means to construct a multi-stage permutation network that is low cost and rearrangeably non-blocking [6]. The Benes network can be formed by recursively replacing the middle stage of a Clos network with another Clos network to reduce the network cost [5]. Most simple descriptions of the Benes networks assume that the number of inputs is equal to the 37  number of outputs, and that the number of inputs is a power of 2. However, when viewed as a recursive construction of Clos networks, it is possible to build a rearrangeably non-blocking network with an arbitrary number of inputs and outputs as shown in Figure 3.1. This construction requires that the number of inputs and outputs be divisible by two. When this is not the case, the size of the inputs and/or outputs can be increased by one and the extra input and/or output can be left unconnected. During the construction, once the number of outputs required for the Clos network is less than or equal to two, a binary tree structure can be used in place of a Clos network.  Fig. 3.2: Sparse Crossbar Concentrator The cost of the Benes network when m=n is 2nlgn and the depth of the network is 2lgn. In this case the network can be viewed as two nlgn networks connected back-to-back. For the case when m!n, the network can be viewed as two networks connected together by a binary tree. The first network has a width of n and the second network has a width of m. The depth of these two networks is dictated by the value of m since the construction will proceed in the networks on either side of the binary tree until the intermediate Clos network requires only two outputs at which point a binary tree will be used. This construction therefore results in a network with a depth of lg(m/2) + lg(2n/m) + lg(m/2), or, lg(m) + lg(n) – 1. The cost of the network is nlg(m/2) + 2m(2n/m –1) + mlg(m/2), or, (n + m)(lgm – 1) + 2(2n – m).  38  3.4.2 Concentration Networks A (n, m)-concentrator is a network with n inputs and m outputs, with m ! n, for which every set k ! m of the inputs can be mapped to some k outputs, but without the ability to distinguish between those outputs. An implication of this definition is that a completely new set of paths may be determined for every new set k, therefore a concentrator is rearrangeably non-blocking. Theoretical research has shown that it is possible to implement an n-input concentrator with O(n) crosspoints for sufficiently large n [7]. In contrast, it has been shown that a rearrangeably non-blocking n-input permutation network must have at least O(nlgn) crosspoints [9].  This  result highlights a fundamental difference between permutation networks and concentrator networks. Finding explicit constructions of concentrators with a linear cost has proved difficult. Margulis [10] was able to provide an explicit construction but could not determine the resulting cost. Gabber and Galil [11] were able to determine the cost of Margulis’s construction and build a concentrator with a cost of approximately 273n.  This value was later improved to  approximately 123n [12]. These linear cost constructions prove to be impractical for networks of the size required for SoC design because the constant multiple in the cost equation is large and the construction of the 123n concentrator is not valid for values of n less than 6048. Work has been done on more practical concentrator constructions for smaller values of n. Nakamura and Masson were able to show that the lower bound on the crosspoint complexity of a sparse crossbar (or single stage network) implementation of a (n, m)-concentrator was (n-m+1)m [13]. Later, an explicit construction of sparse crossbars was determined for all n and m that meets this lower bound [14]. An example of this sparse crossbar concentrator is shown in Figure 3.2a. This is an important result because it shows that even with removal of the ordering restriction, the cost of a single stage concentrator is large for most values of n and m; multistage constructions are therefore more practical for the larger networks of interest to our debug infrastructure. Jan and Oruc demonstrated a multistage concentrator network called a MuxMerger network with O(nlgn) cost [15]. However, the constant multiple in the cost equation is high and the network depth is not balanced, making the worst-case depth large.  39  Fig. 3.3: Narasimha Network Definition Narasimha demonstrated the construction of a multistage n-hyperconcentrator that performs a superset of the functionality of a (n, m)-concentrator. A hyperconcentrator is closely related to a concentrator. This network has n inputs and n outputs for which any active set of inputs k ! n can be routed to any continuous set of outputs [16]. In his work, Narasimha suggests that this network can be used as a (n, m)-concentrator by ignoring the unneeded (n-m) outputs as shown in Figure 3.3. Used in this way, the network provides a lower cost and depth concentrator than any of the other previously described concentrators for network sizes of interest to SoC design.  Fig. 3.4: Narasimha Network Construction Narasimha’s network is constructed recursively. An n sized network is constructed from two n/2-input networks as shown in Figure 3.4. A stage of 2x2 crossbar switches is used to route the inputs of each of the two n/2-input networks. The outputs of these networks are then interleaved. The recursive construction proceeds until the required network is 2x2, at which point a simple crossbar is used. This results in a network with a cost of nlgn and a depth of lgn. 40  The routing on the input side proceeds in two possible ways for the 2x2 crossbar stage: a) if both inputs are active or inactive, the routing is set arbitrarily, and b) if only one input is active then it is routed to the lower concentrator if the previous active input was directed to the upper concentrator and vice versa. Narasimha shows that for all possible input sets this network will produce the desired output. Although Narasimha only considered the case when the number of inputs is a power of two, the construction of each stage of this network only requires that the number of inputs be divisible by two. When this is not the case, the size of the inputs can be increased by one and the extra input can be left unconnected. 3.5 Comparison of Concentrator and Permutation Networks In this section, we compare the implementation cost and depth of concentrator and permutation networks. In order to accurately compare the constructions of permutation networks and concentrators it was first necessary to determine their construction using SoC techniques. For our application this provided a more accurate cost and depth value than comparing generic crosspoints or graph edges.  Since most SoCs are designed with standard cells, the  implementation of each network has been accomplished using 2:1 standard cell multiplexers for all switching.  Fig. 3.5: Permutation vs. Concentrator Network Area Cost Since both the number of inputs and the number of outputs can vary for a (n, m)concentrator, it is difficult to present the construction cost for all configurations. However a 41  representative graph will illustrate the differences in cost and depth between the two networks. Permutation and concentrator networks were constructed such that the number of outputs was varied, while the number of inputs was held constant at 5000 (thereby modelling the case where our post-silicon debug infrastructure is targeted at 5000 potential observable nodes). Note that, as explained above, the Narasimha network has a fixed cost and depth regardless of the value of m. The cost and depth in terms of 2:1 multiplexers was extracted from these constructions as shown in Figure 3.5 and 3.6. Any gates required for signal buffering were ignored.  Fig. 3.6: Permutation vs. Concentrator Depth As shown in Figure 3.5, a comparison of the two networks demonstrates that when the number of outputs is small with respect to the number of inputs, the cost of the permutation network is lower than Narasimha’s concentrator. In contrast, the depth of the concentrator network is always lower than the permutation network, as shown in Figure 3.6. The case with a small number of outputs is important for the proposed debug access network since there will likely be a large number of sources that potentially require access to a relatively small number of inputs on the programmable logic (the inputs of the programmable logic are outputs of the network).  42  3.6 New Concentrator Construction In this section we outline the first of two new concentrator constructions. We begin by providing the topology of the new construction and a proof of the non-blocking characteristics. Then, we examine both the area cost and depth with respect to other concentrator and permutation constructions. 3.6.1 Network Description Observing that the Benes network has a lower area cost than the Narashima concentrator construction for smaller values of m, and that the Narashima provides more flexibility than required, we developed a new (n, m)-concentrator construction that is lower cost and lower depth than a permutation network for all n and m. Our concentrator is built recursively from two (n/2, m/2)-concentrators, as shown in Figure 3.7. The first n-2 inputs are connected to (n/2)-1, 2x2 crossbars. Each of these is then connected to each of the concentrators. If the number of inputs/outputs is not divisible by two, the next larger value is used and the unneeded inputs/outputs are ignored. The final two inputs are each directly connected to one of the concentrators. Each of these two smaller concentrators can be built using the same construction. Once the concentrator reaches a size where the number of outputs is ! 2, a multiplexer based implementation of a sparse crossbar is used for the concentrator. This multiplexer based implementation is shown in Figure 3.2b.  Fig. 3.7: New Concentrator Construction 43  The use of these sparse crossbar concentrators is important to the efficiency of our construction because for smaller values of m, the number of 2x2 switch stages is significantly reduced when compared to Narasimha’s construction. These stages are instead replaced by more efficient binary tree structures. To further illustrate our construction, a detailed diagram of a (16,8)-concentrator is shown Figure 3.8. Neglecting the area savings due to the ‘empty switches’, the cost of this new construction is nlg(m/2) + 2(n-m), and the depth is lgn.  Fig. 3.8: (16,8)-concentrator To see that the (n,m)-concentrator is non-blocking we will present a proof by induction. Observe that as long as each of the (n/2, m/2)-concentrators has no more than m/2 inputs requiring a connection to its outputs, the (n, m)-concentrator is non-blocking. This is true because by definition a (n/2, m/2)-concentrator is able to route m/2 inputs to its outputs. Since there can be no more than m active inputs, all that is required is to show that the initial stage has 44  enough flexibility to balance the inputs. To do this, first consider the 2x2 switches that have two occupied or two unoccupied inputs, in both of these cases the balance is maintained. Next, consider the directly connected inputs – if both of these inputs are used then the balance is maintained. However, if only one is used, then the first 2x2 switch with only one active input is routed to the opposite concentrator. Then, for each 2x2 switch with only one active input, the inputs are divided between concentrator in an alternating fashion. This procedure will ensure that no more than m/2 inputs are routed to a single (n/2, m/2)-concentrator.  Fig. 3.9: New Concentrator Area Cost 3.6.2 Cost and Depth Comparison The cost of the new concentrator construction is less than or equal to the cost of both the permutation network and Narasimha’s concentrator construction for all network configurations, as shown in Figure 3.9. The new concentrator construction maintains the same worst-case depth as Narasimha’s construction for all network configurations, as shown in Figure 3.10.  45  Fig. 3.10: New Concentrator Depth 3.7 Concentrator Constructions for SoC Applications In this section we examine the construction of a concentrator within the context of a SoC implementation. First, we describe the implementation considerations that need to be made for the network in a SoC. Second, we describe a new hierarchical concentrator that is well suited to SoC implementations. Third, we describe the mapping of this new concentrator to our postsilicon debug infrastructure. Finally, we show the area overhead for our new network within a SoC context. 3.7.1 SoC Implementation Considerations Although the concentrator described in Section 3.6 is lower cost and lower depth in terms of 2:1 multiplexers than other concentrator and permutation implementations, its construction as a single monolithic network has a number of drawbacks for practical SoC implementations. First, most SoCs are implemented hierarchically. The SoC is partitioned into blocks and each block is designed and implemented independently, or alternatively, purchased from an IP vendor [17]. The blocks are then integrated together in the top-level of the SoC hierarchy. Because of this, it is desirable to create a network that is hierarchical in order to fit the normal SoC design flow. In addition, in the construction specified in Section 3.6, each stage of 2x2 crossbars fans out to two 46  different (and non-adjacent) sets of 2x2 crossbars in the next stage. If the inputs to the network are not close together physically, long wires will be required to bridge the 2x2 crossbar stages. This may impose difficult timing and routing constraints on the network in a SoC debug scenario. Because of these limitations the network described in Section 3.6 is not necessarily the best topology for the overall debug network at the top-level of the SoC; however, as we will show in the next section, it may be well-suited to concentrator networks that are wholly contained within a given design block.  Fig 3.11: Hierarchical Concentrator Network Architecture  3.7.2 Hierarchical Network Construction To address the constraints of SoC implementation outlined in the previous section, we have created a new hierarchical concentrator construction shown in Figure 3.11. To understand the construction, first consider that the maximum number of outputs simultaneously active is a fixed 47  value, m . These m network outputs would normally be connected to the inputs of the programmable logic core. The n network inputs are partitioned into k groups. In our post-silicon debug infrastructure there would be k subsystems or IP blocks in the SoC. We assume that the number of inputs for each group, i, is a variable, p i, where n = ! pi. A local concentrator is connected to the inputs of each of the k groups in the first stage of the network. Two concentrator topologies may be used for the local concentrator. As we will show below, if a Narashima p i-hyperconcentrator is used for each of the k groups, the overall hierarchical network will remain a concentrator network, and be non-blocking across all n inputs. However, if a (pi, m)-concentrator is used for the local concentrator, the non-blocking criteria will be maintained only within the groups of inputs, pi. Simultaneous observation of nodes within two different groups of inputs will still be possible, but the non-blocking property will not be guaranteed at the SoC level. In many cases this relaxed non-blocking condition is appropriate to post-silicon debug scenarios because the circuit under debug will likely be concentrated in one IP block or subsystem. The second stage of the network combines the outputs of the k local concentrators with a simple multiplexed bus structure. We use an enabled OR-tree implementation, as shown in Figure 3.11. This implementation is advantageous as it avoids the problem of having the configuration bits of the multiplexed network distributed throughout the IC interconnect. The cost and depth of this hierarchical network can be compared to the networks discussed in Sections 3.5 and 3.6. Consider the case where the number of inputs to each local concentrator k is set equal to m. Then assume that m-hyperconcentrators are used to ensure that the overall network is still a true concentrator. Finally, we can use 2:1 multiplexers instead of the OR-tree shown above. In this case the cost of the network will be nlg(m) + (n-m), and the depth will be lgn. This cost is only slightly larger than the nlg(m/2) + 2(n-m) cost of the new concentrator proposed in Section 3.6. It is clear then that an effective strategy for building these hierarchical concentrators would be to recursively use the hierarchical construction until k = m. At this point the cost of the hierarchical network will be very close to the cost of the first concentrator construction. Hui has described a somewhat similar two-stage approach for two related network topologies: superconcentrators and distribution networks [18]. However, these networks were addressed from a theoretical perspective and no explicit construction was given. Because a 48  superconcentrator provides a superset of the functionality of a concentrator, Hui’s non-blocking proof can also be applied to our hierarchical network. Rather than repeating Hui’s proof, we provide a more intuitive (although less rigorous) explanation that our new structure provides the functionality of a concentrator. Each of the k sets of inputs can be considered one at a time. For each set, if there are inputs in the set that need to be routed, the first stage of the network maps these inputs into a contiguous set. Since the hyperconcentrator can map inputs to any contiguous output set, the contiguous set can be mapped such that it begins at the first unoccupied position on the multiplexed bus. If each of the k sets is considered in order and the number of selected inputs n is ! m (which is required by definition), then there will never be a conflict on the bus. As long as there is no conflict on the bus, any m of the n input signals will be able to reach the outputs. Hence the construction meets the requirements.  Fig. 3.12: Example Network Implementation on a SoC for Post Silicon Debug 3.7.3 Concentrator Network Mapping to Post-Silicon Debug Applications We believe that our hierarchical concentrator implementation is well suited to the problem of providing an access network for our post-silicon debug infrastructure. To demonstrate this, an 49  example mapping of this network to a post-silicon debug situation is shown in Figure 3.12. In this figure we can see that the network is partitioned such that the local concentrator is implemented inside the IP blocks of the SoC. Although not shown in the diagram, we assume the debug signals would be registered before leaving these blocks. This isolates the timing of the first stage of the network and minimizes routing congestion by ensuring that the inputs to the local concentrator are physically colocated. In addition, this partitioning also enables block designers to insert much of the debug network locally and manage the timing and area overhead at the block level. A new level of hierarchy, the debug wrapper, is added to each block to accommodate the new debug port on each block. The width of this port would normally match the capabilities of the programmable logic core. The second stage of the network, the debug bus, is then built as an OR-tree implementation of a multiplexed bus. The timing and routing of this stage of the network must be managed at the top-level of the SoC, however this stage of the network is efficient in terms of routing congestion and is well suited to pipelining or asynchronous implementations, as we will show in Chapter 4. 3.7.4 SoC Area Overhead In this section, we estimate the area overhead of the hierarchical concentrator construction. To calculate the area overhead, we created a parameterized model of a baseline SoC and a SoC with the inclusion of the new hierarchical network. The model is parameterized based on the total number of gates in all IP cores and the number of observable signals in the SoC. The area was calculated by summing the size of the IP cores and the area of standard cell implementations of the new network. A 90nm technology and the STMicroelectronics Core90GPSVT standard cell library were assumed throughout [19]. The target PLC was assumed to have 128 available I/O ports [2]. We assigned half of these I/O ports (64) to the observe access network; therefore the concentrator network had 64 outputs. The access network area was calculated by implementing the architecture described in Figures 3.11 and 3.12 for various numbers of network inputs (observe nodes). To ensure that the area estimates were pessimistic, for the local concentrator stage, we used Narasimha’s hyperconcentrator construction to ensure we met the overall requirements of a true non-blocking concentrator. This hyperconcentrator portion of the network was implemented using 2:1 multiplexers for switching and a flip-flop (with enable) per 2:1 multiplexer for routing control. 50  The multiplexed bus was constructed using 2-input ‘OR’ gates, ‘AND’ gates and flip-flops to maintain the state of the bus. In all cases the area required for wire routing was ignored. Figure 3.13 shows the overhead of the access network. SoCs of sizes between 5 million and 80 million gates were assumed (this is the sum of the gate count of all IP blocks before the network was added). For each integrated circuit size, the number of observable signals was varied from 200 to 7200 and the area overhead of the network relative to the original integrated circuit was plotted. As shown in Figure 3.11, as the number of observable signals increases, the area overhead of the network increases in an almost linear fashion. This behaviour is not surprising over this range of inputs since the underlying network has a complexity of O(nlgm), and m is held constant at 64. The figure shows that the overhead of the observe access network is, for example, less than 2% for a 20 million gate SoC in which 7200 signals are to be observed. We would anticipate that the control access network would have similar cost, and therefore the overhead to observe and control 7200 signals would be approximately 4% in this case.  Fig. 3.13: Hierarchical Network Overhead  51  3.8 Chapter Summary We have demonstrated two new concentrator network constructions. The first was shown to be lower cost and lower depth than existing constructions for all the network sizes that we considered. The second construction, while slightly less optimal, was shown to be well suited to SoC implementations because of its hierarchical construction. We explained the mapping of this second concentrator construction to our post-silicon debug infrastructure, and, finally, we demonstrated the area overhead of this network for a range of SoCs sizes and observable debug nodes. Our results show that, on many target SoCs, it is possible to build a network to observe many thousands of nodes for less than 5% area overhead.  52  Chapter 3 References S.J.E. Wilton, R. Saleh, “Progammable Logic IP Cores in SoC Design: Opportunities and Challenges”, Proceedings of the IEEE Custom Integrated Circuits Conference, San Diego, CA, pp. 63-66, May 2001. [2] P. Zuchowski, et al., “A Hybrid ASIC and FPGA Architecture”, Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, pp. 187-194, November 2002. [3] V. Raghunathan, et al., “A Survey of Techniques for Energy Efficient On-Chip Communication”, Proceedings of the Design Automation Conference, pp. 900-905, June 2003. [4] W.J. Dally, B. Towles, “Route packets, not wires: on-chip interconnection networks”, Proceedings of the Design Automation Conference, pp. 684-689, June 2001. [5] V.E. Benes, Mathematical Theory of Connecting Networks and Telephone Traffic, Academic, New York, 1965. [6] F.K. Hwang, The Mathematical Theory of Nonblocking Switching Networks, World Scientific, New Jersey, pp. 45-81, 1998. [7] M.S. Pinsker, “On the complexity of a concentrator,” Proceedings of the Seventh International Teletraffic Congress, Stockholm, Sweden, pp. 318/1-318/4, 1973. [8] F.R.K. Chung, “On Concentrators, Superconcentrators, Generalizers, and Nonblocking Networks,” Bell Systems Technology Journal, vol. 58, no. 8, pp. 1765-1777, 1978. [9] C.E. Shannon, “Memory requirements in a telephone exchange,” Bell Systems Technology Journal, pp. 343-349, 1950. [10] G.A. Margulis, “Explicit constructions of concentrators”, Problems of Information Transmission, vol. 9, no. 4, pp. 325-332, 1973. [11] O. Gabber, Z. Galil, “Explicit constructions of linear sized super-concentrators”, Journal of Computer and Systems Science, pp. 407-420, 1981. [12] N. Alon, Z. Galil, and V. D. Milman, “Better Expanders and Superconcentrators”, Journal of Algorithms, vol. 8, pp. 337-347, 1987. [13] S. Nakamura and G. M. Masson, “Lower bounds on crosspoints in concentrators”, IEEE Transactions on Computers, vol. C-31, pp. 1173-1178, 1982. [14] A.Y. Oruc, H.M. Huang, “Crosspoint Complexity of Sparse Crossbar Concentrators”, IEEE Transactions on Information Theory, vol. 42, no. 5, pp. 1466-1471, September 1996. [15] M.V. Chien, A.Y. Oruc, “High performance concentrators and superconcentrators using multiplexing schemes,” IEEE Transactions on Communications, vol. 42, no. 11, pp. 30453051, November 1994. [16] M.J. Narasimha, “A Recursive Concentrator Structure with Applications to Self-Routing Switching Networks”, IEEE Transactions on Communication, vol. 42, no. 2/3/4, pp. 896897, April 1994. [17] R. Saleh, et al., "System-on-Chip: Reuse and Integration", Proceedings of the IEEE, vol. 94, no. 6, pp. 1050-1069, June 2006. [18] J.Y. Hui, Switching and Traffic Theory for Integrated Broadband Networks, Kluwer Academic Publishers, Boston, 1990. [19] STMicroelectronics, “Core90GPSVT_Databook”, Data Sheet, October 2004. [1]  53  Chapter 4 Network Implementation1 4.1 Introduction The centralized nature of the programmable logic in our post-silicon debug infrastructure imposes the requirement of an interconnect network that spans a significant portion of the physical SoC. This occurs because the circuit under debug may be in any location on the die, while the programmable logic core will be restricted to one physical location. High-speed networks that span a significant distance are a challenge to implement in modern process technologies. To address this, we present a new asynchronous interconnect design and implementation that addresses many of the performance and implementation challenges posed by the network in our post-silicon debug infrastructure. In this chapter we demonstrate an asynchronous interconnect design developed specifically to be implemented using only standard cells and optimized using commercially-available CAD tools. We compare our asynchronous implementation to standard synchronous techniques and show that, although the requirements of an ASIC design flow impose limitations on the performance of the asynchronous interconnect, there is still a significant portion of the overall design space wherein an asynchronous interconnect provides an advantage over a synchronous interconnect. We then address one of the major CAD tool limitations for asynchronous interconnect with a modification to one of the current ASIC design tools. This limitation prevents the automatic insertion of pipeline stages, which are important in achieving high throughput in asynchronous interconnects. We implement a modification to the ASIC place-androute tool that allows asynchronous pipeline insertion for our interconnect designs. This modification is shown to further increase the design space that the asynchronous interconnect can address. Finally, we compare the area, power and latency of synchronous and asynchronous implementations of an interconnect network for a range of IC sizes and constructions. These  1  A version of this chapter has been published as: - B.R. Quinton, M.R. Greenstreet, S.J.E. Wilton, “Practical Asynchronous Interconnect Network Design”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, no. 5, pp 579-588, May 2008. (Reprinted with permission)  54  results are based on a standard 90nm ASIC flow and are extracted from silicon-ready placed and routed designs. 4.2 Motivation The implementation of interconnect is becoming a significant issue in modern IC design. In older processes, wire delays were not significant in comparison to gate delays. This greatly simplified inter-block communication since any part of the IC could communicate with any other part without special timing considerations. However, in modern processes, wire delays are significant, and inter-block communication is no longer a simple matter [1]. Without careful design, the interconnect network may become the critical path in our debug infrastructure, impacting our ability to monitor or interact with the normal, at-speed, operation of the device. Repeater insertion has been employed to decrease wire latency and improve performance [2]. However, even optimally-placed buffers may not be sufficient, and register pipelining is required in some situations [3]. Although register pipelining increases throughput, there is a significant design effort required to implement these synchronous pipeline stages since each individual register requires a low-skew clock. While feasible, the design effort for this low-skew clock distribution in a debug infrastructure could limit the adoption of our design-for-debug methodology since the extra effort must be made “up-front”, before the usefulness of the new logic is clear. Asynchronous techniques have been suggested as a potential solution to the problem of IC interconnect [4,5]. These asynchronous techniques do not require a global clock. Instead, a local request/acknowledgement protocol is used to coordinate each data transfer. In addition to eliminating the need to build a low-skew clock tree for pipeline stages, an asynchronous interconnect also has the advantage of removing clock skew requirements and naturally supporting blocks with different clock rates, which is important for our debug infrastructure. Although a number of asynchronous protocols and implementations have been explored with the intention of being used for IC interconnect, these implementations require custom designed circuits and user-implemented CAD tools or manual design optimization [4,5]. These techniques are not feasible for many of the ASICs and SoCs we are targeting with our debug infrastructure. Typical ASIC and SoC designs are implemented with standard cells and commercial CAD tools  55  (often referred to as a standard ASIC flow). We address this limitation with a new asynchronous design that we will describe in this chapter. 4.3 Background The problem of managing high-speed interconnect has been addressed in a number of ways. For synchronous interconnects, a number of algorithms have been developed that determine the optimal repeater and register insertion locations [3,7,8]. While these algorithms determine the optimal location of registers to pipeline the interconnect and achieve a target clock frequency, they do not address the issue of producing a low-skew clock for each of these registers. Highspeed clock tree design thus remains a significant challenge. Other research has focused on asynchronous interconnect techniques. These techniques can be classified into two broad categories: 1) bundled-data and 2) delay-insensitive [9]. Bundleddata designs manage a number of bits together and make use of request/acknowledge signalling that is separate from the data. Because of this separation, these designs require delay-matching between the data bits and the control signalling. In contrast, delay-insensitive designs rely on data encoding that allows the control signalling to be extracted from the encoded data. This encoding removes the requirement of delay-matching from these designs. While bundled-data techniques tend to be more efficient and higher performance than delay-insensitive techniques, the requirement of delay-matching is not compatible with current commercial CAD tools. Therefore, we did not consider bundled-data techniques further for our interconnect implementations. There are a number of different delay-insensitive techniques. These techniques differ primarily in their data encoding methods and handshaking protocols. There are many possible data encodings [10]. However, most research has focused on two of them: 1) dual-rail and 2) 1of-4. Dual-rail uses a two bit encoding for each data bit, whereas 1-of-4 encoding uses a four-bit code for each two data bits. There are two widely used handshaking protocols: 2-phase and 4phase. When using 2-phase handshaking, the transitions of the control and acknowledgement signals indicate the completion of a transfer. When using 4-phase handshaking, the level of the control signals indicates the completion of a transfer. The commercially-available Nexus crossbar from Fulcrum Microsystems uses a 1-of-4, 4phase implementation [5]. Unlike our work, the Nexus crossbar is based on custom-designed, 56  pre-charged domino logic. Work by Bainbridge and Furber also employs a 1-of-4, 4-phase approach [4]. Again, unlike our work, theirs requires a sequential cell that does not exist in normal standard cell libraries (a C-element) and does not make use of circuit optimization from commercial CAD tools. While a C-element can be created easily with standard cells and a combinational feedback path, the standard tools will not be able to infer the behaviour of the new circuit. A dual-rail, 2-phase protocol, LEDR, has been proposed in [11]. In [11], this implementation was described in terms of custom dynamic logic, and was not investigated within the context of interconnect with long wires. We present a standard cell application of LEDR in this chapter.  Concurrent with our work, Silistix has announced a commercial product,  CHAINworks that is intended to generate an asynchronous interconnect which is compatible with standard design flows [12]. Finally, Stevens has compared a number of synchronous and asynchronous interconnect implementations using first-order models [13]. These first-order models do not consider the performance or cost within the context of a standard ASIC flow.  Fig. 4.1: Multiplexed Bus 4.4 Experimental Framework In this section we outline the experimental framework that we will use in the following sections to evaluate the effectiveness of our new asynchronous implementation. First, we describe the structure of the interconnect network we will use in our experiments. Then, we  57  explain the model ICs that will be used to investigate the new implementation over a range of IC design scenarios. Finally, we provide a detailed description of the CAD tool flow we use to implement the circuits in our experiments. 4.4.1 Interconnect Network Structure The network structure required for our post-silicon debug infrastructure presents a number of challenges. First, a large number of block outputs must be connected to a single PLC. Since the blocks are distributed on the IC, the connections to the PLC must span a significant portion of the IC. In addition, the communication between the fixed blocks and the PLC must occur at high data rates in order to allow the IC to operate normally. Furthermore, the various IP blocks on the SoCs to be tested will often operate at different clock frequencies. Therefore, the interconnect must allow simultaneous communication at different clock rates. In this chapter we will assume that both the circuits under debug and the debug PLC are implemented with synchronous techniques. In the future it may be possible to consider an asynchronous PLC implementation [15]. For the purpose of this chapter, the important aspects of the network are the multiplexing and transmission of the signals from each of the IC blocks to the PLC. The structure used for this portion of the network is an ‘OR-tree’ implementation of a multiplexed bus as shown in Figure 4.1. In this chapter, only the observation network connecting the circuits under debug to the PLC is considered, because this is the most common debug case. Optionally, a network connecting the PLC outputs to the fixed blocks may be used to control and override signals. We expect that this de-multiplexing network would have a similar structure and similar interconnect issues.  58  Fig. 4.2: 64-Block IC Placement 4.4.2 Target ICs In order to investigate the cost and performance of different debug interconnect network implementations, nine different target ICs were modelled based on a STMicroeletronics 90nm process. Three different die sizes were used: 3830x3830 µm, 8560x8560 µm, and 12090x12090 µm. These die sizes represent total standard cell gate counts of approximately 2.7 million, 13.9 million and 27.7 million, respectively. For each die size, three different network scenarios were created based on the number of different block partitions on the chip. The three scenarios were: 16 blocks, 64 blocks, and 256 blocks. For simplicity, each of the blocks was set to be of equal size.  The IP blocks are traditional synchronous designs for which the clocks can run  continuously. Signals from each block began at the centre of the block. The location of the debug PLC (i.e., the network destination) was set to be at the centre of the die as shown in Figure 4.2. With the exception of the initial signal from each block, the network nodes, wire routing and repeaters were restricted to occur outside the design blocks. Placement guides were used to ensure that the placement tool would place network nodes so as to minimize distances between communicating nodes. To make comparison easier, and to shorten the run times, the width of the multiplexed bus was set to 1 bit wide for all scenarios. This is not a fundamental restriction of either the synchronous or asynchronous interconnect architecture. In general the width of the bus may be any value. This had the side effect of reducing routing congestion. Because the 59  routing was constrained to occur between the design blocks, we determined that this had only a minor optimistic effect on the final timing numbers with respect to a wider bus. We confirmed this based on a number of trial runs with the bus width set to larger values (i.e., 16 and 32 bits wide).  Fig. 4.3: CAD Tool Flow 4.4.3 CAD Tool Flow We use a standard digital ASIC design flow. Each interconnect network was described using Verilog-HDL. The Verilog description was synthesized with Synopsys Design Compiler 60  (v2004.06) [16]. Synthesis targeted the STMicroeletronics 90nm, 7-layer metal process using the STMicroelectronics CORE90GPSVT standard cell library [17]. Functional simulations were done with Cadence Verilog-XL (v05.10) [18]. The cell placement, cell sizing and repeater insertion were performed by Cadence SoC Encounter (v04.10) [18]. Detailed wire routing was performed using Cadence NanoRoute (v04.10) [18]. Parasitic extraction was done using Cadence Fire & Ice (v04.10). Static timing analysis was performed using Synopsys Primetime (v2004.06) [16]. Verification of pre/post-layout functional equivalency was performed using Synopsys Formality (v2004.06) [16]. Finally, power measurements were made with Synopsys PrimePower (v2003.12) [16]. This CAD tool flow was completely automated using csh and tcl scripts (Figure 4.3). The initial inputs were: a description of the size and number of block partitions on the target IC; and the target clock operating frequency. From this initial description, both synchronous and asynchronous implementations were created. The outputs of the flow were: a netlist; placement information; a routing description; and area, power and timing details. The run time for the entire flow on a given interconnect ranged from 30 minutes to 3 hours on a 500MHz UltraSPARC-II processor. 4.5 Asynchronous Implementation In this section we describe the design, implementation, optimization and verification of our asynchronous interconnect. We begin by examining the limitations of the current CAD tools and standard cell libraries with respect to asynchronous implementations. Then, we describe details of the design and operation of the interconnect. After this, we explain how we used standard place-and-route tools to perform delay optimization on our implementation. Finally, we explain how we used commercial, synchronous timing tools to perform timing verification of our asynchronous circuits. 4.5.1 CAD Tool / Library Limitations For the purpose of implementing asynchronous interconnects, there are a number of limitations imposed by commercially-available CAD tools. The limitations listed in this section are derived from the specific tool flow outlined in Section 4.4. However, other tools intended for synchronous design impose similar restrictions. The first limitation is that there is no mechanism 61  to ensure that delays are matched on different paths in a circuit. This makes the use of bundleddata interconnects infeasible if the network is large, since each path would require manual consideration. In addition, because of this inability to specify relative delay constraints, there is no automated hazard (glitch) avoidance available. Secondly, these CAD tools will not tolerate combinational feedback paths and are unable to infer sequential circuits built from combinational gates. Because of this restriction, circuits that create sequential elements that are not explicitly defined in the standard cell library cannot be optimized using automatic gate sizing or repeater insertion. Thirdly, for automatic circuit optimization, each path must be referenced to a common global clock. If this reference cannot be made, the tool will simply ignore the path and the delay on that path will not be optimized. Finally, the circuit optimization tools are only designed to insert repeaters to manage wire delays.  These tools do not have the ability to insert  asynchronous pipeline stages. For our work this restriction means that asynchronous pipeline stages can occur only at network nodes. The effect of removing this restriction will be investigated in Section 4.7. Standard cell libraries also restrict the implementation of asynchronous interconnects. Since the library is designed for synchronous circuits, it lacks certain sequential cells that would be ideal for asynchronous circuits. The most obvious example is the C-element that is used in many asynchronous designs [9].  Fig. 4.4: Dual-Rail Encoding 4.5.2 Design To maximize throughput our design used 2-phase handshaking techniques and dual-rail encoding. In our initial designs we attempted to use a 1-of-4, 4-phase protocol since this implementation has been used in previous work on asynchronous interconnects [4,5]. However, 62  the throughput restrictions of the 4-phase handshaking prevented any of our 4-phase designs from performing at a speed equal to that which could be easily achieved using a synchronous interconnect without any pipelining. With 4-phase signalling, each data transfer requires four traversals of the long wires between the interconnect nodes. Two-phase handshaking requires only two traversals of the long wires between network nodes per data transfer. Furthermore, using a dual-rail encoding has the added benefit of minimizing the decode logic delay and further improves throughput. However, our design and methodology are not limited to dual-rail. It is possible to extend our encoding scheme to encode multiple bits, at the cost of increasing the depth of the decode circuitry. For instance, it would be possible to support 1-of-4 encoding at the cost of an extra XOR stage. This dual-rail, 2-phase approach is the same encoding and handshaking scheme used by LEDR [11], although their circuit implementation is significantly different than ours. In the dual-rail scheme there are two different encodings for each data bit value, as shown in Figure 4.4. This ensures that only one of the wires transitions for each new bit. Based on this encoding, it is possible to detect each bit transition with an XOR gate. The value of the XOR gate describes which code exists on the dual-rail. The encoding rules ensure that the code alternates for each new bit. At an asynchronous pipeline stage, the value of the code can be compared to the code that appears in the following stage as well as the code that represents a new incoming value. The stage accepts a new value only if: 1) the code at the following stage equals the current code, and 2) the incoming code is different from the current code.  Fig. 4.5: Dual-Rail 2-Phase Implementation 63  Our design combines synchronous IP blocks with asynchronous interconnect, and our focus is on the interconnect. However, we recognize that such a hybrid design requires care to be taken with the interfaces between the synchronous and asynchronous components. If the sender and receiver blocks are operating at the same frequency, then only the delay of the interconnect is uncertain, and source-synchronous [21] methods can be used. In such an application, our handshaking interconnect provides the FIFO behaviour required to compensate for phase differences between the two blocks. Our approach can also be used when the sender and receiver operate at arbitrary different clock frequencies.  Here, a designer could use a  synchronizing FIFO such as the design proposed by Chelcea and Nowick [22] to connect synchronous blocks to the asynchronous interconnect. While Chelcea and Nowick's design uses full-custom logic, we believe that the methods we have presented in this chapter could be extended to produce a standard-cell implementation of their design, and plan to do so in future work. Our dual-rail approach is also applicable to interconnect for multi-bit busses without changing our encoding scheme. Each bit of the bus would be encoded according to the dual-rail scheme. Then these bits would be realigned using a FIFO at the destination. To meet the requirements of the CAD tool flow, we created a new asynchronous interconnect structure using flip-flops as the sequential elements in the design, as shown in Figure 4.5a. This use of flip-flops is important since it allowed us to specify timing paths that are compatible with standard circuit optimization tools. The resulting design is quite robust against delay variations. Other than the clock generation circuit, which will be discussed later in this section, there is only one timing relationship that must be ensured: the dual-rail data must reach the ‘d’ inputs of the flops before XOR output reaches the clock generation circuitry. This requirement is easily guaranteed by ensuring that the XOR gate and flip-flops are placed reasonably close together. We used the “createRegion” command in SoC Encounter to ensure that this relative placement requirement was met. Further, to ensure that the use of large numbers of createRegion commands did not affect the run time performance of the placement tool, we ran a number of trials with and without this createRegion command. The results showed that the run time of the placement and routing was, in general, shorter when the createRegion command was used. Therefore we do not believe there is any problem using this method to achieve our desired placement restrictions. 64  Fig. 4.6 Clock Generation Circuit The clock generation circuit, shown in Figure 4.6, was designed using specific discrete gates to avoid hazard issues. These simple gates are available in a standard cell library. In the design, a rising clock edge is produced only when the new code is different from the current code and the current code is the same as the next stage code. Therefore, there are two possible transitions that generate the rising clock edge:  the arrival of a new code or the arrival of an  acknowledgement of the next stage’s code (Figure 4.7). The falling clock-edge is always generated by the capture of the new code in the flip-flops (i.e., the generation of a new value for ‘current code’). This circuit is robust against delay variations. There is one timing relationship that must be guaranteed: the minimum clock width of the flip-flops must be respected. This requirement is easily met since the falling edge of the clock is not generated until the flip-flops capture a new value and the new value propagates through the clock generation logic. Section 4.5.4 presents a methodology for verifying the required relationships and measuring the circuit throughput.  Fig. 4.7: Clock Generation Waveform 65  In our experiments, we found that the delays of the XOR gates dominated the latency and cycle time of our designs. With this in mind, it is possible to further enhance the throughput of the circuit described in Figure 4.5a. This can be done by pre-computing the acknowledge bit as shown in Figure 4.5b. This value is then held in an additional flip-flop. By doing this, the delay of one XOR gate is removed from the critical path of the circuit. This modification has the potential to affect the minimum clock width requirement; however, if the requirement of placing the flip-flops and clock generation logic together is extended to this new flip-flop, the circuit remains quite safe because the ‘clk-to-q’ delay of the sampling flops will still dictate the minimum pulse width. 4.5.3 Delay Optimization It is possible to perform delay optimization with automatic cell sizing and repeater insertion directly on the circuit described in Section 4.5.2. However, the descriptions of the clocking relationships can become complicated because the clocks on each flip-flop appear to be unique from the perspective of the CAD tools. Each of these clocks must be related to a common global clock. To simplify this, it is possible to make a minor circuit modification immediately before the delay optimization phase. This circuit modification can be corrected once the optimization stage is done. In our flow, the functional correctness of this modification/correction can be verified using a formal verification tool, Synopsys Formality [16]. This equivalence checking step is done between the pre-layout and post-layout Verilog netlists, as described on the righthand side of Figure 4.3. The circuit modification is done by breaking the wires connecting to the ‘d’ and ‘clk’ inputs of the flip-flops and then connecting the former clock wire to the ‘d’ input of the flip-flops, as shown in Figure 4.8. This change leaves the ‘clk’ inputs to the flip-flops open. A ‘virtual clock’ can then be created and associated with the flip-flop ‘clk’ inputs. This allows one global clock constraint to be applied to the entire design. It also ensures that the correct timing paths are optimized. Once the optimization is done, the ‘virtual clock’ is removed and the wiring is corrected before the detailed route is performed. Parasitic extraction and static timing are then performed as normal. This wiring modification can be automated in SoC Encounter with the engineering change order (ECO) set of commands. Similar commands exist for other placement tools.  66  Fig. 4.8: Circuit Modification Enabling this automatic delay optimization is a critical part of this tool flow. For instance, in the case of the 12090x12090 µm IC with 256-block partitions targeting a 600 MHz throughput, 3647 cells were automatically resized and 84 repeater cells were added to the design.  These  modifications improved the loop time by 810 ps, thereby improving throughput by approximately 174 MHz. This significant improvement highlights the importance of allowing placement aware delay optimization when using standard cell based interconnects. // throughput target set_max_delay –from */CLK –to */CLK <loop_delay/2> // min clock pulse target set_min_delay –from */CLK -to */CLK <ff_min_clk>  Fig. 4.9: Example Static Timing Constraints 4.5.4 Timing Verification We verify the timing constraints of our asynchronous design using standard, commercial timing verification tools. The timing verification of circuits in a standard ASIC flow is done using static timing verification. This technique identifies the worst-case timing paths by traversing a graph representation of the circuit instead of simulating all possible scenarios. This allows the verification of large circuits in a relatively short time. As commercially available static timing tools such as Synopsys Primetime [16] are designed for synchronous circuits, it is not possible to use them to verify typical asynchronous designs. The challenge in timing verification for asynchronous circuits is to identify all potentially critical cycles in advance and alert the timing tool that these paths need to be verified. In general this is difficult for existing 67  asynchronous designs, but because the timing cycles in our asynchronous design start with the clock inputs of flip-flops, these paths are well suited to existing timing tools. The Primetime commands used to specify the timing requirements for our interconnect are shown in Figure 4.9. These commands would be added to an SoCs overall timing verification process. Note that in contrast to the delay optimization phase described in the previous section, no ECO scripts are required for timing verification, because the static timing tool has a more advanced understanding of the clocking relationship and is able to manage each clock pin as an independent synchronous clock. To further understand the timing relationships in the circuit, we performed a detailed analysis of the inter-stage timing of one stage of the interconnect network as shown in Figure 4.10. For this analysis we examined the IC design scenario with 64 blocks and a die width of 8560µm. We examined the timing between stage 3 and stage 4. The physical distance between these two stages was approximately 2140µm. We extracted the delay of the cells and wires in the design from the static timing reports generated by Synopsys Primetime (in Figure 4.10 wires that are not annotated have a negligible delay). We also extracted the drive strengths of each cell. Each of these drive strengths reflect the automatic cell sizing and buffer insertion performed by the placement tool (Cadence SoC Encounter) to optimize the delay of the circuit.  Fig. 4.10: Example Circuit with Annotated Timing And Drive Strengths  68  4.6 Synchronous Implementation In this section we describe the implementation of the synchronous interconnect we used as a reference from which to measure the effectiveness of our asynchronous implementation. We first discuss the CAD tool limitations with respect to this implementation and then explain the implementation of the design itself. 4.6.1 CAD Tool / Library Limitations Because the CAD tool flow and standard cell library were intended for synchronous design, there were fewer limitations with respect to implementing the synchronous interconnect as compared to the asynchronous interconnect described in Section 4.5. The one significant limitation for this work was that current CAD tools do not support the automatic insertion of sequential elements to optimize wire delays.  This ability does exist in some design  methodologies described in research papers [3, 19, 20], but has not yet been implemented in the CAD tools that we used for these experiments. Therefore pipeline stages were restricted to network nodes. 4.6.2 Design The synchronous design was implemented using standard pipelining techniques. Circuits were designed to tolerate a clock skew of 100 ps. In addition, a clock uncertainty margin of 100 ps was assumed for all synchronous timing measurements to compensate for jitter and other nonideal clock-tree effects. 4.7 Throughput Comparison Without An Inter-Block Clock Tree In this section we compare the throughput for SoC design scenarios where no clock is available for the inter-block interconnect (In Section 4.9 results will be presented for the case where a synchronous pipelining clock is available.) We believe that this restricted case is important for the interconnect network implementation in our post-silicon debug infrastructure. In SoC design scenarios, blocks are often designed and implemented independently and then later integrated together into a single SoC. These blocks already have a clock tree that has been created to satisfy the needs of their internal logic. It is a significant design effort to build a new clock tree for pipeline stages that occur between blocks, especially if it will only be used for 69  debug circuits that may not be used right away. The debug interconnect requirements are further complicated by the fact that the SoC may contain a number of blocks that are designed to operate at different clock frequencies. In order to manage this, a number of independent clock trees would be required for the inter-block pipelining. Therefore in these experiments, due to the absence of a clock, the synchronous implementation has no pipeline stages. In contrast, since the asynchronous implementation does not require a clock, the asynchronous network is pipelined.  Fig. 4.11: Throughput Comparison Without Inter-clock Clock Tree The results are shown in Figure 4.11. For all the IC scenarios, with the exception of the case with the smallest die size and the smallest number of blocks, the asynchronous interconnect runs faster than the non-pipelined synchronous interconnect. Therefore, it is potentially advantageous to use the asynchronous design for most of the scenarios we investigated in order to save the design effort of generating clock trees for the synchronous pipeline registers. It is interesting to note that when compared to previous work done in a 0.18µm process, these new 90nm results show that more of the design scenarios benefit from the asynchronous interconnect [6].  We  believe that this reflects an increase in the ratio of wire-delay/gate-delay in smaller process technologies. Avoiding the construction of a global clock tree is a significant advantage for the particular application that we are considering (connecting a number of IP blocks to a programmable logic core (PLC) used for debugging purposes in an SoC) since each IP block may operate at a 70  different clock frequency. To allow for this in a synchronous interconnect, an independent, userselectable clock would be required at the pipeline flops for each different clock frequency used on the SoC. By using handshaking instead of clocks, the asynchronous design avoids placing any constraints on clock skew and jitter, thus simplifying global timing closure. Further, this enables a globally asynchronous, locally synchronous (GALS) implementation strategy [5]. For example, the synchronous results in Figure 4.11 require that there is only 100 ps clock skew and 100 ps clock uncertainty between each of the different blocks on the die. The asynchronous interconnect does not have this requirement. 4.8 CAD Tool Enhancement In this section we address one of the major CAD tool limitations for the asynchronous interconnect case in order to further understand the relative performance between the two interconnect strategies. As we outlined in Section 4.5, current placement tools do not allow for the automatic insertion and placement of asynchronous pipeline stages; we were therefore forced to restrict pipeline stages to the network nodes in our initial results. This is a major limitation for the asynchronous case. Unlike synchronous pipeline stages, the asynchronous pipeline stages do not require a clock and therefore the design effort to include additional asynchronous pipeline stages at arbitrary locations on the die would be minimal if the cell placement tool supported this feature.  Fig. 4.12: Throughput with tool enhancement 71  In order to model the effect of automatic asynchronous pipeline insertion, we created an iterative design flow using existing tools. The design was initially placed and routed with pipelines at the network nodes. Then the timing performance of the resulting design was analyzed. From this analysis, the critical timing path was identified and the physical location of the path was recorded. A new design was then created in Verilog with an additional pipeline stage on the critical timing path, and placement constraints were generated based on the location of the critical path. Next, the design was placed and routed again. This procedure was repeated until either the target throughput was achieved or the timing of the circuit no longer improved. This process was automated using TCL scripts to parse CAD tool outputs and create the new Verilog designs and tool directives. The experiments outlined in Section 4.7 (where no clock is available for inter-block pipelining) were repeated with this new tool enhancement and the results are shown in Figure 4.12. These results are significant for the following two reasons. First, with the CAD tool enhancement, the asynchronous interconnect throughput is insensitive to the die size. Second, the asynchronous throughput for all die sizes and block numbers is larger than the synchronous throughput for the smallest die and the least number of blocks. 4.9 Performance and Cost Comparisons We now remove our assumption that the interconnect for synchronous designs cannot be pipelined. While this pipelining may be impractical for many SoCs designs for the reasons described earlier, removing this assumption allows us to put our asynchronous design up against the best possible alternative and thereby better identify its strengths and weaknesses. In this section, we compare the performance of our new asynchronous interconnect and a pipelined synchronous interconnect in terms of area overhead, timing performance and power dissipation. This is done over a range of IC design scenarios, and target interconnect throughputs.  72  Fig. 4.13: Cell Area for 700 MHz Throughput Target For each of the nine target ICs described in Section 4.4, we generated synchronous and asychronous interconnect implementations targeting 500 MHz, 600 MHz, 700 MHz and 800 MHz operation. For each case the minimum number of synchronous or asynchronous pipeline stages was used to meet the target clock frequency. The area, power and latency were determined for each implementation. These values were generated using detailed placed and routed designs. The CAD tools enhancement presented in Section 4.8 was used for all asynchronous scenarios. The worst-case library was used for all measurements.  This library represents the slowest  process corners with a 0.9 V supply (10% below the 1.0 V nominal) and a die temperature of 125 °C. As a result, no manufactured devices would be expected to have timing slower than these values. For simplicity, the worst-case library values were also used for power measurements. This allows for a consistent comparison between the two approaches, but does not represent the upper bound of the power consumption.  73  Fig. 4.14: Dynamic Power for 700 MHz Throughput Target The area of both implementations was compared for each of the nine design scenarios and for each of the target throughputs; the results are shown in Table 4.1. The area value reflects the total standard cell area and does not include the wire area. For the synchronous case, the area required for a global clock tree was not included. To further illustrate the general trends, Figure 4.13 presents the area results for the 700MHz operation target. In general, the area for the asynchronous case is about eight times larger than that of the synchronous case. Although the relative area difference is quite large, the difference is not that significant in the context of the overall IC. For example, the extra area of the asynchronous interconnect for an 8560x8560 µm IC with 64 block partitions operating at 600 MHz is approximately 19509 µm 2 per bit. Assuming a 32-bit bus, the area increase would represent only 0.85% of the total area.  74  Fig. 4.15: Latency for 700 MHz Throughput Target The power consumption of both implementations was compared for each of the nine design scenarios, which is also shown in Table 4.1. The results were generated by measuring the power consumption while transmitting a fixed set of 1000 random bits from a single input to the output. For the synchronous case, the power does not include the clock tree distribution power, but does include the power consumed by clocking each individual flip-flop during each clock cycle. To further illustrate the general trends, Figure 4.14 presents power results for the 700MHz operation target. It is interesting to note that the power of the asynchronous case is similar to the synchronous case for some of the scenarios with 256 blocks, but much higher for the cases with 16 and 64 blocks. This reflects important differences between the two implementations in terms of power consumption. For a given data transfer between pipeline stages the asynchronous interconnect will consume more power than the synchronous interconnect for two reasons. First, each new data value will cause one bit of the two-bit encoding to change, whereas when a new data value is transmitted in the synchronous case that is the same as the previous value it will not cause a transition (i.e., a ‘0’ followed by another ‘0’ will not cause a new transition). Second, each transition of the asynchronous pipeline stage will cause three flops to be clocked rather than one in the synchronous case. While these two effects will increase the power consumption for a given data transfer, the asynchronous interconnect may save power for the overall network since only the pipeline stages that are currently transferring data are clocked; in contrast, in the synchronous case every flop in the network is clocked on every clock cycle. 75  Table 4.1 Area/Power/Latency/Pipeline Comparison Results Scenario (blocks – die width)  500 MHz async  16 - 3830 16 - 8560 16 - 12090 64 - 3830 64 - 8560 64 - 12090 256 - 3830 256 - 8560 256 - 12090  4297.1 5002.9 5851.3 16134.7 18924.8 21404.3 63928.6 70160.8 76341.4  590.5 749.7 1099.8 1692.5 2250.1 3321.3 7667.8 10248.3 12821.1  5174.1 5526.4 8182.6 18344.2 22234.1 27341.2 70551.6 80632.0 90268.8  16 - 3830 16 - 8560 16 - 12090 64 - 3830 64 - 8560 64 - 12090 256 - 3830 256 - 8560 256 - 12090  0.070 0.146 0.179 0.088 0.167 0.205 0.106 0.192 0.226  0.016 0.027 0.037 0.039 0.054 0.063 0.129 0.148 0.166  0.090 0.151 0.207 0.118 0.182 0.250 0.140 0.205 0.260  0.016 0.081 0.029 0.173 0.038 0.203 0.040 0.116 0.055 0.209 0.069 0.250 0.130 0.138 0.157 0.238 0.167 0.287 Latency (ns)  16 - 3830 16 - 8560 16 - 12090 64 - 3830 64 - 8560 64 - 12090 256 - 3830 256 - 8560 256 - 12090  2.84 2.99 3.34 4.07 4.44 4.78 5.38 5.66 6.14  2.00 2.00 4.00 2.00 4.00 4.00 2.00 4.00 6.00  2.47 2.85 4.10 3.56 4.07 5.34 4.69 5.23 6.56  1.67 2.40 3.33 3.72 3.33 4.09 1.67 3.47 3.33 4.85 5.00 5.21 1.67 4.55 3.33 5.98 3.33 6.37 Pipeline Stages  16 - 3830 16 - 8560 16 - 12090 64 - 3830 64 - 8560 64 - 12090 256 - 3830 256 - 8560 256 - 12090  2 2 2 3 3 3 4 4 4  0 0 1 0 1 1 0 1 2  2 2 3 3 3 4 4 4 5  sync  600 Mhz async  700 MHz  sync  async  800 MHz  sync  async  sync  5609.8 12713.5 13675.0 18361.8 28118.3 33914.7 74550.1 88299.7 98354.8  908.8 1153.6 1411.5 1878.0 2941.6 7010.4 8188.1 15168.8 18008.3  0.017 0.029 0.038 0.039 0.056 0.071 0.137 0.157 0.168  0.109 0.216 0.269 0.118 0.269 0.325 0.145 0.304 0.356  0.018 0.035 0.043 0.044 0.082 0.091 0.137 0.259 0.269  1.43 2.86 2.86 1.43 2.86 4.29 2.86 4.29 4.29  3.23 6.28 6.54 3.43 7.42 7.61 4.47 8.48 8.75  1.25 3.75 3.75 2.50 5.00 5.00 2.50 6.25 6.25  0 1 1 0 1 2 1 2 2  3 4 4 3 5 5 4 6 6  0 2 2 1 3 3 1 4 4  Area (µm2) 621.2 4978.7 818.8 864.9 7014.8 868.2 1137.1 8442.7 1154.7 1806.7 18677.9 2075.6 2725.3 23183.5 2794.5 3095.2 26784.7 3804.3 7997.1 72376.8 7780.9 10470.0 81945.7 10304.3 12665.2 89738.7 14634.3 Dynamic Power (mW)  0 1 1 0 1 2 0 2 2  2 3 3 3 4 4 4 5 5  The latency and number of pipeline stages of both implementations were compared for each of the nine design scenarios and the results are shown in Table 4.1. Figure 4.15 presents latency results for the 700MHz operation target to further illustrate the general trends. The latency of the asynchronous interconnect is, in general, larger than the synchronous case. This latency is a 76  direct result of the delay of the clock generation circuit and the ‘clk-q’ delay of the flip-flops. We note that the relative latency overhead of the asynchronous design decreases for large die and large numbers of blocks. Thus, this overhead becomes fairly small for the designs where the throughput and simple timing of the asynchronous approach are most attractive. We anticipate that adding a small number of cells to the standard cell library could significantly reduce the latency. This is discussed as possible future work in Section 4.10. Alternatively, there have been proposals for latency-insensitive design techniques [23]. 4.10 Chapter Summary In this chapter we have demonstrated that an asynchronous interconnect is viable in a 90nm process using a standard cell library and commercially-available CAD tools. We have shown that a 2-phase, dual-rail circuit can be implemented using normal flip-flops as sequential elements to enable delay optimization with standard commercial CAD tools. Using our implementation, we demonstrated that there is a portion of the IC interconnect design space wherein an asynchronous interconnect provides an advantage over a synchronous interconnect for our debug infrastructure by eliminating the need to build complicated clock trees. This interconnect has the added advantages that it tolerates inter-block clock skew and easily operates with debug circuits at different frequencies. We have also shown that with a relatively simple enhancement to current CAD tools, we could significantly expand the design space where asynchronous interconnect provides an advantage and ensure that the throughput of the interconnect is not tied to the die size. Finally, we have quantified the difference in area, power, and latency between our synchronous and asynchronous interconnect implementations. Although the asynchronous design had a much higher area, and in many cases a higher power, this difference is not significant in relation to the total IC design.  We believe that this new  asynchronous interconnect addresses a key challenge in our post-silicon debug infrastructure because it would allow our infrastructure to be inserted into the SoC without the need to build a global clock tree and it naturally handles different clock frequencies. In completing this work, we have been able to identify a number of areas for future work that would involve minor changes to the ASIC flow which aim to improve the performance and reduce the cost of our asynchronous implementation. These changes could be made without altering the basic design flow. The simplest change would be to create a dedicated standard cell for the three-input function of the clock generation circuit. As there is already a significant 77  number of three-input gates in the cell library the new function would not present a significant alteration of the library. This change could be taken a step further by creating a library cell that combines the clock generation circuit with the flip-flop. The circuit would take the two wires of the dual-rail data as input along with the acknowledgement from the subsequent stage. As a result, it would output the latched dual-rail data and an acknowledgement signal for its predecessor.  78  Chapter 4 References R. Ho, K. Mai, M. Horowitz, “The future of wires”, Proceedings of the IEEE, vol. 89, no. 4, pp. 490-504, April 2001. [2] L.P.P.P. van Ginneken, “Buffer Placement in Distributed RC-Tree Networks for Minimal Elmore Delay”, Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 865-868, May 1990. [3] P. Cocchini, “Concurrent Flip-Flop and Repeater Insertion for High Performance Integrated Circuits”, Proceedings of the IEEE International Conference on Computer-Aided Design, pp. 268-273, November 2002. [4] W.J. Bainbridge, S.B. Furber, “Delay insensitive system-on-chip interconnect using 1-of-4 data encoding”, Proceedings of the International Symposium on Asynchronous Circuits and Systems, pp. 118-126, March 2001. [5] A. Lines, “Asynchronous Interconnect For Synchronous SoC Design”, IEEE Micro, vol. 24, no. 1, pp. 32-41, January/February 2004. [6] B.R. Quinton, M.R. Greenstreet, S.J.E. Wilton, “Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow”, Proceedings of the IEEE International Conference on Computer Design, pp. 267-274, October 2005. [7] C. Lin, H. Zhou, “Retiming for Wire Pipelining in System-On-Chip”, Proceedings of the IEEE International Conference on Computer-Aided Design, pp. 215-220, November 2003. [8] D.K.Y. Tong, E.F.Y. Young, “Performance-driven register insertion in placement”, Proceedings of the International Symposium on Physical Design, pp. 53-60, 2004. [9] J. Sparso, S. Furber, Principles of asynchronous circuit design – A system perspective, Kluwer Academic Publishers, chapter 2.2, p. 14f, 2001. [10] W.J. Bainbridge, W.B. Toms, et al., “Delay-Insensitive, Point-to-Point Interconnect using mof-n Codes”, Proceedings of the International Symposium on Asynchronous Circuits and Systems, pp. 132-140, May 2003. [11] M.E. Dean, T.E. Williams, et al., “Efficient Self-timing with Level-Encoded Two-Phase Dual-Rail (LEDR),” Proceedings of the 1991 University of California/Santa Cruz Advanced Research in VLSI Conference, MIT Press, pp. 55-70, March 1991. [12] Silistix, Inc., http://www.silistix.com. [13] K.S. Stevens, “Energy and Performance Models for Clocked and Asynchronous Communication”, Proceedings of the International Symposium on Asynchronous Circuits and Systems, pp. 56-66, May 2003. [14] B.R. Quinton, S.J.E. Wilton, “Post-Silicon Debug Using Programmable Logic Cores”, Proceedings of the IEEE International Conference on Field-Programmable Technology, pp. 241-247, December 2005. [15] S. Peng, R. Manohar, et al., “Automated Synthesis for Asynchronous FPGAs”, Proceedings of the ACM International Conference on Field Programmable Gate Arrays, pp. 163-173, Feburary 2005. [16] Synopsys Inc., http://www.synopsys.com. [17] STMicroelectronics, “CORE90 GP SVT 1.00V”, Databook, October 2004. [18] Cadence Design Systems, http://www.cadence.com. [19] R. Lu, G. Zhong, C.K. Koh, J.Y. Chao, “Flip-flop and Repeater Insertion for Early Interconnect Planning”, Proceedings of Design, Automation, and Test in Europe, pp. 690695, March 2002. [1]  79  L.P. Carloni, K.L. McMillan, A.L. Sangiovanni-Vincentelli, “Theory of Latency-Insensitive Design”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 20, no. 9, pp. 1059-1076, September 2001. [21] M. Greenstreet, “Implementing a STARI Chip”, Proceedings of IEEE International Conference on Computer Design, pp. 38-43, October 1995. [22] T. Chelcea, S.M. Nowick, “Robust Interfaces for Mixed-Timing Systems”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 12, no. 8, pp. 857-873, August 2004. [23] L.P. Carloni, A.L. Sangionvanni-Vincentelli, “Performance Analysis and Optimization of Latency-Insensitive Systems”, Proceedings of the Design Automation Conference, pp. 361367, June 2000. [20]  80  Chapter 5 Programmable Logic Interface1 5.1 Introduction Circuits implemented in programmable logic typically operate three to four times slower than equivalent circuits implemented in fixed function logic [1]. This is a key challenge for our postsilicon debug infrastructure since, to be effective, the debug circuits, which are implemented in programmable logic, must interact with fixed function circuits operating at their normal frequency. To address this issue, in this chapter we propose changes to the PLC architecture that increase the performance of interface circuits implemented in the PLC such that interaction with full-speed fixed-function circuits is possible. The remainder of this chapter is structured as follows. Section 5.2 will provide the motivation for our approach, Section 5.3 will describe related work in this area, and Section 5.4 will outline the SoC framework we will target for our debug infrastructure. Section 5.5 will establish the PLC architecture framework for our debug logic — based on our SoC framework we will divide our target interfaces into system bus interfaces and direct synchronous interfaces. Sections 5.6 and 5.7 will focus on the design requirements, detailed implementation and experimental results related to these two interfaces types. Finally, Section 5.8 will summarize the chapter and discuss future work. 5.2 Motivation The inherent difference in achievable clock frequency between programmable and fixedfunction logic forces the SoC integrator to either reduce the clock rate of the fixed-function logic or design rate-adaptive circuits in order to manage the interface between these two types of logic. In the case of post-silicon debug, reducing the clock rate of the fixed-function logic is not 1  A version of this chapter has been published as: - B.R. Quinton, S.J.E. Wilton, "Embedded Programmable Logic Core Enhancements for System Bus Interfaces", Proceedings of the International Conference on Field-Programmable Logic and Applications, Amsterdam, pp. 202-209, August 2007. (Reprinted with permission) - B.R. Quinton, S.J.E. Wilton, “Programmable Logic Core Enhancements for High Speed On-Chip Interfaces” accepted for publication in IEEE Transactions on Very Large Scale Integrated Circuits, May 2008.  81  desirable because we want to be able to debug the normal, at-speed operation of the SoC. However, the design of rate-adaptive interface circuits for this scenario is a challenge. In our debug scenario, the circuits to be implemented in the programmable logic are not known at the time of SoC integration because we cannot anticipate the bugs that will exist in the design, and consequently, the interface requirements are also not known. Implementing these rate adaptive interfaces in the regular programmable fabric would provide the required flexibility, but as we will show, the timing performance would not be sufficient. To address this problem, we propose modifying the debug PLC to enable high-speed, rateadaptive interfaces. Rather than designing circuitry for a specific interface, we create new programmable structures integrated in the configurable logic blocks (CLBs) of the programmable fabric. We are careful to maintain the basic structure and flexibility of the fabric in order to leverage existing work in FPGA CAD tools and programmable architectures. Our flexible interface structures complement the functionality of the basic configurable logic block (CLB) structures and integrate into the standard routing architecture. 5.3 Background Programmable logic cores have been studied extensively [2,3,4,5,6]. However, we know of no work that is directly comparable to our proposal for modifying the PLC architecture to improve the timing of interface circuits. The problem of managing timing on the interface between a programmable logic core and fixed function logic was identified by Zuchowski et. al [7]. In this work they identified the issue, but offered no specific solutions to the problem. Other work has discussed the general problem of PLC integration without focusing on interface timing issues [8,9]. There have been some descriptions of specific implementations of bus interfaces to PLCs. Winegarden described a proprietary bus to facilitate communications between a processor and user-configurable logic [10]. The bus protocol was specified and the implementation was carefully pipelined in order to ensure high throughput and predictable latency.  The  improvements to area efficiency and timing achieved by this integration were not quantitatively described.  While these results provide an indication of the benefits of designing the  programmable logic fabric specifically for a given system bus, the results are not directly applicable to the problem addressed in this chapter because, in general, the system bus will be 82  defined without regard to the PLC. Lertora and Borgatti [11] described an SoC containing three PLCs and a microprocessor. Inter-block communication was handled by a Network-On-Chip (NoC) although control and configuration of the PLCs and other peripherals was performed using the AMBA AHB bus. The bus interface for the PLC was built using fixed logic. However, the costs and trade-offs of this decision were not detailed. Work by Madrenas [12] described a Field-Programmable System on Chip (FIPSOC) that made use of a PLC on a proprietary shared bus. Again, the details of the PLC/bus interface were not given.  Fig. 5.1: Example SoC The use of FIFOs to move data between different clock domains is a common technique. There are number of proposed designs for how to do this [13]. However, we know of no other academic work that is similar to our proposal and describes programmable FIFOs that support different incoming and outgoing data width ratios depending on the relative clock frequencies. Industrially, a patent held by Integrated Device Technology describes a programmable FIFO that enables serial-to-parallel and parallel-to-serial conversion with a programmable parallel word length [14]. While somewhat similar to our proposal, this FIFO does not allow for the finegrained conversion that we support in our FIFO, nor is it architected to fit into a programmable 83  logic fabric. Furthermore, a patent held by Cypress Semiconductor describes the architecture of an input/output (I/O) cell for a stand-alone FPGA that supports parallel-to-serial and serial-toparallel conversion with the intention of saving logic in the FPGA for protocols that require serialization [15]. Again, this serialization structure does not have the granularity of our proposal. 5.4 SoC Framework In this section we describe the SoC framework that we are targeting for our proposed PLC modifications. As shown in Figure 5.1, SoCs are characterized by the integration of a number of distinct IP blocks on a single die [16]. We use this framework to define the interface requirements of our PLC in a debug scenario. SoC interfaces can be classified in two groups: 1) system bus interfaces, and 2) direct synchronous interfaces. Most IP blocks interface to one or more system busses. Examples of industry-standard busses include AMBA, Wishbone and CoreConnect [17,18,19,20]. A bus master (often firmware running on an embedded processor core) can use the system bus to control, configure and determine the status of slave IP blocks. Data transactions may also occur on these busses; on multi-master busses, multiple IP blocks (not just the supervisory processor core) may use the bus for inter-block communication. In general, however, SoC system busses do not handle all inter-block communications. Instead, tightly coupled blocks, particularly datapath-oriented blocks, use application-specific synchronous interfaces to communicate. We refer to these interfaces as direct synchronous interfaces. They are often ad-hoc or proprietary, but common examples of industry standard interfaces of this type include the PIPE standard, which defines the interface between the PCS and PHY layers in PCI Express [21], or the GMII interface at the PHY layer in Gigabit Ethernet [22]. Our goal is to ensure that the debug PLC is able to connect to any SoC block. This will maximize the debug potential of the PLC, by allowing the role of the programmable logic to be defined post-fabrication. To accomplish this, the interfaces implemented in the PLC must have the timing performance to ensure that the SoC can operate at its normal clock frequency.  84  In the following sections we will address system bus interfaces and direct synchronous interfaces separately to highlight the difference in their requirements, implementations and results. However, the PLC architecture enhancements proposed for each type are compatible and intended to co-exist in the same physical PLC.  Fig. 5.2: Shadow Cluster Modified CLBs 5.5 PLC Architecture Framework In this section we outline the PLC architectural framework we will use as the baseline for our enhancements. We first describe the structure of our target PLC architectures and then detail our approach to modifying them. All of our new programmable structures were integrated at the configurable logic block (CLB) level of the PLC. Each circuit was partitioned into subcircuits and each subcircuit was embedded in the modified CLBs. As shown in Figure 5.2, each modified CLB contained the regular CLB logic in addition to the new circuitry. Using the shadow cluster approach from [23], when the new circuits are not required, they can be disabled and the CLB can be used normally. The detailed implementations of these new circuits will be provided in Sections 5.6 and 5.7. Our proposal can be applied to any island-style, LUT-based PLC architecture. To maintain the regular tile structure of the PLC, we assume all modified CLBs are aligned in continuous columns in the fabric, as shown in Figure 5.3. This allows the columns with modified CLBs to increase in width to accommodate new logic, while ensuring that the standard CLB tiles maintain the normal size and dimensions. Because of the column-based configuration of our proposal, the amount of interface logic added to the PLC depends on the number of rows in the PLC. However, if a single column does not provide sufficient resources another column of modified CLBs can be added.  85  Fig. 5.3: Modified PLC Architecture The inputs and outputs of the new logic in the modified CLB are connected directly to the regular programmable routing fabric of the PLC. This adds important flexibility to our proposal since the SoC interfaces can be connected directly to the regular I/O of the PLC and then routed to the inputs and outputs of the modified CLBs as needed. This not only allows the flexibility of mapping a different set of signals to specific modified CLBs in accord with the intended application of the PLC, but it also provides another degree of flexibility by allowing signals to be manipulated with the regular programmable logic, before connecting to our new CLBs. 5.6 System Bus Interfaces In this section we present modifications to the PLC architecture that address system bus interfaces. These changes improve the interface timing while remaining flexible enough to support a variety of system bus protocols. We begin with an overview of system busses. We then highlight the timing challenges when implementing these interfaces using PLCs. Following this, we describe our implementation in detail. Finally, we present an experimental evaluation showing the timing improvement achieved with this new architecture as well as the effects on area overhead, routing congestion and CLB usage. 86  5.6.1 System Bus Overview SoC system busses integrate one or more embedded processors, along with a number of fixed-function IP blocks on a single die, as shown in Figure 5.1. The communication between the software (or firmware) running on the embedded processor and the fixed-function IP is performed using the system bus. Each participant on the system bus is designated as a master or slave; masters, such as the embedded processor initiate transactions on the bus, while slaves respond to requests initiated by the masters. Slaves typically provide a memory-mapped interface through a set of control registers, data registers, and status registers. In this chapter, we will refer to these registers as interface registers. These interface registers are mapped to a region of the software memory space, and masters can access the registers through 'load' and 'store' instructions. The slave bus interface works by translating the master request, which appear on the bus, into interface register accesses. In our target SoC integration framework, the debug PLC participates on the system bus as a slave in order to pass debug information to the processor core, monitor bus transactions, or perform other debug tasks. To accomplish this, a portion of the system bus address space is reserved for potential use by debug circuits implemented in the PLC logic. Each different debug circuit implemented in the PLC can provide a different set of interface registers to complement its specific functionality. 5.6.2 System Bus Interface Timing Requirements To ensure that the performance of the overall system will not deteriorate when the debug PLC is used, it is important to enable high-speed bus implementations. The paths through the PLC will likely be the critical paths in a system bus implemented with both fixed and programmable logic. Since most SoC system busses are synchronous, this critical path will dictate the maximum operating frequency. Slave interfaces must perform address and byteenable decoding, as well as output multiplexing of the interface registers. These operations are much slower in programmable logic. For example, in our experiments using 0.18µm technology, an AMBA APB slave interface implemented in generic programmable logic had critical paths as high as 12ns, whereas the same interface implemented with fixed-function standard cells had critical paths of approximately 1ns. This can be a major limitation to the usefulness of the PLC,  87  since, in most cases, the maximum operating frequency will determine the maximum throughput of the bus. 5.6.3 Bus Interface Functional Requirements Since there are many common system bus protocols, our interface circuitry must be flexible enough to improve the timing and implementation efficiency of all of them. We use the following set of bus protocols for our analysis of functional requirements: AMBA APB, AMBA AHB, CoreConnect OPB, CoreConnect DRC, Wishbone Classic, and Wishbone Registered Feedback [17,18,19,20]. We believe that ensuring compatibility with this large set of protocols will in turn ensure that our PLC modifications are useful for other similar bus protocols. All of the system bus protocols that we have analyzed have similarities, but they differ in terms of timing and data transfer policies. We take advantage of the similarities to create new circuits, integrated into the PLC fabric, that improve interface timing while retaining enough programmability to implement the differences in the protocols. In the remainder of this section we highlight some of the specific requirements for this programmability. Different bus interfaces have different protocol timing requirements. For instance, AMBA APB requires control information and data to be valid on the same clock cycle, whereas AMBA AHB requires the control information to be valid one clock cycle before the data. In addition, some protocols provide means to allow for variable read and write delays using an acknowledgment signal. To support this, our circuit must accommodate configurable clock cycle delays for all control, data and acknowledgment signals. Many system busses support transfers at a data granularity that is smaller than the maximum single-transfer data size. For instance, a system bus may support dword (32-bit) transfers with byte (8-bit) granularity. There are a number of ways to indicate which of the data bytes should be involved in a data transfer. For instance, CoreConnect OPB provides dword-aligned addresses with a corresponding set of byte enables, whereas AMBA AHB provides byte-aligned addressing with byte, halfword, and dword size indications. To support this, our circuit must support both byte enables and size indications, and also support a configurable address interpretation.  88  Table 5.1: Interface Register Bit Types Type  Description  Switch Config.:abc  RW RO W1C RWS IND  Read/Write by Master Read Only by Master Write 1 to Clear Read/Write Sticky Indirect  001 110 101 011 010  The behaviour of individual interface register bits is usually not defined in the system bus specification. Instead they are usually design-specific. However, a survey of design standards and IP blocks demonstrates that there is a common set of behaviours that will allow for the implementation of most required designs. These basic bit behaviours, all of which our design must support, are summarized in Table 5.1. The RW bit type is a register bit that can be written and read by a bus master. The value written to the register is always reflected when it is read. These bit types are normally used to configure the design or provide data to a design. The RO bit type is a register that only supports read operations by the bus master. Usually, this type is used to reflect the status of a design or an output data value. The W1C bit type is a register that is used to implement interrupt status registers. It supports normal master read operations; however, it only supports master writes of value ‘1’. These master writes clear the register to ‘0’. The register value is set to ‘1’ based on an event internal to the design. The RWS bit type supports a handshaking operation between the master and slave. The register bit supports master writes; however, the read value of the register is controlled by the internals of the design. Therefore, a master can write a value and then poll the read value to ensure that it has transitioned before it proceeds with other configuration. This behaviour allows the slave design to control the timing of events. The IND bit type support the triggering of immediate actions in the design. Only master writes of ‘1’ are supported for this type. Each master write operation triggers a new event in the design. This bit type does not support read operations. Some system busses, such as AMBA AHB, and CoreConnect OPB support data bursting, which is used to increase the bus throughput for data transfers involving contiguous address locations in a single slave. Our interface must also support these types of transactions.  89  Fig. 5.4: Slave Bus Interface 5.6.4 Modified CLB Implementation To create our modified CLB structures, we target the interface control and interface register portions of the system bus interface, as shown in Figure 5.4. By ‘hardening’ these aspects of the interface we can move the critical address decode and output multiplexing logic to fast, designspecific logic. These new circuits are partitioned such that they match the basic CLB structure in terms of the numbers of I/O and the area. To do this, we create two types of modified CLBs, as shown in Figure 5.5.  Fig. 5.5: Modified CLB Implementation In Figure 5.5, the lower five CLBs implement the interface control portion of the programmable bus interface. These CLBs contain the control and data interfaces as well as the 90  address decode and control logic. The I/O pins of these CLBs are connected to the regular programmable routing fabric of the PLC; connections to the system bus can be made through this fabric. The remaining CLB locations in Figure 5.5 implement the interface register portion of the bus interface. Each CLB contains eight configurable register bits that represent a single byte in the address map of the bus slave. The system bus access to these bits is controlled by signals generated in the interface control circuits. These control signals (write_en, read_en, data_in, and data_out) are hardwired from the interface control circuitry and each interface register bit has an input and output that connects to the programmable routing fabric to allow connections to the rest of the design implemented in the PLC. As described earlier, the inputs and outputs of the circuitry embedded into each CLB are multiplexed with the inputs and outputs of the 4x4 LUT structure (to create a shadow cluster) and then connected to the regular routing resources within the PLC. For both types of modified CLBs, the number of inputs to the embedded circuitry is the same as the number of inputs to the original 4x4-LUT structure. However, the number of outputs from the modified CLBs is eight, whereas a regular CLB has only four outputs. To accommodate this, we make minor changes to the routing fabric. This is discussed further in Section 5.6.5.  Fig. 5.6: Register-Type CLB  91  Fig. 5.7: Configurable Interface Register Bit Each of the CLBs that contain interface registers has eight configurable bits, as shown in Figure 5.6. In order to ensure that the circuit can implement any required slave interface register, each register bit can be configured to any of the types in Table 5.1. The details of this implementation are shown in Figure 5.7. The required configuration of each of the switches (labelled a, b, and c in the diagram) is shown in column 3 of Table 5.1. The correct operation of the circuit configured in each target bit type was verified by simulating a Verilog RTL description of the circuit. 5.6.5 Experimental Evaluation In this section, we measure the area overhead and the performance of our proposed design, and also determine how efficiently our new architecture can implement circuits containing bus interfaces. In the following results, the timing performance of the individual modified CLBs was extracted using static timing based on a 0.18µm standard cell library. All other timing and area results were reported by VPR using 0.18µm technology parameters [24]. 1) Base PLC architecture We assume an architecture from [24]. In this architecture each configurable logic block (CLB) contains four 4-input LUTs, and has ten inputs and four outputs. The configurable logic blocks are arranged in a grid and surrounded by routing channels. The routing channels are composed of length-1 segments. The CLBs inputs are connected to the routing channel with a connection factor, Fc, of 0.5 (meaning each CLB input pin can be selected from half the tracks in the neighbouring channel) and the outputs are connected to the routing channel with a connection factor, F c, of 0.25 (meaning each CLB output pin can drive up to 25% of the tracks in the  92  neighbouring channel). The switch blocks use the ‘subset’ pattern described in [24]. All switches are buffered.  Fig. 5.8: Modified Connection Blocks As described in Section 5.6.4, each modified CLB has 8 outputs, whereas the original CLB has only 4 outputs. To accommodate this, additional switches have been added to the connection blocks in the channels on all four sides of the modified CLBs, as shown in Figure 5.8. In the following sections we will show that this increase in the size of the connection block has only a small impact on the overall area of the PLC. 2) Benchmark Circuits Because there was no large set of existing benchmarks that use system bus interfaces, we created a new set of benchmarks by combining the 20 largest MCNC benchmarks [24] with bus interface logic. This was done for each of the 20 MCNC benchmarks and for each of six bus interface standards described earlier. In the new benchmarks, a register bit in the bus interface drives each of the inputs of the MCNC designs; the value of each output would be determined by a read operation on the bus interface. Some bus interface register bits were configured to be interrupt-style bits. Each new benchmark circuit was named to reflect its base MCNC circuit and its bus interface standard (i.e., ‘alu4_apb’ for the ‘alu4’ circuit with an AMBA APB bus interface). Due to runtime constraints, we only present the results for the 20 benchmarks that use the AMBA APB bus interface. We believe that this is sufficient because the proposed architecture will function correctly for the other interfaces and we expect the place and route results to be similar for other bus interface standards.  93  3) Area Overhead The proposed CLB changes increase the number of transistors required to implement the PLC and therefore increase the physical area. Tiles containing the modified CLBs required more area than a regular tile because of the extra transistors required to implement the additional interface circuitry as well as the extra switches required in the adjacent connection blocks to connect to the extra CLB outputs. We used the technique of transistor enumeration described in [24] to estimate the area required by the additional circuitry in each CLB. In [24], it was determined that 1678 minimumwidth transistor equivalents were required for the baseline 4x4-LUT CLB architecture. Using the same technique, we created transistor count estimates and found that the modified CLB will require between 504 and 570 more minimum-width transistors area than a basic CLB. This overhead, about 33%, is quite modest since the tile area is dominated by interconnect-related transistors and the CLB logic occupies only a minor portion of the overall tile area. The other aspect of area overhead is the transistor increase for the routing resources caused by the increase in the number of CLB outputs. This value depends on the number of normal and modified CLBs in the PLC, and on the channel width, W. Since each benchmark requires a different number of CLBs and a different channel width, we extracted the size and channel width required for each benchmark using HHVPR, a modified version of VPR that handles heterogeneous architectures [24]. We then determined the number of transistors required for the routing resources of each PLC configuration using the count provided by HHVPR. We combined these results with the CLB transistor count values from above to produce the total transistor requirements shown in Table 5.2. As demonstrated, the area overhead for the proposed architecture is quite small, with an average transistor count increase of less than 0.5%. Since each modified CLB can also implement a standard 4x4-LUT, this overhead number represents the area overhead for designs that do not make use of the new circuits. For designs that make use of the new circuits the number of required CLBs will be reduced (see Subsection 4) and there will be an overall area savings.  94  Table 5.2 System Bus Area Overhead Bench-mark  Size  W  Base Arch. (106 trans)  New Arch. (106 trans)  Incr. (%)  alu4_apb apex2_apb apex4_apb bigkey_apb clma_apb des_apb diffeq_apb dsip_apb elliptic_apb ex1010_apb ex5p_apb frisc_apb misex3_apb pdc_apb s298_apb s38417_apb s38584.1_apb seq_apb spla_apb tseng_apb  21x21 23x23 19x19 25x25 48x48 25x25 20x20 23x23 31x31 36x36 18x18 31x31 20x20 35x35 23x23 41x41 42x42 22x22 32x32 18x18  28 34 39 26 43 30 29 28 42 35 34 41 31 53 24 29 34 37 47 25  4.34 6.14 4.72 5.81 32.2 6.51 4.09 5.20 13.3 15.3 3.78 13.0 4.29 20.6 4.61 17.0 20.3 6.04 15.5 2.94  4.36 6.16 4.74 5.85 32.4 6.55 4.10 5.25 13.3 15.3 3.80 13.0 4.31 2.07 4.63 17.0 20.4 6.06 15.6 2.97  0.43 0.36 0.42 0.78 0.33 0.73 0.24 0.79 0.25 0.24 0.47 0.26 0.44 0.21 0.43 0.23 0.41 0.38 0.23 1.09  avg:  0.44  4) CLB Usage and Congestion To determine the effect on logic density of the proposed changes, we measured the number of CLBs (regular and modified) used to implement the benchmarks for the base architecture and our new architecture. The results were generated by performing a detailed place and route with HHPVR. The results are shown in columns 3, 4 and 5 of Table 5.3. The results show a CLB decrease of 7.9% on average. This area savings reflects the fact that a single modified CLB implements more logic for bus interfaces than is possible using a standard 4x4-LUT. To determine the effect of the proposed changes on routing congestion, we measured the minimum required channel width to route each of the new benchmark circuits. The results demonstrate significant improvements in congestion, as shown in columns 6, 7 and 8 of Table 5.3. The average reduction in the channel width is 28.8%. We believe this decrease in routing width is primarily due to the fact that the new architecture ‘hides’ much of the data bus routing and multiplexing. It is salient to note that this decrease in channel width occurs even though the CLBs and CLB pins of the new interface circuits are considered unique and fixed by the placer and router, which removed some of the routing flexibility. 95  Table 5.3 System Bus Interface Place and Route Results Benchmark  Reg Bits  Base Arch (CLBs)  New Arch (CLBs)  Decr. (%)  Base Arch. (W)  New Arch. (W)  Decr. (%)  Base Arch. (ns)  New Arch. (ns)  Decr. (%)  alu4_apb apex2_apb apex4_apb bigkey_apb clma_apb des_apb diffeq_apb dsip_apb elliptic_apb ex1010_apb ex5p_apb frisc_apb misex3_apb pdc_apb s298_apb s38417_apb s38584.1_apb seq_apb spla_apb tseng_apb  20 41 26 398 130 459 96 398 227 18 69 144 26 54 10 131 334 71 58 165  427 536 336 648 2445 627 452 507 992 1210 316 1011 389 1307 508 1781 1765 473 1049 340  398 497 337 598 2131 573 393 388 933 1210 295 914 369 1198 498 1627 1659 458 967 289  6.8 7.3 -0.3 7.7 12.8 8.6 13.1 23.5 5.9 0.0 6.6 9.6 5.1 8.3 2.0 8.6 6.0 3.2 7.8 15.0  43 55 33 51 94 50 49 45 40 35 44 73 46 80 39 88 33 36 69 41  28 34 39 26 43 30 29 28 42 35 34 41 31 53 24 29 34 37 47 25  34.9 38.2 -18.2 49.0 54.3 40.0 40.8 37.8 -5.0 0.0 22.7 43.8 32.6 33.8 38.5 67.0 -3.0 -2.8 31.9 39.0  5.30 3.98 3.98 9.75 9.82 9.43 6.47 9.45 8.29 3.98 6.68 7.18 4.09 6.70 4.09 9.14 12.47 6.65 4.08 7.22  3.65 3.87 4.05 4.18 3.83 4.14 4.31 4.19 4.17 3.66 4.09 4.05 3.85 3.85 3.87 4.16 4.17 3.86 2.98 3.67  31.1 2.8 -1.8 57.1 61.0 56.1 33.4 55.7 49.7 8.0 38.8 43.6 5.9 42.5 5.4 54.5 66.6 42.0 27.0 49.2  avg:  144  856  787  7.9  52  34  28.8  6.94  3.93  36.4  5) Interface Timing To determine the timing improvement of the proposed PLC changes, we measured the system bus critical path for each of the benchmarks. To ensure that we were focused on the system bus portion of the benchmark, we reported the critical path of the clock domain containing only the system bus interface. The results show a very significant improvement for the benchmarks with a large number of register bits and an average critical path improvement of 36.4%, as shown in columns 9, 10 and 11 of Table 5.3. These critical paths delays would result in a system bus running at, on average, 254 MHz in the modified PLC, whereas the regular PLC would only support a system bus interface at, on average, 144 MHz. These results also indicate that the standard deviation of the critical paths is much smaller in the proposed architecture (0.30) than in the base architecture (2.51). This is important in practical terms because, with the modifications, the PLC the system bus timing will be predictable regardless of the circuit implemented.  96  5.7 Direct Synchronous Interfaces In this section we present modifications to the PLC architecture that support direct synchronous interfaces. These changes improve the interface timing while remaining flexible enough to support a variety of protocols, data widths, and clock rates. We begin with an overview of the target interfaces and then highlight the timing challenges when implementing these interfaces. Next, we describe our implementation in detail. Finally, we will present experimental results showing the timing improvements achieved with this new architecture, as well as the effects on area overhead, routing congestion and CLB usage. 5.7.1 Overview Tightly coupled design blocks are often connected using design-specific interfaces, rather than standard system busses. We refer to these interfaces as direct synchronous interfaces. This type of interface avoids the overhead associated with implementing a full system bus and often enables higher throughput by allowing higher interface clock rates. In many SoCs the design blocks on the ‘datapath’ will communicate this way, while control and configuration operations will occur over a system bus. Our debug infrastructure must be able to observe and control these types of interfaces. Direct interfaces often use simpler protocols and operate at higher speeds than system bus interfaces. Rather than using explicit addresses for each data transaction, the identification of the transfer data is often done implicitly, based on the order that the data is received, or is done using in-band signalling, such as packet headers. In addition, these design-specific protocols often eliminate the handshaking that characterizes system bus protocols, and use simple ‘data valid’ indications with no mechanism for back-pressuring incoming data. Because of this, the main challenges in the design of these interfaces are to ensure that the ordering of the data is maintained and that the interface is able to accept new data on each clock cycle. 5.7.2 Timing Requirements The timing requirements for direct synchronous interfaces in PLCs are that they are able to sample or generate data at the full clock rate of the fixed function logic while managing the rate adaptation to the lower clock rate of the programmable logic core. It is possible to design these types of rate adaptive interface circuits with general purpose programmable logic, but, as we will 97  show in Subsection 5.7.5, the performance of the logic often limits the maximum clock rate of the interface. The challenge when connecting an interface of this type from fixed function logic to a circuit implemented in programmable logic is that the maximum clock rate of the programmable logic may not be sufficient. Since many of these interfaces do not allow for back-pressure, the data generated by the fixed function block will be lost if the receiving block does not sample it at the same clock rate at which it was generated. In many cases the circuits implemented in programmable logic could be designed to maintain the required data throughput by parallelizing the data processing and using a lower clock rate. However, the incoming data must still be sampled using the high-speed clock and a rate adaptation performed to move the data to the lower frequency clock domain. A similar problem exists in the opposite direction when the programmable logic drives a high-speed fixed function interface. The specification of the fixed function interface may not be met if the PLC cannot generate data at the expected clock rate. 5.7.3 Functional Requirements The primary functional requirements of these direct, rate adaptive, synchronous interfaces are that they are able to provide the flexibility to enable many different data formats and clock ratios. For instance, the interface may use 32-bit wide data, or eight-bit wide data and the proposed modifications must handle such variable data widths efficiently. The ratios of the clock rates of the fixed function logic to the programmable logic must also be flexible. For instance, the fixed function logic may run with a clock rate of 600 MHz, but the programmable logic may only be capable of a maximum block rate of 200 MHz, or 100 MHz, so the new interface logic must support different ratios. Based on studies comparing the achievable clock frequencies of ASICs and FPGAs [1], we chose the following target clock ratios for this work: 2:1, 3:1, 4:1, 6:1, 8:1. 5.7.4 Modified CLB Implementation To create our modified CLB structures, we target the data sampling, data generation and rate adaptation portions of the interface logic. By ‘hardening’ these aspects of the interface we can move all the logic that must operate on the fast clock to application-specific, configurable logic inside the CLB. These new circuits are partitioned such that they match the basic CLB structure  98  in terms of the number of I/O and area. We create two types of modified CLBs, as shown in Figure 5.9.  Fig. 5.9: Direct Synchronous Modified CLBs One set of modified CLBs implements the interface and rate adaptation logic for incoming data and the other set targets outgoing data. In our implementation, these two CLB types are interleaved on alternating rows of a given column and their I/O are connected directly to the regular programmable routing fabric of the PLC. Then, at the chip level, the arbitrary synchronous interfaces can be connected to the regular I/O of the PLC. These signals are then routed to the inputs and outputs of the modified CLBs as needed. On the other side of the modified CLBs, the signals can be routed to the regular CLBs in order to implement the normal programmable logic circuits.  99  Fig. 5.10: Incoming-Type CLB FIFO As described earlier, the inputs and outputs of the circuitry embedded into each modified CLB are multiplexed with the inputs and outputs of the 4x4 LUT structure (to create a shadow cluster) and then connected to the regular routing resources within the PLC. However, for the incoming-type modified CLBs, the number of outputs from the modified CLBs is eight, whereas a regular CLB has only four outputs. The details of this change were discussed in Section 5.6. Each of the incoming-type CLBs contains a configurable FIFO, as shown in Figure 5.10. The FIFO structure requires two clocks: a fast clock and a slow clock. Each of these clocks is routed to the modified CLBs using standard FPGA clock routing. The fast clock is sourced from the same clock that is generating the incoming data. This fast clock is usually generated by a phase locked loop (PLL) on the SoC. The slow clock is a divided version of the fast clock; a common 100  clock divider is used for the entire PLC. The slow clock is also used as the main clock for the programmable logic in the PLC. The slow clock must be frequency-locked to the fast clock, however, this is will be guaranteed if the fast clock is used as the reference in the clock divider circuit. There are a number of well-known clock divider circuits that are well suited to this task [24]. To avoid requiring difficult clock routing constraints, the FIFO was designed to avoid the need for specific phase relationships between the fast clock and slow clock. The FIFO supports rate adaptation ratios of 2:1, 3:1, 4:1, 6:1, and 8:1. As designed, the FIFO is able to use its inputs and outputs efficiently, such that for a given clock ratio the maximum possible inputs and outputs can be used simultaneously. This rate adaptation allows for highspeed rate adaptive interfaces. For example, assume that the incoming data was generated by a fixed-function block operating at 600 MHz. If the FIFO was configured with a 4:1 clock ratio, then on each 600 MHz clock edge, the FIFO would sample 2 bits of incoming data. On the other side of the FIFO, the slow clock, operating at 150 MHz, would generate 8 bits of data on each clock edge. This data would be synchronous to the rest of the logic in the PLC. By using multiple modified CLBs, any interface data width can be supported. The FIFO manages the transfer of data from the fast clock domain to the slow clock domain. An initial stage of programmable 2:1 multiplexers controls the routing of the inputs based on the target clock ratio and a programmable roll-over value for the 3-bit sample counter is also configured based on the desired clock ratio. Based on these settings, the initial stage of flipflops, which reside in the fast clock domain, selectively sample the incoming data. A second stage of flip-flops samples and holds the first stage data on the falling edge of the fast clock, whenever the counter value is ‘000’.  The final stage of flip-flops resides in the slow clock  domain. These flip-flops transfer the data into the slow clock domain where it can be processed by the regular logic in the PLC. The design of the programmable counter block, which will be discussed in detail later in this section, ensures that there is no chance of metastability when the slow clock transfer occurs. Each of the outgoing-type CLBs contains a configurable FIFO, as shown in Figure 5.11. The outgoing FIFO requires the same clock structure as the incoming FIFO and supports the equivalent rate adaptation ratios of 1:2, 1:3, 1:4, 1:6, and 1:8. Like the incoming FIFO, the outgoing FIFO uses its inputs and outputs efficiently, such that for a given clock ratio the maximum possible inputs and outputs can be used simultaneously. 101  Fig. 5.11: Outgoing-Type CLB FIFO The outgoing FIFO manages data transfers from the slow clock domain to the fast clock domain. The initial stage of the FIFO is a stage of flip-flops operating in the slow clock domain. These flip-flops sample data synchronously to the regular programmable logic. The second stage of the FIFO is a set of flip-flops that sample and hold the data, on the falling edge of fast clock, each time the count value equals ‘000’. The third stage is a hierarchical set of multiplexers. Each of these multiplexers is controlled by a bit of the programmable counter. The counter controls which of the eight sampled values will be captured by the final flip-flop stage. The programmable roll-over value of the counter controls the rate adaptation ratio and dictates which of the output values will be valid. The design of the counter block, which will be discussed in detail later in this section, ensures that there is no chance of a metastablity event when the fast clock transfer occurs.  102  A 3-bit counter with a programmable roll-over value is required to manage data transfers for both the incoming FIFO and the outgoing FIFO. A schematic depiction of this circuit is shown in Figure 5.12. The value of the counter co-ordinates the data transfers to ensure that the data ends up in the correct flip-flops depending on the clock ratio selected. In addition to coordinating the data transfers, the counter also controls the start-up condition of the FIFO to avoid metastiblity when data is transferred between clock domains. The counter circuitry detects the rising edge of the slow clock and uses it to determine when it is safe to transfer the data between clock domains. Initially, when the circuit recovers from the reset, phase lock has not been achieved. Once the rising edge of the slow clock has been detected, then phase lock is declared, and the count value is cleared. When the count value is ‘000’ the cross clock domain data transfer occurs.  Fig. 5.12: Sample Count / Phase Lock Circuit The design of this counter and the FIFOs ensures that there is a worst-case margin of 50% of the period of the fast clock during which the cross-domain sampling occurs. An example of the counter operation is shown in Figure 5.13. The worst-case occurs when the clock ratio is 2:1. The margin is dictated by the accuracy of the rising edge detect. Assuming worst-case alignment of the slow and fast clocks, the edge detect will have an error of ±1 fast clock period (this occurs when the fast clock edge ‘just misses’ the slow clock’s rising edge). In a 2:1 case, the period of the slow clock is twice that of the fast clock, and therefore this amount of phase accuracy is acceptable. By moving the fast clock domain sampling to the falling edge there is always a  103  separation of 50% of the fast clock period. This margin will dictate the relative jitter tolerance between the two clocks.  Fig. 5.13: Sample Count Generation Example Timing 5.7.5 Experimental Results In this section, we measure the area overhead and performance of our proposal, and also determine how efficiently our new architecture can implement circuits containing high-speed synchronous interfaces. We use the same base PLC architecture and experimental framework described in Section 5.5. First, we describe the set of benchmark circuits used to evaluate our architecture. 1) Benchmarks Because there was no large set of existing benchmarks that were designed to interface to high-speed fixed function logic, we created six new sets of benchmarks based on the 20 largest MCNC benchmarks [24]. Each set was created to represent design scenarios requiring different interface rate adaptations. For each benchmark we add a set of inputs and outputs that require a high-speed clock. Then we then add rate adaptation circuitry to connect to the existing I/O of the benchmark circuit. Each new benchmark circuit has been named to reflect its base MCNC circuit and its target clock ratio, for example ‘alu4_4:1’ for the ‘alu4’ circuit requiring a 4:1 clock rate ratio. Due to space constraints, we only present the results for the 20 benchmarks that target a 4:1 clock ratio. We believe that this is sufficient since the proposed architecture will function correctly for the other ratios and we expect the place and route results to be similar for those ratios.  104  Table 5.4 Direct Synchronous Area Overhead Bench-mark  Size  W  Base Arch. (106 trans)  New Arch. (106 trans)  Incr. (%)  alu4_4:1 apex2_4:1 apex4_4:1 bigkey_4:1 clma_4:1 des_4:1 diffeq_4:1 dsip_4:1 elliptic_4:1 ex1010_4:1 ex5p_4:1 frisc_4:1 misex3_4:1 pdc_4:1 s298_4:1 s38417_4:1 s38584.1_4:1 seq_4:1 spla_4:1 tseng_4:1  22x22 24x24 20x20 27x27 49x49 23x23 22x22 21x21 33x33 37x37 19x19 33x33 21x21 37x37 24x24 42x42 42x42 23x23 34x34 18x18  42 52 44 43 87 47 47 39 67 76 43 71 45 75 39 89 99 53 68 38  6.71 9.58 5.78 10.27 62.93 8.07 7.39 5.75 22.62 31.77 5.12 23.85 6.50 31.45 7.49 47.34 52.18 8.97 24.32 4.14  6.76 9.63 5.82 10.33 6.31 8.15 7.44 5.80 22.75 31.91 5.16 23.95 6.55 31.51 7.54 47.47 52.37 9.04 24.36 4.18  0.80 0.56 0.64 0.62 0.30 0.95 0.72 0.93 0.59 0.46 0.85 0.44 0.82 0.21 0.71 0.27 0.36 0.74 0.18 1.06  avg:  0.61  2) Area Overhead Analogous to the increased area overhead for the system bus interface modified CLBs described in Section 5.6, the proposed CLB changes for direct synchronous interfaces also increase the number of transistors required to implement the PLC. Using the same technique of transistor enumeration, we estimated that the modified CLB would require an average of 558 more minimum-width transistors than a basic CLB. Again, this increase is modest since the CLB area is only a small part of the overall PLC area. Using the same procedure described previously in Section 5.6, we produced a total transistor count as shown in Table 5.4. The area overhead for the proposed architecture change is 0.61%, on average. Again, since each modified CLB can also implement a standard 4x4-LUT, this overhead number represents the area overhead for designs that do not make use of the new circuits. For designs that make use of the new circuits, the number of required CLBs will be reduced (see subsection 3) and there will be an overall area savings.  105  3) CLB Usage and Congestion To determine the effect of the proposed changes on logic density, we measured the number of CLBs (regular and modified) used to implement the benchmarks for the base architecture and our new architecture. The results, which are shown in columns 3, 4 and 5 of Table 5.5, were generated by performing a detailed place and route with HHPVR. They show that with the modified PLC architecture, there is a CLB decrease of 13.6% on average. To determine the effect of the proposed changes on routing congestion, we measured the minimum required channel width to route each of the benchmark circuits. The results demonstrate significant improvements in congestion as shown in columns 6, 7 and 8 of Table 5.5. The average reduction in the channel width is 9.4%. Table 5.5 Direct Synchronous Interface Place and Route Results Benchmark  I/O  Base Arch (CLBs)  New Arch (CLBs)  Decr. (%)  Base Arch. (W)  New Arch. (W)  Decr. (%)  Base Arch. (ns)  New Arch. (ns)  Decr. (%)  alu4_4:1 apex2_4:1 apex4_4:1 bigkey_4:1 clma_4:1 des_4:1 diffeq_4:1 dsip_4:1 elliptic_4:1 ex1010_4:1 ex5p_4:1 frisc_4:1 misex3_4:1 pdc_4:1 s298_4:1 s38417_4:1 s38584.1_4:1 seq_4:1 spla_4:1 tseng_4:1  7 12 9 108 38 127 27 108 63 7 19 35 22 37 4 35 87 53 43 45  434 550 368 823 2675 863 505 724 1181 1282 338 1065 403 1339 512 1769 1963 536 1073 409  417 517 352 506 2254 505 424 404 1022 1268 299 988 383 1299 504 1684 1719 482 1035 298  3.9 6.0 4.3 38.5 15.7 41.5 16.0 44.2 13.5 1.1 11.5 7.2 5.0 3.0 1.6 4.8 12.4 10.1 3.5 27.1  45 55 43 60 98 67 53 58 77 78 45 75 45 79 41 89 101 56 71 47  42 53 44 43 87 47 47 39 67 76 43 71 45 75 39 89 99 53 68 38  6.7 3.6 -2.3 28.3 11.2 30.0 11.3 32.8 13.0 2.6 4.4 5.3 0.0 5.1 4.9 0.0 2.0 5.4 4.2 19.1  3.25 3.78 3.34 7.07 4.83 7.33 4.40 7.31 5.90 3.24 3.77 4.30 3.24 3.67 3.02 4.50 6.18 3.99 3.98 5.04  1.25 1.25 1.25 1.25 1.25 3.72 1.45 2.33 1.25 1.25 1.24 1.25 1.25 1.25 1.25 1.24 1.25 1.25 1.25 1.25  61.5 66.9 62.6 82.3 74.1 49.2 67.0 68.1 78.8 61.4 67.1 70.9 61.4 65.9 58.6 72.4 79.8 68.7 68.6 75.2  avg:  44  940  818  13.6  64  58  9.4  4.60  1.44  68.0  4) Interface Timing To determine the timing improvement of the proposed PLC changes, we measured the critical path for each of the benchmarks. As before, we measured the critical path of the clock domain containing only the high-speed interface. The average critical path improvement was 106  68% as shown in columns 9, 10 and 11 of Table 5.5. Importantly, the results show an average critical path of 1.44ns, which translates to an operating frequency of 694 MHz. These benchmarks could therefore interface to fixed function logic operating at 694 MHz, while the core PLC logic operated a 174 MHz; without the modifications the interfaces would be limited to, on average, 217 MHz. It is interesting to note that while timing performance of most of the benchmarks is quite consistent for the new PLC architecture (column 10), the des_4:1 and dsip_4:1 benchmarks had significantly longer critical paths. Based on a manual investigation of these two benchmarks, we believe the longer critical paths were a result of the large ratio between the number of I/Os on these benchmarks and the number of CLBs. This ratio forces the I/O ports of the PLC to reside in non-optimal locations, and increases the routing delay. We believe that this issue could be addressed by increasing the number of modified CLBs in the PLC. 5.8 Chapter Summary In this chapter we have demonstrated modifications to programmable logic cores (PLCs) that enable the implementation of circuits with high-speed interfaces to in turn enable post-silicon debug at full SoC clock rates. We have addressed both system bus interfaces and direct synchronous interfaces. The modifications have minimal area overhead in the PLC, and for circuits that make use of them there is a significant decrease in overall CLB usage and required routing resources. These changes integrate directly into the regular programmable fabric, thus enabling the reuse of existing FPGA CAD tools. By enabling high-speed interfaces internal to the PLC, we have simplified the process of integrating a PLC into a fixed-function SoC for post-silicon debug. With the certainty that the PLC will be able to implement the required interface circuits, the SoC integrator can speculatively connect target circuits to the debug PLC at design time. This ability is especially important in post-silicon debug and error correcting applications [26,27,28], but is also useful in general, since many of the PLC-related design decisions can then be made post-fabrication. In a more general sense, we have also demonstrated the effectiveness of using applicationspecific configurable circuits in the PLC fabric. Unlike general-purpose FPGAs, the design context of the programmable logic is known in advance for a PLC because the target application of the SoC is usually known. By targeting configurable circuits at the CLB level and using the 107  shadow cluster approach, it is possible to add new features to a PLC with a relatively low area overhead. In the future work we hope to leverage this for applications beyond interface circuits.  108  Chapter 5 References I. Kuon, J. Rose, "Measuring the Gap Between FPGAs and ASICs," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 26, no. 2, pp. 203-215, February 2007. [2] S.J.E. Wilton, et al., “Design Considerations for Soft Embedded Programmable Logic Cores”, IEEE Journal of Solid-State Circuits, vol. 40, no. 2, February 2005. [3] S. Phillips, S. Hauck, “Automatic layout of domain-specific reconfigurable subsystems for systems-on-a-chip,” Proceedings of the ACM International Symposium on FieldProgrammable Gate Arrays, February 2002. [4] V.C. Aken'Ova, G. Lemieux, R. Saleh, "An improved "soft" eFPGA design and implementation strategy," Proceedings of the IEEE Custom Integrated Circuits Conference, pp. 179-182, September 2005. [5] M2000 FLEXEOS Configurable IP Core [Online]. Available: http://www.m2000.fr [6] eASIC 0.13um Core [Online]. Available: http://www.easic.com/products/easicore013.html [7] P. Zuchowski, et al., “A Hybrid ASIC and FPGA Architecture”, Proc. IEEE/ACM Int. Conf. on Computer-Aided Design, pp. 187-194, November 2002. [8] P. Magarshack, P. Paulin, “System-on-Chip Beyond the Nanometer Wall”, Proceedings of the Design Automation Conference, pp. 419-4242, June 2003. [9] S.J.E. Wilton, R. Saleh, “Progammable Logic IP Cores in SoC Design: Opportunities and Challenges”, Proceedings of the IEEE Custom Integrated Circuit Conference, San Diego, CA, pp. 63-66, May 2001. [10] S. Winegarden, “Bus Architecture of a System on a Chip with User-Configurable System Logic”, IEEE Journal of Solid-State Circuits, vol. 35, no. 3, pp. 425-433, March 2000. [11] F. Lertora, M. Borgatti, “Handling Different Computational Granularity by Reconfigurable IC featuring Embedded FPGAs and Network-On-Chip”, Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, pp. 45-54, April 2005. [12] J. Madrenas, “Rapid Prototyping of Electronic Systems Using FIPSOC”, Proceedings of the IEEE International Conference on Emerging Technologies and Factory Automation, pp. 287296, October 1999. [13] T. Chelcea, S.M. Nowick, “Robust Interfaces for Mixed-Timing Systems”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 12, no. 8, pp. 857-873, August 2004. [14] W.A. Graf, “Programmable I/O cell with data conversion capability”, U.S. Patent 5760719, Jun 2, 1998. [15] M.J. Miller, “Programmable FIFO buffer”, U.S. Patent 4750149, Jun 7, 1988. [16] R. Saleh, et al., "System-on-Chip: Reuse and Integration", Proceedings of the IEEE, vol. 94, no. 6, pp. 1050-1069, June 2006. [17] Opencores.org, Wishbone System-on-Chip (SoC) Interconnection Architecture for Portable IP Cores, Revision B.3, September 2002. [18] IBM Corporation, On-Chip Peripheral Bus Architecture Specification, Version 2.1, April 2001. [19] IBM Corporation, Device Control Register Bus Architecture Specification, Version 3.5, January 2006. [20] ARM Ltd., AMBA Specification, Revision 2.0, May 1999. [21] Intel Corp., PHY Interface for PCI Express Architecture, Version 1.87, 2005. [1]  109  IEEE, "Part 3: Carrier sense multiple access with collision detect on (CSMA/CD) access method and physical layer specifications," IEEE Std 802.3, 2000 Edition, 2000. [23] P. Jamieson, J. Rose, “Enhancing the area-efficiency of FPGAs with hard circuits using shadow clusters”, Proceedings of the IEEE International Conference on FieldProgrammable Technology, pp. 1-8, December 2006. [24] V. Betz, J. Rose, A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs, Kluwer Academic Publishers, 1999. [25] T.B. Preuber, R.G. Spallek, "Analysis of a Fully-Scalable Digital Fractional Clock Divider," Proceedings of the International Conference on Application-specific Systems, Architectures and Processors, pp.173-177, September 2006. [26] B.R. Quinton, S.J.E. Wilton, “Post-Silicon Debug Using Programmable Logic Cores”, Proceedings of the IEEE International Conference on Field-Programmable Technology, pp. 241-247, December 2005. [27] M. Abramovici, “A Reconfigurable Design-for-Debug Infrastructure for SoCs”, Proceedings of the Design Automation Conference, pp. 7-12, July 2006. [28] S. Sarangi, et al., “Patching Processor Design Errors With Programmable Hardware”, IEEE Micro, vol.27, no.1, pp.12-25, January/February 2007. [22]  110  Chapter 6 Conclusion 6.1 Summary In this thesis we have demonstrated a new reconfigurable post-silicon debug infrastructure that was designed to enhance the post-silicon validation of complex integrated circuits (ICs) and systems-on-a-chip (SoCs). This infrastructure addresses an important and growing problem in the IC development process [1]: the cost and time required to verify the correct functional behaviour of new devices continues to grow dramatically [2], but at the same time, the cost of retooling the manufacturing process to “re-spin” a device and correct a design error has surpassed the $1 million mark [3]. If these trends continue, the economic risks associated with IC development will be so great that only the largest corporations will be able to compete effectively. Developing a comprehensive process for quickly identifying the root-cause of design errors (bugs) in a manufactured device is therefore a critical component of an overall strategy to address the economics of the IC development process [4]. This capability is key not only because it speeds-up the validation process and reduces the time-to-market for new devices, but also because accurately identifying the source of bugs is critical to avoiding extremely costly device re-spins and recalls. Our new infrastructure advances this capability, and as such, constitutes an important step towards enabling this comprehensive debug process. Identifying the root-cause of design errors in a manufactured (or post-silicon) device is difficult because of the lack of visibility into the internal operation of the device [1]. Initial attempts have been made to increase the internal visibility of complex ICs by adding extra debug logic [5, 6]. However, design errors are by their very nature unexpected and therefore determining the correct design of debug logic is difficult. To overcome this issue we have proposed a reconfigurable infrastructure that makes use of programmable networks and embedded programmable logic. By using programmable logic, the problem of determining the correct debug structure can be deferred until after the device has been manufactured. When an unexpected behaviour is encountered during the validation process it is possible to create debug circuits in the embedded programmable logic and connect these circuits to the circuits under debug using our programmable access network. As a result, the same physical debug 111  infrastructure can be reused repeatedly for each new bug discovered throughout the validation process. We have also shown that in addition to assisting with the debug process, our new post-silicon debug infrastructure can be used to enable the detection and correction of design errors during the normal device operation. This capability is valuable in two ways. First, it may help to avoid an expensive device re-spin for a minor design error. Second, the ability to implement ‘workarounds’ to known problems can be important to enable the effective validation of the device. It is extremely important to discover as many bugs as possible before committing to the re-spin because the cost of re-tooling the IC manufacturing process is constant no matter how many bugs are fixed. By working around these initial bugs it is often possible to discover new ‘hidden’ bugs. We show that error detection circuits can be designed using the programmable logic in our infrastructure and used to notify local firmware or other higher-layer supervisory software about the occurrence and nature of the design error. Furthermore, since the debug circuits can be designed to override certain signals internal to the device, they can potentially take autonomous action to correct a design error once it is detected. Although the flexibility of our new infrastructure is valuable, it comes with a significant cost because programmable logic is significantly less area-efficient than fixed function logic and it normally operates a lower frequency. As we discussed in Chapter 2, for the infrastructure to be as effective as possible, it must have a low area overhead and be able to operate at the normal operating frequency of the target device. This thesis has addressed those requirements. We identified three key implementation challenges that are required to enable our infrastructure: the topology of the access network; the implementation of the high-speed access network; and, the high-speed interface to the programmable logic. We first addressed the issue of determining an efficient network topology for the access network in our debug infrastructure. We demonstrated that we could use the flexibility of input and output assignments on the programmable logic core to reduce the cost and depth of the access network. We showed that a class of network called a concentrator would provide the required connectivity for our debug access network, while making use of the output ordering flexibility. This result is important to the success of our infrastructure because it demonstrated that it is possible to create a low-cost network, with full debug connectivity. The new network is much more area-efficient than the more obvious implementations using direct signal 112  multiplexing. We created two new concentrator constructions. Our results demonstrated that a concentrator network could be built with 1/2 the network depth of an equivalently-sized permutation network and significantly less area overhead. We were also able to demonstrate that a hierarchical concentrator network can be integrated into a SoC as part of a post-silicon debug infrastructure. In applying this result to our debug infrastructure we showed that this hierarchical network could provide visibility into many thousands of potential debug signals with an area overhead of less than 5% for most of our target SoC implementations. Second, we examined the issue of implementing our hierarchical concentrator network at the high-speeds required to enable the debug of circuits in normal, full-speed operation. We focused on the second stage of our hierarchical network that spans the entire device. We showed that synchronous pipelining could be used to create high-speed implementations of this stage of the network. However, synchronous pipelining would require that a significant design effort be made to implement a low-skew global clock tree. In order to avoid this extra design effort, and increase the acceptance of our proposal, we investigated asynchronous interconnect implementations. Our survey of existing asynchronous interconnects revealed that they were illsuited for implementation using standard CAD tools. Since our debug infrastructure is targeted at mainstream ICs and SoCs, we created a new asynchronous interconnect that can be implemented, optimized and verified using standard CAD tools. We also demonstrated a modification to current CAD tools that further enhances our asynchronous interconnect. Using our new asynchronous interconnect, our results showed that in a 90nm process technology, we can consistently achieve throughputs of over 800 MHz in the second and most challenging stage of debug network. These results are important to our debug infrastructure because these throughputs match the operating frequencies of our target SoC, thereby enabling the direct monitoring of full-speed circuits under debug. Third, we focused on the implementation of the interface between the circuits under debug and the debug circuits. We showed that because of limitations on timing performance, the debug circuits implemented in standard programmable logic architectures would be unable to interface directly to the circuits under debug which were implemented in fixed functions logic. We addressed this issue by first classifying SoC interfaces into two types: system bus interfaces and direct synchronous interfaces. Then, for each interface type, we developed enhancements to the architecture of the programmable logic core that significantly improved the timing of rate113  adaptive interface circuits. Using our architectural enhancements we were able to show that in a 0.18um process technology we could improve the timing of system bus interface circuits by 36% to enable operations at, on average, 254 MHz. Likewise, we were able to improve the timing of direct synchronous interfaces by 68% to enable operations at, on average, 694 MHz. These results are significant to our debug infrastructure because they enable interface implementations at frequencies that match our target SoC operating frequencies. Finally, we addressed the total area overhead of our post-silicon debug infrastructure using the implementations and architectures developed throughout the thesis. We analyzed our infrastructure over a range of SoC implementation sizes and with varying numbers of debug nodes in order to understand the trade-offs and costs associated with the infrastructure. Our results were very promising in showing that the area overhead for many debug infrastructure configurations and target SoCs was less than 10%. These results compare well to other debug proposals, such as the debug infrastructure from DAFCA, which cites area overheads of 4.2% and 4.8% for implementations on two commercially-available 90nm SoCs [7]. It also compares well to the debug area overhead targets of between 5% and 10% provided by SoC device architects working in industry [8]. 6.2 Contributions and Publications In this section we summarize the main contributions of this thesis and list the publications associated with each contribution, as shown in Tables 6.1 – 6.4. These contributions can be grouped into in four areas: a reconfigurable post-silicon debug infrastructure; network topology; network implementation; and programmable logic interface.  114  Table 6.1: A Reconfigurable Post-Silicon Debug Infrastructure Key contributions: -  The architecture for a new reconfigurable post-silicon debug A description of the debug process using a post-silicon debug infrastructure The identification of key challenges in the implementation of a debug infrastructure  Associated Publications: -  -  B.R. Quinton, S.J.E. Wilton, “Post-Silicon Debug Using Programmable Logic Cores”, Proceedings of IEEE International Conference on Field-Programmable Technology, Singapore, pp. 241-247, December 2005. B.R. Quinton, S.J.E. Wilton, "Programmable Logic Core Based Post-Silicon Debug For SoCs", Proceedings of the IEEE Silicon Debug and Diagnosis Workshop, Germany, May 2007.  Table 6.2: Network Topology Key contributions: -  The use of concentrator networks for embedded programmable logic applications The construction of two new concentrator networks: single-stage and hierarchical The mapping of the new hierarchical concentrator to the problem of post-silicon debug  Associated Publications: -  B.R. Quinton, S.J.E. Wilton, “Concentrator Access Networks for Programmable Logic Cores on SoCs”, Proceedings of the IEEE International Symposium on Circuits and Systems, Kobe, Japan, pp. 45-48, May 2005.  Table 6.3: Network Implementation Key contributions: -  Identification of the limitations of existing interconnects using standard CAD tools Development of a new asynchronous interconnect The implementation of an enhancement to the cell placement CAD tool for asynchronous interconnects  Associated Publications: -  B.R. Quinton, M.R. Greenstreet, S.J.E. Wilton, “Practical Asynchronous Interconnect Network Design”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, no. 5, pp. 579-588, May 2008. B.R. Quinton, M. Greenstreet, S.J.E. Wilton, “Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow”, Proceedings of the IEEE International Conference on Computer Design, San Jose, California, pp. 267-274, October 2005.  115  Table 6.4: Programmable Logic Interface Key contributions: -  Enhancements to the architecture of programmable logic cores to enable high-speed on-chip interfaces Demonstration that these enhancements can be made with no negative impact of the routeability of the fabric Demonstration that these enhancements can be made with a very low area overhead to the PLC  Associated Publications: -  B.R. Quinton, S.J.E. Wilton, “Programmable Logic Core Enhancements for High Speed OnChip Interfaces” accepted for publication in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2008. B.R. Quinton, S.J.E. Wilton, "Embedded Programmable Logic Core Enhancements for System Bus Interfaces", Proceedings of the International Conference on Field-Programmable Logic and Applications, Amsterdam, pp. 202-209, August 2007.  6.3 Limitations and Future Work In this section we identify a number of limitations with respect to our proposal for a postsilicon debug infrastructure. For each of these limitations we describe the issue and suggest areas for future work that would address that issue. We begin by examining the issue of the methodology for the selection of debug nodes. Second, we address the issue of determining the amount of programmable logic and debug buffering required in our infrastructure. Third , we consider the issue of inferring circuit behaviour from a limited set of debug nodes. Finally, we examine the integration of the post-silicon debug infrastructure with existing design-for-test (DFT) structures. 6.3.1 Methodology for the Selection of Debug Nodes The selection of a useful set of debug nodes is key to the effectiveness of our post-silicon debug infrastructure. This is a difficult problem because the selection must be made at design time before the sources of the post-silicon bugs are known. We have not provided a detailed methodology for this selection process in this thesis. Instead, we have left this issue as future work. There are a number of ways that future work could address this issue. We can begin to address the issue based on the observation that some signals in the design are more valuable than others from a debug perspective. It is easy to imagine that this is the case since some signals 116  provide information about the current state of the design, while others are used only for data transport. An effective methodology for the selection of debug nodes would be to give preference to the selection of these valuable nodes. One possible direction to achieve this goal would be to provide a set of criteria for designers to use in the selection of debug nodes. An example of this type of criteria would be: “The current state of all state-machines should be classified as observable nodes for debug purposes”. Such criteria would help the designers to make intelligent decisions. These criteria could also be augmented with the addition of signals that were discovered to be valuable during debug in the pre-silicon verification stage. Another direction that could be examined to determine the valuable debug nodes would be to develop software to automatically analyze designs to help select nodes. The software could be designed to use information about the topology of the design in order to infer the value of the node. For example, a signal that has a high fan-in would likely be valuable since it would provide a summary of information about many other signals. Likewise, a signal that is used as feedback would often indicate the state of a circuit and therefore also be valuable. 6.3.2 Amount of Programmable Logic and Debug Buffering The debug infrastructure will only be effective if it contains sufficient programmable logic and memory buffers to handle the debug scenarios that may arise during the post-silicon validation. This is a difficult problem because it is not possible to know the nature of the bugs beforehand. In this thesis we assumed that the equivalent of 10,000 ASIC gates would be sufficient to build debug circuits, but we did not provide a detailed analysis of this assumption; this analysis has been left for future work. One of the ways to approach this issue would be to record the size of debug circuits used for post-silicon debug on previous devices and use these values to help determine the correct size for new developments. With this approach the first few devices may not be optimal, but through experience the infrastructure would become more efficient. Another approach would be to build a set of example debug circuits before the device is manufactured and use this information when sizing the programmable logic. To help build the example debug circuits, a number of pre-silicon bugs could be designated as examples of potential post-silicon bugs. Then, the post-silicon debug infrastructure could be used to ‘debug’ 117  these known problems. Pre-silicon simulation and synthesis techniques could be used to measure the size and effectiveness of these trial circuits. 6.3.3 Inference of Circuit Behaviour From a Limited Set of Debug Nodes It is not possible to choose every signal in the device as a debug node because the area overhead of the infrastructure would be expensive.  Even with the development of a  methodology for intelligent debug node selection, as suggested in Section 6.3.2, there will be cases where the value a signal is required for debug is not directly observable. This scenario is not addressed in detail in this thesis, but is another possible focus for future research. We could begin to address this issue by observing that the value of signals in a given design is not random. Instead, the values of signals in the design depend on the values of other signals in the design itself. The gate-level description of the design provides a map to relate the values of different signals. Therefore, using the information known about the signals that are observable, and the design netlist, it will be possible to infer the values of some signals that are not directly observable. We expect that in the future we would use this property not only to infer the values of debug signals, but also to help with the debug node selection methodology. 6.3.4. Integration with Existing Design-for-test (DFT) Structures Virtually all large ICs and SoCs make use of DFT structures to enhance the detection of manufacturing defects. Although our post-silicon debug infrastructure has been developed to address functional defects (bugs), the observeability and control provided by our infrastructure provides some of the same functionality that is required by manufacturing test. In this thesis we have not investigated the overlap between these two sets of structures, however, this is a potential area for future work. This overlap in functionality could be used to reduce the area cost of the DFT structures or the post-silicon debug infrastructure when both are present in a given design. The most common DFT methodology for SoCs is the insertion of ‘scan-chains’ into the design. Scan-chains are inserted by replacing each flip-flop in the design with an equivalent ‘scan-enabled’ flip-flop. The scan-enabled flip-flop has two new inputs, ‘scan-in’ and ‘scan-en’. By connecting the output of each flip-flop to the ‘scan-in’ input of an adjacent flip-flop, all the flip-flops in the design can be chained to together in such a way that patterns can be applied to 118  the design, and the results observed, in order to detect manufacturing defects. The observe and control debug nodes in our proposal can similarly be used to apply and observe known patterns and detect manufacturing issues. Since the number of debug nodes in our proposal is limited, the defect coverage achieved will not be complete and some scan-chains will still be required; however, the overall scan-chain requirement may be reduced. Further work is required to investigate the potential of this approach.  119  Chapter 6 References [1] International Technology Roadmap for Semiconductors (ITRS), 2007 Report, http://www.itrs.net, 2008. [2] J. Ogawa, “Living in the Product Development ‘Valley of Death’”, FPGA and Structured ASIC Journal, November 23, 2004. [3] S. Miraglia, et al., “Cost effective strategies for ASIC masks”, Proceedings of SPIE, vol. 5043, pp. 142-152, 2003. [4] S. Sandler, “Need for debug doesn’t stop at first silicon”, E.E. Times, February 21, 2005. [5] M. Riley, M. Genden, “Cell Broadband Engine Debugging for Unknown Events”, IEEE Design and Test of Computers, pp. 486-493, vol. 24, no. 5, September/October 2007. [6] T. J. Foster, et al., “First Silicon Functional Validation and Debug of Multicore Microprocessors”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 15, no. 5, pp. 495-504, May 2007. [7] M. Abramovici, “A Silicon Validation and Debug Solution with Great Benefits and Low Costs”, Proceedings of the IEEE International Test Conference, October 2007. [8] Personal communications: Gerry Leavey, Principal Engineer, PMC-Sierra, Inc., Burnaby, British Columbia, August 2006.  120  Appendix A Publications [1] B.R. Quinton, S.J.E. Wilton, “Post-Silicon Debug Using Programmable Logic Cores”, Proceedings of the IEEE International Conference on Field-Programmable Technology, Singapore, pp. 241-247, December 2005. [2] B.R. Quinton, S.J.E. Wilton, "Programmable Logic Core Based Post-Silicon Debug For SoCs", Proceedings of the IEEE Silicon Debug and Diagnosis Workshop, Germany, May 2007. [3] B.R. Quinton, S.J.E. Wilton, “Concentrator Access Networks for Programmable Logic Cores on SoCs”, Proceedings of the IEEE International Symposium on Circuits and Systems, Kobe, Japan, pp. 45-48, May 2005. [4] B.R. Quinton, M. Greenstreet, S.J.E. Wilton, “Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow”, Proceedings of the IEEE International Conference on Computer Design, San Jose, California, pp. 267-274, October 2005. [5] B.R. Quinton, M.R. Greenstreet, S.J.E. Wilton, “Practical Asynchronous Interconnect Network Design”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, no. 5, pp. 579-588, May 2008. [6] B.R. Quinton, S.J.E. Wilton, "Embedded Programmable Logic Core Enhancements for System Bus Interfaces", Proceedings of the International Conference on FieldProgrammable Logic and Applications, Amsterdam, pp. 202-209, August 2007. [7] B.R. Quinton, S.J.E. Wilton, “Programmable Logic Core Enhancements for High Speed On-Chip Interfaces” accepted for publication in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, May 2008.  121  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0065576/manifest

Comment

Related Items