UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Rapid instrumentation for debug and verification of circuits on FPGAs Eslami, Fatemeh 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2018_september_eslami_fatemeh.pdf [ 1.49MB ]
JSON: 24-1.0369287.json
JSON-LD: 24-1.0369287-ld.json
RDF/XML (Pretty): 24-1.0369287-rdf.xml
RDF/JSON: 24-1.0369287-rdf.json
Turtle: 24-1.0369287-turtle.txt
N-Triples: 24-1.0369287-rdf-ntriples.txt
Original Record: 24-1.0369287-source.json
Full Text

Full Text

Rapid Instrumentation for Debug and Verification ofCircuits on FPGAsbyFatemeh EslamiB.Sc., Shahid Beheshti University, 2009M.Sc., University of Victoria, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)August 2018c© Fatemeh Eslami, 2018The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled:Rapid Instrumentation for Debug and Verification of Circuits on FPGAssubmitted by Fatemeh Eslami in partial fulfillment of the requirements forthe degree of Doctor of Philosophyin Electrical and Computer Engineering.Examining Committee:Dr. Steven Wilton, Department of Electrical and Computer EngineeringSupervisorDr. Shahriar Mirabbasi, Department of Electrical and Computer EngineeringSupervisory Committee MemberDr. Martin Ordonez, Department of Electrical and Computer EngineeringUniversity ExaminerDr. Ryozo Nagamune, Department of Mechanical EngineeringUniversity ExaminerAdditional Supervisory Committee Members:Dr. Karthik Pattabiraman, Department of Electrical and Computer EngineeringSupervisory Committee MemberiiAbstractField-Programmable Gate Array (FPGA) technology is rapidly gaining tractionin a wide range of applications. Nonetheless, FPGA debug productivity is a keychallenge. For FPGAs to become mainstream, a debug ecosystem which providesthe ability to rapidly debug and understand designs implemented on FPGAs isessential. Although simulation is valuable, many of the most elusive and trouble-some bugs can only be found by running the design on an actual FPGA. However,debugging at the hardware level is challenging due to limited visibility. To gainobservability, on-chip instrumentation is required.In this thesis, we propose methods which can be used to support rapid and ef-ficient implementation of on-chip instruments such as triggers and coverage moni-toring. We seek techniques that avoid large area overhead, and slow recompilationof the user circuit between debug iterations.First, we explore the feasibility of implementation of debug instrumentationinto FPGA circuits by applying incremental compilation techniques to reduce thetime required to insert trigger circuitry. We show that incremental implementationof triggers is constrained by the mapping of the user circuits.Second, we propose a rapid triggering solution through the use of a virtual over-lay fabric and mapping algorithms that enables fast debug iterations. The overlayis built from leftover resources not used by the user circuit, reducing the area over-head. At debug time, the overlay fabric can quickly be configured to implementdesired trigger functionalities. Experimental results show that the proposed ap-proach can speed up debug iteration runtimes by an order of magnitude comparedto circuit recompilation.Third, to support rapid and efficient implementation of complex triggering ca-iiipabilities, we design and evaluate an overlay fabric and mapping tools specializedfor trigger-type circuits. Experimental evaluation shows that the specialized over-lay can be reconfigured to implement complex triggering scenarios in less than 40seconds, enabling rapid FPGA debug.The final contribution is a scalable coverage instrumentation framework basedon overlays that enables runtime coverage monitoring during post-silicon valida-tion. Our experiments show that using this framework to gather branch coveragedata is up to 23X faster compared to compile-time instrumentation with a negligi-ble impact on the user circuit.ivLay SummaryField-Programmable Gate Arrays (FPGAs) are programmable integrated circuitsthat can be used to prototype complex designs, such as processors, or to acceler-ate computationally intensive applications such as artificial intelligence algorithms.When incorrect behaviour is observed, finding the root cause (also known as de-bugging) is complicated by a lack of observability, due to the limited resources thatcan be exploited to observe internal signals. The above challenge is exacerbatedby FPGAs very long compilation time, which is required to map a design onto FP-GAs. The lengthy compilation time can take several hours and significantly reducethe productivity of the debug process.In this thesis, we propose techniques for improving existing debugging tech-niques to provide observability in order to help designers to identify design errorsand ensure the correct functionality of designs implemented on FPGAs withoutrequiring long recompilations, accelerating debug process.vPrefaceThe research contributions presented in this thesis are based on work publishedin various workshops and conferences, and a journal article. In all of these con-tributions, I was primarily responsible for designing solutions, development, andperforming experiments. This was done under the guidance and direction of mysupervisor Dr. Steve Wilton. Dr. Wilton also was responsible for editing and writ-ing segments of the manuscripts. I had the collaboration of Dr. Eddie Hung, fromInvionics, for Chapter 6.Below are publication details for each chapter:• Chapter 3– Incremental distributed trigger insertion for efficient FPGA debug [42]:Fatemeh Eslami and Steven Wilton, IEEE International Conference onField Programmable Logic and Applications (FPL), 2014.• Chapter 4– An adaptive virtual overlay for fast trigger insertion for FPGA de-bug [44]: Fatemeh Eslami and Steven Wilton, IEEE International Con-ference on Field Programmable Technology (FPT), 2015.– Enabling Effective FPGA Debug using Overlays: Opportunities andChallenges [42]: Fatemeh Eslami and Steven Wilton, Workshop onOverlay Architectures for FPGAs (OLAF), 2016.• Chapter 5vi– An Improved Overlay and Mapping Algorithm Supporting Rapid Trig-gering for FPGA Debug [43]: Fatemeh Eslami and Steven Wilton,ACM SIGARCH Computer Architecture News (HEART), 2017.– Rapid Triggering Capability using an Adaptive Overlay during FPGADebug: Fatemeh Eslami and Steven Wilton, ACM Transactions on De-sign Automation of Electronic Systems (TODAES), To be published in2018.• Chapter 6– Extending Post-Silicon Coverage Measurement Using Time-MultiplexedFPGA Overlays [46]: Fatemeh Eslami, Eddie Hung, and Steven Wilton,IEEE European Test Symposium (ETS), 2018.viiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 FPGA Debug Instrumentation . . . . . . . . . . . . . . . 31.1.2 FPGA Coverage Instrumentation . . . . . . . . . . . . . . 51.1.3 FPGA Overlay Fabrics . . . . . . . . . . . . . . . . . . . 61.2 Contributions and Research Overview . . . . . . . . . . . . . . . 71.2.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 71.2.2 Research Overview . . . . . . . . . . . . . . . . . . . . . 81.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 12viii2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 142.1 Post-Silicon Debug Instrumentation . . . . . . . . . . . . . . . . 152.1.1 FPGA Debug Flow . . . . . . . . . . . . . . . . . . . . . 192.2 Coverage Metrics for Functional Verification . . . . . . . . . . . 252.2.1 Simulation-based Coverage Metrics . . . . . . . . . . . . 262.2.2 Post-Silicon Coverage Monitors . . . . . . . . . . . . . . 272.3 Overlay Architectures . . . . . . . . . . . . . . . . . . . . . . . . 302.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 Incremental Compilation Techniques for FPGA Debug . . . . . . . 343.1 Incremental Distributed Trigger Insertion . . . . . . . . . . . . . 363.1.1 Target FPGA Architecture . . . . . . . . . . . . . . . . . 373.1.2 Incremental-Distributed Timing-Aware Trigger Pack-place 383.1.3 Two-level Congestion Awareness in Trigger Pack-place . . 393.1.4 Incremental Routing . . . . . . . . . . . . . . . . . . . . 413.2 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . 423.2.1 CAD Flow . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 443.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.3.1 Trigger Insertion . . . . . . . . . . . . . . . . . . . . . . 453.3.2 Effect on Flow Runtime . . . . . . . . . . . . . . . . . . 463.3.3 Effect on Circuit Delay . . . . . . . . . . . . . . . . . . . 483.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494 Rapid Triggering Framework using FPGA Overlays . . . . . . . . . 504.1 Debug Overlay Flow . . . . . . . . . . . . . . . . . . . . . . . . 514.2 Proposed Overlay Architecture . . . . . . . . . . . . . . . . . . . 534.3 Overlay Implementation Strategy . . . . . . . . . . . . . . . . . . 554.3.1 Algorithm to map Overlay on top of User Circuit . . . . . 564.4 CAD for Trigger Mapping . . . . . . . . . . . . . . . . . . . . . 594.4.1 Routing-aware Placement Heuristic . . . . . . . . . . . . 594.4.2 Routing Augmentation . . . . . . . . . . . . . . . . . . . 634.4.3 Merging user Circuit with Trigger Circuit . . . . . . . . . 63ix4.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 654.6.1 Establishing the Overlay Architecture . . . . . . . . . . . 654.6.2 Trigger Mapping . . . . . . . . . . . . . . . . . . . . . . 694.6.3 Comparison to Incremental Triggering without using Over-lays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735 Customized FPGA Overlay and Mapping Tools for Rapid Trigger-ing Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.1 Overlay Architecture . . . . . . . . . . . . . . . . . . . . . . . . 765.1.1 Parameterized Overlay Architecture Family . . . . . . . . 765.1.2 Overlay Construction . . . . . . . . . . . . . . . . . . . . 805.2 CAD Algorithm for Overlay Personalization . . . . . . . . . . . . 815.2.1 Trigger Circuit Modeling . . . . . . . . . . . . . . . . . . 825.2.2 Mapping Combinational Triggers . . . . . . . . . . . . . 825.2.3 Mapping Sequential Triggers . . . . . . . . . . . . . . . . 895.2.4 Connecting Trigger Circuit to the User Circuit . . . . . . 925.3 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . 935.3.1 Underlying User Circuits . . . . . . . . . . . . . . . . . 935.3.2 Overlay Architecture Assumptions . . . . . . . . . . . . . 955.3.3 Synthetic Trigger Circuit Assumptions . . . . . . . . . . . 955.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 965.4.1 Overlay Implementation Runtime . . . . . . . . . . . . . 965.4.2 Trigger Mapping Runtime . . . . . . . . . . . . . . . . . 975.4.3 Circuit Critical Path Delay . . . . . . . . . . . . . . . . . 1005.4.4 Comparison with Intel Quartus Prime . . . . . . . . . . . 1015.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036 Post-Silicon Coverage using Overlays . . . . . . . . . . . . . . . . . 1056.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.2 Coverage Overlay Architecture . . . . . . . . . . . . . . . . . . . 1086.2.1 Overlay Architecture . . . . . . . . . . . . . . . . . . . . 109x6.3 Branch Coverage Instrumentation . . . . . . . . . . . . . . . . . 1116.4 Coverage to Overlay Mapping . . . . . . . . . . . . . . . . . . . 1126.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 1136.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 1197.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1197.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . 1207.3 Limitations and Future Research Directions . . . . . . . . . . . . 124Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127xiList of TablesTable 3.1 Benchmark Summary. . . . . . . . . . . . . . . . . . . . . . . 44Table 3.2 Effect of Trigger Insertion on Circuit Critical Path Delay (ns). . 48Table 4.1 FPGA Architecture Used Based on Intel Stratix IV. . . . . . . 64Table 4.2 Benchmark Summary. . . . . . . . . . . . . . . . . . . . . . . 65Table 4.3 Average Number of Input Pins per Overlay Cell. . . . . . . . . 69Table 4.4 Effect of Trigger Insertion on Critical Path Delay (ns). . . . . . 72Table 5.1 Architectural Parameters Describing Logical Overlay. . . . . . 79Table 5.2 Benchmark Summary. . . . . . . . . . . . . . . . . . . . . . . 94Table 5.3 Values of the Overlay Architectural Parameters, Used for theExperiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . 94Table 5.4 Overlay Compile-Time Overhead. . . . . . . . . . . . . . . . . 97Table 5.5 Effect of Trigger Insertion on Circuit Critical Path Delay (ns). . 100Table 5.6 Comparison between This work and Quartus Prime on IntelStratix IV (EP4SGX180) Architecture for mcml. . . . . . . . . 102Table 6.1 Benchmark circuits, total number of branch control signals, max-imum number of on-chip monitors that fit on FPGA device ateach instrumentation iteration, number of total instrumentationiterations for branch coverage analysis. . . . . . . . . . . . . 114Table 6.2 Circuit critical path delay (ns) of uninstrumented circuit, instru-mented circuit at compile-time and runtime instrumentation. . . 116xiiList of FiguresFigure 2.1 On-chip debug instrumentation including trace buffers to cap-ture data, and a trigger circuitry to control capturing . . . . . 17Figure 2.2 Typical FPGA debug flow in which instrumentation is insertedbefore compilation . . . . . . . . . . . . . . . . . . . . . . . 20Figure 3.1 Incremental trigger insertion flow. . . . . . . . . . . . . . . . 35Figure 3.2 Overview of incremental trigger insertion CAD flow. . . . . . 36Figure 3.3 Overview of a Logic Cluster (LC) and Logic Element (LE). . 37Figure 3.4 Finding spare LCs around the seed.Packer will not considerfully used logic clusters. . . . . . . . . . . . . . . . . . . . . 39Figure 3.5 Skipping a highly congested LC. . . . . . . . . . . . . . . . . 41Figure 3.6 Incremental trigger insertion CAD flow. . . . . . . . . . . . . 43Figure 3.7 Trigger insertion runtime breakdown. . . . . . . . . . . . . . 47Figure 3.8 Incremental trigger insertion runtime saving in comparison tocircuit recompilation. . . . . . . . . . . . . . . . . . . . . . . 47Figure 4.1 Overview of using a virtual overlay for FPGA debug. . . . . . 51Figure 4.2 Proposed debug flow using an overlay for FPGA debug: com-pile time and debug time phases. . . . . . . . . . . . . . . . . 52Figure 4.3 Simplified overlay architecture. . . . . . . . . . . . . . . . . 54Figure 4.4 Overlay placement and routing pseudocode. . . . . . . . . . . 56Figure 4.5 Simple trigger netlist. . . . . . . . . . . . . . . . . . . . . . . 60Figure 4.6 The trigger netlist of Figure 4.5 mapped to logic elements ofoverlay cells. . . . . . . . . . . . . . . . . . . . . . . . . . . 61xiiiFigure 4.7 Compile-time: VPR runtime for establishing a maximally-sizedoverlay architecture as a function of channel width. Note thatthe y-axis uses a log scale . . . . . . . . . . . . . . . . . . . 66Figure 4.8 Compile-time: VPR runtime for establishing overlay architec-ture as a function of overlay size for 1.2Wmin . . . . . . . . . 67Figure 4.9 Compile-time: VPR runtime overhead for establishing overlayarchitecture for 1.3Wmin. . . . . . . . . . . . . . . . . . . . 68Figure 4.10 Debug-time: trigger mapping runtime assuming Overlaymaxfor different size comparators. . . . . . . . . . . . . . . . . . 70Figure 4.11 Average trigger mapping runtime as a function of overlay size. 71Figure 5.1 (a) An overview of the parameterized overlay architecture thatis comprised of cells connected via primary (shown in black)and auxiliary (shown in red) connections, and a set of flip-flops connected to cells in level lo− 1 (b) Overlay cell designwith ko-input LUTs, local routing, and outputs to other cells.(c) Overlay flip-flop design with a flip-flop, local routing, andfanout to other cells. . . . . . . . . . . . . . . . . . . . . . . 77Figure 5.2 Overlay reconfiguration flow that takes a graph representationof a trigger netlist and outputs a placed and routed netlist ontothe overlay fabric. . . . . . . . . . . . . . . . . . . . . . . . . 82Figure 5.3 Fanout bounding takes a multi-sink trigger net (a) and decom-poses it to multiple two-sink nets (b). . . . . . . . . . . . . . 83Figure 5.4 A simple single-sink combinational trigger netlist. . . . . . . 84Figure 5.5 Detailed view of connections between overlay cells assumingno = 8, ko = 4 and lo = 3. Each cell (except OC1) has ten inputpins. Seven (I1-I7) are driven by primary (shown in black)output pins. Three (I8-I10) are driven by auxiliary (shown inred) output pins. Each output pin is driven by a 4-input LUTs.LUTs and routing resources inside each cell are not shown forsimplicity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85xivFigure 5.6 Transformation takes a sequential trigger (a) and restructuresit to sequential trigger (b) where trigger netlist can be cut intotwo partitions: combinational and sequential. . . . . . . . . . 90Figure 5.7 Debug-time: combinational trigger mapping runtime. . . . . . 97Figure 5.8 Debug-time: state-based trigger mapping runtime. . . . . . . 98Figure 5.9 Debug-time: mapping runtime of complex sequential triggercircuits with 128-bit comparator. . . . . . . . . . . . . . . . . 99Figure 5.10 An example of a state-based trigger to detect when the pattern0110101 has been received on a signal in consecutive cycles. . 102Figure 6.1 Overview of our framework. . . . . . . . . . . . . . . . . . . 107Figure 6.2 (a) Coverage overlay architecture that is comprised of cellsconnected via primary (shown in black) and auxiliary (shownin red) connections, inputs to the overlay are control signalsfrom the user circuit and its output controls trace buffers. (b)Overlay cell design with Logic Elements (LE), local routing,and outputs to other cells, each LE contains a 4-input LUT anda flip-flop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110Figure 6.3 Branch coverage instrumentation that is mapped onto the over-lay fabric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111Figure 6.4 Speedup of runtime branch coverage instrumentation in com-parison to compile-time instrumentation over multiple instru-mentation iterations. . . . . . . . . . . . . . . . . . . . . . . 115Figure 6.5 Impact of preserving all control signals on area in comparisonto the uninstrumented circuit where these signals can be opti-mized away if possible. . . . . . . . . . . . . . . . . . . . . . 117xvGlossaryASIC Application-Specific Integrated CircuitCAD Computer-Aided DesignFPGA Field-Programmable Gate ArrayHDL Hardware Description Language, e.g. Verilog, VHDLHLS High-Level SynthesisIC Integrated CircuitLC Logic ClusterLE Logic ElementLUT Look-Up TableRTL Register-Transfer LevelxviAcknowledgmentsFirst and foremost, I would like to express my gratitude to my academic advisor,Steve Wilton for his endless patience, encouragement, and unwavering commit-ment that gave me the opportunity to grow both technically and personally. I wouldnot have gone half of this way without his invaluable guidance and support.I owe a special thanks to Eddie Hung from Invionics Inc. for his support,thoughts, and insights (and as a never ending source of great ideas)- I truly learneda lot from you. I would also like to thank my committee members, Shahriar Mi-arbbasi, Karthik Pattabiaraman, Mieszko Lis, Martin Ordonez, Ryozo Nagamune,and Kiarash Bazargan for their valuable feedback.I am thankful to my colleagues in the SoC lab for creating a friendly researchenvironment. This includes Jeffrey Goeders, Parinaz Tehrani, Shadi Asadi, Al-Shahna Jamal, Jose Pinilla, Pavan Bussa, Hossein Omidian, Sarah Mashayekhi,Assem Bsoul, and several others.To my family, you have been there for me through fear, uncertainty, and doubtin spite of the geographical distance between us. Thanks for making this distancebearable. Azam, thank you for being such supporting sister. Mostafa, thank youfor supporting me, and for believing in me. This thesis is dedicated to the lovingmemory of my father and to my mother.xviiChapter 1Introduction1.1 MotivationIn the nanometer era, producing correctly-working complex Application-SpecificIntegrated Circuits (ASICs) designs is becoming more expensive and time-consuming,primarily due to costly fabrication and the amount of engineering verification re-quired to avoid unaffordable device re-spins [2]. Traditionally, circuit designers usesoftware simulators, such as Mentor’s ModelSim [3], to verify and debug hardwaredesigns before fabrication; this is called pre-silicon debug. Although simulation-based verification is valuable, it is not enough. Many errors can only be observedwhen a design is running on the actual hardware for several reasons: (1) many bugsonly emerge after long runtimes, and simulation is orders of magnitude slower thanreal silicon, (2) many “corner case” behaviours (i.e. those that are impractical to de-scribe with a test-bench) may need real workloads before they can be observed, and(3) the most difficult bugs are often in the interfaces between a design and neigh-bouring chips; only by verification and debugging the system in-situ running on1the actual hardware such bugs can be found. This task is often called post-siliconverification (also known as post-silicon debug) [34, 48, 103, 104]. Post-siliconverification involves ensuring the design correctness. If an unexpected behaviouris observed, it is necessary to identify the root cause and fix design errors.To avoid costly re-spins, post-silicon validation is often performed by imple-menting the design using a prototype built from one or more FPGAs. FPGAsprovide the ability to reconfigure the design during validation, allowing for theevaluation of potential design variants, design updates, or fixing found bugs with-out the need of a costly re-spin. Increasingly, these FPGA implementations arebeing used as an initial version of the product, and only migrated to an ASIC ifvolumes warrant. Additionally, running a design at speed on an FPGA prototypewith real-world stimulus allows for far more exhaustive and system-level tests (e.g.booting an operating system) that are infeasible in simulation.Recent years have seen tremendous interest in using FPGAs for computation-ally intensive applications. Hardware acceleration is now entering the mainstream,as evidenced by Intel’s acquisition of Altera, Amazon’s incorporation of FPGAsin their EC2 F1 web service, and other efforts to bring FPGA technology into thecloud [10, 37, 73, 92, 109, 117].Nonetheless, the widespread use of FPGAs in such services is limited due totheir very long compilation time which can take several hours or even days for alarge and complex design. Traditional hardware designers may be willing to ac-cept long design and debug cycles, however, application designers using FPGAtechnology to accelerate software applications may not. These designers may ex-pect software-like compile times, and similar support for debug and optimization.Although there has been much work on developing design compilers to support2designers that want to accelerate parts of their computation algorithms, design-ers need an effective mechanism to rapidly debug, optimize, and understand theirdesigns.Debugging consumes an increasingly high percentage of an entire project de-velopment lifecycle and has become a critical bottleneck, motivating the need formore effective and efficient verification technologies to improve productivity anddecrease development time to meet time-to-market demands [80, 104]. As a re-sult, both industry and the research community are now moving to consideringentire debug “ecosystems” which provide the ability to rapidly debug and verifythe functionality of designs implemented on FPGAs.The primary focus of this thesis is on improving existing verification and de-bugging techniques in order to help designers to identify functional bugs and en-sure the correct functionality of designs implemented on FPGAs for prototypingor accelerators used in production systems. Examples of functional bugs may beincorrect logic specifications in the RTL code, incorrect state transition, or IP mis-configuration.1.1.1 FPGA Debug InstrumentationOnce the designer observes unexpected behaviour at the outputs of the system, heor she needs to start debugging to identify the root cause of the bug and fix designerrors. Unlike manufacturing tests and pre-silicon verification that have benefitedfrom automated methods and standard coverage metrics, post-silicon debuggingis still designer-dependent and ad-hoc, meaning that designers with deep knowl-edge of the design are required to find the root cause of unexpected behaviour.Rather than trying to pin-point the root cause automatically, our approach focuses3on techniques to provide meaningful signal data behind unexpected behaviour tothe designer, who can then use this information to find the bug and its root cause.Once the root cause has been identified, steps can be taken in order to fix the bug.Fundamentally, identifying the root cause of a bug requires investigating andunderstanding the design’s behaviour in detail. In simulation, the designer canobserve all signal waveforms over the course of design execution. However, in ahardware implementation, the designer does not have access to all signal data dueto a limited number of I/O ports. A widely-used approach to gain observabilityduring debugging is to employ on-chip debug instrumentation. This instrumenta-tion records the behaviour of a set of selected signals in on-chip memory blocks(also called trace buffers), coordinated by trigger circuitry that determines whendata should be recorded. After execution, this recorded data is read out, and usedin a custom tool to replay the behaviour, hopefully leading to insights into the op-eration of the circuit. Due to limited on-chip memory, the recorded data containsinformation only for the selected signals during a fraction of the circuit execution.To gain additional information about the operation of the circuit, the designer mayneed to change the selected signals or refine the trigger event to capture a differentslice of circuit execution, which is known as a debug iteration. Depending on thecomplexity of the bug or the designer’s expertise, several debug iterations may berequired to find the root cause of unexpected behaviour.In commercial tools [74, 119, 127] as well as academic work [29, 53, 83], theinstrumentation is added to the user design before compilation. The instrumenta-tion is then compiled along the user circuit using the FPGA’s general-purpose logicand routing resources. Once the instrumentation is inserted, if the designer wishesto modify the instrumentation, another recompilation is necessary. The long com-4pilation times that are inherent to FPGA devices increase the time to perform adebug iteration. For very large designs, this can be prohibitive (often called a “gohome event”) which can severely limit debug productivity. Further, recompiling adesign may often lead to slightly different implementation and timing behaviourwhich may cause a bug to disappear or change [81, 116]. Although FPGA ven-dors have made gains at reducing compile time in recent years, compile times arestill long; therefore, a method of enhancing debugging of designs implemented onFPGAs that does not require frequent recompilations is required.1.1.2 FPGA Coverage InstrumentationDuring verification, it is important to evaluate the quality of test-cases that exer-cise the design and identify unexplored areas of the design using coverage metrics.Several standard coverage metrics and methods exist for measuring coverage ofsimulation-based verification. For example, code coverage measures what fractionof the source code (in terms of statement, branch, etc.,) has been executed by thetests, assertion-based coverage indicates how well a design’s behaviour indicatedby the designer has been tested, and mutation-based coverage measures what frac-tion of injected bugs are detected during simulation-based verification. Althoughthese metrics can be easily measured with only little overhead to the simulation,it is not sufficient to completely guarantee the correct operation of a chip, eitherdue to extremely long runtimes or its inability to mimic real-world interfaces in apre-silicon testbench.Since running the designs on FPGAs enables the designers to run far more com-plex and realistic tests compared to simulation, it should be leveraged for coverageanalysis as well. Currently, post-silicon verification methods and tools are not5structured for coverage analysis and are only used to check functional correctnessof the design. Quantifying the coverage of tests during post-silicon verification isdifficult; this difficulty is rooted in the lack of visibility into the internal operationof a running chip [104].On-chip monitors are a straightforward solution for increasing observabilityand measuring post-silicon coverage. However, implementing a large number ofmonitors (hundreds or even thousands) on-chip along with the user circuit can con-sume significant area and may make it infeasible to provide a complete coverageanalysis. Although FPGAs enable implementing a limited number of monitors inmultiple debug sessions, changing monitors requires a recompilation, recompilingcan be very slow, often on the order of hours (or longer); this significantly limitsverification productivity. Further, adding these monitors may perturb placementand routing of the user circuit, possibly changing the behaviour of the design. Inthis thesis, we aim to address these limitations.1.1.3 FPGA Overlay FabricsIn recent years, the concept of an overlay has emerged as a promising technologyto tackle FPGAs long compilation time, and may become key to ensuring thatFPGA technology is successful as it moves to the mainstream. Overlay fabrics,also called intermediate fabrics, are virtual configurable architectures implementedon top of an FPGA. Overlay fabrics provide a trade-off between flexibility andcompile time; by constructing fabrics optimized for specific domains or circuittypes, much of the flexibility can be removed leading to faster compilation times. Agrowing amount of research has investigated overlays for specific applications [26,31, 39, 77, 78, 82, 86, 90, 97]. However, overlays are limited by their area and6performance overhead that prevented the realistic use of them in practice. The areaand performance overheads are mainly because of FPGA resources used to enablethe programmability of the overlay fabrics, and that overlay fabrics are typicallydesigned without careful consideration of the underlying FPGA architecture. Inthis thesis, we aim to address this limitation and explore overlay architectures foran in-system debug ecosystem.1.2 Contributions and Research Overview1.2.1 ContributionsThe main contributions of this thesis are novel techniques that support rapid andefficient implementation of on-chip analysis instruments (such as trigger circuitsand coverage monitoring circuits) suitable for an in-system debug ecosystem, in away that does not require large area overhead, and frequent design recompilations,enhancing FPGA debug productivity. More specifically, we make the followingcontributions:• Addressing the feasibility and challenges of using incremental compilationtechniques for inserting a trigger circuitry without changing the mapping(placement and routing) of the user circuit and without performing a fullrecompilation between debug iterations.• A non-intrusive instrumentation framework that exploits virtual overlays toprovide rapid triggering capabilities enabling rapid debug iterations duringFPGA debug.– An adaptive strategy and CAD techniques to non-intrusively construct7the virtual overlay fabric on top of the user circuit only using leftoverresources after user circuit compilation while preserving its originalmapping.– An overlay architecture and CAD techniques suitable for rapid imple-mentation of arbitrary combinational trigger circuits.– An overlay architecture and CAD techniques specialized for efficientimplementation of state-based triggering functionalities.• A scalable and area-efficient coverage instrumentation framework suitablefor FPGA-based validation, which enables runtime implementation of on-chip coverage monitors.1.2.2 Research OverviewChapter 3 explores applying incremental compilation techniques to insert debugcircuitry without performing a full compilation during debugging to enable fastdebug iterations. We used incremental compilation techniques to implement trig-ger functions on top of an already placed-and-routed circuit without modifyingthe underlying circuit using only resources not used by the user circuit; we pre-serve packing, placement, and routing of the original user circuit when insertingthe debug circuitry; this is important since it ensures that the original user cir-cuit is compiled as normal and its mapping is fixed and does not change as theinstrumentation is inserted, changed, or removed. This also allows designers topreserve low-level optimizations and timing closure of the original circuit as wellas avoiding the CAD noise that is inherent to FPGA heuristic CAD algorithms. Ourapproach has two phases: pack-place phase and routing phase. In the pack-place8phase phase, the goal is to distribute the trigger logic to spare logic resources onthe FPGA. In the routing phase, spare routing resources are used to make all therequired connections in the trigger circuitry using incremental routing techniques.We will show that using incremental compilation techniques for trigger implemen-tation significantly reduces debug iteration times. However, the ability to inserttrigger circuits in this way depends on (1) the spare logic resources, (2) the sparerouting resources, and (3) the mapping of the user circuit. This directly motivatesexploiting overlays for implementation of trigger circuits. The contribution fromthis chapter was published in [42].Using the techniques from Chapter 3, we found it was often difficult to makethe required connections when implementing trigger circuits. To address this,Chapter 4 presents an alternative approach. This approach uses a virtual over-lay architecture that is constructed only once after user circuit compilation but canbe reconfigured multiple times to implement different trigger circuits. Importantly,this does not require a recompilation of the entire user circuit between debug itera-tions. In Chapter 4, we present an adaptive strategy to construct the overlay fabricon top of the user circuit while preserving its original mapping. The overlay archi-tecture is designed as a 2D torus and is made efficient by carefully matching theoverlay to the underlying FPGA architecture, and adapting the overlay to use onlythe resources left unused by the user circuit. In this way, the downside of manyoverlay approaches is avoided and the overhead can potentially be reduced to zero.During debugging, an algorithm is required to rapidly configure the overlayfabric. Hence, we propose a routing-aware simulated-annealing based placementalgorithm to place a trigger circuitry into the overlay while optimizing for routabil-ity of the placement solution. This increases the likelihood of a successful trigger9insertion. Since the overlay is pre-synthesized, the algorithm that maps the triggerfunction to the overlay architecture can be fast.We will show that the overlay architecture provides enough flexibility to im-plement different combinational triggering scenarios an order of magnitude fastercompared to the traditional flow. Furthermore, it should be noted that any increasein the circuit critical path delay is temporary; the operating frequency of the cir-cuit can be set as normal when the debug instrumentation is not required. Thecontribution from this chapter was published in [44].While the overlay architecture and mapping algorithms presented in Chapter 4provide support for rapid implementation of arbitrary combinational functions, itdoes not support more complex triggering capabilities, including sequential (state-based) trigger functions. Such complex trigger circuits are essential to ensure thatthe designer is able to make the best use of the limited trace buffer capacity. Hence,Chapter 5 presents an overlay architecture that is specialized and optimized to im-plement state-based trigger functions during debug. The architecture is designedto be adequately flexible to implement a wide variety of such functions. The ar-chitecture is also optimized to be area-efficient when implemented on top of a usercircuit. To implement the logic part of trigger circuits, the overlay contains a num-ber of cells arranged in multiple levels in a triangular reduction-network pattern.To support sequential part of trigger circuits, the overlay also contains a bank offlip-flops connected to the overlay cells. This chapter also presents several newCAD techniques that adapt to the topological characteristics of this specializedoverlay for mapping complex trigger functions onto the overlay. Trigger mappingis divided into two steps: first, the combinational portion is mapped to the overlaycells using a gradual and simultaneous placement and routing algorithm to ensure a10legal mapping solution. Secondly, the flip-flops in the trigger circuit are placed intothe flip-flop banks of the overlay fabric and connect to the combinational portionof the circuit.We will show that our customized overlay fabric can be reconfigured to im-plement different combinational and sequential triggering scenarios in less than 40seconds for our benchmark circuits, enabling rapid debug iterations. Furthermore,we compare our approach to a commercial tool, Intel’s SignalTap II tool [74]. Wefind that our approach for trigger insertion during debugging can outperform thegeneral-purpose incremental compilation feature of the Intel Quartus Prime toolin terms of runtime and area overhead. The contribution from this chapter waspublished in [43] and an extended version was submitted to a journal with minorrevisions.Chapter 6 describes a scalable and adaptable coverage instrumentation frame-work for FPGA-based coverage analysis which greatly reduces the area overheadof on-chip monitors. We made the observation that we can re-purpose existingFPGA-based on-chip debug cores to facilitate on-chip coverage monitoring. Inthis approach, an overlay fabric is implemented on the FPGA along with the usercircuit; this fabric is flexible enough that it can be used to rapidly cycle throughcoverage monitor functions through multiple runs without requiring a slow recom-pile. Importantly, the fabric is added after the user circuit is placed and routed,using only leftover resources that are not used by the user circuit. The architectureis adaptable in that it can adjust to the number of logic blocks and routing tracksnot used by the user circuit; if there is ample space available, a large fabric thatsupports many monitors in parallel can be used, while if space is tight, a smallerfabric that implements fewer simultaneous monitors can be used.11To achieve this, we employ an overlay fabric on top of the user circuit usingFPGA spare resources after user circuit compilation. Because the overlay is opti-mized for coverage monitoring, the time to compile monitors to the overlay fabricis fast and the overlay can be reconfigured at runtime to implement monitors in atime-multiplexed fashion to gather coverage data during the entire design execu-tion. We will show that using our approach to implement all monitors is up to 23Xfaster compared to compile-time instrumentation with a negligible circuit delayincrease. The contribution from this chapter was published in [46].1.3 Thesis OrganizationIn Chapter 2, we will discuss existing post-silicon debug techniques in more detailand provide an overview of how debug instruments are implemented and used inpost-silicon debug for both ASICs and FPGAs. This chapter then reviews the mostcommon coverage metrics used in simulation-based verification, and reviews priorwork in implementing on-chip monitors for post-silicon validation. This chapteralso reviews FPGA overlays and prior work that described overlay architecturesand tools suitable for specific application domains to address FPGA long compi-lation times. Chapter 3 explores applying incremental compilation techniques toinsert debug circuitry without performing a full compilation and addresses chal-lenges of distributing trigger functions over leftover resources without modifyingthe mapping of the user circuit. Chapter 4 proposes a non-intrusive instrumenta-tion framework that enables rapid implementation of trigger circuits through theuse of debug overlays. This chapter then presents an adaptive strategy to mapdebug overlays onto underlying FPGA resources. Finally, Chapter 4 presents anoverlay architecture and CAD techniques suitable for implementing arbitrary com-12binational trigger circuits into the overlay fabric. Chapter 5 presents a customizedoverlay architecture and mapping algorithms specialized for implementing state-based trigger functions.Chapter 6 proposes a scalable instrumentation framework for implementingcoverage monitoring in a time-multiplexed fashion suitable for FPGA-based cov-erage analysis. Finally, Chapter 7 summarizes the contributions of this thesis andprovides directions for future work.13Chapter 2Background and Related WorkPost-silicon validation can be divided into two important tasks: (1) collecting sig-nal data to verify design behaviour and find the root cause of observed bugs; (2)assessing how exhaustively the design has been exercised during validation usingcoverage metrics which can provide useful feedback to identify which areas of thedesign may be inadequately tested. Performing these tasks during post-silicon val-idation is complicated by a lack of observability due to the limited number of pinsthat can be used to observe design behaviour and collect data.This chapter first describes how on-chip debug instrumentation is used to pro-vide visibility into the design and discusses some of the key challenges of im-plementing debug instruments during post-silicon debug in Section 2.1. It alsodiscusses several efforts from academia and industry on enhancing visibility andefficient implementation of debug instrumentation for ASICs and FPGAs. Thissection provides core information for the contributions of Chapters 3, 4, and 5 ofthis thesis. Next, Section 2.2 discusses pre-silicon simulation-based coverage met-rics, and provides a detailed survey of the approaches that use on-chip coverage14monitors to provide visibility for measuring coverage at post-silicon. The latterprovides information related to the contribution of Chapter 6 of this thesis. Finally,Section 2.3 describes overlay fabrics for FPGAs and reviews the existing work inthis area as FPGA overlays play an important role in the contributions from thisthesis.2.1 Post-Silicon Debug InstrumentationFor many types of bugs, such as straightforward functional errors, software simula-tion is suitable. For other bugs, however, simulation-based verification may be tooslow to expose the bug, especially for large and complex designs that include bothsoftware and hardware components. For many timing and interface errors, it maybe impossible to observe the bug until the design is implemented on a hardwareexecuting at speed in-situ using realistic test cases, such as booting an operatingsystem that is too slow to be simulated. For these reasons, many bugs can only befound by observing the circuit implemented and running in hardware interactingwith other system components. A post-silicon validation process is, therefore, anessential part of any integrated circuit design flow [34, 48, 103, 104].This thesis focuses primarily on post-silicon debugging techniques that aim toensure design correctness against errors caused by human designers as opposedto manufacturing defects. Manufacturing test is applied to every single deviceto detect fabrication defects (e.g. stuck-at faults). Unlike manufacturing testing,there are no standard bug models for bug scenarios during post-silicon verification.Bugs that appear at post-silicon are often corner-case bugs caused by subtle designerrors and accurate modeling of such bugs may be very difficult [34, 55, 104]. Dur-ing post-silicon verification and debugging, real-world long-running tests (such as15booting an operating system or application software) are applied to the design run-ning at-speed and the design behaviour is observed. When incorrect behaviour in arunning chip is observed, it is very difficult for designers to deduce the root causeof the behaviour due to poor observability into internal data. Two main approachesexist to enhance on-chip visibility: scan-based instrumentation and trace-based in-strumentation.Scan-based instrumentation is a technique used by manufactures to test formanufacturing defects in ASICs. A scan chain is a technique that involves con-necting all the flip-flops of a circuit sequentially, providing the ability to observe orcontrol the stored values. Since these scan chains are fabricated into ASIC designs,they can be reused for debugging functional errors [54, 122]. However, some FP-GAs (such as Xilinx Virtex-5) have built in support for scan chains. If there is nobuilt in support, Wheeler et al. showed that adding them requires on average 84%additional chip area [123]. One major drawback of scan-based techniques is that itonly provides a single snapshot of the states. Hence, the circuit must be stopped atthe right time to be able to extract values stored in flip-flops. Extracting the datais slow, and halting the circuit every time the designer wants to observe the storedvalues in flip-flops makes real-time observation impossible; the process of viewingone flip-flop using device readback techniques can take 2-8 seconds [61, 75, 76].In [128], the author proposed scan-based techniques in which shadow flip-flops orlatches are configured to periodically provide a snapshot of the circuit state dur-ing system operation without halting the circuit. However, their techniques cannotavoid the substantial area increase.The most common technique to provide visibility without interrupting the cir-cuit execution is trace-based instrumentation. As shown in Figure 2.1, this instru-16Figure 2.1: On-chip debug instrumentation including trace buffers to capturedata, and a trigger circuitry to control capturingmentation typically consists of trace buffers, which are on-chip RAM blocks usedto store a history of internal signals (trace signals) for later interrogation. This datacan then be provided to the user using a graphical user interface.Since on-chip memory is limited, trace buffers can capture a subset of circuitsignals for a limited number of cycles. Since this recorded data is critical in rootcause analysis, it is important to ensure that the collected data is the most relevantto the unexpected behaviour that has occurred. To ensure that the collected datais useful, a routing network is typically provided to connect a subset of internalsignals to the trace buffer, and trigger circuitry is included to identify events dur-ing the circuit execution at which data should be collected and stored in the tracebuffers [29, 53, 74, 83, 85, 101, 119, 127]. This latter flexibility is of particularimportance to the main contributions in this thesis. As shown in Figure 2.1, triggercircuitry consist of three parts: (1) the trigger function, which determines when17trace buffers should stop recording, (2) trigger signals, which are signals from theuser circuit and are selected by the designer to construct the trigger condition, (3) asingle trigger output, which is the output of the trigger function, and drives controllogic within the trace buffers. Trigger circuits can be used in several ways. Oneoption is that trigger circuit could be used to identify the point at which the tracebuffers should start recording data, in which case, after a trigger event occurs, thetrace buffer fills to capacity and then is frozen. Alternatively, the trigger circuitcould be used to identify the point at which recording should stop; in this case,the memories can be configured as a circular buffer, continuously over-writing theoldest data until the trigger event occurs, at which point the buffers are frozen. Ineither case, after the trigger occurs, the trace buffers maintain a short history ofselected signals which can help the designer uncover the cause of observed bugs.Trigger circuits can be combinational or sequential. As a specific example, a sim-ple trigger function might determine when an internal bus carries a specific bitpattern; a more complex trigger might be a state machine that determines when apre-determined packet is received on an input stream.Trace-based instrumentation is commonly used in ASICs [4, 122]. Since theseon-chip memories have a limited storage capacity, the designer must pre-select anumber of internal signals to be connected to trace-buffers before chip fabrica-tion. Several methods have been proposed to make a trade-off between the amountof visibility, the flexibility of changing selected signals after fabrication withouta costly re-spin, and the chip area required to implement the debug instruments.Mentor Certus [101] provides proprietary observation networks to allow designersto instrument a large subset of signals during compilation; any subset of these sig-nals can be selected for debug at runtime [110]. In [110, 111], a programmable18logic core has been embedded into the design along with an access network to pro-vide the ability of connecting signals of ASICs designs to the trace buffer pins afterfabrication. The authors in [112] proposed a debug infrastructure to support chipswith multiple clock and voltage islands. In [5], the authors proposed a distributedreconfigurable infrastructure along with a signal routing network to implement de-bug structures. In [94, 95], the authors explored other types of trace interconnectionnetworks and proposed a tree-like routing network to provide more flexibility fordebug.Data compression techniques have been also used to compress the real-timedebug data to further enhance the storage efficiency of trace buffers [12, 13, 16,41, 52, 108]. Several trace signal selection and state restoring methods have beenproposed to maximize restoring un-traced signals using traced signals [7, 17, 19,56, 63, 91, 98].To further improve observability, several approaches were developed based ona combination of trace-based debugging and periodic scan chain captures. The ideais that a set of trace signals are stored in every cycle, whereas a set of scan signalsare dumped across multiple cycles [18, 84]. In [113], the authors used severalsmaller scan chains with different dumping periods and proposed signal selectionalgorithms to assign the signals to different scan chains in order to maximize thenumber of states that can be restored.2.1.1 FPGA Debug FlowFPGAs can be configured after fabrication which provides a unique opportunity inthat the instrumentation can be recompiled as the debug instrumentation changesduring the debug process. Figure 2.2 shows a typical FPGA debug flow. In this19Figure 2.2: Typical FPGA debug flow in which instrumentation is insertedbefore compilationflow, the instrumentation is inserted into the design before compilation. The designis compiled using standard tools, and runs as normal. As the circuit is running, theinstrumentation records the behaviour of the selected signals in the trace buffer.After the run is complete, the designer can off-load this information for viewingand analysis, often using a waveform viewer.During the hunt for the root cause of a bug, the designer may need to frequentlyrefine his or her view of how the circuit operates, and may wish to re-run the circuit,recording a different set of internal signals or using a different trigger function con-dition so as to record behaviour in a different portion of the execution. This flowhas three major limitations. First, a recompilation of the design is required when-ever the trigger function or set of traced signals is changed during debug iterations.Recompiling today’s large and complex designs can be extremely slow (severalhours or even days) which can severely limit debug productivity. Second, recom-piling can result in a different mapping (placement and routing), which may causea bug to temporarily disappear or lead to different unexpected behaviour (“crash-20ing in a different way”). Third, the tool will treat the instrumentation and usercircuit as equally important and optimize both together. This means, for example,if the critical path is in the debug instrumentation, the tool will not work as hardto optimize the timing performance of the user circuit. Moreover, it also meansthe debug instrumentation uses those FPGA resources that could be used by usercircuit, rather than using only logic resources that the user circuit does not require.Xilinx’s Vivado [127] and Intel’s Quartus Prime [74] use general-purpose in-cremental compilation to avoid full recompilation if the debug core is updated dur-ing debug iterations. Intel’s SignalTapII [74] requires the designer to compile thedebug core with the design as a separate partition and recompiles this partitionseparately if the property of the debug core is updated using general purpose in-cremental compile. Similarly, Synopsys’s Identify [119] requires recompilation ifreinstrumenting introduces new logic to these debug cores. However, because theseare general-purpose techniques — in that they are designed to accelerate changesbeing made to the original circuit (e.g ECO changes), as opposed to simply addingnew read-only instrumentation — these techniques can still be slow, especially ifsignificant changes to the debug instrumentation are made, and often fall short ofthe software-like turnaround times many designers desire. Additionally, the entiredesign must be loaded into the memory of the workstation running the CAD tool.Therefore, incremental compilation techniques specific to debug instrumentationis required to enhance on-chip debug productivity.Academic works have considered incremental routing techniques that connectsignals to trace buffers or output pins without a full recompilation. In [58], the au-thors proposed inserting trace-buffers into a Xilinx FPGA at compile-time. Then,they applied bitstream modifications to route the trace signals to trace-buffers.21Hung and Wilton [64, 66] used incremental compile techniques to accelerate traceinsertion and signal visibility. They speculatively inserted the debug instrumentsinto the unused resources of FPGAs while preserving the original circuit mapping.They also described a network for rapidly routing trace signals to trace bufferswithout a recompilation; they created a virtual overlay network using unused mul-tiplexers within the routing fabric and a routing algorithm has been used to routetrace signals to the trace buffer. In [68], the same authors described a network flowbased routing algorithm that incrementally connects the maximum number of re-quested signals to trace-buffers. This was made possible by the unique opportunityin signal tracing in which all trace buffer inputs are logically equivalent. In [87]the authors performed a signal parameterization on the user circuit and insertedmultiplexers with parameterized selection bits to guide signals to trace buffers.However, those techniques do not address the incremental re-implementation ofthe trigger circuitry. The necessity to consider both logic and routing makes this asignificantly harder problem.In [69, 70], the authors used an incremental approach to insert additional logic(trigger circuitry in [70] and self-monitoring circuitry in [69]) to existing circuitryby finding a contiguous region of unused logic blocks that fits the entire additionallogic. The difficultly with this approach is that on a highly-utilized FPGA, it maybe difficult to find a region large enough, and with the right shape, to implement theadditional logic. It may be possible to “reserve” space for an additional circuitrywhen compiling the existing circuit, but this interferes with the original circuitmapping or the design may no longer fit into the device. In addition, this wouldrequire knowing the size of the additional logic; if the size of the additional logicchanges, and requires more logic elements, a complete recompile may be neces-22sary.In [65], Hung and Wilton focused on the flexible implementation of the inter-connect between user circuit and trace buffers, and provided a mechanism to imple-ment only simple trigger circuits by using a portion of trace buffers, where triggersignals are connected to the address lines and different values on trigger signalswill cause the RAM block to return a stored value. However, this approach onlysupports basic bitwise pattern-matching trigger functions. Commercial tools sup-port state-based triggering. In Synopsys’s Identify [119] and Xilinx’s Vivado [127],state machines with between 2 and 16 states are possible, while in Intel’s Signal-TapII [74], state machines with between 2 and 20 states are supported. The rangeof number of trigger conditions that can be created is 1 to 16 in Synopsys’s Iden-tify [119], and 1 to 10 in Intel’s SignalTapII [74]. Xilinx’s Vivado [127] allowsup to three conditional branches in each state. However, these capabilities must bespecified before compilation and adjustments to these properties requires anotherrecompilation.HLS DebugHigh Level Synthesis (HLS) tools have become available from FPGA vendors suchas Intel [72] and Xilinx [126], as well as open-source academic LegUp project [30]to allow designers to translate a high level language, such as C/OpenCL, to HDL.This allows application designers to benefit FPGA technology to accelerate soft-ware applications without focusing on the hardware implementation details. Sev-eral research works presented techniques for debugging HLS-generated circuitsand used trace-buffers to record data and later mapped the values back to the orig-inal source code for circuit execution replay [29, 51, 53, 105, 106].23In [105, 106], the authors inserted the debug instrumentation at the C level.In their approach, variables in C code are assigned to new top-level pointers. TheHLS tool then converts these pointers to top-level ports in the generated circuit.These ports can be connected to the trace-buffers to record signal data. The authorsused the HLS scheduling information to determine the states in which a signal iscomputed and record signals only in those states. In [29], instead of instrumentingthe C code, the debug circuitry is added after the RTL is generated by the HLS tool.In this approach, a database was used to store the information about the originalsource code variables to the HLS generated circuit signals. Intel SignalTapII [74]logic analyzer was used to record a subset of signals into trace buffers. When thetrigger condition fires, the recorded data is related back to the original source codevariables to enable source-level debugging. Instead of using existing embeddedlogic analyzers that are built for any RTL hardware designs, Goeders and Wiltonpresented a customized instrumentation based on the information provided withinthe HLS tool and record data only if the signals are updated [51, 53]. They alsoused signal restoration techniques to improve trace-buffer utilization resulting inimproved visibility compared to existing logic analyzers.These techniques mainly focused on optimizing debug circuitry for recordinglonger execution histories to enhance observability. However, the lengthy compi-lation times of the back-end flow have been ignored. Changing the variables tobe observed requires a full recompilation of the design and debug instrumentation.To avoid a full recompilation, Bussa et al. used the incremental design flow ina commercial FPGA CAD tool to re-instrument the design between debug itera-tions [27]. This leads to a 40% reduction in debug turn-around times comparedto a full recompilation. Recently, Jamal and Wilton used HLS scheduling infor-24mation to provide highly customized architectural support for selective variabletracing, selective function tracing, and conditional buffer freeze limited to variableassignment in order to maximize trace buffer utilization [79]. To enable these capa-bilities, an architecture with limited ability to program is compiled with the user’sdesign at compile time. During the debug, bits internal to the instrumentation canbe changed to enable those capabilities without performing a full recompilation.However, as discussed in Section 2.1.1 compiling the design with the instrumen-tation prevents the tool from fully optimizing the user circuit and may change thecritical path of the user’s design. This may change the behaviour of the user’scircuit.In above mentioned works, a unique opportunity in instrumenting HLS-generatedcircuits is that they are created by the HLS tool and the HLS scheduling informa-tion within the HLS tool can be used to create and optimize circuit-specific debugcircuitry for efficient use of hardware resources to enhance observability [105]. Incontrast, this opportunity does not exist in optimizing existing debug cores used forRTL hardware designs which is the focus of this dissertation; the limitation comesfrom the fact that these debug cores are designed to support generic RTL circuitsdesigned by a hardware designer and there is no knowledge of the behaviour of thecircuit.2.2 Coverage Metrics for Functional VerificationCoverage metrics enable designers to measure the adequacy of the design func-tional verification effort and to identify which parts of the design have been ade-quately verified and which parts need more verification. Test coverage has emergedas an essential metric for evaluating the effectiveness of both conventional simulation-25based verification and post-silicon validation. Since some of the coverage metricsused in simulation-based verification can be adopted for post-silicon validation,some of the common coverage metrics used in simulation-based verification arediscussed in Section 2.2.1. Then, Section 2.2.2 provides a discussion on the priorworks and methods for measuring coverage during post-silicon validation.2.2.1 Simulation-based Coverage MetricsThere are a variety of coverage metrics to evaluate the effectiveness of tests duringsimulation-based verification. We will briefly discuss the prominent ones in thissection.Code CoverageCode coverage is the most common coverage metric used in simulation-based ver-ification and is inspired from software testing. Code coverage quantifies how wellthe design HDL code is exercised by the test-cases during simulation. The maincode coverage metrics include statement/line coverage, branch coverage, and tog-gle coverage. Statement/line coverage reports whether a line of HDL code is exe-cuted at least once during simulation. Branch coverage reports whether all possiblebranches of if–else, case , and ternary operator (?:) statements are executed. Togglecoverage reports whether a signal has a transition during simulation. These met-rics can be gathered automatically in simulation and does not require extra effortby designers.Assertion CoverageAn assertion is a design property that should not be violated during simulation runs.An example assertion is ”acknowledge signal ack should become 1, one cycle after26the request signal req becomes 1”. Assertions perform automatic checks and fireimmediately when an internal error (malfunction) is detected, hence the designerknows where to start debugging [47]. Assertions are carefully crafted by a designerwith an extensive knowledge of the behaviour of the design. Property SpecificationLanguage (PSL) and System Verilog Assertions (SVA) are the two commonly usedlanguages for describing assertions [1, 28, 59]. Assertion coverage reports whichassertions were successful and which ones failed in simulation. However, writinga comprehensive set of assertions that cover all functional aspects of the design be-haviour imposes a significant challenge and there is no standard way to determinewhether the assertion set is complete.Mutation CoverageMutation-based verification involves injecting artificial bugs (known as mutants)into the RTL description of the design and checking whether it is caught duringverification [9, 125]. For example, a mutant comprises of modifying operators (e.greplacement of ”+” with ”-”), or incorrect assignment statements. Mutation cover-age reports what fraction of the mutants are caught during verification. Mutationcoverage relies on the quality of the mutants and is computationally expensive sinceit requires injecting a large number of mutants, and each mutant must be insertedone at a time.2.2.2 Post-Silicon Coverage MonitorsAs mentioned before, although simulation-based pre-silicon verification is ubiqui-tous, it is not sufficient to completely guarantee the correct operation of a design,due to its long run-times, and its reliance on realistic testbenches to stimulate the27design. This justifies the need for a post-silicon validation process that is an essen-tial part of any integrated circuit design flow.Unlike pre-silicon verification and manufacturing tests that benefit from well-established coverage metrics and automated methods and tool flows, there is nostandard metric and method to quantify the effectiveness of post-silicon validationtests. Evaluating post-silicon coverage is challenging due to the lack of visibilityinto the internal operation of integrated circuits.Several approaches have been proposed for inserting on-chip coverage moni-tors and coverage measurement in ASICs. In [15], the authors manually insertedcode coverage monitor flags alongside a multiprocessor SoC design code, and con-cluded the area overhead of on-chip monitors can be prohibitive, in one blockreaching 22%. In [14], the authors used an emulated version of a design to measurepath coverage of a validation plan for an ASIC. In [49], the authors instrumented adesign in emulation to gather statistics and then used this data to choose a subset ofcoverage monitors for implementing on silicon. In [22], the authors instrumentedIntel’s Core 2 processor family for post-silicon coverage monitoring. However,only minimal coverage information was extracted due to area overhead of on-chipmonitors. In [6], the authors used their emulation platform for the IBM Power7to run processor-specific coverage monitors, providing an estimate of coverage bythe same test in post-silicon. However, the area overhead of the coverage monitorswas not disclosed.Although assertions have been widely used in pre-silicon verification, asser-tion languages (e.g PSL, SVA) are not necessarily synthesizable to be re-used inpost-silicon verification. Assertion synthesis tools and optimization techniques forsilicon have been developed [23–25, 107]; however, on-chip assertion monitors28have not been widely used in practice since the silicon area overhead required toimplement hardware assertion monitors limits the number of assertions that canbe inserted on-chip. In [50], the authors proposed a technique (TMAC) that em-ploys an embedded FPGA (eFPGA) centralized in a SoC design along with a signalrouting network to enable assertion checking for post-silicon bug detection. How-ever, the routing network area overhead grows with the number of signals to bemonitored for assertion checking. In [120], the authors have shown that the areaassociated with adding all discovered assertions to a circuit can easily exceed 20times that of the area of the circuit itself. Therefore, they proposed a methodologyto rank assertions with the objective of increasing the likelihood of bit-flip detec-tion and based on user-specific area constrains and only synthesized a subset ofthem for on-chip monitoring.As an alternative to on-chip coverage monitors, in [45, 88] the authors per-formed trace analysis on signals recorded on trace-buffers to check whether anassertion failed. They proposed trace signal selection algorithms from an asser-tion coverage prospective to record those signals that are important for assertionchecking. However, due to the limited capacity of trace-buffers, these methodscan perform assertion checking only for the recorded cycles, instead of completedesign execution.In the above mentioned works, adding a large number of hardware coveragemonitors to a design consumes a significant amount of chip area and these solu-tions are limited to a small number of important coverage monitors, and making athorough evaluation of coverage infeasible. While ASIC designs require additionalarea for inserting debug instruments and coverage monitors, circuits implementedon FPGAs typically do not use all pre-fabricated logic and routing resources; this29spare capacity presents an opportunity to be used for debug and coverage analysis.Less work has been done on exploiting FPGA spare resources for inserting analysisinstruments such as triggers and coverage monitors.2.3 Overlay ArchitecturesOverlay fabrics are virtual reconfigurable architectures implemented within theuser logic of an FPGA device. Using overlay fabrics optimized for a specific-class of circuits, much of the flexibility can be removed through abstraction of thephysical FPGA details leading to faster compilation times and portability. Thatis, the designer only needs to map the application onto this pre-synthesized pro-grammable architecture which is a simpler problem and is orders of magnitudefaster (100x-1000x speedup) than invoking the FPGA vendor CAD tools to mapthe application to the fine-grained resources in FPGA. Because overlay fabrics arevirtual, portability is also achieved across any FPGA device that can implement theoverlay architecture. However, overlays are limited by their area and performanceoverhead. A growing amount of research has investigated overlays for specificapplications and proposed fine grained and coarse grained overlay architectures.A fine grained overlay (that is ”FPGA on an FPGA”) is defined as a virtualFPGA that is implemented by resources within a physical FPGA [26, 62, 86, 97].In [97], the authors proposed a virtual fine-grained architecture to provide portabil-ity. However, the proposed virtual FPGA approach results in 100X more hardwareusage compared to implementing the circuit directly onto an FPGA. The area over-head of fine-grained virtual architecture, ZUMA, was reduced to 40X by usingphysical LUTs to implement virtual LUTs and routing multiplexers [26]. In [86],the authors proposed a fine-grained overlay architecture providing portability for30custom instruction set extensions of softcore CPUs; the overlay interconnect wasdirectly mapped to the physical switches resulting 2X-5X area overhead comparedto direct implementation of custom instruction extensions into an FPGA with anorder of magnitude area overhead reduction compared to related approaches.As the fine-grained overlays are very costly to implement primarily due to theirfine-grained interconnection network, coarse-grained overlays have been proposedto offer fast compilation, hardware design at a higher level of abstraction, improvedprogrammability and runtime management. Coarse-grained overlays range froman embedded processor-style fabric which can be programmed using software [33,115, 129], to an infrastructure that is specifically optimized for accelerator-type cir-cuits [57, 77, 78, 93] or collections of small processing units [31, 32]. Nonetheless,these overlays suffer from a high area overhead compared to a customized FPGAcircuit implementation due to the resources required to provide the programma-bility of the overlay architecture, particularly the routing network. For example,in [90, 118], the authors presented coarse-grained virtual FPGA fabrics, which arearray of processing units specialized for computation of common image-processingkernels. Although this resulted a compilation time speedup of 554x compared to anFPGA CAD tool, the area overhead of this approach was 2.5X of the area of a cir-cuit directly implemented on an FPGA, and the virtual interconnect was identifiedas the main source of the overhead. In [90], the authors explored the design spaceof the virtual interconnect and optimized the virtual interconnect to reduce the areaby approximately 50% at the cost of 16% reduction in routability compared to theprevious work. Additionally, optimizating and reducing the flexibility of the vir-tual interconnect resulted in 2.4X speed up in compilation time compared to theprevious work. Similarly, in [38], the authors used an overlay fabric specialized31for control flow of common image-processing kernels. This approach required 17xmore logic elements compared to Verilog implementation of the circuit.We will show in this thesis that fine-grained debug overlays with reduced flex-ibility in the routing network can be made efficient by carefully matching the over-lay to the underlying fabric, and adapting the overlay to use only the resourcesleft unused by the user circuit. In this way, the area overhead can potentially bereduced to zero.2.4 SummaryIn this chapter, an overview of the existing post-silicon validation techniques forASICs and FPGAs was provided. To counter the limited observability of hardwareand provide support for root-cause analysis, the most common technique is to usetrace-based instrumentation containing trace buffers to record data for analysis. Tomake a more efficient use of these limited on-chip memories, trigger circuitry isused to identify the time at which recording in a trace buffer should start and/orstop. It was explained that in traditional FPGA debug systems, debug instrumentsare inserted into the design before the design is mapped to the FPGA. The set ofsignals that are to be recorded and the trigger condition must be determined whenthe core is inserted. As the engineer refines his or her view of a suspect bug, he orshe may wish to change the signals being recorded, or the time window in whichdata is stored in the trace buffer, meaning the entire design needs to be recompiled,imposing significant compile time overhead and consequently increasing debugcost; this chapter also reviewed work related to these limitations.Most previous debug techniques have focused on increasing visibility into acircuit by collecting more signal data to help designer to identify the root cause of32an unexpected behaviour; there has been very little attention to the efficient inser-tion of trigger logic in FPGAs. The necessity to consider both logic and routingmakes trigger implementation a significantly harder problem. Hence, this thesismainly focuses on methods that provide efficient and rapid triggering capabilitieswithout modifying the user circuit, eliminating the need of a full recompilationresulting in rapid debug iterations.This chapter also reviewed some of the common coverage metrics used insimulation-based verification. It provided a discussion on the prior works andmethods for collecting coverage data during post-silicon validation. Adding on-chip coverage monitors imposes significant area overhead. Through the contribu-tion of this thesis, we demonstrate an efficient, and flexible framework that canbe used to implement on-chip coverage monitors for code coverage measurementwhile reducing the area overhead of on-chip coverage monitors during FPGA-based validation.33Chapter 3Incremental CompilationTechniques for FPGA DebugAs mentioned in Chapter 2, in traditional debug systems, any change in the debugcircuitry requires a full recompilation, imposing significant compile time overheadand consequently increasing debug cost. This can be addressed using incrementalrouting techniques, such as those in [64, 66] to reclaim unused routing resourcesto incrementally build trace networks without modifying the user circuit. How-ever, those techniques do not address the incremental re-implementation of thetrigger circuitry. This chapter explores the feasibility and challenges of applyingincremental-compilation techniques to insert the trigger circuitry without requiringa full recompilation.The proposed debug flow is shown in Figure 3.1. If a designer wishes to mod-ify the trigger circuitry, incremental techniques are used to only insert the triggercircuitry rather than recompiling an entire design. Besides avoiding the undesir-34Figure 3.1: Incremental trigger insertion flow.able compile-time penalty, this technique has the advantage of not modifying themapping of the user design during debugging.In our approach, which we call incremental distributed trigger insertion, wedistribute the logic cells that make up the trigger function across logic elementsin the FPGA fabric that are not used by the user circuit. These logic elementsmay not be contiguous. Intuitively, this will allow the trigger logic to take betteradvantage of any left-over space after mapping the user circuit, adapting to changesin the size or make-up of the trigger function. After placing the logic cells inunused locations, they must be connected to each other and to the user circuit usingincremental routing techniques. The fact that we are not allowing any changes inthe user circuit constrains the routing; using logic elements in parts of the designin which congestion is high may result in routing failure. Section 3.1 presentsour incremental techniques for inserting the trigger circuitry. Section 3.2 describesour evaluation methodology and Section 3.3 evaluates our techniques using theacademic tool VPR [96]. Lastly, Section 3.4 will provide a conclusion and closing35Figure 3.2: Overview of incremental trigger insertion CAD flow.remarks. Parts of this chapter were published in [42].3.1 Incremental Distributed Trigger InsertionFigure 3.2 shows the procedure that we use to implement incremental trigger in-sertion. In this figure, the left side shows the user design compile flow. First theuser circuit is packed and placed into FPGA fabric and then the circuit is routed.On the right side of Figure 3.2, our incremental trigger insertion flow is performed.In this chapter, we assume that the user specifies a trigger function and selects asubset of signals from all internal signals of the user circuit as inputs of the triggerfunction (in our experiments, we vary the size of the trigger function and hencethe number of trigger input signals). We assume that the trigger circuit is parsed36Figure 3.3: Overview of a Logic Cluster (LC) and Logic Element (LE).and converted to a technology-mapped netlist using a technology-mapping algo-rithm such as [102]. The goal of our technique is to distribute the logic cells of thetrigger netlist to spare logic resources on the FPGA fabric (we refer to this step asthe pack-place phase), such that all of the required connections can be made in therouting phase. The insertion is constrained to not modify the packing, placement,and routing of the original user circuit – we do not allow the rip-up and re-route ofany user circuit nets. Therefore, it is crucial to consider routability in trigger logicpack-place phase in order to lead to a successful routing. An overview of the phys-ical FPGA architecture and the algorithms for pack-place phase and incrementalrouting phase are provided below.3.1.1 Target FPGA ArchitectureWe assume the following FPGA architecture in this chapter, however, the over-all technique will apply to any FPGA. We assume an FPGA consisting of a largenumber of Logic Clusters (LC). Using the terminology of [20], each LC itself con-tains N Logic Elements (LE) connected together through a local routing networkas shown in Figure 3.3. Each LE consists of a K-input Look-Up Table (LUT) anda flip-flop. The number of inputs to each cluster is I. The LC output pins are con-37nected to global routing wires, and hence, provide connectivity between the LEs’output and the global interconnect. A family of FPGA architectures can be de-scribed using these three parameters: N, K, and I. In this chapter, we assume anFPGA architecture similar to the Intel Stratix IV device described in Section Incremental-Distributed Timing-Aware Trigger Pack-placeIn this subsection, we describe how we select which unused LEs in our FPGAfabric will be used to implement the trigger logic. Our approach is to (1) select aseed location based on the positions of all the user selected trigger signals that willbe tapped to implement the trigger function and (2) position all trigger logic cellsin unused slots near the seed.We first describe how we select the seed location. Intuitively, we wish to placethe trigger logic close to the logic elements that drive the trigger signals to mini-mize the wirelength of the trigger circuits as well as the impact on the user circuittiming. Conveniently, since the trigger logic is inserted after the original circuitis mapped and fully routed, both the position and the timing slack of each triggerinput is fully known. Therefore, to find a location for the seed, we find the loca-tion of the source of each trigger input signal in the user circuit, and then weightthese positions by the signal’s timing criticality (a signal with a higher criticalityhas a lower slack, as described in [20]). An example is shown in Figure 3.4; in thisexample, Signal2 has a higher criticality than Signal1, meaning the seed location(x=4, y=2) is closer to the source of Signal2 than to the source of Signal1.Next, we choose sufficient unused LEs as close to the seed location as possible.Figure 3.4 shows our approach; after locating the seed point, we pack the remainingtrigger logic in the empty logic elements surrounding this point in a spiral fashion,38Figure 3.4: Finding spare LCs around the seed.Packer will not consider fullyused logic clusters.using an algorithm that will be described in the next subsection. The use of thisspiral placement strategy is intended to keep the logic elements implementing thetrigger logic as close together as possible. Although an iterative placement strategycould be employed, such a strategy would have a longer runtime.3.1.3 Two-level Congestion Awareness in Trigger Pack-placeAs described previously, it is important to consider routability in the pack-placephase in order to prevent routing failure in routing phase. If the routing resourcesaround the seed point are heavily used by the user circuit, it may result in fail-ure in trigger logic routing phase. To address this issue, we added two levels ofcongestion awareness to the pack-place phase: LC-level congestion awareness andLE-level congestion awareness. Algorithm 1 summarizes the two-level congestionaware pack-place phase.After finding the seed point as described above, trigger logic packer-placerfinds a LC with unused LEs. Then, the CongestionLevel1 estimates the congestionof the routing nodes connected to the candidate LC’s output pins. The congestion is39Algorithm 1: Two-level congestion-aware pack-place algorithmInput: trigger cells to place /* list of trigger logic cells to bepacked-placed*/Output: Determine whether trigger logic has been packed-placed intoFPGA1 /*select a logic cluster as the seed location to start pack-place*/2 LC = calculate seed point3 while trigger cells to place is not empty do4 if CongestionLevel1(LC)<threshold then5 T = select a trigger logic cell from trigger cells to place6 unused LE list = unused LEs in LC7 foreach LE in unused LE list do8 if LE is not congested then9 pack T in LE of LC10 remove T from trigger cells to place11 if trigger cells to place is empty then12 return True13 T = select a trigger logic cell from trigger cells to place14 LC = go to next logic cluster15 if LC is empty then16 return Falsedetermined by counting the number of LC’s routing nodes used by the user circuit.If this congestion is larger than a predetermined threshold, then this LC is not usedto implement the trigger logic; Figure 3.5 shows an example of where a LC wasnot selected to implement trigger logic.Even if an LC has a congestion value lower than the predetermined threshold,it is still possible that the individual LEs inside the LC may suffer from high con-gestion on their outputs; using a highly congested LE can make it impossible toroute its output through the global routing. Therefore, each LE in a selected LC ischecked to see if its output is connected to fully used routing nodes, and if so, this40Figure 3.5: Skipping a highly congested LC.LE is not used. Once a non-congested LE has been found inside a non-congestedLC, the packer-placer inserts the trigger logic inside the LE (lines 6-8 in Algo-rithm 1. The algorithm repeats until sufficient logic elements have been selected toimplement the trigger function.3.1.4 Incremental RoutingAfter the logic elements have been selected, they are connected to each other, thetrigger inputs are connected to the appropriate signals in the user circuit, and thetrigger output is connected to on-chip memory blocks. We use the incrementalrouting technique described in [66] to implement the required connections. Thistechnique uses routing tracks and switches that are not used by the user circuitto implement the required connections. Since routing can only use routing tracksthat have not been used by the user circuit, congested circuits may lead to routingdifficulties. To improve routability, we enhanced the routing algorithm to use un-used LUTs as route-throughs. During routing, in the forward wavefront expansionphase, if an unused LUT is encountered, the router is able to expand the searchwavefront through the LUT to its output pin.Note that it is also necessary to connect the trace signals to the trace buffers aswell; although we do not perform this step in this chapter, techniques such as those41in [64] can be used.3.2 Evaluation MethodologyIn this chapter, we investigate the feasibility of our approach using open-sourceVPR FPGA mapping tool, which is part of the academic Verilog-To-Routing (VTR)project [96]. Using an open-source tool was necessary because demonstrating ourapproach requires low-level resource manipulation to incrementally insert the trig-ger circuit into the bitstream after the user circuit has been compiled, whereascommercial tools do not provide this ability as device information is proprietary.Additionally, using VPR we can investigate the impact that changes in the routingsupply may have on the technique’s overall effectiveness.3.2.1 CAD FlowWe extended VPR to support incremental-distributed trigger insertion flow as shownin Figure 3.6. In this figure, the left side shows the user design compile flow. Thedesign is parsed and converted to a technology-mapped netlist and .blif file is pro-duced. Logic elements are then packed into clusters, producing a .net file. VPRthen places the LCs onto a minimum size FPGA fabric and then routes the circuit,producing .place and .route files.On the right side of Figure 3.6, we show how incremental trigger insertion isperformed on top of a fully place-and-routed user circuit. This new flow, makes nomodification to the initial packing, placement and routing of the user circuit. Sim-ilar to the original flow, the trigger circuit is parsed and converted to a technology-mapped netlist. Then, the pack and place phases are performed by timing awareand congestion aware algorithms presented in Section 3.1. The algorithm from42Figure 3.6: Incremental trigger insertion CAD flow.that section was incorporated into T-VPACK. After inserting the trigger logic, anyspare RAM memories are reclaimed as trace buffers.Finally, the trigger circuitry is routed using the directed search algorithm usedin VPR. We modified the directed search algorithm to ignore adding routing nodesthat are fully occupied by the user circuit to the routing priority-queue; this reducesthe search space during trigger routing. Since the trigger circuit is very small com-pared to the user design, the compilation time of the trigger circuit is significantlylower than a full design recompilation.43Table 3.1: Benchmark Summary.6-Input I/O Logic Cluster RAM Global Free LCs Free LEsCircuit LUTs FFs Wmin Used All Used All Used All Signals (%) (%)bgm 32384 5362 80 289 2400 4111 4200 0 120 23476 2 25LU8PEEng 22634 6630 92 216 1962 2667 2745 45 80 16643 3 18LU32PEEng 76211 20898 128 216 3552 9105 9213 150 252 55873 1 17mcml 101858 53736 86 70 3200 7350 7400 38 208 52226 0 9mkDelayWorker32B 5590 2491 76 1064 1344 916 1302 41 42 4686 30 47mkPktMerge 232 36 44 467 832 18 494 15 16 515 96 22or1200 3054 691 74 779 800 288 475 2 12 2602 39 11raygentop 2148 1423 60 544 640 277 280 1 9 2126 1 35stereovision0 11460 13405 52 354 1312 1240 1271 0 30 8358 2 31stereovision2 29943 18416 92 331 2848 5889 5963 0 154 35386 1 573.2.2 Experimental SetupUsing the tool flow described above, we packed, placed, and routed a set of het-erogeneous benchmark circuits that are supplied with the VTR project onto thesmallest FPGA array that can accommodate the circuit using the default VPR ar-chitecture based on the Intel Stratix IV characterized by logic cluster size N = 10,look-up table size K = 6, cluster input and output flexibilities of Fc−in = 0.15 andFc−out = 0.10, respectively, channel segment length L = 4, and inputs per clusterI = 33. For each benchmark, an FPGA size was chosen to be the smallest squarethat fit the benchmark circuit.The benchmark circuits used in this chapter are shown in Table 3.1. In this ta-ble, Wmin is the minimum channel width for which the circuit (without any debuginstrumentation) can be fully routed; this quantity is often used as a proxy for theroutability of a given architecture using a given CAD tool. The table also showsthe total available resources in the minimum-size FPGA which the circuit fits intoas well as the resources occupied by the user circuit. The number of global signals(both combinational and sequential) is also listed; these are the signals from whichwe choose trigger inputs. The column labeled Free LCs shows the percentage ofthe available LCs which are completely empty. When running VPR, we have dis-abled the default –allow-unrelated-clustering feature; this better mimics packing44in a commercial tool [67]. For each circuit, unused RAM blocks in the FPGA arereclaimed as trace buffer memories as in [66]. Circuits mkDelayWorker32B andmkPktMerge use most of the available memories of the FPGA. For bgm, stereovi-sion0, and stereovision2, all available RAM blocks are reclaimed as trace bufferssince they are not used by the user circuit. The column labeled Free LEs shows thepercentage of unused LEs inside LCs which have been partially used by the usercircuit.In the results presented in the next section, we assume a trigger function thatis a bitwise AND of between 32 to 1024 trigger signals. The trigger inputs arerandomly selected from the circuit’s global signals; we use multiple runs for eachtest and choose a different set of trigger signals for each run. As described inFigure 3.6, the trigger circuit is converted to a technology-mapped netlist.3.3 Evaluation3.3.1 Trigger InsertionTo investigate the effect of a relatively congested routing scenario on trigger inser-tion, we assumed a channel width of Wmin + 20%. In this case, we were able tosuccessfully insert all 32 to 1024 input trigger functions for all benchmarks listedin Table 3.1, with some exceptions. Inserting a trigger function with more than 256inputs was not successful due to routing congestion for mkPktMerge and raygentopbecause these benchmarks are too small to support such large triggering functions.For example, mkPktMerge has only 515 global signals, and inserting a trigger func-tion with size of 256 inputs requires adding 335 nets (which is more than 50% ofsignals in the entire design) that imposed a very high routing congestion that could45not be resolved during incremental routing phase. Inserting a trigger function withsize of 1024 failed for or1200 due to the same reason (up to 512 was success-ful). Inserting the 1024-input trigger for stereovision0 and stereovision2 also failedbecause of high congestion during the incremental routing phase.To determine how our technique works in situations with less routing conges-tion, we repeated the experiments with the channel width set to Wmin + 30%. Byincreasing the channel with to Wmin + 30%, we are able to successfully insert alltrigger functions, with the exception of inserting trigger functions with 512, and1024 inputs for mkPktMerge and raygentop, respectively. As discussed before, thisis because these circuit were too small and inserting such a large trigger circuitsresulted in very high routing congestion that could not be resolved during incre-mental routing phase.As described in Section 3.1.3 we predetermined a threshold value for evaluat-ing the congestion of each LC. We experimentally found that a congestion thresh-old of 60% for trigger functions with size 32 to 512 was enough to successfullyinsert the trigger function. For a trigger function with size 1024, we assumed acongestion threshold of 50% in the pack-place phase.3.3.2 Effect on Flow RuntimeFigure 3.7 shows the total execution time (in seconds) and the breakdown for Pack-place phase v.s. Routing phase for benchmarks assuming a trigger with 256 inputs.We choose 256 inputs since it was the largest trigger size that was successfullyinserted for all benchmark for Wmin + 20%. In Figure 3.8, we show how the trig-ger insertion runtime using our incremental approach is reduced compared to thetraditional flow where a recompilation is required to implement the trigger circuit.46050100150200250300bgmLU8PEEngmcmlraygentopor1200mkPktMergemkDelayWorker32Bstereovision0stereovision1LU32PEEngPack- Pl ace v. s.  Rout ing in SecondsPack-PlaceRoutingFigure 3.7: Trigger insertion runtime breakdown.20406080100120140160180bgmLU8PEEngmcmlraygentopor1200mkPktMergemkDelayWorker32Bstereovision0stereovision1LU32PEEng AVGSpeedupFigure 3.8: Incremental trigger insertion runtime saving in comparison to cir-cuit recompilation.As shown, on average, incremental-distributed trigger insertion is 80X faster thana full recompilation. This is mainly due to avoiding re-executing computationallyexpensive parts of the mapping flow for the user circuit.47Table 3.2: Effect of Trigger Insertion on Circuit Critical Path Delay (ns).Trigger Circuit SizeCircuit Original Delay 32 64 128 256 512 1024bgm 19.71 23.33 23.84 23.68 24.13 24.65 26.03LU8PEEng 89.63 89.57 91.32 94.55 96.35 97.09 97.28LU32PEEng 91.93 92.05 92.58 92.96 93.53 93.61 97.55mcml 66.58 66.58 66.58 70.12 70.6 72.26 72.46mkDW 6.46 7.83 8.4 8.8 9.36 9.83 10.3mkPktMerge 4.93 4.98 5.35 5.82 5.76 6.32 -or1200 11.6 13.27 14.4 14.6 15.15 15.9 16.8raygentop 4.22 5.38 6.42 6.76 7.32 8.07 -stereovision0 3.47 5.52 5.8 6.77 6.9 7.81 8.78stereovision2 11.93 15.81 16.66 16.62 17.37 17.7 Effect on Circuit DelayThe critical path delay of a circuit increases if the delay of any routes in trigger cir-cuitry (including the routes between logic elements of trigger function, the routesform trigger signals to the trigger logic clusters, the routes from trigger output to allthe trace buffers) is greater than the original circuit critical path delay. Therefore,circuits with a longer critical path delay are less sensitive to the trigger insertionsince it is less likely that the inserted routes due to trigger circuitry become longerthan the circuit critical path.Table 3.2 presents the critical path delay of each benchmark circuit before andafter inserting different size trigger circuits. As shown in Table 3.2, circuits witha short critical path delay, such as or1200, mkDelayWorker32B, raygentop, andmkPktMerge, experience higher increase in delay after inserting the trigger func-tion. On the other hand, LU8PEEng, LU32PEEng, and mcml are less sensitive tothe trigger insertion due to a longer critical path delay. Investigating the criticalpath delay of bgm, stereovision0, and stereovision2, we found that the critical path48of these circuits becomes longer mainly because all available memory blocks inthe FPGA array are reclaimed as trace buffers. Hence, the single trigger outputconnects to all these memories across the chip creating a longer critical path.3.4 SummaryThis chapter explored the feasibility of using incremental compilation techniquesfor inserting debug logic, in a way that does not require a full design compila-tion. We distribute the logic cells that make up the trigger function across logicelements that are not used by the user circuit. After placing trigger logic in unusedlocations, they are connected to each other and to the user circuit using incremen-tal routing techniques. From the study described in this chapter, we can concludethat incremental compilation techniques can be used to accelerate debug iterations.However, the success of our incremental-distributed approach in inserting triggercircuits over spare resources of a fully placed-and-routed design such that its map-ping is completely preserved depends on routing congestion in the user circuit. Toimprove our techniques in using incremental compilation techniques for rapid trig-gering, the contributions in Chapter 4 are concerned with increasing the likelihoodof a successful trigger implementation.49Chapter 4Rapid Triggering Frameworkusing FPGA OverlaysChapter 3 explored using incremental compilation techniques for trigger imple-mentation using spare resources after user circuit compilation. The main limitationof this approach was that it was often difficult to make the required connectionswhen inserting trigger logic into a circuit because of the routing congestion in theuser circuit.In this chapter, we aim to increase the likelihood of making successful triggerimplementation by improving the routability of trigger insertion by introducinga new instrumentation framework. To achieve this, a virtual overlay architectureis compiled to an FPGA only once after user circuit compilation, and then rapidlyconfigured to implement instrumentation at debug time as shown in Figure 4.1. Al-though overlays can be notoriously inefficient, we will show that ours can be madeefficient by carefully matching the overlay to the underlying fabric, and adapting50Figure 4.1: Overview of using a virtual overlay for FPGA debug.the overlay to use only the resources left unused by the user circuit. In this way,the overhead can potentially be reduced to zero.This chapter is organized as follows. Section 4.1 describes our new FPGAdebug flow. Section 4.2 describes the architecture of our overlay. Section 4.3presents a non-intrusive and adaptable approach for building the overlay architec-ture by reclaiming spare resources on FPGAs while preserving the underlying usercircuit implementation. Section 4.4 presents CAD techniques to rapidly config-ure the overlay architecture, implementing the trigger circuitry. Section 4.5 detailsthe experimental methodology and steps that were performed to evaluate our over-lay architecture and CAD techniques. Section 4.6 presents the results from theseexperiments in terms of flexibility, runtime, and area overhead. This chapter issummarized in Section 4.7. Parts of this chapter were published in [44].4.1 Debug Overlay FlowFigure 4.2 shows our debug framework flow. There are two major phases: (1)compile-time; (2) debug-time.51Figure 4.2: Proposed debug flow using an overlay for FPGA debug: compiletime and debug time phases.Compile time. First the user circuit is fully compiled onto FPGA fabric withoutany debug instrumentation. Then, the user circuit is frozen, and the overlay fabricis incrementally constructed, only once, using only those resources (logic blocks,memories, and routing tracks) that are not used by the user circuit. Because of thisseparation, we ensure that the user circuit characteristics are not affected by the in-strumentation. This also allows us to adaptively size the amount of instrumentation(and hence its flexibility) depending on how much of the FPGA is unused by theuser circuit.Debug time. During each debug iteration in which a new trigger function is re-quired, the function implemented in the overlay fabric is encoded in a set of FPGAconfiguration bits; these bits can be updated possibly using partial reconfiguration.A mapping algorithm is required to map the debug circuitry to this set of configu-ration bits; importantly, this algorithm is much faster than a full recompilation ofthe user circuit and instrumentation. This improves debug productivity, both byreducing the debug turn-around time, and also by ensuring that the user circuit isfixed and does not change as the instrumentation is changed. Since the overlay52architecture has a pre-synthesized routing network, the trigger mapping algorithmcan be optimized to find a routable solution at debug time, whereas in the incre-mental approach presented in Chapter 3 there was not such an overlay to guide therouting.4.2 Proposed Overlay ArchitectureIn this section, we describe our overlay architecture that is optimized to implementsmall logic functions that we expect to be commonly used as trigger functions. Thearchitecture is designed to be flexible enough to implement a wide variety of suchfunctions, area-efficient, and adaptable enough to be implemented on top of a usercircuit.The architecture consists of a square grid of cells. Each cell contains betweenfour and no logic elements (LEs) connected using a fully-connected crossbar net-work inside of the cell. Each LE consists of a lookup table (LUT) and a flip-flop.Each look-up table has ko inputs. Although other options for implementing overlaycells are possible [36, 121], building an overlay around LUTs has a number of ad-vantages. First, a look-up table is universal and can implement any logic functionof its inputs, simplifying the interconnect between the cells. Second, we can lever-age a large body of lookup-table technology mapping work [35, 102]. Third, aswe will show in Section 5.1.2, this strategy allows for an efficient mapping of theoverlay to the underlying FPGA. We define no = N where N is the size of each logiccluster in the target FPGA architecture (in our experiments, we assume N = 10).Hence, cells in the same overlay fabric may be of different sizes between 4 andN; as will be described in the next section, this provides a mechanism to adapt theoverlay to the availability of empty LEs in the underlying user circuit implemen-53Figure 4.3: Simplified overlay architecture.tation. For the same reason, we define ko = K where K is lookup-table size in thetarget FPGA architecture, in our experiments, we assume K = 6.As shown in Figure 4.3, cells are connected using a 2D torus. Each cell hastwelve input pins and four output pins that send and receive signals to and fromneighbouring cells. Each output signal is connected to three sinks: one primarysink and two secondary sinks. The primary sink connects the output pin to the oneof the cell’s direct neighbour (e.g. to the north side). The primary sink for each ofthe four outputs connects to a different nearest neighbour (since this is a torus, eachcell has four nearest neighbours). Although the primary sinks would be sufficientto provide complete connectivity between all cells in the overlay, additional routingflexibility is provided by the two secondary sinks. The first secondary sink of eachoutput connects to a different nearest neighbour than the primary sink (e.g. theeast side). The second secondary sink connects to a cell that is 1-hop away (nota nearest neighbour cell). This pattern is shown in Figure 4.3. We experimented54with higher and lower fanout counts for each output signal, however, we foundthis strategy gave a good balance between flexibility of the routing network andthe routing congestion when implementing the routing network on the underlyingFPGA.As described above, some cells will contain more than four logic elements.Those logic elements do not drive any other cell; instead, they only drive otherlogic elements within the same cell through the local feedback crossbar.4.3 Overlay Implementation StrategyIn this section, we describe our CAD techniques for mapping the overlay archi-tecture on top of the user circuit without modifying the packing, placement, androuting of the underlying user circuit – we do not allow the rip-up and re-route ofany user circuit nets, nor any relocation of any user circuit blocks.Rather than describing the overlay fabric in a hardware description language(HDL) and compiling it using the normal FPGA CAD tools, we instead perform adirect mapping of the overlay primitives to the underlying FPGA primitives. Forexample, the overlay fabric consists of a number of look-up tables; we directly mapeach of these look-up tables to a look-up table in the underlying FPGA. Similarly,each routing path in the fabric is mapped to a specific routing path in the underlyingFPGA. This leads to a much more efficient overlay implementation than compilinga HDL version using normal CAD tools. Furthermore, this technique avoids adownside of many overlay approaches: the high area overhead of implementing an“FPGA on an FPGA” is avoided. However, implementing such an overlay needs toemploy custom incremental compilation tools. Details of the algorithm for overlayimplementation are provided below.551: procedure OverlayPlacementAndRouting(FPGA Resources)2: N = SelectLCsWithSpareLEs(FPGAResources)3: nxn = CreateLargest2DTorus(N)4: InititalPlacement(nxn)5: OverlayPlacement = SA-Placer()6: FirstLevelRouting=PathFinder()7: OverlayRouting=PathFinder(FirstLevelRouting)Figure 4.4: Overlay placement and routing pseudocode.4.3.1 Algorithm to map Overlay on top of User CircuitSelecting exactly which look-up tables and routing resources in the underlyingFPGA are used to implement the overlay fabric is a CAD problem; our algorithmto perform this is adaptive and best-effort in that (a) the number of cells in theoverlay depends on the number of unused logic resources, (b) each cell in theoverlay is sized according to the number of logic elements (LEs) unused withineach logic cluster (LC), and (c) if certain connections in the overlay are difficult orimpossible to implement, they are removed from the overlay and not implemented.Figure 4.4 shows the pseudocode for our mapping algorithm; details are providedbelow.SelectionGiven the mapping of the user circuit, the algorithm first identifies all logic clustersinside the FPGA which contain at least four unused logic elements (unused by theuser circuit) as well as at least 12 unused input pins. This is due to the fact thateach cell in the 2D Torus has four output pins that send and receive signals to andfrom neighbouring cells. These logic clusters represent potential sites in whichthe overlay can be implemented. We define the size of each site as the number of56empty logic elements in the corresponding cluster (this will be between four andthe total number of logic elements in the cluster). We then prune out those sites inwhich the output pins are connected to routing tracks that have already been usedby the user circuit.We then choose some subset of these sites and create the largest possible over-lay fabric out of the selected sites (in the experiments in this chapter, we first selectall potential sites to maximize the size of the implemented overlay fabric, and latershow the impact of relaxing this number somewhat). This overlay consists of agrid of cells; each cell corresponds to one site. The size of the site determines thesize of the corresponding cell (recall from the previous section that the number oflogic elements in each cell varies across the overlay). The result is a logical griddescribing the overlay.PlacementSince, at this stage, there is a one-to-one mapping between sites with sufficientspace and cells in the overlay, we could simply map each cell to one site. However,it is likely that the available locations unused by the user circuit do not lie in agrid-like pattern; to adapt to the underlying pattern of available cells, we performa simulated annealing placement algorithm to position the cells in the overlay ontoavailable sites in the fabric. The cost function used by the annealer is the totalwirelength of the overlay wires which is estimated by the bounding box heuristic.We only consider swaps which result in legal placements; a cell that consists of blogic elements can only be moved to a site that contains at least b empty LEs. Thisensures that the final placement result is legal.57Creating Overlay Routing NetworkAfter overlay placement, we attempt to create all connections between the cells.We employ the routability-driven PathFinder [100] algorithm to iteratively resolverouting congestion. Since we are restricted to using routing resources left unusedby the user circuit, it may not be possible to find a legal routing solution in whichall overlay wires are implemented. This is not a problem; in this case, we constructa fabric with “missing” connections. The mapping of the trigger function to theoverlay fabric can attempt to construct a mapping that does not use any of themissing connections. This best-effort approach for mapping allows us to implementlarger overlay fabrics than would otherwise be possible, since 100% routing is notrequired. To implement this best-effort routing, we perform a legalization heuristicafter Pathfinder has completed (i.e. after 15 routing iterations). This heuristiciteratively discards illegal connections (from the overlay) that are overusing thelargest number of routing resources until a legal routing solution is achieved. Inthis way, our approach can be used for user circuits that already stress the FPGArouting.To minimize the impact of removing connections from the overlay routing net-work, we prioritize the connections in the overlay routing network, and route thenetwork in two steps. We first use Pathfinder to route the primary sinks of eachconnection (primary sinks are defined in the previous section). We then use thesame algorithm to route the secondary sinks of each connection. This approachensures that the primary sink connections have higher priority for using routingresources than the secondary connections.584.4 CAD for Trigger MappingAt debug time, the overlay architecture can be configured to implement the desiredtrigger circuitry. This section describes the algorithm that maps the trigger circuitryto the overlay architecture.The CAD tasks that must be performed to map a trigger circuit to the overlayare the same as in the traditional FPGA mapping flow: packing, placement, androuting [20]. However, as described in Section 4.3, the overlay architecture haslimited flexibility in the routing network. Placement without considering the rout-ing network can result in an un-routable solution. Therefore, it is important thatthe placement algorithm is routing-aware to make the placement solution moresuitable for the given routing architecture.4.4.1 Routing-aware Placement HeuristicWe choose simulated annealing to perform the placement of the trigger onto theoverlay, due to its success in the FPGA placement field [20, 21, 40, 99] as wellas its ability to target irregular architectures. The cooling schedule of the routing-aware placer is an adaptation of VPlace [99]. Initially, each logic element of thetrigger netlist (i.e. LUT) is assigned to a random unoccupied logic element insidethe overlay. Our algorithm is LE-based, in that individual LEs are swapped (as op-posed to cluster-based placement solutions as in [20]). This allows LEs to migratebetween clusters, providing better routability in the final solutions. Although thisis more computationally-intensive, the trigger circuits being placed are relativelysmall (typically hundreds of logic elements).The cost function used in the placement algorithm is the product of two terms:the routing hops cost and the net routing cost. Each will be described below.59Figure 4.5: Simple trigger netlist.Routing Hops CostCells that are not neighbouring can be connected by routing through one or moreintermediate cells, and configuring these intermediate cells as pass-through buffers.A placement solution that requires fewer routing hops is more likely to be routablethan one that requires more routing hops since it requires fewer overall wires. Toestimate the Routing Hops Cost, we use the bounding box size of a net, and sumthis over all nets.Net Routing CostFigure 4.6 shows how a simple trigger netlist of Figure 4.5 can be mapped to over-lay cells. LE1 and LE2 are placed in the same cell, sharing an input pin as well asconnecting to LE4 directly. Placing LE3 in the same overlay cell as LE1 and LE2will result in routing failure; in case (1) the output pin of the LE of the overlay cellis only connected to the local routing crossbar and can not be connected to LE4. Incase (2), although the output pin of the LE of the overlay cell can indirectly reachLE4 using one routing-hop through C3, all input pins of C1 are occupied by LE1,and LE2, resulting in an unavailable input pin for LE3. Placing LE3 in C4, can60Figure 4.6: The trigger netlist of Figure 4.5 mapped to logic elements of over-lay cells.61indirectly connect to LE4, resulting a routable trigger mapping.The net routing cost guides the annealer to a routable solution. The routingcost of each net described by Equation 4.1.NetRoutabilityCost = α×NumIndirectSinks+β ×NumBlockedOPINs+β ×NumUnavailableIPINs (4.1)The overall routability cost is the sum of this quantity over all nets.Indirect sinks are slightly penalized to encourage LEs to move to the samecluster or neighbouring clusters to make use of the intra-cluster connections orpre-routed connections of the overlay, respectively. Blocked output pins are heavilypenalized since they will result in an immediate routing failure. Unavailable inputpins are heavily penalized for the same reason. This term also encourages LEs thathave common input signals to move to the same cluster to make better use of theinput pins of each cluster. We have experimentally found that setting α to 5 and βto 20 gives good results.We employ an additional optimization during the swap evaluation; if the swapwill move an LE to a location that its output pin can not connect to the overlayrouting network, which results in a blocked output pin, we alleviate unroutabilityby relocating the LE to another unoccupied location inside the same cluster. In thisway, the output is connected to the overlay routing network resolving the blockedoutput pin issue. Consequently the swap still has a chance to be accepted towardan optimized implementation (case (1) and (2) in Figure 4.6).624.4.2 Routing AugmentationDuring routing, most connections are between neighbouring cells, and routing istrivial. For nets between non-neighbouring cells, we use a routing algorithm basedon the routability-driven PathFinder [100] algorithm. The goal of this algorithmis to minimize the number of routing hops of each net, thereby maximizing theroutability of the circuit.4.4.3 Merging user Circuit with Trigger CircuitThe final step is to connect trigger signals in the user circuit to the inputs of thetrigger circuitry and connect the output of the trigger circuitry to unused RAMblocks reclaimed as trace buffers. As described in Section 3.1.4, we use an in-cremental routing algorithm using unused LUTs as route-throughs to implementthe required connections. Since these RAM blocks could be physically located indifferent sites across the chip, connecting the output of trigger circuitry to all ofthese RAM blocks can introduce a new long critical path. To minimize the effectof inserting trigger circuits on circuit critical path delay, the output of the triggercircuit is pipelined.4.5 MethodologyIn this chapter, we evaluate our techniques using VPR, which is part of the open-source academic Verilog-To-Routing project [96]. Using an open-source tool wasnecessary because demonstrating our approach requires low-level resource manip-ulation (e.g. to incrementally insert the overlay fabric into the bitstream after theuser circuit has been compiled using left-over resources), whereas commercialtools do not provide this ability as device information is proprietary. Moreover,63Table 4.1: FPGA Architecture Used Based on Intel Stratix IV.FPGA Architecture Parameter Meaning ValueN Logic cluster size 10K Inputs per look-up table (fracturable) 6I Inputs per cluster 33L Channel segment length 4Fc−in Cluster input flexibility 0.15Fc−out Cluster output flexibility 0.10we have chosen to demonstrate our techniques on VPR, so that we can investigatethe impact that changes in the channel width and spare logic resources may haveon our techniques.Using VPR 6.0, we packed, placed, and routed a set of heterogeneous bench-mark circuits onto an architecture based on Intel Stratix IV characterized in Ta-ble 4.1.For each benchmark, an FPGA size was chosen to be the smallest square thatfit the benchmark circuit. We use options to group only related logic elements intoa cluster, and perform the placement and routing on the smallest FPGA array thatcan fit the benchmark circuit. Only related logic elements are packed into a cluster,mirroring behaviour of industrial tools [67].Information about each circuit and the results of this mapping are shown inTable 4.2, including the number of LUTs and flip-flops of each circuit, the num-ber of logic clusters and memory resources in the smallest FPGA array that fitseach benchmark, and the resources occupied by the user circuit. Circuit mkDelay-Worker32B uses most of the available memories of the FPGA. Other benchmarksuse most of the available logic clusters of FPGA. The ability to construct an over-lay depends heavily on the available resources after the user circuit is mapped onto64Table 4.2: Benchmark Summary.6-Input FPGA Logic Cluster RAM Free(%)Circuit LUTs FFs Size Wmin Used All Used All LCs LEsbgm 32384 5362 75x75 80 4111 4200 0 120 2 25LU8PE 22634 6630 61x61 92 2667 2745 45 80 3 18LU32PE 76211 20898 111x111 128 9105 9213 150 252 1 17LU64PE 147556 39552 153x153 156 17591 17595 293 475 0 17mcml 101858 53736 100x100 86 7350 7400 38 208 0 9mkDW 5590 2491 42x42 76 916 1302 41 42 30 47stereo2 29943 18416 89x89 92 5889 5963 0 154 1 57FPGA. The column labeled Free LCs shows the percentage of the logic clusterswhich are left completely unused by the user circuit. The column labeled Free LEsshows the percentage of the unused logic elements inside logic clusters which arepartially used by the user circuit.For each circuit, unused RAM blocks in the FPGA are reclaimed as trace buffermemories. For bgm and stereovision2, all available memory blocks are reclaimedas trace buffers.4.6 Experimental Results4.6.1 Establishing the Overlay ArchitectureFirst, we evaluate our algorithm for creating the overlay architecture. The abilityto construct an overlay depends heavily on (a) the size of the overlay that is beingcreated (as a fraction of the available resources after the user circuit is mapped)and (b) the availability of routing tracks to implement connections and avoid con-gestion. In this section, we vary both to understand both the limits of the techniqueand the influence these two factors have on the ability to create an overlay.Figure 4.7 shows the compile time required to map (placement, primary and65 10 100 1000 10000 100000bgmLU8PEEngLU32PEEngLU64PEEngmcmlmkDelayWorker32Bstereovision2V PR Ru nt i me  ( s )Circuit 1.2 Wmin 1.3 Wmin 1.4 WminFigure 4.7: Compile-time: VPR runtime for establishing a maximally-sizedoverlay architecture as a function of channel width. Note that the y-axisuses a log scalesecondary sinks routing) a maximally-sized overlay architecture (Overlaymax) intoFPGA spare resources for each benchmark circuit (note that the vertical axis usesa logarithmic scale). Results are shown for four values of the channel width: W =1.2Wmin,1.3Wmin, and 1.4Wmin, where Wmin is the minimum channel width requiredto route each user circuit. Intuitively, the ratio W/Wmin is a measure of the amountof routing slack available in the architecture. An architecture with W/Wmin = 1.2is one where there is very little slack, while an architecture with W/Wmin = 1.4has significant routing slack. Academic architecture studies tend to use a value ofW/Wmin of approximately 1.3 to reflect realistic routing demand as in [8, 96].Figure 4.7 also shows the time to compile the original user circuit for compar-ison. The graph shows that increasing the channel width results in a substantialdecrease in the overlay compile time due to decreased routing congestion. Theoverlay compile time overhead is less than 10% of the user circuit compile time for66 1 10 100 1000 10000 100000bgmLU8PEEngLU32PEEngLU64PEEngmcmlmkDelayWorker32Bstereovision2V PR Ru nt i me  ( s )CircuitOverlaymaxOverlay28X28Overlay20X20Overlay16X16Figure 4.8: Compile-time: VPR runtime for establishing overlay architectureas a function of overlay size for 1.2Wminsmall benchmarks such as mkDelayWorker32B. However, the compile time of map-ping the overlay architecture is close to the circuit compile time for large bench-marks at a channel width of 1.2Wmin. This is because these large benchmarkshighly utilize LCs of FPGA resulting in routing congestion around LCs selected asoverlay cells.Figure 4.8 shows the same thing, but as a function of the overlay size, withW fixed at 1.2Wmin. Overlay28X28 means an overlay fabric containing 784 over-lay cells. Since the Overlaymax of mcml and LU8PEEng has only 729 and 529cells, respectively, it is not possible to create an overlay fabric with 784 cells(Overlay28X28) for these benchmarks. As the results show, smaller overlays com-pile significantly faster, both because there are fewer cells to place and route, andalso because the congestion is reduced. Although small overlays may be morecommon in practice (for small trigger circuits), the results show that our techniquecan map even very large fabrics, including those that fill the entire unused portion67 0 5000 10000 15000 20000 25000 30000 35000 40000 45000bgmLU8PEEngLU32PEEngLU64PEEngmcmlmkDelayWorker32Bstereovision2AVGV PR Ru nt i me  ( s )Circuit Compile Time Overlay Compile Time8.5% 59%4.3%4.7%28%9.6%40%22%Figure 4.9: Compile-time: VPR runtime overhead for establishing overlayarchitecture for 1.3Wmin.of the FPGA.Figure 4.9 summarizes the compile time (the vertical axis is a linear scale)assuming a maximally-sized overlay (Overlaymax) on an architecture with a chan-nel width of 1.3Wmin. On average the additional overhead of our overlay is 22%increase in circuit compile time; this reaches 59% for LU8PEEng. Although thedesigner must wait for place-and-route to implement the overlay architecture, thecompilation time is only needed once to build the overlay and is amortized by theruntime savings made over multiple debug turns with the same circuit.One of the unique aspects of the algorithm from Section 4.3 is that it is adap-tive; that is, if congestion prevents some nets from being routed, they will be leftout of the overlay, creating a slightly less flexible structure. To understand theextent to which nets are being dropped, Table 4.3 shows the average number ofinput pins per overlay cell that were successfully routed as a function of W andthe overlay size for each benchmark circuit. Ideally, this number would be twelve68Table 4.3: Average Number of Input Pins per Overlay Cell.Overlaymax 1.2WminCircuit 1.2Wmin 1.3Wmin 1.4Wmin Overlay28X28 Overlay20X20 Overlay16X16bgm 10.9 12 12 12 12 12LU8PE 6.8 11.8 12 - 12 12LU32PE 12 12 12 12 12 12LU64PE 11.9 12 12 12 12 12mcml 10.4 11.5 12 - 11.5 12mkDW 12 12 12 12 12 12stereo2 11.5 11.9 12 12 12 12(four primary sinks and eight secondary sinks as in Section 4.2). As the resultsshow, for most cases, the overlay is completely routed (the number of input pinsis 12), however, for small values of W or very large overlay sizes, this number issmaller for some circuits. Note that Overlaymax of mcml and LU8PEEng has lessthan 784 cells, Overlay28X28. The average number of input pins per overlay cellfor LU8PEEng is significantly lower than other benchmarks at channel width of1.2Wmin; however, the number approaches 12 as the overlay size decreases, sincea smaller overlay means there are more tracks available for routing, hence lesscongestion. For mcml, the number of input pins does not reach 12, even for anoverlay size consisting of 400 overlay cells (Overlay20X20); this is because even forthe smaller fabric, there is still significant routing congestion at channel width of1.2W .4.6.2 Trigger MappingIn this subsection, we evaluate our algorithm for mapping a trigger circuit to theoverlay architecture. In these experiments, we use a maximally-sized architectureon an FPGA with W = 1.3Wmin. From the previous section, it is clear that for some69 1 10 100 1000 10000 100000bgmLU8PEEngLU32PEEngLU64PEEngmcmlmkDelayWorker32Bstereovision2V PR Ru nt i me  ( s )Circuit 16-bit 32-bit 64-bit 128-bit 256-bitFigure 4.10: Debug-time: trigger mapping runtime assuming Overlaymax fordifferent size comparators.of the benchmark circuits, this configuration leads to fewer than twelve inputs percell (on average), however, the algorithm discussed in Section 4.4 constructs amapping solution that does not use these missing connections. We use an n-bitcomparator comparing two sets of signals or one set of signals and a constant aswell as an n-bit range comparator as our trigger circuit; such a trigger circuit mightbe used when a user wishes to stop/start recording data when the value of an n-bitbus is equal to another n-bit bus or a user defined constant, where n is 16, 32, 64,128, 256. The trigger signals are randomly selected from the registered signals ofthe benchmark circuit (as in [70]); we use multiple runs for each trigger circuitand choose a different set of trigger signals for each run, averaging the results overmultiple runs.Figure 4.10 shows the runtime of our trigger mapping algorithm (this includesconnecting the trigger logic to the trace buffers and user circuit). As the resultsshow, for all benchmarks except for mkDelayWorker32B, inserting a new trigger70 1 10 100 1000 10000 100000OverlayMAXOverlay28X28Overlay20X20Overlay16X16V PR Ru nt i me  ( s )Circuit 16-bit 32-bit 64-bit 128-bit 256-bitFigure 4.11: Average trigger mapping runtime as a function of overlay size.circuitry is an order of magnitude faster than a recompilation. Benchmark mkDe-layWorker32B has mapping times similar to the compile time of the original usercircuit for large triggers (256-bit comparator), primarily because this benchmark isrelatively small, and the original compile time is short.Figure 4.11 shows the average runtime of our trigger mapping algorithm as afunction of overlay size. As the results show, it is faster to map a trigger circuitryinto a smaller overlay.Table 4.4 shows the critical path delay of each circuit before and after insert-ing different size comparators into the design for different overlay sizes. As thetable shows, inserting a large trigger circuit can introduce a new long critical pathto the circuits with a short critical path. For example, the critical path delay ofmkDelayWorker32B which is only 6.5ns increases by 2.13ns when inserting a 256-bit comparator. In the worst case, the delay of stereovision2 increases by 6.13ns.However, it should be noted that this increase in the circuit critical path delay istemporary during debugging; the operating frequency of the circuit can be set as71Table 4.4: Effect of Trigger Insertion on Critical Path Delay (ns).OverlaySizeOverlaymax Overlay28X28 Overlay20X20 Overlay16X16Circuit Original 16 to 128 256 128 256 128 256 128 256bgm 19.7 19.7 19.7 19.7 19.7 19.7 19.7 19.7 19.7LU8PE 89.6 89.6 89.6 - - 89.6 89.6 89.6 89.6LU32PE 91.9 91.9 91.9 91.9 91.9 91.9 91.9 91.9 91.9LU64PE 89.2 89.2 89.2 89.2 89.2 89.2 89.2 89.2 89.2mcml 66.5 66.5 66.5 - - 66.5 66.5 66.5 66.5mkDW 6.46 6.46 8.5 6.46 8.5 6.46 8.5 6.46 7.1stereo2 11.9 11.9 18.0 11.9 12.7 11.9 12.4 11.9 13.1normal when the debug instrumentation is not required.4.6.3 Comparison to Incremental Triggering without using OverlaysIn Chapter 3, an incremental routing technique (without an overlay to guide therouting) was used to route the required nets of the trigger netlist. That work was un-able to route several benchmarks due to congestion during the incremental routingphase. In contrast, using the approach presented in this chapter, the pre-synthesizedoverlay fabric enables us to use a simulated-annealing based placement algorithmin order to increase the chance of finding a routable placement solution withoutrequiring a long runtime.Using overlay for trigger insertion, we were able to successfully insert the sametrigger circuitry (i.e. bitwise AND) from Chapter 3 for stereovision0 and stereovi-sion2 assuming a channel width 1.2Wmin. For circuit mkPktMerge, we created anOverlay21X21 and were able to successfully insert a trigger function with 256, and512 inputs, again assuming a channel width of 1.2Wmin. It should be noted thatthis benchmark has only 515 signals, while the trigger circuitry with 512 inputshas 655 nets; this shows the flexibility of the overlay architecture. The overlay size72of raygentop (Overlay10X10), and or1200 (Overlay14X14) is too small to support atrigger circuitry with more than 512 and 1024 inputs, respectively. Another dif-ference between this work and the previous work is that the entire circuit needs tobe loaded into the memory of a CAD tool to construct an incremental placementand routing solution for the trigger circuitry at debug time; while in this work, theoverlay architecture is only required to map a desired trigger circuitry at debugtime.4.7 SummaryThis chapter demonstrated a non-intrusive framework based on overlays that can beused for instrumentation without requiring circuit recompilation during FPGA de-bug. At compile time, the overlay architecture is incrementally established on topof the user circuit using leftover resources after circuit compilation. This mappingis done using a best-effort adaptive CAD technique that preserves the underlyingcircuit mapping. At debug time, the desired trigger circuitry can be rapidly mappedto the overlay architecture. This is done using a routing-aware placement algorithmto increase the routability of the placement solution.The key novelties of this work are in: (1) a non-intrusive triggering frame-work that separates the user circuit from the debug instrumentation eliminating theneed for recompilation during debug iterations, (2) an adaptive, and flexible over-lay architecture that provides enough flexibility for arbitrary combinational triggercircuits, (3) efficient methodology for overlay construction using incremental CADalgorithms that exploit the spare logic and routing resources unused by the user cir-cuit resulting in zero area overhead, and (4) CAD algorithms to rapidly reconfigurethe overlay for a new trigger circuitry at debug iterations.73We have shown that the overlay architecture provides enough flexibility to im-plement combinational triggering scenarios an order of magnitude faster than a fullrecompilation accelerating debug productivity. Our experiments have shown thatour triggering framework has a negligible impact on the critical path delay.To extend our solution of non-intrusive instrumentation framework used forrapid implementation of combinational trigger circuits, the contributions in Chap-ter 5 are concerned with addressing the more challenging problem of implementingsequential (state-based) trigger circuits.74Chapter 5Customized FPGA Overlay andMapping Tools for RapidTriggering CapabilitiesChapter 4 detailed our non-intrusive instrumentation framework for rapid triggerinsertion during debugging. As part of our framework, we introduced an over-lay architecture (a 2D torus) and mapping tool (simulated-annealing based place-ment algorithm) suitable for implementation of combinational functions; we didnot consider implementing state-based trigger circuits. Yet, supporting such com-plex trigger circuits are essential to ensure that the collected data is useful for thedesigner. An example of a sequential trigger might be a state machine that de-termines nth occurrence of a bus value. Hence, in this chapter we introduce aspecialized overlay architecture and tailored mapping algorithms that provide theability of efficient implementation of complex trigger circuits, suitable for an in-75system debug ecosystem. Section 5.1 describes our optimized overlay architectureand the method by which the overlay architecture is implemented. Since we arelimited to only resources not used by the user circuit, it is necessary to make areasonable trade-off between area and flexibility. To achieve this, we customizethe overlay architecture to provide adequate flexibility for implementing complextrigger circuits, including arbitrary combinational and state-based trigger circuits.Section 5.2 describes our specialized algorithms that take advantage of the char-acteristics of the customized overlay architecture to rapidly map trigger circuits.Section 5.3 details the experimental methodology that was used to evaluate ourtechniques. Following this, in Section 5.4, we discuss results from these experi-ments and compare our technique to Intel’s SignalTap II tool in terms of runtime,area overhead, and circuit delay. Finally Section 5.5 summarizes this chapter. Anearly version of this contribution was published in [43], and an extended versionwas submitted to ACM Transactions on Design Automation of Electronic Systems(TODAES) with minor revisions.5.1 Overlay ArchitectureIn this section, we describe a parameterized overlay architecture family that is op-timized to implement combinational and sequential trigger functions. We first de-scribe a logical view of the architecture, and then show how it is implemented onan FPGA in an efficient manner.5.1.1 Parameterized Overlay Architecture FamilyOur overlay is specialized to implement trigger circuits. To implement the logicpart of the trigger circuit, our overlay contains a number of cells, each contain-76(a) The overall structure of the multi-level overlay architecture.(b) Overlay Cell (OC) details(c) Overlay flip-flop detailsFigure 5.1: (a) An overview of the parameterized overlay architecture that iscomprised of cells connected via primary (shown in black) and auxiliary(shown in red) connections, and a set of flip-flops connected to cells inlevel lo− 1 (b) Overlay cell design with ko-input LUTs, local routing,and outputs to other cells. (c) Overlay flip-flop design with a flip-flop,local routing, and fanout to other cells.ing no look-up tables (LUTs). Each look-up table has ko inputs. As discussed inChapter 4, building an overlay around LUTs allows for an efficient mapping of theoverlay to the underlying FPGA.Figure 5.1a shows an overview of the overlay architecture that is comprised of77cells and flip-flops. Within an Overlay Cell (OC), no look-up tables are arranged asshown in Figure 5.1b. As will be discussed below, four of the LUTs are designatedas auxiliary LUTs, and the remaining no−4 are designated as primary LUTs. EachLUT drives one output of the cell. The inputs to each LUT are connected to Ioinputs and the no feedback paths through a fully-connected crossbar. As we willshow below, our interconnect pattern limits Io to be 2no− 6 (this is less than iscommon for standard FPGA architectures, but we will show it is sufficient for thetriggers we implement).As shown in Figure 5.1a, cells are connected in a triangular reduction-networkpattern, motivated by work in [71, 89, 124]. The use of the triangular pattern wasmotivated by work in [71] in which the authors profiled a large number of circuitswith this property, measuring their ’shape’, and properties such as average fanin,fanout, and convergence. In that paper, the authors concluded that such circuitsoften have triangular shapes, which motivated our architecture. In Figure 5.1a, thelevels are labeled 0..lo− 1 starting from the output; level i contains 2i cells. Theoverlay has a single output from level 0 which is used to control the trace buffers.To support sequential triggers, the overlay also contains a bank of m flip-flopsshown at the top of Figure 5.1a.The interconnect between cells is also shown in Figure 5.1a. We distinguishbetween two categories of interconnects: Each cell output is labeled as primary orauxiliary. There are no− 4 primary outputs; all but one are connected to a cell inthe previous level. The other primary output is connected to a neighbouring cellin the same level. Two of the four auxiliary outputs are connected to cells in thenext level; the other two are connected to neighbouring cells in the same level.Non-neighbouring cells are connected by making a series of route through, i.e. by78Table 5.1: Architectural Parameters Describing Logical Overlay.Overlay Architecture Parameter Meaningno Look-up tables per cellko Inputs per look-up tablelo Number of levels of cellsm Number of flip-flopsfo Flip-Flop feedback fanoutpassing through intermediate cells using unused inputs, outputs, and LUTs as routethrough.The cells in level lo− 1 are connected to the m flip-flops. Each overlay flip-flop is accompanied by a multiplexer as shown in Figure 5.1c. Each input to themultiplexer is driven by one of the auxiliary outputs of a cell in level lo−1. Sincethere are 2lo−1 cells in level lo− 1 and m flip-flops, each flip-flop multiplexer has2lo−1m inputs. The output of each flip-flop drives one of the 2no−10 primary inputsto a cell in level lo− 1. Rather than connect each flip-flop output to each of the2lo−1 cells, we have found that we can obtain sufficient flexibility by connectingeach flip-flop output to a smaller number of cells which we denote fo. In ourexperiments in Section 5.4, fo = 32. Table 5.1 presents a summary of architecturalparameters describing overlay fabric. A family of overlay fabrics can be describedusing these parameters.The inputs to the overlay fabric are the trigger signals from the user circuit. Asdescribed in Section 5.2.4, any of the primary inputs to each cell can be used as aninput pin of the overlay fabric.795.1.2 Overlay ConstructionThe overlay architecture is constructed on a per-circuit basis. The overlay fabricis added to the user circuit at compile time, using only those logic and routingresource that were not used by the user circuit. Rather than compiling an RTLdescription of the overlay fabric using normal CAD tools, we map the overlay di-rectly to the underlying FPGA primitives (e.g. LUTs, and routing segments). Thisleads to a far more efficient overlay implementation. In this chapter, we assumean FPGA architecture similar to Intel Stratix IV device described in Section 5.3.We modified the overlay mapping approach described in Section 4.3 to implementour multi-level overlay architecture into FPGA available resources. Our algorithmconsists of cell selection, overlay placement, and best-effort routing. An overviewof the algorithm is provided below.Selection and Adaptive Allocation. Given the mapping of the user circuit, thealgorithm first collects all logic clusters (LCs) inside the FPGA which (1) containat least no logic elements (LEs) unused by the user circuit, and (2) have at least2no−6 unused inputs. Each such LC has sufficient unused resources to implementa single cell of the combinational part of the overlay architecture. We also collectLCs which contain (1) at least one unused LE, and (2) at least 2lo−1m unused inputs;each such LC can be used to implement one of the flip-flops in the overlay fabric.Based on the number of LCs selected, we then choose the number of levels ofthe overlay, lo. In this way, the size of the overlay adapts to the size of the usercircuit; if a circuit uses most of the available FPGA, few such LCs will be found,and the overlay will be small, limiting the complexity of trigger functions that canbe supported. If there are many unused resources, lc will be large, leading to an80overlay that can implement more complex trigger functionality.Overlay Placement. We then construct a logical view of the overlay architec-ture, and assign potential sites on the FPGA for each overlay cell. We employ thesimulated annealing based placement algorithm presented in Chapter 4 to positionthe cells in these sites.Overlay Routing. After overlay placement, we attempt to create all connec-tions between the cells. We use the best-effort routing heuristic presented in Chap-ter 4. This heuristic iteratively discards illegal connections (beginning with auxil-iary connections) until a legal routing solution is achieved. This approach ensuresthat the primary connections have higher priority for using routing resources thanthe auxiliary connections. In this way, our approach can be used for user circuitsthat already stress the FPGA routing.5.2 CAD Algorithm for Overlay PersonalizationAt debug time, a mapping algorithm is required to implement a trigger netlist ontothe overlay fabric. At this stage, there are three challenges: (1) the mapping al-gorithm should be fast to provide rapid debug iterations; (2) the pre-synthesizedoverlay has limited flexibility compared to a standard FPGA due to its limited rout-ing topology and missing connections as described in Section 5.1.2. If we performplacement and routing as separate phases, a poor placement can easily lead to anun-routable solution, and (3) the overlay architecture is highly constrained since itis customized for trigger-type circuits.Our overall trigger mapping flow is presented in Figure 5.2. It has two majorphases: Pre-processing and Placement/Routing. In this section, we describe thedetails of each phase; for clarity, we first describe the mapping algorithm for com-81Figure 5.2: Overlay reconfiguration flow that takes a graph representation ofa trigger netlist and outputs a placed and routed netlist onto the overlayfabric.binational triggers and then generalize the algorithm for sequential trigger circuits.5.2.1 Trigger Circuit ModelingWe make the same assumptions as in Chapter 4 that the trigger circuit has beenpreviously mapped to lookup tables and flip-flops using a technology-mapping al-gorithm such as [102]. A trigger circuit netlist can be represented as a directedgraph G(V, E), where each node in V represents either a LUT or a flip-flop. Eachedge in E is a connection between LUTs and/or flip-flops in the trigger circuit;we refer to these nets as trigger nets. Each trigger net has one source node, butmay have multiple sink nodes. The graph may have multiple input edges, eachrepresenting an input signal to the trigger circuit, and has a single output edge,representing the trigger output.5.2.2 Mapping Combinational TriggersOur overall mapping flow is shown in Figure 5.2. In this section, we describe thisflow assuming the trigger circuit is combinational; in Section 5.2.3 we generalizethis for a sequential trigger circuit.82(a) A multi-sink net before fanout bounding.(b) Two-sink nets after fanout bounding.Figure 5.3: Fanout bounding takes a multi-sink trigger net (a) and decom-poses it to multiple two-sink nets (b).Pre-processingPre-processing modifies G to suit the overlay architecture by taking the followingsteps:1. Fanout bounding. Since each output pin of each overlay cell is only con-nected to one cell, we bound the fanout of the trigger nets to be as smallas possible to better match these single output connections of the overlay.Hence, this step decomposes any multi-sink nets into multiple two-sink nets.This is performed by replacing a multi-sink net with a binary tree with thesource node as the root, the sink nodes as the leaves, and replicator LUTs asintermediate nodes as illustrated in Figure 5.3. Clearly, the result is a func-tionally equivalent circuit. Fanout bounding techniques such as [60, 114]can be used to minimize the number of intermediate nodes and depth; we donot consider this optimization in this chapter.2. Level Order. In this step, a Level-Order Traversal algorithm is used to tra-verse the trigger netlist graph. The output is a list of LUTs sorted based on83Figure 5.4: A simple single-sink combinational trigger netlist.level-order.Placement and RoutingPlacement and routing finds a mapping of each LUT in the trigger circuit (whichwe refer to as a trigger LUT to a LUT in the overlay fabric (which we refer to asa LUT slot). We process the graph in Level Order, meaning each trigger LUT isplaced after its sink LUTs to ensure there is a connection between the source andthe sink LUTs. In this section, we first describe the mapping algorithm assumingthe trigger circuit contains only single-sink nets, using the example in Figure 5.4.We then generalize the algorithm for circuits which contain at least one net withtwo sinks. Due to the pre-processing step in Section 5.2.2 , we are assured that nonet has more than two sinks.Mapping trigger LUTs using overlay primary resources. Figure 5.5 illustratesthe details of the connections between overlay cells described in Section 5.1.1. As-sociated with each output pin for each cell is a 4-input LUT driving the output pin;as described in Section 5.1, these LUTs are categorized as primary and auxiliary84Figure 5.5: Detailed view of connections between overlay cells assuming no= 8, ko = 4 and lo = 3. Each cell (except OC1) has ten input pins. Seven(I1-I7) are driven by primary (shown in black) output pins. Three (I8-I10) are driven by auxiliary (shown in red) output pins. Each output pinis driven by a 4-input LUTs. LUTs and routing resources inside eachcell are not shown for simplicity.LUTs. Each primary 4-input LUT can be used to implement a 4-input function orit can be used as a route-through to pass signal to neighbouring cells. AuxiliaryLUTs are only used as route-through.The algorithm starts by assigning the first element in the Level Order list to theLUT-slot that drives the output pin of the overlay fabric. In our example, LUT1 inFigure 5.4 is assigned to LUT-slot inside OC1 that drives O1 in Figure 5.5. OnceLUT1 is placed into OC1, the overlay output pin is assigned to LUT1’s output net,and the primary input pins of the overlay cell are assigned to the input signals ofLUT1. Since the primary inputs of OC1 are driven by two cells (i.e. OC2 and OC3),input pins are distributed between OC2 and OC3 evenly. Hence, n1 is assigned toI1, and n2 to I4.85The rest of the trigger LUTs in the Level Order list are considered sequentially;each trigger LUT is mapped to a LUT-slot using Algorithm 2. Note that since weare assuming only single-sink nets in our example, the trigger LUT has only onepredecessor. This algorithm is given the input pin of the trigger LUT’s predecessor;because the trigger LUTs are processed in level order, we are assured the predeces-sor has already been assigned a LUT-slot. The algorithm determines the LUT-slotthat drives this input pin and determines whether this candidate LUT-slot is a legallocation for the trigger LUT being placed (lines 3-8 in Algorithm 2). In our ex-ample, when placing LUT2, the LUT-slot that drives O1 of OC2 is considered as acandidate LUT-slot since n1 was assigned to I1 of OC1.To determine whether a candidate LUT-slot is a legal location, the Validateroutine checks for two conditions: (1) the overlay cell that contains the candidateoutput pin must have enough unassigned input pins to accommodate the input sig-nals of the trigger LUT, and (2) the overlay cell has at least as many unassignedinput pins as its unassigned output pins. This latter condition is to ensure that thecell can later be used as a route-through cell; as we will discuss, this is crucial toprevent blocked routing situations when placing future trigger LUTs. If the valida-tion passes, the algorithm returns the specific LUT-slot and a path from the outputpin of this LUT-slot to the assigned input pin of the predecessor (lines 8-13 in Al-gorithm 2). In our example, cell OC2 is free and the validation step passes; thealgorithm places LUT2 onto the LUT-slot that drives output pin O1 inside OC2.As before, primary inputs pins are distributed among the driving cells evenly; inour example, the input pins are distributed among OC3, OC4, and OC5. Therefore,a valid input assignment is n3 to I1, n4 to I2, n5 to I3 and n6 to I5.If the validation for a candidate LUT-slot fails, the algorithm considers LUT-86Algorithm 2: Mapping a trigger LUT onto a LUT-slot inside overlay cellsInput: overlay /* Overlay Fabric */trigger LUT /* To be mapped trigger LUT */assigned input pin o f predecessor1 /* Input pin that wasassigned to trigger LUT’s first predecessor */assigned input pin o f predecessor2 /* Input pin that wasassigned to trigger LUT’s second predecessor */Output: f ound LUT slotrouting path1 /* Routing path from found LUT slot toassigned input pin of predecessor1 */routing path2 /* Routing path from found LUT slot toassigned input pin of predecessor2 */1 /* a set that keeps the candidate LUT-slots in the current level */2 current level candidate LUT slots← /03 driver LUT slot =GetDriverLUT (overlay,assigned input pin o f predecessor1)4 current level candidate LUT slots.insert(driver LUT slot)5 while (current level candidate LUT slots 6= /0) do6 next level candidate LUT slots← /07 foreach (LUT slot ∈ current level candidate LUT slots) do8 if Validate(overlay, trigger LUT,LUT slot) == true then9 f ound LUT slot = LUT slot10 LUT slot out put pin = GetOut putPin(overlay,LUT slot)11 routing path1 =GetPath(LUT slot out put pin,assigned input pin o f predecessor1)12 if assigned input pin o f predecessor2 == /0 then13 return f ound LUT slot, routing path114 routing path2 =SignalRoute(overlay,LUT slot out put pin,assigned input pin o f predecessor2)15 if routing path2 6= /0 then16 return f ound LUT slot, routing path1, routing path217 /* use the LUT-slot as a route-through to find other candidates */18 foreach ( f ree primary input pin ∈ cell containing LUT slot) do19 driver LUT slot = GetDriverLUT (overlay, f ree input pin)20 next level candidate LUT slots.insert(driver LUT )21 current level candidate LUT slots = next level candidate LUT slots22 return /0 /* return empty indicating that no LUT-slot was found */87Algorithm 3: SignalRouteInput: overlay /* Overlay Fabric */source out put pin /* Source output pin */sink input pin /* Sink input pin */Output: routing path /* Routing path from source to sink */1 /* a set that keeps the candidate output pins in the current level */2 current level out put pins← /03 /* returns the output pin that drives sink input pin */4 driver pin = GetDriverPin(overlay,sink input pin)5 current level out put pins.insert(driver pin)6 while (current level out put pins 6= /0) do7 next level out put pins← /08 foreach (out put pin ∈ current level out put pins) do9 if source out put pin == out put pin then10 routing path = GetPath(source out put pin,sink input pin)11 return routing path12 /* go to the neighbours*/13 foreach ( f ree input pin ∈ the cell containing out put pin) do14 driver pin = GetDriverPin(overlay, f ree input pin)15 next level out put pins.insert(driver pin)16 current level out put pins = next level out put pins17 return /0 /* return empty indicating that no path exists */slots in cells that fan-in to the cell containing the candidate LUT-slot. The can-didate LUT-slot is then used as a route-through (lines 18-20 in Algorithm 2). Asmentioned earlier, condition (2) of the validation ensures the algorithm is able touse the candidate LUT-slot in this way. The algorithm performs a validation foreach new candidate LUT-slot until a legal location is found. If the algorithm can-not find a legal location (i.e. the algorithm reaches the last level), the mapping fails(line 22 in Algorithm 2).Mapping two-sink trigger nets. We do not have to deal with trigger nets withmore than two sinks because of the fanout bounding performed in pre-processing88phase as described in Section 5.2.2. For trigger circuits that contain at least onenet with two sinks, additional processing is required. When placing a trigger LUTwith two sinks, both sink LUTs have already been placed due to the fact that triggerLUTs are placed in a level order. For such a trigger LUT, we first find a candidateLUT-slot based on its first sink using primary resources as explained above. Wethen determine whether there is a path between the trigger LUT and its secondsink (line 14 of Algorithm 2). To determine whether there is such path, we useAlgorithm 3 which is based on a breadth-first search. Both Primary and Auxiliaryresources are used to enhance the routing flexibility when finding such a path. If apath is found (Lines 15-16 in Algorithm 2), the candidate LUT-slot is accepted, andthe path is returned to the calling routine so that it can map the routing resourcesthat mark up the path as used. If there is no path between the candidate LUT-slotand the second sink, the candidate LUT-slot is rejected and Algorithm 2 considersother LUT-slots as described earlier.5.2.3 Mapping Sequential TriggersIn this section, we describe how sequential trigger circuits are mapped.Pre-processingFor a sequential circuit, the pre-processing phase is modified in three ways:1. Fanout Bounding. As described in Section 5.2.2, fanout bounding on G isperformed by replacing each multi-sink net with a k-way tree. As in Sec-tion 5.2.2, for each combinational node, we use k = 2. For each sequentialnode (a flip-flop), we use k = fo. Recall from Section 5.1.1 that fo was de-fined as the fanout of each overlay flip-flop. This ensures that the solution89(a) Original sequential trigger. (b) Transformed sequential trigger.Figure 5.6: Transformation takes a sequential trigger (a) and restructures it tosequential trigger (b) where trigger netlist can be cut into two partitions:combinational and sequential.matches the topological constraints of overlay flip-flops.2. Transformation. As shown in Figure 5.2, if G is sequential, a transformationis performed to restructure a sequential trigger state machine. As shown inFigure 5.6, the state machine is restructured so that the trigger output logic isfed directly from the state logic. This allows us to separate the combinationaland sequential parts of the trigger to better match the overlay architecture.Although this transformation means that trigger output is produced one cy-cle earlier. This can be tolerated by pipelining the connection between thetrigger overlay and the trace buffer, as we described in Section Level Order. The Level Order algorithm performs a level-order traversal ofthe combinational part of G.90Algorithm 4: Mapping a trigger flip-flop onto a flipflop-slot inside over-lay fabricInput: overlay /* Overlay Fabric */all assigned input pins o f sinks /* Set of input pins that wereassigned to trigger flip-flop’s sinks */out put pin o f driver /* Output pin that was assigned to triggerflip-flop’s driver LUT */Output:f ound f lip f lop slot /*Flipflop slot of the overlay that triggerflip-flop can be assigned to it */all routing paths /* Routing paths between the found flipflop slotand the combinational part */1 /* set of unassigned flip-flop slots inside overlay fabric */2 unassigned f lip f lop slots = GetFlipFlops(overlay)3 foreach (candidate f lip f lop slot ∈ unassigned f lip f lop slots) do4 f lip f lop slot out put pin =GetOut putPin(overlay,candidate f lip f lop slot)5 all routing paths← /06 success = true7 foreach (assigned input pin ∈all assigned input pins of sinks) do8 routing path =SignalRoute(overlay, f lip f lop slot out put pin,assigned input pin)9 if routing path == /0 then10 success = f alse11 break12 all routing path.insert(routing path)13 if success == true then14 f lip f lop slot input pins =GetInputPins(overlay,candidate f lip f lop slot)15 foreach ( f lip f lop slot input pin ∈flipflop slot input pins) do16 routing path =SignalRoute(overlay,output pin of driver, f lip f lop slot input pin)17 if routing path 6= /0 then18 all routing paths.insert(routing path)19 return f ound f lip f lop slot, all routing path20 return /0 /* return empty indicating that no flipflop-slot was found */91Placement and RoutingPlacement and routing of a sequential trigger circuit is performed in two steps.First, the combinational portion is mapped as in Section 5.2.2. Then, the flip-flopsin the trigger circuit (which we refer to as trigger flip-flops) are placed into the over-lay and connect to the combinational portion of the circuit. Each trigger flip-flop isconsidered sequentially, and mapped to a flip-flop in the overlay fabric (which werefer to as a flip-flop slot) using Algorithm 4. This algorithm considers candidateslots sequentially (line 3 of Algorithm 4). Each candidate slot is validated usingtwo invocations of breadth-first routing in Algorithm 3. The first invocation deter-mines whether it is possible to find a path between the flip-flop slot output and eachsink LUT (lines 7-12 in Algorithm 4). Since the output pin of each overlay flip-flopis connected to fo input pins of cells in level lo− 1 as described in Section 5.1.1,finding a path from a sink LUT to any of those cells is sufficient. The second in-vocation of the breadth-first routing algorithm determines whether it is possible tofind a path between the output pin of the driving LUT and the input pins of thecandidate overlay flip-flop slot (lines 15-19 in Algorithm 4). Since the inputs ofeach overlay flip-flop are driven by a fraction of cells in level lo−1, finding a pathto any of those cells is sufficient. If both invocations are successful, the candidateis selected; if not, another candidate flip-flop slot is considered (the algorithm re-iterates from line 3 in Algorithm 4). If no legal flip-flop slot is found, the algorithmfails.5.2.4 Connecting Trigger Circuit to the User CircuitAfter mapping the trigger logic to the overlay fabric, we connect the trigger inputsfrom the user circuit to the trigger logic. Rather than having dedicated trigger fabric92input pins, every connection between cells in the overlay fabric can be replacedwith a connection to the user circuit. For example, in Figure 5.5, if pin I2 of OC2is mapped to a trigger input signal, the connection between OC4 and pin I2 ofOC2 can be replaced by a connection between the user circuit and pin I2 of OC2.This re-route can be done using an incremental routing algorithm which reroutesspecific signals using free resources while leaving the rest of the design unchanged.The last step to enable triggering is to connect the single trigger output to alltrace buffers across the chip, which can introduce a long critical path. Hence, weassume the trigger output is registered to minimize impact on the user circuit’scritical-path delay.5.3 Experimental MethodologyIn Section 5.4, we experimentally map synthetic trigger circuits to an overlay ar-chitecture, implemented on top of a set of user circuits. In this section, we describehow we obtain and map the user circuits, the overlay architecture assumptions, andthe synthetic trigger circuits.5.3.1 Underlying User CircuitsWe evaluate our techniques using VPR, which is part of the academic Verilog-To-Routing project [96]. An open-source tool such as VPR provides us to demonstrat-ing our approach since it requires low-level resource manipulation, whereas deviceinformation is proprietary in commercial tools.Using VPR, we packed, placed, and routed a set of benchmark circuits ontothe smallest FPGA array that can fit the circuit using an architecture based on In-tel Stratix IV, characterized by logic cluster size N = 10, look-up table size K = 6,93Table 5.2: Benchmark Summary.6-Input FPGA Logic Cluster RAM Free (%)Circuit LUTs FFs Size Wmin Used All Used All LCs LEsbgm 32384 5362 75x75 80 4111 4200 0 120 2 25LU8PE 22634 6630 61x61 92 2667 2745 45 80 3 18LU32PE 76211 20898 111x111 128 9105 9213 150 252 1 17LU64PE 147556 39552 153x153 156 17591 17595 293 475 0 17mcml 101858 53736 100x100 86 7350 7400 38 208 0 9mkDW 5590 2491 42x42 76 916 1302 41 42 30 47stereo2 29943 18416 89x89 92 5889 5963 0 154 1 57Table 5.3: Values of the Overlay Architectural Parameters, Used for the Ex-periments.Overlay Architecture Parameter Valueno 8ko 4lo 8m 8fo 32cluster input and output flexibilities of Fc−in = 0.15 and Fc−out = 0.10, respectively,channel segment length L = 4, and inputs per cluster I = 33. Table 5.2 summarizesthe benchmark circuits, resources available in the smallest FPGA array, and re-sources occupied by the circuit. Wmin presents the minimum number of routingtracks required to route each circuit on the minimum-sized FPGA array. We in-flate this value by 30% since this is thought to best reflect the situation in a realFPGA [8, 96]. We make the same assumptions as in Chapters 3 and 4 that anyunused RAM block in the FPGA can be reclaimed as a trace buffer. Furthermore,only related logic elements are packed into a cluster to better reflect industrialtools [67]. The column labelled Free shows the percentage of free LCs and LEsafter user circuit compilation.945.3.2 Overlay Architecture AssumptionsTable 5.3 presents the parameters that describes the overlay architecture and theirvalues. Importantly, no is less than N in the target FPGA and ko is less than K inthe target FPGA; this ensures that partially filled logic clusters in the FPGA canbe used to implement overlay cells. We assume that the number of overlay celllevels lo = 8, which is sufficient to implement many of the type of trigger circuitswe describe later in this section.5.3.3 Synthetic Trigger Circuit AssumptionsIn order to adequately evaluate our technique on a wide range of trigger circuits,we generate 1940 synthetic triggers, as described below. We consider three cate-gories of trigger circuits: combinational triggers, state-based triggers, and complextriggers.Combinational Triggers. We consider three types of combinational trigger cir-cuits. First, we consider trigger circuits which consist of a single n-bit comparatorcomparing two sets of n-bit signals in the user circuit. We vary n from 16 to 256(in powers of two), and for each value of n, we generate 100 circuits with differentsets of input signals. Second, we consider trigger circuits that compare an n-bitsignal to a constant; again, we vary n from 16 to 256, and for each value of n, wegenerate 100 circuits with different values of the constant and different input sig-nals. Finally, we consider circuits containing a range comparator to determine if ann-bit signal in the user circuit is within a given range. As before, we vary n from16 to 256, and for each value of n, we generate 100 circuits with different rangesand different signals in the user circuit. All together, this represents 1500 syntheticcombinational circuits.95State-based Triggers. We created a program which generates synthetic statemachines according to two parameters: the number of states and the maximumnumber of transitions from each state. Using this program, we generated syntheticstate machines by sweeping the number of states through {8,16} and the maximumnumber of transitions from each state through {2,3,4,5,6}. For each combination,we generated 20 state machines, giving a total of 200 state machines.Complex Sequential Triggers. Finally, we consider trigger circuits which con-tain a counter, providing the ability to trigger after a specified number of occur-rences or absences of a specified condition. We created a program which generatesa synthetic trigger circuit that compares n-bit quantities in the user circuit with aconstant, and contains an m-bit counter, along with circuitry to assert the outputsignal when the count reaches a specified value (up to 2m−1). Using this program,we generated synthetic triggers by sweeping n from 16 to 128 (in powers of two)and m in the range {4,5,6}. For each combination, we generated 20 triggers, givinga total of 240 complex sequential triggers.5.4 Experimental Results5.4.1 Overlay Implementation RuntimeTable 5.4 presents the compile time overhead required to place and route the over-lay fabric (characterized in Table 5.3) on top of each benchmark circuit. Addingour proposed overlay to each benchmark circuit adds less than 2.5% to the circuitcompile time. This low overhead is due to two main reasons: (1) instead of creat-ing the largest possible overlay out of unused logic resources, we assumed that thenumber of overlay cell levels lo = 8; this results in fast placement. (2) the simple96Table 5.4: Overlay Compile-Time Overhead.CircuitUninstrumented CircuitCompile-time(s)OverlayCompile-time(s)bgm 2989 21LU8PEEng 2071 48LU32PEEng 8773 54LU64PEng 35615 18mcml 13074 19mkDW 756 14stereo2 3574 280510152025303540bgmLU8PEEngLU32PEEngLU64PEEngmcmlmkDWstereo2Runtime (s)Comparator Size16 32 64 128 256Figure 5.7: Debug-time: combinational trigger mapping runtime.interconnect topology of our overlay architecture results in fast routing. The over-lay compile time is a one-time overhead since the overlay is created only once afterthe user circuit and is used multiple times during debug iterations with the samecircuit.5.4.2 Trigger Mapping RuntimeFigure 5.7 shows the runtime of our trigger mapping algorithm for combinationaltriggers (including connecting the trigger signals to the fabric). The vertical axis9702468101214bgmLU8PEEngLU32PEEngLU64PEEngmcmlmkDWstereo2Runtime (s)Max. Transitions per State2 3 4 5 6(a) 8 states.0246810121416bgmLU8PEEngLU32PEEngLU64PEEngmcmlmkDWstereo2Runtime (s)Max. Transitions per State2 3 4 5 6(b) 16 states.Figure 5.8: Debug-time: state-based trigger mapping runtime.represents the mapping time, averaged for all trigger circuits with the given com-parator size. The mapping was successful for all 1500 combinational triggers. Asshown in Figure 5.7, the mapping time for each trigger circuit was less than 40seconds. This runtime includes both the time to map the trigger functionality tothe fabric, as well as the time to connect the trigger signals in the user circuit to thefabric. In all cases, time to connect the trigger signals to the fabric was dominantand the mapping runtime increases as the size of comparators increases.98024681012141618bgmLU8PEEngLU32PEEngLU64PEEngmcmlmkDWstereo2Runtime (s)Counter Size4 5 6Figure 5.9: Debug-time: mapping runtime of complex sequential trigger cir-cuits with 128-bit comparator.Figure 5.8a shows the runtime of our trigger mapping algorithm when mappingstate-based trigger circuits with 8 states as a function of the maximum number oftransitions per state. The runtime depended heavily on the underlying user circuit.Circuits such as stereovision2 had more routing congestion around memory blocks,meaning the runtime to find a legal path between the trigger output and all availablememory blocks is larger.For each of our circuits, the average compile time was less than 14 seconds.Figure 5.8b shows the same results for state machines with 16 states. In this case,not all mappings were successful. For the case with a maximum of 6 transitions perstate, we found that 16 of the 20 trigger circuits could be mapped to the fabric. Ofthose that were successful, each trigger could be mapped in less than 16 seconds.Figure 5.9 shows the runtime of our trigger mapping algorithm for mappingcomplex trigger circuits with 128-bit comparators as a function of counter size.We were able to map all the trigger circuits; on average, the runtime for the largestcircuit was less than 18 seconds. As mentioned before, the runtime to connect the99Table 5.5: Effect of Trigger Insertion on Circuit Critical Path Delay (ns).CircuitCritical Path Delay (ns)Original16 States 128-bit ComparatorMax Transitionsper StateCounter Size2 3 4 5 6 4 5 6bgm 19.72 19.72 20.7 20.09 21.35 21.3 19.72 19.72 19.72LU8PE 89.64 89.64 89.64 89.64 89.64 89.64 89.64 89.64 89.64LU32PE 91.93 91.93 91.93 91.93 91.93 91.93 91.93 91.93 91.93LU64PE 89.23 89.23 89.23 89.23 89.23 89.23 89.23 89.23 89.23mcml 66.59 66.59 66.59 66.59 66.59 66.59 66.59 66.59 66.59mkDW 6.47 10.86 11.19 10.98 11.45 12.19 8.83 8.76 10.11stereo2 11.93 12.2 12.59 12.76 12.9 13.52 11.93 11.93 11.97trigger signals to the fabric was dominant and increasing the counter size has littleeffect on the runtime.As the results show, implementing a new trigger circuitry using our techniquesis significantly faster than performing a lengthy circuit recompilation at each debugiteration.5.4.3 Circuit Critical Path DelayUsing our techniques for trigger insertion, the critical path delay of all bench-marks remained the same except bgm, mkDelayWorker32B, and stereovision2. Ta-ble 5.5 shows the critical path delay of the circuits before (column labeled Unin-strumented) and after inserting different trigger circuits averaged over all the trig-ger circuits. The critical path delay increases no more than an additional 1.6nsfor bgm, 5ns for mkDelayWorker32B, and 1.6ns for stereovision2; in all cases, theworst case is when inserting trigger circuits with 16 states and a maximum of 6transitions per state. The increase in delay in these circuits is because these circuits100have a shorter critical path than the other circuits; circuits with a short critical pathare more sensitive to trigger insertion as a large trigger circuit can introduce a newlong critical path to the circuit. However, this delay increase is temporary dur-ing debugging and the circuit can operate at its normal frequency when the debuginstrumentation is not required.Comparison with Circuit Recompilation. As shown in Table 5.5, our tech-niques had no impact on the critical path delay of the benchmark circuits in ourexperiments, except for bgm, mkDelayWorker32B, and stereovision2, which arecircuits with short critical path. We also evaluate the critical path delay of theseaffected benchmarks when the trigger circuit is compiled with the user circuit inorder to compare with our techniques when trigger circuits with 16 states and amaximum of 6 transitions per state are inserted (worst case). Our experimentalresults show that after compiling the user circuit with these trigger circuits, thecritical path delay increases to 19.84ns, 6.53ns, and 12.25ns for bgm, mkDelay-Worker32B, and stereovision2, respectively. As expected, recompiling the usercircuit with the trigger circuit has less effect on the circuit delay compared to ourincremental techniques using an overlay. However, using our techniques enable thedesigners to rapidly implement the desirable trigger functions without requiring arecompilation.5.4.4 Comparison with Intel Quartus PrimeIn this section, we compare our techniques with incremental compilation feature ofIntel Quartus Prime, which can be used to modify the reconfiguration of SignalTapII Logic Analyzer.Incremental compilation is enabled by setting the user circuit partition to post-101Figure 5.10: An example of a state-based trigger to detect when the pattern0110101 has been received on a signal in consecutive cycles.Table 5.6: Comparison between This work and Quartus Prime on Intel StratixIV (EP4SGX180) Architecture for mcml.mcml This Work Quartus PrimeCircuit full compilation (s) 13074 2880Circuit Fmax (MHz) 15.01 28.82Runtime at each debug iteration (s) 3 460fit; the tool will attempt to preserve the placement and routing of the user circuit.To compare our techniques to Intel’s approach, we created a benchmark state-based trigger with a single user circuit input and a single trigger output; the circuitasserts its trigger output when the pattern 0110101 has been received on the inputin consecutive cycles. A state diagram showing this behaviour is in Figure 5.10.To instrument the circuit, we employ SignalTap II. We selected one registeredsignal at random from the circuit output as the trigger input.The degree to which the trigger circuit can be changed without recompilingthe design is limited in SignalTap II. Changing the number of states, trigger inputsignals, or transition conditions in the trigger circuit would require a recompilationof the trigger circuit (using Intel Quartus Prime’s incremental compilation flow,the recompilation can be limited to the trigger circuit). The tool provides an optionthat allows the target state of each transition to be changed without recompilingthe design, however, this flexibility comes at a cost in terms of resource usage102depending on the size of the trigger circuit. In our example, the trigger circuitrequires 461 LEs when this option was disabled. Enabling this option increases theresource usage to 721 LEs, an increase of 56%.Table 5.6 shows the full compilation time and maximum operating frequency(Fmax) for a large, 100,000 LUT, benchmark circuit mcml without any instrumen-tation, when targeting a Stratix IV architecture in version 15.1 of Quartus Primeand an architecture similar to Stratix IV using our flow.To understand the impact of making a trigger circuit change on the runtimerequired to perform a debug iteration, we modified our example trigger to detecta different pattern, 0111101. Table 5.6 shows the runtime to recompile the triggercircuit using our techniques and Quartus Prime.Unsurprisingly, the academic CAD tool takes significantly longer time to com-pile the uninstrumented circuit. However, the runtime required to perform a debugiteration is two orders of magnitude faster using the trigger insertion flow presentedin this chapter. This is due to the ability to rapidly map the trigger circuit onto theoverlay fabric rather than performing a recompilation.5.5 SummaryThis chapter presented a multi-level overlay architecture and new CAD techniquesthat are specialized for for small combinational and sequential circuits with a singleoutput; such circuits are typical of common trigger functions. Such specializationprovides enough flexibility to enable complex triggering capabilities suitable fordebugging. We have shown that the overlay fabric can be reconfigured to mapvarious combinational and sequential triggering scenarios in less than 40 seconds,enabling rapid debug iterations. Our experiments have shown that our customized103overlay architecture and trigger mapping algorithms have only a small impact onthe critical path delay.104Chapter 6Post-Silicon Coverage usingOverlaysAs discussed in Chapter 2, coverage estimation during post-silicon is a challeng-ing task and there is no standardized coverage metric and methodology for cov-erage analysis. The key challenge is the poor observability into the behaviour ofa running chip. Although this visibility problem can be addressed by designingcircuit-specific monitors and implementing them on-chip along with the user cir-cuit, this can introduce significant area overhead. On the other hand, implementinga small subset of important coverage monitors negatively impacts a complete cov-erage analysis.In this chapter, we make the observation that we can re-purpose existing FPGA-based on-chip debug cores from earlier in this thesis to facilitate on-chip cover-age monitoring. More specifically, we show that the instrumentation frameworkspresented in Chapter 4 and Chapter 5, which were designed to implement trigger105circuits, can be used to support the time-multiplexed implementation of coveragemonitors. Instead of employing this infrastructure – a virtual overlay fabric thatcan be rapidly reconfigured (without compilation) during the verification process– to record a limited window of trace data, we instead use this overlay to measurecoverage over the entire circuit execution. Since this overlay is implemented usingspare FPGA resources, the area overhead of instrumentation is essentially zero andits performance overhead is negligible.In this chapter, we describe our proposed instrumentation framework to effi-ciently implement on-chip coverage monitors in Section 6.1. In Section 6.2.1 wewill revisit the overlay architecture presented in Chapter 5, and show how it canbe modified to provide enough flexibility for implementing coverage monitoringcircuits. Section6.3 describes coverage monitor circuits that are mapped to thisoverlay fabric to gather branch coverage data. Section 6.4 describes our mappingalgorithm to rapidly map these branch coverage monitors to a set of configurationbits. Section 6.5 details the experimental methodology and steps that were per-formed to evaluate our techniques. In Section 6.6, we evaluate our techniques interms of area overhead, overlay flexibility, and runtime. Finally in Section 6.7 wepresent our conclusions. A condensed version of this work was published in [46].6.1 FrameworkFigure 6.1 illustrates our overall framework. First the user circuit is compiled ontothe FPGA as normal, without any coverage monitor instrumentation. Then, the usercircuit is frozen, and instrumentation is added to the circuit using only resources(logic blocks, memories, and routing tracks) that are not used by the user circuit,without disturbing the packing, placement, and routing of the user circuit.106Figure 6.1: Overview of our framework.Rather than instrumenting the user circuit with fixed coverage monitors, ourinstrumentation consists of a flexible overlay fabric that is flexible enough so thatit can be configured, at runtime, to implement one or more coverage monitor func-tions. The flexibility of the overlay to implement coverage monitor functionscomes from a set of configuration bits. Since the overlay is specialized for cov-erage monitor circuits, the overlay can be configured by setting a small number ofconfiguration bits. Importantly, mapping the coverage monitor functions to this setof configuration bits is much faster than a full recompilation of the user circuit withcoverage monitors.After the user circuit has been mapped to the FPGA, and the overlay fabricmapped to the available resources, testing of the circuit proceeds. Before runningthe circuit under all test cases, coverage monitor functions applicable to this usercircuit are identified, and a subset of the set of these coverage monitor functionsare mapped to the overlay, as described above. The design is then run and themonitor circuits capture coverage data. The coverage data is stored in on-chip107memory while the design is running at-speed under all test cases. After the designexecution, this overlay fabric is then reconfigured to implement another subset ofcoverage monitors and the design is re-executed under all test cases and the cov-erage data is stored in on-chip memory. This process is repeated and the overlayfabric is frequently reconfigured until all coverage monitors are implemented andall coverage data is gathered. At the end, the coverage information can be extractedfrom the instrumentation (using device read-back techniques) for off-line coverageanalysis and an overall coverage measurement. Importantly, overlay reconfigura-tion and coverage instrumentation does not change the user circuit characteristicsduring the validation process since the overlay is separated from the user circuit.The size of the overlay fabric can be adjusted based on the proportion of theFPGA that is unused by the user circuit; if the user circuit uses most of the FPGA,the overlay fabric will be small, and can only implement a small number of cover-age monitor functions at at time, while if there are many unused FPGA resourcesafter mapping the user circuit, a larger fabric, capable of implementing many si-multaneous monitor functions can be implemented.6.2 Coverage Overlay ArchitectureThe design of a suitable overlay fabric is critical for our method. The overlay mustbe flexible enough to implement a variety of coverage monitor functions, withas little overhead as possible. A key observation in this work is that the overlayneeded to gather coverage information is very similar to that previously proposedto enhance visibility during post-silicon debug. This section presents the overlayarchitecture optimized to implement coverage monitor circuits. This section showshow the debug-oriented overlay presented in Chapter 5 can be enhanced to better108implement coverage monitor functions.6.2.1 Overlay ArchitectureFigure 6.2 illustrates our overlay architecture. As shown, the overlay is comprisedof a number of cells connected in a triangular reduction-network pattern. Levels arelabeled 0..lo−1; level i contains 2i cells similar to the debug overlay architecturepresented in Chapter 5 (Figure 5.1a). To better implement coverage monitor func-tions, we revise this debug overlay fabric; instead of including a bank of flip-flopsseparated from overlay cells, we enhance the overlay cell design where each over-lay cell consists of logic elements (LEs). The design of an overlay cell is shownin Figure 6.2b. Each overlay cell contains eight LEs. Each LE contains a 4-inputLUT and a flip-flop. The LEs within a cell are connected using a fully-connectedcrossbar network. Each overlay cell has ten input and eight output pins (except thecell in level 0). The interconnect between cells is also shown in figure 6.2a. Thetrigger output is used to control trace buffers. As described in Chapter 6, we dis-tinguish between two categories of interconnects: (1) primary connections that areused to flow signal in forward direction, (2) auxiliary connections that are added toenable each overlay cell to flow data in backward direction for supporting flip-flopfeedbacks of coverage monitor circuits as will be described in Section 6.4.Overlay cells are designed to be similar to FPGA logic clusters for efficientoverlay compilation. We used the same incremental approach used in Chapter 5(Section 5.1.2) to incrementally compile this overlay architecture onto FPGA sparelogic clusters and routing tracks after user circuit compilation while preserving themapping of the underlying user circuit.109(a) The overall structure of the coverage overlay architecture.(b) Overlay Cell (OC) detailsFigure 6.2: (a) Coverage overlay architecture that is comprised of cells con-nected via primary (shown in black) and auxiliary (shown in red) con-nections, inputs to the overlay are control signals from the user circuitand its output controls trace buffers. (b) Overlay cell design with LogicElements (LE), local routing, and outputs to other cells, each LE con-tains a 4-input LUT and a flip-flop.110Figure 6.3: Branch coverage instrumentation that is mapped onto the overlayfabric.6.3 Branch Coverage InstrumentationKey to the flow described in Section 6.1 is the ability to map coverage monitorfunctions onto the pre-synthesized overlay fabric. In this work, we consider branchcoverage, where instruments are attached to the control signals generated as partof if–else-if–else, case, and ternary operator (?:) statements.Coverage Monitor Circuitry: Figure 6.3 shows our coverage instrumenta-tion. Control signals extracted from the user circuit are first captured into a flip-flop, implemented inside the trigger overlay, in order to limit the effect of thisinstrumentation on the timing performance of the circuit. The output of this initialflip-flop is then fed into a second flip-flop that captures the value of the controlsignal in the previous clock cycle. The upper OR gate causes this second flip-flopto latch when a high value is encountered, and to remain at that value, while thelower AND gate (with inverted input) detects the cycle at which a rising transitionoccurred. Once this transition is detected, it passes through a reduction OR gate(and another pipelining flip-flop) that causes the trace-buffer to write the state of111all coverage monitors upon any rising transition. Since each coverage monitor canonly transition once, the size requirements of this trace buffer scale linearly withthe number of monitors. Due to the nature of FPGA technology, all flip-flops canbe initialized with a value of ‘0’.Control Signal Visibility: The ability to attach coverage instrumentation ontoif/case control signals is dependent on whether this exact signal exists; due to syn-thesis optimizations, signals that exist in the original HDL may not be present inthe uninstrumented result. Attaching monitors to such signals will cause a differ-ent optimization trajectory, and it would be expected that doing so would incur anarea or delay overhead. This is an issue that exists regardless of whether cover-age instruments are inserted at compile-time, as with the baseline case, or whetherthey are time-multiplexed at runtime as we propose. The overhead for the runtimecase, however, is compounded by the necessity to preserve all control signals to al-low coverage monitors to be attached at runtime, whereas in the compile-time caseonly those control signals attached to pre-selected monitors would be affected. Wequantify this effect in our experiments; however, for future work we believe thisoverhead can be mitigated by employing spare FPGA logic to also ‘reconstruct’the value of optimized signals.6.4 Coverage to Overlay MappingDuring validation, a mapping algorithm is required to map coverage monitor func-tions onto the pre-synthesized overlay fabric. We use the simultaneous placement-and-routing algorithm 2 presented in Chapter 5 to find a mapping of LUTs ofcoverage monitor functions to a LUT-slot inside the overlay cells. For mappingsequential part of coverage monitor functions, we revised this algorithm to also112place each flip-flop of the coverage monitors onto a flip-flop slot in an overlay cell.Both forward and backward connections in the overlay routing network are used toenhance the routing flexibility when connecting flip-flop feedback paths.6.5 MethodologyIn order to evaluate our approach, we implemented our techniques using the FPGACAD tool VPR, which is part of the open-source academic Verilog-To-Routing(VTR) project [96]. As discussed in Chapter 3, using an open-source tool wasnecessary because implementing our approach requires low-level resource manip-ulation, whereas device information is proprietary in commercial tools.Similar to our methodology in Chapter 3, we used circuits that are suppliedwith the VTR project as underlying user circuits. Using VPR, we packed, placed,and routed each circuit onto the smallest FPGA array that can accommodate thecircuit using the default VPR architecture based on the Intel Stratix IV. As is com-monplace in FPGA research, we first map benchmark circuits to the minimum sizedFPGA that it fits onto, and using the minimum number of routing tracks. We theninflate this minimum value by 30% to reflect realistic routing demand. For eachcircuit, unused RAM blocks in the FPGA are reclaimed for recording coverage in-formation. In these experiments, we constructed an overlay fabric with the numberof overlay cell levels lo = 8 using incremental techniques presented in Chapter 4.6.6 Experimental ResultsIn this section, we evaluate our runtime instrumentation techniques for insertingon-chip monitors for branch coverage at runtime in terms of (1) runtime and (2) theimpact on circuit critical path delay versus compile-time instrumentation in which113Table 6.1: Benchmark circuits, total number of branch control signals, maxi-mum number of on-chip monitors that fit on FPGA device at each instru-mentation iteration, number of total instrumentation iterations for branchcoverage analysis.Benchmark# ofControlSignalsCompile-time Instr. Run-time Instr.# ofmonitorsper iteration# of iterations# ofmonitorsper iteration# of iterationsLU8PEEng 2335 291 9 60 39LU32PEEng 7639 238 33 60 128mcml 1213 303 5 60 21or1200 647 647 1 60 11stereovision2 211 105 3 60 4a number of pre-selected monitors are inserted into the design before compilation.Additionally, we evaluate the area overhead of preserving all control signals re-quired for runtime instrumentation as was discussed in Section 6.3.Coverage Instrumentation: Table 6.1 presents the total number of branchcontrol signals for each benchmark circuit. The compile-time column in the tablepresents the maximum number of on-chip coverage monitors that fit into the FPGAarray that accommodates the circuit. As expected, it was not possible to insert allthe monitors at once (except for or1200) as they did not fit into the device. Hence,it requires multiple compilation iterations to instrument the design for coverageanalysis. The number of recompilation iterations is also presented in Table 6.1 foreach benchmark.For our runtime instrumentation approach, we are able to map a subset of 60monitors (attached to 60 control signals) onto the overlay fabric. Selecting sub-sets of monitors to map to the overlay, we group monitors related to if/else or casestatements into subsets comprising of 60 monitors. For each benchmark, we map1140510152025LU8PEEngLU32PEEngmcmlor1200stereo2Speedup (X)Figure 6.4: Speedup of runtime branch coverage instrumentation in compari-son to compile-time instrumentation over multiple instrumentation iter-ations.each subset onto the overlay and average the mapping runtime over all subsets. Forour benchmarks, the time to map a subset of 60 monitors to the overlay fabric isless than 10 seconds on average. Even thought our approach requires more iter-ations to implement all monitors to gather all branch coverage data, it only takes10 seconds to reconfigure the overlay to map a new subset of monitors. In con-trast, the compile-time approach suffers a full recompile time of the user circuitwith the instruments at each iteration. Figure 6.4 shows the total speedup of ourtechniques (this includes the circuit compilation time, overlay construction time,and multiple iterations of overlay reconfiguration time) when compared with thecompile-time instrumentation which includes multiple iterations of full recompi-lation. As the results show, large circuits, and in particular, circuits with manybranch control signals, significantly benefit from our runtime instrumentation. Forinstance, our approach provides the speedup of 23X for LU32PEEng benchmarkthat has a large number (thousands) of control signals. This speedup is achieved by115Table 6.2: Circuit critical path delay (ns) of uninstrumented circuit, instru-mented circuit at compile-time and runtime instrumentation.Benchmark Uninstrumented Compile-time Instr. Runtime Instr.LU8PEEng 89.6 90 90LU32PEEng 91.9 90.2 91.9mcml 66.6 63.5 67.31or1200 11.6 11.6 11.87stereovision2 11.9 11.8 12eliminating a need for recompilation at each instrumentation iteration. For smallcircuits with a relatively small number of control signals, the benefit of our ap-proach is smaller. For or1200 benchmark, although it was possible to insert allmonitors into the FPGA at compile time, as expected, this lead to a more complexand difficult mapping problem for the FPGA CAD tools which increases compila-tion time compared to our approach where the instrumentation is separated fromthe user circuit.Table 6.2 presents the critical path delay of the user circuit without any in-strumentation, after compile-time instrumentation and using our approach. Forcompile-time instrumentation, the critical path delay of the benchmark circuitschanges as the instrumented circuit is recompiled from scratch. For some bench-marks, the critical path delay slightly decreases which can attributed to the algo-rithmic noise of CAD algorithms. Using our approach for runtime instrumentation,we had no impact or little impact on the circuit critical path delay; the critical pathdelay increases no worse than an additional 2.3% for or1200. As explained earlier,this delay increase is partly due to preserving all branch control signals. Addition-ally, small circuits with short critical path (such as or1200 and stereovision2) aremore sensitive as our instrumentation can introduce a new long critical path to the116024681012LU8PEEngLU32PEEngmcmlor1200stereo2Increase in Area(%)Figure 6.5: Impact of preserving all control signals on area in comparison tothe uninstrumented circuit where these signals can be optimized awayif possible.circuit.Impact of Control Signal Visibility on Area: Figure 6.5 shows the area over-head of preserving all branch control signals in comparison to the same circuitwhere the synthesis tool is free to optimize these signals. As discussed in Sec-tion 6.3, this area increase (average 6.4%) is the price we have to pay to enableruntime instrumentation.6.7 SummaryThis chapter demonstrates an instrumentation framework that enables on-chip cov-erage observability in a way that does not require large area overhead of monitorsduring FPGA-based validation. To achieve this, we utilize existing FPGA debugcores and techniques presented in Chapter 6 that employ an overlay fabric for de-bug instrumentation in order to implement monitors and gather branch coveragedata. Because the overlay is customized for coverage monitoring, the time to com-117pile monitors to the overlay fabric is fast and the overlay can be reconfigured atruntime to implement monitors in a time-multiplexed fashion to gather coveragedata when the design is running at speed under all test cases. Since the overlayis separated form the user circuit, the user circuit characteristics remain the sameduring post-silicon validation process.The key novelties of this work are in: (1) a scalable and area-efficient coverageinstrumentation framework that enables coverage monitoring at post-silicon, (2)specialized overlay architecture and mapping tools supporting runtime implemen-tation of monitors, (3) branch coverage instrumentation to enable runtime evalua-tion of branch coverage during post-silicon validation.The significance of this work is that it enables designers to gather coverage datawith low overhead during FPGA-based validation while the circuit is operatingmany orders of magnitude faster under more realistic test cases than simulation.118Chapter 7Conclusion and Future Work7.1 ConclusionThe past several decades have seen tremendous growth in the capacity and capabil-ity of FPGAs. FPGA platforms are commonly used for fast prototyping to evaluateand validate the functionality of complex designs. Debugging a design while itis running at execution speed on a real hardware enables the designers to verifythe design functionality in system-level with long real-world stimulus and higherfunctional coverage compared to software simulators.However, post-silicon debug and verification is challenging due to the poorvisibility into the design behaviour. As described in Chapter 2, observability canbe added by including commercial or academic trace-based instrumentation. Thisinstrumentation often records the runtime behaviour of selected signals in the chip,allowing it to be played back later using debug tools. Most of these debug flows,however, require the design to be recompiled every time the instrumentation ischanged. For very large designs, this can be prohibitive which can severely limit119debug productivity.Moreover, as discussed in Chapter 2, coverage analysis during post-silicon val-idation is very difficult task due to lack of observability. On-chip circuit-specificcoverage monitors have not been widely used in practice since implementing alarge number of coverage monitors imposes a high area overhead and is not afford-able in silicon. Hence, in this thesis, we have proposed area-efficient methods forinstrumentation of FPGA designs for debug and coverage monitoring in a way thatdoes not require frequent recompilation during FPGA debug and verification pro-cess. The rest of this chapter is as follows. Section 7.2 provides a summary of thecontribution of this thesis and Section 7.3 discusses their limitations and possibleresearch directions based on the findings from this thesis.7.2 Summary of ContributionsIn Chapter 3, we used incremental compilation techniques to insert the trigger cir-cuitry into a design without requiring a full recompilation; trigger logic elementsare distributed throughout the unused logic resources after user circuit compilationand spare routing resources are used to make the required connections betweenthese logic elements and to the user circuit using incremental routing techniques.Our results show that it is feasible to distribute trigger logic over spare logic re-sources after user circuit compilation. However, making connections between thesedistributed trigger logics depends on the routing congestion in the user circuit.In Chapter 4, we enhanced the ability of rapid trigger implementation by in-troducing an instrumentation framework through the use of overlay architectures,which are compiled once, and configurable between debug iterations. We then pre-sented an overlay architecture based on a 2D torus that provides adequate flexibility120for implementing combinational trigger circuits.We then present non-intrusive, adaptive, and best-effort CAD techniques formapping the overlay architecture on top of the user circuit. Our approach is area-efficient because we only use those resources left unused by the user circuit andhence the area overhead is potentially zero. The construction algorithm includesthree major parts: (1) overlay cell selection, in which partially used or empty logicclusters are selected as the overlay cells to create a logical overlay fabric; (2) adap-tive overlay placement, in which an adaptive simulated annealing based placementalgorithm is used to place the logical overlay into those FPGA resource left unusedby the user circuit while minimizing the total wire length; and (3) best-effort over-lay routing in which the routability-driven PathFinder algorithm is used to itera-tively resolve routing congestion. If congestion cannot be resolved, illegal connec-tions that are overusing routing resources are iteratively removed from the overlayuntil congestion is resolved and a legal routing solution is achieved.At debug time, a routing-aware simulated-annealing based placement algo-rithm was proposed to place a trigger circuitry onto the overlay; swaps that couldresult in routing failure were heavily penalized to enhance the routability of theplacement solution.Our experiments have shown that for the benchmarks that were investigated,we were able to build overlay architectures with different sizes depending on theavailable spare resources while increasing the circuit compile time by an averageof 22% assuming a maximally-sized overlay fabric. However, this compile time isonly required once to build the overlay architecture. During debugging, this overlayfabric can be reconfigure multiple times to implement different trigger circuits. Ourexperiments have shown that the overlay architecture provides enough flexibility121to implement different combinational trigger circuits at least an order of magnitudefaster than a full recompilation, enabling rapid debug iterations.In Chapter 5, we extend our instrumentation framework in Chapter 4 by in-troducing a new overlay architecture and mapping tools optimized for implement-ing sequential (state-based) trigger circuits. We proposed a parameterized overlayarchitecture family that is specialized to implement sequential trigger functions.To implement the logic part of the trigger circuit, the overlay contains a numberof cells connected in a triangular reduction-network pattern. To support sequen-tial triggers, the overlay also contains a bank of flip-flops. We then modified theoverlay construction strategy presented in Chapter 4 to construct this multi-leveloverlay fabric on top of user circuit. The overlay fabric increases compile time nomore than 2.5% for the benchmark circuits that were investigated. As mentionedbefore, the overlay compile time is a one-time overhead.Also in Chapter 5, we developed a trigger mapping flow to implement triggercircuits onto the overlay fabric between debug iterations. To take advantage of thespecialized overlay architecture, we developed a mapping flow to perform a grad-ual and simultaneous placement and routing to ensure a legal mapping solution,which includes two major phases: (1) Pre-processing, in which the trigger circuitgraph is modified to suit the optimized overlay architecture; (2) Placement/Rout-ing, in which the logic elements of the trigger circuit are placed and routed onto theoverlay cells. We first described a mapping algorithm for combinational triggersand then generalized the algorithm for sequential trigger circuits. Our experimentshave shown that the overlay fabric can be rapidly reconfigured to implement differ-ent trigger circuits. For our benchmarks, we were able to reconfigure the overlayfabric in less than 40 seconds to implement different combinational and sequential122triggering scenarios. This results suggest that our triggering framework is a viableapproach for enabling rapid triggering capabilities during FPGA debug.In Chapter 6, we utilized trace-based debug infrastructure to provide visibil-ity for coverage evaluation during post-silicon verification; instead of using tracebuffers to record trace data for a limited number of clock cycles, we used tracebuffers to record coverage data during the entire design execution.In this chapter, we first proposed a framework for implementing on-chip cov-erage monitors with low area overhead and in a way that does not require recom-pilation of the user circuit. Our proposed solution is through the use of an overlayarchitecture which is compiled once, and is configurable at runtime to implementcoverage monitors in a time-multiplexed fashion to gather coverage data. We re-visited the multi-level overlay architecture and mapping algorithms presented inChapter 5, and made them suitable to support the time-multiplexed integration ofcoverage monitors.We evaluated our techniques for runtime implementation of on-chip branchcoverage monitors versus compile-time instrumentation, where the design is in-strumented with a fixed number of coverage monitors before compilation. Ourexperiments have shown that using our approach to gather branch coverage datais up to 23X faster compared to compile-time instrumentation. To allow runtimebranch coverage monitoring, all control signals for if/case in the original HDL werepreserved that resulted in an area overhead of 6.4% on average compared to unin-strumented design where these signals are optimized if possible. The significanceof this contribution is that it enables designers to find functional coverage duringFPGA-based validation with a low area overhead and without performing lengthyrecompilations.1237.3 Limitations and Future Research DirectionsThis section discusses limitations of this work and outlines a number of possibleextensions which can be explored in future work.Using our techniques for incremental instrumentation may increase the delayof the instrumented design. This may limit the use of our techniques in debugand verification of applications that require strict timing constraints. One examplewould be video encoding/decoding engine in media streaming applications, whereframes are expected to be processed in a certain frame rate. To minimized the effectof instrumentation on circuit timing, future work can apply pipelining techniquesto the overlay connections in order to reduce the effect of debug or coverage in-struments on circuit performance. One challenge would be to map the trigger orcoverage monitor circuits to this pipelined overlay without changing the function-ality. Typically, the drawback of doing this is an increase in signal latency as thesignal will require more clock cycles to reach its destination, resulting in a delay tostop/start capturing data. However, this will not be an overhead as stopping/startingrecording data a few cycles after an interesting event is tolerable.In Chapter 6, in order to enable runtime branch coverage monitoring, we iden-tified branch control signals from the original HDL and prevented the synthesistool from optimizing these signals, so that these signals exist for runtime coveragemonitoring at post-silicon. This preservation prevented the synthesis tool to fullyoptimize the design. This could result in an area or delay overhead in the userdesign. One possible solution to avoid these overheads is to reclaim FPGA sparelogic resources to duplicate logics required to reconstruct optimized signals usingincremental techniques. This way, the tool is free to fully optimize the user circuit.124Another possible direction is to use our instrumentation framework for imple-menting compression schemes. To make the most efficient use of trace buffers, it isoften useful to compress data before it is stored. In general, trace data is extremelycompressible. This compression may be general purpose as described in [11] or op-timized for a specific application as in [52]. A general purpose compression enginemay be too large to implement in an overlay, however, simple application-specificcompression schemes may be very suitable.In this thesis we assumed the user circuit is not implemented on an overlay(but rather compiled to the native FPGA resources). As discussed in Chapter 2,several researchers have described how overlays can accelerate the design and com-pile time of FPGA designs in recent years. If the user circuit is implemented onan overlay, another interesting architectural possibility exists: co-optimizing theoverlay architecture for both the user circuit and the debug instrumentation. Thiswould eliminate the need for a separate debug overlay and would provide the abil-ity to rapidly change the allocation of space between the user circuit and the debuginstrumentation. Implementing such an overlay also needs to employ custom com-pilation tools. Therefore, an interesting direction of research could be to find a wayto create a single overlay architecture that works efficiently for the user circuit andimplements desirable instruments for post-silicon debug.We showed that our techniques perform better in terms of overlay constructiontime and routability when less circuit routing congestion exists. [67] showed thatcommercial tools do not necessarily pack and place as densely as VPR as long asthe design fits onto the device in order to provide better circuit routability. Con-sequently, we expect that our techniques to be more effective in commercial toolsthan in VPR. Additionally, in this thesis debug instrumentation is limited to us-125ing only the resources that were not used by the user circuit. Although this hassome advantages such as allowing designers to debug the instrumented design thatis close to the uninstrumented circuit, and preserving low-level optimizations, wewould like to investigate the effect of relaxing this limitation in future. For exam-ple, it may be beneficial to enable the tool to pre-reserve some logic blocks androuting wires specifically for constructing the overlay on top of a less compactpacked and placed design.Finally, in this thesis we used overlays to implement triggering circuitry thatis used to monitor for a specific event to control recording data. Alternatively,instead of simply controlling when to record data, it may also be interesting toexplore the possibility of using virtual debug overlays to also change signal valuesin the circuit, that can be used to implement bug fixes or evaluate error resilience(e.g bug injection) during post-silicon debug.126Bibliography[1] IEEE Standard for property specification language (PSL). IEEE Std1850-2010 (Revision of IEEE Std 1850-2005), pages 1–182, April 2010.doi:10.1109/IEEESTD.2010.5446004. → page 27[2] Design and verification in the SoC era. Mentor Graphics, 2014. URLhttps://verificationacademy.com/seminars/design-verification-soc-era. →page 1[3] Modelsim: ASIC and FPGA design. Mentor Graphics, 2014. URLhttp://www.mentor.com/products/fv/modelsim/. → page 1[4] M. Abramovici. In-system silicon validation and debug. IEEE Design Testof Computers, 25(3):216–223, May 2008. → page 18[5] M. Abramovici, P. Bradley, K. Dwarakanath, P. Levin, G. Memmi, andD. Miller. A reconfigurable design-for-debug infrastructure for SoCs. InProceedings of the Design Automation Conference, pages 7–12. ACM,2006. → page 19[6] A. Adir, A. Nahir, A. Ziv, C. Meissner, and J. Schumann. Reachingcoverage closure in post-silicon validation. In Proceedings of theInternational Conference on Hardware and Software: Verification andTesting, HVC’10, pages 60–75, Berlin, Heidelberg, 2011. Springer-Verlag.ISBN 978-3-642-19582-2. → page 28[7] R. Agalya and S. Saravanan. Simultaneous signal selection for silicondebug through mixed-integer linear programming. In InternationalConference on Emerging Trends in Engineering, Technology and Science(ICETETS), pages 1–3, Feb 2016. → page 19[8] E. Ahmed and J. Rose. The effect of LUT and cluster size ondeep-submicron FPGA performance and density. IEEE Transactions on127Very Large Scale Integration Systems, 12(3):288–298, 2004. → pages66, 94[9] G. Al-Hayek and C. Robach. From design validation to hardware testing:A unified approach. Journal of Electronic Testing, 14(1):133–140, Feb1999. ISSN 1573-0727. doi:10.1023/A:1008317826940. URLhttps://doi.org/10.1023/A:1008317826940. → page 27[10] Amazon. Amazon ec2 f1 instances: Run customizable FPGAs in the AWScloud, Dec. 2016. URL https://aws.amazon.com/ec2/instance-types/f1/.→ page 2[11] E. Anis and N. Nicolici. On using lossless compression of debug data inembedded logic analysis. In International Test Conference, pages 1–10.IEEE, 2007. → page 125[12] E. Anis and N. Nicolici. Interactive presentation: Low cost debugarchitecture using lossy compression for silicon debug. In Proceedings ofthe conference on Design, automation and test in Europe, pages 225–230.EDA Consortium, 2007. → page 19[13] E. Anis Daoud and N. Nicolici. On using lossy compression for repeatableexperiments during silicon debug. IEEE Transactions on Computers, 60(7):937–950, 2011. → page 19[14] K. Balston, A. J. Hu, S. J. E. Wilton, and A. Nahir. Emulation inpost-silicon validation: It’s not just for functionality anymore. In IEEEInternational High Level Design Validation and Test Workshop (HLDVT),pages 110–117, Nov 2012. → page 28[15] K. Balston, M. Karimibiuki, A. J. Hu, A. Ivanov, and S. J. E. Wilton.Post-silicon code coverage for multiprocessor system-on-chip designs.Transactions on Computers, 62(2):242–246, Feb 2013. → page 28[16] K. Basu and P. Mishra. Efficient trace data compression using staticallyselected dictionary. In VLSI Test Symposium (VTS), pages 14–19. IEEE,2011. → page 19[17] K. Basu and P. Mishra. Rats: restoration-aware trace signal selection forpost-silicon validation. IEEE Transactions on Very Large Scale Integration(VLSI) Systems, 21(4):605–613, 2013. → page 19128[18] K. Basu, P. Mishra, and P. Patra. Efficient combination of trace and scansignals for post silicon validation and debug. In International TestConference, pages 1–8. IEEE, Sept 2011. → page 19[19] S. BeigMohammadi and B. Alizadeh. Combinational trace signal selectionwith improved state restoration for post-silicon debug. In Design,Automation Test in Europe Conference Exhibition (DATE), pages1369–1374, March 2016. → page 19[20] V. Betz, J. Rose, and A. Marquardt, editors. Architecture and CAD forDeep-Submicron FPGAs. Kluwer Academic Publishers, MA, USA, 1999.ISBN 0792384601. → pages 37, 38, 59[21] S. Birk, J. G. Steffan, and J. H. Anderson. Parallelizing FPGA placementusing transactional memory. In International Conference onField-Programmable Technology (FPT), pages 61–69. IEEE, 2010. →page 59[22] T. Bojan, M. A. Arreola, E. Shlomo, and T. Shachar. Functional coveragemeasurements and results in post-silicon validation of core 2 duo family. InInternational High Level Design Validation and Test Workshop, pages145–150. IEEE, Nov 2007. → page 28[23] M. Boul and Z. Zilic. Generating Hardware Assertion Checkers: ForHardware Verification, Emulation, Post-Fabrication Debugging andOn-Line Monitoring. Springer Publishing Company, 2008. → page 28[24] M. Boule and Z. Zilic. Incorporating efficient assertion checkers intohardware emulation. In Proceedings of International Conference onComputer Design: VLSI in Computers and Processors (ICCD), pages221–228. IEEE, 2005.[25] M. Boule, J.-S. Chenard, and Z. Zilic. Assertion checkers in verification,silicon debug and in-field diagnosis. In International Symposium onQuality Electronic Design (ISQED), pages 613–620. IEEE, 2007. → page28[26] A. Brant and G. G. Lemieux. ZUMA: An open FPGA overlay architecture.In Annual International Symposium on Field-Programmable CustomComputing Machines (FCCM), pages 93–96. IEEE, 2012. → pages 6, 30[27] P. K. Bussa, J. Goeders, and S. J. E. Wilton. Accelerating in-system FPGAdebug of high-level synthesis circuits using incremental compilation129techniques. In International Conference on Field Programmable Logic andApplications (FPL), pages 1–4. IEEE, Sept 2017. → page 24[28] D. Bustan, D. Korchemny, E. Seligman, and J. Yang. Systemverilogassertions: Past, present, and future SVA standardization experience.Design & Test of Computers, IEEE, 29(2):23–31, 2012. → page 27[29] N. Calagar, S. D. Brown, and J. H. Anderson. Source-level debugging forFPGA high-level synthesis. In International Conference on FieldProgrammable Logic and Applications (FPL), pages 1–8. IEEE, 2014. →pages 4, 17, 23, 24[30] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, T. Czajkowski,S. D. Brown, and J. H. Anderson. Legup: An open-source high-levelsynthesis tool for FPGA-based processor/accelerator systems. ACM Trans.Embed. Comput. Syst., 13(2):24:1–24:27, Sept. 2013. URLhttp://doi.acm.org/10.1145/2514740. → page 23[31] D. Capalija and T. Abdelrahman. Tile-based bottom-up compilation ofcustom mesh-of-FUs FPGA overlays. In International Conference on FieldProgrammable Logic and Applications. IEEE, 2014. → pages 6, 31[32] D. Capalija and T. S. Abdelrahman. A high-performance overlayarchitecture for pipelined execution of data flow graphs. In InternationalConference on Field programmable Logic and Applications, pages 1–8.IEEE, Sept 2013. → page 31[33] H. Y. Cheah, S. A. Fahmy, and D. L. Maskell. iDEA: A DSP block basedFPGA soft processor. In International Conference on Field-ProgrammableTechnology, pages 151–158, Dec 2012. → page 31[34] W. Chen, S. Ray, J. Bhadra, M. Abadir, and L. C. Wang. Challenges andtrends in modern SoC design verification. IEEE Design Test, 34(5):7–22,Oct 2017. → pages 2, 15[35] J. Cong, J. Peck, and Y. Ding. RASP: A general logic synthesis system forSRAM-based FPGAs. In Proceedings of the international symposium onField-programmable gate arrays, pages 137–143. ACM, 1996. → page 53[36] J. Cong, H. Huang, and X. Yuan. Technology mapping and architectureevalution for k/m-macrocell-based FPGAs. ACM Transactions on DesignAutomation of Electronic Systems, 10(1):3–23, 2005. → page 53130[37] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang.High-level synthesis for FPGAs: From prototyping to deployment. IEEETransactions on Computer-Aided Design of Integrated Circuits andSystems, 30(4):473–491, 2011. → page 2[38] P. Cooke, L. Hao, and G. Stitt. Finite-state-machine overlay architecturesfor fast FPGA compilation and application portability. ACM Transactionson Embedded Computing Systems (TECS), 14(3):54, 2015. → page 31[39] J. Coole and G. Stitt. Intermediate fabrics: Virtual architectures for circuitportability and fast placement and routing. In Proceedings of theinternational conference on Hardware/software codesign and systemsynthesis, pages 13–22. ACM, 2010. → page 6[40] H. M. Cristinel Ababei and K. Bazargan. Three-dimensional place androute for FPGAs. IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems, 25(6):1132–1140, 2006. → page 59[41] E. A. Daoud and N. Nicolici. Real-time lossless compression for silicondebug. IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, 28(9):1387–1400, Sept 2009. ISSN 0278-0070. →page 19[42] F. Eslami and S. J. Wilton. Incremental distributed trigger insertion forefficient fpga debug. In International Conference on Field ProgrammableLogic and Applications (FPL), pages 1–4. IEEE, 2014. → pages vi, 9, 36[43] F. Eslami and S. J. Wilton. An improved overlay and mapping algorithmsupporting rapid triggering for FPGA debug. ACM SIGARCH ComputerArchitecture News, 44(4):20–25, 2017. → pages vii, 11, 76[44] F. Eslami and S. J. E. Wilton. An adaptive virtual overlay for fast triggerinsertion for FPGA debug. In International Conference on FieldProgrammable Technology (FPT), pages 32–39. IEEE, Dec 2015. → pagesvi, 10, 51[45] F. Farahmandi, R. Morad, A. Ziv, Z. Nevo, and P. Mishra. Cost-effectiveanalysis of post-silicon functional coverage events. In Design, AutomationTest in Europe Conference Exhibition (DATE), pages 392–397, March2017. → page 29[46] E. H. Fatemeh Eslami and S. J. E. Wilton. Extending post-silicon coveragemeasurement using time-multiplexed FPGA overlays. In European TestSymposium (ETS). IEEE, May 2018. → pages vii, 12, 106131[47] H. Foster, A. Krolnik, and D. Lacey. Assertion-based design. KluwerAcademic Publishers, Boston, USA, 2004. ISBN1402080271;9781402080272. → page 27[48] H. D. Foster. Trends in functional verification: a 2014 industry study. InProceedings of the Annual Design Automation Conference, page 48. ACM,2015. → pages 2, 15[49] R. O. Gallardo, A. J. Huy, A. Ivanov, and M. S. Mirian. Reducingpost-silicon coverage monitoring overhead with emulation and bayesianfeature selection. In International Conference on Computer-Aided Design(ICCAD), pages 816–823. IEEE/ACM, Nov 2015. → page 28[50] M. Gao and K. T. Cheng. A case study of time-multiplexed assertionchecking for post-silicon debugging. In International High Level DesignValidation and Test Workshop (HLDVT), pages 90–96. IEEE, June 2010.→ page 29[51] J. Goeders and S. Wilton. Effective FPGA debug for high-level synthesisgenerated circuits. In International Conference on Field ProgrammableLogic and Applications, pages 1–8. IEEE, 2014. → pages 23, 24[52] J. Goeders and S. J. Wilton. Using dynamic signal-tracing to debugcompiler-optimized HLS circuits on FPGAs. In International Symposiumon Field-Programmable Custom Computing Machines, pages 127–134.IEEE, 2015. → pages 19, 125[53] J. Goeders and S. J. Wilton. Signal-tracing techniques for in-system FPGAdebugging of high-level synthesis circuits. IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, 36(1):83–96,2017. → pages 4, 17, 23, 24[54] F. Golshan. Test and on-line debug capabilities of IEEE Std 1149.1 inUltraSPARC TM-III microprocessor. In Proceedings of the InternationalTest Conference, pages 141–150. IEEE, 2000. → page 16[55] J. Goodenough and R. Aitken. Post-silicon is too late avoiding the $50million paperweight starts with validated designs. In Proceedings of theDesign Automation Conference, pages 8–11, New York, NY, USA, 2010.ACM. → page 15[56] M. Gort, F. M. D. Paula, J. J. W. Kuan, T. M. Aamodt, A. J. Hu, S. J. E.Wilton, and J. Yang. Formal-analysis-based trace computation for132post-silicon debug. IEEE Transactions on Very Large Scale Integration(VLSI) Systems, 20(11):1997–2010, Nov 2012. → page 19[57] V. Govindaraju, C. H. Ho, T. Nowatzki, J. Chhugani, N. Satish,K. Sankaralingam, and C. Kim. Dyser: Unifying functionality andparallelism specialization for energy-efficient computing. IEEE Micro, 32(5):38–51, Sept 2012. → page 31[58] P. Graham, B. Nelson, and B. Hutchings. Instrumenting bitstreams fordebugging FPGA circuits. In Symposium on Field-Programmable CustomComputing Machines (FCCM), pages 41–50. IEEE, 2001. → page 21[59] J. Havlicek and Y. Wolfsthal. PSL and SVA: Two standard assertionlanguages addressing complementary engineering needs. Design andVerification Conference and Exhibition (DVCon), pages 1–7, 2005. → page27[60] H. J. Hoover, M. M. Klawe, and N. J. Pippenger. Bounding fan-out inlogical networks. Journal of the ACM (JACM), 31(1):13–18, 1984. → page83[61] A. Hopkins and K. McDonald-Maier. Debug support for complex systemson-chip: A review. IET Proceedings-Computers and Digital Techniques,153(4):197–207, 2006. → page 16[62] M. Hubner, P. Figuli, R. Girardey, D. Soudris, K. Siozios, and J. Becker. Aheterogeneous multicore system on chip with run-time reconfigurablevirtual FPGA architecture. In International Symposium on Parallel andDistributed Processing Workshops and Phd Forum, pages 143–149. IEEE,May 2011. → page 30[63] E. Hung and S. J. Wilton. Scalable signal selection for post-silicon debug.IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 21(6):1103–1115, 2013. → page 19[64] E. Hung and S. J. Wilton. Towards simulator-like observability for FPGAs:a virtual overlay network for trace-buffers. In Proceedings of theinternational symposium on Field programmable gate arrays, pages 19–28.ACM, 2013. → pages 22, 34, 42[65] E. Hung and S. J. Wilton. Accelerating FPGA debug: Increasing visibilityusing a runtime reconfigurable observation and triggering network. ACMTransactions on Design Automation of Electronic Systems (TODAES), 19(2):14, 2014. → page 23133[66] E. Hung and S. J. Wilton. Incremental trace-buffer insertion for FPGAdebug. IEEE Transactions on Very Large Scale Integration Systems, 22(4):850–863, 2014. → pages 22, 34, 41, 45[67] E. Hung, F. Eslami, and S. J. Wilton. Escaping the academic sandbox:Realizing vpr circuits on xilinx devices. In International Symposium onField-Programmable Custom Computing Machines (FCCM), pages 45–52.IEEE, 2013. → pages 45, 64, 94, 125[68] E. Hung, A. S. Jamal, and S. J. E. Wilton. Maximum flow algorithms formaximum observability during FPGA debug. In International Conferenceon Field-Programmable Technology (FPT), pages 20–27. IEEE, Dec 2013.→ page 22[69] E. Hung, T. Todman, and W. Luk. Transparent insertion oflatency-oblivious logic onto FPGAs. In International Conference on FieldProgrammable Logic and Applications, pages 1–8. IEEE, 2014. → page 22[70] B. L. Hutchings and J. Keeley. Rapid post-map insertion of embeddedlogic analyzers for xilinx FPGAs. In International Symposium onField-Programmable Custom Computing Machines (FCCM), pages 72–79.IEEE, 2014. → pages 22, 70[71] M. D. Hutton, J. Rose, J. P. Grossman, and D. G. Corneil. Characterizationand parameterized generation of synthetic combinational benchmarkcircuits. IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, 17(10):985–996, 1998. → page 78[72] Intel. Intel FPGA SDK for OpenCL, Programming Guide UG-OCL002(17.1). December 2017. → page 23[73] Intel Corporation. Intel completes acquisition of altera, Dec. 2015. URLhttps://newsroom.intel.com/news-releases/intel-completes-acquisition-of-altera/. → page 2[74] Intel Corporation. Quartus Prime Standard Edition Handbook Volume 3:Verification; 14. Design Debugging with the SignalTap II Logic Analyzer.San Jose, CA, USA, Nov. 2017. → pages 4, 11, 17, 21, 23, 24[75] Y. Iskander, C. Patterson, and S. Craven. High-level abstractions andmodular debugging for FPGA design validation. ACM Transactions onReconfigurable Technol. Syst., 7(1):2:1–2:22, Feb. 2014. ISSN 1936-7406.→ page 16134[76] Y. S. Iskander, C. D. Patterson, and S. D. Craven. Improved abstractionsand turnaround time for FPGA design validation and debug. InInternational Conference on Field Programmable Logic and Applications(FPL), pages 518–523. IEEE, 2011. → page 16[77] A. K. Jain, X. Li, S. A. Fahmy, and D. L. Maskell. Adapting the dyserarchitecture with DSP blocks as an overlay for the xilinx zynq. SIGARCHComput. Archit. News, 43(4):28–33, Apr. 2016. ISSN 0163-5964. → pages6, 31[78] A. K. Jain, X. Li, P. Singhai, D. L. Maskell, and S. A. Fahmy. DeCO: ADSP block based FPGA accelerator overlay with low overheadinterconnect. In International Symposium on Field-Programmable CustomComputing Machines (FCCM), pages 1–8. IEEE, May 2016. → pages 6, 31[79] A. Jamal and S. S. Wilton. Architecture exploration for HLS-orientedFPGA debug overlays. In International Symposium onField-Programmable Gate Arrays, pages 1–10, 2018. → page 25[80] D. Josephson. The good, the bad, and the ugly of silicon debug. InProceedings of the annual Design Automation Conference, pages 3–6.ACM, 2006. → page 3[81] A. B. Kahng and S. Mantik. Measurement of inherent noise in eda tools. InProceedings International Symposium on Quality Electronic Design, pages206–211, 2002. → page 5[82] N. Kapre, N. Mehta, R. Rubin, H. Barnor, M. J. Wilson, M. Wrighton,A. DeHon, et al. Packet switched vs. time multiplexed FPGA overlaynetworks. In International Symposium on Field-Programmable CustomComputing Machines (FCCM), pages 205–216. IEEE, 2006. → page 6[83] H. F. Ko and N. Nicolici. Algorithms for state restoration and trace-signalselection for data acquisition in silicon debug. IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, 28(2):285–297, 2009. → pages 4, 17[84] H. F. Ko and N. Nicolici. Combining scan and trace buffers for enhancingreal-time observability in post-silicon debugging. In European TestSymposium, pages 62–67. IEEE, May 2010. → page 19[85] H. F. Ko and N. Nicolici. Mapping trigger conditions onto trigger unitsduring post-silicon validation and debugging. IEEE Transactions onComputers, 61(11):1563–1575, Nov 2012. → page 17135[86] D. Koch, C. Beckhoff, and G. G. Lemieux. An efficient FPGA overlay forportable custom instruction set extensions. In International Conference onField Programmable Logic and Applications (FPL), pages 1–8. IEEE,2013. → pages 6, 30[87] A. Kourfali and D. Stroobandt. Efficient hardware debugging usingparameterized FPGA reconfiguration. In International SymposiumWorkshops on Parallel and Distributed Processing, pages 277–282. IEEE,2016. → page 22[88] B. Kumar, K. Basu, M. Fujita, and V. Singh. Rtl level trace signal selectionand coverage estimation during post-silicon validation. In InternationalHigh Level Design Validation and Test Workshop (HLDVT), pages 59–66.IEEE, Oct 2017. → page 29[89] P. D. Kundarewich and J. Rose. Synthetic circuit generation usingclustering and iteration. IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems, 23(6):869–887, 2004. → page 78[90] A. Landy and G. Stitt. A low-overhead interconnect architecture for virtualreconfigurable fabrics. In Proceedings of the international conference onCompilers, architectures and synthesis for embedded systems, pages111–120. ACM, 2012. → pages 6, 31[91] M. Li and A. Davoodi. Multi-mode trace signal selection for post-silicondebug. In Asia and South Pacific Design Automation Conference(ASP-DAC), pages 640–645, 2014. → page 19[92] O. Lindtjorn, R. Clapp, O. Pell, H. Fu, M. Flynn, and O. Mencer. Beyondtraditional microprocessors for geoscience high-performance computingapplications. IEEE Micro, 31(2):41–49, 2011. → page 2[93] C. Liu, H.-C. Ng, and H.-H. So. Quickdough: A rapid FPGA loopaccelerator design framework using soft CGRA overlay. In InternationalConference on Field-Programmable Technology. IEEE, 2015. → page 31[94] X. Liu and Q. Xu. Interconnection fabric design for tracing signals inpost-silicon validation. In Proceedings of the Annual Design AutomationConference, pages 352–357, New York, NY, USA, 2009. ACM. → page 19[95] X. Liu and Q. Xu. On efficient silicon debug with flexible traceinterconnection fabric. In International Test Conference, pages 1–9. IEEE,Nov 2012. → page 19136[96] J. Luu, J. Goeders, M. Wainberg, A. Somerville, T. Yu, K. Nasartschuk,M. Nasr, S. Wang, T. Liu, N. Ahmed, et al. VTR 7.0: Next generationarchitecture and CAD system for FPGAs. ACM Transactions onReconfigurable Technology and Systems (TRETS), 7(2):6, 2014. → pages35, 42, 63, 66, 93, 94, 113[97] R. Lysecky, K. Miller, F. Vahid, and K. Vissers. Firm-core virtual FPGAfor just-in-time FPGA compilation. In Proceedings of internationalsymposium on Field-programmable gate arrays, pages 271–271.ACM/SIGDA, 2005. → pages 6, 30[98] S. Ma, D. Pal, R. Jiang, S. Ray, and S. Vasudevan. Can’t see the forest forthe trees: State restoration’s limitations in post-silicon trace signalselection. In Proceedings of the International Conference onComputer-Aided Design, pages 1–8, Piscataway, NJ, USA, 2015. IEEEPress. → page 19[99] A. Marquardt, V. Betz, and J. Rose. Timing-driven placement for FPGAs.In Proceedings of the international symposium on Field ProgrammableGate Arrays (FPGA), pages 203–213. ACM, 2000. → page 59[100] L. McMurchie and C. Ebeling. Pathfinder: a negotiation-basedperformance-driven router for FPGAs. In Proceedings of the internationalsymposium on Field-programmable gate arrays (FPGA), pages 111–117.ACM, 1995. → pages 58, 63[101] Mentor Graphics. Certus silicon debug, May 2017. URLhttps://www.mentor.com/products/fv/certus-silicon-debug. → pages 17, 18[102] Mishchenko et al. ABC: A system for sequential synthesis and verification,Mar. 2012. URL http://www.eecs.berkeley.edu/∼alanmi/abc/. → pages37, 53, 82[103] P. Mishra, R. Morad, A. Ziv, and S. Ray. Post-silicon validation in the SoCera: A tutorial introduction. IEEE Design Test, 34(3):68–92, June 2017. →pages 2, 15[104] S. Mitra, S. A. Seshia, and N. Nicolici. Post-silicon validationopportunities, challenges and recent advances. In Proceedings of theDesign Automation Conference, pages 12–17. ACM, 2010. → pages2, 3, 6, 15137[105] J. S. Monson and B. Hutchings. New approaches for in-system debug ofbehaviorally-synthesized FPGA circuits. In International Conference onField Programmable Logic and Applications (FPL), pages 1–6. IEEE, Sept2014. → pages 23, 24, 25[106] J. S. Monson and B. L. Hutchings. Using source-level transformations toimprove high-level synthesis debug and validation on FPGAs. InInternational Symposium on Field-Programmable Gate Arrays, pages 5–8.ACM, 2015. → pages 23, 24[107] M. H. Neishaburi and Z. Zilic. An infrastructure for debug using clusters ofassertion-checkers. Microelectronics Reliability, 52(11):2781–2798, 2012.→ page 28[108] S. Prabhakar, R. Sethuram, and M. S. Hsiao. Trace buffer-based silicondebug with lossless compression. In International Conference on VLSIDesign (VLSI Design), pages 358–363. IEEE, 2011. → page 19[109] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides,J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, et al. Areconfigurable fabric for accelerating large-scale datacenter services. InInternational Symposium on Computer Architecture (ISCA), pages 13–24.IEEE/ACM, 2014. → page 2[110] B. Quinton and S. Wilton. Concentrator access networks for programmablelogic cores on SoCs. In International symposium on Circuits and Systems,pages 45–48, 2005. → page 18[111] B. Quinton and S. Wilton. Post-silicon debug using programmable logiccores. In International Conference on Field-Programmable Technology,pages 241–247. IEEE, Dec 2005. → page 18[112] B. R. Quinton, A. M. Hughes, and S. J. Wilton. Post-silicon debug ofcomplex multi clock and power domain ICs. In International Workshop onSilicon Debug and Diagnosis. IEEE, 2010. → page 19[113] K. Rahmani, S. Proch, and P. Mishra. Efficient selection of trace and scansignals for post-silicon debug. IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems, 24(1):313–323, Jan 2016. → page 19[114] J. E. Savage. The Complexity of Computing. Krieger Publishing Co., Inc.,Melbourne, FL, USA, 1987. ISBN 089874833X. → page 83138[115] A. Severance and G. Lemieux. Embedded supercomputing in FPGAs withthe VectorBlox MXP matrix processor. In International Conference onHardware/Software Codesign and System Synthesis, pages 1–10, 2013. →page 31[116] W. Shum and J. H. Anderson. Analyzing and predicting the impact of CADalgorithm noise on FPGA speed performance and power. In Proceedings ofthe International Symposium on Field Programmable Gate Arrays, pages107–110, New York, NY, USA, 2012. ACM. → page 5[117] S. Sirowy and A. Forin. Wheres the beef? why FPGAs are so fast.Technical Report MSR-TR-2008-130, Microsoft Research, MicrosoftCorp., Redmond, WA, 2008. → page 2[118] G. Stitt and J. Coole. Intermediate fabrics: Virtual architectures fornear-instant FPGA compilation. IEEE Embedded Systems Letters, 3(3):81–84, Sept 2011. → page 31[119] Synopsys. Identify: Simulator-like visibility into hardware debug, May2017. URL https://www.synopsys.com/content/dam/synopsys/implementation&signoff/datasheets/identify-rtl-debugger-ds.pdf. → pages4, 17, 21, 23[120] P. Taatizadeh and N. Nicolici. Automated selection of assertions for bit-flipdetection during post-silicon validation. IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, 35(12):2118–2130, 2016. → page 29[121] S. Thakur and D. Wong. On designing ULM-based FPGA logic modules.In Proceedings of the international symposium on Field-programmablegate arrays, pages 3–9. ACM, 1995. → page 53[122] B. Vermeulen and S. K. Goel. Design for debug: catching design errors indigital chips. IEEE Design & Test of Computers, 19(3):37–45, 2002. →pages 16, 18[123] T. Wheeler, P. Graham, B. E. Nelson, and B. Hutchings. Using design-levelscan to improve FPGA design observability and controllability forfunctional verification. In Proceedings of the International Conference onField-Programmable Logic and Applications, pages 483–492.Springer-Verlag, 2001. → page 16139[124] S. J. Wilton, N. Kafafi, J. C. Wu, K. A. Bozman, V. O. Aken’Ova, andR. Saleh. Design considerations for soft embedded programmable logiccores. IEEE Journal of Solid-State Circuits, 40(2):485–497, 2005. → page78[125] M. R. Woodward and K. Halewood. From weak to strong, dead or alive?an analysis of some mutation testing issues. In Proceedings on SoftwareTesting, Verification, and Analysis, pages 152–158, Jul 1988. → page 27[126] Xilinx. High-Level Synthesis, Vivado Design Suite User Guide UG902(v2016.1). San Jose, CA, USA, Jun 2016. → page 23[127] Xilinx. Programming and Debugging, Vivado Design Suite User GuideUG908 (v2017.1). San Jose, CA, USA, April 2017. → pages 4, 17, 21, 23[128] J. S. Yang and N. A. Touba. Enhancing silicon debug via periodicmonitoring. In International Symposium on Defect and Fault Tolerance ofVLSI Systems, pages 125–133. IEEE, Oct 2008. → page 16[129] P. Yiannacouras, J. G. Steffan, and J. Rose. VESPA: portable, scalable, andflexible FPGA-based vector processors. In International Conference onCompilers, architectures and synthesis for embedded systems, pages 61–70,2008. → page 31140


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items