UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Harnessing FPGA technology for rapid circuit debug Hung, Eddie 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2013_fall_hung_eddie.pdf [ 4.01MB ]
Metadata
JSON: 24-1.0074092.json
JSON-LD: 24-1.0074092-ld.json
RDF/XML (Pretty): 24-1.0074092-rdf.xml
RDF/JSON: 24-1.0074092-rdf.json
Turtle: 24-1.0074092-turtle.txt
N-Triples: 24-1.0074092-rdf-ntriples.txt
Original Record: 24-1.0074092-source.json
Full Text
24-1.0074092-fulltext.txt
Citation
24-1.0074092.ris

Full Text

Harnessing FPGA Technologyfor Rapid Circuit DebugbyEddie HungM.Eng., University of Bristol, 2008A T H E S I S S U B M I T T E D I N PA R T I A L F U L F I L L M E N TO F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O FDoctor of PhilosophyinT H E F A C U LT Y O F G R A D U AT E S T U D I E S(Electrical and Computer Engineering)The University Of British Columbia(Vancouver)August 2013c? Eddie Hung, 2013AbstractElectronic devices have come to permeate every aspect of our daily lives, and at the heart of each device isone or more integrated circuits. State-of-the art circuits now contain several billion transistors. However,designing and verifying that these circuits function correctly under all expected (and unexpected) opera-tion conditions is extremely challenging, with many studies finding that verification can consume overhalf of the total design effort. Due to the slow speed of logic simulation software, designers increasinglyturn to circuit prototypes implemented using field-programmable gate array (FPGA) technology. Whilstthese prototypes can be operated many orders of magnitude faster than simulation, on-chip instrumentsare required to expose internal signal data so that designers can root-cause any erroneous behaviour.This thesis presents four contributions to enable rapid and effective circuit debug when using FPGAs,in particular, by harnessing the reconfigurable and prefabricated nature of this technology. The firstcontribution presents a post-silicon debug metric to quantify the effectiveness of trace-buffer baseddebugging instruments, and three algorithms to determine new signal selections for these instruments.Our most scalable algorithm can determine the most influential signals in a large 50,000 flip-flop circuitin less than 90 seconds.The second contribution of this thesis proposes that debug instruments be speculatively inserted intothe spare capacity of FPGAs, without any user intervention, and shows this to be feasible. This proposalallows designers to extract more trace data from their circuit on every debug turn, ultimately leading tofewer debug iterations.The third contribution presents techniques to enable faster debug turnaround, by using incremental-compilation methods to accelerate the process of inserting debug instruments. Specifically, our incre-mental optimizations can speed up this procedure by almost 100X over recompiling the FPGA fromscratch.iiFinally, the fourth contribution describes how a virtual overlay network can be embedded into theunused resources of the FPGA device, allowing debug instruments to be modified without any form ofrecompilation. Experimental results show that a new configuration for a debug instrument with 17,000trace connections can be made in 50 seconds, thus enabling rapid circuit debug.iiiPrefaceThe research contributions documented in this thesis were published in papers [62?67]. Specifically,Chapter 3 combines content from papers [67] and [64]. In particular, I would like to recognize my re-search supervisor, Dr. Steve Wilton, for suggesting to use graph-centrality techniques for trace-signalselection that led to the journal publication [67]. Dr. Wilton continued to developed a similar algorithmin parallel for a commercial product that was subsequently published in paper [150]. The key differencesbetween both two approaches is the level of abstraction at which they operate, as well as the exact central-ity cost function employed; reference [67] operates on the gate-level netlist with the centrality measuredescribed in Chapter 3, whilst reference [150] operates on the high-level parse-tree representation of thecircuit description, and employs a proprietary cost function.Content from Chapter 4 was first published in paper [63], and also provided as an example of anapplication that can be enabled by automated signal selection techniques in reference [67]. The maincontribution of Chapter 5 was first published in paper [65], and later extended to form the journalpaper [62]. Chapter 6 has been published in paper [66], and was included as an example of state-of-the-art instrumentation in reference [68]. I would like to recognize my committee member Dr. AlanHu?s help in recognizing the bipartite graph maximum matching problem employed in this chapter forconfiguring the virtual overlay network.In all cases, I designed the experiments, conducted the research, and formed the conclusions underthe guidance of Dr. Wilton, who also provided editorial support for all of my manuscripts.[62] E. Hung and S. J. E. Wilton. Incremental Signal-Tracing for FPGA Debug. IEEE Transactions onVery Large Scale Integration (VLSI) Systems, page (accepted for publication: March 2013).iv[67] E. Hung and S. J. E. Wilton. Scalable Signal Selection for Post-Silicon Debug. IEEE Transactionson Very Large Scale Integration (VLSI) Systems, 21:1103?1115, June 2013.[66] E. Hung and S. J. E. Wilton. Towards Simulator-like Observability for FPGAs: A Virtual OverlayNetwork for Trace-Buffers. In Proceedings of the 21st ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 19?28, February 2013.[65] E. Hung and S. J. E. Wilton. Limitations of Incremental Signal-Tracing for FPGA Debug. InFPL 2012, International Conference on Field-Programmable Logic and Applications, pages 49?56,August 2012.[63] E. Hung and S. J. E. Wilton. Speculative Debug Insertion for FPGAs. In FPL 2011, InternationalConference on Field-Programmable Logic and Applications, pages 524?531, September 2011.[64] E. Hung and S. J. E. Wilton. On Evaluating Signal Selection Algorithms for Post-Silicon Debug.In ISQED 2011, International Symposium on Quality Electronic Design, pages 290?296, March 2011.[68] E. Hung, B. Quinton, and S. J. E. Wilton. Linking the Verification and Validation of ComplexIntegrated Circuits Through Shared Coverage Metrics. IEEE Design & Test of Computers ? SpecialIssue: Silicon Debug and Diagnosis, page (accepted for publication: June 2013).vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Debugging ASICs Using FPGAs Prototypes . . . . . . . . . . . . . . . . . . . . . . . 31.2.1 Typical Debugging Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Research Challenge and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.1 Post-Silicon Debug Metric and Automated Signal Selection for Trace-Instruments 71.3.2 Speculative Debug Insertion for FPGAs . . . . . . . . . . . . . . . . . . . . . 81.3.3 Accelerating FPGA Trace-Insertion Using Incremental Techniques . . . . . . . 91.3.4 A Virtual Overlay Network for FPGA Trace-Buffers . . . . . . . . . . . . . . 101.4 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1 Review of Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . . 13vi2.1.1 Typical FPGA Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.1.2 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.3 FPGA CAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 Primary Challenge During Post-Silicon Debug: Observability . . . . . . . . . . . . . . 202.2.1 Scan-based Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2.2 Trace-based Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2.3 Visibility Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3 Previous Work Specific to Chapters 3?6 . . . . . . . . . . . . . . . . . . . . . . . . . 302.3.1 Automated Signal Selection and Evaluation Metrics . . . . . . . . . . . . . . 302.3.2 Reclaiming Spare FPGA Resources . . . . . . . . . . . . . . . . . . . . . . . 332.3.3 Incremental Compilation for FPGAs . . . . . . . . . . . . . . . . . . . . . . . 332.3.4 Multiplexer Networks in Post-Silicon Debug . . . . . . . . . . . . . . . . . . 342.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Post-Silicon Debug Metric and Automated Signal Selection for Trace-Instruments . . . . 373.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2 Post-Silicon Debug Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.2.2 Computing an Over-Approximation . . . . . . . . . . . . . . . . . . . . . . . 423.3 Signal Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3.1 Observability-based: Expected Difficulty . . . . . . . . . . . . . . . . . . . . 433.3.2 Connectivity-based: Graph Centrality . . . . . . . . . . . . . . . . . . . . . . 463.3.3 Hybrid: Expected Difficulty with Hierarchy . . . . . . . . . . . . . . . . . . . 493.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.1 Scalability and Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.2 Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.4.3 Observability of LEON3 Physical Prototype . . . . . . . . . . . . . . . . . . . 563.4.4 Fidelity of Debug Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.4.5 Hybrid Algorithm: Merge Threshold Parameter . . . . . . . . . . . . . . . . . . 573.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.5.1 Example Signal Selection for LEON3 . . . . . . . . . . . . . . . . . . . . . . 583.5.2 Comparison with Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.5.3 Gold Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63vii4 Speculative Debug Insertion for FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.1 Limits to Speculative Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.1.1 Impact on Area: Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.1.2 Impact on Area: Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.1.3 Impact on Delay and Routability . . . . . . . . . . . . . . . . . . . . . . . . . 684.1.4 Impact on Runtime and Power . . . . . . . . . . . . . . . . . . . . . . . . . . 704.1.5 Debug Aggressiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2 In-Spec: Speculative Insertion Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2.1 Estimating Instrumentation Area . . . . . . . . . . . . . . . . . . . . . . . . . 734.2.2 Example Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745 Accelerating FPGA Trace-Insertion Using Incremental Techniques . . . . . . . . . . . . 755.1 Trace-Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.1.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.1.2 Pre-Map Trace Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.1.3 Mid-Map Trace Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.1.4 Post-Map Trace Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.1.5 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.2 Incremental CAD for Post-Map Tracing . . . . . . . . . . . . . . . . . . . . . . . . . 835.2.1 Many-to-Many Trace Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . 835.2.2 Logic Element Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.2.3 Timing-Driven Directed Search . . . . . . . . . . . . . . . . . . . . . . . . . 855.2.4 Neighbour Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.3.1 Routing Slack of Minimum Channel Width Wmin+20% . . . . . . . . . . . . . 905.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.4.1 Signal Traceability With Post-Map Insertion . . . . . . . . . . . . . . . . . . . 915.4.2 CAD Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925.4.3 Wirelength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.4.4 Critical-Path Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.4.5 CAD Optimizations for Post-Map Tracing . . . . . . . . . . . . . . . . . . . . . 1015.4.6 Comparison With Altera Quartus II . . . . . . . . . . . . . . . . . . . . . . . . 1015.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102viii6 A Virtual Overlay Network for FPGA Trace-Buffers . . . . . . . . . . . . . . . . . . . . 1046.1 Virtual Overlay Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.2 Network Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086.3 Network Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.3.1 Static Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.3.2 Dynamic Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1126.4.1 Compile-Time Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.4.2 Debug-Time Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.5.1 Maximum Network Connectivity . . . . . . . . . . . . . . . . . . . . . . . . 1166.5.2 Average Match Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.5.3 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206.5.4 Circuit Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1226.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247.2 Current Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 1287.2.1 Post-Silicon Debug Metric and Automated Signal Selection for Trace Instruments1287.2.2 Speculative Debug Insertion for FPGAs . . . . . . . . . . . . . . . . . . . . . 1297.2.3 Accelerating FPGA Trace-Insertion Using Incremental Techniques . . . . . . . 1307.2.4 A Virtual Overlay Network for FPGA Trace-Buffers . . . . . . . . . . . . . . 1327.2.5 Long Term Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137A Signal Selection for leon3s nofpu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150B Incremental-Tracing with Local Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155C Pseudo-code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157C.1 Post-Silicon Debug Metric and Automated Signal Selection . . . . . . . . . . . . . . . 158C.1.1 Debug Difficulty Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158C.1.2 Automated Signal Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 159C.2 Speculative Debug Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161C.3 Incremental Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162C.4 Virtual Overlay Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163ixList of TablesTable 3.1 Signal selection runtime: 128 signals. . . . . . . . . . . . . . . . . . . . . . . . . . 53Table 3.2 Selection runtime: oc mem ctrl for varying sizes. . . . . . . . . . . . . . . . . . . 54Table 3.3 Comparison of leon3s nofpu, 128 signal selection, summarized. . . . . . . . . . . . 59Table 3.4 Signal selection algorithm runtime (seconds). . . . . . . . . . . . . . . . . . . . . . . 61Table 3.5 Example signal selection for oc i2c. . . . . . . . . . . . . . . . . . . . . . . . . . . 62Table 4.1 Example application: OpenSPARCT1 (3951x1024). . . . . . . . . . . . . . . . . . 74Table 5.1 Incremental-tracing benchmark summary . . . . . . . . . . . . . . . . . . . . . . . 88Table 5.2 Routing utilization when mapped onto Altera Stratix IV. . . . . . . . . . . . . . . . 90Table 5.3 Post-map CAD optimization breakdown. . . . . . . . . . . . . . . . . . . . . . . . 100Table 5.4 Comparison between Quartus II and this work for mcml. . . . . . . . . . . . . . . . 102Table 6.1 Overlay network benchmark summary . . . . . . . . . . . . . . . . . . . . . . . . 112Table 6.2 FPGA architecture used, based on Altera Stratix IV device family . . . . . . . . . . 112Table 6.3 Maximum network connectivity results. . . . . . . . . . . . . . . . . . . . . . . . 116Table 6.4 Effect of overlay network on critical-path delay. . . . . . . . . . . . . . . . . . . . 122Table A.1 Comparison of leon3s nofpu, 128 signal selection, full listing. . . . . . . . . . . . . 154xList of FiguresFigure 1.1 A simplified ASIC development flow. . . . . . . . . . . . . . . . . . . . . . . . . 2Figure 1.2 Trace-based instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Figure 1.3 FPGA debug flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Figure 1.4 Thesis organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Figure 2.1 FPGA architecture overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Figure 2.2 FPGA look-up table, logic element and logic cluster. . . . . . . . . . . . . . . . . . 17Figure 2.3 FPGA routing network overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 18Figure 2.4 FPGA CAD flow overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Figure 2.5 Logic and I/O in Xilinx FPGAs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Figure 2.6 Scan-based debug instrumentation. . . . . . . . . . . . . . . . . . . . . . . . . . . 23Figure 2.7 Trace-based debug instrumentation. . . . . . . . . . . . . . . . . . . . . . . . . . 25Figure 2.8 Multiplexer network for trace-based instruments. . . . . . . . . . . . . . . . . . . 34Figure 3.1 State-space partitioning induced by signal observations. . . . . . . . . . . . . . . 39Figure 3.2 Knowledge of circuit state over time. . . . . . . . . . . . . . . . . . . . . . . . . 40Figure 3.3 State-space partition notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Figure 3.4 Flip-flop connectivity graph for flattened oc mem ctrl (maximum of 5 input/outputedges shown, node size indicates its eigenvector centrality). . . . . . . . . . . . . 45Figure 3.5 Histogram distribution of log-centrality . . . . . . . . . . . . . . . . . . . . . . . 48Figure 3.6 Module hierarchy for oc mem ctrl. . . . . . . . . . . . . . . . . . . . . . . . . . 48Figure 3.7 Hybrid algorithm: module centrality for selectively flattened oc mem ctrl. . . . . . . 51Figure 3.8 Debug difficulty comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Figure 3.9 Debug difficulty comparison (physical prototype) . . . . . . . . . . . . . . . . . . 56Figure 3.10 Fidelity of debug difficulty (physical prototype) . . . . . . . . . . . . . . . . . . . . 57xiFigure 3.11 Debug difficulty/runtime for hybrid selection algorithm, geometrically averaged overoc mem ctrl, oc wb dma and oc pci, when varying merge threshold. . . . . . . . . 58Figure 3.12 Debug difficulty comparison with prior restorability work . . . . . . . . . . . . . 60Figure 4.1 Proposed speculative debug flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Figure 4.2 Area: trace depth vs logic utilization. . . . . . . . . . . . . . . . . . . . . . . . . . 67Figure 4.3 Area: trace depth vs memory utilization. . . . . . . . . . . . . . . . . . . . . . . . 68Figure 4.4 Delay: trace depth vs Fmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Figure 4.5 Routability: trace depth vs peak interconnect utilization. . . . . . . . . . . . . . . 70Figure 4.6 Runtime: trace depth vs CAD runtime. . . . . . . . . . . . . . . . . . . . . . . . . 70Figure 4.7 Power: trace depth vs power dissipation . . . . . . . . . . . . . . . . . . . . . . . . 71Figure 5.1 Pre-, mid-, post-map stages of the FPGA compilation flow. . . . . . . . . . . . . . 76Figure 5.2 Placement results when instrumenting the or1200 benchmark . . . . . . . . . . . . 80Figure 5.3 Many-to-many routing flexibility during post-map trace-insertion. . . . . . . . . . 83Figure 5.4 Example of logic element symmetry optimization . . . . . . . . . . . . . . . . . . 84Figure 5.5 Breadth-first and directed search routing strategies . . . . . . . . . . . . . . . . . 86Figure 5.6 Suggested targets during directed search . . . . . . . . . . . . . . . . . . . . . . . 86Figure 5.7 Incremental-tracing neighbour expansion optimization . . . . . . . . . . . . . . . . 87Figure 5.8 Average fraction of signals traceable using post-map insertion. . . . . . . . . . . . . 91Figure 5.9 Runtime breakdown (trace-demand = 0.75) . . . . . . . . . . . . . . . . . . . . . 93Figure 5.10 Runtime breakdown (LU8PEEng) . . . . . . . . . . . . . . . . . . . . . . . . . . 95Figure 5.11 Circuit wirelength breakdown (LU8PEEng) . . . . . . . . . . . . . . . . . . . . . 95Figure 5.12 Post-map routing runtime (trace-demand=0.75) . . . . . . . . . . . . . . . . . . . 96Figure 5.13 Circuit wirelength (trace-demand=0.75) . . . . . . . . . . . . . . . . . . . . . . . . 97Figure 5.14 Post-map circuit wirelength (channel width) . . . . . . . . . . . . . . . . . . . . . . 97Figure 5.15 Critical-path delay (trace-demand=0.75) . . . . . . . . . . . . . . . . . . . . . . . 99Figure 5.16 Critical-path delay (stereovision1) . . . . . . . . . . . . . . . . . . . . . . . . . . 99Figure 5.17 Post-map delay (trace-demand=0.75) . . . . . . . . . . . . . . . . . . . . . . . . 100Figure 6.1 Virtual overlay network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104Figure 6.2 Proposed debug flow: compile- and debug-time phases. . . . . . . . . . . . . . . . 105Figure 6.3 Trace-buffer connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106Figure 6.4 Example routing resource graphs G(V,E) . . . . . . . . . . . . . . . . . . . . . . . 107Figure 6.5 Virtual overlay network abstraction. . . . . . . . . . . . . . . . . . . . . . . . . . 109xiiFigure 6.6 Bipartite graph Gb(Vsignals,Vtrace,Eb) . . . . . . . . . . . . . . . . . . . . . . . . . 109Figure 6.7 Circuit signals establishing new, or sharing existing, trace connections . . . . . . . 114Figure 6.8 Signal fan-in for overlay network . . . . . . . . . . . . . . . . . . . . . . . . . . . 117Figure 6.9 Average match size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119Figure 6.10 Network connectivity and match quality for bgm. . . . . . . . . . . . . . . . . . . 120Figure 6.11 Compile-time and Debug-time CAD overhead . . . . . . . . . . . . . . . . . . . . . 121Figure 7.1 Refining the debug metric using multiple trace data samples . . . . . . . . . . . . 130Figure 7.2 Reclaiming soft-logic for on-chip triggering circuitry . . . . . . . . . . . . . . . . . 131Figure 7.3 The difficulty with determining an upper-bound for non-blocking behaviour. . . . . 133Figure 7.4 Network flow representation of virtual overlay network . . . . . . . . . . . . . . . 135Figure B.1 Distribution of local and global OPINs, accessible (A) and inaccessible (I) . . . . . 156Figure B.2 Histogram of maximum manhattan distance during routing . . . . . . . . . . . . . 156xiiiGlossaryAS I C Application-Specific Integrated CircuitBDD Binary Decision DiagramCAD Computer-Aided DesignDSP Digital Signal ProcessingFPGA Field-Programmable Gate ArrayHDL Hardware Description Language e.g. Verilog, VHDLI C Integrated CircuitI /O Input/OutputI P Intellectual PropertyLC Logic ClusterLE Logic ElementLUT Look-Up TableRAM Random-Access MemoryRTL Register-Transfer LevelSAT SatisfiabilitySOC System-on-ChipVPR Versatile Place and Route, an open-source FPGA CAD tool used in academic researchVTR Verilog-To-Routing, a framework that integrates a Verilog elaborator (Odin II) and atechnology-mapping tool (ABC) with VPRxivAcknowledgmentsFirst and foremost, I would like to express my deepest gratitude to my research supervisor Dr. SteveWilton. Only through his tireless dedication to research, timely words of inspiration, and vast amounts ofpatience, have I been able to accomplish this thesis and develop into the engineer that I am today. Theseare qualities I can only hope to emulate as I embark on my future career.I would like to thank Dr. Brad Hutchings for serving as my external examiner, members of mysupervisory committee: Dr. Tor Aamodt, Dr. Alan Hu, and Dr. Shahriar Mirabbasi, as well as those on myexamining committee: Dr. Wolfgang Heidrich, Dr. Andre Ivanov, Dr. Philippe Kruchten, Dr. Lutz Lampe,Dr .Michiel van de Panne and Dr. Konrad Walus, for all of their valuable wisdom and insight. Thankyou also to my past and present colleagues in the SoC lab for all of the productive (and unproductive!)discussion that were had. I extend my gratitude to the support staff in the Department of Electrical andComputer Engineering for making sure everything ticked over smoothly, and for answering my cries for?help?(@ece.ubc.ca).I am thankful for the generous financial and logistical support provided by the University of BritishColumbia, National Sciences and Engineering Research Council of Canada, Altera Corporation, CMCMicrosystems, and the Semiconductor Research Corporation. I would also like to acknowledge TektronixEIG, Dr. Wayne Luk of Imperial College London, Dr. Philip Leong previously of The Chinese Univer-sity of Hong Kong, Dr. Jose Nunez-Yanez of The University of Bristol, ARM, Panasonic PSDCE, andMotorola, for granting me the opportunities to gain both industrial and academic exposure away frommy formal education.Lastly, my warmest thanks go to my partner, Karen, for her patience and understanding during thefinal stages of my Ph.D. studies. I promise I won?t ever work so hard again.xvDedicated to my family, of past and present.xviChapter 1Introduction1.1 MotivationThe modern world is now heavily dependent on electronic devices, from mobile phones, to automobiles,to the various technologies that makes up the World Wide Web. At the core of these devices are integratedcircuits (ICs). As predicted (or perhaps, dictated) by Moore?s Law, integrated circuits have evolved tobecome extremely complex devices that are capable of realizing significant amounts of functionality.State-of-the-art ICs can now contain several billion transistors; however, designing such complex cir-cuits that function correctly under all expected (and unexpected) operating conditions is an extremelychallenging task.Typically, ICs are custom-built to perform one particular task as effectively as possible, and areimplemented directly onto silicon in what is termed an application-specific integrated circuit (ASIC).Fabricating an ASIC requires an investment of several million dollars to produce a set of photolithographymasks, and require a lead time of several months [84]. Circuit designers would therefore like to minimizethe number of fabrication spins required to develop their product, as each spin increases its non-recurringengineering cost and delays time-to-market.Traditionally, circuit designers use logic simulation software (e.g. Mentor Graphics ModelSim [97])to verify, debug and fix their circuits prior to fabrication, or pre-silicon. The popularity of simulationcan be attributed to its ease-of-use: designers are able to view the behaviour of any internal signal intheir circuit, from which they can quickly root-cause design errors, apply a fix, and re-simulate in amatter of minutes. Whilst this may have been sufficient in the past when devices were smaller, modern1Prototypeon FPGADesign FabricateASICSimulatePre-Silicon Post-Silicon~ Hz ~ MHz ~ GHzOperating frequency: Unit tests System tests At-speed testsTest type:Figure 1.1: A simplified ASIC development flow.integrated circuits have come to be extremely large devices which are proving to be difficult to simulate.To highlight this, an Intel paper reported that during the development of their Core i7 microarchitecture,software simulations of their chip ran a billion times slower than actual silicon [75], whilst a MentorGraphics study found that as silicon density doubled every 18 months, designer productivity did so every39 months, and that half of all designer effort was spent performing functional verification [46]. Thus, itis impractical to use simulation for all test cases. The position of this simulation process in a simplifiedASIC development flow is illustrated by Figure 1.1.To overcome this limitation, circuit designers have increasingly turned towards post-silicon veri-fication. Enabling the effective debug of integrated circuits after fabrication (often called post-silicondebug) has become vital. Although designers thoroughly simulate an integrated circuit design beforethe chip is fabricated, regardless of how careful a designer is, some design errors (bugs) will escape thepre-silicon simulation and be incorporated into manufactured chips. Finding the source of these bugsis often costly, time-consuming, and can significantly extend time-to-market. The 2011 InternationalTechnology Roadmap for Semiconductors (ITRS) report specifically identifies post-silicon validation asan important challenge [70].Rather than simulating the behaviour of a digital circuit purely in software, post-silicon techniquesallow designers to apply tests onto physical implementations of the circuit, such as by using an pre-production ASIC sample. These physical implementations can operate many orders-of-magnitude fasterthan pre-silicon techniques, as fast as several gigahertz, and can also interact with real-world, real-timeinput stimuli (e.g. by connecting it to a live network interface) that may not be possible to describeusing a simulation testbench model or unit test. Thus, post-silicon techniques can allow designers tomore thoroughly validate their circuits and achieve higher testing coverage, for example, by bootingan operating system. However, the lead time for fabricating an ASIC can be in the order of severalmonths, and until a sample returns, designer productivity will continue to be limited by the speed of logicsimulation software.21.2 Debugging ASICs Using FPGAs PrototypesField-programmable gate arrays (FPGAs) represent an alternative circuit implementation technologycontaining prefabricated logic and routing resources that can be reconfigured with any new design inless than a day, for as many times as necessary. This agility makes them extremely suited to prototypingASIC designs, as they allow engineers early access to hardware tests and software development aheadof custom silicon becoming available [83]. Although FPGA technology creates circuit implementationsthat operate slower than ASICs, this is still many orders-of-magnitude faster than logic simulation. Agrowing number of designers are now opting to prototype their design using one or more FPGAs: theMentor study from the previous section also found that 55% of industry employed FPGA prototypingtechniques in 2010, an increase from 41% in 2007 [46]. Their place in the ASIC design flow is alsoillustrated by Fig. 1.1. With a physical prototype, designers are able to run more complex tests thanwere possible with simulation, and almost certainly discover new erroneous behaviour. FPGA-basedprototypes can also be used to debug these behaviours more effectively than in an ASIC device. Thisapplication, along with the unique opportunities exposed by doing so, is central to this thesis and will bediscussed in this section.When unexpected behaviour is observed in a post-silicon system, it is necessary to determine itsroot-cause through debugging. The debugging process can be divided into two sequential steps: first,a designer needs to collect relevant signal data from the circuit for use in its analysis, before secondly,to utilize this data to perform failure triage [143]. This thesis focuses primarily on this first step, bydeveloping methods to rapidly acquire the signal data behind incorrect functional behaviour caused byerroneous circuit descriptions, as opposed to those caused by defects in the manufacturing process. Suchbehaviours may have many causes that include circuit design errors, faulty software or firmware settings,or invalid assumptions about the environment in which the design operates.The debugging task is difficult on physical devices such as FPGAs or ASICs, primarily due tothe lack of on-chip observability. Unlike simulation, in which any internal signal in the circuit can beobserved, in a post-silicon system, typically only signals which appear at the external I/O pins of thedevice can be interrogated. However, the rate at which transistor density has scaled has far outpaced theprogress made in packaging technologies meaning that I/O resources are now more scarce than in the3InstrumentedDevice   PCabcdeExternalStimuliWaveformReal-time data collection Off-line analysisFigure 1.2: Enhancing observability into a physical device by using trace-based instruments.past. This lack of observability makes it difficult for a debugging engineer to deduce the cause of anyunexpected behaviour.Observability into physical devices can be enhanced in many ways. Typically, trace-buffer intellectualproperty (IP) is embedded into the device prior to physical implementation. These instruments allow asmall (predetermined) subset of circuit signals to be unintrusively recorded into on-chip buffer memory,whilst the device continues to run at-speed. Once the desired signal behaviour has been captured, thesignal trace can be extracted using a PC and viewed as waveforms, as in simulation. This trace-baseddebug environment is shown in Figure 1.2.One drawback of using trace-based instruments, however, is that the subset of circuit signals that canbe observed is small, and must be determined prior to physical implementation. To change this selectionsignal would typically require the circuit to be reimplemented. This presents a severe challenge fordesigners wishing to debug an ASIC: trace signals must be selected prior to investing in a fabricationspin, when the nature of any potential bugs is not yet known. Viewing a different set of signals requires anASIC re-spin. However, whilst FPGAs are not immune to this challenge, because of their reconfigurablenature these devices can be used to implement different configurations of debug instruments indefinitely.This affords designers multiple tries at choosing the necessary signals during failure analysis. Eventhough reconfiguring the FPGA does not require new photolithography masks nor a lead time of months,each instrument-compile-debug iteration does still require the circuit to be recompiled which can takemultiple hours, or even a full day [30]. This can severely limit debug productivity.Debugging is arguably even more important for FPGAs than for fixed-function ASICs; new designsare often implemented first on FPGA prototypes, and therefore this is where bugs are most likely to befound. In addition, the reconfigurable nature of FPGAs means that designers tend to be less thorough4ProtyPrpeyn FGeAtDsir grpi asppbcS Irp gsCrmStiull??r?enr?npi?tArnip Unproductive?tDD?SACeDr(hours)FPGAFigure 1.3: FPGA debug flow, showing how multiple compile-debug iterations can hamper designerproductivity.with their simulation, since the consequences of an incorrect circuit are less dire than if the circuit isimplemented on an ASIC.1.2.1 Typical Debugging WorkflowA typical debugging workflow is illustrated in Figure 1.3. First, the circuit would be designed (commonlyusing a hardware description language, or HDL) and verified using logic simulation software. At thisstage, due to the limited speed of software simulators, only simple, unit or constrained-random test-ing [81] would be feasible. A designer would debug their circuit by selecting a set of relevant signals todisplay in a waveform viewer, and carefully study this output to find the root cause of any errors. Withinthis simulator environment, designers have unrestricted access to any user signal from their original HDLdesign, for the entire time that the circuit was simulated. However, depending on whether the designerhad opted to dump out the value of all signals at every simulation time step (which can be significant insize) or whether the simulator is setup to record only those signals a designer had selected, changing theset of displayed signals may require the simulation to be restarted. For short unit tests, the turnaroundtime between simulation runs is quick.Once the designers are satisfied with the level of pre-silicon verification coverage attained, they wouldmove onto building a physical prototype of their circuit such as by using one or more FPGAs. Tests thatwere not previously possible, such as booting an operating system, or attaching to a live network interfacewith realistic traffic patterns, can now be executed. As discussed in the previous section, trace-buffersare used to gain a limited window of visibility into the internal operation of the prototype. The signaldata stored by the trace-buffers would then be extracted and displayed using a waveform viewer, muchlike the simulation workflow.5Unlike simulation, though, not all HDL signals may exist or be visible on the prototype, and onlysignal data over short timeframes can be recorded by on-chip trace-buffers. However, this window ofdata can be captured from much deeper into the prototype?s execution. For trace-buffers, observing anew set of signals will require the device to be fully-recompiled, which is a lengthy process that impactsdesigner productivity.1.3 Research Challenge and ContributionsThe long-term objective targeted by this thesis is to make the experience of debugging circuits imple-mented on FPGAs as convenient and as effective as using a logic simulation tool. The four contributionsthat this thesis makes towards this goal are:1. A new post-silicon debug metric and automated signal selection algorithms that allows designersto measure the quality of existing trace-based debug solutions, as well as formulate new traceconfigurations. This work was published in papers [64, 67], and leads to more effective use ofFPGA and ASIC debug instrumentation.2. A proposal to insert trace-instrumentation speculatively into the spare capacity of every FPGAcircuit, without any user-intervention. In paper [63], we showed that this is viable, and can beachieved with negligible delay impact and only a modest increase to compilation time. By insertingtrace capability into every FPGA circuit, this can enable fewer compile-debug iterations.3. A CAD technique to allow new trace-based instruments to be inserted, or the signals connectedto existing instruments to be modified, incrementally. This process can be completed in the orderof minutes as it does not require the designer to recompile the entire circuit, which can takehours. This contribution was published in papers [62, 65], and allows for faster compile-debugturnaround.4. A virtual overlay network that multiplexes all on-chip signals to the trace-buffers of a design,which can be rapidly reconfigured (in seconds) after the FPGA circuit has compiled. This allowsdesigners to defer their trace-signal selection until debug-time, when the nature of any errors isbetter known. This work was published in papers [66, 68], and can be used to eliminate compile-debug iterations altogether.6The four subsections that follow offers a brief overview of each contribution and the challengesinvolved; a more detailed explanation is presented in the individual chapters of this thesis.1.3.1 Post-Silicon Debug Metric and Automated Signal Selection for Trace-InstrumentsThe first contribution of this thesis is a metric to quantify the effectiveness of a trace-based post-silicondebug solution, and three automated algorithms to select new trace signals.A crucial step in instrumenting any FPGA or ASIC chip with a trace-based debug solution is thechoice of which signals to observe. Because only a limited subset of all on-chip signals can be connectedto the debug logic, this selection plays a key role in determining the effectiveness of the entire solution.Signal selection is a tedious process which requires expert knowledge of the design, and several methodshave been proposed to automate this [49, 82, 160]. Besides relieving designers of this time-consumingprocedure, automated methods may also allow complex relationships between signals to be more rigor-ously recognized and exposed ? for example, certain signals may exhibit a high amount of correlationwhich would allow them to be reconstructed, negating the need to observe them concurrently [82].Currently, no universal debug metric exists to allow designers to formally understand the factorsbehind an effective signal selection. Post-silicon debug differs from the task of post-silicon validationin that it is not sufficient to merely detect the presence of design errors, it is necessary to find theirroot-cause. Algorithms that maximize coverage-based metrics are not appropriate for this task [18, 133],since they will favour solutions which maximize the ?coverage? of the design, possibly by maximizingthe number of behaviours that are exercised during validation. Rather than just showing that a bug exists,we desire a metric that can aid a designer in understanding the exact design error that led to the incorrectbehaviour observed.Chapter 3 describes how our post-silicon debug metric is formulated, and uses it to examine the trade-off between the scalability and observability of three different signal selection algorithms: a technique thatoptimizes for observability directly, a method based on the graph-centrality of the circuit?s connectivity,and a hybrid technique that uses the circuit?s inherent hierarchy to combine both algorithms. Whilst thegraph-based method was found to offer the least amount of observability, it was the only method thatcould be applied to our largest benchmark circuit of over 50,000 flip-flops, computing a signal selectionin less than 90 seconds.71.3.2 Speculative Debug Insertion for FPGAsThe second contribution of this thesis is to show that debug instrumentation can be speculatively instru-mented into the spare capacity of an FPGA, without any user-intervention and without excessive impacton the circuit.On-chip observability can be enhanced in several ways. Tools [8, 155] exist to allow designers toinstrument their design and bring important signals to the output pins or to on-chip trace buffers forobservation during debug. Although some of these tools allow a limited amount of reconfiguration afterthe circuit has been compiled, in most debugging scenarios, the designer must use these tools to insertdebug logic before the circuit is compiled. In many cases, designers will compile the ?first cut? of theirdesign without debug instrumentation, and then when unexpected behaviour is encountered, determinewhich signals would be desirable to observe, instrument their design, and recompile. With compile timesof large FPGAs steadily increasing [21], this can severely limit debugging productivity, especially forthose prototypes partitioned across multiple devices.In Chapter 4, an alternative is proposed. Rather than waiting until unexpected behaviour be observedand then instrumenting and recompiling, it is proposed that the FPGA CAD tool speculatively insertsdebugging instrumentation every time a circuit is compiled. Before compilation, the CAD tool woulddetermine which signals are likely to be important (or would best complement an existing designerselection) and insert debug instrumentation to make those signals observable. Ideally, this would beentirely automated and transparent to the user; the user only needs to become aware of the speculativeinstrumentation (and benefit from the extra visibility available) during debugging.Although the added observability will be valuable during debug, inserting logic speculatively canhave a number of drawbacks. First, it is important to limit the amount of insertion as to not increase thenumber of FPGAs required. Since ASIC prototypes often contain only partially-filled FPGAs (due tolimitations in partitioning of large designs), there will often be extra room for significant debug logic. Itis also important that the extra debug logic does not place additional stress on the routing fabric, so as tonot increase the critical path, nor increase the compile time due to the added routing pressure. Finally, itis important that the CAD tool intelligently choose which signals will be instrumented ? for this task,the automated signal selection algorithm presented previously in Chapter 3 is used.8Chapter 4 shows that speculative debug insertion is feasible and investigates to what extent each ofthe concerns in the previous paragraph limits how aggressive an automated tool should be. We find thatapproximately 10% of all flip-flops in a large System-on-Chip benchmark can be instrumented when10% of the FPGA?s logic resources can be spared, with no impact on the circuit delay and a 10% impacton CAD runtime and power consumption.1.3.3 Accelerating FPGA Trace-Insertion Using Incremental TechniquesThe third contribution of this thesis is a method for accelerating the turnaround-time of installing, andchanging, trace-based instruments in FPGAs.Chapter 5 explores the effect of using incremental-compilation techniques to reduce both the timeneeded to perform the initial instrumentation of a circuit, and the turnaround time between debuggingiterations. Rather than discarding the original circuit mapping, incremental techniques aim to preserveas much of it as possible and make the minimum changes required to accommodate a new debugconfiguration. The advantages to this approach are many: besides achieving significant CAD runtimesavings, the modified circuit will retain most of its original mapping and allow designers to debugsomething as close to the uninstrumented circuit as possible, as well as better preserving any low-leveloptimizations and timing closure. This technique is complementary to the speculative insertion proposalfrom Chapter 4.Specifically, this work proposes that trace instrumentation be inserted directly into the post place-and-routed FPGA circuit mapping ? without moving any existing logic blocks nor ripping up any of theexisting routing ? instead, using only the spare resources that were left behind. To make this feasible,novel techniques to increase the incremental routing flexibility of trace-buffer insertion are presented.The value of these techniques is shown by comparing the effect of instrumenting the circuit before(pre-map insertion) and during the FPGA mapping procedure (mid-map), with doing so solely at the end(post-map).Chapter 5 provides a complete and detailed comparison between the runtime, wirelength, and delayof inserting trace-instruments before, during, and after the FPGA mapping procedure when targetinga realistic FPGA architecture based on the Altera Stratix IV family. We show that, assuming that a20% routing slack above the minimum routing solution post-mapping trace-insertion can improve CADruntime on average by 98X when compared to a full-recompilation of the circuit. Additionally, this post-9map instrumented circuit was found to possess a shorter wirelength than with the other two strategies,and had only a small effect on the circuit?s critical-path delay. By adding trace instrumentation only afterthe original user circuit is compiled, the original timing can be preserved, avoiding the chaotic nature ofheuristic CAD algorithms which may return completely different solutions even for small changes to theinput netlist. Furthermore, because these instruments use FPGA resources that are mutually exclusiveto the original circuit, by disabling this debug support, the user circuit can always be reverted back tooperating at its uninstrumented maximum clock frequency.1.3.4 A Virtual Overlay Network for FPGA Trace-BuffersThe final contribution of this thesis is the development of a virtual overlay network that multiplexesall on-chip signals to all available trace-buffers inside a FPGA design, which enables rapid debuggingwithout recompilation. Providing simulator-like visibility to an FPGA platform is seen as one of the keytechnologies required as FPGAs scale to larger and larger capacities [135].In Chapter 6, a method to accelerate the debug process by significantly reducing the amount of timerequired to perform a compile-debug iteration is proposed. This is achieved by allowing the designerto change which signals be connected to the trace-buffer without recompiling the design, and withoutrequiring a re-route of signals between debug iterations. The key to this technique is that, at compile-time,a flexible overlay network is embedded into the design which multiplexes almost all combinational andsequential signals of the gate-level circuit into these trace-buffers. Unlike [137], the network is not builtusing the normal soft FPGA logic. Instead, by building on the techniques from Chapter 5, the unusedrouting multiplexers within the FPGA fabric are reclaimed to implement this network. As a result, thearea overhead due to this network is essentially zero. At debug time, this network is configured usingbipartite graph matching techniques to determine the routing bits necessary for connecting the selectedsignals to a trace-buffer. Using this technique, any signal selection of a designer?s choosing can beforwarded to most of the on-chip trace capacity.Although this approach falls short of a software simulator in that only a limited number of signalscan be observed, and only for a limited number of clock cycles as constrained by trace-buffer capacity,crucially, it is shown that this technique allows the designer to defer the selection of which signals toobserve to debug-time. This negates the need to recompile the circuit whenever the signal selection ischanged, greatly enhancing debug productivity.10Chapter 6 finds that all on-chip signals (numbering over 100,000 for the largest benchmark) can beconnected to this overlay network. A new network configuration for forwarding any arbitrary subset ofthese signals, corresponding to approximately 80?90% of the available trace-buffer capacity (16,000signals) can then be computed in less than 50 seconds. The overlay network described can also beinserted speculatively, as proposed in Chapter 4; signals for the network can then be selected using theautomated techniques from Chapter 3.1.4 SignificanceThe significance of this thesis is that integrated circuit designers can now debug FPGA-based prototypesof their circuits much more effectively than were previously possible. During ASIC design, fabricatinga test chip is an expensive procedure with a lengthy lead time; prototypes based on FPGA technologyallows designers to undertake more complex testing ahead of real silicon being available. Given that ICscan take at least as long to verify and debug as they do to design [2, 60, 70], it is key that when tests fail,these FPGA-based prototypes can be debugged effectively. The more quickly a design can be verified,the faster its time-to-market.The four contributions of this thesis make significant strides towards the long-term goal of achievinga ?simulator-like? debugging experience on FPGAs. By using automated signal selection techniques,designers can realize more effective debug instruments that may recognize complex or redundant signalrelationships that a manual selection may miss. The concept of speculative debug insertion into everyFPGA circuit was shown to be feasible in this thesis, and can lead to fewer compile-debug iterations.The third contribution of this thesis was to propose incremental signal-tracing techniques that can beused to rapidly insert and modify trace-based debug instruments in FPGAs. This can lead to fewercompile-debug iterations. Lastly, we show how to eliminate compile-debug iterations by installing avirtual overlay network to multiplex circuit signals through to trace-buffer instrumentation. Subsequently,designers are able to change the signals observed during debug, without any recompilation, in a matterof seconds.1.5 Thesis OrganizationThis thesis is organized as follows. Chapter 2 provides a detailed background on the applications of FPGAtechnology, their underlying circuit architecture and the tool-flow used for mapping circuit descriptions to11this platform. The first contribution of this thesis is described in Chapter 3, where we formulate our post-silicon debug metric, as well as algorithms for automatically selecting signals for trace-instrumentation.Chapter 4 documents the second contribution of this thesis, which is to show that speculatively insertingdebug instrumentation into every FPGA circuit is viable. The third contribution of this thesis can befound in Chapter 5, which describes techniques to more efficiently insert or modify existing debuginstruments inside FPGAs through using incremental-compilation methods. The final contribution ofthis thesis is to install a virtual overlay network which can be rapidly reconfigured with the trace signalsthat a designer wishes to observe. This can be found in Chapter 6. Lastly, Chapter 7 concludes this thesis,and presents the outstanding challenges that remain for the long-term goal of achieving ?simulator-likeobservability?.Parts of this thesis have been published in papers [62?67]. During the course of this work, two relatedpapers were also published: reference [150] show how techniques similar to the proposal in Chapter 3can be integrated into a commercial tool, whilst reference [69] presents a practical contribution thatenables the debug flow in Chapter 6 to be realized on physical parts. Due to page limitations, neither ofthese contributions are described in this thesis.Smarter debugiterations(Chapter 3) Fewer debugiterations(Chapter 4) Faster debugiterations(Chapter 5)Eliminatingrecompilationduring debug(Chapter 6)Post-Silicon Debug Metric,Automated Signal Selection SpeculativeDebug Insertion IncrementalSignal-Tracing A Virtual Overlay Networkfor Trace-BuffersFigure 1.4: Thesis organization.12Chapter 2Background and Related WorkThis chapter first reviews field-programmable gate array technology in Section 2.1 ? their typicalapplications, underlying circuit structure, and the process of mapping designs to this platform. Next,the primary challenge of post-silicon debug, on-chip observability, along with a detailed survey of theapproaches used to enhance observability, is described in Section 2.2. Lastly, previous work specificto this thesis: automated signal selection and evaluation metrics, reclaiming spare FPGA resources,incremental compilation for FPGAs, and multiplexer networks for post-silicon debug, are covered inSection 2.3.2.1 Review of Field-Programmable Gate ArraysField-programmable gate arrays (FPGAs) are prefabricated logic platforms that can be used to build awide range of digital designs. These devices are different than application-specific integrated circuits(ASICs) in that they can be completely reconfigured after manufacturing. Whilst an ASIC would typ-ically be hard-wired to perform the same one or two tasks throughout its lifetime: packet switchingor signal processing for example, an FPGA is analogous to an ?Etch-a-Sketch? which can be erasedand reprogrammed with a unlimited number of circuits, each with a different function. However, thisflexibility does not come without drawbacks: an FPGA consumes roughly 20?30X more silicon areathan an equivalent standard cell ASIC, and operates 3?4X slower and consumes 10X more dynamicpower [84].132.1.1 Typical FPGA ApplicationsFPGA devices have many advantages, and these lead to two main real-world use cases: for production,and the main focus of this thesis, for prototyping.Production FPGAsThe reconfigurable nature of FPGAs provides many advantages that have enabled them to be shippedin place of ASICs inside a final product. One key advantage for FPGA technology is their significantlylower non-recurring engineering (NRE) costs. These costs refer to the one-off investment required tolicense the complex tools used to design a custom ASIC, the upkeep of an expert design team who can usethese tools to develop the circuit, and finally to create the photolithography masks used for fabrication? this last requirement can costs millions of dollars for state-of-the-art designs [84]. Although theseNRE costs can be recouped for mass-produced products, this would not be possible at low-volume. Anexample of low-volume products are those offered by Maxeler Technologies [95], who provide customhigh-performance computing solutions implemented using FPGA devices.FPGAs have also found applications in medium-volume devices, where this per-unit cost advan-tage may not be as high. Being reconfigurable, not only can this technology be reprogrammed post-manufacturing, but also post-shipping. Rather than being forced to employ complex software/ firmwareworkarounds [9] for hardware design flaws1, designers can now fix the circuit in-place with an updatedFPGA image in a matter of minutes. This can significantly accelerate a product?s time-to-market. ForASICs, fixing silicon flaws would require a re-spin which can take several months. One such FPGAapplication is in network routers; Cisco [34] provides instructions on how users can perform field-programmable device upgrades. FPGAs have also been successfully instantiated inside higher-volumecustom ASICs for functional and debug purposes [115, 146].Prototyping FPGAsA second common usage scenario for FPGA devices, and the main application targeted in this thesis,is ASIC prototyping. As stated in Chapter 1, simulating a large IC design can be up to a billion timesslower than the final silicon speed, making it infeasible to test using simulation alone. On the other hand,1Early revisions of the AMD Phenom processor in 2007 exhibited erratum 298, a silicon flaw that could potentially causedata corruption; until revised silicon was available months later, the official workaround was a BIOS patch which reducedsystem performance by over 10% [148].14spinning an ASIC can cost millions of dollars, and have a lead time of several months. Whilst someproperties of final silicon (e.g. electrical) cannot be validated using alternate technologies, FPGAs dopresent a viable platform for functional verification ? that is, checking that the circuit has been designedcorrectly ? FPGA prototypes of IBM and Intel ASICs [12, 146] have been shown to run at 4 MHz and50 MHz respectively.Besides being able to explore behaviour from much deeper into the circuit?s operation and to performmore comprehensive tests such as booting an operating system, FPGAs prototypes are also able tointerface with physical, real-time, real-world stimulus that may not be reproducible in a simulation test-bench. For example, FPGAs are able to support circuits with multiple clock domains which are becomingincreasingly common due to the rising importance of power [124]; these circuits contain asynchronousinteractions that are nontrivial to model. Balston et al. [14] have also shown that FPGA prototypes haveutility beyond functional verification, using them to quantify exactly how thoroughly tests will exercisethe final ASIC?s critical, and near-critical paths.During ASIC prototyping, the high-level circuit description would typically be modified to takeadvantage of any specialized FPGA resources (e.g. RAM, DSP, clocking) available. Reference [12]reported that this was possible whilst still preserving cycle-accurate behaviour in their prototype, whichwas used to find over 30 new bugs. However, a primary challenge during ASIC prototyping is that largedesigns will often not fit onto a single FPGA device and will typically have to be partitioned (oftenmanually) over multiple boards each containing multiple FPGAs; IBM reported employing 45 FPGAs intheir prototype [12]. The key issue during partitioning is the number of inter-FPGA connections that canbe made ? and given that FPGA devices have a limited number of I/O pins (Altera?s largest Stratix IVdevice supports 1104 user I/Os) ? this is often the constraining factor which prevents devices frombeing fully utilized, nor maximum clock frequency from being reached [90]. This slack, both in termsof timing, as well as the availability of prefabricated memory and routing resources, represents a uniqueopportunity to add extra debug support.The Dini Group, who manufactures multi-FPGA boards for ASIC prototypes, currently assumesthat (and suggests to) their customers that each FPGA is filled only to 60% utilization [41], whilst Intelreported that they were able to build a highly-tuned prototype of their Nehalem (Core i7/i5) processorover 5 FPGAs, with a maximum logic utilization of 89% (average 84%) and memory utilization of83% (average 55%). Importantly, they reported that due to I/O limitations, time-division multiplexing of15Protypren FGADypren sig arbtnceSItCrAmFigure 2.1: Island-style FPGA architecture overview.connections was required which limited their user clock frequency to 520 KHz [123]. Other studies [13,80] have also reported typical FPGA utilization values of less than 50% and 20% respectively, with theprototype achieving an operating clock frequency one or two orders of magnitude lower than its potentialmaximum.2.1.2 FPGA ArchitectureThe general-purpose nature of an FPGA requires a flexible underlying fabric. Academic and commercialFPGAs most commonly employ a synchronous 2D island-style architecture, though asynchronous [3]and time-multiplexed variants [132] are also available. The island-style architecture fabric is made upof four components: I/O interfaces, soft-logic and hard-logic resources, all interconnected through aflexible routing network. The I/O and logic resources are typically arranged in rows and columns acrossthe device (giving rise to its island-style description). This layout is illustrated in Figure 2.1.The I/O interfaces on an FPGA allow the device to communicate with the outside world. Given thatthe exact mode of communication cannot be determined prior to manufacturing, these blocks must be16InsttrtrttrrrttrtrrrtPC C C a a C C CbcdecExxEtrnlSnxdcilmnrru(a) 3-input look-up table (LUT) implementingA?BC +AB?C?.InPCabcdCeExtrIsPnl SSIsPnl SSIsPnl SS(b) FPGA logic elements (LEs) grouped into a logiccluster (LC) accessed using a local routing network.Figure 2.2: FPGA look-up table, logic element and logic cluster.able to support bidirectional communication across a wide variety of standards and voltages. I/Os can becommonly found on the peripheries of the FPGA layout, but this is not always the case [69].Soft-logic resources are the characteristic element of FPGAs. These resources can be programmed toimplement any digital logic function. The key component that makes up these soft-logic resources is thelook-up table (LUT), which is an array of memory cells connected to a multiplexer; the select-lines ofthis multiplexer serve as the inputs to the LUT and allow the contents of one memory cell to be forwardedto the LUT output. By applying the appropriate contents to this memory array (known as the LUT mask)a truth table of any logic function, up to the number of LUT inputs, can be realized.A 3-input LUT is illustrated in Figure 2.2a. Logic functions of more than 3-inputs can be imple-mented by cascading multiple LUTs. Look-up tables are commonly paired with a flip-flop (FF), andpackaged as a logic element (LE); multiple LEs are then grouped hierarchically into a logic cluster (LC)often accessed through a local routing network, to improve area and delay [20].Hard-logic refers to a set of specialized resources that have been customized to perform specifictasks more efficiently than, or that are impossible, with general-purpose soft-logic. Examples of hard-logic include memory (RAM) blocks, Digital Signal Processing (DSP) cores, embedded CPUs, clocksynthesis/distribution, high-speed I/O, etc. Whilst RAM, DSP, and even embedded CPU functionalitycan be realized using general-purpose LUTs and FFs, it is often more area-, delay-, and power-efficientfor them to be mapped onto dedicated hard resources; likewise, clock signals benefit from utilizing aspecialized low-skew network as opposed to the general-purpose routing network. Commonly, such hard17InsnP CaC PbcdeEdxttxrnlSiltedmSblnnuWvdlfdvoldtrumeudDvrsni dueudDvrsnidueudDvrsni?Figure 2.3: FPGA routing network connecting block A ? blocks B and C. Switch-Blocks (SB)multiplex the wire source, Connection-Blocks (CB) multiplex the wire sink(s).blocks are also arranged in columns interspersed across the device, and are typically larger than soft-logicblocks, consuming multiple rows as shown in Figure 2.1.Lastly, a flexible routing network binds together all of these I/O, soft- and hard-logic resources toenable on-chip communication. The flexibility of this network is realized using multiplexers, arrangedinto switch-blocks (SB) and connection-blocks (CB). SBs are responsible for multiplexing the source ofeach wire ? i.e. which resource output is driving this wire, whilst CBs are responsible for controllingthe sink(s) to which a wire connects ? determining which FPGA resource input(s) receive the drivenvalue. The multiplexers in the routing network differ from those in the LUT in that here, their select linesare held constant by a memory cell, and the data lines are driven by user-logic. By programming thesememory cells, a custom wiring pattern appropriate to the implemented circuit can be built, as illustratedin Figure 2.3 which shows logic block A fanning-out to both logic blocks B and C. Switch-blocks canalso be cascaded to generate complex wiring patterns that can fan-out to multiple sinks far apart.FPGAs contain an abundance of logic and routing resources. Crucially, not all resources will benecessary to implement every single circuit. This point enables many of the techniques described in thisthesis.2.1.3 FPGA CADComputer-aided design (CAD) or electronic design automation (EDA) tools are used to map a high-level circuit description into a physical implementation. Whilst ASIC CAD tools can fabricate anylogical structure, FPGA tools must map this circuit onto a prefabricated architecture subject to prede-18InsItsPCabcdeEe xdtcrnlSSEai mltuEai mWltdvdabfo?bEaiPblbEtxEvEai?alWCeEe?Ebeb?dlvCircuit(e.g. VHDL/Verilog)and ArchitectureDescriptions nrumedDdvic eRrareRrel-rlDoOl nrumedDdvicfrareRrel-rlDoOl tyu?R-rlDoOl -rlDoOl?olm??Ddd?aDyeProgram onto FPGA -rlDoOl?olm??Ddd?aDyeyeR?d?loev?olouyDctylmfrDyiFigure 2.4: FPGA CAD flow overview.termined constraints. The FPGA mapping procedure can be divided into several stages [29]: synthesis,technology-mapping, packing, placement, routing, timing-analysis and bitstream generation, as illus-trated in Figure 2.4.The synthesis (also known as elaboration) stage is responsible for translating a high-level circuitdescription (for example, VHDL or Verilog) into a netlist of technology-independent logic functions, suchas generic truth tables, along with how they are connected, as well as inferring any instances of specializedfunctions such as memory or DSP-like operations. The following stage is technology-mapping, whichconverts this technology- independent netlist into one with technology-dependent primitives suitable forthe target architecture. For example, if the FPGA architecture supported only 3-input LUTs, then anytruth-tables from the previous stage with more than 3 inputs would have to be decomposed into tableswith 3 inputs or less. In academic research, the most common synthesis tool is Odin II [72] (supportingVerilog only) and technology-mapping tool is ABC [19], which operate on circuit netlists in the BLIFformat. For commercial FPGA tools, the ?quartus map? tool from the Altera Quartus II CAD suiteperforms both synthesis and technology-mapping, whilst the ?xst? and ?ngdbuild? tools perform thesetasks for Xilinx ISE.The packing (or clustering) stage is then used to group similar and/or dependent primitives into thesame physical resource, such as grouping LUTs and FFs into logic elements, and then those LEs intologic clusters. Packing interrelated logic resources together can return significant improvements to circuitroutability and delay [29], as well as CAD runtime [30]. Next, placement is performed to determine19the best on-chip location to locate each resource, and is an important step for reducing the length ofwiring needed to connect all resources together. Simulated-annealing algorithms are typically used forthis stage [29], though analytical placement techniques have been explored [52] and implemented incommercial tools [56]. The output from the placement stage is a detailed floorplan with a legal locationfor every resource in the packed netlist.Next, the routing stage determines the exact physical resources that are used to implement eachunidirectional wire, ensuring that each resource is used at most once so that no collisions occur. Theproblem is often represented by a directed graph, in which each wire source node must be connected toall of its sinks nodes without overlapping any other wires; typically, a directed-search (A*) approach [96]is used to traverse this routing resource graph efficiently. The Verilog-to-Routing (VTR) CAD suite [121]is considered the de-facto academic tool for FPGA research, combining Odin II, ABC, and VPR (whichis responsible for the later stages: packing, placement, routing and timing analysis). Altera Quartus IIemploys the ?quartus fit? tool for packing, placement and routing, whilst Xilinx ISE provides the ?map?tool for combined packing and placement, and the ?par? tool for routing.Modern CAD flows, such as VTR, Quartus II and ISE, all support timing-driven flows by defaultwhere the primary objective of the mapping procedure is to produce circuits that balance area utilizationwith critical-path delay. Options exist to change these priorities to minimize area, delay, or power. Thelast two stages are static timing analysis, and bitstream generation. Timing analysis provides a detailedreport of the resources on the critical-path, and the critical-path delay. Bitstream generation formats theimplemented netlist into a string of programming values that can be loaded into the memory cells of aphysical FPGA.2.2 Primary Challenge During Post-Silicon Debug: ObservabilityAs Moore?s Law continues to drive the number of transistors inside integrated circuits higher, thusallowing more complex digital designs to be realized, verifying and debugging these designs has becomean increasingly difficult task. Studies by the ITRS, Hsu et al. and Abramovici et al.[2, 60, 70] all attestto hardware taking at least as long to verify and debug as they do to design, in time-frames measuringmonths or years with high uncertainty. Bolstering this argument further, a Mentor Graphics study foundthat whilst silicon density doubles every 18 months, designer productivity only doubled every 39 months,and that half of all designer effort was spent performing functional verification [46]. Furthermore, flaws20may only be exposed after system integration in which asynchronous interactions between multiple clockand power domains come into play [79].This thesis focuses primarily on techniques for debugging functional errors ? that is, systematic logicerrors caused by a human designer or by a CAD tool, as opposed to random manufacturing or fabricationdefects, though the techniques of this thesis are not exclusive to only the former type. Examples offunctional errors may be an incorrect state machine transition, an unexpected FIFO overflow, or a logicstructure incorrectly inferred by the CAD tool; an example of a manufacturing error would be if a logic orrouting resource exhibited a stuck-at fault that would cause a mismatch between simulation and silicon,or vary when programmed across different FPGA devices. Debug differs from manufacturing test in thatit is used to ensure the circuit was designed correctly, whilst manufacturing test is applied to every devicecoming off the production line to ensure that it had been implemented correctly.For debugging functional errors, almost all engineers prefer a logic simulation environment, favouredfor its unlimited observability [133], and the fast turnaround time between root-cause and bug-fix. Oncean error has been observed, verification engineers can either manually examine their waveforms todiscover the design error, or use automated techniques to do so [78]. However, software simulationoperates extremely slowly, particularly for large designs (IBM reported a simulation speed of 10 Hz fortheir BlueGene/Q compute node [12], whilst Intel reported 2-3 Hz during development of their Core i7microarchitecture [74]) and so designers have been turning to hardware solutions, which can be manyorders of magnitude faster [70].A na??ve approach to faster verification is to use emulation techniques, where simulation software isdirectly accelerated using specialized hardware. Cadence Palladium [25] is a simulation acceleration/in-circuit emulation platform based around a custom ASIC implementing a massively-parallel Booleancompute engine, which can maintain identical circuit visibility to simulation at frequencies up to 2 MHz.Similar products are Mentor Graphics Veloce [98] and Synopsys ZeBu [131], which are based on FPGAtechnology. The big drawback to these products, however, is their cost which can be upwards of $1 mil-lion [105, 137].FPGA technology can also be used to directly implement a circuit prototype. The simplest method ofviewing the internal nodes of such a design, which can be especially convenient due to the reconfigurablenature of FPGAs, is to temporarily route any signals of interest to its I/O pads and out from the chip [8].These I/Os can then be attached to external logic analyzers [136]. Although this cumbersome, ad-hoc21Protypenn FGA ProtypDseig FGA ProtypabbacoSeiI FGA ProtypaCem FGA ProtypaueluFGA ProtypaleC FGA Protypa?engFGA u  ?   i?   ?   i?u  ?   n?   ?   n?u  ?    C  g  i?n  i?l  n?   ?S?r?b?? debugitrraoatngsP-Figure 2.5: Number of logic cells and user I/O in Xilinx Virtex FPGAs.approach will likely work at FPGA speeds due to the availability of high-speed (and high-cost) analyzers,this approach is only possible if such I/O pads are available; a survey by Morris [103] found that over40% of FPGA designs are I/O limited. To make things worse, the ratio of logic to I/O in FPGAs ison the increase [28]; this data is illustrated in Figure 2.5, which plots the number of logic cells anduser-accessible I/O over multiple generations of the Xilinx Virtex architecture.The primary challenge with debugging a prototype system is the lack of on-chip observability. Dueto the limited number of I/O resources that can be used to access internal signals, and the operatingfrequencies that such prototypes are capable of, it is very difficult for designers to deduce what ishappening inside each chip. When a circuit does not function correctly, it is important to understandthe behaviour of the chip in order to identify its cause. This section will review post-silicon debuggingtechniques in both ASICs and FPGAs, where many similarities exist. On both implementation platforms,two main (yet complementary) approaches exist to enhance on-chip visibility: scan-based instrumentationand trace-based instrumentation.2.2.1 Scan-based InstrumentationScan-chains are an essential component of Design-for-Testability techniques for ensuring ASIC deviceshave been manufactured correctly. This technique involves sequentially connecting the flip-flops of anIC into one or more chains. This allows their stored values to be both observed and controlled [144] ?as illustrated in Figure 2.6. If the fabrication process were to introduce a stuck-at fault where an internalwire was permanently logic-0 or logic-1, this would be detectable using a set of test patterns. Each pattern22ASIC/FPGAUser CircuitFigure 2.6: Scan-based debug instrumentation.would set every flip-flop in the circuit to a particular logic value during the ?launch? phase, which isfollowed by a ?capture? phase where the new values returned by the combinational logic are recorded.Any mismatch from the expected value would indicate that the specified circuit was not implementedinto silicon correctly.Whilst scan-chains are most commonly used for detecting manufacturing errors, there can alsobe re-purposed for debugging functional errors. Since 82% of all ASIC designs contain scan-chainsalready [50], this is effectively a zero cost solution. Carbine and Feltham [26] describe the use of aproprietary, observation-only scan technique to observe internal nodes within Intel processors supportingtwo modes of operation: snapshot mode, where a key subset of internal state (over 2000 signals) is peri-odically sampled and serially shifted out as the processor continues to run at full speed (as with Chuanget al. [33]). The second mode is signature compression, which allows for real-time, but incomplete, ob-servability (similar to Yang and Touba [159]). Scan-chains have also been used to isolate timing errorsbetween flip-flops by Holdbrook et al. [59] who describe a ?cycle-stretch? technique where the clockperiods prior to the externally observed failure are successively increased in order to allow more timefor data to arrive. This is used in conjunction with comparing scan-dumps to their simulated values, toisolate the error. Scan techniques are also used by Datta et al. [37] to perform both delay-fault testingand debug. Rootselaar and Vermeulen [141] present a multi-stage breakpoint mechanism, and supportingdebug toolset, to complement the use of scan-chains for debugging a multi-clock domain chip. Otherexamples of scan-chain applications for ASIC debug include [51, 54, 89, 145].23Besides requiring zero overhead in ASICs that already contain scan-chains, another advantage ofscan-based approaches is that typically all flip-flop values in the circuit are captured. However, in orderto extract their values, the circuit must be halted (for example, by gating its clock signal) precisely at thecycle of interest to allow their stored values to be shifted out. This precision can be difficult for circuitsoperating at MHz or even GHz speeds [119, 120]. Furthermore, the extraction process can be slow whencompared to the operating frequency, and is destructive ? meaning that the circuit must remain haltedthroughout the entire unloading and reloading procedure. Once a snapshot has been acquired, the circuitis typically single-stepped through subsequent clock cycles in order to understand the behaviour of thecircuit over time [145]. Thus, scan-based techniques are often too slow and tedious to be suitable forreal-time observation.Scan-based Techniques for FPGAsFPGAs are unlike ASICs in that FPGA vendors would be expected to deliver a fully tested, pre-screeneddevice that operates correctly if used within specifications. An FPGA user should not need to perform anymanufacturing tests, or access any of the infrastructure that does so. However, due to their success andwidely understood nature, there have been a number of approaches which have attempted to reproducescan-chain functionality for functional debug inside FPGAs.Design-level scan proposes duplicating all user-state (present inside flip-flops and embedded memoryblocks) in the design to emulate ASIC scan-insertion [149]. The authors show that doing this on thegeneral-purpose FPGA fabric increases logic element usage by 60%?84%, whilst decreasing circuitfrequency by 20%, compared to a 5%?30% area overhead in ASIC designs. An improved technique isproposed in Chuang et al. [32], where snapshots of the FPGA state are only periodically copied intoshadow flip-flops, which are connected together as scan-chains to allow values unloaded into off-chipmemory without affecting the circuit under debug. Combining this with a continuous trace of the primaryinputs, complete observability is available through reconstruction in a software simulator. The sameauthors further refine their method to only snapshot the minimum selection of flip-flops required forreconstruction in [33].Although explicit scan functionality may not exist in FPGAs devices, several vendors (though not all)expose device readback capabilities [99, 158]. This feature, borne out of a need to verify that a FPGA?sconfiguration memory has not been corrupted [27], allows designers to extract the static configuration of24ASIC/FPGAUser Circuit?r?e I?stru?e?t?ti??Figure 2.7: Trace-based debug instrumentation.the device, as well as all dynamic user-state, including internal logic block and I/O registers as well asembedded memories, for analysis.One of the first pieces of work to utilize this feature was BoardScope [88], which presents a trans-parent, scan-like debugging tool for Xilinx FPGAs to allow users to single-step the circuit clock andexamine and modify all state and combinational logic element outputs in a circuit. However, althoughthis method incurs zero cost from a designer?s point of view, it is slow as the entire bitstream needsto be re-read every cycle any signal is viewed. Debugging efficiency is improved upon by Tiwari andTomko [138], who insert a scan-based watch-point circuit to monitor a predefined set of signals for aruntime-configurable trigger condition. This allows the circuit clock to be stopped on interesting events.A similar piece of work is presented in [139]. More recently, Iskander et al. [71] reported that viewingjust one flip-flop using device readback techniques can take between 2 to 8 seconds, due to the coarsegranularity at which this technique operates.2.2.2 Trace-based InstrumentationThe second category of enhancing on-chip visibility is through the use of trace-based instruments. Thisinvolves inserting dedicated circuitry into the implemented design to allow a limited amount of internalsignal data to be captured unintrusively during regular, at-speed device operation, removing the need tohalt the device every time an observation is to be made. An illustration of trace-based instrumentationcan be found in Figure 2.7. To make this feasible, signal data is recorded into on-chip memory (knownas trace-buffers) as opposed to being exported immediately ? thus, only a limited number of signals25can realistically be observed, and only for a limited number of clock cycles. Trace-buffer memories canbe configured as circular buffers, whereby on every clock cycle, its oldest entries are overwritten withthe current signal values. After recording is complete, this trace data can then be offloaded using a lowbandwidth interface (e.g. JTAG) for analysis. Unlike scan-based approaches, trace-based instrumentationdoes require additional silicon area in the form of memory blocks and control circuitry. Adding dedicateddebug circuitry in this manner is commonly referred to as Design-for-Debug.Trace-based instrumentation requires a designer to preselect a subset of (hopefully) important signalsfor observation prior to circuit implementation. During testing, designers would typically only be ableto observe that preselected set until the circuit is reimplemented with a new set of instruments. Theadvantage of trace-based approaches, however, is that they are able to provide a history of consecutivesignal data that can show the behaviour of critical signals (such as state machine bits, packet counters,or bus addresses) in the lead up to, and in the aftermath of, an erroneous event.An important requirement for both scan-based and trace-based techniques is for this debug instru-mentation to only capture signal data at, or surrounding, those erroneous events. The data recorded byon-chip trace-buffers can be considered a sliding window ? triggering circuitry can be used to eitherstart or stop this window, perhaps through halting the circuit, only when a known error precondition, suchas an illegal state, is encountered. Trace-based instrumentation differs from its scan-based equivalent inthat it does not need to halt the circuit with the same amount of precision as it is able to take advantage ofthe depth of its trace-buffers, which can be hundreds or thousands of samples deep. For this reason, trace-based techniques can offer a designer greater flexibility in choosing a convenient trigger condition, or toreduce the impact of triggering logic on delay through pipelining, without affecting signal observability.In designs with multiple clock-domains, different trace-buffer instruments are typically instantiated forrecording signal data from each clock domain independently. Subsequently, off-chip software is used tosynchronize the extracted traces before presenting it to the designer [117].Trace-based instrumentation can be commonly found in ASIC System-on-Chips (SoCs). The de-signers of the Cell processor [119, 120] specifically state that whilst their chip contains scan-basedfunctionality, starting and stopping the clock precisely can be challenging. The solution they chose toimplement was to add a low-overhead, centralized trace and trigger functionality to allow full-speedobservation into a window of 132 bits wide by 1024 samples deep, of which 4 signals are used fortriggering. A trace-buffer is also used to capture traffic across the AMD HyperTransport bus [47], which26serves as the interconnect in a multi-processor system. The authors describe a method of calculatingthe optimal size of their trace buffer required to fully capture a typical failure, and also show how itscontents can be emptied to off-chip system memory once full, though acknowledging that doing so mayaffect the circuit under debug. Similar approaches include tracing and compressing transactions over theAMBA 2.0 bus standard [76], and a commercial offering from ARM which provides a comprehensivetrace framework for SoCs [101].Trace-based Techniques for FPGAsIf sufficient spare resources exists in the target FPGA, trace-buffers can be inserted for no additionalarea cost. This differs from ASICs, where any amount of additional logic will incur extra silicon area,decrease yield, and increase die cost. In some FPGA architectures, hardened control logic exists insideRAM blocks to allow circular buffers to be implemented (as FIFO structures) without using any soft-logicwhatsoever [153]. Upon a trigger event, these buffers can then be halted by disabling their write-enableinputs, or by gating their clock signal.Commercial signal capture tools are currently offered by the two major FPGA vendors: Xilinx,with ChipScope Pro [155] and Altera, with SignalTap II [8]. These tools are specifically tailored toeach vendor?s FPGA devices, and work by embedding logic analyzer IP (composed of signal probes,trigger monitors, trace buffers and data offload logic) into the user-circuit during regular compilation.A similar, but device-neutral, product is offered by Synopsys as Identify [130], offering much of thesame functionality. However, although it is possible to modify the trigger conditions (but not the triggersignals) at runtime, changing the signals under observation does require FPGA recompilation.More recently, Tektronix Certus [137], also vendor-neutral, allows designers to pre-instrument a largeset of interesting signals in their FPGA prototype prior to compilation so that during debugging, a smallsubset of signals can be selected for both observation and triggering. This provides significantly moreruntime flexibility to designers than in other offerings, and is achieved by using a proprietary multiplexernetwork which is covered in Section 2.3.4. However, this approach still requires a set of signals to bepreselected (albeit less precisely) for observation before the nature of any bugs are known. Tektronix?sequivalent ASIC offering is known by the name of Clarus [114].An important point to note is that scan-based and trace-based debug solutions are not mutuallyexclusive: Gao et al. [49] describe the concept of a suspect window ? the range of clock cycles in which27an error may have initiated before propagating out to an observed signal ? and propose a hybrid methodof first using non-intrusive tracing to find the beginning of this suspect window, after which the circuit issingle-stepped whilst performing scan dumps in order to acquire fine-grained observability.2.2.3 Visibility EnhancementAs described above, scan-based instrumentation can provide very wide visibility one cycle at a time, andtrace-based solutions can provide more narrow visibility over multiple clock cycles. Neither techniqueprovides the same visibility as software simulation. Researchers have explored two approaches to furtherenhance on-chip observability: data compression, and post-processing.Data CompressionThe goal of data compression is to record on-chip signal data more compactly. The simplest form of datacompression is to capture less signal data; this can be achieved in scan techniques by creating chainswhich only connect a subset of all flip-flops. This decreases the time required between scan-dumps, soeither the circuit can be single-stepped more quickly, or signal values can be duplicated into shadow flopsmore frequently [33]. Alternatively, signal values can be passed through a lossy compression function(e.g. composed of XOR gates) in order to generate a signature hash. Post-processing techniques (such asthose covered later) can then be applied to reconstruct the hypothetical signal values that would lead tothe observed signature; alternatively, this signature can be used for triggering [39].In trace-based instrumentation, where on-chip memories are expensive and limited in size, both lossyand lossless forms of compression have been explored to maximize the amount of information that canbe recorded. A key challenge during compressing trace data is that signal probabilities are not knowna-priori, meaning real-time adaptive techniques are required. This can cause any on-chip compressioncore to have a prohibitively high area cost, possibly negating any compression gain. One low-overheadcompression scheme is the run-length encoding technique, commonly used for compressing quantizedimage data, which can give good compression gains for long repeated sequences of data. Rather thanstoring each unit (e.g. bit, or bus vector) of signal data individually, long sequential repetitions of thesame data can be stored as ?data repeated X times?. Run-length compression was successfully appliedin the trace infrastructure of the Cell processor [120]. Similarly, sparse signals would also benefit from28event-based capture [2], which involves recording signal data along with a time-stamp only when atransition occurs: ?at time-stamp Y, data changed to Z?.A lossy compression technique presented in [11] operates by iteratively collecting multiple, lossysignatures of signal data surrounding an error, in such a way that allows the designer to collect progres-sively finer traces to ?zoom in? to the root-cause of their bug. A key assumption is that all faults aredeterministic and reproducible, which may not be the case in some designs, such as those with asyn-chronous interfaces. A similar, multi-pass approach is proposed by Yang and Touba [160] where selectivecapture is used to sample suspect clock cycles only. The need for reproducibility is lifted for losslesscompression [10], where the authors propose a new adaptive, dictionary-based compression techniqueand show up to 52% compression gains on realistic data, even after compensating for the area overheadof the encoder. The authors also note that a negative compression gain is possible for uncorrelated data,and suggest that compression be disabled in those cases.Lossless trace-based data compression is supported by Tektronix Certus, where potential compressionratios of more than 1000X are claimed [137].Post-ProcessingA complementary approach is to apply offline post-processing techniques to the traced data.In [60], the authors describe a data expansion technique which can be used to compute missing signalvalues, with sufficient traced data and knowledge of the circuit. This ability requires only that a subsetof all signals be traced in order to achieve full visibility. A separate technique to select the minimal setof signals required for full observability is also described, where claims of only 4%?15% of all signalsinside three unnamed circuits were required for complete visibility; it is unclear whether this includestrivial combinational nodes. Specific details on either of these techniques are extremely sparse as theyare currently deployed in a commercial product currently offered by SpringSoft as Siloti [127].A method similar to data expansion proposed by Ko and Nicolici [82] is state restoration, which canbe used to increase the visibility of flip-flops in a design. The key idea is that by exploiting don?t careinputs to logic gates (for example, the output of any AND gate is unambiguously logic-0 if any of itsinputs be logic-0) and by propagating these inferred values forwards and backwards over the gate-levelnetlist, values on flip-flops not originally instrumented can be computed. However, if the combinationallogic depth between flip-flops is high, it may not be possible to restore any additional signal data.29BackSpace [39] takes a different approach by employing formal analysis techniques with limitedon-chip visibility to generate a back-trace of states that lead up to the erroneous state. The proposedapproach captures the circuit state using a lossy signature, which is processed using formal techniques togenerate a set of possible predecessor states. States from this candidate set are then validated by settingthem as ?breakpoints? in an device augmented with detection circuitry. Non-determinism within thecircuit (such as with multiple clock domains) can be tolerated as long as the error is reproducible at leastsome of the time. Evolving from this work was TAB-BackSpace [38], in which an automated frameworkfor iteratively extending the length of each back-trace beyond the depth of a trace-buffer is described.nuTAB-BackSpace [109] improves this further by handling non-determinism.2.3 Previous Work Specific to Chapters 3?6This section describes previous work relating specifically to each of the contributions of this thesis.2.3.1 Automated Signal Selection and Evaluation MetricsDue to the high cost of on-chip memory, it is necessary for the designer to select only the most interestingand relevant signals to be stored for observation, which can be difficult and tedious. This challenge isfurther compounded by the fact that this signal selection must be made before the chip is implementedinto ASIC or FPGA, at which time the nature of any potential bugs is not known.Although it may be possible to select some ?important? signals by hand, a full manual method is notdesirable for a number of reasons. First, with the prevalence of modular system-on-chip architectureswhich bring together IP blocks designed by different engineering teams, there is likely no single expertto determine how a general chip-wide debug solution should be constructed. An automated solutionwould be able to quickly provide a list of suggested signals for a designer to refine. Second, an automatedmethod for signal selection may allow for more effective debug solutions even in the presence of a expertdesigner. For example, it may not be necessary to observe all state bits of a particular state machinebecause some states may never be reachable. However, even an expert designer would often find itdifficult to recognize these relationships. A third reason an automated method would be desirable isthat it would rapidly decrease the turnaround between debugging iterations, which is important sincedesigners rarely want to spend time planning debug.30Automated signal selection techniques are especially relevant to debugging circuits implementedon a reconfigurable FPGA platform, given that the turnaround time for a less precise signal selection ismeasured in hours for recompilation, as opposed to months for a fabrication re-spin. However, no formalmetric exists to independently evaluate the effectiveness of a post-silicon debug solution.Evaluation MetricsKo and Nicolici [82] measure the quality of signal selections in trace-based instruments through aspecialized metric which describes the ratio of total flip-flop values visible after applying their staterestoration techniques, to the original amount of trace data without restoration. The authors argue that thehigher this restoration ratio, the better the signal selection. Their technique, however, does not consider the?usefulness? of this reconstructed signal data. If an observed signal is highly correlated to an unobservedsignal, restoring the latter may not provide any additional value. As an example, duplicated state elementsor those in a shift register may be easily restorable, but this does not automatically mean that such signalscan help the designer better understand the behaviour of a failing chip.Although impractical to evaluate on a post-silicon debug solution, the concept of functional coveragehas been previously explored for design validation, where it is used as a metric to describe how thoroughlya design has been exercised for faults [73, 133]. Code coverage describes how much of the HDL codehas been exercised during testing [147], and metrics include line coverage (the number of times eachline of code has been evaluated) branch coverage (which measures the frequency that each branch in aconditional statement is exercised) and path coverage (which counts the number of times a path in thecontrol flow graph is traversed [40]). Finite state machine (FSM) based metrics capture the number ofstates, state transitions, or paths through the FSM that have been visited or taken during validation [104].However, modelling the entire design as one large FSM is not feasible and so research has focused onfinding smaller, abstract FSMs containing only the most interesting state variables [18, 58, 126]. Manyof these coverage techniques can only be measured during software simulation; Balston et al. [15] showa solution for extracting code coverage from real hardware.Automated Signal Selection TechniquesBased on their state restoration metric, Ko and Nicolici [82] developed an algorithm to select trace signalsthat maximizes the restoration ratio using a set of equations to estimate the amount of signal data that31can be inferred through each gate in the circuit netlist. The accuracy of this signal selection metric islater improved upon by Liu and Xu [93], where a more precise definition of restorability is presentedand shown to give higher restoration ratios. A different method of trace signal selection presented byYang and Touba [160] considers the sensitivity of each flip-flop to injected errors, as measured usingsimulation, and uses this metric to select maximally independent signals. Integer linear programmingmethods are then used to calculate the optimal set of independent signals to observe in order to maximizeerror coverage. The authors show that the average latency between an error being injected can be reducedby over 300X in the best case. Similarly, Gao et al. [49] present an algorithm to select signals withthe maximum error coverage in their fan-in cones in order to minimize the suspect window in whichthe error must have occurred; however, it is not clear that functional design errors can be modelled assingle-bit upsets. Feedback between sequential elements is considered by Chuang et al. [33], who useS-graphs to model the HDL design; doing so transforms the problem into one of finding the minimum-cost feedback vertex set [92], which has successfully been used to calculate flip-flop selections forpartial-scan designs [86].More recently, similar techniques [111, 161] have made use of formal SAT-based methods to selectsignals which are most effective at restoring unobserved signals using formal logic implication. Yanget al. [161] formulate the circuit as a SAT problem, and benchmark their work against the number ofpotential root-causes that a signal selection can eliminate. The set of potential root-causes that wouldexplain the erroneous behaviour is measured as the number of UNSAT cores remaining, where eachUNSAT core corresponds to an unsatisfiable clause in the SAT formula. This work reports that the?suspect list? can be reduced, on average, by 30%.In all of the related work above, the scalability of the presented techniques appear to be a secondaryconcern, with many authors choosing only to apply their methods to trivially small circuits or to omittheir runtimes. Ko and Nicolici [82] reported signal selection runtimes of up to 40 hours for selecting 32signals from the s35932 circuit with 1728 flip-flops, whilst reference [93] improved on this to 3 hours.Rather than attempt to perform signal selection on a circuit?s gate-level netlist, reference [150] takesa different tact and operates directly at the HDL level. This provides several advantages: by operating ata higher level of abstraction, significant complexity savings can be made, and any signals selected willbe consistent with the level of abstraction that the designer operated at; through CAD optimizations andtechnology-mapping, a gate-level netlist may contain signals that do not correspond to any signal in the32original circuit description. More specifically, the parse-tree representation of the circuit is analyzed togain an understanding of each signal?s dependencies, from which a ranking of the most influential signalsare produced. This algorithm was reported to produce signal rankings from circuits of similar size tothose in previous work in less than five seconds, and has been implemented as part of Tektronix Certus.2.3.2 Reclaiming Spare FPGA ResourcesDue to the prefabricated and general-purpose nature of FPGAs, circuits implemented on these platformswill typically not use all available logic and routing resources. A number of researchers have investigatedmethods for reclaiming some of these spare structures to enhance circuit efficiency.Moctar et al. [102] reuse the local routing multiplexers present inside each FPGA logic clusterto implement the programmable shift operation in floating-point computation, freeing up soft-logicresources to be used elsewhere. Oldridge and Wilton [106] propose that the configuration memory whichholds the connectivity pattern of routing switch-blocks be made user-accessible in order to realize wide,shallow memories in 20% less area and with 40% less delay. Wilton [151] also proposed that unusedRAM blocks be reclaimed as as soft-logic resources, by employing those blocks as a multiple-inputmultiple-output LUTs.An even more creative use of the programmable FPGA fabric has been to explore how its global inter-connect can be used to implement unconventional antennas [134]. In particular, the authors specificallynote that the abundance of underutilized routing resources enables them to create ?hidden transmitters?that are electrically isolated from the user circuit.2.3.3 Incremental Compilation for FPGAsThe key idea behind incremental synthesis is to allow the functionality of a fully place-and-routed circuitimplementation to be modified whilst preserving as much of the original solution as possible. Within thecontext of FPGAs, the aim is to move the fewest number of placed blocks, and rip-up and re-route thefewest number of existing nets, to achieve this result. Owing to the general-purpose nature of FPGAs,this is a much more feasible task than in custom ASICs, as FPGA CAD tools have the flexibility touse any of the prefabricated logic or routing resources that were not employed in the original circuit.The motivations behind this technique are many: to minimize re-compilation effort during the designphase, to preserve timing closure when undertaking Engineering Change Orders (ECOs), or for improved33ASIC/FPGAUser Circuit?r?e I?stru?e?t?ti???u?ti??e?r ?et??r?Figure 2.8: Multiplexer network for trace-based instruments.fault and defect resilience [42]. However, the penalty for incremental synthesis is often a small loss incircuit performance.Incremental-compilation techniques are not new [29, 35, 44], and are available in many FPGAand ASIC CAD tools. They have also been employed in design prototyping and debug [140]. AlteraQuartus II and Xilinx ISE tools both support incremental (or hierarchical) design reuse. Once the circuit,or its subset, has been designated as a preserved partition, the CAD procedure will only re-implement thispartition if its source description were changed ? making it ideal for localized modifications. However,changing any of the Xilinx ChipScope trace connections into a preserved partition does require it to bere-implemented [157].Graham et al. [53] proposes that unconnected trace-buffers be pre-inserted into a Xilinx FPGA atcompile-time. Prior to testing, low-level bitstream modifications [77] are then applied to make incre-mental changes to the FPGA routing so that the desired signals can be attached to those trace-buffers.However, a drawback of this technique is that it requires FPGA resources to be reserved ahead of time,which prevents the original circuit from using them. Poulos et al. [110] describe a similar method ofusing bitstream modification techniques to modify LUT masks in order to realize more area-efficientmultiplexers for scan- and trace-based debugging.2.3.4 Multiplexer Networks in Post-Silicon DebugTo provide designers with the flexibility to look at a wider selection of signals in trace-based instruments,researchers have investigated the idea of embedding multiplexer networks inside circuits. Typically, the34availability of on-chip memory is the most constraining factor when determining the number of signalsthat can be traced. By building a multiplexer network, this limited memory capacity can be sharedbetween multiple signals, the selection of which can be changed at runtime. This allows designers topreselect a larger set of interesting signals before implementation so that during the debugging process,only the most relevant subset of signals will be selected for observation. This concept is illustrated inFigure 2.8.Networks for ASICsInserting multiplexer networks to aid debug is especially relevant in ASICs, where routing connectionscan be efficiently made using custom metal connections, as opposed to being restricted to prefabricatedresources. Quinton and Wilton [113, 115] propose that a single programmable logic core (PLC) alongwith a supporting routing network be embedded into ASICs. This FPGA-like device can then be used toprovide post-manufacturing flexibility for debugging and repair purposes, with the authors showing thatthis can be achieved for an area overhead less than 10% ? on par with the typical cost of scan-insertion.A similar strategy is described by Abramovici et al. [2], where a distributed reconfigurable fabric isproposed instead to minimize the routing congestion caused by using a single, centralized PLC. Theseproposals differs from other work which integrates programmable logic into an ASIC design as theproposed PLC will be exclusively used for debugging only, and not to add functionality; hence, the PLCwill not impact the production yield as no guarantee of correctness is required.By using a programmable core, the need for input/output ordering is lifted: signals can be connectedto any of its I/O pins as all are equivalent ? the PLC can then be configured to adapt to any arbitraryordering. Thus, a simple crossbar ? which allows any network input to connect to any network output ?provides more flexibility than is required. Reference [116] describes an unordered (concentrator) networkwhich allows any combination of the network inputs to be forwarded to the network outputs, but in an un-specified order. A more efficient network was proposed by Liu and Xu [94], whilst Prabhakar et al. [112]describe a network for re-ordering trace data to make it more amenable for lossless compression.Multiplexer Networks in FPGAsIn the FPGA domain, prefabricated logic resources are often plentiful, but connecting these resourcestogether can be challenging [29]. Altera SignalProbe [8] allows designers to reserve spare I/O pins35on their FPGA for exposing internal signals outside of their device. Each I/O pin can be setup (usingsoft-logic) to multiplex up to 256 predetermined circuit signals for external analysis, the selection forwhich can be determined during testing and changed through the JTAG interface. This predeterminedset can be modified using incremental ECO techniques. Poulos et al. [110] propose that multiplexers beused in a similar fashion during debug (and also propose a parallel-in serial-out structure for monitoringmultiple signals) but employ bitstream modification to change the selected set more quickly. However,with many FPGA designs already I/O limited [103], it may be difficult to reserve any pads for debug.A unique feature exclusive to Microsemi antifuse FPGA Devices is the ability to observe, in real-time, any four nets of a circuit which are dynamically chosen at run-time. This is enabled by a patented,dedicated architecture support known as Action Probe circuitry, which can be controlled using the SiliconExplorer II software [100].2.4 SummaryThis chapter first reviewed the typical applications where FPGA are used for, and then elaborated on thehardware that makes up the FPGA architecture and the CAD techniques that are used to map circuits tothese structures. One common FPGA application is ASIC prototyping, which allows designers to testand debug their circuits using a platform that can run many orders of magnitude faster than using logicsimulation software. The primary challenge when debugging both ASIC and FPGA devices is the lackof on-chip observability. When a circuit does not operate as expected, it is important to understand thebehaviour of internal signals in order to find the root-cause of this error; this chapter also reviewed twobroad classes of solutions to enhance visibility through scan-based and trace-based instruments.This thesis focuses on trace-based instrumentation for FPGAs, which can provide a view of signalbehaviour over time, but only on a limited subset of signals. In addition, the selection of which signalsto observe must be made before the circuit is mapped to the FPGA, and changing this selection willrequire the FPGA to be recompiled which may take several hours. The last part of this chapter reviewedwork related to overcoming these limitations, which specifically relate to the contributions of this thesis.Techniques for improving the quality of the signals selected for observation, allowing those signals to bechanged more quickly, reclaiming spare FPGA resources, and inserting multiplexer networks to provideeven greater post-implementation flexibility were covered.36Chapter 3Post-Silicon Debug Metric and AutomatedSignal Selection for Trace-InstrumentsThis chapter describes a metric for measuring the effectiveness of trace-based, post-silicon debug solu-tions, and presents three techniques for formulating new signal selections for these instruments. Thismetric is relevant for trace solutions on both FPGA and ASIC devices. The devised metric operateson the principle that during post-silicon debug, the ultimate intention of the designer is to understandthe circuit?s state space; thus, the effectiveness of a debug instrument can be measured by how welldata traced at each clock cycle can be used to understand the present state of the circuit. Formal imagetechniques are then applied to understand how this same piece of trace data can be used to gain evenmore understanding of the circuit?s possible past and future state behaviour.Three automated signal selection techniques are presented in the second half of this chapter: amethod which optimizes directly for the expected value of this metric, an algorithm that computes asignal selection based on the centrality of the graph-representation of the circuit netlist, and a hybridtechnique which combines both of these methods by considering the circuit hierarchically. Throughapplying the debug metric described previously, results show that whilst the first method provides thehighest amount of observability, it was the least scalable.On the other hand, the graph-based method wasfound to be the most scalable ? tackling all the benchmark circuits that we possessed ? but provided theleast amount of observability. The hybrid method offered a compromise between these two limits.37The goal of these techniques are to provide effective information to a human designer and help themto better understand any faulty behaviour inside their circuit. This information can then processed by thedesigner in an intelligent manner to find the true root-cause of the bug. In contrast to other work [78, 161],their intention is not to bypass this human designer and pinpoint root-causes autonomously by attemptingto explain faulty behaviour directly using a list of suspected errors that a designer may have committed.Similarly, testability metrics, such as those based on a stuck-at model which measure how well a particularset of test vectors will guarantee to expose all possible faults, are also not suitable for this goal.This chapter is arranged with the framework of this contribution in Section 3.1, and the definitionof the post-silicon debug metric in Section 3.2. The three signal selection techniques are described inSection 3.3. Experimental results that evaluate the scalability and observability of these three techniques,using simulation and a physical System-on-Chip prototype, can be found in Section 3.4. A summaryof this chapter follows in Section 3.6. The post-silicon debug metric, and the flat selection algorithmthat optimizes for it, were first published in paper [64]; this was later extended with two other selectionalgorithms, along with their detailed comparison, in reference [67].3.1 FrameworkIn this work, we are concerned only with debugging functional errors using trace-based techniques. Weassume an integrated circuit that has been instrumented with one or more trace-buffers, as illustratedpreviously in Figure 1.2. The trace-buffers are used to continuously record a history of selected signalsover time, possibly using on-chip compression schemes [10, 11]. We assume that the chip is also in-strumented with programmable trigger circuitry (that is managed by the user) to halt the chip when aspecific condition occurs, so that a trace of states surrounding the error can be extracted for analysis. Inthis thesis, we focus on circuits with a single clock, however, the methods described can be extended tomultiple clock systems due to the fact that trace-buffers can only be used to instrument signals withintheir own clock-domain [117].We differentiate between the debug engineer?s tasks at design time and debug time. At design time,the engineer must instantiate the trace-buffers and associated circuitry, and then select a set of signals tobe recorded. In a simple instantiation, the number of signals is limited by the width of the trace-buffer;more advanced implementations will involve a concentrator that maps a larger number of selected signals38IPCabcdeabaEdexbtErEbtnblcEdeabaEdexbtEC Ci rnstnsrFigure 3.1: State-space partitioning induced by signal observations.to the trace-buffer [115]. In either case, however, the set of signals that are observable must be determinedat design time.At debug time, the engineer runs the fabricated chip at speed, often in-situ. When the trigger conditionis met, the chip halts, and the engineer can read the values stored in the trace-buffer. These values can beanalyzed by the designer to help understand the behaviour of the chip.It is important to emphasize that the signal selection is done at design time when the nature of thebug or chip operation is not known. During debug, the set of signals observed can not be modified. Thisis what makes signal selection a difficult problem; it is necessary to predict, during the design stage,which signals will provide valuable information during debug.3.2 Post-Silicon Debug DifficultyDuring post-silicon debug, the task of a designer is to determine what state a circuit is in. The moreobservability a designer has into their system, the easier it will be to deduce the exact cause of the error.In this section, we will define a new method to measure the difficulty of debugging a fabricated devicewhich has already been instrumented with debug logic. Specifically, we measure how precisely the state-space of a circuit can be inferred from an actual signal trace extracted from the device. We define this tobe the post-silicon debug difficulty.By considering the entire circuit as a finite state machine in which all of its internal flip-flops are itsstate bits, then it is possible to compute the set of reachable states R that the circuit can take under allpossible input combinations, over all time. Selecting signals for observation has the effect of dividingthis reachable state space into a number of partitions, as illustrated by Figure 3.1. The more signals thatare selected, the more, and potentially smaller, partitions that will form. Each observed signal partitionsthe set of potential states into two disjoint subsets; and depending on the observed value (e.g. a = 0) one39InstruImredPCabSuccessor StatesPredecessor Statest=0t=?1t=?2t=1t=2Figure 3.2: Knowledge of circuit state over time.of these sets will represent the possible states of the system, and the other represents the set of statesthat it cannot be in, at the time of observation. Intuitively, not all signals will partition this reachablespace in the same way or into the same size, nor will observing multiple signals guarantee to produce 2npartitions.As an illustration, consider a circuit containing the signal vector <s1,s2,s3> to have a reachablestate set of {000,001,010,011,111}. Electing to trace signal s1 will divide the reachable statespace into two partitions of sizes {4,1} states, whilst tracing s2 will provide a more balanced distributionof {2,3}, with higher entropy. Using the values observed on these signals, the possible state of the circuitcan be collapsed into just one of these partitions.We are not only interested in the current state of a system, but also the state in past and future clockcycles. As illustrated in Fig. 3.2, knowing the values of certain signals in cycle t will not only prune thestate space during the current cycle, but also in some cycles before and after t. Implicitly, these formalimage techniques will also capture any signal correlation that exist, which has been the objective ofprevious state restoration work [82]. Thus, a good signal selection will reduce the size of the potentialstate space across many cycles.3.2.1 DefinitionOnce the set of states reachable during normal circuit operation is known (termed R, where methods tocompute an approximation of this are described in the following subsection) observing a set of flip-flopoutput signals V will partition the reachable set R into at most 2|V | disjoint regions, one for each binaryvalue that V can take. We will represent the set of partitions induced by V in the current clock cycle as40abRV0 RV1RV2 RV3Figure 3.3: State-space partition notation.RV , where each element RVi of RV is a set of states in the disjoint region i. This notation is illustratedin Figure 3.3, when two signals V = {a,b} are observed, which partition the circuit into four regionsi = 0 . . .3.We can consider the set of reachable states in future cycles by computing the image of RVi (or itspre-image, for past cycles). This operation is represented using the function I(RVi , t), relating to its imagein t clock cycles, where I(RVi ,0)= RVi denotes the partition formed by a particular observation of signalvalues. Using this function, we can compute the ?volume? of states leading from an observed region i,by finding the product of the sizes of its images. We define this quantity to be the post-silicon debugdifficulty at one cycle, DVi :DVi =??t=??w(t)??I(RVi , t)?? (3.1)where w(t) represents a windowing function which allows the designer to both weight the importanceof states from each image (for example, images closer to the t = 0 may be more important than thoseimages further away) as well as to limit the product to a finite value. For this latter property, we requirethat limt???w(t) = 0, and that the product computation stops when w(t)??I(RVi , t)??< 1.In this chapter, we use a rectangular window function:w(t) =???????1 when |t|< c0 otherwisebut other functions with the same property are also possible, for example w(t) = e?c|t|. For reasons ofcomputational efficiency which are explained in the following subsection, we have chosen to take the41product of images as opposed to its sum. By using the product of its images, DVi corresponds physicallyto an over-approximation to the number of ?paths? that can be taken through the circuit which passthrough the observed states in RVi . This represents an over-approximation because not every individualstate in RVi will be able to reach all states in their collective image: I(RVi , t).3.2.2 Computing an Over-ApproximationComputing the debug difficulty first requires knowledge of the reachable state space of the circuit,R. In practice, finding this reachable state set, or even computing its image, using exact techniques isinfeasible for anything but small circuits as its complexity can grow exponentially with the number ofstate elements present.To overcome this limitation, we use an approximation method known as state space decomposi-tion [31] which heuristically partitions the state machine of the entire circuit into a set of smaller dis-joint (but interacting) ?sub-machines?, on which exact formal techniques can be feasibly applied. Bycomputing the reachable state set of these sub-machines individually and finding their product, an over-approximation of the reachable states for the original circuit can be achieved. An over-approximationis sufficient for our work as we only require an estimate of how evenly the state space is partitioned.However, because we cannot compute the exact reachable state set for any of the non-trivial benchmarkcircuits used in this work, we have no way of measuring this approximation error.Computing the over-approximate debug difficulty of a decomposed circuit requires an algorithm toconsider the product of images across all sub-machines. Let us consider that the signal selection V isalso partitioned into disjoint subsets corresponding to the signals within each decomposed sub-machine,such that: V = A ? B ? . . . and each region can be formed by: RVi = RAi ?RBi ? . . . . Hence:DVi =??t=??w(t)??I(RAi ?RBi ? . . . , t)??=??t=??w(t){??I(RAi , t)?????I(RBi , t)??? . . .}(3.2)where RAi represents region i of signal set A. Calculating DVi in this manner would not be scalable as itwould require an algorithm to consider the image of the circuit across all sub-machines at every timestep t.42However, because we have chosen to compute the volume formed by the product of its images, andby applying the windowing function, this equation can be transformed into:DVi =c?t=?c|I(RAi , t)|?c?t=?c|I(RBi , t)|? . . .= DAi ?DBi ? . . . (3.3)which allows the debug difficulty of each sub-machine to be computed independently, and then multipliedtogether to find the final value for DVi .In this work, we have implemented our methods within the VIS toolbox [118], which providessupport for the state-space decomposition procedure required to analyze large circuits. This softwarepackage utilizes Binary Decision Diagram (BDD) based techniques for reachability computation.3.3 Signal SelectionIn this section, we present three new signal selection approaches ? a flat technique, a graph-basedtechnique, and a hybrid technique combining both methods ? and describe how they can be feasiblycomputed.3.3.1 Observability-based: Expected DifficultyThe post-silicon metric defined in the previous subsection can only be computed with a real traceextracted from a physical device, which is used to collapse the known state of the circuit from all regionsin RV into one specific region RVi . During pre-silicon signal selection, before implementation, we wouldlike our CAD tools to select a set of signals V that will subsequently make this procedure easier. However,the difficulty of a debug task depends not only on which signals are yet to be selected, but also the actualvalues of the observed signals, and these values are not known before the chip has been tested. Thus, adirect application of the post-silicon debug difficulty is not suitable for use during signal selection.Nevertheless, it is possible to compute the expected value of this metric. During post-silicon debug,after values are traced from the signals in V , we can determine which region of the state space we arein. If we are in region j, then the fewer elements in DVj the better. Since at pre-silicon design time, wedo not know the values on the V selected signals, the best we can do is minimize the expected size ofelements in DV across all 2|V | regions.43We use a greedy algorithm to iteratively construct a signal selection V based on this expected debugdifficulty, which we denote as EV . In the first iteration, we assign each signal u of the circuit to V ,compute EV , and select the signal which minimizes this difficulty. In subsequent iterations, we stepthrough all remaining signals in the circuit, compute EV for V ?u, and again select the set which returnsthe minimum value, repeating until the desired number of signals have been selected.Let us now consider how to compute EV . If the probability of being in each region of RV were thesame, then EV could be computed by averaging the size of all elements in DV . However, in general, it ispossible that the probability of being in each region is not the same. Thus, we can write:EV =2|V |?1?i=0DVi PVi (3.4)where PVi is the probability of observing region i.The values of PVi can be computed using simulation although acquiring a representative set of valuesmay be impractical (as is also the case for vector-less power estimation [8]). Without this information,if we were to make the approximation that all reachable states in the system are equally likely to occur,then the probability of the circuit being present in a particular region will be proportional to the numberof reachable states that it contains. In other words:PVi =??RVi??N(3.5)where N is the sum of all regions in RV (i.e. the size of its reachable state space) which then leads to:EV =1N2|V |?1?i=0DVi??RVi?? (3.6)Minimizing EV will minimize the expected number of states which the system could potentially bein, both in the current cycle, and in past and future cycles. Reducing the set of possible circuit trajectoriesin this manner will help a debug engineer better isolate any faulty circuit behaviour.44Figure 3.4: Flip-flop connectivity graph for flattened oc mem ctrl (maximum of 5 input/outputedges shown, node size indicates its eigenvector centrality).The expected difficulty can be feasibly computed by applying the same expansion as in Eqn. 3.3:EV =1N2|V |?1?i=0DVi??RVi??=1NA2|A|?1?i=0DAi??RAi???1NB2|B|?1?i=0DBi??RBi??? . . .= EA?EB? . . . (3.7)where NA represents the sum of all reachable states in the sub-machine corresponding to the signals inA (for N = NA?NB? . . .). This allows the expected difficulty to be calculated for each sub-machineseparately, and multiplied together for the approximate result. This reduces its complexity from hav-ing to exhaustively consider each possible DVi at 2|V | computations, to considering each sub-machineseparately for 2m |V |m operations, where m represents the maximum sub-machine size used in state-spacedecomposition.453.3.2 Connectivity-based: Graph CentralityAn orthogonal approach to signal selection is to consider the logical connectivity of all signals in thecircuit ? flip-flops that fan-out to a large part of the circuit may be considerably more important than thosethat do not. A graph can be generated to encapsulate this connectivity information from the flattenednetlist of a circuit. More specifically, consider a graph G(V,E) where the set of vertices V represent eachflip-flop in the netlist, and the set of edges E corresponds to the existence of a logical connection (throughcombinational logic) between each pair of vertices. The connectivity graph for a large memory controllercircuit oc mem ctrl containing almost 2000 flip-flops is shown in Figure 3.4, in which a maximum offive arbitrary incoming and outgoing edges from each node is drawn. The size of each node representsits eigenvector centrality.The key intuition behind this approach is that designers may be able to infer the most informationfrom a faulty circuit if they were to observe the signals which exert the highest influence on the entirenetwork. Fig. 3.4 shows that such influential nodes (many of which were found to be state machine bits)clustered towards the centre of the graph and surrounded by edges, do exist. Quantifying the relativeimportance of a node in a graph can be determined by measuring its centrality [23]. A signal selectioncan then be formed by selecting the desired number of signals with the highest centrality scores. Themain advantage of this approach is that the centrality of all nodes within can be computed efficiently,even for large circuits.In this work, we use the eigenvector centrality method, a popular measure of influence, to quantify therelative importance of each node in the graph. Loosely speaking, in directed graphs, the right eigenvectorscores a node highly if that node connects to, or influences, other nodes who themselves have high scores.This differs from simply considering the immediate fan-out, or influence, of a parent signal (which wouldbe a case of degree centrality) as it also considers the downstream centrality of its children, which isitself affected by its children too.The brief mathematical definition of an eigenvector is a vector which remains invariant under anylinear transformation of its matrix, with its significance being the ability to characterize any key fea-tures present. This detail has many applications, such as in data-mining (specifically, in principal com-ponent analysis which is commonly used for dimensionality reduction for inference or compression46purposes [1]), and in Google?s PageRank algorithm for ranking the importance of web pages to returnfor search queries [108] which also showcases its extreme scalability.For a circuit with n latches, its connectivity graph can be captured as an adjacency matrix withdimensions n?n in which an element (i, j) is 1 if there exists a directed logical connection from latch i tolatch j. The eigenvector of this matrix can be calculated by solving Av = ?v. The principal eigenvectorof a matrix can be computed efficiently even for very large, sparse matrices using the power iterationmethod, which is the technique applied in this work [57]. This efficiency makes it more scalable thanother centrality measures, such as closeness and betweenness, which consider the shortest (geodesic)path between/through nodes in determining its score [23]. Furthermore, the betweenness measure treatstraffic as being indivisible ? that is, an error can only affect one flip-flop at a time ? which would notbe an accurate model of erroneous circuit behaviour.Figure 3.5 compares the distribution of the logarithm of the degree and eigenvector centrality valuesgained in applying our method to the two largest circuits at our disposal, over a histogram with 100 bins.The X-axis indicates the centrality values (on a logarithmic scale) of signals that fall into each bin, whilstthe Y-axis shows the number of signals present. As these results show, the eigenvector method providesa broad spread of centrality measures which allows designers to differentiate between the debug value ofeach signal more effectively than by considering just its fanout, as for degree centrality. This is important,as one known limitation of the eigenvector centrality method is that it may not give meaningful resultson directed acyclic graphs in which no feedback between flip-flops exists [22]. However, in the circuitsthat were explored we have not found this to be the case; the presence of cycles is equally important asit allows the centrality measure to score circuit feedback highly, which may be indicative of pipelineddatapath or control structures (i.e. state machines) where errors are likely to accumulate and influencecircuit behaviour elsewhere.Our initial experiments have found that the eigenvector centrality is superior to the closeness andbetweenness measures in reducing the debug difficulty, and was also able to return a result 31X and14X faster on the small leon3s nofpu benchmark (a gap which widens for larger benchmarks) due toits calculation not requiring knowledge of the shortest path between nodes. However, a more thoroughinvestigation of these other centrality measures would form interesting future work.47-4.8-4.6-4.4-4.2-4.0-3.8-3.6-3.4-3.2-3.0-2.8-2.6-2.4-2.2-2.0-1.8-1.6-1.4-1.2-1.0-0.8-0.6-0.4-0.20.0Log10 Degree Centrality110100100010000100000Frequency-4.8-4.6-4.4-4.2-4.0-3.8-3.6-3.4-3.2-3.0-2.8-2.6-2.4-2.2-2.0-1.8-1.6-1.4-1.2-1.0-.8-.6-.4-.2.0Log10 Degree Centrality110100100010000100000Frequency-4.86-428.-4083-.481-.380-.086-L48.-L383-Lo81-3680-3386-3o8.-2683-2L81-2180-1686-1L8.-1183-og81-o.80-o186-g8.-.83-281080  Deo0rCnetaltiyDFrqtayFucnyoo0o00o000o0000o00000Fttai(a) opensparct1 centrality: degree (above), eigenvector (below).-4.86-4280-.381-.382-.184-.L86-.080-.o81-.o82-.684-.g86-.480-..81-..82-.284-386-180-L81-L82-084-o86-680-g81-g82-484-.86 De.2rCnetaltiyDFrq tayFucny..2.22.222.2222Fttai(b) mcml centrality: degree (above), eigenvector (below).Figure 3.5: Histogram distribution of log-centrality; highlighted bars indicate the highest ranking 128 signals.-4.48.62031Lo -4.48.6203gLo -4.48.6203 Lo -4.48.6203DLo -4.48.6203eLo -4.48.6203rLo -4.48.6203CLo -4.48.6203LLo-4.nt03 1Do-4.6n.a2l03g1eD-4.6203g1o l4.-i-.4y6F1-4.6i26i8q03rD1 -4.un6.8iF031C -4.-i-.c203CL -4..c203L -4.l4y.ylt03DD -4.yc-c 03e1rC-4.c4.603g1 -4.l4y03 eL -4.l4y03DeL -4.l4y031eL -4.l4y03geL -4.l4y03CeL -4.l4y03LeL -4.l4y03eeL -4.l4y03reL(a) Original hierarchy.-4---.86260231Log. 86260231Lo4. 86260231LoD. 86260231Lo-. 86260231Loe. 86260231Lor. 86260231LoC. 86260231Lo.. 862ntLoDg- 8623n2a1lLo4ge-86231Lo4g Dl628i826y3Fg8623i13i0qLor-g 862un320iFLogCD 8628i82c1LoC. 8622c1Lo. 862l6y2yltLo-- 862yc8c LoegrC862c623Lo4gD 862l6yLoDe. 862l6yLo-e. 862l6yLoge. 862l6yLo4e. 862l6yLoCe. 862l6yLo.e. 862l6yLoee. 862l6yLore.(b) Selectively flattened hierarchy (threshold=256 latches).Figure 3.6:Module hierarchy for oc mem ctrl.483.3.3 Hybrid: Expected Difficulty with HierarchyCommonly, large designs are built in a hierarchical fashion from high-level HDL code, as opposed tomonolithically, to allow a division of labour and complexity by breaking down the circuit into smallercomponents, each of which can be built, maintained, and verified much more efficiently. As an example,many of the largest designs that currently exist are system-on-chips, which integrate various independentmodules into a single device with each contributing different functionality. Each module is themselvesoften built from various smaller sub-modules. For the flat, observability-based, signal selection techniquepreviously, all of this hierarchical information is discarded during the compilation procedure whichlimits the size of circuits that can be tackled. By contrast, the hybrid selection algorithm described in thissubsection exploits this module-boundary information to pre-divide the circuit into smaller sub-circuitson which the flat technique can be applied.The hierarchy for oc mem ctrl is shown in Figure 3.6a. For all circuits, this hierarchy takes the formof a tree in which the top-level module is the root node, and where each module that it instantiatesis represented by a child vertex. Making up the label of each node is text which indicates the ?mod-ule name:instance name?, whilst the numbers that follow display the number of flip-flops containedwithin.Because the state-space decomposition procedure that we use to compute the expected difficultyscales unpredictably with circuit complexity (as explored in Section 3.4.1) it would be desirable todivide-and-conquer large circuits into representatively smaller sub-circuits on which a flat techniquecan be more practically applied. Conveniently, the circuit hierarchy is a natural source of inspiration:the logic contained inside each module is determined by a human designer (whom will ultimately bedebugging the circuit) as opposed to by a heuristic algorithm, and so these module boundaries form idealcandidates for sub-circuits. However, it may not be desirable to completely break-down the circuit in thismanner. Consider the case where a circuit is built from a large number of very small modules; computingthe importance of signals in each of these modules independently may be an over-simplification, andwill capture the overall value of each module to the entire circuit poorly. A better solution would be toselectively flatten the circuit to avoid this scenario.We use Altera Quartus II software to first compile our benchmark circuits into a flattened netlist, sothat we can leverage its commercial-quality synthesis engine, including support for correctly inferring49embedded blocks (i.e. RAMs, DSPs, etc.) that are present in many large designs. This compiled netlistis taken from the output of the front-end synthesis stages, after all logic optimizations and technology-mapping has occurred, which makes it a true representation of the logic that would be implementedon the target device. We export this netlist into the academic BLIF file format using the techniquesdescribed in [6], and transform any embedded blocks (which are described in the netlist as black-boxes)into primary inputs and outputs so that its fan-in and fan-out logic can be preserved. Conveniently, theexported BLIF file is verbosely annotated with information relating its internal circuit nodes to theirHDL signal-equivalents, and from this description we were able to automatically infer the hierarchy ofeach design; Fig. 3.6a was generated in this way.Our selective-flattening algorithm follows a depth-first greedy merge policy, where modules withoutany children are recursively merged, smallest-first, with their parents if the combined number of flip-flopsbetween them do not exceed a specified threshold. Other merge policies are possible. This thresholdcontrols the granularity of the selectively flattened circuit: allowing it to range from a fully flattenednetlist, as we had used in the previous subsection (threshold=?) to fully hierarchical (threshold=0).Figure 3.6b shows the same oc mem ctrl circuit selectively flattened with this threshold set to the defaultvalue used in this chapter: 256 flip-flops.An additional capability that this technique also exposes is the possibility of exploiting any symmetrythat exists in the circuit hierarchy, which could lead to a more efficient algorithm. Consider, for example,if a circuit were to create duplicate instances of the same module (i.e. for each bit in a datapath) then thesignal selection for each of these identical modules would also be the same. This scenario can be seenin Fig. 3.6a, where the module mc obct top:u2 instantiates a number of identical mc obct sub-modules.This relationship can be identified by searching the hierarchy tree for multiple occurrences of isomorphicsubgraphs, though care must be taken to only recognize modules that are truly identical ? in the sameexample, the mc rf:u0 module appears to instantiate a number of identical mc cs rf sub-modules, but oncloser inspection each of these sub-modules were found to be instantiated with subtly different parameterconstants. We have left this optimization as a topic for future investigation.Once the circuit has been compiled into a selectively flattened netlist, the next step is to determine therelative importance of each module to the overall circuit, in order to calculate how many signals to selectfrom each module. This problem is reminiscent of computing the importance of nodes in the flattenedcircuit, as discussed in the previous subsection, but with one fundamental difference: the hierarchy of50mc_rd_fifo:u00.046 (152)mc_dp:u30.208 (129)oc_mem_ctrl+children0.272 (202) mc_rf:u00.461 (193)mc_timing:u50.469 (147)mc_cs_rf:u10.242 (69)mc_cs_rf:u00.244 (69)mc_cs_rf:u30.238 (69)mc_cs_rf:u20.240 (69)mc_cs_rf:u50.233 (69)mc_cs_rf:u40.235 (69)mc_cs_rf:u70.229 (69)mc_cs_rf:u60.231 (69)mc_obct_top:u2+children0.005 (226)mc_obct:u70.000 (56)mc_obct:u60.000 (56)mc_obct:u50.000 (56)mc_obct:u40.000 (56)Figure 3.7: Hybrid algorithm: module centrality for selectively flattened oc mem ctrl. The eigen-vector centrality value, and number of flip-flops (in parentheses) are shown in the node label;edge weights indicate connectivity between modules.the circuit is a rooted tree, a more restrictive graph than was studied previously for which a centralitymeasure is less well defined.To overcome this, we examine the connectivity between modules in the fully flattened netlist. Ex-amining the flattened netlist is important in order to precisely derive any relationships between moduleswhich are lost when analyzing only the circuit hierarchy. Consider, for example, from Fig. 3.6a that ifthe module ?mc rf:u0? were to connect directly to its sibling ?mc timing:u5?, it would only be able to doso via its parent: this type of connection would only be visible in the flattened netlist.In this module graph, the vertices now represent modules whereas each edge is weighted by theamount of connectivity between each pair of vertices. In our work, we weight each directed edge betweentwo modules (m1,m2) according to the fraction of unique flip-flops in m2 that m1 connects to. Figure 3.7shows the module-centrality graph for selectively flattened oc mem ctrl. Here, the eigenvector centralityof each node is shown in its label. Thicker edges indicate a higher edge weighting between modules, dueto increased connectivity between them.Once the centrality of each module is known, the number of signals that each module contributes tothe final selection can then be calculated as a proportion of this value, and those signals are then selected51using the expected difficulty technique. We generate the netlists for each sub-circuit by isolating thesubgraph of logic between its flip-flops from within the original flattened netlist, and again transformany connections that enter or leave this subgraph into primary-inputs and outputs so that all of its logiccan be preserved. Because these sub-circuits are significantly smaller and less complex than the flattenednetlist, the expected difficulty can be computed much more efficiently.3.4 ResultsThis section examines the trade-off between the scalability of the three proposed selection algorithms,and how they affect the debug difficulty metric defined previously. Although we find that the graph-basedselection method gives the smallest improvement in debug difficulty, it is the only algorithm capableof scaling to the large industrial benchmarks that we have applied it to. In the following section, weprovide examples of the signal selections computed, compare them to previous work, and discuss the?gold standard? tests that can be used to evaluate signal selection quality.3.4.1 Scalability and RuntimeTo compare the three signal selection algorithms, we applied them to a set of benchmark circuits takenfrom the QUIP suite, the GRLIB IP library, the OpenCores repository and the OpenSPARC and VTRprojects [4, 6, 107, 121, 129]. Table 3.1 shows a comparison of the runtime required by each of the threealgorithms for selecting 128 signals, whilst Table 3.2 fixes the benchmark circuit to oc mem ctrl andinstead varies the number of signals selected. All experiments were performed on an Intel Core 2 Quad2.8 GHz workstation with 8GB RAM.These results show an inverse relationship between algorithm runtime and quality of results: whatwas found to give the highest quality of results turned out to be the slowest method, and in the case ofspecific circuits, were found to be infeasible to compute. Upon further investigation, these failing circuitswere found to contain logic structures that were difficult to compute the reachability of; in particular,multipliers and dividers [61], which were present in many of the processor-type circuits: oc acquarius,or1200, as well as in single-core of the LEON3 processor leon3s. This exposes a limitation with ourcurrent implementation of the flat algorithm. However, it may be possible to overcome this by furtherrelaxing the exactness of reachability computation (the original application for state-space decompositionwas safety analysis in model-checking, in which a guaranteed over-approximation was strictly necessary)52Flat HybridCircuit (Latches) Diff. Diff. Graphoc vga lcd (1128) 352 69 1oc ethernet (1277) 617 96 1oc pci (1437) 720 128 1oc wb dma (1775) 1085 120 1oc mem ctrl (1825) 963 283 1leon3s nofpu (2604) 7178 3972 4oc vid compress dct (3833) 4466 41 2radar12 (3844) 6180 428 6oc vid compress jpeg (4386) 7604 886 3radar20 (5749) 17514 648 2stereovision0 (7294) 24091 396 2(a) Runtime (seconds): flat algorithm feasible.Flat HybridCircuit (Latches) Diff. Diff. Graphoc aquarius (1469) - - 11or1200 (2055) - - 10leon3s (5093) - - 11bgm (5362) - - 36LU8PEEng (5490) - 676 5stereovision1 (11253) - 2611 2uoft raytracer (13384) - - 43stereovision2 (14090) - - 6LU32PEEng (18641) - 10384 13LU64PEEng (36096) - 29091 27leon3mp (46578) - - 65mcml (48093) - - 75opensparct1 (52506) - - 82(b) Runtime (seconds): flat algorithm infeasible.Table 3.1: Signal selection runtime: 128 signals.or by using more recent formal techniques, such as those based on SAT methods. We have left this as atopic for future work.In all cases, the hybrid method (with the merge threshold fixed at 256 latches) completed approx-imately 3?100x faster than the flat algorithm. Unsurprisingly, the hybrid algorithm was able to tackleseveral circuits larger than the flat algorithm, the largest of which was 36000 latches. However, it wasunable to operate on many smaller circuits of 1500 and 14000 latches, likely due to the same difficultstructures that the flat algorithm failed with.In contrast, the graph-based connectivity method was able to succeed on all circuits that it wasapplied to, including an 8-core LEON3 System-on-Chip design (leon3mp) and a multi-threaded 64-bitOpenSPARC-T1 processor core, both around 50,000 latches in size. Runtime in both cases did notexceed 90 seconds ? representing a substantial speed-up over the other two algorithms. Additionally,we discovered that the majority of this time was spent extracting and parsing the connectivity graph fromthe netlist, as opposed to computing its centrality.Table 3.2 compares how the runtime of all three algorithms scale with the selection size. In the flatalgorithm, the runtime for all selection sizes is dominated by the initial cost of state space decomposi-tion and reachability computation on the flattened netlist, after which signal selection occurs. Althoughnot exposed by its application on this circuit, this scenario also exists in the connectivity algorithm,where a fixed cost of extracting the circuit connectivity is required before its centrality can be com-53Runtime (seconds)Selection Size Flat Diff. Hybrid Diff. Graph16 889 274 132 913 274 164 926 276 1128 963 283 1256 1039 299 1512 1219 355 11024 1489 440 1Table 3.2: Selection runtime: oc mem ctrl for varying sizes.puted. Once this is complete, selecting any number of signals only involves a linear search to find thehighest-scoring nodes.3.4.2 ObservabilityTo evaluate the observability of each selection algorithm, we measured their post-silicon debug difficultymetric across a variety of benchmark circuits. For this first set of experiments, each circuit was simulatedfrom an initial state of all-zeros using randomly-generated input vectors, and the values on the 128selected signals recorded to mimic a post-silicon trace-buffer operating inside a fabricated chip. Usingthese recorded traces, we computed the average debug difficulty for each selection algorithm and theirresults are shown in Figure 3.8, normalized to the average of 100 random signal selections. Experimen-tally, we have found that using a window size of?10 cycles (i.e. c = 10) to compute the debug difficulty,along with a sub-machine size of m = 8 gave a good compromise between accuracy and complexity. Thiswindow size represents the number of clock cycles, ahead and behind, that the formal techniques considerwhen estimating the observability contributed by each snapshot in the trace-buffer, and is independentof any post-processing (such as signal restoration [82]) that can be applied. The number of latches ineach circuit is shown in brackets, and do not include any storage elements that have been inferred intoembedded RAM resources.Recall that the debug difficulty is defined as the ?volume? leading from a single trace sample,the result shows the geometric-average volume over all 8192 samples in the trace-buffer. A smallervolume, or debug difficulty, is desirable because it means that the state behaviour of the circuit can bemore accurately resolved. Because this debug difficulty computation requires a successful state-spacedecomposition of the flattened circuit, we have been unable to evaluate this metric against the largest54oc_pci(1437) oc_wb_dma     (1775) oc_mem_ctrl     (1825) oc_vcs_dct    (3833) radar12 (3844) oc_vcs_jpeg     (3833) radar20 (5749) stereovision0     (7294)1E?3001E?2701E?2401E?2101E?1801E?1501E?1201E?901E?601E?301E+00Debug Difficulty (normalized)Random Graph Hybrid E^V Flat E^VFigure 3.8: Debug difficulty comparison for each algorithm (128 signals by 8192 samples).benchmark circuits. However, this differs from our ability to compute signal selections for these largecircuits, as covered in the previous subsection.A number of conclusions can be drawn from Fig. 3.8. Firstly, all three signal selections performbetter than an average random selection, which shows the fidelity of our debug metric. Second, the resultsmatch expectations: the best algorithm is indeed the one which targets the expected debug difficulty ofthe flattened netlist. For smaller circuits, the hybrid method, which exploits hierarchical information tosplit the circuit into more computationally efficient sub-units, outperforms the graph-based method butas the circuit size increases the graph method starts becoming more competitive, narrowing the marginbetween their debug difficulties, and in some cases, even surpassing the hybrid result.One possible explanation that only becomes evident for larger circuits is that the graph based tech-nique, which operates on the entire circuit, can better capture its global connectivity and hence selectmore representative signals. In the case of the hybrid method however, the merge threshold was heldfixed even as the circuit grew in size, meaning that each individual sub-circuit represented an increasinglysmaller proportion of the overall circuit. One method to reduce this effect is to increase this merge thresh-old, which will be shown to improve the quality of results, but this comes at the expense of scalability? these trade-offs are explored in a following subsection. However, what the graph-based method maygain in capturing this global information, it may also lose in not utilizing any formal techniques duringsignal selection ? for example, by not considering how the signal will be able to prune the state-spaceof the circuit in both past and future clock cycles. As such, it is not always guaranteed to outperform thehybrid method.551E?100 1E?80 1E?60 1E?40 1E?20 1E+00Debug Difficulty (normalized)leon3s_nofpu     (2810) RandomGraphHybrid E^VFlat E^VFigure 3.9: Debug difficulty comparison for leon3s nofpu (128 signals by 2048 samples, tracedfrom physical prototype).3.4.3 Observability of LEON3 Physical PrototypeIn order to validate our techniques on a real, physical prototype, we traced actual signals values from asimplified LEON3 system-on-chip consisting of a single processor core, and supporting peripherals, thatwas implemented on an Altera DE2 FPGA board. We configured the LEON3 processor to boot the 2.6.36Linux kernel and immediately run the dhrystone benchmark, which was modified to assert a breakpointinstruction during its execution to serve as the debug trigger. This mimics an application-level fault thatonly exhibits itself after an operating system has been loaded.The design was instrumented using the Altera SignalTap II product [8], but due to on-chip memorylimitations, we were only able to trace 128 signals for 2048 samples. To illustrate the effectiveness ofFPGA prototyping even on this small circuit, we observed that the trigger condition was reached afterapproximately 10 seconds of real-time operation at 50 MHz, which we estimate would have otherwiserequired one week of continuous simulation.The results of the three signal selections, compared with the average of 10 random signal selections,are shown in Figure 3.9. In this particular case, our graph based signal selection does not reduce thedebug difficulty by a large amount when compared with the baseline, however, we have observed that thegraph-based technique has selected a number of easily-inferrable flip-flops that are duplicated for eachprocessor pipeline stage (such as the instruction register). This is studied in more detail in Section 3.5.An interesting future direction would be to look at how to account for this signal correlation in a scalablemanner.56Figure 3.10: Fidelity of debug difficulty for leon3s nofpu, when varying submachine size.3.4.4 Fidelity of Debug DifficultyDue to the debug difficulty requiring state space approximation techniques [31] to compute, it is importantto consider its fidelity to changes in the underlying approximation. Figure 3.10 shows the debug difficultyof the same traces as in the previous subsection when varying the maximum sub-machine size (m)between 8 and 32 flip-flops (m=64 was infeasible to compute). A larger sub-machine size means that thecircuit is decomposed into fewer disjoint components, which can capture its behaviour more accuratelyand lead to a tighter over-approximation of its exact reachable state set. This is evidenced in Fig. 3.10,which shows a reduction in the absolute debug difficulty as the sub-machine size is increased. However,this reduction is consistent across all signal selections; those with a high debug difficulty at the smallestsub-machine size continue to be high even as m is increased ? the random signal selections can still beconcluded as providing the least amount of observability, whilst the flat method still provides the most.3.4.5 Hybrid Algorithm: Merge Threshold ParameterOne parameter available only to the hybrid selection technique is the threshold at which modules in thecircuit hierarchy are selectively flattened. The resulting debug difficulty, and algorithm runtime, for arange of thresholds is shown Figure 3.11. Here, the debug difficulty (bars) is measured on the left-axis,and the runtime (line) on the right-axis, geometrically averaged over the oc mem ctrl, oc wb dma andoc pci circuits. Again, we have normalized the debug difficulty to the random case, and have also shownthe flat algorithm (i.e. threshold=?)57Protyp e n F  GA Gn ne D ssigabrcS de bu de uu degu deiu detu debu druu guaiuotunbusuu uguP-SlcDM,AlmAMpvlAIvTVDOASymMOcNNcwyVk-fBTl??Vc?AD??y?kc?AMfI?Figure 3.11: Debug difficulty/runtime for hybrid selection algorithm, geometrically averaged overoc mem ctrl, oc wb dma and oc pci, when varying merge threshold.Unsurprisingly, as the merge threshold increases, the debug difficulty improves due to the sub-circuitsgetting larger, in turn capturing more of the internal interactions between the modules within. However,the cost of this is an increase in selection runtime. Interestingly though, whilst increasing the thresholdreduces the number of sub-circuits for which EV must be computed (as each sub-circuit now consumes abigger proportion of the original circuit) this reduction is outweighed by an increase in their complexity.3.5 EvaluationIn this section, we evaluate our selection algorithms by presenting example signal selections, provide acomparison with prior work, and discuss the best possible tests for measuring their effectiveness.3.5.1 Example Signal Selection for LEON3The signals selected by our techniques, for the LEON3 prototype of Section 3.4.3 is summarized inTable 3.3. The full selection listing can be found in Appendix A.These results show that the flat algorithm provides the broadest signal selection coverage withvisibility into 5 of the 6 module categories, and in particular, focusing on the main integer processingunit which forms the largest of all categories with 1207 flip-flops. According to Fig. 3.9, the flat methodreturns the lowest debug difficulty.In contrast, the hybrid method chooses most of its signals from the memory management unit. Thereason for this behaviour is that the memory unit (particularly its translation lookaside buffer, or TLB)is composed of multiple mid-sized modules, each of which exchange many connections. This causes58Flat EV Algorithm Hybrid EV Algorithm Graph AlgorithmMain integerprocessing unit91 signals, mainly programcounter register bits at var-ious stages of the internalpipeline13 signals, consisting ofinstruction register bits,counter, etc.106 signals, mainly instruc-tion register bits at variousstages of the internal pipelineMemory man-agement unit16 signals, address of I- andD-caches102 signals, address of I- andD-caches, TLB tag registerbits18 signals, state bitsRegister file 10 signals, data input registerbits, memory contents8 signals, memory contents -Hardwaredivider8 signals, counter and statemachine bits5 signals, counter and statemachine bits3 signals, state machine bitsHardware mul-tiplier3 signals, datapath bits - -Global reset - - 1 signalTable 3.3: Comparison of leon3s nofpu, 128 signal selection, summarized.these individual modules to be ranked highly, and receive a higher proportion of the selection budget.Nevertheless, these signals are still able to improve knowledge of the circuit state space over time, andthis is reflected in its debug difficulty being higher than for the flat method.Lastly, the graph based method returns a result similar to the flat algorithm by selecting the majorityof its signals from the main processing unit, but not any from the register file or multiplier. In particular,rather than selecting bits from the processing unit?s program counter, it finds that bits from the instructionregister have a higher centrality value and selects those instead. Unfortunately, the debug metric does notshare the view that these are useful (perhaps because the value of the instruction register can be inferredfrom the memory location pointed to by the program counter), and ranks this with the highest debugdifficulty of all three algorithms. Interestingly, only the graph based technique has chosen to select theglobal reset signal which would be expected to have a high fanout (and hence high centrality) but maybeof less debug value when the circuit is out of reset.3.5.2 Comparison with Prior WorkThis subsection uses the post-silicon debug difficulty metric to evaluate our flat signal selection algorithmwith one which maximizes the circuit restorability, as presented by Ko and Nicolici [82]. These resultswere reported in [64].To compare the two signal selection algorithms, we first instrumented a set of benchmark circuitsusing the results of each algorithm. Then, we simulated each circuit using randomly-generated inputvectors, and recorded the values on the selected signals to mimic a trace buffer operating inside a real59(a) MCNC circuits.(b) QUIP circuits.Figure 3.12: Debug difficulty comparison between our selection algorithm and prior work (32signals by 1024 samples).chip. At every cycle in the trace buffer, we applied state restoration [82] to reconstruct as many othersignal values as possible, and then used this information to refine its images, which are used to calculatethe debug difficulty DVi . For example, if state restoration was able to reconstruct a = 0 at time t, then thisinformation can be used to refine the image: I(RVi , t)|a=0.ObservabilityAs in previous work, we use a signal selection size of 32 flip-flops. Figure 3.12 shows the average debugdifficulty DVi for three different signal selections, over a trace buffer depth of 1024 samples, for each ofthe three algorithms. This difficulty is shown normalized to the baseline case of the average over 1000random signal selections. The results show that while a signal selection algorithm targeting restorability(the previous work) does give a significant improvement in reducing difficulty by approximately 9?43orders of magnitude for the MCNC circuits and 3?22 orders for the QUIP circuits [6] over the random60Circuit Latches Size |V | Previous (Rest. [82]) Our work (Flat)s38584.1 12608 69 6616 189 6732 474 67s38417 14638 174 24416 407 24532 912 245s35932 17288 34 46816 126 46932 383 473oc wb dma 1767 32 1817 1890oc hdlc 2425 32 2223 900oc aes core inv 2712 32 7950 1175Table 3.4: Signal selection algorithm runtime (seconds).case, we can achieve significantly better results: between 13?57 orders and 54?156 orders of additionalimprovement respectively ? when we target DVi directly.ScalabilityTable 3.4 shows a comparison between the CPU time required to compute each signal selection, usinga 64-bit 2.83GHz processor. In four of the six benchmarks shown, our direct algorithm is faster thanthe restoration algorithm. For the direct case, these figures include the fixed time required to performstate space decomposition and reachability computation, which scales unpredictably with the circuit size.Incidentally, it is this fixed cost which is the reason why our direct algorithm lags behind the restorationalgorithm in circuit s35932. However, once this initial cost has been paid, our direct algorithm scalesextremely well with the selection size by exploiting the disjoint nature of the sub-machines.Case StudyTable 3.5 shows a comparison between selections of 10 signals made using the two different algorithms,when applied to the oc i2c circuit available as part of the QUIP benchmark set [6]. This easily understoodcircuit implements a I2C master communication core designed to be used as part of a system-on-chipdesign, and is built using 129 latches. Unlike previous work, we have applied both signal selectionalgorithms to a benchmark circuit for which the source code is available for examination, which hasallowed us to analyze the signals chosen.In this example, the direct algorithm has opted to select only signals from inner-most bit ctrlmodule of the design. 8 of the 10 signals selected are used to track the current state of its internal61Signal Descriptionbit ctrl:al arbitration lost flagbit ctrl:c state[9] one-hot state: stop dbit ctrl:cnt[1] clock divider counterbit ctrl:sSDA SDA data line inputbit ctrl:scl oen SCL clock output enablebit ctrl:sto condition stop condition flagbyte ctrl:shift shift register enablebyte ctrl:ld shift register loadctr[7] control reg: core enablewb inta o[7] bus interrupt requestbit ctrl:al arbitration lost outputctr[7] control register: core enablebyte ctrl:shift shift register enablebit ctrl:c state[9] state: stop dbit ctrl:sSDA SDA signalbyte ctrl:ld shift register load flagbit ctrl:cnt[1] clock divider counterbit ctrl:sto condition stop condition flag(a) Previous Work: Restoration Algorithm [82].Signal Descriptionbit ctrl:c state[0] one-hot state: idlebit ctrl:c state[1] one-hot state: start abit ctrl:c state[2] one-hot state: start bbit ctrl:c state[3] one-hot state: start cbit ctrl:c state[4] one-hot state: start dbit ctrl:c state[5] one-hot state: start ebit ctrl:c state[6] one-hot state: stop abit ctrl:c state[7] one-hot state: stop bbit ctrl:cmd ack bit ctrl cmd completebyte ctrl:ack out byte ctrl acknowledgebit ctrl:c state[7] state: stop bbit ctrl:c state[1] state: start abit ctrl:c state[4] state: start dbit ctrl:c state[6] state: stop abit ctrl:c state[3] state: start cbit ctrl:c state[0] state: idlebit ctrl:cmd ack command complete outputbit ctrl:c state[2] state: start b(b) Our Work: Flat Algorithm.Table 3.5: Example signal selection for oc i2c.state machine stored in the vector c state, which uses one-hot encoding to indicate which of the 17possible I2C states is currently active. The remaining signals cmd ack and ack out are part of theinterface of their respective sub-blocks and is used to communicate its status to its parents. In contrast,the restorability algorithm has selected a seemingly arbitrary selection of signals from the top-levelmodule, as well as from the byte and bit controller sub-modules. Only one bit of vector c state hasbeen selected, along with several other, less useful, internal flags and signals. As a result, and in line withintuition that state machine signals are more often more useful to observe, our direct algorithm is able toreduce the debug difficulty by 39 orders of magnitude over previous work.3.5.3 Gold StandardGiven that the goal of this chapter is to provide human designers with more effective information to aidthe debugging process, and not to directly pinpoint the root-cause of any bugs, evaluating this goal isdifficult. Ultimately, the best possible way ? the gold standard ? of evaluating the effectiveness of oursignal selection methods would be to perform user-studies. Subjects in this study would be assigned arealistic debugging task, some of whom would be aided by our techniques, and statistics recorded onhow effectively designers would be able to find the root-cause.62However, in the context of this thesis, conducting a user-study was impractical for a number ofreasons: firstly, with the target audience of this work being expert circuit designers working on largeindustrial projects, finding a statistically significant sample to participate in a user study was infeasible.Secondly, gaining access to large, realistic, circuit designs with an extensive record of the typical bugsencountered by industry also proved to be difficult, with this being considered proprietary information.If our techniques were to be integrated into a product, then one possible method to acquire thesestatistics would be to make use of the (opt-in) feedback features already available in many pieces ofCAD software (for example, the TalkBack feature in Altera Quartus II) to do so.3.6 SummaryThis chapter first presented a metric that can be used to evaluate the effectiveness of trace-based instru-ments during post-silicon debug. Specifically, the proposed metric quantifies how accurately any datatraced from a FPGA or ASIC device can be used to determine the possible current, past, and futurestate(s) of the circuit by using formal techniques. The key challenge of utilizing exact formal techniques,however, lies with scalability ? typically, they can only operate on trivially small circuits with hun-dreds of latches, which are insufficient during post-silicon debug. This was overcome by employingover-approximate formal techniques to extend its range to circuits with thousands of latches. The signif-icance of this work is that designers are now able to independently evaluate the quality of many moretrace-based debug instruments.Next, this chapter described three automated signal selection techniques for trace-based instruments.The three techniques encompassed an algorithm that optimizes for the debug metric directly, a techniquethat utilizes the eigenvector-centrality of the circuit netlist, and a hybrid combining both methods. Animportant objective of this work (and one which differentiates it from prior research) was to formulatea scalable signal selection that can be applied to large System-on-Chip designs. Experimental data wasused to evaluate the trade-off between the observability (as measured using the debug metric describedearlier) and the scalability of each technique. The results show that designers can use the graph-basedtechnique to compute effective signal selections for large 50,000-latch circuits in less than 90 seconds.This level of scalability is beyond the reach of prior work [82, 93, 160].The limitations of this chapter, and proposals for future work, are described in Chapter 7.63Chapter 4Speculative Debug Insertion for FPGAsIn this chapter, we describe our speculative debug insertion flow, and the framework within which itoperates. In the typical prototyping flow shown previously in Fig. 1.2, a designer first maps his or herdesign to the FPGA prototype using standard compilation tools (this often requires manipulating theoriginal design to make it more amenable to FPGA implementation). The implementation is then tested,often in-situ. If incorrect behaviour is observed, the designer can then use tools such as SignalTap IIor ChipScope Pro to add instrumentation to the circuit. Typically, at this point, the designer will usehis or her understanding of the nature of the failure to carefully determine the number of signals thatwill be observed and the size of the trace-buffers. The circuit is then recompiled (sometimes usingincremental techniques) and the error reproduced. The designer can then use the data in the trace-bufferto help narrow down the cause of the failure, possibly using formal techniques such as those presentedby de Paula et al. [39]. We refer to this typical flow as reactive or post-mortem, since instrumentation isonly done after a failure is observed. FPGA compile times are significant, often taking as long as a dayfor a large FPGA. For a board with many large FPGAs, each compilation ?spin? takes significant time,leading to a time-consuming debug flow.Our proposed speculative flow differs in that, before compilation, the tool will automatically deter-mine a set of signals that it predicts may be useful during debug and instrument the design accordingly,without intervention from the user. As described in Chapter 2, in FPGA prototypes, pin constraints oftendictate that each FPGA is not filled to capacity (often far below capacity). Our speculative flow usesthese unused soft-logic resources to implement debug circuitry. Figure 4.1 shows that the design is in-64ASIC/FPGUsIFer iGc/uI GPCabcd eExtbrC nCalPCSicmuaaWvEfCa PEdCoudiurr??dal?ixCdlFigure 4.1: Proposed speculative debug flow.strumented before the first compilation, meaning that when the implementation is tested, it is possible toimmediately obtain debug data without recompilation, providing a ?head start? in the debugging process.Besides accelerating the first debug iteration, speculative insertion can also aid the designer in everysubsequent iteration too. Firstly, after the first iteration, the designer may continue to add instrumentationas with a reactive flow to provide visibility into signals that he or she deems important. During therecompilation, a speculative tool can supplement these selected signals with signals that it predicts mayalso be useful leading to more debugging information. The number of extra signals that the tool canselect depends on the amount of spare resources in the FPGA. Secondly, when a designer works withan incorrect hypothesis for the root-cause of a fault and selects a set of ineffective signals, a speculativeflow may be able to select signals that the designer would otherwise have missed and prevent the debugiteration from being wasted. Ultimately, these factors can lead to fewer debug iterations overall.The success of this flow depends critically on two factors. First, the tool must be able to effectivelyand automatically select signals for observation. For this, we shall employ the graph-based selectionalgorithm presented previously in Chapter 3. Second, the tool must be careful not to add so muchinstrumentation that the designs targeted for each FPGA no longer fits in that device. In addition, itmust ensure that the extra logic does not lead to congestion that may slow down the circuit, increasepower significantly, or increase place and route run-time. These issues are the focus of Section 4.1. Wepresent ?In-Spec? in Section 4.2, a tool we have developed which realizes these techniques on top ofthe Altera Quartus II CAD tool. Lastly, this chapter is summarized in Section 4.3. This work was firstpublished in paper [63], and included as an example application of automated signal selection techniquesin reference [67].654.1 Limits to Speculative InsertionInserting debug instrumentation into any FPGA design will consume logic and memory resources. Al-though utilizing these otherwise-spare resources may be thought of as being ?free? from an area per-spective, it would not be desirable for circuit performance to be affected. This section investigates thelimitations of speculative debug insertion by quantifying the trade-off between how aggressively thecircuit can be instrumented and the overhead that will be incurred.In the following experiments, we have used Altera Quartus II v10.1 software (64-bit, single-processormode) to implement a LEON3 System-on-Chip [4] design onto an Altera Stratix III FPGA. This designwas configured as an eight-core multi-processor with instruction and data caches, full floating-pointand memory management support, as well as DDR memory controller, Ethernet and UART IP blocksall attached to an internal AMBA 2 bus. Without any debug instrumentation, this circuit consumesalmost 100,000 ALM (Adaptive Logic Module, which is Altera terminology for logic elements insideits Stratix FPGAs) resources, using up close to 70% of the largest Stratix III device (EP3SL340). Thedesign contains 47,000 user registers, which were ranked using the graph-based algorithm described inChapter 3. For each experiment, the highest ranked signals were selected for instrumentation, with thetrigger driven by an off-chip source. To limit experimental noise, each configuration was compiled usingten different random seeds, and the geometric average reported.4.1.1 Impact on Area: LogicTo understand how much debug logic can be speculatively added to our benchmark circuit, we firstconstructed a variety of instrumentation scenarios and measured their utilization on the FPGA. Figure 4.2shows the total logic utilization of the instrumented circuit, after place and route, over a variety of differenttrace configurations. The number of signal samples stored by the on-chip trace-buffer (its depth) is shownon the X-axis, and the number of signals observed simultaneously is displayed as different data series.The utilization value displayed on the Y-axis represents the number of ALM resources partially or fullyused in the implemented design. We found that this was the most appropriate metric for logic utilizationas the other measures reported by Quartus II did not account for blocked resources, making it impossibleto achieve 100% utilization.To a first order, the logic area required for debug instrumentation is proportional to the number ofsignals observed, but roughly independent of the depth of the trace-buffer. This suggests that a speculative66 65 70 75 80 85 90 95 1000641282565121K2K4K8K16K32K64KLogic Utilization (%)depthleon3mpwidth64 128 256 512 1K 2K 4K 8K 16KFigure 4.2: Area: trace depth vs logic utilization.debug tool can use the number of available logic elements to determine how many signals can be observedwithout increasing the size beyond the capacity of the FPGA. Where this breaks down, though, is for theset of debug configurations with a trace width of 16K signals which consumes all the logic resourcespresent on the FPGA. Recognizing that it costs approximately 20% of the logic resources to trace 8Ksignals, it would be assumed that, following from the existing trend, that to observe 8K more signalswould cost an additional 20%; this would exceed the capacity of the chip. Yet, Quartus II is still able toimplement this configuration by packing the logic more densely, but as explored in a following subsection,this comes at a detrimental cost to delay.We have observed that almost all of the instrumentation logic is formed of register resources. Al-though we cannot examine the source code behind the proprietary SignalTap IP to determine exactly howthese registers are used, we believe that they are pipelining registers for routing between observed signalsand their trace-buffers, in order to minimize their impact on timing. This matches with the previousobservation that total logic utilization is unaffected by trace-buffer depth, because once the signals havebeen routed to their memory resource, which are located in columns on the FPGA, it becomes trivial toincrease the number of samples captured by cascading memory blocks together.67 40 45 50 55 60 65 70 75 80 85 90 95 0641282565121K2K4K8K16K32K64KMemory Utilization (%)depthleon3mpwidth64 128 256 512 1K 2K 4K 8K 16KFigure 4.3: Area: trace depth vs memory utilization.4.1.2 Impact on Area: MemoryFigure 4.3 shows the corresponding plot for the memory utilization. This utilization value captures onlyhow many of the M9K memory resources available on the Stratix III device are being either partially, orfully, occupied. In our experiments, we observed that Quartus II would resort to using the larger M144Kmemory resources (which accounts for approximately 40% of the total on-chip memory) only after ithad exhausted all M9K resources. As expected, the number of memory resources required increases withboth the width and depth of the trace configuration. Because our metric treats partially utilized memoryresources as fully utilized, the memory usage results are flat for configurations with 256 or fewer samples(the widest configuration of a M9K block corresponds to a minimum depth of 256 words).These results suggest that any speculative debug insertion tool must also consider the amount ofavailable memory when determining how many signals can be observed, as well as how deep eachtrace-buffer can be.4.1.3 Impact on Delay and RoutabilityAlthough making use of the unoccupied resources on an FPGA is ?free? from an area perspective,the speculative insertion of debug logic can be expected to impact the circuit?s performance. In ourexperiments, we left the main clock of the LEON3 circuit constrained to its default value of 150 MHz,which is not met either before or after speculative debug insertion. Although over-constraining is not68 98 100 102 104 106 108 110 112 1140641282565121K2K4K8K16K32K64KFmax (MHz)depthleon3mpwidth64 128 256 512 1K 2K 4K 8K 16KFigure 4.4: Delay: trace depth vs Fmaxrealistic of industrial designs, this was necessary for us to accurately quantify the impact of speculativeinsertion on circuit delay.Figure 4.4 plots the estimated maximum clock frequency of the implemented circuit, as determinedby the TimeQuest Timing Analyzer tool in Quartus II. Even though ten different compilation seeds wereemployed for each debug configuration, the results still exhibit experimental noise due to the nature of theheuristic CAD algorithms. Nonetheless, the results suggest that there is little impact on circuit speed forall widths less than or equal to 8K (difference of less than 3%). For a trace width of 16K signals, however,Fmax is reduced by over 10%. At this configuration, as discussed earlier, it is necessary for the CADtool to optimize for area rather than speed in order to fit the entire instrumented circuit onto the device.A related concern that designers may have is the effect that speculative insertion may have oncircuit routability. Figure 4.5 shows the peak interconnect utilization for all trace configurations. Circuitsinstrumented with a width of less than 16K signals only incur a slight increase in peak utilization overthe original value of 51% (for which the corresponding average interconnect utilization is 25%). Thisappears to indicate that sufficient routing resources have been provisioned for this FPGA device so thatroutability is only minimally affected by debug insertion.From these results, it is clear that the speculative debug tool can use an estimate of the availableresources on the FPGA to determine the amount of instrumentation to add. As long as the tool ensures thatthe capacity of the instrumented FPGA does not approach 100%, no impact on delay should be expected.69 45 50 55 60 65 70 0641282565121K2K4K8K16K32K64KPeak Interconnect Utilization (%)depthleon3mpwidth64 128 256 512 1K 2K 4K 8K 16KFigure 4.5: Routability: trace depth vs peak interconnect utilization. 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 0641282565121K2K4K8K16K32K64KRuntime (s)depthleon3mpwidth64 128 256 512 1K 2K 4K 8K 16KFigure 4.6: Runtime: trace depth vs CAD runtime.4.1.4 Impact on Runtime and PowerFigure 4.6 shows the total CAD runtime for the mapping (synthesis) and fitting (place and route) stagesof Quartus II. When adding sufficient instrumentation to the LEON3 benchmark to observe 4K debugsignals simultaneously, a 10% penalty to runtime is observed. The run-time goes up dramatically wheninstrumenting 16K signals; this is because the CAD tool has to work much harder to pack the circuitmore tightly into the device. In contrast, runtime appears to be only mildly affected by the depth of the70 11000 11500 12000 12500 13000 13500 14000 0641282565121K2K4K8K16K32K64KTotal Power Dissipation (mW)depthleon3mpwidth64 128 256 512 1K 2K 4K 8K 16KFigure 4.7: Power: trace depth vs power dissipationtrace-buffer, which follows on from the previous observation that very little logic is added as the tracedepth increases.We expect that in most implementations, far fewer than 4K signals will be observed; this implies thatonly a small impact on run-time is to be expected when using speculative debug insertion.Inserting additional logic into an FPGA will also affect its power consumption, and this is shown inFigure 4.7. These values are for the design running at a 150 MHz clock frequency, and were obtainedusing Quartus II?s PowerPlay Power Analyzer tool operating in its vector-less estimation mode. Incommon with the previous performance metrics, increasing trace depth has a smaller effect on powercompared to increasing its width. For this LEON3 design, we can see that up to 4K signals can be tracedsimultaneously for less than 10% power overhead.4.1.5 Debug AggressivenessWhile it is possible to fill all of the free FPGA resources with speculative debug logic, this is not desirableas doing so will have a considerable effect on its performance. With the overheads investigated in thissection, the limiting factors can be classified into either hard or soft limits. Hard limits are those posedby resource availability, which cannot be exceeded (the exception being a small amount of flexibilitywith how tightly the logic is packed); whilst soft limits are those that can have an unbounded effect on71circuit performance, such as with delay, runtime or power. With these soft limits, it would be up to thedesigner to determine how much of a penalty he or she can accept.For speculative insertion, if it was assumed that a designer would be unwilling to tolerate any decreasein the maximum frequency of a circuit, but could accept a 10% increase to its compilation runtime andpower consumption, this would correspond to a trace configuration of 4096 signals by 1024 samples,which would cover almost 10% of the user registers in the design. For the LEON3 design, this equatesto a trace solution which utilizes 10% of the FPGA?s logic resources, and 40% of its memory resources.Any further increase in the width of the trace solution (and hence logic) observed will increase runtimeand power beyond these acceptable overheads. Although increasing memory utilization has a minimaleffect on these metrics, the SignalTap debug IP requires that the depth of the trace-buffer is a power oftwo between 64 and 128K samples. This restriction prevents a deeper trace solution from being realizedin this design: doubling the depth to 2K samples would exceed the available memory capacity.From this analysis, we propose that a reasonable trade-off for debug overhead and aggressivenesslies with inserting a trace solution which consumes 10% of the FPGA?s total logic resources, and as manymemory resources as possible.4.2 In-Spec: Speculative Insertion ToolWe have developed a tool, called In-Spec, which can automatically and speculatively instrument anydesign described by a Quartus II project. In-Spec depends only on Quartus II software to function, theintention being to simplifying the number of tools required and to enable its robust HDL support to beleveraged. The current implementation of this speculative flow consists of two stages. In the first stage,In-Spec fully compiles the original design to obtain the connectivity graph required for signal selection.This is necessary because extracting this data utilizes Quartus Tcl commands available only after placeand route. Using this Tcl interface, all register-register paths in the circuit are dumped into a text file,which is then parsed by a script used to build the connectivity graph for signal selection. Once the designis compiled, accurate values for logic and memory resource utilization can also be obtained.The latest version of In-Spec is available for download at: http://ece.ubc.ca/?stevew.724.2.1 Estimating Instrumentation AreaIn the second stage, the tool calculates the maximum trace configuration that can be inserted, withoutexceeded the utilization values determined in Section 4.1. To realize this, an area model was developed toestimate, in advance, how many resources would be required to implement any trace configuration. Usingthe data gathered in the previous section, an area model of the following format can be constructed:LALM(w,d) = Aw+B (4.1)MM9K(w,d) = Cwd +Dw+Ed (4.2)where w and d represent the trace width and depth.Equation 4.1 estimates that the logic usage (in ALMs) is a function of the trace width only. Usingsurface-fitting techniques, we can determine that: A = 3.39 and B = 123 meaning that the variable costfor each signal observed simultaneously is approximately 3.4 ALM logic resources, and that the fixedcost B = 123 resources. Similarly, for the memory resources modelled by Equation 4.2: C = 1/9216,D = 0.00118 and E = 0.0107. Particularly interesting is that for the coefficient C, which representsthe total number of memory bits required, a value corresponding to the inverse capacity of each M9Kresource can be used.Using Equation 4.1, the maximum trace width that can be supported without exceeding the logicutilization limit is calculated. This value can then be substituted into Equation 4.2 to compute themaximum trace depth, rounded down to the nearest power of two as required by SignalTap. Finally, thistrace configuration is instantiated, the highest ranked signals corresponding to the trace width selectedfor observation, and the instrumented design recompiled.4.2.2 Example ApplicationWe have validated In-Spec against a different, large open-source benchmark: the OpenSPARC T1 Proces-sor Core [129], which implements a 64-bit SPARC core supporting 4 concurrent threads, I- and D-caches,along with an MMU. Two processor cores were combined into a single circuit of 100,000 registers, andmapped onto the same EP3SL340 Stratix III device as used in the previous section. The circuit wasconstrained for a clock frequency of 100 MHz.A trace configuration of 3951 signals wide by 1024 samples deep was inserted automatically, cor-responding to a 4% coverage of the design. A comparison between the two implementations is shown73No Dbg Spec Dbg ?Logic (% ALM) 74.7 83.1 +8ppMemory (% M9K) 10.0 52.2 +42ppFmax (MHz) 74.0 71.9 -3%Peak Interconnect (%) 59 57 -2ppMap Runtime (s) 1441 1620 +12%Fit Runtime (s) 3384 3980 +18%Power @100MHz (mW) 5716 6318 +11%Table 4.1: Example application: OpenSPARCT1 (3951x1024).in Table 4.1. These results show that although the actual logic inserted was only 8% (rather than the10% requested by our tool) further investigation indicated that all 4000 signals were correctly traced,and that the savings had come from Quartus II recouping some unnecessary logic from the user design.Importantly, this did not significantly affect the circuit?s Fmax.4.3 SummaryThis chapter described the concept of speculative debug insertion, and showed it to be feasible. Everytime an FPGA prototype is compiled, we propose that the CAD tool automatically and speculativelyinstruments the circuit with trace-buffers, completely transparently to the designer. The advantage ofdoing so is that when erroneous behaviour is discovered in the prototype, designers can start debuggingimmediately using the trace data acquired from their device, without having to first go back to manuallyinstrument their circuit and recompile. After the first debug iteration, our speculative tool can also intel-ligently supplement any signals selected by the designer with additional signals that can lead to moreuseful debugging information.To show that this method is feasible, we quantified the limits of speculative insertion on an industrial-quality 8-core System-on-Chip, finding that unless speculative insertion is overly aggressive in exhaustingall on-chip resources, trace-buffers can be inserted into the spare capacity of the FPGA without affectingits maximum operating frequency. Lastly, we encapsulated our ideas into an automated tool capable ofmodelling the area impact of instrumentation, and showed a successful example application. This tool isavailable for download using the link in the previous section.The significance of this contribution is that designers can achieve higher debug productivity bygaining a head-start on the first debug iteration, and gain additional visibility on all subsequent iterations.Ultimately, this can lead to fewer compile-debug iterations.74Chapter 5Accelerating FPGA Trace-Insertion UsingIncremental TechniquesThis chapter presents a method to accelerate the insertion of trace-buffers into FPGA circuits by applyingincremental-compilation techniques. As discussed in Chapter 2, incremental methods for acceleratingcircuit compilation are well known [29, 35, 44]. The novelty of our contribution lies with the applicationof several new CAD optimizations that are relevant only to trace-insertion, and that can be used to providesignificant speedups only for this purpose.Current FPGA trace solutions such as Xilinx ChipScope Pro, Altera SignalTap II, Synopsys Identifyand Tektronix Certus [8, 130, 137, 155] all operate primarily on the pre-mapping circuit. That is, thesetools will instrument the original user circuit with trace-buffers and their connections before place-and-routing the combined design, although several of these tools can also support a limited amount ofreconfiguration.In this chapter, we propose that circuits be instrumented post-mapping ? that is, after the originalcircuit has been fully place-and-routed. This is achieved by building trace-connections solely out of thespare FPGA routing resources left unused by the uninstrumented circuit. The benefit of this approach isthat any instrumentation added post-mapping uses resources that are not used by to the original circuit,and hence, the original circuit?s timing is not affected. To evaluate the effectiveness of post-mappinginsertion, a detailed comparison of key metrics ? CAD runtime, wirelength, and critical-path delay ?are made when instrumenting the circuit prior to pre- circuit mapping, mid-way through circuit mapping,75InstrnumedDtvdvivmcdsn Riadtl-R RiomOdfd-ed-umydu nDu ydu nDum?nulm? RRsf ?- ydu nDum?nulm? RRsf ?-m?m?Rrun-i debugiSucesor tatoPd?ePtr?os??t?? trabugi  onsbugiVPRS??so?esor tatoPd?ePtr?os??t??Synthesis& Tech. Packing &Placement Routing BitstreamFigure 5.1: Pre-, mid-, post-map stages of the FPGA compilation flow.and post- circuit mapping. A simplified illustration of what we have defined to be the pre-map, mid-map,and post-map stages of the FPGA compilation flow is shown in Figure 5.1.This chapter is arranged as follows: Section 5.1 presents the assumptions made, and describes theprocedure for each of the three flows: pre-, mid- and post-map trace insertion. Section 5.2 presentsthe incremental CAD optimizations that are used to accelerate post-map trace-insertion. The evaluationmethodology, and the results of a comparison between the three methods are shown in Section 5.3 and 5.4respectively. This chapter is summarized in Section 5.5. An early version of this contribution containingpost-map insertion was first published in [63], which was later refined and compared against pre- andmid-map insertion in [62].5.1 Trace-InsertionIn this section, we will first describe the assumptions that were made, before elaborating on the differ-ences between performing FPGA trace-insertion at each of these stages in order, before arriving at themain focus of this chapter: post-map trace insertion.5.1.1 AssumptionsIn this work, we have made a number of simplifying assumptions for trace-based debugging. Firstly,we do not consider any overheads incurred by triggering logic. Because only a limited subset of signalscan be connected to trace-buffers, and a limited window for which their signal values can be recorded,triggering logic allows designers to control when to start and stop tracing (for example, only in the clockcycles immediately surrounding the occurrence of the error) in order to make the most effective use ofthis finite memory capacity. One scenario where this assumption would be realistic is if this trigger eventwas driven by an external or user-managed source (perhaps off-chip) that is used to halt the clock signal.76Alternatively, this trigger logic may be inserted manually into the circuit by the user, perhaps using moregeneral-purpose incremental-synthesis techniques. We believe this triggering-logic would incur only asmall increase in net fanout, as compared to the numerous trace-buffer connections, and hence could beeasily implemented using the soft-logic resources that are distributed across the FPGA. Another optionwould be for such trigger logic to be implemented using fixed-circuitry as opposed to soft-logic; doingso would make it transparent to the user-circuit. The area overhead of this hard-logic can be reduced byamortizing it over several trace-buffers (for example, one trigger block per memory column).A second simplifying assumption that we have made is the ability for free memory resources toconvert into trace-buffers without the need for any additional control circuitry. We believe this is alsorealistic, as commercial devices allow memory blocks to be operated as wide shift-registers which canthen be used to record a sliding window of signal data. The second requirement to enable this featureis the ability to unload the signal data once tracing was complete: we believe that this can be achievedby using existing IP solutions for low-bandwidth access over the JTAG interface [7], or through usingdevice readback techniques [53].The third assumption is that, due to our CAD tools, we are only able to synthesize (and henceinstrument) circuits operating in a single clock-domain. For trace-based debug to support multiple clock-domains, each observed signal must be sampled by a trace-buffer operating in its clock-domain. Webelieve that our methods can be extended to support this by adding these requirements as additionalconstraints.We note that adding trace-instrumentation to the circuit after logic synthesis, where the circuit hasalready been transformed from a high-level description (such as Verilog) into low-level FPGA primitives(lookup-tables) means that the designer is restricted to only observing gate-level signals. These gate-levelsignals, due to logic optimizations and technology-mapping, may not have a direct correspondence tothe original HDL signals. Whilst it may be possible to post-process gate-level signals into somethingthat a designer understands, we believe several approaches also exist to alleviate this mismatch: firstly,unless register re-timing is performed, both commercial and academic CAD tools are able to preservethe HDL-gate mapping for flip-flops in the circuit. Designers can therefore use these elements as fixedpoints of reference into their Verilog code, or to use the data collected for off-line simulation to computeall intermediate, combinational signals. Secondly, designers are able to manually specify additionalpoints of reference by using synthesis attributes to force the CAD tool to maintain this HDL-gate77correspondence: (* syn keep *) is supported by Synplify and Quartus II tools, whilst ISE userscan apply the S (SAVE NET) attribute to do so. Implicitly, existing trace IP such as ChipScope Pro,SignalTap II and Certus already do this when instrumenting a circuit pre-synthesis.Lastly, during ASIC prototyping, where typically I/O is the limiting factor and spare logic resourcesmay be abundant, it may be feasible to optimize the circuit less aggressively so that more combinationalsignals can be preserved for instrumentation without requiring a larger FPGA or impacting circuit delay.This approach is not dissimilar to debug for software applications, where designers would typicallyrecompile their circuit at a lower optimization level to enhance internal visibility; in fact, the latest versionof GCC 4.8 supports a new ?-Og? optimization level which addresses the need for fast compilation anda superior debugging experience [48].5.1.2 Pre-Map Trace InsertionPerforming trace-insertion at the pre-map stage involves instrumenting the user circuit with trace IP earlyin the implementation flow, whilst the design is still described at a high-level of abstraction, and is theprimary mode of operation in many of the existing trace solutions. For example, Xilinx ChipScope Proallows instruments to be instantiated manually into the HDL source, or inserted directly into the synthe-sized circuit (but still prior to the place-and-route mapping procedure) using the Xilinx PlanAhead andChipScope Pro Core Inserter tools [155]; whilst the Tektronix Certus product automatically modifies theHDL source to add all necessary trace infrastructure [137]. There are many advantages to this approach:by operating on the circuit early in the implementation flow, any trace IP can be described at an equallyhigh-level which allows for increased portability across device families (and even FPGA vendors). Fur-thermore, because the instrumented circuit is treated as a single entity by the subsequent CAD stages,theoretically, the circuit will be optimized as a whole to create a more area and delay efficient result.However, there are also several downfalls to instrumenting the circuit prior to physical mapping.Firstly, because this method inserts additional logic into the original user circuit, the CAD tools willneed to work harder in order to place-and-route the circuit ? not only because there exists more objectsto solve for, but also because of the additional constraints that they impose. For example, each user netthat the designer wishes to observe introduces at least one additional fan-out (the trace-buffer input) forthe placement and routing stages to consider. This increased complexity can manifest in an increasedcompilation runtime. Although Chapter 4 found that tracing 10% of the signals in a large design using78SignalTap II incurred only a 10% increase to runtime over the uninstrumented baseline, this problem iscompounded by the need to perform a full recompilation every time the designer wishes to modify theobserved signal set.Secondly, due to the chaotic and unpredictable nature of the heuristic algorithms used in CAD tools,the very act of modifying the circuit (even by a little) may alter, or even hide, the bug under investigation.Rubin and DeHon found that small perturbations just in the routing stage of the VPR CAD tool causedthe critical-path delay to vary between 17?110% [122]; hence, it is not implausible to imagine a scenariowhereby instrumenting the erroneous circuit would cause the faulty path to be implemented entirelydifferently, one in which the bug was much more difficult (if not impossible) to reproduce. Whilst thispoint may not apply to strictly-functional bugs (i.e. those that are caused solely by designer errors inthe HDL source) this may be of critical importance when attempting to locate non-deterministic timingfaults such as those introduced by underspecified timing constraints or multiple clock-domains. As anexample, an FPGA prototype may function correctly with one particular placement floorplan, but failcompletely with a different floorplan, even though the designer has not changed any HDL source. Post-map techniques would enable a designer to probe the circuit without changing this floorplan nor existingrouting.Figure 5.2 illustrates the difference in the placement results between the original, uninstrumentedcircuit in Fig. 5.2a and the circuit instrumented pre-mapping in Fig. 5.2b, when using the VPR tool [121].Square blocks on the peripheries indicate I/O blocks, whilst the square blocks in the centre of thediagram represent logic clusters: a dark shading means that the block is occupied. The columns ofrectangular blocks interspersed in the logic fabric represent heterogeneous resources: each X indicatesa used multiplier, whilst M represent a used memory block. In the instrumented circuit, T indicates afree memory block that has been converted into a trace-buffer. A clear difference exists between theplacement results before and after pre-map instrumentation: the act of instrumentation has significantlyaffected circuit placement, with the original user memory (annotated with M) have now been pushedaway from the centre of the circuit, with trace-buffers (T) taking its place. This effect is due to theCAD algorithms being unable to differentiate between the existing memory blocks and any of the newlyinserted trace-buffers, and so optimizes for them equally.79X MM(a) Baseline uninstrumented circuit.TTTTTTTTM MX(b) Instrumentation without incremental tech-niques (pre-map; utilizing 8/10 free memo-ries).X MMT T TT T TT TT T(c) Instrumentation with incremental-techniques (mid-/post-map; utilizing10/10 free memories).Figure 5.2: Placement results when instrumenting the or1200 benchmark with identical signals:X = multiplier, M = memory, T = trace-buffer (shaded).5.1.3 Mid-Map Trace InsertionAn opportunity for trace-insertion also exists mid-mapping, in which trace instrumentation is insertedpart-way through the FPGA mapping procedure; specifically, as illustrated in Figure 5.1, between theplacement and routing stages of the compilation flow. This approach offers a compromise between pre-map and post-map trace-insertion ? the original placement of the circuit is left untouched (which, asrevealed in Section 5.4, consumes the most amount of time in our compilation flow).80Between the placement and routing stages, we propose that the trace-buffers are incrementally-placedinto the unoccupied memory resources of the FPGA ? initially similar to the post-map approach ? butwith the difference that both the original and instrumented net connections are subsequently routed ina combined fashion. Hence, mid-map trace-insertion can be expected to offer a trade-off between thequality-of-results gained by pre-map insertion and the fast runtime achievable using post-map insertion.Because the existing routing is ripped-up and re-routed, there remains the possibility that any timing-bugsmay also be obscured with this approach.5.1.4 Post-Map Trace InsertionThe final opportunity for applying trace instrumentation to the circuit is at the end of a normal FPGAcompilation flow, an approach that we have termed post-map insertion. Here, the original design remainsunmodified until the very end of the flow, at which point incremental techniques will be used to makethe minimal set of modifications required to accommodate the trace instruments. This allows designersto preserve as much of their circuit as possible, in order to improve CAD runtime for a small loss incircuit quality.Besides pre-map trace-insertion, Altera SignalTap II also supports a post-map flow in which theentire instrumentation procedure can be completed using the general-purpose incremental-compilationfeatures available in the Quartus II tool [8]. Through experimentation, we have discovered that Quartus IIappears to take a best-effort approach to inserting trace IP ? in many cases, it can preserve over 99% ofthe circuit?s original placement and routing for moderate runtime savings. However, no option exist toguarantee that the user-circuit remains unmodified, which is a requirement that we impose in this work.The Synopsys Identify product, however, provides an incremental flow specifically for allowing designersto quickly modify and re-route the observed signal set but only once the design has been successfullyinstrumented using Synopsys tools [130].In this work, we make the guarantee that during post-map trace-insertion, all placement and routingof the original circuit will be preserved: all new trace instruments must be incrementally inserted into thelogic and routing resources that were previously unoccupied. We believe this is an important requirementfor FPGA debug, and one that is necessary in order to minimize the possibility that timing-bugs will bealtered, or even obscured, by the act of instrumentation. Due to trace-instrumentation being overlaid ontop of and without affecting the existing user-circuit, an interesting side-effect is that as soon as the debug81infrastructure is no longer required (for example, in a production bitstream) it can simply be ignored andthe circuit run back at its original clock frequency prior to any instrumentation. This can help preservetiming-closure in a sign-off circuit.Of course, imposing such a strict constraint can cause the newly instrumented circuit to no longer beroutable, or require a larger FPGA to begin with; this overhead is quantified in Section 5.4. Figure 5.2shows the difference between no instrumentation (Fig. 5.2a) and post-map instrumentation (Fig. 5.2c).Unlike pre-map previously, instrumenting the circuit after placement means that the original circuitplacement is unaffected; additional blocks (and routing, which isn?t shown) are mutually exclusive to theresources used in the original result. A more subtle difference that exists between the pre-map and post-map circuit placements are that more memory blocks have been converted into trace-buffers, Fig. 5.2bshows two free memory blocks, whilst Fig. 5.2c show none, even though the instrumented signals arethe same, this will be explained later.5.1.5 FrameworkThis work applies our techniques to the open-source VPR FPGA mapping tool, which is part of theacademic Verilog-To-Routing project [121]. Given a tech-mapped BLIF netlist as input, VPR performstiming-driven packing, placement and routing to map this circuit onto a custom FPGA architecture. Forrouting, VPR employs PathFinder: a popular FPGA routing algorithm which allows nets to temporarilyover-use routing resources, which is then iteratively resolved to allow access to only the most timing-critical nets in a process termed negotiated congestion [96].Using this tool, circuits are mapped to an FPGA architecture based on the Altera Stratix IV device,with a cluster-size N=10, look-up table size K=6 (fracturable into two K=5 LUTs) and channel segmentlength L=4. The cluster input flexibility Fc in=0.15, and the cluster output flexibility Fc out=0.1. Thetargeted architecture is heterogeneous and based on the Stratix IV 40nm family with support for RAMand DSP blocks, as well as realistic wire delays; advanced architectural features such as inferring carry-chains and shift-registers for the user circuit are not currently supported. However, given that theseoptimizations are used primarily to improve area and delay-efficiency, we believe that integrating thesein the future would not affect the conclusions presented. For example, PLL/DCM clock synthesis anddistribution often use their own dedicated set of FPGA resources and hence will not affect the user circuit.82PPP(a) Original FPGA layoutPPPrrrP(b) Layout with potential trace-connectionsFigure 5.3: Illustration of the many-to-many routing flexibility available in post-map insertion:solid (red) lines indicate user-routing, dashed (green) lines indicate potential trace-connectionsof which only one is sufficient.The routing architecture employed is a modern, unidirectional fully-buffered network, which meansthat adding extra fan-out loads to existing nets will not affect the original circuit timing, as long as theextra fan-out is within its timing slack. The dedicated memory resources of this architecture can beconfigured either as a 72 bits wide by 2048 entries deep shift-register ? that is, each memory blockcan be configured as a trace-buffer capable of recording a sliding window of 72 signals for the last2048 cycles ? or as 36x4096, 18x9182 or 9x18194. In this work, we have chosen the widest memoryconfiguration of 72x2048 to investigate the limits of applying the maximum amount of pressure onto therouting interconnect.5.2 Incremental CAD for Post-Map TracingIn order to effectively implement post-map incremental trace insertion, we have developed a number ofCAD optimizations to take advantage of the unique nature of this problem; these will be explained inthis section.5.2.1 Many-to-Many Trace FlexibilityDuring post-map trace insertion, one rather unique opportunity exists that was not previously availablewhen mapping the original user-logic: a selected signal need only be incrementally-routed to any freetrace-buffer input pin for its signal values to be observed. This differs from user-logic in that signals do83ASIC/FPGUser FiASIC/FPGUser FcAS/uGtSUeC?IASIC/?r?r?eASIC/?r?r?eASIC/?Gr?r?e AS/uGtSUeC?IASIC/?r?r?eASIC/?Gr?r?eASIC/?r?r?e??P????? ???? ??FF?S?GFFtSUeC?? ? ?? ? ?? ? ???? ??Figure 5.4: Example signal path; logic elements A & B and D & F can be incrementally swappedfor greater trace flexibility.not need to be connected to all of its sinks in order to create a valid routing solution; for incremental-tracing, by treating all inputs-pin of all trace-buffers of the FPGA as potential sinks, a connection to anyone of those pins is sufficient to allow observability.Figure 5.3 illustrates this concept ? any trace-pin connection along the dashed routes (or along anyother combination of routes not shown) will be sufficient. Recall that, due to our assumptions, we canconvert every unused memory block into a trace-buffer. In addition, for nets which utilize the globalinterconnect such as that shown, any point of the net can be tapped to make this connection. This many-to-many capability provides two advantages: significantly improved routing flexibility and CAD runtime,both of which are especially important given the self-imposed constraint that during post-map insertion,we prevent any existing user-routing from being ripped up. Routing algorithms, such as breadth-first ordirected search, can then be modified to search for any trace-pin and finish as soon as one is found. Thesealgorithms are commonly coupled with PathFinder with which our techniques are also compatible.5.2.2 Logic Element SymmetryFor local nets, which are entirely absorbed within a logic cluster and do not venture out onto the globalinterconnect, an additional optimization can be applied. Figure 5.4 illustrates an example FPGA signalpath: the nets connecting Logic Element A to B, B to C, and D to E, are all local as they do not exit thecluster. On the other hand, the connection from C to D is global. Tracing local nets is possible by tappingits cluster output pin (OPIN) which would otherwise be unused, and forming a new global connection toa trace-buffer. Because the local routing inside logic clusters is formed of a fully-populated crossbar in84which any cluster input can be switched to any logic element input, logic elements within FPGA logicclusters can be considered symmetric. This observation allows an incremental CAD tool to reorder thelogic elements inside a cluster without affecting the functional or timing behaviour of the original usercircuit, but only those driving local nets. Logic elements driving global nets cannot be swapped, as doingso would cause the global net to be driven by a different OPIN for which a new global-routing solutionwould also be necessary. In Figure 5.4, logic elements A & B, and D & F, can be swapped as they bothdrive local nets, whilst logic elements C or E cannot be moved.The ability to swap logic elements allows greater routing flexibility in tracing local nets, whichcan compensate somewhat for the lack of any existing presence on the global interconnect from whicha trace connection can tap off. Experimental data on this phenomenon can be found in Appendix B.Any routing algorithm can therefore be modified to treat all of these local logic elements as sources,and perform routing expansion from any of them. PathFinder can then be used to iteratively arbitratebetween multiple trace-connections until a valid solution (where each OPIN is used only once) is found.However, it is also worth noting that proprietary FPGA architectures are not likely to be fully-symmetricas assumed by the academic community; instead, the internal crossbar may be depopulated for increasedarea efficiency [87]. In those cases, there may be a reduced amount of symmetry between logic elements.Care must also be taken when attempting to swap fractured architectures (such as that used in this work)for which more than one OPIN may exist for each LE.5.2.3 Timing-Driven Directed SearchAn early iteration of this work [65] pursued a breadth-first search routing strategy (with the optimizationsdescribed above) during post-map incremental-tracing in order to maximize circuit routability. Eventhough breadth-first search is an exhaustive algorithm, we still found that incremental-tracing was anorder of magnitude faster than the original circuit placement. In a later revision [67], we improved onthis result with a timing-aware directed search technique, which is able to route slightly fewer signals,but provide better circuit timing for much lower computational effort. This timing-aware algorithm ispresented here.With the optimizations described in the previous subsection, the nature of the problem is now a one-or many-to-many routing search. For this reason, it is now unclear what the target of any directed-searchalgorithm should be; previously, with breadth-first search, the algorithm can expand outwards from any85Pr(a) Breadth-first trace search.Pr(b) Directed trace search.Figure 5.5: Illustration of breadth-first and directed search routing strategies.P debroty  ypeynp oFyGApDAptsyiDAytAgaobGpFye ycoagSeopteyIpAoCoFntmypeyCDFnoAuo oFypiyrotydypeyaCDeoAlFigure 5.6: Heuristic for pre-assigning suggested targets during directed search.part of the existing net and end as soon as any trace sink is reached, as illustrated in Fig. 5.5a. However,this can require high computational effort, as the CAD tool will need to exhaustively search all nearbyrouting resources ? even if they are unlikely to lead to a trace-buffer.A more efficient routing algorithm would first explore the routing resources that are most likely tolead to a valid solution, yet not forget about the less likely resources in case the preferred resources arecongested. By default, the VTR CAD suite adopts a directed search approach, such as that shown inFig. 5.5b. However, a key difficulty in adopting the directed strategy used in routing user-logic for useduring tracing is that there is not one or more unique targets that any route must connect to. Instead,during tracing, we only require that any free trace-buffer input is reached in order for the signal tobe observed. Additionally, because incremental-tracing works on top of a complete and legally routedcircuit, the timing slack of each signal to be traced is fully known in advance ? this value can be used toadjust its priority over any congested trace-buffer and routing resources to minimize its timing impact.To achieve this, we implemented a heuristic to preassign a suggested trace-buffer input for eachselected net. The heuristic works by first sorting all nets by their decreasing manhattan distance to thenearest available trace-buffer, weighting this by its timing slack, after which the nets furthest away are86ASIASC PCa (hours)?rbcdeExtdrcneExtlSimEuerEtrWASI/ASCvtnflto?btm?xrdenf??e??tErnt?Eue?ldt Elexf?rcdtCmutenteutm?dlm??r?ciel?t??(hours?)?r t?fe?t?EErnt?ltdeiil?tci?Figure 5.7: Incremental-tracing neighbour expansion: consider free routing resources only (1 and 3)as existing user-net cannot be ripped up.allocated their first-choice trace-buffer, as in Figure 5.6. When any trace-buffer is full, all nets are re-sorted according to the remaining buffers and this procedure repeats. The objective of this algorithm isto minimize the suggested, post-placement wirelength of all trace-signals; importantly though, signalsdo not have to connect to their suggested input ? due to the many-to-many trace flexibility optimization,signals can connect to any input that can be more easily reached.5.2.4 Neighbour ExpansionA final, but small, optimization that we make is to reduce the search space during incremental-routing.During post-map insertion, because we do not allow any existing user-routes to be ripped up, we canimprove incremental-routing efficiency by preventing resources that are already fully utilized fromever being considered for searching. This is achieved by modifying the neighbour-expansion routine ofPathFinder, which is responsible for adding new routing resources to the priority queue of candidates tosearch next, to only add those resources that have free capacity as illustrated in Fig. 5.7. In addition, wekeep track of the utilization from user-routing and incremental-tracing separately, thereby allowing therouting algorithm to assign any remaining capacity only to the traces that need it the most. Essentially,we have subtracted all existing routes of the user circuit from the routing resource graph, and treat thegraph that remains as an entirely new routing problem.5.3 MethodologyIn this work, we have looked at the effect of tracing 100 random signal selections, which consumebetween 5% and 100% of the leftover memory capacity in eight different heterogeneous circuits, eachplaced using 5 different seeds across six different channel widths, for a total of 192,000 data pointsper insertion strategy. The details of these circuits are shown in Table 5.1, where the Wmin column876-input FPGA Logic DSP RAM Max Trace-Circuit LUTs FFs Size Wmin I/O Clusters Blocks Blocks Buffer Inputsor1200 3054 691 25x25 90 779/800 258/475 1/18 2/12 720mkDelayWorker32B 5590 2491 42x42 94 1064/1344 468/1302 0/50 41/42 72stereovision1 10290 11789 36x36 118 278/1152 866/936 38/45 0/30 2160LU8PEEng 22634 6630 54x54 136 216/1728 2175/2255 8/91 45/63 1296stereovision2 29943 18416 84x84 184 331/2688 2338/5208 213/231 0/154 11088bgm 32884 5362 64x64 150 289/2048 2987/3072 11/128 0/80 5760LU32PEEng 76211 20898 101x101 200 216/3232 7470/7575 32/325 150/208 4176mcml 101858 53736 95x95 164 69/3040 6680/6745 30/276 38/180 10224Traceable Average Avg. Routing AverageCircuit Nets Pins/Net Utilization Wirelengthor1200 3807 3.7 39% 21.1mkDelayWorker32B 7918 3.2 30% 22.9stereovision1 16653 2.4 45% 15.6LU8PEEng 29001 4.6 47% 28.9stereovision2 47882 2.5 33% 29.0bgm 37639 4.7 45% 29.5LU32PEEng 97563 4.8 45% 40.1mcml 113994 3.4 41% 26.9Table 5.1: Benchmark summary, uninstrumented (values in bold indicate the constraining resource).represents the minimum channel width of the baseline, uninstrumented circuit (explained in the followingsubsection) and the traceable nets column represents the number of gate-level nets, both combinationaland sequential (including all RAM and DSP outputs) that can be connected to a trace-buffer. For DSPsthat employ inputs/output registers, we can observe their values by tracing the input net before it entersthe DSP, and its output net after exiting.These benchmark circuits are supplied with the VTR flow [121] and represent realistic, sizable,designs which include an open-source processor core, or1200, a matrix decomposition core, LU8PEEng,and the largest circuit at over 100,000 LUTs, a Monte Carlo hardware simulator, mcml. The number ofheterogeneous DSP and RAM resources used by each benchmark can also be found in Table 5.1. In allcases, these benchmarks were technology-mapped optimizing for delay and fit into the smallest FPGA(with an aspect ratio of 1:1) possible, representing a highly constrained use-case. Although we couldapply automated signal selection techniques to these circuits, such as those described in Chapter 3 totrace only the most influential signals in the circuit, we decided instead to take multiple random samplesof signals to gain an understanding of our techniques when applied to any signal that a designer maywish to observe.88Pre-Map InsertionTo implement pre-map trace insertion, we directly modified the input BLIF file to add one additional sink? a trace-buffer pin ? for each signal selected for observation. A unique single-port RAM slice wasinstantiated for each selected signal using the .subckt construct, with the responsibility for packingeach of these 1-bit slices left to VPR. Currently, VPR packs to minimize resource utilization, and hencereturns the result seen in Fig. 5.2b where only the minimum number of memory blocks are used.Mid-Map InsertionIn mid-map insertion, the placement of the circuit has been computed, but not any of its routing. At thisstage, the number, and location, of all free memory blocks (which can be transformed into trace-buffersat no cost) are also known. The challenge here is to ensure that each of the selected signals is allocatedto one available trace-pin, before routing can commence. This is achieved using the same heuristic asdescribed in Section 5.2.3, where nets are iteratively assigned their nearest trace-buffer input based ontheir manhattan distance. The key difference here, besides mid-map insertion requiring the entire circuitto be re-routed from scratch, is that these signals must be connected exactly to their assigned pins for thecircuit to be deemed fully-routed.Post-Map InsertionFor post-map insertion, we allow VPR to complete its entire packing-placement-routing process, un-modified, before we start performing any incremental-tracing. Even though the constraint to preventuser-routing from being moved appears to create a more restrictive problem, this is countered by theadditional flexibility that is enabled by the techniques described in Section 5.2. For this work, we usethe timing-driven directed search algorithm described previously to perform incremental-tracing, andremove the bounding-box search window (i.e. the router will consider resources that would not lie ona shortest-path) in order to maximize routability. In addition, we have increased the overuse penaltyfactors used by the PathFinder routing algorithm (experimentally, a reasonable set of values were foundto be --first iter pres fac 10 and --initial pres fac 15) so that routing congestionis penalized more heavily, and also restrict the number of incremental-routing iterations to 5 rounds afterwhich all conflicting trace-nets are sequentially discarded until a legal solution remains.89stereo2 LU32PE mcmlDevice EP4SGX110LAB (logic cluster) utilization 33% 89% 99%Average routing utilization 8% 35% 35%Peak routing utilization 21% 67% 47%Table 5.2: Routing utilization when mapped onto Altera Stratix IV.5.3.1 Routing Slack of Minimum Channel Width Wmin+20%A result that has commonly been used as a metric for routability is the minimum number of FPGAtracks ? or channel width ? required to implement a circuit. A smaller channel width is desirable as itmeans a more optimized implementation which requires a smaller FPGA area (and hence cost) to realize.Whilst the minimum channel width is an important metric for measuring circuit routability during FPGAarchitecture and CAD research, it is however, not realistic nor relevant when targeting real FPGAs, whichcontain a prefabricated channel width that the CAD tool is required to stay under. Furthermore, a faircomparison of other quality metrics cannot be made between circuits mapped to different channel widths,as an architecture with a smaller W may have required more computational effort to route, and may alsohave sacrificed timing to achieve. In the subsequent results, we follow the same procedure as industrialCAD tools targeting real FPGAs, and have run all of our experiments at a fixed channel width.In order to eliminate channel width as an independent variable from the experiments presented in thefollowing section, we have opted to map each circuit to an FPGA architecture with 20% more routingtracks than the minimum channel width Wmin before instrumentation. The average routing utilization atof these circuits are shown in Table 5.1 and average 41%. Whilst routing at Wmin represents the absolutebest-case of routing-efficiency possible, it would be expected that FPGA vendors would provision someadditional slack for the in their devices so that they can cope with even the most stubborn circuits withina reasonable runtime. Due to the proprietary and closed nature of commercial CAD tools, we are unableto find the exact channel width that are employed on these FPGAs, nor to make an exact comparisonwith the VTR flow; however, we are able to infer a relative amount of routing slack by mapping oursame benchmark circuits to a set of similar architectures. The results of these experiments are shownin Table 5.2. Even in the worst case for a large circuit, the peak routing utilization by Quartus II v12.1is only 67%. The average routing utilization for all three circuits is lower on these Altera devices thanin our theoretical architecture; hence, we believe that assuming a FPGA implementation where 20%routing slack, above the best-case Wmin, is reasonable.90(a) Signals traceable as a function of trace-demand, at Wmin+20%.(b) Signals traceable as a function of Wmin slack, at trace demand = 0.75.Figure 5.8: Average fraction of signals traceable using post-map insertion.5.4 Results5.4.1 Signal Traceability With Post-Map InsertionThe objective of post-map insertion is to add trace-buffer connections incrementally on top of an existingdesign ? without re-placing or re-routing any of the user circuit. Under this constraint, it may notbe possible to connect all selected signals to a trace-buffer due to routing congestion, either betweenthe existing user circuit and the new trace connection, or between two new connections, for which nosolution exists. To prevent the router from searching for such impossible solutions, we force post-maptrace insertion to run for five PathFinder iterations to resolve as much congestion as possible, after whichwe iteratively discard illegal trace-nets (i.e. those which have routing resource conflicts) until a legalsolution is found.91The fraction of signals that can be traced using post-map techniques, as a function of the trace-demand ? the number of signals selected for tracing ? is shown in Figure 5.8a. The results show anexpected trend: requested more signals for tracing reduces the number that are successfully traced, dueto increased routing congestion. However, even in the worst case when all left-over memory blocks arereclaimed as trace-buffers, over 95% of all requested signals can be successfully traced. Figure 5.8bshows how the number of signals traced varies with the amount of channel width slack. Intuitively, themore slack that exists, the less routing congestion there is and the more signals that can be connected.Importantly, our techniques are able to degrade gracefully even when there is little routing slack ? thisallows post-map insertion to operate on a circuit routed at minimum channel width, when pre-map andmid-map techniques would fail to route.5.4.2 CAD RuntimeFigure 5.9 shows the runtime (on a log scale) for each of the different stages of trace-insertion, across allbenchmark circuits. The results are broken down into the runtime for the individual packing, placementand routing stages of the flow at a fixed channel width of Wmin+20%. The number of signals selected foreach circuit is fixed at 0.75 of the leftover memory capacity. As expected, pre-insertion incurs a runtimepenalty over the baseline case in which the circuit has not been instrumented, followed by mid-insertion,and post-insertion, both of which utilize past results and are faster than a full recompilation. Incidentally,the stereovision2, bgm and mcml circuits instrumented pre-map were unroutable at our assumed 20%routing slack. Excluding those, on average, post-map insertion was 98X faster than pre-map insertion,and 22X faster than mid-map insertion.For pre-insertion, the circuit is instrumented prior to mapping and hence all three VPR stages ?packing, placement, routing ? must be rerun. Inherently, this instrumented circuit will be more complexthan prior to instrumentation, rendering a more difficult (and a more constrained) problem for the CADalgorithms to solve. During mid-insertion, the original circuit?s packing and placement results are re-utilized, on top of which the trace-buffers are incrementally placed and connected to the traced nets.Here, the two most computationally expensive parts of the mapping flow ? packing and placement ?do not need to be re-executed, leaving only the routing stage.For the proposed technique, post-insertion, this is taken one step further: results from all three VPRstages are re-utilized (the bar shown for tpack represents the time required to read the previously packed92postpostpostpostpostpostpostpost This work: post-mapunrouteable at W_min+20%unrouteable at W_min+20%unrouteable at W_min+20%postFigure 5.9: Runtime breakdown (at trace-demand = 0.75, shown on a log scale) across all circuits;(geomean5 shows geometric mean of the 5 circuits that were routable at all CAD stages).netlist into the CAD tool). This time, trace routes are incrementally connected to a trace-buffer withoutaffecting this previous routing by using only the routing resources that were not used in the originalcircuit mapping. This has the effect of reducing the solution space as compared to a complete re-route,and coupled with the CAD techniques described in Section 5.2 where any connection to any trace-pinwill be sufficient, results in a more efficient algorithm.Figure 5.10 compares the runtime (shown on a linear scale) of all three trace-insertion strategies whenapplied to the LU8PEEng circuit, across trace-demand values from 0.05 (65 signals) to 1.0 (1296 signals).For all three strategies, we can see that increasing the trace-demand (and hence the complexity of the93mapped circuit) increases runtime for each of the mapping stages. In all cases, post-map trace insertion isfaster than a full re-route of the circuit, which is required for pre-map and mid-map insertion strategies.Lastly, the sensitivity of post-map insertion runtime, as a function of the channel width slack, isinvestigated in Figure 5.12 (log scale). Here, routing runtime is highly dependent on the circuit channelwidth. Despite the existence of two conflicting factors: an increase in the size of the solution spacebalanced against a reduction in routing congestion, as the channel width of a circuit increases, the latterfactor dominates. Due to the directed-search routing strategy, the algorithm will only explore the routingresources that lead towards the suggested sink, and hence will not explore all of the additional resourcesprovided by an increased channel width. Increasing the channel width though, will reduce the amount ofrouting congestion that the algorithm will need to resolve, leading to a net gain in runtime which flattensoff at Wmin+30% and beyond.94Pre-Map Trace Insertion Mid-Map Trace Insertion This Work: Post-Map Trace InsertionBaselinetroute264 267 280 297 317 338 412 511 45 48 49 49 51 54 63 74Figure 5.10: Runtime breakdown, as a function of trace-demand, for benchmark LU8PEEng only.Pre-Map Trace Insertion Mid-Map Trace Insertion This Work: Post-Map Trace InsertionBaselineWirelengthFigure 5.11: Circuit wirelength breakdown, as a function of trace-demand, for benchmark LU8PEEng only.95Figure 5.12: Post-map routing runtime (log scale, at trace-demand = 0.75) as a function of channelwidth, across all circuits.5.4.3 WirelengthThe total wirelength, averaged across 100 random signal selections and 5 placement seeds, normalizedto the uninstrumented case, is shown in Figure 5.13. Again, we fix the channel width to Wmin+20%. Thismetric represents the number of FPGA routing resources that are utilized to implement the circuit; lowervalues indicate a more efficient circuit mapping, which can also lead to smaller routing delays. At firstglance, the results shown in these charts are surprising: it would be expected that by instrumenting thecircuit as early in the CAD flow as possible, the implementation can be optimized with more degreesof freedom and return the best result. For example, by completely re-placing the circuit, the algorithmmay be able to situate some trace-buffers more centrally, allowing shorter connections to be made, asillustrated in Figure 5.2.This does not appear to be the case here, however, as these results shows that in all of the bench-mark circuits, those instrumented pre-mapping have the highest wirelength, followed by mid- and post-mapping. One explanation may be that due to the compound and heuristic nature of the CAD algorithmsinvolved, more degrees of freedom may not necessarily translate into better results. An example of thisis that during the earliest packing stage, VPR has very little information to decide which 1-bit trace-netsshould be packed into the same 72-bit trace-buffer. In this way, in attempting to compromise betweenthe user circuit and the new debug instrumentation (which, for the pre- and mid-insertion strategies, isindifferentiable by the CAD tool) it ends up making a poor global choice. Returning to the examplein Figure 5.2, during placement, trace-buffers that are situated more centrally during pre-map insertion96This work: post-mappre-map unrouteable at W_min+20%pre-map unrouteable at W_min+20%pre-map unrouteable at W_min+20%postFigure 5.13: Circuit wirelength, across all circuits at trace-demand = 0.75.+20%Figure 5.14: Post-map circuit wirelength as a function of channel width, across all circuits at trace-demand = 0.75.(Fig. 5.2b) have the inadvertent effect of moving the original memory blocks from the user circuit furtheraway from the logic blocks that it must connect to. Whilst this can decrease the wirelength of any traceconnections, it will also cause the wirelength of the original user connections to increase, possibly in away that causes the net total to increase.97The wirelength for various trace-demands of the LU8PEEng circuit is shown in Figure 5.11. Asexpected, increasing the trace-demand also increases the total wirelength for all three insertion strategies.This chart shows that pre-map insertion returns the highest wirelength for both the user and incrementalconnections, regardless of the number of signals traced. These results show that, in attempting to optimizefor them both equally, the CAD is actually worse at both. Similarly, mid-map insertion also returns ahigher wirelength than at post-map; but this time, the user wirelength is affected by a much smalleramount. At trace demand 0.5 and below, post-map insertion returns a shorter trace wires than mid-mapinsertion, and even at the maximum demand of 1.0, post-map trace wirelength is still within 2% of itsmid-map value. In all cases, post map always returns a smaller total wirelength. Lastly, the sensitivity ofwirelength to the amount of routing slack is presented in Figure 5.14, which shows that as the channelwidth increases, the instrumented wirelength decreases ? due to the router being able to find more direct,uncongested, paths to each trace-buffer.5.4.4 Critical-Path DelayA metric that is important to many designers is the effect that instrumentation has on the critical-pathdelay of their circuit. Figure 5.15 measures the average critical-path delay of all three trace-insertionstrategies, when normalized to the uninstrumented case, at a fixed channel width of Wmin+20%. Despitethe increase in wirelength observed in the previous subsection, for all three strategies, on average, therewas a much smaller effect on the critical-path delay of each circuit. In the worst case, pre-map insertionincreased the delay (and hence, decreased its maximum clock frequency) by 3.8%; however, even moreinteresting is that inserting trace-buffers actually improved the critical-path delay of two circuits: mkDe-layWorker32B and LU8PEEng, by 1.4 and 1.8% respectively. We believe this can be attributed to thechaotic nature of CAD algorithms, as reported by [122].Whilst pre-map and mid-map insertion can both potentially return a smaller delay than the uninstru-mented circuit, this is not possible for post-map insertion, where existing routing connections are neverripped up and re-routed for an even better solution. However, our experiments did not find that mid-maptrace-insertion returned results better than in the uninstrumented case, instead, returning a solution within0.1% of the original solution. For post-map trace-insertion, this overhead was on average higher by 0.6%.Figure 5.16 fixes the benchmark to the circuit with the smallest critical-path delay (and hence, themost sensitive to extra logic): stereovision1, and plots the effect that varying the number of signals98pre-map unrouteable at W_min+20%pre-map unrouteable at W_min+20%This work: post-mappre-map unrouteable at W_min+20%postFigure 5.15: Critical-path delay, across all circuits at trace-demand = 0.75 (errors bars indicate stdev).This work: postFigure 5.16: Critical-path delay for stereovision1 only (error bars indicate stdev).99Figure 5.17: Post-map delay, as a function of channel width at trace-demand = 0.75.Signals Runtime Wirelength T critTraced (s) (ns)This work: stereo0 1318 27 118569 3.61No timing-aware directed-search [65] 1322 38 117528 3.67No many-to-many optimization 1287 44 127175 4.48No LE symmetry optimization 1300 31 122172 3.73No neighbour expansion optimization 1318 28 118569 3.67No optimizations 1311 97 129692 3.94This work: mcml 7665 151 1603771 66.15No timing-aware directed-search [65] 7666 2802 1605644 66.15No many-to-many optimization 7666 256 1641781 66.15No LE symmetry optimization 7665 185 1621184 66.15No neighbour expansion optimization 7665 216 1611983 66.15No optimizations 7664 3261 1650272 66.15Table 5.3: Post-map CAD optimization breakdown.traced has on its delay. Again, these results show that the effect is small for the mid-map and post-map techniques, up until the maximum trace-demand. This indicates that there is enough flexibilityin the routing fabric to comfortably support incremental-tracing. Noticeably though, there are smallperturbations in the delay when using pre-map trace-insertion, where the delay varies by a small andunpredictable amount with the number of signals traced. This effect is also supported by the error-barswhich show a much higher variance for pre-map than for the other two strategies.The impact of trace-insertion, as the circuit channel width varied, is shown in Figure 5.17. Althoughthe number of signals that can be successfully traced decreases with channel width (as explored inSection 5.4.1 previously) the critical-path of the instrumented circuit remains stable.1005.4.5 CAD Optimizations for Post-Map TracingGiven the results thus far show that post-map trace-insertion is superior to doing so pre-map in terms ofCAD runtime, circuit wirelength, and a more stable critical-path delay, and superior to mid-map also inall aspects except for a slightly higher delay, an interesting question to explore is how each of the CADoptimizations described in Section 5.2 contribute to these metrics. Table 5.3 shows the contribution ofeach of the CAD optimizations for both a difficult-to-route circuit, stereovision0, and the largest circuit,mcml. These figures show that whilst using a undirected, breadth-first search routing approach can leadto more signals being traceable and a smaller wirelength due to the solution space being more thoroughlyexplored, this can have a huge impact on the runtime. Disabling the many-to-many flexibility and thelogic element symmetry optimizations both have the effect of reducing the number of signals that canconnected to trace-buffers, whilst also increasing the circuit?s critical-path delay and routing runtime.The neighbour expansion optimization gives a small increase in runtime without substantially affectingother metrics.5.4.6 Comparison With Altera Quartus IIIn this last subsection, we compare our techniques with Altera Quartus II?s incremental-compilationfeature, which can be achieved by designating the circuit as ?post-fit? design partition. In this mode, theCAD tool will attempt to preserve all placement and routing. To instrument the circuit, we employed theAltera SignalTap II product which inserts the necessary trace-buffers, supporting logic, and connectionsto observe a set of user-specified signals. Key differences between our work, and that of SignalTap II,is that the latter will insert approximately 4 pipelining registers for each trace-connection, whilst in thiswork we do not perform any pipelining. Additionally, when operating in post-fit mode, SignalTap II canonly instrument flip-flops in the design, whilst in our work we allow any gate-level signal (combinationalor sequential) to be traced.Table 5.4 shows the runtime, logic utilization (ALMs in Altera terminology), memory utilization(M9K) and maximum operating frequency (F max) for the largest benchmark mcml, when targeting aStratix IV architecture in Quartus II and a Stratix IV-like architecture using our flow. For the Alterapost-map flow, inserting trace instrumentation incrementally is about 40% faster than performing pre-map instrumentation; we believe the primary reason why this is not accelerated any further is that tracelogic is inserted using the general-purpose incremental-compilation flow. This general-purpose flow is101Trace Time Logic Memory F maxmcml Demand (s) Elements Blocks (MHz)Quartus II (Stratix IV EP4SGX180 device, peak routing utilization: 31%)Uninst. - 3514 52688 (75%) 536 (56%) 29.54Pre-map 0.27 (4094) 4131 65268 (93%) 667 (70%) 30.940.52 (7676) 4715 65859 (94%) 767 (81%) 29.91Post-map 0.27 (4094) 2451 64708 (92%) 650 (68%) 29.150.52 (7676) 2959 70251 (100%) 750 (79%) 29.50This work, at Wmin+20% channel widthUninst. - 35526 66685 (99%) 38 (21%) 15.12Pre-map 0.4 (4090) 41908 66737 (99%) 95 (53%) 15.03Above 0.4 Circuit unroutable at Wmin+20%Post-map 0.5 (5111) 158 66685 (99%) 180 (100%) 15.120.75 (7663) 210 66685 (99%) 180 (100%) 15.121.0 (10213) 489 66585 (99%) 180 (100%) 15.12Table 5.4: Comparison between Quartus II and this work for mcml.designed to cope with functional modifications to the circuit (e.g. ECO changes, bug fixes) as opposedto a trace-specific, observe-only flow which seeks to instrument without modifying, and so still has toperform placement, routing (and their preparation) steps. Furthermore, no guarantee exists for the circuitto be fully preserved: for pre-map insertion at 0.27 trace demand, routing conflicts required the wholecircuit to be re-routed from scratch (much like mid-map insertion would) which also led to a smallreduction to F max. We have been unable to instrument any more than 7600 signals due to exhaustingall logic resources.By comparison, the trace-specific flow presented in this chapter is capable of preserving the circuitentirely (and with that, its critical-path). Although it takes the academic CAD flow significantly longer tocompile the uninstrumented circuit, the runtime required to perform trace insertion is almost two ordersof magnitude faster ? driven mainly by the ability to skip packing and placement stages of the mappingflow. Additionally, because our flow does not perform any pipelining, we are able to reclaim all 100%of leftover memory resources for tracing; in particular, taking advantage of the flexibility offered by allmemory blocks even when the trace demand is less than 1.0.5.5 SummaryThis chapter proposed a method to accelerate the process of instrumenting FPGA circuits with trace-buffers by using incremental techniques. Rather than inserting debug logic into the high-level representa-tion of a circuit (prior to place-and-route) and recompiling, this chapter shows that directly manipulating102the spare, low-level FPGA resources in an incremental fashion can achieve even better results, for muchless computation effort. The significance of this contribution is that it can lead to faster compile-debugiterations and increase designer productivity.The novelty of this contribution lies with a number of CAD optimizations that allow general-purposeincremental techniques to applied more effectively to trace-buffer insertion ? such as by exploiting thefact that to observe a signal, it needs only to be routed to any trace-buffer input, as opposed to a specificinput. Experimental results show that post-map insertion in this manner can be completed 98X fasterthan a full recompilation, with only a small effect on the critical-path delay when reclaiming 75% of theleftover memory capacity for tracing.Researchers interested in applying or extending our techniques are invited to download our Inc-Tracepatch for VTR 1.0 from http://ece.ubc.ca/?stevew.103Chapter 6A Virtual Overlay Network for FPGATrace-BuffersThis chapter proposes a method to allow the set of signals connected to on-chip trace-buffers to bemodified without requiring the circuit to be recompiled. These techniques can be used to greatly reducethe time between debug turns, and hence rapidly accelerate the debugging flow. The key enablingcomponent to this work is that, instead of building a custom FPGA mapping to connect each signal to onededicated trace-buffer input as in existing IP [8, 155], we insert a virtual overlay network which allowsmultiple signals to be multiplexed to each trace-input; this is illustrated in Figure 6.1. Subsequently,changing the signals that are forwarded over this network will require only the virtual network to bereconfigured, rather than a new place-and-route solution.Existing work pursue a debug flow similar to that shown previously in Figure 1.1, in which theinstrumentation procedure requires a new FPGA mapping to be constructed for each new set of observedInstPCabcdbeEdx Circut(e.g r(tVH cDLr/ tbreanEllabCSixabirmuWdvirmCFigure 6.1: Virtual overlay network for multiplexing a large set of circuit signals to a small numberof trace-buffer inputs.104Debug-timeCompile-timeRV0123 ?????? ?V ?V0?RV????00????30V???1???????V????V?????ReconfigureNetwork(seconds)?V0 ? ???? ?Figure 6.2: Proposed debug flow: compile- and debug-time phases.signals at each debug turn. Recompiling and/or reconfiguring the design to observe different signals isreferred to as a turn in [152]; often many turns are required during debug to narrow down the causeof unexpected behaviour. Whilst incremental compilation techniques can be used to accelerate thisprocedure (as proposed in Chapter 5) it still requires the entire circuit to be loaded into the memory of aCAD tool and some amount of additional routing (and perhaps placement) operations to be performed.Figure 6.2 describes our proposed debug flow. This debug flow consists of two phases: compile-timeand debug-time. During compile-time, the uninstrumented circuit is fully compiled as normal, and theresulting mapping is then completely fixed. Next, the virtual overlay network is then added incrementally(employing many of the same techniques from Chapter 5), using only the FPGA resources that wereleftover from the initial mapping. It is also possible to insert the overlay network during the original fullcompile, though this option is not explored here.At debug-time, this overlay network is then repeatedly configured with the signals that a designerwishes to observe, and the device tested at each turn in order to record the desired signal values, untilthe root-cause of the bug is located. By eliminating all forms of recompilation from the inner loop, eachdebug turn can now be completed in a matter of seconds.A number of key technical challenges have to be overcome in order to realize such a proposed flow,and these will be described as follows: Section 6.1 presents the details of our virtual overlay network,which seeks to connect all combinational and sequential signals of the user-circuit to the availabletrace-buffers. Section 6.2 describes the graph-based method we employ for computing a valid networkconfiguration whilst, Section 6.3 describes how this configuration can be programmed into the device,The experimental methodology, and results showing the connectivity and impact of the virtual overlaynetwork is presented in Section 6.4 and 6.5. This chapter is summarized in Section 6.6. The contributiondescribed within was published in [66, 68].105Protypen FGAD sAiyGgnaaGAPre-Silconsto~-e Hsz~M-eGMOnOpago  OS-eo lconsto~-e Hsz~M-eGMOnOp(a) Prior work: single trace connections.Protypen FGADProtypen FGApsAiyGgnaaGAbb bc bc cbPre-Silconst~ HnlcSzMGl~OStoIieCgStoIieCm(b) Proposed: overlay network with multiple connections.Figure 6.3: Trace-buffer connectivity.6.1 Virtual Overlay NetworkIn this section, we will describe the details of our virtual overlay network. The key purpose of thisnetwork is to multiplex all on-chip signals to all trace-buffer inputs. With previous work, it is necessaryto compile a custom point-to-point network for each new signal selection, generating a mapping such asthat illustrated by Figure 6.3a where each observed signal is connected exclusively to a single trace-bufferinput via a set of dedicated routing multiplexers. Instead, we propose that an overlay network is createdout of these routing multiplexers, as shown in Fig. 6.3b, where a total of 4 signals: A, B, C and D arenow connected to the same trace-input. The select-lines to each of these routing multiplexers are drivenby the FPGA configuration memory ? methods to reprogram their values are covered in a subsequentsection. The reconfigurable nature of FPGAs arises from the abundance of multiplexers inside, and byutilizing routing multiplexers to build this overlay network instead of general-purpose user-logic, thisnetwork can be built much more efficiently. The feasibility of this proposal is supported by analysis thatin mapping our set of uninstrumented benchmark circuits to a minimum array size FPGA with a smallamount of routing slack, only 32?51% of the total interconnect capacity (geomean at 41%) was utilized.Building the virtual overlay network is essentially a routing problem. This routing problem can berepresented as a routing resource graph G(V,E). We define V = Vsignals?Vrouting?Vtrace where Vsignals is106P rotyperPn FGoytPGnpADsigabPrre-Silconli(a) Prior work: point-to-point connections.Protypen tFPGAen rFoFyG(b) Our approach: utilize a union of trees, each rootedat a trace-pin.P dreotyprePen bugi tdterFtypGeA DtypreP (c) Proposed: Virtual overlay network (tree for trace-input 3 is highlighted).Figure 6.4: Example routing resource graphs G(V,E): . indicate circuit signals, ? for routing mul-tiplexers and / are trace-buffer inputs.the set of all circuit signals that can be traced, Vrouting the set of unused routing multiplexers, and Vtracethe set of trace-buffer inputs. E is the set of unused routing tracks that exist between these resources.Example routing resource graphs are shown in Figure 6.4 where Vsignals is indicated by . triangles, Vroutingas ? circles, and Vtrace as / triangles. Fig. 6.4a illustrates an example point-to-point network that wouldbe created by prior work for observing signals B and D. Here, each routing multiplexer would be used tocarry only one signal.Figure 6.4b shows a routing solution for the same resource graph in which all five circuit signalsare connected to either of the two trace-buffer pins available. Each routing multiplexer can have a fan-inof more than one. At debug-time, a designer can now configure the routing multiplexers in such a wayas to forward just one signal to each trace-input. For this particular solution, designers can observe anysingle signal in their circuit, and a limited selection of any two signals simultaneously, as defined by theCartesian product of the two signal sets {A,B} ? {C,D,E}: {AC,AD,AE,BC ...}.107The key feature of this routing solution is that it is made up of a disjoint union of trees, each rooted ata trace-buffer input, with the leaves of each tree being the circuit signals that it connects. We use a disjointunion of such trees, to allow signal selections to be made for each trace-buffer input independently ofother trace inputs; it is this constraint which differentiates and abstracts our virtual overlay network fromthe more general routing problem faced when building point-to-point networks. Whilst each trace-bufferinput in the general routing resource graph G can be considered the root of a much larger tree whichtouches all the signals in its fan-in cone, the union of such trees will not be disjoint and hence signals foreach trace-input cannot be selected independently.Our virtual overlay network can be described as a graph G?(V ?,E ?) where V ? now consists ofVsignals?Vtrace, and E ? the set of edges that describe connectivity between a circuit signal and a trace-pin.Furthermore, rather than connecting each signal in the circuit to a trace-buffer input just once as inFig. 6.4b, it is possible for a signal to be a leaf of multiple trees. A valid routing solution for a networkwhere this is the case is shown in Figure 6.4c; here, by occupying a few more routing resources, eachof the five signals can now be connected to two of the four trace-buffer inputs. The increased flexibil-ity of this overlay network can now guarantee that any combination of two signals can be selected forobservation, but in practice, many more signals are possible.6.2 Network MatchingSo far, we have assumed that at debug-time, the designer chooses which signal they wish to connectto every input pin of every trace-buffer in their circuit. Once this decision is made, a simple algorithmcan be used to determine the select bits for each of the Vrouting multiplexers that make up the overlaynetwork. This algorithm follows a greedy strategy: starting at the leaf node of the desired Vsignal , movethrough all Vrouting multiplexers in the routing resource graph G belonging to the signal tree towards itsroot, Vtrace. At each Vrouting multiplexer encountered, set it to forward the output from the previous node.However, making the choice of which signal to forward to which trace-input is not trivial. Consider thenetwork in Figure 6.4c: although the designer can select any combination of two signals, they can onlyselect a limited combination of three signals ? defined by the Cartesian product of all sets. Given thateach signal can be connected to one of two trace-buffer pins, there exists a problem of deciding whichsignals to connect to which pin. As an example, suppose that a designer wishes to observe the signalsACD; A can only be forwarded to trace-input 1 or 2, C to 2 or 3, whilst D can be forwarded to 3 or 4.108InstrumedDvsmuiecvt Rsa debugitrauobnsrP-oSdllRrtn-oelvORrsfveysu?? eboSleeeeeeeeeeeeeeeeeeeeOtutv??? eesvOrmt??eeeeeeeeeeeeeOtutv?? eeeeeesvOrmt???Otutv???eeeeeeeeeeee????vOno-vse?no-um?vmvftnR-c? bugitrau?v?ro?tn?Matching?R??nmv?tn?Construction ?Ovse?nsfrntFigure 6.5: Virtual overlay network abstraction.Protypenr o FGApent P rGDpent PGsigabcSisICcmuFlF?peno FG(a) Original bipartite graph.Protypenr o FGApent P rGDpent PGFspeno FG(b) Maximum-matching of graph.Figure 6.6: Bipartite graph Gb(Vsignals,Vtrace,Eb) capturing signal/trace-input connectivity of vir-tual overlay network.From this list of constraints, a feasible assignment must be found: A?1, C?2, D?3 would be one validsolution, as would A?2, C?3, and D?4. However, assigning A?2 and D?3 would prevent signal Cfrom reaching any trace-buffer input. Although this example was easy to compute by hand, it may notbe so simple for circuits containing 10,000s of signals, connected to 1,000s of trace-inputs, from which100s of signals are selected.To solve this assignment problem, we utilize matching techniques for bipartite graphs. A bipartitegraph can be described as Gb(Ub,Vb,Eb), where Ub and Vb represent two disjoint sets of vertices, and Ebthe set of edges that connect between them. Edges must not exist between elements in the same set: fromUb to Ub, nor from Vb to Vb. The definition for our virtual overlay network fits this pattern, when substi-tuting Ub = Vsignals, the set of all circuit signals, Vb = Vtrace, the set of all trace-buffer inputs, and Eb = E ?? the set of edges which describe the network connectivity between the two. The relationship betweenthe virtual overlay network Gb and the general routing resource graph G is shown in Figure 6.5. Thebipartite graph capturing the connectivity of the overlay network from Fig. 6.4c is shown in Figure 6.6a.109A maximum matching in a bipartite graph can be computed in polynomial time, when using theHopcroft-Karp algorithm. A matching of graph Gb represents a subgraph of Gb in which none of itsedges share a common vertex; a maximum matching is the largest such subgraph that can be formed.This is a very convenient property for computing which signal to forward to each trace-pin: given thateach pin can only support one such connection, therefore, each node in Vtrace must have at most oneedge. The maximum matching solution for selecting signals ACDE from its bipartite graph is shown inFigure 6.6b, which returns the solution: A? 1, C? 2, D? 4 and E ? 3. The maximum number ofedges that can exist in a maximum matching is the minimum value of |Vsignals| and |Vtrace|. In typicalcircuits, we would expect that more circuit signals exist than trace-inputs, and hence |Vsignals|>> |Vtrace|.An additional useful characteristic of the maximum matching algorithm is that it not only returnsa pass-fail result, for cases where a complete match is not possible, it will return a best-effort partialassignment. Because the virtual overlay network that we build is blocking, a maximum match can beused to return partial, but optimal, result where the maximum number of signals possible are forwardedover the network. In cases where not all requested signals can be forwarded, more than one maximumpartial-match may exist ? currently, only an arbitrary match is returned. Similarly, the solution maynot capture designer intent in situations where higher emphasis is placed on certain signals ? they mayprefer one high-value signal to be selected over multiple, lower-value ones. This is a scenario we plan toaddress in future work using maximum weighted matching techniques.6.3 Network ReconfigurationOnce the select bits for each of the Vrouting multiplexers in the overlay network have been computed,the final task is to program these bits into the FPGA. For this we propose two different approaches:one which requires the FPGA to be powered down and fully reprogrammed, and an alternative which,with the correct architectural support, would allow the signal selection of a live FPGA to be changedon-the-fly.6.3.1 Static ReconfigurationThe flow employed in existing work [8, 155] is to create a new point-to-point circuit mapping foreach new signal selection. The resulting bitstream would then be used to fully reprogram the all of theconfiguration memory on the FPGA device. This static reconfiguration procedure is identical to that110which is undertaken during the initial power-on of the FPGA, and is also responsible for resetting allflip-flop and memory contents to a known value, destroying any existing user-state. For this reason, afterreprogramming each new trace configuration, designers must then rerun their tests from scratch to collecttheir new signal trace.In our proposed flow, because we do not recompile the circuit between each debug turn, we do notautomatically generate a new bitstream. However, with exact knowledge of where the configuration bitsfor each routing multiplexer is located within this bitstream, it would be possible to directly modifyonly those bits necessary for configuring our overlay network. Then, when the FPGA device is staticallyreprogrammed, the desired signal selection is forwarded for observation. Graham et al. [53] adopt thisbitstream modification approach for creating point-to-point trace networks.6.3.2 Dynamic ReconfigurationAlternatively, it may be possible for the overlay network to be changed for a new signal selection withoutlosing user-state or interrupting live FPGA operation by using dynamic, partial reconfiguration. Thisfeature allows circuit designers to dynamically reprogram only a portion of their FPGA during runtime,whilst the rest of the device continues functioning as normal.Major FPGA vendors support dynamic reconfiguration in their high-end parts, and provide fine-grained, non-glitching support [24, 156] which does not corrupt user-state, showing the feasibility andviability of our application. Altera states that individual routing multiplexers in their fabric can bereconfigured; Vansteenkiste et al. [142] have also proposed that FPGA circuit-specialization be createdusing such fine-grained reconfiguration support.Interestingly, current architectural support for reconfiguration goes beyond the needs of our transpar-ent, observe-only trace-buffer network, as it enables all aspects of the FPGA to be reconfigured, includinglogic elements and lookup-tables, logic clusters, memory and DSP blocks, as well as their associatedrouting resources. Our network requires only the latter (specifically, only the configuration cells forall routing switch-boxes, as well as all connection-boxes for just the memory resources as shown inFig 6.3b). Unfortunately, due to the proprietary nature of commercial FPGAs, we are unable to quantifywhat savings can be made here, nor to test our techniques on a physical device.111Flip- FPGA LogicCircuit 6LUTs Flops Size Wmin I/O Clusters DSPs RAMsor1200 2963 691 25x25 72 779/800 298/475 1/18 2/12mkDelayWorker32B 5580 2491 42x42 76 1064/1344 560/1302 0/50 41/42stereovision1 10366 11789 43x43 70 278/1376 1365/1376 38/50 0/42stereovision0 11462 13405 45x45 44 354/1485 1479/1485 0/66 0/42LU8PEEng 21954 6630 59x59 86 216/1888 2583/2596 8/98 45/72stereovision2 29849 18416 84x84 118 331/2688 3635/5208 213/231 0/154bgm 30089 5362 69x69 88 289/2208 3419/3519 11/153 0/99LU32PEEng 75530 20898 110x110 130 216/3520 8861/9020 32/378 150/252mcml 99700 53736 119x119 86 69/3808 10436/10591 30/435 38/285Table 6.1: Benchmark summary (values in bold indicate the limiting resource)FPGA Architecture Parameter ValueLogic Cluster Size N 10Lookup Table Size (non-fracturable) K 6Inputs per Cluster I 33Channel Segment Length L 4Cluster Input Flexibility Fc in 0.15Cluster Output Flexibility (default: 0.10) Fc out 0.20Table 6.2: FPGA architecture used, based on Altera Stratix IV device family6.4 MethodologyTo evaluate the feasibility of our virtual overlay network, we implemented our techniques using the FPGACAD tool VPR, which forms part of the Verilog-To-Routing academic project [121]. Using VPR 6.0, wepacked, placed and routed a set of benchmark circuits as normal onto the default VPR architecture (givenin Table 6.2, but with an increased Fc out of 0.2) to generate the baseline data outlined in Table 6.1.In this flow, packing is performed with the objective of minimizing logic cluster usage, and placementis subsequently performed onto the minimum-sized FPGA array that will fit the circuit. The minimumchannel width, Wmin, is a measure for the routing efficiency of the CAD tools and FPGA architectureinvolved, and describes the absolute minimum number of routing tracks that is possible to implementthe circuit on the given FPGA.We make the same assumptions as in Chapter 5 that any free memory-block in the FPGA can betransformed into a trace-buffer for zero overhead, its contents can be extracted for free (using devicereadback techniques, or built in JTAG logic) and that triggering to control when to start and stop tracingis specified by the designer manually, or driven externally from a global pin. We do not believe these tobe unrealistic assumptions, given that the memory blocks inside the Xilinx Virtex family have built-inhard logic to implement FIFO functionality [153] with which we can build a ring-buffer to constantly112record signal samples until halted by the trigger. In the FPGA architecture used, the widest configurationfor each memory block is 72 bits (by 2048 entries) ? this is adopted for our trace-buffers.Rather than operating on each circuit at their minimum channel width, we inflate this value by asmall amount, 30%, in order to reflect a realistic commercial architecture in which routing resourceshave been over-provisioned above the very best-case; this is a common approach also taken by otherresearchers such as [125]. To perform our experiments, we use our custom version of VPR to install ourtrace-buffer network incrementally ? using only the spare resources not used in the original mapping ?using the techniques described in Chapter 5. However, instead of employing these techniques to build acustom point-to-point network, we have modified them to build our overlay network whilst preservingthe guarantee that no existing circuit blocks nor routing are moved or re-routed. We then sweep eachcircuit to find the maximum number of times that all circuit signals can be connected to a differenttrace-buffer pin, a parameter we refer to as network connectivity. Once a feasible overlay network isfound, we record the signals connected to each trace-pin into a text file which is subsequently used formatching at debug-time.Currently, the VTR flow supports mapping circuits with only a single clock-domain. However, webelieve that our approach can be extended to those with multiple clock-domains ? given that eachtrace-buffer can only record signals from a single domain [117], we can also build an separate virtualoverlay network to support all trace-buffers from each clock.6.4.1 Compile-Time ConstructionAt compile-time, the virtual overlay network is constructed once per circuit. The primary challenge inconstructing this overlay network using normal CAD tools is that these tools are designed to build acircuit mapping where each and every routing resource can, at most, be used once to connect one netsource to one (or more) net sinks. The proposed network requires the reverse of this ? we requiremultiple net sources to feed (multiplexed) a single trace-buffer sink.Within VPR 6.0?s routing stage, the PathFinder algorithm is employed to iteratively resolve routingresources that become overused, by slowly increasing their costs so that only the most critical nets canafford them, through a process known as negotiated congestion [96]. The goal of our algorithms is toattempt to connect all circuit signals ? both combination and sequential ? to the requested numberof trace-buffer inputs, using a directed-search strategy which terminates whenever any input is found.113ProtypePn PFGApDeoGsPpPyFiPgayyFgAoaynbnb rotypePb PGspcFPFSoGAoytPgayyFgAoayFigure 6.7: Circuit signals can either establish new trace connections (signal A) or share existingconnections (signal B).However, rather than directing each net towards its nearest trace-buffer so that its routing wirelength,and hence any routing congestion, is minimized, we have found experimentally that higher networkconnectivity is possible if each circuit signal was directed towards a randomly chosen trace-buffer. Webelieve this is because it is beneficial to establish connections to trace-buffers that circuit signals wouldnot normally prefer in order to fully utilize the flexibility provided by as many trace-pins as possible;this also has the added benefit that because signals are randomly distributed, higher quality signal totrace-pin matches, which are explained in the next section, can be achieved.Instead of building point-to-point trace connections, where each routing resource can be used onlyonce, we allow all Vrouting multiplexers to be overused, with the understanding that their select bits canbe determined at debug-time. During network insertion, circuit signals have two options: either they canestablish a new connection to a new trace-input (signal A in Fig. 6.7), or they can branch onto an existingconnection (signal B). A na??ve approach would be to force all signals to always take the latter optionwhenever a used Vrouting node is encountered. However, we found that this made the solution sensitive tothe order in which nets were routed: those processed first would be able to consume all the resources thatsuited itself most, with no regard for other nets, and hence causing all subsequent nets to work-aroundthose connections. This is not desirable; we need to allow existing connections to be ripped up andrelocated if it will lead to a globally better solution.This is accomplished by modifying the routing cost function used by the neighbour expansionprocedure in the directed search routing algorithm used to build the network. Although we do not wishto force nets, when encountering a Vrouting node that is part of a different connection, to connect to thesame trace-pin, we do wish to make it preferential to do so in order to minimize the routing search space114and hence runtime. By default, the routing cost function used for all nodes inside VPR is:cost = back cost + this cost +astar? expected cost (6.1)where back cost is the congestion cost up to the current node, plus the this cost of the node under con-sideration, and then the expected cost to the target scaled by an aggressiveness factor. Instead, for Vroutingnodes that are already part of another connection, we omit the expected cost and discount this cost bythe occupancy of the new node, which indicates how many nets are already using it:cost ? = back cost +this costnode occupancy(6.2)The intuition here is that the more nets that already pass through this node, the less likely it will be movedin subsequent routing iterations, and the more seriously the routing algorithm should consider it. A lowercost causes the preferred node to be removed from the heap much sooner than it would be otherwise,allowing the routing algorithm to follow the established connection to the trace-pin, yet does not forcethe router to take only this path. It must be noted that the new cost of a node must not take a value lessthan its predecessor (by discounting back cost or using a negative value for this cost) otherwise the toolwill enter an infinite loop in which the cost of each node is continuously reduced.6.4.2 Debug-Time MatchingGiven a designer-specified signal selection, we then process the text file describing the overlay networkto build a custom bipartite graph containing only the desired signals, before applying the Hopcroft-Karpalgorithm (as implemented in [45]) to find a maximum matching. A downstream tool can then be usedto determine which signals to connect to which trace-pins, and thus compute the routing multiplexer bitsrequired using a simple greedy algorithm. Subsequently, these bits can then be statically or dynamicallyreconfigured onto the FPGA.In the absence of a large collection of realistic signal selections for each of our benchmark circuits,we have evaluated the feasibility of our work using random signal selections. Although automated signalselection algorithms, such as those described in Chapter 3 or proposed by [82, 93] may be used, thesewould only generate a handful of data-points. Instead, we would like to understand how the trace-buffernetwork fares for any signal that a designer may wish to select by using a sufficiently large sample size.115Max Comb&Seq. |Vtrace|Circuit Conn. |Vsignals| Excl. Pinsor1200 15 3483 2 720mkDelayWorker32B 5 7439 1 72stereovision1 19 16211 11 3024stereovision0 20 14937 12 3024LU8PEEng 8 27657 3 1944stereovision2 23 46646 - 11088bgm 22 34966 66 7128LU32PEEng 11 95026 1 7344mcml 25 106555 4 17784Table 6.3: Maximum network connectivity results.For our experiments, we generated 100,000 signal selections randomly each at a different fraction ofthe trace-buffer network?s capacity: from 0.1 to 1.0 in 0.1 increments. For example, if the trace-buffernetwork had a total of 720 input pins as for the or1200 benchmark, then we randomly generated selectionsof 72 signals, 144, up to the full 720 for a total of 1,000,000 signal selections per circuit.6.5 Results6.5.1 Maximum Network ConnectivityTable 6.3 shows the maximum network connectivity ? the maximum number of trace-input trees thateach signal belongs to, the number of circuit signals and trace-buffer inputs that exist for each circuit.For all but one of the nine benchmarks, not every internal signal could be incrementally connected tothe trace-buffer network due to routing congestion. Upon further investigation, we found that only netsabsorbed locally within a logic cluster suffered from this difficulty, caused by an inability to exit thecluster due to a lack of free resources in its vicinity ? this phenomenon is investigated in Appendix B.Unlike those nets that already had a presence on the global interconnect, for these local nets a new globalroute needed to be made from scratch using only the resources leftover from the original mapping. Invery rare cases, this would be impossible. The number of signals that did fail in this manner are shownin the Excl. column, and represents at most 0.2% of all available combinational and sequential circuitsignals. The number of trace-buffer inputs that exist for each circuit are also shown. As described earlier,this network connectivity parameter represents the guaranteed minimum number of signals for which adesigner can observe any combination of, though in practice many more can be selected.1164 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72Trace-Buffer Pin Index110100100010000Signal Fan-in (log scale) <=Avg>Avg(a) mkDelayWorker32B with 5 connections per signal.720 2880 5040 7200 9360 11520 13680 15840Trace-Buffer Pin Index1101001000Signal Fan-in (log scale) <=Avg>Avg(b) mcml with 21 connections per signal.Figure 6.8: Signal fan-in for overlay network ? histogram of the number of signals connecting toeach input.Figure 6.8a shows a histogram of the signal density at each trace-buffer input for the mkDelay-Worker32B benchmark circuit, which contains only a single free memory-block to be reclaimed as atrace-buffer. In this circuit, all but five combinational or sequential signals can each be connected to5 different trace-inputs. It would be expected that, on average, each input-pin would be connected to7438?572 ? 517 signals, though in this particular instance 35 trace-pins exist which connect less than thisvalue (with a minimum number of 5) leaving 37 pins which connect 517 or more signals (including amaximum of 1393 signals ? almost 20% of all on-chip signals). A histogram for the largest circuit atour disposal, mcml, is produced in Fig. 6.8b, where it would be expected that each trace-input would bethe target of approximately 126 signals. Here, a smaller proportion (40%) of trace-inputs are connectedto by more than this value, indicating that there are some trace-buffers which are easier to access (i.e.more centrally located, as indicated in the histogram which shows a vertical scan-line ordering acrossthe chip) than others, or that a tipping point is reached whereby it is cheaper for the routing tool to createnew branches onto existing trees than to create entirely new trees.117If full observability into all circuits signals was strictly necessary, it may be possible to achievethis by increasing the channel width or the cluster output flexibility (Fc out). We have observed thatalthough increasing the channel width slack from Wmin + 30% to +50% had only a minimal effect onits network connectivity, in five of the nine benchmarks, all circuit signals can now be connected to theoverlay network, whilst of the remaining four circuits, at most only 5 signals were impossible in theworst-case: bgm.6.5.2 Average Match SizeFigure 6.9 shows the average match size returned by the maximum matching algorithm, where each data-point represents a sample size of 100,000 randomly-generated signal selections. This figure representsthe average number of signals that can be simultaneously forwarded across the overlay network. Thedotted lines of this graph show the number of signals requested by a designer, whilst the solid linesrepresent the average number of signals that can be forwarded across the network for observation. Wherethe lines coincide indicate that a complete-match was made, where the lines diverge indicate that onlypartial-match was possible.In Figure 6.9a, which corresponds to the mkDelayWorker32B benchmark, we can see that the proba-bility of observing all of the desired signals decreases after approximately 40% of the network capacity? that is, after 29 signals are selected from a total capacity of 72. At full capacity, on average only 54 ofthe 72 signals requested can be matched. This is not a surprising result given the memory-constrainednature of this circuit, which contains only one free memory-block available for use as a trace-buffer. Asstated in the previous section, the average number of signals that each trace-pin is expected to support forthe mkDelayWorker32B circuit is 517 of the total 7439 signals; each time a trace-pin is used, 516 othersignals are blocked from using this same pin, drastically reducing the flexibility of the overlay network.In contrast, the remaining non memory-limited circuits presented in Figures 6.9b and 6.9c showmuch more promising results: in most cases, the network can fully connect up to 80?90% of the trace-buffer capacity before conceding. So far, we have not investigated how signals can be prioritized ? forexample, a signal bus may only be useful if all of its bits are observed simultaneously. This can beachieved currently by requesting fewer signals overall, to reduce matching congestion, however, in futurewe would like to investigate the use of weighted matching techniques to do so.118100% of any 29-bit signal76% of any 72-bit signal(a) Circuit: mkDW.or1200: any 648-bit signalLU8PE: any 1361-bit signalstereo1/0: 100% of any 2722-bit signal(b) Circuits: or1200, stereo1, stereo0, LU8PE.LU32PE: any 5141-bit signalbgm: any 6415-bit signalstereo2: any 8870-bit signalmcml: 100% of any 16006-bit signalmcml: 97% of any 17784-bit signal(c) Circuits: stereo2, bgm, LU32PE, mcml.Figure 6.9: Average match size ? number of arbitrary signals that can be simultaneously forwardedby the overlay network to trace-buffers; dotted lines indicate the requested number of signals,solid lines indicate the average number of signals matched119Figure 6.10: Network connectivity and match quality for bgm.Figure 6.10 graphs how the number of signals observable through the overlay network varies withthe network connectivity parameter, when applied to the bgm circuit. Intuitively, the more times thateach signal is connected to a trace-buffer input, the less likely it will be blocked when a different signalis picked. However, these results show that it may not be necessary to connect each signal as many pinsas possible ? reducing this network connectivity parameter to 10 or 15 has no effect on the signalsobservable when requesting 90% trace capacity, and only a 2 or 5% reduction at 100% capacity whencompared with the maximum connectivity value of 22.6.5.3 RuntimeFigure 6.11a shows the total VPR runtime for building the overlay network, for each connectivity pa-rameter possible, averaged over 10 tries. An X value of zero indicates the baseline measurement, whichdoes not include any trace-buffers nor an overlay network. X values greater than zero specify the totalruntime for the standard CAD stages: packing, placement, routing, as well as the additional stage ofincremental-routing to embed our overlay network. The difference in runtime represents the additionaloverhead of our overlay network; on average this is a 34% increase on the baseline, and in the worst-casethis reaches 76% for stereovision2. As expected, runtime increases with the network connectivity value,with the gradient increasing more rapidly towards the tail-end of each circuit as it reaches its breakingpoint. However, it may not be necessary to push each circuit to this point as Figure 6.10 from the previoussubsection showed. We anticipate that with greater focus on optimizing the CAD algorithms, we canfurther reduce this overhead.120(a) Compile-time: total VPR runtime for overlay network insertion at variousconnectivities; connectivity=0 represents baseline with no network or instrumen-tation.(b) Debug-time: maximum matching runtime, per selection.Figure 6.11: CAD overhead.The runtime for finding a maximum signal to trace-pin match is charted in Figure 6.11b ? this isthe average time required to recompute a matching network assignment to support new signal selections.In the worst case, for the largest mcml benchmark where the full trace-buffer capacity is requested,a solution can be computed in less than 50 seconds, with the relationship between runtime and thenumber of signals requested appearing to be linear. This contrasts with the time required to either fullyor incrementally recompile the circuit to create a new point-to-point trace-buffer configuration: in theprevious figure we observed 30,000 seconds to fully compile an uninstrumented instance of mcml, whilstthis was improved to a still-length 2,000 seconds in Chapter 5. Matching runtime can be improved furtherby implementing our techniques in a more efficient programming language instead of Python.121Original Instrumented ChangeCircuit T cpd (ns) T cpd (ns) (ns) (%)or1200 21.6 21.9 +0.3 +1.4%mkDelayWorker32B 7.4 7.6 +0.2 +2.7%stereovision1 5.1 6.1 +1.0 +19.6%stereovision0 4.1 5.8 +1.7 +41.4%LU8PEEng 134.0 136.0 +2.0 +1.5%stereovision2 15.0 16.6 +1.6 +10.7%bgm 25.8 27.1 +1.3 +5.0%LU32PEEng 134.2 137.4 +3.2 +2.4%mcml 96.7 98.7 +2.0 +2.1%Geomean 23.6 25.7 +2.0 +9.0%Table 6.4: Effect of overlay network on critical-path delay.6.5.4 Circuit DelayA comparison of the critical-path delay before and after inserting our virtual overlay network on eachof our benchmark circuits is shown in Table 6.4. Currently, because our CAD algorithms are routability-driven rather than timing-driven, on average the network incurs a 9.0% penalty (2.0ns) to the criticaldelay, with a worst-case of 41.4% (equating to 1.7ns) for stereovision0 which has the shortest critical-path. We believe that these results may be a little optimistic side due to the nature of the circuits andCAD tools involved, where the majority of the critical-path delay ? between 53% and 89% (geomeanat 72%) ? is made up of logic delay rather than routing delay.Given that we add our overlay network incrementally, that is, only after the original user circuit isfully-compiled, the critical-path delay of the newly instrumented design is due entirely to the connectionsadded by this network. If the observability that the trace infrastructure provides is not required, the circuitcan revert back to operating at its original, uninstrumented, clock frequency. During prototyping, however,it is unlikely that circuits will be operated at this maximum frequency, perhaps limited by off-chip (inter-FPGA) communication and hence timing degradation may not be a critical issue. Despite this, onepromising direction for future work is to apply pipelining techniques to the overlay network in order toreduce its effect on delay ? a technique particularly relevant for this application because any increasein signal latency will not affect its observability.6.6 SummaryThis chapter presented a method to allow any arbitrary combination of FPGA signals to be observedusing trace-buffers without recompilation. Typically, existing trace solutions require customized IP to be122inserted into the FPGA design prior to compilation. During debug, observation would then be restrictedto only those signals determined beforehand, until the design is recompiled with new IP. The contributionof this chapter is to propose that a virtual overlay network is added to the design to multiplex the inputsto trace-buffers. The key novelties of this work are in: (1) the network topology, which allows access toall on-chip signals whilst employing multiplexing resources efficiently, (2) the network implementation,which employs only the spare routing resources unused by the original circuit resulting in no logicoverhead, and (3) the CAD algorithms used to rapidly compute new network configurations for arbitrarysignal selections using bipartite-graph techniques.Experimental data shows that with a routing slack of 30%, for an overlay network that connects allsignals in a 100,000 LUT circuit to almost 18,000 trace-buffer inputs, any signal selection utilizing 90%of this trace capacity (16,000 signals) can be successfully connected in 50 seconds. The significance ofthis work is that designers can now eliminate compile-debug iterations during FPGA debug, achievinga turnaround time similar to logic simulation tools, but capable of operating the circuit many orders ofmagnitude faster.123Chapter 7ConclusionThis thesis has described four contributions that come together to enable more effective FPGA debug. Inthis concluding chapter, we first recap the key challenges of this task, before presenting a summary ofthe individual contributions, and the significance of each. Lastly, we will describe the current limitationsof this thesis, and suggest future research directions.7.1 Thesis SummaryCircuit verification is an increasingly difficult, but necessary, task within IC design. Traditional pre-silicon techniques, such as software simulation and formal verification, are being augmented with FPGAprototyping in order to maximize the verification coverage that can be achieved. Whilst FPGAs devicescan be used to implement circuits that can run at orders-of-magnitude faster than in simulation, the mainchallenge of using this platform is the lack of internal observability; when something goes wrong, thedesigner cannot immediately look at signal values inside the circuit to debug this error.An accepted solution for improving on-chip observability is to embed trace-buffers into the design.This allows a limited number of signal values to be recorded into on-chip memory, during real-timedevice operation, for off-line analysis. Key to the effectiveness of this trace instrumentation is whichsignals are selected for observation; the challenge is that these signals must be determined before thedesign is implemented, and before the nature of any bugs is known. Should the designer wish to observea different set of signals, the circuit must be recompiled at a cost of several hours. This can severelyimpact designer productivity.124The first contribution of Chapter 3 presented a post-silicon ?debug difficulty? metric to evaluate trace-based instruments embedded into both ASICs and FPGAs, as well as three methods for automaticallydetermining new signal selections to connect to those instruments. Our metric quantifies the accuracy atwhich the circuit?s present, past and future state can be resolved ? intuitively, this corresponds to whata debug engineer would want ? the more accurately a circuit?s state is known, and the more states thatcan be eliminated from consideration, the smaller the search space for any erroneous behaviour will be.A key challenge that was overcome in this chapter was how to compute the circuit state space for largecircuits, we do so by employing approximate techniques.To showcase this debug metric, we developed three different signal selection techniques on whichwe evaluated their effectiveness and scalability: a method which aims to minimize the expected debugdifficulty of the flattened circuit, a method which selects the most influential nodes of a circuit based onits logical connectivity using graph techniques, and a hybrid method which combines both of these tocompute the expected difficulty of large circuits through exploiting its inherent hierarchy. Experimentaldata showed that whilst the expected difficulty of the flat circuit gave the best debug metric, it was alsothe slowest technique, and infeasible to compute for large circuits. On the other hand, the graph techniquewas by far the fastest and worked on all circuits (computing the signal selection of a 50,000 flip-flopbenchmark in less than 90 seconds) but returned lower quality results. The hybrid method gave a goodcompromise between these two extremes. This level of scalability far exceeds that presented in priorwork [82, 93, 160]. The significance of this contribution is that it can lead to a more effective use ofdebug instrumentation, and was published in papers [64, 67].Chapter 4 described the second contribution of this thesis ? the concept of speculative debuginsertion for FPGAs. In the proposed flow, trace-buffers and signal observation circuitry are automaticallyand transparently inserted before compilation, allowing key internal signals to be recorded during normalcircuit operation. The primary difference between our speculative flow and the more traditional flow isthat in the latter, debug instrumentation is only added after a bug is found, often necessitating a longrecompile cycle. Our speculative flow can give designers a head-start when debugging newly observederrors, and during the debugging process, can also be used to intelligently supplement a designer?s manualsignal selections thereby leading to greater on-chip visibility. The automated nature of our flow may alsobe advantageous for engineers that are using third-party IP and may not have a intimate understandingof the internals of the circuit, and hence find it difficult to select important signals.125Two important considerations were identified to be necessary for the success of this speculative flow.First, such a flow must not require any user intervention, so it is necessary to have an algorithm thatcan automatically determine which signals may be valuable for debug before the circuit is compiled.This capability was enabled by the techniques described in Chapter 3. Second, it is important that thetool have an understanding of the overhead implications of inserting additional logic, to ensure that theinserted logic does not result in a reduction in speed, or an increase in power or CAD runtime. Thischallenge was addressed with experiments that investigated the limits of automatic insertion to determinehow aggressive such a tool can be. Finally, we have encapsulated our ideas into an automated tool andshowed a successful example application. The results in this chapter show, however, that as long as thereare spare resources available, significant observability can be added without negatively affecting thequality of results. The significance of this contribution is that it can reduce the number of compilationturns necessary during FPGA debug, and was published in papers [63, 67]. A working implementationof these techniques were released to the community2.The third contribution of this thesis can be found in Chapter 5, where a method of using incremental-compilation techniques to accelerate FPGA trace-insertion (and subsequent modification) is presented.Rather than recompiling the circuit from scratch, we propose that the original circuit mapping is com-pletely preserved, and that new trace connections be made using only the spare resources that were notpreviously used. For this to be feasible, we made several optimizations to the incremental CAD algo-rithms in order to exploit the unique nature of incremental-tracing: by recognizing that circuit signalscan be observed by connecting any point of its net to any available input pin of any available trace-buffer,and by taking advantage of the internal symmetry of the FPGA architecture, the flexibility for makingincremental trace connections was vastly increased.We evaluated our incremental-tracing technique, applied after the FPGA mapping procedure is com-plete, by comparing it with two other strategies for trace-insertion: inserting prior to the mapping pro-cedure, and inserting part-way through this procedure. We found that, when targeting an FPGA with achannel width 20% greater than its minimum, our proposed post-map insertion technique is on average98X faster than pre-map trace-insertion, and 22X faster than mid-map insertion. Furthermore, we findthat the post-map solution has a smaller wirelength than solutions returned by both the pre- and mid-mapstrategies, and also that post-mapping insertion has only a small effect on the critical-path delay of thecircuit, with less variance. The significance of this chapter is that engineers can now turnaround between126different debug instruments more quickly, and is complementary to the idea of speculative debug inser-tion. Work from this chapter was published in papers [62, 65], and an implementation of the techniquesdescribed were released to the community2.The fourth and final contribution of this thesis is presented in Chapter 6. Existing academic work [53]on trace instruments, including that described in Chapter 5, and many of the current commercial offer-ings [8, 155] require a designer to preselect the signals they wish to observe at compile-time, after whichdedicated point-to-point connections are made for each signal to a trace-buffer. In this chapter, we pro-posed a method which allows designers to look at any subset of combinational or sequential signals intheir FPGA circuit at debug-time, relieving them of the need to predetermine a selection beforehand. Dueto on-chip memory constraints, it is not possible to make a dedicated trace connection for each signal;hence, we pass each signal through an overlay network which multiplexes these connections betweenthe available trace-buffers, thus allowing signal selection to be deferred to debug-time. Unlike a similarapproach adopted in Certus [137], we do not use soft-logic for this purpose, opting instead to utilizespare switch- and connection-block routing multiplexers that form part of the FPGA fabric. Because wereclaim these routing multiplexers from those that were leftover in the original circuit mapping, the areaoverhead of our technique is essentially zero.Due to routing requirements, it would be impractical to build a fully-populated crossbar network inwhich any input combination can be forwarded to any outputs; hence we build a blocking network inwhich a reduced amount of connectivity exists. To decide which of the input signals to connect to theoutput trace-pins, we apply a maximum matching algorithm to the bipartite graph that represents ournetwork to find the optimal solution with the highest number of observed signals. Once this assignmenthas been determined, the configuration memory of those routing multiplexers are reconfigured usingeither static or dynamic techniques. Our experiments have shown that for the majority of the benchmarkcircuits that were investigated, we were able to build an overlay network connecting to over 99.8% ofall circuits signals whilst increasing initial CAD runtime by an average of 34%. However, once thisnetwork has been built, it can be reconfigured as many times as necessary to forward any set of signalsthrough the overlay network to approximately 80?90% of the on-chip trace capacity (16,000 signals forour largest circuit) in no more than 50 seconds. The significance of this contribution is that designers can2Available for download from http://ece.ubc.ca/?stevew.127rapidly change the observed signal-set without recompilation, in a fashion that designers are accustomedto when using logic simulators. This work was published in papers [66, 68].Compared to the incremental-tracing techniques of Chapter 5, the virtual overlay network describedin Chapter 6 does have a higher overhead of 34% for the initial compilation runtime. For early-stagedesigns undergoing frequent changes which necessitate a full recompilation, the most suitable debugtechnique may be to use incremental-tracing to insert trace-buffers as needed, on a per-revision basis.However, once the design has stabilized, rather than incrementally re-routing the circuit for each newdebug turn it may be beneficial to invest in the one-off cost required to construct an overlay network.This way, any overhead will be negated by the runtime savings made over multiple debug turns with thesame user circuit.The four contributions of this thesis link together in the following manner: the virtual overlay net-work described in Chapter 6 utilizes the incremental-tracing techniques from Chapter 5 to be installedefficiently. Whilst this network can be installed manually, there also exists no reason why it cannot bespeculatively inserted, as proposed in Chapter 4. Once the network has been installed, designers canuse the automated signal selection techniques from Chapter 3 to make the most effective use of limitedon-chip trace memory. As a consequence of this thesis, designers are able to verify integrated circuitsmore quickly, and achieve a faster time-to-market.7.2 Current Limitations and Future WorkThis section will elaborate on the current limitations of this thesis, and suggest directions in which futureresearch can address these.7.2.1 Post-Silicon Debug Metric and Automated Signal Selection for Trace InstrumentsWhilst our post-silicon debug metric scales better than previous work [82], it is still only capable ofanalyzing circuits containing at most several thousand latches. In particular, we have observed thatcertain types of circuits, such as processor cores, were particularly challenging. We believe that theprimary reason for this is that our debug metric relies on knowledge of the reachable state space ofthe circuit which we currently compute using Binary Decision Diagram (BDD) based techniques. Eventhough we have applied approximate reachability methods, BDD-based techniques are known to still find128specific logic structures, like the multiplier and divider blocks common inside processor cores, difficultto compute [61].One possible solution to this problem is to utilize even coarser approximations for reachabilitycomputation ? the state space decomposition method [31] that we currently employ is guaranteed toreturn only a superset of the exact reachability set. In this way, the decomposition method is suited forverifying circuit behaviour; for example, if an illegal circuit state was not present in the approximatereachable set, then it will also not be in the exact reachable set and thus it can be guaranteed that thecircuit can never transition into this erroneous state. However, for computing our debug metric, given thatwe require only an estimate of the size of each partition induced by traced signals, the strict guarantee ofan over-approximation is not necessary. Another possibility would be to investigate satisfiability (SAT)based techniques, which have been shown to scale better than BDDs [55] and used successfully forautomating root-cause analysis [78].Another area of improvement lies with the definition of our debug metric. Currently, the metriccaptures the ?volume? of states that expands from a single sample of traced signal data, captured in oneclock cycle, and averages this across all samples in the trace-buffer. This method does not capture theintuition that trace data across different samples (clock cycles) can be used to refine each other?s volumes.For example, as illustrated by Fig. 7.1, the image of possible states at time=0, I(R|t=0), can be used toconstrain the set of possible states at time=1, R|t = 1, by finding their intersection, and conversely so.Consider a 4 bit up-counter: by tracing the most-significant two-bits for four cycles, it would be possibleto compute all four bits of the counter for all four cycles, whilst this would not be possible if the least-significant two-bits were traced. By analyzing more than one trace sample at a time, this relationshipwould be recognizable.7.2.2 Speculative Debug Insertion for FPGAsThe speculative debug insertion concept presented in Chapter 4 was validated on just two large, industrial-quality circuits, targeting an Altera Stratix III FPGA device. As part of future work, it would be pertinentto explore the effect of speculative instrumentation on more circuits, with different functionalities, andto investigate its effects on more modern FPGA architectures, which now contain more flip-flops to aidpipelined circuits [91]. In addition, we would like to explore the use of other debug IP products insteadof Altera SignalTap II, which requires that each trace signal passes through (a non-configurable) four129InstrInstumeInstudPCabcdeEDvDInstrxtrnstr strstustustir stirFigure 7.1: Future work: refining the debug metric by constraining the set of possible circuit statesat t=1 with the image of states from t=0.pipelining registers before arriving at the trace-buffer. Our results found that this was more pipeliningthan was necessary, and resulted in a significant logic overhead as each register occupied one ALM(Logic Element) resource. For example, we discovered that multiple registers on the same trace signalwould often be packed into the same logic cluster, defeating the purpose of pipelining. We believe thatreducing this number would also reduce the runtime overhead of speculative insertion, given that thiswill reduce the number of objects that downstream CAD algorithms would be required to operate on. Analternative product to SignalTap II would be Tektronix Certus.Presently, a limitation of our In-Spec tool is that it requires the circuit to be compiled by AlteraQuartus II twice. The first pass is necessary to produce a mapped circuit netlist, from which a signalselection can be automatically computed, and used to configure and instantiate the debug instruments.A second pass is then used to recompile the instrumented circuit. We believe that the impact of thistwo-pass approach can be reduced through using Quartus II?s general-purpose incremental-compilationtechniques, or perhaps eliminated altogether, by forcing the CAD tool to re-use all previous resultsthrough employing the methods described in the QUIP interface [6].7.2.3 Accelerating FPGA Trace-Insertion Using Incremental TechniquesA key assumption made in Chapter 5 was that triggering ? or determining when to start or stop recordingsignal data into trace-buffers ? is controlled off-chip by an external source. Although the depth of trace-buffers (often hundreds, if not thousands, of samples deep) can be used to hide the added latency ofoff-chip triggering, commercial tools such as ChipScope, SignalTap II, Certus and Identify all supporton-chip triggering. We believe that our techniques can also be applied to on-chip triggering; insteadof incrementally-connecting circuit signals to spare RAMs and reclaiming them as trace-buffers, our130ASPCabcdePCabcdePCabcdePCabcdePCabcdecdeExtrnlExErSSlExSirSml PCabcdeExSuriWlExivriflixWSrWtlAI(a) Max. area-efficiency, min. routing flexibility.ProtypProtepProtnp FGProtAp DDD FG FG DDD(b) Min. area-efficiency, max. routing flexibility.Figure 7.2: Reclaiming soft-logic for on-chip triggering circuitry (example: 32-bit pattern detec-tion).techniques could be used to connect those same signals to spare general-purpose LUTs to implement therequired trigger logic. However, a significant difference between the two tasks is that, whilst previouslyit was sufficient to independently connect a circuit signal to any free RAM input, this same flexibilitydoes not exist during triggering, where multiple circuit signals must be reduced into a single signal, forexample, to use as a clock- or write -enable.In the most common case, triggering circuitry is used to monitor a small set of signals for a specificbit-pattern, such as a particular bus address, a value on a control register, or an illegal state within astate machine. This requires each trigger signal to be individually XOR-ed with its corresponding bitconstant (which can be implemented as a 2:1 multiplexer) before reducing all results using an ANDgate. For a 32-bit trigger, in the most area-efficient case (such as that returned by a general-purposeincremental-compilation methods), this would require seven 6-input LUTs, all packed into one logiccluster, as illustrated in Fig. 7.2a. However, this would present a highly constrained routing problem,as all 32 trigger signals need to be routed to this same cluster. At the other extreme, in the least area-efficient case, this would require one LUT/logic cluster per trace signal for each 2:1 multiplexer, anda binary tree of thirty-one LUTs/logic clusters to implement the 32-input AND gate ? Fig. 7.2b. Inthis scenario, incremental techniques would have maximum flexibility to dynamically distribute theplacement and routing of this triggering logic across any of the spare LUTs available on the device,if sufficient resources existed, and at the expense of timing. An interesting CAD problem would be toexplore trade-offs between these two extremes.131Alternatively, spare general-purpose soft-logic resources can be reclaimed to improve incremental-routing flexibility further, by operating LUTs as ?pass-throughs?. In this mode, LUTs are configured assingle-input logic buffers which transmits the signal unaltered. The advantage of this is that whilst eachrouting track can typically access three other routing tracks from each switch-block, passing-through aLUT inside a logic cluster can provide significantly more routing flexibility; typically, this can be 10%of all adjacent routing tracks. A complementary approach for making use of leftover soft-logic is to alsoreclaim the local routing multiplexers inside each cluster, as well as the flip-flops that are paired witheach spare LUT to pipeline trace connections, to improve circuit timing. Eguro and Hauck describe apipeline-aware routing algorithm in [43] to solve the N-delay routing problem ? where signals mustpass through exactly N pipeline registers to maintain correct circuit functionality. However, given thatduring debug, observed signals are simply recorded into deep trace-buffers and not consumed by thecircuit, trace connections are latency-insensitive and need not abide by this constraint. Any unbalancedpaths can be realigned automatically by off-line tools before trace data is returned to the user.Lastly, we would like to investigate the feasibility of realizing our techniques on commercial FPGAdevices. Although we have proved our techniques on a mature academic CAD toolchain targeting theoret-ical architectures [121], we strongly believe that our methods are applicable to commercial parts, whichshare the same fundamental, prefabricated, island-style structure. Traditionally, the barrier to pursuingthis has been a lack of access into the low-level details of commercial architectures, in particular torouting resource data, which may be considered proprietary trade secrets. More recently, toolkits suchas RapidSmith, Torc, GoAhead [17, 85, 128] have emerged to provide an API abstraction layer for ac-cessing the internal resources of Xilinx architectures that are exposed by the Xilinx Design Language.However, these toolkits may only implement simplistic placement and routing algorithms (for example,non-timing driven or breadth-first variants). We have already begun to address this limitation in [69],which allows the packing and placement (but currently, not routing) results produced by the academicflow to be translated exactly into an implementation on Xilinx FPGAs.7.2.4 A Virtual Overlay Network for FPGA Trace-BuffersThe techniques of Chapter 6 assumed that the designer needed maximum observability, and hence, eachtrace-buffer was configured to be as wide, but as shallow, as possible: 72 bits by 2048 samples. Manyalternative configurations exist, such as 36b by 4096, 18b by 8192, or 9b by 16384, and it would be132ASIC/FPGUseIPFr eCicIuPCabcdPeCeaebecEdPeCeaebecE(hours)???u??u?r??????hu?)???o??(a) Non-blocking behaviour for any 2 signalsProtypPereonpPeretnPre-Srilrconst~poeteyn en-HzroMSit-GooH-iHOpiGpag~pynCircut(e.g r(tVH cDLr/(b) Non-blocking behaviour for any 3 signalsFigure 7.3: The difficulty with determining an upper-bound for non-blocking behaviour.interesting to explore how changing this affects the quality-of-results. Of particular interest is that onone hand, decreasing the width of each trace-buffer would appear to reduce the flexibility of the virtualoverlay network given that each signal would have fewer trace-buffer inputs to connect to, however, fewertrace inputs would also result in more routing multiplexers being available for each remaining input.Chapter 6 also follows the prior chapter in assuming that triggering is driven by an off-chip source,but as discussed in the previous subsection, many reasons exist for why this can be inadequate. However,whilst it would be desirable for this triggering circuitry to be integrated on-chip, the previous proposal forinserting trigger circuitry incrementally would also be insufficient for Chapter 6, which aims to eliminateall forms of recompilation. Instead, we would like to use the virtual overlay network to connect signalsto trigger circuitry just as easily as it can do for trace-buffers.We believe this can be achieved by utilizing spare RAM blocks to realize trigger functionality, ratherthan just as trace-buffers. More specifically, instead of using the virtual overlay network to connectcircuit signals to the data-input lines of a RAM block as with a trace-buffer, the same network can alsobe used to connect those signals to the address lines of the RAM. Thus, each spare RAM block canalso be reclaimed as a LUT (as proposed by [151]) with a number of inputs equivalent to the maximumnumber of address lines of the RAM (in the FPGA architecture based on the Stratix IV family used, thiswould be 14 bits). This would provide designers with the ability to trade-off observability with triggerprecision. An alternative approach would be to also connect the overlay network to spare DSP blocks inthe FPGA and to reclaim its functionality ? Xilinx DSPs support 48-bit pattern-detection [154]. Futurework should evaluate both of these techniques.Currently, the concrete upper bound for the number of signals (of any arbitrary selection) for whichthe virtual overlay network behaves as a non-blocking network is equal to its network connectivity ? thatis, the number of times each circuit signal has been connected to a different trace-buffer input. However,133we believe the true value is much higher, as the current bound considers only the absolute worst-case,which is when the network connectivity is equal to the number of trace-inputs. For example, this wouldhold if the number of trace-inputs and the network connectivity were both 2, meaning that each signalin the circuit was connected to both inputs, and hence any combination of 2 signals (or less) can beobserved simultaneously. This is illustrated in Fig. 7.3a.However, if the number of trace-inputs doubled to 4, but the network connectivity stayed the same(as in Fig. 7.3b) then it would be likely that the 2 connections that each circuit signal makes wouldbe more evenly distributed across those 4 trace-inputs. In the small, example pattern illustrated, thiscan be hand-calculated to be 3, meaning that any signal combination up to 3 signals can be forwardedover this network (but not all 4 signals, failing on ?ABCD? for example). In practice, our experimentshave shown that trace-inputs maximum network connectivity (for mcml, these values are 17784 and25 respectively) which lead us to believe that the true bound is indeed higher than its connectivity;furthermore, the experimental results of Section 6.5.2 indicate that the overlay network for mcml appearsto behave as a non-blocking crossbar for randomly-selected signals up until approximately a 90% load,which corresponds to 16,006 observed signals. Future work should explore methods to formally improveon this upper bound.Another limitation of the current work is that whilst we allow the user to tap any signal (from all LUT,FF, DSP, RAM outputs used) for observation, these are gate-level signals. Due to the logic synthesis andtechnology-mapping stages of the CAD flow, such signals may not correspond directly to those in theoriginal high-level circuit description. Take for example, an AND of two signals at the RTL level, thismay be packed into a 6-input LUT along with its fan-in or fan-outs; in this case, the output of the ANDgate cannot be directly observed. Unless flip-flop re-timing is enabled, CAD tools will typically preserveall flip-flops in the design, thereby providing several points-of-reference. We believe opportunities exist,particularly in the ASIC-prototyping use case where there may be sufficient spare logic and timingslack, for the circuit to be optimized less aggressively so that more (if not all) combinational signalscan also be preserved, perhaps by using synthesis attributes like (* syn keep *). Alternatively, ifit were unacceptable for the circuit to be modified, relevant portions of the circuit may be duplicated,unoptimized, into other spare parts of the FPGA device so that they may be traced.In common with the previous subsection, we would also like to investigate the feasibility of im-plementing our virtual overlay network onto real devices (likely also enabled by the aforementioned134Protypen Pre-SPilcSuucesorc etacPud?coe?eo???cdPc?Pre-SPorSn-Figure 7.4: Network flow representation of virtual overlay networktoolkits) and subsequently by using either static bitstream modification or dynamic partial reconfigura-tion techniques to change the routing multiplexers in our network. Reference [16] showed that makingsuch low-level modifications are possible, and can even be used for malicious purposes.7.2.5 Long Term DirectionsLooking further, we would like to explore the possibility of using the virtual overlay network describedin Chapter 6 for validation, and even non-validation, applications. As an example, we are interested inexploring opportunities to use our network to measure test coverage without recompilation, but unlikeour proposal in [68], rather than using the RAM block as trace-buffer to record a consecutive windowof signal data, the memory could be used more efficiently to only record a flag of whether a rising or afalling transition had been observed. Alternatively, instead of simply observing signals, it may also beinteresting to use the overlay network to control or override signal values in the circuit, to implementsilicon fixes, or to evaluate error resilience. Away from verification and validation, there may also bepromise in applying the overlay network to facilitate regular on-chip communication, for example, aspart of a Network-on-Chip.Chapter 6 makes the abstraction that the virtual overlay network must be a composed of a disjointunion of trees; thus, forwarding signals across the blocking network can be solved efficiently using bipar-tite graph techniques. During informal discussions at a conference, it was suggested that this restrictioncould be relaxed by representing the problem as a flow network, for which known algorithms exist tooptimally compute the maximum flow. This would likely lead to even better connectivity for the overlaynetwork. In brief, a flow network is a directed graph with weighted edges (arcs) that record the flowcapacity that each can arc support; the maximum flow problem computes the maximum flow rate thatthe entire network can support ? from all signal sources (attached to a supersource) to all signal sinks135(attached to a supersink). This objective matches that of our virtual overlay network, which is to connectas many selected signals as possible, at most once, through to any trace-buffer pin, in any order, witheach pin supporting at most one signal. Figure 7.4 illustrates the flow network corresponding to the treenetwork in Fig. 6.3b in which all arcs have a flow capacity of 1; the constraint that each signal (source)and trace-buffer input (sink) is used only once is captured using virtual nodes with only one edge.A self-imposed limitation placed on the work presented in Chapters 5 and 6 is that debug infras-tructure is restricted to using only the resources that were left behind by the original circuit mapping.Whilst this led to some advantageous properties, such as the ability to avoid the chaotic nature of CADheuristics, as well as being able to restore the original circuit timing by disregarding instrumentation, wewould like to investigate the effect of relaxing this limitation. For example, by mapping the user circuitand overlay network simultaneously as opposed to in two sequential passes, or perhaps by pre-reservingsome routing wires specifically for debug, even more observability may be possible.Lastly, all of the work presented in this thesis has targeted traditional FPGA architectures, whetherthat be academic or commercial variants. The possibilities of changing this underlying device architectureopens up a new world of opportunity. One way to tune the FPGA itself to make it more amenable fordebugging would be to revisit the architectural design decisions (such as number of inputs to each LUT,the number of logic elements in each cluster, or the global interconnect pattern) which were originallymade to optimize the area-delay-power metrics of the fabric, and explore how those decisions influencedthe effectiveness of our techniques. Grouping 10 logic elements into one logic cluster may be optimalfor area-delay [5], but may not be the best for ?debug-ability? ? and analytical models used to guidenew FPGA development [36] may be useful here.Rather than simply tuning the architectural parameters, an even more radical approach would be toinvestigate the feasibility of employing silicon area specifically to aid debug. As a first step, we wouldconsider integrating more hard-logic into the FPGA fabric, such as IP blocks dedicated to facilitating on-chip data compression or triggering. Looking even further, we would like to explore the opportunities forembedding dedicated debugging infrastructure into the device, which can be used to provide guaranteedlevels of observability, perhaps by using structures similar to global (clock) nets in existing FPGAs.A commercial example of a dedicated (but small) observability network can be found in Microsemiantifuse devices [100].136Bibliography[1] H. Abdi and L. J. Williams. Principal Component Analysis. Wiley Interdisciplinary Reviews:Computational Statistics, 2(4):433?459, June 2010. ? pages 47[2] M. Abramovici, P. Bradley, K. Dwarakanath, P. Levin, G. Memmi, and D. Miller. AReconfigurable Design-for-Debug Infrastructure for SoCs. In DAC ?06: Proceedings of the 43rdannual Design Automation Conference, pages 7?12, July 2006. ? pages 11, 20, 29, 35[3] Achronix. Speedster22i HD FPGA Family. URL (Retrieved Feb. 2013):http://www.achronix.com/wp-content/uploads/docs/Speedster22iHD FPGA Family DS004.pdf,February 2013. ? pages 16[4] Aeroflex Gaisler. GRLIB IP Core User?s Manual. URL (Retrieved Feb. 2013):http://www.gaisler.com/products/grlib/grip.pdf, January 2013. ? pages 52, 66[5] E. Ahmed and J. Rose. The Effect of LUT and Cluster Size on Deep-Submicron FPGAPerformance and Density. In Proceedings of the 2000 ACM/SIGDA Eighth InternationalSymposium on Field-Programmable Gate Arrays, FPGA?00, pages 3?12, February 2000.? pages 136[6] Altera. Benchmark Designs For The Quartus University Interface Program (QUIP) Version 1.1.URL (Retrieved Feb. 2013):https://www.altera.com/support/software/download/altera design/quip/quip-download.jsp,February 2008. ? pages 50, 52, 60, 61, 130[7] Altera. Quartus II Handbook Version 11.1 Vol. 3: Verification.http://www.altera.com/literature/hb/qts/qts qii5v3.pdf, November 2011. ? pages 77[8] Altera. Quartus II Handbook Version 12.1 Volume 3: Verification. URL (Retrieved Feb. 2013):http://www.altera.com/literature/hb/qts/qts qii5v3.pdf, November 2012. ? pages 8, 21, 27, 35, 44,56, 75, 81, 104, 110, 127[9] AMD. Revision Guide for AMD Family 10h Processors. URL (Retrieved Feb. 2013):http://support.amd.com/us/Processor TechDocs/41322 10h Rev Gd.pdf, March 2012.? pages 14[10] E. Anis and N. Nicolici. On Using Lossless Compression of Debug Data in Embedded LogicAnalysis. In Test Conference, 2007. ITC 2007. IEEE International, pages 1?10, October 2007.? pages 29, 38[11] E. Anis and N. Nicolici. Low Cost Debug Architecture using Lossy Compression for SiliconDebug. In Design, Automation & Test in Europe Conference & Exhibition, 2007. DATE ?07,pages 1?6, April 2007. ? pages 29, 38137[12] S. Asaad, R. Bellofatto, B. Brezzo, C. Haymes, M. Kapur, B. Parker, T. Roewer, P. Saha,T. Takken, and J. Tierno. A Cycle-Accurate, Cycle-Reproducible Multi-FPGA System forAccelerating Multi-core Processor Simulation. In Proceedings of the ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays, FPGA?12, pages 153?162, February 2012.? pages 15, 21[13] J. Babb, R. Tessier, M. Dahl, S. Hanono, D. Hoki, and A. Agarwal. Logic Emulation with VirtualWires. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 16(6):609?626, June. ? pages 16[14] K. Balston, S. Wilton, A. Hu, and A. Nahir. Emulation in Post-Silicon Validation: Its Not Just forFunctionality Anymore. In High Level Design Validation and Test Workshop (HLDVT), 2012IEEE International, pages 110?117, November 2012. ? pages 15[15] K. Balston, M. Karimibiuki, A. Hu, A. Ivanov, and S. Wilton. Post-Silicon Code Coverage forMultiprocessor System-on-Chip Designs. Computers, IEEE Transactions on, 62(2):242?246,February 2013. ? pages 31[16] C. Beckhoff, D. Koch, and J. Torresen. Short-Circuits on FPGAs Caused by Partial RuntimeReconfiguration. In Proceedings of the 2010 International Conference on Field ProgrammableLogic and Applications, FPL?10, pages 596?601, 2010. ? pages 135[17] C. Beckhoff, D. Koch, and J. Torresen. GoAhead: A Partial Reconfiguration Framework. InField-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th annualInternational Symposium on, pages 37?44, May 2012. ? pages 132[18] J. P. Bergmann and M. A. Horowitz. Improving Coverage Aanalysis and Test Generation forLarge Designs. In ICCAD ?99: Proceedings of the 1999 IEEE/ACM International Conference onComputer-Aided Design, pages 580?583, November 1999. ? pages 7, 31[19] Berkeley Logic Synthesis and Verification Group. ABC: A System for Sequential Synthesis andVerification, Release 70930. URL (Retrieved Feb. 2013):http://www.eecs.berkeley.edu/?alanmi/abc/, September 2007. ? pages 19[20] V. Betz, J. Rose, and A. Marquardt. Architecture and CAD for Deep-Submicron FPGAs. KluwerAcademic Publishers, Norwell, MA, USA, 1999. ISBN 0792384601. ? pages 17[21] H. Bian, A. C. Ling, A. Choong, and J. Zhu. Towards Scalable Placement for FPGAs. InFPGA?10: Proceedings of the 18th annual ACM/SIGDA International Symposium onField-Programmable Gate Arrays, pages 147?156, February 2010. ? pages 8[22] P. Bonacich and P. Lloyd. Eigenvector-like measures of centrality for asymmetric relations.Social Networks, 23/3:191?201, July 2001. ? pages 47[23] S. P. Borgatti. Centrality and network flow. Social Networks, 27(1):55?71, January 2005.? pages 46, 47[24] M. Bourgeault. Altera?s Partial Reconfiguration Flow. URL (Retrieved Feb. 2013):http://www.eecg.utoronto.ca/?jayar/FPGAseminar/FPGA Bourgeault June23 2011.pdf, June2011. ? pages 111138[25] Cadence. Cadence Palladium Series with Incisive XE Software ? Hardware/softwareco-verification and system-level verification. URL (Retrieved Feb. 2013):http://www.cadence.com/rl/resources/datasheets/incisive enterprise palladium.pdf, August2011. ? pages 21[26] A. Carbine and D. Feltham. Pentium Pro Processor Design for Test and Debug. Design & Test ofComputers, IEEE, 15(3):77?82, 1998. ? pages 23[27] C. Carmichael and C. W. Tseng. Correcting Single-Event Upsets with a Self-HostingConfiguration Management Core. URL (Retrieved Feb. 2013):http://www.xilinx.com/support/documentation/application notes/xapp989.pdf, April 2008.? pages 24[28] B. Caslis. Effectively Using Internal Logic Analyzers for Debugging FPGAs. FPGA andStructured ASIC Journal: URL (Retrieved Feb. 2013):http://www.eejournal.com/archives/articles/20080212 lattice/, February 2008. ? pages 22[29] D. Chen, J. Cong, and P. Pan. FPGA Design Automation: A Survey. Foundation and Trends inElectronic Design Automation, 1(3):139?169, January 2006. ? pages 19, 20, 34, 35, 75[30] S. Chin and S. Wilton. An Analytical Model Relating FPGA Architecture and Place and RouteRuntime. In Field Programmable Logic and Applications, 2009. FPL 2009. InternationalConference on, pages 146?153, August 2009. ? pages 4, 19[31] H. Cho, G. Hachtel, E. Macii, M. Poncino, and F. Somenzi. Automatic State SpaceDecomposition for Approximate FSM Traversal Based on Circuit Analysis. Computer-AidedDesign of Integrated Circuits and Systems, IEEE Trans. on, 15(12):1451?1464, December 1996.? pages 42, 57, 129[32] C.-L. Chuang, D.-J. Lu, and C.-N. J. Liu. A Snapshot Method to Provide Full Visibility forFunctional Debugging Using FPGA. In ATS ?04: Proceedings of the 13th Asian Test Symposium,pages 164?169, November 2004. ? pages 24[33] C.-L. Chuang, W.-H. Cheng, D.-J. Lu, and C.-N. J. Liu. Hybrid Approach to Faster FunctionalVerification with Full Visibility. IEEE Design and Test of Computers, 24(2):154?162, 2007.? pages 23, 24, 28, 32[34] Cisco Systems. Field-Programmable Device Upgrades. URL (Retrieved Feb. 2013):http://www.cisco.com/en/US/docs/routers/7200/configuration/feature guides/fpd.pdf, July 2007.? pages 14[35] O. Coudert, J. Cong, S. Malik, and M. Sarrafzadeh. Incremental CAD. In Computer AidedDesign, 2000. ICCAD-2000. IEEE/ACM International Conference on, pages 236?243, November2000. ? pages 34, 75[36] J. Das. Analytical Models for Accelerating FPGA Architecture Development. PhD thesis,University of British Columbia, 2012. ? pages 136[37] R. Datta, A. Sebastine, and J. Abraham. Delay Fault Testing and Silicon Debug Using ScanChains. In Test Symposium, 2004. ETS 2004. Proceedings. Ninth IEEE European, pages 46?51,May 2004. ? pages 23139[38] F. de Paula, A. Nahir, Z. Nevo, A. Orni, and A. Hu. TAB-BackSpace: Unlimited-length tracebuffers with zero additional on-chip overhead. In Design Automation Conference (DAC), 201148th ACM/EDAC/IEEE, pages 411?416, June 2011. ? pages 30[39] F. M. De Paula, M. Gort, A. J. Hu, S. J. E. Wilton, and J. Yang. BackSpace: Formal Analysis forPost-Silicon Debug. In FMCAD?08: Proceedings of the 2008 International Conference onFormal Methods in Computer-Aided Design, pages 1?10, November 2008. ? pages 28, 30, 64[40] S. Devadas, A. Ghosh, and K. Keutzer. An Observability-Based Code Coverage Metric forFunctional Simulation. In ICCAD ?96: Proceedings of the 1996 IEEE/ACM InternationalConference on Computer-Aided Design, pages 418?425, November 1996. ? pages 31[41] M. Dini. The Multi-FPGA Prototyping Platform. URL (Retrieved Feb. 2013):http://www.dinigroup.com/files/The%20DINI%20Group%20Multi-FPGA%20Prototyping%20Platform 9-4-10.pdf, September 2010. ? pages 15[42] S. Dutt, V. Shanmugavel, and S. Trimberger. Efficient Incremental Rerouting for FaultReconfiguration in Field-Programmable Gate Arrays. In Computer-Aided Design, 1999. Digestof Technical Papers. 1999 IEEE/ACM International Conference on, pages 173?176, November1999. ? pages 34[43] K. Eguro and S. Hauck. Armada: Timing-Driven Pipeline-Aware Routing for FPGAs. InProceedings of the 2006 ACM/SIGDA 14th International Symposium on Field-ProgrammableGate Arrays, FPGA?06, pages 169?178, February 2006. ? pages 132[44] J. Emmert and D. Bhatia. Incremental Routing in FPGAs. In ASIC Conference 1998.Proceedings. Eleventh annual IEEE International, pages 217?221, September 1998. ? pages 34,75[45] D. Eppstein. Hopcroft-Karp Bipartite Max-Cardinality Matching and Max Independent Set(Python Recipe). URL (Retrieved Feb. 2013):http://code.activestate.com/recipes/123641-hopcroft-karp-bipartite-matching/, April 2002.? pages 115[46] H. Foster. Challenges of Design and Verification in the SoC Era. URL (Retrieved Feb. 2013):http://testandverification.com/files/DVConference2011/2 Harry Foster.pdf, October 2011.? pages 2, 3, 20[47] T. Foster, D. Lastor, and P. Singh. First Silicon Functional Validation and Debug of MulticoreMicroprocessors. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 15(5):495?504, May 2007. ? pages 26[48] Free Software Foundation. GCC 4.8 Release Series: Changes, New Features, and Fixes. URL(Retrieved Feb. 2013): http://gcc.gnu.org/gcc-4.8/changes.html, February 2013. ? pages 78[49] J. Gao, Y. Han, and X. Li. A New Post-Silicon Debug Approach Based on Suspect Window. InVLSI Test Symposium, 2009. VTS ?09. 27th IEEE, pages 85?90, May 2009. ? pages 7, 27, 32[50] R. Goering. Scan design called portal for hackers. URL (Retrieved Feb. 2013):http://eetimes.com/electronics-news/4050578/Scan-design-called-portal-for-hackers, October2004. ? pages 23140[51] F. Golshan. Test and On-line Debug Capabilities of IEEE Std 1149.1 in UltraSPARCTM-IIIMicroprocessor. In Test Conference, 2000. Proceedings. International, pages 141?150, October2000. ? pages 23[52] M. Gort and J. Anderson. Analytical Placement for Heterogeneous FPGAs. In FieldProgrammable Logic and Applications (FPL), 2012 22nd International Conference on, pages143?150, August 2012. ? pages 20[53] P. Graham, B. Nelson, and B. Hutchings. Instrumenting Bitstreams for Debugging FPGACircuits. In Field-Programmable Custom Computing Machines, FCCM?01. The 9th annual IEEESymposium on, pages 41?50, March 2001. ? pages 34, 77, 111, 127[54] X. Gu, W. Wang, K. Li, H. Kim, and S. Chung. Re-using DFT Logic for Functional and SiliconDebugging Test. In Test Conference, 2002. Proceedings. International, pages 648?656, October2002. ? pages 23[55] A. Gupta, Z. Yang, P. Ashar, and A. Gupta. SAT-Based Image Computation with Application inReachability Analysis. In Formal Methods in Computer-Aided Design, volume 1954 of LectureNotes in Computer Science, pages 391?408. November 2000. ? pages 129[56] S. Gupta, J. Anderson, L. Farragher, and Q. Wang. CAD Techniques for Power Optimization inVirtex-5 FPGAs. In Custom Integrated Circuits Conference, 2007. CICC?07. IEEE, pages 85?88,September 2007. ? pages 20[57] A. A. Hagberg, D. A. Schult, and P. J. Swart. Exploring Network Structure, Dynamics, andFunction using NetworkX. In Proceedings of the 7th Python in Science Conference (SciPy2008),pages 11?15, August 2008. ? pages 47[58] R. C. Ho and M. A. Horowitz. Validation Coverage Analysis for Complex Digital Designs. InICCAD?96: Proceedings of the 1996 IEEE/ACM International Conference on Computer-AidedDesign, pages 146?151, November 1996. ? pages 31[59] K. Holdbrook, S. Joshi, S. Mitra, J. Petolino, R. Raman, and M. Wong. MicroSPARC: ACase-Study of Scan Based Debug. In Test Conference, 1994. Proceedings., International, pages70?75, October 1994. ? pages 23[60] Y.-C. Hsu, F. Tsai, W. Jong, and Y.-T. Chang. Visibility enhancement for silicon debug. InDAC?06: Proceedings of the 43rd annual Design Automation Conference, pages 13?18, July2006. ? pages 11, 20, 29[61] A. J. Hu. Formal Hardware Verification with BDDs: An Introduction. In Proc. Pacific RimConference on Communications, Computers and Signal Processing, pages 677?682, August1997. ? pages 52, 129[62] E. Hung and S. J. E. Wilton. Incremental Signal-Tracing for FPGA Debug. IEEE Transactionson Very Large Scale Integration (VLSI) Systems, page (accepted for publication: March 2013).? pages iv, 6, 12, 76, 127[63] E. Hung and S. J. E. Wilton. Speculative Debug Insertion for FPGAs. In FPL 2011,International Conference on Field-Programmable Logic and Applications, pages 524?531,September 2011. ? pages iv, v, 6, 65, 76, 126141[64] E. Hung and S. J. E. Wilton. On Evaluating Signal Selection Algorithms for Post-Silicon Debug.In ISQED 2011, International Symposium on Quality Electronic Design, pages 290?296, March2011. ? pages iv, v, 6, 38, 59, 125[65] E. Hung and S. J. E. Wilton. Limitations of Incremental Signal-Tracing for FPGA Debug. InFPL 2012, International Conference on Field-Programmable Logic and Applications, pages49?56, August 2012. ? pages iv, v, 6, 85, 100, 127, 155[66] E. Hung and S. J. E. Wilton. Towards Simulator-like Observability for FPGAs: A Virtual OverlayNetwork for Trace-Buffers. In Proceedings of the 21st ACM/SIGDA International Symposium onField-Programmable Gate Arrays, pages 19?28, February 2013. ? pages iv, v, 6, 105, 128[67] E. Hung and S. J. E. Wilton. Scalable Signal Selection for Post-Silicon Debug. IEEETransactions on Very Large Scale Integration (VLSI) Systems, 21:1103?1115, June 2013.? pages iv, v, 6, 12, 38, 65, 85, 125, 126[68] E. Hung, B. Quinton, and S. J. E. Wilton. Linking the Verification and Validation of ComplexIntegrated Circuits Through Shared Coverage Metrics. IEEE Design & Test of Computers ?Special Issue: Silicon Debug and Diagnosis, page (accepted for publication: June 2013).? pages iv, v, 6, 105, 128, 135[69] E. Hung, F. Eslami, and S. J. E. Wilton. Escaping the Academic Sandbox: Realizing VPRCircuits on Xilinx Devices. In Proceedings of the 21st IEEE International Symposium onField-Programmable Custom Computing Machines, pages 45?52, April 2013. ? pages 12, 17,132[70] International Technology Roadmap for Semiconductors (ITRS). International TechnologyRoadmap for Semiconductors, 2007 Edition: Design. URL (Retrieved Feb. 2013):http://www.itrs.net/Links/2007ITRS/2007 Chapters/2007 Design.pdf, February 2008. ? pages 2,11, 20, 21[71] Y. S. Iskander, C. D. Patterson, and S. D. Craven. Improved Abstractions and Turnaround Timefor FPGA Design Validation and Debug. In FPL?11, Proceedings of the 2011 21st InternationalConference on Field Programmable Logic and Applications, pages 518?523, September 2011.? pages 25[72] P. Jamieson, K. B. Kent, F. Gharibian, and L. Shannon. Odin II - An Open-source Verilog HDLSynthesis tool for CAD Research. In Proceedings of the IEEE Symposium onField-Programmable Custom Computing Machines, pages 149?156, May 2010. ? pages 19[73] J.-Y. Jou and C.-N. J. Liu. Coverage Analysis Techniques for HDL Design Validation. In The 6thConference of Asia Pacific CHip Design Languages (APCHDL?99), pages 3?10, October 1999.? pages 31[74] R. Kaivola, R. Ghughal, N. Narasimhan, A. Telfer, J. Whittemore, S. Pandav, A. Slobodova?,C. Taylor, V. Frolov, E. Reeber, and A. Naik. Replacing Testing with Formal Verification inIntel(R) Core(TM) i7 Processor Execution Engine Validation. In CAV ?09: Proc. of the 21stInternational Conference on Computer Aided Verification, pages 414?429, June 2009.? pages 21[75] Kaivola, Roope and Ghughal, Rajnish and Narasimhan, Naren and Telfer, Amber andWhittemore, Jesse and Pandav, Sudhindra and Slobodova?, Anna and Taylor, Christopher and142Frolov, Vladimir and Reeber, Erik and Naik, Armaghan. Replacing Testing with FormalVerification in Intel(R) Core(TM) i7 Processor Execution Engine Validation. In CAV ?09:Proceedings of the 21st International Conference on Computer Aided Verification, pages414?429, June 2009. ? pages 2[76] C.-F. Kao, I.-J. Huang, and C.-H. Lin. An Embedded Multi-resolution AMBA Trace Analyzerfor Microprocessor-based SoC Integration. In DAC ?07: Proceedings of the 44th annual DesignAutomation Conference, pages 477?482, June 2007. ? pages 27[77] E. Keller. JRoute: A Run-Time Routing API for FPGA Hardware. In Proceedings of the 14thInternational Parallel and Distributed Processing Symposium, IPDPS?00, pages 874?881, May2000. ? pages 34[78] B. Keng, S. Safarpour, and A. Veneris. Bounded model debugging. Computer-Aided Design ofIntegrated Circuits and Systems, IEEE Transactions on, 29(11):1790?1803, 2010. ? pages 21,38, 129[79] J. Keshava, N. Hakim, and C. Prudvi. Post-silicon validation challenges: how EDA and academiacan help. In Proceedings of the 47th Design Automation Conference, DAC?10, pages 3?7, June2010. ? pages 21[80] M. Khalid and J. Rose. A Novel and Efficient Routing Architecture for Multi-FPGA Systems.Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 8(1):30?39, February 2000.? pages 16[81] N. Kitchen and A. Kuehlmann. Stimulus Generation for Constrained Random Simulation. InProceedings of the 2007 IEEE/ACM International Conference on Computer-Aided Design,ICCAD ?07, pages 258?265, 2007. ? pages 5[82] H. F. Ko and N. Nicolici. Algorithms for State Restoration and Trace-Signal Selection for DataAcquisition in Silicon Debug. Computer-Aided Design of Integrated Circuits and Systems, IEEETransactions on, 28(2):285?297, February 2009. ? pages 7, 29, 31, 32, 40, 54, 59, 60, 61, 62, 63,115, 125, 128[83] H. Krupnova. Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology andExperience. In Design, Automation and Test in Europe Conference and Exhibition, 2004.Proceedings, pages 1236?1241, February 2004. ? pages 3[84] I. Kuon, R. Tessier, and J. Rose. FPGA Architecture: Survey and Challenges. Foundation andTrends in Electronic Design Automation, 2(2):135?253, February 2008. ? pages 1, 13, 14[85] C. Lavin, M. Padilla, J. Lamprecht, P. Lundrigan, B. Nelson, and B. Hutchings. RapidSmith:Do-It-Yourself CAD Tools for Xilinx FPGAs. In Proceedings of the 21st International Workshopon Field Programmable Logic and Applications (FPL?11), pages 349?355, September 2011.? pages 132[86] D. Lee and S. Reddy. On Determining Scan Flip-Flops in Partial-Scan Designs. InComputer-Aided Design, 1990. ICCAD-90. Digest of Technical Papers., 1990 IEEE InternationalConference on, pages 322?325, November 1990. ? pages 32[87] G. Lemieux and D. Lewis. Using Sparse Crossbars within LUT Clusters. In Proceedings of the2001 ACM/SIGDA Ninth International Symposium on Field-Programmable Gate Arrays,FPGA?01, pages 59?68, February 2001. ? pages 85143[88] D. Levi and S. A. Guccione. BoardScope: A Debug Tool for Reconfigurable Systems. In Societyof Photo-Optical Instrumentation Engineers (SPIE) Conference Series, pages 239?246, October1998. ? pages 25[89] M. E. Levitt, S. Nori, S. Narayanan, G. P. Grewal, L. Youngs, A. Jones, G. Billus, andS. Paramanandam. Testability, Debuggability, and Manufacturability Features of theUltraSPARCTM-I Microprocessor. In Proceedings of the IEEE International Test Conference,pages 157?166, October 1995. ? pages 23[90] D. Lewis, D. Galloway, M. Van Ierssel, J. Rose, and P. Chow. The Transmogrifier-2: a 1 milliongate rapid-prototyping system. Very Large Scale Integration (VLSI) Systems, IEEE Transactionson, 6(2):188?198, June 1998. ? pages 15[91] D. Lewis, D. Cashman, M. Chan, J. Chromczak, G. Lai, A. Lee, T. Vanderhoek, and H. Yu.Architectural Enhancements in Stratix VTM. In Proceedings of the ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays, FPGA?13, pages 147?156, Feburary 2013.? pages 129[92] H.-M. Lin and J.-Y. Jou. On Computing The Minimum Feedback Vertex Set of a Directed Graphby Contraction Operations. Computer-Aided Design of Integrated Circuits and Systems, IEEETransactions on, 19(3):295?307, March 2000. ? pages 32[93] X. Liu and Q. Xu. Trace Signal Selection for Visibility Enhancement in Post-Silicon Validation.In Design, Automation and Test in Europe, DATE 2009, pages 1338?1343, April 2009.? pages 32, 63, 115, 125[94] X. Liu and Q. Xu. Interconnection Fabric Design for Tracing Signals in Post-Silicon Validation.In Design Automation Conference, 2009. DAC ?09. 46th ACM/IEEE, pages 352?357, July 2009.? pages 35[95] Maxeler Technologies. MaxCompiler White Paper. URL (Retrieved Feb. 2013):http://www.maxeler.com/media/documents/MaxelerWhitePaperMaxCompiler.pdf, February 2011.? pages 14[96] L. McMurchie and C. Ebeling. PathFinder: A Negotiation-Based Performance-Driven Router forFPGAs. In Proceedings of the 1995 ACM Third International Symposium onField-Programmable Gate Arrays, FPGA?95, pages 111?117, February 1995. ? pages 20, 82,113[97] Mentor Graphics. ModelSim: ASIC and FPGA Design. URL (Retrieved Feb. 2013):http://www.mentor.com/products/fv/modelsim/. ? pages 1[98] Mentor Graphics. A Closer Look at Veloce Technology: Taking Hardware-Assisted Verificationto the Next Level. URL (Retrieved Feb. 2013): http://www.mentor.com/products/fv/techpubs/a-closer-look-at-veloce-technology-taking-hardware-assisted-verification-to-the-next-level-49128,May 2009. ? pages 21[99] Microsemi. ProASIC3L FPGA Fabric User?s Guide. URL (Retrieved Feb. 2013):http://www.actel.com/documents/PA3L UG.pdf, September 2012. ? pages 24[100] Microsemi. Silicon Explorer II: User?s Guide. URL (Retrieved Feb. 2013):http://www.actel.com/documents/Silexpl ug.pdf, August 2012. ? pages 36, 136144[101] R. Mijat. Better Trace for Better Software ? Introducing the new ARM CoreSight System TraceMacrocell and Trace Memory Controller. URL (Retrieved Feb. 2013): http://www.arm.com/files/pdf/Better Trace for Better Software - CoreSight STM with LTTng - 19th October 2010.pdf,November 2010. ? pages 27[102] Y. O. M. Moctar, N. George, H. Parandeh-Afshar, P. Ienne, G. G. Lemieux, and P. Brisk.Reducing the Cost of Floating-Point Mantissa Alignment and Normalization in FPGAs. InProceedings of the 20th ACM/SIGDA International Symp. on Field-Programmable Gate Arrays,FPGA?12, pages 255?264, February 2012. ? pages 33[103] K. Morris. On-Chip Debugging - Built-in Logic Analyzers on your FPGA. FPGA and StructuredASIC Journal: http://www.fpgajournal.com/articles/debug.htm, January 2004. ? pages 22, 36[104] D. Moundanos, J. Abraham, and Y. Hoskote. Abstraction Techniques for Validation CoverageAnalysis and Test Generation. Computers, IEEE Transactions on, 47(1):2?14, January 1998.? pages 31[105] M. Nava, P. Blouet, P. Teninge, M. Coppola, T. Ben-Ismail, S. Picchiottino, and R. Wilson. AnOpen Platform for Developing Multiprocessor SoCs. Computer, 38(7):60?67, July 2005.? pages 21[106] S. Oldridge and S. Wilton. A Novel FPGA architecture Supporting Wide, Shallow Memories.Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 13(6):758?762, June 2005.? pages 33[107] OpenCores. The #1 Community Within Open Source Hardware IP-Cores. URL (RetrievedFeb. 2013): http://www.opencores.org. ? pages 52[108] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Orderto the Web. Technical Report 1999-66, Stanford InfoLab, November 1999. ? pages 47[109] F. M. d. Paula, A. J. Hu, and A. Nahir. nuTAB-BackSpace: Rewriting to NormalizeNon-determinism in Post-silicon Debug Traces. In CAV?12: Proc. of the 24th InternationalConference on Computer Aided Verification, pages 513?531, July 2012. ? pages 30[110] Z. Poulos, Y.-S. Yang, J. Anderson, A. Veneris, and B. Le. Leveraging Reconfigurability to RaiseProductivity in FPGA Functional Debug. In Design, Automation Test in Europe ConferenceExhibition (DATE), 2012, pages 292?295, March 2012. ? pages 34, 36[111] S. Prabhakar and M. Hsiao. Using Non-Trivial Logic Implications for Trace Buffer-Based SiliconDebug. In Asian Test Symposium, 2009. ATS ?09., pages 131?136, November 2009. ? pages 32[112] S. Prabhakar, R. Sethuram, and M. Hsiao. Trace Buffer-Based Silicon Debug with LosslessCompression. In VLSI Design (VLSI Design), 2011 24th International Conference on, pages358?363, January 2011. ? pages 35[113] B. Quinton. A Reconfigurable Post-Silicon Debug Infrastructure for Systems-on-Chip. PhDthesis, University of British Columbia, 2008. ? pages 35[114] B. Quinton. Bridging software and hardware to accelerate SoC validation. URL (RetrievedFeb. 2013): http://www.tek.com/document/product-article/ee-times-bridging-software-and-hardware-accelerate-soc-validation, February 2012.? pages 27145[115] B. Quinton and S. Wilton. Post-Silicon Debug using Programmable Logic Cores. In FieldProgrammable Technology, 2005. Proceedings. 2005 IEEE International Conference on, pages241?247, December 2005. ? pages 14, 35, 39[116] B. Quinton and S. Wilton. Concentrator Access Networks for Programmable Logic Cores onSoCs. In Circuits and Systems, 2005. ISCAS 2005. IEEE International Symposium on, pages45?48, May 2005. ? pages 35[117] B. R. Quinton, A. M. Hughes, and S. J. E. Wilton. Post-Silicon Debug of Complex Multi Clockand Power Domain SoCs. In Proceedings from 6th IEEE International Workshop on SiliconDebug and Diagnosis, March 2010. ? pages 26, 38, 113[118] R. K. Brayton, G. D. Hachtel, A. Sangiovanni-Vincentelli, F. Somenzi, A. Aziz, S. -T. Cheng, S.Edwards, S. Khatri, Y. Kukimoto, A. Pardo, S. Qadeer, R. K. Ranjan, S. Sarwary, T. R. Shiple, G.Swamy, and T. Villa. VIS: a System for Verification and Synthesis. In Proceedings of the EighthInternational Conference on Computer Aided Verification CAV, pages 428?432, August 1996.? pages 43[119] M. Riley and M. Genden. Cell Broadband Engine Debugging for Unknown Events. Design &Test of Computers, IEEE, 24(5):486?493, 2007. ? pages 24, 26[120] M. Riley, N. Chelstrom, M. Genden, and S. Sawamura. Debug of the CELL Processor: Movingthe Lab into Silicon. In Test Conference, 2006. ITC ?06. IEEE International, pages 1?9, October2006. ? pages 24, 26, 28[121] J. Rose, J. Luu, C. W. Yu, O. Densmore, J. Goeders, A. Somerville, K. B. Kent, P. Jamieson, andJ. Anderson. The VTR Project: Architecture and CAD for FPGAs from Verilog to Routing. InProceedings of the 20th ACM/SIGDA International Symposium on Field-Programmable GateArrays, pages 77?86, February 2012. ? pages 20, 52, 79, 82, 88, 112, 132[122] R. Y. Rubin and A. M. DeHon. Timing-Driven Pathfinder Pathology and Remediation:Quantifying and Reducing Delay Noise in VPR-Pathfinder. In FPGA?11, Proceedings of the 19thACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 173?176,February 2011. ? pages 79, 98[123] G. Schelle, J. Collins, E. Schuchman, P. Wang, X. Zou, G. Chinya, R. Plate, T. Mattner,F. Olbrich, P. Hammarlund, R. Singhal, J. Brayton, S. Steibl, and H. Wang. Intel NehalemProcessor Core Made FPGA Synthesizable. In Proceedings of the 18th annual ACM/SIGDAInternational Symposium on Field-Programmable Gate Arrays, FPGA?10, pages 3?12, February2010. ? pages 16[124] G. Semeraro, G. Magklis, R. Balasubramonian, D. Albonesi, S. Dwarkadas, and M. Scott.Energy-efficient processor design using multiple clock domains with dynamic voltage andfrequency scaling. In High-Performance Computer Architecture, 2002. Proceedings. EighthInternational Symposium on, pages 29?40, February 2002. ? pages 15[125] N. Shah and J. Rose. On the Difficulty of Pin-to-Wire Routing in FPGAs. In FPL 2012,International Conference on Field-Programmable Logic and Applications, pages 83?90, August2012. ? pages 113[126] J. Shen and J. A. Abraham. An RTL Abstraction Technique for Processor MicroarchitectureValidation and Test Generation. Journal of Electronic Testing: Theory and Applications, 16:67?81, February 2000. ? pages 31146[127] SpringSoft. Siloti: Visibility Automation System, Datasheet. URL (Retrieved Feb. 2013):http://www.springsoft.com/assets/files/Datasheets/SS Siloti Datasheet E5 US.pdf, October2008. ? pages 29[128] N. Steiner, A. Wood, H. Shojaei, J. Couch, P. Athanas, and M. French. Torc: Towards anOpen-Source Tool Flow. In Proceedings of the 19th ACM/SIGDA International Symposium onField-Programmable Gate Arrays, FPGA?11, pages 41?44, February 2011. ? pages 132[129] Sun Microsystems. OpenSPARC T1 Processor Design and Verification User?s Guide. URL(Retrieved Feb. 2013):http://download.oracle.com/technetwork/systems/opensparc/OpenSPARCT1.1.7.tar.bz2,March 2009. ? pages 52, 73[130] Synopsys. Identify: Simulator-like Visibility into Hardware Debug. http://www.synopsys.com/Tools/Implementation/FPGAImplementation/CapsuleModule/identify ds.pdf, August 2010.? pages 27, 75, 81[131] Synopsys. ZeBu-Server ? Enterprise Emulator. URL (Retrieved Feb. 2013):http://www.synopsys.com/Tools/Verification/hardware-verification/emulation/Pages/zebu-server-asic-emulator.aspx, October 2012. ? pages 21[132] Tabula. Spacetime Architecture: White Paper. URL (Retrieved Feb. 2013):http://www.tabula.com/technology/TabulaSpacetime WhitePaper.pdf, October 2010. ? pages 16[133] S. Tasiran and K. Keutzer. Coverage Metrics for Functional Validation of Hardware Designs.Design Test of Computers, IEEE, 18(4):36?45, July 2001. ? pages 7, 21, 31[134] A. Tavaragiri, J. Couch, and P. Athanas. Exploration of FPGA Interconnect for the Design ofUnconventional Antennas. In Proceedings of the 19th ACM/SIGDA International Symposium onField-Programmable Gate Arrays, FPGA?11, pages 219?226, February 2011. ? pages 33[135] S. Teig. Programmable logic devices in 2032? (FPGA2012 Pre-Conference Workshop).http://tcfpga.org/fpga2012/SteveTeig.pdf, February 2012. ? pages 10[136] Tektronix. Simplying Xilinx and Altera FPGA Debug. URL (Retrieved Feb. 2013): http://www.newark.com/pdfs/techarticles/tektronix/XylinxAndAlteraFPGA AppNote MSO4000.pdf,July 2007. ? pages 21[137] Tektronix. Certus Debug Suite. URL (Retrieved Feb. 2013): http://www.tek.com/sites/tek.com/files/media/media/resources/Certus Debug Suite Datasheet 54W-28030-1 4.pdf, July 2012.? pages 10, 21, 27, 29, 75, 78, 127[138] A. Tiwari and K. A. Tomko. Scan-Chain Based Watch-Points for Efficient Run-Time Debuggingand Verification of FPGA Designs. In ASP-DAC ?03: Proceedings of the 2003 Asia and SouthPacific Design Automation Conference, pages 705?711, January 2003. ? pages 25[139] J. N. Tombs, M. A. A. Echa?nove, F. M. Chavero, V. B. Lecuyer, A. J. T. Silgado,A. Fernandez-Leo?n, and F. Tortosa. The Implementation of a FPGA Hardware Debugger Systemwith Minimal System Overhead. In Field Programmable Logic and Application, 14thInternational Conference, FPL 2004, Proceedings, pages 1062?1066, August 2004. ? pages 25[140] S. Trimberger. Field-Programmable Gate Array Technology. Kluwer Academic Publishers,Norwell, MA, USA, 1994. ISBN 0792394194. ? pages 34147[141] G. Van Rootselaar and B. Vermeulen. Silicon Debug: Scan Chains Alone Are Not Enough. InTest Conference, 1999. Proceedings. International, pages 892?902, September 1999. ? pages 23[142] E. Vansteenkiste, K. Bruneel, and D. Stroobandt. Maximizing the Reuse of Routing Resources ina Reconfiguration-Aware Connection Router. In FPL 2012, International Conference onField-Programmable Logic and Applications, pages 322?329, August 2012. ? pages 111[143] Vennsa Technologies. OnPoint. URL (Retrieved Feb. 2013: http://www.vennsa.com/product.php,2011. ? pages 3[144] B. Vermeulen and S. Goel. Design for Debug: Catching Design Errors in Digital Chips. Design& Test of Computers, IEEE, 19(3):35?43, 2002. ? pages 22[145] B. Vermeulen, T. Waayers, and S. Goel. Core-Based Scan Architecture for Silicon Debug. In TestConference, 2002. Proceedings. International, pages 638?647, October 2002. ? pages 23, 24[146] P. H. Wang, J. D. Collins, C. T. Weaver, B. Kuttanna, S. Salamian, G. N. Chinya, E. Schuchman,O. Schilling, T. Doil, S. Steibl, and H. Wang. Intel Atom Processor Core MadeFPGA-Synthesizable. In Proceedings of the ACM/SIGDA International Symposium onField-Programmable Gate Arrays, FPGA?09, pages 209?218, February 2009. ? pages 14, 15[147] T.-H. Wang and C. G. Tan. Practical Code Coverage for Verilog. In Verilog HDL Conference,1995. Proceedings., 1995 IEEE International, pages 99?104, March 1995. ? pages 31[148] S. Wasson. Phenom TLB patch benchmarked: A look at how AMD?s BIOS workaround impactsPhenom Performance. URL (Retrieved Feb. 2013):http://techreport.com/review/13741/phenom-tlb-patch-benchmarked, December 2007.? pages 14[149] T. Wheeler, P. Graham, B. E. Nelson, and B. Hutchings. Using Design-Level Scan to ImproveFPGA Design Observability and Controllability for Functional Verification. In FPL?01:Proceedings of the 11th International Conference on Field Programmable Logic andApplications, pages 483?492, August 2001. ? pages 24[150] S. Wilton, B. Quinton, and E. Hung. Rapid RTL-based Signal Ranking for FPGA Prototyping. InFPT 2012: Procedings of the 11th International Conference on Field Programmable Technology,pages 1?7, December 2012. ? pages iv, 12, 32[151] S. J. E. Wilton. SMAP: Heterogeneous Technology Mapping for Area Reduction in FPGAs withEmbedded Memory Arrays. In Proceedings of the 1998 ACM/SIGDA Sixth InternationalSymposium on Field-Programmable Gate Arrays, FPGA?98, pages 171?178, February 1998.? pages 33, 133[152] M. Wirthlin, B. Nelson, B. Hutchings, P. Athanas, and S. Bohner. FPGA Design Productivity:Existing Limitations and Root Causes. URL (Retrieved Feb. 2013):http://www.chrec.org/ftsw/FDP Session1 Posted.pdf, June 2008. ? pages 105[153] Xilinx. Virtex-6 FPGA Memory Resources: User Guide (UG363 v1.6). URL (RetrievedFeb. 2013): http://www.xilinx.com/support/documentation/user guides/ug363.pdf, April 2011.? pages 27, 112148[154] Xilinx. Virtex-6 FPGA DSP48E1 Slice User Guide (UG369 v1.3). URL (Retrieved Feb. 2013):http://www.xilinx.com/support/documentation/user guides/ug369.pdf, February 2011.? pages 133[155] Xilinx. ChipScope Pro Software and Cores, User Guide UG029 (v14.3). URL (RetrievedFeb. 2013): http://www.xilinx.com/support/documentation/sw manuals/xilinx14 4/chipscope pro sw cores ug029.pdf, October 2012. ? pages 8, 27, 75, 78, 104, 110, 127[156] Xilinx. Partial Reconfiguration of Xilinx FPGAs Using ISE Design Suite (WP374 v1.2). URL(Retrieved Feb. 2013): http://www.xilinx.com/support/documentation/white papers/wp374 Partial Reconfig Xilinx FPGAs.pdf, May 2012. ? pages 111[157] Xilinx. Hierarchical Design Methodology Guide (UG748 v14.1). URL (Retrieved Feb. 2013):http://www.xilinx.com/support/documentation/sw manuals/xilinx14 1/Hierarchical Design Methodology Guide.pdf, April 2012. ? pages 34[158] Xilinx. 7 Series FPGAs Configuration (UG470 v1.6). URL (Retrieved Feb. 2013):http://www.xilinx.com/support/documentation/user guides/ug470 7Series Config.pdf, January2013. ? pages 24[159] J.-S. Yang and N. Touba. Enhancing Silicon Debug via Periodic Monitoring. In Defect and FaultTolerance of VLSI Systems, 2008. DFTVS?08. IEEE International Symposium on, pages 125?133,October 2008. ? pages 23[160] J.-S. Yang and N. Touba. Automated Selection of Signals to Observe for Efficient Silicon Debug.In VLSI Test Symposium, 2009. VTS ?09. 27th IEEE, pages 79?84, May 2009. ? pages 7, 29, 32,63, 125[161] Y.-S. Yang, B. Keng, N. Nicolici, A. Veneris, and S. Safarpour. Automated Silicon Debug DataAnalysis Techniques for a Hardware Data Acquisition Environment. In Quality Electronic Design(ISQED), 2010 11th International Symposium on, pages 675?682, March 2010. ? pages 32, 38149Appendix ASignal Selection for leon3s nofpuThis appendix provides the full listing for the 128 signals selected by each of our three algorithms forthe leon3s nofpu circuit from the experiment in Section 3.9. For brevity, the hierarchical module nameshave been shortened, but signal leaves remain intact.150Flat EV Algorithm Hybrid EV Algorithm Graph Algorithmproc3|div32|r.cnt[0] proc3|div32|r.cnt[1] proc3|div32|r.state[0]proc3|div32|r.cnt[1] proc3|div32|r.cnt[2] proc3|div32|r.state[1]proc3|div32|r.cnt[2] proc3|div32|r.cnt[3] proc3|div32|r.state[2]proc3|div32|r.cnt[3] proc3|div32|r.cnt[4] proc3|iu3|ir.pwdproc3|div32|r.cnt[4] proc3|div32|r.state[2] proc3|iu3|r.a.ctrl.inst[20]proc3|div32|r.state[0] proc3|iu3|dsur.tt[4] proc3|iu3|r.a.ctrl.inst[21]proc3|div32|r.state[1] proc3|iu3|r.a.ctrl.inst[20] proc3|iu3|r.a.ctrl.inst[22]proc3|div32|r.state[2] proc3|iu3|r.a.ctrl.inst[25] proc3|iu3|r.a.ctrl.inst[23]proc3|iu3|r.a.ctrl.inst[22] proc3|iu3|r.a.ctrl.inst[26] proc3|iu3|r.a.ctrl.inst[24]proc3|iu3|r.a.ctrl.pc[4] proc3|iu3|r.a.ctrl.rd[1] proc3|iu3|r.a.ctrl.inst[30]proc3|iu3|r.a.ctrl.pc[6] proc3|iu3|r.a.ctrl.rd[3] proc3|iu3|r.a.ctrl.inst[31]proc3|iu3|r.a.ctrl.pc[7] proc3|iu3|r.a.ctrl.rd[7] proc3|iu3|r.a.ctrl.rd[0]proc3|iu3|r.a.ctrl.pc[9] proc3|iu3|r.d.cnt[0] proc3|iu3|r.a.ctrl.rd[1]proc3|iu3|r.a.ctrl.pc[10] proc3|iu3|r.d.cnt[1] proc3|iu3|r.a.ctrl.rd[2]proc3|iu3|r.a.ctrl.pc[11] proc3|iu3|r.e.ctrl.pc[4] proc3|iu3|r.a.ctrl.rd[3]proc3|iu3|r.a.ctrl.pc[12] proc3|iu3|r.e.op2[0] proc3|iu3|r.a.ctrl.rd[4]proc3|iu3|r.a.ctrl.pc[15] proc3|iu3|r.w.result[4] proc3|iu3|r.a.ctrl.rd[5]proc3|iu3|r.a.ctrl.pc[16] proc3|iu3|r.x.ctrl.tt[1] proc3|iu3|r.a.ctrl.rd[6]proc3|iu3|r.a.ctrl.pc[17] proc3|mmu cache|acache|r.ba proc3|iu3|r.a.ctrl.rd[7]proc3|iu3|r.a.ctrl.pc[18] proc3|mmu cache|acache|r.bg proc3|iu3|r.a.ctrl.wiccproc3|iu3|r.a.ctrl.pc[19] proc3|mmu cache|acache|r.bo[0] proc3|iu3|r.a.ctrl.wregproc3|iu3|r.a.ctrl.pc[23] proc3|mmu cache|acache|r.nba proc3|iu3|r.d.annulproc3|iu3|r.a.ctrl.pc[24] proc3|mmu cache|acache|r.retry proc3|iu3|r.d.cnt[0]proc3|iu3|r.a.ctrl.pc[25] proc3|mmu cache|acache|r.retry2 proc3|iu3|r.d.cnt[1]proc3|iu3|r.a.ctrl.pc[26] proc3|mmu cache|dcache|r.cctrl.dfrz proc3|iu3|r.d.cwp[0]proc3|iu3|r.a.ctrl.pc[27] proc3|mmu cache|dcache|r.dstate.dblwrite proc3|iu3|r.d.cwp[1]proc3|iu3|r.a.ctrl.pc[29] proc3|mmu cache|dcache|r.dstate.wflush proc3|iu3|r.d.cwp[2]proc3|iu3|r.a.ctrl.pc[30] proc3|mmu cache|dcache|r.faddr[1] proc3|iu3|r.d.inst[0][14]proc3|iu3|r.a.ctrl.rd[1] proc3|mmu cache|dcache|r.faddr[2] proc3|iu3|r.d.inst[0][15]proc3|iu3|r.d.cnt[0] proc3|mmu cache|dcache|r.faddr[3] proc3|iu3|r.d.inst[0][16]proc3|iu3|r.d.cnt[1] proc3|mmu cache|dcache|r.faddr[4] proc3|iu3|r.d.inst[0][17]proc3|iu3|r.d.pc[4] proc3|mmu cache|dcache|r.faddr[5] proc3|iu3|r.d.inst[0][18]proc3|iu3|r.d.pc[6] proc3|mmu cache|dcache|r.faddr[6] proc3|iu3|r.d.inst[0][19]proc3|iu3|r.d.pc[7] proc3|mmu cache|dcache|r.mmctrl1.ctxp[1] proc3|iu3|r.d.inst[0][20]proc3|iu3|r.d.pc[15] proc3|mmu cache|dcache|r.mmctrl1.ctxp[2] proc3|iu3|r.d.inst[0][21]proc3|iu3|r.d.pc[16] proc3|mmu cache|dcache|r.nomds proc3|iu3|r.d.inst[0][22]proc3|iu3|r.d.pc[17] proc3|mmu cache|dcache|r.vaddr[0] proc3|iu3|r.d.inst[0][23]proc3|iu3|r.d.pc[18] proc3|mmu cache|dcache|r.xaddress[5] proc3|iu3|r.d.inst[0][24]proc3|iu3|r.d.pc[23] proc3|mmu cache|icache|r.faddr[0] proc3|iu3|r.d.inst[0][25]Continued on next page151Continued from previous pageFlat EV Algorithm Hybrid EV Algorithm Graph Algorithmproc3|iu3|r.d.pc[30] proc3|mmu cache|icache|r.faddr[1] proc3|iu3|r.d.inst[0][26]proc3|iu3|r.d.pc[29] proc3|mmu cache|icache|r.faddr[2] proc3|iu3|r.d.inst[0][27]proc3|iu3|r.e.ctrl.pc[4] proc3|mmu cache|icache|r.faddr[3] proc3|iu3|r.d.inst[0][28]proc3|iu3|r.e.ctrl.pc[6] proc3|mmu cache|icache|r.faddr[4] proc3|iu3|r.d.inst[0][29]proc3|iu3|r.e.ctrl.pc[7] proc3|mmu cache|icache|r.faddr[5] proc3|iu3|r.d.inst[0][30]proc3|iu3|r.e.ctrl.pc[9] proc3|mmu cache|icache|r.faddr[6] proc3|iu3|r.d.inst[0][31]proc3|iu3|r.e.ctrl.pc[10] proc3|mmu cache|icache|r.flush proc3|iu3|r.d.inst[1][14]proc3|iu3|r.e.ctrl.pc[11] proc3|mmu cache|icache|r.flush2 proc3|iu3|r.d.inst[1][15]proc3|iu3|r.e.ctrl.pc[12] proc3|mmu cache|icache|r.istate.stop proc3|iu3|r.d.inst[1][16]proc3|iu3|r.e.ctrl.pc[14] proc3|mmu cache|icache|r.istate.streaming proc3|iu3|r.d.inst[1][17]proc3|iu3|r.e.ctrl.pc[15] proc3|mmu cache|icache|r.lock proc3|iu3|r.d.inst[1][18]proc3|iu3|r.e.ctrl.pc[16] proc3|mmu cache|icache|r.overrun proc3|iu3|r.d.inst[1][19]proc3|iu3|r.e.ctrl.pc[17] proc3|mmu cache|icache|r.rndcnt[0] proc3|iu3|r.d.inst[1][20]proc3|iu3|r.e.ctrl.pc[18] proc3|mmu cache|icache|r.setrepl[0] proc3|iu3|r.d.inst[1][21]proc3|iu3|r.e.ctrl.pc[19] proc3|mmu cache|icache|r.underrun proc3|iu3|r.d.inst[1][22]proc3|iu3|r.e.ctrl.pc[23] proc3|mmu cache|tlb|tag0|r.btag.ET[0] proc3|iu3|r.d.inst[1][23]proc3|iu3|r.e.ctrl.pc[24] proc3|mmu cache|tlb|tag0|r.btag.ET[1] proc3|iu3|r.d.inst[1][24]proc3|iu3|r.e.ctrl.pc[25] proc3|mmu cache|tlb|tag0|r.btag.I1[0] proc3|iu3|r.d.inst[1][25]proc3|iu3|r.e.ctrl.pc[26] proc3|mmu cache|tlb|tag0|r.btag.I1[1] proc3|iu3|r.d.inst[1][26]proc3|iu3|r.e.ctrl.pc[29] proc3|mmu cache|tlb|tag0|r.btag.PPN[18] proc3|iu3|r.d.inst[1][27]proc3|iu3|r.e.ctrl.pc[30] proc3|mmu cache|tlb|tag0|r.btag.PPN[21] proc3|iu3|r.d.inst[1][28]proc3|iu3|r.m.ctrl.pc[4] proc3|mmu cache|tlb|tag0|r.btag.ET[0] proc3|iu3|r.d.inst[1][29]proc3|iu3|r.m.ctrl.pc[6] proc3|mmu cache|tlb|tag0|r.btag.ET[1] proc3|iu3|r.d.inst[1][30]proc3|iu3|r.m.ctrl.pc[7] proc3|mmu cache|tlb|tag0|r.btag.I1[0] proc3|iu3|r.d.inst[1][31]proc3|iu3|r.m.ctrl.pc[9] proc3|mmu cache|tlb|tag0|r.btag.I1[1] proc3|iu3|r.d.set[0]proc3|iu3|r.m.ctrl.pc[10] proc3|mmu cache|tlb|tag0|r.btag.PPN[18] proc3|iu3|r.e.bpproc3|iu3|r.m.ctrl.pc[11] proc3|mmu cache|tlb|tag0|r.btag.PPN[21] proc3|iu3|r.e.ctrl.rd[0]proc3|iu3|r.m.ctrl.pc[12] proc3|mmu cache|tlb|tag0|r.btag.CTX[1] proc3|iu3|r.e.ctrl.rd[1]proc3|iu3|r.m.ctrl.pc[14] proc3|mmu cache|tlb|tag0|r.btag.M proc3|iu3|r.e.ctrl.rd[2]proc3|iu3|r.m.ctrl.pc[15] proc3|mmu cache|tlb|tag0|r.btag.SU proc3|iu3|r.e.ctrl.rd[3]proc3|iu3|r.m.ctrl.pc[16] proc3|mmu cache|tlb|tag0|r.btag.ET[0] proc3|iu3|r.e.ctrl.rd[4]proc3|iu3|r.m.ctrl.pc[17] proc3|mmu cache|tlb|tag0|r.btag.ET[1] proc3|iu3|r.e.ctrl.rd[5]proc3|iu3|r.m.ctrl.pc[18] proc3|mmu cache|tlb|tag0|r.btag.I1[0] proc3|iu3|r.e.ctrl.rd[6]proc3|iu3|r.m.ctrl.pc[19] proc3|mmu cache|tlb|tag0|r.btag.I1[1] proc3|iu3|r.e.ctrl.rd[7]proc3|iu3|r.m.ctrl.pc[22] proc3|mmu cache|tlb|tag0|r.btag.PPN[18] proc3|iu3|r.e.ctrl.wiccproc3|iu3|r.m.ctrl.pc[23] proc3|mmu cache|tlb|tag0|r.btag.PPN[21] proc3|iu3|r.e.ctrl.wregproc3|iu3|r.m.ctrl.pc[24] proc3|mmu cache|tlb|tag0|r.btag.ET[0] proc3|iu3|r.e.ctrl.wyproc3|iu3|r.m.ctrl.pc[25] proc3|mmu cache|tlb|tag0|r.btag.ET[1] proc3|iu3|r.m.ctrl.cnt[0]Continued on next page152Continued from previous pageFlat EV Algorithm Hybrid EV Algorithm Graph Algorithmproc3|iu3|r.m.ctrl.pc[26] proc3|mmu cache|tlb|tag0|r.btag.I1[0] proc3|iu3|r.m.ctrl.inst[20]proc3|iu3|r.m.ctrl.pc[28] proc3|mmu cache|tlb|tag0|r.btag.I1[1] proc3|iu3|r.m.ctrl.inst[21]proc3|iu3|r.m.ctrl.pc[29] proc3|mmu cache|tlb|tag0|r.btag.PPN[18] proc3|iu3|r.m.ctrl.inst[22]proc3|iu3|r.m.ctrl.pc[30] proc3|mmu cache|tlb|tag0|r.btag.PPN[21] proc3|iu3|r.m.ctrl.inst[23]proc3|iu3|r.x.ctrl.pc[4] proc3|mmu cache|tlb|tag0|r.btag.I3[4] proc3|iu3|r.m.ctrl.inst[24]proc3|iu3|r.x.ctrl.pc[6] proc3|mmu cache|tlb|tag0|r.btag.I3[5] proc3|iu3|r.m.ctrl.inst[30]proc3|iu3|r.x.ctrl.pc[7] proc3|mmu cache|tlb|tag0|r.btag.PPN[23] proc3|iu3|r.m.ctrl.inst[31]proc3|iu3|r.x.ctrl.pc[9] proc3|mmu cache|tlb|tag0|r.btag.ET[0] proc3|iu3|r.m.ctrl.wiccproc3|iu3|r.x.ctrl.pc[10] proc3|mmu cache|tlb|tag0|r.btag.ET[1] proc3|iu3|r.m.mulproc3|iu3|r.x.ctrl.pc[11] proc3|mmu cache|tlb|tag0|r.btag.I1[0] proc3|iu3|rp.errorproc3|iu3|r.x.ctrl.pc[14] proc3|mmu cache|tlb|tag0|r.btag.I1[1] proc3|iu3|rp.pwdproc3|iu3|r.x.ctrl.pc[16] proc3|mmu cache|tlb|tag0|r.btag.PPN[18] proc3|iu3|r.x.ctrl.annulproc3|iu3|r.x.ctrl.pc[17] proc3|mmu cache|tlb|tag0|r.btag.PPN[21] proc3|iu3|r.x.ctrl.inst[19]proc3|iu3|r.x.ctrl.pc[18] proc3|mmu cache|tlb|tag0|r.btag.ET[0] proc3|iu3|r.x.ctrl.inst[20]proc3|iu3|r.x.ctrl.pc[19] proc3|mmu cache|tlb|tag0|r.btag.ET[1] proc3|iu3|r.x.ctrl.inst[21]proc3|iu3|r.x.ctrl.pc[22] proc3|mmu cache|tlb|tag0|r.btag.I1[0] proc3|iu3|r.x.ctrl.inst[22]proc3|iu3|r.x.ctrl.pc[24] proc3|mmu cache|tlb|tag0|r.btag.I1[1] proc3|iu3|r.x.ctrl.inst[23]proc3|iu3|r.x.ctrl.pc[25] proc3|mmu cache|tlb|tag0|r.btag.PPN[18] proc3|iu3|r.x.ctrl.inst[24]proc3|iu3|r.x.ctrl.pc[26] proc3|mmu cache|tlb|tag0|r.btag.PPN[21] proc3|iu3|r.x.ctrl.inst[25]proc3|iu3|r.x.ctrl.pc[28] proc3|mmu cache|tlb|r.s2 data[4] proc3|iu3|r.x.ctrl.inst[26]proc3|iu3|r.x.ctrl.pc[29] proc3|mmu cache|tlb|r.s2 flush proc3|iu3|r.x.ctrl.inst[27]proc3|iu3|r.x.ctrl.pc[30] proc3|mmu cache|tlb|r.s2 needsync proc3|iu3|r.x.ctrl.inst[28]proc3|mmu cache|dcache|r.dstate.wtrans proc3|mmu cache|tlb|r.s2 tlbstate.idle proc3|iu3|r.x.ctrl.inst[29]proc3|mmu cache|dcache|r.faddr[0] proc3|mmu cache|tlb|r.s2 tlbstate.pack proc3|iu3|r.x.ctrl.inst[30]proc3|mmu cache|dcache|r.faddr[1] proc3|mmu cache|tlb|r.s2 tlbstate.walk proc3|iu3|r.x.ctrl.inst[31]proc3|mmu cache|dcache|r.faddr[2] proc3|mmu cache|tlb|r.walk transdata.data[4] proc3|iu3|r.x.ctrl.pvproc3|mmu cache|dcache|r.faddr[3] proc3|mmu cache|tlb|r.walk transdata.data[12] proc3|iu3|r.x.ctrl.trapproc3|mmu cache|dcache|r.faddr[4] proc3|mmu cache|tlb|r.walk use proc3|iu3|r.x.debugproc3|mmu cache|dcache|r.faddr[5] proc3|mmu cache|tlb|syncram|memarr 153 proc3|iu3|r.x.mexcproc3|mmu cache|dcache|r.faddr[6] proc3|mmu cache|tw|r.state.lv1 proc3|iu3|r.x.rstate.dsu1proc3|mmu cache|dcache|r.su proc3|mmu cache|tw|r.state.lv2 proc3|iu3|r.x.rstate.dsu2proc3|mmu cache|icache|r.faddr[0] proc3|mmu cache|tw|r.walk op proc3|iu3|r.x.rstate.runproc3|mmu cache|icache|r.faddr[1] proc3|mmu cache|r.cmb s1.op.flush op proc3|mmu cache|acache|r.baproc3|mmu cache|icache|r.faddr[2] proc3|mmu cache|r.cmb s1.tlbowner proc3|mmu cache|acache|r.bo[0]proc3|mmu cache|icache|r.faddr[3] proc3|mmu cache|r.cmb s2.op.flush op proc3|mmu cache|acache|r.bo[1]proc3|mmu cache|icache|r.faddr[4] proc3|mmu cache|r.mmctrl2.fa[3] proc3|mmu cache|dcache|r.holdnproc3|mmu cache|icache|r.faddr[5] proc3|mmu cache|r.mmctrl2.fa[5] proc3|mmu cache|icache|r.holdnproc3|mmu cache|icache|r.faddr[6] proc3|mmu cache|r.mmctrl2.fa[13] proc3|mmu cache|dcache|r.mmctrl1.tlbdisContinued on next page153Continued from previous pageFlat EV Algorithm Hybrid EV Algorithm Graph Algorithmproc3|mul32|p i[1][17] proc3|mmu cache|r.mmctrl2.fa[19] proc3|mmu cache|tlb|r.s2 entry[0]proc3|mul32|p i[1][26] proc3|mmu cache|r.mmctrl2.fs.fav proc3|mmu cache|tlb|r.s2 entry[1]proc3|mul32|p i[1][33] proc3|mmu cache|r.mmctrl2.fs.ft[0] proc3|mmu cache|tlb|r.s2 entry[2]regfile 3p|din[23] proc3|mmu cache|r.mmctrl2.fs.ft[1] proc3|mmu cache|tlb|r.s2 flushregfile 3p|din[28] proc3|mmu cache|r.mmctrl2.fs.ow proc3|mmu cache|tlb|r.s2 hmregfile 3p|din[30] regfile 3p|wa[2] proc3|mmu cache|tlb|r.s2 needsyncregfile 3p|din[31] regfile 3p|wa[3] proc3|mmu cache|tlb|r.s2 readregfile 3p|memarr rtl 0 bypass[1] regfile 3p|wa[7] proc3|mmu cache|tlb|r.s2 suregfile 3p|memarr rtl 0 bypass[7] regfile 3p|wr proc3|mmu cache|tlb|r.s2 tlbstate.idleregfile 3p|memarr rtl 0 bypass[40] regfile 3p|memarr 0 proc3|mmu cache|tlb|r.s2 tlbstate.packregfile 3p|memarr rtl 0 bypass[45] regfile 3p|memarr 5 proc3|mmu cache|r.cmb s2.tlbactiveregfile 3p|memarr rtl 0 bypass[47] regfile 3p|memarr 7 proc3|mmu cache|r.cmb s2.tlbownerregfile 3p|memarr rtl 0 bypass[48] regfile 3p|memarr 15 rstTable A.1: Comparison of leon3s nofpu, 128 signal selection, full listing.154Appendix BIncremental-Tracing with Local NetsThe results presented in Chapter 5 showed that whilst a large majority of the signals selected for obser-vation could be succcessfully connected to trace-buffers, it is often not possible to connect all signals.In Chapter 5, we observed that the most challenging nets to trace were those signals that were absorbedlocally into a logic cluster, due to a reduced routing flexibility. In [65], this was studied further; thisanalysis is presented here.The nets (and their cluster output pins, OPINs) that exist within a placed-and-routed design can bedivided into two categories: local and global. Local nets are used to make intra-cluster connections,whilst global nets are used for inter-cluster communication. The proportion of each type is shown inFigure B.1, sub-divided into those that are accessible (labelled ?A?, solid) and inaccessible (?I?, shaded).From this chart, we can see that local OPINs dominate the majority of failing cases. The reasoning forthis trend is intuitive: global OPINs connect onto the global interconnect, making it possible to tap anypoint on the existing route to branch-off to a trace-pin. No such luxury exists for local OPINs, however,and so there is far less flexibility for trace connections to be made.This is reinforced by Figure B.2, which shows a histogram of the maximum manhattan distancereached during breadth-first search routing at minimum channel width and with 20% slack, distinguish-ing between accessible (indicating the distance of the successful route) and inaccessible OPINs (themaximum distance reached before giving up, due to an empty priority queue). As seen, the vast majorityof inaccessible OPINs possess a distance of zero which indicates they are unable to exit the cluster toreach the global network, due to routing congestion. Even though we have only shown the two circuitsin which this finding is most prominent, this trend holds for all the other circuits that we investigated.155mkSM4B or1200 mkDW32B LU8PE mcml0%20%40%60%80%100%Proportion OPINTypeGlobal AGlobal ILocal ALocal IFigure B.1: Distribution of local and global OPINs, accessible (A) and inaccessible (I)0 3 6 9 12 15 18 21 24 27 30Max Manhattan Distance Reached0100200300400500600Frequencyaccessibleinaccessible0 3 6 9 12 15 18 21 24 27 30Max Manhattan Distance Reached0100200300400500600Frequencyaccessibleinaccessible(a) or1200: W min (left), W min +20% (right)0 6 12 18 24 30 36 42 48 54 60 66 72Max Manhattan Distance Reached050100150200250300Frequencyaccessibleinaccessible0 6 12 18 24 30 36 42 48 54 60 66 72Max Manhattan Distance Reached050100150200250300Frequencyaccessibleinaccessible(b) mkDelayWorker32B: W min (left), W min +20% (right)Figure B.2: Histogram of the maximum manhattan distance reached during breadth-first searchThe results presented in this appendix can be used to explain the effectiveness of the logic elementsymmetry CAD optimization described in Section 5.2.2, which allows local nets to access other logiccluster outputs.156Appendix CPseudo-codeThis appendix lists psuedo-code for each of the four contributions of this thesis.157C.1 Post-Silicon Debug Metric and Automated Signal SelectionC.1.1 Debug Difficulty MetricGiven a circuit netlist, the set of signals selected for observation, and a list of trace-buffer samples ofeach signal, this psuedo-code computes the average debug difficulty across all samples.Inputs: circuit netlist, signal selection, trace-buffer contents.Outputs: average debug difficulty.procedure compu teAvgDebugDi f f i cu l t y ( n e t l i s t , s e l e c t i o n , t r a c e b u f f e r )/ / F i r s t , f i n d an a p p r o x i m a t i o n o f t h e r e a c h a b l e s t a t e space u s i n g VISR = VIS . c o m p u t e A p p r o x R e a c h a b l e S t a t e S e t ( n e t l i s t )V = s e l e c t i o nd e b u g D i f f P r o d u c t = 1/ / P r o c e s s each t i m e s l i c e o f t h e t r a c e b u f f e rf o r t = 0 . . . t r a c e b u f f e r . numSamples/ / Compute t h e r e g i o n o f R t h a t was observed , based on t h e s i g n a l s s e l e c t e d and s t a t er e c o r d e d i n t r a c e b u f f e ri = t r a c e b u f f e r [ t ]RV i = computeObservedRegion (R , V, i )DV i = c o m p u t e D e b u g D i f f i c u l t y ( RV i )d e b u g D i f f P r o d u c t = d e b u g D i f f P r o d u c t ? DV i/ / Find t h e n t h r o o tgeoAvg = pow ( d e b u g D i f f P r o d u c t , 1 / t r a c e b u f f e r . numSamples )re turn geoAvgprocedure c o m p u t e D e b u g D i f f i c u l t y ( RV i )/ / ? volume ? r e p r e s e n t s t h e s e t o f p o s s i b l e p a t h s ( t r a j e c t o r i e s ) t h a t e x i s t t h r o u g h t h er e g i o n RV ivolume = RV i . numSta t e s/ / Compute t h e volume o f 10 pre?imagesimage = RV if o r u = 0 . . . 10image = computePreImage ( image )volume = volume ? image . numSta t e s/ / Compute t h e volume o f 10 imagesimage = RV if o r v = 0 . . . 10image = computeImage ( image )volume = volume ? image . numSta t e sre turn volume158C.1.2 Automated Signal SelectionFlat EV MethodGiven the circuit netlist and the number of signals to be selected, this psuedo-code returns a signalselection based on minimizing the expected debug difficulty of the entire circuit.Inputs: circuit netlist, selection size.Outputs: signal selection.procedure c o m p u t e F l a t S e l e c t i o n ( n e t l i s t , s e l s i z e )/ / F i r s t , f i n d an a p p r o x i m a t i o n o f t h e r e a c h a b l e s t a t e space u s i n g VISR = VIS . c o m p u t e A p p r o x R e a c h a b l e S t a t e S e t ( n e t l i s t )s e l e c t e d S e t = emptySet ( )/ / And u n t i l enough s i g n a l s have been s e l e c t e df o r i = 0 . . . s e l s i z e/ / Find t h e s e t o f u n s e l e c t e d s i g n a l su n s e l e c t e d S e t = n e t l i s t . s i g n a l s ? s e l e c t e d S e tb e s t E V i = i n f i n i t y/ / For e v e r y s i g n a l t h a t hasn ? t y e t been s e l e c t e dforeach s i g n a l in u n s e l e c t e d S e t/ / Compute t h e e x p e c t e d d i f f i c u l t y i f t h i s s i g n a l was s e l e c t e dEV i = c o m p u t e E x p e c t e d D i f f i c u l t y (R , s e l e c t e d S e t + s i g n a l )/ / And r e c o r d i t i f i t was t h e b e s t s een so f a ri f EV i < b e s t E V i thenb e s t E V i = EV ib e s t s i g n a l = s i g n a l/ / Add t h e b e s t s i g n a l found t o t h e s e l e c t e d s e ts e l e c t e d S e t = s e l e c t e d S e t + b e s t s i g n a lre turn s e l e c t e d S e tTL;DR: FPGAs are great. Debugging circuits is not. Here?s a few ways to make use of those spare transistors to ease this process along!procedure c o m p u t e E x p e c t e d D i f f i c u l t y (R , V)sum = 0foreach i in s e l e c t i o n . a l l P o s s i b l e V a l u e sRV i = computeObservedRegion (R , V, i )DV i = c o m p u t e D e b u g D i f f i c u l t y ( RV i )sum = sum + DV i ? RV i . numSta t e sEV i = sum / R . numSta t e sre turn EV i159Graph MethodGiven the circuit netlist and the number of signals to be selected, this psuedo-code returns a signalselection based on the eigenvector centrality of flip-flops in the circuit.Inputs: circuit netlist, selection size.Outputs: signal selection.procedure c o m p u t e G r a p h S e l e c t i o n ( n e t l i s t , s e l s i z e )/ / E x t r a c t t h e f l i p ?f l o p c o n n e c t i v i t y graph from n e t l i s tc o n n e c t i v i t y G r a p h = e x t r a c t C o n n e c t i v i t y ( n e t l i s t )/ / Find t h e e i g e n v e c t o r c e n t r a l i t y o f a l l s i g n a l sc e n t r a l i t y L i s t = c o m p u t e E i g e n v e c t o r C e n t r a l i t y ( c o n n e c t i v i t y G r a p h )/ / S o r t t h i s l i s t based on t h e i r c e n t r a l i t y v a l u e sc e n t r a l i t y L i s t = s o r t D e s c e n d i n g ( c e n t r a l i t y L i s t )/ / E x t r a c t t h e s i g n a l s w i t h t h e h i g h e s t c e n t r a l i t y v a l u e ss e l e c t e d S e t = c e n t r a l i t y L i s t [ 0 : s e l s i z e ]re turn s e l e c t e d S e tHybrid EV MethodGiven the circuit netlist and the number of signals to be selected, this psuedo-code returns a signalselection based on first decomposing the circuit hierarchically based on the centrality of its modules,before applying the flat method to each of these sub-circuits separately.Inputs: circuit netlist, selection size, merge threshold.Outputs: signal selection.procedure c o m p u t e H y b r i d S e l e c t i o n ( n e t l i s t , s e l s i z e , m e r g e t h r e s h )/ / E x t r a c t t h e module h i e r a r c h y o f t h e c i r c u i tmoduleGraph = e x t r a c t H i e r a r c h y ( n e t l i s t )/ / Merge s m a l l modules t o g e t h e r from t h e bo t tom up , u n t i l t h e y ex c ee d t h e t h r e s h o l dmoduleGraph = bottomUpMerge ( moduleGraph , m e r g e t h r e s h )/ / A n n o t a t e a l l edges o f t h e moduleGraph w i t h t h e p a i r w i s e c o n n e c t i v i t y be tween s o u r c e andt a r g e t modulesmoduleGraph = a n n o t a t e C o n n e c t i v i t y ( moduleGraph , n e t l i s t )/ / Find t h e e i g e n v e c t o r c e n t r a l i t y o f a l l modules ( w i t h edge w e i g h t s )c e n t r a l i t y L i s t = c o m p u t e E i g e n v e c t o r C e n t r a l i t y ( moduleGraph )s e l e c t e d S e t = emptySet ( )foreach module in c e n t r a l i t y L i s t/ / Compute t h e number o f s i g n a l s t o s e l e c t from t h i s modulem o d u l e S e l S i z e = s e l s i z e ? module . c e n t r a l i t y/ / Apply t h e f l a t s e l e c t i o n t o t h e n e t l i s t f o r t h i s module o n l ymoduleSe l = c o m p u t e F l a t S e l e c t i o n ( module . n e t l i s t , m o d u l e S e l S i z e )s e l e c t e d S e t = s e l e c t e d S e t + moduleSe lre turn s e l e c t e d S e t160C.2 Speculative Debug InsertionGiven a placed-and-routed Quartus II project, this pseudo-code will speculatively instrument the circuitusing SignalTap II. The trace-buffer configuration is intended to use 10% of all on-chip logic resources,and as much spare RAM as possible. Signals are selected using the graph-based method.Inputs: post place-and-route Quartus II project, instrumentation area model.Outputs: speculatively instrumented Quartus II project.procedure s p e c u l a t i v e I n s e r t i o n ( p r o j e c t , a r eamode l )/ / E x t r a c t t h e e x i s t i n g r e s o u r c e u t i l i z a t i o nu t i l i z a t i o n = e x t r a c t U t i l i z a t i o n ( p r o j e c t )/ / Compute t h e l o g i c r e s o u r c e s a v a i l a b l e ( capped a t 10%)l o g i c B u d g e t = min ( 1 . 0 ? u t i l i z a t i o n . l o g i c , 0 . 1 )/ / Compute t h e RAM r e s o u r c e s a v a i l a b l e ( uncapped )ramBudget = 1 . 0 ? u t i l i z a t i o n . ram/ / E s t i m a t e t h e number o f s i g n a l s t h a t can be t r a c e d , based on models e l s i z e = a reamode l . compute ( l o g i c B u d g e t , ramBudget )/ / Compute a graph?based s i g n a l s e l e c t i o ns e l e c t i o n = c o m p u t e G r a p h S e l e c t i o n ( p r o j e c t . n e t l i s t , s e l s i z e )/ / Use S i g n a l T a p I I t o i n s t r u m e n t t h e s e s i g n a l sp r o j e c t . i n s t r u m e n t W i t h S i g n a l T a p I I ( s e l e c t i o n )/ / Recompi le t h e p r o j e c t from s c r a t c hp r o j e c t . f u l l R e c o m p i l e ( )re turn p r o j e c t161C.3 Incremental TracingGiven a VPR project that has been place-and-routed (composed of a packed circuit netlist, placementand routing) and a signal selection, this pseudo-code will attempt to connect all signals incrementally toany free RAM input using only the spare resources not used by the original circuit.Inputs: post place-and-route VPR project, signal selection.Outputs: incrementally traced VPR project.procedure b u i l d I n c r e m e n t a l T r a c e s ( p r o j e c t , s e l e c t i o n )/ / Load p r o j e c t ( packed n e t l i s t , p lacement , r o u t i n g ) i n t o VPRvpr . l o a d P r o j e c t ( p r o j e c t )/ / Rec la im a l l da ta i n p u t s from s p a r e RAM b l o c k s as e l i g i b l e r o u t i n g t a r g e t smarkAl lSpareRamInpu t s ( )/ / Pre?a s s i g n a s u g g e s t e d t a r g e t f o r each s e l e c t e d s i g n a lp r e a s s i g n T a r g e t ( s e l e c t i o n )/ / Per form PathFinder?s t y l e r o u t i n g , f o r up t o 5 r o u t i n g i t e r a t i o n sf o r i = 0 . . 5foreach s i g n a l in s e l e c t i o n/ / Per form t i m i n g?d r i v e n search , towards s u g g e s t e d t a r g e t , f o r each s i g n a l/ / a . Using o n l y f r e e r o u t i n g r e s o u r c e s n o t consumed by user?c i r c u i t/ / b . I f l o c a l , l e a v i n g from any f r e e or l o c a l OPIN i n t h e l o g i c c l u s t e r/ / I f g l o b a l , expand ing from any p o i n t on t h e e x i s t i n g n e t/ / c . S t o p p i n g i f any RAM t a r g e t i s foundt i m i n g D r i v e n D i r e c t e d S e a r c h ( s i g n a l )/ / I f r o u t i n g i s f e a s i b l e ( i . e . no r e s o u r c e s are used more than once ) t h e n f i n i s hi f vpr . i s L e g a l ( ) then break/ / O t h e r w i s e i n c r e a s e c o s t o f a l l o v e r u s e d r e s o u r c e s , and r i p?up a l l debug r o u t i n gvpr . i n c r e a s e O v e r u s e C o s t ( )vpr . r ipUpDebugRout ing ( )/ / I f a f t e r 5 r o u t i n g i t e r a t i o n s , s o l u t i o n i s s t i l l i l l e g a l , i t e r a t i v e l y d i s c a r d eachi l l e g a l s i g n a l u n t i l a l l i s l e g a li f vpr . i s L e g a l ( ) == f a l s e thenforeach s i g n a l in s e l e c t i o ni f s i g n a l . i s L e g a l ( ) == f a l s e then s i g n a l . r ipUpDebugRout ing ( )/ / P lace t r a c e?b u f f e r s i n t o t h e RAM b l o c k s t h a t were r o u t e d t op l a c e T r a c e B u f f e r s ( )/ / Save as a new p r o j e c tt r a c e d P r o j e c t = vpr . s a v e P r o j e c t ( )re turn t r a c e d P r o j e c t162C.4 Virtual Overlay NetworkGiven a VPR project that has been place-and-routed (composed of a packed circuit netlist, placement androuting) and a network connectivity parameter, this pseudo-code will build an overlay network whichincrementally connects all circuit signals to multiple free RAM inputs using only the spare resources notused by the original circuit.Inputs: post place-and-route VPR project, network connectivity parameter.Outputs: VPR project with overlay network.procedure b u i l d O v e r l a y N e t w o r k ( p r o j e c t , c o n n e c t i v i t y )/ / Load p r o j e c t ( packed n e t l i s t , p lacement , r o u t i n g ) i n t o VPRvpr . l o a d P r o j e c t ( p r o j e c t )/ / Rec la im a l l da ta i n p u t s from s p a r e RAM b l o c k s as e l i g i b l e r o u t i n g t a r g e t smarkAl lSpareRamInpu t s ( )/ / Per form PathFinder?s t y l e r o u t i n g , f o r up t o 50 r o u t i n g i t e r a t i o n s ( VPR d e f a u l t )f o r i = 0 . . 50/ / For e v e r y LUT , FF , RAM or DSP o u t p u t o f t h e c i r c u i tforeach s i g n a l in p r o j e c t . s i g n a l s/ / For as many un iq ue c o n n e c t i o n s as r e q u e s t e df o r j = 0 . . c o n n e c t i v i t y/ / Per form d i r e c t e d s e a r c h i n a random d i r e c t i o n/ / a . Using o n l y f r e e r o u t i n g r e s o u r c e s n o t consumed by user?c i r c u i t/ / b . I f l o c a l , l e a v i n g from any f r e e or l o c a l OPIN i n t h e l o g i c c l u s t e r/ / I f g l o b a l , expand ing from any p o i n t on t h e e x i s t i n g n e t/ / c . S t o p p i n g i f any new RAM t a r g e t f o r t h i s s i g n a l i s found/ / d . P r e f e r r i n g t o s h a r e e x i s t i n g c o n n e c t i o n s found by o t h e r s i g n a l s , t om i n i m i z e s e a r c h spacer a n d o m D i r e c t i o n S e a r c h ( s i g n a l )/ / I f r o u t i n g i s f e a s i b l e ( i . e . no r e s o u r c e s are used more than once ) t h e n f i n i s hi f vpr . i s L e g a l ( ) then break/ / O t h e r w i s e i n c r e a s e c o s t o f a l l o v e r u s e d r e s o u r c e s , and r i p?up a l l debug r o u t i n gvpr . i n c r e a s e O v e r u s e C o s t ( )vpr . r ipUpDebugRout ing ( )/ / I f a f t e r 50 r o u t i n g i t e r a t i o n s , s o l u t i o n i s s t i l l i l l e g a l , i t e r a t i v e l y d i s c a r d eachi l l e g a l s i g n a l u n t i l a l l i s l e g a li f vpr . i s L e g a l ( ) == f a l s e thenforeach s i g n a l in p r o j e c t . s i g n a l si f s i g n a l . i s L e g a l ( ) == f a l s e then s i g n a l . r ipUpDebugRout ing ( )/ / P lace t r a c e?b u f f e r s i n t o t h e RAM b l o c k s t h a t were r o u t e d t op l a c e T r a c e B u f f e r s ( )/ / Save as a new p r o j e c to v e r l a y P r o j e c t = vpr . s a v e P r o j e c t ( )re turn o v e r l a y P r o j e c t163

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0074092/manifest

Comment

Related Items