UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Techniques for enabling in-system observation-based debug of high-level synthesis circuits on FPGAs Goeders, Jeffrey 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2016_november_goeders_jeffrey.pdf [ 7.7MB ]
JSON: 24-1.0314576.json
JSON-LD: 24-1.0314576-ld.json
RDF/XML (Pretty): 24-1.0314576-rdf.xml
RDF/JSON: 24-1.0314576-rdf.json
Turtle: 24-1.0314576-turtle.txt
N-Triples: 24-1.0314576-rdf-ntriples.txt
Original Record: 24-1.0314576-source.json
Full Text

Full Text

Techniques for Enabling In-System Observation-based Debug ofHigh-Level Synthesis Circuits on FPGAsbyJeffrey GoedersBASc Computer Engineering, University Toronto, 2010MASc Computer Engineering, The University of British Columbia, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)September 2016c© Jeffrey Goeders, 2016AbstractHigh-level synthesis (HLS) is a rapidly growing design methodology that allows designers to createdigital circuits using a software-like specification language. HLS promises to increase the productivityof hardware designers in the face of steadily increasing circuit sizes, and broaden the availability ofhardware acceleration, allowing software designers to reap the benefits of hardware implementation.One roadblock to HLS adoption is the lack of an in-system debugging infrastructure. Existing debugtechnologies are limited to software emulation and cannot be used to find bugs that only occur in thefinal operating environment.This dissertation investigates techniques for observing HLS circuits, allowing designers to debugthe circuit in the context of the original source code, while it executes at-speed in the normal operatingenvironment.This dissertation is comprised of four major contributions toward this goal. First, we develop adebugging framework that provides users with a basic software-like debug experience, including single-stepping, breakpoints and variable inspection. This is accomplished by automatically inserting special-ized debug instrumentation into the user’s circuit, allowing an external debugger to observe the circuit.Debugging at-speed is made possible by recording circuit execution in on-chip memories and retrievingthe data for offline analysis. The second contribution contains several techniques to optimize this datacapture logic. Program analysis is performed to develop circuitry that is tailored to the user’s individualdesign, capturing a 127x longer execution trace than an embedded logic analyzer.The third contribution presents debugging techniques for multithreaded HLS systems. We developa technique to observe only user-selected points in the program, allowing the designer to sacrifice com-plete observability in order to observe specific points over a longer period of execution. We present aniialgorithm to allow hardware threads to share signal-tracing resources, increasing the captured executiontrace by 4x for an eight thread system.The final contribution is a metric to measure observability in HLS circuits. We use the metricto explore trade-offs introduced by recent in-system debugging techniques, and show how differentapproaches affect the likelihood that variable values will be available to the user, and the duration ofexecution that can be captured.iiiPrefaceThe contributions presented in this thesis have been published in journals and conference proceedings [1,2, 4, 6], as well as a workshops paper [3], and a book chapter [5].Content from Chapter 3 was published as a conference paper [1] and a workshop paper [3]. Largeportions of Chapter 4 were published in paper [2]. Papers [1] and [2] were combined and extendedwith the remaining content from Chapter 4 as the journal article in [6]. These contributions were proto-typed in an open-source academic tool, which was described in the book chapter in [5]. Content fromChapter 5 was published as a conference paper [4]. Content from Chapter 6 was also published as aconference paper [7].In all of these contributions, I was primarily responsible for conducting the research, prototypingtechniques, and performing experiments. This was done under the guidance and direction of my advisor,Dr. Steve Wilton. Dr Wilton also provided editorial support for all manuscripts.[1] Jeffrey Goeders and Steven J. E. Wilton. “Effective FPGA debug for high-level synthesis gen-erated circuits”. In: International Conference on Field Programmable Logic and Applications.Sept. 2014, pp. 1–8.[2] Jeffrey Goeders and Steven J. E. Wilton. “Using Dynamic Signal-Tracing to Debug Compiler-Optimized HLS Circuits on FPGAs”. In: International Symposium on Field-Programmable Cus-tom Computing Machines. May 2015, pp. 127–134.[3] Jeffrey Goeders and Steven J. E. Wilton. “Allowing Software Developers to Debug HLS Hard-ware”. In: International Workshop on FPGAs for Software Programmers. Aug. 2015.iv[4] Jeffrey Goeders and Steven J. E. Wilton. “Using Round-Robin Tracepoints to Debug Multi-threaded HLS Circuits on FPGAs”. In: International Conference on Field-Programmable Tech-nology. Dec. 2015, pp. 40–47.[5] Andrew Canis, Jongsok Choi, Blair Fort, Bain Syrowik, Ruo Long Lian, Yuting Chen, HsuanHsiao, Jeffrey Goeders, Stephen Brown, and Jason Anderson. “LegUp high-level synthesis”. In:Chapter in FPGAs for Software Engineers. Springer, 2016.[6] Jeffrey Goeders and Steven J.E. Wilton. “Signal-Tracing Techniques for In-System FPGA De-bugging of High-Level Synthesis Circuits”. In: IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems (2016). To be published in 2016.[7] Jeffrey Goeders and Steven J. E. Wilton. “Quantifying Observability for In-System Debug ofHigh-Level Synthesis Circuits”. In: International Conference on Field Programmable Logic andApplications. Aug. 2016.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 The Increasing Demand for High-Level Synthesis . . . . . . . . . . . . . . . . 11.1.2 The Need for In-System Debugging of HLS Circuits . . . . . . . . . . . . . . 31.2 Challenges and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.1 Challenges of Source-Level, In-System Debugging . . . . . . . . . . . . . . . 61.2.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.1 A Source-Level, In-System HLS Debugging Framework . . . . . . . . . . . . 81.3.2 Optimizing Data Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.3 Debugging Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3.4 Quantifying Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.1 Current Approaches for Debugging HLS Circuits . . . . . . . . . . . . . . . . . . . . 142.1.1 Debugging and Validation Scenarios . . . . . . . . . . . . . . . . . . . . . . . 14vi2.1.2 Debugging by Software Emulation . . . . . . . . . . . . . . . . . . . . . . . . 152.1.3 Debugging by Hardware Simulation . . . . . . . . . . . . . . . . . . . . . . . 172.1.4 In-System Hardware Debugging . . . . . . . . . . . . . . . . . . . . . . . . . 182.1.5 Filling a Need: Source-Level, In-System Debugging . . . . . . . . . . . . . . 202.2 In-System Hardware Debug Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.1 Scan-based Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2.2 Trace-based Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3.1 History and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3.2 Present-Day HLS Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.3.3 The HLS Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.4.1 In-System HLS Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.4.2 Debugging Optimized Software . . . . . . . . . . . . . . . . . . . . . . . . . 382.5 Approaches and Assumptions In This Work . . . . . . . . . . . . . . . . . . . . . . . 392.5.1 Context with Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.5.2 Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 The Debugging Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.1 The Debugging Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.1.1 Adding Debug to the HLS Flow . . . . . . . . . . . . . . . . . . . . . . . . . 443.1.2 Modes of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.1.3 Observability-Based Debug . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.1.4 Debugging Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.2 Mapping Software to Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.2.1 Scope of Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.2.2 Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.2.3 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.2.4 Required Modifications to the HLS Tool . . . . . . . . . . . . . . . . . . . . . 533.2.5 Properties of Benchmark Circuits . . . . . . . . . . . . . . . . . . . . . . . . 533.3 Circuit Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.3.1 Instrumentation Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.3.2 Required Modifications to the HLS Tool . . . . . . . . . . . . . . . . . . . . . 563.4 The Debugger Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.4.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.4.2 Debug Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.4.3 Instruction-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 59vii3.4.4 IR Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.4.5 Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624 Optimizing Data Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.1 Baseline Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.2 Split Trace Buffer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.3 Control Trace Buffer Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.4 Datapath Registers Trace Buffer Optimizations . . . . . . . . . . . . . . . . . . . . . 694.4.1 Dynamic Tracing Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 694.4.2 Delay-Worst Signal-Trace Scheduling . . . . . . . . . . . . . . . . . . . . . . 704.4.3 Delay-All Signal-Trace Scheduling . . . . . . . . . . . . . . . . . . . . . . . 744.4.4 Dual-Ported Memory Signal-Trace Scheduling . . . . . . . . . . . . . . . . . 744.4.5 Dynamic Tracing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.5 Memory-Update Trace Buffer Optimization . . . . . . . . . . . . . . . . . . . . . . . 754.5.1 Case 1 – Tracing a Single Memory Controller . . . . . . . . . . . . . . . . . . 764.5.2 Case 2 – Multiple Memory Controllers, Combined Tracing with the DatapathRegisters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.5.3 Case 3 – Substitute Memory Controller Signals with IR Registers . . . . . . . 794.5.4 Case 4 – When Possible, Use Memory Controller Signals to Deduce IR Signals 814.6 Challenges of a Split Buffer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 824.6.1 Event Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.6.2 Trace Buffer Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.7 Signal Restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.7.1 The Signal-Trace Selection Problem . . . . . . . . . . . . . . . . . . . . . . . 854.7.2 Solving the Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . 854.7.3 Restoration Effectiveness and Limitations . . . . . . . . . . . . . . . . . . . . 884.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.8.1 Debug Architecture Optimizations . . . . . . . . . . . . . . . . . . . . . . . . 914.8.2 Area Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.8.3 Impact and Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985 Debugging Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.1 Framework Requirements for Debugging Parallel HLS Systems . . . . . . . . . . . . 1005.2 Debugging with Tracepoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.3 Basic Tracepoint Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.3.1 Mapping Tracepoints to RTL Signals . . . . . . . . . . . . . . . . . . . . . . 104viii5.3.2 Signal-Tracing Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055.3.3 Issues with the Basic Architecture . . . . . . . . . . . . . . . . . . . . . . . . 1075.4 Round-Robin Tracepoint Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.4.1 Using Signal Update Frequency to Group Threads . . . . . . . . . . . . . . . 1095.4.2 Round-Robin Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.4.3 Data Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125.4.4 Limitations to Buffer Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.5.1 Experiment Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.5.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1175.5.4 Area Overhead and Timing Considerations . . . . . . . . . . . . . . . . . . . 1205.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1226 Quantifying Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.1 The Observability Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256.1.1 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256.1.2 Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1266.2 Comparing Observability of HLS Signal-Tracing Techniques . . . . . . . . . . . . . . 1266.2.1 Signal-Tracing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276.2.2 Measurement Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 1306.2.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.3 Quantifying Observability of Variable Selection Methods . . . . . . . . . . . . . . . . 1356.3.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.3.2 Variable Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366.3.3 Measurement Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 1376.3.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1396.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1407 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1427.1 Dissertation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1427.1.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1447.1.2 Dissertation Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1477.2 Limitations and Short-Term Future Work . . . . . . . . . . . . . . . . . . . . . . . . 1487.2.1 Extending to Commercial HLS Tools . . . . . . . . . . . . . . . . . . . . . . 1487.2.2 Widespread Support for Optimizations . . . . . . . . . . . . . . . . . . . . . . 1497.2.3 Source-to-Source Transformations . . . . . . . . . . . . . . . . . . . . . . . . 1507.2.4 Scalability to Large Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 151ix7.2.5 Enhancements to the Debugger GUI . . . . . . . . . . . . . . . . . . . . . . . 1517.2.6 Exploring Off-Chip Signal-Tracing . . . . . . . . . . . . . . . . . . . . . . . 1527.3 Long-Term Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1537.3.1 Integrated Software/Hardware Debugging . . . . . . . . . . . . . . . . . . . . 1537.3.2 Helping the User Create High-Quality HLS Systems . . . . . . . . . . . . . . 154Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156xList of TablesTable 3.1 CHStone benchmark properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Table 4.1 Dynamic signal-tracing results, improvement to trace length . . . . . . . . . . . . . 75Table 4.2 Signal restoration results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88Table 4.3 Debugging architectures, trace length improvements and area and timing overheads 92Table 4.4 Detailed area breakdown of debugging instrumentation . . . . . . . . . . . . . . . 96Table 5.1 Tracepointing logic area cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120Table 5.2 Round-robin unit area cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120Table 6.1 Observability of signal-trace techniques . . . . . . . . . . . . . . . . . . . . . . . . 133Table 6.2 Selection methods for partial variable tracing . . . . . . . . . . . . . . . . . . . . . 136Table 7.1 Average memory bandwidth required by trace buffers . . . . . . . . . . . . . . . . 152xiList of FiguresFigure 2.1 Classification of bugs in an HLS system . . . . . . . . . . . . . . . . . . . . . . . 16Figure 2.2 Hardware view of a debug trace . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Figure 2.3 The trace-based debug process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Figure 2.4 The HLS flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Figure 2.5 Example mapping of C code to IR, and scheduling of IR instructions . . . . . . . . 31Figure 2.6 High-level structure of circuit produced by HLS . . . . . . . . . . . . . . . . . . . 32Figure 2.7 Event Observability Ports (EOPs) from [53] . . . . . . . . . . . . . . . . . . . . . 35Figure 3.1 The HLS flow, modified to provide debugging support . . . . . . . . . . . . . . . 44Figure 3.2 The debugging instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Figure 3.3 Screenshot of our debugger prototype . . . . . . . . . . . . . . . . . . . . . . . . 58Figure 3.4 Gantt chart showing loop unrolling optimization . . . . . . . . . . . . . . . . . . 61Figure 3.5 Gantt chart showing code reordering optimization . . . . . . . . . . . . . . . . . . 61Figure 4.1 Baseline signal-tracing architecture . . . . . . . . . . . . . . . . . . . . . . . . . 66Figure 4.2 Layout of the control flow buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Figure 4.3 Memory efficiency of control buffer, when varying number of sequential counter bits 68Figure 4.4 Dynamic tracing architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Figure 4.5 Signal-trace scheduling enhancements for the dynamic signal-tracing architecture . 71Figure 4.6 Example state graphs where delayed tracing cannot be performed . . . . . . . . . 73Figure 4.7 Signal-tracing alternatives for capturing updates to variables in memory . . . . . . 76Figure 4.8 Example of a C program containing memory writes, and the resulting scheduled IRcode. Figure 4.9 shows the resulting circuitry. . . . . . . . . . . . . . . . . . . . . 77Figure 4.9 The generated memory controller circuitry for the example program in Figure 4.8 . 78Figure 4.10 Histogram showing the number of cycles of history required to restore a signal value 90Figure 5.1 Tracepoints in a multithreaded application . . . . . . . . . . . . . . . . . . . . . . 101Figure 5.2 Timeline of tracepoint event data . . . . . . . . . . . . . . . . . . . . . . . . . . . 102Figure 5.3 High-level view of tracepointing architectures . . . . . . . . . . . . . . . . . . . . 105xiiFigure 5.4 Tracepointing logic for each thread in the basic architecture . . . . . . . . . . . . 106Figure 5.5 Example of the benefit of buffer sharing . . . . . . . . . . . . . . . . . . . . . . . 108Figure 5.6 Tracepoint interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110Figure 5.7 Round-robin tracepoint architecture . . . . . . . . . . . . . . . . . . . . . . . . . 113Figure 5.8 Increase to trace length provided by buffer sharing . . . . . . . . . . . . . . . . . 118Figure 5.9 Buffer sharing between threads in different locations of the FPGA . . . . . . . . . 121Figure 6.1 Example of captured execution trace with variable reads and writes . . . . . . . . 128Figure 6.2 Variable availability for Techniques 2 and 3 . . . . . . . . . . . . . . . . . . . . . 131Figure 6.3 Availability, duration and observability provided by variable selection methods . . 138xiiiGlossaryHLS High-Level SynthesisIC Integrated CircuitHDL Hardware Description LanguageRTL Register-Transfer LevelVLSI Very-Large-Scale IntegrationFPGA Field-Programmable Gate ArrayASIC Application-Specific Integrated CircuitOpenCL Open Computing LanguageELA Embedded Logic AnalyzerGUI Graphical User InterfaceFSM Finite State MachineCPU Central Processing UnitGPGPU General-Purpose Graphics Processing UnitSDK Software Development KitIR Intermediate RepresentationSSA Static Single AssignmentILP Instruction-Level ParallelismEOP Event Observability PortALM Adaptive Logic ModulexivALUT Adaptive Look-up TableCAD Computer-Aided DesignSIMD Single Instruction, Multiple DataMIMD Multiple Instruction, Multiple DataxvAcknowledgmentsFirst and foremost, I would like to express my thanks to my advisor, Steve Wilton. His advice andguidance, personal example, and generous support of time has been instrumental in the development ofthis work. He has always been dedicated to my success, not just to the success of this research, but tomy own success as a researcher and contributer to the academic community.I would also like to thank the many other graduate students at UBC who have provided their timeand support discussing my work, listening to practise presentations, and providing great advice. Thisincludes Eddie Hung, Fatemeh Eslami, Assem Bsoul, Jose Pinilla, Sarah Mashayekhi, Aaron Severance,Pavan Bussa, Rehan Ahmed, Joydip Das, Stuart Dueck, Kyle Balston, and several others. Severalfaculty at UBC also deserve acknowledgement for their professional advice, mentorship and guidance,including Guy Lemieux, Karthik Pattabiraman, Sathish Gopalakrishnan, Tor Aamodt, Mieszko Lis, andPaul Davies.Financial support for this work was provided by the National Sciences and Engineering ResearchCouncil of Canada, as well as support from Altera Corporation. Xilinx and Altera were kind enough toallow me to present my work to their research teams, and provided helpful feedback and directions.Finally, and most importantly, I would like to thank my family for their unrelenting support duringthese years of my PhD. My parents raised me to value education and have always supported my decisionsand goals. The greatest help and largest sacrifice has been from my wife Jessie, who has spent countlesshours caring for our children, cleaning our home, feeding our family, and supporting me. Her dedicationand support cannot be overstated.xviChapter 1Introduction1.1 MotivationHigh-Level Synthesis (HLS) is a design process by which a software program can be automaticallysynthesized to a digital hardware circuit. This design process is attractive as it has the potential tooffer the best aspects of both the hardware and software domains. Developing a software program istypically simpler and faster than traditional digital circuit design methods, offering faster time-to-marketand reduced developments costs, while a hardware implementation often provides higher performanceand/or lower power consumption than running an equivalent design as software on a processor. HLScan potentially offer both of these benefits by combining software-like design methods with hardwareimplementation.1.1.1 The Increasing Demand for High-Level SynthesisHLS first began as academic research projects throughout the 1970s and 1980s, and later was used insmall scale commercially at such companies as IBM [8], Siemens [9], NEC [10] and several others.However, in the last decade HLS has gained considerable attention, and is moving from a niche designmethod, to a widely available, general-purpose design flow for creating digital circuits. This shift towardHLS design methods is being motivated by many factors, the chief of which are 1) the need for a fastermethod for developing digital circuits, and 2) the need for a framework to easily accelerate softwareworkloads using application-specific digital circuits.11.1. MotivationRaising the Abstraction of Hardware DesignThe first major factor, the need for faster methods of designing digital circuits, is due to the ever in-creasing sizes of Integrated Circuits (ICs). Exponentially increasing circuit sizes mean that the design,test, and debug of ICs is requiring more and more man-hours. In the 1980s, Hardware DescriptionLanguages (HDLs), such as Verilog and VHDL were introduced in order to raise the design abstrac-tion, allowing designers to create circuits using Register-Transfer Level (RTL) descriptions instead ofmanually designing using gates and transistors. This transition made it easier for designers to createmuch larger circuits, such as Very-Large-Scale Integration (VLSI) type designs, and today, RTL designis the dominant method for creating digital circuits. However, as today’s circuit sizes grow to billionsof transistors, researchers are again looking to raise the abstraction level. This is motivating a shift tohigher level design abstractions, such as embedded processors coupled with FPGA accelerators [11],embedded vector processors [12], and HLS design flows. HLS tools allow designers to specify algo-rithms using high-level software languages, which are then compiled to RTL for use in standard RTLtool flows; this offers faster design times than manual RTL design [13–16].This push to higher abstraction levels and faster design flows is especially important for reconfig-urable computing devices, which primarily includes Field-Programmable Gate Arrays (FPGAs) [17].This is because FPGAs are typically chosen for their time-to-market advantages over Application-Specific Integrated Circuits (ASICs); an HDL circuit can be compiled and programmed onto an FPGAin minutes or hours, whereas ASICs often take months to produce [18]. For this reason, FPGA ven-dors have invested heavily in HLS technologies; Altera (recently acquired by Intel) released the AlteraSDK for Open Computing Language (OpenCL) in 2013 [19], Xilinx released the Vivado HLS tool in2012 [20], and several other HLS vendors offer FPGA-specific design flows [21, 22].A Framework for Accelerating Software WorkloadsThe second major factor driving HLS adoption is the need for a framework to accelerate software work-loads. This is due to the increasing adoption of FPGAs as general-purpose accelerators for softwareprograms. For example, in 2014, Microsoft developed the Catapult system, which integrates FPGAsinto their data center compute platforms in order to accelerate the Bing web search engine. By acceler-21.1. Motivationating portions of the software code via hardware implementation on FPGAs, they were able to doublethroughput or reduce latency by 29% [23]. In 2016, Intel will be releasing a new type of processingchip, which combines a Xeon Central Processing Unit (CPU) with an Altera FPGA on the same pack-age [24]. Likewise, both IBM and Qualcomm have partnered with Xilinx to bring FPGA acceleration totheir cloud computing environments [25, 26]. By migrating portions of a software workload to executeas a hardware circuit on an FPGA, both increased performance, and lower power consumption can beobtained, versus a CPU-only implementation. For example, Microsoft used their Catapult system toaccelerate a machine learning image classification system, and achieved 100 times better operations persecond per Joule than using a pure software implementation running on a CPU. Intel predicts that by2020, a third of the data center market will be using FPGAs to accelerate workloads [27].Although large performance improvements are possible, hardware implementation is a challengingtask. Traditionally, creating a hardware circuit to accelerate an algorithm requires RTL design, whichis much more time consuming than software design. Cycle-by-cycle timing, memory interfaces, andscheduling of operations are all abstracted away when designing software; however, when designing anRTL circuit all of these are critical parts of the design. In addition, there are relatively few hardwaredesigners in the workforce, as software developers outnumber hardware designers by a factor of ten [28].HLS aims to solve these issues by allowing hardware circuits to be created using software as an input.The goal is to allow software developers to create hardware implementations of their designs withouthardware expertise. Today, the major FPGA vendors produce HLS tools specifically aimed to softwaredevelopers without hardware expertise, including OpenCL-based flows from both Altera and Xilinx [19,29].1.1.2 The Need for In-System Debugging of HLS CircuitsIn the past, most work related to HLS development has focused on performance, where the goal is toproduce the highest performance hardware circuit based on the software specification. There are manyexamples of such work [30–40]. Of course this emphasis on performance is expected, as a high qualityHLS compiler is requisite for migration to HLS design flows. If HLS-produced circuits perform muchworse than custom RTL circuits, hardware designers will be hesitant to use HLS, even if design time is31.1. Motivationreduced. Likewise, software designers need speed or power improvements to justify migrating softwarecode to a hardware implementation.However, in order for HLS to receive widespread adoption, there needs to be more than just a goodcompiler. A complete ecosystem of tools that support design, test, debug and optimization are essential.In addition, since HLS is targeted to both hardware and software designers, these technologies mustbe accessible to both of these groups. Support for effective debugging is an important part of thisecosystem, and is the focus of this work.Current HLS tools do not offer practical methods to perform in-system debugging of the resultingcircuit. Most tools offer isolated debugging of HLS blocks by porting and running the software code on aworkstation. With this technique, standard software debuggers, such as GDB [41], can be used to debugthe software program. However, this relies upon the user providing accurate inputs into the softwareprogram. In practise, the actual operating environment of the hardware circuit produced by HLS mayconsist of other hardware modules, processors, or I/O devices. These other blocks may interact with theHLS-produced circuit, and the occurrence of bugs may be dependent upon the full system interactionbetween these modules. Performing software-based debugging would require perfectly replicating thisentire system in software, and may be impractical or near impossible. Software-based debugging alsoassumes the HLS tool is perfect; if the HLS tool is buggy, or the user has used it incorrectly, the resultingcircuit may behave differently than the software.Some HLS tools also offer debugging using RTL simulation [19, 42]. In these tools the debuggerhas the same look and feel as a standard software debugger; however, the control-flow and data-flow isobtained through RTL simulation, typically with Modelsim, rather than executing the original software.This is useful for verifying that the HLS process was performed correctly, and that the RTL circuitmatches the behaviour of the original source code. However, this debug flow also relies upon the userproviding system inputs that perfectly replicate the final hardware system, which may not be possible.In addition, hardware simulation typically executes several orders of magnitude slower than actual hard-ware execution [43, 44], so even if realistic I/O traffic can be simulated, it may simply take too long tosimulate the design to the point where the bug occurs.Although software-based debug and simulation-based debug are useful for capturing many bugs41.1. Motivationin HLS systems, inevitably there will be bugs that cannot be solved using these techniques, due tocircumstances mentioned above. In such cases it is necessary to perform in-system debugging, that is,debug the circuit in the hardware environment, while it executes and interacts with other parts of thesystem.In-system hardware debugging can be performed with the help of Embedded Logic Analyzers(ELAs) such as Altera SignalTap II [45], Xilinx ChipScope Pro [46] or Mentor’s Certus tool [47],all of which provide visibility into a hardware design. Unfortunately, these tools provide visibility at theRTL abstraction level, which only has meaning to someone who understands the underlying hardware.A software designer typically views a design as a set of processes, each consisting of sequential controlflow code, while the generated hardware consists of dataflow components operating in parallel acrossmultiple clock cycles. HLS tools typically perform scheduling optimizations, moving operations acrosscycle boundaries, and allocation strategies that make the relationship between software variables andhardware entities difficult to understand. This mismatch between a software view of the design andthe generated hardware makes debugging difficult. In the short-term, HLS will be used primarily byhardware designers seeking a more productive design environment, and the productivity advantages ofHLS will be lost if these designers need to think about their design in terms of the underlying hardware.In the long term, HLS will be used by software designers; for these designers, reasoning about the be-haviour of the hardware generated by HLS tools may be impossible. In either case, requiring HLS usersto understand the underlying hardware negates many of the benefits promised by this technology.Rather than requiring the user to debug the circuit using the RTL code, this work takes the approachof performing debugging using the original source code (we refer to this as source-level debugging).This means that designers can step through instructions in their original program, and inspect variablevalues from their source code, in a manner similar to a standard software debugger. However, insteadof performing software simulation, the control and data flow is obtained by inspecting signals from theexecuting hardware circuit. This is a much preferred method to RTL debug, as it allows users to designtheir system and perform debugging at the same abstraction level, removing the need to understand thefull details of how their software is implemented in the low-level hardware.51.2. Challenges and Objectives1.2 Challenges and ObjectivesIn this dissertation we present techniques for source-level, in-system debug of HLS circuits, and addressthe challenges associated with such an approach.1.2.1 Challenges of Source-Level, In-System DebuggingThere are two main challenges present when performing source-level, in-system debugging of HLScircuits:1. Tackling the hardware/software mismatch: Source-level, in-system debugging means that theuser is performing debugging in the context of the original source code; however, the actualexecution is taking place in hardware. Thus, relationships need to be established between thesource code and the hardware circuit. This is challenging for a number of reasons. Inherentdifferences between the sequential nature of software and the parallel nature of hardware meanthat the operations in the software will not necessarily be completed in the expected order orin a sequential nature. Also, during the HLS flow many transformations and optimizations areperformed, which will prevent a one-to-one mapping between software and hardware entities.2. FPGA Observability: The key to debugging a digital circuit is understanding what is happeninginside the circuit as it executes. This requires a method for obtaining the values of internal circuitsignals during execution. HLS debugging typically takes place on FPGAs (even if the end-goalis to fabricate an ASIC, the debug phase will likely take place on an FPGA), where, due to thelimited number of I/Os, it is impossible to observe all needed internal signals in real-time duringexecution.1.2.2 Research ObjectivesThe objectives of this work are to:• Develop a framework for source-level, in-system debugging of HLS circuits, that is accessibleto both software and hardware designers. This includes building relationships between source-code entities and signals in the hardware circuit, developing debug circuitry to access relevant61.3. Contributionsinternal signals during circuit execution, using these signal values to construct the control anddata flow of the original program, and providing this to the user in a meaningful format.• Design and evaluate methods for increasing observability within HLS circuits. Current hard-ware debug tools provide limited observability into circuit execution, due to limited on-chip re-sources. These existing tools are built to accommodate any digital circuit a user provides, whichleaves little room for optimization. In contrast, in the HLS flow, circuits are automatically createdand have a set pattern and structure. This provides opportunity to build specialized debug circuitrythat is tailored to each individual circuit, allowing for better use of hardware resources and greaterobservability into circuit execution.• Investigate how debugging techniques can be adapted to handle optimizations that are most com-mon in HLS tools. Although there are many different optimization techniques in use by today’sHLS tools, to focus our efforts, we address those that are most fundamental to HLS. This in-cludes developing debug techniques that work with standard compiler optimizations and par-allel hardware systems.1.3 ContributionsThe contributions of this dissertation are divided into four sections:• A framework for source-level in-system debugging of HLS-produced circuits.• Techniques for optimizing on-chip data capture of HLS systems, increasing the amount of execu-tion that can be captured and used for debugging.• An adapted debug framework to support parallel HLS systems, and additional data capture opti-mizations for these systems.• A metric for quantifying observability into HLS systems, and an exploration of the observabilityprovided by different HLS debugging approaches.The following subsections discuss these contributions in further detail.71.3. Contributions1.3.1 A Source-Level, In-System HLS Debugging FrameworkThis dissertation presents a new approach to debugging HLS produced circuits, which allows the user todebug in the context of the source code, while running the circuit in-situ. This is accomplished througha combination of debug circuitry that is automatically added to the user circuit produced by the HLStool, and a debugger application that runs on the user’s workstation and interacts with the circuit on theFPGA.The debug circuitry we developed enables two types of debugging flows. First, an interactive de-bugging flow allows users to single-step through the program, and the HLS circuit is started and stoppedfor each stepping operation. The value of source code variables can be obtained be reading the value ofcertain hardware registers or memories. The main drawback of this approach is that the circuit must behalted at each step to retrieve the data. This starting and stopping may interfere with interactions withother hardware modules, masking bugs or creating new ones.In such cases it may be necessary to debug at-speed, where the circuit executes at normal operatingspeed. In the second debug flow a record and replay approach is used to enable at-speed debugging. Thisflow uses on-chip memories to record the value of certain hardware signals while the circuit executesat speed. Once the circuit reaches some condition specified by the user, the recorded data is retrievedand the user can debug the program using the recorded data. Although this allows for a true in-situdebugging flow, limited memory resources mean only a portion of the circuit execution can be recorded.To demonstrate the type of debug experience such an approach provides to the user, this workincludes a prototype of a debugger application. This Graphical User Interface (GUI)-based debuggerconnects to the FPGA board, controls the user’s circuit and retrieves execution information that is storedin the FPGA on-chip memories. The debugger provides features commonly found in software debuggerssuch as single-stepping, breakpoints, and inspecting variable values. It also includes features to helpthe user understand how the software code is mapped to the underlying hardware, including a Ganttchart which helps the user to understand the scheduling of each software instruction and the effect ofcompiler optimizations. As a proof of concept, the debugging framework presented in this section isfully prototyped and integrated into LegUp [48], a widely used open-source academic HLS tool.81.3. Contributions1.3.2 Optimizing Data CaptureCapturing live circuit execution requires using on-chip FPGA memory to record the behaviour of manyinternal signals. Since FPGAs have limited on-chip memory resources, only a small portion of the totalcircuit execution can typically be recorded. This means that designers often have to perform multipleexecutions of the design in order to iteratively narrow down the root cause of a bug. This process canbe very time consuming, especially if the bug occurs infrequently or is hard to reproduce. The iterativedebug process can be accelerated if data can be more efficiently stored in the trace buffers, such that alonger trace can be recorded each time the circuit is executed, giving the designer more insight into thecircuit at each iteration.In this portion of the dissertation we present novel optimizations to more efficiently capture circuitexecution into on-chip memories. Information from within the HLS process is used to build debug andcapture circuitry that is tailored and optimized to the individual circuit. We present optimizations forcapturing both the program control flow and data flow from the resulting circuit. One major optimizationwe explore is to use the HLS scheduling information to determine the circuit state in which differenthardware signals need to be captured. Based on this schedule of when data is generated, a tracingschedule is established to determine when each piece of data is traced into on-chip memories. Thisbalances the amount of data that needs to be captured cycle-by-cycle and makes more efficient use ofon-chip memory.We also explore using signal reconstruction techniques; that is, recording only a subset of the nec-essary signals, and using known properties of the program to reconstruct missing data offline, after thecircuit has completed execution. We present an algorithm to minimize the number of signals that needto be recorded, increasing the length of execution trace that can be captured, and decreasing the areaoverhead of the debugging instrumentation.Using these optimization techniques we are able to record, on average, 6362 cycles (28.9 µs) ofexecution per 100Kb of memory allocated to signal tracing. This is a 127x increase in captured executionversus using conventional signal capture techniques (ELAs).91.3. Contributions1.3.3 Debugging Parallel SystemsThe success of HLS tools in delivering higher performance than software hinges on their ability to ex-ploit the vast parallel resources available in a hardware implementation. Thus, a significant amountof effort is spent in HLS tools extracting parallelism from the source code. Recent HLS tools maxi-mize parallelism by allowing the user to provide multi-threaded code, which is synthesized to multiplehardware execution units that operate in parallel [19, 29, 49].In these parallel systems there is a greater likelihood of encountering obscure or non-deterministicbugs that may not be reproducible through software simulation, increasing the need for in-system de-bugging [19]. Such issues include thread communication or synchronization problems, race conditions,deadlocks or thread imbalancing and starvation. For users trying to locate these types of bugs, they mayneed to observe the behaviour of multiple threads for long run times. In such cases it may be desirableto sacrifice complete observability in order to capture longer execution traces.In this portion of the work we present a debugging framework for parallel HLS systems, which uti-lizes partial observability to obtain long execution traces. This framework is centered upon tracepoints,a technique used in current software debuggers for highly parallel software systems [50, 51]. Trace-points are points in the program, selected by the user, which are logged into an execution trace eachtime they are encountered by a thread. The recorded tracepoints can then be provided to the user as acycle-accurate timeline of events to aid them in debugging multithreaded HLS circuits.We also present a novel technique to optimize how tracepoint data is captured in on-chip memory,increasing the length of execution that can be recorded. This technique involves program analysis todetermine which hardware threads can share trace buffer resources. This is accomplished by analyzingthe HLS schedule to determine the maximum rates at which each thread can encounter tracepoints,using this information to select which threads can share trace buffers, and automatically generatinground-robin circuitry to arbitrate access to the buffers. Using this technique we are able to obtain anaverage of 4x improvement in trace length for an 8 thread system. This provides users with a longertimeline of execution and greater visibility into the execution of multithreaded HLS circuits.101.3. Contributions1.3.4 Quantifying ObservabilityThere has been several other research works, completed in parallel with this work, that also presenttechniques for in-system debugging of HLS circuits [52–56]. Each of these works have proposed so-lutions that embed instrumentation into the user’s circuit to observe values during execution and relatethem back to the original source code. These approaches use several different observation techniques,ranging from using commercial ELAs to observe hardware signals, to specialized debug instrumentationtailored to the individual circuit using information from within the HLS flow.However, all of these techniques face the same challenge: due to limited on-chip resources only asmall fraction of the circuit execution can be captured. This means designers must find the root causeof a bug while only being able to see a small piece of the full picture, making in-system debugginga challenging and time consuming process. Successful debugging solutions will be those that makeefficient use of resources, in order to give designers the most insight into the system behaviour, making iteasier and faster to locate bugs. To develop such systems, it is essential that there be a consistent methodto quantify the observability into an HLS circuit. This allows researchers to measure the effectiveness oftheir approach, understand circumstances where it is most effective, and compare and contrast it againstother techniques.In this portion of the dissertation we present a metric for measuring the observability of an executingHLS circuit. This metric measures the likelihood that variable values are available to a user as they stepthrough the program execution, taking into account differences in variable importance, and the durationof execution trace that can be captured. This metric can be used to understand how different circuitobservation networks can provide the user with different levels of observability into the HLS circuitexecution.As a demonstration of the applicability of the metric, we first study differences between recentdebugging approaches for HLS circuits, and quantify the level of observability provided by such archi-tectures. We then explore different schemes to select which variables are accessible in the observationnetwork, and measure impact on variable availability and the length of the captured execution trace.111.4. Organization1.4 OrganizationThis dissertation is organized as follows. Chapter 2 provides background information for this work,describing debugging scenarios for HLS circuits, and the need for in-system debug. The chapter thenexplains the HLS process, including how software is mapped to hardware; techniques for debuggingdigital circuits, and how they can be applied to HLS circuits; and finally, a description of related worktoward in-system debugging of HLS circuits.Chapter 3 describes our source-level, in-system debugging framework for HLS circuits. This in-cludes a description of how a designer would use this debugging framework, focusing on both theinteractive debugging and record and replay approaches to debugging, and their relative merits. Thischapter presents our debug circuitry to support these debugging flows, as well as a proof-of-conceptdebugger application that a designer uses to interact with the instrumented circuit.Chapter 4 presents novel techniques to increase the amount of execution trace that can be capturedusing the record and replay approach. Information from within the HLS flow is leveraged to createcircuit-specific optimizations for capturing the control and data flow. In addition, this chapter containsour algorithms for optimized signal reconstruction, where non-recorded signals are computed offlineusing information from the original software program. Results are presented to measure the increase tothe length of execution that can be captured, as well as an analysis of the impact of the debug instru-mentation on area and timing of the user’s design.Chapter 5 presents an adapted debug framework, specifically optimized for parallel HLS circuits.This chapter describes our tracepointing circuitry, which can be added to hardware threads to givedesigners insight into the long-term execution of the thread, as well as interaction with other threads inthe system. We present an algorithm that uses program analysis to determine which threads can shareaccess to trace buffer resources, increasing the length of execution trace that can be captured.In Chapter 6, we present a metric for quantifying the observability into an executing HLS circuit.This metric is used to evaluate observability trade-offs provided by different techniques for debuggingHLS circuits. In addition, several experiments are presented to show how the metric could be applied tocases where only certain source code variables are observed.Chapter 7 concludes the dissertation, providing a summary of contributions, their impact, and limi-121.4. Organizationtations. This section also provides direction for future work.13Chapter 2BackgroundThis chapter provides background and related work relevant to this dissertation. The chapter begins bydescribing debugging scenarios for HLS (Section 2.1), motivating the need for source-level, in-systemdebug of HLS circuits. Next, we describe existing techniques for performing in-system debug of generalhardware circuits (Section 2.2); although these RTL-level debug tools are not practical for debuggingHLS circuits, their underlying techniques are useful in designing a debug framework for HLS.To understand how to perform source-level, in-system debug of HLS circuits, it is first necessary tounderstand the HLS flow, and how the software maps to hardware. Section 2.3 provides backgroundinformation on HLS, and describes types of modern HLS tools, providing context for this work. Thesection then describes the steps performed in the HLS process as software is synthesized to hardware.Section 2.4 describes related work, and Section 2.5 describes the major approaches and assumptionsmade in this dissertation, and how they differ from the other works. Section 2.6 summarizes the chapter.2.1 Current Approaches for Debugging HLS Circuits2.1.1 Debugging and Validation ScenariosWhen debugging and validating a design implemented using HLS techniques, there are three (oftenintertwined) tasks that an engineer needs to be concerned with:1. The designer must ensure correct logical behaviour of the algorithm, and when incorrect be-142.1. Current Approaches for Debugging HLS Circuitshaviour is observed, identify the root cause. Typical errors include source-code mistakes, incor-rect state transitions, or logic errors.2. The designer must ensure correct interaction between various blocks in the system. A large de-sign will typically contain many blocks, only some of which may be created using HLS-basedtechniques [57]. Other blocks may be RTL-level hardware circuits, soft or hard processors, oreven legacy designs for which full source-code is not available or not understood by the designer.The design may contain off-chip communication with other chips or I/O devices, and may containcomplex traffic patterns, such as video input/output or network traffic.3. Despite significant progress over the past decade, HLS tools are still relatively immature, and de-signers must have confidence that the hardware created by the HLS tool is correct. A functionalitymismatch between the source code and generated hardware may result if there are bugs in the HLStool, or simply if the user has configured the tools incorrectly for the desired hardware function-ality. Providing a mechanism for designers to examine and understand the generated hardware isessential.Later, in Chapter 3 of this dissertation we describe our framework that addresses each of these require-ments. In the remainder of this section we describe current approaches to debugging HLS designs, andhow they fall short in meeting all of these requirements.2.1.2 Debugging by Software EmulationIn HLS-based methodologies, designs are written in software-like languages (such as C). A straight-forward debugging approach is to execute the software-like code on a stand-alone processor, and usestandard software debugging technologies. This is attractive for two reasons: (1) software debuggingtechnology is mature, and these debuggers provide numerous features to help pin-point problems, (2)software developers (who are envisaged to be a primary user of HLS technologies) are already familiarwith software-based debugging technologies, meaning they can use these tools with high confidence.However, this approach of using software debug technologies is effective only for certain types ofbugs in HLS systems. As shown in Figure 2.1, HLS bugs can be classified into several categories.Those bugs that are localized to a single HLS module, we refer to as module-level bugs. These bugs152.1. Current Approaches for Debugging HLS Circuits5HardwareBug Typemain() {int i;}HLS GeneratedRTLHLSFPGAHLS GeneratedHardwareOther HardwareOther HardwareI/O DevicesModule-level bugs• Self-contained within the user’s module.• Can be debugged by executing the module in in isolation, or with simple input patterns.• Often easy to reproduce.Debug C code on workstation (GDB).RTL-level bugs• Occurs when the RTL is functionally different than the C code.• Due to bugs within the HLS tool, or incorrect usage of HLS tool.Run C/RTL co-simulation on workstation.System-level Bugs• These bugs require system-level execution to locate.• They may be depend on:• Complex I/O traffic.• Timing of interaction between modules.• They may be hard to reproduce, or require long run times that make simulation impractical.Debug on FPGA(Requires observing internals of FPGA)SoftwareSimulationDebug MethodFigure 2.1: Classification of bugs in an HLS systemmay be errors in the algorithm specification, for example, errors in loop bounds, incorrect functions,or algorithmic errors. Commonly these bugs are easy to reproduce since, often, these bugs lead toincorrect behaviour every time the circuit is run. These bugs can typically be activated and corrected byexecuting the software code in isolation, which is why software emulation is effective at locating them.We anticipate that many of the logical errors in designs will fall into this category, and can be uncoveredusing software emulation.However, such an approach has limitations. First, it can not be used to uncover errors in interfacesbetween blocks, when some of these blocks are not developed using an HLS-based methodology. Whileit would be possible to create a software model of non-HLS blocks, this is error-prone and it is difficultto ensure that the software model would faithfully describe the behaviour of the hardware in all possiblescenarios (especially if the block was designed by someone else). This may be particularly challengingas it would require writing software to emulate the behaviour of a piece of hardware, and the software162.1. Current Approaches for Debugging HLS Circuitslanguage may lack the constructs required to precisely describe the low-level, cycle-by-cycle, data-parallel behaviours of the hardware circuit. Even if the other blocks are created by HLS, and have asoftware source, they often will communicate on-chip using a bus or other network (eg. [57]), and thecorrectness of the interface between the block and the bus may need to be examined. For streamingapplications, it is often necessary to debug while running in-situ, that is, in the target system, so theblock can be evaluated using realistic input traffic. The system may contain video input/output, networktraffic, or other data patterns that may be very difficult to model perfectly in software.Second, software execution is slower than hardware – this is the reason for accelerating softwareusing hardware in the first place. This may make it impossible to run a sufficient number of tests on asoftware model of the system.Finally, such an approach assumes that the HLS tool is “perfect”; if the tool creates incorrect hard-ware, then debugging the software model will not help find the problem.Based on these limitations, there may be many bugs that cannot be located using software emulationtechniques. For such bugs it will be necessary to debug in the hardware domain.2.1.3 Debugging by Hardware SimulationIn addition to emulated software debug, some HLS tools also offer debugging using RTL simulation [19,42]. In these tools the debugger often has the same look and feel as a standard software debugger;however, the control-flow and data-flow is obtained through RTL simulation, typically using Modelsim,rather than executing the original software.This technique solves some of the issues of software emulation. One benefit is that if the HLS mod-ule interacts with other hardware modules which are designed using RTL, the interaction can be moreaccurately replicated. Since the RTL-level blocks can also be simulated using Modelsim, it removes theerror-prone step of creating a software model of the hardware.Another benefit to hardware simulation is that it can be used to perform automatic verification of theRTL. RTL-level bugs, shown as the second category in Figure 2.1, may be caused by errors in the HLStool itself, or errors in how the the HLS tool is used. Some HLS vendors provide the ability to uncoverthese bugs using a co-simulation approach where the C and RTL code is simulated on a workstation [42].172.1. Current Approaches for Debugging HLS CircuitsEven if the code is correct, this level of system verification is essential; until HLS tools are fully mature,many designers will appreciate the confidence they achieve from a successful RTL simulation.Despite these benefits, there are still several limitations and issues with using a simulation baseddebugging flow. First, it still relies upon the user providing system inputs that perfectly replicate thefinal hardware system. As mentioned previously, this may be nearly impossible for communication withlegacy black-box type modules, or complex I/O traffic patterns.In addition, hardware simulation typically executes much slower than actual hardware execution(typically 20 to 200 times slower [43, 44, 58–61]), so even if realistic I/O traffic can be simulated, itmay simply take too long to simulate the design to the point where the bug occurs.2.1.4 In-System Hardware DebuggingThe bottom of Figure 2.1 shows the last classification of HLS bugs, system-level bugs. These are bugsthat cannot be located using software emulation or hardware simulation, due to the previously explainedlimitations of these approaches, and requires debugging in the physical system. These types of bugsmay be dependent on the interactions with other modules, specific I/O traffic patterns, or simply occurtoo infrequently to make emulation or simulation practical. For these bugs, the ability to debug thehardware version of a design, while in its normal operating environment, is essential. Even though thesemay be a small minority of total bugs, there must be a framework in place to help users find such bugs,otherwise they will be deterred from using HLS design methodologies.Presently, there are no commercial technologies for practical in-system debugging of HLS circuits.Currently to perform in-system debug, designers are limited to using general-purpose hardware debug-ging technologies, such as Embedded Logic Analyzers (ELAs). These tools, such as as ChipScope Profrom Xilinx [46], SignalTap II from Altera [45], and Certus from Mentor graphics [47], provide a userwith observability into a hardware circuit, but in the context of the RTL circuit. These tools are notdesigned for HLS, and offer no insight into how the RTL circuit maps to the original source code.Operating at the RTL level is far from ideal, and not suited as a technique for debugging HLScircuits. One reason is that the RTL circuit will look nothing like the original software program. TheHLS process, which is described later in this chapter, contains several steps of transformations as the182.1. Current Approaches for Debugging HLS CircuitsFigure 2.2: Hardware view of a debug trace: Difficult for a software designer to understand!design is transformed from the software domain, characterized by a sequential set of instructions, tothe the hardware domain, characterized by many dataflow elements operating in parallel (Figure 2.4).Hardware designers, likely using HLS to speed up development, will have to spend significant timeunderstanding the produced circuit, and how it relates back to their original software, possibly negatingthe design time saved by using HLS.Although RTL-level debug is challenging and time consuming for hardware designers, it is nearimpossible for software designers. A software designer typically would not have an understanding ofthe underlying hardware; in fact, this is the primary reason that HLS methodologies are able to deliverhigh design productivity. A software designer views a design as a set of functions, each consisting ofsequential control flow code, while the underlying hardware consists of data flow components operatingin parallel across multiple clock cycles. Figure 2.2 shows a screenshot of the output of one of these RTL-level debug tools; in this example, the behaviour of signals is illustrated using waveforms, a conceptlikely unfamiliar to many software designers. Even if understanding a waveform diagram is not a barrier,there will not be a one-to-one mapping between signals in the waveform and variables in the original Ccode. Further, since the HLS tool typically reorders instructions and extracts fine-grained parallelism, itis often difficult for a software designer to recognize the order of events and relate them to the order ofinstructions in the original C code. All of these factors make it very difficult for a software designer touse these hardware-oriented tools.192.2. In-System Hardware Debug Techniques2.1.5 Filling a Need: Source-Level, In-System DebuggingIt is therefore essential to create a debugging framework that provides visibility into a hardware design,but in a way that is meaningful for both software and hardware designers. We anticipate this frameworkwill have two main requirements:1. Software-level debug experience. Debugging will need to be performed using the original sourcecode, not the RTL. HLS offers rapid design times and accessibility to software developers becauseit uses a high-level software input. The debugging phase of the design process must also take placeat the software abstraction level, and in a manner familiar to software designers. Basic softwaredebugging features such as single-stepping, breakpoints and variable inspection are critical.2. In-system execution. To capture the interaction with other blocks in the system, the circuit mustbe debugged while executing in the normal operating system, executing at full speed.In Chapter 3 we describe our HLS debugging framework that meets these requirements.2.2 In-System Hardware Debug TechniquesIn this section we provide a detailed description of current technologies for in-system debug of generalhardware circuits. As mentioned previously, these technologies, and their related commercial tools,are impractical to use for debugging of HLS circuits because they operate at the RTL level. However,a source-level HLS debug framework will likely build on these tools, and include an additional layerto map the RTL information back to the source code, before presentation to the user. Therefore, wedescribe the main approaches taken by these tools, and identify which are best suited for debugging anHLS-based system.To debug an integrated circuit, the main obstacle is a lack of observability. Determining the root-cause of a bug may require traversing over hundreds or thousands of internal signals to locate the sourceof the error. Although simulation allows full observability into into these signal values, in an integratedcircuit this is not possible due to the limited number of I/O pins. There have been two main approachesto observe the internal hardware signals, scan-based and trace-based [62–64].202.2. In-System Hardware Debug Techniques2.2.1 Scan-based DebuggingA scan chain is a technique whereby all of the flip-flops in the circuit are connected together sequentially,allowing the stored values to be observed or controlled. This provides greater observability into anintegrated circuit, and can be used for debugging, as demonstrated in [65, 66]. Although scan chains arecommon in ASICs [67], only certain FPGA vendors support them [68, 69]. In [70], the authors propose amethod of automatically adding scan chains to circuits implemented on FPGAs, although this increasesthe circuit area by 60–84%. Even if supported by the FPGA, readback, the process of retrieving datafrom the circuit, takes 0.5–1s for a single bit, and 2–8s for multi-bit registers [71, 72]. Perhaps the largestdrawback of a scan-based approach is that reading the state of the flip-flops is potentially destructive,as the chip must be halted in order to read the data [62]. Stopping and starting the circuit may breakinteractions with other blocks [73]. In [73, 74], the authors use modified scan-chains with shadowregisters to take a snapshot of the register values without needing to halt the circuit; however, this canonly be done periodically, and FPGAs do not support the feature.In order to provide controllability of the circuit during scan-based debug, additional design-for-debug modules are often added to the user design [62]. These modules provide the user with the abilityto start and stop the circuit, reset, step one or more cycles, and configure breakpoints. Examples of thesedebug modules can be seen in [63, 66, 75]. In [72], FPGA dynamic reconfiguration is used to modifybreakpoints without recompiling the circuit.2.2.2 Trace-based DebuggingThe other technique for in-system debugging, and the more widely used approach for FPGA designs, istrace-based debugging.Trace-based debugging consists of inserting memory buffers into the design to record a set of internalsignals. This approach allows circuits to be run at full speed, without having to halt the circuit executionto perform readback [76–78]. This method requires instrumenting the user circuit with a sample unitto capture signal values into memory, a trigger unit to determine when to initiate capture [79, 80], andan offload unit, to transfer captured data to an external workstation [81]. Several commercial tools existsuch as Altera Signal Tap II [45], Xilinx ChipScope Pro [46], and Mentor Certus [47].212.2. In-System Hardware Debug Techniques? ? ?Root cause Observed problemTrace WindowFigure 2.3: The trace-based debug process [82].The limited capacity of FPGA on-chip memory means that only a portion of the total signals can berecorded. In addition, even for a small set of signals, memory constraints restrict recording to a fractionof the entire execution time. To handle the problem of overflowing buffers, trace buffers are configuredas circular memories so that they always overwrite the oldest data. For a trace memory of width w anddepth d, the user will be able to observe w different signals, for a duration of d cycles.Figure 2.3 illustrates the process of using a trace-based debugging system. Generally, when per-forming debugging, a user has noticed some incorrect behaviour at the outputs of the system, and needsto locate the root cause of this behaviour. The system may execute for a period of time before the bugis activated, and there may be another period of time before the problem propagates to the outputs andis visible to the user. Commonly, the user would start investigation by recording some execution slicenear where the problem is observed, and working backwards until the root cause can be determined.When attempting to locate the root cause, the user may have to repeatedly choose a different setof signals to observe, or alter the trigger condition to capture a different time slice of execution. Dueto limited on-chip memory, the section of execution that can be captured, which we refer to as thetrace window, is often on the order of tens to thousands of cycles (depending on available memory, andnumber of signals traced). In hardware systems typically executing at MHz or GHz speeds, this maybe a very small portion of the total execution, especially since some systems may execute for minutes,hours, or even days before a bug is both activated and observed at the outputs. Each time the user recordsa portion of the execution, he or she can gain additional information about the circuit behaviour, whichwill eventually allow the user to work backwards and locate the root cause of the bug. However, if thiswindow is very short, this process may require many iterations of executing and recording the circuitbehaviour until the root cause is narrowed down.222.2. In-System Hardware Debug TechniquesSince this process can be very time consuming, much work has tried to improve it by automaticallychoosing a good set of signals to trace, reconstructing non-traced signals, or compressing data to providelonger trace lengths [47, 83–91].In [92–94] the authors present methods of back-stepping the circuit state, to determine values ofsignals for cycles prior to those that are traced. In [95–97] the authors present methods of using tracedsignals to reconstruct, or restore, the values of non-traced signals. Many methods of signal selectionhave been presented that increase the number of signals that can be restored [97–99]. In [86, 87], theauthors provide greater signal visibility by dynamically changing the set of signals that are recorded. Toprovide longer trace lengths, [84, 85] propose data compression methods that are applicable to generalhardware circuits. Recent work has investigated combining trace-based debugging with periodic scanchain capture to achieve higher signal visibility [89, 100]. In [101], this approach is refined to usesmaller scan-chains, sampled at a more frequent interval to obtain even better signal visibility.When trying to determine the root cause of a bug, the user may have to repeatedly select newsets of signals to monitor, requiring time-consuming re-synthesis. In [102, 103], Quinton proposescreating an interconnection network for SoCs to connect a large number of internal signals to a smallnumber of trace buffers inputs using a concentrator network [104]. This allows the user to choosewhich signals are traced at run-time. Later work by Liu explored methods of reducing the networkarea based on signal correlations [105], as well as other types of multiplexer networks [106]. Hungpresented several trace interconnection solutions for FPGAs, showing that signals could be routed totrace buffers incrementally [107, 108], and that higher observability could be obtained using novelrouting methods [109, 110].Although there has been a vast amount of work toward making trace-based debugging more effective(described above), all of these approaches are designed to support generic hardware circuits, and arelimited by the fact that they have no knowledge of the circuit behaviour. In Chapter 4 we present novelsignal-tracing techniques specific to HLS circuits, and show how information from the HLS flow can beused to greatly increase the signal-tracing efficiency.232.3. High-Level Synthesis2.3 High-Level SynthesisHigh-Level Synthesis (HLS) is a technology that allows for automatic creation of digital circuits using asoftware specification. This subsection describes the HLS process, and major components, and providesthe required background for establishing software to hardware mapping, which is necessary for source-level debugging of HLS circuits.2.3.1 History and MotivationHigh-level synthesis tools first began appearing as academic and research endeavours during the 1970sand 1980s [18], and later in the 1990s as internally developed tools at major design companies suchas IBM [8], Siemens [9], NEC [10] and several others. These tools typically used proprietary inputlanguages and never gained widespread use [18]. In the 2000s, a new generation of HLS tools began toemerge, using C or C-like input languages, with many tools specifically targeting FPGA implementa-tion [21, 22], including tools from the major FPGA vendors [19, 20, 29, 111].There are two major factors driving the increased adoption of HLS as of late; although these weredescribed in greater detail in the introduction (Section 1.1.1), we briefly describe them here. The firstfactor motivating HLS is that digital circuit designers need a flow that enables faster development ofdigital systems. Today, most digital circuits are designed at the RTL abstraction level using an HDL suchas Verilog or VHDL; however, as designs continue to grow larger and larger, this complex, low-leveldesign process is become too time consuming, and a higher design abstraction level is required. HLSfills that need, as by designing at the software level, much of the low-level, cycle-by-cycle complexitypresent in RTL designs can be avoided [18, 112].The second factor driving HLS adoption is the advent of FPGAs as general purpose computingdevices for software workloads. FPGAs are entering the data center market as a means of acceleratingworkloads that traditionally would run on CPUs [23]. Intel is currently releasing an integrated CPU andFPGA for data center processing [24], and IBM and Qualcomm are both bringing FPGAs to their cloudcomputing environments [25, 26]. In order for FPGAs to be accessible for software developers, thereneeds to be a development framework that allows them to accelerate their software workloads. Again,HLS fills this role, as it provides automatic software to hardware translation.242.3. High-Level Synthesis2.3.2 Present-Day HLS ToolsThe following describes some of the most prominent types of HLS tools available today:General-Purpose Software to RTL ToolsGeneral-purpose software to RTL tools allow users to specify designs as standard software programs,which are then synthesized to RTL. Most commonly, these tools support the ANSI C language; althoughtypically they have some restrictions on language coverage, such as not permitting dynamic memoryallocation, recursion, floating point operations, or some pointer operations. General-purpose HLS toolsare the most prevalent in the market today, and in this dissertation we focus on debugging techniquesfor this type of HLS flow.There are several commercial tools that fall into this category, such as Xilinx’s Vivado HLS [20],Catapult-C [113], Synopsys Synphony C Compiler [22], NEC’s CyberWorkBench [42], and severalothers. However, to our knowledge, all of these commercial tools are closed-source. Since the approachtaken in this work is to modify the HLS tool to automatically embed debug circuitry into the user’sdesign, it was necessary to use an open-source tool in order to perform experiments and measure theeffectiveness of our techniques.In 2011, Canis et. al., from the University of Toronto, released LegUp, an academic open-sourceHLS tool [48]. At that time, as well as when the work in this dissertation began, LegUp was theonly open-source, actively-maintained, general C-to-HDL tool. For this reason, we chose LegUp asa platform to prototype the concepts presented in this dissertation. LegUp is now widely used in theacademic community, and has been downloaded by hundreds of research groups worldwide. Since thistime there have been a few other academic HLS tools released [114, 115].One notable feature in certain general-purpose HLS tools is the ability for the user to provide par-allel software code, which is then compiled to multiple hardware modules that operate simultaneously.This allows users to take advantage of the vast parallel resources on an FPGA, and achieve higher per-formance. Although HLS tools will aggressively auto-parallelize designs, this is typically fine-grainedparallelization, and user-provided parallel software code is required to achieve coarse-grained paral-lelism. Examples of this feature can be seen in Xilinx’s Vivado HLS tool, which supports SystemC252.3. High-Level Synthesisthreads [20], the LegUp HLS tool, which supports Pthreads and OpenMP [49], and in OpenCL-basedtools, which we describe next.OpenCL ToolsMost recently, FPGA vendors have also provided OpenCL-based HLS tools. For example, Xilinx offersthe SDAccel development environment [29], and Altera provides an OpenCL Software DevelopmentKit (SDK) tool [19]. OpenCL is a framework of extensions of the C language designed to be a unifiedAPI for targeting CPUs, General-Purpose Graphics Processing Units (GPGPUs), and FPGAs. Althoughthese tools compile OpenCL code to hardware, and are not typically commercially branded as an HLStool, the core back-end of the tools still rely upon the traditional C-to-RTL HLS flow [116].Although the work in this dissertation does not explicitly target an OpenCL-based HLS process,since the core HLS process is similar to general-purpose C-to-RTL tools, much of what is presented inthis dissertation would also be applicable to OpenCL-based flows.Application-Specific ToolsIn addition to general-purpose HLS tools, there exist many application-specific high-level design toolsand approaches. For example, both ROCCC [117] and GAUT [118] are specifically designed for DSP-based applications where streaming-oriented loops in C are synthesized to hardware accelerators. In[119, 120] the authors present high-level design techniques for FIR filters and multiple constant mul-tiplication, and the commercial tool Impulse CoDeveloper [21] specifically handles media streamingapplications.Such tools may be highly-specialized for a specific algorithm, and the resulting circuit may be sub-stantially different in structure than what would be produced by a general-purpose tool. For this reason,the techniques presented in this dissertation may not be applicable to such types of tools. Debuggingtechniques for these tools would need to be considered on a case-by-case basis, which is beyond thescope of this work.262.3. High-Level Synthesis2.3.3 The HLS FlowIn this subsection we describe the major components in the HLS flow. This description most accuratelymatches the flow employed by the open-source LegUp HLS tool, which is used for prototyping andexperimentation of the concepts in this dissertation. It is our understanding that at a high-level, thisflow is substantially similar to most major commercial general-purpose HLS tools. However, due tothe closed-source nature of such tools, we do not know the exact differences between the commercialclosed-source tools, and the academic open-source tool we use.Figure 2.4 provides an illustration of the phases in the HLS flow; these stages are described next.Source-to-Source TransformationsSource-to-source transformations automatically modify or restructure the user’s source code. In thecase of a C-to-RTL flow, this involves modifying the original C code to produce new, restructured Ccode. Often these transformations are done to optimize the source code and make it more amenablefor synthesis to hardware. For example, reordering loops has been shown to produce up to 6x fastercircuits [30], and reduce on-chip memory by 40% [31]. Some research has also been done toward usingsource-to-source transformations to embed debugging instrumentation into the design [54–56]; this isdiscussed further in Section 2.4.1.There are two main advantages to performing optimizations at this point in the flow. First, someoptimizations may be easier to insert at a high-level, before the code undergoes low-level optimizations.Second, and more significantly, the optimizations can be performed independent of the HLS tool beingused (provided that the modified source code meets the language requirements of the target HLS tool).This provides the advantages that the optimizations can be applied to closed-source HLS tools, and thatthe same optimizer tool can be used as a pre-processor to several different HLS tool flows.Software Compiler Front-EndA standard software compiler front-end is used to compile high-level code to a lower level encoding.For example, LegUp [48], Altera’s OpenCL compiler [116] and Xilinx’s Vivado HLS [20, 121] all usethe LLVM compiler suite [122] to compile C code to LLVM Intermediate Representation (IR). TheLLVM tool suite offers multiple front-ends for compilation to IR; for the C language this includes both272.3. High-Level SynthesisSource CodeSource-to-Source TransformationsModified Source CodeSoftware Compiler (Front-End)Intermediate Representation (IR)Software Compiler OptimizationsCompiler Optimized IRHLS IR RestructuringHLS Restructured IRHLS IR OptimizationsHLS Optimized IRHDL (Verilog/VHDL)HLS Back-End (IRHDL)AllocationSchedulingBindingOutput ProcessingSource CodeSource-to-Source TransformationsModified Source CodeSoftware Compiler Front-EndIntermediate Representation (IR)HLS Optimized IRHDL (Verilog/VHDL)HLS Back-End (IRHDL)AllocationSchedulingBindingOutput ProcessingIR OptimizationsCompiler OptimizationsHLS IR RestructuringHLS IR OptimizationsFigure 2.4: The HLS flow282.3. High-Level SynthesisClang (used by LegUp) and GCC.The LLVM IR is a machine-independent, assembly-like language. The IR code is in Static SingleAssignment (SSA) form, which means that each variable in the IR code is assigned a value only once.This means that a C source code variable that is assigned a value in multiple locations in the source, willbe split into multiple variables in the IR code (one IR variable per C assignment).IR OptimizationsOnce the high-level code is translated to low-level IR code, several optimization passes can be per-formed. Although Figure 2.4 shows these occurring in a specific order, the ordering of these optimiza-tions is often interchangeable.Compiler Optimizations Standard, machine-independent compiler optimizations are performed on theIR code. In LegUp, this consists of using the standard -O3 compiler option in LLVM, which performsmany different optimization passes, including loop unrolling, constant propagation, loop optimizations,memory to register replacement, dead code elimination, common subexpression elimination, and manyothers. In [32, 33] the authors show that these optimizations lead to higher performance and lower areaHLS circuits.HLS IR Restructuring At the IR level, there may also be the need to perform HLS-specific restructuringof the IR code. For example, in this phase, LegUp replaces calls to library functions (memcpy, memset,etc) with IR code to implement these functions. At this stage, other unsupported operations may alsoneed to be replaced with supported IR code, for example if the hardware lacks a floating point unit, suchinstructions would need to be replaced with an implementation using supported IR code.HLS IR Optimizations Next, HLS-specific optimizations can be performed on the IR code. Thesewould be analogous to machine-dependent compiler optimizations. Unlike the -O3 optimizations thatare build into the software compiler, these are HLS-specific optimizations, and would be dependent onthe circuit architecture of the HLS tool. For example, basic blocks could be restructured to increaseInstruction-Level Parallelism (ILP), memory accesses may be reordered or restructured to take advan-292.3. High-Level Synthesistage of the memory architecture created by the HLS tool, or bit width optimization may be performedto take advantage of the fine-grained hardware [34].The HLS Back-End: IR to HDL SynthesisThe final phase of the HLS flow, and core of the process, is the synthesis from IR to HDL. This isthe phase that bridges the software/hardware domains, as it synthesizes the low-level, sequential IRsoftware, to a data-parallel digital circuit.Allocation The allocation phase determines which hardware entities will be used by the circuit. Eachoperation (add, multiply, load) in the IR code will require a corresponding functional unit (adder, mul-tiplier, memory). When resources are limited, or lower area cost is desired, these functional units canbe shared between multiple instructions; however, performance may be reduced. In the case where anFPGA is the target device, sharing is typically avoided, as multiplexing is expensive to implement usingFPGA logic. For example, if there are two 32-bit additions, it requires the same resources to create one32-bit adder with multiplexers to share access, as it would to just create two 32-bit adders [35, 48]. InLegUp, which targets FPGA designs, sharing is only performed on divide/modulus operations, as theyrequire significant logic resources, as well as multipliers, when there are not sufficient hard multiplierblocks on the targeted FPGA device.Scheduling Next, scheduling is performed which assigns operations to hardware cycles, and buildscontrol logic to implement this schedule. In LegUp, this consists of creating a Finite State Machine(FSM) for each software module, and assigning each IR operation (add, multiply, store) to a state in theFSM. The goal of scheduling is typically to minimize the number of cycles required to complete theoperations, while meeting constraints imposed by both the hardware resources and data dependencies.Figure 2.5 provides an example C program, which sums the even and odd elements in an array, andshows how this program is mapped to LLVM IR, and then scheduled to a state graph in LegUp. As canbe seen in this example, multiple IR instructions will be scheduled to the same cycle, when possible.However, some operations may require multiple hardware cycles, which will necessitate extra states.In this example memory read operations require multiple cycles, so State 3 is inserted after the %5 =302.3. High-Level SynthesisState 1%A = alloca [100xi32]State 2%2 = phi (0,S1),(%8,S7)%s_o = phi (0,S1),(%s_o.1,S7)%s_e = phi (0,S1),(%s_e.1,S7)%3 = and %2,1%4 = icmp eq %3,0 %5 = load %A[%2]State 3State 4br %4, S5, S6State 5%6 = add %5, %s_eState 6%7 = add %5, %s_oState 7%s_e.1 = phi (%6,S5),(%s_e)%s_o.1 = phi (%7,S6),(%s_o)%8 = add %2, 1%exit = cmp eq $8, 100br %exit, S8, S2State 8return 0int main() {int A[100];int i;// sum of odd elementsint s_o = 0; // sum of even elementsint s_e = 0; for (i=0; i<100; i++) {if (i % 2 == 0)s_e += A[i];elses_o += A[i];}return 0;    }Figure 2.5: Example mapping of C code to IR, and scheduling of IR instructions.312.3. High-Level SynthesisMemory BusDatapathFPGA Control FSMsOn-Chip MemoryFigure 2.6: High-level structure of circuit produced by HLS.load %A[%2] instruction to introduce delay. Control flow divergences, such as the branch instructionin State 4 also introduce extra states.In the above example only one iteration of the loop is executed at a time. However, most HLStools, including LegUp, will perform loop pipelining, where multiple iterations of the loop will executesimultaneously. A major role of the scheduling phase is to determine how often these iterations can beissued [123].Binding The binding phase is responsible for assigning items from the IR code to the hardware units.This consists of assigning both operations (add, multiply, etc) to functional units, as well as assigningvariables within the IR code to memories and registers in the circuit. For functional units that are shared,arbitration logic (typically consisting of multiplexers) is created to handle access to the unit, which iscontrolled by the current state of the circuit. In some cases it may be possible to multi-pump functionalunits, where a double clock rate is used and data is provided at a double rate [36].For variables in memories, arbitration logic is added to the memory controller such that each loadand store operation can access the memory. Variables in IR registers are mapped to physical registers onthe device. When the target device is an FPGA, each register is usually mapped to a unique physical reg-ister since FPGAs have registers with every logic element, and sharing access would require expensivemultiplexing logic.322.4. Related WorkOutput Processing The final step is to generate an HDL file representing the circuit. In LegUp, theresulting circuit has a structure consisting of FSMs to implement the scheduling of operations, datapathlogic to implement each IR operation, and datapath registers and local memories to store the programvariables. Figure 2.6 provides an illustration of the structure of the circuit.2.4 Related WorkThis subsection describes both past and current work, which relate to this dissertation. This includesseveral other works related to in-system debugging of HLS circuits. Since one goal of this work is tobring software-like debug to HLS circuits, we include a brief description of relevant work on debuggingin the software domain.2.4.1 In-System HLS DebuggingSea Cucumber DebuggerThe earliest work toward in-system debugging for HLS circuits was the source-level debugger [124,125] created for the academic Sea Cucumber HLS tool [126], in 2003. In this work they do not use tracebuffers, but rather leverage the device readback feature of certain FPGAs, that allows all registers withinthe FPGA to be read externally (See Section 2.2.1). The major drawbacks of this approach are: 1) that itis very slow, requiring several seconds to read values from the FPGA at each timestep [127], 2) is onlysupported by certain FPGA families, and most importantly, 3) that it prevents the circuit from executingat speed and interacting normally with the rest of the system. In order to perform in-system debugging,the circuit would need to be stopped after each instruction in order to perform the readback and obtainthe readback data. Since this behaviour is disruptive to interactions with other blocks in the system,only non-interacting circuits could be properly tested. As explained in Section 2.1.2, if the circuit hasno external interactions, it often can more easily be tested using software debugging.The Inspect DebuggerInspect [52] is a source-level debugger targeted to the LegUp HLS tool. It allows users to manually selecta set of desired signals from the datapath, which are recorded during execution using a commercial ELA332.4. Related Worktool, Altera SignalTap II. During execution, the logic created by SignalTap II will recorded the selectedsignals into on-chip trace buffers. Once stopped, a debugger GUI application connects to the FPGA,retrieves the recorded values, and replays the circuit execution. The debugger application uses the tracedata to allow replay in the context of the original source code, and provides standard software debuggingfeatures: single-stepping, inspecting variables, and setting breakpoints.The debug flow provided by this tools is similar in nature to our debug framework that we presentin Chapter 3. Inspect and our framework were developed in parallel, and both were published at theInternational Conference on Field Programmable Logic and Applications (FPL) in 2014 [1, 52]. Sincethat time we have worked with the authors to combine the ideas from their work with our debuggerprototype into a single open-source tool, which was included in the 4.0 release of the LegUp high-levelsynthesis tool [5, 48].One shortcoming of the Inspect tool is that it uses an existing general-purpose ELA tool to performrecording of the circuit execution. As described in Section 2.2.2, ELA tools are designed to work withany digital circuit, and use an unsophisticated approach of simply recording all desired signals everycycle. For HLS circuits, this is fairly inefficient, and due to limited on-chip FPGA memory, the usermust choose a small number of variables to observe, or end up with only a very short captured executiontrace. In our signal tracing techniques, described in Chapter 4, we create specialized debug circuitry,optimized for HLS circuits, which allows for much more efficient capturing of the circuit execution.Event Observability PortsIn [53], Monson and Hutchings apply trace-based debug techniques to the Vivado HLS tool [20], andtackle the problem of observing variables that reside in datapath registers. The authors make a keyobservation that the HLS schedule can be used to determine the states in which a signal is computed,and that signals only need to be recorded in these states (Refer to Figure 2.5 for an example of HLSscheduling).Their architecture consists of ports, which the authors refer to as Event Observability Ports (EOPs),which are manually added to relevant signals in the RTL circuit. These ports are comprised of an enableand a data signal, and are connected to a trace buffer memory to record signal values only when the342.4. Related Workr3r4rn-1rnDataEnabler2 r1S1||S2Figure 2.7: Event Observability Ports (EOPs) introduced by Monson and Hutchings in [53]. Inthis example signals r1 and r2 are recorded in mutually exclusive states, S1 and S2, and sothey share the same trace buffer.circuit is the the appropriate state. The enable signal can usually be driven using existing one-hot statesignals from the FSM, at no extra area cost. Using this selective tracing technique, signal history can berecorded using much less memory than conventional ELAs which record all signals every cycle.To facilitate recording only some signals each cycle, the authors propose splitting up the monolithictrace buffer used by conventional ELAs into multiple buffers which can be enabled separately. Themost straightforward approach is to assign each signal to its own buffer. However, since each buffer canbe enabled independently, what often happens is that one buffer becomes saturated very quickly, whileother buffers may be largely unused. This results in wasted memory, which if better allocated, could beused to provide a longer execution trace. Solving this memory utilization issue is the key to obtainingthe full benefits offered by selective tracing.The authors explore two approaches to solving this issue. The first is to try to determine the relativefill rates of each buffer, and size them accordingly. Since automatically predicting program behaviouris often impossible, they suggest user-provided directives to aid in this process. This may work wellfor smaller designs, but would be burdensome on the user for larger designs with hundreds of relevantsignals. The second approach used by the authors is to combine multiple signals to be traced in thesame buffer. The HLS schedule is used to identify signals which are recorded in mutually exclusivestates, and merge them into the same memory. This improves memory utilization, but at a cost of extra352.4. Related Worklogic. An or gate is added to combine the enable signals, and a multiplexer to combine the data signals.Figure 2.7 illustrates this architecture.Our architecture for tracing datapath registers (Section 4.4) builds upon these techniques and pro-vides further enhancements to improve memory efficiency and reduce area overhead.C-Level InstrumentationRecent work has investigated inserting debug instrumentation via modifications to the C source code,rather than instrumenting the RTL circuit. In [54], Monson and Hutchings show how the C code can beautomatically modified, such that for every variable assignment in the C code, a duplicate assignment isinserted, where the value is assigned to a top-level pointer. When the modified C code is synthesized bythe HLS tool (the authors use Vivado HLS [20]), this causes top-level ports to be added to the circuit,which are driven by the various inserted assignment statements. The authors propose that these portscould be connected to trace buffers to achieve similar effect to their EOP ports from [53] (discussedearlier in this subsection).In [55], Monson and Hutchings expand upon their work, and provide an in depth analysis of howadding these extra assignments to the C code impacts the produced circuit, both in terms of area over-head, as well as execution time. In [56], the authors further expand the work to include tracking ofpointer values in the C code, which were unavailable in their initial works.Work in progress by Jose Pinilla at the University of British Columbia expands upon the work byMonson and Hutchings, and includes methods of modifying the C code to not only produce the dataports, but also to include the trace buffer memories to record the data, as well as logic to retrieve thesaved values from the circuit.Compared to instrumenting the HLS-produced RTL circuit, modifying the C source code offersboth advantages and disadvantages. The main advantage to C-level instrumentation is portability; Ccode modified with debugging instrumentation could be input into several different closed-source HLStools. In contrast, RTL-level modifications either requires modifying the HLS tool to automatically adddebugging circuitry to the RTL circuit, or requires tool-specific information about the structure of theproduced RTL circuit, and how it maps back to the original software code.362.4. Related WorkThere are several disadvantages to C-level instrumentation versus RTL-level instrumentation. First,by changing the C code, it is possible to change the behaviour of the user’s circuit, including modifyinghow operations are scheduled. This may introduce delay into the user’s design, and could even maskbugs and interfere with debugging. Second, adding debugging at the C-level can have a larger areaoverhead, as it lacks the low-level control that RTL modification offers. Finally, as we show in thisdissertation, RTL-level debug instrumentation offers potential for customized debug circuitry that ishighly efficient for capturing circuit execution.HLS Circuit VerificationIn [128–130], Yang et al. present techniques for validation of HLS circuits. The techniques presentedin these works are designed to detect discrepancies between the C source code and the produced RTLcircuit. The approach taken is to instrument the LLVM IR to record the value of each update of a sourcecode variable. The produced RTL code is also instrumented to record the same values. The LLVM IRcode is then executed using an LLVM emulator, the RTL is simulated using Modelsim, and the twotraces are compared to determine if there are any discrepancies.Such an approach is useful for verification purposes; that is, to verify that an HLS tool is producingRTL circuits that are functionally equivalent to the C code. However, one drawback of this approachis that it is reliant upon the user-supplied test vector. It is possible for the produced RTL code to befunctionally different than the C code, but produce the same trace values for certain input sets. In thesecases this approach would not detect any mismatches.The authors also show that under certain circumstances these techniques could also be useful fordebugging, not just verification; however, such cases are very limited. Only cases where a bug causesnon-deterministic behaviour, and the RTL circuit behaves differently than the C code, will be detected.The authors show this can occur for out-of-bound memory accesses, or accesses to uninitialized memory.However, for most bugs in the C code, such as logic mistakes, algorithm errors, etc., the bug will alsobe present in produced RTL code and no discrepancy will be detected.372.4. Related Work2.4.2 Debugging Optimized SoftwareOne obstacle in debugging HLS circuits using the source code is the wide range of optimizations thatmay be applied to the source code prior to conversion to RTL. As described in Section 2.3.3, HLStools typically use standard software compiler optimizations to improve performance of the IR code(for example, the -O3 option).In the software domain, debugging optimized software code is a well studied problem. Originally,many methods avoided debugging with optimizations [131, 132]. However, over time, it became moreaccepted that users would need to debug using the optimized code. In [133], John Hennessy providesthree reasons for debugging software with optimizations: 1) the compiler may not support disabling alloptimizations, 2) disabling optimizations may change the program behaviour and mask the bug, oftendue to external factors, and 3) because of time or size constraints, optimizations may be necessary.These same reasons exist in the HLS realm, and perhaps are even more applicable. Interactions withother blocks are a primary debug case, and disabling optimizations could easily mask bugs. In terms ofsize, the unoptimized code may require significantly more area, and may not fit on the FPGA, as hasbeen our experience using the LegUp HLS tool.In [133], John Hennessy outlines challenges of creating a debugger that works with optimizedsource code. One major challenge is that optimizations may remove or reorder store operations, makingthe memory value of a variable incorrect, or commonly referred to as noncurrent. The correct, orcurrent, value of a variable may reside in a register, or be optimized away due to compiler optimizationssuch as copy propagation. Hennessy provides a solution to noncurrent variables, which includes addingadditional information to the debug information that is generated at compile time. The debugger canuse this information to determine whether a variable value currently resides in memory or a register.In [134] the authors extend Hennessy’s work and present a debugger that is able to provide currentvariable values by using the program counter to determine where the current value resides. LLVMuses a similar technique and inserts metadata instructions to indicate the current location of a variable.We describe this process, and how we use this information to track variable values from software tohardware, in Section 3.2.3.In [135], the author includes ways to visualize reordered code, such as highlighting lines of a basic382.5. Approaches and Assumptions In This Workblock using different colours to show which instructions have already executed, and which will executenext. We use a similar technique in our debugger GUI to indicate when multiple C instructions areexecuting in the hardware simultaneously.2.5 Approaches and Assumptions In This WorkThis section describes the major assumptions and approaches made in the contributions presented in thisdissertation. We first put this work into context with other related work, and then provide a fault modeldescribing the type of bugs that a user could resolve using the techniques presented in this work.2.5.1 Context with Related WorkTrace-Based Debugging Like [52–56], we chose to use a trace-based debugging approach, where on-chip memories are used to record circuit execution, and debugging is performed using the captureddata (rather than a readback-based approach, such as in [124, 125]). This choice is made to ensurethat the circuit can execute normally, and interact with the rest of the operating environment inreal time.RTL-level Instrumentation Similar to [52, 53], we insert debug instrumentation into the producedRTL circuit. This is done so that the original user circuit is functionally identical to the casewhere no debug instrumentation is added (this is not true if C-level instrumentation is performed,such as in [54–56]). Again, this is to ensure that bugs are not masked, and even elusive system-level bugs can be located.Source-level Debug Like other in-system HLS debugging solutions, we chose to develop techniquesthat allow for debugging using the original source code. This ensures that the debug process isaccessible to all users of HLS, both hardware and software designers.Automated Instrumentation Unlike some past works that have relied upon manually adding debugcircuitry ([52, 53]), we felt it important to prototype an automated instrumentation process. Thishas allowed us to perform experiments on large benchmarks that would be impractical to modifyby hand, and better understand how our debugging techniques scale with circuit size.392.5. Approaches and Assumptions In This WorkObservability-based Debug The approach we take in this work is to allow the user to debug their de-signs by providing observability into the executing circuit. Although software-based debuggerstypically offer both observability and controllability, we focus on the former. Although we canprovide limited controllability (starting/stopping the design), the controllability offered by manysoftware debuggers includes the ability to change variable values during debug, move the ex-ecution point, or even change the program source code. Since our work employs a trace-baseddebugging approach where the user is recording at-speed execution and debugging using recordeddata, many such controllability features are not possible.Optimizing Observability Much of this dissertation focuses on providing users with as much observ-ability as possible into the executing hardware circuit. In Chapter 4 we present techniques tomaximize the length of circuit execution that can be captured into on-chip trace memories, inChapter 5 we extend these techniques to parallel circuits, and in Chapter 6 we introduce metricsto quantify the observability provided by our techniques and others. We believe providing higherlevels of observability is key to reducing debugging times, and shortening the full developmentcycle, which is of high importance to HLS users.2.5.2 Fault ModelThe goal of this work is to give a designer visibility into the control and data flow of their programas it executes as a circuit on the FPGA. As such, the primary use model of this work is for a designerto locate functional errors in his or her source program. These source code errors could be as simpleas a typo in the source code or an incorrect algorithm implementation. As described in Sections 2.1.2and 2.1.3, many such errors could be resolved using software emulation or hardware simulation; how-ever, if the bug requires a long run-time to activate, our in-system hardware debug techniques may benecessary. Other source-code errors that would require in-system debug may include incorrect logicin communications with other blocks or incorrect assumptions about the data format provided by othermodules or I/O devices. Source code errors may not just be functional bugs, but could also includeperformance issues, where the implemented circuit does not achieve the desired performance. Theseperformance issues may also lead to functional errors, for example, if the circuit cannot process data402.5. Approaches and Assumptions In This Workfast enough to handle requests by other modules in the system. All of these cases follow the model ofthe user observing the circuit execution, locating the error in their original software specification, andcorrecting it.There are many other types of faults which may be present in an HLS system. In some cases thesefaults may be observable using our techniques for circuit observation; however, they are not a resultof incorrect program specification provided by the user, and thus are not the target of this work. Forexample, this work is not targeted at finding physical faults in the FPGA. Although such faults maymanifest as the user observes their circuit execution, the techniques we describe in the dissertation arenot useful for locating the root cause of physical errors. In addition, this work does not target errors thatmay be introduced by the FPGA Computer-Aided Design (CAD) tool flow which synthesizes the RTLcircuit (produced by the HLS tool) to a placed and routed FPGA configuration. Such errors could includeincorrect logic synthesis, or errors in placement and routing leading to setup or hold time violations.Again, these faults could affect the circuit behaviour and may be observable using the techniques of thiswork, but we do not provide techniques to locate the source of such errors within the FPGA CAD tools.This work assumes the FPGA CAD tool flow is error-free.Generally, in this work we also assume the HLS tool itself is error-free. However, even with thisassumption, in certain cases our work may be useful for locating errors within the HLS tool. For exam-ple, if the HLS tool schedules operations to execute in an incorrect order, leading to data dependencyviolations, it may be visible to the user using our debugging techniques. Since the user is debuggingthe hardware circuit, in the context of the source code, the user may also be able to observe functionalmismatches between what is supposed to occur according to the source code, and what is actually hap-pening in the hardware circuit. Thus, the work in this dissertation may be useful to a designer of anHLS tool, attempting to resolve these mismatches. However, not all bugs in an HLS may necessarily belocated using our techniques. In addition, as we describe in Section 3.1, we rely upon information fromthe HLS tool to perform debugging, thus we require that these portions of the HLS tool be error-free.So, if the HLS tool is itself error prone, one may not be able to rely upon the accuracy of our debuggingmethods. This dependency should be considered before using our techniques to debug the HLS toolitself. Furthermore, it should be noted that in one optimization in this dissertation, the signal restora-412.6. Summarytion technique presented in Section 4.7, we elect to restore data offline rather than directly observing thehardware execution. In doing so we assume these signals will be correctly implemented in the hardware,and are assuming the HLS tool implements the hardware exactly as specified by the source program.2.6 SummaryThis chapter provided background information relating to the work in this dissertation. As discussed,current techniques for debugging HLS designs require the user to debug using either software emulation,or hardware simulation, and cannot capture the interactions between the HLS design and other hardwareblocks or I/O devices that may be present in the final hardware system. To capture these system-levelbugs, there needs to be a new framework that provides in-system debugging of HLS circuits. Thisframework must be accessible to both hardware and software designers, allowing them to debug inthe context of their original software program, while the hardware executes in the normal operatingenvironment.In this chapter we laid the ground work for such a framework. We first discussed techniques forin-system debug of general digital circuits, as these same techniques can be applied toward an HLS-based debug flow. To understand the structure of HLS circuits, and how software code is mapped toa hardware circuit, we included a description of the steps in the HLS flow. Finally, we provided adescription of related works toward the goal of in-system debugging of HLS circuits, describing boththeir accomplishments and shortcomings. This was followed by a description of the approaches andassumptions made in this dissertation, and how they compare to these related works.The next chapter of the dissertation presents our framework for source-level, in-system debuggingof HLS circuits.42Chapter 3The Debugging FrameworkThis chapter describes our framework for source-level, in-system debugging of HLS circuits.Section 3.1 introduces the debugging framework, including a description of how the HLS flow ismodified to support debugging, the type of user debugging experience that our framework provides, andcontext information describing the type of HLS tools that our framework applies to. This debug flowand instrumentation architecture form the context in which the optimizations and evaluation methodsfrom Chapters 4 to 6 are presented. In addition, the development of this infrastructure represents a novelcontribution in itself. The novelties include 1) debug logic that is added to the user’s design at compiletime, which allows for external control and observation of the HLS circuit, 2) a debugger applicationthat runs on the user’s workstation, communicates with the instrumented circuit, and helps the userunderstand how their software program is mapped to hardware, and 3) techniques to track softwareentities through the HLS flow in order to establish a hardware to software mapping.These components are described in detail in the following sections. Section 3.2 describes howthe software to hardware mapping is obtained, and details the signals within the hardware circuit thatneed to be observed in order to extract the control and data flow of the source program. Section 3.3then describes the debugging circuitry that is automatically added to user’s designs in order to providethis observability into the circuit. Finally, Section 3.4 describes our prototyped debugger application,including a description of features that help a software developer understand how their source code isParts of this chapter were published in [1, 3]433.1. The Debugging FrameworkSource Code(.c)Debug Database (Software to Hardware Mapping)HLS (Modified to Insert Debug Logic)FPGA DeviceDebugVerilog HDL CircuitDebugSynth., Place & RouteFigure 3.1: The HLS flow, modified to provide debugging supportimplemented in the hardware circuit. Section 3.5 summarizes the chapter.3.1 The Debugging FrameworkThis section introduces our debugging framework, describes how a user would perform debugging, andprovides context on types of HLS tools this framework is suited for.3.1.1 Adding Debug to the HLS FlowFigure 3.1 shows the HLS design process, modified to support our proposed debug framework. The userprovides their source code to the HLS tool for compilation in the normal fashion, with no modificationto the source code required. The HLS tool, which is modified to insert debug instrumentation, thencompiles the C source code to an RTL Verilog circuit. The produced Verilog circuit is identical to whatwould be produced by an unmodified HLS tool, except that it also contains the debug instrumentationwhich allows observation into the circuit as it executes. The user then proceeds as normal, and uses thestandard FPGA tool chain to compile the Verilog circuit to a bitstream, which is then used to configurethe FPGA.At this point the user can launch the debugger tool on their workstation, which connects to the FPGAdevice and communicates with the circuit via the added debug logic. Through this communication, thedebugger can control the circuit, starting and stopping execution and observing internal signal values.The debugger looks and behaves similar to a software debugger, and allows the user to step through443.1. The Debugging Frameworkinstructions, add breakpoints, and inspect variables, all in the context of the original software program.In order to support a software abstraction level, the debugger application requires information on how thehardware maps back to the original software. Thus, the secondary role of the modified HLS tool (asidefrom adding debug circuitry to the design) is to output this mapping information. This information isstored in a Debug Database, which includes information about how the software entities (functions, in-structions, variables, etc.) relate to the entities in the hardware circuit (FSM states, registers, memories,etc.).3.1.2 Modes of OperationTo provide effective debug productivity to HLS users, it is essential to create a debugging frameworkthat provides visibility into a hardware design, but in a way that is familiar and meaningful for softwaredesigners. Software designers expect to able to set breakpoints and single-step their design, and areaccustomed to full visibility into the value of any variable at any point in the program. This is chal-lenging since, as described in Section 2.1.4, finding certain types of bugs requires running the circuitat-speed. Providing enough infrastructure (trace buffers and associated logic) to provide full visibilityinto all signals in a hardware design running at-speed would require too much overhead to be practical.To address this, we have implemented two different debug modes: (1) a live interaction mode, inwhich the user can have full visibility, but does not run the circuit at speed, and (2) a record and replaymode, in which the user can run the circuit at speed, but only has full visibility for a portion of theexecution. Each of these is described below.Live Interaction ModeIn the live interaction mode, the system operates similar to a software debugger. The user can createbreakpoints (limited by the number of hardware breakpoint units included in the instrumentation) andsingle step through the code. Upon hitting a breakpoint, or completing a single step, the tool disablesthe controlling finite state machine (FSM), essentially pausing the system. Once stopped, the state ofthe circuit can be retrieved from the FPGA, including all variable values, by retrieving the most recententries in the trace buffers.Debugging using this mode involves stopping the circuit every time the user wants to inspect a value453.1. The Debugging Frameworkor check the program state; this is similar to the technique used in [125]. As discussed previously, ifthe circuit is meant to interact with other hardware modules in the design, or other external devices, it islikely that starting and stopping the circuit will break these interactions.Record and Replay ModeThe record and replay mode provides the ability to run the circuit at-speed while preserving a software-like debug experience. We believe that this mode is likely the most important feature in our frameworkand best illustrates how the software and hardware worlds can be bridged.This mode operates as follows. While in the live interactive mode, the user can set a breakpoint andrun the circuit to the breakpoint. As the circuit runs, instrumentation added to the circuit records thechanges to signals as well as the control flow executed by the program (these values are stored on-chipin trace buffer memories). After the program hits the breakpoint, the values in the trace buffer are readinto the debug tool, and the user can enter the replay mode. While in replay mode, the user can stillsingle-step and set breakpoints as before, however, all variable values and control flow information isobtained from the trace buffer data rather than the live values from the chip. In this way, the user canobserve what the chip did during the at-speed run, while maintaining a software-like debug interface.Since the user is debugging using recorded data, they can single-step their design, both forwardsand backwards, providing the illusion of running the chip in reverse. We anticipate that this feature willbe important as users wish to “work backwards” to determine the root cause of unexpected behaviour.In fact, the technique of using a recorded execution trace, and debugging in reverse is already usedin the software domain; examples include GDB’s Reverse Debugging and Microsoft Visual Studio’sIntelliTrace.Because on-chip resources are limited, the debug circuitry can only store data for a limited numberof instructions; we refer to the length of code for which data can be stored as the replay window (referto Figure 2.3 for an illustration). Within the replay window a complete control-flow trace is available,allowing the user to observe which instructions are active for each execution step. Any variables thatare updated within the replay window are also available for inspection, but only after the point they areupdated; their value is unknown prior to the first update. While in the replay mode, the user can step463.1. The Debugging Frameworkforwards and backwards through all instructions in the replay window, but can not step outside it. Ifthe user wishes to go outside the replay window, they will have to re-execute their design and move thebreakpoint to a new location in the code.3.1.3 Observability-Based DebugSoftware-based debuggers typically offer both observability and controllability. Observability meansthat the user can view what occurs during execution, including the control-flow, which shows the or-dering of executed instructions, as well as the data-flow, providing the value of variables throughoutexecution. To provide controllability, many software debuggers provide the ability to change variablevalues during debug, move the current execution point, or even change the program source code.Our debug framework focuses mainly on the Record and Replay Mode of operation, where at-speedexecution is captured and the user debugs using the recorded data. Since we envision the debug processtaking place with recorded data, we focus on observability-based debug, rather than controllability.Our framework provides observability into the circuit execution, allowing the user to view the control-flow of their design, as well as inspect the value of variables during the captured execution. However,controllability is limited to starting and stopping the circuit, including a breakpoint to allow the circuitto stop at a certain program state. Since the user performs debugging using recorded data, certaincontrollability features, such as changing variable values during execution, or artificially altering thecontrol flow, are not possible.During the debugging process the user may wish to change the functionality of the design. This maybe done to fix bugs and test that the solution is correct, or may simply be done to change the programbehaviour to help locate the root cause of the bug. However, any such changes to the source code wouldrequire a full recompile, including re-executing the HLS flow to generate a new HDL circuit, and thenperforming the usual FPGA CAD flow (place, route, etc.), before the physical circuit could be updated.This process may be very time consuming, which would likely discourage this approach from beingused while locating bugs, and may limit its use to only fixing the bug once found. However, future workcould explore methods to implement minor changes to the design without the need to perform a fullrecompile.473.1. The Debugging Framework3.1.4 Debugging ContextIn Section 2.3.2 of the background, we described the major classes of HLS tools, which includes bothgeneral-purpose tools, as well as application-specific tools. Figure 2.6 showed the high-level structureof a circuit produced by a general purpose C-to-RTL tool.We anticipate our framework could work with most general-purpose C-to-RTL HLS tools. However,to make our framework proposal concrete, we describe it in the context of the academic open-sourceLegUp HLS tool [48]. LegUp is widely used in the academic community, and has been downloadedby hundreds of research groups worldwide. The debug framework we present in this chapter has beenprototyped in LegUp, and is now part of the official release of LegUp. Our prototype is added in LegUpversion 4.0, which is built using LLVM version 3.5.Although we describe our framework in the context of the LegUp flow, we believe it could beextended to any general-purpose HLS tool, provided the tool produces a circuit similar in structureto Figure 2.6, containing a controlling FSM, datapath logic, and on-chip memories.In order to apply these techniques to other general-purpose HLS tools, it would be necessary toobtain the mapping information detailing how both the control flow and data flow of the source codemap to FSMs, datapath elements and memories in the resulting circuit. For the LegUp flow, we describehow we obtain this mapping in Section 3.2; likely a similar process could be followed in other tools. Inaddition, the instrumented debug logic, which we describe in Section 3.3, would need to be added tothe produced circuit, and the circuit would need to be modified to the bring the relevant signals to thetop-level to connect to the debug logic.Outside of the category of general-purpose tools, there exist many application-specific high-leveldesign tools and approaches that target, for instance, DSP or media streaming applications (See Sec-tion 2.3.2). Such tools may be highly-specialized for a specific algorithm, and the resulting circuit maybe substantially different in structure than what is assumed in this work. For this reason, the techniquesin this dissertation may not be applicable to such types of tools. Debugging techniques for these toolswould need to be considered on a case-by-case basis, which is beyond the scope of this work.483.2. Mapping Software to Hardware3.2 Mapping Software to HardwareSection 2.3.3 provided background information on the major steps in the HLS flow (shown in Fig-ure 2.4). In this section we describe how information is tracked through this flow in order to mapentities in the hardware back to the original software. We modify the HLS tool to automatically storethis information into a database at compile time, which we refer to as the Debug Database.3.2.1 Scope of MappingThe mapping process described in this section covers what is sufficient in order to provide source-leveldebugging in the default configuration of the LegUp HLS tool. Other HLS tools may include additionaltransformations and optimizations that may require further steps in the mapping process. In addition,LegUp provides some optional optimizations that are not enabled by default, such as loop pipelining,which must be manually enabled via pragmas in the source code. Although we do not explain themapping process for all such optimizations, we believe with additional effort, it would be possible to doso. As described in Section 1.2.2 it is not the objective of this dissertation to engineer a debugging toolthat works with every possible HLS optimization, rather, we show how this process takes place for thecore HLS flow.There are a few omitted optimizations that should be noted. First, we do not tackle the challengeof supporting source-to-source transformations, where the C code is automatically modified before en-tering the HLS flow. If the designer used a third-party tool to optimize their C code, they would needto debug using the modified C code, which may not be ideal if transformations were extensive. Inaddition, it is our understanding that LegUp performs relatively little IR-level restructuring and opti-mizations compared to commercial tools. In such cases extra care would need to be taken to preservethe mapping information through these optimizations.Despite these limitations, our framework does handle most commonly used optimizations. We trackthe software as it is transformed to IR code, through the -O3 level of compiler optimizations, andfinally through the allocation, scheduling, and binding phases as it is synthesized to RTL. Most of thismapping information comes from existing data already stored within the data structures of the HLS tool.Our contribution is demonstrating how this information can be extracted and used to build a complete,493.2. Mapping Software to Hardwareend-to-end software to hardware mapping.3.2.2 Control FlowIn LegUp, Clang first compiles C code to IR. Functions in the C code are mapped to functions in the IRassembly, and all C instructions are decomposed into lower level IR instructions. To enable debugging,Clang must be called with the -g flag which enables debug metadata. This metadata, which is a featurebuilt into Clang, includes information which maps each IR instruction to the line of C code from whichit was derived.Next, LLVM performs compiler optimizations which may modify the control-flow of the IR code.For example, IR instructions may be re-ordered, dead code may be removed, functions may be auto-matically inlined, and more. LLVM is already designed to preserve the debugging metadata throughoutthese optimizations, so at this point it is still possible to map IR instructions back to their C origin.The next step is synthesis of the IR code to RTL. For every IR function, the HLS tool will create aVerilog module. Within this Verilog module a FSM is created to implement the scheduling information(See Figure 2.5 for an example of the scheduling process). Every IR instruction is assigned to beginexecution at a FSM state, and if the operation is a multi-cycle instruction, complete execution at anotherFSM state.Using this information, we can build an end-to-end mapping, detailing how the control flow fromthe RTL maps back to the C code; that is, for each cycle of circuit execution it is possible to determinewhich instructions from the C code are executing. By observing the state of the FSMs, and using thescheduling information we can determine which IR instructions are executing, and using the LLVMmetadata, determine the corresponding C instructions.3.2.3 Data FlowWe now describe how the HLS tool synthesizes data elements from software to hardware. When the Ccode is compiled to LLVM IR by Clang, all variables are mapped to memory and accessed via load andstore operations. In the LLVM IR code, memory is allocated for variables using the alloca instruction,which returns a pointer for use in load and store instructions. When debug metadata is enabled (-gflag), extra pseudo instructions (llvm.dbg.declare) are added to the IR program to indicate how503.2. Mapping Software to Hardwarethe variables maps back to the C source code (this is an existing feature built into LLVM). The followingshows how a 32-bit variable named sum is initialized to zero in IR code, and how metadata maps it backto the C variable.%sum = alloca i32, align 4store i32 0, i32* %1call void @llvm.dbg.declare(%sum, C code metadata of ’sum’)Although the front-end compiler maps all variables to memory, once compiler optimizations are per-formed on the IR, many of these variables will be mapped to other locations. Constant propagation willcause some of the variables to be replaced with constants. Dead code elimination, loop transformations,and other optimizations may completely optimize away some variables. A large number of variables,particularly local, scalar variables will be mapped to IR registers, removing the need to perform loadand store operations.This is further complicated by the fact that a C code variable may reside in different locationsthroughout the program execution; for example, a variable could be mapped to memory initially, havea constant value for another portion of the program, and be mapped to one or more registers for otherparts of the program. In fact, because the IR code is in SSA form, for variables mapped to registers,each instance in the code where the variable is updated will be assigned to a new IR register. Handlingthese complications is a necessary part of debugging optimized software (refer to Section 2.4.2).During the compiler optimizations LLVM automatically inserts pseudo instructions which indicatethe current location of a variable (llvm.dbg.value instructions). The following shows an exampleof such a case where a C code variable, i, in incremented:%i.new = add i32 %i.old, 1call void @llvm.dbg.value(%i.new, C code metadata of ’i’)The second argument of the llvm.dbg.value instruction identifies the source code variable,and the first argument indicates the IR entity that currently holds the variable’s value. Once the controlflow reaches the llvm.dbg.value instruction, it is meant to indicate that the source location ofthe variable is updated. This construct works fine if the code is executed sequentially, but is not asstraightforward in an HLS system where the scheduler may execute instructions simultaneously or out-513.2. Mapping Software to Hardwareof-order. This is somewhat problematic as LLVM has no method to indicate which instruction thellvm.dbg.value instruction is dependent upon. We make the assumption that it is linked to themost previous instruction, and consider it to take effect in the state where the previous instruction isscheduled to be executed. This may not always be accurate, as the change in variable location mayactually be dependent on some earlier instruction; unfortunately improving this system would requiresignificant modifications to the LLVM tool and optimization passes which maintain the metadata.Although the mapping from C variable to IR location is fairly complicated, the mapping from IRto HDL is fairly straightforward. During synthesis to hardware, LegUp creates a memory for eachalloca instruction. Depending on whether these variables are accessed within a single function, ormultiple functions, these memories may be instantiated locally within a module, or may be located ina shared global memory at the top-level of the circuit. Each memory is accessed through a memorycontroller, and each load and store instruction in the IR is synthesized to datapath logic with sharedaccess to the memory controller.For variables mapped to IR registers, the mapping is even more simple. Due to the plentiful numberof physical registers on the FPGA, each IR register is usually assigned to a unique Verilog register. Thefew exceptions to this rule occur when the register is an output of a shared functional unit; however,as explained in Section 2.3.3, this is uncommon as functional units are typically not shared on FPGAs,unless the functional unit requires a large amount of logic, such as a divider or floating point unit.Once this mapping information has been established, our debugger can determine the value of a Ccode variable using the following process. First, the control flow trace has to be searched for the mostrecent state which contains a llvm.dbg.value instruction. This indicates the current location of thevariable. For variables optimized to constants, the constant value can be provided to the user withoutthe need to inspect the hardware. If the variable is located in an IR register, the corresponding hard-ware register needs to be inspected. For variables in memory, the debugger must monitor the memorycontroller signals, or have external access to the memory controller.523.2. Mapping Software to HardwareBenchmark # Variables # Registers(bits)Lines ofC Code Total Opt.AwayOpt.ConstFSMBitsGlobalRAMs(bits)DistributedRAMs(bits)adpcm 550 248 72 15 10 120 (3175) 1 (192) 12 (663)aes 711 45 4 4 11 37 (937) 1 (192) 4 (304)blowfish 962 42 1 10 10 45 (1366) 1 (192) 7 (296)dfadd 488 127 1 14 9 67 (2556) 1 (96) 1 (32)dfdiv 429 130 2 26 9 63 (2944) 0 (0) 0 (0)dfmul 369 83 2 19 8 41 (1665) 0 (0) 0 (0)dfsin 747 328 1 60 12 164 (6799) 1 (96) 1 (32)gsm 376 81 1 21 10 72 (1400) 1 (192) 5 (262)jpeg 1003 263 22 14 13 168 (4278) 1 (192) 28 (1327)mips 225 19 3 1 9 23 (553) 1 (192) 2 (150)motion 568 99 11 54 10 27 (621) 1 (192) 4 (230)sha 182 36 5 12 8 14 (363) 1 (192) 7 (310)Table 3.1: CHStone benchmark properties, when compiled using LegUp.3.2.4 Required Modifications to the HLS ToolObtaining the hardware to software mapping information described in this chapter will likely requiremodifying the HLS tool to export this information. In our prototype debugger using the LegUp HLStool, we modified the tool to output this information into a database. Although we modified the tool toexport this information, it should be noted that extracting this information did not require modificationsto the HLS process itself. It is important to distinguish between the need to modify the HLS tool forthe purposes of extracting information, versus modifying the actual synthesis process. Obtaining thismapping information requires the former, but not the latter; the HLS flow described in Section 2.3.3,and thus the resulting circuit, including its operations, scheduling, and functionality, does not need to bemodified. The one exception to this is that, as described in this section, the -g flag is provided to LLVMto enable debug metadata in the IR; however, our experience is that this produces little to no change tothe behaviour of the IR code.3.2.5 Properties of Benchmark CircuitsWe now provide details on our benchmark circuits, in order to provide context into how many signals inthe circuit need to be observed to be able to obtain the control and data flow of the C program. For the533.3. Circuit Instrumentationexperiments in this dissertation we use the CHStone benchmark suite, a set of C programs designed forHLS [136]. This is the most widely used benchmark suite in recent academic work on HLS. Table 3.1provides properties of the benchmark programs when compiled using the LegUp HLS tool. We use theversion of the benchmarks included with version 4.0 of LegUp. The source programs range in size from182 to 1003 lines of code, and contain 19 to 328 source code variables.As described in the previous subsection, once mapped to hardware, the control and data flow canbe determined by observing the FSM states, as well as variables in memories and datapath registers.To extract the control flow, the number of FSM bits that need to be observed ranges from 8 to 13. Forthe data flow, compiler optimizations will optimize away some of these variables, while others are opti-mized to constants, as indicated in the table. Of the remaining variables, some are mapped to datapathregisters, in which the number of bits that need to be observed ranges from 363 to 6799. The remainingvariables are mapped to memories. LegUp can optionally produce a circuit with a single global memorycontroller, or with multiple distributed memories, as will be further discussed in Section 4.5. The tableshows how many memories exist in both cases, as well as the number of bits of memory controllersignals that need to be observed. Note that the number of bits can vary depending on the width anddepth of the memory, as well as whether one or two ports are used. The dfdiv and dfmul circuits haveno memory controller signals because they do not contain any variables mapped to memory.Next, we describe the instrumentation that is added to these circuits in order to observe these signals.3.3 Circuit InstrumentationIn order to for the debugger to control the circuit, and observe relevant signals, the HLS tool must insertdebug instrumentation into the user’s RTL circuit. As part of our contribution we have designed thisdebug instrumentation, which is shown, combined with the user’s circuit in Figure 3.2. The user’s cir-cuit, produced by the HLS tool, contains FSMs, a datapath and on-chip memories. These elements thatform the user’s original circuit are shown in grey. The debugging circuitry, which we have developed,is shown as the coloured components in the figure. This debugging circuitry is automatically added tothe user’s design at compile time.54vMemory ArbiterCommunication InterfaceVariables in On-chip MemoryMemory BusTrace RecorderStepping & Breakpoint UnitEnableUser Circuit Generated by HLSOne-Hot StateControl Trace Register Updates Memory UpdatesSignal-Trace SchedulerDatapathFPGAControl FSMsCompressState EncoderFigure 3.2: The debugging instrumentation, and how it interfaces with the user’s circuit. The user’s original circuit is shown in grey, andthe added debug logic is shown as the coloured blocks.553.3. Circuit Instrumentation3.3.1 Instrumentation ComponentsThe debug instrumentation includes the following components:Communication Interface The communication interface handles all communication with the debug-ger application, and forwards requests to the other subsystems.Stepping and Breakpoint Unit This unit outputs an enable signal which is used to start and stop theFSMs in the user circuit, as well as a hardware breakpoint unit. To facilitate breakpoints, thedebugger configures the breakpoint unit to watch for a certain circuit state, which when hit, willdisable the FSM.Memory Arbiter The memory arbiter multiplexes access to the memory controller between the usercircuit and the debugger, allowing the debugger to access program variables located in memory.State Encoder Collects the one-hot state signals from all FSMs within the user circuit, and encodesthem into a single fully encoded format. The encoded format is more appropriate for recordinginto trace buffer memories, as it is much more condensed and requires less memory to record.Record and Replay Unit This unit records a trace of the control FSM, updates to datapath registers,and updates to variables in memory. As the circuit is executing, data is constantly being writtento the buffers, overwriting the oldest data in the buffer. This execution occurs at normal operatingspeeds, allowing the circuit to properly interact with other components in the system. Once thecircuit is halted, either manually, or via breakpoint, the debugger can retrieve the data stored inthe trace buffers, and replay the circuit execution. In the next chapter we describe the architectureof the Trace Recorder in greater detail.3.3.2 Required Modifications to the HLS ToolThe above components provide visibility into, and basic control of the executing circuit. As shownin Figure 3.2, the majority of these components are located separately to the user’s circuit, and donot require any modifications to the user’s design. Aside from bringing signals to the top-level ofthe design, the user’s circuit is only modified in two ways: 1) to add an enable signal to the FSM to563.4. The Debugger Applicationallow the circuit to be started and stopped, and 2) the memory bus is rerouted through an arbitratingmultiplexer. However, as described in Section 4.8, this memory arbiter is actually removed for manyof our debugging configurations, leaving only the first modification. Aside from these changes, nomodifications are needed to the user’s circuit.In the prototype of our debug framework we altered the LegUp HLS tool so that these modificationswere automatically made to the user’s design. We also modified the HLS tool to automatically add thedebug circuitry and connect it to the user’s circuit. However, since the debug instrumentation is separatefrom the user’s circuit, it would be possible to add this logic outside of the HLS tool.3.4 The Debugger ApplicationIn this section we describe the final part of our debug framework, a debugger application that runs on theuser’s workstation, connects to the circuit executing on the FPGA, and provides a software-like debugexperience.The purpose of this section is to show how such a debugging approach can address the challengesfaced by software designers debugging HLS circuits. We make our exposition concrete by presentingdetails of our prototyped debug tool, and relate each of our ideas into features of this tool. Our prototypedemonstrates that it is indeed possible to create a tool that resembles a software debugger, yet can beused to debug hardware designs running on an FPGA. Figure 3.3 shows a screenshot of our tool. In thefollowing discussions, we will refer to specific aspects of this diagram.Much of the debugger looks like a standard software debugger. The left panel contains the sourcecode of the program, and the executing instruction(s) are highlighted. The parallel nature of the circuitmeans that multiple C instructions can execute simultaneously, which is a difference from normal soft-ware debuggers; when this occurs we highlight multiple lines simultaneously. The right panel containsa list of the source code variables and allows users to inspect variable values, including support for bothscalar variables as well as aggregate types (arrays, structs, etc.).The following describes how the debugger deviates from a standard software debugger. We believethese features are key in helping software developers understand the behaviour of the hardware system.573.4.TheDebuggerApplicationDebugger GUIDebug DatabaseDebugManager SuccessNoNoChoose a breakpoint locationRun CircuitRoot cause of bug in Replay Window?BugActivated?FPGA BackendBackend APIModelsimBackend Backend APIReplay BackendBackend APIExecution TraceLive or Replay ModeReplay Window SliderActive C InstructionsBreakpointInstruction SchedulingUnderlying LLVM IR InstructionsFigure 3.3: Screenshot of our debugger prototype: a hardware debugger that a software engineer can understand.583.4. The Debugger Application3.4.1 Gantt ChartAs shown in Figure 3.3, our tool presents a Gantt chart that shows the execution of each line of codeover time. This provides much of the same information as a waveform diagram, in what we believe isa more software-friendly way. The diagram provides a mechanism for the designer to understand thefine-grained parallelism that has been uncovered by the HLS tool. As an example, in Figure 3.3, theassignments to beg[0], end[0], and the initialization operation of the while loop are started during thesame cycle, and this is shown graphically in the Gantt chart. Similarly, some instructions may take morethan one control step (in the hardware, this corresponds to more than one clock cycle, but a clock cycleis abstracted as a control step in our tool). Long operations such as divides, or instructions that operateon variables in memory, typically take more than one control step; examples of the latter can be seen inFigure 3.3. The boxes of the Gantt chart represent individual instructions of the underlying IR code.Note that in the presence of compiler optimizations, the Gantt chart may not be as straightforwardas that in Figure 3.3; we discuss the impact of compiler optimizations in Section Debug ModesOur debug instrumentation provides both the live interaction and record and replay modes described inSection 3.1.2. In the live interaction mode the circuit is started and stopped as the user steps throughthe circuit. In the record and replay mode, the circuit is run at-speed and a portion of the execution iscaptured in on-chip trace buffers. Once the execution information is retrieved from the FPGA the usercan debug using the recorded data. A slider bar at the top of the debugger allows the user to quicklynavigate through the captured trace window. At any point within the trace, the user can single-step bothforward and backward, as well as inspect variable values.3.4.3 Instruction-Level ParallelismIt is important to note that there is a significant difference between single step in our framework andsingle step in a software debugger. A single step in a software debugger will typically advance onesource code statement. A single step in our framework may advance through multiple instructions atthe same time, if those instructions are mapped to hardware that execute in the same clock cycle. Asan example, in Figure 3.3, if the system is advanced one “instruction” beyond the red vertical line, both593.4. The Debugger Applicationthe assignment to piv and part of the (L < R) check will be performed. Because this is hardware, andboth are mapped to the same clock cycle, it is not possible to decouple these two operations and onlysingle step through one of them. It may be possible to address these types of situations by creating atool that modified the user circuit to either temporally separate these two operations or provide selectivegating to each operation, however, that would result in significant changes to the user circuit, meaningthe circuit being debugged may be very different than the original circuit.In [125] the authors explored another approach. They provided virtual serialization, making it appearto the user that the statements were executed serially, when in reality they were executed in parallel in thehardware. Indeed this makes the debugger behave more like a traditional software debugger; however,this hides the parallelism from the user, and prevents them from restructuring their C code to exploredifferent fine-grained parallelism optimizations.3.4.4 IR InstructionsAs discussed in Section 2.3.3, most HLS tools compile software to hardware in several stages; a keystep in this process is translating the software to IR before synthesizing to hardware.Ideally, the software designer should be sheltered from the IR, and should not have to know of itsexistence. This is similar to how someone developing software should not have to know about the un-derlying assembly language of their compiled software. We anticipate that there are several reasons,however, that the HLS designer might want to inspect and understand the IR. First, C instructions oftentake multiple control steps, and it may sometimes become important to understand when primitive op-erations occur (this may be especially important when debugging multi-threaded applications). Second,during performance optimization, it may become important to understand why certain instructions takelonger to execute than others; this information could be used to restructure the C code to lead to differentfine-grained parallelism optimizations. Third, this information could be used by HLS tool vendors tohelp them as they optimize and debug their own compilers. For all these reasons, we have elected toexpose the IR to the user. As shown in Figure 3.3, the bottom panel of the screen shows the IR for thecurrently executing instructions.603.4. The Debugger ApplicationFigure 3.4: Gantt chart showing loop unrolling optimization.Figure 3.5: Gantt chart showing code reordering optimization.3.4.5 Compiler OptimizationsMost HLS tools provide the ability to select the level of optimization applied to the code before hardwareis generated. For HLS tools built around the LLVM framework, the user can specify the familiar -O0,-O3, etc. flags. The higher the level of optimization, the more restructuring of the code that is performedbefore hardware is built.Debugging optimized code (such as generated by the -O3 flag) is notoriously difficult. Because ofthis, when designing software, it is common practise to debug using the -O0 (unoptimized). As alreadydiscussed in Section 2.4.2, we believe that this strategy does not work well for HLS-generated circuits.Changing the level of optimization will significantly change the timing of the resulting circuit. Sincemany of the bugs we anticipate are in the interfaces between blocks, changing the timing may result in613.5. Summarysignificantly different behaviours. Thus, we believe that if the -O3 version of a circuit is going to be“shipped” it is important to debug the -O3 version of a circuit directly. Second, in many cases, we havefound that compiling with -O0 leads to much larger circuits, some of which would not fit on the targetFPGA, making on-chip debugging impossible. This is different than software, in which as long as theprogram fits in (virtual) memory, it can be debugged.Debugging optimized code is challenging. The temporal relationship between various instructionsmay vary greatly; the result of a single-step may not be intuitive, since several instructions that arenot together in the original program are executed out-of-order or simultaneously. We believe that it isimportant to provide the user this level of visibility (rather than abstracting the sequencing to programorder) since understanding the actual order of execution of instructions may be very important whendebugging block-level interfaces or other timing-related problems.Figures 3.4 and 3.5 provide examples of how the Gantt chart aids users in debugging optimized code.In Figure 3.4 a sorting operation is called 10 times in a loop. The optimizing compiler has completelyunrolled the loop and replaced the code with 10 subsequent calls to the function. In Figure 3.5, thecompiler has performed code reordering optimizations. In this case, the two instructions immediatelyafter the if block are independent of the contents of the if block and are executed before it. In both casesthe Gantt chart helps the user in understanding the optimization.3.5 SummaryThis chapter described our framework for source-level, in-system debugging of HLS circuits. Thisframework consists of three major components: debug logic that is automatically added to the user’sdesign to provide internal observability, a software-like debugger application which connects to thecircuit via the debug logic, and finally, a method for mapping signals within the circuit back to theoriginal software program. Using this framework, a designer can debug their circuit while it executesin the normal operating environment, with a software-like debug experience including single-stepping,breakpoints and inspecting variables.Our framework enables two different modes of debugging. The first mode is a live interaction mode,where single-stepping is accomplished by starting and stopping the circuit execution. This provides623.5. Summaryobservability at any point in the program execution, but at the cost of halting the circuit, and likelybreaking interactions with other modules in the system.The second mode, the record and replay mode, works by using on-chip memory resources to recordthe circuit execution, after which this data is extracted by the debugger application and the user is able toperform the same software-like debugging, but with the recorded data. This approach is key to locatingsystem-level bugs as it allows the circuit to operate at-speed, interacting with the rest of the system.However, due to limited memory resources, only a small portion of the execution can be captured. Inorder to locate the cause of a bug, the user may have to repeatedly execute the system, changing whichportion of execution is recorded. This process can may be very time consuming. In the next chapter wetackle this issue, and show how the tracing architecture can be optimized to capture longer executiontraces.63Chapter 4Optimizing Data CaptureFor many debugging tasks, it is necessary to run the circuit at-speed without interruption. To do sorequires signal-tracing, where hardware signals are captured into on-chip memories during the circuitexecution. In this chapter we present efficient techniques for performing signal-tracing within a HLScircuit.The major challenge when performing signal-tracing is making use of limited on-chip memoryresources. These limited resources restrict how many signals can be observed, and the duration for whichexecution can be captured. As discussed in Section 2.2.2, this limited observability means a designerwill likely have to repeatedly execute the system, changing what portion of execution is captured intothe buffers, until the root cause of a bug can be located. This process can be very time consuming,which is problematic since HLS technologies are often used by those who want a rapid design flow andfast time-to-market. If the data capture process can be optimized to record more information about thecircuit behaviour at each iteration, the user may be able to narrow down the source of a bug in fewerdebug iterations, significantly reducing the time spent on debugging.In this chapter we show that by optimizing the trace buffer architecture for the particular class of cir-cuits generated by HLS tools, we can make significantly better use of the debugging infrastructure. Weshow that by using information about the user’s design from within the HLS tool, we can develop debugcircuitry, customized to each design, which captures much longer execution traces than conventionalParts of this chapter were published in [1, 2, 6]644.1. Baseline ArchitectureELA signal-tracing tools.As shown in Figure 3.2, we develop separate tracing approaches, which are optimized for differentparts of the circuit, including the control flow, as well as variables in both the datapath and memories.This chapter describes each of theses techniques in detail, and provides results on the costs and benefits.Section 4.1 first describes existing ELA technologies, which form the baseline for our experiments,and then Section 4.2 explains why we move away from the ELA technique of a single trace buffer,to an architecture with multiple trace buffers. Next, the architecture of each buffer is described indetail. Section 4.3 describes how the control flow information is captured into trace buffers, Section 4.4describes the process for variables in the datapath, and Section 4.5 describes the process for variables inmemories. With the description of each buffer completed, Section 4.6 explains some of the challengesintroduced by our architecture, and describes how we address these issues.Although our architecture provides much more efficient signal-tracing than a conventional ELAtool, there is still substantial overhead costs, especially for large circuits with many signals to observe.In Section 4.7 we present a technique to address this issue by reducing the amount of data that mustbe captured by our signal-tracing architecture. By using information about the structure of the sourceprogram, we are able to reconstruct certain variable values offline, eliminating the need to record theminto memory.Section 4.8 provides results for all of these techniques, measuring the increase in the length ofexecution that can be captured, as well as the area and timing impact. Section 4.9 summarizes thechapter.4.1 Baseline ArchitectureELA tools such as Altera’s SignalTap II [45], Xilinx’s ChipScope Pro [46], and Mentor’s Certus [47]use trace buffers to record the history of selected signals over a portion of the circuit execution. Thesetools are generally optimized in the context of a generic RTL-designed hardware circuit, in which data-paths, storage elements, and interconnects are handcrafted by a hardware designer. As such, these toolstypically have a single large trace buffer to record the behaviour of all signals that the user wishes toobserve every cycle. Figure 4.1 provides an illustration of this technique.654.2. Split Trace Buffer ArchitecturevMemory ArbiterDebug Manager & CommunicationsVariables in On-chip MemoryMemory BusTrace RecorderStepping & Breakpoint UnitEnableUser Circuit Generated by HLSOne-Hot StateControl Trace Register Updates Memory UpdatesSignal-Trace SchedulerDatapathFPGASouce Code(.c)IR(.bc)RTL(.v)FPGABitstreamFront-End CompilerHLS CompilerCompiler OptimizationsPack, Place & RouteUser Circuitcycleicyclei+2cyclei+3cyclei+4Control FSMsCompressState EncoderFigure 4.1: Baseline signal-tracing architecture, equivalent to using an ELA tool. All variables arerecorded each cycle in a single wide memory.In the following sections we show how the trace buffer architecture can be optimized for HLScircuits, making more efficient use of the trace buffer memory, and increasing the trace window size.Although the discussion focuses on circuits generated by the LegUp tool, many of the characteristicswe describe will be common with other HLS tools, so we feel many of these optimizations can begeneralized.4.2 Split Trace Buffer ArchitectureIn traditional trace buffer architectures, such as those in [78, 137–139], signals that contain informationrelated to both the state of the system (state bits) and signals that contain data flowing through the system(data bits) are treated the same, and stored in the trace buffer together. This provides for maximumflexibility – when debugging designs that are control intensive, more trace buffer capacity can be devotedto state bits, and when debugging designs that are data-intensive, more trace buffer capacity can bedevoted to data bits. This is essential due to the vast diversity that is possible when creating hand-crafted RTL designs.Designs produced by an HLS tool tend to have a predetermined structure with a smaller numberof easily identifiable important state bits and data bits. In the LegUp tool we are using, the generatedhardware always contains a set of FSM controllers, datapath logic, and one or more memory controllers,as was illustrated in Figure 2.6.The fact that these important control and data signals are easily identifiable opens an opportunity not664.3. Control Trace Buffer OptimizationsControl Flow Buffer012State Sequence Counts bitsFigure 4.2: Layout of the control flow bufferavailable to hardware-oriented trace buffers. Rather than a generic trace buffer architecture that can beused to record any signal, we use a split trace buffer architecture where we store control and data signalsin separate buffers, as shown in Figure 3.2. We can then optimize these buffers separately; since controland data signals have very different access patterns, we would expect that the optimum architectures forthe buffers are different.4.3 Control Trace Buffer OptimizationsThe control trace buffer is used to store the state bits of the controlling FSMs. Although there may bemultiple FSMs in the user’s design (one per module in circuits produced by LegUp), the State Encodermodule, described in Section 3.3, re-encodes the bits of the various FSMs into a single encoded statesignal.If the baseline (ELA) architecture were used, the encoded FSM state would be recorded each cycle,requiring an entry in the trace buffer each cycle. Thus, if there are m entries in the control trace buffer,an ELA would store state information for m cycles, limiting the replay window to m cycles.Rather than record the FSM state every cycle, we introduce a compression scheme which reducesthe amount of data that needs to be recorded. The compression technique is motivated by the factthat in most cases, as the circuit transitions from one state to the next, the FSM state value incrementssequentially. For example, in the scheduling example shown in Figure 2.5, once in State 1, the circuitwill always progress sequentially to State 4, and depending on the branch condition, possibly even toState 5. To optimize for this, we add an extra field to the control flow buffer, sequence count, which674.3. Control Trace Buffer Optimizations0123456789100 1 2 3 4 5 6 7 8 9 10Compression RatioSequential Counter BitsadpcmaesblowfishdfadddfdivdfmuldfsingsmjpegmipsmotionshaAverageFigure 4.3: Memory efficiency of control buffer, when varying number of sequential counter bits.counts the number of cycles that the control flow progresses sequentially. This optimized buffer layout isshown in Figure 4.2. For a sequence count width of s bits, each entry in the buffer can record the circuitstate for up to 2s cycles, provided that the FSM progresses sequentially. Circuits with many sequentialcomputations benefit most from this optimization, while circuits with substantial control flow variations(function calls, and branch statements) will have less benefit.To determine the optimal number of bits to use for the sequence counter, s, we simulated each bench-mark using Modelsim, extracted the entire control flow trace, and calculated the effective compressionratio. Figure 4.3 shows the results of the experiment, where s is swept from 0 to 10 (s= 0 is equivalent tono compression, i.e. the baseline). The best result occurs when s= 6, achieving an average compressionratio of 4.0.This optimization is conditional upon state values being encoded with sequential values for straight-line execution through the state graph. LegUp always produces FSMs that are structured in this way;however, other tools may not number the states in this fashion. Despite this complication, this opti-mization may still be used; since the state encoder is used to re-encode the FSMs to a single encoding,684.4. Datapath Registers Trace Buffer OptimizationsractiveSignal-Trace SchedulerDatapathr1r2r3r4r5rn-1rncurrent_stateActive Datapath Registersr1r4r9r5r6r8r10r3Regs.r1r4r9r8r10r3Regs.r1r4r9r8r10r3Regs.r1 r4r9r8r10r3Regs.r1r4r9r8r10r3r1 r4r9r8r10r3Regs.r1r4r9r5r8r10r3r5r6 r5r6r5r6r5r6r5r6r6 r5Regs.S1S2S6S7S3S1S2S6S7S3S1S2S6S7S3S1S2S6S7S3S2S6S7S3S2S2S6S6S7S7S3S2S6S7S3Figure 4.4: Dynamic tracing architecture: only relevant signals are recorded each cycle.sequential values could be applied at that stage.4.4 Datapath Registers Trace Buffer OptimizationsAs described in Section 3.2.3, an HLS tool employing an optimizing compiler will store some variablesin centralized memories and implement others as physical registers distributed throughout the datapath.This section describes techniques to record updates to the datapath registers. Table 3.1 lists the numberof datapath registers for the CHStone benchmarks, which range from 363 to 6799 bits. In the baselineapproach, all of these signals are recorded every cycle, consuming memory very quickly and netting avery short replay window. In this section we present our dynamic tracing architecture, which includescircuitry to change which signals are recorded cycle-by-cycle as the circuit executes.4.4.1 Dynamic Tracing ArchitectureFigure 4.4 shows our dynamic register tracing architecture. Compared to the baseline architecture, thereare two improvements. First, the HLS scheduling information is used to determine which signals areupdated in each state, and only those signals are recorded in the trace buffer. This is accomplished byusing existing one-hot signals from the state machine to control when each signal should be recorded.694.4. Datapath Registers Trace Buffer OptimizationsThe second improvement over the baseline architecture is that, in each cycle, our architecture packsall the variables that change into one line of the memory. This significantly reduces the width of thetrace buffer, since it is no longer necessary to have a memory that is as wide as the number of bits thatare to be traced. Instead, the width of the memory is determined by the maximum number of those bitsthat change each cycle.As shown in Figure 4.4, this functionality is performed by the signal-trace scheduler, which consistsof combinational logic to multiplex the register inputs, r1..n, and produce a single output signal, ractive,which is fed to the trace buffer. These improvements provide massive memory savings over recordingevery signal each cycle, and allows for a much longer trace window.This architecture is very similar to the techniques presented by Monson et al. in [53]. In both casesthe HLS schedule dictates when to record signals. In [53] multiple buffers are used, and signals onlyshare buffers when they are recorded in mutually exclusive states. In our approach we aggressively packsignals into a single buffer, which offers two main advantages. First, if there is a mixture of signalswidths, our method will allow for more sharing of buffers. For example, in Figure 4.4, r1 and r3 can bepacked into the same memory columns as r9. This packing provides better memory utilization, and thuslonger execution traces, but at a cost of increased multiplexing logic. In the following subsections wepresent improvements to this architecture that are possible because we are using a single-buffer design.4.4.2 Delay-Worst Signal-Trace SchedulingIn the architecture shown in Figure 4.4, all signals are recorded in the cycle in which they are generated.However, this is not necessary. Some signals could be delayed within the trace circuitry, and be recordedin the memory during the following cycle, or even several cycles later. Delayed tracing can providebetter memory efficiency for the trace buffers. By identifying the worst-case state, and delaying somesignals to be traced in a later cycle, the trace buffer width can be reduced.We use a greedy algorithm shown in Algorithm 1, that repeatedly identifies the worst-case state, andattempts to move any signal traced in that state to a later state. Figure 4.5a provides an example, wherethe tracing of r8 is delayed one cycle to be recorded in state S7 instead of state S6. The state graph isused to determine the successor states which can be used to trace the signal. If the state has multiple704.4. Datapath Registers Trace Buffer OptimizationsractiveSignal-Trace SchedulerDatapathr1r2r3r4r5rn-1rncurrent_stateActive Datapath Registersr1r4r9r5r6r8r10r3Regs.r1r4r9r8r10r3Regs.r1r4r9r8r10r3Regs.r1 r4r9r8r10r3Regs.r1r4r9r8r10r3r1 r4r9r8r10r3Regs.r1r4r9r5r8r10r3r5r6 r5r6r5r6r5r6r5r6r6 r5Regs.S1S2S6S7S3S1S2S6S7S3S1S2S6S7S3S1S2S6S7S3S2S6S7S3S2S2S6S6S7S7S3S2S6S7S3(a) Delay-Worst SchedulingractiveSignal-Trace SchedulerDatapathr1r2r3r4r5rn-1rncurrent_stateActive Datapath Registersr1r4r9r5r6r8r10r3Regs.r1r4r9r8r10r3Regs.r1r4r9r8r10r3Regs.r1 r4r9r8r10r3Regs.r1r4r9r8r10r3r1 r4r9r8r10r3Regs.r1r4r9r5r8r10r3r5r6 r5r6r5r6r5r6r5r6r6 r5Regs.S1S2S6S7S3S1S2S6S7S3S1S2S6S7S3S1S2S6S7S3S2S6S7S3S2S2S6S6S7S7S3S2S6S7S3(b) Delay-All SchedulingractiveSignal-Tr ce SchedulerDatapathr1r2r3r4r5rn-1rncurrent_stateActive Datapath Registersr1r4r9r5r6r8r10r3Regs.r1r4r9r8r10r3Regs.r1r4r9r8r10r3Regs.r1 r4r9r8r10r3Regs.r1r4r9r8r10r3r1 r4r9r8r10r3Regs.r1r4r9r5r8r10r3r5r6 r5r6r5r6r5r6r5r6r6 r5Regs.S1S2S6S7S3S1S2S6S7S3S1S2S6S7S3S1S2S6S7S3S2S6S7S3S2S2S6S6S7S7S3S2S6S7S3(c) Dual-Port SchedulingFigure 4.5: Signal-trace scheduling enhancements for the dynamic signal-tracing architecture.successor states, then the delayed signal will need to be traced at each successor. In order for a signalto be successfully delayed to a later state, the new trace width of the candidate state must not exceedthe original worst-case width. If the immediate successor state(s) do not meet this requirement, theirsuccessors are searched recursively. However, if at any point a candidate state has multiple predecessors,the search is terminated. This ensures that the signal tracing is not moved into a loop (Figure 4.6a), or toa state that may be entered from a different path (Figure 4.6b). This restriction guarantees that there isnever any extra tracing performed versus a non-delayed schedule. However, there are cases where thisconservative approach may prohibit delaying signals that would otherwise be valid, such as the case inFigure 4.6c.714.4. Datapath Registers Trace Buffer OptimizationsFunction Delay-Worstforeach statesort state.signals by widthcanDelay← truewhile canDelaySw← worst-case stateww← Sw.widthw← min signal width in Swif not StateCanReceive(Sw, w, ww)canDelay← falseif canDelaysignal←remove first signal from Swforeach S in Sw.successorsadd signal to Ssort S.signals by widthFunction StateCanReceive(S, w, ww)result← trueforeach Ss in S.successorsif Ss.numPredecessors > 1result← falseelif Ss.width+w≥ ww and not StateCanReceive(Ss, w, ww)result← falsereturn resultAlgorithm 1. Delay-Worst SchedulingIt is important to highlight that modifying the signal-trace schedule has no impact on the user circuit;all operations are still performed in the same order as originally determined by the HLS scheduler.Signals are delayed only within the trace circuitry, allowing them to be recorded in memory one or morecycles later than they were computed.There are some caveats to delaying signal tracing. First, the signal must be available and unchangedat the later state. Many signals will already be saved in a register since they are used in a later state in theHLS schedule. However, some signals are generated and used in the same cycle, and so a register needsto be added to the circuit in order to trace them at a later cycle. In our testing, adding these registersto the user circuit does not increase the area, and in many cases actually reduces it. This is likely dueto the nature of FPGA architectures which provide registers for each combinational logic unit, and thefact that directly recording a combinational signal introduces fan-out which may necessitate duplicatingsome logic in order to meet timing requirements.724.4. Datapath Registers Trace Buffer Optimizationsx2 x2-rxdex3cab++x1-y0User s Circuit TraceScheduler1245r1r2r3r4r5rn-1rnactive_signalscurrent_stater1r3S4S5S6S7S512345Control Flow Trace BufferActive Signals Trace Bufferr4r5r8r13 r9r4r5Single-Ported Dual-Portedw w/2r1r3r4r5 r3r4r1r3r5SwScSwScSwSc(a)x2 x2-rxdex3cab++x1-y0User s Circuit TraceScheduler1245r1r2r3r4r5rn-1rnactive_signalscurrent_stater1r3S4S5S6S7S512345Control Flow Trace BufferActive Signals Trace Bufferr4r5r8r13 r9r4r5Single-Ported Dual-Portedw w/2r1r3r4r5 r3r4r1r3r5SwScSwScSwSc(b)x2 x2-rxdex3cab++x1-y0User s Circuit TraceScheduler1245r1r2r3r4r5rn-1rnactive_signalscurrent_stater1r3S4S5S6S7S512345Control Flow Trace BufferActive Signals Trace Bufferr4r5r8r13 r9r4r5Single-Ported Dual-Portedw w/2r1r3r4r5 r3r4r1r3r5SwScSwScSwSc(c)Figure 4.6: Example state graphs where signals cannot be delayed from the worst case state, Sw toa candidate state, Sc, since Sc has multiple predecessor states.In order to delay signals to be traced in a later state, they must remain unchanged until they areeventually written to memory. In most cases this property is guaranteed since: (1) the SSA form usedby LLVM guarantees that each IR register is updated only once, (2) the HLS binding stage will mapeach IR register to a unique hardware register in the RTL (registers are usually not shared since they arecheaper than multiplexers in FPGAs), and (3) tracing cannot be delayed to the next occurrence of thesame state (a loop) due to the multiple predecessors restriction. In the infrequent case that registers areshared, such as the output registers of a floating point unit, the signals are prevented from being delayedto any other state in which they are updated. One minor issue is that when tracing terminates, somesignals may not have been traced yet. This is remedied by executing for an extra few cycles, whichhas negligible impact since the overall trace typically has thousands of cycles. In addition, since thedebugger application is aware of the trace schedule, the delay in tracing can be completely hidden fromthe user, and has no impact on variable visibility.This approach significantly increases the utilization of the trace buffers, increasing the trace windowlength. More complex algorithms could be developed, such as a non-greedy method of choosing whichsignal to delay, delaying only some bits of a signal, or more aggressively identifying candidate delaystates; however, we expect diminishing returns from these techniques.734.4. Datapath Registers Trace Buffer Optimizations4.4.3 Delay-All Signal-Trace SchedulingThe previously described delay-worst scheduling scheme works to reduce the width of the trace mem-ory; in this next technique, which we refer to as delay-all scheduling, we reduce the rate at which entriesin the buffer are used. For a given state, if all registers that are updated during that state can be delayedand combined with registers from a subsequent state(s), without increasing the required buffer width,then the trace buffer write for the original state need not occur. Figure 4.5b provides an example: r3 andr1 can be delayed and recorded in state S2, eliminating the need to use a buffer entry in state S1.The algorithm is the same as that used in the delay-worst approach (Algorithm 1), with two modifi-cations:1. It iterates over all trace-enabled states (not just the worst-case).2. The StateCanReceive function is modified with the added condition that candidate states musthave a non-zero trace width. (Moving signals to an empty state will not save any memory).4.4.4 Dual-Ported Memory Signal-Trace SchedulingAlthough delayed signal-tracing will improve trace buffer memory utilization, there still may be caseswhere memory is underutilized. We can further optimize the architecture by taking advantage of thefact that FPGAs typically offer dual-ported memories. We cut the trace buffer width in half and recordhalf of the ractive signal in the current entry of the trace buffer, and the other half in the next entry of thebuffer. The memory savings occurs in states where less than half of the trace buffer width is required,as only a single entry in the buffer is consumed. Figure 4.5c provides an example of this optimization;S3 consumes only a single entry in the half-width buffer, reducing the total memory usage.4.4.5 Dynamic Tracing ResultsTable 4.1 provides results of the effectiveness of the dynamic tracing architecture at increasing the tracelength captured in the datapath registers buffer. Employing dynamic tracing allows for the datapathregisters to be recorded for, on average, 86x longer than if an ELA was used. When the delay-worst,delay-all, and dual-port optimizations are included, this rises to 196x. The final column of table rep-resents the optimal improvement, if perfect memory utilization could be obtained. Our architecture744.5. Memory-Update Trace Buffer OptimizationBenchmarkDynamicTracing+ Delay-Worst+ Delay-All+ Dual-PortOptimaladpcm 36x 81x 111x 149x 237xaes 59x 165x 165x 169x 173xblowfish 53x 200x 200x 215x 223xdfadd 38x 38x 46x 55x 70xdfdiv 84x 121x 133x 172x 222xdfmul 14x 19x 23x 27x 36xdfsin 124x 195x 218x 276x 392xgsm 163x 163x 168x 235x 290xjpeg 211x 428x 478x 689x 1056xmips 14x 14x 39x 41x 63xmotion 111669x 12342x 13371x 16045x 17405xsha 49x 40x 59x 79x 100xGeoMean 86x 130x 160x 196x 252xTable 4.1: Improvement to captured trace length due to dynamic signal-tracing. All values rep-resent the factor increase to the replay window length vs the baseline, ELA architecture.1Dynamic tracing offers very large improvements in benchmarks that update variables in dat-apath registers very infrequently.achieves 78% of the optimal case, leaving little room for further improvement.4.5 Memory-Update Trace Buffer OptimizationThe previous sections have presented optimizations for recording the FSM states and datapath registers.As described in Section 3.2.3, most HLS compilers will also store some variables in one or more on-chipmemories within the circuit. In this section, we present several techniques for recording updates to thesevariables in trace buffers.It is first necessary to understand how memory writes are implemented within the HLS circuit.Figure 4.8 provides an example portion of a C program containing four writes to a single memory, aswell as the resulting IR code, which correspondingly contains four store operations scheduled to FSMstates. Figure 4.9 shows some of the circuitry generated for this program, particularly the logic insertedto perform the store instructions. This logic consists of multiplexers that, based on the state, forward theappropriate IR registers in the datapath1 to the memory controller inputs.1When we mention IR registers in the circuit, we are actually referring to physical registers that the IR registers are mappedto. Since there is a one-to-one mapping between IR register and physical register, this shortcut in terminology is used.754.5. Memory-Update Trace Buffer OptimizationMemory BusesSignal-Trace SchedulerOn-Chip Memory12Datapath3Figure 4.7: Signal-tracing alternatives for capturing updates to variables in memory. Green signalsare datapath registers corresponding to source code variables. Purple signals are memory bussignals, as well as datapath registers used by memory store operations.In order to track updates to variables located in memory, it is necessary to record each store op-eration, including both the value written into the memory, as well as the address. In this section weexplore different techniques to capture these store operations. We explore recording either the mem-ory controller signals, or the IR datapath registers, either of which are sufficient to determine how thememory was updated. We also explore alternatives for where this data is stored, either in a dedicatedtrace buffer, or in the same trace buffer used for capturing datapath registers. Figure 4.7 illustrates thedifferent options we explore, which we describe in the following subsections.4.5.1 Case 1 – Tracing a Single Memory ControllerThis case represents the approach taken in our early work [1, 2], where we assumed the circuit containeda single global memory. The memory controller signals (mem addr and mem in) are fed directly intoa dedicated trace buffer, which is enabled using the write enable signal of the memory controller. Thiscase is shown as 1© in Figure 4.7, and is also reflected in the system diagram in Figure 3.2. This resultsin a trace buffer which has minimal overhead logic, and has very good memory utilization, but is onlysuitable for the case of a single global memory.In modern HLS tools (including LegUp 4.0) multiple memories can be distributed throughout thecircuit, removing the bottleneck of a single memory controller, allowing for improved scheduling and764.5. Memory-Update Trace Buffer Optimizationw =  ;x =  ;i = n + 1;j = n + 2;if (flag)  A[i] = w;else  A[j*2] = x;y =  ;z =  ;k = n + 3;m = n + 4;A[k] = y;A[m] = z;State 1%w =  %x =  %i = add %n, 1%j = add %n, 2br %flag, S2, S3State 4%y =  %z =  %k = add %n, 3%m = add %n, 4State 5store %y, %A[%k] mem_addrmem_inMemory Controller %AAddrData InData OutR/WState 6store %y, %A[%m]State 2store %w, %A[%i]State 3%tmp = mul %j, 2store %x, %A[%tmp]%xS2S3S5S6state%w%y%zS1enS1enS4enS4en%jx2S4en3+%nS4en1+%nS1en1+%n%tmpS2S3S5S6state%i%k%mS1en2+%nFigure 4.8: Example of a C program containing memory writes, and the resulting scheduled IRcode. Figure 4.9 shows the resulting circuitry.774.5. Memory-Update Trace Buffer Optimizationw =  ;x =  ;i = n + 1;j = n + 2;if (flag)  A[i] = w;else  A[j*2] = x;y =  ;z =  ;k = n + 3;m = n + 4;A[k] = y;A[m] = z;State 1%w =  %x =  %i = add %n, 1%j = add %n, 2br %flag, S2, S3State 4%y =  %z =  %k = add %n, 3%m = add %n, 4State 5store %y, %A[%k] mem_addrmem_inMemory Controller %AAddrData InData OutR/WState 6store %y, %A[%m]State 2store %w, %A[%i]State 3%tmp = mul %j, 2store %x, %A[%tmp]%xS2S3S5S6state%w%y%zS1enS1enS4enS4en%jx2S4en3+%nS4en1+%nS1en1+%n%tmpS2S3S5S6state%i%k%mS1en2+%nFigure 4.9: The generated memory controller circuitry for the example program in Figure 4.8.784.5. Memory-Update Trace Buffer Optimizationhigher performance. However, this makes recording updates to these variables in trace buffers morechallenging; instead of a single set of memory controller signals, there are now many memory con-trollers, each with circuits like those in Figure 4.9. Recall that Table 3.1 shows the number of memorycontroller signals for the CHStone benchmarks, which range from 0 to 1327 bits in the distributed mem-ory scenario. Next, we present three alternatives for handling the case of multiple memories in thecircuit.4.5.2 Case 2 – Multiple Memory Controllers, Combined Tracing with the DatapathRegistersIn the case where multiple memories are present, it would require a very wide trace buffer to capture allmemory controller signals. In addition, since store operations to the different memories would occur indifferent states there would be significant wasted memory space. However, with several memory con-troller signals to record, each needing to be recorded in specific states, the scenario is actually identicalto tracing the updates to the datapath registers, and the same architecture as described in Section 4.4 canbe used. In fact, the memory controller signals (mem addr and mem in for each memory) can be feddirectly into the signal-trace scheduler, as shown in 2© of Figure 4.7. Thus, the signal-trace scheduling,determined using the delay-worst, delay-all, and dual-port optimizations, will now take into accountboth datapath registers as well as memory controller signals.This approach makes it possible to capture memory controller signals from many different memo-ries, each needing to be recorded in different FSM states.4.5.3 Case 3 – Substitute Memory Controller Signals with IR RegistersOur next configuration is motivated by the fact that when recording both updates to datapath registers, aswell as updates to variables in memory, duplicate data may be stored. This occurs because the memorycontroller signals are fed by signals directly from the datapath, and if these datapath signals correspondsto source code variables, they will already be recorded in the datapath trace buffer.This can best be understood through the example in Figures 4.8 and 4.9. In this case array A iswritten to in four different locations in the code. In the resulting circuit, the address and data signalsfor memory A will be driven by IR registers %i and %w in state S2, %tmp and %x in S3, %k and %y in794.5. Memory-Update Trace Buffer OptimizationS5 and %m and %z in S6. These IR registers, with the exception of %tmp, correspond to variables inthe original source code, and so updates to these registers will already be recorded. If we also recordthe mem addr and men in signals in these states it will result in duplicate data being stored in the tracebuffer.In this configuration we attempt to reduce this data duplication. There are two approaches that canbe taken, 1) record datapath registers, and use those to determine the values of the memory controllersignals, or 2) record the memory controller signals, and use those to determine the values of the datapathsignals. There are subtle differences between the two approaches, which impact both the length ofexecution trace that can be captured, and the area overhead.In this configuration (Case 3) we explore the first option, recording the datapath registers, and usingthat data to determine updates to variables in memory. In the next configuration we will consider thereverse case.We elect to record all of the datapath signals which drive the memory controllers. By doing so, wecan determine the value of the memory controller signals whenever a memory write occurs, and thuscapture all of the memory writes, without needing to directly monitor the memory controller signals.This approach, compared to the previous configuration, has a number of impacts on the resulting debugcircuitry:1. This completely eliminates data duplication between the datapath signals and the memory con-troller signals, since memory controller signals are no longer traced. This results in better memoryefficiency, and thus, a longer trace window.2. The datapath signals that feed memory controllers, but do not correspond to source code variables,will be new signals recorded in this configuration (such as %tmp in the example). Recording thesesignals has a subtle benefit. In the next section we explore signal restoration, where non-tracedIR registers are reconstructed using the values of traced IR registers. This means that knowingthe value of datapath signals, such as %tmp, can aid in reconstructing the value of other signals,eliminating the need to trace them. Again, this improves memory efficiency. This could not bedone in the previous case where the memory controller signals were captured, as this restorationis done using the IR code, and unlike the datapath signals, the memory controller signals do not804.5. Memory-Update Trace Buffer Optimizationhave a one-to-one mapping with the IR.3. Although this approach improves memory efficiency, it comes with an area penalty. This config-uration involves recording many more unique signals. In the example, instead of recording thetwo memory controller signals feeding memory A, we are now recording eight different datapathsignals. As discussed, many of these may already have been traced; however, if there are severalnew signals, such as %tmp in the example, the signal-trace scheduler, which multiplexes all of thesignals into the trace buffer may significantly increase in size.Thus, this configuration presents a trade-off between the length of captured execution trace, andthe area overhead. Eliminating data duplication means that longer execution traces can be captured;however, the addition of new signals to trace means the multiplexing logic feeding the trace bufferincreases in size.4.5.4 Case 4 – When Possible, Use Memory Controller Signals to Deduce IR SignalsThe previous case explored tracing datapath signals instead of memory controller signals. This elimi-nated data duplication, but required extra multiplexing.In this final configuration we attempt the reverse, to use the memory controller signals to determinethe value of IR datapath signals. The reason for doing so is to reduce the multiplexing area penalty, thatis, if we can record a few memory controller signals to determine the value of many datapath signals,multiplexing area will be saved. What is actually happening with this approach, is that the multiplexingis being performed by the multiplexing logic that is already built into the user’s circuit (see Figure 4.9).However, unlike the previous configuration where the datapath registers can always be used to de-termine the memory controller signals, memory controller signals cannot always be used to determinethe datapath signals. Again, this is best understood by considering the example in Figures 4.8 and 4.9.Consider the events that occur within State 2, and suppose we intend to record the memory controllersignals instead of the datapath signals. If the memory controller signals for A are recorded, it wouldbe possible to determine the values of variable i (drives the address of A), and variable w (drives thedata input of A); however, this is contingent upon the control flow entering State 2. If the control flowtakes the other branch in the program, and goes to State 3 after State 1, then the store operation in State814.6. Challenges of a Split Buffer Architecture2 will not be encountered, and the memory controller signals will never be recorded. If this happens,the value of variables i and w will be lost. Thus, in this case, we cannot use the memory controllersignals to reliably determine the datapath signals, as there is no guarantee that the store operation willbe encountered.However, if it can be proven that the state where the datapath register is used in a store instructionwill always be reached, then the memory controller signal can be recorded in place of the datapathsignal. In the example, this can be seen in States 4, 5, and 6, with the variables k, y, m, and z. In thiscase it is guaranteed that after the four variables are computed in State 4, the control flow will reach State5 and State 6, and thus recording the memory controller signals in these states is sufficient to determinethe values of the four datapath signals.In this configuration we analyze the state graph to determine instances where the state in which adatapath signal is updated will lead directly to (and only to) the state where it is used in a store operation.In such cases we elect to record the memory controller signals, making use of the existing multiplexingin the user’s design. In all other cases we maintain the approach used in the previous case, where thedatapath signals are recorded.This approach will result in a lower area overhead than Case 3, and maintains the property of Case3 that all duplicate data is eliminated. Case 4 is not explicitly shown in Figure 4.7; however, it isessentially a hybrid between Case 2 and 3 with a mixture of datapath registers and memory controllersignals being fed to the signal-trace scheduler.Although Cases 2–4 were motivated by the need to handle distributed RAMs, they can also beapplied in the case of a single global memory controller, as we show in the results.4.6 Challenges of a Split Buffer ArchitectureWe have shown how using multiple trace buffers allows for optimizing each of them independently.However, such an approach also has complications compared to the ELA approach of using a singlemonolithic trace buffer.824.6. Challenges of a Split Buffer Architecture4.6.1 Event OrderingOne challenge introduced by using a split trace buffer approach is establishing a temporal orderingbetween the events recorded in the different trace buffers. This is trivial for the baseline case, sincethere is only one trace buffer and it is updated once every cycle. However, in our architecture eachbuffer can be incremented independently, so it is not straightforward.In our proposed trace buffer architecture, the variable updates captured in the data flow buffersneed to be linked to the control flow buffer, in order to determine when in the program execution thesevariables were updated. In our early work [1, 2] we added extra fields to the buffers to track the linkbetween the data-flow and the control-flow; however, this consumes memory the trace buffer memory,and is not necessary. Rather, the link between the buffers can be determined offline, after the data isextracted from the trace buffers. To do this, we keep track of which FSM states add entries to the data-flow buffers, and whether they add one or two entries (dual-ported trace memories). To reconstruct thelink we first extract and decompress the control flow trace, which provides a cycle-by-cycle trace ofFSM states. Next we iterate over the data-flow buffers in reverse, and use the saved information to linkeach data-flow update to the correct control-flow cycle.4.6.2 Trace Buffer BalancingAnother issue that arises when using multiple, independently triggered trace buffers, is that it is difficultto predict at compile time how much memory should be allocated to each buffer. Since the buffers maybe filled at different rates, once the circuit is stopped, each buffer may contain a different number ofcycles of execution history, even if the buffers are sized equally. It is important to note that the availableexecution history is the minimum number of cycles captured by any one buffer. For example, if the threebuffers contain c1, c2 and c3 cycles of history, the replay window will be restricted to min(c1,c2,c3)cycles, since that is the only portion of execution for which complete data is available.This introduces the difficult problem of determining how to best allocate memory between the tracebuffers. To address this challenge we elected to use a static analysis of the scheduled IR code. Ourtechnique is to analyze the design to determine the percentage of states that need to record data in eachtrace buffer, and use this data to predict relative fill rates of the buffers. The major limitation of a static834.7. Signal Restorationanalysis approach is that it does not reflect the fact that some parts of the program will execute moreoften than others.We also experimented with two other approaches. One option was to always allocate a constantfraction of the memory to each buffer. However, even when we located the best average breakdown forour benchmark suite, it performed worse than our static analysis approach, since it did not operate ona circuit-by-circuit basis. The other technique, which can achieve better results than our static analysisapproach, is to perform a dynamic analysis by simulating the benchmark in Modelsim, and determiningthe actual fill rates of each buffer. Although this provides the best results, it may not always be possibleto simulate the circuit, or provide realistic inputs to the simulation that will stimulate the circuit correctly.4.7 Signal RestorationThis point in the chapter ends the description of our signal-tracing architecture. Before providing ananalysis of the architecture, we present one more major optimization. In this section we present atechnique to reduce the number of signals that need to be recorded by our architecture. All of thedescriptions up to this point have assumed that all circuit signals corresponding to source-code variablesare recorded into the trace buffers. However, in many cases it may be possible to use information fromwithin the IR code to automatically reconstruct signals based on the value of other signals, therebyreducing the number of signals that need to be recorded. Consider the following sample IR code:%5 = load %4%10 = add %5, %6%11 = mul %10, 7%12 = div %11, %6In this case register %10 can be restored offline by performing an add operation using the values ofregisters %5 and %6. In many cases there may be a chain of dependencies; for example, once the valueof %5 and %6 are known, they can be used to determine the value of registers %10, %11, and %12.In this section we present a technique to minimize the number of registers that need to be recorded,providing two major benefits for our signal-tracing architecture. First, it reduces the amount of databeing recorded in the trace buffer, allowing for a longer execution to be captured, and second, it reducesthe multiplexing logic in the signal-trace scheduler.844.7. Signal Restoration4.7.1 The Signal-Trace Selection ProblemAutomatically determining which signals to record, and which to restore is a challenging optimizationproblem. There may be tens or hundreds of IR registers, leading to a exponentially increasing space toexplore. A valid solution is one where the value of all needed registers (those corresponding to source-code variables) can be obtained, either because the register is traced, or because its value can be restored.The optimal solution is the one which minimizes the number of traced bits.To illustrate the complexity of the problem, assume from the sample IR code above we only needthe value of %10 (the other registers may not correspond to user variables in the C code). Thus we havethe option of directly tracing %10, or alternatively tracing %5 and %6, and using them to restore %10.In this case it would seem better to trace %10 directly, since it involves only tracing one register insteadof two; however, %5 and %6 may help restore other signals in the program, complicating the decisionprocess. Furthermore, not all registers are weighted equally; registers have varied bit widths, whichfurther complicates the process of minimizing the number of bits to trace.4.7.2 Solving the Optimization ProblemOur approach to solving the optimization problem is to use an integer linear programming (ILP) solver.We now describe how the ILP constraints are derived.Optimization GoalFor each IR register, r, we define two boolean variables: ar represents whether the value of the variableis available (either because the register is traced, or because the value is restorable), and tr representswhether the register is traced. The registers which correspond to a source-code variable are constrainedto be available, by adding the constraint ar = 1. We set the goal of the ILP solver to minimize thenumber of traced bits, as shown in Equation 4.1, where wr is the bit width of register r.∑∀rwr · tr (4.1)854.7. Signal RestorationRecursive ConstraintsNext we add constraints to reflect the conditions in which a register is available. In most cases theoutput of an IR instruction can be computed using its operands. For each restorable register we addthe constraint in Equation 4.2, which reflects that these variable values are available if they are traced,or if the value of all its operands are available. Some registers cannot be restored from their operands;for example, a load operation would require knowledge of the entire memory space in order to reliablyrestore. For these types of registers their value is only available if they are traced, so the constraint inEquation 4.3 is used. We do not consider restorations across function call boundaries, thus functionparameters and function return values are not restorable.ar = tr ∨ (aop1∧aop2∧ ...) (4.2)ar = tr (4.3)Expanded ConstraintsThe recursive approach breaks down when there are dependency loops; for example, consider this IRcode which initializes %i to 0, and then increments it forever:Loop:%i = phi [0, %LoopInit],[%incr, %Loop]%incr = add %i, 1br label %LoopIn this example %i can be restored from %incr and %incr can be restored from %i, which will result inthe following recursive constraints:ai = ti∨aincr (4.4)aincr = tincr ∨ai (4.5)However, these constraints are not sufficient, as the ILP solver may reach a solution where ai = 1,aincr = 1, ti = 0, and tincr = 0, which is legal based on the constraints, but does not reflect the fact that864.7. Signal Restorationneither register is available unless at least one of them is traced.To rectify this issue, we introduce expanded constraints. We first determine the set of registers, C,which have a cyclic self-dependency by performing a depth-first search on the operand tree of eachregister to determine if it contains a loop back to itself. If a loop is found, the register is added to the setC. Next, we iterate over all registers in C, adding an expanded version of the constraint in Equation 4.2,where any a term on the right-hand side, whose corresponding register is in the set C, is expanded outand substituted with the right-hand side of Equation 4.2.Equation 4.6 provides an example of an expanded constraint, where the aop2 term from Equation 4.2has been expanded. This example contains only one expansion; however, it must be repeated until theright-hand side contains no a terms that correspond to registers in C. Inevitably this expansion will resultin a loop, and requires a terminating condition; thus during expansion all visited a terms are tracked,and if encountered again, the non-restorable constraint in Equation 4.3 is used instead to terminate therecursion.ar =tr ∨ (aop1∧ (top2∨ (aop2op1 ∧aop2op2 ∧ ...)∧ ...)) (4.6)Depending on the depth of the dependency loop, multiple expansion iterations may be needed beforetermination, and this constraint can grow very large, significantly impacting the run-time of the ILPsolver. It is for this reason that we only apply the expanded constraint when necessary, that is only forthose registers in C, while the other registers use the recursive constraints.We tested other techniques for handling dependency loops, but could not develop another solutionthat functioned correctly, allowed for an optimal solution, and had less impact on the ILP solver. Forexample, one potential solution is simply constraining the design such that at least one register in thedependency loop is traced. Although this will ensure a valid solution is reached, it overconstrains theproblem, and may prevent the optimal case from being discovered. For example, the optimal solutionmay involve a register in the loop being restored from registers outside the loop, which could thencascade and provide availability of all resisters in the loop. This solution would not be obtained if we874.7. Signal RestorationBenchmarkRegistersNeeded(# bits)TracedRegisters(# bits)Restor.RatioILP SolverTime(ms)adpcm 4987 2248 2.2 266.2aes 4329 2856 1.6 16.9blowfish 2878 1837 1.6 38.6dfadd 2684 518 5.2 3.0dfdiv 2944 581 5.1 5.4dfmul 1665 229 7.3 2.4dfsin 7535 1830 4.1 7.5gsm 2599 812 3.2 26.1jpeg 10447 6202 1.7 36.5mips 2237 775 2.9 3.1motion 2621 1668 1.6 40.1sha 1707 981 1.7 52.3GeoMean 3301 1206 2.7 16.1Table 4.2: Signal restoration resultsoverconstrained the problem such that at least one register in the loop was traced.4.7.3 Restoration Effectiveness and LimitationsTable 4.2 provides details on the signal restoration process for the CHStone benchmarks. On average,the benchmarks have 3301 signals bits that correspond to source-code variables and need to be available.Using the ILP solver we are able to obtain the optimal solution for each benchmark; on average, 1206bits need to be traced, which equates to an average restoration ratio of 2.7. For this experiment weassumed that updates to variables in memories are recorded using Case 4, which uses datapath registersto capture memory updates when possible. This maximizes the number of signals that are consideredfor restoration, and maximizes the difficulty of the restoration problem. Based on our testing, otherconfigurations, which require fewer datapath registers to be traced, achieve similar restoration ratios.The execution time of the ILP solver varies from 2.4 ms to 266.2 ms. The Gurobi Optimizer ILPsolver was used (version 6.5), and run on an Ubuntu 14.04 virtual machine with an Intel i7-2600K CPUand 8GB RAM. Generally the circuits with more variables, and thus more constraints, take longer tosolve. Circuits with many large expanded constraints take longer to solve, while simple circuits with884.7. Signal Restorationprimarily recursive constraints can be solved rapidly. If run-times for larger circuits become prohibitiveone solution may be to consider each function independently, as restoration is not performed acrossfunction boundaries. Alternatively, a heuristic approach could be developed, which may run faster, butnot achieve the optimal solution.It is important to note that past works on signal restoration [96–98, 105, 140] define a restorationratio which represents the number of signal states that can be restored per signal state traced, where asignal state is the value of a signal for a single cycle of the trace window. Thus, the restoration ratioin previous works reflects the product of how many signals can be restored, and how many cycles thevalues are available for. In our work, where signal restoration is not done cycle-by-cycle, the restorationratio reflects only how many signals can be restored, not how many cycles they are restored for; thus, itcannot be directly compared to past work.One issue with our approach is that the availability constraints do not consider the restrictions im-posed by a finite trace buffer. For example, the optimal solution will not trace a register if it can be re-stored from other traced registers; however, those registers may have been computed, and thus recordedin the trace buffer, much earlier in the program execution. Since the trace buffers are being filled con-tinuously, and the oldest data is always being evicted, the data necessary to perform the restoration maybe missing from the buffer. Unfortunately it is difficult to incorporate these restrictions into the problemconstraints; static program analysis often cannot determine how many cycles will elapse between partsof a program, as the outcomes of loops and branches cannot always be predicted.To model how frequently this issue will arise, we simulated execution of all of the CHStone bench-mark circuits with our restoration algorithm applied. Each time a value needed to be restored, werecorded how many cycles back in the execution was the earliest dependency. Figure 4.10 providesa histogram of the results, as well as the cumulative percentage. The results show that over 50% ofrestorations require only signals recorded in either the same cycle or the previous cycle, and that 99.8%of restorations require less than 2048 cycles of history. Since our trace buffer architecture can capturetens of thousands of cycles of execution using a small amount of trace memory, it would be rare fora user to encounter a failed restoration. If this did occur, and the user needed the value, it could beconstrained to be traced directly, and a new restoration solution could be obtained, although the user894.8. Results99.8%0.0%10.0%20.0%30.0%40.0%50.0%60.0%70.0%80.0%90.0%100.0%010000200003000040000500006000070000# Variable UpdatesCycle Distance  to  Earliest Reconstruction ParamterHistogramCumulative %Figure 4.10: Histogram showing the number of cycles of history required to restore a signal value.would need to recompile and re-execute the circuit.Another issue to note is that although our technique finds the optimal solution based on the numberof bits to trace, it does not consider the fact that some registers may be updated more often than others.By minimizing the number of traced registers our solution will minimize the multiplexing logic, but maynot obtain the absolute minimal trace buffer memory usage. To do so would require knowledge of howthe program control-flow will execute, which as mentioned previously, usually cannot be determinedbeforehand.Finally, signal restoration relies upon the HLS tool being correct, and will not properly reflect thehardware behaviour if the HLS tool incorrectly implements the logic for the reconstructed signal. Ifthere was a bug in the HLS tool, and the signal was computed incorrectly in hardware, the user wouldnot be aware, since the correct value would be completed offline. This dependence on the correctnessof the HLS tool is discussed further in Section ResultsIn this section, we evaluate our debug framework from Chapter 3, which includes the trace buffer archi-tecture optimizations presented in this chapter. We evaluate our architecture in terms of 1) the increaseto the replay window versus using an ELA to perform signal-tracing, and 2) the area overhead and904.8. Resultsimpact on timing ( fmax) versus a circuit with no debug instrumentation.Results are obtained using LegUp 4.0, modified to test the proposed optimizations. Circuits arecompiled with the default options, which includes -O3 optimizations. The circuits used are the CHStoneHLS benchmarks [136], included with LegUp 4.0. Area and timing values are obtained by synthesizingthe Verilog circuit to an FPGA device using Altera Quartus II, version 15.0. To calculate the trace lengthof the full system, we first simulated each circuit in Modelsim to obtain a full execution trace. We theninput these traces into scripts that simulated filling the trace buffers cycle-by-cycle. At each cycle werecorded the replay window length, which is the minimum number of cycles available in any buffer, andaveraged this across the entire circuit execution. Based on the average number of cycles captured in thereplay window, we estimate the number of executed lines of C code (LoC) that this represents. This isdone using the approximation that LoC/cycle = Total Lines o f Code#FSM states . It should be noted that although thisis an approximation to the number of executed lines of code that our architecture captures, the factorimprovement over the baseline is based on number of cycles and is an exact measurement. Buffer sizesare determined using the static analysis approach described in Section Debug Architecture OptimizationsTable 4.3 provides results for the various architectures and optimizations discussed in this chapter. Thetable includes two sections, one where LegUp is configured to implement a single global memory, andone where memories are distributed throughout the user circuit. For both sections of the table, there areresults for various debug architecture optimizations, starting with an architecture with no debug support(vanilla LegUp), and then incrementally applying the stated optimizations.The left hand-side of the table indicates the trace length provided by the signal-tracing architecture.For convenience, the trace length is provided as an absolute value in terms of cycles, time and executedLines of Code. The right-most column of the trace length section indicates the relative trace length,compared to the baseline (ELA) architecture.The right-hand side of the table indicates the impact on area and timing. Although the optimizationsare added incrementally, all area and timing results show the difference relative to the “No Debug”configuration for that section.914.8. ResultsTrace LengthDebug ArchitectureCycles TimeLoC(est.)vs. ELAArea fmaxSingle Global RAMNo Debug - - - - 3520 201 MHzBaseline 54 0.25 µs 132 1.0x (est.) +292 -Debugging (Mem. Case 1) 2161 9.80 µs 5221 40x +1566 -10.9%+ No Live Mem Access 2161 9.80 µs 5221 40.0x +1197 +5.8%+ Restoration 2905 13.18 µs 7019 53.8x +744 +5.5%+ Mem. Case 2 4015 18.21 µs 9699 74.3x +906 +2.0%+ Mem. Case 3 6984 31.68 µs 16870 129.3x +981 +5.1%+ Mem. Case 4 7052 31.98 µs 17035 130.5x +862 +4.5%Distributed RAMsNo Debug - - - - 3093 214 MHzBaseline 50 0.23 µs 121 1.0x (est.) +487 -Debugging Framework - - - - - -+ No Live Mem Access - - - - - -+ Restoration - - - - - -+ Mem. Case 2 4055 18.39 µs 7996 81.0x +879 -2.3%+ Mem. Case 3 6208 28.16 µs 14998 124.0x +979 +1.2%+ Mem. Case 4 6362 28.86 µs 15369 127.1x +858 +3.0%Table 4.3: Debug Architectures and overheads. LoC = Executed lines of C Code captured per100Kb of trace memory. Trace length increase is versus the baseline trace buffer architecture,equivalent to using an ELA for signal tracing. Area and fmax values are versus the No Debugarchitecture. All area values are provided in number of Stratix V ALMs.The table does not contain results for the distributed RAMs case until Memory Case 2 is applied,as that is the first architecture we consider that supports multiple memories. The following subsectionsprovide an analysis of the entries within the table.BaselineThis row of the table indicates the case where the baseline signal-tracing approach is used. This usesthe same signal-tracing technique as an ELA, where all signals are recorded every cycle. In such an924.8. Resultsarchitecture, the length of execution trace (in cycles) can be calculated as Equation 4.7.Trace Length (cycles) =⌊Total # Memory Bits# Signal Bits to Trace Each Cycle⌋(4.7)We did not actually build and prototype an ELA-based tool, so the area results are indicated as estimatedvalues. These values were obtained by using the area values from our architecture, and subtracting thearea that belongs to modules that perform our signal-tracing optimizations. This provides an estimateof the area required for an ELA-based tool. Although we do not have timing values, we expect that likeour architecture, it will not have a significant impact on the user circuit.Debugging FrameworkThis configuration includes the core architecture of our debug framework, as illustrated in Figure 3.2.This includes the compression technique for capturing the control flow (Section 4.3), the dynamic signaltracing architecture for capturing the datapath registers (Section 4.4), and the memory update tracebuffer configured using Case 1 (Section 4.5.1). By employing these signal-tracing optimizations, thisconfiguration achieves a 40x longer execution trace than if an ELA were used to record the relevantcircuit signals. This demonstrates the success in optimizing the trace architecture for HLS circuits,versus using a generic signal-tracing solution. By providing the user with a dramatic increase in tracelength, they will often be able to locate the root cause of bugs faster, reducing time spent on debugging.The overhead for this core configuration is fairly high; it requires on average, an extra 1566 ALMs,and reduces the circuit frequency by 10.9%.No Live Memory AccessOne of the primary reasons for the 10.9% reduction in operating frequency, is the Memory Arbiter shownin Figure 3.2. This module provides the debugger with live access to the contents of the memories whilethe circuit is halted. As described in Section 3.3, this ensures that if a variable in memory is neverupdated during the replay window, it will still be available to the user since the debugger can fetchit directly from memory. The impact to circuit frequency is due to the fact that the global memory934.8. Resultscontroller logic often becomes the critical path of the circuit, and because the global memory controloutput is fed through an arbitrating multiplexer, the critical path increases.The No Live Mem Access entry in the table shows the results when the memory access moduleis removed, and shows that circuit frequency improves to be comparable to the circuit with no debugcircuitry added. With the live memory access removed, the user will not have access to the value ofvariables located in memory unless they are updated in the replay window. By removing the memoryarbiter from the circuit, the operating frequency is equal to or greater than the frequency obtained by theoriginal uninstrumented circuit.Signal RestorationNext we incrementally add the signal restoration optimization from Section 4.7. This reduces the num-ber of datapath registers that need to be recorded, which provides two benefits: 1) it allows for a longerexecution trace to be captured, and 2) it reduces the multiplexing logic in the signal-trace scheduler,reducing the area overhead.We have elected to apply the signal restoration scheme before considering the alternate cases ofrecording the memory updates. The reason is that Memory Case 3 and 4, which dramatically increasethe number of traced datapath registers, are only beneficial if signal restoration is employed.Adding signal restoration significantly reduces the average area overhead of the debug instrumenta-tion, from 1197 ALMs to 744. The trace length is increased to 54x compared to the baseline.Mem. Case 2Next we consider Memory Case 2 (Section 4.5.2), where the memory controller signals are fed intothe signal-trace scheduler, and recorded in the same buffer as the datapath registers. As expected, thisresults in a longer trace length, increasing to 74x versus the baseline, as trace buffers can be betterutilized. However, this significantly increases the area overhead (from 744 to 906 ALMs) as it requiresmuch more logic in the signal-trace scheduler. It also lowers the circuit frequency by 3.5%, since, asdescribed previously, the memory controller is often the critical path, and by feeding these signals intothe multiplexing logic of the signal-trace scheduler, the path length is increases.944.8. ResultsMem. Case 3Next we consider Memory Case 3 (Section 4.5.3), where memory updates are recorded using the un-derlying datapath registers instead of the memory controller signals. This eliminates data duplicationbetween tracking the datapath registers and memory updates, and significantly increases the duration ofexecution that can be captured. The trace length (vs ELA), improves from 74x in the previous case to129x, for the single RAM case. For the distributed RAMs configuration, the increase is from 81x to124x.However, it also results in more unique signals being fed into the signal-trace scheduler, increasingthe multiplexing logic size, and thereby increasing the overall area overhead of the debugging circuitryfrom 906 to 981 ALMs for the single RAM case, and from 879 to 979 ALMs for the distributed RAMscase.Mem. Case 4Finally, we consider Memory Case 4 (Section 4.5.2), where program analysis is used to determine whenmemory operations can be used to determine the value of datapath signals, thereby reusing multiplexinglogic from within the user’s circuit, and reducing the area overhead. Area overhead is reduced from 981to 862 ALMs for the single RAM case, and from 979 to 858 ALMs for the distributed RAMs case.4.8.2 Area BreakdownAlthough these optimizations may not seem to drastically reduce area overhead, it is important to rec-ognize there are many fixed-cost components in the debug circuitry, and our optimizations target onlyone part of the debug circuitry, the signal-trace scheduler.In this section we give further insight into the what contributes to the area of the debugging logic,by providing a detailed per-circuit area breakdown. This breakdown is provided in Table 4.4. The dataprovided is for the last architecture in Table 4.3 (distributed RAMs and Memory Case 4). The Basecolumn refers to the area of the circuit without any debug circuitry. The Incr. to Base column indicatesthe increase to the user’s circuit when the debug circuitry is added. Even though the user’s circuit is notmodified, except for bringing signals to the top-level, the area increases. This is a by-product of signaltracing, as when the RTL is compiled to gates, logic optimization is performed, which often results in954.8. ResultsDebug InstrumentationBenchmark Base Circuit Incr.to BaseFixedStateEnc.TraceSched.DebugTotalTotal (Base+ Debug)adpcm 5045 (2.1%) 237 383 60 438 1118 6163 (2.6%)aes 3908 (1.7%) 104 380 84 241 809 4717 (2.0%)blowfish 2230 (1.0%) 173 394 62 322 951 3181 (1.4%)dfadd 1742 (0.7%) 69 374 30 86 559 2301 (1.0%)dfdiv 3295 (1.4%) 38 381 63 91 573 3868 (1.6%)dfmul 1053 (0.4%) -9 368 18 27 404 1457 (0.6%)dfsin 8672 (3.7%) 253 390 45 494 1182 9854 (4.2%)gsm 3064 (1.3%) 201 388 84 209 882 3946 (1.7%)jpeg 15716 (6.7%) 683 390 127 1675 2875 18591 (7.9%)mips 918 (0.4%) 89 386 34 141 650 1568 (0.7%)motion 5820 (2.5%) 225 377 98 519 1219 7039 (3.0%)sha 1292 (0.6%) 56 386 21 136 599 1891 (0.8%)Arith. Mean 4396 (1.9%) 177 383 61 365 985 5381 (2.3%)Geo. Mean 3093 (1.3%) - 133 52 219 858 4025 (1.7%)Table 4.4: Area breakdown. All area values are provided in number of Stratix V ALMs. The Fixedcolumn includes all fixed-cost logic including the Communication & Debug Manager, TraceRecorder and Stepping & Breakpoint Unit. Percentage values indicate resource usage for aStratix V 5SGXEA7N2F45C2 FPGA.nets being optimized away or restructured and absorbed into other logic. However, when these netsare fed into a trace buffer, they must be preserved in the circuit, and some optimizations are prevented,resulting in increased area.As shown in the table, a large portion of the debugging overhead is a fixed cost, and does not scalewith program size. The modules that do scale up are the State Encoder, which encodes one-hot FSMstates into a fully-encoded format. This scales up with the number of states in the system, but is stillmodest in size for even the largest benchmarks. The other module that fluctuates in size is the signal-trace scheduler, which scales up with the number of signals that must be multiplexed into the tracebuffers. For large designs this becomes the dominant contributer to the debugging area. It was due tothis scaling that we spent most of our effort reducing the area of the signal-trace scheduler.964.8. Results4.8.3 Impact and ConsiderationsWith all optimizations applied, our final architecture provides a 127x improvement to trace length versusan ELA. For every 100Kb of memory allocated to trace buffers, the architecture is capable of recordingthe execution of 15369 lines of C code. This is achieved with a moderate amount of area overhead (858ALMs), and no impact on operating frequency.A 127x increase to the trace length means the designer can capture a longer execution, and poten-tially find bugs faster. However, another way to view the improvement is that with our optimizations,we can provide the same trace length as an ELA tool, but requiring 1/127th the amount of memory. Thismakes our work very beneficial in cases where memory is very limited, possibly because the designeris using a small FPGA device, and/or because the user’s design is already using a large portion of thememory resources. Requiring significantly less trace memory may allow the designer to continue to usetheir existing FPGA device, and not require them to migrate their design to a larger device, which wouldbe both more expensive, and require additional redesign time.These results obtained by our architecture are dependent on both properties of the user’s design,as well as the structure of the circuit produced by the HLS tool. For example, much of the benefit torecorded execution trace can be attributed to the fact that our architecture only records values as theychange, not in every cycle as done by an ELA tool. This approach of time-multiplexing which sig-nals are stored in the trace buffer is most beneficial when fewer signals need access to the buffer eachcycle. Thus, in designs where large portions of the variables are updated each cycle, our architecturewill offer less benefit, while designs which update a smaller fraction of variables each cycle will haveincreased benefit. There may be several factors which affect this property. For example, certain HLSoptimizations, such as loop pipelining may increase the parallelism, reducing the benefit of our ap-proach. Another factor to consider is that as designs grow larger, and have more variables, the fractionof variables updated in each cycle will decrease, and our architecture will provide greater benefit.It should be noted that there is a correlation between the benefit of our dynamic signal-tracingtechnique, and the area overhead. As more signals are time-multiplexed into the trace memories, thecaptured execution trace will increase; however, the area overhead will also increase as it requires moremultiplexing resources.974.9. Summary4.9 SummaryThis chapter described and evaluated our techniques for signal-tracing within HLS circuits. Signal-tracing, which involves feeding signals into on-chip memories to record circuit execution, is necessaryfor performing at-speed, in-system debugging. However, due to limited on-chip memory, only a smallportion of circuit execution can typically be captured, which leaves designers to debug by performingmultiple executions of the system, capturing small portions of the execution until the root cause of a bugcan be located.In this chapter we showed that by using information about the design from within the HLS tool,we could perform more efficient signal-tracing that is possible using conventional signal-tracing tools,allowing for substantially longer execution traces to be captured. We demonstrated that informationabout the program could be used to compress the FSM data, determine which datapath signals shouldbe recorded cycle-by-cycle, and determine how observing the value of certain signals could allow thedebugger to reconstruct the value of other signals. Compared to conventional signal tracing tools, ourarchitecture can record, on average, an execution trace that is 127x longer. For every 100Kb of memoryallocated to signal tracing, an average of 15369 executed lines of C code can be captured. This isaccomplished through using a moderate amount of debugging logic, totalling on average, 858 ALMs.The HLS debugging framework presented in Chapter 3 and the optimized signal-tracing architec-ture described in this chapter are limited to operating with HLS circuits created from single-threadedsource code. In the next chapter we explore the challenges introduced when debugging multithreadedapplications.98Chapter 5Debugging Parallel SystemsIn the previous chapters we presented a debug framework and signal-tracing architecture for in-systemdebug of HLS circuits. However, this work, as well as other related work [52, 53, 124, 125], assumedthe source program was a single thread. Modern HLS tools provide the ability to synthesize circuitsfrom multithreaded source code [29, 49, 111, 116]; this is important because, although it is possible toextract fine-grained parallelism opportunities from single-threaded code, the real benefits of hardwarecan only be attained by taking advantage of coarse-grained parallelism that may be available in theunderlying algorithm. Debugging multithreaded HLS systems is challenging as the resulting circuit willhave multiple hardware execution units that execute simultaneously. Although existing methods, suchas those in Chapters 3 and 4 and [52, 53, 124, 125], could support multithreaded code by duplicatingthe debug circuitry for every thread in the system, doing so may not provide the most effective visibilitythat will best support designers as they uncover bugs in complex multithreaded systems.In this chapter we present and evaluate a debugging infrastructure which provides users with ob-servability into multithreaded HLS systems. Like our work presented so far, the goal is to provide asource-level, in-system debug framework, that would allow both hardware and software designers tounderstand the behaviour of their circuit.This chapter is organized as follows: Section 5.1 motivates the need for a new debug frameworkfor multithreaded circuits. This section explains the shortcomings in directly applying past work toParts of this chapter were published in [4]995.1. Framework Requirements for Debugging Parallel HLS Systemsmultithreaded HLS designs, and how the types of bugs encountered in these systems warrants a differentapproach to debugging. Section 5.2 describes the approach we use in this work, which is to providethe user with observability into the design using tracepoints. Unlike the past work, which provided asingle-stepping debugger interface, this approach focuses on giving designers insight into key points inthe system, sacrificing complete observability in order to obtain more “big picture” information.The remainder of the chapter presents our architecture to enable this style of debugging. Section 5.3details our Basic Tracepoint Architecture that we have developed to provide tracepoint-based debuggingto multithreaded HLS circuits. In Section 5.4 we present an enhancement, the Round-Robin TracepointArchitecture, which uses novel techniques to enable sharing of trace buffer resources between threadsin order to capture longer execution traces. Section 5.5 evaluates our techniques, and Section 5.6 sum-marizes the chapter.5.1 Framework Requirements for Debugging Parallel HLS SystemsOur framework presented in Chapters 3 and 4 was designed and optimized for a single software thread.The debugging instrumentation relied upon there being a single control-flow in the circuit, which couldbe captured by feeding the current state of the FSM into a trace buffer. Since all variable updates werescheduled to a FSM state, the recording of their values could also be coordinated to the FSM state.In contrast, when the user provides multithreaded software, each software thread is synthesized to adifferent hardware unit, which execute in parallel. In this case there are multiple control-flows, meaningmultiple FSMs, all executing simultaneously and independently, each controlling the update of theirdata flow units.It would be possible to debug such a system with our existing techniques; however, it would requireduplicating the full debug logic to every thread in the system. If the user was able to achieve a highdegree of parallelism, that is, splitting their design into many threads, it is likely the overhead of thedebug logic would become burdensome. In addition, our previous technique was to provide the userwith a software-like debug experience, where they could step through the code and inspect all variables.This requires full observability into the system, and substantial memory resources to track at-speedcircuit execution. If there were many threads in the system executing simultaneously, the memory1005.2. Debugging with Tracepoints6Thread 1 Thread 3Thread 2Figure 5.1: Tracepoints in a multithreaded application. Tracepoint are user-selected locations inthe thread control flow, which when encountered, cause data to be logged.resources would be depleted at an even faster rate, leaving the user with full visibility into every thread,but for only a tiny portion of the execution time.Unfortunately, sacrificing trace duration in order to obtain full observability may be the exact oppo-site behaviour of what the user needs for multithreaded debugging. This is because the same issues thatmake debugging multithreaded software notoriously difficult are present when debugging multithreadedhardware. This may include finding bugs in thread communication, dead locks, live locks, race condi-tions, and thread starvation. Often these bugs are non-deterministic, hard to reproduce, or may requirelong run times to uncover. We believe this will lead users to want to give up observability into everyvariable, in order to capture key values for a longer period of execution. Thus, the development of aninfrastructure to support this kind of debugging is an essential part of the HLS ecosystem.This requires a different approach to debugging; rather than providing the user with a GDB-likeexperience which requires the full control and data flow, this new framework needs to provide the userwith meaningful data at key points in their program.Although we are giving up some observability, the debugging requirements first described in Sec-tion 2.1.5 still apply. The framework must provide information to the user in the context of their originalsource program, and debugging must be completed in-system, allowing the circuit to execute and inter-act with the full system without interruption.1015.2. Debugging with Tracepoints7Thread ID Cycle # Tracepoint Variables Logged1 26452 main(): 113 mutex = 02 28591 foo(): 41 x = 7, y = 34 28037 bar(): 3 <None>… … … …Figure 5.2: Timeline of tracepoint event data, provided to the user after execution is completed.5.2 Debugging with TracepointsTo provide the user with partial observability into their system, we have elected to use tracepoint-baseddebugging, which is a debug construct used in many current software debugging tools. Tracepointsare key locations in the source program, identified by the user, which are logged each time they areencountered by a thread. Figure 5.1 illustrates the concept.The process of selecting tracepoints is similar to that of breakpoints; a user typically identifies aline of code in their source program of key importance, and mark it as a tracepoint. Unlike breakpoints,which cause the execution to halt, when tracepoints are encountered by threads, key information isentered into a log, and the thread continues to execute without interruption. This information added tothe log may be as simple as recording which thread encountered which tracepoint. However, it oftenalso includes a timestamp to indicate the absolute time the tracepoint was encountered, as well as thevalue of certain variables that the user indicates he or she wants to record at that point in the program.Once program execution is complete, the tracepoint logs can be provided to the user as a timeline events;Figure 5.2 provides an example.We provide three reasons why a tracepoint-based approach is suited for multithreaded HLS circuits:Length of Recorded Execution Providing full visibility, as was done in our past work, depletes on-chip memory resources very quickly, resulting in only a small fraction of the total execution timebeing captured. As the number of threads in the system increases, there will be more signals torecord, and the trace buffers will be depleted even faster. However, bugs in multithreaded systems,such as problems in thread communications, dead locks, live locks, race conditions or thread1025.2. Debugging with Tracepointsstarvation, may require very long run times to expose, especially if they are non-deterministic.To find such bugs it may be necessary to trade off some observability to increase the length ofexecution trace that can be captured. In addition, in multithreaded systems it is often necessary tolocate performance issues between threads, which may require observing key data points for longperiods of time to capture the “big picture” behaviour. The tracepoint-based debugging modelallows for this flexibility. The user can select as many or as few tracepoints as they desire. Byadding very few tracepoints, users can gain insight into circuit behaviour over very long run times.Instrumentation Overhead Providing full visibility requires significant area cost as hundreds or thou-sands of signals need to be fed into the trace buffers (Section 4.8). This will scale up even higheras more threads are added, and the area cost may be prohibitive. Tracepoint-based debuggingallows for a lightweight approach; by only inspecting a handful of values, area costs can be keptlow, as we show in Section 5.5.4.Software-like Debug Experience In keeping with the goal of allowing software designers to debugHLS circuits, we looked for an approach that was similar to the software domain. Tracepoint-based debugging is already used in the software community where systems with many threadsneed to be observed. Tools such as Allinea DDT [50] and Microsoft Visual Studio [51] allowfor recording tracepoints during program execution, which are provided to the user for offlinedebugging after the program has completed execution. Similar approaches have also been usedfor observing multiprocessor systems in SoC environments [141]. The reasoning for such anapproach is similar to the hardware domain; executing at-speed and debugging offline is neededto activate elusive bugs in multithreaded environments, and recording all program values duringexecution can be too intrusive or will generate too much data to record [142].By using a tracepoint-based approach we give the user the option of recording few variables inorder to capture long execution traces. This approach can help find the obscure bugs in multithreadedsystems, while keeping the area overhead minimal and providing an experience familiar to the softwarecommunity. It should be noted that our framework does not automatically solve or locate bugs inmultithreaded systems. The user is responsible for choosing where to insert tracepoints, and which1035.3. Basic Tracepoint Architecturevariables, such as locks or communication channels, should be observed. A debugging tool using ourtechnique may elect to automatically identify such types of variables to the user; however, we do notexplore such tools in this work.One concern when using tracepoint-based debugging in software debuggers, is that in order to recordthe data into the log, the debugger may stall program execution for a number of cycles. This introducesa probe-effect, where adding debugging may change the program behaviour and mask bugs [50]. This isnot an issue in our framework. Our tracepointing architecture is inserted at the RTL-level, not into thesource program, which provides the fine-grained control necessary to ensure that all tracepoint data canbe logged without stalling the circuit, or changing its functionality in any way.5.3 Basic Tracepoint ArchitectureIn this section we present our basic architecture to provide tracepoint debugging of multithreaded HLScircuits. This architecture is an extension of our signal-tracing techniques described in Chapter 4, modi-fied to enable partial observability, where only select portions of the control and data flow are recorded.In this section, the tracepointing architecture we present is replicated for each thread in the circuit(Figure 5.3a). In Section 5.4, we will enhance this basic architecture using novel techniques to shareresources between threads (Figure 5.3b), increasing the observability.5.3.1 Mapping Tracepoints to RTL SignalsAlthough the user selects tracepoint locations as specific lines of code in the C source program, theinstrumentation to record the values is inserted at the RTL level. Thus a relationship between sourcecode and hardware entities must be established. We use the techniques described in Section 3.2 andmodify the HLS tool to populate a debug database during synthesis. Each statement in the sourcecode is mapped to one or more FSM states in the RTL. Thus, when the user selects the location fora tracepoint, it can be determined exactly which state in the FSM will trigger recording into the tracebuffers. The debug database also contains information about source code variables, including how theyare implemented in the RTL, and whether they are mapped to datapath signals (output of registers orcombinational logic) or memories. Using this approach, the tracepoints for each thread can be translatedto a set of FSM states, with zero or more RTL signals that need to be recorded at each state.1045.3. Basic Tracepoint ArchitectureTimerData SignalsTime ValueIDThread nEncodeStateData SignalsTime ValueIDThread 2EncodeStateData SignalsTime ValueIDThread 1EncodeStateTimerData SignalsThread 1EncodeStateTimeValueIDThread n TimeValueIDTimerTimerData SignalsStateTimeValueIDTracepointing LogicThread 1EnaTimeValueIDThread nTimeValueIDS1S3 S2Thread 1Thread 2Thread 3Round-Robin UnitTimeTrackerEnaTimerTimeValueIDRound-Robin UnitEnaCounterCounterCounter-S1S3 S2FSMEnaIDThread 1Thread 2Thread 3IDTimerData SignalsStateTimeValueIDTracepointing LogicThread iEnaIDFPGARoundRobinThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadTrace FPGA(a) Basic Tracepointing Architecture (Section 5.3)TimerData SignalsTime ValueIDThread nEncodeStateData SignalsTime ValueIDThread 2EncodeStateData SignalsTime ValueIDThread 1EncodeStateTimerData SignalsThread 1EncodeStateTimeValueIDThread n TimeValueIDTimerTimerData SignalsStateTimeValueIDTracepointing LogicThread 1EnaTimeValueIDThread nTimeValueIDS1S3 S2Thread 1Thread 2Thread 3Round-Robin UnitTimeTrackerEnaTimerTimeValueIDRound-Robin UnitEnaCounterCounterCounter-S1S3 S2FSMEnaIDThread 1Thread 2Thread 3IDTimerData SignalsStateTimeValueIDTracepointing LogicThread iEnaIDFPGARoundRobinThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadTrace FPGA(b) Round-Robin Architecture (Section 5.4)Figure 5.3: High-level view of tracepointing architectures5.3.2 Signal-Tracing ArchitectureFor variables mapped to datapath registers in the generated hardware circuit, the register output canbe fed directly to the trace buffer for recording. However, for variables mapped to a memory, it isnecessary to perform a memory read to retrieve the value so that it can be written to the trace bufferwhen the circuit reaches a tracepoint. This memory read operation can be inserted directly into the HLSschedule, provided there is an available memory port during that state. In our experience, the availabilityof memory ports is usually very high; for example, in the CHStone [136] benchmarks synthesized with1055.3. Basic Tracepoint ArchitectureTimerData SignalsTime ValueIDThread nEncodeStateData SignalsTime ValueIDThread 2EncodeStateData SignalsTime ValueIDThread 1EncodeStateTimerData SignalsThread 1EncodeStateTimeValueIDThread n TimeValueIDTimerTimerData SignalsStateTimeValueIDTracepointing LogicThread 1EnaTimeValueIDThread nTimeValueIDS1S3 S2Thread 1Thread 2Thread 3Round-Robin UnitTimeTrackerEnaTimerTimeValueIDRound-Robin UnitEnaCounterCounterCounter -S1S3 S2FSMEnaIDThread 1Thread 2Thread 3IDTimerData SignalsStateTimeValueIDTracepointing LogicThread iEnaIDFigure 5.4: Tracepointing logic for each thread in the basic architecture.the LegUp HLS tool [48], on average, memory controllers have an available port in 97% of states.Based upon the user-selected tracepoints, we automatically insert customized hardware to recordthe necessary signals into trace buffers. Figure 5.4 shows the circuitry for each thread. A separate tracebuffer is added for each thread in the design, which is responsible for recording values for all tracepointsin the thread. Custom logic is added which multiplexes the different data signals into the trace buffer,controlled by the FSM state of the thread. An enable signal ensures that data is only recorded in stateswith tracepoints. If a tracepoint requires multiple RTL signals to be recorded simultaneously, the signalsare concatenated into a single wide memory access.The ID field indicates which tracepoint is being recorded and is fed by an encoder which translatesthe FSM state (usually in one-hot encoding) to a fully encoded format. This field is usually very small asthe number of bits it requires is only log2 of the number of tracepoints for the thread. This architecture isa direct adaptation from Section 4.4, with the modification that instead of recording all signals from thedesign, only those that correspond to user-selected tracepoints are recorded, allowing for much smallermultiplexing logic.In our previous signal tracing architecture (Chapter 4), cycle-by-cycle data is tracked; however,in our tracepoint-based approach there may be large gaps between tracepoint events. To recreate thetimeframe of these events we add a global timer to the design, and a time field to each trace buffer,which records the time at which each signal update occurs. Once the trace buffer data is retrieved, thisinformation can be used to provide the user with a cycle-accurate timeline of all of the thread tracepoint1065.3. Basic Tracepoint Architecturedata (Figure 5.2). The width of the timer can be configured based on the anticipated run-time of thedesign. Alternatively, the timer can be modified to only increment when a record event occurs. Thisallows for a smaller counter and less memory overhead, but provides only a relative ordering of events,not a cycle-accurate timeline.5.3.3 Issues with the Basic ArchitectureThe main drawback of this architecture is that each thread requires its own trace buffer. If one threadproduces data more rapidly, it will fill its trace buffer quickly, and the buffers for the other threads willbe under-utilized. This limits the length of the execution trace that can be captured, and reduces thevisibility the user has into the design.Unfortunately it is very difficult to predict how to size the different buffers. Each thread may have adifferent lifetime; the design may consist of all threads running the entire time, or it may have multiplephases where different threads are active at different times. Especially during the initial design anddebug phases, the user may find that one thread is dominating execution and starving out other threads.Even if all threads are working and balanced well, the tracepoints of one tread may be encountered muchmore often than others; the user may place a tracepoint for one thread in a small loop, while anotherthread has tracepoints only in rarely executed locations.If buffers can be shared between the threads in the system, these issues can be alleviated. Figure 5.5provides an example of the benefits of buffer sharing. Each subfigure shows the same execution of fourthreads that encounter tracepoints at different rates. Figure 5.5a shows the case where no buffer sharingoccurs, equivalent to our basic architecture. In this case the buffer for Thread 2 contains a very shorthistory of execution, because it encountered many tracepoints soon before execution was halted.We define the execution trace window as the period of execution for which we know the behaviourof all threads; in the example diagram, this is illustrated as the red arrow. Prior to this point, only partialtracepoint data is available, and the behaviour of some threads will be unknown. We choose not toconsider this partial data, as it will likely introduce confusion into the debugging process. Figure 5.5bshows how the trace window increases when the buffers can be shared between two threads, and Fig-ure 5.5c shows how it is increased even further for a single buffer shared between all four threads. Note1075.3.BasicTracepointArchitecture8Thread 1Thread 2Thread 3Thread 4Execution HaltedThread 1Thread 2Thread 3Thread 4Execution HaltedThread 1Thread 2Thread 3Thread 4Execution Halted(a) No buffer sharing8Thread 1Thread 2Thread 3Thread 4Execution HaltedThread 1Thread 2Thread 3Thread 4Execution HaltedThread 1Thread 2Thread 3Thread 4Execution Halted(b) Buffer shared between two threads8Thread 1234Execution HaltedThread 1234Execution HaltedThread 1234Execution Halted(c) Single buffer shared between all four threadsFigure 5.5: Example of the benefits of buffer sharing. Each subfigure shows the same execution of four threads that encounter tracepointsat different rates. The red arrows show how far back in execution the trace buffer data provides complete information for. Althougheach scenario uses the same total amount of memory, the more sharing that is possible, the longer the execution trace will be.1085.4. Round-Robin Tracepoint Architecturethat the total trace buffer memory in the system is not changing between these scenarios, the only thingchanging is the level of sharing.In the next section we present an enhanced architecture that allows threads to share trace buffers,increasing the average length of execution trace that can be captured.5.4 Round-Robin Tracepoint ArchitectureAs described in the previous section, it is desirable to combine the storage from different threads intothe same trace buffer. In a system with multiple threads, it is often impossible to predict the temporalrelationship between these threads. Each thread is controlled by an FSM, and these FSMs executeindependently. If multiple threads share the same buffer, and encounter tracepoints simultaneously, theymay produce tracepoint data within a single cycle that exceeds what can be stored by the trace buffer. Ifthis occurs, tracepoint data would be silently lost. We consider this as unacceptable since it will likelyintroduce confusion into the debugging process.The following describes our technique to address this problem by identifying threads that can sharethe same buffer and building custom logic to ensure that data is never lost.5.4.1 Using Signal Update Frequency to Group ThreadsAs shown in Figure 5.3b, our approach is to share trace buffers among multiple threads. Determininghow the threads should be partitioned among trace buffers is an optimization problem. In this section,we describe our solution to this problem.We first employ the data within the state graph to determine the maximum tracepoint bandwidth ofeach thread. For each thread we identify the FSM states for which a tracepoint has been added, andthen use Dijkstra’s algorithm [143] to determine the shortest possible path through the graph back to thesame state, or any other state with a tracepoint. This provides the minimum number of cycles betweena thread producing consecutive tracepoint data. We refer to this value as the tracepoint interval, anddenote it as It , for a given thread, t. Figure 5.6 provides an example. Over a long period, it is guaranteedthat the fraction of cycles for which a thread will write to a trace buffer will be at most 1/It . Therefore,1095.4. Round-Robin Tracepoint Architecture10Tracepoint IntervalUse Dijkstra’s algorithm on the CDFG• Tracepoint Interval: minimum # of cycles between tracepoint hitsInterval = 4TStart EndTT TracepointFigure 5.6: The state graph is analyzed to determine the tracepoint interval, It , which is the mini-mum number of cycles between a thread encountering two tracepoints. In this example, theshortest path is between two different tracepoints; however, the shortest path may also be apath back to the same state.multiple threads t1, t2..tn could share a single trace buffer if Equation 5.1 is satisfied.n∑i=11/Ii ≤ 1 (5.1)However, in grouping threads, we instead chose to use the more restrictive constraint shown in Equa-tion 5.2. This is sufficient to guarantee Equation 5.1 is satisfied, while also allowing for simpler arbitra-tion circuitry within the thread group.∀ti ∈ {t1, ...tn}, Ii ≥ n (5.2)When Equation 5.2 is satisfied, a simple round-robin circuit, shown in the next section, will guar-antee that each thread has access to the trace buffer for at least 1/It of the cycles (since Ii ≥ n, and around-robin will give access to each thread every n cycles). In contrast, if the constraint in Equation 5.2is not enforced, it may be possible to group threads more aggressively, and achieve fewer thread groups;however, more complicated arbitration circuitry would be required.In Section 5.5 we evaluate the benefit of sharing trace buffers using round-robin arbitration. We1105.4. Round-Robin Tracepoint Architecturealso performed this evaluation using the more permissive grouping constraint in Equation 5.1, whichprovides the maximum amount of grouping without violating the bandwidth properties of the tracememory. While this allows for slightly more buffer sharing versus the more restrictive Equation 5.2,we found that in practise this offered no significant increase to the recorded execution trace, and thusis not worth the more complicated arbitration circuitry. Therefore, we elected to use the constraint inEquation 5.2, which allows for very simple round-robin arbitration circuitry. However, it is possible thatfor other benchmark circuits, the more aggressive grouping strategy could offer more benefits.The following greedy algorithm is used to group the threads and achieve the minimum number ofthread groups, while satisfying Equation 5.2:1. Use Dijkstra’s algorithm to determine the tracepoint interval, It , of each thread.2. Add all threads to a list, and sort by It , ascending.3. Remove the first thread, designated t f .4. Remove the next It f −1 threads, and group with the t f thread.5. If the list is not empty, go back to step Round-Robin CircuitThe grouping algorithm ensures that each thread group does not violate the long-term bandwidth of thebuffer. To satisfy the short-term bandwidth we ensure that data is buffered and recorded into the tracebuffer within It cycles (before the next tracepoint can be encountered). We do this using a round-robincircuit, as shown in Figure 5.7, which arbitrates access to each trace buffer. The circuit includes thefollowing components:FSM & Multiplexer An FSM containing one state per thread is included, where the current state pro-gresses to the next thread every cycle. Based on the FSM state, a multiplexer controls whichthread’s data is forwarded to the trace buffer. This implements the round-robin access to the tracebuffer.1115.4. Round-Robin Tracepoint ArchitectureTime Counters Each thread has a counter, which can count up to n, where n is the number of threadsin the group. When an enable signal is received from a thread, indicating a tracepoint has beenhit, the counter resets to 0. Once the FSM selects the thread for access to the trace buffer, thethread’s counter is subtracted from the global timer value, and the difference is stored in the timefield of the buffer. This allows for the tracepoint data to be recorded out-of-order and re-orderedwhen the data is retrieved.Data Pending Registers (not shown) Each thread has a register which is set when an enable signal isreceived from the thread, and cleared once the FSM reaches the thread’s state. These registers areused to generate an output enable signal, which enables writing to the trace buffer.Tracepoint ID Encoder (not shown) Each thread provides an ID indicating which tracepoint it hasencountered. The IDs from each thread are re-encoded into an ID that is unique across all threadsin the group.Figure 5.7 shows a group of three threads sharing a trace buffer. Since the trace buffer is constructedusing a dual-ported RAM, two thread groups can actually share a single trace buffer. Each thread grouphas an independent copy of the round-robin circuit shown in Figure 5.7, meaning that the two threadgroups can consist of a different number of threads and still share the same buffer.5.4.3 Data BufferingFor a round-robin unit with n threads, data from the user circuit must be buffered for up to n cyclesuntil it is written into the trace buffer. In many cases the user circuit already provides this buffering, andno extra registers are required. In cases where the source of the tracepoint data is a register in the usercircuit, the state graph can be used, again with Dijkstra’s algorithm, to identify the minimum numberof cycles until the register will be updated. We anticipate that users will commonly place tracepoints inthe states where the variables they are observing are updated. Since the stage graph is typically in SSAform, and register sharing is usually avoided due to the high cost of implementing multiplexers in FPGAlogic (Section 2.3.2), usually there will only be a single state in which each register is updated, and thesignal will not change until the tracepoint is encountered again. This means that for the common case1125.4. Round-Robin Tracepoint ArchitectureTimerData SignalsTime ValueIDThread nEncodeStateData SignalsTime ValueIDThread 2EncodeStateData SignalsTime ValueIDThread 1EncodeStateTimerData SignalsThread 1EncodeStateTimeValueIDThread n TimeValueIDTimerTimerData SignalsStateTimeValueIDTracepointing LogicThread 1EnaTimeValueIDThread nTimeValueIDS1S3 S2Thread 1Thread 2Thread 3Round-Robin UnitTimeTrackerEnaTimerTimeValueIDRound-Robin UnitEnaCounterCounterCounter -S1S3 S2FSMEnaIDThread 1Thread 2Thread 3IDFigure 5.7: Round-robin tracepoint architecturewhere the user places a tracepoint that corresponds to an update of a non-shared register, no bufferingis required. If stage graph analysis determines that it is possible for the signals to change sooner thanthe tracepoint interval, a register can be added to the inputs of the round-robin unit, enabled when thetracepoint is active.5.4.4 Limitations to Buffer SharingThere are several further considerations that arise when sharing a buffer among multiple threads. Oneissue is that each thread may require a buffer of different width. For example, the tracepoint in onethread may record a 64-bit register, while another thread only records an 8-bit register. This differenceof widths can become even worse if a thread has a tracepoint where multiple variables are recorded inthe same cycle. Furthermore, when two thread groups share a dual-ported buffer, a size mismatch isalso possible. There are two approaches to deal with this issue: 1) ignore the differences in width, and1135.5. Evaluationallow for partially used entries in the trace buffer, or 2) only group threads that have the same width. Inour testing we found that the former option provided the most efficient use of trace buffer memory, andthus resulted in the longest execution traces.Another potential limitation of buffer sharing is the number of tracepoints that the user selects for athread. The more tracepoints that are selected for one thread, the smaller the tracepoint interval (It) willbecome, and the opportunities for buffer sharing will reduce.Finally, the buffer sharing technique is only beneficial if the threads encounter tracepoints at differ-ing rates. This would likely only occur in Multiple Instruction, Multiple Data (MIMD) style designs.For designs where the threads are homogeneous in nature (data-parallel, Single Instruction, MultipleData (SIMD)), there is often no reason to employ buffer sharing. This may often be the case withOpenMP designs, where a loop is identically split between threads, or pthread-based designs where allthreads execute the same function. However, even in these cases, if the control-flow is heavily depen-dent on the data content, tracepoints could be encountered at different rates. The types of designs wherebuffer sharing helps the most are those where the threads are heterogeneous in nature (task-parallel,MIMD); where each thread is performing a different task. Despite this restriction, we believe that gen-erally, heterogeneous systems will be more likely to need in-system debug. The non-uniform nature ofthe system will make issues related to thread synchronization, balancing, starvation, etc. more likely tooccur.Although not explored in this dissertation, there may be different optimizations that would benefit adata-parallel system. For example, if such a system could be classified as a SIMD style design, there maybe some benefit from recording only a single instruction trace. However, in our experience, capturingthe data flow requires much more trace memory, so optimizations that apply only to the control-flowwill have minimal overall benefit.5.5 EvaluationIn this section we evaluate our basic tracepoint architecture from Section 5.3, and the round-robin en-hancement from Section 5.4, in terms of the execution traces they provide and the associated area costs.1145.5. Evaluation5.5.1 Experiment MethodologyThe benefit provided by the round-robin architecture is dependent on the hit rates of the tracepoints,which depends entirely on where the user chooses to locate the tracepoints. Since it is user-controlled,and highly benchmark dependent, it is difficult to quantify the expected benefit. However, to providesome insight into what a user can expect to observe we have experimented with randomly adding trace-points to circuits, simulating the circuit to measure tracepoint hit rates, and using our evaluation metric(described in the next section) to calculate the increase to the recorded execution length.The benchmarks we used for the experiment are synthetically created from the CHStone benchmarksuite [136], and synthesized using the LegUp HLS tool [48], version 4.0, configured with the -O3 op-timization flag. To generate a system with n threads we randomly select n items from the CHStonebenchmarks (allowing for duplicates), and construct a system where all threads start simultaneously andrun to completion. There is no synchronization or communication between the threads in the syntheti-cally created design.In our experiment we considered systems with 2, 4, 8, 16 and 32 threads, with 1, 2, 4, 8, or 16tracepoints per thread. Tracepoints were chosen by identifying points in the program where registers(corresponding with source-code variables) were updated, and randomly choosing a subset of thesepoints. For each combination of number of threads and number of tracepoints we performed 1000 trials.In each trial a new circuit was randomly created, and tracepoint locations were randomly chosen. In allexperiments the timer size was 32 bits.For the basic tracepoint architecture, the fraction of memory allocated to each thread was chosensuch that each thread would have an identical number of entries in the trace buffer. For the enhancedround-robin architecture, the memory was allocated such that each thread group would have an identicalnumber of entries. We also experimented with allocating memory such that the number of entries in thegroup was proportional to the number of threads in the group; however, this provided worse results.We chose to use synthetic benchmarks because it was challenging to obtain many different parallelHLS designs with heterogeneous threads. Although it is easy to obtain homogeneous systems by justdividing some computational work between multiple threads (eg. parallelizing matrix multiplication),trace buffer sharing is not very beneficial for such systems, as previously explained in Section EvaluationAlthough we were unable to find several different heterogeneous applications, we do not believe thisindicates a lack of need to debug such systems. On the contrary, with Xilinx and Altera both recentlyreleasing OpenCL-based HLS tools [29, 116], including a focus on data-center usage, we anticipatethese complex heterogeneous designs to become more frequent in the future. Furthermore, we expectthe complexity and irregular patterns present in such systems will necessitate more frequent debugging.5.5.2 Evaluation MetricOur evaluation metric is the length of the execution trace that can be recorded for a multithreadedapplication. This subsection discusses how this quantity is obtained.In the basic architecture, where each thread has its own buffer, the length of the execution trace incycles, ci, that can be recorded by a thread, i, is calculated using Equation 5.3.ci =eihi(5.3)In this equation ei is the number of entries in the trace buffer, and hi is the hit rate, which represents thefraction of cycles in which a tracepoint occurs. If there are multiple threads, each may fill their buffer atdifferent rates. Once execution is halted, each thread will have captured ci cycles of execution. Since weonly want to provide the user with complete data, the length of execution history, c, for the full systemof n threads is the minimum captured by any thread, min{ci}.For the round-robin architecture, multiple threads can share the same trace buffer. Assuming a threadgroup has a buffer with eg entries, the length of the execution trace for the group can be calculated usingEquation 5.4, where ng is the number of threads in the group.cg =egng∑i=1hi(5.4)The length of execution trace for the entire system, crr, will be the minimum execution history of eachthread group, min{cg}. Using these definitions, we can measure the increase to execution length that1165.5. Evaluationthe round-robin buffer sharing provides using Equation 5.5.improvement =crrc=min{cg}min{ci} (5.5)5.5.3 ResultsFigure 5.8 provides a summary of the results. Each bar in the graph represents a different numberof threads and tracepoints. For each of these we performed 1000 trials and measured the impact toexecution trace length according to Equation 5.5. The box-and-whisker plot shows the distribution ofthe different trials. The whiskers show the range from minimum to maximum improvement. The lowerbox shows the range from first to second quartile, an the upper box shows the range from the second tothird quartile.The results show that on average, the round-robin enhancement provides a significant increase tothe length of the recorded execution trace. For example, with eight threads and one tracepoint perthread, the median improvement is 4.1x. As expected, adding more tracepoints per thread reduces thisimprovement. For the case with eight threads, when the each thread has four tracepoints, the medianimprovement is reduced to 2.2x.The maximum increase that can be obtained by buffer sharing amongst n threads is a factor of nimprovement. This occurs when one thread has a much higher hit rate than the other threads, effectivelyfilling its buffer while no other thread has written data. In such a case, if the threads can share a singlebuffer, the active thread will have an n times longer buffer to use. Practically, we cannot obtain exactlyan n improvement because of the ID field in the buffer which adds overhead to identify which thread isresponsible for each entry in the buffer. The results show that close to n improvement is indeed obtainedin at least one trial for the 2, 4, and 8 thread systems with few tracepoints. However, as more threadsand more tracepoints are introduced the likelihood of all threads sharing a single buffer is reduced andthe maximum observed improvement never reaches n.On the other end of the spectrum, the theoretical worst-case result from buffer sharing is 1.0x, ie.no change from the basic approach. This occurs when all threads generate tracepoints at the exact samerate, resulting in all threads filling their buffers evenly with no wasted space. In reality due to non-1170.0 X1.0 X2.0 X3.0 X4.0 X5.0 X6.0 X7.0 X8.0 X9.0 X10.0 X11.0 X12.0 X13.0 X14.0 X1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16Increase to Execution Trace Length(crr/c)# Tracepoints2 Threads 4 Threads 8 Threads 16 Threads 32 Threads19.7Figure 5.8: Increase to trace length provided by buffer sharing.1185.5. Evaluationideal overheads, buffer sharing can actually degrade the length of execution trace obtained, as shown byminimums that reach below 1.0x. One reason this occurs is the additional overhead of adding the IDfield, although this has only marginal impact. The larger culprit is the fact that different threads havedifferent widths of data to be recorded, as discussed in Section 5.4.4. By restricting thread groupings tohave the same buffer width this overhead can be eliminated; however, the average case improvement isreduced, which is why we decided against this approach.Our synthetic benchmarks consist of joining together multiple CHStone benchmarks to execute inparallel. However, each of these benchmarks have different execution lengths, meaning that in mostcases some threads will finish executing before others. This is a significant contributor to varying trace-point hit rates between the threads, and is one reason for the success of the round-robin buffer sharing.To remove this factor we also performed trials where the hit rates of each benchmark were scaled up torepresent the case where all threads execute for the same duration. In this case the median benefit forfour threads with one tracepoint dropped from 2.7x to 2.4x and for eight threads from 4.0x to 3.2x. Sim-ilar decreases were observed for other system configurations. Ultimately we chose to leave the varyingexecution durations in place as we believe it more accurately reflects the type of system a user may needto debug. Poor workload balancing, thread starvation, or just the inherent design of the system mayresult in cases where some threads stall for extended periods of time.Another consideration is that our synthetic benchmarks do not have any communication or synchro-nization between threads. However, as can be seen from Equations 5.3 to 5.5, the only factor in thedesign behaviour that affects the improvement to trace length is the relative rates at which each threadencounters tracepoints. While real parallel designs will contain thread communication and synchroniza-tion, we do not expect that to change the fact that different threads encounter tracepoints at differentrates, leading to buffer sharing benefits.We also tested several homogeneous parallel systems, including parallelized versions of matrix mul-tiplication, mandlebrot and black scholes. However, for these cases the threads perform the same com-putational work, and control flow is independent of data content. Assuming tracepoints are put in thesame location for each thread, the hit rates between threads will be identical and buffer sharing willoffer no improvement. Our experiments with these circuits verified this behaviour. If each thread had1195.5. EvaluationArea (Stratix IV ALUTs)Data Width# Tracepoints1 2 4 8 1616 0 17 19 53 9032 0 33 35 101 17064 0 65 67 197 330Table 5.1: Per-thread area cost of tracepointing logic (Figure 5.3).Area (Stratix IV ALUTs)DataWidth# Threads2 4 8 16 32Total1658 80 173 338 739(per thread) 29 20 22 21 23Total3274 96 238 467 1010(per thread) 37 24 30 29 32Total64106 128 365 718 1552(per thread) 53 32 46 45 49Table 5.2: Round-robin unit area cost (Figure 5.7).tracepoints in different locations, buffer sharing could provide some benefit; however, since all threadsare identical, it seems unlikely that a user would place tracepoints in such a manner.5.5.4 Area Overhead and Timing ConsiderationsTable 5.1 provides the area overhead of adding the basic tracepointing logic from Figure 5.4. Thesearea values are calculated by synthesizing a Verilog implementation of the logic using Altera Quartus II,version 15.0. The majority of this logic is the multiplexing required to feed the data from the tracepointsinto the thread’s trace buffer. For this reason, the area scales up with the tracepoint data width andnumber of tracepoints.If the round-robin enhancement is added as shown in Figure 5.7, additional area is required perthread group. Table 5.2 provides the area of this additional logic. A large portion of the logic is to handlemultiplexing the signals from each thread in the group, and increases with the width of the tracepointdata. The data provided in the chart is assuming there are four tracepoints per thread. Varying thenumber of tracepoints from 1 to 16 changes the size by only ±5% (it only affects the logic needed to1205.5. Evaluation9Thread 1Thread 3Thread 2FPGAFigure 5.9: Buffer sharing between threads in different locations of the FPGA. Pipeline registersare inserted to prevent the long paths from becoming the critical path.generate the ID entry in the buffer).Based on these results, assuming a round-robin style architecture with 8 threads and 4 32-bit tra-cepoints per thread, the area overhead is 65 Adaptive Look-up Tables (ALUTs) per thread. This doesnot include buffering registers, as they were not required for our experiment (See Section 5.4.3). Ifbuffering is necessary, each thread will require a set of registers equal to the tracepoint data width. Inaddition, these area values do not include the logic surrounding the trace buffers, which is common toall trace-based debugging systems. This includes a counter to increment the address of each buffer, andlogic to externally retrieve the data, including a UART core (or similar) for communication.When multiple threads share a single trace buffer, timing issues can arise. Routing signals frommultiple threads to one buffer, especially if the threads are not physically adjacent, has the potential toviolate the user’s timing requirements. In such a case additional registers can be added to pipeline thesignals, as shown in Figure 5.9. To handle this delay in data the input to the data pending registers inthe round-robin unit must be pipelined by the same degree, and the extra cycles must be subtracted fromthe time entry in the buffer.1215.6. Summary5.6 SummaryThis chapter presented a framework for debugging HLS circuits produced from multithreaded software.In such systems, complicated non-deterministic bugs may necessitate observing the system for longexecution times in order to locate the source of the bug. In these cases it is desirable to trade-offobservability for execution duration, that is, record only some of the program control and data flow, inorder to observe execution for longer periods of time.In this chapter we presented an approach to this problem: we use user-selected tracepoints, a conceptalready used in parallel software debuggers, to allow the user to identify key points in the program theywish to observe. Lightweight debug instrumentation, which is added to the user’s circuit at compiletime, logs the tracepoint data into on-chip memories, which can be retrieved after the program is halted,and provided to the user as a timeline of events.We also presented a novel technique to share trace buffer memory resources between multiplethreads in a system. The technique involves analyzing the scheduled software program to determinehow frequently different threads will require access to the trace buffers, and using this information todetermine suitable groupings of threads that can share a single trace buffer. This allows for better mem-ory utilization, and increases the length of execution that can be captured. We demonstrated that for asystem of eight threads, with one tracepoint each, the execution trace length is increased by a factor offour.122Chapter 6Quantifying ObservabilityThe contributions in this dissertation, as well as related work discussed in Section 2.4.1, have exploredseveral different techniques to gain observability into executing HLS circuits. For example, in [52],the authors use an ELA tool to monitor user-selected variables during execution. In Chapters 3 and 4we present a framework which uses information from within the HLS flow to build specialized debugcircuitry which records variables only when they are updated, using much less memory and allowingfor longer execution traces to be captured. In Chapter 5 we describe a tracepoint-based framework thatrequires even less memory as it only records select points in the program execution, rather than observ-ing the entire control and data flow. Other researchers have investigated adding debug instrumentationto the C source program [54–56] instead of modifying the RTL circuit.All of these works are aimed at providing at-speed, in-system debugging, and thus use on-chipmemory resources to record selected signals while the circuit executes at speed. The captured data isretrieved, and used to provide the user with some insight into what occurred during the circuit execution.However, all of these techniques face the same challenge: due to limited on-chip resources only a smallfraction of the circuit execution can be captured. This means designers must find the root cause of a bug,while only being able to see a small piece of the full picture, making in-system debugging a challengingand time consuming process.Successful debugging solutions will be those that make efficient use of resources, in order to giveParts of this chapter were submitted to a conference [7]123designers the most insight into the system behaviour, making it easier and faster to locate bugs. To de-velop such systems, it is essential that there be a consistent method to quantify the observability into anHLS circuit. This allows researchers to measure the effectiveness of their approach, understand circum-stances where it is most effective, and compare and contrast it against other techniques. For example,in Chapter 5 we argued that parallel systems, which are more likely to contain non-deterministic bugs,would require debugging solutions that offer “big picture” observability into the system over longerperiods of time. Understanding the relative merits of different techniques is crucial in determining howHLS debugging problems should be tackled, and what future technologies are needed. Ideally, thiswork would lead to debugging solutions that choose the optimal observation technique based on theuser’s design, or type of bug being investigated.In this chapter we present a metric for measuring observability into an executing HLS circuit. Thismetric measures the likelihood that variable values are available to a user as they step through theprogram execution, taking into account differences in variable importance, and the duration of executiontrace that can be captured. Although there have been metrics developed to measure observability withingeneral purpose digital circuits [83, 90], they are designed to measure observability at the RTL level.In contrast, users who design and optimize HLS circuits operate at the software abstraction level, andprimarily care about whether they can obtain the values of their software variables, not flip-flops or logicgates in the resulting circuit.To demonstrate the utility of our metric, we measure the observability provided by recent HLSdebugging techniques, and explore how signal-tracing decisions affect the availability of variable values,as well as the duration of execution that can be captured. In addition, we also use the metric to studythe case where not all source code variables are recorded, and explore how different variable selectionmethods affect the observability.This chapter is organized as follows: Section 6.1 presents our observability metric, describing how itincorporates the availability of variables values, their relative importance, as well as the duration of timefor which they can be observed. The next sections demonstrate how the metric can be used to evaluatethe observability of different HLS debugging choices. Section 6.2 uses the metric to study differencesbetween recent HLS debugging techniques, and Section 6.3 studies how tracing a subset of the source1246.1. The Observability Metriccode variables affects observability. Section 6.4 summarizes the chapter.6.1 The Observability MetricThis section describes our metric to measure the level of observability into an executing HLS circuit.The metric is designed to evaluate the observability provided to a user while progressing sequentiallythrough the execution of an HLS program. This scenario would be applicable if a designer was single-stepping through the program execution, as would be common in interactive debuggers, such as thosein [1, 52].The metric is designed to measure the percentage of all variables accesses (reads or writes) for whichthe variable value is known. Thus, if a user was single-stepping through a program, it would reflect thelikelihood that the values of variables accessed at a given point in time are available to the user. Themetric is also designed to reflect the fact that not all variables are equal; some may be more importantthan others when trying to understand and debug the system.Our metric for observability takes into account both the likelihood that a variable value is known(availability), as well as the duration of execution for which this availability is provided to the user.In this section we first define availability, then show how it is incorporated into the full metric forobservability.6.1.1 AvailabilityWe define availability as the fraction of variable accesses for which the value is known, weighted byvariable importance, as defined in Equation 6.1.Availability (A) =∑∀i∈varfi · vi∑∀i∈varfi ·ai (6.1)The ai term is the number of times a variable is accessed (read or written), and vi represents the numberof those times for which the value is known. The fi terms are the favourability coefficients, to indicatethe relative importance of each variable. These coefficients can be used to indicate that certain variablesare critical to understanding the program behaviour, while others may be less important, or irrelevant.These coefficients could be determined by user feedback, by program analysis, or could be set uniformly1256.2. Comparing Observability of HLS Signal-Tracing Techniquesto have no effect.6.1.2 ObservabilityWe define the observability as the product of the availability, and the duration of observed execution, t,shown in Equation 6.3. This reflects that capturing a longer execution provides the designer with moreinsight into the circuit execution.Observability = Availability ·Time (6.2)= A · t (6.3)Observability for Signal-Tracing SystemsIn the remainder of this chapter we to measure the observability provided by various signal-tracingapproaches. In such systems, the duration of execution that can be observed is proportional to theamount of memory that the user chooses to allocate to signal tracing. To remove this dependence onuser design choice, we use a modified version of the metric, as shown in Equation 6.4, which measuresobservability as function of memory the user allocates to signal tracing.Observability per Kb = A · etb (6.4)This equation factors out the size of the trace memory by replacing t with etb, where etb is the memoryefficiency of the trace buffer, defined as the average number of cycles captured per bit of memory. Forconvenience, we use the units of cycles/Kb when measuring etb. In the remainder of the chapter wemeasure the observability of various systems; to be concise we use the term observability, but in allcases we are actually referring to observability per Kb (Equation 6.4).6.2 Comparing Observability of HLS Signal-Tracing TechniquesIn this section we use the metric to characterize the levels of observability provided by different in-system HLS debug techniques. Although the techniques we explore in this section are designed to1266.2. Comparing Observability of HLS Signal-Tracing Techniquesobserve all source code variables, we show that the actual availability of variable values is not always100%, and observability varies with the signal-tracing technique that is used. We explore four differentsignal-tracing techniques: the first is from existing work in [52], the second from our work in Chap-ters 3 and 4, and the remaining two are proposed modifications to our work, which offer trade-offs forobservability.This section begins by describing the different signal-tracing techniques, then explains the method-ology in measuring the observability, and ends with an analysis of the results.6.2.1 Signal-Tracing ApproachesIn this experiment we explore four different approaches to signal-tracing:Technique 1: Embedded Logic AnalyzerEmbedded Logic Analyzers (ELAs), such as Altera’s SignalTap II [45] or Xilinx’s ChipScope Pro [46]can be used to observe signals within the circuit while it executes. For example, the Inspect HLSdebugger [52] uses SignalTap II to record signals that correspond to variables in the C source code.ELA tools, which are designed to work with any digital circuit, have no knowledge of the behaviourof the circuit being observed, and so they work by simply recording all selected signal values for everycycle of execution, as shown in Figure 4.1.By recording the signal values at every cycle of execution, the values of the variables will be knownthroughout the entire portion of captured execution. However, such an approach will surely result indata duplication since it is common for variable values to be unchanged for many cycles.Technique 2: Record Updates OnlyTo minimize this data duplication, some techniques, such as the work by Monson and Hutchings [53],as well as our debugging framework presented in Chapters 3 and 4, will only record variable updatesinto the trace buffers.Unlike the ELA approach that always records the same signals each cycle, the approach of onlyrecording variable updates necessitates a dynamic signal-tracing architecture, that is, one that permitschanging which signals are recorded cycle by cycle. We presented such an architecture in Section 4.4,1276.2. Comparing Observability of HLS Signal-Tracing Techniquest (cy cles)Variables Circuit Halted300250200150100WriteCBAReadRWWWR RRWR RR W RWFigure 6.1: Example of captured execution trace with variable reads and writes.where the HLS scheduling information was used to determine the FSM states where each variable wasupdated, and specialized logic was built to multiplex the updated variables into the trace memory eachcycle. This architecture was illustrated in Figure 4.4.One issue with this approach is that it does not actually provide complete availability of all variablevalues within the execution trace. One might expect that recording the value of all variable updatesis sufficient to determine the value of a variable at any point in a program execution; however, this isnot the case. When taking into account the fact that trace buffers can only capture a portion of theentire execution, there may be cases where a variable is read within the trace window, but the updateoccurred before the trace window. In this case, the value of the variable is not stored in the trace buffer,meaning its value is unknown during debugging. This is illustrated in the example in Figure 6.1 whichshows a sample execution trace for three variables, A, B, and C, between cycles 150 and 320. Sinceonly variable writes are recorded in the trace buffer, the value of A prior to cycle 270 will be unknownduring debugging, and the value of B throughout the entire trace will be unknown during debugging.Calculating the availability (Equation 6.1) within this example execution trace results in A= 7/9, or 78%(since the variable values are known at seven of the nine read or write accesses). Thus, even whenrecording the updates of all variables, the availability of variable values may be less than 100%.The impact of this effect was not explored in our work in Chapters 3 and 4, or by the authors in [53].Therefore, in this chapter we preform experiments which use our metric to quantify the impact of thiseffect on observability.1286.2. Comparing Observability of HLS Signal-Tracing TechniquesHow often this issue occurs depends on both the behaviour of the benchmark program, and the depthof the trace buffer. Recall that the trace buffers are configured in a circular fashion, always evicting theoldest data. Thus, for an architecture where only variable updates are traced, if many other variablesare updated before the variable is read, it will be more likely that the recorded update will have beenevicted from the trace buffer. Thus programs with large gaps between when variables are updated andwhen they are read will be more susceptible to this issue. Correspondingly, the more memory a userallocates to signal-tracing, resulting in a deeper trace buffer, the less likely this problem is to occur. Inthe experiments in this section we explore several different benchmarks, while varying the trace bufferdepth, in order to understand when this issue occurs.Technique 3: Record Updates + Memory ReadsOne solution to this issue is to also record the value of variables into the buffers when the variable isread. This approach will consume much more trace memory as the program executes, meaning that ashorter execution trace will be captured. Although this may reduce the total observability according toour metric (Equation 6.4), it should increase availability during the trace window.As described in Section 2.3.3, HLS tools typically compile the source code to low-level IR codebefore synthesis to HDL. Within the IR, registers are typically used for local variables, and to storevalues that will have limited lifetime, whereas memories are used for global variables, arrays and valuesthat are accessed over larger portions of the program. This pattern leads us to believe that variables inmemories will usually have longer gaps between the read of a value and the previous update to the value,and will be more likely to exceed the capacity of the trace buffer. For this reason, we first consider thecase where we also record the value of variables as they are read, but only for those variables stored inmemories.Technique 4: Record Updates + ReadsIn the final technique we record all variable reads, including variables in registers. This will provide100% availability within the trace buffer, but will consume memory faster, resulting in a shorter trace.1296.2. Comparing Observability of HLS Signal-Tracing Techniques6.2.2 Measurement MethodologyTo measure the observability provided by the different signal-tracing approaches, we synthesize theCHStone benchmark suite [136] using the LegUp HLS tool [48]. Table 6.1 provides details of theCHStone programs when compiled using LegUp 4.0. The n column indicates the number of sourcecode variables that are mapped to entities in the hardware circuit, and the nregs and nmem columns givea breakdown of variables residing in registers and those assigned to memory. Note that this does notcover all source code variables, since on average 33% of variables are optimized away by LLVM orreduced to constants/ROMs, and are not included in this study. Recording all of the variables mapped tohardware requires tracing, on average, 1826 signal bits. Of these bits, 96–192 are attributed to tracingthe global memory controller signals, and the remaining are driven by datapath registers (Section 3.2.5provides additional details on the benchmark circuits). In the experiments in this section we assume thatvariables have uniform importance (∀i ∈ vars, fi = 1).For Technique 1 (using an ELA) the availability can be calculated as the fraction of variables thatare traced, weighted by their favourability coefficients. Since the experiments in this section assumeall variables are traced, the availability for the ELA case will always be 100%. For this technique, thememory efficiency of the tracing architecture (etb), which measures the number of cycles captured perbit of memory, can be calculated as the inverse of the number of traced bits that need to be traced eachcycle. This is shown in Equation 6.5.etb =1∑wi(6.5)For Techniques 2–4 we use our debug architecture described in Chapters 3 and 4, which is prototypedand included in LegUp 4.0. We simulate each of the CHStone benchmarks using Modelsim to obtainan execution trace, and use this trace to simulate the filling of trace buffers for various buffer depths.To measure the availability (Equation 6.1) we sweep the captured execution interval from the start ofthe program until the end, in increments of one cycle, measure the availability during the interval, andcalculate the average. To determine the memory efficiency, etb, we count the total number of cycles of1306.2. Comparing Observability of HLS Signal-Tracing Techniques100 101 102 103 104 105 106Trace Buffer Depth0. Technique 2: Registers only100 101 102 103 104 105 106Trace Buffer Depth0. Technique 2: Memory only100 101 102 103 104 105 106Trace Buffer Depth0. Technique 2: All variables100 101 102 103 104 105 106Trace Buffer Depth0. Technique 3: All variablesFigure 6.2: Variable availability for Techniques 2 and 3.execution, and divide by the total number of bits of data recorded in the trace buffer.6.2.3 Results and AnalysisTable 6.1 contains the observability results for the four techniques.Technique 1: Embedded Logic AnalyzerThe average observability (Equation 6.3) of the ELA approach is O1 = 100% ·0.55 cyl/Kb. This meansthat within the captured execution trace 100% of variable values are known; however, the duration ofcapture is only 0.5 cycles per Kb of memory that the user allocates to signal-tracing. The ELA techniquecan be characterized as one that provides a high-level of availability, but with a very short duration.1316.2. Comparing Observability of HLS Signal-Tracing TechniquesTechnique 2: Record Updates OnlyFor the second technique, which is our debug framework that captures only variable updates, the avail-ability of variable values within the trace buffer is plotted in Figures 6.2a to 6.2c. To illustrate thedifferent behaviour of variables in registers versus those in memory, we have plotted them separately;Figure 6.2a shows the availability of variables in registers, Figure 6.2b shows variables in memory, andFigure 6.2c shows the combined data.Considering only those variables located in registers, most benchmarks reach close to 100% avail-ability at a trace buffer depth of 102, while jpeg, which has a much longer run time, requires a depthof 103. The blowfish benchmark is an outlier, and has some accesses that require a depth of 105. Thereason for this is that it has an array initialized at the start of the program and accessed near the end.While normally this array would reside in memory, it is only eight entries long, and LLVM replaces itwith eight registers.For the variables located in memory, shown in Figure 6.2b, much deeper trace buffers are requiredto achieve high levels of availability. In many cases memory entries are written near the beginning of aprogram, and then read much later in the program, requiring a trace buffer large enough to capture theentire execution. This makes the approach of recording only updates somewhat impractical for variablesin memory. This is especially true for longer running programs; in fact, the CHStone benchmarks allhave very short run-times, completing execution in less than a second. For longer running benchmarksthe availability would likely be even lower than shown here.Figure 6.2c shows the availability when considering all variables, both those in registers, as well asthose in memories. Our architecture (shown in Figure 3.2), contains separate trace buffers for variablesin registers and memory. The irregular curves on the graph occur because of discretization effects whenbuffer entries are split between the trace buffers, and is most prominent for very low trace buffer depths.To calculate the observability we multiply the availability (assuming a trace buffer depth of 103,which corresponds to about 100Kb of memory), with the measured memory efficiency (etb). This pro-vides an average observability for Technique 2 of O2 = 94% ·22.0 cyl/Kb, which is a 38x improvementover the ELA approach. Per circuit results are provided in Table 6.1. Although this technique providesthe highest observability, the availability is only 94%, meaning that 6% of variable accesses will be1326.2.ComparingObservabilityofHLSSignal-TracingTechniquesBenchmark Cycles n nregs nmem ∑wi O1 (ELA) O2 (Chapter 4) O3 (+mem reads) O4 (+reg reads)adpcm 14496 149 129 20 3377 100% · 0.30cyl/Kb 97% · 14.6cyl/Kb 99% · 8.1cyl/Kb 100% · 6.4cyl/Kbaes 9487 32 18 14 1140 100% · 0.88cyl/Kb 98% · 16.4cyl/Kb 100% · 7.6cyl/Kb 100% · 6.9cyl/Kbblowfish 181229 28 15 13 1568 100% · 0.64cyl/Kb 62% · 23.7cyl/Kb 98% · 10.1cyl/Kb 100% · 9.2cyl/Kbdfadd 7728 109 108 1 2661 100% · 0.38cyl/Kb 100% · 12.6cyl/Kb 100% · 10.8cyl/Kb 100% · 3.5cyl/Kbdfdiv 39188 99 99 0 2953 100% · 0.34cyl/Kb 100% · 53.1cyl/Kb 100% · 29.5cyl/Kb 100% ·10.7cyl/Kbdfmul 5268 59 59 0 1673 100% · 0.60cyl/Kb 100% · 14.0cyl/Kb 100% · 6.9cyl/Kb 100% · 3.0cyl/Kbdfsin 59138 264 263 1 6907 100% · 0.14cyl/Kb 100% · 31.2cyl/Kb 100% · 26.1cyl/Kb 100% ·12.1cyl/Kbgsm 29315 56 52 4 1602 100% · 0.62cyl/Kb 90% · 21.0cyl/Kb 100% · 10.1cyl/Kb 100% · 6.5cyl/Kbjpeg 1320736 204 162 42 4483 100% · 0.22cyl/Kb 88% · 35.1cyl/Kb 97% · 15.9cyl/Kb 100% ·15.3cyl/Kbmips 6229 14 12 2 754 100% · 1.33cyl/Kb 100% · 15.7cyl/Kb 100% · 8.0cyl/Kb 100% · 4.8cyl/Kbmotion 8268 28 24 4 823 100% · 1.22cyl/Kb 100% · 27.0cyl/Kb 100% · 17.8cyl/Kb 100% ·16.8cyl/Kbsha 209415 15 9 6 563 100% · 1.78cyl/Kb 99% · 24.6cyl/Kb 100% · 11.3cyl/Kb 100% ·10.5cyl/KbGeoMean 31575 58 47 6 1826 100% · 0.55cyl/Kb 94% · 22.0cyl/Kb 99% · 12.0cyl/Kb 100% · 7.7cyl/Kb(10kB Mem) 100% · 0.55cyl/Kb 88% · 22.0cyl/Kb 98% · 12.0cyl/Kb 100% · 7.7cyl/KbTable 6.1: Observability of signal-trace techniques. The per-circuit breakdown and GeoMean rows are for the case where there is 100Kbof trace memory. The final row of the table lists the geometric mean for the case of 10Kb of trace memory.1336.2. Comparing Observability of HLS Signal-Tracing Techniquesunavailable to the user. Table 6.1 also shows how the observability changes for the case where tracememory is very limited; this is shown in the last row, which lists the geometric mean observability for10Kb of trace memory. In this case the availability drops to 88%, doubling the number of unavailablevariable accesses.These results show the importance of a standardized observability metric when comparing betweensignal-tracing techniques. Earlier in this dissertation, in Section 4.8, we showed that our signal-tracingarchitecture provides a 40x longer execution trace than an ELA tool1. Although this is true, when weuse our metric to measure observability, it becomes more apparent that this benefit comes at a price.Depending on how much trace memory is available, our architecture, limited by the approach of onlyrecording variable updates, will fail to provide the user with variable values in about 6–12% of variableaccesses.Technique 3: Record Updates + Memory ReadsWhen memory reads are also recorded in the trace buffer, availability increases dramatically. Figure 6.2dplots the availability for this case, which is noticeably better when compared to Figure 6.2c. Althoughrecording memory reads provides a higher availability, this comes at a price, and the duration that canbe captured is reduced significantly from 22.0 to 12.0 cyl/Kb. This results in an average observabilityof O3 = 99% ·12.0 cyl/Kb, which is 24x higher than the ELA approach.Technique 4: Record Updates + ReadsWhen all variable reads are recorded the availability can be further increased, now attaining 100%availability, but again costing a decrease in duration. The resulting average observability is O4 = 100% ·7.7 cyl/Kb, a 14x improvement over the ELA approach. One may assume that this approach wouldalways be preferable to the ELA approach (Technique 1), since it also provides 100% availability, but fora longer duration. However, as explored in Chapter 4, changing which signals are recorded each cyclerequires more area than an ELA because of the additional multiplexing logic (shown in Figure 4.4).In summary, if the user wishes to maximize the observability, the approach of recording only up-1In Section 4.8 we showed that this 40x could be increased to 127x using signal restoration and alternative options to recordupdates to variables in memory. However, in this chapter, the experiments are done with our debugging architecture withoutthose additional options enabled, thus we report the trace duration improvement as 40x versus an ELA1346.3. Quantifying Observability of Variable Selection Methodsdates should be used. If it is unacceptable to have unavailable variable accesses, the user may sacrificesome duration by recording also memory reads and/or register reads. The ELA case is only preferablewhen the user has no spare logic resources to perform the multiplexing required by a dynamic tracingapproach.6.3 Quantifying Observability of Variable Selection MethodsIn this section we use the metric to explore debugging methods that only observe a subset of the sourcecode variables. We experiment with different methods to select which variables to observe, and measuretheir impact on observability. The objective of this section is not to determine an optimal variableselection algorithm, or to prove that a certain algorithm will be best for debugging, rather it shows howour observability metric can be used to gain understanding into different signal-tracing choices.6.3.1 MotivationsThere are two main scenarios that would motivate a choice to record a subset of the source code vari-ables. The first may be that a designer wishes to trade off variable availability for duration; by recordingfewer signals in trace buffers, a longer execution trace can be captured. This may be helpful if thedesigner knows the general location of bug in the code, such as which function it is contained in, andwishes to observe that portion for as long as possible. It may also be useful if the designer simply wantsto understand the “big picture” of what is happening over a longer period of time, and is willing to sac-rifice knowing the value of all variables. The other scenario where partial-tracing would be applicableis if the user has limited logic or routing resources available for debugging instrumentation, as the morevariables that are fed into the trace buffer, the larger the access network becomes.In this section we perform experiments to measure the effect on observability when different vari-ables are selected to be traced. One may suppose that such a result would be straightforward to obtain,and if we record, for example, 50% of the variables, it would require 50% of the memory, allowing usto record execution for twice as long. In reality this is not the case as the cost of observing variablesin the hardware is not uniform across all source code variables. There are a number of reasons for this:first, variables may be different bit widths, not only because of varied data types in the source code, butbecause of bit width minimization performed by the HLS tool [34]. Second, variables are accessed at1356.3. Quantifying Observability of Variable Selection MethodsMethod DescriptionUniform Variable Importance1. Random Selection Variables are randomly selected; ten trials are performed and the averageresult is reported.2. Reads + Writes Static analysis of the program is performed; the variables with the mostreads + writes are selected.3. Reads + Writes Dynamic Dynamic analysis is performed via Modelsim simulation. The variableswith the most reads + writes are selected.4. Reads / Writes Static analysis of the program is performed; the variables with the mostreads per write are selected.5. 1/Width For each variable, the bit width of all of its hardware signals aresummed. Variables with fewest bits are selected.6. (R/W)/Width Product of the previous two ranking systems.Centrality-Based Variable Importance7. Random Selection Same as #1 above.8. Importance Variables of highest importance (closeness centrality) are selected.9. (R/W)/Width Same as #6 above.Table 6.2: Selection methods for partial variable tracingdifferent rates, and those that are accessed more often will have a greater influence on the observability.Third, there is not a one-to-one relationship between source code variable and hardware signal, butrather a many to many relationship. Each variable in the C code may map to multiple different entitiesin the hardware circuit. For example, when the source code is transformed to IR in SSA form, eachsource code update of a variable will produce a unique variable in the IR, and usually a unique signalin the hardware. This means that to observe one C variable, many hardware signals may need to betraced. The opposite may also be true, where many C variables map to the same hardware signal. Oneexample where this occurs is function parameters; if a single value is passed through multiple functioncalls, there would be a C variable for each layer in the call stack, whereas the hardware may containonly a single signal.6.3.2 Variable Selection MethodsTable 6.2 describes the different selection methods we test. Methods 1-6 are evaluated assuming vari-ables have uniform importance (∀i ∈ vars, fi = 1). However, for a designer gaining insight into theexecution of a design, not all variables will be equally useful. For this reason, we also model the case1366.3. Quantifying Observability of Variable Selection Methodswhere variables have non-uniform importance. Many different methods could be used to decide relativeimportance of each variable; feedback could be obtained from the user, or program analysis could tryto determine which variables are most important in understanding the system. There have been manymethods proposed to determine the best signals to trace in a digital circuit [83, 90, 96]. Although theseapply to RTL-level circuits, similar rankings could be done in the software domain. We choose to applya technique proposed in [90], which is to use graph centrality to determine which variables have themost influence in a system. In our case we use closeness centrality [144] on the FSM state graph, whichgives a measure of how far each state is from all other states. The centrality, C(x), of a node is calculatedusing Equation 6.6, where d(x,y) is the hop distance from node x to node y in the state graph.C(x) = ∑y6=x1d(x,y)(6.6)For each variable we determine all states where it is accessed, and sum the C(x) values for each statenode. This summation is used as the fi values when calculating the availability.6.3.3 Measurement MethodologyThe methodology to measure observability is the same as described in Section 6.2.2. The main dif-ference is that in this experiment, only the variables mapped to registers are traced, compared to theprevious experiment in Section 6.2 where the full circuit (Figure 2.6), including the control flow andvariables in memory, was observed. The reason for this difference is that some of the selection methodswe use, such as counting the number of read and write accesses to each variable, rely on static programanalysis, and cannot be evaluated for variables in memory which use pointers. The reason for this is thatwhen software contains pointers it is possible for the same load and store instructions to access severaldifferent source code variables.In this experiment we sweep the number of variables observed from 10% to 100% in incrementsof 10%, and measure the availability, duration, and resulting observability. We use the architecture thatonly records updates to variable values (Technique 2 from the previous section).1376.3. Quantifying Observability of Variable Selection Methods0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0% Traced Variables0. Random2. R+W3. R+W Dyn.4. R/W5. 1/Bits6. (R/W)/Bits(a) Availability, Uniform Importance0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0% Traced Variables0. Random8. Import.9. (R/W)/Bits10. Import.*(R/W)/Bits(b) Availability, Non-Uniform Importance0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0% Traced Variables0500100015002000Duration (cyl/Kb)1. Random2. R+W3. R+W Dyn.4. R/W5. 1/Bits6. (R/W)/Bits(c) Duration, Uniform Importance0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0% Traced Variables0500100015002000Duration (cyl/Kb)7. Random8. Import.9. (R/W)/Bits10. Import.*(R/W)/Bits(d) Duration, Non-Uniform Importance0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0% Traced Variables050100150200250300350400Observability (% cyl/Kb)1. Random2. R+W3. R+W Dyn.4. R/W5. 1/Bits6. (R/W)/Bits(e) Observability, Uniform Importance0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0% Traced Variables050100150200250300350400Observability (% cyl/Kb)7. Random8. Import.9. (R/W)/Bits10. Import.*(R/W)/Bits(f) Observability, Non-Uniform ImportanceFigure 6.3: Availability, duration and observability provided by variable selection methods.1386.3. Quantifying Observability of Variable Selection Methods6.3.4 Results and AnalysisFigure 6.3 provides the availability, duration and observability results for the various selection tech-niques. The graphs on the left hand side (Figures 6.3a, 6.3c and 6.3e) show the results for uniformvariable importance, and the graphs on the right (Figures 6.3b, 6.3d and 6.3f are for non-uniform vari-able importance. We first discuss the results for the cases that assume uniform variable importance.The Reads+Writes technique is aimed at providing a high-level of availability, as variables that areaccessed most often will be traced. However, it relies on static analysis of the program, and as evident inthe results, only slightly improves the availability over a random selection. This shows that there is not astrong correlation between number of accesses in the source code, and the number of accesses actuallyperformed during execution. This is not surprising as one would expect the number of accesses to bemore correlated to whether the access is in a frequently executed loop, rather than simply the number ofaccesses in the source code. When the R+W Dynamic approach is used, which uses the number actualaccesses from a simulated execution, the availability is significantly higher. The increased availabilitycomes at a cost, as recording more accesses shortens the duration, and the resulting observability, whichis the product of availability and duration, is comparable to the random selector.The Reads/Writes technique aims to trace those variables that are read much more often than theyare updated. Since only updates are recorded in the buffer, this aims for increased memory efficiency,and the results show a clear increase in duration over the random selector, as well as a moderate increasein observability. The 1/Bits selector and the combined (R/W)/Bits selector aim to provide even highermemory efficiency by targeting variables that have the fewest signal bits in the hardware circuit. Thisprovides much longer execution traces; in fact, at the very low end where only 10-20% of variables aretraced, the pay-off is very large. This may be useful if the designer simply wants to get some insightinto the circuit over a very long execution. However, it is contingent upon these variables actually beinguseful. In many cases these variables may be insignificant; in such cases the fi coefficient could be setto zero, or a very low value, to ignore the variable and get a better representation of observability.Next we consider the cases where variable importance is non-uniform, and is proportional to thegraph centrality. The Import. selector simply records the most important variables. This would likely bea common case if the user supplies variable importance, as they would expect to observe the variables1396.4. Summarythey deem to be important. This selector achieves significantly higher availability than the random case,which is to be expected, since the definition of availability is a function of variable importance. Theduration is slightly lower than the random case, which suggests that the graph centrality approach weused is more likely to trace variables with wider bit widths. This is likely due to the fact that we sumthe centrality values for each state in which a variable is updated, so variables that are updated in moreplaces will have both higher importance as well as more signals to record (remember that SSA formmeans there is usually a separate hardware signal for each location a variable is updated).We also test the (R/W)/Bits selector in the context of non-uniform variables. It performs similarly towhen applied to the uniform importance case, maximizing the trace buffer duration, especially for caseswhere few signals are traced.The results show that our metric can capture behaviours exhibited by various variable selectiontechniques. Depending on the chosen technique, the impact may primarily be to availability of variablevalues, or to duration of execution trace that can be captured, or may have some effect on both.6.4 SummaryAs high-level synthesis systems become more widespread, there is a greater need for a complete de-velopment ecosystem, supporting design, test, debug and optimization. There have been several recentworks, including the techniques presented in this dissertation, that aim to provide a user with observ-ability into an executing HLS circuit, for the purposes of debug and optimization. Typically theseapproaches use on-chip memory resources to record a portion of the circuit execution; however, due tolimited resources, it is usually not possible to capture all data for the entire execution. With only lim-ited information about the circuit execution, debugging can be a challenging and time consuming task.Effective debugging tools will be those that make the best use of the these limited resources, providingthe user with the most insight into the circuit behaviour.In this chapter we presented a metric for measuring the level of observability provided by these newHLS debugging technologies. This metric measures the likelihood that variable values will be availableto the user as they inspect the circuit execution, and reflects both the relative importance of variablevalues, as well as the duration of execution for which observability is provided.1406.4. SummaryWe showed how this metric could be applied to understand the levels of observability provided byrecent HLS debugging techniques. These experiments showed that while our signal-tracing architecture(Chapters 3 and 4), which only records variable updates, is able to capture a much longer durationof execution than a traditional ELA tool, this improvement comes at a price. For 6–12% of variableaccesses, the value will be unknown because it was updated too far in the past, and has been evictedfrom the trace buffer. We showed that by also recording variable reads, the value of every variable accesswould be available; however, this increases memory usage and reduces the duration for which executioncan be captured.Finally, we used our metric to demonstrate how observability changes when only a portion of vari-ables are recorded. The experiments showed that the cost of recording variables is not uniform acrossall source code variables, and that different methods of variable selection could be employed to increaseeither the availability of variable values, or the length of execution that can be captured.The HLS field is very active, quickly changing, and full of exciting opportunities. We hope thisobservation metric can help future researchers understand the level of observability provided by theirtools, compare and contrast them to existing technologies, and motivate better HLS debug technologies.141Chapter 7Conclusion and Future WorkThis chapter provides a summary of the dissertation, outlines the significance of each contribution, anddiscusses their impact and limitations. Both short-term future work and long-term research directionsare presented.7.1 Dissertation SummaryIn the last few years there has been a increasing demand for High-Level Synthesis (HLS) technologies;these development frameworks allow designers to create digital circuits by providing a software speci-fication of their design. This demand is being motivated by two major factors: first, as chip capacitiesincrease, and time-to-market becomes critical, hardware designers need a higher level of design abstrac-tion that enables creating large digital designs quickly. Second, reconfigurable computing, especiallyFPGAs, are emerging as viable platforms for acceleration of software workloads [23, 24]. To enablethis transition, there needs to be a design flow that allows software designers to create digital hardwaresystems; HLS is a leading candidate to fill this role.This drive to HLS-based design flows has driven innovation and new products in both the commer-cial and academic communities. Most of this work has focused on the quality of the HLS compiler itself,developing techniques to produce the best performance circuit. However, in order for HLS to receivewidespread adoption, there needs to be a complete ecosystem of design tools; beyond just a compiler,designers also need tools that allow for debugging, validation, and optimization. This dissertation ad-1427.1. Dissertation Summarydresses the challenging of debugging HLS designs.Although commercial HLS tools offer some techniques for debugging HLS systems, these involveeither using software emulation of the hardware system, or some form of circuit simulation. No toolsexist that offer practical in-system debugging of the HLS circuit, that is, allowing the designer to observethe circuit as it executes at-speed and interacts with other modules in the system. However, uncoveringcertain bugs, especially those that occur when the circuit interacts with its surroundings, requires in-system hardware debugging.Although there exist many technologies to debug digital circuits, these operate at the register-transferlevel (RTL), and require an understanding of the signals, bits and registers within the digital circuit. Incontrast, designers using HLS, especially software designers who lack hardware expertise, are operatingonly in the software domain, without knowledge of the underlying circuit produced by the HLS tools.Thus there was a need for technologies to enable debugging in the context of the original source code,but while interacting with the underlying digital circuit. Providing this debugging capability to designersis a challenging problem; users will create their designs at the software level, which is characterized ashigh-level and sequential; however, the system being debugged is in hardware, which consists of low-level data-flow elements operating in parallel. The objective of this dissertation was to address thischallenge, specifically how can we provide designers with effective source-level, in-system debuggingcapabilities for circuits produced by HLS compilers.In Chapter 3 we presented a framework for source-level in-system debug of HLS circuits. Centralto our approach is a signal-tracing architecture which records elements of the circuit while it executes atspeed. We presented this signal-tracing architecture, and several optimizations to improve its efficiencyin Chapter 4. In Chapter 5 we tackled the challenges introduced when debugging parallel HLS system,and in Chapter 6 we developed a metric for measuring observability, which allows different debuggingapproaches to be compared in a consistent manner. The remainder of this section provides a summaryof these contributions, and the impact of this dissertation.1437.1. Dissertation Summary7.1.1 Summary of ContributionsAn In-System Debugging Framework for HLS CircuitsThe first contribution of this thesis, presented in Chapter 3, is a framework for source-level, in-systemdebugging of HLS circuits. This framework allows a designer to debug their circuit using their originalsoftware code, while connected to the actual hardware circuit which is executing at-speed and interactingwith the full hardware environment.This debugging flow is enabled through three components. First, we developed novel debug in-strumentation which is automatically added to the user’s design at compile time. This debug circuitryallows for external control of the circuit, and observation of key internal signals. Second, through a pro-totype tool, we demonstrated how a software-like debug experience could be obtained, complete withsingle-stepping, breakpoints and inspection of variable values. We showed how the debugger could beaugmented with additional information to help the user understand how their software executes in thehardware, including the scheduling of different operations, and the effect of compiler optimizations ontheir design. Finally, we demonstrated techniques to track key information through the HLS flow, build-ing a mapping between hardware and software, allowing for the reconstruction of the program controland data flow by inspecting key hardware values.Central to our debugging framework is the record and replay technique, where the on-chip memoryresources are used to record circuit execution, allowing the circuit to run at-speed and interact with thefull operating environment. Once the circuit is halted, these values are retrieved, and the user is ableto perform debugging using a portion of the recorded circuit execution. This technique of capturingin-system execution allows our debugging framework to be used to uncover hardware bugs that aredependent on the operating environment. At the time of our publication [1], there were no other HLSdebug solutions that provided this capability.Optimizing Data CaptureOur debugging framework relies upon capturing the HLS circuit behaviour into on-chip memory forlater offline analysis. Existing ELA technologies can be used to perform signal-tracing; however, thesetools are designed to work with general RTL circuits, and are not optimized for HLS circuits. For larger1447.1. Dissertation SummaryHLS designs, where there may be many variables that need to be observed, these tools will only becapable of capturing a very small portion of the total circuit execution. For our benchmark circuits, anELA tool would provide on average, only 51 cycles of execution, assuming 100Kb of memory is usedfor trace buffers(Table 6.1). This gives the designer very limited insight into the behaviour of the circuit,and may make debugging a challenging and time consuming process.The contribution of Chapter 4 is a collection of signal-tracing techniques to optimize data captureof an HLS circuit. We demonstrate separate optimizations for capturing the control flow, updates tovariables on the datapath, as well as updates to variables located in on-chip memories. Our novel signal-tracing techniques use the HLS scheduling information to build debug logic, tailored to the individualdesign, which controls which signals are recorded cycle-by-cycle. This technique is used to record bothvariables located in the datapath, as well as to record memory transactions to capture updates to variableslocated in memory. We also develop a novel technique to determine which hardware signals do not needto be recorded on-chip, but rather can be computed offline based on the value of other recorded signals.Our technique uses an ILP solver to determine an optimal solution, minimizing the amount of data thatneeds to be recorded.With all optimizations applied, our signal-tracing architecture provides a 127x improvement to tracelength versus an ELA. For 100Kb of memory allocated to trace buffers, this increases the 51 cycles ofexecution captured by an ELA tool to 6957. This makes it practical to perform software-like debug,where the designer has access to all variable values. This observability is achieved with a moderateamount of area overhead (858 ALMs), and no impact to operating frequency.Debugging Parallel HLS SystemsRecent HLS tools allow designers to maximize the parallelism of their circuit by proving multithreadedsource code. Although it is possible to debug such systems by using our framework and duplicatingthe debug instrumentation to every thread in the system, such an approach would likely require toomuch overhead, especially if the user created a highly parallelized system. It addition, our software-likedebug approach where all variables are observed for all threads would likely consume on-chip memorytoo quickly, netting a short trace duration.1457.1. Dissertation SummaryThe contribution of Chapter 5 is a debugging approach more suited to parallel HLS systems. Thedebug framework we present consists of allowing users to select tracepoints, which are key points inthe program execution, which trigger data logging when encountered by a thread. This provides a light-weight debugging technique, where the user can choose to observe as few points of data within a threadas they like. This allows the user to gain insight into the “big picture” of a system execution, as tracememory can be used to record a few variables for a long period of execution. This is particularly usefulfor resolving non-deterministic bugs in parallel systems, such as dead locks or thread starvation, thatmay require long run time to expose.As part of this work we address a key issue in our architecture, which is that each thread requires itsown trace buffer, and it is challenging to determine how memory should be allocated between threadsat compile time. We present a technique to reduce this problem by introducing an architecture thatallows threads to share memory resources for logging tracepoints. This consists of an algorithm whichperforms program analysis to determine which threads can share access to a single buffer. Results showthat this buffer sharing technique increases the average trace length that can be obtained. This increasedepends on the system configuration; as an example, an eight thread system with one tracepoint perthread achieves a 4x improvement in captured trace length.Quantifying ObservabilityThe final contribution of this dissertation, presented in Chapter 6, is a metric for quantifying the levelof observability into an executing HLS circuit. With many recently published techniques for in-systemHLS debug, including both the work in this dissertation, as well as work by others [52–56], it becameapparent that there needed to be a standard method to compare debug techniques, in order to measuremuch data they provide to the user about the executing HLS circuit.The metric we present measures the likelihood that variable values will be available to the user asthey inspect execution of their HLS system, and takes into account differences in variable importance,as well as the duration of execution for which observability is provided to the user.As part of this contribution we have used our metric to evaluate recent techniques for in-systemHLS debug. We show that using an ELA tool to perform signal tracing provides complete availability1467.1. Dissertation Summaryof variables values, but for only a short duration of execution. We contrast this with our signal-tracingtechniques which records only those variables that are updated each cycle, and show that our techniqueachieves 38x higher observability than the ELA approach. While our approach is much more memoryefficient, it is susceptible to a problem when variables are read long after they are updated, as the updatedvalue may have been evicted from the trace buffer. To solve this issue we introduce the technique ofrecording both variable reads and writes; however, this consumes more memory, reducing the increasein observability over an ELA tool to 14x.The final part of this contribution uses the metric to show how recording different variables fromthe source code can affect observability in different ways. Depending on whether the goal is to increasethe availability of variable values, or to increase the duration of captured execution, different variableselection methods can be employed.7.1.2 Dissertation ImpactThis dissertation has shown that it is possible to provide both hardware and software designers with HLSdebugging techniques that allow them to debug their circuit while it executes in-system, while operatingin the context of their original software program. In addition, we have shown that this capability canbe provided in the presence of compiler optimizations, across fundamental differences between thesoftware and hardware domains, and with limited impact in resources to the user’s circuit. Dependingon available resources, observability can be varied from a few select points in the program (using thetracepoint debugging technique), to complete observability into all all variables with a debug experiencesimilar to a software debugger.These capabilities are not yet offered in any commercial HLS products. Currently, users are forcedto perform debugging using software emulation or hardware simulation; however, even HLS vendorsconcede that this is not not sufficient to capture all system-level bugs, especially those that may benon-deterministic in nature, or require interaction with other modules in the system [19]. The conceptspresented in this dissertation solve those issues, and allow for users to locate complex, non-deterministicbugs. We have presented the contributions of this dissertation to major HLS vendors, and from personaldiscussions, we expect that source-level, in-system debugging technologies will be released in future1477.2. Limitations and Short-Term Future Workiterations of their HLS tools, including some concepts presented in this dissertation.Enabling effective debugging, accessible to software and hardware designers, is a key step towardwidespread HLS adoption. Effective HLS tools will allow future hardware designers to rapidly createlarge complex digital circuits, with faster time-to-markets than is possible today. For software designers,accessible HLS tools allow them to easily accelerate their software workloads using hardware, achievinghigher performance and lower power; this is a key step in bringing reconfigurable computing to datacenters, commodity workstations, and mobile devices.7.2 Limitations and Short-Term Future WorkThis section describes the major limitations of this work, and describes short-term future work that couldaddress these limitations.7.2.1 Extending to Commercial HLS ToolsAll of the contributions in this dissertation have been prototyped and tested in the LegUp HLS tool.LegUp is an academic, open-source HLS tool, that is widely used in research groups worldwide. Despiteits popularity, this tool suffers the same issues as most academic projects: it is not supported by the sameresources as large commercial tools, and most contributors to the project are doing so to test their ownresearch, not necessarily to engineer all features that may be found in commercial tools. This introducesthe concern that commercial tools, which include many more features, may contain techniques thatwould invalidate the approaches presented in this dissertation.One example of how commercial tools differ is that they typically offer many more methods totransfer data in and out of the hardware circuit, such as streaming ports, several different standards ofbus connections, etc., whereas LegUp only supports data input through function parameters and sharedmemory. We provide this as an example difference; however, we do not expect this difference will causeany fundamental problems with our debug techniques, aside from the fact that perhaps different signalsin the hardware will need to be observed. However, since the commercial tools are closed-source, andwe do not have access to them, it is not possible for us to claim with certainty that such fundamentaldifferences do not exist.To address these concerns we have presented this work several times to major HLS vendors. Feed-1487.2. Limitations and Short-Term Future Workback from these commercial HLS developers has been very positive, and we have not received anyindications that our approaches would be invalidated by the techniques used in commercial tools.To make the techniques in this dissertation concrete, it would be ideal for them to be prototypedin a commercial tool. When we began this work, choosing a commercial tool was not practical, as wewanted to have the flexibility to develop debugging techniques that required modifying the HLS tools.In the end, we demonstrated our techniques by modifying an open-source HLS tools; however, ourresulting debug architecture does not necessarily require modifying the tool. The debug instrumentationcould be added to the circuit after being produced by a closed-source tool, providing that sufficientinformation about the design was provided by the HLS tool. As a future project, it would be worthwhileto investigate what information needs to be provided by a closed-source HLS tool to enable our debugtechniques. This will include information on both the structure of the resulting circuit, as well as howsignals in the produced circuit map back to the original software.Another advantage of implementing this work in a commercial tool is that it would allow us to eval-uate our debugging framework in terms of its effectiveness at helping users find bugs in real designs. Inthis dissertation we evaluate our debugging framework in terms of the observability it provides design-ers into the hardware execution of their HLS design. In future work it would be beneficial to measurethe real impact this has on debugging, both in terms of the success of designers in locating bugs intheir design, as well as the length of time it takes to locate the cause of the bug. Such a study wouldlikely consist of injecting bugs into existing commercial designs, having volunteer designers use ourframework to locate the bug, and measuring their success rate.7.2.2 Widespread Support for OptimizationsProviding source-level debugging requires information on how hardware signals map back to instruc-tions and variables in the source code. Building this information requires tracking the software enti-ties through all optimizations and transformations as the software is synthesized to hardware. This isproblematic, as it means that debug mapping must be maintained through all HLS optimizations. Theapproach of just disabling a certain optimization to enable debugging is not practical; this may changecircuit behaviour and mask the bug. Ideally, debugging should be permitted in the presence of all pos-1497.2. Limitations and Short-Term Future Worksible optimizations provided by the HLS tool.In our proof-of-concept debugger presented in this dissertation, we tackled only the core HLS opti-mizations. This included compiler optimizations, for which LLVM already maintains debug mapping,as well as the IR to RTL synthesis steps of allocation, scheduling and binding. However, if this workwere to be incorporated by a commercial HLS developer, they would likely need to modify all of theiroptimizations to maintain this debug data. This may require a shift in design practise, as future opti-mizations may be limited by the need to maintain debug data.In addition, providing support for certain optimizations introduces an additional challenge of havingto expose to the user how behaviour in the circuit has changed; this may be particularly challengingif such behaviour is foreign to a software developer using the debugging tool. For example, one op-timization that is common in HLS tools, that we have not addressed in this work, is loop pipelining.This optimization allows multiple iterations of a software loop to be executed simultaneously. Observ-ing multiple iterations of the loop within the circuit is not difficult, it simply requires observing extrahardware signals. However, within the debugger this data must be provided to the user in a meaningfulformat. For a software developer this concept may be foreign, and thus care must be taken to visualizeit in an effective manner. Although we did not address loop pipelining in our work, we did add visu-alizations to help the user understand when multiple C instructions would execute simultaneously; thiswas done by highlighting multiple lines of code and providing a Gantt chart.7.2.3 Source-to-Source TransformationsAlong the same theme of the previous discussion, another optimization that our work does not address issource-to-source transformations. This occurs when a tool (sometimes a third-party tool) automaticallymodifies the user’s code before inputting it into an HLS tool. This was discussed in greater detailin Section 2.3.3. This introduces additional challenges, as even if the HLS tool tracks mapping fromsoftware to hardware, it will be in the context of the modified software program, not the user’s originalprogram.To address this, future work could investigate the type of transformations performed by these source-to-source transformation tools, and whether it would be possible to automatically map debugging infor-1507.2. Limitations and Short-Term Future Workmation back to the original program in a way that would be intuitive to the developer. For example,most source-to-source tools use the ROSE framework [31, 145, 146], and it may be possible to modifythis framework in order to track the transformations to the source code, with the goal of enabling theuser to debug using the original source code, without needing to be aware of any transformations. If anautomated process proves too difficult, another approach would be to display the transformations to theuser, aiding them in understanding how the new source code maps back to their original code.7.2.4 Scalability to Large BenchmarksOne limitation with the experiments in this dissertation is that there are not a wide variety of HLSbenchmarks available to test how debugging overheads change with vastly different benchmarks. Thetest circuits we use are the CHStone benchmarks, which are the most commonly used benchmarks inrecent academic publications on HLS. Properties of these benchmarks were given in Table 3.1. Unfor-tunately, these benchmarks do not vary greatly in size; the number of lines of code ranges from 182to 1003 (factor of 5.5) , the number of source code variables from 19 to 328 (factor of 17), and asshown in Table 4.4, the resulting circuits range in area from 918 to 15716 logic elements (factor of 17).The largest circuit requires only 2.5% of the logic resources on a moderately sized Stratix V FPGA(Table 4.4).Future work could explore how the techniques in this dissertation scale to much larger benchmarksizes. In discussions with commercial HLS developers, they have indicated that their clients will oftenuse much larger percentages of the FPGA; one company indicated this was typically in the ballparkof 30%, while another had clients who create designs that fill FPGAs to maximum capacity. In Sec-tion 4.6.2 we showed that much of the debug instrumentation is a fixed cost, and it is mainly the signal-trace multiplexing that scales with the benchmark size. By observing fewer variables this logic can bereduced. A future study could explore this relationship, and demonstrate what fraction of variables canbe practically observed as designs grow very large.7.2.5 Enhancements to the Debugger GUISeveral enhancements could be made to the debugger GUI, which was described in Section 3.4. Forexample, to help users work backwards to find the root cause of the bug, it may be beneficial to show1517.2. Limitations and Short-Term Future WorkELA Our WorkBenchmarkbits/cycle Gb/s bits/cycle Gb/sImprovementadpcm 3848 750.4 16.4 3.2 234xaes 1252 225.7 9.7 1.8 129xblowfish 1671 416.9 11.6 2.9 143xdfadd 2627 766.8 27.3 8.0 96xdfdiv 2985 774.7 4.1 1.1 725xdfmul 1704 493.5 19.8 5.7 86xdfsin 6843 1457.2 8.8 1.9 775xgsm 1918 425.0 14.0 3.1 137xjpeg 5618 692.0 10.9 1.3 516xmips 711 145.5 20.8 4.3 34xmotion 861 165.6 12.0 2.3 72xsha 681 204.1 13.8 4.1 49xGeoMean 1965 433.2 12.8 2.8 154xTable 7.1: Average memory bandwidth required by trace buffers.the causal link between each instruction, allowing a user to determine where in the design a value waslast computed. Illustrating this link could also help the user understand dependencies in their design, andallow them to re-order instructions or alter the design to improve scheduling. Other features could beadded to help the user understand the potential impact to the design if certain instructions were removedor re-ordered. All of these features, as well as other similar features, would help a designer locate bugsand understand how changes to their software will affect the operation of the hardware.7.2.6 Exploring Off-Chip Signal-TracingIn this dissertation, signal-tracing is done by recording signal values into on-chip memories. Althoughon-chip FPGA memory is very limited, it is often used for signal-tracing due to its high bandwidth. Insome cases it may be possible to use off-chip memory, such as DDR memory, which typically providesmuch higher capacity, allowing for longer execution traces, but with lower bandwidth [82].In Table 7.1 we show the memory bandwidth required to perform trace-based debugging as de-scribed in our debug framework from Chapters 3 and 4. If an ELA tool is used, which records allnecessary signals every cycle, the average bandwidth for the CHStone benchmarks, compiled usingLegUp 4.0, is 433.2Gb/s. This exceeds the bandwidth provides by DDR3 memory, which ranges from1527.3. Long-Term Future Directionsapproximately 50 to 200Gb/s. Thus, if conventional signal-tracing techniques are used, off-chip DDRmemory is not sufficient, and on-chip memory is necessary. However, when our optimizations are ap-plied the bandwidth is significantly reduced. Table 7.1 shows the reduced bandwidth for our work (weassume the best-case configuration from our work, which is the last row of Table 6.1). In this case theaverage bandwidth is reduced to 2.8Gb/s, well within the limitations of DDR3 memory. Thus, it may bepossible to achieve much longer execution traces by utilizing off-chip memory.Further work would need to be done to determine if this was feasible, including a study of peakbandwidth, how to efficiently buffer data if peak bandwidth exceeds the memory bandwidth, as wellas techniques to handle the case where the user’s circuit is already using the off-chip memory, and theinterface needs to be shared with the debug logic.7.3 Long-Term Future DirectionsThis section discusses long-term research directions that are motivated by the work in this dissertation.7.3.1 Integrated Software/Hardware DebuggingThis dissertation focused on allowing designers to debug HLS circuits; however, the research largelyignored the parent system in which an HLS circuit would be used. A typical usage scenario for HLS-produced circuits is as a means of accelerating certain portions of a software workload [23, 29, 111].In this case the FPGA acts as a co-processor and is tightly bound to a general processor performing asoftware workload. Academic projects already exist which automatically identify portions of a softwareprogram to migrate to hardware for acceleration, and allow for seamless sharing of workload betweenan HLS-produced circuit and a processor embedded in the FPGA [48]. Currently this is mainly usedto interface with ARM processors within the FPGA; however, in the future this will also include x86processors. This is evident as newly announced Xeon processors from Intel will contain on-chip FPGAlogic [24].Debugging such hybrid systems introduces a new set of challenges, beyond what is explored inthis work. As execution context can switch back and forth between the CPU and FPGA, or executesimultaneously, it is likely that a unified debugging process will be required that allows the user todebug both sides of the system simultaneously. New techniques will need to be developed to observe1537.3. Long-Term Future Directionsresources that are shared between the CPU and FPGA, including shared memory and communicationchannels. In some cases it may be necessary to explore how these structures can be altered to bettersupport observability for debugging. Debug circuitry must provide observability without changing CPUor FPGA behaviour, otherwise a probe effect may be introduced and bugs may be masked.Another challenge that must be considered when debugging such a system is the possibility ofdynamic reconfiguration, where the circuit implemented on the FPGA changes throughout the softwareworkload. This may occur simply because different phases of the software execution require differentalgorithms to be accelerated. It can also occur as a result of datacenter visualization, where hardwareresources are shared between many different workloads. In these cases the debug circuitry must bedesigned to handle reconfiguration, and ensure that critical debug data is not lost during this process.Addressing such challenges in debugging will be necessary to allow hardware acceleration to re-ceive widespread adoption in data centers and other other environments that are typically dominated bysoftware applications.7.3.2 Helping the User Create High-Quality HLS SystemsThe work in this dissertation focused on providing the user with as much observability as possible intothe executing design. This allows designers to perform functional debug, that is, locating errors in theirdesign that prevent correct behaviour. This observability into the circuit execution also helps the user toperform performance debug, that is, locating bottlenecks or other issues in the code that contribute tolower than desired performance.However, providing HLS users with functional and performance debugging is just one part of help-ing them make high quality designs. In order for designers to make better HLS systems, they need tounderstand how their software code is transformed into a hardware circuit. In some respects our debug-ging framework already provides this information. For example, in Section 3.4 we showed how a Ganttchart in the debugger application could help the user understand how their software was parallelized inthe hardware implementation, and may help them restructure their C code to obtain better instructionscheduling and higher performance. However, aiding users in creating high quality HLS circuits goesbeyond just solving bugs and maximizing performance. It also includes the problems designers face1547.3. Long-Term Future Directionsof reducing circuit area, and reducing power consumption. Unfortunately our debug framework doeslittle to help them understand how to restructure their code to reduce area or power. Future work couldaddress both of these issues.Power consumption is an especially important aspect of HLS design, as one of the most attractivereasons for offloading software workloads to hardware is for power savings. This is especially applicablefor data center applications, where power consumption is a first-class concern. Future work coulddevelop techniques to help the user understand how the choices they make in their software designaffect the power of the resulting circuit. This would involve techniques to identify which portions ofsoftware code are responsible for the largest power consumption and conveying this information to theuser. This could potentially be integrated as a visualization tool into the debugger application, helpingthe user understand how power consumption changes throughout the circuit execution.Similar approaches could be taken for area optimization. Within a debugging context, it may bebeneficial to see which instructions contribute most to the area of the resulting circuit, and possiblyinclude suggestions to the user about how to restructure their code to reduce circuit area.The main take away from this research direction is that in an HLS system, debugging may go farbeyond just functional or performance debug. When the success of a hardware design hinges on manyfactors, including performance, area and power, it is necessary to help the user understand how choicesthey make in the design of their source program affect these different aspects of the final circuit.155Bibliography[1] Jeffrey Goeders and Steven J. E. Wilton. “Effective FPGA debug for high-level synthesisgenerated circuits”. In: International Conference on Field Programmable Logic andApplications. Sept. 2014, pp. 1–8.[2] Jeffrey Goeders and Steven J. E. Wilton. “Using Dynamic Signal-Tracing to DebugCompiler-Optimized HLS Circuits on FPGAs”. In: International Symposium onField-Programmable Custom Computing Machines. May 2015, pp. 127–134.[3] Jeffrey Goeders and Steven J. E. Wilton. “Allowing Software Developers to Debug HLSHardware”. In: International Workshop on FPGAs for Software Programmers. Aug. 2015.[4] Jeffrey Goeders and Steven J. E. Wilton. “Using Round-Robin Tracepoints to DebugMultithreaded HLS Circuits on FPGAs”. In: International Conference on Field-ProgrammableTechnology. Dec. 2015, pp. 40–47.[5] Andrew Canis, Jongsok Choi, Blair Fort, Bain Syrowik, Ruo Long Lian, Yuting Chen,Hsuan Hsiao, Jeffrey Goeders, Stephen Brown, and Jason Anderson. “LegUp high-levelsynthesis”. In: Chapter in FPGAs for Software Engineers. Springer, 2016.[6] Jeffrey Goeders and Steven J.E. Wilton. “Signal-Tracing Techniques for In-System FPGADebugging of High-Level Synthesis Circuits”. In: IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems (2016). To be published in 2016.[7] Jeffrey Goeders and Steven J. E. Wilton. “Quantifying Observability for In-System Debug ofHigh-Level Synthesis Circuits”. In: International Conference on Field Programmable Logicand Applications. Aug. 2016.[8] R.A. Bergamaschi, R.A. O’Connor, L. Stok, M.Z. Moricz, S. Prakash, A. Kuehlmann, andD. S. Rao. “High-level synthesis in an industrial environment”. In: IBM Journal of Researchand Development 39.1.2 (Jan. 1995), pp. 131–148.[9] J. Biesenack, M. Koster, A. Langmaier, S. Ledeux, S. Marz, M. Payer, M. Pilsl, S. Rumler,H. Soukup, N. Wehn, and P. Duzy. “The Siemens high-level synthesis system CALLAS”. In:IEEE Transactions on Very Large Scale Integration (VLSI) Systems 1.3 (Sept. 1993),pp. 244–253.[10] Kazutoshi Wakabayashi. “C-based Behavioral Synthesis and Verification Analysis onIndustrial Design Examples”. In: Asia and South Pacific Design Automation Conference. 2004,pp. 344–348.[11] O. Mencer R. Dimond M. J. Flynn and O. Pell. “MAXware: Acceleration in HPC”. In: IEEEHOT CHIPS. Maxeler Technologies, Aug. 2008.156Bibliography[12] Jason Yu, Christopher Eagleston, Christopher Han-Yu Chou, Maxime Perreault, andGuy Lemieux. “Vector Processing As a Soft Processor Accelerator”. In: ACM Transactions onReconfigurable Technology and Systems 2.2 (June 2009), 12:1–12:34.[13] Wim Meeus, Kristof Van Beeck, Toon Goedem, Jan Meel, and Dirk Stroobandt. “An overviewof todays high-level synthesis tools”. English. In: Design Automation for Embedded Systems16.3 (2012), pp. 31–51.[14] Luka Daoud, Dawid Zydek, and Henry Selvaraj. “A Survey of High Level SynthesisLanguages, Tools, and Compilers for Reconfigurable High Performance Computing”. English.In: Advances in Systems Science. Vol. 240. 2014, pp. 483–492.[15] R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S. Brown,F. Ferrandi, J. Anderson, and K. Bertels. “A Survey and Evaluation of FPGA High-LevelSynthesis Tools”. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits andSystems (2016). To be published in 2016.[16] S. Windh, X. Ma, R. J. Halstead, P. Budhkar, Z. Luna, O. Hussaini, and W. A. Najjar.“High-Level Language Tools for Reconfigurable Computing”. In: Proceedings of the IEEE103.3 (Mar. 2015), pp. 390–408.[17] David Bacon, Rodric Rabbah, and Sunil Shukla. “FPGA Programming for the Masses”. In:Queue 11.2 (Feb. 2013), 40:40–40:52.[18] J. Cong, Bin Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Zhiru Zhang. “High-LevelSynthesis for FPGAs: From Prototyping to Deployment”. In: IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems 30.4 (Apr. 2011), pp. 473–491.[19] Altera. Altera SDK for OpenCL: Programming Guide. Tech. rep. UG-OCL002. Nov. 2015.[20] Xilinx. Vivado Design Suite User Guide: High-Level Synthesis. Tech. rep. UG902 (v2014.1).May 2014.[21] Impulse Accelerated Technologies. Impulse CoDeveloper. URL:http://www.impulseaccelerated.com/products.htm (visited on 05/25/2016).[22] Synopsys. Synphony C Compiler: High-Level Synthesis from C/C++ to RTL. URL: https://www.synopsys.com/Tools/Implementation/RTLSynthesis/Pages/SynphonyC-Compiler.aspx(visited on 05/25/2016).[23] Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Constantinides,John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jordan Gray, et al.“A reconfigurable fabric for accelerating large-scale datacenter services”. In: ComputerArchitecture (ISCA), 2014 ACM/IEEE 41st International Symposium on. IEEE. 2014,pp. 13–24.[24] Prabhat K. PK Gupta. Intel Xeon + FPGA Platform for the Data Center. Workshoppresentation, Reconfigurable Computing for the Masses, Really? at FPL 2015. Sept. 2015.[25] Julien Happich. Cognitive Computing Platform Unites Xilinx and IBM. EE Times. Apr. 2016.URL: http://www.eetimes.com/document.asp?doc id=1329377 (visited on 04/27/2016).[26] Qualcomm and Xilinx Collaborate to Deliver Industry-Leading Heterogeneous ComputingSolutions for Data Centers with New Levels of Efficiency and Performance. Qualcomm. URL:https://www.qualcomm.com/news/releases/2015/10/08/qualcomm-and-xilinx-collaborate-deliver-industry-leading-heterogeneous (visited on 04/27/2016).157Bibliography[27] Stacey Higginbotham. Why Intel will spend $16.7 billion on Altera. Fortune. URL:http://fortune.com/2015/08/27/why-intel-altera/ (visited on 04/27/2016).[28] United States Bureau of Labor Statistics. Occupational Outlook Handbook. 2012.[29] Xilinx. SDAccel Development Environment: User Guide. Tech. rep. UG1023 (v2015.1). May2105.[30] Wei Zuo, Yun Liang, Peng Li, Kyle Rupnow, Deming Chen, and Jason Cong. “Improving HighLevel Synthesis Optimization Opportunity Through Polyhedral Transformations”. In:International Symposium on Field Programmable Gate Arrays. 2013, pp. 9–18.[31] J. Cong, Peng Zhang, and Yi Zou. “Optimizing memory hierarchy allocation with looptransformations for high-level synthesis”. In: Design Automation Conference. June 2012,pp. 1229–1234.[32] Qijing Huang, Ruolong Lian, A. Canis, Jongsok Choi, R. Xi, S. Brown, and J. Anderson. “TheEffect of Compiler Optimizations on High-Level Synthesis for FPGAs”. In: InternationalSymposium on Field-Programmable Custom Computing Machines. Apr. 2013, pp. 89–96.[33] Jason Cong, Bin Liu, Raghu Prabhakar, and Peng Zhang. “A Study on the Impact of CompilerOptimizations on High-Level Synthesis”. English. In: Languages and Compilers for ParallelComputing. Vol. 7760. 2013, pp. 143–157.[34] M. Gort and J. H. Anderson. “Range and bitmask analysis for hardware optimization inhigh-level synthesis”. In: Asia and South Pacific Design Automation Conference. Jan. 2013,pp. 773–779.[35] Stefan Hadjis, Andrew Canis, Jason H. Anderson, Jongsok Choi, Kevin Nam, Stephen Brown,and Tomasz Czajkowski. “Impact of FPGA Architecture on Resource Sharing in High-levelSynthesis”. In: International Symposium on Field Programmable Gate Arrays. Monterey,California, USA, 2012, pp. 111–114.[36] Andrew Canis, Jason H. Anderson, and Stephen D. Brown. “Multi-pumping for resourcereduction in FPGA high-level synthesis”. In: Design, Automation Test in Europe Conference.Mar. 2013, pp. 194–197.[37] Cheng-Tsung Hwang, Jiahn-Hurng Lee, and Yu-Chin Hsu. “A formal approach to thescheduling problem in high level synthesis”. In: Computer-Aided Design of Integrated Circuitsand Systems, IEEE Transactions on 10.4 (1991), pp. 464–475.[38] Pierre G Paulin and John P Knight. “Scheduling and binding algorithms for high-levelsynthesis”. In: Design Automation, 1989. 26th Conference on. IEEE. 1989, pp. 1–6.[39] Miguel R Corazao, Marwan A Khalaf, Lisa M Guerra, Miodrag Potkonjak, and Jan M Rabaey.“Performance optimization using template mapping for datapath-intensive high-levelsynthesis”. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits andSystems 15.8 (1996), pp. 877–888.[40] David C Ku and G De Mitcheli. “Relative scheduling under timing constraints: Algorithms forhigh-level synthesis of digital circuits”. In: IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems 11.6 (1992), pp. 696–718.[41] GDB: The GNU Project Debugger. URL: https://www.gnu.org/software/gdb/ (visited on04/29/2016).158Bibliography[42] K. Wakabayashi. “CyberWorkBench: integrated design environment based on C-basedbehavior synthesis and verification”. In: International Symposium on VLSI Design, Automationand Test. Apr. 2005, pp. 173–176.[43] Sameh Asaad, Ralph Bellofatto, Bernard Brezzo, Chuck Haymes, Mohit Kapur,Benjamin Parker, Thomas Roewer, Proshanta Saha, Todd Takken, and Jose´ Tierno. “ACycle-accurate, Cycle-reproducible multi-FPGA System for Accelerating Multi-core ProcessorSimulation”. In: International Symposium on Field Programmable Gate Arrays. 2012,pp. 153–162.[44] Roope Kaivola, Rajnish Ghughal, Naren Narasimhan, Amber Telfer, Jesse Whittemore,Sudhindra Pandav, Anna Slobodova´, Christopher Taylor, Vladimir Frolov, Erik Reeber, andArmaghan Naik. “Replacing Testing with Formal Verification in Intel Core I7 ProcessorExecution Engine Validation”. In: International Conference on Computer Aided Verification.2009, pp. 414–429.[45] Altera. Quartus II Handbook Volume 3: Verification; 13. Design Debugging Using theSignalTap II Logic Analyzer. Tech. rep. QII5V3. Nov. 2013.[46] Xilinx. ChipScope Pro Software and Cores: User Guide. Tech. rep. UG2029 (v14.3). Oct.2012.[47] Mentor Graphics. Certus ASIC Prototyping Debug Solution. URL:http://www.mentor.com/products/fv/certus (visited on 12/03/2014).[48] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona,Tomasz Czajkowski, Stephen D. Brown, and Jason H. Anderson. “LegUp: An Open-sourceHigh-level Synthesis Tool for FPGA-based Processor/Accelerator Systems”. In: ACMTransactions on Reconfigurable Technology and Systems 13.2 (2013), 24:1–24:27.[49] Jongsok Choi, S. Brown, and J. Anderson. “From software threads to parallel hardware inhigh-level synthesis for FPGAs”. In: International Conference on Field-ProgrammableTechnology. Dec. 2013.[50] White Paper: Scalable Offline Debugging. Tech. rep. Allinea, Nov. 2011.[51] Microsoft. Debug Multithreaded Applications in Visual Studio. URL:https://msdn.microsoft.com/en-us/library/ms164746.aspx.[52] N. Calagar, S.D. Brown, and J.H. Anderson. “Source-level debugging for FPGA high-levelsynthesis”. In: International Conference on Field Programmable Logic and Applications. Sept.2014, pp. 1–8.[53] Joshua S Monson and Brad Hutchings. “New approaches for in-system debug ofbehaviorally-synthesized FPGA circuits”. In: International Conference on FieldProgrammable Logic and Applications. Sept. 2014, pp. 1–6.[54] Joshua S Monson and Brad Hutchings. “Using Source-Level Transformations to ImproveHigh-Level Synthesis Debug and Validation on FPGAs”. In: International Symposium onField-Programmable Gate Arrays. 2015, pp. 5–8.[55] Joshua S Monson and Brad Hutchings. “Using Source-to-Source Compilation to InstrumentCircuits for Debug with High Level Synthesis”. In: International Conference onField-Programmable Technology. Dec. 2015.159Bibliography[56] J. S. Monson and B. Hutchings. “Using shadow pointers to trace C pointer values in FPGAcircuits”. In: 2015 International Conference on ReConFigurable Computing and FPGAs. Dec.2015, pp. 1–6.[57] Kazutoshi Wakabayashi. Reconfigurable chip advantage compared with GPGPU from thecompiler perspective. Keynote Speech, International Conference on Field-ProgrammableTechnology. Dec. 2013.[58] K.H. Tsoi, K.H. Lee, and P.H.W. Leong. “A massively parallel RC4 key search engine”. In:Symposium on Field-Programmable Custom Computing Machines. Apr. 2002, pp. 13–21.[59] J.S. Beeckler and W.J. Gross. “FPGA particle graphics hardware”. In: Symposium onField-Programmable Custom Computing Machines. Apr. 2005, pp. 85–94.[60] K. Whitton, X.S. Hu, C.X. Yu, and D.Z. Chen. “An FPGA Solution for Radiation DoseCalculation”. In: Symposium on Field-Programmable Custom Computing Machines. Apr.2006, pp. 227–236.[61] J. Lee, L. Shannon, M.J. Yedlin, and G.F. Margrave. “A multi-FPGA application-specificarchitecture for accelerating a floating point Fourier Integral Operator”. In: InternationalConference on Application-Specific Systems. July 2008, pp. 197–202.[62] B. Vermeulen and S.K. Goel. “Design for debug: catching design errors in digital chips”. In:IEEE Design Test of Computers 19.3 (May 2002), pp. 35–43.[63] Hari Angepat, Gage Eads, Christopher Craik, and Derek Chiou. “NIFD: Non-intrusive FPGADebugger – Debugging FPGA ’Threads’ for Rapid HW/SW Systems Prototyping”. In:International Conference on Field Programmable Logic and Applications. 2010, pp. 356–359.[64] Bart Vermeulen. “Functional Debug Techniques for Embedded Systems”. In: IEEE Design andTest of Computers 25.3 (2008), pp. 208–215.[65] K.A. Tomko and A. Tiwari. “Hardware/software co-debugging for reconfigurable computing”.In: High-Level Design Validation and Test Workshop. Nov. 2000, pp. 59–63.[66] Brad Hutchings, Brent Nelson, Mike Wirthlin, and Doran Wilde. Unified Debug Environmentfor Adaptive Computing Systems. Tech. rep. Brigham Young University, Sept. 2003.[67] EE Times. Scan design called portal for hackers. Oct. 2004. URL:http://www.eetimes.com/document.asp?doc id=1151658 (visited on 05/16/2014).[68] Xilinx. Virtex-6 FPGA Configuration: User Guide. Tech. rep. UG360 (v3.9). Nov. 2013.[69] Altera. Protecting the FPGA Design From Common Threats. Tech. rep. WP-01111-1.0. June2009.[70] Timothy Wheeler, Paul Graham, Brent Nelson, and Brad Hutchings. “Using Design-LevelScan to Improve FPGA Design Observability and Controllability for Functional Verification”.In: International Conference on Field-Programmable Logic and Applications. Aug. 2001,pp. 483–492.[71] A. Tiwari and K.A. Tomko. “Scan-chain based watch-points for efficient run-time debuggingand verification of FPGA designs”. In: Asia and South Pacific Design Automation Conference.Jan. 2003, pp. 705–711.160Bibliography[72] Yousef Iskander, Cameron Patterson, and Stephen Craven. “High-Level Abstractions andModular Debugging for FPGA Design Validation”. In: ACM Transactions ReconfigurableTechnology and Systems 7.1 (Feb. 2014), 2:1–2:22.[73] D. Josephson and Bob Gottlieb. “The crazy mixed up world of silicon debug [IC validation]”.In: Custom Integrated Circuits Conference. Oct. 2004, pp. 665–670.[74] Joon-Sung Yang and N.A. Touba. “Enhancing Silicon Debug via Periodic Monitoring”. In:Defect and Fault Tolerance of VLSI Systems. Oct. 2008, pp. 125–133.[75] G.-J. van Rootselaar and B. Vermeulen. “Silicon debug: scan chains alone are not enough”. In:International Test Conference. Sept. 1999, pp. 892–902.[76] C. MacNamee and D. Heffernan. “Emerging on-ship debugging techniques for real-timeembedded systems”. In: Computing Control Engineering Journal 11.6 (Dec. 2000),pp. 295–303.[77] M. Abramovici, P. Bradley, K. Dwarakanath, P. Levin, G. Memmi, and D. Miller. “Areconfigurable design-for-debug infrastructure for SoCs”. In: Design Automation Conference.July 2006, pp. 7–12.[78] E. Hung and S. J E Wilton. “Speculative Debug Insertion for FPGAs”. In: InternationalConference on Field Programmable Logic and Applications. Sept. 2011, pp. 524–531.[79] Ho Fai Ko and N. Nicolici. “Mapping Trigger Conditions onto Trigger Units duringPost-silicon Validation and Debugging”. In: IEEE Transactions on Computers 61.11 (Nov.2012), pp. 1563–1575.[80] M.H. Neishaburi and Z. Zilic. “Hierarchical Embedded Logic Analyzer for AccurateRoot-Cause Analysis”. In: Defect and Fault Tolerance in VLSI and Nanotechnology Systems.Oct. 2011, pp. 120–128.[81] H.F. Ko and N. Nicolici. “Combining scan and trace buffers for enhancing real-timeobservability in post-silicon debugging”. In: European Test Symposium. May 2010, pp. 62–67.[82] Frederic Leens. Debugging FPGAs at full speed. Exostiv Labs. Apr. 2016. URL:http://www.exostivlabs.com/logic-observations/ (visited on 06/10/2016).[83] E. Hung and S.J.E. Wilton. “Scalable Signal Selection for Post-Silicon Debug”. In: IEEETransactions on Very Large Scale Integration (VLSI) Systems 21.6 (June 2013), pp. 1103–1115.[84] E. Anis and N. Nicolici. “Low Cost Debug Architecture using Lossy Compression for SiliconDebug”. In: Design, Automation Test in Europe. Apr. 2007, pp. 1–6.[85] E. Anis Daoud and N. Nicolici. “Real-Time Lossless Compression for Silicon Debug”. In:IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28.9 (Sept.2009), pp. 1387–1400.[86] Xiao Liu and Qiang Xu. “On Multiplexed Signal Tracing for Post-Silicon Validation”. In:IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 32.5 (May2013), pp. 748–759.[87] S. Prabhakar and M.S. Hsiao. “Multiplexed trace signal selection using non-trivialimplication-based correlation”. In: International Symposium on Quality Electronic Design.Mar. 2010, pp. 697–704.161Bibliography[88] Joon-Sung Yang and Nur A. Touba. “Efficient Trace Signal Selection for Silicon Debug byError Transmission Analysis”. In: IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems 31.3 (Mar. 2012), pp. 442–446.[89] Ho Fai Ko and N. Nicolici. “Automated trace signals selection using the RTL descriptions”. In:International Test Conference. Nov. 2010, pp. 1–10.[90] Steven JE Wilton, Bradley R Quinton, and Eddie Hung. “Rapid RTL-based signal ranking forFPGA prototyping”. In: International Conference on Field-Programmable Technology. 2012,pp. 1–7.[91] K. Rahmani, P. Mishra, and S. Ray. “Scalable trace signal selection using machine learning”.In: International Conference on Computer Design. Oct. 2013, pp. 384–389.[92] Johnny J. W. Kuan and Tor M. Aamodt. “Progressive-BackSpace: Efficient PredecessorComputation for Post-Silicon Debug”. In: Microprocessor Test and Verification Conference.2012, pp. 70–75.[93] Flavio M. de Paula, Amir Nahir, Ziv Nevo, Avigail Orni, and Alan J. Hu. “TAB-BackSpace:Unlimited-length Trace Buffers with Zero Additional On-chip Overhead”. In: DesignAutomation Conference. 2011, pp. 411–416.[94] M. Gort, F.M. De Paula, J.J.W. Kuan, T.M. Aamodt, A.J. Hu, S. J E Wilton, and Jin Yang.“Formal-Analysis-Based Trace Computation for Post-Silicon Debug”. In: IEEE Transactionson Very Large Scale Integration (VLSI) Systems 20.11 (Nov. 2012), pp. 1997–2010.[95] Yu-Chin Hsu, Furshing Tsai, Wells Jong, and Ying-Tsai Chang. “Visibility enhancement forsilicon debug”. In: Design Automation Conference. July 2006, pp. 13–18.[96] H.F. Ko and N. Nicolici. “Algorithms for State Restoration and Trace-Signal Selection for DataAcquisition in Silicon Debug”. In: IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems 28 (Feb. 2009), pp. 285–297.[97] S. Prabhakar and M. Hsiao. “Using Non-trivial Logic Implications for Trace Buffer-BasedSilicon Debug”. In: Asian Test Symposium. Nov. 2009, pp. 131–136.[98] K. Basu and P. Mishra. “RATS: Restoration-Aware Trace Signal Selection for Post-SiliconValidation”. In: IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 21.4 (Apr.2013), pp. 605–613.[99] Xiao Liu and Qiang Xu. “On Signal Selection for Visibility Enhancement in Trace-BasedPost-Silicon Validation”. In: IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems 31.8 (Aug. 2012), pp. 1263–1274.[100] K. Basu, P. Mishra, and P. Patra. “Efficient combination of trace and scan signals for postsilicon validation and debug”. In: International Test Conference. Sept. 2011, pp. 1–8.[101] Kamran Rahmani and Prabhat Mishra. “Efficient Signal Selection Using Fine-grainedCombination of Scan and Trace Buffers”. In: International Conference on VLSI Design. 2013,pp. 308–313.[102] B.R. Quinton and S. J E Wilton. “Concentrator access networks for programmable logic coreson SoCs”. In: International Symposium on Circuits and Systems. May 2005, pp. 45–48.[103] B.R. Quinton and S. J E Wilton. “Post-silicon debug using programmable logic cores”. In:International Conference on Field-Programmable Technology. Dec. 2005, pp. 241–247.162Bibliography[104] M.J. Narasimha. “A recursive concentrator structure with applications to self-routing switchingnetworks”. In: IEEE Transactions on Communications 42.234 (Feb. 1994), pp. 896–898.[105] Xiao Liu and Qiang Xu. “Trace Signal Selection for Visibility Enhancement in Post-siliconValidation”. In: Design, Automation and Test in Europe. Nice, France, 2009.[106] Xiao Liu and Qiang Xu. “On efficient silicon debug with flexible trace interconnection fabric”.In: International Test Conference. Nov. 2012, pp. 1–9.[107] E. Hung and S.J.E. Wilton. “Incremental Trace-Buffer Insertion for FPGA Debug”. In: IEEETransactions on Very Large Scale Integration (VLSI) Systems 22.4 (Apr. 2014), pp. 850–863.[108] Eddie Hung, Jeffrey B. Goeders, and Steven J.E. Wilton. “Faster FPGA Debug: EfficientlyCoupling Trace Instruments with User Circuitry”. In: Reconfigurable Computing:Architectures, Tools, and Applications. 2014, pp. 73–84.[109] E. Hung, A.-S. Jamal, and S.J.E. Wilton. “Maximum flow algorithms for maximumobservability during FPGA debug”. In: International Conference on Field-ProgrammableTechnology. Dec. 2013, pp. 20–27.[110] Eddie Hung and Steven J. E. Wilton. “Accelerating FPGA Debug: Increasing Visibility Using aRuntime Reconfigurable Observation and Triggering Network”. In: ACM Transactions onDesign Automation of Electronic Systems 19.2 (Mar. 2014), 14:1–14:23.[111] Xilinx. SDSoC Development Environment. Tech. rep. Accessed: 2015-06-25.[112] J. Cong and W. Rosenstiel. “The Last Byte: The HLS tipping point”. In: IEEE Design Test ofComputers 26.4 (July 2009), pp. 104–104.[113] Mentor Graphics. Catapult High-Level Synthesis. URL:https://www.mentor.com/hls-lp/catapult-high-level-synthesis/ (visited on 05/09/2016).[114] C. Pilato and F. Ferrandi. “Bambu: A modular framework for the high level synthesis ofmemory-intensive applications”. In: 2013 23rd International Conference on Fieldprogrammable Logic and Applications. Sept. 2013, pp. 1–4.[115] R. Nane, V. M. Sima, C. P. Quoc, F. Goncalves, and K. Bertels. “High-Level Synthesis in theDelft Workbench Hardware/Software Co-design Tool-Chain”. In: International Conference onEmbedded and Ubiquitous Computing. Aug. 2014, pp. 138–145.[116] Tomasz S Czajkowski, David Neto, Michael Kinsner, Utku Aydonat, Jason Wong,Dmitry Denisenko, Peter Yiannacouras, John Freeman, Deshanand P Singh, andStephen D Brown. “OpenCL for FPGAs: Prototyping a Compiler”. In: InternationalConference on Engineering of Reconfigurable Systems and Algorithms. 2012, pp. 3–12.[117] Jason Villarreal, Adrian Park, Walid Najjar, and Robert Halstead. “Designing modularhardware accelerators in C with ROCCC 2.0”. In: International Symposium onField-Programmable Custom Computing Machines. IEEE. 2010, pp. 127–134.[118] Eric Martin, Olivier Sentieys, He´le`ne Dubois, and Jean-Luc Philippe. “Gaut: An architecturalsynthesis tool for dedicated signal processors”. In: European Design Automation Conference.1993, pp. 14–19.163Bibliography[119] Jiajia Chen, Chip-Hong Chang, Feng Feng, Weiao Ding, and Jiatao Ding. “Novel designalgorithm for low complexity programmable FIR filters based on extended double base numbersystem”. In: IEEE Transactions on Circuits and Systems I: Regular Papers, 62.1 (2015),pp. 224–233.[120] F. Feng, J. Chen, and C. H. Chang. “Hypergraph Based Minimum Arborescence Algorithm forthe Optimization and Reoptimization of Multiple Constant Multiplications”. In: IEEETransactions on Circuits and Systems I: Regular Papers PP.99 (2016), pp. 1–12.[121] Zhiru Zhang, Yiping Fan, Wei Jiang, Guoling Han, Changqi Yang, and Jason Cong.“AutoPilot: A Platform-Based ESL Synthesis System”. In: High-Level Synthesis. Ed. byPhilippe Coussy and Adam Morawiec. Springer Netherlands, 2008, pp. 99–112.[122] C. Lattner and V. Adve. “LLVM: a compilation framework for lifelong program analysistransformation”. In: Code Generation and Optimization. Mar. 2004.[123] A. Canis, S. D. Brown, and J. H. Anderson. “Modulo SDC scheduling with recurrenceminimization in high-level synthesis”. In: International Conference on Field ProgrammableLogic and Applications. Sept. 2014, pp. 1–8.[124] K.S. Hemmert and B. Hutchings. “Issues in debugging highly parallel FPGA-basedapplications derived from source code”. In: Asia and South Pacific Design AutomationConference. Jan. 2003, pp. 483–488.[125] K.S. Hemmert, J.L. Tripp, B.L. Hutchings, and P.A. Jackson. “Source level debugger for theSea Cucumber synthesizing compiler”. In: Symposium on Field-Programmable CustomComputing Machines. Apr. 2003, pp. 228–237.[126] Justin L. Tripp, Preston A. Jackson, and Brad Hutchings. “Sea Cucumber: A SynthesizingCompiler for FPGAs”. In: International Conference on Field-Programmable Logic andApplications. 2002, pp. 875–885.[127] Y.S. Iskander, C.D. Patterson, and S.D. Craven. “Improved Abstractions and Turnaround Timefor FPGA Design Validation and Debug”. In: International Conference on FieldProgrammable Logic and Applications. Sept. 2011.[128] Liwei Yang, Magzhan Ikram, Swathi Gurumani, Suhaib A Fahmy, Deming Chen, andKyle Rupnow. “JIT trace-based verification for high-level synthesis”. In: InternationalConference on Field Programmable Technology (2015), pp. 228–231.[129] Liwei Yang, Swathi Gurumani, Suhaib A Fahmy, Deming Chen, and Kyle Rupnow.“Automated Verification Code Generation in HLS Using Software Execution Traces”. In:International Symposium on Field-Programmable Gate Arrays. 2016, pp. 278–278.[130] Liwei Yang, Swathi Gurumani, Deming Chen, and Kyle Rupnow. “AutoSLIDE: AutomaticSource-Level Instrumentation and Debugging”. In: International Symposium onField-Programmable Custom Computing Machines. 2016.[131] Lawrence W Zurawski and Ralph E Johnson. “Debugging optimized code with expectedbehavior”. In: Unpublished draft from Department of Computer Science, University of Illinoisat Urbana-Champaign (1991).[132] Urs Ho¨lzle, Craig Chambers, and David Ungar. “Debugging Optimized Code with DynamicDeoptimization”. In: Conference on Programming Language Design and Implementation.1992, pp. 32–43.164Bibliography[133] John Hennessy. “Symbolic Debugging of Optimized Code”. In: ACM Transactions onProgramming Languages and Systems 4.3 (July 1982), pp. 323–344.[134] D. S. Coutant, S. Meloy, and M. Ruscetta. “DOC: A Practical Approach to Source-levelDebugging of Globally Optimized Code”. In: Conference on Programming Language Designand Implementation. 1988, pp. 125–134.[135] Gary Brooks, Gilbert J. Hansen, and Steve Simmons. “A New Approach to DebuggingOptimized Code”. In: Conference on Programming Language Design and Implementation.1992, pp. 1–11.[136] Yuko Hara, Hiroyuki Tomiyama, Shinya Honda, and Hiroaki Takada. “Proposal andQuantitative Analysis of the CHStone Benchmark Program Suite for Practical C-basedHigh-level Synthesis”. In: Journal of Information Processing 17 (2009), pp. 242–254.[137] P. Graham, B. Nelson, and B. Hutchings. “Instrumenting Bitstreams for Debugging FPGACircuits”. In: International Symposium on Field-Programmable Custom Computing Machines.Mar. 2001.[138] M. Riley, N. Chelstrom, M. Genden, and S. Sawamura. “Debug of the CELL Processor:Moving the Lab into Silicon”. In: International Test Conference. Oct. 2006.[139] Joon-Sung Yang and N.A. Touba. “Expanding Trace Buffer Observation Window forIn-System Silicon Debug through Selective Capture”. In: VLSI Test Symposium. Apr. 2008.[140] D. Chatterjee, C. McCarter, and V. Bertacco. “Simulation-based signal selection for staterestoration in silicon debug”. In: International Conference on Computer-Aided Design. Nov.2011, pp. 595–601.[141] E. Matthews, L. Shannon, and A. Fedorova. “A configurable framework for investigatingworkload execution”. In: International Conference on Field-Programmable Technology. Dec.2010, pp. 409–412.[142] Thomas J LeBlanc and John M Mellor-Crummey. “Debugging parallel programs with instantreplay”. In: IEEE Transactions on Computers 100.4 (1987), pp. 471–482.[143] E.W. Dijkstra. “A note on two problems in connexion with graphs”. In: NumerischeMathematik 1.1 (1959), pp. 269–271.[144] Yannick Rochat. “Closeness centrality extended to unconnected graphs: The harmoniccentrality index”. In: ASNA. EPFL-CONF-200525. 2009.[145] Markus Schordan and Dan Quinlan. “A Source-To-Source Architecture for User-DefinedOptimizations”. In: Modular Programming Languages. Ed. by Lszl Bszrmnyi andPeter Schojer. Vol. 2789. Lecture Notes in Computer Science. Springer Berlin Heidelberg,2003, pp. 214–223.[146] Felix Winterstein, Samuel Bayliss, and George Constantinides. “Separation Logic-AssistedCode Transformations for Efficient High-Level Synthesis”. In: International Symposium onField-Programmable Custom Computing Machines. May 2014, pp. 1–8.165


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items