Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

An FPGA overlay architecture supporting software-like compile times during on-chip debug of high-level… Jamal, Al-Shahna 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2018_september_jamal_alshahna.pdf [ 2.5MB ]
JSON: 24-1.0370945.json
JSON-LD: 24-1.0370945-ld.json
RDF/XML (Pretty): 24-1.0370945-rdf.xml
RDF/JSON: 24-1.0370945-rdf.json
Turtle: 24-1.0370945-turtle.txt
N-Triples: 24-1.0370945-rdf-ntriples.txt
Original Record: 24-1.0370945-source.json
Full Text

Full Text

An FPGA Overlay Architecture Supporting Software-LikeCompile Times during On-Chip Debug of High-LevelSynthesis DesignsbyAl-Shahna JamalBASc. in Computer Engineering, University of British Columbia, 2016A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)August 2018c© Al-Shahna Jamal, 2018The following individuals certify that they have read, and recommend to theFaculty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:An FPGA Overlay Architecture Supporting Software-Like Compile Times duringOn-Chip Debug of High-Level Synthesis Designssubmitted by Al-Shahna Jamal in partial fulfillment of the requirements for thedegree of Master of Applied Science in Electrical and Computer EngineeringExamining Committee:Steve Wilton, Computer EngineeringSupervisorGuy Lemieux, Computer EngineeringExamining Committee MemberSudip Shekhar, Computer EngineeringExamining Committee MemberiiAbstractHigh-Level Synthesis (HLS) promises improved designer productivity by allowingdesigners to create digital circuits targeting Field-Programmable Gate Arrays (FP-GAs) using a software program. Widespread adoption of HLS tools is limited bythe lack of an on-chip debug ecosystem that bridges the software to the generatedhardware, and that addresses the challenge of long FPGA compile times.Recent work has presented an in-system debug framework that provides asoftware-like debug experience by allowing the designer to debug in the contextof the original source code. However, like commercial on-chip debug tools, anymodification to the on-chip debug instrumentation requires a system recompile thatcan take several hours or even days, severely limiting debug productivity.This work proposes a flexible debug overlay family that provides software-likedebug turn-around times for HLS generated circuits (on the order of hundreds ofmilliseconds). This overlay is added to the design at compile time, and at debugtime can be configured many times to implement specific debug scenarios withouta recompilation.We propose two sets of debug capabilities, and their required architectural andCAD support. The first set form a passive overlay, the purpose of which is to pro-vide observability into the underlying circuit and not change it. In this category, thecheapest overlay variant allows selective variable tracing with only a 1.7% increasein area overhead from the baseline debug instrumentation, while the deluxe vari-ant offers 2x-7x improvement in trace buffer memory utilization with conditionalbuffer freeze support.The second set of capabilities is control-based, where the overlay is leveragedto make rapid functional changes to the design. Supported functional changesiiiinclude applying small deviations in the control flow of the circuit, or the abilityto override signal assignments to perform efficient “what if” tests. Our overlay isspecifically optimized for designs created using an HLS flow; by taking advantageof information from the HLS tool, the overhead of the overlay can be kept low.Additionally, all the proposed capabilities require the designer to only interact withtheir original source code.ivLay SummaryField Programmable Gate Arrays (FPGAs) are becoming prominent hardware ac-celeration platforms in data centers, with companies like Microsoft, Amazon andBaidu using them to drive compute-heavy applications, such as Artificial Intelli-gence workloads. Hardware expertise is typically required to use FPGAs, yet morerecent High-Level Synthesis (HLS) tools allow these devices to be programmed us-ing a software language, potentially enabling software engineers to reap the speedand energy advantages of FPGAs. However, long FPGA compile times remain asignificant bottleneck during the development phase, lowering debug productivity(the most expensive part of a project). In this thesis we propose an HLS-orienteddebug overlay that allows the designer to both view the internal activity of thecircuit while it executes at-speed, and apply limited functional changes to the de-sign, without a system recompile. We believe such an overlay can improve debugproductivity and increase the adoption of FPGAs by software engineers.vPrefaceThe contributions in Chapters 3 and 4 of this thesis have been published in confer-ence proceedings [1], and the contributions in Chapters 5 and 6 have been acceptedfor publication in [2]. Both works were funded by Intel’s Strategic Research Al-liance (ISRA) grant.All the research presented in this thesis was primarily conducted by authorAl-Shahna Jamal. Domain knowledge was provided by Dr. Jeffrey Goeders, whois a co-author on both conference papers. This was done under the supervision andguidance of Dr. Steve Wilton who also provided editorial support for all writtenpublications.[1] Al-Shahna Jamal, Jeffrey Goeders, Steven J.E. Wilton. “Architecture Explo-ration for HLS-Oriented FPGA Debug Overlays” ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays. Feb 2018. pp. 209-218.[2] Al-Shahna Jamal, Jeffrey Goeders, Steven J.E. Wilton. “An FPGA Over-lay Architecture Supporting Rapid Implementation of Functional Changesduring On-Chip Debug” International Conference on Field ProgrammableLogic and Applications. To appear Aug 2018.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 11.2 HLS Debug and Challenges . . . . . . . . . . . . . . . . . . . . . 31.2.1 Motivation for On-Chip Debug . . . . . . . . . . . . . . . 31.2.2 Challenges with On-Chip Debug for HLS Designs . . . . 41.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 92 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1 A Summary of the HLS Compiler Flow . . . . . . . . . . . . . . 11vii2.1.1 Front-End Software Compilation . . . . . . . . . . . . . . 122.1.2 IR Optimizations . . . . . . . . . . . . . . . . . . . . . . 132.1.3 Back-End RTL Generation . . . . . . . . . . . . . . . . . 132.2 HLS Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1 Vendor support for HLS Debug . . . . . . . . . . . . . . 162.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.1 In-System HLS Debugging . . . . . . . . . . . . . . . . . 162.3.2 Debug Overlays . . . . . . . . . . . . . . . . . . . . . . . 182.3.3 Support for Circuit Rectification . . . . . . . . . . . . . . 192.4 The Debug Framework . . . . . . . . . . . . . . . . . . . . . . . 192.5 HLS Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Passive Overlay Capabilities and Architecture . . . . . . . . . . . . 263.1 Passive Overlay Debug Capabilities . . . . . . . . . . . . . . . . 283.2 Passive Overlay Architecture . . . . . . . . . . . . . . . . . . . . 303.2.1 Selective Variable Tracing . . . . . . . . . . . . . . . . . 303.2.2 Selective Function Tracing . . . . . . . . . . . . . . . . . 383.2.3 Conditional Buffer Freeze . . . . . . . . . . . . . . . . . 393.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 Passive Overlay Results . . . . . . . . . . . . . . . . . . . . . . . . . 434.1 Selective Variable Trace: Variant A . . . . . . . . . . . . . . . . . 434.2 Impact of Line Packer: Variant B . . . . . . . . . . . . . . . . . . 454.3 Selective Function Tracing . . . . . . . . . . . . . . . . . . . . . 464.4 Conditional Buffer Freeze . . . . . . . . . . . . . . . . . . . . . 484.5 Impact on FPGA Compile Times . . . . . . . . . . . . . . . . . . 484.6 Results Summary and Discussion . . . . . . . . . . . . . . . . . . 495 Control-Based Overlay Capabilities and Architecture . . . . . . . . 515.1 Motivation - Example Use Cases . . . . . . . . . . . . . . . . . . 515.2 Control Overlay Debug Capabilities . . . . . . . . . . . . . . . . 525.2.1 Support for Altering Control Flow . . . . . . . . . . . . . 525.2.2 Variable Assignment Override . . . . . . . . . . . . . . . 54viii5.3 Architecture for Control Flow Changes . . . . . . . . . . . . . . . 555.3.1 Control Flow (CF) Variant - No Conditional Support . . . 555.3.2 Conditional Control Flow (CCF) Variant . . . . . . . . . 575.4 Architecture for Overriding Variables at Runtime . . . . . . . . . 595.4.1 Dataflow (DF) Variant - No Conditional Support . . . . . 595.4.2 Conditional Dataflow (CDF) Variant . . . . . . . . . . . . 605.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616 Control-Based Overlay Results . . . . . . . . . . . . . . . . . . . . . 636.1 Control Flow Overlay Opportunities . . . . . . . . . . . . . . . . 646.2 Dataflow Overlay Opportunities . . . . . . . . . . . . . . . . . . 676.3 Impact of Control Flow Overlays . . . . . . . . . . . . . . . . . . 686.4 Impact of Dataflow Overlays . . . . . . . . . . . . . . . . . . . . 696.5 Discussion: Circuit Optimizations . . . . . . . . . . . . . . . . . 727 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 737.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.2 Debug Overlay Usage . . . . . . . . . . . . . . . . . . . . . . . . 747.3 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . 757.3.1 Control-Flow versus Data-flow Circuit Models . . . . . . 757.3.2 Debugging Optimized Circuits . . . . . . . . . . . . . . . 777.3.3 Debugging Parallel Circuits . . . . . . . . . . . . . . . . 787.3.4 Evaluating our Debug Overlay using other HLS Benchmarks 797.3.5 Overlay Configurability . . . . . . . . . . . . . . . . . . 79Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81ixList of TablesTable 2.1 Previous Work - Instrumentation Overhead . . . . . . . . . . . 24Table 3.1 Number of States versus Trace States . . . . . . . . . . . . . . 32Table 4.1 Trace Window Length . . . . . . . . . . . . . . . . . . . . . . 44Table 4.2 Control Flow Tracing Results for adpcm . . . . . . . . . . . . 47Table 5.1 Summary of Proposed Overlay Capabilities . . . . . . . . . . . 62Table 6.1 Acronyms used for Control-Based Overlay Variants . . . . . . 63Table 6.2 Overhead of Base Overlays with and without Signal Recon-struction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Table 6.3 Quantifying Overlay Control Flow Support for -O3 CompiledBenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Table 6.4 Data Override Opportunities for -O3 Circuits . . . . . . . . . . 67Table 6.5 Impact of Control Flow Overlay Variants on Fmax . . . . . . . 69Table 6.6 Impact of Dataflow Overlays on Fmax (MHz) . . . . . . . . . 70xList of FiguresFigure 1.1 Types of bugs in an HLS System [26] . . . . . . . . . . . . . 3Figure 1.2 Waveform viewing of trace buffer recording - not practical fora software engineer . . . . . . . . . . . . . . . . . . . . . . 5Figure 1.3 Previous debug workflow . . . . . . . . . . . . . . . . . . . 7Figure 1.4 Proposed debug workflow . . . . . . . . . . . . . . . . . . . 8Figure 2.1 The HLS Flow [48] . . . . . . . . . . . . . . . . . . . . . . 12Figure 2.2 Input Code on the left transformed into Static Single Assign-ment (SSA) form on the right [51]. . . . . . . . . . . . . . . 13Figure 2.3 Baseline Instrumentation from [25] shows interaction with UserCircuit and Debug Workstation . . . . . . . . . . . . . . . . 20Figure 2.4 Baseline Trace Scheduler Instrumentation and trace buffer con-tents of hypothetical program . . . . . . . . . . . . . . . . . 21Figure 3.1 Debug Overlay inserted with the user circuit at compile time,and configurable at debug time . . . . . . . . . . . . . . . . 27Figure 3.2 Selective Variable Tracing - Variant A . . . . . . . . . . . . . 30Figure 3.3 Selective Variable Tracing Examples. Variant A (a and b) ver-sus additional compression achieved with Variant B (c) in theFigure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Figure 3.4 Selective Variable Selection - Variant B . . . . . . . . . . . . 36Figure 3.5 Line packer architecture with G=4 . . . . . . . . . . . . . . . 37Figure 3.6 Conditional Buffer Freeze Architecture . . . . . . . . . . . . 40xiFigure 4.1 Impact of Line Packer Granularity (G) on Trace Window Size- Variant B . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Figure 4.2 Impact of Line Packer Granularity (G) on Area - Variant B . . 47Figure 4.3 Impact of Number of Sub-units (C) on Area for ConditionalBuffer Freeze Architecture . . . . . . . . . . . . . . . . . . . 48Figure 4.4 FPGA Compile Time versus Overlay Configuration Time (sec-onds) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Figure 5.1 Supported Control Flow Changes. Arrows represent condi-tional branches that can be modified. . . . . . . . . . . . . . 53Figure 5.2 Variable assignment override. The value assigned to a variablecan be modified. . . . . . . . . . . . . . . . . . . . . . . . . 53Figure 5.3 Control Flow Instrumentation for Branches . . . . . . . . . . 55Figure 5.4 Conditional Control Flow (CCF) Overlay . . . . . . . . . . . 57Figure 5.5 Re-ordering optimization performed by HLS tool - Example . 58Figure 5.6 Data Assignment Override (Dataflow) Overlay . . . . . . . . 59Figure 5.7 Conditional Dataflow (CDF) Overlay . . . . . . . . . . . . . 60Figure 6.1 Impact of Control Flow Overlays on Area . . . . . . . . . . . 70Figure 6.2 Impact of Dataflow Architectures on Area . . . . . . . . . . 71xiiList of AbbreviationsALM Adaptive Logic ModuleCAD Computer Aided DesignCPU Central Processing UnitELA Embedded Logic AnalyzerFPGA Field-Programmable Gate ArrayFSM Finite State MachineGPU Graphics Processing UnitGUI Graphical User InterfaceHDL Hardware Description LanguageHLS High-Level SynthesisILP Instruction-Level ParallelismIR Intermediate RepresentationOPENCL Open Computing LanguageRTL Register-Transfer LevelSDK Software Development KitSSA Static Single AssignmentxiiiAcknowledgmentsTo my supervisor Steve Wilton, thank you for encouraging me to pursue graduatestudies and for championing me through countless opportunities. Your mentorshipand guidance cannot be acknowledged enough. Through your example, excellentteaching, leadership and kindness, you have allowed me to develop as a researcher,in my professional life and in my personal life. I am deeply grateful for all the timeand energy you devote to your students.Special thanks to the (past and present) members in our research group: JeffreyGoeders, Eddie Hung, Fatemeh Eslami, Jose Pinilla, Daniel Holanda, Bahar Saleh-pour, Pavan Bussa, and Sarah Mashayeki; and several others in our lab: HosseinOmidian, Max Golub, Amin Azar, Dave Evans, Mohammed Ewais, MohammedOmran, Khaled Essam and John Deppe. You have all provided me with encour-agement during the hectic paper deadlines, advice during roadblocks I encounteredduring my research, and incredibly helpful feedback from my presentations. I amcontinually impressed and motivated by how driven our group is, both in researchand in seeking adventure. A number of you have encouraged me to do more ofthe latter and I will absolutely miss all the coffee break walks, lunches, dinners,puzzle-time, hikes, yoga, ice-skating, volleyball, travel and more! To everyone inthe SoC lab at UBC, you made coming in to work every day for the last two yearsan absolute pleasure.Financial support for this work was provided by the Intel Strategic ResearchAlliance (ISRA) grant. Many thanks to the ISRA program participants for theirhelpful feedback during our research updates.Finally, I would like to express my deepest gratitude to my family and friends,whose love and support has been instrumental in allowing me to achieve my goals.xivTo my parents Salim and Shirzad, your hard work and dedication to my siblings andmyself is humbling and inspiring. Thank you for believing in us and providing forus. To my siblings Shahzaleen, Jahan Ara, Aly-Khan and Aly-Shah, words cannotexpress how much I love you all, and how appreciative I am of your collectivewisdom and emotional support through both my brightest and darkest hours. ToBenafsha and Yohan, my two cheerleaders and confidantes; thank you for reservinga spot on your couch for me and my never-ending stories, I always leave your homefeeling energized. To my extended family and friends, near, far and in differenttime zones, I am grateful for your continued love and support, always.xvChapter 1Introduction1.1 High-Level SynthesisRecent years have seen the emergence of Field-Programmable Gate Arrays (FP-GAs) as mainstream compute accelerators. These chips are integrated circuitscharacterized by a reconfigurable fabric that can implement any digital circuit [46].Companies such as Amazon, IBM, Baidu, and Microsoft have invested significantresources understanding how FPGAs can be used to accelerate cloud-based com-puting. For FPGAs to thrive in this new role, new programming frameworks areessential. FPGA vendors such as Xilinx and Intel have responded by creating high-level synthesis (HLS) tools [3, 58] which allow designers to specify behaviour in asoftware language (C/OpenCL) and automatically compile this code to a hardwareimplementation. HLS technology promises significant productivity improvementsfor hardware engineers and may someday open the door for software designers toenjoy the speed and energy advantages of using FPGAs.Traditionally, FPGAs are programmed by a hardware description language(HDL) such as System Verilog or VHDL. These languages look very differentfrom a software language and require a steep learning curve. A trained hardwareengineer can use an HDL to specify the low level structure and cycle-by-cyclebehaviour (i.e. timed behaviour) of a digital circuit, and then an FPGA vendortoolchain such as Intel’s Quartus II or Xilinx’s Vivado Design Suite is used to syn-thesize the HDL description into a graph or “netlist” of logic gates, IP cores and1flip-flops, which is then mapped, placed and routed onto the FPGA. Using an HDLnot only requires hardware expertise, it is known to be a cumbersome process thatcan lead to long development cycles, and this impacts the design’s time to mar-ket [46].HLS raises the abstraction from specifying the low level or Register-TransferLevel (RTL) architecture of the design to specifying the functional description ofthe design (untimed) in a software language. This has benefits for hardware engi-neers because it significantly decreases the development time and allows them toexplore different design solutions faster, and also benefits software engineers whocan enjoy the computational and energy advantages of using an FPGA without re-quiring the same level of hardware expertise. An HLS tool is then responsible fortransforming the software functional specification into a fully timed RTL descrip-tion, complete with a datapath and a controller [18, 46].The key concepts of an HLS tool, as covered by [18], are to compile thesoftware design, analyze its computational needs and allocate hardware resources(functional units and storage components), schedule the required operations intoclock cycles, bind the operations to the functional units and the variables to storageelements, and finally, generate the RTL architecture. This HLS generated RTL canthen be passed as input to a vendor’s compiler for the place and route process.The time to place and route a design is not insignificant; the FPGA toolchain iscomprised of a series of complex heuristic algorithms that can take several hoursor even days to complete for large designs [55]. This poses a significant challengeduring on-chip debug, the motivations of which are discussed in Section 1.2. Ifthe design exhibits unwanted behaviour on chip, and the process of debuggingand fixing the design requires multiple place and route iterations, it becomes clearthat debug and verification is a severe bottleneck during the FPGA design process.The situation becomes even more complex when debugging a design that has beengenerated by an HLS tool, also discussed in Section 1.2.2Figure 1.1: Types of bugs in an HLS System [26]1.2 HLS Debug and Challenges1.2.1 Motivation for On-Chip DebugWhen using an HLS flow, because designs are written in a software language (likeC or OpenCL), the first approach to debug is to execute the software design ona standalone processor using existing mature software debug technologies. For adesign written in C, one would use GDB [56]. This approach of software emu-lation is only useful for catching bugs that are self-contained within the softwaredesign itself (i.e. the kernel-level bugs in Figure 1.1). Kernel-level bugs are easilyreproducible as they exhibit their behaviour every time the software design is exe-cuted. Examples of these include syntactic errors, algorithmic errors, and incorrectloop bounds. However, the software emulation approach will not help with findingRTL-level bugs, or system-level bugs that are shown in Figure 1.1.RTL-level bugs are bugs that may be caused by errors in the HLS tool, or howthe tool is used. For example, the designer may specify a desired circuit behaviour3using a software construct or pragma that is dealt with in an unexpected way bythe HLS tool. A common method for uncovering these types of bugs is to useco-simulation, where the original C code and the HLS generated RTL code aresimulated at a workstation and mismatches in the output help locate the source ofan error [10]. RTL simulation tools simulate the detailed hardware execution andprovide full visibility into all the internal signals of the design. Co-simulation isoften used as a confidence measure for the designer to confirm that the HLS gen-erated RTL produces the same outputs as their software design for a number oftest inputs. However, there are some cases where RTL simulation falls short. Itis limited to kHz speed [53], which is orders of magnitude slower than runningthe design on the FPGA itself (MHz or GHz). In larger systems, some bugs mayonly appear after long run-times (for example after the billions of cycles required toboot an operating system [53]), meaning they can not practically be observed usingsimulation. If the HLS generated block is connected to a larger system containing“black-box” legacy IP blocks for which simulation models are not available, thena hardware simulation will not catch integration errors. Additionally, it is difficultduring simulation to exercise the design-under-test (DUT) with all possible inputsthat it could be exposed to in production (e.g. network traffic). Even if that werepossible, as already mentioned, simulation is orders of magnitude slower than ac-tual hardware execution making this impractical. The types of bugs that emanatefrom interfaces, that are dependent on I/O data patterns, that are hard to reproduceor require long run-times before they manifest are system-level bugs. These are thetruly difficult bugs that require actually running the design in hardware, at speed.1.2.2 Challenges with On-Chip Debug for HLS DesignsCurrently, there are no commercial offerings for in-system debug tools designedspecifically for an HLS flow. Instead, general purpose hardware debugging tech-nology (familiar to hardware RTL designers) called Embedded Logic Analyzers(ELA) can be used. The key to ELA technology is the instrumentation that isadded to the user circuit at the RTL level to record its behaviour as the circuit runs.Debugging an HLS generated design running on an FPGA is challenging forthree reasons: (a) there is limited observability into the internal behaviour of the4Figure 1.2: Waveform viewing of trace buffer recording - not practical for asoftware engineerhardware design compared to the software design, (b) existing ELA tools providevisibility in the context of the generated RTL design which is may not have mean-ing to an HLS user, and (c) any modification to the debug instrumentation requiresa lengthy recompile.First, unlike the full visibility offered during software or RTL simulation, whendebugging hardware, the limited number of I/O pins means that it is difficult toobserve the internal behaviour of all signals in the design. FPGA vendors offerELA tools such as SignalTap II [1], and Vivado’s ILA core [59] which store thebehaviour of selected signals on-chip for later interrogation. This is a trace-basedtechnique, where the design is instrumented to store the activity of user-selectedsignals into on-chip memories called trace buffers. The trace data can then bedownloaded onto a workstation for offline inspection to gain an understanding ofthe behaviour of the design to identify the root cause of observed errors. Since on-chip memory is limited, trace buffers are configured as circular memories so thatnew data from the instrumented signals overwrite the oldest data. This means thatonly a slice of the circuit execution is available in the trace buffers, and multiplecircuit runs are required to gather data from different portions or time slices of thedesign. Depending on the number of signals being traced and the amount of tracebuffer memory available, this slice is anywhere from tens to thousands of cyclesand is called the “trace window” [25]. By observing a smaller percentage of signals5within a debug iteration, longer trace window lengths may be recorded [7].Second, existing ELA tools provide visibility in the context of the RTL designrather than the original source code. An example of this is shown in Figure 1.2where the recording from the trace buffer is shown in a waveform format that isfamiliar to hardware engineers. Understanding the waveforms produced by thesetools and relating them to the original software-like code is difficult since the HLSgenerated hardware looks very different from the original software. Several opti-mizations and transformations occur during the HLS flow which means that therewill not be a one-to-one mapping from signals in the hardware to variables in thesoftware. Additionally, HLS tools extract parallelism from the software in orderto generate accelerated hardware, and so there may be instruction reordering andparallel computations that may not have been described in the software descrip-tion. There has been a significant amount of work in recent years to address thisproblem. Prior work such as [9, 25, 29, 41, 45] provide the ability to debug an HLSdesign as it is running on an FPGA. Importantly, these systems provide the abil-ity to run the design at-speed, and record variables in on-chip memories for laterreplay using a familiar software-like debugging GUI. Specifically, [25] focuses oncreating a mapping between the software and the hardware that allows the designerto debug in the context of the source code, and uses specialized on-chip debug in-strumentation that achieves 127x improvement in “trace window” length comparedto traditional ELA tools by taking advantage of information from the HLS tool.The third challenge with trace-based debugging is that if the user wants modifythe circuit to gather more information, the instrumentation needs to be updated andthe entire circuit must be recompiled. An example of the typical debug work-flow is shown in Figure 1.3. In the figure, a debug turn involves instrumentingand compiling the design, running the chip, observing the behaviour and usingthis information to deduce the root cause or set up another debug turn. Typically,many debug turns are required to find the root cause of unexpected behaviour, andso it is critical to realize rapid instrumentation techniques between debug turns tospeed up these iterations. Efforts have been made to reduce these compile timesin [7, 19, 33]. For example in [7], Bussa explored incremental compilation featuresin a commercial FPGA tool, where only the modified debug instrumentation isrecompiled and the user circuit remains as untouched as possible. However, gains6YesDebug TurnNoFigure 1.3: Previous debug workflowof only a 40% reduction in compile time were achieved, which falls far short of thesoftware-like turn-around times that would be expected by HLS designers.1.3 ContributionsIn this work, we present an HLS-oriented debug overlay which provides software-like turn-around times between debug turns for circuits created by an HLS tool(i.e. on the order of hundreds of milliseconds). As shown in Figure 1.4, an RTLdescription of the overlay architecture is added to the user circuit before it is com-piled - one slow compile to FPGA bitstream. The overlay is flexible enough toimplement a variety of debug scenarios; debug scenarios may describe specificvariables that should be captured or regions of code that should be traced. At de-bug time, between debug turns, the user can configure the overlay to implementa specific debug scenario without recompiling the design or overlay. In this way,7YesDebug TurnNoFigure 1.4: Proposed debug workflowthe user can rapidly switch between debug scenarios as his or her understandingof the behaviour of the circuit evolves, while also ensuring that the user circuitdoes not change between debug turns. In this thesis, two types of overlays arepresented: a passive overlay whereby the designer can observe the underlying usercircuit, and a control-based overlay that allows the user to make limited functionalchanges to the underlying circuit to potentially aid debugging. Unlike previous de-bug overlays [19], our architecture is optimized for HLS circuits and is intended tobe tightly integrated into an HLS tool.One of the main research challenges that arise when using FPGA-based over-lays is area overhead [55], and this is true in our architecture. The debug scenarioimplemented by our overlay is encoded in a set of “overlay configuration bits”which represent area overhead in the fabric. Clearly, there is a trade-off betweenthe amount of flexibility and the fabric’s overhead. In addition to describing the8overlay architecture, this thesis also presents an architecture study to better under-stand this trade-off. We identify a set of “capabilities” and determine how muchoverhead is incurred if each capability is supported by the fabric.The contributions of our work are as follows:1. The architecture and associated CAD for a passive overlay family. The ca-pabilities supported by this overlay family enable the user to flexibly observethe underlying user circuit, and permits user-programmable conditions thatspecify when the debug instrumentation should stop tracing (i.e. around apoint of interest).2. Results that show how our passive overlay family affect the clock speed andarea of the original user circuit, and how parameters in our architecture im-pact the achievable trace window length.3. The architecture and associated CAD for a control-based overlay family.The capabilities supported by this overlay enable the user to make functionalchanges to the user circuit to further accelerate debugging, or to performquick “what if” tests.4. Experimental results that quantify to what extent the user can use our control-based overlay to change the behaviour of a design, as well as the area anddelay overhead incurred.We believe that the results of our architecture study would be essential infor-mation for FPGA vendors that wish to create an in-system HLS debug overlay suchas ours.1.4 Thesis OrganizationThis thesis is organized as follows. Chapter 2 presents related work on in-systemdebug for both RTL and HLS based designs. Section 2.4 details the debug frame-work we build upon in this work to develop our overlay architecture and how weleverage this previous work to access important information from the HLS tool andcontinue to provide a software-like debug experience.9Chapter 3 details the passive overlay architecture and the supported capabil-ities, while Chapter 4 presents the results quantifying the impact of the passiveoverlay on clock speed and area of the circuits. Our control-based overlay capa-bilities and architecture are described in Chapter 5 while the results quantifyingthe control overlay opportunities and impact on delay and area are discussed inChapter 6. In Chapter 7 we discuss the limitations of our work, and future researchdirections to address these limitations. Chapter 7 also concludes the thesis.10Chapter 2Related WorkThis chapter first describes an overview of the high-level synthesis flow, in orderto provide the necessary background for understanding how the input software istranslated into hardware. This chapter then describes popular commercial and aca-demic HLS compilers available today followed by related work for on-chip debugand how our contributions in this thesis differ from these works. Finally, Sec-tion 2.4 provides a detailed overview of the framework upon which we prototypeour ideas.2.1 A Summary of the HLS Compiler FlowThis section provides a summary of the major components of an HLS flow. Thissummary is largely based on the academic tool LegUp, which is a open-sourceHLS flow [47]. Based on [18, 46], other academic HLS tools and commercial HLStools are similar, however their back end pass may be optimized for specific targetarchitectures and may also contain other proprietary optimizations.As shown in Figure 2.1, the HLS flow has three main components: a front-endwhich translates the input software program (in this case C code) into a machine-independent assembly language or intermediate representation (IR), standard com-piler optimizations that are performed on this IR, and a back-end that translates theoptimized IR into RTL. Each component will be described below.11IR to RTLHLSTool*.CTop (*.v)RTLEDATool111010010110111100011010...FrontEndIR OptBackEndLoop unrolling, Inlining, Common Subexpression Elimination, Scalar Promotion, etc. Allocation Scheduling BindingRTL GenerationFigure 2.1: The HLS Flow [48]2.1.1 Front-End Software CompilationTypically, a compiler front-end is used to compile the input software program toa machine-independent lower level encoding [46]. LegUp is built within a state-of-the-art compiler suite called LLVM which offers multiple front-ends to com-pile high-level software languages to its internal assembly-like language calledLLVM Intermediate Representation (IR) [42]. For example, for an input C pro-gram, LLVM offers Clang to compile the C code to IR. Commercial HLS toolssuch as Xilinx’s Vivado HLS [58] and Intel’s OpenCL [3] both use LLVM.LLVM IR is in Static Single Assignment (SSA) form, where each variable inthe IR code is assigned a value only once as shown in the example in Figure 2.2.SSA is a standard compiler method that improves the efficiency of further compileroptimizations [4]. It is important to note that there will be a unique IR for eachvariable assignment in the C code, resulting in a possible one-to-many mapping.The front-end also generates a Control and Data Flow Graph (CDFG) repre-sentation of the high-level code. The CDFG is comprised of nodes called “BasicBlocks”. A basic block is a piece of straight line code with no jumps in or out ofthe middle of a block; control instructions form edges between basic block nodes12Figure 2.2: Input Code on the left transformed into Static Single Assignment(SSA) form on the right [51].in the CDFG [16].2.1.2 IR OptimizationsOnce the high-level code has been transformed into IR, a series of optimizationpasses can be performed on the IR. There are two main kinds of optimizationpasses: standard compiler optimizations and HLS-based IR optimizations. Stan-dard compiler optimizations are machine-independent and can be applied using the-O3 option (available in LLVM) which performs several passes such as loop un-rolling, inlining, common subexpression elimination, and scalar promotion. HLS-based optimizations are target-dependent. These optimizations are specific to theHLS tool and the target hardware architecture. Examples of HLS-based optimiza-tions are bit-width optimizations to reduce area in the final hardware [39], andextracting Instruction-Level Parallelism (ILP) within a basic block [46] to achievehardware acceleration.User directives, sometimes called pragmas, can also be supplied to the HLStool at the source level to enable further hardware optimizations. Examples of theseinclude specifying which loops to pipeline, or how to perform memory banking toachieve parallel memory accesses [3, 46, 60].2.1.3 Back-End RTL GenerationThis stage of the HLS flow transforms the HLS optimized IR into the final hardwaredesign. The back-end consists of four main components as shown in Figure 2.1:13Allocation, Scheduling, Binding and RTL Generation.In the allocation phase, the hardware resources required by the circuit are de-termined. For each IR operation, a functional unit (adder, multiplier, memory re-source) required to complete that operation is identified. In the case where re-sources (dividers, multipliers) are limited, resource sharing is applied between IRoperations.In the scheduling phase, each IR operation is assigned to a specific hardwarecycle during the hardware execution. Control logic is then built to implement theschedule. In LegUp, the control logic is a Finite State Machine (one for eachgenerated hardware module) where each state drives one or more IR operationsdepending on data dependencies and available hardware resources. One of themost common algorithms for scheduling is System of Different Constraints [12];this algorithm is used in LegUp.During binding, IR operations are assigned to actual functional units and IRvariables to memories and registers in the final circuit. If a hardware resource, suchas a divider, is scarce and needs to be shared between multiple operations, multi-plexing logic is generated. Since registers are ubiquitous in FPGAs, IR variablesthat can be mapped to registers typically are assigned to unique physical resources.The final step is RTL generation where an HDL file containing the Verilogimplementation is generated. After this step, vendor EDA tools can be used forsynthesis and place and route.2.2 HLS ToolsIn this section we highlight commercial and academic HLS tools and the debuginfrastructure available with these tools. An extensive list of existing HLS toolscan be found in [46]; here we only highlight the tools offered by the largest FPGAvendors.There are two types of HLS tools offered by FPGA vendors today: C-based IPdevelopment tools and OpenCL-based frameworks that target heterogeneous plat-forms. When using an HLS tool to generate an RTL IP core for an algorithm, thedesigner still needs to integrate the IP core with the rest of their system that shouldcontain infrastructure to stream data in and out of the IP core. In an OpenCL flow,14the tool automatically generates and integrates hardware for data transfer betweena host (usually an ARM core) and the FPGA. The OpenCL programming paradigmis to specify which code runs on the host, and which code (called kernels) shouldbe accelerated on the FPGA. HLS routines are applied to the specified kernels andthe tool automatically sends data between the host and FPGA via PCIe [60].Intel’s HLS compiler and Vivado’s HLS compiler are both IP developmenttools. Intel’s HLS compiler supports a subset of C++ and converts individual func-tions into RTL modules [17]. Vivado HLS enables IP core generation by acceptingC, C++ or System C as input [58]. Vivado HLS also supports threads in System Cand generates parallel hardware modules for each thread that can be run simulta-neously, allowing the designer to specify more coarse-grained parallelism [58].Intel and Xilinx also both offer their own OpenCL tools (Intel’s OpenCL SDKfor FPGAs and Xilinx’s SDx flow). Under the hood, these tools still perform thesame HLS methods for the kernels that are to be accelerated using the FPGAs,but contains extra infrastructure that automatically takes care of communicationbetween the host and accelerator.All of these commercial tools are closed-source. In our work, we require anopen-source HLS tool because we use information from the HLS tool to constructour debug overlay and we modify the tool to automatically insert the overlay withthe user design. In this work we have chosen LegUp as our research platform. Alist of other academic HLS tools can also be found in [46]. LegUp was first releasedin 2011 by the University of Toronto and has since released four major versions ofits HLS tool for academic research purposes, specifically targeting Intel FPGAs.In July 2017, LegUp was commercialized and released a fifth version of their toolsupporting FPGAs from Xilinx, Lattice, Microsemi and Achronix [14]. Previous tothe commercial release, a survey of HLS tools in [46] shows that LegUp producessimilar quality of results when compared to commercial tools. Since LegUp hasbeen used by several HLS researchers internationally, has been shown to producestate-of-the-art hardware, and has an open-source academic version of the tool, weuse LegUp in this work to protoype and evaluate our work.152.2.1 Vendor support for HLS DebugIn terms of the debug infrastructure that commercial HLS tools offer, both vendorIP HLS and OpenCL tools offer two levels of verification. The first is to compilethe C-based design to an x86-64 architecture that can be debugged using existingdebuggers from the GNU compiler collection. Intel’s HLS compiler supports run-time tracking of datatypes during the x86-64 emulation, specifically checking foroverflow [17]. When the designer is satisfied with their software implementationand is ready to target an FPGA device, they run their code through the HLS toolwhich produces both the RTL design and an RTL simulation model of the circuitgiven an input software testbench. Co-simulation can then be executed to confirmthat the HLS tool produced a functionally correct design. Vivado HLS supports asource-level debugger in simulation mode, instead of providing a waveform [58].However, neither commercial tool supports HLS-oriented in-system debugging;rather designers must resort to general hardware debug methods using ELA toolssuch as ChipScope and SignalTap II [1, 57], whose limitations when applied toHLS designs was discussed in Section Related Work2.3.1 In-System HLS DebuggingEarly work on developing an HLS-oriented debugger for the JHDL-based Sea Cu-cumber framework was discussed in [29]. This debugger used an FPGA’s readbackfeature to take a snapshot of the state and map it back to the original Java code.Later frameworks were described in [9, 25, 45], all of which revolve around tracebuffers. The work in [25] is the framework in which we evaluate our ideas; detailsof their instrumentation architecture will be provided in Section 2.4.The frameworks in [9, 25, 45] include instrumentation that is added to the de-sign at compile time, and requires the designer to select a priori the signals to betraced. If the user wants to change the instrumentation at debug time, for exampleto trace a different set of signals, the instrumentation and user circuit needs to berecompiled. This is addressed in [7] which shows that it is possible to use a com-mercial incremental design flow to recompile the changing debug instrumentation.16However, only a 40% reduction in debug turn-around time was achieved. This isbecause commercial incremental compile is designed to preserve the compilationresults of unchanged logic. In-system debug instrumentation is typically intrusiveto the design in that it “taps” into the user circuit. This causes additional processingoverhead during the incremental compile flow.The amount of trace data that can be stored on chip is limited, and so signalselection plays an important role in capturing signals that the designer deems im-portant. However, this is challenging because at compile time, the designer doesnot yet know where the bugs will occur, and when attempting to find the root causeof a bug, repeatedly selecting different signals to observe becomes a time consum-ing process. Several algorithms have been proposed in the literature for automat-ing signal selection. Examples of these are selecting signals that maximize staterestoration offline, a technique where observed signal values are used to deduce thevalue of unobserved signals [40], and exploiting error propation in sequantial cir-cuits [61], where flip-flops that are the most sensitized to other flip-flops are markedas candidate signals. These techniques were applied in the context of general digi-tal circuits, whereas in [25], Goeders presents a restoration solution specifically forHLS circuits, minimizing the set of source code variables that need to be recordedon-chip that result in the highest restoration ratio of non-observed variables. Or-thogonally, [23] explores off-chip capture of the trace data since off-chip storagedoes not suffer from the capacity limitations of on-chip memory. Typically, stream-ing trace data to off-chip memory is not feasible due to limited memory bandwidth,however Goeders leverages the structure of HLS circuits combined with static anal-ysis to reduce the bandwidth of the execution trace, enabling off-chip tracing as aviable option. The work in [23] still requires the user to select signals a priori, andthis remains fixed unless the circuit is recompiled.There has also been work that investigates instrumenting the design at thesource level (the C code) instead of at the RTL level [29, 45, 49]. The main ad-vantage of source-level instrumentation compared to RTL-level instrumentation isportability to multiple HLS tools. RTL-level instrumentation is usually tied to theHLS tool, requiring specifics about decisions made by the tool. The main disad-vantage of source-level instrumentation is that by modifying the source code, theHLS tool may schedule operations differently which could mask bugs or change17the behaviour. In this thesis, we automatically modify the HLS tool to insert thedebug overlay at the RTL level. By doing this, we ensure that the original HLSscheduling intended for the user design is not affected and we have low-level con-trol of the design of the instrumentation.There has been work on automatically finding and correcting bugs (e.g. [13])often using formal methods. For example in SAT based debugging [13], the RTLis instrumented with multiplexers at the output of each gate. A SAT formulationspecifying the circuit’s expected behaviour is then solved and used to find a set ofsuspect error locations; the instrumentation can then be used to rectify the circuitif possible. This is somewhat orthogonal to our approach. As stated in [43], for-mal verification is very useful in certain situations, such as verifying protocols orindividual arithmetic units. However, it does not scale well for system-level ver-ification. We believe that the most difficult and complex bugs that are the focusof run-time debugging are often multi-faceted in nature, and thus require a humanengineer in the loop to analyze the data. Our goal is to give the user as much in-formation as possible, and as quickly as possible, to help him or her find the rootcause of a bug.2.3.2 Debug OverlaysPrevious FPGA-based overlays have been proposed to support the rapid reconfig-uration of debug logic. [19, 33] proposed a fine-grained overlay in which debuginstrumentation is added to the FPGA on-the-fly using resources not used by theuser circuit. This differs from our approach in two ways: a) these overlays target ageneric RTL design as opposed to an HLS based design, and b) the debug instru-mentation is constructed after the user design has been placed and routed, givingthe user design first priority to FPGA resources and optimizations. The latter isthe advantageous for on-chip debug because it is less intrusive to the user circuit,however this flow is not available using commercial tools. [32, 50] proposed avirtual overlay network and concentrator network respectively that multiplexes allsignals to on-chip trace buffers at compile time, and can be configured at debugtime to observe a subset without a recompilation. This addresses the challengewith commercial ELA tools like ChipScope and SignalTap that require the user to18predetermine a signal set to observe at compile time. A similar solution of multi-plexing all signals to trace-buffers was proposed in [41] in which the user circuitand instrumentation are compiled into a parameterized bitstream. A parameter-ized bitstream is one where some of the bits in the bitstream are expressed as aBoolean functions of parameters. At debug time, the parameterized bitstream canbe used to rapidly generate a specialized bitstream to do things like change the setof observed signals within tens of milliseconds. In summary, the previous work ondebug overlays for FPGAs target general RTL designs. Our goal is to provide anHLS-oriented debug overlay.2.3.3 Support for Circuit RectificationIn-system debug frameworks have also included ways to rectify, or apply smallfixes, to the circuit. Earlier work that supported applying functional changes on-chip can be found in [21, 22, 37]. Techniques from [22, 37] employed rectificationalgorithms to change the functionality of an existing LUT in the design to have thenew desired behaviour, and required a recompilation if the functional change didnot fit within the existing netlist connections. Our control-based overlay has simi-larity to [21] which instruments registers in the design with multiplexers in order tooverride signals at runtime. However, [13, 21] use formal verification techniquesto perform automatic bug detection; this is orthogonal to our work which insteadleverages the debug instrumentation to provide information about the design to theuser, to help him or her root cause observed errors.Our debug instrumentation takes the form of an overlay to allow rapid config-uration of the debug logic at runtime. Similar solutions are proposed in [19, 41],however these works target generic RTL designs and focus on increased signal ob-servability. Like [9, 24, 29] our work provides source-level debugging for HLSdesigns such that the user need only interact with their original source code.2.4 The Debug FrameworkThe contributions presented in this thesis build upon the HLS debug frameworkdeveloped by Goeders [25]. In this framework, the user first compiles a C programto an HDL representation using LegUp [11]. In [25], the HLS tool is modified to19Page 12Trace RecorderDatapathVariables in On-Chip MemoryControl Logic (FSM)Communication and Control LogicCompressDatapath VariablesTrace SchedulerStepping & Breakpoint Unit Memory ArbiterFPGAUser’s CircuitFigure 2.3: Baseline Instrumentation from [25] shows interaction with UserCircuit and Debug Workstationautomatically add instrumentation to the user circuit (at the RTL level), and cre-ate a debug database, which contains a mapping between C-code variables andsignals/memories in the RTL code, as well as scheduling information regardingwhen variables are updated. The circuit is then compiled using a vendor-specifictool-chain (in this case Quartus II) and implemented on an FPGA. As the FPGAruns, the instrumentation records a history of selected variables and control flowinformation into on-chip memory (the trace buffer). When the buffer fills up, olddata is replaced with new data. The buffer continues to record data until a userspecified breakpoint is reached. When the breakpoint is reached, the user launchesa software-like debug Graphical User Interface (GUI) application, which connectsto the FPGA and downloads the trace history. The GUI provides a software-likedebug experience, allowing the user to single-step through the design, and viewchanges to user variables based on the stored execution history (similar to the de-buggers in [26] and [9]). The data is taken from the recorded trace buffer, meaningthe user is, in essence, replaying the execution that happened while the chip was20MemoryDatapathTrace SchedulerActive Signalsr3 r1r8 r6 r5r9ractivecurrent_stateS1S2S6S7S3rnControl Logicctrl memctrlS4ctrlmemmemCompressr2 r1Global MemLocal MemLocal Mem... ...r12 r10 ctrlr4Figure 2.4: Baseline Trace Scheduler Instrumentation and trace buffer con-tents of hypothetical programrunning at-speed (the GUI in [26] also has a “live mode” in which the design issingle-stepped on the FPGA itself, however, this does not allow for running at-speed, which we believe is essential to capture many hard-to-find bugs).This debugger has been integrated with LegUp’s source-level debugger calledInspect, and is available in LegUp 4.0 [9]. LegUp’s original Inspect debugger pro-vided two modes of execution: RTL simulation and in-system execution. Bothmodes allow the user to debug at the source level using information gathered fromthe HLS tool and stored in a debug database. However, Inspect’s original in-systemdebug used the ELA approach (specifically Intel’s SignalTap II) to record variablesinto trace buffers. A major limitation to the ELA approach in the context of anHLS generated circuit is poor utilization of the trace buffers. This is because ELA-instrumented signals are recorded into the trace buffer every cycle, whether theyare changing or not. Goeders addressed this poor memory utilization by designingcustomized debug instrumentation that built upon one key idea from the work ofMonson and Hutchings [44]: given access to the HLS scheduling of the generatedcircuit, only record signals in the cycles in which they are updated. In this sec-tion, we describe Goeders customized debug circuitry. Since our work builds upon21Goeders’ framework, we refer to it as the Baseline Instrumentation.Figure 2.3 shows the instrumentation inserted by the HLS tool [25], wherethe component labeled “Trace Scheduler” is the customized debug instrumentationexplained below. We assume that there is a single trace buffer, which is used tostore the history of user-visible variables (both those stored within the user circuitdatapath and those that are mapped to memories), as well as sufficient control-flow information such that the control path can be reconstructed by the off-linedebug GUI. Since the trace buffer is of a limited size (100Kb in [25]), we cannot store the entire run-time history of all variables and trace information. Thismeans that, when debugging, the user can only view the behaviour of variables fora portion of the execution (called the trace window). While searching for the causeof unexpected behavior, the user may have to run the application several times(debug turns) with different breakpoints or recording a different set of variables.The Trace Scheduler details are shown in Figure 2.4. It is a large multiplexerwhose inputs are RTL signals from the user circuit that map to variables in thesource code. Here, the trace buffer is updated every cycle in which any user vari-able is updated, or a new basic block in the source code is entered. In the exampleof Figure 2.4, in the first cycle (State S1), user variables r3 and r1 are updated, sotheir values are stored in the first line of the trace buffer. Control flow informationis also stored in this cycle to indicate the execution path of the user circuit. Inthe second cycle (State S2), user variable r4 is updated, along with a variable thathas been mapped to the global memory within the user circuit, so both updates arestored in the trace buffer. Notice that the width of the trace buffer must at leastmatch the number of bits to be written in the worst-case state. Reference [25] de-scribes optimization algorithms (named delay-worst and delay-all) which attemptto strategically delay writes to the trace buffer to balance the trace buffer width;these optimizations have been used in our framework.Importantly, this trace scheduler is constructed on a circuit-by-circuit basis isoptimized for both the user circuit and the set of variables to be recorded. Refer-ence [25] shows that this leads to an improvement in the trace window size of 127xwhen compared to the traditional ELA approach. If the user decides to changethe set of variables that are recorded, this trace scheduler has to be reconstructed,meaning the entire design has to be recompiled.222.5 HLS BenchmarksIn this thesis, we evaluate our debug overlay on HLS designs from the CHStoneand MachSuite benchmarks [28, 52]. These two C-based benchmarks suites havebeen widely used in academic research to evaluate HLS tools. CHStone consistsof 12 programs from various application domains, such as arithmetic, media pro-cessing, security and microprocessor. MachSuite consists of 19 programs that con-tain more diverse workloads not covered by CHStone (such as graph traversal andspectral methods). C-based HLS tools support only a subset of the C language. Forexample, dynamic memory allocation, function pointers and recursion are not typ-ically supported. CHStone complies with the language subset supported by LegUpand was directly usable for synthesis and evaluation. Many of the benchmarks inMachSuite are similar in size to the CHStone benchmarks, and not all are synthe-sizable by LegUp because they use an unsupported fixed-point representation. Inthis thesis, in addition to the 12 programs from CHStone, we also use the largestbenchmark from MachSuite called Fast Fourier Transform (FFT). While there ex-ist other more recent HLS benchmark suites such as [62], we chose to use these 13benchmarks in order to directly compare with the previous work.In Table 2.1 we show the area overhead of Goeder’s instrumentation for theCHStone benchmarks and FFT from MachSuite. Each benchmark is mapped to aStratix IV FPGA device using the Quartus II toolchain, the numbers reported arefrom the post place-and-route reports. Column two shows the size of the originaluser circuit without any debug instrumentation (4943 ALMs on average). Columns3-5 show the area results for instrumented benchmarks. The user circuit expands by16% on average to account for the additional ports required to connect signals fromthe user circuit to the trace scheduler (these results are for 100% signal tracing).The HLSD component is the fixed debug infrastructure that contains the RS232communication interface between the Debugger GUI running on the host CPU andthe FPGA, the stepping and breakpoint unit that facilitates hardware breakpointsand the trace buffer itself. As can be seen, the trace scheduler itself is the mostexpensive component of the baseline instrumentation, shown in column 5, and ison average 23.8% the size of the user circuit.23Table 2.1: Previous Work - Instrumentation OverheadBaseline InstrumentationBenchmarkOriginalUserCircuit (ALMs)UserCircuit(ALMs)Fixedhlsd(ALMs)TraceScheduler(ALMs)DebugProportionadpcm 6704 7500 638 1749 23.3%aes 6273 7306 610 1014 13.9%blowfish 2700 3081 595 830 26.9%dfadd 3031 3777 624 1171 31.0%dfdiv 5314 6230 641 1148 18.4%dfmul 1445 1934 628 801 41.4%dfsin 10024 12138 681 3146 25.9%gsm 3662 4313 1080 920 21.3%jpeg 17540 19239 635 3466 18.0%mips 1385 1771 780 572 32.3%motion 6546 6827 577 992 14.5%sha 1718 1703 597 333 19.5%FFT Transpose 39713 46351 875 10518 22.7%MEAN 4943 5743 678 1304 23.8%2.6 SummaryThis chapter describes the HLS flow, and previous work relating to RTL and HLSdebug frameworks. The baseline framework, within which we build and evaluateour HLS-oriented debug overlay in the coming chapters, is also presented. WhileHLS tools promise increased designer productivity by allowing the designer toprogram an FPGA using a software description, debug productivity remains a sig-nificant challenge. The related work presented in this chapter shows that effortshave been made to bridge the gap between a software program and the HLS gen-erated hardware during the debug process, however, there still exists the challengeof long FPGA compile times. When finding the root cause of a bug, several it-erations of the chip execution may be required in order to capture enough tracedata, given the limited capacity of on-chip memory. The designer may choose tofirst observe several variables, and then narrow their search as their understand-ing of the design behaviour evolves. In this chapter, we show that all in-system24HLS debug frameworks require a recompile anytime the debug instrumentation ischanged, limiting debug productivity. The following chapters describe the architec-ture of a debug overlay targeting HLS-generated circuits, that provide the designerwith configurable runtime capabilities, eliminating the need to recompile betweendebug turns.25Chapter 3Passive Overlay Capabilities andArchitectureThis chapter describes our passive overlay architecture family and algorithms tomap a debug scenario to the overlay. As mentioned in Chapter 2, one of the mainchallenges with overlays is area overhead [55]. The more general or flexible anoverlay is, the higher the area resources required to construct it. Therefore, ourresearch approach is to first define a set of capabilities that we believe could beuseful to the designer during in-system debug of an HLS generated circuit, andthen architect an overlay that is just flexible enough to support these capabilities.Figure 3.1 shows the overall approach. The user circuit is first instrumentedwith the flexible debug overlay and compiled once to the FPGA. Before runningthe circuit, the user sets up a debug scenario. A debug scenario is a collection ofvariables or region of code that is to be traced; Section 3.1 will describe the setof debug scenarios that we consider for the passive overlay. The debug scenariois then mapped to the overlay. The flexibility of the overlay to implement a va-riety of debug scenarios comes from a set of overlay configuration bits as shownin Figure 3.1, similar to FPGA configuration bits. Since our overlay is just flex-ible enough to implement common debug scenarios, we can keep the number ofthese overlay configuration bits low, limiting overhead. In our overlay, some con-figuration bits are stored in on-chip RAM blocks (as described in Section 3.2.1),meaning we can use vendor’s in-system memory content editor to update these bits26User Circuit instrumentedwith Flexible Overlayonce, at compile time (slow)Debug Scenariobetween debug iterations (fast)Overlay Configuration BitsFigure 3.1: Debug Overlay inserted with the user circuit at compile time, andconfigurable at debug timeat runtime, and others are stored in registers that are accessible through the debugUART port (as described in Section 3.2.3).We first present the set of capabilities supported by our overlay in Section 3.1and then Section 3.2 describes the architecture. It is important to note that thecapabilities presented in this chapter are not an exhaustive list; these are the capa-bilities we use to demonstrate an HLS-oriented overlay architecture. In Chapter 4,we quantify how much overhead is incurred if each capability is supported by theoverlay fabric. Other useful capabilities could be determined through a user study,but this is out of the scope of the current thesis.A version of this work was published in [35].273.1 Passive Overlay Debug CapabilitiesTo quantify the flexibility of the passive overlay, we define three “capabilities”that the passive instrumentation can have: Selective Variable Tracing, SelectiveFunction Tracing, and Conditional Buffer Freeze. An overlay can have any ofthese capabilities, or a combination of these capabilities. Each is described below.Selective Variable TracingSelective Variable Tracing refers to the capability to configure the overlay, at de-bug time, to specify which variables should be traced. In Goeders’ debug frame-work [25], signals corresponding to all user-visible variables in the source code aretraced. This provides the most software-like debug experience, since it allows theuser to trace through a recorded execution and display the value of every variable atevery step. However, for large designs, and in particular, designs with many paral-lel functional units, recording the behaviour of all variables may quickly consumethe limited on-chip trace buffer. If fewer variables are recorded, then a longer tracewindow can be stored in the trace buffer, possibly making it easier for the user tounderstand the behaviour of the design and deduce the root cause of a bug.In our implementation, the overlay is flexible enough that the user can specifyany number of variables to trace (including all of them) within a debug turn. Theuser can therefore trade-off the number of variables for the trace window depth.Initially, the user may trace many variables to get a quick idea of the overall stateof the system. In later debug iterations, the user may decide to focus on only asmall subset of variables that he or she believes will reveal the problem.Selective Function TracingSelective Function Tracing refers to the capability to specify, at debug time, specificfunction(s) of interest. Once the user identifies specific functions, the fabric recordsthe activity only within those functions. This capability may be useful if the userhas narrowed down the cause of a bug to a specific erroneous function output. Byrecording only data within the function, a longer trace of the function behaviourcan be obtained.There are two variants we will explore: (a) the user may wish to limit variable28tracing to the specified function(s) but trace all control information (whether it isin the selected function or not), or (b) the user may wish to limit both the variableand control tracing to the specified function(s). The former mode would allow theuser to single-step outside the function to understand the call path that led up tothe invocation, and then focus on the variables within the function, while the latterwould allow for a larger trace history within the function, but would not allow theuser to single-step outside the selected function(s).Conditional Buffer FreezeConditional Buffer Freeze refers to the capability to specify, at debug time, a con-dition that, when true, causes recording of data in the trace buffer to halt. Afterrecording halts, the user can read out information in the trace buffer to understandwhat led up to that point. As an example, the user may set a freeze point to occurwhen a particular error flag goes high, or an argument to a function is not withinan expected range. We distinguish between a freeze point and a breakpoint; in theformer, the execution of the chip may continue, however, by freezing the contentsof the trace buffer, the execution history up until the freeze point is preserved.For efficiency reasons, our buffer freeze points are associated with assignmentsto variables rather than the variable itself. For example, the user may not specifythat the buffer be frozen whenever a specified variable becomes a certain value.Instead the user can specify when a specific assignment to that variable results inthe variable receiving a certain value (assignment is specified via source code linenumber). The reasons for this design decision will be described in Section 3.2.Note that conditional buffer freeze is different than the conditional capture ca-pability that may be familiar to hardware designers. Using conditional capture,designers can request that the trace buffer only record signals when a certain con-dition is true. If the condition changes over the execution of the circuit, this cancreate “gaps” in the recorded behaviour. In a hardware debug environment, thesegaps can be clearly shown to the user using a waveform with X (unknown) regionsto denote these gaps. In a software debug framework, however, it would be con-fusing if the user, while replaying the execution of the code by single-stepping,encountered unexpected short “dead zones” where the values of variables could29current_staterecode_state011110111-bit ROMtrace_enablerecode_state0recode_statenMemoryDatapathTrace SchedulerActive Signalsr3 r1r8 r6 r5r9ractivecurrent_stateS1S2S6S7S3r4rnControl Logicctrl memctrlS4ctrlmemCompressr2 r1Global MemLocal MemLocal Mem... ...Config ROMmem r12 r10 ctrlFigure 3.2: Selective Variable Tracing - Variant Anot be displayed. In our system, the buffer freezes when a certain condition occurs,and does not start recording again until a new freeze point is configured or the de-sign is re-run. This ensures that the area of code that can be replayed is contiguous,and provides the maximum visibility around a specified point of interest.Although in our system, the designer specifies the conditional buffer freezepoint in an interactive fashion, it would also be possible to create freeze pointsby automatically extracting assertions in the code, similar to [5]. In Section 3.2,we will consider several variants of an architecture that supports conditional bufferfreeze; these variants differ in how complex a condition can be specified.3.2 Passive Overlay Architecture3.2.1 Selective Variable TracingWe first describe the architectural enhancements that will allow the user to select, atruntime, a subset of variables to trace. We present two variants of the architecture:Variant A which has the least overhead but may not make effective use of the tracebuffer, and Variant B which will make more efficient use of the trace buffer, but atthe expense of more area overhead.30Selective Variable Tracing Architecture - Variant AThe first variant we consider is shown in Figure 3.2. The Signal Trace Schedulerand the Trace Buffer (both shown in green) are the same as in the baseline archi-tecture (Figure 2.4). The 1-bit wide Config ROM and the multiplexer that feedsthe ROM’s address lines (shown on the bottom left) are new. The Config ROMcontains the overlay configuration bits that encode which user variable(s) are tobe recorded. Intuitively, it would make sense to include one configuration bit foreach variable in the user circuit (to indicate if that variable should be recorded ornot recorded), however, we found that this approach leads to large decoding logicwhich increases the overhead of our overlay unacceptably. The reason for thislarge overhead is that when using an HLS compiler, the SSA nature of the com-piler means that there is not a one-to-one mapping between source code variablesand RTL signals in the generated design. This is because typically each variableassignment in the source code maps to a unique RTL signal in the hardware, re-sulting in a much larger number of RTL signals compared to the number of states.Furthermore, steering logic would be required to pack different combinations ofsignals together into the trace buffer each clock cycle depending on the variablesselected for tracing, leading to higher area overhead.Instead, we reuse the access network called the Trace Scheduler from [25]because it already provides us with intelligent access to the important signals inthe user circuit, and we associate each configuration bit with one state in the usercircuit. Recall, the Trace Scheduler operates by recording only the changing sig-nals each state based on the scheduling of the circuit from the HLS compiler. If auser variable is to be recorded, the configuration bits corresponding to all states inwhich that variable is updated are set to 1; this enables the write enable line of thetrace buffer during those states. Since we wish to record all control flow informa-tion regardless of the user’s variable selection, configuration bits corresponding tostates in which control flow information is to be stored are also set to 1. As an ex-ample, Figure 3.3(a) shows the trace buffer contents after executing a hypotheticalprogram in which all user variables are traced. If the user wishes to record only r1and r9, then the trace buffer is not updated during states S2 and S6, leading to themore efficient packing in Figure 3.3(b) (note that the buffer still needs to be up-31Table 3.1: Number of States versus Trace StatesBenchmark # of States # of Trace Statesadpcm 204 70aes 271 102blowfish 194 78dfadd 112 68dfdiv 211 56dfmul 63 44dfsin 465 207gsm 246 157jpeg 1003 372mips 121 47motion 296 149sha 80 36FFT Transpose 1697 868Geomean 240 107dated in states S3 and S4 because control-flow information is written during thosestates). This strategy causes the trace buffer to fill more slowly, meaning at the endof the run, a larger window of execution is available in the trace buffer, providingmore information for the user as he or she seeks the root cause of a bug.Note that, since several variables may be updated in the same state, unselectedvariables may also be inadvertently recorded. In Figure 3.3(b), even though theuser has not selected r3, the value in r3 is still recorded in the first state, since r1has been selected. Similarly, r12 and r10 are recorded in the fourth line since thecontrol flow information in that state must be recorded. This leads to a slightlyless efficient use of trace buffer space than approaches such as [7] in which thecompression circuitry is optimized to store only those variables that have beenselected; in Chapter 4 we will quantify this impact.In a naı¨ve implementation, the depth of the 1-bit wide ROM would be equal tothe number of states in the user circuit. However, in some states, no user variablesare updated, and no control flow data needs to be captured, so no configuration bit32is necessary. Thus, we use a multiplexer (the left-most multiplexer in Figure 3.2)to recode the state number (which is obtained from the user circuit) to a linearsequence of states in which at least one variable is updated (we call these tracestates). The recoded state is then used to address the Config ROM to acquire thecorresponding configuration bit for the trace buffer. In our benchmark circuits,listed in Table 3.1, we found that this approach reduces the depth of the ROM byapproximately 55%. More importantly, it provides compatibility with HLS toolsthat do not encode their states sequentially. Note, we use the CHStone benchmarksuite and the FFT Transpose benchmark from MachSuite [28, 52].The inputs to this recoding multiplexer are constants, inserted into the RTLby the modified HLS tool. It is important to emphasize that the circuitry in Fig-ure 3.2 is constructed on a circuit-by-circuit basis. When the HLS tool compilesthe user circuit, it knows the schedule and state encodings, and thus can createthe instrumentation circuitry, including the values of these constants, optimized forthat specific circuit. Since the multiplexer inputs are constant, significant area isreclaimed by the logic synthesis algorithm in the FPGA CAD suite as it optimizesthe circuit.In our implementation, the 1-bit ROM is implemented using one or more em-bedded FPGA memory blocks. This allows us to change the configuration bits atdebug time, to implement a new debug scenario, without recompiling the circuit.At debug time, an algorithm is needed to map a debug scenario (specified bythe user using a GUI) to the values that will be stored in the overlay configurationcells. In the architecture of Figure 3.2, each overlay configuration bit correspondsto one state in the schedule of the user circuit. When the HLS tool creates the usercircuit, it also creates a debug database which contains a list of all user variablesthat are updated in each state, as well as a list of when control flow informationis generated. At debug time, when the user selects signals to observe, it is thenstraightforward to use this debug database to set the overlay configuration bits ap-propriately.The flexibility of selective variable tracing offered with Variant A comes at acost. When the overlay is constructed (at compile time), the trace buffer width isset to be equal to the largest number of bits that must be stored in a single state.In the example of Figure 3.3(a), the worst-case state is S6, so the width of the33trace buffer is set to be the sum of the widths of all variables to be recorded inS6. At debug time, the trace buffer width can not be changed, meaning we maywaste space in the trace buffer. In Figure 3.3(b), the trace buffer is not updated inS6, meaning the upper-order bits of every line in trace buffer remain unused. Theseverity of the problem increases as we reduce the number of variables selected.In this subsection, we show an alternative architecture, Variant B, which addressesthis concern.Recording Control FlowIn order to replay back the in-system recording of the circuit execution to the user inthe debug GUI, control flow information is required to be traced. In our overlay, allstates that contain control flow are force-traced regardless of the variable selectionspecified by the user. In our single trace buffer architecture, variable updates andcontrol flow are recorded together, where control information is stored in the leastsignificant bits (LSB) of particular states. Control flow is only recorded in theparent state of a state that has multiple direct predecessors, in order to know whichstate the circuit execution came from. In terms of decoding the recording for thedebug GUI, the last thing written to the trace buffer must contain control flowinformation. These bits can be retrieved from the last entry to determine the endingstate of the recording. Decoding is then just a matter of working backwards. Ifthere is only one possible previous state, and this state writes to the trace buffer,then this entry is retrieved and using the debug database, the bits are decoded intothe respective variables written in that state. If there are multiple possible previousstates, the next item is pulled from the trace buffer and the LSB bits are againcontrol flow information. This process is repeated until every item is removedfrom the trace buffer.Selective Variable Tracing Architecture - Variant BIn Variant B, we address the wasted memory issue introduced in Variant A as fewervariables are selected for tracing by the designer during debug. The major differ-ence from the previous variant is the introduction of a line packer module, whichis designed to pack partially used lines of trace data together into a single line in34Trace BufferS1S2S6S7S3r3 r1r8 r6 r5r9r4ctrlctrlmemmemS4S1S7r3 r1r9ctrlTrace Bufferr3 r1 ctrlTrace Buffera. Trace all variablesb. Trace selective variables (r1, r9)c. Trace selective variables (r1, r9); with G=2 line packerS7, S1ctrlctrlS3S4, S3r9r10r12 ctrlmem r10r12 ctrlctrlS4r12 r10Figure 3.3: Selective Variable Tracing Examples. Variant A (a and b) versusadditional compression achieved with Variant B (c) in the Figure.the trace buffer, thus making better use of memory. Figure 3.4 illustrates this mod-ule, and shows how data from two different states (S1 and S7) are combined into asingle line in the trace buffer.The line packer works by breaking up each incoming data line into G equallysized words (G is an architectural parameter representing the granularity of theline packer). As an example, Figure 3.4 shows a scenario where G=2. IncreasingG splits the incoming data into smaller words, allowing for a more fine grainedpacking, saving memory. However, as we will show in the results in Chapter 4,increasing G also increases the area of the line packer.In order for the line packer to operate, it must know which of the incoming35Line Packerpacked_dataDatapathr1rnControl LogicctrlmemCompresscurrent_stateTrace Scheduler... ...MemoryGlobal MemLocal MemLocal Memr3 r1 ctrlTrace BufferS7, S1S4, S3r9S1S7r3 r1r9ctrlmemS4 r10r12 ctrltimer10r12 ctrlnum_wordsrecode_stateConfig ROM0010001010000010S0S1S2S3S4S5S6S7recode_statenrecode_state0ractiveS3 ctrlctrlFigure 3.4: Selective Variable Selection - Variant BG words contain important data to save to the buffer, and which contain data thatcan be discarded. To accomplish this, the 1-bit Config ROM from Variant A, isreplaced with a multibit ROM. While in Variant A a single bit indicates whetheror not the data line for a given state would be recorded, in Variant B, the ROMindicates how many words of the data line should be saved. Thus, if the data lineis split into G words, the width of the Config ROM must be dlog2(G+1)e bits.It should be noted, that for a value n retrieved from the Config ROM, the linepacker will save the lower contiguous n words of the data line. For example, inFigure 3.4, this means that if the user did not choose to observe the mem update inS4, then only the first word is saved. If mem was selected, then the entire line wouldbe saved regardless of what the user chose to observe in the lower word. Although36f0overflowf1f2f3f4f5f6Trace Buffertrace dataw0w1w2w3w0w0w1w0w1w2w0w1w2w3w1w2w3w2w3w3f4f5f6Figure 3.5: Line packer architecture with G=4it would be possible to design a line packer that saved arbitrary (not necessarilycontiguous) words from the data line, we found such a module would require aprohibitively large amount of area.Like Variant A, Variant B is accompanied by a mapping algorithm that mapsthe user’s variable selection to the ROM words. In this case, the algorithm can usethe HLS schedule to determine how many selected variables are written each cycle,and the position of these variables in the trace buffer output, and set the ROM bitsaccordingly.The Line PackerIn the remainder of this subsection we describe how the line packer is constructed.Figure 3.5 illustrates the architecture of a G=4 line packer; we describe the opera-tion in the context of this size line packer. The incoming data line is divided into Gwords (w3..w0). These incoming words are stored in a set of word-sized registers( f6.. f0), which collect the incoming words and store them in the right-most unoc-cupied position. Once there are enough words stored to fill an entire line of thetrace buffer (when f3.. f0 are occupied), the four words are written out to the tracebuffer.While waiting for f3.. f0 to populate, more words may arrive in a cycle than37can be saved in the unpopulated f3.. f0 registers. For this reason, the line packerincludes overflow registers ( f6.. f4). In the G=4 line packer, the worst case occurswhen three words are stored in registers ( f2.. f0) and four words arrive at the sametime. This necessitates a total of seven registers to prevent data loss (or 2G−1 forthe general case). After f3.. f0 is full, and the data is written out to the trace buffer,the occupied overflow words f6.. f4 are transferred over to f2.. f0 and any incom-ing words are stored into the lowest empty positions, again using the overflow ifnecessary.All of this data movement is controlled by the num words signal, coming fromthe Config ROM, which indicates how many words of data are entering the linepacker each cycle. This signal is used to generate steering logic to control all ofthe multiplexers shown in the example.Although the example shown in Figure 3.5 is for G=4, the same design canbe used for any value of G. However, as G increases, the number of inputs to themultiplexers increase by the same rate. In the example the multiplexers have upto four inputs, but for example, if G were increased to 16, the multiplexers wouldhave up to 16 inputs.An alternative packing method that was explored was to use a mixed widthFIFO, such as the IP core offered by Intel [34]. However, only specific mixedwidth data ratios are supported (e.g. 1 to 32), and the input and output data widthsare required to be a power of two - this does not always fit the structure of the gen-erated circuit and overlay. Furthermore, if such a FIFO were to be used, additionalbuffering logic would be required in order to not lose trace data in the worst casewrite, similar to the overflow registers required by our line packer architecture. Ingeneral, line packing or buffering does not contain a one solution that fits all, andso our line packer is specifically constructed to accommodate any potential tracebuffer width required by the generated circuit and our overlay, with the additionalcompression capability.3.2.2 Selective Function TracingThis section describes enhancements needed to enable Selective Function Tracing,which allows the user to indicate specific functions in the source code that should38be traced.To implement this capability, no changes to the architecture described in Sub-section 3.2.1 are required. Changes are required in the CAD algorithm that mapsthe debug scenario to the overlay configuration bits. When the user selects one ormore functions to be traced, the algorithm uses the debug database to determinewhich user variable assignments are associated with each selected function, andwhich states correspond to these assignments. Using the techniques from Subsec-tion 3.2.1, the algorithm can then turn on the configuration bits for these states.Control flow information can be handled in one of two ways, as described in Sec-tion 3.1. In the “full control flow” mode, the algorithm turns on all overlay config-uration bits corresponding to all states in which control flow information is written.In the “partial control flow” mode, the algorithm does not turn on states outside theselected function(s) in which control flow information is written.A complexity arises if the HLS tool makes extensive use of function inlining.In such cases, it is often difficult to crisply delineate which state(s) correspond tothe inlined function, and which correspond to the parent function, since operationsfrom each function can be mapped to the same state. To accommodate inlining, ifan inlined function is selected for tracing, we conservatively trace all states in theparent function as well.Because there are no changes required to the architecture, there are no areaimplications of supporting this capability. However, this capability does affect theachievable trace window; this will be investigated in Chapter Conditional Buffer FreezeThis section describes the architectural enhancements necessary to support the con-ditional buffer freeze capability.Our conditional buffer freeze architecture is parameterized by C which is thenumber of conditions upon which a freeze point can depend. As shown in Fig-ure 3.6, the architecture consists of C comparison subunits and a single TraceBuffer Write Controller. Each comparison subunit monitors the running circuitfor a single condition, and the supported operations of the comparator are =, >,<, ≥, ≤, and 6=. Each subunit contains a wide Configuration Register that can39lp_fulldata_mask target_value state opComparator“Stop Write” ControllerLine PackerUser Circuitr1rncurrent_stateTrace Scheduler...num_wordsTrace BufferConditional Freeze Buffer Unit(s)1'b0set 1'b1D Qpacked_datatrace_buffer_disableWWCommunication and Control Logic0010000010000010Config RAM0010000010000010addrFigure 3.6: Conditional Buffer Freeze Architecturebe configured at debug time via the debug UART port connected to the GUI; thevalues of this register are discussed next.As described in Section 3.1, our freeze point architecture triggers when a spe-cific assignment assigns a value that meets a specified condition. In the baselinearchitecture, the value of each assignment appears at the output of the trace sched-uler multiplexer during a specific state. When the user wishes to set a conditionalfreeze point, the debug database is used to determine the state in which the valuefrom this assignment will appear at the output of the trace multiplexer, and thisstate is stored in the state field in the Configuration Register. Since several variablevalues may appear at the output of the trace multiplexer in the same cycle, it is nec-essary to select the proper subset of bits in the multi-bit trace multiplexer output.Rather than providing a barrel shifter (which would be large), a data mask is gen-40erated (by the algorithm that maps the debug scenario to the overlay) to isolate thevariable selected, along with a target value which has been scaled appropriately sothat the comparison can be performed without shifting. Both the mask and targetvalues are fields within the Configuration Register, and each is as wide as the tracemultiplexer output. During operation, the comparison subunit monitors the state ofthe user circuit, and when it matches the state specified in the Configuration Regis-ter, it masks the appropriate field and performs the comparison operation describedin the op field of the Configuration Register. If the condition is true, a signal is sentto the Trace Buffer Write Controller.The Trace Buffer Write Controller receives signals from all C comparison sub-units, and when any of these signals becomes true, it stops writing to the tracebuffer. In this way, the Trace Buffer Write Controller performs an “or” operationof the C comparison results. An “and” reduction is not supported; unlike a designspecified at the RTL level, in an HLS design, the user can not be sure whether anytwo assignments occur simultaneously in the hardware, due to optimizations thatmay be performed by the HLS tool. Because it may be useful to have some data inthe trace buffer after the trigger, it is possible to continue to store data for severaladditional cycles, providing a sliding window of data around the point of interestselected by the user. The number of extra cycles could be specified by the user inanother Configuration Register within the Trace Buffer Write Controller, howeverwe do not implement this feature in our current architecture.The motivation for basing each condition on a particular assignment rather thana variable value can now be explained. Many FPGA-based HLS tools (includingLegUp) use a Static Single Assignment (SSA) form of the circuit’s IntermediateRepresentation (IR). Because of the relatively high area required to implementmultiplexers in an FPGA, HLS tools that target FPGAs often build hardware withdistinct registers for each SSA IR assignment. If we were to monitor a variable, in-dependent of a specific assignment, we would have to build a multiplexer to selectthe current value from all hardware registers corresponding to that variable. Evenif there was only one register associated with each variable, since, at compile time,we do not know which variable the user would select, we would have to build amultiplexer to select from among all user variables in the circuit, which would beprohibitive in terms of area. By reusing the trace multiplexer, our overlay suffers41much less overhead.There is an extra complexity if the selected assignment is part of a function thatthe HLS tool inlines into multiple parents. In that case, it is impossible to uniquelyidentify which copy of the assignment should be used to perform the comparison.Our approach is to expose this complexity to the user if it occurs, asking him orher to identify the specific instantiation of the inlined function that is of interest. Itis important to note that from a user’s perspective, setting a condition can be veryuser-friendly in the debugger GUI interface (i.e. clicking on a line of source codeand entering a condition on the variable of interest).3.3 SummaryThis chapter describes the architecture of a flexible debug overlay that providessoftware-like debug turn around times for HLS circuits. At compile time, the over-lay is constructed and inserted with the user circuit without changing the originalscheduling of the user circuit generated by the HLS tool (we use Legup). At de-bug time, the user can configure the overlay without having to perform a lengthyrecompile by modifying a set of configuration bits. These configuration bits areeither stored in on-chip RAM blocks (and are modified by vendor’s In SystemMemory Content Editor) or in registers that are accessible through a debug UARTport. The debug capabilities supported by the overlay architecture presented in thischapter are passive - meaning that they do not change the functionality of the un-derlying user circuit. These are selective variable tracing, selective function tracingand conditional buffer freeze. In the next chapter, we evaluate the area overheadincurred by supporting each of these capabilities, and how the overlay affects theclock delay of the user circuit.42Chapter 4Passive Overlay ResultsThe architectural variants described in Chapter 3 all allow the user to trade-off thetrace window size and area overhead, with essentially zero compile time betweendebug iterations. In this chapter, we explore this trade-off for the various variantsand show the impact of several architectural parameters.4.1 Selective Variable Trace: Variant AWe first evaluate basic selective variable tracing architecture (Variant A – withoutthe line packer) shown in Figure 3.2, and compare it to previous work where thetrace scheduler is recompiled between each debug iteration [7, 25]. Table 4.1 showsthe impact on trace window size as we vary the proportion of user-visible variablesthat are traced. To gather these results, we used the CHStone benchmark suite andthe FFT Transpose benchmark from MachSuite [28, 52]. For each benchmark, wesimulated the design using Modelsim and measured the number of cycles that arestored within the trace window, averaged over the run of the program (a highernumber means that the trace buffer is being used more efficiently, and that there ismore information available to the off-line debugging GUI). In all cases, a 100Kbtrace buffer was assumed. In selecting a subset of variables to observe, we selectedvariables randomly; we average the results over five runs with different seeds tominimize the impact of especially bad or good variable selections.In this table, Columns 2-4 show the results for the baseline architecture in43Table 4.1: Trace Window LengthBenchmark Baseline Configurable Tracefrom [25] Variant AVariable Selection Variable Selection100% 50% 25% 100% 50% 25%adpcm 2247 2876 3562 2247 2412 2870aes 3650 8972 17165 3650 5762 7321blowfish 6113 8266 10525 6113 6873 9482dfadd 1047 1366 1822 1047 1056 1095dfdiv 3391 4363 5073 3391 3461 3490dfmul 960 1169 1458 960 1043 1145dfsin 2101 2410 2970 2101 2164 2230gsm 386 1597 2391 386 386 386jpeg 2201 3638 4652 2201 2233 2285mips 739 1104 1949 739 739 739motion 6212 6823 8771 6212 6222 6229sha 3574 6702 9431 3574 3739 3790FFT 636 1324 1627 636 662 675Geomean 1860 2967 4035 1860 1991 2144Improvement 1.60x 2.17x 1.07x 1.15xwhich the trace scheduler is compiled between each debug iteration. Columns 5-7show the results for our architecture. From this table, we can make two obser-vations. First, by recording fewer user variables, we increase the achievable tracelength using either architecture; we see an increase of 2.17x when recording 25% ofthe user variables using the baseline architecture and 1.15x using our architecture.This increase justifies the importance of being able to tailor the debug scenario asdebug proceeds (rather than just recording everything). Second, we can observethat, our achievable trace length increases more slowly as the number of selectedvariables reduces, compared to the baseline. This is one of the prices we pay forsoftware-like debug turn around times.The other price we pay for software-like turn-around times is area. To measurearea, we instrumented each benchmark circuit, and mapped the results to a StratixIV FPGA. We report post place-and-route numbers to account for any physical de-sign optimizations that Quartus II is able to perform. Over all benchmarks, the44average area of the baseline architecture (based on the previous work [25]) was2232 ALMs, not including the trace buffer itself. Our enhanced architecture, Vari-ant A, required a total of 2271 ALMs plus one M9K memory block (this is anincrease of 39 ALMs and one memory block). Of the 39 ALMs increase, 59%(on average) was due to logic required to make the memory block accessible usingIntel’s In-System Memory Content Editor [2].We found little impact on the maximum clock speed of the circuit, rangingfrom -9% to +12% (average +1%), which we attribute to algorithmic noise in theCAD tool.Even though our technique suffers in terms of area and trace window length,for our benchmarks, the time to personalize the overlay ranges from 0.887 secondsto 1.751 seconds (dominated by the time to update the ROM via the memory con-tent editor). In contrast, the approach in [25] suffers a compile time of 454 secondson average for the same benchmarks. In [7], this is improved to 231 seconds (as-suming 50% traced signals) using Intel’s incremental compilation flow in QuartusII v16.0.4.2 Impact of Line Packer: Variant BTo improve the trace buffer capacity, Section 3.2.1 proposes Variant B, which con-tains a line packer. Figure 4.1 shows the impact of the trace length as a functionof the line packer’s granularity (G) for this variant. Compared to the trace lengthsachieved by the baseline and Variant A architectures, G=2 reclaims the trace lengthlost with Variant A and G=16 performs 2x-7x better than the baseline. Figure 4.2shows the impact of the overhead logic area versus the baseline as a function of G(all implementations also require one M9K memory block). As the graphs show,increasing G has a significant impact on the trace length. The area, on the otherhand, grows more quickly as G increases, primarily due to the increased area of thesteering logic.Compared to Variant A, this variant requires more memory bits, however, forall benchmarks, only one M9K block was required to store all configuration bits,even for G=16.Once again, we saw only a small impact on the clock frequency, except for4502000400060008000100001200014000Baseline Variant A      G=0Variant B     G=2Variant B     G=4Variant B     G=8Variant B     G=16Trace Window Length (cycles)Overlay Variants100% 50% 25% 10%Figure 4.1: Impact of Line Packer Granularity (G) on Trace Window Size -Variant Bthe Variant B architecture where G=16 as shown in Figure 4.2. This is due to theincreased chain of steering logic for a higher granularity line packer. At lowerG, the frequency dropped slightly as compared to the baseline. The critical pathof our instrumented circuits typically fall in the user circuit itself rather than theinstrumentation. If we were instrumenting an extremely high-frequency circuit, wecould pipeline our instrumentation to match the clock speed.4.3 Selective Function TracingAs described in Section 3.2.2, adding the selective function tracing capability mayallow a user to use trace buffer space more efficiently by focusing on specific func-tions of interest. The benefit of this capability on trace buffer length is clearly verycircuit-dependent and function-dependent, and thus overall averages may not bemeaningful. However, as a data-point, we gathered the results in Table 4.2. Thistable shows the impact of tracing two functions on trace window size for a singlebenchmark circuit, adpcm, for two variants of the overlay mapping algorithm: par-tial control flow where control flow information that is outside the selected functionis not recorded, and full control flow where all control flow information is recorded,4602040608010012014016001000200030004000500060007000Baseline Variant A G=0Variant B                                  G=2Variant B         G=4Variant B        G=8Variant B         G=16Fmax(MHz)Area Overhead (ALMs)ArchitectureareafrequencyFigure 4.2: Impact of Line Packer Granularity (G) on Area - Variant BTable 4.2: Control Flow Tracing Results for adpcmFunction Inlined? Full Partialtraced Control Flow Control Flownone 2613 2613encode only yes 4024 4278upzero only no 4932 6532regardless of whether it is outside the traced function. In this experiment, the base-line trace window size (where everything is traced) was 2613 cycles. When weonly trace the encode function, this rises to 4024 cycles for the full control flowmethod, and 4278 for the partial control flow method. For the upzero function,the trace window size is 4932 cycles for the full control flow method and 6532 cy-cles for the partial control flow method. Clearly, upzero benefits more from usingpartial control flow. One reason is that encode is inlined by the HLS tool; in thissituation, we are conservative and trace the entire parent function. Thus, we wouldexpect that the benefits of not tracing outside the selected function (in this case,the parent function) would be smaller than a function that is not inlined, such asupzero.As described in Section 3.2.2, there is no area impact in supporting the selectivefunction tracing capability.47130132134136138140142144146148150020040060080010001200C=1 C=2 C=3 C=4Fmax (MHz)Area Overhead (ALMs)C (Number of Conditional Buffer Freeze Units)Figure 4.3: Impact of Number of Sub-units (C) on Area for ConditionalBuffer Freeze Architecture4.4 Conditional Buffer FreezeThe area impact of the conditional buffer freeze capability is shown in Figure 4.3as a function of C (the number of subunits, which affects the complexity of theconditional function that can be used). The left vertical axis is the increase in areawhen this capability is added to Variant B. As the graph shows, as C increases, theoverall area increases significantly. As before, we saw negligible impact on clockspeed as C increases.The conditional buffer freeze capability has no impact on achievable tracelength. The capability is meant to increase the likelihood that useful information isstored in the trace buffer. This is difficult to quantify and would be very specific toa particular circuit, designer, and bug.4.5 Impact on FPGA Compile TimesFigure 4.4 shows the the impact on FPGA compile time when inserting the debugoverlay with the user circuit. The first bar shows the average compile time (377seconds) for the user circuit only, without any debug instrumentation. The second4837750161610100200300400500600700User Circuit Only User Circuit withBaselineInstrumentationUser Circuit withDebug Overlay(G=2, C=1)Configure OverlayStratixIV Compile Time (seconds)Figure 4.4: FPGA Compile Time versus Overlay Configuration Time (sec-onds)bar shows the average compile time of the user circuit with the Baseline instru-mentation shown in Figure 2.3 (501 seconds). The third column shows the averagecompile time of inserting our overlay with the user circuit (616 seconds). The over-lay variant used in Figure 4.4 includes a line packer with G=2 granularity and oneconditional buffer freeze unit. Compiling the debug overlay with the user circuitto the FPGA results in a 23% increase in fitter runtime over the baseline instru-mentation. Importantly, this compile time only occurs once. The fourth columnshows the time required to personalize the overlay; this ranges from 0.887 secondsto 1.751 seconds for our benchmarks (dominated by the time to update the ROMvia the memory content editor).4.6 Results Summary and DiscussionOverall, these results show the trade-off between trace window size and area over-head for the various capabilities we have discussed, along with the impact of vari-49ous architectural parameters. These results are likely to be useful to an FPGA ven-dor who wishes to create an ecosystem containing an overlay such as ours. In suchan ecosystem, the HLS tool could determine, based on an estimate of the amountof space left on the FPGA, how large an overlay to construct. If it is estimatedthat there is little space available, it might construct an “economy” version withonly one or two capabilities. If more space is available, it may choose to constructa “deluxe” version that supports all capabilities we have discussed (and perhapsothers). The results showed that the cost of our Selective Trace architecture is verysmall (Variant A required only 39 ALMs and one M9K block beyond the base-line), and that, when the user can be selective in which variables should be traced,the impact of including this capability on trace length is significant. This suggeststhat almost any overlay should likely have at least this capability. The conditionalbuffer freeze capability is more expensive (between 200 and 1000 ALMs depend-ing on the complexity of the condition supported), so an automatic tool might onlychoose to include this capability in the overlay if there is sufficient space available.We have purposefully not addressed the question of how useful each capabilityis in finding a bug (other than the extent to which some capabilities increase thetrace window size). Understanding the trade-off between increased window sizeand improved buffer freeze capabilities, for example, would require user studies orinteractions with a large user base that could provide experience and insight intowhat features help designers find their most complex bugs.50Chapter 5Control-Based OverlayCapabilities and ArchitectureIn this chapter we present the architecture for our control-based overlay. The over-lay in Chapter 3 was designed to passively monitor the user circuit and record itsbehaviour. This is valuable to help the user understand the behaviour of the design.However, there are situations in which the user may wish to control certain aspectsof the circuit as it runs. Examples of these functional changes may be small devia-tions in control flow to avoid or exercise problem areas during debug, or the abilityto override signal assignments to perform efficient “what if” tests. In this chapterwe show how the passive overlay can be extended to provide control-based capa-bilities. Section 5.1 provides example use cases that motivate the need for such anoverlay, and then Section 5.2 provides a more precise description of the capabilitiesof the overlay. Section 5.3 and 5.4 describe the architectural variants supportingthese capabilities.A version of this work was published in [36]5.1 Motivation - Example Use CasesThe capabilities described in this section are to provide the designer with the abilityto make rapid functional changes to the user circuit in order to further acceleratedebugging. In order to control the area overhead of the overlay, the functional51changes that are supported are limited. In this section, three example use cases areprovided to motivate the need for such an overlay, then a more precise descriptionof the capabilities are provided in Section 5.2.The following describes three use cases that motivate our approach:Use Case 1: A circuit is crashing because an alarm signal goes high erroneously.The user understands the reason, and will fix it later, but wants to let the circuit‘limp along’ to continue debugging.Use Case 2: The user has determined that a particular loop is problematic. Theuser wants to skip the loop entirely, since it is not germane to what he or she isworried about right now.Use Case 3: While debugging, the user is puzzled by a particular assignment to avariable in a segment of code written by someone else. He or she wants to do aquick “what if” test by changing the value assigned to the variable, in an attemptto better understand the code.In all three of these use cases, the functionality of the user design needs to be mod-ified before the next debug turn. Although it would be possible to edit the sourcecode and recompile, our approach allows the user to make these small changes byreconfiguring the overlay.Our overlay supports two types of changes to the functionality of the user cir-cuit: the ability to modify the control flow of the user circuit (in a limited contextthat is described below) and the ability to override individual variable assignments.Each is described below.5.2 Control Overlay Debug Capabilities5.2.1 Support for Altering Control FlowUse Cases 1 and 2 in Section 5.1 require the ability for the user to change the con-trol flow of the design between debug turns. For reasons that will become clear inSection 5.3, rather than providing the user the ability to arbitrarily modify controlflow between any two lines of C code, our overlay restricts the control flow transi-tions that can be modified to those that occur as the result of a conditional branch52for (i=0; condition; i++) {   // do something}if (condition A) {   // do something} else if (condition B) {   // do something else} while (condition) {    // do something }a) If/Else construct b) For loop constructc) While loop constructFigure 5.1: Supported Control Flow Changes. Arrows represent conditionalbranches that can be modified. …  a = something … Figure 5.2: Variable assignment override. The value assigned to a variablecan be modified.that stems from ‘if/else’, ‘for’, or ‘while’ constructs in the user’s C code. Morespecifically, as shown in Figure 5.1, for an ‘if’ construct, the condition related tothe ‘if’ clause as well as any conditions related to ‘else if’ clauses can be modi-fied. For a ‘for’ or ‘while’ loop construct, the condition that ends the loop can bemodified.We consider two variants of this control flow capability. In the first variant,the user can statically indicate the outcome of a branch to either always be taken,or never be taken. In the second, the overlay is architected such that the user canimpose a condition that must be satisfied before altered control flow is imposedonto the user circuit. This means that until this condition is true, the user circuitwill operate as normal. An example use case of this capability would be to exit a53loop early (it may be sufficient to trace only a few iterations of the loop in orderto understand the loop’s behaviour). We limit the “support” variables for the re-placement condition (that is, the variables upon which the condition is based) tobe the same as, or a subset of, the variables in the original condition, For example,if a “for” loop was written to exit when variable i reaches a certain value, only ican be used in the replacement condition. This design decision will be justified inSection 5.3.HLS tools typically optimize the code before generating hardware. In somecases, loops written in the C code may be translated into predicated operationsand then optimized in a way that there is no explicit condition variable. In suchcases, the user would not be able to alter the control flow for such constructs.We anticipate a GUI would indicate which constructs can be instrumented andwhich can not; in Chapter 6, we will quantify what proportion of constructs can beinstrumented in our benchmark circuits.5.2.2 Variable Assignment OverrideUse Case 3 in Section 5.1 requires the ability to change the value of an assignment(Figure 5.2). As will be shown in Section 5.4, providing this capability is moreintrusive on the user circuit than providing the capability to change the controlflow. Thus, we require the user, at compile time, to select which variables he orshe would like to override, and our tool instruments all the assignments to thesevariables (providing the user with the flexibility to alter the variable at any point inthe program). At run-time, the user can override any of the assignments that havebeen instrumented. For efficiency reasons, the replacement value for the right-hand side of the assignment must be a constant. If the user wishes to override anassignment that has not been instrumented, he or she needs to modify the sourcecode and recompile. While this falls short of the debug experience that softwaredesigners might like (being able to override any assignment in the code), it providesa balance between area/delay overhead and flexibility.As in Section 5.2.1, we consider two variants of the architecture. In the firstvariant, the circuit modification is static, meaning that if an assignment is over-ridden, every time the assignment occurs, the replacement value is used. In the540111010110000010BR CONF RAMUser Circuit Module Instancemux select: br_conf_val[1] br_conf_val[0]original_br_sig0br_sig_0br_conf_addrAddress DecoderNext State Logicoriginal_br_sigNbr_sig_N...Datapath br_conf_val[0]Figure 5.3: Control Flow Instrumentation for Branchessecond variant, the replacement value is only used if a specified condition is true;if the condition is false, the circuit operates as normal.5.3 Architecture for Control Flow ChangesFor the remainder of this chapter, we describe our overlay architecture that supportsboth control-flow modifications and variable assignment override (or dataflow mod-ifications), and show how the overlay can be adapted to the number of control flowconstructs and selected variable assignments in the source code. Throughout, weassume that these capabilities are being added to an overlay that supports the pas-sive observation capabilities described in Chapter Control Flow (CF) Variant - No Conditional SupportAs discussed in Section 5.2, support for this capability involves being able to con-figure the outcome of a conditional branch operation. Recall that when compilingthe user’s C code, most HLS tools first translate the source code to an internal55representation. In our case, we use the LegUp HLS tool, which translates C codeinto LLVM’s Intermediate Representation (IR). Back-end routines are then usedto translate the IR to hardware. By identifying the conditional temporary registerin the IR, it is possible to determine the hardware signal that corresponds to thisconditional temporary register. Our HLS tool identifies all such conditional tempo-rary registers, and for each, inserts a multiplexer into the path of the correspondinghardware signal in the user circuit. This is shown in Figure 5.3. In this figure, eachoriginal br sig represents one of these signals.The select lines of each multiplexer as well as the lower data input of eachmultiplexer comes from a memory that is part of the instrumentation. The memorycontains one word for each conditional branch. Each word is two bits; the first bitis used to control whether the corresponding conditional branch should be overrid-den, and the second bit is the value to use if the branch is overridden. The memoryis implemented in one or more on-chip memory blocks. The user can update thecontents of this memory before a debug turn without requiring a recompilation us-ing vendor tools (we use Intel’s In-System Memory Content Editor). The addressline of the memory, labeled br conf addr in Figure 5.3 is asserted in a state previ-ous to the branch evaluation state in question to account for the 1-cycle latency ofthe memory.Note that an alternative, more flexible, implementation is possible. Rather thanproviding the ability to add a multiplexer to each branch condition signal, it wouldbe possible to create an overlay in which the multiplexer is added before each statebit, allowing the user to implement arbitrary state transitions without performinga recompilation. We chose not to implement such an architecture for two reasons.First, the area overhead would be significant; the on-chip memory would requireone word per state, and each word would need to contain at least log2 of the numberof states. Second, HLS tools typically perform aggressive scheduling optimizationsin which multiple lines of code are mapped to the same state. If a user created atransition in an attempt, for example, to skip an arbitrary line of code, it wouldbe likely that additional lines of code would be skipped since they are mapped tothe same state. This would create confusion since it would not be clear from thesoftware view why certain lines are executed and certain lines are not. By limitingthe control changes to those constructs that have clear relationship between the56Next State Logiclp_fulldata_mask target_value state opComparator“Stop Write” ControllerLine PackerUser Circuitr1rncurrent_stateTrace Scheduler...Trace BufferConditional UnitmodeSet ~modeD Qpacked_datatrace_buffer_disableWWCommunication and Control Logicmode0111010110000010BR CONF RAMmux select: br_conf_val[1] br_conf_val[0]original_br_sig0br_conf_addrAddress Decoderoriginal_br_sigNbr_sig_N...read_enable011'b0011'b1Datapath br_conf_val[0]br_sig_0Figure 5.4: Conditional Control Flow (CCF) Overlayhardware and source code, we can avoid this complexity.5.3.2 Conditional Control Flow (CCF) VariantIn the Conditional Control Flow variant (which we refer to as CCF), the user canspecify a condition that, when true, allows the control flow to be modified. For clar-ity, the term branch condition refers to the IR variable and corresponding hardwaresignal that indicate whether a particular branch is taken, and the term user con-dition refers to the condition the user specifies to determine whether a particularbranch condition should be overridden during execution.The overlay architecture for the CCF variant has two parts. The first part is aprogrammable comparison circuit that monitors variables in the user circuit. Weuse an enhanced version of the conditional buffer freeze (C) unit from Figure 3.6for this part of the overlay. The enhanced C unit, shown on the left-side of Fig-ure 5.4, contains a number of programmable fields which the user can configureto indicate the variable to be used as part of the user condition, the state in whichthe variable is updated, and the type of comparison used in the user condition. Asin ection 3.2.3, the supported operations are =, >, <, ≥, ≤, and 6=. The user canconfigure these fields through the debug UART port, allowing the user to changethe user condition between debug turns without requiring a recompilation.57b = something;if (a > 0) {      // do something}if (a > 0) {      // do something}b = something;Optimization in the HLS toolFigure 5.5: Re-ordering optimization performed by HLS tool - ExampleNow, we need to restrict the variables that can be used in the user conditionto the “support” variables used in the original branch condition. This is due to theaggressive scheduling operations performed by many HLS tools. An example isillustrated in Figure 5.5; if the branch condition we wish to override depends onvariable a, then if we use a different variable b in the user condition, it is possiblethat scheduling operations may update b after the original branch is taken, as shownon the right hand side of the figure. In this case, the user condition will not affecthow the branch is evaluated. By requiring both the user condition and branchcondition to use the same set of variables, this can be avoided.Many times these “support” variables or expressions in the branch condition donot map back to source code variables, and are instead intermediate or temporaryvalues. Tracing these in addition to signals that map back to source code variablessignificantly increases the area of the trace scheduler. Therefore, we employ sig-nal reconstruction from [25] to minimize the number of signals traced, with theconstraint that all “support” variables must be part of the base set of traced sig-nals. Signal Reconstruction allows us to perform offline reconstruction of certainvariable values using the known values of traced signals.The second part of the overlay is shown on the right side of Figure 5.4. Thispart is the same as in the original Control Flow (CF) variant, containing a memorythat can be used to describe whether each branch condition can be overridden,in this case, under control of the user condition. The read enable signal of thememory is driven by the output of the C unit; if the read enable is zero (meaningthe user condition is not satisfied) the memory output is zero. Since one of memoryword bits drives the select lines of the multiplexers, in this case, the multiplexerselects the original branch condition from the user circuit, meaning the user circuits58DF_CONF_RAMDF_CONF_RAMDF_CONF_RAM8-bit word32-bit word64-bit word.........Datapath001101modemodemodeAddress Decoderselected_signalUser Circuit Module InstanceDF_CONF_RAM16-bit word...01modeFigure 5.6: Data Assignment Override (Dataflow) Overlayoperates as normal. If the C unit output is one, the memory is read as before,possibly causing the branch condition to be overridden depending on the contentsof the memory.Note that since we are re-using the C unit, this requires adding a mode bit anda small amount of extra logic (shown in Figure 5.4), and means that the user cannot perform a conditional buffer freeze point and a conditional branch override atthe same time.5.4 Architecture for Overriding Variables at Runtime5.4.1 Dataflow (DF) Variant - No Conditional SupportTo use this capability, the user must first select variables they may be interestedin overriding at compile time (we refer to this as the Dataflow (DF) overlay vari-ant). Our tool determines all of the RTL signals that map to assignments to thesevariables and instruments only these in a similar fashion to the conditional branchinstrumentation. Since data assignments can have different bit widths, our toolwill size appropriate memories for 8-bit, 16-bit, 32-bit and 64-bit words per usercircuit module as shown in Figure 5.6. Each word in the memory corresponds to59lp_fulldata_mask target_value state opComparator“Stop Write” ControllerLine PackerUser Circuitr1rnTrace Scheduler...Trace BufferConditional UnitmodeSet ~modeD Qpacked_datatrace_buffer_disableWWCommunication and Control Logicmode011'b0011'b1                          DF_CONF_RAMDF_CONF_RAM8-bit word32-bit word64-bit word.........Datapath001101modemodemodeAddress Decoderselected_signalUser Circuit Module Instance16-bit word...01modedf_read_enabledf_read_enabledf_read_enabledf_read_enableFigure 5.7: Conditional Dataflow (CDF) Overlayone variable assignment. Within each word, one bit is used as the select line forthe corresponding multiplexer, and the other bits are used to contain the overridevalue. As with the conditional branch instrumentation, the contents of these mem-ories can be written between debug turns using vendor tools without requiring arecompilation.5.4.2 Conditional Dataflow (CDF) VariantA conditional version of this architecture can be created in a manner similar to theConditional Control Flow (CCF) variant in Section 5.3. As shown in Figure 5.7,the C unit evaluates a condition, and the result is used to drive the read enablesignals of the memories. In this way, if the condition is not true, no assignmentsare overridden, and the circuit operates as normal. We assume one C unit for allassignments; it would be possible to use multiple C units to selectively controldifferent assignments, however, the overhead in doing so would be large.One design decision that was taken during the construction of the CDF variantwas to limit the user condition to be based on the selected signals that have beeninstrumented for data-override in the overlay. The reason for this is as follows. TheC unit receives its data from the trace scheduler. Optimizations from [25] calleddelayAll and delayWorst strategically delay writes to the trace buffer in order tobalance the trace buffer width; these optimizations are supported in our framework60as they significantly increase the achievable “trace window” length. Suppose theuser chose to modify the value of selected signal X based on another signal Y . Ifsignal Y had been delayed by the trace scheduler to write to the trace buffer in astate later than the state it was updated in, then the C unit would not receive Y untilafter the user circuit had already updated it and potentially moved on to other dataoperations, potentially missing the opportunity to override signal X . Therefore, werestrict the user to only impose conditions on selected signals, and we constrainthe delay optimizations to not be performed on these selected signals. In this waywe can ensure that the C unit receives data in the state it was updated in, and that auser condition is evaluated in a timely manner. The impact to the area of the tracescheduler by constraining the delay optimizations will be evaluated in Chapter 6.5.5 SummaryThis chapter describes the architectural support required for each control-baseddebug capability listed in Table 5.1. We show how the previous passive overlaycan be extended to support small functional changes to underlying user circuit,namely changes to the control flow of the design and the ability to override selectedassignments to variables. In Chapter 6, we quantify to what extent the user can useour overlay to apply functional changes to their design, as well as the impact ondelay and area overhead incurred by each of the architectural variants presented inthis chapter.61Table 5.1: Summary of Proposed Overlay CapabilitiesCapability Category DescriptionSelective Variable Tracing PassiveAllows user to flexibly selectwhich variables to trace orobserve, at runtimeSelective Function Tracing PassiveAllows user to trace functionswith either “full”or “partial”control flow information.Conditional Buffer Freeze PassiveAllows the user to specify acondition that, when true,causes recording of data inthe trace buffer to halt.Control Flow Variant -No Conditional Support (CF)ControlAllows the user to modifythe control flow of theuser circuit, limited to alteringthe outcomes of conditionalbranch signals that map backto “if”’, “for” or “while”loop constructs in the source code.Conditional Control Flow (CCF) ControlSame as CF, except the usercan now impose a conditionthat must be satisfied beforethe altered control flow isforced onto the user circuit.Dataflow Variant -No Conditional Support (DF)ControlInstruments signals that map tovariables selected by the userto provide runtime overridecapabilities.Conditional Dataflow Variant (CDF) ControlSame as DF but with the ability toimpose a condition on the user circuitbefore the altered values are used.62Chapter 6Control-Based Overlay ResultsIn this section, we evaluate the ability of the overlay to implement changes tothe functionality of the user circuit, as well as the area and clock delay overheadimposed by our overlay.Table 6.1 provides a summary of the overlay variants discussed in this chapter.As described in the previous chapter, our control-based overlay is an extensionof the passive overlay. Therefore, in this chapter when we discuss the overheadof the control and dataflow overlay variants, we present the data relative to thepassive overlay described in Chapter 3. We refer to this as the Base Overlay whichincludes a line packer (with G=2 configuration) and one conditional buffer freezeunit (C unit).Table 6.1: Acronyms used for Control-Based Overlay VariantsOverlay Variant DetailsBase Overlay Recon Passive Overlay with G=2 Line Packer and 1 C unit and signal reconstructionBase Overlay Passive Overlay with G=2 Line Packer and 1 C unit (no signal reconstruction)CF Control Flow Variant - No Conditional SupportCCF Conditional Control FlowDF Dataflow Variant - No Conditional SupportCDF Conditional Dataflow VariantIn Table 6.1 we reference two base overlay variants. “Base Overlay Recon” is a63Table 6.2: Overhead of Base Overlays with and without Signal Reconstruc-tionFmax(MHz)UserCircuit(ALMs)TraceScheduler(ALMs)LinePacker(ALMs)ConditionalBufferFreeze Unit(ALMs)ContentEditor(ALMs)Base Overlay(Recon) 152.32 5228 735 128 181 77Base Overlay(No Recon) 140.57 5699 1425 263 249 70passive overlay that employs signal reconstruction during construction of the tracescheduler, while the “Base Overlay” alone traces 100% of the source code vari-ables. As explained in Section 5.3.2, the CCF overlay uses signal reconstructionto minimize the area impact to the trace scheduler. Therefore, when we discussthe overhead results of the control flow overlays in Section 6.3, we perform thiscomparison relative to a base overlay that was also constructed using signal recon-struction. The dataflow overlays discussed in Section 6.4 are compared to a baseoverlay without signal reconstruction.To provide the reader with some context on savings achieved with signal recon-struction, Table 6.2 details the area overhead of both types of Base Overlays. Boththe user circuit and trace scheduler of a base overlay without reconstruction arelarger to account for the higher number of signals being traced. This also affectsthe size of the line packer and C unit because both of these components are sizedto accommodate the maximum width of the signals being traced in a given cycle.6.1 Control Flow Overlay OpportunitiesIn Table 6.3, we quantify the number of control flow constructs in the hardwarethat the overlay is capable of altering at runtime to the total number of control flowconstructs (for/while/if/else) available in the source code, for 13 circuits from theCHStone and MachSuite benchmark suites [28, 52]. These benchmarks were cho-sen in order to directly compare with the previous work in [35]. In columns 6 to 8,the number of for loops, while loops and if/else statements are based on optimized64(-O3) circuits. In order to provide a direct comparison of hardware support to theunique number software constructs, we count the instrumentation due to an inlinedconstruct only once.The number of hardware constructs in the right-most columns of Table 6.3 arelower than the software counterpart, due to compiler optimizations. The last col-umn in Table 6.3 is the number of total overlay opportunities divided by the totalsource code opportunities, 48.5% on average. The reason why our overlay cannotalter 100% of these constructs is in cases of loops that have been completely un-rolled, or if/else statements that have been promoted to ternary operations becausein these cases, there is no conditional branch available in the hardware to modify.65Table 6.3: Quantifying Overlay Control Flow Support for -O3 Compiled BenchmarksSource Code Control Flow OverlayBenchmarksNumberof ForLoopsNumberof WhileLoopsNumberof If/ElseTotalControl FlowOpportunitiesNumberof ForLoopsNumberof WhileLoopsNumberof If/ElseTotalOverlayOpportunitiesOverlay vs.Source Codeadpcm 17 0 29 46 4 0 3 7 15.2%aes 19 1 31 51 12 0 3 15 29.4%blowfish 5 5 13 23 3 4 2 9 39.1%dfadd 2 0 48 50 2 0 21 23 46.0%dfdiv 1 1 28 30 2 2 18 22 73.3%dfmul 2 0 26 28 2 0 15 17 60.7%dfsin 1 2 75 78 1 2 34 37 47.4%gsm 18 1 17 36 11 0 15 26 72.2%jpeg 34 11 74 119 19 7 49 75 63.0%mips 4 2 5 11 2 1 4 7 63.6%motion 7 5 27 39 1 2 9 12 30.8%sha 8 4 5 17 5 2 1 8 47.1%FFT 16 2 73 91 4 2 33 39 42.9%AVG 48.5%666.2 Dataflow Overlay OpportunitiesTo quantify dataflow opportunities, we investigate the number of variables thatour overlay is able to override. In Table 6.4, column 2 shows the total number ofvariables for each benchmark (each inlined variable is counted separately because itis treated as a unique signal in the hardware and may also be optimized differently).Column 3 shows the number of variables that are optimized away; our overlaycannot alter these because there are no RTL sources to override. Columns 4 and5 show the number of variables that are optimized to constants, and that reside inmemories respectively. While we do not currently support overriding these typesof variables, we could do so with more engineering effort. Finally, column 5 showsall the variables that reside in registers in the final hardware (on average 56.8% forour benchmarks) – these are the ones that can be altered by our dataflow overlays.Table 6.4: Data Override Opportunities for -O3 CircuitsBenchmarksTotalVars.Opt.AwayOpt. Regsadpcm 249 72 16 32 129 51.81%aes 46 4 4 19 19 41.30%blowfish 42 1 10 16 15 35.71%dfadd 128 1 14 5 108 84.38%dfdiv 131 2 26 4 99 75.57%dfmul 84 2 19 4 59 70.24%dfsin 328 1 60 4 263 80.18%gsm 82 1 21 7 52 63.41%jpeg 263 22 14 64 163 61.98%mips 21 3 2 3 13 61.90%motion 100 11 54 8 27 27.00%sha 36 5 12 10 9 25.00%FFT 766 2 295 10 459 59.92%AVG 56.80%676.3 Impact of Control Flow OverlaysEach of our benchmark circuits was compiled with LegUp which was modified toinsert our overlay. The result was then mapped to a Stratix IV device using Intel’sQuartus Prime. Table 6.5 shows the impact of CF and CCF on the maximum clockfrequency (Fmax) as compared to the Base Overlay with signal reconstruction -these results are averaged over 3 fitter seeds. For Fmax, a 1ns clock constraint wasspecified for the benchmarks to encourage the tool to achieve a high-speed circuitimplementation. The CF overlay only adds the branch instrumentation shown inFigure 5.3, while the CCF overlay integrates the branch instrumentation with theconditional unit shown in Figure 5.4. From Table 6.5, we can see that the CFoverlay does not impact average Fmax compared to the Base overlay (152.32MHzversus 152.02MHz), but there is variation across the circuits. This is because thebranch RTL signals may lie on existing critical paths for only some of the bench-marks (eg. dfadd), and in these cases the branch instrumentation may cause in-creased delay on these paths. The CCF overlay drops the Fmax to 146.13MHz onaverage, and this is caused by the conditional unit which now drives the read enableport on the configurable branch memories. It is interesting to note that some of thebenchmarks experience an increase in Fmax when the CCF overlay is insertedcompared to the Base or CF overlays (eg. adpcm and sha). This is due to the na-ture of the CAD tool which heuristically explores a very wide solution space andis susceptible to sources of variation [54]. Both base overlays for adpcm and shaexperience higher variation (standard deviation) over 3 seeds (11.26 and 8.12 re-spectively) and so we conclude that the Fmax for the CCF variants are acceptablesince they fall within the fitter’s variation for these benchmarks. In terms of fitterruntime, the average place and route for the original user circuit is 377.35 secondsfor our benchmarks, and this increases to 525.47 seconds with the insertion of theCCF overlay.In Figure 6.1, all area numbers shown are the relative impact when compared tothe Base Overlay with signal reconstruction. The branch instrumentation (labeled“functional override” on the graph) is on average 122 ALMs for the CF overlay, and275 ALMs for the CCF overlay. The former is the logic for the branch instrumen-tation multiplexers, while the latter accounts for the increased number of “taps”68Table 6.5: Impact of Control Flow Overlay Variants on FmaxFmax (MHz) Base stdev CF stdev CCF stdevadpcm 123.60 11.26 127.27 10.53 127.09 9.78aes 127.67 1.07 130.22 1.17 124.04 3.21blowfish 167.63 5.94 175.26 7.74 160.57 7.44dfadd 194.25 3.29 188.84 2.98 170.00 5.81dfdiv 191.06 4.35 186.27 2.74 181.22 6.69dfmul 182.95 0.39 180.19 1.86 177.53 5.03dfsin 161.79 9.33 165.55 0.73 146.42 1.87gsm 167.70 3.93 162.67 5.03 154.86 12.73jpeg 96.39 4.79 90.76 1.52 87.10 0.83mips 170.59 6.46 169.55 3.47 171.50 0.77motion 153.14 4.75 151.45 6.36 151.51 1.74sha 216.33 8.12 219.69 11.36 224.34 3.13FFT Transpose 89.50 1.80 91.20 1.40 86.97 3.15AVG +/- SD 152.32 MHz 5.04 152.02 MHz 4.38 146.13 MHz 4.78from the user circuit to the existing trace scheduler. Recall, the branch operandsignals only need to be traced in the conditional variant so that the Conditional unithas access to this data. For the CCF variant, we can see that the trace schedulerincreases by 307 ALMs due to the logic of tracing the branch condition “support”signals which were not required to be traced in the CF variant. The impact on thecontent editor is due to the increased number of configuration bits in the overlayfor this branch override capability compared to the base overlay.6.4 Impact of Dataflow OverlaysThe overhead of the dataflow overlay depends on how many variables the userhas selected for instrumentation. We compare the impact of the dataflow overlayvariants to a base overlay with no signal reconstruction.Table 6.6 shows the effect of instrumenting a random set of 10%, 20% and 30%signals from the user circuit on Fmax for both the static and conditonal dataflowoverlay variants. We can see that as more variables are selected, the higher theimpact on average Fmax. Some of the impact on Fmax will occur if the data-override multiplexers fall on any critical paths, but most of the impact on Fmax is69050100150200250300350Functional OverrideInstrumentationImpact onConditional UnitImpact on ContentEditorImpact on TraceSchedulerOverhead (ALMs)Instrumentation ComponentsCF CCFFigure 6.1: Impact of Control Flow Overlays on AreaTable 6.6: Impact of Dataflow Overlays on Fmax (MHz)Overlay Variant Fmax (MHz)BaseOverlay140.57DF 0.10 139.31CDF 0.10 139.02DF 0.20 135.18CDF 0.20 134.19DF 0.30 134.22CDF 0.30 131.92due to increased stress in the routing. The modified C unit for the CDF variants alsocause small drops in Fmax because there is an increased amount of logic requiredto drive the read enable port on the configuration memories as shown in Figure 5.7.In terms of area, Figure 6.2 shows the impact of the DF and CDF overlays at700100200300400500600700800Functional OverrideInstrumentationImpact on ConditionalUnitImpact on ContentEditorImpact on TraceSchedulerOverhead (ALMs)Instrumentation ComponentsDF_0.10 CDF_0.10 DF_0.20 CDF_0.20 DF_0.30 CDF_0.30Figure 6.2: Impact of Dataflow Architectures on Area10%, 20% and 30% signal selection on various instrumentation components in theoverlay. The functional override instrumentation is the data-multiplexers that areinserted within the user circuit before each of the selected signals. This overheadvaries between 330 ALMs to 729 ALMs depending on the percentage of selectedsignals, and is much higher than the branch instrumentation due to the higher bitwidths that must be supported depending on the variable type (8 to 64 bits for ourbenchmarks).The increase between the DF and CDF variants on the C unit is due to the extramode logic from the conditional unit required to drive the read enable port on thedata configuration memories. The reason that the trace scheduler is impacted forthe CDF variants is because extra care is taken in the construction of the tracescheduler component to ensure that selected variables are traced in the same statein which they are updated. This is done so that the C unit has access to the updateddata of these signals as soon as it is available (as described in Section 5.4). This71limits some of the “delay” optimizations that can be taken to reduce the size ofthe trace scheduler as in [25], which accounts for the small increases in the tracescheduler overhead for all of the CDF variants.6.5 Discussion: Circuit OptimizationsThroughout this work, our goal has been to provide a software-like debug expe-rience. Ideally, the user can interact with the source code as software withoutworrying about the specifics of the underlying hardware implementation. If possi-ble, this will not only open the door for software engineers, but will improve theproductivity of hardware designers by raising the abstraction with which they viewtheir design.One of the challenges is related to the optimizations performed by the HLStool. We differentiate between two classes of optimizations: compiler optimiza-tions and back-end optimizations. Compiler optimizations include loop unrolling,function inlining, common subexpression, and dead code elimination and is typ-ically performed on the internal representation of the source code. On the otherhand, back-end optimizations are performed as the hardware is generated andmostly focus on extracting any parallelism available from the design during thescheduling algorithm.Several of the design decisions described in Chapter 5 were motivated by thepresence of scheduling optimizations. For example, the restriction that the supportvariables in a replacement condition must be selected from the variables in the orig-inal condition assures that scheduling optimizations do not lead to comparisons us-ing stale values. However, our approach does not guarantee a software-like view inall cases. For example, common-subexpression elimination, in which assignmentsto variables are restructured, may mean that changes to variables using our archi-tecture do not propagate as expected. In this way, we fall somewhat short of ourgoal of providing a strictly software-like debug experience. Debugging optimizedsoftware is notoriously difficult, and is subject to on-going work. In the meantime,we provide “hardware-oriented” debug information such as a cycle-accurate Ganttchart to help expert designers understand their optimized code.72Chapter 7Conclusion and Future WorkThis chapter summarizes the contributions in this thesis, and discusses the limita-tions of our work. Directions for future work are also discussed.7.1 ContributionsThe computational demands of machine learning applications, such as Deep Neu-ral Networks (DNN) for computer vision and machine reading comprehension ap-plications, increasingly require more compute performance than can be achievedby CPUs alone. While GPUs have historically been the acceleration platformof choice, in recent years, FPGAs have resurged as an attractive choice by Mi-crosoft, Baidu, and Amazon due to their lower power and ability to be quicklyreprogrammed as machine learning algorithms evolve over time [15, 20]. This isalso driving the market for HLS technologies, which promises decreased devel-opment time for applications targeting FPGAs and increases accessibility of thesedevices to existing hardware engineers as well as developers with limited hardwareexpertise. However, an HLS tool by itself is not enough; an ecosystem of supporttools for debug and design optimization are required to aid the designer throughthe development process.In this thesis, we propose an in-system debug solution for HLS generated cir-cuits targeting FPGAs. Chapters 3 and 4 present a flexible debug overlay thatprovides software-like debug turn around times for HLS circuits. At compile time,73an overlay is constructed on a circuit-by-circuit basis, taking advantage of HLSscheduling information to maximize trace buffer utilization. At debug time, theuser can configure the overlay without having to perform a lengthy recompilation.The range of debug scenarios that can be implemented by the overlay depend onthe overlay itself; we presented architectural support for selective variable trac-ing, selective function tracing, and conditional buffer freeze. Compared to debugcores presented in previous work, our overlay adds only a small amount of extraoverhead, but enables software-like debug turn around times, which we believe isessential if FPGAs are to reach their true potential as mainstream compute accel-erators.In Chapters 5 and 6, we extend our debug overlay to support functional changesto the underlying design. The overlay architectures presented allow the user toalter the control flow of a design as it executes on chip, as well as override selectedvariable assignments. We present a variety of overlay variants and quantify howeach capability affects the overhead.7.2 Debug Overlay UsageThe insertion of our debug overlay will cause impedance on the user circuit. Ingeneral, adding in-system instrumentation of this kind will affect the place androute of the original circuit, leading to a slight decrease in the overall clock fre-quency as shown in Chapters 4 and 6. This is due to additional “taps” that need tobe inserted into the circuit, or due to the fact that the extra circuitry required by theinstrumentation places additional stress on the routing fabric.For many bugs, a change in clock frequency during debugging will not be aproblem. Bugs related to incorrect usage of IP cores, mishandling of corner cases,etc., are more likely “digital” bugs that will occur even when the clock frequency isslightly reduced. However, some bugs may be caused by subtle timing behaviours,and these may be more affected or even masked when using our debug overlay. Incases where the HLS core being debugged no longer meets timing when our debugoverlay is inserted, pipelining of the debug instrumentation to match the circuit’soriginal clock speed or slowing down the system clock are approaches that can beused until the designer has found a bug. In cases of tight area constraints, i.e. when74the user circuit uses virtually the entire device, inserting our debug overlay maynot be possible. However, we suspect that in some cases, depending on productionvolumes, it may be worth going to a slightly larger FPGA to gain the benefits offaster debug.Although our technique suffers the same impact on clock speed delay as in [25],an important point of the overlay that is the focus of this work is that between debugiterations, the user circuit does not change (since there is no recompilation). Thus,whatever impact our technique has on the user circuit does not change betweendebug iterations. We feel this is important: when hunting for an elusive bug, ifthe place and route changed between every debug iteration, bugs may appear anddisappear as the debugging proceeds. By ensuring that the user circuit does notchange between debug iterations, this can be avoided.Finally, once the design is finalized, the user may wish to remove the instru-mentation. This would require a complete place and route, and may lead to dif-ferent timing behaviour. We suggest that in many cases, it might make sense tojust leave the debug overlay in (but disable it to prevent the introduction of a “backdoor” and minimize power consumption). If, before production, it became neces-sary to reclaim every bit of silicon area because of the desire to go to a smallerFPGA, then there admittedly would be some danger that new bugs could be intro-duced when the instrumentation is removed. However, we assert that our debugoverlay would still have served its purpose helping to find bugs that may have ap-peared leading up to a working prototype.7.3 Limitations and Future Work7.3.1 Control-Flow versus Data-flow Circuit ModelsIn our work we have constructed our overlay assuming a statically scheduled HLScircuit. This means that the HLS tool has generated the circuit based on a strictlycalculated schedule where each operation has been assigned to occur in a specificstate in the circuit execution. We use this up-front knowledge of the circuit schedul-ing (or control flow) to optimize how signals are traced. Recently, Josipovic et al.introduced an HLS tool that generates dynamically scheduled circuits from C code,75where there is no centralized control FSM as in Legup [38]. Instead, these pure-datapath circuits make decisions at runtime based on the availability of operands(via a handshake between operational units), allowing for dynamic out-of-orderexecution of instructions. This HLS paradigm is able to achieve increased perfor-mance when the application contains a large number of irregular control and mem-ory accesses that can not be resolved at compile time, unlike a statically scheduledcircuit that must generate a conservative schedule.Debugging dataflow oriented applications is an important area due to the riseof machine learning applications, which are primarily based on dataflow graphs.Similar to our work, [8, 30] modify their compilers to instrument GPU streamingapplications and Tensorflow applications to perform telemetry, or detect overflowand out-of-bound memory accesses. For example Tensorflow’s debugger insertsspecial-purpose debugging nodes into the computation graph that are responsiblefor exporting runtime activity into log files which can be visualized in tandem usingtheir framework. However, to the best of our knowledge there are no in-systemdebug frameworks for FPGAs developed specifically for HLS circuits based ona dataflow model like [38]. Josipovic’s tool is not yet public, however if it wereto become so, exploring how our overlay techniques should be adapted within adynamically scheduled circuit would be an area for future work. A significantchallenge with debugging dynamic dataflow circuits is presenting the trace data tothe user at the source-level in a meaningful way.Our understanding is that certain commercial OpenCL HLS compilers also usea more dataflow-centric model (no centralized control), where pipelined datapathsare generated for each basic block in the IR and each of these are connected byFIFOs. However, our understanding of this model is that while there is no central-ized control, the circuit is still statically scheduled in such a way that data flowsthrough in a predetermined number of cycles. Our overlay currently does not sup-port dataflow-driven circuits, however we believe that the techniques presented inthis thesis can be adapted to support this circuit model since a static schedule stillexists.767.3.2 Debugging Optimized CircuitsMost HLS tools are implemented within software compiler frameworks (such asLLVM or GCC). Before the final HLS pass that transforms the internal represen-tation (IR) of the source code into hardware, standard compiler optimizations areperformed on the IR to improve its performance [31]. Therefore, it is important toprovide debug infrastructure for compiler optimized HLS circuits as this is whatwould typically be used in production. While the overlay architectures proposedin this thesis target optimized circuits, as discussed in Section 6.5, there remainsignificant challenges.Our work is prototyped in Legup, which is built within the LLVM compilerframework. Therefore, we use the existing LLVM debug metadata to map sourcecode variables in the C code to RTL signals in the generated hardware. Specifi-cally, this metadata keeps track of the transformations that are performed on theIR through compiler optimizations, and our tool constructs a debug database fromthis metadata. However, not all 56 compiler optimizations supported by LLVM areapplied during the standard -O3 pass [31], and standard optimizations such as arrayvectorization are not supported by Legup 4.0, the version used in this thesis. If wewere to evaluate the overlay techniques proposed in this thesis using commercialHLS tools, it would be important to understand which unique optimizations theysupport, and how these optimizations affect the structure and schedule of the finalcircuit. Commercial HLS tools are closed-source and so such an analysis is notpossible at this time, however if they were to adopt our techniques, it would be im-portant to understand the relationship between their supported optimizations anddebugging using our overlay.Our control-based overlay is affected the most by optimizations, as optimiza-tions such as instruction reordering and loop rotations may restrict the types offunctional changes that can be applied to the underlying circuit. The Gantt chartprovided with our debug GUI helps the user visualize instruction reordering and ifmultiple instructions execute in the same cycle, however this may not be enoughto communicate all optimizations. For example, the loop rotation pass tries to re-structure while loops into do while loops in order to eliminate a branch instructionper loop iteration [31]. If a user wanted to entirely skip a loop during debug using77our control flow overlay, this may not be possible for loops that have been restruc-tured to invariantly perform the first iteration. Designing a source-level interfacewithin the debug GUI that communicates the trace data within the context of morecomplex optimizations to the user may be even more challenging.The SDC scheduler that Legup uses operates at the basic block level, exploitingavailable parallelism between instructions in a basic block [31]. That is, there isno optimization across basic blocks and the blocks are executed sequentially viathe state machine. It is possible that commercial tools use aggressive schedulingthat speculatively executes basic blocks to increase parallelism. This is an exampleof an optimization that a debug framework such as ours would need to carefullyconsider in terms of how the trace data is presented to the user at the source-level.Another common optimization that is not addressed in this work is loop pipelin-ing, which allows successive loop iterations to overlap during execution. Our over-lay could easily be extended to observe the required extra hardware signals, butagain, presenting this trace data in a meaningful way to the user at the source levelshould be carefully considered.7.3.3 Debugging Parallel CircuitsLegup, like many other HLS tools, supports the synthesis of multi-threaded pro-grams. Pthreads and OpenMP are the most popular approaches for parallel designof C based programs, and Legup supports both. With Pthreads, Legup compileseach thread into an accelerator that executes in parallel with other threads. OpenMPcan be used to guide which loops should be executed in parallel.We have not evaluated our debug overlays on multi-threaded applications, how-ever previous debug work supports multi-threaded HLS systems by either dupli-cating the debug circuitry for each thread (which is expensive in terms of area andquickly depletes trace buffer resources), or through the use of tracepoints [27].In terms of overlays, it is possible that an entirely different overlay is requiredto support multi-threaded circuits. For example, it might be useful to perform run-time checks at points of communication or synchronization between the hardwarethreads, where the statistic being checked is configurable at runtime. This is left asfuture work.787.3.4 Evaluating our Debug Overlay using other HLS BenchmarksWe primarily use benchmarks from CHStone and Machsuite to evaluate the impacton area and clock speed caused by our different overlay variants. The benchmarkswe use are from a variety of domains (e.g. multimedia, arithmetic, communica-tions and encryption), and is the same set of benchmarks used by the wider HLSacademic community. However, these benchmarks are quite small - their logic uti-lization ranges from 1% to 16% on a Stratix IV device. Recently, a new benchmarksuite called Rosetta was released that contains six applications from the machinelearning and video processing domains [62]. These benchmarks present a morerealistic set of compute kernels as they each explore diverse sources of parallelism,using HLS features from Xilinx’s SDAccel. A short-term direction for future workwould be to synthesize these benchmarks with their varying optimizations throughLegUp and see how our overlay scales in terms of area.7.3.5 Overlay ConfigurabilityIn this thesis, the overlay configuration bits are housed in on-chip memories thatcan be updated at run time using vendor tools, such as Intel’s in-system memorycontent editor supported by the virtual JTAG IP core [2]. This method was selectedfor two reasons. First, the Quartus II toolchain, which we used for placement androuting of the benchmarks, already supported this feature, which meant we didnot have to design our own configuration controller. Second, it was important tohave a way of updating the overlay configuration bits without the rest of the designbeing recompiled, which the memory content editor allowed us to do. However,this method has some limitations. By using the hard on-chip memories for ouroverlay, we are now limiting the memory resources available to the user circuit, andon-chip memory resources are limited already. Additionally, the memories beingreclaimed by the overlay were not always used at their full capacity (in Chapter 4we show that all of our benchmarks required only one M9K memory, but none ofthe benchmarks actually required 9Kbits). Finally, hard on-chip memories usuallyrequire at minimum a one-cycle read latency, and this had to be accounted forwithin the addressing logic of the overlay.An alternative option is to make use of LUTRAMs or MLABs, which are soft79memories built from the core logic of the FPGA. This type of memory is a betterchoice if the required memory capacity is small, as it is with our overlay, and doesnot have a one-cycle read latency. Since these soft memories can be embeddedwithin the core logic of the overlay, it is possible that the impact on frequency of theuser circuit could be lessened since signals would not need to be routed to and fromthe hard memory blocks located on specific tiles within an FPGA device. However,we found that currently there is no vendor support for updating soft memories likeLUTRAMs or MLABs at runtime, and so the design of a configuration controllerwould also need to be implemented. This was out of the scope of this thesis,however, academic research has been conducted in this area, such as the area-efficient controller design presented in [6] that uses LUTRAMs for their overlay.Adopting this architectural implementation into our overlay is left as future work.80Bibliography[1] Altera. Quartus Prime Pro Edition Handbook, volume 3, chapter 9: DesignDebugging Using the SignalTap II Logic Analyzer. November 2015. →pages 5, 16[2] Altera. Altera virtual jtag (altera virtual jtag) ip core user guide. US/pdfs/literature/ug/ug virtualjtag.pdf, October2016. → pages 45, 79[3] Altera. SDK for OpenCL.,2016. → pages 1, 12, 13[4] J. Aycock and N. Horspool. Simple generation of static single-assignmentform. In D. A. Watt, editor, Compiler Construction, pages 110–125, Berlin,Heidelberg, 2000. Springer Berlin Heidelberg. ISBN 978-3-540-46423-5.→ page 12[5] M. Boule and Z. Zilic. Automata-based Assertion-Checker Synthesis of PSLProperties. ACM Transactions on Design Automation of Electronic Systems,13(1):4.1–4.21, January 2008. → page 30[6] A. D. Brant. Coarse and fine grain programmable overlay architectures forFPGAs. PhD thesis, University of British Columbia, 2013. URL → page80[7] P. Bussa, J. Goeders, and S. Wilton. Accelerating in-system FPGA debug ofhigh-level synthesis circuits using incremental compilation techniques. InInternational Conference on Field-Programmable Logic and Applications,Sept 2017. → pages 6, 16, 32, 43, 4581[8] S. Cai, E. Breck, E. Nielsen, M. Salib, and D. Sculley. TensorFlowDebugger: Debugging Dataflow Graphs for Machine Learning. 2016. →page 76[9] N. Calagar, S. Brown, and J. Anderson. Source-level Debugging for FPGAHigh-Level Synthesis. In International Conference on Field ProgrammableLogic and Applications, Sept 2014. → pages 6, 16, 19, 20, 21[10] K. A. Campbell, D. Lin, S. Mitra, and D. Chen. Hybrid Quick ErrorDetection (H-QED): Accelerator Validation and Debug Using High-levelSynthesis Principles. In Proceedings of the 52nd Annual Design AutomationConference, DAC ’15, pages 53:1–53:6, New York, NY, USA, 2015. ACM.ISBN 978-1-4503-3520-1. doi:10.1145/2744769.2753768. URL → page 4[11] A. Canis, J. Choi, et al. LegUp: An Open-source High-level Synthesis Toolfor FPGA-based Processor/Accelerator Systems. ACM Transactions onEmbedded Computing Systems, 13(2):24:1–24:27, Sept. 2013. ISSN1539-9087. → page 19[12] A. Canis, S. Brown, and J. Anderson. Modulo SDC scheduling withrecurrence minimization in high-level synthesis. In InternationalConference on Field Programmable Logic and Applications, Sept 2014.doi:10.1109/FPL.2014.6927490. → page 14[13] Y. Chen, S. Safarpour, J. Marques-Silva, and A. Veneris. Automated DesignDebugging With Maximum Satisfiability. IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, 29(11):1804–1817, Nov. 2010. → pages 18, 19[14] J. Choi. Legup 5.1 is released., July 2017. → page 15[15] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield,T. Massengill, M. Liu, M. Ghandi, D. Lo, S. Reinhardt, S. Alkalay,H. Angepat, D. Chiou, A. Forin, D. Burger, L. Woods, G. Weisz,M. Haselman, and D. Zhang. Serving DNNs in Real Time at DatacenterScale with Project Brainwave. IEEE, March 2018. URL → page 7382[16] C. Collberg. Basic Blocks and Flow Graphs - Compilers and SystemsSoftware.∼collberg/Teaching/453/2009/Handouts/Handout-15.pdf, 2009. → page 13[17] I. Corporation. Intel High Level Synthesis Compiler User Guide., May2018. → pages 15, 16[18] P. Coussy, D. D. Gajski, M. Meredith, and A. Takach. An introduction tohigh-level synthesis. IEEE Design Test of Computers, 26(4):8–17, July2009. ISSN 0740-7475. doi:10.1109/MDT.2009.69. → pages 2, 11[19] F. Eslami and S. J. E. Wilton. An adaptive virtual overlay for fast triggerinsertion for FPGA debug. In International Conference on FieldProgrammable Technology, pages 32–39, Dec 2015. → pages 6, 8, 18, 19[20] K. Freund. Amazon’s Xilinx FPGA Cloud: Why This May Be A SignificantMilestone., December 2016. → page 73[21] M. Fujita. Methods for automatic design error correction in sequentialcircuits. In European Conference on Design Automation with the EuropeanEvent in ASIC Design, 1993. → page 19[22] M. Fujita and Y. Kukimoto. Patching method for lookup-table type FPGA’s.In International Conference on Field-Programmable Logic andApplications, Sept 1992. → page 19[23] J. Goeders. Enabling Long Debug Traces of HLS Circuits UsingBandwidth-Limited Off-Chip Storage Devices. In International Symposiumon Field-Programmable Custom Computing Machines, pages 136–143,April 2017. doi:10.1109/FCCM.2017.29. → page 17[24] J. Goeders and S. Wilton. Effective FPGA debug for high-level synthesisgenerated circuits. In International Conference on Field ProgrammableLogic and Applications, Sept 2014. doi:10.1109/FPL.2014.6927498. →page 19[25] J. Goeders and S. Wilton. Signal-Tracing Techniques for In-System FPGADebugging of High-Level Synthesis Circuits. IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, 36(1):83–96,Jan 2017. → pagesxi, 5, 6, 16, 17, 19, 20, 22, 28, 31, 43, 44, 45, 58, 60, 72, 7583[26] J. Goeders and S. E. Wilton. Allowing Software Developers to Debug HLSHardware. In Workshop on FPGAs for Software Programmers, August2015. → pages xi, 3, 20, 21[27] J. Goeders and S. J. E. Wilton. Using Round-Robin Tracepoints to debugmultithreaded HLS circuits on FPGAs. In International Conference on FieldProgrammable Technology, pages 40–47, Dec 2015.doi:10.1109/FPT.2015.7393128. → page 78[28] Y. Hara, H. Tomiyama, S. Honda, and H. Takada. Proposal and QuantitativeAnalysis of the CHStone Benchmark Program Suite for Practical C-basedHigh-level Synthesis. Journal of Information Processing, 17:242–254, 2009.→ pages 23, 33, 43, 64[29] K. Hemmert, J. Tripp, B. Hutchings, and P. Jackson. Source level debuggerfor the Sea Cucumber synthesizing compiler. In International Symposiumon Field-Programmable Custom Computing Machines, pages 228–237,April 2003. → pages 6, 16, 17, 19[30] Q. Hou, K. Zhou, and B. Guo. Debugging GPU Stream Programs ThroughAutomatic Dataflow Recording and Visualization. In ACM SIGGRAPH Asia2009 Papers, SIGGRAPH Asia ’09, pages 153:1–153:11, New York, NY,USA, 2009. ACM. ISBN 978-1-60558-858-2.doi:10.1145/1661412.1618499. URL → page 76[31] Q. Huang, R. Lian, A. Canis, J. Choi, R. Xi, N. Calagar, S. Brown, andJ. Anderson. The Effect of Compiler Optimizations on High-LevelSynthesis-Generated Hardware. ACM Transactions on ReconfigurableTechnology and Systems, 8(3):14:1–14:26, May 2015. ISSN 1936-7406.doi:10.1145/2629547. URL → pages77, 78[32] E. Hung and S. J. Wilton. Towards simulator-like observability for fpgas: Avirtual overlay network for trace-buffers. In International Symposium onField Programmable Gate Arrays, FPGA ’13, pages 19–28, New York, NY,USA, 2013. ACM. ISBN 978-1-4503-1887-7.doi:10.1145/2435264.2435272. URL → page 18[33] E. Hung and S. J. E. Wilton. Accelerating FPGA Debug: IncreasingVisibility Using a Runtime Reconfigurable Observation and Triggering84Network. ACM Transactions on Design Automation of Electronic Systems,19(2):14:1–14:23, Mar. 2014. ISSN 1084-4309. doi:10.1145/2566668. →pages 6, 18[34] Intel. FIFO Intel FPGA IP User Guide. US/pdfs/literature/ug/ug fifo.pdf, 2018. → page38[35] A. Jamal, J. Goeders, and S. Wilton. Architecture exploration forHLS-Oriented FPGA debug overlays. In International Symposium onField-Programmable Gate Arrays, Feb 2018. → pages 27, 64[36] A. Jamal, J. Goeders, and S. Wilton. An FPGA Overlay ArchitectureSupporting Rapid Implementation of Functional Changes during On-ChipDebug. In International Conference on Field-Programmable Logic andApplications, Aug 2018. → page 51[37] S. Jo. Rectification of advanced microprocessors without changing routingon fpgas (poster). In International Symposium. on Field-ProgrammableGate Arrays, Feb 2013. → page 19[38] L. Josipovic´, R. Ghosal, and P. Ienne. Dynamically Scheduled High-levelSynthesis. In International Symposium on Field-Programmable GateArrays, FPGA ’18, pages 127–136, New York, NY, USA, 2018. ACM.ISBN 978-1-4503-5614-5. doi:10.1145/3174243.3174264. URL → page 76[39] A. Klimovic and J. Anderson. In International Conference onField-Programmable Technology. → page 13[40] F. Ko and N. Nicolici. Algorithms for state restoration and trace- signalselection for data acquisition in silicon debug. IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, 28(2):285–297,Feb. 2009. → page 17[41] A. Kourfali and D. Stroobandt. Efficient hardware debugging usingparameterized FPGA reconfiguration. In International Parallel andDistributed Processing Symposium Workshop, pages 277–282, May 2016.→ pages 6, 19[42] C. Lattner and V. Adve. LLVM: A Compilation Framework for LifelongProgram Analysis & Transformation. In International Symposium on CodeGeneration and Optimization: Feedback-directed and Runtime85Optimization, CGO ’04, pages 75–86, 2004. ISBN 0-7695-2102-9. URL → page 12[43] S. Mitra, S. A. Seshia, and N. Nicolici. Post-silicon validation opportunities,challenges and recent advances. In Proceedings of the 47th DesignAutomation Conference, pages 12–17. ACM, 2010. → page 18[44] J. S. Monson and B. Hutchings. New approaches for in-system debug ofbehaviorally-synthesized FPGA circuits. In International Conference onField Programmable Logic and Applications, Sept 2014.doi:10.1109/FPL.2014.6927495. → page 21[45] J. S. Monson and B. L. Hutchings. Using Source-Level Transformations toImprove High-Level Synthesis Debug and Validation on FPGAs. InInternational Symposium on Field-Programmable Gate Arrays, pages 5–8,2015. → pages 6, 16, 17[46] R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen,H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels. A Survey andEvaluation of FPGA High-Level Synthesis Tools. IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, PP(99), 2016.ISSN 0278-0070. doi:10.1109/TCAD.2015.2513673. → pages1, 2, 11, 12, 13, 14, 15[47] U. of Toronto. High-level synthesis with legup., 2016. → page 11[48] J. P. Pinilla. Source-level instrumentation for in-system debug of high-levelsynthesis designs for FPGA. PhD thesis, University of British Columbia,2016. URL → pagesxi, 12[49] J. P. Pinilla and S. J. Wilton. Enhanced source-level instrumentation for fpgain-system debug of high-level synthesis designs. In InternationalConference on Field-Programmable Technology, Xi’an, China, 2016. IEEE.ISBN 978-1-5090-5601-9. → page 17[50] B. Quinton and S. Wilton. Concentrator Access Networks for ProgrammableLogic Cores on SoCs. In International Symposium on Circuits and Systems,pages 45–48, 2005. → page 1886[51] K. Randall. Generating Better Machine Code with SSA.,July 2017. → pages xi, 13[52] B. Reagen, R. Adolf, Y. Shao, G.-Y. Wai, and D. Brooks. MachSuite:Benchmarks for accelerator design and customized architectures. InInternational Symposium on Workload Characterization, pages 110–119,Oct 2014. → pages 23, 33, 43, 64[53] F. Schirrmeister. Post Silicon Debug: A Prototyping with FPGA Approach., February 2015.→ page 4[54] R. Scoville. Fitter Algorithms, Seeds, and Variation. and SeedSweeps.pdf,November 2011. → page 68[55] H. K.-H. So and C. Liu. FPGA Overlays, pages 285–305. SpringerInternational Publishing, Cham, 2016. ISBN 978-3-319-26408-0.doi:10.1007/978-3-319-26408-0 16. URL 16. → pages 2, 8, 26[56] R. Stallman, R. Pesch, and S. Shebs. Debugging with gdb: Alteringexecution. node/gdb 110.html#SEC115,Feb 2002. → page 3[57] Xilinx. ChipScope Pro Software and Cores: User Guide, October 2012. →page 16[58] Xilinx. Vivado Design Suite User Guide: High-Level Synthesis. manuals/xilinx2016 2/ug902-vivado-high-level-synthesis.pdf, June 2016. → pages 1, 12, 15, 16[59] Xilinx. Integrated Logic Analyzer v6.1: LogiCORE IP Product Guide. documentation/ila/v6 1/pg172-ila.pdf, April 2016. → page 5[60] Xilinx. SDAccel Development Environment.,2016. → pages 13, 1587[61] J. S. Yang and N. A. Touba. Automated Selection of Signals to Observe forEfficient Silicon Debug. In 2009 27th IEEE VLSI Test Symposium, pages79–84, May 2009. doi:10.1109/VTS.2009.51. → page 17[62] Y. Zhou, U. Gupta, S. Dai, R. Zhao, N. Srivastava, H. Jin, J. Featherston,Y.-H. Lai, G. Liu, G. A. Velasquez, W. Wang, and Z. Zhang. Rosetta: ARealistic High-Level Synthesis Benchmark Suite forSoftware-Programmable FPGAs. International Symposium onField-Programmable Gate Arrays, Feb 2018. → pages 23, 7988


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items