UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Accelerating in-system debug of high-level synthesis generated circuits on field-programmable gate arrays… Bussa, Pavan Kumar 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2017_september_bussa_pavankumar.pdf [ 1.12MB ]
Metadata
JSON: 24-1.0353172.json
JSON-LD: 24-1.0353172-ld.json
RDF/XML (Pretty): 24-1.0353172-rdf.xml
RDF/JSON: 24-1.0353172-rdf.json
Turtle: 24-1.0353172-turtle.txt
N-Triples: 24-1.0353172-rdf-ntriples.txt
Original Record: 24-1.0353172-source.json
Full Text
24-1.0353172-fulltext.txt
Citation
24-1.0353172.ris

Full Text

Accelerating In-System Debug ofHigh-Level Synthesis GeneratedCircuits on Field-Programmable GateArrays using Incremental CompilationTechniquesbyPavan Kumar BussaB.Tech Electrical Engineering, Indian Institute of Technology Jodhpur, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2017c© Pavan Kumar Bussa, 2017AbstractHigh-Level Synthesis (HLS) has emerged as a promising technology thatallows designers to create a digital hardware circuit using a high-level lan-guage like C, allowing even software developers to obtain the benefits ofhardware implementation. HLS will only be successful if it is accompaniedby a suitable debug ecosystem. There are existing debugging methodologiesbased on software simulation, however, these are not suitable for findingbugs which occur only during the actual execution of the circuit. Recentefforts have presented in-system debug techniques which allow a designerto debug an implementation, running on a Field-Programmable Gate Ar-ray (FPGA) at its actual speed, in the context of the original source code.These techniques typically add instrumentation to store a history of all uservariables in a design on-chip. To maximize the effectiveness of the limitedon-chip memory and to simplify the debug instrumentation logic, it is de-sirable to store only selected user variables. Unfortunately, this may leadto multiple debug runs. In existing frameworks, changing the variables tobe stored between runs changes the debug instrumentation circuitry. Thisrequires a complete recompilation of the design before reprogramming it onan FPGA.In this thesis, we quantify the benefits of recording fewer variables andsolve the problem of lengthy full compilations in each debug run using incre-mental compilation techniques present in the commercial FPGA CAD tools.We propose two promising debug flows that use this technology to reducethe debug turn-around time for an in-system debug framework. The firstflow, in which the user circuit and instrumentation are co-optimized duringcompilation, gives the fastest debug clock speeds but suffers in user circuitperformance once the debug instrumentation is removed. In the second flow,iiAbstractthe optimization of the user circuit is sacrosanct. It is placed and routedfirst without having any constraints and the debug instrumentation is addedlater leading to the fastest user circuit clock speeds, but performance suffersslightly during debug. Using either flow, we achieve 40% reduction in debugturn-around times, on average.iiiLay SummaryDesigning modern digital electronic systems can be expensive and time con-suming. Ensuring a design does not contain errors is especially challenging.This task, known as debugging, is often hampered by the fact that thereis limited visibility into the internal operation of a circuit. Recent workhas proposed methods to enhance this visibility. These methodologies ofteninvolve running the circuit many times; each run requires significant setup(compilation) time in which the design is automatically instrumented withdifferent observation circuitry. In this thesis, we show how this compilationtime can be dramatically reduced. The key insight is that the vast major-ity of the circuit does not change between runs; we present techniques thatallow the compilation tool to focus only on the parts of the circuit that dochange. This leads to significantly faster debug, potentially lowering thecost of producing working digital systems.ivPrefaceThis thesis is related to the PhD dissertation of Dr. Jeffrey Goeders, a recentgraduate from UBC who is now an Assistant Professor at Brigham YoungUniversity, Utah. He helped me in setting up the framework, answered allmy technical questions and pointed me to the right place to start.This work has been accepted as a poster paper titled ”Accelerating In-System FPGA Debug of High-Level Synthesis Circuits using IncrementalCompilation Techniques” at the International Conference on Field-ProgrammableLogic and Applications 2017 (FPL’17) and will be published in the confer-ence proceedings. I was primarily responsible for conducting the research,performing the experiments and summarizing the results. This was doneunder the guidance of my advisor Dr. Steve Wilton. Dr. Wilton and Dr.Goeders also provided editorial support for all of my submitted works.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 High Level Synthesis . . . . . . . . . . . . . . . . . . . . . . 21.3 HLS Debug . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.1 Software-like Debug . . . . . . . . . . . . . . . . . . . 31.3.2 RTL Simulation . . . . . . . . . . . . . . . . . . . . . 41.3.3 In-system Debugging . . . . . . . . . . . . . . . . . . 41.4 Selective Variable Tracing . . . . . . . . . . . . . . . . . . . . 61.5 Incremental HLS Debug . . . . . . . . . . . . . . . . . . . . . 71.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 81.7 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12viTable of Contents2.2 HLS Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 HLS Debugging Techniques . . . . . . . . . . . . . . . . . . . 152.3.1 In-system Debugging Approaches . . . . . . . . . . . 162.3.2 Source Level In-system Debugging Approaches . . . . 192.4 Incremental debugging Approaches . . . . . . . . . . . . . . 242.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 The Debugging Framework . . . . . . . . . . . . . . . . . . . . 343.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.1 Debug Instrumentation . . . . . . . . . . . . . . . . . 353.3 Trace Buffer Optimizations . . . . . . . . . . . . . . . . . . . 373.3.1 Control Flow Optimization . . . . . . . . . . . . . . . 383.3.2 Dynamic Tracing of Datapath Register Signals . . . . 393.4 Debug Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5 Selective Variable Tracing . . . . . . . . . . . . . . . . . . . . 423.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 Incremental Debug Flows . . . . . . . . . . . . . . . . . . . . . 464.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2 Incremental Flows . . . . . . . . . . . . . . . . . . . . . . . . 474.2.1 A Naive Incremental Flow . . . . . . . . . . . . . . . 474.2.2 Incremental Flow with Permanent Taps . . . . . . . . 484.2.3 Incremental Flow with Permanent Taps and Late Bind-ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3 Automated GUI . . . . . . . . . . . . . . . . . . . . . . . . . 554.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.3 Impact on the Trace Window Size . . . . . . . . . . . . . . . 605.3.1 Variation in the Trace Window Size . . . . . . . . . . 615.4 Impact on Debug Instrumentation Area . . . . . . . . . . . . 62viiTable of Contents5.5 Impact on the Compile time . . . . . . . . . . . . . . . . . . 675.6 Impact on the Frequency of User Circuit . . . . . . . . . . . 695.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756 Conclusions and Future Work . . . . . . . . . . . . . . . . . 766.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90A A Guide to our GUI Framework . . . . . . . . . . . . . . . . 91viiiList of Tables5.1 Trace Window Size(For Flow 3) . . . . . . . . . . . . . . . . . 595.2 Trace Scheduler Area (For Flow 3) . . . . . . . . . . . . . . . 635.3 Total Compile Time for Flow 3 in Seconds . . . . . . . . . . . 655.4 Total Compile Time for Flow 4 in Seconds . . . . . . . . . . . 665.5 Area Breakdown (For Flow 3) . . . . . . . . . . . . . . . . . . 695.6 Incremental Compile Overhead for Flow 3 in Seconds . . . . . 705.7 Frequency Results: Flow 3 vs Flow 1 . . . . . . . . . . . . . . 735.8 Frequency Results: Flow 4 vs Flow 1 . . . . . . . . . . . . . . 74ixList of Figures2.1 HLS Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1 Debug Instrumentation . . . . . . . . . . . . . . . . . . . . . 353.2 Control Flow Optimization . . . . . . . . . . . . . . . . . . . 373.3 Dynamic Signal Tracing Optimizations . . . . . . . . . . . . . 443.4 Original Debug Flow . . . . . . . . . . . . . . . . . . . . . . . 454.1 A Naive Incremental Flow (Flow 2) . . . . . . . . . . . . . . . 514.2 Incremental Flow with Permanent Taps(Flow 3) . . . . . . . 524.3 Incremental Flow with Permanent Taps and Late Binding(Flow 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4 Modified GUI to support Selective Variable Recording . . . . 545.1 Trace Window Size Variation [TracesubsetTracefull ] . . . . . . . . . . . 615.2 Frequency Variation with the number of taps . . . . . . . . . 72xAcknowledgementsI would like to thank my advisor Dr. Steve Wilton for his constant motiva-tion and support throughout my master’s journey. This research would nothave been possible without his guidance.Special thanks to Jeff Goeders for helping me understand his work andproviding the backbone of the framework for me to build upon. I would alsolike to thank other graduate students in Wilton’s group for their suggestionsand feedback especially during our weekly group meetings. My friends haveplayed an important role in helping me maintain a good work life balancewhich was much needed at some times.Financial support for this work has been provided by the National Sci-ences and Engineering Research Council of Canada, as well as from IntelCorporation.Last but not the least, I would like to thank my parents for believing inme and providing everything I needed to reach this point.xiChapter 1Introduction1.1 MotivationRecent years have seen a tremendous demand for faster computation as ap-plications become larger and more complex. However, the performance ofa processor has not been increasing fast enough to meet such computationdemands mainly because of the difficulty in reducing the transistor sizesand because of the increased power dissipation. This has triggered the com-munity to move towards parallel multi-core processing architectures whereseveral processors/cores of same kind (Homogeneous computing) or differentkinds (Heterogeneous computing) are used to gain performance or energy ef-ficiency. Heterogeneous computing has shown promising results in situationswhere only a specific part of the application is to be accelerated. One wayto accelerate algorithms using these heterogeneous systems is using a cus-tomized application specific processor. This is different than a homogeneoussystem where a general purpose processor would have to be used irrespectiveof the computing workload. However, manufacturing and designing an Ap-plication Specific Integrated Circuit (ASIC) like a custom processor is verytime consuming and also the associated non-recurring engineering costs arehigh.An alternative to using ASIC’s is to use devices which are more flexibleand easy to design while providing similar benefits to those of an ASIC.Field Programmable Gate Arrays (FPGAs) have emerged as reconfigurabledevices with the capability of emulating any custom circuit, leading to per-formance gains over a wide range of applications. For every new application,an ASIC has to be manufactured from scratch while an FPGA could betransformed into any custom circuit just by reprogramming it. This makes11.2. High Level Synthesisthe FPGAs very ideal for frequently changing applications.Because of it’s shorter time-to-market[72] and reasonable performancegains, many companies/academia have already started using FPGA’s toaccelerate their complex applications. Microsoft[64], Intel[39], IBM[27] andQualcomm[65] have started using FPGA’s to accelerate their mainstreamcomputing. Recently, Amazon has started providing FPGA instances[2] intheir cloud services which could be used on per-hour basis, making FPGA’saccessible to everyone.However, implementing a design on an FPGA is not as easy as imple-menting it on a processor. As with an ASIC, the design is specified ina hardware description language (HDL) like VHDL or Verilog/SystemVer-ilog which usually contain circuit descriptions at a much lower level (flip-flops/registers, logic blocks and the interconnections between them) knownas Register-Transfer Level (RTL). This representation is later compiled andprogrammed to FPGA’s by vendor specific Computer-Aided Design (CAD)tools. Doing this is a much more challenging and time consuming task thandeveloping a software program for a given problem since it requires the de-signers to take care of the memory interfaces and the detailed schedulingof operations. As the designs become complex these approaches becometedious and are prone to errors if extreme care is not taken.1.2 High Level SynthesisHigh Level Synthesis (HLS) is an automated design process which converts asoftware-like program (usually written in C, C++ or Java) to a hardware/F-PGA implementation. HLS raises the abstraction of the design specificationto a much higher level eliminating the need for the designer to take care ofthe finer details like the scheduling, binding and memory interfacing whichare now performed by the HLS compiler itself. A higher level of abstractionreduces the design time considerably and also makes it feasible for softwaredevelopers to create a hardware implementation for their complete designor a part of their design. As software designers outnumber the hardwaredesigners[60], in order to attract them towards FPGA’s and increase their21.3. HLS Debugmarket sizes, leading FPGA companies like Xilinx and Intel have investedheavily in the HLS technology and developed commercial HLS packages likeVivado HLS[76] and SDK for OpenCL[41]. HLS not only makes it possiblefor software designers to use the FPGA’s but also makes it easy for hardwaredevelopers to specify complex designs, improving design productivity.Most existing HLS tools including the one considered in this work (LegUp[12])use C as the means for design entry[57] and LLVM[47] as the C compiler.The front-end of the compiler converts the C design to an Intermediate Rep-resentation (IR) and the back-end produces an implementation specific tothe target architecture. For this thesis, the target architecture is an FPGAand in this case the back-end performs allocation, scheduling and bindingautomatically in order to generate the HDL representation.However, as the user is unaware of the HDL generated, he/she shouldhave some means to verify the correctness of their implementation. In orderfor a HLS tool to gain widespread adoption it may need to provide a com-plete ecosystem like that of any software/hardware Integrated DevelopmentEnvironments (IDE) including the support for efficient debug.1.3 HLS DebugBugs might arise at different stages of a HLS design process. Finding the rootcause of a bug could be difficult without an effective debugging framework.HLS debugging techniques such as software based debug, RTL simulationand in-system debug could be used to help find the root cause of bugs thatbecome visible at different levels of abstraction (from software program toa hardware implementation).1.3.1 Software-like DebugMost existing HLS tools offer the ability to debug a design by running thesoftware code on a workstation. Standard debuggers like GDB[17] can beused to debug the software program. Logic bugs could be found easilythrough this method. However, in the final operating environment of the31.3. HLS Debughardware circuit generated by the HLS tool, there may be other IP cores,processors or I/O devices interacting with each other and the HLS generatedcircuit. Many bugs related to the interfaces with these blocks that are notproduced by the HLS may not be visible in a software based debug flow.Also, as this type of debugging is performed before the HLS process, it wouldnot help the user to find any bugs arising from the HLS tool itself.1.3.2 RTL SimulationSome HLS tools offer debugging through RTL simulation [41][76]. Thisis very similar to the software-like debug technique except for the controland the dataflow which is now obtained from the simulation of the RTLgenerated by the HLS tool instead of the normal execution of the softwarecode. This allows the designer to verify that the RTL circuit generated bythe HLS tool matches the behavior of the original software code. However,hardware simulation is much slower than the actual hardware execution andwould take a long time to simulate the design to that point where somebugs might occur. Also, as the RTL simulation is cycle-accurate, some ofthe transient/asynchronous events which might happen in between the cycletransitions cannot be determined.1.3.3 In-system DebuggingFor these bugs which cannot be found using the software debug or RTLsimulation the only solution is to debug the circuit in the actual operatingenvironment where it interacts with other blocks present in the system - alsoknown as in-system debugging.In-system FPGA debugging involves running the circuit at speed on theFPGA. At such speeds, the control and the dataflow information of thecircuit are updated rapidly. Considering the throughput of these updates,it is almost impossible to show them to the user directly because of thelimited I/O resources on the FPGA and also unlike the software debuggingapproaches it is not practical to run the circuit step by step, pausing in be-tween to analyze the updated values. Even, if this was possible, reading the41.3. HLS Debugvalues from the FPGA each time the circuit is paused would be time con-suming [68], which is not desirable. Because of these issues most in-systemdebugging tools adopt an alternative approach of trace-based debuggingwhere additional circuitry (also known as debug instrumentation) is addedto store the values of the signals until a set breakpoint is reached and thenthe circuit behavior is replayed with the help of the captured data. Thisinstrumentation is built using the resources of the same FPGA on which theuser circuit would be executing.Embedded logic Analyzers (ELAs) such as SignalTap II [40] and Vivado’sIntegrated Logic Analyzer (ILA) [75] provide visibility into a hardware de-sign by recording the circuit’s execution in on-chip memories (also known astrace buffers) when a predefined trigger condition is met and then present-ing the captured data in the form of waveforms. However this visibility isprovided at the RTL abstraction level which makes sense only to those whocan understand the underlying hardware thus not making a feasible debugoption for software designers who may want to use the HLS tools.Source-Level In-System Debugging for HLSA software designer views the design as a set of sequential statements, whilethe generated hardware consists of several concurrently operating compo-nents. This mismatch between the software designer’s view and the actualhardware running on the FPGA makes in-system debugging at the RTLlevel impractical. This becomes even worse if the HLS tool performs opti-mizations such as moving operations across cycle boundaries, leading to aschedule which might be unfamiliar to the user. To avoid these mismatchesit is preferable to have an in-system debugging flow at the same abstractionlevel in which the design is implemented. This would also eliminate the needto understand the detailed implementation of their software in RTL, thusmaintaining the productivity promised by HLS.Recent work has presented debugging tools and instrumentation that al-lows designers to debug their hardware implementation as if it were software[10, 20, 29, 52, 55, 62]. These systems allow the user to debug code in an51.4. Selective Variable Tracingenvironment in which they are familiar, allowing them to single-step, setbreakpoints, and examine variables. Critically, they allow the hardware torun at-speed, recording behavior in on-chip memory for later replay. Theimplementation described in [10, 23] was incorporated in the recent releaseof LegUp HLS which inserts custom debug instrumentation at the RTL leveland maintains a database to relate the source code variables to the LLVM’sIR signals to the hardware signals in the final Verilog generated in order tomake the source-level in-system debugging possible.To achieve a software-like debug experience, frameworks such as thosein [20] record the history of all user-visible variables (or enough informationthat all user-visible variables can be reconstructed off-line). This providesthe visibility that software designers expect, but it comes at a cost; sinceon-chip memory is restricted in size, only a limited portion of the circuitexecution (the trace window) can be stored. In [21], the follow up workof [20], the authors focused on optimizing the on-chip memory utilizationin order to store longer execution histories. The goal was to have longerexecution histories which allows the user to find bugs easily without havingto run the design multiple times (also known as debug iterations), recordinga limited circuit execution each time.1.4 Selective Variable TracingIn this thesis we focus on a technique known as Selective Variable Tracingto achieve longer trace window sizes. Rather than storing a history for alluser-visible variables, it is possible to only store the history of a subset ofvariables – perhaps variables in a function of interest or variables that thedesigner deems to be important. This could lead to much more efficient useof on-chip memory space, providing longer trace histories, possibly makingit easier to determine the cause of observed incorrect behavior. In addition,recording fewer variables reduces the routing complexity and simplifies thecompression logic which connects signals to trace memories, thus reducingthe area of the debug instrumentation. Various automated signal selectiontechniques like those described in [46, 49] could be used to connect fewer in-61.5. Incremental HLS Debugteresting signals to debug instrumentation as implemented in [31]. However,in this thesis, signals were selected manually as the focus of this work is toshow the impact of incremental compilation techniques in reducing the com-pilation time between the debug iterations and not about efficient ways ofsignal selection. Future work would be to incorporate these automated sig-nal selection techniques and perform incremental debug using our proposedflows.As debugging proceeds and the user refines his or her understandingof how the circuit is operating, the user may wish to change the subset ofvariables observed. In frameworks like [10, 21, 23] where the debug instru-mentation is added at the RTL level, this would require a recompilation ofthe design (in order to reconfigure the FPGA on which the design was run-ning) as the RTL generated would change each time the user changes thesubset of variables to be observed. Recompilation is often slow, and mayeven lead to different place and route results, causing timing differences, pos-sibly hiding an elusive bug (like those related to asynchronous interfaces).To be practical, we need a faster and less intrusive compilation path. Toavoid the recompilation process, we propose using incremental compilationtechniques.1.5 Incremental HLS DebugUsing incremental compilation, the placement and routing of the user’s de-sign can be frozen between debug iterations, and only the instrumentationcircuitry (which is added to gain observability and which depends on the setof variables to be observed) changes. This leads to faster debug iterationswhile at the same time maintaining the timing of the user design betweendebug iterations.Incremental compilation techniques are well-supported in commercialFPGA CAD tools, and have even been applied to hardware-oriented debuginfrastructures such as SignalTapII [37, 40]. However, there are at leastfour unique characteristics of the debugging framework considered in thisthesis [23] that make a straightforward application of existing techniques71.6. Contributionssub-optimal:1. Compared to hardware-oriented debug infrastructures such as Signal-TapII [40], this framework contains much more instrumentation thatneeds to be recompiled in each debug iteration. When applying incre-mental compilation to SignalTapII, the incremental portion primarilyconsists of routing connections, while in this framework, the incremen-tal portion primarily consists of a debug instrumentation which is alarge design-specific custom compression circuit.2. Although changing the set of variables to be recorded primarily im-pacts the instrumentation, it is also extremely intrusive to the usercircuit, since a large number of “taps” are needed. Without carefulconsideration, this could dramatically increase the amount of logic thatneeds to be recompiled between iterations, limiting the effectivenessof the incremental techniques.3. For many applications, it would be desirable to have the user circuitrun as fast as possible, and not be slowed down by the presence ofthe instrumentation. This implies that it would be advantageous tocompile the user circuit first, and only use “left over” resources toimplement the instrumentation.4. After debugging has been completed, it may be desirable to removethe instrumentation. Although it is possible to ship the design withinstrumentation still present (but disabled) some security-consciousdesigners may worry that this creates a “back door” into the design.Therefore, our incremental flow has to support removing instrumenta-tion without performing a complete recompilation, which would affectthe timing of the circuit possibly exposing new bugs.1.6 ContributionsIn this thesis we adopt a source level in-system FPGA debug frameworkfor HLS generated circuits [23], briefly described in Chapter 3 to implement81.6. Contributionsour ideas. The following contributions have been made in this thesis toaccelerate the HLS debugging flow for the considered framework.1. We first quantify the impact of selective instrumentation on trace win-dow size and debug instrumentation area by recording different 50%and 25% subsets of the variables present in the user’s source code.This was done to get an idea of how much benefit could be achievedby using the selective variable tracing approach as this is complicatedby the fact that a single user variable may correspond to multiple IRsignals. If such variables are selected/unselected then all of it’s IR sig-nals would be recorded/ignored making the instrumentation insertionalgorithm difficult to realize.2. To avoid the recompilation process when doing selective signal tracingduring each debug iteration, we propose using incremental compilationtechniques. However, based on the unique characteristics of the frame-work considered [23] and the limitations of the incremental techniquespresent in commercial FPGA CAD tools as described in Section 1.5several modifications had to be made to the RTL generated by theHLS including adding permanent taps to the user design and creat-ing the design partitions to effectively use the incremental compilationflow.3. We present two promising debug flows using commercial incrementalcompilation techniques which balance the compile time, design perfor-mance and area overhead while maximizing the amount of user datathat can be stored in the on-chip trace buffer memories.(a) In the first flow, after the design partitions are created, the placeand route for both the user and the instrumented debug partitionsis performed simultaneously. This leads to the co-optimization ofthe user and the debug partitions resulting in the fastest clockspeeds during debug. Once the debugging is done and the userdecides to remove the debug instrumentation, it can be removedwithout performing a full compilation. However, the remaining91.7. Thesis Outlineuser circuit would now run at a lower speed (when compared tothe speed of the user circuit with no debug instrumentation) asit was not well optimized during the initial compilation becauseof the presence of the debug instrumentation circuitry.(b) In the other promising flow, the design is first compiled withan empty debug partition which allows the user partition to beplaced and routed without any restrictions, optimizing as muchas possible to get better performance. Later, the debug partitionis added incrementally preserving the placement and routing forthe user partition from the previous compilation. This might leadto slower debug clock speeds as the debug instrumentation nowuses spare resources which might be spread around the FPGA.However, when the debug instrumentation is removed after thedebug is done, we could get back the actual speed for the usercircuit which was obtained during the first compilation.Using either flows, we are able to obtain an improvement of 40% incompile time per debug iteration as we now perform an incrementalcompilation for each iteration rather than a full compilation. We alsoachieve an increase of 1.6x and 2.6x in trace buffer window size whenrecording 50% and 25% variables respectively instead of recording allthe variables. Together, this may lead to faster and more effectivedebug turns, resulting in higher productivity for designers creatingcomplex FPGA applications.1.7 Thesis OutlineThe rest of the thesis is organized as follows. In Chapter 2, the backgroundrequired to understand HLS, the need for in-system debug (main focus onsource level debug), the way incremental compilation works and how to useit effectively for debugging is presented. It is followed by a summary ofrecent research related to in-system debug, techniques to improve the tracebuffer utilization in order to record longer execution histories and the use of101.7. Thesis Outlineoverlays/incremental compilation in debug.In Chapter 3, the source level in-system debugging framework used inthis research is briefly explained. Different optimizations aimed at improvingthe trace buffer utilization are presented. Next, the advantages of selectivesignal tracing and the problems associated with it’s practical implementa-tion are discussed along with the proposed solution for using incrementalcompilation techniques.Chapter 4 describes different incremental HLS debug flows proposed inthis thesis to accelerate the debugging process by having faster and effec-tive debug iterations. It also explains how the GUI from [23] is modified inorder to automate the proposed incremental debug flows. Chapter 5 pro-vides extensive results, comparing the different incremental flows proposedin terms of trace window size, area overhead, compile time and the userdesign performance both during and after debugging.Chapter 6 concludes this thesis and suggests possible ideas for futurework.11Chapter 2Background2.1 OverviewThis thesis relies upon three major concepts: high level synthesis (HLS), in-system debug and incremental debug. In this chapter each of these is brieflydiscussed to provide the background required to understand this work. Also,the previous works related to these areas are presented.This chapter is organized as follows. Section 2.2 describes the stepsinvolved in an HLS flow followed by description of the LegUp [12] HLSframework which is used in the research described in this thesis. Section2.3 emphasizes the need for in-system debug and summarizes some of theprevious research in this field. Section 2.4 describes the incremental compi-lation features present in FPGA CAD/EDA tools along with their supportfor incremental debug and it also describes the debugging frameworks whichmake use of overlays/incremental compilation techniques to accelerate thedebug flow. Section 2.5 describes how this thesis differs from the relatedworks and concludes this chapter.2.2 HLS FlowHLS has simplified the design process of a digital system by allowing thedesigners to specify the design requirements at a higher abstraction level.The CAD tools associated with this process automatically converts this rep-resentation to a RTL level specification optimized for performance, area andpower requirements, which could later be used as a reference for manufac-turing an ASIC or configuring an FPGA to emulate the required design.Because of an FPGA’s shorter time to market and other benefits when com-122.2. HLS Flowpared to an ASIC [72], several companies have either started using FPGA’sto accelerate their complex workloads or started providing FPGA servicesfor the customers to meet their needs. Amazon has recently announced theirEC2 F1 instances [2] which allows the users to pay for the instances by thehour without the need for buying an FPGA. Microsoft’s Catapult projectto accelerate their cloud services [64] using FPGAs, Intel’s acquisition ofa leading FPGA manufacturer, Altera [39], IBM and Qualcomm’s interestin using FPGA’s to accelerate their cloud services [27, 65] hint towards theincreasing demand for FPGA’s in near future. A recent survey on HLS tools[57] presents and evaluates different tools that have emerged from both in-dustry and academia [12, 36, 50, 56, 58, 61, 76] to allow both hardware andsoftware designers (who outnumber hardware developers [60]) to use FPGA’sfor accelerating their workloads and also to improve the user productivity byeliminating the need to focus on timing and other interface-related details.A typical HLS flow is shown in Figure 2.1. As discussed in Section1.2, the majority of HLS frameworks use synthesizable subsets of C forspecifying a design at an algorithmic level mainly because of its simplicityand widespread adoption, with LLVM as a C compiler. Any HLS frameworkcan be functionally categorized into a frontend, optimizer and a backend.The frontend compiles the high level behavioral representation of a design(C source code) and converts it to an Intermediate Representation code(IR) or a Control and Data flow graph (CDFG) [13]. The optimizer thenperforms several code optimizations such as dead code elimination, falsedata dependency elimination, function in-lining and loop transformations.Typically, the tool goes through multiple passes, primarily to reduce thenumber of resources required on the target architecture and increase thespeed of the design.The backend consists of a series of steps: allocation, scheduling, bindingand RTL generation. Together these steps convert the optimized intermedi-ate representation code into a HDL specification. During Allocation, the tooldetermines the type and number of hardware resources (such as functionalunits or memory components) required to satisfy the design constraints.In Scheduling, the operations assigned to each of the functional units are132.2. HLS Flowscheduled into cycles based on dependency analysis. Each operation couldbe scheduled for one or several cycles depending on the functional unit towhich it is mapped and also based on the complexity of the operation. Bind-ing assigns the operations present in the IR code to specific functional unitsand also optimizes the resource utilization by allowing non-conflicting oper-ations to share the same functional units by multiplexing them.The order in which allocation, scheduling and binding are performedmay vary depending on the algorithms used by the HLS tool and the givendesign constraints– area minimization or latency minimization [13]. All ofthese steps are highly inter-dependent and some of them may be performedsimultaneously based on the tool’s objectives. For example, scheduling triesto reduce the number of control steps required, subject to the number ofavailable hardware resources which depend on the result of allocation. Oncethese steps are performed and the optimizations are applied, the final stepis to generate RTL from the intermediate data structures which were usedto store the decisions made in the previous steps.SchedulingAllocationBindingRTLGenerationFrontend OptimizerBackendSource Code (C)IR CodeOptimized IRCodeRTL Code (Verilog)Figure 2.1: HLS Flow142.3. HLS Debugging TechniquesLegUp HLSLegUp is an open source HLS framework developed by the University ofToronto [12] for research/academic purposes. It takes standard C as in-put (without recursive functions or dynamic memory allocations) and auto-matically generates an RTL level implementation (Verilog) which could besynthesized for some of the Intel and Xilinx FPGA’s. LegUp is not onlycapable of generating a pure hardware but can also generate a hybrid hard-ware/software system by first running the design on a FPGA based MIPSprocessor (Tiger MIPS [59]) and profiling the execution to determine theprogram segments that need to be accelerated on a hardware while runningthe rest of the code on processor itself.LegUp uses the low level virtual machine (LLVM) [47] as the frontendcompiler to convert the C code to IR, which is then modified by a seriesof optimization passes. This modified IR, with the help of newly createdbackend passes schedules the IR instructions into specific cycles and finallygenerates the RTL based on these descriptions.LegUp uses a System of Difference Constraints (SDC) [11] approachfor scheduling and the Bipartite Weighted Matching [30] technique for thebinding step by default. However, the user could implement his/her ownalgorithms for each of these passes.The latest release, LegUp 4.0 also includes a source level in-system de-bugging framework which allows the users to experience a software-like de-bug experience for a hardware implementation [21]. In this thesis we usethe same debugging framework to prototype our ideas, which are describedin Chapter 4. Some minor modifications had to be made to LegUp whichare described in Appendix A.2.3 HLS Debugging TechniquesTo verify the correctness of the generated RTL, HLS tools should be ableto provide a debugging platform. Only then will HLS be widely adopted.There are different types of debugging approaches for a HLS design flow:152.3. HLS Debugging Techniquessoftware-like debug, RTL simulation and in-system debugging. Each of theseare capable of detecting bugs arising at different level of abstractions andwere discussed briefly in Section 1.3. In software-like debugging and RTLsimulation techniques, it is difficult to provide the exact system inputs andreplicate the final operating environment of the circuit, making some elusivebugs invisible. The most accurate and natural method of debugging wouldbe to run the circuit generated at speed on an FPGA and then analyze it’sexecution with respect to the source code. Several works have been pub-lished which leverage the advantages of in-system debugging using differenttechniques. They can be mostly classified into scan-based or trace-basedapproaches. In a scan-based approach the circuit execution is paused andthe circuit state is retrieved. In the trace-based approach the circuit execu-tion is recorded in a on-chip trace buffer and the execution is replayed laterfor debugging purposes. The following subsections describe several of theseapproaches which are relevant to this thesis.2.3.1 In-system Debugging ApproachesUsing External ProbesWith the help of a simple debug code, selected RTL signals in a circuit couldbe connected to the FPGA I/O pins and then using an external analyzerit would be possible to collect the signals from these pins and display themas waveforms for the user to analyze. However, because of the limited I/Oresources on an FPGA, it is not feasible to collect the data generated by thecircuit in each cycle (especially when wide signal busses are to be observed)[51], limiting the in-system debugging capabilities of such approaches. Evenif it was possible to collect the required data it would not make much senseto a HLS user as the design is being developed in C and they may or maynot have the knowledge of the circuit implemented on the FPGA. Thereforedebugging at RTL level without being aware of the correspondence betweenthe source code variables and the RTL signals is not practical for an HLSuser.162.3. HLS Debugging TechniquesScan Based DebuggingIn this approach, a scan chain is created by connecting the internal flip flopsin a user design sequentially to a JTAG interface, allowing the user to observethe values stored in the flip flops. This is very commonly used for ASIC’swhere using a scan input pin, test inputs are applied and the flip flop valuescould be scanned out through the scan out pin, providing observability intothe circuit. Some FPGA’s also provide scan chains and enable reading thestate of the internal flip-flops through their readback feature [38, 74].Several works have used scan chains for debugging purposes [3, 43, 68–71]. The benefit of this approach is that no additional on-chip memoryis required to record the circuit state. If the scan chains are not presentthen implementing them would take up some resources [71]. In order toread the state of the flip flops in the scan chain, the circuit must be pausedwhich might affect the interactions among some of the blocks, potentiallyaltering the circuit state when it is resumed. In addition, as the debuggingis performed at the RTL level, it would not be beneficial to use this with aHLS design flow unless there is a way of mapping these signals back to thesource code variables.Embedded Logic AnalyzersA logic analyzer is a customized hardware unit which has the capability tocapture and store selected signals from a user circuit based on predefinedtrigger conditions. These signals are stored continuously cycle-by-cycle in aring-type trace buffer as long as the trigger conditions are satisfied. Later,the recorded signal values are retrieved from the buffers for further analysis.Advanced logic analyzers can store the signals using segmented buffers andmultiple trigger conditions, where the signals are recorded until one triggercondition is met, the recording stops when another trigger condition is ac-tivated and is resumed again when a different trigger condition is met andso on. Trigger conditions are used to start/stop the recording of signals inorder to use the trace buffers effectively.Many commercial logic analyzers have been released by FPGA compa-172.3. HLS Debugging Techniquesnies for their devices including SignalTap II [40] from Intel, Integrated LogicAnalyzer (ILA) [75] from Xilinx and also by some third parties like Synopsys(Identify RTL Debugger [67]) which provides support for Intel, Xilinx andMicrosemi devices. These logic analyzers are added automatically by thecorresponding FPGA Electronic Design Automation (EDA) tools and areimplemented on the same FPGA fabric as that of the user circuit. Most ofthe EDA tools interact with the logic analyzers through the JTAG port, inorder to simplify the interface protocol. Usually, the signals are recordedcycle-by-cycle in a circular trace buffers, overwriting the previous values un-til a trigger condition is met. The EDA tool then reads the captured dataand displays the signals in a waveform representation. As these logic ana-lyzers have to be compatible with any circuit, the instrumentation added issimple and would record all the selected signals in each cycle, even thoughsome of the signal values do not change.Even in these trace-based approaches, the signals displayed would cor-respond to those at the RTL level; this requires the user to have an under-standing of the mapping between the source code and the generated RTL inorder to get the most from these waveforms. Also, as the signals are beingrecorded in every cycle, the on-chip trace buffers would be able to record alimited circuit execution making it difficult to find those bugs whose effectsare only noticeable after some time lapse.All of the above described in-system debugging approaches provide suffi-cient observability into a circuit’s execution at the RTL level. However, asdescribed in Section 1.3.3, for an HLS user it would only be meaningful ifthe debugging was performed at the same abstraction level where the de-sign is specified, i.e. at the source (C code) level. The following subsectionsdescribe the related works which focus on providing source level in-systemdebugging opportunities for a HLS generated circuit.182.3. HLS Debugging Techniques2.3.2 Source Level In-system Debugging ApproachesSea Cucumber DebuggerSea Cucumber (SC) [34], is a circuit synthesizer which takes an input be-havioral description written in its own programming model (based on Javathreads) and generate a circuit implementation (JHDL [8]) that could beexecuted on an FPGA. The JHDL representation could be converted into anetlist that could be programmed to an FPGA through JHDL frameworks.These frameworks provide in-system debugging facilities at the JHDL levelby using the readback features available in some FPGA’s.Hemmert et al. proposed a source level debugger for the SC frameworkin [29]. This debugger used the scan based approach as described in Section2.3.1 for communicating with the FPGA. To be specific, it used the readbackfeature of an FPGA to capture a snapshot of the circuit state by freezingthe clock or in other words, pausing the circuit execution. However, thisinformation was remapped to the source code with the help of a databasecontaining the mapping information between the source code variables, IRvalues and the circuit level signals. This database was created during thesynthesis of the circuit and also had the information about the optimizationsperformed by the compiler. This allowed the user to have a software-like de-bug experience while having a well-optimized circuit running on an FPGA.It also provided support for single-stepping, breakpointing, watching andsetting the values for variables- providing both observability and controlla-bility. The major drawback is that the circuit had to be paused after everyinstruction to allow effective debugging, which is the same problem with thatof any other scan based debugging approaches described in Section 2.3.1.Event Observability PortsIn [53], Monson and Hutchings describe a trace based debugging approachand a new technique to improve the trace memory utilization in order to havelonger traces of the circuit execution, which is a major concern for any of thetrace-based in-system debugging approaches [10, 20, 40]. In this method top192.3. HLS Debugging Techniqueslevel ports were manually added to the RTL generated to obtain informationabout what and when the signals should be recorded unlike an ELA wherethe signals are recorded every cycle. These were called Event ObservabilityPorts (EOP) and they consisted of a data and an enable signal. The enablewas set high only when the corresponding signal value was updated. Theseports were added to the relevant signals which were intended to be tracedby the user and were connected to the trace buffer memories, thus recordingthe data signal only when the enable is high. This work also suggests theuse of multiple trace buffers instead of a single buffer to have more flexibilityand more possible optimizations.They used Vivado HLS [76] to generate the RTL from the user specifiedC code, then manually instrumented the RTL and finally programmed it toa Xilinx FPGA to run the circuit at speed. They also maintained a mappingbetween the C code and the RTL to provide a source level debugging expe-rience for a user. Clearly, to do selective signal tracing with this approach,the RTL would have to be modified accordingly to add new ports. Thiswould lead to a full recompilation of the RTL code, in order to reprogramthe FPGA with the modified design, which is very time consuming. Ourwork investigates incremental solutions to avoid these lengthy compilations.LegUp DebuggerAs mentioned in Section 2.2, the latest release of LegUp includes a trace/in-strumentation based source level in-system debugger which was developedcombining the ideas from the following two similar works:Inspect Debugger In this work [10], a source level debugger with thecapabilities of any other software debugger (gdb [17]) such as single stepping,breakpointing and variable inspection was proposed for the LegUp C-to-RTLsynthesizer. There was no controllability provided i.e., a user could onlyview the values of the variables and does not have an option to update thevariable values using this framework. Like the SC debugger [29], a debugdatabase was maintained to relate the mapping of a source code variable all202.3. HLS Debugging Techniquesthe way to the optimized RTL level signals.It provided two different modes of circuit execution. First, it alloweddebugging at the source level using RTL simulation (using ModelSim sim-ulator). In this mode, single stepping through a line in the source codecorresponded to simulation of the circuit for a specific number of cycles(this information is obtained from the debug database). In the other mode,it added SignalTap II [40] logic analyzer to the RTL generated from the userdesign to record the selected signals in on-chip memories while running thecircuit at speed. Once the trigger condition is satisfied, this data is readby the GUI and displayed to the user in the context of the source code.It also provided support for SW/HW discrepancy detection by running thesoftware or RTL simulations in tandem with the hardware execution andcomparing the variable values in both the cases as they execute.The major limitation of this approach is the use of an ELA to record atrace of the circuit’s execution. As described in Section 2.3.1, this approachrecords selected signals in every cycle leading to poor utilization of the on-chip memories restricting the debugging ability in any given debug iteration.Moreover, in order to do selective signal tracing to have better trace lengthsas described in Section 1.4, it requires a recompilation of the circuit asthe RTL generated would be different for different subset of signals beingrecorded. This is time consuming. The major goal of this thesis is to targetthis problem.Goeders’s Debugger This work [20] is very much related to Inspect [10].The major difference is that instead of using an ELA it uses its own cus-tomized debug instrumentation. This circuitry records the signals only inthose cycles when they are updated (using similar techniques from [53]) andreduces the amount of control and information to be recorded by performingseveral optimizations. Follow-up works [21, 23] show how it leverages thescheduling information from the HLS compiler to perform dynamic signaltracing and how it uses the signal restoration techniques to improve the on-chip memory (trace buffer) utilization by a significant factor when comparedto a standard ELA. However, similar to Inspect [10], changing the variables212.3. HLS Debugging Techniquesto be observed between debug iterations require a recompilation of the fulldesign as the RTL generated would now be different– taps into the usercircuit change and also the debug instrumentation is changed as it is highlycustomized based on the variables being recorded.In this thesis, it is this framework which was modified to implement ourproposed ideas. The optimizations used and its limitations are described inmore detail in Chapter 3.Using Source Level InstrumentationsIn this approach, the debug instrumentation is added at the source levelby modifying the C code. This is different than the previously describedworks [10, 20, 21, 29, 53] which insert the debug instrumentation after theRTL is generated by the HLS tool. Monson and Hutchings, in their recentwork [55] use this approach to produce the EOP’s [53] (also described in aprevious subsection) automatically in the RTL generated which could laterbe connected to a trace buffer. This was possible by assigning the requiredvariables/expressions in the source code to newly created top level pointerswhich would finally be converted into top level ports by the HLS tool. Intheir follow up work [54], they add support for adding EOP’s for pointervariables using shadow pointers. Pinilla and Wilton’s work [62] was builtupon [55] and described a way to instrument even the trace buffer andthe associated circuitry at the source level which might allow for betteroptimizations. It also proposed an Array Duplicate Minimization (ADM)technique which improves the trace buffer utilization by using the values ofan array variable from the user circuit memories itself (whenever possible)while reading back the trace data removing the necessity to record them inthe trace buffers.These works provided in-system debugging capabilities for an HLS userat the source level by adding instrumentations at the source level. However,to leverage the use of selective variable tracing as described in Section 1.4,the source code had to be changed. This means that the HLS tool mustcompile the code after instrumentation. This requires a full compilation by222.3. HLS Debugging Techniquesthe EDA tool to generate the bitstream for configuring the FPGA. If theinstrumentation was added at the RTL level, then the RTL corresponding tothe user circuit would never change and only the RTL for the debug instru-mentation might be changed in each debug iteration. This could be compiledincrementally by some existing EDA tools by following proper techniques,which we describe in the following subsections. If the instrumentation wasadded at the source level, however, this would not be possible as the sourcecode itself is changed and there is no guarantee that the RTL correspondingto the user circuit is the same as before.In-System Assertion Based VerificationA common approach for debugging a software application is to use asser-tions. These assertion checks could be used for C code which is to be con-verted into a RTL implementation by the HLS tool, however it would notbe helpful for finding those bugs which occur only during the actual execu-tion of the circuit. Works like [14, 26] make the assertion based verificationpractical for an HLS application by actually synthesizing these assertionstatements into assertion checker circuits (which are implemented on thesame FPGA as the user circuit) and notifying the user of assertion failuresby verifying them with the actual execution of the circuit running at itsspeed on an FPGA. Clearly, as the number of assertions increase, the areautilized by the assertion checkers would also increase which is usually notdesirable. Also, if the assertions had to be changed, it would require run-ning the HLS flow again along with a full compilation in order to reprogramthe FPGA. Our work focuses on avoiding such full recompilations betweensuccessive debug iterations using incremental compilation techniques.The majority of the in-system HLS debugging approaches described in theprevious subsections [10, 20, 21, 53, 55, 62] were trace-based, i.e. they use on-chip trace buffers to store the circuit execution and later replay it to the user,providing the same debugging environment as that of a software simulation.Some of the works [21, 53] also tried to improve the trace buffer utilization232.4. Incremental debugging Approachesby proposing different optimization techniques. However, by using selectivevariable tracing, the trace length could be much higher since fewer variablesare recorded. This approach may require running the design several times,recording a different subset of variables each time. Doing this requires arecompilation of the RTL. A few approaches have been recently publishedand are described in Section 2.4 which try to reduce the recompilation timebetween successive debug turns.2.4 Incremental debugging ApproachesIncremental debugging Using Commercial EDA ToolsCommercial EDA tools like Intel Quartus Prime and Xilinx Vivado haveincremental compilation features [42, 73] which allow the user to preserveplacement and routing for the unchanged portion of a design from previouscompilation. With the help of these features, SignalTap II [40] and theVivado ILA ELA’s [75] can be used to perform incremental debugging. Thefollowing paragraphs describe the steps to be followed to achieve this.Using Intel’s SignalTap II: Incremental Debug can be performed byusing SignalTap II with the help of Quartus Prime’s Incremental Compi-lation techniques. In general, to perform an incremental compilation inQuartus Prime, the project has to be divided into design partitions [40] (bydefault it has one top partition which encompasses the whole design) andthe preservation level for these partitions must be set to POST FIT; thisdirects the tool to use the place and route results for this partition fromthe previous compilation. If it doesn’t exist, then the tool performs a fullcompilation for that partition. The idea is to place and route each designpartition separately in the first compilation, avoiding any cross partitionoptimizations. This makes it possible to make changes in one or more de-sign partitions and just compile those partitions while preserving the detailsfor other unchanged partitions, leading to a significant reduction in compiletimes. We use this incremental compilation technique in our debug flows242.4. Incremental debugging Approaches(refer to Chapter 4) to allow in-system source level incremental debuggingfor HLS generated circuits. Quartus Prime also provides Rapid Recompile,in which creating the design partitions is not mandatory and the tool au-tomatically tries to preserve the placement and routing from the previouscompilation, as much as possible. However, through a series of experimentswe determined that this option is optimized for minor changes and was noteffective for the amount of changes that occur when we modify our debuginstrumentation. Hence, we do not consider it for our work.In a normal debugging flow using SignalTap II, the user has to firstdetermine the signals to be observed (which are visible after running anAnalysis and Elaboration step or a full compilation of the user design), definetrigger conditions and other configuration details in a SignalTap file (.stpfile in Quartus Prime) and add it to the project. Then, a full compilationmust be done in order to insert the SignalTap II IP core into the design.The user circuit and the SignalTap II logic are treated as a single designpartition and placed and routed simultaneously without any restrictions.Therefore, if the signals to be recorded are changed, the SignalTap II logicand hence the whole design partition is considered changed, requiring a fullrecompilation.To perform incremental debugging using SignalTap II, Incremental Com-pilation for the design should be enabled by changing the preservation levelto POST FIT for the default top partition or by creating other partitions ifrequired and changing the preservation settings accordingly. Next, the samesteps are to be followed as described in the earlier paragraph to create andadd a .stp file to the project. When a first compilation is run the tool nowconsiders the SignalTap II logic as a separate design partition and isolates itfrom the user design partitions. If the signals to be observed are changed orthe trigger conditions are changed, then the tool would try to incrementallyroute the new signals to the SignalTap II logic. If this routing was somehownot possible then it would had to recompile the user partition too. Thiscauses only the SignalTap II partition to change while preserving the userdesign partitions, thus reducing the compilation time significantly.However, when we inserted a customized debug instrumentation instead252.4. Incremental debugging Approachesof the SignalTap II ELA core and performed an incremental compilation(using Quartus Prime) changing the signals to be observed, the user designpartitions were also considered changed as the debug taps into the usercircuit were changing each time. This is described in Chapter 4 along withour proposed solution to avoid such situations. When SignalTap II is used,the taps are added/changed automatically by the tool and it does this afterpreserving the placement and routing for the user design partitions fromprevious compilation, however when customized debug instrumentation isadded the taps need to be changed at the RTL level itself by modifying theports of certain design partitions (since we can’t modify the internals of thetool). Moreover, the debugging would be at the RTL level as the informationgathered by the EDA tool (Quartus Prime) would correspond to the RTLsignals (as described in Section 2.3.1). In this thesis, these techniques areapplied to a source level debugging flow [20] to make the debug processeasier, efficient and faster for an HLS user.Using Xilinx’s Vivado ILA: Xilinx’s Vivado design suite also providessupport for incremental compilation. It does not have the concept of designpartitions, but instead uses a design checkpoint file (DCP) as a reference topreserve the data for placement and routing from any of the previous com-pilations. After a full implementation is run for a design, a new incrementalimplementation can be started after making any changes to the design andproviding the routed design from the previous implementation as the DCPfile. This uses the placement and routing for the unchanged logic from theDCP file and incrementally places and routes the changed portion of thedesign.This feature can be used along with the ILA [75] to perform incrementalin-system debugging. In the first implementation, an ILA debug core can beadded to the design and interesting signals can be selected for observation.In the subsequent implementations, if the set of signals selected is changedthen an incremental implementation can be run by preserving the placementand routing from previous compilations using a checkpoint file. Once theimplementations are run, the recorded RTL signals can be read from the262.4. Incremental debugging Approachesdebug core and interpreted as waveforms using Vivado’s interface. If thisis to be used with the RTL generated by the HLS tool then the debuggingwould be at the RTL level and additional mapping information would berequired to relate these signals to the source code variables in order to makeit more meaningful for an HLS user.Using RapidSmithRapidSmith [48] is an open-source set of tools and API’s that could be usedto perform any stage of a CAD flow such as placement or routing for XilinxFPGA’s. This is possible by reading the intermediate results from the XilinxVivado design suite after a particular stage (using XDL [7]), performingthe required tasks using RapidSmith and then communicating the resultsback to the Xilinx Vivado design suite, which could complete the rest ofthe stages in a CAD flow and finally generate the bitstream for an FPGA.Using RapidSmith, an incremental place and route tool could be createdbased on the user requirements instead of using the commercial incrementalflow from Vivado, in order to have fine-grain control over the incrementalalgorithms. However, implementing this is very time consuming and lacksproper documentation/support as this is known to a smaller communitywhen compared to that of a commercial tool.The work by Hutchings and Keeley [35] is closely related to this thesis.They use RapidSmith to incrementally add or modify the debug instru-mentation after performing the place and route for user circuit using theleft over resources. Their instrumentation is simple and consists of a smallnumber of trace buffers with a small trigger circuit. When inserting this in-strumentation, the required RTL signals are connected (routed) to the tracebuffer and the trigger circuit inputs. The trace buffers would record thevalue of the signals connected to them in each cycle as there is no additionalcompression/scheduling logic inserted, resulting in a poor utilization of thememories. In the subsequent incremental compilations when the set of sig-nals to be traced is changed, only a few signals have to be re-routed. If thetrigger circuitry needs to be changed then a small amount of logic had to272.4. Incremental debugging Approachesbe re-placed and re-routed, leading to a significant reduction in the compiletimes.In this thesis, we implement the idea of adding the instrumentation afterthe place and route of user circuit in one of our debug flows (in order notto disturb the user circuit) but our work is different from their’s [35] inthe following aspects: 1) This thesis focuses on source-level incrementaldebug for the HLS generated circuits rather than RTL level. 2) We usea commercial incremental flow available in Intel Quartus Prime which ismature when compared to that of an open-source tool (Rapidsmith). 3) Ourinstrumentation that needs to be changed during an incremental recompileis much more than just adding a few routes and the work in [35] does notquantify how effective their incremental algorithms are for recompiling suchsignificant amount of changes incrementally.Using Debug OverlaysA debug overlay is a virtual customized fabric which can be implemented onthe same FPGA along with the user circuit during debugging. This overlaycould be compiled simultaneously with the user circuit or could be compiledto the FPGA after the place and route of the user circuit (in which casethe presence of the overlay would not perturb the user circuit). The ideais to create an overlay architecture which could be quickly configured toimplement the required debug instrumentation circuitry during each debugiteration. This avoids recompilation of the whole design and could be fasterthan incremental debugging using ELA’s as the amount of logic that has tobe reconfigured/recompiled in case of overlays is usually much smaller whencompared to recompiling the whole debug partition as in the ELA approach.Hung and Wilton proposed an virtual overlay network for trace buffers[33] in which they used the leftover routing resources to create an over-lay where all the user signals were connected to the inputs of at least onetrace buffer. After compiling this overlay to an FPGA, during debuggingthe user can select a subset of signals for tracing (limited to number of thetrace buffer inputs). In the subsequent iterations the user can change the282.4. Incremental debugging Approachessignals he/she wishes to observe and this would just require the reconfigu-ration of routing multiplexers internal to the overlay. Such reconfigurationcan be done using partial reconfiguration or bitstream modifications (as in[24]) within a few seconds. This work does not implement any trigger cir-cuitry or other compression logic, which leads to poor utilization of the tracebuffers. The follow up works by Eslami and Wilton [15, 16] presented anoverlay architecture for inserting trigger circuits incrementally. They usedthe leftover logic blocks and routing resources to create this overlay whichwas compiled to the FPGA after the routing of the user circuit. While de-bugging, a trigger function is mapped to this overlay using their own routingaware placement algorithm (because of the limited routing flexibility in theoverlay). If the trigger function has to be changed then it is remapped tothe overlay. This is faster than the incremental recompilation of the wholedebug partition directly to an FPGA without an overlay. In all of theseoverlay works [15, 16, 32], the debugging is done for the circuits describedat RTL level and its feasibility has not yet been investigated for debuggingHLS generated circuits.However, creating an overlay architecture supporting a wide range oftrigger circuits and trace signal connections would be expensive in terms ofarea and sometimes there may not be enough leftover resources to build theoverlay. Moreover, as the size of the overlay architecture grows, the timetaken to incrementally map the debug instrumentation to it would also grow.Considering the complexity and uniqueness of our debug instrumentation(customized for each subset of signals to be observed), building an overlayarchitecture with such flexibilities to support incremental flow would be verydifficult without having significant area/performance overheads and reducedcompile time benefits. To our knowledge, overlays have not been targetedfor debugging HLS generated circuits at the source level.Other TechniquesOther works used different approaches to increase the size of the circuit’sexecution trace stored in the trace buffers to accelerate the in-system debug-292.4. Incremental debugging Approachesging process. Most of them focus on debugging at the RTL level, but couldbe extended for debugging at source-level for an HLS user by maintainingadditional mapping information to relate the RTL signals back to the sourcecode variables.Bitstream Modifications: In [24], Graham et al. modify the bitstream(which is used to configure the FPGA to implement the desired circuit) itselfto instrument the debugging hardware. The debugging circuit, essentiallyan embedded logic analyzer was added to the design before compiling theuser circuit only, however, there were no connections made between the userand the debug logic. During debugging, when the user selects/changes thesignals to be traced (to gain benefits from selective signal tracing), JRoute- arun-time routing API [45] was used to determine the routing for the selectedsignals and the JBits API interface [25] was used to modify the bitstreamaccordingly. In [63], the authors use a similar approach and pre-connect(before the first compilation of the user circuit) the signals of interest torouting muxes which forward the signals to FPGA I/O pins instead of on-chip trace buffers. Then an external analyzer was used to analyze the signalscoming from the FPGA.These approaches for routing a small number of signals incrementally arevery fast when compared to an incremental compilation, but if the entiredebug instrumentation has to be changed then the runtime advantages arenot clear. Moreover, bitstream modifications are supported by only few ofthe commercial FPGA’s which limits the use of such approaches.Lossy/Lossless Compression Techniques: Generic data compressiontechniques could be used to compress the debug data before writing it tothe trace buffers in order to pack more information. However, the amount ofcompression achieved would vary significantly depending on the debug dataas these techniques are not customized. In [5] the authors investigate theuse of Bentley-Sleator-Tarjan-Wei (BSTW) [9] and a modified Lempel-Zivbased [44] lossless data compression algorithms for embedded logic analysisof a circuit running on an FPGA and propose architectures for doing so302.4. Incremental debugging Approachesefficiently. In [4], the authors make use of lossy compression techniques inthe first debug iteration to record the intervals of circuit execution in theform of a signature. Then the failing signatures were detected and only thesignals from these intervals were recorded in the next debug iterations ignor-ing the signals from other error free intervals, increasing the observabilityinto the circuit. These compression techniques can be used along with ourincremental flows with selected tracing to further accelerate the debuggingprocess.Using Off-Chip Memories: External storage devices can be used tostore the debug data instead of the limited on-chip memories. The debug-ging time would then be reduced by a greater extent since the size of theexecution traces stored would be much higher. However, the amount of thedebugging data generated in each cycle (bandwidth) by a circuit runningat-speed is much higher compared to the bandwidth of these external mem-ories [6], meaning fewer signals have to be traced in each debug iteration.Alternatively, it is possible to use an on-chip buffer to hold the data until itis written to the off-chip memory [1].Recent work by Jeff Goeders [18] focused on optimization techniquesto reduce bandwidth of the data generated in order to use off-chip storagedevices to achieve long debug traces for HLS generated circuits.The most common method of reducing bandwidth requirements is toobserve fewer selected variables (selective variable tracing). Combining theuse of off-chip memories with selective variable tracing would further easeand accelerate the debug process. However, as described in Section 1.4,this may require several debug iterations with a different subset of variablesbeing observed each time to find a bug. This requires full recompilationsbetween each debug iteration. This thesis focuses on solving this issue andour proposed incremental debug flows as described in Chapter 4 eliminatethese full recompilations.312.5. Summary2.5 SummaryThis chapter described previous work which focused on in-system debuggingof circuits running on an FPGA. The majority of these works used a trace-based approach where debug instrumentation is added to record the circuitexecution and replay it later. This has numerous advantages over other scan-based approaches. Some of these previous works were specific to circuitsdescribed at the RTL level and required the user to have an understandingof the underlying hardware to get the most out of the debug data whileothers were targeted for HLS generated circuits and allowed for debuggingat the source level by mapping the RTL signals back to the source codevariables making it more appropriate for an HLS user.The goal of any trace-based approach is to record as much data as pos-sible in order to provide greater visibility into a circuit’s execution to speedup the debugging process. To achieve this, some of the works described inthis chapter used customized instrumentation to compress the debug datathat must be recorded in order to store longer circuit traces. This makesit easier to find the root cause of a bug by providing observability into asignificant portion of the circuit execution. On the other hand some worksmade use of incremental techniques or overlays to observe a different subsetof signals (selective signal tracing) in each debug iteration without the needfor full recompilations. When fewer signals are recorded in an debug itera-tion, the trace buffers would be updated less frequently resulting in longercircuit execution traces which would help pin-point the bug. If the selectedsignals do not provide sufficient information to identify the bug, the selectedsignals can be changed and another iteration is performed. However, usingincremental compilation, the turn-around time between the iterations is sig-nificantly reduced. In addition, these incremental techniques could also beused to add the debug instrumentation after the place and route of the usercircuit, thus not affecting the timing for the user circuit.In this thesis, we accelerate a source level in-system debugging approachfor HLS generated circuits [23] by combining both of these approaches. Cus-tomized compression logic is added to the instrumentation to pack the de-322.5. Summarybug data efficiently and the incremental compilation techniques are usedto record selected variables in each debug iteration, leading to much betterutilization of the trace buffers.33Chapter 3The Debugging Framework3.1 OverviewThis chapter describes the debugging framework used to illustrate our ideas.Section 3.2 gives a brief overview of the debugging framework. Section 3.3describes the optimizations used by the framework to improve the tracebuffer utilization in order to have a larger trace window. Section 3.5 focuseson the limitations of the framework to perform selective variable tracing andour proposed methodology to overcome them. Finally Section 3.6 concludesthis chapter.3.2 FrameworkThe adopted debugging framework was developed by Jeffrey Goeders dur-ing his Ph.D at UBC and is referred to as Goeders’s framework hereafter.References [19–23] provide the details about the framework and the opti-mizations used. Although it has been introduced in Section 2.3.2, in thisand the following sections, we briefly describe the debugging flow and thetrace buffer optimizations, which are most relevant to this thesis.Goeders’s framework has been designed to work with the LegUp [12] HLStool; however, it could be easily extended to support other HLS tools. Thisframework uses a trace-based approach along with a debug database (whichcontains a mapping between the source code variables and the RTL signals)to provide a source level in-system debugging facility. It can be operated intwo modes: live interactive mode and replay mode. In live mode, the usercan single-step through the circuit execution and retrieve the state of thecircuit from the FPGA by pausing the circuit and reading the latest entries343.2. FrameworkFPGADatapathVariables inon-chipmemoryControl FSMsSteppingandBreakpointUnitDebug Manager and CommunicationsState EncoderOne Hot StateEnableControlSignalDatapathSignalsMemorySignalsTraceSchedulerControl/DataTrace BufferTrace RecorderUser CircuitFigure 3.1: Debug Instrumentationfrom the trace buffers. However, interrupting the circuit’s execution mayintroduce additional bugs and is not always practical. In this thesis, wefocus on replay mode in which the circuit is executed at-speed until a setbreakpoint while recording the circuit’s state in trace buffers. The circuit’sexecution is then replayed to the user using the information from thesetrace buffers. The debug instrumentation required to support the replaymode including the trace buffers are added at the RTL level. This is doneby modifying the LegUp tool to automatically insert the required debugcircuitry into the generated RTL.3.2.1 Debug InstrumentationFigure 3.1 shows the instrumentation added by the HLS tool. It consists ofthe following components:Debug Manager: It acts as a communication bridge between the debug-ger application (which can be launched as a GUI on the user’s workstation)353.2. Frameworkand the instrumented debug modules. It receives all the message requestsfrom the GUI through a serial interface and forwards them to the respectivemodule and vice-versa. It is this module which is responsible for readingthe trace buffers and sending the information to the GUI in order to displaythe variable values while replaying the circuit’s execution.Stepping and Breakpoint Unit: This unit controls the execution of thecircuit by starting or stopping it whenever the set breakpoints are reached.The debug manager forwards the information about the added breakpointsand this module then generates an enable/disable signal to start/pause thecircuit running on the FPGA when the circuit reaches the correspondingstate.State Encoder: It is necessary to record the control flow information(sequence of circuit states) in addition to the dataflow information(variablevalues) to replay the circuit behavior accurately. The circuit generated byLegUp uses one-hot encoding for its FSM’s (Finite State Machines). Theseone-hot state signals are very wide and are therefore encoded/compressedby a state encoder module before recording them in order to improve thetrace buffer utilization.Trace Recorder: The heart of the instrumentation is an on-chip memory,known as a trace buffer. It is used to record the necessary information toreplay the circuit execution for debugging purposes. Typically, it stores threetypes of data: (1) a history of variables that have been mapped to registersand logic in the datapath of the user circuit, (2) a history of variables thathave been mapped to a global memory in the user circuit, and (3) a history ofcontrol flow information that describes the sequence of basic blocks executed.Each of these could be stored in a single buffer or separate buffers.The Trace Recorder shown in Figure 3.1 encompasses this trace bufferalong with the associated logic to improve its utilization. The followingsections describe the architecture of the trace recorder along with the opti-mizations used to pack the trace buffer efficiently.363.3. Trace Buffer OptimizationsS1S2S3 S4S5S6Figure 3.2: Control Flow Optimization3.3 Trace Buffer OptimizationsSince the trace buffer is of a limited size (100Kb in [23]), we cannot store theentire run-time history of all variables and control flow information. Instead,the buffer is configured as a circular memory, so that new entries evict theoldest entries. This means that, when debugging, the user can only viewthe behavior of variables and control for a sliding portion of the execution(called the “trace window”). Clearly, the more efficiently data can be storedin the trace buffer, the longer the trace window, and the fewer debug turnsthat will typically be required.We used a combined buffer to record the required control signals, mem-ory and the datapath register signals in each cycle of the circuit execution.However, each of these signals are compressed using different optimizationtechniques [23] and are then packed together into a single word, which iswritten to the trace buffer. The effect of these optimizations are recorded in373.3. Trace Buffer Optimizationsthe database in order to accurately link the information back to the sourcecode variables.3.3.1 Control Flow OptimizationIn order to trace the control flow of a circuit’s execution, frameworks like[10, 40] record the FSM state value each cycle. The number of bits requiredto trace this information is proportional to the total number of states presentin the control flow graph of the design. However, we find that instead ofrecording every state transition, it is enough to record in only those statesprevious to a state with multiple predecessors (for the rest of the states,it is obvious as it could only be reached from a single predecessor). Withthe help of this information and the control flow graph (which is availablefrom the HLS compilation), the full control flow of the circuit executioncan be reconstructed off-line. Therefore, as we are only recording in certainstates, we can essentially number these states with a new numbering, whichwill require fewer bits than that required for numbering all the states. Ourframework identifies the possible states that may need to be recorded andre-numbers them. We achieve a significant improvement using this opti-mization as we only have to record the state information (reduced numberof bits) in these fewer states which are followed by a state with multiplepredecessors.Figure 3.2 shows an example control flow graph (CFG). Frameworks like[10, 40] would require 3 bits (in order to distinguish 6 possible states) ofcontrol flow information to be recorded in each cycle. However, if we lookat the CFG, there are only two possible executions (S1-S2-S3-S5-S6 or S1-S2-S4-S5-S6) and therefore it is enough if we record only when the circuit isin S3 or S4 (shown in red color in the Figure 3.2), which would require only1 bit. This significantly reduces the amount of control flow information tobe recorded.383.3. Trace Buffer Optimizations3.3.2 Dynamic Tracing of Datapath Register SignalsUnlike commercial ELA’s [40, 75] which record all the datapath registers ineach cycle, our framework adds custom circuitry known as a trace schedulerwhich dynamically selects what signals should be recorded in each cycle.The trace scheduler block uses the HLS scheduling information to recordonly the active datapath registers (only those which are updated) similar tothe approach used in [53]. However, there are several other optimizationsproposed by Goeders and Wilton in [23], to further improve the trace bufferutilization in order to have longer trace window.Delay-Worst Signal-Trace Scheduling: Instead of recording the sig-nals in the same cycle as they are generated, they could be delayed andrecorded in any of the following cycles. Any such rescheduling of the signalsis stored in the database, which allows for the perfect reconstruction of thetrace information. Delayed recording of some signals in a worst-case statecould reduce the width of the trace buffers and improve the trace windowsize significantly, especially when there are few entries with the worst casewidth and the rest of the entries are partially filled. Our framework recur-sively identifies the worst-case state and tries to delay some signals to achievethe best possible width. Figure 3.3(a) shows an example of the delay-worstscheduling. As seen, by delaying r10 and recording it in S6 instead of S2,the trace buffer width could be reduced.Delay-All Signal-Trace Scheduling: Unlike the delay-worst scheduling,this optimization tries to delay the entry of all the signals that are updatedin a state to a later state without increasing the width of the buffer. Thealgorithm iterates through the entries of all the states, combining themwhenever possible. As a result, the rate at which the number of entriesrecorded in the trace buffer is reduced leading to a much larger trace window.Figure 3.3(b) shows an example of this optimization where the entry for S1is removed by moving those signal updates to S7, providing space for moreupdates to be recorded.393.3. Trace Buffer OptimizationsIn addition to the datapath registers, LegUp HLS tool stores certain vari-ables (like arrays or global variables) in on-chip memories driven by a mem-ory controller logic. In order to replay the circuit execution, it is also nec-essary to record updates to these memory variables. There may be severalon-chip memories and different memory controller signals could be drivingthem. It is enough to store only those memory controller signals which areupdated in each cycle instead of recording all of them. This is identicalto the datapath register signal tracing problem and hence the same signal-trace scheduling optimizations are used for memory updates. In fact thesame trace scheduler block is used for both the datapath register and thememory signals to output the active signals in the form of a single word(can be seen in Figure 3.1).As mentioned earlier in this section, we use a single trace buffer to recordthe control, memory and datapath register updates. Therefore, in a givencycle the output of the combined datapath register and memory signal-tracescheduler plus the control state signal (if it needs to be recorded in that cycle)are combined into a single entry which is written to the trace buffer.Dual-Ported Memory Signal-Trace Scheduling: In addition to theabove described signal-trace scheduling optimizations, the dual ported mem-ory architecture offered by FPGA’s can be leveraged to further improve tracebuffer utilization. The single word which is to be written to the trace buffercan now be split into half and written in two entries on the same clock cycleusing the available dual ports for the memory buffer. The improvementsfrom this optimization occurs in entries where less than half of the tracebuffer width is required as this corresponds to a single entry now insteadof two entries. Figure 3.3(c) shows an example where S5 requires a singleentry instead of two entries.With all these dynamic signal-tracing optimizations, the length of thecircuit execution that could be traced is improved by more than 127X whencompared to commercial ELAs such as [40, 75].403.4. Debug FlowLinking the Control and Data flow InformationDuring the HLS compilation process, the information related to the mappingof the source code variables to the RTL signals, which FSM states have anentry in the trace buffer and which signals are updated in those states aresaved in a debug database. While debugging, the information from thetrace buffer is retrieved and the control/data flow of the circuit’s executionis reconstructed off-line. When filling the last entry in the trace scheduler,it is ensured that the control state information is also recorded even thoughthe control flow optimization may decide not to record it. With the helpof this state information and the HLS scheduling information we can readthe trace buffer backwards and link the signal update information to thecorresponding states, which is later mapped back to the source code variableusing the mapping information stored in the debug database.3.4 Debug FlowFigure 3.4 shows the overall debug flow for this framework. The user firstcompiles C code to HDL using LegUp, an open-source HLS tool (LegUp [12]).The HLS tool automatically adds instrumentation to the RTL circuit torecord the behavior of all user-visible variables. The circuit is then com-piled using a vendor-specific tool-chain and implemented on an FPGA. Asthe FPGA runs, the instrumentation records a history of variables and con-trol flow information in an on-chip memory. After the run is complete orwhen a breakpoint is reached, the user launches a debug GUI on a work-station which connects to the FPGA and downloads the trace history. TheGUI uses this history to provide a software-like debug experience. The useris able to step through the recorded execution in an attempt to understandthe operation of the design and deduce the root cause of any unexpectedbehavior. As the user refines his or her view of the operation of the circuit,he or she may run the circuit again possibly with a different breakpoint,providing visibility into a different part of the circuit execution. Re-runningthis circuit in this way is called a “debug turn”; often many debug turns are413.5. Selective Variable Tracingrequired to pin-point a bug.3.5 Selective Variable TracingEven though different parts of the circuit execution can be observed usingthe original debug flow, the size of the trace window is fixed in each debugturn. This is because of the fact that all user visible variables are beingrecorded in each debug turn. Being able to examine the value of any variablemimics the software debug experience. For some debug scenarios, however,it may be sufficient to record fewer variables. As an example, if the userhas narrowed down the cause of a bug to a particular function, it may besufficient to only trace variables within that function. Alternatively, if theuser knows that certain variables are unimportant to a particular bug, theycan be excluded from tracing. Recording a smaller number of variables canincrease the size of the trace window in two ways. First, fewer variables maymean the trace memory can be narrower. Given a fixed trace buffer size,this means a deeper trace buffer, meaning a longer history for each variablecan be recorded. Second, recording fewer variables may mean that thereare more cycles in which no value is recorded, meaning the trace buffer isfilled more slowly. Increasing the trace window will provide more visibilityinto the execution of the hardware, hopefully reducing the number of debugturns required to uncover the root cause of unexpected behavior.A second advantage of recording only a subset of variables is that theinstrumentation logic itself will be smaller. For designs that are area con-strained, there may not be sufficient unused resources to implement androute the debug logic which captures all variables. For these cases, selectinga subset of variables to observe may be the only possible way of providingthe user with debugging capabilities.For recording only a subset of user-visible variables, the original flow fromFigure 3.4 can be used directly. However, as debugging proceeds and theuser narrows down the cause of a bug, he or she may wish to change whichvariables are instrumented. In our considered framework [23], changing theinstrumentation in this way requires a rerun of the HLS flow to generate423.6. Summarythe new required connections between the user and debug modules and alsoa different trace-scheduling logic to efficiently pack the signals to be tracedin each cycle using different optimizations described in the previous section,resulting in a different RTL. This would require a complete recompile of thedesign, including a lengthy place-and-route to re-implement the circuit onan FPGA. This significantly impacts debug productivity.In this thesis, we address the problem of length recompilations usingincremental design techniques offered by a commercial FPGA vendor tool-Intel Quartus II and reduce the turn-around time between the successivedebug turns significantly. Chapter 4 describes our incremental debug flows.3.6 SummaryIn this chapter, the debugging framework used to prototype our ideas wasdescribed along with different signal-trace optimizations used to improvethe utilization of the trace buffer (trace window). In addition to these op-timizations, how the selective variable tracing improves the trace windowsize leading to fewer debug turns is elaborated in the context of our adoptedframework. Finally, the problems associated with selective variable tracingand our proposed approach to overcome them are presented.433.6. Summaryr1r4r6r8r9r12r14r10a) Delay-Worst Schedulingr1r9 r6r4r8r10r12r14S1S2S6S1S2S6r1r9r6r4r10 r12r14S1S7S8r1r9r6r4r10 r12r14S7S8b) Delay-All Schedulingr1r9 r6r4r8r10r12r14S1S3S5r1r4r6r8r9r10r12r14S1S1S3S3S5c) Dual-Port SchedulingFigure 3.3: Dynamic Signal Tracing Optimizations’Si’ indicates the state and ’ri’ indicates the datapath/memory signals thatare being recorded.443.6. SummaryCompile C to RTLand add instrumentationPlace and Routeand Configure FPGARunView and Analyze Captured DataRoot Cause Determined?Optional: remove instrumentationDesign YesNo.  Change BreakpointFigure 3.4: Original Debug Flow45Chapter 4Incremental Debug Flows4.1 OverviewIn this chapter we compare different incremental debug flows for the adoptedHLS debug framework [23]. As described in Section 1.6, unlike the incre-mental debug flows associated with the commercial ELA’s, the amount oflogic that needs to be recompiled in each debug iteration for our case issignificant.FPGA vendors provide the ability to guide the place and route toolsto recompile only the parts of a circuit that have changed since a previouscompilation rather than recompiling the whole circuit. This promises to notonly decrease debug turn-around time, but also maintain timing in the usercircuit between debug iterations. However, a direct use of such incrementalfeatures does not work well; Section 4.2.1 describes this approach.Using incremental compilation, however, requires a careful selection ofthe flow to balance the impact on area and delay of the user circuit, the areaand delay of the instrumentation, and the compilation run-time. Further,we desire a flow which allows the instrumentation logic to be removed afterdebugging has completed, with as little impact on circuit timing as possible.We developed two promising flows that try to balance these metrics.Section 4.2.2 presents an incremental debug flow in which we modifythe user partition to add permanent taps to facilitate efficient incrementalrecompilation. Section 4.2.3 describes another incremental debug flow whichis non-intrusive to the user circuit as the debug instrumentation is addedafter the compilation of the user partition. In Section 4.3, we introduce theGUI framework developed to automate these incremental flows. Section 4.4concludes this chapter.46Baseline FlowBaseline FlowThe baseline flow (Flow 1) is as described in Section 3.4 and shown inFigure 3.4. All user-visible variables are recorded and the circuit is compiledonly once. However, using various breakpoints different time periods withinthe circuit execution can be observed. This flow does not use incrementalcompilation.4.2 Incremental FlowsCommercial ELA’s like SignalTap II [40] offer support for incremental RTLdebug. As described in Section 2.4, since these IP blocks have to supportwide range of circuits, they are very simple and consist of a generic triggercircuitry and trace buffers. While doing incremental debug, as the signals tobe recorded are changed, the tool just needs to perform incremental routingin order to change the necessary connections. However, our case is morecomplicated because the debug instrumentation is customized based on thesignals that are being recorded mainly due to the trace buffer optimizationsdescribed in Section 3.3. In addition, as the debug instrumentation is addedat the RTL level, the design partitions are modified each time the variablesrecorded are changed.4.2.1 A Naive Incremental FlowIn this incremental debug flow (Flow 2), we allow the user to select a subsetof the user-visible variables to record. The source code is then compiled us-ing the LegUp [12] HLS tool which is modified to insert the required debuginstrumentation as in the original framework [23]. However, as shown inFigure 4.1, separate design partitions (to enable incremental compilation)for the user circuit and the instrumentation are constructed without restrict-ing them to a specific physical region on the FPGA using fully automatedscripts and the GUI interface (modified version of that from [23] and is de-scribed in Section 4.3). Based on the variables selected, the user circuit andinstrumentation partitions are modified; the user circuit partition is mod-474.2. Incremental Flowsified to insert “taps” and the debug partition is modified to compress andrecord those signals in trace buffers. To obtain this debug logic, the LegUpHLS tool is run again with the selected signals. The design is then compiledusing incremental techniques (from Intel’s Quartus Prime 16.0). The circuitis run, and the history of selected variables are stored in the trace buffer;after the run is complete, the trace information is extracted and used withour modified debug GUI which allows the user to replay the execution in thecontext of the original software. The user may then choose to instrument adifferent set of variables, in which case the instrumentation is modified andthe incremental compilation is repeated where the tool tries to incrementallyplace and route the changed partitions while preserving the partitions whichhave not been changed, as described in Section 2.4.Intuitively, compared to Flow 1, this should result in better utilization ofthe trace buffer as fewer variables are being recorded, leading to longer tracewindows and also reduced debug instrumentation area. However, becauseboth the user circuit (taps) and the debug instrumentation (Trace Scheduler)change every iteration, the ability of the incremental algorithm to reducecompile time is limited and may also result in a recompilation of the usercircuit which is not desirable as the timing paths in the circuit could bealtered. As a result, even if the user wants to remove the instrumentationafter the debugging, the user partition may have to be recompiled. In thenext section, we describe another incremental flow which tries to preservethe place-and-route results for the user partition and only recompile thedebug partition when the variables to be recorded are changed.4.2.2 Incremental Flow with Permanent TapsIn this incremental flow (Flow 3), changes are localized to the instrumentedcircuit. The key idea is to ensure that all taps in the user circuit are main-tained, even if the corresponding variables are not observed. In this flow,as shown in Figure 4.2, we once again use separate user circuit and debuginstrumentation partitions. Unlike Flow 2, taps for all user-visible variablesare added within the user partition, and these taps do not change during the484.2. Incremental Flowsentire debugging process. The taps (ports) which are not being used in thecurrent debug turn are not optimized away as the EDA tool (Quartus Prime)treats the unused ports across the partitions as virtual pins that are tem-porarily mapped to logic elements (Look Up Table–LUT) which can laterbe connected to require ports in the subsequent debug turns without theneed for full recompilation. In order to add these taps, the RTL generatedby the framework in [23] was restructured and modified accordingly. Wedeveloped fully automated scripts and a GUI to perform these steps, hidingeverything from the user’s point of view, which we think is very importantfor user productivity. The GUI is briefly described in Section 4.3.Because of these permanent taps, as the user changes which variables areobserved, only the debug partition is modified and the user partition is notchanged. Intuitively, this will achieve the same trace buffer utilization asFlow 2, however, it will be able to take better advantage of the incrementalcompilation features in the place and route tool, leading to faster debugiterations. However, the fact that taps are added within the user circuit forall variables means that the instrumented user circuit may run somewhatslower than the uninstrumented design.As the debug instrumentation partition and the user circuit partitionare compiled at the same time, the performance of the user circuit may beaffected as it is not able to use all the available FPGA resources. In this flow,if the user chooses to remove the debug instrumentation after the debugginghas been completed, it could be removed incrementally without the need forfull compilation, however, since the first compilation is not fully optimizedfor the user circuit, he/she would need to be satisfied with the obtainedperformance.4.2.3 Incremental Flow with Permanent Taps and LateBindingIn Flows 2 and 3, the presence of the instrumentation may negatively impactthe performance obtained for the user circuit (for example, by “inflating” theuser circuit if some of the instrumentation logic is placed within the bound-494.2. Incremental Flowsaries of the user circuit). This is undesirable for two reasons: (1) adding theinstrumentation during the first compilation changes timing paths withinthe user circuit, possibly hiding bugs that are being sought or exposing newones, and (2) after debug is complete, and the instrumentation is removed,either a full recompile of the user circuit is required, possibly leading to newtiming behaviors, or the designer must be satisfied with the lower perfor-mance of the design.In Flow 4, as shown in Figure 4.3, we maintain user and debug in-strumentation partitions along with the permanent taps added to the userpartition as in Flow 3. However, unlike Flows 2 and 3, we perform an initialcompilation with an empty debug partition (no logic). No physical regionon the FPGA is reserved for this empty debug partition initially. This al-lows the user partition to be well optimized as it is not interfered by thedebug instrumentation partition. During debugging, we then replace theempty debug partition with a partial trace scheduler corresponding to thevariables that are being recorded, and debug as before. This debug partitionwould now be placed and routed using the leftover FPGA resources.Intuitively, compared to Flow 3, this flow will lead to the fastest perfor-mance of the user circuit. However, when the instrumentation is added, wewould expect the performance of the instrumented circuit to be lower thanthat in Flow 3, since the instrumentation is optimized separately from theuser circuit and may now contain the timing critical paths. This means that,during debugging, it may be necessary to slow the clock slightly; whetherthis is acceptable depends on the design and system in which the design isbeing used. Once debugging has been completed, the user may choose toremove the instrumentation incrementally without the need for full recom-pilation and the circuit can now be run at the original (uninstrumented)clock frequency that was obtained during the first compilation.504.2. Incremental FlowsAdd instrumentation to recordselect signalsRoot Cause Determined?UserCircuitPartitionDebug PartitionTrace SchedulerOtherDebugCircuitrytopRun chip, download trace, examine execution with GUICompilation: First iteration: complete compilationSubsequent iterations: incrementalOptional: Remove InstrumentationUserCircuitPartitionDebug Partition(empty)topYesNo. Change instrumented variablesFigure 4.1: A Naive Incremental Flow (Flow 2)514.2. Incremental FlowsAdd Instrumentation to record select signalsRoot Cause Determined?UserCircuitPartitionDebug PartitionTrace SchedulerOtherDebugCircuitrytopRun chip, download trace, examineexecution with GUICompilation:First Iteration: complete compilationSubsequent iterations: incrementalOptional: Remove InstrumentationUserCircuitPartitionDebug Partition(empty)topYesNo. Change instrumented variablesFigure 4.2: Incremental Flow with Permanent Taps(Flow 3)524.2. Incremental FlowsModify Instrumentation to record select signalsRoot Cause Determined?UserCircuitPartitionDebug PartitionTraceSchedulerOther DebugCircuitrytopUserCircuitPartitionDebug Partition(empty)topCreate partitions with empty debug partitionComplete Compilation: User Circuit with taps onlyRun chip, download trace, examine execution with GUIIncremental Compilation: debug partition onlyOptional: Remove InstrumentationUserCircuitPartitionDebug Partition(empty)topYesNo. Change instrumented variablesFigure 4.3: Incremental Flow with Permanent Taps and Late Binding (Flow4)534.2.IncrementalFlowsFigure 4.4: Modified GUI to support Selective Variable Recording544.3. Automated GUI4.3 Automated GUIIn order for Quartus Prime to support incremental compilation features,the overall design has to be divided into partitions. Moreover, for Flow3 and Flow 4, the user partition had to be modified to insert permanenttaps into it. We have developed a GUI framework (modified from thatin [23]) to do all these things automatically in the background. It hasbeen made available on-line as an open source debugging framework athttps://bitbucket.org/pavankumarbussa/.Figure 4.4 shows a screen-shot of our GUI. A brief tutorial has beencreated on how to use this GUI effectively and is described later in theAppendix A. From a user’s point of view, he/she just needs to open a design,optionally set the breakpoints, select the source code variables which theyare interested in observing, select Flow 3 or Flow 4, click the compile buttonand finally program the bitstream to the FPGA once the compilation issuccessful. Under the hood, our GUI runs LegUp HLS to generate theRTL and create a debug database, restructures the RTL code as required(such as making the partitions and adding taps), reruns LegUp HLS to getthe necessary debug instrumentation whenever the variables selected arechanged and finally determines whether a full or incremental compilationhas to be performed. Once the circuit has been run completely or thebreakpoints have been reached, the user can enter the replay mode wherethe GUI reads the data serially from the trace buffer (and performs off-line analysis to reconstruct and link the control flow information with thevariable updates) and replays the circuit execution by showing the updatesfor the selected variables as the circuit execution progresses. The user canthen select a different subset of variables and repeat the same steps untilhe/she is satisfied with the design.4.4 SummaryIn this chapter, different incremental flows developed for the frameworkin [23] were described. These incremental flows enable ’Selective Variable554.4. SummaryTracing’ without the need for full recompilation. With the help of the tracebuffer optimizations (those described in Section 3.3) and the selective vari-able tracing, the trace buffer could be packed much more efficiently leadingto larger trace window sizes. As a result, fewer debug turns might be enoughfor pin-pointing a bug and also the turn-around time between the successiveturns is now reduced significantly as there is no need for full recompilations.Each of the proposed flows have their own advantages and disadvantages.If the user circuit is timing critical then Flow 4 would be a better optionbut if one desires to debug at a relatively higher clock speeds then Flow 3would be a good option. Lastly, we introduced our GUI framework whichabstracts away complexities from the user and provides a faster and efficientdebug environment.56Chapter 5Results5.1 OverviewIn this chapter, we present the results obtained for different sets of ex-periments conducted to evaluate our proposal. Section 5.2 describes ourmethodology and the benchmarks used to evaluate our flows. In Section5.3 and Section 5.4 , we quantify the impact of reducing the number ofvariables that are recorded (Selective Variable Tracing) on the trace windowsize and the debug instrumentation area respectively. Intuitively, reducingthe number of traced variables will increase the trace window size, but theamount of increase is not clear, since there is the possibility of a one-to-manyrelationship between the source code variables and the RTL signals.In Section 5.5, we illustrate the impact of incremental compile on debugturn-around time for the flows described in Chapter 4. We show that byusing our flows, a 40% reduction in compile times can be achieved whendoing selective variable tracing when compared to the original flow proposedin [23].The presence of debug instrumentation and permanent taps may affectthe performance of the user design. The variation in the frequency of thedesign is described in Section 5.6 and we show how the use of an emptydebug partition in the first compilation reduces this performance loss, oncethe debug instrumentation is removed. Finally, Section 5.7 concludes thischapter.575.2. Methodology5.2 MethodologyAs described in Chapter 3, we assume the architecture from [23]. A totaltrace buffer size of 100 Kilobytes is assumed. We do not split this buffer,but rather use a single buffer to store the updates to variables that arein both the user circuit’s datapath and memories as well as the controlflow information. We chose this configuration as it provided better tracebuffer utilization by improving the packing efficiency. It should be notedthat, even for any other trace buffer configuration our flows would workwithout any modifications. For our experiments, we used the circuits fromCHStone benchmark suite [28] and compiled them using our HLS debugframework. The generated RTL was synthesized to a Stratix IV FPGA(EP4SGX530NF45C3) using Intel’s Quartus Prime 16.0 on a workstationhaving a Quad core Intel Xeon CPU E3-1225 V2 processor.In addition to the CHStone benchmarks that come with LegUp HLS tool,we used the largest circuit (FFT Transpose) from another benchmark suite,Machsuite[66], to demonstrate the scalability of our approach. All otherbenchmarks from Machsuite were of similar size to those in the CHStonesuite and hence were not considered. Moreover, these benchmarks (whichare not provided by LegUp) were not able to compile successfully becausethey used an unsupported fixed point representation. We had to replacethese datatypes with those that are supported by LegUp, which was timeconsuming.Also in order to do ’Selective Variable Tracing’ in each debug turn, we donot incorporate any special signal selection algorithms like that of [46, 49],as the focus of this thesis is on improving the debug turn-around time whena different set of variables are recorded and not on how these variables areselected. Therefore, for the purpose of our experiments, we select an uniquesubset of variables for each debug turn by shuffling the the list of variables(using a deterministic seed) and then selecting a required proportion.585.2.MethodologyTable 5.1: Trace Window Size(For Flow 3)Benchmark100% variables traced 50% variables traced 25% variables tracedWindow Window Windowsize size size(cycles) (cycles) (cycles)adpcm 2755 3628 5460aes 4849 10171 22834blowfish 6586 9961 15546dfadd 1265 1650 2342dfdiv 4159 5640 7078dfmul 1126 1361 1944dfsin 2869 3547 4509gsm 732 1413 3401jpeg 3521 5832 7567mips 1103 1968 3559motion 6232 10691 15160sha 4370 8248 13376FFT 1127 1976 2471Average 3130 5083 (1.6x) 8096 (2.6x)The values in the parenthesis indicate the improvement over the trace window size corresponding to 100% variables recorded (Column2)595.3. Impact on the Trace Window Size5.3 Impact on the Trace Window SizeThe following set of experiments were performed to analyze the advantagesof ’Selective Variable Tracing’. Although reducing the number of variablesto be recorded will increase the trace window size, the degree to which thisoccurs is not clear because of the Static Single Assignment (SSA) represen-tation used for the IR code by the LLVM compiler, which is the front endof our HLS (LegUp [12]) debug framework. In SSA, each update of a sourcecode variable is represented by an unique IR signal. A frequently updatedvariable would therefore result in many more IR signals than a variable thatis updated less frequently. Thus, a single source code variable may corre-spond to multiple IR signals which may finally be converted to multipleRTL signals by the LLVM compiler. Hence, if a user chooses not to recorda variable, this might result in ignoring several RTL signals. The mappinginformation between a source code variable and the RTL signals are storedin the database created by our framework and is used for linking the infor-mation stored in the trace buffer back to the corresponding variable. As aresult, there is no direct linear relationship between the number of variablesbeing recorded and the improvement in the trace buffer utilization (TraceWindow size).Table 5.1 shows the impact on trace window size as we vary the propor-tion of user-visible variables that are traced using our incremental flow withthe permanent taps (Flow 3). Flow 2 and Flow 4 (incremental flow withpermanent taps and late binding) would give similar trace window sizes asthis quantity primarily depends on the number of variables being recorded.Column 2 shows the number of execution cycles for which the history of allthe user-visible variables are stored in the trace buffer at the end of the run.To obtain this information, we used the same approach from [23] where thecircuit is simulated using Modelsim to extract the required execution traceand the filling of the trace buffer cycle-by-cycle. With the help of this data,the trace buffer size in terms of number of execution cycles was obtained.Column 3 shows the same quantity assuming 50% of the user-visiblevariables are recorded and Column 4 shows the same assuming 25% of the605.3. Impact on the Trace Window Size50% 25%aes1234567891011121314151617Trace Window Size Improvement50% 25%gsmFigure 5.1: Trace Window Size Variation [TracesubsetTracefull ]user-visible variables are recorded. To gather these results, we ran six ex-periments with a different subset of the variables selected and average theresults.As the table shows, reducing the number of variables recorded increasesthe amount of execution history that can be stored in the trace buffers by1.6x for the 50% case and 2.6x for the 25% case. Intuitively, the increase de-pends on the relative update frequencies of variables that are being recorded.If variables that are rarely updated during the execution of the program areselected, we would expect the improvement to be large because the tracebuffer would now be filled slowly and thus a large portion of the circuitexecution could be traced.5.3.1 Variation in the Trace Window SizeThere are two benchmarks (aes and gsm) for which the trace window sizeis improved by more than 4x when the number of variables recorded werereduced from 100% to 25%. In these experiments, the variables that wereselected tended to be updated very infrequently. To better understand how615.4. Impact on Debug Instrumentation Areathis variability in the variable access rate affects trace window size, we tookthese two benchmarks and ran 100 experiments with different variable se-lections (unique subsets) for each of the 50% and 25% case. For each ex-periment, we measured the trace window size, and created the whisker plotshown in Figure 5.1. In this diagram, the box shows the second and thirdquartile of trace window size. The first quartile (highest trace window size)is shown as a dotted line above the box, and the fourth quartile (lowesttrace window size) is shown as a dotted line below the box. The medianis represented by a red line, and the average by a small red square. In allcases, these trace window sizes (Tracesubset) are normalized to that of thetrace window size obtained when all the user-visible variables are recorded(Tracefull). Even though the average trace window size improvement forthese benchmarks are nearly 2x for the 50% case and 4x for the 25% case(the small red dots in the Figure 5.1), there are some cases where the tracewindow size could go as high as 17x (for the gsm benchmark) if we choose asubset in which the variables are not updated frequently. However, it shouldbe noted that in any case the trace window size when a partial subset ofvariables are recorded (Tracesubset) is almost the same or more than thatwhen all the user-visible variables are recorded (Tracefull).5.4 Impact on Debug Instrumentation AreaAs stated in Section 3.5, a second advantage of ’Selective Variable Tracing’is the reduction in the debug instrumentation area as the number of vari-ables recorded are reduced. The debug instrumentation logic in our case isthe same as that of [23] which consists of a fixed-cost logic namely: com-munication and debug manager, trace recorder, stepping and breakpointunit which are independent of the considered benchmark and the numberof variables recorded. In addition, it also consists of a state encoder (whichwas described in Chapter 3 and is used for recording control information)and a trace scheduler block (for compressing and recording appropriate sig-nals) which are dependent on the benchmark circuit as the number of statesand the variables vary across the different benchmarks. However, for a given625.4. Impact on Debug Instrumentation AreaTable 5.2: Trace Scheduler Area (For Flow 3)Benchmark100% variables 50% variables 25% variablesArea Area Areaadpcm 2064 1231 684aes 883 343 170blowfish 965 577 271dfadd 1628 1079 625dfdiv 1441 1021 573dfmul 1023 622 378dfsin 4016 2837 1718gsm 1029 539 293jpeg 3478 2168 1315mips 991 339 169motion 900 501 328sha 500 204 88FFT 13095 7757 4759Average 2463 1478 (1.7x) 875 (2.8x)The values reported are the number of Stratix IV ALMs obtained after compiling the circuitusing Quartus Prime 16.0. The values in the parenthesis indicate the reduction in the areawhen compared to the area corresponding to 100% variables recorded (Column 2)benchmark, the number of states that are required to be encoded remain thesame even if different number of variables are recorded. Therefore, it is thetrace scheduler logic which changes when the number of variables recordedare changed. This is because of the change in the number of correspondingRTL signals that must be multiplexed by the trace scheduler block.Table 5.2 shows the area of the trace scheduler block in number of StratixIV ALMs as the number of variables recorded are reduced from 100% to 25%for Flow 3 (once again, the area reduction would be by the same factor forFlow 2 and Flow 4 as this primarily depends on the number of variablesrecorded). The trace scheduler area is reduced, on average, 1.7x for the50% case and 2.8x for the 25% case. This is because fewer signals need tobe time-multiplexed into the trace buffers, leading to much simpler tracescheduling logic.635.4. Impact on Debug Instrumentation AreaClearly, incorporating this selective variable tracing into a debug flowwould have a significant impact on debug efficiency, by providing the userwith an option for not recording the redundant/unimportant variables. Italso reduces the amount of debug instrumentation that has to be addedwhich is especially helpful when the user circuit itself is very large andconsumes most of the resources on the FPGA. However, as described inChapter 3, performing selective variable tracing using the framework from[23] requires a full recompilation. We have developed different incrementalflows (described in Chapter 4) to address this issue. In one of these flows,Flow 2, even after making design partitions the whole design was recompiledwhen the variables to record were changed, as both the user partition andthe debug circuit partition were changed (as described in Chapter 4). Thatis, the results were virtually the same as that of Flow 1 (Baseline Flow) andhence we do not include the results for Flow 2 separately. The followingsections evaluate our flows (Flow 3/Flow 4 vs Flow 1) in terms of savingsin the compile times between successive debug turns and the variation infrequency for the overall debug flow.645.4.ImpactonDebugInstrumentationAreaTable 5.3: Total Compile Time for Flow 3 in SecondsBenchmarkFlow 1 Analysis Flow 3 Flow 3 Flow 3100% overhead Variables Recorded Reduction ReductionObs.* 50%* 50% 25%* 25% 0% (for 25%) (for 50%)adpcm 351 31 323 176 305 179 165 49.0% 49.8%aes 283 25 282 158 273 172 136 39.2% 44.0%blowfish 168 19 179 110 158 108 100 35.6% 34.5%dfadd 165 17 157 121 141 120 104 27.0% 26.5%dfdiv 243 27 239 156 221 158 126 34.8% 36.0%dfmul 135 14 141 106 132 105 90 22.2% 21.3%dfsin 438 34 407 244 368 227 196 48.1% 44.4%gsm 210 24 196 133 181 135 102 35.8% 36.9%jpeg 939 47 897 368 848 365 317 61.1% 61.0%mips 138 16 138 100 129 100 89 27.7% 27.4%motion 287 24 268 140 252 136 130 52.5% 51.1%sha 150 16 120 102 117 92 89 38.7% 32.1%FFT 2398 231 2503 1088 2335 958 809 60.0% 54.6%Average 454 40 450 231 420 220 189 40.9% 40.0%* indicates that it is a full compilation and not incremental. 0% means that the debug instrumentation is removed.655.4.ImpactonDebugInstrumentationAreaTable 5.4: Total Compile Time for Flow 4 in SecondsBenchmarkFlow 1 Analysis Flow 4 Flow 4 Flow 4100% overhead Variables Recorded Reduction ReductionObs.* 0%* 50% 25% 0% (for 25%) (for 50%)adpcm 351 31 314 177 174 158 50.5% 49.6%aes 283 25 266 157 165 129 41.6% 44.7%blowfish 168 19 153 112 111 96 34.2% 33.3%dfadd 165 17 137 126 120 99 27.6% 23.5%dfdiv 243 27 224 160 155 126 36.1% 34.1%dfmul 135 14 118 110 106 89 21.5% 18.4%dfsin 438 34 368 254 230 186 47.5% 42.0%gsm 210 24 185 135 135 105 35.9% 35.8%jpeg 939 47 950 369 359 309 61.8% 60.7%mips 138 16 123 102 102 87 26.4% 25.9%motion 287 24 247 146 143 128 50.1% 49.1%sha 150 16 108 105 96 88 36.1% 30.3%FFT 2398 231 2865 1258 996 767 58.5% 47.6%Average 454 40 466 247 222 182 40.6% 38.1%* indicates that it is a full compilation and not incremental. 0% means that the debug instrumentation is not present (removed).665.5. Impact on the Compile time5.5 Impact on the Compile timeThe following experiments were performed to demonstrate the reduction inthe time taken for recompilation when an user wishes to change the debugscenario (recording different variables in our case) using our debug flows.Table 5.3 shows the impact on total compile time of each circuit forFlow 1 (original flow) and Flow 3 (incremental flow with permanent taps).It includes the total time starting from the Analysis and Synthesis until thegeneration of the FPGA bitstream. Column 2 shows the overall compiletime of each circuit in seconds for the baseline flow (Flow 1) which does notuse incremental techniques (as in [23]). In order to create design partitionsfor Flow 3, we had to first run the Analysis and the Elaboration step to getthe hierarchy of the design. This overhead time is shown in Column 3.Columns 4-8 show the compile time for Flow 3 (incremental flow withpermanent taps), in which the permanent taps are added to the user circuitand the changes to the RTL are localized to the debug partition. Afterpartitioning the design, the circuit is compiled from scratch with the in-strumentation added for 50% observability (we anticipate that the Flows 3and 4 would be mostly used with reduced observability to gain faster debugturn around times). This is the first compilation and the user circuit is co-optimized along with the debug instrumentation. This is not incremental(the compile time is shown in Column 4). Then, the variables are changedto a different 50% subset and the circuit is recompiled (Incremental). Sixdifferent subsets of the variables were used for these experiments and theaveraged results are shown in Column 5. Next we repeat the same experi-ment by starting with the instrumentation required for tracing 25% of thevariables (Column 6-7). At any point the designer could decide to removethe instrumentation; this would not need a full recompile as the user par-tition is not modified (Column 8 shows the time taken for this step). Thelast two columns show the improvement in compile time for Flow 3 (25%and 50% observability) compared to the baseline flow (Flow 1).Table 5.4 shows the same quantities for Flow 4 (incremental flow withpermanent taps and late binding), in which the user circuit is placed and675.5. Impact on the Compile timerouted first and then the debug instrumentation is added incrementally.Column 2-3 are same as that of Table 5.3 which are the compile time forbaseline flow (Flow1) and the analysis overhead respectively. Columns 4-7show the compile time for this flow. Column 4 shows the time taken forthe first compilation with no debug instrumentation. Columns 5-6 show theaveraged results for six different subsets of 50% and 25% variables recorded,which are incremental compilations. Column 7 shows the time taken to re-move the debug instrumentation for this flow and the last two columns showthe improvement in compile time for Flow 4 (25% and 50% observability)compared to the baseline flow (Flow 1).The overall compile time for Flow 3 (incremental flow with permanenttaps) reduces by 40.9% for the 25% case and 40.0% for the 50% case. ForFlow 4 (incremental flow with permanent taps and late binding) it reducesby 40.6% for the 25% case and 38.1% for the 50% case. This means that,although the initial run is somewhat longer (due to the analysis overhead)for Flows 3 and 4, each additional debug turn, in which the subset of uservariables to be recorded is changed, is 38–40% faster. It should be notedthat for the largest designs (jpeg and FFT ), which have the largest compiletimes, the run-time was reduced the most (61.8%, 60.0% respectively). Thissuggests that this technique is scalable and will be especially helpful for largedesigns where users are most affected by running a full recompile with eachdebug iteration.To better understand these results, we measured the size of debug in-strumentation as a fraction of the overall instrumented circuit size for eachcircuit using our Flow 3 (incremental flow with permanent taps) implemen-tation (would be almost the same if Flow 4 implementation was considered);these results are shown in Table 5.5. The last column indicates the percent-age of the debug instrumentation in the overall circuit (column 3/(column2 + column 3)). The ratio varied from 17% to 49%. For benchmarks suchas dfmul where the debug instrumentation (49.1%) is as large as the usercircuit, the compile time benefits observed were the least (21%) because thetool had to recompile a major portion of the design in these cases.We also estimated the overhead in an incremental compile by running an685.6. Impact on the Frequency of User CircuitTable 5.5: Area Breakdown (For Flow 3)BenchmarkUser Instrumentation (100%) DebugCircuit Trace Sched. Other Partitionadpcm 7881 2064 700 26%aes 7700 883 734 17.3%blowfish 3410 965 685 32.6%dfadd 3528 1628 654 39.2%dfdiv 5950 1441 706 26.5%dfmul 1702 1023 619 49.1%dfsin 12066 4016 851 28.74%gsm 4224 1029 946 31.8%jpeg 21095 3478 1124 17.9%mips 1826 991 712 48.3%motion 7092 900 730 18.7%sha 2167 500 622 34.1%FFT 51703 13095 1750 22.3%Average 10026 2463 833 30.2%All area values are provided in number of Stratix IV ALMs. The other column representsthe rest of the instrumentation logic like the Communication & Debug Manager, Tracerecorder, State Encoder and Stepping & Breakpoint Unit which were described in Chapter3.incremental compilation twice in a row with no changes in between (Table5.6); this run-time of the second compile averaged 160 seconds which wethink is because of the analysis performed by the tool to determine changesin the design, if any. This overhead limits the benefits that are obtainedusing our current setup for the smaller circuits whose compile time is closerto this average overhead; clearly, as circuits grow or changes are betterlocalized, impact of this overhead can be reduced.5.6 Impact on the Frequency of User CircuitThe presence of debug instrumentation, design partitions and permanenttaps in our flows might affect the performance of the user circuit. Because ofthese factors, there might still be a loss in the performance of the user circuiteven after removing the debug instrumentation, once the debugging has been695.6. Impact on the Frequency of User CircuitTable 5.6: Incremental Compile Overhead for Flow 3 in SecondsBenchmarkIncrementalCompile Overheadadpcm 146aes 135blowfish 108dfadd 97dfdiv 128dfmul 86dfsin 179gsm 115jpeg 287mips 85motion 128sha 88FFT 502Average 160finished. The following experiments quantify this loss in the frequency, ifany, for our incremental debug flows (Flow 3 and Flow 4).Table 5.7 shows the variation in the frequency of the design for Flow3 (incremental flow with permanent taps) when compared to that of theoriginal uninstrumented user circuit. Columns 2-3 show the fmax valuesof the design without and with the debug instrumentation (correspondingto 100% Observability) respectively for Flow 1. These fmax values wereobtained from Quartus Prime 16.0 after compiling the design. On averagethere is a loss of 4.5% for the instrumented design when compared to thefmax of the original user circuit. This clearly shows that the instrumentationcircuitry may perturb the user design, changing some of its timing paths.Columns 4-6 show the fmax results for Flow 3 (incremental flow withpermanent taps). First, a full compilation is run after instrumenting thelogic required to trace a subset of 25% user variables (Column 4). Then werun six compilations using different subsets of 25% variables and averagethe results (Column 5). There is a loss of 9% in fmax when compared to theoriginal circuit. When the instrumentation is removed, we get some of the705.6. Impact on the Frequency of User Circuitperformance back; Column 6 shows that there is a loss of 4.5% when com-pared to the original circuit. The reasons for this may be attributed to thecreation of partitions, addition of permanent taps and the co-optimizationof the user and the debug partitions in the initial compilation.Similarly, Table 5.8 shows the variation in the frequency of the design forour Flow 4 (incremental flow with permanent taps and late binding) whencompared to that of the original uninstrumented user circuit. Columns 2-3show the fmax values of the design without and with the debug instrumen-tation for Flow 1 (same as that in Table 5.7). Columns 4-6 show the fmaxresults for Flow 4. In Flow 4, we start with an empty debug partition toensure the user partition is optimized as much as possible. As the Column 4shows, fmax for this initial compilation is roughly the same as the origi-nal uninstrumented user circuit (Column 2). We then replace the emptydebug partition with instrumentation (required for 25% observability) us-ing an incremental compilation. As the results in Column 5 show, this hasa negative impact on the performance of the overall instrumented circuit;compared to the original circuit, adding the instrumentation lowers fmax by15.7%. In some cases, the drop is larger; in jpeg, the overall instrumentedcircuit runs 41% slower than the uninstrumented version. We have observedthat, in some cases, routing between the user circuit and the instrumen-tation becomes difficult due to congestion, causing nets to take circuitousroutes. This does not occur to the same extent in Flow 3, since in that case,the user circuit and instrumentation are optimized simultaneously, meaningthe user circuit can be adjusted to allow for connections within the instru-mentation if necessary. Finally, when we remove the instrumentation, thevalue of fmax returns to a frequency very close to that of the original unin-strumented circuit; Column 6 shows that the frequency of the circuit afterinstrumentation has been removed is 1.3% slower than the uninstrumentedcircuit. The 1.3% loss is primarily due to the taps that are added and leftin the user circuit once the instrumentation has been removed. However,it should be noted that the frequency which was obtained for the user cir-cuit in the first compilation (with empty debug partition) is maintainedthroughout the debugging process and could be recovered even after the715.6. Impact on the Frequency of User Circuit100% 75% 50% 25% No Taps% of Taps160180200220240260fmaxdfmul_orgsha_orgsha_instrumentedsha_orgdfmul_instrumenteddfmul_orgFigure 5.2: Frequency Variation with the number of tapsdebug instrumentation is removed (can be seen from Columns 4 and 7 ofTable 5.8).To better understand this, we took the two benchmark circuits, namelydfmul and sha, for which the loss was very high and obtained the frequenciesfor these designs as the number of taps were reduced. Figure 5.2 showshow the fmax changes as these taps into the user partition were reduced.We randomly deleted the required number of taps for the purpose of theseexperiments. For both circuits, as the number of taps are reduced, fmax ofthe designs (sha instrumented and dfmul instrumented) reached the valueof the corresponding original user circuits (sha org and dfmul org). Thisshows that the presence of such taps might slightly affect the frequency ofthe user circuit but however, it enables the possibility of efficient incrementalrecompilations. As seen from the Figure 5.2, the fmax values are not exactlythe same as that of the original circuits when all the taps were removed. Thisslight variation may be attributed to the presence of partitions (even whenwe have no taps, we have an empty debug partition) in our design.725.6.ImpactontheFrequencyofUserCircuitTable 5.7: Frequency Results: Flow 3 vs Flow 1BenchmarkFlow 1 Flow 3Original With First Incremental InstrumentationUser Instrumentation Compilation Compilations RemovedCircuit 100% Obs. 25% Obs. 25% Obs. 0% Obs.(MHz) (MHz) (MHz) (MHz) (MHz)adpcm 134 134 (-0.6%) 129 129 129 (-4.1%)aes 133 134 (0.6%) 126 125 126 (-5.3%)blowfish 207 191 (-7.9%) 197 173 197 (-5.0%)dfadd 190 196 (3.0%) 201 201 201 (5.6%)dfdiv 196 184 (-6.5%) 190 190 179 (-9.1%)dfmul 177 158 (-10.8%) 158 158 158 (-10.5%)dfsin 179 168 (-6.4%) 173 164 173 (-3.3%)gsm 162 171 (5.6%) 158 158 158 (-2.6%)jpeg 98 94 (-3.4%) 95 84 98 (0.6%)mips 173 172 (-0.6%) 154 158 163 (-5.8%)motion 160 138 (-14.2%) 156 126 156 (-2.5%)sha 222 205 (-7.9%) 222 200 230 (3.3%)FFT 112 100 (-9.9%) 99 92 90 (-19.2%)Average 165 157 (-4.5%) 158 151 158 (-4.5%)For Columns 3 and 6 the values in the parenthesis indicates the % variation of the frequency with respect to that of the original usercircuit (Column 2)735.6.ImpactontheFrequencyofUserCircuitTable 5.8: Frequency Results: Flow 4 vs Flow 1BenchmarkFlow 1 Flow 4Original With First Incremental InstrumentationUser Instrumentation Compilation Compilations RemovedCircuit 100% Obs. 0% Obs. 25% Obs. 0% Obs.(MHz) (MHz) (MHz) (MHz) (MHz)adpcm 134 134 (-0.6%) 135 134 135 (0.3%)aes 133 134 (0.6%) 128 115 128 (-3.6%)blowfish 207 191 (-7.9%) 215 173 215 (3.9%)dfadd 190 196 (3.0%) 188 188 188 (-1.1%)dfdiv 196 184 (-6.5%) 193 163 193 (-1.8%)dfmul 177 158 (-10.8%) 154 154 154 (-12.9%)dfsin 179 168 (-6.4%) 176 151 176 (-1.9%)gsm 162 171 (5.6%) 166 166 166 (2.8%)jpeg 98 94 (-3.4%) 100 59 100 (1.9%)mips 173 172 (-0.6%) 172 163 170 (-1.7%)motion 160 138 (-14.2%) 165 116 165 (3.2%)sha 222 205 (-7.9%) 214 191 214 (-3.8%)FFT 112 100 (-9.9%) 108 65 110 (-1.5%)Average 165 157 (-4.5%) 163 141 163 (-1.3%)For Columns 3 and 6 the values in the parenthesis indicates the % variation of the frequency with respect to that of the original usercircuit (Column 2)745.7. Summary5.7 SummaryThis chapter provided a detailed analysis of the results obtained for variousexperiments which were conducted to quantify the impact of selective vari-able tracing and also the impact of our incremental debug flows on compiletime and the frequency of the user circuit.As shown in Section 5.3, Selective Variable Tracing could achieve sig-nificant improvements in the trace window size and also a reduced debuginstrumentation area. Our flows enable selective variable tracing to reducethe number of debug turns and also leverage the incremental compilationtechniques to accelerate the debug turn around times by almost 40%, onaverage.Like any other debug flow, our flows may also potentially interfere withthe user circuit partition, reducing its maximum possible operating fre-quency. Section 5.6 shows that for Flow 3, the performance of the usercircuit degrades to a much greater extent (loss of 4.5%) when compared toour Flow 4 (loss of 1.3%). However, the performance of the overall instru-mented circuit is better for Flow 3 as the user and the debug partitionsare co-optimized, when compared to Flow 4 where the debug partition usesthe left over FPGA resources, possibly creating more critical paths. Clearlythere is a trade-off between each of our flows and one has to choose themaccording to the application’s requirements.75Chapter 6Conclusions and FutureWork6.1 OverviewThis chapter summarizes the significance of the work done in this thesisalong with the contributions made and the important research findings (Sec-tion 6.2). In Section 6.3, we also present possible ideas to further explore inthe direction of this work.6.2 SummaryHigh Level Synthesis (HLS) simplifies the design process of a digital hard-ware system and makes it possible for software developers to make use ofthe hardware accelerators for their complex applications. However, for it tobecome successful, there is a need for an efficient debug infrastructure. In-system debug is becoming an important part of the HLS ecosystem becauseof its advantages over the usual simulation based approaches. Existing HLSdebug techniques allow the user to debug a circuit at the source level as itruns on an FPGA, providing visibility into the run-time operation of thecircuit. In order to do this, most such flows contain tools that automati-cally add additional circuitry (debug instrumentation) to record the circuitexecution and then replay this information to provide a software-like debugexperience. Typically, these tools record the updates to all the user visiblevariables and the control flow information.In this thesis, we improved an existing in-system HLS debug frameworkby allowing the ability to selectively record only some user-visible variables766.2. Summaryin on-chip trace buffers. This leads to reduced instrumentation logic (by1.7x when 50% variables are recorded and 2.8x when 25% variables arerecorded) and a longer trace window (1.6x for 50% case and 2.6x for 25%case), meaning fewer debug turns may be required when searching for an elu-sive bug. In our original framework, if the user needs to change the variablesto be recorded, it would require a full recompilation of the design, which isgenerally not desirable. To make this practical, incremental compilationtechniques are essential to reduce the debug turn around time.Although commercial FPGA tools contain extensive support for incre-mental compilation, we found that, due to several unique characteristics ofthe considered debug instrumentation (like the customized trace schedulerlogic), careful application of these techniques is required. We outlined sev-eral flows to perform incremental compilation for our problem. Of the flowswe examined, two were deemed promising: the first, in which the user circuitand instrumentation are co-optimized during compilation, gives the fastestdebug clock speeds, but suffers in user circuit performance once the debug in-strumentation is removed (a frequency loss of 4.5%). In the other promisingflow, the user circuit is first placed and routed without the instrumenta-tion logic, giving the best possible performance of the user circuit. Then,the instrumentation is added incrementally without changing the user cir-cuit. This flow suffers somewhat in terms of debug performance, however,when the instrumentation is removed, the circuit runs almost as fast as theoriginal uninstrumented user circuit (only a 1.3% loss in frequency). Usingeither flow, we achieve a significant reduction (40%, on average) in debugturn-around times, leading to more effective debug and higher productivity.To our knowledge, ours is the first work to incorporate the incrementalcompilation techniques into an in-system HLS debug flow. There are manyworks which focus on incremental debug for the circuits implemented at theRTL level, however, the debug of HLS generated circuits is different as theinstrumentation inserted is highly customized based on the variables selected(to achieve better trace window size). As a result, the amount of logic thatneeds to be recompiled when a user wants to record different variables ismuch larger.776.3. Future Work6.3 Future WorkShort Term GoalsIn order to incrementally place-and-route the changed logic between succes-sive debug turns, we used the incremental techniques from a commercial tool(Quartus Prime 16.0) and modified our design accordingly (creating designpartitions and adding taps) to get the most out of this tool.Similar to Quartus Prime, there are several other FPGA CAD tools likeVivado which also provide support for incremental recompilations. One ofour future works would be to understand the incremental flow offered bythese tools and use them in the proposed debug flows to see if we couldachieve any higher reductions in the compile time when compared to ourcurrent results.Another possible work is to investigate the use of generic/customizedlossless data compression schemes to compress the data generated by thetrace scheduler block before writing it to the trace buffer (in order to increasethe trace window size) and then evaluate the benefits obtained using ourincremental flows. We anticipate that the results might slightly vary basedon the amount of area overhead and also the amount of logic that wouldchange with the change in the variables being recorded.In this thesis we do not use any special variable selection algorithms toguide the user to select important variables for recording. We randomlyselect the variables in each debug turn to evaluate the effectiveness of ourincremental debug flows. It would be interesting to evaluate our flows withproper variable selections in each debug turn, as a user would actually dowhile debugging (something like selecting variables function-by-function).As most of the available HLS benchmarks are very small and have acompile time of less than few minutes (except one or two) we feel thatour results might be suppressed. We also anticipate that the compile timereductions offered by our incremental debug flows would go up if they areused with bigger benchmarks. In future, we expect the availability of biggerHLS benchmarks or some open source practical HLS applications to evaluateour flows and get more insight.786.3. Future WorkLong Term GoalsThe use of commercial incremental techniques in this work was just the firststep towards our ultimate long term goal of developing a fully acceleratedHLS debug framework, allowing the users to debug quickly and efficiently.The benefits achieved by using these commercial frameworks cannot be im-proved further as we do not have access to the internal of the tools andhence we cannot modify the algorithms used in their incremental flows. Anobvious solution would be to develop a customized incremental place-and-route tool with our own requirements using open source FPGA CAD toolslike RapidSmith [48].Another possible orthogonal approach to our work is to investigate thefeasibility of an overlay for HLS debug, similar to those used for incrementalRTL based debug [15, 16, 32]. However, as described in Section 2.4, we feelthat this is not so easy given the uniqueness of our debug instrumentationfor each subset of variables that are being recorded.Lastly, we feel that there is another direction to explore in which therecould be no need for recompilation at all. For this purpose, it is neces-sary to develop a generic trace scheduler and a compression circuit whichachieves compression ratio as close as that of a customized trace schedulerfor each subset of variables being recorded. Then by using a scan regis-ter, we could mask off the respective signals that are not being recordedin the current debug turn with out any recompilation. Clearly, there is atrade-off between recompilation and the trace buffer utilization or the tracewindow size (which depends on the compression achieved by the trace sched-uler block). This trade-off would not be clear until the generic compressioncircuit is developed, which we leave as future work.79Bibliography[1] Christopher M. Abernathy, Lydia M. Do, Ronald P. Hall, andMichael L. Karm. System and Method for Streaming High FrequencyTrace Data Off-Chip. http://www.freepatentsonline.com/y2008/0016408.html, Jan 2008. (visited on August 8, 2017).[2] Amazon. Amazon EC2 F1 Instances with Custom FPGAs. https://aws.amazon.com/ec2/instance-types/f1/, 2016. (visited on August8, 2017).[3] Hari Angepat, Gage Eads, Christopher Craik, and Derek Chiou. Nifd:Non-intrusive fpga debugger – debugging fpga ’threads’ for rapid hw/swsystems prototyping. In FPL, pages 356–359. IEEE Computer Society,2010.[4] E. Anis and N. Nicolici. Low cost debug architecture using lossy com-pression for silicon debug. In Design, Automation Test in Europe Con-ference Exhibition, pages 1–6, April 2007.[5] E. Anis and N. Nicolici. On using lossless compression of debug data inembedded logic analysis. In IEEE International Test Conference, pages1–10, Oct 2007.[6] ARM. Differences between On-chip and Off-chip Stor-age. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dgi0012d/Babhaifj.html. (visited on August 8,2017).[7] Christian Beckhoff, Dirk Koch, and Jim Trresen. The xilinx design80Bibliographylanguage (xdl): Tutorial and use cases. In ReCoSoC, pages 1–8. IEEE,2011.[8] P. Bellows and B. Hutchings. JHDL-an HDL for reconfigurable systems.pages 175–184, 1998.[9] Jon Louis Bentley, Daniel D. Sleator, Robert E. Tarjan, and Victor K.Wei. A locally adaptive data compression scheme. Commun. ACM,29(4):320–330, April 1986.[10] N. Calagar, S.D. Brown, and J.H. Anderson. Source-level Debuggingfor FPGA High-Level Synthesis. In International Conference on FieldProgrammable Logic and Applications, Sept 2014.[11] A. Canis, S.D. Brown, and J.H. Anderson. Modulo SDC scheduling withrecurrence minimization in high-level synthesis. In Field ProgrammableLogic and Applications (FPL), 2014 24th International Conference on,Sept 2014.[12] Andrew Canis, Jongsok Choi, et al. LegUp: An Open-source High-levelSynthesis Tool for FPGA-based Processor/Accelerator Systems. ACMTrans. Embed. Comput. Syst., 13(2):24:1–24:27, September 2013.[13] Philippe Coussy, Daniel D. Gajski, Michael Meredith, and AndrsTakach. An introduction to high-level synthesis. IEEE Design & Testof Computers, 26(4):8–17, 2009.[14] John Curreri, Greg Stitt, and Alan D. George. High-level Synthesis ofIn-circuit Assertions for Verification, Debugging, and Timing Analysis.Int. J. Reconfig. Comput., 2011:1:1–1:17, January 2011.[15] F. Eslami and S. J. E. Wilton. Incremental distributed trigger in-sertion for efficient fpga debug. In International Conference on FieldProgrammable Logic and Applications (FPL), pages 1–4, Sept 2014.[16] F. Eslami and S. J. E. Wilton. An adaptive virtual overlay for fasttrigger insertion for FPGA debug. In Field Programmable Technology(FPT), 2015 International Conference on, pages 32–39, Dec 2015.81Bibliography[17] GDB: The GNU Project Debugger. https://www.gnu.org/software/gdb/. (visited on August 8, 2017).[18] J. Goeders. Enabling Long Debug Traces of HLS Circuits UsingBandwidth-Limited Off-Chip Storage Devices. In 2017 IEEE 25th An-nual International Symposium on Field-Programmable Custom Com-puting Machines (FCCM), pages 136–143, April 2017.[19] J. Goeders and S. J. E. Wilton. Using round-robin tracepoints to debugmultithreaded hls circuits on fpgas. In Field Programmable Technology(FPT), 2015 International Conference on, pages 40–47, Dec 2015.[20] J. Goeders and S.J.E. Wilton. Effective FPGA debug for high-level syn-thesis generated circuits. In Field Programmable Logic and Applications(FPL), 2014 24th International Conference on, Sept 2014.[21] J. Goeders and S.J.E. Wilton. Using Dynamic Signal-Tracing to DebugCompiler-Optimized HLS Circuits on FPGAs. In International Sympo-sium on Field-Programmable Custom Computing Machines, pages 127–134, May 2015.[22] Jeffrey Goeders. Techniques for In-System Observation-based Debug ofHigh-Level Synthesis Generated Circuits on FPGAs. PhD thesis, TheUniversity of British Columbia (Vancouver), September 2016.[23] Jeffrey Goeders and Steven J. E. Wilton. Signal-tracing techniquesfor in-system FPGA debugging of high-level synthesis circuits. IEEETrans. on Computer-Aided Design of Integrated Circuits and Systems,36(1):83–96, January 2017.[24] Paul Graham, Brent Nelson, and Brad Hutchings. Instrumenting Bit-streams for Debugging FPGA Circuits. In Proceedings of the the 9thAnnual IEEE Symposium on Field-Programmable Custom ComputingMachines, FCCM ’01, pages 41–50, 2001.[25] Steve Guccione, Delon Levi, and Prasanna Sundararajan. Jbits: Javabased interface for reconfigurable computing. In Second Annual Military82Bibliographyand Aerospace Applications of Programmable Devices and Technologies(MAPLD), Sep.[26] M. Ben Hammouda, P. Coussy, and L. Lagadec. A Design Approach toAutomatically Synthesize ANSI-C Assertions During High-Level Syn-thesis of Hardware Accelerators. In 2014 IEEE International Sympo-sium on Circuits and Systems (ISCAS), pages 165–168, June 2014.[27] Julien Happich. Cognitive Computing Platform Unites Xilinx andIBM. http://www.eetimes.com/document.asp?doc id=1329377, Apr2016. (visited on August 8, 2017).[28] Yuko Hara, Hiroyuki Tomiyama, Shinya Honda, and Hiroaki Takada.Proposal and Quantitative Analysis of the CHStone Benchmark Pro-gram Suite for Practical C-based High-level Synthesis. Journal of In-formation Processing, 17:242–254, 2009.[29] K.S. Hemmert, J.L. Tripp, B.L. Hutchings, and P.A. Jackson. Sourcelevel debugger for the Sea Cucumber synthesizing compiler. In Sym-posium on Field-Programmable Custom Computing Machines., pages228–237, April 2003.[30] Chu-Yi Huang, Yen-Shen Chen, Youn-Long Lin, and Yu-Chin Hsu.Data path allocation based on bipartite weighted matching. In Pro-ceedings of the 27th ACM/IEEE Design Automation Conference, DAC’90, pages 499–504, New York, NY, USA, 1990. ACM.[31] E. Hung and S. J. E. Wilton. Speculative Debug Insertion for FPGAs.In 2011 21st International Conference on Field Programmable Logicand Applications, pages 524–531, Sept 2011.[32] Eddie Hung and Steven J. E. Wilton. Accelerating FPGA De-bug: Increasing Visibility Using a Runtime Reconfigurable Observationand Triggering Network. ACM Trans. Des. Autom. Electron. Syst.,19(2):14:1–14:23, March 2014.83Bibliography[33] Eddie Hung and Steven J.E. Wilton. Towards simulator-like observabil-ity for fpgas: A virtual overlay network for trace-buffers. In Proceedingsof the ACM/SIGDA International Symposium on Field ProgrammableGate Arrays, pages 19–28, 2013.[34] B. Hutchings, P. Bellows, J. Hawkins, S. Hemmert, B. Nelson, andM. Rytting. A cad suite for high-performance fpga design. In Field-Programmable Custom Computing Machines, 1999. FCCM ’99. Pro-ceedings. Seventh Annual IEEE Symposium on, pages 12–24, 1999.[35] B. L. Hutchings and J. Keeley. Rapid post-map insertion of embeddedlogic analyzers for xilinx fpgas. In IEEE Annual International Sympo-sium on Field-Programmable Custom Computing Machines, pages 72–79, May 2014.[36] Impulse Accelerated Technologies. CoDeveloper from ImpulseAccelerated Technologies. http://www.impulseaccelerated.com/ReleaseFiles/Help/iAppMan.pdf, 2015. (visited on August 8, 2017).[37] Intel. Increasing Productivity With Quartus II Incremental Compi-lation, month=May, year=2008, version=1.0, howpublished = WhitePaper WP-01062-1.0.[38] Intel. Protecting the FPGA Design From Common Threats,month=June, year=2009, version=1.0, howpublished = White PaperWP-01111-1.0.[39] Intel. Intel Completes Acquisition of Altera. https://newsroom.intel.com/press-kits/intel-acquisition-of-altera/,December 2015. (visited on August 8, 2017).[40] Intel. Quartus Prime Pro Edition Handbook, volume 3, chapter 9: De-sign Debugging Using the SignalTap II Logic Analyzer. November 2015.[41] Intel. SDK for OpenCL. https://www.altera.com/products/design-software/embedded-software-developers/opencl/overview.html,2016. (visited on August 8, 2017).84Bibliography[42] Intel. Quartus prime standard edition handbook: Design and synthe-sis. https://www.altera.com/en US/pdfs/literature/hb/qts/qts-qps-handbook.pdf, May 2017. (visited on August 8, 2017).[43] Yousef Iskander, Cameron Patterson, and Stephen Craven. High-levelabstractions and modular debugging for fpga design validation. ACMTrans. Reconfigurable Technol. Syst., 7(1):2:1–2:22, Feb 2014.[44] J. Jiang and S. Jones. Word-based dynamic algorithms for data com-pression. IEE Proceedings I - Communications, Speech and Vision,139(6):582–586, Dec 1992.[45] Eric Keller. Jroute: A run-time routing api for fpga hardware. InProceedings of the IPDPS Workshops on Parallel and Distributed Pro-cessing, pages 874–881, 2000.[46] H. F. Ko and N. Nicolici. Algorithms for state restoration and trace-signal selection for data acquisition in silicon debug. IEEE Transac-tions on Computer-Aided Design of Integrated Circuits and Systems,28(2):285–297, Feb 2009.[47] Chris Lattner and Vikram Adve. LLVM: A Compilation Frameworkfor Lifelong Program Analysis & Transformation. In Proceedings ofthe International Symposium on Code Generation and Optimization:Feedback-directed and Runtime Optimization, CGO ’04, pages 75–86,2004.[48] C. Lavin, M. Padilla, J. Lamprecht, P. Lundrigan, B. Nelson, andB. Hutchings. Rapidsmith: Do-it-yourself cad tools for xilinx fpgas.In International Conference on Field Programmable Logic and Applica-tions, pages 349–355, Sept 2011.[49] X. Liu and Q. Xu. On signal selection for visibility enhancement intrace-based post-silicon validation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 31(8):1263–1274, Aug2012.85Bibliography[50] Mentor. Catapult high-level synthesis. https://www.mentor.com/hls-lp/catapult-high-level-synthesis/, 2016. (visited on August 8,2017).[51] Microsemi. In-Circuit FPGA Debug: Challenges and So-lutions. https://www.microsemi.com/document-portal/doc view/133662-in-circuit-fpga-debug-challenges-and-solutions. (vis-ited on August 8, 2017).[52] J. S. Monson and B. Hutchings. New approaches for in-system debugof behaviorally-synthesized FPGA circuits. In Int’l Conf. on Field-Programmable Logic and Applications, pages 1–6, Sept 2014.[53] J. S. Monson and B. Hutchings. New approaches for in-system debugof behaviorally-synthesized FPGA circuits. In International Conferenceon Field Programmable Logic and Applications, Sept 2014.[54] J. S. Monson and B. Hutchings. Using shadow pointers to trace cpointer values in fpga circuits. In International Conference on ReCon-Figurable Computing and FPGAs (ReConFig), pages 1–6, Dec 2015.[55] J. S. Monson and Brad L. Hutchings. Using Source-Level Transforma-tions to Improve High-Level Synthesis Debug and Validation on FP-GAs. In International Symposium on Field-Programmable Gate Arrays,pages 5–8, 2015.[56] R. Nane, V. M. Sima, B. Olivier, R. Meeuws, Y. Yankova, and K. Ber-tels. Dwarv 2.0: A cosy-based c-to-vhdl hardware compiler. In 22ndInternational Conference on Field Programmable Logic and Applica-tions (FPL), pages 619–622, Aug 2012.[57] R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen,H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels. A Surveyand Evaluation of FPGA High-Level Synthesis Tools. IEEE Transac-tions on Computer-Aided Design of Integrated Circuits and Systems,PP(99), 2016.86Bibliography[58] R. Nikhil. Bluespec System Verilog: efficient, correct RTL from highlevel specifications. In Formal Methods and Models for Co-Design,2004. MEMOCODE ’04. Proceedings. Second ACM and IEEE Inter-national Conference on, pages 69–70, June 2004.[59] University of Cambridge. The Tiger MIPS processor. https://www.cl.cam.ac.uk/teaching/0910/ECAD+Arch/mips.html, 2010. (vis-ited on August 8, 2017).[60] United States Bureau of Labor Statistics. Occupational Outlook Hand-book, 2012.[61] C. Pilato and F. Ferrandi. Bambu: A modular framework for the highlevel synthesis of memory-intensive applications. In 2013 23rd Interna-tional Conference on Field programmable Logic and Applications, Sept2013.[62] J. P. Pinilla and S. J. E. Wilton. Enhanced source-level instrumentationfor fpga in-system debug of high-level synthesis designs. In InternationalConference on Field-Programmable Technology (FPT), pages 109–116,Dec 2016.[63] Z. Poulos, Y. S. Yang, J. Anderson, A. Veneris, and B. Le. Leveragingreconfigurability to raise productivity in fpga functional debug. In 2012Design, Automation Test in Europe Conference Exhibition (DATE),pages 292–295, March 2012.[64] A. Putnam, A.M. Caulfield, E.S. Chung, D. Chiou, K. Constantinides,J. Demme, et al. A reconfigurable fabric for accelerating large-scale dat-acenter services. In Computer Architecture (ISCA), 2014 ACM/IEEE41st International Symposium on, pages 13–24, June 2014.[65] Qualcomm. Qualcomm & Xilinx Collaborate to Deliver Industry-Leading Heterogeneous Computing Solutions for Data Cen-ters with New Levels of Efficiency and Performance. https://www.qualcomm.com/news/releases/2015/10/08/qualcomm-and-87Bibliographyxilinx-collaborate-deliver-industry-leading-heterogeneous,2015. (visited on August 8, 2017).[66] B. Reagen, R. Adolf, Y.S. Shao, Gu-Yeon Wai, and David Brooks.Machsuite: Benchmarks for accelerator design and customized archi-tectures. In Int’l Symposium on Workload Characterization, pages 110–119, Oct 2014.[67] Synopsys. Identify: Simulator-like Visibility into FPGA Hardware Op-eration. https://www.synopsys.com/implementation-and-signoff/fpga-based-design/identify-rtl-debugger.html. (visited on Au-gust 8, 2017).[68] Anurag Tiwari and Karen A. Tomko. Scan-chain based watch-pointsfor efficient run-time debugging and verification of fpga designs. InProceedings of the 2003 Asia and South Pacific Design AutomationConference, Jan.[69] K. A. Tomko and A. Tiwari. Hardware/software co-debugging for re-configurable computing. In Proceedings of the IEEE International High-Level Validation and Test Workshop (HLDVT’00), HLDVT ’00, Wash-ington, DC, USA, 2000. IEEE Computer Society.[70] Bart Vermeulen and Sandeep Kumar Goel. Design for debug: Catchingdesign errors in digital chips. IEEE Des. Test, 19(3):37–45, May 2002.[71] Timothy Wheeler, Paul S. Graham, Brent E. Nelson, and Brad L.Hutchings. Using design-level scan to improve FPGA design observabil-ity and controllability for functional verification. In Field-ProgrammableLogic and Applications, 11th International Conference, FPL, pages483–492, Aug 2001.[72] Xilinx. FPGA vs. ASIC. https://www.xilinx.com/fpga/asic.htm.(visited on August 8, 2017).[73] Xilinx. Vivado Design Suite User Guide: Implementation.https://www.xilinx.com/support/documentation/sw manuals/88Bibliographyxilinx2012 4/ug904-vivado-implementation.pdf, Dec 2012. (vis-ited on August 8, 2017).[74] Xilinx. Virtex-6 FPGA Configuration: User Guide. https://www.xilinx.com/support/documentation/user guides/ug360.pdf,Nov 2013. (visited on August 8, 2017).[75] Xilinx. Integrated Logic Analyzer v6.1: LogiCORE IP Prod-uct Guide. http://www.xilinx.com/support/documentation/ip documentation/ila/v6 1/pg172-ila.pdf, April 2016. (visited onAugust 8, 2017).[76] Xilinx. Vivado Design Suite User Guide: High-Level Synthesis.http://www.xilinx.com/support/documentation/sw manuals/xilinx2016 2/ug902-vivado-high-level-synthesis.pdf, June2016. (visited on August 8, 2017).89Appendix90Appendix AA Guide to our GUIFrameworkSteps to use Our Proposed Debug Flows:1. Open a Design:– Once the GUI is loaded, click on the open file icon and select thedesign folder which consists of the .c, make and the config files.– After selecting the folder and clicking the ”Open” button, LegUpis run in the background (assuming all variables are being traced) tocreate the database and also to have an initial design with all thetaps present in the user module. This is an RTL generation step andno (Quartus Prime) compilation is necessary at this moment. At thesame time, all the debug modules present in the design are isolated intoseparate Verilog files as it is required to have the partitions in separatefiles for the tool to perform the incremental compilation effectively. Allthese (main) files are stored in a new sub-folder ”quartus proj”, so asto avoid any overwriting/deletion when LegUp is rerun for the currentdesign.– Next, the information from the database is read and populated intothe ”Vars” tab (shown in Figure A.1). A list of all the variables presentin the source code can be seen under this tab.2. Select the variables which are to be recorded:– In the ”Vars” tab, a list of all the variables are displayed along witha corresponding checkbox.91Appendix A. A Guide to our GUI Framework– Once the required variables are selected, click on the ”Trace SelectedVars” button. In background, this triggers the creation of a file named”selectedVars.txt” and also reruns LegUp, which has been modified tomake use of the information from this file and generate the RTL (Ver-ilog file) with the instrumentation for recording the selected variables.The .v file generated by LegUp is in the design folder and has not beenmoved into our main ”quartus proj” sub-folder. We only need partialcontents from this file, which would be copied to our main files whena Quartus Prime compilation is to be run.– Next, as LegUp is rerun, a new database is created. Therefore, theconnection to the database is refreshed and all the relevant informationis populated again from the database.3. Creating a new Quartus Prime project (if it does not exist in the ”quar-tus proj” folder):– After selecting the variables, go to the ”FPGA” tab (shown in FigureA.2). Click on the ”Create Quartus Project” button. If a projectalready exists, a pop up indicating this will be displayed. If not, anew project is created. Presently, the target device is Intel’s CycloneVDE1-SoC.– After the project is created, an ”Analysis and Elaboration step” isrun in order to create design partitions using appropriate Tcl files.4. Running a full compilation (if this is the first compilation):– Before running the compilation, the user has the option to startwith empty debug partitions which allow the user design to be placeand routed efficiently (Flow 4). If this is selected, then the partitionpreservation settings for the debug partitions are set to EMPTY andthe user partition is set to SOURCE as this is the first compilation. Ifthe option is not selected then all the partitions including the debugmodules are set to SOURCE. If a partition is set to SOURCE, the toolrecompiles it from scratch.92Appendix A. A Guide to our GUI Framework– Next we copy the traceScheduler logic and some other parameterswhich change with every run of LegUp from the .v file present in thedesign folder into the .v files (present in our ”quartus proj” folder)which need them. This is done to ensure that we are running thecompilation for the design with instrumentation added for only theselected variables.– After this, a full compilation for the design is run.5. Program the bitstream to the FPGA:– After compiling the design click ”Program Bitstream” button toprogram the DE1-SoC FPGA.6. Connect to the FPGA through the RS232 interface.7. Run the design on FPGA and analyze the variables:– Once connected to the FPGA, the design can be run (until a break-point or until it is completed) by clicking the ”run” icon.– After running the design, go to ”FPGA Replay Execution” mode tosee how the variable values were changed as the design was running.You can use the slider present to go back and forth or use the singlestep/step back icons to move one step at a time.8. Changing the variables to be recorded and performing an incrementalcompilation:– During debugging, if you feel some variables are not necessary thenyou may not want to record them (thus saving the trace memory/in-creasing the trace window length).– You can select which variables you want to record (follow the sameprocedure as in Step 2).– Next DON’T create a new Quartus Prime project, as we alreadyhave a project. Also, DON’T run a full compilation, if you want tospeed up your compilation. You can use incremental compilation asyou already have a previous compilation results.93Appendix A. A Guide to our GUI Framework9. Incremental Compilation:– When you click this button, the partition preservation settings arechanged to POST FIT using Tcl files, which directs the tool to reusethe results for the partitions which did not change from the previouscompilations. In our case the user partition would not change as onlythe debug instrumentation is changed when the variables to be tracedare changed. This preserves the results for the user partition. Thissaves, up to, on average, 40% of the runtime.– Before running the compilation, we again copy (from the designfolder to our ”quartus proj” folder as in Step 4) the traceSchedulerlogic and other parameters which might have changed, as the LegUpwould have been rerun when the variables to be recorded are changed.– Now a compilation is run, which would be incremental as the parti-tion preservation settings were changed to POST FIT.– Follow Steps 5,6 to program and connect to FPGA. Then (as in Step7) we can analyze the updates to the variable values during the designexecution.10. If the bug is not found, repeat the steps starting from Step 8 until theroot cause of the bug is identified.Note:– For some variables, even if they are not selected for recording, you couldsee the values showing up for them while debugging. This is because of thepointer aliasing happening with that variable and we are being conservativein such cases and recording these variables anyway.– For now, we are using split buffer architecture (one each for datapathregisters, memory and control signals). The GUI support for single tracebuffer configuration is still under development.94AppendixA.AGuidetoourGUIFrameworkFigure A.1: ’Vars’ Tab95AppendixA.AGuidetoourGUIFrameworkFigure A.2: ’FPGA’ Tab96

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0353172/manifest

Comment

Related Items