UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Source-level instrumentation for in-system debug of high-level synthesis designs for FPGA Pinilla, Jose Pablo 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2016_november_pinilla_jose.pdf [ 1.07MB ]
JSON: 24-1.0319056.json
JSON-LD: 24-1.0319056-ld.json
RDF/XML (Pretty): 24-1.0319056-rdf.xml
RDF/JSON: 24-1.0319056-rdf.json
Turtle: 24-1.0319056-turtle.txt
N-Triples: 24-1.0319056-rdf-ntriples.txt
Original Record: 24-1.0319056-source.json
Full Text

Full Text

Source-Level Instrumentation forIn-System Debug ofHigh-Level Synthesis Designs forFPGAbyJose Pablo PinillaB.Eng., Universidad Pontificia Bolivariana, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)September 2016c© Jose Pablo Pinilla 2016AbstractHigh-Level Synthesis (HLS) has emerged as a promising technology to re-duce the time and complexity that is associated with the design of digitallogic circuits. HLS tools are capable of allocating resources and schedulingoperations from a software-like behavioral specification. In order to main-tain the productivity promised by HLS, it is important that the designer candebug the system in the context of the high-level code. Currently, softwaresimulations offer a quick and familiar method to target logic and syntaxbugs, while software/hardware co-simulations are useful for synthesis veri-fication. However, to analyze the behaviour of the circuit as it is running,the user is forced to understand waveforms from the synthesized design.Debugging a system as it is running requires inserting instrumentationcircuitry that gathers data regarding the operation of the circuit, and adatabase that maps the record entries to the original high-level variables.Previous work has proposed adding this instrumentation at the RegisterTransfer Level (RTL) or in the high-level source code. Source-level instru-mentation provides advantages in portability, transparency, and customiza-tion. However, previous work using source-level transformations has focusedon the ability to expose signals for observation rather than the constructionof the instrumentation itself, thereby limiting these advantages by requiringiiAbstractlower-level code manipulation.This work shows how trace buffers and related circuitry can be insertedby automatically modifying the source-level specification of the design. Thetransformed code can then be synthesized using the regular HLS flow togenerate the instrumented hardware description. The portability of theinstrumentation is shown with synthesis results for Vivado HLS and LegUp,and compiled for Xilinx and Altera devices correspondingly. Using theseHLS tools, the impact on circuit size varies from 15.3% to 52.5% and theimpact on circuit speed ranges from 5.8% to 30%. We also introduce alow overhead technique named Array Duplicate Minimization (ADM) toimprove trace memory efficiency. ADM improves overall debug observabilityby removing up to 31.7% of data duplication created between the tracememory and the circuit’s memory structures.iiiPrefaceThis dissertation is original, independent work by the author, J. PinillaThe work on this thesis, under the supervision of Prof. Steve Wilton,led to the submission of the research paper titled Enhanced Source-Level In-strumentation for FPGA In-System Debug of High-Level Synthesis Designs,accepted for oral presentation at the International Conference on Field Pro-grammable Technology 2016 (FPT16).ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Field Programmable Gate Arrays . . . . . . . . . . . . . . . 21.2 High Level Synthesis . . . . . . . . . . . . . . . . . . . . . . 41.3 HLS Debug . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.1 Software Simulation . . . . . . . . . . . . . . . . . . . 61.3.2 Co-Simulation . . . . . . . . . . . . . . . . . . . . . . 71.3.3 In-System Debugging . . . . . . . . . . . . . . . . . . 81.3.4 Instrumentation . . . . . . . . . . . . . . . . . . . . . 101.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 12vTable of Contents1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1 High-Level Synthesis Frameworks . . . . . . . . . . . . . . . 152.1.1 HLS Flow . . . . . . . . . . . . . . . . . . . . . . . . 162.1.2 Vivado HLS . . . . . . . . . . . . . . . . . . . . . . . 192.1.3 LegUp . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 212.2 FPGA Debug . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2.1 Embedded Logic Analyzers . . . . . . . . . . . . . . . 232.2.2 HLS Verification & Source-Level Debug . . . . . . . . 232.3 HLS On-Chip Monitors . . . . . . . . . . . . . . . . . . . . . 242.3.1 Assertion Based Verification . . . . . . . . . . . . . . 252.3.2 Control Flow Integrity Verification . . . . . . . . . . . 262.3.3 Embedded Signature Monitoring . . . . . . . . . . . . 272.4 HLS Source-Level Debug . . . . . . . . . . . . . . . . . . . . 272.4.1 JHDL Debug . . . . . . . . . . . . . . . . . . . . . . . 282.4.2 LegUp Debug . . . . . . . . . . . . . . . . . . . . . . 292.4.3 Event Observability Ports . . . . . . . . . . . . . . . 312.5 Source-to-Source Transformations . . . . . . . . . . . . . . . 322.5.1 ROSE Compiler Infrastructure . . . . . . . . . . . . . 332.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 Source-Level Instrumentation . . . . . . . . . . . . . . . . . . 393.1 Source-Level Debug Framework . . . . . . . . . . . . . . . . 403.1.1 Instrumentation . . . . . . . . . . . . . . . . . . . . . 41viTable of Contents3.2 Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.1 Control Constructs . . . . . . . . . . . . . . . . . . . 433.2.2 Code Example . . . . . . . . . . . . . . . . . . . . . . 433.3 Data Capture . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3.1 Assignment Statements . . . . . . . . . . . . . . . . . 453.3.2 Code Example . . . . . . . . . . . . . . . . . . . . . . 463.4 Trace Readback . . . . . . . . . . . . . . . . . . . . . . . . . 483.4.1 Code Example . . . . . . . . . . . . . . . . . . . . . . 483.5 Trace Reconstruction . . . . . . . . . . . . . . . . . . . . . . 493.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494 Array Duplicate Minimization . . . . . . . . . . . . . . . . . . 514.1 Array Duplication . . . . . . . . . . . . . . . . . . . . . . . . 514.2 Merged Instrumentation . . . . . . . . . . . . . . . . . . . . 534.3 Old Value Store . . . . . . . . . . . . . . . . . . . . . . . . . 534.3.1 Code Example . . . . . . . . . . . . . . . . . . . . . . 544.3.2 Trace Example . . . . . . . . . . . . . . . . . . . . . . 554.3.3 Trace Analysis . . . . . . . . . . . . . . . . . . . . . . 574.4 Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.4.1 History Coverage . . . . . . . . . . . . . . . . . . . . 594.4.2 Total History Coverage . . . . . . . . . . . . . . . . . 614.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 635.1 Experiment 1: Single Instrumentation Point . . . . . . . . . 655.2 Experiment 2: Complete Instrumentation . . . . . . . . . . . 66viiTable of Contents5.2.1 Latency Impact . . . . . . . . . . . . . . . . . . . . . 675.2.2 Resource Utilization . . . . . . . . . . . . . . . . . . . 715.3 Experiment 3: Partial Instrumentation . . . . . . . . . . . . 755.4 ADM Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 785.4.1 Observability . . . . . . . . . . . . . . . . . . . . . . . 795.4.2 Resource Utilization . . . . . . . . . . . . . . . . . . . 795.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 84Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86AppendicesA LegUp Interface Directive . . . . . . . . . . . . . . . . . . . . 99B EOP Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 101C Memory Access Profiles . . . . . . . . . . . . . . . . . . . . . . 104viiiList of Tables4.1 MIPS Benchmark Example History . . . . . . . . . . . . . . . 564.2 Trace Buffer Contents . . . . . . . . . . . . . . . . . . . . . . 564.3 MIPS History Metrics Example . . . . . . . . . . . . . . . . . 605.1 Latency Impact Comparison for ADPCM Single Assignment . 645.2 LUT Impact Comparison for ADPCM Single Assignment . . 645.3 FF Impact Comparison for ADPCM Single Assignment . . . 645.4 LE/LC* Impact Comparison for ADPCM Single Assignment 645.5 LegUp Cycles (Slowdown) . . . . . . . . . . . . . . . . . . . . 695.6 Vivado HLS Cycles (Slowdown) . . . . . . . . . . . . . . . . . 695.7 LegUp Complete Instrumentation Logic Elements (Overhead) 715.8 Vivado HLS Complete Instrumentation Logic Cells (Overhead) 725.9 Vivado HLS EOP vs Data Capture Overhead . . . . . . . . . 735.10 LegUp HLS Debugger vs Data Capture Logic Elements (Over-head) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.11 Duplication Metrics . . . . . . . . . . . . . . . . . . . . . . . 785.12 ADM Resource Utilization for ADPCM Instances . . . . . . . 80ixList of Figures1.1 HLS Debugging Techniques . . . . . . . . . . . . . . . . . . . 72.1 HLS Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Hierarchy ROSE AST Nodes . . . . . . . . . . . . . . . . . . 342.3 Example ROSE AST Node . . . . . . . . . . . . . . . . . . . 363.1 HLS In-System Debug Framework with Source-Level Instru-mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.2 Hierarchy of Assignment Statements . . . . . . . . . . . . . . 454.1 Data Duplication . . . . . . . . . . . . . . . . . . . . . . . . . 524.2 Trace Buffer Analysis . . . . . . . . . . . . . . . . . . . . . . 585.1 Latency Histogram for Variation of ADPCM . . . . . . . . . 765.2 Latency Histogram for Variation of ADPCM All-Inlined . . . 77B.1 LegUp EOP Writes . . . . . . . . . . . . . . . . . . . . . . . . 102B.2 Vivado HLS EOP Writes . . . . . . . . . . . . . . . . . . . . . 103C.1 ADPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106C.2 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106xList of FiguresC.3 BLOWFISH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106C.4 GSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106C.5 JPEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107C.6 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107C.7 SHA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107xiList of Listings3.1 C Control Flow Instrumentation . . . . . . . . . . . . . . . . 443.2 C with Data Instrumentation and Readback . . . . . . . . . . 474.1 C with Array Duplicate Minimization . . . . . . . . . . . . . 544.2 Alternative to Store Old Value . . . . . . . . . . . . . . . . . 54A.1 C with EOP Instrumentation . . . . . . . . . . . . . . . . . . 100A.2 Vivado HLS “directives.tcl” . . . . . . . . . . . . . . . . . . . 100A.3 LegUp “config.tcl” . . . . . . . . . . . . . . . . . . . . . . . . 100xiiChapter 1IntroductionComputation scaling through recent years has not seen the same return interms of higher operational frequency and power efficiency that was obtainedfrom shrinking transistor sizes. As processor performance plateaus, com-puter architectures have moved to parallel execution models. Such modelseither replicate processing units or use heterogeneous computing. Heteroge-neous computing merges different computing architectures into one system,where each architecture targets a specific task or type of tasks.There are two domains in which this trend is specially important. First,embedded devices, in which energy efficiency can be achieved by offloadingtasks from the main processor core to custom cores that consume less power,allowing the more powerful resources to go to a standby state. Second,in High Performance Computing (HPC) or any compute-demanding appli-cation, where higher performance can only be obtained with application-specific logic, and both the sequential processing unit(s) and custom logiccan execute in tandem. However, design and fabrication of Application Spe-cific Integrated Circuits (ASICs) is not only time consuming, but also oftenunaffordable.More flexible heterogeneous computing systems integrate programmable11.1. Field Programmable Gate ArraysGeneral Purpose (GP) devices instead of specialized ASICs. This balancesthe customizability and performance of ASICs, while preserving the flexi-bility and affordability of software programmable cores. For example, Sys-tems on Chip (SoC) can integrate two Central Processing Units (CPUs)with architectures focused for performance and for power efficiency [7], orthe CPU(s) can be integrated with a GPGPU (General Purpose GraphicsProcessing Unit) [25, 58] which has become attractive in scientific comput-ing and other HPC applications. The reprogrammable many-core Single-Instruction Multiple-Data (SIMD) execution model offered by GPUs can beused to solve graph and matrix based problems, among other applications. Amore suitable approach for some applications is to provide finer-grained re-programmability, to also allow Multiple-Instruction Multiple-Data (MIMD)parallelism with custom logic design. This execution model is offered byProgrammable Logic Devices (PLDs), namely, Field-Programmable GateArrays (FPGAs).1.1 Field Programmable Gate ArraysThe fine-grained nature of FPGAs allows them to emulate the behaviourof any digital logic circuit. Their architecture is based on Look-Up Tables(LUTs) for digital gates, memory blocks for RAMs and ROMs, registersfor sequential logic, programmable I/O blocks, and programmable switchblocks for custom interconnection. More specialized blocks are also seen instate-of-the-art devices. This extensive flexibility made these devices a suit-able match for prototype development, Application Specific Integrated Cir-21.1. Field Programmable Gate Arrayscuit (ASIC) emulation for design verification, and as interfacing devices forsystem-level design, otherwise called “glue logic”. More recently, FPGA re-source density has increased to such degree that complex, multi-core, multi-clock domain systems can be fully implemented in programmable logic [26].The easily replicable homogeneity of modern FPGA architectures has lead toimpressive records for transistor density in one chip [4, 64]. FPGAs also en-able a faster time-to-market; this is advantageous when compared to ASICdesign [65]. At the same time, FPGAs can offer a significant increase inperformance compared to CPU and GPU implementations [66].This great potential for acceleration has been successfully put into prac-tice for multiple applications where custom functional units and pipelinedexecution can be designed to take advantage of existing parallelism [55, 78].Notable applications exist in high-demand cloud computing [11, 60], whereeach server node is augmented with one FPGA configured with multiplecustom Processing Elements (PE). Recently, Intel Corporation acquired oneof the largest FPGA companies, Altera Corporation [18], while major ef-forts from other companies have also been seen in order to incorporate FP-GAs into mainstream computing, especially for machine learning on thecloud [8, 23, 78]. However, programming FPGAs requires greater effortthan programming GPUs or CPUs in order to extract optimal efficiency[10, 26, 55, 66], mainly because these designs need to be described at alower abstraction, requiring ample knowledge of the device architecture andhardware-specific design methodologies.Traditionally, FPGA applications are specified using Hardware Descrip-tion Languages (HDLs) which require hardware expertise (e.g. VHDL or31.2. High Level SynthesisVerilog). HDLs are used to write a structural or behavioral description ofthe circuit, in which low-level logic (logic functions and flip-flops) and de-tailed timing requirements (rising and falling edge triggering) are specified;this is defined as the Register-Transfer Level (RTL). Other alternatives suchas block diagram specifications are available, but these are often tedious toread and analyze with larger and more complex designs.1.2 High Level SynthesisFPGA vendors and academic research have invested significant effort intoproviding a software-like design environment for FPGAs in the form of High-Level Synthesis (HLS) tools. This involves automatically transforming a be-havioral description into a digital circuit design. High-Level Synthesis (HLS)has emerged as a leading technology to reduce the design time and complex-ity that is associated with FPGAs, and to enable software programmers touse FPGAs in such a way that their expertise can be put into practice forcompute acceleration without a steep learning curve [36, 37]. In order todo this, the behavioral description language and programming flow needs toresemble that of software.This software-like behavioral description can be done using Domain Spe-cific Languages (DSLs) (i.e. SystemC, BSV [57]), or subsets and extensionsof existing software programming languages (i.e. Java, C, C++). The pre-ferred language for existing and recent HLS tools is C [56]. C-based toolsoften use either GCC or LLVM [45], which are open source C/C++ compilerframeworks. These compilers are modified to include a new backend. The41.2. High Level Synthesisbackend is the last stage in the compilation flow, where the IntermediateRepresentation (IR) or code written with generic low-level instructions, isconverted to comply with the target architecture, i.e. x86, ARM, MIPSor other binary executables in the software approach. When targeted to-wards FPGAs or logic design, this backend translates the IR into an HDLspecification.Most software concepts can remain unchanged in the language-subsetapproach, such as “calling function”, “jumping to instructions” and “vari-able pointers”, except these are translated into digital logic instead of binaryinstructions. This is attractive to software developers and hardware devel-opers alike for two main reasons, portability and ease of transition. Existingsoftware can be compiled for an FPGA device, and because HLS tools usea well-known programming language, it is more likely that the developer isalready familiar with it and can focus on optimization rather than the ini-tial implementation. Along with FPGAs and HLS tools to program them,it is necessary to have a development infrastructure. This should includeassistance in the code writing process, as found in multiple Integrated Devel-opment Environments (IDE) [22], but should also allow the users to analyzeand debug the behaviour of their design. Specifically, debugging should bepossible in the context of the abstraction level at which the logic circuit isbeing designed.51.3. HLS Debug1.3 HLS DebugHigh-level synthesis compilers are not enough. An entire ecosystem includ-ing support for debug and optimization is required. During the design pro-cess, the developers move through multiple iterations of the design. Flaws orbugs in the execution of the circuit can appear in different stages of develop-ment and can be found using different debugging techniques. Starting fromthe lowest level of abstraction, debugging can target electrical bugs such asmanufacturing defects, device wear out, and unmet timing. Moving up theabstraction level, bugs can generally be classified into logic, arithmetic, orsyntactic. Most syntax bugs can be found statically; these are code struc-tures that do not comply with the programming language, or are mistypedexpressions (variable name mismatch, wrong operator, wrong variable type,etc.). Logic and arithmetic bugs refer to operational discrepancies from theexpected behaviour (incorrect statements, loop bounds, division by zero),whether these are caused by a flawed description of the design, or by amistake in one of the synthesis (or compile) stages. As represented in Fig-ure 1.1, an HLS design can be debugged at three different levels, i.e Softwaresimulation, co-simulation, and in-system debug.1.3.1 Software SimulationCurrently, HLS tools incorporate software simulations. These offer a quickand familiar method to target logic and syntax bugs following the samestandards applicable to software debugging. As previously mentioned, C-based HLS tools use standard C compiler frameworks, and the resemblance61.3. HLS DebugFigure 1.1: HLS Debugging Techniquesof the HLS programming flow to a software programming counterpart issuch that the behavioral specification can be compiled and executed onthe workstation using the regular software compilers. This type of sourcesimulation helps finding bugs early in the design, and without the need forexecuting the synthesis flow.1.3.2 Co-SimulationSome HLS tools also provide software/hardware (C/RTL) co-simulations,which encompass the bug coverage provided by C simulations but are alsouseful for synthesis verification, i.e. checking for tool bugs or tool usageerrors. Here, a cycle-accurate simulation of the generated circuit runs alongwith the binary executable. Inconsistencies can be checked during executionor by comparing return values to golden data which is useful for uncoveringthe root cause of errors or performance bottlenecks that arise when the codeis synthesized to hardware, possibly due to incorrect compiler settings (suchas pragmas), and to provide confidence that the HLS tool produced correcthardware.71.3. HLS Debug1.3.3 In-System DebuggingAlthough software simulation and hardware/software co-simulation are anessential part of the HLS ecosystem, they are not sufficient to find the rootcause of all bugs. Many of the most elusive bugs do not become apparentunless the design is run in-system, exercised by real input traffic, at-speed,for long periods of time. As a specific example, simulating an SoC as it bootsLinux would take years in modern simulators, yet there are many types ofbugs that need at least this before appearing. Further, many elusive bugs(such as those related to the timing interactions between HLS and legacyRTL blocks) may not occur unless the system is running at-speed, which inFPGA design can be accomplished from an early stage of development.Running at-speed, however, means that data regarding the state of thecircuit is being updated constantly. If exposed to the user, the amount ofinformation regarding all available signals in the circuit would require veryhigh throughput and I/O resources. Also, in contrast to software debugging,step by step execution of a digital logic design is often not possible for correctoperation. This is due to interfaces with additional modules that can onlybe observed at runtime, whether because the source code is inaccessible ornonexistent at the same high level, or because the module requires externalinputs from peripherals. These cases are common due to the use of externalIP in most designs.The alternative approach that is found in commercial in-system debug-ging solutions, is to use a trace record-replay technique. Instrumentation,or additional circuitry is added to the design in order to store the changes of81.3. HLS Debugsignals of interest until a predefined event is encountered. At that point inthe execution, all captured data is retrieved and the behaviour of the circuitis “replayed”, presented as waveforms to the designer. In Chapter 2, severalexamples for in-system debug are presented.Source-Level In-System Debugging for HLSIn HLS systems, providing support for in-system debug is especially chal-lenging, due to the mismatch between the hardware running on-chip and theHLS (software) designer’s view of the system [8, 27]. A software designerviews the design as a set of sequential statements with limited parallelism,while the actual hardware consists of many sequential and combinationalhardware units running at the same time. A software designer does not con-sider the notion of a “clock” when specifying a design, yet the cycle-by-cyclebehaviour is inherent in the structure and operation of the hardware. Thisis especially challenging if the HLS tool performs many optimizations on thecode, leading to a structure and schedule that may be very unfamiliar tothe designer.Although there are many debuggers that provide visibility into the de-sign at the system level, these tools provide information commonly presentedas signal waveforms that only have meaning to a hardware designer. Thesesignals are generated by the synthesis tool, and can be completely detachedfrom the initial source code nomenclature and design intuition, e.g. the gen-erated circuit will likely contain Finite State Machines (FSMs), of which thevalue of the register holding the current state might be of great interest for ahardware designer, however, this is meaningless from a software perspective.91.3. HLS DebugTo be effective and maintain the productivity promised by HLS, the in-system debug technique must present the execution in terms of C-level vari-ables and C-level control flow, rather than presenting cycle-by-cycle wave-forms that the designer must manually relate to the original C design [8].The objective is to have source-level in-system debug for HLS.This has led several research groups to develop techniques to provide asoftware-like view of running hardware, allowing the software designer toobserve variable values and single-step code as if it was software. Earlywork presented a system for the JHDL-based Sea Cucumber (SC) frame-work [39] allowing software optimizations instead of using a debug version ofthe code. More recently, LegUp’s release included source-level in-system de-bugging support [12, 27] using custom RTL instrumentation and a databaseto relate hardware signals to source-code statements, LLVM’s IntermediateRepresentation (IR), and Verilog. Subsequent research [30] focused on op-timizing resource utilization, successfully storing longer execution histories.1.3.4 InstrumentationKey to any in-system debug approach is efficient and effective instrumenta-tion that is added to the user design to record the behaviour of the designas it runs. Since I/O pins are limited, and since it is desired to run the chipat-speed, existing systems instrument the user design by adding memories(trace buffers) and support circuitry to record the behaviour of key signalsinto these trace buffers. After the chip has been run, these trace buffers canbe interrogated using a software-like debugging tool, allowing the designerto understand the behaviour of the system.101.3. HLS DebugThe instrumentation is built using resources of the same reconfigurablefabric used for the user’s circuit. As such, this instrumentation can be addedat any stage of the design process, bitstream-level [31], gate-level at compiletime [2, 43], incrementally after place and route [24], RTL-level [12, 27],or in high-level source code [53]. These different approaches either need tomodify the synthesis tools to insert the instrumentation while performingthe synthesis process, or modify (instrument) the code before putting itthrough the subsequent synthesis stage.Source-Level InstrumentationHigh-level source code instrumentation, or source-level instrumentation, isan HLS specific approach in which the user’s source code is transformedbefore any lower-level code generation. The motivation behind source-levelinstrumentation is threefold.• First, inserting instrumentation at the source level creates a runtime-verifiable design that is portable between any HLS tool that uses thesame high-level language. Source-level isntrumentation avoids the dif-ficult task of mapping circuit-level structures to C-level elements. Insolutions such as [12, 27], such a mapping requires access to a debugdatabase from the HLS tool, making it difficult to apply the techniqueto a commercial HLS flow.• Second, code transformations can be written in such a way that ittakes advantage of the HLS and Logic synthesis optimizations; boththe source code and the instrumentation will be optimized together.111.4. Contributions• Third, the instrumented C code is readable and familiar to the design-ers, allowing them to better understand the role of the instrumentationin the debugging process.A key challenge of instrumenting at the C level is the possibility that theoverhead may quickly become overwhelming if too much instrumentation isadded. In [52], it is shown experimentally that this overhead can be keptreasonable, suggesting this technique is feasible.1.4 ContributionsIn this work, a methodology is described to use source-level instrumentationfor C-based HLS tools, to create memories and related circuitry to gathertrace data that provides visibility into the operation of the circuit. Thistrace is then matched with a database that maps each memory entry to thehigh-level variables that have meaning to the designer during debug. Thecontributions of this work are presented here.1. In previous work [52, 53], the focus was on connecting internal sig-nals to Event-Observability Ports (EOPs) to provide access points inthe design. These are left unconnected and meant to be connectedto memories by modifying the generated RTL, or through proprietaryELAs[54]. In this work, we implement not only the connections toaccess points in C, but also insert trace buffer memories, as well astrace readback and related circuitry in the C-level design. This elimi-nates the need to make modifications after compilation, and as we will121.5. Thesis Outlineshow, can lead to further co-optimization opportunities between thetrace buffer memories and the circuit memories. We experimentallyevaluate the overhead associated with such instrumentation.2. Exactly which signals or events should be instrumented is vital tothe effectiveness of this technique. In this work, we distinguish be-tween two strategies, control flow instrumentation and data captureinstrumentation, each of which provides different views of the circuit’sbehaviour to the debugger, and evaluate the overhead for each of thesestrategies.3. We introduce and evaluate a low overhead technique named ArrayDuplicate Minimization (ADM) to improve trace memory efficiency.ADM improves overall debug observability by removing up to 31.7% ofdata duplication created between the trace memory and the circuit’smemory structures. This optimization is enabled by the inclusion ofthe trace buffer memories in the original C code.1.5 Thesis OutlineThis thesis is organized as follows. Chapter 2 presents the correspondingbackground about FPGA programming and debugging solutions. This isbacked by references from recent surveys and applications developed usingHLS that demonstrate the growing availability and usability of this type ofprogramming environment. These examples also emphasize the need for adebugging infrastructure that allows the user to remain in the higher ab-131.5. Thesis Outlinestraction level while still taking advantage of the flexibility of the underlyinghardware. Related projects on source-level debugging are also contained inChapter 2 to establish the state of the art in instrumentation techniques.Chapter 3 presents the debugging flow that the user can expect fromusing the tools created for this project and also describes the methodologyfollowed to implement those tools. Both Control Flow and Data Captureinstrumentation are presented by using code examples. Chapter 4 presentsthe motivation behind our Array Duplication Minimization (ADM) tech-nique and the methodology followed for its implementation.In Chapter 5, the quantification of the instrumented designs is presented.These results are compared with data available from related work and thefeasibility, trends, and corner cases of the proposed technique are analyzed.Chapter 6 concludes and suggests future work.14Chapter 2Related WorkThis chapter presents related work, describing the state of the art in HLSin-system debugging. Initially, this chapter introduces the HLS flow andseveral HLS frameworks. The HLS tools chosen to evaluate the proposedapproach are presented in more detail. This is followed by Section 2.2which contains an introduction to Embedded Logic Analyzers (ELAs), astandard in-system debugging approach for RTL-based FPGA designs. Ver-ification and debugging methods for HLS frameworks are then describedin Sections 2.3 and 2.4, covering several tools and instrumentation levels.Then, in Section 2.5, source-to-source transformation, its applications, andthe framework used for this purpose are reviewed.2.1 High-Level Synthesis FrameworksHLS tools for digital design have been a focus of research for more than threedecades [70]. The idea behind HLS is to specify a behavioral descriptionand have the Computer-Aided Design (CAD) tools find the best use of logicresources to implement the operations and variables using functional unitsand registers.Recently, with the proliferation of FPGAs, a wide range of tools have152.1. High-Level Synthesis Frameworksbeen made available by the device manufacturers, CAD tool companies,and by academic research efforts [15, 19, 44, 57, 67–69, 75, 77]. Theseand more are described in the latest HLS tools surveys [8, 21, 49], withthe most recent one [56] listing and categorizing many of these tools. Acommon observation gathered from these surveys is that the quality andease of adoption of C-based HLS tools are beyond other HLS tools usingDSLs or other languages. This is measured using multiple factors, such asease of implementation, abstraction level, supported data types, exploration,verification, area results, documentation, and learning curve.In addition to the frameworks included in said surveys, the latest OpenCL-based products from Intel (Altera) [3] and Xilinx [74] make use of a similarinfrastructure to compile C-based kernels into IP modules with the corre-sponding interface, to be called from a host device. Most recently, Alteraannounced the A++ compiler [1] for standalone IP design from C/C++specifications. These design environments provide compelling productivityimprovements for FPGA designers, and may open the field of FPGA accel-eration to more designers than ever before.2.1.1 HLS FlowFigure 2.1 is a representation of the internal stages identified in most HLStools. Similar to a software compilation flow, the HLS flow can be dividedinto three main stages: input code parsing, optimization, and unparsing. Forthis reason, HLS tools are commonly based on open source software-compilerinfrastructures (i.e. LLVM or GCC), borrowing the nomenclature for manyof their components. These three stages correspond to the frontend, opti-162.1. High-Level Synthesis FrameworksFigure 2.1: HLS Flowmizer, and backend. An HLS flow can be seen as a software compilation,where the backend is modified in order to generate an HDL specificationinstead of an architecture-specific binary file. However, the frontend andoptimizer are modified to target this specific flow, avoid unsupported con-structs, and for optimization.The frontend is in charge of generating a formal representation, be itan Intermediate Representation (IR) code (a generic low-level version of theuser code), a Control and Data Flow Graph (CFDG), or both. The frontendcan be set to recognize statements that are incompatible with hardwaredesign, and therefore unsupported (i.e. dynamic memory allocation, system172.1. High-Level Synthesis Frameworkscalls, etc.). These can in turn be transformed into supported constructs,ignored (printf() statements), or synthesis flow can be halted. The optimizer,through multiple passes, transforms the IR to minimize the number of low-level instructions and, in general, the amount of resources required accordingto the constraints of the target architecture.The backend consists of a set of steps that create an internal represen-tation that can be unparsed into an HDL specification. These steps can beclassified as allocation, scheduling, binding, and RTL generation.Allocation The number and type of operations identified in the IR aremapped into functional units. All variables, registers or memory values areassigned to physical structures of the corresponding type.Scheduling Through dependency analysis heuristics, such as the Systemof Difference Constraints (SDC) [14, 17], the allocated functional units arescheduled, and a set of states are defined for the execution of one or multiplethreads.Binding When possible, non conflicting functional units can be mergedinto one unique instance that is multiplexed throughout the execution, thesewill mostly apply to scarce, resource-demanding structures such as dividersand multipliers.RTL Generation Once optimized using the least resources, the datastructure used to represent the logic system architecture is transformed intoan RTL representation that is possible to compile using the corresponding182.1. High-Level Synthesis FrameworksEDA tool.2.1.2 Vivado HLSPreviously owned by AutoESL under the name AutoPilot [79], this HLStool was acquired by Xilinx in 2011 and is offered alongside the Vivado De-sign Suite package. The Vivado HLS IDE uses a familiar GUI, resemblingthat of the C/C++ Development Tool (CDT) for Eclipse [22]. As such, theIDE offers different perspectives or environments for Synthesis, Analysis, andDebug. Synthesis of C, C++, SystemC, and OpenCL kernels is supported,with a large set of standard circuit interfacing options and the use of TCLdirectives to control code optimizations such as unrolling, pipelining, inlin-ing, chaining, memory partitioning, etc. The compile flow uses LLVM for itsfrontend and optimization passes. Visualization of the generated scheduleof execution is possible using a Gantt Diagram representation, as well as araw data report for more advanced analysis.The Debug perspective is a C/C++ Software Debugging tool, there-fore, it features a complete set of software debugging capabilities, such asbreakpoint insertion, variable monitoring, custom expression analysis, andregister monitoring. The Debug perspective is activated when using the Csimulation tools. This does not take into account any of the synthesis resultsand, instead, executes a sequential version compiled for the host architec-ture. Therefore, as mentioned before, bugs created during the synthesis flowor activated during in-system interfaces will not be captured.For synthesis verification, Vivado HLS offers the C/RTL Co-simulation.This is an RTL simulation using one of the provided hardware simulators192.1. High-Level Synthesis Frameworks(i.e. Vivado Simulator, Modelsim, etc.) and requiring a testbench as a wrap-per of the synthesizable code. The testbench can contain a simple returnvalue comparison or contain complex user-designed Vector Based verifica-tion procedures. The testbench return value is the measure of success, andthe tool reports this and the Co-simulation latency, or number of cycles ofexecution observed in the simulation.In order to perform in-system debugging of the generated hardware, onemust first export the generated RTL to a Vivado RTL project. Once syn-thesized, before implementation (optimization, place, route, and bitstreamgeneration), the in-system debug flow requires the insertion of the propri-etary Debug Cores provided by Xilinx, configuration of the cores’ properties,and connection of the cores’ probe ports to signals of interest, either as data,triggers, or both. There are no HLS related configurations or features at thisdesign level.2.1.3 LegUpThe LegUp framework, being developed at the University of Toronto, is anopen-source HLS research tool [15]. LegUp takes ANSI-C as an input and iscurrently capable of generating Verilog circuits for a small set of Altera andXilinx devices. The framework lacks a GUI interface for code editing andproject management, but offers a considerable set of features configurablethrough TCL scripts for code optimizations and options for hardware gen-eration, such as loop pipelining, loop unrolling, function inlining, and RAMgrouping. A LegUp design choice is to group global arrays and arrays refer-enced in multiple functions into one memory with a single memory controller202.1. High-Level Synthesis Frameworksinterface. This is contrary to Vivado HLS, in which these are assigned toindependent memory blocks.The LegUp design flow offers three options for design compilation: ahardware-only implementation, a software-only alternative including a softMIPS processor in the FPGA to execute the software provided, or a hybridversion that runs top-level functions in the soft processor and calls hardwareaccelerators for user-chosen functions. A profiler is also provided to informthe user regarding the behaviour of the program on the soft processor andaid in the choice of functions for acceleration.LegUp offers an integrated in-system debugging experience. In version4.0, LegUp provides the HLS Debugger infrastructure, which is part of theopen source repository. With this tool, users can recompile the code andinclude debugging instrumentation. Section 2.4.2 provides a more detaileddescription of this infrastructure.In order to fully support the approach used in this work, LegUp wasmodified to provide port declaration in the same way as is done in VivadoHLS. Appendix A explains the modification.2.1.4 BenchmarksThe CHStone benchmark suite is widely used to evaluate HLS tools. Thisbenchmark suite is based on a selection and creation of programs that spanmultiple domains with various quantifiable characteristics [38].Although C-based HLS tools take C-like programs as input, such toolsdo not completely comply with the ANSI/ISO C standard and only supporta subset of the C language. This benchmark suite is designed according212.2. FPGA Debugto these limitations by avoiding, for instance, the use of dynamic memoryallocation and recursion. This, however, does not guarantee that all pro-grams can be synthesized on every HLS tool since every tool has specificrequirements for the input code. For the HLS tools used in the experimentsfor this work, 3 out of 12 of the original programs did not compile in atleast one of the HLS tools; these have been removed from the results. Theremaining majority of programs give a good representation of the behaviourof the tools with and without the debugging techniques presented herein.2.2 FPGA DebugDebugging a digital circuit on FPGA requires a thorough understanding ofthe implementation. In particular, understanding the behaviour of a designoften requires observing these signals over time. Due to resource constraints,it is not possible to observe the behaviour of all signals, therefore, it isimportant for the designer to be able to select and analyze only the mostrelevant signals of the design [42]. Once identified, the behaviour of thosesignals of interest can be exposed to the user through built-in readbackmechanisms or trace recording. Built-in logic such as the JTAG interface,also used for programming, provides read access from all circuit nodes [6].However, using the JTAG scan-based infrastructure to observe the executionin real-time and in situ can be destructive. The execution of the circuit needsto be paused to read out all signals of interest.The alternative trace record-replay approach described in Section 1.3.3 isachieved through the use of Embedded Logic Analyzers (ELA). We describe222.2. FPGA DebugELAs below, followed by similar approaches targeting HLS design flows.2.2.1 Embedded Logic AnalyzersEmbedded Logic Analyzers enable the monitoring of selected hardware sig-nals, in situ and at runtime, through the insertion of trace buffers. TheELAs store the signals of interest cycle-by-cycle and use a trigger unit tochoose when to read back these buffers. Traced signals and signals for trig-ger conditions are chosen before synthesis, while trigger conditions are oftenallowed to vary after place and route. Some reconfigurability is allowed toavoid recompilation. These signals are presented as waveforms to the de-signer and labeled using the name of the HDL or the post-fitting (place androute) resource.Commercial ELA tools such as SignalTap II [2], ChipScope (now calledLogiCORE Integrated Logic Analyzer [73]), and Certus [50] use this ap-proach and incorporate multiple optimizations to maximize the use of mem-ory resources. Recent work on ELAs focuses on resource optimization andincremental instrumentation. The latter is concerned with the reductionof compilation time by creating a field-configurable trigger network overlay[24, 43].2.2.2 HLS Verification & Source-Level DebugWhen debugging circuits created using an HLS flow, analyzing signals at thehardware level is challenging. The user provides a C specification and ulti-mately implements a circuit into FPGA, therefore, debugging should be per-formed in the context of the original source code. In related work there are232.3. HLS On-Chip Monitorstwo main approaches: HLS verification and HLS source-level debug infras-tructures. The first category is based on On-Chip Monitors (OCMs) whichare created to automatically identify inconsistencies between the executionof the generated hardware and the software-like specification. Work in thiscategory includes Assertion Based Verification (ABV) [20, 35, 63], ControlFlow Integrity (CFI) Verification [9], and Embedded Signature Monitoring(ESM) [13].Instrumentation for HLS source-level debug, on the other hand, has theobjective of allowing the user to observe a step by step execution of thecode, as introduced in Section 1.3.Although both verification and source-level debug infrastructures havedifferent objectives, both approaches require instrumentation or hardwareresources in order to capture, store, or monitor the behaviour of the user’sdesign.2.3 HLS On-Chip MonitorsOn-Chip Monitors (OCMs) are runtime verification tools which can targetdifferent behaviours for analysis. OCMs automatically recognize unexpectedbehaviour in the generated hardware and notify the user of such events.Counter reactions can be implemented in order to correct the errors de-tected, although work on reactive OCMs, to our knowledge, has not yetbeen developed for HLS. Relevant work in this area is presented below.242.3. HLS On-Chip Monitors2.3.1 Assertion Based VerificationAlthough this approach is commonly used when designing hardware at theRTL level, related work for HLS is more limited. Examples of HLS flowswith the capability to recognize assertion statements have been developedby upgrading the tool to support assertions for simulation [63], to synthesizethem into custom OCMs [35], or by performing source-to-source transfor-mations into synthesizable constructs [20].Work on temporal assertions by Ribon in 2011 [63] made use of sim-ulations and HDL built-in assertion statements. These were inserted au-tomatically during program synthesis by translating behavioural assertionstatements into temporal assertions. Temporal assertions are assertions withtiming specifications, therefore described in a HDL. This translation in-cluded information on operation scheduling, data availability, and uses thecorresponding signals associated with the high-level variables.In order to complement the verification and debugging approach givenby assertions in software, HLS can also make use of timing informationto generate OCMs. In 2011 Curreri’s ABV approach [20] used Impulse-C and the time library to automatically insert clock() calls in the circuitand evaluate the time passed between calls, which is then compared with apredefined value to determine timing compliance. This is only applicable tothe CoDeveloper HLS tool developed by Impulse [44]. The inclusion of thetime library and other standard libraries is a feature that is missing frommost HLS tools.In 2014, Hammouda proposed two flows that can be applied to any HLS252.3. HLS On-Chip Monitorstool. The first flow is to automatically synthesize assertion statements asOCMs [35], while the second flow automatically generates a CFI verificationOCM infrastructure [9]; the latter is explored in the next subsection. TheANSI-C assertion synthesis flow differs from previous work in that this cre-ates an FSM and Datapath, independent from that of the user circuit. Thisrequires access to the internals of the tools’ source code, something that isnot assumed by Curreri, but necessary in order to target any HLS tool byusing the CDFG (Control and Data Flow Graph) representation.2.3.2 Control Flow Integrity VerificationControl Flow Integrity (CFI) in software is a safety property to detect at-tacks that provoke unintended software behaviours. In a broader sense, CFIverification for programmable hardware can help detect unintended circuitbehaviours when compared to the high-level source code. For this purpose,only control flow information is required.The CFI application of Hammouda’s [9] work uses the same independentcircuit approach seen in [35], which is only feasible if there is access tomodify the internals of the HLS tool. This circuit takes the STATUS ofthe user circuit as its input and recognizes control flow discrepancies duringexecution. The STATUS signal is the name given by the authors to a genericset of signals that indicate the state of execution of the synthesized circuit;i.e. state of an FSM, and/or flags in a Status Register (SR). This approachthen relies on static analysis of the CDFG of the program and the generationof a fairly complex OCM architecture.At the same time, the OCM architecture uses an I/O Control Unit262.4. HLS Source-Level Debug(IOCU) to verify that register loads are performed in the correspondingstates, providing both control flow and I/O timing behaviour monitoring.The proposed flow targets any HLS tool, although experimental results aredone using a combination of GCC and GAUT [19].2.3.3 Embedded Signature MonitoringMore recent work from Chen [13] makes use of Embedded Signature Moni-toring (ESM). In ESM, the synthesized hardware is extended with the instru-mentation for signature generation. Signatures are produced from severalcircuit state signals, including memory interface signals (address, data in,data out), and FSM and datapath registers. This work uses a co-simulationapproach, in which the source code is instrumented both for hardware andsoftware execution; the two versions are executed in tandem to generate andverify signatures using the software execution as the golden model. Severalchallenges were addressed in this work, including memory address matchingbetween SW and HW, LFSR implementation for signature generation, andarea optimization through instrumentation resource binding.2.4 HLS Source-Level DebugHLS source-level debugging allows the designer to find and analyse unex-pected circuit behaviour in-situ and by inspecting the high-level source code.This debugging approach borrows part of the hardware debugging method-ology that allows the visualization of the state of the circuit by insertinginstrumentation to record a set of signals of interest into a collection of272.4. HLS Source-Level Debugmemories. After running the design, these memories will contain a historyof the execution which can then be read and the behaviour of the circuit canbe reconstructed. For this to be useful to the designer in an HLS workflow,the signals have to be related back to source-level variables and statements.The user can then, offline, replay a step-by-step reproduction of the ex-ecution of the program to look for unexpected behaviour, iterating throughthe complete runtime by pausing program execution, collecting data fromthe trace buffers, and then resuming the execution. The size of the mem-ories, often called trace buffers, and their efficiency in the use of storage,determines the observability of the program. Better observability improvesthe probability of finding the root cause of the unexpected behaviour. Inorder to allow greater coverage of the execution history, the trace buffermemories have a rollover behaviour, meaning old buffer entries are evictedfor each new write access.2.4.1 JHDL DebugPrevious work presented in 2003 developed the debugging infrastructurefor the JHDL-based Sea Cucumber (SC) framework [39]. This project pre-sented the basis for a debug system that relates the hardware signals ofsynthesized circuits with the high-level language statements. Special atten-tion was put on allowing software optimizations, since most compiler passescan substantially modify the execution schedule and variable nomenclature.Instrumentation for the SC framework made use of device specific readbackmechanisms and additional debug circuitry.282.4. HLS Source-Level Debug2.4.2 LegUp DebugRecently, two similar approaches were presented for the C-based LegUpHLS framework, namely Inspect[12] and Goeders’ HLS Debugger [27]. Thelatest version of the LegUp repository contains code from both of theseprojects [59]InspectInspect allows single-stepping through either HW cycles or source code state-ments by using SignalTap II for data capture and storage, and a database torelate signals to LLVM’s [45] Intermediate Representation (IR) nodes, andto Verilog. Inspect works as both a debugging infrastructure and an OCMdue to its capability of comparing the software execution with the RTL sim-ulation, but more importantly by integrating a discrepancy detection flow.These features allow for RTL verification at the same time as providing theuser with the source-level debugging information. On the other hand, dueto Inspect’s use of proprietary circuitry for instrumentation (i.e. SignalTapII), the debugging infrastructure is not optimized for HLS. Signals for in-strumentation must be selected manually and are limited by the availableon-chip RAM.Goeders’ HLS DebuggerParallel work presented similar features with the insertion of a custom de-bugging system during the synthesis process [29], instead of using SignalTapII. This work and subsequent research [28, 30] collected in [27], has focused292.4. HLS Source-Level Debugon optimizing resource utilization, allowing multiple thread instrumentation,and successfully obtaining longer replay window lengths. This translates intomore lines of code available in a step-through interface comparable to gdb.The architecture of this instrumentation includes the following modules:Debug Manager This module is the communication manager, receivingand transmitting commands from and to the workstation in order to startdata collection routines, start/pause/stop execution, etc. Communicationis done through a RS232 serial connection with custom commands.Stepping and Breakpoint Unit This unit enables or disables a clockbuffer, controlling the execution of the user circuit. The user is allowed toput a breakpoint in the source code in the same way as it’s done in softwareIDEs. This breakpoint then translates into conditions triggered by a certaincircuit state, memory address, or chosen variable value.Memory Arbiter The instrumentation can take control of the main mem-ory controller in the circuit and use it to retrieve all memory values. Thearbiter is used to grant access to either the user circuit or the instrumenta-tion.State Encoder One of the values recorded into the trace buffer is a stateidentifier. This is used to relate the execution with one or more statementsscheduled concurrently. However, the architecture of the synthesized circuituses a one-hot encoding which needs be reencoded or compressed to bestored in the trace buffer.302.4. HLS Source-Level DebugTrace Recorder This is the bulk of the instrumentation. Besides contain-ing the memory blocks to store data and states, this includes a Signal-TraceScheduler. Only signals relevant to the state in execution are stored, anddue to previous knowledge of these signals sizes and schedule, it is possibleto rearrange them and make better use of the memory space.This instrumentation and its optimizations rely heavily on LegUp’s mem-ory architecture and synthesis flow. Most of the debugging system standsbetween the main module and the main memory controller. This LegUparchitecture is advantageous for the debugging system, but often impactsdesign flexibility and performance when compared to Vivado’s distributedmemories, representing a bottleneck for some array accesses.2.4.3 Event Observability PortsThe most closely related work to the work of this thesis is on Event Ob-servability Ports (EOPs). Work on EOPs has focused on adding points ofconnection in the circuit [51] to extract data. It further showed a method ofproviding these access points through source-to-source transformations [53],followed by a set of experiments to demonstrate their feasibility by not caus-ing significant impact to circuit performance and resource utilization [52].In the encompassing thesis [54] the researcher presented additional meth-ods to instrument variable pointers and to add compatibility with both Vi-vado HLS and LegUp. This work, however, did not consider the C-levelinstrumentation of the trace buffers or associated circuitry, and is mainlyfocused on guaranteeing that this observability does not significantly impact312.5. Source-to-Source Transformationsthe original behaviour of the circuit.This work remarks that source-level instrumentation adds reasonableoverhead to the user’s circuit. In section 5.1 we compare the results fromEOP instrumentation to our own approach, which includes trace buffersand the buffer access circuitry. In appendix B we show a method to insertEOPs, different from the one found in [54], and show the results of additionalexperimentation.The main drawback of this implementation is that even though thisbrings the instrumentation to a higher level of abstraction, it is still necessaryto perform significant transformations in RTL or rely on proprietary tools.This work suggests the use of custom logic or ELAs in combination withthe EOPs in order to add the triggering logic and trace buffers. In [51], aproblem of buffer unbalancing is pointed out, this occurs when the EOPsare connected to individual Event Observability Buffers (EOBs). Due tovarying refreshing rates for each variable, these buffers are often not usedvery efficiently. In response, the authors presented the Relative AssertionRate (RAR) metric and its use for the allocation of a relative size of EOBper variable by using dynamic analysis [51].2.5 Source-to-Source TransformationsSource-to-source transformation is the manipulation of lines of code, state-ments, and expressions. In order to allow this, a piece of code or a collectionof source code files need to be read into a data structure. The Abstract Syn-tax Tree (AST) representation is used for this purpose in compilers. An AST322.5. Source-to-Source Transformationsis elaborated following a dictionary and a set of syntactic rules, and creat-ing nodes representing blocks of code, statements, expressions, and symbols.Once parsed, modifications to the AST can be either translated into a lower-level representation, or unparsed in order to generate the original code withthe modifications.2.5.1 ROSE Compiler InfrastructureThe ROSE compiler infrastructure is an open source project by the U.S.Department of Energy with the goal of designing code optimization tools.Through the repository, ROSE provides an Application Program Interface(API) for the inclusion of compiler techniques and optimizations, and alsothe infrastructure to develop source-to-source transformation tools for cus-tom purposes.The ROSE API provides a parsing tool that supports multiple files andpreprocessing of the source code. A set of query functions allows finding ASTnodes according to different attributes and classes; these queries can be veryspecific or generic in order to do a custom post-filtering of the statements ofinterest once the query is finished. Statement generation using the ROSEAPI requires creating the expressions bottom-up, incrementally generatingthe desired line of code or statement. When unparsing, the ROSE APIallows the designer to run a full set of tests called Sanity Check, not only tomake sure the AST is unparsed correctly but to make sure it is consistent.Figure 2.2 presents the Node hierarchy with the main Nodes of interestused in this work. These nodes are represented in an Intermediate Rep-resentation (IR) code specific to ROSE called Sage III, and automatically332.5. Source-to-Source TransformationsFigure 2.2: Hierarchy ROSE AST Nodesgenerated using ROSETTA [62]. The Sg- prefix is part of the IR nomencla-ture.• An SgLocatedNode is a node with a specific location in the source files,also called instructions.– SgExpressions are instructions that have return values, such asoperations, and variable/function references.– SgStatements are any type of instruction, such as control flowmodifiers (SgIfStmt, SgForStatement), declarations, and expres-sion statements which are made out of one or multiple SgExpres-sions.• SgType nodes do not have a specific location in the source files, butare used to define each variable type.342.5. Source-to-Source Transformations• SgSymbol nodes are shared and unique for each variable, function,enumerator, label, or any other symbol in the source code.• SgSupport nodes represent all other type of nodes, such as attributes,modifiers (constant, volatile, etc.), and higher level data structuresincluding ROSE projects, files, and tables for internal use by ROSE.Each node has a set of attributes that indicate its branches or leavesand the characteristics of that class of node. Figure 2.3 is the listing of anexample node taken from one of the benchmarks used for the experimentsdescribed in this thesis.Code TransformationsOne example of using the ROSE Compiler Infrastructure with HLS applica-tions is a tool flow that can automatically parallelize loops in C/C++ pro-grams that use pointer arithmetic [71]. This work also distributes dynam-ically allocated data structures between on-chip memory, adding supportfor dynamic memory allocation calls. Both transformations are performedat the source level, resulting in generic implementations that can be putthrough various compatible HLS tools.Code RestructuringOther work targeting HLS flows has been done in order to optimize thesynthesis results through source-level transformations. In [48] and [47] theauthors studied the use of loop transformations and pre-optimized templatesto aid the user in the process of circuit optimization. Future work, however,352.5. Source-to-Source TransformationsFigure 2.3: Example ROSE AST Nodeis aimed towards automatic code restructuring using source-to-source trans-formation tools to replace the user’s code with a more suitable construct,according to the findings of this work.OthersThere is a great amount of research in source-to-source transformations forsoftware applications, which is often relevant for HLS. This is a great ad-vantage of using HLS with the language subset approach. Relevant work in362.6. Summarythis area has been presented using the ROSE compiler infrastructure, withtwo notable examples.In [46], the researchers present a code outliner. This is a transformationtool capable of generating modular kernels out of whole programs withoutaffecting performance. This is targeted towards parallel computing applica-tions since it includes the use of OpenMP to further optimize the kernels.Moreover, due to recent advances in the use of OpenMP and Pthreads forHLS [16] this work can be applicable to HLS.Similar work without using ROSE can also be found for paralleliza-tion optimizations embedded within HLS tools such as SPARK [34] andROCCC [33]. Work on source-to-source transformations on specific pro-grams such as X10 [40] have also contributed in developing optimizationtools at this level of abstraction. In [40], researchers recently presentedmethods to incorporate Loop Unrolling, Inlining, Stack allocation, and LoopInvariant Hoisting into the ROSE API.2.6 SummaryThis chapter described the previous work in this area. The objective of thisthesis and referenced work is to bridge the gap between software program-mers and FPGA implementations. A debug infrastructure is necessary toachieve this goal and it needs to work at the design level (C code), usingthe information from the implementation level (Hardware signals). Cap-turing those signals is possible through the use of instrumentation, whichcan be specified and inserted at various levels of the design. Source-level372.6. Summaryinstrumentation, however, offers greater portability than other levels of in-strumentation, and previous explorations of this approach have been ableto achieve this with low impact to the user’s circuit performance and area.38Chapter 3Source-LevelInstrumentationAn important part of an effective in-system debug infrastructure is the in-strumentation that provides access to internal signals in the design. Suchinfrastructure needs to consume as little area as possible, and result in theleast intrusion possible into the user design. While previous work provedthe feasibility of adding EOPs to the user’s circuit at the source-level, thework in this thesis includes the trace buffer and associated circuitry. This isbeneficial for portability, eliminating the need for RTL editing, and allowingfurther optimization during synthesis.This chapter presents our instrumentation that provides both controlflow and data capture capabilities, as well as the methods we use to insertthis instrumentation. In Section 3.1, the debug framework is presented,which is necessary to insert this instrumentation, and also to be able toretrieve and interpret the data related to the original source code. In Sec-tions 3.2 and 3.3, each proposed feature of the debug framework instrumen-tation is presented with the corresponding method for insertion (ControlFlow and Data Capture) and code examples of the resulting source code.393.1. Source-Level Debug FrameworkFigure 3.1: HLS In-System Debug Framework with Source-Level Instrumen-tation3.1 Source-Level Debug FrameworkFigure 3.1 shows our overall in-system debug framework. Starting at thetop-left, the original user C code is parsed into the Abstract Syntax Tree(AST) representation. Instrumentation is automatically inserted using thecustom tool built using the ROSE source-to-source compiler infrastructureAPI [61]. Simultaneously, the database is created, mapping the IDs used inthese new statements to user’s code constructs, statements and/or variablesymbols. The last stage in the ROSE compiler unparses the AST to amodified set of C source files.Initial stages in HLS tools can also perform AST parsing and editing,403.1. Source-Level Debug Frameworkmeaning these could be modified to incorporate an instrumentation stage,and would, therefore, not require unparsing. Keeping the instrumentationand synthesis separate has an insignificant impact on processing time, whilefavoring portability.The modified C code is then compiled using the HLS tool of choice.In our experiments this is either Vivado HLS or LegUp. The synthesizedRTL description is then instantiated by a generic top level module thatcontains elements of the debugging infrastructure (i.e. communication withthe workstation, clock buffering, circuit resetting), and is then put throughthe EDA compilation flow to generate the FPGA bitstream containing theuser’s circuit and the instrumentation.3.1.1 InstrumentationThe inserted instrumentation, as seen in Figure 3.1, consists of 1© a networkthat taps off of key signals in the user design, 2© a collection of memo-ries, referred to as trace buffers, which are used to store a history of thebehaviour of these signals, and 3© the serial connection between the chipand the debug workstation to retrieve the data after the circuit has run.The amount of history available depends on the size of the trace buffers, aswell as the efficiency in which data is stored in the trace buffers [30]. Thistrace record-replay technique is preferable over on-line debugging, wheresignals are retrieved on every step of execution, because the latter restrictsthe circuit from running at-speed.An Integrated Development Environment (IDE) may provide a software-like debugging experience, as described in related work [29], by communi-413.2. Control Flowcating with the instrumentation provided in this work. The IDE allows theuser to run the design at speed, stop at a pre-determined breakpoint, andthen retrieve the current value of variables of interest. To provide moreinformation, the trace also contains a history of variable values and control-flow information; the larger this history, the easier it will be for the designerto narrow down the root cause of a bug. The code version used through thisIDE is the original code. The instrumented code is only for internal use ofthe HLS process.In the following subsections we describe how the designs are instru-mented, first, to record control-flow information, useful for Control FlowIntegrity (CFI) verification. Then, as the main objective is to provide asoftware-like debugging infrastructure, we provide an extension of the instru-mentation to record variable data information and provide a trace readbackmechanism.3.2 Control FlowWe first consider instrumentation that provides sufficient data for an IDE toreplay circuit execution, allowing the designer to understand how the designbehaved while it was running. Such data can also be used for CFI verifi-cation, to find discrepancies between the recorded path and the designer’sexpectation, by comparing the captured trace to a software simulation, orto a Control Flow Graph (CFG) generated using static analysis.423.2. Control Flow3.2.1 Control ConstructsIn our system, each control construct in the AST of the original code isinstrumented. Each of the control constructs is a collection of statements(e.g. For-loop bodies, if-true and if-false bodies, While-loop bodies, Functiondefinitions). At the same time, a database is populated with unique IDsmapped to those constructs. As the circuit executes, each time a controlconstruct is encountered, the instrumentation stores an ID of that constructin the trace buffer.In the ROSE compiler framework, control constructs can be identifiedusing the SgBasicBlock node class. SgBasicBlocks differ from the BasicBlocks found in compiler frameworks and do not follow the same definition.The Basic Block definition used in compiler frameworks refers to a sequenceof lower-level instructions (using the compiler’s IR) that only allows oneentry and one exit point. An SgBasicBlock, on the other hand, is a sequenceof statements of the original source code and does not have this restriction.3.2.2 Code ExampleListing 3.1 shows an example of this instrumentation. In this example,every execution of the function call pushDbgCF(<ID>) “pushes” the ID ofthe construct of interest into the trace buffer. Lines 1-8, 10, and 13 arethe instrumentation code; lines 1-8 are inserted in the global scope whilelines 10 and 13 are prepended in each control construct. Line 1 createsthe Trace Buffer. A monolithic approach (single trace buffer) was chosen,rather than multiple buffers, to avoid the need for buffer balancing and433.3. Data CaptureListing 3.1: C Control Flow Instrumentation1 volatile int TRACEBUFFER[TRACESIZE];2 unsigned int bufIndex=0;3 void pushDbgCF(int ID){4 TRACEBUFFER[bufIndex++] = ID;5 if (bufIndex==TRACESIZE){6 bufIndex=0;7 }8 }9 void foo(){10 pushDbgCF(CDBGID STMNT1);11 ...12 if(...){13 pushDbgCF(CDBGID STMNT2);... ...event-sequence reconstruction. The trace buffer is configured as a circularmemory, meaning old buffer entries are evicted. Lines 2, 3-7 provide a tracebuffer index and access function; The pushDbgCF(<ID>) function containsonly the buffer write and a conditional statement to avoid invalid addressing.Experimentally, we have found that this if-statement does not affect the storeroutine latency when synthesized.3.3 Data CaptureIn order to build a debugging infrastructure, the instrumentation needs tocapture the data that is produced while the circuit is running. The secondinstrumentation strategy we consider provides the ability to gather dataregarding the history of variable values over the run of the circuit. Suchdata can be used in conjunction with an IDE such as that in [27] to providethe ability for the user to observe values in variables using the original code,to help understand the overall behaviour of the circuit.443.3. Data CaptureFigure 3.2: Hierarchy of Assignment Statements3.3.1 Assignment StatementsIn the Data Capture strategy, each location in the code in which a vari-able is updated is assigned a unique ID. An ID is assigned to basic as-signments (a=b, a=fn()), compound assignments (a+=b), and unary op-erations (a++). More complex statements are split recursively, and IDsare assigned to those basic statements. Figure 3.2 shows various exam-ples of splitting higher-hierarchy classes (left), until the basic statements ofinterest are found (shaded blocks on the right). For each assignment state-ment found in the original code, the transformation tool inserts a call topushDbg(<data>),<ID>).Implicitly, this instrumentation acts as triggering circuitry. Triggeringcircuitry specifies when and which signals to store in the trace buffer. Ina custom HDL-based approach, a trigger condition needs to be specifiedfor signals of interest, and inserted at design time or incrementally using453.3. Data Capturespare resources [24, 43]. In HLS instrumentation, triggering conditions areimplicitly defined by the assignment statements, meaning that signals areonly stored when the state machine reaches the state where its value ismodified.Note that instrumenting for data capture will also provide control flowinformation. As long as each control construct contains at least one datacapture, we will have sufficient data to reconstruct the control flow, meaningthis method subsumes that in Subsection 3.2. If a basic block does notcontain a data capture (which we did not encounter in our benchmarks)then control flow logging for this particular block can be added. This savestrace buffer space, allowing more updates to be stored in a fixed amount ofmemory.3.3.2 Code ExampleListing 3.2 is an example of this technique; lines 1-17, 23, 26 and 28 repre-sent the instrumentation. In this example, two trace buffer access functionsare used – one for 32-bit quantities (Line 4) and one for double-length quan-tities (Line 8). A casting statement using pointer indirection is used in line26 in order to support storing floating-point data types in the same buffer.Experimentally, we found that having multiple operations in the store rou-tine (i.e. cast, shift, or, and) does not affect the latency of the synthesizedhardware.463.3. Data CaptureListing 3.2: C with Data Instrumentation and Readback1 volatile Ulong TRACE[TRACESIZE];2 unsigned int bI = 0;3 volatile Ulong traceOut;4 void pushDbg(int data,int ID){5 TRACE[bI++] = ((ULong)ID) << 32 | ((UInt)data);6 ...//Conditional for invalid addressing7 }8 void pushDbgLong(long dataL,int ID){9 TRACE[bI++] = (((Ulong)ID)<<32) | ((UInt)(dataL>>32));10 ...//Conditional for invalid addressing11 TRACE[bI++] = (((Ulong)ID)<<32) | ((UInt)dataL);12 ...//Conditional for invalid addressing13 }14 void traceUnload(){15 Unload: for (int i = 0; i < TRACESIZE; i++)16 traceOut = TRACE[i];17 }18 int main(){19 int temp[SIZE];20 double result;21 ...22 temp[i]=fn(); //Assignment Type: int23 pushDbg(temp[i], CDBGID STMNT);24 ...25 result += temp[i]∗3.14159f; //Assignment Type: double26 pushDbgLong(∗(long∗)&result, CDBGID STMNT2);27 ...28 traceUnload();29 return result;30 }473.4. Trace Readback3.4 Trace ReadbackThe instrumentation techniques described above do not consider any tracereading mechanism. One approach is to instrument the synthesized memoryin Verilog by tapping into the Address, Data, and Control signals connectedto it. This would affect the high-level compatibility of the rest of the instru-mentation, although not drastically since every memory primitive has verysimilar interfaces. However, to keep the level of abstraction at the sourcelevel, it is possible to insert a new function call that will unload the tracememory.A read-back of the trace buffer must be triggered by an event. Thiscauses the circuit execution to be paused and all data to be sent over theserial connection to the user’s workstation. In our implementation, ratherthan inserting the communication controller in the C code, we halt executionand expose the contents of the trace buffer to be transmitted by a genericserial communication controller in the top module. A simple RS232 inter-face was implemented for communication between the workstation and thecircuit. In HLS, the readback-trigger events coincide with the use of break-points; wherever the user sets a breakpoint in an IDE, a trace-unloadingroutine needs to be inserted.3.4.1 Code ExampleListing 3.2 shows an example in which the output port declaration and trace-unloading function definition are in lines 3 and 14-17, respectively. Line28 calls the trace-unloading routine. Although such a call can be added483.5. Trace Reconstructionanywhere in the code, care must be taken to avoid adding these calls insidelatency sensitive blocks, since interfaces can timeout. Preferably, unloadcalls can be inserted after a critical section. A call to traceUnload() is addedby default before the main return statement.3.5 Trace ReconstructionTrace reconstruction is the step necessary to relate the trace, obtained fromthe circuit, with the database created during the source transformations.The generated AST needs to be stored and not regenerated, this is due to in-deterministic address assignments during source code parsing. Even thoughthe resulting AST structure is deterministic, the addresses associated to eachnode can vary; in Figure 2.3 these values can be seen as pointer:0x1c75f40for that specific node, and that node’s parent and branches (i.e. p parent,p lhs operand, p rhs operand).The database contains a one-to-one relation between the Identifiers as-signed for each pushDbg(<ID>) call and the pointer value to one statement.These statements are control assignments or assignment statements and,therefore, contain information about the exact location of the instructionand its attributes.3.6 SummaryThis chapter described our approach for inserting instrumentation into auser design. Unlike previous work, our instrumentation includes the tracebuffer and associated circuitry in order to capture the data flow of the execu-493.6. Summarytion. In addition, this Chapter presented the Control Flow instrumentation,allowing the user to opt for a less detailed, less invasive debugging flowalternative.A set of transformation tools built using the ROSE compiler infrastruc-ture were developed to test these different features through the experimentsdescribed in Chapter 5. All the tools follow the same automatic processdescribed in Section 3.1 and can be easily configured to perform either Con-trol Flow or Data Capture instrumentation insertion, change the buffer size,or to enable/disable optimizations like the one described in the followingchapter.50Chapter 4Array DuplicateMinimizationThe instrumentation approach described in this work can lead to unneces-sary data duplication. The fact that we use instrumentation at the sourcelevel, however, provides a unique opportunity to address the duplication.This chapter explains this opportunity for optimization, followed by twostrategies to improve data observability and minimize data duplication. Af-terwards, an example is used to demonstrate the advantages of the proposedArray Duplicate Minimization (ADM) strategy, and to present the metricsthat allow us to quantify how ADM benefits observability.4.1 Array DuplicationData duplication occurs every time a value that lives in a user’s circuit struc-ture (memory or register) is stored in the trace buffer; some transient valuesmay only exist in wires and won’t be duplicated. Until the variable or arrayentry is changed, the value lives in two places: the user circuit and the tracebuffer. This duplication is reasoned by the fact that, ultimately, the tracebuffer will contain a history of the values and not just the current content.514.1. Array DuplicationFigure 4.1: Data DuplicationIn the case of variable arrays, updates are less frequent than for scalar vari-ables. Therefore, array duplicates are more likely to be evicted from thetrace buffer before the next update, meaning, these did not contribute tothe trace history.Figure 4.1 is an example of a section of user code with instrumentationinserted at the source-level. The original code, along with the instrumenta-tion, is shown in the center of the figure. The boxes on the left side of thefigure represent on-chip storage that is part of the user circuit; this storageholds values of user variables and arrays as the code executes. The right sideof the figure shows the trace buffer; this is part of the debug instrumentationwhich stores a history of variable and array values. The arrows on the rightindicate the rollover behaviour of the memory in which new entries replacethe oldest. Blocks with values that are contained in both the user circuitand in the trace buffer are shaded; this represents duplication that we seekto reduce or eliminate in this chapter.524.2. Merged Instrumentation4.2 Merged InstrumentationIn this strategy, we selectively identify arrays in the user circuit. For allupdates to these arrays, we do not add instrumentation code to store valuesin the trace buffers. Instead, we modify the trace unload function, so thatwhen the trace buffer is read, these select arrays are unloaded, along with thetrace buffers. The IDE software can then use this information to determinethe latest values for all elements in the array. Intuitively, this strategy willresult in fewer trace buffer updates, meaning a longer history of the otheruser variables can be stored (recall that the trace buffers are configured ascircular buffers, and when full, old data is evicted). In addition, fewer tracebuffer updates lead to less contention for memory, reducing the impact oncircuit latency.However, this method may make it impossible to reconstruct controlflow history. Recall from Section 3.3 that the data capture instrumentationstrategy subsumes the control flow strategy assuming there is at least onedata update in each control flow construct. By removing writes to the tracebuffer, we increase the likelihood that the IDE does not have sufficient in-formation to reconstruct the control flow. Although we can add control flowinstrumentation within the affected control flow constructs, this eliminatesthe advantage of this strategy.4.3 Old Value StoreIn this strategy, we also modify the trace unload function, so that whenthe trace buffer is read, select arrays are unloaded. For all updates to the534.3. Old Value StoreListing 4.1: C with Array Duplicate Minimization1 ...//Buffer, index, functions and output declaration2 void traceUnload();3 volatile ARRAY1TYPE T array1[ARRAY1SIZE];4 int main(){5 ...6 pushDbg(array1[x],CDBGID LINE)7 array1[x]=a;8 ...9 }10 void traceUnload(){11 traceUnload: for (int i = 0; i < TRACESIZE; i++)12 traceOut = TRACEBUFFER[i];13 array1Unload: for (int i = 0; i < ARRAY1SIZE; i++)14 traceOut = (long)array1[i];15 }Listing 4.2: Alternative to Store Old Value... // This method favors Common Subexpression Elimination6 int *temp ptr array1 803 866 = &array1[x];7 pushDbg(*temp ptr array1 803 866,CDBGID LINE);8 *temp ptr array1 803 866 = a;... ...selected arrays, however, we add instrumentation to store the old value ofthe array element rather than the new value. Although this does not reducepressure on the trace buffers, it does increase the history available for theelements of the select arrays, while also providing information that can beused to reconstruct control flow. A longer history for these array valuesmeans that an engineer using the IDE would have more information to helplocate the root cause of observed incorrect behaviour.4.3.1 Code ExampleExample instrumentation for this technique is shown in Listing 4.1; the ar-ray array1 is moved to global scope, and is read by the traceUnload function.544.3. Old Value StoreThe function call in line 6 in Listing 4.1 is inserted before the assignmentstatement. This has two benefits: it allows us to reconstruct the controlflow (as long as there is at least one access in each control construct) andit extends the effective history of the array data. These modifications areuniquely possible at the source-level, where array reference and pointer deref-erence expressions are explicitly identifiable. An alternative way to call thepushDbg routine that we found gives better synthesis results is shown inListing 4.2.Note that only the array accesses are affected; writes to scalar variablesare performed as previously. This technique does not need to be appliedfor every array in the user circuit. The selection of arrays for which thistechnique is applied is currently done manually. The selection of arrays isimportant, since if we indiscriminately apply the technique, resource utiliza-tion may rise. The investigation of automatic array selection policies is leftas future work. Note that this technique could be applied to individual vari-ables in the user circuit (not just arrays), however, we feel that the overheadin doing so would be too high.4.3.2 Trace ExampleTo better understand the idea behind ADM to extend the effective historyof array data, consider the example of Table 4.1. This table shows thehistory of both array and variable updates for a particular run of the MIPSsimulator program from the CHStone benchmark set [38]. Each row of thetable shows an assignment statement as well as the line number where itcan be found in the source code. The table includes an event number (EX)554.3. Old Value StoreTable 4.1: MIPS Benchmark Example HistoryEvent Line Data Assignment StatementE0 142 pc = pc + 4;E1 258 dmem[DADDR (reg[RS] + (ins & 0xffff))] = reg[RT];E2 141 ins = imem[IADDR (pc)];E3 142 pc = pc + 4;E4 241 reg[RT] = reg[RS] + (ins & 0xffff);E5 141 ins = imem[IADDR (pc)];E6 142 pc = pc + 4;E7 258 dmem[DADDR (reg[RS] + (ins & 0xffff))] = reg[RT];Table 4.2: Trace Buffer ContentsID Datapc 142 pc+4dmem[x] 258 reg[RT]ins 141 imem[x]pc 142 pc+4reg[y] 241 reg[RS] + ...ins 141 imem[y]pc 142 pc+4dmem[z] 258 reg[RT](a) Data CaptureID Datapc 142 pc+4dmem[x] 258 dmem[i]oldins 141 imem[x]pc 142 pc+4reg[y] 241 reg[j]oldins 141 imem[y]pc 142 pc+4dmem[z] 258 dmem[k]old(b) Data Capture with ADM564.3. Old Value Storefor each statement; these will be used in the following discussion.Assuming a trace buffer size of eight entries, Table 4.2 shows the contentsof the trace buffer after the code is executed for two scenarios: without ADM(4.2a) and with ADM (4.2b). As described previously, each entry in the tracebuffer contains an ID and the data update. In the table corresponding to theADM scenario, three data elements are shaded; these elements are memoryentries (rather than scalar variable accesses). In these cases, the Old Valueof the array element is stored in the trace buffer as described above.4.3.3 Trace AnalysisAfter running the circuit, when the execution is replayed using the off-linesoftware tool, the amount of information available to the user is limited bythe information that can be obtained through the readback routine.Figure 4.2 shows a representation of the history of data values availablefor each variable accessed in the code. The horizontal axis in each diagramrepresents time, going backwards from the instant when execution of thecode was halted. The left-most record (E0) is the oldest record and theright-most record (E7) occurred immediately before the readback routine.The top diagram shows the scenario without ADM. In that case, a historyof three values is available for PC; these values correspond to the updateslabeled E0, E3, and E6 in Table 4.1. Correspondingly, only one value isavailable for reg[y] from E4, and so on.The lower figure shows the scenario with ADM. In this case, a historyof two values is available for reg[y]: the value corresponding to update E4(retrieved from the array itself) as well as an older value (update E9 in574.4. Observability(a) Trace without ADM(b) Trace with ADMFigure 4.2: Trace Buffer Analysisthe diagram). This older value represents the assignment that was madeto reg[y] before the update E4. In this way, the ability to retrieve the oldvalues for array accesses increases the amount of history available, providingmore information to the user, possibly allowing him or her to have a morecomplete picture of the operation of the circuit.4.4 ObservabilityWe expect that an increase in observability, provided by ADM, would havea beneficial impact on the debugging experience. With more observabilitythe user may be able to replicate the bug and find the root cause of that584.4. Observabilitybehaviour using fewer iterations.Using Figure 4.2 we illustrate how much of the circuit’s execution we canobserve with that trace sample. In order to measure this, we identified aunit of time. For our instrumentation, even though timestamps or executioncycles are not stored, the events’ sequence can be used as a proxy to measuretime.4.4.1 History CoverageIn the sequence of events represented in Figure 4.2a, the entry for event E1was recorded earlier than the entry for E7. To quantify this relation, weuse the number of entries as a proxy to measure the time that each entry isvalid.After the entry for event E1 was recorded, six more entries were recordedbefore readback. The entry for event E1 is valid for seven entries, countingits own.On the other hand, E7 was the last event recorded in the trace bufferbefore readback, which means zero entries were recorded after it. The entryfor event E7 is valid for one entry.Consequently, on Figure 4.2a we can observe variable dmem[x] for 7entries, and variable dmem[z] for 1 entry. We call this a variable’s HistoryCoverage.Multiple Entries per VariableSome variables have multiple entries, e.g. variable pc has three entries (E0,E3, and E6). In those cases, an entry is valid until readback or until the594.4. ObservabilityTable 4.3: MIPS History Metrics ExampleVariableHistoryCoverageHistory Coveragew/ ADMHistory CoverageIncreasepc 8 8 0dmem[x] 7 8 1ins 6 6 0reg[y] 4 8 4dmem[z] 1 8 7Total 26 38 12 (46%)next update. The entries for events E0 and E3, for example, are valid forthree entries, while the entry for E6 is valid for two entries. The HistoryCoverage for variable pc is the sum of those values. With this in mind, weobtain column 2 of Table 4.3.Old Value EntriesIn the sequence of events represented in Figure 4.2b, the History Coverageneeds to consider the use of old array values. These are values in entries as-sociated to variable-arrays that were selected during the ADM optimization(i.e. E1, E4, and E7); we call these Old Value Entries.For example, the Old Value Entry provided by event E1, is a value thatwas assigned to dmem[x] on a previous event (E8). That entry, however,is only valid for one entry. We cannot assume the observability of dmem[x]further back, because the oldest recorded entry is for event E0, meaning anyevent previous to E0 could have modified dmem[x].On the other hand, the value read directly from the circuit memorysynthesized for dmem[x] corresponds to the value assigned to dmem[x] on604.5. Summaryevent E1. This value is, therefore, valid for the same seven entries as theprevious case without ADM, from event E1 until readback.The History Coverage for variable dmem[x] is now the sum of the validentries of the Old Value Entry and the value read directly from the circuitmemory. Column 3 of Table 4.3 shows the History Coverage when usingADM with arrays dmem[], and reg[] selected for optimization. Column 4shows the increase in observability for each variable.4.4.2 Total History CoverageThe total history coverage at one point in the execution is the sum of theHistory Coverage of all variables. The last row in Table 4.3 obtains thisvalue for the sample trace of the MIPS program, without ADM in column2, with ADM in column 3, and the overall increase in column 4.Overall, from Figure 4.2 we can measure a 46% History Coverage in-crease, achieved with the use of ADM selecting arrays dmem[], and reg[].These metrics are collected automatically during the experiments pre-sented in Section SummaryData duplication is caused by the use of the record and replay technique fordebugging instrumentation. The trace buffer needs to temporarily store thesame values contained in the circuit’s structures. This duplication can bereduced by using ADM.ADM is an optimization focused on removing data duplication for select614.5. Summaryarrays. This is done by adding instrumentation to read directly from arraysand recording the old value that was stored in the array index whenever it isupdated. This allows us to reconstruct the control and data flow of the exe-cution without duplication and with additional information; arrays selectedin ADM are fully visible and old values may provide more observability.In this chapter we presented two trace examples, with and without theuse of ADM. We showed the trace buffer contents from the execution ofthe MIPS benchmark in the CHStone benchmark suite in those two cases.Finally, this chapter introduced a way to determine how much of the circuitexecution can be observed with the content of the trace buffer at one pointin the execution. This observability metric is used in the experiments forADM in the following chapter.62Chapter 5Experiments and ResultsIn this chapter, we experimentally evaluate the size and performance over-head caused by the insertion of instrumentation at the C level. Experiment1 is inspired by [52]. As in the previous work, our goal is to determine theoverhead required by our instrumentation techniques. Unlike the previouswork, our instrumentation includes the trace buffer memories and associatedcircuitry, while the instrumentation in [52] is aimed at connecting key pointsin the circuit to EOPs.Experiment 2 evaluates the overhead added by the instrumentation whenall identified statements of interest are instrumented. Control Flow and DataCapture alternatives are explored for latency and resource overheads.Experiment 3 investigates partial instrumentation, and measures thevariability that is observed. A variability agent is identified and a new con-figuration of the HLS tool is evaluated using the same experiment. ADMexperimentation follows a similar methodology, again focusing on latencyand resource overhead.63Chapter 5. Experiments and ResultsTable 5.1: Latency Impact Comparison for ADPCM Single AssignmentTool Min Avg Max StdevEOP[52]-Vivado HLS -15.00% 0.00% 3.10% 1.00%Data Capture-Vivado HLS -15.41% -0.09% 10.39% 2.40%Data Capture-LegUp -0.02% 1.99% 20.00% 4.90%Control Flow-Vivado HLS -15.41% -0.35% 0.69% 2.33%Control Flow-Legup 1.08% 6.03% 21.25% 5.85%Table 5.2: LUT Impact Comparison for ADPCM Single AssignmentTool Min Avg Max StdevEOP[52]-Vivado HLS -1.10% 0.20% 4.80% 0.60%Data Capture-Vivado HLS -3.51% -0.09% 3.17% 0.69%Data Capture-LegUp -4.41% 0.04% 16.38% 1.84%Control Flow-Vivado HLS -3.30% -0.01% 2.47% 0.94%Control Flow-Legup -4.27% 0.38% 3.69% 1.36%Table 5.3: FF Impact Comparison for ADPCM Single AssignmentTool Min Avg Max StdevEOP[52]-Vivado HLS -7.40% 0.20% 10.70% 1.20%Data Capture-Vivado HLS -6.62% 0.22% 8.26% 1.50%Data Capture-LegUp -4.45% 0.75% 19.79% 2.20%Control Flow-Vivado HLS -6.25% 0.40% 3.55% 1.27%Control Flow-Legup -4.73% 0.72% 4.88% 1.79%Table 5.4: LE/LC* Impact Comparison for ADPCM Single AssignmentTool Min Avg Max StdevEOP[52]-Vivado HLS† N/A N/A N/A N/AData Capture-Vivado HLS -1.33% 1.18% 6.89% 0.98%Data Capture-LegUp -4.43% 0.54% 18.91% 2.04%Control Flow-Vivado HLS -1.58% 1.73% 3.81% 1.22%Control Flow-Legup -4.59% 0.62% 4.33% 1.57%* Logic Elements (LEs) for LegUp and Logic Cells (LCs) for Vivado HLS† EOP results do not include this metric645.1. Experiment 1: Single Instrumentation Point5.1 Experiment 1: Single Instrumentation PointWe first consider the impact of instrumenting single points of interest. Inthis experiment, we cycled through all possible data and control instrumen-tation points (assignments and control constructs) and instrumented eachin isolation. We assumed a 256-entry trace buffer and performed our exper-iments using the CHStone benchmark suite. As in [52], we were unable tocompile three of the CHStone benchmarks for Vivado HLS without hand-coded modifications, and so they were omitted from the experiments.Tables 5.1-5.3 show the results for one benchmark circuit, ADPCM. Ex-periments using Vivado HLS used the Xilinx Artix-7 XC7A35T [72] FPGAdevice while LegUp uses the Altera Stratix V 5SGXEA7 [5]. The changesin latency, number of LookUp-Tables (LUTs), number of Flip-Flops (FFs),and number of Logic Elements (LEs) or Logic Cells (LCs) are shown. LEsand LCs are, respectively, the names given by Altera and Xilinx to a unitof configurable fabric that includes LUTs and FFs. For each, we tabulatethe minimum, maximum, average, and standard deviation over all instru-mentation points in the circuit. Results are shown using both LegUp andVivado HLS, using both Control Flow and Data Capture, as well as resultsfrom [52]. The latency results represent a slowdown if positive or speedupif negative.Comparing the first two rows of numbers in Tables 5.1-5.3 (EOP andData Capture-Vivado HLS), we see that the overhead results of our ap-proach match those in [52] closely. The results for the other circuits in thebenchmark suite also matched the trends presented in [52]. This suggests655.2. Experiment 2: Complete Instrumentationthat the extra overhead due to the trace buffers and extra control logic forsingle assignments do not significantly affect the size and performance ofthe instrumented circuit. Comparing these to the third rows of Tables 5.1-5.3 (Data Capture-LegUp), we see the latency overhead is higher if LegUprather than Vivado HLS is used. We expect that this is because the versionof LegUp we used groups all global variables in a single monolithic mem-ory, meaning it is more likely that trace buffer writes interfere with variableupdates in the user circuit. On the other hand, Vivado HLS instantiatesmultiple memory arrays and controllers, meaning it can better tolerate theextra memory updates added with our instrumentation.Rows four and five of Tables 5.1-5.4 have the results of these experimentsfor control flow instrumentation. The area and performance overhead resultsfor single control flow statements are similar to those for data capture. Ta-ble 5.4 shows the changes in LEs or LCs according to the HLS tool; LegUpor Vivado HLS respectively. These numbers represent circuit size overheadin one single value.5.2 Experiment 2: Complete InstrumentationFor our second experiment, we considered the overhead if all control con-structs or assignment statements are instrumented simultaneously. The lat-ter would be required in a flow similar to the one found in [27] which providesaccess to all variables within a window of interest. Such a strategy is im-portant if we wish to provide a debug experience similar to software; insoftware debuggers such as GDB, access to all variables is provided. How-665.2. Experiment 2: Complete Instrumentationever, due to the overhead of hardware debug instrumentation, the commonHLS debug workflow will focus on a selection of variables. CFI verifica-tion scenarios would also benefit from complete control flow information. Inthis experiment, we gather separate results for control flow instrumentation(as described in Section 3.2) as well as data capture instrumentation (asdescribed in Section 3.3) using a 256-entry buffer.Numbers for latency are presented first. Latency is a critical metricfor debug instrumentation. A modification that affects latency could alsoremove the root cause of a system-level bug, or create one, making the instru-mentation detrimental for the debugging process. A change in latency canbe caused by the addition or removal of states in the state machine or stagesin the datapaths, or different assignments of resources during allocation andbinding. These translate to changes to the critical path, the creation ofhelpful/harmful delay slots, or affect (cause or prevent) conflicting use ofresources. This type of bug, which may appear or disappear due to the in-sertion of instrumentation, is classified as a Heisenbug [32]. For source-levelinstrumentation, maintaining a low impact on latency is a greater challengebecause there is less control over resource allocation and operation schedul-ing; these tasks are relegated to the HLS tool. During our experiments, allbenchmarks executed correctly before and after instrumentation, meaningwe did not observe Heisenbugs.5.2.1 Latency ImpactThe latency results for control and data instrumentation are shown in Ta-bles 5.5 and 5.6. Columns 3 and 4 show both the total latency of the resulting675.2. Experiment 2: Complete Instrumentationinstrumented circuit, and the overhead percentage over the original latency.The percentage is used in order to analyze how the original latency affectsthe results. Here, instrumentation on circuits with short original latenciestend to generate a higher percentage overhead (i.e. DFADD, DFMUL).However, this is also the case for other benchmarks (i.e. MIPS) as will bedescribed below.When instrumentation is added, latency tends to increase, partly becausesource-level instrumentation may interfere with optimizations that optimizeaway certain operations or signals. When optimized, however, signals areprone to be removed and their behavior is not available during an RTL-instrumented debug. A source-instrumented version will preserve all signalsdirectly relatable to source variables, producing a functionally-identical cir-cuit but possibly interfering with an optimization that could have produceda more efficient design. The impact of this, however, was found to be small.Control FlowAs in Experiment 1, we see that the latency overhead is more significantwhen compiling the design with LegUp, due to the increased congestion foraccess to the single memory caused by the instrumentation. The ADPCMcircuit on LegUp has the most significant impact which seems to be causedby a more frequent use of global arrays when compared to other circuits.This increases congestion with accesses to the trace buffer.685.2. Experiment 2: Complete InstrumentationTable 5.5: LegUp Cycles (Slowdown)Benchmark Original Control DataADPCM 13221 31585(138.9%) 65308(394.0%)AES 9193 10892(18.5%) 11538(25.5%)BF 163925 186862(14.0%) 188983(15.3%)DFADD 673 3166(370.4%) 1308(94.4%)DFDIV 1916 2806(46.5%) 2674(39.6%)DFMUL 224 1113(396.9%) 983(338.8%)DFSIN 59061 88162(49.3%) 51090(-13.5%)GSM 4771 6059(27.0%) 5710(19.7%)MIPS 5035 10892(116.3%) 7805(55.0%)Average 28669 37949(32.4%) 37267(30.0%)Table 5.6: Vivado HLS Cycles (Slowdown)Benchmark Original Control DataADPCM 28880 31633(9.5%) 41249(42.8%)AES 3159 3223(2.0%) 3623(14.7%)BF 107429 107950(0.5%) 112889(5.1%)DFADD 405 519(28.1%) 433(6.9%)DFDIV 1980 2043(3.2%) 1979(-0.1%)DFMUL 215 284(32.1%) 227(5.6%)DFSIN 51635 55547(7.6%) 50810(-1.6%)GSM 3728 3963(6.3%) 3721(-0.2%)MIPS 2541 3251(27.9%) 5535(117.8%)Average 22219 23157(4.2%) 24496(5.8%)695.2. Experiment 2: Complete InstrumentationData CaptureData capture results reveal the same trend of the previous experiments. Theimpact on latency for LegUp instrumentation is higher.In more detail, the higher latency overhead found for the MIPS bench-mark generated by both HLS tools is caused by multiple concurrent assign-ments that could not be scheduled efficiently. These assignments belongto decoded signals from the ins (instruction) variable, which is split intomultiple variables for decoding. A constraint of the trace buffer limits thenumber of inputs to two on each cycle (inferred dual-port RAM), however,the scheduler also uses available slots in the following cycles. This situationis witnessed in other benchmarks but, in this particular case, all availableslots are occupied. This affects all loop iterations of the program, causing asignificant impact. The ADPCM benchmark using LegUp presents an enor-mous overhead, which is caused by a combination of the previous factors;multiple assignments are scheduled concurrently and the memory bottleneckis significantly stressed. In this case, ADPCM’s higher impact over that onMIPS appears to be due to the former’s heavier use of global memory asseen with the Control Flow instrumentation.On the other hand, some benchmarks show unchanged latency or evena speedup of up to 20%. Speedups, although beneficial during normal syn-thesis, are unsought during debug instrumentation. These imply a changethat could cause Heisenbugs. These results, however, suggest there is an-other consideration that causes scheduling changes; this is further exploredin Section 5.3.705.2. Experiment 2: Complete InstrumentationTable 5.7: LegUp Complete Instrumentation Logic Elements (Overhead)Benchmark Original Control DataADPCM 22216 26511(19.3%) 29962(34.9%)AES 26546 27241(2.6%) 30818(16.1%)BF 14778 16568(12.1%) 17213(16.5%)DFADD 12324 14555(18.1%) 14161(14.9%)DFDIV 18809 20209(7.4%) 18771(-0.2%)DFMUL 8901 10502(18.0%) 8929(0.3%)DFSIN 39096 44851(14.7%) 45639(16.7%)GSM 18705 21612(15.5%) 19662(5.1%)MIPS 6541 9189(40.5%) 8456(29.3%)Average 18657 21249(13.9%) 21512(15.3%)5.2.2 Resource UtilizationArea results are presented in terms of Logic Elements in Table 5.7 and interms of Logic Cells in Table 5.8. Columns 3 and 4 show both the totalresources required by the resulting instrumented circuit, and the overheadpercentage over the original implementation. In these results, we see thepercentages are rather stable. As the number of signals and, therefore, sizeof the original circuit increase, the instrumentation also increases in orderto create the access network from the signals to the trace buffer.Overall, area overhead when using LegUp is lower than that when us-ing Vivado HLS for data capture. However, the total resource utilizationfor LegUp is significantly higher even though all optimizations (-O3) wereenabled. Outlying results, such as ADPCM, are found to be caused mostlyby inlining changes; this will be further explored in Section 5.3.For implementation, a Stratix V 5SGXEA7 [5] was chosen for LegUpand an Artix-7 XC7A35T [72] FPGA for Vivado HLS. The resource uti-715.2. Experiment 2: Complete InstrumentationTable 5.8: Vivado HLS Complete Instrumentation Logic Cells (Overhead)Benchmark Original Control DataADPCM 6764 7064(4.4%) 11305(67.1%)AES 2386 3936(65.0%) 4562(91.2%)BF 1645 1642(-0.2%) 4185(154.4%)DFADD 2826 2716(-3.9%) 3892(37.7%)DFDIV 2894 3011(4.0%) 3707(28.1%)DFMUL 1397 1594(14.1%) 2310(65.4%)DFSIN 11549 12035(4.2%) 15351(32.9%)GSM 3871 4193(8.3%) 5518(42.5%)MIPS 1329 1333(0.3%) 2036(53.2%)Average 3851 4169(8.8%) 5874(52.5%)lization numbers are presented in Tables 5.7 and 5.8. An increase can beseen between control and data instrumentation. Since data instrumenta-tion requires connecting all active bits of each variable, the multiplexinglogic required to share access to the trace buffer is significantly higher thanthat for control flow. Control flow instrumentation only needs to multiplexIdentifiers which are constants and can be optimized during logic synthesis.Comparison with Previous WorkIn Tables 5.9 and 5.10 we compare the numbers gathered from our ex-periments with those from the EOP [52] and Goeders’ RTL instrumenta-tion [30], respectively. Goeders’ results were obtained using the same StratixV 5SGXEA7 FPGA as used in our LegUp experiments, while the EOP ex-periments were performed on a Xilinx Zynq XC7Z020 FPGA [76]; both theArtix-7, used in our experiments, and Zynq devices use 6 input LUTs. Al-though common debug instrumentation would only target a set of signals,these results with all statements instrumented reveal characteristics of the725.2. Experiment 2: Complete InstrumentationTable 5.9: Vivado HLS EOP vs Data Capture OverheadBenchmarkEOP[52]* Data CaptureLUTs FFs LUTs FFsADPCM 25% 2% 76% 37%AES 22% 2% 119% 41%BF 50% 25% 413% 76%DFADD 70% 10% 43% 17%DFDIV 27% 2% 42% 10%SHA 27% 5% 55% 19%Average 37% 8% 58% 25%*EOP results do not include a trace buffer or a buffer access networkdifferent approaches.EOP EOP results taken from [52] are only given in LUTs and FFs over-head percentage for a smaller subset of the CHStone benchmark suite. Thenumbers for Data Capture in Table 5.9 show the impact caused by the mul-tiplexing logic needed to share access to the trace buffer. The trace bufferitself uses on-chip RAM. The numbers from [52] only represent the impact tothe original circuit and do not include the multiplexing logic of the buffer ac-cess network. The LUTs overhead in our experiments increases significantlyas the number of variables increases, due to the need of more multiplexinglogic. The increase in FF numbers may be due to an increase in the numberof states and impact on optimizations. The EOP experiments in [52] or [54]do not present latency impact for all statements instrumented.Goeders’ The results of using Goeders’ HLS Debugger are compared withour results using LegUp in Table 5.10. Multiple factors pose a challenge tomake this a direct comparison. In terms of the modules included, Goeders’735.2. Experiment 2: Complete InstrumentationTable 5.10: LegUp HLS Debugger vs Data Capture Logic Elements (Over-head)BenchmarkGoeders’[30] This thesis*Original Debug Original Data CaptureADPCM 17900 22244 (24%) 22216 29962 (35%)AES 16966 18372 (8%) 26546 30818 (16%)BF 10448 13692 (31%) 14778 17213 (16%)DFADD 7261 10473 (44%) 12324 14161 (15%)DFDIV 12233 16018 (31%) 18809 18771 (0%)DFMUL 4171 6414 (54%) 8901 8929 (0%)DFSIN 24982 31928 (28%) 39096 45639 (17%)GSM 9096 11415 (25%) 18705 19662 (5%)JPEG 35691 38305 (7%) 48354 69110 (43%)MIPS 3162 4380 (39%) 6541 8456 (29%)Average 14191 17324 (22%) 18657 21512 (15%)*These results do not include the RS232 communication manager.results are affected by the inclusion of the RS232 communication managerand the scheduling logic in the trace recorder. Our results do not includeeither as they focus on the impact caused to the original circuit. Addingthe estimated LEs necessary to include the communication manager, whichis similar between both approaches, increases the average impact to approx-imately 21%. However, Goeders’ trace scheduler, which allows longer tracerecordings in the RTL approach [27], represents a large part of the overheadand has no analogy in our approach.The comparison between these two approaches is made harder by the sig-nal selection stage of each process. On the source-level approach, variablesselected in the original source-code are instrumented regardless of the result-ing circuit. An RTL approach, however, instruments an optimized descrip-tion of the logic, starting with high-level optimization passes, and followed by745.3. Experiment 3: Partial Instrumentationlogic synthesis optimizations. High-level optimization passes like dead codeelimination (DCE) apply to both cases, where ”unreachable” instructionsare eliminated, including the instrumentation in the source-level approach.However, optimizations that apply constant folding techniques, can resultin (a) constant inputs to the trace buffer in a source-level approach, and(b) signals that are not available in hardware and ideally inferred duringsynthesis.Therefore, a direct comparison would require listing these cases and ap-plying source-level instrumentation only to the variables for which RTL in-strumentation was inserted. This experiment is out of the scope of this workbut is proposed as future work, along with the integration of source-level op-timizations as discussed in 6.1.Overall, RTL instrumentation for LegUp often provides observabilitywith lower impact to the original circuit than source-level instrumentation.Additional resources can then be used to optimize this architecture to allowlonger trace records. This, however, is the result of tool-specific designand optimizations. Goeders’ instrumentation relies heavily on the memoryarchitecture used in LegUp.5.3 Experiment 3: Partial InstrumentationThe data capture overhead numbers in Tables 5.5 and 5.6 are large; thisimplies that it may be appropriate to consider instrumenting only a subsetof the available signals. Intuitively, and as shown by [52], cumulativelyincreasing the number of statements that are instrumented increases the cost755.3. Experiment 3: Partial InstrumentationFigure 5.1: Latency Histogram for Variation of ADPCMof that instrumentation. However, the total cost of instrumenting multipleassignments may be less than multiplying the number of assignments by thecost of instrumenting each single assignment individually, due to resourceusage overlap.For further experimentation, we created a set of 1000 different instancesof the ADPCM benchmark, each instance with 10 random statements in-strumented, to observe the variability of the results depending on whatassignments are chosen. The sample set is a small subset of all possiblecombinations(11910), however, the distribution trend was validated with dif-ferent set sizes and random selection seeds. Each instance was compiledusing Vivado HLS, and hardware simulation latency results for all 1000 in-stances are shown in the histogram of Figure 5.1. The left vertical axisshows the count of instances that performed with a latency matching theranges in the horizontal axis. The right vertical axis is the cumulative per-centage. From these results, 60.4% of the instances require less than or an765.3. Experiment 3: Partial InstrumentationFigure 5.2: Latency Histogram for Variation of ADPCM All-Inlinedequal amount of cycles as the original design (28880 cycles). Average latencyis -0.6% lower than the original, and standard deviation is 4.4%. Averageresource utilization is 1.7% higher, with a standard deviation of 152 LCs(2.2%).The observed latency changes were caused, in part, by the considerationsmentioned above. However, changes during the inlining optimization ofthe HLS tool were found to be more significant. The HLS tools can beconfigured to use a different inline threshold or to inline every function inthe program. An experiment using the same set of 1000 instances of theADPCM benchmark with all functions inlined, produced a distribution oflatency results with less variation. For the data in Figure 5.1, the standarddeviation was 1263 cycles (4.4%), and average 28694 cycles; the always-inlined version of these instances showed a standard deviation of 604 cycles(2.1%), with an average of 28180 cycles (-2.4% lower). This lower latencyand less variant latency, however, requires higher resource utilization [41].775.4. ADM EvaluationTable 5.11: Duplication MetricsBenchmarkAverageMemory EntriesAverageDuplicatesCoverageIncreaseADPCM 75.09(29.3%) 41.73(16.3%) 43.76%AES 112.98(44.1%) 48.36(18.9%) 90.37%BF 122.61(47.9%) 55.41(21.6%) 91.31%GSM 76.03(29.7%) 69.23(27.0%) 59.52%JPEG 76.59(29.9%) 76.14(29.7%) 95.44%MIPS 23.65(9.2%) 16.13(6.3%) 65.41%SHA 80.36(31.4%) 80.36(31.4%) 124.47%Average 81.04(31.7%) 55.34(21.6%) 84.28%Average resource utilization is 4.6% higher, with a standard deviation of 328LCs (5.1%).5.4 ADM EvaluationIn this section we evaluate the ADM strategies. To understand their po-tential, we first measured how often data is duplicated between memoryarrays and the trace buffer. Results for the following experiments on eachbenchmark program are presented in Table 5.11. We made these measure-ments using a subset of the CHStone benchmark, excluding the arithmeticdomain programs (DFADD,DFDIV,DFMUL,DFSIN) because of their algo-rithmic/structural simplicity, i.e. read from ROM, compute, and compare;which does not require arrays for computation.On average, from the second column of Table 5.11, we found that 31.7% ofthe trace buffer entries are from user memories. These would allow theMerged Instrumentation technique from Section 4.2 to use up to 31.7% moreentries for scalar values, depending on which arrays are selected. Moreover,785.4. ADM Evaluationwe found that 21.6% of the trace buffer entries are from user memory entriesthat are not evicted – multiple assignments to the same entry in an arrayare counted as one. Therefore, a Data Capture instrumentation with OldValue Store ADM has the potential to reduce the amount of data duplica-tion by 21.6%. Appendix C shows the array access profiles of the CHStonebenchmark programs obtained during the implementation of the previousexperiment.5.4.1 ObservabilityWe performed an experiment with the observability metrics proposed inSection 4.4 to evaluate the average increase in history coverage we can obtainfrom the Old Value Store ADM technique.We gathered the results in the third column of Table 5.11 based on ourexperiments on the CHStone benchmarks. This shows that an average of84.2% more history coverage can be stored in the trace buffer using thistechnique. This demonstrates the benefit of implementing Old Value StoreADM, suggesting that HLS-generated circuits contain enough identifiablearrays, such that, storing the value being evicted from those arrays canresult in a significant contribution to circuit execution observability.5.4.2 Resource UtilizationTo analyze the impact on area overhead of ADM, we measured the resourceutilization after place and route of (a) the original ADPCM benchmark, (b)the circuit instrumented with data capture instrumentation, (c) the sameinstrumentation with our data duplication minimization approach applied795.5. SummaryTable 5.12: ADM Resource Utilization for ADPCM InstancesInstance BRAMFFs(Overhead)LUTs(Overhead)(a) Original 7 5135 9200(b) Data Capture 9 7448 (45.0%) 14203 (54.4%)(c) ADM array100 9 7464 (45.4%) 14284 (55.3%)(d) ADM array24 9 7460 (45.3%) 14288 (55.3%)(e) ADM (both) 9 7476 (45.6%) 14307 (55.5%)(f) ADM (13 arrays) 9 7579 (47.6%) 14616 (58.9%)to one array of size 100, (d) to one different array of size 24, (e) to botharrays, and (f) to all data accesses in the circuit (13 arrays were identified).Table 5.12 shows that it is increasingly less expensive to add multiple arrays,and that memory utilization is kept constant, as expected. Moreover, forthe full instrumentation of the ADPCM circuit (f), adding 2.6% more FFsand 4.5% more LUTs to (b) allowed 16.3% duplication to be removed, with43.76% more history coverage.5.5 SummaryThis chapter presented the methodology and results of various experimentsto evaluate the impact caused by the insertion of debug instrumentation atthe source level. In our first set of experiments we instrumented single pointsof interest. This showed that, on average, instrumenting these single pointscauses low impact on the area and speed of the original designs. This obser-vation, also observed in previous work on EOPs, is extended here to includethe insertion of a trace buffer and related circuitry. The second experiment,instrumenting all points of interest, showed how the impact on the original805.5. Summarycircuit increases, suggesting that source-level debugging using source-levelinstrumentation may benefit from prior selection of regions of interest inthe source code to avoid impractical overheads. The third experiment, iden-tified the causes of performance overhead when partial instrumentation isapplied. Occasionally, states are added to the circuit FSMs to schedule tracebuffer writes. However, results would vary significantly due to changes inthe resulting code after inlining optimizations.We also presented experiments to evaluate our ADM approach. The useof ADM in our instrumentation allows us to extend circuit observability withlow overhead. Using our metric for History Coverage we found that ADMcan extend circuit observability by an average of 84%. This is the result ofbeing able to observe the values of select arrays before the time they wereevicted.81Chapter 6ConclusionHigh-level synthesis tools promise increased productivity for designers, al-lowing them to create compute accelerators more rapidly, and test themin the application environment from an early stage of development. Thispromise, however, will only be realized if the compilers are accompanied byan entire ecosystem including a debugging framework that allows designersto debug their designs in the context of the original C code, while runningin silicon.In-system source-level debugging tools for HLS rely on instrumentationto record the behaviour of the design as it executes, for later interrogationby an off-line software debugger. We showed a source-level instrumentationtechnique that includes the trace buffer and related circuitry. This instru-mentation is inserted by automatic source-to-source transformation tools torecord both the control flow and data assignments into trace buffers. Theimpact on circuit size varies from 15.3% to 52.5% and the impact on circuitspeed ranges from 5.8% to 30% when all assignment statements are instru-mented. The impact on circuit size for single point instrumentation rangesfrom -4.6% to 18.9% and the impact on circuit speed ranges from -15.4%to 21.3%. A variability agent is identified in the inlining optimization and826.1. Limitationsevaluated.During these transformation the tools also create a database that mapssource code statements to unique identifiers. Matching the trace capturedafter execution with the offline database allows mapping the behaviour ofthe circuit back to the original C code.We also showed how our design can be optimized to make better useof resources by eliminating up to 31.7% of the data duplication betweencircuit memories and trace buffers. This helps extending visibility into theexecution history, resulting in 84.2% more coverage, which we anticipatewill translate into a faster and easier debugging experience, requiring fewerexecutions to find the root cause of observed incorrect behaviour.6.1 LimitationsWhen instrumenting high-level code there are changes that can affect lowerlevel optimizations. The nature of the statements being inserted can espe-cially affect loop optimizations. A noticeable effect is found on loop-invariantcode; this consists of variables or expressions inside a loop body that areunaffected by all loop iterations and can, therefore, be moved outside ofthe loop body. This is called loop-invariant code motion, hoisting or scalarpromotion. Adding a function call to pushDbg() inside the loop with thatexpression as an argument creates a false dependency that invalidates theoptimization.In the scope of this work, the impact of source code manipulation inloop-invariant code motion were not considered.836.2. Future WorkHowever, there is a considerable amount of work on source-level opti-mizations, including the use of the ROSE compiler framework for thosespecific optimizations [40], as well as others, mentioned in Section 2.5.Therefore, these optimizations can be incorporated as a step previous tothe instrumentation insertion, either with automatic code transformationsor using programmer feedback to indicate the loop-invariant expression.An addition that can be suggested for a better debugging experience isthe inclusion of execution cycles numbers with every entry in the trace buffer.Although this is not necessary for a behavioral representation of the code,this is useful for performance analysis and more advanced debugging. RTLinstrumentation techniques [12, 27] implement custom counters in logic forthis purpose. However, when using source-level instrumentation, this wouldhave to rely on tool-specific features to incorporate custom Verilog; thisapproach has been used for timing analysis assertions [20]. A more suitablealternative is the inclusion of the <time.h> library as an API for HLS tools.For the HLS tool user, this would mean being able to call functions likeclock() and time() and meaningful macros such as CLOCKS PER SEC intheir programs. For debugging purposes, using these same functions wouldallow adding timestamps to trace buffer entries.6.2 Future WorkAs mentioned above, the incorporation of timestamps into the trace bufferwould be beneficial to expand the capabilities of this approach towards tim-ing analysis and performance debugging. Working with open source HLS846.2. Future Worktools to add API support of the time.h library could be appealing for devel-opers and suggest a standard for proprietary HLS tools to follow. The workin [20] lays an approach with the basic requirements.Future work specific to ADM involves developing heuristics to selectmemories for minimization, as well as quantifying and comparing the per-formance (e.g. lines of code in a replay window) of this, other types ofoptimizations, and other types of instrumentation.Work on EOPs [54] made a compatibility study of other HLS tools,without the need for empirical testing. This was accomplished by identi-fying the requirements to allow such transformations, and found that only1 (Shang [80]) out of 12 tools was incompatible; LegUp was also deemedincompatible for not being able to schedule concurrent I/O, but this limita-tion, although unnecessary for the enhanced instrumentation approach, canbe overcome through the modifications explained in Appendix A.With the upcoming release of Altera’s A++ HLS framework [1], empir-ical testing of RTL generation of source-level instrumented circuits on thistool will be necessary.85Bibliography[1] Altera. Altera Announces New Spectra-Q Engine for Industry-leading Quartus II Software to Accelerate FPGA and SoCDesign. http://newsroom.altera.com/press-releases/nr-altera-spectraq-quartusii-software-fpga-soc.htm, May2015.[2] Altera. Quartus Prime Pro Edition Handbook, volume 3, chapter 9:Design Debugging Using the SignalTap II Logic Analyzer. November2015.[3] Altera. Altera SDK for Opencl. https://www.altera.com/products/design-software/embedded-software-developers/opencl/overview.html, 2016.[4] Altera. Stratix 10 FPGA and SoC - Overview. https://www.altera.com/products/fpga/stratix-series/stratix-10/overview.html,2016.[5] Altera. Stratix V FPGAs - Overview. https://www.altera.com/products/fpga/stratix-series/stratix-v/overview.html, 2016.[6] H. Angepat, G. Eads, C. Craik, and D. Chiou. Nifd: Non-intrusive fpga86Bibliographydebugger – debugging fpga ’threads’ for rapid hw/sw systems prototyp-ing. In 2010 International Conference on Field Programmable Logic andApplications, pages 356–359, Aug 2010.[7] ARM. big.LITTLE Technology: The Future of Mobile.https://www.arm.com/files/pdf/big_LITTLE_Technology_the_Futue_of_Mobile.pdf, 2016.[8] David Bacon, Rodric Rabbah, and Sunil Shukla. FPGA Programmingfor the Masses. Queue, 11(2):40:40–40:52, February 2013.[9] Mohamed Ben Hammouda, Philippe Coussy, and Loic Lagadec. ADesign Approach to Automatically Generate On-chip Monitors DuringHigh-level Synthesis of Hardware Accelerator. In Proceedings of the24th Edition of the Great Lakes Symposium on VLSI, GLSVLSI ’14,pages 273–278. ACM, 2014.[10] John Bodily, Brent Nelson, Zhaoyi Wei, Dah-Jye Lee, and Jeff Chase. AComparison Study on Implementing Optical Flow and Digital Commu-nications on FPGAs and GPUs. ACM Trans. Reconfigurable Technol.Syst., 3(2):6:1–6:22, May 2010.[11] S. Byma, J. G. Steffan, H. Bannazadeh, A. L. Garcia, and P. Chow.FPGAs in the Cloud: Booting Virtualized Hardware Acceleratorswith OpenStack. In Field-Programmable Custom Computing Machines(FCCM), 2014 IEEE 22nd Annual International Symposium on, pages109–116, May 2014.87Bibliography[12] N. Calagar, S.D. Brown, and J.H. Anderson. Source-level Debuggingfor FPGA High-Level Synthesis. In Field Programmable Logic and Ap-plications (FPL), 2014 24th International Conference on, Sept 2014.[13] Keith A. Campbell, David Lin, Subhasish Mitra, and Deming Chen.Hybrid Quick Error Detection (H-QED): Accelerator Validation andDebug Using High-level Synthesis Principles. In Proceedings of the 52ndAnnual Design Automation Conference, DAC ’15, pages 53:1–53:6, NewYork, NY, USA, 2015. ACM.[14] A. Canis, S.D. Brown, and J.H. Anderson. Modulo SDC scheduling withrecurrence minimization in high-level synthesis. In Field ProgrammableLogic and Applications (FPL), 2014 24th International Conference on,Sept 2014.[15] Andrew Canis, Jongsok Choi, et al. LegUp: An Open-source High-levelSynthesis Tool for FPGA-based Processor/Accelerator Systems. ACMTrans. Embed. Comput. Syst., 13(2):24:1–24:27, September 2013.[16] Jongsok Choi, S. Brown, and J. Anderson. From software threadsto parallel hardware in high-level synthesis for FPGAs. In Field-Programmable Technology (FPT), 2013 International Conference on,pages 270–277, Dec 2013.[17] J. Cong and Zhiru Zhang. An efficient and versatile scheduling algo-rithm based on SDC formulation. In Design Automation Conference,2006 43rd ACM/IEEE, pages 433–438, 2006.88Bibliography[18] Intel Corporation. Intel Completes Acquisition of Altera. http://www.intc.com/releasedetail.cfm?ReleaseID=948014, December 2015.[19] Philippe Coussy, Cyrille Chavet, Pierre Bomel, Dominique Heller, EricSenn, and Eric Martin. GAUT: A High-Level Synthesis Tool for DSPApplications, pages 147–169. Springer Netherlands, Dordrecht, 2008.[20] John Curreri, Greg Stitt, and Alan D. George. High-level Synthesis ofIn-circuit Assertions for Verification, Debugging, and Timing Analysis.Int. J. Reconfig. Comput., 2011:1:1–1:17, January 2011.[21] Luka Daoud, Dawid Zydek, and Henry Selvaraj. Advances in Sys-tems Science: Proceedings of the International Conference on SystemsScience 2013 (ICSS 2013), chapter A Survey of High Level SynthesisLanguages, Tools, and Compilers for Reconfigurable High PerformanceComputing, pages 483–492. Springer International Publishing, 2014.[22] Eclipse CDT. Eclipse CDT (C/C++ Development Tooling). http://www.eclipse.org/cdt/, 2016.[23] Chris Edwards. Growing pains for deep learning. Commun. ACM,58(7):14–16, June 2015.[24] F. Eslami and S. J. E. Wilton. An adaptive virtual overlay for fasttrigger insertion for FPGA debug. In Field Programmable Technology(FPT), 2015 International Conference on, pages 32–39, Dec 2015.[25] Zhe Fan, Feng Qiu, A. Kaufman, and S. Yoakum-Stover. Gpu cluster89Bibliographyfor high performance computing. In Supercomputing, 2004. Proceedingsof the ACM/IEEE SC2004 Conference, pages 47–47, Nov 2004.[26] Philip Garcia, Katherine Compton, Michael Schulte, Emily Blem, andWenyin Fu. An Overview of Reconfigurable Hardware in EmbeddedSystems. EURASIP J. Embedded Syst., 2006(1), January 2006.[27] J. Goeders and S. Wilton. Signal-Tracing Techniques for In-SystemFPGA Debugging of High-Level Synthesis Circuits. IEEE Transac-tions on Computer-Aided Design of Integrated Circuits and Systems,PP(99):1–1, 2016.[28] J. Goeders and S. J. E. Wilton. Using round-robin tracepoints to debugmultithreaded hls circuits on fpgas. In Field Programmable Technology(FPT), 2015 International Conference on, pages 40–47, Dec 2015.[29] J. Goeders and S.J.E. Wilton. Effective FPGA debug for high-level syn-thesis generated circuits. In Field Programmable Logic and Applications(FPL), 2014 24th International Conference on, Sept 2014.[30] J. Goeders and S.J.E. Wilton. Using Dynamic Signal-Tracing to DebugCompiler-Optimized HLS Circuits on FPGAs. In Field-ProgrammableCustom Computing Machines (FCCM), 2015 IEEE 23rd Annual Inter-national Symposium on, pages 127–134, May 2015.[31] Paul Graham, Brent Nelson, and Brad Hutchings. Instrumenting Bit-streams for Debugging FPGA Circuits. In Proceedings of the the 9thAnnual IEEE Symposium on Field-Programmable Custom ComputingMachines, FCCM ’01, pages 41–50, 2001.90Bibliography[32] Michael Grottke and Kishor S Trivedi. A classification of soft-ware faults. Journal of Reliability Engineering Association of Japan,27(7):425–438, January 2005.[33] Zhi Guo, Betul Buyukkurt, Walid Najjar, and Kees Vissers. OptimizedGeneration of Data-Path from C Codes for FPGAs. In Proceedings ofthe Conference on Design, Automation and Test in Europe - Volume 1,DATE ’05, pages 112–117, March 2005.[34] Sumit Gupta, Manev Luthra, Nikil Dutt, Rajesh Gupta, and Alex Nico-lau. Hardware and Interface Synthesis of FPGA Blocks Using Paral-lelizing Code Transformations. In Proceedings of the International Con-ference on Parallel and Distributed Computing and Systems, November2003.[35] M. Ben Hammouda, P. Coussy, and L. Lagadec. A Design Approach toAutomatically Synthesize ANSI-C Assertions During High-Level Syn-thesis of Hardware Accelerators. In 2014 IEEE International Sympo-sium on Circuits and Systems (ISCAS), pages 165–168, June 2014.[36] Frank Hannig, Dirk Koch, and Daniel Ziener, editors. Proceedings ofthe First International Workshop on FPGAs for Software Programmers(FSP 2014), August 2014.[37] Frank Hannig, Dirk Koch, and Daniel Ziener, editors. Proceedings of theSecond International Workshop on FPGAs for Software Programmers(FSP 2015), August 2015.91Bibliography[38] Yuko Hara, Hiroyuki Tomiyama, Shinya Honda, and Hiroaki Takada.Proposal and Quantitative Analysis of the CHStone Benchmark Pro-gram Suite for Practical C-based High-level Synthesis. Journal of In-formation Processing, 17:242–254, 2009.[39] K.S. Hemmert, J.L. Tripp, B.L. Hutchings, and P.A. Jackson. Sourcelevel debugger for the Sea Cucumber synthesizing compiler. In Field-Programmable Custom Computing Machines, 2003. FCCM 2003. 11thAnnual IEEE Symposium on, pages 228–237, April 2003.[40] Michihiro Horie, Mikio Takeuchi, Kiyokuni Kawachiya, and DavidGrove. Optimization of x10 Programs with ROSE Compiler Infras-tructure. In Proceedings of the ACM SIGPLAN Workshop on X10,X10 2015, pages 19–24, 2015.[41] Qijing Huang, Ruolong Lian, Andrew Canis, Jongsok Choi, Ryan Xi,Nazanin Calagar, Stephen Brown, and Jason Anderson. The Effect ofCompiler Optimizations on High-Level Synthesis-Generated Hardware.ACM Trans. Reconfigurable Technol. Syst., 8(3):14:1–14:26, May 2015.[42] E. Hung and S. J. E. Wilton. Speculative Debug Insertion for FPGAs.In 2011 21st International Conference on Field Programmable Logicand Applications, pages 524–531, Sept 2011.[43] Eddie Hung and Steven J. E. Wilton. Accelerating FPGA De-bug: Increasing Visibility Using a Runtime Reconfigurable Observationand Triggering Network. ACM Trans. Des. Autom. Electron. Syst.,19(2):14:1–14:23, March 2014.92Bibliography[44] Impulse Accelerated Technologies. CoDeveloper from ImpulseAccelerated Technologies. http://www.impulseaccelerated.com/ReleaseFiles/Help/iAppMan.pdf, 2015.[45] Chris Lattner and Vikram Adve. LLVM: A Compilation Frameworkfor Lifelong Program Analysis & Transformation. In Proceedings ofthe International Symposium on Code Generation and Optimization:Feedback-directed and Runtime Optimization, CGO ’04, pages 75–86,2004.[46] Chunhua Liao, Daniel J. Quinlan, Richard Vuduc, and Thomas Panas.Effective source-to-source outlining to support whole program empiricaloptimization. In Guang R. Gao, Lori L. Pollock, John Cavazos, andXiaoming Li, editors, Languages and Compilers for Parallel Computing:22nd International Workshop, LCPC 2009, Newark, DE, USA, October8-10, 2009, Revised Selected Papers, pages 308–322. Springer BerlinHeidelberg, Berlin, Heidelberg, 2010.[47] J. Matai, D. Lee, A. Althoff, and R. Kastner. Composable, parameter-izable templates for high-level synthesis. In 2016 Design, AutomationTest in Europe Conference Exhibition (DATE), pages 744–749, March2016.[48] J. Matai, D. Richmond, D. Lee, and R. Kastner. Enabling FPGAsfor the Masses. In Proceedings of the First International Workshop onFPGAs for Software Programmers, pages 15–20, Aug 2014.[49] Wim Meeus, Kristof Van Beeck, Toon Goedeme´, Jan Meel, and Dirk93BibliographyStroobandt. An overview of today’s high-level synthesis tools. Des.Autom. Embedded Syst., 16(3):31–51, September 2012.[50] Mentor Graphics. Certus Silicon Debug. https://www.mentor.com/products/fv/certus-silicon-debug, 2016.[51] J. S. Monson and B. Hutchings. New approaches for in-system debug ofbehaviorally-synthesized FPGA circuits. In Field Programmable Logicand Applications (FPL), 2014 24th International Conference on, Sept2014.[52] J. S. Monson and B. Hutchings. Using Source-to-Source Compilationto Instrument Circuits for Debug with High-Level Synthesis. In FieldProgrammable Technology (FPT), 2015 International Conference on,pages 48–55, Dec 2015.[53] J. S. Monson and Brad L. Hutchings. Using Source-Level Transforma-tions to Improve High-Level Synthesis Debug and Validation on FP-GAs. In Proceedings of the 2015 ACM/SIGDA International Sympo-sium on Field-Programmable Gate Arrays, FPGA ’15, pages 5–8, 2015.[54] Joshua Scott Monson. Using Source-to-Source Transformations to AddDebug Observability to HLS-Synthesized Circuits. PhD thesis, BrighamYoung University, March 2016.[55] Roger Moussalli, Ildar Absalyamov, Marcos R. Vieira, Walid Najjar,and Vassilis J. Tsotras. High performance FPGA and GPU com-plex pattern matching over spatio-temporal streams. GeoInformatica,19(2):405–434, 2015.94Bibliography[56] R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen,H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels. A Surveyand Evaluation of FPGA High-Level Synthesis Tools. IEEE Transac-tions on Computer-Aided Design of Integrated Circuits and Systems,PP(99), 2016.[57] R. Nikhil. Bluespec System Verilog: efficient, correct RTL from highlevel specifications. In Formal Methods and Models for Co-Design,2004. MEMOCODE ’04. Proceedings. Second ACM and IEEE Inter-national Conference on, pages 69–70, June 2004.[58] NVIDIA. NVIDIA R© Tegra R© X1 - NVIDIA’s New Mobile Su-perchip. http://international.download.nvidia.com/pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf, January 2015.[59] University of Toronto. High-level synthesis with legup. http://legup.eecg.toronto.edu/, 2016.[60] A. Putnam, A.M. Caulfield, E.S. Chung, D. Chiou, K. Constantinides,J. Demme, et al. A reconfigurable fabric for accelerating large-scale dat-acenter services. In Computer Architecture (ISCA), 2014 ACM/IEEE41st International Symposium on, pages 13–24, June 2014.[61] Dan Quinlan and Chunhua Liao. The ROSE Source-to-Source CompilerInfrastructure. In Cetus Users and Compiler Infrastructure Workshop,in conjunction with PACT 2011, October 2011.[62] Dan Quinlan and Bobby Philip. ROSETTA: The Compile-Time Recog-95Bibliographynition Of Object-Oriented Library Abstractions And Their Use WithinApplications . In Proceedings of the PDPTA’2001 Conference, 2001.[63] A. Ribon, B. Le Gal, C. Jgo, and D. Dallet. Assertion Support in High-Level Synthesis Design Flow. In Specification and Design Languages(FDL), 2011 Forum on, pages 1–8, Sept 2011.[64] Mike Santarini. Xilinx ships industry’s first 20-nm all programmabledevices. Xcell Journal, 86:9–11, 2014. First Quarter 2014.[65] Semico. How an FPGA Approach to Complex System Design Can Im-prove Profitability: Real Case Studies. Case Studies CC303-12, SemicoResearch Corporation, Apr 2012.[66] Scott Sirowy and Alessandro Forin. Wheres the Beef? Why FPGAsAre So Fast. Technical Report MSR-TR-2008-130, Microsoft Research,September 2008.[67] Calypto Design Systems. Catapult c synthesis. http://calypto.agranderdesign.com/catapult_c_synthesis.php, 2016.[68] J. Villarreal, A. Park, W. Najjar, and R. Halstead. Designing ModularHardware Accelerators in C with ROCCC 2.0. In Field-ProgrammableCustom Computing Machines (FCCM), 2010 18th IEEE Annual Inter-national Symposium on, pages 127–134, May 2010.[69] K. Wakabayashi and T. Okamoto. C-based soc design flow and edatools: An asic and system vendor perspective. Trans. Comp.-AidedDes. Integ. Cir. Sys., 19(12):1507–1522, November 2006.96Bibliography[70] Robert A. Walker and Raul Camposano, editors. A Survey of High-Level Synthesis Systems. Springer Science Business Media, 1991.[71] Felix Winterstein, Samuel Bayliss, and George A. Constantinides. Sep-aration Logic-Assisted Code Transformations for Efficient High-LevelSynthesis. In Proceedings of the 2014 IEEE 22Nd International Sym-posium on Field-Programmable Custom Computing Machines, FCCM’14, pages 1–8, 2014.[72] Xilinx. 7 series fpgas overview. http://www.xilinx.com/products/silicon-devices/fpga/artix-7.html, May 2015.[73] Xilinx. Integrated Logic Analyzer v6.1: LogiCORE IP Prod-uct Guide. http://www.xilinx.com/support/documentation/ip_documentation/ila/v6_1/pg172-ila.pdf, April 2016.[74] Xilinx. SDAccel Development Environment. http://www.xilinx.com/products/design-tools/software-zone/sdaccel.html, 2016.[75] Xilinx. Vivado Design Suite User Guide: High-Level Synthesis.http://www.xilinx.com/support/documentation/sw_manuals/xilinx2016_2/ug902-vivado-high-level-synthesis.pdf, June2016.[76] Xilinx. Zynq-7000 All programmable SoC. http://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html, 2016.[77] YXI products. eXCite: C to RTL Behavioral Synthesis. http://www.yxi.com/products.php, 2016.97[78] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, andJason Cong. Optimizing FPGA-based Accelerator Design for Deep Con-volutional Neural Networks. In Proceedings of the 2015 ACM/SIGDAInternational Symposium on Field-Programmable Gate Arrays, FPGA’15, pages 161–170, 2015.[79] Zhiru Zhang, Yiping Fan, Wei Jiang, Guoling Han, Changqi Yang, andJason Cong. AutoPilot: A Platform-Based ESL Synthesis System, pages99–112. Springer Netherlands, Dordrecht, 2008.[80] Hongbin Zheng, Swathi T. Gurumani, Kyle Rupnow, and DemingChen. Fast and Effective Placement and Routing Directed High-levelSynthesis for FPGAs. In Proceedings of the 2014 ACM/SIGDA Inter-national Symposium on Field-programmable Gate Arrays, FPGA ’14,pages 1–10, New York, NY, USA, 2014. ACM.98Appendix ALegUp Interface DirectiveVivado HLS provides a comprehensive set of interfacing options for IP inte-gration in the Vivado logic synthesis flow. LegUp, on the other hand, uses arelatively limited memory-mapped approach, although it is possible to useother LegUp features to generate interface descriptions similar to what isprovided in Vivado HLS (e.g. Custom Verilog code insertion [54]). The workin this thesis included a modification of LegUp’s source code to allow thecreation of I/O ports with validity signals, in order to have a more directcomparison between LegUp and Vivado HLS implementations.To resemble the format used by Vivado HLS, the implemented LegUpfeature supports adding a TCL directive to convert a global variable into anI/O port. Listing A.1 is an example of C source code with EOP instrumenta-tion for the “sum”. Listings A.2 and A.3 shows the TCL directives for eachtool to convert the “eop 14 sum 16” variable into an output port. Vivado’simplementation allows modifying the mode and location. These examplesuse the default settings to create a port with a validity signal located in themain “function/module”.In LegUp, the value assignments to the specified global variables arescheduled using the System of Difference Constraints (SDC) heuristic [17]99Appendix A. LegUp Interface Directivewithout any resource constraints. Multiple concurrent port writes are al-lowed by both Vivado HLS and this LegUp implementation. The variablesare then synthesized into I/O ports (input-only if no values are assigned toit; output-only if no assignments are made from it).Listing A.1: C with EOP Instrumentation1 //Non−volatile for output−only in Vivado HLS2 volatile int eop 14 sum 16;3 int main (){4 ...5 sum += A1[i][k] ∗ B1[k][j];6 eop 14 sum 16 = sum;7 ...8 }Listing A.2: Vivado HLS “directives.tcl”1 #set directive interface [OPTIONS=−mode(ap vld)] <location=main> <port>2 set directive interface eop 14 sum 16Listing A.3: LegUp “config.tcl”1 #set interface port <global variable1> <global variable2> ...2 set interface port eop 14 sum 16100Appendix BEOP ExperimentA custom source-to-source transformation tool for EOP insertion was de-veloped using the ROSE compiler infrastructure for evaluation of this ap-proach. For Vivado HLS, this was achieved using global variables and the”interface” directive to instruct the HLS tool to convert these into port in-terfaces. Vivado HLS automatically includes a “valid” signal that is trueduring the state with the variable assignment. On the other hand, a newI/O feature had to be implemented in LegUp in order to allow the sametype of transformations to be synthesized, this is explained in Appendix A.In spite of the exhaustive experimentation presented in [52], our ap-proach was used for an unexplored behavior. An experiment was set to un-derstand the motivation for implementing independent EOBs and, generally,the amount of concurrent port writes that are scheduled by either LegUpor Vivado. The CHStone benchmark [38] was instrumented with EOPs forall identified assignments, synthesized and wrapped by a top module thatincluded a set of Concurrent Port Writes Counters. These counters keeptrack of the number of valid EOPs at each execution cycle. Figures B.1 andB.2 show the results of this experiment for all the programs in the CHStonebenchmark. The X axis shows the number of EOPs, the left Y axis shows101Appendix B. EOP ExperimentFigure B.1: LegUp EOP Writesthe frequency that the given number of EOPs were found valid, and theright axis is the percentage of that amount over the total number of execu-tion cycles. In general, there is at least 1 valid EOP for 20-30% of the timeof the execution. Also, even though HLS is set to identify all possible par-allelism and schedule simultaneous assignments, the number of valid portsis generally 1 or 2, and reduces abruptly after 2. Using this informationfor EOB allocation would require dynamic analysis to guarantee an optimalEOB balancing. Thorough dynamic analysis, however, is not feasible forprograms with input-dependent behaviour.This metric can also be useful for performance profiling, determiningthe amount of extracted parallelism by finding the number of simultaneousvariable assignments that were scheduled.102Appendix B. EOP ExperimentFigure B.2: Vivado HLS EOP Writes103Appendix CMemory Access ProfilesThe setup of the experiments used in chapter 5 that allowed us to measurearray duplication in the CHStone benchmark suite can also be used to obtaina Memory Access Profile of the program. For these experiments, each pro-gram was first instrumented for Data Capture in C using our ROSE-basedtransformation tool. Using Vivado HLS, the C code was synthesized andthen added to a Vivado project in order to perform logic synthesis. Thesefiles can then be used for simulation and obtain behavior details throughVerilog System Tasks and Functions (i.e. readmem() and writemem()). Us-ing file I/O functions we can periodically dump the contents of the tracememory at any time without the need for trace read back routines; thisapproach is used in other debugging techniques and this data is what ismatched with the database. In our experiments, the in-system debug is themain contribution and simulation is only used for benchmark characteriza-tion and analysis.The set of files obtained from each circuit can then be analyzed by read-ing and matching their contents with the database generated during instru-mentation. Trace entries belonging to arrays can be identified as well as thearray index used in that statement in order to count the number of dupli-104Appendix C. Memory Access Profilescates found in memory. Accesses to the same array index are counted onlyonce, since only the last entry is duplicated, or available in both the tracebuffer and the circuit memory.The following graphs indicate the number of unique array entries in thetrace buffer on the Y axes and the execution cycles on the X axes. Althoughsome profiles of the CHStone benchmark are generally balanced (e.g. MIPS,ADPCM, SHA), most can be used to determine possible bottlenecks in cir-cuit implementations. In general, it can be seen that memory accesses arecommon and, as seen in Table 5.11, can represent an average of more than30% of the assignments carried out in the execution. This is a great moti-vation for ADM and should be considered as the target for optimizations infuture work.105Appendix C. Memory Access ProfilesFigure C.1: ADPCMFigure C.2: AESFigure C.3: BLOWFISHFigure C.4: GSM106Appendix C. Memory Access ProfilesFigure C.5: JPEGFigure C.6: MIPSFigure C.7: SHA107


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items