UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Syncopation : an adaptive clock management technique for high-level synthesis generated circuits on FPGA Gibson, Kahlan 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2020_november_gibson_kahlan.pdf [ 1019.44kB ]
Metadata
JSON: 24-1.0394843.json
JSON-LD: 24-1.0394843-ld.json
RDF/XML (Pretty): 24-1.0394843-rdf.xml
RDF/JSON: 24-1.0394843-rdf.json
Turtle: 24-1.0394843-turtle.txt
N-Triples: 24-1.0394843-rdf-ntriples.txt
Original Record: 24-1.0394843-source.json
Full Text
24-1.0394843-fulltext.txt
Citation
24-1.0394843.ris

Full Text

Syncopation: An Adaptive ClockManagement Technique for High-LevelSynthesis Generated Circuits onFPGAbyKahlan GibsonB.A.Sc., The University of British Columbia, 2018A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Electrical & Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)October 2020c© Kahlan Gibson 2020The following individuals certify that they have read, and recommend tothe Faculty of Graduate and Postdoctoral Studies for acceptance, thethesis entitled:Syncopation: An Adaptive Clock Management Technique for High-LevelSynthesis Generated Circuits on FPGAsubmitted by Kahlan Gibson in partial fulfillment of the requirements forthe degree of Master of Applied Sciencein Electrical & Computer EngineeringExamining Committee:Steve Wilton, Computer EngineeringSupervisorGuy Lemieux, Computer EngineeringSupervisory Committee MemberAndre´ Ivanov, Computer EngineeringSupervisory Committee MemberiiAbstractHigh-Level Synthesis (HLS) tools improve hardware designer productivityby enabling software design techniques to be used during hardware devel-opment. While HLS tools are effective at abstracting the complexity ofhardware design away from the designer, producing high-performance HLS-generated circuits still generally requires awareness of hardware design prin-ciples. Designers must often understand and employ pragma statements atthe software level or have the capability to make adjustments to the designin Register-Transfer Level (RTL) code.Even with designer hardware expertise, the HLS-generated circuits canbe limited by the algorithms themselves. For example, during the HLSflow the delay of paths can only be estimated, meaning the resulting circuitmay suffer from unbalanced computational distribution across clock cycles.Since the maximum operating frequency of synchronous circuits is deter-mined statically using the worst-case timing path, this may lead to circuitswith reduced performance compared to circuits designed at a lower level ofabstraction.In this thesis, we address this limitation using Syncopation, a performance-boosting fine-grained timing analysis and adaptive clock management tech-nique for HLS-generated circuits. Syncopation instrumentation is imple-iiimented entirely in soft logic without requiring alterations to the HLS-synthesistoolchain or changes to the FPGA, and has been validated on real hard-ware. The key idea is to use the HLS scheduling information along with theplacement and routing results to determine the worst-case timing path forindividual clock cycles. By adjusting the clock period on a cycle-by-cyclebasis, we can increase performance of an HLS-generated circuit. Our ex-periments show that Syncopation improve performance by 3.2% (geomean)across all benchmarks (up to 47%). In addition, by employing targeted syn-thesis techniques called Enhanced Synthesis along with Syncopation we canachieve 10.3% performance improvement (geomean) across all benchmarks(up to 50%).ivLay SummarySpecialized tools can reduce the time and effort to design circuits. How-ever, these tools may produce slow circuits because some design decisionsare based on estimates of path timing in the final design, which cannot bedetermined in advance. Typically, low circuit performance is due to a slowclock frequency, as the clock must be slow enough for the longest pathsin the design. To improve the performance of circuits developed with thesetools, we propose a method of recovering performance by adjusting the clockfrequency in a fine-grained manner. This allows any design to operate fasterwithout requiring additional compilation time for circuit designers.vPrefaceThe work presented in this thesis was originally published in part as a shortpaper in the 2020 International Conference on Field-Programmable Logicand Applications (FPL)1. Section 4.5 elaborates upon the Enhanced Syn-thesis technique, which was initially proposed in experiments conducted byE. Roorda. The implementation of Enhanced Synthesis presented in thisthesis was extended, implemented, and evaluated by Kahlan Gibson.All the research, implementation, and experimentation presented in thisthesis was primarily conducted by Kahlan Gibson. This research was con-ducted under the supervision of Dr. S. Wilton, who provided editorial sup-port for the publication. Additional inspiration, guidance, and editorialsupport was provided by D. H. Noronha, who co-authored the publication.Syncopation is available as an open-source tool online at https://github.com/kahlangibson/Syncopation.1K. Gibson, E. Roorda, D. H. Noronha, and S. Wilton. Syncopation: An AdaptiveClock Management Technique for HLS-Generated Circuits on FPGA. In 2020 30th Inter-national Conference on Field Programmable Logic and Applications (FPL), 2020.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Key Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3vii1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . 52 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Static Timing Analysis . . . . . . . . . . . . . . . . . . . . . 92.2 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . 133 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1 Pipelining and Retiming . . . . . . . . . . . . . . . . . . . . 173.1.1 Clock Skew Scheduling . . . . . . . . . . . . . . . . . 203.2 HLS Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.1 Multi-Cycle Operation . . . . . . . . . . . . . . . . . 223.2.2 Dynamic Scheduling . . . . . . . . . . . . . . . . . . . 223.3 Overclocking . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4 Multiple Clock Domains . . . . . . . . . . . . . . . . . . . . 253.5 Adaptive Clock Management . . . . . . . . . . . . . . . . . . 274 Syncopation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Adaptive Clock Management Circuitry . . . . . . . . . . . . 324.3.1 Circuit Operation . . . . . . . . . . . . . . . . . . . . 324.3.2 Divisor Memories . . . . . . . . . . . . . . . . . . . . 404.3.3 Divisor Selection Logic . . . . . . . . . . . . . . . . . 414.3.4 Clock Generator Circuitry and Clock Distribution . . 444.4 Fine-Grained Static Timing Analysis . . . . . . . . . . . . . 47viii4.4.1 Per-State Timing and Synthesis Directives . . . . . . 484.4.2 Control Logic Constraints . . . . . . . . . . . . . . . 504.4.3 Divisor Calculation . . . . . . . . . . . . . . . . . . . 534.5 Enhanced Synthesis . . . . . . . . . . . . . . . . . . . . . . . 534.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 Adaptive Clock Management Using Syncopation . . . . . . 595.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.1.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . 605.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.2.1 Validation in Hardware . . . . . . . . . . . . . . . . . 615.2.2 Validation in Simulation . . . . . . . . . . . . . . . . 645.3 Performance Results . . . . . . . . . . . . . . . . . . . . . . . 645.3.1 Current State Syncopation . . . . . . . . . . . . . . . 655.3.2 Next State Syncopation . . . . . . . . . . . . . . . . . 665.3.3 Enhanced Synthesis . . . . . . . . . . . . . . . . . . . 685.3.4 Instrumentation Overhead . . . . . . . . . . . . . . . 705.3.5 Memory Overhead . . . . . . . . . . . . . . . . . . . . 715.3.6 Comparison to Previous Version of Syncopation . . . 735.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.4.1 Custom Place-and-Route Toolchain . . . . . . . . . . 755.4.2 Custom HLS Toolchain . . . . . . . . . . . . . . . . . 765.4.3 Security Applications . . . . . . . . . . . . . . . . . . 785.4.4 Syncopation for Arbitrary Circuits . . . . . . . . . . . 796 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80ixBibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82xList of Tables4.1 The Area and Performance Impact of Synthesis Directives . . 505.1 Validation Results in Hardware . . . . . . . . . . . . . . . . . 635.2 Performance of Current State Syncopation (Sync-CS) . . . . 665.3 Performance of Next State Syncopation (Sync-NS) . . . . . . 675.4 Performance of Syncopation Compared to the TheoreticalPerformance Achievable without an Integer Divider ClockGenerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.5 Performance of Enhanced Synthesis . . . . . . . . . . . . . . . 705.6 Syncopation Instrumentation Overhead for device 5CSEMA5F31C6N,32070 ALMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.7 Syncopation Memory Utilization. . . . . . . . . . . . . . . . . 725.8 Performance of Syncopation without Active Module Identifi-cation Logic (Previous Work) . . . . . . . . . . . . . . . . . . 745.9 Per-Module Maximum Operating Frequency . . . . . . . . . . 77xiList of Figures1.1 Number of Cycles Across Execution Spent in States with Var-ious State Delays for the adpcm CHStone Benchmark . . . . 32.1 The Performance-Programmability Computing Landscape atLow Volumes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 The LegUp high-level synthesis flow. . . . . . . . . . . . . . . 112.3 Program 2.1, scheduled into states. . . . . . . . . . . . . . . . 153.1 Pipelining some computations to balance path delays. . . . . 183.2 Retiming some computations to balance path delays. . . . . . 194.1 A simple Finite State Machine. . . . . . . . . . . . . . . . . . 304.2 Syncopation overall flow. . . . . . . . . . . . . . . . . . . . . . 314.3 A Syncopation Circuit. . . . . . . . . . . . . . . . . . . . . . . 334.4 Current State based divisor selection logic. . . . . . . . . . . . 344.5 Current State divisor memory packing based on the FiniteState Machine in Fig. 4.1. . . . . . . . . . . . . . . . . . . . . 354.6 The general form of a Finite State Machine. . . . . . . . . . . 364.7 Next State divisor memory packing based on the Finite StateMachine in Fig. 4.1. . . . . . . . . . . . . . . . . . . . . . . . 38xii4.8 Next State based divisor selection logic. . . . . . . . . . . . . 394.9 A sequence diagram describing the operation of a High-LevelSynthesis-generated circuit. . . . . . . . . . . . . . . . . . . . 434.10 Integer divisor clock generator circuit. . . . . . . . . . . . . . 475.1 Validation circuit for Syncopation. . . . . . . . . . . . . . . . 625.2 Syncopation Performance on CHStone Benchmark Circuits. . 75xiiiList of Programs2.1 A simple example program written in an assembly-like language. 154.1 A Syncopation schedule with various path delays and a function-call wait state. . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 A Syncopation schedule including control logic constraints. . 52xivAcronymsACM Adaptive Clock Management.ALM Adaptive Logic Module.ASIC Application-Specific Integrated Circuit.CAD Computer-Aided Design.CGRA Course-Grained Reconfigurable Architecture.CPU Central Processing Unit.CSS Clock Skew Scheduling.DM Divisor Memory.ES Enhanced Synthesis.FPGA Field-Programmable Gate Array.FSM Finite State Machine.GPP General Purpose Processor.GPU Graphics Processing Unit.HDL Hardware Description Language.xvHLS High-Level Synthesis.LUT Lookup Table.MCD Multiple Clock Domain.PLL Phase-Locked Loop.RTL Register-Transfer Level.SDC Systems of Difference Constraints.SDC Synopsis Design Constraints.SM State Machine.STA Static Timing Analysis.xviAcknowledgementsFirst and foremost, thank you to my supervisor Steve Wilton, whose im-measurable support, patience, and teaching was inspirational throughoutmy tenure at UBC.Special thanks to the members of my research group for their ideas,feedback, and comradery: Daniel Holanda Noronha, Jose Pinilla, BaharSalehpour, Esther Roorda, Eddie Hung, and Nikhil Ganathe; and to themany others who were my colleagues, friends, mentors, and idols over theyears: Dave Evans, Amin Azar, Khalid Essam, Xiaowei Ren, John Deppe,Mohamed Omran, William Yang, Deval Shah, Max Golub, Hossein Omidian,and Al-Shahna Jamal. Your advice, feedback, high-spirited chats, and lunchbreaks have made these years a delight.I would also like to thank the broader faculty of Electrical & ComputerEngineering, who provided seven years of challenging learning and allowedme to discover my passion for teaching and my career.A final round of thank-yous to my colleagues at Microsoft, who bolsteredmy professional growth by fostering an innovative and engaging culture overtwo fantastic summers.Funding for this research was provided by Intel and the Natural Sciencesand Engineering Research Council of Canada.xviiMy deepest gratitude to my friends who have remained my greatest fans,critics, and fellow adventurers through it all, whether they were an editorfor a day, a lab partner for months, or a confidante and companion for years.To my parents, Allison and Randall, and Kyle: your foundation is anunwavering source of energy. I am lucky to have three of the most resilientpeople I know on my team, forever.xviiiChapter 1IntroductionAs Field-Programmable Gate Array (FPGA) architecture and Computer-Aided Design (CAD) tools evolve, designers are using FPGAs to implementlarger and more complex circuits than ever before. Many of these circuitshave tight timing requirements and achieving timing closure is time consum-ing and often requires extensive optimization. For expert hardware design-ers with familiarity with Register-Transfer Level (RTL)-based design flows,techniques such as pipelining and strategically imposing timing constraintscan help achieve timing closure. Increasingly, non-domain expert designersare using High-Level Synthesis (HLS) tools such as LegUp, which automat-ically transform a software-oriented language (e.g. C Language) to hard-ware [10]. For these designers, optimization methods requiring hardwareknowledge may not be feasible, either because the designer does not havethe expertise or because the effort to effectively use these techniques defeatsthe purpose of accelerating the design flow using HLS. For this reason, recentwork has focused on improving tools to automatically optimize circuit per-formance, power, and area in an attempt to achieve custom-implementationresults without requiring additional designer expertise.11.1 MotivationIn the HLS toolflow, RTL code is generated using estimated delays for in-dividual paths. The RTL code is then synthesized by place and route toolswhich attempt to balance path lengths, often by giving preference to timing-critical nets. The degree of path balancing achievable by synthesis tools islimited without changing the scheduling decisions made by the HLS tool,leading to a longer-than-necessary clock period.To demonstrate this and to motivate our technique, we synthesize sev-eral CHStone benchmark circuits [51] using the LegUp open source HLStool [10] and use Intel Quartus Pro to perform placement, routing, and tim-ing analysis. For each implementation, we profiled the design to determinethe minimum clock period broken down by circuit state, and then deter-mined the number of cycles across execution spent in each state. Fig. 1.1shows the results of this experiment for the adpcm benchmark.For the adpcm benchmark, the critical path delay is close to 15 ns.However, only a small proportion of execution time is spent in states whichexercise this path. The vast majority of execution time is spent in statesin which the longest exercised path is at least 30% faster. This means thatalthough the operating frequency of this circuit would normally be estimatedas 64 MHz, this circuit could operate with much lower latency if we couldtune the clock on a cycle-by-cycle basis according to the requirements ofthe active state. This would effectively increase the operating frequency to98 MHz.2State-by-State Timing (ns)Number of Cycles Across ExecutionFigure 1.1: Number of Cycles Across Execution Spent in States with VariousState Delays for the adpcm CHStone Benchmark1.2 Key IdeaIn this thesis, we present Syncopation, a technique to recover lost execu-tion time due to path length variation after placement and routing2. Thekey idea is to use Adaptive Clock Management (ACM), in which the clockfrequency is adjusted on a cycle-by-cycle basis based on the needs of thecurrent circuit state. This is made possible by leveraging three observa-tions. First, computational path delay variations exist across state machinestates. Second, in any circuit, the output of all paths are not requiredin every cycle and only paths that compute scheduled instructions are re-quired to meet timing in each circuit state. The clock period does not needto accommodate timing for paths which are not active in a given circuitstate. Finally, HLS-generated circuits are particularly amenable to ACM2In music, syncopation is the temporary displacement of a beat that disrupts theexpected rhythm3techniques as HLS tools automatically generate instruction schedules whichidentify paths that are important for each state, therefore requiring no ad-ditional time to identify the relevant paths for each state. By combiningthe information from the detailed instruction schedules with Static TimingAnalysis (STA) results for each path, we can identify the minimum clockperiod required on a state-by-state basis.1.3 ContributionsThis paper describes and evaluates our technique in combination with LegUpHLS and Intel Quartus Pro software. Specifically, we make the followingcontributions:1. We implement a fine-grained adaptive clock using a dynamic clockgenerator based on [40] and a memory-based lookup controller system.The clock generator is capable of producing a glitch-free global clockwith a different clock period in every cycle.2. We propose a method for fine-grained static timing analysis. Thismethod uses the schedule generated by the HLS tool to determinewhich values are computed in each circuit state, at the cost of usingsynthesis directives to prevent register optimization during place androute. The adaptive clock controller is programmed with path-lengthdata and achieves a 3.2% increase in performance on average (up to47%).3. We investigate integrating the results of fine-grained STA with com-4mercial timing-driven place-and-route tools by automatically generat-ing SDC files of timing constraints used in an additional synthesis passto further enhance performance. We obtain a 10.3% increase in per-formance on average (up to 50%) without altering the place-and-routealgorithms.1.4 Thesis OrganizationThe remainder of this thesis is organized as follows. Chapter 2 presentsrelevant background on FPGA tools, including static timing analysis andhigh-level synthesis. Chapter 3 summarizes common techniques to improveperformance in circuits, both in general-purpose processors and custom ap-plications. Chapter 4 details our technique and describes the additionalhardware required to implement the Syncopation clock management strat-egy. Chapter 5 presents experimental results which show the impact ofapplying Syncopation to a set of benchmark circuits.5Chapter 2BackgroundDesigners selecting a computing platform consider tradeoffs between pro-grammability, performance, and cost. On one end of the spectrum, CentralProcessing Units (CPUs) offer a highly programmable platform capable ofexecuting any software, but may suffer in performance due to (1) the lackof customized logic optimized for a specific application and (2) the assump-tion of a sequential compute paradigm. On the other end of the spectrum,fully-custom Application-Specific Integrated Circuits (ASICs) are not gen-erally programmable, and instead are designed to maximize performance(minimize latency, maximize throughput) to complete a specific task. Forlow-to-medium volume applications, the cost of using a device fabricatedat scale, such as a CPU, is significantly lower than the costs of developinga custom-designed ASIC. At these lower volumes, the non-recurring ASICdevelopment costs (including engineering development, fabrication, and val-idation) make fully-custom solutions economically unfavourable. In high-volume applications, the cost of ASIC development is offset by the volumeproduced and sold, making fully-custom applications more appealing.In applications which require higher performance at volumes not suffi-cient to offset the initial design costs, FPGAs provide a platform to develop6Cost per unitPerformanceProgrammability/ FlexibilityCPUGPUFPGAASICFigure 2.1: The Performance-Programmability Computing Landscape atLow Volumes.custom solutions at a lower price-point. FPGAs are highly programmablearrays of gates which are mass fabricated, eliminating one-time engineeringcosts. Since FPGAs are programmed to implement circuits at the gate leveland can be reprogrammed for testing without waiting for fabrication, FP-GAs offer hardware designers a unique opportunity to quickly design highperformance accelerators at a lower cost compared to ASICs. Due to theseadvantages, FPGAs have become a common technology in consumer devicesand high performance applications in big data, networking, and artificial in-telligence. In recent years, FPGA instances have become a standard featurein major cloud computing infrastructures (including Amazon AWS [6] andMicrosoft Azure [41]), such as the latest 10nm process Intel Agilex FPGAsdesigned specifically for datacenter use [26]. Widespread utilization of FP-GAs has been bolstered by the improvement of development environments7and tools such as HLS, which strives to make programming FPGAs acces-sible to designers without hardware design experience.Fig. 2.1 describes the computational platform landscape, including Graph-ics Processing Units (GPUs), which are not discussed further here. CPUsoccupy the low-cost, low-performance, high-programmability corner of thelandscape. ASICs occupy the high-cost, high-performance, low-programmabilitycorner. FPGAs combine these characteristics into a package which providesspecialized compute desired for high performance applications at a lowercost than ASIC development and in a platform that can be reprogrammedwhen needed.FPGAs development and implementation quality are highly dependenton the tools and methodology provided by commercial vendors. Intel [2] andXilinx [5] have the largest market share, followed by a small number of othercompanies including Achronix Semiconductor [1], Lattice Semiconductor [3],and Microsemi Corporation [4]. Each company’s FPGAs are programmedusing a tightly coupled closed-source toolflow. The toolflows use intimateknowledge of the FPGA architecture as well as circuit timing information tosynthesize a designer’s description of a circuit into a gate-level implementa-tion to generate a bitstream which is used to populate memories, implementlogic, and connect specified wires on the FPGA device. In addition to re-alizing the designer’s circuit description using device resources, synthesistools are responsible for fine-tuning the circuit implementation. Given abehavioural description of a circuit, the synthesis tool should implement thecircuit on the FPGA device to minimize area, latency, and power utilization.Since there are many ways to implement a given circuit on FPGA resources,8determining the best implementation is a complicated problem.The complexity of determining a high-quality implementation is furthercomplicated when designers choose to use HLS tools. HLS tools enable de-signers to describe the operation of their circuit using a high-level softwarelanguage, enabling faster development and abstracting away the hardware-level details. However, the high-level description does not explicitly definethe circuit operation and leaves even more implementation details to thetools to determine. The algorithms which transform high-level software lan-guages into circuits may be heuristic or stochastic in nature to encouragedesign space exploration, and significant research effort has been expendedto improve the quality of HLS-generated circuits while minimizing the hard-ware knowledge required by designers.This chapter presents the background information relevant to this thesis.It begins with an overview of static timing analysis and how timing infor-mation is used to guide circuit implementation in synthesis tools. Then, itdiscusses high-level synthesis tools and how they utilize how timing infor-mation to generate hardware descriptions from software algorithms.2.1 Static Timing AnalysisFPGA devices consist of an array of programmable “blocks” which can bewired together to form an arbitrary circuit. The wiring between blocksconsists of a routing “fabric” of switch blocks to connect required gatestogether. The blocks on an FPGA include hardened arithmetic units, I/Oblocks, on-chip memories, routing switch blocks, and generic programmable9blocks (Adaptive Logic Modules (ALMs) in the Intel vernacular [24]).To design circuits using an FPGA, hardware designers write HardwareDescription Language (HDL) code (such as Verilog) which describes circuitbehaviour. Then, specialized vendor tools (such as Intel’s Quartus) syn-thesize the code into a circuit implementation which maps onto the FPGAresources. Synthesis consists of multiple stages, including netlist genera-tion, technology mapping, clustering, placement, and routing, as shown inFig. 2.2. During netlist generation, the Verilog is transformed into a gate-level description of the circuit. Then, technology mapping transforms groupsof gates in the netlist into components available on the FPGA, such as N -input Lookup Tables (LUTs) (N -LUTs). Then, the N -LUTs are clusteredto encourage highly connected sections of the circuit to be within the samelogic block cluster on the chip (to reduce latency and routing congestion).Finally, the N -LUT contents are assigned to specific ALMs on the FPGA,and routing is performed to make connections between the ALMs.To evaluate the performance of a circuit implementation, STA measureseach path in the circuit to determine whether it meets timing requirements.Additionally, the longest path (critical path) is recorded to determine themaximum operating frequency, since the clock period must be long enoughto accommodate the timing of the longest path. The clock frequency de-termined during STA determines the performance of the design, as higherfrequency circuits have lower latency. STA must be performed after synthe-sis for an accurate result, as the specific implementation produced by thetool directly impacts the critical path delay.Although STA identifies the critical path of the post-synthesis circuit, it10C CodeC CompilerOptimized IRAllocationSchedulingBindingRTL GenerationVerilog RTL ScheduleLegUp HLS ToolFigure 2.2: The LegUp high-level synthesis flow.is desirable to use timing information throughout synthesis. This techniqueis referred to as Timing-Driven Synthesis, which incorporates timing infor-mation into the algorithms to continuously perform optimizations aimed atreducing the critical path of the circuit. A simple example of a non-timing-driven clustering algorithm is VPack [7], which attempts to efficiently clusterthe design to have high utilization of physical resources while minimizing sig-nal latency. The VPack clustering algorithm attempts to cluster ALMs tomeet the following goals:1. Minimize the total number of ALM clusters used on chip, to encouragehigh cluster utilization.2. Minimize the number of inter-cluster connections used in the design,11since these wires have longer latency.In VPack, an unclustered ALM in the design with the most inputs isselected as the “seed”, or first ALM, in a cluster. Then, ALMs with themost shared inputs to the current cluster population are iteratively addedto the cluster until no further packing into the cluster is possible.On the other hand, a timing-driven clustering algorithm, such as T-VPack [37], prioritizes clustering of the critical path by selecting an unclus-tered ALM on the critical path with the most inputs as the seed for thecluster. Additionally, T-VPack uses more complex selection criteria whenadding to the cluster. In particular, ALMs will prioritized for cluster ad-dition if they (1) share many inputs with the current population, (2) areclose in connectivity to the critical path, and (3) are involved in multipletiming-critical paths. Adding timing awareness to clustering incentivizesthe implementation of solutions for which the critical path spans the fewestnumber of clusters.Another target for timing-driven algorithms is to attempt to balancepath-lengths such that the amount of compute taking place in each clockperiod fully utilizes the time available in each cycle. However, there arelimits on how even the path lengths may be due to inherent limitations ofthe tools and FPGA architectures. First, the circuit description is explic-itly defined at the RTL level, and significant changes to when computationsare being performed are not desirable. For example, the synthesis tools arenot given the ability to broadly reschedule large numbers of operations inorder to improve circuit performance. For this reason, the first performance12optimizations are typically performed by hardware designers at the RTLlevel. Second, the granularity of synchronization elements and delay differ-ences in paths through elements such as routing switch blocks on FPGAsmakes producing precise path lengths difficult. While ASIC designers maymanually lay out gates and wires to achieve very similar path delays, thesynthesis tools for FPGAs are limited by the availability of the remainingresources to achieve matching path lengths. Finally, synthesis, place androute, and STA are time-consuming, and calculating timing for all pathswould cause intractable increases in compilation time. To reduce synthesisrun-time, tools use timing estimates and focus optimizations primarily onthe few most-critical paths, unless specifically incentivized to reduce timingfor all paths.2.2 High-Level SynthesisHLS is the process of converting high-level software language code into a cir-cuit which implements the described algorithm. HLS tools reduce the hard-ware knowledge required to use FPGAs as general acceleration platforms,increases designer productivity, and boosts debug productivity. However,the level of abstraction that makes HLS so attractive also removes much ofthe control designers have over the quality of the final circuit implementationand may lead to longer critical paths compared to what may be possible withRTL-oriented design techniques. Details such as the schedule of instructionsand computations in the datapath are removed from the designer’s controland are left to the discretion of the HLS tool, unless the designer has suffi-13cient experience to change the code at the Verilog level. There are multiplestages to the HLS toolflow, all of which have unique opportunities to influ-ence the final outcome of the synthesis flow.At the input to the HLS flow, depicted in Fig. 2.2, the tool takes thehigh-level behavioural description of the algorithm and information aboutavailable hardware resources to generate RTL which describes the software.To generate the RTL, the tool generally completes the following steps:1. Compilation of the high-level code to a formal representation, includ-ing basic optimizations2. Hardware resource allocation3. Instruction Scheduling4. Binding computations to functional units (multipliers, dividers)5. Binding variables to storage elements (registers or memories)6. Binding data transfers to buses7. Generate RTL descriptionIn each of these steps, there are opportunities to improve circuit perfor-mance. In particular, the scheduling of instructions can directly influencethe critical path length. The challenge is that, instruction scheduling at theHLS level can only use rough timing estimates to inform decisions, as noinformation about the final implementation is available yet.Consider the instructions provided in Program 2.1. An approximate de-lay for each computation is provided as an example. There are multiple14Program 2.1 A simple example program written in an assembly-like lan-guage.main:1 add r3, r1, r2 # add: delay 22 sub r4, r3, r1 # sub: delay 23 add r5, r4, r2 # add: delay 24 sub r6, r1, r5 # sub: delay 25 mul r7, r3, r6 # mul: delay 106 mul r8, r5, r6 # mul: delay 107 add r9, r7, r8 # add: delay 28 add r0, r9, r1 # add: delay 2State 1Instructions 1-4Delay: 8State 2Instructions 5-6Delay: 10State 3Instructions 7-8Delay: 4Figure 2.3: Program 2.1, scheduled into states.ways to schedule these computations into states. We assume this code seg-ment is part of a larger algorithm, and minimizing the critical path lengthof this code segment is critical for circuit performance. There are two goalsto consider scheduling these instructions. First, we would like to minimizethe longest path in a single state. Since multiplication is typically slowerthan addition, we are limited by the delay of the multiply instructions onlines 5-6, which can be parallelized. No other instructions can be addedto this state without increasing the path length. Second, we would like tominimize the total number of states. We can combine instructions 1-4 andinstructions 7-8 into two other states without increasing the critical pathlength. The proposed schedule is shown in Fig. 2.3.Although Fig. 2.3 meets our scheduling goals, we still have path length15variation between our states, resulting in 27% of the program’s executiontime being unutilized due to short computational paths. Although thisexample is simple, it is clear how the scheduling problem is complicated forHLS tools to solve without significant designer effort either at the RTL levelor through the use of compiler directives. ACM techniques forgo the needfor this extra effort and can reduce the underutilized compute time.16Chapter 3Related WorkAt the RTL level, hardware designers can carefully tune circuits to max-imize circuit operating frequency without introducing erroneous computa-tion. However, modern synthesis tools have been designed to automaticallyperform circuit tuning to make hardware design faster and more accessible.In this section, we introduce multiple approaches in recent work to auto-mate improved circuit performance through multiple stages of the synthesistoolflow, including circuit-level optimizations, HLS pass optimizations, andclocking schemes.3.1 Pipelining and RetimingAn RTL description of a circuit describes operation on a state-by-state basis.Operations occurring in each state are scheduled such that data dependen-cies and timing requirements are satisfied. When these scheduling decisionsare made by a hardware designer writing RTL, the designer can carefullyoptimize the code to achieve a target performance. Once the RTL is writ-ten, there are few opportunities for automated tools to further optimize thecircuit implementation since they lack knowledge of the overall operation.However, pipelining and retiming can be automatically employed in syn-17Computation A10 ns50 MHz100 MHzComputation B10 nsComputation B10 nsComputation A10 nsFigure 3.1: Pipelining some computations to balance path delays.thesis tools as they often do not require significant changes to the schedule ofoperations as described in the RTL. Fig. 3.1 and Fig. 3.2 depict examples ofpipelining and retiming simple circuits. Propagation delays have been arbi-trarily assigned between the registered stages indicated as flip-flops, whichcould be due to scheduled computations or data transfers over wires. InFig. 3.1, there are two combinational components (labelled A and B) whichare connected sequentially. Each component takes some amount of compu-tational time to produce a result. In this case, the compute times are 10ns each, for a total path delay of 20 ns. If this path is the critical path ofthe circuit, the maximum frequency the circuit can operate at is 50 MHz.Pipelining can increase the maximum frequency by adding an additional reg-ister stage to break up this path, as shown in the second path in Fig. 3.1. Inthis circuit, the path has been broken up by an additional register into two18Path A22 nsPath B18 nsRetimed Path A20 nsRetimed Path B20 nsFigure 3.2: Retiming some computations to balance path delays.paths with 10 ns delays, increasing the maximum frequency of the circuit to100 MHz.In Fig. 3.2, retiming is performed to balance out paths and reduce thecritical path length without adding additional registers. In this circuit, thereare two paths consisting of combinational logic seperated by a register. PathA is the critical path of the circuit with a delay of 22 ns, and Path B hasa shorter delay of 18 ns. The maximum operating frequency due to Path Ais 45 MHz. It is possible that some of the computation that is performedin Path A could be moved to Path B to improve circuit performance. InFig. 3.2, retiming is performed to balance Path A and Path B. The new pathsnow have delays of 20 ns each, and the new maximum circuit frequency is50 MHz.Modern FPGA tools support automated pipelining and retiming in syn-thesis. For example, Intel’s Hyperflex Architecture in the top-of-the-lineStratix 10 and Agilex devices and Quartus Pro allow users to enable au-19tomated pipelining and retiming [24]. Although leveraging the fine-grainedregistered architecture in these devices can improve circuit performance,finding appropriate paths can increase synthesis time due to added com-plexity. Furthermore, not all circuits are amenable to these optimizationsat the synthesis stage. For HLS-generated circuits in particular, academicresearch has proposed leveraging these device architectures to improve tim-ing. In particular, [12] demonstrated improved retiming techniques in IntelHyperflex devices. Additionally, changes to the loop pipelining algorithmsin HLS tools have been shown to increase circuit throughput [15]. Thesemethods require changes to HLS algorithms, and may increase complexityof the tools or increase compilation time. Unlike previous work, we do notpropose changes to the HLS or commercial synthesis algorithms, and Syn-copation does not rely on a specific FPGA architecture. Instead, we exploitthe path imbalances which persist in HLS-generated circuits despite thesealgorithmic improvements. In fact, Syncopation could be used to furtherimprove previous work by exposing slack that was not recovered throughpipelining and retiming.3.1.1 Clock Skew SchedulingClock Skew Scheduling (CSS) is a technique for achieving performance-boosting path balancing comparable to retiming, but is implemented byintentionally inserting clock skew instead of physically moving logic betweenflip-flops [19]. CSS has also specifically been applied to FPGAs to improveperformance [47]. FPGA architecture changes have also been proposed tobenefit performance of circuits [16].203.2 HLS SchedulingIn high-level synthesis, scheduling is the process of mapping the software-level operations into clock steps. Scheduling is a complicated optimizationproblem typically solved with heuristic algorithms, which may not resultin the highest-performance result. Furthermore, since the detailed path-timing information used to inform scheduling decisions is not known untilafter place-and-route is complete, HLS tools use timing estimates to as-sign software instructions to computational states. When determining theinstruction schedule, the HLS tool attempts to maximize performance bybalancing path lengths across cycles and minimize power utilization andcircuit area by efficiently utilizing on-chip resources.The open source HLS tool LegUp 4.0 uses a Systems of Difference Con-straints (SDC)-based scheduling algorithm [34]. SDC-based algorithms findsolutions to problems such as scheduling by solving a mathematical formu-lation of the constrained system. Multiple scheduling algorithms have beenproposed in academic research which improve upon the LegUp SDC-basedscheduler. In [22], instruction-level data dependency graphs were utilizedto improve scheduling across basic blocks to reduce circuit latency. Otherscheduling algorithms have attempted to optimize for circuit area and con-gestion, such as in [17], by constraining the scheduler by both latency andresource usage. Stochastic optimization methods which use randomisationto perform broad design space exploration have also been proposed. Thesealgorithms, such as simulated annealing (proposed in [39]) and genetic al-gorithms (proposed in [44]) explore the multi-dimensional design spaces to21determine schedules which meet area, delay, and power requirements.3.2.1 Multi-Cycle OperationAnother technique that has been applied to improve the performance ofHLS circuits is multi-cycling. Multi-cycling is a technique which allows somepaths within a circuit to have longer path delays than the clock period andallowing those paths more than one clock cycle to complete execution. HLStools are a particularly suitable for multi-cycling as the high-level algorithmand scheduling analysis occurring during HLS reveals which computationalresults are not required in the next clock cycle, and therefore may benefitfrom multi-cycling techniques. Furthermore, the HLS scheduling algorithmscan be altered to automatically employ multi-cycling for improved perfor-mance [23, 52].3.2.2 Dynamic SchedulingTaken to an extreme, multi-cycling leads to a dynamic synchronous dataflow-like architecture incorporating handshake protocols to signal the completionof individual operations. Dynamic scheduling algorithms have been exploredfor HLS circuits to gain additional performance. Typically, HLS tools sched-ule instructions statically, meaning that even if the time required for anoperation to complete is data-dependent, the worst-case timing estimate ofthat operation is assumed. Static scheduling techniques suffer performancedegradation as the worst-case timing estimate is conservative in the typicalcase. To overcome this limitation, dynamic HLS scheduling has been pro-posed [13, 31, 32]. Dynamic scheduling prevents variable-latency operations22from blocking future computations at the overhead of additional circuit areafor handshaking protocols (2.48x LUTs [13]). To mitigate the area cost, theauthors proposed a hybrid static-dynamic scheme which dynamically sched-ules sub-circuits with the most improvement due to dynamic scheduling(1.52x LUTs [13]). Another technique for improving software execution isspeculative execution, a performance-boosting technique in which a proces-sor will predict what to execute when a data-dependent branch is reached.In [33], data-dependent branch operations are speculatively computed inparallel ahead of the branch resolution to achieve additional performance.Utilizing speculative execution reduces latency up to 22.4%, but the ad-ditional resources hardware to perform the predictions and computationsincreases circuit area up to 19.7% [33].Syncopation does not require changes to the HLS scheduling algorithms,and could be implemented in conjunction with existing scheduling algo-rithms to further improve performance. For example, a hybrid dynamic-static scheduling paradigm such as in [13] could employ Syncopation instatically-scheduled sub-circuits which would otherwise not see performanceimprovement from dynamic scheduling. Additionally, Syncopation can bemade compatible with speculative execution techniques, data-dependent dy-namic scheduling, and strategic multi-cycling for more aggressive perfor-mance enhancements.233.3 OverclockingOverclocking is a method of speculatively increasing the clock frequencyof a circuit to achieve higher performance. Overclocking is a speculativetechnique because increasing the clock frequency beyond the maximum fre-quency computed during static timing analysis may produce erroneous re-sults. Overclocking is widely employed to increase the operating speed ofCPUs and GPUs by increasing the base clock multiplier or front side busoverclocking [49]. In online dynamic overclocking, the clock frequency isdynamically tuned while the circuit is operating while monitoring for er-rors to either ensure none have occurred or to keep the number of timingfaults below an acceptable threshold for the application. In [48], onlinedynamic overclocking is combined with an error detection mechanism toprevent faults in superscalar CPUs. Razor [18] is another technique whichenables overclocking by using Razor flip-flops, which automatically identifytiming faults by double-sampling computational paths.Overclocking is also employed to improve performance of FPGA circuits,both with and without automated error detection mechanisms. In [45], theauthors propose using overclocking as an alternative performance-boostingmethod to reduced precision computing, as both methods increase perfor-mance at the cost of some acceptable threshold of accuracy loss. In appli-cations which require no accuracy loss, automated techniques to performoverclocking have been proposed for arbitrary circuits [8][42] and algorithm-level error detection in convolutional neural networks [38]. Others haveproposed changes to datapath architecture to reduce the severity of timing-24induced failures to enable aggressive overclocking [46]. Additionally, workhas been conducted to adapt the Razor timing speculation technique forpaths with bidirectional communication, including computational arrays,Course-Grained Reconfigurable Architectures (CGRAs), and FPGAs [9].Syncopation is not a dynamic overclocking technique, and does not re-quire error detection hardware. We perform fine-grained STA for each statein the HLS schedule, and select a clock frequency based on the result. Sincethe clock frequency is not set higher than the provided value, there is norisk of corrupting the data. However, online adaptive overclocking tech-niques could be used in combination with Syncopation to further increaseperformance.3.4 Multiple Clock DomainsUsing Multiple Clock Domains (MCDs) is a method to improve performanceby operating portions of a circuit at a higher frequency, dictated by the sub-circuit critical path. MCDs require the addition of clock domain crossinghardware to prevent corruption of data sampled during a period of signalmetastability. A circuit that would benefit the most from using MCDs is onewhich spends a significant proportion of run-time performing computationswith paths much shorter than the critical path. Multiple clock domains forHLS-generated circuits has been proposed by partitioning the clock domainson boundaries of hardware modules [43]. This minimizes the complexity ofthe boundary crossing hardware (due to fewer signals passing data betweensoftware functions than within them) and improved wall-clock time by 33%25at an additional hardware cost (7-9%) [43]. To improve the performanceof this technique, designers need to partition the software-level code in-telligently. For example, the code could be partitioned so that operationswith long computational paths are separated from functions which dominateprogram execution time, so that the amount of time operating at a higherfrequency is maximized. However, since operation latency is not a typicalconcern for software developers, the authors of [36] proposed an automatedframework for optimizing MCD boundaries based on dataflow graph repre-sentations of the source code. In addition to using multiple clock domains toachieve higher performance in some applications, MCDs can be used to slowdown some domains in power-sensitive applications. In [35], an automatedmulti-clock domain scheme is used to slow down some parts of the design toreduce clock network complexity.Syncopation does not use multiple clock frequencies distributed spatiallyacross the design, as is the case with MCD. Instead, Syncopation uses theglobal clock routing resources to distribute a single clock signal which variesin frequency temporally. Syncopation could be used in conjunction withMCD techniques to further boost HLS-generated circuit performance. Whileusing multiple clock domains would be very effective if one module waslimiting the performance of all other modules, the performance benefit wouldbe limited in cases where all the modules are limited by a similar criticalpath. In these cases, Syncopation could be used to control the clock indomains modules which did not benefit from a large increase in operatingfrequency.263.5 Adaptive Clock ManagementACM is the dynamic adjustment of processor clock period on a per-cyclegranularity based on in-flight instructions. ACM has primarily been targetedat General Purpose Processors (GPPs) such as CPUs and GPUs [14][30] [29].In general, ACM works by adjusting the clock period dynamically basedon the operation code (op-code) of the instructions in-flight, as the op-code indicates which functional units will be operating. Operations withlonger critical paths through the functional units will be given longer clockperiods to complete operation. In this application, ACM must be applieddynamically due to out-of-order execution of instructions employed in GPPs.ACM has also been proposed for FPGA-based soft-processors. To achieveadjustable clock period, previous work ([20]) has proposed a hybrid clockmultiplexing and clock stretching technique called Hybrid Adaptive ClockManagement (HACM). HACM amortizes the long switching delays associ-ated with clock multiplexing by incorporating clock stretching to recoveradditional slack.Syncopation is different than previous ACM techniques as it is applicableto arbitrary circuits synthesized with HLS tools, whereas previous ACMtechniques relied upon on known delays through GPP functional units andknowledge of in-flight instructions. Applying Syncopation ACM to arbitrarycircuits is made possible by using fine-grained timing analysis of each state’spath in the state machine at synthesis time. While Syncopation could beextended to arbitrary FPGA circuits not generated by HLS (expanded uponin Sect. 5.4.4), our technique currently relies upon the static schedule of27instructions already produced by HLS tools. Furthermore, the Syncopationclock generator avoids long clock switching times of clock multiplexing andprovides greater dynamic range than clock stretching techniques by using ahigh-frequency clock divider.28Chapter 4Syncopation4.1 OverviewSyncopation is an adaptive clock management strategy which improves theperformance of FPGA circuits. The key observations which make Syncopa-tion possible are (1) not all circuit paths are active in every clock cycle, (2)the clock period only needs to be long enough to accommodate the com-putational delay of the active paths, and (3) HLS-generated circuits areparticularly amenable to ACM techniques as HLS tools automatically gen-erate instruction schedules which reveal the active paths in each clock cycleof the hardware state machine. By combining these observations with post-place-and-route static timing analysis and using a small amount of additionalinstrumentation, Syncopation tunes the clock period on a cycle-by-cycle ba-sis to reduce latency and boost performance.To demonstrate Syncopation, we use the Finite State Machine (FSM) inFig. 4.1. The Syncopation clock generator is a clock divider which receivesan integer divisor every cycle which corresponds to the desired clock period.Assuming a 500 MHz reference clock (selection discussed in Sect. 4.3), wecan use the HLS information to determine the minimum integer divisorbased on the active path delays within each state; each state in the Fig. 4.129State 16State 49State 310State 22State 57State 68State 73State 05Figure 4.1: A simple Finite State Machine.is annotated with this divisor. The fastest state in this circuit has a divisorof ten and the slowest has a divisor of three; these states can operate at167 MHz and 50 MHz. Using a single-frequency clock this circuit wouldoperate at 50 MHz for the entirety of its operation. However, Syncopationeffectively increases the clock frequency by enabling fine-grained adjustmentof the clock period to accommodate only the active computational paths.While Syncopation does not require changes to the HLS or place-and-route algorithms, the per-state timing information can be used to targetsynthesis tool effort in timing-driven place-and-route to further boost per-formance.4.2 Overall FlowThe Syncopation flow is shown in Fig. 4.2. First, LegUp [10] transforms eachfunction in the software description into an FSM. An algorithm consisting30LegUp HLS ToolCommercial Synthesis ToolFine-grained Static Timing AnalysisC CodeVerilog RTL ScheduleBitstream (.sof) TimingConstraints (.sdc)Iterative SynthesisVerilog RTL Project File (.qpf)Memory Allocation & RTLInstrumentationFigure 4.2: Syncopation overall flow.of multiple software functions will result in a circuit of multiple FSMs. Theinstruction schedule describing the computations performed in each state isalso generated in this step.Next, Syncopation clock generator circuitry (described in Sect. 4.3) isinserted into the LegUp-generated RTL and a Divisor Memory (DM) is allo-cated for each FSM. DM allocation and operation is described in Sect. 4.3.2.In addition, synthesis directives are inserted into the RTL to enable fine-grained static timing analysis (details in Sect. 4.4).Once the Syncopation hardware has been inserted, the design is synthe-sized using commercial tools (we use Quartus Pro 15.0). Fine-grained STAis performed using the post-place-and-route implementation to determine31circuit timing on a per-path basis. The measured path lengths are thenused in conjunction with the HLS schedule to populate the DMs, describedin Sect. 4.4. The DM contents can be updated without re-synthesizing thedesign.Finally, and optionally, a fined-grained timing constraints file is createdand used to improve performance in successive synthesis passes, which wecall Enhanced Synthesis (ES). ES is described in Sect. 4.5.4.3 Adaptive Clock Management CircuitryThe Syncopation ACM circuitry is inserted into the user circuit RTL beforeplace-and-route. Consisting of the DMs, divisor selection logic, and the clockgenerator, the Syncopation circuitry monitors the user circuit and tunes theclock period according to the circuit state. In this section, the Syncopationcircuitry is described in detail.4.3.1 Circuit OperationA simple Syncopation circuit is depicted in Fig. 4.3. The LegUp-generatedstate machine is the user circuit, and the Divisor Memory, Divisor SelectionLogic, and Clock Generator comprise the Syncopation circuitry.During circuit operation, the state signals from each state machine areused to address the corresponding DM. The DMs are populated with divi-sor values. The values obtained from the memory reads are propagated tothe Divisor Selection Logic which determines the longest active path in thedesign across all the state machines and selects the corresponding divisor to32LegUp-GeneratedState MachineDivisorMemoryLegUp-GeneratedState MachineDivisorMemoryLegUp-GeneratedState MachineDivisorMemoryClock GeneratorDivisor Selection LogicSubmodule InstanceFigure 4.3: A Syncopation Circuit.tune the Clock Generator.To ensure stability of our clock signal, we require that the clock divi-sor is set before the beginning of the next clock cycle. Since the divisormemories are registered, the entire clock tuning procedure requires threeclock cycles: one to initiate the DM read, one to select the divisor value,and a third in which the clock is tuned using that value. This procedureimplies that the circuit state is known two cycles in advance, which is notpossible in designs with branch instructions. To successfully implement thedivisor lookup system, we propose two architectures: one which performsthis lookup by monitoring the Current State of the user circuit, and anotherwhich monitors the Next State of the circuit. In our studies, we implementedand tested both of these configurations.33DivisorMemoryDivisor SelectionLogicClockGeneratorCurrent State DivisorNext StateCombinational LogicNext StateFSMFigure 4.4: Current State based divisor selection logic.Current State SyncopationIn the Current State implementation of Syncopation, the current state signalof each state machine is connected to the DM address port and the DivisorSelection Logic, as shown in Fig. 4.4. Electing to use the current statesignal to monitor the user circuit means the exact circuit state cannot bedetermined in advance. To ensure that the clock is tuned to accommodatethe longest path, the divisor values packed into the DM correspond to theslowest possible state that the state machine may be in, given the currentstate. This is determined by inspecting the HLS-generated state machine.The DM in Fig. 4.5 depicts the DM contents for the example state machinein Fig. 4.1. Each address corresponds to a state, and the corresponding linecontains the tags and divisors used to tune the clock generator. Note thatthis implementation of Syncopation does not require storing all the statedivisors in the DMs due to the selection of the maximum divisor value.The maximum divisor is determined offline through inspection of the statemachine and the results from static timing analysis. In this example, thedivisor values for States 1, 2, 4, and 5 are not present in the DM.340Address Current State Tags Next State Divisors1 101 2 3 4 8 8 33 62 5 6 7 84 7.........Figure 4.5: Current State divisor memory packing based on the Finite StateMachine in Fig. 4.1.The maximum divisor selection results in slower clock period (on aver-age) compared to the ideal operation. To demonstrate operation of CurrentState Syncopation, consider the example state machine in Fig. 4.1. In thisexample, we will walk through how Current State Syncopation uses the statetags and divisors from the DM to tune the clock period for Cycle 3 and Cycle4.To determine the divisor value for Cycle 3, the memory read to line 0 isinitiated in Cycle 1 and the divisor is selected and sent to the clock generatorin Cycle 2. In Cycle 2, the current state is State 1, and the next state divisorsare 2, 9, and 10. Since we must accommodate for the longest computationalpath for Cycle 3, the divisor value selected to send to the clock generatoris 10. Finally, in Cycle 3, the clock period is tuned to have a clock periodcorresponding to a divisor value of 10. However, the computational pathwould not utilize the entire clock period unless the circuit is in State 3 inCycle 3. This corresponds to a longer-than required clock period comparedto an implementation capable of determining exactly the required divisor35j+N...j+1j+2jj+N+Mj+N+1j+N+2...Figure 4.6: The general form of a Finite State Machine.value. In particular, State 2 can operate 5× faster than State 3, resultingin 80% of the clock period being unutilized if the state machine transitionsto State 2.Determining the clock divisor for Cycle 4 follows a similar procedure. InCycle 2, the DM read to line 1 is initiated. In Cycle 2, the set of tags anddivisors corresponding to the possible states State 2, State 3, and State 4 areobtained. For each of these states, the corresponding divisor is worst-casedivisor for the next cycle. For State 2, this is a divisor of 8; for State 3, adivisor of 8; and for State 4, a divisor of 3. Similarly to Cycle 3, the divisorselection is conservative and the worst-case divisor corresponds to a longerclock period than the active path if the state machine transitions from State2 to State 5.The three-cycle clock tuning procedure using the current state signalas the selector operates as follows, assuming the general form of the statemachine depicted in Fig. 4.6.361. In Cycle i, the current state of the state machine, which we will denoteas statej , initiates a read to line j of the DM. The memory line jcontains a set of N tags and N divisors, where N is the possiblenumber of next states for the state machine during Cycle i+1 and theset of tag values is the set forall a { statej+a }, where 1 ≤ a ≤ N .2. In the next Cycle i + 1, the circuit is in some state statej+n and theDM line is obtained from memory. The value statej+n is equal toone of the tag values returned from the DM, and is compared to thetags to select the divisor value to propagate to the clock generator.In this case, the selected divisor will correspond to the longest pathdelay among the possible states for Cycle i + 2. For example, givenM possible states for Cycle i+ 2 corresponding to M possible divisorvalues, the max divisor across the set of states forall b { statej+N+b }where 1 ≤ b ≤ M corresponds to the longest path. For simplicity ofthe Syncopation hardware, this selection is determined offline beforethe tags and divisor values are packed into the DMs.3. Finally, in Cycle i + 2, the circuit is in some state statej+N+m andthe clock is tuned to the period determined by the max divisor, whichcorresponds to the longest path among all M possible states. Non-ideal performance will be achieved in cases where the required divisorfor statej+N+m is less than the max divisor.370Address Next States Next State Divisors1 61 2 3 4 2 10 93 6 82 5 6 7 84 7 3.........Figure 4.7: Next State divisor memory packing based on the Finite StateMachine in Fig. 4.1.Next State SyncopationIn the Next State implementation of Syncopation, the next state signal ofeach state machine is connected to the DM address port and the DivisorSelection Logic, as shown in Fig. 4.8. Unlike the Current State implementa-tion of Syncopation, the Next State implementation allows the clock periodto be tuned to exactly the required clock period.The packed tags and divisor values for the state machine in Fig. 4.1 areshown in Fig. 4.7. Like the current state DMs, each line corresponds to astate and contains both tags and divisors. However, this implementation ofSyncopation captures all the divisor values and corresponding tags.The three-cycle tuning procedure using the next state signal as the selec-tor operates as follows, again referencing the generic state machine depictedin Fig. 4.6.1. In Cycle i, the next state signal of the state machine, which has avalue we will denote as statej+n where 1 ≤ n ≤ N , initiates a read to38DivisorMemoryCurrent StateNext StateNext StateCombinational LogicDivisor SelectionLogicClockGeneratorFSMDivisorFigure 4.8: Next State based divisor selection logic.line j + n of the DM. The memory line j + n contains a set of M tagsand M divisors, where M is the number of possible states the circuitcan be in Cycle i + 2 and the set of tag values is the set forall b {statej+N+b } where 1 ≤ b ≤M .2. In Cycle i + 1, the DM line is obtained. During this cycle, the nextstate signal will resolve to a state value for Cycle i + 2, statej+N+mwhere 1 ≤ m ≤M . By comparing this value to the state tags obtainedfrom the DM, the divisor corresponding to the active path length canbe determined and sent to the clock generator.3. In Cycle i + 2, the clock period is tuned to the minimum clock pe-riod required to accommodate the active paths for the current statestatej+N+m.By using the next state signal of the state machine, we can determineexactly the minimum clock divisor for the state, even though the memoryread is initiated two cycles in advance.While Next State Syncopation enables additional performance over Cur-39rent State Syncopation by enabling selection of exactly the required divisorinstead of the max divisor, there is also some possibility that some per-formance degradation can occur. Consider Fig. 4.4 and Fig. 4.8. In thesecircuit diagrams, we can clearly see the connection of the Current Stateand Next State signals from the user circuit leading to the Syncopationinstrumentation. However, there is one key difference: in Current StateSyncopation, the signal is connected directly from the state register, andthe propagation delay of the Current State signal to the instrumentation isonly wiring. This is not the case in Next State Syncopation, where the NextState computational logic lies on the path between the state register and theSyncopation instrumentation. In cases where the user state machine mayhave relatively simple computations in a state, this state path may be long.By adding additional instrumentation to this path, some designs which arealready performance-limited by the complexity of the control logic may seereduced performance using Next State Syncopation.4.3.2 Divisor MemoriesClock generator divisor values are determined through post-synthesis fine-grained timing analysis. In order to utilize these values during circuit opera-tion, they are packed into the DMs to enable single-read access to the valuescorresponding to all possible circuit states. The DMs are sized statically be-fore place-and-route for each FSM, and the memory contents are updatedwith divisor values after timing analysis without re-synthesizing the design.The values used in each DM depend on whether the Current State orNext State implementation of Syncopation is being used. The size of the40DMs are the same across both Syncopation variants. In both cases, theDMs are sized statically before place-and-route by inspection of the statemachine. For each FSM, the DM size in bits is:bits = Nstates ×max(Nnext states)× (log2(Nstates) + D)where Nstates is the number of states in the state machine, corresponding tothe number of lines in the memory; Nnext states is a set containing the numberof next states reachable for each state in the state machine; max(Nnext states)is the maximum number of next states from the set Nnext states, correspond-ing to the number of elements contained in each memory line; log2(Nstates)is the number of bits required to represent the state tag; and D is the divisorwidth in bits.One limitation of this approach is that it may not size the memoriesefficiently using other common state encoding techniques, such as 1-hot en-coding. This could be rectified using an encoder on the memory address port.However, the additional logic required to implement the encoder may reducethe benefit of other encoding techniques altogether. Since LegUp HLS en-codes states with integer values, we did not perform studies to evaluate theperformance of other encoding methods or evaluate other DM addressingtechniques. This would be an interesting area for future work.4.3.3 Divisor Selection LogicThe Syncopation clock period is directly related to the state of the usercircuit, so the DMs are addressed by the state bits. During each cycle, a41divisor is read from the memory address corresponding to the state of eachstate machine. This lookup method generates multiple candidate divisors:one for each module instantiated on chip. The role of the divisor selectionlogic is to determine which of these candidate divisors will be propagated tothe clock generator for clock tuning.The simplest method to select the appropriate candidate divisor is touse a max() function to select the largest candidate divisor. Using thistechnique, the generated clock will always have a period long enough to ac-commodate the active path, ensuring that circuit operation will be correct,even with instantiated modules performing operations in parallel. However,the max() divisor selection technique assumes that all instantiated mod-ules are performing computations in every clock cycle. This assumption isoverly conservative in our implementation, since LegUp-generated circuitsuse a start-finish protocol to perform function calls. Using the start-finishprotocol, only one module instantiation is performing computations in anygiven cycle, and all other modules are waiting to be called or are waiting forcomputed results to be returned.In order to improve performance over the generic max() selection logic,we implement a mechanism to identify function calls and precisely determinewhich module instantiation is currently performing operations. To do this,the HLS schedule is inspected pre-synthesis to identify function call wait-states. These wait-states indicate when a “caller” module is waiting on theresult of a computation being performed in a “callee” function. Then, duringcircuit operation, the state of each module is monitored to dynamicallyevaluate whether the “caller” or one of the “callees” is currently operating.42mainstartmain:func1finishstartmain:func1:func2startfinishmain:func3startfinishfinishRegular Operationfunc1() callfunc2() callfunc3() callLegendFigure 4.9: A sequence diagram describing the operation of a High-LevelSynthesis-generated circuit.43As an example, consider the sequence diagram in Fig. 4.9. The mainfunction of the circuit is the top level C function, and every other moduleis indicated hierarchically under the top-level design. In this case, the mainfunction instantiates two modules (func1 and func3) and the func1 moduleinstantiates another module (func2). During circuit operation, each calleemodule may eventually issue a function call to one of the instantiated sub-modules. In this case, the first function call is issued by main to func1.While the submodule func1 is operating, the main state machine remainsin a wait-state until the finish signal is obtained. Similarly, func1 issues afunction call to func2 at some point during circuit operation. While thisoccurs, func1 remains in a wait-state. Then, once both func2 and func1finish operating, the main function continues operation until a function callis issued to func3. Finally, func3 returns and main finishes operation.To implement divisor selection logic for this circuit, a small amount ofinstrumentation is inserted in every caller function; in this case, the callerfunctions are main and func1. This instrumentation monitors the next statesignal of the caller function state machine to determine whether the divi-sor for the callee or the caller function should be propagated to the clockgenerator.4.3.4 Clock Generator Circuitry and Clock DistributionThe Syncopation clock is produced by a custom soft-logic clock generator.This custom clock generator is capable of producing clock signals at a widerange of frequencies (50–250 MHz) and is capable of changing the clockfrequency, glitch-free, on a per-cycle basis. Glitches are transient logic errors44which can cause corruption of circuit data when a clock signal glitch causesflip-flops to prematurely register signals which have not yet settled to theirfinal computed value.While there are other clock generating techniques (such as Phase-LockedLoops (PLLs), clock stretching, and clock multiplexing) available for FP-GAs, these options do not meet the specifications for our implementationeither due to limited dynamic range or long switching times. For exam-ple, dynamically reconfiguring PLLs requires tens of cycles to change theconfiguration and lock the clock output. Clock multiplexing also requiresmultiple cycles to switch between clock sources to match clock phases andprevent glitches. Clock stretching techniques, such as in [11], do not incura long switching overhead, but can only change the clock period by 25%.In previous work proposing ACM, hybrid clocking schemes were adaptedusing both clock stretching and clock multiplexing techniques to overcomethe limitations of both [20].Our clock generator is the 50% duty cycle clock generator from [40],shown in Fig. 4.10. We choose the 50% duty cycle implementation to ac-commodate dual-edge logic in some memories, although this requirementis likely conservative for the majority of applications. Our implementationuses a 4-bit integer divisor and a 500 MHz reference clock. The referenceclock drives a counter which is provided along with a divisor value to twocomparators. The first comparator determines the clock period by resettingthe counter once the divisor value is reached by the counter. The secondcomparator determines the clock phase by comparing the counter to half ofthe divisor value. At the output of the clock generator, labelled “clock”,45the signal is promoted to the global clock distribution network resourcesavailable on the FPGA. This promotion is automatically performed duringplace-and-route and is confirmed by analysing the synthesis messages foreach benchmark.The clock generator produces a clock signal which can vary in period on aper-cycle basis. However, there are some timing requirements which must besatisfied to prevent glitches. First, to guarantee that an update to the divisorvalue does not cause a glitch on the output of the clock phase comparator, weregister the divisor so it can not be updated mid-cycle. Second, we require aminimum divisor value of 2 to prevent a race condition between the outputsof the period-reset comparator and the phase comparator.It is possible to implement a more aggressive form of Syncopation whichwould forgo our requirement to have the clock divisor arrive before therising edge of the clock period, therefore relaxing the three-cycle clock tuningprocedure presented in Sect. 4.3.1. In fact, the divisor only needs to arrive atthe clock generator in time to properly tune the falling edge of the 50% dutycycle phase, although insuring that the Syncopation control logic alwayssatisfies this timing is more complex than our implementation. Additionally,it is likely that the 50% duty cycle requirement could be relaxed entirely,given that no dual-edge triggered logic is used in the design. This wouldrelax the divisor selection procedure and timing even further.This clock generator produces frequencies between 33 MHz and 250 MHz,which is sufficient for the observed frequencies in the CHStone [51] bench-mark circuits (63–197 MHz). We did not evaluate clock generators withcounter bits > 4, as the additional range provides no further benefit for the46500 MHz PLL CounterA < B ?divisorA < B ?resetclk>>1Figure 4.10: Integer divisor clock generator circuit.circuits we surveyed.4.4 Fine-Grained Static Timing AnalysisOnce the HLS-generated RTL and instruction schedules have been analyzedand modified to include the DMs, Divisor Selection Logic, and clock gen-erator, the design is synthesized using vendor place-and-route tools. Nochanges are made to the place-and-route algorithms. Once place-and-routefinishes, timing analysis is performed to evaluate design performance. Fora statically-clocked design, STA involves identifying the longest path in thedesign and determining the path delay to calculate the maximum operatingfrequency. For a Syncopation design, timing analysis also includes what werefer to as fine-grained static timing analysis, which evaluates cycle-specificpath timing to generate data for the Syncopation DMs.Fine-grained STA is a three-stage process. First, timing between regis-tered instruction nodes in the design is measured for each state in each state47machine. Once the timing for these paths has been measured, we updatethe project timing constraints such that the per-state path delays satisfythe design timing requirements. Once these constraints are in place, wecan request the maximum operating frequency of the unconstrained pathsin the design on a per-module granularity. Typically, the longest uncon-strained path consists of State Machine (SM) control logic which cannot beattributed to specific cycles in circuit operation. This maximum frequencyis then used to limit all frequencies in the design to ensure that timing re-quirements for the control logic will not be violated. Finally, these resultsare combined and translated into per-state divisor values and packed intothe DMs.4.4.1 Per-State Timing and Synthesis DirectivesPer-state timing is estimated using a custom Tcl script which is automati-cally generated by analyzing the LegUp-generated instruction schedule andthe RTL. In the instruction schedule, each instruction is analyzed to iden-tify operands that are produced or consumed. These operands are thenmatched to the corresponding signal names in the Verilog RTL, which areused to identify signals during post-place-and-route timing analysis. Todetermine the path delay between these operands for each state, Syncopa-tion automatically generates a Tcl script which requests timing data be-tween all operands in a state, and labels the resulting data with the corre-sponding state. We perform our requests using the Quartus Tcl commandreport timing, which provides slack data between two named registered inthe post-synthesis netlist. Since all registered nodes in the design are clocked48using our promoted Syncopation clock signal, this measurement automati-cally includes conservative timing estimates to account for process variationand clock skew.Ideally, it would be possible to request timing data from or two anynamed register in the RTL using Quartus Tcl commands. However, this isnot always the case. During synthesis, extensive register optimizations takeplace which merge, duplicate, and rename registers in the RTL [27]. Thepost-synthesis netlist does not retain information about optimized registers,and therefore it is not possible to obtain timing information about opti-mized nodes. However, if synthesis directives are inserted into the LegUp-generated RTL to prevent register optimizations it is possible to obtain thistiming data at the cost of performance and resource utilization. For thisreason, we have included synthesis preserve and synthesis noprune direc-tives to retain information about registered nodes used to compute resultsof HLS instructions in the RTL. The preserve attribute prevents removalor minimization of a register without preventing other optimizations suchas duplication; and the noprune attribute prevents removal of fanout-freenodes, which we require to estimate timing for intermediate computationalsteps [25]. The noprune directive is only included for a subset of the regis-ters; in particular, the intermediate computational steps used for retimingof multi-cycle operations such as multiply operations are retained.Table 4.1 shows the overhead of inserting these synthesis directives intothe baseline CHStone benchmarks.Inserting the synthesis directives into the baseline benchmarks reducestheir performance by 10.8% (geomean), and more than doubles ALM utiliza-49Table 4.1: The Area and Performance Impact of Synthesis DirectivesBaseline SD BaselineFMax ALMs FMax ALMs(MHz) (%) (MHz) (%) (%)adpcm 87.3 16.6 64.3 -26 32.1aes 90.6 15.8 86.5 -5 27.9bf 124.2 7.0 117.7 -5 22.2dfadd 135.0 4.5 112.3 -17 18.9dfdiv 102.9 9.9 103.4 +0 30.3dfmul 98.8 9.9 93.0 -6 12.8dfsin 96.4 26.2 83.4 -13 62.1gsm 94.4 10.2 78.5 -17 25.0jpeg 72.6 52.2 70.2 -3 75.1mips 89.3 3.3 81.5 -9 9.9motion 101.9 17.7 81.5 -20 26.0sha 151.4 4.2 142.7 -6 10.1Geomean 101.7 9.9 90.7 -10.9 24.4tion. However, the performance of the statically clocked baseline circuits isdetermined solely by the critical path delay. Since the performance of Syn-copation circuits is not solely determined by the critical path delay, it is stillpossible to regain and exceed the performance of the baseline circuits with-out synthesis directives using our technique. In Chapter 5, we will quantifyto what extent this is true.4.4.2 Control Logic ConstraintsOnce the per-state timing is determined, it is necessary to determine themaximum operating frequency of the control logic in each module to ensuretiming requirements are not violated. As an example, consider Program 4.1.In Program 4.1, per-state path delays have been determined by fine-50Program 4.1 A Syncopation schedule with various path delays and afunction-call wait state.Top Design:Module A:State 1: Delay 6.67 nsState 2: Delay 5.00 nsState 3: Function call B()State 4: Delay 11.11 nsModule B:State 1: Delay 8.33 nsState 2: Delay 8.33 nsState 3: Delay 5.00 nsState 4: Delay 5.00 nsgrained static timing analysis. However, these delays were determined basedon the timing for instructions in each state which produce and consumeoperands, not the control logic of the circuit state machines. To ensurethat the individual state timing doesn’t violate the minimum delay for thecontrol logic, we separately measure the longest control logic delay of eachmodule. To do this, we generate an Synopsis Design Constraints (SDC) filecontaining set max delay constraints for the paths already measured duringfine-grained static timing analysis to ensure that these paths won’t generateexceptions. Then, we generate timing reports requesting the longest delayswithin each module that weren’t measured during fine-grained timing anal-ysis. Then, the delays obtained in the reports are used to constrain thetiming of individual states according to the following rule:dupdated = max(dsta, dcontrol)where dsta is the delay measured during fine-grained static timing analysis,51dcontrol is the longest delay in the state machine control logic, and dupdatedis the updated clock period for the state which will not violate timing of thestate machine. Program 4.2 shows how the example program in Program 4.1would be updated after the control logic delays are measured.Program 4.2 A Syncopation schedule including control logic constraints.Top Design:Module A: Longest Control Logic Delay 3.33 nsState 1: Delay 6.67 nsState 2: Delay 5.00 nsState 3: Function call B()State 4: Delay 11.11 nsModule B: Longest Control Logic Delay 5.55 nsState 1: Delay 8.33 nsState 2: Delay 8.33 nsState 3: Delay 5.55 ns *State 4: Delay 5.55 ns *To constrain timing for each state and avoid control logic timing vio-lations, the first step is to measure the longest control logic delay for eachmodule. In this circuit there are two modules instantiated: Module A andModule B. The longest control logic path for each of these modules as beenmeasured as 3.33 ns and 5.55 ns, respectfully. Now, we can update the statedelays accordingly. As indicated in Program 4.1 with asterisks, the statedelay for State 3 and State 4 in Module B have been lengthened, since us-ing the previously measured delays of 5.00 ns violated control logic timingrequirements. Instead, these states will be executed with a clock period of5.55 ns.After the timing constraints are updated for all states in the design, thenew per-state delays are converted into divisor values and packed into the52DMs.4.4.3 Divisor CalculationGiven a list of maximum frequencies per-state constrained to both the max-imum per-module frequency and the maximum per-design frequency, thedivisor values for the clock generator can be calculated. The divisor valuecorresponding to a state is calculated using the state frequency and thereference clock frequency for the clock generator. The divisor value D isdefined as:D = d FrefFstateewhere Fref is the reference clock frequency for the clock generator and Fstateis the maximum clock frequency for a given state. A non-integer result is“snapped” to the larger integer which corresponds to a longer clock period.Once the divisor values for all states have been computed and packed intothe DMs, the circuit is operational.4.5 Enhanced SynthesisSyncopation circuits are able to gain performance by adjusting the clockfrequency according to the individual clock cycle. However, the integer clockdivider does result in non-ideal performance compared to a clock dividerwith infinite frequency tuning granularity. Ideally, the place-and-route toolwould be aware of the integer clock divider and attempt to optimize eachpath to meet timing for a higher frequency bucket. It would be possible to53implement this custom place-and-route algorithm using an open-source toolchain, however we opted to leave a custom tool chain to future work. Instead,we show that we can achieve better performance without implementing acustom algorithm by using vendor tools, fine-grained timing targets, and anadditional synthesis pass using a technique we call Enhanced Synthesis.Generally, timing-driven place-and-route tools target effort to optimizethe longest paths in the circuit in order to shorten the critical path length.The critical path optimization target improves the performance of statically-clocked designs because the clock frequency is only dependent on the timingof the longest paths in the design. Since the performance of Syncopationdesigns is not entirely dependent on the critical path delay, the additionaleffort to optimize the most-critical paths may be disproportionate to thepossible gain. Instead, synthesis effort would be better targeted to optimizethe delay of every state’s computational path to meet timing for the higherfrequency produced by the clock generator.As an example, consider a path (which we will denote as Path A, withdelay j) which is in a design with critical path J . In this example, j < Jso Path A is not a critical path. In a statically clocked design, Path Awould be executed at the frequency Fmax = 1/J , as no instrumentation isin place for clock tuning. By adding the Syncopation instrumentation to thedesign and assuming an ideal (infinite granularity, for now) clock generator,we can calculate the minimum clock period required to meet timing forPath A and tune the clock to the corresponding frequency FA = 1/j in thecorresponding states. In this case, since j < J , we will achieve lower delay(higher performance) by using the Syncopation design.54Now consider the same circuit, but assume an integer divider clock gen-erator. In the best case, Path A meets the following requirements: (1) thepath length corresponds to the minimum-delay implementation achievableby the synthesis tool (i.e. cannot be significantly more optimized by increas-ing synthesis effort on this path) and (2) the maximum operating frequencyFA = 1/j corresponds exactly to a frequency generated by the clock gener-ator. This ideal case makes additional synthesis effort to optimize this pathredundant.In reality, a path in the circuit implementation is unlikely to meet thesetwo requirements. Paths are likely to fail the first requirement becausethere may be many ways to implement a path, and it is unlikely to alwaysselect the lowest-delay implementation. Furthermore, the lowest-delay im-plementation may only be determined by the synthesis tool given sufficientmotivation to optimize a path with a delay lower than the critical path de-lay. Paths are also likely to fail the second requirement for similar reasons;without sufficient incentivization to achieve a target delay, paths delays areequally likely to be just longer than a clock period generated using our tech-nique, resulting in performance loss due to “snapping” of the clock period toa longer value to prevent timing violations. In the same context of the pre-vious example, consider another path (Path B) with delay k < J . Since thesynthesis tool is not incentivized to optimize this path, the implementationmay not be optimized to minimize the path delay. Furthermore, since thesynthesis tool has no awareness of the integer clock divider, the worst-casefor Path B is that the path delay k is slightly longer than a clock period pro-duced by the clock generator, and therefore incurs the maximum overhead.55For example, our 500 MHz, 4-bit clock generator produces clocks with 4 nsand 6 ns periods, but in the case where k = 4.1 ns, the clock period will“snap” to 6 ns, a 46% increase in delay.For Syncopation circuits, a better synthesis tool would target resourcesand computational effort to optimize paths close to the boundary betweengenerated frequencies, not just paths with the longest delays in the design.To achieve targeted performance improvement, we automatically generateper-path synthesis timing constraints using the results of the fine-grainedstatic timing analysis in an additional synthesis pass. The timing con-straints improve the place-and-route implementation by setting additionalconstraints for all paths, such that the synthesis tool has target timing con-straints for all paths in the design, instead of just for paths which violatethe critical path frequency. In our scripts, the individual path constraintsare set using the Tcl command set max delay. Since Syncopation uses aninteger clock divider, we set the maximum delay for each path to the next-fastest clock period, relative to the previously calculated per-state timing.Using the previous example, this would mean Path B with delay k = 4.1 nswould have a target maximum delay of 4 ns.It is possible additional performance could be obtained using more ag-gressive timing constraints. In our existing procedure, we set timing con-straints for paths to match the next highest frequency bucket for each path.To achieve further improvement, it would be necessary to optimize pathssuch that they meet an even higher frequency bucket. We investigated thisexperimentally, and found that this led to < 1% additional performance overEnhanced Synthesis.56Once the timing constraints files are generated, a second synthesis pass isperformed to improve the circuit implementation, and fine-grained synthesisis performed again to update timing analysis results and the DM contents.4.6 SummaryIn this chapter, we presented Syncopation: an ACM technique which boostsperformance of HLS-generated circuits by tuning the clock period on a cycle-by-cycle basis according to the operations currently being performed. Atcompile time, the RTL and instruction schedules produced by the HLS tool(LegUp) is profiled and instrumented with the Syncopation hardware, in-cluding the divisor memories, divisor selection logic, clock generator, andsynthesis directives. We presented two methods of instrumenting the usercircuit to select the appropriate clock divisor: Next State and Current StateSyncopation. After synthesis, fine-grained static timing analysis is per-formed to determine circuit timing requirements on a per-state, per-module,and per-design basis. These timing measurements are then used to calculatedivisor values which are packed into the divisor memories to tune the clockgenerator at run-time. Additionally and optionally, the per-state timing con-straints are also used to target resources during an additional synthesis passto minimize delay for paths in the design in a technique we call EnhancedSynthesis. The techniques described in this chapter do not require changesto the FPGA architecture, HLS tools, or place and route algorithms. Inthe next chapter, we evaluate the performance and overhead of Syncopationby measuring the frequency, area, and memory requirements across a set of57benchmarks.58Chapter 5Adaptive Clock ManagementUsing SyncopationIn Chapter 4, we introduced Syncopation, an adaptive clock managementstrategy for HLS-generated circuits. Syncopation uses a memory-based di-visor reference system and dynamic selection logic to select a pre-calculateddivisor value, and tunes the clock period according to the divisor value toaccommodate the critical path of each circuit state on a cycle-by-cycle basis.In this chapter, evaluate the performance benefit achieved using Syncopationfor general HLS circuits.We begin with an introduction to the methodology, evaluation metrics,and the studied configurations of Syncopation. Then we present the ex-perimental set-up used to validate circuit functionality in hardware andsimulation. Finally, we present the results achieved using Syncopation andsummarize our performance improvement on generic HLS circuits.5.1 IntroductionTo evaluate Syncopation, we use the CHStone [51] HLS benchmarks com-piled with LegUp 4.0 and synthesized with Quartus Pro 15.0 Web Edition,59targeting Intel Cyclone V FPGAs (5CSEMA5F31C6N). To ensure validityof the HLS instruction schedule, we turned off retiming to implement Synco-pation circuits. The baseline circuit performance is measured with retimingenabled.5.1.1 Evaluation MetricsTypically, the performance of digital circuits is quantified by Fmax, which isthe maximum frequency at which the circuit can operate. In normal STA,the Fmax is determined from the critical path delay and is calculated as:Fmax =1tdelay + tsetup + tskewwhere tdelay is the combinational delay of the critical path logic, tsetup isthe minimum time before the rising edge of the clock that the data must bestable at the input of a flip-flop in order to be latched correctly, and tskew isthe difference in time that a clock edge arrives at a destination componentcompared to the source.Fmax is not sufficient to describe the performance of Syncopation circuits,where the clock period changes every cycle according to the delay of theactive paths. Instead of using Fmax, we we propose describing performanceof Syncopation circuits using effective Fmax which we denote Feff . Feff isdefined as:Feff =tstatsyncFmaxwhere the wall-clock time for a statically clocked circuit is denoted as tstaand the wall-clock time for a Syncopation circuit is denoted as tsync. Since60the wall-clock time required for Syncopation circuits to complete operationdepends on which states are active, calculating Feff requires simulating thedesigns using sample inputs. This is different from evaluating the perfor-mance of statically clocked circuits, which only require STA to determineFmax, since the clock frequency is independent of which states are activated.In terms of performance to the end-user, an improvement in effectivefrequency is proportional to the improvement in operation latency for a setof inputs applied to a benchmark circuit. Specifically, a 10% increase ineffective frequency corresponds to a 10% decrease in operation latency forthe same benchmark and input vector.5.2 ValidationTo validate the operation of the Syncopation circuits we programmed thetarget FPGA with all the benchmark circuits presented in this disserta-tion. Additionally, we confirmed that timing was properly constrained forall paths in simulation using vendor static timing analysis tools.In this section, we describe how we confirmed functionality of the Syn-copation circuits was confirmed both in hardware and using simulation.5.2.1 Validation in HardwareIn hardware, circuit operation was validated by executing all circuits on aknown set of inputs and confirming receipt of the expected result. Fig. 5.1depicts the block diagram of our validation setup. At the beginning ofoperation, the LegUp-generated circuit reads the input data from an on-61LegUp-GeneratedCircuitInput MemoryOutput MemoryChecksumCalculation!= ?ExpectedChecksum ErrorFigure 5.1: Validation circuit for Syncopation.chip memory. Then, circuit operation is performed as usual, with the clockcontrolled by Syncopation. Finally, the result is written back into an outputmemory and a checksum is generated. The checksum is then compared tothe expected value determine using simulation. If the produced checksumdoes not match the expected value, an error signal is generated and recorded.To ensure that circuit functionality is reliable, our validation setup alsoautomatically reset and repeated circuit operation 1, 024 times. A counterwas used to record the total number of errors observed over all executions foreach benchmark. The number of clock cycles each benchmark was observedfor (across all Syncopation implementations) as well as the number of errorsobserved for each benchmark is included in Table 5.1.By evaluating the performance of all our benchmark circuits over thismany clock cycles, we seek to confirm three hypotheses: first, that the fine-grained STA technique used to determine the minimum Syncopation clock62Table 5.1: Validation Results in HardwareTotal Clock Cycles Observed Error Countadpcm 40157184 0aes 27801600 0bf 503580672 0dfadd 1978368 0dfdiv 5790720 0dfmul 666624 0dfsin 174305280 0gsm 14635008 0jpeg 3998300160 0mips 15470592 0motion 25377792 0sha 512305152 0period is accurate; second, that the control logic for the Syncopation logicand the LegUp-generated state machines can be properly reset; and third,that glitches and jitter on our clock signal do not cause failures during typicalcircuit operation.In our validation studies, we did not perform rigorous investigation to de-termine test vectors which would activate the longest data-dependent pathsof each circuit in the design to attempt to cause failures of each circuit. In-stead, we chose to use simulation techniques to evaluate whether the pathsassociated with each state had positive slack relative to their Syncopationclock period. Since the STA techniques employed by Quartus use conser-vative timing estimates for both the circuit implementation and the clockdistribution networks, passing simulation is equivalent to ensuring that weare not overclocking any paths within our design.635.2.2 Validation in SimulationTo validate circuit operation using simulation, SDC files were generatedwhich used set max delay commands to describe the timing requirementsfor all paths in the Syncopation circuit. The command set max delay gen-erates an exception for paths which are longer than the maximum permitteddelay for that specific path. This enables us to ensure that vendor statictiming analysis will only generate timing errors if a path’s delay is longerthan the clock period selected by Syncopation.Using this technique, the Syncopation-determined constraints were con-firmed for all benchmarks for both the Next State and the Enhanced Syn-thesis implementations of Syncopation.5.3 Performance ResultsIn Chapter 4, multiple methods of implementing the Syncopation instru-mentation were introduced. In our experiments, we evaluate Syncopationusing both the Current State and Next State configurations (described inSect. 4.3.1), active module divisor selection logic (described in Sect. 4.3.3),and per-module timing constraints (described in Sect. 4.4.2). EnhancedSynthesis (described in Sect. 4.5) was also presented as a technique to tar-get synthesis to overcome some of the limitations of using an integer-divisorclock generator.In this section, first we compare the Current State and Next State im-plementations of Syncopation, and discuss performance tradeoffs. Next, wepresent our performance using Enhanced Synthesis and evaluate the bene-64fits of a targeted synthesis pass by comparing our results to the maximumtheoretical effective frequency achievable assuming a clock generator withthe ability to tune to exactly the required clock period, as opposed to rely-ing on the integer-based mechanism. Finally, we compare our results to ourprevious version of Syncopation3 which did not contain the active moduledivisor selection logic.5.3.1 Current State SyncopationUsing the current state (CS) to select the divisor, there is inherent loss as thestate is not known in advance, and the max divisor among the possible nextstates is selected for clock tuning, as described in Sect. 4.3.1. However, it ispossible that using the current state instead of the next state implementa-tion avoids additional performance-limiting delay incurred by inserting theDivisor Selection Logic after the Next State Computational Logic, describedin Sect. 4.3.1. The results using this configuration are shown in Table 5.2.Using the current state of the circuit does provide additional perfor-mance over the baseline with synthesis directives (9.9% on average (ge-omean)). However, this benefit is not significant enough to overcome theperformance overhead of the synthesis directives themselves. Compared tothe baseline, the performance improvement of individual benchmarks rangein performance from -25 to +25%, with an average performance change of-2% (geomean).Given the previous description of Current State and Next State Syncopa-3Published as a short paper in the 2020 International Conference on Field-Programmable Logic and Applications.65Table 5.2: Performance of Current State Syncopation (Sync-CS)Baseline SD Baseline Sync-CSFMax FMax FEff(MHz) (MHz) (MHz) (%)adpcm 87.3 64.3 84.4 -3aes 90.6 86.5 95.1 5bf 124.2 117.7 110.5 -11dfadd 135.0 112.3 101.1 -25dfdiv 102.9 103.4 99.8 -3dfmul 98.8 93.0 112.1 13dfsin 96.4 83.4 89.0 -8gsm 94.4 78.5 94.2 0jpeg 72.6 70.2 77.8 7mips 89.3 81.5 111.3 25motion 101.9 81.5 105.3 3sha 151.4 142.7 125.1 -17Geomean 101.7 90.7 99.6 -2.0tion presented in Chapter 4, it is possible that additional performance can beachieved using Next State Syncopation, which does not require conservativemax divisor selection.5.3.2 Next State SyncopationTo improve performance over the Current State implementation of Synco-pation, we can use the Next State implementation of Syncopation to ensurethat the clock is tuned to exactly the clock period required by the activestate. Table 5.3 presents the performance of the CHStone benchmarks usingthis technique.As expected, tuning the clock period exactly to the required clock periodaccording to the per-state timing measurements prevents unnecessarily in-66Table 5.3: Performance of Next State Syncopation (Sync-NS)Baseline SD Baseline Sync-NSFMax FMax FEff(MHz) (MHz) (MHz) (%)adpcm 87.3 64.3 97.5 12aes 90.6 86.5 95.9 6bf 124.2 117.7 126.7 2dfadd 135.0 112.3 111.3 -18dfdiv 102.9 103.4 99.5 -3dfmul 98.8 93.0 101.9 3dfsin 96.4 83.4 80.9 -16gsm 94.4 78.5 97.8 4jpeg 72.6 70.2 80.2 10mips 89.3 81.5 85.6 -4motion 101.9 81.5 111.2 9sha 151.4 142.7 221.7 47Geomean 101.7 90.7 105.0 +3.2creasing design latency and improves performance. In our experiments, wesee a 15.9% average (geomean) performance improvement over the baselinewith synthesis directives, and a 3.2% average (geomean) performance im-provement over the baseline. Overall, using the Next State implementationof Syncopation improves performance by approximately 5% over CurrentState Syncopation on average. The performance improvement for individ-ual benchmarks ranges from -18% to +47%.While the performance improved on average across the benchmark de-signs, some benchmarks saw reduced performance using the Next State ver-sion of Syncopation. As elaborated in Sect. 4.3.1, Next State Syncopationadds additional logic between the Next State logic and the Syncopationhardware, which may increase the longest computational path length in67designs which are performance limited by control logic instead of computa-tional logic. In particular, dfmul, dfsin, and mips lose performance (-10%,-10%, and -23%, respectfully). However, other designs do gain significantperformance (up to 77% for sha), suggesting that Next State Syncopationis generally a favourable implementation over the Current State version.5.3.3 Enhanced SynthesisAs described in Sect. 4.5, Enhanced Synthesis is a technique to regain someof the performance lost due to the integer clock generator by targetingtiming-driven synthesis to optimize each path in the design to meet tim-ing for a higher clock frequency produced by the clock generator. To furthermotivate the need for this technique, Table 5.4 compares the performanceof Syncopation to the theoretical performance achievable by Syncopation ifthe clock period was tuned to exactly the delay required by the active state.This theoretical performance quantifies the performance we could gain witha clock generator capable of tuning to any arbitrary frequency.On average, the theoretically achievable performance is 12.6% over thebaseline circuit implementations, which is 9.4% higher than the performanceachieved using Syncopation. However, it is possible to recover some of thisunrealized potential by revealing details about the clock generator to theplace-and-route tool. Currently, the place-and-route algorithms are not at-tempting to optimize paths to meet the frequencies generated by our clockgenerator, and in fact are not incentivized to optimize the non-critical cir-cuit paths at all. Enhanced Synthesis uses SDC file constraints to targetpath optimization efforts across all paths in the design, instead of only the68Table 5.4: Performance of Syncopation Compared to the Theoretical Per-formance Achievable without an Integer Divider Clock GeneratorBaseline SD Baseline Sync-NS TheoreticalFMax FMax FEff FEff(MHz) (MHz) (MHz) (%) (MHz) (%)adpcm 87.3 64.3 97.5 12 108.0 24aes 90.6 86.5 95.9 6 102.5 13bf 124.2 117.7 126.7 2 147.8 19dfadd 135.0 112.3 111.3 -19 121.8 -10dfdiv 102.9 103.4 99.5 -4 101.6 -1dfmul 98.8 93.0 101.9 3 118.7 20dfsin 96.4 83.4 80.9 -16 88.6 -8gsm 94.4 78.5 97.8 4 103.9 10jpeg 72.6 70.2 80.2 10 85.1 17mips 89.3 81.5 85.6 -4 94.4 6motion 101.9 81.5 111.2 9 124.5 22sha 151.4 142.7 221.7 47 230.9 53Geomean 101.7 90.7 105.0 +3.2 114.6 +12.6most-critical paths. Table 5.5 shows the performance of a single EnhancedSynthesis pass.As expected, Enhanced Synthesis typically improves the performanceof Syncopation by targeting synthesis effort to optimize paths near clockgenerator frequency boundaries, with few exceptions. On average, the per-formance across all benchmarks improves 23.8% compared to the baselinewith synthesis directives, and 10.3% compared to the baseline. The per-formance of individual benchmarks improve between -20 and +50%, withbetween -4 to +21% performance improvement over Syncopation alone. Ad-ditional iterations of Enhanced Synthesis do not significantly improve designperformance (<1%), so we have not included these results in this thesis.69Table 5.5: Performance of Enhanced SynthesisBaseline SD Baseline Sync-NS Sync-NS+ESFMax FMax FEff FEff(MHz) (MHz) (MHz) (%) (MHz) (%)adpcm 87.3 64.3 97.5 12 119.4 37aes 90.6 86.5 95.9 6 97.7 8bf 124.2 117.7 126.7 2 124.2 0dfadd 135.0 112.3 111.3 -18 108.5 -20dfdiv 102.9 103.4 99.5 -3 99.0 -4dfmul 98.8 93.0 101.9 3 117.6 19dfsin 96.4 83.4 80.9 -16 93.1 -4gsm 94.4 78.5 97.8 4 102.0 8jpeg 72.6 70.2 80.2 10 76.9 6mips 89.3 81.5 85.6 -4 104.7 17motion 101.9 81.5 111.2 9 125.0 23sha 151.4 142.7 221.7 47 226.9 50Geomean 101.7 90.7 105.0 +3.2 112.2 +10.35.3.4 Instrumentation OverheadThe bulk of the additional area utilization for Syncopation is due to thesynthesis directives. Tab. 5.6 shows the ALM utilization across the designswe studied.Without Syncopation instrumentation or synthesis directives, the bench-mark circuits utilize 9.9% of our device resources. With synthesis directives,this utilization increases to 24.4%. This overhead is not insignificant anddegrades circuit performance. Attempts to minimize this cost included em-ploying these directives only for datapath nodes and studying the Quartussynthesis reports for evidence of optimizations. Unfortunately, sufficient de-tail was not found in the reports to forgo the need for the directives andthus we were not able to remove them for this work.70Table 5.6: Syncopation Instrumentation Overhead for device5CSEMA5F31C6N, 32070 ALMs.Baseline SD Baseline Syncopation(ALM %) (ALM %) (ALM %) (%)adpcm 16.6 32.2 32.3 16aes 15.8 27.9 28.0 12bf 7.0 22.2 22.0 15dfadd 4.2 18.9 19.1 15dfdiv 9.9 30.3 30.4 21dfmul 3.4 12.8 12.8 9dfsin 26.2 62.1 62.3 36gsm 10.2 25.0 25.3 15jpeg 52.2 75.1 75.3 23mips 3.3 9.9 10.2 7motion 17.7 26.0 26.4 9sha 4.2 10.1 10.1 6Geomean 9.9 24.4 24.5 +14.6Adding in the Syncopation hardware, the utilization increases to 24.5%of the device, which is an additional 15% of the device compared to the base-line. However, only 0.1% of this device utilization is due to the Syncopationinstrumentation, and as previously mentioned, a vendor or custom-tool im-plementation of Syncopation would not require the insertion of SynthesisDirectives.5.3.5 Memory OverheadThe number and size of the Divisor Memories may vary significantly depend-ing on the design. Table 5.7 shows the memory utilization of the baselineand Syncopation circuits. For simplicity, the change in memory utilizationis also indicated.71Table 5.7: Syncopation Memory Utilization.Baseline Syncopation(kB) (kB)adpcm 1.1 1.8 +0.7aes 4.3 5.0 +0.7bf 18 19 +0.5dfadd 1.0 1.5 +0.5dfdiv 0.2 1.9 +1.7dfmul 1.0 1.1 +0.1dfsin 1.1 3.6 +2.4gsm 1.2 1.7 +0.5jpeg 56 68 +12mips 0.4 3.3 +2.9motion 4.0 4.7 +0.7sha 16.4 16.6 +0.2Geomean 2.4 4.4 +2.0Syncopation DMs are sized to contain the state tags and divisors foreach state machine. For example, the number of lines required in each DMis determined by the number of states in the corresponding state machineand the width of each memory line is determined by the state with the mostnext states in the state machine. Additionally, since a State Machine isgenerated for each function in the software level description, the number ofDMs is equal to the number of functions in the software level description.On average, the Syncopation DMs require an additional 2 kB (geomean)of on-chip memory in the benchmarks we studied. However, the DM sizeis sensitive to features of the RTL implementation which designers may nothave the expertise to alter. Additionally, since State Machines are auto-matically generated by the LegUp HLS tool, the designer does not directlycontrol these parameters; a future extension of Syncopation may include72alterations to the LegUp source code to take designer pragma directives toconstrain DM size. Without altering the LegUp tool, a designer could in-directly influence the DM size by partitioning code into multiple functions,instead of a large main function.5.3.6 Comparison to Previous Version of SyncopationA previous publication that described Syncopation used a Next State imple-mentation with a simple max() function to select the longest active paths,instead of the active module identification logic. The active module identi-fication logic works with LegUp-generated circuits because only one moduleis performing computations in each clock cycle, which may not be true ingeneral HLS-generated circuits which incorporate parallelization. Table 5.8includes the results from this original publication.Without the active module identification Logic, Syncopation does im-prove performance on average over the baseline with synthesis directives,but does not overcome the performance loss due to insertion of the direc-tives, resulting in a 2.7% loss in performance compared to the baseline.5.4 DiscussionWe have shown that the Syncopation Adaptive Clock Management tech-niques can improve the performance of High-Level Synthesis-generated cir-cuits, and using Enhanced Synthesis along with Syncopation can furtherimprove performance.Fig. 5.2 summarizes the baseline, theoretical performance, Next State73Table 5.8: Performance of Syncopation without Active Module IdentificationLogic (Previous Work)Baseline SD Baseline Sync-NS Sync-1.0 [21]FMax FMax FEff FEff(MHz) (MHz) (MHz) (%) (MHz) (%)adpcm 87.3 64.3 97.5 12 95.4 9aes 90.6 86.5 95.9 6 99.6 10bf 124.2 117.7 126.7 2 111.9 -10dfadd 135.0 112.3 111.3 -18 100.0 -26dfdiv 102.9 103.4 99.5 -3 103.4 0dfmul 98.8 93.0 101.9 3 109.6 11dfsin 96.4 83.4 80.9 -16 84.8 -12gsm 94.4 78.5 97.8 4 97.6 3jpeg 72.6 70.2 80.2 10 73.3 1mips 89.3 81.5 85.6 -4 95.1 6motion 101.9 81.5 111.2 9 100.8 -1sha 151.4 142.7 221.7 47 126.5 -16Geomean 101.7 90.7 105.0 3.2 99.0 -2.7Syncopation, and Enhanced Synthesis performance across the CHStone bench-mark circuits. While insertion of the synthesis directives reduces the perfor-mance of the benchmarks, Syncopation typically recovers this loss and gainsadditional performance over the baseline design (3.2% on average), and us-ing Syncopation with Enhanced Synthesis achieves even better performance(10.3% on average and up to 50%).Our implementation of Syncopation achieved this performance boostwith minimal instrumentation and on-chip memory utilization and with-out changing the HLS or place-and-route toolchains. However, additionalbenefits could be achieved by customizing these tools. In this section, wewill discuss some of these extensions. Additionally, we will discuss extended74Figure 5.2: Syncopation Performance on CHStone Benchmark Circuits.applications which could benefit from Syncopation.5.4.1 Custom Place-and-Route ToolchainOur implementation of Syncopation requires synthesis directives to enablethe use of our fine-grained static timing analysis scripts. Without syn-thesis directives, many named nodes in the RTL cannot be located in thepost-synthesis netlist, and therefore path timing cannot be mapped backto the original state machine state. As stated previously, this could berectified if Syncopation was implemented as a vendor tool, in which casethe node optimizations or transformations which occurred in intermediatenetlist optimizations could be tracked and recorded for timing reports. An-75other method to eliminate the need for these synthesis directives is to im-plement a custom place-and-route tool for Syncopation which automaticallyperforms fine-grained timing analysis.In addition to automatic fine-grained timing analysis, a custom place-and-route tool would forgo the necessity of an additional synthesis pass forEnhanced Synthesis. Currently, Enhanced Synthesis is only possible aftercomputing the clock generator frequency for each path in the design and au-tomatically generating a timing constraint file to use in a second, full place-and-route pass through a vendor synthesis tool. While the second synthesispass is shown to improve design performance, it doubles the synthesis time,which is not desirable for large or complex designs. Implementing a cus-tom place-and-route tool would enable targeted timing driven synthesis toautomatically adapt path timing targets according to current path timing,eliminating the need for additional synthesis time.5.4.2 Custom HLS ToolchainA custom HLS toolchain, or alterations to the LegUp HLS tool, could alsoimprove the performance of Syncopation and enable higher levels of controlfor designers with hardware expertise. By altering the HLS toolchain, wecan include options for pragmas to control the size and shape of the state ma-chines, divisor memories, and enable additional transformations to improveperformance such as Syncopation-aware pre-synthesis pipelining and retim-ing. Additionally, a custom HLS tool could be designed to identify modulesin the user circuit which benefit from Syncopation and modules which sufferin performance, and automatically generate a multi-clock domain scheme to76achieve additional performance.Multiple Clock Domain OptimizationAs presented in this thesis, Syncopation does not utilize multiple clock do-mains and instead uses a single clock for the entire design. Using multipleclock domains is a common technique to improve performance of circuits.Using multiple clock domains with Syncopation could improve design perfor-mance further by enabling the option to employ static clock and Syncopationclock domains according to which would achieve better performance. In Ta-ble 5.9, we determined the maximum operating frequency of each modulewithin the benchmark circuits we surveyed to quantify whether a multipleclock domain scheme would improve performance.Table 5.9: Per-Module Maximum Operating FrequencyPer-Module Frequencies(MHz)adpcm 87.3, 355.6aes 131.3, 147.23, 199.3, 231.2bf 146.4, 184.5dfadd 135.0, 144.7dfdiv 102.9dfmul 98.8dfsin 96.4, 97.2, 120.2gsm 94.4jpeg 80.1, 83.8, 84.62, 116.1, 121.27, 129.2, 132.36mips 89.3motion 125.5, 142.7sha 162.3, 168.1As an example, consider the module frequencies for the adpcm bench-77mark: 87.3 and 355.6 MHz. The maximum clock frequency produced by ourclock generator is 250 MHz, so it is guaranteed that the 355.6 MHz modulewould have reduced performance using Syncopation. By separating thesemodules into multiple clock domains, the 87.3 MHz module could remain ina Syncopation domain while the 355.6 MHz module could be in a statically-clocked domain. Clock domain partitions could be specified manually bythe designer using pragma statements or automatically determined duringHLS using timing techniques similar to those already performed during in-struction scheduling.5.4.3 Security ApplicationsA final extended application of Syncopation which we will discuss is thesecurity of multi-tenancy FPGAs in modern cloud computing. The multi-tenancy paradigm has recently become more attractive as it enables fine-grained allocation of FPGA resources among end users. However, multi-tenancy opens user applications, which are performing operations on po-tentially sensitive or secret data, to side-channel power analysis attacks.Previous work [28, 50] has shown that secret data in victim circuits can beobfuscated from these attack methods by using randomized or time-fracturedclock frequencies.Extending Syncopation to perform data-protecting clock randomizationwould require minimal additional processing and instrumentation. For ex-ample, a simple clock randomization scheme could be implemented usinga random number generator connected to the integer divisor clock genera-tor. More sophisticated methods of data obfuscation could time-multiplex or78partition the user circuit into Syncopation (performance-oriented) or Obfus-cation (data security-oriented) clock schemes. This technique could balancethe performance of a data processing design as a whole with the need toprotect sensitive, unencrypted data processing.5.4.4 Syncopation for Arbitrary CircuitsIn this thesis, we investigated applying adaptive clock management tech-niques to HLS-generated circuits. Although HLS circuits are particularlywell-suited for this type of optimization due to the instruction schedulesavailable from the HLS tools, Syncopation could hypothetically be appliedto any arbitrary state machine circuit. In particular, Automated Test andFormal Analysis tools exist which can determine which regions of a circuitare exercised during specific execution states. By incorporating an Auto-mated Test Tool into the Syncopation flow, the paths activated by each statein an arbitrary FPGA circuit could be identified and measured to determinethe minimum clock period, enabling Syncopation.79Chapter 6ConclusionsThis thesis presents Syncopation, a novel adaptive clock management strat-egy to improve the performance of High-Level Synthesis-generated user cir-cuits by dynamically adjusting the clock period. Syncopation is made pos-sible by leveraging three key observations. First, computational path delayvariation exists in HLS circuits due to complexities in instruction schedulingalgorithms. Second, not every path in a design is active in every clock cycle,and therefore the clock period only needs to be long enough to accommo-date the active paths. Finally, Syncopation leverages detailed instructionschedules already automatically generated by HLS tools to determine ex-actly which paths are active in each state of the circuit’s operation.By combining these observations with a customized post-place-and-routetiming analysis techniques, our results show that it is possible to tune theclock period on a cycle-by-cycle basis using only a small amount of additionalinstrumentation. Additionally, we show that our timing analysis techniquecan reveal detailed path timing information that can be used to target iter-ative synthesis passes to further improve performance in a method we callEnhanced Synthesis.In this thesis, we detailed the key design elements which enable the Syn-80copation adaptive clock management technique. In particular, we presentedthe instrumentation, mechanism and performance of both Current Stateand Next State Syncopation, which monitor the user State Machines usingdifferent metrics to determine the per-cycle clock tuning. Additionally, wedescribed the Divisor Memory, Divisor Selection Logic, and Clock GeneratorInstrumentation in detail.Overall, we show that Syncopation can be used to improve performanceof our benchmark circuits between -18 to +47% (by 3.2% on average). Fur-thermore, the Enhanced Synthesis technique improves design performancebetween -10 to +50% (10.3% on average) over Syncopation alone.Over time, making high-performance circuit development easier and moreaccessible to designers without low-level hardware experience will requirebetter tools to exploit limitations of the existing heuristic algorithms. Syn-copation is an automated tool which enables additional performance with nochanges to standard HLS and place-and-route tool algorithms, and does notrequire additional synthesis time unless Enhanced Synthesis is performed.Even higher performance is possible by developing custom toolflows to au-tomatically employ Syncopation without the need for synthesis directives.Additionally, the cycle-by-cycle adjustment of the design clock period couldeasily be extended to other applications, such as security. Detailed experi-ments implementing these tools were out of scope for this thesis, but will beinteresting areas for future work.81Bibliography[1] Achronix Semiconductor Corporation. https://www.achronix.com.Accessed: 2020-05-01.[2] Intel. https://www.intel.ca/. Accessed: 2020-05-01.[3] Lattice Semiconductor. https://www.latticesemi.com. Accessed:2020-05-01.[4] Microsemi. https://www.microsemi.com. Accessed: 2020-05-01.[5] Xilinx. https://www.xilinx.com/. Accessed: 2020-05-01.[6] Amazon. Amazon EC2 F1 Instances. https://aws.amazon.com/ec2/instance-types/f1/, 2020.[7] V. Betz and J. Rose. Cluster-based logic blocks for FPGAs: Area-fficiency vs. Input sharing and size. In Proceedings of CICC 97 - CustomIntegrated Circuits Conference, pages 551–554, 1997.[8] Jacob A. Bower, Wayne Luk, Oskar Mencer, Michael J. Flynn, andMartin Morf. Dynamic clock-frequencies for FPGAs. Microprocessorsand Microsystems, 30:388–397, 2006.82[9] A. Brant, A. Abdelhadi, D. H. H. Sim, S. L. Tang, M. X. Yue, andG. G. F. Lemieux. Safe Overclocking of Tightly Coupled CGRAsand Processor Arrays using Razor. In 2013 IEEE 21st Annual Inter-national Symposium on Field-Programmable Custom Computing Ma-chines, pages 37–44, 2013.[10] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, AhmedKammoona, Tomasz Czajkowski, Stephen Brown, and Jason Anderson.LegUp: An Open-Source High-Level Synthesis Tool for FPGA-BasedProcessor/Accelerator Systems. ACM Transactions on Embedded Com-puting Systems (TECS), 13, 09 2013.[11] K. Chae, S. Mukhopadhyay, Chang-Ho Lee, and J. Laskar. A dynamictiming control technique utilizing time borrowing and clock stretching.In IEEE Custom Integrated Circuits Conference 2010, pages 1–4, 2010.[12] Y. T. Chen, J. H. Kim, K. Li, G. Hoyes, and J. H. Anderson. high-level synthesis techniques to generate deeply pipelined circuits for fpgaswith registered routing. In 2019 International Conference on Field-Programmable Technology (ICFPT).[13] Jianyi Cheng, Lana Josipovic, George A. Constantinides, Paolo Ienne,and John Wickerson. Combining Dynamic & Static Scheduling in High-Level Synthesis. In The 2020 ACM/SIGDA International Symposiumon Field-Programmable Gate Arrays, FPGA ’20, page 288–298, NewYork, NY, USA, 2020. Association for Computing Machinery.[14] J. Constantin, A. Bonetti, A. Teman, C. Mu¨ller, L. Schmid, and83A. Burg. DynOR: A 32-bit microprocessor in 28 nm FD-SOI with cycle-by-cycle dynamic clock adjustment. In ESSCIRC Conference 2016:42nd European Solid-State Circuits Conference, pages 261–264, Sep.2016.[15] L. de Souza Rosa, V. Bonato, and C. Bouganis. Scaling Up LoopPipelining for High-Level Synthesis: A Non-iterative Approach. In 2018International Conference on Field-Programmable Technology (FPT),pages 62–69, Dec 2018.[16] X. Dong and G. G. F. Lemieux. PGR: Period and glitch reduction viaclock skew scheduling, delay padding and GlitchLess. In 2009 Inter-national Conference on Field-Programmable Technology, pages 88–95,2009.[17] S. Dutt and O. Shi. A Fast and Effective Lookahead and FractionalSearch Based Scheduling Algorithm for High-level Synthesis. In 2018Design, Automation Test in Europe Conference Exhibition (DATE),pages 31–36, 2018.[18] D. Ernst, Nam Sung Kim, S. Das, S. Pant, R. Rao, Toan Pham,C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge. Ra-zor: a low-power pipeline based on circuit-level timing speculation. InProceedings. 36th Annual IEEE/ACM International Symposium on Mi-croarchitecture, 2003. MICRO-36., pages 7–18, 2003.[19] J. P. Fishburn. Clock Skew Optimization. IEEE Transactions on Com-puters, 39(7):945–951, 1990.84[20] A. Gheolba˘noiu, L. Petrica˘, and S. Cot¸ofana˘. Hybrid adaptive clockmanagement for FPGA processor acceleration. In 2015 Design, Au-tomation Test in Europe Conference Exhibition (DATE), pages 1359–1364, March 2015.[21] K. Gibson, E. Roorda, D. H. Noronha, and S. Wilton. Syncopation:Adaptive Clock Management for HLS-Generated Circuits on FPGAs.In 2020 30th International Conference on Field Programmable Logicand Applications (FPL), 2020.[22] Z. Gu, W. Wan, and C. Wu. Latency Minimal Scheduling with Max-imum Instruction Parallelism. In 2019 IEEE 13th International Con-ference on ASIC (ASICON), pages 1–4, 2019.[23] S. Hadjis, A. Canis, R. Sobue, Y. Hara-Azumi, H. Tomiyama, andJ. Anderson. Profiling-driven multi-cycling in FPGA high-level synthe-sis. In 2015 Design, Automation Test in Europe Conference Exhibition(DATE), pages 31–36, 2015.[24] Mike Hutton. Understanding How the New Intel HyperFlex FPGA Ar-chitecture Enables Next- Generation High-Performance Systems. 2017.[25] Intel. Verilog HDL Synthesis Attributes and Directives.https://www.intel.com/content/www/us/en/programmable/quartushelp/17.0/hdl/vlog/vlog_file_dir.htm, 2017.[26] Intel. Intel Ships First 10nm Agilex FPGAs. https://newsroom.intel.com/news/intel-ships-first-10nm-agilex-fpgas/, Aug.2019.85[27] Intel. Intel R© Quartus R© Prime Pro Edition UserGuide: Design Optimization. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug-qpp-design-optimization.pdf, 2020.[28] D. Jayasinghe, A. Ignjatovic, and S. Parameswaran. RFTC: RuntimeFrequency Tuning Countermeasure Using FPGA Dynamic Reconfigu-ration to Mitigate Power Analysis Attacks. In 2019 56th ACM/IEEEDesign Automation Conference (DAC), pages 1–6, June 2019.[29] T. Jia, R. Joseph, and J. Gu. 19.4 An Adaptive Clock Manage-ment Scheme Exploiting Instruction-Based Dynamic Timing Slack for aGeneral-Purpose Graphics Processor Unit with Deep Pipeline and Out-of-Order Execution. In 2019 IEEE International Solid- State CircuitsConference - (ISSCC), pages 318–320, Feb 2019.[30] T. Jia, R. Joseph, and J. Gu. An Instruction-Driven Adaptive ClockManagement Through Dynamic Phase Scaling and Compiler Assistancefor a Low Power Microprocessor. IEEE Journal of Solid-State Circuits,54(8):2327–2338, Aug 2019.[31] L. Josipovic, P. Brisk, and P. Ienne. From C to elastic circuits. In 201751st Asilomar Conference on Signals, Systems, and Computers, pages121–125, 2017.[32] Lana Josipovic´, Radhika Ghosal, and Paolo Ienne. Dynamically Sched-uled High-Level Synthesis. In Proceedings of the 2018 ACM/SIGDAInternational Symposium on Field-Programmable Gate Arrays, FPGA86’18, page 127–136, New York, NY, USA, 2018. Association for Com-puting Machinery.[33] V. Lapotre, P. Coussy, C. Chavet, H. Wouafo, and R. Danilo. Dynamicbranch prediction for high-level synthesis. In 2013 23rd InternationalConference on Field programmable Logic and Applications, pages 1–6,Sep. 2013.[34] Legup. LegUp 4.0 Programmer’s Manual. http://legup.eecg.utoronto.ca/docs/4.0/programmermanual.html.[35] G. Lhairech-Lebreton, P. Coussy, and E. Martin. Hierarchical andMultiple-Clock Domain High-Level Synthesis for Low-Power Design onFPGA. In 2010 International Conference on Field Programmable Logicand Applications, pages 464–468, 2010.[36] T. Liang, J. Zhao, L. Feng, S. Sinha, and W. Zhang. Hi-ClockFlow:Multi-Clock Dataflow Automation and Throughput Optimization inHigh-Level Synthesis. In 2019 IEEE/ACM International Conferenceon Computer-Aided Design (ICCAD), pages 1–6, 2019.[37] Alexander (Sandy) Marquardt, Vaughn Betz, and Jonathan Rose. Us-ing Cluster-Based Logic Blocks and Timing-Driven Packing to ImproveFPGA Speed and Density. In Proceedings of the 1999 ACM/SIGDASeventh International Symposium on Field Programmable Gate Arrays,FPGA ’99, page 37–46, New York, NY, USA, 1999. Association forComputing Machinery.87[38] T. Marty, T. Yuki, and S. Derrien. Enabling Overclocking ThroughAlgorithm-Level Error Detection. In 2018 International Conference onField-Programmable Technology (FPT), pages 174–181, 2018.[39] J. A. Nestor and G. Krishnamoorthy. SALSA: A New Approach toScheduling with Timing Constraints. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 12(8):1107–1122,1993.[40] B. Pontikakis, H. T. Bui, F. Boyer, and Y. Savaria. A Low-ComplexityHigh-Speed Clock Generator for Dynamic Frequency Scaling of FPGAand Standard-Cell Based Designs. In 2007 IEEE International Sympo-sium on Circuits and Systems, pages 633–636, May 2007.[41] Andrew Putnam, Adrian Caulfield, Eric Chung, Derek Chiou, KyprosConstantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, JanGray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati,Joo-Young Kim, Sitaram Lanka, Eric Peterson, Aaron Smith, JasonThong, Phillip Yi Xiao, Doug Burger, Jim Larus, Gopi PrashanthGopal, and Simon Pope. A Reconfigurable Fabric for AcceleratingLarge-Scale Datacenter Services. In Proceeding of the 41st Annual Inter-national Symposium on Computer Architecuture (ISCA), pages 13–24.IEEE Press, June 2014. Selected as an IEEE Micro TopPick.[42] R. Ragavan, C. Killian, and O. Sentieys. Adaptive Overclocking andError Correction Based on Dynamic Speculation Window. In 201688IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages325–330, 2016.[43] O. Ragheb and J. H. Anderson. High-Level Synthesis of FPGA Cir-cuits with Multiple Clock Domains. In 2018 IEEE 26th Annual Inter-national Symposium on Field-Programmable Custom Computing Ma-chines (FCCM), pages 109–116, April 2018.[44] D. S. H. Ram, M. C. Bhuvaneswari, and S. M. Logesh. A Novel Evo-lutionary Technique for Multi-objective Power, Area and Delay Opti-mization in High Level Synthesis of Datapaths. In 2011 IEEE ComputerSociety Annual Symposium on VLSI, pages 290–295, 2011.[45] K. Shi, D. Boland, and G. A. Constantinides. Accuracy-PerformanceTradeoffs on an FPGA through Overclocking. In 2013 IEEE 21st An-nual International Symposium on Field-Programmable Custom Com-puting Machines, pages 29–36, 2013.[46] K. Shi, D. Boland, E. Stott, S. Bayliss, and G. A. Constantinides.Datapath Synthesis for Overclocking: Online Arithmetic for Latency-Accuracy Trade-offs. In 2014 51st ACM/EDAC/IEEE Design Automa-tion Conference (DAC), pages 1–6, 2014.[47] Deshanand P. Singh and Stephen D. Brown. Constrained Clock Shift-ing for Field Programmable Gate Arrays. New York, NY, USA, 2002.Association for Computing Machinery.[48] V. Subramanian, M. Bezdek, N. D. Avirneni, and A. Somani. Super-scalar Processor Performance Enhancement through Reliable Dynamic89Clock Frequency Tuning. In 37th Annual IEEE/IFIP InternationalConference on Dependable Systems and Networks (DSN’07), pages 196–205, June 2007.[49] D. Thomas and M. Shanmugasundaram. A Survey on Different Over-clocking Methods. In 2018 Second International Conference on Elec-tronics, Communication and Aerospace Technology (ICECA), pages1588–1592, 2018.[50] Stephen M. Williams and Mingjie Line. Reactive Signal Obfusca-tion with Time-Fracturing to Counter Information Leakage in FP-GAs. In The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’20, page 322, New York, NY, USA,2020. Association for Computing Machinery.[51] Yuko Hara, Hiroyuki Tomiyama, Shinya Honda, Hiroaki Takada, andKatsuya Ishii. CHStone: A benchmark program suite for practical C-based high-level synthesis. In 2008 IEEE International Symposium onCircuits and Systems, pages 1192–1195, May 2008.[52] H. Zheng, S. T. Gurumani, L. Yang, D. Chen, and K. Rupnow. High-level synthesis with behavioral level multi-cycle path analysis. In 201323rd International Conference on Field programmable Logic and Appli-cations, pages 1–8, 2013.90

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0394843/manifest

Comment

Related Items