A Mix-Grained Architecture for ImprovingHLS-Generated Controllers on FPGAsbyShadi AssadikhomamiB.Sc., Sharif University of Technology, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University Of British Columbia(Vancouver)June 2017c© Shadi Assadikhomami, 2017AbstractWith the recent slowdowns in traditional technology scaling, hardware accelera-tors, such as Field Programmable Gate Arrays (FPGAs), offer the potential for im-proved performance and energy efficiency compared to general purpose processingsystems. While FPGAs were traditionally used for applications such as signal pro-cessing, they have recently gained popularity in new, larger scale domains, such ascloud computing. However, despite their performance and power efficiency, pro-gramming FPGAs remains a hard task due to the difficulties involved with the low-level design flow for FPGAs. High-Level Synthesis (HLS) tools aim to assist withthis time-consuming task by supporting higher level programming models whichsignificantly increases design productivity. This also makes the use of FPGAs forlarge scale design development for evolving applications more feasible.In this thesis we explore the potential of modifying the current FPGA architec-ture to better support the designs generated by HLS tools. We propose a specializedmix-grained architecture for Finite State Machine (FSM) implementation that canbe integrated into existing FPGA architectures. The proposed mix-grained archi-tecture exploits the characteristics of the controller units generated by HLS tools toreduce the control-path area of the design. We show that our proposed architecturereduces the area of the next state calculation in FSMs by more than 3X withoutimpacting the performance and often reducing the critical path delay of the nextstate calculation in FSMs.iiLay SummaryProgramming low-level, dedicated hardware computing systems, such as Field-Programmable Gate Arrays (FPGAs), is more challenging and time consumingcompared to programming higher-level software for general-purpose processors.Despite the difficulties associated with programming hardware, FPGAs still remainan appealing solution over general-purpose processors for many applications dueto their higher efficiency. High-Level Synthesis (HLS) aims to ease the hardwareprogramming by enabling the use of higher-level software languages to programFPGAs. However, there is generally a trade-off between programmability and ef-ficiency when using HLS tools, which can often result in a less efficient hardwaredesign than programming FPGAs using low-level programming languages. In thisdissertation, we aim to narrow the gap between programmability and efficiencywhen programming FPGAs using HLS tools. We propose a novel modification tocurrent FPGA architectures that exploits common properties of HLS-generated de-signs to improve the FPGAs efficiency by reducing the total area of the hardwaredesign.iiiPrefaceThis dissertation is based on a research project conducted by myself under the su-pervision and guidance of Professor Tor M. Aamodt. I assisted with defining theproblem space and was responsible for identifying challenges within this problemspace, and designing and modelling the architecture to evaluate the proposed so-lution. I also conducted the experiments and collected all of the data representedin this dissertation, except the data shown in Section 3.1.4 which has been col-lected by Jennifer Angelica with my assistance under the supervision of ProfessorAamodt.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Finite State Machines . . . . . . . . . . . . . . . . . . . . . . . . 72.2.1 Finite State Machine Definition and Representation Models 7v2.2.2 Finite State Machine Implementations . . . . . . . . . . . 92.3 Hardware Design Flow . . . . . . . . . . . . . . . . . . . . . . . 102.4 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 113 Control-Path Optimization . . . . . . . . . . . . . . . . . . . . . . . 133.1 Finite State Machine Analysis . . . . . . . . . . . . . . . . . . . 133.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 143.1.2 FSM Characteristic . . . . . . . . . . . . . . . . . . . . . 143.1.3 HLS-Generated Finite State Machines . . . . . . . . . . . 163.1.4 Data Flow Height Experiment . . . . . . . . . . . . . . . 173.2 Specialized FSM Block . . . . . . . . . . . . . . . . . . . . . . . 193.2.1 Design Space Exploration . . . . . . . . . . . . . . . . . 213.2.2 Mix-Grained Architecture . . . . . . . . . . . . . . . . . 233.2.3 Input Sequence Encoder Unit . . . . . . . . . . . . . . . 243.2.4 Coarse-Grained Fabric . . . . . . . . . . . . . . . . . . . 243.3 Fracturable FSM Hard Blocks . . . . . . . . . . . . . . . . . . . 293.3.1 FSM Partitioning . . . . . . . . . . . . . . . . . . . . . . 293.3.2 Fracturable FSM Block Architecture . . . . . . . . . . . . 303.4 State Assignment Algorithm . . . . . . . . . . . . . . . . . . . . 313.5 Mapping to the Specialized FSM Architecture . . . . . . . . . . . 403.5.1 Applying the Size Checking Pass . . . . . . . . . . . . . 403.5.2 Fine-Grained Mapping . . . . . . . . . . . . . . . . . . . 413.5.3 Coarse-Grained Mapping . . . . . . . . . . . . . . . . . . 413.6 Putting it All Together . . . . . . . . . . . . . . . . . . . . . . . 413.6.1 Generating the FSM . . . . . . . . . . . . . . . . . . . . 433.6.2 State Assignment . . . . . . . . . . . . . . . . . . . . . . 454 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 514.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 FSM Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.3 Area and Delay Model . . . . . . . . . . . . . . . . . . . . . . . 554.4 CAD Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.5 Mapping to the Next State Generation Block . . . . . . . . . . . . 57vi5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 585.1 Next State Generation Block Size . . . . . . . . . . . . . . . . . 585.2 Area Improvement . . . . . . . . . . . . . . . . . . . . . . . . . 615.3 Delay Improvement . . . . . . . . . . . . . . . . . . . . . . . . . 635.4 Resource Usage of the Mix-Grained Architecture . . . . . . . . . 655.5 FSM Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.6 Impact of HLS Directives on the Generated FSMs . . . . . . . . . 685.7 Efficiency of the Fracturable FSM Block . . . . . . . . . . . . . . 716 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80viiList of TablesTable 3.1 Input sequence encoder generation: the original transition tableis given in (a), the reduced table is given in (b). {cs, activeinput(2bits)} will be used to address the memory instead of {cs,inputs(10 bits)}. . . . . . . . . . . . . . . . . . . . . . . . . . 42Table 3.2 Memory content . . . . . . . . . . . . . . . . . . . . . . . . . 48Table 4.1 Number of lines of actual C code, excluding comments, for theevaluated benchmarks . . . . . . . . . . . . . . . . . . . . . . 52Table 4.2 Characteristics of the FSMs extracted from MachSuite . . . . . 53Table 4.3 Characteristics of the FSMs extracted from HLS datacenter bench-marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Table 5.1 The sizing configuration of the elements of the next state gen-eration block . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Table 5.2 Fraction of next state calculation logic area to total design areafor sqlite lookupName and sqlite getToken functions from theHLS datacenter benchmark set . . . . . . . . . . . . . . . . . 68Table 5.3 Block size checking for the sql ln fsm1 indicating that it doesnot fit to one FSM block, since it requires larger memory unitand more state bits than what is provided by the FSM block. . 71viiiTable 5.4 Memory unit size requirement for each partition after partition-ing the FSM. Row 4 indicates that FSM partition A requires amemory unit of size 120 and FSM partition B requires a mem-ory unit of size 88, thus they both can be mapped to a fracturableFSM block. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73ixList of FiguresFigure 2.1 Basic FPGA architecture [15] . . . . . . . . . . . . . . . . . 7Figure 2.2 Digital systems structure . . . . . . . . . . . . . . . . . . . . 8Figure 2.3 The state transition graph of a Mealy FSM. . . . . . . . . . . 9Figure 2.4 General structure of finite state machines . . . . . . . . . . . 10Figure 2.5 Memory-based implementation of FSMs . . . . . . . . . . . 11Figure 3.1 Branch-free path (shown in green). Each path starts and endswith a vertex with fan-out degree of greater than 1 (shown inred). Note that vertices that belong to a branch-free path canhave more than one fan-in edge. . . . . . . . . . . . . . . . . 15Figure 3.2 An example of the state encoding for the states that belong tobranch-free paths. . . . . . . . . . . . . . . . . . . . . . . . . 16Figure 3.3 Equivalent FSM construction . . . . . . . . . . . . . . . . . . 20Figure 3.4 Fan-out degree frequency in the equivalent FSMs of Mach-Suite benchmarks and Spec CPU2006 INT . . . . . . . . . . 21Figure 3.5 High-level view of the specialized FSM architecture . . . . . 23Figure 3.6 Number of active inputs per state calculated as average over46 FSMs extracted from 21 benchmarks generated by HLS.Details of the benchmarks are described in Chapter 4. . . . . 25Figure 3.7 Edge distribution: number of transitions per state calculatedas an average over 46 FSMs extracted from 21 benchmarksgenerated by HLS. Details of the benchmarks are described inChapter 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Figure 3.8 Next state generation block . . . . . . . . . . . . . . . . . . . 26xFigure 3.9 Memory content . . . . . . . . . . . . . . . . . . . . . . . . 28Figure 3.10 Path refinement . . . . . . . . . . . . . . . . . . . . . . . . . 35Figure 3.11 An example C code: the example shows a while loop with fourconsecutive instructions each with a data dependency on theprevious instruction. Additionally each instruction performsan operation that has a different latency in hardware such asdivision, multiply, shift, and add. . . . . . . . . . . . . . . . . 43Figure 3.12 State assignment example . . . . . . . . . . . . . . . . . . . 44Figure 3.13 Next state generation block - transition from a memory state toa memory state . . . . . . . . . . . . . . . . . . . . . . . . . 48Figure 3.14 Next state generation block - transition from a memory state toa branch-free path state . . . . . . . . . . . . . . . . . . . . . 49Figure 3.15 Next state generation block - transition from a state of a branch-free path to another state on the same path . . . . . . . . . . . 49Figure 3.16 Next state generation block - transition from the last state of abranch-free path to a memory state . . . . . . . . . . . . . . . 50Figure 5.1 Area breakdown of the coarse-grained fabric . . . . . . . . . 59Figure 5.2 FSM coverage vs. memory depth in number of entries. Ap-proximately 98% of the evaluated FSMs fit into a memory witha depth of 128 entries. . . . . . . . . . . . . . . . . . . . . . 59Figure 5.3 Area improvement of the specialized FSM architecture, whichincludes the area of the input sequence encoder and next stategeneration block relative to the baseline . . . . . . . . . . . . 63Figure 5.4 FSM size along with the breakdown of the states that are partof branch-free paths and states that reside in memory . . . . . 64Figure 5.5 Critical path delay improvement of the specialized FSM archi-tecture which includes the critical path of the input sequenceencoder and next state generation block relative to the baseline 66Figure 5.6 Area breakdown of the mix-grained FSM architecture for theFSMs extracted from the evaluated benchmarks . . . . . . . . 67xiFigure 5.7 Impact of applying HLS optimization directives on backprop,aes, and radix sort benchmarks from MachSuite. The x-axisshows the number of reachable next states per state. For ex-ample, fan-out 1 is an indicator for states that only have onenext state. This figure shows that the optimization increasesthe percentage of branch-free path. . . . . . . . . . . . . . . . 69Figure 5.8 Impact of applying HLS optimization directives on aes, back-prop, and radix sort benchmarks from MachSuite . . . . . . . 70Figure 5.9 Area overhead of using a fracturable FSM block to map a largeFSM as opposed to having one large specialized FSM block tofit the FSM. The overhead due to making the FSM block frac-turable is negligible compared to the area improvement gainedby mapping the FSM to the hard FSM block. . . . . . . . . . 73xiiList of AbbreviationsCPU Centeral Processing UnitGPU Graphic Processing UnitDSP Digital Signal ProcessorASIC Application-Specific Integrated CircuitHLS High-Level SynthesisFPGA Field Programmable Gate ArrayCGRA Coarse Grain Reconfigurable ArchitectureFSM Finite State MachineCDFG Control/Data Flow GraphDCG Directed Cyclic GraphRAM Random Access MemoryLUT Look-Up TableHDL Hardware Description LanguageRTL Register-Transfer LevelCAD Computer Aided DesignSOC System On a ChipxiiiAcknowledgmentsI would like to thank my supervisor, Professor Tor M. Aamodt, for the valuableguidance, support, and insight he provided during these years and for giving me theopportunity to do high-quality research. This work would have not been possiblewithout him. I also gratefully acknowledge the funding provided by the NaturalSciences and Engineering Research Council of Canada (NSERC) that made myresearch possible.I would like to thank everyone in the computer architecture group at UBC, in-cluding the graduate students and undergraduate interns, who I had the opportunityto work with. I would also like to thank my former lab-mates, Tim Rogers, AhmedElTantawy, and Andrew Boktor, and my friends from the SOC group, Fatemeh Es-lami and Hossein Omidian, for their kind help, advice, and sharing their knowledgeand experience with me. I would like to specially thank my best friend and a seniorPhD students in our group, Tayler Hetherington, for his help, support, feedback,and everything I have learned from him during these years.And last but not least, I would like to thank my family who taught me to nevergive up and fight for my goals, and for their invaluable support throughout my life.xivDedicationTo my parentsxvChapter 1IntroductionSince their emergence, computing systems have undergone a series of revolution-ary improvements in their performance, energy efficiency and cost-effectiveness.These improvements were achieved by architectural innovations and advancementsin the semiconductor industry. Advancements in semiconductor technologies pro-vide large improvements to computing systems by drastically increasing the amountof processing capability per unit of area and power. Historically, these advance-ments have followed Moore’s law [22], which states that the number of transis-tors on a chip will double approximately every two years and Dennard Scaling[8], which states that the power density of transistors remains constant as their sizescales down which enables smaller and faster transistors. However, in recent years,Moore’s law and Dennard scaling have slowed, resulting in diminishing returnsfrom semiconductor improvements.Additionally, to broaden the scope of applications able to benefit from suchcomputing systems, architectures were designed with generality in mind, such asthe CPU. However, due to the slowdowns in the rate of improvements for comput-ing systems, there has been a shift towards using alternative architectural designsand specialized hardware accelerators to keep up with the growing computationaldemands of today’s applications.Hardware accelerators are customized circuits that are designed for performinga particular set of tasks [28]. They have shown great potential to improve the per-formance and energy efficiency of applications by eliminating the overheads that1come with having a more general purpose architecture. Graphic Processing Units(GPUs), Application-Specific Integrated Circuits (ASICs), Digital Signal Proces-sors (DSPs), and Field Programmable Gate Arrays (FPGAs) are examples of themost common hardware accelerators [28].These accelerators range in their level of specialization and programmability.Similar to CPUs, GPUs offer a high degree of programmability, however, are de-signed to accelerate a class of applications with large amounts of data-level paral-lelism. In contrast, ASICs are designed to perform a specific set of tasks with ded-icated hardware at the cost of little to no programmability. FPGAs bridge the gapbetween programmable processors and dedicated hardware accelerators by provid-ing a reconfigurable and programmable hardware platform. FPGAs improve theflexibility over ASICs, while maintaining a portion of the improvements in perfor-mance and energy efficiency of a hardware design compared to a general purposearchitecture.More recently, FPGAs have been gaining popularity in domains they have nottypically been used for, such as cloud computing [[39], [38]]. Some of of theworld’s biggest datacenters, such as Microsoft and Baidu, are now deploying FP-GAs in their servers [[44], [23], [6]], and Amazon is now offering FPGA cloudinstances in the amazon web services[3]. Additionally, with the acquisition of Al-tera by Intel in 2015 [13], FPGAs may become more closely tied to general purposearchitectures, making them more accessible and increasing the use in new markets,such as Cloud computing.FPGAs are traditionally programmed using hardware design languages (HDLs),such as Verilog or VHDL. Hardware design is notoriously more difficult comparedto software development. This is one of the main issues with using FPGAs foraccelerating large scale applications. However, recent advances in high-level syn-thesis (HLS) significantly increase the productivity of hardware design by enablingthe designers to use higher level software programming languages, such as C/C++and OpenCL, which makes FPGAs easier to use for accelerating larger scale ap-plications. Therefore, HLS is now becoming a part of the main hardware designflow [[7], [42]]. This raises the question - can we modify the FPGA architectureand CAD flow such that they can be more efficiently used by HLS tools? In thisdissertation we aim to answer this question by exploring the potential of improving2the architecture of FPGAs to better tune them for HLS design flow by analyzingthe characteristics and requirements of designs generated by HLS tools.1.1 MotivationFPGA architecture consists of an array of generic programmable logic blocks andprogrammable routing switches that enables them to implement any logic func-tion. This flexibility comes with the cost of area, performance, and power over-head that causes the FPGAs implementation of a given design to be at least anorder of magnitude larger than the ASIC implementation, with a critical path de-lay ratio of about 3 to 4 [14]. To bridge this gap, FPGA designers have intro-duced hard blocks such as mutiplier/accumulator, block memories, and floatingpoint units to modern FPGA architecture to mimic efficiency of ASICs for a com-mon set of operations[17]. Hard blocks are ASIC-like hardware units that are lessprogrammable, but more efficient than programmable logic blocks. Despite theirefficiency improvements, the area of underutilized hard blocks is wasted, therefore,the hard block architecture must consist of function units and logic operations thatare commonly used among a representative set of important FPGA applications.The existing hard blocks on FPGA architectures have been designed to accel-erate the operations that are common among the original application domains thatwere using FPGAs. However, the recent shift to use FPGAs in new domains withvarying processing requirements raises the question - are there other common op-erations among these new application domains that can benefit from being mappedto hard blocks? The same question can be asked regarding the recent increasedpopularity in the use of HLS tools - Due to their automated nature to generatehardware designs, as opposed to a human hardware designer approach, is it pos-sible that they generate any special structure in hardware that can be exploited bynew hard blocks?In this work, we aim to answer this question by studying the controller unithardware generated by HLS tools. HLS tools often generate large explicit con-troller units that are modelled by finite state machines. These control units canhave a big influence on the total area of the design in cases where the realiza-tion of the data path requires a large number of states and control signals [20].3In this dissertation we analyze the characteristics of the finite state machines thatare generated by HLS tools and argue that these state machines all share commonbehaviours that can be exploited to design an alternative hardware implementationfor such FSMs. We evaluate our proposed architecture by detecting and extractingthe FSM as a standalone circuit from the application and compare it against thebaseline FSM implemented purely in FPGA soft logic. We show that our proposedarchitecture has a great potential to reduce the area implementation of the next stategeneration logic in FSMs as well as reducing its critical path delay.1.2 ContributionsThis thesis makes the following contributions:• Identifying common characteristics among state machines generated by HLS-tools.• Proposing a novel architecture to improve area efficiency of next state calcu-lation logic in FSM implementation without affecting performance.• Proposing a novel state encoding technique which exploits certain propertiesof HLS-generated FSM described in Section 3.1.2.• Evaluating the the area and delay improvement of the proposed architecturecompare to a baseline FPGA architecture which shows an average area re-duction of 70% as well as critical path delay reduction of 45%.1.3 OrganizationThe rest of this dissertation is organized as follows:• Chapter 2 first details the background FPGA architecture used in this dis-sertation. It then provides the necessary background on finite state machinesas an important part of digital systems.4• Chapter 3 performs analysis on FSM characteristics and presents a novelconfigurable architecture to improve the area efficiency of FSM implemen-tation on FPGAs.• Chapter 4 presents the methodology used to implement and evaluate thedesigns introduced in Chapter 3.• Chapter 5 evaluates the area/delay improvements of the proposed special-ized FSM architecture.• Chapter 6 discusses related work.• Chapter 7 discusses directions for potential future work.• Chapter 8 concludes the dissertation.Next section describes the necessary background for this work.5Chapter 2BackgroundThis chapter presents the necessary background information to understand the con-tributions of this Thesis. First, this chapter describes the architecture of contem-porary FPGAs. It then discusses the required background to understand the finitestate machines and their traditional implementation methods on FPGAs. Finally,this chapter provides a brief summary of the standard hardware design flow and theaddition of High-Level Synthesis (HLS) tools to this flow.2.1 FPGA ArchitectureA traditional FPGA architecture consists of an array of generic logic blocks thatare connected via configurable routing channels. The main components of theselogic blocks are n-input (normally 6-input) Look-Up Tables (LUTs), small one-bithard adders, and optional flip-flops that enable registering the output of the block.A n-input LUT can be configured to implement any logic function that maps then-bit input to a 1-bit output. Therefore, using LUTs in logic blocks turns them intogeneric flexible blocks that are capable of implementing any logic function [15].As discussed in the Chapter 1, in modern FPGA architectures some of thesegeneric blocks are replaced by hard blocks such as multiply-add, floating pointoperations, and memory blocks to improve the efficiency of these specific set ofoperations [14]. This is shown in Figure 2.1.6Figure 2.1: Basic FPGA architecture [15]2.2 Finite State MachinesIn this section we present background information for Finite State Machines (FSMs)as an important part of digital circuit design and review the alternative methods ofimplementing FSMs on FPGAs.2.2.1 Finite State Machine Definition and Representation ModelsLogic circuits consist of two main parts: data-path and control-path. The control-path is also known as the control unit.The general block diagram of digital circuitsis shown in Figure 2.2. The data-path can be described as functional units thatperform the computational tasks (data operations) in an application [21]. The con-trol unit, on the other hand, generates the control signals required to direct theoperation of data-path according to the timing constraints, data, and control de-pendencies in an application. Finite state machines are a common way to describethe control path in logic circuits. As the name suggests, an FSM is composed of alimited set of states and the corresponding transitions between these states. Each7Figure 2.2: Digital systems structurestate corresponds to a specific state in the real design. The transition between thesestates happens based on the current state of the system and the set of inputs to theFSM. Each states has a set of associated control signals that are dependant on thecurrent state of the system and, potentially, the input signals. In a Moore FSM theoutput signals are defined only based on the current state of the system, where asin a Mealy model both inputs and current state are used to determine the value ofoutput signals.State Transition TableA state transition table is one of the common ways of representing an FSM. Thestate transition table is a truth table, where the inputs and current state form theinput column of the table, while the output column contains the next state valueand outputs of the FSM. It is a simple method to define the state transitions and thevalues of output signals based on the current state and inputs.State Transition DiagramThe state transition diagram is the equivalent graph-based representation of thestate transition table [21]. A state transition diagram is a directed cyclic graph(DCG) G = (V,E) where each vertex vi ∈V represent a unique state and each edgeei j ∈ E shows a transition from the corresponding state vi to the v j. The edge labelsindicates the input sequence that causes the corresponding transition. Dependingon the FSM model, Mealy or Moore, the output of each states will be either part of8Figure 2.3: The state transition graph of a Mealy FSM.the edge or vertex label respectively. This is shown in Figure 2.32.2.2 Finite State Machine ImplementationsThere are two main approaches to implement FSMs on FPGAs, which are dis-cussed below.LUT-Based ImplementationA LUT-based implementation is the common conventional way to implement FSMson FPGAs. Figure 2.4 shows the block diagram of an FSM. It consists of state reg-isters to hold the current state value, and combinational logic to calculate the nextstate value and output signals. The combinational logic is implemented using FP-GAs’ LUT-based logic blocks. However, the flexibility of LUTs to implement anylogic function comes at cost of increased area, power, and performance. Logic min-imization algorithms and state assignment techniques are used to find the optimalcombination circuits, which realize the state transfer function and output function[21].RAM-Based ImplementationAfter embedded block RAMs were introduced to FPGA architectures, a large bodyof research investigated the benefits of using block RAMs as an efficient method9Figure 2.4: General structure of finite state machinesfor implementing FSMs [[26], [34], [30], [31]]. RAM-based FSM implementa-tions have a great potential to reduce the area usage by utilizing less of the FPGA’srouting and logic resources, which consequently improves the area and power con-sumption of the design. Figure 2.5 shows an example of a RAM-Based FSMimplementation. In this example, the FSM has q inputs, r outputs, p states, whichrequires a n-bit encoding. The state value will be stored in a n-bit register and to-gether with the input, form the address to the memory unit to look up the value ofthe next state and output signals. Such a memory unit will have 2(n+q) entries ofsize (n+ r) to accommodate the next sate and output values for all the combina-tions of current state and input values. However, one potential problem with suchimplementation is the exponential growth in memory size with an increase in num-ber of states and inputs. For the scenario where there are several inactive inputs ateach states that do not contribute to the next state calculation, a potential solutionhas been proposed that utilizes a selecting mechanism to choose the active inputsat each state to address the memory locations in order to avoid the unnecessaryincrease in the memory size [10].2.3 Hardware Design FlowProgramming hardware tend to be more difficult compared to software develop-ment. The traditional hardware design flow requires designers to use low-levelhardware description languages such as Verilog and VHDL, to directly describe agiven high-level algorithm. This description is typically at register transfer level(RTL) where a circuit is described by its logic operation, registers, and their cor-responding data flow. The RTL design will then be mapped to an FPGA using theElectronic Design Automation (EDA) tools , which after synthesizing the RTL de-10Figure 2.5: Memory-based implementation of FSMssign into a gate-level netlist and applying the logic optimization techniques, try tomap the design onto an FPGA architecture in an iterative manner.The very low-level nature of RTL design, various design constraints and re-quirements, and long EDA process makes the hardware design a very challengingand time-consuming task compared to the typical sequential programming in soft-ware. On the other hand, the large scale and evolving nature of applications in newdomains, such as cloud computing, makes the hardware design for applications insuch domains even more challenging. Therefore, to make FPGAs a more feasiblesolution, there needs to be a mechanism to ease the hardware programming, suchas high-level synthesis (HLS), which is described below.2.4 High-Level SynthesisHigh-level synthesis (HLS) tools try to assist with this issue by raising the levelof abstraction and letting the designer use a high-level language such as C/C++ orOpenCL for describing the desired algorithm to generate an RTL design.HLS tools are becoming increasingly popular due to the recent improvementsin their underlying algorithms, which has enabled them to generate RTL designsthat have comparable quality with a hand-coded RTL design by an expert hardwaredesigner [7]. High-level synthesis tools use the control/data flow graph (CDFG) ofa given program as the main starting point to generate the corresponding RTL de-11sign. Similar to the logic circuit describe in Section 2.2.1, the generated hardwareis composed of two main parts: (1) datapath and (2) control path.The datapath corresponds to the operations and data flow in the the given high-level program while also taking the resource constraints of the target FPGA archi-tecture, such as number of available specific hardware units, into account [7]. Thecontrol path is described using an FSM which is constructed after performing twomain tasks: (1) scheduling and (2) binding. Scheduling is the process of identify-ing the cycle in which each operation can be performed given the timing/resourceconstraints of the target architecture and control and data dependencies in the in-put application [7]. Binding is the process of mapping the given operations andvariables to hardware units that are capable of implementing them while also tak-ing the resource constraints into account [7]. For example, an addition operationis mapped to an adder on the FPGA. If the schedule allows for 20 additions tobe performed on the same cycle given the data dependencies, but there are only10 hardware addition units, the binding task will modify the schedule to performthese operations over two cycles. In a scenario where same hardware unit is sharedbetween multiple operations, the input sources to each operation and the outputconnections will be defined based on the output of the FSM.12Chapter 3Control-Path OptimizationThis chapter studies the potentials of using specialized hardware blocks for im-plementing Finite State Machines (FSMs) to improve the area efficiency and per-formance of the control unit portion of the RTL designs generated by high-levelsynthesis tools. We propose a novel configurable mixed-grained architecture, thatmakes use of unique characteristics of the FSMs generated by HLS tools to reducethe silicon area that is required for FSM implementation. This is achieved with-out affecting the control unit performance and, in most of the cases, improves thecritical path delay as well.The rest of the chapter is organized as follows: First we perform analysis onselected characteristics of finite state machines. We show that these characteris-tics can be used to design an architecture that more efficiently uses the silicon areacompared to conventional LUT-based implementation of state machines. We thendescribe our proposed architecture in detail and a proposed state encoding tech-nique that has been developed in order to better exploit these specialized FSMblocks. Finally we describe the technology mapping algorithm that we have devel-oped to map a given finite state machine to specialized FSM blocks.3.1 Finite State Machine AnalysisIn this section we define and analyze specific characteristics of Finite State Ma-chines that can be exploited to design a custom FSM block. We then present the13required state encoding technique to be able to efficiently utilize the custom FSMblocks. Finally, we show that Finite State Machines generated using High-LevelSynthesis tools always demonstrate such characteristics, hence they are great can-didates to benefit from our proposed customized blocks.3.1.1 PreliminariesThis section presents the preliminaries for our Finite State Machine analysis.Definition 1. State Transition Diagram: Finite state machines can be representedby their state transition diagram. State transition diagram is a directed cyclic graph(DCG) G = (V,E) where each vertex vi ∈V represent a unique state and each edgeei ∈ E shows a transition between two corresponding states (vertecis). In the restof this chapter, we refer to vertices and states interchangeably.Definition 2. Directed Path: Directed path is a finite sequence of edges followingthe same direction which connect a sequence of vertices.Definition 3. Vertex degree: The degree of a vertex of a graph is defined as thenumber of edge incidents to the vertex. In DCGs, the vertex degree can be groupedinto fan-in degree and fan-out degree which represent the number of incomingedges and outgoing edges of a vertex respectively.Definition 4. Branch-Free Path: Given a DCG, we define a branch-free pathto be a directed path where each vertex has at most one fan-out edge but can havemore than one fan-in edge. An example of a graph with branch-free paths is shownin Figure 3.1.3.1.2 FSM CharacteristicUsing the definitions in the previous section, we now can describe two specificproperties of FSMs that can be exploited to reduce the area usage and improvecritical path delay of FSM implementations.14Figure 3.1: Branch-free path (shown in green). Each path starts and endswith a vertex with fan-out degree of greater than 1 (shown in red). Notethat vertices that belong to a branch-free path can have more than onefan-in edge.Abundance of Branch-Free PathsIf the state transition graph of a finite-state machine has long branch-free paths,then consecutive states in each path can be assigned consecutive state values (stateencoding) such that next state value can be calculated with a simple incrementoperation. This leads to a new state encoding where branch-free paths have simpleincreasing state encoding. This is shown with an example in Figure 3.2. Thegraph represents part of the state transition diagram of an FSM which contains twobranch-free paths labelled with the proposed encoding. Note that the blank statesare not part of any branch-free path since they have fan-out degree of 2. Considerthe top path with the length equal to n, if the first state in this path is assigned thestate encoding X , then following states in the path will be assigned X + 1, X + 2,... , X + n− 2, and X + n− 1 until a non branch-free state is reached. The samerule applies to the second path with the length equal to m where the first state ofthe path is assigned the state encoding Y and the following states in the path willbe assigned Y +1, Y +2, ... , Y +n−2, and Y +n−1. Hardware implementationfor such state machine has an opportunity to reduce the silicon area, since the nextstate calculation logic for states that belong to branch-free paths can be realizedwith a simple adder along with small control logic in hardware. Section 3.2.4provides more detail on how this adder unit is utilized in our proposed architecturefor implementing FSMs.15 XYX+1 X+2 X+n-2 X+n-1 Y+1 Y+2 Y+m-2 Y+m-1Figure 3.2: An example of the state encoding for the states that belong tobranch-free paths.Low Fan-Out DegreeFor a given FSM, the maximum number of possible next states for any given statecan be calculated using the following equation:min(2q, p)Where q is equal to the total number of inputs to the state machine and p repre-sent the total number of states. However, not all of the input signals are active indifferent states, therefore the number of reachable states from a given state can be,and often is, far less than the maximum. For each given state, the fan-out degreerepresent the number of reachable states from that given state.For the state machines with abundance of branch-free paths, the remainingstates which are not part of any branch-free path form a smaller subset of the statemachine. If the states that belong this subset have low fan-out degree, there is po-tential for a hybrid memory-based FSM implementation that is independent of theinput size. Section 3.2.4 provides more detail on how a small memory unit is in-corporated to our proposed architecture for implementing the next state calculationof the states the do not belong to a branch free path.3.1.3 HLS-Generated Finite State MachinesThe results of the analysis on the finite state machines extracted from two sets ofHLS benchmarks used in this thesis are presented in Figure 3.6 and Figure 3.7.The details of these benchmark sets are described in Chapter 4. The RTL code forthese benchmarks is generated using Vivado HLS, an HLS tool by Xilinx [42].For the MachSuite benchmarks, we used the default set of HLS optimizationdirectives that were shipped with the benchmarks such as loop unrolling, loop16pipelining, and memory partitioning. In Section 5.6, we specifically analyze theimpact of applying HLS optimization directives on generated FSMs by lookingat three benchmarks from MachSuite. The HLS directives are obtained using themethodology and results described in [18] which aim to minimize the area-delayproduct of the generated RTL design. As the results suggest, the size of FSMsand fraction of branch-free paths are not negatively impacted (i.e. the branch-freepaths still exist and are a large fraction of total states). In fact, for these threebenchmarks, the fraction of branch-free paths actually increases.For the datacenter benchmarks, BZIP, Lucy, and SQLite (getToken function),HLS optimization directives were applied while generating the RTL design. Forthe the remaining benchmarks/functions in this benchmark set, no optimizationwere applied. However, based on our analysis in Section 5.6, we expect a similarbehaviour to the result shown for optimizing Machsuite. Applying and evaluatingfull optimizations on all benchmarks is left to future work.Figure 3.6 shows that more than 80% of the states in each FSM do not re-quire any input and only have one possible next state, which means they belongto a branch-free path. Figure 3.7, which shows the fan-out-degree (transitionsper state) statistics also indicates that there is at most 4 reachable next states forany given state. Therefore, finite state machines coming from HLS-generated RTLcodes have a great potential to benefit from our proposed architecture.3.1.4 Data Flow Height ExperimentThe FSM analysis on the HLS benchmark sets exposed two common patterns: lowfan-out degree and long branch-free paths. This raises the question: what is thecause of the low fan-out degree and long branch-free paths in FSMs among allof the HLS generated RTL codes? Our hypothesis is that these common patternsare caused by data dependent instructions and the latency of instructions within abasic block. To evaluate our hypothesis of data-dependence leading to branch-freepaths, we look at the mechanisms used by HLS tools to generate the RTL code fora given application. As discussed in the background section, HLS tools rely on thecontrol/data flow graph (CDFG) of an application and consider the resource and17timing constraints of the target hardware to perform scheduling. The outcome ofscheduling is used to generate the control unit and, consequently, the FSM that willdirect the operations of the data path.To understand the impact of data-dependence on scheduling, we mimic thebehaviour of an HLS scheduler by constructing a simplified equivalent FSM of agiven program from the control flow and data flow graph. Our simplified equivalentFSM assumes that there are infinite resources on the FPGA, the latency of anyinstruction is one cycle, and that data-dependent instructions cannot take place onthe same cycle. These simplifications aim to limit the scheduling to data dependentinstructions.The following steps describe how the simplified equivalent FSM is constructed.Figure 3.3 visualizes this process for the program shown in Figure 3.3a.• Step 1: Construct the control flow graph (CFG) of the program (Figure3.3a).• Step 2: Construct the data flow graph (DFG) for each of the basic blocks inthe CFG. Each node of a DFG shows an operation and edges are representa-tive of data dependencies among these operations (Figure 3.3b).• Step 3: Apply unconstrained list scheduling [21] separately on each of thedata flow graphs, with the simplifications described above (Figure 3.3c).• Step 4: Given that each of these data dependent operations may be per-formed by functional units that require appropriate control signals, each ofthese operations needs to be a separate state in the equivalent FSM. Replaceevery cycle of each scheduled DFG with a corresponding state in the equiv-alent FSM (Figure 3.3d).• Step 5: Finally, connect the states to construct the equivalent FSM. For thestates belonging to the same scheduled DFG (within a basic block), apply anedge directly between the states. To construct the transitions between statesin different DFGs, replace each control edge between two basic blocks inthe CFG with an equivalent edge between states in the FSM. The equivalentedge connects the last state of the predecessor basic block (i.e., cycle N of18the DFG for the predecessor basic block) with the first state in the successorbasic block (i.e., cycle 0 of the DFG for the successor basic block)(Figure3.3d).The equivalent FSM constructed by this approach is a naive representation ofthe FSM that is generated by HLS tools for a given program. For example, multipleoperations may be able to be performed on a single cycle, long latency instructionsmay result in multiple states, or there may be resources limitations in the numberof operations that can occur per cycle. However, the simplified FSM maintains theimpact of data dependence. We use our simple approach to perform analysis on theequivalent FSMs of SPEC2006 INT benchmarks[32]. Along with the MachsuiteHLS benchmarks, we chose the SPEC2006 INT benchmark suite to compare thebehaviour of the benchmarks that are and are not necessarily written for HLS.Figure 3.4a presents the fan-out degree of the equivalent FSM for both theMachsuite and SPEC CPU2006 INT benchmark suites. As can be seen, bothbenchmarks demonstrate very similar behaviour in the fan-out degree, with over85% of the states having a single next state. Based on our construction of the equiv-alent FSM, these single fan-out edges are caused by data dependencies. However,they are independent of input as the timing schedule is predetermined in advance.Although the simplifications in the equivalent FSM may affect the result ofthe fan-out degree experiment, the impact will mostly affect the number of stateswith fan-out degree equal to 1. The result of this experiment (Figure 3.4) shows avery large ratio between single and multi fan-out degrees. Hence, we believe thateven with the assumptions discussed above, the equivalent FSM provides a goodapproximation of the actual FSM to highlight the existence of a large fraction ofnodes with a single fan-out edge.3.2 Specialized FSM BlockIn this section, we first discuss the potential approaches to optimize the implemen-tation of HLS-generated FSMs. We then describe our proposed specialized config-urable architecture that takes advantage of the HLS-generated FSM characteristicsto implement the FSMs in a more area/delay efficient manner. We also provide19(a) C code (b) CFG(c) DFG scheduling (d) EquivalentFSMFigure 3.3: Equivalent FSM construction20(a) MachSuite(b) Spec CPU2006 INTFigure 3.4: Fan-out degree frequency in the equivalent FSMs of MachSuitebenchmarks and Spec CPU2006 INTdata that justifies the main design decisions that lead to the final architecture of thishybrid architecture.3.2.1 Design Space ExplorationThere are different approaches to exploit the properties of HLS-generated FSMs tooptimize the FSM implementation on FPGAs. These approaches along with theirbenefits and drawbacks are described below:21Specialized Hard BlocksOne potential solution is to introduce a specialized hard block to FPGA Architec-ture that is designed to only implement the HLS-generated FSMs. Such specializedhard block is extremely efficient compared to the FPGA LUT-based logic blocksdue to the reduced overhead that comes with the flexibility of LUT-based logicblocks. In this thesis, we propose and evaluate a novel mix-grained architecturethat makes use of such hard blocks to improve the efficiency of FSM implementa-tion on FPGAs.Soft-Logic and Block RAMsAnother potential approach is to use FPGA soft-logic to implement an adder unitthat interacts with the existing embedded memories on FPGAs to implement theFSM in a more efficient manner. In this approach, the synthesis tool is responsibleto transform the FSM description to a logic circuit that implements the FSM usingan adder, control logic, and memory.This approach does not require any modification to the existing FPGA archi-tecture. However, due to the overhead of the soft-logic implementation of the adderand control logic along with using programmable routing between the adder, con-trol logic, and memory unit, this solution is not as efficient as having specializedhard blocks. Evaluation of this approach is left to future work.Modified Block RAMsAnother solutions is to modify the existing block RAMs on FPGAs by adding ahard adder and control logic to these blocks such that they can support our proposedFSM implementation. This is a promising approach, however, the existing blockRAMs on FPGAs are typically synchronous memories. Therefore, digital designsthat are not latency tolerant are not able to be mapped to such block RAMs, sincea synchronous memory read adds one cycle delay to the state calculation process.Evaluation of this approach is left to future work.22Figure 3.5: High-level view of the specialized FSM architecture3.2.2 Mix-Grained ArchitectureOur proposed architecture consists of both fine-grained (soft) and coarse-grained(hard) logic that are connected via hard (or flexible FPGA) routing, which togetherform the mix-grained architecture. The high-level architecture is illustrated in Fig-ure 3.5. The coarse-grained part of this architecture implements the next statecalculation and consists of two main units, the accumulator unit and memory unit.The accumulator unit takes care of state calculation for the states that belong to thebranch-free paths, while the memory unit stores the next states for the remainingstates, along with some metadata that will be described later. As previously men-tioned, these remaining states tend to have a low fan-out degree, which makes themwell suited to be stored in memory, since fan-out degree directly corresponds to thenumber of entries per-state in memory. The fine-grained part of this architecturetakes the current state and input signals and tries to minimize the address space ofthe memory unit, and hence, the memory size. As mentioned in Section 3.1.3,the reduction of the state address space is possible since the number of reachablestates from a given state is often much less than the maximum possible number ofreachable states.The following sections describe the the coarse-grained and fine-grained partsof the proposed architecture in more detail.233.2.3 Input Sequence Encoder UnitThe input sequence encoder unit implements a configurable encoder using theFPGA soft logic. Figure 3.6 and Figure 3.7 explain the intuition behind hav-ing such unit. At each state of the state machine only a subset of input signalsimpact the state transition. This subset of inputs are called active inputs. FSMsextracted from our benchmarks sets have variable number of inputs ranging from3 to 56, however, the number of active inputs at each state is much less for thesebenchmarks (characteristics of FSMs are listed in detail in Chapter 4). As shownin Figure 3.6, the number of state machine active inputs per state varies from 0 to 5,however, the number of next reachable states from a given state (i.e number of fan-outs per node in the state transition graph) does not exceed 4. This means that thechoice of next state, which corresponds to the memory address, can be representedby only 2 bits instead of 56. Therefore, we use a simple encoder that maps thepossible large input sequence for the state machine to a smaller sequence of lengthlog2 (maximum number of reachable states per state). This significantly reducesthe size of the memory unit that is used for next state calculation as it enables us toavoid storing don’t care data for unreachable states. The input sequence encoderunit can be easily implemented on a LUT-based cluster as part of the conventionalFPGA architecture.3.2.4 Coarse-Grained FabricThe coarse-grained fabric corresponds to the “Next State Generation” block in Fig-ure 3.5. By analyzing the edge distribution of the state transition graphs amongour benchmark suite (discussed more in Chapter 4), we observed abundance ofbranch-free paths, states with only one next state where the transition betweenstates is input-independent. In Section 3.4 we describe a simple encoding that en-ables using a single accumulator in order to calculate the next state value for suchstates.Figure 3.8 presents a detailed breakdown of the next state generation blockshown in Figure 3.5. The following subsections describe each of the componentsin more detail.There are timing requirements for the FSM block that require delay of certain24Figure 3.6: Number of active inputs per state calculated as average over 46FSMs extracted from 21 benchmarks generated by HLS. Details of thebenchmarks are described in Chapter 4.Figure 3.7: Edge distribution: number of transitions per state calculated asan average over 46 FSMs extracted from 21 benchmarks generated byHLS. Details of the benchmarks are described in Chapter 4.data. For example, metadata read from an entry in memory corresponds to the nextstate and in case of the branch-free paths, metadata is used for the entire path as25such we require a mechanism to save the metadata. That is why we use the registersto delay the metadata by one cycle such that they apply to the next state and in thecase of the branch-free path to the l next following state where l is the length of thepath. ”Path Final State Register”, ”Branch Address Register”, and ”State ControlRegister” are the registers that we have used for this purpose which are explainedin detail below.Figure 3.8: Next state generation blockAccumulator UnitThis unit is responsible for calculating the next state values for the trailing input-independent states on a branch-free path. After applying the proposed state en-coding, the next state value for states that belong to a given branch-free path ina state transition graph can be calculated using a simple accumulator along withsome extra information, which are described below:• Adder: The adder is the main part of the accumulator unit. It takes in the cur-rent state and increments the value to calculate the next state in the branch-free path. It has two inputs: a single 1-bit value set to one, and the currentstate value coming from the output of the state register.26• Control logic: While an adder is enough to calculate the next state value forthe branch-free states, it is not sufficient to determine when we have reachedthe end of the branch-free path. Additionally, once we have reached the endof the branch-free path, we need to read the next state value from memory.However, the address to this state is not just the current state encoding, sincethe memory address space is separate from the state encoding for branch-freepaths. Therefore, we use two metadata registers to store this information foreach path.– The Path Final State Register is set to the state value of the last stateon the path. This is used to mark the ending state of the path.– The Branch Address Register is set to the address we should readfrom the memory once we have reached the end of the branch-freepath.– The comparator is used to compare the values of the path final stateregister with the output of the accumulator, and then generates the con-trol signal which decides if the next state value should come from theaccumulator or the Branch Address Register.Memory UnitThe memory unit is responsible for storing the next state value for the sates thatdo not belong to a branch-free path along with some metadata to assist with thetransition from the accumulator unit to memory unit. To avoid adding an extracycle delay to the next state calculation, an asynchronous memory block must beused for this unit. Figure 3.9 shows the content of a row in memory. It consist offour main fields: (1) Next State Value (2), Path Final State, (3) Branch Address,and (4) State Control bit. The first and fourth fields always have a valid value,however the second and third fields will only be valid in the case where next statebelong to a branch-free path. In this case, the contents of these two fields will beregistered into the registers described in the accumulator unit, as described above.The last field, state control bit, is used to determine if the source of the next statevalue should be the accumulator unit or memory unit. This field will be registeredinto the control unit register that will be described below.27The depth of the memory is dependent on the non-branch-free states and edgedistribution and the width is based on the next state plus metadata. In Chapter 5we describe how we size the memory unit in detail.Figure 3.9: Memory contentControl UnitThe control unit is responsible for selecting the source of the next state value be-tween the accumulator unit and memory unit using a multiplexer which is con-trolled by the “State Control Register”. The State Control Register can be set intwo different ways: (1) The State Control Field of the memory unit for the givenstate Figure 3.9, or (2) the result of the comparator in the accumulator unit whichmarks the termination of the branch-free path, Figure 3.8). At any given time,either the memory unit or the accumulator unit is active and responsible for calcu-lating the next state value. The active unit is responsible for selecting whether thesame unit is active on the next cycle or the other unit is active on the next cycle.This is implemented as a feedback loop from the State Control Register to the se-lect signal of the multiplexer feeding the State Control Register. This continues toselect the same unit until that unit signals a transition by toggling zero to one orvice versa.State DecoderWe provide an optional binary to one-hot decoder at the output of this block toenable more efficient binary to one-hot conversion if required by the rest of thecircuit.283.3 Fracturable FSM Hard BlocksThe size of finite state machines can vary significantly among different applica-tions. As mentioned in Chapter 2, any hard block on FPGAs will be wasted ifnot fully utilized by the applications, leading to fragmentation in the hard blocks.Therefore, to be able to efficiently accommodate state machines with various num-ber of states, we propose fracturable FSM hard blocks. The main idea behind hav-ing fracturable FSM blocks is to tailor the block size such that it accommodate thestate machines with an average size while supporting combination of two blockssuch that they can accommodate large FSMs that do not fit into just one block.Tomap a large state machine to multiple smaller combined blocks, the state machineneeds to be partitioned to multiple sub state machines, and the architecture shouldenable fast transition between these blocks. In this work we only look at parti-tioning state machines to two sub-machine which enables us to accommodate allthe FSM benchmarks that we use, however, our approach can be easily applied tomore partitions. We first describe our partitioning method and then propose minormodifications to the FSM hard block to support FSM block combination.3.3.1 FSM PartitioningGiven an input graph G = (V,E), the objective of bi-partitioning problem is to par-tition the vertex set V into two disjoint subsets with the main goal of minimizingthe number of edges between two subsets. The FSM partitioning (decomposi-tion) problem is a famous problem [21] with plenty of possible solutions whichare mainly proposed to target complex state machines. For the purpose of parti-tioning HLS-generated FSMs, which are less complex in terms of the number oftransition between different states, we chose a classic algorithm known as Fiduccia-Matheyses partitioning algorithm [9] and, with an example in Chapter 5, show thatit works very well for the state machines that are generated by HLS tools. Fiduccia-Matheyses partitioning algorithm is an iterative mincut heuristic algorithm with alinear computation time with respect to the size of the input graph.293.3.2 Fracturable FSM Block ArchitectureWhen splitting an FSM over multiple fracturable blocks, every state transitionacross two different blocks requires control signals that enable switching betweenthese blocks. For example, if the current state X is mapped to the fracturable blockA and the next state Y is mapped to the fracturable block B, then when the transi-tion occurs, the state register of block A must enter the idle state (described later),and the state register of block B must be updated to Y. To enable this switchingbetween the blocks, state X must carry the metadata that controls this transition. Apotential candidate to store this metadata is the memory unit. If the states that markthe transition across fracturable blocks are stored in memory, an extra field on eachmemory row can be used to store the required metadata for transitioning across theblocks. In this example, state X must be stored in memory. For this work, we onlyallow splitting the FSM over two fracturable blocks, thus a single bit in memoryis sufficient to indicate whether the next state should be calculated in this block orthe other fracturable block that implements the same FSMBy transitioning to another fracturable block, we enter a new state which caneither be mapped to the memory unit or accumulator unit. As described in Sec-tion 3.2.4, if this state is mapped to the accumulator unit the control registers,specifically, Path Final State Register, Branch Address Register, and State controlRegister must be updated as well.This requires extra multiplexer logic that allowssetting the value of these registers from multiple sources which are the memoryunit on the same block as well as the memory unit on the other fracturable block.To simplify this scenario and reduce the overhead logic, we decide to only mapthis state to the memory unit to avoid the need for updating the control registersdescribed above.To summarize, for any transition across two fracturable blocks, both currentstate and next state must be stored in memory. Although this increases the requiredmemory size to accommodate the FSM, a proper partitioning algorithm that aimsto reduce the number of transitions between blocks can limit this memory overheadto only a few extra entries.Additionally, we must add a multiplexer before the state register in each frac-turable block to allow updating the state value using the data stored in the other30fracturable block, for the scenario where there is a transition between two blocks.We dedicate the state value zero as the idle state. Once the state calculation tasksgets transferred over to the other fracturable block, the inactive block will enter theidle state by updating its state register to zero.The overheads and issues of splitting the FSM over more than two fracturableblocks are discussed in Chapter 5.3.4 State Assignment AlgorithmThe state assignment (encoding) problem is defined as determining the binary rep-resentation of the states in a finite-state machine such that each state has a uniquevalue to separate it from the other states [21]. The state encoding directly af-fects the circuit area and performance as different encoding results in differentcircuit complexity. The choice of circuit implementation, such as two-level logic,multiple-level logic, or in our case a mix-grained architecture that contains a spe-cialized ASIC-like FSM block, also plays an important role in finding the stateencoding that optimizes the circuit area and/or performance. For the purpose ofmapping to our specialized FSM architecture, the circuit area is measured by thenumber of bits required for state encoding, number of states that have to be mappedto the memory unit, and the logic complexity of the input sequence encoder. Wepropose a novel state assignment technique for the FSM targeting our FSM block.This technique aims to minimizes the FSM area by mapping as many states to theaccumulator logic as possible and minimizing the number of states that reside inmemory, hence reducing the input encoder logic complexity.Our state assignment algorithm consists of two main parts: (1) identifying thestate categories and (2) performing state encoding separately on each state cate-gory. Before describing how we categorize the states, we first explain why weneed different categories. Our proposed FSM block contains two main parts thatperform the next state calculation: the memory unit and accumulator unit. Eachunits is responsible for determining the next state value for a subset of states. Aproper state encoding for each subset must consider the limitations and require-ments of the unit that is in charge of the next state calculation for this subset. Thislead us to group the states into two categories based on whether their next state31is calculated by the memory unit or accumulator unit. Below, we discuss the re-quirements of each unit in detail and explain the actions required to meet theserequirements. Then we explain how to categorizes the states.Memory Unit RequirementsThe main requirements are the memory size and address signal generation. Theread address signal of the memory unit is formed by concatenating the value ofthe current state and encoded input signals that comes from the input sequenceencoder. However, only a subset of states of the FSM reside on the memory, hencenot all the bits of the current state signal are necessary for addressing the memory.For example, if the number of the states that are stored on memory is equal ton, then only log2 n bits of the current state signal are required for addressing thememory. Therefore, the state encoding for these states must be between zero and nto minimize the size of memory.Accumulator Unit RequirementsAs described in the Section 3.2.4, the accumulator unit performs the state calcu-lation for the states that belong to branch-free paths, hence it is necessary for theconsecutive states of each path to have consecutive state values. However, theremust be one and only one state encoding for each individual state, therefore in ascenario where two branch free paths overlap, such as path A and path C shown inFigure 3.10b and Figure 3.10c, we must first refine the paths such that they do notoverlap to avoid two encoding values for the same state.Path RefinementAs we discussed in the previous section, on any transition from a memory state toa counter state, there is metadata for the corresponding branch-free path that mustbe provided to the accumulator unit. To store this metadata, we use the memorylocation for a state that transitions to a branch-free state. For any given path, theprevious state that branches to this path must reside in memory so it can store thismetadata. As such, there must be a gap of at least one memory state between the32vertices of any two branch-free paths. Note that due to the definition of a branch-free path, any two non-overlapping branch-free paths satisfy this requirement, sincea branch-free path begins right after, and terminates where, there is a state with afan-out degree greater than 1 (divergent vertex), which corresponds to a state storedin memory. Thus, any two non-overlapping paths will be at least one memory stateaway from each other.Two branch-free paths can never overlap at the starting vertex since they willbe equivalent. However, they can overlap on any other vertex, in which case theremaining vertices will also overlap. Therefore, if two branch-free paths overlapon any of their vertices, they will definitely overlap on the ending vertex as well.The ending vertex is always a predecessor to a divergent vertex. This means thatbranch-free paths that have different starting vertices but share a common termina-tion divergent vertex, might partially overlap with each other. We use this conditionto find the potentially overlapping paths by grouping the paths that share a commontermination divergent vertex. In a scenario where the branch-free paths overlap, wemust refine the paths such that the refined paths are at least one memory state awayfrom each other as described above.The pseudo code of our proposed path refinement algorithm is shown in Al-gorithm 1. The input to the algorithm is a set of branch-free paths which sharea common termination vertex. This means that the ending node of all paths inthis set is a predecessor to a common divergent vertex (a vertex with fan-out de-gree greater than one). Note that due to the definition of a branch-free path, pathsthat do not share a common termination node will never overlap, hence this is arequired condition that indicates the potential of overlapping. After applying thepath refinement algorithm, the output is (1) a set of refined branch-free paths and(2) a set of independent vertices which, contains the states that initially belong tooverlapping branch-free paths, but are no longer part of the refined paths after ap-plying refinement. The path refinement algorithm will be used as an intermediatestep by our state assignment algorithm, Algorithm 2, which will be described af-ter the path refinement algorithm. We start by describing the details of the pathrefinement algorithm and then use an example to better illustrate these steps.• Step 1: At the initial step, set S, which will eventually contain all vertices33that belong to the refined branch-free paths is empty. This set is used tokeep track of the vertices that belong to the refined branch-free paths overdifferent iterations of this algorithm to help detect the overlaps. Second, wesort the paths that belong to the input set G based on their path length andthen add them to SPL, a sorted list of all paths from G (lines 3-4).• Step 2: At this step, as long as SPL is not empty, we select the longest pathLP from SPL to apply the path refinement process on it (lines 5-6).• Step 3: Next we traverse LP and compare each of its vertices with everyvertex of S, until we find a common vertex between LP and S or we reachthe end of path LP (lines 7-13). Note that when we first start the algorithm,set S is empty, thus none of the vertices of LP will overlap with the verticesof set S for the first path.• Step 4: After we detect two overlapping vertices at vi, we must terminate LPat vi−1. This requires cutting LP such that the predecessor of vi−1, vi−2, is theending vertex of the refined path. By doing so, LPre f ined no longer overlapswith any of the paths that have already been refined. Vertex vi−1 will nowbecome an independent state and be added to the set of independent statesIG which will be stored in memory. This independent state, vi−1, separatesLPre f ined from all others refined paths (lines 9-10).• Step 5: Next, we add the refined path, LPre f ined , to the set of refined pathsGre f ined and add all of its vertices to the set of refined path vertices S (lines15-16) to be used for the next iterations of the while loop (line 5).• Step 6: Once the while loop is completed, Gre f ined will contain the set ofrefined branch-free paths and IG will include the independent states.An example of a scenario when two branch-free paths of a state transition graphoverlap is illustrated in Figure 3.10. Figure 3.10a shows part of the state transitiongraph of an FSM (labels are omitted for simplicity), which contains 3 branch-free paths. Figure 3.10b, Figure 3.10c, and Figure 3.10d highlight these pathsindividually. These three paths all share a common termination node (show in redin Figure 3.10a), thus they might overlap. In this case, the last two states of path34(a) state diagram (b) path A (c) path B(d) path Cindep.(e) refined pathsFigure 3.10: Path refinement35Algorithm 1 Path RefinementInput: G→ Set of branch-free paths grouped by common termination nodeOutput: Gre f ined → Set of refined branch-free paths from GOutput: IG → Set of independent vertices from G1: S→ Set of refined path vertices2: SPL→ sorted list of branch-free paths3: S = ∅4: SPL = sort(G);5: while SPL! =∅ do6: LP = select the path with the longest length from SPL;7: for all vi ∈ LP do8: if vi ∈ S then9: LPre f ined = terminate(LP, i−1);10: add vi−1 to IG;11: break;12: end if13: end for14: if LPre f ined! =∅ then15: add LPre f ined to Gre f ined16: add all the vertices of LPre f ined to S17: end if18: end whileA and C overlap, therefore the path refinement algorithm must be applied on thesepaths. An example used to illustrate this algorithm is described below:In this example, the input to the algorithm is a set of branch-free paths thatcontains path A, path B, and path C which all share a common termination node(show in red in Figure 3.10a). At step 1, these paths are sorted based on theirpath length in the following order, path A (5 vertices), path C (4 vertices), andpath B (3 vertices). At the first iteration of the algorithm, path A will be selected.However, as the set of refined path vertices S is empty, path A do not requirerefinement and will be added to the set of refined paths Gre f ined as is. All of itsvertices will then be added to the set of refined path vertices S (step 2 through 5).At the second iteration, path C will be selected. By comparing each of its vertices36with the vertices of set S, which now contains all the vertices of path A, we findthat the third vertex of path C already exist in set S. Therefore, we terminatepath C at its second vertex by cutting it after its first vertex. This means thatone independent memory state, the second vertex of path C, will be used beforeoverlapping with path A to store the necessary metadata to join path A. Figure3.10e illustrates the effect of the terminate subroutine applied to path C from Figure3.10d. After applying terminate to the middle path, the refined path C now onlyhas one state. The one state gap that separates path A from the refined path C islabelled “indep.” (independent) state. At the third iteration of the algorithm pathB, the only remaining path, will be selected. Since non of the vertices of path Boverlap with any of the refined paths, it will be added to the set of reined pathsGre f ined as is. At this point the algorithm is completed and the output is (1) the setof refined paths A, B, and C (shown in Figure 3.10e), and (2) a set of independentstates which contains the vertex labelled “indep.”.State AssignmentNext, we will describe the full state assignment algorithm, which is presented inAlgorithm 2. As mentioned in the beginning of this section, the state assignmentalgorithm consists of two main parts: (1) identifying the state categories and (2)performing state encoding separately on each state category. These state categoriesare described below:• branch-free states: States that belong to non-overlapping branch-free paths.• independent states: All remaining states that either have a fan-out degreegreater than one (divergent states), or states that are initially part of the over-lapping branch-free paths but do not qualify to remain part of the path afterapplying path refinement.Below we describe the details of the state assignment algorithm (Algorithm 2):• Step 1 (Identify divergent vertices): Identify and add vertices with a fan-out degree greater than one (two or more successors) to the set of divergentvertices D (lines 9-10).37Algorithm 2 State AssignmentInput: G f sm = (V,E)→ FSM state transition graphOutput: Gencoded− f sm = (V,E)→ Encoded FSM state transition graph where eachvertex is labelled by its state encoding value1: Pnon−re f ined → Set of non-refined branch-free paths2: Pre f ined → Set of refined branch-free paths3: Pk−non−re f ined → Set of non-refined branch-free paths that share common ter-minating divergent vertex dk (Pk−non−re f ined ⊂ Pnon−re f ined)4: Pk−re f ined → Set of refined branch-free paths after applying refinement algo-rithm on Pk−non−re f ined (Pk−re f ined ⊂ Pre f ined)5: I→ Set of independent vertices6: Ik → Set of independent vertices found after applying path refinement algo-rithm on Pk−non−re f ined7: D→ Set of divergent vertices8: Si→ Set of successors of divergent vertex di9: /*find all divergent vertices in the state transition graph*/10: traverse G f sm and populate D with the vertices that have fan-out greater than 111: /*find all branch free paths in the state transition graph*/12: for all di ∈ D do13: for all s j ∈ Si do14: add the branch-free path p j that starts from s j to Pnon−re f ined15: end for16: end for17: /*group together the branch-free paths that share common terminating diver-gent vertex*/18: for all dk ∈ D do19: add every branch-free path from Pnon−re f ined that share common terminatingvertex dk to Pk−non−re f ined20: end for21: /*Apply the path refinement algorithm (Algorithm 1)*/22: for all dk ∈ D do23: (Pk−re f ined , Ik) = path refinement (Pk−non−re f ined) /* Algorithm 1*/24: Pre f ined = Pre f ined ∪Pk−re f ined25: I = I∪ Ik26: end for27: I = I∪D3828: /*state assignment*/29: for all vi ∈ I do30: assign a state encoding in an incrementing manner starting from zero31: end for32: for all Pi ∈ Pre f ined do33: for all v j ∈ Pi do34: assign a state encoding in an incrementing manner starting from the lastvalue that was used for the previous path +135: end for36: end for• Step 2 (Identify branch-free paths): Find all of the branch-free paths be-tween every two divergent vertices that have been marked in the first step andadd them to the set of non-refined branch-free paths Pnon−re f ined (lines 11-16). To identify a branch-free path, we start from a successor of a divergentvertex and add its consecutive vertices to the path by traversing the graphbefore arriving at another divergent vertex. By doing so, all the vertices onthis path will only have a fan-out degree of one, hence the correspondingpath meets the requirements of a branch-free path.• Step 3 (Group the paths based on their termination vertex): At this step,the branch-free paths that share a common termination divergent vertex dkwill be grouped together and added to Pk−non−re f ined , since this is a precon-dition for potential overlapping paths (lines 17-20).• Step 4 (Apply path refinement): Apply the path refinement algorithm (Al-gorithm 1) on each group of branch-free paths with a common terminationvertex that were obtained in step 3, Pk−non−re f ined (line 23). The output ofthis step is the subset of refined branch-free paths, Pk−re f ined , and the subsetof independent states, Ik, that are no longer part of the refined paths (de-scribed in detail in Algorithm 1).• Step 5 (Update state categories-1): Add the subset of paths that were re-fined in step 4, Pk−re f ined , to the final set of refined branch-free paths, Pre f ined(line 24). Update the set of independent vertices I by adding the vertices thatwere obtained in step 4, Ik, to this set (line 25).39• Step 6 (Update state categories-2): Add the divergent vertices, D, to thelist of independent vertices I. Set I indicates all of the vertices(states) thatwill be mapped to the memory unit (line 27).• Step 7 (State assignment-1): Finally, for the independent vertices, I, thatwere identified in step 1 through step 6, assign incremental values to thevertices(states) starting from zero (lines 29-31).• Step 8 (State assignment-2): For each branch-free path in the refined pathset Pre f ined , assign incremental values to the consecutive vertices (states).For the first path, the starting state value will be the value assigned to thelast independent state (step 7) plus one. For all remaining paths, the startingstate value is one greater than the last state value of the previous path (lines32-36).3.5 Mapping to the Specialized FSM ArchitectureIn this section we describe the steps required to map a given state machine to ourspecialized FSM architecture. The mapping process consists of 3 main steps whichare listed below:• Applying the size checking pass• Fine-grained mapping• Coarse-grained mapping3.5.1 Applying the Size Checking PassAt this step we check two required conditions to verify whether the input FSM,described by its state transition table, is suitable to be mapped to the next stategeneration block: (1) Whether the number of bits required for the encoding ofthe state machine is smaller than the maximum bit-width of the adder unit, (2) ifthe total number of states that reside in memory are smaller than the size of thememory unit. This step is performed after applying the state assignment algorithm40described in Algorithm 2. A more detailed description of this step is described inthe Chapter 5.3.5.2 Fine-Grained MappingThis part corresponds to mapping the corresponding part of the FSM to the inputsequence encoder. To do so, we must form the logic function that implements theinput sequence encoder.This is achieved by performing a transformation on the state transition table.The goal of this transformation is to reduce the number of inputs to what we callthe encoded input sequence. This is shown with an example in Table 3.1. Table3.1a shows the original state transition table of an example FSM. This table showsthe choice of next state value for a given state based on the current state and inputvalue. This FSM has 10 inputs, however, each state has no more than 3 next states.Therefor, an encoded input sequence with only 2 bits is sufficient to distinguishamong the next states of any given current state. To obtain such encoded inputsequence, we first transform the original state transitions table to a reduced tablewhich only contains the states that have more than one next state. The reduce tableis shown in Table 3.1b. We then use the reduced table as a truth table to imple-ment a logic function that takes the state machine inputs as input and generates anencoded input sequence as output.3.5.3 Coarse-Grained MappingThe next step, coarse-grained mapping, generates the memory contents for theFSM. At this point, the state assignment algorithm (Algorithm 2) has been appliedto the state machine. Hence the states that reside in memory, and their correspond-ing metadata have been determined. Using this information, the memory contentsare generated in the format shown in Figure 3.9.3.6 Putting it All TogetherIn this section, we present a complete example from C code to implementationand operation of the corresponding FSM on our proposed specialized FSM block.41Input Current state (cs) Next state (ns)10’bx s0 s110’bx s1 s210’bx s2 s310’bxxx11xxx0x s1710’bxxx11xxx1xs3s410’bx s4 s510’bx s5 s610’bx s6 s710’bx s7 s810’bx s8 s910’bx s9 s1010’bx11xxx1xxx s1110’b101xxx11xx s310’bx11xxx1xx0s10s1710’bx s11 s1210’bx s12 s1310’bx s13 s1410’bx s14 s1510’bx s15 s1610’bx s16 s1710’bx s17 s0(a) Original state transition tableOriginal input Current state (cs) Encoded input10’bxxx11xxx0xs32b0010’bxxx11xxx1x 2b0110’bx11xxx1xxxs102b0010’b101xxx11xx 2b0110’bx11xxx1xx0 2b10(b) Truth table for the Input Sequence EncoderTable 3.1: Input sequence encoder generation: the original transition table isgiven in (a), the reduced table is given in (b). {cs, active input(2bits)}will be used to address the memory instead of {cs, inputs(10 bits)}.Additionally, this example highlights different sources of branch-free paths in HLSgenerated benchmarks.421. int foo(int A, int B, int C, int N)2. {3. int x, y, z, result;4.5. x = y = z = result = 0;6. if (B > A){7. int tmp = B;8. B = A;9. A = tmp;10. }11. while( (N-- > 0) && (0 < A) ) {12. x = A / B;13. A = x;14. y = x << C;15. z = y + x;16. result += z;17. }18.19. return result;20. }Figure 3.11: An example C code: the example shows a while loop with fourconsecutive instructions each with a data dependency on the previousinstruction. Additionally each instruction performs an operation thathas a different latency in hardware such as division, multiply, shift,and add.3.6.1 Generating the FSMFigure 3.11 presents a simple microbenchmark with a function, foo, that containsconditional code (line 6), a loop (line 11), data-dependent instructions (lines 7-9and 12-16), and instructions with different latencies (e.g., divide on line 12 andshift on line 14).Figure 3.12a presents the FSM generated by Vivado HLS as part of the gener-ated RTL design for the C function shown in Figure 3.11.In this example, state S0 is the initial state, which waits for a start signal beforetransitioning to S1. State S0 and S1 correspond to the initialization code (line 5)and swap code (lines 6-10). While there is a conditional branch at line 6, the codeis easily mapped to a multiplexer in hardware, so there is no corresponding branch43s0s1s2s3s4s39s40(a) Original state diagrams0s1s2s3s4s39s40(b) Branch-free path 1s0s1s2s3s4s39s40(c) Branch-free path 2S0: 6’d0s1: 6’d2s2: 6’d40S3: 6’d1s4: 6’d3s39: 6’d38s40: 6’d39(d) New EncodingFigure 3.12: State assignment example44node in the state graph. States S2 - S40 correspond to the while loop (lines 11-17) 1.State S3 evaluates the loop condition and returns to S0 if the loop is complete,or transitions to S4 if not. The long branch-free path from S4 - S2 (37 states)corresponds to the loop body (lines 12-16) and is a result of the data dependencebetween instructions (e.g., lines 14 and 15) and the latency of the instructions. Forexample, if the divide operation at line 12 is replaced with a multiply operation,the length of the path changes relative to the difference in latency between the twooperations.3.6.2 State AssignmentThe state transition graph of the example is shown in Figure 3.12a. We first cate-gorize the states, then perform the state encoding on each category separately.Categorizing states: In step 1 of Algorithm 2, we add S0 and S3, the states withmore than one next state, to the set of divergent states. These states are shown inred in Figure 3.12a. In step 2 of Algorithm 2, we find all the branch-free paths thatstart from successors of S0 and S3. This step results in finding path1 =< S1,S2 >(Figure 3.12b) and path2 =< S4,S5,S6, ...,S39,S40,S2 > (Figure 3.12c).In steps 3 and 4, overlapping paths are identified and path refinement (Algo-rithm 1) is applied. In this example, the two branch-free paths overlap at S2, whichrequires path refinement. After applying the path refinement algorithm, the longerpath, path1, remains the same while path2 will no longer exist since S1 becomesthe independent state that separate these two paths. S1 stores metadata to supportthe case that the FSM transitions to path2 via S1 to S2.After the above steps, the only branch-free states are the states of path2. Theremaining states, along with the divergent states, are marked as independent states(blue states in Figure 3.12d). This corresponds to steps 5 and 6 of Algorithm 2.Now that we have categorized the states, we can perform state assignment oneach category according to steps 7 and 8 of Algorithm 2. The result of state assign-1The loop condition, (0 < A), was added so the loop would not be simplified by Vivado. Oth-erwise, we found Vivado would replace the while loop with the expression “N*result” making theexample less interesting for demonstrating our state assignment algorithm.45ment is shown in Figure 3.12d.Memory Unit ContentTo simplify this example, we assume that the memory unit is sized exactly to fitthis FSM, which has a depth of 6 entries and a width of 15-bits. Note that NextState and Path Final State fields (Figure 3.9) are 6 bits since we require 6 bits toencode all of the states ( f loor(log2(40states)) = 6), however, the Branch Addressfield is only 2 bits since there are only 3 states that reside in memory, hence weonly require 2 bits to distinguish among these 3 states.The memory contents for this example is shown in Table 3.2. The first twocolumns, state label and address, are not part of the memory but have been addedto the table to help with understanding which entry corresponds to which addressand state. Note that the state encoding shown corresponds to that generated byAlgorithm 2 and corresponds to the state values shown in Figure 3.12d. As such,the state encodings may be different from the state label (e.g., S0 = 6’d0, S1 = 6’d2,and S3 = 6’d1). Since the fan-out degree of the states in this example is at most 2,each state will have two corresponding rows in memory, which only requires 1 bitfor the Encoded input to select the next state. Each memory location containingan x indicates that the content of the corresponding field is unused for this state(memory row). This occurs for the state transitions where both current state andnext state reside in memory, since the Path Final State and Branch Address entriesare only used for the states that belong to branch-free paths. As mentioned earlier,independent states (including divergent states) reside in memory, so any transitionbetween two independent states contains unused memory fields.Specialized FSM Block OperationNext, we describe the operation of our proposed Specialized FSM Block usingthe example FSM in Figure 3.11. There are four possible operating conditions:transitioning from a memory state to a memory state, from a memory state to abranch-free state, from a branch-free state to a branch-free state on the same path,and from a branch-free state to a memory state. Each of these cases is describedbelow and Figure 3.13 to Figure 3.16 highlight the relevant active portions of the46specialized FSM block.• Memory state to memory state transition (e.g., S0 to S1 in Figure 3.12d):In this case, illustrated in Figure 3.13 , the FSM block behaves simply likea memory-only FSM implementation. In Figure 3.13, the current state (S0)and encoded input are used to address the memory unit, and the next state(S1) is read from memory. Using the State Control bit from memory, thecontrol unit selects the next state output from the memory to write to thestate register. Aside from the State Control bit, the corresponding metadatain the memory is unused.• Memory state to branch-free path state (e.g. S3 to S4 in Figure 3.12d):In this case, illustrated in Figure 3.14, the control registers, specifically,the Path Final State Register and Branch Address Register, must be updatedto control the branch-free path state generation for subsequent cycles. Thenext-state (i.e., the start of the branch-free path, S4) is loaded into the stateregisters and the metadata, as described in Section 3.2.4, is loaded into thePath Final State (S2) and Branch Address (S3) registers.• Branch-free path state to Branch-free path state on the same path (e.g.,S39 to S40 in Figure 3.12d): In this case, illustrated in Figure 3.15, theadder in the accumulator unit is used to increment the current state (S39 withencoding 6’d38) to the next state (S40 with encoding 6’d39). The comparatorcompares the next state with the final state of the path in the Path FinalRegister (S2 with encoding 6’d40). Since the value of the adder 6’d39 (S40)is not greater than 6’d40 (S2), the accumulator unit and control unit pass thenext state (S40) to the state registers.• Branch-free path state to Memory state (e.g., S2 to S3 in Figure 3.12d):Finally in this case, illustrated in Figure 3.16, the adder unit incrementsthe current state (S2 with encoding 6’d40) and the comparator compares thevalue of the next state from the adder (6’d41) with the value in the PathFinal State Register (S2 with encoding 6’d40). Since the value of the adderis greater than the Path Final State Register, the comparator sets the controlsignal to select the value in the Branch Address Register (S3 with encoding476’d1) to send to the state registers. This transitions out of the branch-freepath as the next state is used to address the memory unit.While not a separate case, the transition from S1 to S2 transitions from amemory state to branch-free path state that is not the initial state on the path.This behaves identically to the memory state to branch-free path state tran-sition described above, with the only difference being the initial state that isloaded into the state registers.Figure 3.13: Next state generation block - transition from a memory state toa memory stateState Address [3bits] Memory Content{CS, Encoded input} Next State Final state Target Mem/addS0{00,0} 6’d0 x x 1{00,1} 6’d2 x x 1S1{10,0} 6’d40 6’d40 2’d1 0{10,1} x x x xS3{01,0} 6’d3 6’d40 2’d1 0{01,1} 6’d0 x x 1Table 3.2: Memory content48Figure 3.14: Next state generation block - transition from a memory state toa branch-free path stateFigure 3.15: Next state generation block - transition from a state of a branch-free path to another state on the same path49Figure 3.16: Next state generation block - transition from the last state of abranch-free path to a memory state50Chapter 4Experimental Methodology4.1 BenchmarksIn this work we use two sets of C/C++ benchmarks to assist with the design andevaluation of our proposed architecture. Both benchmark sets have been developedto be used by HLS tools. The first benchmark set, MachSuite [25], is a collectionof benchmarks for evaluating accelerator design and customized architectures. Thesecond benchmark sets, HLS datacenter benchmark, is developed in our computerarchitecture group as a joint effort among a group of three students [24]. TheHLS datacenter benchmark set consists of high impact functions, in terms of runtime, extracted from Lucy [4], SQLite [12], and BZIP [32] benchmarks, and aimsto represent benchmarks that may be commonly run in a datacenter. Some partsof these benchmarks have been re-written to replace C/C++ features that are notsupported by Vivado HLS.Table 4.1 shows the number of lines of C/C++ code for the benchmarks in eachbenchmark set. This is used to highlight the size and complexity of the benchmarksto better understand the resulting FSMs from HLS.We convert the benchmarks from C/C++ to Verilog HDL using Vivado HLS.For the main part of our evaluation we use the default HLS directives provided inMachsuite to improve the quality of the generated Verilog code, however, the de-fault HLS directives might not necessarily lead to the most optimized design. Inorder to directly evaluate the impact of HLS optimization for certain optimization51Benchmark LOCbackprop 159aes 167viterbi 44spmv crs 34spmv ellpack 34nw 98bfs bulk 64bfs queue 71fft transpose 363fftstrided 43sort merge 57sort radix 116kmp 52stencil3d 46stencil2d 39md knn 71md grid 78gemm ncubed 41gemm blocked 43(a) Machsuite Bench-marksBenchmark LOCbzip 747lucy sn 78lucy sv 82lucy sa 66sqlite ln 561sqlite gt 410(b) HLS DatacenterBenchmarksTable 4.1: Number of lines of actual C code, excluding comments, for theevaluated benchmarksgoals (such as area, delay, and area-delay product) on the generated FSMs, we usethe model described in Lo et al. [18]. This work uses sequential model-based op-timization methods to automatically select the set of HLS directives that optimizethe design for different optimization goals. We use the data provided by Lo et al.to obtain the HLS directive settings that minimize the area-delay product of thegenerated RTL design for the aes, backprop and sort radix benchmarks. The resultof this analysis is provided in Section 5.652Benchmark States Inputs Max fanoutaes fsm1 47 6 2aes fsm2 76 14 2bckp fsm1 11 11 2bckp fsm2 158 10 2bckp fsm3 69 6 2bfs b fsm 8 7 3bfs q fsm 8 6 2fft st fsm 24 5 2fft tr fsm1 17 8 2fft tr fsm2 24 6 2fft tr fsm3 219 14 2fft tr fsm4 10 6 2fft tr fsm5 66 5 2gemm fsm1 10 8 2kmp fsm1 7 4 2kmp fsm2 10 6 2md gr fsm 15 10 2md knn fsm 98 5 2sort m fsm1 4 5 2sort m fsm2 7 5 2sort r fsm1 15 11 2sort r fsm2 6 4 2sort r fsm3 6 4 2spmv crs fsm 10 6 2smpv elpk fsm 9 6 2stencil fsm 4 4 2viterbi fsm 8 6 2Table 4.2: Characteristics of the FSMs extracted from MachSuite4.2 FSM ExtractionTo evaluate our proposed mix-grained architecture, we must extract the finite statemachines from each benchmark. This is achieved as follows: We use Yosys syn-thesis tool [37] to synthesize each benchmark to an RTL netlist. We then use theFSM detection and FSM extraction passes provided in Yosys to detect and extract53Benchmark States Inputs Max fanoutlucy sh fsm 71 3 2sql ln fsm1 508 56 4sql ln fsm2 7 6 3sql ln fsm3 5 6 3sql ln fsm4 10 10 3sql ln fsm5 4 4 2sql ln fsm6 4 4 2lucy sn fsm 25 5 2lucy sv fsm 12 10 4bzip fsm1 72 19 3bzip fsm2 41 11 2bzip fsm3 67 28 4bzip fsm4 17 9 3bzip fsm5 43 4 2bzip fsm6 61 19 3bzip fsm7 36 13 2bzip fsm8 117 34 3sql gt fsm1 61 48 4sql gt fsm2 12 9 2Table 4.3: Characteristics of the FSMs extracted from HLS datacenter bench-marksthe state machines from the rest of the design. These passes implement an algo-rithm similar to the algorithm proposed in [29] to extract the FSM from a flattennetlist. The extracted FSM is in KISS [27] format, a simple format to store theFSM transition table. We have developed an FSM generator in C++ which, givena finite-state machine described in KISS format, generates the Verilog HDL codethat describes this FSM. For the purpose of this work, we were only interested inthe RTL code of the next state calculation logic, hence our FSM generator onlygenerates the RTL design for the next state calculation logic and does not includethe output calculation logic in the generated design. Using this flow we are able toextract the FSM from any given benchmark and generate a stand-alone RTL designthat describes this state machine.The statistics of the FSM that we have extracted from the MachSuite and data54center benchmarks are shown in Table 4.2 and Table 4.3.4.3 Area and Delay ModelNext State Generation Block Area ModelTo model the next state generation block, which correspond to the coarse-grainedpart of the FSM architecture of Figure 3.5, we have described the architecture ofthis block in Verilog HDL. This excludes the area model used for Input SequenceEncoder which is described in next section. The memory unit is modelled usingthe Artisan synchronous SRAM compiler [5]. As described in Section 3.2.4, theMemory Unit in Figure 3.8 is an asynchronous memory. However, since we did nothave access to an asynchronous SRAM memory compiler, we used a synchronousmemory unit to model the area. We believe that the area of the asynchronous mem-ory will be comparable to the synchronous memory unit. However, a small errorin the area estimation will have a minimal affect on the total area of the proposedspecialized FSM architecture, since the Next state generation block counts for lessthan half of the block area for small FSMs, and is much less than half for the largerFSMs.The RTL design has been synthesized using the Synopsis Design Compiler vH-2013.03-SP5-2 [33] with the TSMC 65nm library. The area estimations presentedin this dissertation are pre place-and-route. We estimate the routing area of the nextstate generation block, which is not calculated by the Synopsys design compiler asfollows: We exclude the area of the RAM (since the internal routing has alreadybeen modelled by the SRAM compiler), then we multiply the area of the remainingunits, which is reported by design compiler, by a factor of 2X. Note that by usingthis approach, we are overestimating the area of the block, since the routing insidethe next state generation unit is very limited. Thus, the presented area estimationsare conservative.Input Sequence Encoder Area ModelWe have developed an input sequence encoder generator in C++. It takes the FSMdescribed in KISS format and generates the Verilog HDL that implements this55encoder, as described in Section 3.5.2.The RTL design for the input sequence encoder is then implemented onto theFPGA soft logic. We use the FPGA architecture k6 frac N10 40nm provided inVTR [19] to model the area of the input sequence encoder, and map the inputsequence encoder described in verilog to the FPGA soft logic. We then use thefollowing formula, which is also used by VTR, to convert the logic and routingarea reported by VTR in Minimum Width Transistor Area (MWTA) to um2:1∗MWTA = 70∗ (λ )2Where λ is equal to 65nm.Specialized FSM Architecture Delay ModelNext we describe the delay model used for the proposed specialized FSM archi-tecture which consists the delay of both input sequence encoder and next stategeneration block. Looking at Figure 3.8, the critical path delay reported by designcompiler for the next state generation block starts from the output of the state reg-ister through the adder and two multiplexers back to the input of the state register.Note that, for the scenario when the next state calculation is solely calculated usingthe accumulator unit, the total critical path delay of the FSM architecture is equalto the critical path delay of the next state generation block. However, for the casewhere the next state calculation is performed through input sequence encoder andmemory unit, the output of the state register is fed back to the input of the inputsequence encoder. Therefore the critical path delay of the input sequence encoderalong with the critical path delay of the next state generation block form the totaldelay of the architecture.The delay of the input sequence encoder is obtained from VTR by mapping theencoder onto the baseline architecture. The delay values for the next state genera-tion block are obtained from the design compiler. To account for the effect of theplace and route on the delay values, we use the same experience-based estimationapproach stated in [45], which suggests on average paths degrade by a factor of1.6X after layout.Note that we provide an optional binary-to-onehot decoder at the output of the56FSM block (Figure 3.8). This decoder is located after the state registers, henceafter obtaining the total critical path of design as mentioned above, we also add thelatency of this decoder to the total critical path of the specialized FSM architecture.Baseline FPGA architectureThe baseline FPGA architecture is also k6 frac N10 40nm. We selected the simplearchitecture without any hard block as the baseline to minimize the area overheadof unused hard blocks that the FSM will not benefit from.4.4 CAD FlowTo synthesize our benchmarks onto the FPGA baseline and our proposed architec-ture, we use VTR 7.0 [19]. VTR provides the full synthesis, technology mapping,placement, and routing steps required to compile the proposed next state gener-ation hard block and input sequence encoder soft block onto the baseline FPGAarchitecture.4.5 Mapping to the Next State Generation BlockAs described in Section 3.5.3, for a given state machine to fit into the next stategeneration block, there are two required conditions that must be met: (1) the num-ber of bits required for the state encoding should not exceed the maximum bit-width of the adder, 2) the number of states that reside in memory must be less thanthe memory size.To evaluate these two conditions, we first apply the state assignment algorithm,Algorithm 2, on the given FSM. After performing the state encoding, we will havethe number of state bits required to encode the state values and the total number ofstates that will be assigned to the memory unit. In case any of these two require-ments are not met, we can use the FSM partitioning technique described in Section3.3 to map the FSM to two or more combined fracturable blocks.57Chapter 5Experimental ResultsThis chapter presents and discusses our experimental results. We first use the resultof applying our state assignment technique on the finite state machines extractedfrom MachSuite and the datacenter benchmarks to explain the sizing of the FSMblock. We then evaluate the overall area and delay improvement of our proposedarchitecture over these benchmarks. We also provide the detail characteristics ofeach FSM to fully explain the variation in the result of area/delay improvementover these benchmarks. We then demonstrate the outcome of applying HLS op-timization to three MachSuite benchmarks on the characteristics of the generatedstate machines. We finally assess the functionality of the FM partitioning algorithmon an FSM that does not fit into one FSM block and measure the overhead of ourproposed modifications to support fracturable FSM blocks.5.1 Next State Generation Block SizeIn this section we explain the reason behind the size decisions for elements of thenext state generation block. The best sizing for the FSM block will accommo-date the common FSM size, while reducing the amount of wasted resources if thecommon FSMs are smaller than the selected FSM block area.Figure 5.1 shows the area breakdown of each unit of the next state generationblock as a fraction of the total area for the block configuration given in Table 5.1.As can be seen in this figure, the memory block is the main contributor to the next58Figure 5.1: Area breakdown of the coarse-grained fabricFigure 5.2: FSM coverage vs. memory depth in number of entries. Approx-imately 98% of the evaluated FSMs fit into a memory with a depth of128 entries.state generation block area. We have measured the area breakdown of the block forvarious block configurations by sweeping the memory size, however, the memoryunit always remains the main contributor to the block area since the area of theremaining units also scale accordingly as the memory size varies. Therefore, it is59important to select a memory unit with the proper size to minimize the total areaof our proposed FSM block.We have collected the required memory depth, in terms of number of entries(independent states), for our evaluated benchmarks. Figure 5.2 presents the frac-tion of the FSMs that will fit in a certain memory depth of 32, 64, 128, and 256entries. For our workloads, 98% of the FSMs fit into a depth of 128. Thus, for theremainder of our evaluation, we selected a memory size with a depth of 128 entriesto accommodate the common FSM sizes.The second design decision is the bit-width of the adder, control registers, andencoding bits. To choose these values, we need to answer the following question:What are the total number of states for a state machine that uses all the 128 memoryentries? To answer this question we require two data points: (1) what percentage ofthe states are typically allocated in memory, and (2) what is the maximum fan-outdegree over our evaluated FSM. The second question helps determine how manymemory rows are needed for each state.The answer to the first question is shown in Figure 3.6. On average approxi-mately 18% of the states reside in memory. The second question can be answeredby looking at Figure 3.7 which shows the maximum number of fan-out per stateis equal to 4. Therefore, given a memory unit that has 4 memory rows associatedwith each state, and where the number of memory states is 20% of the total numberof states, the total number of states in an FSM that can map to this memory unit isequal to ( 128 states4 rows per state×120% of total states = 160 states). Hence we use 8 bitsto represent the states in such a state machine. For any state machines that requiremore bits for the state encoding, there is a high chance that the memory size willnot be able to accommodate all the states.Using the format of memory content presented in Figure 3.9, the memorywidth should be equal to (size of Next State value+size of Path Final State value+size of State Control value) which is 8+8+5+1 = 22 bits. Putting it all together,the size of the units in the next state generation block can be seen in Table 5.160Total Memory size 128x22 bitsAdder size 8 bitsState Register size 8 bitsEncoded Input Sequence size 2 bitsTable 5.1: The sizing configuration of the elements of the next state genera-tion block5.2 Area ImprovementThis section presents the improvement in FSM area using our proposed specializedFSM block compared to the baseline LUT-based FSM implementation. The area ofthe next state generation block for the configuration given in Section 5.1 is equal to15662 um2 which is calculated as described in Chapter 4. The area improvementfor the MachSuite and HLS datacenter benchmarks is presented in Figure 5.3. Thesubsequent figure, Figure 5.4, presents a breakdown of state categories to assistwith analyzing the variation in area savings. The breakdown of state categories iscollected after performing state assignment on each FSM.In Figure 5.3, the x-axis shows the FSMs extracted from the benchmark setsand the y-axis shows the relative area of our proposed architecture compared tothe baseline LUT-based implementation. The gap between the bars on the x-axisseparates the FSMs that have less than 10 states (on the left) from the FSMs withmore than 10 states (on the right). In the extreme cases where the FSM only hasa few states, less than 10, the number of states on the branch-free paths and thenumber of states to be stored in memory are so limited that it does not justify thearea overhead of using the FSM hard block with a memory depth of 128. This issuecan be addressed with two different approaches. First, a simple predictor based onthe FSM size could be used during the synthesis to decide whether the FSM shouldbe mapped to the proposed FSM hard block or should be implemented using thesoft logic on FPGAs. Second, the FSM block can be sized down to accommodatesmaller FSMs. However, this also results in a lower percentage of FSMs fittinginto a single block. The FSMs that do not fit into a single block will be splitover multiple fracturable blocks. The overheads of having fracturable blocks arediscussed in Section 5.7.61As shown in Figure 5.3, in average the area of our proposed architecture isapproximately 36% of the baseline FPGA architecture for the MachSuite bench-marks, and is approximately 30% of the baseline area for the HLS datacenterbenchmarks. These averages are not including the benchmarks that have FSMswith fewer than 10 states.The main trend that can be seen is that the area improvement increases as theFSM size increases. This is due to the increase in the amount of soft logic thatis required to implement the baseline FSM, which is replaced by our fixed-sizespecialized FSM block.Figure 5.4 can be used to help explain the area improvements for differentFSMs. The x-axis shows the FSMs from our evaluated benchmarks and the y-axisshows the total number of states as a breakdown of branch-free and memory states.The main trend that we see is that as the number of states that can be mappedto the branch-free paths increases, the area savings also increase. For the statemachines that have the same number of states but a different area improvement, thecomplexity of the input sequence encoder is the main reason for the area difference.As the number of states that need to be stored in memory increases, the logic toimplement the input sequence encoder will be more complex, resulting in havinga larger area. This can be seen for bzip fsm6 and sql gt fsm1. These benchmarkshave the same number of states (61 states), however, the total number of states thatreside in memory for sql gt fsm1 is equal to 30 while it is only 10 for bzip fsm6.Consequently, as shown in Figure 5.3, bzip fsm 6 has a smaller area (14% of thebaseline) compared to sql gt fsm1 (27% of the baseline). However, one exceptionis with benchmarks lucy sv fsm and sql gt fsm2 where benchmark lucy sv fsm hasmore memory states and better area improvement than sql gt fsm2. We expect thatthis is due to the higher complexity of the next state calculation logic for benchmarklucy sv fsm than benchmark sql gt fsm2, which results in a greater area reductionwhen mapping to a simple memory lookup. Further analysis to this scenario is leftto future work.62(a) MachSuite(b) HLS datacenterFigure 5.3: Area improvement of the specialized FSM architecture, which in-cludes the area of the input sequence encoder and next state generation blockrelative to the baseline5.3 Delay ImprovementThe input to output delay of the next state generation block for the configurationgiven in Section 5.1 is equal to 0.5 ns, which is calculated as described in Chapter4. The delay improvement achieved by our specialized FSM block is shown in63(a) MachSuite(b) HLS datacenterFigure 5.4: FSM size along with the breakdown of the states that are part ofbranch-free paths and states that reside in memoryFigure 5.5 for the evaluated benchmarks. The x-axis shows the FSMs from thedifferent benchmarks and the y-axis shows the critical path, relative to the baseline.64As above, the FSMs with less than 10 states are separated from the FSMs with morethan 10 states by a gap. As with area savings, the FSMs with at least 10 states willbenefit from our specialized hard block, and the critical path delay improves as thesize and complexity of the FSM increases. This is due to the fact that, for smallerFSMs, the overhead of the extra control logic in the FSM blocks is not negligiblecompared to the critical path delay of the LUT-based portion of the FSM.Similar to the area results, the complexity of the input sequence encoder is alarge contributor to the critical path of the total design, which is indicated by thenumber of states that are mapped to memory, as shown in Figure 5.4.5.4 Resource Usage of the Mix-Grained ArchitectureFigure 5.6 illustrates the area of each unit as a fraction of the total area of themix-grained architecture for the same workloads presented in Figure 5.3. As thisresult shows, about 50% of the area is consumed by the input sequence encoder.This amount varies among the benchmarks as size of the FSM and more specifi-cally, number of states that reside in memory varies. However, in addition to thenumber of memory states, the complexity of the boolean function that defines thetransition between states also affects the complexity and size of the input sequenceencoder. As can be seen in Figure 5.4a, number of memory states among Mach-Suite benchmarks is mainly less than 10, independent of the FSM size. This resultsin small variation in size of the input sequence encoder among MachSuite bench-marks. However, for the Datacenter benchmarks (Figure 5.4b), there is a highervariation in number of memory states among different benchmarks, hence there ismore variation in size of the input sequence encoder for these benchmarks as well.The area of the hard block, consisting of the memory, adder unit, and outputdecoder is always fixed. This explain the increase in area savings for the largerFSMs, since the overhead of the control logic in the hard block will be negligiblecompared to the input encoder implemented using the FPGA soft logic.65(a) MachSuite(b) HLS datacenterFigure 5.5: Critical path delay improvement of the specialized FSM architecturewhich includes the critical path of the input sequence encoder and next stategeneration block relative to the baseline66(a) MachSuite(b) HLS datacenterFigure 5.6: Area breakdown of the mix-grained FSM architecture for the FSMsextracted from the evaluated benchmarks5.5 FSM AreaAlong with the area improvement of the FSMs, we are also interested in the fractionof the FSM next state calculation area to the total application design area (controlplus data-path). However, for most of the evaluated benchmarks, the generationof IP (Intellectual Property) cores as part of the design by Vivado HLS limits the67benchmarkarea percentage of theFSM next statecalculation logictotal number ofthe statessqlite ln 11.27% 508sqlite gt 9.12% 73Table 5.2: Fraction of next state calculation logic area to total design area forsqlite lookupName and sqlite getToken functions from the HLS datacen-ter benchmark setability to implement the whole design onto the baseline FPGA using VTR. There-fore, we were unable to measure the total area of the designs for these benchmarks,since Vivado synthesis tool only provides the resource utilization of the final syn-thesized design, not the total area. We were able to measure this fraction for theSQLite benchmark (from the datacenter benchmark set) which does not containany IP cores. The percentage of the area for the next state calculation logic for twofunctions of SQLite benchmark is shown in Table 5.2. On average for these twofunctions, the next state calculation logic area is approximately 10.19% of the totaldesign area. We leave evaluating the total area for all benchmarks, including IPcores, to future work.5.6 Impact of HLS Directives on the Generated FSMsTo evaluate the impact of HLS optimization on FSM characteristics, we have ap-plied a set of HLS directives that minimize the area-delay product of the aes, back-prop, and sort radix benchmarks. In Chapter 4, we have described how these HLSsettings have been obtained.The impact of applying HLS directives on the three mentioned MachSuitebenchmarks is shown in Figure 5.7, and is averaged across these benchmarks.These benchmarks were arbitrarily chosen to show the impact of HLS directives.The x-axis is labelled by the number of fan-outs per state, and the y-axis indicatesthe fraction of total states that have the corresponding fan-out degree. As can beseen, on average, the optimized designs (opt) have a higher number of branch-free paths than the non-optimized designs (no-opt), e.g., fan-out 1 is higher for the68Figure 5.7: Impact of applying HLS optimization directives on backprop,aes, and radix sort benchmarks from MachSuite. The x-axis shows thenumber of reachable next states per state. For example, fan-out 1 is anindicator for states that only have one next state. This figure shows thatthe optimization increases the percentage of branch-free path.pragma optimized (opt) versus non-optimized (non-opt) versions. The affect ofapplying the HLS directive on each individual benchmarks as an average over allthe FSMs extracted from each benchmark is shown in Figure 5.8. The x-axis andy-axis are the same as described for Figure 5.7. By looking at the impact of HLSdirectives on each individual benchmark, we observed that the HLS directives donot necessarily change the ratio of fan-out degrees 1 and 2, however, they result ingenerating more FSMs for different parts of the same design which, in the case ofthese three benchmarks, contain a higher ratio of branch-free paths.Many of the HLS directives attempt to exploit more parallelism, for example,by loop unrolling and loop pipelining. In these cases, it results in an increase inthe number of states to generate the control signals for the unrolled and pipelineloops, adding more branch-free states in between divergent states used to controlthe loops.69(a) backprop(b) aes(c) sort radixFigure 5.8: Impact of applying HLS optimization directives on aes, backprop, andradix sort benchmarks from MachSuite70Benchmark StatesRequiredstate bitsMemorystatesRequiredmem. depthMaxfan-outsql ln fsm1 508 9 40 160 4Table 5.3: Block size checking for the sql ln fsm1 indicating that it does notfit to one FSM block, since it requires larger memory unit and more statebits than what is provided by the FSM block.5.7 Efficiency of the Fracturable FSM BlockIn this section we evaluate the efficiency of our proposed solution for the scenariowhere a large FSM does not map to one FSM block. We perform analysis onan FSM with 508 states, which is extracted from the sqlite lookupName function.The corresponding FSM is named sql ln fsm1. Below we describe the the requiredsteps for mapping this FSM to two fracturable FSM blocks.Block size: Table 5.3 shows the block size information for the sql ln fsm1FSM. The FSM has 508 states, which requires 9 state bits, thus it is too large tomap to our FSM block with only 8 state bits and an 8 bit-wide adder. Additionally,the result of the state encoding shows that 40 states must be mapped to the memoryunit. The maximum number of fan-out per state for this FSM is equal to 4, thus weneed 2 bits to represent the encoded input, which allows each memory states to havea maximum of four corresponding memory rows to store the data for potential statetransitions. Therefore, a memory unit with 160 entries is required to accommodatethis FSM, which will not fit into our proposed FSM block with 128 entries.Partitioning: Table 5.4 describes the result of applying the Fiduccia-Matheysespartitioning algorithm on sql ln fsm1, as described in Section 3.3.1, and then re-performing the state assignment on each of these partitions. The first row indicatesthe number of states that are required to be mapped to the memory unit in eachFSM partition. The second row presents the overhead of partitioning in terms ofthe number of states that are required to be mapped to the memory to store theinformation corresponding to the transitions across the the fracturable blocks. Thethird and fourth row show the total number of memory states for each FSM parti-tion and the required memory size to accommodate them.The values for the refined required memory sizes indicate that the partitioning71can result in an unbalanced division of the required memory size between parti-tions. The partitioning algorithm aims to minimize the cut set value between twopartitions, however, it is possible to have different number of branches within eachpartition. A more sophisticated partitioning algorithm can aim to also balance thisnumber in addition to minimizing the cut set value to better utilize each fracturableblock. Evaluating the impact of different partitioning algorithms is left to futurework.Area saving:Figure 5.9 shows the area overhead of using a fracturable FSMblock to map a large FSM as opposed to having one large specialized FSM blockto fit the FSM. LUT-based implementation of the FSM in FPGA soft logic is usedas the baseline. The area overhead due to making the FSM block fracturable isnegligible compared to the area improvement gained by mapping the FSM to theFSM hard block.The results of splitting a large FSM over two fracturable blocks, Figure 5.9,show the efficiency of this approach for a medium size FSM block (a memory unitwith 256 entries). As shown in Table 5.4, partitioning an FSM results in storingadditional states in memory. For the smaller FSM blocks, e.g. a memory unitwith 128 entries, there are only 32 states that can be stored in memory (assumingeach state has 4 memory locations for 4 potential next states). This memory sizeoffers a very limited space for storing states. By adding the overhead of additionalstates that are caused by partitioning, this memory unit can easily become fullwhich leads to requiring more than two fracturable blocks to accommodate a givenmedium size FSM. This might result in extra overhead that is more than the amountshown in Figure 5.9. In this work we only look at splitting the FSM across twofsm blocks. Evaluating the efficiency of mapping an FSM over multiple blocks isleft to future work.72Partition A Partition BInitial number of memory states 24 16Number of overhead memory states 6 6Refined number of memory states 30 22Refined required memory size 120 88Table 5.4: Memory unit size requirement for each partition after partitioningthe FSM. Row 4 indicates that FSM partition A requires a memory unitof size 120 and FSM partition B requires a memory unit of size 88, thusthey both can be mapped to a fracturable FSM block.Figure 5.9: Area overhead of using a fracturable FSM block to map a largeFSM as opposed to having one large specialized FSM block to fit theFSM. The overhead due to making the FSM block fracturable is negli-gible compared to the area improvement gained by mapping the FSM tothe hard FSM block.73Chapter 6Related WorkThis chapter summarizes and contrasts the work done in this dissertation againstrelated work on FPGA hard blocks and configurable FSM architectures.There is a large body of work looking at using specialized hard blocks as partof the FPGA’s architecture. Wilton et al. [36] examines the architecture of FP-GAs containing coarse-grained memory blocks, Langhammer et al. [16] proposesDSP blocks that support floating point operation, and all modern FPGAs now con-tain specialized hard blocks as part of their architecture [[43], [2], [41], [40], [1]].Similar to our work, these works share a common goal of introducing specializedblocks to the FPGA’s architecture that perform a set of specific tasks in a moreefficient manner in terms of area, performance, and power. However, they mainlylook at improving the complex operations and functional units that are common inthe data-path part of hardware designs, whereas we propose a hard block that isdesigned to better implement the control-path portion of digital systems.Garcia-Vargas et al. [10] proposes to use block RAMs provided in modernFPGA architecture to implement FSMs. This work looks at implementing the nextstate/output calculation for every state using memory. They reduce the size ofthe memory by multiplexing the FSM inputs to choose the set of active inputs ateach state. Additionally, along with the next state and output values, they storeextra control signals at each memory location that helps reduce the complexity ofcontrolling the multiplexer. Similar to Garcia-Vargas et al., we also try to reducethe memory size by exploiting the fact that only a subset of inputs are active at74each state. However, we further optimize the memory size by introducing an inputencoder which exploits the fact that not all the combinations of the active inputscontribute to different choices of next state selection. For example, in a scenariowhere the maximum number of active inputs at one state is 3, previous works looksat having 8(23) memory locations for choosing the potential next state for a givenstate. However, we show that the number of memory locations can be further re-duced to the maximum number of reachable next states per state which is normallyless than 2(number of active inputs). Moreover, our work further reduces the num-ber of states that must be implemented in memory by looking at the characteristicsof HLS-generated benchmarks. We show that a hybrid FSM implementation canbe used to divide the task of next state calculation between the memory unit and anaccumulator unit, resulting in significant reduction in number of states that mustbe mapped to the memory and consequently reducing the memory size.Glaser et al. [11] presents a reconfigurable FSM architecture, TR-FSM, to beimplemented as part of the ASICs or System On a Chip (SOC) designs. The pro-posed architecture offers reduced area, delay, and power consumption comparedto an FPGA baseline. However, their architecture must be sized according to thespecifications of the FSMs that are going to be implemented onto this architecture,otherwise the extra resources will be wasted. TR-FSM is possible in case of ASICand SOC design for a certain class of applications where they can profile FSMsprior to generating the TR-FSM block. However, their work cannot be utilized as ageneral architecture where the size of FSMs is not known in advance, limiting thefeasibility for integrating their proposed architecture into the FPGA architecture.However, in our work, we are able to design specialized FSM blocks that can ben-efit the common FSM sizes, while still allowing the mapping of larger FSMs usingthe proposed fracturable architecture.Wilson et al. [35] propose a low overhead FSM overlay based on a multi-RAM architecture. They aim to improve the area usage of the previously proposedmemory-based overlays by grouping the state machines to different subsets basedon the number of active inputs at each state. The states at each subset are thenmapped to separate memory units such that each memory address space can betailored to the number of active inputs in each subset. Their solution, however, stillhas a larger area compared to the LUT implementation, since the main goal of their75work is to reduce the FPGA compilation time.76Chapter 7Future WorkIn this dissertation, we explore the potential of adding a specialized hard blockfor FSM implementation to existing FPGA architectures to reduce the area of thenext state calculation logic in FSMs that are generated by HLS tools. In the future,we plan to explore the possibility of extending the proposed architecture to alsoimprove the area efficiency of the output calculation part of FSMs.Additionally, we plan to fully integrate our specialized block to a baselineFPGA architecture. To realize this goal, we also need to extend our mappingflow to support the full CAD flow that can compile the FSM onto the modelledFPGA architecture. The full CAD flow requires support for technology mapping,placement, and routing of the applications onto the modified FPGA architecture.Another future direction to further optimize our proposed architecture is toexplore the potentials of adding hard routing between the fracturable FSM blocksto improve the area and delay for the large FSMs that need to be mapped to acombination of fracturable FSM blocks. In order to support the modified routingstructure, the placement and routing algorithms need to be modified accordingly tomaximize the efficiency of hard routing between the fracturable blocks.Our work shows that there are potentials for introducing new hard blocks to theexisting FPGA architectures due to the adoption of HLS tools among hardware de-signers. The recent adoption of FPGAs in new domains, such as cloud computing,also suggests a future direction to explore the potentials of adding new hard blocksto the current FPGA architecture that are well suited for accelerating the common77operations in these new domains. One potential solution to approach this problemis by using graph isomorphism algorithms to find the common structure and logicoperations in this new domains which can be used as a starting point to design newhard blocks to be integrated to FPGAs.78Chapter 8ConclusionIn this dissertation, we analyzed the control-unit portion of RTL designs that aregenerated by HLS tools. HLS-generated control units, modelled by finite-statemachines, often have a large influence on the total area of the design in applicationswhere data-path realization requires a large number of states and control signals.We show that these FSMs demonstrate common properties that can be exploited toimprove the area of FSM implementations.We propose a novel mix-grained architecture that takes advantage of these char-acteristics to improve the total area for implementing the next state calculationlogic in FSMs. The proposed architecture can be integrated to modern FPGA ar-chitectures. We introduce a new state assignment technique that enables FSMs tobetter map to our proposed architecture. We evaluate our proposed architecture ona group of RTL designs generated by a commercial HLS tool. Finally, we show thatthe proposed architecture is on average 3X smaller than LUT-based FSM imple-mentations on a baseline FPGA. The reduction in area is achieved without affectingthe performance of the design.79Bibliography[1] Altera. FPGAs for High-Performance DSP Applications.https://www.altera.com/en US/pdfs/literature/wp/wp dsp comp.pdf, . →pages 74[2] Altera. Floating-point DSP Energy Efficiency on Altera 28 m, FPGAs.https://www.altera.com/content/dam/altera-www/global/en US/pdfs/literature/wp/wp-01192-bdti-altera-fp-dsp-energy-efficiency.pdf, . → pages74[3] Amazon. Amazon EC2 F1 Instances-Run Customizable FPGAs in the AWSCloud. https://aws.amazon.com/ec2/instance-types/f1/. → pages 2[4] Apache Software Foundation. LuCy. http://lucy.apache.org/, 2016. URLhttp://lucy.apache.org/. [Online; accessed 19-May-2016]. → pages 51[5] ArmDeveloper. Artisan Memory Compiler.https://developer.arm.com/products/physical-ip/memory-compilers. →pages 55[6] A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers,M. Haselman, S. Heil, M. Humphrey, P. Kaur, J.-Y. Kim, et al. A cloud-scaleacceleration architecture. In Microarchitecture (MICRO), 2016 49th AnnualIEEE/ACM International Symposium on, pages 1–13. IEEE, 2016. → pages2[7] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang.High-level synthesis for fpgas: From prototyping to deployment. IEEETransactions on Computer-Aided Design of Integrated Circuits and Systems,30(4):473–491, 2011. → pages 2, 11, 1280[8] R. H. Dennard, F. H. Gaensslen, and K. Mai. Design of Ion-ImplantedMOSFET’s with Very Small Physical Dimensions. In IEEE Journal ofSolid-State Circuits, October 1974. → pages 1[9] C. M. Fiduccia and R. M. Mattheyses. A linear-time heuristic for improvingnetwork partitions. In Papers on Twenty-five years of electronic designautomation, pages 241–247. ACM, 1988. → pages 29[10] I. Garcia-Vargas, R. Senhadji-Navarro, G. Jimenez-Moreno,A. Civit-Balcells, and P. Guerra-Gutierrez. Rom-based finite state machineimplementation in low cost fpgas. In Industrial Electronics, 2007. ISIE2007. IEEE International Symposium on, pages 2342–2347. IEEE, 2007. →pages 10, 74[11] J. Glaser, M. Damm, J. Haase, and C. Grimm. Tr-fsm: transition-basedreconfigurable finite state machine. ACM Transactions on ReconfigurableTechnology and Systems (TRETS), 4(3):23, 2011. → pages 75[12] Hwaci. SQLite. https://www.sqlite.org, 2016. URL https://www.sqlite.org.[Online; accessed 02-June-2016]. → pages 51[13] Intel. Intel Completes Acquisition of Altera.https://newsroom.intel.com/news-releases/intel-completes-acquisition-of-altera/. → pages2[14] I. Kuon and J. Rose. Measuring the gap between fpgas and asics. IEEEtransactions on computer-aided design of integrated circuits and systems, 26(2):203–215, 2007. → pages 3, 6[15] I. Kuon, R. Tessier, and J. Rose. Fpga architecture: Survey and challenges.Foundations and Trends in Electronic Design Automation, 2(2):135–253,2008. → pages x, 6, 7[16] M. Langhammer and B. Pasca. Floating-point dsp block architecture forfpgas. In Proceedings of the 2015 ACM/SIGDA International Symposium onField-Programmable Gate Arrays, pages 117–125. ACM, 2015. → pages 74[17] M. Langhammer and B. Pasca. Floating-point dsp block architecture forfpgas. In Proceedings of the 2015 ACM/SIGDA International Symposium onField-Programmable Gate Arrays, pages 117–125. ACM, 2015. → pages 381[18] C. Lo and P. Chow. Model-based optimization of high level synthesisdirectives. In Field Programmable Logic and Applications (FPL), 2016 26thInternational Conference on, pages 1–10. IEEE, 2016. → pages 17, 52[19] J. Luu, J. Goeders, M. Wainberg, A. Somerville, T. Yu, K. Nasartschuk,M. Nasr, S. Wang, T. Liu, N. Ahmed, et al. Vtr 7.0: Next generationarchitecture and cad system for fpgas. ACM Transactions on ReconfigurableTechnology and Systems (TRETS), 7(2):6, 2014. → pages 56, 57[20] C. Menn, O. Bringmann, and W. Rosenstiel. Controller estimation for fpgatarget architectures during high-level synthesis. In Proceedings of the 15thinternational symposium on System Synthesis, pages 56–61. ACM, 2002. →pages 3[21] G. D. Micheli. Synthesis and optimization of digital circuits. McGraw-HillHigher Education, 1994. → pages 7, 8, 9, 18, 29, 31[22] G. E. Moore. Cramming more components onto integrated circuits.Electronics, 38(8):114–117, 1965. → pages 1[23] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides,J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, et al. Areconfigurable fabric for accelerating large-scale datacenter services. InComputer Architecture (ISCA), 2014 ACM/IEEE 41st InternationalSymposium on, pages 13–24. IEEE, 2014. → pages 2[24] R. David Evans. Architecture Synthesis from High-Level Specifications. →pages 51[25] B. Reagen, R. Adolf, Y. S. Shao, G.-Y. Wei, and D. Brooks. Machsuite:Benchmarks for accelerator design and customized architectures. InWorkload Characterization (IISWC), 2014 IEEE International Symposiumon, pages 110–119. IEEE, 2014. → pages 51[26] R. Senhadji-Navarro, I. Garcia-Vargas, and J. L. Guisado. Performanceevaluation of ram-based implementation of finite state machines in fpgas. InElectronics, Circuits and Systems (ICECS), 2012 19th IEEE InternationalConference on, pages 225–228. IEEE, 2012. → pages 10[27] E. M. Sentovich, K. J. Singh, L. Lavagno, C. Moon, R. Murgai, A. Saldanha,H. Savoj, P. R. Stephan, R. K. Brayton, and A. Sangiovanni-Vincentelli. Sis:A system for sequential circuit synthesis. 1992. → pages 5482[28] Y. S. Shao and D. Brooks. Research infrastructures for hardwareaccelerators. Synthesis Lectures on Computer Architecture, 10(4):1–99,2015. → pages 1, 2[29] Y. Shi, C. W. Ting, B.-H. Gwee, and Y. Ren. A highly efficient method forextracting fsms from flattened gate-level netlist. In Circuits and Systems(ISCAS), Proceedings of 2010 IEEE International Symposium on, pages2610–2613. IEEE, 2010. → pages 54[30] V. Sklyarov. Synthesis and implementation of ram-based finite statemachines in fpgas. Field-Programmable Logic and Applications: TheRoadmap to Reconfigurable Computing, pages 718–727, 2000. → pages 10[31] V. Sklyarov. An evolutionary algorithm for the synthesis of ram-based fsms.In International Conference on Industrial, Engineering and OtherApplications of Applied Intelligent Systems, pages 108–118. Springer, 2002.→ pages 10[32] SPEC. CPU 2006. https://www.spec.org/cpu2006/. → pages 19, 51[33] Synopsys. Design Compiler. http://www.synopsys.com/Tools/Implementation/RTLSynthesis/ DesignCompiler/Pages/default.aspx. →pages 55[34] A. Tiwari and K. A. Tomko. Saving power by mapping finite-state machinesinto embedded memory blocks in fpgas. In Proceedings of the conference onDesign, automation and test in Europe-Volume 2, page 20916. IEEEComputer Society, 2004. → pages 10[35] D. Wilson and G. Stitt. A scalable, low-overhead finite-state machineoverlay for rapid fpga application development. arXiv preprintarXiv:1705.02732, 2017. → pages 75[36] S. J. E. Wilton, J. Rose, and Z. Vranesic. Architectures and algorithms forfield-programmable gate arrays with embedded memory. University ofToronto, Toronto, Ont., Canada, 1997. → pages 74[37] C. Wolf. Yosys open synthesis suite, 2015. → pages 53[38] Xilinx. Xilinx and IBM to Enable FPGA-Based Acceleration withinSuperVessel OpenPOWER Development Cloud.https://www.xilinx.com/news/press/2016/xilinx-and-ibm-to-enable-fpga-based-acceleration-within-supervessel-openpower-development-cloud.html,. → pages 283[39] Xilinx. Qualcomm and Xilinx Collaborate to Deliver Industry-LeadingHeterogeneous Computing Solutions for Data Centers with New Levels ofEfficiency and Performance.https://www.xilinx.com/news/press/2015/qualcomm-and-xilinx-collaborate-to-deliver-industry-leading-heterogeneous-computing-solutions-for-data-centers-with-new-levels-of-efficiency-and-performance.html, . → pages2[40] Xilinx. High-Volume Spartan-6 FPGAs: Performance and Power Leadershipby Design.https://www.xilinx.com/support/documentation/white papers/wp396 S6 HV Perf Power.pdf,. → pages 74[41] Xilinx. Spartan-7 FPGAs: Meeting the cost-Sensitive Market Requiremnts.https://www.xilinx.com/support/documentation/white.../wp483-spartan-7-intro.pdf, . → pages74[42] Xilinx. Vivado High-Level Synthesis.http://www.xilinx.com/products/design-to ols/vivado/, . → pages 2, 16[43] Xilinx. Embedded Vision with INT8 Optimization on Xilinx Devices.https://www.xilinx.com/support/documentation/white papers/wp490-embedded-vision-int8.pdf, . → pages74[44] Xilinx-Xcell Daily Blog. Baidu Adopts Xilinx Kintex UltraScale FPGAs toAccelerate Machine Learning Applications in the Data Center.https://forums.xilinx.com/t5/Xcell-Daily-Blog/Baidu-Adopts-Xilinx-Kintex-UltraScale-FPGAs-to-Accelerate/ba-p/728667. → pages2[45] G. Zgheib, L. Yang, Z. Huang, D. Novo, H. Parandeh-Afshar, H. Yang, andP. Ienne. Revisiting and-inverter cones. In Proceedings of the 2014ACM/SIGDA international symposium on Field-programmable gate arrays,pages 45–54. ACM, 2014. → pages 5684
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A mix-grained architecture for improving HLS-generated...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A mix-grained architecture for improving HLS-generated controllers on FPGAs Assadikhomami, Shadi 2017
pdf
Page Metadata
Item Metadata
Title | A mix-grained architecture for improving HLS-generated controllers on FPGAs |
Creator |
Assadikhomami, Shadi |
Publisher | University of British Columbia |
Date Issued | 2017 |
Description | With the recent slowdowns in traditional technology scaling, hardware accelerators, such as Field Programmable Gate Arrays (FPGAs), offer the potential for improved performance and energy efficiency compared to general purpose processing systems. While FPGAs were traditionally used for applications such as signal processing, they have recently gained popularity in new, larger scale domains, such as cloud computing. However, despite their performance and power efficiency, programming FPGAs remains a hard task due to the difficulties involved with the low-level design flow for FPGAs. High-Level Synthesis (HLS) tools aim to assist with this time-consuming task by supporting higher level programming models which significantly increases design productivity. This also makes the use of FPGAs for large scale design development for evolving applications more feasible. In this thesis we explore the potential of modifying the current FPGA architecture to better support the designs generated by HLS tools. We propose a specialized mix-grained architecture for Finite State Machine (FSM) implementation that can be integrated into existing FPGA architectures. The proposed mix-grained architecture exploits the characteristics of the controller units generated by HLS tools to reduce the control-path area of the design. We show that our proposed architecture reduces the area of the next state calculation in FSMs by more than 3X without impacting the performance and often reducing the critical path delay of the next state calculation in FSMs. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2018-06-30 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0348677 |
URI | http://hdl.handle.net/2429/62125 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2017-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2017_september_shadi_assadikhomami.pdf [ 2.56MB ]
- Metadata
- JSON: 24-1.0348677.json
- JSON-LD: 24-1.0348677-ld.json
- RDF/XML (Pretty): 24-1.0348677-rdf.xml
- RDF/JSON: 24-1.0348677-rdf.json
- Turtle: 24-1.0348677-turtle.txt
- N-Triples: 24-1.0348677-rdf-ntriples.txt
- Original Record: 24-1.0348677-source.json
- Full Text
- 24-1.0348677-fulltext.txt
- Citation
- 24-1.0348677.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0348677/manifest