Automated Space/Time Scaling of Streaming Task Graphson Field-Programmable Gate ArraysbyHossein Omidian SavarbaghiB.Sc. in Computer Engineering, Isfahan University of Technology, 2007M.Sc. in Computer Engineering, Science and Research Branch of IAU, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)October 2018c© Hossein Omidian Savarbaghi, 2018The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled:Automated Space/Time Scaling of Streaming Task Graphs onField-Programmable Gate Arrayssubmitted by Hossein Omidian Savarbaghi in partial fulfillment of the require-ments for the degree of Doctor of Philosophyin Electrical and Computer EngineeringExamining Committee:Guy Lemieux, Electrical and Computer EngineeringSupervisorProf. Mieszko Lis, Electrical and Computer EngineeringSupervisory Committee MemberProf. Andre Ivanov, Electrical and Computer EngineeringUniversity ExaminerProf. Alan Hu, Computer ScienceUniversity ExaminerProf. Russell Tessier, Electrical and Computer EngineeringExternal ExamineriiAbstractParallel computing platforms provide good performance for streaming applicationswithin a limited power budget. However, these platforms can be difficult to pro-gram. Moreover, when the size of the computing platform target changes, usersmust manually reallocate resources and parallelism. This thesis provides a frame-work to retarget applications described by a Streaming Task Graph (STG) for im-plementation on different platforms, where the framework can automatically scalethe solution size to fit available resource or performance targets.First, we explore automated space/time scaling for STGs targeting a pipelinedcoarse-grained architecture. We produce a tool that analyzes the degrees of paral-lelism in a general stream application and finds possible bottlenecks. We introducepossible optimization strategies for STGs, and two algorithmic approaches: a clas-sical approach based upon Integer Linear Programming (ILP), and a novel heuristicapproach which introduces a new optimization and produces better results (using30% less area) with a shorter run-time.Second, we explore automated space/time scaling for STGs targeting a fine-grained architecture (Field-Programmable Gate Array (FPGA)). We propose aframework on top of a commercial High-Level Synthesis (HLS) tool which addsthe ability to automatically meet a defined area budget or target throughput. Withinthe framework, we use similar ILP and heuristic approaches. The heuristic auto-matically achieves over 95% of the target area budget on average while improvingthroughput over the ILP. It can also meet the same throughput targets as the ILPwhile saving 19% area.Third, we investigate automated space/time scaling of STGs in a hybrid archi-tecture consisting of a Soft Vector Processor (SVP) and select custom instructions.iiiTo achieve higher performance, we investigate using dynamic Partial Reconfigu-ration (PR) by time-sharing the FPGA resources. The performance results achievespeedups far beyond what a plain SVP can accomplish: an 8-lane SVP achieves aspeedup of 5.3 on the Canny-blur application, whereas the PR version is another3.5 times faster, with a net speedup of 18.5 over the ARM Cortex A9 processor. Byincreasing the dynamic PR rate beyond what is available today, we also show that afurther 5.7 times speedup can be achieved (105.9x speedup versus the Cortex A9).ivLay SummarySoftware applications are often computationally intensive. To maximize the useof resources on a range of hardware platforms with a differing amount of parallelresources and minimize the runtime of software applications as much as possible,engineers need to manually modify and optimize each application for each platformwith different resource restrictions or desired performance targets. This process istime consuming and can be prone to error. We propose an approach to automat-ically explore different ways of implementing an application for a performancetarget or a resource restriction defined by users.vPrefaceThe following publications have been adapted for inclusion in the dissertation:• Automated Space/Time Scaling of Streaming Task Graph [66]Published in the 2016 International Workshop on Overlay Architectures forFPGA (OLAF 2016). Authored by Hossein Omidian and Guy Lemieux.Appears in chapter 3.For this paper, I performed all of the deign methodology and implementa-tion of compiler and simulator. I also did the benchmarking, simulating andevaluation. I also formulated the ILP optimization and proposed and im-plemented the heuristic optimization approach presented in this paper. GuyLemieux served in an advisory fashion.• Exploring Automated Space/Time Tradeoffs for OpenVX Compute Graphs[67].Published in the 2017 International Conference on Field-Programmable Tech-nology (FPT 2017). Authored by Hossein Omidian and Guy Lemieux. Ap-pears in chapter 4.I performed all the design methodology, implementation and benchmarkingas well as all the simulation and evaluation for this paper. Guy Lemieuxserved in an advisory fashion.• JANUS: A Compilation System for Balancing Parallelism and Perfor-mance in OpenVX [65].Presented in the 2018 International Conference on Machine Vision and In-formation Technology (CMVIT 2018) and published in Journal of Physics:viConference Series. Authored by Hossein Omidian and Guy Lemieux. Ap-pears in chapter 3 and chapter 4.I performed all the implementation and benchmarking for this paper. GuyLemieux served in an advisory fashion.• An Accelerated OpenVX Overlay for Pure Software ProgrammersPublished as a shot paper (poster) in the 2018 International Conference onField-Programmable Technology (FPT 2018). Authored by Hossein Omid-ian, Nick Ivanov and Guy Lemieux. Appears in chapter 5.I performed all the design methodology, implementation and benchmarkingas well as all the simulation and evaluation for this paper. Nick Ivanov helpedimplementing runtime results for SVP and ARM. Guy Lemieux served in anadvisory fashion.viiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Experimental Architecture Models . . . . . . . . . . . . . 51.2.2 Experimental Methodology . . . . . . . . . . . . . . . . 61.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . 92 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1 Finding Parallelism in a General Program . . . . . . . . . . . . . 10viii2.2 Programming Models . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Reconfigurable Computing Platforms . . . . . . . . . . . . . . . 152.3.1 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.2 Massively Parallel Processor Array . . . . . . . . . . . . 172.4 OpenVX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5 Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . 233 MPPA Space/Time Scaling . . . . . . . . . . . . . . . . . . . . . . . 273.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Finding Different Implementations . . . . . . . . . . . . . . . . . 303.2.1 Intra-Node Optimizer . . . . . . . . . . . . . . . . . . . . 313.2.2 Inter-Node Optimizer . . . . . . . . . . . . . . . . . . . . 323.2.3 Example: N-Body Problem . . . . . . . . . . . . . . . . 323.3 Trade-off Finding Formulation and Solutions . . . . . . . . . . . 363.3.1 Integer Linear Programming Algorithm . . . . . . . . . . 373.3.2 Heuristic Algorithm . . . . . . . . . . . . . . . . . . . . 383.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 463.4.1 StreamIt . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.2 JPEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 FPGA Space/Time Scaling . . . . . . . . . . . . . . . . . . . . . . . 494.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3 Tool Flow for OpenVX-based HLS . . . . . . . . . . . . . . . . . 524.3.1 OpenVX Programming Model . . . . . . . . . . . . . . . 534.3.2 Finding Different Implementations . . . . . . . . . . . . . 544.3.3 CV Accelerator on FPGA . . . . . . . . . . . . . . . . . 564.3.4 Heavily Parameterized C++-based OpenVX Kernels . . . 594.3.5 Intra-node Optimizer . . . . . . . . . . . . . . . . . . . . 604.3.6 Inter-node Optimizer . . . . . . . . . . . . . . . . . . . . 614.3.7 Trade-off Finding Formulation and Solutions . . . . . . . 644.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 65ix4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715 FPGA Overlay Space/Time Scaling with Custom Instructions . . . . 735.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 785.3 Mapping OpenVX Applications to FPGA Overlay . . . . . . . . . 805.3.1 Finding Different Implementations . . . . . . . . . . . . . 805.3.2 Execution Time Analysis . . . . . . . . . . . . . . . . . . 815.3.3 Solving the Space/Time Tradeoff . . . . . . . . . . . . . 865.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 875.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101xList of TablesTable 3.1 Different operations with their initiation intervals . . . . . . . . 33Table 3.2 Number of different implementations found by the tool for StreamItbenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Table 3.3 Implementation library for JPEG encoder . . . . . . . . . . . . 47Table 3.4 Heuristic vs ILP for many-core system . . . . . . . . . . . . . 47Table 5.1 vxMagnitude kernel throughput running on different platforms 74Table 5.2 Some of common patterns used for pre-synthesized node fusion 86Table 5.3 List of CV kernels . . . . . . . . . . . . . . . . . . . . . . . . 89Table 5.4 List of CV kernels implemented as VCIs in Figure 5.8 . . . . . 91Table 5.5 List of CV kernels implemented as VCIs on SVP-V4 in Figure 5.9 93xiList of FiguresFigure 2.1 Example FPGA architecure . . . . . . . . . . . . . . . . . . 18Figure 2.2 Ambric bric organization [12] . . . . . . . . . . . . . . . . . 20Figure 2.3 Structural object programming model [12] . . . . . . . . . . . 22Figure 2.4 OpenVX source code for Canny . . . . . . . . . . . . . . . . 23Figure 2.5 OpenVX graph for Canny . . . . . . . . . . . . . . . . . . . 23Figure 2.6 Area saving by reconfiguring only the currently required accel-erator module to the FPGA. Configurations are fetched fromthe module repository at runtime. . . . . . . . . . . . . . . . 24Figure 2.7 System acceleration due to partial reconfiguration. By spend-ing temporarily more area for each function, the overall latencyis reduced[47]. . . . . . . . . . . . . . . . . . . . . . . . . . 25Figure 3.1 Tool flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 3.2 Pipelined force calculation . . . . . . . . . . . . . . . . . . . 33Figure 3.3 A node with inverse-throughput=4 . . . . . . . . . . . . . . . 34Figure 3.4 Expanding node using replication to improve throughput . . . 34Figure 3.5 Expanded force calculation . . . . . . . . . . . . . . . . . . . 35Figure 3.6 Inverse-throughput/area relation for different implementationsof force calculation . . . . . . . . . . . . . . . . . . . . . . . 36Figure 3.7 Minimum and expected inverse-throughput . . . . . . . . . . 38Figure 3.8 Throughput analysis example . . . . . . . . . . . . . . . . . 39Figure 3.9 Throughput propagation and balancing . . . . . . . . . . . . 40Figure 3.10 Node combining in Bottleneck Optimizer . . . . . . . . . . . 42Figure 3.11 Node combining in Bottleneck Optimizer . . . . . . . . . . . 47xiiFigure 4.1 Tool flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Figure 4.2 OpenVX source code . . . . . . . . . . . . . . . . . . . . . . 54Figure 4.3 Sobel graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Figure 4.4 Two different approaches for satisfying Θ= 5 . . . . . . . . . 56Figure 4.5 System view implemented on Xilinx FPGA . . . . . . . . . . 57Figure 4.6 Internal view of a general node in CV hardware accelerator . . 57Figure 4.7 Pixel2Pixel kernel example, WT = 4,WF = 2 . . . . . . . . . 59Figure 4.8 Window2Pixel kernel example, WT = 4,WF = 2 . . . . . . . . 59Figure 4.9 Window2Pixel kernel . . . . . . . . . . . . . . . . . . . . . . 60Figure 4.10 Area, throughput and tile-width correlation for Gaussian3x3kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Figure 4.11 Pixel2Pixel replication . . . . . . . . . . . . . . . . . . . . . 62Figure 4.12 Window2Pixel replication . . . . . . . . . . . . . . . . . . . 63Figure 4.13 Node combining . . . . . . . . . . . . . . . . . . . . . . . . 63Figure 4.14 LUT usage percentage for Sobel implementations on differentFPGA sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Figure 4.15 Throughput achieved for Sobel on different FPGA sizes . . . 67Figure 4.16 Percentage of LUT usage for different Xilinx FPGAs . . . . . 68Figure 4.17 vxMagnitude Area/Throughput results for different throughputtargets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68Figure 4.18 Area cost results for different throughput targets . . . . . . . . 69Figure 4.19 Heuristic vs ILP runtime speedup for Harris corner detection . 69Figure 4.20 Area cost results for Harris using Heuristic and ILP approaches 70Figure 4.21 Heuristic vs ILP throughput results for Harris corner detection 70Figure 4.22 Area/throughput results for implementing Sobel on Xilinx Zed-Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71Figure 5.1 Running an application on the hybrid system . . . . . . . . . 76Figure 5.2 System overview . . . . . . . . . . . . . . . . . . . . . . . . 79Figure 5.3 Node clustering and bypassing the scratchpad . . . . . . . . . 83Figure 5.4 VCI chaining versus node fusion . . . . . . . . . . . . . . . . 85Figure 5.5 ARM Cortex-A9 (667MHz) vs SVP-V4 and SVP-V8 (100MHz) 88Figure 5.6 Graph representation of Sobel application with 6 nodes . . . . 88xiiiFigure 5.7 Graph representation of Canny−Blur application with 10 nodes 90Figure 5.8 Throughput vs area for V4 and V8 with/without VCI (Canny−Blur Figure 5.7) . . . . . . . . . . . . . . . . . . . . . . . . 90Figure 5.9 Sobel speedup by adding static VCIs (standalone and bypass-ing) to SVP-V4 compared to ARM . . . . . . . . . . . . . . . 91Figure 5.10 V4 Dynamic PR and Static PR speedup vs ARM for SobelApplication (4500 LUT budget, image size 1920×1080) . . . 94Figure 5.11 V4 Dynamic PR and Static PR speedup vs ARM for Canny-blur Application (4500 LUT budget, image size 1920×1080) 95Figure 5.12 V8 Dynamic PR and Static PR speedup vs ARM for Canny-blur Application (14000 LUT budget, image size 1920×1080) 95Figure 5.13 V4 Dynamic PR speedup vs ARM for Canny-Blur for differentimage sizes (4500 LUT budget) . . . . . . . . . . . . . . . . 96Figure 5.14 V8 Dynamic PR speedup vs ARM for Canny-Blur for differentimage sizes (4500 LUT budget) . . . . . . . . . . . . . . . . 96xivGlossaryALU Arithmetic Logic UnitASIC Application-Specific Integrated CircuitBRAM Block RAMCAD Computer Aided DesignCLB Configurable Logic BlockCPU Central Processing UnitCV Computer VisionDAG Directed Acyclic GraphDIG Different Implementation GeneratorDMA Direct Memory AccessDSP Digital Signal ProcessingFIFO First In, First OutFPGA Field-Programmable Gate ArrayGLPK GNU Linear Programming KitHDL Hardware Description LanguageHLL High-Level LanguagexvHLS High-Level SynthesisHPC High Performance ComputingIC Integrated CircuitICAP Internal Configuration Access PortILP Integer Linear ProgrammingJPEG Joint Photographic Experts GroupKPN Kahn Processing NetworkLUT Look-Up TableMIMD Multiple Instruction, Multiple DataMPPA Massively Parallel Processor ArrayMUX MultiplexerMXP Matrix ProcessorNOC Network on ChipPE Processing ElementPR Partial ReconfigurationPRR Partial Reconfiguration RegionRAM Random Access MemoryRTL Register Transfer LevelSDA Stream Data AdjusterSDFG Synchronous Data Flow GraphsSDK Software Development KitSIMD Single Instruction, Multiple DataxviSRAM Static RAMSTG Streaming Task GraphSVP Soft Vector ProcessorVCI Vector Custom InstructionVLIW Very Long Instruction WordxviiAcknowledgmentsBaba, thanks for all the support. You taught me how to think differently. Maman,thanks for your unconditional love and helping me through the tough times. Thanksgrandpa for inspiring me even after you left us, you always wanted me to improvemyself.Guy, thank you for everything you’ve done for me. You literally changed mylife. I cannot thank you enough. Thanks Kia Bazargan for all the support duringmy Masters. You are the reason I started my PhD, you are my hero. Thank youJayme Carvey for everything. You helped me getting my speech confidence back.Aaron, Thanks for helping me starting my research.Thanks SOC faculties and members. It was a pleasure working with everyonein the SOC lab.Thanks to NSERC for funding.xviiiChapter 1Introduction1.1 MotivationApplications that are structured around the notion of a “stream” are increasinglyimportant and widespread. One study shows that streaming applications are al-ready consuming most of the cycles on consumer machines, and their use is con-tinuing to grow [71]. Many of these are media and vision applications, and mostof them are computationally intensive. There is no doubt that increasingly morecomplex streaming applications will continue to be introduced, so the demand forhigher performance will continue.Increasing the clock frequency was the simple and traditional way to achievehigh performance computing, however, since clock frequency scaling has essen-tially stopped due to power constraints, an issue known as Dennard scaling [25],computer designers and architects have focused on delivering increased levels ofparallelism to improve both performance and performance-per-Watt [68]. Sev-eral different approaches have been introduced to address this issue. At one endis coarse-grained parallelism, often implemented through multi-core processors,usually through a high-level language. At the other end is fine-grained paral-lelism, often implemented through Field-Programmable Gate Array (FPGA) andApplication-Specific Integrated Circuit (ASIC) devices by designing hardware-levelsolutions.Unfortunately, both coarse-grained and fine-grained parallelism can be chal-1lenging to program. In software, it can be challenging to expose sufficient paral-lelism in common languages like C, and difficult to describe some types of com-putation in CUDA or OpenCL, for example. In hardware, Register Transfer Level(RTL) languages such as VHDL and Verilog achieve very good results, but it is notaccessible to users without hardware design knowledge. RTL is also very tediousto program, as it requires describing everything on a cycle-by-cycle basis. Thismakes the above-mentioned approaches either inaccessible to general users or timeconsuming and hard to implement for more expert users.In FPGAs, design is made easier through High-Level Synthesis (HLS) tools.These are tools that convert a High-Level Language (HLL) (e.g., Java or C) intoRTL (e.g., an Hardware Description Language (HDL) such as Verilog or VHDL).However, they impose their own challenging constraints in writing the HLL and inwhat can be parallelized [23, 62]. Several industrial and academic HLS tools havebeen developed to provide an environment for users to describe their applicationin a HLL such as C/C++ to avoid the difficulty of HDL programming. However,current HLS tools require the user to explicitly manage resources at every stage intheir algorithm in order to meet a specified area target or throughput target. Thisis done by manual control of things such as loop unrolling or pipelining by controlpragmas to the code to change how it is synthesized. Moreover, in order to fit adesign to different FPGA sizes or achieve a different throughput, users need to alterthe design manually, which involves changing the pragmas at best, but in othercases may require more significant rewriting of the source code. Instead, userswould benefit from an HLS tool that can automatically investigate different degreesof parallelism, explore space/time tradeoffs and find a suitable implementation foreither a throughput target or an area budget.As another example, coarse-grained streaming architectures such as Ambric’sMassively Parallel Processor Array (MPPA) device [13] also require programm-ming in a HLL. In Ambric’s case, Java is used to describe each thread and to ex-plicitly allocate instances, where each thread has streaming input and output ports,and the Astruct custom language is used to connect these ports together [12]. Thecompiler creates a graph of communicating object-threads which must be placedand routed, much like an FPGA. This can be a great platform for low-power, high-performance computing [40]. However, the Astruct language design does not al-2low for the creation of dynamic connections; these must be known explicitly atcompile-time. Although using object-oriented Java helps make this platform acces-sible to general users, to achieve high performance, users still need to manually dothe resource management by partitioning the application into small atomic objects.This requires the user to learn the Ambric architecture and parallelism methods.Moreover, users need to go through the same process again if they want to increaseperformance, shrink the size, or target another MPPA with a different architectureand resources.The approaches described above are specialized for getting high performancefrom specific hardware targets. In both cases, however, when the number of avail-able resources changes, or the target changes, users need to change the sourcecode of the implementation to satisfy the new constraints. For example, let us gothrough four different scenarios which need changing these targets. One scenariocan be implementing an application on a different platform with different comput-ing resources (from a small embedded device to a big data center). In this case, theuser needs to fill up different chips (different area budgets) while maximizing thethroughput. A second scenario is when the user has purchased a pre-made design,wants to change/customize one block within the whole design, and there is a lim-ited amount of leftover area available. This needs an area budget. A third scenariocan be with a big team of hardware and software developers, where the hardwaredevelopers assign a budget (throughput or area) to the software developers to im-plement their algorithm (e.g., part of a floorplanning process for a whole team). Afourth scenario is when a user wishes to put multiple applications onto a chip, andis trying to pack as many applications (or application instances) as possible; thisneeds setting throughput targets or area budgets for each instance, but the overallgoal is to achieve maximum throughput for the entire chip. A final use-case is thata user may wish to develop several different products or solutions at different price(and performance) points; there is a need for a tools that allow them to more easilyscale it to fit different chips or budgets.This process of retargetting and scaling an application in all the scenarios men-tioned above can be time consuming and expensive, and in many cases program-mers need extensive skill as well as knowledge of the target platform. There is aneed to have general, scalable and flexible approach which allows users to describe3streaming applications in a retargetable way.Streaming applications can be described by a Streaming Task Graph (STG)which are collections of tasks with dependencies between their inputs and outputs.This means they can leverage pipelined architectures. This work focuses on STGswith Directed Acyclic Graph (DAG) topologies. These type of graphs can easilybe manipulated to adopt different methods such as pipelining, node replicating andnode combining in order to find different degrees of parallelism. This makes anSTG suitable model for exploring space/time implementation tradeoffs on pipelinedarchitectures. The scope of automatically exploring space/time tradeoffs for STGshas motivated a considerable amount of research to improve the usability of parallelresources in different pipelined architectures.The goal of this dissertation is to broaden the overall usability of parallel re-sources by providing an environment which allows users to automatically explorespace/time tradeoffs and find suitable implementations regarding a throughput tar-get or an area budget on a wide range of different pipelined architectures. Thiscan be added to a HLL or to the HLS process to make them more flexible, scalableand more area efficient. Moreover this makes a broad range of pipelined archi-tectures with different granularity and resources accessible to people with limitedspecialized knowledge. They can easily define an application as a STG in an HLLinstead of dealing with using low level RTL or parallel programming languagessuch as OpenCL. At the same time, we wish to ensure that, by using this approach,the HLS approach is able to automatically find a broad range of different solutionswhich can each compete with manually optimized implementations. Moreover, theHLS tools need to automatically explore the different solutions and find a suitableimplementation based on different defined restrictions or desired targets.1.2 ApproachAs mentioned above, the main goal of this dissertation is to broaden the overall us-ability of parallel resources in different pipelined architectures by making scalingan implementation easier for general users. This can be done by providing an envi-ronment which allows users to define applications as STGs, explore the space/timedesign space, and find a suitable solution for a defined restriction or a desired tar-4get. In other words, we wish to address the automated space/time scaling problemfor STGs on different pipelined architectures. We start with addressing this prob-lem targeting a coarse-grained architecture (an MPPA), then target a fine-grainedarchitecture (an FPGA), and finally use the experience learned to target a hybrid ar-chitecture with both coarse and fine components. Below, we discuss our approachin greater detail.1.2.1 Experimental Architecture ModelsOur experiments into automated space/time scaling for STGs examine three differ-ent parallel architecture models.First, we examine pipelined coarse-grained parallelism modelled after the Am-bric MPPA. Like Ambric, we develop a Java-based compiler tool chain targetingthe MPPA. The compiler attempts to find the maximum degrees of parallelism inthe stream application, then performs throughput analysis, throughput propagationand looks for possible bottlenecks. We study different STG manipulations to finddifferent implementations in the space/time tradeoff space. We show that a clas-sical Integer Linear Programming (ILP) optimization strategy, based upon priorwork, finds a locally optimal design point subject to the given area or throughputtarget. In addition, we introduce a heuristic approach that runs faster and achievesbetter results than the ILP approach because it has fewer restrictions. We discussour solutions to this coarse-grained model in detail in chapter 3.Second, we examine pipelined fine-grained parallelism available on a XilinxFPGA. We investigate automating the ability to make space/time tradeoffs in thecommercial HLS tool, Xilinx Vivado HLS [43, 96], targeting the FPGA fabric. Sincewe do not have access to the Xilinx Vivado HLS source code, we propose a frame-work on top of it which evaluates many ways pragmas can be used to parallelizea STG. To represent STGs, we restrict ourselves to the OpenVX standard whichenables the creation of Computer Vision (CV) applications as compute graphs. OurOpenVX HLS system uses heavily parameterized CV kernels as well as multipleoptimization approaches to automatically expand and explore the space/time de-sign space. Ultimately, our system finds a suitable fine-grained implementation fora given area or throughput target. We compare both the classical ILP and novel5heuristic optimization approaches, and compare these to manually written imple-mentations. We discuss this model in detail in chapter 4.Third, we examine a hybrid approach that uses both coarse-grained and fine-grained parallelism. For coarse-grained parallelism, we target a Soft Vector Pro-cessor (SVP) architecture implemented in the Xilinx FPGA fabric that streams datathrough a wide Single Instruction, Multiple Data (SIMD) datapath. For fine-grainedparallelism, we target custom instructions implemented in the Xilinx FPGA fabricbuilt using the HLS tool developed earlier. Each Vector Custom Instruction (VCI)offers horizontal parallelism up to the same SIMD width as the SVP, as well asvertical parallelism created with pipelining. Furthermore, additional coarse-grainpipeline parallelism can be created by cascading the output of one VCI to another,producing a chain of VCI operations. The entire VCI chain must respect the two-input, one-output restriction of a single VCI, and must be of the same SIMD width.To assemble such a system, we rely upon a feature of FPGAs known as PartialReconfiguration (PR), where a Partial Reconfiguration Region (PRR) can be recon-figured dynamically at run time. The VCI chain is connected to the SVP usinga multiplexer network added to the PRR logic. Using this system, we examinethe ability to implement OpenVX compute graphs with space/time tradeoffs on anFPGA device with limited resources. This means our approach decides which partof the graph runs on the SVP and which part runs as a VCI or VCI chain. Similarto previous approaches, we use a heuristic approach to leverage runtime optimiza-tion techniques which cannot be done as easily in the classical ILP approach. Wediscuss this model in detail in chapter 5.1.2.2 Experimental MethodologyFor each of the three experimental architectural models, we use a slightly differentmethodology for building our tools and collecting results.In the first model, we develop a Java-based compiler platform similar to theAmbric system. This compiler performs optimization and code generation for ourtarget device which is similar to the Ambric MPPA. However, we did not follow pre-cise instruction encodings, and we did not attempt to run the code on a real Ambricdevice. Instead, we developed a cycle-accurate simulator which measures execu-6tion time. Benchmarks are written in Java and run on this simulator. Although realAmbric devices are no longer available, similar coarse-grained overlays are beingdeveloped on FPGAs such as the GRVI Phalanx [34] as well as an Ambric clone. 1In the second model, we use the actual Xilinx tools and devices to producea real bitstream. In particular, the Vivado HLS tool is used to synthesize C forthe FPGA. Benchmarks are written in C using the OpenVX API. We develop afront-end tool which synthesizes an FPGA implementation of an entire OpenVXcompute graph. This tool is given an overall area or throughput target; it analyzesthe full graph, optionally transforms it, and determines the best space/time tradeoffto meet the target. Each node is an OpenVX compute kernel which is written in Cand fully annotated with pragmas for maximum parallelism with Vivado HLS. Ourtool determines which pragmas are needed for each kernel instance in the graphto meet the overall target. The final output is a C program which is compiledby the Vivado tools into an FPGA bitstream, where the area usage, clock speed,and throughput metrics can be verified. These graphs use AXI-stream inputs andoutputs, allowing them to be easily connected to a full system.In the third model, we continue to use Xilinx tools and devices together withthe VectorBlox Matrix Processor (MXP) SVP, but we produce final results using aperformance equation. For each OpenVX compute kernel, we use three implemen-tations: scalar ARM Cortex-A9 code produced using regular C with gcc, vectorizedcode for the ARM and VectorBlox MXP, and a VCI. The scalar code is only used toproduce a performance baseline. The VCI implementations are produced by com-piling the OpenVX kernels (from the previous model) for varying SIMD widths,up to the width of the SVP, and compiled into the smallest area required. To savetime in this work, we did not implement the multiplexer network required to con-nect VCI chains into the MXP, nor did we implement the PR control logic. Instead,we model the time required to perform PR based upon performance specificationsgiven by Xilinx for their Internal Configuration Access Port (ICAP) on-chip config-uration controller. We are particularly interested making the ICAP controller faster,so we model this speed as a variable, and use performance equations to model thespeed of the overall system. With this feasibility study, we have confidence that a1Michael Butts, private communication.7working system could be implemented to verify the estimated gains.1.3 ContributionsThe contributions of this dissertation are summarized in the following paragraphs.• Chapter 3 is a demonstration of the benefits of automated space/time scalingfor STGs mapped onto MPPA overlays rather than manually implementing,scaling and optimizing. We introduce an HLS tool that automatically allowsexploring area/throughput tradeoffs for STGs. We improve upon the classicalILP approach by introducing a heuristic approach.Our approach differs from existing approaches because:1. It automatically investigates partitioning and finding different imple-mentations.2. It combines module selection and replication methods with node com-bining and splitting in order to automatically find a better area/through-put tradeoff.3. It presents a novel heuristic approach which is more flexible and canfind design points not feasible to find with a classical ILP approach.• Chapter 4 is a demonstration of our solution for automating the ability tomake space/time tradeoffs in a commercial HLS tool. This leads to a C-to-RTL tool which can automatically explore the space/time problem spacefor implementing OpenVX applications defined as STGs on FPGAs. Basedon our knowledge, this cannot be done with existing C-to-RTL tools. Theexperiment shows the proposed approach is able to achieve the same perfor-mance as manually written implementations and also it finds several moresolutions for a variety of different throughput targets which leads to a 30%area reduction. The tool is able to achieve over 95% of the target area bud-get on average while improving the throughput. The Inter-node Optimizerstep of our heuristic is able to hit the same throughput targets while reducingthe area cost by 19% on average compared to the ILP approach. In termsof efficient use of parallel resources on the chip, the experiment shows the8tool manages to satisfy different throughput targets while using parallel re-sources efficiently. For example, it achieves up to 5.5 GigaPixel/sec for theSobel application on a small Xilinx 7Z020 device.• Chapter 5 is a demonstration of run-time acceleration using dynamic par-tial reconfiguration. More specifically, a SVP software system with coarse-grained parallelism is further accelerated using rapid reconfiguration of VCIchains. The experiment shows speedups far beyond what a plain SVP canaccomplish. For example, an 8-lane SVP achieves a net speedup of 18 versusthe scalar ARM processor for running the Canny− blur application. Thiswas achieved by using automated space/time scaling, node clustering anddynamic PR. However, if FPGA vendors can provide a much faster PR rate, anet speedup of 106 is possible.Our approach differs from existing approaches because:1. It is the first work to explore PR and time-sharing of VCI for speedingup SVP.2. It further improves performance by adding scratchpad bypass with VCIchaining.3. It uses pre-synthesized node fusion of common VCI chains to save area.1.4 Dissertation OrganizationChapter 2 presents the background of this dissertation, including the general ap-proach to find different degrees of parallelism in an application and programmingmodels to describe the application. Moreover, it presents the background of dif-ferent reconfigurable computing platforms we used in this dissertation. Chapter 3details our implementation of automated space/time scaling of STG on a coarse-grained architecture (MPPA). Chapter 4 details our approach of exploring auto-mated space/time tradeoffs for CV application described as an OpenVX computegraph on a fine-grained architecture (FPGA). Chapter 5 details our implementationof a compilation system which uses run-time reconfiguration to accelerate applica-tions on a hybrid SVP/FPGA architecture.9Chapter 2BackgroundThis chapter provides the necessary background information for this dissertation.First, it presents an overview of finding parallelism in a general program to showdifferent transformations that find different degrees of parallelism in a general pro-gram. Next, it describes different programming models and their pros and cons. Togive the necessary background to understand different reconfigurable computingplatforms used in this dissertation, FPGA, MPPA and SVP are introduced. Thisleads to discussing OpenVX applications on reconfigurable computing platforms.Finally, it discusses the advantages and limitations of partial reconfiguration.2.1 Finding Parallelism in a General ProgramRegardless of what programming model we will use for describing a computationalproblem, the first step to get better throughput and use parallel resources efficientlyis analyzing the program and finding the potential parallelism in it. Getting highperformance on a platform with parallel resources requires not only finding paral-lelism in the program but also minimizing the synchronization overhead becausethe synchronization process may stall the system; a processing element may haveto wait for another processing element to reach the corresponding synchroniza-tion point and make data ready. High synchronization frequency generally comeswith high levels of data communication between processing elements, which mightreduce the performance of the system. As a result, a program with fine-grain syn-10chronization can run on a multi-core system even slower than one processor [56].It is therefore important to find parallelism that requires minimal synchronization.In other words the final goal is identifying the coarsest granularity of parallelismin a program by finding the largest set of independent computations that can be runby different processing elements in a synchronization-free manner. As mentionedabove, it is important to find parallelism before implementing a design. A paral-lelism finder algorithm analyzes and transforms the program to find all the degreesof parallelism in it.There have been several studies on program transformation to achieve paral-lelism and data locality for a program. These transformations are generally limitedto loops that use affine functions for representing bounds and array accesses [6].Below we describe an algorithm that finds the maximum degree of parallelismin a general program with nested loops and affine index expressions for array ac-cesses proposed by Lim et al. [56]. Index expressions are affine if it involvesmultiplying the loop index variables by constants and adding constants. All the in-structions in a program are identified by the loop index values of their surroundingloops, and affine expressions are used to map these loop index values to a partitionnumber. Partition numbers are used for two different purposes, space partition-ing and time partitioning. Operations belonging to the same space partition aremapped to the same processing element. On the other hand operations belongingto time partition i should execute before those in partition i+ 1. We try to find acombination of affine space and time partition mappings that maximizes the de-gree of parallelism with successively greater degree of synchronization. Severaltransformations are described in [5, 7, 8, 14, 44, 74, 92]. We can achieve all of theloop-level parallelism with a combination of these transformations such as:• Fusion• Fission• Re-indexing• Scaling• Reversal11• Permutation• SkewingFirst we should explain the different forms of parallelism and present our prob-lem statement. A program has k degrees (number of dimensions) of parallelismif O(nk) units of computations can be executed in parallel, where n is the numberof iterations in a loop. Also we say that different degrees of parallelism in a loopnest exist at the same nesting level if and only if they require the same amountof synchronization. The algorithm described in this section locates all the degreesof parallelism in a program. In other words it finds the maximum degree of par-allelism at each level of granularity, starting from coarsest to finest. It also findsopportunities for pipelining. This algorithm assumes there is an infinite number ofvirtual processors, which means for each independent thread of computation thereis a processor. To generate code for a specific number of processing elements ina target architecture we can simply combine multiple of these parallel threads andassign them to the same processing element.We can describe the overall problem of finding the maximum degree of par-allelism into subproblems such as: how to maximize the degree of parallelismthat requires 0, O(1), and O(n) amount of synchronization, where n is the numberof iterations in a loop. By solving each of these problems in turn, the algorithmfinds successively more degrees of parallelism at a higher cost of synchronization.The above-mentioned algorithm repeats these steps to find parallelism requiringO(n2),O(n3), ... synchronization until it finds sufficient parallelism to occupy allof the available hardware. Below we describe these subproblems with more detail.The subproblem of maximizing synchronization-free parallelism studies theproblem of parallelizing an application without allowing any communication orsynchronization between processors at all. In other words it is formulated as parti-tioning the dynamic operations, into the largest number of independent partitions.More specifically, it finds an affine partition mapping for each instruction that max-imizes the degree of parallelism. By using a set of space-partition constraints in theaffine partition mapping process, we ensure that the processing elements executingoperations in different partitions need no synchronization with each other.The next subproblem is to find parallelism with O(1) synchronization which12means the number of synchronizations must be independent of the number of it-erations in a loop. This algorithm divides instructions into a sequence of stages(strongly connected components) and locates synchronization-free parallelism withineach stage, then it inserts barrier synchronization before and after each parallelizedstage.Finally, to find parallelism with O(n) synchronization, the algorithm tries tofind an affine time partition mapping for each instruction. By using a set of time-partition constraints in the affine mapping process, we ensure that data dependencescan be satisfied by executing the partitions sequentially. The goal is to find affinemapping that yields the maximum parallelism among operations within each of thetime partitions.The time partitions and space partitions are similar in many ways and areamenable to the same kind of techniques. The affine form of the Farkas lemma[29] has been used to transform the constraints into linear inequalities. The prob-lem of finding a partition mapping that gives the maximum degree of loop-leveland pipelined parallelism while satisfying the space-partition or time-partition con-straints reduces to finding the null space of a system of equations. This affinepartition mapping can be found easily with a set of simple algorithms.2.2 Programming ModelsIn this section we go through some programming models and their pros and cons.For more information see the Tessier et al. survey paper [80]. In the early daysof reconfigurable computing, there was no overlap between programming generalpurpose computers and FPGAs or other reconfigurable computing platforms. Whileprocedural languages such as C were generally used to target microprocessors,most FPGAs application designers were still drawing schematics or using HDLssuch as Verilog or VHDL. In order to make the reconfigurable computing plat-forms more accessible for general users without any hardware knowledge, severaldifferent programming models have been introduced.A key goal in the early days was making the programming environment for re-configurable computing platforms as similar as possible to microprocessor-basedsystems to make it attractive to general users. Since C was primary language of13the day, many C-based programming models were introduced. An initial C-to-hardware compiler [89] could do source to source transformation for a simple chainof operations (e.g. add, shift) into HDL code. Wo et al. [91] extended this ideaby adding a simple state machine to execute multiple sequential hardware opera-tions (considering data dependencies). To express parallelism better, features wereadded to the language. Early compilers often relied on users to manually expressparallelism and synchronization using “pragma” statements [32]. Modern systemsare closer to extracting parallelism from C source code [87] and consider the in-terface between the synthesized hardware and memory [99]. Application code canalso be profiled using software execution [15] to determine target code for hard-ware acceleration. An important aspect of C-to-FPGA compilers is the estimationof functional unit area and performance [21], an issue made easier by the recentinclusion of hard macro blocks in FPGAs. Moreover, specialized systems have alsobeen introduced to target synthesis with floating-point data types [85]. The amountof research and number of commercial procedural language-to-hardware tools (in-cluding Xilinx’s Vivado, Calypto Catapult C, and Cadence’s C-to-silicon) in recentyears, show the demand of targeting users without hardware knowledge.The similarity between objects in object-oriented programming models andinstantiated hardware modules has led to a number of attempts to represent recon-figurable computing as communicating objects. Predictable communication flowand limited communication dependencies are key aspects of these models. Thisis similar to pipelined implementations. Streaming applications typically havecoarse-grain compute objects that communicate with adjacent blocks via buffersor synchronized communication channels. Moreover, results are often sent alongpredetermined communication paths at predictable rates. This model is suitablefor defining a series of signal processing blocks that require minimal control flowor global synchronization. PamBlox focused on the ability to define hierarchies ofC++ objects [63] and the use of embedded block memories. These blocks couldthen be organized into streams. The Streams-C model [31] introduced a series ofcommunicating sequential processes that used small local memories. Streams-Ctools were later commercialized into the Impulse-C compiler. Perhaps the mostcomprehensive stream-based, object-oriented environment to date was the com-mercial Ambric model [12]. In this model, a Processing Element (PE) can exe-14cute one or more user-defined objects that communicate with objects running onother PEs via self-synchronizing, dataflow channels. The commercial BluespecSystemVerilog [64] hardware synthesis system is also based on the manipulationof objects.The ability to abstract away details of implementing reconfigurable platformsfrom the users makes stream-based environments attractive to general users. Sev-eral projects have considered the possibility of combining stream-oriented com-putation with run-time reconfiguration. JHDL [10] allows defining objects whosefunctionality can be dynamically changed. Development tools allow for evaluationof system performance using both simulation and in-circuit execution. The SCOREproject [16] explores swapping stream-based objects on-demand at run-time. Ob-jects can be swapped if the number of objects in an application is too large to fitin the hardware. As a result, the same application could be mapped to hardwareplatforms of many different sizes.Data flow models are often used for specifying the behaviour of signal pro-cessing and streaming applications as a set of tasks, actors or processes with dataand control dependencies. The differences between various dataflow models canbe characterized by their expressive power and the availability of techniques foranalyzing correctness and performance properties like absence of deadlock andthroughput. The Kahn Processing Network (KPN) [30], for example, can capturemany of the dynamic aspects of these systems, but evaluating their correctnessand performance is in general undecidable. On the other hand, Synchronous DataFlow (SDF) [53] models do allow analysis of many correctness and performanceproperties but they lack support for expressing any form of dynamism.2.3 Reconfigurable Computing Platforms2.3.1 FPGAFPGAs are Integrated Circuits (IC) that are used to implement digital logic func-tions. In contrast to an ASIC, an FPGA is field-programmable. This means thatlogic functions must be programmed after the device has been manufactured. Onthe other hand, an ASIC implements digital logic functions by placing and con-15necting transistors (layout), which cannot be changed once the chip is fabricated.An FPGA implements digital logic functions with Configurable Logic Block (CLB)modules (typically containing Look-Up Table (LUT)s and flip-flops (FF)) and con-figurable interconnect between them. Most modern FPGAs can be programmed byloading data into Static RAM (SRAM) cells, which can be reconfigured practicallyan unlimited number of times [51]. Since normal Integrated Circuit (IC) tech-nology has been used to fabricate SRAM cells, CLBs and routing components onFPGAs, it’s reasonable to describe an FPGA as a type of ASIC than emulates otherASICs. Traditionally FPGAs are used for applications such as ASIC prototyping,telecommunications equipment (where low volumes and changing standards makeASICs less attractive), and as interfaces between other ICs (“glue-logic”). In re-cent years, modern FPGAs have been introduced which provide more performanceby adding different hard-core components such as memory blocks and multipli-ers. This makes them more desirable to run computationally-intensive applicationssuch as computer vision.Architecture and Design FlowA common model of an FPGA is a two-dimensional grid of blocks which are con-nected by a mesh routing network. The blocks may consist of CLBs or differenthard blocks. The hard blocks are frequently used functions that are too expen-sive to implement as soft logic using CLBs. The most common hard blocks im-plemented by FPGA vendors are memories (Block RAM (BRAM) modules) andmultipliers or other expensive arithmetic and logical functions (Digital Signal Pro-cessing (DSP) blocks). FPGA blocks communicate through a configurable routingnetwork, usually using horizontal and routing wires meeting though reconfigurableswitch blocks. Additionally, the wires within a routing channel to which a blockinput or output connect is configurable. The input/output blocks with pins thatconnect the FPGA to the outside world are also configurable, supporting multiplevoltage levels and signalling styles.FPGA architecture is relatively generic so user logic can be implemented in anyset of CLBs. This means there are many possible ways to implement a generaldigital circuit on an FPGA. This process is not straightforward: the size of modern16FPGAs are large, so mapping a large design to a large FPGA is an optimal way iscomputationally intensive. To appreciate the problem size, consider that the largestXilinx Virtex UltraScale+ FPGA has 3.8 million programmable logic cells, 94 MbBRAM and 12,288 DSP slices.The Computer Aided Design (CAD) flow for translating a design to an FPGAconfiguration bitstream varies for different vendors. For more information refer to asurvey paper by Chen et al [17]. The differences are not important for the purposesof this dissertation, but it is necessary to understand a CAD flow to understand thereasons the traditional FPGA design cycle is long compared to the design cycle forsoftware. RTL synthesis is the process of translating the input circuit as specifiedby the user to a netlist of Boolean functions and macros such as hard blocks. TheBoolean functions get technology-mapped to FPGA programmable logic blocks.Placement then selects locations for each of these units, and routing determineshow to configure the communication network to connect logic block inputs andoutputs. Finally, assembly creates the bitstream that is used to program the FPGA.Note the term synthesis is often used to refer to the entire process; it includes thetime taken by all of the steps. For large FPGAs, synthesis can take hours or evendays. This fact makes automatically exploring the space/time tradeoffs and using alibrary of pre-synthesized bitstream implementations desirable for users.2.3.2 Massively Parallel Processor ArrayA Massively Parallel Processor Array (MPPA) is a type of embedded platformwhich has an array of hundreds or thousands of processing elements (processors)and memories (Random Access Memory (RAM)). Processors in the system passwork to one another through a reconfigurable interconnect of channels (e.g. FirstIn, First Out (FIFO) buffers). A general MPPA is Multiple Instruction, MultipleData (MIMD) architecture, often with local distributed memory. Communicationbetween processors is realized in the configurable interconnect. Each processorcan often run independently at its own speed. Several different architectures forMPPAs have been introduced by both industry and academia. Companies suchas Aspex (Ericsson), Ambric, PicoChip, Intel, IntellaSys, GreenArrays, ASOCS,Tilera, Kalray, Coherent Logix, Tabula, and Adapteva have introduced their MPPAs17…………………………………I/O Block…Logic ColumnDSP ColumnMemory ColumnConfigurable RoutingFigure 2.1: Example FPGA architecurein recent years. Academia has introduced some architectures as well such asAsAP [100]. Below we discuss some popular MPPAs.PicoChip (acquired by Intel) developed a multi-core digital signal processor,the picoArray [27]. This integrates 250-300 individual DSP cores onto a single die(depending on the specific product) and as such it can be described as an MPPA.Each of these cores is a 16-bit processor with Harvard architecture, local memoryand 3-way Very Long Instruction Word (VLIW). The picoArray is programmedusing a mixture of VHDL [2], ANSI/ISO C and assembly language. The VHDLis used to describe the structure of the overall system, including the relationshipbetween processes, and the signals which connect them together. Each individualprocess is programmed in conventional C (albeit with additional communicationfunctions), or in assembly language.18The Epiphany architecture consists of a low power, multi-core, scalable, par-allel, distributed shared memory embedded system created by Adapteva [1]. TheEpiphany IV Network on Chip (NOC) co-processor contains 64 cores (referred toas eCores) organized in a 2D mesh with future versions expected to house up to4096 eCores. The Epiphany chip can be programmed using C, and has a SoftwareDevelopment Kit (SDK) but users need to manually find and express parallelism.The Kalray MPPA was introduced as a single-chip many-core processor thatintegrates 256 user cores and 32 system cores in 28nm CMOS technology [24].These cores are distributed across 16 compute clusters of 16+1 cores, and 4 quad-core I/O subsystems. Each compute cluster and I/O subsystem owns a privateaddress space, while communication and synchronization between them uses aNOC. This processor targets embedded applications whose programming modelsfall within the following classes: KPN, as motivated by media processing; singleprogram multiple data (SPMD), traditionally used for numerical kernels; and time-triggered control systems.University of California, Davis, introduced Asynchronous Array of SimpleProcessors (AsAP) [100] which contains an array of simple RISC processors with anine-stage pipeline with small instruction and data memories. Processors commu-nicate only with adjacent processors to permit full-rate communication with lowenergy. Each processor can receive data from any two neighbors and send data toany combination of its four neighbors. The first generation was introduced with36 processors. The second generation has 167 processors [86] for DSP, communi-cation, and multimedia workloads. It contains 164 programmable processors withdynamic supply voltage and dynamic clock frequency circuits, three algorithm-specific processors, and three 16 KB shared memories, all clocked by independentoscillators and connected by configurable long-distance-capable links.AmbricAmbric Inc. was a high performance computing company founded in 2003 in Ore-gon state, United State of America. Their Am2045 MPPA chips have 336 32-bitRISC-DSP fixed-point processors and run up to 300 MHz. They were designed forhigh-performance embedded systems such as medical imaging, video, and signal-19Figure 2.2: Ambric bric organization [12]processing. The Am2045 is internally organized into a 5×9 array of bric modules.Figure 2.2 shows one bric and its neighbouring brics. Each bric contains twokinds of 32-bit Central Processing Unit (CPU)s. SRD processors contain 3 Arith-metic Logic Unit (ALU)s and provide math-intensive instructions to support DSPoperations. Each SRD processor contains a dedicated 256-word RAM for instruc-tions and data. This memory can be augmented though direct connections to bricmemory objects. SR processors are lighter weight with only 1 ALU. They con-tain a dedicated 128-word memory for programs and data but do not have directconnections to memory objects. Each of the two memory objects (RU) in a bric isorganized as 4 independent RAM banks.20Ambric introduced the Am2045 and its software tools in 2007, but fell victimto the 2008 worldwide financial crises [90].Microprocessor Report gave a 2006 MPR Analysts’ Choice Award for inno-vation for the Ambric architecture “for the design concept and architecture of itsmassively parallel processor, the AM2045”. Although Ambric Inc. is defunct, in2013 the Ambric architecture received the Top 20 award from the IEEE Interna-tional Symposium on Field-Programmable Custom Computing Machines, recog-nizing it as one of the 20 most significant publications in the 20-year history of theconference.Software written for Ambric devices is based on the Structural Object Program-ming Model [12]. Each processor is programmed in conventional Java (a strictsubset) and/or assembly code (Figure 2.3). A programmed processor or memory iscalled a primitive object. Objects run independently at their own speeds. They arestrictly encapsulated, execute with no side effects on one other, and have no implic-itly shared memory. Objects intercommunicate through channels (FIFO-buffers)in hardware. Channels carry both data and control tokens. Channel hardware syn-chronizes its elements at each end, not at compile time but dynamically as neededat run time. Inter-processor communication and synchronization are combined inthese channels. The transfer of a word on a channel is also a synchronizationevent. Processors, memories, and channel hardware handle their own synchroniza-tion transparently, dynamically, and locally, so it doesn’t have to be done by thedeveloper or the tools. Channels provide a common hardware-level interface forall objects. This makes it simple for nodes to be assembled into higher-level com-posite nodes. Because nodes are encapsulated and interact only through channels,composite nodes work the same way as primitive node objects. The developersexpress object-level parallelism using a block diagram. First the hierarchical struc-ture of primitive and composite objects is defined, connected by channels. Thenordinary sequential software is written to implement the primitive objects. We haveused an architecture and programming similar to Ambric in chapter 3.21Figure 2.3: Structural object programming model [12]2.4 OpenVXOpenVX is a cross-platform, C-based API standard for Computer Vision applica-tions. OpenVX is a good programming model for embedded systems because itenables performance and power-optimized CV processing to be written in a waythat is independent of the target architecture. At the lowest level are OpenVXkernel functions; these are implemented as a library by developers with detailedknowledge of the target accelerator. In OpenVX, CV applications are implementedas a set of these vision kernel functions which communicate through streamingchannels. An OpenVX application assembles these kernels, or nodes, into a graph,where edges convey an image passed between kernels. For example, Figure 2.4shows the OpenVX code for Canny edge detection and Figure 2.5 shows the cor-responding graph.The OpenVX run-time system can break a large image into smaller imagetiles, allowing each small tile to pass through the entire graph to completion be-fore writing back to external memory. This provides excellent memory localityand improves both performance and power keeping the tile and its transformationson-chip for as long as possible. Most kernel functions can work with tiles becausethey need only local information. The few kernel functions that need global in-formation cannot be executed until the full image output of the preceding kernelfunctions is computed; this can be achieved by cutting the graph at this point and22/ / Canny examplevx node nodes [ ] = {vxColorConvertNode ( graph , rgb , gray ) ,vxGaussian3x3Node ( graph , gray , gauss ) ,vxSobel3x3Node ( graph , gauss , gradx , grady ) ,vxMagnitudeNode ( graph , gradx , grady ,mag) ,vxPhaseNode ( graph , gradx , grady , phase ) ,vxNonMaxima ( graph ,mag, phase ,nm) ,vxThreshold ( grpah ,nm, output )} ;Figure 2.4: OpenVXsource code forCannyColor ConvertGaussian 3x3graySobel3x3gaussgradxgradyMagnitudePhasergbmagphaseHyst ThreshnmNon-Maxima outputFigure 2.5: OpenVX graphfor Cannyonly operating on full images at these cut points.OpenVX applications can be defined as streaming applications: each stage re-ceives stream of image pixels, rows or frames, processes them, and sends the resultsas stream of data to the next stage. This means we can describe a computation ascompute graphs or Synchronous Data Flow Graphs (SDFG) [53]. Previous studieshave shown that custom hardware implementations on FPGAs as well as MPPAshave the potential to dramatically increase the performance/Watt for computation-ally intensive SDFGs [4, 40].2.5 Partial ReconfigurationThe functionality of an FPGA is created by loading its configuration with a set ofbits called a bitstream. In most applications, the entire FPGA is configured at oncewith a single bitstream. The Partial Reconfiguration (PR) feature available in mostSRAM-based FPGAs allow only a portion of the FPGA to be reconfigured, withthe rest of the bits staying intact. PR allows changing behaviour of partitions in theFPGA architecture while the remaining logic is still running.There are two main uses for PR: to save parallel resources and power [42, 69],or to increase performance. There have been several studies on the use of PR.For example [19, 50] demonstrate using partial FPGA reconfiguration for imageprocessing, [20, 83] propose self-adaptive control systems, and [28, 72] investigatethe use of PR for High Performance Computing (HPC).23Radio CryptoProtocol Processing FM FSK QRM RC5 AES 3DESHTTP VoIP SSH FTPConfiguration MemoryADCDACFigure 2.6: Area saving by reconfiguring only the currently required acceler-ator module to the FPGA. Configurations are fetched from the modulerepository at runtime.Below we will go through two examples to illustrate the benefits of PR [47].First, Figure 2.6 shows an adaptive communications device where each of threestages (the front-end radio, the cryptographic accelerator, and the protocol process-ing engine) can be loaded with three or four different ‘algorithms’. The algorithmswithin each stage are never needed at the same time, i.e., they are mutually ex-clusive. In a traditional hardware system, all of these algorithms would be imple-mented at the same time and use considerable area. In a PR system, only enougharea for the largest algorithm in each stage needs to be reserved, leading to consid-erable savings.There are more potential benefits than only power and area savings. If wecan implement a system with a smaller FPGA, we might be able to use a smallerpackage, which is especially important for mobile applications. In the case ofmore complex systems that demand multiple FPGAs, PR may reduce the total FPGAcount, thereby simplifying PCB and system design. Due to higher integration, wemay be able to perform more data processing on-chip, thereby reducing energy forchip-to-chip communications and/or memory writes.Second, partial reconfiguration can help increase system performance. For ex-ample, in Figure 2.7, the PR system on the right can perform more work to achieve24Figure 2.7: System acceleration due to partial reconfiguration. By spend-ing temporarily more area for each function, the overall latency isreduced[47].lower latency and higher throughput than the one on the left (without PR). If weassume that modules A, B and C are executed one after another, the system onthe left requires 3 time steps to operate on a single block of data. In contrast, thesystem on the right can dedicate all resources to step A before using PR to move tostep B.A practical example for this is a hardware accelerated secure SSL connection.In this protocol, we first exchange a session key using asymmetric key exchange.After this, the session key is used by a symmetric cipher for the actual data transfer.Consequently, both steps are executed strictly sequentially and we can dedicatemore resources for each step and reduce latency by using PR.Partial run-time reconfiguration has been an active research field for morethan two decades. Much research has been done on increasing the reconfigu-ration speed and reducing the reconfiguration time. Many different approachesand ideas have been proposed. Hauck introduced the concept of configurationprefetching to swap modules in the background to hide latency and improve thereconfiguration time [36]. Dittmann et al. used reconfiguration port schedulingas well as configuration preemption to improve the efficiency of the reconfigu-ration interface and to decrease the reconfiguration time [26]. Lange et al. in-troduced a hyper-reconfigurable architecture to reduce the amount of configuration25data needed to perform reconfiguration, which results in lower reconfiguration timeand faster reconfiguration speed [52]. Different studies have investigated bitstream(de)compression to reduce the required bandwidth of configuration storage and toreduce the reconfiguration time [39, 48, 54]. Moreover, several different types ofreconfiguration controllers, with or without Direct Memory Access (DMA) capabil-ities, have been investigated to improve the reconfiguration speed and reduce thereconfiguration time [18, 58, 61].Although FPGA PR rates are relatively low, it is possible to obtain increasedperformance. For example, Xilinx FPGAs provides 400MByte/sec configurationspeed with its on-chip ICAP controller ICAP. Hansen et al. showed it possible toachieve 2.2GByte/sec with existing Xilinx FPGAs [35] by overclocking the ICAPcontroller. In addition, other studies have suggested different architectures to im-prove PR time. For example, Trimberger et al. proposed a time-multiplexed FPGAarchitecture with a 33GByte/sec reconfiguration rate [84]. While faster reconfigu-ration times are possible, FPGA vendors have not yet seen sufficient need or demandto provide this feature.26Chapter 3MPPA Space/Time ScalingIn this chapter, we automate space/time scaling for STGs by developing a Java-based compilation tool chain targeting a pipelined coarse-grained architecture thatis very similar to the Ambric MPPA chip architecture. Similar to the Ambric tools,our compiler accepts input written in a subset of Java. Unlike the Ambric tools, ourcompiler analyzes the parallelism internal to each node and evaluates the through-put and area of several possible implementations. After finding different imple-mentations for each node, it then analyzes the full graph for bottlenecks or excesscompute capacity, and selects an implementation for each node while either mini-mizing area (for a fixed throughput target), or maximizing throughput (for a fixedarea target). To find a better area/throughput tradeoff, we use node combining andsplitting in the graph. We present two optimization approaches, a formal ILP for-mulation based on prior work and a novel heuristic solution. Results show thatthe heuristic is more flexible and can find design points that are computationallyinfeasible to find using the ILP, thereby achieving superior results with a fasterruntime.3.1 IntroductionIn this chapter, we will investigate whether an array of ALUs or very lightweightprocessors, described best as a massively parallel processor array or MPPA, canachieve sufficient levels of performance, and make design entry sufficiently easy,27to make them an interesting alternative to more traditional design methods forrunning STGs. Moreover, we propose a novel approach to automatically explorespace/time tradeoffs that can produce different optimized implementations for dif-ferent throughput targets or different area budgets.To explore the MPPA as an alternative target, we need a programming modeland a tool flow that can compile algorithms into the target and use parallel resourcesefficiently. To make an MPPA a truly high performance platform, the programmingmodel should support some of the strengths of FPGAs, especially pipelined paral-lelism. This led us to start with the explicit streaming model and architecture thatwas defined by Ambric [12, 13].In the Ambric model, a Java object is created for each thread. Threads arenodes in a graphs defined as instances of objects that communicate together throughexplicitly defined blocking FIFO communication channels. The node can be a prim-itive node ranging from a single operation to multiple loop nests and complex con-ditions, or a composite node containing more than one primitive node. The objectsand channels are placed and routed onto an array of 336 processors with a meshNoC. Each object contains local state and a processing thread. Processing an objectcan be variable latency, but computation between objects is synchronized throughthe blocking FIFOs. Objects may be replicated, thus facilitating some re-use of aprogram, but all instances are explicitly allocated and defined by the programmerat compile-time. This is very similar to a KPN [30], except that in a KPN the FIFOsare assumed to be infinitely deep. The resulting process network exhibits determin-istic behaviour that does not depend on the various computation or communicationdelays.One of the drawbacks of the Ambric framework is the need for explicit allo-cation of all objects and channels. The number of objects, and the computationaldelays within each object, define the amount of parallelism and the throughputof the application. Thus, scaling a program to a larger or smaller processor ar-ray requires manually re-programming all objects and channels. For the Ambriccommercial solution consisting of a single device, this is an acceptable trade-off.However, for a research platform, we must investigate a variety of array sizes, aswell as simpler or more complex processors, which requires automatically trans-forming a streaming application to use more or less space, thereby increasing or28decreasing throughput.In our model there is no long global interconnect. Instead, each PE can onlysend/receive data to/from its immediate neighbours. In order to send/receive datafrom/to more PEs, intermediate PEs are needed to disseminate the values. To eval-uate functionality of our model and measure execution time, we developed a cycle-accurate simulator. Note, the simulator doesn’t support 2D placement which needsto be addressed in the future work. Although the simulator doesn’t support 2Dplacement, we manually placed sample benchmarks to make sure the model is notdependent on long placement delays.In this chapter, we describe our compilation tool that can perform automatedspace/time tradeoffs. The user describes an initial program in Ambric-style Java,and then defines either a throughput target, or an area budget. The compiler ana-lyzes the processing rate of each object (or thread), and propagates these through-puts across the entire computational graph (defined by the communication chan-nels). It also analyzes each thread to determine the degree of internal parallelism.Using this information, it transforms the compute graph to meet the area or through-put target. There are a variety of transformations such as replicating objects (re-quiring a split/join on the data), subdividing objects into a deeper pipeline (in-creasing throughput), and merging objects together (decreasing area). At all times,a whole-program approach is taken to optimization, so portions of a program thatare not performance-critical will be merged to use less area, and more area willbe allocated to performance-critical regions. This alleviates some effort from theprogrammer, and creates a scalable/re-targetable implementation.Our tool uses two internal optimization approaches. The first is based uponILP and is based on prior work on task graph optimizations by Cong et al. [22].The second, based upon a heuristic approach, is our own novel contribution. Al-though the ILP approach works well, it can be difficult or impossible to representsome types of optimizations as ILP constraints. In particular, our heuristic ap-proach is able to perform object coalescing, which is difficult to model within theILP formulation; doing so adds a large number of additional constraint variableswhich quickly make the ILP computationally infeasible. This additional optimiza-tion gives the heuristic considerable area savings over the ILP approach.Figure 3.1 illustrates the flow for the proposed tool. It compiles a program29Figure 3.1: Tool flowdescribed in Ambric-style Java (compute) and aStruct (communication), and cre-ates a STG complete with composite nodes communicating with each other throughchannels (edges of the graph). It uses Intra-Node Optimizer and Inter-Node Opti-mizer in order to find different implementations for each composite node. It usesTrade-off Finder to find a good trade off between throughput and area.3.2 Finding Different ImplementationsConsider an application with N composite nodes f1, f2, ..., fN in its STG. For eachcomposite node fm, our tool tries to find different implementations P1m,P2m, ...,PSmmwhere each implementation Psm can perform the functionality of fm with area costA(Psm) and initiation interval II(Psm). For node fm and its implementation Psm, theminimal input “inverse-throughput” ϑin(Psm) and output inverse-throughput ϑout(Psm)are calculated asϑin(Psm) =II(Psm)In( fm),ϑout(Psm) =II(Psm)Out( fm)(3.1)30where In( fm) and Out( fm) equal the number of data tokens that fm consumes onthe input data channel and produces on the output data channel during each firing,respectively. Note that inverse-throughput shows the number of cycles used toconsume/produce per datum in its input/output channel. Intra-Node Optimizer andInter-Node Optimizer modules have been implemented in our tool to automaticallyfind these abovementioned implementations for each composite node.3.2.1 Intra-Node OptimizerAffine loop transformation strategies in [6–8, 56, 74] are used in Intra-Node Op-timizer to find the maximum degree of parallelism for each composite node. Af-ter finding the maximum degree of parallelism, Intra-Node Optimizer tries to findthe best throughput (minimizing the inverse-throughput) for each node for differ-ent area costs. Since each operation needs a different number of clock cycles toprovide its output (different inverse-throughput), Intra-Node Optimizer expands,combines, splits and, pipelines nodes regarding the inverse-throughput of opera-tions inside the composite node in order to find an implementation with highestthroughput for each composite node for different area costs. Below, the differentoptimization techniques used in Intra-Node Optimizer are listed.• Loop Fusion• Loop Fission• Loop Re-indexing• Loop Scaling• Loop Reversal• Loop Permutation• Loop Skewing• Node Replication• Node Combining31• Node Splitting• Pipelining3.2.2 Inter-Node OptimizerAfter finding the implementation with the highest throughput for each compos-ite node for each different defined area cost, Inter-Node Optimizer is a clusteringoperation that finds different implementations. The clustering operation was im-plemented based on a previous study by Amit Singh et al. [77]. Each clusterwill be mapped to one Processing Element. Inter-Node Optimizer sends opera-tions back and forth between clusters to find optimum area cost for each overallinverse-throughput target. The example below illustrates how our tool works.Here, we go through an example to demonstrate how Intra-Node and Inter-Node Optimizers work.3.2.3 Example: N-Body ProblemThe N-body Problem simulates a 3D universe, where each celestial object is a body,or particle, with a fixed mass. Over time, the velocity and position of each particleis updated according to interactions with other particles and the environment. Inparticular, each particle exerts a net force (i.e., gravity) on every other particle. Thecomputational complexity of the basic all-pairs approach we use is O(n2). The run-time is dominated by the gravity force calculation, shown below:# »Fi, j = GMiM jr2= 0.0625MiM j| #»Pi − #»Pj|3(#»Pi − #»Pj) (3.2)Where# »Fi, j is the force particle i imposes on particle j, Pi is the position ofparticle i, and Mi is the size or “mass” of particle i. When mapping the forcecalculation, because of the dependencies between instructions in this code, our toolfirst pipelines it. A simplified 2D pipeline STG (with inverse-throughput) for thegravity force calculation is shown in Figure 3.2 and its corresponding Java sourcecode is shown in Listing 3.1.Consider mapping each operation to a simple PE. Since everything is pipelined,we get the highest throughput per datum when each operation consumes/produces32!!xxx+ sqrt( / x x xxx++1"1"2"2"2"1" 4" 8" 2" 2" 2"2"2"1"1"xi xj yi yj mi mj Figure 3.2: Pipelined force calculationListing 3.1: Force calculation Java codevoid f o r c e c a l c ( ){for ( i n t i = 0 ; i < NUMBER OF PARTICLES ; i ++ ) {gmm = ref gmm ∗ m[ i ] ;dx = r e f p x − px [ i ] ;dy = r e f p y − px [ i ] ;dx2 = dx ∗ dx ;dy2 = dy ∗ dy ;r2 = dx2 + dy2 ;r = s q r t ( r2 ) ;r r = 1 / r ;gmm rr = r r ∗ gmm;gmm rr2 = r r ∗ gmm rr ;gmm rr3 = r r ∗ gmm rr2 ;d fx = dx ∗ gmm rr3 ;d fy = dy ∗ gmm rr3 ;r e s u l t x = r e s u l t x [ i ] + d fx ;r e s u l t y = r e s u l t y [ i ] + d fy ;r e s u l t x [ i ] = r e s u l t x ;r e s u l t y [ i ] = r e s u l t y ;}}Table 3.1: Different operations with their initiation intervalsOperation Initiation IntervalADD / SUB 1MUL 2SQRT 4DIV 833𝑓"𝑖𝑛 𝑂𝑢𝑡4Figure 3.3: A node with inverse-throughput=4𝑓"𝑖𝑛_1 o𝑢𝑡_14𝑓"𝑖𝑛_2 o𝑢𝑡_24𝑓"𝑖𝑛_3 o𝑢𝑡_34𝑓"𝑖𝑛_4 o𝑢𝑡_44𝑖𝑛 𝑂𝑢𝑡Figure 3.4: Expanding node using replication to improve throughputone datum in one clock cycle (inverse-throughput=1). As shown in Table 3.1, eachoperator has a different initiation interval. Since each operation (primitive node)consumes/produces one data from/to its input/output in each firing, the inverse-throughput is equal to the initiation interval. For example, division needs eightcycles to provide its output (inverse-throughput equals to 8), which make it theslowest node in the STG. Because of dependencies between nodes, faster operationshave to stall for division. This means the best overall inverse-throughput we can getwith this mapping is 8. To get the highest throughput, Intra-Node Optimizer usesan “expanding” approach to parallelize those nodes with high inverse-throughput(slow nodes).One approach to expanding is replicating. Figure 3.3 shows a node ( fn) withinverse-throughput equal to 4. This means it takes 4 clock cycles for the node togenerate an output. To improve the throughput, it’s possible to replicate node fn,four times, pass the inputs in a round-robin order to each of them, and then gatherthe outputs coming from each replicated node in the same order. Figure 3.4 showsthe expanded node for this situation. The expanded node improves throughput by afactor of 4 and is capable of receiving or sending one datum per clock cycle. Simi-34!!xxx+sqrt(/x x xxx++sqrt(sqrt(sqrt(xx///////x x xxxxFigure 3.5: Expanded force calculationlarly, we can use this expanding method over all slow nodes in the force calculationSTG. Figure 3.5 shows an improved expanded STG for force calculation after usingIntra-Node optimizer, where the overall inverse-throughput equals to 1.After finding the highest possible throughput, Inter-Node Optimizer tries tocluster and combine nodes again to find several implementations with differentthroughput and area. It means that Inter-Node Optimizer sacrifices the throughputto save area. It continues this procedure until it assigns the entire composite nodeto one PE (Area cost = 1). Figure 3.6 shows inverse-throughput and area relationfor different implementations of the gravity force calculation function. Here, theinverse-throughput varies from 1 to 33. Moreover, to achieve the best through-put (inverse-throughput = 1), we can either replicate the slowest implementation(inverse-throughput=33) into 33 copies or use the fastest implementation directly(instruction level parallelism or data level parallelism).We used the gravity force calculation example as a simple example to demon-strate how the tool deals with a general STG. Of course depending on the differentSTG topologies and data dependencies, the tool might not be able to use all theparallelism methods.After finding several different implementations for an STG, the tool needs tofind a suitable implementation for a given user restriction. There are differentmodes: a defined throughput target, or a defined area budget. To do so, the tool3533, 1051015202530350 5 10 15 20 25 30 35Inverse ThroughputAreaFigure 3.6: Inverse-throughput/area relation for different implementations offorce calculationneeds to automatically explore space/time tradeoffs and find a suitable tradeoff foreither of the modes. Below, we define the problem for these two modes and usetwo different approaches to solve the optimization problem.3.3 Trade-off Finding Formulation and SolutionsTo solve the trade-off finding problem we used an ILP approach as well as a heuris-tic approach. For each of those approaches, there are two different modes or ob-jectives which a user can choose:• Given an available area target Atgt and different implementations for eachnode f j, which implementation Pij should be selected and how many replicasnij are needed in order to minimize application inverse-throughput ϑA subjectto the constraint the application area cost AA is not bigger than Atgt .• Given an inverse-throughput target ϑtgt , and different implementations foreach node f j, which implementation Pij should be selected and how many36replicas nij are needed in order to minimize area cost AA subject to the con-straint the application inverse-throughput ϑA is not bigger than ϑtgt .3.3.1 Integer Linear Programming AlgorithmThe first Trade-off Finding algorithm simply defines the problem as an IntegerLinear Programming (ILP) problem based on prior work by Cong et al. [22]. Thegoal is to find binary integers x j,1, x j,2, . . . , x j,Sm indicating the implementations tobe selected, and integer nrij indicating the number of replicas needed. Equation 3.3shows the formulation of finding a suitable implementation for a given area budget.Minimize ϑA subject to:∀ j ∈ {1, ...,N} :N∑j=1Sm∑i=1nrijA(Pij)x j,i < Atgt andSm∑i=1x j,i = 1.(3.3)Equation 3.4 shows the formulation of finding a suitable implementation for agiven throughput target.Minimize AA subject to:∀ j ∈ {1, ...,N} :Sm∑i=11nrijϑ(Pij)x j,i < ϑtgt andSm∑i=1x j,i = 1.(3.4)An ILP solver such as GNU Linear Programming Kit (GLPK) [60] could gothrough all the possibilities of x j,i and nrij and find the optimum solution for thisproblem, subject to the constraints. Although ILP solvers can solve these problems,the approach does have two shortcomings:• Lack of flexibility: it is difficult and sometimes impossible to represent sometypes of optimizations as ILP constraints. In particular, combining or split-ting nodes requires adding all combinations of merges and splits to the ILPproblem formulation, which introduces a very large number of additionalconstraint variables. This quckly becomes computationally infeasible tosolve.• Time inefficient: In our experiments, ILP (without the node combining orsplitting optimization) is usually slower than our heuristic algorithm.37A" B"vmo$ vei$Figure 3.7: Minimum and expected inverse-throughput3.3.2 Heuristic AlgorithmBefore describing our heuristic approach, we must first define Throughput Analy-sis, Throughput Propagation and Bottleneck Optimizer.Throughput AnalysisEach node achieves the maximum output throughput if and only if all its inputdata are ready when the node expects them. To make the throughput analysis morestraightforward we use inverse-throughput instead of throughput. To achieve mini-mum output inverse-throughput ϑmo, the input data have to be ready with expectedinverse-throughput ϑei. We define inverse-throughput slack ϑs for each channel as:ϑs = ϑmo−ϑei (3.5)Figure 3.7 shows a simple example in which two nodes A and B are con-nected together. Node A is a potential bottleneck if it doesn’t provide data fastenough to satisfy node B’s expectation (ϑmo > ϑei ⇐⇒ ϑs > 0). Node B is apossible bottleneck if node A provides data faster than what node B consumes(ϑmo < ϑei ⇐⇒ ϑs < 0).Throughput analysis helps us to find possible bottleneck nodes in a system aswell as unnecessary high-throughput nodes. Figure 3.8 shows an example withseven nodes with different ϑmo and ϑei. We calculated slack ϑs for each channel.As we can see, ϑs for input channels going to f3 are smaller than ϑs for otherinput channels. Also, ϑs for output channel from f3 is bigger than ϑs for otheroutput channels. This shows f3 is a potential critical bottleneck. To find potentialbottlenecks in an STG, we define the weight Wm for each node fm as:Wm =∑Noutj=1ϑs j−∑Nini=1ϑsiNout +Nin(3.6)where ϑsi is the input throughput slacks for incoming channels and ϑs j is the output38f1#f2#f3#f6#f4#f5#f7#1#1#3#2#4# 72# 90# 2#2# 4#4#2#2#48#2# 48#3#2#2#1#-2#2#72#-68# 88#-68#1#-46#46#2#A=45#A=24# A=30#A=6#A=24# A=3#A=30#Figure 3.8: Throughput analysis examplethroughput slacks for outgoing channels of fm. Nin denotes the number of inputs,and Nout denotes the number of outputs for node fm. A higher weight means thatthe node is not able to provide/consume expected outgoing/incoming data to/for itsneighbors in most of its channels and the throughput differences between this nodeand its neighbor are critical which makes that node a potential bottleneck. A set ofweights is calculated for all nodes in the graph and will be used in the heuristic tofind critical bottlenecks in the graph. This weight set will be updated every time abottleneck is improved by Bottleneck Optimizer.Throughput PropagationAlthough it seems trade-off finder should select an implementation for each nodein order to increase its throughput, increasing throughput of a node will not nec-essarily increase the overall throughput. For example, as shown in Figure 3.9, op-timizing the block B doesn’t increase the overall throughput of the STG (in eithercase) because the STG always has to wait for block A which takes 9 clock cyclesto generates its output. In this case, either we need to improve the throughput ofnode A, or we can sacrifice the throughput of block B and make it slower to releasesome area. This example shows that we cannot consider improving a node inde-39A Generator B IIA = 9 Collector IIB = 4 A Generator B IIA = 9 Collector IIB = 4 IIO = 9 IIO = 9 Figure 3.9: Throughput propagation and balancingpendently without looking at the whole STG, examining each node’s predecessorsand successors. In other words, we need to have a balanced throughput throughoutthe STG.To balance the throughput for each node and the entire STG, we have to prop-agate the target inverse-throughput to all nodes in the application. For propa-gating the input inverse-throughput to the output inverse-throughput for a sin-gle node, we used a similar strategy to Jason Cong’s previous work [22]. Fora node fm, the number of input and output channels are denoted as numIn( fm)and numOut( fm), the number of data tokens that fm consumes/produces on the in-put/output channel is denoted as In j( fm)/Outk( fm), and the inverse-throughput onthe input/output channel is denoted as ϑ jin( fm)/ϑkout( fm), where 1≤ j≤ numIn( fm)and 1 ≤ k ≤ numOut( fm). Given the input inverse-throughput target ϑ jin( fm), theoutput inverse-throughput target ϑ kout( fm) is calculated as Equation 3.7.ϑ kout( fm) =minj{ϑ jin( fm)In j( fm)}Outk( fm)(3.7)Equation 3.7 allows propagating throughput from input nodes in the STG, throughother nodes all the way to output nodes. It helps the tool to find the overall applica-tion throughput as well as possible bottlenecks which helps the tool to balance the40throughput of all nodes in the STG in the heuristic approach. This helps the tool toeventually generate a balanced implementation for a throughput target or an areabudget.Bottleneck OptimizerBottleneck Optimizer is very similar to the ILP approach in that it makes replicasof the bottleneck to increase throughput. However, the ILP replicates the bottle-neck without any attention to its neighbouring nodes so it can miss opportunities tohave lower area overhead. To overcome this deficiency, we propose a method thatrelies on the fact that each node can send/receive data to/from up to FanIn/FanOutnumber of nodes without any area overhead cost. If more than FanIn/FanOut num-ber of replicas are required, some overhead cost is inevitable. For example, in ourAmbric replica, a FanOut beyond 4 requires extra area. In this situation, to con-nect these replicas to more successor/predecessor nodes, new fork/join nodes areneeded to send/receive data to each replica. Let us go through a simple example toshow the overall idea in our Bottleneck Optimizer approach. Figure 3.10.a showsan example in which two nodes, S with inverse-throughput ϑs, and D with inverse-throughput ϑD, are connected together. Node S is sending data to D over a channel.Assume the node D is a bottleneck, and we want to match its throughput to nodeS’s throughput. To match the throughput, we need nr replicas of node D, which iscalculated asnr =⌈ϑDϑS⌉(3.8)In order to connect node S to nr replicas of node D, we have to use severalnodes in between to gather data and then send it to each replica in round-robinorder. Assume each node has FanIn/FanOut equal to n f , which means each nodecan send data to a maximum of n f nodes. We define H, which shows how manylayers of nodes need to send data from one node to nr nodes.H =⌈logn f(nr)⌉(3.9)Assuming nr= n f H ( Figure 3.10.b), the area overhead AO for connecting node41S" DDDDS"."."."."."."."."."."."."nf#0#nf#1#nf#2#nf#H(1#nf#H#vS" vS"vS" vS" vD"vS.nf#H#=vD"vS.nf#vS.nf#2#a)"b)"nr#=#nf#H#nf#H(2#vS.nf#H(1#=vD/nf"S’" DCCC."."."."."."."."."."."."nf#0#nf#1#nf#2#nf#H(2#nf#H(1#vS"vS" vD/nf" tD"vS.nf#H(1#=vD/nf"vS.nf#vS.nf#2#DS’"D."."."vD"C"nr#=#nf#H(1#c)"Figure 3.10: Node combining in Bottleneck OptimizerS to nr replicas of node D is calculated asAO =H−1∑i=0n f i (3.10)In our approach, we try to combine nodes together in order to save area over-head. As shown in Figure 3.10.b, in each layer h there are n f h−1 nodes with inputinverse-throughput ϑ hin and output inverse-throughput ϑ hout , which are calculated asϑ hin = ϑS n fh−1 =ϑDn f H+1−h(3.11)ϑ hout = ϑhin n f (3.12)So if we can find an implementation S′ of node S with inverse-throughput equalto ϑ hin, we can combine node S′ with n f copies of node D without any area overhead( Figure 3.10.c) and name it node C with input inverse-throughput ϑC.42ϑC =tDn f(3.13)To match the inverse-throughput of node C to ϑC we have to make nr′ replicasof it with area overhead A′Onr′ =nrn f(3.14)A′O =H−2∑i=0n f i (3.15)Assuming that inverse-throughput/area relation between node S and node S′ islinear, we can save n f H−1 nodes. For example in case n f = 4, more than 75%overhead area will be saved. As shown in this example it is possible to decreasearea overhead by combining nodes, an approach that is computationally infeasibleto model in the ILP formulation.Now that the key modules of our heuristic approach have been defined, we willnext explain how our heuristic approach works.Heuristic Approach DescriptionAs mentioned before, the trade-off finding process has two modes. The user defineseither an area target or a throughput target. Algorithm 1 shows pseudocode of theheuristic approach for a defined area target (Atgt).The Trade-off Finder heuristic starts by selecting an implementation with thehighest throughput (lowest inverse-throughput) achieved per area unit for eachnode. It analyzes the full application and calculates the expected input inverse-throughput and minimum output inverse-throughput for each channel using Through-put Propagation. Then, it calculates slacks for all channels and weights for allnodes. Trade-off Finder finds the most critical bottlenecks as the nodes with thelargest weights. Next, Trade-off Finder calculates the application area and avail-able area for this implementation. Considering the defined area target, Trade-offFinder budgets the most critical bottleneck and propagates the throughput to othernodes. It continues budgeting other nodes considering propagated throughput. Af-43Algorithm 1 Heuristic algorithm1: Application←Compiler(source f iles)2: A← GlobalPartitioner(Application) . A = { f1, f2, . . . , fN}3: . Either using aStruct or MinCut for generating STG4: for Each composite node fm do5: MDPm← ParallelismFinder( fm) . Maximum degree of parallelism6: Pm← IntraNodeOptimizer( fm,MDPm) . Different implementations7: P˜m← InterNodeOptmizer( fm)8: end for9: AH ← SelectImplementationsWithHighestT hroughput(P˜)10: . AH = {PH1 ,PH2 , . . . ,PHN } where PHm = {Psm | argmin(ϑsmAsm)}11: while TRUE do12: for Each composite node fm do13: if fm and its successors and predecessors are visited then14: (Wm)← T hroughputAnalysis() . Equation 3.5 and Equation 3.615: end if16: end for17: Set of critical bottlenecks B← T hroughputPropagation() . Equation 3.718: while !(αlAtgt < A < αuAtgt) & applicationIsBalanced do . Budgeting19: Budget the most critical bottleneck in the set20: Propagate the throughput21: Calculate estimated application area cost22: end while23: while (all Bottlenecks are visited) do24: BottleneckOptimizer()25: B← T hroughputPropagation()26: end while27: AA← AreaCost()28: if δ < AAAtgt < β then29: break30: end if31: end while44ter budgeting all nodes, it calculates an approximate area cost for the applicationconsidering the new throughput for each node. Trade-off Finder accepts an areacost bigger than the target area on the chip within a margin. In other words, itovershoots and hopes to release area later in the process from fast nodes. If theapproximate area cost is above the margin, Trade-off Finder decreases the targetthroughput budget and does the same procedure again.After finding a budgeting which satisfies the targeted area, Trade-off Finderstarts from the most critical bottleneck on the critical path (i.e., the path with theslowest throughput) and uses Bottleneck Optimizer to make replicas of that bottle-neck to get better throughput. Trade-off Finder starts from the optimized bottleneckand goes toward the output until it reaches a node which is located on another crit-ical path. After reaching this node, Trade-off Finder goes backward to visit theother bottleneck and uses Bottleneck Optimizer to match its throughput to satisfythe throughput expectations of other nodes. Note, the main idea is to prevent op-timizing a bottleneck without addressing the expected inverse throughput for itspredecessors. The process continues until it balances all the other nodes. Trade-offFinder sees the other nodes in breadth first search order and makes sure that eachnode doesn’t affect nodes in other critical paths.The Trade-off Finder use the same approach described above for a definedthroughput target. The difference is in the budgeting process. The Trade-off Findercalculates an estimate application area cost based on the throughput target and bud-gets all the nodes based on that. Next, it goes through the Bottleneck Optimizerstage as before and makes sure all the bottlenecks are optimized in order to satisfythe throughput target. After finding an implementation which satisfies the through-put target, it budgets all the nodes again with a smaller area budget (consideringthe throughput per area unit achieved) and runs the process again. Next, it reducesor increases the overall area budget based on the success or failure of the last at-tempt to find alternative solutions. It continues this process until it finds the localoptimum for the area cost function.45Table 3.2: Number of different implementations found by the tool forStreamIt benchmarksBenchmark Number of different implementationsFFT 34FIR 42Radar Application 73Filter Bank 52FM Software Application 39Vocoder 92gsm 823.4 Experimental ResultsOur experiments are carried out in two steps. We first test our tool using StreamItbenchmarks [81]. Then, we examine implementation of a Joint Photographic Ex-perts Group (JPEG) encoder produced using our tool.3.4.1 StreamItIn order to test our tool, 7 out of 9 benchmarks in the StreamIt benchmark set [33]are implemented as STGs using our Java-based programming model. For these testbenchmarks, each benchmark is a single STG node. An architectural simulator hasbeen implemented to validate the results generated by our tool. Our tool was ableto find a number of different implementations for each benchmark with differentarea cost and throughput. Table 3.2 shows the number of different implementationsfound by our tool for the benchmarks. The functionality of all implementations hasbeen verified with the simulator as well. Due to limited time, we didn’t implementtwo of the benchmarks in the StreamIt benchmark set; the seven benchmarks usedhere were enough to evaluate the tool’s ability to automatically find different imple-mentations. Since the StreamIt benchmarks are small, we only use them here forverifying the front-end and finding different implementations. In the next section,we will use JPEG as our benchmark in order to examine different implementationsprovided by our Trade-off Finder.46Table 3.3: Implementation library for JPEG encodermodule Color Conversion DCT Quantization EncodingVersion v1 v2 v3 v4 v1 v2 v3 v42 v5 v1 v2 v3 v4 v5 v1InverseT hroughput 1 2 4 8 1 2 4 6 32 1 2 4 8 128 512Area1 512 256 128 64 800 400 224 160 50 512 256 128 64 4 221 Area units are the number of simple operations.2 The tool couldn’t find an implementation with inverse throughput of 8, but it found a faster implementation instead.Table 3.4: Heuristic vs ILP for many-core systemMethod Inverse Throughput Color Conversion DCT Quantization Encoding Fork/join Overhead Total Areaimpl rep1 impl rep impl rep impl repILP 1 v1 1 v1 1 v1 1 v1 512 10880 23968Heuristic v1 1 v5 32 v5 128 v1 512 640 13888ILP 2 v2 1 v2 1 v2 1 v1 256 5376 11920Heuristic v2 1 v5 16 v5 64 v1 256 256 7456ILP 4 v3 1 v3 1 v3 1 v1 128 2688 5984Heuristic v3 1 v5 8 v5 32 v1 128 128 3600ILP 8 v4 1 v4 1 v4 1 v1 64 1280 2976Heuristic v4 1 v5 4 v5 16 v1 64 0 17361 rep indicates how many replicas are used; overhead and area refer to the total number of simple operations.3.4.2 JPEGFigure 3.11 shows the block diagram of the JPEG compression algorithm [88]. TheJPEG compression algorithm contains four major producer/consumer relationshipsshown as 4 blocks in the figure. The benchmark is written in this fashion, as fourconnected objects, with multiple loops and conditionals within each object. De-spite this coarse level of implementation, our tool can break it down to a fullyunrolled graph with one operation per node if necessary.Color Conversion DCT Quantization Encoding Figure 3.11: Node combining in Bottleneck OptimizerThe tool uses Intra-Node Optimizer and Inter-Node Optimizer modules on theSTG, finding different implementations for each of them. In particular, our toolfound 11 different implementations for “Color Conversion” and “Quantization”modules, 17 different implementations for “DCT”, and only one implementationfor “Encoding”. Table 3.3 shows a selection of these implementations.Both ILP and Heuristic approaches have been used by our tool in order to find47a trade off between area and throughput for different inverse throughput targets.Table 3.4 gives the results generated by these two approaches for given inversethroughput targets. We list the selected implementation and number of replicas foreach module. The heuristic approach finds better area/throughput trade-off com-pared to the ILP approach. For example, for an inverse throughput target of 2, theheuristic approach used 37% less area compared to ILP. The ILP solver we use isGLPK [60] and the area cost unit is the number of primitive nodes, i.e. a simple op-eration such as add or subtract. A primitive node can be implemented with a verysimple PE. This approach can also be used to map code to large processor arrayssuch as GRVI Phalanx [34]. Note, the complexity of PEs can be reduced whilemoving towards high throughput implementations. For example, in an implemen-tation which all the loops are unrolled and parallelized, a primitive node can be asimple ALU (or a simple operator). This motivated us to explore space/time scalingfor fine-grained architectures described in chapter 4.3.5 SummaryThis chapter investigates two ways of automatically finding area/throughput trade-off of streaming applications being mapped onto MPPA overlays. We introducea new tool that compiles a streaming application written in Java, partitions it intocomposite nodes, finds all degrees of parallelism for each, finds different imple-mentations for each node, and finally selects a good trade off between area andthroughput. For optimization, we used both a classical ILP formulation as wellas a novel heuristic. The heuristic combines module selection and replicationmethods with node combining and splitting in order to more quickly find a bet-ter area/throughput trade-off than what can be readily modelled in the ILP formu-lation; the ILP formulation can quickly become computationally infeasible. Thisapproach has been verified with small designs in StreamIt and one larger design, aJPEG encoder. This approach can also be used to map code to large processor ar-rays. We stopped evaluating this approach after JPEG since our results showed theapproach is reliable and it also can be more beneficial targeting fine-grained archi-tectures. This leads us to investigate a better approach with OpenVX and FPGAs,described in chapter 4.48Chapter 4FPGA Space/Time ScalingIn the previous chapter, we discussed how to automatically trade-off space vs timeto implement an STG by targeting a given area budget or a given throughput budget.That architecture consisted of coarse-grained PEs.In this chapter we consider a new approach that directly targets fine-grainedparallelism available in FPGAs. We build a practical tool for automatically explor-ing space/time tradeoffs that is used together with a commercial HLS tool, XilinxVivado HLS [43, 96], for implementing Computer Vision (CV) applications.4.1 IntroductionWith the rise of FPGA-based CV applications, there is increasing need for a pro-gramming method that achieves the target throughput or area budget. HLS toolsprovide the opportunity to simplify design entry and debug. Unfortunately, a mod-ern tool like Vivado HLS cannot automatically produce a range of implementationsacross the space/time spectrum from a single source file. However, back in 1994,in an attempt to address automatic space/time scaling, Synopsys Behavioral Com-piler was introduced [46] with the promise that user can start with a high-leveldescription containing few implementation details and the tool would handle re-source allocation and scheduling automatically. Synopsys was not able to fullydeliver and the project was shut down after 10 years. The idea of having a toolwhich automatically explores space/time tradeoffs motivated us to add this capa-49bility to the Vivado HLS tool.In this chapter, we provide a framework for doing this with compute graphsspecified in OpenVX, a C-based programming environment for computer vision.To do this, we build our own OpenVX system on top of Xilinx Vivado HLS [96],and add an algorithmic layer which allows the user to specify an area budget (whilemaximizing throughput) or a throughput target (while minimizing area).Our OpenVX system consists of a series of compute kernels, prewritten inC++ for Vivado HLS and heavily parameterized. However, the key contribution inthis chapter is an algorithm to automatically select from among these prewrittenkernels. This is based upon the Intra-node and Inter-node Optimizers presented inchapter 3. With OpenVX, the implementations can also break an image down intomultiple tiles, where the tile size is selected to maximize on-chip data reuse, thusimproving delay and power. The runtime system must determine an appropriatetile size.We evaluate the system on typical OpenVX benchmarks under a variety offixed area constraints, and find that our system is able to automatically achievebetween 92% and 100% of the target area utilization. We also evaluate the sys-tem with same benchmarks under variety of fixed throughput targets, and find oursystem saves up to 30% in area cost compared to manually parallelized implemen-tations. Our heuristic approach is able to hit the same throughput targets and save19% area on average compared to existing ILP approaches. In terms of efficientuse of parallel resources on chip, the tool manages to satisfy different throughputtargets, getting up to 5.5 GigaPixel/sec for the Sobel edge detection application ona small FPGA with only 53,200 LUTs and 220 DSP slices.The most similar existing approach to ours is Xilinx’s reVISION [95] whichprovides a set of hardware-accelerated OpenCV kernels. While existing methodscan easily achieve a single design point, they are unable to automatically generatea set of solutions from the same source; a prominent capability embedded in ourtool.504.2 ApproachExisting methods of programming FPGAs with HLS require the user to explicitlymanage resources at every stage in their algorithm in order to meet a specified areatarget or throughput target by manual loop unrolling or adding different “pragmas”to the code. Below, we describe a novel approach to explore the space/time trade-offs for OpenVX [45] compute graphs in order to find optimum solutions, meetingdifferent area budgets or throughput targets.We analyzed OpenVX kernels (nodes) to increase parallelism based on pipelineopportunities and different loop transformation strategies [55, 73]. After analyzingeach kernel, we rewrote them in C++ while heavily parameterizing for HLS. Userscan describe a CV computation as an OpenVX compute graph (STG) and then de-fine either a throughput target, or an area budget. Our OpenVX system analyzesthe compute graph and generates different implementations for each node with dif-ferent area, IO and throughput characteristics by creating different HLS projectsand passing them to Xilinx Vivado HLS. In order to get precise throughput/areainformation for different FPGA targets, and avoid implementations with deadlock,our tool automatically generates Vivado projects including a System-Verilog test-bench for each implementation. In addition, our tool uses node replication andnode combining to cover more possible solutions by either increasing the imagetile size, or improving throughput for existing implementations.Moreover, our tool uses the two internal optimization approaches discussed inchapter 3. The first, based upon ILP, is similar to previous work on task graphoptimizations by Cong et al [22]. The second is a heuristic approach that we tunedand improved for implementing CV applications targeting FPGAs. Similar to chap-ter 3, we observed that although the ILP approach works well, maintaining the ILPoptimization model within the tool precludes the use of certain optimizations. In-stead, the heuristic approach is able to perform object coalescing which can becomputationally infeasible in the ILP formulation. This leads to area saving andless runtime compared to the ILP approach.There have been several recent studies on implementing image processing andOpenVX applications on FPGAs and exploring the area/throughput trade-off suchas [78], [37] and [38]. These prior approaches either use a specific programming51model, which requires the user to learn a new programming language, or theyimplement a soft multi-core platform on the FPGA and then run the applicationon it. An OpenVX acceleration framework introduced by Taheri et al [79] cangenerate different hardware implementations by manually specifying the amountof pixel parallelism (different tile-size). Our approach is more general: to satisfya given area target or throughput constraint, not only does it automatically exploredifferent levels of pixel parallelism, but it can also automatically generate a varietyof different implementations with different throughput or area cost at each level inorder to meet the given constraint.Next, we describe our proposed OpenVX based HLS system in detail.4.3 Tool Flow for OpenVX-based HLSFigure 4.1 illustrates the detailed flow of the proposed tool. It receives an OpenVXcompute graph as an STG and analyzes it to obtain a matched kernel for each node.Each kernel is a heavily parameterized C++ function with Xilinx AXI-Stream [94]in/out arguments. Taking into account the user area budget or throughput target,our tool creates different Vivado HLS projects in order to generate different im-plementations for each kernel. To minimize the search space, our tool prunes thedominated points in the design space using an algorithm by Zhong et.al. [101].Previous studies [57] have shown that the report generated by the HLS tool isnot accurate and should therefore not be used for exploring the problem space. Toobtain a precise throughput and area cost, our tool also generates Vivado projectsusing the HDL output generated from the HLS tool. Using Vivado design suite, thetool is able to find the throughput/area correlation for each kernel. This data willbe eventually used later in the trade-off finding process. In addition, our tool usesa step called Intra-node Optimizer to generate more implementations to increasethe trade-off solution space by filling gaps in the throughput/area solution space.Finally, the tool uses Trade-off Finder to find a good compromise between area andthroughput. Below, we discuss all of these steps in detail.52Kernel AnalyzerIntra-Node Optimizer1324DB of Implementations for each nodeTrade-off FinderSOC3214Compute GraphC++-based CV KernelsC++C++C++C++Different Implementations GeneratorTrade-off123 4Design Evaluator + Throughput CalculatorArea/Throughput correlation for each nodeDB of Implementations for each node11223344𝝁𝒑FPGA FabricFigure 4.1: Tool flow4.3.1 OpenVX Programming ModelOpenVX is a cross-platform, C-based API standard for Computer Vision. Most CVapplications can be described as a set of vision kernels (nodes) which communi-cate through input/output data dependencies. OpenVX describes this set of visionkernels in a graph-oriented execution model based on DAGs. Figure 4.2 shows anOpenVX code example for the Sobel application and Figure 4.3 shows the corre-53/ / vxSobel3x3 examplevx node nodes [ ] = {vxColorConvertNode ( graph , rgb , gray ) ,vxGaussian3x3Node ( graph , gray , gauss ) ,vxSobel3x3Node ( graph , gauss , gradx , grady ) ,vxMagnitudeNode ( graph , gradx , grady ,mag) ,vxPhaseNode ( graph , gradx , grady , phase )} ;Figure 4.2: OpenVXsource codeColor ConvertGaussian 3x3graySobel3x3gaussgradxgradyMagnitudePhasergbmagphaseFigure 4.3: Sobel graphsponding STG for it.Since OpenVX compute graphs are DAGs, it makes them suitable candidates tobe implemented as pipelined hardware accelerators on FPGAs. Below we discusshow our tool generates a variety of different implementations for those hardwareaccelerators. Then we describe our approach to find different implementations foreach CV kernel.4.3.2 Finding Different ImplementationsConsider an application described as an STG with N nodes f1, f2, ..., fN . For eachnode fm our tool tries to find different implementations P1m,P2m, ...,PSmm where eachimplementation Psm can perform functionality of fm with area cost A(Psm), numberof pixels it can consume/produce NP(Psm), and initial interval II(Psm). For imple-mentation Psm, the area cost on FPGAs is calculated as:A(Psm) = wlut ·LUT (Psm)+wdsp ·DSP(Psm)+wbram ·BRAM(Psm) (4.1)where LUT (Psm), DSP(Psm) and BRAM(Psm) are the LUT, DSP and BRAM count forimplementation Psm. Note LUT weight (wlut), DSP weight (wdsp) and BRAM weight(wbram) are different for various FPGA architectures and are chosen to reflect therelative size of each resource in silicon. We have set these weights to slightlydifferent values for Xilinx, Altera and VPR architectures.For node fm and its implementation Psm, input “inverse throughput” ϑin(Psm) and54output inverse throughput ϑout(Psm) for each input/output are calculated as:ϑin(Psm) =II(Psm)In( fm),ϑout(Psm) =II(Psm)Out( fm)(4.2)where In( fm) and Out( fm) are the number of data tokens that fm consumes onthe input data channel and produces on the output data channel during each firing,respectively. Note that inverse throughput shows the number of cycles to con-sume/produce per datum in its input/output channel. For most CV kernels, theirinput/output channels have matched inverse throughput, ϑIO(Psm). Kernel through-put is number of pixels consumed/produced in each clock cycle:Θ(Psm) =NP(Psm)ϑIO(Psm)(4.3)The Different Implementation Generator (DIG) module in our tool automati-cally finds the above mentioned implementations of each node using the heavilyparameterized C++ based kernels. The DIG needs to be able to automatically finda wide range of different implementations to cover the solution space as much aspossible.To have a better understanding of the complexity of this problem and the varietyof possible solutions, let’s look at the simple example of a 3-node graph shownin Figure 4.4a. Figure 4.4.b and Figure 4.4.c show two different approaches tosatisfy a target throughput of 5; one reads 20 pixels and picks implementationswith II = 4 for each node, the other reads 5 pixels and picks implementations withII = 1. Further, Figure 4.4.c shows another approach for node f2 in which insteadof picking an implementation with NP = 5, it picks two implementations withNP= 3 and NP= 2 and splits the data between them. Figure 4.4.b and Figure 4.4.care just two examples of various instances of II, NP and splitting/joining nodeswhich can be utilized to find the solution. In addition, the strategy of readingimage data from the main memory can vary for different implementations whenNP and II change. The strategy impacts DMA configuration and data alignmentnetwork design, which leads us to a different overhead cost for each.The above-mentioned example shows that for every area budget or throughputtarget, there are a variety of different acceptable solutions. This makes the trade-off55f2 f3f1P1 P2 P3II=4 II=4 II=4NP=20NP=20NP=20P’1P’2P’3II=1II=1II=1NP=3NP=3NP=5P”2II=1NP=20NP=2 NP=2NP=5a)b)c)Figure 4.4: Two different approaches for satisfying Θ= 5finding problem a rich design space. It also shows the importance of generating avariety of different implementations with different NP, II and area costs to coverthe solution space as much as possible.4.3.3 CV Accelerator on FPGABefore describing the tool flow in more detail, it is beneficial to go through theoverall system description first. As mentioned earlier, the main goal of this studyis to automatically find a good area/throughput trade-off for CV applications bygenerating different implementations which is done through changing the imagetile width and/or using different function implementations inside the kernel.Figure 4.5 gives a high-level system view of a CV accelerator implemented onXilinx FPGAs. The host processor (i.e., ARM based processor [88]) is responsiblefor configuring DMA (e.g., Xilinx AXI DMA IP core [98]) to read/write image dataas vertical strips from the main memory. On the other end, DMA sends/receivesimage data to/from the accelerator using the AXI-Stream protocol. Since the DMAdata-width should be a power of two and the CV accelerator may read data at adifferent width in pixels, there needs to be a Data Alignment Network implemented56Image widthImage heightTile widthAXI InterconnectAXI DMAI$D$Alignment NetworkCV AcceleratorAXI-StreamAXI-StreamMain MemoryFigure 4.5: System view implemented on Xilinx FPGAfm𝑊"×𝑊$ 𝑊$×𝑊"𝜗&'𝜗&'𝜗&'𝑊$𝑊" 𝜗&' 𝜗()𝜗()𝜗()𝜗()𝜗()𝑊"𝑊$ 𝜗()............SDA SDAFigure 4.6: Internal view of a general node in CV hardware acceleratoras mixed-width FIFOs in between to align the data sent back and forth between DMAand accelerator.Figure 4.6 provides an internal view of a general node in a CV hardware ac-celerator. Representing the image tile width with WT , a general node m with im-plementation Psm consumes WT pixels (NPin(Psm) = WT ) as stream-in and generatesWT pixels as stream-out with inverse-throughput equal to ϑIO. Since the hardwarefunction inside the node might consume/produce a different number of pixels, twoStream Data Adjuster (SDA)s are added to either end. Assuming the hardwarefunction can consume/produce WF pixels in its input/output channels, the inputSDA should get WT pixels from the input and pass WF pixels to the function withinverse-throughput equal to WFWT ϑIO. On the other end, the output SDA gets WF pix-els with inverse throughput ϑ fm and provides WT pixels with inverse-throughputequal to WTWF ϑ fm . As Figure 4.6 shows, four layers of FIFO are added in between inorder to match different throughputs in various stages. To prevent data loss, FIFO57depths should be carefully selected.Stream Data Adjuster in more detailSDA can deal with two different types of kernels; Pixel2Pixel kernels and Window2Pixelkernels.Pixel2Pixel kernels such as vxConvertColor produces one pixel for each pixelreceived:P˜i, j = f (Pi, j) (4.4)Figure 4.7 shows how the SDA functions as an adjuster for a simple Pixel2Pixel ker-nel which receives/produces 4 pixels every 2 clock cycles and its hardware functionconsumes/produces 2 pixels in every clock cycle. In order to match the stream ratebetween IO and the function, SDA simply uses upstream to downstream transfor-mation by splitting data in its input and sending it to the function. On the other endit joins data coming from the function and sends it to the output. In this exampleWTϑIO is equal toWFϑ fm, so it only needs to capture WT pixels in its 2×2 FIFO every twoclock cycles.Window2Pixel kernels such as vxSobel3x3 consumes a window of pixels forevery produced output pixel as follows:P˜i, j = f (Pi−δ , j−δ . . . Pi−δ , j . . . Pi−δ , j+δ.... . ..... . ....Pi, j−δ . . . Pi, j . . . Pi, j+δ.... . ..... . ....x j+δ , j−δ . . . Pi+δ , j . . . Pi+δ , j+δ),δ = w−12 (4.5)A kernel with an image tile width of WT and filter window size of w×w con-sumes WT +w− 1 pixels and produces WT pixels in every firing. For each firing,it slides down an image by one row, thus producing a vertical stripe of output, WTpixels wide.Figure 4.8 shows how SDA handles the stream adjustment for a Window2Pixelkernel with WT equal to 4, WF equal to 2 and filter window size equal to 3× 3.This kernel receives 6 pixels and produces 4 pixels every 2 clock cycles. The hard-58B1 A1B2 A2B3 A3B4 A4B3 B1 A3 A1B4 B2 A4 A2SDA21 1fmB’1 A’1B’2 A’2B’3 A’3B’4 A’4B’3 B’1 A’3 A’1B’4 B’2 A’4 A’2SDA21 1Figure 4.7: Pixel2Pixel kernel example, WT = 4,WF = 2SDA211fmB’1B’2B’3B’4B’3 B’1B’4 B’2SDA21 1C0 B0 A0C1 B1 A1C2 B2 A2C3 B3 A3C4 B4 A4C5 B5 A5C2 C0 B2 B0 A2 A0C3 C1 B3 B1 A3 A1C4 C2 B4 B2 A4 A2C5 C3 B5 B3 A5 A3Figure 4.8: Window2Pixel kernel example, WT = 4,WF = 2ware function produces 2 pixels every clock cycle which means it needs to get 4pixels every clock cycle. In this case, SDA splits the input stream data maintain-ing some data overlap. This data overlap has two consequences: overhead of 2duplicated pixels for every 6 pixels consumed by the kernel in each firing, and theneed for a line-buffer with minimum depth of 5. Figure 4.9 illustrates a generalWindow2Pixel kernel with an image tile width of WT and a filter window size of3×3 with WF equals to WTN . Since the function needs to receive WTN +2 pixels in itsfiring, the overhead is 2N. Also, to produce the first output, the function needs tohave a line buffer with a minimum depth of 2N+1.4.3.4 Heavily Parameterized C++-based OpenVX KernelsA set of heavily parameterized C++ based OpenVX kernels with AXI-Stream in-put/output have been implemented to generate different implementations for eachkernel. Each kernel is parameterized at two levels, the IO level and the core level.The number of pixels that a kernel consumes/produces in its IO as well as the num-ber of pixels needed to provide/gather to/from its core are parameterized in the IOlevel. For each kernel, the main core function has been manually analyzed to findall degrees of parallelism and then heavily parameterized. This can be done by59Figure 4.9: Window2Pixel kernellabeling all loops and generating a set of suitable pragmas saved as a JSON filefor each kernel. Considering this parameterization, the tool can generate differ-ent implementations with different number of inputs (NP(Psm)), area cost (A(Psm))and initiation intervals (II(Psm)). Figure 4.10 shows the area, initiation interval andtile-width (number of pixels input) correlation of different implementations for theGaussian3x3 kernel. Each dot represents an implementation.4.3.5 Intra-node OptimizerIn addition, an Intra-node Optimizer step in the tool generates a wider range of im-plementations. Intra-node Optimizer replicates and combines existing implemen-tations in order to fill gaps in the solution space. Node replication can be used toeither increase the throughput or widen the tile-width. Figure 4.11.a demonstratesa general Pixel2Pixel kernel with inverse throughput ϑIO and tile-width WT . Inorder to improve the throughput (reduce the ϑIO), the tool replicates the node andsends data to each replica with a round-robin order. Figure 4.11.b shows how thetool improves the throughput N-times by making N replicas of the original kernel.Figure 4.11.c shows how the tool increases the tile-width by replicating the ker-nel. Because of data dependencies in Window2Pixel kernels, replication can onlybe used for increasing tile-width. Our tool replicates Window2Pixel kernels con-sidering the window size and handles the data passing. Figure 4.12 demonstrates60Figure 4.10: Area, throughput and tile-width correlation for Gaussian3x3kernelhow the tool passes data to each replica when windows must overlap, e.g. in 2Dconvolutions.4.3.6 Inter-node OptimizerMoreover, we also have added the capability of Inter-node Optimizer which com-bines existing implementations and then replicates the combined node on the fly.Figure 4.13 shows a simple node combining example. Assume node fn’s through-put is N times bigger than node fm’s throughput. Two different approaches areshown in Figure 4.13 to match the throughputs: the first approach is replicatingnode fm, N times and using a 1→ N data splitter so that node fn can send datato those replicas in a round-robin order (Figure 4.13.a). In the second approach61a) 𝑓"𝜗$% 𝜗$%𝑊' 𝑊'𝑓"𝜗$% 𝜗$%𝑊' 𝑊'𝑓"𝜗$% 𝜗$%𝑊' 𝑊'𝑓"𝜗$% 𝜗$%𝑊' 𝑊'𝜗$%𝑁 𝑊' 𝜗$%𝑁𝑊'𝑓"𝜗$% 𝜗$%𝑊' 𝑊'𝑓"𝜗$% 𝜗$%𝑊' 𝑊'𝑓"𝜗$% 𝜗$%𝑊' 𝑊'𝜗$% 𝑁𝑊' 𝜗$%𝑁𝑊'b)c)......Figure 4.11: Pixel2Pixel replicationshown in Figure 4.13.b, another implementation for node fn with a throughputequal to twice node fm’s throughput is found ( f ′n). Then the nodes f ′n and fm arecombined and the combined node is replicated N2 times. Note that second approachneeds a 1→ N2 data splitter and is much smaller than the 1→ N data splitter in thefirst approach.All the above mentioned techniques are used in Intra-node and Inter-node Op-timizer steps to find a wide range of implementations (either in the pre-synthesisstep or during tradeoff finding process) for each kernel which widens the solutionspace for the area/throughput scaling problem. Below we discuss our trade-offformulation and solutions.62a)𝑓"𝜗$% 𝜗$%𝑊' 𝑊'𝑊 '+2𝛿𝑓"𝜗$% 𝜗$%𝑊' 𝑊'𝑓"𝜗$% 𝜗$%𝑊' 𝑊'𝑓"𝜗$% 𝜗$%𝑊' 𝑊'............b)𝑁𝑊 '+2𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝜗$%𝜗$%𝛿 𝜗$%𝑁𝑊 'Figure 4.12: Window2Pixel replication𝑓"𝜗$%𝑊'𝑓"𝜗$%𝑊'𝑓"𝜗$%𝑊'𝜗$%𝑁𝑊'b)...𝑓)𝑓"2𝜗$%𝑊' 𝑓"2𝜗$%𝑊'𝜗$%𝑁 𝑊' ...𝑓′)𝑓"𝑓"𝑓′)a)𝑁𝑁2𝑊'𝜗$%𝑁Figure 4.13: Node combining634.3.7 Trade-off Finding Formulation and SolutionsTo solve the described trade-off finding problem we used an ILP approach as wellas a heuristic approach. For each of these approaches, there are two different modesor objectives:• Given an area target Atgt and different implementations for each node f j,which implementation Pij should be selected and how many replicas nrij areneeded in order to maximize application throughput ΘA subject to the con-straint the application area cost AA is not bigger than Atgt .• Given a throughput target Θtgt , and different implementations for each nodef j, which implementation Pij should be selected and how many replicas nrijare needed in order to minimize area cost AA subject to the constraint theapplication throughput ΘA is bigger than Θtgt .Integer Linear Programming AlgorithmThis problem can be defined as an ILP model similar to the previous chapter’s for-mulation and solved with GLPK; it goes through all the possibilities in the solutionspace and finds the optimum solution, subject to the constraints.Similar to the previous chapter, we can define an ILP problem formulation forboth modes. The goal is to find binary integers x j,1,x j,2, ...,x j,Sm indicative of theimplementations to be selected, and integer nrij indicative of the number of replicasneeded. Equation 4.6 shows the formulation of the finding an implementation withmaximum throughput for an area budget.Maximizing ΘA subject to:∀ j ∈ {1, ...,N} :N∑j=1Sm∑i=1nrijA(Pij)xi, j < Atgt andN∑i=1xi, j = 1(4.6)Equation 4.7 shows the formulation of finding an implementation with mini-mum area cost for a given throughput target. The formulation of the second prob-lem is:64Minimizing AA subject to:∀ j ∈ {1, ...,N} :Sm∑i=1NP(Pij)ϑ(Pij)nrij>Θtgt andN∑i=1xi, j = 1(4.7)Heuristic AlgorithmTo overcome the shortcomings of ILP approach mentioned in subsection 3.3.1, wealso used a heuristic approach similar to the approach we discussed in subsec-tion 3.3.2. We use throughput analysis and throughput propagation as well as nodereplication and node combining to find space/time tradeoffs for CV applications onFPGAs for a defined throughput target or a defined area budget.4.4 Experimental ResultsOur experiments are carried out in two steps. First, we evaluate our strategies offinding a good area/throughput tradeoff by targeting different FPGA architectureswith an area target. Then we evaluate our tool by setting different throughputtargets.In our evaluation, we utilize the following benchmarks implemented as OpenVXcompute graphs:• Sobel, a Sobel-filter based edge detection with 5 nodes.• Canny, a Canny edge detector with 6 nodes.• Harris, a Harris corner detector with 6 nodes.All the kernels inside each of the aforementioned benchmarks are analyzedand rewritten as parameterized C++ based kernels with stream-in/stream-out argu-ments. Using our tool, a library of different implementations for each kernel isgenerated. To examine whether our tool can cover a wide range of FPGA sizes, weevaluated it with VPR [11], a part of the academic Verilog-To-Routing project [59].Using VPR, different size FPGAs were generated based on Altera Stratix IV [41],with logic cluster size N = 10, look-up table size K = 6, and channel segment65length L = 4. Then we passed each FPGA’s size as an area budget to our tool. Fig-ure 4.14 shows the percentage of LUTs used for implementing Sobel on differentdevice sizes. As shown, our tool was able to automatically find suitable imple-mentations for different architecture targets and fill over 95% of the chip area onaverage.Figure 4.15 shows how efficiency scales with FPGA size in terms of LUTs re-quired per unit of throughput. For smaller FPGAs, there is some overhead, so upto 50% more LUTs are required to achieve the same throughput. This overheadquickly reduces as FPGA size is scaled.We also evaluated our tool with different Xilinx FPGAs. Figure 4.16 showsthe percentage of LUTs used for implementing Sobel and Harris benchmarks ondifferent Xilinx 7 series devices [97]. As shown, our tool was able to automaticallyfind suitable implementations for different architecture targets and fill over 97% ofthe chip area on average.Finding an optimum implementation for different throughput targets was thesecond goal. To evaluate that, we tested our tool by setting different throughputtargets for different benchmarks and compared our tool to a fully manual HLS ap-proach. Since we implemented heavily parameterized kernels for our OpenVX ap-proach first, we learned which parallelization strategies worked better. Using thatknowledge, we generated a manual HLS version. Due to limited time (as all design-ers will experience), we had to choose just a handful of implementation strategieswhich gave similar optimal area and throughput as our tool. However, to achievedesigns with throughputs in between the optimal points, these implementationswere scaled in the most logical way possible, but as they were scaled, they becameless efficient; naturally they used more area as we moved further away from theoptimal design points. Figure 4.17 shows how our tool covers a large design spaceand hits all targets more efficiently for the vxMagnitude kernel. We compared ourtool results for different CV kernels with our manual HLS and observed that ourapproach found an implementation for each throughput target that saves up to 30%area.Figure 4.18 shows area per throughput, normalized by the median value foreach benchmark, for a range of throughput targets. As shown, for throughput tar-gets larger than roughly 5 Pixel/clk, our tool finds good area/throughput tradeoffs66889092949698100102Used LUT(Percentage)FPGA SizeFigure 4.14: LUT usage percentage for Sobel implementations on differentFPGA sizes05001000150020002500300020x2023x2326x2629x2932x3235x3538x3841x4144x4447x4750x5053x5356x5659x5962x6265x6568x6871x7174x7477x7780x8083x8386x8689x89LUT/ThroughputFPGA SizeFigure 4.15: Throughput achieved for Sobel on different FPGA sizesfor each throughput target. For throughput targets less than 5, line buffer and SDAoverhead became a big portion of the area cost and increased the area per through-put ratio.To solve the trade-off finding problem, we used both ILP (similar to prior workby Cong et al. [22]) and the heuristic approaches. We also evaluated both ap-67889092949698100102Used LUT(%)Xilinx 7 SeriesSobel HarrisFigure 4.16: Percentage of LUT usage for different Xilinx FPGAs1000  1100  1200  1300  1400  1500  1600  1700  1800  1900  2000  5   10   15   20   25   30   35   40  Area/Throughput  Throughput  Target  Our  Approach  Manually  designed  Figure 4.17: vxMagnitude Area/Throughput results for different throughputtargetsproaches with different Xilinx 7-series FPGAs. Both heuristic and ILP approachescould fill over 95% of the chip area on average. The heuristic approach, how-ever, improves the runtime up to 3.6x compared to the ILP approach. Figure 4.19shows the runtime results for both heuristic and ILP approaches for implementingHarris application on different Xilinx FPGAs. Although the results are no shown,the heuristic was also able to fill closer to the area target on average.680.911.11.21.31.41.51.61.71.81.922.12.22.30 10 20 30 40Scaled Area/Throughput by MedianThroughput Targets(Pixel/Clk)CannyMagnitude_testSobelGuassian_filterHarrisFigure 4.18: Area cost results for different throughput targetsFigure 4.19: Heuristic vs ILP runtime speedup for Harris corner detectionTo better demonstrate the ability of the heuristic to save area, we first used theILP approach to meet the area target for Harris. Then, we used the ILP’s achievedthroughput on each FPGA size as a throughput-target for the heuristic approach.Figure 4.20 shows the area cost comparison between the heuristic and the ILP. Theheuristic approach saves 19% area on average while decreasing the throughput by6900.20.40.60.811.21.41.6Scaled Used Area by Chip SizeXilinx 7 SeriesILP HeuristicFigure 4.20: Area cost results for Harris using Heuristic and ILP approaches-2.5-2-1.5-1-0.50Percentage of Throughput decreasing Heuristic vs. ILPXilinx 7 SeriesFigure 4.21: Heuristic vs ILP throughput results for Harris corner detectionless than 2%. Unfortunately, however, it also failed to fit for three of the FPGAsizes. The corresponding (small) drop in throughput is shown in Figure 4.21.Next, to evaluate the real performance of the tool, we used the ZedBoard de-7001000020000300004000050000600000 1 2 3 4 5 6Arae cost(LUT)Throughput (Giga-Pixel/sec)Figure 4.22: Area/throughput results for implementing Sobel on Xilinx Zed-Boardvelopment platform [3] and set up different throughput targets for implementingSobel. Figure 4.22 shows the results. As shown, our tool is able to hit all thethroughput targets while increasing the area cost linearly. The tool is able toachieve up to 5.5GigaPixel/sec throughput for Sobel running at 105MHz. Sincethe Sobel benchmark requires about 70 operations per pixel, this is equivalent toroughly 350 GOPS.4.5 SummaryIn this chapter, we studied the problem of automatically finding an area/throughputtrade-off of CV applications by mapping OpenVX compute graphs onto FPGAs. Weproposed a framework on top of the Xilinx Vivado HLS tool which receives C++based CV kernels and uses different approaches in order to find many different im-plementations for each kernel. It compiles an OpenVX compute graph, analyzesit and finds a good trade-off between area and throughput. Our approach is dif-ferentiated from the existing HLS approaches as 1) it automatically investigatesfinding different implementations, and 2) it combines module selection and repli-71cation methods as well as changing tile-size width, node combining, and splittingin order to automatically find a better area/throughput tradeoff. This approach wasverified with different OpenVX benchmarks targeting several different FPGA sizes.Our tool is able to automatically achieve over 95% of the target area budget on av-erage while maximizing the throughput. Our tool also can automatically satisfy avariety of throughput targets while minimizing the area cost. The proposed systemsaves up to 30% of the area cost compared to manually written and heavily par-allelized implementations. Using Inter-node Optimizer step, our heuristic tradeofffinder is able to hit the same throughput targets as the ILP algorithm while saving19% area on average.72Chapter 5FPGA Overlay Space/TimeScaling with Custom InstructionsSo far, this dissertation has investigated space/time scaling to implement STG ap-plications on MPPA and FPGA architectures. In this chapter, we investigate anFPGA-based overlay architecture for accelerating OpenVX applications withoutcreating any application-specific logic. The overlay consists of a Soft Vector Pro-cessor (SVP) for general acceleration, and the ability to implement Vector CustomInstruction (VCI) on the fly using a Partial Reconfiguration Region (PRR). The PRRis a reserved area of the chip that is initially blank and made available at run-timeto allow instances of VCIs to be defined as they are needed. A pipeline of computekernels can sometimes be realized by chaining multiple VCI together using a cus-tom multiplexer network within the PRR. Using PR, this method obtains speedupsfar beyond what a plain SVP can accomplish. For example, on the Canny-blurapplication, an 8-lane SVP is 18 times faster than the plain ARM Cortex-A9. How-ever, using ultra-fast PR, which is technically feasible but not yet supported onmodern FPGAs, a speed of 106 times faster is possible. This allows OpenVX pro-grammers, who have no FPGA design knowledge, to achieve hardware-like speedswithin their vision application.73Table 5.1: vxMagnitude kernel throughput running on different platformsRunning Platform Throughput (MegaPixel/Sec) Speedup vs ARMARM Cortex-A9 (667MHz) 10.31 1.0VectorBlox SVP-V4 (100MHz / 12,989 LUTs) 65.54 6.3VectorBlox SVP-V8 (100MHz / 22,517 LUTs) 128.92 12.5Custom hardware (100MHz / 3,000 LUTs) 1176.04 1145.1 IntroductionPreviously in chapter 4, we augmented Vivado HLS tools to implement CV ap-plications described using OpenVX with maximum throughput for a given areatarget, or with minimum area for a given throughput target. These solutions arecustom-generated for the OpenVX application, and require running the full Xilinxplace-and-route tool flow to generate the solution. While they are highly optimizedin terms of performance and area, the overall design flow still requires the expertiseof hardware designers who are familiar with FPGAs.In this chapter, we wish to consider accelerating OpenVX applications onFPGAs by software developers with no hardware experience, while still exploit-ing the ability to make space/time tradeoffs. The simplest implementation methodwould simply use a host processor – either a fast hard ARM processor present onmany modern FPGAs, or a slower soft processor built using FPGA logic. To pro-vide scalable performance for area, an SVP implemented in the FPGA logic achievesgreater performance per unit area than multi-core soft processors [76].Unfortunately, the total performance per unit area of an SVP still falls short ofwhat can be achieved with the dedicated hardware generated in chapter 4. For ex-ample, consider Table 5.1 which shows throughput achieved with the vxMagnitudeOpenVX kernel on different platforms. The vectorized software implementationrunning on a SVP with four vector lanes (V4) uses about 13,000 LUTs and achieves6.3 times higher throughput compared to a basic software implementation runningon the Cortex-A9. Note the SVP achieves this high throughput while running at afrequency that is approximately one-sixth that of the ARM processor, making it anefficient platform. Increasing the number of vector lanes to 8 nearly doubles thespeedup to 12.5 and uses 23,000 LUTs. However, a custom hardware implementa-tion of the vxMagnitude kernel that is area-constrained to just 3,000 LUTs achieves114 times higher throughput than Cortex-A9.74While it may appear that custom hardware is the way to go, there are threemain constraints. First, and most importantly, it requires the expertise of an FPGAhardware engineer. Second, the total area available is limited, so a system likethe one proposed in chapter 4 is required to achieve a proper space/time tradeoff.Third, unlike the SVP, it is completely inflexible and cannot be ‘reprogrammed’ todo other tasks.Instead, we can take advantage of the SVP’s ability to add specialized vectorcustom instructions or VCI to provide greatly increased performance for a smallincrease in area. Unlike custom instructions in a traditional processor, where per-formance gains are limited by the amount of data available, a VCI has access to asignificant amount of high-bandwidth data. It naturally accepts two wide, stream-ing data sources and produces one wide, streaming data result in a pipelined fash-ion. Thus, a VCI is a natural candidate for an OpenVX compute kernel, which alsostreams data and can operate in a pipelined fashion. One major restriction of theVCI, however, is a limit of only two input operands and one output operand.Thus, to accelerate OpenVX compute graphs for software developers, we canproduce an FPGA overlay consisting of a processor resource, such as a host ARMprocessor and SVP for some acceleration, and a pool of VCI modules. However,choosing the right VCI modules is a difficult problem. Not all OpenVX computekernels will be used by an application, so it would be wasteful to dedicate areato implement all of them at once. Instead, to fit a limited budget, we can reservesome FPGA area as a partially reconfigurable region and dynamically load the par-tial bitstream for each VCI as needed. Using an SVP with multiple VCI modulesand partial reconfiguration, we can create an FPGA overlay for software program-mmers using a limited area budget, yet still greatly accelerate OpenVX computegraphs.Figure 5.1 shows four different scenarios (a, b, c and d) of running a simpleapplication consisting of three nodes (A, B and C) on the proposed system. Oneway, shown in Figure 5.1.a, is to run all the nodes sequentially as software. Imple-menting one node as a VCI module, shown in Figure 5.1.b, increases throughput.A second VCI module might improve things further, but the two modules will com-pete for area if they must be instantiated at the same time. Instead, if we canreconfigure the reserved FPGA region from supporting node A to node C while the75Figure 5.1: Running an application on the hybrid systemSVP runs node B, we can realize a greater speedup. Note that, after each nodeis run, the results are written back into the SVP scratchpad, and must be read outagain by the next node. This wastes time. To speed things up, it may be possi-ble to directly stream results from A to B by connecting two VCI modules in apipelined fashion, as shown in Figure 5.1.d. This is called VCI chaining. Sincewe do not know beforehand which two modules must be connected, we can builda multiplexer network to forward the data, effectively bypassing the scratchpad inbetween.As mentioned before, chip area is limited, so not all compute nodes can beimplemented at the same time. However, with timesharing these resources throughdynamic partial reconfiguration, we can better dedicate area to nodes that areneeded to improve throughput.There are various ways for running an application on such a system. The keyquestion here is which nodes should be selected to run as software, and which ones76as hardware. Since the VCI needs to achieve fixed levels of throughput, the toolused in chapter 4 is needed to produce an implementation with minimum area. Fur-thermore, when different VCI modules are chained, implementations with matchedthroughputs are required to avoid the complications of internal buffering.This chapter presents a method for the run-time acceleration of an OpenVXapplication on an SVP system that uses a PRR which can host VCIs. To do this,we pre-generate a library of different VCI implementations that fit the PRR usingthe tool from chapter 4. Next, using the knowledge of PR speed and how long itwill take to instantiate a VCI, we determine which OpenVX nodes should be runas software and which should use a VCI. For a very slow PR speed, we cannotdynamically reconfigure and must select a static set of nodes for acceleration byVCIs that fit the PRR area. Each VCI instance takes a different load time dependingupon the bitstream size needed to program that region. If the underlying FPGA cansupport a faster PR speed, it may become possible to dynamically switch the VCIloaded on the fly. Conceptually, for an infinitely fast PR speed, we can change theVCI for each node in the graph and produce the highest speeds. When VCI chainingis possible, matched throughput implementations must be selected and loaded intothe PRRs, and the multiplexer network must be configured accordingly.Thus, the speed of the PR, the bitstream size of the VCI, the size of the PRRsupported, the size of the image, the tile size used, and the OpenVX graph (eg,which nodes, how they are arranged, and whether VCI chaining is possible) will allaffect performance.Since we cannot easily modify the speed of the PR (it is fixed by the FPGAvendor), we build a framework that takes these variables into account and estimatesoverall performance.There have been several recent studies on implementing image processing andOpenVX applications on reconfigurable platforms [37, 38, 70, 78]. The contribu-tions listed below distinguish our work from previous approaches:1. Adding automated space/time tradeoffs to the process of generating VCI forSVP in order to use parallel resources more efficiently.2. Adding PR to VCI implementation to time-share the parallel resources.773. Adding scratchpad bypass capability to improve performance with VCI chain-ing.4. Using pre-synthesized node fusion as well as a heuristic approach to savearea compare to classic ILP approaches.5.2 System OverviewThe FPGA-based overlay consists of an ARM Cortex-A9 host processor, the Vec-torBlox MXP SVP [75], and an empty PRR reserved for vector custom instructions.To use a VCI, the associated bitstream must first be loaded into the reconfigurableregion. Static PR loads bitstreams for VCIs until the PRR is full. Dynamic PR allowsthe bitstream to change quickly on the fly, enabling the use of more VCIs than thePRR can hold at once. Figure 5.2 provides a view of the overall system.The MXP has its own scratchpad and supports up to 16 different VCI opcodes.The scratchpad provides two streaming source operands on PortA and PortB, andaccepts results on PortC. Each VCI can be implemented within the reconfigurableregion and connect to these ports. The Multiplexer (MUX) network is implementedas part of each VCI and connects it with the SVP; the VCIs can forward their resultsdirectly to each other through the MUX network in a way that bypasses writing theresult to the scratchpad.The VCIs for OpenVX are generated using the synthesis system described ear-lier in chapter 4, thus ensuring maximum throughput is achieved within the areaconstraints provided. The OpenVX compute graph is analyzed and each computekernel is run as either regular software or a custom instruction in an acceleratedhardware pipeline.According to Xilinx, the tool support for partial reconfiguration is based uponlarge, non-overlapping regions with predefined boundaries. However, this is a cur-rent limitation of the tools, and not the underlying device architecture. For ex-ample, tools such as GoAhead [9] enables fine-grained boundaries, fine-grainedvertical relocation of the bitstream, and the ability to lock down interconnect wiresfeeding these reconfigurable regions from non-reconfigurable regions. These tools78Figure 5.2: System overviewcan be also extended to modern devices under Vivado. 1 In our system, we assumetools like GoAhead will be available for modern devices, but until then this workis limited to being a conceptual study.1Dirk Koch, private communication.795.3 Mapping OpenVX Applications to FPGA OverlayThe process of mapping an OpenVX application involves two steps. The first stepis similar to chapter 4 and done in advance by FPGA experts: the OpenVX kernels,written in C++ for Vivado HLS, are mapped for defined levels of throughput. Thethroughput levels are defined to match the width of the SVP VCI, which can beanywhere from 1 to N lanes, where N is a power of 2 that matches the largestSVP to be used. This produces a minimal-area VCI for each width. Each VCIimplementation is noted with its throughput, area requirement, and bitstream size.Each OpenVX kernel also has a pure software implementation on the SVP or hostprocessor, where the throughput, in terms of pixels per second it can process, isrecorded at each image tile size.The second step is executed at application time by the OpenVX run-time sys-tem. Given an OpenVX compute graph and a target image size, it uses an executiontime model to determine the best tile size, which nodes should be implemented as aVCI, and the VCI implemention to use. It also determines whether to use bypassingand node fusion.All of these steps are described below.5.3.1 Finding Different ImplementationsConsider an application described as an STG G with N nodes f1, f2, ..., fN .G = (V,E) (5.1)V = { f1, f2, ..., fN} (5.2)For each node fm we can find NSV P different SVP implementations S1m,S2m, ...,SNSV Pmas well as NVCI different VCI hardware implementations P1m,P2m, ...,PNVCIm . EachSVP implementation Ssm can perform functionality of fm on an image tile, in t(Ssm)time. Each VCI implementation Psm can perform functionality of fm with areacost A(Psm), number of pixels it can consume/produce NP(Psm) each firing (i.e.,tile width), and initiation interval II(Psm).Considering available resources, such as the size of the PRR, scratchpad ca-pacity, and PR reconfiguration speed, the OpenVX run-time system decides which80node should be implemented as SVP software and which one should be imple-mented as VCI hardware. After enumerating different implementations for eachOpenVX node, to minimize the search space, the OpenVX run-time system prunesany dominated implementation points. Moreover, it uses “execution time analysis”to consider other factors, namely DMA time and PR time, to find out the overallexecution time for each implementation.5.3.2 Execution Time AnalysisIn order to execute a general OpenVX compute graph, the run-time system needsto fetch each image tile from main memory to the scratchpad and execute the wholecompute graph one node at a time (either as SVP or VCI implementations). In everystage of traversing the graph, the intermediate data is saved in the scratchpad. Forlarge graphs, a significant amount of intermediate data may need to be buffered.This means the tile size must be calculated based on the available scratchpad sizeand amount of buffering required. After finishing executing all the nodes in thegraph on a tile, results are written back to main memory before fetching the nexttile. The execution time for these components is discussed below.SVP Software ImplementationThe execution time of executing a compute graph G with N nodes f1, f2, ..., fN andsubset of selected SVP implementations S1,S2, ...,SN on image tile Tj is tTj . Thisis calculated as:tTj = tDMAM2S +[N∑i=1t(Si)]+ tDMAS2M (5.3)The overall execution time for NT tiles in the image is calculated as:tA =NT∑j=1tTj (5.4)81Accelerated VCI ImplementationConsider node fm in the compute graph G. Instead of using SVP implementationSm with execution time t(Sm), it is possible to use a VCI hardware implementationPm with execution time t(Pm) and PR reconfiguration time tPR to improve the ex-ecution time. Assuming we need to reconfigure this node NPR times during theprocessing of the entire image, the execution time can be improved if:NT t(Sm)> NPR · tPR+NT · t(Pm) (5.5)In chapter 4 we showed that for most CV kernels implemented as pipelinedhardware accelerators, we can define kernel throughput Θ(Pm) as number of pixelsconsumed/produced in each clock cycle. The same formulation can be used hereto calculate VCI execution time t(Pm):t(Pm) = SetupTime(Pm)+TileSizeΘ(Pm) ·Fmax (5.6)where the Fmax is the speed of the SVP (here, 100MHz). In addition:tPR =PRsizePRrate. (5.7)Node ChainingEach VCI normally implements one OpenVX kernel, and only one VCI is executingat a time. However, if the graph topology allows, it’s possible to find a cluster ofnodes where a series of VCIs can chain together, sending the output of one directlyto the input of the next, without writing intermediate results to the scratchpad.This scratchpad bypassing allows us to take advantage of pipeline parallelism byoverlapping VCI execution.Although chaining improves performance, it requires all VCI implementationsto be active at the same time. That is, there must be sufficient area in the PRR to holdthe entire VCI chain. In addition, the overall VCI chain still needs to follow VCItopology restrictions: overall, there can be a maximum 2 input operands (PortAand PortB) and one destination operand (PortC).82BACBACACBA B CA BCA ACBa)b)c)Figure 5.3: Node clustering and bypassing the scratchpadFor example, Figure 5.3.a shows a graph with three nodes and its executiontimeline. Standalone VCI implementations are used for each node in the graph.This means each VCI implementation needs to write its results to the scratchpadfor its successors. It also means each node must wait for its predecessors to finishtheir jobs before it can begin. Since the execution of nodes A, B and C do notoverlap, the system only needs to keep one VCI configured at a time within thePRRs.In contrast, nodes B and C can be chained as shown in Figure 5.3.b. The chainbypasses the scratchpad for writing, so the result of node B bypasses the scratchpadand is sent directly by the MUX network to the VCI implementing node C. For thisto work, the VCI for node A must be executed first, and the MUX network must beconfigured to stream data through the VCIs for both B and C which must be activesimultaneously.In a different example, shown in Figure 5.3.c, nodes B and C are both usingthe result of node A. This concept of fan-out was not present in the previous twoexamples. To avoid writing the result of A to the scratchpad, two VCI chains mustbe formed: A and B, as well as A and C. This example shows that clustering needsto consider all uses of the intermediate data between nodes.Now that we have explained VCI chaining, we will describe the execution time83analysis for standalone VCIs as well as VCI chains.Pruning Slow Standalone VCIsTo enhance performance and reduce the search space, standalone VCI implemen-tations that are slower than SVP implementations are pruned. Hence, we will onlykeep VCIs that satisfy the following equation:t(Sm)>PRsize.NPRPRrate.NT+TileSizeΘ(Pm).Fmax+SetupTime(Pm). (5.8)Bypassing the ScratchpadSimilarly, we will prune VCI chains that are slower than the SVP implementations.Assuming we can implement a sequence of NC nodes as a VCI chain, we will onlykeep VCI chains which that satisfy the following equation:NC∑j=1t(S jM)>NC∑j=1PR jsize.NjPRPR jrate.NT+max[1Θ(P jm)].TileSizeFmax+NC∑j=1SetupTime(P jm).(5.9)We prune the problem space by eliminating all implementations that cannotsatisfy Equation 5.5, Equation 5.8 and Equation 5.9.Pre-synthesized Node FusionInstead of VCI chaining, it is possible to fuse nodes together. This accomplishes asimilar result, but the VCI must be pre-synthesized, so the pair of nodes to be fusedmust be known in advance. Hence, this can be done as long as there is sufficienttime to precompute a library of all pairs of OpenVX nodes.The difference between VCI chaining and node fusion is shown in Figure 5.4.With chaining, the MUX network is used to steer the output of A to the input of B.With node fusion, the connection is made directly and the entire fused function issynthesized into a single VCI. This can yield higher performance within an areabudget; for example, with node chaining, each VCI might be limited to 2 pixels per84Figure 5.4: VCI chaining versus node fusion85Table 5.2: Some of common patterns used for pre-synthesized node fusionPatternConvertColor, GaussianGaussian, Sobel XGaussian, Sobel YConvertColor, Gaussian, Sobel XConvertColor, Gaussian, Sobel YSobel X, Sobel Y, MagnitudeSobel X, Sobel Y, PhaseGaussian, Sobel X, Sobel Y, MagnitudeGaussian, Sobel X, Sobel Y, PhaseConvertColor, Gaussian, Sobel X, Sobel Y, MagnitudeConvertColor, Gaussian, Sobel X, Sobel Y, PhaseMagnitude, Phase, Non-MaximaSobel X, Sobel Y, Magnitude, Phase, Non-Maximafiring, whereas node fusion might be able to support 4 pixels per firing within asimilar budget.It may not be feasible to pre-synthesize all pairs of OpenVX nodes. Fusingmore than two nodes is even more computationally difficult. However, a subsetmight be feasible if the graph topology is known in advance. In particular, there areseveral common sub-graph patterns in OpenVX applications which can be antici-pated. For example, the Color Convert kernel followed by Guassian Filter kernelis a common 2-node sequence. Table 5.2 lists a few common patterns with up to5 nodes that might be useful. In this case, these patterns must still conform to theoverall two-input, one-output operand structure, but with node fusion it is possibleto encapsulate more complex internal structures.5.3.3 Solving the Space/Time TradeoffAfter pruning the problem space, the next step is exploring space/time tradeoffs tofind suitable implementations for each OpenVX node and solving the schedulingproblem by finding which ones need to be implemented as SVP and which onesneed to be implemented as VCI or VCI chains.Previous studies have shown the scheduling problem can be defined as an ILP86problem and be solved by ILP solvers [49, 82]. In chapter 4, we also demonstratedan ILP approach for automatically exploring space/time tradeoffs. Combining thesetwo ILP formulation approaches, we can easily produce an ILP model and use asolver such as GLPK [60].Although ILP solvers can solve these problems, they lack flexibility. In par-ticular, the ILP problem cannot model all possible constraints, and it can quicklybecome computationally infeasible to solve. In other words, as we have shownelsewhere, combining or splitting nodes are not feasible while using ILP. In addi-tion, as shown in chapter 4, ILP solvers can be slower than heurstics.To overcome those mentioned shortcomings, we developed a heuristic approachwhich allows us to use node fusion. The heuristic approach is similar to the ap-proach we used in chapter 4 in the sense that it uses the “node combining” ability(explained subsection 3.3.2) to allow for node fusion. Going through different pos-sible nodes to fuse, the heuristic uses exhaustive search similar to ILP to find whichnodes need to be implemented as SVP and which ones need to be implemented asVCI.5.4 Experimental ResultsIn this section, we will investigate the speedup provided by the SVP and variousVCI configurations. The performance of three different hardware configurations aremeasured: the baseline, consisting of OpenVX kernels written in C and running onthe ARM Cortex-A9 processor at 667MHz; the SVP running at 100MHz withoutany VCI, consisting of OpenVX kernels written in C using the VectorBlox MXPAPI and running on a combination of the Cortex-A9 and MXP; and the SVP withdifferent VCI options at 100MHz.The OpenVX kernels implemented for this study are shown in Table 5.3. Forthe HLS versions, a library of different PR bitstream implementations for each VCIinstance were generated to facilitate the trade-off finding process. These VCIs weregenerated using the HLS tool described in chapter 4.The baseline SVP and ARM code was implemented by Nick Ivanov of Vec-torBlox. Mr. Ivanov generated two versions of the SVP hardware, and using thathardware he gathered runtime data for the OpenVX kernels. The two SVP versions87Figure 5.5: ARM Cortex-A9 (667MHz) vs SVP-V4 and SVP-V8 (100MHz)AColor ConvertBGaussian 3x3CSobel_X_3x3EMagnitudeDSobel_Y_3x3FPhaseFigure 5.6: Graph representation of Sobel application with 6 nodesgenerated were a 4-lane configuration, where four 32-bit ALUs are assembled andcalled SVP-V4, and an 8-lane configuration, called SVP-V8.The throughput results for four specific kernels (ColorConvert, Gaussian, So-bel and Magnitude) running on 3 different platforms (ARM Cortex-A9, SVP-V4and SVP-V8) are shown in Figure 5.5. On average across those kernels, the SVPachieves a 4.6 times speedup on SVP-V4 over the Cortex-A9, and an 8 timesspeedup using SVP-V8.Unfortunately, only a few general algorithms have been released as publicbenchmarks for OpenVX. This means there is no standard OpenVX benchmarksuite available. To show the capabilities of our approach, we implemented follow-ing simple benchmarks as OpenVX compute graphs:88Table 5.3: List of CV kernelsCV kernel ARM SVP HLSColor Conversion X X XChannel Extract X X XGaussian filter X X XSobel filter X X XPhase X X XMagnitude X X XNon-maxima Suppression X X XThresholding X X XMedian Filter X X XBox Filter X X XChannel Combine X X XAbsolute Difference 7 X XAccumulate 7 X XAccumulate Squared 7 X XAccumulate Weighted 7 X XArithmetic Addition 7 X XArithmetic Subtraction 7 X XBitwise AND 7 X XBitwise OR 7 X XBitwise XOR 7 X XBitwise NOT 7 X XConvert Bit Depth 7 X XDilate Image 7 X XErode Image 7 X XLBP 7 X XPixelwise Multiplication 7 X XMagnitude-cordic 7 7 XSqrt-cordic 7 7 XArctan-cordic 7 7 X89AColor ConvertBGaussian 3x3CSobel_X_3x3EMagnitudeDSobel_Y_3x3FPhaseGNon-MaximaHHyst ThreshIGaussian 3x3JMagnitudeFigure 5.7: Graph representation of Canny−Blur application with 10 nodesV4-1 V4-2 V4-3 V4-4 V4-5 V8-1 V8-2 V8-3 V8-4 V8-5 051015202512000 17000 22000 27000 32000MPixel/SecLUT utilizationV4 V4 with VCI V8 V8 with VCI ARMFigure 5.8: Throughput vs area for V4 and V8 with/without VCI (Canny−Blur Figure 5.7)• Sobel application with 6 nodes, shown in Figure 5.6.• Canny-blur application with 10 nodes, shown in Figure 5.7.All the kernels in both applications can be run using image tiles except the hys-teresis thresholding kernel (node H) in Canny−blur. The hysteresis thresholdingkernel needs the global image perspective, so it must DMA all tile results prior tothat node before running the subsequent nodes. Thus, to run Canny− blur, weneed to run the first part on whole image, save to memory, read the results backand then run the second part on the whole image generated by first part.90Table 5.4: List of CV kernels implemented as VCIs in Figure 5.8Implementation Kernels implemented as VCIsV4-1 (AB)V4-2 (AB), DV4-3 (ABC), (ABD)V4-4 (ABC), (ABD), H, IV4-5 (ABC), (ABD), H, (IJ)V8-1 (AB)V8-2 (AB), CV8-3 (ABC), (ABD)V8-4 (ABC), (ABD), H, IV8-5 (ABC), (ABD), H, (IJ)Figure 5.9: Sobel speedup by adding static VCIs (standalone and bypassing)to SVP-V4 compared to ARMBenefits of Static VCIsAlthough SVPs are able to achieve better throughput than ARM, the use of VCIs of-fers significant improvement. Figure 5.8 shows the results of running the Canny−blur benchmark on SVP-V4 and SVP-V8, both without and with VCIs. The Y-axisshows the throughput achieved in Mega-Pixels per second, while the X-axis showsthe LUT utilization, i.e. the amount of FPGA area used by each implementation.91We set different PRR sizes for implementing VCIs connecting to V4 or V8.Considering the area budget, our tool explored the solution space and decidedwhich nodes need to be implemented as VCIs and which ones as SVPs in orderto maximize the throughput. When PRR size increases, throughput increases. FiveVCI solutions were generated for each of V4 and V8; these are shown in Table 5.4.In this table, the letter indicates which node(s) from the graph in Figure 5.7 areaccelerated with a VCI; a grouping with parentheses indicates a VCI chain thatbypasses the scratchpad.Note that in this figure, all of the VCIs are implemented in a fully static manner.That is, NPR = 0 in Equation 5.5, so the VCI is only configured once at applicationload time. Below, we will demonstrate the benefits and limitations of using dy-namic PR, thereby changing which VCI is implemented as an image tile is passedfrom node to node in the graph.Impact of BypassingThe previous subsection included bypassing to achieve better results. In this sub-section, we remove bypassing to show the gains it provides. Results in Figure 5.9show speedup before and after bypassing on an SVP-V4 when a fixed area budgetis given to VCIs. For a small area budget, bypassing does not yield significant im-provements. However, as the area budget grows, the benefit of bypassing increases;in the version with the largest area budget, the speedup with bypassing is over 2times faster. A breakdown of the VCIs used in this figure is shown in Table 5.5.Impact of Dynamic Partial ReconfigurationInstead of statically assigned VCIs within the PRR, it is also possible to dynamicallyreconfigure each VCI while evaluating a graph. To simplify discussion, supposeNPR = NT in Equation 5.5. This might be the case when the first use of a VCIincurs latency, but future uses within a graph can hide the latency (e.g., throughprefetching). In this case, the time to reconfigure and run on the VCI must also befaster than the software-only SVP implementation. That is,t(Sm)> tPR+ t(Pm). (5.10)92Table 5.5: List of CV kernels implemented as VCIs on SVP-V4 in Figure 5.9Implementation Kernels implemented as VCIsSA1 A, BBP1 ABSA2 A, B, CBP2 (AB), CSA3 A, B, DBP3 (AB), DSA4 A, B, C, DBP4 (ABC), (ABD)SA5 A, B, C, D, E, FBP5 (ABCDE) , (ABCDF)When we allocate a new VCI to the PRR, we need to select a precise location.First, we do a simple first-fit search strategy in the free space. If that fails, we swapout the VCI that has been idle for the longest time. We find that most VCIs withthe same bandwidth require similar amount of space within the PRR.This allocation heuristic is far from perfect, and it may lead to internal frag-mentation within the PRR. Since the entire graph is known, a better scheduler canlook at whether kernels are used multiple times in the graph. Also, since the multi-ple image tiles will be passed through the entire graph, the reconfiguration scheduleis cyclic. Using these properties, a better scheduler can optimize for the fewest re-configurations while also avoiding fragmentation. Rather than attempting to createsuch an optimal heuristic, we went with a simple one (which has more reconfig-urations than necessary, thereby conservatively underestimating the performanceimpact) and ignored the fragmentation problem (since it is likely solvable). Thesepractical issues should be addressed in future work.Impact of PR Rate and Image SizeOne concern of using dynamic PR is the PR rate, or the speed at which a new con-figuration can be loaded. Considering Equation 5.7 and Equation 5.10, this rate cansignificantly influence the benefits of using dynamic PR. Although in this study weused Xilinx FPGAs, the overall PR rate for Intel FPGAs is similar [93]. As a start-ing point, Xilinx documents a maximum 400MByte/sec PR rate using their on-chip93Figure 5.10: V4 Dynamic PR and Static PR speedup vs ARM for Sobel Ap-plication (4500 LUT budget, image size 1920×1080)ICAP interface; this is specified as a 32b value every 100MHz clock cycle. How-ever, Hansen et al. demonstrated this can be overclocked by more than a factorof 5 to achieve 2.2GByte/sec on real devices [35]. Other studies have suggesteddifferent architectures to achieve even faster PR rates. For example, Trimberger etal. proposed a time-multiplexed FPGA architecture which provides 33GByte/secreconfiguration rate [84]. Unfortunately, FPGA vendors have not made it a prior-ity to provide fast PR rates; we believe this work provides some of that missingincentive.Figure 5.10 shows the impact that PR rate has on speedup. In this figure, theSobel application is run on an SVP-V4 with both with dynamic PR and static PR onan image of size 1920×1080. As the PR rate increases along the X-axis, the overallspeedup (relative to the ARM) also increases. At low PR rate values of 3.2GB/s orlower, dynamic PR is not considered because it is slower than static. At 6.4GB/sand above, dynamic PR becomes faster than static. Bypassing benefits more fromfast PR because it covers a greater proportion of the total runtime. These sametrends are demonstrated on the Canny-blur application on both an SVP-V4 versionwith area budget of 4500 LUTs in Figure 5.11, and an SVP-V8 version with areabudget of 14000 LUTs in Figure 5.12.Similar to how increasing the PR rate improves bypassing more, increasing the94Figure 5.11: V4 Dynamic PR and Static PR speedup vs ARM for Canny-blurApplication (4500 LUT budget, image size 1920×1080)Figure 5.12: V8 Dynamic PR and Static PR speedup vs ARM for Canny-blurApplication (14000 LUT budget, image size 1920×1080)image size can also improve performance, particularly at low PR rates. This isshown in Figure 5.13 where speedup improves over larger images. This is becausethe larger image size allows the use of a taller image tile, allowing the current nodecomputation to better hide the latency of configuration for future nodes. At low PRrates, the proportion of time spent doing PR relative to the total computational size950.5124816320.4 0.8 1.6 3.2 6.4 12.8 25.6 51.2 102.4 204.8 409.6 819.2 1638.4SpeedupParial Reconfiguration Bandwidth(GB/sec)352x240480x4801920x10803840x2160V4Figure 5.13: V4 Dynamic PR speedup vs ARM for Canny-Blur for differentimage sizes (4500 LUT budget)0.512481632640.4 0.8 1.6 3.2 6.4 12.8 25.6 51.2 102.4 204.8 409.6 819.2 1638.4SpeedupParial Reconfiguration Bandwidth(GB/sec)352x240480x4801920x10803840x2160V8Figure 5.14: V8 Dynamic PR speedup vs ARM for Canny-Blur for differentimage sizes (4500 LUT budget)goes down with larger images. At high PR rates, the PR overhead has already beenminimized so changing the image size has little impact on performance.96Impact of Node Fusion HeuristicSo far, all of the above results have been generated using the ILP approach. We alsotested the heuristic approach, which enables node fusion as a new optimization,using the same throughput targets or area budgets. Just like the individual kernels,the candidates for fusion, taken from Table 5.2, were presynthesized. We haveomitted the detailed results showing that the heuristic is able to match all of thethroughput results achieved by the ILP while using 9% less area. This area savingscould be useful to reduce the number of reconfigurations, thereby reducing thereconfiguration time, or to increase the number of VCIs or their width, therebyimproving performance.5.5 SummaryThis chapter presented a method for the run-time acceleration of an OpenVX appli-cation on an FPGA-based overlay system that uses a PRR which can host VCIs. Todo this, we pre-generated a library of different VCI implementations that fit the PRRprofile using the automated FPGA space/time scaling tool we introduced in chap-ter 4. The OpenVX application’s compute graph is analyzed and each computekernel is run as either regular software or as a custom instruction in an acceleratedhardware pipeline configured on the PRR. A pipeline of compute kernels can some-times be realized by chaining multiple VCI together using a custom multiplexernetwork. With partial reconfiguration, this method can obtain speedups far beyondwhat a plain SVP can accomplish. For example, on the Canny-blur application, an8-lane SVP is 18 times faster than the plain ARM Cortex-A9. However, using ultra-fast partial reconfiguration which is technically feasible but not yet supported onmodern FPGAs, a speed of 106 times faster is possible. This allows OpenVX pro-grammers, who have no FPGA design knowledge, to achieve hardware-like speedswithin their vision application.97Chapter 6ConclusionsThis work broadens the overall usability of parallel resources by providing an en-vironment which allows users to automatically explore space/time tradeoffs andfind suitable implementations regarding a throughput target or an area budget ona wide range of different pipelined architectures. The contributions made in thisdissertation are summarized below.First, we added automated space/time scaling for streaming applications tocompilation tools for MPPAs (a coarse-grained architecture). This was done byproposing a Java-based HLS tool chain which finds all degrees of parallelism in ageneral stream application and then uses throughput analysis and throughput prop-agation to find possible bottlenecks. The proposed approach combines moduleselection and replication methods with node combining and splitting in order toautomatically find a better area/throughput tradeoff. It also presents a heuristicapproach which is more flexible and can find design points that are computation-ally infeasible or difficult to model using a classic ILP formulation. We examineddifferent benchmarks and the tool satisfied a variety of different throughput tar-gets and area budgets. Our heuristic approach could achieve the throughput targetsusing 30% less area compared to the ILP approach.Second, automated exploration of space/time tradeoffs was added to a com-mercial HLS tool for implementing CV applications targeting FPGAs. OpenVX wasused to define CV applications as STGs. This approach was verified with differentOpenVX benchmarks targeting several different FPGA sizes. Our tool is able to au-98tomatically achieve over 95% of the target area budget on average while improvingthroughput. Our tool also can automatically satisfy a variety of throughput targetswhile minimizing the area cost. The proposed system saves up to 30% of the areacost compared to manually written implementations. Using the Inter-node Opti-mizer step, our heuristic tradeoff finder is able to hit the same throughput targetswhile saving 19% area on average compared to existing ILP approaches.Third, an automated space/time tradeoff approach was used for implement-ing CV applications targeting an FPGA-based overlay architecture that combines aprocessor, a vector execution unit (SVP), and reconfigurable custom vector instruc-tions. The SVP allows us to target more general applications and coupling it withthe FPGA fabric provides us an environment to accelerate computationally inten-sive applications. Automated space/time scaling was used to automatically gener-ate different VCIs for the SVP in order to use parallel resources more efficiently.Also, Partial Reconfiguration (PR) was used for implementing VCIs in order totime-share the parallel resources. Scratchpad bypassing capability was added toVCIs in order to improve the performance. Moreover, node clustering and nodecombining approaches are used to avoid local memory access as much as possible.Last, pre-synthesized node fusion allows the heuristic approach to use additionaloptimization opportunities which are difficult to model or computationally infea-sible in classic ILP approach. The heuristic approach matches all the throughputresults achieved by the ILP approach while using 9% less area on chip.Overall, the performance results achieve speedups far beyond what a plain SVPcan accomplish. For example, an 8-lane SVP achieves a speedup of 5.3, whereasa VCI version is another 3.5 times faster, with a net speedup of 18.5 versus ARMCortex-A9 for running the Canny− blur application. This was achieved by usingautomated space/time scaling, node clustering and dynamic PR (considering to-day’s device restrictions). These speedup results can be significantly improved upto 106 times faster than ARM if the PR rate increases in the future.The continuing trend toward designing larger pipelined architectures, as wellas introducing more complex streaming applications, requires smarter and morecapable approaches in order to find all degrees of parallelism in the application anduse available parallel on-chip resources efficiently. In this work, we proposed anenvironment in which users can explore automated space/time tradeoffs for STGs99however we only addressed the STG with DAG topologies. Addressing automatedspace/time tradeoffs for cyclic graphs needs to be investigated. Moreover, we be-lieve our approach can be used not only for CV applications but also for a widerange of streaming applications which we did not address because they were out ofscope of this study.100Bibliography[1] Adapteva-Inc. Epiphany-iv 64-core 28nm microprocessor. 2014. URLhttp://www.adapteva.com/products/silicon-devices/e64g401/. → page 19[2] P. J. Ashenden. The designer’s guide to VHDL, volume 3. MorganKaufmann, 2010. → page 18[3] Avnet-inc. Zedboard product briefs. 2017. URL http://www.zedboard.org/sites/default/files/product briefs/5066-PB-AES-Z7EV-7Z020-G-V1b.pdf.→ page 71[4] M. Awad. Fpga supercomputing platforms: a survey. In FieldProgrammable Logic and Applications, 2009. FPL 2009. InternationalConference on, pages 564–568. IEEE, 2009. → page 23[5] E. Ayguade´ and J. Torres. Partitioning the statement per iteration spaceusing non-singular matrices. In Proceedings of the 7th internationalconference on Supercomputing, pages 407–415. ACM, 1993. → page 11[6] D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler transformations forhigh-performance computing. ACM Computing Surveys (CSUR), 26(4):345–420, 1994. → pages 11, 31[7] U. Banerjee. Unimodular transformations of double loops. University ofIllinois at Urbana-Champaign, Center for Supercomputing Research andDevelopment, 1990. → page 11[8] U. Banerjee. Loop transformations for restructuring compilers: thefoundations. Springer Science & Business Media, 2007. → pages 11, 31[9] C. Beckhoff, D. Koch, and J. Torresen. Go ahead: A partial reconfigurationframework. In Field-Programmable Custom Computing Machines(FCCM), 2012 IEEE 20th Annual International Symposium on, pages37–44. IEEE, 2012. → page 78101[10] P. Bellows and B. Hutchings. Jhdl-an hdl for reconfigurable systems. InFPGAs for Custom Computing Machines, 1998. Proceedings. IEEESymposium on, pages 175–184. IEEE, 1998. → page 15[11] V. Betz and J. Rose. Vpr: A new packing, placement and routing tool forfpga research. In International Workshop on Field Programmable Logicand Applications, pages 213–222. Springer, 1997. → page 65[12] M. Butts, A. M. Jones, and P. Wasson. A structural object programmingmodel, architecture, chip and tools for reconfigurable computing. InField-Programmable Custom Computing Machines, 2007. FCCM 2007.15th Annual IEEE Symposium on, pages 55–64. IEEE, 2007. → pagesxii, 2, 14, 20, 21, 22, 28[13] M. Butts, B. Budlong, P. Wasson, and E. White. Reconfigurable workfarms on a massively parallel processor array. In Field-ProgrammableCustom Computing Machines, 2008. FCCM’08. 16th InternationalSymposium on, pages 206–215. IEEE, 2008. → pages 2, 28[14] D. Callahan, K. Kennedy, et al. Automatic decomposition of scientificprograms for parallel execution. In Proceedings of the 14th ACMSIGACT-SIGPLAN symposium on Principles of programming languages,pages 63–76. ACM, 1987. → page 11[15] A. Canis, J. Choi, B. Fort, B. Syrowik, R. L. Lian, Y. T. Chen, H. Hsiao,J. Goeders, S. Brown, and J. Anderson. Legup high-level synthesis. InFPGAs for Software Programmers, pages 175–190. Springer, 2016. →page 14[16] E. Caspi, M. Chu, R. Huang, J. Yeh, J. Wawrzynek, and A. DeHon. Streamcomputations organized for reconfigurable execution (score). InInternational Workshop on Field Programmable Logic and Applications,pages 605–614. Springer, 2000. → page 15[17] D. Chen, J. Cong, P. Pan, et al. Fpga design automation: A survey.Foundations and Trends R© in Electronic Design Automation, 1(3):195–330,2006. → page 17[18] C. Claus, F. H. Muller, J. Zeppenfeld, and W. Stechele. A new frameworkto accelerate virtex-ii pro dynamic partial self-reconfiguration. In Paralleland Distributed Processing Symposium, 2007. IPDPS 2007. IEEEInternational, pages 1–7. IEEE, 2007. → page 26102[19] C. Claus, W. Stechele, M. Kovatsch, J. Angermeier, and J. Teich. Acomparison of embedded reconfigurable video-processing architectures. InField Programmable Logic and Applications, 2008. FPL 2008.International Conference on, pages 587–590. IEEE, 2008. → page 23[20] S. Commuri, V. Tadigotla, and L. Sliger. Task-based hardwarereconfiguration in mobile robots using fpgas. Journal of Intelligent andRobotic Systems, 49(2):111–134, 2007. → page 23[21] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang.High-level synthesis for fpgas: From prototyping to deployment. IEEETransactions on Computer-Aided Design of Integrated Circuits andSystems, 30(4):473–491, 2011. → page 14[22] J. Cong, M. Huang, B. Liu, P. Zhang, and Y. Zou. Combining moduleselection and replication for throughput-driven streaming programs. InDesign, Automation & Test in Europe Conference & Exhibition (DATE),2012, pages 1018–1023. IEEE, 2012. → pages 29, 37, 40, 51, 67[23] L. Daoud, D. Zydek, and H. Selvaraj. A survey of high level synthesislanguages, tools, and compilers for reconfigurable high performancecomputing. In Advances in Systems Science, pages 483–492. Springer,2014. → page 2[24] B. D. de Dinechin. Kalray mppa R©: Massively parallel processor array:Revisiting dsp acceleration with the kalray mppa manycore processor. InHot Chips 27 Symposium (HCS), 2015 IEEE, pages 1–27. IEEE, 2015. →page 19[25] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R.LeBlanc. Design of ion-implanted mosfet’s with very small physicaldimensions. IEEE Journal of Solid-State Circuits, 9(5):256–268, 1974. →page 1[26] F. Dittmann and S. Frank. Caching in real-time reconfiguration portscheduling. In Field Programmable Logic and Applications, 2007. FPL2007. International Conference on, pages 740–744. IEEE, 2007. → page25[27] A. Duller, G. Panesar, and D. Towner. Parallel processing-the picochipway. Communicating Processing Architectures, 2003:125–138, 2003. →page 18103[28] E. El-Araby, I. Gonzalez, and T. El-Ghazawi. Exploiting partial runtimereconfiguration for high-performance reconfigurable computing. ACMTransactions on Reconfigurable Technology and Systems (TRETS), 1(4):21,2009. → page 23[29] P. Feautrier. Some efficient solutions to the affine scheduling problem. i.one-dimensional time. International journal of parallel programming, 21(5):313–347, 1992. → page 13[30] K. Gilles. The semantics of a simple language for parallel programming.Information processing, 74:471–475, 1974. → pages 15, 28[31] M. Gokhale, J. Stone, J. Arnold, and M. Kalinowski. Stream-oriented fpgacomputing in the streams-c high level language. In Field-ProgrammableCustom Computing Machines, 2000 IEEE Symposium on, pages 49–56.IEEE, 2000. → page 14[32] M. B. Gokhale and J. M. Stone. Napa c: Compiling for a hybrid risc/fpgaarchitecture. In FPGAs for Custom Computing Machines, 1998.Proceedings. IEEE Symposium on, pages 126–135. IEEE, 1998. → page 14[33] M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb,C. Leger, J. Wong, H. Hoffmann, D. Maze, et al. A stream compiler forcommunication-exposed architectures. In ACM SIGOPS OperatingSystems Review, volume 36, pages 291–303. ACM, 2002. → page 46[34] J. Gray. Grvi phalanx: A massively parallel risc-v fpga acceleratoraccelerator. In Field-Programmable Custom Computing Machines(FCCM), 2016 IEEE 24th Annual International Symposium on, pages17–20. IEEE, 2016. → pages 7, 48[35] S. G. Hansen, D. Koch, and J. Torresen. High speed partial run-timereconfiguration using enhanced icap hard macro. In Parallel andDistributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEEInternational Symposium on, pages 174–180. IEEE, 2011. → pages 26, 94[36] S. Hauck. Configuration prefetch for single context reconfigurablecoprocessors. In Proceedings of the 1998 ACM/SIGDA sixth internationalsymposium on Field programmable gate arrays, pages 65–74. ACM, 1998.→ page 25[37] J. Hegarty, J. Brunhaver, Z. DeVito, J. Ragan-Kelley, N. Cohen, S. Bell,A. Vasilyev, M. Horowitz, and P. Hanrahan. Darkroom: Compiling104high-level image processing code into hardware pipelines. ACM Trans.Graph., pages 144:1–144:11, 2014. → pages 51, 77[38] J. Hegarty, R. Daly, Z. DeVito, J. Ragan-Kelley, M. Horowitz, andP. Hanrahan. Rigel: Flexible multi-rate image processing hardware. ACMTrans. Graph., pages 85:1–85:11, 2016. → pages 51, 77[39] M. Huebner, M. Ullmann, F. Weissel, and J. Becker. Real-timeconfiguration code decompression for dynamic fpga self-reconfiguration.In Parallel and Distributed Processing Symposium, 2004. Proceedings.18th International, page 138. IEEE, 2004. → page 26[40] B. Hutchings, B. Nelson, S. West, and R. Curtis. Optical flow on theambric massively parallel processor array (mppa). In Field ProgrammableCustom Computing Machines, 2009. FCCM’09. 17th IEEE Symposium on,pages 141–148. IEEE, 2009. → pages 2, 23[41] Intel-inc. Overview for the stratix iv device family. 2016. URL https://www.altera.com/en US/pdfs/literature/hb/stratix-iv/stx4 siv51001.pdf. →page 65[42] C. Kao. Benefits of partial reconfiguration. Xcell journal, 55:65–67, 2005.→ page 23[43] R. Kastner, J. Matai, and S. Neuendorffer. Parallel programming for fpgas.arXiv preprint arXiv:1805.03648, 2018. → pages 5, 49[44] K. Kennedy and K. S. McKinley. Optimizing for parallelism and datalocality. In Proceedings of the 6th international conference onSupercomputing, pages 323–334. ACM, 1992. → page 11[45] Khronos-Group. Openvx. 2017. URL https://www.khronos.org/openvx/.→ page 51[46] D. W. Knapp. Behavioral synthesis: digital system design using thesynopsys behavioral compiler. Prentice Hall PTR, 1996. → page 49[47] D. Koch. Partial reconfiguration on FPGAs: architectures, tools andapplications, volume 153. Springer Science & Business Media, 2012. →pages xii, 24, 25[48] D. Koch, C. Beckhoff, and J. Teich. Bitstream decompression for highspeed fpga configuration from slow memories. In Field-ProgrammableTechnology, 2007. ICFPT 2007. International Conference on, pages161–168. IEEE, 2007. → page 26105[49] H. Kooti and E. Bozorgzadeh. Reconfiguration-aware task graphscheduling. In Embedded and Ubiquitous Computing (EUC), 2015 IEEE13th International Conference on, pages 163–167. IEEE, 2015. → page 87[50] B. Krill, A. Ahmad, A. Amira, and H. Rabah. An efficient fpga-baseddynamic partial reconfiguration design flow and environment for image andsignal processing ip cores. Signal Processing: Image Communication, 25(5):377–387, 2010. → page 23[51] I. Kuon, R. Tessier, and J. Rose. Fpga architecture: Survey and challenges.Foundations and trends in electronic design automation, 2(2):135–253,2008. → page 16[52] S. Lange and M. Middendorf. Hyperreconfigurable architectures for fastrun time reconfiguration. In Field-Programmable Custom ComputingMachines, 2004. FCCM 2004. 12th Annual IEEE Symposium on, pages304–305. IEEE, 2004. → page 26[53] E. A. Lee and D. G. Messerschmitt. Synchronous data flow. Proceedings ofthe IEEE, pages 1235–1245, 1987. → pages 15, 23[54] Z. Li and S. Hauck. Don’t care discovery for fpga configurationcompression. In Proceedings of the 1999 ACM/SIGDA seventhinternational symposium on Field programmable gate arrays, pages 91–98.ACM, 1999. → page 26[55] A. W. Lim and M. S. Lam. Maximizing parallelism and minimizingsynchronization with affine transforms. In Proceedings of the 24th ACMSIGPLAN-SIGACT Symposium on Principles of Programming Languages,POPL ’97, pages 201–214, 1997. ISBN 0-89791-853-3. → page 51[56] A. W. Lim and M. S. Lam. Maximizing parallelism and minimizingsynchronization with affine transforms. In Proceedings of the 24th ACMSIGPLAN-SIGACT symposium on Principles of programming languages,pages 201–214. ACM, 1997. → pages 11, 31[57] D. Liu and B. C. Schafer. Efficient and reliable high-level synthesis designspace explorer for fpgas. In 2016 26th International Conference on FieldProgrammable Logic and Applications (FPL), pages 1–8, 2016. → page 52[58] M. Liu, W. Kuehn, Z. Lu, and A. Jantsch. Run-time partial reconfigurationspeed investigation and architectural design space exploration. In FieldProgrammable Logic and Applications, 2009. FPL 2009. InternationalConference on, pages 498–502. IEEE, 2009. → page 26106[59] J. Luu, J. Goeders, M. Wainberg, A. Somerville, T. Yu, K. Nasartschuk,M. Nasr, S. Wang, T. Liu, N. Ahmed, K. B. Kent, J. Anderson, J. Rose, andV. Betz. Vtr 7.0: Next generation architecture and cad system for fpgas.ACM Trans. Reconfigurable Technol. Syst., pages 6:1–6:30, 2014. → page65[60] A. Makhorin. Glpk (gnu linear programming kit). http://www. gnu.org/software/glpk/, 2008. → pages 37, 48, 87[61] P. Manet, D. Maufroid, L. Tosi, G. Gailliard, O. Mulertt, M. Di Ciano, J.-D.Legat, D. Aulagnier, C. Gamrat, R. Liberati, et al. An evaluation ofdynamic partial reconfiguration for signal and image processing inprofessional electronics applications. EURASIP Journal on EmbeddedSystems, 2008:1, 2008. → page 26[62] G. Martin and G. Smith. High-level synthesis: Past, present, and future.IEEE Design & Test of Computers, 26(4):18–25, 2009. → page 2[63] O. Mencer, M. Morf, and M. J. Flynn. Pam-blox: High performance fpgadesign for adaptive computing. In FPGAs for Custom ComputingMachines, 1998. Proceedings. IEEE Symposium on, pages 167–174. IEEE,1998. → page 14[64] R. Nikhil. Bluespec system verilog: efficient, correct rtl from high levelspecifications. In Formal Methods and Models for Co-Design, 2004.MEMOCODE’04. Proceedings. Second ACM and IEEE InternationalConference on, pages 69–70. IEEE, 2004. → page 15[65] H. Omidian and G. G. Lemieux. Janus: A compilation system for balancingparallelism and performance in openvx. In Journal of Physics: ConferenceSeries, volume 1004, page 012011. IOP Publishing, 2018. → page vi[66] H. Omidian and G. G. F. Lemieux. Automated space/time scaling ofstreaming task graph. International Workshop on Overlay Architectures forFPGA (OLAF), abs/1606.03717, 2016. → page vi[67] H. Omidian and G. G. F. Lemieux. Exploring automated space/timetradeoffs for openvx compute graphs. In 2017 International Conference onField-Programmable Technology (FPT), Dec 2017. → page vi[68] J. Parkhurst, J. Darringer, and B. Grundmann. From single core tomulti-core: preparing for a new exponential. In Proceedings of the 2006IEEE/ACM international conference on Computer-aided design, pages67–72. ACM, 2006. → page 1107[69] K. Paulsson, M. Hu¨bner, S. Bayar, and J. Becker. Exploitation of run-timepartial reconfiguration for dynamic power management in xilinx spartaniii-based systems. In ReCoSoC, pages 1–6, 2007. → page 23[70] J. Pu, S. Bell, X. Yang, J. Setter, S. Richardson, J. Ragan-Kelley, andM. Horowitz. Programming heterogeneous systems from an imageprocessing dsl. ACM Trans. Archit. Code Optim., pages 26:1–26:25, 2017.URL http://doi.acm.org/10.1145/3107953. → page 77[71] S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. Lo´pez-Lagunas, P. R.Mattson, and J. D. Owens. A bandwidth-efficient architecture for mediaprocessing. In Proceedings of the 31st annual ACM/IEEE internationalsymposium on Microarchitecture, pages 3–13. IEEE Computer SocietyPress, 1998. → page 1[72] M. Saldan˜a, A. Patel, C. Madill, D. Nunes, D. Wang, P. Chow, R. Wittig,H. Styles, and A. Putnam. Mpi as a programming model forhigh-performance reconfigurable computers. ACM Transactions onReconfigurable Technology and Systems (TRETS), 3(4):22, 2010. → page23[73] V. Sarkar and R. Thekkath. A general framework for iteration-reorderingloop transformations. SIGPLAN Not., pages 175–187, 1992. → page 51[74] V. Sarkar and R. Thekkath. A general framework for iteration-reorderingloop transformations. In ACM SIGPLAN Notices, volume 27, pages175–187. ACM, 1992. → pages 11, 31[75] A. Severance and G. G. F. Lemieux. Embedded supercomputing in fpgaswith the vectorblox mxp matrix processor. In 2013 InternationalConference on Hardware/Software Codesign and System Synthesis(CODES+ISSS), pages 1–10, 2013. → page 78[76] A. Severance, J. Edwards, H. Omidian, and G. Lemieux. Soft vectorprocessors with streaming pipelines. In Proceedings of the 2014ACM/SIGDA International Symposium on Field-programmable GateArrays, FPGA ’14, pages 117–126, 2014. → page 74[77] A. Singh, G. Parthasarathy, and M. Marek-Sadowska. Efficient circuitclustering for area and power reduction in fpgas. ACM Transactions onDesign Automation of Electronic Systems (TODAES), 7(4):643–663, 2002.→ page 32108[78] G. Tagliavini, G. Haugou, A. Marongiu, and L. Benini. Adrenaline: Anopenvx environment to optimize embedded vision applications onmany-core accelerators. In 2015 IEEE 9th International Symposium onEmbedded Multicore/Many-core Systems-on-Chip, pages 289–296, 2015.→ pages 51, 77[79] S. Taheri, J. Heo, P. Behnam, P. Veidenbaum, and A. Nicolau. Accelerationframework for fpga implementation of openvx graph pipelines. Technicalreport, Center for Embedded and Cyber-Physical Systems, University ofCalifornia, Irvine, 2018. → page 52[80] R. Tessier, K. Pocek, and A. DeHon. Reconfigurable computingarchitectures. Proceedings of the IEEE, 103(3):332–354, 2015. → page 13[81] W. Thies, M. Karczmarek, and S. Amarasinghe. Streamit: A language forstreaming applications. In International Conference on CompilerConstruction, pages 179–196. Springer, 2002. → page 46[82] M. F. Tompkins. Optimization techniques for task allocation andscheduling in distributed multi-agent operations. PhD thesis,Massachusetts Institute of Technology, 2003. → page 87[83] S. Toscher, T. Reinemann, and R. Kasper. An adaptive fpga-basedmechatronic control system supporting partial reconfiguration of controllerfunctionalities. In Adaptive Hardware and Systems, 2006. AHS 2006. FirstNASA/ESA Conference on, pages 225–228. IEEE, 2006. → page 23[84] S. Trimberger, D. Carberry, A. Johnson, and J. Wong. A time-multiplexedfpga. In Field-Programmable Custom Computing Machines, 1997.Proceedings., the 5th Annual IEEE Symposium on, pages 22–28. IEEE,1997. → pages 26, 94[85] J. L. Tripp, M. B. Gokhale, and K. D. Peterson. Trident: From high-levellanguage to hardware circuitry. Computer, 40(3), 2007. → page 14[86] D. N. Truong, W. H. Cheng, T. Mohsenin, Z. Yu, A. T. Jacobson,G. Landge, M. J. Meeuwsen, C. Watnik, A. T. Tran, Z. Xiao, et al. A167-processor computational platform in 65 nm cmos. IEEE Journal ofSolid-State Circuits, 44(4):1130–1144, 2009. → page 19[87] J. Villarreal, A. Park, W. Najjar, and R. Halstead. Designing modularhardware accelerators in c with roccc 2.0. In Field-Programmable CustomComputing Machines (FCCM), 2010 18th IEEE Annual InternationalSymposium on, pages 127–134. IEEE, 2010. → page 14109[88] G. K. Wallace. The jpeg still picture compression standard. IEEEtransactions on consumer electronics, 38(1):xviii–xxxiv, 1992. → pages47, 56[89] M. Wazlowski, L. Agarwal, T. Lee, A. Smith, E. Lam, P. Athanas,H. Silverman, and S. Ghosh. Prism-ii compiler and architecture. In FPGAsfor Custom Computing Machines, 1993. Proceedings. IEEE Workshop on,pages 9–16. IEEE, 1993. → page 14[90] L. Wirbel. Ambric lives on, in a parallel universe, 2011. URLhttps://www.edn.com/electronics-blogs/fpga-gurus/4409421/Ambric-Lives-On-in-a-Parallel-Universe. → page 21[91] D. Wo and K. Forward. Compiling to the gate level for a reconfigurableco-processor. In FPGAs for Custom Computing Machines, 1994.Proceedings. IEEE Workshop on, pages 147–154. IEEE, 1994. → page 14[92] M. E. Wolf. Improving locality and parallelism in nested loops. PhDthesis, Citeseer, 1992. → page 11[93] Z. Xiao, D. Koch, and M. Lujan. A partial reconfiguration controller foraltera stratix v fpgas. In Field Programmable Logic and Applications(FPL), 2016 26th International Conference on, pages 1–4. IEEE, 2016. →page 93[94] Xilinx-inc. Axi4 stream interconnect. 2017. URL https://www.xilinx.com/products/intellectual-property/axi4-stream interconnect.html. → page 52[95] Xilinx-inc. revision enables responsive and reconfigurable vision systems.2018. URLhttps://www.xilinx.com/products/design-tools/embedded-vision-zone.html.→ page 50[96] Xilinx-inc. Vivado high-level synthesis. 2018. URL https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.→ pages 5, 49, 50[97] Xilinx-inc. 7 series fpgas data sheet: Overview. 2018. URLhttps://www.xilinx.com/support/documentation/data sheets/ds180 7Series Overview.pdf. → page 66[98] Xilinx-inc. Dma v7. 1, logicore ip product guide, vivado design suite,2018. 2018. URL https://www.xilinx.com/products/intellectual-property/axi4-stream interconnect.html. → page 56110[99] Y. Yankova, G. Kuzmanov, K. Bertels, G. Gaydadjiev, Y. Lu, andS. Vassiliadis. Dwarv: Delftworkbench automated reconfigurable vhdlgenerator. In Field Programmable Logic and Applications, 2007. FPL2007. International Conference on, pages 697–701. IEEE, 2007. → page14[100] Z. Yu, M. Meeuwsen, R. Apperson, O. Sattari, M. Lai, J. Webb, E. Work,T. Mohsenin, M. Singh, and B. Baas. An asynchronous array of simpleprocessors for dsp applications. In Solid-State Circuits Conference, 2006.ISSCC 2006. Digest of Technical Papers. IEEE International, pages1696–1705. IEEE, 2006. → pages 18, 19[101] G. Zhong, V. Venkataramani, Y. Liang, T. Mitra, and S. Niar. Design spaceexploration of multiple loops on fpgas using high level synthesis. In 2014IEEE 32nd International Conference on Computer Design (ICCD), pages456–463, 2014. → page 52111