Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Broadening the applicability of FPGA-based soft vector processors Severance, Aaron 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_may_severance_aaron.pdf [ 2.64MB ]
JSON: 24-1.0167154.json
JSON-LD: 24-1.0167154-ld.json
RDF/XML (Pretty): 24-1.0167154-rdf.xml
RDF/JSON: 24-1.0167154-rdf.json
Turtle: 24-1.0167154-turtle.txt
N-Triples: 24-1.0167154-rdf-ntriples.txt
Original Record: 24-1.0167154-source.json
Full Text

Full Text

Broadening the Applicability of FPGA-basedSoft Vector ProcessorsbyAaron SeveranceBachelor of Science in Computer Engineering, Boston University, 2004Master of Science in Electrical Engineering, University of Washington, 2006A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)March 2015c© Aaron Severance, 2015AbstractA soft vector processor (SVP) is an overlay on top of FPGAs that allows data-parallel algorithms to be written in software rather than hardware, and yet stillachieve hardware-like performance. This ease of use comes at an area and speedpenalty, however. Also, since the internal design of SVPs are based largely oncustom CMOS vector processors, there is additional opportunity for FPGA-specificoptimizations and enhancements.This thesis investigates and measures the effects of FPGA-specific changesto SVPs that improve performance, reduce area, and improve ease-of-use; therebyexpanding their useful range of applications. First, we address applications needingonly moderate performance such as audio filtering where SVPs need only a smallnumber (one to four) of parallel ALUs. We make implementation and ISA designdecisions around the goals of producing a compact SVP that effectively utilizesexisting BRAMs and DSP Blocks. The resulting VENICE SVP has 2× betterperformance per logic block than previous designs.Next, we address performance issues with algorithms where some vector ele-ments ‘exit early’ while others need additional processing. Simple vector predica-tion causes all elements to exhibit ‘worst case’ performance. Density time masking(DTM) improves performance of such algorithms by skipping the completed ele-ments when possible, but traditional implementations of DTM are coarse-grainedand do not map well to the FPGA fabric. We introduce a BRAM-based implemen-tation that achieves 3.2× higher performance over the base SVP with less than 5%area overhead.Finally, we identify a way to exploit the raw performance of the underlyingFPGA fabric by attaching wide, deeply pipelined computational units to SVPsiithrough a custom instruction interface. We support multiple inputs and outputs,arbitrary-length pipelines, and heterogeneous lanes to allow streaming of datathrough large operator graphs. As an example, on an n-body simulation problem,we show that custom instructions achieve 120× better performance per area thanthe base SVP.iiiPrefaceThe following publications have been adapted for inclusion in this dissertation:• VENICE: A Compact Vector Processor for FPGA Applications [57]Published in the 2012 International Conference on Field-ProgrammableTechnology (FPT 2012). Authored by Aaron Severance and Guy Lemieux.Appears in Chapter 3.I performed all of the design and implementation and did the primary bench-marking and analysis presented in this paper. Guy Lemieux served in anadvisory fashion. A separate work [46] detailing a compiler for VENICEwas published earlier; the included paper is a distinct contribution detailingthe design and architecture of the VENICE SVP.• Wavefront Skipping using BRAMs for Conditional Algorithms on Vec-tor Processors [61]Published in the 2015 ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA 2015). Authored by Aaron Severance,Joe Edwards, and Guy G.F. Lemieux. Appears in Chapter 4.I performed the design, implementation, and benchmarking in this paperwith the following exceptions:– Joe Edwards developed the scalar and basic vector (without density-time masking) code for Viola-Jones face detection and FAST9 featuredetection.– Guy Lemieux developed the scalar and basic vector (without density-time masking) code for Mandelbrot set generation.ivGuy Lemieux also served in an advisory fashion.• Soft Vector Processors with Streaming Pipelines [60]Published in the 2014 ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA 2014). Authored by Aaron Severance,Joe Edwards, Hossein Omidian, Guy Lemieux. Appears in Chapter 5.I performed the design, implementation, and benchmarking done in this pa-per with the following exceptions:– Guy Lemieux and Joe Edwards developed the scalar and basic vector(without custom instructions) code for the N-Body problem.– Hossein Omidian designed and implemented the high-level synthesisapproach for creating custom instructions. His contribution to the paperwas removed from the body of this dissertation.Guy Lemieux also served in an advisory fashion.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Soft Vector Processors . . . . . . . . . . . . . . . . . . . . . . . 41.3 SVP Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.7 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . 92 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.1 Architecture and Design Flow . . . . . . . . . . . . . . . 11vi2.2 Overlays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Vector Processing and SIMD Overview . . . . . . . . . . . . . . 152.4 Soft Vector Processors (SVPs) . . . . . . . . . . . . . . . . . . . 192.5 Divergent Control Flow . . . . . . . . . . . . . . . . . . . . . . . 232.5.1 Execution Pipeline Customization . . . . . . . . . . . . . 273 VENICE: Optimizing for Small but Capable . . . . . . . . . . . . . 293.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Design and Architecture . . . . . . . . . . . . . . . . . . . . . . 313.3 VENICE Implementation . . . . . . . . . . . . . . . . . . . . . . 323.3.1 Removal of Vector Address Register File . . . . . . . . . 323.3.2 2D and 3D Vector Instructions . . . . . . . . . . . . . . . 343.3.3 Operations on Unaligned Vectors . . . . . . . . . . . . . 353.3.4 New Vector Conditional Operations . . . . . . . . . . . . 363.3.5 Streamlined Instruction Set . . . . . . . . . . . . . . . . . 373.3.6 FPGA Architecture-Specific Optimizations . . . . . . . . 373.4 Native Programming Interface . . . . . . . . . . . . . . . . . . . 393.5 Evalution Results . . . . . . . . . . . . . . . . . . . . . . . . . . 423.5.1 Area and Clock Frequency . . . . . . . . . . . . . . . . . 423.5.2 Benchmark Performance . . . . . . . . . . . . . . . . . . 443.5.3 Speedup versus Area . . . . . . . . . . . . . . . . . . . . 463.5.4 Case Study: DCT . . . . . . . . . . . . . . . . . . . . . . 483.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 Wavefront Skipping on Soft Vector Processors . . . . . . . . . . . . 514.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 BRAM Based Wavefront Skipping . . . . . . . . . . . . . . . . . 534.2.1 Full Wavefront Skipping . . . . . . . . . . . . . . . . . . 544.2.2 Wavefront Partitioning . . . . . . . . . . . . . . . . . . . 584.2.3 Application Example: Viola-Jones Face Detection . . . . 594.2.4 Comparison with Vector Compress . . . . . . . . . . . . 624.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.3.1 Area Results . . . . . . . . . . . . . . . . . . . . . . . . 63vii4.3.2 BRAM Usage . . . . . . . . . . . . . . . . . . . . . . . . 644.3.3 Mandelbrot Benchmark . . . . . . . . . . . . . . . . . . 664.3.4 FAST9 Feature Detection . . . . . . . . . . . . . . . . . 674.3.5 Viola-Jones Face Detection . . . . . . . . . . . . . . . . . 684.3.6 MMVL Tradeoffs . . . . . . . . . . . . . . . . . . . . . . 694.3.7 Results Summary . . . . . . . . . . . . . . . . . . . . . . 724.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735 Attaching Streaming Pipelines to Soft Vector Processors . . . . . . . 755.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.2 Custom Vector Instructions (CVIs) . . . . . . . . . . . . . . . . . 785.2.1 CVI Design Approach . . . . . . . . . . . . . . . . . . . 785.2.2 CVI Interface . . . . . . . . . . . . . . . . . . . . . . . . 795.2.3 Large Operator Support . . . . . . . . . . . . . . . . . . 815.2.4 CVIs with Deep Pipelines . . . . . . . . . . . . . . . . . 835.3 Multi-Operand CVI . . . . . . . . . . . . . . . . . . . . . . . . . 845.3.1 N-Body Problem . . . . . . . . . . . . . . . . . . . . . . 845.3.2 Multi-Operand CVI Dispatch . . . . . . . . . . . . . . . 865.3.3 Face Detection CVI Example . . . . . . . . . . . . . . . 905.4 CVI Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.5 CVI Design Methodologies . . . . . . . . . . . . . . . . . . . . . 925.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.2.1 Initial Work . . . . . . . . . . . . . . . . . . . . . . . . . 1006.2.2 Long Term Directions . . . . . . . . . . . . . . . . . . . 1016.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104viiiList of TablesTable 3.1 Resource Usage Comparison . . . . . . . . . . . . . . . . . . 42Table 3.2 Area Breakdown (ALMs) . . . . . . . . . . . . . . . . . . . . 43Table 3.3 Benchmark Properties . . . . . . . . . . . . . . . . . . . . . . 45Table 3.4 Benchmark Performance . . . . . . . . . . . . . . . . . . . . 45Table 4.1 Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . 63Table 5.1 Results with MXP Compared to Nios II/f, Intel, and ARM Pro-cessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94ixList of FiguresFigure 1.1 An FPGA System with a Soft Processor . . . . . . . . . . . . 2Figure 1.2 Design Flow using a Soft Processor . . . . . . . . . . . . . . 3Figure 2.1 Example FPGA Architecture . . . . . . . . . . . . . . . . . . 12Figure 2.2 Data Parallel Execution on Different Architectures . . . . . . 16Figure 2.3 Psuedocode for Different Parallel Paradigms . . . . . . . . . 17Figure 2.4 The VEGAS Soft Vector Architecture [15] . . . . . . . . . . 20Figure 2.5 Psuedocode for Divergent Control Flow . . . . . . . . . . . . 23Figure 2.6 Compress and Gather Operations . . . . . . . . . . . . . . . 24Figure 3.1 VENICE Architecture . . . . . . . . . . . . . . . . . . . . . 31Figure 3.2 VEGAS Code (Requiring Vector Address Register Setting) vsVENICE Code . . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 3.3 Example of Misaligned Operation . . . . . . . . . . . . . . . 35Figure 3.4 Fracturable Multiplier Styles . . . . . . . . . . . . . . . . . . 38Figure 3.5 VENICE Code to Add 3 Vectors . . . . . . . . . . . . . . . . 39Figure 3.6 FIR Kernel Using 2D Vector Instructions . . . . . . . . . . . 41Figure 3.7 Area Breakdown (ALMs) . . . . . . . . . . . . . . . . . . . 44Figure 3.8 Speedup (Geometric Mean of 9 Benchmarks) vs Area Scaling 46Figure 3.9 Computational Density with V1 SVPs . . . . . . . . . . . . . 47Figure 3.10 16-bit 4x4 DCT Varying 2D/3D Dispatch, SVP Width, and #of SVPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Figure 4.1 Wavefront Skipping on a 4 Lane SVP . . . . . . . . . . . . . 52Figure 4.2 VectorBlox MXP with Four 32-bit Lanes . . . . . . . . . . . 54xFigure 4.3 Code Example: Double Every Fifth Element of a Vector . . . 55Figure 4.4 Data Written to Mask BRAM (Every Fifth Element Valid) . . 57Figure 4.5 Pseudocode for Viola-Jones Face Detection . . . . . . . . . . 59Figure 4.6 Haar Features Calculated at Each Candidate Face Locationfor Different Groupings (Percentage Work Done is CalculatedRelative to Minimum) . . . . . . . . . . . . . . . . . . . . . 60Figure 4.7 Vector Compress Operations Needed for 5x1 Stencil Filter . . 62Figure 4.8 BRAM Usage vs Wavefront Partitions (MMVL = 1024) . . . 65Figure 4.9 Mandelbrot Benchmark . . . . . . . . . . . . . . . . . . . . . 66Figure 4.10 FAST9 Feature Detection . . . . . . . . . . . . . . . . . . . . 67Figure 4.11 Viola-Jones Face Detection Speedup . . . . . . . . . . . . . . 69Figure 4.12 Viola-Jones Face Detection Speedup Vs Area . . . . . . . . . 70Figure 4.13 BRAM Usage for Masks When Varying MMVL for 1 (top)and 4 (bottom) Partitions . . . . . . . . . . . . . . . . . . . . 71Figure 4.14 Effect of Changing MMVL on Viola-Jones Face Detection . . 72Figure 4.15 Speedup from Wavefront Skipping . . . . . . . . . . . . . . . 73Figure 4.16 Speedup per Area (eALMs) from Wavefront Skipping . . . . 74Figure 5.1 Internal View of VectorBlox MXP . . . . . . . . . . . . . . . 77Figure 5.2 Examples of Custom Vector Instructions . . . . . . . . . . . 80Figure 5.3 Custom Vector Instructions With Fewer Lanes Than the SVP 82Figure 5.4 Force Summation Pipeline . . . . . . . . . . . . . . . . . . . 85Figure 5.5 Multi-Operand Custom Vector Instructions . . . . . . . . . . 87Figure 5.6 Using 3D Vector Operations for Multi-Operand Dispatch . . 88Figure 5.7 Multi-Operand Custom Vector Instruction Funnel Adapters . 89Figure 5.8 Face Detection Pipeline . . . . . . . . . . . . . . . . . . . . 90Figure 5.9 FLOAT Custom Vector Pipeline in Altera’s DSP Builder . . . 93Figure 5.10 Area of Gravity Pipeline Systems . . . . . . . . . . . . . . . 94Figure 5.11 Performance and Performance-per-Area of Gravity Pipeline . 95xiGlossaryThe following acronyms are used in this dissertation:ALM adaptive logic moduleALU arithmetic logic unitAoS array of structuresAPI application programming interfaceASIC application specific integrated circuitBRAM block RAMCAD computer aided designCAM content addressable memoryCGRA coarse-grained reconfigurable arrayCMOS complementary metal-oxide semiconductorCMOV conditional moveCPU central processing unitCVI custom vector instructionDCT discrete cosine transformDMA direct memory accessxiiDLP data level parallelismDSP digitial signal processingDFG data flow graphDRAM dynamic random-access memoryDSPBA Altera’s DSP Builder Advanced Blockset for SimulinkDTM density-time maskingeALM equivalent ALMESL electronic system levelFF flip-flopFFT fast Fourier transformFIFO first-in first-outFIR finite impulse responsefmax maximum operating frequencyFPGA field-programmable gate arrayFU functional unitGDB the GNU debuggerGUI graphical user interfaceGPU graphics processing unitHDL hardware description languageHLS high level synthesisIDE integrated development environmentIC integrated circuitxiiiILP instruction level parallelismIP intellectual propertyISA instruction set architectureLAB logic array blockLUT look-up tableMAC multiply accumulateMMVL maximum masked vector lengthMSB most significant bitMVL maximum vector lengthMXP A SVP from VectorBlox Computing, Inc. used in this dissertationNOP no operationOoOE out-of-order executionPC program counterPE processing elementPVFB pending vector fragment bufferRTL register transfer levelSAD sum of absolute differencesSIMD single-instruction multiple-dataSIMT single-instruction multiple-threadSoA structure of arraysSoC system on chipSRAM static RAMxivSVP soft vector processorVARF vector address register fileVEGAS A second generation SVP from The University of British ColumbiaVENICE A new SVP that is part of this dissertationVESPA A first generation SVP from University of TorontoVIRAM The VIRAM project; a hard vector processor from BerkeleyVIPERS A first generation SVP from The University of British ColumbiaVL vector lengthVLIW very-long instruction wordVRF vector register filexvAcknowledgmentsPapa, thanks for the support (moral and otherwise) and advice. Mama, thanks forhelping me through the tough times. Sister, thanks for being a tiny potato. Sean,you are a great sounding board and full of useful advice. Thanks for helping mefind reasons to try.Thanks Guy; you were the impetus for this whole thing and were there thewhole way for me. Steve, thanks for taking over; you’re a nice guy to a fault. Joe,you showed me it was good to be excited (and your work on the benchmarks wasgreat). Hossein, thanks for your small part in this; you’ll go on to great thingsI’m sure. Alex, Ameer, Chris, Chris, Dave, Doug, Keith, Mike, Tom, Usman, andZhiduo, it was great working with you.Thanks to NSERC, MITACS, and VectorBlox Computing for funding, and toAltera for donating hardware and software licenses.xviTo whom it may concern.xviiChapter 1IntroductionWhat we call the beginning is often the end. And to make an end is tomake a beginning. The end is where we start from. — T. S. Eliot1.1 MotivationA field-programmable gate array (FPGA) is a configurable logic device that canbe programmed at bit-level granularity. Instead of running a sequential or parallelsoftware program like a central processing unit (CPU) or graphics processing unit(GPU), it is programmed with a bitstream that tells its logic and interconnect how tobehave. This extremely fine granularity allows FPGAs to expose massive amountsof parallelism to the designer and provide better performance and efficiency thanCPUs and GPUs on certain applications [13]. At the same time, this flexibilityalso makes it difficult for applications programmers to implement algorithms onFPGAs.Since FPGAs can be used to emulate digital circuits, they are commonly pro-grammed the same way: using a register transfer level (RTL) description. RTLis a textual description of a circuit in which behavioral and structural statementsdescribe elements which execute in parallel and communicate through signals. Forprogrammers familiar with the sequential computing model used for software (ontop of which parallelism can be added, but at a much coarser granularity) RTL de-sign can be difficult to learn and understand. Even for programmers accustomed1FPGASoft VectorProcessorOther LogicMemory Controllers, Custom RTL, Etc.Synthesis ParametersMemory Size, ALUs, Etc.MemoryBus/NoCCustomInstructionsVectorizedSoftware ProgramFigure 1.1: An FPGA System with a Soft Processorto writing RTL, the design process is time-consuming. An RTL design will takefrom several minutes to many hours to synthesize for an FPGA depending on itssize and complexity, versus seconds to compile software. Additionally debug ismore difficult for RTL designs; there is limited visibility when they are imple-mented on an FPGA, and simulations run at several orders of magnitude slowerthan the actual design. An additional complexity is that modern FPGAs are nothomogenous; in addition to configurable logic there are hardened memories (blockRAMs (BRAMs)) and arithmetic units (digitial signal processing (DSP) blocks)that the programmer must utilize to take full advantage of the FPGA.A goal for researchers and industry has thus become to achieve close to soft-ware levels of productivity for FPGA design. There are many approaches to this,with some trying to improve productivity by improving synthesis speed [47, 69]or debug visibility [26], some trying to make RTL more software-like [7, 51], andsome using a software program as input to generate RTL [11, 19]. This workconcentrates on soft processors: processor-style intellectual property (IP) blocksinstantiated on FPGAs that offer a more software-like design flow.Figure 1.1 shows a conceptual view of a soft processor. It has three levels:the FPGA hardware, the soft processor (which is implemented on the FPGA hard-ware), and the user program (which runs on the soft processor). Soft processorsbring the afforementioned benefits of software programmability and debugability2SynthesizeFPGA Design(Hours)ConfigureSoftProcessorDesignSoftwareAlgorithmand TestsCompileSoftware(Seconds)PassesTests?DebugSoftware(GDB)YesNoDone!Figure 1.2: Design Flow using a Soft Processorto FPGAs. In Figure 1.2 the soft processor design flow is shown. First, the de-signer has to instantiate a soft processor into their FPGA design and synthesizethe design (taking minutes to hours). Next, changes to the algorithm running onthe soft processor can happen in seconds; the designer simply has to recompilethe software and update the program memory. Debugging can proceed with toolsfamiliar to software programmers such as the GNU debugger (GDB). By contrast,in an RTL design each time the design is changed FPGA resynthesis must occur.This means the designer must wait hours to see the results of their changes. Ad-ditionally, if there is not enough information visible to diagnose a bug, insertingadditional debug capability requires another run of FPGA synthesis.Another benefit of implementing a design using a soft processor is the abilityto easily reuse the same hardware for multiple functions or applications. To reuseFPGA hardware that is not programmable (such as fixed RTL IP blocks) on thesame hardware, some FPGAs provide the ability to reconfigure parts of the deviceat runtime. However, even the fastest partial reconfiguration times are on the orderof a millisecond [54]. By contrast, a soft processor need only switch instructionstreams, which can be done orders of magnitude faster.Soft processors are already familiar to FPGA programmers; each major FPGAvendor offers a soft CPU [4, 43, 72]. Soft CPUs are useful for controlling data-path logic and interfacing with low speed I/O interfaces. However, emulating ascalar CPU on an FPGA has certain challenges [71]; techniques that provide tradi-tional CPUs with high performance (e.g., superscalar, out-of-order execution) areexpensive to implement in FPGA fabric. Because of this, different approaches to3organizing processors (or extremely simple processors known as processing ele-ments (PEs)) have been proposed for FPGAs. Multithreaded [42] and very-longinstruction word (VLIW) [52] processors can provide additional performance, andmultiple PEs can be put together into 2D-grids as coarse-grained reconfigurablearrays (CGRAs) [12]. In contrast, this dissertation is concerned with another ar-rangement of PEs into a 1D single-instruction multiple-data (SIMD) array withvector processor-style control, the soft vector processor (SVP).1.2 Soft Vector ProcessorsSVPs draw from both SIMD processors and traditional vector processing models.SIMD processors execute the same instruction on multiple data elements at thesame time; for example, Intel’s latest SIMD extention to the x86 instruction setarchitecture (ISA) can perform an operation on sixteen single-precision or eightdouble-precision floating-point operands with one instruction [27]. Multiple copiesof each functional unit (FU) are needed to process the data in parallel, and if theamount of data to be processed is larger than the number of FUs the processingmust be broken up into multiple instructions (a process known as strip mining [70]).Traditional vector processors also operate on multiple data elements with a singleinstruction, using a single FU and streaming the data through sequentially overmultiple clock cycles. The number of elements is configurable through a vectorlength (VL) register; strip mining is not needed unless data is larger than somemaximum vector length (MVL) threshold.Vector-SIMD hybrids are an extension of traditional vector processors to usemultiple parallel FUs to speed up computation. The vector-SIMD paradigm hasbeen shown to map well to embedded media processing applications [37]. Exist-ing vector-SIMD hybrids were the basis for initial SVP implementations [74, 78].While these initial implementations were closely modeled on ‘hard’ vector pro-cessors, they had the benefit of being ‘soft’; i.e., programmer-configurable. Theprogrammer has the option of setting multiple parameters when instantiating anSVP such as the number of parallel arithmetic logic units (ALUs) and the sizeof the local memory/register file. This configurability allows the programmer tocustomize the processor for their application, only using the resources required to4meet their performance target.1.3 SVP WeaknessesHowever, SVPs face many challenges before being viable as a mainstream optionfor implementing algorithms on FPGAs. First, there is a large penalty in speedand/or area compared to an RTL design. This penalty can vary greatly dependingon the application; a comparison of an RTL implementation of a motion estimationkernel (used in video encoding) ran 21× faster than a SVP implementation givensimilar resource usage [78], while a study across several embedded benchmarksfound that an SVP with 16 FUs was 17× slower than hardware but used 64×morearea than a hardware implementation [77]. Reducing this penalty is essential ifSVPs are to be a viable choice for many applications. To do this, the design ofSVPs must be rethought with FPGAs in mind; some initial work has been donein this area [15] but there are still many performance and area optimizations thatcan be made. FPGAs have a very different cost model than application specificintegrated circuits (ASICs) for memories and multipliers due to the hardened on-chip BRAMs and DSP blocks. SVPs should be designed such that they can makeas much use of these hardened blocks as possible.Also, the performance penalty can be reduced by better targetting likely userapplications. Targetting large designs that can fill up multi-thousand dollar FPGAsmakes for impressive benchmark numbers but is an unlikely use-case for SVPs.Rather, targetting smaller applications where the goal is to get ‘good-enough’ per-formance quickly is more realistic. An example is accelerating the audio decodingportion of video playback; the video decoding core(s) may be need preconfiguredIP or custom RTL to achieve the necessary performance. However, an SVP caneasily support a multitude of audio codecs, switching between them at runtime inmicroseconds. Though the area of the SVP may be larger than a single audio codecdesigned in RTL, one SVP replaces several codecs and will not be large comparedto the video codec(s).Another challenge is expanding the number of different types of applicationsthat SVPs can handle. An algorithm with a simple data flow graph (DFG) can al-ready be implemented via RTL or other methods; an SVP should allow a program-5mer to implement algorithms with that would be difficult to implement otherwise.One example is algorithms with data-dependent behavior. Current SVPs allow alimited form of data-dependent behavior. Data dependent branches can be skippedonly if every element on the vector has not taken that branch. By allowing skip-ping within vectors, rarely taken branches can be accelerated. This is important foralgorithms that do most of their useful computation on only a small subset of theirdata.Finally, to truly take advantage of the FPGA, the programmer needs to be ableto interface with external logic. Current SVPs only share data through main mem-ory, which has limited bandwidth and is cumbersome to synchronize. CurrentFPGA programmers may therefore be wary of an SVP implementation; if it is un-able to meet their performance goals there is no graceful way to implement logicto increase performance, so the SVP design may have to be abandoned completely.Giving users a tightly coupled interface to external logic allows for a smoothertransition between software and RTL.1.4 GoalsThe goal of this dissertation is to broaden the overall capabilities of SVPs by mak-ing them more area-efficient, achieve higher performance on many applications,remain general enough for general computations, and provide a low-level inter-face which allows advanced users a way to access the underlying FPGA fabric.By harnessing these gains in efficiency, it will become more feasible to implementapplications in the SVP framework instead of the conventional approach of usingcustom RTL. At the same time, we wish to ensure the SVP becomes easier touse from a programmer’s perspective. The overall improved ease-of-use shouldmake SVP a much more compelling framework for solving the types of processingproblems encountered by FPGA users.1.5 ApproachOur approach is threefold. First, we address the area/performance penalty of SVPsby developing VENICE, an SVP specifically targetted for moderate performanceapplications. Scalar soft processors are not adequate for such applications, but a6small SVP (with one to four lanes) can meet the desired peformance levels. Weprovide a direct comparison of VENICE to previous SVPs to show the impact ofthese design choices.Second, we investigate accelerating data-dependent algorithms as a way tofurther differentiate SVPs from RTL. We investigate density-time masking(DTM) [63] (also known as wavefront skipping), a technique from traditional vec-tor processors, but implemented in an FPGA-specific way. This allows us to havemuch more efficient FU usage, which we demonstrate on applications includingViola-Jones face detection [68].Third, we add a new interface for sharing data between the SVP and externallogic. Our custom vector instructions (CVIs) have an external channel for datato flow to and from logic the programmer has created using either RTL or visu-ally with Altera’s DSP Builder Advanced Blockset for Simulink (DSPBA). Byimplementing an N-body simulation application using these CVIs, we show howthe main kernel of an application can be greatly accelerated with a small amountof RTL programming. The programmer does not have to manage data movementand synchronization, and non-critical pre-processing/post-processing can be doneusing normal SVP instructions.1.6 ContributionsThe contributions of this dissertation are:1. A demonstration of the benefits of designing SVPs for performance densityrather than scalability and emulation of traditional vector processors. TheVENICE SVP is able to achieve 2× better performance per logic block thanprevious SVPs. The main contributions that combine to achieve this are:(a) The vector address register file was removed to save area; addresses arestored in scalar registers and transferred with each instruction.(b) 2D and 3D vector addressing modes were added to increase perfor-mance on code that contains loops with fixed increments.(c) Alignment networks were added in the pipeline, removing the need forseparate vector align/rotate instructions. This is especially important7for sliding window/stencil filter algorithms.Several other minor contributions were made and will be discussed withinthe body of this dissertation.2. A demonstration of a method for run-time skipping of masked-off vector el-ements in SVPs. A novel implementation called wavefront skipping takesadvantage of FPGA BRAMs. On early exit algorithms, such as Viola-Jonesface detection, this gives up to an 3× increase in performance with a max-imum of 5% area overhead. Additionally, partitioned wavefront skippingshowed additional performance benefits for an extra cost in area.3. A demonstration of a method for adding streaming pipelines to SVPs ascustom vector instructions (CVIs). An end user can create a CVI withminimal effort compared to a full hardware algorithm implementation, andachieve performance similar to custom RTL while still having software pro-grammable control. Specific contributions are:(a) Expansion of the custom instruction to variable width vector units.(b) Allowing an arbitrary number of CVI lanes to be used (from one tothe number of SVP lanes) without adding additional multiplexing orbuffering. This allows the user to more finely tune the area/perfor-mance tradeoff of using CVIs.(c) Creation of a method for time multiplexing inputs and outputs for com-plex CVIs. Instead of simple two-input, one-output instructions, com-plex CVIs can have 2*N inputs and N outputs where N ranges from 1 tothe number of CVI lanes. This allows instructions to replace large dataflow graphs (DFGs) that would otherwise take multiple instructions tobe implemented.(d) Demonstration of alternate design menthods that do not require RTLdesign skills. Users can design a CVI within MATLAB’s Simulinkgraphical programming environment using DSPBA.81.7 Dissertation OrganizationChapter 2 presents the background of SVPs, including traditional vector proces-sors. Chapter 3 details the design of VENICE, our SVP targetting narrow, FPGA-centric design. Chapter 4 details our implementation of wavefront skipping forconditionally divergent data-parallel algorithms. Chapter 5 details our CVI imple-mentation method and results. Finally, Chapter 6 provides some conclusions aboutwhat has been done and what has yet to be done.9Chapter 2BackgroundThose who cannot remember the past are condemned to repeat it.— George SantayanaThis chapter provides the necessary background information for this disser-tation. First, it presents an overview of FPGAs architecture and design flow tointroduce the problem. Next, it provides is a discussion of overlays on FPGAsand why they are a useful tool for designers. To give the necessary background tounderstand soft vector processors (SVPs), vector processing and SIMD executionare then introduced. This leads to a discussion of existing SVPs. Finally, somemore specific detail into alternatives to and predecessors of the contributions ofthis dissertation is given.2.1 FPGAsFPGAs are integrated circuits (ICs) that are used to implement digital logic func-tions. The difference between an FPGA and an application specific integrated cir-cuit (ASIC) is that the FPGA is field-programmable; that is, the logic functionsmust be programmed after the device has been manufactured. An ASIC imple-ments digital logic functions via the placement and connections of transistors,which cannot change once the chip has been fabricated. An FPGA implementsdigital logic functions with configurable basic logic elements (typically look-uptables (LUTs)) and configurable interconnection between them. Most modern10FPGAs are programmed by loading data into static RAM (SRAM) cells, whichcan be reconfigured practically an unlimited number of times [41]. These SRAMcells, as well as the logic elements and routing are created with normal IC technol-ogy, so it is reasonable to consider an FPGA a type of ASIC that can emulate otherASICs. FPGAs have traditionally found use in applications such as ASIC prototyp-ing, telecommunications equipment (where low volumes and changing standardsmake ASICs less attractive), and as interfaces between other ICs (“glue-logic”).The cost of programmability is that an FPGA implementation of a circuit performsapproximately 4× slower than an ASIC implementation, and requires 14× as muchdynamic power and 35× as much area [40].A note on terminology: in relation to FPGAs the adjective soft refers to a logicfunction or digital circuit implemented by configuring the FPGA, while hard refersto something implemented at the transistor level. For instance, an FPGA configuredwith a CPU circuit would be said to have a soft CPU; implementing the same CPUcircuit on an ASIC (hardening it) would make it a hard CPU.2.1.1 Architecture and Design FlowA simple model of an FPGA is a two-dimensional grid of blocks connected bya mesh routing network. The blocks may consist of programmable logic or hardblocks. The hard blocks are functions that are frequently used and would be muchmore expensive to implement in soft logic than as hardened structures. The mostcommon of these are memories (block RAMs (BRAMs)) and multipliers or otherexpensive arithmetic functions (digitial signal processing (DSP) blocks). Fig-ure 2.1 shows an example architecture with columns of logic blocks, memoryblocks, and DSP blocks. Signals between blocks are carried by a configurablerouting network; a common configuration is to have horizontal and vertical rout-ing channels with programmable switch blocks where they meet. Additionally,the wire within a routing channel to which a block input or output connects isconfigurable. The input/output blocks with pins that connect the FPGA to the out-side world are also configurable, supporting multiple voltage levels and signallingstyles.The architecture is therefore relatively generic; user logic can be implemented11LogicColumnMemoryColumnDSPColumnConfigurableRoutingI/O BlockFigure 2.1: Example FPGA Architecturein any programmable logic block and if hardened blocks are specified there aremultiple locations they could be placed. It is not a straightforward problem toautomatically map a large design to a large FPGAs; to appreciate the problemsize consider that the largest Altera Stratix V FPGA has 359,200 adaptive logicmodules (ALMs) (its basic programmble logic element), 2,640 BRAMs, and 352DSP blocks [5].The computer aided design (CAD) flow for translating a design to an FPGAconfiguration bitstream varies for different vendors’ CAD tools; a survey can be12found in [14]. The differences are not important for the purposes of this disserta-tion, but it is necessary to understand a CAD flow to understand the reasons thetraditional FPGA design cycle is long compared to the design cycle for software.Basic CAD steps include RTL synthesis, technology mapping, placement, rout-ing, and assembly. RTL synthesis is the process of translating the input circuitas specified by the user to a netlist of boolean functions and macros such as hardblocks. The boolean functions get technology-mapped to FPGA programmablelogic blocks. Placement then selects locations for each of these units, and routingdetermines how to configure the communication network to connect logic blockinputs and outputs. Finally, assembly creates the bitstream that is used to programthe FPGA. Note the term synthesis is often used to refer to the entire processes;saying it takes hours to synthesize a design includes the time taken by all of thesteps.Each of these steps is time consuming, but placement is particularly so. Forinstance, the Stratix V FPGA mentioned has 359,200 ALMs, which are groupedinto logic array blocks (LABs) of 10 ALMs before placement. Any LAB that canbe placed can therefore go into one of 35,920 locations. LABs that communicatewith each other should be placed close together, as longer distance between themresults in longer circuit delays (reducing the maximum operating frequency ( fmax)of the circuit) and higher routing resource usage. For an arbitrary circuit, findingthe optimal placement (according to a metric such as fmax) is not computationallyfeasible. In practice, methods such as simulated annealing [35] are used to try tofind a high quality placement.Simulated annealing is a stochastic method which works by swapping theplacement of LABs if they meet some threshold of a cost function. Initially thethreshold is set to allow swaps even if the cost function worsens somewhat; as thealgorithm progresses the threshold is slowly lowered until eventually only swapsthat improve the cost function are allowed. This slow convergence prevents theplacement from being stuck in local optima (as a greedy algorithm would be) butrequires a huge number of potential swaps. In turn this means that placement onlarge FPGAs can run for many hours. There are techniques to reduce the CADtime, such as incremental compilation [3] and parallel placement [47, 69], but theystill do not provide software-like design cycles.132.2 OverlaysFPGA overlays are an intermediate layer between the programmable FPGA logicand the programmer’s input, designed to simplify design and improve productivity.The overlay itself is a premade IP block and may be distributed as configurableRTL or a pre-synthesized netlist. What separates overlays from other IP blocks isthat an overlay provides an additional level of programmability; the overlay can beconfigured for different applications without having to resynthesize the IP block.Instead, the overlay is programmed by changing some configuration state, typicallythe contents of one or more BRAMs. An overlay can be added to a design withexternal logic via some communication channels, or in special cases it may bethe majority or entirety of the design. These special cases include designs withsignificant time-to-market constraints and those for which only low volumes areneeded; given enough time and volume it would make more sense to make a designthat was only an overlay into an ASIC.Overlays improve user productivity by allowing programming and debug at ahigher level, shortening design cycle times, and speeding up reconfiguration. Anoverlay that implements a CPU, for instance, can reuse tools that have been de-signed for CPUs, such as compilers, integrated development environments (IDEs),and debuggers. Not only is the level of abstraction in programming and debuggingraised, but compiling for a CPU is orders of magnitude faster than synthesizing anFPGA design. By making a coarser layer which is less flexible than the underlay-ing FPGA fabric, CAD tools have less design space to explore, shortening compiletime. Also, in general, an overlay is only suited to a particular class of applications,and so the mapping from application to overlay is more straightforward.Overlays can implement many functions, such as networks [32], simula-tors [31], or even a different reconfigurable logic fabric (a meta-FPGA) [8].The overlays dealt with in this work are SVPs; before discussing them directlyit is useful to know how other soft processors are designed and why they are de-signed that way. A canonical example of a scalar soft processor is Altera’s Nios IIprocessor [4]. Nios II is actually a family of processors, from a multicycle econ-omy variant (Nios II/e) to a six-stage pipelined high performance variant (NiosII/f). The ability to drop in a different ISA-compatible variant allows the designer14to allocate more or less resources in exchange for more or less performance withouthaving to rewrite the application software running on top of the soft processor. Ad-ditionally, each variant has a multitude of configuration options such as cache sizesand whether to use hardware for certain instructions or trap to software emulation.Implementing a soft processor does not involve the same tradeoffs as creat-ing a custom CMOS CPU or even a processor embedded within an ASIC. Asummary of the differences in FPGA and custom CMOS with respect to proces-sor design can be found in [71]. They find that designs that simply map customCMOS CPUs to FPGAs have a delay 18 to 26× greater and use 17 to 27× morearea. Key to the differences are the relatively higher costs of multipexors andcontent addressable memorys (CAMs), which are necessary to implement data for-warding nextworks and reordering structures such as those used in out-of-orderexecution (OoOE). Multiported register files are also relatively more expensive inFPGAs, which makes it more difficult to implement superscalar or VLIW tech-niques where multiple operands are read and written each cycle.We are therefore motivated to look at other processor organizations that canoffer increased peformance without requiring structures that are difficult or costlyto implement on an FPGA. Particularly well-suited organizations include vectorprocessing and SIMD processing, which use deep pipelines and replicated PEsrespectively.2.3 Vector Processing and SIMD OverviewThere are many different parallel processor organizations; a good overview can befound in [20]. SVPs take core concepts from both the vector and SIMD paradigms.Vector processing has been applied in supercomputers on scientific and engi-neering workloads for decades [21]. It exploits the data-level parallelism readilyavailable in scientific and engineering applications by performing the same opera-tion over all elements in a vector or matrix. It is also well-suited for image process-ing. The first commercially succesful (and certainly most iconic) vector processorwas the Cray-1 [56]. The Cray-1 improved on prior vector processors in severalways, most notably by having a vector register file (VRF) instead of streaming vec-tor data from memory. Each vector register could hold up to 64 elements of data15Timea0 a1 a2 a3 a4 a5 a6 a7b) Vector* * * * * * * *b0 b1 b2 b3 b4 b5 b6 b7c0 c1 c2 c3 c4 c5 c6 c7+ + + + + + + +t0 t1 t2 t3 t4 t5 t6 t7c) Vector-SIMDa0 a1 a2 a3a4 a5 a6 a7********b0 b1 b2 b3b4 b5 b6 b7c0 c1 c2 c3c4 c5 c6 c7++++++++t0 t1 t2 t3t4 t5 t6 t7a0a1a) SIMDb0b1t0t1**c0c1++a2a3b2b3t2t3**c2c3++a4a5b4b5t4t5**c4c5++a6a7b6b7t6t7**c6c7++Figure 2.2: Data Parallel Execution on Different Architectureswhich could be streamed through a FU at the rate of one element per cycle. Thenumber of elements actually processed by any given instruction was controlled bythe vector length (VL) register. Only one instruction could issue per clock cycle,but because each vector instruction would take multiple clock cycles, more thanone instruction could be executing at the same time. In the case of dependent vec-tor operations, the output of one FU could feed into the input of another FU; this isa process called chaining. Multiple FUs could be chained together this way, witheach FU operating at the same time to provide parallel execution.SIMD processing, by contrast, utilizes multiple parallel FUs executing thesame instruction at the same time. Modern microprocessors are augmented withSIMD processing instructions to accelerate data-parallel workloads. These operateon short, fixed-length vectors (e.g., only 128b, or four 32b words). Significant over-head comes from instructions to load/pack/unpack these short vectors and looping(incrementing, comparisons, and branches are still necessary). Intel’s Haswell pro-cessors implement AVX2, the latest version of their mainstream SIMD extensions,with a maximum vector length of 256b (8x32b words) [27]. In contrast, the In-tel Many Integrated Core (MIC) microarchitecture Xeon Phi supports 512b vectorlengths (16x32b words) [28].Figure 2.2 shows data being processed in a vector processor (with instruction16//a) Scalarfor i in 0..7temp = a[i] * b[i]d[i] = temp + c[i]//b) 2-wide SIMDfor i in 0..3 //2 elements per instruction, so only 4 iterationstemp[0:1] = a[2*i:(2*i)+1] * b[2*i:(2*i)+1]d[2*i:(2*i)+1] = temp[0:1] + c[2*i:(2*i)+1]//c) Vector and Vector-SIMDset_vector_length(8) //Subsequent instructions process 8 elementstemp[] = a[] * b[] //Elementwise vector multiplyd[] = temp[] + c[] //Elementwise vector addFigure 2.3: Psuedocode for Different Parallel Paradigmschaining), a SIMD processor, and a hybrid vector-SIMD processor. Pseudocodefor this example is shown in Figure 2.3. The code multiplies two eight-element ar-rays together and adds them to a third eight-element array. In the code for a scalarprocessor (Figure 2.3a), this requires eight iterations through a loop; in each iter-ation a multiply and add instruction are executed. Additionally, the data elementsare read from and stored back to memory in eight separate memory operations perarray. If data is not available in the processor’s cache or there are hazards then therewill be additional cycles in which no work is being done. Thus, even though thereare 16 cycles of computation (eight multiplies and eight adds) a scalar processormay take several times that many cycles to execute the whole loop.Figure 2.3b shows the code for a 2-wide SIMD implementation. The codeclosely matches scalar code except that every operation works on two elements,so only four loop iterations are needed. Most SIMD architectures also have SIMDload and store instructions, so the overhead of issuing memory instructions is sim-ilarly reduced. Figure 2.2a shows how the data moves through a SIMD processorfor this code; there are eight SIMD instructions which will take at least eight cyclesto execute. The total execution time for SIMD code rarely scales perfectly with thenumber of parallel ALUs; hazards and stalls are not reduced by increasing SIMDwidth and so they quickly dominate runtime.Figure 2.3c shows the vector code. Instead of having a preset amount of dataprocessed by each instruction, there is a VL register that controls the amount of17data processed. After it is set to eight, the subsequent instructions will processeight data elements each. Figure 2.2b shows the data movement in a classic vectorprocessor; there are only two vector instructions to execute. Because the control forlooping and addressing is in the vector processor’s hardware, once the instructionhas started, executing the computation can proceed at one cycle per operation untilthe entire array has been processed. Vector processors have vector-load and vector-store instructions to address memory in large chunks, which can allow for bettermemory bandwidth utilization.Figure 2.2c shows the code running on a vector-SIMD hybrid; it combinesthe efficiency of vector execution and the parallelism of SIMD. While there are thesame number of execution cycles as a SIMD implementation, only two instructionsneed to be issued and there is no control flow or address manipulation in software.Note that the same code runs on a classic vector processor or vector-SIMD regard-less of SIMD width; the vector hardware executes for as many cycles as is neededdepending on the current VL and the number of ALUs.The vector-SIMD model allows parallelism in a more scalable manner thanchaining. Chaining and SIMD are not orthogonal; for instance, the T0 vector pro-cessor could chain together three instructions on eight vector pipelines to achieve24 operations per cycle [6]. Chaining takes advantage of instruction level paral-lelism (ILP), while SIMD takes advantage of data level parallelism (DLP). Whilemeasuring the amount of ILP present in a program is non-trivial [55], in practicevector processors have been limited to chaining three instructions, and wide su-perscalar CPUs such as Intel’s latest Haswell architecture can only commit fouroperations per cycle [30]. Not only is there a limit to how much ILP can be ex-tracted, exploiting it requires reading and writing multiple operands per cycle froma flat register file. In SIMD execution, by contrast, each ALU writes back to itsown independent bank of the vector register file, so parallelism is achieved withoutneeding multiple read and write ports [36].The case for embedded vector processors was made by the VIRAM [36]project. It showed that vector processing was more efficient than VLIW or su-perscalar techniques for embedded media applications. VIRAM used an existingMIPS CPU to handle control flow, and dispatched vector instructions to a copro-cessor vector core. The vector core consisted of four 64-bit lanes, which could18alternately process eight 32-bit or sixteen 16-bit elements per cycle. This abilityto split lanes and process more elements of narrow widths is common in SIMD ar-chitectures and is referred to as intra-lane SIMD or sub-word SIMD. VIRAM wasa load/store architecture and connected to memory through a crossbar connectedto eight DRAM banks. VIRAM was the architecture the first SVPs were modeledupon.2.4 Soft Vector Processors (SVPs)The first SVPs were developed in parallel, VIPERS [78] from The University ofBritish Columbia and VESPA [74, 76] from the University of Toronto.VIPERS demonstrated that programmers can explore the area-performancetradeoffs of data-parallel workloads without any hardware design expertise. The re-sults of three benchmark kernels demonstrate a scalable speedup of 3–30× over thescalar Nios II processor. Additionally, a speedup factor of 3–5× can be achievedby unrolling the vector assembly code. VIPERS uses a Nios II-compatible multi-threaded processor called UT-IIe [22], but control flow execution is hindered by themultithreaded pipeline. The UT-IIe is also cacheless; it contains a small on-chip in-struction memory and accesses all data through vector read/write crossbars to faston-chip memory. VIPERS instructions are largely based on VIRAM. VIPERS alsooffered a comparison to high level synthesis (HLS) (Altera’s C2H compiler [1])and custom RTL circuits. The HLS comparison showed VIPERS to be larger inarea but faster; they were roughly even in terms of performance per area. The RTLcomparison only used one (motion estimation) benchmark, for which VIPERS was82× slower at the same amount of area, though that could be reduced to 21× withminor ISA enhancements.VESPA is a MIPS-compatible scalar core with a VIRAM compatible vectorcoprocessor. The original VESPA at 16 lanes can acheive an average speedup of6.3× the scalar core over six EEMBC benchmarks. Furthermore, an improvedVESPA achieves higher performance by adding support for vector chaining with abanked register file and heterogeneous vector lanes [75]. Over the 9 benchmarkstested, the improved VESPA averages a speedup of 10× at 16 lanes and 14× at 32lanes. The MIPS core uses a 4kB instruction cache, and shares a data cache with19Figure 2.4: The VEGAS Soft Vector Architecture [15]the vector coprocessor. Smaller (1- or 2-lane) vector coprocessors use an 8kB datacache, while larger ones use 32kB.Both VIPERS and VESPA offer a wide range of configurability. For example,the parallel vector lanes can be specified at FPGA compile-time to be 8, 16 or 32bits wide. However, when mixed-width data is required, the vector engine must bebuilt to the widest data. Therefore, when processing smaller data, load instructionswill zero-extend or sign-extend to the full width, and store instructions will truncatethe upper bits. Since the vector register file must store all data (even byte-sizeddata) at the widest width, VIPERS and VESPA can be very inefficient: byte-widedata is stored in the on-chip main memory or data cache, then expanded to word-wide inside the register file. On top of that, to implement dual read ports, VIPERSand VESPA duplicate the vector register file. Hence, a single byte of data mayoccupy up to 9 bytes of on-chip storage (including the copy in the data cache).The VEGAS soft vector architecture [15] uses a Nios II/f with a soft vectoraccelerator. Figure 2.4 shows the architecture of VEGAS. Each vector instruction20is encoded into the free bits of a Nios custom instruction. Instead of vector dataregisters, VEGAS uses an 8-entry vector address register file (VARF) that pointsinto a large, banked scratchpad memory. The scratchpad can be partitioned intoany number of vectors of any length. During execution, the ALUs in VEGAScan be fractured to support subword arithmetic. This means a fixed-width vectorengine can operate on more elements if they are halfword or byte sizes. Also, itmakes more effective use of the scratchpad memory, since smaller operands arenot expanded to fill an entire word as done with prior architectures. The executionpipeline is followed by a data alignment network that is used when operands startin different banks of the scratchpad. If the source operands are aligned to eachother but the destination has a different alignment, this occurs in a pipelined fash-ion. However, if the source operands are not aligned to each other the vector coremust stall the current instruction and insert a vector move operation to correct thealignment mismatch.Both VIPERS and VESPA also share a similar memory-system design. Theyuse vector load and vector store instructions to transfer blocks of data, with optionalstrides, from main memory to the vector data register file. Separate read and writecrossbars shuffle data during loads and stores.In contrast, VEGAS does not support vector load and store instructions. In-stead, it uses direct memory access (DMA) block-read and block-write commandsto copy between the scratchpad and main memory. All vector alignment / byteshuffling is done at runtime in the writeback pipeline stages by passing vector re-sults through a shuffle network. Except for lengthening the pipeline, there is norun-time penalty if the destination vector is misaligned. However, if vector sourceoperands are misaligned, VEGAS must first copy one operand to a temporary lo-cation in the scratchpad. This extra copy operation cuts performance roughly inhalf.The vector register file is connected to an on-chip memory (VIPERS) or on-chip data cache (VESPA) through separate read and write crossbars. These cross-bars are used when striding through memory during vector load and store instruc-tions; they must shuffle bytes/halfwords/words from their byte-offset in memoryinto word size at a new word-offset in the vector register file. The size of the cross-bars are constrained on one end by the overall width of the vector register file, and21on the other end by the overall width of the on-chip memory/cache. As the vectorprocessor is scaled to contain more lanes, one end of the crossbars increases in sizewhile the other end is fixed by the memory width. To quickly load a large registerfile, the on-chip memory/cache width must be similarly increased to match. Thearea to implement these crossbars is significant, and grows as the product of thetwo widths.While VEGAS serves as the starting point for our work, there have been otherSVPs or SVP-like processors implemented as well. The FPVC, or floating pointvector coprocessor, was developed by Kathiara and Leeser [33]. It adds a floating-point vector unit to the hard Xilinx PowerPC cores which can exploit SIMD par-allelism as well as pipeline parallelism. The FPVC fetches its own VIRAM-likeinstructions and has its own private register file. Unlike most other vector architec-tures, it can also execute its own scalar operations separate from the host PowerPC.Convey’s HC-2 [18] is a vector computer built using several FPGAs. It is tar-getted at high performance applications such as geological exploration and utilizesthe entirety of the FPGA rather than being a building block in an embedded systemlike the other SVPs mentioned.Bluevec [50] was developed to perform neural-network simulations on FPGAs.The authors created a custom SVP in Bluespec [51]. They compared a custompipeline (also written in Bluespec) to the SVP implementation and found that per-formance was within a factor of two given similar resource usage, and that fur-ther performance improvement was not possible with either design due to memorybandwidth limitations.Bluevec also has a C++ compiler designed to allow programming at a higherlevel than vector assembly or C macros [49]. Their compiler is a run-time systemthat compiles and downloads a kernel to the SVP; the first time a kernel is en-countered it is compiled, and then that compiled kernel can be reused if the kernelis executed again. This contrasts to a compiler for the VENICE processor (to bepresented in Chapter 3) which does a source-to-source translation of Microsoft’sAccelerator object-based langauge [65] to VENICE C code [46].Additionally, the work on VENICE lead to a commercial SVP, the VectorBloxMXP [59]. MXP is used for the experiments performed in Chapter 4 and Chapter 5.MXP is similar to VENICE in design, but with added features such as fixed-point22//a) Scalarfor i in 0..7if mask[i] /= 0 //Skip computation if mask element is not zeroa[i] = (b[i] * c[i]) + (d[i] * e[i])//b) Vector and Vector-SIMDset_vector_length(8) //Subsequent instructions process 8 elementstemp[] = (b[] * c[]) + (d[] * e[]) //Compute result for all elements//select elements of temp[] if the corresponding mask is not zero//otherwise keep the same value in a[]a[] = merge_nonzero(mask[], temp[], a[])Figure 2.5: Psuedocode for Divergent Control Flowarithmetic, 2D-DMA support, and a C++ object based application programminginterface (API) for higher level programming.2.5 Divergent Control FlowChapter 4 deals with a method of speeding up execution of algorithms with diver-gent control flow. Figure 2.5 gives an example of code that has control flow di-vergence. The scalar version has a branch within the inner loop (the if statement).For each iteration i of the loop, if the mask[i] element is nonzero a calculation isperformed and the result stored into a[i]. If the mask[i] element is zero, nothinghappens and the scalar processor can jump to the next iteration of the loop. Fig-ure 2.5b gives the code example for a vector processor. Instead of performing thecalculation only for the elements where mask[i] is nonzero, the calculation is per-formed for all elements and stored in temp[i]. Then a merge instruction is usedto select whether to write temp[i] to a[i] or leave the value unchanged (by writingback the current value of a[i]). Predicated instructions can also be used. Predicatedinstructions use a writeback enable signal to prevent the update of the result vectordepending on the value of a flag vector that was previously set.Support for divergent control flow in vector processors is well researched; agood summary is [63]. For short conditionals, implementations that use merge orpredicated operations can give good performance. These operations are performedon all data elements, with only the elements that pass some conditional test writtenback; the rest of the results are discarded. For longer conditional branches, how-231 0 0 1 0 1 1 0D7 D6 D5 D4 D3 D2 D1 D0D7 D4 D2 D1a) Compress Operationb) Gather OperationMask VectorData Vector7 4 2 1Offset VectorBase Address (Scalar)&DExternal MemoryPipelinedLoadUnitD7 D4 D2 D1Result VectorResult VectorFigure 2.6: Compress and Gather Operationsever, this can lead to a large percentage of execution time spent on unused results.In these instances, it is desirable to skip elements that are not on the current branch.Three strategies can be employed (separately or together): compress/expand, scat-ter/gather, and wavefront skipping (also referred to as density-time masking).Compress vector operations are a way to take a source vector and a mask andproduce a new vector that only contains the valid (unmasked) elements withoutgaps, while expand vector operations take a compressed source vector and a maskand fill in the unmasked slots in the destination vector with the source data [39].Figure 2.6a shows an example of a compress operation. The compress operationtakes in a mask vector and a data vector, and outputs a vector with only the dataelements that correspond to locations in the mask vector that are set. Locationsthat are masked off (mask vector equal to zero) are discarded. VIRAM imple-mented VCOMPRESS and VEXPAND operators. During a long branch, the source24operands can be compressed, followed by processing only the shorter (compressed)vector operations, and finally the result expanded. Since execution time is propor-tional to vector length, operations on compressed vectors are potentially faster thanpredicated operations on the uncompressed vectors. The main drawback to this ap-proach is that all source operands must be compressed, and all destination operandsmust be expanded. In a MxN conditional stencil filter computation, for instance,the MxN input pixels would all have to be separately compressed.Scatter/gather is a method of performing indexed memory accesses (indexedstore is a scatter, indexed load is a gather), either to the local memory store orexternal memory [29]. Along with a compress operation or special index calcu-lating instructions, scatter/gather can be used to speed up conditional execution.Figure 2.6b shows an example of using a gather operation. A scalar base addressalong with a vector of offsets is used to calculate the addresses of each data vec-tor element in external memory. The resulting addresses pass through a memoryunit and are loadied into the desired result vector. For the conditional stencil filterexample, the indices of pixels to be processed could be compressed, and then thepixel data needed for each location could be loaded using gather operations. Themain drawback of this approach is that parallelizing scatter and gather operationsrequires parallel memory acceses, which are nontrivial. VESPA could performparallel scatter gather accesses within a single cache line. A special throughputcache [58] was developed for MXP to support scatter/gather operations, but evenin the best case where all data could fit in a statically allocated multiple-bank on-chip memory, speedups were modest.Wavefront skipping, by contrast, uses knowledge of the mask register to skipelements that do not need to be processed. Early implementations scanned themask register during instruction execution to determine if subsequent wavefrontscould be skipped. This introduces enough latency that a way to reduce the over-head by only skipping powers of 2 elements was patented [62]. Though this re-duces latency, some extra work is done because the skipping is not exact. Giventhat an FPGA implementation is already slower than a hard processor and we aredouble-clocking our scratchpad to provide additional ports, we wished to avoid theadditional latency in our design. Our implementation uses a special mask setupinstruction to store the offsets of valid wavefronts in one or more BRAMs within25the FPGA. Prior work [63] suggested the idea that each lane can skip forwardindividually rather than lockstep as entire wavefronts, but no implementation wascreated.A similar concept exists in GPUs based computing [53]. GPUs divide up akernel (consisting of hundreds of threads) into warps, which are analogous to ourwavefronts, usually 16 or 32 threads. Each warp maintains a program counter(PC) and a mask of active elements. Inside a kernel, branch instructions allowdifferent threads to diverge. If threads within a warp take different paths becauseof a branch instruction, then a separate PC and active mask must be maintained forboth branches. However, if all threads take the same path, the PC for the not-takenpath can be discarded. This is somewhat analagous to unpartitioned wavefrontskipping in vector processors, where a wavefront that is completely masked-offdoes not need to be executed.There has been some work published about, but to our knowledge no actualimplementations of, GPUs skipping at a finer grain than the warp level. Warpcompaction for single-instruction multiple-thread (SIMT) processors and threadblock compaction were simulated [23, 24]. This approach works within blocks ofwarps and moves threads between warps to reduce the amount of unnecessary workdone. A similar approach uses ‘large warps’ which are much larger than the SIMDwidth of the actual FUs and then executes elements from different subwarps thatare all active [48].Similarly, an architectural study involving the vector-thread architecturalparadigm looked at divergence, including density-time execution [45]. Vector-thread architectures can be thought of as a hybrid between vector processors andGPUs. A single control thread executes per core, and this control thread can issuea vector-fetch instruction that starts parallel execution. This vector-fetch is not asingle instruction, but the start of a kernel of microthreads that can include controlflow and diverge. When diverged, an active mask is used similar to a GPU, withpending PCs placed in a pending vector fragment buffer (PVFB). Microthreads donot reconverge in the baseline, but keep executing until all encounter a stop instruc-tion and the microkernel finishes. With a simple first-in first-out (FIFO) PVFB,adding density-time execution reduced energy use by more than 2×. Adding a dy-namic reconverge scheme lessens the impact of density-time execution. This work26only considered single lane implementations of density-time execution.2.5.1 Execution Pipeline CustomizationChapter 5 explores our method for customizing the SVP pipeline with custom op-erators created in FPGA logic. There has been some related work in data parallelprocessors with configurable or extensible pipelines.Some scalar processors have support for extensible pipelines. The TensilicaXtensa [25] is a synthesizable CPU for system on chip (SoC) designs which re-serves certain opcodes for custom instructions. Some soft processors, such as Al-tera’s Nios II [4], support custom instructions directly in their execution pipeline.Others, such as Xilinx’s MicroBlaze [72] implement tightly coupled queues to in-terface to custom logic in a similar manner.VESPA included support for heterogeneous vector lanes [75]; e.g., havingfewer multipliers than general-purpose ALUs. Due to the mismatch between thevector register file width and execution unit width, a parallel load queue was usedto buffer a vector for heterogeneous operations, and a separate output queue wasused to buffer results before writeback. The amount of customization was limitedto subsetting the existing ISA.Work by Cong et al. [16] created composable vector units. At compilationtime, the DFG of a vector program was examined for clusters of operations thatcan be composed together to create a new streaming instruction that uses multipleoperators and operands. This was done by chaining together existing functionalunits using an interconnection network and multi-ported register file. This is sim-ilar to traditional vector chaining, but it was resolved statically by the compiler(not dynamically in the architecture) and encoded into the instruction stream. Thisprovided pipeline parallelism, but was limited by the number of available operatorsand available register file ports. The reported speedups were less than a factor oftwo.The Convey HC-2 [18] can adopt one of several ‘personalities’, each of whichprovides a domain-specific vector instruction set. User-developed personalities arealso possible. Designed for high-performance computing, the machine includesa high bandwidth, highly interleaved, multi-bank DRAM array to reduce strided27access latency. Since the personalities come preconfigured it may be more useful tothink of the HC-2 as a family of processors rather than a processor with a extensibleexecution units.28Chapter 3VENICE: Optimizing for Smallbut CapableIt takes a tough man to make a tender chicken. — Frank PerdueOur first approach to broadening the applicability of SVPs was to reduce thearea/performance penalty of using SVPs to make them more competitive withhand-crafted RTL designs. In modest performance applications where only a smallSVP is needed, the area overhead incurred (compared to RTL) will be small rela-tive to the area used in the rest of the system. This contrasts to high performanceapplications, where the area penalty will dominate the area of the whole system,and therefore there is more incentive to spend time developing a custom RTL so-lution. This lead us to investigate an SVP design that targetted a small number oflanes; we envision that such a solution that could be a building block in a largersystem rather than a stand-alone processor that takes up the entire FPGA.3.1 IntroductionThis chapter presents VENICE, an SVP designed for maximum throughput with asmall number (one to four) of ALUs. VENICE demonstrates that applications thatneed somewhat higher performance than a scalar soft processor can provide canbe implemented while using only modestly more resources. VENICE can achieveover 2x better performance-per-logic block than VEGAS, the previous best SVP as29of the time of its development (2011-2012). It achieves this through a combinationof increasing clockspeed, eliminating bottlenecks in ALU utilization, and reducingarea through FPGA-specific design. While VENICE can scale to a large number ofALUs, a multiprocessor system of smaller VENICE SVPs is shown to scale betterfor benchmarks with limited inner-loop parallelism. VENICE is also simpler toprogram than previous SVPs, since its instructions are C macros using pointers intoa scratchpad memory rather than requiring assembly and/or manually allocatingvector registers.VENICE is smaller and faster than all SVPs published before it. Not only isit roughly 2× better performance per logic block (speedup per ALM) than VE-GAS [15], but it is also 5.2× better than Altera’s fastest Nios II/f processor. As aresult, less area is needed to achieve a fixed level of performance, reducing devicecost or saving room for other application logic. Alternatively, it enables larger mul-tiprocessors to be built using VENICE, resulting in greater computational through-put for the fixed area of a specific device.The key contributions that lead to this are:• Removal of vector address register file (area)• Use of 2D and 3D vectors (performance)• Operations on unaligned vectors (performance)• New vector conditional implementation (area)• FPGA optimized fracturable multiplier (area)Programming VENICE requires little specialized knowledge, utilizing theC programming langauge with simple extensions for data parallel computation.Changes to algorithms require a simple recompile taking a few seconds rather thanseveral minutes or hours for FPGA synthesis. In particular, the removal of thevector address register file makes VENICE easier to program than VEGAS. Inaddition, VENICE does not suffer a performance penalty when using unalignedoperands, freeing programmers from worrying about the placement of data inmemory. Finally, as described in a separate publication, Zhiduo Liu developed30DDR2 D$I$ NiosII/fCPUABSAccum.MULSHIFTROTATEAlign 1Align 2ALUEXTENDCMOVAlign 3VENICE Vector EngineInstruction QueueScratchpadMemory2kB - 2MB2x clkAddress LogicAltera AvalonFabric(2ndpipestage)DMACustom Instruction PortFigure 3.1: VENICE Architecturea compiler for VENICE based on Microsoft Accelerator [46], demonstrating thathigh level code can be translated to VENICE’s programming model.3.2 Design and ArchitectureA block diagram of the VENICE (Vector Extensions to NIOS Implemented Com-pactly and Elegantly) architecture is shown in Figure 3.1. Similar to previousSVPs, VENICE requires a scalar core as the control processor; in this case, a NiosII/f executes all control flow instructions. In our work, Nios is configured to usea 4kB instruction cache and a 4kB data cache. Unfortunately, Nios lacks supportfor hardware cache coherence, so the programmer must sometimes explicitly flushdata from the cache to ensure correctness.VENICE vector instructions are encoded into one or two tandem Nios custominstructions and placed inline with regular instructions. These custom instructionsare written into a 4-entry vector instruction queue. Typically, after writing thevector instruction to the queue, the Nios core can continue executing its next in-struction. However, the Nios core will stall if the instruction queue is full, or if thecustom instruction synchronizes by explicitly waiting for a result.The VENICE vector engine implements a wide, double-clocked scratchpad31memory which holds all vector data. Operating concurrently with the vector ALUsand the Nios core, a DMA engine transfers data between the scratchpad and mainmemory. The DMA engine also has a 1-entry control queue, allowing the nexttransfer to be queued.It is very important for the DMA engine to operate concurrently with the vec-tor ALUs, because this hides the memory latency of prefetching the next set ofdata elements. All of our benchmarks implement some form of double-buffering,allowing us to hide most of the memory latency.All vector ALU operations are performed memory-to-memory on data storedin the scratchpad. A configurable number of vector lanes (32-bit vector ALUs)provides additional parallelism. Each 32-bit ALU supports subword-SIMD oper-ations on halfwords or bytes, thus doubling or quadrupling the parallelism withthese smaller data types, respectively.3.3 VENICE ImplementationBelow, we will describe each of the key improvements made to VENICE. As aresult of these and other optimizations, the design was pipelined to reach 200MHz+(roughly 50–100% higher than previous SVPs). This also allows the SVP to runsynchronously at the same full clock rate as the Nios II/f, simplifies control, addsthe predictablity of a fully synchronous design, and reduces resource usage.3.3.1 Removal of Vector Address Register FileOne major difference from VEGAS is removal of the vector address register file(VARF). In VEGAS, the vector instructions emitted from the scalar core includetwo source register numbers and one destination register number. They are indicesinto the VEGAS 8-entry VARF, which produces addresses that index into the VE-GAS scratchpad memory. Three VARF entries need to be read for every VEGASinstruction, and possibly modified due to instructions that automatically incrementaddress registers. To support three reads and three writes concurently the VARFwas not implemented in BRAM , but rather using LUTs and flip-flops (FFs). Theresult was that the VARF used 377 ALMs in a V1 VEGAS implementation, whichis 10% of its total ALM usage.32// VEGAS code to add two vectors already in scratchpad// Clobbers VARF registers V1, V2, V3void vegas_add(int length, int *v_a, int *v_b, int *v_dest){//Set the vector lengthvegas_set( VCTRL, VL, length );//Copy the pointer values from the scalar register file to the VARFvegas_set( VADDR, V1, v_a );vegas_set( VADDR, V2, v_b );vegas_set( VADDR, V3, v_dest );//Perform the operation (VARF values index into scratchpad)vegas_vvw( VADD, V3, V1, V2 );}// VENICE code to add two vectors already in scratchpadvoid venice_add(int length, int *v_a, int *v_b, int *v_dest){//Set the vector lengthvector_set_vl( length );//Perform the operation (pointers index directly into scratchpad)vector( VVW, VADD, v_dest, v_a, v_b);}Figure 3.2: VEGAS Code (Requiring Vector Address Register Setting) vsVENICE CodeInstead, VENICE relies upon the Nios register file to contain the vector ad-dresses. Instead of emitting register numbers, the vector instruction is emitted withtwo source addresses and one destination address; these addresses directly indexthe scratchpad memory. Nios instructions can run concurrently with vector oper-ations, calculating addresses for the next vector instruction while the current oneexecutes. Additionally, the 2D and 3D vector instructions (which will be explainedin Section 3.3.2) replace the functionality of automatically incrementing addressregisters.In VEGAS, copying values into the VARF and spilling/filling them is a cum-bersome task for a programmer. Code demonstrating how the VARF is used inVEGAS along with the corresponding code for VENICE is shown in Figure 3.2 (amore complete explanation of the VENICE programming model will be given inSection 3.4). Both VEGAS and VENICE work with pointers into the scratchpadmemory, but programming VEGAS requires copying pointers into local vector ad-33dress registers (V1, V2, and V3 in this example). Vector operations then use thevector address register number rather than the pointer, making code less readableand harder to maintain. Additionally, if more than eight vectors are needed theVARF values will have to be spilled and filled between vector operations, hurtingperformance. In VENICE there is no need to do this extra work. The lack of VARFalso makes the task of writing a compiler for VENICE easier.One drawback of this approach is that increased instruction issue bandwidth isrequired between the Nios and VENICE vector engine. VEGAS only has to trans-mit three constant 3-bit VARF indices along with its opcode and modifier bits foreach instruction, allowing a VEGAS instruction to be encoded as a single Nios II/fcustom instruction. VENICE instructions, by contrast, need to read three operandaddresses from the Nios II/f register file. Since a Nios II/f custom instruction canonly access two register values per cycle, two Nios II/f custom instructions areneeded to issue each VENICE instruction.3.3.2 2D and 3D Vector InstructionsThe execution of some programs is limited by the instruction dispatch rate, espe-cially when short vectors are used. This makes it difficult to achieve high vectorALU utilization.To get around this limitation, VENICE implements 2D and 3D vector instruc-tions. A 1D vector is an instruction applied to a configurable number of elements,which can be thought of as columns in a matrix. The 2D vector extends this torepeat the 1D vector operation across a certain number of rows. In between eachrow, the operand address will be incremented by a configurable stride value, allow-ing for selection of arbitrary submatrices. Each operand uses its own unique stridevalue. When executing, the vector core dispatches a separate vector instruction foreach row, and in parallel adds the strides for each operand to determine the addressof the next row. As a result, VENICE can issue up to 1 row per cycle. This hasa direct extension to 3D instructions for operations on 3D data, which can processmultiple 2D matrices with a single instruction.An example of programming using 2D vector instructions will be given in Sec-tion 3.4.34Figure 3.3: Example of Misaligned Operation3.3.3 Operations on Unaligned VectorsWhen input and/or output vectors are not aligned, data needs to be shuffled betweenlanes. For example, Figure 3.3 shows how VENICE uses three alignment networksto shuffle unaligned data. A four bank scratchpad memory is shown, with Bank 0at the top of the figure. The three operand vectors are striped across the scratchpadmemory, with the first input starting in Bank 1, the second input starting in Bank3, and the destination vector starting in Bank 2. When an instruction is issuedusing these operands, the two source operands are first aligned into a canonicalposition such that Element 0 is moved to the top ALU. Next, the vector data isprocessed by the ALUs. Finally, the result is re-aligned, moving Element 0 ofthe result vector into the scratchpad memory bank that corresponds to the start ofthe destination operand target position. Doing vector alignment in the processorspipeline is especially useful when doing convolution, as one operand’s startingaddress is continuously changing and seldom aligned.Note that only two alignment networks are required to align the two inputoperands and one output operand; the extra alignment network was put in placeto allow for future work when operand data sizes mismatch. The ability to dooperand resizing in the pipeline was not implemented in VENICE (though it was35in the commercial MXP processor [59]), so the third alignment network could beremoved from VENICE to save area without affecting performance.The older VEGAS processor uses a single alignment network because this isthe only component that grows super-linearly with the number of vector lanes. Thisis a concern as the number of lanes increases, but not for the small number of lanesthat VENICE is designed for, as we will show in Section 3.5.1. The drawbackof VEGAS’s single alignment network is that it introduces a performance penaltywhen input operands are not aligned. In VEGAS, when the two input operandswere not aligned, a copy operation is automatically inserted to align them. Theseextra copies can potentially halve performance on code where inputs are mostlyunaligned, such as sliding window algorithms. Additionally, space is reserved atthe end of the scratchpad memory for these temporary copies, meaning that thefull scratchpad can not be used by the program unless all operations are provablyaligned.3.3.4 New Vector Conditional OperationsPrevious SVPs based on the VIRAM architecture used eight vector mask registersto store flags which could be used to perform conditional operations. VEGAS useda similar approach, storing flags in separate BRAMs from the scratchpad.VENICE takes a novel approach to data-conditional operations. Vector flagsfor compare instructions and arithmetic overflow are written alongside each byte,using the 9th bit in BRAMs that have an extra data bit for parity or optional storage(such as the M9K in Altera Stratix devices). This reduces BRAM usage, but doesnot allow for a VIRAM-style register file of flags. Instead, conditional move op-erations that read the flag bits as one of the input operands are used to implementsimple predication.The stored flag value depends upon the operation: unsigned addition/subtrac-tion stores the carry-out/borrow, while signed addition/subtraction stores the over-flow. The flag and most significant bit (MSB) result bits can thus be used to checkout-of-range results, less-than/greater-than (using the negative-result flag), or ex-tended precision (64-bit) arithmetic.363.3.5 Streamlined Instruction SetVENICE implements a simple instruction set, consisting of 24 basic operationsversus over 50 instructions in VEGAS. VENICE also has several mode bits toindicate more complex operations. Because two Nios custom instructions are re-quired to dispatch a VENICE instruction, there are enough usable bits to encodethe ISA while allowing any combination of mode bits, making the ISA fully or-thogonal. Absolute value, 2D/3D instructions, and accumulation reduction can allbe combined with any instruction and each other. This allows for complex instruc-tions to be built that perform multiple operations between reading and writing tothe scratchpad. All operations can take a scalar operand as one input, and the scalaralways replaces operand A, in contrast to VEGAS where the scalar would replaceoperand A or B depending on the operation.3.3.6 FPGA Architecture-Specific OptimizationsFigure 3.4 shows VENICE’s method of implementing fracturable multipliers ver-sus the partial products method used in VEGAS. The previous method used four18-bit multipliers. Byte and halfword operations could be performed directly usingthe 18-bit multipliers, while word operations used the four multipliers to computepartial products which were then added together. Since this addition was not per-formed for all operations, the adder could not be implemented using one of the DSPBlock’s internal adders. The VENICE method, by contrast, uses a simpler parallelimplementation where there is one 36-bit multiplier that can perform a word/half-word/byte multiply, one 18-bit multiplier that can perform a halfword/byte mul-tiply, and two 9-bit multipliers that can perform byte multiplies. The VENICEmethod uses the same number of DSP Blocks as the partial products method, butdoes not require additional adders or multiplexers on the inputs. This leads to bothlower ALM usage and lower delay in Stratix IV FPGAs. The lower ALM usage ofthe multiplier is the primary reason for the size difference of the ALUs of VEGASand VENICE that will be shown in Section 3.5.1.The only drawback of the new multiplier organization is that a single lane’smultipliers cannot be packed into a single Stratix IV DSP Block due to configura-tion limitations. However, two lanes worth of multipliers may be packed into two3736x3618x189x99x96416byte 0 / halfword 0 / wordbyte 3byte 2 / halfword 1byte 13216b) VENICE Fracturable Multiplier2 lanes pack into 2 DSP blocks and use fewer ALMs<<16<<3218x1818x1818x1818x18643232wordbyte 0 /halfword 0byte 3 /halfword 1byte 2byte 11616a) VEGAS Fracturable Multiplier1 lane packs into 1 DSP block36x369x99x936x369x99x9One DSP Blockcontains two36b MultipliersEach 36b Multipliercan be divided intotwo 18b Multipliers orfour 9b Multipliers18x1818x1818x1818x1818x1818x18Figure 3.4: Fracturable Multiplier StylesDSP Blocks. Hence, V1 may use an additional half DSP Block, but larger designsuse the same amount as VEGAS.Despite its lower latency compared to previous multiplier implementations, itis necessary to pipeline the multiplier over two stages to achieve high freqencyoperation. This leaves an extra cycle for processing simple (non-multiplier) ALUoperations which can complete in a single cycle. A general absolute value stagewas added after the integer ALU, allowing operations such as absolute differencein a single instruction. When followed by VENICE’s reduction accumulators,operations such as sum-of-absolute-differences (used for motion estimation) andmultiply-accumulate instructions (for matrix multiply) can be implemented in asingle instruction.38#include "vector.h"int main(){const int length = 8;int A[length] = {1,2,3,4,5,6,7,8};int B[length] = {10,20,30,40,50,60,70,80};int C[length] = {100,200,300,400,500,600,700,800};int D[length];// if A,B,C are dynamically modified,// then flush them from the data cache hereconst int num_bytes = length * sizeof(int);// alloc space in scratchpad, DMA from A to vaint *va = (int *) vector_malloc( num_bytes );vector_dma_to_vector( va, A, num_bytes );// alloc and DMA transfer, in one simple callint *vb = (int *) vector_malloc_and_dmacpy( B, num_bytes );int *vc = (int *) vector_malloc_and_dmacpy( C, num_bytes );// setup vector length, wait for DMAvector_set_vl( length );vector_wait_for_dma(); // ensure DMA donevector( VVW, VADD, vb, va, vb );vector( VVW, VADD, vc, vb, vc );// transfer results from vc to Dvector_instr_sync(); // ensure instructions donevector_dma_to_host( D, vc, num_bytes );vector_wait_for_dma();vector_free();}Figure 3.5: VENICE Code to Add 3 Vectors3.4 Native Programming InterfaceThe native VENICE API is similar to inline assembly in C, using C preprocessormacros. The use of macros for instruction encoding was introduced in VEGAS,and makes vector instructions look like C functions without any run time over-head. Previous SVPs had required the programmer to write assembly directly.VENICE further improves on the VEGAS programming model by not having avector address register file (VARF) (as explained in Section 3.3.1), so no registermanagment is needed in VENICE code. An example of VENICE code to add three39vectors together is shown in Figure 3.5.Each macro dispatches one or more vector assembly instructions to the vectorengine. Depending upon the operation, these may be placed in the vector instruc-tion queue, or the DMA transfer queue, or executed immediately. A macro thatemits a queued operation will finish as soon as the operation is enqueued, and sub-sequent macros may run before the enqueued operation is executed. Some macrosare used to restore synchrony and explicitly wait until the vector engine or DMAengine is finished.Programs using the VENICE programming model follow seven steps:1. Allocation of memory in scratchpad2. Optionally flush data in data cache3. DMA transfer data from main memory to scratchpad4. Setup for vector instructions (e.g., the vector length)5. Perform vector operations6. DMA transfer resulting data back to main memory7. Deallocate memory from scratchpadThe basic instruction format is vector(MODE, FUNC, VD, VA, VB),where the values of MODE and FUNC must be predefined symbols, while the val-ues of VD and VB must be scratchpad pointers and VA can be a scalar value orscratchpad pointer.For example, to add two unsigned byte vectors located in the scratchpad byaddress pointers va and vb, increment the pointer va, and then store the result ataddress pointer vc, the required macro would be:vector( VVBU, VADD, vc, va++, vb);In the example, the MODE specifier of VVBU refers to a ‘vector-vector’ op-eration (VV) on byte-size data (B) that is unsigned (U). The vector-vector part can40void vector_fir(int num_taps, int num_samples,int16_t *v_output, *v_coeffs, *v_input){// Set up 2D vector parameters:vector_set_vl( num_taps ); // inner loop countvector_set_2D( num_samples, // outer loop count1*sizeof(int16_t), // dest gap(-num_taps )*sizeof(int16_t), // srcA gap(-num_taps+1)*sizeof(int16_t) ); // srcB gap// The instruction below repeats a 1D dot-product (ie,// 1D accumulate) operation once for every output sample.// After each 1D dot product, it adds the gap values (set// by vector_set_2D) to each of the three pointers.vector_acc_2D( VVH, VMULLO, v_output, v_coeffs, v_input );}Figure 3.6: FIR Kernel Using 2D Vector Instructionsinstead be scalar-vector (SV), where the first source operand is a scalar value pro-vided by Nios instead of an address. These may be combined with data sizes ofbytes (B), halfwords (H) and words (W). A signed operation is designated by (S)or by simply omitting the unsigned specifier (U). For example, computing a vec-tor of signed halfwords in vresult by subtracting a vector vinput from thescalar variable k would be written as vector( SVH, VSUB, vresult, k,vinput).Space can be allocated in vector scratchpad memory using vector malloc(num bytes ) which returns an aligned pointer. As it is common to allocatespace and immediately DMA a vector from main memory to scratchpad memory,the vector malloc and dmacpy() function combines both operations into asingle function call. The vector free() call frees all previous scratchpad al-locations; a single macro which frees all allocated memory was provided since thecommon case is to utilize the scratchpad for one kernel/function after which it canbe reused for the next kernel/function. DMA transfers and instruction synchroniza-tion are handled by macros as well.Figure 3.6 gives an example of using 2D vector instructions: a kernel that per-forms a finite impulse response (FIR) filter with 16-bit input and output data. Thisfunction can be performed in a single VENICE instruction using the reduction ac-cumulators. Without 2D instructions, the scalar processor must repeat a 1D vector41Table 3.1: Resource Usage ComparisonVEGAS VENICEDevice or CPU ALMs DSPs M9Ks Fmax ALMs DSPs M9Ks FmaxStratix IV EP4SGX530 212,480 128 1,280 – 212,480 128 1,280 –Nios II/f 1,223 1 14 283 1,223 1 14 283Nios II/f + V1 (8kB) 3,831 2 35 131 2,529 2.5 27 206Nios II/f + V2 (16kB) 4,881 3 49 131 3,387 3 40 203Nios II/f + V4 (32kB) 6,976 5 77 130 5,096 5 66 190instruction (of num taps length) using a for() loop of length num samples.This requires the scalar processor to increment the source and destination addressesand count the number of iterations.With 2D instructions, the loop counting and operand address incrementing isperformed in hardware. The vector set 2D() call specifies the number of iter-ations (first parameter), and three gaps for the destination and two source operands,respectively. Each gap is a distance, measured in bytes, between the end of the rowand the start of the next row. Hence, a gap of 0 implies a packed 2D matrix, whilea negative gap produces a sliding window effect as required by the FIR filter. Thegap for operand A, the FIR coefficients, produces a stationary window. The desti-nation gap is 2 bytes because each 1D operation is accumulated, producing a resultrow which contains just 1 element (i.e., the dot product).3.5 Evalution ResultsAll soft processor results in this paper are measured on an Altera DE4-530 de-velopment system using Quartus II version 11.0. All software self-verifies itselfagainst a sequential C solution.3.5.1 Area and Clock FrequencyThe overall resource usage and clock frequency for VENICE compared with VE-GAS is in Table 3.1. In this table, the overall Stratix IV-530 device capacity isshown in terms of ALMs, DSP Blocks, and M9K BRAMs. It is important to notethat an Altera DSP Block is a compound element consisting of two 36b multipliers.Alternatively, each 36b multiplier can be statically configured as two 18b multipli-42Table 3.2: Area Breakdown (ALMs)VEGAS VENICE SavingsFracturable ALU 771 471 300Control/Pipeline 1200 538 662DMA 501 181 320Alignment (V1) 136 116 20Alignment (V4) 448 855 -407ers or four 9b multipliers. Altera literature usually quotes device capacity as thenumber of internal 18b multipliers; there are eight 18b multipliers internally in aDSP Block, but only four can be used indepedently.Table 3.1 also gives area and clock frequency results for several VEGAS andVENICE configurations, each of which includes a Nios II/f base processor. TheV1, V2, and V4 notation indicates one, two and four 32b vector lanes, respectively.The (8kb), (16kB), and (32kB) notation indicates the scratchpad sizes used in eachconfiguration. VENICE uses fewer ALMs and fewer M9Ks than VEGAS in allconfigurations. Except for the smallest V1 configuration, VEGAS and VENICEuse the same number of DSP Blocks. In the V1 configuration, VENICE is using2.5 DSP Blocks, but these are not filled to capacity; in addition to the 0.5 DSPblock being available, there is also room for one 18b and two 9b multipliers. Theclock frequency achieved with VENICE is also 50% higher than VEGAS. Exceptfor the V1 DSP block anomaly, VENICE completely dominates VEGAS in areaand clock frequency.Figure 3.7 gives a more detailed area breakdown of VEGAS and VENICE.VENICE has consistently lower area for everything but the alignment network(when using multiple lanes). Precise area values for the V1 configuration are shownin Table 3.2. The savings in each area is primarily due to:• Fracturable ALU savings is primarily due to the new parallel multiplier,which uses fewer adders and requires less input and output multiplexing.Additional savings were obtained by streamlining the ISA. Note that thissavings is per lane.• Control/Pipeline savings is primarily due to removal of the 8-entry vectoraddress register file. Since each operand must support an auto-increment43Figure 3.7: Area Breakdown (ALMs)mode, VEGAS implemented these in ALMs and FFs.• The VEGAS DMA engine includes an alignment network, which allows datato be loaded into an aligned position to help avoid the misalignment perfor-mance penalty. This is not necessary in VENICE, since it runs full-speed onunaligned data. Instead, VENICE performs DMA alignment in software, sounaligned DMA operations may exhibit a slowdown.• The VEGAS alignment network was designed to scale to a large number oflanes, while the VENICE alignment network was optimized for only a fewlanes.3.5.2 Benchmark PerformanceThe characteristics of nine application kernels are reported in Table 3.3. The inputand output data types for each kernel are shown in the Data Type column, includingwhether a higher precision is used during intermediate calculations. The overallinput data set size is also shown, along with the size of a filter window in the Tapscolumn. Some of these kernels come from EEMBC [17], others are from VIRAM,44Table 3.3: Benchmark PropertiesData Type Benchmark PropertiesBenchmark In/Out Intermed. Data Set Size Taps Originautocor halfword word 1024 16 EEMBCrgbcmyk byte - 896×606 EEMBCrgbyiq byte word 896×606 EEMBCimgblend halfword - 320×240 VIRAMfilt3x3 byte halfword 320×240 3×3 VIRAMmedian byte - 128×21 5×5 custommotest byte - 32×32 16×16 customfir halfword - 4096 16 custommatmul word - 1024×1024 customTable 3.4: Benchmark PerformancePerformance (Million elements per sec) SpeedupBenchmark Nios II/f V1 V2 V4 V1 V2 V4autocor 0.46 5.94 11.11 18.94 12.9 24.2 41.2rgbcmyk 4.56 17.68 21.41 22.72 3.9 4.7 5.0rgbyiq 5.20 6.74 11.09 15.61 1.3 2.1 3.0imgblend 4.83 77.63 145.57 251.18 16.1 30.1 52.0filt3x3 2.11 16.82 26.95 36.42 8.0 12.7 17.2median 0.10 0.74 1.45 2.69 7.3 14.4 26.6motest 0.09 2.37 4.18 6.29 27.4 48.2 72.4fir 3.32 20.11 34.95 41.67 6.1 10.5 12.5matmul 11.7 148.20 322.22 593.75 12.6 27.4 50.6Geomean 7.95 13.8 20.6and others are written by the authors. The ‘median’ benchmark performs a 5x5median filter on a greyscale input image. The ‘motest’ benchmark performs (lumaonly) motion estimation of a 16x16 reference block across a 32x32 search range.Note that ‘motest’ only computes the set of sum of absolute differences (SAD)values, it does not include the time to scan and find the lowest SAD location. The‘fir’ benchmark is a 16-tap, 16-bit FIR filter. The ‘matmul’ benchmark performs32-bit integer dense matrix multiplication.The Nios II/f processor was run at 283MHz with a 200MHz Avalon intercon-nect and 200MHz DDR2 controller (i.e., at the limit of the DDR2-800 SODIMM).The VENICE V1 and V2 configurations were run synchronously at 200MHz foreverything, including the Nios II/f, VENICE engine, Avalon interconnect, andDDR2 controller, while the V4 configuration was run with everything at 190MHz.45Figure 3.8: Speedup (Geometric Mean of 9 Benchmarks) vs Area ScalingThe performance of these kernels is shown in Table 3.4. The results are in unitsof millions of elements computed per second. For the first eight kernels, an ele-ment is in units of the output data type. This is meaningful because the amount ofcompute work is linear to the number of output bytes. For example, the motest ker-nel computes 0.09 million SAD calculations per second, where each calculationresults in one byte of output. In matrix multiply, however, the amount of com-putation is not a linear factor of the number of output elements. For that kernelonly, we report performance as millions of multiply accumulates (MACs) per sec-ond (MAC/s). The peak performance of VENICE V4 at 190MHz is 760 millionMAC/s, so our matrix multiply code is running at 78% of peak performance.3.5.3 Speedup versus AreaThe VENICE processor is designed to offer higher performance and use less areathan VEGAS. Figure 3.8 demonstrates this with a speedup versus area plot. Thespeedup and area results are normalized to the Nios II/f results, and these are usedto compute a geometric mean across nine benchmark programs. Area results ac-460.1	  1	  10	  100	  autcor	  rgbcmyk	  rgbyiq	  imgblend	  filt3x3	  median	  motest	   fir	  int	  matmul	  geomean	  Speedup	  per	  ALM	  (rela?ve	  to	  Nios	  II/f)	  	  VEGAS-­‐V1	   VENICE-­‐V1	  Figure 3.9: Computational Density with V1 SVPscount for ALMs only; VENICE uses fewer M9K’s but a similar number of DSPBlocks.Results for a single Nios II/f processor are presented at the <1.0,1.0> positionon the graph. Diamond-shaped data points extrapolate the performance and area ifmultiple Nios processor instances are created (with no interconnect overhead) andrun perfectly in parallel. Triangular-shaped data points for V1, V2, and V4 config-urations of VEGAS clearly outperform Nios II/f, achieving a maximum speedup of17.2× at an area overhead of 5.7×. Circular-shaped data points for similar config-urations of VENICE show it dominates VEGAS in both area and speed, achievinga maximum speedup of 20.6× at an area overhead of 4.0×.Speedup divided by area produces a metric of performance per unit area. Thiscan also be considered a measure of the computational density of an FPGA. Fig-ure 3.9 compares the computational density for VEGAS and VENICE using a smallV1 configuration.Since the configuration (and therefore area) used is the same across all nine47benchmarks, computational density differences between them correspend onlyto performance differences. Simple benchmarks such as rgbcmyk, rgbyiq,imgblend and median achieve the smallest performance increase over VEGAS.These benchmarks have large vector lengths and no misaligned vectors, and so thespeedup comes mostly from the clock speed increase. Convolution benchmarkslike fir and autocor benefit from the lack of misalignment penalty on VENICE.The 2D vectors accelerate autocor, motest, and fir. On matmul, using 3D vectorsand the accumulators allow VENICE to achieve 3.2× the performance of VEGAS.For one application, rgbyiq, the computational density falls below 1.0 onVENICE, meaning Nios II/f is better. This is because the area overhead of 1.8×exceeds the speedup of 1.3×. The limited speedup is due to a combination ofmemory access patterns (r,g,b triplets) and wide intermediate data (32b) to preventoverflows. However, on average, VENICE-V1 offers 3.8× greater computationaldensity than Nios II/f, and 2.3× greater density than VEGAS-V1.Comparing V4 configuration results (not shown), the computational density ofVENICE is 5.2×, while VEGAS is 2.7× that of Nios.3.5.4 Case Study: DCTVENICE was designed to exploit vector parallelism, even when vectors are short.By remaining small, VENICE can be efficiently deployed in multiprocessor sys-tems to efficiently exploit other forms of parallelism (eg, thread-level) on top of thevector-level parallelism.In this section, we use VENICE to perform a 4x4 discrete cosine transform(DCT) with 16-bit elements on a total of 8192 different matrices. Each DCT is im-plemented using two matrix multiplies followed by shifts for normalization. Vec-tors are short, limited to four halfwords for the matrix multiply.In Figure 3.10, the first set of bars shows the benefit of using 2D and 3D vectoroperations with a V2 VENICE configuration. In the 1D case, VENICE is issue-ratelimited and gets poor speedup. To compute one element in the output matrix (i.e.,compute the dot product of one row with a transposed row), the Nios processormust compute the addresses and issue two custom instructions. Still, it managesto get a speedup of 4.6× over the Nios II/f. With 2D instructions, an entire row480	  10	  20	  30	  40	  50	  60	  1D	  (V2)	   2D	  (V2)	   3D	  (V2)	   V1	   V2	   V4	   1xV2	  (3.4k	  ALMs)	  2xV2	  (8.8k	  ALMs)	  4xV2	  (19k	  ALMs)	  Speedup	  vs	  Nios	  II/f	   Speedup	  per	  ALM	  Figure 3.10: 16-bit 4x4 DCT Varying 2D/3D Dispatch, SVP Width, and # ofSVPsin the output matrix can be computed with a single VENICE instruction, giving aspeedup of 3.6× over the 1D case. With 3D instructions, it can compute the entireoutput matrix with one instruction, running 1.4× faster than the 2D case. With 3Dinstructions, one DCT takes 43.9 cycles on average. The theoretical minimum is40 cycles, giving 91% ALU utilization.The second set of bars shows performance scaling from one to four vectorlanes. A V1 VENICE can achieve a speedup of 10.8×Nios II/f, and the benchmarkscales well to V2 (1.87× faster than V1). However, V4 is only 1.05× faster thanV2. During the matrix multiply step, the maximum vector length is 4 halfwords,so it does not benefit from more than 2 lanes; only the normalization step benefits.The final set of bars shows the results of a simple multiprocessor VENICE sys-tem, consisting of one, two, and four V2 VENICE processors. This system wasconstructed by connecting together several Nios II/f processors, with VENICE ac-celerators, using the default Altera Avalon fabric. Scaling from one to two pro-49cessors leads to a system that is 1.98× faster. Scaling from two to four processorsachieves a speedup of only 1.31×. This is lower than expected, because we arenot yet bandwidth-limited (the memory bandwidth limit is 8 cycles per DCT, butwe are achieving only 17.0 cycles per DCT at this point). Further investigation isrequired.The set of red bars in the figure indicates speedup per ALM. The multipro-cessors use a large number of additional ALMs in the Avalon fabric. This greatlydeteriorates the speedup-per-ALM advantage, and suggests that Avalon is not themost efficient way to build a VENICE multiprocessor.3.6 SummaryThis chapter investigates ways to reduce SVP area and increase SVP performanceby targetting modest performance applications and using FPGA-specific optimiza-tions. VENICE is designed to accelerate applications that require higher perfor-mance than a scalar soft processor can provide, but do not need to scale to a large,expensive FPGA.Speedups over 70× a Nios II/f were demonstrated, with 3.8× to 5.2× bet-ter performance per logic block from V1 to V4. VENICE is also both smallerand faster than the VEGAS, the previous best SVP at the time of development.VENICE offers over 2× the performance per logic block of VEGAS while usingfewer BRAMs and the same number of DSP blocks. The use of 2D and 3D in-structions allows for high ALU utilization (91% in DCT) even with small vectorlengths.VENICE can be used on its own to implement applications with modest per-formance needs, giving FPGA designers a tool that is quick to iterate designs with.Additionally, VENICE could be conceived as a building block in a heterogeneoussystems such as a vector/thread hybrid solution [38].50Chapter 4Wavefront Skipping on SoftVector ProcessorsLet us redefine progress to mean that just because we can do a thing,it does not necessarily mean we must do that thing. — FederationPresident, Star Trek VIIn the previous chapter, we discussed how VENICE can replace RTL formodest-performance data-parallel applications with much lower overhead than pre-vious SVPs. This gives designers a tool for quickly developing applications thathave straightforward and regular data parallelism. To enable more applicationson SVPs, we investigated broader classes of algorithms. This chapter presents amethod for accelerating certain irregular data parallel applications, such as facedetection. We utilize the cheap and abundant FPGA BRAMs to avoid wasted com-putational cycles.4.1 IntroductionSVPs process multiple data elements in parallel; the data processed in a singlecycle is called a wavefront. If the vector length is longer than the number of lanes,the instruction will sequentially process one wavefront per cycle until the wholevector instruction has completed. For algorithms that use conditional execution,i.e., branching, the vector processor must execute both paths of the branch and51Figure 4.1: Wavefront Skipping on a 4 Lane SVPmask off writes for elements not on the active branch. This can result in a largeportion of execution unit results that are not used, especially for algorithms thatcan stop processing some data elements early depending on a conditional check.In order to speed up these conditional algorithms, masked-off elements mustnot utilize execution slots. It is possible to compress vectors in order to removemasked-off elements [63], but we will show later that this is costly for algorithmssuch as stencil filters. Rather, we would like to skip over elements without rearrang-ing data, which led us to implement wavefront skipping. In wavefront skipping, awavefront where all the elements are masked off is skipped. We also implementedpartial wavefront skipping, by which the wavefront is divided into partitions thatcan skip independently. This fine-grained skipping can lead to performance in-creases, at the cost of requiring more resources.Figure 4.1 helps illustrate how wavefront skipping works. The values shownare the mask bits corresponding to each data element; a zero indicates that elementis masked off. In Figure 4.1a one of the eight wavefronts can be skipped since all52of its elements are masked off. Normally this instruction would take eight cycles toprocess, but with wavefront skipping it can complete in seven. Figure 4.1b showstwo wavefront partitions; the finer granularity means more partial wavefronts canbe skipped, and so the instruction can run in only four cycles (the partition withonly three sub-wavefronts of active data ends up being idle during the last cycle).Figure 4.1c shows four wavefront partitions, at which point the partition size is asingle element, and the instruction can execute in two cycles.This chapter gives the first implementation of wavefront skipping on SVPs. Ituses a different approach than is used on fixed vector processors, where the maskregister is read out in parallel and the number of leading masked-off elements iscomputed each cycle. Instead, we take advantage of the relative abundance ofBRAMs on FPGAs by computing wavefront offsets beforehand in a setup instruc-tion and storing them. Additionally, we implement partitioned wavefront skipping,where instead of entire wavefronts skipping together, partial wavefronts (down toindividual bytes) can skip by different amounts. To our knowledge this is the firstimplementation of this idea in any vector processor. The full wavefront skippingimplementation requires no more than 5% increase in area and a single BRAM andachieves speedups of up to 3.2× for the early exit algorithms we have tested. Theaddition of partial wavefront skipping provides additional gains of up to 5.3× butrequires up to 27% additonal area. While this might not be a good tradeoff in ahard vector processor, it may make sense as a configuration option for an SVP.Because the SVP can be configured differently for each design, partial wavefrontskipping may make sense if the design has some unused BRAMs within the deviceor the algorithm is particularly sensitive to wavefront partitioning.4.2 BRAM Based Wavefront SkippingThis work builds on the VectorBlox MXP [59], a VENICE-like commercial SVP.An architectural diagram of MXP can be seen in Figure 4.2. Prior to this workMXP supported predicated execution through conditional move (CMOV) instruc-tions only. This differs from wavefront skipping in that the CMOV operation doesboth a condition check and move in the same instruction, and it operates on theentire vector instead of just the valid wavefronts. The CMOV instruction checks53Align BAlign AAlign CScratchpadAddress GenerationDMA and Vector Instruction QueuesSystemBus orNoCALUsΣDMAMSRdSrcARdSrcBWrDstCR/WDMAA B C DFrom Scalar CoreMasked UnitFigure 4.2: VectorBlox MXP with Four 32-bit Lanesconditions using a flag bit associated with each element; each 9-bit wide scratch-pad BRAM stores 8-bits of data and one flag bit. The flag bit is set by earliervector instructions; for instance an add instruction stores overflow while a shiftright stores the bit shifted out. The flag bit is used along with whether the elementis zero or negative to perform several different CMOV operations. The most com-mon CMOV operations first perform a subtraction-based comparison; the result ispredicated based on whether the result is less than zero (LTZ), less than or equal tozero (LTE), etc. The CMOV hardware is part of the ALUs.4.2.1 Full Wavefront SkippingFigure 4.3 gives an example of how to use wavefront skipping to perform stridedoperations, such as operating on every fifth element as in Figure 4.1. The first loopsets every fifth element of v temp to be 0 (this could be done with vector divideor modulo instructions if the SVP supports them, but is shown in scalar code for54#define STRIDE 5//Toy functition to double every fifth element//in the vector v_avoid double_every_fifth_element( int *v_a,int *v_dest,int *v_tempint vector_length ){int i;//Initialize every fifth element of v_temp to zerofor( i = 0; i < vector_length; i++ ) {v_temp[i] = i % STRIDE;}//Set the vector length which will be processedvbx_set_vl( vector_length );//Set mask for every element equal to zerovbx_setup_mask( CMOV_Z, v_temp );//Perform the wavefront skipping operation://multiply all non-masked off elements of v_a by 2//and store the result in v_destvbx_masked( SVW, MUL, v_dest, 2, v_a );}Figure 4.3: Code Example: Double Every Fifth Element of a Vectorclarity). The vbx setup mask instruction takes as input a comparison operation(CMOV Z, or use the CMOV hardware to test for equal to zero) and a pointer tothe vector operand (v temp, which is a pointer to a vector of 32-bit data).After the mask is set, the vbx masked instruction is executed. The ‘SVW’type specifier indicates that it operates on scalar (2) and vector (v a) inputs andthat the values are words (32-bit). The elements in the destination vector (v dest)that are masked-off will not be written, while the valid elements will be set to thecorresponding element of v a multiplied by 2. Although this is a trivial exampleand is not faster than using a conditional move, complex algorithms that reuse themask several times and/or have sparse valid elements can give significant speedup.In order to support wavefront skipping, we need to alter the scratchpad addressgeneration logic of MXP. For normal vectors, the addresses of the operands areincremented by one wavefront every cycle. So, with a V2 MXP (meaning it hastwo 32-bit vector lanes), each address is incremented by 8 bytes (the width of one55wavefront) each cycle. To support wavefront skipping, we want to add a variablenumber of wavefronts to the operand addresses depending on the value of a maskthat was observed earlier. We accomplish this by storing the offsets of wavefrontsthat have valid elements in a BRAM inside the ‘masked unit’. Because a wavefrontcontains multiple elements, we also have to store a valid bits for each element.These are stored per byte, and during subsequent execution become byte enablesupon writeback. Finally, the length of a wavefront skipped vector may be shorterthan the length of the full vector, so we have to mark the last wavefront. We usean extra bit to mark the wavefront that is the end of the skipped vector. This is animplementation detail that was convenient in our design and could be replaced bystoring the number of wavefronts in a register.BRAMs are limited in depth, however, and MXP’s scratchpad can hold vectorsof any length. To allow for wavefront skipping of vectors of the whole scratchpadlength, the masked unit would need a BRAM as deep as the scratchpad itself. Thiswould be wasteful for many applications, so we allow the user to configure amaximum masked vector length (MMVL) as an SVP build option. The minimumdepth of BRAMs in the Stratix IV FPGAs we tested on is 256 words, so settingMMVL to be less than 256 will result in underutilized BRAMs. The fact thatwavefront skipping instructions have a length restriction must be known by theprogrammer. In real applications data does not fit entirely in scratchpad memoryand so is operated on in chunks (also known as strip-mining [70]) even without anMMVL; the MMVL only affects the size of the chunks.To generate the wavefront offsets, we added the mask setup instruction. Thismask setup instruction uses the conditional move logic already in MXP to generatethe valid bits set in the mask BRAM. Figure 4.4a shows the data written the maskBRAM during the mask setup instruction. For each wavefront processed, the offsetand valid bits are sent to the masked unit. If no valid bits are set, as in wavefront4 (data elements 16-19), the masked unit does not write into its BRAM. If any ofthe valid bits are set, the masked unit writes the current offset and the valid bitsto its current BRAM write address and increments the write address. The finalwavefront causes the end bit to be written to the mask BRAM along with its offsetand byte enables, provided there are any byte enables set. In this example the finalwavefront has valid byte enables, but in instances where the final wavefront has no56Figure 4.4: Data Written to Mask BRAM (Every Fifth Element Valid)byte enables set we redo the write of the last valid wavefront with the end bit set.In the case of algorithms with multiple early exit tests, the set of elements beingmasked off grows with each test. In this case, we wish to generate a new mask thatis based on the previous mask, except that a superset of elements will be maskedoff. In this situation, it is desirable to have a masked ‘setup mask’ instruction. Thisis a trivial extension of the normal mask setup instruction; instead of the wavefrontnumbers progressing linearly, they are taken from the output of the masked unit.In this way the number of valid elements can decrease until either the algorithmfinishes or no more valid elements are left. When no valid elements are left, anywavefront skipping instructions will execute as no operations (NOPs), taking onlya single cycle in the vector engine to execute. However, dispatching the vectorinstructions and calculating vector addresses still takes time in the scalar engine,so it is desirable to exit the algorithm completely if no elements are valid. For thiscase, we have a mask status register that the scalar core can query to determine if57any valid elements are left.4.2.2 Wavefront PartitioningSo far we have only discussed skipping whole wavefronts. When dealing with wideSVPs, it may be of little value to skip whole wavefronts, since it will be rare thatthe entire wavefront can be skipped. As a trivial example, the code from Figure 4.3will skip 4/5 of wavefronts on a V1, 1/5 of wavefronts on a V4, and no wavefrontson wider MXP’s (every wavefront will contain at least one valid word). To speedupthis code on a wide MXP, we need to support skipping at a narrower granularitythan the wavefront. We support this by partitioning the wavefront into narrowerunits which each have separately controllable BRAMs; for instance, a V4’s maskedunit can have 1 BRAM (whole wavefront skipping), 2 BRAMs (pairs of lanes canskip together), or 4 BRAMs (each lane can skip individually). If halfword or byteoperations are frequently done, the masked unit can even have 8 or 16 BRAMs,respectively. Each BRAM requires the same number of bits to store offsets and theend-of-masked-vector bit, but the byte-enable bits are specific to each partition.Figure 4.4b shows the data written to the 2 mask partition BRAMs for thecode from Figure 4.3. This mask will take 4 cycles to execute (the maximum ofthe depth written to all of the partitions), compared to the 7 cycles needed for thesingle partition shown in Figure 4.4a.One complication with wavefront partitioning in our architecture is the map-ping from partitions to scratchpad BRAMs. In an architecture with a simple regis-ter file or a scratchpad that did not allow unaligned accesses, each BRAM wouldget its offset from a fixed partition; in a V2 with two partitions, Lane 1 would getits offset from Partition 1 and Lane 2 from Partition 2. However, our scratchpadsupports unaligned addresses. This means that partitions are not directly associ-ated with a scratchpad BRAM; instead, depending on the alignment of the vectoroperands a scratchpad BRAM address may come from any of the partitions. Thismeans we had to implement an offset mapping network to map wavefront offsetsto scratchpad BRAMs. Note that there is no such overhead with full wavefrontskipping (single partition) because the same offset goes to all BRAMs, which isjust a broadcast of the offset which requires no additional logic.58for stage in classifier:for row in image:vector::init-maskfor feat in stage:for rect in feat:vector::feat.sum += vector::image[rect]if vector::feat.sum > feat.threshold:vector::stage.sum += feat.passelse:vector::stage.sum += feat.failif vector::stage.sum < stage.threshold:vector::update-maskif all elements masked:exitFigure 4.5: Pseudocode for Viola-Jones Face Detection4.2.3 Application Example: Viola-Jones Face DetectionFigure 4.5 gives high level pseudocode for one of the benchmarks we have im-plemented, Viola-Jones face detection. Face detection attempts to detect a face atevery (x,y) pixel location in an input image. To vectorize this, we will test severalpossible starting locations in parallel: a 2D vector of starting locations character-ized by (x+i,y+j). In this way, we will be testing for i∗ j face locations in the vectorsimultaneously.Viola-Jones face detection must compute several thousand values, called Haarfeatures, for each pixel location. Features are grouped into stages. The featuresin a stage must pass a threshold test if a face is to be detected at that location. Ifany stage fails this threshold test, there is no point in testing further features at thatlocation. Hence, each stage is an ‘early exit’ test for each pixel starting location.Features are calculated in-order across this vector of locations. When utilizingpredicated instructions, the algorithm must continue processing each feature for alllocations, even though some locations may have already failed an earlier thresholdtest and do not need to participate in further calculations. If even a single locationin the 2D vector may still have a face, the algorithm must continue to compute forall locations. Here, long vector lengths work less efficiently. In this model, paral-lelism due to vectorization runs against the ability to exit early; longer vectors are59Figure 4.6: Haar Features Calculated at Each Candidate Face Location forDifferent Groupings (Percentage Work Done is Calculated Relative toMinimum)more likely to contain locations requiring computation of many features, leading toextensive processing for all elements in the vector. By using wavefront skipping,we can avoid processing elements that are already known not to contain a face.Further features are computed only for those locations still in question, effectivelyshortening the vector length to the relevant elements.The only differences between the predicated/CMOV version (without wave-front skipping) and the wavefront skipping version are the init-mask and update-mask instructions and the low level details of the test checking whether all ele-ments are masked. Porting an existing predicated algorithm to wavefront skippingis therefore straightforward and requires minimal effort.Figure 4.6 compares the amount of work done for different strategies. Eachplot shows the number of Haar features calculated at each candidate face location,60from the minimum of three (in black) to a maximum of 2135 (in white). Differentstrategies result in doing different amounts of work at each location; the less workthat is done (less white and more black in each heatmap) the faster execution canbe. The total work is also calculated for each strategy and shown as a percentrelative to the minimum. The fact that the difference between the minimum andmaximum number of features that must be calculated is three orders of magnitudeshows the need for some smarter form of predication than simply computing theworst case on all pixels.In order to run the application on an SVP, groups of pixels must be operatedon as parallel vectors. Larger groupings provide more parallelism, but withoutwavefront skipping (only using CMOV instructions) every pixel in a group mustpass the maximum number of Haar feature tests of any pixels within the group.It can be difficult to balance the CMOV implementation; it is tempting to use theminimum vector length possible to reduce the amount of work done, but below acertain level the lack of parallelism means runtime actually increases. The resultsin Figure 4.6 are from a 4 lane MXP implementation, and we found the runtimeoptimal vector length for CMOV implementations empirically.A naive vector layout would be to have every row be a vector (Figure 4.6a),which does almost 8× as much work as the optimal. A slightly better methoduses rectangular blocks to get better spatial locality (Figure 4.6b); getting suffi-cient parallelism still requires vectors long enough that almost 6× as much workas the minimum is done. In contrast, simply by changing the implementation touse wavefront skipping, the effective grouping is the wavefront partition width(Figure 4.6c). Wavefront skipping removes the need to profile an application to de-termine the best vector length to use; the wavefront skipping implementation canuse the longest vector length possible and will always achieve at least the perfor-mance of the CMOV implementation. Finally, a fully partitioned design performsthe minimal number of feature tests (Figure 4.6d), which is the same number offeatures that a scalar implementation would require.61Figure 4.7: Vector Compress Operations Needed for 5x1 Stencil Filter4.2.4 Comparison with Vector CompressAn alternative method mentioned in Section 2.5 is a vector compress operation.The compress operation removes masked-off elements from a vector, creating ashorter vector as a result. Figure 4.7 demonstrates how this would work for asimple 5x1 stencil filter. The stencil filter takes as inputs shifted versions of theinput data, from an offset of -2 to +2. Before running the stencil filter, each ofthe shifted versions must be separately compressed. The five versions of the datawith different offsets must be compressed into five scratchpad locations, each ofwhich uses as much storage as the original in the worst case. After compressingthe inputs, subsequent vector operations can operate on the shorter vectors at fullspeed.The drawbacks of requiring a compress instruction and additional storage spacefor each input make it impractical in many stencil filter algorithms. For instance,on the Viola-Jones face detection application the amount of work and extra spaceneeded would be prohibitive. The Haar feature tests operate on a sliding 20x20window; each feature selects a subset of the window. Compressing the inputswould require compressing each location, or 400 compress operations and 40062Table 4.1: Resource UsageMXP Vector Lanes - Logic Memory DSP Blocks Total Area Area Increase fmaxWavefront Partitions ALMs M9Ks eALMs % Over CMOV MHz(CMOV) V1-P0 4,697 96 2 7,502 206V1-P1 5,034 97 2 7,877 5.0% 200(CMOV) V4-P0 9,732 152 5 14,243 173V4-P1 10,210 153 5 14,750 3.6% 176V4-P4 13,276 156 5 17,902 25.7% 183(CMOV) V32-P0 63,334 414 33 76,198 144V32-P1 63,027 418 33 76,005 -0.3% 144V32-P4 78,698 422 33 91,791 20.5% 149V32-P32 83,220 446 33 97,002 27.3% 146Nios II/f 1,370 19 1 1,945 283DE4-230 Maximum 91,200 1,235 161 131,434 –temporary vectors. The wavefront skipping method does not need to manipulatethe input data, meaning MXP only needs one copy of the chunk of the image beingscanned.Compress operations also need the inverse vector expand operation to restorethe output. This is less critical for algorithms like stencil filters where the numberof outputs is smaller than the number of inputs. Note that the compress operationis only of use when the vector being compressed is used multiple times, becausecompressing the vector takes an extra instruction. Setting up a mask for wavefrontskipping also takes an extra instruction, but the same mask can be used for multipleoffsets in a stencil filter.4.3 ResultsOur results were obtained using Altera Stratix IV GX230 FPGAs on the TerasicDE4-230 development board. FPGA builds were done using Quartus II 13.0sp1.We used one 64-bit DDR2 channel as our external memory.4.3.1 Area ResultsTable 4.1 shows the resources used and maximum frequency achieved for var-ious configurations of MXP. In addition to the actual FPGA resources used(ALMs, M9Ks, and DSP blocks), we also report the area in equivalent ALMs63(eALMs) [71]. By factoring the approximate silicon area of all of the resourcesused, eALMs are a convenient way to compare architectures that use different mix-tures of logic, memory, and multipliers. For the Stratix IV family of FPGAs eachM9K memory block counts as 28.7 eALMs and each DSP Block counts as 29.75eALMs (in addition to each ALM, which counts as one eALM).All MXP configurations shown have a maximum masked vector length (MMVL)of 256 wavefronts (the effect of changing MMVL will be investigated later). MXPconfigurations are listed as VX PY where X is the number of 32-bit vector lanes andY is the number of mask partitions. P0 means that masked instructions are disabledand only CMOV instructions can be used for conditionals. V1s are configured with64kB of scratchpad memory, V4s are configured with 128kB of scratchpad mem-ory, and V32s are configured with 256kB of scratchpad memory. The area numbersfor MXP include the Nios scalar core used for control flow and vector instructiondispatch, as well as prefix sum and square root custom vector instructions that areused in the face detection benchmark.The eALM area penalty is 5.0% from no wavefront skipping to wavefront skip-ping with full wavefront skipping on a V1. Most of this area penalty is in ALMs, asonly a single extra BRAM is used and no extra DSP Blocks. Since the number ofALMs for full wavefront skipping is roughly constant with respect to the numberof lanes, the overhead drops to 3.6% on a V4 and less than 1% on a V32. For mul-tiple partitions, the area overhead is much higher (up to 27.3% more eALMs in theV32-P32 case), because of the partition to scratchpad BRAM mapping needed (asexplained in Section 4.2.2). We implemented this mapping using both a switchingnetwork and multiplexers; the area results were similar but the multiplexer imple-mentation had lower latency and so was used for our results.4.3.2 BRAM UsageBRAM usage is minimal for the V1 and V4 as each mask partition uses just 1BRAM. For the V32, a single partition uses 4 BRAMs since it has to store 128-bits worth of byte enables, eight bits of offset, and one end bit, which fits in four36-bit wide M9Ks. With more partitions, the number of byte enables per par-tition is reduced, so that the V32-P32 only uses one BRAM per partition. The64Figure 4.8: BRAM Usage vs Wavefront Partitions (MMVL = 1024)masked unit has no critical timing paths; increasing the number of BRAMs useddoes make the placement job harder for the CAD tools, but the critical path alwaysremains outside the masked unit. Since the contribution of this work is not affect-ing fmax directly, we decided to remove this variability from our results by runningall benchmarks shown here at 125MHz.Figure 4.8 shows how many BRAMs are used in our partitioned wavefrontskipping with a maximum masked vector length (MMVL) of 1024 wavefronts.With a small number of partitions, as the vector processor gets wider, multipleBRAMs per partition are required to store the byte enables. With a large numberof partitions, the byte enables are divided up into small enough chunks that theyalways fit in a single BRAM per partition. There is a wide range in the number ofBRAMs that can be used by MXP. A designer has the freedom to allocate BRAMson the FPGA device to achieve the best performance with their algorithm, eitherby increasing scratchpad capacity, or by increasing flexibility for conditional exe-cution.65Figure 4.9: Mandelbrot Benchmark4.3.3 Mandelbrot BenchmarkFigure 4.9 shows speedup results for computing the Mandelbrot set (geometricmean of 23 frames shown in a visual demo). Results are shown compared to ascalar version run on the Nios II/f as a baseline. The Mandelbrot computationiterates a complex valued equation at each pixel until either the pixel reaches a setcondition (the early exit) or else a maximum iteration count is reached. Withoutmasked instructions (CMOV configurations) we can only exit early after all pixelsin the group agree to exit early. We show results for two versions of the algorithm:the ‘line’ implementation which naively computes pixels in raster order (row byrow), and the ‘square’ implementation which computes a 2D block of pixels at atime. The ‘line’ implementation is more straightforward. However, since the earlyexit pixels are correlated spatially, selecting a group of pixels closer together in ablock results in less wasted computation and therefore higher performance.For the CMOV configurations, the difference between the line and square im-plementations is vast; for V32 the square implementation has a speedup that isalmost a 4× higher than the line implementation. With the masked implementa-tions, the difference between line and square are much less–at most 10%. Themasked implementations always outperform the CMOV implementations. Parti-66Figure 4.10: FAST9 Feature Detectiontioned wavefront skipping has little effect in Mandelbrot and so is not shown; thedata is grouped together in such a way that wavefronts exit at the same or nearly thesame time. The CMOV implementation also uses a vector length determined byprofiling; too long and early exits are not helpful, and too short and instruction dis-patch rate and data hazards reduce performance. The masked implementation justuses the maximum vector length it can, which is either determined by the MMVLof the masked instructions or the size of the scratchpad and number of vectorsneeded. Between not having to profile to find the best vector length, and being ableto use the naive ‘line’ implementation with little performance penalty, the maskedimplementation is likely much easier for the programmer.4.3.4 FAST9 Feature DetectionFigure 4.10 shows the results for FAST9 feature detection. FAST9 determines ifa pixel is a feature by examining a circle of pixels around the target location andlooking for a consecutive series of pixels that are darker or lighter by a threshold.An early exit can happen if certain conditions hold, such as if neither of two pixelson opposite sides are above or below the threshold. All operations are byte-wide,67taking advantage of MXP’s subword-SIMD which executes byte operations 4×faster than word operations. While running FAST9 on simple input data with theCMOV code, we found checking early exit (as done in Mandelbrot) to be helpful.However, on a real image (Lenna), the early exit code only helped on a V1. Thus,the CMOV code is doing the full calculation for all pixels on V4 and V32. Incontrast, the masked code is able to achieve speedup by skipping at a more fine-grained level. However, if the mask is too coarse, such as at V32-P1 which is 128bytes wide, very little is gained. Only with 4 or 32 partitions is speedup achieved.It may be possible to gain even more speed by using byte-wide partitions, but ourMXP test configurations had a minimum partition size of one 32-bit lane wide.4.3.5 Viola-Jones Face DetectionFor Viola-Jones face detection, we used the same algorithm settings as those usedin an FPGA implementation created in RTL [10]. No reference image was spec-ified in the publication, so we used the standard ‘Lenna’ image. Their hardwareimplementation with 32 PEs was able to achieve 30fps. Our SVP implementationhas an inner loop containing 9 or 13 instructions (depending on the Haar classifier),instead of being fully pipelined as in a hardware implementation, so it is expectedto be somewhat slower. Also, our implementation includes full data movement,from input image to frame buffers driving a DVI display.Figure 4.11 shows the results for CMOV only and masked implementationsincluding partial wavefront skipping. The masked implementations with a singlepartition are up to 3× faster than CMOV, and partitioning increases it to up to4.1×. For a V4 processor, partitioning gives a 29% improvement from P1 to P4,and for a V32 processor, partitioning gives a 35% improvement from P1 to P4and an additional 22% improvement from P4 to P32 for an overall 65% improve-ment. While the impact of partitioning is not as dramatic as the impact of switchingfrom CMOV instructions to wavefront skipping, it provides a performance increasewithout having to rewrite software.Figure 4.12 presents the results as speedup versus area (eALMs). MXP withoutwavefront skipping is roughly the same the performance per area as a Nios II/f(which in practice can not scale ideally to achieve higher performance). Wavefront68Figure 4.11: Viola-Jones Face Detection Speedupskipping provides significantly higher performance per area, since the area impactis minimal compared to the performance gained. The highest performance perarea is 4.1× that of Nios II/f (37.8× faster with 9.2× more area for the V4-P4configuration).With our fastest configuration, we achieve 3 frames per second, or about 1/10ththat of the previous hardware result, on the Lenna input image. Not all of the pro-cessing time of our implementation is in Haar classifier tests; there is some over-head such as loading data from memory and resizing images. To test the amountof overhead we used a blank image (where all pixels exit early), which achieved60 frames per second.4.3.6 MMVL TradeoffsSo far we have examined tradeoffs in additional logic and BRAMs when us-ing multiple partitions, but it’s also possible to use more BRAMs to allow forlonger masked vector instructions. Each BRAM needs to store an offset of widthlog2(MMV L) as well 4 byte-enables per lane and an end bit. With multiple parti-tions, the byte enables are split between BRAMs, but the other data is replicated.69Figure 4.12: Viola-Jones Face Detection Speedup Vs AreaFigure 4.13 shows the BRAM usage for two configurations of MXP as the MMVLis varied. As explained in Section 4.2, we default to a MMVL of 256 wavefrontssince that is the minimum depth of M9Ks. Hence, a MMVL of 128 or 256 wave-fronts almost always has the same resource usage, except on a V16 with one parti-tion where the width of the BRAM data goes from 72 and 73 bits and can no longerfit in two 36-bit wide M9Ks. Increasing MMVL to 512 does have a cost in BRAMsonce the width of the BRAM data gets past 18-bits. An MMVL of 1024 requireseven more resources, as BRAMs are only 9-bits wide at that depth.Figure 4.14 shows the results of changing MMVL on the face detection bench-mark. For V1, there is a reasonable gain when increasing MMVL; there is 16%better performance as the number of wavefronts is increased from 128 to 256, withan additional 13% improvement as the number of wavefronts is increased from256 to 512. The results are even more dramatic on a V4 with 4 mask partitions:100% faster as the number of wavefronts is increased from 128 to 1024. The factthat a V4-P4 with MMVL 128 is actually slower than a V4-P1 with MMVL 256suggests that using BRAM for increased depth can be better than using it for morepartitions. The reason that increased MMVL makes such a large difference is thatas masked instructions have fewer and fewer valid wavefronts, they start taking sofew cycles that efficiency drops (either data hazards or instruction dispatch rates70Figure 4.13: BRAM Usage for Masks When Varying MMVL for 1 (top) and4 (bottom) Partitionsdominate). Longer MMVL means that fewer masked instructions are needed, andless time is spent in this low efficiency regime. However, in the V32 processor, themasked vector length is no longer limited by MMVL but by the number of vectorsneeded and the size of the scratchpad. In this case, changing MMVL had no effect71Figure 4.14: Effect of Changing MMVL on Viola-Jones Face Detectionon the results so they are not shown. The trend for V1 and V4 does suggest thatincreased vector length (in this case by having a larger scratchpad) would providehigher performance on this application.4.3.7 Results SummaryFigure 4.15 summarizes the speedup of wavefront skipping over the previousCMOV implementation on our three benchmarks, while Figure 4.16 shows thespeedup per area (eALMs). The maximum speedups for a single partition is 3.3×on FAST9 (V1-P1) and 3.2× on Viola-Jones (V4-P1). The largest speedup frompartitioning over full wavefront skipping is 1.65× on Viola-Jones (V32-P32 vsV32-P1). All of the benchmarks gain at least as much in speedup as the cost inarea of the wavefront skipping implementation, with the minimum gain being 1.1×better performance per area for FAST9 on a V32 with one wavefront partition. Al-though an increased number of partitions causes a corresponding increase in thearea of MXP, for FAST9 and Viola-Jones the performance per area continues to in-crease as the number of partitions increases. Wavefront skipping is therefore usefulfor all the benchmarks we tested, and partitioning is useful for two of the three.For a programmer implementing a conditional algorithm, the configuration of72Figure 4.15: Speedup from Wavefront SkippingBRAM usage for scratchpad size, wavefront partitions, and maximum masked vec-tor length will depend on the application. The designer will generally want todevote as much BRAM as possible to scratchpad data, as longer vectors will ben-efit all parts of an application. However, for heavily conditional applications likeViola-Jones, having masked vectors that are long enough to keep the vector corebusy as the number of valid wavefronts shrink is also important. Few BRAMs areneeded for just a single partition; using multiple partitions adds several BRAMs,but this may be justified when performance is absolutely necessary or the targettedFPGA has leftover BRAMs with a given design.4.4 SummaryThis chapter shows how a class of irregular data parallel algorithms that are dif-ficult or inefficient to implement with other approaches can be accelerated usingwavefront skipping. Our implementation of wavefront skipping in soft vector pro-cessors (SVPs) can be done efficiently in terms of logic and BRAM usage. Wave-front skipping allows for higher performance due to skipping masked off elements.In addition, from our experience, masked execution is also much easier to use73Figure 4.16: Speedup per Area (eALMs) from Wavefront Skippingthan checking early exit conditions on blocks of elements using predicated/CMOVinstructions. Our implementation stores offsets in BRAMs, which are relativelyplentiful and high-performance in FPGAs. The alternative, which uses a count-leading-zeros operation, would have higher latency and also limit the number ofwavefronts skipped in one cycle. Our approach keeps the mask logic simple andoff the critical path, and can also skip an arbitrary number of wavefronts.When not partitioned, our wavefront skipping implementation uses less than5% extra area and gives 3.2× better performance on Viola-Jones face detection.Extra logic and BRAMs can be used to gain additional performance by partition-ing, allowing parts of a wavefront to have different offsets. Although costly interms of area, partitioning gives up to 1.65× extra performance on face detection.Partitioned wavefront skipping may not be a reasonable design tradeoff in a fixedvector processor. In an FPGA, partitioned wavefront skipping gives a designer anextra tool to tradeoff additional logic and BRAM for application specific perfor-mance.74Chapter 5Attaching Streaming Pipelines toSoft Vector ProcessorsThe expert knows more and more about less and less until he knowseverything about nothing. — Mahatma GandhiSo far, this dissertation has shown how to better optimize SVPs for increasedperformance per area and how to enable applications that are traditionally difficultto accelerate. These approaches help make SVPs more attractive for applicationsrequiring modest performance or those with divergent control flow. However, theydo not take full advantage of the underlying FPGA fabric, and they do not offer apath for migrating a design from an SVP program to an RTL pipeline. This chapterexplores the challenges and benefits of interfacing custom pipelines to an SVP.The SVP is able to manage data movement and perform computations that are lessperformance critical, while dispatching heavy computation to custom engines thatcan fully exploit the underlying FPGA fabric. In this manner, applications canmigrate fixed parts of their algorithms to get the full performance the FPGA canprovide while maintaining software design and control at each step.5.1 IntroductionFPGAs are most useful when designs exploit deep pipelines, utilize wide data par-allelism, and perform operations not found in a typical CPU or GPU. This typically75requires a hardware designer to design a custom system in an RTL such as VHDLor Verilog. Recently, the emergence of electronic system level (ESL) tools such asVivado HLS and Altera’s OpenCL compiler allow software programmers to pro-duce FPGA designs using C or OpenCL. However, since all ESL tools translate ahigh-level algorithm into a hardware description language (HDL), they share com-mon drawbacks: changes to the algorithm require lengthy FPGA recompiles, re-compiling may run out of resources (eg, logic blocks) or fail to meet timing, debug-ging support is very limited, and high-level algorithmic features such as dynamicmemory allocation are unavailable. This suggests ESL users need some degree ofhardware design expertise. Hence, ESL tools may not be the most effective way tomake FPGAs accessible to software programmers.An SVP achieves high performance through wide data parallelism, efficientlooping, and prefetching. The main advantages of an SVP over RTL are scalableperformance and a traditional software programming model. Performance scalingis achieved by adding more ALUs, but beyond a certain point (eg, 64 ALUs) theincreases in parallelism are eroded by clock frequency degradation. Additionally,even if performance scales perfectly, performance per area will lag behind that ofa custom design.To increase performance of SVPs even further, they must also harness deeppipeline parallelism. For example, most types of encryption (such as AES) needdeep pipelines to get significant speedup. Although processors (and SVPs) are notbuilt to exploit deep pipeline parallelism, FPGAs support it very well.Several questions then arise: How can a SVP be augmented to exploit bothwide and deep parallelism? What differs in interfacing an SVP to external logicversus a scalar soft processor? How can control flow be coordinated between deepcustom pipelines and the fixed execution stages of the SVP? Can more complexinstructions than the standard two-input, one-output format be supported, and howmuch benefit do they provide?We investigated these questions using the VectorBlox MXP SVP. We deviseda way to add custom vector instructions (CVIs) to the processor, shown in Fig-ure 5.1. Such CVIs can be simple operations, or they may contain wide and deeppipelines. The interface is kept as simple as possible so that software programmerscan eventually develop these custom pipelines using C; we also show that a simple76Align BAlign AAlign CScratchpadAddress GenerationDMA and Vector Work Queues, Instruction Decode & ControlAvalonBusCustom PipelinesΣDMAMSRdSrcARdSrcBWrDstCR/WDMAA B C DCustom InstructionsP3P1 P2 P4Figure 5.1: Internal View of VectorBlox MXPhigh-level synthesis tool can be created for this purpose. To demonstrate speedups,we selected the N-body gravity problem as case study. In this problem, each bodyexerts an attractive force on every other body, resulting in an O(N2) computation.The size and direction of the force between two bodies depends upon their twomasses as well as the distance between them. Solving the problem requires squareroot and divide, neither of which are native operations to the MXP. Hence, we startby implementing simple custom instructions for the reciprocal and square root op-erations. Then, we implement the entire gravity equation as a deep pipeline.The main contribution of this chapter is the introduction of a modular way ofallowing users to add streaming pipelines into SVPs as custom vector instructionsto get significant speedups. On the surface, this appears to be a simple extensionof the way custom instructions are added to scalar CPUs such as Nios II. However,there are unique challenges that must be addressed to enable streaming data frommultiple operands in a SVP. Also, scalar CPU custom instructions are often datastarved, limiting their benefits. We show that SVPs can provide high-bandwidthdata streaming to properly utilize custom instructions.775.2 Custom Vector Instructions (CVIs)The VectorBlox MXP was designed to have a minimal core instruction set. It isimportant to keep this instruction set minimal in an SVP because the area requiredby an operation will be replicated in every vector lane, thus multiplying its costby the number of lanes. In MXP, multiply/shift/rotate instructions are included ascore instructions because they share the use of the hard multipliers in the FPGAfabric. However, divide and modulo are not included as core instructions because apipelined implementation requires more than three pipeline stages and more logicthan all other operators combined.There are many instructions that could be useful in certain applications; someare large and complex but there are also simple, stateless 1- or 2-input, single-output operators with no state. These include arithmetic (e.g., divide, modulo,reciprocal, square root, maximum, minimum, reciprocal square root), bit manip-ulation (e.g., population count, leading-zero count, bit reversal), and encryptionacceleration (e.g., s-box lookup, byte swapping, finite field arithmetic). Some ofthese operators require very little logic, while others demand a significant amountof logic. However, supporting all such operators is prohibitively expensive. Also,it is unlikely that a single application would make use of almost all of these spe-cialized instructions. Finally, even when they are used by an application, theseinstructions may not appear frequently in the dynamic instruction mix. For thisreason, the base MXP ISA excludes many operations that could potentially be use-ful, but were considered to be too specialized.Instead, we would like to add such specialized instructions on a per-applicationbasis. We would also like to allow the user to decide how many of these operatorsshould be added to the pipeline, since it may not make sense to replicate large op-erators in every lane. Additionally, operators with deep pipelines should be able toco-exist along with the existing three stage execution pipeline without lengtheningthe number of cycles taken to process normal instructions.5.2.1 CVI Design ApproachOur add-on custom vector instruction (CVI) approach is different from the VESPAapproach [74], which allows selective instruction subsetting from a master instruc-78tion set. Subsetting allows a reduction in area when an application does not usespecific instructions, but does not allow for further customization. Additionally,since no base instruction set is defined, it may be difficult to change an alogirthmrunning on a subsetted processor without resynthesizing the processor to add backneeded instructions.In contrast, the MXP approach defines a core set of instructions to increase soft-ware portability, while user-specified CVIs can be added to accelerate application-specific operations that are rarely needed by other applications. CVIs use an exter-nal port interface to MXP, allowing the addition to be done without modifying theprocessor source HDL. This modularity may also make it easier to add CVIs us-ing run-time reconfiguration. Some CVIs will be application specific, while others(such as square root) may be resuable.5.2.2 CVI InterfaceA typical CVI is executed in MXP like a standard arithmetic instruction, with twosource operands and one destination operand. The main difference is that data issent out of MXP through a top-level port, processed by the CVI, then multiplexedback into the MXP pipeline before writeback. One individual CVI may consist ofmany parallel execution units, processing data in both a parallel and a streamingfashion.In Altera’s Qsys environment, CVIs are implemented as Avalon componentswith a specific conduit interface. Qsys conduits are a way to bundle signals to-gether when connecting modules in the graphical Qsys interface. Vector data andcontrol signals are exported out of the top level of MXP and connected to the CVIautomatically through the conduit. The signals in the conduit interface can be seenin Figure 5.2. The left side (Figure 5.2a) shows an example of a simple CVI, the‘difference-squared’ operation. This CVI does the same action on all data, so itsindividual execution units are simply replicated across the number of CVI lanes. Inthe simplest case, the number of CVI lanes will match the number of MXP lanes.This is adequate if the CVI is small, or there is sufficient area available. In Sec-tion 5.2.3, we will consider the area-limited case when there must be fewer CVIlanes than MXP lanes.79clockstartvalidopsizeopcode22A0B0A1B1A2B2A3B3C0wr0C1wr1C2wr2C3wr3SUB x2SUB x2SUB x2SUB x2clockstartvalidopsizeopcode22A0B0A1B1A2B2A3B3C0wr0C1wr1C2wr2C3wr3ADDADDADDADDDQ00 01a) Custom instructionwithin lanesb) Custom instructionprefix sum across lanesFigure 5.2: Examples of Custom Vector InstructionsThe right side (Figure 5.2b) shows a more complicated example, a prefix sum,where data is communicated across lanes. The prefix sum calculates a new vectorthat stores the running total of the input vector appearing on operand A. This makesit a very different type of operation than the difference-squared operation, whichdoes not have communication between lanes. As a result, computing a prefix sum isa difficult operation for wide vector engines; it is best implemented in a streamingfashion. Since a vector may be longer than the width of the SVP, it is important toaccumulate the value across multiple clock cycles in time. To support this, the CVIinterface provides a clock and vector start signal. Furthermore, a data valid signalindicates when each wave of input data is provided, and individual data-enableinput signals (not shown for clarity) are provided for each lane.Additional signals in the CVI interface include an opsize (2 bits) that indicateswhether the data is to be interpreted as bytes, halfwords, or words. Also, outputbyte-enable signals allow the instruction to write back only partial vector data,or to implement conditional-move operations, or write back a last (incomplete)wave. Finally, an opcode field is provided to allow the selection of multiple CVIs.80Alternatively, the opcode can be passed to a single CVI and used as a mode-selectfor different functions, such as sharing logic between divide and modulo, or toimplement different rounding modes. The opcode field is shown as two bits, butthis can be easily extended.One notable exclusion from the interface is a stall signal or other means ofproviding back pressure. Allowing a CVI to stall the MXP front end was consid-ered undesirable from a clock speed standpoint. More precisely measuring andaddressing this limitation is left to future work.5.2.3 Large Operator SupportCustom operators may be prohibitively large to add to each vector lane. For ex-ample, a fully pipelined 32-bit fixed-point divider with 16 fractional bits (Q16.16format) requires 2,652 ALMs to implement in a Stratix IV FPGA. This is morelogic than an entire vector lane in the MXP. Thus, it may be desirable to use fewerdividers than the number of lanes (depending upon the number of divides in the dy-namic instruction mix). We have designed an interface that allows using narrowerCVIs with minimal overhead by reusing the existing address generation and dataalignment logic.Figure 5.3 shows how CVIs with a different number of lanes are added to theexisting MXP datapath. During normal operation, the address generation logicincrements each address by the width of the vector processor each cycle. For theexample shown, i.e., an MXP with four 32-bit lanes (written as a ‘V4’), each sourceand destination address is incremented by 16 bytes, until the entire vector is pro-cessed. MXP’s input alignment networks align the start of both source vectors tolane 0 before the data is processed by the ALUs. After execution, the destinationalignment network is used to align the result to the correct bank in the scratchpadfor writeback.During a CVI, the address generation logic increments the source and destina-tion addresses by the number of CVI lanes times the (4-byte) width of each lane.In the example shown, the addresses are incremented by 12 bytes for each wave,regardless of the SVP width; operand A starts at address 0x04 on the first cycle(Figure 5.3a) and increments to 0x10 on the second cycle (Figure 5.3b). As in81012345670123Align A Align BCustomALUsScratchpadAlign C0123456745671014181C0004080C3034383C2024282C5054585C4044484C7074787C6064686C9094989C8084888CALUsb) Funneling elements 3, 4, 5 through three custom ALUs012345670123Align BALUsScratchpadAlign C0123456745671014181C0004080C3034383C2024282C5054585C4044484C7074787C6064686C9094989C8084888CCustomALUsa) Funneling elements 0, 1, 2 through three custom ALUsAlign AFigure 5.3: Custom Vector Instructions With Fewer Lanes Than the SVPnormal execution, the alignment networks still align source data to start at lane 0before data is processed in the custom ALUs. In this case, the fourth lane wouldnot contain any data, so its data-enable input would be inactive. After execution,the CVI result is multiplexed back into the main MXP pipeline, and finally the re-82sulting data is aligned for writeback into the scratchpad. On CVI writeback, theoutput byte enables are then used to write out data for only the first 12 bytes of eachwave; the destination alignment network then repositions the wave to the correcttarget address.5.2.4 CVIs with Deep PipelinesMXP uses an in-order, stall-free backend for execution and writeback to achievehigh frequencies. The CVIs are inserted in parallel to the regular 3-stage arithmeticpipeline of MXP, which means they can also have 3 internal register stages. If fewerstages are needed, it must be padded to 3 stages.Some operations, such as divide or floating point, require much deeperpipelines. If the user naively creates a pipeline that is longer than 3 cycles, thefirst wave of data would appear to the writeback stage later than the writebackaddress, and the last wave of data would not reach the writeback stage at all.To address the latter problem, we have devised a very simple strategy for in-serting long pipelines. In software, we extend the vector length to account forthe additional pipeline stages (minus the 3 normal stages). This solves part of theproblem, allowing the last wave of vector data to get flushed out of the pipelineand appear at the writeback stage. During the last cycles, the pipeline will readdata past the end of the input operands, but their results will never be written back.However, the beginning of the output vector will have garbage results.To eliminate this waste of space, the MXP could simply delay the writebackaddress by the appropriate number of clock cycles. A second approach is to allowthe CVI itself to specify its destination address for each wave. This requires theMXP to inform the CVI of the destination addresses, and rely upon the CVI todelay them appropriately. We have chosen this latter technique, as it allows formore complex operations where the write address needs to be controlled by theCVI, such as vector compression. Because this can write to arbitrary addresses,any CVI using this mode must set a flag which tells MXP to flush its pipelineafter the CVI has completed. The flag is set as a top-level parameter to the MXPinstance.Although this approach does not require additional space, it still requires ex-83tending the length of the vector and flushing the pipeline. The extended vectorlength makes the instruction take a number of additional cycles equal to the num-ber of pipeline stages in the CVI, while the MXP pipeline flush costs the depthof the MXP pipeline (eight to ten stages, depending on the MXP configuration).During normal execution instructions can issue back-to-back and multiple instruc-tions can be in the MXP pipeline in parallel, but deep-pipeline CVIs cannot executeconcurrently with other instructions in our approach.5.3 Multi-Operand CVIThe CVIs described in the previous section are intended for one or two inputoperands, and one destination operand. However, the DFGs of large compute ker-nels may require multiple inputs and outputs, and require both scalar and vectoroperands. In this section, we describe how to support multiple-input, multiple-output CVIs. As a motivating example, we have chosen the N-body gravitationalproblem. We have modified the problem slightly to produce a pleasing visualdemonstration: keep calculations to only 2 dimensions, use a repelling force ratherthan an attracting force, and allow elastic collisions with the screen boundary.5.3.1 N-Body ProblemThe traditional N-body problem simulates a 3D universe, where each celestial ob-ject is a body, or particle, with a fixed mass. Over time, the velocity and positionof each particle is updated according to interactions with other particles and theenvironment. In particular, each particle exerts a net force (i.e., gravity) on everyother particle. The computational complexity of the basic all-pairs approach weuse is O(N2). Although advanced methods exist to reduce this time complexity,we do not explore them here.In our modified version, we consider a 2D screen rather than a 3D universe.The screen is easier to render than a 3D universe, but it also has boundaries. Also,we change the sign of gravity so that objects repel each other, rather than attract.(Attractive forces with screen boundaries would result in the eventual collapse intoa moving black hole, which is not visually appealing.) Like the traditional N-body problem, we also treat particles as point masses, i.e., there are no collisions84x idxy idyScalardata in:y iFx iFDataout:1.0rsqrtrcustom instruction  full custom pipelineStreaming vectordata in:x j x j+1x j-1. . . . . .y j y j+1y j-1. . . . . .GmimjGm i. . . . . .m j-1 jm m j+1(48)(16)xdFydF(65)(65)(65)682169 70 71 72 73 7419Figure 5.4: Force Summation Pipelinebetween particles. We have also adjusted the gravitational constant to producevisually pleasing results.The run-time of the N-body problem is dominated by the gravity force calcu-lation, shown below:~Fi, j = GMiM jr2= 0.0625MiM j|~Pi−~Pj|3(~Pi−~Pj)where ~Fi, j is the force particle i imposes on particle j, ~Pi is the position of particlei, and Mi is the size or ‘mass’ of particle i. When computing these forces, we chosea fixed-point Q16.16 fixed-point representation, where the integer component of ~Prepresents a pixel location on the screen.When a particle reaches the display boundary, its position and velocity areadjusted to reflect off the edge (towards the center) after removing some energyfrom the particle. These checks do not dominate the run-time as they are onlyO(N).An implementation of the gravity computation as a streaming pipeline is shownin Figure 5.4. This is a fixed-point pipeline with 74 stages; the depth is dominatedby the fixed-point square root and division operators which require 16 and 48 cy-cles, respectively.1 For each particle, its x position, y position, and mass (premul-tiplied by the gravitational constant) are loaded into scalar data registers within theinstruction. This is the reference particle, Pi. Then, three vectors representing the1We used Altera’s LPM primitives for these operators. The pipeline would benefit from a com-bined reciprocal square root operator, but it does not exist in the Altera library.85x position, y position and mass of all particles, Pj, are streamed through the vec-tor pipeline. The pipeline integrates the forces exerted by all these particles, andaccumulates a net force on the reference particle.Overall, the pipeline requires three scalar inputs (reference particle properties)and three vector inputs (all other particles). It also produces two vector outputs (anx vector and a y vector), although the output vectors are of length 1 because of theaccumulators at the end of the pipeline. Hence, this gravity pipeline is a 3-input,2-output CVI.All of the MXP vector instructions, including the custom type, only have twoinputs and one output. This is a limitation in the software API, where only twoinputs and one output can be specified, as well as the hardware dispatch, whereonly two source vector addresses and one destination vector address can be issued.Loading of scalar data can be accomplished by using vector operations with length1, and either using an opcode bit to select scalar loading versus vector execution,or by fixed ping-ponging between scalar loading and vector execution. We use theping-pong approach to save opcodes.Supporting multiple vector operands is not as simple, however, and will bediscussed below.5.3.2 Multi-Operand CVI DispatchFigure 5.5 shows two approaches to dispatching multi-operand CVIs. A ‘wide’approach requires data to be laid out spatially, such that operand A appears asvector element 0, operand B appears as vector element 1, and so forth. This isshown in Figure 5.5a. In other words, the operands are interleaved in memoryas if packed into a C structure. To stream these operands as vectors, an array ofstructures (AoS) is created. Ideally, the input operands would precisely fit into thefirst wave; with two read ports, the amount of input data would be twice the vectorengine width. If more input data is required, then multiple waves will be required,which will be similar to the depth approach below. If less input data is required,then the CVI does not need to span the entire width of the SVP. In this case,it may be possible to provide multiple copies of the pipeline to add SIMD-levelparallelism to the CVI.86a) Wide multi-operand streaming datapaths require interleaved datab) Deep multi-operand streaming datapaths can avoid interleaved data001122330011Align BScratchpadAlign C0011223322331014181C0004080C3034383C2024282C5054585C4044484C7074787C6064686C9094989C8084888CStreaming PipelineAlign A012301230123Align BScratchpadAlign C0123012301231014181C0004080C3034383C2024282C5054585C4044484C7074787C6064686C9094989C8084888CStreaming PipelineAlign AFigure 5.5: Multi-Operand Custom Vector InstructionsThe main drawback of the wide approach is that the data must be interleavedinto an AoS. In our experience, SVPs work better when data is arranged into astructure of arrays (SoA). The SoA layout assures that each data item is in its ownarray, so SVP instructions can operate on contiguouly packed vectors.For example, suppose image data is interleaved into an AoS as {r,g,b} triplets.With this organization, it is difficult to convert the data to {y,u,v} triplets becauseeach output data item requires a different equation. When the image data is blockedas in a SoA, it is easy to compute the {y} matrix based upon the {r}, {g}, and {b}matrices. Furthermore, converting between AoS and SoA on the fly requires data87020x3000x3100x3200x33008 090x2900x4800x4900x4A00x4B000 01 0203 04 05 0607 08 09v_A1= 0x300v_A2= 0x48400 01 02 0304 05 06 070x0 0x4 0x8 0xC0x0200x0300x0400x05008 090x0100x8300x8400x8500x8600001 02 03 0405 06 07 0809v_B1= 0x028v_B2= 0x83C0003 04 0506 070x0 0x4 0x8 0xC01VL = 2 (number of elements to keep together)num1 = (number of arrays to interleave)     = 2num2 = (number of elements/VL)     = 10/VL = 5srcAStride1 = (v_A2 - v_A1)            = 0x184srcAStride2 = (v_A1[VL] - v_A1[0])            = 0x08vbx_set_vl( VL );vbx_set_2D( num1, dstStride1, srcAStride1, srcBStride1 );vbx_set_3D( num2, dstStride2, srcAStride2, srcBStride2 );vbx_3D( VVW, VCUSTOM1, v_dest, v_A1, v_B1 );srcBStride1 = (v_B2 - v_B1)            = 0x814srcBStride2 = (v_B1[VL] - v_B1[0])            = 0x08Interleaved read-out order:00 0100 0102 0302 0300 0100 0102 0302 03Operand A Operand BCycle1234B1B2B1B2A1A2A1A2vbx_interleave_4_2( VVW, VCUSTOM1, num_elem, VL,   v_D1, v_D2, v_A1, v_A2, v_B1, v_B2 );vbx_interleave_4_2( int TYPE, int INSTR, int NE, int VL, int8 *v_D1, int8 *v_D2, int8 *v_A1, int8 *v_A2, int8 *v_B1, int8 *v_B2 ){  vbx_set_vl( VL );  vbx_set_2D( 2, v_D2-v_D1, v_A2-v_A1, v_B2-v_B1 );  vbx_set_3D( NE/VL, v_D1[VL]-v_D1[0],              v_A1[VL]-v_A1[0], v_B1[VL]-v_B1[0] );  vbx_3D( TYPE, VINSTR, v_D1, v_A1, v_B1 );}Figure 5.6: Using 3D Vector Operations for Multi-Operand Dispatchcopying and can be time consuming. Hence, it is better for regular SVP instructionsto use SoA format.An alternative ‘deep’ approach to multiple-operand CVIs requires data to beinterleaved in time. This is shown in Figure 5.5b, where a streaming datapath onlyhas access to two physical ports, operands A and B of one vector lane. This can becombined with wide parallelism by replicating the deep pipeline. It is not desirableto simply fully read two input vectors and then read the third input, though, as theCVI would have to buffer the full length of the instruction. In MXP, vector lengthsare limited only by the size of the scratchpad, so the buffering could be costly.Rather, it is desirable to only buffer a single cycle’s worth of inputs.We accomplish this in MXP by using its 2D and 3D instruction dispatch toissue a single wavefront of data from each input on alternating cycles. The 2Dinstructions work by first executing a normal (1D) vector instruction to read onewavefront of data, then applying a different stride to each of the input addresses andoutput address and to switch among input arrays and thus interleave the wavefronts.The strides and repetitions can be set at runtime using a separate set 2D instruction.The 3D instructions are an extension of this, where 2D instructions are repeatedusing another set of strides.Figure 5.6 illustrates how these 2D/3D ops are used to dispatch CVIs withmultiple operands. In this example, a CVI with 4 inputs (A1, A2, B1, and B2) and88012301230123Align BScratchpadAlign C0123012301231014181C0004080C3034383C2024282C5054585C4044484C7074787C6064686C9094989C8084888CStreaming PipelineAlign AFigure 5.7: Multi-Operand Custom Vector Instruction Funnel Adapters2 outputs (D1 and D2) is to be executed. The desired result is that the CVI willalternate A1/B1 and A2/B2 inputs each cycle, and alternate D1/D2 outputs eachcycle.To get this outcome, first the 1D vector length (VL) is set to the number of CVIlanes, and the 2D strides are set to the difference between input addresses (A2-A1,B2-B1) and output addresses (D2-D1). Since the inner vector length is the sameas the number of custom instruction lanes, each row is dispatched as one wavein a single cycle, followed by a stride to the next input. The 2D vector length isset to the total number of cycles required (max(inputs/2, outputs/1)). Note that ifmore than 2 cycles (4 inputs or 2 outputs) are needed, sets of additional inputs andoutputs will need to be laid out with a constant stride from each other.Since a 2D operation merely alternates between sets of inputs (and outputs), a3D instruction is used to stream through the arrays of data. Each 2D instructionprocesses one wavefront (of CVI lanes) worth of data, so the 3D instruction is set tostride by the number of CVI lanes. The number of these iterations (the 3D length)is set to the data length divided by the number of CVI lanes.In Figure 5.6, the complex setup routine (top) can be abstracted away to a singlefunction call, vbx interleave 4 2() (middle). One possible implementationof this call is shown at the bottom of the figure.On the hardware side, data is presented in wavefronts and needs to be multi-plexed into a pipeline. Because a new set of inputs only arrives every max(inputs/2,89c2j c2j+1c2 j-1. . . . . .a2j a2j+1a2 j-1. . . . . .d2j d2j+1d2 j-1. . . . . .b2j b2j+1b2 j-1. . . . . .w2j w2j+1w2j-1. . . . . .c1j c1j+1c1 j-1. . . . . .a1j a1j+1a1 j-1. . . . . .d1j d1j+1d1 j-1. . . . . .b1j b1j+1b1 j-1. . . . . .w1j w1j+1w1j-1. . . . . .t j t j+1t j-1. . . . . .varp j p j+1p j-1. . . . . .f j f j+1f j-1. . . . . . 01sc3j c3j+1c3 j-1. . . . . .a3j a3j+1a3 j-1. . . . . .d3j d3j+1d3 j-1. . . . . .b3j b3j+1b3 j-1. . . . . .w3 j w3j+1w3j-1. . . . . .Figure 5.8: Face Detection Pipelineoutputs) cycles, the pipeline would be idle part of the time if it had the same widthand clockrate as the CVI interface. We can recover the lost performance, and savearea, by interleaving two or more logical streams into one physical pipeline. To dothis, we have created ‘funnel adapters’ which are used to accept the spatially dis-tributed wave and feed it to the pipeline over time. This is illustrated in Figure 5.7.The funnel adapter for our 3-input, 2 output particle physics pipeline, whichhas inputs arriving every 2 cycles (and outputs leaving every 2 cycles), allows twoMXP lanes worth of data to share a single physical streaming pipeline.5.3.3 Face Detection CVI ExampleAs another example, we have also outlined the design of a multiple-input/outputCVI for Viola-Jones face detection. The face detection pipeline is shown in Fig-ure 5.8. Unlike the gravity pipeline, the face detection requires far more inputs – atotal of 18 vector inputs and 1 scalar input. It produces a single vector output.Using regular SVP instructions, this face detection requires a total of 19 in-90structions, requiring 19 clock cycles per wave of data. In contrast, due to the largenumber of vector input operands, the face detection pipeline takes 9 clock cyclesper wave of data. Hence, the best-case speedup expected from this custom pipelineis 199 , or roughly 2 times. Even though face detection contains a large number ofoperators, the number of input operands limits the overall speedup. Hence, not allapplications will benefit significantly from custom pipelines.5.4 CVI LimitationsOur CVI implementation was designed to address a wide range of operations andsupport different use cases, but it is not without limitations. These include theinability of a CVI to provide flow control or back pressure to the vector processor,inability to write longer vectors than the inputs, loss of scratchpad bandwidth whenusing narrow CVIs, and the inability to pipeline deep CVIs with regular vectorinstructions.There are situations where it may be useful to provide back pressure from theCVI to the vector processor. For example, a CVI that accesses an external memoryor network interface may need to stall for a variable number of cycles. However,our CVI interface only works with a fixed-length pipeline. Adding back pressurewould require changing MXP’s entire pipeline, which uses a fixed-length, stall-freeexecution backend. The benefits of this would not be measureable unless we hada CVI which needed this feature, and the costs would not only include increasedarea but also possibly decreased frequency, as any stall signal would have to bepropagated to all lanes of the vector processor. Still, there may be situtations whereit would be useful to have back pressure.Most of the CVIs we have implemented write back an output vector of the samelength as the inputs, and we have provided a write address port so that the CVI canwrite a shorter vector (as in a vector compress operation) or write to arbitrary ad-dresses. However, this is not general enough to implement operations that requiremore write cycles than read cycles. For instance, a parallel vector scatter operationmay require writing back multiple wavefronts given a single cycle of reading ad-dresses and data to scatter. This could be implemented if back pressure on readingdata in was available while still writing output data.91When using fewer CVI lanes than the width of the vector processor, more datais read from and scratchpad memory than is used, and less data is written to scratch-pad memory than is possible. Since we have more scratchpad bandwidth than isneeded for the CVI, it is reasonable to investigate if we can do something usefulwith that extra read and write bandwidth. For instance, if there are half the num-ber of CVI lanes as vector lanes, it could be possible to issue a half-speed vectorinstruction using the other half of the scratchpad memory bandwidth. How best toutilize the extra bandwidth (if it can be utilized usefully at all) is left as an openquestion.Finally, as mentioned in Section 5.2.4, when executing deep pipeline CVIs weextend the vector length so that the instruction takes a number of cycles equal to thepipeline depth. This means in an N-stage pipeline CVI, the first N cycles are spentreading data into the pipeline, and no data is written to the scratchpad memory.In the last N cycles, no more data is needed to be read into the pipeline while theCVI writes back results already in its pipeline. Currently we deal with the N cyclepenalty of deep pipeline CVIs by amortizing this pipeline filling overhead over verylong vector lengths, but a solution that allowed other instructions to execute usingthe unutilized scratchpad write and read cycles would be preferable. However, thisis not trivial, especially since write bandwidth is available at the beginning of theCVI execution and read bandwidth at the end.5.5 CVI Design MethodologiesWhile implementing a CVI to accelerate a SVP program is much easier than writ-ing a complete accelerator, implementing them in HDL is not desirable for our tar-get users, software programmers. Hence, we have explored an additional methodfor generating CVI pipelines that uses a high-level tool from Altera.Altera’s DSP Builder Advanced Blockset for Simulink (DSPBA) [2] is a block-based toolset integrating into Matlab and Simulink to allow for push-button genera-tion of RTL code. Figure 5.9 shows a floating-point version of our physics pipelineimplemented in DSPBA. DSPBA was able to create the entire pipeline, includ-ing accumulation units, and design was significantly faster than manually buildingthe fixed-point version in VHDL. Although we were not able to create a fixed-92Figure 5.9: FLOAT Custom Vector Pipeline in Altera’s DSP Builderpoint version of our pipeline in DSPBA because it lacks fixed-point reciprocal andsquare root, we generated a floating-point version as an additional data point.Glue logic was needed to integrate the pipeline into a CVI, however, becauseour CVI pipelines require a clock enable signal, which DSPBA-generated logicdoes not have. Rather than attempt to modify the output of DSPBA (includinglibraries used), we built a FIFO buffer to retime data appropriately, which addsminimal logic and uses one additional M9K memory per lane. This glue logic issufficiently generic to allow any DSPBA-generated pipeline to be integrated into aCVI.An additional method of creating CVIs was designed by Hossein Omidian us-ing HLS techniques to generate RTL from a C function. More details can be foundin [60].5.6 ResultsAll FPGA results are obtained using Quartus II 13.0 and a Terasic DE4 develop-ment board which has a Stratix IV GX530 FPGA and a 64-bit DDR2 interface. Forcomparison, Intel Core i7-2600 and ARM Cortex-A9 (from a Xilinx Zynq-basedZedBoard) performance results are shown. Both fixed-point (fixed) and floating-point (float) implementations were used. MXP natively supports fixed-point multi-plication in all lanes. The Nios II/f contains an integer hardware multiplier andhardware divider; additional instructions are required to operate on fixed-point93Table 5.1: Results with MXP Compared to Nios II/f, Intel, and ARM Proces-sorsProcessor Area DSPs fmax s/frame GigaOp/s pairs/s SpeedupALMs 18-bit MHzNios II/f (fixed) 1,223 4 283 231.6 0.004 0.3M 1.0Cortex A9 (fixed) – – 667 52.1 0.02 1.3M 4.5Cortex A9 (float) – – 667 14.0 0.07 4.8M 16.6Core i7-2600 (fixed) – – 3400 6.5 0.15 10.3M 35.6Core i7-2600 (float) – – 3400 1.6 0.63 41.9M 144.8MXP V32 (fixed) 46,250 132 193 73.8 0.14 9.1M 31.4MXP V32+16FLOAT 115,142 644 122 0.041 24.6 1,326M 5,669MXP V32+16FIXED 86,642 740 153 0.032 31.3 2,087M 7,203Figure 5.10: Area of Gravity Pipeline Systemsdata. The Intel, ARM and Nios II versions are written with the same C sourceusing libfixmath [9]. We developed a vectorized version of this library for use withMXP. Nios II/f and MXP results use gcc-4.1.2 with ‘-O2’. The Core i7 results usegcc-4.6.3 and ‘-O2 -ftree-vectorize -m64 -march=corei7-avx’. ARM results usegcc-4.7.2 and reports the best runtime among ‘-O2’ and ‘-O3’. Our Intel and ARMcode is typical of what a C programmer would start with, not highly optimizedcode.When gathering the MXP results we varied the number of SVP lanes (V2, V8,and V32) and the number of CVI lanes. Three types of CVIs are generated: onecontaining separate fixed-point divide and square root instructions (DIV/SQRT),one containing a manually generated fixed-point gravity pipe (FIXED), and anDSPBA pipe (FLOAT).94Figure 5.11: Performance and Performance-per-Area of Gravity PipelineFigure 5.10 shows the area, in Adaptive Logic Modules (ALMs) on the leftand DSP Block 18-bit elements on the right. The DIV/SQRT configurations takeroughly the same area (in ALMs) as the FIXED pipeline. However, FIXED re-quires more multipliers. The FLOAT pipelines require about 5,500 ALMs and 38DSP elements per lane versus 3,000 and 32 per lane for FIXED.Figure 5.11 shows the speedup of an algorithm to solve the N-body problemwith 8192 particles. Speedup is relative to a Nios II/f soft processor, and is shownfor the various MXP configurations as well as a 3.4GHz Intel Core i7-2600 anda 667MHz ARM Cortex A9. As mentioned earlier, the Intel and ARM code isbasic C code and not highly optimized, but implements the algorithm exactly asour MXP code does. A highly optimized single-core AVX implementation of theN-Body problem for i7-2600 matches our best MXP performance at 2×109 pairsper second [64]. An exact performance comparison between MXP and an Intelprocessor is not intended; rather, this shows that the level of performance of anSVP with CVIs is on par with that of a hard processor.Comparing results within the MXP designs shows the usefulness of CVIs.Without any CVIs, MXP achieves comparable performance to Nios II/f per lane,and its performance scales nearly linearly from V2 to V32. MXP is runningthe fixed-point divide and square root operations in software for these builds,which hampers its overall performance. Adding in divide and square root CVIsgreatly improves performance, to the extent that a V2 with a single lane of di-vide and square root CVIs (V2+1DIV/SQRT configuration) outperforms a V3295without any CVI. Adding more lanes of the divide and square root CVIs onlyslightly increases performance; from V8+1DIV/SQRT to V8+4DIV/SQRT andfrom V32+4DIV/SQRT to V32+16DIV/SQRT performance per area decreases.An additional order of magnitude more performance is seen going from theV2+1DIV/SQRT configuration to the V2+1FIXED configuration which has thefull N-body CVI. This configuration has two orders of magnitude higher perfor-mance per area than Nios II/f. Performance of the V8+1FIXED and V32+1FIXEDis essentially the same as the V2+1FIXED, showing that the kernel of the com-putation is running entirely on the N-body CVI, with only some housekeepingoperations running on the SVP. Additional lanes of the N-body CVI increase per-formance to 7200× that of Nios II/f for the V32+16FIXED configuration. TheFLOAT configurations created with DSPBA have slightly higher area and slightlylower performance (due to having a longer pipeline) compared to the FIXED con-figurations.5.7 SummaryThis chapter has presented a method of attaching custom vector instructions (CVIs)to SVPs, allowing the SVP to take advantage of custom data processing pipelinesimplemented in the FPGA fabric. This approach broadens the design space inwhich SVPs are useful by allowing the designer to acheive increased performanceby migrating the most compute-intensive parts of the algorithm into custom logic.Instead of designing entirely in software or RTL, the designer is able to start insoftware on an SVP and create a working solution first, and only then incrementallyadd CVIs as needed.Our approach reuses existing structures in SVPs to attach a variable numberof streaming pipelines with minimal resource overhead. These can be accessedin software as an extension to the SVP’s ISA. Logic-intensive operators, such asfixed-point divide, should not be simply replicated across all vector lanes. Doing sowastes FPGA area unnecessarily. Instead, it is important to consider the frequencyof use of the specialized pipeline, and add enough copies to get the most speed-upwith minimal area overhead. Methods for dispatching complex CVIs were pre-sented, including a time-interleaved method that allows an arbitrary number of96inputs and outputs using funnel adapters.The performance results achieve speedups far beyond what a plain SVP canaccomplish. For example, a 32-lane SVP achieves a speedup of 31.4, whereas aCVI-optimized version is another 230 times faster, with a net speedup of 7,200versus Nios II/f. As a point of comparison, this puts the MXP roughly at par withreported results for an AVX-optimized Intel Core i7 implementation of the N-Bodyproblem [64].97Chapter 6ConclusionsWhat we call the beginning is often the end. And to make an end is tomake a beginning. The end is where we start from. — T. S. EliotThis work has expanded the applicability of soft vector processors (SVPs),making them more efficient for applications they could already handle, and en-abling new classes of applications to execute on them. Soft processors have advan-tages and drawbacks compared to RTL design. The software style design flow iseasier to understand and faster to debug and iterate. In particular, SVPs are a goodmatch for embedded media applications, where data can be processed as large vec-tors. In contrast, RTL design gives full control to do exactly what the algorithmneeds with the only practical constraint being area and power budgets.We started from the results of previous work which showed that SVP are useful,and improved upon that work with better results and new features and applications.It is important to step back and look at this work at its conclusion and try to gaugewhat its impact may be. In addition to the academic work (ours and the parallelwork mentioned in Chapter 2), there is a commercial entity started as a result of thiswork. VectorBlox Computing Inc. [67] was founded in 2011, and the author hasworked with that company through NSERC and MITACS scholarships. The abilityof this startup company to get funding and employ multiple engineers speaks tothe current relevance of SVPs, and the research work presented here in particular.That said, this work is an academic thesis and stands on its own as a set of ideas,execution of those ideas, and observations and conclusions that are useful to those98who might build upon them.6.1 ContributionsThis work has demonstrated three main contributions. First, the area and perfor-mance penalty is reduced; our VENICE processor has 2× the performance per areaof the previous best SVP. This is done through a combination of FPGA-specificoptimizations and architectural enhancements that focus on efficiency rather thanmaximum performance/scaling. Instead of implementing an existing vector ISA asour starting point, we designed VENICE around the FPGA hardware and therebywere able to both increase performance and reduce area.Second, we enabled a new class of applications with wavefront skipping. Forapplications with divergent control flow, such as those which do a variable amountof computation depending on the input data, wavefront skipping allows for muchhigher performance; 3× higher performance than the base SVP was observed withless than 5% area overhead. This enables SVPs to tackle computer vision algo-rithms such as object and face detection without wasting cycles doing unnecessaryprocessing. These types of algorithms are also difficult to implement efficiently inhardware due to the divergent control flow, making SVPs an attractive target.Finally, we directly addressed the gap between SVPs and RTL by allowingthe use of custom vector instructions (CVIs). These allow the user to extend theSVP by connecting additional processing pipelines, created using a small amountof RTL or using a GUI. Our CVIs can attach arbitrary length pipelines to the fixed-length pipeline of a simple SVP, and can be used with an arbitrary number of inputsand outputs with little additional buffering required. This allows a user to put themost compute-intensive parts of their applications into a fixed pipeline, while stillbeing able to do auxillary processing in software. We show how on an N-bodyproblem we can gain speedups of 230× that of the base SVP by implementingsimple CVIs to perform the main force calculation, while doing all of the particlemovement, bounds checking, etc. with software.996.2 Future Work6.2.1 Initial WorkOur current SVPs use an existing scalar core to perform address calculations, highlevel control flow, etc. VENICE uses Altera’s Nios II/f, while MXP can use NiosII/f, Xilinx’s MicroBlaze, or ARM’s Cortex-A9. This was a pragmatic design de-cision; we can take advantage of the existing toolchain and get industry standardscalar performance without having to redesign a high performance scalar core.However, this means that no resources (such as DSP blocks) can be shared be-tween the scalar core and vector core. Additionally, vector instructions are issuedusing multiple scalar instructions, which limits vector instruction issue speed andties up the scalar processor when it could be making other progress. Fully integrat-ing the vector and scalar cores would allow for more sharing and better instructionstream optimization. This would require either modifying an existing open-sourcesoft processor (such as OpenRISC []) or implementing a new processor compatiblewith an open toolchain (such as the RISC-V project [66]).Our wavefront skipping work was based upon the vision algorithms we studied,but it could be made more general. We mentioned in Section 4.3.6 that it wouldbe possible to support multiple masks at the same time. Early exit algorithmsonly need a single mask, as elements are either being processed currently or elsefinished and do not need to be revisited. However, arbitrary branching in algorithmswould currently require saving and restoring the contents of the mask BRAM. Itmay be useful to support multiple masks instead of having to save and restorewhen switching between branches. With a short enough maximum masked vectorlength (MMVL) each mask is not as deep as a full BRAM, so multiple masks couldshare a BRAM.We may also be able to repurpose our multiple partition addressing logic toincrease the speed of transposed matrix accesses. We can already set up stridedaccesses using masked vector instructions; they will run at full speed providedthey do not cause bank conflicts (by being a power of two, for instance). Wecould also do transposed accesses in this manner for matrices that are of the correctdimensions (2N ± 1) or were padded to the correct dimensions. However, in this100case, we would want to write out the destination at different offsets than we readthe input, so we would have to either have a mixed addressing mode (mask inputs,flat addressed outputs) or else use multiple mask RAMs to store different maskoffsets for the inputs and the outputs.We might also want to directly accelerate strides that are powers of two. Index-ing into scratchpad banks is done by taking the address modulo the width of thescratchpad, which is a power of two. Thus, power of two strides always fall withinthe same scratchpad bank and cannot be accelerated via wavefront skipping. Powerof two strides are useful for algorithms such as fast Fourier transforms (FFTs) aswell as matrix transposition. To allow parallel memory access to elements at powerof two strides, they cannot be stored in the same memory bank. Prime numberbanking [44] can achieve this at the cost of requiring complex addressing logic.For fixed sizes, Latin squares [34] allow fully parallel access to row/column/diago-nals of a matrix. Either of these alternatives could be applied to scratchpad bankingor a future banked external memory system.Regarding CVIs, larger companies such as Xilinx and Altera are seeking tomove to higher level design through high level synthesis (HLS) tools such asOpenCL compilers [19, 73]. These tools allow designers to specify data flow al-gorithms as a series of multithreaded kernels which can be mapped into FPGAlogic. While this gives a data parallel method of programming FPGAs, it still re-quires performing FPGA synthesis during the algorithm design and debug stages.It may be possible to combine the best of both OpenCL HLS and SVPs. Com-piling OpenCL to an SVP should be relatively straightforward, though additionalfeatures would have to be added (e.g., atomic memory operations) for full sup-port. Integrating SVPs and OpenCL HLS could be done through a custom vectorinstruction (CVI) interface or something similar.6.2.2 Long Term DirectionsA major point this thesis does not address is accelerating scatter/gather and powerof two stride support on SVPs. Scatter/gather is the vector parlance for indexed(data-dependent) memory operations; a scatter takes an address vector and ’scat-ters’ data to memory while a gather uses an address vector to ’gather’ data from101various locations in memory and load it into a packed data vector. These operationsare necessary for algorithms such as sorting, graph traversal, and sparse matrix ma-nipulation. However, fulfilling multiple memory requests per cycle is non-trivial,though. A first step towards accelerating scatter/gather operations was done withTputCache [58] which is a high-throughput cache that allowed MXP to increaseperformance on scatter/gather benchmarks. TputCache does not perform any mem-ory access coalescing, but a future version could have multiple cache banks sharea wide back-end that interfaces to external memory, increasing bandwidth utiliza-tion.Longer term ambitions might include creating a wider family of soft parallelprocessors that can be targetted by the same software and looking at architecturalsupport in the FPGA fabric.One direction towards increasing performance would be to keep the vector pro-gramming model and creating a family of processors that could even better exploitexisting parallelism than the straightforward SVPs we have used. We have notexplored superscalar or VLIW type approaches on top of the vector paradigm,which would allow for multiple functional units (FUs) to be utilized simultane-ously. Vector chaining is not straightforward on our SVPs due to the use of scratch-pad pointers rather than vector registers, but would be possible for some instructionsequences. We treat our SVPs and its banked scratchpad monolithically, dispatch-ing a single instruction per cycle, but it may be advantageous to fracture it intomultiple running threads which could synchronize when possible.Support in the FPGA fabric could come in a number of forms. Hardeningof individual functions, such as having a full integer ALU as an element withinthe FPGA fabric, would be useful in reducing area and power, though less flexi-ble than using soft logc. Fully hardening the vector processor itself and keepingthe tightly coupled CVI interfaces to the FPGA fabric would reduce power and in-crease performance even more, though with even less flexibility. A hardened vectorcore could operate at higher frequency than the FPGA logic, but the vector modelmakes it straightforward to use funnel adapters where the CVI would operate atone half or one fourth of the frequency of the vector core, but use two or four timesas many lanes.1026.3 SummaryThis dissertation demonstrates that SVPs can be applicable in a range of appli-cations for which they were previously not well suited. VENICE significantlyreduces the performance/area penalty of using SVPs compared to RTL, showingthat SVPs need not use large amounts of area to accelerate modest-performanceapplications. Our work on wavefront skipping shows how SVPs can be used to ac-celerate applications that may be difficult to implement in RTL. Our CVI interfacegives a way of interfacing deep pipelines with multiple inputs and outputs to SVPs,enabling designers to harness the power of the FPGA fabric directly while retain-ing the software control that the SVP provides. Together, these contributions helpSVPs to be a much more useful tool for implementing data parallel applications onFPGAs.103Bibliography[1] Altera Corporation. Nios II compiler user guide, . URL nios2 c2h compiler.pdf. AccessedDecember 2009. → pages 19[2] Altera Corporation. DSP builder, . URL Accessed September2013. → pages 92[3] Altera Corporation. Increasing productivity with Quartus II incrementalcompilation, . URL September 2014. → pages 13[4] Altera Corporation. Nios II processor: The world’s most versatile embeddedprocessor, . URL AccessedSeptember 2014. → pages 3, 14, 27[5] Altera Corporation. Stratix V FPGA family overview, . URL September 2014. → pages 12[6] K. Asanovic. Vector Microprocessors. PhD thesis, University of Californiaat Berkeley, May 1998. Technical Report UCB-CSD-98-1014. → pages 18[7] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avizˇienis,J. Wawrzynek, and K. Asanovic´. Chisel: constructing hardware in a Scalaembedded language. Proceedings of the 49th Annual Design AutomationConference, pages 1216–1225, 2012. → pages 2[8] A. Brant and G. Lemieux. Zuma: An open FPGA overlay architecture.IEEE Symposium on Field Programmable Custom Computing Machines,pages 93–96, 2012. → pages 14104[9] B. Brewer. libfixmath - cross platform fixed point maths library. URL Accessed September 2013. → pages94[10] B. Brousseau and J. Rose. An energy-efficient, fast FPGA hardwarearchitecture for OpenCV-compatible object detection. InternationalConference on Field-Programmable Technology, pages 166–173, Dec 2012.→ pages 68[11] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, T. Czajkowski,S. D. Brown, and J. H. Anderson. LegUp: An open-source high-levelsynthesis tool for FPGA-based processor/accelerator systems. ACMTransactions on Embedded Computing Systems, 13(2):24:1–24:27, 2013. →pages 2[12] D. Capalija and T. Abdelrahman. A high-performance overlay architecturefor pipelined execution of data flow graphs. International Conference onField Programmable Logic and Applications, pages 1–8, Sept 2013. →pages 4[13] S. Che, J. Li, J. W. Sheaffer, K. Skadron, and J. Lach. Acceleratingcompute-intensive applications with GPUs and FPGAs. Symposium onApplication Specific Processors, pages 101–107, 2008. → pages 1[14] D. Chen, J. Cong, and P. Pan. FPGA design automation: A survey.Foundations and Trends in Electronic Design Automation, 1(3):139–169,2006. → pages 13[15] C. H. Chou, A. Severance, A. D. Brant, Z. Liu, S. Sant, and G. G. Lemieux.VEGAS: Soft vector processor with scratchpad memory. ACM/SIGDAInternational Symposium on Field Programmable Gate Arrays, pages15–24, 2011. → pages x, 5, 20, 30[16] J. Cong, M. A. Ghodrat, M. Gill, H. Huang, B. Liu, R. Prabhakar,G. Reinman, and M. Vitanza. Compilation and architecture support forcustomized vector instruction extension. Asia and South Pacific DesignAutomation Conference, pages 652–657, 2012. → pages 27[17] T. E. M. B. Consortium. EEMBC - the embedded microprocessorbenchmark consortium. URL Accessed January 2015.→ pages 44105[18] Convey Computer. The Convey HC-2 computer: Architectural overview.URL HC-2 Architectual Overview.pdf. Accessed September 2013. →pages 22, 27[19] T. S. Czajkowski, U. Aydonat, D. Denisenko, J. Freeman, M. Kinsner,D. Neto, J. Wong, P. Yiannacouras, and D. P. Singh. From OpenCL tohigh-performance hardware on FPGAs. International Conference on FieldProgrammable Logic and Applications, pages 531–534, 2012. → pages 2,101[20] R. Duncan. A survey of parallel computer architectures. IEEE Transactionson Computers, 23(2):5–16, 1990. → pages 15[21] R. Espasa, M. Valero, and J. E. Smith. Vector architectures: Past, presentand future. International Conference on Supercomputing, pages 425–432,1998. → pages 15[22] B. Fort, D. Capalija, Z. G. Vranesic, and S. D. Brown. A multithreaded softprocessor for SoPC area reduction. IEEE Symposium on FieldProgrammable Custom Computing Machines, pages 131–142, 2006. →pages 19[23] W. Fung and T. Aamodt. Thread block compaction for efficient SIMTcontrol flow. International Symposium on High Performance ComputerArchitecture, pages 25–36, Feb 2011. ISSN 1530-0897. → pages 26[24] W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formationand scheduling for efficient gpu control flow. IEEE/ACM InternationalSymposium on Microarchitecture, pages 407–420, 2007. → pages 26[25] R. E. Gonzalez. Xtensa: A configurable and extensible processor.IEEE/ACM International Symposium on Microarchitecture, 20(2):60–70,2000. → pages 27[26] E. Hung and S. J. Wilton. Accelerating FPGA debug: Increasing visibilityusing a runtime reconfigurable observation and triggering network. ACMTransactions on Design Automation of Electronic Systems, 19(2):14:1–14:23, 2014. → pages 2[27] Intel Corporation. Intel instruction set extensions, . URL Accessed December2014. → pages 4, 16106[28] Intel Corporation. Intel Xeon Phi coprocessor: Datasheet, . URL Accessed December 2014. → pages16[29] H. Ishihata, T. Horie, S. Inano, T. Shimizu, and S. Kato. An architecture ofhighly parallel computer AP 1000. IEEE Pacific Rim Conference onCommunications, Computers and Signal Processing, pages 13–16, 1991. →pages 25[30] D. Kanter. Intel’s Haswell CPU microarchitecture. URL Accessed December 2012. →pages 18[31] N. Kapre and A. DeHon. Accelerating SPICE model-evaluation usingFPGAs. IEEE Symposium on Field Programmable Custom ComputingMachines, pages 37–44, 2009. → pages 14[32] N. Kapre, N. Mehta, M. Delorimier, R. Rubin, H. Barnor, M. J. Wilson,M. Wrighton, and A. Dehon. Packet switched vs. time multiplexed FPGAoverlay networks. IEEE Symposium on Field Programmable CustomComputing Machines, pages 205–216, 2006. → pages 14[33] J. Kathiara and M. Leeser. An autonomous vector/scalar floating pointcoprocessor for FPGAs. IEEE Symposium on Field Programmable CustomComputing Machines, pages 33–36, 2011. → pages 22[34] K. Kim and V. K. Prasanna. Latin squares for parallel array access. IEEETransactions on Parallel and Distributed Systems, 4(4):361–370, Apr. 1993.ISSN 1045-9219. → pages 101[35] S. Kirkpatrick, C. D. Gelatt, M. P. Vecchi, et al. Optimization by simmulatedannealing. Science, 220(4598):671–680, 1983. → pages 13[36] C. Kozyrakis. Scalable Vector Media Processors for Embedded Systems.PhD thesis, University of California at Berkeley, May 2002. TechnicalReport UCB-CSD-02-1183. → pages 18[37] C. Kozyrakis and D. Patterson. Vector vs. superscalar and VLIWarchitectures for embedded multimedia benchmarks. IEEE/ACMInternational Symposium on Microarchitecture, pages 283–293, 2002. →pages 4107[38] R. Krashinsky, C. Batten, M.Hampton, S. Gerding, B. Pharris, J. Casper, andK. Asanovic. The vector-thread architecture. International Symposium onComputer Architecture, pages 52–63, June 2004. → pages 50[39] D. J. Kuck and R. Stokes. The Burroughs Scientific Processor (BSP). IEEETransactions on Computers, C-31(5):363–376, May 1982. ISSN 0018-9340.→ pages 24[40] I. Kuon and J. Rose. Measuring the gap between FPGAs and ASICs. IEEETransactions on Computer-Aided Design of Integrated Circuits and Systems,26(2):203–215, 2007. → pages 11[41] I. Kuon, R. Tessier, and J. Rose. FPGA architecture: Survey and challenges.Foundations and Trends in Electronic Design Automation, 2(2):135–253,2008. → pages 11[42] C. E. LaForest and J. G. Steffan. OCTAVO: an FPGA-centric processorfamily. ACM/SIGDA International Symposium on Field Programmable GateArrays, pages 219–228, 2012. → pages 4[43] Lattice Semiconductor Corporation. LatticeMico32 open, free 32-bit softprocessor. URL AccessedDecember 2014. → pages 3[44] D. H. Lawrie and C. Vora. The prime memory system for array access.IEEE Transactions on Computers, C-31(5):435–442, May 1982. ISSN0018-9340. → pages 101[45] Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, andK. Asanovic´. Exploring the tradeoffs between programmability andefficiency in data-parallel accelerators. ACM Special Interest Group inComputer Architecture News, 39(3):129–140, 2011. → pages 26[46] Z. Liu, A. Severance, S. Singh, and G. G. Lemieux. Accelerator compilerfor the VENICE vector processor. ACM/SIGDA International Symposium onField Programmable Gate Arrays, pages 229–232, 2012. → pages iv, 22, 31[47] A. Ludwin and V. Betz. Efficient and deterministic parallel placement forFPGAs. ACM Transactions on Design Automation of Electronic Systems, 16(3):22:1–22:23, 2011. → pages 2, 13108[48] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, andY. N. Patt. Improving GPU performance via large warps and two-level warpscheduling. IEEE/ACM International Symposium on Microarchitecture,pages 308–317, 2011. → pages 26[49] M. Naylor and S. W. Moore. Rapid codesign of a soft vector processor andits compiler. International Conference on Field Programmable Logic andApplications, pages 1–4, 2014. → pages 22[50] M. Naylor, P. Fox, A. Markettos, and S. Moore. Managing the FPGAmemory wall: Custom computing or vector processing? InternationalConference on Field Programmable Logic and Applications, pages 1–6, Sept2013. → pages 22[51] R. Nikhil. Bluespec System Verilog: efficient, correct RTL from high levelspecifications. ACM and IEEE International Conference on Formal Methodsand Models for Co-Design, pages 69–70, 2004. → pages 2, 22[52] K. Ovtcharov, I. Tili, and J. Steffan. TILT: A multithreaded VLIW softprocessor family. International Conference on Field Programmable Logicand Applications, pages 1–4, Sept 2013. → pages 4[53] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C.Phillips. GPU computing. Proceedings of the IEEE, 96(5):879–899, 2008.→ pages 26[54] K. Papadimitriou, A. Dollas, and S. Hauck. Performance of partialreconfiguration in FPGA systems: A survey and a cost model. ACMTransactions on Reconfigurable Technology and Systems, 4(4):36:1–36:24,2011. → pages 3[55] B. R. Rau and J. A. Fisher. Instruction-level parallel processing: History,overview, and perspective. The Journal of Supercomputing, 7(1-2):9–50,1993. → pages 18[56] R. M. Russell. The CRAY-1 computer system. Communications of the ACM,21(1):63–72, Jan. 1978. ISSN 0001-0782. → pages 15[57] A. Severance and G. Lemieux. VENICE: A compact vector processor forFPGA applications. International Conference on Field-ProgrammableTechnology, pages 261–268, Dec 2012. → pages iv109[58] A. Severance and G. Lemieux. TputCache: High-frequency, multi-waycache for high-throughput FPGA applications. International Conference onField Programmable Logic and Applications, pages 1–6, Sept 2013. →pages 25, 102[59] A. Severance and G. G. Lemieux. Embedded supercomputing in FPGAswith the VectorBlox MXP matrix processor. International Conference onHardware/Software Codesign and System Synthesis, pages 6:1–6:10, 2013.→ pages 22, 36, 53[60] A. Severance, J. Edwards, H. Omidian, and G. Lemieux. Soft vectorprocessors with streaming pipelines. ACM/SIGDA International Symposiumon Field Programmable Gate Arrays, pages 117–126, 2014. → pages v, 93[61] A. Severance, J. Edwards, and G. Lemieux. Wavefront skipping usingBRAMs for conditional algorithms on vector processors. ACM/SIGDAInternational Symposium on Field Programmable Gate Arrays, Feb 2015.→ pages iv[62] J. E. Smith. Density dependent vector mask operation control apparatus andmethod, 1996. URL US Patent5,940,625. → pages 25[63] J. E. Smith, G. Faanes, and R. Sugumar. Vector instruction set support forconditional operations. ACM Special Interest Group in ComputerArchitecture News, 28(2):260–269, May 2000. ISSN 0163-5964. → pages7, 23, 26, 52[64] A. Tanikawa, K. Yoshikawa, K. Nitadori, and T. Okamoto.Phantom-GRAPE: numerical software library to accelerate collisionlessN-body simulation with SIMD instruction set on x86 architecture.,Oct. 2012. → pages 95, 97[65] D. Tarditi, S. Puri, and J. Oglesby. Accelerator: Using data parallelism toprogram GPUs for general-purpose uses. Technical ReportMSR-TR-2005-184, Microsoft Research, October 2006. URL → pages22[66] University of California, Berkeley. RISC-V. URL AccessedDecember 2014. → pages 100[67] VectorBlox Computing, Inc. VectorBlox - embedded supercomputing. URL Accessed September 2014. → pages 98110[68] P. Viola and M. Jones. Rapid object detection using a boosted cascade ofsimple features. IEEE Computer Society Conference on Computer Visionand Pattern Recognition, 1:511–518, 2001. → pages 7[69] C. Wang and G. Lemieux. Scalable and deterministic timing-driven parallelplacement for FPGAs. ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 153–162, 2011. → pages 2, 13[70] M. Weiss. Strip mining on SIMD architectures. International Conference onSupercomputing, pages 234–243, 1991. → pages 4, 56[71] H. Wong, V. Betz, and J. Rose. Comparing FPGA vs. custom CMOS and theimpact on processor microarchitecture. ACM/SIGDA InternationalSymposium on Field Programmable Gate Arrays, pages 5–14, 2011. →pages 3, 15, 64[72] Xilinx, Inc. MicroBlaze soft processor core, . URL Accessed September 2014. →pages 3, 27[73] Xilinx, Inc. All programmable abstractions, . URL Accessed December2014. → pages 101[74] P. Yiannacouras, J. G. Steffan, and J. Rose. VESPA: portable, scalable, andflexible FPGA-based vector processors. International Conference onCompilers, Architectures and Synthesis for Embedded Systems, pages 61–70,2008. → pages 4, 19, 78[75] P. Yiannacouras, J. G. Steffan, and J. Rose. Fine-grain performance scalingof soft vector processors. International Conference on Compilers,Architectures and Synthesis for Embedded Systems, pages 97–106, 2009. →pages 19, 27[76] P. Yiannacouras, J. G. Steffan, and J. Rose. Data parallel FPGA workloads:Software versus hardware. International Conference on FieldProgrammable Logic and Applications, pages 51–58, 2009. → pages 19[77] P. Yiannacouras, J. G. Steffan, and J. Rose. Data parallel fpga workloads:Software versus hardware. International Conference on FieldProgrammable Logic and Applications, pages 51–58, 2009. → pages 5111[78] J. Yu, C. Eagleston, C. H. Chou, M. Perreault, and G. Lemieux. Vectorprocessing as a soft processor accelerator. ACM Transactions onReconfigurable Technology and Systems, 2(2):1–34, 2009. → pages 4, 5, 19112


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items