UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Real-time computer vision in software using custom vector overlays Edwards, Joseph James 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2018_september_edwards_joseph.pdf [ 8.23MB ]
Metadata
JSON: 24-1.0369726.json
JSON-LD: 24-1.0369726-ld.json
RDF/XML (Pretty): 24-1.0369726-rdf.xml
RDF/JSON: 24-1.0369726-rdf.json
Turtle: 24-1.0369726-turtle.txt
N-Triples: 24-1.0369726-rdf-ntriples.txt
Original Record: 24-1.0369726-source.json
Full Text
24-1.0369726-fulltext.txt
Citation
24-1.0369726.ris

Full Text

Real-time Computer Vision in Softwareusing Custom Vector OverlaysbyJoseph James EdwardsB.A.Sc, University of British Columbia, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)July 2018c© Joseph James Edwards, 2018The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Real-time Computer Vision in Software using Custom Vector Overlayssubmitted by Joseph James Edwards in partial fulfillment of the requirements forthe degree of Master of Applied Sciencein Electrical and Computer EngineeringExamining Committee:Guy Lemieux, Electrical and Computer EngineeringSupervisorMieszko Lis, Electrical and Computer EngineeringSupervisory Committee MemberSid Fels, Electrical and Computer EngineeringSupervisory Committee MemberiiAbstractReal-time computer vision places stringent performance requirements on embed-ded systems. Often, dedicated hardware is required. This is undesirable as hard-ware development is time-consuming, requires extensive skill, and can be difficultto debug. This thesis presents three case studies of accelerating computer vision al-gorithms using a software-driven approach, where only the innermost computationis performed with dedicated hardware. As a baseline, the algorithms are initiallyrun on a scalar host processor. Next, the software is sped up using an existing vec-tor overlay implemented in the FPGA fabric, manually rewriting the code to usevectors. Finally, the overlay is customized to accelerate the critical inner loops byadding hardware-assisted custom vector instructions. Collectively, the custom in-structions require very few lines of RTL code compared to what would be neededto implement the entire algorithm in dedicated hardware.This keeps design complexity low and yields a significant performance boost.For example, in one system, we measured a performance advantage of 2.4× to3.5× over previous state-of-the-art dedicated hardware systems while using farless custom hardware.iiiPrefaceIn all chapters, the soft-vector processor and the software libraries provided to pro-gram it were provided by VectorBlox Computing. Portions of chapters 3 of thisthesis appear in “Real-time object detection in software with custom vector in-structions and algorithm changes” by Edwards and Lemieux [12]. Both hardwarecustomization and software design was performed by the author with supervisionprovided by Guy Lemieux. In chapter 4, all software design was performed bythe author. Initial hardware customization was also written by the author but finaldesigns used in this thesis were written by Aaron Severance at VectorBlox Com-puting. Supervision was provided by Guy Lemieux. Portions of chapters 5 of thisthesis appear in “TinBiNN: Tiny Binarized Neural Network Overlay in Less Than5,000 4-LUTs” by Edwards, Vandergriendt, Severance, Raouf, Watzka, Singh, andLemieux [13]. Software was written by the author with hardware customizationprovided by Joel Vandergriendt at VectorBlox Computing. Board design was per-formed by the authors at Lattice Semiconductor. Supervision was provided by GuyLemieux.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 72 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9v2.1 AdaBoost Object Detection . . . . . . . . . . . . . . . . . . . . . 92.1.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . 102.1.2 Feature Types: Haar vs LBP . . . . . . . . . . . . . . . . 122.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . 142.2.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . 152.2.2 Binary Neural Networks . . . . . . . . . . . . . . . . . . 172.3 VectorBlox MXP Architecture . . . . . . . . . . . . . . . . . . . 182.3.1 Parameterization . . . . . . . . . . . . . . . . . . . . . . 202.3.2 Programming Model . . . . . . . . . . . . . . . . . . . . 202.3.3 Custom Vector Instructions . . . . . . . . . . . . . . . . . 212.3.4 Wavefront Skipping . . . . . . . . . . . . . . . . . . . . 223 Local Binary Pattern Custom Vector Overlay . . . . . . . . . . . . . 233.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.1 Restricting LBP Block Sizes . . . . . . . . . . . . . . . . 253.2.2 ILP Formulation to Reduce Data Size . . . . . . . . . . . 283.3 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.1 Applying Wavefront Skipping . . . . . . . . . . . . . . . 323.4 Custom Vector Instructions . . . . . . . . . . . . . . . . . . . . . 333.4.1 LBP Table Lookup Instruction . . . . . . . . . . . . . . . 333.4.2 LBP Pattern Instruction . . . . . . . . . . . . . . . . . . . 373.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 393.5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . 40vi3.5.3 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.6 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.7 Design Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 433.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 Convolutional Neural Network Custom Vector Overlay . . . . . . . 454.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.1 Tiny YOLOv2 . . . . . . . . . . . . . . . . . . . . . . . 474.2.2 VGG16 SVD500 . . . . . . . . . . . . . . . . . . . . . . 484.2.3 Quantization . . . . . . . . . . . . . . . . . . . . . . . . 494.2.4 Quantizational Impact on Accuracy . . . . . . . . . . . . 524.3 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4 Custom Vector Instructions . . . . . . . . . . . . . . . . . . . . . 554.4.1 3x3 Convolution Instruction . . . . . . . . . . . . . . . . 564.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 584.5.2 Overlay Instances . . . . . . . . . . . . . . . . . . . . . . 604.5.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . 604.5.4 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.6 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.7 Design Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 664.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665 Binary-weight Neural Network Custom Vector Overlay . . . . . . . 685.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69vii5.2 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.2.1 Network Reduction . . . . . . . . . . . . . . . . . . . . . 715.3 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.4 Custom Vector Instructions . . . . . . . . . . . . . . . . . . . . . 745.4.1 3x3 Binary Convolution Instruction . . . . . . . . . . . . 745.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 755.5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . 775.5.3 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.6 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.7 Design Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 795.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84viiiList of TablesTable 3.1 Accuracy of frontal face cascades ran on the MIT-CMU test set 28Table 3.2 Resource usage and Fmax . . . . . . . . . . . . . . . . . . . . 41Table 3.3 Previous work comparison . . . . . . . . . . . . . . . . . . . . 43Table 4.1 Tiny YOLOv2 VOC hyperparameters . . . . . . . . . . . . . . 48Table 4.2 VGG16 SVD500 Convolutional Neural Network (CNN) hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Table 4.3 Mean average precision of various versions of Tiny YOLOv2VOC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Table 4.4 Tiny YOLOv2 VOC runtime breakdown (ms) . . . . . . . . . 61Table 4.5 Tiny YOLOv2 VOC throughput breakdown (GOPS) . . . . . . 61Table 4.6 VGG16 SVD500 runtime breakdown (ms) . . . . . . . . . . . 62Table 4.7 VGG16 SVD500 throughput breakdown (GOP/s) . . . . . . . 63Table 4.8 Comparing inference speed to the Darknet framework . . . . . 64Table 4.9 Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . 64Table 4.10 Previous work comparison . . . . . . . . . . . . . . . . . . . . 66Table 5.1 Runtime of reduced networks, desktop vs custom overlay . . . 77ixTable 5.2 Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . 78Table 6.1 Comparing the various overlays explored in this thesis . . . . 81xList of FiguresFigure 1.1 Results after running frontal face detection . . . . . . . . . . 4Figure 1.2 Results after running the Tiny YOLO v2 (COCO) network . . 5Figure 1.3 Example CIFAR-10 “horse” image . . . . . . . . . . . . . . 7Figure 2.1 A search window traverses an image pyramid, evaluating a cas-cade of classifiers at each position. . . . . . . . . . . . . . . . 11Figure 2.2 Examples of typical features . . . . . . . . . . . . . . . . . . 12Figure 2.3 Overview of MB-LBP feature computation. . . . . . . . . . . 14Figure 2.4 Example CNN network, showing multiple layers. . . . . . . 17Figure 2.5 Overview of the VectorBlox MXP processor . . . . . . . . . 18Figure 2.6 Code required to allocate and move data inside the scratchpad 21Figure 2.7 Calculating a fully-connected layer on a host processor versusthe equivalent VectorBlox MXP instructions . . . . . . . . . . 21Figure 3.1 Distribution of block sizes of Local Binary Pattern (LBP) fea-tures in the trained classifiers . . . . . . . . . . . . . . . . . 26Figure 3.2 Minimal code modifications to lbpfeatures.cpp gener-ate cascades with restricted block sizes . . . . . . . . . . . . 27xiFigure 3.3 Sample ILP constraints for a stage with 5 features using Z3 . . 30Figure 3.4 The number of features calculated at every location is shown.The bottom demonstrate parallelizing across a row, with thelatter taking advantage of masked instructions . . . . . . . . . 34Figure 3.5 The inner loop using a custom vector instruction . . . . . . . 36Figure 3.6 Accelerating the pre-computation of LBP patterns . . . . . . 38Figure 3.7 Photo of video output showing 49 of 50 faces detected on a1080p image in 36 ms . . . . . . . . . . . . . . . . . . . . . 40Figure 3.8 Performance in milliseconds (speedup) on 320× 240 imagepyramid, 1.1 scale factor, unit stride . . . . . . . . . . . . . . 41Figure 4.1 Average precision curve for “person” class . . . . . . . . . . 53Figure 4.2 Convolution instruction (showing a V4 system with 2 CNNsuper-kernels attached) . . . . . . . . . . . . . . . . . . . . . 57Figure 4.3 Example YOLO detection . . . . . . . . . . . . . . . . . . . 59Figure 5.1 Reduced binary CNN containing 89% fewer operations thanBinaryConnect . . . . . . . . . . . . . . . . . . . . . . . . . 71Figure 5.2 Samples of CIFAR-10 dataset . . . . . . . . . . . . . . . . . 72Figure 5.3 Person detector, sample results . . . . . . . . . . . . . . . . 73Figure 5.4 Binary convolution custom vector instruction . . . . . . . . . 75Figure 5.5 System diagram . . . . . . . . . . . . . . . . . . . . . . . . . 76xiiGlossaryASIC Application-specific Integrated Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42BNN Binary Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6BRAM Block Random Access Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63CNN Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4CVI Custom Vector Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21CVI custom vector instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21DMA Direct Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19xiiiDRAM Dynamic Random Access MemoryDSP digital signal processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42FPGA Field-programmable Gate Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2GOPS giga operations per second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65GPU Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5ILP Integer Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4IoT internet-of-things. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1LBP Local Binary Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12LUT Lookup table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6LVE Lightweight Vector Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79xivMAC multiply-accumulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6mAP mean average precisionMB-LBP Multi-Block Local Binary Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13MIMD Multiple Instruction, Multiple Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31PE processing element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42RAM Random Access Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76ReLu Rectified Linear Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16ROM Read-only MemoryRTL Register Transfer Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2SIMD Single Instruction, Multiple Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10xvSVD Singular-Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49SoC System-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2TOPS tera operations per second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78YOLO You Only Look Once . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4xviAcknowledgmentsI would like to thank Dr. Guy Lemieux for his support and guidance during mytime as a graduate student.I would also like to thank both Lattice and Xilinx for donating hardware andsoftware licenses used in this research, and NSERC for providing funding.xviiChapter 1Introduction1.1 MotivationAdvanced computer vision algorithms and sophisticated image processing are be-coming important workloads for embedded systems. This push for embedded sys-tems to run demanding computations is partly fueled by the growing internet-of-things (IoT) movement and need for edge processing. Advanced embedded sys-tems are needed for security analysis, self-navigating automobiles or drones, qual-ity control systems and a host of other applications.Higher frame rates, higher resolutions, and more complex operations push em-bedded systems well into the GOPS, demanding a higher degree of parallelization.On desktop machines, where these algorithms are first developed, neither cost norpower are paramount. Real-time performance is often achieved using tuned li-braries leveraging multi-core, SIMD instructions or general purpose GPU process-ing. On embedded systems, where these algorithms are deployed and run at scale,both cost and power matter.1Custom hardware solutions are often required to meet both performance andpower constraints. In many cases, Field-programmable Gate Arrays (FPGAs) rep-resent ideal platforms for creating a System-on-Chip (SoC), as the configurablelogic can be leveraged to perform computationally demanding processing in ahighly parallel, efficient manner. FPGAs are also intrinsically low power, interfacedirectly with other devices, and typically only need external memory to complete asystem. The flexibility and performance of FPGAs comes at a price, however, re-quiring complex Register Transfer Language (RTL) design and considerable timeto implement and debug. To reduce development effort, FPGA users should aimto minimize the amount of domain-specific RTL their designs require. A trade-offbetween performance and design complexity may be needed.1.2 ApproachTo mitigate this trade-off, our approach allows for, but minimizes the amount ofdomain specific RTL needed. We propose a hybrid hardware/software approachwhere the majority of development effort is kept in software, keeping design com-plexity low (measured in lines of RTL hardware vs lines of software).Our approach is summarized in the following steps. First, a baseline is estab-lished, running the program on a host processor on the FPGA. Second, softwareis rewritten to take advantage of a soft vector overlay, increasing performance overthe baseline. Third, after profiling the application, targeted hardware is added tothe vector overlay in the form of custom vector instructions to produce a large per-formance boost. This last step can be revisited or revised as current bottlenecksare removed and new bottlenecks are discovered. By iterating through bottlenecksuntil performance requirements are met, hardware development is kept to a mini-2mum. The majority of processing does not require acceleration beyond the initialsoftware-based vectorization as the majority of the runtime is usually isolated to asmall region of code that can be further accelerated by a small amount of customRTL. Our approach quickly produces a custom vector overlay capable of billionsof operations per second (GOPS).Several case studies follow, showing the applicability of our approach. Wedevelop a custom vector overlay for both traditional machine learning pipelinesand state-of-the-art deep learning pipelines. In all cases studies, we produce com-plete end-to-end demonstration systems. We present the speedup of our acceleratedversion to the host processor, report additional lines of RTL required (includingwhitespace and comments) and compare to previous implementations when possi-ble.1.3 ContributionsOur first case study presents a real-time face detection system, using a variationof the classic Viola-Jones algorithm. Our result is competitive with similar sys-tems designed completely in hardware. This method of face detection serves as agood example of our hybrid approach, both due to the resulting performance of thecomplete system and because parallelizing the algorithm is not straightforward.Our custom overlay achieves a speedup of 248.4× compared to the hard ARMCortex-A9 processor and requires less than 800 lines of custom RTL. The previousefforts implementing face detection on FPGAs have produced good results [6, 7,16]. However, these solutions are single-use, fixed implementations that can bedifficult to maintain and the developed hardware is only useful for object detection.Our approach results in faster detection rates and can be modified to include other3Figure 1.1: Results after running frontal face detectiontypes of pre-processing and post-processing.In developing our solution, we also provide a novel contribution using IntegerLinear Programming (ILP) to quantize our generated object detection cascades.Our face detection system produces the results seen in Figure 1.1.The second case study accelerates Convolutional Neural Network (CNN) infer-ence of multi-class object detection and localization. CNNs represent the state-of-the-art in object detection. The You Only Look Once (YOLO) project [28] uses a4Figure 1.2: Results after running the Tiny YOLO v2 (COCO) networknew approach focusing on real-time operation. The project’s reduced size networkcan reach up to 200 fps on a (then) top-of-the-line 16 nm Nvidia Titan-X GraphicsProcessing Unit (GPU) using 416×416 input images. However, the Titan-X GPUis not suitable for embedded applications due to power, form factor, the need for anx86 host, and an inability to directly communicate with sensors and I/O devices.We produce a CNN overlay that uses a reduced data width of 8 bits. We runseveral example networks including 80-category and 20-category variants of theYOLO object detector. Sample outputs of Tiny YOLO are shown in Figure 1.2.Our CNN overlay, in a single core configuration, reaches 12.4 fps. A dual-core5system, showing nearly perfect 2:1 scaling, achieves 24.4 fps. This is 1359.0×speedup over the ARM processor. The addition of the convolution acceleratoronly requires 1300 lines of additional RTL. Scaling is expected to continue aswe increase the number of cores, and moving to larger, more modern FPGAswith more DSP blocks. This will close the gap with GPUs and previous workon FPGAs that implement dedicated CNN accelerators. The custom vector overlayapproach is flexible, offers end-to-end processing, and can easily be extended tohandle changes in neural network workloads.In the third and final case study, several Binary Neural Network (BNN) classi-fiers derived from BinaryConnect [8] are accelerated using a lightweight overlay.Typically, the gains that deep learning provides in accuracy is offset by the num-ber of multiply-accumulate (MAC) operations required. BinaryConnect networks,however, use 1-bit weights to eliminate multiplications and achieve state-of-the-arterror rates using only addition. The 1-bit weights also reduce storage requirementsand improve power efficiency.The small BNN networks examined here are accelerated using a lightweightimplementation of the vector processor to produce a minimal area solution thatstill benefits from the custom overlay approach. The reduced-precision arithmeticimproves the size, cost, power and performance of neural networks in digital logic.This lightweight overlay uses only about 5000 4-input Lookup tables (LUTs) andfits into the lowest cost FPGA. We show it can run CIFAR-10 [20], a 10-categoryclassification task, and that our low-precision fixed-point activations do not in-crease the error. A typical input image is shown in Figure 1.3.A secondary network, trained as a single-category person classifier, is also pre-sented. The error rate is less than 1%. Our custom vector overlay improves the6Figure 1.3: Example CIFAR-10 “horse” imageruntime 71× over the ORCA RISC-V host, and only requires 200 lines of customRTL.1.4 Thesis OrganizationThis thesis presents our approach across three separate case studies, and is orga-nized as follows. Chapter 2 provides background material for all examples in-7cluding Viola-Jones object detection, CNNs, and the soft vector processor overlay.Chapter 3 presents the face detection example. Chapter 4 presents the CNN over-lay. Chapter 5 presents the lightweight BNN overlay. Chapter 6 concludes thethesis and lists future work.8Chapter 2BackgroundIn this chapter we discuss essential background needed for several real-time com-puter vision algorithms, as well as the vector processor used to realize them. Inparticular, we cover the seminal work by Viola and Jones on AdaBoost object de-tection and recent advances in deep learning which use neural networks for multi-object classification and detection. To accelerate both classes of computer visionalgorithms on an embedded system, we require a high degree of hardware paral-lelism. The VectorBlox MXP, a soft vector overlay, is used to provide this paral-lelism. The VectorBlox MXP is fully software programmable, allowing for rapiddevelopment and is extensible, allowing for simple integration of custom hardwarefor domain-specific acceleration. Details of the operation and key features of thesoft vector processor are described in the last section.2.1 AdaBoost Object DetectionOne of the most popular computer vision algorithms is AdaBoost-based object de-tection, popularized by Viola and Jones [34]. This algorithm was an important9move towards accurate, real-time object detection. AdaBoost remains a popularform of object detection in embedded systems. Uses for detection can includeimproving auto-focus where faces are found, or producing safety warnings whenpedestrians are detected in the paths of vehicles. The detection algorithm finds ob-jects at any size (scale-invariant) and any location (spatially-invariant) in an image.This is done with high accuracy, at high resolutions and, if parallelized well, athigh frame rates.2.1.1 Algorithm OverviewFor this algorithm, several image pre-processing steps are required. This beginswith an initial grey-scaling of the image, followed by the generation of an imagepyramid. The images in the pyramid decrease in size until either the width or heightis less than the search window dimensions. A search window is then slid acrosseach location of each image in the pyramid. At every location, a detection cascade,consisting of a series of weak classifiers, is evaluated to determine if an object ispresent. Figure 2.1 shows the search window scanning the image pyramid.Each weak classifier looks for a particular feature in the search window, usinga quick computation to determine if it is present or not. Evaluating these weakclassifiers is the central computation in the algorithm. During training, features areranked by how discriminative they are and are then grouped into stages. Each stageeliminates a percentage of false positives and allows the search at a given locationto terminate early as soon as enough weak classifiers in a stage fail. This early-exiting, variable execution property of the algorithm significantly speeds computa-tion when performed sequentially, but makes feature computation across multiplelocations inefficient with Single Instruction, Multiple Data (SIMD)-style execu-10Figure 2.1: A search window traverses an image pyramid, evaluating a cas-cade of classifiers at each position.tion. In the worst case, all of the locations being processed in parallel must becomputed.OpenCV, a popular open source library for computer vision, provides efficientimplementations for both training and detection routines for these classifiers andcontains support for classifiers that look for various types of features [5].11(a) Haar features (b) LBP featuresFigure 2.2: Examples of typical features2.1.2 Feature Types: Haar vs LBPViola and Jones used Haar features in their work. Haar features, as shown in Figure2.2a, use the average intensity of several rectangular regions. A check is performedto see if a specified subregion (shown in white) is sufficiently brighter or darkerthan another subregion (shown in black). To make computing these regions effi-cient, integral images are often used. Integral images, or sum area tables, containthe sum of all pixels above and to left of each location. This means the sum of pix-els for any given rectangular area can be obtained with only four lookups, one ateach corner. The variance of the search window, calculated by summing the squareof the difference between each pixel and the mean value, is also needed. Variancenormalization makes the features more robust to lighting conditions. To efficientlycalculate this, a squared version of the integral image is also generated.Alternative feature types have been used successfully and are more amenableto quick calculation. Local Binary Pattern (LBP) features, shown in Figure 2.2b,12rely on generating an 8-bit pattern based on comparing a central region, or cell, toits eight neighboring regions. Each comparison yields 1 bit of this result. This 8-bit pattern then serves as an index into a feature’s lookup table, returning a pass orfail value. The operations used in calculating LBP features map well to embeddedprocessors, using integer arithmetic and avoiding division and square-root opera-tions. Although more regions are needed to calculate each feature, the size of theseregions are regular and have the potential for reuse. Also, each LBP feature tendsto be more expressive than an individual Haar feature, which results in cascadeswith fewer features overall.Liao et al. introduced Multi-Block Local Binary Pattern (MB-LBP) features,where each of the cell sizes used to generate the LBP pattern may be larger thanone pixel [25]. Integral images may also be used to calculate these regions faster.MB-LBP features increased accuracy over single-pixel based LBP patterns. Ex-amples of typical Haar and MB-LBP features are contrasted in Figure 2.2, shownas they would appear inside a search window. MB-LBP features are used in ourimplementation due to the computational advantages.An overview of MB-LBP feature calculation is shown in Figure 2.3, highlight-ing the calculation of one of the weak classifiers in the cascade. Each classifiercompares the average brightness of a center cell to its 8 neighbour cells, producingthe 8-bit comparison bitfield. This 8-bit LBP pattern serves as an index to the fea-tures’ LUT, which produces either a PASS or FAIL value. This value is summedwith all other features in the stage. A stage fails if the sum is less than its requiredthreshold. Objects are detected at a given location only if all stages pass.13(a) compute average brightness (b) compare to neighbors(c) lookup pattern (d) add pass or failFigure 2.3: Overview of MB-LBP feature computation.2.2 Convolutional Neural NetworksThere has been a growing trend away from traditional machine learning that useshand-tuned features, such as Haar and LBP, toward features that are learned. Thisgoes under the title of representation learning or feature learning. Coinciding withthis move to feature learning are the use of deeper, hierarchical models, whichcombine learned features of earlier layers into more complex features. This allowsfor more abstract, less rigid factors of variation to be learned by the models.These large shifts to deep learning have been made possible by vastly increased14datasets, computational resources, and algorithmic advances, allowing large net-works to train towards better and better solutions. Hinton’s introduction of back-propagation [29] and use of deep feed-forward networks [20] represent two keycontributions in this shift. LeCun’s work using CNNs for MNIST digit recogni-tion [22] proved these networks were adept for computer vision tasks.Convolutional neural networks are now the leading approach across severalcomputer vision challenges [20]. These include MNIST digit recognition [22],CIFAR classification tasks [17] and the much larger classification and localizationchallenges on ImageNet [11]. A pivotal moment occurred in 2012 when AlexNet,a deep CNN, beat previous, traditional approaches by a large margin of 10% inImageNet’s 1000-category classification task [21]. Since this point, CNNs havecontinued to be the dominant approach.Though CNNs have been gaining in accuracy, most localization or detectionimplementations, like R-CNN, require running classification multiple times in asliding window approach, similar to what was discussed for Viola Jones. Thismakes running these algorithms in real time a difficult challenge, as running thenetwork even at a single window location requires a GPU for good performance,much less a resource-constrained embedded system. Improvements such as Faster-CNN have lowered this significantly, but recent research, in the YOLO project,have introduced a new CNN-based architecture that makes real-time inferencepractical.2.2.1 Algorithm OverviewCNN networks typically consist of three layer types: convolution layers, pooling(or subsampling) layers and fully connected layers. In convolution layers, output15maps are generated by convolving each input map with its unique kernel, summingthe results at each location. A bias is usually applied to the output map. This isrepeated for each output map in the layer. For a given layer with maps of sizeN ×M, containing C input maps and K output maps, C×K kernels are appliedover N×M pixels. This leads to a large number of multiply-accumulate (MAC)operations. The kernels are typically 3×3, but can be larger or smaller, dependingon the network. Pooling layers take a given output map and reduce its size. A2D subregion of a map, often 2×2, is reduced to a single value. Common modesof pooling include max pooling and average pooling, either taking the maximumvalue or average value respectively. Fully connected layers, or dense layers, take aninput vector and multiply it against a matrix of weights to produce an output vector.A bias term at every position in the output vector is applied. As a matrix of sizeN×M is required for an input vector of N and output vector of M, and inputs andoutputs can commonly be greater than 1024 in number, memory bandwidth tends todominate in fully connected layers. Some networks replace fully connected layerswith 1×1 convolution layers, decreasing this bandwidth demand tremendously.Activation functions often follow convolution and fully connected layers. Theseare usually pointwise functions. Which function is used varies from network to net-work. They can include the sigmoid, tanh, and softmax functions. Simpler func-tions such as Rectified Linear Unit (ReLu), which requires only setting negativevalues to zero, have been used successfully in many networks and is commonlyused after convolution layers. A typical network, like those implemented on theVectorBlox MXP vector overlay is shown in Figure 2.4.Unlike the AdaBoost approach, modern CNNs require minimal preprocessingas features are learned directly from raw pixel data. CNN networks typically take16Figure 2.4: Example CNN network, showing multiple layers.the 3 color channels (red, green and blue) as inputs, and proceed through manylayers of convolution layers and pooling, ending with one or more fully connectedlayers (or 1× 1 convolution layers). Map sizes tend to decrease in later layersdue to pooling, but the number of maps tend to increase to preserve informationcontent. Execution is not variable and operations are simpler and more regular.Instead a much larger, deterministic number of operations are required.2.2.2 Binary Neural NetworksThe large number of multiply-accumulate (MAC) operations (often greater than90% of the operations in CNNs) and apparent robustness of large networks has ledresearchers to explore low-precision weights. A promising area is in the extremecase of binary weights, where weights take on one of two values, usually +/-1.Several methods have been introduced that allow networks to be trained where theweights in convolution and fully connected layers are fixed to +/-1. Examples ofthis include BinaryConnect [8], BinaryNet [9] and XNOR-Net [27].Binary weights are ideal for eliminating costly multiply operations, replacingthem with additions. Memory bandwidth and power are also significantly reduced.This research is promising as it allows even highly constrained embedded systems17Figure 2.5: Overview of the VectorBlox MXP processorto leverage the accuracy gains provided by deep learning. We explore acceleratingthese lightweight networks in our final case study.2.3 VectorBlox MXP ArchitectureFor this thesis, we adopt a parameterizable and extensible soft vector processor forour overlay. The VectorBlox MXP [30] is shown in Figure 2.5. The main reason forselecting a soft vector processor is to minimize the development effort required toaccelerate algorithms. Software-based iterations greatly reduce compilation timeand enable rapid debugging with familiar software development tools. The pro-cessor itself has been pre-designed and pre-verified, saving significant effort forproducing computationally demanding solutions.At a high level, VectorBlox MXP consists of a host processor coupled to a18vector core and Direct Memory Access (DMA) engine. The host processor handlesall control flow and I/O, and dispatches instructions to the vector core and DMAengine. All data processed by VectorBlox MXP is held in a memory-mapped,multi-bank scratchpad memory. A vector can be of any length and start at anyaddress in the scratchpad. Data is transferred in and out of the scratchpad using theDMA engine. If there are data hazards between the vector instruction and a DMAtransfer, the vector engine stalls until the hazard is resolved.When a new vector instruction is ready to issue, the address generation logicwill generate the addresses for the vector operands. The vector operands are pro-cessed one wavefront at a time, where the size of a wavefront corresponds with thenumber of parallel ALUs. The addresses are incremented each cycle until the endof the vector is reached.Vector instructions are 2-input, single-output, variable-length SIMD instruc-tions. The engine size is configured as the number of 32-bit vector lanes. Subword-SIMD is supported, so 8-bit and 16-bit data types have 4× and 2× more paral-lelism, respectively. 1D, 2D, and 3D variants of the instructions are supported forvector and matrix operations.Given the operands must be stored in the scratchpad memory, a DMA enginemoves the vectors in between the scratchpad and main memory. This is handledby the programmer. Double buffering can be used to hide data transfer overheadthrough pre-fetching. Long vectors help reduce potential pipeline stalls due tohazards, allowing the engine to stay fully utilized.192.3.1 ParameterizationThe vector engine is parameterized in terms of number of vector lanes and thescratchpad memory size. This allows various systems to be instantiated for the tar-get applications. The ability to trade performance for area is important for differenttargets.Multi-core systems can be instantiated, allowing the work to split between themultiple vector engines. This is useful when long vector lengths are difficult toachieve and lane scaling leads to diminishing returns as clock speed diminisheswith size.2.3.2 Programming ModelFrom a users perspective, programming the VectorBlox MXP requires three steps.First, the operands must be moved into the scratchpad. The requires allocatingspace for all operands in the scratchpad and calling the DMA engine to moveoperands from main memory into the scratchpad. Second, the vector length mustbe specified. Third, the vector operations must be executed using intrinsics. Theresulting output can be used by further vector operations or written back to mainmemory via DMA.Figure 2.6 shows the steps needed to allocate and move two arrays into thescratchpad to be operated upon. A second code example shown in Figure 2.7demonstrates how the vector engine would be used to compute a fully-connectedlayer, versus the host processor. This consists of multiplying an input vector by aweight matrix, producing and output vector, and adding a bias vector to this result.To fully utilize the vector engine, the programmer should aim for long vec-tors, overlap any DMA transfers with computation, and use small data operands to20int *arrA = (int*)malloc(10*sizeof(int));int *arrB = (int*)malloc(10*sizeof(int));vbx_word_t *vA = (vbx_word_t*)vbx_sp_malloc(10*sizeof(vbx_word_t));vbx_word_t *vB = (vbx_word_t*)vbx_sp_malloc(10*sizeof(vbx_word_t));vbx_dma_to_vector(vA, arrA, 10*sizeof(int));vbx_dma_to_vector(vB, arrB, 10*sizeof(int));Figure 2.6: Code required to allocate and move data inside the scratchpadscalar code:for (int o = 0; o <= vector_out_length; o++) {vector_out[o] = 0;}for (int o = 0; o <= vector_out_length; o++) {for (int i = 0; i <= vector_in_length; i++) {vector_out[o] += vector_in[i] * weights[o*vector_out_length+i];}}for (int o = 0; o <= vector_out_length; o++) {vector_output[o] += bias[o];}MXP code:// set inner loop vector length and outer loops iterationsvbx_set_vl(vector_in_length, vector_out_length);// set outer loop increments; output++, reuse inputs, row++ of weightsvbx_set_2D(1, 0, vector_out_length);vbxx_acc(VMUL, v_out, v_in, v_weights);// set inner loop vector lengthvbx_set_vl(output_vector_len);vbxx(VADD, v_output, v_output, v_biases);Figure 2.7: Calculating a fully-connected layer on a host processor versus theequivalent VectorBlox MXP instructionsexploit the available subword SIMD speedup.2.3.3 Custom Vector InstructionsThe VectorBlox MXP allows the use of custom vector instructions (CVI) [31].Once added, these CVIs are used by a programmer the same as any other vectorinstruction, taking full advantage of the soft vector processor’s pipelining and data21marshalling. The number of lanes for each CVI can be set independently fromthe vector processor, allowing the designer to instantiate only the number of lanesrequired for performance, minimizing the area used. This becomes especially im-portant for custom instructions that are too large to replicate across all vector lanes.CVIs provide the ability to leverage custom hardware while keeping the design ef-fort to a minimum.2.3.4 Wavefront SkippingA common problem that inhibits SIMD parallelism is control flow divergence.Branches in SIMD engines are commonly handled by masked or predicated in-structions, but this requires all elements in the vector to be processed even ifmasked. This wastes performance by occupying potential execution slots with no-operations.To reduce the impact of control flow divergence, the VectorBlox MXP sup-ports wavefront skipping [32]. This is beneficial when multiple instructions areexecuted with the exact same mask. On a first pass, the mask is analyzed to mem-orize those wavefronts where all elements are masked off. On subsequent passes,during instruction execution, these ‘empty’ wavefronts can then be skipped entirelyand occupy 0 cycles of the Arithmetic Logic Units (ALUs). This results in fasterexecution as a mask narrows to have fewer and fewer enabled elements. Customvector instructions in the VectorBlox MXP can also be masked if the number ofcustom instruction lanes matches the number of vector processor lanes.22Chapter 3Local Binary Pattern CustomVector OverlayDetecting and locating objects is a necessary first step in complex vision appli-cations. In this chapter we implement an LBP-based object detection overlay andproduce an end-to-end, real-time system for high accuracy face detection. Runningon the soft vector overlay, we first achieve a 25× speedup over the baseline ARMCortex-A9 processor. We then accelerate the critical inner loops by adding twohardware-assisted custom vector instructions to the overlay, for an additional 10×speedup. The custom vector overlay yields a total 248.4× speedup over the initialCortex-A9 baseline. Collectively, the custom vector overlay requires fewer than800 additional lines of custom RTL, including comments and blank lines. Com-pared to a previous hardware-only face detection system of comparable size, thiswork is 2.6× faster.233.1 ApproachIn object detection systems, objects can be found anywhere in an image frame.Traditional scanning approaches solve this by looking for a small fixed-size objectat all positions in the image. This requires massive parallelism to process frames inreal time. Viola and Jones introduced several key optimizations to make this man-ageable. First, integral images were used to accelerate feature calculation. Second,by using cascades of features, early, more discriminative features are used to re-ject candidate locations, which avoids evaluating any remaining features. Theseoptimizations still do not provide enough performance at high resolutions. To gofaster, many custom hardware engines have been developed that extract SIMD par-allelism, testing the same feature at many locations in parallel. However, variableexecution due to exiting early means typical SIMD processing is not efficient. Thislimits scaling of these solutions, and in turn, their performance.In the approach presented here, we use a software-programmable vector pro-cessor, creating a vector overlay that uses SIMD parallelism with variable-lengthvector instructions. The processor contains a masking feature, which allows us toaddress the exit-early nature of the algorithm, skipping entire sections of a vec-tor that have exited. We also focus on subword data sizes, which allows for moreparallel computation. Our contributions include the creation of a custom vectoroverlay for LBP detection and quantifying the performance of each step in the op-timization process. This includes the initial vectorization, the pre-computation ofrestricted MB-LBP patterns, the use of vector masking and the addition of eachCVI (targeting LBP pattern generation and the LBP LUT operation). We also in-troduce a novel ILP formulation to solve for 8-bit representations of feature pass24and fail values. Our software-driven approach is 2.4× to 3.5× faster than previ-ous custom hardware solutions, showing only minimal additional RTL is needed toachieve state-of-the-art results.3.2 AdaptationIn this section, we adapt the algorithm to make full use of the soft vector overlay.Two changes allow for efficient vectorization. The first change restricts the blocksizes of LBP features, allowing pre-computation of patterns to become feasible.The second change introduces a novel way to quantize stage thresholds and featurepass/fail values to 8 or fewer bits, using an ILP formulation. This allows the innerloop to produce and consume 8-bit values exclusively, enabling a high degree ofsubword SIMD parallelism.3.2.1 Restricting LBP Block SizesMulti-block LBP patterns allow any block size to be used for a given feature. UsingOpenCV’s frontal face detector as an example, the histogram distribution of block(cell) sizes across features is shown in Figure 3.1(a). This graph indicates thatsmaller block sizes are used more often and that blocks typically have matchingwidth and height.By restricting block sizes, two things occur. First, when using only powers oftwo, the computation better fits SIMD-style execution. Second, adopting the sameblock size across many features leads to computational efficiency due to aliasing,which allows us to memoize. This is explained as follows: given a single window ata specific x0,y0 position, each feature uses a different offset, so there is no aliasingand all computations are necessary. However, across multiple windows at different25(a) Unrestricted (b) Restricted (c) Restricted SquareFigure 3.1: Distribution of block sizes of LBP features in the trained classi-fiersstarting positions, the computation for some feature i may be aliased to a featurej with the same block size. Thus, we can transform the innermost computation,which performs redundant work due to aliasing, into a lookup operation, where thework is done just once as a pre-computation. We must pre-compute an entire imageworth of results, once for each block size. Reducing the number of block sizesalso increases the amount of aliasing, which further improves the result. Whenrestricting the block widths and heights to powers of two, we get the distributionshown in Figure 3.1(b). If add the additional restriction that width must equalheight, we get the distribution shown in Figure 3.1(c).Recent work by Bilaniuk et al. [4] had made similar observations: restrictingblock sizes to powers of two is better for SIMD engines, and using just three blocksizes (1× 1, 2× 2, and 4× 4 ) makes it more efficient to use pre-computation. Intheir results, these restrictions did not significantly change true positive detectionrates, but it did increase false positives from about 1% to 5%.To measure the accuracy tradeoff with restricted block sizes, we created threeversions of the cascade. We used OpenCV’s traincascade program to train2640,41c40,41< for( int w = 1; w <= winSize.width / 3; w++ )< for( int h = 1; h <= winSize.height / 3; h++ )---> for( int w = 1; w <= winSize.width / 3; w=w*2)> for( int h = 1; h <= winSize.height / 3; h=h*2 )(a) Restricted40,41c40,42< for( int w = 1; w <= winSize.width / 3; w++ )< for( int h = 1; h <= winSize.height / 3; h++ )---> for( int w = 1; w <= winSize.width / 3; w=w*2 ){> int h = w;> if (h <= winSize.height / 3)(b) Restricted SquareFigure 3.2: Minimal code modifications to lbpfeatures.cpp generatecascades with restricted block sizesthe new cascade. By modifying lbpfeatures.cpp, we can exclude unwantedfeatures sizes from the training process. Three versions of traincascade wereused: an unmodified, unrestricted version; a set where width and height are individ-ually restricted to powers of two; and a final restricted set, where width and heightmust match and be restricted to a power of two. This requires minimal changes tothe source code, with differences shown in Figure 3.2.Each cascade produced is valid and can be verified and used outside of ourembedded implementation. Like Viola and Jones, we use the MIT-CMU test set(sets A, B, C) to quantify the trade off in accuracy and restricted features. Theresults for OpenCV’s default frontal face cascades and the newly trained classifiersare presented in Table 3.1. The unrestricted and restricted classifiers perform betterthan OpenCV’s default LBP frontal face cascade.Our results are similar to Bilaniuk et al. Accuracy remains high across therestricted sets, though false positives do increase by a small amount. We proceed27Cascade True Positives False Positives Positive Predictive ValueOpenCV lbpcascade frontalface.xml 396 27 0.94unrestricted lbp frontalface.xml 385 9 0.98restricted lbp frontalface.xml 381 12 0.97restricted2 lbp frontalface.xml 386 17 0.96Bilaniuk et al. [4] 364 23 0.94Table 3.1: Accuracy of frontal face cascades ran on the MIT-CMU test setto develop an implementation taking advantage of the restricted block sizes. sets.3.2.2 ILP Formulation to Reduce Data SizeBy supporting subword SIMD, the VectorBlox MXP provides increased perfor-mance for the smaller data sizes of 8 or 16 bits. Our vector code aims to use the8-bit data size to maximize performance.In most AdaBoost implementations, a stage passes or fails after a series of 32-bit floating-point values (representing pass or fail scores) are added together andcompared to a 32-bit floating-point threshold. This uses more bits than necessaryand may even be susceptible to round-off errors since floating-point addition is notcommutative or associative. In this work, we show that this computation can useprecise 8-bit integers instead. Integer computation is not susceptible to round-offerrors. However, the pass/fail values for each feature must be carefully chosen toavoid overflows and the number of features per stage must be limited.In training, the AdaBoost algorithm determines which features in a stage mustpass for the stage to pass. More discriminative features are deemed more impor-tant, and given larger weight in the form of increased values in their pass or failscores. Ultimately, however, this decision logic is encoded into a summation andcomparison to a determined threshold. In our implementation, we use the exactsame decision logic to write out a series of constraints for an ILP solver (Microsoft28Research’s Z3 theorem prover [10]). These constraints allow the ILP solver to as-sign 8-bit pass and fail values, pi and fi, for each feature i, thus preserving theoriginal logic. We are also able to assign a threshold of 0, so a positive stage totalpasses and a negative stage total fails.For example, in a stage with 5 features, there are 25 constraints representingall combinations of each feature either passing or failing. Each constraint is givenone clause to pass the stage (result ≥ 0) or fail the stage (result < 0). However, acomplementary second clause is also necessary to avoid overflows (result≤ 127 orresult ≥ −128). An additional 10 constraints force each of the pi and fi values tobe within the 8-bit signed integer range. Z3 is used to solve for these 10 values.An example of these constraints is shown in Figure 3.3.One limitation of this approach is that Z3 starts having trouble with stages withtoo many features (e.g., 15 or more) using signed 8-bit constraints. We avoid thisin practise by limiting the number of features per stage. Our final trained cascadeuses 98 features across 12 stages, with at most 9 features per stage. Limiting thefeatures per stage changes how often we perform the early-exit check, but does notaffect accuracy.Note that the vectorized version thus far remains bottlenecked on the precedingLUT computation that generates the pass or fail result for each feature. Hence,there is little performance advantage in switching from 32-bit weights to 8-bitweights in software at this point. However, in the next section, we will add a customvector instruction to directly support the LUT operation and to include this accu-mulating logic to ultimately determine whether the stage will pass. At that point,the inner loop produces and consumes 8-bit operands exclusively, greatly increas-ing performance. This is an example of a hardware-motivated software change.29Subject To:// fail stage if all features failf0+f1+f2+f3+f4 <= -1f0+f1+f2+f3+f4 >=-128// pass stage if only 4 passesf0+f1+f2+f3+p4 >= 0f0+f1+f2+f3+p4 <= 127// fail stage if only 3 passesf0+f1+f2+p3+f4 <= -1f0+f1+f2+p3+f4 >=-128// pass stage if both 3 and 4 passf0+f1+f2+p3+p4 >= 0f0+f1+f2+p3+p4 <= 127...// pass stage if all features passp0+p1+p2+p3+p4 >= 0p0+p1+p2+p3+p4 <= 127Bounds:-128 <= f0 <= 127-128 <= f1 <= 127...-128 <= p4 <= 127Figure 3.3: Sample ILP constraints for a stage with 5 features using Z3303.3 VectorizationThe initial scalar implementation was compiled with gcc -O3, running bare metalon the ARM Cortex-A9 processor at 667 MHz. An object dump shows that thecompiled code contains NEON instructions, but we did not try to optimize thescalar code to aid NEON auto-vectorization.Next, manual vectorization for VectorBlox MXP across rows is quickly imple-mented and verified against the scalar implementation. This initial vectorizationgives us an understanding of which functions are amenable to SIMD paralleliza-tion, and how performance may scale as more ALUs are added. We did not vec-torize all of the code; we used profiling to identify and vectorize only the mostcompute-intensive loops.Using a 320× 240 test image, we perform a dense scan (scaling factor = 1.1,stride = 1) to compare with previous work [6]. The scalar core takes 1.9 secondsto complete. Our initial vectorized version of the code, running on 16 vector lanes,processes the test image in 0.76 seconds, a improvement of nearly 2.5×. The vectorprocessing is done at 183 MHz, which is less than 1/3 of the clock rate of the scalarengine.Only the innermost loops, which test many possible x,y starting positions of agiven feature, are easily vectorized by SIMD instructions. These loops are nestedwithin a loop testing all features within a stage, which is in turn, nested in a looptesting each stage in the cascade. These outermost loops cannot be easily SIMD-vectorized because each feature and each stage have unique properties. Further-more, the early exit condition means that as soon as a stage fails, no other stagesneed to be tested at that position. This performs well in scalar or Multiple Instruc-31tion, Multiple Data (MIMD) implementations, but leads to low utilization in SIMDengines as the early exit positions are masked off within the vector, but continue touse execution slots as no-ops until all positions fail or all stages are tested.After the initial vectorization, the inner loop computing each feature of a givenstage continues to consume the most runtime (93%). Other components, includinggrey-scaling and image pyramid creation (via bilinear interpolation), consume thenext-most runtime (7%). This downscaling later becomes a key bottleneck andis later vectorized. The merging of detected features to produce the final resultsconsumes minimal runtime (<1%) and is not vectorized.The innermost loop consists of two parts: computing a LBP pattern for thecurrent feature, and performing a table-lookup using this pattern as the index to de-termine the contribution towards passing the stage. By using the restricted featuresdiscussed above, it is fairly simple to hoist the computation of the LBP pattern outof the inner loop into a pre-compute step. This means that the inner loop needs totouch less data and fewer computations are required. Also, overlapping featuresdue to aliasing do not need to be recomputed. The switch to restricted square fea-tures with pre-computed features produces an additional 1.8× speedup over theinitial vectorized solution when using 16 lanes.3.3.1 Applying Wavefront SkippingLong vector lengths, often good for performance, can act against the benefit ofexiting early. This is seen in Figure 3.4, where image (b) shows the amount ofcomputation done at each location in the original image (a); bright pixels indicatehighly probable locations for a face, where most features will pass. In contrast,black pixels indicate an exit-early condition. The extra work done by row-based32vectorization is shown in image (c). Here, we see computation for the whole rowmust continue until the last pixel is done. Finally, image (d) shows how some workcan be skipped when using VectorBlox MXP masks: entire wavefronts within avector that are masked off can be skipped entirely and do not consume executionslots.After setting up a mask for the vector, we iterate through stages and update themask after every stage. When all locations within a wavefront exit early, the vectorengine skips that wavefront. The speedup when using masked instructions is 6×versus the non-masked version.3.4 Custom Vector InstructionsIn this section, we accelerate the computation further with two key custom vectorinstructions (CVIs). After the algorithmic refinements above, profiling reveals thatthe table lookup operation dominates runtime. This operation is an ideal candidatefor implementation as a custom vector instruction. Once accelerated, the LBP pre-computations become the bottleneck. The LBP pre-computation also requires aCVI. These two custom vector instructions, one for table lookup and one for LBPpre-computation, are described below.3.4.1 LBP Table Lookup InstructionThe LBP table lookup operations are accelerated with the first custom instruction.Algorithmically, this instruction implements the two steps shown in Figure 2.3(c)and (d), representing the lookup followed by an addition. The input is an 8-bit valuecorresponding to the LBP pattern, shown as the value 56 in the figure, which is theresult of the 8-way LBP comparison. (Further details about the LBP comparison33(a) Input image (b) Scalar(c) Simple rows (d) Masked rowsFigure 3.4: The number of features calculated at every location is shown. Thebottom demonstrate parallelizing across a row, with the latter takingadvantage of masked instructionsare given in the next section.) The output is an 8-bit value, produced by the secondstep where two parallel table lookup results are added together.The first step is a table lookup into a 256-entry table with a 1-bit output. Theoutput bit selects one of two 8-bit values for the feature (aptly named PASS orFAIL). Thus, each feature requires 272 bits of storage.To maximize performance, we perform two of these table lookups in parallel,using the two 8-bit input operands of the custom instruction for two subsequentfeatures. Since the custom instruction can only provide a single output, the PASS34or FAIL results of each of the two features are added internally into an 8-bit partialstage total. This is the output of the second step of our custom instruction.To determine whether a stage passes or fails, the results of all features in thestage need to be accumulated. This is done using regular 8-bit vector add instruc-tion, outside of this custom instruction. It accumulates all the partial stage totalsinto a final stage total. After processing all features, this final stage total is com-pared to a threshold of zero to determine whether the stage passes.As mentioned, the custom instruction invokes two parallel table lookups inparallel in each lane. This is done by reading two different 8-bit LBP patternsand presenting them to the CVI as operands A and B, respectively. This requiresa 544-bit wide memory. However, since all 8-bit vector lanes are processing thesame feature, this memory is shared across the entire vector engine. Back-to-backcustom instructions that perform table lookups automatically increment a featurecounter, which is used to address into the 544-bit wide memory. This memory mustbe large enough for all features across all stages; the starting address is determinedby the stage number. This face detection cascade contains a total of 12 stagesand 98 features. With 2 features per row, and 8 stages having an odd number offeatures, a total of 53 rows of memory are required (less than 32 kB in total). Thememory itself is initialized using another simple custom instruction.Four of these dual-8-bit-lookups can fit into a 32-bit lane of a custom vectorinstruction. For example, in a 16-lane configuration, 128 table lookups are doneevery cycle.Without the CVI, this table lookup requires approximately 15 regular vec-tor instructions, several of which operate on 32-bit operands. The majority ofthe runtime was originally spent computing and checking the features of each35#define VLUT VCUSTOM0for(f = 0; f < cascade[stage].n; f+=2) {// sz = MB-LBP size, w = image_widthfeat_a = cascade[stage].feats[f];feat_b = cascade[stage].feats[f+1];v_lbp_a = v_lbp[fa.sz] + feat_a.dy*w + feat_a.dx;v_lbp_b = v_lbp[fb.sz] + feat_b.dy*w + feat_b.dx;vbx_masked(VVB, VLUT, v_lut, v_lbp_a, v_lbp_b);vbx_masked(VVB, VADD, v_sum, v_sum, v_lut);}Figure 3.5: The inner loop using a custom vector instructionstage. Even after restricting LBP block sizes and transforming this work into apre-computation, the lookup table operation and stage pass/fail computation stilloccupies 57% of the runtime. Two of these table lookups (30 regular vector in-structions) are reduced into a single CVI. Since the CVI operates only on 8-bitdata, more parallelism is available than regular 32-bit vector instructions.Due to the ILP formulation, the summation of 8-bit values is guaranteed to besufficient without risk of overflow or rounding. This allows each 8-bit input patternto produce an 8-bit output.The final code of the innermost loop is presented in Figure 3.5. The scalarhousekeeping is done in parallel with two vector operations: vector table-lookup,and vector add.Since the LUT contents are not hardcoded, but loaded at runtime, this systemcan be used to detect different types of objects beyond faces. It can also be used indetection chains, e.g. first detect a face, then detect eyes within a face.Adding the custom table lookup instruction to the masked vectorized algorithmresults in an additional 4.2× speedup.363.4.2 LBP Pattern InstructionAfter the table lookup is sped up with the first custom vector instruction, computa-tion of the LBP patterns becomes the bottleneck. This is the process of computingthe 8-way neighbour comparison to the center block in each 3×3 windowed regionof the source image. This must be repeated for 1×1, 2×2 and 4×4 block sizes inour restricted design, producing three separate 8-bit arrays as output. Due to featurereuse at different offsets, this is done as a pre-computation rather than on-the-fly.Restricting the variety of LBP features (in terms of block sizes used) increases thefrequency of reuse, and improves the advantage of the pre-computation.Since regular ALU operations have two inputs and one output, computing LBPpatterns with 9-inputs and 1-output for a 3×3 window is not straight-forward. Thefan-in is larger still for the larger block sizes.To solve this, we use a stateful pipeline that processes columns of data, onevertical stripe at a time. The output stripe width is a set of 8-bit values that matchthe wavefront width of the overlay’s SIMD execution unit, as shown in Figure 3.6.Producing output values at the far left or right edges of this stripe requires read-ing source image pixels past the left edge or right edge. To accomplish this, thecustom instruction is supplied with two overlapping rows as input operands: oneoperand includes the input pixels past the left edge of the output stripe, while theother operand is offset enough to include the input pixels past the right edge of theoutput stripe. The custom vector instruction iterates one row at a time down theimage. The custom instruction has three modes, one for each of the block sizes,and therefore requires three passes, before moving over to the next column. Whenstarting the next column, a 2D DMA is performed to convert the vertical stripe in37Figure 3.6: Accelerating the pre-computation of LBP patternsthe source image into a packed image array in the internal scratchpad.Each pass of the custom vector operation can be separated into two steps. Thefirst step reduces (adds) the rows and columns of byte-sized image data accordingto the LBP block size. This is skipped when block size is one, but requires reducing2× 2 and 4× 4 regions for other block sizes. The second step takes the reducedvalues, and compares the center to its 8 neighbours, producing the 8-bit LBP pat-38terns. Adding the CVI to pre-compute the LBP patterns results in an additional2.4× speedup.3.5 ResultsBelow, we first detail the performance and area of the LBP overlay at each stage ofcustomization. In the following sections, we will compare our results to previouswork.3.5.1 Experimental SetupResults were obtained using a Xilinx ZC706 FPGA development board with a Zynq7Z045-2 device. FPGA builds were done using Vivado 2014.2. All designs are syn-thesized for a 200 MHz target clock speed, with the reported worst-case negativeslack used to compute Fmax. The Zynq 7000 series of FPGAs are all manufacturedon a 28 nm silicon technology node.Processing time is reported for one 320×240 image, producing an image pyra-mid with scale factor of 1.1, and using a stride of 1 to produce a dense scan. Thesedefault settings match those of Brousseau [6]. Generally, however, 1080p60 videocan be consumed and produced, and processing can be sped up with larger scalefactors and strides.A full face-detection system was implemented with a 1080p HDMI video cam-era, FPGA board, and display. A photo of the display output is shown in Figure 3.7,where a test image is presented to the camera using an iPhone. In this case, 49 of50 faces are detected in 36 ms, despite the small face sizes and the angle of presen-tation.39Figure 3.7: Photo of video output showing 49 of 50 faces detected on a 1080pimage in 36 ms3.5.2 PerformanceThe contributions of each optimization to performance are shown in Figure 3.8.Wavefront skipping and the LUT custom vector instruction boost performance themost. With all optimizations in place, diminishing returns with respect to vectorlength become apparent, showing a limit to our SIMD approach. For additionalperformance gains, a multi-core implementation would be required.3.5.3 AreaThe FPGA resources required for the base system (containing only HDMI I/O) andvarious vector engine sizes are shown in Table 3.2. As a rough comparison, thehardware face detection system developed by Brousseau [6] runs at 125 MHz on aStratix IV 530 (the largest available) and uses similar area to our largest configura-tion, but our work is 2.6× faster, offers a complete end-to-end (camera to display)40Software Only With Custom InstructionsVector Cumulative optimizationsLanes Scalar +Vector +Restrict +Masked +CVI:LUT +CVI:Pattern4 1,885.4 (1.0×) 1,731.8 1,646.6 125.2 (15.1×) 28.0 10.3 (183.9×)8 1,885.4 (1.0×) 1,150.4 823.6 88.5 (21.3×) 21.0 8.5 (222.9×)16 1,885.4 (1.0×) 831.4 452.5 75.6 (24.9×) 18.0 7.6 (248.4×)Figure 3.8: Performance in milliseconds (speedup) on 320×240 image pyra-mid, 1.1 scale factor, unit strideMem. Mem. # Fmax# FF # LUTs (LUTs) (BRAM) DSP (MHz)ZC706 Max. 437,200 218,600 70,400 545 900 -video only 8,873 7,028 514 12 8 236# Lanes CVIs4 no 16,843 16,437 869 54.5 36 197yes 25,679 25,237 1,194 88.5 36 1998 no 22,235 23,223 1,123 47 64 198yes 37,856 39,377 1,651 81 64 17516 no 36,165 36,672 1,672 48 120 183yes 65,430 68,444 2,629 82 120 166Table 3.2: Resource usage and Fmaxsystem, and keeps the majority of the face detection system in software.3.6 Previous WorkEmbedded face detection systems have been explored on various platforms. Wecompare our work with several FPGA implementations use custom hardware en-41gines and a SIMD-based Application-specific Integrated Circuit (ASIC) implemen-tation. We match input image sizes and scale factors.Brousseau [6] implemented a 32-core hardware engine on Stratix IV 530 FPGA,using Haar features. The design runs at 125 MHz and uses similar number of LUTsand digital signal processors (DSPs) as our largest configuration, using approxi-mately 160K LEs and 550 18-bit multipliers. They were able to process 320×240images in 20 ms. Each processing element (PE) processes a different location, andeach PE stops processing once the cascade exits, saving power. Images are sentover USB and post-processing of detected features is performed on an external PC.This design computes integral images to assist with feature computation.Cho [7] implemented an 8-core hardware engine, processing Haar features, ona Xilinx Virtex-5 FPGA. The design uses nearest-neighbour interpolation with ascaling factor of 1.2 and can synthesize to run 320× 240 and 640× 480 images.The design could process 320× 240 images in 16.4 ms, and 640× 480 images in37.4 ms. Like the previous design, integral images are used to minimize featurecomputation.Gao [16] implemented a 16-core hardware engine, again processing Haar fea-tures with a scaling factor of 1.2. Images are restricted to 256× 192 pixels andcan be processed at 10.2 ms. The design runs at 125 MHz, implemented on a Xil-inx Virtex-5 FPGA. Only the Haar classifier step was implemented on the FPGA,leaving pre-processing and post-processing on the host PC.The previous approaches provide good performance, but are fixed in their im-plementation details. Our custom overlay approach can accelerate both pre- andpost-processing and allows scaling factors and interpolation schemes to be changedquickly without changing the hardware.42Prior Feature Image Scale Prior This SpeedupWork Type Platform Res. Factor (fps) (fps)Brousseau [6] Haar FPGA (40nm) 320x240 1.1 50 132 2.6Cho [7] Haar FPGA (65nm) 320x240 1.2 61 214 3.5640x480 1.2 16 56 3.5Gao [16] Haar FPGA (65nm) 256x192 1.2 98 236 2.4Bilaniuk [4] LBP ASIC (unknown) 640x480 1.1 5 34 6.8Table 3.3: Previous work comparisonUsing a programmable approach, Bilaniuk [4] implemented their embeddedface detection system on a SIMD-based ASIC with 96 16-bit lanes. Like ouroverlay, LBP features were used for computational efficiency. The authors alsorestricted the LBP cell sizes to powers of two (1×1, 2×2 and 4×4). The cascadethey produced increased false positives, similar to what we observed. Their designwas able to process 640×480 images in 200 ms. This approach is close to our softvector overlay, but as custom hardware is not available, the algorithm cannot beaccelerated beyond the use of its dedicated SIMD instructions.Table 3.3 compares our performance, in frames per second, to the prior workdiscussed above. Our performance is 2.4× to 3.5× faster than these pure hardwareimplementations. The last row of this table is a fixed-width SIMD CPU imple-mented as an ASIC where our work is 6.8× faster. In all of these cases, we wereable to set the image resolution, scaling factor and stride to match the prior works.3.7 Design ComplexityThe majority of the code is accelerated by software-based vector instructions. Thebottlenecks in the accelerated algorithm, used for LBP feature calculation and ta-ble lookup, were accelerated as CVIs. These two CVIs require an additional 795lines of RTL. Together, they are responsible for the calculation of LBP features,43accounting for 87 lines of software. The remainder of the software requires ap-proximately 2500 lines of C code.The first CVI, used to pre-calculate the LBP patterns, requires 420 lines ofRTL. This hardware replaces approximately 60 lines of the original scalar code.The second CVI requires 375 lines of RTL. This hardware replaces approximately30 lines of the original scalar code.3.8 SummaryThe LBP custom vector overlay achieves a speedup of 248.4× over the ARMCortex-A9 software implementation. It represents a software-driven, flexible so-lution that is 2.4× to 3.5× faster than previous custom hardware solutions and6.8× faster than an ASIC-based SIMD processor. The overlay can be adjusted torun at various resolutions and can add or modify additional processing without ahardware rebuild. Customizing the overlay only requires about 800 lines of customRTL, keeping the majority of the development effort on software.44Chapter 4Convolutional Neural NetworkCustom Vector OverlayRapid advances in deep learning are vastly improving the accuracy of object recog-nition and localization. In this chapter, a custom vector overlay is developed to ac-celerate inference computations in convolutional neural networks with 8-bit quan-tized activations. Several popular neural network instances are used to demonstratethe capability of the overlay. A dual-core overlay is implemented, which uses botha hard ARM Cortex-A9 processor and a soft MicroBlaze processor. Each processoris augmented with vector instructions and convolution accelerators. At 116 MHzon a 28 nm Xilinx Zynq 7Z045-2, the dual-core overlay has a peak performance of352.6 GOPS. On one example network, a throughput of 175.0 GOPS is realised.In comparison, the same network running under the Darknet software frameworkachieves roughly 7.0 GOPS throughput on a single core of an Intel i7-2600 CPUimplemented in 32 nm technology.45To produce this custom overlay, only a single custom vector instruction to ac-celerate convolutions is required. The instruction is described in 1300 lines of cus-tom RTL. We anticipate that larger systems can be built on more modern FPGAsand achieve a level of performance that rivals GPUs. A key benefit of this software-driven overlay approach is that new networks of any size can be compiled and runjust like a CPU or GPU, without regenerating the FPGA bitstream. This is impor-tant in embedded systems where frequent updates or changes to the neural networksare required.4.1 ApproachThe explosion in popularity of deep learning and convolutional neural networks hasresulted in many hardware-based approaches for fast inference of CNNs. Most ofthese approaches result in a fixed solution which has limited flexibility and requiresregeneration if the network changes significantly. These systems are quite complexin that they must analyze the network to determine an appropriate set of buffer sizesand the size of hardware components to compute each layer in a balanced way. Incontrast, this chapter follows the theme of the thesis in applying a software-basedapproach to accelerate CNNs. This approach retains enough flexibility that manynetwork changes can be incorporated without regenerating the hardware, yet stillprovides excellent performance that rivals the hardware approaches.In particular, on VGG, we find that a custom vector overlay with a single cus-tom vector instruction is enough to achieve a speedup of over 1450 times comparedto the host ARM Cortex-A9 processor. The vector overlay sufficiently acceleratesall other processing not done by the CVI, including fully-connected layers andactivation functions. In this chapter we optimize several popular networks, reduc-46ing the precision from 32-bit floating point to smaller, fixed-point representations.These are implemented both with and without the CVI that accelerates the 3× 3convolutions required by the networks. This instruction requires 1300 lines of cus-tom RTL. We also show that a dual-core implementation achieves nearly perfect2× speedup over a single core.4.2 AdaptationIn this section, we examine and adapt network properties needed to support com-puter vision within the overlay framework. First, we observed the trend in CNNsis towards smaller filter sizes of 3× 3 and 1× 1 with unit strides, so we selectedtwo popular networks for implementation, YOLO and VGG, that follow this trend.Next, to maximize the parallelism available in FPGAs, we verified that fixed-pointquantization does not lead to significant loss of precision.4.2.1 Tiny YOLOv2Two variants of Tiny YOLOv2 are used, trained with both the 20-category VOCand the 80-category COCO challenge datasets. Tiny YOLOv2 VOC is used tomeasure the effect of the network quantization and reduction described below. Ingeneral, these networks consist of alternating 3× 3 convolution layers and maxpooling layers. A 1×1 pointwise convolution layer is used for the last layer. Thenetworks’ hyperparameters are both the same except for this last layer, as COCOhas four times as many categories as VOC. The Tiny YOLOv2 VOC network struc-ture is described in Figure 4.1.The output stage, interpreting the network’s output values, is performed by theARM Cortex-A9 host processor. This stage currently uses expensive, floating-point47Type Filters Size/Stride OutputConvolutional 16 3×3 416×416Maxpool 2×2/2 208×208Convolutional 32 3×3 208×208Maxpool 2×2/2 104×104Convolutional 64 3×3 104×104Maxpool 2×2/2 52×52Convolutional 128 3×3 52×52Maxpool 2×2/2 26×26Convolutional 256 3×3 26×26Maxpool 2×2/2 13×13Convolutional 512 3×3 13×13Maxpool 2×2/1 13×13Convolutional 1024 3×3 13×13Convolutional 1024 3×3 13×13Pointwise Convolutional 125 1×1 13×13Table 4.1: Tiny YOLOv2 VOC hyperparametersoperations and adds 6-18 ms to every frame processed, depending on the network.No attempt was made to speed this up or translate it to fixed-point. Without thespeed of the ARM core, this could be vectorized as well. No other stage of thedetection process depends upon the ARM core, so it potentially can be replaced bysoft cores with minimal performance impact.Input maps to the convolution layers are zero padded, keeping the output mapsthe same size. Pooling layers in this network take a 2× 2 region and reduce it toa singular, maximum value. Padding and pooling take minimal time compared tothe convolution computations.4.2.2 VGG16 SVD500VGG16 is a very deep 1000-category ImageNet classifier, consisting of 16 layers.Like the YOLO networks, VGG16 consists mainly of alternating 3×3 convolution48layers and max pooling layers. The padding and pooling seen in YOLO matchthis network. Unlike YOLO, however, VGG16’s last layers consist of large fully-connected layers. These layers are difficult to accelerate, as they quickly becomememory bound, containing millions of weights. One approach to reducing the sizeof fully-connected layers is Singular-Value Decomposition (SVD) which splits afully-connected layer into two smaller fully-connected layers. We adapt the sameSVD decomposition as Qui et al [26], allowing implementations to be compared.The final network structure of VGG16 is presented in Figure 4.2.Post-processing is significantly simpler than YOLO, as each of the 1000 out-puts represents a score for a particular category and only requires sorting.4.2.3 QuantizationMost neural networks are trained with 32-bit floating-point to maintain high ac-curacy during back propagation. However, low-precision fixed-point values canbe used during training as well as inference [18] [19] [27]. In the next chap-ter, we explore networks where low precision is taken to the extreme, using binaryvalues for weights. In this chapter, however, we reduce networks originally trainedwith floating-point values to use low-precision fixed-point integers, a process calledquantization. Our goal is not to find the best possible quantization method. Instead,it is simply to show that it can be done with low accuracy loss, allowing us to uselow precision in our custom vector overlay.Low-precision, fixed-point quantization of floating-point networks can be seenin Tensorflow, a leading neural network framework [1]. Earlier work by Farabetand LeCun exploring CNN inference on FPGAs used 16-bit weights, 8-bit acti-vations, and larger 48-bit values for accumulation [14]. Following this reduced-49Type Filters Size/Stride OutputConvolutional 64 3×3 224×224Convolutional 64 3×3 224×224Maxpool 2×2/2 112×112Convolutional 128 3×3 112×112Convolutional 128 3×3 112×112Maxpool 2×2/2 56×56Convolutional 256 3×3 56×56Convolutional 256 3×3 56×56Convolutional 256 3×3 56×56Maxpool 2×2/2 28×28Convolutional 512 3×3 28×28Convolutional 512 3×3 28×28Convolutional 512 3×3 28×28Maxpool 2×2/2 14×14Convolutional 512 3×3 14×14Convolutional 512 3×3 14×14Convolutional 512 3×3 14×14Maxpool 2×2/2 7×7Fully-Connected 25088 500Fully-Connected 500 4096Fully-Connected 4096 1024Fully-Connected 1024 1000Table 4.2: VGG16 SVD500 CNN hyperparametersprecision quantization trend, we quantize the networks to expose more parallelismin the soft vector overlay by exploiting subword SIMD instructions and buildinglow-precision structures in the custom vector instructions.The quantization scheme is kept simple. First, we run several hundred sampleimages through the original floating-point network, recording the ranges of valuesseen in each layers’ output. Weights and biases are then scaled on a layer-by-layer basis, normalizing a layer’s outputs to 1.0. This allows for a simple fixed-point representation. More complex quantization schemes could be explored in50the future. Code book compression schemes like those discussed in [26] may benecessary to minimize memory bandwidth with larger fully-connected layers, butwere not needed for this work.Both an initial 32-bit floating-point and 16-bit fixed-point versions were imple-mented on the ARM host processor. These serve as baseline versions to compare tolower-precision versions or their vectorized equivalents. Vectorization of the 16-bitversion is described later in the chapter.Translating to 8-bit or even 16-bit fixed-point values can be problematic. Thiswas explored with the bit-accurate VectorBlox MXP simulator, where significantproblems with underflows and overflows were noticed. In particular, a quick glanceat the network structure in Table 4.1 shows values from up to 1024 input maps mustbe summed to produce each output map. If the summation occurs with just simple8-bit arithmetic, it is clear that overflows will be a problem. We also detectedsignificant problems with underflows; values that were normally very close to zero,but negative, tended to get truncated down to -1 rather than 0; these accumulatevery quickly across 1024 inputs.For the 16-bit fixed-point version, VectorBlox MXP’s fixed-point specific ad-dition and multiply instructions address these issues. The fixed-point instructionssaturate outputs to prevent overflow, and the fixed-point multiply rounds outputs,which mitigates underflow. When we explored further reducing data sizes, i.e.,setting both weights and activations to 8-bits, significant loss in accuracy was ob-served. As the vector processor requires both input operands to be of the same size,neither weights nor activations can be switched to 8-bits independently.To get around this restriction posed by the processor, i.e. address the loss inaccuracy of 8-bit operands, and accelerate inference on the overlay, a CVI is added51to perform convolutions. The CVI, discussed later in the chapter, loads in higher-precision weights ahead of time so low-precision 8-bit activations can benefit fromthe increased parallelism. The CVI internally adds results from multiple inputmaps at once and carries forward up to 29 bits of precision. The partial summationsproduced by the CVI are rounded down to 16 bits. These partial output maps aresummed to produce the final output maps, which can be eventually truncated to 8bits. In experiments running on the simulator, this strategy achieves good accuracy.4.2.4 Quantizational Impact on AccuracyTo measure the impact of quantization for each implementation, we run Tiny YOLOv2VOC and calculate the mAP, the standard method for measuring accuracy. Calcu-lating the mAP requires several steps. First, inference is run on the VOC2007challenge test set. The set contains 4952 images. The floating-point version ofTiny YOLOv2 VOC runs on Darknet, while the reduced-precision versions run onthe overlay. Second, for each of the 20 output categories, precision-recall curvesare generated and the average precision for each category is measured, taking thearea under the curve. Recall is the percentage of actual objects found and precisionis the percentage of true positives in the found results. The recall is set to specificvalues by raising or lowering the cutoff threshold. The precision is plotted for theserecall values. A typical curve for the “person” class is shown in Figure 4.1. Finally,the mean of the average precision values found from each category is calculated,to produce the final accuracy metric. Taking the average precision across all cate-gories, we calculate the mAP for each quantization strategy or implementation.The resulting mAP of the different implementations is presented in Table 4.3.We start with implementations using standard vector instructions. The 16-bit fixed-52Figure 4.1: Average precision curve for “person” classpoint version sees a minimal drop in mAP score, while reducing both activationsand weights further to 8-bit sees a large decrease.The last three rows of the table shows implementations that use the CNN cus-tom instruction. Input and weight data sizes can vary independently, as weightsare pre-loaded separately. Weights, which are reused across input maps, can be setat higher precision, while inputs can be set to 8 bits to maximize throughput. Thecustom instruction is parameterized by input and weight size. Depending on thenumber of bits used for inputs and weights, the number of DSP blocks requiredvaries. We find the accuracy loss becomes large as both operands approach 8 bits.The quantization scheme we choose for the custom vector overlay uses 8-bit inputs53Operand type Activation bits Weight bits Platform mAPfloating-point 32 32 Darknet 56.5floating-point 32 32 ARM A9 56.5fixed-point 16 16 Overlay 56.1fixed-point 8 8 Overlay 40.5fixed-point 8 16 Custom Overlay 54.9fixed-point 8 12 Custom Overlay 54.3fixed-point 8 8 Custom Overlay 42.8Table 4.3: Mean average precision of various versions of Tiny YOLOv2 VOCand 12-bit weights, as accuracy remains high while the CVI still maps very effi-ciently to the DSPs in the FPGA. Note the fully-connected layers in VGG16 usestandard vector instructions, hence inputs and weights remain at 16 bits.4.3 VectorizationWith the quantization effects understood, we start by vectorizing the 16-bit fixed-point scalar implementation. Inference consists of three steps; pre-processing,computing each layer in the network, and post-processing.Pre- and post-processing are needed to produce a complete system. Pre-processingconsists of separating interleaved RGB pixels into input maps, followed by down-scaling. This is similar to the pre-processing required in the previous chapter, andthe vectorized routines are reused here. Post-processing for YOLO is kept on theARM host, as it accounts for a small amount of the runtime. The operations couldbe vectorized, but need not be. For VGG, post-processing requires a simple sortingof the output vector, and is again kept on the host. Our approach allows softwarereuse, and requires that we only vectorize what is necessary for performance.The majority of vectorization effort is spent creating efficient vectorized ver-sions of each layer, in particular convolution and fully-connected layers. Although54the majority of the operations in these networks are in the convolution kernel, addi-tional operations require vectorization to make a fully functioning system withoutbottlenecks. This includes padding, pooling and activation functions.As inputs and outputs must be operated on inside the scratchpad, maps are con-tinuously being transferred by DMA between off-chip DRAM and the scratchpad.To minimize memory transfers, when processing a map, all additional operations,including padding, pooling and activation functions, are chained with the core con-volution or matrix multiply operations. This allows the outputs to be fully pro-cessed before being sent back to main memory. Larger maps, too big to fit fully inthe scratchpad, are processed in a tiled fashion.Standard vector instructions efficiently accelerate these additional operations.The ReLU and leaky ReLU activation functions, which retain zero and 10% ofnegative values, respectively, are used in the example networks. Maxpool layers,commonly performing a 4:1 data reduction, keeping the maximum value acrossevery 2× 2 pixel group, are also used heavily. These operations make heavy useof the comparison (subtract) and conditional-move instructions provided by theVectorBlox MXP processor.The vectorized implementation, which makes use of saturation and roundingincorporated into the fixed-point multiply and addition instructions, serves as thesoft vector overlay result. A speedup of 11.6× over the baseline is observed.4.4 Custom Vector InstructionsIn this section, we accelerate the computation further with a single custom vectorinstruction (CVI). The vast majority of operations are in the convolution kernel; allinput maps are convolved and summed repeatedly to produce each output, while all55other operations are applied only once at the end of each layer. The custom vectorinstruction to accelerate 3×3 convolutions is described below.4.4.1 3x3 Convolution InstructionAs kernels are kept constant across input maps, bandwidth is wasted if kernels areused as one of the input vectors. By pre-loading weights, multiple input maps caninstead be processed at once, increasing throughput and decreasing the amount ofpartial summations required outside the CVI. This is amplified by interleaving theinput maps so every 32-bit element of the input vector contains four 8-bit elementsfrom four successive maps. Instead of accelerating a single convolution kernel, theinstruction accelerates eight kernels simultaneously, which we refer to as a super-kernel.The 3× 3 CVI is shown in Figure 4.2 for a V4 configuration with 2 ‘CNNsuper-kernel’ operators. The maximum number of super-kernels is always kernel width−1 less than the number of 32-bit lanes. Interleaved input matrices are passed intothe instruction, and each super-kernel convolves these 8 input matrices (A, B, C,D, E, F, G and H) in parallel and sum-reduces them into a single output, enablinga 8:1 sum-reduction to be done with extended precision inside the CVI.Each convolution contains 17 operators, so each super-kernel contains 143 op-erators (17×8+4+2+1). The implementation uses a V16 CVI with 13 parallelCNN super-kernels, totalling 1859 parallel operations per cycle. While upto 14parallel CNN super-kernels could be used here, we only instantiated it 13 times tosave a bit of area. In one cycle, the CVI receives 15 horizontally adjacent 8-bit pix-els from each of the eight input matrices, and computes 13 horizontally adjacent16-bit outputs. The CVI proceeds to the next row each cycle. After computing56CNN1ΣLane 3Lane 2dstsrcBsrcALane 1Lane 0Lane 1Lane 04 x (4x8b) 4 x (2x16b)4 x (4x8b)Lane 2Lane 3CNN0Σ Shift, Round& SaturateShift, Round& Saturateweights(9 x 12b)cnn kernel output++++++++x x x x x x x x xcol0col1col2row0row1row2Figure 4.2: Convolution instruction (showing a V4 system with 2 CNNsuper-kernels attached)57an entire column, the CVI advances horizontally by 13 pixels to the next column.Input padding ensures all output data is computed.Inputs to the inner convolutions are 8-bit, and internal widths increase as re-quired. The vector srcA and srcB operands each hold four matrices of interleaved8-bit data. The output, after the 8:1 reduction uses 29 bits. After this, shift-ing/rounding/saturating logic reduces the output to 16 bits. These latter operationsare not counted as ops by YOLO, since they are not needed in a floating-pointimplementation. We don’t count them here either.4.5 ResultsResults for the custom vector overlay are presented in this section. Initially, weused a single ARM Cortex-A9 as the host processor. To investigate potential formulti-core scaling, we also add implementations containing a second core, alsoequipped with a VectorBlox MXP processor. The second core host processor is amuch slower MicroBlaze soft processor.4.5.1 Experimental SetupResults were obtained using a Xilix ZC706 FPGA development board with a Zynq7Z045-2 device. FPGA builds were done using Vivado 2017.1. The ARM Cortex-A9 processor and camera/video framebuffers primarily use memory attached tothe processing system (32-bit wide DDR3), while the VectorBlox MXPs primarilyuses the faster memory attached to programmable logic (64-bit DDR3) for weightsand activations. Example detections produced when running Tiny YOLOv2 VOCare shown in Figure 4.3.58Figure 4.3: Example YOLO detection594.5.2 Overlay InstancesIncreasingly performant CNN implementations will be discussed below:1. 32-bit floating-point host baseline, single-core (ARM Cortex-A9)2. 16-bit fixed-point soft vector overlay, single-core (ARM Cortex-A9)3. 8-bit fixed-point custom vector overlay, single-core (ARM Cortex-A9)4. 8-bit fixed-point custom vector overlay, dual-core (ARM Cortex-A9 + Mi-croBlaze)This overlay instantiates multiple cores. The first host processor (ARM Cortex-A9) is used to control any slave processors (MicroBlaze). Each processor has avector engine attached. With two cores running at 116 MHz, each capable of 1859operations per cycle, the overlay has a theoretical peak throughput of 431.3 GOPS.To minimize latency of inference, the processing of each layer in the networkis split between the two cores. This is accomplished by generating half the outputmaps on each core, with both cores using the same input maps in shared memory.Using simple message passing, the host starts the slave core to generate the slave’soutputs of the current layer before beginning its own. Upon finishing, the hostblocks until the slave is finished, and then moves on to the next layer. Workloadsare relatively balanced and can continue to be divided as more cores are added.Multi-core scaling is discussed below.4.5.3 PerformanceA layer-by-layer runtime breakdown for Tiny YOLOv2 VOC is presented in Ta-ble 4.4. The soft vector overlay provides a significant performance increase, with60Type Filters ARM baseline 16-bit soft vector 8-bit custom vectorsingle core dual core single core dual coreConv 16 1294.37 46.42 23.37 8.86 4.67Conv 32 3224.92 115.14 57.68 5.71 2.99Conv 64 3170.09 122.02 61.08 4.18 2.18Conv 128 3048.22 122.24 61.19 3.54 1.83Conv 256 2988.95 156.99 78.53 3.49 1.77Conv 512 3038.33 626.48 313.38 4.53 2.28Conv 1024 12163.71 2502.87 1251.88 15.18 7.61Conv 1024 24505.00 5004.52 2503.17 29.17 14.61Conv 125 653.36 615.83 308.04 3.61 1.86Total 54087.00 (1×) 9313.4 (5.8×) 4658.9 (11.6×) 78.3 (690.8×) 39.8 (1359.0×)Table 4.4: Tiny YOLOv2 VOC runtime breakdown (ms)Type Filters GOP/Layer 8-bit custom vector 8-bit custom vector (dual) Scaling factorConv 16 0.15 16.9 32.0 1.89Conv 32 0.40 69.7 133.2 1.91Conv 64 0.40 95.2 182.5 1.92Conv 128 0.40 112.6 217.2 1.93Conv 256 0.40 114.1 224.8 1.97Conv 512 0.40 88.0 174.7 1.99Conv 1024 1.59 105.0 209.5 1.99Conv 1024 3.19 109.3 218.2 2.00Conv 125 0.04 12.0 23.3 1.94Total 6.97 89.3 175.0 1.97Table 4.5: Tiny YOLOv2 VOC throughput breakdown (GOPS)the dual-core implementation speeding up the baseline by 11.6×. However, notuntil the CVI is added, does the overlay achieve real-time performance. The dual-core custom vector overlay provides a speedup of 1359.0× over the hard ARMbaseline, an additional 117.2× over the soft vector overlay.Tiny YOLOv2 VOC and Tiny YOLOv2 COCO networks require 6.97 and 7.07GOPS, respectively. Throughput for Tiny YOLOv2 VOC is presented in Table 4.5,showing GOPS on a layer-by-layer basis for both single-core and dual-core over-lays. Nearly perfect scaling can be seen when moving to the dual-core overlayand scaling is expected to continue as cores are added. The layer with the high-61Type Filters ARM baseline 16-bit soft vector 8-bit custom vectorsingle core dual core single core dual coreConv 64 1442.21 52.40 26.35 11.22 5.86Conv 64 29597.55 1009.19 504.78 34.94 17.73Conv 128 14658.75 539.57 269.93 17.79 9.08Conv 128 29302.65 1076.98 538.64 30.99 15.72Conv 256 14250.50 540.73 270.49 16.15 8.16Conv 256 28504.01 1080.06 540.16 30.34 15.27Conv 256 28510.84 1080.37 540.30 30.12 15.16Conv 512 13977.83 626.55 313.45 18.21 9.13Conv 512 28070.75 1252.16 626.32 35.20 17.64Conv 512 28074.34 1252.58 626.50 35.19 17.62Conv 512 7003.18 1251.45 625.99 12.75 6.39Conv 512 7003.12 1251.41 626.00 12.75 6.39Conv 512 7004.00 1251.84 626.18 12.62 6.32FC 500 164.43 7.20 3.68 7.20 3.69FC 4096 26.36 3.80 2.03 3.80 2.02FC 4096 226.39 11.85 6.05 11.88 6.06FC 1000 55.25 2.89 1.51 2.90 1.51Total 237872.23 (1×) 12291.6 (19.4×) 6148.8 (38.7×) 324.1 (733.9×) 163.8 (1452.2×)Table 4.6: VGG16 SVD500 runtime breakdown (ms)est throughput reaches 224.8 GOPS. The last convolutional layer is 1× 1, so theconvolution CVI, designed to accelerate 3×3 kernels, is not fully utilized.Including pre- and post-processing, the Tiny YOLO VOC network takes 50.3 msto run, while the Tiny YOLO COCO takes 68.3 ms. The networks are identical ex-cept for the size of the last layer. The networks require an additional 6-15 msfor post-processing which is excluded from the reported speeds. As this now is asignificant part of the computation time, it may require vectorization in the future.The VGG16 SVD500 network takes 164.3 ms to run. A breakdown of runtimeis shown in Table 4.6. The throughput breakdown is shown in Table 4.7, with thehighest layer throughput reaching 243.9 GOPS. The fully-connected layers, whichare bandwidth-limited in the dual-core system, reach 6.8 GOPS. Post-processingis greatly simplified, only requiring a final sort of the 1000-entry output.We also compared performance to a desktop-based solution. Table 4.8 shows62Type Filters GOP/Layer 8-bit custom vector 8-bit custom vector (dual) Scaling factorConv 64 0.17 15.4 29.6 1.91Conv 64 3.70 105.9 208.6 1.97Conv 128 1.85 104.0 203.5 1.96Conv 128 3.70 119.4 235.2 1.97Conv 256 1.85 114.5 226.5 1.98Conv 256 3.70 121.9 242.2 1.99Conv 256 3.70 122.8 243.9 1.99Conv 512 1.85 101.5 202.4 1.99Conv 512 3.70 105.1 209.7 2.00Conv 512 3.70 105.1 210.0 2.00Conv 512 0.92 72.5 144.6 1.99Conv 512 0.92 72.5 144.6 1.99Conv 512 0.92 73.3 146.2 2.00FC 500 0.03 3.5 6.8 1.95FC 4096 0.04 1.1 2.0 1.88FC 4096 0.03 2.8 5.5 1.96FC 1000 0.01 2.9 5.6 1.92Total 30.76 88.8 176.3 1.98Table 4.7: VGG16 SVD500 throughput breakdown (GOP/s)the runtime for the networks tested, including post-processing, compared to theruntime of the original Darknet framework, for both a NVIDIA 1080 GPU, andrunning on a single-core of a 4.00 GHz Intel i7-4790k CPU. The overlay is about20 times faster than the CPU, but is still nearly an order of magnitude slower thanthe GPU when running Tiny YOLOv2. This gap is expected to narrow if a largerFPGA is targeted. Note that though inference on both Tiny YOLOv2 networkswere greatly sped up when NVIDIA’s cuDNN library was used, VGG saw littleimprovement, leading to our implementation running slightly faster.4.5.4 AreaThe area required for single- and dual-core overlay is reported in Table 4.9. Thedesigns use the 28 nm 7Z045-2, and met timing at 116 MHz. In the dual-core de-sign Block Random Access Memory (BRAM) becomes the most utilized resource,63Network Darknet GPU Darknet CPU Dual-core custom vector overlayTiny YOLOv2 VOCms 4.6 992.0 41.0fps 218.1 1.0 24.4Tiny YOLOv2 COCOms 4.6 1186.3 42.5fps 216.5 1.2 23.5VGG16 SVD500ms 230.2 4977.8 164.3fps 4.34 0.2 4.9Table 4.8: Comparing inference speed to the Darknet frameworkTable 4.9: Resource UsageFully Implemented System (Zynq-7000, 28nm, 8-bit fixed-point)Logic 64b LUT BRAM6-LUTs RAMs FFs (36kB) DSP7Z045 Device 218,600 70,400 437,200 545 900VectorBlox MXP (V16) 29712 1053 25602 72 80MicroBlaze 4490 926 3025 19 5CNN CVI (13 super-kernels) 13897 0 24886 12 144Per-core Interconnect/Peripherals 6710 205 10590 4 01-Core (MXP+CVI+Interconnect) 54809 2184 64103 107 2291-Core Device Utilization 25.1% 3.1% 14.7% 19.6% 25.4%2-Cores (1 using ARM) 98485 3234 114300 191 453AXI Fabric 13314 1332 28412 32 0MIG Memory Controller 10408 2234 9231 1 0Video etc. 5834 370 9845 33 8Complete System 128041 7170 161788 257 461Device Utilization 58.6% 10.2% 37.0% 47.2% 51.2%as they are used by both the MXP and the MicroBlaze. Optimizing the memoryfootprint of each host, and using smaller scratchpad sizes, is needed to fit morecores on this device.4.6 Previous WorkAccelerating training and inference of CNNs has been an active research area. Fastinference of deep networks using FPGAs is important both in the data center andin the field.64Early work by Farabet [14], accelerated the face detection using a 5 layer CNNnetwork, consisting of 3 convolutional layers and 2 fully connected layers. The de-sign follows a flexible, vectorized approach and allows reprogramming of softwarewithout reconfiguration. Custom hardware is designed to accelerate each layertype. It was implemented on a Virtex-4 SX35 and ran close to around 4 GOPS.This design used 16-bit quantization.More recently, Zhang et al. [35] implemented a convolution accelerator on Xil-inx Virtex7 485T, at a much higher performance of 62.62 GFLOPS. The synthe-sized design ran at 100 MHz using Vivado HLS. Only the convolution computa-tion was accelerated, ignoring the pointwise operations and pooling layer. Full32-floating point operands were used, and design consumed 2240 DSP blocks.Qiu et al. [26] implemented a CNN engine, which ran a modified VGG16 net-work, with reduced fully-connected layers. The design used the same developmentboard as our system, a Xilinx ZC706, and ran at 116 MHz. Quantization uses16 bits. Convolutional layers were measured at 187.5 giga operations per sec-ond (GOPS), while fully connected layers measured 1.2 GOPS. For the YOLOnetworks, where nearly all computation was in the convolutional layers, they mea-sured 137.5 GOPS, using 90% of the DSP blocks in the design.Most recently, Ayondat et al. [3] implemented a deep learning accelerator usingOpenCL. It targeted the Intel Arria 10 1150 FPGA. The authors benchmarked theirdesign on AlexNet, which contains CNN kernels greater than 3×3. The Winogradtransform is used to boost performance of the of their design, which significantlyspeeds larger kernels. Using the Arria 10, which is a much faster chip than ours,the design runs at 303 MHz, nearly 3× the clock speed of our current design.This puts the throughput of this custom overlay above the performance of65[14] [35] [26] [3] OursOperands 18-bit fixed 32-bit float 16-bit fixed 16-bit float 8-bit + 12-bit fixedFPGA Virtex-4 Virtex7 VX485T 7Z045 Arria 10 7Z045MHz 200 100 150 303 116LUT - 186251 182616 246k alms 681K reg 128041DSP 192 2240 780 1476 461GOPS 4 61.62 137.5 1382 175.0GOPS/GLUT - 0.33 0.75 5.6 1.3GOPS/DSP 0.021 0.028 0.176 0.936 0.361Table 4.10: Previous work comparisonZhang et al. and Qui et al. yet significantly below the performance of Aydonatet al.. Moving to a larger, newer FPGA, scaling up the overlay, and increasing thenumber of DSP blocks can increase performance. However, to obtain state-of-the-art performance per unit area, clock speed improvements are also necessary.4.7 Design ComplexityThis custom vector overlay only requires a single CVI as the vast majority of theoperations are in the convolution kernel. The soft vector processor is capable of ac-celerating the remaining operations, including pointwise activation functions andpooling. The convolution CVI is designed only for 3×3 kernels and was developedfairly rapidly. The instruction requires 1300 lines of RTL, as it contains optimiza-tions to maintain precision and high throughput. The remainder of the overlayconsists of C code totaling approximately 4500 lines of software.4.8 SummaryThe CNN custom vector overlay is a performant, flexible solution for acceleratinginference on FPGAs. With the fastest layer achieving 243.9 GOPS, the design iscompetetive with the state-of-the-art on FPGAs. Further refinement to the CVI and66scaling to larger FPGAs will help close the gap with GPUs.As the overlay is software-driven, the addition of other network layers can beadded rapidly, making it a useful tool for the fast-moving field of deep learning.The custom vector overlay approach minimizes the use of custom RTL, to achieveresults competetive with custom hardware solutions.67Chapter 5Binary-weight Neural NetworkCustom Vector OverlayReduced-precision arithmetic improves the size, cost, power and performance ofneural networks in digital logic. CNNs using 1-bit weights can achieve state-of-the-art error rates while eliminating costly multiplications, reducing memory band-width and improving power efficiency. The BinaryConnect binary-weight network,for example, achieves 9.9% error using floating-point activations on the CIFAR-10dataset.In this chapter, we implement a lightweight vector overlay for accelerating in-ference computations with 1-bit weights and 8-bit activations. The entire overlayis very small, using about 5000 4-input LUTs, and fits into a low cost iCE40 Ul-traPlus FPGA from Lattice Semiconductor. Beyond the base vector processor, theoverlay only requires 200 additional lines of custom RTL, keeping the majorityof development effort in software. To show small networks can be useful, we run68two classification networks produced by shrinking the original BinaryConnect net-work. The first is a 10-category classifier with a 89% smaller network that runs in1315 ms and achieves 13.6% error. The second is an even smaller, single categoryclassifier that runs in 195 ms, and has only 0.4% error. In both classifiers, the er-ror can be attributed entirely to training and network size, not to the reduction inprecision of activations.5.1 ApproachDeep CNNs provide increasingly accurate solutions but require an increasing num-ber of MAC operations. Large networks can contain tens of GOPS on a singleinference pass. Not all neural networks are as expensive to run, however, and ad-vances for deep learning can benefit even small devices. Binary versions of CNNsallow deep learning networks to be deployed on small, low-cost FPGA chips, or tomaximize the operations per second on a larger chip.Using our software-based approach, we develop a lightweight, custom vectoroverlay for accelerating networks with binary weights. We target a low-cost iCE40UltraPlus FPGA, and use a soft RISC-V processor as the host. The RISC-V pro-cessor contains lightweight vector instructions called LVE, which accelerate theoverlay. In this chapter we make three contributions. First, we optimize the Bina-ryConnect system by reducing the network size and computing precision. Second,for performance, we implement a CVI that accelerates binary convolutions for net-works with binary weights and 8-bit inputs, requiring only 200 lines of customRTL. Third, we produce a lightweight custom overlay for accelerating binary net-works, capable of targeting the small FPGAs.695.2 AdaptationIn this section, we examine and adapt networks that allow deep-learning to be uti-lized in small, highly-constrained devices. First, we observe the research showingnetworks reduced to the extreme can produce state-of-the-art results. We take thework of BinaryConnect [8], where weights contain only binary values, to producenetworks for a lightweight overlay. Next, we verify fixed-point quantization of theremaining calculations does not lead to loss in accuracy.As in the previous chapter, the core compute kernel is a 2D convolution (again3× 3). This computation is repeated at every position in the padded input map,producing an output map of the same size. These outputs become inputs for thenext layer, and the process is repeated. In this chapter, however, we explore net-works with weights fixed to ±1. Using the training method described in Bina-ryConnect, 1-bit weights are used to represent ±1. This saves memory storageand bandwidth, and replaces all multiplications with addition and subtraction. Al-though using binary weights, BinaryConnect achieves 9.9% state-of-the-art errorrates on CIFAR-10 [20]. Binary and ternary weights (where 0 is also allowed) al-low FPGAs to exploit bit-level parallelism. We further save area and energy byprocessing fixed-point activations throughout the network. Importantly, our re-duced fixed-point quantization does not introduce any further error.Two networks are explored and adapted in this chapter; a downscaled versionof BinaryConnect’s CIFAR-10 network, and a custom, single category classifier,intended to be used in a low-power, always-on sensor handling sleep/wake eventsfor a larger electronic device. The first network is larger and requires overlappingcomputation with DMA transfers of weights from the SPI flash ROM, while the70128 maps4 x 4128 maps8 x 83 maps32 x 3248 maps32 x 32bgr48 maps32 x 3296 maps16 x 1696 maps16 x 16128 maps8 x 848 maps16 x 1696 maps8 x 8convolve convolve maxpool convolve convolve maxpool convolve convolve maxpool256x 1256x 110x 1dense densedenseFigure 5.1: Reduced binary CNN containing 89% fewer operations than Bi-naryConnectsecond can fit all parameters in the scratchpad, avoiding transfers beyond the initialbootup.5.2.1 Network ReductionWe start by optimizing the BinaryConnect system by reducing both the networksize and computed precision described below. We reduced the network size from:(2×128C3)-MP2-(2×256C3)-MP2-(2×512C3)-MP2-(2×1024FC)-10SVMto:(2×48C3)-MP2-(2×96C3)-MP2-(2×128C3)-MP2-(2×256FC)-10SVMwhere C3 is a 3×3 ReLU convolution layer, MP2 is a 2×2 max-pooling layer, FCis a fully connected layer, and SVM is a L2-SVM output layer.This new network, shown in Figure 5.1, contains 89% fewer operations thanthe BinaryConnect reproduction and achieves 11.8% error on CIFAR-10. For per-formance, we also dropped ZCA whitening, increasing error to 13.6%.All computations are converted to fixed-point, creating both an initial ver-sion using 32-bit values throughout, as well as a reduced-precision version. Thereduced-precision network’s inputs and activations use 8-bit unsigned integers andthe intermediate sums use 16-bit and 32-bit signed integers. Importantly, the reduced-precision network maintains the same error rate of 13.6%.71Figure 5.2: Samples of CIFAR-10 datasetFor a the second network, we start with the CIFAR-10 dataset, shown in Fig-ure 5.2, but modify it slightly by replacing the ‘deer’ category with ‘people’ super-class from CIFAR-100. The ‘people’ images were duplicated to match the numberof images in each of the CIFAR-10 categories. This change allows the network tofunction as a “person” detector.Sample results are shown in Figure 5.3. The two bars for each category showthe different scores between floating-point computation (lighter) and our 8-bit fixed-point computation (darker). A more positive score is a higher confidence classifi-cation.After this initial network is successfully running on our overlay, we modify the72Figure 5.3: Person detector, sample resultssecond network to become a binary classifier, targeting only the “person” category.The network was shrunk in size and retrained on a larger database. As expected, ahigher accuracy for this simpler task is achieved, with less than 1% error.5.3 VectorizationWith the two networks, we start by vectorizing a 32-bit fixed-point scalar imple-mentation. The soft vector overlay consists of an ORCA [24] soft RISC-V proces-sor augmented with Lightweight Vector Extensions (LVE) [23] 1. Like VectorBlox1ORCA implements a pipelined RV32IM instruction set. LVE is a proprietary extension thatdiffers from the proposed RISC-V vector extension, which is not yet finalized.73MXP, LVE enables efficient vector and matrix operations without any loop, mem-ory access, or address generation overhead. LVE streams data through the RISC-VALU, so subword operations are unavailable. LVE does allow CVIs to be inserted.After completing this initial vectorization we see a speedup of 7.43× over theRISC-V scalar implementation, speeding inference from 93233 ms to 12543 . Pro-filing confirms the convolution layers consume the vast majority of the runtime.The fully-connected layers take only a small percentage of the runtime and do notrequire further acceleration.5.4 Custom Vector InstructionsTo accelerate the overlay further, with convolution operations consuming the ma-jority of the runtime, a CVI is used. The binary convolution CVI is discussedbelow. In addition to this instruction, two minor CVIs are added to allow certainsubword SIMD processing: a quad 16-bit to 32-bit SIMD add, and a 32-bit to 8-bitactivation function. These two custom instructions allow us to maintain perfor-mance while avoiding overflows by accumulating the 16-bit convolution outputsinto 32-bit sums, before ultimately producing 8-bit activations. As these latter twoCVIs are straightforward, they aren’t further discussed, but do count towards cus-tom lines of RTL.5.4.1 3x3 Binary Convolution InstructionThe convolution CVI accelerates CNNs with binary weights and 8-bit inputs. Theinstruction computes two overlapping convolutions in parallel. In use, input data isfetched down a column, accepting 8 consecutive bytes each cycle as its two 32-bitoperands. Two passes of the same input data are required to compute four columns74weights(9x1 bits)srcA (4x8b) srcB (4x8b)convLorow2 row1 row0sel01/23convHi0 1 2 3 4 5 6 7dstLo(2x8b)dstHi(2x8b)0 12 3programmablenegationof 8b inputsFigure 5.4: Binary convolution custom vector instructionof 16-bit convolutions from all four possible byte offsets within a word. After that,the inputs advance by 4 bytes to maintain alignment. A diagram of the CVI isshown in Figure 5.4.5.5 ResultsResults for the performance and area of the lightweight custom vector overlay arepresented in this section.5.5.1 Experimental SetupThe entire system, shown in Figure 5.5, was built around a Lattice iCE40 UltraPlusMobile Development Platform (MDP). Unlike the previous development board,there is no DRAM, and SPI Flash ROM is used to store the binary weights. In75SPI DMAScratchpad128kBSPI FlashROMVGACamera16 x 16DownscaleRGB DMAI2CRISC-VLVE, CVILattice iCE40UltraPlus FPGAFigure 5.5: System diagramthis minimal design, the input and output maps must remain entirely within the128 kB scratchpad memory. A lightweight port of VectorBlox’s vector proces-sor, titled LVE, is used in place of MXP. Like MXP, LVE instructions operateon data in a dedicated scratchpad, which is single-ported Random Access Mem-ory (RAM). The scratchpad operates at 72 MHz to provide two read and one writeevery 24 MHz CPU clock. Operating concurrently with the CPU, a DMA en-gine transfers multiple 32-bit values from the SPI Flash ROM, which stores the bi-nary weights (about 270 kb), into the scratchpad. A VGA-resolution RGB camera(640×480 pixels) using RGB565 is downscaled to 40×30 pixels in hardware, anduses DMA to write 32-bit aligned RGBA888 pixels into the scratchpad. Software76Network Lasagne CPU custom vector overlayReduced CIFAR10ms 6.4 1315fps 156.3 0.8Reduced Person classifierms 2.0 195fps 500.0 5.1Table 5.1: Runtime of reduced networks, desktop vs custom overlayde-interleaves the RGBA888 pixels into separate 8-bit R, G, and B pixel planes,where the algorthm uses a 32×32 centred and padded region of interest.5.5.2 PerformanceOn the Mobile Development Platform, inference for the CIFAR-10 network takes1315 ms. The accelerator improves RISC-V runtime of convolution layers 73×,and LVE improves runtime of fully-connected layers 8.3×, for an overall speedupof 70.9×. In comparison, a 4.00 GHz Intel i7-4790k desktop, using Python/Lasagne,the network takes 6.4 ms.The reduced, binary classifier runs in 195 ms (2.0 ms on the i7-4790k). Thecomplete system, including camera overhead, runs at 230 ms per frame.5.5.3 AreaResource utilization for the Lattice iCE40 UltaPlus-5K FPGA can be seen in Table5.2. The RISC-V CPU runs at 24 MHz. The whole system uses 5036 of 52804-input logic cells, 4 of 8 (16-bit) DSP blocks, all thirty 4 kb (0.5 kB) BRAM, andall four 32 kB single-ported RAM.77Table 5.2: Resource UsageComponent Logic BRAM SPRAM4-LUTs (0.5 kB) 32 kB DSPiCE40 UltraPlusTotals 5,280 30 4 8Binary-weight CNN overlayComplete System 5,036 30 4 4Device Utilization 95.4% 100.% 100.0% 50.0%5.6 Previous WorkResearch into training binary CNNs is still fairly recent, however, several customhardware inference engines have already been developed. Unlike current GPUs,these custom hardware solutions can fully exploit the reduced-bit computation,and avoid heavy usage of resource-constrained DSP blocks.In 2016, Renzo et al. produced YodaNN [2], a low-power ASIC for accelerat-ing binary-weight neural networks. Their design is capable of 1.5 tera operationsper second (TOPS) at 1.2 V. The maximum performance seen on binary versionsof several popular networks is 525.5 GOPS.Zhao et al. [36] synthesized a binary-weight accelerator on a 7Z020 FPGA,targeting inference of CIFAR-10 networks. The design was capable of 207.8 GOPSacross layers.Umurolglu et al. [33] produced a binary neural network framework using 7Z045FPGA, targeting several small networks, including CIFAR-10 and SVHN. The de-sign used 1 bit inputs and activations, not just binary weights. The network usesXNOR popcount for accumulations. A maximum throughput of 11.6 TOPS wasreported for a fixed 3×512 fully connected topology. Fraser et al. [15] expandedupon this work, and targeted larger networks on a larger, KU115 FPGA. The re-78ported a maximum throughput of 14.8 TOPS.5.7 Design ComplexityThe majority of the code is accelerated by lightweight vector instructions. AsLightweight Vector Extensions (LVE) does not support subword SIMD, two basicoperations to operate on halfwords and bytes are added as well as the important bi-nary convolution instruction. These small CVIs, adding subword support, requireda total of 100 lines of RTL. These instruction replace approximately 10 lines of theoriginal scalar code.The CVI used to accelerate the binary-weight convolution kernels requires 100lines of RTL. This hardware replaces approximately 6 lines of the original scalarcode. The entire design requires a total of 700 lines of C code.5.8 SummaryThis custom vector overlay demonstrates that our approach scales down to thesmallest FPGAs. Using an embedded RISC-V processor with lightweight vectorextensions, we create a minimal soft vector overlay. Adding the binary-weightconvolution CVI allows our design to run 70.9× faster. The design runs in real-time, despite the restricted resources available while only requiring an addition 200lines of RTL.79Chapter 6ConclusionsThis thesis use three case studies to explore a hardware/software approach for ac-celerating computer vision algorithms using custom vector overlays. The algo-rithms are implemented with a software-driven approach and achieve performancecomparable with full hardware solutions. These overlays only require a minimumof custom RTL, keeping the development effort on software. Our approach worksboth on traditional vision algorithms using AdaBoost cascades and state-of-the-artneural networks.6.1 SummaryA summary of results for the three case studies is shown in Table 6.1.In the first case study, accelerating LBP face-detection, the custom vector over-lay achieves a speedup of 248.4× over an ARM Cortex-A9 software implemen-tation. This gain can be decomposed into software and hardware contributions.Software-only vectorization using the general purpose soft vector overlay providesa 25× speedup, using the 16 parallel 32-bit ALUs. An additional 10× speedup80Algorithm LBP CNN BNNCores 1 2 1Host Processor Hard ARM A9 Hard ARM A9 Soft RISC-VVector Processor VectorBlox MXP VectorBlox MXP VectorBlox LVEVector Lanes 16 16 1CVIs 2 1 2Lines of RTL 800 1300 200Lines of software 2500 4500 700Vector Speedup 24.9× 11.6× 7.43×CVI Speedup 248.4× 1359.0× 70.9×CVI Speedup over Vector 10× 117× 9.6×Table 6.1: Comparing the various overlays explored in this thesisis achieved by customizing the vector overlay, implementing two custom vectorinstructions (CVIs): one for quickly computing the 8-way comparison, producingthe 8-bit LBP, and another for the table-lookup operation.The software-driven LBP-based face detection system runs 2.4× to 3.5× fasterthan previously published face detection systems implemented purely in hardware.Furthermore, the custom vector overlay only needs about 800 lines of custom RTL,keeping the majority of the development effort on software. This approach permitseasy integration of additional processing, where the same overlay can be used.A novel contribution made in this case study is the ILP formulation used toreplace 32-bit floating-point or fixed-point computation with 8-bit integers. Thissimplifies hardware, produces a speedup, and provides an accuracy guarantee. Wealso show how restricting block sizes to be square powers of two allows the LBPcomputation to be performed in an efficient pre-computation step. Although thishas been reported before on fixed-width SIMD systems [4], we found that it worksfor variable-length vector systems as well.The second case study implements a CNN overlay running VGG and two net-works based on YOLO [28]. The software-programmable vector overlay enables81rapid algorithm development, an important property for this fast-moving researchfield. Only a single CVI, requiring 1300 lines of RTL, was needed to create a cus-tom vector overlay that supports these CNNs. The multi-core custom overlay is1359.0× faster than the hard ARM core. The performance of the overlay is po-tentially competitive with top-of-the-line GPUs; this work achieved hundreds ofGOPS on a relatively small FPGA. We found that using 8-bit precision is essentialfor performance and allows for maximal use of the DSP blocks as long as we used12-bit weights. The CVI contributed to a 145× speedup over the base soft vectoroverlay.The final case study implements a lightweight, binary-weight neural networkoverlay to address the heavy DSP usage in the previous example. We use a verylow-cost FPGA and show the effectiveness of the design using several networksbased on of BinaryConnect. The overlay is very small – it uses about 5000 4-input LUTs. Only 200 lines of custom RTL were required for this design. Theaccelerator improves ORCA RISC-V runtime of convolution layers 73×, and LVEimproves runtime of dense layers 8×, for an overall speedup of 71×. The CVIcontributed to a 9.6× speedup over the base LVE. The 1-category classifier runsin 195 ms(2.0 ms on the i7-4790k) with 0.48% error and consumes 21.8 mW. Apower-optimized version, designed to run at one frame per second, consumes just4.6 mW. In the classifiers we tested, the error can be attributed entirely to trainingand not reduced precision.6.2 LimitationsWe have only assessed our algorithmic changes using the cascades and networksdiscussed. For generality, we should further validate with cascades trained to detect82other objects and explore further networks in detail.6.3 Future WorkMany of our optimizations may be applicable to GPUs or other SIMD systems aswell, which would be useful to apply.One obvious way to improve performance of our approach is to build larger,multi-core systems. Both the traditional Viola-Jones algorithm and neural networkscan be divided efficiently between cores. We have shown this with our dual-coreCNN overlay with near perfect scaling. Vector overlays can be parameterizableboth in terms of cores and vector lanes, allowing any sized system to be targeted.Beyond scaling to many core systems, we would like to expand the range ofneural networks that can be run on the overlays. An overlay represents a practicalway to accelerate other layers and incorporate the latest algorithmic advances.To make our overlay useful for many applications, an automated tool, targetingone of the popular deep learning frameworks, should take an arbitrary network andtranslate it to run on our overlay.Finally, accelerating the training step, instead of only inference would be inter-esting in domains that would benefit from low-latency, online training.83Bibliography[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system forlarge-scale machine learning. In OSDI, volume 16, pages 265–283, 2016. →pages 49[2] R. Andri, L. Cavigelli, D. Rossi, and L. Benini. YodaNN: An ultra-lowpower convolutional neural network accelerator based on binary weights. InVLSI (ISVLSI), 2016 IEEE Computer Society Annual Symposium on, pages236–241. IEEE, 2016. → pages 78[3] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R. Chiu. AnOpenCL deep learning accelerator on Arria 10. In Proceedings of the 2017ACM/SIGDA International Symposium on Field-Programmable GateArrays, pages 55–64. ACM, 2017. → pages 65, 66[4] O. Bilaniuk, E. Fazl-Ersi, R. Laganiere, C. Xu, D. Laroche, and C. Moulder.Fast LBP face detection on low-power SIMD architectures. In IEEEComputer Vision and Pattern Recognition Workshops, pages 630–636, 2014.→ pages 26, 28, 43, 81[5] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools,2000. → pages 11[6] B. Brousseau and J. Rose. An energy-efficient, fast FPGA hardwarearchitecture for OpenCV-compatible object detection. In ICFPT, pages166–173, 2012. → pages 3, 31, 39, 40, 42, 43[7] J. Cho, B. Benson, S. Mirzaei, and R. Kastner. Parallelized architecture ofmultiple classifiers for face detection. In IEEE ASAP, pages 75–82, 2009. →pages 3, 42, 43[8] M. Courbariaux, Y. Bengio, and J.-P. David. BinaryConnect: Training deepneural networks with binary weights during propagations. In Advances in84Neural Information Processing Systems, pages 3123–3131, 2015. → pages6, 17, 70[9] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio.Binarized neural networks: Training deep neural networks with weights andactivations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.→ pages 17[10] L. De Moura and N. Bjørner. Z3: An efficient SMT solver. In Tools and Alg.for the Construction and Analysis of Systems, pages 337–340. Springer,2008. → pages 29[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: Alarge-scale hierarchical image database. In Computer Vision and PatternRecognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE,2009. → pages 15[12] J. Edwards and G. G. Lemieux. Real-time object detection in software withcustom vector instructions and algorithm changes. In Application-specificSystems, Architectures and Processors (ASAP), 2017 IEEE 28thInternational Conference on, pages 75–82. IEEE, 2017. → pages iv[13] J. Edwards, J. Vandergriendt, A. Severance, A. Raouf, T. Watzka, S. Singh,and G. G. Lemieux. Tinbinn: Tiny binarized neural network overlay in lessthan 5,000 4-luts. In Proceedings of the 3rd International Workshop onOverlay Architectures for FPGAs (OLAF 2017), pages 23–25, 2017. →pages iv[14] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun. CNP: An FPGA-basedprocessor for convolutional networks. In Field Programmable Logic andApplications, 2009. FPL 2009. International Conference on, pages 32–37.IEEE, 2009. → pages 49, 65, 66[15] N. J. Fraser, Y. Umuroglu, G. Gambardella, M. Blott, P. Leong, M. Jahre,and K. Vissers. Scaling binarized neural networks on reconfigurable logic.In Proceedings of the 8th Workshop and 6th Workshop on ParallelProgramming and Run-Time Management Techniques for Many-coreArchitectures and Design Tools and Architectures for Multicore EmbeddedComputing Platforms, pages 25–30. ACM, 2017. → pages 78[16] C. Gao and S.-L. Lu. Novel FPGA based Haar classifier face detectionalgorithm acceleration. In FPL, pages 373–378, 2008. → pages 3, 42, 4385[17] B. Graham. Fractional max-pooling. arXiv preprint arXiv:1412.6071, 2014.→ pages 15[18] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learningwith limited numerical precision. In International Conference on MachineLearning, pages 1737–1746, 2015. → pages 49[19] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio.Quantized neural networks: Training neural networks with low precisionweights and activations. arXiv preprint arXiv:1609.07061, 2016. → pages49[20] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tinyimages. Univ. of Toronto, 2009. → pages 6, 15, 70[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in neural informationprocessing systems, pages 1097–1105, 2012. → pages 15[22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learningapplied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. → pages 15[23] G. Lemieux and J. Vandergriendt. FPGA-optimized lightweight vectorextensions for VectorBlox ORCA RISC-V. 4th RISC-V Workshop, July2016. → pages 73[24] G. Lemieux and J. Vandergriendt. ORCA FPGA-optimized RISC-V. 3rdRISC-V Workshop, January 2016. → pages 73[25] S. Liao, X. Zhu, Z. Lei, L. Zhang, and S. Z. Li. Learning multi-scale blocklocal binary patterns for face recognition. Advances in Biometrics, pages828–837, 2007. → pages 13[26] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu,S. Song, et al. Going deeper with embedded FPGA platform forconvolutional neural network. In Proceedings of the 2016 ACM/SIGDAInternational Symposium on Field-Programmable Gate Arrays, pages26–35. ACM, 2016. → pages 49, 51, 65, 66[27] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenetclassification using binary convolutional neural networks. In EuropeanConference on Computer Vision, pages 525–542. Springer, 2016. → pages17, 4986[28] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once:Unified, real-time object detection. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 779–788, 2016. →pages 4, 81[29] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al. Learningrepresentations by back-propagating errors. Cognitive modeling, 5(3):1,1988. → pages 15[30] A. Severance and G. Lemieux. Embedded supercomputing in FPGAs withthe VectorBlox MXP matrix processor. In CODES+ISSS, pages 1–10, 2013.→ pages 18[31] A. Severance, J. Edwards, H. Omidian, and G. Lemieux. Soft vectorprocessors with streaming pipelines. In FPGA, pages 117–126, 2014. →pages 21[32] A. Severance, J. Edwards, and G. Lemieux. Wavefront skipping usingBRAMs for conditional algorithms on vector processors. In FPGA, pages171–180, 2015. → pages 22[33] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre,and K. Vissers. Finn: A framework for fast, scalable binarized neuralnetwork inference. In Proceedings of the 2017 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays, pages 65–74. ACM, 2017.→ pages 78[34] P. Viola and M. Jones. Rapid object detection using a boosted cascade ofsimple features. In IEEE Computer Vision and Pattern Recognition,volume 1, pages 511–518, 2001. → pages 9[35] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. OptimizingFPGA-based accelerator design for deep convolutional neural networks. InProceedings of the 2015 ACM/SIGDA International Symposium onField-Programmable Gate Arrays, pages 161–170. ACM, 2015. → pages65, 66[36] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. B. Srivastava, R. G,depending upon the target frame ratoupta, and Z. Zhang. Acceleratingbinarized convolutional neural networks with software-programmable fpgas.In FPGA, pages 15–24, 2017. → pages 7887

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0369726/manifest

Comment

Related Items