@prefix vivo: . @prefix edm: . @prefix ns0: . @prefix dcterms: . @prefix skos: . vivo:departmentOrSchool "Applied Science, Faculty of"@en, "Electrical and Computer Engineering, Department of"@en ; edm:dataProvider "DSpace"@en ; ns0:degreeCampus "UBCV"@en ; dcterms:creator "Liu, Zhiduo"@en ; dcterms:issued "2012-10-17T18:07:50Z"@en, "2012"@en ; vivo:relatedDegree "Master of Applied Science - MASc"@en ; ns0:degreeGrantor "University of British Columbia"@en ; dcterms:description "This thesis describes the compiler design for VENICE, a new soft vector processor (SVP). The compiler is a new back-end target for the Microsoft Accelerator, a high-level data parallel library in C/C++ and C\\#. This allows automatic compilation from high-level programs into VENICE assembly code, thus avoiding the process of writing assembly code used by previous SVPs. Experimental results show the compiler can generate scalable parallel code with execution times that are comparable to human-optimized VENICE assembly code. On data-parallel applications, VENICE at 100MHz on an Altera DE3 platform runs at speeds comparable to one core of a 2.53GHz Intel Xeon E5540 processor, beating it in performance on four of six benchmarks by up to 3.2x. The compiler also delivers near-linear scaling performance on five of six benchmarks, which exceed scalability of the Multi-core target of Accelerator."@en ; edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/43442?expand=metadata"@en ; skos:note "Accelerator Compiler for the VENICE Vector Processor by Zhiduo Liu B.A.Sc, Harbin Institute of Technology, 2009 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in THE FACULTY OF GRADUATE STUDIES (Electrical and Computer Engineering) The University of British Columbia (Vancouver) October 2012 © Zhiduo Liu, 2012 Abstract This thesis describes the compiler design for VENICE, a new soft vector processor (SVP). The compiler is a new back-end target for the Microsoft Accelerator, a high- level data parallel library in C/C++ and C#. This allows automatic compilation from high-level programs into VENICE assembly code, thus avoiding the process of writing assembly code used by previous SVPs. Experimental results show the compiler can generate scalable parallel code with execution times that are compa- rable to human-optimized VENICE assembly code. On data-parallel applications, VENICE at 100MHz on an Altera DE3 platform runs at speeds comparable to one core of a 2.53GHz Intel Xeon E5540 processor, beating it in performance on four of six benchmarks by up to 3.2. The compiler also delivers near-linear scaling performance on five of six benchmarks, which exceed scalability of the Multi-core target of Accelerator. ii Preface [1] Zhiduo Liu, Aaron Severance, Satnam Singh and Guy G. F. Lemieux, “Accelerator Compiler for the VENICE Vector Processor,\" in Proceedings of the 20th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 2012, pp. 229 - 232. Portions of chapter 3 and 5 have been published at FPGA 2012 [1]. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation and Research Goals . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 Vector Processors . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Soft Vector Processors . . . . . . . . . . . . . . . . . . . 11 2.2 VENICE Architecture . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Native VENICE Programming Interface . . . . . . . . . . 15 2.3 Vectorizing Compilers . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Microsoft Accelerator System . . . . . . . . . . . . . . . . . . . 23 iv 2.4.1 Accelerator Language Fundamentals . . . . . . . . . . . . 24 2.4.2 The Accelerator Front-end . . . . . . . . . . . . . . . . . 26 2.5 The Sethi-Ullman Algorithm and Appel’s Generalization . . . . . 30 3 Compiler Implementation . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1 Compiler Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.1 Compiler Overview . . . . . . . . . . . . . . . . . . . . . 33 3.1.2 Accelerator Front-end . . . . . . . . . . . . . . . . . . . 35 3.1.3 Target-specific Optimizations . . . . . . . . . . . . . . . 37 3.1.4 Convert to LIR . . . . . . . . . . . . . . . . . . . . . . . 43 3.1.5 Code Generation . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Implementation Limitations . . . . . . . . . . . . . . . . . . . . . 49 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4 Performance Enhancement . . . . . . . . . . . . . . . . . . . . . . . 52 4.1 Dynamic Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Combining Operations . . . . . . . . . . . . . . . . . . . . . . . 60 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Experimental Strategy . . . . . . . . . . . . . . . . . . . . . . . 66 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.1 Execution Time . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.3 Compile Time . . . . . . . . . . . . . . . . . . . . . . . 77 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.1.1 Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.1.2 VENICE . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.1.3 Compiler Design . . . . . . . . . . . . . . . . . . . . . . 82 6.1.4 Evaluation Methodology . . . . . . . . . . . . . . . . . . 85 v 6.2 Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.2.1 JIT mode . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.2.2 Instruction Scheduling . . . . . . . . . . . . . . . . . . . 87 6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 vi List of Tables Table 4.1 Performance before and after combining operations for motion estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Table 5.1 Benchmark descriptions . . . . . . . . . . . . . . . . . . . . . 65 Table 5.2 Speedups of compiler generated over hand-written code . . . . 70 Table 5.3 VENICE and single-CPU runtimes (ms) and speedups of VENICE vs. single CPU . . . . . . . . . . . . . . . . . . . . . 72 Table 5.4 Benchmarks’ natural data types . . . . . . . . . . . . . . . . . 73 Table 5.5 Speedups for benchmarks operating on byte vs. word . . . . . 73 Table 5.6 Speedups for benchmarks operating on half word vs. word . . 73 Table 5.7 Input sizes used for Multi-core and VENICE execution . . . . 75 Table 5.8 VENICE target compile time (ms) . . . . . . . . . . . . . . . 77 vii List of Figures Figure 1.1 Design goal for VENICE compiler . . . . . . . . . . . . . . . 4 Figure 2.1 VENICE architecture (gray vertical bars are pipeline registers) 13 Figure 2.2 Native VENICE API to add 3 vectors . . . . . . . . . . . . . 16 Figure 2.3 Extracting sub-matrix from a 2D array . . . . . . . . . . . . . 18 Figure 2.4 VENICE execution flow . . . . . . . . . . . . . . . . . . . . 20 Figure 2.5 VENICE pipeline structure . . . . . . . . . . . . . . . . . . . 21 Figure 2.6 Accelerator code to add 3 vectors . . . . . . . . . . . . . . . 24 Figure 2.7 Lowering memory transform operations . . . . . . . . . . . . 29 Figure 2.8 From IR to code generation . . . . . . . . . . . . . . . . . . 29 Figure 2.9 Sethi-Ullman labeling algorithm . . . . . . . . . . . . . . . . 31 Figure 3.1 Development flows with native VENICE and Accelerator . . . 33 Figure 3.2 Accelerator compiler flow . . . . . . . . . . . . . . . . . . . 34 Figure 3.3 Accelerator front-end flow . . . . . . . . . . . . . . . . . . . 35 Figure 3.4 Example on conversion from user-written code to IR graphs . 36 Figure 3.5 Target-specific optimizations . . . . . . . . . . . . . . . . . . 38 Figure 3.6 Establish evaluation order . . . . . . . . . . . . . . . . . . . 39 Figure 3.7 Reference-counting process . . . . . . . . . . . . . . . . . . 41 Figure 3.8 Convert IR to LIR . . . . . . . . . . . . . . . . . . . . . . . 43 Figure 3.9 Code generation flow . . . . . . . . . . . . . . . . . . . . . . 44 Figure 3.10 Memory transform examples . . . . . . . . . . . . . . . . . . 45 Figure 3.11 Data initialization for multiple memory transforms combined together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Figure 3.12 Example of generating code from LIR . . . . . . . . . . . . . 49 viii Figure 4.1 VENICE execution flow for memory-bound applications . . . 55 Figure 4.2 Impact of vector length selection for input size of 8192 (words) with different instruction counts . . . . . . . . . . . . . . . . 57 Figure 4.3 Impact of vector length selection for instruction count of 16 with different input data sizes . . . . . . . . . . . . . . . . . 58 Figure 4.4 Vector length look-up table for V16 . . . . . . . . . . . . . . 59 Figure 4.5 Complete VENICE compiler flow . . . . . . . . . . . . . . . 63 Figure 5.1 Speedups over Nios II implementation for compiler generated code and hand-written code . . . . . . . . . . . . . . . . . . 70 Figure 5.2 Hand-written compare and swap . . . . . . . . . . . . . . . . 71 Figure 5.3 Compiler generated compare and swap . . . . . . . . . . . . 71 Figure 5.4 Scaling performance of the Accelerator multi-core target on AMD Opteron . . . . . . . . . . . . . . . . . . . . . . . . . 74 Figure 5.5 Scaling performance of the Accelerator multi-core target on Intel Xeon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Figure 5.6 Scaling performance of the Accelerator VENICE target . . . . 76 Figure 6.1 Accelerator compiler flow . . . . . . . . . . . . . . . . . . . 83 ix Glossary ALU Arithmetic Logic Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 ARBB Array Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 DAG Directed Acyclic Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 IR Intermediate Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 LIR Linear Intermediate Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 CSE Common Subexpression Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 PA Parallel Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 SVP Soft Vector Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 GPU Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 ISA Instruction Set Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 SOC System on Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 CUDA Compute Unified Device Architecture FPGA Field Programmable Gate Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 API Application Programming Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 VL Vector Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 DMA Direct Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 GPR General Purpose Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 SSE Streaming SIMD Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 SIMD Single Instruction, Multiple Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 x SPMD Single Program, Multiple Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23 JIT Just-In-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 xi Acknowledgments I would like to thank my supervisor Dr. Guy Lemieux for all his support, fanancially and academically, over the past three years. I am very lucky to become a student of Professor Lemieux and being involved in cutting edge FPGA technology research projects. This thesis would not become possible without his wealth of experience and knowledge, his innovative ideas, and his insights on the project. I want to give special thanks to Microsoft Accelerator group for accepting me as an intern and teaching me everything about the Accelerator system. They are also very generous and granted me access to part of the Accelerator source code, which makes it possible for me to conduct this study. Thanks to all the instructors of all the courses I’ve taken at UBC. I really appreciate their fascinating lectures, their patience for students, and their great kindness and help. All I’ve learned in the past three years have been solid foundation for the completion of my thesis, and will become a precious gift in the future of my life. Thanks to all the members of the SoC lab for the support and help. To David Grant, who inspired me and helped me on several course projects. To Aaron Severance, who provided valuable suggestions and continuously efforts on vector processor design and maintainance. To Chris Wang, who made jokes all the time and brought laughters to the lab. xii In the end, I want to give my appreciation to my parents who teach me right, and always support me in all aspects. To my husband Zheng, I feel so blessed for marrying you and having you always on my side taking care of me. xiii Chapter 1 Introduction 1.1 Motivation and Research Goals As modern processors have hit the power wall, we are embracing the parallel pro- cessing era with a wide range of hardware architectures, parallel programming languages, and auto-parallelizing compilers being extensively studied across the world. Today’s hardware offerings range from general-purpose chips with a few cores, to graphic processors that support large-scale data-parallel applications, to recon- figurable hardware with massive bit-level parallelism. Additionally, several spe- cialized Single Instruction, Multiple Data (SIMD) platforms such as vector pro- cessors are being actively investigated. The fast growth of novel architectures with diverse components and increas- ing levels of parallelism, either for general acceleration or specialized for certain domain applications, raises great challenges for software developer’s to effectively program them. 1 Using traditional sequential programming languages such as C/C++ to program modern parallel platforms creates substantial difficulty for compiler developers. It requires the compiler to identify the parallelizable part, analyze complex data dependencies, and orchestrate data movement and synchronization among parallel units in the target hardware architecture. This approach is complex and usually renders limited performance. Several specialized library languages, such as CUDA and OpenMP, have demonstrated some success. Most of these languages expose low-level details of the device architecture. The user has to perform operations such as explicitly spec- ifying data movement between different memory spaces, creating a certain number of threads, and synchronizing among threads to better exploit the processing re- sources. These explicit operations provide a great help to the compiler towards parallelizing the kernel. Many publications have shown that these languages can yield much easier compiler development and good runtime performance. However, they require users to apply knowledge of hardware details and advanced program- ming skills. These languages usually target only one platform, being either GPU, or Multi-core/Multi-threaded architecture, or computer clusters. Users have to learn a new language or even several new languages for each new device coming to the market. In addition, in order to optimize for performance, users are usually re- quired to learn hardware architecture in detail. This is obviously not a sustainable solution. The ideal solution is to move towards hardware independent, portable library languages based on existing primitive ones today. Hardware-independent lan- guages abstract away hardware details and present the user with a simple and high- level perspective. This higher abstraction is considered to be more portable and 2 less error-prone and it delivers higher programmer productivity. In addition, pro- grammers would prefer these languages to completely new ones because they are more similar to current primitive languages. There are several such projects being actively researched around the world. To name a few: Lime from IBM is a Java based language that can be ported among GPU, Multi-core and FPGA devices; Array Building Blocks (ARBB) from In- tel is a C++ based library that can be compiled to Multi-core and heterogeneous many-core architectures; and Microsoft Accelerator supports multiple primitive languages including C, C++, C#, and F#. Accelerator can be compiled to GPU, Multi-core, and FPGA using the same source code. These projects open the opportunity to provide effective, scalable solutions for future parallel architectures. For this thesis, it is compelling to investigate the feasibily of targeting one of these new high-level abstract languages to vector pro- cessors. In particular, the VENICE architecture [2] is one of the newest and unique vector architectures. VENICE is a research project conducted at the System on Chip (SOC) lab of the University of British Columbia. It has evolved through several generations since 2007 [2–5] and is presently being commercialized. VENICE inherited the SIMD model from general vector processing that provides massive data paral- lelism. It also provides great flexibility with low cost as being a Soft Vector Pro- cessor (SVP) running on Field Programmable Gate Arrays (FPGAS). It can be configured to instantiate multiple vector lanes, where each lane contains a 32-bit Arithmetic Logic Unit (ALU). In addition, it has several novel features: • It is smaller and faster than all previously published SVPs. 3 Figure 1.1: Design goal for VENICE compiler • It uses a multi-ported scratchpad memory for concurrent data accesses by the ALUS and the Direct Memory Access (DMA) engine. • Each ALU can be further fractured into sub-word ALUs that work concur- rently, enabling better performance on vectors of bytes or half words. • It can be programmed with a C-like inline assembly library system. • It provides user-level flexibility with multiple configurable components. • It uses a novel predicated move and flag system to handle data overflow and comparisons. Although the C-like inline assembly somewhat eases the programmability of VENICE compared to previous generations, it can still be difficult to program. For example, the two code fragments in figure 1.1 show native VENICE code on the left adding two vectors, and an idealized array-based object system similar to 4 Microsoft Accelerator describing the same computation on the right. The cumber- some assembly programming style, detailed hardware manipulation, and difficulty of debugging create the need for a higher-level programming model to provide simplicity, efficiency, and convenience. As explained, developing a new high-level language specially for VENICE is not practical. Programming VENICE with an existing portable higher-level abstraction could save development effort, and al- lows the user to write simple and short expressions at the same time. 1.2 Contributions This work presents a compiler design that serves as a vectorizing compiler/back- end code generator for VENICE based on Microsoft’s Accelerator framework. Our main contributions are as follows: 1. To our knowledge, this is the first compiler designed for a vector processor with a scratchpad memory. VENICE operates on a scratchpad memory, which can contain a vari- able number of vectors with arbitrary vector length. The multi-banked scratchpad memory allows concurrent read/write by multiple ALUs and the DMA engine. The compiler simplifies the challenges raised by the novel features of VENICE by treating the scratchpad memory as a ‘virtual vector registerfile’. 2. The compiler produces highly optimized VENICE assembly, and achieves robust end-to-end performance, which is close to or even better than hand- optimized code. 5 The compiler applies a double-buffering technique to effectively hide memory-transfer overhead. It uses a modified Sethi-Ullman algorithm to determine the evaluation order of sub-expressions, and a reference- counting method to precisely calculate the exact number of vector registers required by a computation. A profiling approach is adopted to tune compiler performance. 3. The compiler greatly improves the programming and debugging experience for VENICE. Before this work, VENICE can only be programmed using C-like inline assembly. Accelerator is a high-level abstraction language that eliminates the manual effort of assembly programming on performing operations such as memory transfers and synchronization of vector instructions. Visual Studio debugger is available for Accelerator, which saves the process of downloading to an FPGA board to check results using print statements. Although data management for architectures with scratchpad memory has been studied before, and it is still an on-going research area, such work focuses exclu- sively on scalar architectures (single ALU) co-existing with data caches. There are also some works on compiler support for hard vector processors, such as VIRAM’s vcc compiler and Cray’s compiler, but these vector processor designs have a fixed- size register file which is divided into exact N1 vectors with exact N2 maximum elements each. In contrast, the scratchpad memory in VENICE is flexible and can 6 be of arbitrary size. It supports an arbitrary number of vectors, each of an arbitrary length, subject only to the maximum capacity. Further more, the DMA engine connecting the scratchpad and main memory has a dedicated read/write port to the scratchpad. The DMA queue and vector instruction queue in VENICE allows con- trol processor to return immediately after dispatching a DMA or vector instruction. These allow the scalar core, the DMA engine, and the vector engine to operate in parallel. Compiler design for an architecture with such novel features has never been studied before. The experimental results show that across a set of selected benchmarks, the VENICE compiler delivers a speedup (using a Nios II processor as the baseline) up to 362 using the biggest VENICE configuration. Compared to a modern Intel processor, it provides speedup of up to 3:2. Furthermore, the compiler-generated code can achieve speedups between 0:81 and 2:24 of hand-tuned VENICE as- sembly code. In addition, when compared to the Accelerator Multi-core target, the compiler reveals almost linear scaling performance from 1 to 64 ALUs, whereas the Multi-core target performance saturates beyond 4 cores. 1.3 Approach When searching for a suitable compiler framework for VENICE, ease-of-use and ease-of-compiler-development were the two most important factors under consid- eration. In particular, it is desirable to have a clean implementation language for the user, and have parallelism as explicit as possible in order to avoid complex dependency analysis algorithms within the compiler. In comparison with other systems, Microsoft Accelerator has a number of fea- tures that make it ideal for our goals and it is chosen as the target high-level lan- 7 guage for our compiler: 1. It operates as an embedded library in C/C++, C# and F#. Therefore, it pro- vides the user with a certain amount of flexibility in choosing a language. 2. It has a vector-like description model which matches the VENICE architec- ture extremely well. In many cases, an almost direct translation approach can be used by the compiler. Special parallel-array data types simplify the task of identifying the parallelizable part and managing data movement between main memory and scratchpad memory. 3. The built-in Accelerator front-end provides an Intermediate Representation (IR) with sufficient static information for the back-end target to analyze, optimize, and parallelize. 4. It has built-in support for pluggable 3rd-party developed targets (back-end compilers), so adding a new target is easy. Most other systems do not open this option for 3rd-party developers. 5. It provides a sufficiently rich expressiveness to efficiently implement our benchmarks, and yet restricts users from using operations that would destroy data parallelism through unstructured array accesses. 6. Source code is accessible directly from Microsoft. In this thesis, the compiler operates as a source-to-source translation compiler. It compiles the Accelerator code into the low-level native VENICE code. The resulting C code needs to be compiled with gcc before being downloaded to an FPGA board. 8 As previously mentioned, the scratchpad memory is treated as a ‘virtual vector registerfile’. However, the number of vector registers and vector length can be arbitrary. They are dynamically decided by the compiler based on user input data array sizes and computation tree structure. A modified Sethi-Ullman algorithm is adopted to minimize the number of vector registers used by the generated code. A double-buffering technique is used to hide memory latencies caused by transferring data between scratchpad memory and main memory. Highly-efficient usage of the scratchpad memory is the main optimization goal of this work. 1.4 Thesis Organization The remainder of this thesis is organized as follows: Chapter 2 presents back- ground knowledge of the VENICE architecture, the VENICE native programming interface, the Microsoft Accelerator system, basic compiler algorithms this work is based on, and related works. Chapter 3 describes design details of this compiler including platform-specific optimizations and Chapter 4 shows how compiler per- formance is further improved. Chapter 5 describes the benchmarks used as well as experimental methods and results. Chapter 6 lists some future works under consid- eration. In the end, Chapter 7 concludes the whole thesis. 9 Chapter 2 Background This chapter will first introduce the VENICE vector processor architecture, with an emphasis on key features that are important to compiler design. Then it will discuss some state-of-art parallel programming languages and projects similar to Microsoft’s Accelerator, including Intel’s ArBB, IBM’s Lime, MATLAB’s parallel computing toolbox, and the VIRAM vcc compiler. The Microsoft Accelerator system and programming model will then be described. This chapter will also introduce some general compiler algorithms related to this thesis. 2.1 Vector Processors Vector processing has been applied on scientific and engineering workloads for decades [6, 7]. It exploits the data-level parallelism readily available in applications by performing the same operation over all elements in a vector or matrix. It is also well-suited for image processing, signal processing, and multi-media applications. The Cray-1 is the main ancestor of modern vector processors. Developed dur- ing the early 1970s, Cray was the first vector processor to adopt a RISC-like load/s- 10 tore architecture, where all vector operations are executed register-to-register. The Cray processor spanned several hundred circuit boards filling a large six-foot tall cabinet. VIRAM is a widely-cited embedded vector processor developed in the research labs at the University of California Berkeley [8]. It also uses a load/store architec- ture like Cray, but it is implemented on a single chip. 2.1.1 Soft Vector Processors Compared to hard processors or other complex heterogeneous architectures, soft vector processors running on FPGAs offer a fast, low-cost, and configurable solu- tion with ultra-high performance. They are designed to do data-parallel processing with less development effort than hardware design using VHDL or Verilog. The VIPERS soft vector processor by Yu from the University of British Columbia [3, 4] is the first published soft vector processor with configurable pa- rameters. It serves as a general-purpose data-parallel accelerator on an FPGA. It uses a Nios II-compatible multi-threaded processor called UT-IIe as control pro- cessor. The VESPA soft vector architecture developed by Yiannacouras at the Univer- sity of Toronto [9, 10] implements a MIPS-compatible scalar core with a VIRAM- compatible vector coprocessor. Like Cray and VIRAM, both VIPERS and VESPA adopt a load/store archi- tecture to transfer blocks of data into a vector data register file. In each case, the register file storage is divided evenly among a fixed number of vector registers. Hence, each vector register contains a fixed number of words. In order to offer dual read ports for simultaneous reading of two operands, the vector register file is 11 duplicated, which is costly for precious on-chip memory. The VEGAS project [5] made a significant departure from past vector archi- tectures. VEGAS shares most similarities with VENICE, which is the target archi- tecture of this thesis. It uses a Nios II/f as a control processor. Instead of vector data registers, VEGAS uses a large multi-banked scratchpad memory. Vectors in scratchapd are indexed by an 8-entry vector address register file. This organiza- tion allows for a large number of vectors in the scratchpad, but only 8 of them can be accessed immediately by any instruction; accessing the other vectors involves spilling the vector address register file contents. VEGAS uses DMA block-read and block-write commands to copy between the scratchpad and an off-chip main memory (DDR2). The scratchpad memory completely removes the load/store latencies and data duplication caused by using vector data registers. As a result, VEGAS operates memory-to-memory on the scratchpad, and can store a large number of vectors of arbitrary length up to the maximum scratchpad capacity. Furthermore, the ALUs can be fractured to support sub-word arithmetic. The ability to use arbitrarily long vectors and have more parallelism on byte or half word data make VEGAS much more efficient. 2.2 VENICE Architecture The VENICE soft vector processor was developed as an improvement to the VE- GAS [5] architecture. A block diagram of the VENICE architecture is shown in figure 2.1. Similar to previous SVPS, VENICE requires a scalar core as the control pro- cessor; in this case, a Nios II/f executes all control flow instructions. 12 D D R 2 D$ I$ Nios II/f CPU ABS Accum. MUL SHIFT ROTATE Align 1 Align 2 ALU EXTEND CMOV Align 3 VENICE Vector EngineInstruction Queue Scratchpad Memory 2kB - 2MB 2x clk Address Logic Altera Avalon Fabric (2nd pipe stage) DMA Custom Instruction Port Figure 2.1: VENICE architecture (gray vertical bars are pipeline registers) 13 VENICE vector operations are dispatched through the Nios custom instruction interface to the vector accelerator. Similar to VEGAS, the vector engine imple- ments a wide multi-ported scratchpad memory, in which vector operations take place, along with a DMA engine for controlling transfers to and from main mem- ory. The scratchpad memory has two read and one write ports dedicated to each vector lane (32-bit ALU). This allows reading two operands for ALU execution and writing one result back in every cycle. It also has a dedicated read/write port for the DMA engine, which allows reading/writing data in parallel with ALU oper- ations. The configurable number of vector lanes is the main form of parallelism for VENICE. All vector lanes receive the same vector instruction from the instruction queue and perform the operation concurrently. Similar to VEGAS, the 32-bit ALUs can be collectively fractured to support sub-word SIMD parallelism, enabling op- erations on vectors of bytes or half words with more parallelism. Memory-level parallelism and instruction-level parallelism are attained through concurrent op- erations of the scalar core, the DMA engine, and the vector core. The arbitrary vector length, as well as the ability to achieve memory-level and instruction-level parallelism raise interesting challenges for compiler design. Besides the inherited features, VENICE makes the following improvements upon the VEGAS architecture: • Unlike VEGAS, all vector address registers are removed. This saves the programmer from tracking and spilling the vector address registers. It would also be an extra layer of work for compiler. Instead of the vector address registers, C pointers are directly used as operands for vector instructions. • Scratchpad-based designs require alignment networks to allow for cases 14 where the input operands and output are not aligned. VENICE uses 3 align- ment networks in the pipeline to remove the performance penalty in VEGAS, which has only one alignment network. • Support for 2D data structures achieves a high instruction dispatch rate even in the case of short vectors. These 2D instructions eliminate the need for the auto-increment mode of the vector address registers in VEGAS. Further- more, since the 2D instruction can perform in a strided manner, it is easy to extract sub-matrices. • The shared multiplier/shift/rotate structure requires two cycles operational latency, allowing a general absolute value stage (the ABS block in figure 2.1) to be added after the integer ALU. As a result of these and other optimizations, the design was pipelined to reach speeds of 200MHz, which is roughly 50–100% higher than previous SVPs. This also allows the SVP to run synchronously at the same full clock rate as the Nios II/f. All of these unique features of VENICE are under consideration during the development of the compiler. 2.2.1 Native VENICE Programming Interface Using inline C function to program VENICE In order to simplify programming, C macros are used to make VENICE instruc- tions look like C functions without adding any runtime overhead. The sample code in figure 2.2 adds three vectors together using the native VENICE Application Pro- gramming Interface (API). 15 1 #include \"vector.h\" 2 3 4 int main() 5 { 6 const int length = 8; 7 int A[length] = {1,2,3,4,5,6,7,8}; 8 int B[length] = {10,20,30,40,50,60,70,80}; 9 int C[length] = {100,200,300,400,500,600,700,800}; 10 int D[length]; 11 12 // allocate space in the scratchpad 13 const int data_len = length * sizeof(int); 14 int *va = (int *) vector_malloc( data_len ); 15 int *vb = (int *) vector_malloc( data_len ); 16 int *vc = (int *) vector_malloc( data_len ); 17 18 flush_cache_all(); 19 20 // transfer from main memory to scratchpad 21 vector_dma_to_vector( va, A, data_len ); 22 vector_dma_to_vector( vb, B, data_len ); 23 vector_dma_to_vector( vc, C, data_len ); 24 vector_wait_for_dma(); 25 26 // perform vector setup and operations 27 vector_set_vl( length ); 28 vector( VVW, VADD, vb, va, vb ); 29 vector( VVW, VADD, vc, vb, vc ); 30 31 // wait and transfer results 32 vector_instr_sync(); 33 vector_dma_to_host( D, vc, data_len ); 34 vector_wait_for_dma(); 35 36 vector_free(); 37 } Figure 2.2: Native VENICE API to add 3 vectors Each macro dispatches one or more vector instructions to the vector engine. Depending upon the operation, these may be placed in the vector instruction queue, 16 or the DMA transfer queue, or executed immediately. A macro that emits a queued operation may return immediately before the operation is finished. This allows several instructions to be executed concurrently if they operate on different com- ponents. In addition, some macros are used to restore synchrony and explicitly wait until the vector engine or DMA engine is finished. As implied by figure 2.2, after initializing input data array (line 7-10), the VENICE programming model follows a few general steps: 1. Allocation of memory in scratchpad (line 14-16) 2. Optionally flush data in data cache (line 18) 3. DMA transfer data from main memory to scratchpad (line 21-23) 4. Wait for DMA transaction to be completed (line 24) 5. Setup for vector instructions, e.g., the Vector Length (VL) (line 27) 6. Perform vector operations (line 28-29) 7. Wait for all vector operations to be completed (line 32) 8. DMA transfer resulting data back to main memory (line 33) 9. Wait for DMA transaction to be completed (line 34) 10. Deallocate memory from scratchpad (line 36) The basic vector instruction format is vector(VVWU, FUNC, VD, VA, VB). The VVWU specifier refers to ‘vector-vector’ operation (VV) on 32-bit integer type data (W) that is unsigned (U). The vector-vector part can instead be scalar- vector (SV), where the first source operand is a scalar value provided by Nios. 17 These may be combined with data sizes of bytes (B), half words (H) or words (W). A signed operation is designated explicitly by using the signed specifier (S) or implicitly by omitting the unsigned specifier (U). Figure 2.3: Extracting sub-matrix from a 2D array One important feature of VENICE mentioned in the previous section is the ability to operate on two-dimensional arrays. This allows a smooth flow of vector operations on a 2D array instead of individual 1D operations on each row. The user can also operate on a sub-matrix of a 2D array by specifying strides of each vector operand or destination vector. For example, to extract the shaded mn sub-matrix from the original MN matrix in figure 2.3, the following vector setup instructions are needed: vector_set_vl(n); vector_setup_2D(m, 4*(N-n), 0, 4*(N-n)); The vector length is specified by the set_vl instruction indicating n elements will be processed for each row. In the setup_2D instruction, the first number m indicates the number of rows that will be involved in the following 2D vector in- structions. The second number is the row stride for destination vector, meaning 4*(N-n) bytes (N-n words) will be skipped at the end of each row after a vector length of n is written into the destination vector. The third number indicates there 18 is no stride for the first source operand. The last number 4*(N-n) means the number of bytes will be skipped at the end of each row for the last source operand. Now, adding a scalar value of 1 to the sub-matrix to obtain the right-side matrix in figure 2.3 can be as simple as : vector(SVW, VADD, VA, 1, VA); In addition to basic arithmetic, conditional moves of individual vector elements are achieved via flag bits. These flags are efficiently encoded into the 9th bit of every byte after a vector arithmetic instruction.1 The stored flag value depends upon the operation: unsigned addition stores the carry-out, whereas signed addition stores the overflow. Optimizing for performance The latest VENICE architecture can be configured up to 64 vector lanes (V64) on a Stratix III chip. With sufficient data parallelism in the application, a V64 could have a 64 speedup over a V1 configuration in the ideal case. Since VENICE is built upon an FPGA, transferring data from and to main memory could be an expensive overhead. From our experience, a simple and ef- fective technique to hide this memory latency is double-buffering. As stated in the previous section, user data has to be sent to the on-chip scratchpad memory in order to perform vectorized operation. Double-buffering allocates two buffers in the scratchpad memory for each user input array. First, input arrays are partitioned into several pieces. Then the first piece of input array is sent to scratchpad mem- ory. After this is done, vector processing on the first piece of data will happen at the same time as transferring a second piece of input data. By the time the results 1FPGA memory blocks are normally 9 bits wide. 19 Figure 2.4: VENICE execution flow are computed on the first piece of data, the second piece of input data is ready to be processed. By alternating the two buffers for computation and data transfer, most of the memory latency can be successfully hidden. Figure 2.4 portrays how the three parts of VENICE work together concurrently 20 by applying double-buffering. The input data is partitioned into three pieces. With vector execution and data transfer happening concurrently, most of the overhead comes from transferring the first piece of input data and transferring back the last piece of result. On the other hand, VENICE has a pipelined architecture as indicated in fig- ure 2.1. Figure 2.5 is a simplified pipeline structure of VENICE. There are 7 pipeline stages in total. Since the scratchpad memory is multi-ported, memory read and write can be completed in the same cycle. Therefore, for the first vector instruction entering the pipeline, there will be a 6 cycle latency. Any instruction that has dependency on a previous one has to wait for 6 cycles as well to enter the pipeline. Within one instruction, all elements of a specified vector will stream through the pipeline with no stalls. Therefore, in order to keep high throughput of the vector engine, an as-long-as possible vector length is desired. However, this may expose the long data transfer overhead of the first and last pieces. Figure 2.5: VENICE pipeline structure 2.3 Vectorizing Compilers This section will introduce a few works similar to the Microsoft Accelerator system and compiler work in this thesis. VIRAM was the pioneer of applying the vector processor in the embedded do- main. The VIRAM vcc compiler was based on Cray’s compiler and was developed simultaneously with the VIRAM hardware. It is claimed to be able to automatically 21 vectorize C/C++ directly without any special pragma or language annotations in most cases. However, it still needed pragmas to help with analysis on applications with complex data dependencies [8]. A deep analysis of an incomplete version of the vcc compiler was performed in [11]. The results showed that the compiler could not detect vectorizable loops consistently. All results were derived from the Dongarra Loops instead of real world benchmarks. Results from using the VIRAM simulator were reported in [12]. However, there wasn’t enough evidence to prove the completeness and efficiency of the compiler. Intel’s ARBB [13] consists of a virtual machine and a C++ API that defines new parallel types such as collections to avoid direct data dependencies. These collections are treated like values and the Just-In-Time (JIT) compilation engine optimizes these and extract thread and data parallelism. It targets multi-core pro- cessors, GPU and Intel Many Integrated Core Architecture processors. IBM’s Lime [14] is a Java-based language capable of generating multiple low- level programming languages such as OpenCL for GPUs and Verilog for FPGAs. It relies on a user-built task graph and explicit data flow specification to identify the parallelizable kernel. It requires special value keywords to achieve data im- mutability and gets hints from user-declared private, local or global data types to manage complicated device memory systems. For the OpenCL target, Lime generates a mix of Java ByteCode and OpenCL code for GPU kernel computation. For the FPGA target, it generates a mix of Java ByteCode and Verilog code. The advantage of Lime is less restriction to applications since it uses standard array types. However, there is a substantial overhead caused by Java to C and C back to Java conversion [15]. It is reported that it could achieve equivalent performance of hand-tuned native OpenCL code 22 on GPUs. However, it only measures the computation kernel runtime without all the data transfer and other overheads introduced by the framework. Results for the FPGA target have not been published yet. MATLAB’s parallel computing toolbox provides a flexibility of combining hardware independent and hardware dependent code together to achieve maximum performance, which gives the user the choice of whether to get involved in hard- ware manipulation. It is available for multi-core processors, GPUs, and computer clusters. The built-in libraries could be a time saver for developers. It also pro- vides task parallelism among multiple applications and Single Program, Multiple Data (SPMD) mode. The disadvantage of the MATLAB PCT is that programs are not portable between GPU and multi-core or computer clusters. There is a vast amount of other works for new languages and its associated compilation and run-time techniques as well such as StreamIt [16], hiCUDA [17], X10 [18], Sponge [19], etc. They will not be introduced one by one in detail since they do not offer both high-level abstraction and diverse platform compatibility. 2.4 Microsoft Accelerator System The Accelerator system developed by Microsoft Research [20, 21] is a domain- specific language aimed at manipulating arrays. It presents to the user a set of high-level data parallel operations and object types that can be embedded in mul- tiple primitive languages such as C/C++, C# and F#. The key assets of the Accel- erator system is its ability to target entirely different devices with a single source language description. Accelerator provides a level of abstraction that completely hides hardware details from the programmer, and it auto-parallelizes the code for each target. It currently supports three different back-end targets: Graphics Pro- 23 cessing Units (GPUS) using DX9 and CUDA, Multi-core using Streaming SIMD Extensions (SSE), and FPGAs using VHDL [22]. The language and some com- piler front-end basics will be described in this section. 2.4.1 Accelerator Language Fundamentals 1 #include \"Accelerator.h\" 2 #include \"MulticoreTarget.h\" 3 4 using namespace ParallelArrays; 5 using namespace MicrosoftTargets; 6 7 int main() 8 { 9 Target *tgtMC = CreateMultiCoreTarget(); 10 11 const int length = 8; 12 int A[length] = {1,2,3,4,5,6,7,8}; 13 int B[length] = {10,20,30,40,50,60,70,80}; 14 int C[length] = {100,200,300,400,500,600,700,800}; 15 int D[length]; 16 17 // constructors copy user data to PA objects 18 IPA a = IPA( A, length ); 19 IPA b = IPA( B, length ); 20 IPA c = IPA( C, length ); 21 22 // assignment statements build an expression tree 23 IPA d = a + b + c; 24 25 // the ToArray() call evaluates the expression tree 26 tgtMC->ToArray( d, D, length ); 27 28 tgtMC->Delete(); 29 } Figure 2.6: Accelerator code to add 3 vectors This section serves as a brief introduction on how to program using Accelerator. Figure 2.6 shows sample code for Accelerator adding three vectors together. 24 Accelerator declares and stores data arrays as Parallel Array (PA) objects. The PA objects are largely opaque to the programmer and restrict them from manipu- lating the array by index. Five primitive parallel array data types are supported by Accelerator which are boolean, integer, float, double, and Float4 (a set of 4 floats). Note there is no native support for byte or halfword (short) data types. When an Accelerator PA object is instantiated, it is automatically initialized by making a copy of the original user array. Accelerator does a lazy functional evaluation of operations with PA objects. That is, expressions and assignments involving PA objects are not evaluated instantly, instead they are used to build up the expression graph. Loops and conditionals must be analyzable (i.e., not data- dependent on the parallel computation) so they can be fully unrolled, if necessary. At the end of a series of operations, the Accelerator ToArray() function must be called. This results in the expression graph being optimized, translated into native code using a JIT compilation process, and evaluated. At its core, Accelerator allows easy manipulation of arrays using a rich variety of element-wise operations, including both binary and unary arithmetic, compar- isons, logical operations, and type-conversions. It also supports reductions (sum, product, max, min, or, and) that can be applied to the entire array or just rows of a 2D array, as well as linear algebra operations (inner product and outer product). Finally, it also provides a number of transforms which can shift, expand, stretch, transpose, or otherwise modify the shape and relative positions of the vector/matrix entries. As shown in figure 2.6, with user data arrays declared and initialized (line 12- 14), programming in Accelerator is straightforward as follows: 25 1. Create a target (Line 9) 2. Create parallel array objects for each input data array (Line 18-20) 3. Write expressions with parallel array objects and operations from the Accel- erator library (Line 23) 4. Call ToArray() function to evaluate the result and copy it back to a regular C array (line 26) In figure 2.6, the CreateMCTarget() function indicates that a subsequent ToArray() call will be evaluated on a Multi-Core platform. The IPA type repre- sents an integer parallel array object. Similarly, the FPA type represents a floating point parallel array object. Here, A, B, and C are declared as input PA objects that are initialized by user arrays a, b, and c. D is the output PA which is converted to user array d in the end. The ToArray() triggers the compiler to start the evalu- ation of D. As stated previously, except for creating the proper target, the program is unaware of all hardware-related details. Targeting a different device can be done simply by changing the CreateMCTarget() function. 2.4.2 The Accelerator Front-end The Accelerator front-end [23] is a built-in process common to all targets of Accelerator. Accelerator uses deferred evaluation. It builds a Directed Acyclic Graph (DAG) that consists of operations and associated data but it does not im- mediately perform any computations until the evaluation method – ToArray() is called. The DAG expression graph is composed of singly-linked expression node objects. All of the loops will be analyzed and unrolled during this process 26 which might produce a large number of duplicated sub-expressions in the expres- sion graph. The expression graph will be handed over to the compiler front-end by a root operation node. After obtaining the full graph, the expression graph will be converted to an IR graph which is also a DAG. The IR graph is usually a near- duplicate of the expression graph but represented by a different object which serves as a working copy of the expression graph. The IR graph is easier to further analyze and optimize. After the IR graph is validated, the system performs initial optimizations to the graph including constant folding and Common Subexpression Elimination (CSE). The CSE process will detect all of the common sub-expressions, including those introduced by loop unrolling when building the initial expression graph, and mark necessary breaks on relevant IR nodes. One of the most important optimizations performed by the Accelerator front- end is analyzing all of the memory-transforming operations, combining them to- gether and binding them to a leaf node (input array) in the IR graph in the form of index bound objects. Memory-transforming operations here refer to operations that rearrange the elements of a PA object or change the array dimensions by adding or deleting elements. An index bound is an object that stores a certain access pattern of a PA object in the form of a start point, an array length, a stride value, and a boundary condition for each dimension of the data array. The start point could be a negative value if there is an out-of-bounds access. The array length could also exceed the original array boundary as well. All these ‘out-of-bounds’ situations will be handled according to the boundary condition, which can be ‘Wrap’ - assign values from the opposite edge of the array to the new elements, ‘Clamp’ - assign values from the nearest element to the new elements, or ‘DefaultValue’ - assign 27 a specified value to the new elements. The goal of this optimization is to put the transform information at the point that data is retrieved from memory, rather than performing the transform later as a separate operation. Figure 2.7 explains how the Accelerator front-end converts memory transform operations into index bound objects of leaf data nodes. Here, the shift operation performed on intermediate data X will be detected as a memory transform oper- ation. The front-end optimizer will perform a series steps upon this observation. First, the Shift and ’+’ operations are swapped, which results in the Shift op- erating on the two child data nodes. These Shift operations then can be pushed to leaf data nodes as attached bounds objects. Therefore, the Shift operations can be completely removed from the IR graph. This process is called a lowering process of memory transform operations. Multiple composable transforms can be combined together during the lowering process. In-composable transforms will be marked as forced breaks of the original IR graph. Memory transforms with the same boundary condition are considered to be composable. With all memory transforming operations pushed down to leaf nodes, no IR graph should contain any such operations. In the end, the original IR graph are subdivided into a collection of small graphs. Break points can be caused by the CSE process as common sub-expression, or by memory transformation analysis process for in-composable memory trans- forms, or by user-specified Evaluation points using the Evaluate() method. The Accelerator framework provides the target developer enough flexibility here to do any target-specific optimizations on the IR graphs from this point. After all the optimizations, the IR graph is ready for conversion to a Linear Intermediate Representation (LIR). The LIR is a linearized format that consists of operation 28 Figure 2.7: Lowering memory transform operations and data nodes in stack-machine order. The primary purpose of the LIR is to repre- sent an optimized IR graph in a format that can be readily converted to processor- specific code. The built-in common interface for IR to LIR conversion provides a convenient mechanism for generating processor-specific code. Figure 2.8 shows a conceptional flow from IR graph to final processor-specific code generation that is usually common to all targets. Figure 2.8: From IR to code generation All targets have their own back-end which starts from the semi-optimized IR. Each back-end compiler can perform its own target-specific optimizations on the 29 IR graphs before serialization to LIR. The code generation process will also be different for different devices. 2.5 The Sethi-Ullman Algorithm and Appel’s Generalization As introduced in previous section, the Accelerator front-end produces a set of IR graphs representing the whole computation. Each target needs its own back- end compiler to perform target-specific optimizations, code generation, and execu- tion. The pipeline structure of VENICE, introduced in section 2.2, indicates that short vector lengths cause idle cycles for ALUs and negatively affect performance. Therefore, a vector length beyond a certain value is usually favored for avoiding pipeline bubbles. In this thesis, the concept of ‘virtual vector registers’ is used to hold vector data in the scratchpad. Due to the limited on-chip scratchpad memory, a minimum number of virtual vector registers is desired. The Sethi-Ullman algorithm [24] is a common technique used in compilers to obtain an optimal ordering of a computation tree, which results in the fewest number of registers required for storing intermediate results. The algorithm can be applied to any binary computation tree. The algorithm works bottom-up on the tree. Each node is labeled with a ‘Sethi- Ullman number’, which is the number of registers required during its computation. This number is equal to 1 if the node is a leaf node, or the maximum of the num- bers of its children if they are unequal, or to the number of either child plus one if two children have equal numbers. During the code generation stage, each node will first recursively generate code for the child with a bigger number, then recur- sively generate code for the other child then generate its own operation instruction. 30 Figure 2.9 shows an example of how the labeling process is done. This is important because registers are usually a scarce resource in most archi- tectures. In this work, the entire scratchpad memory is treated as a set of fixed-size virtual vector registers. The fewer vector registers needed, the more memory can be assigned to each one. However, the IR produced by the Accelerator front-end is not always a binary tree. Therefore, a modified version of Sethi-Ullman algorithm is adopted. Appel and Supowit generalized the Sethi-Ullman algorithm [25] to accommo- date operations with three or more operands. In the case that a node has three or more children, it will be labeled the maximum of all its children’s numbers if they differ from each other. If there are two or more children with the same number that are equals to the maximum number of all children, each additional child node that has a number equal to the maximum number will add one to the parent node number. For example, a parent node with three children numbered 2, 2, and 2 respectively will be labeled 4. Figure 2.9: Sethi-Ullman labeling algorithm 31 Chapter 3 Compiler Implementation This chapter describes the end-to-end flow of the VENICE compiler, which serves as a source-to-source translator. It compiles Accelerator code into VENICE in- structions and stores them in a C file. The Accelerator built-in front-end is used for building the expression graph and initial conversion from expression graph to Intermediate Representation (IR). It also applies basic common optimizations to the IR. The VENICE back-end implementation, which is the main topic of this the- sis, starts from target-specific optimizations focusing mainly on the efficient usage of the scratchpad memory. Then it converts the IR to a stack-machine-style LIR. Code generation from LIR follows the steps for programming VENICE introduced in chapter 2. This chapter also talks about some limitations of the compiler. 32 Figure 3.1: Development flows with native VENICE and Accelerator 3.1 Compiler Flow 3.1.1 Compiler Overview Figure 3.1 shows the user flow for (a) programming VENICE directly, and (b) programming in Accelerator. Unlike Multi-core, CUDA and DX9 targets, which use a JIT compilation process, the VENICE back-end for Accelerator adopts an off- line approach similar to the FPGA target. It serves as a source-to-source compiler by writing out another C program that uses the VENICE APIs. When writing native VENICE code and debugging, the user has to download the compiled assembly to an FPGA board to check the correctness of results. In contrast, the existing JIT mode targets of Accelerator allow fast debugging within the Visual Studio debugger. Once the Accelerator code is complete, it only requires a one-line change of the code to switch to the VENICE target and generate code 33 Figure 3.2: Accelerator compiler flow for VENICE. The output is written into a C file, which can be compiled with gcc and downloaded to an FPGA board for execution. Figure 3.2 illustrates the high-level flow of Accelerator compilers. The Ac- celerator front-end (upper box in figure 3.2) is common to all targets. It builds an expression graph from user-written Parallel Array (PA) expressions, and pro- duces an Intermediate Representation (IR) that serves as a working copy of the expression graph. Optimizations can be performed on the IR. After the IR graph is produced, each back-end compiler (lower box in figure 3.2) performs target- specific optimizations on the IR before code generation. The remainder of this 34 section will walk through the steps in the back-end compiler needed to generate native VENICE code that efficiently uses the available scratchpad memory. The back-end first sets an evaluation order of each IR graph. Then, it counts the num- ber of vector registers needed across all the IR graphs. Next, the IR graphs are converted to a LIR format before final code generation. Steps in code generation include data allocation and initialization, data transfers, and mapping Accelerator APIs to VENICE instructions. 3.1.2 Accelerator Front-end Figure 3.3: Accelerator front-end flow 35 Figure 3.4: Example on conversion from user-written code to IR graphs 36 The Accelerator front-end completes several tasks shown in figure 3.3. It con- verts the user-written Accelerator code into an expression graph. The expression graph is copied to an IR graph for initial validations and optimizations. It performs constant folding, Common Subexpression Elimination (CSE), memory transform combination, and memory transform lowering. In the end, it sub-divides the IR graph into smaller IRs. Details on these processes of the front-end were introduced in chapter 2, and will not be repeated here. Figure 3.4 shows an example of how the Accelerator front-end produces the resulting IR graphs. In the example, two rotations of parallel array A are added by scalar values. The use of Evaluate() on these two additions forms two break points in the IR. The final resulting parallel ar- ray D is an expression of user input array A, and intermediate results B and C. The rotate operations in the original IR are combined into leaf node A. The IR is then sub-divided into three smaller IRs due to the user-specified early evaluation points. This example will continue to be used throughout this chapter to demonstrate the steps in the back-end compiler. 3.1.3 Target-specific Optimizations As the first step in the Accelerator back-end compiler, this sub-section describes some target-specific optimizations on the IR graphs in preparation for conversion to the LIR. It includes three steps shown in figure 3.5. The main goal of target- specific optimizations is the efficient use of scratchpad memory. The back-end follows the intuitive approach of treating the scratchpad space as a pseudo-vector- registerfile. Similar approaches can be found in [26, 27] for scalar architectures. It first uses an algorithm based on Appel’s generalization of the Sethi-Ullman algo- rithm introduced in section 2.5 to determine the evaluation order of each IR graph. 37 Figure 3.5: Target-specific optimizations Then it uses a reference-counting algorithm that walks through all the IR graphs in the established evaluation order and calculates the exact number of vector registers needed across all IRs. Since the entire scratchpad memory can be partitioned into arbitrary number of vector registers with arbitrary size, this number is crucial for optimized scratchpad memory partitioning. Constant Folding It is found beneficial to perform another round of constant folding in the back-end in addition to the existing pass done in the front-end. This is necessary because the Accelerator front-end is provided in pre-compiled form by Microsoft, so this part of the flow could not be modified due to limited access to the source code. Evaluation Order To simplify the management of scratchpad memory and code generation, the scratchpad is divided into the fewest equal-size ‘virtual-vector-registers’ that are 38 needed. Several techniques are employed to limit the number of vector registers used, so as to maximize their size. Chapter 2 introduced Appel’s generalization of Sethi-Ullman algorithm for de- termining the evaluation order of computation trees. The algorithm can handle op- erations that have three or more operands, where the Sethi-Ullman algorithm only handles binary computation trees. Since Accelerator contains functions that have more than two operands, Appel’s algorithm is adopted to determine the evaluation order of sub-expressions in each IR graph. However, modifications have to be made to accommodate the scratchpad archi- tecture. The Sethi-Ullman labels are used to represent the number of vector regis- ters in scratchpad memory. For common processor architectures, variables and im- mediate values usually require a register to temporarily store the value. However, in the VENICE architecture, all input data arrays are loaded to scratchpad memory. Immediate values (scalar values) come directly from the instruction queue, and do not take any space in the scratchpad. Therefore, all the scalar nodes are labeled 0. Figure 3.6: Establish evaluation order Figure 3.6 shows an example on how evaluation order is set for the IR graphs produced at the end of figure 3.4. In part a) of figure 3.6, all nodes are labeled 39 with the Sethi-Ullman number. The circled numbers in part b) indicate the final evaluation order of each IR graph. Register Counting In order to partition the scratchpad, the compiler must know the exact number of vector registers needed by the computation. The labeling is done individually on each IR graph. The compiler must combine all the IRs together to obtain a total number of registers. Furthermore, the labeling does not take vector register re-use into account. To reduce unnecessary data transfers, it is desirable to re-use vector registers as much as possible. A reference-counting method is adopted to achieve this goal. Initially, each leaf data node is associated with a vector register. Each vector resister may be re-used several times during the whole computation. To re-use a vector register, the back-end keeps a list of leaf data nodes, plus the number of re- maining references to each of them. Since each IR graph produces an intermediate result that will become the leaf node of other IR graphs later in the computation, not only are these nodes kept in the reference-counting list, but the back-end also keeps a list that records when an intermediate data node becomes available and when a data node dies after its reference count becomes 0. The back-end calculates the exact total number of vector registers needed using these two lists. This process follows the evaluation order established in the previous pass, and mimics the code generation process. Figure 3.7 demonstrates this register-counting process. The computation con- sists three IR graphs, which are results of figure 3.6. Node A is a user input array. Node B and C are intermediate results. Node D is the final output. When an IR 40 Figure 3.7: Reference-counting process is highlighted, it means the IR is being processed, and when an IR is in dashed outline, it has finished being processed. All of the IR graphs are already organized in a desired evaluation order. The value ‘active’ in the active list indicates a leaf 41 data node is reserving a vector register. The value ‘no’ means the opposite. Two variables are used to keep track of the number of registers used by each IR graph. NumLoads is the number of active leaf nodes that are reserving a vector register. NumTemps is the number of temporary registers requested by operation IR nodes that cannot re-use any of its children’s registers. The number of registers needed for leaf nodes (numLoads) is always the sum of total active data nodes in the active list. Part a) shows initially only user input array A is active. Therefore, numLoads is 1. Whenever an IR graph is completed, the temporary result will activate the corresponding intermediate leaf data node in the active list and update the numLoads. Part c) shows B becoming active after the first IR is completed, and numLoads becomes 2. Each time an input data node is consumed, its reference count will be reduced by 1. Whenever a reference count becomes 0, that vector register is no longer needed to hold an input array or in- termediate result and is available for re-use immediately. When encountering an operation node, the compiler will check if it can re-use any of its children’s regis- ters. If one of the children is an operation node, that register is always considered reusable. If all children are leaf data nodes, the reference count of each leaf node is checked to see if it has become 0. If no re-usable register is found, numTemps will be increased. In the example shown in figure 3.7, when the first ‘+’ node in the first IR graph is being processed, it finds that its left child A has a non-zero refer- ence count, and its right child is a scalar node. Therefore, it adds 1 to numTemps. The numTemps variable will be reset to 0 after an IR graph is processed. The total number of registers needed (numTotal) is the sum of registers used to hold leaf data nodes (numLoads) plus temporary registers needed by operation nodes that cannot re-use its children’s registers (numTemps). The maxTotal vari- 42 able keeps track of the maximum value of numTotal. At the end of processing all of the IR graphs, the maxTotal variable indicates the number of virtual vector registers needed to perform the computation. 3.1.4 Convert to LIR Figure 3.8: Convert IR to LIR The Accelerator framework recommends use of the LIR for the convenience of code generation. The LIR consists of node objects similar to IR nodes, but in a stack-machine style. Figure 3.8 shows how the optimized IR graphs are converted to LIR. The compiler creates LIR nodes in the established evaluation order. The bound objects associated with each input data, the total number of registers com- puted from previous section, and all other necessary information that is useful for code generation, are passed over to the LIR during the conversion. When the LIR is complete, it is ready for code generation. 43 3.1.5 Code Generation Figure 3.9: Code generation flow The code generation follows the steps required for VENICE assembly pro- gramming. This flow is depicted in figure 3.9. This sub-section will go through these steps one by one. Memory Allocation and Data Initialization The compiler calculates the size of each vector register and allocates memory in the scratchpad according to the number of vector registers computed from sec- tion 3.1.3. As discussed in chapter 2, one convenience feature of Accelerator is the efficient handling of out-of-bounds array indices that comes from memory- transforming operations such as Shift() and Rotate(). In the front-end, Ac- celerator combines memory transforms and propagates the array bounds to leaf data nodes, so the maximum extents are known. The back-end takes this infor- mation into account, and allocates extra memory in the scratchpad for these out- of-bound cases. All remaining scratchpad memory is allocated equally among the 44 Figure 3.10: Memory transform examples number of vector registers. The initialization stage copies user data to the output C file, and prepares for memory transforms by padding the input array with proper values for out-of- bounds accesses. Figure 3.10 demonstrates how input data padding is done. Part a) shows a rotation performed on a 1D array. The original array is white, with the out-of- bounds elements shaded. In this case, the last element 1 appears padded after the last element. The new 1 5 array formed by the Rotate() is indicated by the lower bold black bar. Therefore, a vector register size of 6 words will be allocated for A. Part b) shows a shift by 1 row and 1 column on a 2D array. Values past the bounds are initialized with the specified default value of k. The new 3 3 array is highlighted by a bold black box in the foreground. When allocating memory in scratchpad, a total of 44 words will be assigned to B. Each input data array might be referenced multiple times with different bound conditions. For example, if the 1D array A in figure 3.10 is rotated to the left and to the right by two different data nodes, not only will 1 be padded after the last element, but the element 5 is also padded before the first element. This is shown in 45 Figure 3.11: Data initialization for multiple memory transforms combined together the upper half of figure 3.11. Similarly, if the 2D array B in figure 3.10 is shifted up and left by 1, and later it is shifted down and right by 1, the compiler must prepare for both cases during data initialization. Therefore, the initialization of this 2D array will look like the lower half of figure 3.11. However, if two shift operations on the 2D array in figure 3.11 need to be padded with different values, the in-composable memory-bound conditions will result in creation of a new 2D array. All scratchpad memory is freed at the end of the entire program. This is mainly because VENICE does not support partial de-allocation of the scratchpad memory. Data Transfer DMA transfer instructions are generated after memory allocation and data initial- ization. Input data arrays are transferred to the scratchpad memory all at once. A wait-for-DMA instruction is generated before vector instructions. All the results 46 will be transferred back to the main memory after vector instructions are com- pleted. Generate Vector Instructions The ability to manipulate arrays is intrinsic to both Accelerator and VENICE. Therefore, in many cases, a direct mapping of an Accelerator operation to a VENICE instruction for basic element-wise operations is possible. For example, the simple ‘+’ operator in Accelerator is mapped to VADD instruction in VENICE. In a few cases that Accelerator operators are not directly supported by VENICE, such as divide, modulo, power, and matrix multiply, pre-written library code is required. Memory transforms were discussed in the previous sub-section, and they were handled by properly initializing the input data array. Extra elements might be padded outside of the normal array bounds. In the code generation stage, the com- piler must extract desired data from the original array based on the bounds object associated with the individual data node. Take the arrays in figure 3.10, for example. With all data properly initialized, extracting partial data from the 1D array is simply done by adding an offset to the starting address in the scratchpad memory. The Accelerator expression: D = Rotate(A, [1]) + 2; will be translated into the following VENICE assembly: vector_set_vl(5); vector(SVW, VADD, VD, 2, VA+1); For 2D arrays, the row stride amounts can be adjusted to step over any padding elements at both ends. The Accelerator expression: 47 D = Shift(B, {1,1}) + 2; will be translated into the following VENICE assembly: vector_set_vl(3); vector_setup_2D(3, 0, 0, 1); vector_2D(SVW, VADD, VD, 2, VB+1); Accelerator provides APIs to do element-wise comparisons between arrays such as Max() and Min(). There is also a Select() API that does a conditional move operation. Chapter 2 introduced that VENICE supports conditional moves by us- ing flag bits. This feature is used to perform Accelerator comparison operations. For example, the Accelerator expression: D = Select(A-B, A, B); which moves an element in A to D if an element in A-B is greater than or equal to 0 and moves the element in B to D otherwise, will be converted into VENICE assembly as follows: vector(VVW, VSUB, VTemp, VA, VB); vector(VVW, VCMV_GTEZ, VD, VA, VTemp); vector(VVW, VCMV_LTZ, VD, VB, VTemp); The subtract instruction sets the flag bits associated with each element in VTemp according to whether VTemp is positive or negative. The first conditional move instruction VCMV_GTEZmoves elements in VA to VD if flag bits of VTemp are not set. The last line of code moves elements in VB to VD if flag bits of vector VTemp are set. Figure 3.12 shows an example of the final stage of compiler that generates a human-readable C program with VENICE APIs. 48 Figure 3.12: Example of generating code from LIR 3.2 Implementation Limitations The compiler does not support all of the Accelerator APIs. One of the reasons is that VENICE cannot perform floating-point arithmetic. Therefore, float and double data types are not supported by the compiler. All APIs that require these data types, such as Log() and Sin(), are hence omitted. Some rarely used operations, such as OuterProduct(), Replicate() and Pad(), are also not supported due to the low priority of the task during compiler development. On the other hand, there are several VENICE instructions that cannot be ex- pressed by Accelerator APIs, such as the multiply-high instruction which returns the upper N bits of an N N multiply. However, the generated C program is human-readable, and can be easily modified to utilize such features of VENICE that Accelerator cannot exploit. 49 As described in chapter 2, the Accelerator front-end has a built-in CSE algo- rithm to eliminate duplicated evaluation of sub-expressions that are considered to be expensive. However, VENICE is unlike other targets where sometimes com- puting a sub-expression locally may be faster than storing intermediate data into memory and reading from memory again. In VENICE, reading from and writing to the scratchpad memory requires no extra overhead. Any duplicated computation is an overhead. Therefore, the built-in CSE algorithm is not suitable for the VENICE target. Attempts have been made to replace the built-in CSE algorithm with a more aggressive version, but it is not yet integrated into the Accelerator front-end. Instead, CSE must be performed manually by the user using the Evaluate() operation. The ToArray() call triggers generation of a C file with VENICE APIs. Appli- cations that require multiple ToArray() calls require the manual effort of the user to merge multiple C files and the associated memory management. The scratchpad memory is treated as a medium that can be sub-divided into as many pieces as possible. Demand for a large number of vector registers will only result in a small size for each one. Demand for an exceptionally large number of registers might result in a register size that degrades performance. For extreme cases that require thousands of vector registers resulting in a register size that is smaller than 1 word, the compiler reports an error. 3.3 Summary The back-end compiler starts with an intermediate representation produced by the Accelerator front-end, and follows a flow recommended by the Accelerator frame- work. The back-end compiler performs a series of device-specific optimizations. 50 The key optimizations are the optimized evaluation order specific to VENICE re- sulting in a minimal number of vector registers required by the whole computation, and the precise calculation of the exact number of vector registers needed. This number is used to partition the scratchpad memory. These optimizations provide a maximum vector length for vector execution. The resulting C program with in-line VENICE API calls is generated from a linearized intermediate representation. The entire compiler implementation required approximately 1 year of work and approximately 6000 lines of code. This chapter has listed some of the limitations of the compiler. More will be discussed in detail in chapter 6, where possible future work is presented. 51 Chapter 4 Performance Enhancement This chapter describes two improvements made to enhance the performance of the VENICE back-end compiler. First, the compiler handles the case when data arrays cannot fit into the scratchpad memory by sub-dividing data arrays and performing data transfers in a double-buffered fashion. Based on the input sizes and number of operations in the application, the compiler references a look-up table that stores the near-optimal vector lengths to partition the data arrays. These near-optimal vector lengths are obtained by profiling on synthetic benchmarks. Second, the back-end compiler is able to reduce vector-instruction count by combining multiple Accelerator operations into one VENICE instruction. Experi- mental results are provided to demonstrate the impacts of these optimizations. 4.1 Dynamic Partitioning As mentioned in chapter 3, when the VENICE back-end performs target-specific optimizations, it always precisely calculates a minimal number of vector registers needed by the computation, so as to maximize register sizes. This is because the 52 vector-register size directly becomes the vector length for vector execution, which affects the pipeline utilization and end-to-end runtime performance. In the case where the user array size exceeds this maximized register size, the back-end needs to sub-divide data array into smaller segments. In order to hide data-transfer over- head, data movement can be done in a double-buffered fashion by pre-fetching the next-needed pieces. This fully exploits the capability that DMA transactions and vector computation can take place at the same time. Overlapping the two can al- most completely hides the overhead of the data transfers. However, even when the entire computation can be performed on VENICE without being segmented, applying double-buffering may still be beneficial. Unlike a GPU or multi-core architecture, memory latency could dominate VENICE’s performance. The Multi-core target relies upon caching to hide mem- ory latency. GPUs rely upon switching between thousands of threads to tolerate memory latency. However, VENICE uses a DMA engine to transfer data between scratchpad memory and main memory. Although the DMA engine owns a ded- icated 4-byte read/write port to the entire width of the scratchpad memory, the main memory doesn’t support multiple memory banks. This constrains the mem- ory bandwidth to be at most 32 bytes, which is the maximum bandwidth of the main memory at the board-level in the DE3 development system used by this work. For smaller VENICE configurations using less than 8 lanes (4-byte wide each), the bandwidth will be further limited by connections between DMA and scratchpad memory. For example, a VENICE V4 only has a memory bandwidth of 16 bytes. Furthermore, in addition to the bulk data transfer time, there is the additional la- tency from the DMA instruction being dispatched until the data actually arrives the destination. Taken together, these factors strongly impact VENICE’s performance. 53 Trade-offs have to be considered when performing double-buffering. Treating the entire input data as a whole will result in no overlap for data transfer and vector operations. With all the memory transfer overhead being exposed, performance will be significantly degraded, especially for small benchmarks. On the other hand, dividing data array into too small segments will result in a short vector length that cannot maintain high throughput in the VENICE pipeline. The initiation of memory transfer will also introduce additional delays. Taking another look at the VENICE execution example in figure 2.4, vector execution time is shown taking longer than the memory transfers. Most of the overhead comes from transferring the first segment of the input data to scratchpad and transferring the last segment of the result back to main memory. A smaller vector length might actually reduce these overheads and improve performance. In contrast, applications can also be memory-bound as demonstrated in fig- ure 4.1. Memory transfers become dominant, and vector executions are hidden by the memory latency. The total runtime is basically the time for transferring all the inputs to scratchpad and transferring the results back. Due to the additional latency introduced by each data transfer instruction, the less data transfer instructions ini- tiated, the less time it takes to complete transferring the entire input and output. In these cases, a larger vector length is preferred to reduce the total overhead of initiating all data transfers. Based on these observations, an assumption can be made that for each con- figuration of VENICE, there should always be an optimal vector length for each application that will achieve the best balance of the data transfer and vector execu- tion and yields a minimal total runtime. As mentioned in chapter 2, VENICE has 7 pipeline stages in total, which causes 54 Figure 4.1: VENICE execution flow for memory-bound applications a 6 cycle delay between any consecutive instructions with a data dependency. Since Accelerator does not require the programmer to be aware of any hardware details, which may affect parallelism, data dependencies are prevalent in many bench- 55 marks. 1 The VENICE back-end does not perform deep analysis on such data dependencies as this would radically change the structure of a given computation tree. Instead, this work focuses on the workload balancing optimization by care- fully choosing a vector length that is large enough to limit the impact of pipeline bubbles caused by data dependencies and also small enough to avoid long latencies caused by the first and last piece of data transfers. For each application, data-transfer time and vector-execution time can be ex- pressed by complex equations based on the input data sizes, the number of opera- tions performed on each data element, the number of vector lanes available, mem- ory bandwidth, memory latency, function call overhead, and the to-be-determined vector length. Each application can be further categorized into three different sit- uations based on the data-transfer and vector-execution time – compute-bound, memory-bound, and balanced. Each situation would use a different equation to cal- culate a vector length that achieves the shortest total runtime. The calculation might involve deciding clear boundaries of variables, calculating derivatives of equations, and computing a vector length that minimizes the equation. However, the sensi- tivity of this approach to the accuracy of the equations and detailed modeling of the the VENICE architecture and application characteristics could easily make this method imprecise or incorrect. Therefore, instead of designing a complicated algorithm to compute the opti- mal vector length for each program, a simple look-up table is created containing selected vector lengths for each pair of input size and operations per data element (instruction count). In order to achieve this goal, a synthetic benchmark is used 1Accelerator assumes there is sufficient data -level parallelism in each operation. Hence, it does not focus on parallelism between operations. 56 with controllable input sizes and number of vector instructions. By sweeping on each parameter, a performance table with runtime for each given input size and number of instructions is derived. The vector length that results in the least run- time is stored in the final look-up table. Figure 4.2: Impact of vector length selection for input size of 8192 (words) with different instruction counts Figure 4.2 shows the number of clock cycles required when performing a dif- ferent number of operations on each input array element with a fixed size (8192 words) of input data. The Y axis is the total cycle count; a smaller number indi- cates better performance. The X axis is the vector length, aka the vector register 57 Figure 4.3: Impact of vector length selection for instruction count of 16 with different input data sizes size used. When there are only 1-2 vector instructions, the application is memory- bound, and a vector length of 4096 (words) is preferred since it is the biggest size that can be used with double-buffering. When more instructions are involved, for example 5-8, the application moves towards compute-bound, and a vector length of 1024 (words) has the best performance. This choice provides good balance be- tween memory transfer and vector execution. When the input size changes, the optimal vector length also changes. Figure 4.3 shows cycle count per element when a fixed number of vector op- 58 erations (16) are performed with different input sizes, and different vector lengths. Y axis is switched to cycle count per element in order to compensate the large total cycle count differences. In figure 4.3, almost all cases are compute-bound. It shows that the time for transferring the first and last piece of data becomes less noticeable as input size increases to a large value. A larger vector length is again preferred for fewer pipeline bubbles. Optimal Buffer Sizes for V16 Input Data Sizes (words) 8192 16384 32768 65536 131072 262144 524288 1048576 Instru -ction Count 1 4096 8192 8192 8192 8192 8192 8192 8192 2 4096 8192 8192 8192 8192 8192 8192 8192 3 2048 2048 4096 4096 8192 8192 8192 8192 4 1024 2048 2048 4096 4096 8192 8192 8192 5 1024 2048 2048 4096 4096 8192 8192 8192 6 1024 2048 2048 4096 4096 8192 8192 8192 7 1024 2048 2048 4096 4096 8192 8192 8192 8 1024 2048 2048 4096 4096 8192 8192 8192 9 1024 2048 2048 4096 4096 8192 8192 8192 10 1024 2048 2048 4096 4096 8192 8192 8192 11 1024 2048 2048 4096 4096 8192 8192 8192 12 1024 2048 2048 4096 4096 8192 8192 8192 13 1024 2048 2048 4096 4096 8192 8192 8192 14 1024 2048 2048 4096 4096 8192 8192 8192 15 1024 2048 2048 4096 4096 8192 8192 8192 16 1024 2048 2048 4096 4096 8192 8192 8192 Figure 4.4: Vector length look-up table for V16 Several tables similar to figure 4.2 and figure 4.3 were collected for each VENICE configuration. A set of these tables forms a final look-up table like the one shown in figure 4.4 for each VENICE configuration. The best vector length (in number of words) that gives the best performance is listed in the table. The 59 compiler hashes the computation tree into an entry in the table based on its input size and number of operation nodes. As can be seen from the table, when the instruction count exceeds a certain number, the vector length will saturate. Since it is impossible to have a table with an infinite number of rows, the table stops at 16 as a point where most of the vector lengths are saturated. Similarly, when the input array exceeds a certain size, the optimal vector length stops changing. Therefore, a 168 table like figure 4.4 is used for each VENICE configuration. For values that fall outside of the boundaries in the look-up table, they are simply rounded to the closest value in the table. Vector lengths are chosen to be power of 2 because these values used most often in hand-written code. A more fine-grained vector lengths might be beneficial. However, the look-up tables only serve as references when choosing a vector length for real applications. Pursuing a perfect optimal point from a synthetic program is meaningless. In addition, curves in figure 4.3 and figure 4.2 become nearly flat when they get close to the lowest point, which leaves a wide margin for a vector length that leads to the near-optimal performance. Using double-buffering will increase the total number of vector registers be- cause two vector registers will be assigned to each user input data array. 4.2 Combining Operations Although Accelerator and VENICE are both designed for operating on data arrays, the two instruction sets do not match perfectly. As mentioned in chapter 3, the back-end sometimes has to write a sequence of code for certain Accelerator APIs. There are also opportunities where VENICE benefits from combining certain short sequences of simple operators into a single 60 compound VENICE operation. For example, VENICE has an absolute-value unit following each ALU within the pipeline. Taking an absolute value on the result of most vector instructions is free. This does not apply to multiplication and shift op- erations, which take place in the two-cycle multiply ALU. Therefore, the compiler can take advantage of this feature, and reduce instruction count in order to increase performance. For example, an Accelerator expression that takes an absolute differ- ence of parallel array A and scalar value of 1 like: D = Abs(A - 1); can be translated into one vector instruction: vector_abs_2D(SVW, VSUB, VD, 1, VA); This also applies to accumulate operations as well, since VENICE has a ded- icated accumulator inside the pipeline. The compiler will walk through all the IR graphs, find these sequences of operations, and replace them with new VENICE operations. Out of the 6 selected benchmarks, motion estimation benefits the most from this optimization, because it calculates the absolute differences of two image frames. A total instruction count reduction of 30% is achieved, which saves the execution time by 32% (a 1.47 speedup) on average across different VENICE configurations. Table 4.1 shows the performance improvement for motion estimation in detail. V4 V16 V64 runtime before combine operators (ms) 2.03 0.55 0.30 runtime after combine operators (ms) 1.36 0.37 0.21 speedups 1.49 1.48 1.43 Table 4.1: Performance before and after combining operations for motion es- timation 61 4.3 Summary This chapter walks through a few target-specific optimizations made upon the ba- sic implementation described in chapter 3. Experimental results showed solid evi- dence on how the selection of vector length affects runtime. It demonstrated that, for each application, there exists a vector length that achieves optimal performance based on its input array sizes and the number of operations performed on each array element. In addition, to take advantage of special function units in the VENICE architecture, such as the absolute value unit, the compiler makes an additional pass to combine certain sequences of Accelerator operations into a single VENICE in- struction. This further improved performance on some of benchmarks. Figure 4.5 summarizes the final compiler flow with all the optimization stages. 62 Figure 4.5: Complete VENICE compiler flow 63 Chapter 5 Experimental Results This chapter will introduce the benchmarks selected to test the quality of the com- piler design followed by experimental strategy. Runtime results will be presented showing that the compiler-generated code achieves performance close-to or better- than human-optimized assembly. By comparing to execution time on a single Intel core, VENICE is shown to be competitive with modern processors. Finally, evi- dence is given to show that the VENICE target delivers better scalability than the Multi-core target of Accelerator. 5.1 Benchmarks To measure the quality of the compiler, a set of 6 highly data-parallel benchmarks are selected to test how well it performs on a variety of applications on both small and large size SVPs. 64 Benchmark Name Description Input Size (words) Output Size (words) Instruction Count Additional Information fir Finite Impulse Re- sponse filter 8192 8176 32 Fixed point arithmetic. Consists of multiply and add operations on 16 shifted variants of a 1D input array. 2D fir Two-dimensional fir filter 320240 318238 18 Fixed point arithmetic. Consists of multiply and add operations with tap size of 3x3. life Conway’s game of life 256256 256256 11 Integer arithmetic. Consists of sum of values of neighbor grids and comparison operations. Performs one iteration on the initial state of grid. imgblend Image blend 3202402 320240 4 Fixed point arithmetic. Combines two images into one by multiplying different coefficients to the pixels of two input images. median Median filter 128128 124124 678 Integer arithmetic. Replaces each pixel of input image with the me- dian of a 55 nearby window. The median value is calculated using a bubble sort algorithm. motest Motion estimation 4848 3232 512 Integer arithmetic. Sum of absolute differences between the input image and a sample image within a 1616 2D window. Table 5.1: Benchmark descriptions 65 Table 5.1 lists the set of benchmarks used for performance evaluation. These highly data-parallel applications are shown to scale fairly well. The selection of benchmarks was also influenced by the ease of expressing the computation using a subset of Accelerator APIs supported by the VENICE target. Other applications are expected to benefit in the same way. The benchmarks cover both 1D applications, such as fir, and 2D applications, such as motest. Some of the benchmarks are memory-bound, such as imgblend, where others are compute-intensive, such as median. The instruction count col- umn indicates the number of operations performed on each element of the input array. Medium input sizes are used for all benchmarks, which provides enough data parallelism and fast collection of results. 5.2 Experimental Strategy In order to demonstrate the validity, effectiveness and efficiency of the compiler, experiments are designed in order to answer the following questions: 1. Is the generated code correct? 2. Is the generated code of good quality? 3. How well does the performance scale over SVPs with different sizes? 4. How much overhead does the compiler introduce? To answer these questions, all the benchmarks are coded in both Accelerator and native VENICE assembly. The assembly code is carefully optimized for per- formance following the principles described in Chapter 2. During the result col- lecting stage, some of the hand-written benchmarks were found to be much slower 66 than the compiler-generated ones. In these cases, they were re-written to get better performance. Both compiler-generated code and human-optimized assembly are run on VENICE with 4 different configurations – V1, V4, V16, and V64. The number after V means the number of ALUs instantiated in the vector engine. This also affects the size of scratchpad memory, since a fixed 8KB of memory is as- signed to each vector lane. V64 is an exception with only 4KB memory available for each vector lane. This is due to the limited on-chip memory in the Stratix III FPGA used by this work. Due to a shortage of DSP blocks, V64 is also modified to fit in a DE3 system by removing 8-bit and 16-bit multiply support; Accelerator does not support these modes either. The execution time for human-optimized assembly and compiler-generated code are compared side by side to see if the compiler could intelligently translate Accelerator code into VENICE assembly. Performance of the same benchmark running on VENICE with a different number of ALUs also tests if the compiler can generate scalable code. The scalability of the code generated by the VENICE back-end is also com- pared to performance of Accelerator’s Multi-core target. Ideally, the experiments should show the scaling speedups from the Multi-core target when it uses a differ- ent number of CPUs on the same machine. However, since Accelerator is designed to completely hide hardware information from the user, it is impossible to directly control resource usage. The Multi-core target always uses all resources available in the hardware. Instead, the VirtualBox hypervisor is used. Varying hardware con- figurations for the multi-core device is accomplished by assigning different number of processors to the virtual machine. 67 Due to the inseparable JIT time and relatively high JIT overhead of the Multi- core target, larger input arrays are used for multi-core execution. Since only scala- bility is being compared, the change of input sizes should not affect the validity of the results. The Multi-core target always generates SSE3-compliant code, which mostly works with single-precision floating-point data. This lack of efficient inte- ger arithmetic in SSE3 forces the Multi-core target to unpack values already loaded to the SSE registers and send them to general purpose registers. The unpacking oc- curred in almost every operation, which becomes a huge overhead. To avoid this drawback, the benchmarks are re-written to use float data type for the Multi-core target. This change of data type is considered to be harmless since float and inte- ger types are of the same size (4 bytes), which puts the same burden on memory accesses. However, floating-point arithmetic is not supported by VENICE. Also because Accelerator does not support the byte or short data types, integer is used as a unified data type for all benchmarks in VENICE assembly and the VENICE Accelerator back-end. The best runtime results of VENICE produced by the compiler-generated code are selected to compare to an Intel CPU to show that even running at a low fre- quency, VENICE can be competitive with a modern CPU running at a much higher frequency. Each benchmark self-verifies the results of vector execution against a simple, optimized sequential C solution. This sequential C code is used for both the single- core Intel CPU and the Nios II execution baseline. All soft processor results are measured by running the benchmarks on an Al- tera DE3-150 development system with a Nios II/f processor running at 200MHz. 68 The Nios II/f processor is configured to use a 4kB instruction cache and a 4kB data cache. All vector engines run at a fixed clock rate of 100MHz. (Small size SVPs like V4 can run at a much higher frequency, but the maximum frequency drops quickly when the vector engine increases to 64 lanes.) Single-CPU results are derived from an Intel Xeon E5540 running at 2.53GHz. The Multi-core target is tested on two machines. One is the same Intel Xeon machine used for sequential execution. This Intel machine has 4 cores and 8MB cache; VirtualBox is config- ured to use 16GB of memory. The other one is an AMD Opteron 6128 machine with a total of 32 cores, each running at 2GHz. There is only a 512KB cache on the AMD machine. The 32 cores are distributed among 4 sockets with 8 cores per socket; 16GB of memory is assigned to the VirtualBox on AMD as well. 5.3 Results 5.3.1 Execution Time Speedups for different VENICE configurations – V1, V4, V16, and V64, over the serial Nios II/f implementation are shown in figure 5.1. All VENICE results are a minimum runtime of 10 consecutive runs. For both hand-written and compiler- generated code, only the part related to vector execution from allocating memories in the scratchpad, until all scratchpad memory is freed, is timed. Time for code generation and initialization of input arrays is not included. Table 5.2 is a detailed listing of speedups of compiler-generated code versus hand-written code on the 4 VENICE configurations. As shown, the compiler- generated code out-performs the manually-written code in most cases. This is because Accelerator puts more effort into the optimizing process than a human: 69 Figure 5.1: Speedups over Nios II implementation for compiler generated code and hand-written code fir 2Dfir life imgblend median motest geomean V1 1.04 0.97 1.01 1.00 0.99 0.82 0.97 V4 1.01 1.12 1.10 1.02 1.07 1.01 1.05 V16 1.09 1.12 1.38 0.90 0.96 1.01 1.07 V64 1.30 1.42 2.24 0.92 0.81 1.04 1.22 Table 5.2: Speedups of compiler generated over hand-written code 1. Accelerator fully unrolls inner loops to reduce loop overhead; 2. Accelerator carefully chooses the vector length by a profiled look-up table rather than conservatively rounding down or guessing; 3. Accelerator always double-buffers data transfers when necessary; 4. Accelerator inlines all function calls to avoid function call overhead. The fastest, life, achieves 370 speedup compared to a Nios II/f, and a 2.24 70 speedup compared to the hand-written code on V64. However, humans can some- times do far better than the compiler: motion estimation can achieve another 1.5 speedup if the VENICE accumulator is used in a clever way, where the compiler cannot do so automatically. 1 input_type *v_min = v_input1; 2 input_type *v_max = v_input2; 3 // copy min values to tmp 4 vector( VVW, VOR, v_tmp, v_min, v_min ); 5 // v_sub flags are set if max