An Analytical Model of Logic Resource Utilization for FPGA Architecture Development by Andrew H. Lam B.A. Sc., University of Toronto, 2006 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Applied Science in The Faculty of Graduate Studies (Electrical and Computer Engineering) The University of British Columbia (Vancouver) February, 2010 c© Andrew H. Lam 2010 Abstract Designers constantly strive to improve Field-Programmable Gate Array (FPGA) per- formance through innovative architecture design. To evaluate performance, an un- derstanding of the effects of modifying logic blocks structures and routing fabrics on performance is needed. Current architectures are evaluated via computer-aided design (CAD) simulations that are labourious and computationally-expensive experiments to perform. A more scientific method, based on understanding the relationships be- tween architectural parameters and performance will enable the rapid evaluation of new architectures, even before the development of a CAD tool. This thesis presents an analytical model that describes such relationships and is based principally on Rent’s Rule. Specifically, it relates logic architectural parameters to the area efficiency of an FPGA. Comparison to experimental results show that our model is accurate. This accuracy combined with the simple form of the model’s equations make it a powerful tool for FPGA architects to better understand and guide the development of future FPGA architectures. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Goals and Contributions . . . . . . . . . . . . . . . . . . . 4 1.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Logic Block and Cluster Architecture . . . . . . . . . . . . . 9 2.1.2 Routing Architecture . . . . . . . . . . . . . . . . . . . . . . 12 iii Table of Contents 2.2 FPGA CAD Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Technology Mapping . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.3 Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.4 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Previous Research on Analytical Models . . . . . . . . . . . . . . . . 19 2.3.1 Rent’s Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2 Wirelength Models . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.3 Stochastic Fanout Model . . . . . . . . . . . . . . . . . . . . 22 2.4 FPGA-Specific Analytical Models . . . . . . . . . . . . . . . . . . . 24 2.4.1 Relating Circuit Rent Parameter to FPGA Architecture . . . 24 2.4.2 Minimum Needed Channel Width . . . . . . . . . . . . . . . 25 2.4.3 FPGA Area Model . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3 Model Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 Technology Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1 Number of K-LUTs to Implement a Circuit . . . . . . . . . . 30 3.1.2 Average Number of Used Inputs for a K-LUT . . . . . . . . . 31 3.2 Single-Level Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.1 Number of Single-Level Clusters Needed to Implement a Circuit 33 3.2.2 Average Number of Used Inputs for a Single-Level Cluster . . 38 3.3 Multi-Level Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.1 Number of Multi-Level Clusters Needed to Implement a Circuit 41 3.3.2 I-Limited Clustering . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.3 N -Limited Clustering . . . . . . . . . . . . . . . . . . . . . . 43 iv Table of Contents 3.3.4 Boundary Condition . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.5 Number of Used Inputs for a Multi-Level Cluster . . . . . . . 43 3.4 Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4 Model Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 Technology Mapping Model . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Single-Level Clustering Model . . . . . . . . . . . . . . . . . . . . . . 51 4.4 Multi-Level Clustering Model . . . . . . . . . . . . . . . . . . . . . . 51 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5 Applications of Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1 The Classical Experimental Method . . . . . . . . . . . . . . . . . . 58 5.2 Single-Level Cluster Architecture . . . . . . . . . . . . . . . . . . . . 59 5.3 Multi Level Cluster Architecture . . . . . . . . . . . . . . . . . . . . 63 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.2.1 Model Improvements . . . . . . . . . . . . . . . . . . . . . . . 69 6.2.2 Analytical Modeling Project . . . . . . . . . . . . . . . . . . 71 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Appendices A Area Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 v Table of Contents A.1 Transistor Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 A.2 Basic Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 A.2.1 SRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 A.2.2 Multiplexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 A.2.3 Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 A.3 Routing Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 A.3.1 Inter-Cluster Routing . . . . . . . . . . . . . . . . . . . . . . 83 A.3.2 Intra-Cluster Routing . . . . . . . . . . . . . . . . . . . . . . 87 A.4 Complete Area Model . . . . . . . . . . . . . . . . . . . . . . . . . . 88 vi List of Tables 3.1 γ Values From 20 MCNC Benchmarks [41] . . . . . . . . . . . . . . . 32 3.2 Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1 Benchmark List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Cluster architectures that achieve 90% logic block utilization . . . . . 52 5.1 List of the best cluster architectures with an overall number of LUTs of 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 vii List of Figures 1.1 Model overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Example of an island-style FPGA architecture . . . . . . . . . . . . . 9 2.2 Example a) two input look-up table (2-LUT) and b) basic logic block 10 2.3 Nth level cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Routing architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Types of switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6 Unidirectional - single-driven routing architecture . . . . . . . . . . . 14 2.7 FPGA CAD flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.8 Placement hierarchy [17] . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.9 Example of a tile from [21] . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 Example modeled circuit . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Example of a 2-input circuit tech-mapped into 4-LUTs. Notice not all inputs are utilized in the LUT on the left . . . . . . . . . . . . . . . . 32 3.3 Example of a 3-LUT tech-mapped circuit clustered into groups of 3. In this case nk = 15 and nc = 6 . . . . . . . . . . . . . . . . . . . . . 34 3.4 Example of I-limited clustering (K = 3, N = 3, I = 5). Notice the restriction I = 5 prevents all clusters in the example to be fully populated with three LUTs. . . . . . . . . . . . . . . . . . . . . . . . 35 viii List of Figures 3.5 Example of N-limited clustering (K = 3, N = 3, I = 9). Notice with I = 9, there is no longer a restriction on fully packing all N = 3 clusters. However, N = 3 does limit the entire circuit from being packed into a single cluster, as in the case N = 6. . . . . . . . . . . . 37 3.6 Rent’s rule applied to a single cluster (K = 3, N = 3, I = 8) . . . . . 39 3.7 Example of multi-level clustering . . . . . . . . . . . . . . . . . . . . 40 4.1 Verification flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Logic per LUT vs. LUT size . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Logic packed per cluster vs. cluster size . . . . . . . . . . . . . . . . . 52 4.4 Logic packed per cluster vs. input size . . . . . . . . . . . . . . . . . 53 4.5 Used inputs per cluster vs. cluster Size . . . . . . . . . . . . . . . . . 54 4.6 Used inputs per cluster vs. LUT size . . . . . . . . . . . . . . . . . . 54 4.7 Number of 4-LUTs vs. cluster level . . . . . . . . . . . . . . . . . . . 55 4.8 Choosing an architecture for verification at level 1 . . . . . . . . . . . 55 4.9 Used inputs per cluster vs. cluster levels . . . . . . . . . . . . . . . . 56 5.1 Classic architecture evaluation flow . . . . . . . . . . . . . . . . . . . 59 5.2 Area vs. LUT size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Area vs cluster size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.4 Area vs number of inputs . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.5 Identifying limitations of Fang’s model [18] . . . . . . . . . . . . . . . 64 5.6 Number of minimum-transistor widths (MTW) for the best architectures 66 A.1 Six-transistor SRAM cell schematic . . . . . . . . . . . . . . . . . . . 81 A.2 Multiplexer designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 A.3 Inter-cluster routing connections . . . . . . . . . . . . . . . . . . . . . 84 ix List of Figures A.4 Switch components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 A.5 Intra-cluster routing . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 A.6 A tile composed of a cluster and its two local routing channels . . . . 88 x Nomenclature γ Estimated number of unused LUT inputs λ Experimentally estimated number of used cluster inputs r Average wirelength aB Estimated area of one programming bit aF Estimated area of other fixed logic aL Estimated logic area of a single tile aR Estimated routing area of a single tile C Number of basic logic blocks for a partition of a circuit c Average number of utilized LUTs in a cluster Fc Connection block flexibility fo Number of fanout connections Fs Switch block flexibility foMax Maximum fanout I Number of cluster inputs i Estimated number of used cluster inputs xi Nomenclature ik Number of K-LUT inputs utilized K Number of LUT inputs k Wirelength L Level of a multi-level cluster N Number of LUTs in a cluster N2 Number of simple 2-input gates in a circuit nc Estimated number of clusters to implement a circuit Ng Number of gates in a circuit NK Number of K-LUTs in a circuit nk or nL Estimated of K-LUTs needed to implement a circuit Net(fo) Number of nets with a fanout of fo o Estimated number of used outputs per cluster p Rent’s parameter r Estimated wirelength distribution function T Number of external inputs for a partition a circuit t Rent’s constant the average number of used connections on a logic block wK Estimated required channel width wabs min Channel width requirement of a fully flexible FPGA xii Nomenclature wneed Estimated channel width requirement xiii Acknowledgements First, I would like to thank my supervisor, Professor Steve Wilton, for providing me with this opportunity and his invaluable advice and teaching during the course of my research. Through his consistent encouragement, understanding, and patience, I’ve been able to conquer many great feats during my time as a graduate student. To all of the members of the System on a Chip group, thank you for making the lab a great place to work. I especially thank Darius, Cindy, Daryl, Paul, Mark, and Dave for your helpful insight and great company. Finally, to Mom, Carolyn, and Paul, thank you for your support, encouragement, and inspiration. xiv Dedication To Dad My inspiration Your memories and spirit are with me forever xv Chapter 1 Introduction 1.1 Motivation A field-programmable gate array (FPGA) is an integrated circuit (IC) containing a reprogrammable logic array and an interconnecting routing fabric that can implement virtually any digital logic application. The ability to program, erase, and reprogram the reconfigurable logic and flexible routing fabric make FPGAs suitable for rapid testing and prototyping of digital logic designs. Unlike application specific integrated circuits (ASICs), FPGAs do not require the time and cost intensive process of custom designing and manufacturing of masks. Therefore, the utilization of FPGAs leads to a reduction in the design cycle time and upfront costs compared to ASICs. However, the conveniences of pre-manufactured silicon come at a price. FPGAs have lower performance than ASICs that are custom designed at the transistor level and optimized for clock speed, logic density, and power consumption. The exact performance gap between the two technologies is application specific, but on average FPGAs are more than 40 times larger, 3.2 times slower, and 12 times more power (dynamic) hungry than ASICs [26]. That said, many digital logic designs do not demand high-end performance. In these cases, FPGAs are a cost-efficient choice for many low to medium volume products that require quick design turn-around times and time-to-market. Since their introduction, FPGAs have evolved considerably in an attempt to re- 1 Chapter 1. Introduction duce the performance gap and capture more of the higher-end IC market. These advancements can be categorized and attributed to the development in three fields or research: transistor technology, FPGA computer-aided design (CAD) tools, and FPGA architectures. First, advancements in transistor technology have led to a dra- matic increase in the number of available transistors and a radical change in the functionality of FPGAs. At their inception, FPGAs could implement just over 1000 gates [25]. Today’s FPGAs can implement over a million gates [12, 24], large enough to implement complete digital system designs. With the size of present day FPGAs, it’s unrealistic for the user to manually program each individual programming bit. Instead, a digital logic circuit is typically described at a high-level of abstraction, usually in the form of a hardware-description language (HDL), that is then compiled using a CAD tool. The CAD tool inter- prets the HDL and determines a mapping of the appropriate values for the FPGA’s programming bits to implement the described digital circuit. Typically multiple map- pings exist for the same circuit, some yielding higher performance than others. Much research has gone into incorporating models for metrics such as area, delay, and power into the CAD algorithms [3, 6, 9, 10, 29, 38, 40, 44]. These models provide the algo- rithm estimated metric values and enables the CAD tool to make informed decisions when generating mappings. This allows the CAD tool to create an optimized mapping based on one or a combination of the metrics listed. Much of the improvement in FPGA technology is a result of developments in the field of FPGA architecture. The architecture of an FPGA refers to the structure and interconnections of the configurable elements inside the device. In early FPGAs, for example, logic was implemented using 4-input lookup tables (LUTs) [5], while in modern FPGAs, more complex fracturable LUTs which can be used in many different modes are employed [12, 24]. The new architectures are designed to provide 2 Chapter 1. Introduction higher density, lower power, and/or faster circuit implementations. FPGA companies expend tremendous effort and money evaluating architectural enhancements for every generation of their devices. During the design of a new FPGA, each architectural enhancement has to be evaluated, to determine whether it should be incorporated in the new device. This evaluation is typically done using an experimental approach. The new architecture is modeled and experimental CAD tools are used to map a set of benchmarks to the new architecture. Detailed area, delay, and power models are then used to evaluate the resulting implementation of each benchmark on the new architecture. Based on the results, the architectural enhancements may be deemed worthwhile, in which case it may be incorporated into the device. Often, the results suggest modifications to the enhancement, and these modifications are then evaluated using the same experimental techniques. This is often repeated numerous times until a suitable architecture is found. This process occurs both within FPGA companies and in academia. The above experimental process can be slow. To properly exercise an architecture, many benchmark circuits are required. If not enough benchmarks are used, it is possible to create an architecture tuned for specific circuits rather than a architecture that is suitable for a wide range of customers. In academia, researchers typically use roughly 20 benchmark circuits [5], but in industry, many more benchmark circuits are employed. In the experimental approach, each of these circuits must be mapped to all potential variants of the architecture under investigation. Modern CAD tools can take several hours to compile as its algorithm searches for the optimum mapping. This slow process limits the number of alternative architectures that can be considered, and thus limits the abilities of FPGA companies to explore new structures that may lead to more efficient FPGAs. Architectural investigation can be accelerated using analytical models that de- 3 Chapter 1. Introduction scribe some aspect of an architecture. Analytical models relate parameters describing an FPGA architecture to area, delay, or power efficiency of an FPGA. Such analytical models usually take the form of simple expressions, and thus searching for efficient architectures can be fast and does not require time-consuming experiments. The development of such models is an emerging area of FPGA research [13, 21, 28, 45]. These models can be used to accelerate the architectural investigation process in two ways. First, understanding the relationships between architectural parame- ters enables early-stage architecture development [18] in which the design space can be searched quickly using analytical models. Once a promising region of the ar- chitecture space has been identified, traditional experimental methods can be used to choose precise architectural parameters. This would significantly accelerate the FPGA architecture design process. It may also allow the study of a wider variety of “interesting” architectures since experimental CAD tools need not be developed for each architecture under consideration. Second, the development of such theory will encourage researchers to understand why certain architectures work well, and may eventually provide bounds on the capabilities and efficiencies of programmable logic. 1.2 Research Goals and Contributions The primary goal of this thesis is to develop and evaluate an analytical model that describes the relationship between the logic block architecture and the area-efficiency of the resulting FPGA. This model must balance complexity and accuracy. In gen- eral, simple equations provide significantly more insight into architectural tradeoffs than expressions which are unnecessarily complex. On the other hand, the model must be accurate enough that it leads to useful conclusions when investigating new architectures. 4 Chapter 1. Introduction An important aspect of our model is that we wish to derive relations between architectural parameters that are independent of the circuit to implement on the FPGA. For example, we would prefer a relation between logic block size and number of inputs that is independent of the circuit to be implemented. This is different from much prior estimation work, in which the goal is to predict the area, speed, or power for a given circuit [4]. That being said, it is impossible to completely ignore the impact of the circuit; we used the Rent Parameter [30] and size of the circuit prior to technology mapping as parameters for our model. The secondary goal of this thesis is to show an application of the model and how it can be used to accelerate FPGA design. Through two examples, we show that the model is indeed useful in FPGA architecture evaluation. 1.3 Approach In order to make the task tractable, we divide the model into three sections. As shown in Figure 1.1, each section reflects one step of the CAD flow typically used to map circuits to an FPGA. For each section, we estimate the logic resource usage, specifically the number of logic blocks needed and the number of logic block inputs used to implement a circuit as a function of the architectural parameters. The three sections are: 1. Technology Mapping - Our model outputs the estimate for the logic block us- age of K-input look-up tables given key circuit parameters that can be easily computed from a netlist of simple two-input gates. 2. First-level clustering - From the output of our model’s technology mapping level, our model estimates the resource usage of first-level clusters packed with 5 Chapter 1. Introduction K-LUTs. 3. Multi-level clustering - From the output of the first-level clustering level, our model estimates the resource usage of clusters at all other hierarchical levels. Throughout, we strived to develop the model by deriving relations analytically, without relying on curve-fitting or experimental techniques. This ensures that we captured the “essence” of programmable logic, and didn’t create a model that is limited to a particular CAD flow or tool suite. As will be discussed in Chapter 3, all aspects of our model were derived analytically. To evaluate the model, we employ an experimental approach. We compare predic- tions from the model to results obtained from a commonly used experimental CAD flow. Doing this comparison required the development of a detailed area model. The details of the comparison and this area model is also described in this thesis. 1.4 Organization This thesis is organized as follows. Sections 2 reviews the FPGA architectures and CAD tools targeted by this work, and briefly describes related analytical models. Section 3 details the derivation of the proposed model, and the model is validated against experimental results in Section 4. Section 5 gives an example of how the model can be applied to FPGA architectural investigations. Section 6 summarizes this thesis, draws conclusions and suggests future work. The material in Sections 3, 4, and 5, have been published in [28]. 6 Chapter 1. Introduction Technology Mapping Section 3.1 Single / First-Level Clustering Section 3.2 1 2 2 1 1 2 3 3 2 2 2 2 1 2 2 Multi-Level Clustering Section 3.3 3 4 7 26 6 2 Input Gates K-LUTs K-LUTs First Level Clusters First Level Clusters Multi Level Clusters Figure 1.1: Model overview 7 Chapter 2 Background This chapter begins by reviewing FPGA architectures, including the logic block and routing architecture assumed by our model. This is followed by a summary of the CAD algorithms used in our model’s verification. The final section gives an overview of general ASIC and FPGA-specific models that provide insight into our model’s derivation. 2.1 FPGA Architecture Many classes of FPGA architecture have been proposed in literature including island- style, row-based, and hierarchical [7]. This work targets the widely-used island style architecture due to its relevance in the FPGA industry and its wide-spread use in academia. Commercially, island-style architectures are implemented on the leading edge FPGAs offered by the two largest FPGA companies, Altera [12] and Xilinx [24]. The remainder of this section exclusively focuses on this style of FPGA architecture. From a high level view, as seen in Figure 2.1, a simple FPGA consists of three major reprogrammable resources: logic blocks, I/O blocks, and programmable rout- ing. Arranged in an array, each logic block implements a small portion of a circuit’s logic. The reprogrammable interconnect surrounding the cluster blocks provide com- munication and allows an FPGA to implement any digital logic circuit. I/O blocks at the periphery of the FPGA also connect to the cluster blocks via interconnect. 8 Chapter 2. Background LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB≈ I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/OI/OI/OI/OI/O I/OI/OI/OI/OI/O Figure 2.1: Example of an island-style FPGA architecture The remainder of this section provides an in-depth description of the logic block and reprogrammable interconnect. 2.1.1 Logic Block and Cluster Architecture This subsection is a bottom-up description of the typical hierarchical structure of FPGA logic cluster blocks. The basic unit of FPGA logic is the lookup-table (LUT). A K-LUT can implement any K-input logic function by programming the appropriate truth table in the 2K Static Random Access Memory (SRAM) cell array [7]. Using a multiplexer, as shown in Figure 2.2(a) for a 2-LUT, the inputs of the K-LUT select the appropriate SRAM cell to be connected to the output. A basic logic element (BLE) contains a LUT and data flip-flop (DFF) as shown in Figure 2.2(b). The output of each LUT is connects to a DFF, and a multiplexer 9 Chapter 2. Background 1 or 0 1 or 0 1 or 0 1 or 0 P ro g ra m m a b le L o g ic S R A M C e lls Output 2 Inputs (a) 2 - Input LUT DFF Clock Inputs Output SRAM (b) Figure 2.2: Example a) two input look-up table (2-LUT) and b) basic logic block selects either the LUT or DFF output as the output of the BLE. This allows both combinational and sequential logic to be implemented. In a typical circuit, BLEs will have input signals in common due to multi-fanout nets. Rather than individually connecting BLEs to the programmable routing fabric, the BLEs are usually grouped together into cluster logic blocks (CLBs). Absorbing connections completely within CLBs localizes shared connections and reduces the number of wires and switches. This results in an overall improvement in area and delay efficiency [5]. A similar argument can be made for recursively grouping CLBs 10 Chapter 2. Background into increasingly large mega CLBs [1]. In this case, complete clusters are made of recursively grouped logic blocks. For example, the Lth level logic block (L-LB) contains NL number of (L − 1)-LE and IL number of inputs, and the smallest logic block, 0-LE is simply a BLE. N LBs Logic Block #1 Logic Block #N Clock I N N Outputs I Inputs Figure 2.3: Nth level cluster In this work, clusters are fully connected internally at all levels, meaning each (L- 1)-LB input can access any of the IL cluster inputs or NL BLE outputs. This is the common design for single-level logic blocks (L = 1) used in academia [5]. However, research has shown multi-level clusters (L > 1) benefit from a depopulated internal logic block routing scheme [39]. We assume a fully connected scheme in this work to simplify the custom CAD tools needed for verification. The resulting logic block is as shown in Figure 2.3. The sizing of the cluster block parameters K, IL, and NL influence both the area and speed efficiency of an FPGA. Previous research has shown that for single-level logic blocks K = 4, N1 = [8 · · · 20], and I1 = K/2 · (N1 + 1) are good architectural choices [2]. Exact optimal sizing parameters for mutli-level logic blocks is dependent 11 Chapter 2. Background on the choice of switch depopulation schemes, physical layout, and CAD tools, all of which are not standardized in academia. 2.1.2 Routing Architecture As briefly mentioned in the previous section, cluster blocks are interconnected using programmable routing. Although this work concentrates on cluster block utilization, the choice of routing architecture directly affects FPGA area, a metric estimated in the application in Chapter 5. Conceptually, the interconnect is composed of three components: routing chan- nels, switch blocks, and connection blocks, as shown in Figure 2.4. Routing channels lay horizontally and vertically between the rows and columns of cluster blocks. Each channel contains W parallel routing tracks. Each track comprises wire segments that span L number of cluster blocks. In FPGA literature, the parameters W and L are often referred to as channel width and wire segment length[32]. The inputs and outputs of cluster block, known as pins, connect to adjacent channels via connection blocks. Each pin is connected to a programmable switch, that subsequently connects to Fc number of independent channel wires. Switches can be configured to allow connections from each pin to some or all of the connected channel wires, as needed by a circuit implementation. Switch blocks are located at the intersection of horizontal and vertical channels; a switch block enables wires to stitch and turn onto other wires. Programmable switches connect each wire that enters a switch block to a total of Fs number of wires in the other three directions. In the above description, switch and connection blocks are described as independent sets of programmable switches. However in modern FPGA layouts these switch sets are integrated together [31]. 12 Chapter 2. Background ccccccccccc LB LB LB LB LB LB LB LB LB I/O I/O I/O I/OI/OI/O S S S C C C S S S C C C S S S C C C C C C C C C C C C LB LB Programmable Switch Connection Block Switch Block in out Long Wire (L>1) LB - Logic Block C - Connection Block S - Switch Block Figure 2.4: Routing architecture c) Buffered Bi-Directionalb) Buffered Uni-Directionalb) Unbuffered Figure 2.5: Types of switches The programmable switches in the connection and switch blocks consists of mul- tiplexers controlled by configuration memory bits. Switches are either buffered or un- buffered, and buffered switches are either uni-directional or bi-directional, as shown in Figure 2.5. Buffered switches require more area compared to an unbuffered design, but reduces signal propagation delay[32]. Modern commercial FPGAs use unidirec- tional switches, where each routing wire drives a single buffer and a multiplexer that aggregates connection box and switch box inputs, as shown in Figure 2.6. Unidi- 13 Chapter 2. Background rectional switches offer better delay optimization and a denser layout compared to bidirectional switches[31]. LB LB LB LB LB LB LB LB LB I/O I/O I/O I/OI/OI/O S S S C C C S S S C C C S S S C C C C C C C C C C C C in1 out1 Programmable Switch Long Wire (L>1) Figure 2.6: Unidirectional - single-driven routing architecture 2.2 FPGA CAD Tools The FPGA CAD tools industry is driven by the continuous evolution of semiconductor technology leading to the reduction in transistor size and increase in FPGA logic density. Modern FPGAs contain close to a million LUTs [12] and it is infeasible for 14 Chapter 2. Background a designer to program each LUT individually. Instead, designers create digital logic circuits at a higher level of abstraction, commonly a hardware description language, that is then compiled using FPGA CAD tools. There are five steps in the CAD flow, as shown in Figure 2.7, that ultimately yield a programming bit stream that sets the FPGA routing control and LUT SRAMs to implement the circuit design. The first step is called high-level synthesis, which takes a set of design files as input, interprets them, and outputs an equivalent circuit netlist comprising simple 2-input logic gates and flip flops. This step is technology independent for pure logic; the results are targeted for either ASICs and FPGAs. The remaining four steps are FPGA specific, and will be described in the remainder of this section. 2.2.1 Technology Mapping A technology mapper converts a synthesized two-input gate netlist into a circuit of FPGA K-LUTs and D flip-flops. The technology mapper can optimize the resulting circuit for one or a combination of metrics, such as area, delay and power. For example, the two technology mapping algorithms used in this work, Flowmap[9] and Emap[29] optimize for a combination of area, delay, and power. Both algorithms generate a directed acyclic graph (DAG) representation of the original netlist, where the DAG’s nodes represent gates and edges represent nets. The algorithms then create K-LUT assignments, by grouping nodes to have at most K-input and one output. Minimizing the depth of a circuit ensures minimum delay and is accomplished by reducing the number of continuous K-LUT groups between primary inputs, flip flops, and outputs. Flowmap further reduces delay by duplicating nodes providing the technology mapper more flexibility in node grouping. Emap reduces power by 15 Chapter 2. Background High Level Synthesis Technology Mapping Clustering Placement Routing F P G A S p e c ifi c Circuit Design Bit Stream Figure 2.7: FPGA CAD flow grouping sets of nodes that share nets that are expected to have a high switching activity. 2.2.2 Clustering The clustering step involves taking a technology mapped netlist, creating BLEs, and then packing these BLEs into clusters. In doing so, it must adhere to the pin and logic 16 Chapter 2. Background size constraints of the cluster architecture. The key goal of clustering is to optimize the utilization of a cluster’s internal routing that contain shorter and less capacitive wires compared to external wires. This leads to less delay and a reduction in the amount of programmable routing needed to route a circuit. The Vpack [38] algorithm is a clustering algorithm optimized for logic resource and routing wire utilization, and is the predecessor of the two algorithms used in this work. In this tool clusters are built one at a time. The process starts by selecting one BLE that acts as a seed for the cluster. Subsequent BLEs are then greedily added to the cluster based on the number of shared nets between candidate BLE and the BLEs grouped in the cluster already, until the cluster is full or no more BLEs can be added. T-VPack [38], a timing-driven variant of Vpack, adds a critical path term to the cost function grouping BLEs on the critical path. iRAC [44], another bottom- up algorithm, attempts to improve circuit routability by adding a cost function that encourages the encapsulation of nets within a cluster. This reduces the number of total external nets in the circuit, making it easier to route. 2.2.3 Placement The placement tool assigns each cluster to a physical position in the FPGA. The final goal of placement and subsequent routing phase is to legally implement the circuit and meet timing. In order to do so, the router must be able to find a solution to connect the logic clusters through the programmable routing. To facilitate this, the placer attempts to minimize the routing wirelength between clusters and external I/O pins. The place implements a bounding box approximation since performing full routing to evaluate possible placements is computationally prohibitive. The current two popular classes of placement algorithms are simulated annealing [6] and analytical methods 17 Chapter 2. Background [22, 46]. Simulated annealing begins with an initial (often random) legal cluster placement. Then random swaps (or “moves”) are made and evaluated using a cost function that factors in bounding-box wirelength, delay, and other quality metrics. Beneficial moves that reduce the cost are always taken. Bad moves are sometimes taken to avoid settling in a local minimum. The chance of accepting such a move is an exponential function of how bad the move is. The temperature is a parameter that decreases over time so that fewer bad moves are accepted as the placement settles into a final solution. When moves no longer yield beneficial returns, the placement is complete and the algorithm terminates. Analytical approaches attempt to formulate the placement problem as a mathe- matical or physical system and then find the optimal solution to that system. For example, an analytical approach can use the bounding box cost function to produce a set of simultaneous equations for the position of each cluster. Solving these equations produces an optimal but illegal placement of clusters with floating-point coordinates. A legalizing algorithm refines the analytical result, leading to a placement on an inte- ger grid where no two clusters overlap. Many analytic solvers use iterative refinement techniques, such as a low-temperature simulated annealing (an anneal that accepts a much smaller percentage of bad swaps to reduce runtime) to improve the final results. 2.2.4 Routing Routing is the final step of the CAD flow. A routing algorithm assigns the appropriate switches and wires in the routing fabric. Typical ASIC routers perform routing in two steps: 1) a global routing step assigns nets to channels, 2) a detailed routing step specifies the exact wire within the assigned channel that will implement a net. 18 Chapter 2. Background Two-step FPGA routers also exist [33], however, unlike ASICs, FPGAs contain fixed routing resources that imposes additional constraints, making the detailed routing step difficult. Instead, many modern FPGA routers perform the global and detailed routing steps concurrently. For example, VPR is a single-step router that implements a modified version of the iterative Pathfinder algorithm [40]. The algorithm initially routes nets independently, which results in multiple nets possibly sharing a single wire. A cost-function calculation is made for each wire that is proportional to the number of nets assigned to that wire. All nets are ripped up and rerouted, with the cost functions penalizing and discouraging the use of highly utilized wires. The algorithm repeats until a legal routing solution is found. After routing, the memory states of the logic blocks and routing resources needed to implement a circuit are determined. This information is used to create a circuit’s bit stream that is then loaded on FPGA to physically set the appropriate values in the SRAMs. Also, at this stage, the exact FPGA routing resources are known and can be used to accurately estimate performance results that are essential to architectural evaluation. In this work these estimations are used to validate the accuracy of our model’s equations. 2.3 Previous Research on Analytical Models This section begins by reviewing a series of analytically derived ASIC models that compute the expected wirelength distribution, routing demand, and average fanout. While these models are not directly applicable to FPGAs, they form the basis for many of the FPGA specific models covered in the next section. 19 Chapter 2. Background 2.3.1 Rent’s Rule Rent’s rule is the power law relationship between the number of external inputs and outputs, T , and number of basic logic blocks, C, for a partition of a circuit design. E. F. Rent at IBM in 1960 was the first to be credited with the discovery of the relationship. To investigate the hierarchical structure of circuits, the data from partitions of portions of the IBM series 1400 computers were used in a plot of T versus C on a log - log plot. The result was a straight line implying a power law relationship. Landman was the first to formalize the relationship into equations [30]. In his work, Landman performed a series of controlled experiments on a set of partitioned circuits with varying size and complexity. In the experiments the maximum number of partition inputs and outputs, t, was swept while T and C in the circuit were measured. Generalizing the results led to Equation 2.1, T = t · Cp (2.1) where t, also known as Rent’s constant, is the average number of used connections on a logic block, and p is Rent’s parameter that typically ranges from [0,1]. Rent’s parameter is dependent on the partitioning algorithm chosen and the circuit’s struc- ture. Circuits with highly serialized and parallelized structures typically have values of p around 0.5 and 0.75 respectively. 2.3.2 Wirelength Models Wirelength estimates provide insight into the overall area, delay, and power used to implement a circuit. Donath [17] developed the first wirelength model for realistic 20 Chapter 2. Background circuits. In this work an average wirelength upper bound, r, was derived based on two assumptions, 1) logic is placed on a two-dimensional hierarchical square array where each higher level element consists of 4 lower level elements, seen in Figure 2.8, and 2) Rent’s parameter is constant throughout all levels. r is given by Figure 2.8: Placement hierarchy [17] r ∝ Cp− 1 2 if p > 1 2 logC if p = 1 2 F (p) if p < 1 2 (2.2) where C is circuit size and p is the Rent parameter. In a majority of cases, when compared against experimental results, the actual interconnect wirelength was found to be about half of the estimated upper bound. In [16] Donath extended his work to model wirelength distribution. Donath’s 21 Chapter 2. Background wirelength distribution function, r, is inversely related to wirelength, as seen here r(k) = g/k3−2p for 1 ≤ k ≤ L (2.3) r(k) = 0 for k > L (2.4) where r(k) is the fraction of wires with length k, g is a normalization constant, and L is a constant that is proportional to W/2 for an array size WxW . In [19] Feuer extended Donath’s work by deriving wirelength distribution and average wirelength models for non-uniform circuit partitions. Feuer states that for good placements, Rent’s rule holds on average for any circuit partition. Furthermore, the wirelength distribution of a circuit is f(r) ∝ r2p−4 (2.5) and the average wirelength of a partition is r ∝ Cp− 12 (2.6) both for Rent exponent p > 1 2 . Though derived using two distinct methods, the agreement between Donath’s and Feuer’s wirelength reinforce the validity of each other and the underlying Rent’s model that they are derived from. 2.3.3 Stochastic Fanout Model To develop a full interconnect model, fanout must be considered. One model that does this is Zarkesh-Ha’s stochastic fanout model for circuits. In [50], Zarkesh-Ha developed a stochastic model predicting fanout distribution 22 Chapter 2. Background in circuits. Based on Rent’s rule and the assumption that its relationship holds throughout an entire system, Zarkesh-Ha predicted a circuit’s fanout distribution as a function of the number of used logic block inputs, t, and Rent’s parameter, p, as seen in Equation 2.7, Net(fo) = t ·Ng(f p−1o − (fo + 1)p−1) fo + 1 (2.7) where Net(fo) is the number of nets with a fanout of fo and Ng is the number of gates in the circuit. The first order Taylor expansion of Equation 2.7 was taken and the maximum fanout, foMax >> 1 was assumed. This produced an equation for maximum fanout as shown in Equation 2.8. foMax = (t ·Ng(1− p))( 1 3−p ) (2.8) Equation 2.7 was summed from 1 to foMax to give the total number of nets in the circuit as seen in the series expansion version of equation: Ψ(P, foMax) = foMax∑ n=1 np n2(n+ 1) NetTot = t ·Ng[1− (foMax+1)p−2−Ψ(P, foMax)] (2.9) This led to the average wirelength of a circuit that is simply foAvg = foMax∑ fo=1 fo ·Net(fo) NetTot (2.10) which can be expanded and simplified to foAvg = 1− (foMax + 1)p−1 1− (foMax + 1)p−2 −Ψ(P, foMax) − 1 (2.11) 23 Chapter 2. Background In Chapter 3 we make use of Zarkesh-Ha’s average fanout equation to formulate a logic block utilization model for FPGAs. 2.4 FPGA-Specific Analytical Models The previous section reviewed both classic and contemporary applications of Rent’s rule to develop models to estimate ASIC routing. This section focuses on FPGA specific models. 2.4.1 Relating Circuit Rent Parameter to FPGA Architecture Using empirical results from a Cyclone FPGA architecture [11], Pistorius [42] per- formed a set of four experiments pertaining to Rent’s rule: 1) contrasting various new and old techniques for calculating Rent’s parameter, p, for placed circuits; 2) exploring the correlation between the Rent parameter and circuit type; 3) comparing Rent parameters for purely-timing and congestion-based placement algorithms; and 4) observing the temporal behavior of the Rent parameter in simulated annealing and the relationship between cost function and wirelength. Inspired by Feuer theory, Pistorius developed a novel and more natural method of choosing sample regions to measure Rent’s parameter that minimize sample size and position bias compared to the traditional recursive partitioning technique [19]. Comparing block sample position and size distributions of the new techniques, the least biased method defined a sample region as all blocks within a radius R of chosen random grid point, (x,y). This method led to Rent parameters between 0.5 and 0.65 for the large experimental benchmarks used. In general these values are larger 24 Chapter 2. Background compared to the Rent parameter calculated using regions based on the PART recursive partition algorithm [47, 49]. The PART based regions led to more accurate wirelength predictions when a modified version of Feuer’s model [19] is used with the Rent exponent measured with both region defining methods. Pistorius showed there is little correlation between Rent parameter and circuit functionality or parallelism in contrast to previous beliefs [30]. The Rent parameters of categorized benchmark circuits were compared and showed amongst the measured control logic, digital signal processing (DSP), image processing, and networking cir- cuits, the average Rent parameter was between 0.58 and 0.60. This result was ex- plained by the fact that designs of this size are composed of smaller functional blocks of various types, and any trends that exist are averaged over the entire system-level design. Finally, the previous works showed that there exists a strong linear relationship between wirelength and Rent parameter. Measurements were taken during a sim- ulated annealing placement and showed both wirelength and the Rent parameter decreased over time. This reaffirmed the belief Rent parameter can be viewed as a metric of quality placement and can be used to evaluate algorithms [8, 48]. 2.4.2 Minimum Needed Channel Width Fang [18] developed an empirical model which estimates the number of tracks per channel required to implement a circuit, often referred to as routing demand or wneed, for island style FPGAs. As with this work’s model, the analytical wneed model facil- itates early-stage estimation of essential FPGA resources. The inputs to this model were the estimated wirelength r, estimated logic block input usage i, switch block flexibility Fs, and connection block flexibility parameters Fcin and Fcout. The model 25 Chapter 2. Background was developed in stages, beginning with a routing demand estimate for a fully flexible routing architecture consisting of a crossbar in the switch and connection blocks and wires of length one. An FPGA architecture analogous to El Gamal’s ASIC master slice interconnect model [20] was assumed. Fang names this architecture’s routing demand wabs min and estimates it as: Wavg = λr 2 wabs min = pWavg (2.12) where the empirically measured p is the peak factor, that scales Wavg , El Gamal’s average channel width estimate, to the maximum used channel width. From experi- mental data, r = 4.43 and p = 1.4 were estimated as constants, and led to accurate wabs min results for a wide range of cluster sizes,. In each subsequent part of the derivation, key parameters were added describing the routing architecture’s flexibility. Fang’s model built on the equation for wabs min incorporating terms for switch block, connection block, and wirelength. The final FPGA routing demand model was summarized as: wneed = wabs min + 1 β ( wabs min Fs )( wabs min Fcin )αin ( wabs min Fcout )αout + λ(L− 1) 4 (1 + 1 Fcαinin ) where Fs, Fcin, and Fcout are the switch and connection block flexibility, L is wire segment length, λ is the estimated number of used inputs, and β = 3, αin = 0.5, and αout = 2.5 are empirically measured values. The equation was curved fitted and the relationship λ = 0.88N + 3.2 was derived. In Section 5, Fang’s model in conjunction with this work’s model are applied in an area estimation, demonstrating the potential of early-stage architecture evaluation of 26 Chapter 2. Background nascent designs without the need to first create architecture specific CAD tools. 2.4.3 FPGA Area Model In [21], Gao derived an area model based on the geometric structure of an island-style FPGAs. Gao assumed the FPGA can be divided into NK tiles. Each tile was made of a logic block and the routing resources to its right and bottom sides as shown in Figure 2.9. The tile’s logic blocks consisted of a BLE made of a K-input LUT and a flip-flop. Logic Block AL LA LA bA W Figure 2.9: Example of a tile from [21] The area model derivation begun by finding the area of a single tile. The area of a tile was calculated as the sum of the logic block and routing resource areas, as shown in Equations 2.13 to 2.15 aL = aB · 2K + aF (2.13) aR = aB · w2K + 2 · √ aB · √aL · wK (2.14) ATotal = ( √ aB · wK +√aL)2 ·N (2.15) where aL is the logic area, aR is the routing area, aB is the area of one programming bit, aF is the area of other fixed logic (internal routing resources and DFF), wK is the require channel width, and NK is the number of used logic blocks. The channel 27 Chapter 2. Background width, wK can be computed from El Gamal’s channel width model and an FPGA wirelength model from [35] as shown here, wK = K + 1 2 · 2 3 · 1− p p− 1/2 · [(NK) p−1/2 − 1] (2.16) The number of logic blocks required to implement a circuit, NK , is derived from Rent rule[30]. Gao assumed, for a K-input LUT, the average number of used pins is K inputs plus one output leading to a ratio between N2 and NK as shown here N2 NK = ( K + 1 3 )1/p (2.17) where the rent exponent is assumed to be p=0.75. Similar to Gao’s model, our work relates logic-block and area utilization to FPGA architectural parameters. Our model improves upon this work by incorporating clus- tered architectures and using cluster architectural parameters in our equations. 2.5 Summary In this chapter, we described the architecture of an FPGA and presented definitions of architectural parameters used to described the FPGA. We also presented previous wirelength and FPGA-specific models. In subsequent chapters, we use these defi- nitions and models to derive an analytical model that relates logic parameters to the area-efficency of an FPGA. Since this work’s publication in [28] it has been im- plemented in several other publications. Smith et al. incorporates our model with Feuer’s [19] and Davis’s [14, 15] wirelength models to estimate post-placement wire- length for homo- and hetero-geneous FPGAs [45]. Das et al applies our model to relate logic architecture of cluster-based FPGAs to its expected speed [13]. 28 Chapter 3 Model Derivation This chapter describes the mathematical derivation of our model that estimates the resource utilization of logic blocks at the various different hierarchical levels within an FPGA. Our model is divided into three stages: Technology Mapping, Single/First- Level Clustering, and Multi-Level Clustering, as described in Section 1.2. At each level we derive an estimate for the logic block resource usage, specifically the number of logic blocks and the number of logic block inputs used in each circuit implemen- tation. The following sections describe each stage and the derivation of the model’s equations. 3.1 Technology Mapping The first stage of the FPGA CAD flow and our model is technology mapping, where simple 2-input gates are mapped into K-LUTs. In this section we seek to derive an equation to estimate the two major logic block resources at this stage, the number of LUTs, nK , and the number of LUT inputs, iK utilized in implementing a circuit. Our model’s three inputs are K, LUT size; ns, the number of simple 2-input gates in the circuit; and p, Rent’s parameter. The remainder of this section presents a physical description behind the derivation for nK and iK . 29 Chapter 3. Model Derivation 3.1.1 Number of K-LUTs to Implement a Circuit Consider a pre-techmapped circuit consisting of ns simple two-input LUTs. During technology mapping, these ns gates will be mapped into a smaller number of LUTs, which we will denote nk. In this section, we seek a closed form expression for nk. Consider a portion of the pre-techmapped circuit, Figure 3.1(a), consisting of x two-input gates (1 < x ≤ ns). Let y denote the number of signals that connect across the boundary of this region. Since each gate has three pins (two inputs and one output), we can use Rent’s Rule to write: y = 3xp (3.1) where p is the Rent parameter of the circuit. Now suppose this same region, Figure 3.1(b), is mapped to z K-LUTs using a technology mapping algorithm. The number of signals that connect across the boundary of this region is still y. The number of pins used in each K-LUT is 1 + iK ; the first term corresponds to the output and iK is the expected number of used inputs. In the next section we define γ, the estimated number of unused K-LUT inputs, which leads to iK = K − γ. We can then write y = (K + 1− γ)zp (3.2) Since y is the same in Equations 3.1 and 3.2, we can eliminate y to get z x = p √ 3 K + 1− γ . (3.3) Intuitively, this ratio is a measure of how much logic can be packed into each lookup- table. 30 Chapter 3. Model Derivation Finally, using Equation 3.3, we can write: nk = ns p √ 3 K + 1− γ (3.4) (a) Netlist of 2-input gates (b) Possible technology mapping covering for 3-LUTs Figure 3.1: Example modeled circuit 3.1.2 Average Number of Used Inputs for a K-LUT Of the K inputs of a lookup table, not all of them are always used in a K-LUT. Figure 3.2 illustrates this fact with a plausible 3-LUT covering of an example 2-input gate circuit. In this figure, many of the LUTs utilize only two of the three LUT inputs. In one case, the LUT in the bottom left of the figure utilizes only one of the three LUT inputs. This is indicated by having only one arrow head entering the dark shaded region representing the LUT. The under-utilization of LUT inputs is characterized by the term γ, which is the expected number of inputs to a K-LUT that are not used. We can then write, iK = K − γ. (3.5) 31 Chapter 3. Model Derivation as used in Equations 3.2, 3.3, and 3.4. a b c d e f g (a) Original Netlist a b c d e f g (b) Possible Covering a b c d e LUT 1 LUT 1 f g (c) LUT Mapping of Covering Figure 3.2: Example of a 2-input circuit tech-mapped into 4-LUTs. Notice not all inputs are utilized in the LUT on the left We have not found a way to accurately model this analytically, however, exper- imentally we have found that γ (as a function of K) is extremely consistent across all benchmark circuits we considered. Table 3.1 shows our measured values of γ; the derivation of a closed form for this expression is an interesting topic of future work. Table 3.1: γ Values From 20 MCNC Benchmarks [41] K 2 3 4 5 6 7 γ 0.000 0.261 0.466 0.701 0.996 1.232 32 Chapter 3. Model Derivation 3.2 Single-Level Clustering The second part of the model mirrors single-level clustering, in which K-LUTs are packed into clusters with a pre-defined capacity and a pre-defined number of unique inputs [5]. The inputs of this part of the model are nk, the number of K-LUTs in a circuit; N , the number of LUTs in a cluster; and I, the maximum number of available input pins on a cluster. The outputs are two quantities that describe the clustering: the expected number of LUTs that can be packed into each cluster, and the expected number of inputs to each cluster that are used. 3.2.1 Number of Single-Level Clusters Needed to Implement a Circuit Consider a technology mapped circuit consisting of nk K-LUTs. During clustering, these nk LUTs are packed into a smaller number of clusters, which we will denote nc as shown in Figure 3.3. In this section, we seek closed-form expressions for nc. In the next section, we derive an expression for the expected number of inputs used per cluster, i. Each cluster can contain up to N LUTs and have up to I unique inputs. In an architecture with a large value of I and small value of N , it is likely that most clusters will be completely filled. On the other hand, architectures with a small value of I some clusters may not be completely filled because of the limitation on the number of unique inputs. Since we wish our model to apply to both types of architectures, we consider each case separately below. We also define an equation for the boundary between the two cases. 33 Chapter 3. Model Derivation (a) 3-LUT Tech-Mapped Cir- cuit A B C D E F (b) Possible Covering A B C FE D 2 22 (c) Clustered Circuit Figure 3.3: Example of a 3-LUT tech-mapped circuit clustered into groups of 3. In this case nk = 15 and nc = 6 I-Limited Clustering We first consider architectures in which I is small, and the expected number of LUTs packed into each cluster is dictated by the number of physical pins on each cluster. Figure 3.4 shows an example I-limited clustering, where the restriction I = 5 prevents all clusters to be fully populated with three 3-LUTs. In such cases, the expected number of LUTs packed into each cluster, c = nk/nc, will be smaller than the capacity of the cluster N . To estimate c, and hence nc, we employ Rent’s Rule as follows. Consider the same region of the technology-mapped circuit from Section 3.1.1 which contains z K-LUTs and has y signals that connect outside the region. When the same region is mapped to clusters, we can write y = (i+ o)vp (3.6) where v is the number of clusters needed for this region (v ≤ z), i is the average number of used inputs per cluster, and o is the average number of used outputs per cluster. By the definition of I-limited clustering, all available cluster inputs are used. 34 Chapter 3. Model Derivation For the remainder of this derivation i = I. (a) Netlist of 3-LUTs B C (b) Possible Covering for N = 3, I = 5 B C 2 (c) Clustered Netlist Figure 3.4: Example of I-limited clustering (K = 3, N = 3, I = 5). Notice the restriction I = 5 prevents all clusters in the example to be fully populated with three LUTs. Eliminating y from Equations 3.2 and 3.6 and solving for c gives: c = p √ I + o K + 1− γ , (3.7) In this equation, o is the only value we do not have an expression for. The following presents the two different substitutions for o explored: Architectural Approximation: o = c Cluster outputs are only driven by utilized LUTs, hence the average number of used output pins is the number of utilized LUTs in the cluster. Substituting o = c into Equations 3.7 leads to a non-polynomial equation that is difficult to solve. Applying a first-order series expansion to this equation, we can write 35 Chapter 3. Model Derivation c = 1 ( I (k+1−γ )(1/p) − 1 pI (3.8) and nc = nk( 1 ( I (k+1−γ )(1/p) − 1 pI ) (3.9) Circuit Fanout Approximation: o = i/fo Alternatively, the average number of used outputs can be written as o = i/fo where fo is the average fanout of the circuit (this term will be computed below). Solving for c and nc from Equation 3.6 leads to: c = p √ I(1 + 1 fo ) K + 1− γ (3.10) and nc = nk p √ K + 1− γ I(1 + 1 fo ) , (3.11) The fanout f in Equations 3.11 and 3.10 can be calculated using a formula from [50]: f = 1− (foMax + 1)(p−1) 1− (foMax + 1)(p−2) − Φ(p, foMax) − 1 (3.12) where: Φ(p, foMax) = foMax∑ n=1 np n2(n + 1) , (3.13) and foMax is the maximum fanout written as: foMax = [(i+N) nk N (1− p)]( 13−p ). (3.14) In Equation 3.14 we approximated the number of outputs as N and the number of clusters as nk/N ; experimentally, we have found that the fanout is only a weak function of foMax, and thus these approximations lead to only a small error. 36 Chapter 3. Model Derivation N-Limited Clustering This case is trivial. As in the Figure 3.5 example, cluster input pins are plentiful and clusters can be filled to capacity. Hence, c = N and nc = nk N (3.15) (a) Netlist of 3-LUTs B (b) Possible Covering for N = 3, I = 9 B (c) Clustered Netlist Figure 3.5: Example of N-limited clustering (K = 3, N = 3, I = 9). Notice with I = 9, there is no longer a restriction on fully packing all N = 3 clusters. However, N = 3 does limit the entire circuit from being packed into a single cluster, as in the case N = 6. Boundary Condition For architectures in which c < N , clustering is I-limited, otherwise it isN -limited. We found that deriving a boundary condition from Equation 3.8, leads to an unintuitive equation and is not used in our model. Instead, using Equation 3.10, we can write 37 Chapter 3. Model Derivation the following condition that indicates that clustering is I-limited: p √ I(1 + 1 fo ) K + 1− γ < N (3.16) This can be rearranged to produce: I < Np K + 1− γ 1 + 1 fo (3.17) For all values of I in which Inequality 3.17 is true, clustering is I-limited. 3.2.2 Average Number of Used Inputs for a Single-Level Cluster Again, we consider I-limited and N -limited architectures separately. The boundary condition between the two types of architectures is the same as in the previous section. I-Limited Clustering In these architectures, we would expect all cluster input pins to be used. Thus, we can write i = I (3.18) N-Limited Clustering Consider applying Rent’s Rule to a single cluster as shown in Figure 3.6. Setting the Rent’s Rule boundary to encompass this single cluster, the number of utilized inputs and outputs crossing the boundary is T = i+ o since i of the I cluster inputs and o of the N cluster outputs are expected to be utilized. Also, the number of basic logic 38 Chapter 3. Model Derivation blocks, in this case K-LUTs, contained in the bounded region is N . Substituting these values into Equation 2.1 leads to: i+ o = t ·Np (3.19) As shown in Section 3.1.2, the average number of used inputs, t, for a K-LUT can be estimated as K − γ and since all LUT output are always utilized, Equation 3.20 can be rewritten as: i+ o = (K + 1− γ) ·Np (3.20) N LUTs K-LUT #1 K-LUT #N I N N Outputs I Inputs K K Cluster Rent's Rule Boundary Figure 3.6: Rent’s rule applied to a single cluster (K = 3, N = 3, I = 8) 39 Chapter 3. Model Derivation By substituting o = i/fo into Equation 3.20 and solving for i, we obtain i = (K + 1− γ)Np 1 + 1 fo . (3.21) 3.3 Multi-Level Clustering The final part of the model mirrors multi-level clustering. The first-level of clustering is the same as single-level clustering, where K-LUTs are grouped together into first- level clusters. First-level clusters are then recursively grouped together into larger clusters with a pre-defined capacity and number of inputs at each level as illustrated in Figure 3.7. The model’s inputs are analogous to the single-level model: N is replaced by NL and I is replaced by IL, where the L subscript indicates the level in the hierarchy that the value describes. (a) Netlist of L=1 Clusters B C (b) Possible Covering for N2 = 2, I2 = 6 B C (c) Netlist of L=2 Clusters Figure 3.7: Example of multi-level clustering 40 Chapter 3. Model Derivation 3.3.1 Number of Multi-Level Clusters Needed to Implement a Circuit We now model the multi-level clustering process, for an architecture with M hier- archical levels. Consider a technology mapped circuit consisting of nk K-LUTs. In multi-level clustering, these nk LUTs will be packed into a number of first-level clus- ters, which we will denote n1. Level-1 clusters are then recursively grouped into progressively smaller number of nL sized clusters, where L denotes the hierarchical level in the FPGA architecture and L ≤ M . In this section, we seek a closed-form expressions for nL. In the next section, we derive an expression for the expected number of inputs used per cluster for the Lth level, iL. At each level L, clusters contain up to NL (L − 1)-level clusters and have up to IL unique inputs. Each individual level of an architecture is categorized as either I or N -limited. Hence, it is possible for an architecture to be both I and N limited at different levels and each case is handled separately below. The simple modification of replacing N and I with N1 and I1 in Equations 3.11, 3.15, 3.18, and 3.21 of our single-level clustering model produces our first-level model for multi-level clustering. These modifications can be seen in the summary of equations in Section 3.5. The remainder of this section derives the capacity of the model’s subsequent levels. 3.3.2 I-Limited Clustering Consider the L-level cluster in which IL is small, and the expected number of NL−1 clusters packed into each cluster is dictated by the number of physical pins. In this case, the expected number of (L−1) clusters packed into each L cluster, cL = nL−1/nL, will be smaller than the capacity of the cluster NL. Consider the same circuit region as the previous derivations. When that region is mapped to L-level clusters, we can 41 Chapter 3. Model Derivation write y = (iL + oL)w p L (3.22) where wL is the number of L-level clusters needed for this region (wL ≤ wL−1 ≤ · · · ≤ w1 ≤ v), iL is the average number of used inputs per L-level cluster and oL is the average number of used outputs per level-L cluster. In this case, by definition, all cluster inputs are utilized and for the remainder of the derivation iL = IL. Clustering at the L-level results in wL−1 (L − 1)-level clusters packed into wL L-level clusters. Applying Equation 3.22 to the L and (L − 1) cases, eliminating y, and solving for cL leads to: cL = wL−1 wL = p √ iL + oL iL−1 + oL−1 (3.23) In this equation, a derivation for the values of the expected number of used outputs, oL and oL−1, is needed. As discussed in Section 3.2.1, there are two methods of approximating o. In both cases we chose the approximation that led to the most intuitive and simple equation. In the L-level case, applying the average fanout based approximation provides a solution without performing a complicated series expansion. In the (L − 1)-level case we prefer the architectural based approximation, oL−1 = cL−1 since it’s a simpler expression compared to the fanout approximation and the value cL−1 is easily accessible when applying this model recursively. Making these substitutions into Equation 3.23 gives: cL = p √ iL(1 + 1 foL ) iL−1 + cL−1 (3.24) 42 Chapter 3. Model Derivation and nL = nL−1 p √ iL−1 − cL−1 iL(1− 1foL ) (3.25) 3.3.3 N-Limited Clustering This is the trivial case. Since cluster inputs are plentiful, clusters can be filled to capacity. At an clustering L-level, cL = NL and nL = nL−1 NL (3.26) 3.3.4 Boundary Condition For architectures in which cL < NL, clustering is I-limited, otherwise it is N -limited. Using Equation 3.24, we can write the following condition that indicates that clus- tering is I-limited: p √ iL(1 + 1 foL ) iL−1 + cL−1 < NL (3.27) This can be rearranged to produce: IL < N p L iL−1 + cL−1 (1 + 1 fo(L−1) ) (3.28) For all values of I which inequality 3.28 holds, clustering is I-limited. 3.3.5 Number of Used Inputs for a Multi-Level Cluster Again, I-limited and N -limited architectures are treated separately. The boundary condition between the two types of architectures is the same as in the previous section 43 Chapter 3. Model Derivation I-Limited Clustering In these architectures, we would expect all cluster input pins to be used. Thus we can write: iL = IL (3.29) N-Limited Clustering By applying Rent’s Rule to a L-level cluster that has iL + oL external I/O pins, we obtain: iL + oL = (iL−1 + oL−1)N p L−1 (3.30) since each of the NL−1 (L − 1)-level clusters contained in the L-Level cluster has (iL−1 + oL−1) external connections. By substituting oL = iL/fL and oL−1 = cL−1 and solving for iL, we obtain iL = iL−1 + cL−1N p L−1 1 + 1 foL (3.31) 3.4 Model Summary Our model is summarized in Table 3.2. In this chapter we derived and presented our model’s equations in terms of CAD tool stages. However, our model equations are intrinsically related and can be further expanded by using simple substitutions, exposing subtle relationships. For exam- ple, when Equation 3.4 was substituted into Equation 3.11, the resulting equation nc = ns p √ 3 I(1+1/fo) was the number of utilized clusters under I-limited clustering and contained only architectural and circuit parameters. This form revealed that cluster utilization is a weak function of LUT size. 44 Chapter 3. Model Derivation In subsequent chapters we validate our model’s derived equations and applied them in an analytical model that relates logic parameters to the area efficiency of an FPGA. Description Equ. # Equation Technology Mapping Number of Utilized K-LUTs 3.4 nk = ns p √ 3 K+1−γ Expected K-LUT Input Utilization - K − γ Single/First-Level Clustering Number of Utilized Clusters 3.11 nc = nk p √ K+1−γ I(1+ 1 fo ) for I < Np K+1−γ 1+ 1 fo 3.15 nc = nk N for I > Np K+1−γ 1+ 1 fo Expected Cluster Input Utilization 3.18 i = I for I < Np K+1−γ 1+ 1 fo 3.21 i = (K+1−γ)N p 1+ 1 fo for I > Np K+1−γ 1+ 1 fo Multi-Level Clustering Number of Utilized Clusters 3.25 nL = nL−1 p √ iL−1−cL−1 iL(1− 1 foL ) for IL < N p L iL−1+cL−1 (1+ 1 fo(L−1) ) 3.26 nL = nL−1 NL for IL > N p L iL−1+cL−1 (1+ 1 fo(L−1) ) Expected Cluster Input Utilization 3.29 iL = IL for IL < N p L iL−1+cL−1 (1+ 1 fo(L−1) ) 3.31 iL = iL−1+cL−1N p L−1 1+ 1 foL for IL > N p L iL−1+cL−1 (1+ 1 fo(L−1) ) Table 3.2: Model Summary 45 Chapter 4 Model Verification In this chapter, we evaluate the accuracy of the model derived in Chapter 3. We employed an experimental approach. Experimental data was generated by synthesiz- ing and compiling benchmark circuits using the FPGA CAD flow and then compared against our model’s estimations. The chapter is organized as follows. The methodology is described in Section 4.1. The verification of the technology mapping, single-level, and multi-level models are described in Sections 4.2 - 4.4. Section 4.5 summarizes the chapter. 4.1 Experimental Methodology Two sets of benchmarks were used in this study: 1) twenty benchmark circuits from Microelectronics Centre of North Carolina (MCNC) [41] and 2) five synthetically generated benchmarks created using a system-level stochastic circuit generator [37]. The benchmark circuit names, along with their size in 2-LUTs, and Rent parameter are shown in Table 4.1. The Rent parameter of each circuit was measured using the Gortian partitioner [23]. The partitioner was recursively applied to each circuit, and the Rent parameter was evaluated at each iteration. The final value for the Rent parameter was calculated as the arithmetic average of the measured Rent parameters at all but two levels. The Rent parameter measured during the first two levels are often influenced by I/O 46 Chapter 4. Model Verification Name Size (# of 2-input Gates) Rent’s Parameter MCNC s38417 13307 0.588249 ex1010 8020 0.662364 pdc 8408 0.734196 spla 7438 0.690866 frisc 6002 0.655749 elliptic 5464 0.634743 bigkey 2979 0.544277 dsip 2531 0.569324 s298 4268 0.551897 des 2901 0.614122 apex2 3165 0.665979 seq 2939 0.70254 diffeq 2544 0.577743 alu4 2732 0.647396 apex4 2196 0.716016 tseng 1858 0.583609 misex3 2557 0.687191 ex5p 1779 0.723639 i10 1668 0.637836 C6288 1820 0.637836 Synthetically Generated net7 0 56953 0.69094 net7 1 33374 0.656576 net7 2 28538 0.76029 net7 3 42200 0.666351 net7 4 39853 0.62213 Table 4.1: Benchmark List 47 Chapter 4. Model Verification restrictions caused by packaging, and are omitted to ensure we capture the circuit’s “natural” Rent parameter. This Rent parameter, along with the number of 2-input gates in each circuit were then used as inputs to the models described in Chapter 3. The model predictions were compared to results obtained from an experimental flow as shown in Figure 4.1. In this experimental flow, the two-input gates in each circuit were first packed into look-up tables (as described in Section 2.2.1, this process is referred to as technology mapping). We employed two different technology mapping algorithms: Emap [29] and Flowmap [9]. After being technology mapped, we counted the number of lookup-tables required to implement each circuit, and compared this to the results of Equation 3.4. The comparison is presented in Section 4.2. Each technology-mapped circuit is then packed into logic blocks (as described in Section 2.2.2, this process is referred to as clustering.) Again, we use two different clustering algorithms: T-VPack [38] and an implementation of iRAC algorithm [44]. After clustering, we count the number of clusters required to implement each circuit and compare this to that predicted by Equations 3.11, 3.15, 3.18, and 3.21. These comparisons are presented in Section 4.3. Finally, we employed a multi-level clustering algorithm to cluster the lower-level clusters into higher-level clusters. Since a multi-level clustering algorithm was not immediately available, we modified the T-VPack algorithm to perform recursive clus- tering for multi-level architectures and named it MT-VPack. MT-VPack shares T-VPack’s core algorithm, a two step algorithm that 1) selects a seed logic block (LB) for the new cluster, 2) then greedily packs subsequent LBs based on an attraction function. The attraction function is based on two components, a shared net and critical path cost function, as described in Section 2.2.2. The main difference between the two programs is in the calculation of the two cost functions between the unpacked LBs and the new cluster. Differences between the cost func- 48 Chapter 4. Model Verification tion calculations originate from the additional output connections that need to be considered when packing multiple output low-level clusters in MT-Vpack in contrast to single output BLEs in T-Vpack. With the exact quality of MT-VPack’s heuristics undetermined, the use of this clustering tool in our verification may be questionable. However, in the absence of a fully evaluated multi-level clustering tool, MT-VPack fills our verification needs. Though MT-VPack is not an optimal clusterer, it is based on the ubiquitous single- level clusterer, T-Vpack, and it’s expected to yield reasonable data with similar trends. In verification, our model’s tracking accuracy is essential to its use in ar- chitecture evaluation. Accurate absolute value estimates of optimally packed cluster is of secondary importance. At each level of the clustering, we employed MT-VPack and counted the number of clusters required to implement each circuit, as well as the average number of inputs used in each cluster, and compared those numbers to Equations 3.25, 3.26, 3.29, and 3.31. The results are presented in Section 4.4. 4.2 Technology Mapping Model Figure 4.2 illustrates the accuracy of our technology mapping model. Table 4.1 shows the results for two different technology mapping algorithms, EMap[29] and Flowmap[9], averaged over the benchmark circuits. We obtained the analytical re- sults from Equation 3.4 using the average Rent parameter from all benchmark circuits. As the graph shows, the analytical results track the experimental results very closely. 49 Chapter 4. Model Verification Synthesized Benchmark Circuits nk=F(n2,K,p) Emap/Flowmap n1=F(nk, N1, I1, p) i1=F(nk, N1, I1, p) T-Vpack/iRAC nL=F(nL-1, NL-1,IL-1, p) iL=F(nL-1, NL-1,IL-1, p) Modified T-Vpack Gortian Partitioner Measure # of 2 Input Gates p n2 Technology Mapping Single / First-Level Clustering Multi-Level Clustering Section 4.3 Section 4.2 Section 4.1 Measured Circuit Parameters Figure 4.1: Verification flow Model Experimental (EMap) Experimental (Flowmap) 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 2 3 4 5 6 7 Inputs per LUT (K) n 2 / n k Figure 4.2: Logic per LUT vs. LUT size 50 Chapter 4. Model Verification 4.3 Single-Level Clustering Model Figure 4.3 and 4.4 illustrate the accuracy of our clustering model. Figure 4.3 shows the ratio n2 nc as a function of cluster size. In all cases, we assume that the lookup- table size, K, was 4 and the number of available cluster inputs, I, was 0.88N + 3.2 from [18] (this ensures that our clustering is I-limited, which is the interesting case for this graph). The analytical results were obtained using Equation 3.4 and 3.11, while the experimental results were obtained using two separate clustering algorithms, T- Vpack [5] and an implementation of iRAC [44]. Again, the analytical results are very consistent with both sets of experimental results. Figure 4.4 shows the same ratio as a function of the number of input pins per cluster, I. In all cases, K = 4 and N = 20. The boundary between I-limited and N -limited architectures is also shown. The graph shows that our model tracks the experimental results well in both regions. Figure 4.5 and Figure 4.6 shows the average number of used inputs per cluster as a function of cluster size. Although our model tracks both sets of experimental results, it matches the iRAC results more closely. This is expected, since iRAC explicitly tries to minimize the use of cluster pins. 4.4 Multi-Level Clustering Model The previous section focuses on verifying our model for varying LUT size, cluster size, and cluster inputs for single-level clustering. In this section we concentrate on verifying our model across a number of levels, L, for a multi-level FPGA. Figure 4.7 shows the accuracy of our model for up to 4 levels. In this experiment K = 4 and a cluster size of 4 was used at each level (ncL = 4 for L = [1, 4]). At L = 4, 51 Chapter 4. Model Verification 0 5 10 15 20 25 30 35 4 8 12 16 20 Cluster Size (N) n 2 / n c Model Experimental (TVPack) Experimental (iRAC) Figure 4.3: Logic packed per cluster vs. cluster size Level (L) 1 2 3 4 Number of Inputs (I) 7 18 49 136 Total Number of LUTs 4 16 64 256 Table 4.2: Cluster architectures that achieve 90% logic block utilization the clusters contained 256 4-LUTs, large enough to encapsulate the largest MCNC benchmark in less than 15 clusters. To improve the statistical significance, we used a synthetic circuit generator [37] to create larger benchmark circuits that were greater than 10 times larger than the largest MCNC benchmark (s38417). At each level, we set IL to a value that led to 90% logic block utilization, ensuring I limited clustering. For example, Figure 4.8 shows at Level 1, I1 = 7. Table 4.2 summarizes the exact architectures used at each level. Figures 4.9 shows the average number of used inputs, iL as a function of cluster levels, L. In both the multi-level versions of the cluster utilization and used input 52 Chapter 4. Model Verification 0 5 10 15 20 25 30 35 40 45 0 10 20 30 40 80 Inputs per Cluster (I) n 2 / n c Model Experimental (TVPack) Experimental (iRAC) Boundary between I and N-limited clustering Figure 4.4: Logic packed per cluster vs. input size models we see an increase in error between the model and experimental data as L increases. The two errors may be associated to an inaccuracy in the boundary condition. Specifically in these cases our model assumes solely I-limited clustering when in reality, the experimental data indicates some N -limited clustering. This is evident in Figure 4.9, where iL = IL, a clear indication that our model assumed I-limited clustering for all clusters. However, in the experimental data iL < IL, suggesting a significant number of clusters are actually N-limited. This misprediction in the boundary condition can also be seen in Figure 4.7, since assuming I- instead of N -limited clustering would lead to an under-estimate in cluster packing density, nk/ncL. Though both single- and multi-level models share similar boundary condition, the recursive nature of our multi-level model leads to compounded errors and becomes more pronounced at higher hierarchical levels. 53 Chapter 4. Model Verification 5 10 15 20 25 30 35 4 8 12 16 20 Cluster Size (N) U s e d i n p u t s p e r C l u s t e r Estimated Experimental (TV-Pack) Experimental (IRAC) Model Figure 4.5: Used inputs per cluster vs. cluster Size 5 10 15 20 25 30 3 4 5 6 7 LUT Size U s e d I n p u t s p e r C l u s t e r ( i ) Estimated Experimental (TV-Pack) Experimental (IRAC) Model Figure 4.6: Used inputs per cluster vs. LUT size 54 Chapter 4. Model Verification !" #!" $!" %!" &!" '!!" '#!" '$!" '%!" '&!" '" #" (" $" ! " #! $% & '()*+,-&%,.,(&/%0& )*+,-" ./0,123,456-" Figure 4.7: Number of 4-LUTs vs. cluster level !"# $!"# %!"# &!"# '!"# (!"# )!"# *!"# +!"# ,!"# $!!"# $$!"# '# (# )# *# +# ,# $!# $$# $%# $&# $'# $(# $)# ! "# $ %& '( $ ) * ") +" ,) - &. "/ %) .0 " 123456")+"7%28956":*;298"<:=" Chosen Architecture Figure 4.8: Choosing an architecture for verification at level 1 55 Chapter 4. Model Verification 4.5 Summary In this chapter we demonstrated that our model tracks experimental results across the four swept architectural parameters, LUT size, cluster size, number of cluster inputs, and depth. In the next chapter we apply our model in two sample applications and show our model is sufficiently accurate to base architectural decision on it’s estimates. In these two applications we exhibit our model’s ability to rapidly search architectural spaces for optimal configurations for both single- and multi-level cluster FPGAs. !" #!" $!" %!" &!" '!!" '#!" '$!" '%!" '" #" (" $" ! " # $ % &' ( )' * +% , '- . / " 0+ '1 2 3 4' 3%5%6+'134' )*+,-" ./0,123,456-" IL - # Input Pins / Cluster Figure 4.9: Used inputs per cluster vs. cluster levels 56 Chapter 5 Applications of Model One of the main contributions of our model is to allow for early architectural evalua- tion. In this chapter, for both single and multi-level FPGAs, we show how the model, along with the channel width model from [18], can be used to estimate the routing area in an FPGA as a function of architectural parameters such as K, N , and I. In modern FPGAs routing area accounts for the majority of the overall area [5] and is an essential metric in architectural evaluation. We will investigate whether an analytical flow employing our model leads to similar conclusions that would be obtained by a more time-consuming experimental methodology. This chapter is divided into three sections. The first section provides a brief summary of the classical experimental method used to perform similar performance evaluations. This is followed by an example application of the single-level model used in the estimation of routing area as a function of K, N , and I. The resulting estimations are compare against experimental results obtained by performing the classical experimental methods. The last section investigates the use of our multi- level cluster model to rapidly sweep a large architectural space in search of an optimal architecture. 57 Chapter 5. Applications of Model 5.1 The Classical Experimental Method During the design of a new FPGA, the area implications of each architectural en- hancement has to be evaluated, to determine whether it should be incorporated in the new device. The current method of evaluating routing area is to experimentally synthesize a series of benchmark circuits using a CAD flow into different FPGA ar- chitectures of interest, and then measure the resulting area. This method is common in both academia [2, 5, 27] and industry [34]. A typical CAD flow used in evaluation is shown in Figure 5.1 and is the flow im- plemented in the verification of our model’s applications as seen later in this chapter. In the first step the SIS [43] program performs synthesis, which takes a benchmark circuit and converts them into an equivalent circuit netlist of simple 2-input logic gates and flip flops. This netlist is then inputted into Flowmap [9] that performs technology mapping and converts the circuit in a netlist of K-input LUTs. Then, all the LUTs are packed into logic clusters using either T-VPack [38] for single-level clus- ter architectures or MT-VPack for multi-level cluster architectures. This is followed by the placement and routing of all clusters using VPR 5.0 [36]. Experiments evaluating the impact of logic architecture, described by K, N , and I, on routing area are conducted by performing sweeps of each parameter. During these experiments a reasonable FPGA routing architecture is chosen and remains con- stant throughout all experiments. From the placement information and the routing architecture description, the VPR 5.0 router determines the needed routing resources to implement the circuit and calculates the needed routing area using an area model similar to that shown in Appendix A. 58 Chapter 5. Applications of Model Circuit High Level Synthesis (SIS) Technology Map to K-LUTs (Flowmap) Clustering (T-VPack or MT-VPack) Placement (VPR) Routing (VPR) FPGA Logic Block Architecture FPGA Routing Architecture K K, N, I N,I Description Determine Transistor Area to Build FPGA Routing Figure 5.1: Classic architecture evaluation flow 5.2 Single-Level Cluster Architecture In this section we evaluate the effectiveness of our single-level cluster model to esti- mate the inter-cluster routing area as a function of K, N , and I. Our area model was derived from the area model calculations found in VPR 5.0 [36] and a full derivation can be found in Appendix A. We considered three flows: 1. Purely analytical, 2. Purely experimental, and 3. An intermediate flow. Each is described in detail below followed by a comparison of flow results and a discussion of sources of errors. The three flows we consider are: 1. Flow 1: Purely analytical - We used the model described in Chapter 3, to estimate nc, the number of utilized clusters, and i, the expected number of utilized cluster inputs. These quantities were used in conjunction with the 59 Chapter 5. Applications of Model channel width model in [18] to determine the amount of routing needed for a given architecture. We then used our area model to estimate logic block, connection block, and switch block area usage. These equations are similar to those used within VPR 5.0 that assumed an architecture with uni-directional and single-driver wires. Multiplexers were assumed to be implemented using a two-tiered structure with one-hot select lines (as in VPR 5.0, see Appendix A). Finally, these area estimates were used to determine the area required for each of twenty large benchmark circuits. Note that this flow was purely analytical and does not require experimental CAD tools. 2. Flow 2: Purely Experimental - As described in Section 5.1, each of the twenty benchmark circuits were full synthesized using the CAD flow in Figure 5.1. The final area required for reach of the twenty benchmarks was calculated using the model within VPR 5.0. 3. Flow 3: Intermediate - As an intermediate between the first two flows, we used the model in Chapter 3 to estimate nc and i, but used VPR 5.0 to determine the actual channel width (instead of using the model from [18]). The analytically estimated values for nc and i and the experimentally measured values for channel width were the inputs to our area model. This flow is included to provide insight into each portion of the analytical model. Figure 5.2-5.4 shows the results for an architecture with Fs = 9, Fcin = 20, and Fcout = 4. Figure 5.2 shows the area as a function of K, Figure 5.3 shows the area as a function of N , and Figure 5.4 shows the area as a function of I. All cases resulted in close tracking between Flow 2 (purely experimental) and Flow 3 (analytical technology mapping and clustering but experimental routing). This was expected, given our model’s accuracy. 60 Chapter 5. Applications of Model 0 2000000 4000000 6000000 8000000 10000000 12000000 14000000 16000000 18000000 20000000 3 4 5 6 7 N u m b e r o f M in im u m T ra n s is to r W id th s LUT Size (K) Flow 1 - Analytical Flow 2 - Experimental Flow 3 - Intermediate Figure 5.2: Area vs. LUT size When the model from [18] is used (Flow 1), the analytical estimates still tracked the experimental results, however, not as closely. The largest differences arose for architectures with a small number of cluster inputs. These architectures contained far fewer inputs per cluster than the equations in [18] were intended to model. From further experimentation, we found that for such “extreme” architectures, the average wirelength is 25% larger than the architectures considered in [18]. We saw a direct relationship between required routing resources and wirelengh from Equation 2.13 indicating this was a plausible cause of the over-approximation in the routing area. The wirelength approximation error could be explained, primarily because as I decreased, the sharing of LUT inputs within a cluster was encouraged. This tended to decrease the average fanout, as seen in Figure 5.5. Intuitively, decreasing the average fanout tends to increase the average wirelength. The role of a single net in an 61 Chapter 5. Applications of Model 0 5000000 10000000 15000000 20000000 25000000 4 8 12 16 20 N u m b e r o f M in im u m T ra n s is to r W id th s Cluster Size (N) Flow 1 - Analytical Flow 2 - Experimental Flow 3 - Intermediate Figure 5.3: Area vs cluster size FPGA can be modeled as a spanning tree, where vertices model I/Os and switches and edges model wires. The branches of the branch-trunk structure of a spanning tree reduces the wirelength compared to a strictly trunk structure, like a direct point- to-point routing architecture. A decrease in fanout both likely reduces the number of branches in a route and increases the number of trunks, limiting the wirelength savings of the tree structure, and leading to an overall increase in wirelength. The limitations of Fang’s model is an important observation – although the model presented correctly modeled architectures with small values of I, a complete analytical flow that accurately models these architectures does not yet exist. Addressing this limitation is an interesting area for future research, since if such a model did exist, the tracking between Flows 2 and 3 suggests the potential of completely replacing the experimental approach with the the analytical approach presented. 62 Chapter 5. Applications of Model 0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 40000000 4 5 6 7 8 9 10 11 12 13 14 15 16 N u m b e r o f M in im u m T ra n s is to r W id th s Number of Cluster Inputs (I) Flow 1 - Analytical Flow 2 - Experimental Flow 3 - Intermediate Figure 5.4: Area vs number of inputs 5.3 Multi Level Cluster Architecture In our second application of our model, we demonstrate a method of rapidly exploring a multi-level FPGA design space. In this case we determine to acquire the optimal values for NL and IL for a multi-level cluster containing exactly eight 4-LUTs. One may consider using an experimental approach and performing an exhaustive search to solve this problem. However this approach would be time prohibitive, since the design space consists of more than 800 cluster architectures. To put the computation time in perspective, if an architecture evaluation was performed using 20 benchmarks, each on average taking 30 minutes for VPR to compile, this entire experiment would take over 7500 processor · hours. This is a generous prediction since typically modern digital designs require several hours to compile. Instead, in this application we take an analytical approach to rapidly search and 63 Chapter 5. Applications of Model 0 0.5 1 1.5 2 2.5 3 3.5 0 1 2 3 4 5 6 4 5 6 7 8 9 10 11 12 13 14 15 16 A v e ra g e F a n o u t (f ) A v e ra g e W ir e le n g th ( R ) Number of Cluster Inputs (I) Wirelength Fanout Figure 5.5: Identifying limitations of Fang’s model [18] reduced the design space to only interesting architectures. We use the area model described in Appendix A in conjunction with the logic block and routing resource estimations from the equations in Chapter 3 and [18] to estimate the area required for all architectures in the design space. Table 5.1 presents the three most area efficient cluster input configurations for each cluster size combination and forms a reduced design space by choosing. Finally, we perform a more thorough and time-intensive CAD experiment on the top architectures to determine the optimal architecture. These results were verified using the similar three flows used in our single cluster application for our reduced design space. 1. Flow 1: Purely analytical - We used the model described in Chapter 3 and [18] to estimate logic block and routing resource usage, respectively. These values were applied to the multi-level cluster model described in Appendix A to 64 Chapter 5. Applications of Model Architecture Index Cluster Sizes Cluster Inputs N1 N2 N3 I1 I2 I3 1 2 2 2 5 7 12 2 2 2 2 5 8 13 3 2 2 2 5 8 14 4 2 4 1 5 13 1 5 2 4 1 5 14 1 6 2 4 1 6 13 1 7 4 2 1 8 14 1 8 4 2 1 8 15 1 9 4 2 1 8 13 1 10 8 1 1 13 1 1 11 8 1 1 14 1 1 12 8 1 1 16 1 1 Table 5.1: List of the best cluster architectures with an overall number of LUTs of 8 determine the area required for each of the twenty large benchmarks circuits. These area values were the same as those used in the rapid architecture space search. 2. Flow 2: Purely Experimental - Each of the 20 large benchmark were mapped into 4-LUTs using flow map and then packed into the cluster architectures described in Table 5.3 using T-VPack at the first-level and MT-VPack for subsequent levels. The resulting clusters were then placed and routed with VPR 5.0 and we used the CAD tool’s internal model to calculate the area for each of the twenty benchmarks. 3. Flow 3: Intermediate - In this flow we used the equations from Chapter 3 to determine the logic block resource usage, but used the VPR 5.0’s model to determine the actual routing resource demand. These values were then used applied to our multi-level cluster area model. This flow provided insight into each portion of the analytical model. 65 Chapter 5. Applications of Model Figure 5.6 shows the results for each of the 12 architectures in Table 5.1. As in the single-level clustering application, the inaccuracies of Flow 1 compared to the other two flows show the limitations of the Fang model. Flows 2 and 3 track very well and suggests with an accurate routing resource model our rapid search approach would be effective way to reduce the design space and is possibly accurate enough to entirely base our architecture decisions on. !" #!!!!!!" $!!!!!!!" $#!!!!!!" %!!!!!!!" %#!!!!!!" $" %" &" '" #" (" )" *" +" $!" $$" $%" ! " # $ % & '( )' * +, +# " # '- &. , / +/ 0( & '1 +2 0 3 / ' 4&53+0%50"&%'6,2%7' ,-./"$" ,-./"%" ,-./"&" Figure 5.6: Number of minimum-transistor widths (MTW) for the best architectures 5.4 Summary This chapter first compared our single-level clustering model against results from clas- sical experiments and showed our model’s potential to develop into a viable analytical area model. In the second section, we evaluated multi-level cluster architectures using analytical means. Analytically we performed an exhaustive sweep of the architectural space, to identify a region of interest. A comparison between experimental and esti- 66 Chapter 5. Applications of Model mated results showed our model was adequately accurate within that targeted region. In the next chapter we summarize our overall results and expand on the further im- provements for our models. 67 Chapter 6 Conclusions 6.1 Contributions In this thesis, we presented an analytical model describing the relationship between FPGA logic block architecture and logic block resource utilization. Our model was built in three stages, each estimating measurable values from a corresponding FPGA CAD flow stage. The first stage, analogous to technology mapping, estimated the number of K-LUTs needed to implement a circuit. The second and third stages paralleled single- and multi-leveled clustering. For these two clustering stages our model estimated two values: 1) the number of clusters needed to implement a circuit and 2) the expected number of used cluster inputs used for each cluster. All three stages were verified against experimental data produced by the corresponding CAD tool and were shown to have good accuracy. To illustrate the power of our model, we applied our logic block utilization model with Fang’s routing resource utilization model [18] to perform two architecture eval- uations focused on area-optimization. Our first application demonstrated how our model could be used to search for the best LUT size (K), number of cluster inputs (I), and cluster size (N) for single-level FPGA architectures. Comparing the ana- lytical against experimental results, our model showed adequate accuracy and led to conclusions similar to previous academic papers [2, 5]. Our second application demonstrated our model in a multi-level architecture sweep 68 Chapter 6. Conclusions to identify a set of area-efficient architectures. In this case, verifying all possible architectures was time prohibitive. Instead, we utilized the area model and performed a rapid search to reduce the number of possible optimal architectures in the design space to twelve. These twelve architectures were then evaluating using a full CAD flow compilations, and the resulting CAD model estimations were compared against our model’s area estimations. The results showed our model had good fidelity and Flows 2 and 3 track very well, suggesting that with an accurate routing resource model our rapid search approach would be effective way to reduce the design space and is possibly accurate enough to entirely base our architecture decisions on. 6.2 Future Work 6.2.1 Model Improvements In Section 3.1.2 we introduced the value γ, the expected number of inputs to a K- LUT that are not used. We have not found a way to accurately model γ analytically and remains an interesting area of research for the future. Modern FPGAs contain advanced BLEs, dedicated digital signal processing (DSP), and memory blocks to improve area density and eliminate performance bottlenecks in DSP designs. Although these components were not included in this work, they need to be correctly modeled to enable future architectural exploration. The current model does not handle advanced features present in modern FPGA architectures. The model could be enhanced to cover the components discussed below. 69 Chapter 6. Conclusions Advanced BLE In most datapath circuits the critical path is contained in the carry chain used in arithmetic and logic operations. BLE’s contain specialized high-performance circuitry that directly connect adjacent BLEs and bypasses the slower LUT logic to improve performance in carry chain operations. Such an architecture shifts carry logic from a BLE’s LUT to the carry chain circuitry resulting in a reduction in the number of utilized BLEs. To improve area-efficiency, modern BLEs contain fracturable adaptive-LUTs (ALUT) capable of being configured to effectively implement one large or two smaller sized LUTs. For example, Altera’s Stratix III [12] proprietary BLE contains a single 8- input ALUT, which at the tech mapping CAD stage can be configured to implement a single 6-input function or two smaller LUTs which sizes depend on the number of shared inputs between the two LUTs. Dedicated Blocks In modern FPGAs, some clusters are replaced with large memory blocks and DSP blocks to achieve higher logic density and reduced delay. FPGA memory blocks facilitate large data storage and also provide an area-efficient implementation of first- in first-out (FIFO) buffers and shift registers. DSP blocks contain specialized high- performance circuitry to support a large number of parallel multipliers. Again logic is shifted from LUTs to these specialized blocks resulting in a reduced number of utilized LUTs and more area-efficient implementation. 70 Chapter 6. Conclusions 6.2.2 Analytical Modeling Project This body of work is the first of a larger analytical modeling project to provide a tool to enable early-stage architecture development and provide insight to understand why certain architectures work well. Recently this work has been incorporated in two analytical wirelength models for FPGAs [13, 45]. The results from Chapter 5 motivates the need for an analytical model of routing resource utilization that is accurate for a wider range of architectures than current models. This would enable a fully analytical evaluation of area-usage of FPGA architectures. Other future works include developing analytical models for depth and routing resource utilization that would in turn facilitate the creation of models for power, delay, compilation time, and all other important metrics in evaluating and designing optimal architectures for the next generation of FPGAs. 71 Bibliography [1] A.A. Aggarwal and D.M. Lewis. Routing architectures for hierarchical field programmable gate arrays. In Computer Design: VLSI in Computers and Pro- cessors, 1994. ICCD ’94. Proceedings., IEEE International Conference on, pages 475–478, 1994. [2] Elias Ahmed and Jonathan Rose. The effect of lut and cluster size on deep- submicron fpga performance and density. pages 3–12, 2000. [3] J.H. Anderson and F.N. Najm. Power-aware technology mapping for LUT-based FPGAs. Field-Programmable Technology, 2002. (FPT). Proceedings. 2002 IEEE International Conference on, pages 211–218, Dec. 2002. [4] S. Balachandran and D. Bhatia. A priori wirelength estimation and interconnect estimation based on circuit characteristics. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 24(7):1054–1065, July 2005. [5] V. Betz, J. Rose, and A. Marquardt. Architecture and CAD for Deep-submicron FPGAs. Kluwer Academic Publishers, 1999. [6] Vaughn Betz and Jonathan Rose. VPR: A new packing, placement and routing tool for FPGA research. In FPL ’97: Proceedings of the 7th International Work- shop on Field-Programmable Logic and Applications, pages 213–222, London, UK, 1997. Springer-Verlag. 72 Bibliography [7] S.D. Brown, J. Rose, and Z. Vranesic. Field-Programmable Gate Arrays. Kluwer Academic Publishers, 1992. [8] Philip Christie and Dirk Stroobandt. The interpretation and application of Rent’s rule. IEEE Trans. Very Large Scale Integr. Syst., 8(6):639–648, 2000. [9] J. Cong and Yuzheng Ding. FlowMap: an optimal technology mapping algorithm for delay optimization in lookup-table based FPGA designs. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 13(1):1–12, Jan 1994. [10] J. Cong and Yuzheng Ding. On area/depth trade-off in LUT-based FPGA tech- nology mapping. Very Large Scale Integration (VLSI) Systems, IEEE Transac- tions on, 2(2):137–148, Jun 1994. [11] Altera Corp. Cyclone family data-sheet. Available at www.altera.com. [12] Altera Corp. Stratix III device handbook. Available at www.altera.com. [13] Joydip Das, Steven J. E. Wilton, Philip Leong, and Wayne Luk. Modeling post- techmapping and post-clustering FPGA circuit depth. In FPL ’09: Proceedings of the 19th International Workshop on Field-Programmable Logic and Applica- tions, 2009. [14] J. A. DAVIS. A stochastic wire-length distribution for gigascale integration (gsi)- part i : Derivation and validation. IEEE Trans. Electron Devices, 45(3):580–589, 1998. [15] J. A. DAVIS. A stochastic wire-length distribution for gigascale integration (gsi)-part ii. applications to clock frequency, power dissipation, and chip size estimation. IEEE Trans. Electron Devices, 45(3):590–597, 1998. 73 Bibliography [16] W Donath. Wire length distribution for placements of computer logic. IBM Journal of Research and Development, 25(1):152–155, 1981. [17] W.E. Donath. Placement and average interconnect lengths of computer logic. IEEE Trans. on Circuits and Systems, 26(4):272–277, 1979. [18] Wei Mark Fang and Jonathan Rose. Modeling routing demand for early-stage FPGA architecture development. In FPGA ’08: Proceedings of the 16th in- ternational ACM/SIGDA symposium on Field programmable gate arrays, pages 139–148, New York, NY, USA, 2008. ACM. [19] M Feuer. Connectivity of random logic. IEEE Transaction on Computers, C- 31(1):29–33, Jan. 1982. [20] A.E. Gamal and Z. Syed. A stochastic model for interconnections in custom integrated circuits. Circuits and Systems, IEEE Transactions on, 28(9):888–894, Sep 1981. [21] Haixia Gao, Yintang Yang, Xiaohua Ma, and Gang Dong. Analysis of the effect of LUT size on FPGA area and delay using theoretical derivations. In Int’l Symposium on Quality Electronic Design, pages 370–374, March 2005. [22] Padmini Gopalakrishnan, Xin Li, and Lawrence Pileggi. Architecture-aware FPGA placement using metric embedding. In DAC ’06: Proceedings of the 43rd annual conference on Design automation, pages 460–465, New York, NY, USA, 2006. ACM. [23] Marcel Gort. Gortian partitioner (ece583: Assignment 3). ECE583, 2007. [24] Xilinx Inc. Virtex-5 family overview. Available at www.xilinx.com. 74 Bibliography [25] Xilinx Inc. XC2064 datasheet. Available at www.xilinx.com. [26] Ian Kuon and Jonathan Rose. Measuring the gap between FPGAs and ASICs. In FPGA ’06: Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays, pages 21–30, New York, NY, USA, 2006. ACM. [27] Ian Kuon, Russell Tessier, and Jonathan Rose. Fpga architecture: Survey and challenges. Found. Trends Electron. Des. Autom., 2(2):135–253, 2008. [28] Andrew H. Lam, Steven J.E. Wilton., Philip Leong, and Wayne Luk. An ana- lytical model describing the relationships between logic architecture and FPGA density. In FPL ’08: Proceedings of the 18th International Workshop on Field- Programmable Logic and Applications, pages 221–226, 2008. [29] J. Lamoureux and S.J.E. Wilton. On the interaction between power-aware FPGA CAD algorithms. Computer Aided Design, 2003. ICCAD-2003. International Conference on, pages 701–708, Nov. 2003. [30] B. Landman and R. Russo. On a pin vs. block relationship for partitions of logic graphs. IEEE Trans. on Computers, C-20:1469–1479, 1971. [31] G. Lemieux, E. Lee, M. Tom, and A. Yu. Directional and single-driver wires in FPGA interconnect. Field-Programmable Technology, 2004. Proceedings. 2004 IEEE International Conference on, pages 41–48, Dec. 2004. [32] Guy Lemieux and David Lewis. Design of Interconnection Networks for Pro- grammable Logic. Kluwer Academic Publishers, Norwell, MA, USA, 2004. 75 Bibliography [33] Guy G. F. Lemieux, Stephen D. Brown, and Daniel Vranesic. On two-step routing for FPGAs. In ISPD ’97: Proceedings of the 1997 international symposium on Physical design, pages 60–66, New York, NY, USA, 1997. ACM. [34] David Lewis, Elias Ahmed, Gregg Baeckler, Vaughn Betz, Mark Bourgeault, David Cashman, David Galloway, Mike Hutton, Chris Lane, Andy Lee, Paul Leventis, Sandy Marquardt, Cameron McClintock, Ketan Padalia, Bruce Peder- sen, Giles Powell, Boris Ratchev, Srinivas Reddy, Jay Schleicher, Kevin Stevens, Richard Yuan, Richard Cliff, and Jonathan Rose. The stratix ii logic and routing architecture. In FPGA ’05: Proceedings of the 2005 ACM/SIGDA 13th interna- tional symposium on Field-programmable gate arrays, pages 14–20, New York, NY, USA, 2005. ACM. [35] Wei Li and D. K. Banerji. Routability prediction for hierarchical FPGAs. In GLS ’99: Proceedings of the Ninth Great Lakes Symposium on VLSI, page 256, Washington, DC, USA, 1999. IEEE Computer Society. [36] Jason Luu, Ian Kuon, Peter Jamieson, Ted Campbell, Andy Ye, Wei Mark Fang, and Jonathan Rose. Vpr 5.0: FPGA CAD and architecture exploration tools with single-driver routing, heterogeneity and process scaling. In FPGA ’09: Proceeding of the ACM/SIGDA international symposium on Field programmable gate arrays, pages 133–142, New York, NY, USA, 2009. ACM. [37] Cindy Mark, Ava Shui, and Steven J.E. Wilton. A system-level stochastic circuit generator for FPGA architecture evaluation. Field Programmable Technology, 2008. FPT 2008. IEEE International Conference on, 2008. [38] Alexander (Sandy) Marquardt, Vaughn Betz, and Jonathan Rose. Using cluster- based logic blocks and timing-driven packing to improve FPGA speed and den- 76 Bibliography sity. In FPGA ’99: Proceedings of the 1999 ACM/SIGDA seventh international symposium on Field programmable gate arrays, pages 37–46, New York, NY, USA, 1999. ACM. [39] Z. Marrakchi, H. Mrabet, and H. Mehrez. A new multilevel hierarchical MF- PGA and its suitable configuration tools. In Emerging VLSI Technologies and Architectures, 2006. IEEE Computer Society Annual Symposium on, volume 00, page 6 pp., 2006. [40] Larry McMurchie and Carl Ebeling. Pathfinder: a negotiation-based performance-driven router for FPGAs. In FPGA ’95: Proceedings of the 1995 ACM third international symposium on Field-programmable gate arrays, pages 111–117, New York, NY, USA, 1995. ACM. [41] Microelectronics Centre of North Carolina. Microelectronics centre of north car- olina benchmarks. Available at http://www.ncren.net/. [42] J. Pistorius and M. Hutton. Placement Rent exponent calculation methods, temporaral behavior and FPGA architectural evaluation. In SLIP Workshop, pages 31–38, April 2003. [43] Ellen M. Sentovich, Kanwar Jit Singh, Luciano Lavagno, Cho Moon, Rajeev Murgai, Alexander Saldanha, Hamid Savoj, Paul R. Stephan, Robert K. Bray- ton, and Alberto Sangiovanni-vincentelli. SIS: A system for sequential circuit synthesis. 1992. [44] A. Singh and M. Marek-Sadowska. Efficient circuit clustering for area and power reduction in FPGAs. In Int’l Symposium on FPGAs, pages 59–66, February 2002. 77 Bibliography [45] Alastair M. Smith, Steven J.E. Wilton, and Joydip Das. Wirelength model- ing for homogeneous and heterogeneous FPGA architectural development. In FPGA ’09: Proceeding of the ACM/SIGDA international symposium on Field programmable gate arrays, pages 181–190, New York, NY, USA, 2009. ACM. [46] Guenter Stenz. US patent 6957406 - Analytical placement methods with mini- mum preplaced components. US Patent, Oct. 2005. [47] P. Verplaetse, J. Dambre, D. Stroobandt, and J. Van Campenhout. On parti- tioning vs. placement Rent properties. In SLIP ’01: Proceedings of the 2001 in- ternational workshop on System-level interconnect prediction, pages 33–40, New York, NY, USA, 2001. ACM. [48] P. Verplaetse, J. Van Campenhout, and D. Stroobandt. On synthetic bench- mark generation methods. Circuits and Systems, 2000. Proceedings. ISCAS 2000 Geneva. The 2000 IEEE International Symposium on, 4:213–216 vol.4, 2000. [49] Xiaojian Yang, Elaheh Bozorgzadeh, and Majid Sarrafzadeh. Wirelength esti- mation based on Rent exponents of partitioning and placement. In Proc. Inter- national Workshop on System-Level Interconnect Prediction, pages 25–31, 2001. [50] Payman Zarkesh-Ha, Jeffrey A. Davis, William Loh, and James D. Meindl. Pre- diction of interconnect fan-out distribution using Rent’s rule. In Proceedings of the 2000 international workshop on System-level interconnect prediction, pages 107–112, New York, NY, USA, 2000. ACM. 78 Appendix A Area Models This appendix provides a detailed bottom-up description of the FPGA area model applied in our model’s example applications in Chapter 5. Beginning at the lowest of our three-leveled hierarchical model, the first section of this Appendix provides the schematics and assumptions at the transistor level. The second section describes the model in terms of basic components, such as SRAMs, multiplexers, and buffers. The final section describes the FPGA’s inter- and intra-routing structures. Our model was derived from the area model calculations found in VPR 5.0 [36]. A.1 Transistor Level As with VPR5.0’s area model, our area model is built on counting the number of minimum-width transistor areas (MWTA) needed to implement an FPGA architec- ture. A MWTA is simply the area of the smallest transistor plus the minimum spacing to the right and below the transistor for a transistor to properly operate for a given technology. By counting the number of MWTA rather than the number of sub-microns, our area estimation is process independent. Transistors required to drive larger loads, for example longer routing wires, need to be wider than the minimum-width to increase drive strength. In these cases, tran- sistors are implemented with parallel diffusion regions. Given the geometric layout of the transistor, the size of the boosted drive strength transistor a function of drive 79 Appendix A. Area Models strength, as described by: MWTA(RequiredDriveStrength) = RequiredDriveStrength 2 ·DriveStrength(MinimumWidth) . In the next section we implement our transistor level model to estimate the area of basic FPGA components. A.2 Basic Components An FPGA is of 3 basic circuit components: SRAM, buffers, and multiplexers. In this section we describe the assumptions used when creating the area model for each. A.2.1 SRAM We assume the common six-transistor SRAM cell shown in Figure A.1 in our model. The total area of a SRAM cell is approximated as six minimal-transistor widths. Typically the latch transistors M1−4 are sized larger than M5−6 to increase propaga- tion delay and minimize power consumption. Despite this fact, SRAM cells are the most optimized cells in an FPGA and are designed using a relaxed set of rules, and therefore the assumption of six-minimum width transistors is reasonable. A.2.2 Multiplexer Multiplexers are the basic components for switches used in switch blocks, connec- tion blocks, and intra-cluster routing. At the transistor level, multiplexers are im- plemented by a set of NMOS pass-transistors that, depending on the gate signals, selects which one of many inputs as an output. Depending on the number of mul- 80 Appendix A. Area Models Program Data_in Data_in Data Data M1 M2 M3 M4 M5 M6 Figure A.1: Six-transistor SRAM cell schematic tiplexer inputs, one of two designs is applied. Multiplexers with four or less inputs use a single-level design while those with more than four inputs are realized with a two-level design as shown in Figure A.2. Though additional levels would minimize the number of transistors and area needed to implement the multiplexer, restricting the design to a maximum of two levels reduces propagation delay. FPGA architects trade-off size for speed when sizing the multiplexer’s pass-transistors, but usually they are sized in the range of 5-10 minimum-width transistors. Our model estimates the number of needed minimal-width transistors to implement a multiplexer as: TransPerMux(IMux) = IMux · PassTransArea (IMux ≤ 4) (⌊√IMux⌋+ IMux) · PassTransArea (IMux > 4) , where IMux is the number of multiplexer inputs. 81 Appendix A. Area Models input_1 input_2 Select_1 Select_2 Output (a) Single-Stage 2-input Multiplexer input_1 input_2 Select_1 Select_2 input_3 input_4 Select_1 Select_2 input_5 input_6 Select_1 Select_2 Select_3 Select_4 Output (b) Two-Stage 6-input Multiplexer Figure A.2: Multiplexer designs 82 Appendix A. Area Models A.2.3 Buffer Long metal wires that make up routing wire segments are driven by large buffers to minimize propagation delay [32]. Buffers consume a large portion of the overall FPGA area [32]. We model buffers as buffer chains with stage ratios of 4, as commonly used in FPGA to minimize its contribution to area. We assume a PMOS to NMOS ratio of 1.9:1, a design that results in a switching threshold of 1.35 V. This ensures balanced swing times for signals. The area of a single buffer is a function of drive strength, and in terms of minimum-width transistors is: NumBufferTrans(DriveStrength) = NumStages∑ i=0 (1 + 1.45 ·DriveStrengthi) , where NumStages = log4(DriveStregth). A.3 Routing Structures This section builds on the basic component model introduced in the section before, and estimates the area estimation of the top-level FPGA routing structures. The section is organized into two subsections summarizing inter- and intra-routing struc- tures. The next section concludes the appendix with a final model incorporating all the previously presented components. A.3.1 Inter-Cluster Routing We assume a modern day unidirectional - single-driven routing architecture that is described in Section 2.1 and is implemented on most commercial products [24] [12]. The majority of routing area is comprised of programmable switch transistors. In this architecture, programmable switches are deployed in two locations, within a switch 83 Appendix A. Area Models block or connection block, as shown in Figure A.3. The following sub-sections present a transistor area equation that estimates the number of total transistors for each type of routing block. LB LB LB LB LB LB LB LB LB I/ O I/ O I/ O I/ O I/ O I/ O S S S C C C S S S C C C S S S C C C C C C C C C C C C in1 out1 Switch Switch Figure A.3: Inter-cluster routing connections Switch Block Switch blocks are comprised of a number of individual switches that perform two functions, connecting neighbouring logic block outputs to surrounding routing fabric and facilitating the extension and turning of local routing wires. Each switch is 84 Appendix A. Area Models comprised of a SRAM controlled multiplexer with a buffered output that aggregates the outputs of local logic blocks and switch block inputs as shown in Figure A.4. By this definition, the average number of inputs to each of these multiplexer is NumSwitchBlockInputs = Fs + 2 · NumLogicBlockOutputs 4 · Fcout W . where the first term accounts for the number of wire inputs to each switch and the second term is the number of logic block outputs connected to the multiplexer. input output SRAM Figure A.4: Switch components Since all routing wires are driven by a switch, the total number of switches in a switch block is 2 ·W , one for each of the 1 2 ·W wires from the four neighbouring channels. The total number of switches in an FPGA is approximated as the number of logic blocks in an FPGA, that is nx · ny, nx and ny represents the number of logic blocks in the horizontal and vertical direction respectively. Employing the multiplexer and buffer approximations from the previous section the total number of switch block transistors per channel is 85 Appendix A. Area Models NumSwitchBlockTrans = ChannelWidth ·(NumBufferTrans(DriveStrength) +NumMultiplexerTrans(NumSwitchBlockInputs)) Connection Block Connections from routing wires to logic block inputs are made via switches within a connection block. As in the switch block, each of these switches are made of an output- buffered multiplexer. Of the ChanelWidth number of channel wires, the multiplexer connects fcin to each logic block input. Applying the buffer and multiplexer equations from the previous sections, the total number of connection block transistors per cluster is NumConnectionBlockTrans = I · (NumBufferTrans(DriveStrength) +NumMultiplexerTrans(fcin)). In the applications in Chapter 5, we assumed that the buffers for both connection and switch block switches are 4x minimum drive strength to be consistent with the value applied in VPR 5.0, and in terms of area and delay this is a reasonable choice in sizing [31]. 86 Appendix A. Area Models A.3.2 Intra-Cluster Routing In a fully connected cluster, as assumed in this work, the majority of the routing area is implemented in a cluster’s multiplexers. These multiplexers act as switches, mainly connecting the I cluster inputs and N logic block outputs to the inputs of each logic block, as shown in A.5. Our intra-cluster routing area model is based on the number of multiplexer transistors needed in a cluster. We began with the single level case and expanded upon to apply to multi-level clusters. For a single-level cluster, a cluster’s routing area is SingleLevelRoutingArea = N ·K ·NumMultiplexerTrans(I +N) since there are N number of BLEs, each with K inputs that are driven by a single multiplexer that aggregates the I cluster inputs and N BLE outputs. N LBs Logic Block #1 Logic Block #N Clock I N N Outputs I Inputs Figure A.5: Intra-cluster routing 87 Appendix A. Area Models A.4 Complete Area Model In this final section, we complete our overall FPGA area model by implementing a tile approach similar to [21] to estimate the number of basic components in the whole FPGA. In an island style FPGA, as we assumed in this work, a tile is defined as the composition of a cluster and the inter-cluster routing resources of two local channels, as shown in Figure A.6. The number of minimum-width transistor required for each tile is estimated as: L L L L S S C C S S C C C C C C Figure A.6: A tile composed of a cluster and its two local routing channels TotalT ileArea = NumConnectionBlockTrans + 2 ·NumSwitchBlockTrans 88 Appendix A. Area Models The final step in our model’s derivation is summing the number of transistors for each tile. This can be expressed as: TotalFPGAArea = nx∑ x=1 ny∑ y=1 TotalT ileArea = x · y · TotalT ileArea, where we approximate the total number of logic block inputs as nx ∗ ny, the total number of logic blocks multiplied by the number of inputs per logic block. Also, we assumed a square layout and defined nxy = nx = ny as the number of logic blocks in any direction. Applying our logic block approximation from Chapter 4, nxy = ⌈ √ n⌉, where n is the estimated number of post clustering logic blocks. The work presented here is only a starting point in our group’s research towards an accurate area model. Though there is still much to improve, the results from Chapter 5 shows this is a quality model to build upon. 89
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- An analytical model of logic resource utilization for...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
An analytical model of logic resource utilization for FPGA architecture development Lam, Andrew H. 2010
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | An analytical model of logic resource utilization for FPGA architecture development |
Creator |
Lam, Andrew H. |
Publisher | University of British Columbia |
Date Issued | 2010 |
Description | Designers constantly strive to improve Field-Programmable Gate Array (FPGA) performance through innovative architecture design. To evaluate performance, an understanding of the eﬀects of modifying logic blocks structures and routing fabrics on performance is needed. Current architectures are evaluated via computer-aided design (CAD) simulations that are labourious and computationally-expensive experiments to perform. A more scientiﬁc method, based on understanding the relationships between architectural parameters and performance will enable the rapid evaluation of new architectures, even before the development of a CAD tool. This thesis presents an analytical model that describes such relationships and is based principally on Rent’s Rule. Speciﬁcally, it relates logic architectural parameters to the area eﬃciency of an FPGA. Comparison to experimental results show that our model is accurate. This accuracy combined with the simple form of the model’s equations make it a powerful tool for FPGA architects to better understand and guide the development of future FPGA architectures. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-02-08 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0064963 |
URI | http://hdl.handle.net/2429/19753 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2010-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2010_spring_lam_andrew.pdf [ 3.47MB ]
- Metadata
- JSON: 24-1.0064963.json
- JSON-LD: 24-1.0064963-ld.json
- RDF/XML (Pretty): 24-1.0064963-rdf.xml
- RDF/JSON: 24-1.0064963-rdf.json
- Turtle: 24-1.0064963-turtle.txt
- N-Triples: 24-1.0064963-rdf-ntriples.txt
- Original Record: 24-1.0064963-source.json
- Full Text
- 24-1.0064963-fulltext.txt
- Citation
- 24-1.0064963.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0064963/manifest