Activity-based Power Estimation and Characterization of DSP and Multiplier Blocks in FPGAs by Nathalie Chan King Choy B . A . S c , University of Toronto, 2004 A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F M A S T E R O F A P P L I E D S C I E N C E in The Faculty of Graduate Studies (Electrical and Computer Engineering) T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A October, 2006 © Nathalie Chan K i n g Choy, 2006 11 A b s t r a c t Battery-powered applications and the scaling of process technologies and clock frequencies have made power dissipation a first class concern among F P G A vendors. One approach to reduce power dissipation in F P G A s is to embed coarse-grained fixed-function blocks that implement certain types of functions very efficiently. Commercial F P G A s contain embedded multipliers and "Digi ta l Signal Processing (DSP) blocks" to improve the per-formance and area efficiency of arithmetic-intensive applications. In order to evaluate the power saved by using these blocks, a power model and tool flow are required. This thesis describes our development and evaluation of methods to estimate the ac-t iv i ty and the power dissipation of F P G A circuits containing embedded multiplier and D S P blocks. Our goal was to find a suitable balance between estimation time, modeling effort, and accuracy. We incorporated our findings to create a power model and C A D tool flow for these circuits. Our tool flow builds upon the Poon power model, and the Versa-tile Place and Route ( V P R ) C A D tool, which are both standard academic experimental infrastructure. i i i T a b l e o f C o n t e n t s Abstract i i Table of Contents i i i List of Tables v i List of Figures v i i Acknowledgements ix 1 Introduction 1 1.1 Mot iva t ion 1 1.2 Research Goals 4 1.3 Research Approach 5 1.4 Organization of Thesis 7 2 Background and Previous Work 8 2.1 F P G A Architectures 8 2.1.1 Island-style Architectures 9 2.1.2 Enhanced L U T Architectures and Car ry Chains 13 2.1.3 Platform-style Architectures 15 2.2 F P G A C A D Flow 19 2.3 Power 21 2.4 Power Es t imat ion 22 2.4.1 Simulation-based vs. Probabil is t ic 24 2.4.2 Techniques: Circuit-level 25 2.4.3 Techniques: Gate-level 25 2.4.4 Techniques: RT-level . ' . . 30 2.4.5 FPGA-spec i f i c Power Es t imat ion Tools 32 2.5 Focus and Contr ibut ion of Thesis 34 Table of Contents iv 3 F r a m e w o r k 36 .3.1 Overall F low 36 3.2 Versatile Place and Route ( V P R ) 38 3.2.1 Archi tectural Assumptions 38 3.2.2 T - V P a c k 39 3.2.3 Placement and Rout ing Engine 39 3.2.4 Archi tectural F lex ib i l i ty 40 3.3 Poon Power M o d e l 40 3.3.1 Archi tectural Assumptions 40 3.3.2 A c t i v i t y Es t imat ion 41 3.3.3 Power Es t imat ion 43 3.3.4 Archi tectural F lex ib i l i ty . 44 3.4 D S P Block Power M o d e l and Tool F low 44 3.4.1 Requirements 45 3.4.2 Extending P V P R Flow 45 3.5 Chapter Summary .• • • 46 4 A c t i v i t y E s t i m a t i o n 48 4.1 Mot iva t ion for developing A c t i v i t y Es t imat ion Techniques 48 4.2 Techniques for A c t i v i t y Est imat ion of an Embedded Block 49 4.2.1 Gate-Level Technique 49 4.2.2 Independent Output Technique 51 4.3 Methodology and Results for Ac t iv i t y Est imat ion 53 4.3.1 Output Nodes of D S P Block . . 54 4.3.2 Downstream Nodes of D S P Block 58 4.4 Conclusions for A c t i v i t y Est imat ion 60 4.5 Chapter Summary 61 5 P o w e r E s t i m a t i o n 63 5.1 Techniques for Power Es t imat ion 63 5.1.1 Objectives 63 5.1.2 The Constant Technique 64 5.1.3 The Lookup Technique 64 5.1.4 The P i n C a p Technique 65 5.2 Methodology for Power Characterization 66 5.2.1 Constant Technique 67 5.2.2 Lookup Technique 67 Table of Contents v 5.2.3 P i n C a p Technique 67 5.3 Evaluat ion 68 5.4 Conclusions for Power Est imat ion 71 5.5 Chapter Summary 72 6 Power Estimation Tool Flow 73 6.1 Overal l F low 73 6.1.1 Functionali ty 73 6.1.2 Modifications to P V P R 75 6.2 F low Demonstration and Comparison to Gate-Level Simulat ion 76 6.2.1 Mot iva t ion 76 6.2.2 Terminology 77 6.2.3 Methodology 78 6.2.4 Results 82 6.3 Chapter Summary 84 7 Conclusion 86 7.1 Summary and Contributions 86 7.2 Future Work 87 7.2.1 Benchmarks : 88 7.2.2 G l i t c h Characterization 89 7.2.3 Board-Level Verification 89 Bibliography 91 A d i t c h i n g Characterization Attempt 95 B Detailed Results from Comparison 98 v i L i s t o f T a b l e s 6.1 Parameters added to the P V P R Architecture Fi le Format 76 6.2 V P R Power Analysis Results - . 83 6.3 Overal l Percentage Error for F I R Fi l ter Results 84 6.4 Overal l Percentage Error for Differential Equat ion Solver Results 84 B . l F I R Fi l ter Results for 0.25 Average Input A c t i v i t y 98 B.2 F I R Fi l ter Results for 0.50 Average Input A c t i v i t y 99 B .3 F I R Fi l ter Results for 0.75 Average Input A c t i v i t y . 100 B.4 Differential Equat ion Solver Results for 0.25 Average Input A c t i v i t y . . . . 101 B .5 Differential Equat ion Solver Results for 0.50 Average Input A c t i v i t y . . . . 102 B.6 Differential Equat ion Solver Results for 0.75 Average Input A c t i v i t y . . . . 103 L i s t o f F i g u r e s 1.1 18-bit x 18-bit multipliers combined for 36-bit x 36-bit multiplier 2 2.1 Island-style Architecture (from [1]) 9 2.2 2-input L U T (from [2]) 10 2.3 L U T wi th Flip-flop (from [2]) 10 2.4 Cluster-based logic block 11 2.5 Programmable Switches 13 2.6 Carry Cha in Connections to a 4 - L U T 15 2.7 Mul t i -b i t Logic Block (from [3]) 18 2.8 Steps in the F P G A C A D Flow 19 2.9 Abstract ion Levels for Power Analysis (from [4]) 23 3.1 Overal l C A D Flow (from [5]) 37 3.2 C A D Flow Enhanced to Support D S P Block Power Es t imat ion 46 4.1 Gate-Level Technique 49 4.2 Independent Output Technique 51 4.3 Output P i n Act ivi t ies for Unregistered Mul t ip l ie r 54 4.4 Output P i n Act iv i t ies for Registered Mul t ip l ie r 55 4.5 Output P i n Act iv i t ies for Unregistered D S P Block 56 4.6 Output P i n Act iv i t ies for Registered D S P Block 56 4.7 18-bit x 18-bit Mul t ip l iers Combined for 36-bit x 36-bit Mul t ip l ie r 57 ' 4.8 F I R Fi l ter Used for A c t i v i t y Experiments 58 4.9 A c t i v i t y Results for the F I R Fi l ter 59 4.10 Ac t iv i t y Results for the Converter 60 5.1 Methodology for Power Est imat ion Experiments 66 5.2 Power Est imat ion Experimental Methodology 68 5.3 Power Est imat ion Experiment Results 69 5.4 Power Est imat ion Experiment Results 70 6.1 Power Est imat ion Tool F low . . '. 74 List of Figures v i i i 6.2 Terminology 77 6.3 fir.3_8.8: F I R Fi l ter Circui t 78 6.4 Differential Equat ion Solver Circui t 79 6.5 18-bit x 18-bit multipliers combined for 36-bit x 36-bit multiplier 80 6.6 Simulat ion Flow Pseudocode 81 6.7 F low for Determining Simulation Power Estimate of Each D S P Block Instance 82 A . l Power Characterization Flow 96 A . 2 Testbench Pseudocode 96 ix A c k n o w l e d g e m e n t s I would like to thank my supervisor, Professor Steve Wi l t on , for teaching me so much these last two years and for providing many opportunities to interact w i t h the F P G A community and to gain valuable teaching experience. I would like to thank everyone in the Socwilton, Soclemieux, and other System-on-Chip groups for their ideas, advice, and good company. Peter Jamieson (from the Universi ty of Toronto), Brad , Dipanjan, Julien, and Vic to r were incredibly helpful when I was getting started wi th al l the tools. Roozbeh and Roberto were also great resources. I am grateful to Al t e ra and the Natura l Sciences and Engineering Research Counc i l of Canada for funding my project and to my defense committee for volunteering their time. Thank you to Scott and Lesley for your emotional support during the academic rough patches and for knowing exactly what I was going through. Also thank you to Steph and Jess for your emotional support and encouragement during the non-academic rough patches, the roommate antics, and the entertaining adventures. Final ly , I would like to dedicate this work .to my family, friends, and my dragon boat family (Swordfish, R o l i , and U C Water Dragons crews) who kept me sane through a very intense last two years. Chapter 1. Introduction 1 Chapter 1 Introduction 1.1 M o t i v a t i o n Power dissipation has become increasingly important in the electronics industry. There are two reasons for this: 1. There is a growing number of portable electronic applications, which have battery life constraints 2. Process technologies in the nanometer scale and clock frequencies in the gigahertz range result in very hot chips and increased cooling costs. These issues are relevant to both Application-Specific Integrated Circui ts (ASICs) and to Field-Programmable Gate Arrays ( F P G A s ) . W i t h increasing mask costs for A S I C s , F P G A s have become an attractive implementation alternative for low and medium volume production. However, F P G A s lag behind A S I C s significantly in their power efficiency. The additional transistors required for programmability result in higher power dissipation per function. Leakage power is also dissipated in both the used and unused parts of F P G A s , and most commercial F P G A s do not have a sleep mode to reduce power in unused parts (The Ac te l I G L O O flash-based F P G A is an exception [6]). Chapter 1. Introduction 2 Figure 1.1: 18-bit x 18-bit multipliers combined for 36-bit x 36-bit multiplier Over the past few years, significant advances have been made at the circuit level [7] [8], architecture level [9], and C A D tool level [10] [11] [12]. S t i l l , current F P G A s have been found to be, on average, 12 times less power efficient than A S I C s [13]. A basic F P G A is composed of programmable logic elements (called Look-up Tables or LUTs), wire segments, and programmable connections/switches that are typical ly laid out in a regular pattern. The L U T s in F P G A s typical ly have 4 to 6 inputs and can implement any boolean function of these inputs. B y configuring each L U T and connecting them together using the programmable interconnect, users can implement vir tual ly any digi tal circuit. To reduce area and improve circuit speed, larger logic blocks are used (typically composed of a collection of 4-10 L U T s ) ; they are called clusters. Chapter 1. Introduction 3 One promising approach to reducing power in F P G A s at the architectural-level is to embed coarse-grained fixed-function blocks that implement certain types of functions very efficiently. Commercial F P G A s contain embedded multipliers and embedded "Dig-ital Signal Processing (DSP) blocks" to improve the performance and area efficiency of arithmetic-intensive applications, such as DSP applications [14] [15] [16] [17]. A simplified example of a DSP block is shown in Figure 1.1; in this diagram, four 18xl8-bit multipli-ers are programmably connected to create a 36x36-bit multiplier, with optional registers at the multiplier inputs, outputs, and adder stage outputs. Other examples of DSP block configurations are adder/subtractors, multiply-accumulators, and multiply-adders. Arithmetic-intensive applications can be implemented in LUTs, but even if LUTs are en-hanced with dedicated interconnect for fast carry chains [18] and arithmetic chains [15], significant performance, density, and power improvements can be obtained by implement-ing the arithmetic parts of the applications in the embedded blocks [19]. Reference [20] found that an average area savings of 55% and an average increase in clock rate of 40.7% could be obtained for floating point applications by embedding floating point DSP blocks. There are several disadvantages of including DSP blocks in an F P G A fabric. If the blocks are not used (1) the area is wasted, and (2) the blocks will still dissipate leakage power. This makes it important to try to optimize the size and quantity of embedded resources included. However, it is difficult to determine what proportion of the F P G A area should be devoted to embedded blocks because it is difficult to determine what typical usage is for a device that can implement almost any digital circuit. Chapter 1. Introduction 4 In order to meaningfully evaluate the power savings that can be obtained by using these blocks, an experimental flow that can estimate the power dissipated by F P G A s containing embedded DSP blocks is required. Commercial tools from F P G A vendors provide power estimation that includes these blocks; however, these tools are tailored for specific devices, and do not provide the flexibility needed to investigate alternative embedded block architectures. The Versatile Place and Route (VPR) tool suite [ 2 1 ] with the Poon power model [ 2 2 ] has become standard experimental infrastructure to study F P G A architecture, C A D , and power issues (we will refer to the combination of V P R and the Poon power model as PVPR). The estimates from the power model can be used by F P G A architects to evaluate architectures for power efficiency, by F P G A users to make power-aware design decisions, and by F P G A C A D tool developers to create power-aware C A D algorithms. However, P V P R only supports homogeneous F P G A architectures containing LUTs and a routing fabric, and thus is not sufficient to examine architectures containing embedded DSP blocks. 1.2 Research Goals The objective of this research is to create an enhanced F P G A C A D flow that can be used to evaluate the power dissipation of FPGAs containing embedded DSP and arithmetic blocks. We have imposed the following requirements: • As P V P R is standard experimental infrastructure for academic F P G A architecture, C A D , and power studies, our power estimation of the DSP and multiplier blocks Chapter 1. Introduction 5 must be compatible wi th the P V P R flow. • A s P V P R results are frequently used to perform architectural evaluations, our power estimation must be fast, to facilitate iterations over tens or hundreds of architectural alternatives and benchmark circuits. • To facilitate iterations over many architectural alternatives, our method should aim to minimize the modeling effort required when introducing a new architecture for evaluation. • In order for the power estimates to be meaningful, our method should a im to be as accurate as possible, wi th in the constraints imposed by the previous requirements. Fast power estimation techniques make use of average quantities, such as probabil-ities, to reduce runtime. Details are lost during this averaging, resulting in a runtime versus accuracy tradeoff. Similarly, reducing characterization effort typically results in the capturing of fewer details, again resulting in an effort versus accuracy tradeoff. A s the requirements that we have set for our flow involve both fast estimation and low charac-terization effort, we expect that we w i l l be sacrificing some accuracy. Therefore, we must find a suitable balance between speed, characterization effort, and accuracy. .1.3 Research Approach A s w i l l be explained in Chapter 2, power estimation involves two steps: (1) activity estimation, and (2) power estimation using the activities. Chapter 1. Introduction 6 Dynamic power dissipation is proportional to how often nodes in a circuit switch values (from logic-0 to logic-1, or logic-1 to logic-0). A c t i v i t y estimation is the calculation of how much switching is expected at each node in the circuit. In implementing the act ivi ty estimation for F P G A architectures containing embedded D S P blocks, we came across the following challenge: it is not clear how to propagate activity calculations through a D S P block. Tradit ional activity estimation techniques propagate activities through small gates ( L U T s ) using transition probabili ty or transition density models [23]; however, these approaches do not scale well to embedded blocks wi th tens (or even hundreds) of inputs. The next step is to estimate the power dissipated by the F P G A implementing a user's circuit . In implementing algorithms to perform this estimation, we identified the following challenge: once the p in activities of each embedded D S P block are known, it is s t i l l not clear how to use them to estimate the power dissipated by the embedded D S P block. There are many possibilities, each providing a different trade-off between power estimation time, modeling effort when a new block is to be investigated, and accuracy. The second technical challenge is to choose a power estimation technique that provides the best balance between these factors. In this thesis we treat each of these two challenges as a separate problem. For each problem, we develop and evaluate solutions. Then, we use the best of our solutions to each problem to create our experimental C A D flow for evaluating F P G A architectures containing D S P blocks. Chapter 1. Introduction 7 1.4 Organization of Thesis This thesis is organized as follows. Chapter 2 provides an overview of F P G A architectures, the F P G A C A D flow, power, and power estimation. It also describes previous work related to power estimation. Chapter 3 describes the existing P V P R C A D flow and how our D S P and arithmetic block power model fits into this framework. Chapter 4 describes our solutions and their evaluation for the problem of activity estimation. Chapter 5 describes our solutions and their evaluation for the problem of power estimation. Chapter 6 describes how we integrated our solutions from Chapters 4 and 5 to create a Power Estimation Tool Flow for evaluating F P G A architectures containing D S P blocks, presents results for two benchmark circuits, and compares the results to those obtained using gate-level simulation. Final ly , Chapter 7 summarizes the conclusions and provides suggestions for future work. 8 Chapter 2 Background and Previous Work 2.1 F P G A Architectures The most basic building blocks of F P G A s are programmable logic blocks, wire segments, and programmable connections/switches that are typically la id out in a regular pattern. B y programming basic functions into each block and connecting them together using the programmable interconnect, users can implement vir tual ly any digital circuit. A number of architectures have been developed to optimize for criteria such as routabil-ity, area efficiency, and t iming. Architectures are frequently classified based on their rout-ing architecture because most of the area in an F P G A is devoted to routing. Reference [1] lists the three main categories of commercial F P G A architectures that were available when that book was written: island-style, row-based, and hierarchical. Today, platform-style F P G A s are available, which include coarse-grained embedded components. Th is thesis w i l l focus on F P G A s based on island-style architectures. Section 2.1.1 w i l l introduce basic island-style architecture components, Section 2.1.2 w i l l describe modern enhancements to the basic components, and Section 2.1.3 w i l l describe platform-style F P G A s . The framework that we bui ld upon is described in Chapter 3. Chapter 2. Background and Previous Work 9 Programmable routing switch Figure 2 . 1 : Island-style Architecture (from [1]) 2.1.1 Island-style Architectures Figure 2 .1 shows an island-style architecture, where programmable logic blocks are "is-lands" in a "sea" of programmable routing fabric [5]. In the tradit ional island-style ar-chitecture, the logic blocks are clusters of look-up tables ( L U T s ) . The routing fabric is composed of wires, programmable connection blocks, and programmable switch blocks. Look-up Tables Look-up tables ( L U T s ) are the most basic elements for implementing logic in an F P G A . Figure 2 .2 shows a 2-input L U T . A K- inpu t L U T ( K - L U T ) is a memory wi th 2K bits, K Chapter 2. Background and Previous Work 10 Inputs »> Inputs SRAM cells 0 0 0 1 00 01 10 11 Output Figure 2.2: 2-input L U T (from [2]) Select bit P 4-input LUT Clock D Flip-flop Figure 2.3: L U T wi th Flip-flop (from [2]) Output address inputs, and one output port. B y setting the memory bits, this structure can be used to implement any combinational function of K inputs. Typica l values for K are 4-6. To implement sequential logic, a L U T is typical ly paired wi th a flip-flop to form a Basic Logic Element (BLE), as shown in Figure 2.3. The output of the B L E is selected from either the registered or the un-registered version of the L U T output, depending on the select bit of the output multiplexer. Chapter 2. Background and Previous Work 11 Inputs Figure 2.4: Cluster-based logic block Cluster-based Logic Blocks The use of larger logic blocks helps to increase circuit speed and reduces circuit area and C A D tool processing time. Unfortunately, L U T complexity grows exponentially wi th the number of inputs. The use of cluster-based logic blocks addresses this problem. Clusters are typically composed of a number of B L E s , internal cluster routing, and possibly spe-cialized internal cluster connections, such as carry chains. W i t h i n a cluster, B L E inputs are typically connected to the cluster inputs and B L E outputs by a multiplexer-based crossbar [2]. A n example of a cluster-based logic block is shown in Figure 2.4. It has 8 inputs, contains 4 B L E s , and has 4 outputs. The L U T inputs can be connected to either the cluster inputs or the B L E outputs v i a the internal routing. B y grouping B L E s that share signals, placement and routing processing time is reduced because the number of inter-block connections is reduced. This internal routing is also shorter and faster than Chapter 2. Background and Previous Work 12 inter-block connections, reducing circuit delay. Routing Logic blocks are connected together and to I / O resources using a routing "fabric" that is composed of: • pre-fabricated metal tracks • programmable switch blocks • programmable connection blocks Figure 2.1 shows the components of the routing fabric. The tracks are arranged in channels, typical ly horizontally and vertically, to form a grid pattern. Wires run along these tracks. The connection blocks connect the wires to the inputs and outputs of logic blocks adjacent to the channel. The switch blocks connect the horizontal and vertical wires together to form longer wires and make turns. It should be noted that, although we consider the switch blocks and connection blocks as separate entities, they are often not separate in the F P G A circuit layout. The programmable switches in the switch blocks and connection blocks can be either buffered or unbuffered. Typica l ly they are buffered to reduce delays in long connections; however, this increases routing area. Buffered switches can be either unidirectional or bidirectional. Modern F P G A s use unidirectional switches to get better delay opt imizat ion and a denser routing fabric [24]. Figure 2.5 shows examples of switches typical ly found in F P G A s : (a) unbuffered, (b) buffered unidirectional, and (c) buffered bidirectional. Chapter 2. Background and Previous Work 13 P (a) Unbuffered (b) Buffered unidirectional (c) Buffered bidirectional Figure 2.5: Programmable Switches The routing wires inside an F P G A can be categorized as either (a) segmented local routing for connections between logic blocks, or (b) dedicated routing for global networks, such as clocks or reset/preset lines. The local routing segments can span a single logic block or multiple blocks; typically, not a l l segments are of the same length. The dedicated routing tracks are designed to ensure low-skew transmission. Commercial F P G A s typical ly also include phase-locked loops ( P L L s ) and delay-locked loops ( D L L s ) for clock de-skew on these dedicated lines. They also may include clock multiplexers and clock management circuitry to reduce power consumption by disabling unused parts of the clock network [25]. 2 . 1 . 2 Enhanced L U T Architectures and Carry Chains Commercia l F P G A s contain improvements to the basic clustered K - L U T and interconnect F P G A s described in Section 2.1.1. Chapter 2. Background and Previous Work 14 Adaptive Logic M o d u l e ( A L M ) The Al te ra Adaptive Logic Module (ALM) is a B L E that can implement a single function of six inputs or two functions that share four inputs [15]. It is based on research that found that 6 - L U T s have better area-delay performance than the typical 4 - L U T , but can result in wasted resources if many of the logic functions do not require six inputs. Configurable Logic Block ( C L B ) The X i l i n x Vir tex-5 C L B is a B L E based on a 6 - L U T wi th two outputs so that it can either implement a single function of six inputs or two functions of five (out of the same six) inputs [26]. A s wi th the A L M , it is based on research showing that 6 - L U T s have better performance, but may waste resources. C a r r y chains Carry chains are dedicated connections between logic blocks that aid in the efficient imple-mentation of arithmetic operations. They also can be used in the efficient implementation of logical operations, such as parity and comparison. Fast carry chains are important be-cause the cri t ical path for these operations is often through the carry. Figure 2.6 shows a simple carry chain architecture. Each 4 - L U T in a B L E can be fractured to implement two 3 - L U T s ; this is sufficient to implement both the sum and carry, given two input bits (a and b) and a carry input. The carry out signal from one B L E would typical ly be connected to the carry in of an adjacent B L E using a fast dedicated connection. The Z input is used to break the carry chain before the first bit of Chapter 2. Background and Previous Work 15 a b 2-LUT 2-LUT 2-LUT 2-LUT P Carry in - 7 -3-LUT *~T' 3-LUT Carry out S u m out Figure 2.6: Carry C h a i n Connections to a 4 - L U T an addition. More complex carry schemes have been published. In [18], carry chains based on carry select, variable block, and Brent -Kung schemes are described; the Bren t -Kung scheme is shown to be 3.8 times faster than the simple ripple carry adder in Figure 2.6. The selection of a carry scheme by the F P G A architect involves weighing the area cost of a faster more complex carry chain against the performance benefits, because they are only beneficial when used. A c t e l and X i l i n x devices include support for carry-lookahead and Al t e ra devices include support for carry select capabilities. 2.1.3 Platform-style Architectures W i t h Moore 's Law and the increase in the possible number of transistors on a die, F P G A manufacturers could afford to sacrifice some silicon area to application-specific circuits. In 2000, to increase the efficiency and density of designs containing components such as processors, large memories, and complex arithmetic circuits, F P G A manufacturers began to release Platform-style F P G A s containing dedicated circuitry for these parts. Chapter 2. Background and Previous Work 16 In the context of F P G A s , Intellectual Property (IP) cores are typical ly reusable circuit modules or high-level bui lding blocks that can be combined to create a system. These cores can be soft or hard. Soft cores are implemented in L U T s and hard cores are implemented as embedded A S I C blocks in the F P G A . Platform F P G A s contain hard IP cores, such as processors, large memory blocks, fast multipliers, and Digital Signal Processing (DSP) blocks. They also contain dedicated interconnect for fast communication between certain types of adjacent cores. DSP blocks To improve the density and address the performance requirements of D S P and other arithmetic-intensive applications, F P G A manufacturers typical ly include dedicated hard-ware multipliers in their devices. A l t e ra Cyclone II and X i l i n x Vir tex- I I / - I I P ro devices contain embedded 18xl8-bit multipliers, which can be split into 9x9-bit multipliers [27]. The Vir tex-I I / - I I P ro devices are further optimized wi th direct connections to the X i l i n x block R A M resources for fast access to input operands. Higher-end F P G A s include D S P blocks, which are more complex dedicated hardware blocks optimized for a wider range of D S P applications. A l t e r a Strat ix and Strat ix II D S P blocks support pipelining, shift registers, and can be configured to implement 9x9-bit, 18xl8-bit , or 36x36-bit multipliers that can optionally feed a dedicated adder/subtractor or accumulator [15]. X i l i n x Vir tex-4 X t r e m e D S P slices contain a dedicated 18xl8-bit 2's complement signed multiplier, adder logic, 48-bit accumulator, and pipeline registers. They also have dedicated connections for cascading D S P slices, wi th an optional wire-shift, without having to use the slower Chapter 2. Background and Previous Work 17 general routing fabric [27]. This inclusion of dedicated multipliers or D S P blocks to complement the general logic resources results in a heterogeneous F P G A architecture. Research has considered what could be gained from tuning F P G A architectures to specific application domains, i n par-ticular datapaths and D S P . The work in [28] and [3] is tuned for datapath (arithmetic) circuits. Datapath circuits are often composed of regularly structured components, called bit-slices. The authors propose a multi-bit logic block that uses configuration memory sharing to exploit this regularity and save area. In typically-sized cluster-based logic blocks (containing 4 to 10 B L E s ) , configuration S R A M cells consume 39% to 48% of the total cluster area. If each cluster is used to implement a single bit-slice of the datapath circuit and adjacent clusters are used to implement adjacent bit-slices, the configuration memory for corresponding B L E s in the adjacent slices can be shared. A mult i-bi t logic block is illustrated in Figure 2.7, indicating resources that can share configuration memory. The authors also propose bus-based connections to exploit this regularity to achieve 14% routing area reduction for implementing datapath circuits. A disadvantage of multi-bit architectures is that if implementation circuits require a different bi t -width, some resources may be wasted. The work in [29] deliberately avoids creating a heterogeneous architecture because the authors found that D S P applications contain both arithmetic and random logic, but that a suitable ratio between arithmetic and random logic is difficult to determine. Instead they develop two "mixed-grain" logic blocks that are suitable for implementing both arith-metic and random logic by looking at properties of arithmetic operations and properties Chapter 2. Background and Previous Work 18 01 < 8 to 3 *C 3 C P ,BLE #1 I-BLE #N -BLE *1 BLE #1 BLE #N Cluster #1 J L Cluster #2 J L Cluster #3 Figure 2.7: Mul t i -b i t Logic Block (from [3]) of the 4 - L U T . Their logic blocks are coarse-grained: each block can implement up to 4-bit addition/subtraction, 4 bits of an array multiplier, a 4-bit 2:1 multiplexer, or wide Boolean functions. Each logic block can, alternatively, implement single output random logic functions like a normal L U T . Their architecture reduces configuration memory re-quirements by a factor of four, which is good for embedded systems or those wi th dynamic reconfiguration. Compared to arithmetic-optimized F P G A architectures, it offers efficient support of a wider range of D S P applications that vary in the amount of random logic they contain. However, the cost of this flexibility is additional area overhead. If applications wi th very litt le arithmetic are implemented, further area penalties are incurred in the case of al l of these architectures wi th specialized arithmetic support, because most of the arithmetic support logic is left unused. Chapter 2. Background and Previous Work 19 Behavioral Description Hii/i-L'wrl tVil-w.is Register Transfer Level (RTL) , Logic Synthesis [ Logic Optimizations! [ ' | Technology Mapping Netlist of Interconnected Cells T ^Physical Design ^ j Packing [' !: Placement |^, | Routing | * J Design for download Figure 2.8: Steps in the F P G A C A D Flow 2.2 F P G A C A D Flow In order to implement large designs on F P G A s , C A D tools are essential. They allow users to work at a high abstraction level, automating optimizat ion and transformation to a low-level implementation. This section w i l l give a brief overview of the steps in the F P G A C A D flow. Figure 2.8 illustrates the steps in the flow. The input to the top level of the F P G A C A D flow is a behavioral description, typically in the form of Hardware Description Language ( H D L ) code or a schematic that describes what the circuit does, wi th little or no reference to its structure. Dur ing high-level synthesis, an in i t ia l compilation of the design is done, some in i t ia l optimizations are performed, and functional units are scheduled and assigned to the oper-ations required by the circuit [30]. The result of high-level synthesis is a Register Transfer Level ( R T L ) description of the design. A n R T L description is typical ly wri t ten in H D L . R T L code is characterized by a straightforward flow of control, wi th subcircuits that are Chapter 2. Background and Previous Work 20 connected together in a simple way [31]. Dur ing the optimization stage of logic synthesis, technology-independent optimizations are applied to the logic functions that the R T L describes; the result is a netlist of gates. Dur ing the technology mapping stage of logic synthesis, this netlist of gates is mapped to the set of cells that are available to bui ld the functions. In the case of F P G A s , these cells are L U T s and flip-flops. The result is a structural netlist of interconnected L U T s and flip-flops. Physical design has three stages: (1) packing, (2) placement, and (3) routing. Dur ing the packing stage, the L U T s and flip-flops are packed into coarser-grained cluster-based logic blocks, wi th the goals of improving routabil i ty and optimizing circuit speed. Packing together B L E s that share signals minimizes the number of connections between logic blocks, thus enhancing routability. Packing together B L E s likely to be on the cr i t ical path makes connections between them go v i a the fast local interconnect, thus opt imizing circuit speed. Dur ing the placement stage, the cluster-based blocks are assigned locations in the F P G A . Goals during placement are to minimize wir ing by placing connected blocks close together, to enhance routabil i ty by balancing the wir ing density across the F P G A , and to maximize circuit speed by putt ing blocks likely to be on the cri t ical path close together. F ina l ly during the routing stage, paths are found in the channels for the wires that connect the logic blocks. The result is typically refered to as the placed and routed design. A t that point it could be converted to a bitstream for programming an F P G A . The widely used Versatile Place and Route ( V P R ) academic C A D tool for packing, Chapter 2. Background and Previous Work 21 placement, and routing w i l l be described in Chapter 3. 2.3 Power The formula for the instantaneous power dissipated at time t, p(t), is: Pit) = i(t) • Vdd (2.1) where Vdd is the supply voltage and i(t) is the instantaneous current drawn from the supply. To obtain the average power over a time interval, we replace the instantaneous current by the average current drawn over that interval in Equat ion 2.1. Knowing the peak instantaneous power is useful when sizing supply lines, whereas knowing the average power helps in the calculation of battery life and cooling requirements. In this work we focus on average power. Power dissipation can be broken down into dynamic and static components. Dynamic power is dissipated when a gate is switching; it is due to: (1) the charging and discharging of parasitic capacitances and (2) temporary short circuits between the high and low supply voltage lines. The average dynamic power of the gate is given by the equation P = 0.5 • aCVfj (2.2) where a is the activity of the gate, C is the parasitic capacitance of the gate, and / is the clock frequency. Static power is dissipated when the gate is not switching; it is due to leakage currents. Typically, in an F P G A , the majority of power dissipated is dynamic [22]. Chapter 2. Background and Previous Work 22 2.4 Power Estimation It is important to distinguish between power estimation and power measurement. Power measurement involves obtaining voltage and current values from a real physical apparatus. Power estimation involves predicting what the power dissipation would be, based on a number of assumptions. One reason for performing power estimation is that a physical apparatus is not always available. Another reason is that a wider range of designs can be considered and evaluated more quickly when we are not constrained to using physical implementations. Performing power estimates earlier in the design flow is desirable to help guide design decisions or identify problems in the design. Power estimation can take place at any stage in the F P G A C A D flow (Figure 2.8). The stages higher in the flow are at a higher abstraction level and do not involve imple-mentation details. A s we get lower in the flow, more physical details of the design have been determined. Performing power estimates at the lower stages i n the flow w i l l generally give more accurate estimates; however, it w i l l take more computational resources to take into account these physical details. In our work, we w i l l be performing estimates after placement and routing. Power estimation can be done at different abstraction levels, as shown in Figure 2.9. A t each stage, we need the following types of information so that we can perform the power analysis: 1. activity estimates, so we can compute the dynamic power dissipation, 2. a description of what the design looks like - this can be either an architectural Chapter 2. Background and Previous Work 23 Level of Abstraction System-level Algorithm-level |^onthrt:pf^circyit| Figure 2.9: Abstract ion Levels for Power Analysis (from [4]) description or a netlist - so that we know what components are used and how they are connected, 3. models of these components and connections In a platform-style F P G A , the placed and routed netlist contains representations of circuit components at multiple levels of abstraction: L U T s , flip-flops, and wires are es-sentially at the gate level, while D S P blocks and memories can be thought of as being R T L components, and hard processors can be thought of as being system-level compo-nents. For this work, we are interested in the lower three abstraction levels in Figure 2.9: circuit-level, gate-level, and R T L . In the following subsections, we w i l l first describe two categories into which power estimation techniques can be classified. Then we w i l l describe a number of existing techniques at the three lower abstraction levels. F ina l ly we Chapter 2. Background and Previous Work 24 w i l l describe some FPGA-spec i f i c power estimation tools. 2.4.1 Simulation-based vs. Probabilistic Power estimation techniques can be divided into two categories: (1) simulation-based and (2) probabilistic. Simulation-based techniques simulate the circuit to gather data about the switching of circuit nodes or even determine the waveform of the current being drawn. However, simulation-based techniques require complete and specific information about the input signals. The accuracy of the simulation results is dependent on how realistic the in-puts are. Consequently, reference [23] calls simulation-based power estimation techniques strongly pattern-dependent. To avoid the problem of determining complete and specific input signal characteristics, probabilistic techniques are based on typical input signal behaviour. They represent the average behaviour of the inputs using probabilities. Al though the estimation is s t i l l dependent on the probabilities provided, it is sufficient to supply typical behaviour instead of specific behaviour. Thus [23] calls probabilistic power estimation techniques weakly pattern-dependent. Since calculations need only be performed once on the average data, instead of on a large number of simulation inputs, probabilistic techniques tend to require less computational resources than simulation-based techniques; however, some accuracy is sacrificed by of the use of averaging. Chapter 2. Background and Previous Work 25 2.4.2 Techniques: Circuit-level Typically, at the circuit-level users seek very precise estimates. A s a result, circuit-level techniques tend to be simulation-based, while probabilistic techniques tend to be applied only at the gate-level and above [32]. S P I C E S P I C E provides very detailed, low-level simulation data for a circuit. S P I C E stands for Simulation Program wi th Integrated Circui ts Emphasis and is a general purpose analog circuit simulator for nonlinear D C , nonlinear transient, and linear A C analyses. It uses mathematical models to represent the devices in the circuit , such as resistors, capacitors, and transistors [33]. This very detailed simulation can result in high accuracy estimates, but it requires substantial computational resources, making it unsuitable for large circuits. S P I C E was used in the creation of the Poon power model, which is discussed in Chapter 3. However, the work in this thesis wi l l be done at higher levels of abstraction, due to runtime and complexity constraints. 2.4.3 Techniques: Gate-level Simulation-based gate-level analysis is very mature. The most popular type of gate-level analysis uses event-driven logic simulation, where switching events at the inputs of a logic gate trigger events at the output after a pre-defined delay. Probabil ist ic gate-level techniques exist as well, to reduce the execution time of estimates. We used both simulation-based and probabilistic gate-level techniques in this work. Chapter 2. Background and Previous Work 26 Synopsys Pr imePower Synopsys PrimePower is a simulation-based dynamic power analysis took for gate-level power verification that can be used on multimillion-gate designs. It combines gate-level simulation results wi th delay and capacitance information from technology libraries to get detailed power information. In addit ion to average power numbers, PrimePower reports instantaneous power consumption i n different parts of the design. Transit ion Probabi l i ty The Transition Probabi l i ty Technique relates the average dynamic power of nodes to the likelihood that they wi l l switch. To use the Transi t ion Probabi l i ty Technique, we need the signal probability and the transit ion probabil i ty of each node. The signal probability, Psignah of a node is the average fraction of clock cycles in which the steady state value of the node is logic high. The transition probability, Pt, of a node is the average fraction of clock cycles in which the steady state value of the node is different from its in i t ia l value [23]. The Transition Probabi l i ty Technique makes some simplifying assumptions: zero-delay, spatial independence of inputs and internal nodes, and temporal independence of signal values. The assumption of zero-delay means there is, at most, a single transition of each signal per clock cycle; in reality, there are delays and they can cause the output of a gate to transition multiple times before settling at its final value for the clock cycle. The assumption of spatial independence means we assume that there is no correlation be-tween nodes, although, in reality, the value of one signal may affect the value of another Chapter 2. Background and Previous Work 27 signal, in the same cycle. The assumption is made because calculating the correlation between signals for a large circuit is prohibitively expensive. The assumption of temporal independence means that we assume, for a given signal, that values in consecutive clock cycles are independent of each other. W i t h those assumptions, the average power can be calculated using Equat ion 2.3: P = 0.5-Vd2J J2 C ^ (2-3) all nodes where Vdd is the supply voltage, / is the clock frequency of the circuit, C , is the total capacitance at node i, and Ptj is the transition probability at node i. Because of the zero-delay assumption, Equat ion 2.3 only gives a lower bound on the power - unmatched delays cause multiple transitions at gate outputs. W i t h the temporal independence assumption, the transition probabili ty can be calculated from the signal probability using Equat ion 2.4: Pt — 2 • PsignaliX Psignal) (2-4) Transition Density The Transition Density Technique is more accurate than the Transition Probabi l i ty Tech-nique and more computationally efficient than event-driven logic simulation. The advan-tage of the Transition Density Technique over the Transit ion Probabi l i ty Technique is that it distinguishes between multiple transitions of a node in a single cycle, making i t more accurate. Switching act ivi ty can also be thought of as transition density, D(x) (for Chapter 2. Background and Previous Work 28 node x) , which is the average number of transitions of node x per unit time. Formally, it is given by Equat ion 2.5, Z ? ( x ) 4 l i m ! ^ ) (2.5) 1 —>oo 1 where T is the length of the time interval and nx(T) is the number of transitions i n the time interval of length T. Given the transition density of al l the nodes, the average power dissipation can be calculated using Equat ion 2.6: P = 0.5-Vd2d CiD(xi) (2.6) all nodes where Vdd is the supply voltage, Q is the capacitance at node i, and D{xi) is the transit ion density of node i . There are two important quantities in the calculation of activities for a l l nodes in the circuit using the Transit ion Density model: static probability and transition density. Static probabili ty is the probabili ty that the signal is high. To calculate the activity of each node in the circuit, the transition density for each node is computed, gate-by-gate, going from the primary inputs to the primary outputs. If we assume that a l l inputs are uncorrelated, we can use the relationship D(y) = . P all input pins where f(x) is the logic function of the gate, — f{x)\Xi~\ © f(x)\Xi=o is the boolean df(x) D{xi (2.7) Chapter 2. Background and Previous Work 29 difference at the output port wi th respect to each input D(xi) is the transit ion density at input Xi and D(y) is the transit ion density at the output, y. P (j^^J can be calculated from the static probabilities of the inputs x{ using the relationships: • P ( X ) = 1 - P ( X ) • P ( X Y ) = P ( X ) • P { Y ) • P { X + Y ) = P ( X ) + P ( Y ) - P ( X ) • P { Y ) where P ( X ) is P \ ( X ) , the static probabili ty of X . Lag-one M o d e l The Transition Density model assumes that there is no temporal correlation. The purpose of using the lag-one model is to relax this assumption; the lag-one model assumes that the current value of a signal may depend on the value immediately preceding it . Using the lag-one model, the switching probabili ty can be calculated using Equat ion 2.8: xiEXo P(Xi) • ^ P(Xi>Xj) XJ&XI (2.8) For a boolean function, / , Xi is the set of input states such that / (xi) = 1 V Xi € X\ and X0 is the set of input states such that / ( i j ) = 0 V Xj € X0, P ( X J ) is the probabili ty that the current input state is xiy and P (x,, Xj) is the probability that the input state wi l l be Xj at the end of a clock cycle if the state was X{ at the beginning of the clock cycle. This equation represents the summation of probabilities over al l pairs of input states xt, Xj such that / (pa) = f (XJ), where an input state is a row of the t ruth table for / . Chapter 2. Background and Previous Work 30 2.4.4 Techniques: RT-level RT-level estimators are typical ly based on macro-modeling. Macro-modeling involves cre-ating power macro-models for the basic functional components in the R T L libraries and characterizing them [4]. The user of an R T L estimator sees the macro-models as black boxes. However, creating a macro-model of a component involves characterizing its rep-resentation at a lower level of abstraction [32]. For example, to do power characterization for an adder, we might estimate its gate-level implementation and use information about the gates to derive overall values for its power characteristics. Al though power estimates at higher levels of abstraction are less accurate, they s t i l l provide valuable information. W i t h the increase in the size and complexity of designs, it is desirable for designers to be able to estimate the power at a high level of abstraction so that the information can guide early architectural decisions. Another motivating factor is that the largest power reductions often come from architectural and algorithmic modi-fications [34], which are least costly to make early in the design flow. However, although R T L estimators are available in commercial tools, they have not yet gained widespread acceptance in design practice. Reference [4] attributes this to the difficulty of quantifying the accuracy gap between gate-level and R T L power estimation in an industrial setting. Another deterrent noted by reference [4] is the fact that a large amount of characterization must be done to make a l ibrary of macro-models; this process must be automated to be efficient. Chapter 2. Background and Previous Work 31 Dual Bit Type Method The D u a l B i t Type ( D B T ) Method [34] is an architecture-level strategy for generating accurate black-box models of datapath power consumption. Its creators note that, while typical strategies quantify activity and physical capacitance for their estimates, the strate-gies do not account for the effect of signal statistics on the activity. In particular, the authors identify the correlation between sign bits of two's complement operands as being an important source of error when using the assumption of randomized inputs to the block being modeled. A s an example, consider an F P G A wi th 8-bit adders and a user circuit where a l l the operands are 5 bits wide. The lower 5 bits could be adequately represented by uniform white noise ( U W N ) inputs, but the upper 3 bits would always be identical (correlated) sign-extension values. The creators of the D B T method propose to account for two input bi t types: (1) correlated sign bits, and (2) U W N operand bits. Recal l Equat ion 2.2: P = 0.5 -aCVd2J (2.9) To account for the two bit types, instead of using a single capacitative coefficient based on U W N inputs, they use multiple capacitative coefficients that account for transitions on each type of data on each input to the block. However, a two-input single-function module requires 73 capacitive coefficients; the number increases for semi-configurable multi-function D S P blocks that are found in F P G A s . Chapter 2. Background and Previous Work 32 Entropy-based In reference [35], the authors propose to characterize the average switching activity of a module by using the average switching activity of a typical signal line in the module. Their goal is to obtain an acceptable estimate wi th a l imited number of design details and at a significantly lower computational cost. They derive simple closed form expressions to approximate the switching activity in the R T L blocks using the concepts of entropy and informational energy. However, to manage the complexity of their calculations, they make the following simpifying assumptions: • Simplified, uniform network structure: Each level of the circuit has the same number of nodes and all the gates on each level are assumed to get their inputs from the previous level. • Asymptot ic network depth: The number of levels in the circuit is large enough to be considered infinity. Unfortunately, D S P and arithmetic blocks in F P G A s do not have a uniform network structure and are not so large that we can approximate their network depth as infinite. 2.4.5 FPGA-specific Power Estimation Tools Spreadsheets The most accurate power estimation results for an F P G A design w i l l be after the design has been implemented (i.e. placed, routed, and then simulated wi th accurate stimulus vectors). However, it is valuable to understand the impact of early high-level design Chapter 2. Background and Previous Work 33 decisions on power dissipation. Power estimation spreadsheets can be used in the pre-implementation phase to obtain a rough idea of power dissipation for a design. These spreadsheets contain detailed device data constants from F P G A manufacturer datasheets. The user enters environmental conditions, voltage and clock information, logic uti l ization, and toggle rates. Ea r ly spreadsheets only calculated total power dissipated for voltage sources and components [36] [37]. The spreadsheets for the latest F P G A fami-lies from Al t e r a and X i l i n x are newer and calculate the static, dynamic, and total power consumption [38] [39]. The X i l i n x Vir tex-4 spreadsheet also provides graphical represen-tations of power, voltage, and temperature relationships and power used by each type of component. It should be noted that these spreadsheets compute power in a device-specific manner, based on constants. The user is expected to provide toggle activity information for each block, but (s)he might not know what values to use at such an early stage. C A D tools Industrial C A D tools that offer more accuracy than spreadsheets are X i l i n x XPower and Al t e ra PowerPlay Power Analyzer . They are used in the implementation phase, when design details such as placement and routing have been established. XPower requires either user supplied toggle rates, as wi th the spreadsheets, or post-implementation simulation data to estimate the power consumed [40]. PowerPlay is sim-ilar, but also includes (for some device families) vectorless act ivi ty estimation to statisti-cally estimate the signal activity of a node using the activities of the signals feeding the Chapter 2. Background and Previous Work 34 node and the logic function implemented by the node [41]. The l imita t ion of these tools is that they apply only to some of the A l t e r a and X i l i n x devices. The Poon power model is a freely available, detailed, flexible power model that has been integrated into the Versatile Place and Route ( V P R ) C A D tool . It estimates the dynamic, short-circuit, and leakage power consumed for a wide variety of user-specified F P G A architectures. It is described i n detail in Chapter 3. 2.5 Focus and Contribution of Thesis Section 2.1 describes the basic island-style architecture and the improvements that exist in commercial F P G A s to improve density and speed. Unfortunately, available academic power estimation tools only support basic island-style architecture components. The goal of this research project is to enable fast and accurate estimation of power dissipated in F P G A designs that include embedded multiplier and D S P blocks (for the remainder of the thesis, both embedded multipliers and D S P blocks w i l l be referred to as D S P blocks). Our project uses both simulation-based and probabilistic information at the gate-level to create a Power Estimation Tool Flow that includes automated RT-level embedded D S P block macro-model characterization. Th i s work builds upon the Poon power model and the widely used V P R C A D Tool , which are described in Chapter 3. The contributions of this thesis can be summarized as: 1. Identification of a fast and accurate technique to estimate the switching act ivi ty of an embedded D S P block Chapter 2. Background and Previous Work 35 2. Identification of a fast and accurate technique to estimate power dissipated by D S P blocks 3. A tool flow for estimating embedded D S P block power in the context of F P G A designs. The impact of our enhanced tool flow is threefold; the existence of a freely available, architecturally flexible F P G A C A D tool that includes power modeling for embedded D S P blocks enables: 1. the investigation of power-aware architectures containing embedded D S P blocks 2. the investigation of power-aware C A D algorithms for F P G A circuits containing em-bedded D S P blocks 3. the incorporation of power tradeoffs in the design of user circuits. 36 Chapter 3 Framework This chapter introduces the existing experimental C A D tool suite that forms the basis of our work. Section 3.1 describes the flow of the framework. Section 3.2 describes the T - V P a c k tool for packing basic logic elements into cluster-based logic blocks and the original V P R C A D tool . Section 3.3 describes the Poon power model and the improved activity estimation tool , A C E - 2 . 0 . Section 3.4 describes how our work fits into the existing framework and the requirements for our work. 3.1 Overall Flow Our work is based upon the V P R C A D tool suite, enhanced wi th the Poon power model (together P V P R ) . Frequently, " V P R " refers to the pair of tools T - V P a c k and V P R , since they are typical ly used together. In this thesis, we w i l l do the same. Figure 3.1 illustrates the steps in the P V P R tool flow. The left side is the original V P R flow. For the Poon power model, activity estimation was added; this is shown to the right of the original V P R flow. The first input to the flow is a netlist describing the user's circuit. This netlist must be pre-processed to generate the correct data and data format required by V P R . This pre-Chapter 3. Framework 37 Netlist Logic Optimization and Technology Mapping J Mapped Netlist Parameterized Architecture • Description Power Estimates Figure 3.1: Overal l C A D Flow (from [5]) processing involves logic optimizat ion using SIS [42] and technology mapping to L U T s and flip-flops using F l o w M a p + F l o w P a c k [43]. The result of technology mapping is a netlist mapped to the desired F P G A architecture. This mapped netlist and input s t imuli are inputs to the activity estimation module, A C E - 1 . 0 , which is based on the Transit ion Density model. The output of A C E - 1 . 0 is switching activity information for each node in the mapped netlist. The mapped netlist, switching activity information, and cluster architecture parameters are then input to T - V P a c k , which packs the L U T s and flip-flops into cluster-based logic blocks. The cluster-based blocks are placed using the V P R placement engine. The connections between the placed blocks are then routed using the V P R routing engine. V P R generates reports of placement and routing statistics. Archi tectural investiga-tions can then be performed by varying the parameters in the parameterized architecture Chapter 3. Framework 38 description and examining the resulting statistics. Algor i thmic investigations can also be performed by making modifications to the packing, placement or routing engines and examining the statistics for a set of benchmark circuits. The Poon power model adds the generation of power statistics for the clock, logic, and interconnect to V P R . These power statistics can be used in both architectural and algorithmic studies for basic Island-style F P G A architectures. 3.2 Versatile Place and Route (VPR) V P R is a freely available C A D tool that is widely used for performing F P G A architectural studies. It is composed of a packing tool, a placement and routing engine, and a detailed area and delay model. 3.2.1 Architectural Assumptions There are a large number of architectural alternatives for F P G A s and not a l l are supported by V P R . V P R targets S R A M - b a s e d Island-style F P G A s wi th cluster-based logic blocks and perimeter I / O . Each S R A M cell is made of six minimum-sized transistors wi th gate voltage boosting to overcome the Body effect. Four types of switch block architectures are supported for the programmable connection of routing tracks: Disjoint [44], Universal [45], W i l t o n [46], and Imran [47]. Chapter 3. Framework 39 3.2.2 T-VPack T - V P a c k is a t iming-driven C A D tool; it takes a circuit netlist that has been technology mapped to L U T s and flip-flops and packs these basic logic elements into larger cluster-based logic blocks. Before the placement stage of V P R , the circuit netlist is processed using the T - V P a c k tool. As described in Section 2.1.1, the use of coarse-grained logic blocks results in faster, denser circuits, and in faster place and route runtimes. T - V P a c k has the optimization goals of: • M i n i m i z i n g the number of inter-cluster connections on the cri t ical path of the circuit • Reducing the number of connections required between clusters by minimiz ing the number of inputs to the clusters • M i n i m i z i n g the number of clusters needed 3.2.3 Placement and Routing Engine The placement tool assigns the cluster-based logic blocks to locations in the F P G A . The F P G A is modeled as a set of legal locations where logic blocks or I / O pads can be placed. A n ini t ia l random placement is constructed, then simulated annealing is used to improve the solution. Opt imiza t ion goals involve minimizing wir ing and maximizing circuit speed. A s w i l l be described in Section 6.1.2, we modified the placement tool to place D S P blocks as well. Once placement is complete, the routing tool determines which programmable switches to turn on to make the required inter-logic block connections in the F P G A . V P R represents Chapter 3. Framework 40 the routing architecture of the F P G A as a directed graph called the routing resource graph. T w o routing algorithms are available: a purely routabili ty-driven algorithm and a t iming-and routability-driven algorithm. 3.2.4 Architectural Flexibility The reason for V P R ' s versatility is its flexible representation of architectures that the user specifies in an architecture file. The following features can be specified: • Logic block architecture • Detailed routing architecture • Channel width • T i m i n g analysis parameters • Process technology parameters and capacitances 3.3 Poon Power Model 3.3.1 Architectural Assumptions A s the Poon model is incorporated into V P R , it uses the architectural assumptions made by V P R . However, the original version of V P R assumes that the clock and other global signals are implemented using special dedicated resources. The version of V P R enhanced wi th the Poon model assumes an H-Tree clock distribution network and uses the total capacitance of the clock network for power estimation. Chapter 3. Framework 41 3.3.2 Activity Estimation In the Poon model, the first step is to estimate the activities of the nodes in the F P G A . The activity estimation tool for the power model is called A C E . To distinguish between major revisions of the tool, the original wi l l be referred to as A C E - 1 . 0 and its sucessor w i l l be referred to as A C E - 2 . 0 [48]. This section w i l l describe the techniques used for estimation in A C E - 1 . 0 and A C E - 2 . 0 . ACE-1.0 A C E - 1 . 0 is the original act ivi ty estimation tool for the Poon model. It estimates the static probability (P i ) , switching probabili ty (Ps), and switching activity (As) for combi-national and sequential gate-level circuits using the Transit ion Density signal model. The original Transition Density model only handles combinational circuits, but was enhanced to support sequential circuits. To support circuits w i th sequential feedback, an iterative technique is used to update the switching probabilities at the output of the flip-flops, using the expressions P^Q) = PX(D) and PS(Q) = 2 • P^D) • (1 - P i (£>)). The original Transition Density model was also enhanced to account for logic gate inertial delays by adding an analytical low-pass filter to filter out very short glitches. The authors of [48] found A C E - 1 . 0 to be inaccurate for large and/or sequential circuits. They found that A C E - 1 . 0 overestimates activities and suggest that the low-pass filter function is insufficient for reducing glitching. They also attribute the poor sequential circuit performance to the simple expressions used in the iterative technique for updating the switching probabilities at the outputs of flip-flops. The next subsection describes the Chapter 3. Framework 42 activity estimator from [48], A C E - 2 . 0 , which addresses the weaknesses of A C E - 1 . 0 . A C E - 2 . 0 A C E - 2 . 0 is a faster and more accurate probabilistic activity estimator for the Poon power model. It has three stages that address the weaknesses of A C E - 1 . 0 : 1. Simulation of sequential feedback loops 2. Calculat ion of P i and Ps values for nodes not in sequential feedback loops using the Lag-one model 3. Calculat ion of As using a probabilistic technique that accounts for glitching The first stage improves the accuracy of activity estimation in sequential circuits. Since simulation techniques were avoided because of runtime issues, A C E - 2 . 0 only simulates the logic in sequential feedback loops. In the second stage, A C E - 2 . 0 obtains the Pi and Ps values using the Lag-one model for the parts of the circuit not simulated, which produces exact switching probabilities if we assume that inputs are not correlated [48]. A C E - 1 . 0 uses the Transition Density model, which assumes that there is no temporal correlation. The purpose of using the lag-one model is to relax this assumption; the lag-one model assumes that the current value of a signal may depend on the value immediately preceding it . The most efficient known implementation of the Lag-one model uses a B ina ry Decision Diagram ( B D D ) . However, there is an exponential relationship between B D D size and the number of inputs, making this implementation impractical for large circuits. A C E - 2 . 0 combines B D D pruning wi th a partial collapsing technique to give smaller B D D s . Chapter 3. Framework 43 In the third stage, A C E - 2 . 0 calculates the switching activities. It uses a generalization of the Lag-one model and accounts for glitching by incorporating the concept of a min imum pulse width for passing glitches. 3.3.3 Power Estimation The Poon model uses estimated capacitances at the transistor level for each component inside the F P G A . Then, using the capacitance values and switching activity estimates, the average power dissipation is calculated. The model was compared against H S P I C E simulations. The Poon model dynamic power estimates were found to be wi th in 4.8% for routing and 8.4% for logic. For leakage, average difference between the estimates and the H S P I C E results was 13.4%. Dynamic power is the dominant component of the total power in an F P G A . The Poon model calculates capacitance values at the transistor level to determine the power dissipation of L U T s , multiplexers, and buffers inside logic blocks. It also uses the metal capacitance of each routing track and the parasitic capacitance of a l l switches attached to the track, specified using the process technology parameters in the architecture file, to calculate the power dissipated in the F P G A routing. The routing power is a large portion of the dynamic power dissipated. Since the S R A M programming bits in the F P G A do not change value after configuration, they are not included in the dynamic power calculations. The dynamic power is calculated using the equation: (3.1) all nodes Chapter 3. Framework 44 The short circuit power is modeled as 10% of the dynamic power, based on extensive H S P I C E simulations in [5]. The leakage power has two components: reverse bias leakage and subthreshold leakage. A s the Poon model was calibrated using a 0.18 /zm process technology, it assumes that the reverse bias leakage is negligible. To calculate the subthreshold leakage the Poon model uses the equation: Fleak Idrain (weak inversion) ' ^supply (3*2) It uses a first order analytical estimation model to estimate the subthreshold current. 3.3.4 Architectural Flexibility Enhancements for the Poon model add support to the architecture file for the flexible specification of: • Supply, swing, and gate-source voltage levels • Leakage and short circuit power parameters • N M O S and P M O S transistor characteristics • Clock network architecture parameters 3.4 DSP Block Power Model and Tool Flow The D S P block power model that we propose is an extension for the P V P R flow. Section 3.4.1 discusses the requirements of the power model we have developed as part of this work. Section 3.4.2 explains where our work fits in to the P V P R flow. Chapter 3. Framework 45 3.4.1 Requirements A s stated earlier, our power model must fit into an existing F P G A C A D tool flow. In order to allow the investigation of future architectures, instead of simply existing commercial architectures, we prefer that this C A D tool be architecturally flexible; it should be possible to specify a wide range of logic block, routing, clock, and D S P block architectures. In an architectural investigation, many iterations of P V P R are executed to gather data about the impact of varying certain architectural parameters. In order to not hinder the use of P V P R for an investigation requiring tens (or even hundreds) of iterations, our power estimation must be fast. Furthermore, in order for the power estimates from the investigation to be meaningful, they must be accurate. The previous requirements pertain to the tool flow that is visible to the P V P R user. A n important input to the tool flow in Figure 3.4.2 is the DSP block characterization data. A s described in Section 2.4.4, a deterrent to the use of macro-modeling at the RT-level is the fact that a large amount of characterization must be done to make a l ibrary of macro-models; this process must be automated to be efficient. Therefore, to make our tool flow attractive, we must minimize the effort required when adding models for new D S P blocks and automate characterization. 3.4.2 Extending P V P R Flow Figure 3.2 shows how our model fits into the P V P R flow. Pre-processing of the D S P blocks in the mapped netlist is required before the activities can be generated for the nodes in a user's circuit that contains D S P blocks. Addi t iona l characterization data Chapter 3. Framework 46 Netlist I Logic Optimization » " and Technology Mapping Netlisl Pre-processing for - ; * DSP Blocks Mapped Netlist j J J , Acttvfty Input Stimuli 1 '(ACE-2 0)-_ _ _ _ _ ^Switching Activity Information Parameterized Architecture • Description ',' * .£ 'Packing ^Modified T :VPACK) • Placement (VPR) • DSP Block Characterization Data Power Estimation (Implemented in VPR) j Power Estimates Figure 3.2: C A D Flow Enhanced to Support D S P Block Power Est imat ion is also required in order to estimate the power of the D S P blocks in the final stage of processing. The addition of our work to the P V P R flow expands the support of P V P R to F P G A architectures that contain D S P blocks, thus enabling architectural and algorithmic investigations wi th circuits that contain these blocks. 3.5 Chapter Summary Our work modifies the widely used P V P R tool flow. P V P R is composed of the V P R en-gines and the Poon power model. The V P R engines are the T - V P a c k clustering algorithm, the V P R placement engine, and the V P R routing engine. The Poon power model adds activity estimation and power estimation for logic blocks, interconnect, and the clock. Our work modifies the P V P R flow to add activity estimation for D S P blocks and power estimation of D S P blocks using characterization information. The requirements for our Chapter 3. Framework 47 work are: • The D S P block power estimation must fit into an existing architecturally flexible F P G A C A D tool flow • The power estimation must be fast • The power estimation must be accurate • The characterization effort must be low and should be automated A s mentioned in Section 1.2, the last three of these requirements are competing factors; fast estimation and low characterization effort w i l l generally lead to less accurate results. Thus, we must find a suitable balance between speed, characterization effort, and accuracy. Details of how we determined accurate methods for performing activity estimation and power estimation for D S P blocks are discussed in Chapters 4 and 5, respectively. The com-plete automated tool flow, including D S P block characterization, is described in Chapter 6. 48 Chapter 4 Activity Estimation 4.1 Motivation for developing Activity Estimation Techniques A n important part of estimating the power dissipated in an F P G A is estimating the ac-t iv i ty of each connection in the circuit . In the Poon power model, activities are calculated gate-by-gate, starting from the pr imary inputs. Since each gate ( L U T ) is small, the Tran-sition Density or Lag-one model can be used to calculate the activity of the output of each L U T as a function of the activity of its inputs. It is not feasible, however, to propagate activities through a D S P block using the Transit ion Density or Lag-one model since the computation performed using these models (and other related models) is 0(2k) where k is the number of inputs to the block. Thus, a new technique is required. In this chapter, two alternative techniques to estimate the activities of each D S P output p in are consid-ered. The two techniques are compared to determine which is suitable for use wi th in the experimental C A D flow that was developed in this work. It is important to note that the techniques described below are only being used to estimate the activities of the output pins of each embedded block. Power estimation of Chapter 4. Activity Estimation 49 Circuit Inputs DSP Block Technology Circuit Verilog Description Library uutiitututt Circuit DSP Block Tech Map A I h S,rt'li />• \ i Quartus II Gates with DC I I Random Vectors for Inputs DSP Block |Fptten Circuit; [^Pi;to:gates; ACEv2 0 TTTTTTTTT Circuit Outputs Activity Estimates B) Gate-Level Technique Flow A) Rattened DSP Blocks for Gate-Level Technique Figure 4.1: Gate-Level Technique these blocks w i l l be discussed in Chapter 5. 4.2 Techniques for Activity Estimation of an Embedded Block This section describes two techniques for estimating the activity in circuits containing embedded D S P and multiplier blocks. Al though Platform-style F P G A s also contain mem-ories, processors, and other features, we l imi t our experiments to circuits containing only D S P and multiplier blocks, logic, and interconnect. This ensures that the results we obtain are not obscured by assumptions about the other Platform-style features. 4.2.1 Gate-Level Technique The first technique is an extension of the Poon model activity estimation method. The embedded blocks are too large to for us to apply the Transit ion Density or Lag-one model to them directly. To be able to use the Transit ion Density or Lag-one models, this Chapter 4. Activity Estimation 50 technique involves representing the embedded blocks by their gate-level implementations. The flattened circuit would then consist of L U T s (for the parts of the circuits not in the embedded blocks) and the gates that make up each embedded block. T h e Transit ion Density or Lag-one model can then be applied directly to this flattened netlist. This technique wi l l be referred to as the Gate-Level Technique. Figure 4.1(A) shows the circuit wi th the flattened D S P blocks in grey and the LUT-and-interconnect part of the circuit in white. Figure 4.1(B) shows a flow that employs this technique. In this flow, the D S P block to be evaluated is described in Veri log. Synopsys Design Compiler is used to map the block to gates, using T S M C 0.18 / i m technology l ibrary information. The parent circuit to be considered is technology mapped using Quartus II in order to determine what gets mapped to D S P blocks. The parent circuit is then flattened and the D S P blocks are replaced wi th their gate-level representations. Then, activity estimation is performed. Currently, the use of Quartus II for technology mapping restricts us to Altera-style D S P blocks; however, Quartus II could be replaced wi th another technology mapping tool to evaluate non-Altera-style D S P blocks. It is important to emphasize that we do not modify the netlist that w i l l be implemented in the F P G A . The flattened netlist is generated only during act ivi ty estimation. The principal advantage of this technique is that it accounts for the correlation between the input activities and output activities of the D S P blocks. Another advantage of this technique is that it allows the use of A C E - 2 . 0 to estimate the act ivi ty for a l l the nodes in the circuit, including the D S P block nodes. Chapter 4. Activity Estimation 51 Circuit Inputs Circuit 1 Circuit Pseudo-outputs ;Tech Map with I Quartus II Jc Jc Jc Pseudo-outputs • . . I Random Vectors for Inputs and Pseudo-inputs •* 4- 4-DSP Block * Pseudo-inputs Remove DSP: biocks l i Pseudo-inputs ACEv2 0 TTTTTTTTT Circuit Outputs Activity Estimates A) Independent Output Technique Terminology B) Independent Output Technique Flow Figure 4.2: Independent Output Technique The disadvantage of this technique is that it requires a gate-level implementation of the D S P block and proprietary technology l ibrary information. Since we expect that this technique would be used for evaluating a large number of architectures, a non-tr ivial amount of effort would be involved in generating gate-level implementations for each D S P block to be evaluated. 4.2.2 Independent Output Technique The second technique that was evaluated addresses the disadvantage of the previous tech-nique. If sufficiently accurate, we would prefer to use a technique that does not require proprietary technology or implementation details. In this technique we propose to model the embedded blocks as if they are external to the circuit. The remainder of the circuit is composed of L U T s and interconnect, so the Poon model can be applied to that part. To model the D S P blocks as external to the circuit, the inputs to the embedded block are treated as primary outputs of the circuit and the outputs of the embedded block are Chapter 4. Activity Estimation 52 treated as primary inputs to the circuit. We refer to the D S P block inputs and outputs as pseudo-outputs and pseudo-inputs of the circuit, respectively, as shown in Figure 4.2(A). In this technique, the pseudo-inputs are assigned values in the same way that the primary inputs are assigned values by the Poon model. Random input vectors w i th a specified average activity are applied to the inputs and pseudo-inputs. The activities are then propagated through the circuit using A C E - 2 . 0 . W h e n the inputs to a D S P block, pseudo-outputs, are encountered, the activities are not propagated through the D S P block. Instead, the pseudo-outputs are treated in the same way as the primary outputs. The activity calculations for the nodes downstream from the D S P block proceed using the pseudo-input values. In effect, the estimated output activities of a D S P block are then independent of the input activities to the D S P block. This technique w i l l be referred to as the Independent Output Technique. Figure 4.2(B) shows a flow that employs this technique. The parent circuit is tech-nology mapped using Quartus II. The D S P blocks are removed from the netlist and the nodes that were formerly outputs of the D S P block are represented as inputs to the circuit . Input vectors are then applied to the inputs and pseudo-inputs and activity estimation is performed using A C E - 2 . 0 . The advantage of this technique is that it does not require gate-level implementation and technology information. The disadvantage of this technique is that inaccuracies are being introduced because, in general, D S P block output transitions are not independent of their input transitions. Chapter 4. Activity Estimation 53 Reference [49] found that for word-parallel or bit-serial arithmetic, the average activity at the output of an adder can be closely approximated by the maximum of the average activities of the two inputs, implying that there is a dependence. 4.3 Methodology and Results for Activity Estimation In comparing the accuracy of the two techniques, the Gate Level Technique w i l l be used as the baseline. To determine how well the simpler Independent Output Technique correlates wi th the more accurate Gate-Level Technique, two quantities w i l l be compared: 1. The activities at the outputs 2. The activities at the downstream nodes In order to accurately estimate the power dissipated by the nets driven by the D S P block, accurate activities for these nets are needed. However, if the activities of al l the D S P output pins are similar, then using a single average value could be sufficient. W h e n we refer to the downstream nodes of a D S P block, we mean the nodes between the D S P block outputs and the circuit outputs. Inaccurate activity estimates at the D S P block outputs may lead to inaccurate activity estimates for the downstream nodes due to the iterative nature of activity estimation algorithms (the output act ivi ty of each node is estimated based on the activities of the node inputs). Since there are typically many more downstream nodes than there are D S P output pins, inaccuracies in these downstream Chapter 4. Activity Estimation 54 Block Output Pin Figure 4.3: Output P i n Activi t ies for Unregistered Mul t ip l i e r nodes may have a larger impact on the accuracy of the estimation than would inaccuracies in the D S P outputs. In this section, bo th of these quantities were measured to determine whether the Independent Output Technique provides sufficient accuracy. 4.3.1 Output Nodes of DSP Block In this subsection, the activities at the outputs are compared. To begin the investigation, the activity of each output pin of a 9-bit x 9-bit multiplier was examined. Random inputs (with a known average activity) were applied to the inputs of the multiplier, and A C E - 2 . 0 was used to estimate the output activity of each pin . The results are plotted in Figure 4.3. The horizontal axis spans the set of 19 multiplier outputs (sign, L S B to M S B ) and the vertical axis is the estimated activity of each of these outputs. Each line corresponds Chapter 4. Activity Estimation 55 O.fl -0.5 — i 0 4 u < 0.3 -Q. 9 a. 3 02 o 0 1 0 -^ ^ i!?1 ^ <SN ^ J5 n ^ ^ ^ ^ X* 1 A*" ^ y / 4 & & # 4 f y # # f f f Block Output Pin Figure 4.4: Output P i n Activit ies for Registered Mul t ip l i e r to a different average input activity. For all values of the average input activity, the conclusion is the same: the estimated activity differs across the output pins. The pins that are on the far left and right of the graph (the least and most-significant bits) have low activities, while the activities of the middle bits are large. Th i s implies that there is a specific distr ibution for the activities of the output pins, and that choosing these activities randomly (as is done in the Independent Output Technique) wi l l lead to inaccurate act ivi ty estimates for these nodes. Note that the activities reported in Figure 4.3 are large, mostly greater than one. This is because multipliers tend to produce a large number of glitches on their outputs [50]. Most D S P blocks, however, contain registers on their output pins, which w i l l remove these glitches. Figure 4.4 shows the results of the same experiment i n which registers are added at the output of each multiplier. A s the graph shows, the distr ibution is s t i l l there, especially for the extreme least and most significant bits. Chapter 4. Activity Estimation Block Output Pin Figure 4.5: Output P i n Activi t ies for Unregistered D S P Block S 01 > ™ o < . £ 0.3 Q . (— 3 Q. o 0 2 0.3 avg input activity -0.5 avg input activity - 0.7 avg input activity 0.9 avg input activity Block Output Pin Figure 4.6: Output P i n Activi t ies for Registered D S P Block Chapter 4. Activity Estimation 57 Figure 4.7: 18-bit x 18-bit Multipliers Combined for 36-bit x 36-bit Multiplier Chapter 4. Activity Estimation 58 FIR Filter Figure 4.8: F I R Fi l ter Used for A c t i v i t y Experiments Figures 4.5 and 4.6 show the results of the same experiment on a more complicated D S P block. The block is shown in Figure 4.7. It is similar to an Al t e ra D S P block configuration: it combines four 18-bit x 18-bit multipliers and an adder to give a 36-bit x 36-bit multiplier. Aga in the conclusion is the same: a single average value to represent the activities w i l l not capture the distribution of activities at the output pins. 4.3.2 Downstream Nodes of DSP Block The results in Section 4.3.1 were for the D S P output pins only. In this section, we consider the nodes that lie downstream from the D S P blocks. Because A C E - 2 . 0 propagates activities from inputs to outputs, inaccuracies in the D S P output activities wi l l lead to inaccuracies in these downstream activities. The purpose of the remainder of this section is to understand how inaccurate these activities w i l l be. To perform these experiments, the F I R filter shown in Figure 4.8 was used. Th is cir-cuit contains a bank of four multipliers followed by an adder tree; registers are included after the multipliers and wi th in the adder tree to support pipelining, and to reduce gli tch Chapter 4. Activity Estimation 59 0.2 0.4 0.6 0.8 1 1.2 Downstream Activities (Gate Level) Figure 4.9: A c t i v i t y Results for the F I R Fi l ter 1.4 power. The activities of al l nodes in the circuit were estimated using two methods: the Independent Output Technique flow shown in Figure 4.2(B) and the Gate-Level Technique flow shown in Figure 4.1(B). Figure 4.9 shows the results. In this graph, each dot corre-sponds to a node in the circuit; only nodes that are "downstream" (to the right of) the multipliers are included, starting wi th the outputs of the multiplier output registers. The x-coordinate of a dot is the activity predicted for the corresponding node by the Gate-Level (more accurate) Technique, while the y-coordinate of the dot is the act ivi ty predicted for the corresponding node by the Independent Output (less accurate) Technique. In this plot, a straight line at y = x would indicate that there is perfect correlation between the two estimation techniques. As the graph shows, the correlation is good; the R2 correlation metric is 0.8091. This is surprising, since the activities of the multiplier outputs are as shown in Figure 4.4 for the Gate-Level Technique, but random for the Independent Output Technique. The reason has to do wi th the nature of the downstream Chapter 4. Activity Estimation 60 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 Downstream activities (Gate Level) Figure 4.10: Ac t iv i t y Results for the Converter circuit. A s mentioned in Section 4.2.2, [49] found that for word-parallel or bit-serial arithmetic, the average act ivi ty at the output of an adder can be closely approximated by the maximum of the average activities of the two inputs. Other adder configurations were considered wi th similar results. The good correlation values do not hold for other downstream circuits, however. The adder tree in Figure 4.8 was replaced wi th a signed-magnitude to 2s complement converter, and found that P 2 =0.542 , as shown in Figure 4.10, which is not nearly as good. 4.4 Conclusions for Activity Estimation Based on the results from the previous subsection, it was concluded that the activities obtained using the Independent Output technique do not correlate well wi th those ob-tained using the more accurate Gate-Level technique for a l l downstream circuits. Since Chapter 4. A c t i v i t y Estimation 61 the Power Estimation Tool Flow must work for a wide variety of D S P architectures and downstream circuits, it was concluded that the Gate-Level technique is required for the flow. Note that the results in Figures.4.4 and 4.6 suggest a th i rd activity estimation tech-nique (instead of the Independent Output and the Gate-Level techniques). Rather than generate the multiplier output activities randomly (as in the Independent Output Tech-nique), it may be possible to construct a distr ibution function, and generate activities based on this distribution function. Whi le this would be possible, it would require a significant amount of characterization effort each time a new embedded D S P block is to be evaluated, since the distribution function can be significantly different for different D S P block architectures. Given that A C E - 2 . 0 is fast and accurate [48], and the Gate-Level Technique is easier, the extra characterization effort for this th i rd method is not warranted. 4.5 Chapter Summary The motivation for developing a new activity estimation technique is runtime. It is not feasible to propagate activities through a D S P block using the Transit ion Density or Lag-one model since the computation performed using these models (and other related models) is 0(2k), where k is the number of inputs to the block. The Gate-Level Technique was introduced as an extension of the Poon model activity estimation method. The Independent Output Technique was introduced because we would prefer to use a technique that does not require proprietary technology or implementation details. Our experiments Chapter 4. Activity Estimation 62 showed that the more accurate Gate-Level Technique is required for our Power Estimation Tool Flow. 63 Chapter 5 Power Estimation Once the activity estimates for each input and output p in of the D S P block have been obtained, the Power Estimation Tool Flow must estimate the power dissipated wi th in each D S P block. Th is chapter describes and evaluates techniques for doing characterization-based power estimates. 5 . 1 Techniques for Power Estimation 5.1.1 Objectives Since one of the objectives for this flow is to allow architectural exploration and experi-mentation, power estimation must be fast. This w i l l facilitate many iterations of the flow in architectural parameter sweeps. Al though it would be possible to create a gate-level model and use gate-level power simulation (such as wi th PrimePower), this would be far too slow to include in the inner loop of the Power Estimation Tool Flow. Therefore, a method is needed to quickly estimate the power of the embedded block, without resorting to modeling every internal node in the block. Reduced estimation time typically comes at the cost of accuracy. In this chapter, we compare the accuracy of three fast and relatively simple techniques for estimating Chapter 5. Power Estimation 64 the power of an embedded DSP block against simulation results. For each technique, offline characterization is used to obtain data that can be quickly referenced at runtime. A limited amount of data is found once for each DSP block architecture, offline, using PrimePower. The following sections describe the three techniques we considered in increasing order of modeling effort. 5.1.2 The Constant Technique The first technique is the simplest. For this technique, the power dissipated by a DSP block is assumed to be a constant, dependent on the DSP block type and independent of the activities of the input and output pins. This technique will be referred to as the Constant Technique. The advantages of this technique are that it is simple and fast. The disadvantage of this technique is that it may lead to inaccurate estimates, since the power dissipated in an embedded block does depend on the input pin activities. However, this technique could be sufficient if the dependence is weak and the deviation from the average power is small. 5.1.3 The Lookup Technique For the second technique, we approximate the power dissipated by the embedded block as a function of the average activity of all the DSP block inputs. The function need-not be linear, and may be implemented as a look-up table rather than as a closed-form function. This technique will be referred to as the Lookup Technique. Chapter 5. Power Estimation 65 The advantage of this technique is that it is still relatively simple and fast, though it does involve more modeling effort than the Constant Technique. The disadvantage of this technique is that it assumes all input pins contribute equally to the power dissipated by the block. If we consider a multiplier block, for example, it will have pins corresponding to the multiplicand and multiplier operands. The fanout logic from the multiplicand operand pins may be very different from the fariout logic from the multiplier operand pins; thus, we expect that their contribution to the power dissipation will differ. However, if the difference does not cause substantial variation in the total power dissipation from an average over many trials, then this technique could be sufficient. The Lookup Technique may provide inaccurate power estimates, since only the average input activity is used to estimate the power. As mentioned in the previous section, in reality, not all inputs pins are equal; activity on some pins may have more impact on the power dissipation of a block than the same activity on other input pins. To take this into we estimate, for each input pin in isolation, how much of an impact that pin has on the overall power dissipation of the embedded block. This is quantified by calculating an effective capacitance, Cj, for each input pin i. Then, the total power can be calculated as: 5.1.4 The PinCap Technique account, a third technique is considered, called the PinCap Technique. For this technique (5.1). all-input jpins where / is the frequency of the circuit and is the activity of pin i. Intuitively, this Chapter 5. Power Estimation 66 Technology DSP Block Library Verilog Description i Synthesize to Gates with DC Verilog Testbench + Input Vectors * r Simulate with Verilog-XL Analyze Power with PnmePower — J — Power Estimates Figure 5.1: Methodology for Power Est imat ion Experiments technique might provide accurate results, at the expense of more characterization effort. 5.2 Methodology for Power Characterization Each of the three techniques requires some amount of offline characterization. This section describes the characterization methodology for each technique. Figure 5.1 shows the flow used to obtain this characterization data. A Verilog description of the D S P block was synthesized to gates using Synopsys Design Compiler. A Verilog testbench was used to simulate a set of input vectors applied to the gate-level description of the D S P block in V e r i l o g - X L . The simulation data was then fed to the Synopsys PrimePower simulator to obtain characterization information. This is done for a training set of input vectors, according to the characterization technique used. Chapter 5. Power Estimation 67 5.2.1 Constant Technique For the Constant Technique, we repeated this characterization task nine times for the given D S P block for average input activity values in the set {0.1, 0.2, 0.3, 0.4, 0:5, 0.6, 0.7, 0.8, 0.9}. For each of the 9 average activity values, a t raining set of 10 x 5000 input vectors was simulated to get 10 power estimates. We took the average of a l l 90 results to obtain a single value to use as the constant power value for the D S P block. It should be noted that there is a one-to-many, relationship between the average activity for a set of input vectors and power estimates for that average input activity. Different sets of input vectors may differ in the individual activities of each input p in and, as described in Section 5.1.4, different pins may contribute to the power of the block differently. 5.2.2 Lookup Technique For the Lookup Technique, we reused the data from the Constant Technique characteri-zation. We averaged the 10 estimates for each of the 9 activity values to obtain 9 data points. These 9 average input activities and their corresponding average power estimates became the activity-power estimate pairs for the Lookup Technique look-up table. This table was then included in the power model. 5.2.3 P i n C a p Technique The input vector sets used for characterization for the PinCap Technique were different from those used for the Constant and Lookup Technique characterization. In those vector sets, a l l bits would toggle. For this technique, to assess the contribution of individual Chapter 5. Power Estimation 68 Training Set of 10x5,000 vectors at each activity Characterization (Section 5.2) 30 x 5,000 vectors at each activity Characterization data Estimation (3 techniques) PrimePower Power estimate Power estimate Figure 5.2: Power Est imat ion Exper imental Methodology pins, a l l other pins except the one in question were held constant; the p in in question was then toggled wi th the given activity. The resulting power from PrimePower and Equat ion 5.1 was then used to determine the effective capacitance for pin j, Cj, because a* = 0 for al l i ^ j. Th is was repeated for each input p in to the D S P block. Once the effective capacitance for each input p in was determined, the values were included in the power model. Input sets w i th average activities of 0.2 and 0.5 were used for the characterization.. Conclusions about the P i n C a p technique could be drawn from the results for these two average input activities, so P i n C a p charaterization for the remaining activities between 0.1 and 0.9 was not necessary. 5.3 Evaluation In this section, we evaluate the accuracy of each of the three power estimation techniques. The experimental methodology is shown in Figure 5.2. For each technique, we performed Chapter 5. Power Estimation 69 4.0E-03 i 5.0E-04 O.OE+00 J 1 1 1 0 0.2 0.4 0.6 0.8 1 Average Activity for Vector Set Figure 5.3: Power Est imation Experiment Results the characterization tasks described in Section 5.2 for a 9x9-bit mult ipl ier . We then com-pared the power estimated by each technique to that estimated by Pr imePower simulation (which is presumably more accurate than any of our three techniques). For this experi-ment, we used 30 x 5000 randomly generated input vectors at each of the average input act ivi ty values in the set {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} (this is more than was used during the offline characterization). Figure 5.3 shows the results for the Constant Technique and the Lookup Technique. Each point on the graph represents the results for one set of 5000 vectors. The x-coordinate of the point is the average activity of al l input vectors wi th in the set, and the y-coordinate of the point is the estimated power. The triangular dots represent the estimates using the Constant Technique, while the circular dots represent the predictions from the Lookup Technique. For comparison, the PrimePower estimates are also plotted on the same graph; the PrimePower estimates are shown as diamonds. A s the graph shows, the Constant Chapter 5. Power Estimation 70 4.5E-03 —•"PrimePower 4.0E-03 3.5E-03 _ . 3.0E-03 CD I 2.5E-03 J> 2.0E-03 o ? 1.5E-03 a. 1.0E-03 5.0E-04 0.0E+0O J J Vector sets sorted by PrimePower power value, in increasing order Figure 5.4: Power Est imat ion Experiment Results Technique (in which a single power value is used, independent of the input activity) provides poor estimates, compared to the PrimePower results. The Lookup Technique, on the other hand, produces results that track well wi th the PrimePower results. The average difference between the Lookup Technique and PrimePower estimates was 8%. The experiment was repeated wi th a D S P block rather than a multiplier, and the conclusions were the same; the average difference between the Lookup Technique and the PrimePower results was 5%. This suggests that it is important to take the input activities into account when estimating power of the block, but that the average of the input activities is enough information to get reasonably accurate power estimates. For the P i n C a p Technique, the results were not as good. Figure 5.4 shows the es-timates obtained using the P i n C a p Technique, along wi th the estimates obtained using PrimePower. In this graph, only vector sets wi th an average input act ivi ty of 0.2 were considered. Each point in this graph corresponds to one such set. The points are dis-Chapter 5. Power Estimation 71 t r ibuted evenly across the x-axis, sorted by the values of the PrimePower estimate. A s the graph shows, even for these sets, which all correspond to the same average activity, the power estimates predicted by the P i n C a p technique vary widely, and have no correlation to the PrimePower estimates. The PinCap Technique always overestimates the actual power; this is because the technique assumes that the power contribution of each input p in is independent of the power contribution of each other input pin, when in fact, it is not. The total power is not simply the sum of the contributions from each pin . Even if the P i n C a p results are scaled by a constant value in an attempt to reflect this, the results do not track the simulation results well (this is also shown in Figure 5 . 4 ) . Other input activities were also attempted and no simple way was found to scale the results to take into account overlap between the contributions of multiple input pins. 5.4 Conclusions for Power Estimation Based on the results from the previous section, we concluded that the Lookup Technique is most appropriate for the Power Estimation Tool Flow. Not only does it provide reasonably accurate results, but the characterization effort is relatively simple, and the run-time of the power estimate is small . Chapter 5. Power Estimation 72 5.5 Chapter Summary The objective of the power estimation experiments was to determine a fast and accurate technique for estimating the power of D S P blocks, given input activities. We introduced three techniques, in increasing order of complexity: (1) the Constant Technique, (2) the Lookup Technique, and (3) the P i n C a p Technique. Power estimates using these techniques were performed and compared against PrimePower simulations for accuracy. We found that it is important to consider the input activities when estimating the power and that the Lookup Technique is most suitable for our Power Estimation Tool Flow. 73 Chapter 6 Power Estimation Tool Flow This chapter describes how we combined our Gate-Level activity estimation technique from Chapter 4 and our Lookup power estimation technique from Chapter 5 w i th the P V P R framework to obtain a complete power estimation C A D tool flow. Section 6.1 de-scribes the overall flow and the modifications we made to P V P R . Section 6.2 describes the processing of two benchmark circuits through the entire flow and compares our estimates against PrimePower simulations of the circuits. 6.1 O v e r a l l F l o w 6.1.1 Functionality Our C A D tool flow for estimating the power dissipated in F P G A circuits containing em-bedded D S P blocks is shown in Figure 6.1. The activity estimation and power estimation are performed as described in Chapters 4 and 5. The steps in the box on the left correspond to activity estimation. For act ivi ty esti-mation, there are three inputs: (1) the user's circuit, (2) a Verilog description of the D S P block, and (3) input statistics (either as vectors for the circuit inputs or the transit ion density and static probability of each input) . To facilitate architectural investigations, Chapter 6. Power Estimation Tool Flow 74 Parameterized DSP Block verilog Generator User's Circuit 1 Tech Map with Quartus II DSP Block Verilog Description Synthesize to Gates with DC 1__L~ Flatten Circuit DSPs to Gates Vectors, or D(x) and P i ACE-2.0 Activity Data Technology Library Training Set of Input Vector Files i Simulate with Vonlog-XL I Analyze Power with PrimePower Timing charact. Power charact. of DSP of DSP Architecture File Enhanced T-VPACK/VPP. with Poon model Power Analysis Report for Circuit with DSP Blocks Figure 6.1: Power Est imat ion Tool F low we created a parameterized D S P block generator to automate the production of Verilog descriptions for a large number of D S P architectures. The D S P block description is syn-thesized to a gate-level netlist using Synopsys Design Compiler and a T S M C 0.18 fj,m technology library. The H D L description of the user's circuit is technology mapped to L U T s , flip-flops and D S P blocks using Quartus II to produce a mapped netlist of these el-ements and their connections. The mapped netlist is flattened by replacing the D S P block instances wi th the gate-level implementation obtained from Design Compiler . A C E - 2 . 0 is then used to obtain the activity of each node in the flattened circuit, using the input Chapter 6. Power Estimation Tool Flow 75 statistics that we provided. The steps in the box on the right correspond to power characterization. Recal l that, whereas the activity estimation steps must be repeated each time the user's cir-cuit changes, the power characterization steps need only be done once when the D S P block is designed and can then be stored as l ibrary data. For power characterization, there are two inputs: (1) the gate-level implementation of the D S P block from Design Compiler , and (2) a training set of input vector files. The gate-level implementation is simulated using Ver i l og -XL and the resulting power determined by PrimePower for the training set of input vector files. The average activity and resulting power for each input vector file in the training set is saved as an activity-power pair in the Lookup Technique table in the P V P R architecture file. The D S P block t iming characteristics are also stored in the P V P R architecture file. The activity data for the circuit and the characterization data for the D S P block are then input to P V P R for packing, placement, routing, and power analysis. Currently, the use of Quartus II for technology mapping restricts us to Altera-style D S P blocks; however, Quartus II could be replaced wi th another technology mapping tool to evaluate different D S P block architectures. 6.1.2 Modifications to P V P R Neither the original version of V P R nor P V P R support circuits wi th embedded D S P blocks. We enhanced the B L I F netlist format [42] to allow for the specification of D S P blocks. We modified the P V P R architecture file format [21] to include power and t iming Chapter 6. Power Estimation Tool Flow 76 numbers for these blocks (including the power look-up table proposed in Chapter 5); Table 6.1 describes the parameters we added. We assumed that D S P blocks are arranged in columns, as in Al t e ra and X i l i n x devices. We modified the placement algorithm to correctly position these blocks, modified the t iming analysis algorithm to estimate the delay through these blocks, and use this information to calculate the cr i t ical path of an implementation [51]. We then modified the power model to use the Lookup Technique for D S P blocks. Table 6.1: Parameters added to the P V P R Architecture Fi le Format Parameter Meaning Start Firs t column of D S P blocks Repetition Number of columns before next column of D S P blocks Class Nature of the D S P input and output pins Location Locations around the D S P block where the input and output pins can be programmably connected to the routing fabric Leakage Leakage power dissipated by the D S P block Activity Look-up value for activity in an activity-power pair Energy Energy dissipated for a given Ac t iv i t y in an activity-power pair (Energy is used to be independent of clock frequency) 6.2 Flow Demonstration and Comparison to Gate-Level Simulation 6.2.1 Motivation The experiments in Chapter 5 consider the D S P blocks as stand-alone elements; random inputs are used and all bits have approximately the same average activity. W h e n a D S P Chapter 6. Power Estimation Tool Flow 77 block exists in a larger circuit, however, it may be located downstream from other logic; each bit in the input operands may have a different activity. This section demonstrates the functionality of our C A D tool flow using two benchmark circuits from [52] and compares our results w i th a more accurate method (using the same inputs) to see how much accuracy is sacrificed to obtain fast estimates. 6.2.2 Terminology A s this section discusses the simulation of multiple circuits, it is important to clarify the terminology we w i l l use. The test circuits (the F I R filter and differential equation solver) w i l l be refered to as the parent circuit or circuit. The circuit w i l l contain multiple instances of a DSP block. The DSP block is described by the D S P Block Verilog Description in Figure 6.1. The instances w i l l be referred to as DSP block instances. Figure 6.2 shows the circuit (in white) wi th instances (in grey) of a DSP block, which is shown to the right. circuit DSP block instances DSP block DSP1 DSP2 Figure 6.2: Terminology Chapter 6. Power Estimation Tool Flow 78 FIR Filter Figure 6.3: fir_3_8_8: F I R Fi l ter Ci rcui t 6.2.3 Methodology Demonstration of Power Estimation Tool Flow To demonstrate the functionality of our flow, we ran it on two benchmark circuits from [52]. The first circuit is fir_3_8_8, the F I R filter shown in Figure 6.3, which uses embedded multipliers; it consists of 272 L U T s , 148 flip-flops, and 4 multipliers. The second circuit is diffeq_paj_convert, a differential equation solver, shown in Figure 6.4; it consists of 850 L U T s , 193 flip-flops, and 3 D S P blocks. The differential equation solver uses a D S P block configuration similar to one found in Al t e ra Strat ix devices. This block is shown in Figure 6.5; it combines four 18-bit multipliers wi th a dedicated adder to make a 36-bit multiplier. For both example D S P blocks, a look-up table of activity-power pairs had been created offline, using Lookup Technique characterization as described in Chapter 5. For each circuit we performed the characterization part of the flow only once and ran an iteration of the rest of the flow for each input vector file. For each act ivi ty in the set {0.25, 0.50, 0.75}, we created five input files of 5000 vectors having that average activity, for a total of 15 input files. Note that the wid th of these vectors is equal to the number Chapter 6. Power Estimation Tool Flow 79 of inputs of the parent circuit . The purpose of providing vectors instead of activity and static probability values for each bit is that A C E - 2 . 0 performs simulation for sequential feedback loops. If only activity and static probability values are given to A C E - 2 . 0 , then it randomly generates vectors for simulation. However, since we wanted to compare our results against a PrimePower simulation, we needed to provide the same set of vector files to both estimation tools. Figure 6.4: Differential Equat ion Solver Ci rcu i t Compar i son against P r imePower For these experiments we compared the results from our Power Estimation Tool Flow against PrimePower. The pseudocode for the methodology is given i n Figure 6.6 and Chapter 6. Power Estimation Tool Flow 80 Figure 6.5: 18-bit x 18-bit multipliers combined for 36-bit x 36-bit multiplier described below. The same input vectors were used for both the Power Estimation Tool Flow and the simulation approaches. We assumed that the pr imary inputs to the parent circuit were free of glitching. To determine the power dissipated by each DSP block instance using simulation, we used the flow shown in Figure 6.7. To determine the power of each D S P block instance, we had to determine the input waveforms to each D S P block instance separately. It is incorrect to assume that the D S P block instances wi l l have identical input waveforms, because the logic upstream of (leading up to) each D S P block instance in the parent circuit may not be identical. Chapter 6. Power Estimation Tool Flow 81 Synthesize Verilog description of DSP block to gates in Design Compiler (already done for Power Estimation Tool Flow); Create simulation model of circuit in Quartus II; For avg_act in (0.25, 0.50, 0.75) { For trial in (1 to 5) { Create input vector set V of 5000 vectors with activity avg_act and width equal to number of primary inputs in circuit; Apply input vector set to circuit sim model using testbench_cct; Use ModelSim to simulate and generate VCD; For each DSP block instance i in circuit { Parse VCD for input waveforms to this DSP instance; Apply waveforms at DSP instance inputs using testbench_dsp_i; Use Verilog-XL to simulate; Use PrimePower to calculate power of DSP instance i (when circuit is stimulated by input vector set V); } } } Figure 6.6: Simulation Flow Pseudocode To obtain the input waveforms to each D S P block instance, we used Quartus II to generate a simulation model of the parent circuit. Since we used Quartus II to technology map the circuits in the Power Estimation Tool Flow, the mapped implementations for both methods match. We used Mode lS im to simulate the circuit and generate a Value Change D u m p ( V C D ) of the simulation, which is an A S C I I file that describes the waveforms for the circuit internal nodes. We then used a V C D parser to extract the waveforms corresponding to the inputs of each D S P block instance. To obtain the simulation power estimates for each vector set, we used the input waveforms for each D S P block instance to simulate the Chapter 6. Power Estimation Tool Flow 82 • — i Y DSP[3) DSP[2] Quartus II Input Vectors for Parent Circuit To Power Estimation Tool Flow... _ Technology Simulation Mapped Model of Netlist Parent Circuit VCD (waveforms of circuit internal nodes) Waveforms for DSP[1] inputs Waveforms for DSP[2] inputs Waveforms for DSP[3] inputs DSP Block Gate-Level Description Verilog-XL + PnmePower Verilog-XL + 1 PnmePower - • . PnmePower l l l l l i l l lP l t l l T r T Simulation power Simulation power Simulation power estlorDSP[1] esttorDSP[2J esttorDSP[3] Figure 6.7: F low for Determining Simulation Power Est imate of Each D S P Block Instance gate-level implementation of the DSP block using V e r i l o g - X L and PrimePower. 6.2.4 Results Demonstration of Power Estimation Tool Flow The results of the power analysis are shown in Table 6.2. In both circuits, the D S P blocks dominate the power dissipation. This may be surprising because routing power generally dominates the power dissipation in an F P G A . However, the circuits are D S P kernels and not complete systems. Thus, they contain only a small amount of non-DSP logic and substantial routing is internal to the D S P blocks. Chapter 6. Power Estimation Tool Flow 83 Compar i son against P r imePower Tables 6.3 and B . l to B.3 show the detailed results for the F I R filter circuit. Tables 6.4 and B.4 to B.6 show the detailed results for the differential equation solver circuit. The average difference between the Power Est imat ion Tool F low and PrimePower results for the F I R circuit was 20.4%. In the differential equation circuit, 2 of the 3 D S P blocks had an average difference of 22-26%, however the th i rd had glitching on one input bus and was off by 77% on average. W h e n creating the lookup table, the blocks had been characterized only for activities 0.1 to 0.9, as it was not clear how to properly imitate glitching. Al though we use linear interpolation to determine-the power corresponding to average input activities that are not in the Lookup Technique table, the power relationship is not necessarily linear wi th respect to D S P input activities. Consequently, it is not surprising that estimates for D S P block instances having an average input act ivi ty greater than 1 are not well represented by a linear interpolation using the points for activities 0.8 and 0.9. This indicates the importance of a lookup-table that includes data that considers glitching. Append ix A describes preliminary unsuccessful attempts at including glitching during PrimePower characterizations. Table 6.2: V P R Power Analysis Results Power fir_3_8_8 diffeq_paj .convert Rout ing 5.46 m W , 18.4% 8.67 m W , 10.0% Logic Blocks 4.04 m W , 13.6% 4.68 m W , 5.6% Clock 1.61 m W , 5.4% 2.57 m W , 2.9% D S P 18.59 m W , 62.6% 70.82 m W , 81.5% Total 29.71 m W 86.74 m W Chapter 6. Power Estimation Tool Flow 84 Table 6.3: Overal l Percentage Error for F I R Fi l ter Results Ari thmet ic Mean average percentage error 20.4% average for just multiplier instance #0 21.6% average for just multiplier instance #1 19.7% average for just multiplier instance #2 19.6% average for just multiplier instance #3 20.5% Table 6.4: Overal l Percentage Error for Differential Equat ion Solver Results Ari thmet ic Mean average percentage error 41.7% average for just D S P block instance #0 76.8% average for just D S P block instance #1 22.5% average for just D S P block instance #2 25.8% 6.3 Chapter Summary In this chapter we have demonstrated the functionality of our Power Estimation Tool Flow for estimating the power of F P G A circuits containing embedded D S P blocks, which addresses our requirements laid out in Chapter 3. We have compared our results against simulation using PrimePower. Our fast estimates are wi th in 19% to 26% of the simulated results, except in the case where there is significant glitching at the inputs of the D S P blocks; the glitching case resulted in 77% error. We believe that adding characterization of glitches on the inputs of the D S P blocks is necessary to improve the accuracy of our method; however, it is not immediately clear how this can be done. This w i l l be discussed Chapter 6. Power Estimation Tool Flow 8 5 as future work in Section 7.2.2. Chapter 7. Conclusion 86 Chapter 7 Conclusion 7.1 Summary and Contributions In this thesis we have described an experimental C A D flow that can be used to estimate the power dissipation of F P G A circuits containing embedded D S P blocks. We identified two technical challenges in creating such a flow: (1) estimating the activity of a l l nodes in a circuit containing one or more D S P blocks, and (2) estimating the power dissipated wi th in a D S P block quickly and accurately. The first challenge arises because standard activity estimation techniques cannot prop-agate activities through these D S P blocks. We address this by replacing each D S P block wi th a gate-level representation of the block, and using the standard act ivi ty techniques on the resulting circuit. The second challenge arises because it is not possible to pre-characterize the D S P block for al l possible input patterns and activities. We have shown that reasonable estimates can be obtained by creating a look-up table of power values. In the power model, the look-up table is indexed using the average activity of the block input nodes. We then combined our findings to create a Power Estimation Tool Flow based on the P V P R framework. The impact of our enhanced tool flow is threefold; the existence of Chapter 7. Conclusion 87 a freely available, architecturally flexible F P G A C A D tool that includes power modeling for embedded D S P blocks enables: 1. the investigation of power-aware architectures containing embedded D S P blocks 2. the investigation of power-aware C A D algorithms for F P G A circuits containing em-bedded D S P blocks 3. the incorporation of power tradeoffs in the design of user circuits This work is also one of a collection of projects at the University of Br i t i sh Co lumbia System-on-Chip L a b that each take a step towards the larger goal of enabling power estimation for platform-style F P G A architectures that contain embedded D S P blocks, embedded memories, embedded processors, and multiple clock domains. A poster of our contributions w i l l appear at the 2006 I E E E International Conference on F ie ld Programmable Technology in Bangkok, Thai land . 7.2 Future Work Given our tool flow, there are three enhancements that would be necessary before per-forming power-aware F P G A architecture studies that include embedded D S P blocks: (1) a suite of integer benchmarks representative of D S P and arithmetic-intensive user designs, (2) the incorporation of glitch characterization into the look-up data, and (3) board-level verification. Chapter 7. Conclusion 88 7.2.1 Benchmarks A suite of integer benchmarks representative of D S P and arithmetic-intensive user de-signs is essential to be able to draw generalizable conclusions from an architectural study targeting embedded D S P blocks. The standard benchmark suite for F P G A studies is the collection of Microelectronics Center of Nor th Carol ina ( M C N C ) circuits; however, these circuits are not very representative of D S P applications. Freely available circuits were obtained from [52] and [53]; however, most were not suitable for our experiments for the reasons listed below: • Some circuits used floating point arithmetic. D S P blocks in commercial F P G A s target integer arithmetic, so our flow does the same. • Some circuits used very simple and small D S P blocks, which would not exercise many of the features in D S P blocks embedded in commercial F P G A s . • Some of the circuits were automatically generated using high-level synthesis. The R T L signal and module names were automatically generated alphanumeric character sequences. This h id the flow of control and data in the circuits and prevented analysis and debugging of the circuits. A very useful future research project would be the creation of circuits for integer D S P and arithmetically intensive applications at the register transfer level. Chapter 7. Conclusion 89 7.2.2 Glitch Characterization The comparison to gate-level simulation in Section 6.2 revealed that glitching may take place on the inputs to the D S P blocks and that it is important to include characterization data in the power estimation look-up table for cases where the activity at the inputs to the D S P blocks is greater than 1. When performing characterization, it was not clear how to properly imitate glitching in our testbenches. Appendix A describes our attempts. Reference [30] describes word-level and bit-level glitch generation and propagation models for characterization of datapath circuits. It would be interesting to incorporate this into our flow and evaluate its effectiveness. 7.2.3 Board-Level Verification W h e n performing power estimation at higher levels of abstraction, as we do w i t h the Lookup Technique, we are trading off accuracy for fast estimation. Consequently, at this level, fidelity is what we seek to provide (i.e. relative accuracy, instead of absolute). In order to verify that our tool flow w i l l provide consistent estimates that provide usable trends, we must compare our results to physical measurements on ah F P G A board. The use of board-level measurements to verify our power model is not t r iv ia l . It is not simply a case of downloading our test circuits to boards wi th F P G A s containing the appropriate D S P blocks and comparing the measured power to our estimates. F i rs t , it is not feasible to create custom F P G A s for the set of D S P block architectures under evaluation; layout and fabrication costs are excessive. Second, the power of the D S P blocks alone cannot be measured; typically, we can only measure the total active and Chapter 7. Conclusion 90 quiescent power of the system. Consequently, we cannot simply compare our estimates against the results for a set of commercial F P G A boards. For example, A l t e r a Cyclone II and X i l i n x Vir tex-II devices contain embedded 18xl8-bit multipliers, and Al t e r a Strat ix and X i l i n x Vir tex-4 devices contain D S P blocks. One possibility is to compare our power estimates for a set of benchmark circuits against the power estimates for four boards containing each of these devices; however, their logic and routing architectures differ, making it impossible to distinguish between deficiencies in the power models for the D S P blocks, the logic blocks, and the interconnect. A starting point for board-level verification could be to compare trends in measured values on a particular F P G A board against trends in the power estimates using the corre-sponding architecture description file, for a set of benchmark circuits. The desired result would be to see that the measurements and estimates both rank the power dissipation of the benchmark circuits in the same order. A prerequisite for this verification is a representative set of benchmark circuits, described in Section 7.2.1. 91 Bibliography [1] V . Betz, J . Rose, and A . Marquardt , Architecture and CAD for Deep-Submicron FPGAs. Springer, March 1999. [2] S. J . E . W i l t o n , , S. C h i n , and K . K . W . Poon, Field-Programmable Gate Array Architectures. Taylor and Francis, 2006. [3] A . Ye and J . Rose, "Using M u l t i - B i t Logic Blocks and Automated Packing to Im-prove Field-Programmable Gate Array Density for Implementing Datapath Circui ts ," in Proceedings of the 2004 IEEE International Conference on Field-Programmable Technology, 2004, pp. 129-136. [4] C . Piguet, E d . , Low Power CMOS Circuits: Technology, Logic Design and CAD Tools. New York: Taylor and Francis, 2006. [5] K . K . Poon, "Power Est imat ion for F i e ld Programmable Gate Arrays," Master 's thesis, University of Br i t i sh Columbia, August 2002. [6] Acte l , Actel IGLOO Flash Freeze Technology and Low Power Modes Application Note, 2006. [7] C . T . Chow, L . S. M . Tsui , P. H . W . Leong, W . Luk , and S. J . E . W i l t o n , "Dynamic Voltage Scaling for Commercial F P G A s , " in Proceedings of the 2005 IEEE International Conference on Field-Programmable Technology, 2005, pp. 173-180. [8] F . L i , Y . L i n , and L . He, " F P G A Power Reduct ion Using Configurable D u a l - V d d , " in Proceedings of the 41st Design Automation Conference, 2004, pp. 735-740. [9] V . George, H . Zhang, and J . Rabaey, "The Design of a Low Energy F P G A , " in Proceedings of the 1999 International Symposium on Low Power Electronics and Design, 1999, pp. 188-193. [10] J . H . Anderson and F . N . Najm, "Power-aware Technology Mapp ing for LUT-based F P G A s , " in Proceedings of the 2002 IEEE International Conference on Field-Programmable Technology, 2002, pp. 211-218. [11] J . H . Anderson and F . N . Na jm, "Active Leakage Power Opt imizat ion for F P G A s , " IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 25, no. 3, pp. 423-437, 2006. Bibliography 92 [12] R . Tessier, V . Betz , D . Neto, and T . Gopalsamy, "Power-aware R A M Mapp ing for F P G A Embedded Memory Blocks," in FPGA '06: Proceedings of the International Symposium on Field Programmable Gate Arrays. New York , N Y , U S A : A C M Press, 2006, pp. 189-198. [13] I. K u o n and J . Rose, "Measuring the Gap Between F P G A s and A S I C s , " in FPGA '06: Proceedings of the International Symposium on Field Programmable Gate Arrays. New York, N Y , U S A : A C M Press, 2006, pp. 21-30. [14] Al te ra , Cyclone II Device Handbook, 2005. [15] Al te ra , Stratix II Device Handbook, 2005. [16] X i l i n x , Virtex II Platform FPGAs: Complete Datasheet, 2005. [17] UG073: XtremeDSP for Virtex-J,. User Guide, 1st ed., 2005. [18] S. Hauck, M . M . Hosier, and T . W . Fry, "High-performance Car ry Chains for F P G A s , " in FPGA '98: Proceedings of the 1998 ACM/SIGDA Sixth International Symposium on Field-Programmable Gate Arrays. New York , N Y , U S A : A C M Press, 1998, pp. 223-233. [19] S. D . Haynes, A . B . Ferrari , and P. Y . K . Cheung, "Flexible Reconfigurable Mul t ip l ie r Blocks Suitable for Enhancing the Architecture of F P G A s , " in Proceedings of the IEEE 1999 Custom Integrated Circuits Conference, 1999, pp. 191-194. [20] M . J . Beauchamp, S. Hauck, K . D . Underwood, and S. K . Hemmert, "Embedded Floating-Point Uni ts in F P G A s , " in Proceedings of the 2006 ACM/SIGDA Fourteenth International Symposium on Field Programmable Gate Arrays., 2006, pp. 12-20. [21] V . Betz , VPR and T-VPack User's Manual, ver 4.30, March 2000. [22] K . K . W . Poon, S. J . E . W i l t o n , and A . Y a n , " A Detai led Power Mode l for Field-Programmable Gate Arrays ," ACM Transactions on Design Automation of Electronic Systems (TODAES), 2004. [23] F . N . Najm, " A Survey of Power Est imat ion Techniques i n V L S I Circui ts ," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol . 2, no. 4, pp. 446-455, 1994. [24] G . Lemieux, E . Lee, M . Tom, and A . Y u , "Directional and Single-driver Wires in F P G A Interconnect," in Proceedings of the 2004 IEEE International Conference on Field-Programmable Technology, 2004, pp. 41-48. [25] A . Yang , "Design Techniques to Reduce Power Consumption," XCell Journal, 2005. [26] X i l i n x , Virtex-5 LX Platform Overview, 2006. [27] X i l i n x , UG070: Virtex-4 User Guide, 2006. Bibliography 93 A . Ye and J . Rose, "Using Bus-based Connections to Improve Field-Programmable Gate Ar ray Density for Implementing Datapath Circui ts ," in Proceedings of the 2005 ACM/SIGDA Thirteenth International Symposium on Field Programmable Gate Ar-rays, 2005, pp. 3-13. K . Leijten-Nowak and J . L . van Meerbergen, " A n F P G A Architecture w i th Enhanced Datapath Functionality," in FPGA '03: Proceedings of the 2003 ACM/SIGDA Eleventh International Symposium on Field Programmable Gate Arrays. New York , N Y , U S A : A C M Press, 2003, pp. 195-204. A . Raghunathan, N . K . Jha, and S. Dey, High-level Power Analysis and Optimization. Springer, November 1997. S. Brown and Z. Vranesic, Fundamentals of Digital Logic with Verilog Design. M c G r a w - H i l l Science/Engineering/Math, August 2002. G . K . Yeap, Practical Low Power Digital VLSI Design. Springer, August 1997. T . Quarles, D . Pederson, R . Newton, A . Sangiovanni-Vincentelli , and C . Wayne, "Interactive S P I C E User Guide," website maintained by Jan Rabaey. [Online]. Available: h t tp : / /bwrc.eecs .berkeley.edu/Classes / IcBook/SPICE/ P. E . Landman and J . M . Rabaey, "Architectural Power Analysis: The D u a l B i t Type Method," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 3, no. 2, pp. 173-187, 1995. D . Marculescu, R . Marculescu, and M . Pedram, "Information Theoretic Measures for Power Analysis ," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol . 15, no. 6, pp. 599-610, 1996. X i l i n x , " X i l i n x Web Power Tools," A p r i l 2006. [Online]. Available: http: / / www.xi l inx.com / products / design_resources / power .central / Al tera , PowerPlay Early Power Estimator User Guide for Stratix, Stratix GX, and Cyclone FPGAs, October 2005. X i l i n x , "Virtex-4 XPower - Ear ly Power Est imator v8.1," A p r i l 2006. [Online]. Available: http:/ /www.xilinx.com/ise/power_tools/l icense_virtex4.htm Al tera , PowerPlay Early Power Estimator User Guide for Stratix II, Stratix II GX and Hardcopy II, December 2005. X i l i n x , Xilinx Development System Reference Guide 8.H, December 2005. Al tera , PowerPlay Power Analyzer, October 2005. E . M . Sentovich, K . J . Singh, L . Lavagno, C . M o o n , R . Murga i , A . Saldanha, H . Savoj, P. R . Stephan, R . K . Brayton, and S. A . Vincentel l i , "SIS: A System for Sequential Ci rcui t Synthesis," U C Berkeley, Tech. Rep. , 1992. [Online]. Avai lable: http: //citeseer. ist. psu. edu/sentovich92sis. h tml Bibliography 94 [43] J . Cong arid Y . Ding , "F lowMap: an Opt ima l Technology Mapping Algor i thm for Delay Optimizat ion in Lookup-table Based F P G A Designs," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 13, no. 1, pp. 1-12, 1994. [44] G . G . F . Lemieux, S. D . Brown, and D . Vranesic, "On Two-step Rout ing for F P G A S , " in ISPD '97: Proceedings of the 1997 International Symposium on Physical Design. New York , N Y , U S A : A C M Press, 1997, pp. 60-66. [45] Y . - W . Chang, D . F . Wong, and C . K . Wong, "Universal Switch Modules for F P G A Design," ACM Transansactions on Design Automation of Electronic Systems, vol. 1, no. 1, pp. 80-101, January 1996. [46] S. J . E . Wi l t on , "Architectures and Algori thms for Field-Programmable Gate Arrays wi th Embedded Memory," P h . D . dissertation, University of Toronto, 1997. [47] I. M . Masud and S. J . E . W i l t o n , " A New Switch Block for Segmented F P G A s , " in FPL '99: Proceedings of the 9th International Workshop on Field-Programmable Logic and Applications. London, U K : Springer-Verlag, 1999, pp. 274-281. [48] J . Lamoureux and S. J . E . W i l t o n , "Ac t iv i ty Es t imat ion For Field-Programmable Gate Arrays," in International Conference on Field-Programmable Logic and Appli-cations, August 2006. [49] A . Chatterjee and R . K . Roy, "Synthesis of Low Power Linear D S P Circui ts Using Ac t iv i t y Metrics," in Proceedings of the Seventh International Conference on VLSI Design, 1994, pp. 265-270. [50] S. J . E . Wi l t on , S. S. A n g , and W . Luk , "The Impact of Pipel in ing on Energy per Operation in Field-Programmable Gate Arrays," in International Conference on Field-Programmable Logic and its Applications. Springer-Verlag, August 2004, pp. 719-728. [51] S. J . E . W i l t o n , " H H V P R Manual , " Internal document, University of Br i t i sh Columbia , Tech. Rep., Ju ly 2005. [52] P. Jamieson and J . Rose, " A Veri log R T L Synthesis Tool for Heterogeneous F P G A s , " in International Conference on Field Programmable Logic and Applications, 2005, 2005, pp. 305-310. [53] C . H . Ho, P. H . W . Leong, W . Luk , S. J . E . W i l t o n , and S. Lopez-Buedo, "Vi r tua l Embedded Blocks: A Methodology for Evaluat ing Embedded Elements in F P G A s , " in 1 4 t h Annual IEEE Symposium on Field-Programmable Custom Computing Ma-chines, 2006. Appendix A. Glitching Characterization Attempt 95 Appendix A Glitching Characterization Attempt This Appendix describes several unsuccessful attempts to account for glitches in the power characterization scheme described in Chapter 5. Figure A . l illustrates the flow we used for characterizing the power of the D S P blocks. A Verilog description of the D S P block was synthesized to gates using Synopsys Design Compiler . A Verilog testbench was used to simulate a set of input vectors applied to the gate-level description of the D S P block in Ver i log -XL. The simulation data was then fed to the Synopsys PrimePower simulator to obtain characterization information. For input activities less than one, the testbench applied exactly one vector to the inputs of the D S P block each clock cycle. The activity at each input could be controlled by keeping the value of the bit or changing the value. To attempt to imitate glitching, we generated testbenches where more than one vector was applied during each clock cycle. For example, to attempt to characterize the power for input activities of 1.5, a vector file wi th activity 0.5 (assuming 1 vector per clock cycle) was read in and 3 vectors were applied each clock cycle. We observed that applying the vectors for a particular clock cycle at equally spaced intervals in the clock cycle, as shown in the testbench pseudocode in Figure A . 2 , does not imitate glitching properly. $T_fraction is the length of the equally spaced intervals and Appendix A. Glitching Characterization Attempt 96 Technology Library DSP Block Verilog Description Synthesize to Gates with DC Verilog Testbench + Input Vectors Simulate with Verilog-XL Analyze Power with PrimePower Power Estimates Figure A . l : Power Characterization Flow reg [0:$inbits_ind] test_vector_input[0:N-1]; reg [0:$inbits_ind] inputs; task test_top; integer i, j ; begin ©(posedge clock); @(negedge clock); reset = 0; ©(posedge clock); for (i=0; (i+$integer_multiple) <= N; i=i+$integer_multiple) begin inputs = test_vector_input[i]; #$T_fraction; inputs = test_vector_input[i+1]; #$T_fraction; inputs = test_vector_input[i+($integer_multiple-1)]; #$T_fraction; ©(posedge clock); end end endtask Figure A .2: Testbench Pseudocode Appendix A. Glitching Characterization Attempt 97 ^integer.multiple is the number of vectors to apply during each clock cycle. For example, from Figure 5.3 we would expect that using a vector file w i t h activity 0.5 (assuming 1 vector per clock cycle) and applying two vectors per cycle to achieve an effective input activity of 1.0 would give PrimePower estimates greater than the results for 0.9 input activity. For a D S P block where only input transitions during the high part of the clock cycle have an effect, the PrimePower estimates were approximately equal to the results for 0.5 input activity instead because only half the transitions had any effect. To avoid this problem, a second method was attempted where %T.fraction was set to half the period divided by $integer_multiple, so that a l l the vectors would be applied during the high part of the clock cycle. Th is led to overestimates of the power because, in reality, glitching at the inputs to the D S P block can take place in any part of the clock cycle. To properly imitate glitching for characterization, a more sophisticated testbench gen-erator would be required that incorporates some sort of statistical or probabilistic model that dictates when vectors should be applied to the D S P block inputs. Appendix B. Detailed Results from Comparison 98 Appendix B Detailed Results from Comparison Table B . l : F I R Fi l te r Results for 0.25 Average Input A c t i v i t y Vector Mul t ip l ie r H H V P R PrimePower % Error Set Energy Energy vec_025_0_a mult_rtl_0 6.93E-11 6.31E-11 9.8% mult_rt l_l 6.92E-11 6.01E-11 15.1% mult_rtl_2 6.92E-11 5.87E-11 . 17.8% mult_rtl_3 6.93E-11 6.07E-11 14.2% vec_025_0_b mult_rtl_0 6.95E-11 6.14E-11 13.1% mult_rt l_l 6.92E-11 6.05E-11 14.3% mult_rtl_2 6.93E-11 5.82E-11 19.1% mult_rtl_3 6.97E-11 6.08E-11 14.5% vec_025.0_c mult_rtl_0 6.95E-11 6.32E-11 10.0% mult_rt l_l 6.93E-11 5.92E-11 17.1% mult_rtl_2 6.95E-11 5.95E-11 16.8% mult_rtl_3 6.95E-11 6.12E-11 13.5% vec_025_0_d mult_rtl_0 6.92E-11 6.24E-11 10.9% mult_rt l_l 6.91E-11 5.90E-11 17.2% mult_rtl_2 6.96E-11 5.93E-11 17.3% mult_rtl_3 6.96E-11 6.02E-11 15.5% vec_025_0_e mult_rtl_0 6.95E-11 ' 6.25E-11 11.2% mult_rt l_l 6.95E-11 6.06E-11 14.7% mult_rtl_2 6.95E-11 5.92E-11 17.3% mult_rtl_3 6.97E-11 6.17E-11 12.9% average percentage error 14.6% average for just multiplier instance #0 11.0% average for just multiplier instance #1 15.7% average for just multiplier instance #2 17.7% average for just multiplier instance #3 14.1% Appendix B. Detailed Results from Comparison 99 Table B .2 : F I R Fi l ter Results for 0.50 Average Input A c t i v i t y Vector Set Mul t ip l ie r H H V P R Energy PrimePower Energy % Error vec_050_0_a mult_rtl_0 8.92E-11 1.09E-10 18.2% mult_rtl_l 8.90E-11 1.04E-10 14.4% mult_rtl_2 8.95E-11 1.04E-10 14.0% mult_rtl_3 8.93E-11 1.06E-10 15.5% vec_050_0.b mult_rtl_0 8.92E-11 1.10E-10 19.2% mult_rtl_l 8.92E-11 1.04E-10 14.1% mult_rtl_2 8.94E-11 1.03E-10 13.4% mult_rtl_3 8.92E-11 1.07E-10 16.2% vec_050_0_c mult_rtl_0 8.93E-11 1.11E-10 19.4% mult_rtl_l 8.92E-11 1.04E-10 14.2% mult_rtl22 8.95E-11 1.04E-10 14.2% mult_rtl_3 8.95E-11 1.06E-10 15.4% vec_050_0_d mult_rtl_0 8.91E-11 1.07E-10 16.8% mult _rtl_l 8.94E-11 1.05E-10 15.0% mult_rtl_2 8.90E-11 1.01E-10 12.3% mult_rtl_3 8.94E-11 1.07E-10 16.2% yec_050.0.e mult_rtl_0 8.93E-11 1.12E-10 20.6% mult_rtl_l 8.92E-11 1.04E-10 14.5% mult_rtl_2 mult_rtl_3 8.94E-11 8.92E-11 1.02E-10 1.06E-10 12.2% 15.7% average percentage error 15.6% average for just multiplier instance #0 18.9% average for just mult iplier instance #1 14.4% average for just multiplier instance #2 13.2% average for just multiplier instance #3 15.8% Appendix B. Detailed Results from Comparison 100 Table B .3 : F I R Fi l ter Results for 0.75 Average Input A c t i v i t y Vector Mul t ip l ie r H H V P R PrimePower % Er ror Set Energy Energy vec_075_0_a mult_rtl_0 9.97E-11 1.52E-10 34.3% mult_rt l_l 9.96E-11 1.42E-10 29.7% mult_rtl_2 9.95E-11 1.38E-10 27.9% mult_rtl_3 9.96E-11 1.44E-10 30.7% vec_075_0_b mult_rtl_0 9.96E-11 1.52E-10 34.6% mult_rt l_l 9.96E-11 1.41E-10 29.2% mult_rtl_2 9.97E-11 1.38E-10 27.7% mult_rtl_3 9.96E-11 1.43E-10 30.5% vec_075_0_c mult_rtl_0 9.96E-11 1.49E-10 33.3% mult_rt l_l 9.95E-11 1.41E-10 29.4% mult_rtlJ2 9.95E-11 1.38E-10 27.7% mult_rtl_3 9.96E-11 1.47E-10 32.4% vec_075_0.d mult_rtl_0 9.96E-11 1.55E-10 35.8% mult_rt l_l 9.97E-11 1.39E-10 28.3% mult_rtl_2 9.97E-11 1.41E-10 29.4% mult_rtl_3 9.97E-11 . 1.50E-10 33.6% vec_075_0_e mult_rtl_0 9.96E-11 1.55E-10 35.9% mult_rt l_l 9.96E-11 1.39E-10 28.4% mult_rtl_2 9.97E-11 1.37E-10 27.4% mult_rtl_3 9.96E-11 1.44E-10 30.9% average percentage error 30.9% average for just multiplier instance #0 34.8% average for just multiplier instance #1 29.0% average for just multiplier instance #2 28.0% average for just multiplier instance #3 31.6% Appendix B. Detailed Results from Comparison 101 Table B.4: Differential Equat ion Solver Results for 0.25 Average Input A c t i v i t y Vector Set D S P Block P V P R Energy PrimePower Energy % Error vec_025_0_a dspblockO 9.51E-10 4.16E-09 77.1% dspblockl 1.01E-09 1.52E-09 33.3% dspblock2 1.12E-09 1.65E-09 32.2% vec.025.0_b dspblockO 9.04E-10 4.03E-09 77.6% dspblockl 1.04E-09 1.33E-09 21.8% dspblock2 1.23E-09 1.58E-09 22.2% vec.025.0-c dspblockO 8.95E-10 3.88E-09 76.9% dspblockl 1.01E-09 1.28E-09 21.2% dspblock2 1.23E-09 1.49E-09 17.4% vec.025.0_d dspblockO 1.05E-09 4.24E-09 75.4% dspblockl 1.04E-09 1.31E-09 21.0% dspblock2 1.17E-09 1.73E-09 32.1% vec.025.0_e dspblockO 8.02E-10 3.76E-09 78.7% dspblockl 1.00E-09 1.22E-09 17.7% dspblock2 1.12E-09 1.27E-09 12.0% average percentage error 41.1% average for just D S P block instance #0 77.1% average for just D S P block instance #1 23.0% average for just D S P block instance #2 23.2% Appendix B. Detailed Results from Comparison 102 Table B . 5 : Differential Equat ion Solver Results for 0.50 Average Input A c t i v i t y Vector Set D S P Block P V P R Energy PrimePower Energy % Error vec.050_0_a dspblockO 8.88E-10 3.79E-09 76.6% dspblockl 1.06E-09 1.22E-09 12.9% dspblock2 1.26E-09 1.51E-09 16.9% vec_050_0_b dspblockO 9.24E-10 3.88E-09 76.2% dspblockl 1.06E-09 1.36E-09 21.9% dspblock2 1.29E-09 1.66E-09 22.5% vec_050.0.c dspblockO 8.78E-10 4.14E-09 78.8% dspblockl 1.05E-09 1.73E-09 39.2% dspblock2 1.20E-09 1.75E-09 31.6% vec.050.0_d dspblockO 9.59E-10 3.97E-09 75.9% dspblockl 1.04E-09 1.38E-09 24.8% dspblock2 1.28E-09 1.54E-09 16.9% vec.050_0_e dspblockO 9.19E-10 3.87E-09 76.2% dspblockl 1.06E-09 1.25E-09 15.1% dspblock2 1.31E-09 1.62E-09 19.0% average percentage error 40.3% average for just D S P block instance #0 76.7% average for just D S P block instance #1 22.8% average for just D S P block instance #2 21.4% Appendix B. Detailed Results from Comparison 103 Table B.6 : Differential Equat ion Solver Results for 0.75 Average Input A c t i v i t y Vector Set D S P Block P V P R Energy PrimePower Energy % Error vec_075_0_a dspblockO 8.95E-10 3.89E-09 77.0% dspblockl 1.03E-09 1.44E-09 28.4% dspblock2 1.23E-09 1.61E-09 23.5% vec_075_0_b dspblockO 8.70E-10 3.82E-09 77.2% dspblockl 1.01E-09 1.25E-09 19.3% dspblock2 1.27E-09 1.64E-09 22.7% vec_075_0_c dspblockO 8.83E-10 3.78E-09 76.6% dspblockl 1.01E-09 1.26E-09 19.7% dspblock2 1.27E-09 1.62E-09 21.6% vec_075.0_d dspblockO 9.13E-10 3.94E-09 76.8% dspblockl 1.03E-09 1.40E-09 26.2% dspblock2 1.24E-09 1.69E-09 26.5% vec_075_0_e dspblockO 9.37E-10 3.84E-09 75.6% dspblockl 1.01E-09 1.19E-09 14.9% dspblock2 1.18E-09 3.84E-09 69.3% average percentage error 43.7% average for just D S P block instance #0 76.7% average for just D S P block instance #1 21.7% average for just D S P block instance #2 32.7%
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Activity-based power estimation and characterization...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Activity-based power estimation and characterization of DSP and multiplier blocks in FPGAs Choy, Nathalie Chan King 2006
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Activity-based power estimation and characterization of DSP and multiplier blocks in FPGAs |
Creator |
Choy, Nathalie Chan King |
Publisher | University of British Columbia |
Date Issued | 2006 |
Description | Battery-powered applications and the scaling of process technologies and clock frequencies have made power dissipation a first class concern among FPGA vendors. One approach to reduce power dissipation in FPGAs is to embed coarse-grained fixed-function blocks that implement certain types of functions very efficiently. Commercial FPGAs contain embedded multipliers and "Digital Signal Processing (DSP) blocks" to improve the performance and area efficiency of arithmetic-intensive applications. In order to evaluate the power saved by using these blocks, a power model and tool flow are required. This thesis describes our development and evaluation of methods to estimate the activity and the power dissipation of FPGA circuits containing embedded multiplier and DSP blocks. Our goal was to find a suitable balance between estimation time, modeling effort, and accuracy. We incorporated our findings to create a power model and CAD tool flow for these circuits. Our tool flow builds upon the Poon power model, and the Versatile Place and Route (VPR) CAD tool, which are both standard academic experimental infrastructure. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-01-08 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0064950 |
URI | http://hdl.handle.net/2429/17877 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2006-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2006-0394.pdf [ 5.54MB ]
- Metadata
- JSON: 831-1.0064950.json
- JSON-LD: 831-1.0064950-ld.json
- RDF/XML (Pretty): 831-1.0064950-rdf.xml
- RDF/JSON: 831-1.0064950-rdf.json
- Turtle: 831-1.0064950-turtle.txt
- N-Triples: 831-1.0064950-rdf-ntriples.txt
- Original Record: 831-1.0064950-source.json
- Full Text
- 831-1.0064950-fulltext.txt
- Citation
- 831-1.0064950.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0064950/manifest