Activity-based Power Estimation and Characterization of DSP and Multiplier Blocks in F P G A s by Nathalie Chan King Choy B . A . S c , University of Toronto, 2004 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L M E N T O F THE REQUIREMENTS FOR T H E DEGREE OF MASTER OF APPLIED SCIENCE in T h e Faculty of Graduate Studies (Electrical and Computer Engineering) T H E UNIVERSITY OF BRITISH C O L U M B I A October, 2006 © Nathalie C h a n K i n g Choy, 2006 11 Abstract Battery-powered applications and the scaling of process technologies and clock frequencies have made power dissipation a first class concern among F P G A vendors. One approach to reduce power dissipation i n F P G A s is to embed coarse-grained fixed-function blocks that implement certain types of functions very efficiently. C o m m e r c i a l F P G A s contain embedded multipliers and " D i g i t a l Signal Processing ( D S P ) blocks" to improve the performance and area efficiency of arithmetic-intensive applications. In order to evaluate the power saved by using these blocks, a power model and t o o l flow are required. T h i s thesis describes our development and evaluation of methods to estimate the act i v i t y and the power dissipation of F P G A circuits containing embedded multiplier and D S P blocks. O u r goal was to find a suitable balance between estimation time, modeling effort, and accuracy. W e incorporated our findings to create a power model and C A D t o o l flow for these circuits. O u r t o o l flow builds upon the P o o n power model, and the Versatile Place and Route ( V P R ) C A D tool, which are b o t h standard academic experimental infrastructure. iii Table of Contents Abstract ii T a b l e of C o n t e n t s iii List of Tables vi List of Figures vii Acknowledgements 1 2 ix Introduction 1 1.1 Motivation 1 1.2 Research Goals 4 1.3 Research A p p r o a c h 5 1.4 Organization of Thesis 7 B a c k g r o u n d a n d Previous W o r k 8 2.1 8 F P G A Architectures 2.1.1 Island-style Architectures 2.1.2 E n h a n c e d L U T Architectures and C a r r y Chains 13 9 2.1.3 Platform-style Architectures 15 2.2 F P G A C A D Flow 19 2.3 Power 21 2.4 Power E s t i m a t i o n 22 2.4.1 Simulation-based vs. P r o b a b i l i s t i c 24 2.4.2 Techniques: Circuit-level 25 2.4.3 Techniques: Gate-level 25 2.4.4 Techniques: R T - l e v e l . 2.4.5 F P G A - s p e c i f i c Power E s t i m a t i o n Tools 2.5 Focus a n d C o n t r i b u t i o n of Thesis ' . . 30 32 34 Table of Contents 3 Framework .3.1 3.2 3.3 3.4 3.5 4 36 Overall F l o w 36 Versatile Place and R o u t e ( V P R ) 38 3.2.1 A r c h i t e c t u r a l Assumptions 38 3.2.2 T-VPack 39 3.2.3 Placement and R o u t i n g Engine 39 3.2.4 Architectural Flexibility 40 P o o n Power M o d e l 40 3.3.1 A r c h i t e c t u r a l Assumptions 40 3.3.2 Activity Estimation 41 3.3.3 Power E s t i m a t i o n 43 3.3.4 Architectural Flexibility . 44 D S P B l o c k Power M o d e l and T o o l F l o w 44 3.4.1 Requirements 45 3.4.2 Extending P V P R Flow Chapter S u m m a r y 45 .• • • 46 Activity Estimation 48 4.1 M o t i v a t i o n for developing A c t i v i t y E s t i m a t i o n Techniques 48 4.2 Techniques for A c t i v i t y E s t i m a t i o n of an E m b e d d e d B l o c k 49 4.2.1 Gate-Level Technique 49 4.2.2 Independent O u t p u t Technique 51 4.3 5 iv Methodology and Results for A c t i v i t y E s t i m a t i o n 53 4.3.1 O u t p u t Nodes of D S P Block . . 54 4.3.2 Downstream Nodes of D S P B l o c k 58 4.4 Conclusions for A c t i v i t y E s t i m a t i o n 60 4.5 Chapter S u m m a r y 61 Power Estimation 63 5.1 Techniques for Power E s t i m a t i o n 63 5.1.1 Objectives 63 5.1.2 T h e Constant Technique 64 5.1.3 T h e L o o k u p Technique 64 5.1.4 T h e P i n C a p Technique 65 5.2 Methodology for Power Characterization 66 5.2.1 Constant Technique 67 5.2.2 L o o k u p Technique 67 Table of Contents 5.2.3 6 P i n C a p Technique 67 5.3 Evaluation 68 5.4 Conclusions for Power E s t i m a t i o n 71 5.5 Chapter S u m m a r y 72 Power E s t i m a t i o n T o o l F l o w 73 6.1 Overall Flow 73 6.1.1 Functionality 73 6.1.2 Modifications to P V P R 75 6.2 6.3 7 v F l o w Demonstration and C o m p a r i s o n to G a t e - L e v e l S i m u l a t i o n 76 6.2.1 Motivation 76 6.2.2 Terminology 77 6.2.3 Methodology 78 6.2.4 Results 82 Chapter S u m m a r y 84 Conclusion 86 7.1 S u m m a r y and Contributions 86 7.2 Future W o r k 87 7.2.1 Benchmarks 88 7.2.2 G l i t c h Characterization 89 7.2.3 B o a r d - L e v e l Verification 89 : Bibliography 91 A d i t c h i n g Characterization Attempt 95 B Detailed Results from C o m p a r i s o n 98 vi List of Tables 6.1 Parameters added to the P V P R Architecture F i l e Format 76 6.2 V P R Power A n a l y s i s Results 6.3 O v e r a l l Percentage E r r o r for F I R F i l t e r Results 84 6.4 O v e r a l l Percentage E r r o r for Differential E q u a t i o n Solver Results 84 B.l F I R F i l t e r Results for 0.25 Average Input A c t i v i t y 98 B.2 F I R F i l t e r Results for 0.50 Average Input A c t i v i t y 99 B.3 F I R F i l t e r Results for 0.75 Average Input A c t i v i t y . 100 B.4 Differential E q u a t i o n Solver Results for 0.25 Average Input A c t i v i t y . . . . 101 B.5 Differential E q u a t i o n Solver Results for 0.50 Average Input A c t i v i t y . . . . 102 B.6 Differential E q u a t i o n Solver Results for 0.75 Average Input A c t i v i t y . . . . 103 -. 83 List of Figures 1.1 18-bit x 18-bit multipliers combined for 36-bit x 36-bit multiplier 2 2.1 Island-style Architecture (from [1]) 9 2.2 2-input L U T (from [2]) 10 2.3 L U T w i t h Flip-flop (from [2]) 10 2.4 Cluster-based logic block 11 2.5 Programmable Switches 13 2.6 C a r r y C h a i n Connections to a 4 - L U T 15 2.7 M u l t i - b i t Logic B l o c k (from [3]) 18 2.8 Steps i n the F P G A C A D F l o w 19 2.9 A b s t r a c t i o n Levels for Power Analysis (from [4]) 23 3.1 Overall C A D F l o w (from [5]) 37 3.2 C A D F l o w E n h a n c e d to Support D S P Block Power E s t i m a t i o n 46 4.1 Gate-Level Technique 49 4.2 Independent O u t p u t Technique 51 4.3 O u t p u t P i n A c t i v i t i e s for Unregistered M u l t i p l i e r 54 4.4 O u t p u t P i n A c t i v i t i e s for Registered M u l t i p l i e r 55 4.5 O u t p u t P i n A c t i v i t i e s for Unregistered D S P B l o c k 56 4.6 O u t p u t P i n A c t i v i t i e s for Registered D S P B l o c k 56 4.7 18-bit x 18-bit M u l t i p l i e r s Combined for 36-bit x 36-bit M u l t i p l i e r 57 ' 4.8 F I R F i l t e r Used for A c t i v i t y Experiments 58 4.9 A c t i v i t y Results for the F I R F i l t e r 59 4.10 A c t i v i t y Results for the Converter 60 5.1 Methodology for Power E s t i m a t i o n Experiments 66 5.2 Power E s t i m a t i o n E x p e r i m e n t a l Methodology 68 5.3 Power E s t i m a t i o n E x p e r i m e n t Results 69 5.4 Power E s t i m a t i o n E x p e r i m e n t Results 70 6.1 Power E s t i m a t i o n T o o l F l o w . . '. 74 List of Figures viii 6.2 Terminology 77 6.3 fir.3_8.8: F I R F i l t e r C i r c u i t 78 6.4 Differential E q u a t i o n Solver C i r c u i t 79 6.5 18-bit x 18-bit multipliers combined for 36-bit x 36-bit multiplier 80 6.6 Simulation F l o w Pseudocode 81 6.7 F l o w for Determining Simulation Power Estimate of E a c h D S P B l o c k Instance 82 A.l Power Characterization F l o w 96 A.2 Testbench Pseudocode 96 ix Acknowledgements I w o u l d like to thank my supervisor, Professor Steve W i l t o n , for teaching me so much these last two years and for providing many opportunities to interact w i t h the FPGA c o m m u n i t y and to gain valuable teaching experience. I would like to thank everyone i n the Socwilton, Soclemieux, and other System-on-Chip groups for their ideas, advice, and good company. Peter Jamieson (from the University of Toronto), B r a d , D i p a n j a n , Julien, and V i c t o r were incredibly helpful when I was getting started w i t h a l l the tools. R o o z b e h and R o b e r t o were also great resources. I a m grateful to A l t e r a and the N a t u r a l Sciences and Engineering Research C o u n c i l of C a n a d a for funding my project and to my defense committee for volunteering their time. T h a n k you to Scott and Lesley for your emotional support during the academic rough patches and for knowing exactly what I was going through. A l s o thank you to Steph and Jess for your emotional support and encouragement during the non-academic rough patches, the roommate antics, and the entertaining adventures. F i n a l l y , I would like to dedicate this work .to my family, friends, and m y dragon boat family (Swordfish, R o l i , and U C Water Dragons crews) who kept me sane through a very intense last two years. Chapter 1. Introduction 1 Chapter 1 Introduction 1.1 Motivation Power dissipation has become increasingly important i n the electronics industry. There are two reasons for this: 1. There is a growing number of portable electronic applications, which have battery life constraints 2. Process technologies i n the nanometer scale a n d clock frequencies i n the gigahertz range result i n very hot chips a n d increased cooling costs. These issues are relevant to b o t h Application-Specific Integrated C i r c u i t s ( A S I C s ) a n d to Field-Programmable G a t e A r r a y s ( F P G A s ) . W i t h increasing mask costs for A S I C s , F P G A s have become a n attractive implementation alternative for low a n d m e d i u m volume production. However, F P G A s lag behind A S I C s significantly i n their power efficiency. T h e additional transistors required for programmability result i n higher power dissipation per function. Leakage power is also dissipated i n both the used a n d unused parts of F P G A s , and most commercial F P G A s do not have a sleep mode to reduce power i n unused parts (The A c t e l I G L O O flash-based F P G A is an exception [6]). Chapter 1. Introduction 2 Figure 1.1: 18-bit x 18-bit multipliers combined for 36-bit x 36-bit multiplier Over the past few years, significant advances have been made at the circuit level [7] [8], architecture level [9], and C A D tool level [10] [11] [12]. S t i l l , current F P G A s have been found to be, on average, 12 times less power efficient t h a n A S I C s [13]. A basic F P G A is composed of programmable logic elements (called L o o k - u p Tables or LUTs), wire segments, and programmable connections/switches that are typically laid out i n a regular pattern. T h e L U T s i n F P G A s typically have 4 to 6 inputs and can implement any boolean function of these inputs. B y configuring each L U T and connecting t h e m together using the programmable interconnect, users can implement virtually any digital circuit. To reduce area and improve circuit speed, larger logic blocks are used (typically composed of a collection of 4-10 L U T s ) ; they are called clusters. 3 Chapter 1. Introduction One promising approach to reducing power in F P G A s at the architectural-level is to embed coarse-grained fixed-function blocks that implement certain types of functions very efficiently. Commercial F P G A s contain embedded multipliers and embedded "Digital Signal Processing (DSP) blocks" to improve the performance and area efficiency of arithmetic-intensive applications, such as D S P applications [14] [15] [16] [17]. A simplified example of a D S P block is shown in Figure 1.1; in this diagram, four 18xl8-bit multipliers are programmably connected to create a 36x36-bit multiplier, with optional registers at the multiplier inputs, outputs, and adder stage outputs. Other examples of D S P block configurations are adder/subtractors, multiply-accumulators, and multiply-adders. Arithmetic-intensive applications can be implemented in LUTs, but even if L U T s are enhanced with dedicated interconnect for fast carry chains [18] and arithmetic chains [15], significant performance, density, and power improvements can be obtained by implementing the arithmetic parts of the applications in the embedded blocks [19]. Reference [20] found that an average area savings of 55% and an average increase in clock rate of 40.7% could be obtained for floating point applications by embedding floating point DSP blocks. There are several disadvantages of including DSP blocks in an F P G A fabric. If the blocks are not used (1) the area is wasted, and (2) the blocks will still dissipate leakage power. This makes it important to try to optimize the size and quantity of embedded resources included. However, it is difficult to determine what proportion of the F P G A area should be devoted to embedded blocks because it is difficult to determine what typical usage is for a device that can implement almost any digital circuit. Chapter 1. Introduction 4 In order to meaningfully evaluate the power savings that can be obtained by using these blocks, an experimental flow that can estimate the power dissipated by F P G A s containing embedded DSP blocks is required. Commercial tools from F P G A vendors provide power estimation that includes these blocks; however, these tools are tailored for specific devices, and do not provide the flexibility needed to investigate alternative embedded block architectures. The Versatile Place and Route (VPR) tool suite [ 2 1 ] with the Poon power model [ 2 2 ] has become standard experimental infrastructure to study F P G A architecture, C A D , and power issues (we will refer to the combination of V P R and the Poon power model as P V P R ) . The estimates from the power model can be used by F P G A architects to evaluate architectures for power efficiency, by F P G A users to make power-aware design decisions, and by F P G A C A D tool developers to create power-aware C A D algorithms. However, P V P R only supports homogeneous F P G A architectures containing L U T s and a routing fabric, and thus is not sufficient to examine architectures containing embedded DSP blocks. 1.2 Research Goals The objective of this research is to create an enhanced F P G A C A D flow that can be used to evaluate the power dissipation of F P G A s containing embedded DSP and arithmetic blocks. We have imposed the following requirements: • As P V P R is standard experimental infrastructure for academic F P G A architecture, C A D , and power studies, our power estimation of the DSP and multiplier blocks Chapter 1. must be compatible w i t h the P V P R Introduction 5 flow. • A s P V P R results are frequently used to perform architectural evaluations, our power estimation must be fast, to facilitate iterations over tens or hundreds of architectural alternatives and benchmark circuits. • T o facilitate iterations over many architectural alternatives, our method should a i m to minimize the modeling effort required when introducing a new architecture for evaluation. • In order for the power estimates to be meaningful, our method should a i m to be as accurate as possible, w i t h i n the constraints imposed by the previous requirements. Fast power estimation techniques make use of average quantities, such as probabilities, to reduce runtime. Details are lost during this averaging, resulting i n a runtime versus accuracy tradeoff. Similarly, reducing characterization effort typically results i n the capturing of fewer details, again resulting i n an effort versus accuracy tradeoff. A s the requirements that we have set for our flow involve b o t h fast estimation and low characterization effort, we expect that we w i l l be sacrificing some accuracy. Therefore, we must find a suitable balance between speed, characterization effort, and accuracy. .1.3 Research Approach A s w i l l be explained i n Chapter 2, power estimation involves two steps: estimation, and (2) power estimation using the activities. (1) activity Chapter 1. Introduction 6 D y n a m i c power dissipation is proportional to how often nodes i n a circuit switch values (from logic-0 to logic-1, or logic-1 to logic-0). A c t i v i t y estimation is the calculation of how much switching is expected at each node i n the circuit. In implementing the a c t i v i t y estimation for F P G A architectures containing embedded D S P blocks, we came across the following challenge: it is not clear how to propagate activity calculations through a D S P block. T r a d i t i o n a l activity estimation techniques propagate activities through s m a l l gates ( L U T s ) using transition probability or transition density models [23]; however, these approaches do not scale well to embedded blocks w i t h tens (or even hundreds) of inputs. T h e next step is to estimate the power dissipated by the F P G A implementing a user's circuit. In implementing algorithms to perform this estimation, we identified the following challenge: once the p i n activities of each embedded D S P block are known, it is still not clear how to use them to estimate the power dissipated by the embedded D S P block. There are many possibilities, each providing a different trade-off between power estimation time, modeling effort when a new block is to be investigated, and accuracy. T h e second technical challenge is to choose a power estimation technique that provides the best balance between these factors. In this thesis we treat each of these two challenges as a separate problem. For each problem, we develop and evaluate solutions. T h e n , we use the best of our solutions to each problem to create our experimental C A D flow for evaluating F P G A containing D S P blocks. architectures Chapter 1. 1.4 Introduction 7 Organization of Thesis T h i s thesis is organized as follows. Chapter 2 provides an overview of F P G A architectures, the F P G A C A D flow, power, and power estimation. It also describes previous work related to power estimation. Chapter 3 describes the existing P V P R C A D flow and how our D S P and arithmetic block power model fits into this framework. C h a p t e r 4 describes our solutions and their evaluation for the problem of activity estimation. Chapter 5 describes our solutions and their evaluation for the problem of power estimation. C h a p t e r 6 describes how we integrated our solutions from Chapters 4 and 5 to create a Power Estimation Tool Flow for evaluating F P G A architectures containing D S P blocks, presents results for two benchmark circuits, and compares the results to those obtained using gatelevel simulation. F i n a l l y , Chapter 7 summarizes the conclusions and provides suggestions for future work. 8 Chapter 2 Background and Previous Work 2.1 F P G A Architectures T h e most basic building blocks of F P G A s are programmable logic blocks, wire segments, and programmable connections/switches that are typically laid out i n a regular pattern. B y programming basic functions into each block and connecting t h e m together using the programmable interconnect, users can implement virtually any digital circuit. A number of architectures have been developed to optimize for criteria such as routability, area efficiency, and t i m i n g . Architectures are frequently classified based on their routing architecture because most of the area i n an F P G A is devoted to routing. Reference [1] lists the three m a i n categories of commercial F P G A architectures that were available when t h a t book was written: island-style, row-based, and hierarchical. Today, platform-style F P G A s are available, which include coarse-grained embedded components. T h i s thesis w i l l focus on F P G A s based on island-style architectures. Section 2.1.1 w i l l introduce basic island-style architecture components, Section 2.1.2 w i l l describe modern enhancements to the basic components, and Section 2.1.3 w i l l describe platform-style F P G A s . T h e framework that we b u i l d u p o n is described i n Chapter 3. Chapter 2. Background and Previous Work 9 Programmable routing switch Figure 2 . 1 : Island-style Architecture (from [1]) 2.1.1 Island-style Architectures Figure 2 . 1 shows an island-style architecture, where programmable logic blocks are "islands" i n a "sea" of programmable routing fabric [5]. In the traditional island-style architecture, the logic blocks are clusters of look-up tables ( L U T s ) . T h e routing fabric is composed of wires, programmable connection blocks, and programmable switch blocks. Look-up Tables L o o k - u p tables ( L U T s ) are the most basic elements for implementing logic i n an F P G A . Figure 2 . 2 shows a 2-input L U T . A K - i n p u t L U T ( K - L U T ) is a memory w i t h 2 K bits, K Chapter 2. Background and Previous Work 10 Inputs SRAM cells 0 00 0 01 0 10 1 11 Output Figure 2.2: 2-input L U T (from [2]) Select bit P Inputs »> Output 4-input LUT Clock D Flip-flop Figure 2.3: L U T w i t h Flip-flop (from [2]) address inputs, and one output port. B y setting the memory bits, this structure can be used to implement any combinational function of K inputs. T y p i c a l values for K are 4-6. To implement sequential logic, a L U T is typically paired w i t h a flip-flop to form a Basic Logic Element (BLE), as shown i n F i g u r e 2.3. T h e output of the B L E is selected from either the registered or the un-registered version of the L U T output, depending on the select bit of the output multiplexer. Chapter 2. Background and Previous Work 11 Inputs F i g u r e 2.4: Cluster-based logic block Cluster-based Logic B l o c k s The use of larger logic blocks helps to increase circuit speed and reduces circuit area and C A D tool processing time. Unfortunately, L U T complexity grows exponentially w i t h the number of inputs. T h e use of cluster-based logic blocks addresses this problem. Clusters are typically composed of a number of B L E s , internal cluster routing, and possibly specialized internal cluster connections, such as carry chains. W i t h i n a cluster, B L E inputs are typically connected to the cluster inputs and B L E outputs by a multiplexer-based crossbar [2]. A n example of a cluster-based logic block is shown i n F i g u r e 2.4. It has 8 inputs, contains 4 B L E s , and has 4 outputs. T h e L U T inputs can be connected to either the cluster inputs or the B L E outputs v i a the internal routing. B y grouping B L E s that share signals, placement and r o u t i n g processing time is reduced because the number of inter-block connections is reduced. T h i s internal routing is also shorter and faster than Chapter 2. Background and Previous Work 12 inter-block connections, reducing circuit delay. Routing Logic blocks are connected together and to I / O resources using a routing "fabric" that is composed of: • pre-fabricated metal tracks • programmable switch blocks • programmable connection blocks Figure 2.1 shows the components of the routing fabric. T h e tracks are arranged i n channels, t y p i c a l l y horizontally and vertically, to form a grid pattern. W i r e s r u n along these tracks. T h e connection blocks connect the wires to the inputs and outputs of logic blocks adjacent to the channel. T h e switch blocks connect the horizontal and vertical wires together to form longer wires and make turns. It should be noted that, although we consider the switch blocks and connection blocks as separate entities, they are often not separate i n the F P G A circuit layout. T h e programmable switches i n the switch blocks and connection blocks can be either buffered or unbuffered. T y p i c a l l y they are buffered to reduce delays i n long connections; however, this increases routing area. Buffered switches can be either unidirectional or bidirectional. M o d e r n F P G A s use unidirectional switches to get better delay o p t i m i z a t i o n and a denser routing fabric [24]. Figure 2.5 shows examples of switches t y p i c a l l y found i n F P G A s : (a) unbuffered, (b) buffered unidirectional, and (c) buffered bidirectional. Chapter 2. Background and Previous Work 13 P (a) Unbuffered (b) Buffered unidirectional (c) Buffered bidirectional Figure 2.5: Programmable Switches T h e routing wires inside an F P G A can be categorized as either (a) segmented local routing for connections between logic blocks, or (b) dedicated routing for global networks, such as clocks or reset/preset lines. T h e local routing segments can span a single logic block or multiple blocks; typically, not a l l segments are of the same length. T h e dedicated routing tracks are designed to ensure low-skew transmission. C o m m e r c i a l F P G A s t y p i c a l l y also include phase-locked loops ( P L L s ) and delay-locked loops ( D L L s ) for clock de-skew on these dedicated lines. T h e y also may include clock multiplexers and clock management circuitry to reduce power consumption by disabling unused parts of the clock network [25]. 2.1.2 Enhanced L U T Architectures and Carry Chains C o m m e r c i a l F P G A s contain improvements to the basic clustered K - L U T and interconnect F P G A s described i n Section 2.1.1. Chapter 2. Background and Previous Work 14 A d a p t i v e Logic M o d u l e ( A L M ) T h e A l t e r a Adaptive Logic Module (ALM) is a B L E t h a t can implement a single function of six inputs or two functions that share four inputs [15]. It is based on research that found that 6 - L U T s have better area-delay performance t h a n the typical 4 - L U T , but can result i n wasted resources if many of the logic functions do not require six inputs. Configurable Logic B l o c k ( C L B ) T h e X i l i n x V i r t e x - 5 C L B is a B L E based on a 6 - L U T w i t h two outputs so that it can either implement a single function of six inputs or two functions of five (out of the same six) inputs [26]. A s w i t h the A L M , it is based on research showing that 6 - L U T s have better performance, but may waste resources. C a r r y chains C a r r y chains are dedicated connections between logic blocks that aid i n the efficient implementation of arithmetic operations. T h e y also can be used i n the efficient implementation of logical operations, such as parity and comparison. Fast carry chains are important because the critical p a t h for these operations is often through the carry. Figure 2.6 shows a simple carry chain architecture. E a c h 4 - L U T i n a B L E can be fractured to implement two 3 - L U T s ; this is sufficient to implement b o t h the s u m and carry, given two input bits (a and b) and a carry input. T h e carry out signal from one B L E would typically be connected to the carry i n of an adjacent B L E using a fast dedicated connection. T h e Z input is used to break the carry chain before the first bit of Chapter 2. Background and Previous Work a 15 P b 2-LUT 2-LUT 2-LUT C a r r y in - - *~T' 7 3-LUT 2-LUT Carry out 3-LUT S u m out Figure 2.6: C a r r y C h a i n Connections to a 4 - L U T an addition. M o r e complex carry schemes have been published. In [18], carry chains based on carry select, variable block, and B r e n t - K u n g schemes are described; the B r e n t - K u n g scheme is shown to be 3.8 times faster than the simple ripple carry adder i n F i g u r e 2.6. The selection of a carry scheme by the F P G A architect involves weighing the area cost of a faster more complex carry chain against the performance benefits, because they are only beneficial when used. A c t e l and X i l i n x devices include support for carry-lookahead and A l t e r a devices include support for carry select capabilities. 2.1.3 Platform-style Architectures W i t h M o o r e ' s L a w and the increase i n the possible number of transistors on a die, F P G A manufacturers could afford to sacrifice some silicon area to application-specific circuits. In 2000, to increase the efficiency and density of designs containing components such as processors, large memories, and complex arithmetic circuits, F P G A manufacturers began to release Platform-style F P G A s containing dedicated circuitry for these parts. Chapter 2. Background and Previous Work 16 In the context of F P G A s , Intellectual Property (IP) cores are t y p i c a l l y reusable circuit modules or high-level b u i l d i n g blocks that can be combined to create a system. These cores can be soft or hard. Soft cores are implemented i n L U T s a n d hard cores are implemented as embedded A S I C blocks i n the F P G A . P l a t f o r m F P G A s contain h a r d I P cores, such as processors, large memory blocks, fast multipliers, a n d Digital Signal Processing (DSP) blocks. T h e y also contain dedicated interconnect for fast communication between certain types of adjacent cores. DSP blocks To improve the density a n d address the performance requirements of D S P a n d other arithmetic-intensive applications, F P G A manufacturers t y p i c a l l y include dedicated hardware multipliers i n their devices. A l t e r a Cyclone II a n d X i l i n x V i r t e x - I I / - I I P r o devices contain embedded 1 8 x l 8 - b i t multipliers, w h i c h can be split into 9x9-bit multipliers [27]. T h e V i r t e x - I I / - I I P r o devices are further optimized w i t h direct connections to the X i l i n x block R A M resources for fast access to input operands. Higher-end F P G A s include D S P blocks, which are more complex dedicated hardware blocks optimized for a wider range of D S P applications. A l t e r a S t r a t i x a n d S t r a t i x II D S P blocks support pipelining, shift registers, a n d can be configured to implement 9x9-bit, 18xl8-bit, or 36x36-bit multipliers that can optionally feed a dedicated adder/subtractor or accumulator [15]. X i l i n x V i r t e x - 4 X t r e m e D S P slices contain a dedicated 1 8 x l 8 - b i t 2's complement signed multiplier, adder logic, 48-bit accumulator, a n d pipeline registers. They also have dedicated connections for cascading D S P slices, w i t h an optional wire-shift, without having to use the slower Chapter 2. Background and Previous 17 Work general routing fabric [27]. T h i s inclusion of dedicated multipliers or D S P blocks to complement the general logic resources results i n a heterogeneous F P G A architecture. Research has considered w h a t could be gained from t u n i n g F P G A architectures to specific application domains, i n particular datapaths a n d D S P . T h e work i n [28] a n d [3] is tuned for datapath (arithmetic) circuits. D a t a p a t h circuits are often composed of regularly structured components, called bit-slices. T h e authors propose a multi-bit logic block that uses configuration memory sharing t o exploit this regularity a n d save area. I n typically-sized cluster-based logic blocks (containing 4 t o 10 B L E s ) , configuration S R A M cells consume 39% t o 48% of the t o t a l cluster area. If each cluster is used to implement a single bit-slice of the datapath circuit and adjacent clusters are used to implement adjacent bit-slices, the configuration memory for corresponding B L E s i n the adjacent slices c a n be shared. A m u l t i - b i t logic block is illustrated i n F i g u r e 2.7, indicating resources that can share configuration memory. T h e authors also propose bus-based connections to exploit this regularity to achieve 14% routing area reduction for implementing datapath circuits. A disadvantage of multi-bit architectures is that if implementation circuits require a different b i t - w i d t h , some resources may be wasted. T h e work i n [29] deliberately avoids creating a heterogeneous architecture because the authors found that D S P applications contain b o t h arithmetic and random logic, but that a suitable ratio between arithmetic and random logic is difficult to determine. Instead they develop two "mixed-grain" logic blocks that are suitable for implementing b o t h arithmetic and random logic by looking at properties of arithmetic operations a n d properties Chapter 2. Background and Previous Work 01 < 8 to 3 P BLE ,BLE #1 *1 18 BLE #1 *C IBLE #N BLE #N 3 C J Cluster #1 J L L Cluster #3 Cluster #2 Figure 2.7: M u l t i - b i t Logic B l o c k (from [3]) of the 4 - L U T . T h e i r logic blocks are coarse-grained: each block can implement up to 4-bit addition/subtraction, 4 bits of an array multiplier, a 4-bit 2:1 multiplexer, or wide B o o l e a n functions. E a c h logic block can, alternatively, implement single output random logic functions like a normal L U T . T h e i r architecture reduces configuration memory requirements b y a factor of four, w h i c h is good for embedded systems or those w i t h dynamic reconfiguration. Compared to arithmetic-optimized F P G A architectures, it offers efficient support of a wider range of D S P applications that vary i n the amount of r a n d o m logic they contain. However, the cost of this flexibility is additional area overhead. If applications w i t h very little arithmetic are implemented, further area penalties are incurred i n the case of a l l of these architectures w i t h specialized arithmetic because most of the arithmetic support logic is left unused. support, Chapter 2. Background and Previous Work 19 Behavioral Description Hii/i-L'wrl tVil-w.is Register Transfer Level (RTL) , Logic Synthesis [ Logic Optimizations! [ ' | Technology Mapping Netlist of Interconnected Cells T ^Physical Design j Packing ^ [' : ! Placement |^,| Routing | *J Design for download Figure 2.8: Steps i n the F P G A C A D F l o w 2.2 F P G A C A D Flow In order to implement large designs on F P G A s , C A D tools are essential. T h e y allow users to work at a high abstraction level, automating o p t i m i z a t i o n and transformation to a low-level implementation. T h i s section w i l l give a brief overview of the steps i n the F P G A C A D flow. Figure 2.8 illustrates the steps i n the flow. T h e input to the top level of the F P G A C A D flow is a behavioral description, typically i n the form of Hardware Description Language ( H D L ) code or a schematic that describes what the circuit does, w i t h little or no reference to its structure. D u r i n g high-level synthesis, an i n i t i a l compilation of the design is done, some i n i t i a l optimizations are performed, and functional units are scheduled and assigned to the operations required by the circuit [30]. T h e result of high-level synthesis is a Register Transfer Level ( R T L ) description of the design. A n R T L description is t y p i c a l l y w r i t t e n i n H D L . R T L code is characterized by a straightforward flow of control, w i t h subcircuits that are Chapter 2. Background and Previous Work 20 connected together i n a simple way [31]. D u r i n g the optimization stage of logic synthesis, technology-independent optimizations are applied to the logic functions that the R T L describes; the result is a netlist of gates. D u r i n g the technology m a p p i n g stage of logic synthesis, this netlist of gates is m a p p e d to the set of cells that are available to b u i l d the functions. In the case of F P G A s , these cells are L U T s and flip-flops. T h e result is a structural netlist of interconnected L U T s and flip-flops. Physical design has three stages: (1) packing, (2) placement, and (3) routing. D u r i n g the packing stage, the L U T s and flip-flops are packed into coarser-grained cluster-based logic blocks, w i t h the goals of improving r o u t a b i l i t y and optimizing circuit speed. P a c k i n g together B L E s that share signals minimizes the number of connections between logic blocks, thus enhancing routability. P a c k i n g together B L E s likely to be on the critical p a t h makes connections between them go v i a the fast local interconnect, thus o p t i m i z i n g circuit speed. D u r i n g the placement stage, the cluster-based blocks are assigned locations i n the F P G A . Goals during placement are to minimize w i r i n g by placing connected blocks close together, to enhance routability by balancing the w i r i n g density across the F P G A , and to maximize circuit speed by p u t t i n g blocks likely to be on the critical path close together. F i n a l l y during the routing stage, paths are found i n the channels for the wires that connect the logic blocks. T h e result is typically refered to as the placed and routed design. A t t h a t point it could be converted to a bitstream for programming an F P G A . T h e widely used Versatile Place and R o u t e ( V P R ) academic C A D tool for packing, 21 Chapter 2. Background and Previous Work placement, and routing w i l l be described i n Chapter 3. 2.3 Power T h e formula for the instantaneous power dissipated at time t, p(t), is: Pit) = i(t) • V (2.1) dd where Vdd is the supply voltage and i(t) is the instantaneous current drawn from the supply. To obtain the average power over a time interval, we replace the instantaneous current by the average current drawn over that interval i n E q u a t i o n 2.1. K n o w i n g the peak instantaneous power is useful when sizing supply lines, whereas knowing the average power helps i n the calculation of battery life and cooling requirements. In this work we focus on average power. Power dissipation can be broken down into dynamic and static components. D y n a m i c power is dissipated when a gate is switching; it is due to: (1) the charging and discharging of parasitic capacitances and (2) temporary short circuits between the high and low supply voltage lines. T h e average dynamic power of the gate is given by the equation P = 0.5 • aCVfj (2.2) where a is the activity of the gate, C is the parasitic capacitance of the gate, and / is the clock frequency. Static power is dissipated when the gate is not switching; it is due to leakage currents. Typically, i n an F P G A , the majority of power dissipated is dynamic [22]. Chapter 2. Background and Previous Work 2.4 22 Power Estimation It is important to distinguish between power estimation and power measurement. Power measurement involves obtaining voltage and current values from a real physical apparatus. Power estimation involves predicting what the power dissipation would be, based on a number of assumptions. One reason for performing power estimation is that a physical apparatus is not always available. A n o t h e r reason is that a wider range of designs can be considered and evaluated more quickly when we are not constrained to using physical implementations. Performing power estimates earlier i n the design flow is desirable to help guide design decisions or identify problems i n the design. Power estimation can take place at any stage i n the F P G A C A D flow (Figure 2.8). T h e stages higher i n the flow are at a higher abstraction level and do not involve implementation details. A s we get lower i n the flow, more physical details of the design have been determined. Performing power estimates at the lower stages i n the flow w i l l generally give more accurate estimates; however, it w i l l take more computational resources to take into account these physical details. In our work, we w i l l be performing estimates after placement and routing. Power estimation can be done at different abstraction levels, as shown in Figure 2.9. A t each stage, we need the following types of information so that we can perform the power analysis: 1. activity estimates, so we can compute the dynamic power dissipation, 2. a description of what the design looks like - this can be either an architectural Chapter 2. Background and Previous Work 23 Level of Abstraction System-level Algorithm-level |^onthrt:pf^circyit| Figure 2.9: A b s t r a c t i o n Levels for Power Analysis (from [4]) description or a netlist - so that we know what components are used a n d how they are connected, 3. models of these components and connections In a platform-style F P G A , the placed a n d routed netlist contains representations of circuit components at multiple levels of abstraction: L U T s , flip-flops, a n d wires are essentially at the gate level, while D S P blocks a n d memories can be thought of as being R T L components, a n d hard processors can be thought of as being system-level components. For this work, we are interested i n the lower three abstraction levels i n F i g u r e 2.9: circuit-level, gate-level, and R T L . In the following subsections, we w i l l first describe two categories into which power estimation techniques can be classified. T h e n we w i l l describe a number of existing techniques at the three lower abstraction levels. F i n a l l y we Chapter 2. Background and Previous Work 24 w i l l describe some F P G A - s p e c i f i c power estimation tools. 2.4.1 Simulation-based vs. Probabilistic Power estimation techniques can be divided into two categories: (1) simulation-based a n d (2) probabilistic. Simulation-based techniques simulate the circuit to gather d a t a about the switching of circuit nodes or even determine the waveform of the current being drawn. However, simulation-based techniques require complete and specific information about the input signals. T h e accuracy of the simulation results is dependent on how realistic the i n puts are. Consequently, reference [23] calls simulation-based power estimation techniques strongly pattern-dependent. T o avoid the problem of determining complete and specific input signal characteristics, probabilistic techniques are based on typical input signal behaviour. T h e y represent the average behaviour of the inputs using probabilities. A l t h o u g h the estimation is still dependent on the probabilities provided, it is sufficient to supply t y p i c a l behaviour instead of specific behaviour. T h u s [23] calls probabilistic power estimation techniques weakly pattern-dependent. Since calculations need only be performed once on the average data, instead of on a large number of simulation inputs, probabilistic techniques tend to require less computational resources than simulation-based techniques; however, some accuracy is sacrificed by of the use of averaging. Chapter 2. Background and Previous Work 2.4.2 25 Techniques: Circuit-level Typically, at the circuit-level users seek very precise estimates. A s a result, circuit-level techniques tend to be simulation-based, while probabilistic techniques tend to be applied only at the gate-level and above [32]. SPICE S P I C E provides very detailed, low-level simulation data for a circuit. S P I C E stands for Simulation P r o g r a m w i t h Integrated C i r c u i t s E m p h a s i s and is a general purpose analog circuit simulator for nonlinear D C , nonlinear transient, and linear A C analyses. It uses mathematical models to represent the devices i n the circuit, such as resistors, capacitors, and transistors [33]. T h i s very detailed simulation can result i n high accuracy estimates, but it requires substantial computational resources, m a k i n g it unsuitable for large circuits. S P I C E was used i n the creation of the P o o n power model, w h i c h is discussed i n C h a p t e r 3. However, the work i n this thesis w i l l be done at higher levels of abstraction, due to runtime and complexity constraints. 2.4.3 Techniques: Gate-level Simulation-based gate-level analysis is very mature. T h e most popular type of gate- level analysis uses event-driven logic simulation, where switching events at the inputs of a logic gate trigger events at the output after a pre-defined delay. P r o b a b i l i s t i c gate- level techniques exist as well, to reduce the execution time of estimates. simulation-based and probabilistic gate-level techniques i n this work. W e used b o t h Chapter 2. Background and Previous 26 Work Synopsys P r i m e P o w e r Synopsys PrimePower is a simulation-based d y n a m i c power analysis took for gate-level power verification that can be used on multimillion-gate designs. It combines gate-level simulation results w i t h delay a n d capacitance information from technology libraries to get detailed power information. In addition to average power numbers, P r i m e P o w e r reports instantaneous power consumption i n different parts of the design. Transition P r o b a b i l i t y T h e Transition P r o b a b i l i t y Technique relates the average dynamic power o f nodes t o the likelihood that they w i l l switch. T o use the T r a n s i t i o n P r o b a b i l i t y Technique, we need the signal probability a n d the transition probability of each node. T h e signal probability, Psignah of a node is the average fraction of clock cycles i n which the steady state value of the node is logic high. T h e transition probability, P , of a node is the average fraction of t clock cycles i n which the steady state value of the node is different from its i n i t i a l value [23]. T h e Transition P r o b a b i l i t y Technique makes some simplifying assumptions: delay, spatial independence of inputs and internal nodes, and temporal zero- independence of signal values. T h e assumption of zero-delay means there is, at most, a single transition of each signal per clock cycle; i n reality, there are delays and they can cause the output of a gate to transition multiple times before settling at its final value for t h e clock cycle. T h e assumption of spatial independence means we assume that there is no correlation between nodes, although, i n reality, the value of one signal may affect the value of another Chapter 2. Background and Previous Work 27 signal, in the same cycle. T h e assumption is made because calculating the correlation between signals for a large circuit is prohibitively expensive. T h e assumption of temporal independence means that we assume, for a given signal, that values i n consecutive clock cycles are independent of each other. W i t h those assumptions, the average power can be calculated using E q u a t i o n 2.3: P = 0.5-V J 2 d where V dd J2 all nodes C ^ (-) 2 3 is the supply voltage, / is the clock frequency of the circuit, C , is the t o t a l capacitance at node i, and P j is the transition probability at node i. Because of the zerot delay assumption, E q u a t i o n 2.3 only gives a lower bound on the power - unmatched delays cause multiple transitions at gate outputs. W i t h the temporal independence assumption, the transition probability can be calculated from the signal probability using E q u a t i o n 2.4: Pt — 2 • PsignaliX Psignal) (2-4) Transition Density T h e Transition Density Technique is more accurate than the Transition P r o b a b i l i t y Technique and more computationally efficient than event-driven logic simulation. T h e advantage of the Transition Density Technique over the Transition P r o b a b i l i t y Technique is that it distinguishes between multiple transitions of a node i n a single cycle, m a k i n g it more accurate. Switching a c t i v i t y can also be thought of as transition density, D(x) (for Chapter 2. Background and Previous 28 Work node x ) , w h i c h is the average number of transitions of node x per unit time. Formally, it is given by E q u a t i o n 2.5, Z?(x)4lim!^) (2.5) 1 —>oo 1 where T is the length of the time interval and n (T) x is the number of transitions i n the time interval of length T. G i v e n the transition density of all the nodes, the average power dissipation can be calculated using E q u a t i o n 2.6: P = 0.5-V CiD(xi) 2 d d (2.6) all nodes where Vdd is the supply voltage, Q is the capacitance at node i, and D{xi) is the transition density of node i . There are two important quantities i n the calculation of activities for a l l nodes i n the circuit using the Transition Density model: static probability and transition density. Static probability is the probability that the signal is high. To calculate the activity of each node i n the circuit, the transition density for each node is computed, gate-by-gate, going from the p r i m a r y inputs to the p r i m a r y outputs. If we assume that a l l inputs are uncorrelated, we can use the relationship D(y) = . P df(x) (2.7) D{xi all input pins where f(x) is the logic function of the gate, — f{x)\ ~\ Xi © f(x)\ =o Xi is the boolean Chapter 2. Background and Previous Work D(xi) difference at the output port w i t h respect to each input 29 is the transition density at input Xi and D(y) is the transition density at the output, y. P (j^^J can be calculated from the static probabilities of the inputs x using the relationships: { • P ( X ) • P ( X Y ) • P { X where P ( X ) 1- = + = Y) is P ( X ) P ( X ) = • P ( X ) P\(X), P{Y) + P ( Y ) - P ( X ) • P{Y) the static probability of X. Lag-one M o d e l T h e Transition Density model assumes that there is no temporal correlation. T h e purpose of using the lag-one model is to relax this assumption; the lag-one model assumes that the current value of a signal may depend on the value immediately preceding it. U s i n g the lag-one model, the switching probability can be calculated using E q u a t i o n 2.8: ( i) P • X ^ (2.8) ( i> j) P X X XJ&XI xiEXo For a boolean function, / , Xi is the set of input states such that / (xi) = 1 V Xi € X\ and X 0 is the set of input states such that / ( i j ) = 0 V Xj € X , the current input state is x 0 iy P ( X J ) is the probability that and P (x,, Xj) is the probability that the input state w i l l be Xj at the end of a clock cycle if the state was X{ at the beginning of the clock cycle. T h i s equation represents the summation of probabilities over all pairs of input states x , Xj such t that / (pa) = f (XJ), where an input state is a row of the t r u t h table for / . Chapter 2. Background and Previous Work 2.4.4 30 Techniques: RT-level RT-level estimators are typically based on macro-modeling. Macro-modeling involves creating power macro-models for the basic functional components i n the R T L libraries and characterizing them [4]. T h e user of an R T L estimator sees the macro-models as black boxes. However, creating a macro-model of a component involves characterizing its representation at a lower level of abstraction [32]. For example, to do power characterization for an adder, we might estimate its gate-level implementation and use information about the gates to derive overall values for its power characteristics. A l t h o u g h power estimates at higher levels of abstraction are less accurate, they s t i l l provide valuable information. W i t h the increase i n the size and complexity of designs, it is desirable for designers to be able to estimate the power at a high level of abstraction so that the information can guide early architectural decisions. A n o t h e r motivating factor is that the largest power reductions often come from architectural and algorithmic m o d i fications [34], which are least costly to make early i n the design flow. However, although R T L estimators are available i n commercial tools, they have not yet gained widespread acceptance i n design practice. Reference [4] attributes this to the difficulty of quantifying the accuracy gap between gate-level and R T L power estimation i n an industrial setting. Another deterrent noted by reference [4] is the fact that a large amount of characterization must be done to make a l i b r a r y of macro-models; this process must be automated to be efficient. Chapter 2. Background and Previous Work Dual 31 Bit Type Method T h e D u a l B i t T y p e ( D B T ) M e t h o d [34] is an architecture-level strategy for generating accurate black-box models of datapath power consumption. Its creators note that, while t y p i c a l strategies quantify activity and physical capacitance for their estimates, the strategies do not account for the effect of signal statistics on the activity. In particular, the authors identify the correlation between sign bits of two's complement operands as being an important source of error when using the assumption of randomized inputs to the block being modeled. A s an example, consider an F P G A w i t h 8-bit adders and a user circuit where a l l the operands are 5 bits wide. T h e lower 5 bits could be adequately represented by uniform white noise ( U W N ) inputs, but the upper 3 bits would always be identical (correlated) sign-extension values. T h e creators of the D B T method propose to account for two input b i t types: (1) correlated sign bits, and (2) U W N operand bits. Recall E q u a t i o n 2.2: P = 0.5 -aCV J 2 d (2.9) To account for the two bit types, instead of using a single capacitative coefficient based on U W N inputs, they use multiple capacitative coefficients that account for transitions on each type of data on each input to the block. However, a two-input single-function module requires 73 capacitive coefficients; the number increases for semi-configurable multi-function D S P blocks that are found i n F P G A s . Chapter 2. Background and Previous Work 32 Entropy-based In reference [35], the authors propose to characterize the average switching activity of a module by using the average switching activity of a typical signal line i n the module. T h e i r goal is t o obtain an acceptable estimate w i t h a limited number of design details and at a significantly lower computational cost. T h e y derive simple closed form expressions to approximate the switching activity i n the R T L blocks using the concepts of entropy and informational energy. However, to manage the complexity of their calculations, they make the following simpifying assumptions: • Simplified, uniform network structure: E a c h level of the circuit has the same number of nodes a n d a l l the gates on each level are assumed t o get their inputs from the previous level. • A s y m p t o t i c network depth: T h e number of levels i n the circuit is large enough to be considered infinity. Unfortunately, D S P a n d arithmetic blocks i n F P G A s do not have a uniform network structure a n d are not so large that we can approximate their network depth as infinite. 2.4.5 FPGA-specific Power Estimation Tools Spreadsheets T h e most accurate power estimation results for an F P G A design w i l l be after the design has been implemented (i.e. placed, routed, a n d then simulated w i t h accurate stimulus vectors). However, it is valuable to understand the impact of early high-level design Chapter 2. Background and Previous Work 33 decisions on power dissipation. Power estimation spreadsheets can be used i n the preimplementation phase to obtain a rough idea of power dissipation for a design. These spreadsheets contain detailed device data constants from F P G A manufacturer datasheets. T h e user enters environmental conditions, voltage and clock information, logic utilization, and toggle rates. E a r l y spreadsheets only calculated total power dissipated for voltage sources and components [36] [37]. T h e spreadsheets for the latest F P G A families from A l t e r a and X i l i n x are newer and calculate the static, dynamic, and total power consumption [38] [39]. T h e X i l i n x V i r t e x - 4 spreadsheet also provides graphical representations of power, voltage, and temperature relationships and power used by each type of component. It should be noted that these spreadsheets compute power i n a device-specific manner, based on constants. T h e user is expected to provide toggle activity information for each block, but (s)he might not know what values to use at such an early stage. C A D tools Industrial C A D tools that offer more accuracy t h a n spreadsheets are X i l i n x X P o w e r and A l t e r a PowerPlay Power A n a l y z e r . T h e y are used i n the implementation phase, when design details such as placement and routing have been established. X P o w e r requires either user supplied toggle rates, as w i t h the spreadsheets, or postimplementation simulation d a t a to estimate the power consumed [40]. P o w e r P l a y is similar, but also includes (for some device families) vectorless a c t i v i t y estimation to statistically estimate the signal activity of a node using the activities of the signals feeding the Chapter 2. Background and Previous Work 34 node and the logic function implemented by the node [41]. T h e l i m i t a t i o n of these tools is t h a t they apply only to some of the A l t e r a and X i l i n x devices. T h e P o o n power model is a freely available, detailed, flexible power m o d e l that has been integrated into the Versatile Place and Route ( V P R ) C A D tool. It estimates the dynamic, short-circuit, and leakage power consumed for a wide variety of user-specified F P G A architectures. It is described i n detail i n Chapter 3. 2.5 Focus and Contribution of Thesis Section 2.1 describes the basic island-style architecture and the improvements that exist i n commercial F P G A s to improve density and speed. Unfortunately, available academic power estimation tools only support basic island-style architecture components. T h e goal of this research project is to enable fast and accurate estimation of power dissipated i n F P G A designs that include embedded multiplier and D S P blocks (for the remainder of the thesis, both embedded multipliers and D S P blocks w i l l be referred to as D S P blocks). O u r project uses b o t h simulation-based and probabilistic information at the gate-level to create a Power Estimation Tool Flow that includes automated RT-level embedded D S P block macro-model characterization. T h i s work builds upon the P o o n power model and the widely used V P R C A D T o o l , which are described i n Chapter 3. T h e contributions of this thesis can be summarized as: 1. Identification of a fast and accurate technique to estimate the switching a c t i v i t y of an embedded D S P block Chapter 2. Background and Previous Work 35 2. Identification of a fast and accurate technique to estimate power dissipated by D S P blocks 3. A t o o l flow for estimating embedded D S P block power i n the context of F P G A designs. T h e impact of our enhanced tool flow is threefold; the existence of a freely available, architecturally flexible F P G A C A D t o o l that includes power modeling for embedded D S P blocks enables: 1. the investigation of power-aware architectures containing embedded D S P blocks 2. the investigation of power-aware C A D algorithms for F P G A circuits containing embedded D S P blocks 3. the incorporation of power tradeoffs i n the design of user circuits. 36 Chapter 3 Framework T h i s chapter introduces the existing experimental C A D t o o l suite that forms the basis of our work. Section 3.1 describes the flow of the framework. Section 3.2 describes the T - V P a c k t o o l for packing basic logic elements into cluster-based logic blocks and the original V P R C A D t o o l . Section 3.3 describes the P o o n power model and the improved activity estimation t o o l , A C E - 2 . 0 . Section 3.4 describes how our work fits into the existing framework and the requirements for our work. 3.1 Overall Flow Our work is based u p o n the V P R C A D t o o l suite, enhanced w i t h the P o o n power m o d e l (together P V P R ) . Frequently, " V P R " refers to the pair of tools T - V P a c k and V P R , since they are t y p i c a l l y used together. In this thesis, we w i l l do the same. Figure 3.1 illustrates the steps i n the P V P R t o o l flow. T h e left side is the original V P R flow. For the P o o n power model, activity estimation was added; this is shown to the right of the original V P R flow. T h e first input to the flow is a netlist describing the user's circuit. T h i s netlist must be pre-processed to generate the correct data and d a t a format required by V P R . T h i s pre- Chapter 3. Framework 37 Netlist Logic Optimization and Technology Mapping J Mapped Netlist Parameterized Architecture • Description Power Estimates Figure 3.1: Overall C A D F l o w (from [5]) processing involves logic o p t i m i z a t i o n using SIS [42] and technology m a p p i n g to L U T s and flip-flops using F l o w M a p + F l o w P a c k [43]. T h e result of technology m a p p i n g is a netlist mapped to the desired F P G A architecture. T h i s mapped netlist a n d input s t i m u l i are inputs to the activity estimation module, A C E - 1 . 0 , which is based on the Transition Density model. T h e output of A C E - 1 . 0 is switching activity information for each node i n the mapped netlist. T h e mapped netlist, switching activity information, and cluster architecture parameters are then input to T - V P a c k , which packs the L U T s and flipflops into cluster-based logic blocks. T h e cluster-based blocks are placed using the V P R placement engine. T h e connections between the placed blocks are then routed using the V P R r o u t i n g engine. V P R generates reports of placement and routing statistics. A r c h i t e c t u r a l investiga- tions can then be performed by varying the parameters i n the parameterized architecture Chapter 3. Framework description and examining the resulting statistics. 38 A l g o r i t h m i c investigations can also be performed by m a k i n g modifications to the packing, placement or routing engines and examining the statistics for a set of benchmark circuits. T h e P o o n power model adds the generation of power statistics for the clock, logic, and interconnect to V P R . These power statistics can be used i n b o t h architectural and algorithmic studies for basic Island-style F P G A architectures. 3.2 Versatile Place and Route (VPR) V P R is a freely available C A D tool that is widely used for performing F P G A architectural studies. It is composed of a packing tool, a placement and routing engine, and a detailed area and delay model. 3.2.1 Architectural Assumptions There are a large number of architectural alternatives for F P G A s and not a l l are supported by V P R . V P R targets S R A M - b a s e d Island-style F P G A s w i t h cluster-based logic blocks and perimeter I / O . E a c h S R A M cell is made of six minimum-sized transistors w i t h gate voltage boosting to overcome the B o d y effect. Four types of switch block architectures are supported for the programmable connection of routing tracks: Disjoint [44], Universal [45], W i l t o n [46], and I m r a n [47]. Chapter 3. Framework 3.2.2 39 T-VPack T - V P a c k is a timing-driven C A D tool; it takes a circuit netlist that has been technology mapped to L U T s and flip-flops and packs these basic logic elements into larger clusterbased logic blocks. Before the placement stage of V P R , the circuit netlist is processed using the T - V P a c k tool. A s described i n Section 2.1.1, the use of coarse-grained logic blocks results i n faster, denser circuits, and i n faster place and route runtimes. T - V P a c k has the o p t i m i z a t i o n goals of: • M i n i m i z i n g the number of inter-cluster connections on the critical p a t h of the circuit • R e d u c i n g the number of connections required between clusters by m i n i m i z i n g the number of inputs to the clusters • M i n i m i z i n g the number of clusters needed 3.2.3 Placement and Routing Engine T h e placement t o o l assigns the cluster-based logic blocks to locations i n the F P G A . T h e F P G A is modeled as a set of legal locations where logic blocks or I / O pads can be placed. A n initial r a n d o m placement is constructed, then simulated annealing is used to improve the solution. O p t i m i z a t i o n goals involve m i n i m i z i n g wiring and m a x i m i z i n g circuit speed. A s w i l l be described i n Section 6.1.2, we modified the placement tool to place D S P blocks as well. Once placement is complete, the routing tool determines which programmable switches to t u r n on to make the required inter-logic block connections i n the F P G A . V P R represents Chapter 3. Framework 40 the routing architecture of the F P G A as a directed graph called the routing resource graph. T w o routing algorithms are available: a purely routability-driven algorithm and a t i m i n g and routability-driven algorithm. 3.2.4 Architectural Flexibility T h e reason for V P R ' s versatility is its flexible representation of architectures that the user specifies i n an architecture file. T h e following features can be specified: • Logic block architecture • Detailed routing architecture • Channel w i d t h • T i m i n g analysis parameters • Process technology parameters and capacitances 3.3 3.3.1 Poon Power Model Architectural Assumptions A s the P o o n model is incorporated into V P R , it uses the architectural assumptions made by V P R . However, the original version of V P R assumes that the clock and other global signals are implemented using special dedicated resources. T h e version of V P R enhanced w i t h the P o o n model assumes an H-Tree clock distribution network a n d uses the total capacitance of the clock network for power estimation. Chapter 3. Framework 3.3.2 41 Activity Estimation In the P o o n model, the first step is to estimate the activities of the nodes i n the F P G A . T h e activity estimation t o o l for the power model is called A C E . T o distinguish between major revisions of the tool, the original w i l l be referred to as A C E - 1 . 0 and its sucessor w i l l be referred to as A C E - 2 . 0 [48]. T h i s section w i l l describe the techniques used for estimation i n A C E - 1 . 0 and A C E - 2 . 0 . ACE-1.0 A C E - 1 . 0 is the original a c t i v i t y estimation t o o l for the P o o n model. It estimates the static probability ( P i ) , switching probability (Ps), and switching activity (A ) s for combi- national and sequential gate-level circuits using the Transition Density signal model. T h e original Transition Density model only handles combinational circuits, but was enhanced to support sequential circuits. To support circuits w i t h sequential feedback, an iterative technique is used to update the switching probabilities at the output of the flip-flops, using the expressions P^Q) = P (D) X and P (Q) S = 2 • P^D) • (1 - P i (£>)). T h e original Transition Density model was also enhanced to account for logic gate inertial delays by adding an analytical low-pass filter to filter out very short glitches. T h e authors of [48] found A C E - 1 . 0 to be inaccurate for large a n d / o r sequential circuits. T h e y found that A C E - 1 . 0 overestimates activities and suggest that the low-pass filter function is insufficient for reducing glitching. T h e y also attribute the poor sequential circuit performance to the simple expressions used i n the iterative technique for updating the switching probabilities at the outputs of flip-flops. T h e next subsection describes the 42 Chapter 3. Framework activity estimator from [48], A C E - 2 . 0 , which addresses the weaknesses of A C E - 1 . 0 . ACE-2.0 A C E - 2 . 0 is a faster and more accurate probabilistic activity estimator for the P o o n power model. It has three stages that address the weaknesses of A C E - 1 . 0 : 1. Simulation of sequential feedback loops 2. C a l c u l a t i o n of P i and P values for nodes not i n sequential feedback loops using the s Lag-one model 3. C a l c u l a t i o n of A s using a probabilistic technique that accounts for glitching T h e first stage improves the accuracy of activity estimation i n sequential circuits. Since simulation techniques were avoided because of runtime issues, A C E - 2 . 0 only simulates the logic i n sequential feedback loops. In the second stage, A C E - 2 . 0 obtains the Pi and P values using the Lag-one model for s the parts of the circuit not simulated, w h i c h produces exact switching probabilities if we assume that inputs are not correlated [48]. A C E - 1 . 0 uses the Transition Density model, which assumes that there is no temporal correlation. T h e purpose of using the lag-one model is to relax this assumption; the lag-one model assumes that the current value of a signal may depend on the value immediately preceding it. T h e most efficient known implementation of the Lag-one model uses a B i n a r y Decision D i a g r a m ( B D D ) . However, there is an exponential relationship between B D D size and the number of inputs, m a k i n g this implementation i m p r a c t i c a l for large circuits. ACE-2.0 combines B D D pruning w i t h a partial collapsing technique to give smaller B D D s . Chapter 3. Framework 43 In the t h i r d stage, A C E - 2 . 0 calculates the switching activities. It uses a generalization of the Lag-one model and accounts for glitching by incorporating the concept of a m i n i m u m pulse w i d t h for passing glitches. 3.3.3 Power Estimation T h e P o o n model uses estimated capacitances at the transistor level for each component inside the F P G A . T h e n , using the capacitance values and switching activity estimates, the average power dissipation is calculated. T h e model was compared against HSPICE simulations. T h e P o o n model d y n a m i c power estimates were found to be w i t h i n 4.8% for routing and 8.4% for logic. For leakage, average difference between the estimates and the H S P I C E results was 13.4%. D y n a m i c power is the dominant component of the t o t a l power i n an F P G A . T h e P o o n model calculates capacitance values at the transistor level to determine the power dissipation of L U T s , multiplexers, and buffers inside logic blocks. It also uses the metal capacitance of each routing track and the parasitic capacitance of a l l switches attached to the track, specified using the process technology parameters i n the architecture file, to calculate the power dissipated i n the F P G A routing. T h e routing power is a large portion of the dynamic power dissipated. Since the S R A M programming bits in the F P G A do not change value after configuration, they are not included i n the d y n a m i c power calculations. T h e dynamic power is calculated using the equation: (3.1) all nodes Chapter 3. Framework 44 T h e short circuit power is modeled as 10% of the d y n a m i c power, based on extensive H S P I C E simulations i n [5]. T h e leakage power has two components: reverse bias leakage and subthreshold leakage. A s the P o o n model was calibrated using a 0.18 /zm process technology, it assumes that the reverse bias leakage is negligible. T o calculate the subthreshold leakage the P o o n model uses the equation: Fleak Idrain (weak inversion) ' ^supply (3*2) It uses a first order analytical estimation model to estimate the subthreshold current. 3.3.4 Architectural Flexibility Enhancements for the P o o n model add support to the architecture file for the flexible specification of: • Supply, swing, and gate-source voltage levels • Leakage and short circuit power parameters • N M O S and P M O S transistor characteristics • Clock network architecture parameters 3.4 DSP Block Power Model and Tool Flow T h e D S P block power model that we propose is an extension for the P V P R flow. Section 3.4.1 discusses the requirements of the power model we have developed as part of this work. Section 3.4.2 explains where our work fits i n to the P V P R flow. 45 Chapter 3. Framework 3.4.1 Requirements A s stated earlier, our power model must fit into an existing F P G A C A D t o o l flow. In order to allow the investigation of future architectures, instead of simply existing commercial architectures, we prefer that this C A D t o o l be architecturally flexible; it should be possible to specify a wide range of logic block, routing, clock, and D S P block architectures. In an architectural investigation, many iterations of P V P R are executed to gather d a t a about the impact of varying certain architectural parameters. In order to not hinder the use of P V P R for an investigation requiring tens (or even hundreds) of iterations, our power estimation must be fast. Furthermore, i n order for the power estimates from the investigation to be meaningful, they must be accurate. T h e previous requirements pertain to the t o o l flow that is visible to the P V P R user. A n important input to the tool flow i n F i g u r e 3.4.2 is the DSP block characterization data. A s described i n Section 2.4.4, a deterrent to the use of macro-modeling at the RT-level is the fact that a large amount of characterization must be done to make a library of macro-models; this process must be automated to be efficient. Therefore, to make our tool flow attractive, we must minimize the effort required when adding models for new D S P blocks and automate characterization. 3.4.2 Extending P V P R Flow Figure 3.2 shows how our model fits into the P V P R flow. Pre-processing of the D S P blocks i n the mapped netlist is required before the activities can be generated for the nodes i n a user's circuit that contains D S P blocks. A d d i t i o n a l characterization d a t a Chapter 3. Framework 46 Netlist I Logic Optimization » " and Technology Mapping Netlisl Pre-processing for - ; * DSP Blocks j J Mapped Netlist Input Stimuli J , 1 Acttvfty _'(ACE-2 _ _ _ _0)^Switching Activity Information ',' * . £ 'Packing ^Modified T VPACK) • : Parameterized Architecture • Description Placement (VPR) • DSP Block Characterization Data Power Estimation (Implemented in VPR) j Power Estimates Figure 3.2: C A D F l o w Enhanced to Support D S P B l o c k Power E s t i m a t i o n is also required i n order to estimate the power of the D S P blocks i n the final stage of processing. T h e addition of our work to the P V P R flow expands the support of P V P R to F P G A architectures that contain D S P blocks, thus enabling architectural and algorithmic investigations w i t h circuits that contain these blocks. 3.5 Chapter Summary Our work modifies the widely used P V P R tool flow. P V P R is composed of the V P R engines and the P o o n power model. T h e V P R engines are the T - V P a c k clustering algorithm, the V P R placement engine, and the V P R routing engine. T h e P o o n power model adds activity estimation and power estimation for logic blocks, interconnect, and the clock. O u r work modifies the P V P R flow to add activity estimation for D S P blocks and power estimation of D S P blocks using characterization information. T h e requirements for our Chapter 3. Framework 47 work are: • T h e D S P block power estimation must fit into an existing architecturally flexible F P G A C A D tool flow • T h e power estimation must be fast • T h e power estimation must be accurate • T h e characterization effort must be low and should be automated A s mentioned i n Section 1.2, the last three of these requirements are competing factors; fast estimation and low characterization effort w i l l generally lead to less accurate results. Thus, we must find a suitable balance between speed, characterization effort, and accuracy. Details of how we determined accurate methods for performing activity estimation and power estimation for D S P blocks are discussed i n Chapters 4 and 5, respectively. T h e complete automated t o o l flow, including D S P block characterization, is described i n C h a p t e r 6. 48 Chapter 4 Activity Estimation 4.1 Motivation for developing Activity Estimation Techniques A n important part of estimating the power dissipated i n an F P G A is estimating the act i v i t y of each connection i n the circuit. In the P o o n power model, activities are calculated gate-by-gate, starting from the p r i m a r y inputs. Since each gate ( L U T ) is small, the T r a n sition Density or Lag-one model can be used to calculate the activity of the output of each L U T as a function of the a c t i v i t y of its inputs. It is not feasible, however, to propagate activities through a D S P block using the Transition Density or Lag-one model since the computation performed using these models (and other related models) is 0(2 ) k where k is the number of inputs to the block. T h u s , a new technique is required. In this chapter, two alternative techniques to estimate the activities of each D S P output p i n are considered. T h e two techniques are compared to determine which is suitable for use w i t h i n the experimental C A D flow that was developed i n this work. It is important to note that the techniques described below are only being used to estimate the activities of the output pins of each embedded block. Power estimation of Chapter 4. Activity Estimation Circuit Inputs uutiitututt Circuit DSP Block Circuit 49 DSP Block Technology Verilog Description Library Tech Map A I h Quartus II S,rt'li />• \ i Gates with DC II DSP Block |Fptten Circuit [^Pi;to:gates; ; Random Vectors for Inputs ACEv2 0 TTTTTTTTT Circuit Outputs Activity Estimates A) Rattened DSP Blocks for Gate-Level Technique Figure 4.1: Gate-Level B) Gate-Level Technique Flow Technique these blocks w i l l be discussed i n Chapter 5. 4.2 Techniques for Activity Estimation of an Embedded Block T h i s section describes two techniques for estimating the activity i n circuits containing embedded D S P and multiplier blocks. A l t h o u g h Platform-style F P G A s also contain memories, processors, and other features, we l i m i t our experiments to circuits containing only D S P and multiplier blocks, logic, and interconnect. T h i s ensures that the results we obtain are not obscured by assumptions about the other Platform-style features. 4.2.1 Gate-Level Technique T h e first technique is an extension of the P o o n model activity estimation method. The embedded blocks are too large to for us to apply the Transition Density or Lag-one model to them directly. To be able to use the Transition Density or Lag-one models, this Chapter 4. Activity 50 Estimation technique involves representing the embedded blocks by their gate-level implementations. The flattened circuit would then consist of L U T s (for the parts of the circuits not i n the embedded blocks) and the gates that make up each embedded block. T h e Transition Density or Lag-one model can then be applied directly to this flattened netlist. This technique w i l l be referred to as the Gate-Level Technique. F i g u r e 4.1(A) shows the circuit w i t h the flattened D S P blocks i n grey a n d the L U T - a n d - i n t e r c o n n e c t part of the circuit i n white. Figure 4.1(B) shows a flow that employs this technique. In this flow, the D S P block to be evaluated is described i n Verilog. Synopsys Design Compiler is used to map the block to gates, using T S M C 0.18 / i m technology library information. T h e parent circuit to be considered is technology m a p p e d using Quartus II i n order to determine what gets mapped to D S P blocks. T h e parent circuit is then flattened a n d the D S P blocks are replaced w i t h their gate-level representations. T h e n , activity estimation is performed. Currently, the use of Quartus II for technology mapping restricts us to Altera-style D S P blocks; however, Quartus II could be replaced w i t h another technology m a p p i n g tool to evaluate non-Altera-style D S P blocks. It is important to emphasize that we do not modify the netlist that w i l l be implemented i n the F P G A . T h e flattened netlist is generated only during activity estimation. The principal advantage of this technique is that it accounts for the correlation between the input activities and output activities of the D S P blocks. A n o t h e r advantage of this technique is that it allows the use of A C E - 2 . 0 to estimate the activity for a l l the nodes i n the circuit, including the D S P block nodes. Chapter 4. Activity 51 Estimation Circuit Inputs Circuit Circuit 1 Pseudo-outputs ;Tech Map with I Quartus II Jc Jc Jc Pseudo-outputs DSP Block * • I .. Remove DSP: biocks •* 4- 4Pseudo-inputs l i Pseudo-inputs ACEv2 0 TTTTTTTTT Activity Estimates Circuit Outputs A) Independent Output Technique Terminology Random Vectors for Inputs and Pseudo-inputs B) Independent Output Technique Flow F i g u r e 4.2: Independent O u t p u t Technique T h e disadvantage of this technique is that it requires a gate-level implementation of the D S P block and proprietary technology library information. Since we expect that this technique would be used for evaluating a large number of architectures, a n o n - t r i v i a l amount of effort would be involved i n generating gate-level implementations for each D S P block to be evaluated. 4.2.2 Independent Output Technique T h e second technique that was evaluated addresses the disadvantage of the previous technique. If sufficiently accurate, we would prefer to use a technique that does not require proprietary technology or implementation details. In this technique we propose to model the embedded blocks as if they are external to the circuit. T h e remainder of the circuit is composed of L U T s and interconnect, so the P o o n model can be applied to that part. To model the D S P blocks as external to the circuit, the inputs to the embedded block are treated as p r i m a r y outputs of the circuit and the outputs of the embedded block are Chapter 4. Activity Estimation 52 treated as p r i m a r y inputs to the circuit. We refer to the D S P block inputs a n d outputs as pseudo-outputs a n d pseudo-inputs of the circuit, respectively, as shown i n F i g u r e 4 . 2 ( A ) . In this technique, the pseudo-inputs are assigned values i n the same way that the p r i m a r y inputs are assigned values by the P o o n model. R a n d o m input vectors w i t h a specified average activity are applied to the inputs a n d pseudo-inputs. T h e activities are then propagated through the circuit using A C E - 2 . 0 . W h e n the inputs t o a D S P block, pseudo-outputs, are encountered, the activities are not propagated through the D S P block. Instead, the pseudo-outputs are treated i n the same way as the p r i m a r y outputs. T h e activity calculations for the nodes downstream from the D S P block proceed using the pseudo-input values. In effect, the estimated output activities of a D S P block are then independent of the input activities to the D S P block. T h i s technique w i l l be referred to as the Independent Output Technique. Figure 4.2(B) shows a flow that employs this technique. T h e parent circuit is technology mapped using Quartus II. T h e D S P blocks are removed from the netlist a n d the nodes that were formerly outputs of the D S P block are represented as inputs t o the circuit. Input vectors are then applied to the inputs a n d pseudo-inputs a n d activity estimation is performed using A C E - 2 . 0 . T h e advantage of this technique is that it does not require gate-level implementation and technology information. The disadvantage of this technique is that inaccuracies are being introduced because, in general, D S P block output transitions are not independent of their input transitions. Chapter 4. Activity Estimation 53 Reference [49] found that for word-parallel or bit-serial arithmetic, the average activity at the output of an adder can be closely approximated by the m a x i m u m of the average activities of the two inputs, i m p l y i n g that there is a dependence. 4.3 Methodology and Results for Activity Estimation In comparing the accuracy of the two techniques, the Gate Level Technique w i l l be used as the baseline. To determine how well the simpler Independent Output Technique correlates w i t h the more accurate Gate-Level Technique, two quantities w i l l be compared: 1. T h e activities at the outputs 2. T h e activities at the downstream nodes In order to accurately estimate the power dissipated by the nets driven by the D S P block, accurate activities for these nets are needed. However, if the activities of a l l the D S P output pins are similar, then using a single average value could be sufficient. W h e n we refer to the downstream nodes of a D S P block, we mean the nodes between the D S P block outputs and the circuit outputs. Inaccurate activity estimates at the D S P block outputs may lead to inaccurate activity estimates for the downstream nodes due to the iterative nature of activity estimation algorithms (the output activity of each node is estimated based on the activities of the node inputs). Since there are typically many more downstream nodes t h a n there are D S P output pins, inaccuracies i n these downstream Chapter 4. Activity Estimation 54 Block Output Pin F i g u r e 4.3: O u t p u t P i n Activities for Unregistered M u l t i p l i e r nodes may have a larger impact on the accuracy of the estimation than would inaccuracies i n the D S P outputs. In this section, b o t h of these quantities were measured to determine whether the Independent Output Technique provides sufficient accuracy. 4.3.1 Output Nodes of DSP Block In this subsection, the activities at the outputs are compared. T o begin the investigation, the activity of each output pin of a 9-bit x 9-bit multiplier was examined. R a n d o m inputs (with a known average activity) were applied to the inputs of the multiplier, and A C E - 2 . 0 was used to estimate the output activity of each p i n . T h e results are plotted i n F i g u r e 4.3. T h e horizontal axis spans the set of 19 multiplier outputs (sign, L S B to M S B ) a n d the vertical axis is the estimated activity of each of these outputs. E a c h line corresponds Chapter 4. Activity Estimation O.fl 55 - 0.5 — i 04 u < 0.3 - Q. 9 a. 3 o 02 0 1 0 - ^ ^ i!? ^ 1 ^ y / <S ^ N 4 J5 n ^ ^ & & # 4 f ^ ^ y Block Output Pin X* 1 # A*" # f f f Figure 4.4: O u t p u t P i n Activities for Registered M u l t i p l i e r to a different average input activity. For a l l values of the average input activity, the conclusion is the same: the estimated activity differs across the output pins. T h e pins that are on the far left a n d right of the graph (the least a n d most-significant bits) have low activities, while the activities of the middle bits are large. T h i s implies that there is a specific distribution for the activities of the output pins, and that choosing these activities randomly (as is done i n the Independent Output Technique) will lead to inaccurate a c t i v i t y estimates for these nodes. Note that the activities reported i n Figure 4.3 are large, mostly greater t h a n one. T h i s is because multipliers tend to produce a large number of glitches o n their outputs [50]. M o s t D S P blocks, however, contain registers on their output pins, w h i c h w i l l remove these glitches. F i g u r e 4.4 shows the results of the same experiment i n w h i c h registers are added at the output of each multiplier. A s the graph shows, the d i s t r i b u t i o n is still there, especially for the extreme least and most significant bits. Chapter 4. Activity Estimation Block Output Pin Figure 4.5: O u t p u t P i n A c t i v i t i e s for Unregistered D S P B l o c k S 01 > ™ o < .£ Q. 0.3 — ( 3 Q. o 0 2 0.3 -0.5 - 0.7 0.9 avg input activity avg input activity avg input activity avg input activity Block Output Pin Figure 4.6: O u t p u t P i n A c t i v i t i e s for Registered D S P B l o c k Chapter 4. Activity Estimation Figure 4.7: 18-bit x 18-bit Multipliers Combined for 36-bit x 36-bit Multiplier 57 Chapter 4. Activity Estimation 58 FIR Filter F i g u r e 4.8: F I R F i l t e r Used for A c t i v i t y Experiments Figures 4.5 and 4.6 show the results of the same experiment on a more complicated D S P block. T h e block is shown i n Figure 4.7. It is similar to an A l t e r a D S P block configuration: it combines four 18-bit x 18-bit multipliers and an adder t o give a 36-bit x 36-bit multiplier. A g a i n the conclusion is the same: a single average value to represent the activities w i l l not capture the distribution of activities at the output pins. 4.3.2 Downstream Nodes of DSP Block T h e results i n Section 4.3.1 were for the D S P output pins only. In this section, we consider the nodes that lie downstream from the D S P blocks. Because A C E - 2 . 0 propagates activities from inputs to outputs, inaccuracies i n the D S P output activities w i l l lead to inaccuracies i n these downstream activities. T h e purpose of the remainder of this section is to understand how inaccurate these activities w i l l be. To perform these experiments, the F I R filter shown i n Figure 4.8 was used. T h i s circuit contains a bank of four multipliers followed by an adder tree; registers are included after the multipliers and w i t h i n the adder tree to support pipelining, and to reduce glitch Chapter 4. Activity 0.2 0.4 59 Estimation 0.6 0.8 1 1.2 1.4 Downstream Activities (Gate Level) Figure 4.9: A c t i v i t y Results for the F I R F i l t e r power. T h e activities of all nodes i n the circuit were estimated using two methods: the Independent Output Technique flow shown i n F i g u r e 4.2(B) a n d the Gate-Level Technique flow shown i n Figure 4.1(B). Figure 4.9 shows the results. I n this graph, each dot corresponds to a node i n the circuit; only nodes that are "downstream" (to the right of) the multipliers are included, starting w i t h the outputs of the multiplier output registers. T h e x-coordinate of a dot is the activity predicted for the corresponding node b y the Gate-Level (more accurate) Technique, while the y-coordinate of the dot is the a c t i v i t y predicted for the corresponding node by the Independent Output (less accurate) Technique. In this plot, a straight line at y = x would indicate that there is perfect correlation between the two estimation techniques. A s the graph shows, the correlation is good; the R 2 correlation metric is 0.8091. T h i s is surprising, since the activities of the multiplier outputs are as shown i n Figure 4.4 for the Gate-Level Technique, but r a n d o m for the Independent Output Technique. T h e reason has to do w i t h the nature of the downstream Chapter 4. Activity Estimation 60 0.8 0 0.1 0.2 0.3 0.4 Downstream activities (Gate Level) 0.5 0.6 Figure 4.10: A c t i v i t y Results for the Converter circuit. A s mentioned i n Section 4.2.2, [49] found that for word-parallel or bit-serial arithmetic, the average a c t i v i t y at the output of an adder can be closely approximated by the m a x i m u m of the average activities of the two inputs. Other adder configurations were considered w i t h similar results. T h e good correlation values do not hold for other downstream circuits, however. T h e adder tree i n F i g u r e 4.8 was replaced w i t h a signedmagnitude to 2s complement converter, and found that P = 0 . 5 4 2 , as shown i n F i g u r e 2 4.10, which is not nearly as good. 4.4 Conclusions for Activity Estimation Based on the results from the previous subsection, it was concluded that the activities obtained using the Independent O u t p u t technique do not correlate well w i t h those obtained using the more accurate Gate-Level technique for a l l downstream circuits. Since Chapter 4. A c t i v i t y Estimation 61 the Power Estimation Tool Flow must work for a wide variety of D S P architectures and downstream circuits, it was concluded that the Gate-Level technique is required for the flow. Note that the results i n Figures.4.4 and 4.6 suggest a t h i r d activity estimation technique (instead of the Independent Output and the Gate-Level techniques). R a t h e r t h a n generate the multiplier output activities randomly (as i n the Independent Output Technique), it may be possible to construct a distribution function, and generate activities based on this distribution function. W h i l e this would be possible, it w o u l d require a significant amount of characterization effort each time a new embedded D S P block is to be evaluated, since the distribution function can be significantly different for different D S P block architectures. G i v e n that A C E - 2 . 0 is fast and accurate [48], a n d the Gate- Level Technique is easier, the extra characterization effort for this t h i r d m e t h o d is not warranted. 4.5 Chapter Summary T h e motivation for developing a new activity estimation technique is runtime. It is not feasible to propagate activities through a D S P block using the T r a n s i t i o n Density or Lag-one model since the computation performed using these models (and other related models) is 0(2 ), where k is the number of inputs to the block. T h e G a t e - L e v e l Technique k was introduced as an extension of the P o o n model activity estimation method. The Independent O u t p u t Technique was introduced because we would prefer to use a technique that does not require proprietary technology or implementation details. O u r experiments Chapter 4. Activity Estimation 62 showed that the more accurate Gate-Level Technique is required for our Power Estimation Tool Flow. 63 Chapter 5 Power Estimation Once the activity estimates for each input and output p i n of the D S P block have been obtained, the Power Estimation Tool Flow must estimate the power dissipated w i t h i n each D S P block. T h i s chapter describes and evaluates techniques for doing characterizationbased power estimates. 5.1 5.1.1 Techniques for Power Estimation Objectives Since one of the objectives for this flow is to allow architectural exploration and experimentation, power estimation must be fast. T h i s w i l l facilitate many iterations of the flow in architectural parameter sweeps. A l t h o u g h it would be possible to create a gate-level model and use gate-level power simulation (such as w i t h PrimePower), this w o u l d be far too slow to include i n the inner loop of the Power Estimation Tool Flow. Therefore, a method is needed to quickly estimate the power of the embedded block, w i t h o u t resorting to modeling every internal node i n the block. Reduced estimation time typically comes at the cost of accuracy. In this chapter, we compare the accuracy of three fast and relatively simple techniques for estimating Chapter 5. Power Estimation 64 the power of an embedded DSP block against simulation results. For each technique, offline characterization is used to obtain data that can be quickly referenced at runtime. A limited amount of data is found once for each DSP block architecture, offline, using PrimePower. The following sections describe the three techniques we considered in increasing order of modeling effort. 5.1.2 The Constant Technique The first technique is the simplest. For this technique, the power dissipated by a DSP block is assumed to be a constant, dependent on the DSP block type and independent of the activities of the input and output pins. This technique will be referred to as the Constant Technique. The advantages of this technique are that it is simple and fast. The disadvantage of this technique is that it may lead to inaccurate estimates, since the power dissipated in an embedded block does depend on the input pin activities. However, this technique could be sufficient if the dependence is weak and the deviation from the average power is small. 5.1.3 The Lookup Technique For the second technique, we approximate the power dissipated by the embedded block as a function of the average activity of all the DSP block inputs. The function need-not be linear, and may be implemented as a look-up table rather than as a closed-form function. This technique will be referred to as the Lookup Technique. Chapter 5. Power Estimation 65 The advantage of this technique is that it is still relatively simple and fast, though it does involve more modeling effort than the Constant Technique. The disadvantage of this technique is that it assumes all input pins contribute equally to the power dissipated by the block. If we consider a multiplier block, for example, it will have pins corresponding to the multiplicand and multiplier operands. The fanout logic from the multiplicand operand pins may be very different from the fariout logic from the multiplier operand pins; thus, we expect that their contribution to the power dissipation will differ. However, if the difference does not cause substantial variation in the total power dissipation from an average over many trials, then this technique could be sufficient. 5.1.4 The P i n C a p Technique The Lookup Technique may provide inaccurate power estimates, since only the average input activity is used to estimate the power. As mentioned in the previous section, in reality, not all inputs pins are equal; activity on some pins may have more impact on the power dissipation of a block than the same activity on other input pins. To take this into account, a third technique is considered, called the PinCap Technique. For this technique we estimate, for each input pin in isolation, how much of an impact that pin has on the overall power dissipation of the embedded block. This is quantified by calculating an effective capacitance, Cj, for each input pin i. Then, the total power can be calculated as: (5.1). all-input jpins where / is the frequency of the circuit and is the activity of p i n i. Intuitively, this Chapter 5. Power Estimation 66 Technology DSP Block Library Verilog Description i Verilog Testbench + Input Vectors Synthesize to Gates with DC * r Simulate with Verilog-XL Analyze Power with PnmePower — J — Power Estimates Figure 5.1: Methodology for Power E s t i m a t i o n Experiments technique might provide accurate results, at the expense of more characterization effort. 5.2 Methodology for Power Characterization E a c h of the three techniques requires some amount of offline characterization. T h i s section describes the characterization methodology for each technique. Figure 5.1 shows the flow used to obtain this characterization data. A Verilog description of the D S P block was synthesized to gates using Synopsys Design Compiler. A Verilog testbench was used to simulate a set of input vectors applied to the gate-level description of the D S P block i n V e r i l o g - X L . T h e simulation d a t a was then fed to the Synopsys P r i m e P o w e r simulator to obtain characterization information. T h i s is done for a training set of input vectors, according to the characterization technique used. Chapter 5. 5.2.1 Power Estimation 67 C o n s t a n t Technique For the Constant Technique, we repeated this characterization task nine times for the given D S P block for average input activity values i n the set {0.1, 0.2, 0.3, 0.4, 0:5, 0.6, 0.7, 0.8, 0.9}. For each of the 9 average activity values, a t r a i n i n g set of 10 x 5000 input vectors was simulated to get 10 power estimates. W e took the average of a l l 90 results to obtain a single value to use as the constant power value for the D S P block. It should be noted that there is a one-to-many, relationship between the average activity for a set of input vectors and power estimates for that average input activity. Different sets of input vectors may differ i n the individual activities of each input p i n and, as described i n Section 5.1.4, different pins may contribute to the power of the block differently. 5.2.2 L o o k u p Technique For the Lookup Technique, we reused the data from the Constant Technique characterization. W e averaged the 10 estimates for each of the 9 activity values to obtain 9 d a t a points. These 9 average input activities and their corresponding average power estimates became the activity-power estimate pairs for the Lookup Technique look-up table. T h i s table was then included i n the power model. 5.2.3 P i n C a p Technique T h e input vector sets used for characterization for the PinCap Technique were different from those used for the Constant and Lookup Technique characterization. In those vector sets, a l l bits would toggle. For this technique, to assess the contribution of individual Chapter 5. Power Estimation 68 Training Set of 10x5,000 vectors at each activity 30 x 5,000 vectors at each activity Characterization (Section 5.2) Characterization data Estimation (3 techniques) PrimePower Power estimate Power estimate Figure 5.2: Power E s t i m a t i o n E x p e r i m e n t a l Methodology pins, a l l other pins except the one i n question were held constant; the p i n i n question was then toggled w i t h the given activity. T h e resulting power from PrimePower and E q u a t i o n 5.1 was then used to determine the effective capacitance for p i n j, Cj, because a* = 0 for all i ^ j. T h i s was repeated for each input p i n to the D S P block. Once the effective capacitance for each input p i n was determined, the values were included i n the power model. Input sets w i t h average activities of 0.2 and 0.5 were used for the characterization.. Conclusions about the P i n C a p technique could be drawn from the results for these two average input activities, so P i n C a p charaterization for the remaining activities between 0.1 and 0.9 was not necessary. 5.3 Evaluation In this section, we evaluate the accuracy of each of the three power estimation techniques. T h e experimental methodology is shown i n Figure 5.2. For each technique, we performed 69 Chapter 5. Power Estimation 4.0E-03 i 5.0E-04 O.OE+00 J 0 1 0.2 1 0.4 1 0.6 0.8 Average Activity for Vector Set 1 Figure 5.3: Power E s t i m a t i o n E x p e r i m e n t Results the characterization tasks described i n Section 5.2 for a 9x9-bit m u l t i p l i e r . W e then compared the power estimated by each technique to that estimated by P r i m e P o w e r simulation (which is presumably more accurate than any of our three techniques). F o r this experiment, we used 30 x 5000 randomly generated input vectors at each of the average input a c t i v i t y values i n the set {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} (this is more t h a n was used during the offline characterization). F i g u r e 5.3 shows the results for the Constant Technique a n d the L o o k u p Technique. E a c h point on the graph represents the results for one set of 5000 vectors. T h e x-coordinate of the point is the average activity of all input vectors w i t h i n the set, a n d the y-coordinate of the point is the estimated power. T h e triangular dots represent the estimates using the Constant Technique, while the circular dots represent the predictions from the L o o k u p Technique. For comparison, the PrimePower estimates are also plotted on the same graph; the P r i m e P o w e r estimates are shown as diamonds. A s the graph shows, the Constant Chapter 5. Power 70 Estimation 4.5E-03 —•"PrimePower 4.0E-03 3.5E-03 _ . 3.0E-03 CD I 2.5E-03 J>o 2.0E-03 ? a. 1.5E-03 1.0E-03 5.0E-04 0.0E+0O J J Vector sets sorted by PrimePower power value, in increasing order F i g u r e 5.4: Power E s t i m a t i o n E x p e r i m e n t Results Technique (in w h i c h a single power value is used, independent of the input activity) provides poor estimates, compared to the P r i m e P o w e r results. T h e L o o k u p Technique, on the other hand, produces results that track well w i t h the P r i m e P o w e r results. The average difference between the L o o k u p Technique and P r i m e P o w e r estimates was 8%. T h e experiment was repeated w i t h a D S P block rather t h a n a multiplier, and the conclusions were the same; the average difference between the L o o k u p Technique and the P r i m e P o w e r results was 5%. T h i s suggests that it is important to take the input activities into account when estimating power of the block, but that the average of the input activities is enough information to get reasonably accurate power estimates. For the P i n C a p Technique, the results were not as good. Figure 5.4 shows the es- timates obtained using the P i n C a p Technique, along w i t h the estimates obtained using PrimePower. In this graph, only vector sets w i t h an average input a c t i v i t y of 0.2 were considered. E a c h point i n this graph corresponds to one such set. T h e points are dis- Chapter 5. Power Estimation 71 t r i b u t e d evenly across the x-axis, sorted b y the values of the PrimePower estimate. A s the graph shows, even for these sets, which a l l correspond to the same average activity, the power estimates predicted by the P i n C a p technique vary widely, a n d have no correlation to the PrimePower estimates. T h e PinCap Technique always overestimates the actual power; this is because the technique assumes that the power contribution of each input p i n is independent of the power contribution of each other input p i n , when i n fact, it is not. T h e t o t a l power is not simply the s u m of the contributions from each p i n . E v e n if the P i n C a p results are scaled by a constant value i n an attempt to reflect this, the results do not track the simulation results well (this is also shown i n Figure 5 . 4 ) . Other input activities were also attempted and no simple way was found to scale the results to take into account overlap between the contributions of multiple input pins. 5.4 Conclusions for Power Estimation Based on the results from the previous section, we concluded that the L o o k u p Technique is most appropriate for the Power Estimation Tool Flow. N o t only does it provide reasonably accurate results, but the characterization effort is relatively simple, a n d the run-time of the power estimate is small. Chapter 5. Power Estimation 5.5 72 Chapter Summary T h e objective of the power estimation experiments was t o determine a fast a n d accurate technique for estimating the power of D S P blocks, given input activities. W e introduced three techniques, i n increasing order of complexity: (1) the Constant Technique, (2) the L o o k u p Technique, a n d (3) the P i n C a p Technique. Power estimates using these techniques were performed a n d compared against PrimePower simulations for accuracy. W e found that it is important to consider the input activities when estimating the power a n d that the L o o k u p Technique is most suitable for our Power Estimation Tool Flow. 73 Chapter 6 Power Estimation Tool Flow T h i s chapter describes how we combined our Gate-Level activity estimation technique from Chapter 4 and our L o o k u p power estimation technique from C h a p t e r 5 w i t h the P V P R framework to obtain a complete power estimation C A D t o o l flow. Section 6.1 describes the overall flow and the modifications we made to P V P R . Section 6.2 describes the processing of two benchmark circuits through the entire flow and compares our estimates against P r i m e P o w e r simulations of the circuits. 6.1 Overall Flow 6.1.1 Functionality O u r C A D t o o l flow for estimating the power dissipated i n F P G A circuits containing embedded D S P blocks is shown in Figure 6.1. T h e activity estimation and power estimation are performed as described i n Chapters 4 and 5. T h e steps i n the box on the left correspond to activity estimation. For a c t i v i t y estimation, there are three inputs: (1) the user's circuit, (2) a Verilog description of the D S P block, and (3) input statistics (either as vectors for the circuit inputs or the transition density and static probability of each input). To facilitate architectural investigations, Chapter 6. Power Estimation Parameterized DSP Block verilog Generator User's Circuit 1 Tool Flow 74 Technology Library DSP Block Verilog Description Training Set of Input Vector Files Synthesize to Gates with DC Simulate with Vonlog-XL Tech Map with Quartus II i I 1__L~ Flatten Circuit Analyze Power with PrimePower DSPs to Gates Vectors, or D(x) and P i ACE-2.0 Timing charact. of DSP Power charact. of DSP Architecture File Activity Data Enhanced T-VPACK/VPP. with Poon model Power Analysis Report for Circuit with DSP Blocks F i g u r e 6.1: Power E s t i m a t i o n T o o l F l o w we created a parameterized D S P block generator to automate the production of Verilog descriptions for a large number of D S P architectures. T h e D S P block description is synthesized to a gate-level netlist using Synopsys Design C o m p i l e r and a T S M C 0.18 fj,m technology library. T h e H D L description of the user's circuit is technology mapped to L U T s , flip-flops and D S P blocks using Quartus II to produce a mapped netlist of these elements and their connections. T h e mapped netlist is flattened by replacing the D S P block instances w i t h the gate-level implementation obtained from Design Compiler. A C E - 2 . 0 is then used to obtain the activity of each node in the flattened circuit, using the input Chapter 6. Power Estimation Tool Flow 75 statistics that we provided. T h e steps i n the box on the right correspond to power characterization. Recall that, whereas the activity estimation steps must be repeated each time the user's circuit changes, the power characterization steps need only be done once when the D S P block is designed and can then be stored as library data. For power characterization, there are two inputs: (1) the gate-level implementation of the D S P block from Design Compiler, and (2) a t r a i n i n g set of input vector files. T h e gate-level implementation is simulated using V e r i l o g - X L and the resulting power determined by P r i m e P o w e r for the training set of input vector files. T h e average activity and resulting power for each input vector file i n the training set is saved as an activity-power pair i n the L o o k u p Technique table i n the P V P R architecture file. T h e D S P block t i m i n g characteristics are also stored in the P V P R architecture file. T h e activity d a t a for the circuit and the characterization data for the D S P block are then input to P V P R for packing, placement, routing, and power analysis. Currently, the use of Quartus II for technology mapping restricts us to Altera-style D S P blocks; however, Quartus II could be replaced w i t h another technology m a p p i n g tool to evaluate different D S P block architectures. 6.1.2 Modifications to P V P R Neither the original version of V P R nor P V P R support circuits w i t h embedded D S P blocks. We enhanced the B L I F netlist format [42] to allow for the specification of D S P blocks. W e modified the P V P R architecture file format [21] to include power and t i m i n g Chapter 6. Power Estimation Tool Flow 76 numbers for these blocks (including the power look-up table proposed i n Chapter 5); Table 6.1 describes the parameters we added. We assumed that D S P blocks are arranged i n columns, as i n A l t e r a and X i l i n x devices. W e modified the placement algorithm to correctly position these blocks, modified the t i m i n g analysis algorithm to estimate the delay through these blocks, and use this information to calculate the c r i t i c a l path of an implementation [51]. W e then modified the power model to use the Lookup Technique for D S P blocks. Table 6.1: Parameters added to the P V P R Architecture F i l e F o r m a t Parameter Meaning Start Repetition F i r s t column of D S P blocks N u m b e r of columns before next column of D S P blocks Nature of the D S P input and output pins Locations around the D S P block where the input and output pins can be programmably connected to the routing fabric Leakage power dissipated by the D S P block L o o k - u p value for activity i n an activity-power pair Energy dissipated for a given A c t i v i t y i n an activity-power pair (Energy is used to be independent of clock frequency) Class Location Leakage Activity Energy 6.2 Flow Demonstration and Comparison to Gate-Level Simulation 6.2.1 Motivation T h e experiments i n Chapter 5 consider the D S P blocks as stand-alone elements; random inputs are used and all bits have approximately the same average activity. W h e n a D S P Chapter 6. Power Estimation Tool Flow 77 block exists i n a larger circuit, however, it may be located downstream from other logic; each bit i n the input operands may have a different activity. T h i s section demonstrates the functionality of our C A D tool flow using two benchmark circuits from [52] and compares our results w i t h a more accurate method (using the same inputs) to see how much accuracy is sacrificed to obtain fast estimates. 6.2.2 Terminology A s this section discusses the simulation of multiple circuits, it is important to clarify the terminology we w i l l use. T h e test circuits (the F I R filter and differential equation solver) w i l l be refered to as the parent circuit or circuit. T h e circuit w i l l contain multiple instances of a DSP block. T h e DSP block is described by the D S P B l o c k Verilog Description i n F i g u r e 6.1. T h e instances w i l l be referred to as DSP block instances. (in white) w i t h instances F i g u r e 6.2 shows the (in grey) of a DSP block, which is shown to the right. circuit DSP1 DSP block instances DSP2 Figure 6.2: Terminology DSP block circuit Chapter 6. Power Estimation Tool Flow 78 FIR Filter Figure 6.3: fir_3_8_8: F I R F i l t e r C i r c u i t 6.2.3 Methodology D e m o n s t r a t i o n of Power E s t i m a t i o n T o o l F l o w To demonstrate the functionality of our flow, we ran it on two benchmark circuits from [52]. T h e first circuit is fir_3_8_8, the F I R filter shown i n Figure 6.3, which uses embedded multipliers; it consists of 272 L U T s , 148 flip-flops, and 4 multipliers. T h e second circuit is diffeq_paj_convert, a differential equation solver, shown i n Figure 6.4; it consists of 850 L U T s , 193 flip-flops, and 3 D S P blocks. T h e differential equation solver uses a D S P block configuration similar to one found i n A l t e r a S t r a t i x devices. T h i s block is shown i n Figure 6.5; it combines four 18-bit multipliers w i t h a dedicated adder to make a 36-bit multiplier. For b o t h example D S P blocks, a look-up table of activity-power pairs had been created offline, using L o o k u p Technique characterization as described i n Chapter 5. For each circuit we performed the characterization part of the flow only once and ran an iteration of the rest of the flow for each input vector file. For each activity i n the set {0.25, 0.50, 0.75}, we created five input files of 5000 vectors having that average activity, for a t o t a l of 15 input files. Note that the w i d t h of these vectors is equal to the number Chapter 6. Power Estimation Tool Flow 79 of inputs of the parent circuit. T h e purpose of providing vectors instead of activity and static probability values for each bit is that A C E - 2 . 0 performs simulation for sequential feedback loops. If only a c t i v i t y and static probability values are given to A C E - 2 . 0 , then it randomly generates vectors for simulation. However, since we wanted to compare our results against a PrimePower simulation, we needed to provide the same set of vector files to b o t h estimation tools. Figure 6.4: Differential E q u a t i o n Solver C i r c u i t C o m p a r i s o n against P r i m e P o w e r For these experiments we compared the results from our Power Estimation Tool Flow against PrimePower. T h e pseudocode for the methodology is given i n F i g u r e 6.6 and Chapter 6. Power Estimation Tool Flow 80 Figure 6.5: 18-bit x 18-bit multipliers combined for 36-bit x 36-bit multiplier described below. T h e same input vectors were used for b o t h the Power Estimation Tool Flow and the simulation approaches. W e assumed that the p r i m a r y inputs to the parent circuit were free of glitching. T o determine the power dissipated by each DSP block instance using simulation, we used the flow shown i n Figure 6.7. To determine the power of each D S P block instance, we had to determine the input waveforms to each D S P block instance separately. It is incorrect to assume that the D S P block instances w i l l have identical input waveforms, because the logic upstream of (leading up to) each D S P block instance i n the parent circuit may not be identical. Chapter 6. Power Estimation Tool Flow 81 Synthesize Verilog description of DSP block to gates in Design Compiler (already done for Power Estimation Tool Flow); Create simulation model of circuit in Quartus II; For avg_act in (0.25, 0.50, 0.75) { For trial in (1 to 5) { Create input vector set V of 5000 vectors with activity avg_act and width equal to number of primary inputs in circuit; Apply input vector set to circuit sim model using testbench_cct; Use ModelSim to simulate and generate VCD; For each DSP block instance i in circuit { Parse VCD for input waveforms to this DSP instance; Apply waveforms at DSP instance inputs using testbench_dsp_i; Use Verilog-XL to simulate; Use PrimePower to calculate power of DSP instance i (when circuit is stimulated by input vector set V); } } } Figure 6.6: Simulation F l o w Pseudocode To obtain the input waveforms to each D S P block instance, we used Quartus II to generate a simulation model of the parent circuit. Since we used Quartus II to technology map the circuits i n the Power Estimation Tool Flow, the mapped implementations for b o t h methods match. W e used M o d e l S i m to simulate the circuit and generate a Value Change D u m p ( V C D ) of the simulation, which is an A S C I I file that describes the waveforms for the circuit internal nodes. W e then used a V C D parser to extract the waveforms corresponding to the inputs of each D S P block instance. To obtain the simulation power estimates for each vector set, we used the input waveforms for each D S P block instance to simulate the 82 Chapter 6. Power Estimation Tool Flow DSP[3) •—i Y Quartus II DSP[2] To Power Estimation Tool Flow... Input Vectors for Parent Circuit _ Technology Mapped Netlist Simulation Model of Parent Circuit VCD (waveforms of circuit internal nodes) Waveforms for DSP[1] inputs DSP Block GateLevel Description Verilog-XL + PnmePower Waveforms for DSP[2] inputs - • . PnmePower Verilog-XL + PnmePower Simulation power esttorDSP[2J Simulation power esttorDSP[3] 1 T Simulation power estlorDSP[1] Waveforms for DSP[3] inputs r lllllilllPltll T Figure 6.7: F l o w for Determining Simulation Power E s t i m a t e of E a c h D S P B l o c k Instance gate-level implementation of the DSP block using V e r i l o g - X L a n d PrimePower. 6.2.4 Results D e m o n s t r a t i o n of Power E s t i m a t i o n T o o l F l o w T h e results of the power analysis are shown i n Table 6.2. In b o t h circuits, the D S P blocks dominate the power dissipation. T h i s may be surprising because routing power generally dominates the power dissipation i n an F P G A . However, the circuits are D S P kernels a n d not complete systems. Thus, they contain only a small amount of n o n - D S P logic a n d substantial routing is internal to the D S P blocks. Chapter 6. Power Estimation Tool Flow 83 C o m p a r i s o n against P r i m e P o w e r Tables 6.3 and B . l to B . 3 show the detailed results for the F I R filter circuit. Tables 6.4 and B . 4 to B . 6 show the detailed results for the differential equation solver circuit. T h e average difference between the Power E s t i m a t i o n T o o l F l o w and P r i m e P o w e r results for the F I R circuit was 20.4%. In the differential equation circuit, 2 of the 3 D S P blocks had an average difference of 22-26%, however the t h i r d had glitching on one input bus and was off by 77% on average. W h e n creating the lookup table, the blocks had been characterized only for activities 0.1 to 0.9, as it was not clear how to properly imitate glitching. A l t h o u g h we use linear interpolation to determine-the power corresponding to average input activities that are not i n the L o o k u p Technique table, the power relationship is not necessarily linear w i t h respect to D S P input activities. Consequently, it is not surprising that estimates for D S P block instances having an average input a c t i v i t y greater t h a n 1 are not well represented by a linear interpolation using the points for activities 0.8 and 0.9. T h i s indicates the importance of a lookup-table that includes d a t a that considers glitching. A p p e n d i x A describes preliminary unsuccessful attempts at i n c l u d i n g glitching during P r i m e P o w e r characterizations. Table 6.2: V P R Power Analysis Results Power Routing fir_3_8_8 diffeq_paj .convert 5.46 m W , 18.4% 4.04 m W , 13.6% 8.67 m W , 10.0% 4.68 m W , 5.6% 2.57 m W , 2.9% DSP 1.61 m W , 5.4% 18.59 m W , 62.6% 70.82 m W , 81.5% Total 29.71 m W 86.74 m W Logic Blocks Clock 84 Chapter 6. Power Estimation Tool Flow Table 6.3: Overall Percentage E r r o r for F I R F i l t e r Results Arithmetic Mean average percentage error average for just multiplier instance # 0 average for just multiplier instance # 1 average for just multiplier instance # 2 20.4% 21.6% 19.7% 19.6% average for just multiplier instance # 3 20.5% Table 6.4: Overall Percentage E r r o r for Differential E q u a t i o n Solver Results Arithmetic Mean average percentage error average for just D S P block instance # 0 average for just D S P block instance # 1 average for just D S P block instance # 2 6.3 41.7% 76.8% 22.5% 25.8% Chapter Summary In this chapter we have demonstrated the functionality of our Power Estimation Tool Flow for estimating the power of F P G A circuits containing embedded D S P blocks, w h i c h addresses our requirements laid out i n Chapter 3. W e have compared our results against simulation using PrimePower. O u r fast estimates are w i t h i n 19% to 26% of the simulated results, except i n the case where there is significant glitching at the inputs of the D S P blocks; the glitching case resulted i n 77% error. W e believe that adding characterization of glitches on the inputs of the D S P blocks is necessary to improve the accuracy of our method; however, it is not immediately clear how this can be done. T h i s w i l l be discussed Chapter 6. Power Estimation Tool Flow as future work in Section 7.2.2. 85 Chapter 7. Conclusion 86 Chapter 7 Conclusion 7.1 Summary and Contributions In this thesis we have described an experimental C A D flow that can be used to estimate the power dissipation of F P G A circuits containing embedded D S P blocks. W e identified two technical challenges i n creating such a flow: (1) estimating the activity of a l l nodes i n a circuit containing one or more D S P blocks, and (2) estimating the power dissipated w i t h i n a D S P block quickly and accurately. T h e first challenge arises because standard activity estimation techniques cannot propagate activities through these D S P blocks. We address this by replacing each D S P block w i t h a gate-level representation of the block, and using the standard a c t i v i t y techniques on the resulting circuit. T h e second challenge arises because it is not possible to pre-characterize the D S P block for all possible input patterns and activities. W e have shown that reasonable estimates can be obtained by creating a look-up table of power values. In the power model, the look-up table is indexed using the average activity of the block input nodes. We then combined our findings to create a Power Estimation Tool Flow based on the P V P R framework. T h e impact of our enhanced tool flow is threefold; the existence of Chapter 7. Conclusion 87 a freely available, architecturally flexible F P G A C A D t o o l that includes power modeling for embedded D S P blocks enables: 1. the investigation of power-aware architectures containing embedded D S P blocks 2. the investigation of power-aware C A D algorithms for F P G A circuits containing embedded D S P blocks 3. the incorporation of power tradeoffs i n the design of user circuits T h i s work is also one of a collection of projects at the University of B r i t i s h C o l u m b i a System-on-Chip L a b that each take a step towards the larger goal of enabling power estimation for platform-style F P G A architectures that contain embedded D S P blocks, embedded memories, embedded processors, and multiple clock domains. A poster of our contributions w i l l appear at the 2006 I E E E International Conference on F i e l d Programmable Technology i n Bangkok, T h a i l a n d . 7.2 Future Work G i v e n our t o o l flow, there are three enhancements that w o u l d be necessary before performing power-aware F P G A architecture studies that include embedded D S P blocks: (1) a suite of integer benchmarks representative of D S P and arithmetic-intensive user designs, (2) the incorporation of glitch characterization into the look-up data, and (3) board-level verification. Chapter 7. Conclusion 7.2.1 88 Benchmarks A suite of integer benchmarks representative of D S P and arithmetic-intensive user designs is essential to be able to draw generalizable conclusions from an architectural study targeting embedded D S P blocks. T h e standard benchmark suite for F P G A studies is the collection of Microelectronics Center of N o r t h C a r o l i n a ( M C N C ) circuits; however, these circuits are not very representative of D S P applications. Freely available circuits were obtained from [52] and [53]; however, most were not suitable for our experiments for the reasons listed below: • Some circuits used floating point arithmetic. D S P blocks i n commercial F P G A s target integer arithmetic, so our flow does the same. • Some circuits used very simple and small D S P blocks, w h i c h w o u l d not exercise many of the features i n D S P blocks embedded i n commercial F P G A s . • Some of the circuits were automatically generated using high-level synthesis. The R T L signal and module names were automatically generated alphanumeric character sequences. T h i s h i d the flow of control and d a t a i n the circuits and prevented analysis and debugging of the circuits. A very useful future research project would be the creation of circuits for integer D S P and arithmetically intensive applications at the register transfer level. Chapter 7.2.2 7. 89 Conclusion Glitch Characterization T h e comparison to gate-level simulation i n Section 6.2 revealed that glitching may take place on the inputs to the D S P blocks and that it is important to include characterization data i n the power estimation look-up table for cases where the activity at the inputs to the D S P blocks is greater than 1. W h e n performing characterization, it was not clear how to properly imitate glitching i n our testbenches. A p p e n d i x A describes our attempts. Reference [30] describes word-level and bit-level glitch generation and propagation models for characterization of datapath circuits. It would be interesting to incorporate this into our flow and evaluate its effectiveness. 7.2.3 Board-Level Verification W h e n performing power estimation at higher levels of abstraction, as we do w i t h the L o o k u p Technique, we are trading off accuracy for fast estimation. Consequently, at this level, fidelity is what we seek to provide (i.e. relative accuracy, instead of absolute). In order to verify that our t o o l flow w i l l provide consistent estimates that provide usable trends, we must compare our results to physical measurements on ah F P G A board. T h e use of board-level measurements to verify our power model is not t r i v i a l . It is not simply a case of downloading our test circuits to boards w i t h F P G A s containing the appropriate D S P blocks and comparing the measured power to our estimates. F i r s t , it is not feasible to create custom F P G A s for the set of D S P block architectures evaluation; layout and fabrication costs are excessive. under Second, the power of the D S P blocks alone cannot be measured; typically, we can only measure the t o t a l active and Chapter 7. Conclusion 90 quiescent power of the system. Consequently, we cannot simply compare our estimates against the results for a set of commercial F P G A boards. For example, A l t e r a Cyclone II and X i l i n x V i r t e x - I I devices contain embedded 1 8 x l 8 - b i t multipliers, and A l t e r a S t r a t i x and X i l i n x V i r t e x - 4 devices contain D S P blocks. One possibility is to compare our power estimates for a set of benchmark circuits against the power estimates for four boards containing each of these devices; however, their logic and r o u t i n g architectures differ, making it impossible to distinguish between deficiencies i n the power models for the D S P blocks, the logic blocks, and the interconnect. A starting point for board-level verification could be to compare trends i n measured values on a particular F P G A board against trends i n the power estimates using the corresponding architecture description file, for a set of benchmark circuits. T h e desired result would be to see that the measurements and estimates b o t h rank the power dissipation of the benchmark circuits i n the same order. A prerequisite for this verification is a representative set of benchmark circuits, described i n Section 7.2.1. 91 Bibliography [1] V . B e t z , J . Rose, a n d A . M a r q u a r d t , Architecture and CAD for Deep-Submicron FPGAs. Springer, M a r c h 1999. [2] S. J . E . W i l t o n , , S. C h i n , and K . K . W . P o o n , Field-Programmable Gate Array Architectures. Taylor a n d Francis, 2006. [3] A . Y e and J . Rose, "Using M u l t i - B i t Logic B l o c k s a n d A u t o m a t e d P a c k i n g to I m prove Field-Programmable Gate A r r a y Density for Implementing D a t a p a t h C i r c u i t s , " i n Proceedings of the 2004 IEEE International Conference on Field-Programmable Technology, 2004, pp. 129-136. [4] C . Piguet, E d . , Low Power CMOS Circuits: Technology, Logic Design and CAD Tools. N e w Y o r k : Taylor and Francis, 2006. [5] K . K . P o o n , "Power E s t i m a t i o n for F i e l d Programmable Gate A r r a y s , " Master's thesis, University of B r i t i s h C o l u m b i a , A u g u s t 2002. [6] A c t e l , Actel IGLOO Note, 2006. Flash Freeze Technology and Low Power Modes Application [7] C . T . C h o w , L . S. M . T s u i , P . H . W . Leong, W . L u k , and S. J . E . W i l t o n , " D y n a m i c Voltage Scaling for C o m m e r c i a l F P G A s , " i n Proceedings of the 2005 IEEE International Conference on Field-Programmable Technology, 2005, p p . 173-180. [8] F . L i , Y . L i n , a n d L . H e , " F P G A Power R e d u c t i o n U s i n g Configurable D u a l - V d d , " in Proceedings of the 41st Design Automation Conference, 2004, pp. 735-740. [9] V . George, H . Zhang, a n d J . Rabaey, "The Design of a L o w Energy F P G A , " i n Proceedings of the 1999 International Symposium on Low Power Electronics and Design, 1999, pp. 188-193. [10] J . H . Anderson a n d F . N . N a j m , "Power-aware Technology M a p p i n g for L U T - b a s e d F P G A s , " i n Proceedings of the 2002 IEEE International Conference on Field-Programmable Technology, 2002, pp. 211-218. [11] J . H . Anderson a n d F . N . N a j m , "Active Leakage Power O p t i m i z a t i o n for F P G A s , " IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 25, no. 3, pp. 423-437, 2006. Bibliography 92 [12] R . Tessier, V . B e t z , D . Neto, a n d T . Gopalsamy, "Power-aware R A M M a p p i n g for F P G A Embedded M e m o r y Blocks," i n FPGA '06: Proceedings of the International Symposium on Field Programmable Gate Arrays. N e w Y o r k , N Y , U S A : A C M Press, 2006, pp. 189-198. [13] I. K u o n and J . Rose, "Measuring the G a p Between F P G A s a n d A S I C s , " i n FPGA '06: Proceedings of the International Symposium on Field Programmable Gate Arrays. New York, N Y , U S A : A C M Press, 2006, p p . 21-30. [14] A l t e r a , Cyclone II Device Handbook, 2005. [15] A l t e r a , Stratix II Device Handbook, 2005. [16] X i l i n x , Virtex II Platform FPGAs: Complete Datasheet, 2005. [17] UG073: XtremeDSP for Virtex-J,. User Guide, 1st ed., 2005. [18] S. Hauck, M . M . Hosier, a n d T . W . Fry, "High-performance C a r r y Chains for F P G A s , " i n FPGA '98: Proceedings of the 1998 ACM/SIGDA Sixth International Symposium on Field-Programmable Gate Arrays. N e w Y o r k , N Y , U S A : A C M Press, 1998, p p . 223-233. [19] S. D . Haynes, A . B . Ferrari, a n d P . Y . K . Cheung, " F l e x i b l e Reconfigurable M u l t i p l i e r Blocks Suitable for E n h a n c i n g the Architecture of F P G A s , " i n Proceedings of the IEEE 1999 Custom Integrated Circuits Conference, 1999, p p . 191-194. [20] M . J . Beauchamp, S. Hauck, K . D . Underwood, a n d S. K . Hemmert, "Embedded Floating-Point U n i t s i n F P G A s , " i n Proceedings of the 2006 ACM/SIGDA Fourteenth International Symposium on Field Programmable Gate Arrays., 2006, p p . 12-20. [21] V . B e t z , VPR and T-VPack User's Manual, ver 4.30, M a r c h 2000. [22] K . K . W . P o o n , S. J . E . W i l t o n , a n d A . Y a n , " A Detailed Power M o d e l for Field-Programmable Gate A r r a y s , " ACM Transactions on Design Automation of Electronic Systems (TODAES), 2004. [23] F . N . N a j m , " A Survey of Power E s t i m a t i o n Techniques i n V L S I C i r c u i t s , " IEEE Transactions on Very Large Scale Integration (VLSI) Systems, v o l . 2, no. 4, pp. 446-455, 1994. [24] G . Lemieux, E . Lee, M . T o m , a n d A . Y u , "Directional a n d Single-driver W i r e s i n F P G A Interconnect," i n Proceedings of the 2004 IEEE International Conference on Field-Programmable Technology, 2004, pp. 41-48. [25] A . Y a n g , "Design Techniques to Reduce Power Consumption," XCell Journal, 2005. [26] X i l i n x , Virtex-5 LX Platform Overview, 2006. [27] X i l i n x , UG070: Virtex-4 User Guide, 2006. Bibliography 93 A . Y e and J . Rose, "Using Bus-based Connections t o Improve Field-Programmable Gate A r r a y Density for Implementing D a t a p a t h C i r c u i t s , " i n Proceedings of the 2005 ACM/SIGDA Thirteenth International Symposium on Field Programmable Gate Arrays, 2005, p p . 3-13. K . Leijten-Nowak a n d J . L . van Meerbergen, " A n F P G A Architecture w i t h Enhanced D a t a p a t h Functionality," i n FPGA '03: Proceedings of the 2003 ACM/SIGDA Eleventh International Symposium on Field Programmable Gate Arrays. N e w Y o r k , N Y , U S A : A C M Press, 2003, p p . 195-204. A . Raghunathan, N . K . J h a , and S. Dey, High-level Optimization. Springer, November 1997. Power Analysis and S. B r o w n a n d Z . Vranesic, Fundamentals of Digital Logic with Verilog Design. M c G r a w - H i l l S c i e n c e / E n g i n e e r i n g / M a t h , A u g u s t 2002. G . K . Yeap, Practical Low Power Digital VLSI Design. Springer, A u g u s t 1997. T . Quarles, D . Pederson, R . Newton, A . Sangiovanni-Vincentelli, a n d C . Wayne, "Interactive S P I C E User Guide," website maintained by J a n Rabaey. [Online]. Available: h t t p : / / b w r c . e e c s . b e r k e l e y . e d u / C l a s s e s / I c B o o k / S P I C E / P. E . L a n d m a n a n d J . M . Rabaey, " A r c h i t e c t u r a l Power Analysis: T h e D u a l B i t T y p e M e t h o d , " IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 3, no. 2, p p . 173-187, 1995. D . Marculescu, R . Marculescu, a n d M . P e d r a m , "Information Theoretic Measures for Power Analysis," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, v o l . 15, no. 6, pp. 599-610, 1996. X i l i n x , " X i l i n x W e b Power Tools," A p r i l 2006. [Online]. Available: / / w w w . x i l i n x . c o m / products / design_resources / power . c e n t r a l / http: A l t e r a , PowerPlay Early Power Estimator User Guide for Stratix, Stratix GX, and Cyclone FPGAs, October 2005. X i l i n x , "Virtex-4 X P o w e r - E a r l y Power E s t i m a t o r v8.1," A p r i l 2006. [Online]. Available: http://www.xilinx.com/ise/power_tools/license_virtex4.htm A l t e r a , PowerPlay Early Power Estimator User Guide for Stratix II, Stratix II GX and Hardcopy II, December 2005. X i l i n x , Xilinx Development System Reference Guide 8.H, December 2005. A l t e r a , PowerPlay Power Analyzer, October 2005. E . M . Sentovich, K . J . Singh, L . Lavagno, C . M o o n , R . M u r g a i , A . Saldanha, H . Savoj, P . R . Stephan, R . K . B r a y t o n , a n d S. A . Vincentelli, "SIS: A System for Sequential C i r c u i t Synthesis," U C Berkeley, Tech. R e p . , 1992. [Online]. Available: http: //citeseer. ist. psu. edu/sentovich92sis. h t m l Bibliography 94 [43] J . C o n g arid Y . D i n g , " F l o w M a p : a n O p t i m a l Technology M a p p i n g A l g o r i t h m for Delay O p t i m i z a t i o n i n Lookup-table Based F P G A Designs," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 13, no. 1, pp. 1-12, 1994. [44] G . G . F . Lemieux, S. D . B r o w n , a n d D . Vranesic, " O n Two-step R o u t i n g for F P G A S , " i n ISPD '97: Proceedings of the 1997 International Symposium on Physical Design. New Y o r k , N Y , U S A : A C M Press, 1997, pp. 60-66. [45] Y . - W . Chang, D . F . W o n g , a n d C . K . W o n g , "Universal Switch Modules for F P G A Design," ACM Transansactions on Design Automation of Electronic Systems, vol. 1, no. 1, pp. 80-101, January 1996. [46] S. J . E . W i l t o n , "Architectures a n d A l g o r i t h m s for Field-Programmable G a t e A r r a y s w i t h Embedded Memory," P h . D . dissertation, University of Toronto, 1997. [47] I. M . M a s u d a n d S. J . E . W i l t o n , " A N e w Switch B l o c k for Segmented F P G A s , " in FPL '99: Proceedings of the 9th International Workshop on Field-Programmable Logic and Applications. L o n d o n , U K : Springer-Verlag, 1999, pp. 274-281. [48] J . Lamoureux and S. J . E . W i l t o n , " A c t i v i t y E s t i m a t i o n For Field-Programmable Gate A r r a y s , " i n International Conference on Field-Programmable Logic and Applications, August 2006. [49] A . Chatterjee a n d R . K . Roy, "Synthesis of L o w Power Linear D S P C i r c u i t s U s i n g A c t i v i t y Metrics," i n Proceedings of the Seventh International Conference on VLSI Design, 1994, p p . 265-270. [50] S. J . E . W i l t o n , S. S. A n g , a n d W . L u k , " T h e Impact of P i p e l i n i n g o n Energy per Operation i n Field-Programmable G a t e A r r a y s , " i n International Conference on Field-Programmable Logic and its Applications. Springer-Verlag, A u g u s t 2004, p p . 719-728. [51] S. J . E . W i l t o n , " H H V P R M a n u a l , " Internal document, University of B r i t i s h C o l u m b i a , Tech. Rep., J u l y 2005. [52] P . Jamieson and J . Rose, " A Verilog R T L Synthesis T o o l for Heterogeneous F P G A s , " i n International Conference on Field Programmable Logic and Applications, 2005, 2005, pp. 305-310. [53] C . H . H o , P. H . W . Leong, W . L u k , S. J . E . W i l t o n , a n d S. Lopez-Buedo, " V i r t u a l Embedded Blocks: A Methodology for E v a l u a t i n g Embedded Elements i n F P G A s , " in 1 4 t h Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2006. Appendix A. Glitching Characterization Attempt 95 Appendix A Glitching Characterization Attempt T h i s A p p e n d i x describes several unsuccessful attempts to account for glitches i n the power characterization scheme described i n Chapter 5. Figure A . l illustrates the flow we used for characterizing the power of the D S P blocks. A Verilog description of the D S P block was synthesized to gates using Synopsys Design Compiler. A Verilog testbench was used to simulate a set of input vectors applied to the gate-level description of the D S P block i n V e r i l o g - X L . T h e simulation d a t a was then fed to the Synopsys PrimePower simulator to obtain characterization information. For input activities less t h a n one, the testbench applied exactly one vector to the inputs of the D S P block each clock cycle. T h e activity at each input could be controlled by keeping the value of the bit or changing the value. T o attempt to imitate glitching, we generated testbenches where more than one vector was applied during each clock cycle. For example, to attempt to characterize the power for input activities of 1.5, a vector file w i t h activity 0.5 (assuming 1 vector per clock cycle) was read i n and 3 vectors were applied each clock cycle. We observed that applying the vectors for a particular clock cycle at equally spaced intervals i n the clock cycle, as shown i n the testbench pseudocode i n Figure A . 2 , does not imitate glitching properly. $T_fraction is the length of the equally spaced intervals and Appendix A. Glitching Characterization Attempt Technology DSP Block Library Verilog Description Synthesize to Gates with DC Verilog Testbench + Input Vectors Simulate with Verilog-XL Analyze Power with PrimePower Power Estimates Figure A . l : Power Characterization Flow reg [0:$inbits_ind] test_vector_input[0:N-1]; reg [0:$inbits_ind] inputs; task test_top; integer i, j ; begin © ( p o s e d g e clock); @ ( n e g e d g e clock); reset = 0; © ( p o s e d g e clock); for (i=0; (i+$integer_multiple) <= N; i=i+$integer_multiple) begin inputs = test_vector_input[i]; #$T_fraction; inputs = test_vector_input[i+1]; #$T_fraction; inputs = test_vector_input[i+($integer_multiple-1)]; #$T_fraction; © ( p o s e d g e clock); end end endtask Figure A.2: Testbench Pseudocode 96 Appendix A. Glitching Characterization Attempt 97 ^integer.multiple is the number of vectors t o apply during each clock cycle. For example, from Figure 5.3 we would expect that using a vector file w i t h activity 0.5 (assuming 1 vector per clock cycle) a n d a p p l y i n g two vectors per cycle t o achieve an effective input activity of 1.0 would give P r i m e P o w e r estimates greater t h a n the results for 0.9 input activity. For a D S P block where only input transitions during the high part of the clock cycle have an effect, the P r i m e P o w e r estimates were approximately equal to the results for 0.5 input activity instead because only half the transitions h a d any effect. T o avoid this problem, a second method was attempted where %T.fraction was set to half the period divided by $integer_multiple, so that a l l the vectors w o u l d be applied during the high part of the clock cycle. T h i s led t o overestimates of the power because, i n reality, glitching at the inputs to the D S P block can take place i n any part of the clock cycle. To properly imitate glitching for characterization, a more sophisticated testbench generator would be required that incorporates some sort of statistical or probabilistic model that dictates when vectors should be applied to the D S P block inputs. Appendix B. Detailed Results from Comparison 98 Appendix B Detailed Results from Comparison Table B . l : F I R F i l t e r Results for 0.25 Average Input A c t i v i t y Vector Set Multiplier HHVPR Energy PrimePower Energy % Error vec_025_0_a mult_rtl_0 mult_rtl_l mult_rtl_2 6.93E-11 6.92E-11 6.31E-11 6.01E-11 5.87E-11 6.07E-11 9.8% 15.1% . 17.8% 14.2% 6.14E-11 6.05E-11 mult_rtl_2 mult_rtl_3 6.95E-11 6.92E-11 6.93E-11 6.97E-11 13.1% 14.3% 19.1% 14.5% mult_rtl_0 mult_rtl_l 6.95E-11 6.93E-11 mult_rtl_2 mult_rtl_3 6.95E-11 6.95E-11 6.32E-11 5.92E-11 5.95E-11 mult_rtl_0 6.92E-11 6.91E-11 6.24E-11 5.90E-11 6.96E-11 6.96E-11 6.02E-11 mult_rtl_3 vec_025_0_b vec_025.0_c vec_025_0_d mult_rtl_0 mult_rtl_l mult_rtl_l mult_rtl_2 mult_rtl_3 vec_025_0_e 6.92E-11 6.93E-11 5.82E-11 6.08E-11 6.12E-11 5.93E-11 mult_rtl_0 mult_rtl_l mult_rtl_2 6.95E-11 6.95E-11 6.95E-11 ' 6.25E-11 6.06E-11 5.92E-11 mult_rtl_3 6.97E-11 6.17E-11 average percentage error 10.0% 17.1% 16.8% 13.5% 10.9% 17.2% 17.3% 15.5% 11.2% 14.7% 17.3% 12.9% average for just multiplier instance # 0 14.6% 11.0% average for just multiplier instance # 1 average for just multiplier instance # 2 average for just multiplier instance # 3 15.7% 17.7% 14.1% Appendix B. Detailed Results from Comparison 99 Table B . 2 : F I R F i l t e r Results for 0.50 Average Input A c t i v i t y Vector Set Multiplier HHVPR Energy PrimePower Energy vec_050_0_a mult_rtl_0 mult_rtl_l mult_rtl_2 mult_rtl_3 8.92E-11 8.90E-11 8.95E-11 8.93E-11 1.09E-10 1.04E-10 1.04E-10 1.06E-10 18.2% 14.4% 14.0% 15.5% vec_050_0.b mult_rtl_0 mult_rtl_l mult_rtl_2 mult_rtl_3 8.92E-11 8.92E-11 8.94E-11 8.92E-11 1.10E-10 1.04E-10 1.03E-10 1.07E-10 19.2% 14.1% 13.4% 16.2% vec_050_0_c mult_rtl_0 8.93E-11 8.92E-11 1.11E-10 1.04E-10 19.4% 14.2% 8.95E-11 8.95E-11 1.04E-10 1.06E-10 14.2% 15.4% mult_rtl_0 mult _rtl_l 8.91E-11 8.94E-11 mult_rtl_2 mult_rtl_3 8.90E-11 8.94E-11 1.07E-10 1.05E-10 1.01E-10 1.07E-10 16.8% 15.0% 12.3% 16.2% mult_rtl_0 mult_rtl_l 8.93E-11 8.92E-11 1.12E-10 1.04E-10 20.6% 14.5% mult_rtl_2 mult_rtl_3 8.94E-11 8.92E-11 1.02E-10 1.06E-10 12.2% 15.7% average percentage error average for just multiplier instance #0 average for just multiplier instance #1 15.6% average for just multiplier instance # 2 average for just multiplier instance # 3 13.2% mult_rtl_l mult_rtl22 mult_rtl_3 vec_050_0_d yec_050.0.e % Error 18.9% 14.4% 15.8% Appendix B. Detailed Results from Comparison 100 Table B . 3 : F I R F i l t e r Results for 0.75 Average Input A c t i v i t y Vector Set Multiplier HHVPR Energy PrimePower Energy % Error vec_075_0_a mult_rtl_0 9.97E-11 9.96E-11 9.95E-11 9.96E-11 1.52E-10 1.42E-10 1.38E-10 1.44E-10 34.3% 29.7% 27.9% 30.7% mult_rtl_0 mult_rtl_l mult_rtl_2 mult_rtl_3 9.96E-11 9.96E-11 9.97E-11 9.96E-11 1.52E-10 1.41E-10 1.38E-10 1.43E-10 34.6% 29.2% 27.7% mult_rtl_0 mult_rtl_l mult_rtlJ2 mult_rtl_3 9.96E-11 9.95E-11 9.95E-11 1.49E-10 33.3% 9.96E-11 1.41E-10 1.38E-10 1.47E-10 29.4% 27.7% 32.4% mult_rtl_0 mult_rtl_l mult_rtl_2 9.96E-11 9.97E-11 9.97E-11 1.55E-10 1.39E-10 1.41E-10 35.8% 28.3% 29.4% mult_rtl_3 9.97E-11 1.50E-10 33.6% mult_rtl_0 mult_rtl_l 9.96E-11 9.96E-11 9.97E-11 9.96E-11 1.55E-10 1.39E-10 35.9% 28.4% 1.37E-10 1.44E-10 27.4% 30.9% average percentage error average for just multiplier instance #0 30.9% 34.8% average for just multiplier instance #1 average for just multiplier instance # 2 29.0% 28.0% average for just multiplier instance 31.6% mult_rtl_l mult_rtl_2 mult_rtl_3 vec_075_0_b vec_075_0_c vec_075_0.d vec_075_0_e mult_rtl_2 mult_rtl_3 . #3 30.5% Appendix B. Detailed Results from Comparison 101 Table B . 4 : Differential E q u a t i o n Solver Results for 0.25 Average Input A c t i v i t y Vector Set DSP Block PVPR Energy PrimePower Energy % Error vec_025_0_a dspblockO dspblockl dspblock2 9.51E-10 1.01E-09 1.12E-09 4.16E-09 1.52E-09 1.65E-09 77.1% 33.3% 32.2% dspblockO dspblockl dspblock2 9.04E-10 4.03E-09 1.33E-09 1.58E-09 77.6% 21.8% 22.2% dspblockO dspblockl dspblock2 8.95E-10 3.88E-09 1.28E-09 1.49E-09 76.9% 21.2% 17.4% dspblockO dspblockl dspblock2 1.05E-09 1.04E-09 1.17E-09 4.24E-09 75.4% 1.31E-09 1.73E-09 21.0% 32.1% dspblockO dspblockl dspblock2 8.02E-10 3.76E-09 1.22E-09 78.7% 17.7% 12.0% vec.025.0_b vec.025.0-c vec.025.0_d vec.025.0_e 1.04E-09 1.23E-09 1.01E-09 1.23E-09 1.00E-09 1.12E-09 1.27E-09 average percentage error average for just D S P block instance # 0 average for just D S P block instance #1 average for just D S P block instance # 2 41.1% 77.1% 23.0% 23.2% Appendix B. Detailed Results from Comparison Table B . 5 : Differential E q u a t i o n Solver Results for 0.50 Average Input A c t i v i t y Vector Set DSP Block PVPR Energy PrimePower Energy vec.050_0_a dspblockO dspblockl dspblock2 8.88E-10 1.06E-09 3.79E-09 1.26E-09 1.22E-09 1.51E-09 dspblockO dspblockl dspblock2 9.24E-10 3.88E-09 1.06E-09 1.29E-09 1.36E-09 1.66E-09 dspblockO dspblockl dspblock2 8.78E-10 1.05E-09 4.14E-09 dspblockO 9.59E-10 1.04E-09 vec_050_0_b vec_050.0.c % Error 76.6% 12.9% 16.9% 76.2% 21.9% 22.5% 1.73E-09 1.75E-09 78.8% 39.2% 31.6% 3.97E-09 1.38E-09 1.54E-09 75.9% 24.8% 16.9% 3.87E-09 1.25E-09 1.62E-09 76.2% 15.1% 19.0% average percentage error average for just D S P block instance #0 40.3% 76.7% average for just D S P block instance #1 average for just D S P block instance # 2 22.8% 21.4% vec.050.0_d dspblockl dspblock2 vec.050_0_e dspblockO dspblockl dspblock2 1.20E-09 1.28E-09 9.19E-10 1.06E-09 1.31E-09 102 Appendix B. Detailed Results from Comparison 103 Table B . 6 : Differential E q u a t i o n Solver Results for 0.75 Average Input A c t i v i t y Vector Set DSP Block PVPR Energy PrimePower Energy vec_075_0_a dspblockO dspblockl dspblock2 8.95E-10 3.89E-09 1.03E-09 1.23E-09 1.44E-09 1.61E-09 dspblockO dspblockl dspblock2 8.70E-10 1.01E-09 3.82E-09 1.27E-09 1.25E-09 1.64E-09 dspblockO dspblockl 8.83E-10 3.78E-09 1.01E-09 1.27E-09 1.26E-09 dspblock2 dspblockO dspblockl dspblock2 9.13E-10 1.03E-09 1.24E-09 3.94E-09 1.40E-09 1.69E-09 76.8% 26.2% 26.5% dspblockO 9.37E-10 3.84E-09 75.6% dspblockl dspblock2 1.01E-09 1.18E-09 1.19E-09 3.84E-09 14.9% 69.3% average percentage error 43.7% average for just D S P block instance # 0 average for just D S P block instance #1 average for just D S P block instance # 2 76.7% 21.7% 32.7% vec_075_0_b vec_075_0_c vec_075.0_d vec_075_0_e 1.62E-09 % Error 77.0% 28.4% 23.5% 77.2% 19.3% 22.7% 76.6% 19.7% 21.6%
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Activity-based power estimation and characterization...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Activity-based power estimation and characterization of DSP and multiplier blocks in FPGAs Choy, Nathalie Chan King 2006
pdf
Page Metadata
Item Metadata
Title | Activity-based power estimation and characterization of DSP and multiplier blocks in FPGAs |
Creator |
Choy, Nathalie Chan King |
Date Issued | 2006 |
Description | Battery-powered applications and the scaling of process technologies and clock frequencies have made power dissipation a first class concern among FPGA vendors. One approach to reduce power dissipation in FPGAs is to embed coarse-grained fixed-function blocks that implement certain types of functions very efficiently. Commercial FPGAs contain embedded multipliers and "Digital Signal Processing (DSP) blocks" to improve the performance and area efficiency of arithmetic-intensive applications. In order to evaluate the power saved by using these blocks, a power model and tool flow are required. This thesis describes our development and evaluation of methods to estimate the activity and the power dissipation of FPGA circuits containing embedded multiplier and DSP blocks. Our goal was to find a suitable balance between estimation time, modeling effort, and accuracy. We incorporated our findings to create a power model and CAD tool flow for these circuits. Our tool flow builds upon the Poon power model, and the Versatile Place and Route (VPR) CAD tool, which are both standard academic experimental infrastructure. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-01-08 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0064950 |
URI | http://hdl.handle.net/2429/17877 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2006-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_2006-0394.pdf [ 5.54MB ]
- Metadata
- JSON: 831-1.0064950.json
- JSON-LD: 831-1.0064950-ld.json
- RDF/XML (Pretty): 831-1.0064950-rdf.xml
- RDF/JSON: 831-1.0064950-rdf.json
- Turtle: 831-1.0064950-turtle.txt
- N-Triples: 831-1.0064950-rdf-ntriples.txt
- Original Record: 831-1.0064950-source.json
- Full Text
- 831-1.0064950-fulltext.txt
- Citation
- 831-1.0064950.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0064950/manifest