You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A power evaluation framework for FPGA applications and CAD experimentation Dueck, Stuart 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2013_fall_dueck_stuart.pdf [ 5.41MB ]
JSON: 24-1.0073893.json
JSON-LD: 24-1.0073893-ld.json
RDF/XML (Pretty): 24-1.0073893-rdf.xml
RDF/JSON: 24-1.0073893-rdf.json
Turtle: 24-1.0073893-turtle.txt
N-Triples: 24-1.0073893-rdf-ntriples.txt
Original Record: 24-1.0073893-source.json
Full Text

Full Text

A Power Evaluation Framework for FPGA Applications and CAD Experimentation  by Stuart Dueck B.A.Sc., University of British Columbia, 2010  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF APPLIED SCIENCE in THE FACULTY OF GRADUATE STUDIES (Electrical and Computer Engineering)  The University Of British Columbia (Vancouver) June 2013 c Stuart Dueck, 2013  Abstract Field-Programmable Gate Arrays (FPGAs) consume roughly 14 times more dynamic power than Application Specific Integrated Circuits (ASICs) making it challenging to incorporate FPGAs in low-power applications. To bridge the gap, power consumption in FPGAs needs to be addressed at the application, Computer-Aided Design (CAD) tool, architecture, and circuit levels. The ability to properly evaluate proposals to reduce the power dissipation of FPGAs requires a realistic and accurate experimental framework. Mature FPGA power models are flexible, but can suffer from poor accuracy due to estimations on signal activity and simplifications. Additionally, run-time increases with the size of the design. Other techniques use unrealistic assumptions while physically measuring the power of a circuit running on an FPGA. Neither of these techniques can accurately portray the power consumption of FPGA circuits. We propose a framework to allow FPGA researchers to evaluate the impact of proposals for the reduction of power in FPGAs. The framework consists of a realworld System-on-Chip (SoC) and can be used to explore algorithmic and CAD techniques, by providing the ability to measure the power at run-time. High-level access to common low-level power-management techniques, such as clock gating,  ii  Dynamic Frequency Scaling (DFS), and Dynamic Partial Reconfiguration (DPR), is provided. We demonstrate our framework by evaluating the effects of pipelining and DPR on power. We also reason why our framework is necessary by showing that it provides different conclusions than that of previous work.  iii  Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iv  List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  viii  List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ix  Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  xii  Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  xiv  1  2  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1  1.1  Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1  1.2  Research Contributions and Significance . . . . . . . . . . . . . .  4  1.3  Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . .  5  Background and Previous Work . . . . . . . . . . . . . . . . . . . .  6  2.1  Overview of FPGAs . . . . . . . . . . . . . . . . . . . . . . . . .  6  2.1.1  7  FPGA Architecture . . . . . . . . . . . . . . . . . . . . .  iv  2.1.2  Power Implications of FPGAs . . . . . . . . . . . . . . .  10  2.1.3  CAD Tool Flow . . . . . . . . . . . . . . . . . . . . . . .  10  Dynamic Partial Reconfiguration . . . . . . . . . . . . . . . . . .  11  2.2.1  Dynamic Partial Reconfiguration Design . . . . . . . . .  14  2.2.2  Dynamic Partial Reconfiguration Simulation . . . . . . .  16  Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .  17  2.3.1  Power Estimation . . . . . . . . . . . . . . . . . . . . . .  20  2.3.2  Direct Power Measurement . . . . . . . . . . . . . . . . .  27  Power Analysis Frameworks . . . . . . . . . . . . . . . . . . . .  31  2.4.1  FPGA-Specific Power Estimation . . . . . . . . . . . . .  34  2.4.2  FPGA Power Measurement Frameworks . . . . . . . . . .  37  Focus and Contribution of this Thesis . . . . . . . . . . . . . . .  41  Desirable Characteristics of the Platform . . . . . . . . . . . . . . .  42  3.1  Realistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  42  3.2  Flexible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  47  3.3  Available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  48  3.4  Easy to Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  49  3.4.1  Hooks . . . . . . . . . . . . . . . . . . . . . . . . . . . .  49  3.4.2  Obtaining Results . . . . . . . . . . . . . . . . . . . . . .  49  Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  50  Description of our Framework . . . . . . . . . . . . . . . . . . . . .  51  4.1  Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  51  4.2  The System . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  52  4.2.1  53  2.2  2.3  2.4  2.5 3  3.5 4  Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . v  4.2.2  5  Software . . . . . . . . . . . . . . . . . . . . . . . . . .  56  4.3  Development Environment and Tools . . . . . . . . . . . . . . . .  56  4.4  Extending the LEON3 System-on-Chip . . . . . . . . . . . . . .  59  4.4.1  Measurement Results . . . . . . . . . . . . . . . . . . . .  60  4.4.2  Dynamic Partial Reconfiguration and Internal Configuration Access Port . . . . . . . . . . . . . . . . . . . . . .  62  4.4.3  Dynamic Clock Generation . . . . . . . . . . . . . . . . .  68  4.4.4  Clock Gating . . . . . . . . . . . . . . . . . . . . . . . .  70  4.4.5  Kernel Modules . . . . . . . . . . . . . . . . . . . . . . .  72  4.5  Potential Experiments for this Framework . . . . . . . . . . . . .  73  4.6  Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  78  Characterization and Evaluation of our Framework . . . . . . . . .  79  5.1  Characterization of our Framework . . . . . . . . . . . . . . . . .  80  5.1.1  Methodology . . . . . . . . . . . . . . . . . . . . . . . .  80  5.1.2  Results and Analysis . . . . . . . . . . . . . . . . . . . .  83  Framework Demonstration: An Evaluation of Pipelining . . . . .  86  5.2.1  Motivation for Pipelining . . . . . . . . . . . . . . . . . .  87  5.2.2  Methodology . . . . . . . . . . . . . . . . . . . . . . . .  88  5.2.3  Results and Analysis . . . . . . . . . . . . . . . . . . . .  90  An Evaluation of Dynamic Partial Reconfiguration . . . . . . . .  95  5.3.1  Methodology . . . . . . . . . . . . . . . . . . . . . . . .  95  5.3.2  Results and Analysis . . . . . . . . . . . . . . . . . . . .  96  5.2  5.3  5.4  Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101  vi  6  Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.1  Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 102  6.2  Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103  6.3  Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105  6.4  6.3.1  Short-Term . . . . . . . . . . . . . . . . . . . . . . . . . 105  6.3.2  Long-Term . . . . . . . . . . . . . . . . . . . . . . . . . 107  Public Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 109  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110  vii  List of Tables Table 2.1  A List of Simulation Techniques and Simulation Simplifications  23  Table 2.2  A List of Probabilistic Attributes and Simplifications . . . . . .  25  Table 2.3  Comparison and Breakdown of Previous SoC FPGA Frameworks 38  Table 5.1  Configuration of the LEON3 SoC . . . . . . . . . . . . . . . .  Table 5.2  Xilinx Version 14.1 CAD Tool Run-times for Base Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  viii  84  84  List of Figures Figure 2.1  Basic island style FPGA architecture. . . . . . . . . . . . . .  Figure 2.2  A basic logic element without the configurable SRAM bits  8  shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  9  Figure 2.3  Typical FPGA CAD Flow. . . . . . . . . . . . . . . . . . . .  12  Figure 2.4  Example usage of DPR. . . . . . . . . . . . . . . . . . . . .  13  Figure 2.5  Proxy logic. . . . . . . . . . . . . . . . . . . . . . . . . . . .  16  Figure 2.6  Power analysis - levels of abstraction. . . . . . . . . . . . . .  21  Figure 2.7  Error in power measurement. . . . . . . . . . . . . . . . . . .  30  Figure 3.1  Test circuit input assumptions. . . . . . . . . . . . . . . . . .  45  Figure 4.1  High-level view of the LEON3 SoC. . . . . . . . . . . . . . .  52  Figure 4.2  The LEON3 SoC design flow. . . . . . . . . . . . . . . . . .  57  Figure 4.3  Overview of GRMON. . . . . . . . . . . . . . . . . . . . . .  59  Figure 4.4  Overview of the dynamic region setup. . . . . . . . . . . . .  60  Figure 4.5  System Monitor primitive. . . . . . . . . . . . . . . . . . . .  61  Figure 4.6  ICAP controller implementation. . . . . . . . . . . . . . . . .  63  Figure 4.7  A simple representation of the FPGA configuration layer. . . .  67  ix  Figure 4.8  MMCM primitive. . . . . . . . . . . . . . . . . . . . . . . .  69  Figure 4.9  Clock gating circuitry with clock domain crossing synchronizer. 70  Figure 4.10 Hardware experiment. . . . . . . . . . . . . . . . . . . . . .  74  Figure 4.11 Software experiment. . . . . . . . . . . . . . . . . . . . . . .  76  Figure 4.12 DPR experiment. . . . . . . . . . . . . . . . . . . . . . . . .  76  Figure 4.13 Accelerator experiment. . . . . . . . . . . . . . . . . . . . .  78  Figure 5.1  Modifying the GroundHog AES circuit. . . . . . . . . . . . .  82  Figure 5.2  Idle power versus LEON3 implementation. . . . . . . . . . .  85  Figure 5.3  AES power versus LEON3 implementation. . . . . . . . . . .  86  Figure 5.4  AES temperature versus LEON3 implementation. . . . . . . .  87  Figure 5.5  CAD tool flow for integrated, dynamic, and stand-alone AES circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  89  Figure 5.6  Stand-alone and LEON3 AES circuit architecture. . . . . . .  90  Figure 5.7  The power consumption of running AES encryption versus number of pipelining stages. . . . . . . . . . . . . . . . . . .  Figure 5.8  The temperature of running AES encryption versus number of pipelining stages. . . . . . . . . . . . . . . . . . . . . . . . .  Figure 5.9  91  92  The power consumption of the LEON3 running AES encryption versus number of pipelining stages. . . . . . . . . . . . .  93  Figure 5.10 The temperature of the LEON3 running AES encryption versus number of pipelining stages. . . . . . . . . . . . . . . . . Figure 5.11 Total CAD tool run-times versus number of pipelining stages.  93 94  Figure 5.12 CAD tool run-times for LEON3 with integrated AES versus number of pipelining stages. . . . . . . . . . . . . . . . . . .  x  95  Figure 5.13 Configuration and reboot time for full and partial reconfiguration. 97 Figure 5.14 Implementation before, during, and after internal partial reconfiguration versus power. . . . . . . . . . . . . . . . . . . . .  99  Figure 5.15 The power consumption of various AES implementations versus number of pipelining stages. . . . . . . . . . . . . . . . . 100 Figure 5.16 The relative power consumption compared to combinational AES encryption versus number of pipelining stages. . . . . . 101  xi  Glossary AES Advanced Encryption Standard AHB Advanced High-performance Bus AMBA Advanced Microcontroller Bus Architecture APB Advanced Peripheral Bus ASIC Application-Specific Integrated Circuit BCC Bare C Cross-Compiler BRAM Block Random Access Memory CAD Computer-Aided Design CPU Central Processing Unit DFS Dynamic Frequency Scaling DMA Direct Memory Access DPR Dynamic Partial Reconfiguration DRP Dynamic Reconfiguration Port xii  FPGA Field-Programmable Gate Array GRLIB Gaisler Research IP Library GRMON Gaisler Research Monitor HDL Hardware Description Language ICAP Internal Configuration Access Port IP Intellectual Property LFSR Linear Feedback Shift Register MMCM Mixed Mode Clock Manager RM Reconfigurable Module RP Reconfigurable Partition OS Operating System SoC System-on-Chip SysMon System Monitor VHDL Very-High-Speed Integrated Circuits Hardware Description Language XST Xilinx Synthesis Technology  xiii  Acknowledgments First I would like to thank my supervisor, Dr. Steven Wilton, for his support, advice, and vision during my time at the University of British Columbia. This thesis would not have been possible without his time and guidance. Dr. Wilton’s contribution to my education and research has been invaluable. I wish to thank Dr. Brad Quinton for his mentoring and advice. I am thankful for the company and insight into my research provided by the other members of the group: Kyle Balston, Assem Bsoul, Scott Chin, Joydip Das, Jeff Goeders, Marcel Gort, and Eddie Hung, as well as any others. My time in the lab was greatly improved by our daily interactions. Also, I would like to thank Roberto Rosales and Roozbeh Mehrabadi for their help and management of the System-on-Chip group and labs. I would especially like to thank my family for their unwavering support and encouragement in all my years of schooling. My parents for their influence in my life and on my core values. Finally, I could not have done any of this without the motivation and patience of my wife, Vanessa. I dedicate this thesis to her and our shortly expected baby boy.  xiv  Chapter 1  Introduction 1.1  Motivation  Over the past 50 years, advances in Integrated Circuit (IC) fabrication technology have provided the ability to implement more transistors than ever before on an integrated circuit. This has led to the availability of an unprecedented amount of cheap computing power, which has had a dramatic impact on almost every aspect of society. This exponential progress in semiconductor technology, however, has increased the complexity (and hence cost) of designing and implementing an integrated circuit. Today, only a very few large enterprises are capable of creating state-of-the art chips - only the large microprocessor, graphics engine, System-onChip (SoC) and Field-Programmable Gate Array (FPGA) companies are able to build leading-edge devices. As a result, many companies are now turning to FPGAs to implement digital circuitry. FPGAs can be configured to implement almost any digital circuit, allowing designers to immediately test designs without the cost, risk and delay of producing a Very Large Scale Integration (VLSI) implementation. 1  In the past, FPGAs have been primarily used for prototyping; today, they are being widely used in manufactured projects. We assert that in the near future, generic programmable logic devices such as FPGAs will be the substrate upon which virtually all applications are built. One of the key areas where FPGAs have not yet become ubiquitous is in lowpower applications, which are becoming an ever-larger portion of the electronics market. FPGAs have not been employed in many low power systems because their energy consumption (or dissipation) is simply too high. Studies have shown that FPGAs consume roughly 14 times more dynamic power than Application Specific Integrated Circuits (ASICs) [61]. Mobile devices are very sensitive to energy consumption: a modern BlackBerry device consumes 9.25 mW of power while idle and 740 mW during a phone call [104]. With current battery technologies providing between 3700 mWh and 5550 mWh, anything that increases the energy dissipation of the device is unwelcome. Reducing the power dissipation of FPGAs has been an active research area. Researchers have proposed changes to FPGA architectures and FPGA circuit design, in an attempt to lower the power of a circuit implemented on an FPGA [15, 33, 42, 62, 65, 71, 77, 100, 116, 123]. In parallel, researchers have investigated how the associated Computer-Aided Design (CAD) synthesis algorithms can be modified to better optimize a user circuit for power [10, 11, 28, 64, 113, 114]. Finally, researchers have addressed power consumption at an application level; often by reformulating the problem or modifying the algorithm to be implemented, significant power savings can be obtained [13, 76, 87, 146]. Only by attacking the power problem at each of these levels can low-power implementations be achieved. A key challenge in all of these studies is the evaluation of each proposal. Ide2  ally, researchers proposing new architectures or circuits would build new chips and test the power dissipation of these chips directly. The cost, delay, and complexity of implementing a chip in a modern process makes this infeasible for even the largest research groups. Researchers proposing new CAD algorithms would ideally implement these algorithms in commercial tools, use them to map real applications to an FPGA, and again measure the power dissipation directly [46, 128]. Although some support for integrating new algorithmic ideas into commercial tools exist [67, 80, 118], in many cases, these “hooks” supplied by FPGA vendors do not provide enough control. In addition, the availability of realistic circuits and environments in which to test the CAD tools is often lacking. As a result, many researchers turn to detailed power models [98] and experimental open-source CAD tools [105]. These experimental CAD tools map benchmark circuits to a modeled architecture, and the power model estimates the power dissipation of the implementation. While very flexible, the accuracy of these power models can be poor. In addition, these models evaluate the power dissipation of a benchmark in isolation, often assuming random inputs. In reality, circuits are usually integrated together either on a chip or between chips, and the interaction between these circuits will have a significant impact on the power dissipation. In particular, the interaction between circuits is essential for understanding input activity, which is critical to achieve realistic power estimates. In this thesis, we describe an experimental platform (or framework) designed to allow FPGA researchers to better evaluate the impact of proposals for reducing power in FPGAs. This platform addresses some of the shortcomings of the modeling and experimental approaches described above. Rather than relying on questionable power models and activity estimation techniques, our framework can be used 3  with measurement equipment to obtain accurate power measurements (although it can also be used with existing estimation techniques if measurement equipment is not available). The platform provides a way of evaluating an optimized circuit in the framework of a large SoC (we use the LEON3 SoC from Aeroflex Gaisler running a Linux operating system), with realistic interfaces between the optimized circuit and the rest of the system. Our platform is flexible enough to be useful for a large number of power dissipation studies. It is intended for researchers proposing CAD or application-level techniques. We believe architectural and circuit-level techniques are still best evaluated using a modeling framework as in [45].  1.2  Research Contributions and Significance  More precisely, this research has the following two contributions: 1. To design an experimental platform that will allow FPGA researchers to evaluate proposals for power reduction techniques. The platform must allow researchers to evaluate their ideas in the context of a large real-world SoC. The framework must be flexible enough to accommodate a variety of FPGA investigations, and accurate enough that the evaluation results are meaningful. The platform should also afford the user the opportunity to swap test circuits in and out in a timely manner. 2. To demonstrate the use of this platform when evaluating the power implications of three common power reduction techniques: clock gating, pipelining, and Dynamic Partial Reconfiguration (DPR). Achieving these goals will be significant. Future research will be able to use our framework as an integral part of their experimental flow. This will enable re4  searchers to better evaluate their ideas, ensuring that research in this area leads to meaningful power savings. This, in turn, will lead to lower-power FPGAs, providing the benefits of flexibility to many more designers than ever before.  1.3  Thesis Organization  This thesis is organized as follows. Chapter 2 will provide an overview of existing FPGA architecture and CAD techniques, and describe ideas that have been proposed to reduce the power of these devices, with an emphasis on how these ideas were evaluated. Chapter 3 outlines the desirable features of our experimental platform (the requirements). Chapter 4 describes our platform, which was built around the existing LEON3 SoC. Chapter 5 shows an example of how our framework can be used to evaluate common power reduction techniques. Finally, Chapter 6 concludes this thesis.  5  Chapter 2  Background and Previous Work This chapter begins with an an overview of FPGAs in Section 2.1 and an introduction to DPR in Section 2.2. This is followed by an exploration of power analysis techniques and methodologies used in past research. Next in Section 2.4 we provide a discussion of previous power analysis frameworks. Lastly, Chapter 2 is concluded with the focus and contributions of this thesis in Section 2.5.  2.1  Overview of FPGAs  This section provides an overview of FPGA architecture, a background of the FPGA CAD tool flow, and the power implications of FPGAs. An FPGA is a prefabricated integrated circuit which can be configured and reconfigured to implement a variety of digital circuits. They are commonly used for prototyping, emulating, and implementing hardware designs. Almost any digital logic function that can be executed on an ASIC can also be implemented on an FPGA. The main advantages of FPGAs over ASICs is their flexibility, ability to be reprogrammed, and low cost for small volumes. This programmability is typically provided through the use of 6  programmable switches, which for the sake of simplicity can be thought of as a Static Random Access Memory (SRAM) configuration bit attached to the gate of a pass transistor, although other types of FPGAs also exist [23]. Hardware designs are typically written using an Hardware Description Language (HDL), such as Very-High-Speed Integrated Circuits Hardware Description Language (VHDL) or Verilog, to specify the behaviour which is then translated into gates and mapped to an FPGA by CAD tools.  2.1.1  FPGA Architecture  A simple island style FPGA is comprised of a two-dimensional array of logic resources, called logic blocks, surrounded by vertical and horizontal routing tracks, along with Input/Output (I/O) pads, as shown in Figure 2.1. Basic logic cells are packed together into logic blocks (or clusters). A block will have a fixed number of inputs and outputs with an intra-cluster routing network. These local connections are shorter and allow less latency than global connections between clusters. Logic blocks are connected to routing tracks via connection blocks. Each I/O pin of the logic block can connect to a subset of the adjacent routing tracks through these connection blocks. A group of horizontal or vertical routing tracks is referred to as a channel. Channels are connected to each other in switch blocks through the use of programmable switches. Each logic block typically contains several logic cells. A basic logic cell contains a configurable Look-Up-Table (LUT) to evaluate boolean logic, along with a flip-flop and multiplexer to allow a combinational or synchronous output. Figure 2.2 shows an example of a basic logic element. The multiplexer is controlled by a configurable SRAM bit and, depending on its value, enables a combinational or synchronous output. A LUT can implement any 7  I/O Pads  Logic Blocks  Routing Network  Figure 2.1: Basic island style FPGA architecture. boolean logic function; the number of possible functions depends on the number of inputs. A K-input LUT will have 2K possible configurations. Due to this flexibility LUTs require more area and use more energy than a fixed logic gate. Recently Xilinx has moved from four inputs to six inputs for all LUTs as they claim the 8  LUT FF  Figure 2.2: A basic logic element without the configurable SRAM bits shown. latter gives a better trade-off between performance and area [130]. For example, a basic six-input LUT can implement sixty-four (26 ) unique functions. Real LUTs are even more complicated as they can often be configured into two fracturable LUTs that can share some of their inputs [70]. This extra circuitry allows LUTs to be flexible, but also increases the total area and adds capacitance to the internal nodes. This causes extra dynamic power to be dissipated when these nodes switch. Similar to LUTs, configurable switches are also used to connect routing tracks to one another or to logic block I/O pins [23]. These configurable switches add to the area and capacitance of routing wires, which have been noted to dissipate more than 50% of the dynamic power in FPGAs [62, 110]. In addition, most modern FPGAs also include hard logic blocks, also known as embedded blocks, to allow faster, denser, lower power implementations of common hardware such as Digital Signal Processors (DSPs), memory blocks, or processors. This results in a more efficient use of the FPGA area for many circuits which  9  need these blocks. Most FPGAs also include dedicated global and regional clock circuitry to reduce skew and other timing problems that could result from routing clock signals on the general purpose interconnect.  2.1.2  Power Implications of FPGAs  FPGAs are less efficient than ASICs in several ways. While modern ASIC processors can obtain clock speeds of 3 GHz and above, clock signals on state-of-the-art FPGAs are limited to 1 GHz [142]. However, typical designs often run in the hundreds of MHz range. As mentioned previously, the configurability and flexibility of FPGAs is afforded through extra programmable switches. Routing tracks and switch box topologies are fabricated without any knowledge of the future circuits that will be run on the FPGA. This generic implementation of FGPAs causes more parasitic capacitance compared to an ASIC counterpart. In turn, this increases the power consumption of FPGA devices. According to a recent study, FPGAs use on average 35 times the area, run up to 4.6 times slower and consume 14 times more dynamic power than ASICs on average for logic-only implementations [61].  2.1.3  CAD Tool Flow  The FPGA CAD tool flow consists of several steps, as shown in Figure 2.3. Logic synthesis is the conversion of a circuit description into an optimized gate representation. Technology mapping takes the gate logic and maps it to LUTs and flipflops. The packing phase groups basic logic elements together into logic blocks based on several possible factors such as speed, power, and area. Placement positions each logic block on the FPGA. Routing programs the interconnect to connect the inputs and outputs of logic blocks with each other and with the I/O pins of  10  the FPGA. Lastly, a bitfile is generated which has the configuration data for every programmable SRAM. There are many constraints in the CAD flow, such as speed, power, and area, and also FPGA architecture parameters such as the number of inputs per LUT, inputs per logic block, routing channel width, and routing track length. The CAD tools need to take this all into account and produce a working bitfile. This can take several hours for large designs. As FPGA and design sizes grow, so will the time to synthesize these designs. Incremental CAD is a technique that can help speed up synthesis. Normally any change in a design, no matter how small, requires the full synthesis flow to be rerun. Incremental CAD uses the previous results to help the next iteration run faster. If the two input circuits are similar then it is possible that the previous iteration choices can be duplicated to save time [35].  2.2  Dynamic Partial Reconfiguration  Dynamic Partial Reconfiguration (DPR) [138] (also referred to as run-time configuration) is a technique for reconfigurable devices where the configuration of part of the device can be modified while the remainder of the device is running. This procedure allows FPGA resources to be time multiplexed, as shown in Figure 2.4, thereby allowing a large circuit to fit on a smaller FPGA and/or reducing power dissipation. A region is defined as a combination of FPGA resources: LUTs, hard blocks (e.g., DSPs and BRAMs), and routing fabric [132]. A dynamic region is the area(s) of the FPGA which has been allocated to allow reconfiguration, while static is used to describe those areas which must keep a permanent configuration for the lifetime of the application. This distinction is determined during the design 11  Circuit Description Input  Logic Synthesis Technology Mapping Packing Placement Routing Output  FPGA Bitfile Figure 2.3: Typical FPGA CAD Flow. phase for DPR compatible Xilinx FPGAs. A group of dynamic resources is termed a Reconfigurable Partition (RP), while a block that is placed into a dynamic region is termed a Reconfigurable Module (RM) [138]. RPs (dynamic regions) can be thought of as containers in which a single RM (dynamic logic) can be placed. The design logic that is not allocated to a RP, but is still being used, is deemed static logic. The designer is required to specify at design-time one or more dynamic areas of the FPGA which are reserved for run12  Reconfigurable Module 2  Time Multiplexing  Reconfigurable Module 1  Static Logic Reconfigurable Partition  Reconfigurable Module 3  FPGA  Figure 2.4: Example usage of DPR. time configuration. There is no limit to the number of RMs that can be configured for a single RP, but only one RM can ever be active. Each RP needs to contain the appropriate types and number of resources for the most demanding RM in its set. For example, if RM A and RM B both belonged to the same RP the partition would need to contain at least the maximum number of slices required between A and B. In actuality, RPs typically need to be over-provisioned to account for routing congestion and other issues due to the extra constraints [138]. Also only the RP being reconfigured is affected; any other RPs will continue running as normal. FPGAs are typically programmed externally from a workstation or internally  13  from on-board Read-Only Memory (ROM). Configuration data is stored in a bitstream file which contains the configuration of the programmable routing, logic blocks, and embedded blocks. A full bitfile is used to program every configurable SRAM bit on the whole FPGA, while partial reconfiguration is used to set the configuration bits in a specified region using a partial bitstream file. For this reason a partial bitstream file is smaller and takes less time when being used to reconfigure a device. A black box instance is an empty netlist which can be used to save power [75]. It is used to create a blanking partial bitfile meant to “unconfigure” the partition. The use of blank partial bitfiles in blocks where the outputs are not being used can eliminate the switching of signals. This is similar to clock gating, where synchronous elements would be restricted from switching, but where there is still static power required to hold nodes at their previous logic value. If it is not necessary to hold these node voltages constant then these blocks can be replaced with a black box (blank logic). This could also reduce power by decreasing the capacitive load on routing wires. Additionally DPR can also be used to save area by allowing two independent blocks to occupy the same resources at different times. The downside is the time and the power overhead of reconfiguration. The end user needs to decide when DPR is feasible through analysis of the duration and power required by DPR. This was explored in [76].  2.2.1  Dynamic Partial Reconfiguration Design  DPR has existed in Xilinx FPGAs for several years, but has been difficult to implement due to its high learning curve and the difficulty of using the associated CAD tools. Xilinx has developed a floorplanning tool called PlanAhead [137] which aids 14  the designer in developing DPR designs through a Graphical User Interface (GUI). This tool can help in the design flow by providing a general layout of the FPGA and helping the user understand where the resources are located. It allows the designer to set the size, location, and resources for RPs. When designing a Xilinx DPR compliant system the implementation of the static logic must remain constant across all configurations. Due to the fact that the static logic does not change, the Xilinx tools require modules that are allocated to the same reconfigurable region to have the same boundary-crossing signals. In other words, any dynamic cores that will occupy the same physical part of the chip must have identical port lists. This ensures that the Xilinx tools can keep the routing from the static region to the dynamic region constant as shown in Figure 2.5. It is also possible for static signals to be routed through a dynamic partition. In these cases, the static routing would be integrated uniformly into each RM for that partition. PlanAhead uses proxy logic[138], essentially a 1-input LUT, to interface between static and dynamic signals. These proxy LUTs are located in the dynamic region and serve as a fixed sink and source for every signal into and out of the dynamic module. This causes a significant area overhead for each signal. Previous work has looked into more efficient partial reconfiguration design techniques. Beckhoff et al. [17] propose GoAhead, a tool that synchronizes the routing tracks between static and dynamic partitions to avoid using a proxy LUT. The area savings of this technique scales with the number of region crossing signals. GoAhead also allows swapping of dynamic modules between dynamic partitions, something that is not supported with the bitfiles created using PlanAhead. Another DPR design framework was presented in [29]. This research automated 15  and sped up the process of taking processor-centric designs from Xilinx Embedded Development Kit (EDK) to PlanAhead to make them DPR capable. Other DPR design frameworks are mentioned in [58, 106, 115]. Reconfigurable Partition Proxy LUTs  Static Routing Dynamic Logic  Reconfigurable Module 2  Reconfigurable Module 1  Figure 2.5: Proxy logic.  2.2.2  Dynamic Partial Reconfiguration Simulation  DPR also increases the complexity of simulation. Changing the underlying FPGA configuration layer during a simulation run was not something that early simulation tools were capable of handling. Recently, there have been many academic studies in how to best handle the partial reconfiguration process during simulation. One of the earliest and most straight-forward techniques was to include all recon16  figurable modules in simulation and choose between them using multiplexors [78]. Gibson et al. [43] change module generics during simulation to mimic partial reconfiguration. These techniques were appropriate for functional verification, but did not account for the time duration of partial reconfiguration. Dynamic circuit switching techniques [79, 103] insert simulation-only constructs into the RegisterTransfer Level (RTL) of a design to monitor specific signals. Depending on these signals the outputs of a specific task or block will be driven. Inactive tasks output a ’Z’ (weak), tasks under configuration are driven with an ’X’ (unknown) and active tasks drive the output signal strongly, therefore overriding any other tasks trying to drive the outputs. High-level DPR models have also been proposed. Several make use of SystemC, but do not accurately depict the low-level details of reconfiguration [21, 95, 99]. The complexity of simulating DPR has led to many techniques in an attempt to capture the true effects of what happens on the real chip. As we will discuss in the next section, accurate simulation is an important factor in power analysis. The accuracy of an estimate largely depends on an accurate modelling of the hardware. In this case, Xilinx does not release detailed information about the configuration layer, which makes it hard to estimate the power of DPR.  2.3  Power Analysis  Low-power design is composed of analysis and optimization techniques [145]. Analysis concerns the problem of accurately estimating the absolute power consumption of a design. Optimization, on the other hand, is used to make design choices based on an analysis of the power effects. Absolute power analysis is used for power budgeting, while relative power analysis compares the difference 17  in power dissipated by two designs. Low-power design choices almost always come at a cost to the circuit delay, which affects maximum frequency, or the area (which subsequently either affects manufacturing cost or in the case of FPGAs how large of a design one can fit on an FPGA) [145]. Conversely, as clock frequencies and chip transistor counts have been increasing, so has the power dissipation of these chips. The International Technology Roadmap for Semiconductors predicts the power consumption of logic and memory to steadily increase over the next 15 years [50]. These factors have led to the need to reduce power consumption through tradeoffs–optimization– becoming one of the most important design decisions, as it plays a role in battery life, performance, temperature, packaging, and reliability [47, 84]. Our platform can be used to aid in power optimization by analyzing the power dissipation of FPGA circuits through direct physical measurement. Power dissipation in VLSI Complementary Metal-Oxide-Semiconductor (CMOS) circuits is the result of dynamic and static power and can be estimated as the product of the supply voltage and current of a circuit, shown in Equation 2.1. Dynamic power (see Equation 2.2) is due to short-circuit current, glitches, and the charging and discharging (or switching) of capacitance. The switching power has traditionally been the key component of power dissipation and can be estimated using the switching capacitance model (see Equation 2.3). This equation sums the switching power of all the nodes in a circuit to calculate the total switching power. Here α is the switching coefficient, or the ratio of the number of clock cycles in which the node switches (from high to low or low to high) to the total number of clock cycles. VDD is the supply rail voltage and Vswing is the voltage difference before and after the charging or discharging, often equivalent to VDD . f is the frequency of the 18  circuit clock, which directly affects the rate at which logic values change. α and f are sometimes combined to represent the switching frequency of a node. Lastly, C is the capacitance of each node. This capacitance includes planned capacitors and parasitic capacitance from wires and devices. Static power is mainly due to leakage current in CMOS circuits and is playing an increasing role in total power, as process node sizes decrease, due to the decreasing threshold voltages and shrinking gate oxide width [126]. Static power is complex and the low-level details of it are outside the scope of this thesis. Therefore, for more details the interested reader can refer to [47].  P = VDD ∗ IDD = Pdynamic + Pstatic  (2.1)  Pdynamic = Pswitching + Pshort−circuit + Pglitch  (2.2)  Pswitching =  α VDDVswing f C  ∑  (2.3)  all nodes  Estimating the supply voltage and current of a circuit at a single point in time would provide the instantaneous power. To obtain average power (see Equation 2.4), the instantaneous power can be integrated over a period of time. In this thesis, we focus on average power which directly affects the battery life. 1 Paverage = T  Z t1  P(t)dt  (2.4)  t0  All analysis can be divided into either estimation or measurement techniques. Estimation, though typically less accurate, is essential to compare optimizations early on in the design phase, since that is when design choices are easiest to modify. Measurement, on the other hand, requires part or all of the design to be im19  plemented in hardware. As a consequence more implementation is required, but the results can be more accurate. Estimation techniques are also much better at providing a breakdown of the power usage of specific types of hardware in a design, while doing so with measurement would require hardware overhead. There has been some work using FPGAs to measure power consumption breakdown by resource type. [62] and [110] used a two-step approach to provide breakdowns of dynamic power in the logic, interconnect, clock network, and I/O blocks. In the first step they added a specific resource to a baseline circuit, measured the relative power difference, and extracted estimated capacitance values from the results. These capacitance values were then used in Equation 2.3 to estimate switching power. Jevtic and Carreras [55] used similar methods to create a high-level model that provides a breakdown of static, logic, global interconnect, and clock network power.  2.3.1  Power Estimation  There are many different power estimation techniques, each with different accuracy and computational efficiency [145]. Early in the design process quick power estimates are required and low accuracy is allowed; as more details in the design are exposed longer run times are accepted and more accurate power analysis results are required [145]. Higher level descriptions often lack the details required to obtain accurate estimates. Therefore, the fundamental tradeoff in power dissipation estimation is that detailed, low-level descriptions are more accurate, but require more time and/or resources to compute [144]. This is depicted in Figure 2.6 [145]. Power estimation techniques can be grouped into two broad categories: simulationbased and probabilistic. Simulation methods use inputs vectors to a circuit to cal20  Levels of Abstraction  Worst  Least  Algorithm Software and System Hardware Behaviour  Accuracy  Computing Resources  Register Transfer (Architecture) Gate (Logic) Circuit Device Best  Most  Figure 2.6: Power analysis - levels of abstraction (adapted from [145]). culate the values of all internal and output signals. In general a design will be simulated for many cycles with many different input vectors and the results will be averaged. Probabilistic methods use input attributes, which are then propagated to all other signals. These only need to be propagated once, allowing probabilistic techniques to be much faster than simulation-based techniques, although with reduced accuracy, as we will discuss. To accelerate analysis, techniques often make use of simplifications that can decrease run-time at the expense of accuracy. A list of generic simplifications is as follows: • Constant voltage supply • Ignoring glitch, short-circuit, or static power  21  • Assuming a constant relationship between two equations • Zero-delay timing model for routing or logic • Random or pseudo-random inputs • Constant transistor capacitance regardless of operating mode The first simplification listed is assuming a constant voltage on supply and ground rails. In reality, the voltage cannot always be considered constant everywhere in the design. One reason is voltage droop, caused by varying loads on the supply rails drawing different amounts of current. Another case is simultaneous switching noise, which can cause a reduction in the supply voltage (known as IRdrop) and an increase in the ground voltage (known as ground bounce) [47]. Chen and Ling [30] estimated the supply voltage noise by modeling voltage drops at the package and the chip level. This can be done with lengthy run-times to estimate the various supply voltages at a coarse granularity. Assuming a constant voltage helps decrease the run time and allows higher abstraction levels, but contributes to error in the results. Some estimation techniques also simplify their calculations by ignoring shortcircuit current during switching [57] or assuming it is a fixed ratio of dynamic power [73, 98]. These currents occur in CMOS logic when there is a current path between the voltage supply and ground due to N-type Metal-Oxide-Semiconductor (NMOS) and P-type Metal-Oxide-Semiconductor (PMOS) networks both partially conducting before settling to their final value. Nose and Sakurai [86] estimated short-circuit power was between 10%-20% depending on typical values of many circuit parameters. Another simplification is ignoring glitches in combinational  22  logic. This can be a substantial part of the total power and appears when inputs arrive to a gate at different times. Shen et al. [112] reasoned that glitch power typically represents 20% of the total power in ASICs, but could contribute up to 70% in circuits such as a combinational adder. More recently it was suggested that glitch power comprises about 31% of dynamic power and 23% of total power [65]. Another study claimed it accounts for 26% of dynamic power [113]. Many power estimation techniques ignore glitch power, because it is computationally expensive to compute for all internal signals due to the complex timing relationships involved [83].  Simulation  Simulation Techniques  Simulation Simplifications  SPICE Gate-level simulation Macromodels (Characterization) Statistical simulation  Modeling transistors as switches Using a single clock frequency for a whole circuit  Table 2.1: A List of Simulation Techniques and Simulation Simplifications The abstraction levels discussed for power analysis can also be applied to simulation. Simulation based power estimation is referred to as strongly input patterndependent because the power consumption is directly related to the inputs provided. The most basic power estimation technique is to run a circuit simulation and observe the power supply voltage and current over time [83]. Simulation Program with Integrated Circuit Emphasis (SPICE) is the de facto tool for low-level circuit simulation and power estimation [144]. It uses transistor-level models involving complex mathematical equations, which are very accurate, but not suitable  23  for large designs as the run time becomes unreasonable. Switch-level modelling treats transistors as switches that are fully on or off. For designs with millions to billions of transistors this is still too demanding to model the whole system. Thus gate-level modelling is often used to allow simulation of larger designs. At the gate level, hardware is modelled as a collection of logic gates, registers, and other blocks connected by nets. A list of common simulation techniques and simplifications can be found in Table 2.1. For all simulation-based techniques a set of inputs (test vector) must be provided to the simulation and the accuracy of the results depends on how realistic these inputs are. Test vectors can often be difficult to determine for a multitude of reasons, such as: some or all of the remaining parts of the system may not be designed or available, the designer might not know how the block in question will typically be used by an end user, or the block may be sufficiently flexible that it is unrealistic to capture all typical behaviours. An alternative technique is statitistical simulation. The principle idea of statistical simulation is to execute a limited number of deterministic simulations using randomly generated input vectors. In this method, the user can specify an accuracy and confidence level, which can be calculated using statistical estimation techniques [84]. The resulting power is the average across all runs. This is referred to as the Monte Carlo Simulation method [24, 38, 107]. It is known to increase accuracy of simulation techniques without needing realistic inputs, but can require longer run-times depending on how many runs are required to reach the user-specified accuracy and confidence. Characterization is a technique which allows some of the accuracy of low-level modeling with the speed of higher abstractions using macromodels [72, 109, 124, 24  148]. The macromodeling approach is classified as bottom-up, as a block or group of transistors is studied through low-level simulation to determine a relationship between the inputs and the power dissipation. This relationship typically takes the form of a mathematical expression or a look-up table. Macromodels are “trained” with circuits and input vectors, therefore their correctness largely depends on the training set. For this reason any differences between the classes of data used to train the macromodel and the designs being analyzed can lead to error in the estimation.  Probabilistic  Probabilistic Attributes  Probabilistic Simplifications  Signal probability Transition probability Transition density Equilibrium probability  Spatial independence Temporal independence Assuming all primary inputs have the same attribute value Treating embedded block outputs as primary inputs  Table 2.2: A List of Probabilistic Attributes and Simplifications Probabilistic methods were created to help deal with the problem of input dependence. They are characterized as weakly input pattern-dependent as they do not require input vectors. Instead the user specifies primary input attributes that are propagated to the rest of the design. This allows these techniques to run faster, with reduced accuracy compared to simulation. The most basic attribute is signal probability [94]. This is the expected proportion of clock cycles in which a signal settles to a logic high value compared to the total number of clock cycles. Another common attribute is transition probability [83], which is defined as the expected proportion of clock cycles in which a signal value has a different value at the end of a clock cycle from its original value at the beginning of the clock cycle compare  25  to the total number of clock cycles. This considers transitions from high to low and vice versa, but does not consider glitches. Probabilistic techniques can be used after simulating a circuit with realistic test vectors to obtain representative activity attributes or by assuming uniform activity attributes for all primary inputs and propagating these through the circuit. A list of attributes and simplifications for probabilistic techniques can be found in Table 2.2. Propagating these two attributes to all other nodes can be difficult. This is due to two complications called spatial and temporal dependence [83]. Spatial dependence is defined as a correlation between two or more signal values. For example if two signals are always (or never) logic high at the same time then they are spatially dependent. Temporal dependence occurs when value of a signal in the next clock cycle depends on the value in the current clock cycle. Without spatial and temporal independence, computing and using the myriad of signal correlations is infeasibly difficult and too computationally expensive for complex circuits [84]. Signal probability and transition probability do not take into account the settling time of signals in combinational logic. This simplification, which assumes all combinational logic is instantaneously resolved is called the zero-delay timing model. When using a zero-delay model the technique does not account for shortcircuit or glitch power, therefore more sophisticated attributes are required. Najm presented transition density, a measure of activity defined as the average switching rate of a signal, along with a method to propagate it [85]. To complement this, he also presented equilibrium probability, which is the expected proportion of time during which a signal is high compared to total time [82]. In the zero-delay case transition density is equal to transition probability and equilibrium probability is equal to signal probability. 26  One problem with these techniques is that probabilistic power estimation is greatly sensitive to input parameters. Poon and Wilton [97] show that incorrect primary input attributes, such as transition density, can lead to significant error in results. However, obtaining realistic transition density and equilibrium probability values depends on the knowledge of how the circuit will ultimately be used. When using a technique involving switching activity the error of the technique is added to the user error of estimating the switching attributes, making it difficult to obtain accurate estimates.  2.3.2  Direct Power Measurement  In many cases the most accurate way to determine power usage is to directly measure the current and voltage going into the chip. Measurement can be used while the circuit under investigation is running at full speed and has the potential to be more accurate than estimation. However, it does require a functioning implementation of the circuit, which can take several hours to reimplement after a change. There are also simplifications and errors in measurement techniques, but even with those problems direct measurement is still widely considered the most accurate technique for power analysis.  Measurement Techniques  There are several different methods to physically measure the power consumption of FPGAs. Many estimation studies compare their results with a physical measurement [1, 88, 108] and often a measurement is treated as the exact power consumption [16, 96, 110]. In reality the measurement results are not exact, but are largely considered the best approximation. Measurement is typically accomplished  27  by measuring the current on the voltage supply rail with a multimeter or oscilloscope and using the current and voltage to calculate the power. Experiments in the past have used small, external shunt resistors across the supply rail in conjunction with a multimeter to measure the current [75, 76]. Those experiments also assumed a constant supply voltage. Becker et al. [16] characterized the reconfigurable fabric of various FPGAs by inserting precision current sense resistors onto the supply rail and measuring the voltage rather than inserting a current meter. This was done to keep the supply voltage within 50 mV of its nominal value. Conversely, Pitkänen et al. [96] used an intricate system, involving an external voltage source, which measured the supply voltage and adjusted it to keep it constant based on the current that was drawn by the FPGA. Capacitor-based techniques avoid the sampling error and constant voltage supply assumption with a charge transfer technique. In previous work one or more capacitors alternated being charged and connected to the power supply rail [26, 27]. The charge lost from each capacitor was recorded and the process was repeated. With this method the energy dissipated can be calculated from the charge transferred and if the capacitors are recharged often enough, the supply rail can stay in the proper operating voltage range. This technique can be used to accurately measure power spikes that traditional instruments miss due to their limited sampling rate. One problem with power measurement techniques is that it can be necessary to separate the power dissipation of the circuit under investigation from the power dissipation of other resources. For example realistic circuits have many primary input and output pins and FPGAs require a large amount of power to drive data on and off chip. This can obscure the internal power dissipation that is being observed. 28  Chin et al. [32] investigated the power implications of using embedded memories for logic in FPGAs. Chin et al. avoided this problem of I/O power clouding their results by generating the input data on-chip through a Linear Feedback Shift Register (LFSR). Similarly, all outputs of the benchmark were connected to an exclusive-or gate so that only one output was being driven off-chip. These techniques can cause some error in the power measured compared to a real implementation of the benchmark since the modified circuit has extra logic and unrealistic inputs.  Measurement Simplifications and Causes of Error  Sample-based measurement measures the instantaneous power at discrete points in time. These techniques are still prone to error in the measurement apparatus, can be limited by the sampling rate, and make some assumptions for simplifications. For example, the Xilinx System Monitor (SysMon) [131] can measure the chiplevel current and voltage at 200 kilosamples per second or every 5 microseconds. Therefore, any fluctuation of the power between sampling points is lost when computing average power. Sample-based measurement is required to assume that the power between any two samples is a linear progression. This is not always the case and can lead to overestimates and underestimates of average power as shown in Figure 2.7. In the first example the sampling points miss the high power spike causing an underestimation. In the second example, one of the sampling points hits the power spike at its peak causing an overestimation. The red line is the continuous power estimate based on the discrete sampling points. If these causes of underestimates and overestimates are equally likely then they should average out with enough samples. 29  Power t0  t1  t2  t3  t2  t3  Time  Underestimation  Power  Estimated Power  t0  t1  Time  Overestimation  Figure 2.7: Error in power measurement. When measuring the power of a circuit at the voltage input pin, there is also an underlying assumption that the voltage on the supply rail is uniform. This is not  30  always the case, as was explained in Section 2.3.1. Lastly, measurement results are non-deterministic and can be skewed for a specific chip or environment due to process variation or environmental factors, such as changes in ambient temperature at the time of measurement. A high quality instrument can provide less error and faster sampling rate, but there is always some error in the final power results with sample-based measurement techniques.  2.4  Power Analysis Frameworks  In this section we discuss power analysis frameworks provided in previous works. When we say framework we mean a system that provides high-level analysis, typically above the gate-level abstraction. There have been many frameworks for estimating power in microprocessors [22, 73, 143], Systems-on-Chip [18, 44], and Networks-on-Chip [25, 57, 91], in addition to the FPGA power models that will be discussed in Sections 2.4.1 and 2.4.2. As previously mentioned, many studies have also directly measured the power consumption of FPGAs and of applications running on FPGAs. Architecture-level descriptions are often used to design digital VLSI circuits, therefore there is much appeal for accurate architecture-level power models. Higher-level models have also been created and will be discussed here. One example of an architecture-level technique is Functional Level Power (FLP) analysis [1, 18, 40, 66]. This macromodel approach estimates the power consumption of a block using architectural-level parameters. In these studies a sweep of the parameters is performed and the power is measured or estimated at a lower level on blocks or sub-blocks of the design. Finally the power dissipation models are created using curve fitting. This provides a less complex model at the cost of reduced accuracy. Elléouet et al. [40] used a method based on FLP analysis 31  to breakdown the power in basic blocks such as adders, memories, multipliers, and registers. Their results showed errors of 25%-30%. Brooks et al. [22] presented an architecture-level framework, Wattch, for investigating dynamic power consumption of new microprocessor architectures. Their methodology was based around parameterized power models for common building blocks of microprocessors. The authors claimed Wattch was 1000x faster than existing layout-level power tools and their error was between 6%-30% for tests over 3 different industry processors. Another architectural framework, McPAT, was presented by Li et al. as the “first integrated power, area, and timing modeling framework for multithreaded and multicore/many-core processors” [73]. It allowed the user to specify the architecturelevel details and it automatically created realistic low-level implementations. The framework accounted for switching, leakage, and short-circuit power, but disregarded glitch power. They calculated switching power using a capacitive model and obtained capacitance with analytical models and activity factor with architectural simulation. Short-circuit power was calculated using the trends in [86]. McPAT modeled power between 22 nm and 90 nm process technology with an error of 10%-23% compared to industry processors. Kahng et al. [57] presented ORION 2.0, an architecture-level parameterized power and area model for Network-on-Chip architectures. They modeled dynamic power using a switching capacitance model and leakage power following the technique proposed by Chen and Peh [31]. The results in [57] had a 7%-11% error compared to post-layout simulation results of two Intel chips. A software-level macromodeling technique similar to FLP is Instruction Level Power (ILP) modelling [69, 121]. This method is often used with processors to 32  determine the power dissipation of software programs. The main idea is that the total power consumption of a program is the sum of the power dissipation of individual processor instructions. For complex processors the resulting models can be enhanced to consider inter-instruction effects which are dependent, not only on the current instruction, but on the history of previous instructions [122]. Such effects include cache behaviour, branch prediction, stalls, switching power of the instruction, data, and address registers, and any power management features of the processor [122]. Profiling these effects in isolation is difficult to do accurately and maintaining all the low level details of a simulation can lead to large run times. This technique can be utilized using simulation or direct measurement to obtain instruction energy characteristics. In Platune [44] the instruction power was obtained by physically measuring the power on a real MIPS processor to create a parameterized power and performance tuning framework for SoCs. Platune used ILP to model power of the processor and peripherals of an SoC, but did not consider many of the inter-instruction effects and also ignored static power. Laurent et al. [66] modeled the power consumption of processors using a combined FLP and ILP approach. They characterized instruction power using measurement on several hard-processors followed by curve fitting and regression. Their approach used algorithmic and architectural parameters as inputs to compute a power estimate. One limitation of this work is that only compared their results with the processors used to characterize the tool. Atienza et al. [14] use a Virtex-2 FPGA to emulate their Multi-processor Systemon-Chip (MPSoC) design and extract statistics for power estimation. They show cycle-accurate results with up to three orders of magnitude speedup over a similar software simulation. Their framework also allows them to test run-time thermal 33  management techniques without performance loss. There has also been work on simulation with high-level description languages. Ahuja et al. [6] proposed a methodology to combine probabilistic methods on a system-level circuit description translated into RTL. Several high-level languages based on C to describe circuits have existed for some time, such as SystemC [92], SpecC [41], and HardwareC [59]. These languages can be used with high-level simulators to estimate power consumption [7, 48, 60, 119]. For example, using systemC Kuehnle et al. [60] claimed a 28x speedup over RTL simulation with similar accuracy.  2.4.1  FPGA-Specific Power Estimation  Though simulation and probabilistic techniques were designed to be used in general power analysis they can also be extended to be used in FPGA design. Commercial tools can be used to estimate the power in those vendor’s devices, but are not typically flexible enough to be used to explore other architectures or circuit techniques. Many vendors, such as Altera [8, 9] and Xilinx [134, 136], have released proprietary CAD tools to help estimate power. Xilinx has two tools which are used to estimate and analyze power consumption of a hardware design. The first, Xilinx XPower Estimater, is a probabilistic spreadsheet-based power estimation tool often used in pre-design or pre-implementation [134]. The user can specify global parameters such as device, temperature, and voltage, while also identifying clock, logic, and I/O resources. The power consumption of these resources is directly proportional to the frequency, toggle rate, and fanout. As we have discussed previously, toggle rate can be difficult to establish for logic and I/O resources, especially during the design phase of complex 34  or software-driven hardware. Second, Xilinx XPower Analyzer (XPA) [136] is a tool integrated into the Xilinx Integrated Software Environment (ISE) Design Suite [139]. XPA analyzes a circuit after the place and route phase has been completed. Analyzing the power after the implementation of the physical layout leads to a better understanding of the wire lengths, fanouts, and resources used. However, it would not help with determining the primary input toggle rate. Other works have claimed that significant error exists in the results from XPower [39, 40, 54, 68]. For FPGA architecture and circuit exploration the de facto standard in the academic community is Verilog-to-Routing (VTR) [105]. This tool can take a design from HDL to physical layout and is comprised of a combination of ODIN-II [53], ABC [19], and Versatile Place and Route (VPR) 6.0. Several power models have been developed for VTR or earlier versions of VPR. One of the earliest, presented by Poon et al. [98], was integrated into VPR 4.3. This model estimated dynamic, short-circuit, and leakage power for island style FPGAs with a variety of parameterized architectures. It used probabilistic techniques based on transition density and equilibrium probability. Their work extended the transition density model to work with sequential elements. Flip-flops were modeled with SPICE and a macromodel was created with curve fitting. Low pass filters were modeled on the outputs of LUTs to filter out short glitches. The authors assumed that SRAM cell leakage was negligible, short-circuit power was 10% of dynamic power, the gate-source voltage of off transistors was half of the threshold voltage, and all primary inputs had transition density and equilibrium probability values of 0.5. These assumptions help speed up the performance of the tool, but can reduce the accuracy. This work was later extended by Jamieson et al. [52] to work with VPR 5.0. It used the same estimation techniques, but was enhanced to allow unidirectional routing 35  architectures. VersaPower [45] modeled power for the latest VPR version 6.0. This work supported new features in VTR such as fracturable LUTs and hard blocks. It accounted for configurable logic blocks, routing, and clock networks, which it reduced to wires, simple multiplexers (using NMOS pass transistors) and inverters. Li et al. later released the fpgaEVA-LP2 power model [72] that also worked with VPR 4.3. Their work used a hybrid approach of switch-level models for interconnects and macromodels for LUTs. They claimed cycle-accurate power simulation by using a gate-level netlist with post-layout capacitance and delay. A new attribute, effective transition density, was used to model glitches by allowing for partial voltage swings. These occur when the driver of a node changes from low(high) to high(low) before the node has settled. Short-circuit power in the interconnect was modeled as a linear function of switching power, where SPICE models were used along with curve fitting to create a variable function based on signal transition delay. The LUT macromodel assumed all input vectors are equally likely and used the average power consumption. This was a simplification, as the authors found through SPICE simulation that the power consumption of a fourinput LUT depends on the input vector pairs applied. This work assumed a signal probability of 0.5 and transition probability of 0.85. Atitallah et al. [18] used the FLP approach to create a framework for MPSoCs. In their framework they created a complete system power model which includes a processor, a Xilinx Virtex-II FPGA, and an Operating System (OS). Their FPGA power model used frequency, switching activity, and area utilization. The authors did not show the results of the FPGA power model accuracy, but a similar approach was also taken in [1] where Abdelli et al. claimed a maximum error of 37%. Due to the dependence of FPGA power on the circuit being implemented it can be 36  very difficult for high-level FPGA power models to be accurate. Ou and Prasanna used the ILP method to create a profile for soft processors on FPGAs [90]. They estimated the energy costs of instructions by obtaining gate-level switching activity in ModelSim and importing these into Xilinx’s XPower tool. Other FPGA power models and estimation frameworks are discussed in [37, 56, 102].  2.4.2  FPGA Power Measurement Frameworks  In Table 2.3 we present a breakdown of previous FPGA power measurement frameworks. This is not meant to be an exhaustive list of all FPGA frameworks. Some works referenced in Table 2.3 do not analyze or estimate power, but have implemented their work on an FPGA. We feel that any FPGA implementation of a circuit can be used to measure power with the techniques discussed in previous sections and therefore can be considered a framework. The remainder of this section will provide further background on the studies shown in Table 2.3.  37  Study\Attribute  Processor  OS  DPR  [129] [125]  Linux N  Y N  N  N  N  Y  [111] [93] [127] [87] [146] & [147]  LEON3 None PowerPC/ Microblaze Microblaze PowerPC Microblaze LEON3 LEON3  Frequency Scaling N N  Y N uClinux No Mention Linux  N Y Y N Y  Y N N Y N  Y N N N N  [101]  NIOS II  N  N  Y  Y  [76] [81] This Work  MIPS LEON3 LEON3  N No Mention Linux  Y Y Y  N Y Y  Y Y Y  [14]  38  Clock Gating  Power Analysis  N Y  N N Emulation + Estimation Estimation N N Measurement Measurement Measurement + Estimation Measurement Measurement Measurement  Table 2.3: Comparison and Breakdown of Previous SoC FPGA Frameworks  On-chip Power Measurement N N  Virtex-2 Spartan3  N  Virtex-2  N N N N N  Virtex-2 Virtex-2 Virtex-2 Virtex-4 Virtex-4  Y  Stratix IV  N N Y  Virtex-4 Virtex-5 Virtex-6  FPGA  Many studies have investigated new uses for [129], created an efficient implementation of [34, 63], or further analyzed the run-time, area or power properties of DPR [76, 93, 127, 146, 147]. Outside of DPR’s power and area benefits, researchers have used DPR to aid against single event upsets due to radiation in space applications [129]. Claus et al. [34] created an area and speed efficient implementation of an Internal Configuration Access Port (ICAP) controller as a bus master. It also had Direct Memory Access (DMA) capabilities which made it less dependent on the Central Processing Unit (CPU). Lai and Diessel presented a reusable ICAP controller with DMA support that does not require a CPU [63]. They claimed higher throughput and lower area than other CPU-based implementations. Williams and Bergmann [127] present a Microblaze implementation of an embedded Linux SoC with DPR capabilities and Papadimitriou et al. [93] use a DPR system with a PowerPC to analyze configuration time. In both cases the authors do not evaluate the power impact of their designs. Zaidi et al. [146] evaluated the power and area of a LEON3 SoC running Linux on a Xilinx Virtex-4 FPGA. They also extended the LEON3 SoC with an ICAP driver to reconfigure the FPGA with partial bitfiles. Zaidi et al. [147] extended their previous work to show a power analysis of DPR for various arithmetic operators. Liu et al. [76] compared the measured power effects of clock gating and partial reconfiguration of a 64-bit division circuit. They showed that the benefit of each power technique depended on the amount of time the circuit remained idle, as reconfiguration had a much higher overhead, but affected dynamic and static power, while clock gating had a low overhead, but only affected dynamic power. They also showed that higher throughput in the ICAP driver led to lower energy of reconfiguration. 39  Several studies focused on power evaluation and low-power implementation frameworks. Ul Haq et al. [125] have created an SoC platform architecture for lowpower SoCs. They claimed it was processor-independent as they required the end user to implement an interface for whichever processor is chosen. They included a power management unit which can enable and disable the clock to manage power. However, their platform did not include a power analysis. She [111] presented a Microblaze based emulation platform for MPSoCs. Her work used power equations, rather than measurement, to analyze the power consumption of the platform. Emulation was used instead of simulation to obtain parameters of the power equation in a more timely manner. Clock gating and frequency scaling were both integrated into her work, however the frequency scaling was a simplistic version that disabled the clock for a ratio of the cycles using clock gating and therefore can only be used to reduce the clock frequency. Similarly, power gating was “emulated” using clock gating. Ravishankar et al. created a MPSoC prototyping environment using multiple NIOS II soft-cores on an Altera Stratix IV [101]. Their framework supported Dynamic Frequency Scaling (DFS), clock gating, and software thread migration between CPU cores. Their power analysis for every algorithm consisted of measuring the power for several different frequencies to empirically determine a coefficient through regression [101]. This coefficient was then used in an equation to scale the estimated dynamic power of each NIOS core based on its frequency. They claimed this gave a 0.33% error compared with measured data. However, they compared their results with the same circuits they used to bias their estimation equations. Nunez-Yanez et al. looked at the power effects of Dynamic Voltage and Frequency Scaling (DVFS) using external measurement on the LEON3 SoC [87]. 40  They used a sourcemeter connected to the core supply voltage in order to dynamically adjust the voltage at run-time. This work was extended by Nabina et al. [81] to support power gating and DPR. The created a voltage scaling module on a printed circuit board which they used in place of the voltage regulator of their FPGA development board, and which interfaced with the LEON3 using General Purpose Input/Output (GPIO) pins. This created a DVFS system, where voltage and frequency values could be adjusted automatically at run-time if the critical path length was provided by the user. They observed power savings of 85%, simply by reducing the core voltage, when running their circuits at 20 MHz.  2.5  Focus and Contribution of this Thesis  Although the techniques used in this framework have all been presented in previous studies, it is the proposed ideas of high-level access and integration of these techniques into a flexible framework for SoC on FPGA exploration that is the novel contribution of this work. Our implementation serves as a proof-of-concept of these ideas. First, we implement power-management techniques on an open-source SoC running Linux. Second, we provide the infrastructure to control our framework through command line “hooks”. Lastly, we focus on evaluating the power impact of algorithmic or CAD tool changes through on-chip measurements of an FPGA.  41  Chapter 3  Desirable Characteristics of the Platform As described in Chapter 1, to properly evaluate new proposals for power reduction in FPGAs, an experimental platform is essential. In this chapter, we outline the desirable characteristics of this platform; the next chapter describes our platform and discusses to what extent it satisfies these requirements. We have identified the following desirable characteristics: (1) the platform must provide for power evaluation in a realistic setting, (2) the platform must be flexible enough to be amenable to a variety of experiments, (3) the platform must be open-source or otherwise available, and (4) the platform must be easy to use in an experimental setting. Each of these is described separately in this chapter.  3.1  Realistic  A primary requirement of our platform is that it allows investigators to obtain results that realistically quantify the power impact of a proposal. This requirement 42  has two aspects: (a) the measurement technique itself is reasonably accurate and (b) the environment in which the measurements are performed is representative of real chips. The first aspect concerns the accuracy of the power estimation. It is important to distinguish between absolute accuracy and relative accuracy. Absolute accuracy is important when qualifying the power characteristics of a new design, to ensure it meets specifications or to evaluate packaging and/or cooling requirements. Relative accuracy (also called fidelity in [72, 74]) refers to the extent to which the investigator can correctly predict changes in power consumption between two implementation options. In most experimental evaluations the important task is to compare the power dissipation of an optimized version with the power dissipation of a baseline version, to determine the percentage improvement that can be attributed to a proposal. Since we are targeting an experimental environment, it is relative accuracy that we are most concerned with. As described in Chapter 2, power can be estimated either through simulation, probabilistic models, or direct measurement. Of these, direct measurement has the potential to provide the best accuracy, so our platform will place the highest priority on this type of measurement. However, since some users may wish to use simulation or probabilistic methods, the platform should also support these techniques. It is common to simulate hardware designs with a testbench, running at a fraction of the speed of the real hardware. Therefore, direct measurement can be used to examine a larger number of input vectors or execution paths. Also, these testbenches often verify the functionality of the most important execution paths. However, this varies from the average usage of most hardware, since in many realistic circuits these paths would not achieve equal utilization. For example, 43  code pertaining to an error condition might consume significant power when it occurs, but that error is triggered so seldom that its effect on the total power can be considered negligible. The second aspect related to this requirement is that the environment in which the measurements are performed is representative of real designs. For FPGA power-aware CAD experiments it is common to implement a circuit using a proposed algorithm, and present results as if the circuit being optimized was the only circuit on the chip, as done in [64]. However, as the size and complexity of designs increase it is becoming more common to reuse Intellectual Property (IP) blocks along with established bus protocols to substantially reduce implementation and verification time of systems as a whole. While a stand-alone circuit may be typical of some design scenarios, in many other cases large pre-compiled or pre-optimized IP blocks are incorporated in an FPGA design. Therefore, the conclusions drawn by applying a power reduction technique on a stand-alone circuit may be different when examined in the scope of an overall SoC implementation. More important, however, are the interfaces between the IP blocks and the rest of the system. As described in Chapter 2, the activity of the input signals of a circuit directly impacts the power dissipation of the circuit. Many studies assume that the inputs to the circuit are primary inputs of the chip, and often assume that these inputs are driven randomly [32, 45]. In [45], in which power models are used to estimate power, an input activity of 20% is assumed for each input; that is, it is assumed that each input changes in 20% of the clock cycles, as shown in Figure 3.1a. In [32], in which measured power results are used, the inputs are driven by an LFSR, effectively creating a 50% activity for each input, as shown in Figure 3.1b. Neither of these assumptions are realistic. In many common SoC design 44  20% Input Activity  50% Input Activity LFSR  Design Under Test  Design Under Test  LFSR  LFSR  LFSR  (a) Typical Simulation Assumptions [45]  (b) Typical Measurement Technique [32] System Bus  Memory Mapped I/O Area Realistic Input Activity  Design Under Test  (c) Realistic Input Activity  Figure 3.1: Test circuit input assumptions. methodologies, user IP blocks are connected, through a system bus, to a processor. In these cases, the activities of each input are certainly not random. As an example, many IP blocks are integrated into an SoC using a memory-mapped I/O interface, 45  as shown in Figure 3.1c. Such memory mapped bits change only when the software writes to these memory locations; assuming a random input pattern can lead to a large error in predicted input activity, often artificially inflating the activity significantly. This can lead to a substantial overestimation of the power (and possibly a considerable overestimation of the effectiveness of any power reduction proposal). It is also common for SoCs to run an OS. While an OS provides ease of use by raising the level of abstraction, it also possesses an overhead. Many real SoC designs use an OS to allow flexibility at run-time. The OS takes care of various lowlevel system details and also allows for simple interfacing with a file system. We decided this was necessary when dealing with the complexity of an SoC. Without the OS and file system our framework would be limited in its run-time options. It uses the OS to move and manipulate data at run-time, as well as building on top of it to create high-level “hooks” that could leverage this data. The same thing could be achieved without as OS, but most likely with more designer time and effort. We required a small, embedded OS as the FPGA development board has a limited amount of storage and memory with which to work. A pre-existing infrastructure to load the OS onto the FPGA was also of vital importance. The most straightforward way to ensure that the inputs have a realistic activity is to integrate the circuit under test into a realistic SoC running an OS. As will be discussed in Chapter 4, we built our platform around the LEON3 processor and SoC from Aeroflex Gaisler running a Linux operating system. The platform provides the ability to integrate user IP into this system, and to evaluate the power dissipation of this IP in a realistic scenario. We know other techniques may be faster to implement, but our framework will be more realistic and thus more accurate. 46  3.2  Flexible  The second requirement for our platform was that it should be flexible enough to be used for a range of experiments. The following section will explain these requirements in more detail. Our framework aimed to provide several power management techniques for evaluating power. This was accomplished by using a single implementation that was suitable for several different experiments. The platform cannot be expected to include all past and future techniques, but it should be capable of being extended to support new techniques. Ideally our platform is the base infrastructure for power evaluations and constitutes the bulk of the implementation work, while it is simple to extend the platform for specific proposals that are not covered. We sought to provide the ability to examine the power, area, performance, runtime, and temperature of design implementations. The types of experiments we believed would be most fitting for this type of framework are as follows: 1. Hardware This includes modifications to hardware algorithms and/or CAD tools. Examples could include altering an IP core to use clock gating or examining the results of a new power-aware clustering tool. 2. Software These tests would be limited purely to changes in software code that run on the same hardware. This could include changes to algorithms, compiler optimizations, or even OS features. 3. DPR Analysis  47  These experiments would be limited to assessing the effects of DPR. A sample experiment might measure the power or time required to dynamically reconfigure a portion of the FPGA. 4. Hardware/Software Codesign In these tests the user could notice an opportunity for improvement in a software application and move part of the computation to a hardware accelerator. For example, this could be into either an embedded block or soft logic on the FPGA.  3.3  Available  By choosing open-source code at the hardware, OS, and software layers we can make our framework available to future researchers. We wanted to be able to configure our system at all 3 of these levels, which requires the source code, in order to customize the platform efficiently. An open-source and standard OS was required to reduce the barrier to entry and to enable future researchers to build on top of it. At the hardware level we wanted to create a system that is platform and vendor independent. Therefore, we obtained the hardware at the HDL level, in Verilog and VHDL, in order to give us the freedom to make low-level modifications. It was necessary for most, if not all, of the source code to be freely available and open-source. Choosing a popular architecture for the target processor was also valuable as were able to take advantage of robust compilers for the OS and software applications.  48  3.4  Easy to Use  We wanted to make our platform easy to use. This was accomplished through the use of high-level “hooks” to control hardware blocks. We also wanted to create a straightforward method to obtain results. This section explains these requirements in more detail.  3.4.1  Hooks  It was critical for our platform to allow the integration of power management techniques at a low level, while providing access to these techniques at a high level. In order to integrate high-level interaction into our framework, we created software programs and OS drivers. The types of hooks we were able to implement largely depended on the FPGA, SoC, and OS we opted to use. An extensive description of the hooks we integrated into our framework will be discussed in Chapter 4.  3.4.2  Obtaining Results  The main results we wanted to monitor were area, performance, run-time, power, and temperature. We wanted to create a platform that supported all of these and enabled the end user to obtain the results with little effort. Area and frequency are obtainable through the Xilinx CAD tools’ output logs. Run-time can be measured on most OSs using software calls or shell commands. Lastly, power and temperature results depend on the FPGA being used and/or external tools available. In our framework these can either be obtained using on-chip monitoring hardware on our FPGA board or by measuring the voltage, current, and/or temperature of the chip with external devices.  49  3.5  Summary  This chapter discussed the desirable characteristics of the platform we originally set out to create for this thesis. We wanted our framework to be representative of realworld systems so that the results are meaningful. Also, we wanted the platform to be flexible, such that it could be used for a range of experiments. It includes mostly open-source code to allow low-level implementation. Lastly, we provide high-level hooks to aid in power evaluation experiments and obtaining results. These desired characteristics of our platform can be summarized as follows: 1. Accurately measuring power usage of FPGA circuits in a realistic environment 2. Leveraging an OS-based SoC and high-level hooks to allow researchers to focus on experiments instead of infrastructure 3. Enabling DPR on an open-source soft-core processor 4. Experimental evaluation of traditional power-saving methods as a proof-ofconcept of our platform  50  Chapter 4  Description of our Framework 4.1  Overview  Our platform is constructed as an SoC which is based around an on-chip bus connecting a soft processor to several IP cores, as shown in Figure 4.1. The IP blocks in this system consist of external interfaces, bus controllers, and memory controllers for external memory. We have also included a customizable block in dynamic logic that can be reconfigured separately from the remainder of the system. The processor, on-chip bus, and IP blocks are considered static logic, meaning they are permanent when the system is running. Conversely dynamic logic can have its functionality changed to adapt to system needs. It achieves this using DPR (i.e., time multiplexing), where at different points in time the dynamic region can be configured to contain one of several different designs. This dynamic region can be reconfigured at run-time without disturbing any other logic running on the FPGA outside of the RP. As discussed in Chapter 3, we wanted a platform which was realistic, flexi51  LEON3 Processor  Debug Support Unit  AHB Controller Our Extensions  AMBA AHB  Memory Controller  Ethernet  IRQ Controller  MMCM  Clock Gating  AMBA APB  AHB/APB Bridge UART  System Monitor  RS232  JTAG  Clock  Dynamic Region ICAP Configuration Layer  External Memory  PHY  Figure 4.1: High-level view of the LEON3 SoC (adapted from [4]). ble, open-source, and easy to use. This chapter describes the implementation of the framework that was developed according to these requirements. It begins by providing an overview of the system hardware and software. Next we present a description of our development environment. This is followed by a description of how our work extended the system to provide hooks for users to evaluate power. Next, we outline some potential uses of our platform. This is followed by an evaluation of our framework implementation. Lastly, we present a summary of this chapter.  4.2  The System  We have chosen to use the LEON3 processor and the associated Gaisler Research IP Library (GRLIB), both from Aeroflex Gaisler [4]. They have patched the Linux 2.6 kernel to work with the LEON3 and GRLIB. The GRLIB is a collection of IP cores from different vendors that can optionally be included in the end system. All source code is provided except for the floating point unit, which is provided  52  in netlist format for Xilinx and Altera FPGAs. These netlists are synthesized for specific FPGA families (e.g., Xilinx Virtex-5) which reduces the portability of the LEON3. Most of the remaining GRLIB source code is vendor independent and can run on several commercial CAD tools, however in some cases the code does differentiate between FPGA families to make use of FPGA-specific embedded blocks. As a result, there exists system designs for FPGAs from Actel, Altera, Lattice, and Xilinx. The LEON3 processor is provided in source code and differs in this way from other popular soft cores processors such as the Xilinx Microblaze and Altera Nios II which are proprietary.  4.2.1  Hardware  This section describes the hardware provided in the LEON3 SoC. The hardware consists of the LEON3 processor, the GRLIB, and the Advanced Microcontroller Bus Architecture (AMBA) bus interconnect used to link the system blocks. All the hardware is open-source, meaning the system can be synthesized for different platforms.  LEON3 Processor  Overview The LEON3 CPU is a state-of-the-art industrial 32-bit processor with the Scalable Processor Architecture (SPARC) V8 architecture [3]. The LEON3 has been used in commercial ASIC products [117], including a variant of it being sent to space [120]. We chose this processor because it is configurable (Section 3.2), open-source (Section 3.3), well documented, and we had previous experience with it. Those last two reasons made it easier to implement the modifications required for our framework. The CPU connects to the system blocks using the AMBA bus 53  and makes use of the plug&play method. There also exists a Fault-Tolerant (FT) and a symmetric multiprocessing version of up to 16 cores of the LEON3 processor. Our work will focus on the single core non-FT implementation.  Architecture At its base configuration the LEON3 processor is a pipelined processor with instruction and data caches, and an AMBA Advanced High-performance Bus (AHB) interface. A Floating-Point Unit (FPU) can be used to speed up noninteger (floating-point) operations using soft-logic resources. The Debug Support Unit (DSU) allows reading and writing of processor registers and memory, including instruction and bus trace buffers for hardware debugging. This will be discussed further in Section The remainder of this thesis will use the terms LEON3, system, and SoC to refer to the combination of the LEON3 processor, system buses, and GRLIB IP cores. If we are talking only about the LEON3 processor it will be denoted explicitly using the words processor or CPU. Similarly, we will refer to our implementation of the LEON3 system extended with our work as the platform, framework, or environment.  Advanced Microcontroller Bus Architecture  The LEON3 system is built around the AMBA 2 bus. This strongly aligns with our requirements in Chapter 3 because the AMBA bus is realistic, as it is used in many chips, open-source, and well documented. The AMBA specification [12] is technology-independent, making it quite flexible for use with an FPGA from any vendor. The LEON3 system makes use of the AMBA AHB and the AMBA Advanced  54  Peripheral Bus (APB) protocols. The AHB connects high performance, high frequency system blocks such as CPUs, on-chip memory, external memory, and other high-bandwidth blocks [12]. The GRLIB includes the AHB controller core for managing address decoding, storing AHB plug&play information, and acting as an arbiter for the AHB, to decide which master can drive the bus in a multi-master system. The APB is a secondary bus that is used in addition to the AHB; in the LEON3 slower system devices are located on the APB. Blocks on the APB are typically simpler, require less throughput, or do not require a high frequency clock input [12]. One disadvantage of the LEON3 system is that the GRLIB uses the dated AMBA 2 bus standard. Xilinx has released many open-source hardware modules in their EDK for use in their Microblaze soft core based on the AMBA 3 (Advanced eXtensible Interface (AXI) bus), which are not backwards compatible with AMBA 2. The ability to reuse Xilinx cores would have given us the benefits of reduced design and debug time.  Gaisler Research IP Library  The GRLIB [3, 4] is a collection of IP cores, in synthesizable VHDL, from several different vendors including Gaisler. The library is open-source and vendorindependent. Therefore the LEON3 system can be synthesized using several different CAD tools and target technologies for FPGAs or ASICs. All the GRLIB cores have a common AMBA 2 bus interface, so the library can be extended to include new hardware devices that utilize this bus standard. This modular design approach allows our framework to be flexible, since it allows us to add or remove blocks in the LEON3 system quite easily. 55  4.2.2  Software  The software running on the LEON3 consists of an operating system kernel, utilities, and user applications. Gaisler provides several compatible OS options for the LEON3 and we felt that the Linux 2.6.36 kernel with LEON patches was the best choice. The LEON version of Linux is popular, configurable, open-source, well supported, thoroughly documented, and intended for embedded systems. This is very much in line with our requirements in Chapter 3. The LEON family of processors is officially supported in Linux and LEON patches are included in the release of Linux kernels [2]. LEON patches that are made between kernel releases are also provided by Gaisler and can be integrated into the kernel [2]. The LEON version of Linux is configured, patched, and built through a tool called Linuxbuild. All software in the system was written in C and cross-compiled for the LEON3 by the Bare C Cross-Compiler (BCC). LEON3 applications can be compiled to run either within Linux or as stand-alone programs without an OS. In essence, a standalone application will be bundled with a bootloader to run on the LEON3 CPU. The Gaisler Research Monitor (GRMON) loader is used to load, run, and debug these programs.  4.3  Development Environment and Tools  The development environment for the LEON3 SoC can be separated into hardware and software design tools. The hardware and software portions can be configured, compiled/synthesized and tested in parallel before being integrated and tested together, as shown in Figure 4.2. We integrated the necessary tools to facilitate software and hardware development environments. This includes synthesis tools, simulators, compilers, and 56  Software Flow  Hardware Flow  Software Source Code  Hardware Source Code  Linuxbuild GUI Or Text Editor  sparc-linux-gcc  Applications: run on workstation OS/Driver: Modelsim  Configure  Xconfig GUI or Text Editor  Verify  Modelsim  Synthesis and Implementation  ISE  Load  iMPACT  Configure  Compile  Verify  Load Software and Run System  GRMON  Figure 4.2: The LEON3 SoC design flow. debug tools. Proprietary tools, for example Xilinx ISE, were not provided by Aeroflex Gaisler but can be integrated into their flow. This section will explain the development environment components. We installed the tools to configure, compile, simulate, and run the LEON3 SoC system on a Linux workstation. We were required to license, install, and setup HDL synthesis and simulation tools. We configured the LEON3 and GRLIB through the provided GUI. We also used the pre-configured design template for the Xilinx Virtex-6 XC6VLX240T  57  FPGA and ML605 development board, which set up the constraints and optimized the design based on available embedded blocks. We compiled the software through a tool chain involving several different tools. The tools were used to cross-compile all C code for the SPARC architecture, package the OS kernel, user files, utilities, and bootloader, and to create a memory image to run on the LEON3. We used the Xilinx ISE tool suite to compile the HDL code and generate full and partial bitfiles that could be used to configure our FPGA. This also included the Xilinx PlanAhead tool and a command line flow for executing multiple runs automatically. For verification we used Mentor’s ModelSim program to simulate the LEON3 and the target hardware to debug. A pre-written VHDL testbench was used to instantiate the LEON3 SoC, various memory blocks, and drive the system clocks. This allowed us to load software into memory and execute it on the CPU. The processor is then able to read from and write to memory-mapped addresses of hardware peripherals. To test our new hardware we created simple software functions to read and write to the corresponding address space. The resulting waveforms were then observed to aid in the iterative design and debug of our hardware. Due to the fact that the ML605 uses the Xilinx Memory Interface Generator, we could not use the proper testbench in simulation. As an alternative we tested our hardware blocks with the ML509 system and testbench.  Gaisler Research Monitor  GRMON is the tool we used to access the LEON3 SoC hardware internals through a workstation. With it we were able to load, run, and debug LEON3 software 58  Host Workstation  LEON3 SoC AMBA AHB  Terminal  GRMON  Ethernet  Ethernet Controller  Debug Support Unit  LEON3 CPU Debug Interface  Figure 4.3: Overview of GRMON (adapted from [5]). applications on the target system while it was running. Both stand-alone programs and the Linux kernel with file system were loaded into Random Access Memory (RAM) using GRMON. We were also able to access internal LEON3 processor registers and system memory through the DSU core, as shown in Figure 4.3.  4.4  Extending the LEON3 System-on-Chip  Our framework has extended the LEON3 to provides researchers with high-level tools to accurately evaluate the impact of algorithmic and CAD techniques on power. We have provided several hooks in our design that the experimenter can utilize to apply various power-saving methods. Hooks are intended to offer highlevel access to low-level power management techniques, while keeping in mind our goal of reducing the design time and complexity for experiments. The hooks we have included are used to obtain results, modify clock frequency, perform clock gating, and execute DPR. An overview of the hooks and how they contribute to the use of the dynamic region is presented in Figure 4.4.  59  Static-Dynamic Interface  AMBA APB  Register File for Dynamic Logic  AMBA APB  Clock Generation Block  Functionality depends on Configuration Layer and can be dynamically modified through the Dynamic Configuration Block  Dynamic Region  Clock Gating Block  Configuration Layer Primitive  Clock  Dynamic Configuration Block  AMBA APB  Figure 4.4: Overview of the dynamic region setup.  4.4.1  Measurement Results  We have instantiated the Xilinx SysMon inside the LEON3 system and configured it to provide the on-chip temperature and voltage/current of the FPGA and development board. SysMon is a Xilinx primitive (special purpose embedded block) that provides monitoring and averaging of off-chip analog inputs and on-chip sensors [131]. Through a multiplexor it can choose between the temperature and power supply of the FPGA or power supply of the development board [135], as shown in Figure 4.5. Sysmon uses a 10-bit 200 kilosamples per second Analog to Digital Converter (ADC) to convert sensor data and has a 16-bit data output [131]. We also configured Sysmon to provide readings averaged over 256 samples and enabled offset and gain calibration of the ADC. The FPGA core and board current and voltage are measured using shunt resistors. The core supply current (ICCint ) is measured with a 5 mΩ current sensing  60  shunt resistor [135]. An on-chip sensor measures the voltage difference across this resistor and uses that voltage to obtain the supply current. A separate sensor measures the FPGA core supply voltage (VCCint ), which is expected to be close to 1 V. Then the core power is obtained through the product of the two. The development board power can be determined in a similar way. Board current is measured with the help of a 2 mΩ current sensing shunt resistor and an amplifier [135]. Board voltage is expected to be around 12 V and is dropped over a pair of voltage divider resistors before being routed along with the board current to the external analog SysMon inputs.  External Analog Signals  Status Registers Analog-to-Digital Converter  JTAG Controller  On-Chip Sensors  Control Registers  FPGA Logic  Figure 4.5: System Monitor primitive (adapted from [131]). The outputs of Sysmon can either be used on-chip through the FPGA soft-logic or transferred to an external workstation via Joint Test Action Group (JTAG) interface. Our implementation makes use of the latter, which uses Xilinx ChipScope to read and write from Sysmon, while the LEON3 design is running. ChipScope can be used to modify Sysmon configuration registers at run-time and to plot Sysmon data in real-time. It also provides the ability to write measured values to a file on the host computer. All the infrastructure to compute and read the values already exists in Xilinx ChipScope and we have simply leveraged this. Unfortunately, the fastest automatic sampling rate option in ChipScope is 1 Hz, which could severely  61  reduce the accuracy of our results. One property of Sysmon in Virtex-6 FPGAs is the fact that it always exists with a default configuration and can be used to measure core FPGA power through ChipScope even if it was not initialized in a design at compile time. The reason we instantiated it in our design was in order to connect the external board voltage and current pins for board power and also to have Sysmon configured automatically at startup. Alternatively, the Xilinx ML605 development board also provides physical pins to measure the voltage across the previously mentioned 5 mΩ shunt resistor to compute core current using an external measurement device. We connected a Hewlett Packard 34401A multimeter to these pins to create another substitute method. This can be more accurate, but was not calibrated to correct for offset and gain error, possibly reducing the accuracy. We implemented this to show the different options for measuring the FPGA core power, but our results in Chapter 5 will use the JTAG interface to obtain their power measurements.  4.4.2  Dynamic Partial Reconfiguration and Internal Configuration Access Port  We have implemented a dynamic configuration block into the LEON3 system. This area has been constrained such that no LEON3 SoC logic is placed or routed through this region. Additionally we have added a controller and storage area for partial bitfiles in order to dynamically reconfigure the dynamic logic. This is done through a Xilinx ICAP primitive, which communicates with the underlying configuration layer of the FPGA, as shown in Figure 4.6.  62  AMBA APB  4KB BRAM  ICAP Controller  ICAP Primitive Configuration Layer  Figure 4.6: ICAP controller implementation.  Dynamic Partial Reconfiguration  DPR is a flexible technology that can enable tradeoffs in area, power, and performance. It is supported by Xilinx and due to the availability of a Xilinx FPGA development board we have chosen to make the Xilinx tools and FPGAs our initial target. Including DPR in our framework gives end users three advantages: 1) they can run DPR experiments to evaluate benefits and tradeoffs, 2) it may allow faster synthesis times when making changes to the circuits that will be placed in the dynamic partition and 3) it enables rapid reconfiguration of dynamic hardware in the LEON3 system. The first point has been discussed previously as DPR can aid in saving area or power, but also requires time and power to reconfigure the FPGA. The next two points will be discussed in the following paragraphs. With our DPR approach, we can modify the SoC by changing existing RMs or creating new RMs faster than a non-DPR design, since these changes do not require the larger static logic to be reimplemented—a time consuming process in  63  large FPGA designs. Any modifications to a dynamic module only require reimplementation of dynamic logic and do not affect the rest of the design (i.e., other dynamic modules or the static partition). However, a change to the static logic (even one that does not affect the static-dynamic interface) does necessitate reimplementation of the static logic and all RMs when using the Xilinx tools. This can reduce the experimentation time of experiments. For example if a researcher wanted to compare two encryption algorithms he or she could do so using DPR. This could help ensure that they are placed and routed in roughly the same place. Doing so would keep the distance from the processor (routing wire length), transistor variation, and temperature similar. Also, the total implementation time might be decreased as the smaller dynamic circuits could be synthesized separately from the large base system. We expect this benefit to grow as the difficulty in meeting timing constraints increases for the static partition, as this only needs to be done once. However, the static region still needs to be considered when implementing reconfigurable modules as the whole design still needs to meet timing. In regards to the third advantage, parts of the FPGA can be reconfigured at runtime to reduce the time it takes to switch the hardware for an experiment. Synthesis of all the required hardware blocks is still required, but DPR allows the user to make run-time hardware choices after the design phase. This is faster, since a partial bitfile is smaller than a full bitfile. Also, loading a full bitfile would require shutting the operating system down, loading the bitfile, reloading the operating system into RAM and executing the power-on sequence. One effect of the DPR approach is a possible reduction in accuracy. If the circuit under test will not use DPR in the final system, then the constraints imposed on it could cause a different FPGA configuration for testing than will be used in 64  the final system. The side effect could be a difference in the power and temperature measured. This is the tradeoff to consider: experimentation time vs accuracy. We believe having the option of this tradeoff is a worthwhile investment, while researchers who do not wish to lose this accuracy need not use DPR.  Implementation  We have created the necessary hardware infrastructure to allow our system to perform DPR. A large dynamic region has been allocated in our environment which can be broken up into smaller RPs to create one or more AMBA peripherals. This has been done to devote a useable amount of resources to dynamic logic while allowing the static LEON3 system to be properly implemented using the remaining resources. Our system allows the user to utilize a set of partial bitstreams to reconfigure the dynamic partition of the LEON3 system while it is running the Linux OS. If the bitstream file is on the file system it can be passed, via command line instructions, to the ICAP Linux driver. Every bitfile created by the Xilinx Bitgen tool has a single synchronize and desynchronize command, at the beginning and end respectively, to signify to the ICAP that data between these commands is for configuring the FPGA. We have designed a hardware controller to transfer the configuration data to the ICAP, a memory area to store data, and a Linux kernel driver to facilitate binary file input from the command line. This was a lengthy process with many technical hurdles. Understanding the reconfiguration protocol along with designing and verifying the hardware, drivers, and software to move a bitfile through the ICAP was a difficult undertaking. 65  Configuration of the FPGA can be executed internally via the ICAP or externally over JTAG. The ICAP [133] is a Xilinx primitive on the FPGA that is used to internally access the FPGA configuration layer, as shown in Figure 4.7. The ICAP hard block is designed to run at 100 MHz and receive up to 32-bit wide data [138]. Therefore, the maximum theoretical throughput is 3.2 Gbps or 400 MBps. It is similar to the SelectMAP [133] interface, which is used for configuring the FPGA from external input. It is worth noting that our framework can also be dynamically and partially reconfigured externally through Xilinx iMPACT, which uses the SelectMAP interface. With this method, a partial bitfile is transferred over JTAG from a workstation at a frequency of 6 MHz, but the throughput is quite low as JTAG transfers a single bit at a time. Configuration time depends on the clock frequency, the number of cycles to write the data to memory, driver and controller overheads, and the speed at which the data can be retrieved from memory. Our controller passes the data 32 bits at a time to the ICAP. Our controller formats the input data, which needs to be byte-swapped, uses a state machine to drive the ICAP enable signal when data on the input is ready, and facilitates the transfer of the next data word from the Block Random Access Memory (BRAM) when the ICAP is ready. Our state machine enables the ICAP on the clock cycles where the data is valid and disables it to pause the reconfiguration when waiting for new data to populate the BRAM, or when waiting for the next word from memory. When the word has been received by the controller, it increments the BRAM address to obtain the next word. The ICAP also has the ability to read from the configuration layer, but we have omitted this feature to simplify the design and therefore only allow writing to the ICAP.1 1 The  ICAP also has busy and status outputs that we use.  66  Design Layer  Dynamic Region  ICAP Controller Reconfiguration  Control Commands  Configuration Layer  ICAP Primitive  Figure 4.7: A simple representation of the FPGA configuration layer. We have created a 4 KB memory space which can be accessed from the OS and the ICAP controller. This memory has been mapped to an embedded BRAM on the FPGA to ease synthesis complexity and to allow faster memory access than if the data was stored in distributed LUTs or off-chip. The BRAM is used as a buffer to store the data sent from the CPU. Data is passed from user-space to our kernel-space driver 4 KB at a time. This block size is dictated by the OS, such that larger data transfers are broken up into 4 KB pieces. To reconfigure a region the user must write to an OS device file we have created at /dev/icap. Once our kernel module is loaded into the OS it will monitor the /dev/icap 67  file to observe when it is opened, closed, read from, or written to and call the corresponding function. When a write is issued to this file the kernel driver will write the binary data, one word at a time, to our memory mapped BRAM. Lastly, it will write to a special memory locations for ICAP commands to tell the driver how many words have been transferred and indicate that it should start writing the BRAM information to the ICAP. Once the data in the BRAM has been written to the ICAP the process can be repeated for any remaining configuration data (i.e., if more than 4 KB was written). Reading the /dev/icap file will display information about the ICAP, such as status and if it is busy. Bitfiles passed to the driver need to be properly word aligned such that the synchronization word sits on a word boundary. Additionally, any files that are not a multiple of 4 bytes are zero-padded by the kernel module on the last word.  4.4.3  Dynamic Clock Generation  In our framework we have included a clock frequency generator that can be adjusted at run-time, otherwise known as Dynamic Frequency Scaling (DFS). It uses the Xilinx Mixed Mode Clock Manager (MMCM) primitive [141] to create multiple clock signals either from an FPGA input pin or an on-chip signal. There are several MMCMs in the Virtex 6 FPGAs and each can create up to seven output clocks with unique frequency and phase in reference to the input clock. We use the 60 MHz system clock as an input to the MMCM and use the output to clock our dynamic partition. Every MMCM primitive contains internal Dynamic Reconfiguration Port (DRP) that can be written to in order to modify clock outputs. These registers can be connected to other soft-logic in the FPGA. We have memory-mapped these registers 68  MMCM Clock_in  Clock_out0  Clock_out6  DRP  AMBA APB  DClock Ddata_in DAddr DEn DWrEn DRdy Ddata_out  MMCM DRP Controller  Figure 4.8: MMCM primitive. so that the LEON3 CPU can dynamically reconfigure the MMCM using the DRP, as shown in Figure 4.8. This is different than DPR and does not require the use of the ICAP or configuration layer. The protocol for writing to the MMCM DRP has been captured in an OS-level device driver. The user can specify the desired output frequency and the low-level details will be taken care of. This method could easily be extended to clock any of the core in the static logic, for instance the CPU, or multiple CPUs in an MPSoC. The user can write to the OS device file /dev/mmcm to change the clock frequencies of any of the seven output clocks. The output frequency is limited between 4.69 and 700 MHz [140] and our clock generation block does not check the critical path to determine if the specified frequency will cause timing errors. The frequencies are specified in MHz and an example to change the first clock to 5 MHz is as follows:  69  echo “clk0 5” > /dev/mmcm  4.4.4  Clock Gating  Clock gating is a power-saving technique for VLSI circuits and its benefits have been well-studied [49, 89]. In FPGAs the clock interconnect is said to contribute 19%-39% of the dynamic power [36, 123]. When the outputs of a functional unit will go unused, its clock signal can be gated with an enable signal. When the enable signal is logic low, the clock signal, and thus the synchronous state of the functional unit will not change. This can save power by reducing the switching power of flip-flops and of the clock network that drives those flip-flops. Synchronizer  Clock Enable (60 MHz domain)  System Clock (60 MHz)  MMCM Clock Enable BUFG  BUFGCE  Figure 4.9: Clock gating circuitry with clock domain crossing synchronizer. The architecture of our clock gating circuitry can be seen in Figure 4.9. The 70  clock enable signal is read from and written to on the same clock domain as the CPU and the rest of the system, running at a frequency of 60 MHz. In order to cross clock domains to the dynamic partition, we have created a synchronizer consisting of two flip-flops. The extra circuitry required includes a Xilinx primitive, a global buffer (BUFG), to drive the clock from the clock resources in the middle column of the FPGA to the synchronizer flip-flops. The output clock of our MMCM is fed in to a global clock buffer (BUFGCE), another Xilinx primitive. The BUFGCE has two inputs—a clock and an enable—and a single clock output. Once the clock enable signal is properly synchronized it is used as an enable for the BUFGCE tri-state buffer. Due to our architecture choices for the clock gating circuitry, the clock enable signal should always switch at a low rate in order to ensure correct functionality. The user can write to the OS device file /dev/cg to enable or disable clock gating for the dynamic region. The MMCM can produce up to seven output clocks and the clock gating device driver can handle enabling or disabling the clock on any MMCM output the user desires. This is accomplished by specifying the clock output, similar to the MMCM syntax presented in the previous section. To enable clock gating (stop the clock), we would set the clock enable signal to logic low, therefore on the command line the clock output would be followed by a 0 and to disable clock gating (enable the clock) it would be followed by a 1. A simple example to disable the first clock output of the MMCM is as follows:  echo “clk0 0” > /dev/cg  71  4.4.5  Kernel Modules  In this thesis we have created device drivers as kernel modules for our new hardware. Linux kernel modules are device drivers that can be added to the kernel at run-time when needed. This is a modular approach that allows the kernel to stay small by only including drivers for devices that are connected, while still allowing the system to adapt to new devices. Any kernel modules that are not in use can be located on the file system, rather than in memory as a drain on important system resources. In our case, these drivers were written, cross-compiled through the BCC, and loaded into the OS at run-time in order to allow us to write and read from memory-mapped hardware registers. The ability to load and remove kernel modules at run-time makes the iterative software debug cycle much easier. Rather than modifying the driver and recompiling the whole kernel, only the driver needs to be recompiled and the system does not need to be restarted unless an error in the kernel has crashed the system. We implemented our drivers as kernel modules rather than user applications since our device drivers need to access memory directly. This choice also eliminates context switches that occur when accessing hardware from user space, which could slow down the response time of transactions. The disadvantages of implementing our driver as a kernel module include prolonged use of system resources, more difficult debugging, and absence of C library support. Debugging kernel modules is much more difficult than debugging user applications. One cannot use a debugger, such as the GNU Debugger (GDB), for breakpoints or to step through kernel code and an error can crash the whole system, not just the application. This was mitigated by simulating the hardware design without  72  the driver in ModelSim and keeping the kernel driver simple before developing the software that uses the module. Lastly, we had to use kernel functions rather than linking to C library functions. This was a minor annoyance and was easily overcome by using the kernel-space equivalent functions, such as the kernel-space print function instead of the standard user-space printf(). Any other missing functions were manually coded into our module.  4.5  Potential Experiments for this Framework  This section will outline the potential experiments this framework could be used for by future researchers. In all of our experiments the goal is to examine the power dissipation of the FPGA by monitoring the current and voltage on VCCint , which is the supply voltage rail for the FPGA. Our framework allows several additional metrics to be observed, such as temperature, compile time, run time, and area. By allowing researches to examine these metrics on an SoC FPGA implementation our platform enables them to focus more on design decisions and less on implementing a test framework. The following list represents the types of experiments we envision for our platform: 1. Change Hardware: Our framework can be used to explore tradeoffs in hardware changes. In this scenario the same software and OS are run on different hardware configurations, as shown in Figure 4.10. This includes modifications to a circuits algorithms and/or hardware that has been compiled with different CAD tools/options. The former is done by adding or modifying hardware modules on the AMBA bus, while the latter can be done with ex-  73  isting Xilinx tools or using academic tools, such as VTR. Any changes to the HDL would require the full design to be synthesized and implemented again.  SW  OS  HW1  HW2  Figure 4.10: Hardware experiment. In order to examine the effects of hardware changes the user would create two or more circuits to be tested. A baseline hardware circuit would be integrated into the LEON3 SoC and the power consumption of the implementation would be measured. Then, the baseline test circuit would be modified with a different algorithm or power saving technique, reinserted into the system, and re-examined for power usage. When comparing two or more CAD tools one could start with a single HDL source. First a baseline CAD tool would take the HDL through synthesis, implementation, and bitfile generation. The design can be loaded onto the FPGA to measure power if required, or the CAD run-time, clock frequency, 74  and area can be observed from the CAD tools log files. Next, the user can change the CAD tools or options for one or more steps in the CAD flow and rerun those steps. At the end, the output metrics or measured power can be compared. The difference in the power output between the baseline and subsequent experiments would be the power savings of that algorithm or technique. 2. Change Software: The user can run different software implementations on the LEON3 processor. This could be two software applications with different algorithms, a single application with different compiler optimizations, or even possibly comparing OS features, as shown in Figure 4.11. In all cases the underlying LEON3 SoC hardware remains constant for the whole experiment. The applications would be cross-compiled and loaded into FPGA memory. Then the executables could be run one at a time while measuring power and run-time. Alternatively, instead of using multiple applications the user could use different compiler options on the same software application and follow a similar flow. Some specific examples for this scenario include profiling the power of the LEON3 CPU instructions, comparing software encryption implementations, or indirectly observing the effects of caching on different matrix multiplication algorithms. 3. Analyze Dynamic Partial Reconfiguration: Experiments with DPR could be used to assess the effects of dynamically reconfiguring regions of the FPGA. This could be specifically for DPR systems, to reduce the time it takes to modify the LEON3, or to blank a region of the FPGA that is not being used. All cases look similar to the representation shown in Figure 4.12. 75  SW1  SW2  SW  OS  or  OS1  HW  OS2  HW  Figure 4.11: Software experiment.  SW  OS  RM1  HW  RP RM2  Figure 4.12: DPR experiment. DPR can be used to swap system modules dynamically, either for time multiplexing to save area and power, or to reduce experimentation time. Swapping blocks at run-time allows the user to have different hardware configurations without the need to reload software.  76  Users can also dynamically “unconfigure” a region of the FPGA with black box bitfiles. This can reduce the power dissipation in a region and can be beneficial if that region is idle. When the block is going to be needed again, a bitfile can be sent to reconfigure the area for future activity. It is up to the user to understand how long the reconfiguration takes and how far ahead of time it needs to be reinitialized. The power saved can be compared against the power dissipation and time of the reconfiguration to determine if the black box process is desirable. 4. Codesign Hardware and Software (Accelerator): Most often a hardware implementation is more efficient than a general purpose processor at running a specific application. In most cases accelerators can complete the execution faster and with less power consumption. Our framework can be used to aid in hardware/software codesign, that is in determining which parts of an application should run on a CPU and which parts can benefit from running on custom hardware, as shown in Figure 4.13. A typical flow for this type of experiment might proceed as follows. A user might profile his or her software application and notice that part of it could benefit from a hardware accelerator. The user would then create a hardware block that can be attached to the LEON3 system in the soft logic, using embedded logic, or both. The LEON3 CPU would be used to set up the accelerator with data and instructions when it is ready, retrieve the accelerator results when it is done, and integrate them back into the software application. Once the user runs this accelerator implementation, they can compare run-time and power with the baseline design to evaluate the benefits.  77  SW1  SW2  OS  HW1  HW2  Figure 4.13: Accelerator experiment.  4.6  Summary  This chapter gave a detailed explanation of our framework, the LEON3 SoC, how we extended it, the potential uses for our framework, and an evaluation of the implementation of our environment. Our framework can be leveraged by the end user to evaluate the impact of different algorithmic or CAD decisions on power. This is achieved through the use of hooks, which allow high-level access to low-level power management techniques. These techniques include DPR, dynamic clock generation, and clock gating. We also provide the user with a sophisticated method of observing and recording the power consumption of the FPGA and the development board. The following chapter includes an evaluation of our implementation and shows examples of the ways in which it can be used to evaluate the impact of hardware and software decisions on power.  78  Chapter 5  Characterization and Evaluation of our Framework In this chapter we first present a characterization of the power dissipated by our LEON3 framework. We also demonstrate how our framework can be used to compare the impact of using a hardware accelerator versus a pure software implementation of an algorithm. We then demonstrate how our framework can be used in a typical experiment. As a demonstration, we use an experiment in which we seek to determine the impact of pipelining on the power efficiency of a circuit. This sort of study has been done previously in [128], however, in this chapter we show that the conclusions drawn from the experiment using our platform are very different than those from [128]. Finally, we present the power and reconfiguration times of partial bitfiles using our ICAP architecture.  79  5.1  Characterization of our Framework  We will first characterize the power and temperature of our framework by evaluating the LEON3 system with an Advanced Encryption Standard (AES) RM, the system with the RP configured to be blank, and also with the LEON3 CPU running AES software.  5.1.1  Methodology  To build our target system we first synthesized the static LEON3 SoC logic and dynamic module HDL with Xilinx Synthesis Technology (XST). These netlists were then imported into PlanAhead to floorplan the system and determine a suitable dynamic region. We reserved two contiguous clock regions in the top right of the FPGA as we noticed that was the least used area when trying to implement the LEON3 without a dynamic region. Once an area was chosen and verified in PlanAhead the user constraints file was modified with the new area constraints. This prevented static logic from being implemented in the dynamic region and dynamic logic from crossing over into the static region. The system was then implemented with Xilinx command line tools. We have implemented the LEON3 system at a conservative frequency of 60 MHz to reduce implementation challenges for the CAD tools. All circuits in this chapter have been synthesized and implemented with the Xilinx PlanAhead and ISE version 14.1 tools. After the implementation stage, full and partial bitstream files were generated through Bitgen. In order to load the full bitfile onto the FPGA, we used the JTAG protocol over a Universal Serial Bus (USB) JTAG cable with the iMPACT tool. The Linux kernel and file system were generated through Linuxbuild using a makefile and transferred via ethernet to the FPGA memory using GRMON. GRMON was 80  also used to initiate and debug the system. There was no shutdown sequence required as we did not implement a permanent storage setup and therefore were not concerned about properly preserving data. To measure the time of full bitfile configuration through iMPACT and the time to boot the OS we used a stopwatch. We have created a pipelined AES circuit, modified from the AES benchmark found in the GroundHog 2009 benchmark suite [51]. The GroundHog AES benchmark makes use of a 128-bit data input and 128-bit key input to produce a 128-bit data output. It performs ten identical stages of bit manipulations on the input using the key value. The key value is also manipulated so that it is different for each stage. In the GroundHog version this is accomplished by reusing a single block with a counter to determine when the encryption or decryption is complete. We modified the GroundHog AES using “loop unrolling” on the main hardware block, as shown in Figure 5.1a. Pipeline registers were manually placed in the VHDL code between each AES stage and the code was synthesized with the register balancing option turned on to allow the logical placement of the registers to be optimized automatically. In the AES circuit the internal width of the datapath is 128 bits. Therefore, for every group of registers between stages we are adding 128 registers into the datapath. The key input is not changed during the encryption of a file. Therefore, we did not pipeline the key path as the dynamic power is minimal and adding registers would increase the static power. Our modified AES circuit is shown in Figure 5.1b. The AES circuit was created as an RM with the DPR implementation flow. Along with the AES RM there is also a blank RM that unconfigures the RP, though the proxy LUTs and the routing between the static logic and proxy LUTs are still present in the dynamic region. 81  No Data Output  AES  Data Input  Done? Yes  Key Done?  Modify Key  No  (a) Original GroundHog AES circuit. Data Input  Key  AES  AES  Modify Key  ...  ...  AES  Modify Key  AES  Data Output  Modify Key  (b) Our modified AES circuit.  Figure 5.1: Modifying the GroundHog AES circuit. We created an AES software application to read data input and key out of binary files, and later write the output data back into a new binary file. The input data we used was random, taken from the Linux file /dev/urandom and stored in a file so that the same data was used in all tests. The output of all versions was captured and compared to one another to ensure correct functionality for different implementations of AES. In order to allow more consistent results we looped over the input file when we required a longer input stream.  82  Power and temperature results were taken using the ChipScope tool which communicates with the Xilinx SysMon primitive. The status registers are read once per second and these values are averaged over a sixteen second sliding window for our recorded results. In order to ensure that the system was stable we waited until the average temperature and power values remained constant, before recording the results. This typically took on the order of minutes. Other synthesis and implementation metrics were obtained from the ISE output logs. There were many technical challenges in implementing our system. There were some paths between the dynamic and static regions that crossed clock domains and were not able to meet the default timing. This was solved by manually specifying those paths that could be ignored, for timing, in the constraints file. We also experienced difficulties with DPR on the Virtex-6 using either the Xilinx tools or our ICAP architecture. It was determined that often during DPR the FPGA supply voltage rail is experiencing an over-voltage fault that causes components on the ML605 board to shut off to avoid damage. This caused errors in reconfiguration and it took a lengthy amount of time to find the cause. The AES software application was created using the LibTomCrypt encryption library and executes on the LEON3 CPU running at 60 MHz. It uses the same file format and command line options for input and output as the AES benchmark circuit.  5.1.2  Results and Analysis  The static LEON3 system requires 27,922 LUTs (18.5% of the FPGA total) and 17,287 slice registers (5.7% of the total). It uses 2.7% of the BRAM and 2.6% of the DSPs. This shows that there is room to add additional test circuits. Table 5.1 83  provides a list of the LEON3 system specifications. Table 5.2 provides the base implementation tool CPU run-times for the various Xilinx tools. Development Board FPGA Frequency Processor Data Cache Instruction Cache Debug Support Unit Memory Management Unit Memory Interfaces  Other Features  ML605 Revision D Xilinx Virtex-6 XC6VLX240T-FF1156-1 60 MHz SPARC V8 LEON3 8 KB; 2-way set associative; 4 KB/way; 16 B/line; Least Recently Used (LRU) 16 KB; 2-way set associative; 8 KB/way; 32 B/line; LRU 4 KB instruction trace buffer; 4 KB AHB trace buffer 4 KB page size Xilinx MIG DDR2/3 controller; 512 MB LEON2 memory controller for Programmable ROM (PROM)/SRAM Ethernet 10/100 Mbit UART GPIO JTAG debug link FPU Branch prediction  Table 5.1: Configuration of the LEON3 SoC  Tool XST NGDBuild Map PAR  CPU Time 10 mins 19 secs 1 min 0 secs 16 mins 45 secs 11 mins 31 secs  Table 5.2: Xilinx Version 14.1 CAD Tool Run-times for Base Implementation We have observed the various times for booting the LEON3 system. Loading the full bitfile onto the FPGA took approximately 21.5 seconds for a 9.0 MB file. Moving the OS and file system RAM file to the FPGA memory via ethernet occurred at a transfer speed of about 60 Mbit/s. Our base kernel and file system 84  was approximately 5 MB and took about 1 second to load onto the FPGA. The OS powerup sequence took approximately 13.5 seconds.  Power (W)  Idle  Idle with Clock Gating  2.015 2.010 2.005 2.000 1.995 1.990 1.985 1.980 1.975 1.970 1.965  LEON3 Baseline  LEON3 with AES Hardware  LEON3 Implementation  Figure 5.2: Idle power versus LEON3 implementation. Figure 5.2 shows the idle power in the baseline LEON3 implementation and the baseline LEON3 with an AES module in the dynamic region. The graph also shows the idle power when clock gating is enabled. In the baseline implementation clock gating of the dynamic region has no effect, since there are no sequential resources being driven by the clock in that region. With an AES circuit as dynamic logic the idle power increases due to static and dynamic power consumption in the RM. In this case clock gating does cause a decrease in power consumption due to the synchronous resources being driven by the clock. The clock gated power dissipation of the latter implementation does not equal that of the former due to the static power dissipation of the dynamic resources. Figure 5.3 shows the power and temperature of AES running on three different implementations. The first is the LEON3 baseline, which has a blank RM in the RP. 85  2.080 2.070  Power (W)  2.060  2.050 2.040 2.030  2.020 2.010 2.000 LEON3 Baseline  LEON3 with AES Hardware LEON3 Baseline running AES Software  AES Implementations  Figure 5.3: AES power versus LEON3 implementation. There is no AES logic in the dynamic region, so the extra power dissipated in this case over the idle LEON3 baseline is solely due to reading from and writing to the AES registers. In the second implementation we have configured a pipelined AES circuit as the RM. The third implementation involves an AES software application being run on the LEON3 CPU. We can see in Figure 5.4 that the temperature results roughly follow the power consumption of the system. The SysMon only records temperature at a coarse granularity of half of a degree Celsius, so we do not expect to see much difference in the temperature results for similar circuits.  5.2  Framework Demonstration: An Evaluation of Pipelining  In this section, we present an example experiment using our framework. The experiment we have selected is one in which we vary the degree of pipelining of a circuit  86  Temperature (°C)  46.5  46.0  45.5 LEON3 Baseline  LEON3 with AES Hardware LEON3 Baseline running AES Software  AES Implementations  Figure 5.4: AES temperature versus LEON3 implementation. to determine the impact on power dissipation. Previous work [128] has found that increasing the degree of pipelining has a significant reduction in power. These previous experiments, however, assumed unrealistic input activities if the circuit is to be used in an SoC. We will show that much more modest gains are measured using our architecture, primarily because we are assuming input activities that would be common in an SoC.  5.2.1  Motivation for Pipelining  As described in [128], increasing the degree of pipelining tends to reduce the dynamic power dissipated in a design. This is primarily because deep combinational circuits typically experience glitches on their internal signals due to uneven path delays through the logic. These signals consume significant power. When a design is pipelined, the combinational depth is decreased, meaning these glitches become  87  less severe. In [128], it was shown that a 40-90% reduction in energy per operation was seen when pipeline registers were added to typical combinational circuits.  5.2.2  Methodology  For this experiment the methodology is similar to that described in Section 5.1.1. We have created two implementations to evaluate. First, we integrated the AES circuits with the LEON3 globally rather than constraining them to a dynamic region. Second, we created stand-alone AES circuits that have their input data driven by an LFSR, the method which was often used before this framework to evaluate a circuit. Additionally, we have modified our pipelined AES circuit to create AES circuits with 0 to 10 pipelining stages to evaluate the effects of pipelining on power consumption. In order to make a fair comparison we always run all eleven circuits at 25 MHz, which preserves the throughput, but produces higher latency for circuits with more pipelining stages. The hardware CAD tool flow for the integrated and stand-alone implementations can be seen in Figure 5.5. All eleven AES circuits were synthesized once through XST and then combined at the netlist level with the stand-alone LFSR test harness. The integrated implementation combines the AES and LEON3 HDL before synthesis to allow full optimization. In our stand-alone AES circuit we drive the data inputs with a 128-bit LFSR. There are three other inputs to the circuit, in addition to a 66 MHz clock signal. These are reset, key, and clock enable and are all tied to GPIO switches on the development board. The value on the key switch is sent to all 128 bits of the key input vector. Both the key and reset switches are held constant. The clock gets scaled through an MMCM to 25 MHz and gated with a BUFGCE. The 128-bit out88  GroundHog AES HDL  Manual Pipelining LEON3 HDL  LEON3 HDL with Reconfigurable Partition  LFSR Test Harness HDL  11  XST  11  XST  XST  11  11  XST  11  NGDBuild  NGDBuild  NGDBuild  Map  Map  Map  PAR  PAR  PAR  Bitgen  Bitgen  Bitgen  11  Full Bitfiles  Integrated  12  12  Full Bitfiles  Partial Bitfiles  Dynamic  11  Full Bitfiles  Stand-alone  Figure 5.5: CAD tool flow for integrated, dynamic, and stand-alone AES circuits. put is fed into an “AND” gate with the output of the gate going to a Light-Emitting Diode (LED). The reset and key switches and the output LED exist to prevent the  89  CAD tools from optimizing away parts of the circuit, thereby maintaining a fair comparison with the LEON3 AES circuit. There is a low probability of all 128 output bits being logic high and therefore the power to drive the LED should be negligible, as the output will almost exclusively be logic low.  5.2.3  Results and Analysis  We first compared the power results that would be obtained from a stand-alone implementation of AES with inputs driven by LFSRs as is done in [32] to those that would be obtained by incorporating an AES core in our framework. The standalone architecture is shown in Figure 5.6a and the LEON3 architecture is shown in Figure 5.6b.  LEON3 Processor  LFSR  System Bus  AES  AES  (a) Stand-alone AES.  (b) LEON3 with AES.  Figure 5.6: Stand-alone and LEON3 AES circuit architecture. The results from this experiment are shown in Figure 5.7. The results for the stand-alone version follow the trends in [128]; increasing the degree of pipelining decreases the power dissipation of the circuit. However, the results obtained from 90  our framework shows that the power dissipation is not significantly affected by the degree of pipelining. LEON3 with Integrated AES  Standalone AES  3.500  Power (W)  3.000 2.500 2.000 1.500  1.000 0.500 0.000  0  1  2  3  4  5  6  7  8  9  10  Number of Pipelining Stages  Figure 5.7: The power consumption of running AES encryption versus number of pipelining stages. The primary reason for this discrepancy is due to the activity of the inputs. In the stand-alone experiment, the inputs are driven by an LFSR, meaning the activity is roughly 50% (each input switches for approximately half of the cycles). In our system, however, the AES circuit is driven by the CPU via the system bus. Driving a new value on the system bus takes about 49 CPU clock cycles, which is approximately 20 AES clock cycles. Therefore, in our framework, the AES circuit is idle much of the time. Based on these results, the conclusion obtained for each experimental method is different: using the stand-alone infrastructure, we would conclude that power goes down as the degree of pipelining increases, however, using our framework, we would conclude that pipelining has little impact. Which conclusion is correct 91  depends on the environment in which the circuit is going to be used. The important point from this discussion is that conclusions drawn from a stand-alone implementation may not be valid if the circuit is going to be implemented in an SoC. In this particular case, we argue that most circuits are included as part of a larger system such as an SoC, and thus the conclusion obtained from our framework is more meaningful. Figure 5.8 shows the temperature of the two implementations. The stand-alone version has more range in the temperature and is significantly below the temperatures of the LEON3 version. LEON3 with Integrated AES  Standalone AES  48.0 Temperature (°C)  46.0 44.0 42.0 40.0 38.0 36.0 34.0 32.0  0  1  2  3  4  5  6  7  8  9  10  Number of Pipelining Stages  Figure 5.8: The temperature of running AES encryption versus number of pipelining stages.  To further illustrate that the difference in conclusions is due to a difference in the input activity, we have modified our LEON3 system so that the AES core is being driven by an on-chip LFSR rather than the CPU. This allows us to operate the circuit with different inputs every clock cycle and evaluate the maximum 92  Power (W)  LEON3 with AES  LEON3 with AES LFSR  4.500 4.000 3.500 3.000 2.500 2.000 1.500 1.000 0.500 0.000  0  1  2  3  4  5  6  7  8  9  10  Number of Pipelining Stages  Figure 5.9: The power consumption of the LEON3 running AES encryption versus number of pipelining stages. LEON3 with AES  LEON3 with AES LFSR  50.0 Temperature (°C)  49.0  48.0 47.0 46.0 45.0 44.0 43.0 0  1  2  3  4  5  6  7  8  9  10  Number of Pipelining Stages  Figure 5.10: The temperature of the LEON3 running AES encryption versus number of pipelining stages. activity scenario. The data in Figure 5.9 shows that the LFSR driving the AES leads to increased power and larger improvements when the number of pipelining stages increases. Again, Figure 5.10 shows that the range of the temperatures for the LFSR version is larger than the realistic input scenario, due to the increased  93  dynamic power.  Dynamic  Integrated  Standalone  1:10:00 1:00:00  Time (H:MM:SS)  0:50:00 0:40:00 0:30:00 0:20:00 0:10:00 0:00:00 0  1  2  3  4  5  6  7  8  9  10  Number of Pipelining Stages  Figure 5.11: Total CAD tool run-times versus number of pipelining stages. Figure 5.11 shows the total CAD tool run-times using CPU time for three implementations. As expected, the stand-alone implementation is much faster than the integrated implementation. In both cases the combinational circuit requires significantly more CAD tool time compared to the other pipelining stages. This is due to the increased difficulty in implementing the AES circuit at 25 MHz and might not exist for more relaxed timing constraint. Figure 5.12 shows a breakdown of the different CAD tool stages for the integrated implementation. The other two implementations shown in Figure 5.11 have similar breakdowns. We see that most of the tool times are constant when varying the number of pipelining stages, but the Place and Route (PAR) time for the combinational AES is much larger than for any other pipelining stage and is the reason for the much slower total run-time. This is not surprising as PAR is the last stage 94  01:10:00 01:00:00  Time (MM:SS)  00:50:00 XST CPU Time  00:40:00  BLD CPU Time 00:30:00  MAP CPU Time PAR CPU Time  00:20:00  Total 00:10:00 00:00:00 0  1  2  3  4  5  6  7  8  9  10  Number of Pipelining Stages  Figure 5.12: CAD tool run-times for LEON3 with integrated AES versus number of pipelining stages. used to meet timing for the circuit. If any of the previous stages made poor timing choices, these would not become apparent until the PAR stage and would require extra effort by the PAR tool.  5.3  An Evaluation of Dynamic Partial Reconfiguration  Our second experiment uses our framework to evaluate the power consumption and configuration time of DPR.  5.3.1  Methodology  The methodology in this experiment is also similar to that described in Section 5.1.1 and Section 5.2.2. We will use the same RP along with an AES and a blank RM to evaluate DPR. The partial bitfiles for these RMs are 1.5 MB each and were stored on the file system for our experiments. The configuration time for a partial bitfile of this size is less than a second and too small to allow us to accurately measure power 95  and temperature, therefore we repeated the configuration many times to ensure accurate results. The time of partial reconfiguration through the ICAP was observed through the command line “time” utility to get real time. Figure 5.5 shows the dynamic CAD tool flow which produces twelve bitfiles: eleven for the pipelined AES circuits and one containing a blank RM. The method for power and time measurements of full configuration was different. The power of loading a full bitfile and booting the OS was difficult to measure. SysMon and iMPACT both require the use of the JTAG interface, which means that during full reconfiguration we cannot observe power consumption. Also, while the OS is booting the power consumption is fluctuating, making a proper measurement problematic. For these reasons we did not record these results. We have also modified the stand-alone AES circuit with constraints identical to that of the dynamic RMs, where it is restricted to resources in the two upper rightmost clock regions. In the constrained stand-alone implementation only the AES circuit is constrained, while the stand-alone test harness is not restricted in any way. This is closer to the way the dynamic LEON3 is implemented and should help us understand how the constraints affect the power dissipation.  5.3.2  Results and Analysis  The time we observed for configuration can be seen in Figure 5.13. The full configuration time does not take into account the time to load the OS into memory as that depends on the interface used and the size of the OS file. We also ignored any time related to user actions, such as moving or clicking the mouse and typing. Partial configuration does not require the OS to be reloaded and therefore only depends on the bitfile configuration time. We observed that in our implementation 96  DPR achieves approximately an 87x speedup over full configuration in the time it takes to modify the dynamic logic. As shown previously in Figure 5.11, the dynamic CAD tool flow is on average 74 seconds slower than the integrated time to create a single circuit. However, this does not take into consideration the ability to import the static implementation, which reduces the CAD tool time by about ten minutes for any dynamic implementations after the first. The times shown also do not consider user time during the design phase such as, learning new tools, setup time, and difficulty in achieving timing closure, all of which we encountered and estimate took many hours for the dynamic implementation. 40 35 Time (seconds)  30  25  OS 13.5  20  OS  15 10  Bitfile  Bitfile 21.5  5  0.40  0 Full Configuration  DPR  Type of Configuration  Figure 5.13: Configuration and reboot time for full and partial reconfiguration. Figure 5.14 shows the power of the LEON3 system before, during, and after DPR. As expected the LEON3 with a blank RM has a lower power than the LEON3 with an AES RM. During reconfiguration the power dissipated is essentially an overhead on the idle power consumption. This is why we observe a consistent 97  increase in power during reconfiguration compared to before reconfiguration in either implementation. Our results show the power of repeatedly reconfiguring the RP using the same partial bitfile. The power consumption of the first partial reconfiguration (e.g., switching from a blank RM to an AES circuit) will probably look different than reconfiguring an RM repeatedly with the same configuration, but the first configuration occurs too quickly to capture properly. After reconfiguration we have observed a puzzling result. After dynamically reconfiguring the system with the blank partial bitfile the power consumption does not return to the value we would expect. Instead, the power after reconfiguring the RP with the blank RM is higher than the configuration with the AES circuit. This is something we noticed with all partial bitfiles and was not due to our ICAP implementation as we also tested the bitfiles with iMPACT. Our hypothesis is that this is an unintended side effect of DPR and perhaps related to a problem with our specific implementation of the DPR flow. We think that resources in the dynamic region may be optimized for low power when not configured and our partial reconfiguration implementation is not putting these resources back into their low power mode when unconfiguring them. A study on the impact of configuration bits on leakage power has been discussed in [11]. In it Anderson and Najm show through SPICE simulation that certain combinations of configuration bits can lead to much higher static power. Further understanding this issue is an interesting avenue for future work. By using averages of our partial reconfiguration power (34.25 mW) and time (0.40 ms) we can estimate an energy usage of 13.7 mJ for a single partial reconfiguration in our framework. Therefore, to swap the blank RM in to the RP and later reconfigure the RP with the AES RM would take twice the energy, or 27.4 mJ. Our 98  LEON3 Implementation  After  During  Before  AES with Pipelining  Blank  1.920  1.940  1.960  1.980  2.000  2.020  2.040  Power (W)  Figure 5.14: Implementation before, during, and after internal partial reconfiguration versus power. idle AES circuit and our clock-gated AES circuit consume 30 mW and 17 mW more power respectively than the blank RM. This does not take into account any power consumption for enabling or disabling clock gating, which we consider to have a negligible overhead. Therefore, to save power by swapping the blank RM in and out we would require the AES circuit to be idle for 913 ms or the clock-gated AES circuit to be idle for 1.612 seconds. This is assuming the partial blank bitfile returns the idle power consumption to the same values as the full bitfile, which we have not found to be the case. Figure 5.15 shows the effects of pipelining on the absolute power consumption for the AES circuit in six different implementations. The CPU-driven dynamic and integrated AES circuits see very little effect from pipelining, while the circuits with LFSRs see a significant decrease in power as the number of pipelining stages increase. These results are similar for all four LFSR implementations. In most cases the power consumption of circuits with the dynamic implementation is slightly above the same circuit with an integrated implementation. The same is 99  4.500 4.000 Dynamic LFSR 3.500 Integrated LFSR  Power (W)  3.000  2.500  Dynamic  2.000 Integrated  1.500  1.000  Standalone LFSR Constrained  0.500 0.000 0  1  2  3  4  5  6  7  8  9  10  Standalone LFSR Unconstrained  Number of Pipelining Stages  Figure 5.15: The power consumption of various AES implementations versus number of pipelining stages. true of circuits with the constrained stand-alone implementation compared to the unconstrained stand-alone implementation. In both cases this is expected due to global optimization and reduced constraints. Figure 5.16 shows the change in power compared to the combinational circuit for each of the four LFSR implementations. The stand-alone circuits see a bigger percentage improvement compared to the LEON3 circuits as a higher percentage of the power dissipated in the implementation is due to AES compared to nonAES logic. The results between the constrained and unconstrained stand-alone implementations are within 3%. This may be because there is not a significant amount of logic in the test harness that can benefit from global optimization in the unconstrained implementation. Also, the stand-alone circuits were combined with the test harness at the NGDBuild stage, which eliminates any opportunity for optimization in the XST stage. The dynamic and integrated LEON3 circuits have a much larger difference as the static logic is a significant part of the SoC design 100  and therefore provides many more opportunities for global optimization. Dynamic AES with LFSR  Integrated AES with LFSR  Constrained Standalone AES with LFSR  Unconstrained Standalone AES with LFSR  0.000  1  2  3  4  5  6  7  8  9  10  -10.000  Power (%)  -20.000 -30.000 -40.000 -50.000 -60.000  Number of Pipelining Stages  Figure 5.16: The relative power consumption compared to combinational AES encryption versus number of pipelining stages.  5.4  Summary  This chapter provided a characterization of our framework, while showing two example experiments. We provided an analysis of power, temperature, CAD time, and configuration time in our platform. We examined stand-alone circuit tests that can give the wrong power conclusions when implementing more activity than in a realistic SoC scenario. We also evaluated our DPR implementation and explained when it would be beneficial to use for power savings.  101  Chapter 6  Conclusion 6.1  Thesis Summary  The power gap between FPGAs and ASICs is significant and continues to restrict the use of FPGAs in low-power applications. Evaluating power consumption and tradeoffs at the application and CAD tool levels is important and not adequately addressed by previous power models and measurement studies; assumptions on signal activity and other simplifications can lead to poor accuracy in results. Proper evaluation is key to allowing researchers to better explore the impact of FPGA power-reduction proposals. To address these needs, this thesis presents an FPGA framework for power evaluation with high-level access to DFS, clock gating, DPR, and power and temperature measurements. Our implementation is integrated into a realistic SoC. Power-management techniques have been implemented and can be accessed through the command line in the Linux OS. This allowed us to evaluate the impact of algorithmic and CAD tool changes on power using on-chip FPGA measurements. 102  Future researchers will be able to use this framework in their experimental flow to aid them in reaching accurate and meaningful power conclusions. In turn, this will guide experimenters to lower-power FPGAs, enabling FPGAs to be employed in low-power and mobile devices. This thesis is structured as follows: Chapter 2 provides an overview of FPGAs, DPR, power analysis, and previous power analysis frameworks. Chapter 3 describes the desirable characteristics of our platform. Chapter 4 explains our framework implementation and how we extended the LEON3 with high-level hooks and power measurements. Chapter 5 gives a characterization of our framework and an evaluation of power in both pipelining and DPR example circuits. Our contributions can be summarized as follows: 1. We designed and implemented an experimental platform which will allow researchers to evaluate power reduction techniques on a realistic SoC. Our framework supports a variety of experiments and the power results are meaningful and accurate. DPR allows the user to save experimentation time when swapping test circuits in and out of the system. 2. We demonstrated the use of this platform by characterizing the power consumption of our system and evaluating the power implications of clock gating, pipelining, and DPR. We showed that we were also able to measure chip temperature approximately.  6.2  Limitations  Our research has certain limitations which, in the interests of time, could not be addressed fully. 103  1. Our framework observes measurements using the System Monitor (SysMon), which is only available on Xilinx Virtex-5 and newer FPGAs. The SysMon has the ability to observe the FPGA core voltage, but measuring the current to determine power depends on the external implementation. The ML605 board provides the ability to measure FPGA core current, but other boards such as the ML509, need to be physically modified. This limits the portability of our framework. 2. Our ICAP implementation and driver for DPR has been developed for Xilinx FPGAs and specifically for the Virtex-6. Using a different family of FPGAs would require just a small change in the instantiation of the ICAP primitive, but any future changes to the protocol would require more extensive changes to the code. 3. The power and temperature measurements of our framework are not repeatable. Measurement results could differ day-to-day, after the chip has warmed up due to temperature, or between FPGAs due to process variation. The power results are also limited by the sampling frequency, which could cause significant power data to be lost. 4. The fastest ChipScope can record power consumption data is once per second, whereas the SysMon can sample the FPGA voltage and current once every 5 µs. This prohibits us from accurately obtaining power consumption for short actions, where we cannot allow the execution to reach a stable power and temperature. This is not a problem for repeatable actions, where we can measure the power over multiple runs, but becomes more difficult for non-repeatable actions, such as partial reconfiguration. 104  5. The SysMon sends data over the JTAG interface, which prevents the use of the JTAG port for communication with other devices. This could be overcome by recording SysMon data on-chip where it can be used later, or by modifying the SysMon to send data over a different interface. 6. Making changes in the static logic or static-dynamic interface of a DPRcapable system requires a reimplementation of all bitfiles. This can take hours to days when many bitfiles are used. 7. The implementation and timing closure of a dynamic partition is more difficult than a system that does not enable DPR. The RP needs to be overpartitioned due to the extra constraints, which can lead to fragmentation of the FPGA resources. This means that a design which fits in a non-DPR system, may not fit in a DPR system.  6.3  Future Work  Although we have shown why our framework is important and how it can be used, there are still several useful items for future research consideration that we have identified.  6.3.1  Short-Term  The short-term future work we have identified is: An exploration of why power consumption is increasing when using a blank partial bitfile We feel this is either due to an error being introduced into the measurement reporting system or the real power of the FPGA increasing. We hypothesize  105  the increase in power is due to the constraints, implementation tools or options, or is inherent to the way DPR has been designed on our FPGA. This requires further investigation. Implementing the ICAP with readback capability would allow for verification that the FPGA configuration, after a partial reconfiguration, matches that of the full bitfile. Also, configuring then unconfiguring very specific resources, such as a group of registers, to explore the effect on power could provide insight to this occurrence of increased power. An optimization of our ICAP implementation Our implementation can be improved to allow shorter reconfiguration times by adding a data buffer or setting up the ICAP to directly access system memory. Alternatively, this could be done by building on previous work in this area, such as [63]. In that study, Lai and Diessel provide a customizable and reusable bus-based ICAP implementation. Their work operates on any bus, without the need for a CPU, and with support for DMA. This could be easily integrated into our framework and would be beneficial to researchers evaluating DPR. A systematic way to automatically record results It was tedious and time-consuming to manually reconfigure, observe, and record the power consumption and temperature results of our platform. Any changes to our circuits required us to repeat this process to update our results. We believe this could be automated by using system calls for reconfiguration, parsing the result data to determine when the results are not changing, and writing the end result to a formatted file. This would be useful for those 106  planning on doing many tests with this framework. Using more benchmarks We would have liked to have investigated our results using multiple benchmarks. However, manually pipelining benchmarks, meeting timing closure in dynamic regions, and creating software applications and drivers is a lengthy process. Simplifying or automating this process would be valuable to reduce the implementation time for future benchmarks. This could be accomplished by making use of an automated pipelining tool, DPR design tool, better understanding the constraints required for dynamic logic resources, and writing code that is easily reused. Potential choices for future benchmarks could be found in the GroundHog 2009 benchmark suite [51] and/or the VTR benchmark suite [105].  6.3.2  Long-Term  We have identified several avenues of long-term future work. One such long-term future work for our evaluation platform would be to extend it as an adaptive DPR system. This would include being able to obtain and use bitfiles from a central database, for example using the Internet. Systems could receive hardware updates in the same way they currently receive software updates. OS scheduling in combination with DFS or DPR would allow the system to adapt depending on factors monitored in real-time, such as its workload, power consumption, and temperature. An adaptive system could also have applications in other areas, such as reliability. If a resource has failed or degraded, due to aging, the critical path of a circuit could be affected and errors could be caused. Our framework could be used to reroute a path or move a logic block with partial bitfiles to amend these types 107  of errors. Future research is necessary to explore how an FPGA should analyze its logic and understand the reason for a hardware error. Another interesting area of research would be to merge our platform with a DPR design framework, such as GoAhead [17]. GoAhead includes communication infrastructure which becomes more useful when using many reconfigurable cores attached to a system bus. This could save area by eliminating the need for proxy LUTs and reduce implementation time by allowing RMs to be configured in multiple RPs in a system. Changes to the static logic would not require previous RMs to be reimplemented. Furthermore, by knowing the RP constraints a partial bitfile containing a new RM could be made, without requiring full knowledge of the static logic, and transferred to the system for partial reconfiguration. Alternatively, combining our work with an FPGA overlay, such as that described in [20], would be immensely useful as it would allow a single bitfile to be used for configuration on many FPGAs from various vendors. Lastly, since this framework only focuses on application and CAD tool evaluation, extending this work for circuit and architecture research would be challenging, but important. There are many power saving techniques that are implemented at the circuit and architecture abstraction levels and testing these on a realistic SoC would be beneficial. One such circuit level application that would be interesting is modifying the supply rail voltages at run-time based on system measurements. This could be done using external implementation, as in [81]. An FPGA overlay, as mentioned in the previous paragraph, could also be used with this framework to explore the power implications of architecture choices. It would be necessary to explore how the physical underlying architecture affects the overlay architecture conclusions, as that is not fully understood. 108  6.4  Public Distribution  One goal of our work was to aid future researchers in evaluating their power proposals. We have made our work publicly available for this purpose at∼stuartd/  109  Bibliography [1] N. Abdelli, A. Fouilliart, N. Mien, and E. Senn. High-Level Power Estimation of FPGA. In Industrial Electronics, 2007. ISIE 2007. IEEE International Symposium on, pages 925–930. IEEE, June 2007. doi:10.1109/ISIE.2007.4374721. → pages 27, 31, 36 [2] Aeroflex Gaisler. LEON Linux 2.6 Development, December 2010. URL http: // →  pages 56 [3] Aeroflex Gaisler. GRLIB IP Core Users Manual, January 2013. URL → pages 53, 55 [4] Aeroflex Gaisler. GRLIB IP Library Users Manual, January 2013. URL → pages 52, 55 [5] Aeroflex Gaisler. GRMON Users Manual, April 2013. URL → pages 59 [6] S. Ahuja, D. Mathaikutty, G. Singh, J. Stetzer, S. Shukla, and A. Dingankar. Power Estimation Methodology for a High-Level Synthesis Framework. In Quality of Electronic Design, 2009. ISQED 2009. Quality Electronic Design, pages 541 –546, March 2009. doi:10.1109/ISQED.2009.4810352. → pages 34 [7] S. Ahuja, W. Zhang, and S. Shukla. System Level Simulation Guided Approach to Improve the Efficacy of Clock-Gating. In High Level Design Validation and Test Workshop (HLDVT), 2010 IEEE International, pages 9 –16, June 2010. doi:10.1109/HLDVT.2010.5496669. → pages 34 [8] Altera. PowerPlay Early Power Estimator User Guide, July 2012. URL epe.pdf. → pages 34 110  [9] Altera. Quartus II Handbook Version 12.1 Volume 3: Verification, July 2012. URL qii5v3.pdf. → pages 34 [10] J. Anderson and F. Najm. Power-Aware Technology Mapping for LUT-Based FPGAs. In Field-Programmable Technology, 2002. (FPT). Proceedings. 2002 IEEE International Conference on, pages 211 – 218, December 2002. doi:10.1109/FPT.2002.1188684. → pages 2 [11] J. Anderson and F. Najm. Active Leakage Power Optimization for FPGAs. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 25(3):423 – 437, March 2006. ISSN 0278-0070. doi:10.1109/TCAD.2005.853692. → pages 2, 98 [12] ARM. AMBA Specification (Rev 2.0), May 1999. URL∼magagni/amba99.pdf. → pages 54, 55 [13] W. Atabany and P. Degenaar. Parallelism to Reduce Power Consumption on FPGA Spatiotemporal Image Processing. In Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on, pages 1476 –1479, May 2008. doi:10.1109/ISCAS.2008.4541708. → pages 2 [14] D. Atienza, P. G. Del Valle, G. Paci, F. Poletti, L. Benini, G. De Micheli, and J. M. Mendias. A Fast HW/SW FPGA-Based Thermal Emulation Framework for Multi-Processor System-on-Chip. In Proceedings of the 43rd annual Design Automation Conference, DAC ’06, pages 618–623, New York, NY, USA, July 2006. ACM. ISBN 1-59593-381-6. doi:10.1145/1146909.1147068. URL 3. → pages 33, 38 [15] N. Azizi, F. Najm, and A. Moshovos. Low-Leakage Asymmetric-Cell SRAM. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 11(4):701 –715, August 2003. ISSN 1063-8210. doi:10.1109/TVLSI.2003.816139. → pages 2 [16] T. Becker, P. Jamieson, W. Luk, P. Cheung, and T. Rissa. Power Characterisation for the Fabric in Fine-Grain Reconfigurable Architectures. In Programmable Logic, 2009. SPL. 5th Southern Conference on, pages 77 –82, April 2009. doi:10.1109/SPL.2009.4914910. → pages 27, 28 [17] C. Beckhoff, D. Koch, and J. Torresen. Go Ahead: A Partial Reconfiguration Framework. In Field-Programmable Custom Computing 111  Machines (FCCM), 2012 IEEE 20th Annual International Symposium on, pages 37 –44, May 2012. doi:10.1109/FCCM.2012.17. → pages 15, 108 [18] R. Ben Atitallah, E. Senn, D. Chillet, M. Lanoe, and D. Blouin. An Efficient Framework for Power-Aware Design of Heterogeneous MPSoC. Industrial Informatics, IEEE Transactions on, 9(1):487 –501, February 2013. ISSN 1551-3203. doi:10.1109/TII.2012.2198657. → pages 31, 36 [19] Berkeley Logic Synthesis and Verification Group. ABC: A System for Sequential Synthesis and Verification, 2011. URL∼alanmi/abc/. Retrieved on May 1, 2013. → pages 35 [20] A. Brant and G. Lemieux. ZUMA: An Open FPGA Overlay Architecture. In Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium on, pages 93–96, April 2012. doi:10.1109/FCCM.2012.25. → pages 108 [21] A. Brito, M. Kuhnle, M. Hubner, J. Becker, and E. U. K. Melcher. Modelling and Simulation of Dynamic and Partially Reconfigurable Systems using SystemC. In VLSI, 2007. ISVLSI ’07. IEEE Computer Society Annual Symposium on, pages 35–40, May 2007. doi:10.1109/ISVLSI.2007.69. → pages 17 [22] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a Framework for Architectural-Level Power Analysis and Optimizations. SIGARCH Comput. Archit. News, 28(2):83–94, May 2000. ISSN 0163-5964. doi:10.1145/342001.339657. URL → pages 31, 32 [23] S. Brown. An Overview of Technology, Architecture and CAD Tools for Programmable Logic Devices. In Custom Integrated Circuits Conference, 1994., Proceedings of the IEEE 1994, pages 69 –76, May 1994. doi:10.1109/CICC.1994.379764. → pages 7, 9 [24] R. Burch, F. Najm, P. Yang, and T. Trick. A Monte Carlo Approach for Power Estimation. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 1(1):63–71, March 1993. doi: → pages 24 [25] J. Chan and S. Parameswaran. NoCEE: Energy Macro-Model Extraction Methodology for Network on Chip Routers. In Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design, ICCAD 112  ’05, pages 254–259, Washington, DC, USA, November 2005. IEEE Computer Society. ISBN 0-7803-9254-X. URL → pages 31 [26] N. Chang and K. Kim. Real-Time Per-Cycle Energy Consumption Measurement of Digital Systems. Electronics Letters, 36(13):1169–1171, June 2000. ISSN 0013-5194. doi:10.1049/el:20000811. → pages 28 [27] N. Chang, K. Kim, and H. G. Lee. Cycle-Accurate Energy Consumption Measurement and Analysis: Case Study of ARM7TDMI. In Low Power Electronics and Design, 2000. ISLPED ’00. Proceedings of the 2000 International Symposium on, pages 185–190, July 2000. doi: → pages 28 [28] D. Chen, J. Cong, F. Li, and L. He. Low-Power Technology Mapping for FPGA Architectures with Dual Supply Voltages. In Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays, FPGA ’04, pages 109–117, New York, NY, USA, February 2004. ACM. ISBN 1-58113-829-6. doi:10.1145/968280.968297. URL → pages 2 [29] E. Chen, W. Gruver, D. Sabaz, and L. Shannon. Facilitating Processor-Based DPR Systems for non-DPR Experts. In Field-Programmable Custom Computing Machines, 2008. FCCM ’08. 16th International Symposium on, pages 318 –319, April 2008. doi:10.1109/FCCM.2008.57. → pages 15 [30] H. H. Chen and D. D. Ling. Power Supply Noise Analysis Methodology for Deep-Submicron VLSI Chip Design. In Proceedings of the 34th annual Design Automation Conference, DAC ’97, pages 638–643, New York, NY, USA, June 1997. ACM. ISBN 0-89791-920-3. doi:10.1145/266021.266307. URL → pages 22 [31] X. Chen and L.-S. Peh. Leakage Power Modeling and Optimization in Interconnection Networks. In Proceedings of the 2003 international symposium on Low power electronics and design, ISLPED ’03, pages 90–95, New York, NY, USA, August 2003. ACM. ISBN 1-58113-682-X. doi:10.1145/871506.871531. URL → pages 32 [32] S. Chin, C. Lee, and S. Wilton. Power Implications of Implementing Logic Using FPGA Embedded Memory Arrays. In Field Programmable Logic 113  and Applications, 2006. FPL ’06. International Conference on, pages 1 –8, August 2006. doi:10.1109/FPL.2006.311200. → pages 29, 44, 45, 90 [33] L. Ciccarelli, A. Lodi, and R. Canegallo. Low Leakage Circuit Design for FPGAs. In Custom Integrated Circuits Conference, 2004. Proceedings of the IEEE 2004, pages 715 – 718, October 2004. doi:10.1109/CICC.2004.1358929. → pages 2 [34] C. Claus, F. Muller, J. Zeppenfeld, and W. Stechele. A New Framework to Accelerate Virtex-II Pro Dynamic Partial Self-Reconfiguration. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1–7, March 2007. doi:10.1109/IPDPS.2007.370362. → pages 39 [35] O. Coudert, J. Cong, S. Malik, and M. Sarrafzadeh. Incremental CAD. In Proceedings of the 2000 IEEE/ACM international conference on Computer-aided design, ICCAD ’00, pages 236–244, Piscataway, NJ, USA, November 2000. IEEE Press. ISBN 0-7803-6448-1. URL → pages 11 [36] V. Degalahal and T. Tuan. Methodology for High Level Estimation of FPGA Power Consumption. In Design Automation Conference, 2005. Proceedings of the ASP-DAC 2005. Asia and South Pacific, volume 1, pages 657 – 660 Vol. 1, January 2005. doi:10.1109/ASPDAC.2005.1466245. → pages 70 [37] L. Deng, K. Sobti, and C. Chakrabarti. Accurate Models for Estimating Area and Power of FPGA Implementations. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 1417 –1420, April 2008. doi:10.1109/ICASSP.2008.4517885. → pages 37 [38] C.-S. Ding, C.-T. Hsieh, and M. Pedram. Improving the Efficiency of Monte Carlo Power Estimation. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 8(5):584 –593, October 2000. ISSN 1063-8210. doi:10.1109/92.894163. → pages 24 [39] D. Elléouet, N. Julien, D. Houzet, J.-G. Cousin, and M. Martin. Power Consumption Characterization and Modeling of Embedded Memories in Xilinx Virtex 400E FPGA. In Digital System Design, 2004. DSD 2004. Euromicro Symposium on, pages 394 – 401, August 2004. doi:10.1109/DSD.2004.1333302. → pages 35 114  [40] D. Elléouet, Y. Savary, and N. Julien. An FPGA Power Aware Design Flow. In J. Vounckx, N. Azemard, and P. Maurine, editors, Integrated Circuit and System Design. Power and Timing Modeling, Optimization and Simulation, volume 4148 of Lecture Notes in Computer Science, pages 415–424. Springer Berlin Heidelberg, September 2006. ISBN 978-3-540-39094-7. doi:10.1007/11847083 40. URL 40. → pages 31, 35 [41] D. D. Gajski, J. Zhu, R. Domer, A. Gerstlauer, and S. Zhao. SpecC: Specification Language and Methodology. Springer, 2000. ISBN 0792378229. → pages 34 [42] A. Gayasen, K. Lee, N. Vijaykrishnan, M. Kandemir, M. Irwin, and T. Tuan. A Dual-VDD Low Power FPGA Architecture. In J. Becker, M. Platzner, and S. Vernalde, editors, Field Programmable Logic and Application, volume 3203 of FPL ’04, pages 145–157. Springer Berlin Heidelberg, August 2004. ISBN 978-3-540-22989-6. doi:10.1007/978-3-540-30117-2 17. URL 17. → pages 2 [43] D. Gibson, M. Vasilko, and D. I. Long. Virtual Prototyping for Dynamically Reconfigurable Architectures using Dynamic Generic Mapping. In VHDL International Users Forum. Bournemouth University, Fern Barrow, Poole, Dorset, BH12 5BB, UK, March 1998. URL http: //  → pages 17 [44] T. Givargis and F. Vahid. Platune: a Tuning Framework for System-on-a-Chip Platforms. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 21(11):1317 – 1327, November 2002. ISSN 0278-0070. doi:10.1109/TCAD.2002.804107. 1. → pages 31, 33 [45] J. Goeders and S. Wilton. VersaPower: Power estimation for diverse FPGA architectures. In Field-Programmable Technology (FPT), 2012 International Conference on, pages 229 –234, December 2012. doi:10.1109/FPT.2012.6412139. → pages 4, 36, 44, 45 [46] S. Gupta, J. Anderson, L. Farragher, and Q. Wang. CAD Techniques for Power Optimization in Virtex-5 FPGAs. In Custom Integrated Circuits Conference, 2007. CICC ’07. IEEE, pages 85–88, September 2007. doi:10.1109/CICC.2007.4405687. → pages 3 115  [47] D. Hodges, H. Jackson, R. Saleh, H. Jackson, and R. Saleh. Analysis and Design of Digital Integrated Circuits, volume 88. McGraw-hill New York, 2003. → pages 18, 19, 22 [48] C.-W. Hsu, J.-L. Liao, S.-C. Fang, C.-C. Weng, S.-Y. Huang, W.-T. Hsieh, and J.-C. Yeh. PowerDepot: Integrating IP-Based Power Modeling with ESL Power Analysis for Multi-Core SoC Designs. In Proceedings of the 48th Design Automation Conference, DAC ’11, pages 47–52, New York, NY, USA, June 2011. ACM. ISBN 978-1-4503-0636-2. doi:10.1145/2024724.2024736. URL → pages 34 [49] S. Huda, M. Mallick, and J. Anderson. Clock Gating Architectures for FPGA Power Reduction. In Field Programmable Logic and Applications, 2009. FPL 2009. International Conference on, pages 112–118, August 2009. doi:10.1109/FPL.2009.5272538. → pages 70 [50] International Technology Roadmap for Semiconductors. 2011 Edition System Drivers. Technical report, International Technology Roadmap for Semiconductors, July 2011. URL → pages 18 [51] P. Jamieson, T. Becker, W. Luk, P. Y. K. Cheung, T. Rissa, and T. Pitkanen. Benchmarking Reconfigurable Architectures in the Mobile Domain. In Field Programmable Custom Computing Machines, 2009. FCCM ’09. 17th IEEE Symposium on, pages 131–138, April 2009. doi:10.1109/FCCM.2009.13. → pages 81, 107 [52] P. Jamieson, W. Luk, S. Wilton, and G. Constantinides. An Energy and Power Consumption Analysis of FPGA Routing Architectures. In Field-Programmable Technology, 2009. FPT 2009. International Conference on, pages 324 –327, December 2009. doi:10.1109/FPT.2009.5377675. → pages 35 [53] P. Jamieson, K. Kent, F. Gharibian, and L. Shannon. Odin II - An Open-Source Verilog HDL Synthesis Tool for CAD Research. In Field-Programmable Custom Computing Machines (FCCM), 2010 18th IEEE Annual International Symposium on, pages 149 –156, May 2010. doi:10.1109/FCCM.2010.31. → pages 35 [54] R. Jevtic and C. Carreras. Power Estimation of Embedded Multiplier Blocks in FPGAs. Very Large Scale Integration (VLSI) Systems, IEEE 116  Transactions on, 18(5):835–839, May 2010. ISSN 1063-8210. doi:10.1109/TVLSI.2009.2015326. → pages 35 [55] R. Jevtic and C. Carreras. Power Measurement Methodology for FPGA Devices. Instrumentation and Measurement, IEEE Transactions on, 60(1): 237–247, January 2011. ISSN 0018-9456. doi:10.1109/TIM.2010.2047664. → pages 20 [56] T. Jiang, X. Tang, and P. Banerjee. Macro-Models for High-Level Area and Power Estimation on FPGAs. International Journal of Simulation and Process Modelling, 2(1):12–19, April 2006. doi: → pages 37 [57] A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. ORION 2.0: a Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’09, pages 423–428, 3001 Leuven, Belgium, Belgium, April 2009. European Design and Automation Association. ISBN 978-3-9810801-5-5. URL → pages 22, 31, 32 [58] D. Koch, C. Beckhoff, and J. Teich. ReCoBus-Builder A Novel Tool and Technique to Build Statically and Dynamically Reconfigurable Systems for FPGAS. In Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on, pages 119 –124, September 2008. doi:10.1109/FPL.2008.4629918. → pages 16 [59] D. C. Ku and G. De Micheli. Hardware C-a Language for Hardware Design. Technical report, DTIC Document, 1988. → pages 34 [60] M. Kuehnle, A. Brito, C. Roth, M. Kruesselin, and J. Becker. An Approach for Power and Performance Evaluation of Reconfigurable SoC at Mixed Abstraction Levels. In Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), 2011 6th International Workshop on, pages 1 –8, June 2011. doi:10.1109/ReCoSoC.2011.5981516. 7. → pages 34 [61] I. Kuon and J. Rose. Measuring the Gap Between FPGAs and ASICs. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 26(2):203–215, February 2007. ISSN 0278-0070. doi:10.1109/TCAD.2006.884574. → pages 2, 10 [62] E. Kusse and J. Rabaey. Low-Energy Embedded FPGA Structures. In Low Power Electronics and Design, 1998. Proceedings. 1998 International 117  Symposium on, pages 155 –160, August 1998. URL org/stamp/stamp.jsp?tp=&arnumber=708181&isnumber=15337. → pages 2, 9, 20 [63] V. Lai and O. Diessel. ICAP-I: A Reusable Interface for the Internal Reconfiguration of Xilinx FPGAs. In Field-Programmable Technology, 2009. FPT 2009. International Conference on, pages 357–360, December 2009. doi:10.1109/FPT.2009.5377616. → pages 39, 106 [64] J. Lamoureux and S. J. E. Wilton. On the Interaction Between Power-Aware FPGA CAD Algorithms. In Proceedings of the 2003 IEEE/ACM international conference on Computer-aided design, ICCAD ’03, pages 701–, Washington, DC, USA, November 2003. IEEE Computer Society. ISBN 1-58113-762-1. doi:10.1109/ICCAD.2003.106. URL → pages 2, 44 [65] J. Lamoureux, G. Lemieux, and S. Wilton. GlitchLess: Dynamic Power Minimization in FPGAs Through Edge Alignment and Glitch Filtering. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 16 (11):1521 –1534, November 2008. ISSN 1063-8210. doi:10.1109/TVLSI.2008.2001237. → pages 2, 23 [66] J. Laurent, N. Julien, E. Senn, and E. Martin. Functional Level Power Analysis: An Efficient Approach for Modeling the Power Consumption of Complex Processors. In Proceedings of the conference on Design, automation and test in Europe - Volume 1, DATE ’04, pages 10666–, Washington, DC, USA, February 2004. IEEE Computer Society. ISBN 0-7695-2085-5. URL → pages 31, 33 [67] C. Lavin, M. Padilla, P. Lundrigan, B. Nelson, and B. Hutchings. Rapid Prototyping Tools for FPGA Designs: RapidSmith. In Field-Programmable Technology (FPT), 2010 International Conference on, pages 353 –356, December 2010. doi:10.1109/FPT.2010.5681429. → pages 3 [68] H. Lee, K. Lee, Y. Choi, and N. Chang. Cycle-Accurate Energy Measurement and Characterization of FPGAs. Analog Integrated Circuits and Signal Processing, 42:239–251, March 2005. ISSN 0925-1030. doi:10.1007/s10470-005-6758-5. URL → pages 35 [69] M. T.-C. Lee, V. Tiwari, S. Malik, and M. Fujita. Power Analysis and Minimization Techniques for Embedded DSP Software. Very Large Scale 118  Integration (VLSI) Systems, IEEE Transactions on, 5(1):123 –135, March 1997. ISSN 1063-8210. doi:10.1109/92.555992. → pages 32 [70] D. Lewis, D. Cashman, M. Chan, J. Chromczak, G. Lai, A. Lee, T. Vanderhoek, and H. Yu. Architectural Enhancements in Stratix V. In Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays, FPGA ’13, pages 147–156, New York, NY, USA, February 2013. ACM. ISBN 978-1-4503-1887-7. doi:10.1145/2435264.2435292. URL → pages 9 [71] F. Li, Y. Lin, L. He, and J. Cong. Low-power FPGA using Pre-Defined Dual-Vdd/Dual-Vt Fabrics. In International Symposium on Field Programmable Gate Arrays: Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field Programmable Gate Arrays (FPGA), volume 22, pages 42–50, February 2004. doi: → pages 2 [72] F. Li, Y. Lin, L. He, D. Chen, and J. Cong. Power Modeling and Characteristics of Field Programmable Gate Arrays. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 24(11): 1712 – 1724, November 2005. ISSN 0278-0070. doi:10.1109/TCAD.2005.852293. → pages 24, 36, 43 [73] S. Li, J. H. Ahn, R. Strong, J. Brockman, D. Tullsen, and N. Jouppi. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pages 469 –480, December 2009. URL → pages 22, 31, 32 [74] Y. Lin, F. Li, and L. He. Power Modeling and Architecture Evaluation for FPGA with Novel Circuits for Vdd Programmability. In Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA), pages 199–207. ACM, February 2005. doi: → pages 43 [75] S. Liu, R. N. Pittman, and A. Forin. Energy Reduction with Run-Time Partial Reconfiguration. In Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays, pages 292–292. ACM, September 2010. URL  119  pubs/112466/energyreductionwithrun-timepartialreconfiguration.pdf. →  pages 14, 28 [76] S. Liu, R. N. Pittman, A. Form, and J.-L. Gaudiot. On Energy Efficiency of Reconfigurable Systems with Run-Time Partial Reconfiguration. In Application-specific Systems Architectures and Processors (ASAP), 2010 21st IEEE International Conference on, pages 265 –272, July 2010. doi:10.1109/ASAP.2010.5540985. 12. → pages 2, 14, 28, 38, 39 [77] A. Lodi, L. Ciccarelli, and R. Giansante. Combining Low-Leakage Techniques for FPGA Routing Design. In Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, FPGA ’05, pages 208–214, New York, NY, USA, February 2005. ACM. ISBN 1-59593-029-9. doi:10.1145/1046192.1046219. URL → pages 2 [78] W. Luk, N. Shirazi, and P. Y. K. Cheung. Compilation Tools for Run-Time Reconfigurable Designs. In Field-Programmable Custom Computing Machines, 1997. Proceedings., The 5th Annual IEEE Symposium on, pages 56–65, April 1997. doi:10.1109/FPGA.1997.624605. → pages 17 [79] P. Lysaght and J. Stockwood. A Simulation Tool for Dynamically Reconfigurable Field Programmable Gate Arrays. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 4(3):381–390, September 1996. ISSN 1063-8210. doi:10.1109/92.532038. → pages 17 [80] S. Malhotra, T. Borer, D. Singh, and S. Brown. The Quartus University Interface Program: Enabling Advanced FPGA Research. In Field-Programmable Technology, 2004. Proceedings. 2004 IEEE International Conference on, pages 225 – 230, December 2004. doi:10.1109/FPT.2004.1393272. → pages 3 [81] A. Nabina and J. L. Nunez-Yanez. Adaptive Voltage Scaling in a Dynamically Reconfigurable FPGA-Based Platform. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 5(4):20:1–20:22, December 2012. ISSN 1936-7406. doi:10.1145/2392616.2392618. URL 13. → pages 38, 41, 108 [82] F. Najm. Transition Density: a New Measure of Activity in Digital Circuits. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 12(2):310 –323, February 1993. ISSN 0278-0070. doi:10.1109/43.205010. → pages 26 120  [83] F. Najm. A Survey of Power Estimation Techniques in VLSI Circuits. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 2(4):446 –455, December 1994. ISSN 1063-8210. doi:10.1109/92.335013. → pages 23, 25, 26 [84] F. Najm. Power Estimation Techniques for Integrated Circuits. In Computer-Aided Design, 1995. ICCAD-95. Digest of Technical Papers., 1995 IEEE/ACM International Conference on, pages 492 –499, November 1995. doi:10.1109/ICCAD.1995.480162. → pages 18, 24, 26 [85] F. N. Najm. Transition Density, a Stochastic Measure of Activity in Digital Circuits. In Proceedings of the 28th ACM/IEEE Design Automation Conference, DAC ’91, pages 644–649, New York, NY, USA, June 1991. ACM. ISBN 0-89791-395-7. doi:10.1145/127601.127744. URL → pages 26 [86] K. Nose and T. Sakurai. Analysis and Future Trend of Short-Circuit Power. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 19(9):1023 –1030, September 2000. ISSN 0278-0070. doi:10.1109/43.863642. → pages 22, 32 [87] J. Nunez-Yanez, V. Chouliaras, and J. Gaisler. Dynamic Voltage Scaling in a FPGA-Based System-on-Chip. In Field Programmable Logic and Applications, 2007. FPL 2007. International Conference on, pages 459 –462, August 2007. doi:10.1109/FPL.2007.4380689. 8. → pages 2, 38, 40 [88] J. Oliver and E. Boemo. Power Estimations vs. Power Measurements in Cyclone III Devices. In Programmable Logic (SPL), 2011 VII Southern Conference on, pages 87–90, April 2011. doi:10.1109/SPL.2011.5782630. → pages 27 [89] J. Oliver, J. Curto, D. Bouvier, M. Ramos, and E. Boemo. Clock Gating and Clock Enable for FPGA Power Reduction. In Programmable Logic (SPL), 2012 VIII Southern Conference on, pages 1–5, March 2012. doi:10.1109/SPL.2012.6211782. → pages 70 [90] J. Ou and V. Prasanna. Rapid Energy Estimation of Computations on FPGA Based Soft Processors. In SOC Conference, 2004. Proceedings. IEEE International, pages 285–288, September 2004. doi:10.1109/SOCC.2004.1362437. → pages 37 [91] G. Palermo and C. Silvano. PIRATE: A Framework for Power/Performance Exploration of Network-on-Chip Architectures. In 121  E. Macii, V. Paliouras, and O. Koufopavlou, editors, Integrated Circuit and System Design. Power and Timing Modeling, Optimization and Simulation, volume 3254 of Lecture Notes in Computer Science, pages 521–531. Springer Berlin Heidelberg, September 2004. ISBN 978-3-540-23095-3. doi:10.1007/978-3-540-30205-6 54. URL 54. → pages 31 [92] P. Panda. SystemC - a Modeling Platform Supporting Multiple Design Abstractions. In System Synthesis, 2001. Proceedings. The 14th International Symposium on, pages 75 – 80, September 2001. doi: → pages 34 [93] K. Papadimitriou, A. Anyfantis, and A. Dollas. An Effective Framework to Evaluate Dynamic Partial Reconfiguration in FPGA Systems. Instrumentation and Measurement, IEEE Transactions on, 59(6): 1642–1651, June 2010. ISSN 0018-9456. doi:10.1109/TIM.2009.2026607. 5. → pages 38, 39 [94] K. Parker and E. McCluskey. Probabilistic Treatment of General Combinational Networks. Computers, IEEE Transactions on, C-24(6):668 – 670, June 1975. ISSN 0018-9340. doi:10.1109/T-C.1975.224279. → pages 25 [95] A. Pelkonen, K. Masselos, and M. Cupak. System-Level Modeling of Dynamically Reconfigurable Hardware with SystemC. In Parallel and Distributed Processing Symposium, 2003. Proceedings. International, pages 8 pp.–, April 2003. doi:10.1109/IPDPS.2003.1213321. → pages 17 [96] T. Pitknen, P. Jamieson, T. Becker, S. Moisio, and J. Takala. Power Consumption Benchmarking for Reconfigurable Platforms. Analog Integrated Circuits and Signal Processing, 73:649–659, November 2012. ISSN 0925-1030. doi:10.1007/s10470-012-9950-4. URL → pages 27, 28 [97] K. Poon and S. Wilton. Sensitivity of FPGA Power Evaluation. In Field-Programmable Technology, 2002. (FPT). Proceedings. 2002 IEEE International Conference on, pages 441 – 442, December 2002. doi:10.1109/FPT.2002.1188730. → pages 27 [98] K. K. W. Poon, S. J. E. Wilton, and A. Yan. A Detailed Power Model for Field-Programmable Gate Arrays. ACM Transactions on Design Automation of Electronic Systems (TODAES), 10(2):279–302, April 2005. 122  ISSN 1084-4309. doi:10.1145/1059876.1059881. URL → pages 3, 22, 35 [99] A. Raabe and A. Felke. A SYSTEMC Language Extension for High-Level Reconfiguration Modelling. In Specification, Verification and Design Languages, 2008. FDL 2008. Forum on, pages 55–60, September 2008. doi:10.1109/FDL.2008.4641421. → pages 17 [100] A. Rahman and V. Polavarapuv. Evaluation of Low-Leakage Design Techniques for Field Programmable Gate Arrays. In Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays, FPGA ’04, pages 23–30, New York, NY, USA, February 2004. ACM. ISBN 1-58113-829-6. doi:10.1145/968280.968285. URL → pages 2 [101] C. Ravishankar, S. Ananthanarayan, S. Garg, and A. Kennings. EmPower: FPGA Based Rapid Prototyping of Dynamic Power Management Algorithms for Multi-Processor Systems on Chip. In Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on, pages 41–48, August 2012. doi:10.1109/FPL.2012.6339239. 11. → pages 38, 40 [102] A. Reimer, A. Schulz, and W. Nebel. Modelling Macromodules for High-Level Dynamic Power Estimation of FPGA-Based Digital Designs. In Proceedings of the 2006 international symposium on Low power electronics and design, ISLPED ’06, pages 151–154, New York, NY, USA, October 2006. ACM. ISBN 1-59593-462-6. doi:10.1145/1165573.1165609. URL → pages 37 [103] I. Robertson, J. Irvine, P. Lysaght, and D. Robinson. Improved Functional Simulation of Dynamically Reconfigurable Logic. In M. Glesner, P. Zipf, and M. Renovell, editors, Field-Programmable Logic and Applications: Reconfigurable Computing Is Going Mainstream, volume 2438 of Lecture Notes in Computer Science, pages 152–161. Springer Berlin Heidelberg, September 2002. ISBN 978-3-540-44108-3. doi:10.1007/3-540-46117-5 17. URL 17. → pages 17 [104] J. Rose. Private communication. Private Communication, 2012. → pages 2 [105] J. Rose, J. Luu, C. W. Yu, O. Densmore, J. Goeders, A. Somerville, K. B. Kent, P. Jamieson, and J. Anderson. The VTR Project: Architecture and 123  CAD for FPGAs from Verilog to Routing. In Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays, FPGA ’12, pages 77–86, New York, NY, USA, February 2012. ACM. ISBN 978-1-4503-1155-7. doi:10.1145/2145694.2145708. URL → pages 3, 35, 107 [106] M. Saldana, A. Patel, H. J. Liu, and P. Chow. Using Partial Reconfiguration in an Embedded Message-Passing System. In Reconfigurable Computing and FPGAs (ReConFig), 2010 International Conference on, pages 418–423, December 2010. doi:10.1109/ReConFig.2010.37. → pages 16 [107] V. Saxena, F. N. Najm, and I. Hajj. Monte-Carlo Approach for Power Estimation in Sequential Circuits. In Proceedings of the 1997 European conference on Design and Test, EDTC ’97, pages 416–, Washington, DC, USA, March 1997. IEEE Computer Society. ISBN 0-8186-7786-4. URL → pages 24 [108] E. Senn, N. Julien, D. Elléouet, Y. Savary, and N. Abdelli. Building and Using System, Algorithmic, and Architectural Power and Energy Models in the FPGA Design-Flow. In Reconfigurable Communication-centric SoCs. LIRMM, July 2006. URL → pages 27 [109] L. Shang and N. Jha. High-Level Power Modeling of CPLDs and FPGAs. In Computer Design, 2001. ICCD 2001. Proceedings. 2001 International Conference on, pages 46 –51, September 2001. doi:10.1109/ICCD.2001.955002. → pages 24 [110] L. Shang, A. Kaviani, and K. Bathala. Dynamic Power Consumption in Virtex-II FPGA Family. In Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field-programmable gate arrays, pages 157–164. ACM, February 2002. doi:10.1145/503048.503072. → pages 9, 20, 27 [111] D. She. FPGA Platform for Emulation of Composable and Predictable MPSoC Power Management. Master’s thesis, Eindhoven University of Technology, September 2009. URL 4. → pages 38, 40 [112] A. Shen, A. Ghosh, S. Devadas, and K. Keutzer. On Average Power Dissipation and Random Pattern Testability of CMOS Combinational 124  Logic Networks. In Proceedings of the 1992 IEEE/ACM international conference on Computer-aided design, ICCAD ’92, pages 402–407, Los Alamitos, CA, USA, November 1992. IEEE Computer Society Press. ISBN 0-89791-540-2. URL → pages 23 [113] W. Shum and J. Anderson. FPGA Glitch Power Analysis and Reduction. In Low Power Electronics and Design (ISLPED) 2011 International Symposium on, pages 27 –32, August 2011. doi:10.1109/ISLPED.2011.5993599. → pages 2, 23 [114] A. Singh, G. Parthasarathy, and M. Marek-Sadowska. Efficient Circuit Clustering for Area and Power Reduction in FPGAs. ACM Transactions on Design Automation of Electronic Systems (TODAES), 7(4):643–663, October 2002. ISSN 1084-4309. doi:10.1145/605440.605448. URL → pages 2 [115] A. Sohanghpurwala, P. Athanas, T. Frangieh, and A. Wood. OpenPR: An Open-Source Partial-Reconfiguration Toolkit for Xilinx FPGAs. In Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on, pages 228 –235, May 2011. doi:10.1109/IPDPS.2011.146. → pages 16 [116] S. Srinivasan, A. Gayasen, and N. Vijaykrishnan. Leakage Control in FPGA Routing Fabric. In Design Automation Conference, 2005. Proceedings of the ASP-DAC 2005. Asia and South Pacific, volume 1, pages 661 – 664 Vol. 1, January 2005. doi:10.1109/ASPDAC.2005.1466246. → pages 2 [117] D. Staunton. Successful use of an open source processor in a commercial asic, December 2005. URL successful-use-of-an-open-source-processor-in-a-commercial-asic.html.  Retrieved on May 1, 2013. → pages 53 [118] N. Steiner, A. Wood, H. Shojaei, J. Couch, P. Athanas, and M. French. Torc: Towards an Open-Source Tool Flow. In Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays, FPGA ’11, pages 41–44, New York, NY, USA, February 2011. ACM. ISBN 978-1-4503-0554-9. doi:10.1145/1950413.1950425. URL → pages 3 [119] M. Streubuhr, R. Rosales, R. Hasholzner, C. Haubelt, and J. Teich. ESL Power and Performance Estimation for Heterogeneous MPSOCS using 125  SystemC. In Specification and Design Languages (FDL), 2011 Forum on, pages 1 –8, September 2011. URL → pages 34 [120] Tech Space Staff Writers. Leon: the space chip that europe built, January 2013. URL LEON the space chip that Europe built 999.html. Retrieved on May 1, 2013. → pages 53 [121] V. Tiwari, S. Malik, and A. Wolfe. Power Analysis of Embedded Software: a First Step towards Software Power Minimization. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 2(4):437 –445, December 1994. ISSN 1063-8210. doi:10.1109/92.335012. → pages 32 [122] V. Tiwari, S. Malik, A. Wolfe, and M. Tien-Chien Lee. Instruction Level Power Analysis and Optimization of Software. Journal of VLSI signal processing systems for signal, image and video technology, 13:223–238, August 1996. ISSN 0922-5773. doi:10.1007/BF01130407. URL → pages 33 [123] T. Tuan, S. Kao, A. Rahman, S. Das, and S. Trimberger. A 90nm Low-Power FPGA for Battery-Powered Applications. In Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays, FPGA ’06, pages 3–11, New York, NY, USA, February 2006. ACM. ISBN 1-59593-292-5. doi:10.1145/1117201.1117203. URL → pages 2, 70 [124] S. Turgis and D. Auvergne. A Novel Macromodel for Power Estimation in CMOS Structures. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 17(11):1090 –1098, November 1998. ISSN 0278-0070. doi:10.1109/43.736183. → pages 24 [125] E. Ul Haq, M. Hafeez, M. Khan, S. Sial, and A. Riazuddin. FPGA Implementation of a Low Power, Processor-Independent and Reusable System-on-Chip Platform. In Emerging Technologies, 2009. ICET 2009. International Conference on, pages 337–341, October 2009. doi:10.1109/ICET.2009.5353150. 2. → pages 38, 40 [126] N. Weste and D. Harris. CMOS VLSI Design. Pearson/Addison Wesley, 2010. → pages 19  126  [127] J. Williams and N. Bergmann. Embedded Linux as a Platform for Dynamically Self-Reconfiguring Systems-on-Chip. In The International Conference on Engineering of Reconfigurable Systems and Algorithms, pages 163–169. CSREA Press, June 2004. URL 10330&dsID=ers2035-jwilliam.pdf. 6. → pages 38, 39  [128] S. Wilton, S.-S. Ang, and W. Luk. The Impact of Pipelining on Energy per Operation in Field-Programmable Gate Arrays. In J. Becker, M. Platzner, and S. Vernalde, editors, Field Programmable Logic and Application, volume 3203 of Lecture Notes in Computer Science, pages 719–728. Springer Berlin Heidelberg, August 2004. ISBN 978-3-540-22989-6. doi:10.1007/978-3-540-30117-2 73. URL 73. → pages 3, 79, 87, 88, 90 [129] X. Wu and T. Vladimirova. A Self-reconfigurable System-on-Chip Architecture for Satellite On-Board Computer Maintenance. In C. Jesshope and C. Egan, editors, Advances in Computer Systems Architecture, volume 4186 of Lecture Notes in Computer Science, pages 552–558. Springer Berlin Heidelberg, September 2006. ISBN 978-3-540-40056-1. doi:10.1007/11859802 57. URL 57. 0. → pages 38, 39 [130] Xilinx. Achieving Higher System Performance with the Virtex-5 Family of FPGAs, July 2006. URL papers/wp245.pdf. → pages 9 [131] Xilinx. Virtex-6 FPGA System Monitor User Guide, June 2010. URL guides/ug370.pdf. → pages 29, 60, 61 [132] Xilinx. Virtex-5 FPGA Configuration User Guide, October 2012. URL guides/ug191.pdf. → pages 11 [133] Xilinx. Virtex-6 FPGA Configuration User Guide, September 2012. URL guides/ug360.pdf. → pages 66  127  [134] Xilinx. Xilinx Power Estimator User Guide, December 2012. URL manuals/xilinx2012 4/ ug440-xilinx-power-estimator.pdf. → pages 34  [135] Xilinx. ML605 Hardware User Guide, October 2012. URL and kits/ug534.pdf.  → pages 60, 61 [136] Xilinx. Command Line Tools User Guide, July 2012. URL http://www. manuals/xilinx14 4/devref.pdf. → pages 34, 35 [137] Xilinx. PlanAhead User Guide, December 2012. URL manuals/xilinx14 4/ PlanAhead UserGuide.pdf. → pages 14  [138] Xilinx. Partial Reconfiguration User Guide, December 2012. URL http:// manuals/xilinx14 4/ug702.pdf.  → pages 11, 12, 13, 15, 66 [139] Xilinx. Xilinx Design Tools: Installing and Licensing Guide, December 2012. URL http: // manuals/xilinx2012 4/iil.pdf.  → pages 35 [140] Xilinx. Virtex-6 FPGA Data Sheet: DC and Switching Characteristics (DS152 v3.5), May 2013. URL sheets/ds152.pdf. → pages 69 [141] Xilinx. Virtex-6 FPGA Clocking Resources User Guide (UG362 v2.2), February 2013. URL guides/ug362.pdf. → pages 68 [142] Xilinx. Virtex-7 T and XT FPGAs Data Sheet:DC and Switching Characteristics (DS183 v1.11), January 2013. URL support/documentation/data sheets/ds183 Virtex 7 Data Sheet.pdf. → pages 10 [143] W. Ye, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin. The Design and Use of Simplepower: a Cycle-Accurate Energy Estimation Tool. In Proceedings of the 37th Annual Design Automation Conference, DAC ’00, 128  pages 340–345, New York, NY, USA, January 2000. ACM. ISBN 1-58113-187-9. doi:10.1145/337292.337436. URL → pages 31 [144] G. Yeap and F. Najm. Low Power VLSI Design and Technology, volume 6. World Scientific Publishing Company Incorporated, 1996. → pages 20, 23 [145] G. K. Yeap. Practical Low Power Digital VLSI Design. Springer, August 1997. ISBN 0792380096. URL redirect?tag=citeulike07-20&path=ASIN/0792380096. → pages 17, 18, 20, 21 [146] I. Zaidi, A. Nabina, C. Canagarajah, and J. Nunez-Yanez. Power/Area Analysis of a FPGA-Based Open-Source Processor using Partial Dynamic Reconfiguration. In Digital System Design Architectures, Methods and Tools, 2008. DSD ’08. 11th EUROMICRO Conference on, pages 592 –598, September 2008. doi:10.1109/DSD.2008.92. 9. → pages 2, 38, 39 [147] I. Zaidi, A. Nabina, C. Canagarajah, and J. Nunez-Yanez. Evaluating Dynamic Partial Reconfiguration in the Integer Pipeline of a FPGA-Based Opensource Processor. In Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on, pages 547 –550, September 2008. doi:10.1109/FPL.2008.4630005. 10. → pages 38, 39 [148] F. Zhou, N. Wu, Y. Zhang, and X. Ge. NoC Router Power Macro-modeling at High Level. In D. Zeng, editor, Advances in Computer Science and Engineering, volume 141 of Advances in Intelligent and Soft Computing, pages 199–206. Springer Berlin Heidelberg, January 2012. ISBN 978-3-642-27947-8. doi:10.1007/978-3-642-27948-5 28. URL 28. → pages 25  129  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items