UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

FPGA architectures and CAD algorithms with dynamic power gating support Bsoul, Assem A. M. 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2014_november_bsoul_assem.pdf [ 5.5MB ]
Metadata
JSON: 24-1.0167610.json
JSON-LD: 24-1.0167610-ld.json
RDF/XML (Pretty): 24-1.0167610-rdf.xml
RDF/JSON: 24-1.0167610-rdf.json
Turtle: 24-1.0167610-turtle.txt
N-Triples: 24-1.0167610-rdf-ntriples.txt
Original Record: 24-1.0167610-source.json
Full Text
24-1.0167610-fulltext.txt
Citation
24-1.0167610.ris

Full Text

FPGA Architectures and CADAlgorithms with Dynamic PowerGating SupportbyAssem A. M. BsoulB.Sc., Jordan University of Science and Technology, 2006M.Sc., Queen’s University, 2009A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)October 2014c© Assem A. M. Bsoul 2014AbstractAs the CMOS technology scales down, the cost of fabricating an integrated circuit becomes moreexpensive. Field-programmable gate arrays (FPGAs) are becoming attractive because their fabri-cation costs are amortized among many customers. A major challenge that face adopting FPGAsin many low-power applications is their high power consumption due to their reconfigurablity. Asignificant portion of FPGAs’ power is dissipated in the form of static power during applications’idle periods.To address the power challenge in FPGAs, this dissertation proposes architecture enhancementsand algorithms support to enable powering down portions of a chip during their idle times usingpower gating. This leads to significant energy savings especially for applications with long idleperiods, such as mobile devices. This dissertation presents three contributions that address themajor challenges in adopting power gating in FPGAs.The first contribution proposes an architectural support for power gating in FPGAs. Thisarchitecture enables powering down unused FPGA resources at configuration time, and poweringdown inactive resources during their idle times by the help of a controller. The controller can beplaced on the general fabric of an FPGA device. The proposed architecture provides flexibility inrealizing varying numbers and structures of modules that can be powered down at run-time wheninactive.The second contribution proposes an architecture to appropriately handle the wakeup phase ofpower-gated modules. During a wakeup phase, a large current is drawn from the power supply torecharge the internal capacitances of a power-gated module, leading to reduced noise margins anddegraded performance for neighbouring logic. The proposed architecture staggers the wakeup phaseto limit this current to a safe level. This architecture is configurable and flexible to enable realizingiiAbstractdifferent user applications, while having negligible area and power overheads and fast power uptimes.The third contribution proposes a CAD flow that supports mapping users’ circuits to the pro-posed architecture. Enhancements to the algorithms in this flow that reduce power consumptionare studied and evaluated. Furthermore, the CAD flow is used to study the granularity of thearchitecture proposed in this dissertation when mapping application circuits.iiiPrefaceThe contributions presented in this dissertation have been published in journals and conferenceproceedings [1–5] and in a book chapter [6]. Content from Chapter 3 was first published in papers [1,3], and later extended to a book chapter [6] and a journal paper [5]. Content from Chapter 4 wasfirst published in [2], and later extended to a journal paper [4]. Part of the content from Chapter 5was published in paper [3], and later extended to a journal paper [5].I would like to recognize my co-authors from Imperial College London in [5] who provided theexample application circuit that is studied in Chapter 5. I would also like to recognize my co-author,Mr. Stuart Dueck of NVIDIA, who provided editorial support for the book chapter [6]. In all ofthese works, I conducted the research, performed the experiments, and prepared the manuscriptsfor publication under the guidance of my supervisor, Dr. Steve Wilton, who also provided editorialsupport for all of my manuscripts.[1] A. A. M. Bsoul and S. J. E. Wilton, “An FPGA Architecture Supporting Dynamically ControlledPower Gating,” in Proceedings of The IEEE International Conference on Field-Programmable Tech-nology (FPT), pp. 1–8, Dec. 2010.[2] A. A. M. Bsoul and S. J. E. Wilton, “A Configurable Architecture to Limit Wakeup Current inDynamically-Controlled Power-Gated FPGAs,” in Proceedings of the International Symposium onField-Programmable Gate Arrays, FPGA ’12, pp. 245–254, 2012.[3] A. A. M. Bsoul and S. J. E. Wilton, “An FPGA with Power-Gated Switch Blocks,” in Proceedingsof The IEEE International Conference on Field-Programmable Technology (FPT), pp. 87–94, Dec.2012.[4] A. A. M. Bsoul and S. J. E. Wilton, “A Configurable Architecture to Limit Inrush Current inivPrefacePower-Gated Reconfigurable Devices,” J. Low Power Electronics, vol. 10, no. 1, pp. 1–15, 2014.[5] A. A. M. Bsoul, S. J. E. Wilton, K. H. Tsoi, and W. Luk, “An FPGA Architecture and CAD FlowSupporting Dynamically-Controlled Power Gating,” Very Large Scale Integration (VLSI) Systems,IEEE Transactions on, submitted.[6] A. A. M. Bsoul, S. Dueck, and S. J. E. Wilton, Green Communications: Theoretical Funda-mentals, Algorithms and Applications, ch. Runtime-Controlled Energy Reduction Techniques forFPGAs, pp. 3–22. CRC Press, 2012.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Power Gating in ASICs vs. FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.1 FPGA Architecture Supporting DCPG . . . . . . . . . . . . . . . . . . . . . 81.3.2 A Configurable Inrush Current Handling Architecture . . . . . . . . . . . . . 91.3.3 Studying the Architecture Granularity and CAD Tool Optimizations . . . . . 91.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12viTable of Contents2.1 FPGA Architecture and CAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.2 CAD Flow for FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.1 Dynamic Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.2 Static Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.3 Power Dissipation Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2.4 Power Dissipation in FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3 FPGA Power Reduction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3.1 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3.2 Multiple Supply Voltages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.3.3 Controlled Body Biasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.4 Dynamic Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.5 CAD Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3.6 Other Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.4 Power Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.5.1 Power-Gated FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.5.2 Handling Inrush Current . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.5.3 CAD Optimizations for Power-Gated FPGAs . . . . . . . . . . . . . . . . . . 342.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Dynamically-Controlled Power-Gated FPGA Architecture . . . . . . . . . . . . . 383.1 Power Gating Framework: A High Level Picture . . . . . . . . . . . . . . . . . . . . 383.2 Power-Gated Logic Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.2.1 Fine-Grained Power Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.2.2 Region-Based Power Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3 Power-Gated Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44viiTable of Contents3.3.1 Switch Block Power gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3.2 Fine-Grained Switch Block Power Gating . . . . . . . . . . . . . . . . . . . . 463.4 Supported Power Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.5 Modelling Application Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.6.2 Architecture Parameters Sweep . . . . . . . . . . . . . . . . . . . . . . . . . . 523.6.3 Application Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 A Configurable Architecture to Limit Inrush Current in Power-Gated Recon-figurable Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.1 Motivational Example: Inrush Current Effect . . . . . . . . . . . . . . . . . . . . . . 644.1.1 Baseline Power Grid Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.1.2 Inrush Current of DCPG FPGA Architecture . . . . . . . . . . . . . . . . . . 664.1.3 Voltage Droop Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.1.4 Over-Provisioning the Power Grid . . . . . . . . . . . . . . . . . . . . . . . . 694.1.5 Inserting Decoupling Capacitors . . . . . . . . . . . . . . . . . . . . . . . . . 694.2 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.2.1 Intra-Region Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2.2 Inter-Region Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2.3 Architectural Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.3 Design Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.3.1 Device Architecture Designers . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.3.2 Device End Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.4 Conditions for Energy Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.5.1 Analysis Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80viiiTable of Contents4.5.2 Power Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.5.3 DCPG FPGA Architecture with Wakeup Circuitry . . . . . . . . . . . . . . . 814.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.6.1 Intra-Region Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.6.2 Inter-Region Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.6.3 Combined Intra- and Inter-Region Levels . . . . . . . . . . . . . . . . . . . . 854.6.4 Optimal Design of The Intra- and Inter-Region Levels . . . . . . . . . . . . . 874.6.5 Minimum Idle Time to Achieve Energy Savings . . . . . . . . . . . . . . . . . 884.6.6 Soft-Logic vs. Embedded PDEs . . . . . . . . . . . . . . . . . . . . . . . . . . 894.6.7 Decaps and Power Grid Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925 Physical-Level CAD Support for Power-Gated FPGAs . . . . . . . . . . . . . . . 945.1 CAD Flow For Power-Gated FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.1.1 Modelling DCPG Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 965.1.2 Power Controller Placement and Routing and Circuit Placement . . . . . . . 985.1.3 Power Control Signals Routing and Circuit Routing . . . . . . . . . . . . . . 995.2 PG-Aware Placement and Routing Optimizations . . . . . . . . . . . . . . . . . . . . 1025.2.1 Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.2.2 PG-Aware Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.3 Benchmark Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1105.3.1 Synthetic Benchmarks Generation . . . . . . . . . . . . . . . . . . . . . . . . 1105.3.2 HLS Benchmark Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.3.3 Robot Control System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125.4 Impact of PG-Aware Placement and Routing . . . . . . . . . . . . . . . . . . . . . . 1135.4.1 Impact on Resources’ Power States . . . . . . . . . . . . . . . . . . . . . . . . 1145.4.2 Impact on Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.4.3 Impact on Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120ixTable of Contents5.4.4 Impact on Congestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225.4.5 CAD Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225.5 Granularity Study of the Power Gating Architecture . . . . . . . . . . . . . . . . . . 1245.5.1 Breakdown of Resources’ Power States . . . . . . . . . . . . . . . . . . . . . . 1265.5.2 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305.6 Example Application: iSnake Robot Control System . . . . . . . . . . . . . . . . . . 1335.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1376.1 Dissertation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1376.1.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1386.1.2 Dissertation Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1426.2 Limitations and Short-Term Future Work . . . . . . . . . . . . . . . . . . . . . . . . 1426.2.1 Architecture Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1436.2.2 Sleep Transistors Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1436.2.3 Characterizing the Wakeup Phase of Power-Gated Modules . . . . . . . . . . 1446.2.4 CAD Study and Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 1446.3 Long-Term Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1456.3.1 High-Level CAD Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1456.3.2 Self-Optimizing FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149AppendicesA FPGA Architecture Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158xList of TablesTable 3.1 Configurable power modes supported by the different components in a powergating region (PGR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48Table 3.2 Application parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Table 3.3 Architecture parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Table 3.4 Total area overhead (%) for the power gating architecture for different ar-chitecture granularities compared to an architecture with no power gating support(W = 80 and L = 4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Table 4.1 Functional module and architecture parameters to find minimum idle time toachieve energy savings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Table 4.2 Rconcur and PDE size for different designs of the intra-region architecture withItile intra exceeding what the power grid supports in each tile location (Itile pg = 600µA). 88Table 5.1 Generated synthetic circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111Table 5.2 HLS circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112Table 5.3 Information about FPGA-based iSnake robot control system. . . . . . . . . . . 113Table 5.4 Summary of the mapping techniques studied in this section. . . . . . . . . . . 114Table 5.5 Parameter values for Li’s and PG-aware placement and routing techniques. . . 114Table 5.6 Average ON and OFF power for all synthetic benchmark circuits for the dif-ferent mapping techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Table 5.7 Average ON and OFF power for HLS circuits for the different mapping tech-niques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122xiList of TablesTable 5.8 Comparing Baseline and PG-aware P&R required minimum channel width toroute the synthetic benchmark circuits. . . . . . . . . . . . . . . . . . . . . . . . . . 123Table 5.9 Execution time (seconds) of the Baseline and the PG-aware P&R for thesynthetic benchmark circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124xiiList of FiguresFigure 1.1 The basic idea of power gating for a circuit block. . . . . . . . . . . . . . . . 3Figure 1.2 An illustration of application mapping to a SCPG and DCPG FPGA archi-tectures. PMU is a power management unit. . . . . . . . . . . . . . . . . . . . . . . . 4Figure 1.3 Illustration of the difference between chip design phase and application designphase in FPGAs vs ASICs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Figure 2.1 General architecture for tile-based FPGAs. . . . . . . . . . . . . . . . . . . . 13Figure 2.2 A cluster of N BLEs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Figure 2.3 Basic logic element. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Figure 2.4 An example implementation of the internals of a logic cluster. . . . . . . . . . 15Figure 2.5 Routing architecture for island-style FPGA. . . . . . . . . . . . . . . . . . . . 16Figure 2.6 The details inside an example connection block for an LC’s input pin. . . . . 16Figure 2.7 General FPGA CAD flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Figure 2.8 Power vs. energy [7]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Figure 2.9 Switching current charges the load capacitance. . . . . . . . . . . . . . . . . . 20Figure 2.10 Short circuit current. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Figure 2.11 Leakage current paths in an NMOS transistor [8]. . . . . . . . . . . . . . . . . 21Figure 2.12 Trends for the scaling of the different power consumption components overthe years [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Figure 2.13 Illustration of clock gating concept for a group of registers. . . . . . . . . . . 24Figure 2.14 An example design with the critical path shown. The body voltage of logicblocks can be adjusted depending on if the block lies on the critical path [10]. . . . . 26xiiiList of FiguresFigure 2.15 Dynamic reconfiguration set up using: static logic, reconfigurable partitions,and reconfigurable modules. A reconfigurable partition can be configured with areconfigurable module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 2.16 General structure for a power-gated application. . . . . . . . . . . . . . . . . 29Figure 3.1 Example application mapped to an FPGA supporting DCPG. . . . . . . . . . 39Figure 3.2 Basic fine-grained DCPG for two neighbouring tiles. Shared bordering routingchannels have their own power gating circuit. . . . . . . . . . . . . . . . . . . . . . . 40Figure 3.3 Internals of an LC with DCPG showing pull-down NMOS devices at FF inputand LC outputs. A status signal to indicate completion of power transition can berouted through one of the LC’s outputs (the top). . . . . . . . . . . . . . . . . . . . 42Figure 3.4 Example PGR of 2x2 tiles. Internal region’s LCs and CBs share the powerswitch. PGR’s SBs and bordering RCs have individual power switches. . . . . . . . . 44Figure 3.5 Details of the power gating circuitry for the different components in a PGR.The internal LCs and RCs share the same power switch, individual bordering RCshave their own power switches, and individual SBs (or SB partitions) have their ownpower switches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Figure 3.6 Power gating circuit for a switch block. SB outputs are pulled down to GNDwhen the SB’s power is off (in sleep mode). . . . . . . . . . . . . . . . . . . . . . . . 46Figure 3.7 One power gating partition per SB side (details for two sides are shown). Thepower state for each partition can be configured separately. . . . . . . . . . . . . . . 47Figure 3.8 Example incoming wire track from the left side of an SB, showing two cases:(left) wire ending in an SB, and (right) wire passing by an SB. . . . . . . . . . . . . 48Figure 3.9 Experimentation procedure for sizing sleep transistors. . . . . . . . . . . . . . 51Figure 3.10 Results for sweeping LC’s cluster size (N). Switch blocks are not included inthe results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Figure 3.11 Results for sweeping routing channel width (W). Switch blocks are not in-cluded in the results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54xivList of FiguresFigure 3.12 Results for sweeping granularity of power gating regions (PGRs). Switchblocks are not included in the results. . . . . . . . . . . . . . . . . . . . . . . . . . . 55Figure 3.13 Results for SBs power gating granularity by sweeping routing channel width(W). Segment length (L) = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Figure 3.14 Results for SBs power gating granularity by sweeping routing channel width(W). Segment length (L) = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Figure 3.15 Leakage energy reduction for clma circuit as a function of Fidle PGRs andFidle time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60Figure 4.1 Hypothetical inrush current and the associated voltage droop. . . . . . . . . . 63Figure 4.2 An FPGA with mesh power grid structure. FPGA circuit components aremodeled as current sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Figure 4.3 Modeling of metal segments in the power grid. . . . . . . . . . . . . . . . . . 65Figure 4.4 Inrush current for different architecture region sizes. SB is the total inrushcurrent for all switch blocks in a region, RC is the total inrush current for borderingrouting channels, and Internal part is the inrush current for the internal routingchannels and logic clusters of a region. . . . . . . . . . . . . . . . . . . . . . . . . . . 67Figure 4.5 IR drop as region size increases. (a) The inrush current for bordering RCsand internal PGR’s part is considered (b) The total inrush current for a PGR isconsidered (including SBs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Figure 4.6 Cost of power grid sizing to handle region’s inrush current. . . . . . . . . . . 70Figure 4.7 Proportion of chip area required as decaps to handle inrush current. . . . . . 71Figure 4.8 Power gating region with the inrush current limiting circuitry. . . . . . . . . . 72Figure 4.9 (a) Fixed and (b) programmable delay elements used in the proposed archi-tecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72Figure 4.10 An example power-gated module mapped to an FPGA. Each PGR is repre-sented by its PDE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78xvList of FiguresFigure 4.11 Example DCPG FPGA region with the proposed inrush current handlingcircuitry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Figure 4.12 Intra-region level (a) area overhead, (b) wakeup time, and (c) mode transitionenergy results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Figure 4.13 PDE size for different Imodule wakeup budgets. PGR size = 4× 4 tiles. . . . . . 85Figure 4.14 PDE area overhead for different Imodule wakeup budgets. PGR size = 4× 4 tiles. 85Figure 4.15 Total intra and inter-region level wakeup architecture results for differentdevice’s wakeup current budgets (PGR size = 4× 4 tiles). (a) Total area overhead,(b) total ON state leakage power, and (c) total OFF state leakage power. . . . . . . 86Figure 4.16 Comparing the total area overhead for the inrush current limiting architecturefor two design choices of the intra-region level. Two PGR granularities are studied(a) 1× 1 tiles and (b) 4× 4 tiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88Figure 4.17 Minimum idle time to achieve energy savings. PGR size is 4x4 tiles andItile intra = 600 µA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89Figure 4.18 Area overhead for soft logic and embedded logic implementation of inter-region architecture for different Imodule wakeup budgets. PGR size = 4× 4 tiles. . . . 91Figure 5.1 General CAD flow for power-gated FPGAs. . . . . . . . . . . . . . . . . . . . 96Figure 5.2 Physical-level CAD flow for power-gated FPGAs. . . . . . . . . . . . . . . . . 97Figure 5.3 The architecture model of an FPGA of size 5x5 tiles, and a core PGR size of2x2 tiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97Figure 5.4 Routing of a power control signal to PGRs of a power-gated module in atrunk-branches topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100Figure 5.5 Illustrating placement effect on the always-on PGRs in a power-gated FPGAs.104xviList of FiguresFigure 5.6 Three modules mapped to three PGRs showing how the placement affects thenumber of routing and logic resources that are configured as dynamically-controlled.In (a), the LCs in each PGR are configured as dynamically-controlled, but a largenumber of SB partitions will be required to be always-on to ensure proper circuitoperation. In (b), the LCs of the PGRs of modules M1 and M3 cannot be controlleddynamically because the PGRs have a mix of blocks from other modules. However,a fewer number of SB partitions in these PGRs are required to be always-on. . . . . 105Figure 5.7 An application circuit with three power-gated modules mapped to an FPGA.The x-y layout of a connection between blocks M1.B2 and M3.B4 during placementis used to estimate always-on SBs required to route the connection. The larger ofthe two cases (1 or 2) is assumed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107Figure 5.8 Three power-gated modules placed on three PGRs with two ways shown toroute the same net. The SBs used to route the net in M2 and M3 must be always-on(ON), other resources can be dynamically-controlled (DC). . . . . . . . . . . . . . . 108Figure 5.9 Snake robot example application. . . . . . . . . . . . . . . . . . . . . . . . . . 112Figure 5.10 The percentage of logic clusters (LCs) configured as dynamically-controlledfor different mapping techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115Figure 5.11 The effect of the different mapping techniques on reducing the always-on SBswitches for the synthetic benchmark circuits. . . . . . . . . . . . . . . . . . . . . . . 117Figure 5.12 The effect of the different mapping techniques on reducing the always-on SBswitches for HLS benchmark circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . 118Figure 5.13 The effect of the different mapping techniques on the critical path delay (Tcrit).119Figure 5.14 The effect of the different mapping techniques on the power reduction of thesynthetic benchmark circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120Figure 5.15 The effect of the different mapping techniques on the power reduction of theHLS circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121xviiList of FiguresFigure 5.16 Two modules placed on a DCPG FPGA. (a) One PGR of size 4x4 tiles isconfigured as always-on (b) Four PGRs of size 2x2 tiles are configured as dynamically-controlled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125Figure 5.17 One module (M1) placed on a PGR of size 2x2 tiles. A signal from module M2is routed through two SBs in the PGR. (a) PPS = 0, resulting in two SBs configuredas always-on, one SB configured as dynamically-controlled, and one SB configuredas always-off (b) PPS = 1, resulting in two SB partitions configured as always-on,two partitions configured as dynamically-controlled, and 12 partitions configured asalways-off. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126Figure 5.18 The percentage of SB switches in the different power states, averaged overall synthetic benchmarks mapped to architectures with different power gating gran-ularities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127Figure 5.19 The percentage of LCs and RCs in the different power states, averaged overall synthetic benchmark circuits mapped to architectures with different power gatinggranularities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129Figure 5.20 Leakage power results for SBs averaged over all synthetic benchmark circuitsmapped to architectures with different power gating granularities. ON leakage poweris measured when all dynamically-controlled resources are powered. OFF leakagepower is measured when all dynamically-controlled resources are put in sleep mode. 131Figure 5.21 Leakage power results for PGRs only (no SBs included) averaged over all syn-thetic benchmark circuits mapped to architectures with different power gating granu-larities. ON leakage power is measured when all dynamically-controlled resources arepowered. OFF leakage power is measured when all dynamically-controlled resourcesare put in sleep mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132Figure 5.22 Total leakage power results averaged over all synthetic benchmark circuitsmapped to architectures with different power gating granularities. ON leakage poweris measured when all dynamically-controlled resources are powered. OFF leakagepower is measured when all dynamically-controlled resources are put in sleep mode. 134xviiiList of FiguresFigure 5.23 Total leakage power reduction for individual benchmark circuits. PPS = 3and PGR size is 2x2 tiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Figure 5.24 Energy reduction for iSnake’s control system in DCPG FPGA. . . . . . . . . 135xixAcronymsALM Adaptive Logic Module.ASIC Application-Specific Integrated Circuit.BLE Basic Logic Element.CAD Computer-Aided Design.CB Connection Block.CLB Configurable Logic Block.CPF Common Power Format.DCM Digital Clock Manager.DCPG Dynamically-Controlled Power Gating.DFG Data Flow Graph.DR Dynamic Reconfiguration.DSP Digital Signal Processor.FF Flip-Flop.FPGA Field-Programmable Gate Array.HLS High-Level Synthesis.xxAcronymsIC Integrated Circuit.IP Intellectual Property.LC Logic Cluster.LUT Lookup Table.MIM Metal-Insulator-Metal.MTCMOS Multi-Threshold CMOS.MWTA Minimum-Width Transistor Areas.NRE Non-Recurring Engineering.PDE Programmable Delay Element.PGR Power Gating (Gated) Region.RC Routing Channel.RM Reconfigurable Module.RP Reconfigurable Partition.SB Switch Block.SCPG Statically-Controlled Power Gating.SRAM Static Random Access Memory.ST Sleep Transistor.UPF Unified Power Format.VPR Versatile Placement and Routing - academic tool used in FPGA-related research.xxiAcknowledgementsFirst of all, I’d like to express my sincere thanks and gratitude to my research supervisor, Dr. SteveWilton. Dr. Wilton’s constant advice, support, insights, and encouragement, made my PhD studyexciting and productive. I was fortunate to have him as my mentor and research supervisor!I thank the external examiner, Dr. Kenneth B. Kent, for his feedback and suggestions. Ithank my supervising committee members, Dr. Shahriar Mirabbasi and Dr. Tor Aamodt, fortheir feedback to improve the dissertation, and for the several discussions we had on research andteaching. I’d like to also thank my university examiners, Dr. Sathish Gopalakrishnan and Dr.Xiaodong Lu, for their constructive feedback during the examination process. I also thank Dr.Guy Lemieux and Dr. Sudip Shekhar for their constructive feedback during the different stages ofthis dissertation.I thank all the members of the System-on-Chip research group at UBC, and friends from othergroups in the Electrical and Computer Engineering department who made it a friendly environmentduring my stay as a student. I’d like to mention Ameer Abdelhadi, Rehan Ahmed, Usman Ahmed,Samer Al-Kiswany, Scott Chin, Joydip Das, Stuart Dueck, Fatemeh Eslami, Abdullah Gharaibeh,Jeffrey Goeders, Eddie Hung, Sohaib Majzoub, Aaron Severance, Ahmad Sharkia, Chris Wang,and Grace Yu, in addition to several others who I haven’t mentioned here.Special thanks to my friends in Vancouver who I enjoyed their company away from research andacademia. Also special thanks to my brother and friend, Anas, for his support and encouragement;I wish him the best of luck in his PhD studies!I am very grateful to the constant support and encouragement I received from my family.Without their love, support, and prayers I will not be the person I am today.Several companies in the industry showed interest and support for the work in this dissertation.xxiiAcknowledgementsI thank the employees in Recon Instruments, Altera, Xilinx, and Lattice Semiconductors for givingme the opportunity to present my work, and for providing constructive feedback.Financial support for this project was provided by Natural Sciences and Engineering ResearchCouncil of Canada, Altera, the University of British Columbia, and MITACS.xxiiiTo my parents and sbilings ..xxivChapter 1Introduction1.1 MotivationField-programmable gate arrays (FPGA) have become ubiquitous in many applications such astelecommunications, digital signal processing, and scientific computing because of their small non-recurring engineering (NRE) cost and short time-to-market [11]. They provide an implementationplatform that is accessible for medium and small firms that cannot afford the cost of fabricatingtheir own integrated circuits (ICs). This is because the fabrication cost of an FPGA is amortizedamong all the users who buy the chip from an FPGA company. In contrast, the design andfabrication of an application-specific integrated circuit (ASIC) may cost several millions of dollarsusing a sub-90 nm process technology [12, 13]. With technology scaling, the cost of designing andfabricating an IC becomes infeasible but for high volumes. In addition to their cost advantage overASIC designs for small to medium volumes, the flexibility of FPGAs makes it possible to performprotocol updates and bug fixes in the hardware after product shipment, facilitating competition inthe market.Despite the cost advantages of FPGAs over their ASIC counterparts for small to mediumvolumes, FPGAs suffer from large performance, area, and power overheads. The penetration ofFPGAs in the low-power applications domain in particular is still limited. This is largely due to theirhigh power consumption rather than their speed or area [14]. Compared to ASIC implementations,FPGA implementations suffer 7-14 times power overhead [15]. This power overhead comes fromthe additional circuit components in an FPGA that are required to enable its reconfiguration in thefield. High power consumption in ICs not only shortens the battery lifetimes of portable devices,but it also degrades systems’ reliability and increases cooling solutions’ costs.11.1. MotivationTo bring reconfigurable technology to low-power applications, new programmable devices thatconsume significantly less power are required. Many researchers have proposed optimizing the powerdissipation of FPGAs using techniques that have originally been applied to ASICs, such as guardedevaluation [16–19], clock gating [20, 21], power gating [22–28], employing dual supply voltages [29–34], power-aware computer-aided design (CAD) optimizations [35–37], and other circuit and processoptimizations [10, 38–40]. Nevertheless, the power consumption of FPGAs remains prohibitive forsome applications.Dynamic and static (leakage) powers are dissipated in integrated circuits. Dynamic poweris dissipated when a circuit performs activity by charging and discharging internal capacitances.Static, or leakage power, is dissipated when the circuit is not active, i.e., in idle, quiescent state.Dynamic power has historically been the major component of the total power dissipation in an IC.However, with process technology scaling, static power had become as important as dynamic powerfor sub-90 nm process technologies, and in some applications far more important than dynamicpower. Recently, it has been reported that static power and dynamic power are roughly equal inFPGAs built using a 28 nm technology [41, 42].A large portion of the aforementioned power reduction techniques (guarded evaluation, clockgating, dual supply voltages, CAD optimizations) focus on dynamic power optimization. Thesetechniques have been proven to effectively reduce the power consumption of FPGAs. Static powerreduction techniques (power gating of unused blocks and circuit/process optimizations) have beensuccessful to some extent in reducing the power consumption of FPGAs. However, a large portionof static power is dissipated during the idle periods that the functional blocks of an applicationexperience.The power consumed during idle times does not contribute to the overall desired functionalityof an application, thus it is wasted. Clock gating in which the clock signal, or portion of the clocksignal, is turned off for a period of time, eliminates the dynamic part of the idle-periods’ power.Disabling the clock stops the switching in the clock network of the idle functional blocks, and itcan stop the flow of data in the circuit components when the outputs are not needed. However,leakage power dissipation during idle periods is not reduced by simply disabling the clock.21.1. MotivationMany optimization techniques that target leakage power during idle periods have been used bothin ASICs and FPGAs. In ASIC-based implementations of applications that experience long idletimes, power gating has been the most effective technique in reducing leakage power dissipation [43].Using power gating, functional blocks’ power supplies can be turned off during their idle times [7,43]. The basic idea of power gating is illustrated in Figure 1.1. When the power switch is turnedoff using a power controller unit, the leakage current for the functional block is limited to that ofthe power switch. The net effect is a reduction in the system’s overall energy, leading to longerbattery life.Power switchPower supply line (VDD)Power-gated functional blockPower controller unitFigure 1.1: The basic idea of power gating for a circuit block.Although power gating is a standard leakage power optimization technique in low-power ASICdesigns, it has seen limited use in FPGAs. Today, state-of-the-art FPGAs provide power gatingcapability for only the embedded hard blocks, such as digital signal processors (DSPs), memoryblocks, clock managers, etc. [41, 44]. These blocks, however, tend to consume much lower powerthan the user logic blocks and routing resources on an FPGA since they are custom designed blocks.Power gating is applied to the embedded hard blocks only at configuration time if they are notused by the user’s circuit. This limits the advantage of power gating.In the research literature, there has been power gating proposals for the basic configurableunits in tile-based FPGAs. However, many of the proposed architectures revolve around the ideaof turning off the power supply for resources that are unused by the user’s circuit at configuration31.1. Motivationtime [22–28]. We refer to using power gating to turn off unused resources in an FPGA at con-figuration time as statically-controlled power gating (SCPG), because once the power state of acomponent is determined, it cannot be changed until a new configuration is loaded to the FPGA.The advantages of this technique may be limited due to the fact that designers tend to select thesmallest FPGA device that fit their designs, resulting in only few unused FPGA blocks.M odule 3 M odule 4M odule 1 M odule 2A lways on A lways off Dynam ically controlledA pplication m apping to F PG A  architecture M 1M 2M 2M 1M 2M 3M 3M 3M 3M 4 M 4 M 4 M 4S C PG  F PG AM 1M 2M 2M 1M 2PM UM 3M 3PM UM 3M 3M 4 M 4 M 4 M 4DC PG  F PG AFigure 1.2: An illustration of application mapping to a SCPG and DCPG FPGA architectures.PMU is a power management unit.This dissertation proposes and evaluates a new FPGA architecture that supports dynamically-controlled power gating (DCPG). DCPG enables controlling the power state of the circuit resourcesin an FPGA at run-time. Thus, idle functional blocks in an FPGA can be turned off to reduce theirleakage power dissipation. Figure 1.2 illustrates the difference between a SCPG architecture andthe proposed DCPG architecture. The figure shows an example application that is mapped to both41.2. Power Gating in ASICs vs. FPGAsarchitectures. Using the SCPG architecture, only unused regions can be turned off at configurationtime; this has limited effect in reducing the leakage power as discussed earlier. The DCPG FPGAarchitecture, on the other hand, enables exploiting the temporal behaviour of the application toachieve power savings. The modules that experience active/inactive phases are configured with adynamically-controlled power state, allowing the power to be turned off during idle times, and thepower to be turned on when the module is active. This may lead to significant energy savings,especially for applications that experience long idle times.For many applications, especially for hand-held mobile devices, DCPG may be far more ben-eficial than SCPG. It is conceivable that many mobile applications show only bursts of activity;that is, most of the functional blocks remain idle except for short bursts of activity. For example,a recent study showed that the microprocessor in a circuit designed for rendering video images in ahead-mounted display spends more than 80% of the time without performing any useful task [45].1.2 Power Gating in ASICs vs. FPGAsDCPG in FPGAs is different than the problem of power gating in ASIC designs. FPGAs need tobe flexible enough to support the implementation of virtually any digital circuit, while an ASICis designed to perform a specific function. In order to understand the differences between powergating in ASICs and FPGAs, it is important to understand the design cycle in both implementationtechnologies. Figure 1.3 shows the design phases for FPGAs and ASICs. For FPGAs, there isa distinction between the architecture design phase and the application design phase. In thearchitecture design phase, the basic building blocks of an FPGA are designed to support widerange of functions using a programmable technology; volatile memory-based programming is widelyused in many commercial parts. The user’s application design phase, however, is different, sinceevery user has different application requirements/characteristics. Thus, users have the ability toconfigure the device to realize the desired functionality. For ASICs, on the other hand, the designfor the chip’s fabric is derived from the application’s requirements/characteristics, leading to afixed-function chip that is suitable for only one application.51.2. Power Gating in ASICs vs. FPGAsVDDVDD VDDVDDVDD VDDVDDVDD VDDF PG A  fabric designwith power gating capabilityC onfigurable functionality and configurable power m odes(m any chips)VDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDU ser app.VDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDU ser app.VDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDVDDVDD VDDU ser app.F abrication and testing cycleC hip design and fabrication phase U ser application design phase(a) FPGA design phase vs. application design phase.A pplication design F ix ed functionality and power dom ains(m any chips)F abrication and testing cycleA pplication and chip design,  and fabricationVDDA lways onPower controllerVDD VDDVDDVDDVDDA lways onPower controllerVDD VDDVDDVDDVDDA lways onPower controllerVDD VDDVDDVDD(b) ASIC design phase.Figure 1.3: Illustration of the difference between chip design phase and application design phase inFPGAs vs ASICs.The use of power gating in ASICs is well understood. Designers know ahead of time whichfunctional blocks of the application can be turned off using power gating, and the conditions underwhich they can be turned off, thus they can design their chips accordingly. This involves theappropriate design of the power distribution network, the power switches, the isolation mechanismof the outputs of a power-gated block when it is turned off, the retention registers, and the powercontroller unit.In an FPGA, however, there is not information at chip design time (i.e., before an FPGAis fabricated) about the applications that will be mapped to the FPGA. Thus, the new FPGA61.3. Contributionsarchitecture must enable various application structures and behaviours to be realizable, and thedifferent power gating design issues must be abstracted away from the application designers. Tomeet these requirements, the proposed DCPG FPGA architecture includes configurable powergating components that allow multiple power states to be realizable both at configuration time andat run-time.One important issue related to power gating in ASICs is the handling of the large current surgethat occurs when a functional block is turned on after being in sleep mode. This is known as inrushcurrent. This current may cause errors in the design because of the resulting IR drop. In ASICs,because designers know ahead of time the application’s power requirements, and the amount ofinrush current that may occur, they can design their power-gating architecture to take this intoconsideration. One common technique used to solve this problem is to stagger the turn-on phaseof a functional block into a few steps to limit the amount of inrush current to a safe level. Ina power-gated FPGA, however, it is unknown how much inrush current will occur because thestructure/size of the power-gated application is not known beforehand. Therefore, the proposedFPGA architecture must provide a solution to this problem that is configurable and suitable for awide range of power-gated applications.1.3 ContributionsThe contributions of this dissertation are divided into three main themes:• An FPGA architecture that supports static and dynamic power modes.• A reconfigurable architecture to solve the inrush current problem in reconfigurable devicesthat support dynamic power modes.• A CAD flow for power-gated FPGAs using existing placement and routing techniques. ThisCAD flow is used to perform two studies. First, it is used to evaluate different enhancementsto the placement and routing algorithms that target increasing the power savings of theproposed architecture. Second, it is used to investigate the granularity of the proposed powergating architecture when application circuits are mapped to the architecture.71.3. ContributionsThe following subsections discuss the contributions in this dissertation in more detail.1.3.1 FPGA Architecture Supporting DCPGThe dissertation presents a novel FPGA architecture that supports different power modes. Thearchitecture specifically targets low-power applications with modules that experience long idle pe-riods.The proposed DCPG FPGA supports various power modes for the logic and routing compo-nents with small speed/area overheads. The supported power modes are on, off, and dynamically-controlled. At configuration time, each basic power gating unit can be configured separately. Theadvantage of this flexibility is threefold. First, the resources that are not used by the applicationcan be turned off at configuration time. This is similar to statically-controlled power gating ar-chitectures. Second, the power state for the resources that cannot be turned off at run-time or atconfiguration time can be set to always-on. Third, and most importantly, the power state for theblocks that experience periods of active and idle times can be controlled dynamically at run-time.This architecture is capable of supporting a varying number of power domains; it supports realizingpower domains with different spatial and temporal characteristics that may co-exist in the sameapplication.The signals that are needed to control the power state of power-gated functional blocks can berouted on the general-purpose routing resources already available in the chip. This can be donein a way similar to how other signals are routed. No dedicated routing links are required for thecontrol signals. Thus, control signals can be routed from any location on the chip (or off the chip)to any power-gated block, regardless of its structure or size, making the architecture suitable fora wide range of applications. Moreover, existing routing techniques/tools can be used to performrouting of the power control signals.The proposed architecture does not require alterations to the basic FPGA architecture. Anyapplication can be mapped to the DCPG FPGA architecture regardless of whether it is going totake advantage of the new power gating capability or not. Moreover, the CAD tools do not need toworry about the DCPG capabilities if the application does not use this architecture feature. This81.3. Contributionsis very important since FPGA users may not be comfortable with, or interested in, advanced powersaving features such as power gating.1.3.2 A Configurable Inrush Current Handling ArchitectureOne of the main issues associated with power-gated architectures is inrush current. Inrush currentis the current surge that occurs when a power-gated functional block is turned on after being insleep mode. This current may be large and may cause malfunction of the design or the device.This dissertation presents a solution to the inrush current problem that targets reconfigurablearchitectures that support power gating. The proposed architecture is configurable, meaning thatit can be adapted based on the application that is being implemented on the device.Similar to solutions for this problem in the ASIC domain, the proposed architecture has thecapability of staggering the turn on phase of the basic power gating units to which a functionalblock is mapped. To enable the flexible mapping of different applications, staggering is determinedat configuration time. This is realized by the use of programmable delay elements. The area andpower overheads of the proposed architecture are much lower than that incurred from alternativeapproaches.1.3.3 Studying the Architecture Granularity and CAD Tool OptimizationsMapping application circuits to the proposed power gating architecture provides insight into thebest power gating granularity and the effect of the different optimizations that can be done in themapping algorithms. Therefore, this dissertation presents a CAD flow for FPGAs that supportdynamic power gating.Using this CAD flow we investigate different optimizations in the algorithms that are used tomap application circuits to FPGAs with dynamic power gating. The dissertation focuses on theplacement algorithm. Previous studies present placement algorithms which attempt to increasethe number of logic resources that can be powered down at run-time. However, these studiesoverlook the effect on the number of routing resources that can be powered down. The study inthis dissertation evaluates the impact of the several optimizations in terms of the power savings91.4. Organizationand the effect on circuits’ timing.We also use the CAD flow to study the impact of the power gating granularity. Finer granularityarchitectures enable a closer match between application requirements and the device’s capabilities,increasing the total chip area that can be powered down for a given application. However, finergranularity architectures have a higher overhead due to the additional power gating circuitry. Onthe other hand, coarser granularity power gating makes it more challenging to find opportunitiesto power down resources. However, the overhead associated with coarser granularity architecturesis small. We study the best granularity of the power gating architecture that achieves the lowestpower consumption during idle periods.The CAD flow is also used to study an example application that utilizes the new FPGA ar-chitecture. The application is a control system for a snake robot that is designed for endoscopyoperations. We identified a major functional unit that is idle most of the time, and the application’spower can be reduced by using a combination of power gating and clock gating during idle periods.1.4 OrganizationThe dissertation is organized as follows. Chapter 2 provides a comprehensive background on theFPGA architecture model used in this study and its CAD flow, the sources of power dissipation,and a literature survey of previous power optimization techniques used for FPGAs.Chapter 3 presents the proposed FPGA architecture that supports dynamically-controlledpower gating (DCPG). The chapter focuses on defining the architecture and its capabilities, suchas the ability to control the power state of FPGA resources at run-time, the ability to use theexisting FPGA’s routing resources to route the control signals that are used to control the powerstate of the functional blocks of an application, the different power modes supported and how theycan be realized, and the granularity of the architecture. This chapter also presents experimentalresults showing the possible power savings by the proposed architecture, and the impact of varyingbasic FPGA architecture parameters on area overhead and power savings.Chapter 4 presents the proposed configurable architecture to handle inrush current during the101.4. Organizationwakeup phase of a power-gated functional block. The chapter provides in-depth analysis of inrushcurrent in DCPG FPGA architecture, its effect, and a configurable solution for this problem thatis suitable for a wide range of applications. This chapter also discusses a design flow that can beused to integrate the architecture with a reconfigurable architecture that supports DCPG, and adesign flow for CAD tools/end users to use this architecture. An evaluation of the architecture ispresented, including comparisons with alternative solutions for the inrush current problem.Chapter 5 presents a CAD flow for power-gated FPGAs that is based on existing FPGA CADalgorithms. The chapter discusses optimizations in the mapping algorithms to increase the powersavings in DCPG FPGAs. Experimental results are presented for a set of synthetic benchmark cir-cuits that evaluate the granularity of the architecture and the best CAD optimizations. Results arealso presented for an example application that benefit from the proposed power gating architecture.Finally, Chapter 6 concludes the dissertation by providing a summary of the contributionsand findings. The chapter also presents the limitations of the studies in this dissertation and howthey can be addressed in future research. Suggestions for long-term directions are also presented.11Chapter 2Background and Related WorkThis chapter describes the background and related work necessary to put the reset of the dissertationin context. The chapter first provides an overview and background on FPGA architectures and CADflow. We focus on the academic FPGA model that is adopted in this dissertation. Our proposedtechniques and architectures, however, can be integrated with other FPGA architectures. Thechapter also provides an overview of the basic power dissipation mechanisms in integrated circuits,and discusses the power trends in integrated circuits in general, and in FPGAs in particular. Anoverview of power reduction techniques in FPGAs is presented, with a focus on run-time controlledtechniques. Lastly, the chapter discusses power gating as it forms the basis for the architecturespresented in subsequent chapters in this dissertation, and provides an overview of related works.2.1 FPGA Architecture and CADIn this section, we describe the architecture (internal structure) of an FPGA, as well as thecomputer-aided design (CAD) algorithms that are used in the FPGA design flow. The archi-tecture described here is a simplified model that is general and representative of that in commercialFPGAs.2.1.1 ArchitectureFigure 2.1 shows the structure of a tile-based FPGA [46]. In this architecture the same tile is repli-cated many times throughout the chip. Each tile (or slice) consists of a logic cluster (LC) and therouting resources that provide connectivity between the cluster and the routing channels, and be-tween the individual wires in the different routing channels [46]. FPGA vendors use different names122.1. FPGA Architecture and CADfor these resources. For example, a logic cluster is referred to as a configurable logic block (CLB)in Xilinx’s terminology [47], and as an adaptive logic module (ALM) in Altera’s terminology [48].Logic Clusters Routing FabricI/O PadFigure 2.1: General architecture for tile-based FPGAs.Each logic cluster consists of a number of logic blocks that can implement sequential or com-binational functions. These logic blocks are called basic logic elements (BLE). Figure 2.2 shows alogic cluster (LC) of N BLEs. An intra-cluster switch matrix provides connectivity between theinputs of the cluster and the outputs of the BLEs to the inputs of the BLEs. Each BLE (Fig-ure 2.3) contains a K-input look-up table (K-LUT) and a flip flop (FF). The output of a K-LUTcan be registered or unregistered by appropriately configuring the 2:1 multiplexer shown in thefigure. Each LUT is built using a multiplexer and 2K configuration memory bits that drive theinputs of the multiplexer. Thus, a LUT is capable of implementing any function of up to K inputsby configuring the function in the 2K configuration memory bits, and driving the select lines ofthe multiplexer by the function’s K inputs. Figure 2.4 shows an example implementation of theinternals of an LC.The logic clusters are surrounded by a sea of routing resources. Figure 2.5 shows the routingarchitecture for island-style FPGAs. The routing resources consist of wires that can be connectedusing programmable switches. These switches are multiplexers that have their select lines controlled132.1. FPGA Architecture and CADIntra-Cluster InterconnectLogic ElementLogic ElementLogic ElementNKKI inputs KFigure 2.2: A cluster of N BLEs.K-LUTDFFFigure 2.3: Basic logic element.by configuration memory cells. The pre-fabricated wires lie in horizontal and vertical channels, andeach wire typically spans one or more logic clusters. The length of a wire (L) is an architectureparameter that indicates the number of clusters a wire spans before it ends in a switch block. Aswitch block (SB) lies at the intersection of each horizontal and vertical routing channel. Each SBcontains programmable switches that provide connectivity between the incident wires. Connectionblocks (CBs) provide programmable connectivity between the fixed wires and the input pins ofneighbouring logic clusters. Figure 2.6 shows the details inside an example connection block forone input pin of the two bordering logic clusters. Similar topology is used for the other input pins oflogic clusters. Different devices have different switch and wire topologies in switch and connectionblocks. Academic research and industrial optimization studies have evaluated various topologiesin great detail [14]. A suitable topology balances the routability and speed of the device with thearea and power overhead of the programmable switches.142.1. FPGA Architecture and CAD4-LUT1x4xReg.D2:1SRAM1x14:1SRAM41x14:1SRAM41x4x1x4x1x4x 1x4xIn1In10 ClockClockOut1Figure 2.4: An example implementation of the internals of a logic cluster.In commercial FPGA devices, other resources exist that provide additional functionality andimproved performance, power, and area. Such resources include adders carry chains, embeddedmemory blocks, DSP blocks, embedded processor cores, and digitally-controlled clocking resources.The island-style FPGA architecture described in this chapter, however, is widely used in academicFPGA research because it resembles what exists in commercial devices. Therefore, this architectureis adopted as the underlying FPGA architecture in subsequent chapters.2.1.2 CAD Flow for FPGAsIn order to implement an application on an FPGA, each configuration memory cell must be setappropriately. Typically, there are millions of these SRAM configuration memory cells in an FPGA,making it infeasible to generate the values of these cells except with the appropriate computer-aideddesign (CAD) tools.The FPGA design flow starts with design entry where the user specifies the circuit that will beimplemented on the FPGA. Although it is possible to do this using schematic-based techniques,using hardware description languages such as VHDL and Verilog, or higher-level languages such as152.1. FPGA Architecture and CADCluster ClusterCluster ClusterTrackProgrammable ConnectionsSwitch BlockConnection BlockChannelFigure 2.5: Routing architecture for island-style FPGA.To LC input pinRouting wiresTrack isolation buffersTo LC input pinConnection block multiplexerFigure 2.6: The details inside an example connection block for an LC’s input pin.System-C is more common. The CAD tools then transform the high-level description of a circuitto the appropriate SRAM cells values; the task is decomposed into multiple steps to make theprocess more tractable. Figure 2.7 shows the steps in an FPGA CAD flow, with the output beinga programming file that is used to configure the programmable resources in an FPGA device.The synthesis step transforms the high-level description of a circuit into a network of logic gates.During this step, logic optimizations are performed to simplify the logic and remove redundancieswhenever possible. The resulting logic netlist is mapped to LUTs and flip-flops (FFs) that corre-spond to the components in the target FPGA device; this is called technology-dependent mapping.These LUTs and FFs are packed into logic clusters (LCs) in the packing step. The final output of162.1. FPGA Architecture and CADCircuit description (VHDL,  Verilog, Sysme-C, ...)Synthesis1. Synthesize to logic network2. Technology mapping3. Packing into clustersPlacementAssign logic blocks and I/O blocks to FPGA physical locationsRoutingRoute connections between blocks on the FPGA routing resourcesFPGA programming file (configuration bitstream)Figure 2.7: General FPGA CAD flow.this step is a netlist of logic clusters that will be mapped to the physical locations on an FPGAdevice. Common algorithms to perform these tasks include [49, 50] for logic synthesis, [51–55] forFPGA technology mapping, and [56, 57] for clustering.In the placement step, each block (LC or IO block) is assigned to a physical location on thetarget FPGA device [46]. The main objective during placement is to produce a routable circuit, i.e.,a circuit for which the router can route the nets that connect the different blocks on the availableFPGA routing resources. Other objectives include optimizing the timing performance of the circuitby minimizing the critical path delay [58] and power [35] of the circuit. Analytical placementalgorithms and simulated annealing are commonly used in this phase of the CAD flow [46, 59, 60].The final step in the CAD flow is routing. In this step, the nets that connect the different clustersin a circuit are mapped to the available routing tracks in an FPGA device, and the switches thatconnect the routing tracks are set appropriately [61]. In addition to producing a legal routing by172.2. Power Dissipationusing the available routing resources, routers strive to find solutions with good timing performanceby keeping critical wires as short as possible [46].Industrial CAD tools also provide platforms for developing embedded system-on-a-chip applica-tions, such as the SOPC builder and Qsys from Altera and the XPS from Xilinx [62, 63]. These toolsfacilitate designing large applications for FPGAs by providing the ability to integrate intellectualproperty (IP) cores into larger systems.2.2 Power DissipationIn this section we describe the sources of power dissipation in CMOS integrated circuits. We alsodescribe the trends in power dissipation, and discuss the causes of power dissipation in FPGAs.Power can be dissipated in two ways: as a result of gates switching logic state (this is termeddynamic power) and as a result of static current that occurs when gates are not switching (this istermed static or leakage power). Each of these will described separately.It is important to differentiate between power and energy. Power is measured in Watts whileenergy is expressed in Joules, where 1 Watt = 1 Joule/s. The instantaneous power through a deviceis computed by taking the product of the voltage across it and the current that flows through it.P (t) = I(t)V (t) (2.1)Energy can be computed as the integral of the instantaneous power over a time interval T , whichis the area under the power curve.E =∫ T0P (t) dt (2.2)The average power is thenPave =ET=1T∫ T0P (t) dt (2.3)In Figure 2.8 the height denotes the power and the area under the curve is the energy. Approach1 uses more power for a shorter period of time, but both approaches use the same amount of energy.This distinction is important especially when talking about battery-powered devices. For these182.2. Power Dissipationkind of devices, the goal is usually to optimize the energy consumption since this would prolongthe battery lifetime (i.e., the time between the battery recharges).Power is the height of the curveApproach 1Approach 2Energy is the area under the curveTimeWattsFigure 2.8: Power vs. energy [7].2.2.1 Dynamic PowerDynamic power dissipation is due to switching and short-circuit current:Pdynamic = Pswitching + Psc (2.4)Switching power is due to the charging and discharging of load capacitances, as illustrated inFigure 2.9. The amount of dynamic power dissipated is given byPswitching = CLoadV2DDfswitch (2.5)where CLoad is the capacitive load and fswitch is the frequency at which the load capacitance ischarged and discharged. For the inverter in Figure 2.9, fswitch is equivalent to the frequency ofthe input. Power is only dissipated when the load capacitance is charged and then discharged, thefrequency of this is typically not the same as the clock frequency of a circuit.Short-circuit current, also referred to as crowbar current, occurs when both PMOS and NMOSnetworks in a CMOS circuit are partially on. Figure 2.10 illustrates short circuit current. Thepower dissipated is proportional to the duration of time the crowbar current occurs, tsc, the supplyvoltage, VDD, the total switching current, Ipeak, and the switching frequency, f0→1. The short-192.2. Power DissipationV in V outIswitchingCLoadFigure 2.9: Switching current charges the load capacitance.circuit power can be written as:Psc = tscVDDIpeakf0→1 (2.6)V in V outIscFigure 2.10: Short circuit current.It is clear from Equations 2.5 and 2.6 that dynamic power can be optimized by reducing thevoltage, the switching frequency, or the switched capacitance. The largest benefit could comefrom reducing the voltage since there is a quadratic relationship between switching power and thevoltage. However, reducing the voltage results in decreased performance, which impacts the totaltime required to perform an operation. This in turn might increase the energy per operation.Therefore, when optimizing for power, one should balance the different parameters and should takeinto consideration the impact on performance. More discussion on these relationships can be foundin classical VLSI texts such as [8, 64, 65].2.2.2 Static PowerStatic power dissipation is due to three main sources of leakage current in a CMOS device. Fig-ure 2.11 shows the leakage current paths for the three main sources [8].202.2. Power Dissipation• Subthreshold Leakage (Isub): current flows from the drain to the source region when thetransistor is in the off state.• Gate Leakage (Igate): current flows from the gate to the substrate through the silicon dielec-tric.• Junction Leakage (Ijunct): current across reverse-biased junctions between the diffusion re-gions of a transistor and the substrate.BodyGaten+n+Source DrainBulk SiIjunctIsubIgateFigure 2.11: Leakage current paths in an NMOS transistor [8].Subthreshold leakage current is dominant in feature sizes below 90 nm. In 45 nm processes, gateleakage and subthreshold leakage are comparable. However, improved processes (such as high-Kdielectric) has made gate leakage less important than subthreshold leakage current.As the size of transistors continues to decrease, the contribution of static power compared todynamic power is increasing because of the increase in subthreshold leakage current. Subthresholdleakage current is mainly influenced by a transistor’s threshold voltage. The following equationshows one model that describes this leakage current [8].Ids = Ids0eVgs−Vt0+ηVds−kγVsbnvT(1− e−VdsvT)(2.7)Interested readers can refer to [8] for details about the parameters in Equation 2.7. Loweringthreshold voltage (Vt) results in exponential increase in subthreshold current. During the differentMOSFET technology generations, the threshold voltage has been scaled down to maintain constant212.2. Power Dissipationdevice overdrive and improve transistors’ switching speed [66]. This has exacerbated the effect ofsubthreshold leakage, and the amount is expected to increase in future technology nodes. Therefore,power optimization techniques for static power are necessary in order to maintain the pace oftechnology scaling, especially in battery-powered systems.2.2.3 Power Dissipation TrendsIt is widely believed that Moore’s law will hold true for the next several years. The law predictsthat the number of transistors in a chip doubles every 18 months [67] because of shrinking in atransistor’s size.With this scaling in a transistor’s dimensions, it became possible to integrate moretransistors on the same chip area to provide more functionality. This resulted in an increase in thepower density, which directly increases the generated heat [65].Power consumption was historically dominant by the dynamic power consumption. Shrinking ofa transistor’s gate length to shorter than 90 nm made leakage power (mainly subthreshold leakage)another important component of the total power consumption. Figure 2.12 shows the contributionof the different power components over the years.3 0 02 5 02 0 01 5 01 0 05 001 0 010 . 0 10 . 0 0 0 10 . 0 0 0 0 0 0 11 9 9 0 1 9 9 5 2 0 0 0 2 0 0 5 2 0 1 0 2 0 1 5 2 0 2 0Gate- ox ide leakageSub-th resh old leakageDynam ic p ow erGate lengthPhysical gate length (nm)Normalized total chip power dissipationH igh - k dielectric in m ainstream  p roductionFigure 2.12: Trends for the scaling of the different power consumption components over the years [9].222.2. Power Dissipation2.2.4 Power Dissipation in FPGAsAn FPGA contains a large number of circuit components to support device reconfigurability. Thesecircuit components include configuration static memory cells, lookup tables, a large number ofrouting wires and their drivers, and embedded blocks such as memories and multipliers. Therefore,FPGAs consume more power than ASIC devices when implementing the same circuit [15].The power consumed in an FPGA’s routing resources dominate the total power consumption.Shang et al. found that 60% of the dynamic power is dissipated in routing resources in a XilinxVirtex-II device, and the remaining 40% is dissipated in logic (16%), clocking (14%), and I/O(10%) resources [68]. Tuan et al. have reported a very similar breakdown for the dynamic powerin a Xilinx Spartan-3 FPGA, which is based on a 90 nm CMOS process [26]. Routing resourcesconsist of a large number of prefabricated wire segments and configurable interconnection points.The large power consumption of routing resources is due to the fact that much of the interconnectcapacitance is due to the wire segments and fanout switches.The static power consumption in FPGAs have historically been dominated by the routingresources and the SRAM configuration memory cells. Tuan et al. reported that in a 90 nm Spartan-3 device, 36% of static power is consumed in routing, 44% is consumed in configuration memory,and 20% is consumed in logic resources [26]. As noted by Tuan et al., configuration memorycells do not switch during normal operation, thus they can be slowed down to reduce their staticpower consumption with no impact on the chip’s performance. Therefore, configuration SRAMcells can be implemented using a high-Vth, thick-oxide process to eliminate both subthreshold andgate leakage currents. The same authors found that this technique can reduce the static powerdissipated by the configuration memory by two orders of magnitude, which is about 43% of thetotal device’s static power [26]. With this optimization, 64% of the static power is dissipated byrouting resources and 36% is dissipated by logic resources in Spartan-3 devices.As discussed in the previous subsection, static power has become more important. It has beenreported that leakage and dynamic power in recent FPGAs built based on 28 nm technology nodesare roughly equal [41, 42]. Routing resources are usually under utilized, which results in large staticpower consumption in these resources. Although some commercial FPGAs provide the capability232.3. FPGA Power Reduction Techniquesto power down unused embedded blocks such as DSPs and block RAMs, none provide the capabilityto power down unused routing resources.2.3 FPGA Power Reduction TechniquesThis section describes techniques that are commonly used to reduce energy dissipation. Thesetechniques include process, circuit, architecture, system, and CAD enhancements. Although thesetechniques will be described in the context of FPGA energy reduction, most of the techniques alsoapply to non-configurable integrated circuits as well.2.3.1 Clock GatingThe power dissipated in clock networks contributes up to 22% of the power in an FPGA tile [68].The clock signal to idle functional units can be gated, when not needed, to reduce the dynamicpower consumption. Clock gating is found in ASICs [69, 70] and microprocessors [71]. Figure 2.13shows a clock signal for a group of registers. The clock and enable signals are inputs to an andgate and the output connects to the clock ports of the registers. The gated clock signal can beconnected to as many registers as desired to gate a region. If the enable signal is low the values ofthe registers will not change, thus eliminating dynamic power dissipation.clockenableD QFFclkD QFFclkD QFFclkFigure 2.13: Illustration of clock gating concept for a group of registers.This technique is utilized by state-of-the-art FPGAs to reduce clock tree power consumption.242.3. FPGA Power Reduction TechniquesFine-grained clock gating is available through the use of clock-enable signals on registers in FPGAsfrom Altera, Xilinx, Microsemi and SiliconBlue [72–75]. Altera, Xilinx, and Lattice allow control ofthe clock tree using static or dynamic methods [20, 76, 77]. However, this feature is only availableat the clock region granularity and often requires extra logic.Huda et al. investigated clock gating granularity in FPGAs; user logic was used to control theenable signals of the clock switches [78]. The study in [78] reported large power reduction in theclock tree using clock gating and the necessary placement algorithms. Zhang et al. found thatRTL-level clock gating is less efficient in FPGAs than in ASICs [21]. This is expected due to thelarge area overhead of RTL-level clock gating in FPGAs.2.3.2 Multiple Supply VoltagesReducing supply voltage has a quadratic effect on dynamic power. Typically SoCs use voltageislands to take advantage of this relationship. The supply rail voltage for functional blocks isallowed to vary depending on power and performance requirements [79]. Most FPGA-related worksuse only low (VDDL) and high (VDDH) voltage islands due to the complexity and hardware requiredto connect two voltage domains.Li et al. proposed an architecture which supports a dual-VDD/dual-Vth FPGA fabric [29].During chip fabrication, logic blocks are selected to be powered by (VDDH) or (VDDL) while SRAMconfiguration cells use high-Vth to reduce leakage power. This work was extended by the sameauthors to allow voltage level configuration of some logic blocks at compile time [30]. Li et al.again followed up this work by enabling their architecture to employ programmable dual-VDD andpower gating for routing resources [31]. Voltage level shifters are required to connect the output of(VDDL) with the input of (VDDH) logic blocks [80].252.3. FPGA Power Reduction Techniques2.3.3 Controlled Body BiasingBody biasing is used to alter the threshold voltage (Vth) of a MOSFET by adjusting the voltagepotential between the source and body. The relationship can be seen in the following equation:Vth = Vt0 + γ(√VSB + 2φ−√2φ)(2.8)where VSB is the voltage difference between the source and the body, Vt0 is the threshold voltagewith a zero-VSB, γ is the body effect parameter, and 2φ is the surface potential parameter. Higherthresholds reduce static leakage power, while lower thresholds enable faster switching times. Alteraemploys a software-controlled back-biasing technique which can configure a tile to low-power orhigh-speed mode depending on performance requirements [10]. Devices which are not on the criticalpath can be biased to switch slower and to dissipate less leakage power.BodySource DrainGateGate oxideHigh-speed logic Low-power logicFigure 2.14: An example design with the critical path shown. The body voltage of logic blocks canbe adjusted depending on if the block lies on the critical path [10].Figure 2.14 shows an example FPGA design. Certain logic blocks are on the critical path andwill benefit from high-speed logic. The rest of the FPGA can be configured with low-power logicas long as this does not change the critical path.2.3.4 Dynamic ReconfigurationDynamic reconfiguration (DR) can be used to overwrite an FPGA region at run-time with a newconfiguration bitstream [81]. This does not affect the rest of the design, but incurs time and power262.3. FPGA Power Reduction Techniquesoverhead. Current FPGA compile tools allow DR at a coarse granularity.Figure 2.15 shows an example of a system employing DR. Areas of the FPGA can be designatedas reconfigurable partitions (RPs), which act as containers for different design configurations. Areconfigurable module (RM) is the logic that is loaded into an RP. In this example, two regions aredesignated as RPs, and each region has three associated RMs. All RMs associated with a given RPmust have the same port list, so that they can attach to the static logic in a uniform manner. AnRP can support as many RMs as desired, but only one RM associated with an RP can be activeat a single time. At run-time, active RM can be loaded into each RP without affecting the rest ofthe design. This “time-multiplexing” allows multiple hardware implementations in a smaller area.Reconfigurable Partition AReconfigurable Partition BStatic LogicReconfigurable Module A0Reconfigurable Module A1Reconfigurable Module A2Time MultiplexingTime MultiplexingReconfigurable Module B2Reconfigurable Module B1Reconfigurable Module B0Figure 2.15: Dynamic reconfiguration set up using: static logic, reconfigurable partitions, andreconfigurable modules. A reconfigurable partition can be configured with a reconfigurable module.DR can be used to dynamically scale the clock frequency of an on-chip digital clock manager(DCM) [82]. Liu et al. proposed using DR to turn off modules in a design which are not being used,by loading blank designs [83]. This resulted in a reduction in the dynamic power dissipation ofarithmetic modules. However, the blank region still experiences the same static power dissipation.272.4. Power Gating2.3.5 CAD TechniquesOther power reduction techniques for FPGAs include enhancing the computer-aided design (CAD)algorithms so that they seek low-power implementation of user circuits. These techniques typi-cally take advantage of detailed power models [84–86] to direct decision making in the compilationstage. Included in this group are clock-aware optimizations [20, 36, 78], power-aware placement androuting [35, 87, 88], and power-aware technology mapping [35, 89, 90]. There are other CAD tech-niques for specific low-power architectures. These include dual VDD-aware voltage assignment [32]and mapping [91], synthesis of a power state controller using the data flow graph (DFG) of anapplication [24], and sleep mode-aware region-constrained placement [22].2.3.6 Other TechniquesThere exist many other static and dynamic power dissipation reduction techniques at the device-and circuit-level for FPGAs.Altera and Xilinx both use different oxide thickness (triple gate oxide) transistors in theirFPGAs. This process enables a trade-off between performance and static power. They can reducedynamic power by reducing capacitance through the use of low-K dielectrics. Additionally theyalso employ low-core voltage, larger transistor lengths, and higher Vth [10, 40].Other published techniques include configuration-bit flipping [92], low-power circuits for routingresources [38, 39], and reduction of glitch power dissipation [93, 94].2.4 Power GatingIn this section, we provide an overview of power gating as it forms the basis for the architectureproposed in this dissertation. Power gating, also known as multi-threshold CMOS (MTCMOS), is atechnique in which parts of a chip can be turned off at run-time to reduce the static (leakage) powerconsumption [7, 43]. This is useful in reducing the static energy dissipation during idle periods ofa circuit block due to subthreshold current; longer idle times lead to larger power savings. Powergating is the most efficient technique to address idle-period power consumption [7, 43]. Power282.4. Power Gatingswitches, also known as sleep transistors, can be realized with header and/or footer transistorsthat can be turned off during idle times to cut-off the power supply for the power-gated circuitblocks. Figure 2.16 shows the general structure of an application circuit with power gating. Thisfigure shows three power domains, namely A, B, and C. Domains A and B are power-gated, whiledomain C is always-on. A power control unit provides the necessary control signals to switch thepower state of the power-gated domains. Each power-gated domain has power switches that canbe switched off to disconnect the power supply, isolation logic to isolate the floating outputs duringsleep state from the rest of the system, and a mechanism to retain the state of the circuit block(retention flops).Dom ain C  ( alw ays- on)P ow er gating controllerDom ain AIsoR etention F lop sV DDV irtual V DDDom ain BIsoR etention F lop sV DDV irtual V DDSLEEP_ASLEEP_BFigure 2.16: General structure for a power-gated application.Different sleep modes can be implemented based on the required response to sleep and wakeupevents. If fast sleep/wakeup times are required then state retention might slow down the process.Application requirements must be taken into consideration when designing the state retention partof the power-gated circuit. For example, in a microprocessor, the processor’s logic and its cachememory can be split into two different power domains. Therefore, the processor can enter sleepmode while the cache memory can be kept powered to retain data. If a deep sleep is required, boththe processor and its cache memory can enter sleep mode, but in this case the data in the cache292.4. Power Gatingmemory must be properly handled before entering the sleep mode (by writing it back to memory,for example).To properly realize power gating, there must be a sequence of activities when turning the circuitblock on or off. Generally to turn off a power-gated block with retention, the following sequence ofevents is required:1. Stop all data transfer operations.2. Stop the clock to prevent data movements during the turn off phase.3. Assert isolation control signal to bring the outputs to a safe condition.4. Assert state retention save condition.5. Assert the power gating control signal to turn off the circuit block.To turn a power-gated block back on, the following sequence of events is generally required:• De-assert the power gating control signal to power on the block. This is usually staggered toprevent drawing a large amount of current from the supplies (inrush current) that may causeproblems for nearby circuit components.• Assert the restore condition of state retention.• De-assert the output isolation in order to restore the outputs to their current values.• Start the clock in order to resume operation.Isolation cells can be as simple as an AND or OR gate that can be used to clamp the output tologic 0 or logic 1. Other choices for the isolation mechanism include the use of pull-up or pull-downtransistors at the outputs. The advantage of this scheme is the small area and power overhead,and the smaller number of nets that need to be routed. However, this increases the complexity ofthe design since a power-gated net would have more than one driver, which adds overhead for theverification process and may result in reliability problems even if small currents pass through the302.5. Related Worktransistors. Therefore, the use of this isolation scheme is usually limited to custom designs wherethe circuit components are designed and verified carefully [7].Different mechanisms can be used for state retention, including saving state to a memory (on-or off-chip) or by using retention registers. The former is slower and suitable for cases where deepsleep mode is desired for extended periods of time and the area overhead is critical, while thelatter is suitable for fast entry/exit from sleep mode but it requires more area overhead. Detaileddiscussion of the different state retention mechanisms can be found in [7].Sleep transistors are typically implemented using a higher threshold voltage than that for theother parts of the chip to reduce their leakage current when they are turned off. When high powerswitch efficiency is required (area vs leakage power saving), sleep transistors with typical thresholdvoltage can be used, but their gates are over-driven by higher than the nominal voltage (in case ofheader switches) when turned off to reduce their leakage current [8]. This requires the availabilityof multiple voltage levels in the chip.Both header and footer switches can be included in a power-gated design. This results in thebest power gating since the current path from the power supply and to ground is cut-off when thesleep transistors are turned off. However, this configuration results in large area overhead and largeperformance impact. Typically, only a NMOS or a PMOS transistor is used as a sleep transistor,with PMOS transistors being more common since their leakage current is smaller than NMOStransistors, but their area overhead is larger because of their lower carrier mobility (µp) [8].2.5 Related WorkThis section provides an overview of additional work related to the contributions of this dissertation.We focus on works that proposed the use of power gating on FPGAs, inrush current in power-gateddesigns, and CAD optimizations for power-gated FPGAs.312.5. Related Work2.5.1 Power-Gated FPGAsSleep mode is supported by Microsemi’s IGLOO FPGAs. In this mode, the FPGA core is turnedoff to reduce power consumption. Another low-power mode, called Flash*Freez, is also supportedin which power could be reduced by controlling clock and IO while retaining the state of registersand RAM [75]. Similarly, some Xilinx devices support a low-power mode, called Suspend Mode,that reduces static power by about 40% [95, 96]. However, these modes apply to the whole FPGAdevice, and cannot be used to selectively choose which FPGA parts to put in low-power mode.Lin et al. studied fine-grained power gating for FPGAs to turn off unused resources at config-uration time [23]; their study showed that the area overhead could be more than 100%, which isundesirable because of the associated degradation in power and timing, and the increase in cost.Gayasen et al. proposed coarse-grained power gating by using a power switch for a region oflogic blocks [22]. The use of dynamic reconfiguration was suggested to change the power state forthe different regions in an FPGA based on their activity. However, this incurs power overhead andcan only be applied at a very coarse granularity.Tuan et al. proposed power gating for an architecture similar to the Xilinx Spartan-3 [26].Their architecture supports sleep mode by using a sleep signal from an off-chip controller that isconnected to all power switches in the FPGA; this scheme allows creating one controllable powerdomain only.Bharadwaj et al. proposed synthesizing a power state controller (PSC) from the data flowgraph (DFG) of an application; this controller could exploit the idleness periods of the applicationto reduce the dissipated leakage energy in an FPGA [24]. They evaluated their work using thearchitecture from [22].Li et al. proposed using a power control hard macro (PCHM) that is associated with eachpower-gated region in an FPGA to control its power state (clock and power gating) [97]. They usea power gating architecture that we proposed in one of our published papers [1].Hoo et al. proposed fine-grained power gating for switch blocks (SBs) [98] and a routing algo-rithm to optimize the power savings. The proposed architecture, however, only supports poweringdown unused SBs at configuration time.322.5. Related WorkMondal and Memik observed that an average of 47% of LUTs in an FPGA have at least oneunused input. Therefore, turning off some of the LUTs configuration memory is possible [25].However, modern low-leakage processes are typically used to implement SRAM cells, which resultsin near-zero leakage for these components. Furthermore, the majority of leakage power dissipationin FPGAs come from other sources such as intra- and inter-cluster routing resources, which havenot been addressed in [25].A common theme in all of the above works is that fine-grained power gating for FPGAs wasconsidered infeasible because of its large area overhead. Therefore, most of these works focusedon coarse-grained power gating. Another interesting observation is that none of the above works(except [97] which is based on our work) provided details for dynamic power gating.The above works do not support dynamically powering down SBs at run time during an appli-cation’s idle periods. In our work, however, we propose an architecture that supports dynamically-controlled power gating for SBs and logic blocks.2.5.2 Handling Inrush CurrentThe works in [22, 24, 26, 97] discussed the use of dynamically-controlled power modes in FPGAs;however, these works did not address inrush current and the associated voltage droop in FPGAsduring the wakeup phase. The work in [99] proposed the use of power gating in a CGRA toaccommodate different application loads. However, the architecture does not have general-purposeinrush current handling circuitry that can be used with other architectures. Most of the works thataddressed inrush current were focused on application-specific power-gated designs as discussedbelow.In [100], the authors proposed reducing glitches on the ground and power rails due to inrushcurrent by dynamically controlling the gate-to-source voltage (Vgs) of sleep transistors. They alsoproposed daisy-chaining the wakeup of the sleep transistors in a chain of size-increasing sleeptransistors.Howard and Shi proposed reducing inrush current by splitting the chip into logic rows, eachpowered by one or few sleep transistors [101], with a controller to stagger their turn on. They332.5. Related Workalso proposed a two-stage power-on method as an alternative solution. A trickle chain of sleeptransistors is turned on first to slowly charge the floating nodes. After some time, the main chainis turned on to fully charge the nodes.Shi and Li proposed a programmable power gating unit [102]. The unit is composed of multipledaisy chains of sleep transistors, and can be configured to select which chains are turned on firstto trickle charge the design, and which chains are activated later to fully charge the virtual nodesto VDD.Calimera et al. presented a power gating reactivation technique based on modulating the sizeof sleep transistors with delay elements [103] in order to limit the wakeup current. The authorspresented an algorithm that can find the optimal sizes of the sleep transistors (STs) in the delaychain for a specific standard cell library.The above mentioned solutions in [100–103] are suitable for handling inrush current in fixed-function chip designs where the functional blocks and their inrush current requirements are knownwhen the chip is designed . Unlike ASICs, reconfigurable architectures are flexible, and they need asolution that is suitable for a wide range of applications. This is addressed later in this dissertation.2.5.3 CAD Optimizations for Power-Gated FPGAsA few works addressed the problem of optimizing the power-gated resources in an FPGA thatsupports power gating. Some of these works address placement, others address routing.Region-Constrained PlacementGayasen et al. used region-constrained placement to increase the number of resources that can bepowered down at configuration time [22]. This is done by using the placement constraints supportedin Xilinx CAD tools. The authors also proposed module-level region-constrained placement, inwhich placement constraints are applied for individual modules in a circuit. The work in [22] doesnot evaluate how the use of placement constraints would impact the routing resources that can bepowered down. Routing resources consume a significant amount of leakage power.342.5. Related WorkLi’s Placement - Fill Power Gating RegionsThe works in [97, 104] tried to enhance an FPGA placement algorithm to increase the number ofpower-gated FPGA regions that can be powered down at configuration time, and to increase thenumber of regions that can be powered-down when a power domain is idle.The placement cost function in the VPR tool [105] has been changed to include a term thatrepresents the cost of power gating regions as followsCost = (1− γ)((1− τ)CostW + τCostT︸ ︷︷ ︸Original cost function in VPR) + γCostSR (2.9)where CostW and CostT are the wire length and timing costs, and τ is constant between 0-1 thatis used to tradeoff routability and timing of the placement. The work in [97, 104] introduced twoadditional terms to the cost function, specifically γ and CostSR. CostSR is a metric used to indicatehow many power gating regions can be powered down, and γ is between 0-1 and is used to tradeoffbetween the original cost function and the new cost.There are two main issues in the techniques presented in [97, 104]. The first is that they didnot consider the impact of their placement algorithm on the routing resources that can be powereddown during idle times. The second is that their benchmark circuits that were used to evaluate theproposed algorithms consisted only of two power domains that are not connected by any signals,which is not realistic. Real circuits may have multiple interconnected power-gated and always-ondomains, and the algorithms must take this into consideration.PG-Aware RoutingThe work in [98] proposed a power-gated routing architecture and the necessary improvements tothe routing algorithm to optimize the number of power-gated routing resources. The following costfunction was used to evaluate the cost of node n that is investigated during the inner loop of the352.6. Summaryrouting algorithm in VPRCost(n) = critnet × tnet + (1− critnet)× congCost︸ ︷︷ ︸Original cost function in VPR+k × baseCost× e−numNetsInRegion (2.10)where critnet is the timing criticality of the connection being routed, tnet is the timing cost ofthe route to reach node n, congCost is the congestion cost of using node n, k is a scaling factor,baseCost is the power-gating base cost of using a specific node, and numNetsInRegion is thenumber of nets that have used the power-gated unit of routing.The modified routing cost function increases the cost of using routing resources that havealready been used to route a small number of nets, which discourages using them in subsequentrouting iterations. However, this only focuses on optimizing the number of resources that can bepowered down at configuration time, and is not designed to increase the number of resources thatcan be powered down at run-time. Ideally, the architecture should support powering down routingresources at configuration time and at run-time, and the routing algorithms should optimize powerfor both types of resources by taking into consideration the behaviour of the user’s design. As willbe discussed in Chapter 5, the finer the granularity of power gating for routing resources, the easierit is to power down unused resources at configuration time.2.6 SummaryIn this chapter, we reviewed the island-style FPGA architecture that forms the basis for manycommercial FPGA architectures, and several different power dissipation and optimization tech-niques. The power trends suggest that static power, which is mostly dissipated during idle periods,is becoming more important with technology scaling. Most of the previous works on FPGA poweroptimization do not address the static power consumption during idle periods. Even though powergating was presented in the literature as an efficient way to reduce the static power of FPGAs, thefocus of previous works was on powering down unused resources in a chip at configuration time.This does not provide significant power savings except for applications that use a small portion of a362.6. Summarychip’s logic resources, which is often not the case. Subsequent chapters in this dissertation presentour proposed architecture that enables powering down idle functional blocks in an FPGA in orderto reduce the total application energy consumption.37Chapter 3Dynamically-Controlled Power-GatedFPGA ArchitectureThis chapter presents the proposed dynamically-controlled power-gated (DCPG) FPGA architec-ture. The power-gated resources include logic clusters (LCs), routing channels (RCs), and switchblocks (SBs). The chapter describes the details of the power gating circuitry for these components,and the architectural features that enable controlling the power state of these resources using sig-nals that can be routed on the available routing resources. Furthermore, the chapter describes howdifferent granularities of the architecture can be realized. Results are shown for area and power ofthe proposed architecture for different architecture parameters. Much of this chapter was publishedin [1, 3, 5, 6].3.1 Power Gating Framework: A High Level PictureFigure 3.1 shows an example system of three modules that are mapped to an FPGA that supportsdynamically-controlled power gating (DCPG). Each module occupies a number of regions in thetarget FPGA architecture. We refer to these regions as power gating regions (PGRs). Each ofthese PGRs is composed of a number of FPGA tiles; the number of tiles in a region dictates thearchitecture granularity. The power state of each PGR can be configured as always-on, always-off,or dynamically-controlled. As will be explained later in this section, the power state for some ofthe internal components of a PGR can be configured to a different power state than that for theencapsulating PGR.In this example, two of the functional modules, M1 and M2, experience long idle periods, thus383.1. Power Gating Framework: A High Level PictureM 1M 2M 2M 1M 2M 3M 3Module 2 Module 3Module 1Power controllerFigure 3.1: Example application mapped to an FPGA supporting DCPG.it is desired to power them down during these times to reduce the leakage power consumption. Thisleads to reduction in the overall energy consumed by the application. The power state of the PGRsof M1 and M2 are configured to “dynamically-controlled”, which allows controlling their powerstate at run-time. Power control signals are routed from a power controller module to control thepower states of modules M1 and M2. The third module, M3, does not experience idle periods, thusits power state is configured to be “always-on”. Similarly, the power state for the power controlleris configured to be always-on. The power state for routing resources that are used to route thepower control signals is configured to always-on.The proposed architecture enables realizing power domains with different temporal (idle/activeperiods) and spatial characteristics (sizes and locations), thus it is suitable for a wide range ofapplications. There is no need to have fixed tracks in the FPGA fabric dedicated as control signals;rather, power control signals can be routed on the pre-existing FPGA routing fabric similar to anyother user signal. The following subsections describe the details of this architecture.Note that the steps of power gating described in Section 2.4 must also be performed for circuitsdesigned using the proposed architecture in this chapter. However, many of the issues, such asoutput isolation and state retention are supported by the proposed architecture. We assume thatclock gating is used in conjunction with the power gating architecture in this chapter. In addition,the control circuitry that controls the power gating process is provided by the user as it is done inASIC designs, or ideally by smart CAD tools that can derive the controller logic using informationabout the behaviour of the application.393.2. Power-Gated Logic Resources3.2 Power-Gated Logic Resources3.2.1 Fine-Grained Power GatingIn this subsection, we describe a fine-grained version of the proposed power gating architecture.Figure 3.2 shows two tiles of an FPGA; some details are not shown for the sake of clarity. The basicpower gating architecture supports power gating at the granularity of one tile, thus a PGR is onetile. Three power modes are supported for the power-gated resources by configuring the SRAM cellsthat control the select lines of the 3:1 multiplexers that drive the power switches. The supportedpower modes are always-on, always-off, and dynamically-controlled. The novelty of the proposedarchitecture lies in its support for controlling the power state of individual LCs and the routingresources (input pin connection boxes, track isolation buffers, and switch blocks) dynamically atrun-time. This makes the proposed power gating scheme suitable for various tile-based FPGAarchitectures.V DDR outing ch annel p ow er sw itch01 PG_CNTL1PG_CNTL2RCSBSBRCR CRCV DD01PG_CNTL2CLBPLDN_CNTL2SBSBRCR CRCV DD01PG_CNTL1CLBSBSBR CPLDN_CNTL1Figure 3.2: Basic fine-grained DCPG for two neighbouring tiles. Shared bordering routing channelshave their own power gating circuit.The always-on mode sets the resources in a powered state. This is useful for resources thatneed to be available all the time, such as power control signals or application modules that do notexperience idle times. The always-off power state puts the resources in sleep, low-leakage mode.403.2. Power-Gated Logic ResourcesThis is useful for resources that are not utilized by the application that is mapped to the device.Dynamically-controlled means that the power state of a resource can be controlled at run-time(alternating between powered and sleep modes based on activity), and can be changed by changingthe value on the power control signal.Figure 3.2 shows that one of the bordering input pins of each LC can be used to route the powercontrol signal to the power switch (control signals are labelled PG CNTL1 and PG CNTL2). Ifan LC is configured as dynamically-controlled, its power gating multiplexer (the 4:1 multiplexer inthe figure) is used to route the power control signal by configuring the SRAM cells that control theselect lines of the 4:1 multiplexer. The input pin used to route the power control signal, therefore,cannot be used by the logic implemented in the LC. Obviously, variations to this organization wherea subset of the input pins are used as inputs to the power gating multiplexer can be realized.For correct operation of the dynamically-controlled mode, the power state of the routing channelthat is used to route the power control signal must be configured as always-on. To support this,separate power gating circuitry is used for the bordering routing channels of an LC. The lower part ofFigure 3.2 shows the details. The power state of a routing channel can be configured to the differentpower modes discussed previously. When configured to the dynamically-controlled mode, the ANDgate ensures that the shared routing channel is turned off only when both neighbouring LCs areturned off (when PG CNTL1=1 and PG CNTL2=1). This ensures that any of the borderingrouting channels of an LC can be used as the entry point for the power control signal. Therefore,such a signal could be routed from an on-chip power controller to the target logic clusters in thesame way that any other user circuit signal is routed.The same power control signal can be routed to any number of LCs that belong to the samepower-gated module, which forms one power domain. The SBs’ power state can be configured inthe same manner discussed above. More details about SBs power gating will be discussed later inthis section.In the proposed architecture, configuration memory cells and the flip-flops (FFs) in an LC arenot power-gated. The configuration SRAM cells are typically implemented using a low leakage,high-Vth process, such as the medium oxide thickness transistors used in configuration SRAM cells413.2. Power-Gated Logic Resourcesin some commercial FPGAs [40]. The area of flip-flops within an LC is relatively small, and thusthey consume only a small amount of leakage power. Therefore, these components are kept onall the time instead of using other state-saving mechanisms that would increase the architecturecomplexity and power consumption.Figure 3.3 shows the details of an LC in the proposed architecture. Pull-down NMOS transistorsare used to isolate the outputs of the LC when the LC is in sleep mode. This prevents large shortcircuit current in SB buffers that are driven by the LC outputs. Similarly, the inputs of the FFsinside an LC are also isolated to prevent large short circuit current in the FFs. Notice that weassume that clock gating is used in association with dynamically-controlled power gating modeduring the idle times; this guarantees that the values stored in the FFs do not get corrupted duringsleep mode. The gates of the pull-down NMOS transistors are controlled using the output of the3:1 multiplexers that drive the power switch.B LE  # 1B LE  # 2B LE  # NIntra-cluster switch networkKP LD N _ CN TLCluster inputsS RA MKK4 - LU T D - F FS RA MP LD N _ CN TLCan be used  as power state feed backFigure 3.3: Internals of an LC with DCPG showing pull-down NMOS devices at FF input and LCoutputs. A status signal to indicate completion of power transition can be routed through one ofthe LC’s outputs (the top).The architecture also supports sending a feedback signal to indicate that a power domain hascompleted a power transition. Figure 3.3 shows that an inverted version of PLDN CNTL, whichis the output of the related 3:1 multiplexer in Figure 3.2, can be routed through one of the LC’soutputs. Thus, it can be routed to the power controller or other modules in the application circuit.Note that in a power-gated module, only one LC needs to send this feedback signal. Timing analysiscan be used to determine which LC is the last one to be turned on/off, and it can be used to send423.2. Power-Gated Logic Resourcesthe feedback signal.3.2.2 Region-Based Power GatingThe area and power overheads associated with the architecture in the previous subsection is dueto the LC and routing channels’ sleep transistors, the power gating multiplexer of an LC, the 3:1multiplexers that drive the gate of the sleep transistors, the AND gates required to implementproper power gating for the routing channels, and the extra required SRAM configuration memorycells.Typically, when an application is mapped to an FPGA, blocks that are part of the same func-tional module are placed close to each other in order to minimize a target cost function, such astiming and wiring cost [58]. Thus, it is likely that a group of LCs and routing channels that arespatially close to each other share the same power state. It is, therefore, feasible to support powergating at a coarser granularity level than what is described in the previous subsection in order toreduce the area and power overheads of the power gating circuitry.The concept of coarse-grained PGRs is presented here. Each PGR is composed of a numberof LCs and their associated routing resources. Similar to the tile-level architecture in the previoussubsection, the SRAM configuration memory cells and flip-flops are powered on all the time.Figure 3.4 shows an example dynamically-controlled power gating region (PGR) of size 2x2 tiles.Some details are omitted for clarity. The internal part of the region in the large dark box in the figure(LCs and internal RCs), share the same power switch; thus, their power state can be configured asone unit. The individual SBs (or SB partitions as will be explained in Section 3.3) of the regionand the individual bordering RCs, on the other hand, have their own power switches and theirpower states can be configured separately; however, their power switches can still be controlledusing the region’s control signal (labelled PG CNTL in the figure) when they are configured asdynamically-controlled. Figure 3.5 shows the details of the three main power-gated components ina region. The bordering RCs in the coarse-grained PGRs have the same structure and functionalitydescribed for the RCs in Figure 3.2; they can be used to route the power control signal to a PGR.Different PGR sizes can be realized in the same manner. For example, a 3x3 PGR consists433.3. Power-Gated RoutingLCSBSBSBSBRCRCRCRC LCSBSBRCRCRCLCSB SBRCRCRC LCSBRCRCVDD01 PG_CNTLBordering routing channels (connection blocks and isolation buffers)From bordering connection blocksPower gating multiplexerPower switchExample LC input pins that drive the power gating muxVirtual VDDInternal routing channelsFigure 3.4: Example PGR of 2x2 tiles. Internal region’s LCs and CBs share the power switch.PGR’s SBs and bordering RCs have individual power switches.of 3x3 tiles (this PGR has 12 bordering RCs). Larger PGRs make it more challenging for theCAD tools to group related blocks in the same PGR, resulting in smaller power savings. On theother hand, the area and power overheads in smaller PGRs are larger. In terms of applicationmapping, a small PGR size means that an application occupies a larger number of PGRs; morerouting resources would be needed to route the power control signals, which may negatively impactroutability and requires more always-on routing resources. The relationship between the powergating granularity and power savings is investigated in more detail in Chapter 5.3.3 Power-Gated Routing3.3.1 Switch Block Power gatingIn the previous subsections, we described the proposed power gating architecture for LCs and RCs(track isolation buffers and connection boxes). This subsection focuses on describing the power443.3. Power-Gated RoutingLC RCRCLCRCLC RC LCVDD01SLEEP(a) Internal PartSB SBSB SBV DD V DDV DD V DD(b) Regions SBsRCR CRCR CRCR C R CRCV DDV DDV DDV DDV DDV DDV DDV DD(c) Bordering RCsFigure 3.5: Details of the power gating circuitry for the different components in a PGR. The internalLCs and RCs share the same power switch, individual bordering RCs have their own power switches,and individual SBs (or SB partitions) have their own power switches.gating circuitry for SBs in a PGR.The example PGR in Figure 3.4 has a size of 2x2 tiles. The power control signal that is used tocontrol the power state of the LCs region (PG CNTL) is also used to selectively control the powerstate of the individual SBs that belong to the same region. For each LC, the SB that belongs tothe same region as the LC lies in the right-bottom corner of that LC.Figure 3.6 shows the power gating circuitry for an SB. This circuitry is similar to that forthe other components as described in the previous subsections. The power state for each SB in a453.3. Power-Gated RoutingPLDN_CNTLVDD0SB power switch1P G _ CN T LP LD N _ CN T LFigure 3.6: Power gating circuit for a switch block. SB outputs are pulled down to GND when theSB’s power is off (in sleep mode).PGR can be configured to always-off, always-on, or dynamically-controlled by setting the propervalues for the configuration SRAM cells of the 3:1 multiplexer. Minimum-sized pull-down NMOStransistors are placed at the outputs of the SB to pull them to ground during sleep mode to ensureproper output isolation. The gate input of the pull-down transistors is the same as the gate inputto the power switch.3.3.2 Fine-Grained Switch Block Power GatingThe power-gated SB architecture in the previous subsection enables configuring an SB’s power stateas one unit. However, our experiments for many application circuits showed that more than 50%of the SBs’ switches are not used (see Figure 5.18(b)). Supporting finer granularity power gatingfor SBs, therefore, may result in larger number of switches that can be turned off either staticallyor dynamically at run-time compared to coarse-grained SB power gating. This would result in asignificant impact on the total power consumption since an SB consumes more than 70% of a tile’sleakage power.Figure 3.7 shows how all switches in a specific SB side are grouped into one power gatingpartition to implement a finer granularity power gating. Partitions per side (PPS) is used as anarchitecture parameter to describe the power gating granularity of an SB. For example, PPS = 1 forthe SB in Figure 3.7. Increasing PPS results in finer granularity power gating. PPS = 0 represents463.4. Supported Power Modes01P G_ C N T L01P G_ C N T LFigure 3.7: One power gating partition per SB side (details for two sides are shown). The powerstate for each partition can be configured separately.an architecture where the power state for an SB is configured as one unit (coarse-grained powergating), whilst PPS = Nswitch indicates the finest power gating granularity where the power statefor each switch can be controlled individually. Nswitch is the number of switches that exist in anSB’s side. The cost of increasing PPS is the additional area and leakage power due to the additionalpower gating circuit components and the increase in the total effective sleep transistor size.Figure 3.8 shows examples of incoming wire tracks in an SB and their connections. The outputsof these buffers are connected to inputs of switches in the different SB sides. Turning the incomingwires’ buffers off requires that all of the switches they drive are turned off. This requires extra logicto check, which results in large area and power overheads. Therefore, when PPS is larger than 0,the buffers of incoming wire tracks are not part of the power gating circuit of an SB (i.e., they arealways powered if PPS > 0).3.4 Supported Power ModesThe presented architecture enables different power mode configurations for the different componentsin a PGR that are illustrated in Figure 3.5. Table 3.1 shows the supported power modes. Forexample, if the internal part of a PGR (LCs and internal RCs) is configured as dynamically-473.5. Modelling Application BehaviourW ire end ing in S B  ( end ing in left sid e)W ire passing by  S B  ( coming from left sid e)Figure 3.8: Example incoming wire track from the left side of an SB, showing two cases: (left) wireending in an SB, and (right) wire passing by an SB.Table 3.1: Configurable power modes supported by the different components in a power gatingregion (PGR).Internal Components a Each Border RC Each SB PartitionOFF OFF / ON / DCb OFF / ON / DCON OFF / ON / DC OFF / ON / DCDC OFF / ON / DC OFF / ON / DCa The PGR’s internal LCs and RCs that share the same power switch.b OFF = always-off, ON = always-on, DC = dynamically-controlled.controlled, there is flexibility in configuring the power state for the individual SBs (or SB partitions)and bordering RCs. This flexibility is very useful especially when some SBs need to be always-onto route important signals such as power control signals or inter-module signals.3.5 Modelling Application BehaviourThe overall power savings that the proposed power gating architecture could achieve depends verymuch on the application. Applications with blocks having long idle times would result in significantpower savings because these blocks could be powered down for long periods. In this section, wepresent a simple model to characterize applications and we use it to evaluate the upper bound onthe reduction in leakage energy during sleep mode.An application can be characterized by the percentage of logic clusters and routing that could bepower gated, and the percentage of time they stay in the sleep mode. These two parameters could483.5. Modelling Application BehaviourTable 3.2: Application parameters.Term MeaningRPSC The number of PGRs the power state controller occupiesRapp The number of PGRs the application occupiesFidle PGRs The fraction of Rapp regions that can be power gated because they gothrough some idle periodsFidle time The fraction of the application execution time during whichFidle blocks ×Rapp regions are idleTable 3.3: Architecture parameters.Term MeaningRdevice The number of PGRs in an FPGA devicePleakUG The leakage power of a PGR of the ungated architecturePleakSG ON The leakage power of a PGR of the static gating architecture whennot in sleep modePleakSG OFF The leakage power of a PGR of the static gating architecture whenin sleep modePleakDG ON The leakage power of a PGR of the dynamic gating architecture whennot in sleep modePleakDG OFF The leakage power of a PGR of the dynamic gating architecture whenin sleep modebe derived from a more complex model that involves more details about the application.For the model in this section, we consider three architectures:• Ungated: this is the baseline FPGA architecture that does not support power gating.• Static-gating: this is an architecture that supports statically-controlled power gating, such asthe one presented in [26]. The power state for this architecture can only be set at configurationtime• Dynamic-gating: this is the dynamically-controlled power gating architecture that we pro-posed in Sections 3.2 and 3.3.Table 3.2 shows definitions for the application parameters and Table 3.3 shows definitions forthe architecture parameters we use in the application model described here. Let RSG denote thenumber of regions that could be put into sleep mode at configuration time because they are notused by the application, and RDG denote the number of regions that could be turned off during493.5. Modelling Application Behaviourapplication execution for Fidle time of the application running time. We can calculate these twovalues as follows:RSG = Rdevice − (RPSC +Rapp) (3.1)RDG = bFidle PGRs ×Rappc (3.2)Let Tapp be the application execution time on an ungated architecture, and T ∗app be the appli-cation execution time on a gated architecture, i.e., accounts for performance degradation due tousing sleep transistors. Then, we can calculate the dissipated leakage energy for an application forthe three architectures using the following equations:Eungated =(RSG × PleakUG +RDG × PleakUG × Fidle time)× Tapp (3.3)Estatic =(RSG × PleakSG OFF +RDG × PleakSG ON × Fidle time)× T∗app (3.4)Edynamic =(RSG × PleakDG OFF +RDG × PleakDG OFF × Fidle time)× T∗app (3.5)The above model uses the simplification that the number of regions that could be powereddown dynamically at run-time can be determined from the application structure (Rapp and Rpsc)and behaviour (Fidle time and Fidle PGRs). To simplify the equations, no parameters account for thepower dissipation of the PSC in the case of the dynamic gating architecture; the power dissipation ofthe PSC is highly dependent on its structure and the structure of the user’s circuit. The equationsabove also do not model the required wakeup time and energy for the idle blocks during their statetransition; this is highly dependent on the structure of the wakeup circuitry as will be discussedin Chapter 4. The model in this section is used to find the upper limit on static energy savings inSubsection 3.6.3.503.6. ResultsCreate SPICE netlist (automatic LC buffer sizing)Meets performance constraints (10%)?Run HSPICE to measure performanceCreate SPICE netlist for power gating architecture Collect results (leakage power and area)Resize sleep transistorsArchitecture parameters (N, I, W, Fcout, ...)YesNoFigure 3.9: Experimentation procedure for sizing sleep transistors.3.6 Results3.6.1 Experimental SetupWe used HSPICE simulations to obtain the leakage power of the PGRs. We used the number ofminimum-width transistors as in [46] to estimate the area. We used the 45 nm HP technology fromthe predictive technology models website [106], with supply voltage VDD = 1.0 V and temperatureT = 85 ◦C to measure the worst case power and timing.For the power-gated architectures, the threshold voltage of sleep transistors have been increasedby 100 mV by changing the Vth0 parameter in the technology files. Sleep transistors have beeniteratively sized to constrain the performance degradation to 10% compared to an architecturethat does not support power gating. We assume 20% activity in doing this. Figure 3.9 shows theexperimentation procedure used to size the sleep transistors and collect the results.We assume that SRAM cells are built using six minimum-sized transistors. All multiplexersused in our architecture are based on pass transistors; each multiplexer is followed by a levelrestorer [107] and a buffer.513.6. ResultsThe SPICE netlists for LCs have been generated as follows. The size of the last inverter in abuffer is found by dividing the number of equivalent min-sized load inverters by four, and internalstages are sized by a stage ratio of four. Roughly, this sizing results in minimum delay. Lookuptables are built using transmission gates as in [108], with inverters inserted after the second andlast stages to reduce the delay of series connected transmission gates.The SPICE netlists for switch blocks (SBs) have been generated as follows. We assume therouting wire segments have the length of four tiles (L = 4). We used unidirectional, single driverrouting architecture [109]. The output buffers of SBs are built using multiple stages of inverters.The stages are sized using a stage ratio of four. The capacitance of wire segments was obtainedusing the model in [110]. We assume a switch block flexibility Fs = 3, input pin connection boxflexibility Fc,in = 0.2, and output pin connection box flexibility Fc,out = 0.1. Throughout thedissertation, we use an architecture with 4-LUT. The outputs of the LCs connect directly to theswitch blocks through isolation buffers without the need for output pins connection blocks; this issimilar to the architecture assumptions made in VPR 5.0 [105].3.6.2 Architecture Parameters SweepIn this subsection, we evaluate the power gating efficiency for the different FPGA and power gatingarchitecture parameters. The following list defines three architectures that are evaluated in ourexperiments:• Ungated: this is the baseline FPGA architecture that does not support power gating.• Static-gating: this is an architecture that supports statically-controlled power gating, such asthe one presented in [26]. The power state for this architecture can only be set at configurationtime• Dynamic-gating: this is the dynamically-controlled power gating architecture that we pro-posed in Sections 3.2 and 3.3.523.6. ResultsPower Gated TilesWe first study a basic architecture that has only one tile (without the switch block); we vary twoarchitecture parameters, the cluster size (N) and the width of the routing channels (W ). Whenvarying N , we also vary the number of input pins (I) of the logic cluster. When varying N , we setW = 90, and when varying W we set N = 6. 2 3 4 5 6 7 8 9 10 2  3  4  5  6  7  8  9  10Area overhead (%)Cluster size (N)static-gateddynamic-gated(a) Area Overhead 0 1 2 3 4 5 6 2  3  4  5  6  7  8  9  10Leakage (µW)Cluster size (N)ungated (ON)static-gated (OFF)dynamic-gated (OFF)(b) Leakage Power 85 86 87 88 89 90 91 92 93 2  3  4  5  6  7  8  9  10Leakage reduction (%)Cluster size (N)static-gateddynamic-gated(c) Leakage Power ReductionFigure 3.10: Results for sweeping LC’s cluster size (N). Switch blocks are not included in theresults.Figures 3.10 and 3.11 show the effect of the cluster size (N) and routing channel width (W )on power gating. The results shown in the figures are for a tile that supports power gating, notincluding the switch block, compared to a tile that does not support power gating (ungated).The area overhead decreases as the cluster size and the channel width increase (Figures 3.10(a)and 3.11(a)); this is because a larger number of circuit components are powered through a singlesleep transistor. However, the correlation between the area overhead and the channel width is533.6. Results 3 4 5 6 40  60  80  100  120  140  160Area overhead (%)Channel Width (W)static-gateddynamic-gated(a) Area Overhead 0 1 2 3 4 5 40  60  80  100  120  140  160Leakage (µW)Channel Width (W)ungated (ON)static-gated (OFF)dynamic-gated (OFF)(b) Leakage Power 88 89 90 91 92 93 94 40  60  80  100  120  140  160Leakage reduction (%)Channel Width (W)static-gateddynamic-gated(c) Leakage Power ReductionFigure 3.11: Results for sweeping routing channel width (W). Switch blocks are not included inthe results.not high. The area overhead for the static-gated architecture is lower than that for the proposeddynamic-gated architecture. This is because of the additional circuit components that are requiredto support dynamic power state control.The leakage power reduction increases as the cluster size and the channel width increase (Fig-ures 3.10(c) and 3.11(c)). The results show that the leakage power reduction in the off state(compared to an ungated architecture) can be up to about 91% for a cluster size of 10 (channelwidth of 160). Figures 3.10(b) and 3.11(b) show that the proposed architecture has slightly largerleakage power in the off state than the static-gated architecture. This is because the dynamic-gatedarchitecture requires more circuit components to support controlling its power state dynamically.543.6. Results 0 1 2 3 4 51x1 2x2 3x3 4x4 5x5 6x6 7x7Area overhead (%)Region size (RxR tiles)static-gateddynamic-gated(a) Area Overhead 0 40 80 120 160 2001x1 2x2 3x3 4x4 5x5 6x6 7x7Leakage (µW)Region size (RxR tiles)ungated (ON)static-gated (OFF)dynamic-gated (OFF)(b) Leakage Power 90 91 92 93 94 95 961x1 2x2 3x3 4x4 5x5 6x6 7x7Leakage reduction (%)Region size (RxR tiles)static-gateddynamic-gated(c) Leakage Power ReductionFigure 3.12: Results for sweeping granularity of power gating regions (PGRs). Switch blocks arenot included in the results.Power Gated RegionsIn this subsection, we study the granularity of the proposed architecture. Figure 3.12 shows thedifferent results as we increase the region size.As the region size increases, the area overhead decreases. The area overhead includes that of thesleep transistors and the circuit components required to support configuring the different powerstates of a PGR. The area overhead decreases as the region size increases because more circuitcomponents are powered through a single sleep transistor, and the circuit components requiredto support the different power states of a region are shared among a larger number of circuitcomponents. The area overhead is as small as 1% for a PGR of 4x4 tiles.The off state leakage power of the power-gated architectures is much lower than that for theungated architecture, leading to a leakage power reduction of more than 90% (about 95% for aPGR of 4x4 tiles). Increasing the region size by more than 4x4 tiles does not significantly increase553.6. Resultsthe leakage power savings. As can be seen in Figure 3.12(c), the leakage power reduction in astatic-gated architecture is slightly larger than that in the proposed dynamic-gated architecture.This is because of the additional circuit components that are required in the proposed architectureto support changing its power state at run-time. 0 10 20 30 40 50 60 70 80 90 100 40  60  80  100  120  140  160Area Overhead (%)Routing channel width (W)PG GranularityCoarse GrainPPS = 1PPS = 2PPS = 3PPS = 4Fine Grain(a) Area Overhead 5 10 15 20 25 30 40  60  80  100  120  140  160ON Leakage Power (µW)Routing channel width (W)Ungated(b) ON Leakage Power 0 1 2 3 4 5 6 7 8 9 10 40  60  80  100  120  140  160OFF Leakage Power (µW)Routing channel width (W)(c) OFF Leakage Power 55 60 65 70 75 80 85 90 95 100 40  60  80  100  120  140  160OFF Leakage Savings (%)Routing channel width (W)(d) Leakage Power ReductionFigure 3.13: Results for SBs power gating granularity by sweeping routing channel width (W).Segment length (L) = 2.Power Gated Switch BlocksIn this subsection, we vary the architecture parameters that describe the switch blocks (SBs).Results are shown for SB architectures that have different segment lengths, L = 2 in Figure 3.13and L = 4 in Figure 3.14, compared to an architecture that does not support power gating. Inaddition to varying the routing channel width (W ), we also vary the power gating granularity of563.6. Results 0 10 20 30 40 50 60 70 80 40  60  80  100  120  140  160Area Overhead (%)Routing channel width (W)PG GranularityCoarse GrainPPS = 1PPS = 2PPS = 3PPS = 4Fine Grain(a) Area Overhead 4 6 8 10 12 14 16 18 20 22 24 40  60  80  100  120  140  160ON Leakage Power (µW)Routing channel width (W)Ungated(b) ON Leakage Power 0 1 2 3 4 5 6 40  60  80  100  120  140  160OFF Leakage Power (µW)Routing channel width (W)(c) OFF Leakage Power 70 75 80 85 90 95 100 40  60  80  100  120  140  160OFF Leakage Savings (%)Routing channel width (W)(d) Leakage Power ReductionFigure 3.14: Results for SBs power gating granularity by sweeping routing channel width (W).Segment length (L) = 4.SBs by varying the number of partitions per side (PPS) (larger PPS means finer granularity). Ina fine-grained SB power gating, each output buffer in an SB has a power gating circuit, whereas ina coarse-grained power gating, one power gating circuit is used for all circuit components in an SB.In general, Figures 3.13 and 3.14 show that the area overhead and leakage power when L = 4 islower than that for L = 2. This is because an FPGA routing architecture that has shorter segmentscontains more circuit components in SBs [109]; as the number of power-gated circuit componentsincreases, the area overhead and the leakage power increases.The area overhead (Figures 3.13(a) and 3.14(a)) of the power gating circuitry decreases as thechannel width increases. This is because as we increase W the sleep transistor size increases at alower rate than the increase in the number of circuit components that are powered through it.573.6. ResultsThe power gating granularity (PPS) also affects the area overhead. Fine-grained power gatingresults in prohibitive area overhead (more than 60% for L = 4 and about 100% for L = 2). Forlarge values of W , the area overhead for granularities down to PPS = 4 ranges between 10-16%.This area overhead, however, is only for SBs. The overall area overhead for the power gatingarchitecture is lower. Recall from the previous subsection that the area overhead (without SBs) fora PGR of size 4x4 tiles is about 1%. The overall area overhead for the same PGR with SBs rangesbetween 4.7-10.3% for PPS from 0 to 4.Figures 3.13(b) and 3.14(b) show the ON leakage power for the gated architecture (for differentPPS) and the ungated architecture. The ON leakage increases as PPS increases; this is becausefiner granularity power gating requires more circuit components and larger sleep transistors. TheON leakage overhead for W = 100 and PPS = 0 − 4, for example, is about 6-10% for L = 2 and3.5-7.7% for L = 4. As we will see later, finer granularity SBs power gating increases the numberof always-off resources, resulting in lower total ON leakage power for an application circuit.Figures 3.13(c) and 3.14(c) show the leakage power for the power-gated SBs in the OFF state,i.e., static power when SBs are powered down, and Figures 3.13(d) and 3.14(d) show the leakagepower reduction compared to SBs with no power gating. The OFF leakage power for SBs withPPS = 0 is the smallest because all of the SB circuit components are included in the power-gatedcircuit. SBs with larger PPS, incur larger leakage power overhead in the OFF state because of theadditional power gating circuit components, and because all the buffers for incoming wire tracks aredesigned to be powered during the OFF state (see Subsection 3.3.2). The leakage power reductionfor the proposed architecture is more than 95% for the coarse-grained power-gated SBs, and couldreach more than 90% for PPS = 1− 4. For fine-grained power gating, the leakage power reductionis about 70%.Overall Area OverheadTable 3.4 shows the total area overhead for different architecture granularities of the power gatingarchitecture. The area overhead ranges between 3.9% - 34.8%. The area overhead results inlonger routing wire segments. This leads to larger interconnect capacitance, and potentially larger583.6. Resultsdynamic power. This overhead is larger for finer granularity power gating. For example, fine-grained power gating for SBs would result in about 60% area overhead (assuming L = 4) in SBs.The total area overhead for a PGR of size 2x2 tiles is about 34%, which roughly translates to16% increase in each dimension of a tile. 1 This represents a loose upper bound on the increase ininterconnect capacitance [111]. Dynamic power would be lower than this value because it is highlydependent on the activity of the application (i.e., no dynamic power during idle times). As we willsee in Section 5.5, when application circuits with power-gated modules are mapped to the proposedarchitecture, the best granularity in terms of its leakage power in the OFF state is when the PGRsize is larger than 1x1 and PPS = 3 or 4. The area overhead for these granularities is between9.6% - 12.2%. This translates to an increase in interconnect capacitance by 4.7% - 5.9%, whichis again a loose upper bound on capacitance increase. The impact of the increase of interconnectcapacitance has not been evaluated in this dissertation, and is left as a future research.Table 3.4: Total area overhead (%) for the power gating architecture for different architecturegranularities compared to an architecture with no power gating support (W = 80 and L = 4).SB/PGR 1x1 2x2 3x3 4x4Coarse Grained 5.4 4.3 4.1 3.9PPS = 1 6.9 5.8 5.7 5.5PPS = 2 8.3 7.2 7.1 6.9PPS = 3 10.7 9.6 9.4 9.2PPS = 4 13.7 12.6 12.4 12.2Fine Grained 34.8 33.7 33.5 33.33.6.3 Application Model ResultsWe apply the model in Section 3.5 to the clma circuit (775 tiles) from the Microelectronics Centerof North Carolina (MCNC) benchmark suite [112]. We assume a device size such that the numberof utilized logic blocks is 70%. We assume a PSC size of 5% of the number of logic blocks inthe application. Therefore, about 25% of the device’s resources are not used. Furthermore, we1We assume that a tile has a square layout [111]. Increasing the tile area by 34% would result in each dimensionof the tile being√1.34 of its original length.593.6. Resultsuse the following parameters for the architecture generation and HSPICE simulations: R = 3,PPS = 0, W = 90, N = 6, and I = 16; the other parameters are similar to the ones described inSubsection 3.6.1.Figure 3.15 shows the static energy reduction for both the statically- and the dynamically-controlled power gating architectures compared to the ungated architecture over different fractionsof idle time (Fidle time) and idle PGRs (Fidle PGRs). Static energy is consumed by unused or idleresources in an FPGA. We can see that the static energy savings using the SCPG architecture arelarge only when there are no idle PGRs in the application. In this case, only 25% of the PGRsin the device consume static power because they are not used by the application circuit (clma inthis example). Therefore, they are powered down at configuration time. When Fidle time increases,this means Fidle PGRs of the application’s regions become idle for Fidle time of the execution timeof the application. The SCPG architecture cannot power down these PGRs. However, using theDPGC architecture, the unused resources are powered at configuration time and the idle PGRsare powered down during their idle periods. Therefore, the static energy savings using the DCPGarchitecture are 97% always.F idle_ P GR s ( % ) Fidle_ tim e ( %)Static Energy Saving (%)DC P G A rch .SC P G A rch .Figure 3.15: Leakage energy reduction for clma circuit as a function of Fidle PGRs and Fidle time.The graph in Figure 3.15 shows that the static energy savings by the SCPG architecture areonly achieved for the unused resources. The proposed DCPG architecture, on the other hand,603.7. Summaryachieves large static energy savings because it enables powering down idle resources and unusedresources.3.7 SummaryThis chapter presented an FPGA power gating architecture that supports both configuration-timeand run-time control of power modes. The architecture enables realizing different power domainssizes, numbers, and locations. The basic unit of power gating is a power gating region (PGR). APGR includes the power-gated logic and routing resources. The architecture enables routing powercontrol signals on the existing FPGA routing resources without requiring dedicated architecturalresources for this purpose. The architecture supports handling the intricacies of power gatingdesigns by providing automatic output isolation during sleep mode, data retention by keeping thestorage elements powered all the time, and enabling routing power status feedback signals from apower-gated module to other parts of the system.We evaluated the presented architecture by studying its area and power for different basic FPGAarchitecture parameters and power gating granularities. Our studies were performed by generatingHSPICE models for many power gating regions (PGRs). Compared to an architecture that hasno power gating, our results show that the total area overhead of the proposed architecture canrange between 4% for coarse-grained SBs power gating and a PGR size of 4x4 tiles, to 35% forfine-grained SBs power gating and a PGR size of 1x1 tiles. The power savings of the proposedarchitecture range between 90-95% for LCs and RCs in PGRs of sizes between 1x1 - 4x4 tiles,and between 71-98% for the SBs with power gating granularities ranging between fine-grained andcoarse-grained (W = 100 and L = 4).61Chapter 4A Configurable Architecture to LimitInrush Current in Power-GatedReconfigurable DevicesOne of the important challenges in the power gating technique is dealing with inrush current orwakeup current. Inrush current is the large current drawn from the power supplies during theturn-on phase to charge the internal floating capacitance of a power-gated module. This currentcan be large and can cause a temporary voltage droop across the power rails. This voltage droopmay cause functional errors due to reduced noise margins and degraded performance [7]. Figure 4.1shows an illustration of the inrush current and the associated voltage droop. As the amount ofinrush current increases, the droop on the power grid increases, leading to a violation of powerintegrity constraints, which may result in functional errors.In this chapter, we present a configurable architecture to limit the inrush current in recon-figurable architectures that support power gating, such as the power-gated FPGAs of Chapter 3and CGRAs. The structure and behaviour of applications implemented using such architectures isunknown before fabrication takes place, thus requiring the inrush current handling circuitry to beprogrammable to meet the different applications’ requirements.In application-specific integrated circuits (ASICs) that employ power gating, the amount ofinrush current is known at design time before fabrication. Thus, a fixed solution can be designedalong with the other components in the chip. Reconfigurable architectures, on the other hand,are flexible in order to implement many applications, and may even implement multiple kernels62Chapter 4. A Configurable Architecture to Limit Inrush Current in Power-Gated Reconfigurable DevicesCurrent (mA)TimeVoltageInrush currentVoltage droopWakeup phaseTurn_ONNormal operation currentFigure 4.1: Hypothetical inrush current and the associated voltage droop.during an application’s execution through time-multiplexing. This means that the structures ofpower-gated modules that are mapped to a power-gated architecture are not known at fabricationtime. Therefore, it is not feasible to implement a fixed inrush current handling circuitry in sucharchitectures. Instead, the circuitry must be flexible enough to support a variety of scenarios.The proposed architecture in this chapter takes into consideration the inherent flexibility ofreconfigurable architectures. Our proposed architecture has two levels. The lower level, which werefer to as the intra-region level, consists of circuitry to support turning on a single architecturalpower gating region (PGR) without causing a large voltage droop. This level is not configurablesince the inrush current requirements are known when the architecture for a region is designed.The upper level, which we refer to as the inter-region level, consists of circuitry to sequence turningon regions in the same power-gated module to ensure voltage droop constraints are not violated.This level is configurable using static RAM bits because it is not known at fabrication time howbig power-gated modules will be or where they will be on the chip. This combination of static andconfigurable wakeup circuitry is unique to reconfigurable architectures. Much of this chapter waspublished in [2, 4].634.1. Motivational Example: Inrush Current Effect4.1 Motivational Example: Inrush Current EffectIn this section, we study the inrush current effect in an example power gating architecture, and weshow that it is large enough to cause a voltage droop, motivating our design for the reconfigurableinrush current handling circuitry in Section 4.2. The example architecture used in this sectionis similar to the dynamically-controlled power gating (DCPG) FPGA that has been described inChapter 3.Our estimation methodology is as follows. We first model a power grid based on estimates ofthe current drawn from the power supply during normal operation of the baseline DCPG FPGA.We then model the impact of the additional current drawn when a region is turned on, and showthat, with the same size power grid, an unacceptable voltage droop occurs. Finally, we show thatif we were to increase the size of the power grid to supply the required additional current, or insertdecoupling capacitors, the resulting overhead is significant, motivating our alternative approach.4.1.1 Baseline Power Grid ModelOur model of the power grid is similar to that in [113, 114]. We assume a mesh-like power gridstructure [115], in which each metal layer has alternating VDD and GND lines. The number of linesin each layer is determined based on the width and spacing between the lines. Vias connect VDD(GND) lines of adjacent layers at their intersection points. This builds a large VDD (GND) net thatcan provide power for the transistors in the device. Clean VDD (GND) sources (power supplies)are positioned, and evenly distributed in the top metal layer of the power grid. Figure 4.2 showsthe FPGA power grid model assumed in this section.In order to determine the width of the VDD and GND lines, and the number of clean powersupplies at the top metal layer in a typical power grid, we must first estimate the expected currentrequirements of the device during normal operation of the DCPG (at times other than when regionsof the chip are turning on or off). To do this, we first mapped the twenty largest MicroelectronicsCenter of North Carolina (MCNC) benchmark circuits [112], which are commonly used in FPGAarchitecture studies, to an FPGA that has similar parameters as described in Section 4.5. We then644.1. Motivational Example: Inrush Current EffectFPGA with tilesClean VddPG metal lineCurrent sourceFigure 4.2: An FPGA with mesh power grid structure. FPGA circuit components are modeled ascurrent sources.used an enhanced version of the Poon power model [86] (modified to better model leakage powerbased on curve fitting of HSPICE simulation results) to estimate the power dissipated by eachdesign per FPGA tile, averaged over all tiles in the design. We then used the maximum such powerper tile across all designs. We found that the amount of power per tile is 636.6µW, assuming 45 nmtechnology.We then created a parameterized HSPICE model of the power grid. For the static analysis,each metal segment in the power grid is modeled as a resistance. For the dynamic analysis, eachmetal segment is modeled as series connected resistance and inductance, and a capacitance toground. Vias are modeled as resistances. Figure 4.3 shows how each metal segment in the powergrid is modeled. FPGA tiles are modeled as independent current sources. These current sourcescorrespond to the current drawn by each FPGA tile during the normal operation (which dependson the estimated worst case power per tile).V iaV iaM etal segm ent(a) Portion of a Power GridV iasM etal segm ent m odel(b) Static ModelV iasM etal segm ent m odel(c) Dynamic ModelFigure 4.3: Modeling of metal segments in the power grid.654.1. Motivational Example: Inrush Current EffectWe assume the connections between the power grid and each tile are distributed in an arrayof size n × n, where n represents the granularity at which we model the power grid at the lowestmetal layers. A large value of n would lead to more accurate results at the cost of longer simulationtimes; in our experiments, we found that n = 2 gives adequate accuracy (1.6% error compared ton = 8) with reasonable simulation time. This is shown on the right side of Figure 4.2, where thereare 2 × 2 current sources per tile that represent VDD connections. The amount of current drawnby each current source is an FPGA tile’s power divided by (VDD × n2).We then performed HSPICE simulations, and iteratively adjusted the width of each metal linein the power grid and the number of clean power sources until the IR drop across the power gridwas less than 5% of VDD (in our case, this corresponds to 50 mV). This represents the size of thepower grid that can supply current to the DCPG architecture during normal operation. Note thatthis approximation is both pessimistic and optimistic. It is pessimistic because we have assumedthe largest current seen by all of our benchmark circuits, and assumed this current is drawn fromeach tile. It is unlikely that during normal operation all tiles would draw this maximum current allthe time. It is optimistic because in a real FPGA the power grid would likely be over-designed (i.e.,provisioned to supply more current than the benchmarks might predict). We consider the impactof over-provisioning the power network in Subsection 4.1.4.4.1.2 Inrush Current of DCPG FPGA ArchitectureIn order to estimate the inrush current for the DCPG FPGA architecture, we created a detailedtransistor-level models of power gating regions (PGRs) with different sizes using the FPGA archi-tecture parameters described in Section 4.5. We then used HSPICE to determine the amount ofcurrent drawn for each region when it is turned on. Figure 4.4 shows the inrush current for differentregion sizes. The per tile inrush current in a PGR can be obtained by dividing the total inrushcurrent of the PGR by the number of tiles in the PGR.Recall from Chapter 3 (Figure 3.5) that each region has three main components: the internalrouting channels and logic clusters, the switch blocks, and the bordering routing channels. Theinrush currents for these components are shown in Figure 4.4. As the figures show, the inrush664.1. Motivational Example: Inrush Current EffectSheet1Page 1­10123456789SBRCInternal partTotalTimeCurrent (mA)(a) 1x1Sheet1Page 1­2024681012141618SBRCInternal partTotalTimeCurrent (mA)(b) 2x2Sheet1Page 1­50510152025303540SBRCInternal partTotalTimeCurrent (mA)(c) 3x3Sheet1Page 1­10010203040506070SBRCInternal partTotalTimeCurrent (mA)(d) 4x4Figure 4.4: Inrush current for different architecture region sizes. SB is the total inrush current forall switch blocks in a region, RC is the total inrush current for bordering routing channels, andInternal part is the inrush current for the internal routing channels and logic clusters of a region.current for these components appear in different time intervals. This is attributed to the sizing ofthe sleep transistors and their drivers. Individual bordering RCs sleep transistors are small, makingtheir turn on faster than the other components. Individual SBs sleep transistors are larger thanthe sleep transistors of bordering RCs, but they are smaller than the sleep transistor of the internalpart of the region. However, the total size of the SBs’ sleep transistors is larger than that for theinternal part of the region. SBs have the largest inrush current since the turn on for individual SBsin a region happens simultaneously. The internal part of the region has a large sleep transistor.Thus, it takes a long time to turn it on compared to the other components in a region.As can be observed in Figure 4.4, the turn on for the SBs and the internal part of a region ofsize 1x1 tiles happen within the same time interval, resulting a large total inrush current. Thisphenomenon happens because the sizes for the sleep transistors of the internal part of the region674.1. Motivational Example: Inrush Current Effectand the switch block are roughly similar. In larger regions, the size of sleep transistors for an SB issmaller than the size of the sleep transistor for the internal part of the region. This makes turningon individual SBs faster than turning on the internal part of the region. In regions larger than 1x1tiles, the maximum inrush current is mainly due to turning on SBs.4.1.3 Voltage Droop EstimationThe power grid described in Subsection 4.1.1 was sized to provide adequate current during normaloperation. When a region is turned on, there will be additional inrush current, which will causevoltage droop on the power rails.To estimate the magnitude of the voltage droop, we used the estimated inrush currents asdescribed in Subsection 4.1.2. We then used the original power grid model from Subsection 4.1.1,replacing the normal operating current with this new (larger) current. Using HSPICE, we thenmeasured the maximum voltage droop that occurs on the voltage rails.Figure 4.5 shows the voltage droop as a function of the FPGA power gating region size (ar-chitecture granularity of DCPG FPGA). As can be seen in the figure, the droop is in the rangeof 83-335 mV when the inrush current for a PGR’s bordering RCs and internal part is considered,and it exceeds 300 mV when the total inrush current of a PGR is considered. In both cases, thevoltage droop exceeds our target of 50 mV (5% of VDD). This could lead to incorrect operation ofthe FPGA.We can see that the voltage droop is smaller for larger region sizes in Figure 4.5 (a). Althoughlarger regions have larger total inrush current, the amount per tile is smaller than that for smallerregions. This is attributed to the power gating circuit implementation. In our implementation, wechose to use a fixed-size buffer to drive a region’s sleep transistor, regardless of the sleep transistor’ssize. This means that for larger sleep transistors, the driver will take a longer time to turn it on,leading to lowering the maximum current of the sleep transistor during the wakeup phase. InFigure 4.5 (b), the amount of voltage droop is largest for region size of 1x1 tiles because it hasthe largest per tile total inrush current. The voltage droop for larger regions is about 330 mVbecause they have the same amount of inrush current per tile, which is dominated by that of SBs684.1. Motivational Example: Inrush Current Effect 0 50 100 150 200 250 300 3501x1 2x2 3x3 4x4IR drop (mV)Power gating region size (tiles)IR drop constraint = 50 mV(a) Effect of inrush current of bordering RCs andinternal part. 0 100 200 300 400 500 600 700 1x12x23x34x4IR drop (mV)Power gating region sizeIR drop constraint = 50 mV(b) Effect of total inrush current.Figure 4.5: IR drop as region size increases. (a) The inrush current for bordering RCs and internalPGR’s part is considered (b) The total inrush current for a PGR is considered (including SBs).as explained in Subsection 4.1.2.4.1.4 Over-Provisioning the Power GridOne na¨ıve solution to this problem is to over-provision the power grid. To investigate this, weagain iteratively adjusted the metal width and the number of clean VDD supplies until HSPICEsimulation results showed that the voltage droop due to the inrush current is below our 50 mVtarget. Using an area model, we were able to estimate the area impact of doing this.Figure 4.6 shows the number of the clean VDD sources required by the modified power gridnormalized to the number of clean VDD sources in the original power grid, as a function of thepower gating region size. This can be 2-7X the number of clean VDD sources in the originalpower grid. Clearly, the overhead of over-provisioning the power grid to handle inrush current issignificant.4.1.5 Inserting Decoupling CapacitorsWe further investigated another solution, which is to insert enough on-chip decoupling capacitors(decaps) in order to provide a temporary charge reservoir that can be used to reduce the amountof inrush current. We have investigated this solution by distributing 0.1 nF MOS decaps [116, 117]uniformly in the power grid. Each of these capacitors has an area of about 1.3X of that of a694.2. Proposed Architecture 1 2 3 4 5 6 71x1 2x2 3x3 4x4Clean Vdd sources normalizedto base power gridPower gating region size (tiles)Figure 4.6: Cost of power grid sizing to handle region’s inrush current.tile of the DCPG FPGA architecture used in this section. For the inserted decaps, we removedthe equivalent number of tiles from the power grid model in order to approximately represent afixed-size chip.Figure 4.7 shows, for different power gating region sizes, i.e. different granularities of the DCPGFPGA architecture, the proportion of the chip’s area that is occupied by the inserted decaps. Thearea overhead is larger than 23% for a region size of 4 × 4 tiles, and it is larger for smaller regionsizes. As we will see in Section 4.6, our proposed solution has an overhead that is smaller thanthat required by decaps. Moreover, inserting decaps results in large leakage current (not evaluatedin our experiments), which negatively affects the energy savings obtained by using a power-gatedarchitecture. Although it is possible to reduce decaps’ leakage by increasing oxide thickness (tox),this reduces the capacitance, thus leading to more area overhead to obtain the same capacitance.Note that we do not consider other decoupling capacitors types, such as metal-insulator-metal(MIM) or metal decoupling capacitors, that may have lower area overhead but may occupy resourcesin the different metal layers [118].4.2 Proposed ArchitectureIn this section, we describe our architecture for handling inrush current in reconfigurable archi-tectures that support power gating. The architecture consists of strategically placed configurableand non-configurable delay elements and sleep transistors that can be used to ensure that inrushcurrent does not violate the constraints set by the power grid. The proposed architecture has two704.2. Proposed Architecture 20 22 24 26 28 30 32 34 361x1 2x2 3x3 4x4Area (%)Power gating region size (tiles)Figure 4.7: Proportion of chip area required as decaps to handle inrush current.levels: a fixed intra-region level and a configurable inter-region level. Each is described below.4.2.1 Intra-Region LevelAs described in Chapter 3, a power gating region (PGR) consists of one or more tiles, and isthe smallest unit of granularity that can be turned on or off. The purpose of the intra-regionarchitecture is to limit the amount of current drawn by a single region to a level that does notcause a large IR drop. By limiting the current drawn by a single region, it becomes possible toturn on multiple regions simultaneously; this tradeoff will be revisited in Subsection 4.2.3.The top portion of Figure 4.8 shows the intra-region architecture. Instead of using a single sleeptransistor (ST) for the whole region, a set of parallel STs is used for the region. At a wakeup event,the STs are turned on sequentially in order to limit the inrush current of the region to a set value(Iregion intra). Clearly, Iregion intra must be small enough that voltage droop does not occur on thepower rail when a single region is turned on; as will be described in Subsection 4.2.3, there aretradeoffs that may motivate a significantly smaller value of Iregion intra. Note that in subsequentsections in this chapter, we will refer to the value used to design the intra-region architecture asItile intra, where Itile intra = Iregion intra/num tiles in region.The value of Iregion intra is fixed, and is determined when the chip is designed (before fabrica-tion). As a result, the delay elements and the sleep transistors (STs) do not need to be configurable.For a given Iregion intra, the sizes of the delay elements and the STs can be found using the siz-ing algorithm proposed in [103]. This algorithm takes the current constraint, the supported delay714.2. Proposed ArchitecturePower Gating RegionVDD01PG_CTRLPd1Pdn-1Virtual VDD ST1ST2STnFrom routingPDEInter-regionIntra-regionSRAMSRAMFigure 4.8: Power gating region with the inrush current limiting circuitry.values, and a representation of the circuitry that makes up the region, and produces a list thatcontains the sizes of STs and the delays that determine their turn on sequence.For the evaluation in Section 4.6, the smallest delay element in our library has a delay of 100 ps.Figure 4.9(a) shows an example delay element that consists of a series of transmission gates. Largerdelays can be realized by increasing the chain length or by increasing the length of the gates ofindividual transistors.in out(a)¨TMUXSRAM¨T¨T¨T¨T¨Tinout(b)Figure 4.9: (a) Fixed and (b) programmable delay elements used in the proposed architecture.4.2.2 Inter-Region LevelThe circuitry described in the previous subsection ensures that a single region can be turned onwhile limiting the inrush current to Iregion intra. However, in practice, it is unlikely that the part ofa user circuit that will be turned on simultaneously will be confined to a single architectural region.When multiple regions of a power-gated module are turned on simultaneously, the total device’s724.2. Proposed Architectureinrush current will be the sum of the inrush current in all of the regions of the module. In additionto this wakeup current, there will be more current drawn for “normal” activities in the chip foractive modules. In order to ensure that the wakeup current does not affect the functionality of thedesign on chip, the maximum current that is drawn from the external power supply of the chipfor turning on power-gated regions must be limited to a specific amount. We denote this amountas Imodule wakeup. The amount of Imodule wakeup is specified at architecture design time, i.e., by thedevice architects.The proposed inter-region level architecture provides an architecture-level support to facilitatemeeting the constraint on the maximum inrush current a power-gated module is allowed to drawfrom the power supply. For a power-gated module, the pattern by which regions will turn ontogether is specific to the user circuit, making it impossible to design a fixed inter-region levelarchitecture that is suitable for all application circuits. This is different from the more traditionalproblem of designing wakeup circuitry for a fixed-function chip (such as an ASIC); our architecturemust be flexible enough to work for a wide variety of user circuit scenarios.Our approach is to provide a programmable delay element (PDE) with each PGR on the chip.As shown in Figure 4.8, the PDE is inserted on the path of the power control signal. Figure 4.9(b)shows our implementation of a PDE. Each of the blocks labeled ∆T represents a delay elementthat delays the signal by an amount of time that is required to wakeup a single PGR. The SRAMmemory bits are used to configure the PDE to select the required delay to turn on a region. Thevalues of the SRAM bits are set by the CAD tools at configuration time. Note that the PDEs ofmultiple regions can be configured to have the same delay in order to turn them on simultaneously.The proposed architecture allows the CAD tool to configurably delay the turn on of individualregions to limit the number of regions that turn on simultaneously. The maximum number ofregions that can be turned on concurrently is dictated by the architecture, and will be denotedRconcur. The value of Rconcur depends on the design of the intra-region level of the architecture aswell as the value of Imodule wakeup. For example, if the intra-region level is designed with Iregion intraas explained in the previous subsection, then Rconcur = bImodule wakeup/Iregion intrac.Note that the size of the multiplexer in a PDE dictates the maximum number of regions that734.2. Proposed Architecturecan be turned on due to one wakeup event, i.e., using one control signal. The turn on of a largemodule can be done by turning on a partition of Rconcur PGRs in each step. The turn on of eachpartition can be delayed by appropriately configuring the inter-region level PDEs in the regions ofthe partition (the PDEs of PGRs in each partition will be configured with the same delay). If thenumber of partitions in the module exceeds the number of delays supported by the PDE, the powermanagement unit must be designed to provide another delayed signal that can be used to turn onthe additional regions. This is discussed further in Subsection 4.3.2.4.2.3 Architectural TradeoffsThere is a complex set of tradeoffs between the parameters of the proposed wakeup architectureand the area of the architecture and the power and wakeup time of applications.The value of Iregion intra, selected when designing the intra-region architecture, determines thesizes of the region’s sleep transistors and delay elements. Whilst it may seem desirable to makethis as large as possible to minimize the area of the intra-region level circuitry, and to minimize thetime to turn on a single region, doing so reduces the achievable value of Rconcur, since each regiondraws more instantaneous current. A lower value of Rconcur means that more “steps” are requiredto turn on a large power-gated module. In addition, a lower value of Rconcur would imply morefingers in the PDE multiplexer, increasing its area and leakage power. This tradeoff is investigatedexperimentally in Subsection 4.6.4.Instead of embedding a PDE in each power gating region to sequence the wakeup of the regionsin a functional module, soft logic-based delay elements can be realized on the resources of a recon-figurable architecture (e.g., the lookup tables in an FPGA). The area overhead of this approachdepends on many factors, such as the amount of the required delays, whether these delays arepurely combinational or not, and the frequency of the clock used to generate these delays. Thistradeoff is experimentally investigated in Subsection 4.6.6.744.3. Design Flows4.3 Design FlowsThe architecture shown in Section 4.2 can be integrated in a tile-based reconfigurable architecturethat supports power gating, such as an FPGA that supports dynamically-controlled power gating.In this section, we explain the design flow that is used to integrate the proposed architecture in apower-gated FPGA architecture in order to limit the inrush current during wakeup events to a safelevel (Subsection 4.3.1). This is relevant to FPGA vendors who design FPGA devices. We alsoshow the design procedures that must be followed by the end users of a power-gated FPGA thatuses our architecture (Subsection 4.3.2).4.3.1 Device Architecture DesignersIn this subsection, we present a design flow for FPGA vendors that can be used to design anFPGA that employs our proposed inrush current limiting architecture. We explain how the FPGAarchitects can decide on the values for the different architecture parameters.Intra-Region LevelAn architect first decides on a value for Iregion intra (or Itile intra, where Iregion intra = Itile intra ×num tiles in region) that will be used to design the intra-region level of the architecture. Thisvalue can be the same as what the chip’s power grid could support for each tile (Itile pg) withoutviolating IR drop constraints, or a larger value might be selected in order to reduce the area overheadof the intra-region level of the architecture; selecting a larger value will reduce the number of regionsthat can be turned on simultaneously. This tradeoff is investigated later in this chapter. UsingItile intra, the sizes of the sleep transistors of a region and the fixed delays between them are foundusing the algorithm described in [103].Inter-Region LevelThe main parameter that needs to be selected to design the inter-region level of the architecture isthe PDE size. Given a specific Imodule wakeup constraint on the device’s inrush current to turn on754.3. Design Flowspower-gated modules, and a value selected for Iregion intra to design the intra-region architectureas stated in the previous subsection, the maximum number of regions that can be turned onconcurrently (Rconcur) is calculated as Rcocnur = bImodule wakeup/Iregion intrac.The PDE size indicates the maximum number of partitions of Rconcur regions that can beturned on using only one power control signal. Assuming the largest module size Rmodule thatwe want to be able to control its power state using only one control signal, the PDE size can becalculated as PDEsize = dRmodule/Rconcure. Rmodule can be selected based on the nature of theexpected applications that will utilize the power gating capability of the chip and the experienceof the architects. Notice that a large PDE size results in large area and power overheads, whilea small PDE size might require more complicated power controller design as will be explained inthe next subsection; thus, a suitable PDE size should be selected to be useful for a large range ofapplications, and to avoid offsetting the benefits of power gating.Note that it may seem desirable to select a large value of Imodule wakeup in order to allow allregions to turn on simultaneously, thus totally eliminating the need for the inter-region level circuit.However, the architects must take into consideration the capabilities of the on-chip power grid, andwhether it could provide the desired current without violating the IR drop constraints, or beingthreatened by the electromigration reliability problem. In addition to that, the final user of thechip will have to employ a power supply at the board level that can deliver Imodule wakeup currentin addition to the worst case “normal operation” current, which may lead to larger power supplycost and size.4.3.2 Device End UsersIn this subsection, we describe how a user can implement their application using an FPGA thatsupports power gating with our proposed inrush current limiting architecture.In an FPGA design flow, the design entry is performed by the user, e.g. at the RTL levelor at higher levels. If the application contains power-gated modules, the CAD tools partitionsthe application into power-gated modules that will be controlled at run-time. In addition to theapplication circuit, the user also specifies the required signals to control the power states of the764.3. Design Flowsdifferent modules in the application (or the CAD tool does this), such as the power control signalsand any other required signals (clock gating signals, handshaking signals, etc.).All the subsequent steps in the normal FPGA design flow are performed whilst keeping modules’partitioning information. This information is used in the packing, placement, and routing phasesin order to group related blocks in the same power gating regions (PGRs).After placement and routing is performed, the CAD tool derives the configuration for thedifferent PDEs in the chip to allow at most Rconcur regions to turn on simultaneously. A moduleis therefore divided logically into partitions of Rconcur PGRs. The CAD tools would configure thePDEs in each part ensuring that when a wakeup event occurs, no more than Rconcur regions areturned on at a time, thus limiting the current drawn from the power supply for the wakeup phaseto Imodule wakeup.At the board level, the user must design a power supply system for the device that provides atleast Imodule wakeup current in addition to the current required for normal operations.Post Placement ModificationsThe number of PGRs that a module occupies after the placement phase might be different from thenumber estimated when the power management unit was designed. This is not a problem as longas the number of Rconcur partitions of a module does not exceed the PDE size in the inter-regionlevel of the architecture. However, if the number of a module’s partitions exceeds the PDE size,then the power controller must be redesigned to account for this; this means that the FPGA flowmust be redone for the power management unit and the placement and routing must be redone forthe whole circuit.An example of using the inter-region architecture is shown in Figure 4.10. The figure shows anexample of a power-gated module that occupies six PGRs on an FPGA. Each PGR is representedin the figure by its corresponding PDE. In this example Rconcur = 1, which means that a singlePGR is turned on in each step. The first four PGRs are controlled by signal C1, and the PDEs inthese PGRs are configured to enable turning on one region at a time. Another control signal C2 isgenerated in order to turn on the remaining two regions of the module. Signal C2 is generated by774.4. Conditions for Energy Savingsthe power controller by delaying C1 by 4×∆T .C1 C1C1 C1C2C2Figure 4.10: An example power-gated module mapped to an FPGA. Each PGR is represented byits PDE.4.4 Conditions for Energy SavingsPower-gating a functional module is only beneficial if the saved energy during sleep mode is largerthan the energy overhead associated with power gating (turning on/off and power controller energy).Achieving this is equivalent to finding if the idle time of a functional module is larger than a certainthreshold. We call this threshold the minimum idle time to achieve energy savings (Tidle min). Toestablish a mathematical model for this time, we use the terms in Table 4.1 that are related to afunctional module and the DCPG FPGA architecture.Table 4.1: Functional module and architecture parameters to find minimum idle time to achieveenergy savings.Term MeaningRmodule Number of PGRs a module occupiesTidle The idle period of the moduleRconcur Number of PGRs that can be turned on simultaneouslyPon leak Average leakage power for an inactive PGR when not in sleep modePoff leak Average leakage power for a PGR in sleep modetturn on Time to turn on a PGRtturn off Time to turn off a PGREturn on Energy consumed to turn on a PGREturn off Energy consumed to turn off a PGR784.4. Conditions for Energy SavingsIf the functional module is not placed in sleep mode during its idle period, then its energyconsumption during idle period using the power gating architecture is:Eidle = Tidle ×Rmodule × Pon leak (4.1)On the other hand, if the module is placed in sleep mode during its idle period, then the energyconsumption during sleep mode and to enter and exit sleep mode is:Esleep = Tidle ×Rmodule × Poff leak +Rmodule × (Eturn on + Eturn off ) + EDPT (4.2)where EDPT is the energy consumed during power transitions; that is, during the time of theturn-on phase and during the time of the turn-off phase. This is explained below.During the period of turning off a power-gated module, only Rconcur regions can be turned offsimultaneously. Therefore, if the size of the module is larger than Rconcur, some of the regionswill be dissipating leakage energy until the whole power-gated module is powered down. Similarly,during the power up phase, some of the regions will dissipate leakage energy until the whole power-gated module is turned on. The leakage energy dissipated for these regions during both the turn-onand turn-off phases of a power-gated module is accounted for in Equation 4.3 if Rmodule > Rconcur.Otherwise, EDPT = 0.EDPT = (tturn on + tturn off )× (Poff leak + Pon leak)×m−1∑i=1(Rmodule − i×Rconcur) (4.3)where m = dRmodule/Rconcure.Putting a functional module in sleep mode is more energy efficient than not doing so whenEidle > Esleep. From this relationship we can find the minimum idle time to achieve energy savingsfor a functional module (Tidle min):Tidle min =Rmodule × (Eturn on + Eturn off ) + EDPTRmodule × (Pon leak − Poff leak)(4.4)794.5. Experimental SetupNote that the analysis above does not consider other parameters that play a role on decidingwhether using sleep mode is feasible or not, such as the energy consumed by a power controller.However, such terms can be integrated when dealing with higher level energy models that describethe energy for a complete system.4.5 Experimental SetupIn this section, we describe the setup used for the experiments in Section 4.6.4.5.1 Analysis SettingsWe assume 45 nm [119] with VDD = 1 V. Power consumption and duration of power state transitionswere measured assuming the worst case temperature of 85◦C. However, we assume a temperatureof 25◦C when measuring the inrush current, and in the algorithm that finds the sizes of the sleeptransistors in the intra-region architecture. This is because it is possible that an on transitionhappens after a long idle period, in which the temperature has gone down. At this temperature, asleep transistor can deliver current that is larger than at higher temperatures.4.5.2 Power GridWe use the methodology explained in Section 4.1 to build the power grid used in this study. Weuse the inductance model from the predictive technology models website [106], and the capacitancemodel from [110] to generate the dynamic power grid model that is used in Subsection 4.1.5. Thepower grid is generated for a chip of 3×3 mm size (about 57×57 FPGA tiles). The power grid hasbeen synthesized manually with a pitch of 30µm for the top metal layers. Clean VDD and GNDsources were distributed at a spacing of 400µm.For the lowest metal layer, which has the current sources that represent the circuits of the chip(FPGA tiles in our experiments), we used a pitch that is equivalent to the physical width of a tiledivided by the number of sources (n) in each metal segment passing over a tile. Thus, there are n×ncurrent sources connected to the lowest metal layer lines over each tile; n = 2 in our experiments804.5. Experimental Setupas has been explained in Subsection 4.1.1. The physical length of a tile was found by mapping thenumber of minimum-width transistor areas (MWTA) of a tile to the physical dimensions of thetile, assuming square tiles [46].We assume that the maximum allowed voltage drop at a virtual VDD node is 100 mV (10% ofVDD). The source of this drop is 50 mV from the IR drop on the power grid, and 50 mV drop acrosssleep transistors.4.5.3 DCPG FPGA Architecture with Wakeup CircuitryFigure 4.11 shows the proposed wakeup architecture integrated with a simplified region of theDCPG FPGA architecture of size 1× 1 tiles.PDELCSBSBSBSBRCRCVDD01PG_CTRLPd1Pdn-1Virtual VDDST1ST2STnRCRCSRAMSRAMFigure 4.11: Example DCPG FPGA region with the proposed inrush current handling circuitry.The wakeup mechanism for a region is as follows. First, all of the bordering routing channels(RCs) are turned on to ensure that the inputs from bordering RCs have a well-defined state whileturning on the internal part of the region in the subsequent step, thus preventing large leakagecurrent due to floating inputs to occur during the power on phase. Next, the internal part of theregion along with the switch blocks (SBs) are turned on. The necessary delay elements are insertedto ensure that turning on the different components inside a single PGR follows this sequence.814.6. Results and DiscussionFPGA Architecture Parameters and Area ModelWe use the following FPGA architecture parameters: logic cluster size N = 6, inputs per clusterI = 16, routing channel width W = 90, switch box flexibility Fs = 3, input pins connection boxflexibility Fc,in = 0.2, and output pins connection box flexibility Fc,out = 0.1. To calculate area, weused the min-width transistor areas (MWTA) model from [46]. Note that in some cases, as in delayelements, we had to increase the resistance of transistors (to achieve larger delays) by increasingthe gate length of transistors. To account for this in the area model, we first calculated the areafor the minimum sized transistor in the 45 nm technology node, assuming MOSIS scalable CMOSdesign rules [120]. Then we calculated the area for the transistor as we increase the gate length.We found that increasing the gate length by a factor of l results in 0.125 × l increase in the areaof the transistor. We used the same scaling factor to scale results from the MWTA area model fortransistors that have larger-than-minimum gate length.Power Gating ArchitectureTo size the sleep transistors, we first used HSPICE simulations of the transistor-level model ofa power gating region (PGR) to determine the total effective width of the sleep transistors, andthen used the algorithm from [103] to break this into individual sleep transistors as discussed inSection 4.2. In choosing the total effective width, we assumed a maximum allowable voltage drop of50 mV on the sleep transistor of a power gating region during normal operation. Our sizing methodrequires an estimate of the activity of the nodes in a region; rather than performing extensive poweranalysis using the Poon power model, we assume an activity of 30% for these nodes. We found thatthis approximation over-estimates the power and results in larger than necessary sleep transistors.However, the overall impact on the results is small.4.6 Results and DiscussionIn this section, we experimentally study the proposed architecture and the different tradeoffs thathave been discussed in Subsection 4.2.3. First, we evaluate different design points of the architecture824.6. Results and Discussionaccording to the design flow that has been described in Subsection 4.3.1. Then, we evaluate the useof soft logic-based delay elements instead of embedded PDEs for the inter-region level architecture.4.6.1 Intra-Region LevelAs mentioned in Subsection 4.3.1, the area overhead of the intra-region level of the proposedarchitecture depends on the selected value for Itile intra during the architecture design 2. We varyItile intra to show the area overhead, the wakeup time, and the energy dissipation of the intra-region level during the turn on/off phases. We also vary the size of the power gating region (thegranularity) in the underlying DCPG FPGA architecture since the inrush current of a power gatingregion depends on its size.Figure 4.12(a) shows the area overhead of the intra-region architecture compared to an archi-tecture that does not support power gating, and Figure 4.12(b) shows the wakeup time per tile.We can see that as we increase Itile intra, the area overhead and the wakeup time of the intra-regionarchitecture decrease. This is due to the reduction in the number of stages of delay elements andthe reduction in the size of each delay element as Itile intra is increased. We can also see that largerpower gating region sizes have smaller area overheads and smaller wakeup times per tile; largerregions have smaller inrush current per tile as has been explained in Subsection 4.1.3. 0.5 1 1.5 2 2.5 3 3.5 0.3  0.4  0.5  0.6  0.7  0.8  0.9  1Area overhead (%)Max inrush current per tile (Itile_intra) (mA) PGR 1x1 tilesPGR 2x2 tilesPGR 3x3 tilesPGR 4x4 tiles(a) 0 1 2 3 4 5 6 7 0.3  0.4  0.5  0.6  0.7  0.8  0.9  1Wakeup time per tile (ns)Max inrush current per tile (Itile_intra) (mA)PGR 1x1 tilesPGR 2x2 tilesPGR 3x3 tilesPGR 4x4 tiles(b) 700 720 740 760 780 800 820 840 860 880 0.3  0.4  0.5  0.6  0.7  0.8  0.9  1Mode transition energy per tile (fJ)Max inrush current per tile (Itile_intra) (mA)PGR 1x1 tilesPGR 2x2 tilesPGR 3x3 tilesPGR 4x4 tiles(c)Figure 4.12: Intra-region level (a) area overhead, (b) wakeup time, and (c) mode transition energyresults.Figure 4.12(c) shows the energy dissipation during power mode transitions (sum of the energydissipated to power on and to power off a PGR). This energy is due to the power dissipated in2Recall that Itile intra = Iregion intra/num tiles in region.834.6. Results and Discussiondelay elements as well as the power dissipated due to inrush current. We can see from the figurethat as we increase Itile intra, the energy due to power mode transition decreases. This is due tothe decrease in the number and sizes of delay elements required in the intra-region level circuitry,and the corresponding decrease in the turn on/off time. The results given here are importantespecially when evaluating whether turning off a module can achieve energy savings or not. Thatis, to justify using power gating for a module, the energy saved when the module is turned offduring its idle times must be larger than the energy cost of turning things off and then on and theenergy dissipated by the power management unit.4.6.2 Inter-Region LevelAccording to the design flow described in Subsection 4.3.1, a value must be chosen for Rmodule,which indicates the largest power-gated module that can be turned on using one power controlsignal. For the experiments in this section, we assume Rmodule = 1000 tiles (the size of a simpleMIPS processor, for example, is 905 tiles). Also, the design flow specifies a constraint on the chip’scurrent that is dedicated to turning on power-gated modules, denoted Imodule wakeup; this constraintis varied to investigate different design options of the architecture.We evaluate the area overhead of the inter-region architecture by varying two design parameters:we vary Imodule wakeup to represent different design choices for the constraint on the chip’s currentdedicated to turning on power-gated modules, and we vary the constraint on the current used todesign the intra-region architecture (Itile intra), since finding the size of the PDEs in the inter-regionlevel depends on this parameter as well. The results in this subsection are given for a PGR of size4× 4 tiles.Figure 4.13 shows the required PDE size, i.e., the number of configurable delays in each PDE, fordifferent values of Imodule wakeup, and Figure 4.14 shows the corresponding area overhead comparedto a PGR that does not support power gating. As can be seen in the figures, as Imodule wakeupincreases, the area overhead of the inter-region level decreases because a larger number of PGRscan be turned on simultaneously (Rconcur), which leads to smaller PDE sizes. Also, we can seethat larger values of Itile intra results in larger area overhead because smaller number PGRs can844.6. Results and Discussionbe turned on simultaneously (Rconcur), which leads to larger PDE sizes. Nevertheless, the areaoverhead of the inter-region level is very small (less than 0.2%). Clearly, if larger module sizes wereused in the experiments (larger Rmodule), then the PDE sizes would be larger, leading to largerarea overheads. 0 2 4 6 8 10 12 0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1PDE SizeConstraint on chip’s wakeup current Imodule_wakeup (A)Max. tile’s inrush currentItile_intra=400 µAItile_intra=600 µAItile_intra=800 µAItile_intra=1000µAFigure 4.13: PDE size for different Imodule wakeup budgets. PGR size = 4× 4 tiles. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1PDE Area Overhead (%)Constraint on chip’s wakeup current Imodule_wakeup (A)Max. tile’s inrush currentItile_intra=400 µAItile_intra=600 µAItile_intra=800 µAItile_intra=1000µAFigure 4.14: PDE area overhead for different Imodule wakeup budgets. PGR size = 4× 4 tiles.4.6.3 Combined Intra- and Inter-Region LevelsIn this subsection, we combine the results from the previous two subsections in order to show theoverall area and leakage power overheads of the proposed inrush current handling architecture.Figure 4.15 shows the totals for area overhead and leakage power per tile in the ON and OFFstates for different values of Imodule wakeup. The results in this subsection are given for a PGR sizeof 4× 4 tiles.854.6. Results and Discussion 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1Area Overhead (%)Constraint on chip’s wakeup current Imodule_wakeup (A)Max. tile’s inrush currentItile_intra=400 µAItile_intra=600 µAItile_intra=800 µAItile_intra=1000 µA(a) 40 45 50 55 60 65 0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1ON leakage per tile (nW)Constraint on chip’s wakeup current Imodule_wakeup (A)Max. tile’s inrush currentItile_intra=400 µAItile_intra=600 µAItile_intra=800 µAItile_intra=1000 µA(b) 55 60 65 70 75 0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1OFF leakage per tile (nW)Constraint on chip’s wakeup current Imodule_wakeup (A)Max. tile’s inrush currentItile_intra=400 µAItile_intra=600 µAItile_intra=800 µAItile_intra=1000 µA(c)Figure 4.15: Total intra and inter-region level wakeup architecture results for different device’swakeup current budgets (PGR size = 4 × 4 tiles). (a) Total area overhead, (b) total ON stateleakage power, and (c) total OFF state leakage power.As can be seen in the figure, the area overhead is smaller than 1% for most of the cases, andit decreases as Imodule wakeup increases. This follows the same trend that was shown in Figure 4.14since increasing Imodule wakeup would result in a decrease in the PDE area in the inter-region levelof the architecture. Also, we can see in Figure 4.15 that the area overhead and ON/OFF leakagepowers drop off quickly, and then flattens out because no PDEs are required, i.e. the area overheadis that for the intra-region architecture only.Figure 4.15 also shows the leakage power of the inrush current handling circuitry in the ON andOFF states. We can see that in the ON and OFF states, leakage power decreases as Imodule wakeupincreases; this is due to the reduction in the sizes of PDEs used in the inter-region level of thearchitecture; the ON (OFF) state leakage power is strongly correlated to the area overhead results.The leakage power dissipation due to the inrush current handling circuitry (mainly the delayelements) represents an overhead that is incurred in the power gating regions. For example, atImodule wakeup = 0.5 A, the OFF state leakage power saving for a PGR of 4 × 4 tiles with ourproposed architecture is about 91.6%, whilst it is 93.2% for a PGR that does not have our proposedarchitecture. The leakage power saving in a PGR is defined as: 100%× leakage ON−leakage OFFleakage ON , whereleakage ON and leakage OFF are the leakage power of a region without power gating and theleakage power of a region that employs power gating when it is turned OFF, respectively.864.6. Results and Discussion4.6.4 Optimal Design of The Intra- and Inter-Region LevelsIn the previous two subsections, we assumed that Itile intra that is used to design the intra-regionarchitecture does not exceed what the power grid supports for each tile. Larger values of Itile intrameans that the current drawn when turning on a PGR exceeds what the power grid supportsfor each tile. This does not necessarily cause the IR drop constraint (50 mV in our case) to beviolated unless a “large number” of regions are turned on simultaneously. However, the effectof increasing Itile intra is the reduction in the number of PGRs that can simultaneously turn onwithout unacceptable voltage droop (Rconcur); this results in larger PDEs for the inter-region levelarchitecture. At the same time, increasing Itile intra reduces the area overhead of the intra-regionlevel. Based on this discussion, there is an optimal point where the overall area overhead of theinrush current handling circuitry is minimum. In this subsection, we experimentally investigatethis optimal design point.In order to study this tradeoff, we used Itile pg = 600µA +10% to synthesize a power grid.We then varied Itile intra that is used to design the intra-region architecture, and for each case wefound the number of regions (in a square area of the chip) that can be turned on simultaneouslywithout violating the voltage droop constraints on the power grid (we use 50mV as the constraint).We use a pessimistic assumption where all other chip regions are drawing worst case “normaloperation” current. We investigated this using extensive HSPICE simulations. In these experimentswe used two sizes for the power gating regions in the underlying DCPG FPGA architecture to studydifferent granularities of the architecture, and we assume Rmodule = 1000 tiles. We also assumethat Imodule wakeup is unlimited since we want to evaluate Rconcur capability of the power grid.Table 4.2 shows Rconcur for different values of Itile intra. As can be seen in the table, Rconcurdecreases as Itile intra increases, leading to larger PDE sizes.Figures4.16(a) and (b) show the total area overhead of the intra- and inter-region levels of thewakeup architecture for two PGR granularities. In each of these figures, we study two cases: thefirst is when the intra-region level is designed assuming Itile intra is set to the same amount as whatthe power grid supports for each tile (Itile intra = 600µA), and the second case is for increasingItile intra beyond what the power grid supports for each tile. Although increasing Itile intra results874.6. Results and Discussionin smaller area for the intra-region architecture as has been shown in Subsection 4.6.1, the areaof the inter-region architecture dominates the area overhead, and thus the total area is increasingas shown in the figure. The smallest area overhead is achieved by setting Itile intra to the sameamount as what the power grid supports. This is true for different granularities of the DCPGFPGA architecture as can be seen in Figures4.16(a) and (b).Table 4.2: Rconcur and PDE size for different designs of the intra-region architecture with Itile intraexceeding what the power grid supports in each tile location (Itile pg = 600µA).Itile intra (µA)PGR 1× 1 tiles PGR 4× 4 tilesRconcur PDE Size Rconcur PDE Size700 484 3 25 3800 144 7 9 7900 81 13 4 161000 49 21 1 62 0 5 10 15 20 25 30 35 600  700  800  900  1000Total Area Overhead (%)Max inrush current per tile (Itile_intra) (µA)Itile_intra = 600 µAPGR 1x1 tilesIncreasing Itile_intra(a) 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 600  700  800  900  1000Total Area Overhead (%)Max inrush current per tile (Itile_intra) (µA)Itile_intra = 600 µAPGR 4x4 tilesIncreasing Itile_intra(b)Figure 4.16: Comparing the total area overhead for the inrush current limiting architecture for twodesign choices of the intra-region level. Two PGR granularities are studied (a) 1× 1 tiles and (b)4× 4 tiles.4.6.5 Minimum Idle Time to Achieve Energy SavingsFigure 4.17 shows the minimum idle time (Tidle min) that is required for a module in order toachieve energy savings when turned off using the proposed architecture. The results in the figureare shown for four module sizes. Equation 4.4 is used to obtain the results in this figure. Theresults are reported assuming a PGR of size 4x4 tiles and Itile intra = 600 µA. As Imodule wakeup884.6. Results and Discussionincreases, Rconcur increases, leading to smaller PDE sizes, lower power overhead, and shorter turnon time for a module. For modules of sizes 250 - 2000 tiles, Tidle min ranges between 100 - 110 nswhen Imodule wakeup = 0.5 A. This corresponds to 34 - 37 cycles on a 300 MHz clock. 80 100 120 140 160 180 200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Tidle_min (ns)Constraint on chip’s wakeup current I module_wakeup (A)Module size (tiles)250 500 10002000Figure 4.17: Minimum idle time to achieve energy savings. PGR size is 4x4 tiles and Itile intra = 600µA.4.6.6 Soft-Logic vs. Embedded PDEsThere is an alternative approach for implementing the inter-region level of the wakeup circuitry;instead of having dedicated hardware for the inter-region level (i.e., embedded PDEs), the user ofthe chip can implement the required delays using the soft logic of the chip. In this subsection, wecompare the two implementation options. To perform this analysis, we used a PGR of size 4 × 4tiles, Itile intra = 600 µA, and assumed a power-gated module of 1000 tiles. Note that this tradeoffmight not be available in other types of reconfigurable architectures that do not support using softlogic at a fine granularity level to implement delay elements.A Johnson counter was used to generate the delayed signals. A simple implementation of aJohnson counter is a shift register with the feedback signal inverted. This implementation efficientlyprovides the required functionality of a sequence of delayed turn on signals.In this implementation, a clock is required for the counter. The number of flip flops (FFs)894.6. Results and Discussionrequired to implement a ∆T delay unit that is required to turn on Rconcur regions is calculatedas d∆T/Tclke; where Tclk is the clock period. For cases where ∆T < Tclk, each delay period willrequire one FF in the counter. Each generated delay period in this case will be larger than therequired ∆T . This increases the turn on time of a power-gated module, and hence the dissipatedenergy during power mode transitions. For cases where ∆T > Tclk, more than one FF is requiredto implement one delay period; the number of such FFs increases as the clock frequency increases.We investigated the effect of different clock frequencies on the area of the required counters togenerate the required delays. Note that this analysis only considers the area overhead; we do notinclude the routing overhead or the power overhead for the clock signal or the power control signalsthat are generated from the soft logic-based delay elements. It is conceivable that as the numberof signals that need to be routed increases, the routability and speed of the application might beaffected negatively.Figure 4.18 shows the area overhead of the inter-region level of the inrush current handlingarchitecture for different values of Imodule wakeup. Imodule wakeup has been varied to evaluate differentconstraints on Rconcur, and thus, different sizes of required delay elements. The figure shows theresults obtained for different clock frequencies for the soft logic-based implementation. We can seethat as Imodule wakeup increases, the area overhead decreases because the sizes of PDEs decrease (noPDEs are required if Imodule wakeup ≥ 0.7 A). We can also see that as the clock frequency increases,the area overhead of the soft logic implementation increases because more FFs are required in thecounter to generate the desired delays. An interesting observation is that the area of embeddedPDEs can be larger than the soft logic implementation when a low-frequency clock is used for thecounter. This is because in these cases Tclk > ∆T , which requires implementing each ∆T delayperiod as only one FF. However, as mentioned earlier, this will generally increase the turn on timeand the power mode transition energy, making it more difficult to achieve energy savings in anOFF state.Although the area overhead of soft logic-based delay elements might be smaller than that ofembedded PDEs in some cases, using embedded PDEs in the inter-region level of the inrush currenthandling circuit is more appealing for many reasons. The area overhead of using embedded PDEs904.6. Results and Discussion 0 0.05 0.1 0.15 0.2 0.25 0.1  0.2  0.3  0.4  0.5  0.6  0.7Area Overhead (%)Constraint on chip’s wakeup current Imodule_wakeup (A)Hard Embedded PDEsSoft Logic - Fclk=100 MHzSoft Logic - Fclk=200 MHzSoft Logic - Fclk=300 MHzSoft Logic - Fclk=400 MHzFigure 4.18: Area overhead for soft logic and embedded logic implementation of inter-region archi-tecture for different Imodule wakeup budgets. PGR size = 4× 4 tiles.is very small (less than 0.5%), they simplify the design of power-gated circuits, and using thempartially excludes the consequences of design errors that may result because of using soft logic-baseddelays. For other architectures where using soft logic delays provides significant area savings, theother metrics (power and turn on time) must be studied as well to best evaluate the design choicesof the inter-region architecture.4.6.7 Decaps and Power Grid SizeIn our previous experiments, we did not use explicit decoupling capacitors in our chip. Typically,there will be some amount of explicit decaps present in the chip to reduce the voltage droop duringnormal circuit operation. Recall that a decap in a standard CMOS process is implemented as aNMOS transistor with its source/drain tied to GND and its gate tied to VDD. In our power-gatedarchitecture, we imagine that such decaps would be part of the PGRs in order to reduce their leakagepower consumption when the PGRs are powered off. However, in an alternative implementation,the decaps could be designed to be in an “always on” state. Such an alternative implementationcan reduce the inrush current in the PGRs, which results in less area overhead for the proposedarchitecture at the cost of more leakage power dissipation because of the decaps.Another related issue is that the widths of the power rails can be increased to support morecurrent. This will result in an increase in Rconcur in the proposed architecture since more current914.7. Summarycan be dedicated for turning on power-gated modules without affecting the normal operation ofthe circuit (increasing the power grid size is equivalent to having larger Imodule wakeup). However,as the results in Subsections 4.6.1 and 4.6.3 suggest, the area overhead is dominated by that of theintra-region level circuit. Thus, increasing Rconcur might reduce the area overhead of the PDEs inthe inter-region level, but the total area overhead reduction will be relatively small. The questionof what is the best combination of power-grid area and intra- and inter-region level area has notbeen answered in this chapter since it is highly dependent on how the layout of the architecture isperformed. We hope that this study will motivate FPGA researchers to investigate this tradeoff inmore detail.It is worth mentioning that increasing the power rails widths cannot alone solve the problemof inrush current. As has been shown in Subsection 4.1.4, the only way at the power grid levelto reduce the voltage droop resulting from the inrush current is by using more clean VDD/Gndsources, which translates to more power pins on the chip. This might not be desirable, especiallyfor small FPGAs that do not have too many pads.4.7 SummaryIn this chapter, we presented an architecture that can be used to solve the inrush current problemin reconfigurable devices that support dynamic power gating. It enables sequencing the turn-onphase of a power domain such that the voltage droop remains within an acceptable level that doesnot affect the functionality of the other circuit components. The architecture is configurable, andcan be used to support a wide range of power domains structures and sizes. We evaluated othertechniques to handle inrush current, by using decoupling capacitors or over-provisioning the powerdistribution network. We found that these techniques result in large area overheads.We evaluated the proposed architecture’s area, power, and timing overheads. The area overheadof the presented architecture is less than 4% for a PGR of size 1x1 tiles, and about 1% for a PGRof size 4x4 tiles, and the power saving from the DCPG architecture is reduced by 2% for a PGRof size 4x4 tiles. The wakeup time per tile was found to range between 6 ns to less than 1 ns for924.7. SummaryPGRs of sizes between 1x1 - 4x4 tiles. Given that many tiles can be waken up simultaneously, thewakeup time for a power-gated module that has 1000 tiles is less than 33 ns (about 10 clock cycleson a 300 MHz clock), when the module is mapped to a DCPG FPGA architecture with a PGR sizeof 4x4 tiles, and Rconcur = 25.Note that this chapter does not discuss how the instability during the wakeup phase is handled.We expect that turning on the PGRs in a module will result in instability in the power-gatedmodule. As mentioned in Chapter 3, it is left to the user to design the circuit and the powercontroller to ensure that entering sleep mode is preceded by stopping data transfers and the clock,and exiting power gating is followed by enabling the clock again. The architecture in Chapter 3enables providing a feedback signal (see Subsection 3.2.1) that can be routed by the CAD tools tothe power controller, or other components in the system, that might need to know when the powertransition of a module is completed, and hence start/stop data transfers and the clock.93Chapter 5Physical-Level CAD Support forPower-Gated FPGAsWith few additional steps in the lower-level FPGA CAD flow, existing placement and routingalgorithms can be used as is to map power-gated circuits to the proposed power-gated FPGA ar-chitecture. This chapter starts by describing a CAD flow that is used to map application circuitshaving power-gated modules to power-gated FPGAs. This is followed by a discussion on opti-mizations in the low-level CAD algorithms that help improve the power savings achievable by thepower gating architecture presented in Chapter 3. Two studies are performed using the presentedCAD flow. Section 5.4 focuses on the impact of the physical-level CAD algorithms on the powerreduction in power-gated FPGAs, and Section 5.5 focuses on the granularity of the power gatingarchitecture, and its effect on the achievable power reduction. The architecture file in Appendix Ais used for the experiments in this chapter. Part of the work in this chapter was published in [3, 5].5.1 CAD Flow For Power-Gated FPGAsThis section presents the CAD flow used for the power-gated FPGA architecture presented inChapter 3. The CAD flow is based on the VPR placement and routing tool (version 5.0) [105], whichis commonly used in FPGA architecture and CAD research. This section focuses on the minimumenhancements required in the CAD flow in order to make it suitable for power-gated FPGAs.This flow uses existing placement and routing algorithms without modification. In Section 5.2, wediscuss enhancements to the placement and routing tools to improve power savings of the powergating architecture.945.1. CAD Flow For Power-Gated FPGAsFigure 5.1 shows the general CAD flow used for power-gated FPGAs, which is similar to theone shown in Section 2.1 with some modifications. Information about the power-gated modules ismaintained through the flow to guide the different steps. Logic synthesis and technology mappingare done for individual power-gated functional blocks (or modules) and always-on components inthe circuit. The output from the technology mapping step is a collection of netlist files (.blif files).In the diagram, PG modules refer to the power-gated modules in the circuit, and AO modulerefers to the remaining parts of the circuit that are always-on. This is followed by a new step toreconstruct the circuit from the technology-mapped netlists of individual modules. The packingphase has been modified to keep information about the power-gated modules in a circuit. This isdone by allowing only basic logic elements (BLEs) and flip flops (FFs) that belong to the samemodule to be packed in the same logic cluster (LC). The output from the packing step is a .netfile. Note that in an alternative approach, packing can be performed for individual .blif files ofthe power-gated modules, and then the circuit can be reconstructed from the individual .net filesthat represent the modules of the circuit. This alternative approach is used to build the syntheticbenchmarks as explained in Subsection 5.3.1.Figure 5.2 shows the low-level CAD flow for the power-gated FPGAs. This flow is based on theVPR tool [105]. The circuit netlist (.net file) is annotated with power-gated modules informationfrom the previous packing step. This is done by prefixing the names of the blocks that belong to amodule with the module’s name. In addition to the application circuit netlist, a power controllernetlist is provided to the CAD flow. Currently, the CAD flow assumes that the power controlleris provided as a separate netlist. Note that ideally the CAD tools should be able to generate thepower controller from information about the circuit structure and behaviour. The power controllerreceives signals from the I/Os or from the application circuit, and it drives the power control signalsthat control the power states of the different power-gated modules in the application circuit. Thisis explained in more detail in the following subsections.955.1. CAD Flow For Power-Gated FPGAsC ircuit ( V erilog,  V H DL ,  etc. )L ogic synth esis ( O din II,  Q uartus II,  etc. )T ech nology m ap p ing ( A BC )P acking ( T - V P ack)P G m odules inf orm ationP lace and route ( V P R )A rch itecture descrip tionC onf iguration bitstreamR econstruct circuit( generate BL IF )A O  m odule ( BL IF )P G m odule ( BL IF )Figure 5.1: General CAD flow for power-gated FPGAs.5.1.1 Modelling DCPG ArchitectureThe VPR tool was originally designed as an architecture exploration tool. An FPGA architecturemodel is generated in the tool based on an input architecture file [46]. We extended the supportedarchitecture parameters in VPR to enable exploring many power gating architecture granularitiesin addition to exploring different basic FPGA architecture parameters. The new power gatingparameters include the PGR’s width and height, and the SB’s partitions per side (PPS). Theseparameters are used to generate a floorplan of the PGRs in an FPGA.A coordinate system that identifies the PGR that each component (SBs, LCs, and RCs) belongsto is established in this step. This helps in determining the power configuration for the different965.1. CAD Flow For Power-Gated FPGAsP lace and route P ow er controllerP lace C ircuitDeterm ine P GR s p ow er state and i/ p  p ins f or p ow er control signalsR oute C ircuitDeterm ine p ow er state f or SB p artitionsR oute p ow er control signals to P GR s1 25 4 36 P laced and routed circuit and p ow er controller w ith  p ow er dom ains1P GR _ w idth ,  P GR _ h eigh t,  P P S,  etc.A rch itecture C ircuitm odule1 ,  m odule2 ,  …P ow er controllerGenerate P G arch itecture m odel0Figure 5.2: Physical-level CAD flow for power-gated FPGAs.components after placement and routing as will be discussed later in this chapter. Given an FPGAsize in terms of the number of columns and rows of tiles in the chip, the chip is divided intoequally-sized PGRs starting from its bottom-left corner.LCLC LC LC LCLCLC LC LC LCLCLC LC LC LCLCLC LC LC LCLCLC LC LC LCCore PGR (2 x2 )Remaining column PGR (1x2 )Remaining row PGR (2 x1) Corner PGR (1x1)Figure 5.3: The architecture model of an FPGA of size 5x5 tiles, and a core PGR size of 2x2 tiles.975.1. CAD Flow For Power-Gated FPGAsFigure 5.3 shows an example FPGA that has a height of 5 rows of tiles and a width of 5 columnsof tiles. The floorplan shows how the chip is divided into PGRs each has a size of 2x2 tiles. Sincethe number of rows and columns of the example FPGA do not divide evenly by 2, the top row oftiles and the right-most column of tiles have PGRs of size 2x1 and 1x2 tiles, respectively. Note thatit is possible to size up the chip to generate same-size regions in the top row and the right-mostcolumn of PGRs; larger regions can result in more power savings and smaller area overhead asdiscussed in Chapter 3. However, this is not expected to significantly impact our results since itseffect is limited to a few regions in a chip.5.1.2 Power Controller Placement and Routing and Circuit PlacementIn this step, the power controller’s blocks are placed and their internal connections are routed onthe available FPGA resources. This is done using the existing placer and router in VPR. The powercontrol signals (outputs of the power controller unit) are not routed in this step because there isno information yet about which PGRs implement which power-gated module.The next step is the placement of the application circuit using the VPR’s placer. After thecircuit is placed, the module that each PGR belongs to is determined in order to know where thepower control signals should be routed (routing is explained in the next subsection). This is donebased on the circuit’s blocks that occupy the LCs in a PGR. If at least half of the LCs in a PGRare occupied by blocks that belong to a certain power-gated module, then the PGR belongs to thatpower-gated module. The justification for this is as follows. If all of a PGR’s LCs are occupied byblocks that belong to a certain power-gated module, then the power state for the LCs and internalRCs of the PGR will be configured as dynamically-controlled. If at least half of the LCs in a PGR,but not all of them, are occupied by blocks that belong to a certain power-gated module, then it isvery likely that most of the routing resources in the PGR will be used to route the same module’ssignals, and their power state will be configured as dynamically-controlled.Next, the power state configuration for the LCs and internal RCs in each PGR is determinedbased on the following: 33Recall from Chapter 3, each PGR has three main components: LCs with internal RCs, border RCs, and individual985.1. CAD Flow For Power-Gated FPGAs• Dynamically-controlled: if only one power-gated module is mapped to the LCs in the PGR.• Always-off: if all of the PGR’s LCs are empty.• Always-on: all other cases.5.1.3 Power Control Signals Routing and Circuit RoutingAfter the PGRs of each power-gated module are identified in the previous step, the net for eachpower control signal is constructed. The source terminal of a net is known since it is one of theoutputs of the power controller. The sinks are found as follows. For each PGR that belongs tothe power-gated module under consideration, a free input pin from its bordering routing channelsis selected (if one is available) to act as a sink for the control signal. Once the nets of the powercontrol signals are constructed, the VPR router is used to route these signals. The method usedfor selecting the sinks for each power control signal (explained below) will likely result in a routewith a trunk-branches topology. This topology results in the smallest number of SB partitions thatare required to route a power control signal.Figure 5.4 shows an example routing of a power control signal to PGRs of size 2x2 tiles (R = 2)in a trunk-branches topology. Note that the size of the power gating region has an effect on thenumber of SB partitions that are required to route the power control signals. As the region size isincreased, i.e., coarser-granularity architecture, smaller number of PGRs can be used to implementthe same-size power-gated module. This leads to fewer branches that are required in the route ofthe control signal, and results in the use of fewer SB partitions.After the routing of the power control signals, the power state configuration for each borderingRC is determined as follows:• Dynamically-controlled: if both neighbouring PGRs that share the bordering RC belong tothe same power-gated module, and the power control signal of the module is not routedthrough this bordering RC.SBs (or individual SB partitions). The power state for each of these components can be configured separately.995.1. CAD Flow For Power-Gated FPGAsP ow er controllercontrol signal Figure 5.4: Routing of a power control signal to PGRs of a power-gated module in a trunk-branchestopology.• Always-off: If none of the input pins that are driven by connection boxes in this borderingRC is used.• Always-on: All other cases.The final step is the routing of the application circuit’s signals. Routing resources that areoccupied by the power control signals cannot be used in this step. The routing is done using thetiming-driven router in the VPR tool.The power state for each SB partition is determined after the routing of the power controlsignals and the circuit’s signals. The SB partitions that are used to route the power control signalsare set as always-on to ensure that the power control signals are available all the time. The powerstate for each of the remaining SB partitions is determined as follows:• Dynamically-controlled: if the SB partition is used to only route signals that belong to thepower-gated module that the PGR belongs to.• Always-off: if the SB partition is not used to route signals.• Always-on: all other cases.1005.1. CAD Flow For Power-Gated FPGAsSelecting the Sinks of a Power Control SignalIn order to guide the router to generate a trunk-branches routing topology for a power control signalas explained above, the sinks for a power control signal are selected as described in Algorithm 1.Algorithm 1 Algorithm for building the power control signal for a power-gated module1: function build module pcs(module index, chip PGRs)2: net sinks← {}3: (PGRx1, PGRy1), (PGRx2, PGRy2)← Identify the module’s rectangular area4: trunk orientation← Find orientation for min. number of SB partitions (using Equ. 5.1)5: if trunk orientation = vertical then6: num pcs branches← dPGRy2−PGRy1+12 e . Number of power control signal branches7: branch← 08: while branch < num pcs branches do9: first PGR y ← PGRy1 + branch× 2 . Y-coordinate of the first PGRs row10: opp PGR y ← first PGR y + 1 . Y-coordinate of the opposite PGRs row11: i← PGRx112: while i ≤ PGRx2 do13: if chip PGRs[i][first PGR y].module index = module index then14: add sink(net sinks, chip PGRs[i][first PGR y], TOP )15: end if16: if chip PGRs[i][opp PGR y].module index = module index then17: add sink(net sinks, chip PGRs[i][opp PGR y], BOTTOM)18: end if19: i← i+ 120: end while21: branch← branch+ 122: end while23: else24: //Does the same as the if part but for the vertical branches25: end if26: return net sinks27: end function28: function add sink(sinks arr, PGR, side) . Finds an empty sink on PGRon the specified side and adds it to sinks arr. If no empty pin found on side, the other sidesof PGR are searched29: end functionIn Algorithm 1, Equation 5.1 is evaluated two times to determine the orientation of the potentialroute that results in a smaller number of SB partitions. Equation 5.1 finds an estimate of the numberof SB partitions that would be required to route the power control signal in a module that occupiesa rectangular portion of an FPGA,1015.2. PG-Aware Placement and Routing OptimizationsSB parts =⌈R× PGRtrunk − 1L⌉+ (PGRtrunk/2)×⌈R× PGRbranch − 2L⌉(5.1)where R is the architectural region size (a PGR has R × R tiles), L is the routing segment length(i.e., the length of a routing wire in the FPGA architecture), PGRtrunk is the number of PGRs inthe trunk of the route, and PGRbranch is the number of PGRs in each branch of the route.In Equation 5.1,⌈R×PGRtrunk−1L⌉represents the number of SB partitions required to route thetrunk of the route, and the remaining parts of the equation represent the number of SB partitionsrequired to route the branches (PGRtrunk/2 is the number of branches and⌈R×PGRbranch−2L⌉isthe number of SB partitions in each branch). This equation gives a lower bound estimate of theSB partitions required to route a control signal. This equation assumes that all wire segments areinternally connection-block populated [46]. The equation applies to rectangular modules; therefore,before the routing of the power control signals is performed, the extent of each module is determinedusing the coordinates of the PGR in its left-bottom corner and the PGR in its right-top corner.5.2 PG-Aware Placement and Routing OptimizationsThe placement and routing algorithms in VPR can be used without modifications in order tocorrectly map application circuits to DCPG FPGAs as explained in the previous section. However,certain optimizations can be applied to these algorithms in order to increase the number of power-gated resources and hence reduce leakage power. When a circuit is implemented on DCPG FPGA,some of the resources must be configured as always-on in order to ensure proper circuit operation.For example, it will not be possible to completely power down a PGR that is filled by blocks thatdo not belong to the same module. Always-on resources also include SB partitions that are used toroute important signals, such as the power control signals and inter-module signals. In this section,we propose and evaluate optimizations for the placement and routing algorithms. In Section 5.5,we compare the results of the optimizations with those that have been proposed in the literature.1025.2. PG-Aware Placement and Routing Optimizations5.2.1 PlacementIssues in PG-Aware PlacementFor applications that contain power-gated modules, it is important to direct the placement algo-rithm to minimize the number of resources that must be always-on while minimizing the impacton speed and area of the implemented circuit. This improves the power reduction using the power-gated FPGA architecture. This subsection describes how placement affects the number of resourcesthat can be powered down when idle.Figure 5.5 shows two power-gated modules (Modules 1 and 2) and one always-on module (Mod-ule 3) placed on nine PGRs, with their power controller also placed on the same FPGA. Someof the PGRs are occupied by blocks that belong to only one module, while others contain blocksthat belong to more than one module. The PGRs that are occupied by the blocks of the samepower-gated module can be completely powered down when that module is inactive. Examples ofthis are PGRs (0,0), (0,1), (1,1), and (1,2). For PGR (0,2), there are two possibilities for its powerconfiguration. One possibility is to configure the SB partitions that are used to route Module 1’ssignals as dynamically-controlled and its LCs and internal RCs as always-on. This may lead tosome power savings, but it depends on how many SB partitions are occupied by Module 1. Thisalso applies to PGR (1,0) because one of its LCs is occupied by the power controller, which mustbe always-on. The second possibility is to turn off the PGR only when both occupying modulesare inactive at the same time. This requires a separate power control signal and depends on thebehaviour of the two occupying modules. Note also that the PGR at (2,0) is occupied by the powercontroller and it must be configured as always-on.Based on the above discussion, a PG-aware placement might be conceived as a placement inwhich the number of PGRs that can be powered down is maximized. Previous works discussed inSubsection 2.5.3 try to achieve this through placement constraints [22] or by filling as many PGRsas possible with blocks that belong to the same module [97]. These concepts were borrowed fromASIC flows. In ASIC designs, a floorplan for a design that has power-gated domains or multiplevoltages is created before placement and routing [121]. The power-domains, or voltage areas, are1035.2. PG-Aware Placement and Routing OptimizationsP GR  ( 0 , 0 ) P GR  ( 1 , 0 ) P GR  ( 2 , 0 )P GR  ( 0 , 1 ) P GR  ( 1 , 1 ) P GR  ( 2 , 1 )P GR  ( 0 , 2 ) P GR  ( 1 , 2 ) P GR  ( 2 , 2 )M odule 1  ( Dynam ically- controlled)M odule 2  ( Dynam ically- controlled)M odule 3  ( A lw ays- on)C ontroller ( A lw ays- on)Figure 5.5: Illustrating placement effect on the always-on PGRs in a power-gated FPGAs.fixed, rectilinear areas. This early design stage is employed to make it feasible to implement aspecific design, mainly because the power gating circuitry and the power distribution network arerelatively expensive to implement in ASICs, and need to be determined ahead of time. However,this is not necessarily true for power-gated FPGAs.Consider the example placement shown in Figure 5.6 (a). In this example there are threemodules that have been placed on three PGRs. The number of PGRs that can be powered downhas been maximized in this placement. Block M1.B2 has three connections that have sinks inmodule M3’s PGR. These connections can only be routed through the routing resources that lie inmodule M2’s PGR. In addition, some of the routing resources in module M3’s PGR will also be usedto route these connections to their sinks. Therefore, these routing resources must be configured asalways-on for the circuit to operate correctly.A potentially better placement in terms of the number of always-on routing resources is shownin Figure 5.6 (b). In this example, although the PGRs of modules M1 and M3 have not been filledby blocks that belong to the same module, there is a possibility to reduce the number of always-onrouting resources that lie in the PGR of module M2 and are required to route the connection fromM1.B2 to its sinks in module M3. As has been demonstrated in Chapter 3, a switch block (SB)consumes about 70% of the leakage power of a tile. Therefore, trying to minimize the number ofalways-on SBs would lead to reduction in the overall leakage power dissipation.1045.2. PG-Aware Placement and Routing OptimizationsM3.B3 M3.B4M3.B1 M3.B2M2.B3 M2.B4M2.B1 M2.B2M1.B3M1.B4M1.B1 M1.B2M odule M 1 M odule M 2 M odule M 3(a) Placement that results a large number of always-on routing resources and threePGRs that are configured as dynamically-controlled.M3.B3 M3.B4M1.B2 M3.B2M2.B3 M2.B4M2.B1 M2.B2M1.B3M3.B1M1.B4 M1.B1M odule M 1 M odule M 2 M odule M 3(b) Placement that potentially results a smaller number of always-on routing re-sources and one PGR that are configured as dynamically-controlled.Figure 5.6: Three modules mapped to three PGRs showing how the placement affects the numberof routing and logic resources that are configured as dynamically-controlled. In (a), the LCs in eachPGR are configured as dynamically-controlled, but a large number of SB partitions will be requiredto be always-on to ensure proper circuit operation. In (b), the LCs of the PGRs of modules M1 andM3 cannot be controlled dynamically because the PGRs have a mix of blocks from other modules.However, a fewer number of SB partitions in these PGRs are required to be always-on.PG-Aware PlacementIn this subsection, simple enhancements to the placement algorithm that address the issues dis-cussed above are presented. These enhancements focus on minimizing the number of resourcesthat cannot be powered down (always-on resources). Unlike the algorithms discussed in Subsec-tion 2.5.3, the target of our optimizations are the routing resources because the majority of leakagepower is dissipated by these resources.1055.2. PG-Aware Placement and Routing OptimizationsThe following equation shows the modified cost function used to evaluate a placement solutionCost = (1− γ)((1− τ)CostW + τCostT︸ ︷︷ ︸Original cost function in VPR) + γCostPGR (5.2)where CostW and CostT are the wire length and timing costs, and τ is a constant between 0-1 thatis used to tradeoff routability and timing of the placement. γ is a weighting parameter between0-1 that is used to tradeoff the original cost function and CostPGR. CostPGR represents the costof power gating. This parameter is used as a metric for the quality of the placement in terms ofpower gating (i.e., larger values of CostPGR indicate more always-on resources).CostPGR is calculated as followsCostPGR =∑neti∑sinki,jNumAlwaysOnSBs(src(neti), sinki,j) (5.3)where NumAlwaysOnSBs(src(neti), sinki,j) is a function used to estimate the number of always-on SB partitions that may result when the connection between the source of neti and its sink sinki,jis routed.The exact number of always-on SB partitions can be determined by routing the circuit for eachplacement solution. However, this is infeasible due to excessive execution time of the algorithm.Therefore, we estimate NumAlwaysOnSBs by assuming that a connection would have an x-ylayout with at most one turn. Figure 5.7 shows an example of a connection between a sourceblock in module M1 and a sink block in module M3. In general, there will be either one or tworouting paths; we consider both and count the number of always-on SB partitions along each path.Always-on SB partitions result when a connection is routed through an area in the chip that has itspower state controlled by a different signal than that which controls the power state of the sourcemodule of the connection. Case 1 in the figure results in 4 always-on SBs, while case 2 results in 3always-on SBs. The larger of these two cases is used in the cost function in Equation 5.3.This requires knowing the coordinates of the different power-gated modules on a chip during theplacement phase (i.e., finding a rough floorplan of the modules on chip). These coordinates change1065.2. PG-Aware Placement and Routing OptimizationsM3.B3 M3.B4M3.B1 M3.B2M2.B3 M2.B4M2.B1 M2.B2M1.B3 M1.B4M1.B1 M1.B2M odule M 1 M odule M 2 M odule M 312Figure 5.7: An application circuit with three power-gated modules mapped to an FPGA. The x-ylayout of a connection between blocks M1.B2 and M3.B4 during placement is used to estimatealways-on SBs required to route the connection. The larger of the two cases (1 or 2) is assumed.during the placement process. In each outer iteration of the placement algorithm, the informationabout each power-gated module’s location is updated based on the blocks that occupy the PGRsof the chip.During early stages of the placement, the locations of blocks may be far from their eventual finallocations due to the random nature of the simulated annealing algorithm. Therefore, we modifythe function in Equation 5.3 to include an additional term that controls the importance of PGRcost based on the accuracy of the floorplan of power-gated modules. The modification is shownbelowCostPGR = e−KT ×∑neti∑sinki,jNumAlwaysOnSBs(src(neti), sinki,j) (5.4)where K is a constant that is chosen empirically, and T is the annealing temperature of theplacement algorithm.Equation 5.4 ensures that during the early stages of the placement (high annealing temperatureT ), CostPGR is very small since the floorplan of the power-gated modules is not close to that ofthe eventual circuit placement. During later placement stages, the floorplan becomes more close tothe final placement, therefore the value of the exponential term becomes very close to 1.1075.2. PG-Aware Placement and Routing Optimizations5.2.2 PG-Aware RoutingDue to the complexity of the routing topology of multiple-module power-gated circuits, some SBpartitions are required to be always-on. Figure 5.8 shows an example of three power-gated modulesmapped to three PGRs. Two possible ways of routing the net from M1 to its sinks in M1 andM3 are shown (denoted R1 and R2). In both R1 and R2, SBs in M2 and M3 are required to bealways-on to ensure proper operation. For example, when M2 is powered down, the SBs in M2’sPGR that route the net need to be powered.M odule M 1 M odule M 2 M odule M 3DC  SBO N  SBDC  L CO N  SB f or R 2 ,  oth erw ise DCO N  SB f or R 1 ,  oth erw ise DCR oute 1  ( R 1 )R oute 2  ( R 2 )Figure 5.8: Three power-gated modules placed on three PGRs with two ways shown to route thesame net. The SBs used to route the net in M2 and M3 must be always-on (ON), other resourcescan be dynamically-controlled (DC).In Figure 5.8, R2 has a smaller number of always-on SBs (larger number of always-off anddynamically-controlled SBs) compared to R1, which improves the power savings during idle periods.In this subsection, we present the enhancements made to the router in order to increase the numberof SBs that can be powered down. We modified the timing-driven router in VPR to implementthese enhancements.VPR uses the Pathfinder negotiated congestion-delay router [46]. The routing resources arerepresented by a routing-resource graph. In this graph, nodes represent wire segments and logicblock pins, and edges represent switches. In the inner loop of the algorithm, when searching for aroute from a source node to a sink node, nodes of the graph are visited and added with a cost value(path cost) to a priority queue. These nodes are used later to iteratively investigate other nodesconnected to them until the sink is reached. The path cost to reach a node from the source of the1085.2. PG-Aware Placement and Routing Optimizationsnet is the sum of the costs of nodes in that path. The function that is used to measure the cost ofusing a node has a timing and congestion terms as in Equation 5.5 (from [46]), where Critij is thetiming criticality of the connection (i, j), T (n) is the timing cost of the route to reach node n fromthe source, and Congs(n) is the congestion cost of using node n.Cost(n) = Critij × T (n) + (1− Critij)× Congs(n) (5.5)For the PG-aware router, we modified the congestion term of the cost function as in the followingequationCongs(n) = Congs(n)old × (1 + CostPartition) (5.6)where Congs(n)old is the original congestion cost, and Costpartition is used to represent the cost ofusing a specific power-gated SB partition. Costpartition is calculated based on the power status forthe PGR of the SB partition, and its current occupancy.The following function is used to calculate CostpartitionCostpartition =K, if δPGR,net = 0−L, if δPGR,net = 1(5.7)where K and L are weighting parameters determined empirically, and δPGR,net is a binary functionthat has the value of 1 if the connection being routed belongs to the same module as that of thenode’s PGR, and 0 otherwise. Each wire segment (node) in an FPGA architecture is driven by anSB switch; we consider a node to belong to a PGR if the switch driving it belongs to an SB in thatPGR.CostPartition is used to change the weight given to the congestion cost of the node being in-vestigated. If the module of the node’s PGR is the same as that of the net being routed, thenthe congestion cost is decreased (by L) to encourage routing through the node. Routing nets thatbelong to the same module as that of the PGR through an SB partition in the PGR enables con-1095.3. Benchmark Circuitsfiguring the partition as dynamically-controlled. On the other hand, if the net does not belongto the same module as that of the node’s PGR, then the cost is increased (by K) to discouragerouting the net through that node; if the net is routed through that node, then the SB must be setas always-on.We found that L = 0.2 and K = 3 give good results with negligible timing overhead (less than1% on average). Larger values of L may result in circuits that cannot be routed because the routerwill not be able to resolve congestion. Larger values of K may result in a large congestion cost,which may negatively impact the critical path delay.5.3 Benchmark CircuitsThis section describes the benchmark circuits used for evaluation. The benchmark circuits includesynthetic multi-module circuits and a robot control system. Details are provided in the followingsubsections.5.3.1 Synthetic Benchmarks GenerationWe used the largest 20 Microelectronics Center of North Carolina (MCNC) benchmark circuitsavailable with the VPR download [105] as sub-circuits (or modules) in the generated circuits. Eachof the generated circuits is composed of two or more modules (up to nine), connected to each otherusing the primary I/Os of the sub-circuits. Table 5.1 shows the details of the generated circuits.The modules in each circuit are connected together after performing the packing phase usingT-VPack [122], i.e., after each circuit’s LUTs and FFs are grouped in LCs. This guarantees that thesubcircuits used in stitching closely represent independent functional modules in an application.We assume that the power state for each module in the circuits of Table 5.1 can be dynamicallycontrolled. Thus, a power controller that has an output power control signal for each moduleis generated for each circuit. Timers are used to generate the power control signals in a powercontroller. A timer is sized assuming the sleep signal of a module is asserted after 50 ms. Thisamount has been chosen arbitrarily. Longer timer periods may increase the number of resources1105.3. Benchmark CircuitsTable 5.1: Generated synthetic circuits.Circuit Modules # LCsc2 1 elliptic, s38417 1498c2 2 clma, s298 1625c3 1 s298, elliptic, s38417 1833c3 2 diffeq, frisc, s38584.1 1705c4 1 ex5p, s298, apex2, seq 1104c4 2 spla, diffeq, misex3, s38417 2106c5 1 apex4, tseng, seq, pdc, clma 2709c5 2 s38417, ex5p, diffeq, ex1010, s298 2536c6 1 seq, tseng, apex4, pdc, spla, misex3 2275c6 2 spla, ex5p, s298, seq, apex2, ex1010 2504c7 1 misex3, spla, s38417, pdc, seq, tseng, apex4 3312c8 1 clma, ex1010, s298, spla, misex3, spla, apex2, seq 4473c9 1 s298, apex2, seq, alu4, elliptic, tseng, s38417, apex4, ex5p 3219required to implement the controller circuit on an FPGA. Notice that more sophisticated powercontrollers can be implemented in the synthetic circuits. However, the goal of our study is to usethe power controller circuits to evaluate the number of resources (especially routing resources) thatare occupied by power control signals. This provides an estimate of the FPGA resources that willbe in the different power states, and hence the potential power savings by using the DCPG FPGAarchitecture.5.3.2 HLS Benchmark CircuitsWe used LegUP [123], a high-level synthesis (HLS) tool, to generate a number of circuits that we usein our experiments. A number of CHStone benchmark applications [124] were used as inputs to theLegUP HLS tool. LegUp generates a hardware accelerator that represents each function call in aninput program. By analyzing the execution and idle times of each accelerator in each application,we found that the circuits in Table 5.2 have accelerators that can be power-gated dynamically atrun-time. Table 5.2 provides a summary of each circuit, the sizes of its power-gated module, andthe size of its always-on parts.1115.3. Benchmark CircuitsTable 5.2: HLS circuits.Circuit #Always-on LCs #Power-gated modules LCsBF 749 M1:866, M2:118MPEG2 199 M1:510, M2:441JPEG 5258 M1:710, M2:4156, M3:2563AES 2274 M1:1360, M2:390, M3:390SHA DRIVER 549 M1:1185, M2:2001C ondition0Inp utsO utp utsselectDeltaM iniSnake robot inside organ Block diagram  of  control systemFigure 5.9: Snake robot example application.5.3.3 Robot Control SystemThe application presented in this subsection is used to evaluate the proposed architecture. Thisapplication represents a control system for a snake-shaped robot, called iSnake, that is used inendoscopy [125]. The left side of Figure 5.9 shows the robot inside an organ, in two different states.The robot’s control system provides haptic feedback to the surgeon to prevent harming thepatient’s organs during an operation. A proximity query (PQ) algorithm is used to approximatethe distance between iSnake and the surface of the patient’s organ [125], which is computationallyintensive and requires a high-performance implementation.We developed an FPGA-based implementation of the control system. The right side of Fig-ure 5.9 shows the main modules in the system. The datapath performs stream processing for inputdata. The Delta module is only activated when the robot touches the organ’s surface. This modulecan be put in sleep mode when its output is not required. The select output from the Conditionmodule can be used as a power control signal for the Delta module.1125.4. Impact of PG-Aware Placement and RoutingTable 5.3: Information about FPGA-based iSnake robot control system.Module Name # LCs LCs % Potential Power StateCondition 25342 75.16 always-onDelta 8352 24.77 dynamically-controlledMin 22 0.07 always-onThe FPlibrary [126] was used to implement the required floating point operators in the PQalgorithm. Quartus II was used to generate a technology-mapped netlist of the circuit [127]. Thegenerated netlist was then annotated with information about the modules of each circuit component.A modified version of T-VPack was then used to pack the circuit; this version ensures that eachmodule’s blocks and FFs are clustered in the same LCs, thus generating a netlist that containsthree interconnected modules.Table 5.3 shows information about the robot control system. The size of the Delta module isabout 25% of the system, indicating that properly managing its power state may result in a largeenergy reduction.5.4 Impact of PG-Aware Placement and RoutingThis section provides a detailed study of the impact of the physical-level CAD tools on the numberof power-gated resources and the leakage power reduction in FPGAs that support dynamic powergating. The focus of this section is to study the different techniques used in the placement forpower-gated FPGAs. Further, we show that the simple enhancements for placement and routingpresented in Subsections 5.2.1 and 5.2.2 can be used to optimize the power savings in power-gatedFPGA architectures.In addition to the baseline unmodified placement and routing algorithms, we study four differentapproaches. Table 5.4 summarizes these approaches. The details of Li and Constraints approacheshave been described in Section 2.5.3, and the details of the PG-aware placement and routing (P&R)have been described in Section 5.2. Table 5.5 shows the values for the parameters used in Li andthe PG-aware P&R techniques. Note that Li’s technique in [97] uses γ = 0.1, while we use γ = 0.05in our experiments because we found that with γ = 0.1 many circuits cannot be routed. The1135.4. Impact of PG-Aware Placement and RoutingTable 5.4: Summary of the mapping techniques studied in this section.Approach SummaryBaseline The unmodified placement and routing algorithms in VPRLi [97] Modified placement algorithm that tries to increase the number ofpower-gated PGRsConstraints [22] Imposing placement constraints on each power-gated module to beplaced in a rectangular area on the chipPG-aware P&R The enhancements to the placement and routing algorithms shown inSection 5.2Constr+P&R Constraints combined with PG-aware P&Rexperiments in [97] were performed using circuits with only two power-gated modules that have noconnections between them. In contrast, our experiments are performed using circuits with multiplepower-gated modules that are interconnected as described in Subsection 5.3.1.Each circuit is mapped to a minimum-sized array of tiles that can fit the circuit. The targetarchitecture has a cluster size N = 6, and a routing channel width that is 20% larger than theminimum channel width required to route the circuit. A routing wire segment has a length offour tiles (L = 4). The remaining basic FPGA architecture parameters (Fcin, Fcout, Fs, etc.) andsettings (voltage, temperature, etc.) are similar to those used in Chapter 3. The SBs power gatinggranularity is PPS = 0, which means coarse-grained SBs power gating, and the PGR size is 4x4tiles. Note that the mapping techniques would have less impact for finer granularities.Table 5.5: Parameter values for Li’s and PG-aware placement and routing techniques.Approach ParametersLi γ = 0.05PG-aware Placement γ = 0.03, K = 20PG-aware Routing K = 3, L = 0.25.4.1 Impact on Resources’ Power StatesWe first provide analysis of the impact of the different mapping techniques on the percentage ofFPGA resources that are configured in the different power states. This is directly related to thepotential power savings of the DCPG FPGA architecture using the different mapping techniques.1145.4. Impact of PG-Aware Placement and Routing 50 60 70 80 90 100c2_1c2_2c3_1c3_2c4_1c4_2c5_1c5_2c6_1c6_2c7_1c8_1c9_1Dynamically-controlled LCs (%)Benchmark CircuitBaseline LiConstraints P&RConstr+P&R(a) Synthetic Circuits 35 40 45 50 55 60 65 70 75 80aesbfmpeg2sha_driverjpegDynamically-controlled LCs (%)Benchmark CircuitBaseline LiConstraints P&RConstr+P&R(b) HLS CircuitsFigure 5.10: The percentage of logic clusters (LCs) configured as dynamically-controlled for differentmapping techniques.Figure 5.10 shows the percentage of dynamically-controlled logic clusters (LCs) for the differentmapping techniques. The results show that for the synthetic circuits (Figure 5.10 (a)) with asmall number of power-gated modules (c2 1 - c3 2), none of the techniques is clearly superior toothers. However, for circuits with a larger number of modules, the Constraints technique improvesthe number of dynamically-controlled LCs. The Constraints technique also produces the largest1155.4. Impact of PG-Aware Placement and Routingnumber of dynamically-controlled LCs in the HLS circuits (Figure 5.10 (b)). The HLS circuitshave a relatively small number of modules (up to 4 modules). The effectiveness of the Constraintstechnique can be explained as follows. Using the Baseline placement, the PGRs on the borders ofa power-gated module must be configured as always-on because they contain blocks from differentbordering modules. Using constraints helps in encouraging bordering PGRs to be filled by blocksthat belong to the same module, and thus the LCs in these PGRs are configured as dynamically-controlled. From these results, we conclude the following:Trying to improve the number of dynamically-controlled LCs by using a placement cost functionthat also balances timing and wire-length, as in Li’s technique, does not help. Using placementConstraints is the best approach to do this.Figures 5.11 and 5.12 show the impact of the different mapping techniques on the percentage ofalways-on switches for the synthetic and the HLS circuits, respectively. In addition to showing thepercentage of SB switches that are configured as always-on, the figures also show the amount bywhich each mapping technique reduces the number of always-on SB switches. As can be seen in thefigures, all techniques, except the Li technique, consistently reduce the number of always-on switchesfor all circuits. For example, the Constraints technique reduces always-on SB switches by up to47% (circuit c2 2) and the PG-aware P&R reduces the always-on switches by up to 45% (circuitc2 1). The Li technique does not try to reduce the number of always-on SB switches. Rather, ittries to increase the number of PGRs that are dynamically-controlled using a new placement costfunction. As explained above, using such a placement cost function does not help in building alarger number of dynamically-controlled PGRs, and hence would not help in reducing the numberof always-on SB switches.Using the Constraints technique results in large reductions in the number of always-on switchesin many circuits. This is because it places the blocks of a module within a rectangular area inthe target chip; this helps in potentially increasing the number PGRs filled by LCs of the samemodule as explained above, and hence the number of SB switches that can also be configured asdynamically-controlled.1165.4. Impact of PG-Aware Placement and Routing 10 15 20 25 30 35 40 45 50 55c2_1c2_2c3_1c3_2c4_1c4_2c5_1c5_2c6_1c6_2c7_1c8_1c9_1Always on SB switches (%)Benchmark CircuitBaseline LiConstraints P&RConstr+P&R(a) Always-on SB switches-20-10 0 10 20 30 40 50c2_1c2_2c3_1c3_2c4_1c4_2c5_1c5_2c6_1c6_2c7_1c8_1c9_1Always on SB switches reduction (%)Benchmark CircuitLiConstraints P&RConstr+P&R(b) Reduction in always-on switchesFigure 5.11: The effect of the different mapping techniques on reducing the always-on SB switchesfor the synthetic benchmark circuits.The PG-aware P&R technique does fairly well in reducing the number of always-on switches forcircuits with small number of power-gated modules because the cost function in the placement androuting was designed to do so. For larger circuits, the technique does not show large reductionsin the always-on SB switches compared to the Constraints technique. We attribute this to thecomplexity of the routing for these circuits. However, this technique has no negative impact on the1175.4. Impact of PG-Aware Placement and Routing 55 60 65 70 75 80 85 90aesbfmpeg2sha_driverjpegAlways on SB switches (%)Benchmark CircuitBaseline LiConstraints P&RConstr+P&R(a) Always-on SB switches-10-5 0 5 10 15 20aesbfmpeg2sha_driverjpegAlways on SB switches reduction (%)Benchmark CircuitLiConstraints P&RConstr+P&R(b) Reduction in always-on switchesFigure 5.12: The effect of the different mapping techniques on reducing the always-on SB switchesfor HLS benchmark circuits.number of always-on switches as can be seen in the figures. Combining both Constraints techniqueand PG-aware P&R results in the largest reduction in always-on SB switches in almost all circuits.1185.4. Impact of PG-Aware Placement and Routing-8-6-4-2 0 2 4 6 8 10c2_1c2_2c3_1c3_2c4_1c4_2c5_1c5_2c6_1c6_2c7_1c8_1c9_1Performance Degradation (%)Benchmark CircuitBaseline LiConstraints P&RConstr+P&R(a) Effect on Tcrit for Synthetic Circuits-6-4-2 0 2aesbfmpeg2sha_driverjpegPerformance Degradation (%)Benchmark CircuitBaseline LiConstraints P&RConstr+P&R(b) Effect on Tcrit for HLS CircuitsFigure 5.13: The effect of the different mapping techniques on the critical path delay (Tcrit).5.4.2 Impact on PerformanceFigure 5.13 shows the impact of the different mapping techniques on the critical path delays ofthe different circuits. On average, there is no impact on performance for both the HLS and syn-thetic circuits. However, some circuits experience performance degradation while others experienceperformance improvement; we attribute this to the noise in the algorithms. For example, using1195.4. Impact of PG-Aware Placement and Routingthe Constraints technique causes the critical path of c7 1 to increase by about 10%, and usingLi technique increases the critical path for c5 1 by about 9%. The PG-aware P&R technique re-sults in a performance degradation of up to 5%. For the HLS circuits, the maximum performancedegradation caused by the different techniques is less than 2%.5.4.3 Impact on Leakage PowerFigures 5.14 and 5.15 show the leakage power reduction for each of the mapping techniques. Thetechniques that can better reduce the number of always-on SB switches achieve more power savings.For circuits with a small number of power-gated modules (up to three modules), PG-aware P&Rresults in the best power savings, and for circuits with a larger number of power-gated modules,using constraints results in the best power savings. Notice that the power reductions in HLS circuitsare relatively low compared to the synthetic circuits. This is because every HLS circuit contains arelatively large always-on module as described in Table 5.2, while all the modules in the syntheticcircuits are power-gated. 0 20 40 60 80 100c2_1c2_2c3_1c3_2c4_1c4_2c5_1c5_2c6_1c6_2c7_1c8_1c9_1Total Leakage Power Reduction (%)Benchmark CircuitBaseline LiConstraints P&RConstr+P&RFigure 5.14: The effect of the different mapping techniques on the power reduction of the syntheticbenchmark circuits.Tables 5.6 and 5.7 show the average leakage power in the ON and OFF states using the differentmapping techniques. All techniques roughly have the same amount of ON leakage power. However,the OFF leakage power using Constraints combined with PG-aware P&R is the smallest. For the1205.4. Impact of PG-Aware Placement and Routing 0 20 40 60 80 100aesbfmpeg2sha_driverjpegTotal Leakage Power Reduction (%)Benchmark CircuitBaseline LiConstraints P&RConstr+P&RFigure 5.15: The effect of the different mapping techniques on the power reduction of the HLScircuits.synthetic circuits, the OFF leakage power can be reduced by about 16% using these combinedtechniques compared to the baseline technique, while in the HLS circuits the OFF leakage powercan be reduced by about 7%.Both the PG-aware P&R technique and the Constraints technique result in similar reduction inthe average OFF leakage power (about 9% for the synthetic circuits and 4% for the HLS circuits).The Li technique does not help in reducing the average OFF leakage power. One important aspect ofthe Constraints technique (and the combined Constr+P&R technique) is that it requires one extradesign step in the design flow. In our experiments, determining the proper placement constraintswas done manually by visually observing how VPR would place a circuit (without constraints),and then deriving a constraints file for the circuit that roughly resembles the placement done byVPR. This may be impractical to do manually especially for designs with a large number of blocksand power-gated modules. The automation of this step has been investigated in the literature andfound to generate good timing results while improving CAD tools execution time [128].Table 5.6: Average ON and OFF power for all synthetic benchmark circuits for the differentmapping techniques.Baseline Li Constraints PG-aware P&R Constr+P&RPON (mW) 33.98 34.0 34.83 34.01 34.38POFF (mW) 12.32 12.28 11.12 11.31 10.411215.4. Impact of PG-Aware Placement and RoutingTable 5.7: Average ON and OFF power for HLS circuits for the different mapping techniques.Baseline Li Constraints PG-aware P&R Constr+P&RPON (mW) 84.82 84.81 84.80 84.78 84.76POFF (mW) 60.62 60.47 58.44 58.28 56.395.4.4 Impact on CongestionThe presented placement and routing enhancements in Section 5.2 tradeoff congestion for betterpower gating results. In this subsection, we study the impact of the enhancements on the congestionof the circuits. Table 5.8 compares the Baseline and the PG-aware P&R in terms of the minimumchannel width required to route each of the synthetic benchmark circuits. On average, the minimumchannel width using PG-aware P&R is increased by 3.2%. In most cases, there is no significantincrease in the minimum channel width, which indicates that the PG-aware P&R does not have asignificant impact on congestion.Unlike the other circuits in Table 5.8, c5 1 and c9 1 show significant increase in their minimumchannel width. This is attributed to the experimental methodology used in the VPR tool. TheVPR tool searches for the minimum channel width needed to route a circuit by increasing thechannel width by steps of 24. This may result in skipping some of the valid channel widths, andtherefore overestimating the minimum channel width required to route a circuit.5.4.5 CAD Execution TimeThe execution time overhead for the PG-aware P&R enhancements presented in Section 5.2 canbe large. The execution time increases with the increase in the number of power-gated modules ina circuit. We found that it can reach 3X for the largest circuit (c9 1) that has nine power-gatedmodules. This overhead is mostly due to the placement enhancements because the placementalgorithm checks in each swap whether the connections to the blocks under consideration mightresult in always-on SBs because they pass through a specific power-gated module. In order toreduce the execution time overhead, we studied a simplified version of the PG-aware placementalgorithm. The results obtained using the simplified version are roughly similar to those obtained1225.4. Impact of PG-Aware Placement and RoutingTable 5.8: Comparing Baseline and PG-aware P&R required minimum channel width to route thesynthetic benchmark circuits.Circuit Baseline PG-aware P&R Increase (%)c2 1 48 48 0.0c2 2 60 58 -3.3c3 1 44 46 4.5c3 2 50 48 -4.0c4 1 50 52 4.0c4 2 56 56 0.0c5 1 66 74 12.1c5 2 56 56 0c6 1 68 66 -2.9c6 2 56 60 8.0c7 1 68 66 -2.9c8 1 62 60 -3.2c9 1 56 74 32.1Average 56.9 58.8 3.2using the original algorithm presented in Subsection 5.2.1. The average power reduction using thesimplified version is 68% while it is 67.7% using the original version.The following simplifications are done to the original algorithm to reduce its execution time.The first simplification is done by estimating the number of always-on SBs for a connection byconsidering one x-y layout of a connection rather than checking the two possible x-y connections.The second simplification is done by finding the number of always-on SBs for only one sink of anet and multiplying the result by the number of the net’s sinks rather than finding the number ofalways-on SBs for each sink of the net. In the second simplification, Equation 5.3 is modified asfollows,CostPGR =∑netinumSinks(neti)×NumAlwaysOnSBs(src(neti), sinki,1) (5.8)where numSinks(neti) is the number of sinks for net i, and sinki,1 is the first sink of net i.Table 5.9 shows the total execution time for the Baseline and the simplified PG-aware P&R.The average execution time overhead is 63%, and the largest overhead is 99% for circuit c9 1. Thesesimplifications reduced the execution time overhead of the original PG-aware P&R by about 50%whilst maintaining similar quality of results. As in commercial tools, we expect that PG-aware1235.5. Granularity Study of the Power Gating ArchitectureP&R would be used as an option in a CAD tool to allow users to tradeoff execution time and powersavings.Table 5.9: Execution time (seconds) of the Baseline and the PG-aware P&R for the syntheticbenchmark circuits.Circuit Baseline Simplified PG-aware Overhead (%)c2 1 101.81 138.88 36c2 2 124.12 180.59 45c3 1 130.95 189.32 45c3 2 196.81 312.66 59c4 1 88.87 137.71 55c4 2 162.44 259.51 60c5 1 384.04 615.62 60c5 2 219.15 350.17 60c6 1 202.28 348.30 72c6 2 230.13 380.32 65c7 1 421.57 743.64 76c8 1 588.82 1087.60 85c9 1 423.31 841.48 99Average 251.87 429.68 635.5 Granularity Study of the Power Gating ArchitectureThe granularity of the DCPG FPGA architecture presented in Chapter 3 affects the power savingsof application circuits mapped to the architecture. For a given application circuit mapped to aDCPG FPGA, the amount of FPGA resources that are configured in the different power statesdepends on the architecture’s granularity. The granularity of the DCPG FPGA architecture isdetermined by two architecture parameters, the PGR size and the number of power-gated SBpartitions per side (PPS). In this section we study both granularity parameters when applicationcircuits are mapped to the DCPG FPGA architecture.In order to understand the effect of the PGR granularity on power savings, consider the examplein Figure 5.16 that shows two modules mapped to an FPGA that has 16 LCs. In Figure 5.16 (a),both modules are placed on one PGR that has a size of 4x4 tiles. If the two modules have distinctidle times, then the PGR must be configured as always-on for the circuits to function correctly. In1245.5. Granularity Study of the Power Gating ArchitectureFigure 5.16 (b), the modules are placed on an architecture with a PGR size of 2x2 tiles. In thiscase, all the PGRs are configured as dynamically-controlled, which enables powering down eachmodule when it is idle. This leads to large energy savings compared to the first case.M odule 1 M odule 2(a) PGR Size is 4x4 TilesM odule 1 M odule 2(b) PGR Size is 2x2 TilesFigure 5.16: Two modules placed on a DCPG FPGA. (a) One PGR of size 4x4 tiles is configuredas always-on (b) Four PGRs of size 2x2 tiles are configured as dynamically-controlled.The granularity of SBs power gating has also a significant impact on power savings when appli-cation circuits are mapped to the DCPG FPGA architecture. Consider the example in Figure 5.17that shows a module (M1) mapped to a PGR of size 2x2 tiles, and a signal from module M2 isrouted through two SBs in the PGR. In Figure 5.17 (a), because PPS = 0, the SBs that are usedto route the signal from M2 are configured as always-on to ensure that the signal is available evenwhen module M1 is in sleep mode. In Figure 5.17 (b), because PPS = 1 (i.e., each power-gatedpartition contains 14 of an SB’s switches), only two SB partitions are configured as always-on toensure that M2’s signal is available all the time. Many SB partitions are configured as always-offbecause they are not used. This leads to significant power savings compared to the case wherePPS = 0.The study in this section assumes that all of the modules in a benchmark circuit can be powereddown during their idle periods. Real applications, however, can include modules that are always-onin addition to the power-gated modules. Nevertheless, in both cases similar techniques would beused to map applications to power-gated FPGAs, which means that the trends for FPGA resourceutilization will be similar.1255.5. Granularity Study of the Power Gating ArchitectureM1 M1M1 M1(0, 1)(0, 0) (1, 0)(1, 1)From M 2O FFDCO NO N(a) PPS = 0M1 M1M1 M1(0, 1)(0, 0) (1, 0)(1, 1)From M 2O FFOFFOFFO NO FFDCOFFO NO FFDCOFFO FFO FF OFFOFFO FF(b) PPS = 1Figure 5.17: One module (M1) placed on a PGR of size 2x2 tiles. A signal from module M2 is routedthrough two SBs in the PGR. (a) PPS = 0, resulting in two SBs configured as always-on, one SBconfigured as dynamically-controlled, and one SB configured as always-off (b) PPS = 1, resultingin two SB partitions configured as always-on, two partitions configured as dynamically-controlled,and 12 partitions configured as always-off.The synthetic circuits presented in Subsection 5.3 are mapped to DCPG FPGAs with differentpower gating granularities using the CAD flow presented in this chapter. The results in this sectionare averages over all benchmark circuits. We use architecture parameters similar to those describedin the previous section, except for power gating granularity which is varied in our experiments.5.5.1 Breakdown of Resources’ Power StatesAs stated earlier, the power savings of the DCPG FPGA architecture depend on the number ofFPGA components that are configured in the different power states after application mapping, andthis depends on the granularity of the power gating architecture. In this subsection, we report thepercentages of PGR components and SB switches that can be configured in the different powerstates. For PGRs, we report both the internal LCs of a PGR and the total RCs that includeinternal and border RCs (see Section 3.2 for details of these components). For SBs, we report thenumber of switches. A multiplexer followed by a buffer in an SB is considered a switch. An SBpower-gated partition may contain one or more switches based on the partition size.1265.5. Granularity Study of the Power Gating ArchitectureSB Switches 0 20 40 60 80 1001x12x23x34x4Always on SB switches (%)PGR granularity (tiles)SB Gating GranularityPPS=CoarsePPS=1PPS=2PPS=3PPS=4PPS=Fine(a) Always On 0 10 20 30 40 501x12x23x34x4Always off SB switches (%)PGR granularity (tiles)(b) Always Off 0 10 20 30 40 50 60 701x12x23x34x4Dynamically off SB switches (%)PGR granularity (tiles)(c) Dynamically Controlled 0 20 40 60 80 1001x12x23x34x4Always off+Dynamic off switches (%)PGR granularity (tiles)(d) Always Off + Dynamically ControlledFigure 5.18: The percentage of SB switches in the different power states, averaged over all syntheticbenchmarks mapped to architectures with different power gating granularities.Figure 5.18 shows the breakdown of SB switches that are configured in the different powerstates for different architecture granularities after the benchmark circuits were mapped to theDCPG FPGA architecture. The x-axis is the PGR granularity in terms of the number of tiles of aPGR, and the different bars in each sub-figure are for different SB power gating granularities. Thisrepresents 20 architectures that have different power gating granularities.First, we study the effect of the PGR granularity on the percentage of always-on SB switches(Figure 5.18 (a)). For PGR size of 1x1 tiles, the percentage of always-on SB switches is larger thanthat for larger PGR sizes. For example, consider the point at PPS = Coarse. When the PGR sizeis 1x1 tiles, 51% of SB switches are configured as always-on on average (ranges between 34%-64%1275.5. Granularity Study of the Power Gating Architecturefor individual circuits). Larger PGR sizes result in a smaller percentage of always-on switches (42%for PGR size 2x2 tiles, and 40% for PGR size 3x3 and 4x4 tiles). The large percentage of always-onSB switches when the PGR size is 1x1 tile is attributed to the fact that each LC that is configuredas dynamically-controlled would require a power control signal to be routed to it. This means thata large number of SBs need to be configured as always-on in order to keep the power control signalpowered all the time.Figure 5.18 (c) shows the switches that are configured as dynamically-controlled. Larger PGRsresult in a larger percentage of SB switches that are configured as dynamically-controlled. Forexample, consider the case of PPS = Coarse. When the PGR size is 1x1 tile, the percentageof SB switches configured as dynamically-controlled is 48%. Larger PGR sizes result in a largerpercentage of dynamically-controlled switches (57-59% for PGR sizes between 2x2 - 4x4 tiles).Second, we study the effect of the granularity of SBs power gating (PPS) on the number ofswitches configured in the different power states. This is shown in Figure 5.18 (the bars in thefigure indicate PPS). As the granularity becomes finer (larger PPS), the percentage of always-onSB switches decreases as shown in Figure 5.18(a). For example, for a PGR size of 1x1 tiles, thealways-on SB switches is between 9-51% for PPS ranging between fine-grained to coarse-grainedpower gating. The small percentage of always-on SB switches when PPS = Fine is a result ofthe large number of unused switches that are configured as always-off. The large percentage ofalways-on SB switches for coarse-grained power gating results because if an SB is used, then itspower configuration must be either always-on or dynamically-controlled. Many SBs might be usedto route critical signals (power control signals or inter-module signals), thus their power state mustbe configured as always-on.The percentage of always-off switches for fine-grained SBs power gating (Figure 5.18 (b)) ismore than 50% of the switches. The ON leakage power is lower for finer granularity SBs powergating because larger number of switches can be powered down at configuration time. This will bedemonstrated in Subsection 5.5.2.An interesting observation is related to the percentage of switches that are configured asdynamically-controlled as shown in Figure 5.18 (c). The largest percentage of dynamically-controlled1285.5. Granularity Study of the Power Gating Architectureswitches results when PPS is 1 and 2, regardless of the PGR size. SBs power gating with finergranularity than this (larger PPS) results in a decrease in the number of dynamically-controlledswitches. This is because a larger number of switches can be configured as always-off since theybelong to SB partitions that are not used. 0 5 10 15 20 25 301x12x23x34x4Always on LCs & RCs (%)PGR granularity (tiles)LCs RCs(a) Always On 0 1 2 3 4 51x12x23x34x4Always off LCs & RCs (%)PGR granularity (tiles)(b) Always Off 0 20 40 60 80 1001x12x23x34x4Dynamic off LCs & RCs (%)PGR granularity (tiles)(c) Dynamically Controlled 0 20 40 60 80 1001x12x23x34x4Always off + Dynamic off LCs & RCs (%)PGR granularity (tiles)(d) Always Off + Dynamically ControlledFigure 5.19: The percentage of LCs and RCs in the different power states, averaged over all syntheticbenchmark circuits mapped to architectures with different power gating granularities.LCs and RCsFigure 5.19 shows the percentage of PGRs (LCs and RCs) in the different power states for differentPGR granularities after the benchmark circuits are mapped to the DCPG FPGA architecture.These are averages over all benchmark circuits. As can be expected, finer granularity architectures(smaller PGR sizes) result in a smaller number of resources that are configured as always-on. Thisleads to an overall increase in the number of LCs and RCs that are configured as dynamically-1295.5. Granularity Study of the Power Gating Architecturecontrolled. The only exception to this is the large percentage of always-on border RCs for a PGRof size 1x1 tile. At this PGR size, a large number of border RCs are used to route the power controlsignal to an input pin in each PGR.The always-on LCs range between 7-21% for PGR sizes between 1x1 - 4x4 tiles. LCs that areconfigured as dynamically-controlled range between 91-79% for PGR sizes between 1x1 - 4x4 tiles.The number of resources that are configured as always-off is small (less than 3% of the LCs and lessthan 3% of the RCs) because each benchmark circuit was mapped to the smallest FPGA chip towhich it can fit. We can see in Figure 5.19 (c) that the largest percentage of dynamically-controlledRCs is when the PGR size is 2x2 (86% of the RCs).5.5.2 Leakage PowerFigure 5.20 shows the leakage power for SBs averaged over all benchmark circuits. The resultsare shown for different power gating granularities of PGRs and SBs. The ON leakage power ismeasured when all routing resources that are configured as dynamically-controlled are poweredon, and the OFF leakage power is measured when all routing resources that are configured asdynamically-controlled are powered down. The ON leakage power is roughly constant over differentPGR granularities. However, the OFF leakage power is higher for finer PGR granularities. This isbecause smaller PGR sizes result in a larger number of always-on switches as demonstrated in theprevious subsection.The largest ON leakage power is observed when the coarse-grained SB power gating is used(PPS = Coarse). At this granularity, each SB represents one power-gated unit, and it mustnot be configured as always-off if any of its switches are used for routing. Only 1% of the SBswitches can be powered down at configuration time when coarse-grained SB power gating is usedas demonstrated in the previous subsection. Although architectures with a finer SBs power gatinggranularity have a higher leakage power overhead as demonstrated in Chapter 3, some of theSB partitions are powered down at configuration time (always-off) because they are not used forrouting application circuit signals, which results in a reduction in the net ON leakage power. In theresults, the ON leakage power when PPS = 1 is roughly equal to the leakage power of the ungated1305.5. Granularity Study of the Power Gating Architecture 5 10 15 20 251x1 2x2 3x3 4x4SBs Leakage Power (mW)PGR granularity (tiles)ON/PPS=CoarseON/PPS=1ON/PPS=2ON/PPS=3ON/PPS=4ON/PPS=FineOFF/PPS=CoarseOFF/PPS=1OFF/PPS=2OFF/PPS=3OFF/PPS=4OFF/PPS=FineUngated(a) SBs Leakage Power 0 20 40 60 80 1001x1 2x2 3x3 4x4SBs Leakage Power Reduction(%)PGR granularity (tiles)SB Gating GranularityPPS=CoarsePPS=1PPS=2PPS=3PPS=4PPS=Fine(b) SBs Leakage Power ReductionFigure 5.20: Leakage power results for SBs averaged over all synthetic benchmark circuits mappedto architectures with different power gating granularities. ON leakage power is measured whenall dynamically-controlled resources are powered. OFF leakage power is measured when alldynamically-controlled resources are put in sleep mode.architecture. The ON leakage power goes down for power gating granularities that are finer thanPPS = 1.1315.5. Granularity Study of the Power Gating ArchitectureFor the OFF leakage power, SB power gating granularities (PPS) of 2, 3, and 4 result in thelowest SB leakage power (largest reduction in Figure 5.20(b) of up to 76% when PPS = 3 comparedto an ungated architecture). Although SB power gating granularity of PPS = Fine results in thelargest percentage of switches that are configured as always-off, the large leakage power overheadassociated with its power gating circuitry makes the net OFF leakage power at this granularityhigher than than when PPS = 2, 3, and 4. 0 2 4 6 8 101x12x23x34x4PGRs Leakage Power (mW)PGR granularity (tiles)ON OFF Ungated(a) PGRs Leakage Power (LCs and RCs) 0 20 40 60 80 1001x12x23x34x4PGRs Leakage Power Reduction (%)PGR granularity (tiles)(b) PGRs Leakage Power Reduction (LCs and RCs)Figure 5.21: Leakage power results for PGRs only (no SBs included) averaged over all syntheticbenchmark circuits mapped to architectures with different power gating granularities. ON leakagepower is measured when all dynamically-controlled resources are powered. OFF leakage power ismeasured when all dynamically-controlled resources are put in sleep mode.1325.6. Example Application: iSnake Robot Control SystemFigure 5.21 shows the leakage power for LCs and RCs (SBs are not included) for differentarchitecture granularities after the circuits are mapped to the DCPG FPGA architecture, averagedover all benchmark circuits. The ON leakage power for LCs and RCs is roughly equal to that ofthe ungated architecture. The OFF leakage power is minimum when the PGR size is 2x2 tiles,which results in PGRs’ leakage reduction of 85%. This is because a PGR size of 2x2 tiles resultsin the highest percentage of dynamically-controlled LCs and RCs as demonstrated in the previoussubsection. PGR size of 1x1 roughly results in the same percentage of dynamically-controlled LCs,but lower percentage of dynamically-controlled RCs. With coarser granularities, the OFF leakagepower increases because each PGR that is not filled by blocks that belong to the same module isconfigured as always-on. This increases the number of always-on LCs and RCs. The number ofsuch PGRs increases in coarser granularities of the architecture.Figure 5.22 shows the total ON and OFF leakage power (for all resources) averaged over all thebenchmark circuits. The trend is similar to that in the SBs and PGRs leakage power results. Thebest power gating granularities in terms of the OFF leakage power is when PPS = 2, 3, and 4 andthe PGR size is 2x2 tiles. When PPS = 3 and the PGR size is 2x2 tiles, the average leakage powerreduction is about 78%. The leakage power reduction for individual benchmark circuits with thisarchitecture granularity is shown in Figure 5.23. The savings range between 71-83%. The smallestsavings can be observed for circuits with a larger number of modules, and the largest savings canbe observed for circuits with a smaller number of modules. This is expected since increasing thenumber of modules in a circuit would increase the complexity of interconnections and routing,which results in a larger percentage of always-on routing resources.5.6 Example Application: iSnake Robot Control SystemIn this subsection, we show the energy saving results for the iSnake robot control system described inSubsection 5.3.3. Figure 5.24 shows the energy savings as a function of the amount of time the Deltamodule is active. We compared mapping the application on the proposed power gating architecturewith two baseline implementations. The first is a system that has no power optimizations. Normally,1335.6. Example Application: iSnake Robot Control System 5 10 15 20 25 30 351x1 2x2 3x3 4x4Total Leakage Power (mW)PGR granularity (tiles)ON/PPS=CoarseON/PPS=1ON/PPS=2ON/PPS=3ON/PPS=4ON/PPS=FineOFF/PPS=CoarseOFF/PPS=1OFF/PPS=2OFF/PPS=3OFF/PPS=4OFF/PPS=FineUngated(a) Total Leakage Power 0 20 40 60 80 1001x1 2x2 3x3 4x4Total Leakage Power Reduction(%)PGR granularity (tiles)SB Gating GranularityPPS=CoarsePPS=1PPS=2PPS=3PPS=4PPS=Fine(b) Total Leakage Power ReductionFigure 5.22: Total leakage power results averaged over all synthetic benchmark circuits mappedto architectures with different power gating granularities. ON leakage power is measured whenall dynamically-controlled resources are powered. OFF leakage power is measured when alldynamically-controlled resources are put in sleep mode.one would implement clock gating for modules that experience inactivity periods. We assume thatthe first baseline does not include this. The second baseline is an implementation that includes1345.6. Example Application: iSnake Robot Control System 0 20 40 60 80 100c2_1c2_2c3_1c3_2c4_1c4_2c5_1c5_2c6_1c6_2c7_1c8_1c9_1Leakage Power Reduction (%)Benchmark CircuitFigure 5.23: Total leakage power reduction for individual benchmark circuits. PPS = 3 and PGRsize is 2x2 tiles.clock gating for the Delta module.At 5% activity level, the results show that compared with the first baseline (no clock gating),the proposed architecture coupled with clock gating could achieve 19% energy savings. Comparedwith the second baseline (includes clock gating), the proposed architecture could achieve 8% en-ergy savings. Given that only a small portion of the application benefited from the power gatingarchitecture (25% of the circuit), the results are quite promising. 4 6 8 10 12 14 16 18 20 0  5  10  15  20  25  30Energy Saving (%)Active time (%)Compared to baseline w/o clk gatingCompared to baseline w/ clk gatingFigure 5.24: Energy reduction for iSnake’s control system in DCPG FPGA.1355.7. Summary5.7 SummaryIn this chapter, we presented a CAD flow that can be used to map application circuits to DCPGFPGAs. Using existing placement and routing algorithms, the low-level CAD flow requires few ad-ditional steps to enable creating power domains and connecting a power control signal to the PGRsin a power domain. Using this CAD flow, we studied the impact of enhancements to placementand routing on the potential power savings and performance of the circuits, and we compared theimpact of the proposed enhancements to other placement approaches. In addition, the CAD flowhas been used to study the granularity of the proposed DCPG FPGA architecture, and to evaluatean example application that has a module with long idle times.136Chapter 6Conclusion and Future WorkThis chapter provides a summary of the dissertation, the significance of each contribution, thelimitations and their impact, and possible short-term and long-term future research directions.6.1 Dissertation SummaryTechnology scaling has resulted in increased functionality and speeds of ICs over the years. Withrecent technology nodes, however, it is more expensive to produce ICs due to the growing maskcosts. Programmable technology, such as FPGAs, is becoming a more attractive solution fromthe cost point of view as they are usually fabricated using the latest technology nodes, and theirfabrication costs are amortized among many customers. One major challenge in the adoption ofFPGAs in many low-power applications is their high power consumption due to the additionalcircuitry that enables their reconfigurability. FPGAs suffer about 12 times power overhead, onaverage, compared to their ASIC counterparts [15].A significant portion of FPGAs’ power is dissipated in the form of static power. Recent re-ports from leading FPGA vendors reveal that static power and dynamic power are roughly equalin FPGAs built using a 28 nm technology [41, 42]. Static power is mainly consumed when a func-tional block is inactive. A common approach to address this problem in low-power ASIC designsis the use of power gating, which enables powering down inactive modules at run-time. ASICs aredesigned based on the desired application functionality and structure, which enables applying dif-ferent power optimizations during the design process (i.e., the architecture design is derived fromthe application’s specifications). In contrast, the architecture design and the application designsteps are decoupled in FPGAs. Therefore, an FPGA architecture must be general enough to sup-1376.1. Dissertation Summaryport a wide range of applications. To address the power challenge and the generality requirementin FPGAs, we present a configurable dynamically-controlled power gating (DCPG) architecturefor FPGAs. We also present the necessary physical-level CAD support to realize power gating inDCPG FPGAs. We believe that our architecture and CAD support will form the basis for futureresearch and commercial adoption for such technology.6.1.1 Summary of ContributionsDCPG FPGA ArchitectureThe first contribution of this dissertation is an architecture supporting dynamic power gating in tile-based FPGAs. This has been presented in Chapter 3. Previous works focused on architectures thatenable powering down unused FPGA resources only at configuration time [22–28]. The proposedarchitecture enables constructing power domains of arbitrary shapes, numbers, and sizes. Thepower state for these power domains can be controlled dynamically at run-time by using controlfrom on-chip or off-chip. This enables powering down functional modules during their inactiveperiods in order to reduce their power consumption.The proposed power gating architecture requires minor changes to the basic FPGA fabric,and its generality makes it suitable for any tile-based FPGA architecture. The general-purposerouting resources in an FPGA that are typically used for routing user signals are used to route thepower control signals that control the power state of a power domain. Therefore, no significantmodifications were done to the functionality of the routing architecture. This is important since itfrees FPGA vendors from re-designing the CAD tools required for circuits’ placement and routing.The DCPG FPGA architecture supports power gating for logic clusters (LCs) and routingresources (the circuit components in routing channels and switch blocks). The intricacies of powergating that usually need to be handled by ASIC designers are taken care of in our architecture.This includes output isolation during sleep mode, state retention, and providing feedback signalsto report the state of a power transition in a power gating region. We studied the effect of thedifferent basic FPGA architecture parameters on the area overhead and the leakage power of theproposed architecture. We also studied the different power gating granularities of the architecture.1386.1. Dissertation SummaryOur studies were performed by generating HSPICE models for many power gating regions (PGRs),and by estimating their area overhead and measuring their power overhead and their potentialpower savings compared to architectures that do not support power gating. Our results show thatthe total area overhead of the proposed architecture can range between 4% for coarse-grained SBspower gating and a PGR size of 4x4 tiles, to 35% for fine-grained SBs power gating and a PGR sizeof 1x1 tiles. The total static power savings for the proposed architecture is 98% for coarse-grainedSBs power gating (W=90 and L=4).Handling Inrush Current in DCPG FPGAsThe second contribution of this dissertation is presented in Chapter 4. In this contribution, wepresent a configurable architecture that limits the current that results from waking up powerdomains. This current is known as inrush current, and it must be handled appropriately in power-gated designs because it may cause large IR drop for neighboring logic, which may cause malfunctionof the design [7]. The significance of the presented architecture is that it can be used with anyconfigurable architecture that supports power gating to solve the inrush current problem in aconfigurable way that meets the requirements of varying power domains structures, locations, andnumbers. The area and power overheads of the presented architecture are negligible.We studied the effect of inrush current on a baseline FPGA power grid that we synthesizedbased on application circuits’ power requirements. We found that the resulting IR drop can besignificant. In addition, we investigated the possibility of using other techniques to handle inrushcurrent such as over-provisioning the power grid or using decoupling capacitors. Our results showthat a large number of additional power supply pins are required to address the inrush currentproblem (2-7X increase), and a large area overhead is required for decoupling capacitors (23-35%).The proposed architecture, on the other hand, has negligible area and power overheads.The proposed architecture has two levels. The first level is a fixed, intra-region architecturethat is based on staggering the turn on phase within a single PGR by using fixed delay elements.The inrush current of a PGR is known when the FPGA architecture is designed, therefore theintra-region level ensures that waking up a single PGR does not cause large IR drop that may1396.1. Dissertation Summaryresult in timing errors in neighbouring logic. The second level of the inrush current handlingarchitecture is a configurable, inter-region level. This architecture consists of programmable delayelements (PDEs). A PDE is embedded within each PGR in the architecture. The PDEs in PGRsthat belong to the same power domain can be configured to ensure that at most Rconcur PGRs areturned on simultaneously, where Rconcur is an architecture parameter that represents the maximumnumber of PGRs that can be waken up while not causing a large IR drop. This ensures that theresulting large inrush current does not cause large IR drop in neighbouring logic. In addition, thisarchitecture limits the current drawn from the power supply so that the operation of the otherparts of the chip (active parts) is not affected.In addition to presenting the details of the architecture, we also presented a design flow that canbe used by the device architects to include our architecture in DCPG FPGAs, and a design flowthat end users can follow to design power-gated applications. We have studied the proposed inrushcurrent handling architecture by generating HSPICE models of the architecture using differentarchitecture and design parameters. The area overhead was estimated from the HSPICE models,and the power and performance measurements were obtained using circuit simulations. The areaoverhead of the presented architecture is less than 4% for a PGR of size 1x1 tiles, and about 1%for a PGR of size 4x4 tiles, and the power saving from the DCPG architecture is reduced by 2%for a PGR of size 4x4 tiles. The wakeup time per tile was found to range between 6 ns to lessthan 1 ns for PGRs of sizes between 1x1 - 4x4 tiles. Notice that many tiles can be waken upsimultaneously, which makes the total wakeup time for a module short. For example, for a power-gated module that has 1000 tiles mapped to a DCPG FPGA architecture with region size of 4x4tiles, and Rconcur = 25, the wakeup time is less than 33 ns (about 10 clock cycles on a 300 MHzclock).Low-Level CAD Support for DCPG FPGAsThe third contribution of this dissertation is presented in Chapter 5. We present a mappingtechnique that can be used to map application circuits with power-gated modules, along withpower controller modules to our DCPG FPGA architecture. The significance of this contribution1406.1. Dissertation Summaryis twofold. First, we show that existing placement and routing algorithms can be used withoutmodifications to perform the mapping phase. Few additional steps are required to connect thepower control signals from a power controller to the different PGRs in power-gated modules usingan existing router. Second, we use this CAD flow to perform two important studies. We presentenhancements to the placement and routing algorithms, and compare these enhancements to othermapping techniques available in the literature (Section 5.4). We also use the CAD flow to study thegranularity of the DCPG FPGA architecture in the context of application circuits (Section 5.5).We found that existing placement and routing algorithms perform fairly well in finding solutionsin which a large number of resources can be power-gated at run-time. We also found that usinga placement cost function might not be helpful in reducing the number of resources that areconfigured as always-on while balancing timing and routability. The best way to reduce the numberof resources that are configured as always-on is by using placement constraints. Our enhancedplacement and routing techniques focus on reducing the number of always-on routing resources.Combined with placement constraints, our enhancements provide improved power savings. Onaverage, combining placement constraints and PG-aware P&R reduces the OFF leakage power by16% for our benchmark circuits. All of the investigated techniques were found to have some impacton the critical path delay of individual circuits.The proposed CAD flow enables studying the granularity of the DCPG FPGA architectureby mapping benchmark circuits to the DCPG FPGA architecture and collecting resources usagestatistics. This is unlike the study in Chapter 3 that investigates the power gating granularityof standalone PGRs and SBs using HSPICE simulations. The results in Section 5.5 provide twoinsights. The first is related to the architecture granularity of PGRs (i.e., the size of each PGR).For the basic FPGA architecture parameters used, we found that a PGR size of 2x2 tiles is thebest in terms of generating the largest number of routing resources and logic resources that areconfigured as dynamically-controlled, and thus results in the largest static power reduction. Thesecond insight is related to the power gating granularity of SBs. Using finer granularity power gatingfor SBs leads to more routing resources that are powered down at configuration time because theyare not used to route application circuits’ signals. In addition, finer granularity SBs power gating1416.2. Limitations and Short-Term Future Workleads to larger number of routing resources that are configured as dynamically-controlled. We foundFor SB power gating granularities of PPS greater than 1, the net ON leakage power is lower thanthat of the ungated architecture. The smallest OFF leakage power results when PPS is 3 and 4.This granularity level is equivalent to creating power-gated SB partitions with 2 or 3 switches.6.1.2 Dissertation ImpactThe architectures, circuits, and CAD techniques presented in this dissertation target enabling dy-namic power gating in reconfigurable devices. Significant energy savings can be obtained usingdynamic power gating. However, commercial FPGAs lack this capability. We studied the ma-jor challenges associated with power gating in FPGAs, including the details of the architecture,handling the intricacies of power gating and wakeup phase, and the CAD tools that map appli-cation circuits to power-gated FPGAs. We provided and evaluated solutions to these problems.Our solutions are general and can be integrated in commercial FPGA devices. Adopting the pro-posed architectures and design methodology by FPGA vendors would enable new generation oflow-power FPGAs that are attractive for many applications, especially for customers who cannotafford the costs of fabricating their own ASIC devices. Low-power FPGAs facilitates energy efficientcomputations in several domains, including large data processing, mobile devices, and biomedicalapplications. With low-power FPGAs, small companies would have better chances to compete inmarkets that require fast, low-power devices. Many applications, such as mobile devices, can ben-efit from the reconfigurable technology by supporting protocol updates and bug fixes after productshipment. This improves competition in the market, and eventually leads to cheaper products andbetter user experience.6.2 Limitations and Short-Term Future WorkIn this section, we describe the limitations of the study in this dissertation, and provide directionsfor short-term future work.1426.2. Limitations and Short-Term Future Work6.2.1 Architecture FrameworkThe power gating architecture presented in Chapter 3 is based on an academic FPGA architecturethat is commonly used in FPGA research [46]. Commercial FPGA architectures contain additionalfunctional blocks that have not been studied in this dissertation, such as memory blocks, DSPs,clock managers, etc. [75–77, 81]. Nevertheless, we believe that power gating can also be implementedfor these components in a similar manner as presented in this dissertation.Supporting power gating for memory blocks, however, might be challenging because some usagescenarios require that memory data be retained during sleep mode. Using conventional retentiontechniques is not suitable for memory blocks because of the significant overhead. 4 Other low-power modes, such as V DD retention method and source-diode biasing method can be used to putmemory blocks in low-power modes while maintaining their data [7]. The periphery circuitry canalso be power-gated without losing data stored in the memory array as some memory compilerssupport [129]. These techniques can be designed to be dynamically-controlled in FPGAs using asimilar approach as proposed in this dissertation.6.2.2 Sleep Transistors SizingThe sizing for the power switches (sleep transistors) in the proposed power gating architecture inChapter 3 was performed assuming that 20% of the inputs are switching simultaneously. Switchingactivity affects the power consumption of a circuit, and hence the current drawn from a powersupply. The sleep transistors must be sized to enable this amount of current with low voltage drop.In a typical power gating design flow, the maximum power of a circuit is used to size the powerswitches so that a reasonable performance degradation is obtained, typically between 5-10% [8].The power consumption depends on the application circuit and the input data. Whilst 20% isconsidered a high switching activity for many circuits, it may not be the case for FPGAs becausethese devices are supposed to be able to implement any digital circuit; this includes circuits thathave higher than 20% activity in some of their components.4Using retention memory implies that for each memory block there is another memory block that is used to retaina copy of its data during sleep mode, which is impractical.1436.2. Limitations and Short-Term Future WorkOur experiments show that for the MCNC benchmark circuits, the maximum power per tiledissipated is far less than the power dissipation when a single tile is simulated using HSPICEassuming 20% activity. This indicates that our assumed 20% activity is an overestimate. If a moreconservative approach is to be taken by assuming that all inputs of a tile are switching at thesame time, this would result in larger sleep transistor sizes, and hence larger area overhead. Thiswould impact the results obtained in this dissertation in terms of the power savings and the bestarchitecture granularity. Investigating the effect of using different inputs activity for sizing sleeptransistors might be an interesting avenue of future work.6.2.3 Characterizing the Wakeup Phase of Power-Gated ModulesThe architecture presented in Chapter 4 for handling inrush current depends on the underlyingcircuit-level support to stagger the turn-on phase. The worst-case inrush current was assumed inthis dissertation in order to design the inrush current handling architecture.A future research direction would include characterizing the effect of specific turn-on sequenceson power and timing. The work in [130] proposes modifications to the FPGA placement and routingalgorithms that can reduce the IR drop on the power grid for FPGAs. The work shows that high-activity portions of a circuit may experience high IR-drop. A similar approach can be used tostudy the effect of the wakeup phase of a power-gated module in the context of larger applicationcircuits. Optimizations can be done during early phases of the design flow of power-gated FPGAsby avoiding placing power-gated modules close to high-activity modules. This can help in reducingthe IR drop on the power distribution network, and would result in better than worst case powerand timing.6.2.4 CAD Study and BenchmarksThe CAD study performed in Chapter 5 to investigate the granularity of the proposed power gatingarchitecture is based on benchmark circuits that do not contain information about the behaviourof the power-gated modules (i.e., no information about active and idle periods). Rather, it assumesthat all power-gated modules are idle all the time in order to find the upper bound of power savings1446.3. Long-Term Future Directionsfor each granularity of the power gating architecture. Ideally, application behaviour informationshould be included in such a study. For example, power-gated modules that have long active periodsmight benefit more from finer-granularity SBs power gating since a large number of resources canbe powered down at configuration time.In order to account for the behaviour of power-gated modules, a representative set of benchmarkcircuits is required. Currently, there are no benchmark circuits that can be used to evaluatepower-gated architectures. The application we use in Section 5.6 is an example of the desiredbenchmark circuits. There have been some attempts to provide benchmark circuits to betterevaluate the energy/power consumption of FPGAs [131], or to better represent SoC designs [132].The benchmark circuits in [131] represent real applications from different domains. However, theydo not include information about modules that can be power-gated and their power controllerunits. The benchmark generator in [132] focuses on generating synthetic benchmark circuits thatrepresent SoC designs, without taking into consideration the actual functionality and behaviour ofthe different modules. Extensions to these works to include information about power-gated modulesand their behaviour is an interesting avenue for future work.6.3 Long-Term Future DirectionsThis section provides long-term research directions.6.3.1 High-Level CAD SupportThe CAD support presented in Chapter 5 focuses on low-level CAD algorithms, specifically place-ment and routing of circuit blocks on the available resources of an FPGA. Our experiments wereperformed assuming that the power-gated modules in each benchmark circuit is known to the CADflow. In current low-power ASIC flows, the identification of power-gated modules is performedmanually by experienced designers. However, an ideal flow would be smart enough to identifypower gating opportunities automatically from a high-level description of the application.An interesting avenue of future research involves automating the process of identifying power1456.3. Long-Term Future Directionsgating opportunities at the high-levels of the design flow. In addition to simplifying the designprocess, this would enable optimizations that can lead to further power reductions in FPGAs withdynamic power gating or other power reduction techniques. Currently, low-power ASIC design flowsuse power specifications that can be fed into the different stages of the design flow in order to guideCAD tools to account for the desired power-specific intentions [121]. The unified power format(UPF) [133] and the common power format (CPF) [134] are used in commercial CAD flows to dothis. However, these power specifications are design-specific, must be provided for each new designmanually, and largely depend on the designer’s experience to identify power saving opportunities.Ideally, a CAD flow should support identifying power-saving opportunities, and applying themost suitable power saving techniques. This requires knowledge about the behaviour of the ap-plication that may be provided by the user at a high level. Such a CAD flow would be able toperform fast design space exploration to determine the combination of power saving techniques thatmay result in the lowest power consumption. For example, in high-level synthesis (HLS) flows, theHLS tool would have information about the high-level description of the application in the formof a software program. This high-level description is used to derive the control flow and data flowgraphs of the application. Thus, it is possible for these tools to identify power saving opportunities,and perhaps optimize the design to reduce the power consumption. This could also be applied toother design flows that include high-level description of the design.In addition to obtaining power saving opportunities at high-levels, this information can helpoptimize the different subsequent steps in the design flow. For example, for a module that isidentified as a candidate for power gating because its behaviour indicates that it has very longidle times, the synthesis step in an FPGA design flow can focus more on improving timing whilespending less effort on optimizing area. This is because such modules would spend most of theirtime in low-power, sleep mode, which makes their dynamic power consumption (which is relatedto the area) less important.1466.3. Long-Term Future Directions6.3.2 Self-Optimizing FPGAsAn interesting avenue of future work involves creating systems that support self-optimization inorder to address the different scaling challenges of future technologies. This includes optimizingpower, reliability, and performance. Many state-of-the-art SoCs contain programmable logic com-ponents in addition to the basic system components, such as the Zynq-7000 SoCs from Xilinx andthe Cyclone V SoCs from Altera, which makes the idea of self-optimization very compelling forsuch systems.One of the important scaling challenges that face designers is process variation, as it affects yieldand reliability [135]. Process variation results in differences between the electrical characteristicsof the transistors on a chip (within-die) and among different chips (die-to-die). This leads todifferences in the performance, reliability, and power dissipation among different parts of the samechip. A large part of process variations is systematic, meaning that the transistors within a regionin a chip have roughly similar characteristics.With self-optimizing systems that contain programmable logic, it is possible to characterize theperformance of the different components in the chip so that judicious resource utilization could bedone. For example, critical tasks can be mapped to areas in the programmable fabric that havehigher speed than others in order to improve energy efficiency. Tasks can also be migrated fromone region in the fabric to another to adapt to the requirements of the overall application (power,performance, etc.) by allowing other tasks with higher criticality to be mapped to fast areas in thechip. Some of these optimizations have been studied in the past, such as process variation-awareplacement [136–138], but no framework was provided to enable devices to have autonomous controlover their own resources.Another interesting aspect of self-optimizing systems is healing from aging effects, such as NBTI,that cause performance degradation with time. It is known that a large fraction of the effect ofNBTI can be recovered by using appropriate input vectors for logic gates [139]. Self-optimizingsystems would monitor the aging effects, and try to heal or distribute the aging within the fabricsuch that the overall lifetime of the system is improved. Power consumption can also be includedin the cycle of self-optimizations by detecting opportunities to apply certain power optimizations,1476.3. Long-Term Future Directionsand applying them when possible. For example, power gating can be applied for resources thathave been detected to be inactive for certain periods of time. The interaction between the differentdesign metrics, power, performance, and reliability, can be part of the self-optimization effort inthese futuristic platforms, such as has been indicated in [140].In order to support such self-optimizing systems, new programmable fabric architectures as wellas CAD tools are needed. Architectural support would provide a means for self-controllability, suchthat the available components can control certain aspects of other components in the programmablefabric, such as voltage, speed, etc. In addition, programmable fabric would need to be able to detectcertain characteristics of the data being processed, and respond to that appropriately. The CADsupport is required to map application circuits to such self-optimizing systems, and perhaps beembedded in the overall system to support real-time circuit mapping to the programmable fabric,as in [141].148Bibliography[1] A. A. M. Bsoul and S. J. E. Wilton, “An FPGA Architecture Supporting DynamicallyControlled Power Gating,” in Proceedings of The IEEE International Conference on Field-Programmable Technology (FPT), pp. 1–8, Dec. 2010.[2] A. A. M. Bsoul and S. J. E. Wilton, “A Configurable Architecture to Limit Wakeup Cur-rent in Dynamically-Controlled Power-Gated FPGAs,” in Proceedings of the InternationalSymposium on Field-Programmable Gate Arrays, FPGA ’12, pp. 245–254, 2012.[3] A. A. M. Bsoul and S. J. E. Wilton, “An FPGA with Power-Gated Switch Blocks,” in Pro-ceedings of The IEEE International Conference on Field-Programmable Technology (FPT),pp. 87–94, Dec. 2012.[4] A. A. M. Bsoul and S. J. E. Wilton, “A Configurable Architecture to Limit Inrush Currentin Power-Gated Reconfigurable Devices,” J. Low Power Electronics, vol. 10, no. 1, pp. 1–15,2014.[5] A. A. M. Bsoul, S. J. E. Wilton, K. H. Tsoi, and W. Luk, “An FPGA Architecture andCAD Flow Supporting Dynamically-Controlled Power Gating,” Very Large Scale Integration(VLSI) Systems, IEEE Transactions on, submitted.[6] A. A. M. Bsoul, S. Dueck, and S. J. E. Wilton, Green Communications: Theoretical Funda-mentals, Algorithms and Applications, ch. Runtime-Controlled Energy Reduction Techniquesfor FPGAs, pp. 3–22. CRC Press, 2012.[7] M. Keating, D. Flynn, R. Aitken, A. Gibbons, and K. Shi, Low Power Methodology Manual:For System-on-Chip Design. Springer Publishing Company, Incorporated, 2007.[8] N. Weste and D. Harris, CMOS VLSI Design: A Circuits and Systems Perspective. USA:Addison-Wesley Publishing Company, 4th ed., 2011.[9] N. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J. Hu, M. Irwin, M. Kandemir, andV. Narayanan, “Leakage Current: Moore’s Law Meets Static Power,” Computer, vol. 36,pp. 68–75, Dec 2003.[10] “40-nm FPGA Power Management and Advantages.” Altera, Corp. white paper WP-01059-1.2 (v1.2), August 2008.[11] P. Garcia, K. Compton, M. Schulte, E. Blem, and W. Fu, “An Overview of ReconfigurableHardware in Embedded Systems,” EURASIP Journal on Embedded Systems, vol. 2006, p. 19,2006.[12] D. Lammers, “Chip Industry Growing, but Experts Wonder for How Long,” EE Times, 2003.[13] Z. Or-Bach, “Paradigm Shift in ASIC Technology: In-Standard Metal, Out-standard Cell.”eASIC white paper, 2005.[14] I. Kuon, R. Tessier, and J. Rose, “FPGA Architecture: Survey and Challenges,” Found.Trends Electron. Des. Autom., vol. 2, pp. 135–253, Feb. 2008.[15] I. Kuon and J. Rose, “Measuring the Gap Between FPGAs And ASICs,” in Proceedings of the14th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 21–30,149Bibliography2006.[16] V. Tiwari, S. Malik, and P. Ashar, “Guarded Evaluation: Pushing Power Management toLogic Synthesis/Design,” IEEE Transactions on Computer-Aided Design of Integrated Cir-cuits and Systems, vol. 17, pp. 1051–1060, Oct. 1998.[17] M. Mu¨nch, B. Wurth, R. Mehra, J. Sproch, and N. Wehn, “Automating RT-level OperandIsolation to Minimize Power Consumption in Datapaths,” in Proceedings of the Conferenceon Design, Automation and Test in Europe, pp. 624–633, 2000.[18] A. Abddollahi, M. Pedarm, F. Fallah, and I. Ghosh, “Precomputation-Based Guarding forDynamic and Leakage Power Reduction,” in Proceedings of the 21st International Conferenceon ComputerDesign, pp. 90–97, 2003.[19] J. H. Anderson and C. Ravishankar, “FPGA Power Reduction by Guarded Evaluation,” inProceedings of the 18th ACM/SIGDA International Symposium on Field-Programmable GateArrays, pp. 157–166, 2010.[20] Q. Wang, S. Gupta, and J. H. Anderson, “Clock Power Reduction for Virtex-5 FPGAs,” inProceeding of the 17th ACM/SIGDA International Symposium on Field-Programmable GateArrays, pp. 13–22, 2009.[21] Y. Zhang, J. Roivainen, and A. Mammela, “Clock-Gating in FPGAs: A Novel and Com-parative Evaluation,” in Proceedings of the 9th EUROMICRO Conference on Digital SystemDesign, pp. 584–590, 2006.[22] A. Gayasen, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and T. Tuan, “ReducingLeakage Energy in FPGAs Using Region-Constrained Placement,” in Proceedings of the 12thInternational Symposium on Field-Programmable Gate Arrays, pp. 51–58, 2004.[23] Y. Lin, F. Li, and L. He, “Routing Track Duplication with Fine-Grained Power-Gating forFPGA Interconnect Power Reduction,” in Proceedings of the 2005 Asia and South PacificDesign Automation Conference, pp. 645–650, 2005.[24] R. P. Bharadwaj, R. Konar, P. T. Balsara, and D. Bhatia, “Exploiting Temporal Idleness toReduce Leakage Power in Programmable Architectures,” in Proceedings of the 2005 Asia andSouth Pacific Design Automation Conference, pp. 651–656, 2005.[25] S. Mondal and S. O. Memik, “Fine-Grain Leakage Optimization in SRAM Based FPGAs,”in Proceedings of the 15th ACM Great Lakes Symposium onVLSI, pp. 238–243, 2005.[26] T. Tuan, S. Kao, A. Rahman, S. Das, and S. Trimberger, “A 90 nm Low-Power FPGAfor Battery-Powered Applications,” in Proceedings of the 14th International Symposium onField-Programmable Gate Arrays, pp. 3–11, 2006.[27] T. Tuan, A. Rahman, S. Das, S. Trimberger, and S. Kao, “A 90-nm Low-Power FPGA forBattery-Powered Applications,” IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, vol. 26, pp. 296–300, Feb. 2007.[28] P. Nair, S. Koppa, and E. John, “A Comparative Analysis of Coarse-Grain and Fine-GrainPower Gating for FPGA Lookup Tables,” in Proceedings of the 52nd IEEE InternationalMidwest Symposium on Circuits and Systems, pp. 507 –510, 2009.[29] F. Li, Y. Lin, L. He, and J. Cong, “Low-Power FPGA using Pre-Defined Dual-Vdd/Dual-Vt Fabrics,” in Proceedings of the 12th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 42–50, 2004.[30] F. Li, Y. Lin, and L. He, “FPGA Power Reduction Using Configurable Dual-Vdd,” in Pro-ceedings of the 41st Design Automation Conference, pp. 735–740, 2004.[31] F. Li, Y. Lin, and L. He, “Vdd Programmability to Reduce FPGA Interconnect Power,” in150BibliographyProceedings of the 2004 IEEE/ACM International Conference on Computer-Aided Design,pp. 760–765, 2004.[32] Y. Lin and L. He, “Leakage Efficient Chip-Level Dual-Vdd Assignment with Time SlackAllocation for FPGA Power Reduction,” in Proceedings of the 42nd Design Automation Con-ference, pp. 720–725, 2005.[33] Y. Lin, F. Li, and L. He, “Power Modeling and Architecture Evaluation for FPGA with NovelCircuits for Vdd Programmability,” in Proceedings of the 13th ACM/SIGDA InternationalSymposium on Field-programmable Gate Arrays, pp. 199–207, 2005.[34] F. Li, Y. Lin, and L. He, “Field Programmability of Supply Voltages for FPGA Power Re-duction,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,vol. 26, pp. 752–764, April 2007.[35] J. Lamoureux and S. J. E. Wilton, “On the Interaction Between Power-Aware FPGA CADAlgorithms,” in Proceedings of the 2003 IEEE/ACM International Conference on Computer-Aided Design, pp. 701–708, 2003.[36] J. Lamoureux and S. Wilton, “Clock-Aware Placement for FPGAs,” in Proceedings of theInternational Conference on Field Programmable Logic and Applications, pp. 124–131, 2007.[37] J. Lamoureux and W. Luk, “An Overview of Low-Power Techniques for Field-ProgrammableGate Arrays,” in Proceedings of the 2008 NASA/ESA Conference on Adaptive Hardware andSystems, pp. 338–345, 2008.[38] J. H. Anderson and F. N. Najm, “Low-Power Programmable Routing Circuitry for FPGAs,”in Proceedings of the 2004 IEEE/ACM International Conference on Computer-Aided Design,pp. 602–609, 2004.[39] M. Lin and A. El Gamal, “A Low-Power Field-Programmable Gate Array Routing Fabric,”IEEE Trans. Very Large Scale Integr. Syst., vol. 17, no. 10, pp. 1481–1494, 2009.[40] M. Klein, “Power Consumption at 40 and 45 nm.” Xilinx, Inc. white paper WP298 (v1.0),April 2009.[41] J. Hussein, M. Klein, and M. Hart, “Lowering Power at 28 nm with Xilinx 7 Series FPGAs.”Xilinx, Inc. white paper WP389 (v1.1), Jun. 2011.[42] “Meeting the Low Power Imperative at 28 nm.” Altera Corporation, white paper WP-01158-2.0, Nov. 2011.[43] S. Henzler, Power Management of Digital Circuits in Deep Sub-Micron CMOS Technologies(Springer Series in Advanced Microelectronics). Secaucus, NJ, USA: Springer-Verlag NewYork, Inc., 2007.[44] “Reducing Power Consumption and Increasing Bandwidth on 28-nm FPGAs.” Altera Cor-poration white paper, March 2012.[45] A. A. M. Bsoul, R. Hoskinson, M. Ivanov, S. Mirabbasi, and H. Abdollahi, “Implementation ofan FPGA-Based Low-Power Video Processing Module for a Head-Mounted Display System,”in Consumer Electronics (ICCE), 2013 IEEE International Conference on, pp. 214–217, 2013.[46] V. Betz, J. Rose, and A. Marquardt, eds., Architecture and CAD for Deep-Submicron FPGAs.Norwell, MA, USA: Kluwer Academic Publishers, 1999.[47] T. Ahmed, P. D. Kundarewich, J. H. Anderson, B. L. Taylor, and R. Aggarwal, “Architecture-specific Packing for Virtex-5 FPGAs,” in Proceedings of the 16th International ACM/SIGDASymposium on Field Programmable Gate Arrays, FPGA ’08, (New York, NY, USA), pp. 5–13,ACM, 2008.[48] D. Lewis, E. Ahmed, G. Baeckler, V. Betz, M. Bourgeault, D. Cashman, D. Galloway, M. Hut-151Bibliographyton, C. Lane, A. Lee, P. Leventis, S. Marquardt, C. McClintock, K. Padalia, B. Pedersen,G. Powell, B. Ratchev, S. Reddy, J. Schleicher, K. Stevens, R. Yuan, R. Cliff, and J. Rose,“The Stratix II Logic and Routing Architecture,” in FPGA ’05: Proceedings of the 2005ACM/SIGDA 13th international symposium on Field-programmable gate arrays, pp. 14–20,2005.[49] A. Sangiovanni-Vincentelli, A. El Gamal, and J. Rose, “Synthesis Method for Field Pro-grammable Gate Arrays,” Proceedings of the IEEE, vol. 81, pp. 1057–1083, Jul 1993.[50] R. Brayton, G. Hachtel, and A. Sangiovanni-Vincentelli, “Multilevel Logic Synthesis,” Pro-ceedings of the IEEE, vol. 78, pp. 264–300, Feb 1990.[51] R. Francis, J. Rose, and Z. Vranesic, “Chortle-crf: Fast Technology Mapping for LookupTable-based FPGAs,” in Proceedings of the 28th ACM/IEEE Design Automation Conference,DAC ’91, (New York, NY, USA), pp. 227–233, ACM, 1991.[52] R. Francis, J. Rose, and Z. Vranesic, “Technology Mapping of Lookup Table-Based FPGAsfor Performance,” in Computer-Aided Design, 1991. ICCAD-91. Digest of Technical Papers.,1991 IEEE International Conference on, pp. 568–571, Nov 1991.[53] K.-C. Chen, J. Cong, Y. Ding, A. Kahng, and P. Trajmar, “DAG-Map: Graph-Based FPGATechnology Mapping for Delay Optimization,” Design Test of Computers, IEEE, vol. 9, pp. 7–20, Sept 1992.[54] J. Cong and Y. Ding, “FlowMap: An Optimal Technology Mapping Algorithm for DelayOptimization in Lookup-Table Based FPGA Designs,” Computer-Aided Design of IntegratedCircuits and Systems, IEEE Transactions on, vol. 13, pp. 1–12, Jan 1994.[55] J. Cong and Y. Ding, “On Area/Depth Trade-off in LUT-Based FPGA Technology Mapping,”Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 2, pp. 137–148,June 1994.[56] V. Betz and J. Rose, “Cluster-Based Logic Blocks for FPGAs: Area-Efficiency vs. InputSharing and Size,” in Custom Integrated Circuits Conference, 1997., Proceedings of the IEEE1997, pp. 551–554, May 1997.[57] A. S. Marquardt, V. Betz, and J. Rose, “Using Cluster-based Logic Blocks and Timing-drivenPacking to Improve FPGA Speed and Density,” in Proceedings of the 1999 ACM/SIGDASeventh International Symposium on Field Programmable Gate Arrays, FPGA ’99, (NewYork, NY, USA), pp. 37–46, ACM, 1999.[58] A. Marquardt, V. Betz, and J. Rose, “Timing-Driven Placement for FPGAs,” in Proceedingsof the Eighth ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,pp. 203–213, 2000.[59] V. Betz and J. Rose, “VPR: A New Packing, Placement and Routing Tool for FPGA Re-search,” in Proceedings of the 7th International Workshop on Field-Programmable Logic andApplications, pp. 213–222, 1997.[60] A. B. Kahng, S. Reda, and Q. Wang, “Architecture and Details of a High Quality, Large-scale Analytical Placer,” in Proceedings of the 2005 IEEE/ACM International Conference onComputer-aided Design, ICCAD ’05, (Washington, DC, USA), pp. 891–898, IEEE ComputerSociety, 2005.[61] L. McMurchie and C. Ebeling, “PathFinder: A Negotiation-based Performance-driven Routerfor FPGAs,” in Proceedings of the 1995 ACM Third International Symposium on Field-programmable Gate Arrays, FPGA ’95, (New York, NY, USA), pp. 111–117, ACM, 1995.[62] Altera Corporation, Embedded Design Handbook, July 2011.152Bibliography[63] Xilinx, Inc., Embedded System Tools Reference Manual, March 2013.[64] J. M. Rabaey, A. Chandrakasan, and B. Nikolic´, Digital Integrated Circuits: A Design Per-spective. Prentice-Hall, Inc., 2nd ed., 2003.[65] J. Rabaey, Low Power Design Essentials. Springer Publishing Company, Incorporated,1st ed., 2009.[66] H. Hassan and M. Anis, Low-Power Design of Nanometer FPGAs: Architecture and EDA.Morgan Kaufmann Publishers, 2010.[67] G. E. Moore, “Cramming More Components Onto Integrated Circuits,” Electronics, vol. 38,Apr 1965.[68] L. Shang, A. S. Kaviani, and K. Bathala, “Dynamic Power Consumption in VirtexTM-IIFPGA Family,” in Proceedings of the Tenth ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 157–164, 2002.[69] M. Donno, A. Ivaldi, L. Benini, and E. Macii, “Clock-Tree Power Optimization Based onRTL Clock-Gating,” in Proceedings of the 40th Design Automation Conference, pp. 622–627,2003.[70] Q. Wang and S. Roy, “Power Minimization by Clock Root Gating,” in Proceedings of the2003 Asia and South Pacific Design Automation Conference, pp. 249–254, 2003.[71] V. Venkatachalam and M. Franz, “Power Reduction Techniques For Microprocessor Systems,”ACM Comput. Surv., vol. 37, no. 3, pp. 195–237, 2005.[72] F. Rivoallon, “Reducing Switching Power with Intelligent Clock Gating,” white paper wp370,Xilinx, Inc., May 2010.[73] SiliconBlue, iCE65 Ultra Low-Power mobileFPGA Family Datasheet, May 2010.[74] Altera Corporation, Stratix Device Handbook, July 2005.[75] Actel Corporation, IGLOO Low-Power Flash FPGAs Handbook, November 2009.[76] Altera Corporation, Stratix III Device Handbook, July 2010.[77] Lattice Semiconductor Corp., LatticeECP3 Family Handbook, June 2010.[78] S. Huda, M. Mallick, and J. Anderson, “Clock Gating Architectures for FPGA Power Re-duction,” in Proceedings of the International Conference on Field Programmable Logic andApplications, pp. 112–118, 2009.[79] D. E. Lackey, P. S. Zuchowski, T. R. Bednar, D. W. Stout, S. W. Gould, and J. M. Cohn,“Managing Power and Performance for System-on-Chip Designs using Voltage Islands,” inProceedings of the 2002 IEEE/ACM international conference on Computer-Aided Design,pp. 195–202, 2002.[80] R. Puri, L. Stok, J. Cohn, D. Kung, D. Pan, D. Sylvester, A. Srivastava, and S. Kulka-rni, “Pushing ASIC Performance in a Power Envelope,” in Proceedings of the 40th DesignAutomation Conference, pp. 788–793, 2003.[81] Xilinx, Inc., Virtex-5 FPGA Configuration User Guide, August 2009.[82] K. Paulsson, M. Hu¨bner, S. Bayar, and J. Becker, “Exploitation of Run-Time Partial Recon-figuration for Dynamic Power Management in Xilinx Spartan III-based Systems,” in ReCoSoCWorkshop, 2007.[83] S. Liu, R. N. Pittman, A. Forin, and J.-L. Gaudiot, “On Energy Efficiency of ReconfigurableSystems with Run-Time Partial Reconfiguration,” in Proceedings of the 21st IEEE Interna-tional Conference on Application-Specific Systems, Architectures and Processors, pp. 265–272,2010.[84] F. Li, D. Chen, L. He, and J. Cong, “Architecture Evaluation for Power-Efficient FPGAs,” in153BibliographyProceedings of the 11th ACM/SIGDA International Symposium on Field-Programmable GateArrays, pp. 175–184, 2003.[85] J. H. Anderson and F. N. Najm, “Power Estimation Techniques for FPGAs,” IEEE Trans.Very Large Scale Integr. Syst., vol. 12, no. 10, pp. 1015–1027, 2004.[86] K. K. W. Poon, S. J. E. Wilton, and A. Yan, “A Detailed Power Model for Field-Programmable Gate Arrays,” ACM Trans. Des. Autom. Electron. Syst., vol. 10, no. 2,pp. 279–302, 2005.[87] K. Vorwerk, M. Raman, J. Dunoyer, Y. chung Hsu, A. Kundu, and A. Kennings, “A Techniquefor Minimizing Power During FPGA Placement,” in Proceedings of the 2008 InternationalConference on Field Programmable Logic and Applications, pp. 233 –238, 2008.[88] Q. Dinh, D. Chen, and M. Wong, “A Routing Approach to Reduce Glitches in Low PowerFPGAs,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,vol. 29, pp. 235–240, February 2010.[89] J. Anderson and F. Najm, “Power-Aware Technology Mapping for LUT-Based FPGAs,” inProceedings of the 2002 IEEEInternational Conference on Field-Programmable Technology,pp. 211–218, 2002.[90] L. Cheng, D. Chen, and M. Wong, “GlitchMap: An FPGA Technology Mapper for Low PowerConsidering Glitches,” in Proceedings of the 44th ACM/IEEE Design Automation Conference,pp. 318–323, 2007.[91] D. Chen and J. Cong, “Delay Optimal Low-Power Circuit Clustering for FPGAs with DualSupply Voltages,” in Proceedings of the 2004 International Symposium on Low Power Elec-tronics and Design, pp. 70–73, 2004.[92] J. H. Anderson, F. N. Najm, and T. Tuan, “Active Leakage Power Optimization for FPGAs,”in Proceedings of the 12th ACM/SIGDA International Symposium on Field-ProgrammableGate Arrays, pp. 33–41, 2004.[93] J. Lamoureux, G. G. Lemieux, and S. J. E. Wilton, “Glitchless: An Active Glitch Min-imization Technique for FPGAs,” in Proceedings of the 15th ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays, pp. 156–165, 2007.[94] T. S. Czajkowski and S. D. Brown, “Using Negative Edge Triggered FFs to Reduce GlitchingPower in FPGACircuits,” in Proceedings of the 44th Design Automation Conference, pp. 324–329, 2007.[95] “Using Suspend Mode in Spartan-3 Generation FPGAs.” Xilinx Application Note, XAPP480(v1.0), May 2007.[96] “Spartan-6 FPGA Power Management User Guide.” Xilinx, UG394 (v1.2), May 2014.[97] C. Li, Y. Dong, and T. Watanabe, “New Power-Aware Placement for Region-Based FPGAArchitecture Combined with Dynamic Power Gating by PCHM,” in Proceedings of the 17thIEEE/ACM International Symposium on Low-Power Electronics and Design, ISLPED ’11,pp. 223–228, IEEE Press, 2011.[98] C. H. Hoo, Y. Ha, and A. Kumar, “A Directional Coarse-Grained Power Gated FPGA SwitchBox and Power Gating Aware Routing Algorithm,” in Field Programmable Logic and Appli-cations (FPL), 2013 23rd International Conference on, pp. 1–4, Sept 2013.[99] C. Kim, M. Chung, Y. Cho, M. Konijnenburg, S. Ryu, and J. Kim, “ULP-SRP: Ultra LowPower Samsung Reconfigurable Processor for Biomedical Applications,” in Proceedings of TheInternational Conference on Field-Programmable Technology (FPT), pp. 329–334, Dec. 2012.[100] S. Kim, S. V. Kosonocky, and D. R. Knebel, “Understanding and Minimizing Ground Bounce154BibliographyDuring Mode Transition of Power Gating Structures,” in Proceedings of the 2003 InternationalSymposium on Low-Power Electronics and Design, pp. 22–25, ACM, 2003.[101] D. Howard and K. Shi, “Power-On Current Control In Sleep Transistor Implementations,”in Proceedings of the 2006 International Symposium on VLSI Design, Automation and Test,pp. 1–4, 2006.[102] K. Shi and J. Li, “A Wakeup Rush Current and Charge-up Time Analysis Method for Pro-grammable Power-Gating Designs,” in IEEE International SOC Conference, pp. 163 –165,26-29 2007.[103] A. Calimera, L. Benini, A. Macii, E. Macii, and M. Poncino, “Design of a Flexible Reactiva-tion Cell for Safe Power-Mode Transition in Power-Gated Circuits,” IEEE Transactions onCircuits and Systems I: Regular Papers, vol. 56, pp. 1979 –1993, Sept. 2009.[104] C. Li, Y. Dong, and T. Watanabe, “New Power-Efficient FPGA Design Combining withRegion-Constrained Placement and Multiple Power Domains,” in New Circuits and SystemsConference (NEWCAS), 2011 IEEE 9th International, pp. 69–72, June 2011.[105] J. Luu, I. Kuon, P. Jamieson, T. Campbell, A. Ye, W. M. Fang, and J. Rose, “VPR 5.0:FPGA CAD and Architecture Exploration Tools with Single-DriverRouting, Heterogeneityand Process Scaling,” in Proceeding of the 17th ACM/SIGDA International Symposium onField-Programmable Gate Arrays, pp. 133–142, 2009.[106] “Predictive Technology Model (PTM).” http://www.eas.asu.edu/~ptm/.[107] E. Hung, S. Wilton, H. Yu, T. Chau, and P. H. W. Leong, “A Detailed Delay Path Modelfor FPGAs,” in Field-Programmable Technology, 2009. FPT 2009. International Conferenceon, pp. 96–103, IEEE, 2009.[108] T. Pi and P. J. Crotty, “FPGA Lookup Table with Transmission Gate Structure for ReliableLow-Voltage Operation,” Dec. 2003. US Patent 6,667,635.[109] G. Lemieux, E. Lee, M. Tom, and A. Yu, “Directional and Single-Driver Wires in FPGAInterconnect,” in Proceedings of the IEEE International Conference on Field-ProgrammableTechnology, pp. 41–48, 2004.[110] S.-C. Wong, G.-Y. Lee, and D.-J. Ma, “Modeling of Interconnect Capacitance, Delay, andCrosstalk in VLSI,” Semiconductor Manufacturing, IEEE Transactions on, vol. 13, pp. 108–111, Feb. 2000.[111] S. Huda, J. Anderson, and H. Tamura, “Charge Recycling for Power Reduction in FPGAInterconnect,” in Field Programmable Logic and Applications (FPL), 2013 23rd InternationalConference on, pp. 1–8, Sept 2013.[112] S. Yang, “Logic Synthesis and Optimization Benchmarks User Giude, Version 3.0,” tech. rep.,Jan 1991.[113] A. Kumar and M. Anis, “IR-Drop Management CAD Techniques in FPGAs for Power GridReliability,” in Proceedings of the 2009 10th International Symposium on Quality of ElectronicDesign, (Washington, DC, USA), pp. 746–752, IEEE Computer Society, 2009.[114] A. Kumar and M. Anis, “IR-Drop Aware Clustering Technique for Robust Power Grid in FP-GAs,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 19, pp. 1181–1191, July 2011.[115] S. R. Nassif, “Power Grid Analysis Benchmarks,” in Proceedings of the 2008 Asia and SouthPacific Design Automation Conference, ASP-DAC ’08, (Los Alamitos, CA, USA), pp. 376–381, IEEE Computer Society Press, 2008.[116] B. Vaidyanathan, S. Srinivasan, Y. Xie, N. Vijaykrishnan, and L. Rong, “Leakage Optimized155BibliographyDECAP Design for FPGAs,” in IEEE Asia Pacific Conference on Circuits and Systems,pp. 960 –963, Dec. 2006.[117] M. Popovich, A. V. Mezhiba, and E. G. Friedman, Power Distribution Networks with On-ChipDecoupling Capacitors. Springer Publishing Company, Incorporated, 1st ed., 2007.[118] T. Charania, A. Opal, and M. Sachdev, “Analysis and Design of On-Chip Decoupling Ca-pacitors,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 21,pp. 648–658, April 2013.[119] W. Zhao and Y. Cao, “New Generation of Predictive Technology Model for Sub-45nm DesignExploration,” in International Symposium on Quality Electronic Design, pp. 585–590, 2006.[120] “MOSIS Scalable CMOS Design Rules.” http://www.mosis.com/Technical/Designrules/scmos/scmos-main.html.[121] “Synopsys Low-Power Flow User Guide.” Version D-2010.03, March 2010.[122] A. R. Marquardt, “Cluster-Based Architecture, Timing-Driven Packing and Timing-DrivenPlacement for FPGAs,” Master’s thesis, University of Toronto, 1999.[123] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, T. Czajkowski, S. D. Brown,and J. H. Anderson, “LegUp: An Open-source High-level Synthesis Tool for FPGA-basedProcessor/Accelerator Systems,” ACM Trans. Embed. Comput. Syst., vol. 13, pp. 24:1–24:27,Sept. 2013.[124] Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii, “CHStone: A Benchmark ProgramSuite for Practical C-based High-Level Synthesis,” in ISCAS’08, pp. 1192–1195, 2008.[125] K.-W. Kwok, K. H. Tsoi, V. Vitiello, J. Clark, G. Chow, W. Luk, and G.-Z. Yang, “Dimen-sionality Reduction in Controlling Articulated Snake Robot for Endoscopy Under DynamicActive Constraints,” IEEE Transactions on Robotics, vol. 29, no. 1, pp. 15–31, 2013.[126] J. Detrey and F. de Dinechin, “FPLibrary, A VHDL Library of Parametrisable Floating-Pointand LNS Operators for FPGA.” http://www.ens-lyon.fr/LIP/Arenaire/Ware/FPLibrary/,2004.[127] “Quartus II University Interface Program.” http://www.altera.com/education/univ/research/quip/unv-quip.html.[128] R. Tessier, “Fast Placement Approaches for FPGAs,” ACM Trans. Des. Autom. Electron.Syst., vol. 7, pp. 284–305, Apr. 2002.[129] A. Mathur and L. Minwell, “Memory Power Reduction in SoC Designs Using PowerPro MG,”Design Reuse, 2009.[130] A. Kumar and M. Anis, “IR-Drop Management in FPGAs,” Computer-Aided Design of In-tegrated Circuits and Systems, IEEE Transactions on, vol. 29, pp. 988–993, June 2010.[131] P. Jamieson, T. Becker, P. Y. K. Cheung, W. Luk, T. Rissa, and T. Pitka¨nen, “Benchmarkingand Evaluating Reconfigurable Architectures Targeting the Mobile Domain,” ACM Trans.Des. Autom. Electron. Syst., vol. 15, no. 2, pp. 1–24, 2010.[132] C. Mark, A. Shui, and S. Wilton, “A System-Level Stochastic Circuit Generator for FPGAArchitecture Evaluation,” in Proceedings of the 2008 International Conference on Field-Programmable Technology, pp. 25–32, 2008.[133] “IEEE Standard for Design and Verification of Low-Power Integrated Circuits,” IEEE Std1801-2013 (Revision of IEEE Std 1801-2009), pp. 1–348, May 2013.[134] “Si2 Common Power Format Specification 2.0,” 2011.[135] “International Technology Roadmap for Semiconductors.”http://www.itrs.net/Links/2013ITRS/Home2013.htm, 2013.156[136] S. Srinivasan and V. Narayanan, “Variation Aware Placement for FPGAs,” in EmergingVLSI Technologies and Architectures, 2006. IEEE Computer Society Annual Symposium on,pp. 422–423, March 2006.[137] L. Cheng, J. Xiong, L. He, and M. Hutton, “FPGA Performance Optimization Via ChipwisePlacement Considering Process Variations,” in Field Programmable Logic and Applications,2006. FPL ’06. International Conference on, pp. 1–6, Aug 2006.[138] A. Bsoul, N. Manjikian, and L. Shang, “Reliability- and Process Variation-Aware Placementfor FPGAs,” in Design, Automation Test in Europe Conference Exhibition (DATE), 2010,pp. 1809–1814, March 2010.[139] K. Ramakrishnan, S. Suresh, N. Vijaykrishnan, M. Irwin, and V. Degalahal, “Impact ofNBTI on FPGAs,” in VLSI Design, 2007. Held jointly with 6th International Conference onEmbedded Systems., 20th International Conference on, pp. 717–722, Jan 2007.[140] A. Calimera, E. Macii, and M. Poncino, “Power-Gating: More Than Leakage Savings,” inPh.D. Research in Microelectronics and Electronics (PRIME), 2010 Conference on, pp. 1–4,July 2010.[141] R. Lysecky, G. Stitt, and F. Vahid, “Warp Processors,” ACM Trans. Des. Autom. Electron.Syst., vol. 11, pp. 659–681, June 2004.[142] I. Kuon and J. Rose, “Area And Delay Trade-offs in the Circuit and Architecture Design ofFPGAs,” in FPGA ’08: Proceedings of the 16th international ACM/SIGDA symposium onField programmable gate arrays, pp. 149–158, 2008.157Appendix AFPGA Architecture ParametersThe following listing is the basic architecture file used for the experiments in Chapter 5. Thisarchitecture file has been obtained from the intelligent FPGA Architecture Repository (iFAR)website [142], and modified to reflect the architecture parameters used in the dissertation.<!-- # This is a modified version of an iFAR - Architecture file --><!-- # Version 0.3 - 296 --><!-- # Copyright 2009 by Ian Kuon and Jonathan Rose --><!-- # Timing numbers for 45nm CMOS (BPTM) (1.0 V) with (area^1 * delay ^1)optimization --><!-- # Name: n10k04l09.fc20fo10.area1delay1.cmos45nm.bptm --><!-- # link: http: //www.eecg.utoronto.ca/vpr/architectures/vpr -0.3/ n10k04l09.fc20fo10.area1delay1.cmos45nm.bptm.cmos45nm.xml --><!-- # row information in table: --><!-- # tile area: 1873.31 --><architecture ><layout auto="1.0"/><device ><sizing R_minW_nmos="6065.520020" R_minW_pmos="18138.500000"ipin_mux_trans_size="1.093770"/><timing C_ipin_cblock="5.07e-16" T_ipin_cblock="7.534000e-11"/><area grid_logic_tile_area="7264.740234"/><chan_width_distr ><io width="1.000000"/><x distr="uniform" peak="1.000000"/><y distr="uniform" peak="1.000000"/></chan_width_distr ><switch_block type="wilton" fs="3"/></device ><switchlist ><switch type="mux" name="0" R="0.000000" Cin="5.07e-16" Cout="4.8302929687499995e-15" Tdel="9.458000e-11" mux_trans_size="1.989000"buf_size="24.265699"/></switchlist ><segmentlist ><segment freq="1.000000" length="4" type="unidir" Rmetal="51.8" Cmetal="10.608e-15">158Appendix A. FPGA Architecture Parameters<mux name="0"/><sb type="pattern">1 1 1 1 1</sb><cb type="pattern">1 1 1 1</cb><mux name="0"/></segment ></segmentlist ><typelist ><io capacity="6" t_inpad="4.775000e-11" t_outpad="1.556000e-11"><fc_in type="frac">0.200000 </fc_in ><fc_out type="frac">0.100000 </fc_out ></io><type name=".clb"><subblocks max_subblocks="6" max_subblock_inputs="4"><timing ><T_comb ><trow>2.029000e-10</trow><trow>2.029000e-10</trow><trow>2.029000e-10</trow><trow>2.029000e-10</trow></T_comb ><T_seq_in ><trow>1.838000e-10</trow></T_seq_in ><T_seq_out ><trow>8.623000e-11</trow></T_seq_out ></timing ></subblocks ><fc_in type="frac">0.200000 </fc_in ><fc_out type="frac">0.100000 </fc_out ><pinclasses ><class type="in"> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 </class ><class type="out">16 17 18 19 20 21 </class ><class type="global">22 </class ></pinclasses ><pinlocations ><loc side="left">1 5 9 13 17 21 </loc><loc side="right">3 7 11 15 19 </loc><loc side="top">2 6 10 14 18 </loc><loc side="bottom">0 4 8 12 16 20 </loc></pinlocations ><gridlocations ><loc type="fill" priority="1"/></gridlocations ><timing ><tedge type="T_sblk_opin_to_sblk_ipin">5.030000e-11</tedge ><tedge type="T_fb_ipin_to_sblk_ipin">5.297000e-11</tedge ><tedge type="T_sblk_opin_to_fb_opin">0.000000e+00</tedge >159Appendix A. FPGA Architecture Parameters</timing ></type></typelist ><!-- clock network --><clocks ><clock buffer_R="31" buffer_Cin="20e-15" buffer_Cout="50e-15"Rwire="10" Cwire="5e-15" Cin_per_clb_clock_pin="2e-15"/></clocks ></architecture >160

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0167610/manifest

Comment

Related Items