"Applied Science, Faculty of"@en . "Electrical and Computer Engineering, Department of"@en . "DSpace"@en . "UBCV"@en . "Aken\u00E2\u0080\u0099Ova, Victor Olubunmi"@en . "2009-12-11T21:36:33Z"@en . "2005"@en . "Master of Applied Science - MASc"@en . "University of British Columbia"@en . "Potential cost savings that come from the ability to make post fabrication changes in System-on-Chip\r\n(SoC) designs make embeddable Field Programmable Gate Array (eFPGA) cores an attractive\r\ndesign option. However, they are only available as \"hard\" macros from vendors as a small number\r\nof fixed size cores, and may not be optimal in terms of area, power or delay for a given SoC. A\r\n\"soft\" eFPGA methodology [01] [02] based on the ASIC design flow was used to create small\r\namounts of programmable logic but incurs significant overhead. In this thesis, it is shown that this\r\noverhead can be reduced by deploying architecture-specific tactical standard cells in the ASIC flow,\r\nmaking eFPGA generation configurable, and imposing a regular structure on eFPGA architectures.\r\nFor the set of benchmarks considered, the use of tactical standard cells resulted in area and delay\r\nsavings of 58% and 40% respectively, when compared to cores implemented with generic standard\r\ncells [02]. Also, a proposed IP-generator-based approach for eFPGA design is shown to achieve\r\nresults that are competitive with commercial full-custom hard eFPGA cores. For example, for some\r\nlarge benchmark circuits (over 1000 4-LUTs) the generated eFPGA fabrics were up to 40% smaller\r\nthan available hard eFPGA cores. Finally, it is shown that a regular structured architecture makes it\r\npossible to generate fabrics with logic capacities that gready exceed what was previously possible\r\n[02] [15]. In addition, a structured layout approach yielded a 36% reduction (average) in wire lengths."@en . "https://circle.library.ubc.ca/rest/handle/2429/16577?expand=metadata"@en . "Bridging the Gap between Soft and Hard eFPGA Design by Victor Olubunmi Aken'Ova B.A.Sc. University of British Columbia, 2002 A thesis submitted in partial fulfillment of the requirements for the degree of Master of Applied Science Degree in The Faculty of Graduate Studies Electrical and Computer Engineering The University of British Columbia March 2005 \u00C2\u00A9 Victor Olubunmi Aken'Ova 2005 Abstract Bridging the Gap between Hard and Soft eFPGA Design by Victor Olubunmi Aken'Ova Potential cost savings that come from the ability to make post fabrication changes in System-on-Chip (SoC) designs make embeddable Field Programmable Gate Array (eFPGA) cores an attractive design option. However, they are only available as \"hard\" macros from vendors as a small number of fixed size cores, and may not be optimal in terms of area, power or delay for a given SoC. A \"soft\" eFPGA methodology [01] [02] based on the ASIC design flow was used to create small amounts of programmable logic but incurs significant overhead. In this thesis, it is shown that this overhead can be reduced by deploying architecture-specific tactical standard cells in the ASIC flow, making eFPGA generation configurable, and imposing a regular structure on eFPGA architectures. For the set of benchmarks considered, the use of tactical standard cells resulted in area and delay savings of 58% and 40% respectively, when compared to cores implemented with generic standard cells [02]. Also, a proposed IP-generator-based approach for eFPGA design is shown to achieve results that are competitive with commercial full-custom hard eFPGA cores. For example, for some large benchmark circuits (over 1000 4-LUTs) the generated eFPGA fabrics were up to 40% smaller than available hard eFPGA cores. Finally, it is shown that a regular structured architecture makes it possible to generate fabrics with logic capacities that gready exceed what was previously possible [02] [15]. In addition, a structured layout approach yielded a 36% reduction (average) in wire lengths. ii Table of Contents Abstract ii Table of Contents iii List of Figures vi List of Tables ix Acknowledgments x Chapter 1 Thesis Introduction 1 1.1 Research Motivation 1 1.2 Research Objectives 3 1.3 Thesis Organization 5 Chapter 2 Background and Related Previous Research 6 2.1 Overview of Integrated Circuit (IC) Design Techniques 6 2.2 Embedded Programmable Logic IC Design Techniques 12 2.2.1 Embedded Field Programmable Gate Array (eFPGA) 13 2.2.2 Embedded Mask Programmable Gate Array (MPGA) 17 2.2.3 Example Application: Bluetooth Base-band System-on-Chip 19 2.3 Embedded Programmable Logic as an Intellectual Property (IP) 21 2.3.1 Hard eFPGA IP.. 21 2.3.2 Soft eFPGA IP 23 2.4 Research Problem Definition and Thesis Research Focus 25 Chapter 3 An Embedded Programmable Logic Architecture Family 27 3.1 Island-Style eFPGA Architectures 27 3.1.1 Bidirectional Routing Architectures for eFPGA design 28 3.2 Architectural Issues 32 3.2.1 Input/Output Design 32 3.2.2 Design for Testability 36 3.2.3 SRAM Power up State 38 Chapter 4 Island-Style eFPGA Design with Generic Standard Cells 40 4.1 The Existing Design flow 40 iii 4.1.1 Front-End Flow 42 4.1.2 Back-End Flow 44 4.2 Design Flow Issues and Solutions 44 4.2.1 Combinational Loop-back 44 4.2.2 Architecture Discrepancies 49 4.2.3 Static Timing Exceptions 49 4.2.4 Configuration Power 51 4.3 Design Results 54 Chapter 5 Island-Style eFPGA Design with Custom Standard Cells 63 5.1 An Improved Design flow 63 5.2 Design of Custom Cells \u00E2\u0080\u00A2 65 5.2.1 SRAM Cell Circuit Design 65 5.2.2 Multiplexer Circuit Design 66 5.3 Layout Design 72 5.4 Layout Improvements 75 5.5 eFPGA Design Results 76 5.5.1 Area Improvements 76 5.5.2 Delay Improvements 77 5.6 Comparison to GILES 82 5.7 Sensitivity Case Study 85 5.8 Mux Switch Evaluation 89 Chapter 6 The Implications for eFPGA IP Design 90 6.1.1 Some \"Real world\" Case Studies 91 6.2 A New Paradigm for eFPGA IP Design 94 6.2.1 \"Open\" Architecture IP Library 95 6.2.2 Configurable Architecture IP 96 6.2.3 Domain-driven IP generation 96 6.2.4 Automated Layout generation 98 6.2.5 Summary 100 Chapter 7 Final Conclusions and Future Research Work 101 7.1 Conclusions 101 7.2 Future Work 104 iv 7.3 Contributions 104 Appendix A Standard Cell Based eFPGA implementation Results 107 Appendix B Area Model for Standard Cell based eFPGA area Estimation Il l Appendix C Details of Circuit Analysis Solutions for Multiplexer Circuits 113 Appendix D Area estimation method for GILES and Full-custom eFPGAs 116 Appendix E A sample structured eFPGA layout and Bluetooth SoC Floorplan 123 Bibliography 125 v List of Figures Figure 2.1: A general overview of existing integrated circuit design techniques 7 Figure 2.2: Sample generic standard cell layout for two CMOS logic functions 8 Figure 2.3: Typical standard cell chip layout and a closer view of cell rows in chip 9 Figure 2.4: A programmable platform IC with an embedded PowerPC\u00C2\u00AE processor 10 Figure 2.5: Typical flow for semi-custom cell, block, and core based ASIC design 11 Figure 2.6: 2x2 island style eFPGA core architecture and regular tile architecture 13 Figure 2.7: programmable logic CLB of size 4 and a disjoint switch block topology 14 Figure 2.8: Input and output connection blocks in the eFPGA routing architecture 15 Figure 2.9:VPR placement and routing of a sample user circuit in a 6x6 logic array 16 Figure 2.10: A MPGA logic tile layout, and a connectivity fabric of top metal layers 18 Figure 2.11: A System-on-a-Chip platform that implements the Bluetooth Protocol 20 Figure 2.12: An example of architectural inefficiencies in existing hard eFPGA IP 23 Figure 3.1: An island style eFPGA tile architecture with tri-stated routing switches 28 Figure 3.2: Island style eFPGA tile architecture with multiplexed routing switches 29 Figure 3.3: Island style eFPGA tile with improved multiplexed routing switch 1 30 Figure 3.4: Island-style eFPGA tile with improved multiplexed routing switch 2 31 Figure 3.5: Programmable I/O pad connections and unused switch connections 33 Figure 3.6: Equivalent external input routing in standalone and embedded FPGA 34 Figure 3.7: A limitation of the novel embedded FPGA external I/O architecture 35 Figure 3.8: An embedded FPGA IP configuration architecture designed for test 37 Figure 3.9: Logic to prevent driver contention in bidirectional routing architecture 39 Figure 4.1: Typical flow for semi-custom cell, block, and core based ASIC design 42 Figure 4.2: Combinational and sequential loops in an island-style eFPGA tile BLE 45 Figure 4.3: Example of combinational loops path in a multiplexed routing switch 46 Figure 4.4: Technique to avoid loop breaking during synthesis and verification 47 Figure 4.5: VPR critical riming path trace for \"golden-20\" benchmark circuit apex4 50 Figure 4.6: Proposed configuration scheme targeting a single tile and row of tiles 51 vi Figure 4.7: Proposed scheme for glitch isolation during soft eFPGA configuration 53 Figure 4.8: Critical path delay comparison of three jy?/? island-style architectures 55 Figure 4.9: Overall core area comparison of three soft island-style architectures 56 Figure 4.10: Comparison of CMOS logic and pass transistor based multiplexers 60 Figure 4.11: transistor level illustration of a simple single-edge-triggered flip-flop 61 Figure 4.12: pie charts showing area distribution in a soft island-style eFPGA. tile 62 Figure 5.1: an enhanced ASIC design flow for eFPGA design and implementation 64 Figure 5.2: SRAM transistor sizing for embedded FPGA configuration memory 66 Figure 5.3: multiplexer with minimum size output buffer and extra buffer stages 66 Figure 5.4: a level restoring circuit for pass-transistor-based multiplexing logic 67 Figure 5.5: Delay optimization problem specification for LUT input-selection paths 68 Figure 5.6: Critical delay paths for the different multiplexing circuits in an eFPGA 69 Figure 5.7: Issues around input loading for pass tree networks in the ASIC flow 70 Figure 5.8: two possible multiplexing-logic buffering schemes for eFPGA design 71 Figure 5.9: RC network representation of repeater insertion in an eFPGA 5-LUT 72 Figure 5.10: standard cell layout structure before and after n-well cutout is made 73 Figure 5.11: illustration of resource allocation in a multiplexer standard cell layout 74 Figure 5.12: double height standard cell layout of 32:1 multiplexer (2 metal layers) 74 Figure 5.13:1 flip-flop used for configuration memory in the previous approaches 75 Figure 5.14: area comparisons of customized tactical standard-cell-based eFPGA implementations with generic standard cell, and full-custom implementations 77 Figure 5.15: illustration of SPICE simulation setup to measure logic block delay 79 Figure 5.16: SPICE simulation setup to measure the routing switch element delay 80 Figure 5.17: delay comparison of customized tactical standard-cell-based eFPGA implementations with generic standard cell, and full-custom implementations 81 Figure 5.18: Core area comparison of product-term and island-style architectures 86 Figure 5.19: delay path comparison of product-term and island-style architectures 88 Figure 5.20: Area-delay-product comparison of product-term and island-style core 88 Figure 6.1: logic efficiency comparisons of a standard-cell-based eFPGA IP generator approach to a commercial hard IP approach for 9 MCNC benchmarks 92 vii Figure A.l: Block level diagram of the test interface module with an eFPGA fabric 107 Figure A.2: simulation waveform capture of eFPGA state transitions and response 108 Figure A.3: simulation waveform capture of glitches and glitch isolation in a BLE 108 Figure C.l: RC network representation of shared buffering scheme in an eFPGA 113 Figure C.2: RC network representation of repeater insertion in an eFPGA 5-LUT 114 Figure D . l : plot of layout area vs. inputs for general-purpose eFPGA multiplexers 116 Figure D.2: Plot of layout area (excluding the SRAMs) vs. select inputs for LUT 116 Figure D.3: Plot of layout area (excluding the function SRAMs) vs. inputs for LUT 117 Figure D.4: Area distribution between logic and routing components of eFPGA 119 Figure D.5: ASIC gate density scaling after estimated improvements to routing 119 Figure E.l: Structured eFPGA layout for Bluetooth Baseband Encryption Module 123 Figure E.2: Bluetooth baseband SoC showing underutilized area around the core 124 vui List of Tables Table 2.1: Comparison of eFPGA, MPGA and standard cell logic design methods 18 Table 4A: Area overhead of \"soft\" eFPGA design relative to full-custom approach 58 Table 4.2: Delay overhead of \"soft\" eFPGA design relative to a full-custom design 59 Table 5.1: layout area improvements with tactical cells vs. generic standard cells 75 Table 5.2: The scaling factors for multiplexer logic bloating in GILES Virtex-E tile 84 Table 6.1: Summary of Soft, Firm and Hard eFPGA implementation methodologies 100 Table A.l: Design results for 18 eFPGA cores with tri-state buffer based switches 109 Table A.2: Design results for 18 eFPGA cores with original multiplexer switches 110 Table A.3: Design results for 18 eFPGAs using the speedy multiplexer switch 1 110 ix Acknowledgments I would like to thank my supervisor Dr. Resve Saleh for the opportunity to work with him. His enthusiasm and support over the years are unmatched. He was easily my most vocal Cheerleader. I also would like to thank Dr. Guy Lemieux for teaching me everything I know about FPGAs. I am also grateful to all the other professors of the SoC research Lab because I learned something from all of them. I especially would like to thank Dr. Steve Wilton who spearheaded the Soft eFPGA concept at UBC that ultimately led to my research topic. Thanks for all the comments and insight. I would also like to thank the staff of the SoC lab. Roozbeh our CAD manager was a tremendous help and always tried his best to cater to my many requests. Roberto our Test Lab manager was always fantastic with the test lab equipment and great at finding solutions to all kinds of vexing problems. Sandy Scott our administrative assistant was always cheerful and happy to help. I am grateful to you all. I would also like to thank past and present student members of the SoC lab for their support. I especially thank Pedram Sameni, James Wu, Ronald Fung, Louis Hong, Mohsen Nahvi, Marvin Tom, Zion Kwok, Neda Nouri, Noha KafafL Julien Lamoureux, Rod Foist, Brad Quinton, and Kara Poon. Almost last, but certainly not least, I would like to thank my entire family for the tremendous love and support they have shown me my entire life. I am thankful to them for teaching me the value of hard work and dedication. A very gigantic hug also goes to my dear Laura for easily being my best friend. Many thanks to NSERC, PMC-Sierra, the Canadian Microelectronics Corporation and the University of British Columbia for the financial and tool support that made all this work possible. Viva Canada. x In loving memory of my dearest Mama XI Chapter 1 Thesis Introduction 1.1 Research Motivation The increasing density of integrated circuit (IC) designs due to shrinking transistor sizes has led to the emergence of System-on-Chip (SoC) design to cope with the resulting increase in design size and complexity. However, this has resulted in significant increases in the cost of IC designs due to corresponding increases in engineering and mask costs in the order of tens of millions of dollars. Therefore, designers are pursuing software and hardware methods to build programmable SoCs and avoid the extra costs of chip re-spins. A programmable device can be used to compensate for errors, adapt to changes in standards or design specifications and amortize costs over design derivatives. Embedded Programmable Logic Cores (ePLCs) have emerged as a natural hardware solution to meet this growing challenge because they allow logic functionality to be changed after fabrication. Such cores are generally suited to small and medium on-chip logic functions such as accelerator functions for on-chip processors to speed up embedded software, data encryption circuits in wireless devices that need to be changed from time to time, packet routing switches for the newly emerging Network-on-Chip design paradigm [71], and I/O standards for data communication. In spite of the potential cost benefits and useful applications of ePLCs in SoC design, commercial success has been limited by a number of difficult issues that arise. These range from design and implementation issues [14] to issues related to the nature of commercial ePLCs devices in general. For example, common design problems concern the selection of blocks to make programmable, the 1 integration of fixed and programmable blocks and the size of the programmable block. However, by far the biggest issue with ePLCs is the high area, power and delay overhead that they generally incur. These overhead issues are not unique to ePLCs alone; it has also become an important issue for stand-alone programmable logic devices like Field Programmable Gate Array (FPGA) chips. However, this issue is more critical for ePLCs because they are needed in high performance IC designs that have stringent area, speed, and power requirements. Further, this problem is made even worse because ePLCs, like FPGA chips, are available in limited sizes and ranges, and so designers must select the smallest core or chip that will accommodate their circuit(s). The selected ePLC could be much bigger than needed, slower, and consume more power, because vendors include excess resources for potential circuit implementations. As a result, prospective users are sometimes forced to abandon ePLCs altogether, and this has resulted in a general lack of interest in these devices. Current commercial programmable logic design techniques do not afford vendors much opportunity to address some of the reasons for the overhead in ePLCs because they use expensive full-custom design techniques to build ePLCs, and so have to find a cost effective tradeoff when designing their products. Hence, for cost reasons, these vendors design a niinimum set of cores that can serve the broadest range of user applications possible while keeping the device performance overhead within tolerable bounds. This model works well for standalone FPGA chips because they are not used in high performance situations and therefore users can tolerate the additional performance overhead. The above model is unsatisfactory for ePLCs because higher performance is often required. Furthermore, because their application space is limited and target user circuits often share common characteristics, surplus resources on the same scale as FPGA chips is not required. For example, if a 2 designer uses an ePLC for the purpose of allowing future \"bug fixes\" or design revisions, then it is reasonable to assume that potential future circuit implementations will be of about the same size. The existing model for ePLC design cannot take advantage of domain information, since tailoring ePLCs for application specific scenarios is expensive for vendors. However, if ePLCs are to make inroads into mainstream digital design, domain knowledge should be exploited [01] in some fashion. The potential benefits and the difficulties associated with embeddable programmable logic design motivate the research work presented in this thesis report. For most of the benefits to be realized, current ePLC design methodologies must be revisited. For example, rather than a \"one-size-fits-all\" model for designing programmable logic cores, an alternative approach that tailors cores to a particular application domain might be more appropriate. Therefore, in this research, we explore some new techniques that could make a \"one-size-fits-few\" design model feasible for ePLC design. 1.2 Research Objectives Given the inherent difficulties associated with using ePLCs in high performance chip design, this work proposes some new ways to reverse the current trends and make the use of ePLCs more attractive to users. To achieve this goal, some important observations about current methodologies for embedded programmable logic design will be made and solutions investigated and evaluated. First, the advantages and disadvantages of the \"soft\" [02] and \"hard\" [44] ePLC design methodologies are evaluated. Following from this, an embeddable programmable logic architecture is implemented in a way that combines the best characteristics of the existing design methodologies. 3 Second, inefficiencies in area and speed that result from implementing a ePLC within an automated, generic-cell-based design flow [02] are investigated, and the compared to previous research results. The goal here is to pinpoint areas of the design flow that could benefit the most from optimization. Third, architecture-specific tactical cells are designed to replace generic cells and eliminate inefficiencies resulting from the use of a generic-cell-based IC design flow [02] [14] [15]. ollowing custom cell design, area and speed improvements that can be achieved as a result of implementing an embeddable programmable logic architecture with custom cells in an automated design flow are reported. The results obtained are then compared with other previous approaches [02] [15] [68]. Fourth, certain inefficiencies that exist in current commercial design approaches are investigated by comparing our design results for area and speed to an existing commercial hard ePLG library [44]. Lasdy, a new paradigm is introduced for embedded programmable logic design that combines domain driven architectural exploration with a flexible and efficient semi-custom circuit design flow. 4 1.3 Thesis Organization Chapter 2 of this thesis presents some general background on the research subject, summarizes work done in this area to date, describes benefits and difficulties associated with existing ePLC technologies and design methods, illustrates -potential uses of programmable technology on a wireless IC (Bluetooth) and then introduces an approach for ePLC design explored in this thesis. Chapter 3 describes variants of a well-known architecture to be used in the proposed approach for embedded PLC design'. This chapter includes a detailed description of some design issues that were resolved during architecture specification in order to ensure proper ePLC operation. Chapter 4 presents results obtained after implementing our ePLC in a \"soft\" design flow as described in [14]. In Chapter 5, an alternative approach similar to work presented in [01][68][69] which requires the design and implementation of architecture-specific tactical cells is explored. The details of design and implementation issues related to this work are presented. Next, our results are compared with earlier results obtained using architectures and approaches described in [14] [15] [68]. Chapter 6 compares the results obtained in Chapter 5 with estimates for commercial hard eFPGAs [41], to illustrate key problems related to current design approaches. In addition, two blocks in a Bluetooth baseband SoC are used to investigate the impact of eFPGA cores on area. Finally, a novel design paradigm, for automatic embedded programmable logic generation is presented. Chapter 7 summarizes the work presented, suggests topics for future work and lists contributions. 5 Chapter 2 Background and Related Previous Research This chapter begins with a general overview of logic design methodologies and then focuses on programmable logic design. In particular, two commercial technologies for embedded programmable logic design are described. The description focuses on important features of both approaches and highlights their advantages and disadvantages. Next, a wireless SoC platform is used to illustrate potential applications of programmable logic in IC design. Also, issues related to different methodologies for reprogrammable logic implementation are discussed. Finally, a description of the research problem and an outline of the focus of this thesis report are presented. 2.1 Overview of Integrated Circuit (IC) Design Techniques Over the last three decades, several logic design methodologies have evolved to cope with technological advancements in semiconductor circuit design. As shown in Figure 2.1, these methodologies tend to fall into two main groups: full-custom and semi-custom IC design methods. Full-custom design relies to a large extent on manual effort for most design decisions. For example, design decisions such as transistor sizing, transistor layout, device placement and routing are all carried out manually with the aid of rudimentary Computer-Aided Design (CAD) tools. This technique offers the greatest flexibility from a designer perspective, because circuits can be tailored to specifications with superior performance in terms of area, delay or power. However, there is a high engineering cost overhead involved. Furthermore, given shrinking time-to-market windows and 6 shelf-life of IC products, it becomes more difficult to depend on full-custom techniques for IC design. For example, a large IC design house like Intel\u00C2\u00AE requires large teams of designers working for the equivalent of hundreds of man-years to deliver high-performance full-custom products such as the Pentium\u00C2\u00AE chip on schedule. This results in a huge expense that has an impact on the pricing of such chips and the products that use them (e.g., personal computers). Similarly, well-known programmable logic device vendors like Xilinx\u00C2\u00AE and Altera\u00C2\u00AE, use full-custom techniques to design their Field Programmable Gate Array (FPGA) devices, and this is part of the reason some of these devices can cost anywhere from a few hundred dollars to a few thousand dollars per chip. Digital IC design Methodologies Figure 2.1: A general overview of existing integrated circuit design techniques In order to reduce engineering effort of full-custom IC designs, some vendors resort to semi-custom design techniques to deliver new products much faster and at lower cost. Semi-custom design relies more on automated flows and Electronic Design Automation (EDA) tools to implement circuits. These EDA tools use libraries of pre-designed logic cells, blocks, and or cores to implement circuits in much shorter time but with some overhead costs, and certain restrictions on the IC designer. 7 Some examples of semi-custom design include: cell-based, block-based and or core-based ICs. However, it is important to notice from Figure 2.1 that semi-custom designs do have elements of full-custom design and vice-versa. In other words, some components like cells, blocks, or cores in a semi-custom design, are implemented using full-custom techniques to some degree. The extent to which this \"crossover\" occurs depends upon performance requirements and time-to-market constraints. Furthermore, full-custom designs can also make use of cells, blocks and cores as needed. In standard-c?//-Zwft/ semi-custom designs, vendors develop cell libraries that implement generic logic gates such as NORs and NANDs with different drive strengths and in accordance with a constrained physical layout format. These libraries of logic gates are supplied to designers who then map Register Transfer Level (RTL) descriptions of desired hardware behavior written in VHDL or Verilog, into gates using logic synthesis tools such as Design Compiler\u00C2\u00AE and Cadence-PKS\u00C2\u00AE. A more aggressive form of cell-based design allows designers to build custom cells and tools according to their own specifications and tailored for a particular circuit or group of circuits [46] [48] [53] [55]. Figure 2.2(a) and 2.3(b) below shows typical standard cell physical layouts. Usually PMOS devices are in the top part of the cell near Vdd and NMOS devices are in the lower part of the cell near Gnd. Figure 2.3(a) and 2.3(b) show a chip implementation of [19] that uses standard cells. (a) simple function (b) complex function Figure 2.2: Sample generic standard cell layout for two CMOS logic functions 8 (a) Full chip layout (b) Top corner of chip Figure 2.3: Typical standard cell chip layout and a closer view of cell rows in chip In order to implement the standard cell ASIC shown in Figure 2.3(a), special EDA tools are used for placement and routing. Placement involves reading a net-list of gates generated through logic synthesis and then arranging the corresponding standard cells in rows as shown in Figure 2.3(b). Placement is automated and can be constrained for area and performance. Standard cells are arranged in rows so that power (Vdd) and ground (Gnd) rails of adjacent cells are connected by abutment. Routing tools connect nodes (with metal wires) to implement the desired logic function. In semi-custom block-based designs, vendors implement larger functions like arithmetic logic units (ALUs), multipliers, and adders as blocks that designers can also include in a standard cell design. One reason for doing this is that certain elements in a cell-based design are not efficiently implemented with generic standard cell-based logic gates like NANDs and NORs. Therefore, circuit blocks like ALUs, adders and multipliers would be included in a cell design to improve performance. 9 In core-based designs, vendors implement even larger and more complex circuits called cores, for inclusion with cells and blocks in ICs. The most common cores are microprocessor cores such as the ARM7\u00C2\u00AE, MIPS, and PowerPC\u00C2\u00AE cores. For example, Figure 2.4 shows a Virtex-II Pro\u00C2\u00AE programmable platform IC [72] with an embedded PowerPC\u00C2\u00AE microprocessor core (highlighted). Figure 2.4: A programmable platform IC with an embedded PowerPC\u00C2\u00AE processor In order to design a semi-custom IC, designers often use a flow similar to the one shown in Figure 2.5. This flow is known as the ASIC flow [73]. As shown in Figure 2.5, design data that contains dimensions, timing, drive-strength, and power requirements for cells, blocks, and cores are used in EDA tools at all stages of the design flow. For example, during timing verification, timing and drive strength data for cells, blocks, and cores used in a design are combined with resistance and capacitance values extracted from metal traces after routing, and used to estimate the timing characteristics of all valid signal paths. If a placed and routed design satisfies all timing requirements, it is converted into a special layout file format called GDSII. A GDSII file contains all the design data required to manufacture a chip. If a design does not meet timing, a new and improved net-list must be created, placed, routed and again verified for timing until design goals are met (Figure 2.5). 10 Figure 2.5: Typical flow for semi-custom cell, block, and core based ASIC design Full-custom cell, block, and core based design flows includes most of the steps shown in Figure 2.5. Designs are created and tested at the transistor level via circuit schematics and analog simulations. In addition, full-custom designs are often hierarchical and structured so that complex circuit designs can be better managed and optimized. Designers often begin full-custom designs by first building cell-like entities that are then combined to form larger structures like blocks and cores. Placement of cells, blocks and cores relative to one another is done manually along with routing of interconnects. Manual place and route often gives the best results but it is very time consuming. Microprocessor chips and programmable logic devices are often designed in this manner. For example, programmable logic devices like FPGAs, are built from a single, higUy-optirnized full-custom programmable logic block or tile, that is replicated to form a two-dimensional logic array. 11 2.2 Embedded Programmable Logic IC Design Techniques Programmable logic ICs in the form of Field Programmable Gate Arrays (FPGA) have been available for over two decades. In the last few years, there has been a push to develop embedded FPGA cores that reside in a chip alongside hardwired or fixed logic. The idea is that the exact logic function of a programmable core can be defined after fabrication, much like any stand-alone FPGA. The advantages of this approach include the ability to cater to multiple customers with a single programmable chip, accommodate changes in standards or design specifications, or allow designers to fix design errors that are caught after chip tape-out (if they are suitable for correction by the embedded FPGA). An embedded FPGA or programmable fabric is usually arranged as a structured array of logic blocks, switches and routing tracks. A logic function is implemented on the array by using SRAM cells set to 0 or 1 to define logic functionality as well as define routing connections. While programmable fabrics afford designers a tremendous amount of flexibility, there is significant overhead associated with this approach when compared to fixed logic (hardwired) ASICs. Specifically, the area of programmable fabrics can be 50 to 100 times higher, speeds can be 2 to 10 times worse, and power dissipation is substantially higher. Over the past few years, the creation of efficient programmable logic cores has been an active area of research and development in an effort to reduce overhead, improve ease of use, and make this option more attractive to chip designers. So far, leading approaches for embedded programmable logic implementation include embedded Field Programmable Gate Arrays (eFPGAs) [08] [41] [44] and Mask Programmable Gate Arrays (MPGAs) [45] [60]. This section presents a general architectural overview of programmable logic 12 devices within an eFPGA context, followed by a general description of MPGAs. The important features of each technology are emphasized, as well as their associated advantages and disadvantages. 2.2.1 Embedded Field Programmable Gate Array (eFPGA) In this section, an overview of the popular island-style FPGA architecture [11] is presented. This architecture (Figure 2.6) is the basis for many commercial programmable logic devices used today. \"0 \u00E2\u0080\u0094 CLB 10 (a) eFPGA core (b) eFPGA tile Figure 2.6: 2x2 island style eFPGA core architecture and regular tile architecture In Figure 2.6(a), CLB is a configurable logic block, SBLK is a routing switch block, LET is a left edge tile, BET is a bottom edge tile, CON is a corner tile, REGT is a regular tile, BUF contains track buffers for input connection blocks, and OB is an output connection block. Also, Figure 2.6(a) shows an eFPGA core as a structured array of logic clusters, switches and routing tracks of width W. The enlargement in Figure 2.6(b) shows that each regular tile contains a CLB, a switch block, connection blocks and SRAM cells for configuration. The left and bottom edge tiles only contain switch-blocks and connection blocks for routing. The corner tile connects the LETs and BETs. 13 Because logic blocks implement logic functions, they constitute the logic architecture of a programmable fabric. Similarly, switch blocks and connection blocks constitute the routing architecture because they route signals between tiles, and between the eFPGA and external logic via I/O ports. SRAMs and decoders constitute the configuration architecture since they configure the fabric. In essence, programmable logic devices are a combination of logic, routing, and configuration architectures. Within the logic architecture, CLBs comprise one or more basic logic elements (BLEs). The number of BLEs in a CLB is its cluster size, N. For example, in Figure 2.7(a) the cluster size is 4. Each BLE in a CLB is comprised of a K-input look-up-table (K-LUT), an edge-triggered flop (LUT-flop) to generate a registered copy of the LUT output, and an output selection multiplexer (\"H\" in Figure 2.6(b)) to select between registered and unregistered LUT outputs. LUT input multiplexers (\"M\" in Figure 2.7(a)) are used to drive K inputs of all N LUTs in a CLB of size N ( 4 in this example ). (a) tile CLB of cluster size 4 (b) disjoint switch topology Figure 2.7: programmable logic CLB of size 4 and a disjoint switch block topology The switch block design of the routing architecture is typically based on a disjoint topology [11] as shown in Figure 2.7(b). This means a given net cannot hop between track \"lanes\". For example, a net that enters a switch on routing track 0 as shown in Figure 2.7(b) will exit and continue on track 0 14 until it terminates. Connection blocks connect CLB inputs and outputs to adjacent routing channels. For example, in Figure 2.8(a) an input connection block multiplexer, G, selects a net from track \"2\" in the routing channel to drive a cluster input I (the input track buffers are used to minimize loading). Output connection blocks drive cluster outputs (O in Figure 2.8) onto the routing channel. An output connection block has a tri-state driver for each track that it can drive. The value in the SRAM on the enable input of the tristate-driver determines whether or not a routing track is driven. (a) input connection path (b) output connection path Figure 2.8: Input and output connection blocks in the eFPGA routing architecture The configuration architecture includes configuration SRAMs, decoders, and a configuration finite state machine. A row decoder selects a row in Figure 2.6a for programming, and combined with the column decoder, can select a single tile for programming. A state machine controls configuration. The bits (sequence of zeros and ones) needed to configure a programmable logic device are generated using specialized CAD tools such as VPR [11]. The details of this tool are outside the scope of this work. However, at a high level, these CAD tools contain detailed models of programmable logic architectures like the one shown in Figure 2.6. These models are used to decide suitable placements for circuits that have been mapped into LUTs [11]. Once a suitable placement 15 for a circuit is found, it is routed [11]. FPGA routing involves finding suitable paths between all the connected nodes (sources and sinks)in a circuit implementation given the available routing resources in the target programmable logic device and any user-supplied path timing constraints. Figure 2.9(a) shows a VPR screen capture of a 6x6 programmable logic device after placement. The darker blocks represent occupied logic clusters. Figure 2.9(b) shows the same architecture after routing. Figure 2.9(b) shows just routing tracks that are used. Placement and routing files for a given circuit implementation are generated within VPR and used in software programs that generate the appropriate sequence of zeros and ones needed to program a circuit on a programmable device. (a) after placement (b) after routing Figure 2.9:VPR placement and routing of a sample user circuit in a 6x6 logic array An eFPGA is essentially an unpackaged standalone FPGA (Figure 2.9) that has been stripped of its 1/O pad ring and adapted to function as an embeddable IP core. Like a standalone FPGA, the exact logic function of the eFPGA can be decided after fabrication. Typically, an eFPGA would be used to implement small to medium logic functions like microprocessor accelerator functions or hardware data encryption algorithms that need to be reprogrammed periodically to adapt to changes. 16 2.2.2 Embedded Mask Programmable Gate Array (MPGA) More recently, MPGA devices like the eASIC\u00C2\u00AE core [45] were introduced to eliminate some of the shortcomings of eFPGA cores. An eASIC core is modeled on SRAM programmable look-up-tables but the routing infrastructure between look-up-tables is quite different from eFPGAs. In particular, the routing interconnect is either metal/via or just via programmable (one-time programmable). The advantage here is that the extra logic needed for switch elements in the routing of a typical eFPGA are no longer required. Therefore MPGAs tend to be are smaller, faster and more power efficient. Furthermore, the wafers associated with a particular design can be processed in advance of final metal/via programming. Therefore, the turnaround time for a programmed design is only about a week or two compared to a few months for so-called hardwired ASICs that are not programmable. Although current MPGA cores are denser, faster, and more power efficient than an eFPGA, there is a loss of flexibility because the routing interconnect is not reprogrammable. For example, once via connections between metal layers for a particular circuit implementation are made, they cannot be changed. Any further changes to the routing interconnect will require a new silicon die with an unprogrammed embedded MPGA. Although changes in LUT functionality is possible (reprogrammable SRAMs), the scope of such changes is limited once the routing has been finalized. Figure 2.10(a) below shows a transistor-level full-custom layout of an MPGA tile [45] modeled on the popular island-style architecture, and similar in many respects to the eFPGA tile shown in Figure 2.6(b). For example, the LUTs in an MPGA CLB are also programmable using SRAMs. The main difference however, is the absence of reprogrammable routing switches in the MPGA tile. Instead, 17 an upper metal layer grid of potential global routing connections is used to form an interconnect mesh or connectivity fabric [45] as Figure 2.10(b) shows. After fabrication, this grid has no connection to lower layers of metal. Instead connections are only made after it has been determined what circuit will be implemented in the MPGA, and which connections to the lower metal layers (hardwired) will be needed to implement a logic function. Such connections are made with a via mask layer that connects the uncommitted connectivity fabric to the hardwired \"base array\" [45] in Figure 2.10(a). (a) MPGA base array (b) Connectivity Fabric Figure 2.10: A MPGA logic tile layout, and a connectivity fabric of top metal layers Table 2.1: Comparison of eFPGA, MPGA and standard cell logic design methods 0.18 micron CMOS eFPGA MPGA(eASIC\u00C2\u00AE) Standard Cell IC Density (gate/mm2) 1.5K 30K 60K Performance(MHz) 100 400 600 Power(nW/Gate/MHz) 1000 40 20-30 NRE ($) 0 30K 500K Prototype TAT(days) 0 5-10 20-40 Table 2.1 [45], presents useful data on key features of eFPGA and MPGA technology that further illustrates important differences between these embedded programmable logic solutions. For example, the absence of reprogrammable routing in MPGAs makes them 20 times smaller than an 18 eFPGA of the same logic size. The Non-Recurring Engineering (NRE) costs and development turn-around-times (TAT) for an eFPGA is more or less zero because a designer simply needs to load the appropriate bitstream to get a working chip. On the other hand, designs with MPGA cores require one or two metal and or via mask sets to program a core. In this case, IC manufacturing services are needed and accounts for the NRE and TAT overhead. The NRE costs and TAT for standard cell IC (non-programmable) designs are even higher because all IC layers must be fabricated to realize a working chip. However, standard cell IC designs have superior area, speed, and power efficiency. An MPGA offers \"static\" or one-time programmability that is suited to designs that do not require \"on the fly\" r1000 LUTs), the strategic use of the more explicit approach developed with [14] and described therein, completely eliminated false timing from timing reports. More specifically, applying stringent exceptions to the input and output connection blocks alone was sufficient. To confirm this, path delay reports from the ASIC tools were compared with those from VPR to ensure the paths were true paths and not false paths. For example, in Figure 4.5 the VPR-calculated critical path starting at \"1\" and ending at \"6\", matched exactly the critical timing path reported in the Primetime\u00C2\u00AE static timing analysis tool. mmmmmmmm 9 IS B B J 9 I. J, s s E H m \u00E2\u0080\u00A2 S 8 B B\u00E2\u0080\u0094 B B B \u00E2\u0080\u00A2 . -\u00E2\u0080\u00A2 t l B B B B B B~~ B -a B U B B \u00E2\u0080\u00A2 a S S IB EHB S B \u00C2\u00AE - S I \u00E2\u0080\u00A2\u00E2\u0080\u0094IsT^ te B B \u00E2\u0080\u00A2 L S B B B B B B Ji*) j5j | g B H 0 [H aLj ]S S B is Wi 13 B B B a @ B E B B \u00E2\u0080\u00A2 n v*Av '.UV *.:,r.';' i.*'.* v.U' S-Vv b\u00C2\u00BB\" n>\u00C2\u00BB \u00C2\u00AB \u00C2\u00AB t i 1 T^s \u00C2\u00AB '}>,:? Wit U.W U.V.' I W WW Figure 4.5: VPR critical timing path trace for \"golden-20\" benchmark circuit apex4 50 4.2.4 Configuration Power The configuration architecture of an eFPGA is used to load and hold the configuration bits that implement a particular user function. As mentioned in Section 2.3.2, the use of flip-flops rather than SRAMs for program storage is very cosdy, because flip-flops are larger and several are needed. Furthermore, the large number of flip-flops in eFPGAs also requires a large network of configuration-related signals that are spread throughout the eFPGA. For example, it is expected that the clock network for the configuration flip-flops will be quite extensive, and the results in [14] suggest that this is the case. This has implications for power dissipation in the clock network during programming. In addition, the programming scheme used in an architecture implementation also affects power dissipation. For example, in [14], a single shift chain of flips-flops was used to implement the eFPGA programming architecture. This approach forces all flip-flops in the chain to be clocked for the entire duration of the configuration phase. This means that all the flip-flops, and their entire clock tree network, could potentially draw large currents from off-chip power supplies. (a) single tile programmed (b) entire row programmed Figure 4.6: Proposed configuration scheme targeting a single tile and row of tiles 51 In order to make soft eFPGA configuration more power efficient, two schemes were proposed that are somewhat similar to the approach used in commercial SRAM-programmable FPGAs [56] [57] (see also Figure 3.8(d)). In the first scheme shown in Figure 4.6(a), a single tile in the soft eFPGA can be targeted for programming so that only the clock network and configuration flip-flops for that tile are activated during configuration (see Figure 4.6(a)). In the second scheme shown in Figure 4.6(b), a single row of the embedded FPGA can be programmed. Again, this means that just the clock network and configuration flip-flops for the target row are activated during programming. Individual rows or tiles are targeted for programming until the entire eFPGA has been programmed. Clearly, the first scheme is more power efficient since only the capacitive loading on a tile clock tree network and the total current (data dependent) drawn by all the flip-flop in the active tile contribute to power. However, as the number of flip-flops targeted scales upwards from a single row to one \"gobal\" program shift chain, power and energy increase. The programming time is roughly constant. Another disadvantage of using flip-flops for eFPGA configuration storage is the likelihood of glitch power dissipation. Glitches are rapid transitions that occur at the outputs of combinational logic in response to changing inputs. Because flip-flops in this configuration scheme are part of a shift chain, there is a good chance that the state of each flip-flop will toggle several times before the end of configuration of a single eFPGA tile or row. Some of these flip-flops are inputs to combinational logic and as such will cause transitions to occur at combinational logic outputs during configuration. 52 CLK1 CLR C L K 2 Figure 4.7: Proposed scheme for glitch isolation during soft eFPGA configuration Figure 4.7 shows a proposed scheme to minimize the propagation of glitches to global routing wires (primary source of dynamic power dissipation) and other parts of the logic architecture during configuration. A global masking input labeled \"G\" in Figure 4.7 is used to control the OR-gate output, so that the registered (non-toggling) output of a LUT is selected by the output multiplexer \"H\" during the configuration phase. The transition that occurs on the registered output (shown as a high-to-low transition in Figure 4.7) prior to configuration is due to a global reset (CLR) on the LUT-flop. The registered LUT output remains unchanged until programming is complete. The embedded FPGA system clock (CLK1) is also inactive during the configuration phase to prevent glitch propagation through the LUT-flop to tristate buffer inputs and other parts of the logic architecture. CLK2 is the programming clock. The \"Xs\" in Figure 4.7 indicate disabled paths. Without the above safeguard, glitches could be propagated through output drivers and onto routing tracks and contribute to dynamic power dissipation. Ultimately, the magnitude of glitch power dissipation is a function of the capacitance of routing tracks (function of the track wire length), the width, W, of the routing channels in a device, the number of connections from a given output to a routing channel, the total number of routing tracks in a device (related to core size), and how often 53 it is reprogrammed during operation. It is also important to notice that the glitch isolation scheme shown in Figure 4.7 was also used to prevent combinational loops during functional verification. Finally, a case study of an example design from [14] [15] [19] was used to successfully verify the island-style fabric implementation from synthesis through to static timing signoff as shown in the ASIC flow of Figure 4.1. Further details of this implementation are provided in Appendix A. 1. 4.3 Design Results Successful verification of the eFPGA RTL implementation meant that valid area and speed results for architecture derivatives presented in Chapter 3 could be obtained. It is useful to note that the current design supports only architectures that have routing tracks that span one tile. In other words, each switch block is directly connected to only switch blocks in the nearest tiles. These tracks are referred to as single segment length tracks [11]. Likewise, tracks spanning L tiles are segment L tracks. Figure 4.8 shows a plot of critical path delays for 18 MCNC benchmarks. All delays are normalized relative to the delay of an equivalent tri-buffered switch architecture. Results for 3 of 4 architectures described previously in Chapter 3 are presented. Based on the results, the eFPGA cores with tri-buffered switches are on average 24% faster than those implemented with the original multiplexed switch described in [20] [49]. In terms of area, the cores implemented with tristate-buffer-based switches (tri-buff) are on average about 3.3% larger than multiplexer based switches (see Figure 4.9). Based on these results, the tri-buffered switches have the smallest critical path delay overhead and a relatively small area overhead when compared to the multiplexer based switch presented in [49]. In an effort to reduce the delay overhead of the original multiplexer-based switch, a new switch 54 element referred to in Section 3.1.1. as \"Improved multiplexed switch 1\" was designed. This design was contrived based on an important observation that is evident in Figure 4.5. In Figure 4.5, nets on the critical path span several CLBs vertically and horizontally without changing direction. For example, the path from \"node 1\" to \"node 2\" has long vertical and horizontal sections. Therefore, it was anticipated that delay overhead could be reduced by creating fast horizontal and vertical routes through the original multiplexer-based switch. Fast connections with single buffer delay overheads were incorporated in the original multiplexer switch design as shown in Figure 3.3(b). This approach is somewhat analogous to including long routing wires [11] [20] in programmable logic architectures. 2 1 _1.8 E a) J2 \u00C2\u00B1 1 a> l -Q Oo.8 \u00E2\u0082\u00AC i \u00C2\u00BB e Q . ~ - \u00C2\u00AB0.4 ra \u00E2\u0080\u0094 O 0 LUTri-buff \u00E2\u0080\u00A2 Original Multiplexer Switch \u00E2\u0080\u00A2 Improved Multiplexer Switch 1 a u CO O n a. E E o o E \u00C2\u00B0 o a a. x O D) rt c a> co ~ c \"E 3 x a> a. co CM 3 03 = E E \u00C2\u00B0 E = CD o Q. m x x a> a> Q . Benchmark Circuits Figure 4.8: Critical path delay comparison of three soft island-style architectures Measured critical path delays showed that \"improved multiplexer switch 1\" is on average 11% faster than the original multiplexed switch and only 2.6% larger. The tri-state-based switch is still faster by about 13% and only 0.7% larger. From the plots in Figure 4.9 and data in Appendix A, fabrics with larger dimensions provided the biggest speed gains. For example, two large circuits from 55 the MCNC benchmark suite, apex4 and ex5p, were sped up by 23% and 22% respectively. The larger circuits show the biggest gains because critical nets follow longer vertical and horizontal paths. 1.04 1.02 I n a> = 5 0.98 A , < D <55 Q. \u00E2\u0080\u00A2 (0 (f) 0 CO X to Benchmark Circuits Figure 4.9: Overall core area comparison of three soft island-style architectures It is very important to note that the area and delay results presented above are sensitive to the composition of the standard cell library used in the ASIC design flow. For example, architectures using multiplexer-based switches and implemented with CMOS logic gates in generic standard cell libraries are at a disadvantage because a significant area and delay overhead is incurred relative to architectures that have multiplexers implemented with pass transistor logic [20]. On the other hand, architectures with tri-buffered switches (see Figure 3.1) are at an advantage when compared to multiplexer-based switches, because tri-state buffers are available as well-optimized cells in most commercial standard cell libraries while pass-transistor-based multiplexers are not available. 56 For the above reasons, derivative architectures using the novel \"improved multiplexer switch 2\" and \"original multiplexer switch\" [20] [49] (see Figures 3.2, 3.4) are best implemented using pass transistor logic in order to fairly compare area and delay overheads relative to the tri-buffered switch architecture. For example, in the original multiplexer switch design, it has been observed that the multiplexer delay overhead dominates the tri-state buffer delay overhead by a factor of about 2.5. This imbalance is due in large part to the fact that the multiplexers are constructed from logic gates. Furthermore, it is expected that the area superiority of multiplexer-based switches can be improved. Given these observations, architectures based on the original multiplexer switch and \"improved multiplexer switch 2\" offer the best chance to further close the speed gap and improve area relative to the tri-buffered switch. This is because \"improved multiplexer switch 1\" has eight tristate buffers per switch, compared to four for \"original multiplexer switch\" and \"improved multiplexer switch 2\". Another example of the sensitivity of the ASIC-flow-based eFPGA design approach is evident in results in [02] and [14]. For example, in [14], although the same architecture was implemented for the same benchmarks circuits used in [02], the final area results differ quite significandy, and in some cases by over a factor of 2. The reason for this may be due to different emphasis during synthesis. For example, a delay constrained logic synthesis usually results in larger area. Therefore, in this work, similar constraints were applied to all circuits during logic synthesis to avoid any inconsistencies. In addition to the architecture comparisons described above, the area overhead of the standard cell approach relative to full-custom design was investigated. Since the architectures implemented here are identical to the architecture assumed in the VPR tool, it was possible to leverage the area model in VPR for this purpose. The VPR tool reports area based on the number of minimum sized 57 transistors in a programmable logic device, and this can be easily converted to area in sq. microns. For this comparison, the derivative architecture based on the original multiplexing switch was used because the version of VPR that has been characterized for 180nm technology is based on this architecture. This eliminates any bias that may result from comparing slighdy different architectures. From the results in Table 4.1, the standard cell approach has an average area overhead of about 6.8X relative to a full-custom design. Another important observation from Table 4.1 is the difference between the estimated area and the actual area obtained for the real design. This estimate is based on the area model in Appendix B that was developed as part of this research. Based on results in Table 4.1 the estimated area is on average within +5.8/-3.8% of the actual synthesis area. This implies that the area model is quite accurate. This model was useful for predicting the core area before ASIC implementation, so that only cores with acceptable area (and delay) overheads were implemented. Table 4.1: Area overhead of \"soft\" eFPGA design relative to full-custom approach Estimated Actual % Area VPR Soft vs. VPR Circuit area (urn2) area (urn2) difference area (urn2) area Ratio Cc 419158 451298 7.7 62741.02 7.2 cm150a 221086 240130 8.6 33038.89 7.3 Cmb 248244 257889 3.9 36705.79 7.0 Comp 443320 475532 7.3 69590.83 6.8 Cu 306440 290039 -5.4 39844.19 7.3 5xp1 370696 366387 -1.2 51354.28 7.1 11 277208 267236 -3.6 40112.47 6.7 Inc 370696 366387 -1.2 51354.28 7.1 unreg 569176 604075 6.1 88647.52 6.8 rd84 2506527 2401023 -4.2 362839.93 6.6 apex7 1483108 1455608 -1.9 239553.32 6.1 alu2 2744876 2936981 7.0 437718.72 6.7 clip 1955266 2049542 4.8 321019.09 6.4 9symml 929304 848430 -8.7 120710.88 7.0 terml 1216817 1233960 1.4 181702.80 6.8 cht 1211974 1289779 6.4 196652.12 6.6 ex5p 23128464 22168164 -4.2 2832898.74 7.8 apex4 36441909 38133916 4.6 4567776.75 8.4 58 Table 4.2 shows results obtained for critical path delay. The second column shows delay results after placement, routing and parasitic extraction. The diird column shows delay results prior to placement (without metal wire delay). Based on these results, it was estimated that on average, wire delay is roughly 10.3% of the overall critical path delay. This result is not surprising since single segment length wires are used in all architectures considered. However, it is expected, that in architectures that have long wires, the delay overhead due to wires would be much greater. It is also interesting to draw parallels between this result and the results obtained in [14]. In [14], a flow was developed to minimize the wire delay overhead on critical paths by rerouting nets along paths with less wire delay. The experiment in [14] yielded modest improvements, and the results obtained here show that a low wire delay overhead is the likely reason. Finally, Table 4.2 also shows that the critical path delay overhead of a standard cell eFPGA design relative to a full-custom equivalent is about 1.8X. Table 4.2: Delay overhead of \"soft\" eFPGA design relative to a full-custom design Post layout Pre layout %Wire VPR path Soft vs. VPR Circuit delay (ns) delay (ns) overhead delay (ns) Delay Ratio cc 10.3 9.6 -7.8 5.81 1.8 cm150a 10.3 9.6 -6.7 5.94 1.7 cmb 9.7 8.8 -9.6 5.45 1.8 comp 17.7 16.0 -9.8 9.79 1.8 cu 9.9 8.8 -11.5 4.68 2.1 5xp1 7.6 7.0 -8.0 4.85 1.6 11 11.3 9.9 -12.6 6.09 1.9 inc 8.3 7.5 -9.2 4.84 1.7 unreg 10.1 9.3 -7.9 6.04 1.7 rd84 28.0 24.4 -12.7 15.2 1.8 apex7 18.4 17.1 -6.8 10.8 1.7 alu2 28.7 23.6 -17.7 15.6 1.8 clip 23.7 20.6 -13.2 13.7 1.7 9symml 14.5 12.2 -16.0 7.73 1.9 terml 18.6 15.6 -16.0 8.71 2.1 cht 9.6 8.9 -7.3 6.65 1.4 ex5p 40.2 37.3 -8.0 20.7 1.9 apex4 56.7 51.1 -11.0 36.9 1.5 59 The 18 benchmarks used in the above experiments were selected based on availability and size. Further, MCNC benchmarks have historically been used for FPGA research. In addition, it was necessary to use benchmarks with sizes that span a wide range ( 32 to > 1000 4-LUTS in this case). Given the area and delay overhead results reported in Tables 4.1 and 4.2, it is necessary to find ways to reduce the overhead. The use of custom cell libraries to improve ASIC designs is well documented in [48], therefore, it is expected that the same techniques might be applicable in this case. However, before embarking on designing custom cells, it was necessary to first evaluate the sources of inefficiency in the existing approach. As previously reported in [14] it is expected that the biggest sources of area inefficiency comes from the use of standard flip-flops for configuration memory and discrete CMOS logic gates like NAND-gates and NOR-gates for multiplexer implementation. For example, as Figure 4.10 shows, a pass transistor network implementation of a 4:1 multiplexer contains a total of 13 transistors including inverting gates for complimentary select signals (SO and SI). On the other hand a CMOS logic implementation with discrete gates has over 40 transistors in total. This translates to an area overhead of about 3X for a 4:1 multiplexer. Further, it is also expected that the multiplexer area will grow exponentially with the number of select inputs. (a) 4:1 multiplexer in CMOS logic (b) 4:1 multiplexer in pass gate logic Figure 4.10: Comparison of CMOS logic and pass transistor based multiplexers 60 br ib- trob D Q f -CLOCK D 1 VDD CLOCK _1_ POSITIVE FEEDBACK t > - i VDD CLOCK POSITIVE FEEDBACK Q Figure 4.11: transistor level illustration of a simple single-edge-triggered flip-flop Similarly, the area overhead of flip-flops is due to the fact that flip flops contain more transistors and therefore occupy more area compared to 6T (six transistor) SRAM cells which are typically used for program memory. The flip-flop design in Figure 4.11 comprises a total of 12 transistors. Therefore an SRAM with six transistors (see Figure 5.2) is approximately half the size of a flip-flop The pie charts in Figure 4.12 show the area overhead distribution in the \"soft\" implementation of the island-style architecture described earlier. From the chart in Figure 4.12(a), it is clear that configuration memory (flip-flops) and multiplexers are the biggest contributors to area overhead. \"Other\" in Figure 4.12(a) includes the BLE output tristate buffers, and shared buffers used to drive the inputs of the connection block multiplexers. Figure 4.12(b) shows that the biggest contributors to multiplexing logic area are the LUTs and the LUT input selection multiplexers (\"M\" in Figure 2.6(b)). Since generic standard cell multiplexers and flip-flops make up 88% of the total eFPGA island-style tile area, and are also not efficiendy built using the current approach, they are prime candidates for custom standard cell substitution. Moreover, it is expected that replacing generic standard cell multiplexers will have a significant impact on delay since multiplexers make up the logic and routing architecture. Furthermore, the results in Table 4.2 have shown that gate delay 61 comprises about 90% of the overall delay. Although the current research work is not focused on power optimization, it is expected that power saving will occur as a side-effect of logic optimization. (a) tile area distribution (b) multiplexer area distribution Figure 4.12: pie charts showing area distribution in a soft island-style eFPGA tile Finally, in the next thesis chapter, circuit-level design techniques that could improve area and delay for standard-cell-based eFPGA fabrics are explored based on the observations in Figure 4.12. 62 Chapter 5 Island-Style eFPGA Design with Custom Standard Cells 5.1 An Improved Design flow Hard and soft eFPGA fabric design approaches have several advantages, but also have significant disadvantages. For example, a hard eFPGA has superior area, speed, and power characteristics compared to an equivalent soft eFPGA. However, hard eFPGAs are inflexible and can be very inefficient depending on logic requirements. Soft eFPGAs on the other hand are flexible and offer the best chance to match an eFPGA to an application, but have large area, delay, and power overheads. In particular, flip-flops used for configuration storage, and the multiplexers used in routing and LUTs, account for a significant proportion of the area overhead. The reason for this is simple: these two circuits are the most pervasive in programmable architectures, but generic standard cell libraries do not include SRAM cells needed for configuration storage, or large fan-in multiplexers needed for parts of the routing and logic architectures. On the other hand, in hard eFPGAs, full-custom techniques are used to design much smaller configuration SRAMs and fast, wide fan-in multiplexers. Given the inefficiencies that exist in the hard and soft design approaches, a compromise approach that combines the best features of both approaches is needed. Such an approach should retain the soft IP advantage of configurable architectures, that is, it must remain flexible but at the same time, incorporate some full-custom design techniques to minimize area, delay and power overheads. However, any improvements should fit widiin the ASIC digital design flow in order to maintain the ease of use characteristics of the soft approach. Given the fact that configurable RTL descriptions of eFPGA architectures have been designed and implemented in Chapter 4, this requirement has been 63 achieved. Therefore, the next stage of this research was to minimize area, delay, and power overheads using full-custom design techniques, in order to derive the benefits of the hard approach. Previous work in [46] [48] [53] has demonstrated that generic standard-cell-based ASIC designs can be improved using customized standard cells. These are called tactical cells in [53]. Most of this previous work was aimed at nonprogrammable ASICs, but in [01] [68] some of the same techniques have been used to design standard-cell-based programmable architectures. However, this work did not take into account the significant area overhead due to wide-fan-in multiplexing logic (see Figure 4.12) and the potential savings that come from optimizing them (see Figure 4.12). Also, work in [01] showed area improvements of 16-19% for a data-path architecture but it is anticipated that larger gains might be possible in an island-style architecture [11]. This could be because the architecture in [01] is not multiplexer-intensive and has fewer circuits that benefit from the use of custom cells. Figure 5.1: an enhanced ASIC design flow for eFPGA design and implementation 64 A modified ASIC digital design flow using custom standard cell libraries is shown in Figure 5.1. The main difference is that customized standard cell data would be used for embedded programmable logic generation, rather than generic standard cell library data. The details of the design and implementation of these customized standard cells is described in the next section. 5.2 Design of Custom Cells Given the area distribution results obtained in the previous chapter, this section describes the detailed design and implementation of SRAM and multiplexing logic cells for use in eFPGA cores. 5.2.1 SRAM Cell Circuit Design The SRAM circuit design used in this work is based on the 6-T (six transistor) cell [25] [56] [66]. However, the SRAM cell design for FPGA circuits is different from conventional SRAMs in one important regard, namely, FPGA SRAM cells have dedicated read and write \"ports\" (one for writing and another for reading). In other words, the cell is unidirectional. As a result, one of the inverters in the cell acts as a \"charge keeper\" (via positive feedback) and keeps the value written to the SRAM, \"bit\", from being lost, while the other inverter acts as a \"sensor\" to quickly detect a new value being written to the SRAM. Consider the 6T cell of Figure 5.2: because the \"sense\" inverter is driven by an NMOS pass transistor that passes a weak logic \"1\" value, the sense inverter was skewed (larger NMOS) to respond faster to a rising input (i.e., when writing a logic \"1\" value to memory or setting node \"bit\" to logic \"1\")- A slow response to a weak logic \"1\" input is not desirable because the programming time would increase. Moreover, the sense inverter is sized larger since it also is used as an input driver in the function table of LUTs. In this case, the LUT input is connected to \"bitb\". Figure 5.2 shows transistor sizes obtained from SPICE simulations. All channel lengths are 0.18um. 65 VDD O.SOumf -write 1 |1.2um4' 0.50um charge keeper 0.50um read 0.75um sense inverter Figure 5.2: SRAM transistor sizing for embedded FPGA configuration memory 5.2.2 Multiplexer Circuit Design To minimize area overhead, pass transistor logic was used for multiplexer circuit design. All pass transistors were minimum size because this gives the best area and speed tradeoffs [66]. As a result, circuit design work was focused on issues like buffer sizing for pass tree select lines, output driver sizing, and buffer insertion (repeaters) within the pass tree to improve speed in larger tree networks. extra buffer stages (a) single output driver (b) multiple drivers Figure 5.3: multiplexer with minimum size output buffer and extra buffer stages 66 The output driver inside each multiplexer circuit is a minimum sized inverter (Figure 5.3(a)) since this presents the smallest load for the pass transistor network, and higher output drive strength can be achieved with extra buffer stages as needed (see Figure 5.3(b)). In addition, because NMOS pass transistor gates transmit a weak high logic value (VDD \u00E2\u0080\u0094VT),a level restorer [11] [49] [66] was placed at the output. This ensures that the input to the output driver is equal to the supply voltage, and prevents static power dissipation in the output driver. Since, eFPGAs typically contain a significandy large proportion of multiplexers, it is important to eliminate static power dissipation in these drivers. Furthermore, the level-restorer must be of the appropriate size because in cases where its transistor W \u00E2\u0080\u0094 ratio is too large, it can \"overpower\" the pass tree driver, and the output buffer never switches. Figure 5.4: a level restoring circuit for pass-transistor-based multiplexing logic Buffer sizing of the select line drivers was done using Logical Effort techniques as described in [25] [48]. For each size of multiplexing logic, the number of buffer (inverter) stages that gave the best speed and area tradeoff was computed. As a result, the select lines for some multiplexer sizes were inverting, while others were non-inverting. Notwithstanding this fact, an important aspect of the design was to recognize the strong coupling between LUT input multiplexers and the LUTs 67 themselves. In particular, optimization of each LUT select-signal-path must include the output buffer of the LUT input multiplexer (as shown in Figure 5.5) in order to achieve best results. VDD VDD \u00E2\u0080\u00A2\u00C2\u00A3t>of>o PATH OPTIMIZATION P ^ B L E M LUT INPUT MULTIPLEXER VDD / r v f \ \ ID ^ C i n 2 C g ^ 2 C g ^ L EXTRA BUFFERS AS NEEDED Figure 5.5: Delay optimization problem specification for LUT input-selection paths The above optimization is not required for other multiplexing circuits in the island-style eFPGA architecture, because the critical path in all the other multiplexing logic circuits does not include the select signal path. For example, the select-inputs of the LUT input multiplexer are driven by static values that are stored in SRAMs after configuration (see Figure 5.6(a)). Therefore, the critical delay path for this multiplexer is simply the path from the multiplexer input to its output as shown in Figure 5.6(a). However, the critical delay path for a LUT is shown in Figure 5.6(b), and includes not only the input-to-output path delay of Figure 5.6(a) but also the delay of the select-input-path. 68 (a) \"regular\" multiplexer (b) LUT multiplexer Figure 5.6: Critical delay paths for the different multiplexing circuits in an eFPGA Also related to area reduction was the decision not to include buffers at each input of wide fan-in multiplexers but instead leave them as source/drain inputs. For multiplexers used in implementing LUTs, this was an obvious choice since these inputs are SRAM-driven, and the SRAMs had been sized for this purpose. However, for multiplexers used elsewhere in the eFPGA (e.g., the LUT-MUXs), this decision required further consideration. In particular, difficulties arise during driver sizing for source/drain inputs, because conventional ASIC tools are designed for CMOS logic with gate inputs and not for pass transistor logic circuits that have source/drain inputs. Furthermore, source/drain capacitances when considered alone, underestimate the load of pass transistor logic paths, while the loading CMOS gates of a given size allow for accurate input load estimation. Figures 5.7(a) and 5.7(b) illustrate the problem that diffusion loads pose for standard ASIC tools. In Figure 5.7(a) it is evident that specifying the input diffusion capacitance {Cdjff) of a pass transistor as the input load is inaccurate, because, the real load is the RC network highlighted from the multiplexer input to the output buffer input. Since standard ASIC tools are not designed to handle 69 RC loads of this type, the configuration shown in Figure 5.7(a) is recommended since the buffer gate capacitance (C g ) becomes, the input load - ASIC tools are only able to interpret loads of this kind. (a) exposed RC loads (b) isolated RC loads Figure 5.7: Issues around input loading for pass tree networks in the ASIC flow However, in our case, the multiplexer circuits can be designed with diffusion inputs because the eFPGA architecture allows multiplexers to benefit from buffer sharing. Therefore, rather than buffering each multiplexer's inputs and increasing the total cell area (see Figure 5.8), a single buffer chain is used to drive multiple shared inputs (shown in Figure 5.8). Furthermore, rather than using ASIC tools to get an inaccurate estimate of buffer sizing for these multiplexer inputs, a special case of the Elmore equation [42] for distributed RC networks [43] achieves accurate buffer sizing and improved circuit speed. Equations (CO) through (C.2) in Appendix C.l are based on equations in [66] and are used to estimate buffer sizing for a given pass transistor RC network (similar to Figures 5.8 and C.l). Logical effort techniques are then used to find the number of stages and size of buffers to drive the buffer obtained from these equations. The appropriate buffers are then instantiated as gates in the eFPGA description. This process is made easier because it is known exactly where these buffers will be used in the eFPGA and so worst-case loading constraints are easily estimated. An example calculation that uses these equations for buffer/driver sizing is presented in Appendix C.2. 70 Figure 5.8: two possible multiplexing-logic buffering schemes for eFPGA design In multiplexers with more than four levels of pass gates from input to output, repeaters are needed in the pass tree RC network to speedup the critical path. A significant degradation in speed occurs as the depth of the pass transistor tree increases with increasing input size. This result is not very surprising because it is known that the delay of a pass transistor network is quadratic with the number of pass transistors in the tree from input to output (depth). SPICE experiments showed that repeaters after every four levels of pass transistors sped up the circuits as needed. Also, since the pass tree transmits a weak high logic value, skewed gates are used in the repeaters. In particular, a W \u00E2\u0080\u0094 2 2 2 L ratio of 1.5 was used because this gave the best area, speed, and power tradeoffs. Finally, analytical results for the repeater problem in NMOS-only pass transistor tree networks in Appendix C.2 corroborate the experimental results that were obtained through SPICE simulations. 71 Figure 5.9: RC network representation of repeater insertion in an eFPGA 5-LUT 5.3 Layout Design Custom layout design was the next step after detailed transistor-level circuit design. Since this research is concerned with standard-cell-based design, and given the fact that not all the cells in the eFPGA implementation may be substituted with custom tactical cells, it was important that our tactical cell layouts match the format of the original standard cell library [24] used in this work. For example, rules for cell height, cell width, pin location, n-well overlap had to be enforced throughout. Perhaps the most important of the layout challenges had to do with how best to implement an NMOS pass transistor tree network within a standard cell format. As shown in Figure 2.2, a typical standard cell reserves the upper part of the cell (closer to Vdd) for the PMOS network of CMOS logic while the lower part (closer to Gnd) is reserved for NMOS transistors of the CMOS pull-down network. However, since the most area-efficient wide fan-in multiplexers are best implemented 72 using NMOS pass transistors, there was a potential problem of wasting the space in the upper section of the standard cell. Therefore, it was necessary to find new ways to improve area efficiency. In order to make the most efficient use of standard cell area for NMOS pass transistor network design, a technique was devised to make good use of the otherwise wasted space reserved for PMOS transistors. In particular, a significant portion of the n-well was cut out from the middle section of each multiplexer cell so that more NMOS transistors could be included without increasing cell size. Figures 5.10(a) shows a normal cell before making the n-well cut and Figure 5.10(b) is the result of making the n-well cut. Notice that in Figure 5.10(b) the n-well around the fringes of the cell have been reserved so that level-restoring logic that includes PMOS transistors can still be implemented. vdd\ . _ _ _ _ _ _ \u00E2\u0080\u0094 _ _ _ _ _ _ A/dd (a) before n-well cutout (b) after n-well cutout Figure 5.10: standard cell layout structure before and after n-well cutout is made Further, part of the n-well regions were used to implement CMOS logic drivers (inverter chains) for pass transistor gates. Also of importance, is the fact that no design-rule violations occur when these cells are abutted against \"normal\" cells. In essence, fringe n-well regions preserve the continuous n-well region that must extend across an entire standard cell row after cell placement and routing. 73 Figure 5.11: illustration of resource allocation in a multiplexer standard cell layout In addition to the above design strategies, multi-height standard cells [78] were also used to create more area-efficient designs. None of these new cells use more than two metal levels for routing. Figure 5.12 shows a double-height standard cell that uses two metal levels in a 32:1 multiplexer. Figure 5.12: double height standard cell layout of 32:1 multiplexer (2 metal layers) 74 5.4 Layout Improvements Results from tactical cell implementation are provided in Table 5.1. They show that it is possible to achieve area improvement factors of 2.5x for a single SRAM cell when compared to a flip-flop. In addition, area reductions of between 3.5x and 7.6x for multiplexer and LUT circuits were achieved. Table 5.1: layout area improvements with tactical cells vs. generic standard cells Cell Generic Standard Cell Area (urn2) Custom Standard Cell Area (urn2) Improvement Factor 1-SRAM 61 24 2.5 8:1 Mux 267 77 3.5 16:1 Mux 899 146 6.1 32:1 Mux 2,228 293 7.6 4-LUT 1,875 530 3.5 5-LUT 4,180 1,061 3.9 The SRAM cell is smaller than a flip-flop because it has fewer transistors (compare Figures 4.11 and 5.2). Also, the custom multiplexers are much smaller than generic standard cell implementations because minimum size NMOS pass transistors occupy less area compared to CMOS logic gates (see Figure 4.10). Furthermore area saving techniques described in Section 5.3 impact cell area efficiency. (a) Flip-flop (routing only) (b) SRAM (Detailed) Figure 5.13: I flip-flop used for configuration memory in the previous approaches 75 Figure 5.13 shows a tactical standard cell SRAM alongside a standard cell flip-flop to further illustrate the size differences that can exist between customized cells libraries and generic libraries. 5.5 eFPGA Design Results Given the significant cell area reductions achieved with custom tactical standard cells relative to generic standard cells, the next step was to evaluate their impact on the overall area and performance of standard-cell-based, island-style eFPGA architectures described in prior chapters. 5.5.1 Area Improvements In order to evaluate the impact of tactical cells on the area of eFPGAs implemented with generic standard cells, a detailed area breakdown of island architectures implemented using generic standard cells was performed. These results were retrieved from reports generated from the ASIC design tools. From these reports it was possible to measure the area contribution of each cell or cell group used in the soft eFPGA implementation. With this information it was then straightforward to replace cell-groupings in the eFPGA with their equivalent tactical standard cell implementations. Therefore, in order to estimate the impact on area of tactical cells, the total eFPGA core area was re-calculated as if tactical cells had been used instead of generic standard cells. This was done for all cell groups for which custom tactical standard cell equivalents were designed and implemented. Experiments show that on average, an area reduction of 58% is achievable for an island-style architecture using customized tactical standard cells rather than generic standard cells. In Figure 5.14, the results are provided for a set of 9 MCNC benchmark circuits and all area results have been 76 normalized relative to the area per tile of a full-custom equivalent. Other circuits were considered but only the results for some of the largest benchmarks (between 56 and 1300 4-LUTs) are shown. Also, in Figure 5.14 the results obtained with custom tactical cells are compared with a full-custom eFPGA of the same architecture. The area results for the full-custom implementations are based on estimates of area obtained from a version of the VPR tool that is characterized for 0.18 um process [49]. The results show an equivalent full-custom eFPGA is about 2-3X smaller than the implementation that uses custom tactical standard cells. The average overhead (geometric) is 2.86X. O < 8 h - CL 7 i_ LL. O O 6 \u00E2\u0080\u00A2 Generic Standard Cei ls \u00E2\u0080\u00A2Tac t i ca l Standard Cel ls \u00E2\u0080\u00A2 custom Benchmark Circuits Figure 5.14: area comparisons of customized tactical standard-cell-based eFPGA implementations with generic standard cell, and full-custom implementations 5.5.2 Delay Improvements In order to evaluate the impact of tactical standard cells on eFPGA speed, a scheme was devised that leveraged existing static timing verification tools in the ASIC flow. In particular, the eFPGA gate netlist was \"programmed\" as described in the previous chapter and then a static timing report was generated for the critical delay paths in the original eFPGA ASIC implementation. This report 77 details the timing contribution of standard cells and wires in the critical path. From this report, it is possible to identify the generic standard cells or standard cell groups that would be replaced with tactical cells if the eFPGA were implemented using these cells. The main task was then to characterize the tactical cells for speed in a fashion that accounted for the loading and drive characteristics of the cells that remained unchanged (that is, cells not replaced with tactical standard cells). SPICE simulations were used to characterize the tactical cells for speed. However, rather than simulate each cell in isolation, it was more efficient to simulate groups of tactical cells that were connected as if in an actual FPGA. For example, the schematic in Figure 5.15 models the path starting from an input connection block multiplexer input (\"A\" in Figure 5.15) and ending at the output of a LUT in a BLE (\"D\" in Figure 5.15). Notice also that in this figure the loading due to other parts of the eFPGA architecture have been included (e.g., LUT-MUXs). Also modeled is the path starting at the output of a LUT and ending at the output of another LUT in the same CLB (e.g. the path from \"D\" to \"G\" in Figure 5.15). As shown in Figure 5.16, the path from the input of a BLE output tristate buffer through one level of switch element and ending at one of four possible switch outputs was also simulated (the choice of switch output is unimportant since each one is loaded about the same). This is the path from \"B\" to \"C\" in Figure 5.16. The goal of this SPICE experiment was to determine the liming overhead from the input to the output of a switch element. 78 Figure 5.15: illustration of SPICE simulation setup to measure logic block delay During the experimental setup for speed characterization, it was important to ensure that we simulated the loading effects due to cells or cell groups that were not replaced with tactical standard cells. For example, the track buffers (see Figure 5.8) in the original standard cell implementation were of fixed drive strength for all eFPGA implementations. In the current implementation, these buffers are not replaced with tactical cells and so it was important to account for this in the SPICE simulations. Therefore, track buffers used in experiments depicted in Figures 5.15 and 5.16 are the same ones used in the original standard cell implementation. There are other instances where a similar approach was required, and so special scripts were created to extract this information from the netlist and create a database. This was needed because cells used to implement the same function can differ depending on the value of N, for example, or timing constraints. Creating a database of different implementations and the cells used in each case, made it possible to determine the worst case loading constraints for use in all the SPICE simulations. All tactical standard cell SPICE simulations were done under nominal process conditions for 0.18um process technology node. 79 VDD ...aw P--, VDD / \ -A. VDD i \u00E2\u0080\u0094 a VDD - i. d-. r. .....til VDD 1\" VDD 01 \"I\" .0 xw \u00E2\u0080\u00A2O r0 \ / track buffer \u00E2\u0080\u00A2jHt: VDD B \" I.-'. CK. P;, V\"! I | ! I t i l l -VDD IT' TV-\"!I>-I \" '\" i\" VDD Figure 5.16: SPICE simulation setup to measure the routing switch element delay Once all paths of interest had been characterized for timing using extracted layouts of the tactical standard cells, the data was used to back-annotate static timing reports generated during static timing verification of the original eFPGA implementations. Scripts were created to annotate and recalculate timing for all previously reported critical paths for different eFPGA implementations. The implementation based on customized tactical standard cells shows an average circuit speedup of 40% on the critical path, compared to an eFPGA implementation using generic standard cells. For the benchmark circuits considered, (includes some not shown in Figure 5.17) the range of circuit speedup was between 28% and 48%. Results for 9 MCNC benchmark circuits are shown in Figure 80 5.17 below as well as comparisons with equivalent full-custom implementations. Our timing results for full-custom implementations of the same architecture are also based on estimates from a version of the VPR C A D tool that was characterized for 0.18um process. Furthermore, the timing results achieved with our custom tacdcal standard cells are within 10% of a full-custom implementation. 3 - < Q \u00C2\u00A3 2.5 CO CD Q. E 2 To B . - W 1 5 \u00E2\u0080\u00A2_ O \u00C2\u00B0 O ! T3 ** 1 CD CD N > = 0.5 re _co _ 5 o * 0 \u00E2\u0080\u00A2 Generic Standard Cells \u00E2\u0080\u00A2 Tactical Standard Cells \u00E2\u0080\u00A2 custom r-X CO a. C3 Benchmark Circuits Figure 5.17: delay comparison of customized tactical standard-cell-based eFPGA implementations with generic standard cell, and full-custom implementations Although the current work has not focused on power reduction per se, it is reasonable to assume power savings have been achieved as a side-effect of the work done here. For example, it is reasonable to assume that since the overall cell area has been more than halved, gate capacitances and interconnects would also experience similar reductions, and thereby reducing power dissipation. 81 5.6 Comparison to GILES It was also useful to compare the standard cell approach with the GILES approach [68] [69] [70]. GILES is essentially an automated, cell-based, semi-custom design strategy for implementing programmable logic tiles. This approach is similar to the standard cell approach used here except, a new custom suite of \"FPGA-aware\" tools were developed for this purpose. Specifically, an entirely new back-end suite of tools for cell placement and routing was developed. Furthermore the cells used to implement programmable logic tiles do not adhere to the standard cell format. For example, cells are not necessarily the same height and so Vdd and Gnd lines may not always be abutted. In [68] the authors reported that their approach resulted in a Virtex-E FPGA tile implementation that was within 36% of the full-custom implementation of the same tile. Their results also suggested that a standard cell based implementation of the Virtex-E tile using custom SRAMs and pass transistors in the output connection block [69], resulted in a tile implementation that was 75 % larger than the full-custom version. The authors do acknowledge in [68] [69] that wide-fan-in multiplexers in the standard cell implementation were not replaced with pass-transistor based multiplexers. The multiplexers in the GILES implementation used pass-transistors. Our work has already shown that this can have a significant impact on the densities achievable with standard cells. In order to evaluate the impact of wide-fan-in multiplexers on the standard cell implementation of the Virtex-E tile, it would have been ideal to have access to the netlist used to implement the tile. However, since this information is not readily available, a reverse approach was used: rather than improve the standard cell implementation with pass-transistor-based wide-fan-in multiplexers, the GILES implementation is made worse by \"bloating\" the multiplexers in the \"netlist\" so that the cell 82 areas are at \"pre-optimization levels\". In other words the cell areas are increased so that they equal the area of a generic standard cell implementation of the same multiplexer. This was possible because the database of all cells used in the GILES implementation were provided by [69] [70]. To scale the multiplexers in the GILES implementation appropriately, the areas of equivalent generic standard cell implementations had to be determined. For this purpose, the graphs in Figures D.l through D.3 were used. The plots labeled \"X\" in all three figures are based on a database of area results from the synthesis of different size multiplexers used throughout our research. For the synthesized multiplexers, each data point is the average synthesis area for that size of multiplexer. In addition, the plot labeled \"Y\" in Figure D.l gives the scaling trend for pass-transistor-based multiplexers relative to the number of inputs, and is based on actual layout results. Comparing our layout results with the layout results for the GILES cells showed the area results are more-or less identical. For example, based on our layout results (see plot \"Y\" in Figure D.l), a 25-input multiplexer has a layout area of 140um2 while an equivalent GILES cell has an area of 174um2 (based on supplied, unpublished data for the GILES cells). Figures D.2 and D.3 also use data based on synthesis results and actual layout results to predict the scaling trends. A different plot was needed for LUTs because full-custom layouts of LUTs include buffers for fast select inputs (Figure D.2). Table 5.2 shows the layout area and scaling factors for our tactical cell layouts and GILES cells, versus the generic standard cell multiplexers and LUT implementations obtained using logic synthesis. In all synthesis experiments, constraints were set to give the best balance between total cell area and overall delay. The multiplexers listed in Table 5.2 are the same ones used in the GILES implementation of the Virtex-E architecture. We have excluded 2-input multiplexers because there are very few of them and they are fairly small. Based on these scaling factors, the area of the GILES 83 implementation of the Virtex-E architecture increased by a factor of 1.62X. Given the initial results in [68] that suggested the standard cell implementation was 1.55X the GILES cell area, the new results obtained here suggest the tactical standard cell implementation is 0.96X the GILES area. Table 5.2: The scaling factors for multiplexer logic bloating in GILES Virtex-E tile multiplexer inputs LUT size (K) tactical cell area urn2 GILES cell area urn2 generic cell area urn2 scaling factor 4 149.02 107.59 810.78 5.44 12 70.13 71.87 333.85 3.10 16 91.48 107.59 448.93 4.17 25 139.51 174.24 707.86 4.06 26 144.85 181.21 736.63 4.07 4 27.44 24.39 103.69 3.78 6 38.11 39.20 161.23 4.11 8 48.79 52.71 218.77 4.15 The results above would seem to place a tactical cell eFPGA within 36% of a full-custom design, which also contradicts estimates obtained from VPR. For this reason a methodology to estimate the area overhead was devised and is explained in Appendix D. Our estimates (see Table D.l) place a tactical standard cell eFPGA at between 1.66X and 3.11X. However, the more likely estimate is probably between 1.66X and 2.84X (closer to 1.66X) since these estimates are based on an architecture (Virtex-II) that is similar to the island-style architecture used in this research work. It should be noted that these results are sensitive to the synthesis constraints used to obtain the generic standard cell areas. As a result, the scaling factor used here may be less or higher depending on the design emphasis (speed, area, or both) of the original GILES implementation. This is not clear from the literature that has been published on this topic. However, since all synthesis constraints were tuned for area and delay optimization, it is likely that our scaling factors are reasonable. It is not possible to investigate this any further widiin the scope of the current work. 84 5.7 Sensitivity Case Study In this section architecture sensitivity to standard cell libraries is investigated. For example, the product-term architecture [15] [16] benefits from the fact that its product-term block (roughly 76% of core area) is based on logic gates like NANDs and NORs that are already well tuned in most commercial standard cell libraries. This is not the case for the island-style architecture. Therefore, in this experiment, tactical cells are used to implement some parts of the island-style architecture. In particular, the logic component of the island-style architecture (contents of CLB excluding program flip-flops) can benefit from the inclusion of tactical standard cells. This is essentially equivalent to implementing the product-term blocks in [15] [16] with NANDs and NORs. The routing architecture for the island-style architecture (connection blocks and switch blocks) and product-term architecture (wide fan-in routing multiplexers) [15] are left unchanged. Figures 5.18 to 5.20 show the experimental results. Island-MUX and Island-Tribuf are island-style architectures with multiplexed and tri-state routing switches respectively, and PTB is the product-term-based architecture [15] [17]. The results in Figure 5.18 show that for the set of benchmarks considered, the product term architecture is smaller than the LUT-based island-style architecture in all but one case. On average (arithmetic), the product term architecture is 34.85 % smaller than the island-style architecture. This is almost identical to the result obtained from comparisons in [15] [16] with a different LUT-based architecture [02] [14]. This would appear contrary to expectations since it has been observed that the LUT-based island-style architecture requires less area on average than the current implementation of the LUT-based gradual architecture used in comparisons in [15]. Hence, one would expect a smaller (less than 34.85%) average area differential between the island-style and product-term architectures. 85 > \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 <*-> < o o O a) JE o L . (0 CQ H Q_ 2 0 4 3.5 3 2.5 2 1.5 1 5 0 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 PTB \u00E2\u0080\u00A2 Island-MUX \u00E2\u0080\u00A2 Island-TRIBUF 1 LL1 o o ^ o o E \u00C2\u00B0 o 3 T- T- O D) S CM o a. \u00E2\u0080\u0094 e 3> oo x 3 = re a \u00E2\u0080\u00A2= r-^ i i ! * CO o Benchmark Circuits Figure 5.18: Core area comparison of product-term and island-style architectures Furthermore, for this experiment, most of the benchmarks used are those for which the average area overhead of the LUT architecture [02] relative to the product-term architecture is very high on average (54.33%). Furthermore, analysis of the LUT-based island-style architecture showed that the area of the core is reduced by an average of 23.5% when pass transistor based multiplexers are used in the logic architecture. Therefore, the 34.4% area differential reported is in-line with expected results. This means that if all the benchmarks used in experiments in [15] had been considered here, the average area differential between the two architectures would likely be less than 34.85%. In addition to demonstrating architecture sensitivity to cell libraries, these numbers demonstrate that results obtained for different architectures are also sensitive to the benchmarks used in experiments. 86 The delay results in Figure 5.19 show that the island-style architecture with multiplexed switch elements [20] (Figure 3.2) is faster than the product-term architecture for most of the benchmarks considered. These are the results obtained after the inclusion of tactical cells in the logic architecture. On average (arithmetic), and for the benchmarks considered, the LUT-based island-style architecture was 19.7% faster than the product-term core. The tri-buffered switch architecture was on average 37.4% faster. Although not shown, it was also found that even without the use of pass transistor multiplexers in the logic architecture of the island-style architectures, Island-MUX and Island-TRIBUF are respectively 8.8 % and 29.4% faster than the product-term architecture. These results suggest a significant change from the results presented in [17], even-though the LUT based architecture used in [17] for comparisons is different from the one used in these experiments. This could be related to the fact that delay results used for the LUT-based architecture [02] are based on non-functional timing paths relative to the corresponding benchmarks. In other words, the fabric was not programmed. Furthermore, it has been noted previously that the programmed and un-programmed critical path delays results for the product-term architecture are very closely matched. However, this is not the case for the gradual LUT-based architecture [02] that was used in [16] [17]. As a result of the significant differences in delay between the programmed and un-programmed versions of the LUT-based fabric used in [17], the speed gap between the product-term and the LUT-based architecture is much larger than it should be. This explains (in part) why results obtained here seem like a huge reversal in the delay trends of LUT-based versus product-term architectures. Figure 5.20, compares the area-delay-product for the product-term and island-style architectures. For the set of benchmarks considered, the island-land style architecture, Island-TRIBUF, has on average (geometric) an area-delay-product that is 0.99X that of the product-term architecture in [17]. 87 CO CD CO \u00E2\u0080\u00A2\u00E2\u0080\u0094 - 1 CO co v p GO. 5 \u00C2\u00A3 \u00E2\u0080\u00A2o o CD N CD CO ^ \u00C2\u00A3 * _ CD o a: 1.4 i 1.2 1 0.8 0.6 0.4 = 0.2 \u00E2\u0080\u00A2 PTB Island-MUX \u00E2\u0080\u00A2 Island-TRIBUF O 03 \u00C2\u00B0 g E E o o E \u00C2\u00B0 u Q. 3 O Q. \u00E2\u0080\u0094 X U D) T S N C CO N 4 3.5 3 2.5 2 1.5 1 I 3 0.5 4| z ^ 0 \u00E2\u0080\u00A2 PTB \u00E2\u0080\u00A2 Island-MUX \u00E2\u0080\u00A2 Island-TRIBUF 1 ft o o re o m I -Q E o Q . E o o 3 o a. x c 0) _ C \u00E2\u0080\u0094 s \"P K X CD a. re CM 3 re a o II CD Benchmark Circuits Figure 5.20: Area-delay-product comparison of product-term and island-style core 88 The arithmetic average of the area-delay-product scaling factor for Island-TRIBUF is 1.1 IX relative to the product-term architecture. Further, Island-MUX has geometric and arithmetic averages for area-delay-product of 1.26X and 1.38X respectively relative to \"PTB\". These results her show that the area-delay product of Island-TRIBUF is comparable to the product term architecture [15] [16]. 5.8 Mux Switch Evaluation Finally, \"Improved multiplexer switch 2\" was evaluated as a possible candidate for tactical cell implementation in a future architecture implementation. This design is somewhat of a variant of \"multiplexer switch 1\" because it also aims to speedup horizontal and vertical routes through a switch element. To evaluate its potential, results from SPICE simulations were used to annotate some MCNC benchmarks as described earlier. SPICE simulations revealed that the fast switch route was 12.5% faster than the \"slow\" routes. For the smaller benchmarks this only translated to a 1% speedup. For larger benchmarks like ex5p and apex4 speedups of 3% and 6% respectively were achieved. It was anticipated that even larger speedups can be achieved for switch multiplexers with wider fan^ ins since slower paths will have more levels of pass gates relative to the fast path with a single pass transistor (plus buffer delay for all paths). However, an independent study in [21] showed that speedups of about 6% were achieved in architectures of this type. It is not immediately obvious why the gains were not higher but one reason could be due to the fact that most of the delay reductions come from the use of longer wires (e.g. length 4 wires) in the architecture [21]. It would be interesting to see if the proportion of length 4 wires in the architecture could be reduced in favor of these switches without adding significandy to area, and also reducing dynamic power dissipation. 89 Chapter 6 The Implications for eFPGA IP Design The improved area and performance results of the previous chapter have some important implications for current eFPGA design and implementation strategies. Although the research work presented in this thesis is an initial exploration of potential implementation and design strategies for eFPGA circuits, and largely independent of the larger question of how eFPGA IP should be delivered to the end-user, this thesis chapter provides a position on this larger research question. At present, the popular method for delivering eFPGA IP to end-users is hard IP. However, as illustrated in Figure 2.12, restrictions on core size introduce significant inefficiencies. An alternative is the soft approach [14] [16] where a configurable RTL description of an eFPGA can be implemented using ASIC tools and generic standard cell libraries. This approach allows customization [48] that is impossible with hard IP (due to fixed hard IP core sizes and composition), but the reliance on generic standard cells introduces significant inefficiencies. A \"middle-ground\" approach that combines positive elements of hard and soft IP approaches in a manner that allows customization to user specification, and retains superior area and performance metrics is desirable. The middle-ground approach suggested above would require an IP generator similar to commercial SRAM generators that exist today [18]. Assuming a generator of this kind existed for eFPGA IP and was based on customized libraries of standard cells, results in the following subsection are intended to illustrate how such a generator would compare to the current hard IP approach. 90 6.1.1 Some \"Real world\" Case Studies In order to compare an IP generator approach (based on customized standard cell libraries) with the existing hard IP approach, 9 benchmark circuits were used. Each benchmark circuit can be viewed as part of a different embedded application that might need to be modified in the future (for bug fixes or upgrades). Seven of these circuits range from 56 to 200 LUTs, and the 2 largest benchmark circuits have 1112 and 1340 LUTs respectively. For this experiment, the LUT input size (K) and cluster size (N) were fixed at 4 so that fair comparisons could be made with existing commercial hard eFPGA IP cores [44]. Also, the impact of embedding FPGA cores in a Bluetooth baseband SoC is evaluated using reprogrammable Frequency hopping [74] and data encryption [74] modules. Using data available from IBM and Xilinx [41], it was estimated that on average a hard eFPGA IP core can implement anywhere from 800 to 1,371 equivalent ASIC gates per mm2 in 180nm process. This estimate was obtained after correcting for process scaling (90nm to 180nm). These estimates are somewhat similar to estimates published in [45]. However, these estimates assume that all the logic in the eFPGA is used to implement a target circuit or benchmark. This is often not the case since the logic fabric is sometimes underutilized. We express logic underutilization in Equation (6.1). Underutilized Area per Logic Tile = eFPGA core area logic clusters used 1 (6-1) Effective Logic Density Equation (6.1) measures logic inefficiencies in eFPGA IP. It normalizes the overall core area relative to the actual number of logic clusters that are used to implement a target application circuit. Therefore, the fewer logic clusters a circuit implementation requires, given a certain core size, the higher the underutilization. This equation is used as the basis for the comparisons in this section. In 91 essence, the goal is to compare logic inefficiencies that exist in both the IP generator approach and the hard IP approach. Results for 9 benchmarks are presented in Figure 6.1. In this figure, it is assumed that hard eFPGA core sizes of 512, 1024, and 2048 LUTs are available, such as in Actel's Varicore [11]. To calculate core area (mm2) the reported ASIC gate capacity for each core [44] was multiplied with the estimated equivalent ASIC gates per mm2 for 180nm process technology. CD \u00E2\u0084\u00A2 1.40 < 1.20 H * a. JH u.1.00 0.80 < -o < (5 O X Q. o 0.60H LL +3 \u00C2\u00BB \u00C2\u00A70 .40 0) \u00C2\u00A3 3 0.20 o a: CO \u00C2\u00A3 o.oo o \u00E2\u0080\u00A2 hard I tactical standard cell approach \u00E2\u0080\u00A2 custom u a; I a> m B i Ml X a re a. 00 T3 3 re Benchmark Circuits Figure 6.1: logic efficiency comparisons of a standard-cell-based eFPGA IP generator approach to a commercial hard IP approach for 9 MCNC benchmarks The results in Figure 6.1 suggest that for small benchmark circuits, it is very important to match the logic capacity of cores to the circuit needs. Also, even for larger benchmarks such as ex5p and apex4 with over 1000 LUTs, the IP generator approach can still achieve effective logic densities that are higher compared to the hard eFPGA IP approach. From the results for the two largest benchmarks, the use of customized standard cell libraries makes the IP generator approach 92 competitive. Furthermore, the use of configurable architectures in the IP generator approach makes it possible to prune routing overhead to save area while hard IP tiles typically contain excess routing. While an IP generator based approach using custom standard cell libraries generates cores that are larger than a full-custom eFPGA implementation of the same size and composition (Figure 5.14), it still can achieve effective logic densities (as defined in Equation (6.1)) that are superior or comparable to the best available hard eFPGA core (Figure 6.1). This stems from the fact that eFPGA vendors at present do not have the means to design full-custom eFPGAs for each application, or customize one based on some other efficient circuit implementation method. Instead, cores of fixed size and composition are built, and the inefficiencies predicted in Figure 2.12 occur. The impact of including standard-cell-based eFPGA cores in a Bluetooth baseband SoC was also evaluated. Based on actual eFPGA layouts and estimates of improvements due to tactical standard cell substitution, an eFPGA that implements the baseband frequency hopping (FH) module would result in a core area increase of 31%. If instead the encryption module eFPGA is included, core area increases by 130% (2.3X). Both results assume an eFPGA with bidirectional routing and single segment wires. Assuming a tactical standard cell implementation with length 4 (L4) directional wires, the corresponding core area increase from the encryption module eFPGA is about 80% (1.8X). Its is not expected that the FH module eFPGA would benefit significandy in terms of area, from length 4 and directional wires. To put these results in perspective, consider that the smallest available hard IP core in 180nm technology with a similar architecture is about 6.25mm2 based on data in [41] [44]. However, FH module eFPGAs based on generic and tactical standard cells are respectively 5mm2 and 2.1mm2 in area. Further, the larger encryption module eFPGA (784 4-LUTs) with length 4 wires and directional routing [20] has an estimated area of 5.3mm2. This is still smaller than the 93 smallest available hard IP core [44] in 180 nm with a similar architecture. Lasdy, the Bluetooth SoC is I/O limited, (51% of unused area in Figure El) hence, the inclusion of the FH module eFPGA or encryption module eFPGA (L4 wire + directional routing + tactical cells) has no effect on die area! Finally, although the results above show that the tactical cell approach surpasses the effective logic density of the hard IP approach, there have to exist cases where this is not true (See Figure 2.12). However, with continued efforts to improve automated eFPGA generation methods (e.g. using other more efficient circuit building blocks), it may be possible to surpass effective logic densities of existing hard eFPGA IP cores [44] across all target circuit sizes. In essence, we advocate a departure from the existing commercial hard eFPGA IP approach towards a more adaptable and efficient methodology based on an eFPGA IP generator concept described herein. Previous work [01] [02], and the limited success of hard eFPGA IP, suggests the need for a paradigm shift in eFPGA design. 6.2 A New Paradigm for eFPGA IP Design Despite the improvements reported here, there remains much work ahead if embedded eFPGA cores are to become mainstream. In particular, an efficient IP generator framework must be devised, so that users can tailor programmable fabrics to their own needs. The results here have shown using real world examples, that significant inefficiencies exist in current hard IP design approaches, due to the absence of application-specific customization. When users of hard eFPGA IP are restricted to IP with fixed sizes and resources, it is possible to end up with too much or too litde of the resources needed for a given application. This has implications for area and speed, and presumably for power. 94 It has been shown in this thesis that standard cell libraries can be customized to improve the area and performance of standard-cell-based eFPGAs. However, augmenting an eFPGA RTL description with custom standard cell libraries is a deviation from the soft IP concept. For example, process independence, one of the hallmarks of the soft IP concept, would be sacrificed, because, new customized standard cell libraries have to be designed for each process migration. Moreover, it is desirable to hide the details of an architecture implementation, since this is to some extent what defines the competitiveness of a product in the market. Therefore, an IP-generator approach that is similar to SRAM generators [18] is proposed. Some of its proposed features are described next. 6.2.1 \"Open\" Architecture IP Library Similar to the \"open source\" concept used in Linux\u00C2\u00A9 software development, it is desirable to have a programmable logic architecture IP library that allows programmable logic architects and designers to augment new architectures and related CAD software to an existing IP generator. This approach makes it possible for designers to select the eFPGA architecture that best suits their design needs. For example, the results in Table 5.18 to 5.20 showed that the LUT-based island style architecture is the better choice for certain benchmark circuits, while the product term architecture is better choice for others. Clearly, having the option to choose between architectures can be quite beneficial. It is also expected that CAD infrastructure IP for architectures will include detailed area, speed and power models in addition to placement and routing software algorithms. This would allow users to evaluate the area, speed, and power characteristics of all architectures before deciding which to use. 95 6.2.2 Configurable Architecture IP An eFPGA IP generator should contain architectures that are configurable. This means that in addition to having a pool of high-level architectures as described above, each architecture in the IP library must itself be configurable (i.e., it must have an associated set of modifiable parameters). For example, if using the clustered island-style eFPGA architecture described in this thesis and in [11] [13] [20] [22] [28] for an embedded application, users should have the option to change the routing segment length distribution, LUT input size (K), channel width (W), core dimension (Dx * Dy) or other architecture parameters as needed. The absence of such a flexible system in existing hard IP approaches [44] is the reason for the inefficiencies that were illustrated in this thesis and Figure 2.12. Configurable eFPGA architectures may be described in RTL or in \"architecture files\" [11] [68]. 6.2.3 Domain-driven IP generation Different parameterized high-level architectures in an architecture library, makes domain-driven IP generation possible. In particular, it is possible to construct an eFPGA with the best available architecture for a specific application (e.g., selecting a product term architecture over a LUT-based architecture) and with optimal parameter settings (e.g. with suitable channel width or core size). The above approach is an improvement over existing approaches, because programmable logic designers in general implement programmable logic devices based on a large set of \"golden\" benchmarks, and do not cater to specific applications. This can result in large area overheads due to underutilized logic. Furthermore, it is expected that a large area overhead would also have a negative impact on speed and power (due to higher capacitance). To compensate, the authors of 96 [67] advocated architecture families comprised of sibling architectures that are in essence scaled down versions of the same \"parent\" architecture. The goal is that users will find an FPGA witnin the collection that minimizes overhead. However, inventory costs associated with having a large \"family\" limit the extent to which this can be done [67]. FPGA vendors use some of these same techniques and it is in a sense equivalent to having a collection hard IP core sizes as is done in [44]. Applications intended for an eFPGA are much smaller in number and better understood beforehand; therefore, inefficiencies associated with the usual design approach for standalone FPGAs and hard eFPGAs referred to above, can be avoided through domain specific customization [01] of eFPGA IP. Standalone FPGAs do not present the same opportunities because the potential application space is much larger and less is known about intended future applications. Deciding on the most suitable eFPGA IP architecture implementation involves experiments with example circuits that are representative of the target application, and a series of parameter sweeps to determine the most suitable architecture implementation. In a SoC design for example, some of the target applications would be known in advance due to re-use of application IP, and therefore it is possible for SoC designers to tailor the generated eFPGA IP to the SoC application. It is expected that these optimization experiments, based on user-supplied design constraints, would be automated, and run within the proposed IP generator tool and therefore not visible to the user. Finally, programmable logic architectures in general are simple enough that they are relatively very predictable. For example, the relative sizes of the logic used will change given a particular architecture implementation (e.g. different N or K values) but the logic function will be the same. Furthermore, there is no need to build multiplexing logic with different drive strengths, because, 97 minimum sized pass transistors have been found to give the best results in general [66] and multiplexer inputs and outputs can be buffered as needed. As a result, traditional and sometimes complex logic synthesis of the kind used in Chapter 3 and in [14] [15] is not required, and only gate resizing of input and output drivers is needed in addition to selecting the size of multiplexing logic (e.g. a 32:1 versus 16:1 multiplexer). Therefore, a reasonable approach could involve combining gate resizing with architecture optimization and using techniques similar to that used in [31] [35]. The end result would then be a circuit-level architecture netlist that has been resized and is ready for layout. 6.2.4 Automated Layout generation A layout implementation framework is needed to compliment architectural flexibility, so that physical layouts of programmable logic architectures can be generated automatically. This could be achieved with a scripted and fully automated ASIC backend design flow with modifications for eFPGA design [14] [75] [78] , or a new custom design flow as in [68]. Such a design flow would typically include support for design partitioning that allows regular programmable logic architectures with repeated structures (e.g. eFPGAs) to be tiled and replicated (e.g. Cadence First Encounter [23]). The need for an automated layout design flow that supports tile replication is based on the notion that architecture IP implemented in the proposed IP generator should have a regular structure that allows tile replication. Standalone FPGA chips and hard eFPGA cores are built using a structured approach that replicates a single well-optimized tile to form a programmable logic array or fabric. This approach allows tile re-use and improves the quality of design. A regular arrayed architecture like the island architecture makes it possible to build an eFPGA fabric from a single replicated tile. 98 In addition to facilitating tile re-use during design, a regular fabric also has a positive impact on the quality of results achieved with existing EDA tools. For example, it was observed that attempting \"flat\" logic synthesis and or place and route of a large standard cell eFPGA fabric can take several hours and produce less than desirable results, or in some cases lead to tool failure. However, a \"divide and conquer\" strategy allows synthesis to run till completion and achieves good results for the largest MCNC benchmark circuits (> 1,000 4-LUTS). A regular fabric structure makes this process straightforward, because a single tile (small in comparison to the fabric) can be mapped to a gate-level netlist rather easily, and then replicated at a higher level to form the programmable fabric. This same principle of localized design optimization can be applied to physical layout design as well. The fabric layout shown in Figure E.l of Appendix E (784 LUTs) was implemented in this way. A tiled regular fabric also simplifies eFPGA characterization and eFPGA CAD design because post-routing timing extraction and characterization needs to be done for a single tile (not the entire fabric) and then applied to all tiles since identical nets in each tile will have the same wire length. Otherwise, timing arcs in each tile must be extracted separately. This process can be time consuming and requires eFPGA CAD tools to store more complex databases. In a regular architecture like the island-style architecture with repeated tiles, the same results can be achieved in much less time. Finally it was observed that a regular layout structure (see Figure E.l) reduced wire lengths by about 21% on average. However, because wire delays make up 10% of the overall delay on average, the wire length reductions translated to a 3% reduction in delay for benchmarks regardless of size. This result is not surprising since the architecture used here has only short wire (single segment length). It is expected that structure will have more of an impact in architectures with long wires 99 [11]. Nonetheless, this result is significant, because it suggests results similar to those obtained in [14] with a new CAD flow could have been achieved by imposing physical layout structure alone. 6.2.5 Summary In the final analysis, the proposed eFPGA IP generator method which includes an \"open\" architecture IP library, individually configurable programmable logic architectures, domain-driven IP generation, and an automated layout generator that relies on highly optimised custom cell libraries, lies somewhere between the hard and soft IP approaches described previously, because, it aims to combine only the best properties of both these IP approaches to achieve the best results. We call the proposed approach the Grm IP approach because it is neither truly soft nor hard. In Table 5.2 below, we summarize key distinguishing characteristics of the soft, firm, and hard IP approaches. Table 6.1: Summary of Soft, Firm and Hard e F P G A implementation methodologies Soft e F P G A Firm e F P G A Approach Hard e F P G A Behavioral RTL Structural RTL or GateLevel Netlist Transistor-Level Design ASIC flow Custom ASIC flow Full-custom flow Generic standard cells Custom tactical cells and generic standard cells Full-custom design Logic Synthesis Required No Logic Synthesis Required Cells free to move Regular, tiled structure Regular, fixed-tile structure Configurable architecture Fixed arc hitecture Flexible size, no fixed shape Flexible size, shape can be fixed Fixed size and shape Mixed with cells used for rest of design (fixed logic) Designed as a separate core and inserted Small Amount of Programmable Logic Small-to-Medium Amount of Programmable Logic Medium Amount of Programmable Logic 100 Chapter 7 Final Conclusions and Future Research Work 7.1 Conclusions In this thesis, a generic island style eFPGA was successfully implemented using the ASIC flow. In the process, design issues were resolved that before now, excluded this class of eFPGA architectures from being implemented within the ASIC design flow. In particular, a \"workaround\" was devised to prevent the occurrence of combinational loops in eFPGA fabrics during logic synthesis and functional verification. As a result, new architectures that could enter states that create combinational feedback loops can be explored without any design or implementation issues. Furthermore, this solution was leveraged to niinimize the occurrence of glitch power dissipation during programming. Also, a novel technique for implementing I/O for island style eFPGAs that uses the switch blocks around the core periphery to implement I/O was introduced and successfully implemented. Two novel multiplexing switches that improve speed of eFPGA fabrics by as much as 24% for the set of benchmarks considered was also presented. A configuration scheme (similar to that used in commercial eFPGAs) that allows tiles in a fabric to be programmed individually or in a row was described and successfully implemented. This scheme has implications for testability programming power. For example, the ability to program tiles individually facilitates fault diagnosis during testing. Also, because only the clock networks of targeted tiles are activated, switching power in rninimized. In this thesis it was also shown that significant improvements in area and speed could be achieved by using FPGA-centric tactical cells to implement eFPGAs rather generic standard cells. An average 101 core area reduction of 58% was achieved for the set of benchmarks considered. Compared to area results reported by VPR for a full-custom equivalent, tactical cell based eFPGAs are about 2-3X larger. However, other estimates based on results for commercial fabrics show that this overhead lies somewhere between 1.66X and 3X. An average delay reduction of 40% was also achieved over the same set of benchmarks. These delays were found to be within 10% of a full-custom equivalent. In addition, results achieved with tactical standard cells were compared to results achieved with a new approach called GILES, which uses non-standard cells and custom tools in a flow similar to the ASIC flow. Results obtained showed that cell densities achievable with tactical standard cells are comparable to those obtained for GILES. Consequendy, it was also shown that with tactical standard cells, it is possible to achieve layout densities that are comparable to the GILES approach. Next, the results obtained here were compared to the product term architecture in [15] [16]. The purpose of this experiment was to measure how having close to optimal building blocks for each architecture might affect conclusions. Since the NANDs and NORs used in the product-term architecture are already close to optimal in most standard cell libraries, this architecture is at an advantage relative to LUT-based architectures that rely more heavily on multiplexers. Multiplexers built with typical standard cell libraries are not optimal. In this experiment, it was found that including tactical multiplexers in the logic architecture of an island style ePFGA (equivalent to having NANDs and NORs in product-term blocks) resulted in a delay improvement of about 35% in the LUT-based island architecture relative to the product term architecture. It was also found that without tactical cells, the LUT-based Island architecture is still about 20% faster on average. In terms of area, the island-style architecture is roughly 35% larger. It is expected that this difference 102 would be even less if all the benchmarks used in [15] were considered. In terms of area-delay product, and for the set of benchmarks circuits considered, both architectures are comparable. In Chapter 6 it was shown that eFPGAs based on tactical standard cells are competitive with commercial hard eFPGA cores. For the set of benchmarks considered, it was observed that higher effective logic densities could be achieved using tactical standard cell ePFGAs. This is possible because tactical cell eFPGAs can be tailored for a given application, and so, the extra routing and logic overcapacity that is associated with the hard eFPGA IP approach is avoided almost completely. It was also shown using an actual Bluetooth baseband SoC design, that either of the baseband frequency hopping or data encryption modules could be replaced with their eFPGA equivalents without any impact in the overall die area of the Bluetooth SoC. This is possible because both modules together account for less that 0.7 % of the SoC design area. Furthermore, this SoC design is 1/O limited and so a significant portion of the area within the 1/O ring can be used at no cost. Further, in Chapter 6, a new paradigm for eFPGA IP design was proposed to minimize the overhead that is normally associated with programmable logic fabrics and possibly make them more attractive to designers. In particular, a system based on an open source architecture library was proposed, so that designers are able to choose the most suitable architecture for a given application. The results in this thesis have already shown that some circuits, for reasons that are not yet fully understood, appear to \"prefer\" one architecture over another (other factors such as CAD tool algorithms cannot be ruled out as a factor [12]). Other recommended features include configurable architectures, such as the ones used here, and domain-driven IP generation to minimize overhead. 103 7.2 Future Work The tactical standard cells developed in this work need to be fully incorporated into the ASIC design flow so that the results obtained in thesis can be further corroborated. Furthermore, it is also expected that this will result in further savings due to reductions in the layout area overhead. In thesis, a pessimistic estimate of layout area overhead for tactical standard cell eFPGAs was assumed. Techniques to further minimize the SRAM cell area overhead should be investigated further, because SRAM constitute a large proportion of the fabric area, and are relatively easy to design. Data in [25] suggests that as SRAM cell area of roughly Hum2 is possible in 180nm process technology. Given the workaround for combinational loops presented in this thesis, it would be interesting to see the improvements that can be made to previously implemented architectures in [02] [15] [16]. Finally a chip design identical to the one in [14] was implemented using the island architecture in 180nm process. It would therefore be useful to also implement the same design using customized tactical standard cells. This would present the chance to compare the power dissipation in both approaches, since it is difficult to obtain exact estimates of power consumption with available ASIC tools. Moreover, this would also be an excellent opportunity to corroborate current findings. 7.3 Contributions \u00E2\u0080\u00A2 A parameterizable island-style architecture was implemented in RTL. In addition, a novel 1/O design was developed that reuses excess routing resources around the eFPGA edges. 104 A clever workaround was devised for the combinational loop problem that plagued previous architectures implemented in the ASIC flow. As a result, designers can now explore a broader range of architectures. For example, the dual-network architecture [15] could be improved because the work done here implies that an extra routing network in no longer needed. Also, the gradual architecture [02] can now be modified to support sequential logic. Two novel switch designs, \"Improved multiplexer switch 1\" and \"Improved Multiplexer switch 2\" were introduced. For the set of benchmarks considered, these switches resulted in speedups of between 1 % and 24%. However \"Improved multiplexer switch 1\" results in an area overhead of 3% for generic standard cells and about 13% for tactical standard cells. Although maximum delay savings due to \"Improved multiplexer switch 2\" are not very high (6%) area overhead in minimal or non-existent compared to the original multiplexer switch. The design and implementation of tactical standard cells for the island-style architecture implemented as part of this project resulted in area and delay savings of 58% and 40% respectively. Compared to a full-custom equivalent, our delay results are within 10% and area results are somewhere between 1.66X and 3X based on our estimates and those from VPR It was shown that the densities achieved with the tactical standard cells developed as part of this work are comparable to cells used in the GILES. This is possible because a novel technique was devised to improve the layout area efficiency of pass-transistor-based multiplexers. As a result, with the standard tools of the ASIC flow, it was possible to achieve identical results to GILES which used custom-designed tools and a non-standard cell layout. 105 It was shown that standard-cell-based architectures are very sensitive to the cells that are available in a library. Using a product term architecture and the LUT-based architecture developed here, it was shown that conclusions can change significandy when each architecture is implemented with cells that are relatively well optimized for each architecture. It was shown that an eFPGA fabric based on tactical standard cells developed as part of this work, are competitive with a commercial library of hard eFPGA IP fabrics. In the example considered, the tactical cell approach is better in all cases and for a range of circuit sizes. Using an actual \"real-world\" wireless platform design, it was shown that two modules that lend themselves to programmability, namely, the frequency hopping and data encryption modules, could be embedded in the SoC platform without any increase in the die area. It was demonstrated in this thesis that a regular eFPGA architecture has numerous benefits. For example, it was possible to implement a fabric with 784 4-LUTs in relatively short time (in roughly 3 hours versus 17 hours for a flat design) by exploiting the regularity of the island-style eFPGA during synthesis and layout. This approach also resulted in a 21% reduction in wire lengths. Although this only translated into a 3% speedup of fabrics in general, the implication is quite significant, because it suggests the same results obtained in [14] could have been achieved by imposing layout structure to better manage wire lengths. A novel paradigm for eFPGA design based on an open architecture library was proposed to enable designers select the best architecture for a given application and minimize overhead. 106 Appendix A Standard Cell Based eFPGA implementation Results A.1 In order to verify the RTL implementation of the island-style architectures described in Chapter 3, a reference design with an embedded core was implemented. This is the same reference design that was used in previous work [14] [77]. The design is a test interface for embedded core testing [76]. In essence, this design is an adaptor that allows an embedded IP core in a SoC design to \"plug\" into an on-chip test network and receive test packets depending on the destination address of the packet. next state logic test packets MEMORY CONTROL PACKET BUFFER TEST PACKET SERIALIZER test bits to core Figure A.1: Block level diagram of the test interface module with an eFPGA fabric As in previous work [14] the next state logic of the primary controller va. this interface was replaced with programmable logic, leaving only the output logic of the controller as fixed logic (see Figure A.1). This design was successfully implemented and Figure A.2 shows a screen capture of gate-level simulation waveform for the design. In Figure A.2 the states transition as expected (output of embedded fabric) and the fixed output logic module responds to the state transitions as expected. 107 Finally, the measured crictical path delay through the eFPGA core (based on the original multiplexer switch) was 33.2ns and 34.24ns for the entire chip. The eFPGA core area alone was 389,550um2. cll.b (I uijiJiiiriiiJiiim b5rj*ata_out[12:0] 'hGAlB OEEE OESF b3i_reci_eii w_ctet\u00C2\u00BB_out[0:0] 0 \u00E2\u0080\u00A2M 0 0 0 n 0 sc_t*g_en si_r\u00C2\u00ABj_eh cc_en i:Jats._iri[0:0] 'h i vaid full 1 (1 ( n% \ 4 S { ^ W sniff_p_state_t{1:0J \u00C2\u00ABtite_p_*t\u00C2\u00ABte_cJ2:01 testcll; 'hO 'h\u00C2\u00AB fl Figure A.2: simulation waveform capture of eFPGA state transitions and response \u00E2\u0080\u00A2:*ata_iri[0:0] valid III! fpct?._cll: 'hO o Glitching Isolated 11P.IIII llllll UilHIf from Output Mux l i f i fpga_$hift_in full powerjjp G H t c r i l n ^ i r a P ! \u00E2\u0084\u00A2 O m B D = Output \ . \u00E2\u0080\u00A2 y ^ sfcahjen HWX_Q senaljcujt , % i i H u i n ii S P G O u W O I rifi r# 1 f i 1 1 scce c j ^ a e ^ e \u00C2\u00A9 c c\u00C2\u00AB2 \u00C2\u00A9 \u00C2\u00AE e lut_o taste* [fl Figure A.3: simulation waveform capture of glitches and glitch isolation in a BLE Furthermore, all of the architecture, CAD, and design issues highlighted in the previous chapter and in this chapter were resolved. For example, Figure A.3 shows a simulation screen capture from the eFPGA configuration phase. As the figure shows, several transitions and glitches occur at the output of the LUT during configuration as predicted (see Figure 4.7). Figure A.3 also shows that these glitches have been isolated from the multiplexer output (\"mux_o\" in Figure A.3) using the scheme illustrated in Figure 4.7. The power savings that result from combining this glitch isolation approach 108 with tile-based and row-based configuration has not been measured here. This is left to future work when the fabricated test interface chip is tested, and measurements of power can be obtained. A.2 The data presented in Tables A.l through A.3 were used to generate the plots shown in Chapter 4 of this thesis. Synthesis results were obtained using Synopsys Design Compiler. Layout was done using Cadence First Encounter and critical path delays were obtained using Prime Time Static Timing Analysis tool from Synopsys. All the results are for a 180nm process node standard cell library [24]. All critical path delay results in Tables A.l through A.3 are for the worst case process comer. Table A.1: Design results for 18 eFPGA cores with tri-state buffer based switches estimated synthesis % area layout critical path Circuit area (urn2) area (urn2) difference area (urn2) delay (ns) C c 439018 471158 7.32 612505 7.74 Cm150a 229030 248074 8.32 322496 8.91 cmb 258836 268481 3.73 349025 8.54 comp 455733 487945 7.07 634328 16.30 cu 317032 300631 -5.17 390820 8.25 5xp1 381288 376979 -1.13 490073 5.14 11 289621 279649 -3.44 363543 9.66 inc 381288 376979 -1.13 490073 5.40 unreg 600952 635851 5.81 826606 6.88 rd84 2570245 2464741 -4.10 3204163 20.92 apex7 1546660 1519160 -1.78 1974908 11.23 alu2 2792540 2984645 6.88 3880039 23.87 clip 2018818 2113094 4.67 2747022 18.05 9symml 939896 859022 -8.60 1116729 12.79 terml 1249090 1266233 1.37 1646102 15.92 cht 1275526 1353331 6.10 1759330 5.39 ex5p 23446224 22485924 -4.10 29231701 42.00 apex4 37571447 39263454 4.50 51042490 65.37 109 Table A .2: Design results for 18 eFPGA cores with original multiplexer switches circuit Estimated area (urn2) synthesis area (urn2) % area difference layout area (urn2) critical path delay (ns) cc 419158 451298 7.67 586687 10.43 cm 150a 221086 240130 8.61 312169 10.28 cmb 248244 257889 3.89 335256 9.76 comp 443320 475532 7.27 618192 17.74 cu 306440 290039 -5.35 377051 9.92 5xp1 370696 366387 -1.16 476303 7.61 11 277208 267236 -3.60 347407 11.30 inc 370696 366387 -1.16 476303 8.29 Unreg 569176 604075 6.13 785298 10.12 rd84 2506527 2401023 -4.21 3121330 27.98 apex7 1483108 1455608 -1.85 1892290 18.37 alu2 2744876 2936981 7.00 3818075 28.73 Clip 1955266 2049542 4.82 2664405 23.69 9symml 929304 848430 -8.70 1102959 14.53 terml 1216817 1233960 \u00E2\u0080\u00A21.41 1604148 18.58 Cht 1211974 1289779 6.42 1676713 9.55 ex5p 23128464 22168164 -4.15 28818613 61.59 apex4 36441909 38133916 4.64 49574091 98.27 Table A.3: Design results for 18 eFPGAs using the speedy multiplexer switch 1 Circuit estimated area (urn2) synthesis area (urn2) % area difference layout area (urn2) critical path delay (ns) C c 434740 466880 7.39 606944 9.07 cm 150a 227319 246363 8.38 320272 9.96 Cmb 256554 266199 3.76 346059 8.86 Comp 453059 485271 7.11 630852 15.99 cu 314750 298349 -5.21 387854 9.10 5xp1 379006 374697 -1.14 487107 6.31 11 286947 276975 -3.48 360067 10.72 inc 379006 374697 -1.14 487107 6.49 unreg 594107 629006 5.87 817708 8.50 rd84 2556519 2451015 -4.13 3186320 24.49 apex7 1532970 1505470 -1.79 1957112 15.52 alu2 2782273 2974378 6.90 3866691 26.19 clip 2005128 2099404 4.70 2729226 19.25 9symml 937614 856740 -8.63 1113763 14.16 terml 1242138 1259281 1.38 1637065 17.51 cht 1261836 1339641 6.17 1741534 7.25 ex5p 23377776 22417476 -4.11 29142719 47.23 apex4 37328135 39020142 4.53 50726185 77.20 110 Appendix B Area Model for Standard Cell based eFPGA area Estimation Area breakdown of a tile in a standard-cell-based island-style embedded programmable logic fabric. The Routing Switch Block Areasblock = 1318x W(original multiplexer switch), AreaMock = 1483xfF (tri-buffered switch design) Areasblock = 1447xW (improved multiplexer switch 1 design) Configurable Logic Block ArealulJlipflops = axNx2K + ArealulinputmuUiplexers = Nx ( ^ J / - + Z>(mod^)) + Arealuloutpulmux = N(a) + AreaselflopMoutflop = N(b + c) + Arealulinpulmuxes = KN{\^^\f)+ KN{b x mod lI^ipU) + Areareset\o%ic(2-inputORgates),reset\o%icflipflops = 2/\f(a + C?) + Area lulinpumuhiplexerflops = (\"log 2 (37V + 2)~|X7Yo Tile Connection Blocks AreaxtionblkoU = \WF0 >(/ + a + e) + AreabuffarrayInler = \WF, \ ^ + \WF, + [^f^ j , + [WF,V\u00C2\u00BBp}l + \WF,KL^J+ [ ^ f ^ i \ + Areaclbinputmuxflops = [log 2[WF,~\~]p(2N + 2) + Areaclbmputmuxes = (2N + 2)x ( [ i - ^ > + * x m o d ^ ! ^ 111 Area of Peripheral Blocks Areabuffarrayleft +bollom = \WFt ] L ^ > + \WFt \"|( J + J ), + Areaswilcheslefl+bollom = 211W + Areacornerblk = 11080 Aream_buffs left = \WFQ 1 ( l * J + > + Area lri_buffs bol = \WFa \"|( [*| > + ^ \u00E2\u0080\u00A2 r e a cik logic left + bottom = 6xr + Areaselloslclej-l+bollom = 2m + Area tr,buffs leflJlip flops = \WF0 ~\ ( J + L ^ F ^ J 1 > + ^ *, \u00C2\u00BB = f ^ . 1 ( L * J )* Configure and Clock logic Arearegulartile dkiogjc =3xr + ArearegulartileselecAogic \u00E2\u0080\u0094m + Areaedgetile c!kiogic = 3 x r + AreaedgelileselecAogjc = m a = cell area of program flop; b = averagecell area of 2 :1 CMOS multiplexer; c = cell area of LUToutput flip - flop; d = area of 2 - input OR drive 1; e = area of 2 - input OR drive 2; f = area of 4:1 CMOS multiplexer; q = area of intratiletrack buffers; s = area inter tile track buffers; t = average tri - state buffer cell area; r = area of clock gating logic 2 input AND; m = area of tile selection logic OR2D1 + MUX2D1; 112 C . l Appendix C Details of Circuit Analysis Solutions for Multiplexer Circuits R, r a n s = LAW * R per Square R R f% * ^\"S \u00E2\u0080\u0094 4 L \u00E2\u0084\u00A2 /\"^> t r 1. . . /*\"^ V ^ ^ - p v>diff^-p ^diff^p \"*p U g \u00E2\u0080\u00A2O' \"\"sb\" \"d?\" - O ' ^ use logical effort C = 3W * 2.0 fF per micron H f f l f p W r f f ' p W l f f ^ T T U 5 Figure C.1: RC network representation of shared buffering scheme in an eFPGA Delay = (B-i)[(D-l)Rtrans2Cdiff+Rlrans(Cdijr + C j + Elmore Delay ofactive path (CO) Delay of active path = Rtran&Cdiff + Cself)+ {Rlrans + R)2Cdlff +(Rums +2R\Cdiff +Cg) (C.l) D+l i Equation (C.l) is based on the general Elmore delay equation: ^ T C J ^ T R J where D is the depth of i=i j=i the pass transistor tree, i is the number of nodes from source to sink, j is the number of resistors from the source to the current node (/'* node), and B is the number of branches in the pass tree. W 0.5um Combining Equations (CO) and (C.l) with B, D = 2, Rper square = 27 KQ, L 0.18wm Rlrans =^xR per square, Cdiff = C effWpass,C g = C\Wmm, Ceff,Cg =1.0, 2.0-^and solving for W (width of PMOS transistor network) in Figure 5.11 gives the results in Equation (C.2) below: 113 49x1fT18 49x1fT18 Delaylargel = m + 78xlQ-'2 .-. W (of PMOS in buffer) = ^ , _ m (C.2) \u00E2\u0080\u00A2W Delay l3XRet -78x10\" The value of ^ obtained in Equation (C.2) depends on the required target delay (Delaytasget)-Therefore, based on the results in Equation (C.2), this target delay must be greater than 78 x 10~12s because, this represents the smallest delay that is achievable in the design example considered here. C.2 iran R R R \u00E2\u0080\u0094I\u00E2\u0080\u0094MA\u00E2\u0080\u0094; + \u00C2\u00AB * C J ( C 3 ) Where w is the number of NMOS pass transistors between repeaters and n is the depth of the pass m transistor tree. This translates to \u00E2\u0080\u0094 sections of m pass transistors for the entire pass transistor tree. n Setting ^ ^ = 0 gives Rlran(c sel/ + Cg)+^RCdiff =0 om m I (C.4) Rearranging equation (5.4) and solving for m gives m , = _ 2R t r a n [C s e l f +C g ) opt ,| R C (C.5) >diff Where m , is the optimal number of pass transistors needed between repeaters to minimize delay. 114 w Finally, Equation (C.5) evaluates to approximately 4 pass transistors if a \u00E2\u0080\u0094 r a t i o of 1.5 is used. W pmos Notice that this analytical result is identical to the result obtained experimentally through SPICE. 115 Appendix D Area estimation method for GILES and Full-custom eFPGAs D.l 1.00E+03 n 0.00E+00 0 5 10 15 20 25 30 35 Number of Inputs Figure D.1: plot of layout area vs. inputs for general-purpose eFPGA multiplexers 8.00E+03 -i Number of Select Inputs Figure D.2: Plot of layout area (excluding the SRAMs) vs. select inputs for LUT 116 7.00E+03 -i 0 20 40 60 80 100 120 140 Number of LUT Inputs Figure D.3: Plot of layout area (excluding the function SRAMs) vs. inputs for LUT 117 D.2 Earlier area estimates obtained from the VPR tool show that a tactical standard cell implementation of a generic island-style eFPGA architecture is about 2.86X the size of the equivalent full-custom implementation. However, the results in [68] [69] have suggested the area overhead could be as low as 1.36X. We next propose an alternative method to estimate the area overhead of a standard cell eFPGA relative to a full-custom device and then compare our results with previous estimates. To improve the quality of results obtained, it was necessary to consider a larger set of benchmark circuits. In particular, it was necessary to add more circuits from the \"golden 20\" MCNC benchmarks so that a more realistic area estimates might be obtained. This is needed because the very large benchmarks in the MCNC suite require a much larger routing infrastructure compared to the logic components (LUTs). In other words, the routing infrastructure of an FPGA scales at a faster rate than the logic with increasing size of the benchmarks. A large routing interconnect degrades the area efficiency of programmable devices. Therefore, in order to capture this effect in our estimates, it was necessary to include more golden 20 benchmarks to the initial set of benchmarks used in this work. Figure D.4 shows the distribution of logic and routing for the standard cell implementations of all the benchmarks considered. The general trend in this figure confirms that routing in more dominant as the size of logic implemented in an eFPGA increases. 118 ^ 90.00 Q_ 80.00 Li. 0 \u00E2\u0080\u009E 70.00 H O ^ 60.00 H \u00C2\u00AB*- o O ' S 50.00 0 -Q g> ,F 40.00 c 30.00 20.00 10.00 60 105 240 322 390 459 2128 3520 4413 Equivalent ASIC gates Figure D.4: Area distribution between logic and routing components of eFPGA The proportion of routing and logic was more or less split evenly for smaller benchmarks (less than 2128 ASIC gates in Figure D.4). For the larger benchmarks, routing made up 70% of the total area. O < 1.00E+03 9.00E+02 8.00E+02 > c -= 7.00E+02 \u00C2\u00B0 . 6.00E+02 \u00E2\u0080\u00A2a O 1 5.00E+02 \u00C2\u00A3 S\"4.00E+02 ^ 3.00E+02 "Thesis/Dissertation"@en . "2005-05"@en . "10.14288/1.0092104"@en . "eng"@en . "Electrical and Computer Engineering"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en . "Graduate"@en . "Bridging the gap between soft and hard eFPGA design"@en . "Text"@en . "http://hdl.handle.net/2429/16577"@en .