Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Detailed routing architectures for embedded programmable IP cores Hallschmid, Peter 2003

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2003-0410.pdf [ 4.32MB ]
Metadata
JSON: 831-1.0065360.json
JSON-LD: 831-1.0065360-ld.json
RDF/XML (Pretty): 831-1.0065360-rdf.xml
RDF/JSON: 831-1.0065360-rdf.json
Turtle: 831-1.0065360-turtle.txt
N-Triples: 831-1.0065360-rdf-ntriples.txt
Original Record: 831-1.0065360-source.json
Full Text
831-1.0065360-fulltext.txt
Citation
831-1.0065360.ris

Full Text

DETAILED ROUTING ARCHITECTURES FOR EMBEDDED PROGRAMMABLE IP CORES by PETER HALLSCHMID B.Eng., The University of Victoria, 1999 B . S c , The University of Victoria, 1996  A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of  MASTER OF APPLIED SCIENCE (M.A.Sc.) in The Faculty of Graduate Studies Department of Electrical & Computer Engineering  We accept this thesis as conforming to the required standard:  The University of British Columbia June 2003 © Copyright by Peter Hallschmid, 2003  U B C Rare Books and Special Collections - Thesis Authorisation Form  Page 1 of 1  In p r e s e n t i n g t h i s t h e s i s i n p a r t i a l f u l f i l m e n t of the requirements f o r an advanced degree a t the U n i v e r s i t y of B r i t i s h Columbia, I agree t h a t the L i b r a r y shall.make i t f r e e l y a v a i l a b l e f o r r e f e r e n c e and study. I f u r t h e r agree t h a t p e r m i s s i o n f o r e x t e n s i v e c o p y i n g o f t h i s t h e s i s f o r s c h o l a r l y purposes may be g r a n t e d by the head of my department o r by h i s o r her r e p r e s e n t a t i v e s . I t i s u n d e r s t o o d t h a t c o p y i n g o r p u b l i c a t i o n o f t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l not be a l l o w e d without my w r i t t e n p e r m i s s i o n .  The U n i v e r s i t y o f B r i t i s h Columbia Vancouver, Canada  http://www.library.ubc.ca/spcoll/thesauth.html  7/28/2003  Abstract As the complexity of integrated circuits increases, the ability to make postfabrication changes to fixed ASIC chips will become increasingly attractive. This ability can be realized using programmable logic cores. These cores are blocks of programmable logic that can be embedded into a fixed-function ASIC or a custom chip. Such cores differ from stand-alone FPGAs in that they can take on a variety of shapes and sizes. With this in mind, we investigate the detailed routing characteristics of rectangular programmable logic cores. We quantify the effects of having different X and Y channel capacities, and show that the optimum ratio between the X and Y channel widths for a rectangular core is between 1.2 and 1.5. We also present a new switch block family optimized for rectangular cores. Further, we quantify the effects of logic block pin placement. Comparing to a simple extension of an existing switch block, our new architecture leads to a density improvement of up to 11.9%. Finally, we show that if the channel width, switch block, and pin placement are chosen carefully then the penalty for using a rectangular core (compared to a square core with the same logic capacity) is small; for a core with an aspect ratio of 2:1, the area penalty is 1.6% and the speed penalty is 3.8%.  ii  Table of  Contents  ABSTRACT  II  T A B L E O F CONTENTS  Ill  T A B L E O F FIGURES  V  ACKNOWLEDGEMENTS  VIII  CHAPTER 1: INTRODUCTION AND OVERVIEW  1  1.2 RESEARCH APPROACH & OBJECTIVES  4  1.3 THESIS ORGANIZATION  5  CHAPTER 2: BACKGROUND AND PREVIOUS WORK 2.1 FIELD PROGRAMMABLE DEVICES  7 V  2.1.1 PROMs  7  2.1.2 PLAs andPALs  7  2.1.3 CPLDs  8  2.1.4 FPGAs  9  2.2 FPGA ARCHITECTURES  10  2.2.7 Achieving Programmability  10  2.2.1.1 Programming Technologies  10  2.2.1.2 Gate Level Programmability  11  2.2.2 "Island-Style" FPGA architecture  13  2.2.3 Logic Resources  15  2.2.4 Routing Resources  16  2.3 CAD FOR FPGAs  20  2.3.1 Routing Algorithms  22  2.4 FPGAs IN SoC  25  2.4.1 Applications  26  2.4.2 Implications on the FPGA  27  2.5 SUMMARY  29  CHAPTER 3: RECTANGULAR SWITCH B L O C K  30  3.1 MOTIVATION  30  3.2 EXISTING SWITCH BLOCKS  31  3.3 N E W SWITCH B L O C K  36  iii  3.3.1 Baseline Rectangular Switch Block  36  3.3.2 Family of Rectangular Switch Blocks  41  3.4 E X P E R I M E N T A L F R A M E W O R K  46  3.4.1 Experimental Methodology  46  3.4.2 Experimental Philosophy  49  3.4.3 Architectural Framework  49  3.4.4 Area & Delay Model  50  3.5 E X P E R I M E N T A L R E S U L T S  52  3.6 S U M M A R Y  55  CHAPTER 4: DIRECTIONALLY BIASED CHANNELS  56  4.1 M O T I V A T I O N  56  4.2 P R E V I O U S R E S E A R C H  57  4.3 N E W S E G M E N T A T I O N A R C H I T E C T U R E  58  4.4 N E W E X P E R I M E N T A L A P P R O A C H  63  4.5 E X P E R I M E N T A L R E S U L T S  64  4.6 S U M M A R Y  71  CHAPTER 5: PIN DISTRIBUTION  73  5.1 M O T I V A T I O N  73  5.2 E X I S T I N G P I N DISTRIBUTIONS  73  5.3 V A R I O U S P I N DISTRIBUTIONS  74  5.4 E X P E R I M E N T A L R E S U L T S  76  5.5 S U M M A R Y  83  CHAPTER 6: CONCLUSIONS  85  6.1 F U T U R E W O R K  87  6.2 L I S T O F C O N T R I B U T I O N S  88  BIBLIOGRAPHY  89  APPENDIX A  95  APPENDIX B  97  APPENDIX C  98  iv  T a b l e of  Figures  Figure 1: Hypothetical System-on-Chip with Programmable Logic IP Core  2  Figure 2: Programmable switches in an S R A M F P G A  12  Figure 3: The multiplexer or tri-state buffer solution for line selection  13  Figure 4: Island-style F P G A architecture [48]  14  Figure 5: Basic Logic Element (BLE)  16  Figure 6: A logic block consisting of multiple BLEs with an interconnect matrix  17  Figure 7: An F P G A tile consists of a logic block, vertical & horizontal channels, a switch block and connection blocks  18  Figure 8: A track segmentation including tracks of length 1, 2, & 4  20  Figure 9: Standard F P G A Flow  22  Figure 10: Hypothetical SoC chip with Test Controller implemented in an E P L C [38].. 29 Figure 11: Switch block models for the Disjoint, Universal, and Wilton Switch Block Topology [48]  34  Figure 12: Advantages/Disadvantages to Disjoint and Wilton Switch Block topologies for those wire that pass through the switch block versus those that terminate at the switch block [48]  36  Figure 13: An Imran switch block is a combination of a Disjoint and a Wilton switch block  38  Figure 14: A baseline rectangular switch block is composed of an integer number of subblocks and a possibly a partial subblock  38  Figure 15: Algorithm Description of the Baseline Switch Block pattern  40  Figure 16: Example Baseline Switch with t =12 and t -4  41  y  x  Figure 17: MOD2 and MOD4 switch block patterns. Each line represents line represents t combined for clarity  43  x  v  Figure 18: Algorithmic Description of the MODn switch block patterns  45  Figure 19: Standard V P R flow [4]  48  Figure 20: Fan-out implemented using buffer sharing for switch blocks and connection blocks  52  Figure 21: Tile Area vs. Switch Block Aspect Ratio for Core Aspect Ratio = 4 for three different Switch Blocks  53  Figure 22: Critical Path Delay vs. Switch Block Aspect Ratio for Core Aspect Ratio = 4 for three different Switch Blocks  54  Figure 23: Two different designs implemented on a rectangular FPGA. One has high logic utilization and the other has low  57  Figure 24: This diagram identifies two locations where a switch should exist in accordance to the Disjoint switch block pattern. In both cases, a conflict exists between horizontal and vertical constraints  60  Figure 25: "Staggered" segmentation architecture as defined by Betz and Rose [56] to solve horizontal and vertical constraints problem. Channel, track, and logic block labeling schemes have been included  61  Figure 26: Simple extension of Betz and Roses' segmentation architecture for a directionally-biased routing architecture. This algorithm describes horizontal tracks only  63  Figure 27: New segmentation pattern allows for rectangular switch blocks that adhere to the horizontal and vertical constraints issue  65  Figure 28: TileArea vs. Track Ratio (ty/tx) for various core aspect ratios. For each point on the graph, the best switch block topology is used for that track ratio  66  Figure 29: Critical Path Delay vs. Track Ratio (ty/tx) for various core aspect ratios. For each point on the graph, the best switch block topology is used for that track ratio. 67 Figure 30: Tile Area vs. core aspect ratio. The solid line uses the best switch block and track ratio for each point. The dashed line uses the Baseline switch block and track ratio=l  69 vi  Figure 31: Critical Path Delay vs. core aspect ratio. The solid line uses the best switch block and track ratio for each point. The dashed line uses the Baseline switch block and track ratio=l  70  Figure 32: Six different logic block pin distributions. "|" shaped pins represent inputs and "o" shaped pins represent outputs  75  Figure 33: Tile area vs. track ratio for various pin distributions. Core aspect ratio = 2:1. 77 Figure 34: Critical path delay vs. track ratio for various pin ratios. Core aspect ratio = 2. 78 Figure 35: A tile consisting of a logic block, switch block and vertical and horizontal channels. Each point represents a tri-state buffer used to connect a pin to a routing track. By moving an output pin from the bottom side of the logic block to the right side of the logic block, the number of tri-state buffer connected to this pin increases from t*0.5=5 to t *l.2*0.5 = 6  80  x  Figure 36: Tile area vs. core aspect ratio. The solid line uses the best switch block, track ratio, and pin distribution for each point. The dashed line uses the Baseline switch block, an even pin distribution, and track ratio=l  81  Figure 37: Critical path delay vs. core aspect ratio. The solid line uses the best switch block, track ratio, and pin distribution for each point. The dashed line uses the Baseline switch block, an even pin distribution, and track ratio=l  82  Figure A . l : "Screenshot" from VPR showing the routing resources available for the baseline rectangular switch block  95  Figure A.2: "Screenshot" from VPR showing the routing resources available for the MOD2 rectangular switch block. Routing resources are half that of the Baseline switch block  96  Figure B . l : A "screenshot" from V P R [4] showing the segmentation patterns  97  Figure C l : A "screenshot" from VPR [4] showing the entire E P L C after layout  98  vn  ACKNOWLEDGEMENTS First, I would like to thank my supervisor, Dr. Steve Wilton. FTis guidance and enthusiasm as a supervisor, teacher, and friend, is very much appreciated. I would like to thank the members of my research group, Ernie, Tony, Steve, Imran, Kara, and Dana for their continual support.  I would like to thank my wife, May. She has been an incredible source of encouragement and emotional support throughout this thesis. I would like to thank my parents for always encouraging me that I can do whatever I set my heart on. I would like to thank the rest of my wonderful family for all their support. This work was supported by the British Columbia Advanced Systems Institute, Micronet, the Engineering Research Council of Canada, and Altera. I greatly appreciate their financial support. I would like to thank the Canadian Microelectronics Corporation for their technical support.  viii  C H A P T E R 1: Introduction a n d  Overview  The use of Field Programmable Gate Array's (FPGAs) has grown dramatically since their introduction 18 years ago. In fact, the FPGA market consists of more than 9 vendors with annual sales of over $3 billion. Their popularity hinges on the fact that they can be programmed to implement any digital circuit within minutes and with little more than a personal desktop computer. The ability to implement a production circuit with such ease has many advantages over implementation using application specific integrated circuits (ASICs). These advantages include quick time-to-market, low non-recurring engineering (NRE) costs, and easy debugging. The flexibility of an FPGA comes with the price of being slower and less dense than an ASIC. In fact, a circuit implemented on an FPGA is typically at least 10 times less dense and 3 times slower then its implementation as an ASIC [1]. Due to the extra capacitance associated with programmable switches and the wasted wirelength associated with using long tracks when they are not needed, FPGAs consume more power ASICs. Narrowing this performance gap is the motivation behind most F P G A research.  Because of their reprogrammability and low N R E costs, FPGAs are best suited for prototyping and for small to medium scale volume designs. On the other hand, designs with high speed and density demands or designs with large-scale volume tend to be more suitable for ASICs.  To achieve the benefits of both FPGAs and ASICs, fixed and programmable logic could be combined on a single chip. This way, designers could enjoy the benefits of flexibility and configurability by creating a custom chip or cell-based ASIC with an 1  embedded programmable logic IP (Intellectual Property) core (EPLC). A chip designed this way would contain both fixed logic and programmable logic. Parts of the chip that are unlikely to change could be implemented using fixed ASIC circuitry or fixed IP (Intellectual Property) cores, while functions that may change can be implemented in the EPLC.  I/O Ring and Interface Circuitry  Embedded Processor  Fixed IP Block  On-Chip Memory  Fixed IP Block  Programmable Logic  I/O Ring and Interface Circuitry  Figure 1: Hypothetical System-on-Chip with Programmable Logic IP Core.  The use of EPLCs fits well with the emerging system-on-chip (SoC) design methodology, in which third-party cores are combined on a single chip. The chip in Figure 1 is a hypothetical chip that follows this design style. As the figure shows, the chip contains a processor core, on-chip memory, two application specific IP blocks (ASICs), and an EPLC. In this case, the EPLC is simply another core that the system-on-a-chip designer can buy from a third party. Today, several companies offer such cores [2] [3].  2  The potential benefits of integrating fixed and programmable logic as described above are so compelling that we feel that the ability to make post-fabrication changes in a fixed-function ASIC will eventually become as important as design-for-testability is today. In order for this to happen, however, there are a number of challenges that must be overcome. In the past, most programmable logic architecture and C A D tool research has focused on stand-alone FPGAs. Much of this will not carry over to EPLCs which could take on a variety of shapes and sizes. A second issue is the integration of the FPGA C A D flow into the existing ASIC design flow. Yet another challenge is the pre-tape-out verification of an ASIC with an EPLC; it is unclear how such a chip can be verified if the circuit to be included within the EPLC is unknown.  Until now, F P G A research has only addressed the architectural aspects relating to stand-alone FPGAs. The architecture of an EPLC would likely differ from a stand-alone FPGA to better mesh with the surrounding ASIC circuitry. This would include a more constrained I/O placement and the ability for the E P L C to take on a variety of shapes and sizes. With this in mind, we investigate the architecture of what we believe would be the most common shape for an EPLC: a rectangle.  Depending on the application, one SoC design may call for a relatively square EPLC, while another may call for a long narrow E P L C (perhaps placed along an entire edge of the chip to provide a programmable interface to the I/O ports). Yet another design may need several small EPLCs. The example in Figure 1 uses an L-shaped EPLC. Each of these sizes and shapes makes unique demands on the detailed routing architecture.  3  Given that most FPGA architectural research to date has focused on square-shaped FPGAs, it is unclear how well these architecture concepts will map to rectangular-shaped EPLCs. A major part of this thesis will be to determine how some of the key architectural aspects are affected by the aspect ratio of EPLCs. This will lead to a simple set of guidelines that could be used by an SoC designer when developing an EPLC.  1.2 Research Approach & Objectives The goal of this research is to evaluate key FPGA architecture parameters and to evaluate how well they apply to rectangular EPLCs. Further, we would like to improve on these architectural parameters such that tile area and critical path delay will be reduced. We would like to characterize these improvements as a function of the aspect ratio of the EPLC. This thesis will focus on the following three key parameters of the routing architecture: the switch block topology, the segmentation architecture, and the logic block pin placement. We believe that improvements to these parameters have the greatest potential to improving area and timing efficiency. To proceed, a layout tool is required which will allow us iterate through various architectures and to measure improvements. This tool would need to incorporate state of the art C A D algorithms such that our results will be as applicable as possible to the industry. Timing and density information should be reported by the tool to serve as a figure of merit.  4  Developing an entirely new tool to perform place and route architectures is not necessary given that a great deal of effort has been put into existing high quality, academically available, place and route tools. One such tool, which will serve as a starting point for the research in this thesis, is the Versatile Place and Route Tool (VPR) developed by Vaughn Betz [4]. Given an architectural description of the FPGA, V P R will place and route a design netlist using state of the art algorithms. As discussed in subsequent chapters, placement will be performed using an annealing engine with a bounding box based cost function. Routing is performed using a modified Pathfinder algorithm.  Although VPR is flexible enough to study various FPGA architectures, it was not designed to study EPLCs. After identifying which relevant aspects of an EPLC's architectures differed from a standard FPGA, VPR was modified to reflect these changes.  Once VPR was capable of mapping circuits to an EPLC, it was then used to conduct architecture experiments. This involved running a battery of sample circuits through the V P R flow and determining the average quality of results using critical path delay and tile density as figures of merit. By repeating this experiment while varying each key architectural parameter one at a time, it was possible to make incremental improvements.  1.3 Thesis Organization This thesis is organized into six chapters that include the study of three major architectural aspects of an EPLC. Chapters 1 and 2 include the introduction, background information, and previous work. Chapters 3,4, and 5 include an analysis of rectangular  5  switch block topologies, directionally biased routing channels, and logic block pin placement in rectangular EPLCs. Chapter 6 is a summary of all conclusions. Much of the research in Chapters 3 and 4 of this thesis has been published previously in [5].  6  C H A P T E R 2: B a c k g r o u n d  and Previous Work  2.1 Field Programmable Devices FPDs, or Field Programmable Devices, is a general term for all devices that can programmed (and possibly reprogrammed) after fabrication. Several standard approaches to programmability are used in the industry in the form of PROMs, PLAs, PALs, CPLDs, and FPGAs. These approaches vary significantly in their complexity (and subsequently their cost) and the applications for which they are best suited.  2.1.1 P R O M s PROMs, or programmable read-only memories, are nothing more than the implementation of combinational logic in memory. Each input is connected to an address line; each combination of inputs accesses a memory location holding the appropriate value of a function. The main disadvantage to PROMs over other FPDs (Field-Programmable Devices) is that they are less efficient because they contain large address decoders that expand exponentially with the number of inputs. Their advantage is that they allow for functions with a very large number of product terms; however, real circuits rarely require more than a few product terms [6].  2.1.2 P L A s a n d P A L s PLAs and PALs (programmable logic arrays and programmable array logic) are specifically designed for implementing combination logic and consist of an A N D plane followed by an OR plane which serve to implement functions as a sum of products. In PLAs, both planes are programmable whereas only the A N D plane is programmable in 7  PALs. The advantages of these devices are their low cost and high pin-to-pin performance. The disadvantage to these devices is that the size of their AND/OR plane structures grows rapidly with the number of inputs [7].  2.1.3 C P L D s One way to scale PALs and PLAs without a large area penalty is to place several of them on a chip and to connect them with a set of programmable interconnect wires. The disadvantage is that the C A D becomes more complex. Devices with this architecture are referred to as Complex Programmable Logic Device (CPLD) and include products such as the Altera Max 6000 and 7000 [8]. As a specific example of a CPLD, consider the Altera Max 7000. The most basic hierarchical unit of this device is the macrocell that consists of a product term select matrix, a register, and local interconnect. Sixteen of these macrocells are placed together to form a Logic Array Block (LAB) which also includes interconnect for product term sharing and an I/O control block. Inputs and outputs from each L A B feed into a Programmable Interconnect Array (PIA) which is used for communication with other LABs.  Compared to other programmable devices, CPLDs have the advantage that they are fast and have a wide range of capacities. The M A X 7000 macrocell was designed to handle five product terms but can be expanded by fifteen using neighboring macrocells within the same C L B . One advantage includes the ability to subdivide circuits such that each sub-circuit can be individually mapped to a different P A L / P L A array structure, resulting in a predictable implementation. Another advantage is that old designs targeted for PALs/PLDs can easily be transferred to CPLDs while maintaining timing expectations. 8  CPLDs are often good for designs that exploit wide AND/OR gates but do not require a large number of flip-flops; finite state machines are good candidates [6].  2.1.4 F P G A s Amongst the largest and fastest growing FPDs are Field-Programmable Gate Arrays (FPGAs). Although there are many types of FPGAs, all architectures include logic blocks, I/O blocks, and programmable routing, which are arranged in a regular pattern. FPGA provide narrow logic resources; in other words, their logic blocks are generally small and uncommitted. One advantage of an F P G A over other types of FPDs is that they generally have much higher logic capacities than other FPDs and offer a higher ratio of flip-flops to logic [6]. A higher ratio of flip-flops to logic is important because flip-flops are often the limiting factor in designs. FPGAs are the most common form of FPD offered by programmable logic vendors. One such vendor, Xilinx, offers several different "families" of FPGAs that target different design sizes, design speeds, and cost requirements. Some of the more popular devices include the XC4000, the Spartan series, and the Virtex U series [9]. Another vendor, Altera, offers several families that including the Flex 8000, Flex 10K, Apex 20K, Stratix, and Cyclone [8]. The market also includes many other competitive products from other vendors including Actel [10], Cypress [11], and Quicklogic [12].  A more detailed look into FPGA architectures will be provided in Section 2.2.2. This section will describe the baseline architecture used for this thesis, but it will also include a brief description of devices from Xilinx and Altera.  9  2.2 FPGA Architectures This section provides background information on F P G A architectures. First, a low level description of programming technologies is provided. Second, a discussion of gate level programmability is provided. Third, a discussion of the higher level architecture is provided and is presented with a separate discussion for logic resources and routing resources.  2.2.1  A c h i e v i n g Programmability A circuit is implemented on an FPGA by programming each available logic block  to implement a portion of the logic required by the circuit, and each I/O block to act as either an input or output pad. The programmable routing resources are configured to connect all necessary logic and I/O blocks to completely implement the circuit. The methods used to achieve programmability has implications on the timing and density of the final circuit implementation. Many gate level design decisions are based on the programming technology and many high level architectural aspects are based on decisions made at the gate level. Because of its importance, an introduction to programming technologies is appropriate before discussing high level aspects of architecture.  2.2.1.1 Programming  Technologies  There are several different ways by which FPDs are made programmable. The most popular technique is to configure pass transistors, multiplexers, and/or tri-state buffers using S R A M cells implemented in CMOS technology [1,6,13]. By using S R A M cells,  10  these devices have the advantage of being reprogrammable, but the disadvantage of being volatile. Vendors such as Xilinx[9] and Altera[8] actively use this technology. It should be noted that wire programmability could also be achieved using nMOS transistors [14,15] with gate boosting. By doing so, area improvements can be made by eliminating the pMOS transistor in every transmission gate, and delay improvements can be made because nMOS transistors have a lower parasitic capacitance.  An alternative to S R A M cells is antifuse technology [16], which is used by companies such as Actel [10]. This technology, which Actel refers to as PLICE (programmable logic interconnect circuit element), is a modified CMOS switch that acts as an open circuit until programmed. Once programmed, the circuit takes on a low resistance. Antifuse technology is fast, dense, secure, and non-volatile; however, devices implemented using this technology are not reprogrammable. Actel [10] also makes use of flash memory. The advantage to this is that it is reprogrammable (to a limited number of times) and non-volatile. Further, they are dense and highly secure. For reconfiguration applications, flash cells have the disadvantage that they can only be reprogrammed a finite number of times.  2.2.1.2  Gate Level  Programmability  The most common type of programming cell in FPGAs is the S R A M cell. There are three types of programmable switches (see Figure 2) all of which are programmed using S R A M cells; each type has different area and delay characteristics associated with it. First, there are pass transistors, which are made up of a single nMOS transistor with a boosted gate voltage. These switches generally have an area cost of 7 transistors, 1 for transmission 11  and 6 for the S R A M cell. Although they are small, pass transistors can contribute a great deal of capacitance (via an RC chain) without contributing active drive strength [17]. In fact, [17] shows that the delay caused by a chain of pass transistors grows quadratically with the length of the RC chain.  SRAM Jh-i  JP-I  SRAM  SRAM  rC rC  Mt Mt Pass Transistor  Tri-state Buffers  Figure 2: Programmable switches in an SRAM FPGA  Tri-state buffers are an alternative to pass transistors. Their disadvantage is they require at least 11 transistors (depending on the drive strength required), 6 for the S R A M cell and 5 for the buffer (two inverters and a pass transistor). Their advantage is that they terminate capacitive loads and provide active drive strength.  Line selection allows a wire to be connected to more than one wire depending on how it is programmed. An example of this is the input pin of a logic block which can be configured to connect to one of many tracks in the adjacent routing channel. This can be 12  accomplished using either multiplexers or multiple tri-state buffers. An example for each of these is shown in Figure 3. The solution on the left side of Figure 3 requires 34 transistors: 6 for the multiplexer, 12 for the S R A M cells and 16 for the buffers. The solution on the right side of Figure 3 requires 44 transistors: 24 for the S R A M cells and 20 for the tri-state buffers.  Each type of programmable switch has advantages and disadvantages depending on the application. Discovering which switch is optimal for area and delay savings is one aspect of architecture research.  S R A M  S R A M  S R A M S R A M  S R A M  S R A M  Multiplexer  Tri-state  Buffers  Figure 3: The multiplexer or tri-state buffer solution for line selection.  2.2.2  "Island-Style" F P G A architecture The general architecture used throughout this thesis is commonly referred to as an  "island-style" architecture. This architecture is used for research because it has a simple, 13  homogeneous structure that incorporates most of the features common to FPGAs. Figure 4 is an illustration of the basic island-style architecture.  The island-style architecture includes a two dimensional matrix of logic blocks. Each logic block is used to implement a small portion of a circuit. By connecting all logic blocks via a collection of horizontal and vertical programmable routing channels, a large circuit can be implemented.  Logic Cluster  Routing Channels  Figure 4: Island-style FPGA architecture [48].  There are several ways logic blocks can be designed to implement logic. Logic blocks are typically comprised of P A L / P L A structures and/or LUTs, flip-flops and local interconnect. The principal contents of logic blocks differs from vendor to vendor and Section 2.2.3 describes what is assumed for this thesis.  14  A row of I/O blocks runs along the perimeter of the FPGA and is often referred to as the I/O ring. Each block in the ring provides the necessary buffers and capacitors to provide the required drive strength and noise protection necessary for off-chip VO. Each I/O block is connected to another channel of routing which runs just inside the I/O ring.  Fixed-function blocks are often embedded into the programmable fabric [8,9]. In this way, functionality common to most customers can be made available with the density and speed advantages of fixed logic. The most common block of this sort is R A M , which is versatile for all applications because it can be used as storage or it can implement logic as a look-up table [18,19]. Other common blocks include multipliers, processors, and DSPs [8,9].  2.2.3 L o g i c R e s o u r c e s For most FPGAs, look-up tables (LUTs) are the basic element used to implement logic. LUTs are usually referred to using the notation k-LUT where k is the number of inputs. A 4-LUT, for example, has 4 inputs and can therefore implement all logic functions with 4 or few inputs. A L U T , a register, and a multiplexer together form a basic logic element (BLE)  1  (see Figure 5). Combinational logic can be implemented by configuring the multiplexer so that it bypasses the register and connects the L U T directly to the B L E output pin. Conversely, sequential logic can be implemented by using the L U T and register together; this can be selected by configuring the multiplexer so that it connects the L U T ' s output to  1  B L E s are called logic elements (LE) by Altera and logic cells (LC) by Xilinx.  15  the register. Other circuitry is often added to BLEs such as support for the efficient implementation of carry chains.  SRAM  k Inputs  k-LUT  Output Register  CLK Figure 5: Basic Logic Element (BLE)  A logic block, or logic cluster, is comprised of one or more BLEs (see Figure 6). An interconnect matrix provides each B L E with both inputs to the logic block and feedback wires from the B L E outputs. There is one logic block output for each of its BLEs. The purpose of clustering BLEs to form logic blocks is to reduce the demand for routing resources. This is accomplished in two ways. First, by combining BLEs with common input nets into the same cluster, fewer input pins are required. Second, nets that are exclusively connected to BLEs of the same cluster can be collapsed into the cluster.  2.2.4 R o u t i n g R e s o u r c e s The regularity of an island-style architecture lends to a simplified layout. A single tile can be defined, laid out and optimized, and then duplicated many times to create a 2dimensional matrix of tiles. This property allows FPGA vendors to develop devices with some of the largest transistor counts in the semiconductor industry. Further, an entire  16  family of devices based on different numbers of tiles can be offered easily. Figure 7 is an illustration of a single F P G A tile.  The basic components of a tile are a logic block, a horizontal and vertical channel, horizontal and vertical connection blocks, and a switch block. The method and pattern by which wires of a tile are connected is called the detailed routing architecture. The horizontal channel is a collection of horizontal tracks that connect to the tiles immediately east and west. Each track is a segment of much longer tracks that run horizontally across the chip. The purpose of horizontal channels is to carry signals horizontally along the chip and to allow access to input/output pins on the north and south sides of the logic block. The analogous is true for vertical channels.  Interconnect Matrix k Inputs  BLE  k Inputs  BLE  k Inputs  BLE  I Inputs  N Outputs  Figure 6: A logic block consisting of multiple BLFs with an interconnect matrix.  The inputs and outputs of a logic block are called the logic block pins. As an example, Figure 7 has its inputs pins on the west side of the logic block and its output pins 17  on the east. This pin placement is not the most efficient pattern; a detailed discussion of pin placement is provided in Chapter 5.  m n n m r.n r.n r.n  Connection Block  BLE BLE  . Logic Block  -=o-i  BLE Switch Block. Routing Channel (W tracks)  Figure 7: An FPGA tile consists of a logic block, vertical & horizontal channels, a switch block and connection blocks.  Connection blocks facilitate connectivity between logic block pins and the routing channels. Each input pin can be programmed to connect to one or more of the tracks in a channel using either a multiplexer or multiple pass transistors (see Figure 3). Output pins, 18  on the other hand, are connected to tracks using tri-state buffers. The number of tracks that a pin can connect to is called its connection block flexibility, or Fc. The pattern of connections between all pins and all tracks is referred to as the connection block pattern. Rose and Brown [20] have speculated that certain connection patterns lead to improvements in density and routability.  Switch blocks reside at the intersections of horizontal and vertical routing channels. They provide programmable switches used to connect tracks from both the vertical and horizontal channels incident to the switch. The number of outgoing tracks that each ingoing track can connect to is called its switch block flexibility. The pattern that dictates which tracks are connected to each other is called the switch block pattern. Switch block patterns are an important part of the research in this thesis and are discussed in detail in Chapter 3.  The length of tracks in a routing channel refers to the number of tiles that a track spans. The distribution of track lengths in a channel is referred to as the segmentation distribution. Figure 8 shows a horizontal channel running through 4 tiles; the segmentation distribution includes tracks of length 1, 2 and 4. It is important to recognize that tracks not only terminate at switch blocks, but also pass through them. The use of short tracks is advantageous in that it results in greater routing flexibility. On the other hand, longer tracks are advantages in that they can carry signals a long distance without having to continuously pass through switches. Several studies have been made on segmentation distribution and can be found in [14, 21, 22, 23, 24, 25].  19  To be fully switch-block populated and connection-block populated means that all tracks of length greater than 1 are connected to all switch blocks and connection blocks that they pass through [14]. If tracks are not connected to all switch blocks they pass through then the architecture is switch-block depopulated; if tracks are not connected to all connection blocks they pass through then the architecture is called connection-block depopulated.  2.3 CAD for FPGAs Computer-aided design (CAD) is a very important aspect of F P G A technology. It allows users to convert a circuit description represented in a hardware description language (HDL) or as a schematic to a bit stream that can be uploaded to an F P G A for programming. Examples of C A D software for FPGAs are Xilinx Alliance and Foundation, Altera Quartus II and Max+Plus II, and Actel Libero. Due to the narrow logic resources provided in FPGAs, the C A D is much more complex than the architecture and has a very heavy influence on the density and performance of the programmed FPGA.  Switch Block  Length 2 Tracks  Length 4 Tracks Figure 8: A track segmentation including tracks of length 1, 2, & 4.  20  The FPGA C A D flow consists of several steps as illustrated in Figure 9. First, an FTDL description of a circuit is passed through a synthesis tool to convert the circuit into basic logic gates. The netlist is then optimized using techniques such as factoring and substitution. Technology mapping is then performed to map the circuit of basic logic gates to the target technology. In the case of a LUT-based FPGA, this results in a netlist of LUTs and flip-flops. After technology dependant optimization, LUTs and flip-flops are paired together to form BLEs. Using a packing/clustering tool, the BLEs are then packed together to form logic clusters or logic blocks. The netlist of logic blocks is then fed into a placement tool that assigns each logic block in the netlist to a physical location in the FPGA. This is performed using a heuristic that tries either to minimize the distance between all connected logic blocks, to minimize the estimated delay of the placed circuit, to reduce routing congestion, or some combination thereof. Routing is the final stage of the C A D flow and involves making all the required connections between logic cells. Routing algorithms are quite complex and are discussed in detail in the next section.  21  signal a; signal b;  ( H D L Description^)  (  Netlist of basic gates  )—  )  D  SET  O  >  Technology-independent logic optimization  D  a  P  LUT  Technology map to look-up tables(LUTS)  LUT^  >  Pack LUTs into logic blocks  i-  >  (  Netlist of logic blocks  ) Ji  >  • • • •• • •• •  r  Place & Route  -  111111 11111 Figure 9: Standard FPGA Flow  2.3.1 R o u t i n g A l g o r i t h m s Routing is a very important part of the C A D flow for several reasons [26]. First, most of an FPGAs tile area is devoted to routing fabric. Any improvement in the architecture or routing algorithms that leads to a reduction in the required routing resources has a huge effect on the die size of the FPGA. Second, the critical path delay of a circuit is directly affected by the length of wire and number of switches that the nets of a circuit use. It is the routing algorithm that dictates which tracks and switches are used for each net. 22  Thus, routing algorithms are critical in improving the density and critical path delay of the programmed FPGA.  Not unlike other stages in the C A D flow, routing algorithms are based on heuristics. Due the complexity of routing problems, an optimal solution is not possible within a reasonable amount of time. As a result, the problem is solved using a set of "educated guesses" which, with hope, will lead to a "near optimal" solution. The advantage of a heuristic algorithm is that they make solving the problem feasible. The disadvantage of a heuristic algorithm is that the result is generally not optimal. Most routing algorithms start by creating a routing-resource graph representation of the FPGAs routing resources. This is a directed graph in which each node represents a wire or logic block pin and each edge represents a potential connection. Once the routingresource graph has been built, the routing problem is reduced to finding all vertex-disjoint paths necessary to complete the circuit such that all delay constraints are met. What makes the problem difficult is that there are many detailed choices that together form a large solution space.  Most routing algorithms are based on a maze router [27] algorithm that essentially runs Dijkstra's shortest path algorithm [28] on each net in the circuit. Because the quality of the routing solution is so heavily dependent on the order by which they are routed, an iterative solution is generally used. At the end of each iteration, some or all of the nets are ripped up and rerouted [29]. Ideally, each iteration is better than the last by using insight gained from previous iterations.  23  Many of today's routers are based on a sophisticated version of the maze router called the Pathfinder congestion-delay algorithm [30]. Pathfinder is a congestion and delay base router that routes and reroutes using a timing analysis [31], but also reroutes with a gradually increasing emphasis on congestion avoidance. For each iteration, a timing analysis determines the slack for each net. Slack times are used to assign criticalities to 2  nets; more critical nets are routed first and with a greater emphasis on delay reduction. Congestion is dealt with in a very different manner. On the first iteration, Pathfinder allows multiple nets to be routed using the same routing resources; this is called congestion. A gradual increase in the penalty associated with congestion is added to the router's cost function. The router is finished when it is able to find a valid route for which no two nets share routing resources, and when it has reached either a predetermined number of iterations or a predetermined timing constraint.  Routers can be categorized into two major types, global routers [32,33] and detailed routers [21,34,35,36]. Global routers fully route each net from a high level of abstraction such that each net includes an assignment to all channel segments and pins necessary to connect the source block to the destination blocks. Global routers are used either to reduce the solutions space before using the detailed router or to act as a congestion estimator. In both cases, global routers should be as fast as possible. Given a set of channel assignments and pin assignments determined by the Global Router, the Detailed Router determines precisely which tracks and switches are used to route each net.  2  Slack is the amount of delay that can be added to a connection before it becomes the  critical path.  24  Instead of performing global routing and detailed routing one after another, it is also possible to perform both at the same time [4,30,37]. This approach improves the quality of result because the detailed router is not constrained by decisions made by the global router. Decisions made by the global router can often turn out to be counter-productive because they are made with little foresight. While assigning nets to channels, the global router typically tries to prevent congestion by limiting the number of nets per channel. In many cases, poor decisions are made because the global router is either too optimistic or not optimistic enough with its measure of congestion. This can lead to channels that are too congested for the detailed router too resolve. The disadvantage to a combined solution, on the other hand, is that memory and run-time requirements can be large because of the enormous solution space.  2.4 FPGAs in SoC Traditionally, board development involves first choosing off-the-shelf components from third-party vendors. Next, these would be laid out on the PCB, wiring would be laid out, and very often an FPGA would be used as a controller. This approach is very common for several reasons. It allows small to mid-sized companies to develop products using components that would they would normally not be able to develop on their own due to limited resources. Further, it subdivides complex systems and greatly reduces time-tomarket.  System-on-a-chip development is a scaled version of board design. It allows system development companies to combine third-party intellectual property onto a single chip. It is similar to board development in that it allows relatively quick and cheap system  25  development. It differs from board design in that chips are faster and denser yet more costly to develop and more difficult to test.  Combining fixed and programmable logic as a single chip solution is an attractive alternative to ASICs and stand-alone FPGAs. By combining both approaches, a system can have the speed and area advantages of an ASIC with the reprogrammability of an FPGA. With System-on-a-Chip (SoC) methodologies emerging, including an FPGA core as part of a SoC system could have many advantages.  2.4.1  Applications The following is a list of scenarios in which incorporating an embedded  programmable logic core (EPLC) within a fixed ASIC would be advantageous [5, 38]: 1. Some design details can be left until late in the design cycle. In a communications application, for example, the development of the chip can proceed while standards are being finalized. Once the standards are set, they can be incorporated in the programmable portion of the chip. This is important, since time-to-market is so critical in industry today. 2. As products are upgraded, or as standards change, it may be possible to incorporate these changes using the programmable part of the chip, without fabricating an entirely new device. 3. In many cases, it may be possible to fabricate a single chip for an entire family of devices; the characteristics that differentiate each member of the family can be implemented using the programmable logic. This would, in effect, amortize the cost of developing the ASIC over several products. In a similar way, standard products can be customized for different customers by implementing the parts of the circuit specific to that customer in the programmable logic portion of the chip. 26  4. Testing structures can be implemented using the programmable logic. Consider a chip with a faulty functional unit. To debug the functional unit (so changes can be made in later spins of the chip), it may be very valuable to construct a small circuit that can stimulate or analyze the faulty functional unit. Figure 10 shows a hypothetical chip with programmable test structures in place [38]. Two of the cores within this chip use scan chains, and a third has external buses which could be used for observability and controllability. Also included is an on-chip jittermeasurement circuit and circuitry used for setting up delay calibration. The programmable component of the chip is used to tie together all of the observability and controllability features together into a test circuit. There are several advantages to implementing test circuitry using a programmable core. First, valuable area is saved because a test circuit can be broken into parts and can be used to test the chip one after the other. Second, the nature of errors is difficult to predict when the chip is initially designed; as errors found through testing, new test circuitry can be implemented to better test for this error or to find other similar errors. 2.4.2 Implications o n the F P G A One of the most significant differences between EPLCs and stand-alone FPGAs is that EPLCs can take on a variety of shapes and sizes to better mesh with the fixed ASIC circuitry. One SoC design may need a relatively square programmable logic block, while another may need a long narrow programmable logic block (perhaps placed along an entire edge of the chip to provide a programmable interface to the I/O ports). Yet another design may need several small cores. The example in Figure 1 uses an L-shaped core. Each of these sizes and shapes makes unique demands on the detailed routing architecture.  Previous work by Wong [39] studied non-rectangular EPLCs of shape " L " , " U " , and "O". Wong evaluated how existing place and route algorithms could be modified to 27  work better with non-rectangular EPLCs. Further, Wong quantified how the density and timing of an EPLC is affected if it is non-rectangular; however, [39] did not address how the detailed routing architecture of an EPLC should be optimized to accommodate for its shape. In the following chapters, three important aspects of routing architecture are studied in the context of rectangular EPLCs. The motivation behind this work is to determine the topology of these routing structures as a function of the aspect ratio of the EPLC. The results of this work can serve as a set of guidelines for EPLC architectures. With a simple set of guidelines defined, developing an EPLC specific to a design's size and performance needs could be relatively quick and seamless. An SoC designer would not need to send design specifics to an IP vendor and wait for the results but instead could use a core-generating tool and would quickly generate a "near" optimal architecture. This would reduce development costs and time-to-market.  28  Figure 10: Hypothetical SoC chip with Test Controller implemented in an E P L C [38].  2.5 Summary This chapter provided background information regarding FPGA architecture, C A D , and programming technology. System-on-Chip methodologies (SoC) were introduced and a description was provided on how programmable logic can be used as an embedded core (EPLC). Next, several potential applications for EPLCs in SoC were described. It was then suggested that there could be benefits to optimizing the architecture of FPGAs implemented as EPLCs.  29  CHAPTER 3.1  3: R e c t a n g u l a r S w i t c h  Block  Motivation Routing architecture has a significant impact on the density and critical path delay  of the resulting, programmed, FPGA. Most of an FPGA's tile area is made of routing resources. For every additional transistor added to a tile's-routing resources, the effect on the FPGA's area is multiplied by the number of tiles on the chip. Conversely, the more routing resources available, the easier is to route designs that highly utilize logic resources. This leads to an increase in implementation density. The critical path delay of a circuit is directly affected by the length of wires and number of switches used to route each net. Delay is affected by wirelength quadratically; however, the effect of switches depends on whether they are pass transistors or buffers. A pass transistor has a smaller intrinsic delay than a buffer, but it does not terminate capacitance. As a result, a chain of pass transistors has a delay that grows with the number of pass transistors. Buffers have a larger intrinsic delay, yet they terminate capacitance and provide active drive strength. Consequently, pass transistors are more suitable for short nets whereas buffers are more suitable for long, high fanout, nets.  A core component of an FPGA's routing architecture is the switch block. It resides at the intersection of each horizontal channel and each vertical channel. Each switch block provides programmable connections between the tracks of the incident channels. The switch block is a significant portion of routing; therefore, optimizing the architecture of the switch block structure is very important.  30  Until now, most academic research has assumed that the horizontal and vertical channels of an FPGA contain an equal number of tracks; an equal number of tracks enter all sides of the switch block. As a result, a great deal of research has focused on square switch blocks [19,40,41,42,43,44,45,46,47]. As will be described in Chapter 4, however, rectangular EPLCs will benefit from having a different number of tracks in the horizontal and vertical channels. Such an architecture is said to have a directionally biased routing architecture. A switch block exists wherever horizontal and vertical channels cross; because each channel has a different capacity, the number of tracks incident to the top and bottom sides of the switch block differ from the right and left sides. Thus, there exists a need for rectangular switch blocks.  To investigate rectangular switch blocks, Section 3.2 will provide background information on previous switch blocks and how they differ. Section 3.3 will introduce a new switch block architecture. Section 3.4 will provide an experimental framework. Finally, experimental results will be presented and interpreted in Chapters 3.5.  3.2 Existing Switch Blocks Previously proposed switch block patterns include the Disjoint [9], Universal [40,42,43,44,45,46,47], Wilton [19], and the Imran [41]. A l l of these switches have two things in common. First, they are all square; an equal number of tracks enter each of the four sides of the block. Second, all maintain a switch block flexibility of F =3. Switch s  block flexibility is the number of tracks to which an incoming track can be connected to via a programmable switch. Rose and Brown [20] found that a switch block flexibility of Fs = 3 or 4 provides the most area-efficient architectures.  31  What characterizes a switch block is its switch-block pattern. This is the pattern of programmable connections that connects incoming tracks to outgoing tracks. Past switchblock patterns have been proposed on the basis that they are area efficient or that they provide better routing flexibility. To illustrate the differences between switch blocks, switch block models such as those in Figure 11 are used. In these diagrams, each dotted line represents a programmable switch and can be implemented using either pass transistors or tri-state buffers. The numbering system shown is a common way of identifying the sides and tracks of a switchblock. The Disjoint pattern [9] is unique in that it is "symmetric". For each incoming track 'x' incident to a side of the switch-block, programmable connections will exist to track 'x' of each of the three remaining sides. A consequence of using this switch block topology is that all routing in the FPGA will be divided into routing "domains". Any net that is routed using track 'y' in its starting channel can only be routed, via switch-blocks, to track V' of all other channels. I i W is the number of tracks per channel, the result of using a Disjoint pattern is that the routing fabric of the chip is divided into W routing domains all of which are disjoint [42]. Each net within the circuit is assigned to a routing domain and must remain within this routing domain for its entire path. As a result, there is a reduction in the number of routing choices available to each net thus making the routing solution space smaller. As a consequence, finding solutions becomes more difficult and it is less likely that there exists a solution that meets all timing requirements.  32  Universal switch blocks as proposed by Chang, Wong, and Wong [40] are an attempt to maximize the routing flexibility of a switch block using combinatorial counting techniques. They present a set of switch blocks which are guaranteed routable given that there is no more than 6W switches when F =3 and that all incoming nets can be arbitrarily s  ordered on each side. They claim that Universal switch blocks can accommodate the maximum number of simultaneous routing instances which is 25% higher than the Disjoint switch block. They show through experimentation that the total number of tracks required to route a design is smaller when using Universal switch blocks than the Disjoint switch block or with the other modules presented in [20].  Experimentation conducted by Fan, Liu, Wu, and Cheung [46] provided results that showed how well the Universal switch compares to the Disjoint switch block at the chip level. There results show that, on average, the Universal switch block improves routability by 6% over the Disjoint switch block.  One problem with the Universal switch block occurs when a net is routed through more than one switch block. According to [41], the switch block guarantees routability given that there are no more than 6W switches when F =3 and that all incoming nets can be s  arbitrarily ordered on each side. It is trivial to adhere to these conditions for the first switch block in a path. For all other switch blocks within this path, however, it is no longer possible to adhere to the arbitrary ordering condition because the first switch block will have already assigned outgoing nets to tracks.  33  Q  1  ?  3  4  <?  1  ?  ?  f  6  i  2  i  4  6  i  2  3  4  a) Disjoint  b) Universal  9  6  1  i  2  ^  3  4  3  4  c) Wilton  Figure 11: Switch block models for the Disjoint, Universal, and Wilton Switch Block Topology [48].  Another switch block solution is the Wilton switch-block [19], which is illustrated in Figure 11. Results show that by using the Wilton switch block, speed performance is comparable to that of the Disjoint and Universal switch-blocks [48]. On the other hand, the area efficiency results of the Wilton are better than that of Disjoint and Universal.  Although the Wilton switch block is more area efficient than the Disjoint switch block, this is only true for segmentation distributions with all track lengths equal to 1 [48]. For architectures with all tracks >= 4, the Disjoint switch block is more area efficient. When track lengths are greater than 1 then it is necessary for them to pass through a switch block instead of terminating. Under these circumstances, the Wilton switch block requires more programmable switches, and hence area, than the Disjoint. Figure 12 provides examples that illustrate the situation when wires terminate at the switch block and when they pass through the switch block. Both switch blocks require 6 switches for wires that terminate at the switch block. For wires that pass through the switch block, 1 switch is required for the Disjoint switch block and 4 for the Wilton switch block.  34  The need for a switch block the works well for segmented routing architectures led to the proposal of the Imran switch block [41]. This switch block, in effect, combines the best features of the Wilton and the Disjoint switch blocks. This is achieved by making a distinction between two categories of tracks in a segmented architecture; tracks that end at the switch block are connected using a Wilton pattern, while tracks that pass through are connected using a Disjoint pattern (see Figure 13). Imran [41] provides experimental results that compare the critical path delay and tile area of the Universal, Disjoint, Wilton, and Imran switch block patterns. These results show that the Imran switch block has the best area and delay results for all segmentation lengths.  35  a) Disjoint  b) Wilton  Wires terminate at switch block  k  4  A*,  a) Disjoint  b) Wilton  Wires pass through switch block Figure 12: Advantages/Disadvantages to Disjoint and Wilton Switch Block topologies for those wire that pass through the switch block versus those that terminate at the switch block [48].  3.3 New Switch Block 3.3.1  Baseline Rectangular Switch Block As discussed in Section 3.1, a directionally biased routing architecture necessarily  has rectangular switch blocks. A rectangular switch block has a more tracks incident to the switch blocks in one dimension than the other.  36  In the previous section, it was shown that the Imran switch block patterns results in improved area efficiency and reduced critical path delay. As a result, we use this switch block as a building block for our family of rectangular switch blocks.  The baseline rectangular switch block is a natural extension of a square Imran switch block. The Imran pattern is replicated the appropriate number of times to form a rectangular pattern of the desired dimensions. Each iteration of the replicated pattern is referred to as a subblock. Figure 14 is an example of a rectangular switch block resulting from the composition of several subblocks. With t tracks incident to the switch block from x  the x-channel and t tracks incident from the y-channel, we assume that t >t . As a result, y  y  there are N = ceil(t/t ) subblocks (called SSBn where 0<h<N ). The (N -l)  x  th  s  x  s  s  subblock will  be a partial subblock if the number of vertical channels is not evenly divisible by the number of horizontal channels.  The switch block can be represented by a graph M(T,S) where each node in T represents a terminal (incident track) of the switch block and each edge in S represents a programmable switch that connects two terminals. Tis partitioned into 2+2N subsets, s  each with W=t terminals; two subsets represent the tracks incident to the short sides of the x  switch block and 2N subsets represent the tracks incident to the long sides. Each terminal S  in Tis labeled t  m>n  where m is the subset number (0<m<2N+l) and n is the terminal number  within the subset (0<h<W-l).  37  Imran Figure 13: An Imran switch block is a combination of a Disjoint and a Wilton switch block.  *2,0  *2,1  *2,2  *2,W-1  *4,0  *4,1  *2Ns,0 *2Ns,X  ^4,2 *4,W-1  t  • *1,W-1  O.W-1 •  •  SSB1  SSBO  SSB(N-1)|-^  t« 1,2 t  t ,1 — I 0  r— -J—y  ^3,0  *3,1  *3,2  *3,W-1  *5,0  *5,1  *5,2 *5,W-1  t  1,1 1,0  a—L_  *2Ns+1,0 *2Ns+1,X  Figure 14: A baseline rectangular switch block is composed of an integer number of subblocks and a possibly a partial subblock.  38  Figure 15 shows an algorithmic description of the baseline switch block assuming that t > y  t . The algorithm starts with an empty set of switch connections. The remainder of the x  algorithm iteratively adds connections to the set until the entire switch block pattern has been defined. To do this, there are two levels of a "for" loop. The outer level is used to iterate through a subblocks while the inner loop iterates through all possible track indices. The " i f statement on line 5 determines whether or not the tracks end at the switch block. If they do, then the Wilton portion of the algorithm is used. Otherwise, the algorithm skips to line 27 for tracks that pass through the switch block and the Disjoint part of the algorithm is used. As expected, the Disjoint portion of the algorithm is only be used for segmented architectures. For a given subblock and a given track number the following occurs in the algorithm: lines 9-10 adds horizontal connections; lines 12-13 adds connections from the top of the subblock to the bottom of the subblock; lines 15-16 adds connections from the left side of the switch to the top of the subblock; lines 18-19 adds connections from top of the subblock to the right of the switch; lines 21-22 adds connections from the right side of the switch to the bottom of the subblock; and lies 24-25 adds connections from the bottom of the subblock to the left side of the switch. For the Disjoint pattern, line 31 is used to connect horizontal and vertical lines that pass through the switch-block and have the same track index.  The connection pattern for a partial subblock is the same as a full subblocks with the exception that a connection is not made if, due to it being a partial sub-block, one or more of its incident tracks does not exist. This exception is incorporated into the algorithm using the function, Exists(/ / y, t 2 2), where t i i and t 2, 2 represent incident tracks. m  jn  m  >n  m  39  A  m  n  1 S=0 2 forj = 0 t o N - l fork = 0 t o W - l 3 s  A  5 f.  if incident wires to, k, ti, k,  hj+2, k, hj+i, k  end at switch block  \j  // Wilton Switch Block Pattern 7 Q 0 // horizontal if (j ==0) 9 10 S = S v { (to, , tj.k) } 11 12 // vertical if (Exists(t 2,k, t j+3,k)) 13 S = S V { (t 2, ,t 3,k) } 14 15 // left to top if ( Exists(to, , t2j+2, (W-k) mod w) ) 16 S = S V { (to, , t j , (W-k) mod w) } 17 18 // top to right if ( Exists(t j k, ti, ( i) mod w)) 19 S = S V { (t j k, ti,(k l)modw) } 20 21 // right to bottom if ( ExiSts(ti k, t2j+3, (2W-2-k) mod w) ) 22 S = S V { (ti,k, t j+3, (2W-2-k) mod w) } 23 24 // left to bottom if ( ExistsO^j.^ k, to, (k+1) mod w)) 25 S = S V { (t j 3, k, to, (k+1) mod w) } 26 27 else if incident wires t , , ti, , t , , t , k pass through switch block 28 29 // Disjoint Switch Block Pattern 30 31 S = S V { (to,k,t ,k) } 32 33 end 34 end 35 end k  2j+  2  2j+  k  2j+  k  k  2  2  +2j  +2  k+  2  +2-  +  >  2  2  +  0  k  k  2j+2  k  2 j + 3  2j+3  Figure 15: Algorithm Description of the Baseline Switch Block pattern.  As an example of a baseline rectangular switch block, Figure 16 shows a switch block pattern assuming t =12 and t =4 and assuming all tracks end at this switch block. Note that y  x  the last subblock has fewer than W vertical inputs and is therefore a partial subblock.  40  3.3.2 Family of R e c t a n g u l a r S w i t c h B l o c k s As the aspect ratio of the baseline rectangular switch block increases, the density degrades. The higher the aspect ratios of switch block, the more subblocks it contains (i.e. N gets larger). As N gets larger, the number of switches connected to tracks incident to S  S  the long sides of the switch block grows. More explicitly, the switch block flexibility for tracks incident to the long sides of the switch block are fixed at F =3 while it is F =1+2N S  S  S  for tracks incident to the short side. Such a growth rate in F is harmful for both area and S  delay.  S S B O  S S B 1  S S B 2  S S B O  S S B 1  S S B 2  Figure 16: Example Baseline Switch with t =12 and t =4. y  x  The larger F , the larger the capacitive load. The amount of excessive load depends S  on whether the switch is a pass-transistor or a tri-state buffer. With pass-transistors, capacitive load will be added due to the drain capacitance of all unused switch transistors  41  plus the wire's "downstream" capacitance. With tri-state buffers, capacitive load will be added. Any increase in capacitance will affect delay linearly . 3  Larger values of F also affect tile density significantly because they require more s  layout area to implement the added switches. With pass transistors, every added switch requires 7 transistors; 1 is needed for the switch and 6 for the S R A M bit. A tri-state buffer implementation is even worse requiring 22 transistors; 10 for two opposing tri-state buffers and 12 for their S R A M bits. In both cases, the area penalty for each connection added to the switch block is large when multiplied by the number of switch blocks on the chip. This motivates us to develop a new switch pattern with fewer programmable connections for each incoming track. Rather than defining a single switch block to solve this problem, we propose a family of switch blocks, where each member of the family is obtained by depopulating the baseline switch block by a different amount. As the aspect ratio of the switch-block increases and F grows, more programmable connections will be s  removed.  Members of the family of switch blocks will be referred to as MODn where n is a measure of the level of depopulation. Explicitly, a MODn has 1/n of the connections remaining after depopulation. An example of this is provided in Figure 17. In this example, a MOD2 and MOD4 switch block are shown under the assumption that ty/tx=4. For this example, the tx connections used to connect one the side of a subblock to a side of  3  Delay is modeled as an exponential decay function, t = R C In (Vdd/Vout), such that R is  the discharge resistance, C is the capacitive load, Vout is the switching voltage, Vdd is the supply voltage.  42  the switch block have been combined and are shown as a solid line for clarity. It is intended that the true pattern of connections represented by the solid line would follow an Imran pattern. The switch block flexibility for the sides and 2 for the long sides, and for the  MOD4  MOD2  switch-block is 5 for the short  switch-block it is 3 for the short sides and  2 for the long sides.  MOD2  MOD4  Figure 17: MOD2 and MOD4 switch block patterns. Each line represents line represents t combined for clarity. x  An algorithmic description of the family of depopulated switch blocks is given in Figure 18. The algorithm is the same as that of the baseline switch block with the exception that the depopulation pattern has been incorporated so that connections can be  43  removed. The depopulation pattern is dictated by what we call sub-block coefficients which are identified by a 4-tuple, AMODII=( CG, CCI, (Xi, 0C3). The values of each tuple depends on n; for a MOD2 switch block, we have chosen A OD2=(1, M  block, the pattern is  AMOD4=(2, 1, 3, 0).  0, 1, 0),  while for a  MOD4  These patterns correspond to the examples  presented in Figure 19.  Intuitively, our choices for sub-block coefficients should maximize routing flexibility. Routing flexibility is maximized if there is no directional bias. For the average track incident to the switch block (track ends at switch block), there should be an equal opportunity to turn left versus right. The MOD2 and MOD4 switch block patterns adhere to this for both vertical and horizontal tracks. Other than the choice of continuing straight, all tracks incident to the left an right sides of the switch block have a choice of either turning left or right. For tracks incident to the top and bottom of the switch block, half are able to turn right and the other half are able to turn left. The sub-block coefficients were chosen in a way that allows nets to be routed using the shortest path possible when routing in diagonal directions. As an example, consider a net that must be routed from the left-bottom side of the EPLC to the top-right side of the EPLC. The shortest possible path will always consists of a set of zero or more alternating left and right turns. The left turns are necessary to go from horizontal tracks to vertical tracks and the right turns are necessary to go from vertical tracks back onto horizontal tracks. To allow for this type of a route, our switch-block patterns allows for left-to-top turns to coincide with bottom-to-right turns. In other words, the set of vertical tracks used for left-to-top turns is same as the set of tracks used to make bottom-to-right turns. For this example, this set of vertical tracks is defined by the set of all odd numbered sub-blocks. 44  1 S=0 2 forj = 0to N -1 fork = 0to W - l 3 4 5 if incident wires to, k, ti, k, s  t2j+2, k. t2j+3, k  f. O  7  end at switch block  // Wilton Switch Block Pattern  o o  9 i f ( j = 0) 10 S = S v { (to, , t ) } 11 12 if (Exists(t j , k, t j+3, k ) ) 13 S = S V { (t ,k, t ,k) } 14 15 if ( ( j mod n = a ) && Exists(tn, , t , (w-k) mod w)) 16 S = S V { (to, k, t2j+2, (W-k) mod w) } 17 18 if ( ( j mod n = a , ) && Exists(t , , ti, i) mod w)) 19 S = S V { (t2j+2, k, ti,(k+l)modw) } 20 21 if ( ( j mod n = a ) && Exists^, , t , -2-k) mod w)) 22 S = S V { (ti,k, t2j+3, (2W-2-k) mod w) } 23 24 if ( ( j mod n = a ) && Exists(t , k, to, (k+i) mod w)) 25 S = S V { (t2j+3,k, to,(k+l)modw) } 26 27 else if incident wires to, k, ti, k, t j , k, t2j+3, k pass through switch block 28 // Disjoint Switch Block Pattern 29 30 31 S = S V { (t ,k,t 3,k) } 32 33 end 34 end 35 end k  2  +2  lik  2  2j+2  2 j + 3  0  k  2j+2  2j+2 k  2  k  3  2j+3  2  0  (k+  2j+3  (2W  +2  2j+  Figure 18: Algorithmic Description of the MODn switch block patterns.  45  3.4 Experimental Framework 3.4.1  Experimental Methodology To investigate the suitability of our new switch block, an experimental approach is  used. As part of this approach, benchmark circuits must be mapped to the FPGA to help analyze the effectiveness of the proposed architecture. For each circuit that is mapped, an area and timing model are applied to provide both area and critical path delay estimates. Throughout the reminder of this thesis, these two figures of merit will be used to evaluate all proposed architectures. It is not good enough to simply use one benchmark circuit to conduct architectural studies. The goal of F P G A research is to find an architecture that maximizes circuit density and minimizes critical path delay for the "average" circuit used in real world designs. To achieve good results for the "average" real world circuit, we used 20 benchmark circuits from the Microelectronics Corporation of North Carolina (MCNC) [49]. Because we were primarily interested in circuits that would fit into an embedded programmable logic core, we chose circuits that were approximately 10% of the size of circuits that could be implemented in commercially available stand-alone FPGAs. Half of the circuits were sequential and half were combinational. The C A D flow used for this thesis was a variant of that discussed in Chapter 2 and is illustrated in Figure 19. Each benchmark circuit was first optimized using SIS [50]. Next, Flowmap [51] was used to technology map each circuit to 4-LUTs and registers. Technology dependent optimization was then performed using Flowpack[51]. T-Vpack [52] was then used to pack the desired number of BLEs (4-LUTs + register) into logic 46  blocks (logic clusters). Versatile Place and Route (VPR) was then used to perform timing driven placement and routing [4]. Placement in V P R is based on a heuristic algorithm called simulated annealing [53]. Routing is based on a modified version of Pathfinder [30].  To determine how well optimized a particular architectures is with respect to density, routing was performed repetitively with various channel capacities. That is, all aspects of the E P L C s detailed architecture were held fixed while the minimum channel capacity required for 100% routability of the current benchmark circuit was found using a binary search methodology. By doing so, the amount of logic per a tile was fixed while the area required to implement each tile varied. Because density is a measure of the amount of logic per unit area, finding the minimum channel capacity for 100% routability is equivalent to finding the highest EPLC density achievable for a given architecture. Higher densities are favorable because they offer higher logic capacities for a cheaper die cost.  Our density performance metric, which is reported as tile area, is then calculated based on the sum of the areas used by the logic block, the switch block, and a vertical and horizontal segment of the minimum width channel. Commercial FPGAs are normally designed with enough routing so that "average" circuits have some spare routing available [48]. To model this, we perform a "low-stress" route by rerouting all benchmark circuits after the minimum with channel capacity has been found, but with 20% larger channel capacities.  47  Circuit  Logic Optimization (SIS) Technology Map (Flowmap)  Pack FFs andLUTs into logic blocks (VPack)  Place & Route (VPR)  No  Adjust Channel Width  Record Tile Area/ Critical Delay Figure 19: Standard VPR flow [4].  The I/O placement of an EPLC would likely be strongly constrained in an SoC application. There is either a set of predetermined signals from other cores within the chip or the E P L C s IOs must be connected to fixed signal selection hardware. To simulate this, all experiments were carried out with a randomly selected fixed 10 placement which was not optimized during layout.  VPR was developed with a typical "island-style" FPGA architecture in mind. Much of the architectural research presented in this thesis required modifications to the  48  placement tool, routing tool, and graphics features of VPR to allow for new architectures. The details of this work have not been included in this thesis.  3.4.2 E x p e r i m e n t a l P h i l o s o p h y A single architectural experiment requires a great deal of computer processing time. After placement, a circuit must then be routed as many times as it takes to complete a binary search for the minimum number of channels required to successfully route. This procedure must be repeated for all twenty benchmark circuits. There are many aspects to the detailed architecture of an EPLC. Finding the best combination of these aspects would require a "brute-force" method of trying every possible combination of an N-dimensional problem space. Due to the time-consuming nature of performing place and route over twenty benchmark circuits, this approach is not feasible. Instead, we make "intelligent" guesses as to which aspects we can hold fixed while varying others. Once we have found a "good" value for one, we can hold it fixed and begin varying another. This most certainly will not give us the most optimal solution but should guide towards a reasonably good solution.  3.4.3 Architectural F r a m e w o r k We restrict our attention to a rectangular island-style EPLC. Each E P L C consists of a matrix of n *n logic blocks. Betz and Rose showed that logic blocks with N=4 BLEs x  y  result in FPGAs with better area-efficiency [54]. Further, they showed that logic blocks should then have I = 2N+2 = 10 distinct inputs. Consequently, N = 4 and I = 10 are held fixed for all experiments presented in this thesis. Each logic block contains one clock input and one output for each for each B L E . Each B L E consists of one 4-LUT and a register. 49  Logic Blocks are surrounded by a grid of routing channels; each vertical channel contains t parallel tracks, and each horizontal channel contains y  parallel tracks. Further  we assume that t <t . y  x  We assume a segmented architecture in which every track spans 4 logic clusters. A l l channels are 100% connection block and switch block populated. At the intersection of each horizontal and vertical channel is a switch block that comprises of 50% passtransistors and 50% tri-state buffers. This was found to provide good results in conjunction with a segmentation distribution of 100% length 4 tracks [55].  The "starting points" for tracks of the same channel are staggered relative to each other. This enhances routability, since each logic cluster in the F P G A can then reach a logic cluster 3 units away in either direction using only one wire [56]. Channels are also staggered relative to each other. By doing so, all vertical and horizontal channels line up so that the ends of horizontal and vertical segments can be connected as dictated by the switch block pattern.  Programmable connections are made from routing tracks to logic block input pins using multiplexers. Connections between logic block output pins and routing tracks are achieved using pass transistors.  3.4.4  A r e a & Delay M o d e l The area model used by VPR is a count of the total number of minimum width  transistors required to implement a tile. This provides an accurate, technology independent, estimate of area without having to layout the FPGA. Clearly, this method of  50  estimating area is more accurate than using the number of tracks per channel as used in previous research [40,57].  Fan-out buffer sharing is assumed for all fan-outs within switch blocks and connection blocks. In other words, a single buffer is used to drive an entire fan-out rather than having a single buffer for each fan-out load [see Figure 20]. This decision makes a significant impact on the area model results of VPR and is used in this thesis because it is more realistic [48]. The delay model was created by first building SPICE simulation models for all transistor level components including multiplexers, buffers, look-up tables, and latches [48]. Next, the connection delay between logic blocks was calculated using simplified models of pass transistors, buffers, and metal capacitances and resistances to build an R C tree [48]. All models were built using timing information gained from running SPICE on a 0.18 urn CMOS process from T S M C [58]. Finally, an Elmore delay [17] approximation was used to find source to sink delays and then a path-based timing analysis was performed on the RC-tree to determine the critical path delay of the programmed FPGA.  51  Logic Block Logic Block  Logic Block  H Routing Wire  t> Logic Block  V  SRAM  _r~L  Routing Wire  SRAM  fSRAMl  Logic Block  SRAM  Routing Wire  Routing Wire  Routing Wire  Figure 20: Fan-out implemented using buffer sharing for switch blocks and connection blocks.  3.5 Experimental Results VPR was modified to allow for rectangular switch blocks. This required modifications to the routing resource graph generation for switch blocks, connection blocks, and for 170 connections blocks (see Appendix A for V P R "screenshots"). Figure 21 shows the suitability of the MOD2 and MOD4 switch blocks as compared to the baseline. The vertical axis for this plot is the geometric average of tile area over 20 M C N C benchmark circuits. The horizontal axis represents the aspect ratio of the switch block in terms of the number of incident vertical tracks divided by the number of incident horizontal tracks. Consequently, the left side of the horizontal axis represents a square switch block. Moving right along the horizontal axis represents rectangular switch blocks with progressively higher aspect ratios. Three different curves are plotted in Figure 21 representing the MOD2, MOD4, and baseline switch blocks. A greater number of points have been generated for low switch block aspect ratios because trends that are more interesting exist in this region. This plot represents results for a core aspect ratio of 4:1.  52  Figure 22 is similar to the Figure 21 with exception of the vertical axis which represents the geometric average of the critical path delay. Figure 21 provides a very important result; different members of the rectangular switch block pattern are better suited for different switch block aspect ratios. For cores with a switch block aspect ratio ranging from a square shape to an aspect ratio of 1.5:1 results, the densest solution uses the MOD2 pattern. For all aspect ratios higher than 1.5:1, MOD4 is the best choice. The baseline switch block is never the best choice. Although not shown, similar results were found for all other core aspect ratios (2:1, 6:1, 8:1, and 10:1).  18000  6000 -! 1  1 1.5  1 2  1 2.5  1 3  1 3.5  1 4  Switch Block Aspect Ratio  Figure 21: Tile Area vs. Switch Block Aspect Ratio for Core Aspect Ratio = 4 for three different Switch Blocks.  53  1.70E-08  1.65E-08 H  1.45E-08 •  1.40E-08 -I 1  1 1.5  1 2  1 2.5  , 3  1 3.5  4  Switch Block Aspect Ratio  Figure 22: Critical Path Delay vs. Switch Block Aspect Ratio for Core Aspect Ratio = 4 for three different Switch Blocks.  A very interesting behavior exists for the MOD4 curve in that it "spikes" up near the lower switch block aspect ratios (i.e. less than 1.4:1). This result is likely due to the extreme inflexibility caused by the MOD4 having so few programmable switches at low aspect ratios. In fact, VPR is not able to route all nets when using a square MOD4 switch block. The only turns available to tracks incident to the switch block is from the left side to the bottom side and vice versa. Routing is made difficult because of this bias along with fact that 3/4 of the tracks (which follow the Wilton pattern) have been removed.  The critical path delay results of Figure 22 are inconclusive. There are no noticeable patterns in this plot and all fluctuations that do exist are small enough to be considered as "noise". Effectively, the critical path delay is not influenced by the switch block aspect ratio or the switch block type.  54  3.6 Summary In this chapter, we introduced a family of rectangular switch blocks. A l l members are derived from the baseline switch block which is an extension of the Imran square switch block. To save area and reduce capacitive load, each member of the family has some proportion of its switches removed. Each family member is named MODn where l/n is the proportion of its switches that remain after depopulation. To help describe rectangular switch blocks, the concept of a subblock and partial subblock was introduced.  An experimental framework was defined and experiments were conducted for the baseline, MOD2, and MOD4 switch blocks. For different ranges of aspect ratio, experiments showed which switch block resulted in the densest EPLC. For aspect ratios less than 1.5:1, the MOD2 was found the be the best choice whereas the MOD4 was found the be the best choice for aspect ratios higher than 1.5:1. These results were consistent for all EPLC aspect ratios tested. Further, results showed that the critical path delay was not significantly influenced by the switch block aspect ratio or the switch block topology.  55  C H A P T E R 4: D i r e c t i o n a l l y 4.1  Biased  Channels  Motivation Intuitively, the routing architecture of an EPLC will be heavily influenced by  changes in the EPLC shape. One might expect that a rectangular core would have different routing capacity requirements in the long dimension versus the short dimension. The name for such an arrangement of routing resources is called directionally biased channels. Most placement tools make use of a wire length cost function which tends to pull together all instances in the netlist such that the average distance between any two is minimized. For netlists that with low logic and IO utilization, the shape of the placement will be very compact as shown in the left diagram in Figure 23. For netlists with high logic and IO utilization, the placement tends to take on the shape of the core by which it is constrained. An example of this case is shown in the right diagram in Figure 23.  With placement elongated, the average distant between placed instances will be further in one direction than the other. As a result, one would expect it to be beneficial if more routing resources were made available in this direction. This is the focus of this Chapter.  The experimental framework and philosophy used in this chapter is the same as in Chapter 4. Also adopted from this chapter are the architectural framework and area and delay models.  56  Low Logic Utilization  High Logic Utilization  Figure 23: Two different designs implemented on a rectangular FPGA. One has high logic utilization and the other has low.  4.2 Previous Research A first study of directionally biased channels was made by Betz and Rose [57]. Betz and Rose approached the problem without EPLCs in mind but instead with the intent of improving stand-alone FPGAs. They stated that because some commercial parts have biased [8] routing channels and other have unbiased [9] routing channels, they felt that an area efficiency investigation was necessary. In other words, they wanted to determine whether or not there was "an intrinsic property of circuits that makes a directional bias more area efficient." If so, they were interested in determining what that bias was.  Betz and Rose experimented with directionally biased FPGAs by using V P R to perform placement and then global routing. With global routing, the router only determines the pins and channels to which each net is mapped. Details of the segmentation, switch block topology, and connection block topology were not considered.  The area-efficiency metric used by Betz and Rose is based on the average number of track segments per tile. For an N by N F P G A with W tracks per channel, the total number of track segments is 2 W N and the number of track segments per tile is 2W. 2  57  The most relevant conclusions from Betz and Rose are the following:  1. For an unbiased global routing architecture, square FPGAs are more area efficient than rectangular FPGAs. Explicitly, they found that an F P G A with an aspect ratio of 3:1 requires 18% more tracks per channel than a square FPGA. 2. When comparing an F P G A with an aspect ratio of 3:1 to a square FPGA, the difference is only 4% when the best horizontal/vertical routing capacity ratio is used in both cases. They concluded from this that rectangular FPGAs do not result in significant area inefficiencies when the horizontal/vertical routing capacity ratio is properly adjusted.  4.3 New Segmentation Architecture The segmentation architecture is composed of two aspects. First, it defines what proportion of track lengths are in each routing channel. For example, one segmentation architecture might consist of 100% length 4 tracks whereas another might consist of 50% length 4 tracks and 50% length 1 tracks. Second, the segmentation architecture defines the relative placement of tracks within a channel and the "starting-point" of channels relative to each other.  Betz and Rose [56] define a segmentation architecture suitable for the most common square switch block patterns. This segmentation architecture is suitable for the Universal, Disjoint, Wilton, and Imran switch blocks for the most common combinations of wire lengths and with any level of switch block population. With the segmentation architecture so heavily dependant on the switch block, there is no published definition for a  58  segmentation architecture in the context of a directionally-biased routing channels or in the context of a rectangular switch block.  Before discussing segmentation architectures any further it is best to first introduce the F P G A coordinate system used by Betz and Rose. In this coordinate system, the lower left corner of the FPGA array has coordinates (1,1). The logic block to immediately to its right is (2,1) and the block immediately above it would is (1,2). Horizontal channels have the same y-coordinate as the logic block below it and vertical channels have the same xcoordinate as the logic block to its left. Tracks are numbered from left to right for vertical channels and from bottom to top for horizontal channels. As an example, the F P G A illustrated in Figure 24 is labeled with this coordinate system. In this example, all wires are of equal length 3. The "starting point" of a wire represents the location of the leftmost coordinate in the horizontal channel and the bottommost coordinate in the vertical channel. The "starting points" are staggered relative to each other within a channel. By doing so, routability is improved because any logic block in the FPGA can access any other logic blocks up to 3 tiles away using only one wire segment. As described by Betz and Rose, the most straightforward way to layout routing resources is to create one horizontal channel and one vertical channel and to replicate them along the array. Switch block connections following some pattern are then used to connect horizontal and vertical wires. Betz and Rose describe how this pattern is not sufficient because it results in a conflict between horizontal constraints and vertical constraints. As  59  illustrated in Figure 24, the horizontal and vertical tracks of most channels do not start and "end" in a way that is compatible with the switch block they are using [56]. To rectify this problem, Betz and Rose proposed a solution where instead of replicating each channel as is, the "starting points" of the segments of each channel is "staggered". This solution not only solves the constraints problem but also amounts to a more "easily" laid-out FPGA. Figure 25 illustrates an example implementation using their solution. Vertical Channel 0  Vertical Channel 2  Vertical Channel 1  Vertical Channel 3  O I 3" o tu z! =!  =. N O  2.  3  O I IT CD 3 3  CD  O =!. N O  3. ED  O I IT o 03 D 3  =!. N O  O  »  0 12 3 4 5  Conflicts Figure 24: This diagram identifies two locations where a switch should exist in accordance to the Disjoint switch block pattern. In both cases, a conflict exists between horizontal and vertical constraints.  60  Although Betz and Roses' solution was intended for a segmentation architecture with wires of length 3, an internal switch block population of 50%, and a Disjoint switch block topology, their segmentation architecture is a good overall solution. In fact, it solves the horizontal and vertical constraints conflict for an F P G A with any segmentation distribution and switch block topology as long as the architecture is fully internal switch block populated. Vertical Channel 1  Vertical Channel 0  (0,3)  (1,3)  Vertical Channel 2  (2,3)  Vertical Channel 3  (3,3)  (4,3) O X =r o 13  N  zz o  =3 r o SL  CD  (0,2)  (1.2)  (2,2)  (3,2)  (4,2) O I zr o SO 3 3  =. N O  ro u 92.  (0,1)  (1.1)  (2,1)  (3,1)  (4,1) O  I  =T Q> 3 =J  O 3. N O  ro  o (0,0)  (1,0)  (2,0)  (3,0)  3  9L  (4,0)  0 12 3 4 5  Figure 25: "Staggered" segmentation architecture as defined by Betz and Rose [56] to solve horizontal and vertical constraints problem. Channel, track, and logic block labeling schemes have been included.  Betz and Rose provide a solution for dealing with directionally unbiased architectures with square switch block patterns. This solution works well although it does  61  not address directionally biased routing architectures. Further, there is no other published work that addresses such a case.  Figure 26 is an algorithm that describes how wires can be laid out across a rectangular EPLC. Nht represents the number of tiles in the horizontal direction, N  vt  represents the number of tiles in the vertical direction, W is the set of all wires, Wjj is the set of all wires that originate at tile (i,j), C represent the channel capacity, Lw is the wire length (assuming all wires are of equal length), and w^j^u) is a unique wire that originates in the k* track of the tile (i,j). a is a repetition constant that represents the number of times the segmentation pattern is replicated in a channel.  Although there is a limited description of how tracks are laid out in [56], a careful look at VPR [4] can be used to determine the policy used by Betz and Rose. A straightforward extension of Betz and Rose's solution would have a = 1. The problem with this solution is that the vertical and horizontal constraints will be violated for the family of rectangular switch blocks described in Chapter 3. To fix this, the "staggered" pattern should be replicated once for each subblock in the switch block. This way, each subblock (and partial subblock) will be connected using a "staggered" pattern and will therefore solve the conflict between vertical and horizontal constraints. Figure 27 is a simple example of a segmentation architecture making use of this pattern.  62  I W= 0 2 3 // iterate over all vertical & horizontal tiles 4 fori = 0 t o N - l . 5 forj = 0 t o N - 1 6 7 // get all tracks of length Lt which "start" in the 8 // horizontal channel of tile (i,j) 9 Wy = 0 10 for k = 0 to C - 1 II if ( (j + i + 2 ) mod Lw == floor ( k * U, * a / C ) ) h t  v t  12  Wjj  = Wy  V W ij (  j k  , Lw)  13 end 14 end 15 16 W = W v Wij 17 end 18 end Figure 26: Simple extension of Betz and Roses' segmentation architecture for a directionally-biased routing architecture. This algorithm describes horizontal tracks only.  4.4 New Experimental Approach The research of Betz and Rose is not adequate for studying directionally biased channel because the detailed routing architecture is ignored. Without timing information after layout, it is difficult to make conclusions about architecture without being certain that critical path delay is not being adversely affected.  Betz and Rose use the number of tracks per channel as an area metric. The problem with this is that it ignores the heavy influence of switch block and connection block topology on the resulting area efficiency. As an example, a Wilton switch block will use more area than a Disjoint switch block for segmented architectures with the same channel capacity. In a similar fashion, the area of the connection block is directly proportional to  63  the channel width. Using a detailed router and detailed architectural model are a necessary to experiment with confidence.  4.5 Experimental Results VPR was modified to incorporate directionally biased architectures at the detailed level. Modifications were required in routing resource graph generation for switch blocks, connection blocks, and for I/O connections blocks (see Appendix B & Appendix C for VPR "screenshots"). Directionally-biased routing experiments were run while adhering to the conclusions made in Chapter 4. That is, the most appropriate switch block topology (Baseline, MOD2, or MOD4) was used for every given track ratio (ty/t ). Results were x  generated for five different core aspect ratios, 2:1,4:1, 6:1, 8:1, and 10:1 and are shown in Figure 28. The horizontal axis represents tile area and the vertical axis represents the track ratio (ty/t ). Figure 29 is the same as Figure 28 with the exception that the vertical axis x  represents the critical path delay.  64  Vertical C h a n n e l 0  012345  Vertical C h a n n e l  1  Vertical C h a n n e l 2  Vertical C h a n n e l  3  "Staggered" pattern replicated for each of the two subblocks  Figure 27: New segmentation pattern allows for rectangular switch blocks that adhere to the horizontal and vertical constraints issue.  65  13000  6000 -I 1  1 1.5  1  ,  2  2.5  1 3  1 3.5  I 4  Track Ratio  Figure 28: TileArea vs. Track Ratio (ty/tx) for various core aspect ratios. For each point on the graph, the best switch block topology is used for that track ratio.  There are several important observations that can be made from Figure 28. First, density degrades as the aspect ratio of the EPLC increases. This is expected because a higher aspect ratio of the EPLC acts as an additional constraint on layout which results in a reduction in placement quality. Due to the shape of the E P L C , the average horizontal distance between any two neighboring logic blocks is further than the average vertical distance between them. This makes if more difficult for the layout tool to tightly cluster the loads of a net around their driver. As a result, it is less likely that tightly coupled portions of logic will be packed together within a small area.  66  1.80E-08 1.75E-08  Track Ratio  Figure 29: Critical Path Delay vs. Track Ratio (ty/tx) for various core aspect ratios. For each point on the graph, the best switch block topology is used for that track ratio  Another important observation can be made from Figure 28. For all core aspect ratios, there is a steep negative slope for small track ratios. As the track ratio increases, a minimum is reached before the curve continues with a slight positive slope. This behavior appears to be the result of two competing trends.  The first trend is that the number of transistors in each switch block increases as the track ratio increases. As the track ratio increases, the number of tracks incident to the long side of the switch block grows relative to the short side. As the aspect ratio of the switch block grows, the number of sub-blocks increases thus requiring more programmable switches. For each additional switch added to the switch block, the area of the EPLC can increase by up to 22 transistors.  On the other hand, the channel capacity increases in the long direction of the E P L C , thus increasing its overall routing flexibility. This increase in flexibility leads to a  67  reduction in the overall number of tracks required in the EPLC and consequently a reduction in its overall area (or equivalently, an increase in the achievable density). This trend is responsible for the strong negative slope at lower track ratios.  The two competing trends in Figure 28 cause a distinct minimum. This minimum for each curve represents the "best" track ratio for that EPLC aspect ratio. The "best" track ratio is 1.3:1, 1.4:1, 1.5:1, 1.5:1, and 1.5:1 for EPLC aspect ratios 2:1,4:1,6:1, 8:1, and 10:1, respectively. The higher the aspect ratio of the EPLC, the higher the "best" track ratio. This indicates that the need to have more routing resources running down the long direction of the core becomes more important as its aspect ratio increases. From the curve in Figure 29, it is evident that the average critical path delay is not significantly affected by the track ratio. This was true for all aspect ratios tested. By comparing curves, it is evident that EPLCs with higher aspect ratios result in higher critical path delays.  Figures 30 and 31 compare the best results of Figure 28 with those obtained assuming ty/t -l. x  In other words, the best track ratio and switch block are used for a given  core aspect ratio which is represented by the horizontal axes. The left side of the graph represents a "square" EPLC and as we move along the horizontal axis, the aspect ratio increases. The dashed line represents an architecture that maintains a track ratio of 1:1 and a baseline switch block for all aspect ratios. The solid line represents an architecture that uses the best switch block and track ratio for a given aspect ratio.  We can conclude from Figure 30 that the density of an E P L C that uses rectangular switch blocks with directionally biased channel capacities is better, in terms of area 68  efficiency, than an architecture that uses square switch blocks with a track ratio of 1:1. In fact, the density improvements for an EPLC with aspect ratios 2:1, 4:1,6:1, 8:1, and 10:1 are 4.8%, 6.4%, 10.1%, 11.0% and 11.2%, respectively.  1  12000 -i  Figure 30: Tile Area vs. core aspect ratio. The solid line uses the best switch block and track ratio for each point. The dashed line uses the Baseline switch block and track ratio=l.  The results from Figure 31 show that the critical path delay performance results are opposite to that of density. For all aspect ratios tested, with the exception of one, the use of rectangular switch blocks and biased routing slows down the resulting EPLC. Critical path delay degradations for EPLCs with aspect ratios 2:1,4:1, 6:1, 8:1, and 10:1 is 4.1%, -1.1%, 4.2%, 1.6%, and 2.4%, respectively.  69  5  6  Aspect Ratio  -b e s t s w i t c h b l o c k / b e s t t r a c kr a t i o -- b a s e l i n e s w i t c hb l o c k / t r a c kr a t i o =1  Figure 31: Critical Path Delay vs. core aspect ratio. The solid line uses the best switch block and track ratio for each point. The dashed line uses the Baseline switch block and track ratio=l.  Figures 30 and 31 show that a square EPLC is always preferable in terms of both area and critical path delay. In fact, the higher that aspect ratio, the worse things get. This does not mean that rectangular EPLCs are a bad idea; in many applications, the fixed shapes and sizes of the other cores will dictate that a rectangular EPLC is to be used. For small aspect ratios, the area density penalty for using a rectangular EPLC is small; for an aspect ratio of 2:1, there is a 1.6% penalty. As the aspect ratio increases, however, the penalty increases. For an aspect ratio of 4:1, 6:1, 8:1, and 10:1, the penalty is 15.6%, 27.1%, 37.9%, and 47.7%, respectively. In terms of critical path delay, the penalty of using an aspect ratio of 2:1,4:1, 6:1, 8:1, and 10:1 is 3.8%, 3.54%, 12.36%, 13.9%, and 15.8%, respectively.  70  4.6 Summary In this chapter, we provide experimental results on directionally biased routing architectures. For various EPLC aspect ratios, we determined the impact of increasing the ratio of tracks in the x-direction versus the y-direction. Because we have a family of rectangular switch blocks defined in Chapter 3, we were able to perform experiments at the detailed level. Our modified version of VPR was able to considered all aspects of routing including pin placement, switch block topology, connection block topology, and segmentation architecture. To prevent vertical and horizontal constraints conflicts, a segmentation architecture was defined that works well with rectangular switch blocks.  We quantified improvements in density and speed when using various channel widths in the x and y direction (track ratios) for a given EPLC aspect ratio. For a given EPLC aspect ratio, we found that there exists a "best" track ratio where density is maximized. We found that this "best" track ratio starts at 1.3:1 for an EPLC aspect ratio of 2:1 and increases with higher aspect ratios. For an E P L C aspect ratio of 10:1, the "best" track ratio is 1.5:1.  We compared the benefits of using an optimized architecture that uses the best track ratio and best switch block topology for a given aspect ratio. For various aspect ratios, the optimized architecture was compared with an architecture that uses the baseline switch block and a track ratio of 1:1. Results showed that for an aspect ratio of 2:1, the optimized architecture improved area by 4.8%. Further, area improvements grew with higher aspect ratios; for an aspect ratio of 10:1, the improvement was 11.2%. There was, however, a small degradation in critical path delay. We conclude from these results that the optimized  71  architecture is best suited for applications where density (cost) is more important than speed.  We showed that no matter what the switch block topology and track ratio used, area and speed efficiency decline with greater EPLC aspect ratios. In other words, the most area efficient architecture is always square and it progressively becomes worse with higher EPLC aspect ratios. Compared to a square EPLC, an EPLC with an aspect ratio of 2:1 has an area penalty of 1.6% and a critical path delay penalty of 3.8%. For an aspect ratio of 10:1, this grows to 46.5% and 17.2% for density and critical path delay, respectively. Although there is a penalty for using rectangular EPLCs, this does not mean that they are a bad idea; in many applications, the fixed shapes and sizes of the other cores will dictate that a rectangular EPLC is to be used.  72  CHAPTER 5: Pin Distribution 5.1 Motivation Chapter 4 was concerned with the influence of a directionally biased routing channels on the density of an EPLC. For rectangular EPLCs, experimentation showed that density could be improved by increasing the capacity of routing channels in the long direction of the core. Given this, the next architectural aspect of interest is pin placement. For a given rectangular EPLC the switch block and track ratio optimized, how should pins be distributed around the logic block? Given that the number of tracks on the horizontal side of the logic block differs from the vertical side, it is natural to expect that new demands will be placed on logic block pin placement.  5.2 Existing Pin Distributions Betz and Rose [54] studied two different pin distributions within the context of a directionally biased routing structure; they called one a "full-perimeter" positioning with pins distributed evenly on all four sides and the other a "top/bottom" pin distribution with pins evenly distributed on the two horizontal sides. Each of these corresponds to approaches used by Xilinx[9] and Actel[8], respectively.  Experimentation by Betz and Rose was similar to that explained in Section 4.2. V P R was used to perform placement and global routing for 26 M C N C benchmark circuits. The results were averaged and the quality of results were measured using a rough estimate of area based on the number of track segments per tile needed to successfully perform a global route.  73  Betz and Rose arrived at several important conclusions. First, a full-perimeter pin distribution leads to the most area-efficient FPGA. In fact, the best full-perimeter pin solution is 8% better in terms of area efficiency than the best "top/bottom" solution. The full-perimeter architecture is more area-efficient because there is a greater chance that the block input pins are closer to their desired connections than for the "top/bottom" architecture. This allows for the most compact arrangement of the loads of a net around their driver. Second, Betz and Rose found that the best full-perimeter architecture has no directional bias. This is because an architecture with no bias makes the difficulty of routing to each of a logic block's neighbors roughly equal.  Third, they found that the most area-efficient solution for a "top/bottom" architecture has a directional bias with a 2 : 1 track ratio. In other words, twice as many tracks should reside on the "top" and "bottom" sides of the logic than the "right" and "left" sides. Every connection to a logic block must be connected to a horizontal track so there is extra pressure for more horizontal routing resources.  5.3 Various Pin Distributions The experimental approach used by Betz and Rose has several problems. First, they only use a global router thus ignoring the important details of segmentation, switch block topology, and connection block topology. Second, area measurements were inaccurate since they were estimated using the number of track segments per tile instead of transistor count.  74  We experimented with pin placement using the same experimental methodology, experimental philosophy, architectural framework, and area and delay models as used in Chapters 3 and 4. This provides us with a more accurate assessment of the effects of various pin placements. The results from Chapters 4 serve as a foundation for this chapter; for all track ratios, the most suitable switch block topology is used. To gain further insight into the problem, we consider more than the two pin distributions studied by Betz and Rose. In total, we define 6 pin distributions ranging from an "even" placement (type "0") in which input and output pins are evenly distributed around the logic block to a "biased" placement in which all pins are placed on the top and bottom of the block (type 6). For completeness, we also study an architecture with a slight "reverse" bias distribution (type " - 1 " ) in which there are more pins accessible from the narrow vertical routing channels than the horizontal channels. Figure 32 illustrates the six different pin configurations.  o  d  Figure 32: Six different logic block pin distributions. "|" shaped pins represent inputs and "o" shaped pins represent outputs.  75  One important assumption we make is that logic block pins are fully permutable. In other words, the router can choose any of the available physical input and output pins to represent any of the logical input and output pins. This assumption is in-line with many commercially available FPGAs. Without this assumption, many of our results would differ.  5.4 Experimental Results For each of the 6 pin distributions studied, results were generated for EPLCs with aspect ratios of 1:1, 2:1, 4:1, 6:1, 8:1, and 10:1. Figure 33 is a comparison of the density efficiency of EPLCs with an aspect ratio of 2:1 using each of the 6 different pin distributions. The vertical axis represents the geometric average of tile area over 20 M C N C benchmark circuits whereas the horizontal axis represents the track ratio (t/t ). x  Each curve represents one of the 6 pin placements studied. For each track ratio, we use the most suitable switch block topology (Baseline, MOD2, or MOD4) as discussed in Chapter 3.  Figure 34 is a plot of critical path delay vs. track ratio for a core aspect ratio of 2:1. Each curve represents a different logic block pin placement. It is difficult to draw conclusions from this plot since there are no apparent trends and the delay range is relatively small (14.5 ns to 15.2 ns). Figure 36 encapsulates all of the contributions from Chapters 3 and 4 with findings from this chapter. The vertical axis is the geometric average of tile area over 20 M C N C benchmark circuits and the horizontal axis is the aspect ratio of the EPLC. The solid line represents an E P L C using a standard FPGA architecture. In this case, we use a square 76  switch block topology with a track ratio of 1:1 with an "even" pin placement (type "0"). The dashed line is an architecture that incorporates the best track ratio and switch block topology for each E P L C aspect ratio. The dotted line is an F P G A that incorporates the best track ratio, switch block topology, and pin placement for a given E P L C aspect ratio.  7000 -I 1  1 1  1 .  5  2  1 2  1 .  5  3  Track Ratio  Figure 33: Tile area vs. track ratio for various pin distributions. Core aspect ratio = 2:1.  Several observations can be made from Figure 33. Just as in Chapter 4, the shape of each curve appears to be the result of two competing trends. As the track ratio increases, the switch block enlarges thus causing an increase in tile area. Conversely, an increase in the track ratio results in higher channel capacities in the "long" direction of the EPLC thus resulting in increased routability. This trend held for all pin placements we investigated.  77  1 6  j  1 5 . 8 1 5 . 6 -  1 4 . 4 1 4 . 2 1 4 1  1  .  5  2  Track Ratio  Figure 34: Critical path delay vs. track ratio for various pin ratios. Core aspect ratio = 2.  Several important trends can be observed when comparing the curves of Figure 33. Each of the curves in Figure 33 has a minimum that represents the "best" track ratio for a given EPLC aspect ratio, switch block topology, and pin distribution. As we move from a reverse-biased pin ratio to a strongly biased pin ratio, the minimum tends to progress to higher track ratios. For an EPLC of aspect ratio 2:1, the "best" track ratio is at 1.1:1, 1.2:1, 1.3:1, 1.4:1, 1.7:1, and 1.7:1 for pin placement types "-1", "0", "1", "3", "4", and "6", respectively. Intuitively, this means that the more pins that reside on a particular side of a logic block, the greater the channel capacity should be on that side. With a better match between pin placement and channel capacity, routing has better access to logic blocks. As a result, there is an increase in routing flexibility.  78  Another important trend can be observed in Figure 33. As the pin bias increases, the minimum moves towards higher track ratios; further, the tile area at that minimum increases. This means that an architecture with more biased routing resources tends to result in worse area efficiency. This penalty is likely due to two factors. First, higher track ratios result in reduced area efficiency due to a reduction in routing flexibility (see Chapter 4). Second, by moving pins to the sides of the logic block with larger channel capacities, the number of tri-state buffers used to connect to routing tracks is increased. This is because the number of tri-state buffers used to connect to routing tracks is proportional to the number of track.  As an example, consider an EPLC that has an aspect ratio of 2:1 and a track ratio of 1.2:1. For a logic block output pin that connects the logic block to 50% of the tracks in a horizontal channel of size t , it would require t * 0.5 tri-state buffers. If this pin were then x  x  moved to connect to a vertical side then it would require t * 1.2 * 0.5 tri-state buffers. The x  result is a 20% increase in the number of tri-state buffers used to connect tracks to this pin (see Figure 35).  79  Logic Block  Switch Block  Figure 35: A tile consisting of a logic block, switch block and vertical and horizontal channels. Each point represents a tri-state buffer used to connect a pin to a routing track. By moving an output pin from the bottom side of the logic block to the right side of the logic block, the number of tri-state buffer connected to this pin increases from t *0.5=5 to t *1.2*0.5 = 6 . x  x  Another interesting result from Figure 33 is the magnitude of the minima for the reverse biased curve (type "-1"). This result shows us that density is degraded when using a reverse biased pin placement. Further, the magnitude of the degradation is roughly equal to that of a forward bias of type "1"; however, the cause for the degradation is very different. For the reverse bias, there is a poor match between the number of logic pins on a side and the size of the routing channel adjacent to that side. The result in a reduction in routing flexibility. For the forward bias, extra area is consumed by an increase in the number of tri-state buffers needed to connect output pins to routing channels.  When comparing the type "-1" curve and the type "1" curve, it can be observed that the type  configuration is more area efficient at low track ratios (t/t < 1.2:1). For track x  ratios higher than 1.2:1, the type "1" configuration is more area efficient. This trend indicates how the impact of increased routing flexibility compares to the impact caused by  80  a reduction in the number of tri-state buffers. For higher track ratios, the relative importance of routing flexibility grows.  W r - -b a s e l i n e s w i t c h / t r a c k r a t i o = 1 -•  b e s ts w i t c h / b e s tt r a c kr a t i o / b e s tp i n r a t i o  Figure 36: Tile area vs. core aspect ratio. The solid line uses the best switch block, track ratio, and pin distribution for each point. The dashed line uses the Baseline switch block, an even pin distribution, and track ratio=l.  From Figure 36, it is evident that, by using the best switch block topology, the best pin distribution, and the best track ratio, there is a significant density improvement. For core aspect ratios of 2:1,4:1:, 6:1, 8:1, and 10:1, the density improved by 4.8%, 6.5%, 10.1%, 11.2%, and 11.9%, respectively. On the other hand, there was a degradation in the critical path delay (Figure 37). For core aspect ratios of 2:1,4:1:, 6:1, and 10:1, the critical path delay degraded by 4.1%, 3.6%, 1.1%, 2.3% and 3.7%, respectively.  81  In Chapter 4, a comparison was made between an architecture with the baseline switch block and a track ratio of 1, to an optimized architecture that uses the best switch block and track ratio. In this chapter, the optimized architecture uses the best pin placement instead of an even (type "0") pin placement. By comparing the results from Chapter 4 with those from this chapter, we can determine the impact of optimizing pin placement. For core aspect ratios of 2:1,4:1, 6:1, 8:1, and 10:1, the density improved by 0.0%, 0.1%, 0.0%, 0.2%, and 0.8%, respectively. Critical path delay degraded by 0.0%, 4.7%, -2.9%, 0.7%, and 1.2%, respectively. From this, it is evident that optimizing for pin placement has a negligible impact on density and a mixed impact on critical path delay.  - - » - - baseline switch/track ratio = 1 — •  b e s ts w i t c h / b e s tt r a c k r a t i o / b e s t p i n r a t i o 1  1.80E-08T 1 . 7 5 E - 0 8 A  1 . 4 0 E - 0 8 -I 1  1 2  1 3  4  5  1 6  7  8  1  1  9  1  ,  1 1  1 0  Aspect Ratio  Figure 37: Critical path delay vs. core aspect ratio. The solid line uses the best switch block, track ratio, and pin distribution for each point. The dashed line uses the Baseline switch block, an even pin distribution, and track ratio=l.  82  5.5 Summary In this chapter, we experimented with various logic block pin placements. For architectures with track ratios greater than 1:1, we studied the effects of placing an increased number logic block pins adjacent to the larger, horizontal channels. Our experiments were performed with five different pin placements that ranged from one with all input and output pins connected to horizontal channels (type "6") to one with all pins evenly distributed around the logic block (type "0"). For completeness, we experimented with a pin placement that has slightly more pins connected to the vertical channels than the horizontal channels (type We quantified improvements in density and speed when using various track ratios for a given EPLC aspect ratio and pin placement. For a given E P L C aspect ratio and pin placement, we found that there existed a "best" track ratio where density is maximized. We found that the "best" track ratio starts at 1.1:1 for a pin placement of type "-1" and increases to 1.7:1 for a pin placement of type "6". We compared the benefits of using an optimized architecture that uses the best track ratio, best switch block topology, and best pin placement for a given EPLC aspect ratio. For several aspect ratios, the optimized architecture was compared with an architecture that uses the baseline switch block, a pin placement of type "0", and a track ratio of 1:1. Results showed that for an aspect ratio of 2:1, the new architecture improved area by 4.8%. Area improvements grew with higher aspect ratios; for an aspect ratio of 10:1, the improvement was 11.9%. There was, however, a small degradation in critical path delay.  83  Next, we next compared an architecture that has an optimized switch block, track ratio, and pin placement, and compared that to one that optimizes the switch block and the track ratio only. Effectively, we were trying to determine how much was gained by optimizing pin placement. For all aspect ratios tested, the density improvement was negligible and the critical path delay degraded.  84  CHAPTER 6: Conclusions In this thesis, we investigated the detailed routing architectures of EPLCs. We suggested that EPLCs should be able to take on any rectangular shape so that they are better able to fit available locations within an SoC layout. With this in mind, we studied the following three aspects of routing architecture for a rectangular EPLC: switch block topology, directionally biasing channels, and logic block pin placement.  We presented a new family of switch blocks that works well with rectangular EPLCs. Each family member can be used for a variety of channel segmentation patterns and aspect ratios, yet each is more suitable for a given E P L C aspect ratio. We have provided a set of guidelines that determines when each member should be used. To avoid conflicts between horizontal and vertical constraints, we specified a segmentation architecture that works well with rectangular switch blocks.  Through experimentation using a modified version of VPR, we measured the density and speed of EPLCs with various track ratios for given E P L C aspect ratios. We showed that for EPLCs with aspect ratios of 2:1,4:1, 6:1, 8:1, and 10:1, the "best" track ratios are at 1.3,1.4,1.5,1.5, and 1.5, respectively.  Next, we found the "best" track ratio for a given logic block pin placement. Results showed that for pin placements of type "-1", "0", "1", " 3 " , "4", and "6", the best track ratios are at 1.1:1,1.2:1, 1.3:1, 1.7:1, and 1.7:1, respectively.  85  We quantified the benefits of an optimized architecture that uses the "best" track ratio, "best" switch block topology, and "best" pin placement for a given E P L C aspect ratio. For each aspect ratio, the optimized architecture was compared to one that uses the baseline switch block and a track ratio of 1:1. Results showed that for E P L C aspect ratios of 2:1, 4:1, 6:1, 8:1, and 10:1, the optimized architecture improved area by 4.8%, 6.5%, 10.1%, 11.2%, and 11.9%, respectively. For E P L C aspect ratios of 2:1,4:1, 6:1, 8:1, and 10:1, on the other hand, the critical path delay degraded by 4.1%, 3.6%, 1.1%, 2.3%, and 3.7%, respectively. From this, we conclude that the optimized architecture is best suited for EPLCs where density (cost) is more important than speed.  When compared to a square EPLC, the area penalties for using rectangular EPLCs with aspect ratios 2:1, 4:1, 6:1, 8:1, and 10:1 are 1.6%, 15.5%, 27.1%, 37.6%, and 46.5%, respectively. In terms of critical path delay, the penalties for aspect ratios 2:1,4:1, 6:1, 8:1, and 10:1 are 3.8%, 8.4%, 9.1%, 14.7%, and 17.2%, respectively. This does not mean that rectangular EPLCs are a bad idea; in many applications, the fixed shapes and sizes of the other cores will dictate that a rectangular EPLC is to be used.  Finally, we showed that it is not beneficial to optimize an EPLC for density using pin placement. For E P L C aspect ratios of 2:1,4:1,6:1, 8:1, and 10:1, the density was improved by only 0.0%, 0.1%, 0.0%, 0.2%, and 0.8%, respectively. These negligible improvements in density come with the expense that the critical path delay is degraded by 0.0%, 4.7%, -2.9%, 0.7%, and 1.2% for the same aspect ratios.  86  6.1 Future Work This thesis is a first attempt at optimizing the routing architecture of an EPLC. There is a great deal of architectural work that remains unexplored. Experiments were conducted with a segmentation distribution with 100% of the wires with length 4. It would be interesting to determine what the best segmentation distribution is as a function of the EPLC aspect ratio. Further, it would be interested to determine whether the segmentation pattern should be directionally biased in a similar way to channel capacities. If this were beneficial, a new family of switch blocks would need to be defined to accommodate for this. There are many unanswered questions with respect to the VO architecture of an EPLC. First, it would interesting to determine how often "real world" designs would require that the I/O placement of an E P L C be defined by other cores within the system. If most applications require such a constraint, it would be interesting to quantify what effect this has on performance.  Once all of the various aspects of architecture have been determined for EPLCs, these design rules could be used to develop an EPLC generator. With little knowledge of FPGA architecture concepts, this tool could be used by an SoC designer to quickly develop a "near" optimized EPLC. A project is to develop such a tool is currently under way at the University of British Columbia.  87  6.2 List of Contributions The contributions of our work are summarized as follows: 1. We proposed a novel family of switch block topologies suitable for various E P L C aspect ratios. 2. We quantified the impact of directional biased channels on density and the critical path delay as a function of an E P L C s aspect ratio. This required us to propose a segmentation pattern suitable for directionally biased architectures. Further, we quantified the penalty -Of using rectangular EPLCs over square EPLCs. 3. We defined a set of logic block pin placements and quantified their impact on the density and critical path delay of the E D L C .  88  Bibliography [I]  S. Brown, R. Francis, J. Rose, and Z. Vranesic, "Field-Programmable Gate Arrays," Kluwer Academic Publishers, 1992.  [2]  Leopard Logic Corporate Website, http://www. leonardlogic. com, 2003.  [3]  eASIC Corporate Website, http://www.easic.com, 2003.  [4]  V . Betz and J. Rose, "VPR: A New Packing, Placement, and Routing Tool for F P G A Research," Int. Workshop on Field-Programmable Logic and Applications, pp. 213-222, August 1997  [5]  P. Hallschmid, S.J.E. Wilton, ""Detailed Routing Architectures for Embedded Programmable Logic IP Cores", in the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, C A , Feb. 2001, pp. 69-74.  [6]  S. Brown, J. Rose, "FPGA and CPLD Architectures: A Tutorial", IEEE Design & Test of Computers, vol. 13, no. 2, pp. 42-57, 1996.  [7]  J. L . Kouloheris, A . E l Gamal, "PLA-based F P G A Area versus Cell Granularity," Proceedings of the IEEE 1992 Custom Integrated Circuits Conference, pp. 4.3.14.3.4, 1992.  [8]  Altera Inc., Data Book, 2003.  [9]  Xilinx, Inc., The Programmable Logic Data Book, 2003.  [10] Actel Inc., FPGA Data Book and Design Guide, 2003. [II] Cypress Semiconductor Corporate Website, http://www.cypress.com, 2003. [12] QuickLogic Corporate Website, http://quicklogic.com, 2003. [13] J. Rose, A . E l Gamal and A . Sangiovanni-Vincentelli, "Architectures of FieldProgrammable Gate Arrays," Proceedings of the IEEE, Vol. 81, No. 7, July 1993, pp. 1013 - 1029.  89  [14] P. Chow, S. Seo, J. Rose, K . Chung, I Rahardja, and G. Paez, "The Design of an SRAM-Based Field-Programmable Gate Array: Part I: Architecture," in IEEE Transactions on VLSI, Vol. 7 No. 2, June 1999, pp. 191-197 [15] M . Khellah, S. Brown and Z. Vranesic," Modelling Routing Delays in S R A M based FPGAs," Proc. Canadian Conf. on VLSI, 1993, pp. 6B.13 - 6B.18. [16] J. Greene, E. Hamdy and S. Beal, "Antifuse Field Programmable Gate Arrays," Proceedings of the IEEE, July 1993, vol.81 no. 7, pp. 1042 - 1056. [17] W.C. Elmore, "The Transient Response of Damped Linear Networks with Particular Regard to Wideband Amplifiers," / . Applied Physics, Vol. 19, pp. 55-63, January 1948. [18] S.J.E. Wilton, " Implementing Logic in F P G A Embedded Memory Arrays: Architectural Implications", IEEE Custom Integrated Circuits Conference, pp. 12.4.1-12.4.4, May 1998. [19] S.J.E. Wilton, "Architectures and Algorithms for Field-Programmable Gate Arrays with Embedded Memory," Ph.D. thesis, University of Toronto, 1997. [20] J. Rose and S. Brown, "Flexibility of Interconnection Structures for Field Programmable Gate Arrays," IEEE Journal of Solid-State Circuits, Vol. 26 No. 3, March 1991, pp. 277-282. [21] J. Greene, V . Roychowdhury, S. Kaptanoglu and A . E l Gamal, "Segmented Channel Routing," Design Automation Conference, 1990, pp. 567 - 572. [22] K . Roy and M . Mehendale, "Optimization of Channel Segmentation for Channelled Architecture FPGAs," Custom Integrated Circuits Conference, 1992, pp.4.4.1 4.4.4. [23] S. Brown, M . Khellah and G. Lemieux, "Segmented Routing for SpeedPerformance and Routability in Field-Programmable Gat Arrays," Journal of VLSI Design, Vol. 4, No. 4, 1996, pp. 275 - 291.  90  [24] M . Khellah, S. Brown and Z. Vranesic, "Minimizing Interconnection Delays in Array-Based FPGAs," Custom Integrated Circuits Conference, 1994, pp. 181 — 184. [25] S. Brown, M . Khellah and Z. Vranesic, "Minimizing F P G A Interconnect Delays," IEEE Design and TEST Magazine, Winter 1996, pp. 16 - 23. [26] S. Trimberger and M . Chene," Placement-based partitioning for lookup-table-based F P G A ' s , " Proc. Intl. Conf. Computer Design, VLSI in Computer and Processors, pp. 91 - 94, 1992. [27] C. Y . Lee, "An Algorithm for Path Connections and its Applications," IRE Trans. Electron. Comput., Vol. EC=10, 1961, pp. 346 - 365. [28] E. Dijkstra, " A Note on Two Problems in Connexion with Graphs," Numer: Math., Vol. 1, 1959, pp. 2 6 9 - 171. [29] F. Rubin, "The Lee path connection algorithm", IEEE Trans. Electron. Computers, pp. 907-914, September 1974. [30] C. Ebeling, L . McMurchie, S. A. Hauck and S. Burns, "Placement Routing Tools for the Triptych F P G A , " IEEE Trans on VLSI, Dec. 1995, pp. 473 - 482. [31] J. Frankle, "Iterative and Adaptive Slack Allocation for Performance-Driven Layout and F P G A Routing," Design Automation Conference, 1992, pp. 536 - 542. [32] J.S. Rose, "Parallel Global Routing for Standard Cells," IEEE Transactions on Computer-Aided Design of Circuits and Systems, Vol. 9, No. 10, October 1990, pp. 1085-1095. [33] Y . Chang, S. Thakur, K. Zhu and D. Wong, " A New Global Routing Algorithm for FPGAs," International Conference on Computer-Aided Design, 1994, pp. 356 361. [34] S. Brown, J. Rose, Z. G. Vranesic, " A Detailed Router for Field-Programmable Gate Arrays," IEEE Trans, on Computer-Aided Design of Circuits and Systems, May 1992, pp. 620 - 628.  91  [35] G . Lemieux, S. Brown, " A Detailed Router for Allocating Wire Segments in FPGAs," ACM/SIGDA Physical Design Workshop, 1993, pp. 215 - 226. [36] G . Lemieux, S. Brown, D . Vranesic, "On Two-Step Routing for FPGAs," ACM Symp. on Physical Design, 1991, pp. 60 - 66. [37] M . J. Alexander, G. Robins, "New Performance-Driven F P G A Routing Algorithms," Design Automation Conference, 1995, pp. 562 - 567. [38] S.J.E. Wilton, R. Saleh, Progammable Logic IP Cores in SoC Design: vv  Opportunities and Challenges", in the IEEE Custom Integrated Circuits Conference, San Diego, C A , May 2001, pp. 63-66. [39] T. Wong, Non-Rectangular Embedded Programmable Logic Cores", M.A.Sc. vv  Thesis, May 2002. [40] Y.-W. Chang, D . Wong, and C. Wong, "Universal switch modules for F P G A design, "ACM Transactions on Design Automation of Electronic Systems, vol. 1, pp. 80-101, January, 1996. [41] M.I. Masud and S.J.E. Wilton, "A New Switch Block for Segmented FPGAs," International Workshop on Field-Programmable Logic and Applications, Glasgow, U.K., September 1999. Included in Lecture Notes in Computer Science 1673, Springer-Verlag, pp. 274 - 281. [42] G . - M . W u and Y.-W. Chang, "Quasi-universal switch matrices for FPD design," in IEEE Trans, on Computers (TC), Vol. 48, No. 10, pp. 1107-1122, Oct. 1999. [43] M . Shyu, Y . - D . Chang, G . - M . Wu, and Y.-W. Chang, "Generic Universal Switch Blocks," in IEEE Trans, on Computers (TC), vol. 49, no. 4, pp. 348-359, April 2000. [44] Y.-W. Chang, K . Zhu, D. F. Wong, and C. K . Wong, "Analysis of FPGA/FPIC switch modules," Dept. of Computer Sciences, University of Texas at Austin, TR9632, September 1996.  92  [45] J. Depreitere, H . Van Marck, and J. Van Campenhout, "Evaluation of F P G A Switch Matrices using Monte Carlo Approach", Proceedings of the Sixth Japanese FPGA/PLD  Design Conference, pp. 303-306, June 1998.  [46] H.B. Fan, J.P. Liu, Yu-Liang Wu, and C.C. Cheung, "On Optimum Designs of Universal Switch Blocks", Field Programmable Logic and Applications, Sept. 2002, pp. 142-151. [47] H.B. Fan, J.P. Liu, and Y . L . Wu, "General Models for Optimum ArbitraryDimension F P G A Switch Box Designs," Proc. IEEE International Conference on Computer-Aided Design, pp. 93-98, Nov. 2000. [48] V . Betz, J. Rose, A . Marquardt, "Architecture and C A D for Deep-Submicron FPGAs," Kluwer Academic Publishers, 1999. [49] S. Yang, "Logic Synthesis and Optimization Benchmarks, Version 3.0," Tech. Report, 1991. [50] E . M . Sentovich et al, "SIS, A System for Sequential Circuit Analysis," Tech. Report No. UCB/ERLM92/41,  1992.  [51] J. Cong and Y . Ding, "Flowmap: A n Optimal Technology Mapping Algorithm for Delay Optimization in Lookup-Table Based F P G A Designs," IEEE Trans, on Computer-Aided Design of Integrated Circuits and Systems, vol. 13, no. 1, January 1994, pages 1-12. [52] A . Marquardt, V . Betz, J. Rose, "Using Cluster-Based Logic Blocks and TimingDriven Packing to Improve F P G A Speed and Density," in the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, C A , 1999, pp. 37-46. [53] S. Kirkpatrik, C. Gelatt and M . Vecchi, "Optimization by Simulated Annealing," Science, May 13, 1983, pp. 671 - 680. [54] V . Betz and J. Rose, "Cluster-Based Logic Blocks for FPGAs: Area-Efficiency vs Input Sharing Size", IEEE Custom Integrated Circuit Conference1997, Santa Clara, C A , pp. 551-554.  93  [55] V . Betz and J. Rose, "FPGA Routing Architecture: Segmentation and Buffering to Optimize Speed and Density," ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 1999, pp 59-68. [56] V . Betz and J. Rose, "Automatic Generation of F P G A Routing Architectures from High-Level Descriptions," ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2000, pp 175-84. [57] V . Betz and J. Rose, "Directional and Non-Uniformity in F P G A Global Routing Architectures," IEEE/ACM International Conference on ComputerAided Design, 1996,pages 652-659. [58] Canadian Microelectronics Corporation, "0.18 u,m Mixed-Mode Polycide HSPICE Models," Confidential Process Documentation, 2000.  94  Appendix A Figure A . l & A.2 are "screen shots" from VPR [4]. Modifications were made to allow for a family of rectangular switch block topologies. A . l illustrates the routing resources available for baseline rectangular switch block topology and A.2 for the  MOD2  topology. The MOD2 topology has exactly half the switch block routing resources and is evident in these "screen shots".  V  VPR: Versatile Place and Route for FPGAs  Figure A . l : "Screenshot" from VPR showing the routing resources available for the baseline rectangular switch block.  95  V VPR: V e r s a t i l e Place a n d R o u t e f o r FPGAs  Figure  A.2:  "Screenshot" from VPR showing the routing resources available for the M O D 2 rectangular switch block. Routing resources are half that of the Baseline switch block.  96  Appendix B VPR [4] was modified to accommodate a new segmentation architecture that would solve the horizontal and vertical constraints problem for rectangular directionally biased routing architectures. This is shown in Figure B . l . .JOLsl  Figure B . l : A "screenshot" from VPR [4] showing the segmentation patterns.  97  Appendix C Figure C . l is a V P R [4] "screenshot" showing an entire E P L C after layout is complete. This E P L C has an aspect ratio of 2:1, a track ratio of 2:1, and a baseline switch block topology. .„lD|x|  PostScript itProceed  Routing succeeded with a channel width factor of 23.  Figure C . l : A "screenshot" from VPR [4] showing the entire E P L C after layout.  98  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065360/manifest

Comment

Related Items