Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Congestion-driven re-clustering CAD flow for low-cost FPGAs Chiu, Darius 2009

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2009_fall_chiu_darius.pdf [ 729.05kB ]
Metadata
JSON: 24-1.0067718.json
JSON-LD: 24-1.0067718-ld.json
RDF/XML (Pretty): 24-1.0067718-rdf.xml
RDF/JSON: 24-1.0067718-rdf.json
Turtle: 24-1.0067718-turtle.txt
N-Triples: 24-1.0067718-rdf-ntriples.txt
Original Record: 24-1.0067718-source.json
Full Text
24-1.0067718-fulltext.txt
Citation
24-1.0067718.ris

Full Text

Congestion-Driven Re-Clustering CAD Flow for Low-Cost FPGAs by Darius Chiu B.Sc., The University of British Columbia, 2006 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in The Faculty of Graduate Studies (Electrical and Computer Engineering)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) September, 2009 c Darius Chiu 2009  Abstract FPGA device area is dominated by the on-chip interconnect. For this reason, the amount of interconnect provided must be limited. This limit is usually imposed by designing an FPGA device family with a fixed channel width. CAD tools must meet this hard channel-width constraint for a circuit to be successfully mapped to a device from this family. Previous work has shown that if a design cannot be mapped to a device due to insufficient interconnect availability, it is possible to identify regions of high interconnect demand, and spread out or depopulate the logic in this area into surrounding regions. This is done by re-packing logic in the affected regions into an increased number of CLBs. This increases the effective amount of interconnect available to these high-demand areas. This methodology has been shown to significantly reduce channel width, at the expense of CLB count and runtime. In this work, we extend this previous algorithm in two ways: we present novel region selection techniques to optimize the selection of which regions should be depopulated, and we introduce a local channel-width demand model which can used to more accurately determine the amount of white space insertion at each iteration. Together, these techniques lead to significant run-time improvements and reduce the area of the resulting FPGA implementations. We were able to improve runtime by a factor of up to 5.5 times while reducing area by up to 20% when compared to previous methods.  ii  Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iii  List of Tables  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  v  List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  vii  Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  viii  Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ix  Dedication  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  x  1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . .  1 3 4  2 Background . . . . . . . . . . . . . . 2.1 FPGA Architecture . . . . . . . 2.2 FPGA CAD Flow . . . . . . . . 2.2.1 Synthesis . . . . . . . . . 2.2.2 Technology Mapping . . . 2.2.3 Clustering . . . . . . . . 2.2.4 Placement . . . . . . . . 2.2.5 Routing . . . . . . . . . . 2.3 Un/DoPack CAD Flow . . . . . 2.3.1 Baseline Un/DoPack . . . 2.3.2 Fine-Grained Un/DoPack 2.3.3 Multiregion Un/DoPack .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  3 Multiple Region Depopulation with Congestion-Driven Metrics 3.1 Budgeted Multiregion Un/DoPack (BMR) . . . . . . . . . . . . . 3.1.1 Region Selection . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Whitespace Insertion . . . . . . . . . . . . . . . . . . . . 3.2 Congestion-Model Multiregion Un/DoPack (CMR) . . . . . . . .  5 5 9 9 10 10 11 13 14 17 18 19 21 21 22 25 26  iii  Table of Contents  3.3  3.2.1 Modeling Regional Interconnect Demand . . . . . . . . . 3.2.2 Modeling Internal Demand . . . . . . . . . . . . . . . . . Congestion-Aware Placement . . . . . . . . . . . . . . . . . . . .  4 Results . . . . . . . . . . . . . . . . . . . . . 4.1 Experimental Methodology . . . . . . . . 4.2 Previous Un/DoPack Schemes . . . . . . 4.3 Budgeted Multiregion Un/DoPack . . . . 4.4 Congestion-Model Multiregion Un/DoPack 4.5 Critical-Path Comparison . . . . . . . . . 4.6 CMR with Congestion-Aware Placer . . .  . . . .  . . . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  27 28 29  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  33 33 35 37 38 42 43  5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Influence from Neighbouring Regions . . . . . 5.1.2 Congestion-Driven Placement and Clustering  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  47 47 47 48  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  49  Appendices A Baseline Un/DoPack . . . . . . . . . . . . . . . . . . . . . . . . . .  56  B Fine-Grained Un/DoPack . . . . . . . . . . . . . . . . . . . . . . .  59  C Multiregion Un/DoPack . . . . . . . . . . . . . . . . . . . . . . . .  62  D Budgeted Multiregion Un/DoPack  65  . . . . . . . . . . . . . . . . .  E Congestion-Driven Multiregion Un/DoPack  . . . . . . . . . . .  F CMR Un/DoPack with Congestion-Aware Placement  . . . . .  68 71  iv  List of Tables 3.1 3.2  Maximum MRCW Comparison of Placement Schemes . . . . . . . Runtime Comparison of Placement Schemes . . . . . . . . . . . .  31 32  A.1 A.2 A.3 A.4 A.5 A.6 A.7  Baseline Baseline Baseline Baseline Baseline Baseline Baseline  B.1 B.2 B.3 B.4 B.5 B.6 B.7  Fine-Grained Fine-Grained Fine-Grained Fine-Grained Fine-Grained Fine-Grained Fine-Grained  C.1 C.2 C.3 C.4 C.5 C.6 C.7  Multiregion Multiregion Multiregion Multiregion Multiregion Multiregion Multiregion  D.1 D.2 D.3 D.4 D.5 D.6 D.7  BMR BMR BMR BMR BMR BMR BMR  Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack  -  Stdev0 . Stdev002 Stdev004 Stdev006 Clone . . Stdev010 Stdev012  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  56 56 57 57 57 58 58  Stdev0 . Stdev002 Stdev004 Stdev006 Clone . . Stdev010 Stdev012  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  59 59 60 60 60 61 61  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  62 62 63 63 63 64 64  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  65 65 66 66 66 67 67  E.1 CMR Un/DoPack - Stdev0 . . . . . . . . . . . . . . . . . . . . . .  68  Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack  Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack  Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack  -  -  -  . . . . . . .  . . . . . . .  Stdev0 . Stdev002 Stdev004 Stdev006 Clone . . Stdev010 Stdev012  Stdev0 . . Stdev002 Stdev004 Stdev006 Clone . . Stdev010 Stdev012  . . . . . . .  . . . . . . .  . . . . . . .  v  List of Tables E.2 E.3 E.4 E.5 E.6 E.7  CMR CMR CMR CMR CMR CMR  Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack  -  Stdev002 Stdev004 Stdev006 Clone . . Stdev010 Stdev012  F.1 F.2 F.3 F.4 F.5 F.6 F.7  CMR CMR CMR CMR CMR CMR CMR  Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack Un/DoPack  (Congestion (Congestion (Congestion (Congestion (Congestion (Congestion (Congestion  . . . . . .  . . . . . .  . . . . . .  . . . . . .  Aware Aware Aware Aware Aware Aware Aware  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  Placement) Placement) Placement) Placement) Placement) Placement) Placement)  -  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  68 69 69 69 70 70  Stdev0 . Stdev002 Stdev004 Stdev006 Clone . . Stdev010 Stdev012  . . . . . . .  . . . . . . .  . . . . . . .  71 71 72 72 72 73 73  vi  List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.1 3.2  FPGA Logic and Routing Layout . . . . . . . . . . . . . . . . . . Basic Logic Element . . . . . . . . . . . . . . . . . . . . . . . . . Generalized Configurable Logic Block (from [30]) . . . . . . . . . Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Typical FPGA CAD Flow . . . . . . . . . . . . . . . . . . . . . . Illustration of Placement (from [16]) . . . . . . . . . . . . . . . . . Un/DoPack CAD Flow (From [29]) . . . . . . . . . . . . . . . . . Region Selection and Whitespace Insertion - Baseline Un/DoPack Region Selection and Whitespace Insertion - Fine-Grained Un/DoPack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5 6 7 8 9 11 15 18  Example of Region Selection . . . . . . . . . . . . . . . . . . . . . Region Selection and Whitespace Insertion - BMR Un/DoPack . .  23 26  Area versus Runtime - Fine-Grained and Multiregion Schemes, Normalized to Baseline . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Area versus Runtime - Fine-Grained and BMR . . . . . . . . . . . 4.3 Area versus Runtime - Multiregion and CMR . . . . . . . . . . . 4.4 Area versus Runtime - BMR, Multiregion and CMR . . . . . . . . 4.5 Runtime Comparison with Baseline Un/DoPack - Circuit Stdev004 4.6 Area Comparison with Baseline Un/DoPack - Circuit Stdev004 . . 4.7 Critical Path Comparison - Circuit Stdev004 . . . . . . . . . . . . 4.8 Critical Path Comparison - Circuit Stdev006 . . . . . . . . . . . . 4.9 CMR with Congestion-Aware Placement - Stdev004 . . . . . . . . 4.10 CMR with Congestion-Aware Placement - Stdev006 . . . . . . . .  19  4.1  36 37 39 40 41 42 43 44 45 46  vii  Glossary Field-Programmable Gate Arrays (FPGA)  An integrated circuit device which are capable of being programed to implement any digital circuit  Look-up Table (LUT)  An element of a FPGA device which impliments any logical function of its inputs  Configurable Logic Blocks(CLBs)  An element of a FPGA device which comprises of a group of N BLEs which are interconnected with a fast local interconnect network  Basic Logic Elements(BLEs)  An element of a FPGA device which is comprised of a LUT and Flip-Flop  Aplication Specific Integrated Circuit (ASIC)  An integrated circuit device which is specifically designed and manufactured for a specific purpose  Computer-Aided Design (CAD)  Automated computer software tools used to aid the design and implimentation of systems  Minimum Routable Channel Width (MRCW)  The minimum number of tracks required for a given design to be routeable  Channel Width  The number of wiring tracks in each routing channel in the FPGA  Hardware Description Language (HDL)  A human readable language to express hardware designs  Microelectronics Corporation of North Carolina (MCNC) Circuits  A set of widely used acedemic benchmark circuits for FPGAs  viii  Acknowledgments My sincere thanks goes out to Dr. Guy Lemiuex and all of my fellow students and professors who have given me assistance and guidance throughout the course of my degree. In particular, I would like to thank David Grant for the help he has provided in maintaining our server cluster. Without his help, I would not have been able to easily complete this thesis. In addition, I would also like to thank Dr. Steve Wilton and Usman Ahmed for helping me with my writing. Finally I would like to thank my friends Andrew, Cindy, Chris, Dave, Johnny, Paul, and Patrick for the great times during the course of this degree.  ix  Dedication I dedicate this to my family who have patiently given their support to me throughout the years.  x  Chapter 1 Introduction Field Programmable Gate Arrays are customizable devices capable of implementing a variety of digital logic applications. This is because FPGAs can be customized by configuring re-programmable logic blocks and routing fabric. In contrast, Application Specific Integrated Circuits (ASICs) offer an alternative to FPGAs. ASICs are custom-manufactured integrated circuits which are designed for a specific application. Because of the application-specific nature of ASICs, they can typically offer better speed, density, and power characteristics than FPGAs. However, ASICs also require large up-front manufacturing costs and do not provide the re-programmability offered by FPGAs. FPGA devices are purchased pre-manufactured from the vendor. Time-to-market can be reduced because development and testing can be performed immediately using actual FPGA devices. Thus, FPGAs are favorable in circumstances where field-programmability and time-to-market benefits outweigh its speed, density and power disadvantages. In addition, low volume applications may not justify the economics of the high upfront costs for ASIC manufacture. FPGAs offer designers an alternative option for low to medium volume applications. As FPGAs increase in capacity and capability, it is common for manufacturers to provide separate low-cost and resource-rich families. For a similar logic capacity, low-cost families often have less embedded memory, embedded multipliers, and interconnect in the form of routing tracks in each channel. To target low-cost  1  Chapter 1. Introduction device families, computer-aided design (CAD) tools must configure the FPGA device in a way which meets a hard channel-width constraint imposed by routing capacity limits. Careful allocation of FPGA logic can help meet routing capacity constraints. Device logic in FPGAs are grouped as clusters, also known as configurable logic blocks (CLBs). The distribution of logic among CLBs can impact the number of routing tracks used around each CLB; the interconnect use around CLBs varies spatiality with placement. It is possible to distribute regions of high local interconnect demand by spreading logic over a larger number of CLBs. This increase in area allows the same amount of logic access to an increase in aggregate routing. Therefore, it is possible to spread logic in regions where there is high local interconnect demand, allowing CAD tools to meet hard channel-width constraints by increasing CLB usage. In a traditional single-pass CAD flow such as VPR [3], a solution may not be found which meets the channel-width constraint of a low-cost device. When regions are found which are congested, we can distribute logic over a larger area through whitespace insertion. Whitespace is inserted in the form of empty logic elements; for example, we can impose a limit on how many basic logic elements (BLEs) are allowed to form a CLB. Those CLBs in regions with higher local interconnect requirements should contain less BLEs such that local interconnect requirements are less then or equal to the channel-width provided by the device. The processes of identifying and re-clustering CLBs to reduce logic capacity is termed depopulation. Depopulation can occur in different ways. Single-pass clustering approaches, such as that presented in [26], can perform clustering in a way which results in clusters that are not full. These clustering approaches typically estimate the  2  1.1. Contributions routability of the circuit at the clustering stage and attempt to perform clustering to ease routability. However, without detailed information from placement or routing solutions, it is difficult for a clustering tool to determine the routing requirements of the circuit. This is especially true if our goal is to reduce local interconnect usage to meet channel-width constraints. In order to effectively depopulate only the areas of the device which are unroutable, we need to accurately identify and select which CLBs to depopulate. We also need to determine the amount of depopulation needed, which will require accurate prediction of postrouting interconnect demand after depopulation has occured. In Un/DoPack [29], authors presented methods to iteratively perform whitespace insertion. Post-routing information is used by the CAD flow to determine which regions should be targeted for depopulation. Regions which cannot be routed are reclustered to a smaller cluster size. This cluster size is determined either by the size of the region, or relative to the routability of the region. To our knowledge, to date, only Un/DoPack is capable of meeting a user-specified channel-width constraint through the use of iterative depopulation.  1.1  Contributions  The main contribution of this work is to improve region selection and whitespace insertion of Un/DoPack through the use of congestion information. We improve region selection, and effectively allocate whitespace to reduce routing requirements to meet channel-width constraints, but at the same time reduce runtime of the CAD flow and CLB usage. Un/DoPack showed that congested areas in the placement can be identified and reduced, but lacked an interconnect demand model to determine the amount of whitespace to insert. We will show that the  3  1.2. Thesis Organization use of a model-driven depopulation approach can result in both runtime and area improvements. Primarily, we will be improving area and runtime by selecting multiple congested regions simultaneously, and depopulating each region according to local interconnect demand. The model will be used to select congested areas in the placement for depopulation; this model will also attempt to predict the local interconnect demand after depopulation has occurred. This information will then be used to determine an adequate amount of whitespace to insert. To compare the runtime and area improvements, we will compare our work against the baseline version of Un/DoPack presented in [29] using the same benchmark circuits and channel-width constraints. The benchmark circuits, composed of smaller sub-circuits, are designed to simulate large system-on-chip circuits. These circuits exhibit the property where each sub-circuit may have different local interconnection requirements. Compilation using various channel-width constraints are performed to compare the various schemes presented in this thesis.  1.2  Thesis Organization  The remainder of the thesis is organized as follows: Chapter 2 provides an overview of modern FPGA technology, the state of current FPGA CAD technology, and previous work relating to the subject of this thesis; Chapter 3 describes the algorithms used and the experimental methodology; Chapter 4 compares the results of this work against previous work; Chapter 5 presents the conclusions of this work, contributions, and possible future work.  4  Chapter 2 Background This chapter provides an overview of current FPGA technology, as well as modern methods used to implement circuit designs onto an FPGA. A description of each step of a generalized CAD flow is discussed, along with a survey of common tools used in each step. We then describe, in detail, a CAD flow which performs regional depopulation to alleviate routing congestion.  2.1  FPGA Architecture  CLB  CLB  CLB  CLB  CLB  CLB  CLB  CLB  CLB  Figure 2.1: FPGA Logic and Routing Layout Field-Programmable Gate Arrays (FPGAs) are integrated circuits which are capable of implementing any digital circuit. This is possible because FPGAs contain an array of programmable logic elements, which can communicate with each 5  2.1. FPGA Architecture  LUT  D  Q  MUX  FF  Figure 2.2: Basic Logic Element other via a programmable routing fabric. This is shown in Figure 2.1 and is often referred to as the Island Style Architecture. Logic is arranged in a rectangular array and are surrounded by wires in the vertical and horizontal directions. The number of CLBs spanning the FPGA in the horizontal and vertical position is referred to as the array size. Surrounding the periphery of the FPGA, are input/output pads which allow the FPGA to connect with circuitry outside the FPGA. FPGA logic elements are known as a basic logic element (BLE). The logic capacity of a FPGA device is commonly measured as the number of BLEs contained in the whole FPGA. As shown in Figure 2.2, BLEs consist of a k-input look-up table (LUT) and a flip-flop. A k-input LUT, or k-LUT, can implement Boolean logic up to k inputs. The BLE can be used in combinational mode, where the output of the BLE is taken from the LUT, or sequential mode where the output is taken from the flip-flop. It has been shown that it is advantageous to combine multiple BLEs into clusters. These clusters contain a fast, local interconnect which connects constituent BLEs. Clusters connect to other clusters using an inter-cluster routing fabric. These clusters are known as configurable logic blocks (CLBs), shown in Figure 2.3. Previous work has shown that clustering creates various advantages including: the reduction of delay, reduction of interconnect usage, increase in 6  2.1. FPGA Architecture  k BLE i  k BLE  k BLE  Figure 2.3: Generalized Configurable Logic Block (from [30]) device density, and most importantly a reduction in CAD runtime. The maximum number of BLEs contained within a single CLB is defined as N, and the number of inputs as i. [1] experimentally determined the relationship between the number of inputs required for a cluster as a function of the LUT size and cluster size. The number of inputs per cluster was determined to be i = k2 × (N + 1), where k is the LUT size and N is the number of BLEs per cluster. The authors in [1] determined that this relationship yielded a high utilization (98%) of the logic contained within the CLBs. FPGA routing surrounds the CLBs. The routing fabric consists of three components: wires, connection blocks, and switch blocks. Wires occupy the vertical and horizontal channels adjacent to CLBs in the FPGA. In modern designs, wire segments may span multiple CLBs to allow efficient communication across longer distances. Because the number of wires occupying each channel are fixed at manufacture-time, the routing capacity of a FPGA device is limited by the num-  7  2.1. FPGA Architecture  Connection Block  Logic Block  Switch Block  Figure 2.4: Routing ber of wires contained in each channel across the device. The number of wires in each channel is referred to as the channel width. Channel width provides a major constraint for FPGA computer-aided design (CAD) tools since designs cannot be implemented if channel width constraints cannot be met; insufficient routing resources imply that connections between CLBs cannot be made. Furthermore, the channel width required may not be uniform across the device; some regions will require more interconnect than others. The occurrence of a small portion of the FPGA requiring more wiring than is offered by the FPGA architecture will prevent the entire design from being implemented. In this case, the circuit is said to be unroutable. The minimum channel width required for the circuit to be implemented is referred to as the minimum routable channel width (MRCW). FPGA routing also contains connection blocks and switch blocks. Connection blocks allow CLBs to accept inputs from, or provide outputs to, nearby wires. Switch blocks allow signals to change direction by connecting vertical and horizontal routing wires together. This is shown in Figure 2.4  8  2.2. FPGA CAD Flow  Synthesis  Technology Mapping  Clustering  Placement  Routing  Figure 2.5: Typical FPGA CAD Flow  2.2  FPGA CAD Flow  Through the use of software tools, designers produce a bitstream used to program the FPGA device. Figure 2.5 shows the 5 typical steps, and is referred to as an FPGA CAD flow. These steps are: synthesis, technology mapping, clustering, placement, and routing. These steps are described in detail below.  2.2.1  Synthesis  A designer will typically express the design in a hardware description language (HDL), such as VHDL or Verilog. The task of the synthesis step is to translate the design into a gate-level representation. This representation is a network of Boolean logic gates and flip-flops.  9  2.2. FPGA CAD Flow  2.2.2  Technology Mapping  The technology mapping step takes the output of the synthesis step and maps groups of logic into k-LUTs. At this stage, optimizations can be made to minimize logic usage, delay and power. For example, delay can be reduced by minimizing logic depth, which is the longest path of the circuit. Most notably, FlowMap [8] was able to produce a depth optimal-solution in polynomial complexity time. Other technology mapping algorithms are presented in [9], [10], [11] and [15].  2.2.3  Clustering  The clustering step packs LUTs and flip-flops into BLEs and combines multiple BLEs into CLBs to reduce delay and routing resource usage of the circuit. Each CLB contains a fast local interconnect structure which allows constituent BLEs to communicate with each other. This intra-cluster routing is much faster than inter-cluster routing in general. The simplest clustering approaches are based on greedy selection. Known as the bottom-up approach, individual CLBs are built by first choosing a seed BLE. Subsequent blocks are added to the CLB based on their relationship to the seed. This occurs until CLB capacity constraints are met. Various greedy clustering algorithms exist, each with distinct goals. Examples include: Vpack [2], which attempts to minimize total number of inputs per cluster; T-VPack [3], which tries to reduce the number of intercluster nets on the critical path; RPack [4], which attempts to reduce the routing effort of the circuit by enclosing intercluster nets into clusters; an intrinsic shortest-path length based clustering method [23], which attempts to reduce post-placement wirelength; and iRAC [26] which attempts to minimize routing channel width by completely enclosing low-fanout nets within a  10  2.2. FPGA CAD Flow  a d  b e  f i  l  e  c g j m  a  i  c  f  l  d  m  h  g  n  b  h k  n  k  j  Figure 2.6: Illustration of Placement (from [16]) cluster. Greedy clustering algorithms are fast and area efficient, but due to a lack of backtracking, they can get trapped in local minima. For the experiments performed in this thesis, a replica of the iRAC clustering algorithm was used. iRAC clustering has been shown to produce a low routed channel width solution and good delay performance when compared to other clustering approaches.  2.2.4  Placement  The placement step attempts to find an optimal arrangement for each of the CLBs in the array. Each CLB in the FPGA can be placed in fixed locations in the array. Placement tools will typically try to find a location for each CLB, such that a cost function is minimized. For example, timing-driven placement attempts to minimize the lengths of the critical and near-critical paths; boundingbox placement will attempt to reduce the sum of the half-perimeter bounding boxes for all nets in the circuit; CMap [32] attempts to reduce peak interconnect demand by balancing predicted interconnect demand across the device; iRAP [24] attempts to balance local interconnect demand by matching the Rent parameter  11  2.2. FPGA CAD Flow of the placement with the architectural Rent parameter; the Rent parameter is a measure of how tightly interconnected a circuit is. Placement algorithms generally fall into two categories: simulated annealing and analytical placement. Simulated annealing is a flexible technique that can be applied to any optimization problem. Different optimizations can be implemented just by changing the cost function. For FPGA placement, the simulated annealing approach first starts with a random placement of CLBs. Pairs of CLBs then exchange locations. For each swap, a cost function evaluates the cost of the swap. If the cost decreases, the move is accepted. If the cost increases, the probability of accepting the swap depends on the current temperature. Initially, the temperature is set to accept all swaps, regardless of the cost. Gradually, the temperature decreases, reducing the probability that detrimental swaps are accepted. The acceptance of detrimental swaps allows the placement tool to possibly find a placement with globally minimal cost, instead of being trapped in local minima. However, due to the random nature of simulated annealing, it is a computationally intensive method to arrive at a placement solution. For this thesis, placement will be based on a simulated annealing approach. Analytical placement uses systems of equations to solve the placement problem. This can potentially be much faster than simulated annealing. For example, in force-directed placement, CLBs are modeled as a system of particles being connected by a system of springs. The final location of each CLB is determined by solving the system such that the system is at a state of equilibrium. A drawback of analytical placement is that illegal placements may occur due to CLBs overlapping each other; each CLB can only be placed in a discrete location. Legalization methods must then be introduced to address this issue. A drawback of the traditional CAD flow is that clustering, placement, and  12  2.2. FPGA CAD Flow routing occur in sequence. Changes to clustering are able to significantly affect circuit structure. Yet, accurate information from the placement step regarding wirelength, timing, and routability are not available during clustering. Some techniques combine clustering and placement to overcome this issue. For example, SCPlace [7] extends the simulated annealing based placement method to allow BLEs to be swapped between CLBs while optimizing for wirelength and timing. Work by [30] utilized node duplication and a novel depth-optimal initial clustering solution with combined clustering and placement to reduce critical path delay.  2.2.5  Routing  The routing step determines which wires on the FPGA device will connect the signals between CLBs. In general, routing algorithms are classified into single or two-step algorithms. The two-step approach first performs global routing, then detailed routing. In global routing, a signal is assigned to input-output pins and a routing channel. In a subsequent detailed routing step, the signal is assigned to a specific track in the routing channel. Examples of two-step routing approaches include [6], [17], and [25]. Single-step routers combine global and detailed routing into a single step. Examples include [19], [20], and [22]. The routing algorithm used in this thesis is the PathFinder [20] negotiatedcongestion routing algorithm contained within VPR. The PathFinder algorithm initially maze routes all nets in the circuit. This will lead to some routing resources being overused due to sharing from multiple signals. Those shared resources are assigned a cost and all nets are ripped up and re-routed. The cost for overused resources iteratively accumulates until enough nets avoid the use of these resources such that only one net is routed on each wire, meaning the circuit can be routed.  13  2.3. Un/DoPack CAD Flow  2.3  Un/DoPack CAD Flow  The traditional CAD flow is not typically equipped to address hard channel width constraints of FPGA devices. Once the routing fails, designers need additional flexibility to produce a routable solution. Un/DoPack [29] is a fully automated congestion-aware CAD flow which performs re-clustering at a local level to meet a hard channel-width constraint. Re-clustering of local regions is especially important in the case of circuits with large interconnect variation, such as system-on-chip circuits, which may consist of many different subcircuits, each having significantly different interconnect demands. While clustering tools exist which maximize logic utilization by fully packing CLBs, various authors [12][27] have shown that the best performance may result from a balance between logic utilization and interconnect demand. Work by [4][26][29] showed that overall area could be reduced by packing CLBs to less than 100% capacity. Since FPGA area is dominated by interconnect, reducing the overall interconnect requirements by balancing local routing demand with logic utilization can produce a net decrease in FPGA area. The clustering tool in [27] performs depopulation uniformly across the design, which depopulates uncongested regions along with congested regions. Shown in Figure 2.7, Un/DoPack [29] reduces interconnect only in localized congested regions of the FPGA by iteratively reclustering regional BLEs into an increased number of CLBs. The user provides the following inputs to Un/DoPack: a circuit description, architecture description, target channel-width constraint, and an array size constraint. Iteratively, the MRCW of a circuit is reduced by spreading local areas of high congestion. This happens until the MRCW meets  14  2.3. Un/DoPack CAD Flow Congestion Calculator (UnPack) Circuit Description, Architecture Description, Channel Width Constraint, Array Size Constraint Yes Array Size Limits Reached?  Synthesize and Technology Map (SIS/Flowmap)  No  Cluster (iRAC Replica)  Incremental Cluster (DoPack)  Placement (VPR)  Fast Placement (Incremental or VPR)  Routing (VPR)  Routing (VPR)  Channel Width Constraint Met?  No  Channel Width Constraint Met?  Yes  Failure  No  Yes Success!  Figure 2.7: Un/DoPack CAD Flow (From [29]) the specified channel-width constraint. Initially, a traditional CAD flow using SIS/FlowMap and VPR is run. This is shown inside the dashed outline in Figure 2.7. Any packing, placement, and routing algorithm can be used; we use iRAC and VPR. If the specified target channel-width constraint is met on the first pass, it is done. If the channel-width constraint cannot be met, the iterative portion of the Un/DoPack flow is invoked, which consists of the steps described below. • The first step of the iterative portion, the UnPack step, determines which regions to depopulate. A region should be depopulated if the local interconnect demands of the region exceed the specified channel-width constraint. 15  2.3. Un/DoPack CAD Flow Previous work presented two region selection schemes: single and multiple region schemes. These schemes select regions of a fixed radius, and calculate a new cluster size. The new cluster size is calculated such that enough whitespace is introduced to expand the number of CLBs in each region by a pre-determined amount. • The second step repacks the BLEs in the selected regions. The clusters are packed less than 100% full using new cluster sizes determined by the UnPack step. BLEs from each region are individually reclustered with other BLEs from the same region. Any clustering algorithm can be used; for this work we used iRAC as our clustering algorithm. • The final step is to perform placement and routing. If the final result is a routable solution, the CAD flow will exit. If the circuit remains unroutable after this step, the algorithm will iterate: the congestion information from routing will be used in the UnPack step to determine which regions should be depopulated next. For placement, Un/DoPack uses a incremental placer, RePlace [18], which preserves the placement stability in each iteration. At each iteration, Un/DoPack is creating additional CLBs. These CLBs should be placed in close proximity to other CLBs from the same region. Existing CLBs are shifted to create room for the newly created CLBs. A low temperature anneal is then performed to optimize the placement. Since Un/DoPack is modifying only small localized regions of the FPGA, the incremental placement has the effect of significantly reducing runtime, when compared to a full VPR placement. Incremental placement also preserves the existing locations of CLBs in uncongested regions. While Un/DoPack was shown to be very effective in reducing the channel16  2.3. Un/DoPack CAD Flow width, runtime and area expansion can be further reduced through the use of better congestion-driven region selection and cluster-size calculation techniques. In the following sections, we will discuss the various Un/DoPack schemes against which this work will compare.  2.3.1  Baseline Un/DoPack  The Baseline version of Un/DoPack depopulates a large, single region of the device. In each iteration, a single region is selected by marking all the CLBs in a circular area, centered on the CLB with the highest congestion label, closest to the center of the array. The congestion label is the maximum of the number of signals in the x or y channel immediately adjacent to the CLB. In this work, we will use the original parameters for region size, from [29]. A circular region with a radius of array size / 4 will be used. The number of new CLBs created is equal to the number of CLBs in one row plus one column of the FPGA array. Equation 2.2 illustrates the increase in CLBs using the Baseline Un/DoPack approach. We will be comparing the area and runtime performance of each algorithm against the Baseline version of Un/DoPack.  num new CLBs baseline =  new region CLBs =  (num CLBs in F P GA) ∗ 2 + 1  (2.1)  num region LEs (2.2) num CLBs in region + num new CLBs baseline 17  2.3. Un/DoPack CAD Flow  Figure 2.8: Region Selection and Whitespace Insertion - Baseline Un/DoPack  2.3.2  Fine-Grained Un/DoPack  As one of the observations in [28], authors noted that adding very small amounts of whitespace at each iteration produced superior area results compared to the Baseline approach, at the expense of increased runtime. This is referred to as the Fine-Grained approach. At each iteration, a single small region is selected for depopulation. We will be using the same region size as in [28], which is a circular region of radius array size / 8. The number of new CLBs in each region is determined by Equation (2.4). The Fine-Grained Un/DoPack method offers the least amount of area inflation, for the same channel width targets. Therefore, we will be comparing our work against the area performance of Fine-Grained Un/DoPack.  num new CLBs f inegrained =  new region CLBs =  (num CLBs in region) ∗ 2 + 1  (2.3)  num region LEs num CLBs in region + num new CLBs f inegrained (2.4)  18  2.3. Un/DoPack CAD Flow  Figure 2.9: Region Selection and Whitespace Insertion - Fine-Grained Un/DoPack  2.3.3  Multiregion Un/DoPack  The multiple region approach presented in [29], referred to as Multiregion Un/DoPack, reduced runtime by depopulating multiple regions simultaneously in each iteration. Regions are selected by centering the region on the CLB with the largest congestion value, closest to the center of the array. Once marked, CLBs cannot belong to any other region. Iteratively, all CLBs with a congestion value over the target channel width constraint are marked in this way, until no other unmarked CLBs have a congestion value over the target channel width constraint. We use the region size presented in [29], which is a circular region with a radius of array size / 10. In this approach, there is no restriction on how many CLBs are created in each iteration. Each region in the Multiregion scheme grows proportionally to the peak congestion in each region and an empirically determined scaling factor. In each iteration, the new number of CLBs in congested regions are calculated according to Equation 2.5 and Equation 2.6.  α = 45 · region radius new region CLBs = α ·  highest clb label of region −1 channel width constraint  (2.5) (2.6)  19  2.3. Un/DoPack CAD Flow Of the previous Un/DoPack approaches, Multiregion Un/DoPack requires the lowest runtime. We will be comparing our work against the runtime peformance of Multiregion Un/DoPack.  20  Chapter 3 Multiple Region Depopulation with Congestion-Driven Metrics The work presented in this thesis is based on the Un/DoPack CAD flow. We extend this approach to use congestion information presented by the placement and routing stages to more accurately determine the amount of whitespace to insert at each iteration. We introduce congestion-driven techniques that utilize local congestion metrics to reduce runtime and area inflation. Two new approaches are presented which explore the use of congestion information to determine which regions will be reclustered in each iteration of the Un/DoPack flow. First, we present an approach to help the CAD flow to better select congested regions. Second, we relax the constraint on how many CLBs are created in each iteration, and apply an interconnect demand model to determine the amount of whitespace to insert in each iteration. Finally, we introduce a congestion-aware placement algorithm to show that it is possible to further improve the overall quality of the Un/DoPack flow through the inclusion of other congestion-aware tools.  3.1  Budgeted Multiregion Un/DoPack (BMR)  Our first observation is that while previous work showed that depopulation is effective in reducing local interconnect congestion, it is important to correctly select congestion areas. A CAD flow was built which modified the region selection technique of Un/DoPack. We will refer to this as the Budgeted Multi-Region  21  3.1. Budgeted Multiregion Un/DoPack (BMR) (BMR) approach for the remainder of this work. Multiple small regions are utilized to capture congested areas more accurately than a large single region. In addition, the number of CLBs created in each iteration is limited by a budget which is the same as in the Baseline Un/DoPack Flow; this budget is such that the number of CLBs in each iteration increase the FPGA array size by 1 row and 1 column. Each region is depopulated by a fixed amount equal to the Fine-Grained version of Un/DoPack. This has the effect of inflating area as much as the Baseline method, but the budget is spread to multiple regions. We will show that despite the overall same area growth in each iteration, we can converge more quickly to a routable solution.  3.1.1  Region Selection  Instead of selecting a single large region, such as in Baseline Un/DoPack, our BMR approach creates regions that cover all congested CLBs. The details of each step in region selection is as follows: • The first congestion region is selected by finding the CLB with the highest congestion label, closest to the center of the array. This forms an initial x,y center location for a circular windowed region of size bmr radius which will be marked for depopulation. In this work, bmr radius is array size / 10. This step is shown in the leftmost illustration in Figure 3.1(a). The initial region is centered on the most-congested (darkest in the illustration) CLB, closest to the center of the array. • The window center is adjusted slightly (up to ± bmr radius ) in each direction 2 by using force-directed shifting. Force vectors between each CLB in the win22  3.1. Budgeted Multiregion Un/DoPack (BMR)  1 1 1 1 1 1 1 1  1 1 1 1 1 1 1  1 1 1 1 1 1 1  2 2 2 2 2 2 2  2 2 2 2 2 2 2  2 2 2 2 2 2 2 2  1 1 1 1 1 1 1  1 11 11 11 1  (a) Selection of the First Region  2 2 2 2 2  2 2 2 2 2 2 2  (b) Selection of the First Region  2 2 2 2 2  1 1 1 2 2 2 2 2  1 1 1 1 1 2 2 2 2  1 1 1 1 1 1 1 2 2 2  1 1 1 1 1 1 1 2 2  1 1 1 1 1 1 1 2  1 11 1 1 11 1  (c) CLBs are Marked for Depopulation  Figure 3.1: Example of Region Selection dow and the center of the region are calculated using the congestion values for each CLB. The sum of these vectors produces a direction in which to move the window. A binary search along this direction between the starting point and the farthest possible new starting point is then used to determine the window location that encompasses the largest average congestion. The furthest that the window can shift is a distance of up to ± bmr radius away 2 from the original window center. This produces a force-directed move to shift the region to encompass the most amount of congestion. The center illustration in Figure 3.1(a) shows that location of the region has shifted. 23  3.1. Budgeted Multiregion Un/DoPack (BMR) The congested CLBs near the top left of the array create a net force which cause the region to shift. • The CLBs in this selected region are marked as belonging to this region. This is shown in the rightmost illustration in Figure 3.1(a) • The next congestion region is selected by finding the next CLB with the largest congestion value, closest to the center of the array. The center CLB is only chosen from CLBs not already selected. This is shown in the leftmost illustration in Figure 3.1(b). • Force-directed shifting is applied to the new region. The region center is not allowed to shift into an already selected region. However, once the region location is determined, CLBs overlapping with other regions are allowed to belong to this new region. This is shown in the center illustration in Figure 3.1(b). • This continues until all CLBs exceeding the target channel-width constraint are covered by a region. The force-directed move ensures the depopulation window region is repositioned so the CLB label peak value is still captured, but it will also capture as many other CLBs as possible that need depopulation. A list of all such regions is then sorted such that the regions with the highest average congestion are depopulated first. This ensures that the overall budget will be spent on the mostcongested regions first. At each depopulation step, we determine the number of CLBs which will be added. Once a region is depopulated, subsequent regions will not be able to depopulate the CLBs that are already depopulated in this iteration. The regions are depopulated in sorted order until all congested regions are depop24  3.1. Budgeted Multiregion Un/DoPack (BMR) ulated. This is shown in Figure 3.1(c). Although region 1 and region 2 overlap, region 1 will be depopulated first because it is more congested. Subsequently, region 2 is depopulated but does not overlap with region 1. In addition, those adjacent regions with the same target cluster size are merged together into a single combined “super-region”, which may allow for better flexibility during reclustering.  3.1.2  Whitespace Insertion  Whitespace insertion is achieved by reducing the cluster size of the CLBs, such that some BLEs become unused. For the BMR flow, we limit the growth in each region to an amount determined by Equation 3.1. However, the total number of new CLBs produced in each iteration is limited by a budget, defined in Equation 3.2.  num new CLBs region =  new CLBs bmr =  (num CLBs in region) ∗ 2 + 1  (num CLBs in F P GA) ∗ 2 + 1  (3.1)  (3.2)  For each region marked for depopulation, we subtract the number of new CLBs created in each iteration against the budget until the budget is exhausted. In some cases, the target cluster size can not be attained because the remaining budget is insufficient. In this case, all of the remaining budget will be applied to that region and the cluster size will be calculated accordingly.  25  3.2. Congestion-Model Multiregion Un/DoPack (CMR)  Figure 3.2: Region Selection and Whitespace Insertion - BMR Un/DoPack  3.2  Congestion-Model Multiregion Un/DoPack (CMR)  In addition to improved region selection, we experimented with using a channelwidth demand model to improve the accuracy of the whitespace insertion. We will term this scheme the Congestion-Model Multiregion Un/DoPack (CMR). We utilize the same region selection method as in the above BMR flow, but we extend the whitespace insertion method to utilize congestion information to determine a target cluster size. The overall budget is removed to allow as much depopulation in each iteration to occur as needed. Of the interconnect models available in previous work, the most applicable to this work consist of those models which predict wire length and channel-width demand [13][14]. The work in [13] extends previous models to consider the routing inflexibility inherent in FPGAs. The advantage of this model is that we can use it to predict outcomes in interconnect demand based on decisions made during clustering. Since this model assumes its application to an entire FPGA, we make some simple assumptions to apply it to our local depopulation regions.  26  3.2. Congestion-Model Multiregion Un/DoPack (CMR)  3.2.1  Modeling Regional Interconnect Demand  While most channel-width demand models predict the interconnect demand at a global level, we are interested in determining, on a region-by-region basis, how much whitespace insertion is necessary for each congested region to become routable. Thus we create the following simple model of local interconnect demand to allow the application of a global model, shown in Equation 3.3.  total region demand = region internal demand + region external demand (3.3) Interconnect demand for a region of CLBs can be generally categorized in two types: internal interconnect, which is the interconnect needed to route between CLBs inside a region, and external interconnect, which consists of routing that connects CLBs outside the region, but pass through the region without connecting to any CLBs inside the region. While depopulation directly affects the internal interconnect demand through whitespace insertion, we cannot directly reduce interconnect demand from external routing by whitespace insertion into a region. Instead, we must separately account for its effects by identifying the contribution to the channel-width demand inside a region caused by these external nets, as shown in Equation 3.3. For each region, our goal is to reduce total region demand to meet our channelwidth constraint by reducing region internal demand. We apply a congestion-estimation model, Wirelength-per-Area [31], to the internal and external nets separately. We can compare the amount of average interconnect demand from internal nets to the average interconnect demand of  27  3.2. Congestion-Model Multiregion Un/DoPack (CMR) external nets. Combining this with post-routing information, we can then estimate the actual interconnect demand due to external and internal nets. For simplicity, we assume that subsequent iterations will produce relatively the same amount of external-net congestion in the next iteration. Although this is a simplification, interconnect demand from external nets are typically less than interconnect from internal nets. This is intuitive since local routing should mostly originate from logic inside regions. We leave the influence of inter-region effects on congestion to future work. The channel-width estimation model using Equation 3.5 from [13], is then used to determine the amount of whitespace to insert for the next reclustering step.  3.2.2  Modeling Internal Demand  The interconnect model presented in [13] is shown in Equation 3.5. This model is an extension of El Gamal’s master slice interconnect model [14], which predicts the channel-width for a fully flexible FPGA. This is shown in Equation 3.4. Wabs min is the channel-width required for a fully flexible FPGA, while λ is the average number of used inputs and R is the average point-to-point wirelength. The channel-width determined from the master slice interconnect model does not account for additional channel-width required due to routing inflexibility caused by wire segment length, switch blocks, and connection blocks. Instead, [13] accounts for these by including additional terms, shown in Equation 3.5.  Wabs min = p  ¯ λR 2  (3.4)  28  3.3. Congestion-Aware Placement W = Wabs min 1 Wabs min Wabs min β Fs FCin λ(L − 1) 1 + 1 + αin 4 FCin  +  αin  Wabs min FCout  αout  (3.5)  In Equation 3.5, W is the estimated peak channel-width. Post-placement, we can measure the average number of used inputs per CLB, λ. Fs is switch block flexibility, FCin and FCout denote connection block flexibilities for inputs and outputs, and L is wire segment length; these values can be determined from the FPGA architecture. We use the values 1.4, 0.5, and 0.25 for β, αin , and αout , respectively, as presented in [13]. We assume, as in [13], that R remains constant with cluster size. Therefore, by using the measured peak channel-width for a region and average number of used inputs, we can solve Equation 3.5 for p · R. With these constants set, we substitute W for the target channel-width and solve Equation 3.5 for Wabs min . From Equation 3.4, we can then solve for λ. Therefore, by re-clustering the region at the next iteration to meet the λ constraint on the number of used inputs, this region should become routable. In our implementation, the clustering tool iteratively reclusters a region with a progressively lower cluster-size until the average number of used inputs constraint is met. Since the clustering runtime is very small, the increase in overall runtime is negligible. In addition, we retain the flexibility of the clustering tool to use different clustering methods.  3.3  Congestion-Aware Placement  A second goal of this work is to show that a congestion-aware placement tool improves the overall quality of the Un/DoPack flow, but is insufficient as a 29  3.3. Congestion-Aware Placement replacement to Un/DoPack.  While several congestion-aware placement tools  exist [32][24][5], work presented in [32] follows a philosophy complimentary to Un/DoPack by optimizing placement to reduce local congestion.  Nnets  Cost = coef f ∗  q[i](bbx (i) + bby (i))  (3.6)  i=1  The placement approach in [32], referred to as Bounding Box Overlap placement, uses a congestion estimation map to create a coefficient. This coefficient is multiplied with the bounding box cost function in the VPR placement tool to penalize swaps which lead to congested placements. The modified cost function is shown in Equation 3.6. For each net i, the horizontal and vertical spans, bbx (i) and bby (i), are added and multiplied with the q(i) factor which compensates for the fanout of the net. This is summed over all nets and multiplied with the coefficient calculated from the congestion map. The coefficient is calculated using Equation 3.7.  coef f =  2 2 Σi,j Ui,j Σi,j Ui,j / nx · ny nx · ny  2  (3.7)  In Equation 3.7, Ui,j is a CLB label in the congestion estimation map used. The congestion estimation map in [32] uses an approach referred to as Bounding Box Overlap. The congestion estimation map indicates how many bounding boxes overlap each CLB. Since we are able to use any method to generate a congestion estimation map, we also took the opportunity to explore a slight variation to Bounding Box Overlap heuristic. In [31], authors explained that a related  30  3.3. Congestion-Aware Placement Circuit  stdev0 stdev002 stdev004 stdev006 stdev008 stdev010 stdev012  VPR Default (tracks) 96 96 101 89 119 153 145  VPR BB (tracks) 95 93 97 89 116 150 145  BB Overlap (tracks) 92 94 98 90 115 152 141  Wirelength per Area (tracks) 93 86 92 86 106 139 138  Table 3.1: Maximum MRCW Comparison of Placement Schemes heuristic, Wirelength per Area, produces a congestion map which better indicates relative local amounts of interconnect demand when compared to Bounding Box Overlap. We created congestion maps using the Wirelength per Area method and applied this to the congestion-aware placement tool. Table 3.1 illustrates the effect of congestion-aware placement on the maximum MRCW of our benchmark suite, and Table 3.2 compares the runtime performance for each placement scheme. While much slower in runtime, results show that a congestion-aware placer consistently reduces the number of routing tracks. The runtime and maximum MRCW results obtained using VPR default, VPR bounding box, Bounding Box Overlap, and Wirelength per Area placement approaches, are compared in Table 3.1 and Table 3.2. We note that the Wirelength per Area method indeed performs slightly better in reducing the maximum MRCW, although both produce consistently good results. Our experiments combine our Congestion-Model Multiregion version of Un/DoPack with the congestion-aware placement, using the Wirelength per Area method. This will be called CMR-CAP. The congestion-aware placement is integrated with the incremental placement tool, RePlace.  31  3.3. Congestion-Aware Placement  Circuit  stdev0 stdev002 stdev004 stdev006 stdev008 stdev010 stdev012  VPR Default (seconds) 2406 2207 2170 1704 1810 2279 1875  VPR BB (seconds) 783 882 831 692 776 1037 1063  BB Overlap (seconds) 10063 10226 8792 8794 9602 10394 12126  Wirelength per Area (seconds) 10918 10639 9694 9803 10204 11942 13808  Table 3.2: Runtime Comparison of Placement Schemes  32  Chapter 4 Results We compared the runtime and area performance of Baseline Un/DoPack to the following schemes:  Multiregion Un/DoPack, Fine-Grained Un/DoPack,  Budgeted Multiregion Un/DoPack (BMR) and the Congestion-Model Multiregion Un/DoPack (CMR). We also performed experiments which combined the Congestion-Model Multiregion Un/DoPack approach with a congestion-aware placement tool (CMR-CAP).  4.1  Experimental Methodology  This section discusses the experimental framework to evaluate our algorithmic improvements. The results in this work are normalized to the baseline Un/DoPack flow presented in [29]. Baseline Un/DoPack has the following characteristics:  • Congestion calculator with single region depopulation • Clustering algorithm which is a replica of iRAC • RePlace, incremental placer presented in [18], using default VPR placement (timing and wirelength driven placement) • VPR flags: pres fac mult 1.3, max router iterations 100 • FPGA architecture with LUT size k = 6, cluster-size N = 16, inputs per cluster I = 51, and a wire length of L = 4  33  4.1. Experimental Methodology The experiments were conducted on a single core of a Xeon X5355 2.66 GHz processor with 16GB of RAM. The maximum MRCW of each circuit, was determined from VPR with the binary search option set. The verify binary search option in VPR was used to ensure that the lowest routable channel width was measured. All versions of Un/DoPack were run on our servers including the following Un/DoPack schemes: Baseline, Fine-Grained, BMR, and CMR. All VPR simulations used an overuse penalty factor growth factor, pres fac mult, of 1.3 and the maximum number of router iterations, max router iterations, set at 100. The channel-width constraints are the same as those presented in [29] using the benchmark circuits described below. We used the benchmark circuits from [29]. Each of the benchmark circuits have the following characteristics: 40013 LUTs, 241 inputs, 120 outputs, and approximately 52000 nets. These circuits are designed to help examine the performance of CAD tools for large, system-on-chip designs which are composed of subcircuits with varying amounts of interconnect demand. The benchmark circuits created in [29] were generated by mimicking the properties of MCNC benchmark circuits [21] as subcircuits in one large, synthetic circuit using the generator GNL. GNL allows the user to specify the overall Rent parameter of the circuit, and also the Rent parameter of individual subcircuits. The average Rent parameter is 0.65 for each benchmark circuit, but the standard deviation of the Rent exponent for the subcircuits is varied. The result is a set of benchmark circuits which contain some circuits with uniform local interconnect demand across the circuit, while others have regions of high local interconnect demand. For each of the Un/DoPack multiregion depopulation methods, we use a depopulation radius of array size / 10. This is the same region size of Multiregion Un/DoPack presented in [29]. We use a depopulation radius of array size / 4 for  34  4.2. Previous Un/DoPack Schemes Baseline Un/DoPack, and a depopulation radius of array size /8 for Fine-Grained Un/DoPack, in accordance to [28]. Multiregion Un/DoPack will be our target approach for a low-runtime version of Un/DoPack, while Fine-Grained Un/DoPack will form the target for the low-area version of Un/DoPack. A separate set of experiments were performed to examine the effect of combining a congestion-aware placement tool with the Congestion-Model Un/DoPack approach. All parameters for this set of experiments are identical to those mentioned above, with the exception of the inclusion of the congestion-aware placement cost function within the incremental placer.  4.2  Previous Un/DoPack Schemes  Figure 4.1 shows the area versus runtime results for Multiregion and Fine-Grained Un/DoPack for each of the benchmark circuits used in this thesis, over a range of channel width constraints. The horizontal axis indicates the runtime for each benchmark, normalized to the runtime of Baseline Un/DoPack. The vertical axis indicates the area for each benchmark, normalized to the area of Baseline Un/DoPack. In this work, area is considered the sum of routing area and CLB area. Area is calculated in the same way as VPR [3], where the layout area of an individual transistor is expressed in units of minimum-width transistor areas. The vertical axis indicates the runtime of each scheme, normalized to the runtime of Baseline Un/DoPack. Our results show that of the previous Un/DoPack schemes, the Fine-Grained approach produces the best results in terms of area; the area is reduced by up to 30% when compared to Baseline Un/DoPack. However, this has a large runtime penalty; due to the small number of CLBs inserted at every iteration, many  35  N o r m a liz e d A r e a  4.2. Previous Un/DoPack Schemes  1 .3 1 .3 1 .2 1 .2 1 .1 1 .1 1 .0 1 .0 0 .9 0 .9 0 .8 0 .8 0 .7 0 .7  5  F in e - G r a in e d M u ltir e g io n  0 5 0 5 0 5 0 5 0 5 0 5 0 0  1  2  3  4  N o r m a liz e d R u n tim e  Figure 4.1: Area versus Runtime - Fine-Grained and Multiregion Schemes, Normalized to Baseline iterations are needed, especially for lower target channel-width constraints. The runtime increases accordingly. Multiregion Un/DoPack reduces runtime by depopulating multiple regions simultaneously, while inserting whitespace proportional to the peak congestion value of a region. As expected, Multiregion Un/DoPack outperforms the runtime of Baseline Un/DoPack, improving runtime by up to 6x. Area performance also improves in general by up to 17.5% over Baseline Un/DoPack. However, unlike Fine-Grained Un/DoPack, which consistently reduces area up to 28%, Multiregion Un/DoPack also produces worse area results, inflating area up to 32% more than Baseline Un/DoPack.  36  4.3. Budgeted Multiregion Un/DoPack  Budgeted Multiregion Un/DoPack  N o r m a liz e d A r e a  4.3  1 .3 1 .3 1 .2 1 .2 1 .1 1 .1 1 .0 1 .0 0 .9 0 .9 0 .8 0 .8 0 .7 0 .7  5  F in e - G r a in e d B M R  0 5 0 5 0 5 0 5 0 5 0 5 0 0  1  2  3  4  N o r m a liz e d R u n tim e  Figure 4.2: Area versus Runtime - Fine-Grained and BMR Figure 4.2 shows the area versus runtime results for BMR and Fine-Grained Un/DoPack for each of the benchmark circuits used in this thesis, over the same range of channel width constraints. We observe that BMR Un/DoPack is able to maintain area performance which is comparable to Fine-Grained Un/DoPack, and yet runtime is significantly faster. The general trend for Fine-Grained Un/DoPack is that as the area performance improves against Baseline Un/DoPack, but runtime also increases. In BMR however, the opposite trend is apparent; a decrease in area is accompanied by a decrease in runtime; this is despite BMR creating as many CLBs as Baseline Un/DoPack in each iteration. The speedup can be explained, in some cases, by a reduction in iterations needed to reach a channel 37  4.4. Congestion-Model Multiregion Un/DoPack width constraint. Depopulating multiple congested regions simultaneously allows BMR to converge faster by depopulating multiple locations. The area performance of BMR is better than Baseline Un/DoPack. A smaller CLB increase in each region may prevent regions from being over-depopulated. Accurate region selection reduces the likelihood that uncongested CLBs are depopulated, due to non-circular congestion regions and force-directed shifting. Compared to Baseline Un/DoPack, area is reduced by up to 23%, with the maximum increase over Baseline at 1.1%. This leads to the conclusion that increasing the total CLB budget for depopulation in each iteration is an important factor in reducing runtime. For the congestion in each region to be resolved as soon as possible, we require an adequate amount of depopulation. However, by reducing the amount of depopulation in each region to the minimum necessary amount, unnecessary area inflation can be reduced.  4.4  Congestion-Model Multiregion Un/DoPack  For the CMR scheme, the limit on area growth for each iteration is removed. Instead, a congestion model for whitespace insertion is added to determine how much whitespace to insert in each iteration. Figure 4.3 shows a comparison of area and runtime performance between the Multiregion Un/DoPack and CMR Un/DoPack. Because of the removal of the budget, the number of iterations decreases dramatically compared to Baseline and Fine-Grained Un/DoPack. The reduction of iterations produces a runtime improvement that is comparable to Multiregion Un/DoPack. In our experiments, CMR Un/DoPack is able to achieve a 5.5x speedup over Baseline Un/DoPack, while Multiregion Un/DoPack is able to achieve a runtime improvement of 6x. 38  4.4. Congestion-Model Multiregion Un/DoPack  N o r m a liz e d A r e a  1 .3 1 .3 1 .2 1 .2 1 .1 1 .1 1 .0 1 .0 0 .9 0 .9 0 .8 0 .8 0 .7 0 .7  5  M u ltir e g io n C M R  0 5 0 5 0 5 0 5 0 5 0 5 0 0  1  2  3  4  N o r m a liz e d R u n tim e  Figure 4.3: Area versus Runtime - Multiregion and CMR However, we note that the area performance of CMR Un/DoPack is consistently better than Multiregion Un/DoPack. In some instances where Multiregion Un/DoPack has significant runtime speedup, the area growth is quite high. In contrast, an improvement in runtime is also accompanied by an area improvement for CMR in general. Figure 4.4 shows a comparison of all multiple region depopulation methods. Both CMR and BMR show consistently better area performance than Multiregion Un/DoPack. Figures 4.5 and 4.6 show examples of typical results for individual benchmarks. Shown is the area and runtime of Baseline Un/DoPack compared to other methods. The horizontal axis is minimum routable channel width, normalized to  39  N o r m a liz e d A r e a  4.4. Congestion-Model Multiregion Un/DoPack  1 .3 1 .3 1 .2 1 .2 1 .1 1 .1 1 .0 1 .0 0 .9 0 .9 0 .8 0 .8 0 .7 0 .7  5 0 5  M u ltir e g io n C M R B M R  0 5 0 5 0 5 0 5 0 5 0 0  1  2  N o r m a liz e d R u n tim e  Figure 4.4: Area versus Runtime - BMR, Multiregion and CMR the maximum MRCW. The maximum MRCW was determined by performing the traditional VPR CAD flow with the binary search option enabled for the routing step. This determines the minimum number of tracks needed to route each benchmark circuit before depopulation using Un/DoPack. Area is measured as the total transistor area of the logic and routing of the CLBs used. This captures the effect of successfully reducing the channel width as well as the effect of using more CLBs after depopulating. Runtime figures presented are normalized to the maximum MRCW for the circuit. A value of 0.6 on the x-axis means that the final routed channel width is reduced by 40%. As expected, Fine-Grained Un/DoPack produces the best area with the worst runtime. Our CMR approach maintains or improves the runtime performance of  40  4.4. Congestion-Model Multiregion Un/DoPack  4 .5  F in M u B M C M  4 .0 3 .5  e - G r a in e d ltir e g io n R R  S p e e d u p  3 .0 2 .5 2 .0 1 .5 1 .0 0 .5 0 .0 0 .6  0 .7  0 .8  0 .9  1 .0  M R C W ( N o r m a liz e d to m a x im u m  M R C W )  Figure 4.5: Runtime Comparison with Baseline Un/DoPack - Circuit Stdev004 the previous Multiregion approach, but also improves area. Results for other circuits are similar to circuit Stdev004. However, in situations where the maximum MRCW is quite large, the performance of all approaches will be similar. This is because only a small number of regions need to be depopulated for the circuit to become routable. Therefore at larger channel width constraints, the number of iterations required for each approach is similar, and so the runtime is also similar. As the maximum MRCW is lowered, Baseline and Fine-Grained Un/DoPack perform less depopulation at each iteration than the Multiregion or Congestion-Model Multiregion approaches. This allows our approach to have a significant speedup at lower MRCW settings. In addition, our CMR approach  41  N o r m a liz e d A r e a  4.5. Critical-Path Comparison  1 .2 1 .2 1 .1 1 .1 1 .0 1 .0 0 .9 0 .9 0 .8 0 .8 0 .7 0 .7  5  F in M u B M C M  0 5 0  e - G r a in e d ltir e g io n R R  5 0 5 0 5 0 5 0 0 .6  0 .7  0 .8  0 .9  1 .0  M R C W ( N o r m a liz e d to m a x im u m  M R C W )  Figure 4.6: Area Comparison with Baseline Un/DoPack - Circuit Stdev004 reduces the over-depopulation of regions and area performance is improved even at low target channel widths.  4.5  Critical-Path Comparison  The critical path for each method are relatively similar. Typical critical path results, compared to Baseline Un/DoPack, are shown in Figures 4.7 and 4.8.  42  N o r m a liz e d C r itic a l P a th  4.6. CMR with Congestion-Aware Placer  1 .2 1 .2 1 .1 1 .1 1 .0 1 .0 0 .9 0 .9 0 .8 0 .8 0 .7 0 .7  5  F in M u B M C M  0 5 0  e - G r a in e d ltir e g io n R R  5 0 5 0 5 0 5 0 0 .6  0 .7  0 .8  0 .9  1 .0  M R C W ( N o r m a liz e d to m a x im u m  M R C W )  Figure 4.7: Critical Path Comparison - Circuit Stdev004  4.6  CMR with Congestion-Aware Placer  The inclusion of a congestion-driven placement tool can further improve the Un/DoPack approach.  Figures 4.9 to 4.10 illustrate that combining CMR  Un/DoPack with a congestion-aware placer consistently improves area. As indicated in Table 3.1, this congestion-aware placement tool cannot by itself reduce the channel-width more than is possible with whitespace insertion. However, when combined with CMR, we can further limit area inflation of the Un/DoPack flow. This improvement can be significant, up to 25% better than CMR alone, as shown in Figure 4.10.Although this particular congestion-aware placement tool causes CMR-CAP Un/DoPack to be slower (4x slower than Base-  43  N o r m a liz e d C r itic a l P a th  4.6. CMR with Congestion-Aware Placer  1 .2 1 .2 1 .1 1 .1 1 .0 1 .0 0 .9 0 .9 0 .8 0 .8 0 .7 0 .7  5  F in M u B M C M  0 5 0  e - G r a in e d ltir e g io n R R  5 0 5 0 5 0 5 0 0 .7  0 .8  0 .9  1 .0  M R C W ( N o r m a liz e d to m a x im u m  M R C W )  Figure 4.8: Critical Path Comparison - Circuit Stdev006 line Un/DoPack), our results indicate that the inclusion of other congestion-driven placement techniques can possibly reduce the area of the Un/DoPack flow. This illustrates that the inclusion of local congestion-driven techniques can further improve, but not replace, the Un/DoPack approach.  44  N o r m a liz e d A r e a  4.6. CMR with Congestion-Aware Placer  1 .2 1 .2 1 .1 1 .1 1 .0 1 .0 0 .9 0 .9 0 .8 0 .8 0 .7 0 .7  5  F in M u C M C M  0 5  e - G r a in e d ltir e g io n R R + C A P  0 5 0 5 0 5 0 5 0 0 .6  0 .7  0 .8  0 .9  1 .0  M R C W ( N o r m a liz e d to m a x im u m  M R C W )  Figure 4.9: CMR with Congestion-Aware Placement - Stdev004  45  N o r m a liz e d A r e a  4.6. CMR with Congestion-Aware Placer  1 .3 1 .3 1 .2 1 .2 1 .1 1 .1 1 .0 1 .0 0 .9 0 .9 0 .8 0 .8 0 .7 0 .7  5  F in M u C M C M  0 5 0  e - G r a in e d ltir e g io n R R + C A P  5 0 5 0 5 0 5 0 5 0 0 .7  0 .8  0 .9  1 .0  M R C W ( N o r m a liz e d to m a x im u m  M R C W )  Figure 4.10: CMR with Congestion-Aware Placement - Stdev006  46  Chapter 5 Conclusions In this thesis we show that we are able to effectively improve the performance of the Un/DoPack flow in area and runtime by effectively utilizing congestion information. This work presented methods to better select congested regions on an FPGA and calculate the amount of depopulation required at each iteration. Our Congestion-Model Multiregion approach is shown to improve runtime by a factor of up to 5.5 times and reduce area by up to 20%, compared to Baseline Un/DoPack. This allowed us to reduce channel-width up to 55%. We also showed that this CAD flow is complementary to using a congestion-aware placement method; the inclusion of a congestion-aware placement tool had positive effects on the overall area performance of Un/DoPack.  5.1 5.1.1  Future Work Influence from Neighbouring Regions  To further improve the accuracy of congestion-driven whitespace insertion technique for each region, future work should consider the influence of depopulating neighbouring regions on the congestion. This is especially an important consideration when neighbouring regions are depopulated with different cluster sizes. The congestion model in this work assumes that the CLBs in each region will eventually be placed adjacent to CLBs from the same region. However, in reality, 47  5.1. Future Work it is possible that CLBs from one region to be placed into adjacent regions during the placement step.  5.1.2  Congestion-Driven Placement and Clustering  This work has shown that additional congestion-aware methods can further improve the area performance of Un/DoPack. Through fine-grained logic placement, SCPlace has been shown to produce more routable solutions at a better runtime than VPR simulated annealing. While SCPlace performs fine-grained placement, it does not introduce whitespace insertion into to help a non-routable solution become routable. The combination of a simultaneous clustering and placement method, combined with incremental placement and Un/DoPack whitespace insertion would further improve runtime and area performance. Additionally, a large part of the runtime cost for each iteration is due to the time required for routing. We attempted to extract local interconnect demand information, using an estimator based on the Area-Per-Wirelength model [31]. While it is able to locate congested regions somewhat adequately, it does not provide the very accurate congestion information needed for our congestion-driven whitespace insertion. Additional research into estimation methods which estimate local congestion and routability information will help reduce the need to rely on full routing to determine accurate congestion information.  48  Bibliography [1] Elias Ahmed and Jonathan Rose. The effect of LUT and cluster size on deepsubmicron FPGA performance and density. In IEEE Transactions on Very Large Scale Integration (VLSI) Systems, pages 3–12, Monterey, California, United States, 2000. ISBN 1-58113-193-3. URL http://portal.acm.org/ citation.cfm?id=329166.329171. [2] V. Betz and J. Rose. Cluster-based logic blocks for FPGAs: area-efficiency vs. input sharing and size. In Custom Integrated Circuits Conference, pages 551–554, 1997. doi: {10.1109/CICC.1997.606687}. URL http://www.eecg. toronto.edu/~vaughn/papers/cicc97.pdf. [3] Vaughn Betz, Jonathan Rose, and Alexander Marquardt, editors. Architecture and CAD for Deep-Submicron FPGAs. Kluwer Academic Publishers, 1999. ISBN 0792384601. URL http://portal.acm.org/citation.cfm?id= 553523. [4] Elaheh Bozorgzadeh, Seda Ogrenci-Memik, and Majid Sarrafzadeh. RPack: routability-driven packing for cluster-based FPGAs.  In Asia and South  Pacific Design Automation Conference, pages 629–634, Yokohama, Japan, 2001. ACM.  ISBN 0-7803-6634-4.  doi: 10.1145/370155.370567.  URL  http://portal.acm.org/citation.cfm?id=370567. [5] Ulrich Brenner and Andre Rohe. An effective congestion driven placement 49  Chapter 5. Bibliography framework. In International Symposium on Physical Design, pages 6–11, San Diego, CA, USA, 2002. ACM. ISBN 1-58113-460-6. doi: 10.1145/505388. 505391. URL http://portal.acm.org/citation.cfm?id=505391. [6] Yao-Wen Chang, Shashidhar Thakur, Kai Zhu, and D. F. Wong. A new global routing algorithm for FPGAs. In IEEE/ACM International Conference on Computer-Aided Design, pages 356–361, San Jose, California, United States, 1994. IEEE Computer Society Press. ISBN 0-89791-690-5. URL http:// portal.acm.org/citation.cfm?id=191489. [7] Gang Chen and Jason Cong. Simultaneous timing driven clustering and placement for FPGAs. In Field Programmable Logic and Application, pages 158– 167. Springer Berlin / Heidelberg, 2004. URL http://www.springerlink. com/content/gyahkply7ltv78eu. [8] J. Cong and Yuzheng Ding. FlowMap: an optimal technology mapping algorithm for delay optimization in lookup-table based FPGA designs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 13(1):1–12, 1994. ISSN 0278-0070. doi: 10.1109/43.273754. URL http://citeseerx.ksu.edu.sa/viewdoc/summary?doi=10.1.1.22.9473. [9] Jason Cong and Yuzheng Ding. On area/depth trade-off in LUT-based FPGA technology mapping. In International Design Automation Conference, pages 213–218, Dallas, Texas, United States, 1993. ACM. ISBN 0-89791-577-1. doi: 10.1145/157485.164675. URL http://portal.acm.org/citation.cfm?id= 164675. [10] Jason Cong and Yean-Yow Hwang.  Simultaneous depth and area mini-  mization in LUT-based FPGA mapping. In ACM International Sympo50  Chapter 5. Bibliography sium on Field-Programmable Gate Arrays, pages 68–74, Monterey, California, United States, 1995. ACM. ISBN 0-89791-743-X. doi: 10.1145/ 201310.201322. URL http://portal.acm.org/citation.cfm?id=201310. 201322&coll=GUIDE&dl=GUIDE&CFID=47964849&CFTOKEN=55429101. [11] Jason Cong and Yean-Yow Hwang. tion  for  designs.  depth-optimal  technology  Structural gate decomposi-  mapping  in  LUT-based  FPGA  ACM Transactions on Design Automation of Electronic  Systems, 5(2):193–225, 2000.  doi:  10.1145/335043.335045.  URL  http://portal.acm.org/citation.cfm?id=335043.335045&coll= GUIDE&dl=GUIDE&CFID=47964849&CFTOKEN=55429101. [12] Andre DeHon. Balancing interconnect and computation in a reconfigurable computing array (or, why you don’t really want 100% LUT utilization). In ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pages 69–78, Monterey, California, United States, 1999. ACM. ISBN 158113-088-0. doi: 10.1145/296399.296431. URL http://portal.acm.org/ citation.cfm?id=296431. [13] Wei Mark Fang and Jonathan Rose. Modeling routing demand for early-stage FPGA architecture development. In ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pages 139–148, Monterey, California, USA, 2008. ACM. ISBN 978-1-59593-934-0. doi: 10.1145/1344671.1344694. URL http://portal.acm.org/citation.cfm?id=1344671.1344694. [14] A.E. Gamal. Two-dimensional stochastic model for interconnections in master slice integrated circuits. IEEE Transactions on Circuits and Systems, 28 (2):127–138, 1981. ISSN 0098-4094. URL http://ieeexplore.ieee.org/ xpl/freeabs_all.jsp?&arNumber=1084958. 51  Chapter 5. Bibliography [15] Chi-Chou Kao and Yen-Tai Lai.  An efficient algorithm for finding the  minimal-area FPGA technology mapping. ACM Transactions on Design Automation of Electronic Systems, 10(1):168–186, 2005. doi: 10.1145/1044111. 1044121.  URL http://portal.acm.org/citation.cfm?id=1044111.  1044121&coll=GUIDE&dl=GUIDE&CFID=47964849&CFTOKEN=55429101. [16] Julien Lamoureux. On the interaction between Power-Aware ComputerAided design algorithms for Field-Programable gate arrays. Master’s thesis, University of British Columbia, June 2003. URL http://www.ece.ubc.ca/ ~stevew/papers/pdf/lamoureux_masc.pdf. [17] Guy G. F. Lemieux, Stephen D. Brown, and Daniel Vranesic. two-step routing for FPGAS.  On  In International Symposium on Phys-  ical Design,  pages 60–66,  Napa Valley,  1997. ACM.  ISBN 0-89791-927-0.  California,  United States,  doi: 10.1145/267665.267682.  URL  http://portal.acm.org/citation.cfm?id=267665.267682&coll= GUIDE&dl=GUIDE&CFID=48052556&CFTOKEN=40588312. [18] D. Leong and G. Lemieux. RePlace: an incremental placement algorithm for field-programmable gate arrays. In International Conference on Field Programmable Logic and Applications, 2009. [19] Jai-Ming Lin, Song-Ra Pan, and Yao-Wen Chang. Graph matching-based algorithms for array-based FPGA segmentation design and routing.  In  Asia and South Pacific Design Automation Conference, pages 851–854, Kitakyushu, Japan, 2003. ACM. ISBN 0-7803-7660-9. doi: 10.1145/1119772. 1119959.  URL http://portal.acm.org/citation.cfm?id=1119772.  1119959&coll=GUIDE&dl=GUIDE&CFID=48052556&CFTOKEN=40588312.  52  Chapter 5. Bibliography [20] L. McMurchie and C. Ebeling. PathFinder: a negotiation-based performancedriven router for FPGAs.  In ACM International Symposium on Field-  Programmable Gate Arrays, pages 111–117, 1995. URL http://ieeexplore. ieee.org/xpls/abs_all.jsp?arnumber=1377269. [21] MCNC. LGSynth93 benchmark suite. Technical report, Microelectronics Centre of North Carolina, 1993. [22] M. Palczewski. Plane parallel a maze router and its application to FPGAs. In ACM/IEEE Design Automation Conference, pages 691–697, Anaheim, California, United States, 1992. IEEE Computer Society Press. ISBN 089791-516-X. URL http://portal.acm.org/citation.cfm?id=149679. [23] A. Pandit and A. Akoglu. Net length based routability driven packing. In International Conference on Field-Programmable Technology, pages 225–232, 2007. doi: {10.1109/FPT.2007.4439253}. URL http://ieeexplore.ieee. org/xpls/abs_all.jsp?arnumber=4439253. [24] G. Parthasarathy, M. Marek-Sadowska, Arindam Mukherjee, and Amit Singh. Interconnect complexity-aware FPGA placement using Rent’s rule. In International Workshop on System-Level Interconnect Prediction, pages 115–121, Sonoma, California, United States, 2001. ACM. ISBN 1-58113-3154. doi: 10.1145/368640.368806. URL http://portal.acm.org/citation. cfm?id=368806. [25] J. Rose. Parallel global routing for standard cells. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 9(10):1085–1095, 1990. ISSN 0278-0070. doi: 10.1109/43.62733. URL http://ieeexplore. ieee.org/xpls/abs_all.jsp?arnumber=62733. 53  Chapter 5. Bibliography [26] Amit Singh and Malgorzata Marek-Sadowska. Efficient circuit clustering for area and power reduction in FPGAs. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 59–66, Monterey, California, USA, 2002. ACM. ISBN 1-58113-452-5. doi: 10.1145/503048.503058. URL http://portal.acm.org/citation.cfm?id=503058. [27] Russell Tessier and Heather Giza. Balancing logic utilization and area efficiency in FPGAs. In International Conference on Field-Programmable Logic and Applications, pages 535–544. Springer-Verlag, 2000. ISBN 3-540-67899-9. URL http://portal.acm.org/citation.cfm?id=739548. [28] M. Tom. Channel width reduction techniques for System-on-Chip circuits in field-programmable gate arrays. Master’s thesis, University of British Columbia, April 2006. [29] Marvin Tom, David Leong, and Guy Lemieux. Un/DoPack: re-clustering of large system-on-chip designs with interconnect variation for low-cost FPGAs. In IEEE/ACM International Conference on Computer-Aided Design, pages 680–687, San Jose, California, 2006. ACM. ISBN 1-59593-389-1. doi: 10. 1145/1233501.1233643. URL http://portal.acm.org/citation.cfm?id= 1233643. [30] Mark Yamashita. for FPGAs. ber 2007.  A combined clustering and placement algorithm  Master’s thesis, University of British Columbia, NovemURL http://www.ece.ubc.ca/~lemieux/publications/  yamashita-masc2007.pdf. [31] David Yeager, Darius Chiu, and Guy Lemieux. Congestion estimation and localization in FPGAs: a visual tool for interconnect prediction. In In54  ternational Workshop on System Level Interconnect Prediction, pages 33– 40, Austin, Texas, USA, 2007. ACM.  ISBN 978-1-59593-622-6.  doi:  10.1145/1231956.1231963. URL http://portal.acm.org/citation.cfm? id=1231963. [32] Yue Zhuo, Hao Li, and S.P. Mohanty. A congestion driven placement algorithm for FPGA synthesis. In International Conference on Field Programmable Logic and Applications, pages 1–4, 2006. doi: {10.1109/FPL. 2006.311290}.  URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?  arnumber=4101052.  55  Appendix A Baseline Un/DoPack CW 100 95 90 85 80 75 70 65  Runtime 2867 6001 6506 9518 13616 19318 29127 52205  CLBs 3148 3596 3911 4449 4863 5445 6392 7290  CP 5.06E-08 5.26E-08 5.53E-08 5.70E-08 6.15E-08 5.90E-08 5.96E-08 6.46E-08  Area 1.82E+08 2.00E+08 2.15E+08 2.37E+08 2.51E+08 2.75E+08 3.14E+08 3.52E+08  Table A.1: Baseline Un/DoPack - Stdev0  CW 105 100 95 90 85 80 75 70 65 60  Runtime 3286 3085 3470 3163 5599 7888 10330 17390 26359 35047  CLBs 3157 3157 3290 3298 3683 4220 4656 5351 6029 6867  CP 5.28E-08 5.28E-08 5.74E-08 5.34E-08 6.03E-08 5.67E-08 5.87E-08 5.75E-08 5.91E-08 6.31E-08  Area 1.87E+08 1.82E+08 1.85E+08 1.82E+08 1.97E+08 2.18E+08 2.37E+08 2.65E+08 2.90E+08 3.21E+08  Table A.2: Baseline Un/DoPack - Stdev002  56  Appendix A. Baseline Un/DoPack  CW 100 95 90 85 80 75 70 65  Runtime 2977 3136 3491 4113 7093 13240 16985 31989  CLBs 3302 3273 3298 3409 3961 5014 5196 6195  CP 5.79E-08 5.98E-08 6.01E-08 6.14E-08 5.87E-08 5.78E-08 5.68E-08 5.89E-08  Area 1.90E+08 1.85E+08 1.82E+08 1.83E+08 2.04E+08 2.53E+08 2.58E+08 2.98E+08  Table A.3: Baseline Un/DoPack - Stdev004  CW 95 90 85 80 75 70 65  Runtime 2772 3045 5943 7908 12817 20791 30157  CLBs 3139 3139 3889 4090 4916 6067 6443  CP 5.29E-08 5.50E-08 5.57E-08 5.65E-08 5.90E-08 6.02E-08 5.98E-08  Area 1.78E+08 1.74E+08 2.09E+08 2.11E+08 2.50E+08 2.98E+08 3.11E+08  Table A.4: Baseline Un/DoPack - Stdev006  CW 125 120 115 110 105 100 95 90 85 80 75 70  Runtime 3548 4178 3600 4399 5174 4645 11595 12227 17837 23261 35152 50869  CLBs 3151 3286 3282 3289 3560 3412 4426 4281 4832 4999 5593 6654  CP 5.21E-08 5.60E-08 5.92E-08 5.27E-08 5.78E-08 5.48E-08 5.72E-08 6.13E-08 5.94E-08 5.78E-08 5.99E-08 6.06E-08  Area 2.02E+08 2.05E+08 2.02E+08 1.97E+08 2.08E+08 1.96E+08 2.47E+08 2.35E+08 2.58E+08 2.58E+08 2.83E+08 3.28E+08  Table A.5: Baseline Un/DoPack - Clone  57  Appendix A. Baseline Un/DoPack  CW 165 155 150 145 140 135 130 125 120 115 110 105 100 95 90 85  Runtime 3544 3105 4100 4154 4309 5236 6200 6618 9696 9348 15258 15797 18072 30961 31815 39060  CLBs 3152 3152 3304 3307 3282 3460 3469 3434 3573 3665 3839 3952 4015 4813 4879 5170  CP 5.81E-08 5.94E-08 5.76-08 5.66E-08 5.92E-08 6.39E-08 6.00E-08 6.10E-08 6.08E-08 6.27E-08 6.21E-08 6.33E-08 6.50E-08 6.49E-08 6.48E-08 6.26E-08  Area 2.35E+08 2.26E+08 2.31E+08 2.27E+08 2.22E+08 2.27E+08 2.24E+08 2.18E+08 2.21E+08 2.24E+08 2.28E+08 2.30E+08 2.31E+08 2.69E+08 2.66E+08 2.75E+08  Table A.6: Baseline Un/DoPack - Stdev010  CW 155 150 145 140 135 130 125 120 115 110 105 100  Runtime 4131 3588 4081 4918 6957 7994 9816 13155 18225 28230 30233 39021  CLBs 3304 3163 3163 3313 3600 3586 3750 3943 4501 4861 4628 5095  CP 5.93E-08 5.74E-08 5.82E-08 5.84E-08 5.96E-08 5.87E-08 6.22E-08 6.09E-08 6.28E-08 6.35E-08 6.41E-08 6.12E-08  Area 2.35E+08 2.22E+08 2.19E+08 2.23E+08 2.36E+08 2.31E+08 2.39E+08 2.44E+08 2.76E+08 2.89E+08 2.73E+08 2.92E+08  Table A.7: Baseline Un/DoPack - Stdev012  58  Appendix B Fine-Grained Un/DoPack CW 100 95 90 85 80 75 70 65  Runtime 3131 5611 10794 19360 47615 62504 106495 203021  CLBs 3148 3223 3404 3618 4236 4410 5135 5817  CP 5.06E-08 5.28E-08 5.42E-08 5.46E-08 5.89E-08 6.17E-08 6.05E-08 6.09E-08  Area 1.82E+08 1.80E+08 1.88E+08 1.95E+08 2.21E+08 2.24E+08 2.53E+08 2.81E+08  Table B.1: Fine-Grained Un/DoPack - Stdev0  CW 105 100 95 90 85 80 75 70 65 60  Runtime 2572 2853 3294 5124 6820 21225 31006 55206 92211 203822  CLBs 3157 3157 3178 3253 3286 3662 3853 4383 4834 5931  CP 5.28E-08 5.28E-08 5.55E-08 5.83E-08 5.65E-08 5.78E-08 5.76E-08 5.60E-08 6.02E-08 6.05E-08  Area 1.87E+08 1.82E+08 1.79E+08 1.80E+08 1.77E+08 1.90E+08 1.97E+08 2.17E+08 2.33E+08 2.80E+08  Table B.2: Fine-Grained Un/DoPack - Stdev002  59  Appendix B. Fine-Grained Un/DoPack  CW 100 95 90 85 80 75 70 65  Runtime 3178 3536 4911 5970 16577 19323 62945 96523  CLBs 3169 3188 3220 3261 3547 3600 4558 5020  CP 5.68E-08 5.83E-08 5.58E-08 5.70E-08 5.52E-08 5.62E-08 5.90E-08 5.63E-08  Area 1.83E+08 1.79E+08 1.76E+08 1.76E+08 1.84E+08 1.82E+08 2.25E+08 2.41E+08  Table B.3: Fine-Grained Un/DoPack - Stdev004  CW 95 90 85 80 75 70 65  Runtime 2749 2639 8399 18600 35314 50198 80474  CLBs 3139 3139 3356 3682 4031 4447 4812  CP 5.29E-08 5.50E-08 5.57E-08 5.61E-08 5.63E-08 5.73E-08 6.10E-08  Area 1.78E+08 1.74E+08 1.79E+08 1.91E+08 2.05E+08 2.19E+08 2.33E+08  Table B.4: Fine-Grained Un/DoPack - Stdev006  CW 125 120 115 110 105 100 95 90 85 80 75  Runtime 3074 3343 3386 5849 6670 10818 15808 24808 44691 87193 112947  CLBs 3151 3172 3172 3229 3262 3348 3441 3616 3870 4359 4758  CP 5.22E-08 5.37E-08 5.48E-08 5.33E-08 5.41E-08 5.42E-08 5.48E-08 5.55E-08 6.07E-08 5.74E-08 5.64E-08  Area 2.02E+08 1.98E+08 1.95E+08 1.92E+08 1.93E+08 1.91E+08 1.92E+08 2.00E+08 2.08E+08 2.28E+08 2.40E+08  Table B.5: Fine-Grained Un/DoPack - Clone  60  Appendix B. Fine-Grained Un/DoPack  CW 165 155 150 145 140 135 130 125 120 115 110 105 100 95 90  Runtime 3202 3237 4026 5523 8736 9679 11838 18518 18326 26796 40918 60074 74675 107440 148997  CLBs 3152 3152 3166 3183 3236 3259 3284 3371 3379 3441 3613 3639 3856 4181 4348  CP 5.81E-08 5.94E-08 6.10E-08 6.01E-08 6.07E-08 5.89E-08 6.43E-08 6.05E-08 6.28E-08 6.18E-08 6.26E-08 6.39E-08 6.38E-08 6.77E-08 6.59E-08  Area 2.35E+08 2.26E+08 2.22E+08 2.19E+08 2.17E+08 2.17E+08 2.14E+08 2.16E+08 2.12E+08 2.10E+08 2.18E+08 2.14E+08 2.22E+08 2.33E+08 2.37E+08  Table B.6: Fine-Grained Un/DoPack - Stdev010  CW 155 150 145 140 135 130 125 120 115 110 105 100  Runtime 4323 3346 3644 6043 8672 15415 23355 37426 56216 66307 103231 149969  CLBs 3184 3163 3163 3196 3251 3356 3418 3558 3781 3969 4233 4412  CP 5.83E-08 5.75E-08 5.82E-08 5.92E-08 6.07E-08 5.83E-08 6.05E-08 6.46E-08 6.11E-08 6.21E-08 6.16E-08 6.54E-08  Area 2.27E+08 2.22E+08 2.19E+08 2.16E+08 2.17E+08 2.16E+08 2.18E+08 2.21E+08 2.31E+08 2.35E+08 2.50E+08 2.53E+08  Table B.7: Fine-Grained Un/DoPack - Stdev012  61  Appendix C Multiregion Un/DoPack CW 100 95 90 85 80 75 70 65  Runtime 2567 4438 6894 9903 12917 10678 8682 10746  CLBs 3148 3434 3667 5148 6095 6096 6102 7174  CP 5.06E-08 5.26E-08 5.87E-08 5.76E-08 5.84E-08 5.98E-08 5.65E-08 5.86E-08  Area 1.82E+08 1.92E+08 2.01E+08 2.74E+08 3.17E+08 3.10E+08 3.02E+08 3.45E+08  Table C.1: Multiregion Un/DoPack - Stdev0  CW 105 100 95 90 85 80 75 70 65 60  Runtime 2517 3143 2719 3630 3461 8710 7164 4620 6528 10319  CLBs 3157 3157 3195 3284 3455 4177 4531 4695 6328 8812  CP 5.28E-08 5.28E-08 5.69E-08 5.56E-08 5.47E-08 5.67E-08 6.17E-08 5.96E-08 6.09E-08 5.99E-08  Area 1.87E+08 1.82E+08 1.79E+08 1.81E+08 1.85E+08 2.16E+08 2.30E+08 2.32E+08 3.05E+08 4.11E+08  Table C.2: Multiregion Un/DoPack - Stdev002  62  Appendix C. Multiregion Un/DoPack  CW 100 95 90 85 80 75 70 65  Runtime 3571 4066 4784 4233 8429 8477 9832 7965  CLBs 3171 3217 3355 3304 4128 4968 6438 7167  CP 5.58E-08 5.72E-08 5.89E-08 5.72E-08 5.83E-08 6.02E-08 5.76E-08 6.23E-08  Area 1.83E+08 1.80E+08 1.83E+08 1.77E+08 2.15E+08 2.52E+08 3.18E+08 3.45E+08  Table C.3: Multiregion Un/DoPack - Stdev004  CW 95 90 85 80 75 70 65  Runtime 3142 2945 8741 11644 9591 9060 5008  CLBs 3139 3139 3713 5385 5407 6561 6824  CP 5.29E-08 5.50E-08 5.86E-08 5.94E-08 6.62E-08 5.74E-08 6.40E-08  Area 1.78E+08 1.74E+08 1.98E+08 2.79E+08 2.74E+08 3.22E+08 3.29E+08  Table C.4: Multiregion Un/DoPack - Stdev006  CW 125 120 115 110 105 100 95 90 85 80 75 70  Runtime 3616 3056 4217 5505 6832 4312 6314 9575 7616 11451 8669 16623  CLBs 3151 3165 3206 3256 3384 3309 3620 4101 4634 5539 6159 7240  CP 5.22E-08 5.39E-08 5.37E-08 5.56E-08 5.70E-08 5.55E-08 5.79E-08 5.63E-08 5.81E-08 6.53E-08 6.26E-08 6.23E-08  Area 2.02E+08 1.98E+08 1.96E+08 1.97E+08 2.00E+08 1.90E+08 2.04E+08 2.26E+08 2.49E+08 2.87E+08 3.12E+08 3.58E+08  Table C.5: Multiregion Un/DoPack - Clone  63  Appendix C. Multiregion Un/DoPack  CW 165 155 150 145 140 135 130 125 120 115 110 105 100 95 90 85  Runtime 2834 3488 3792 6099 11691 5515 6878 8846 8932 7047 9549 9432 10711 10807 14656 13677  CLBs 3152 3152 3182 3222 3227 3257 3251 3427 3537 3576 3771 3852 4284 4492 5300 6350  CP 5.81E-08 5.94E-08 5.85E-08 6.13E-08 5.89E-08 5.98E-08 6.16E-08 6.11E-08 6.36E-08 6.06E-08 6.40E-08 6.32E-08 6.12E-08 6.34E-08 6.42E-08 6.52E-08  Area 2.35E+08 2.26E+08 2.23E+08 2.21E+08 2.16E+08 2.17E+08 2.13E+08 2.18E+08 2.20E+08 2.18E+08 2.26E+08 2.28E+08 2.46E+08 2.53E+08 2.89E+08 3.38E+08  Table C.6: Multiregion Un/DoPack - Stdev010  CW 155 150 145 140 135 130 125 120 115 110 105 100  Runtime 4589 3402 3755 7819 6325 17459 17165 13467 14687 14980 15235 28027  CLBs 3168 3163 3163 3294 3331 3595 3793 4006 4402 4569 4855 5266  CP 5.86E-08 5.75E-08 5.82E-08 6.24E-08 5.93E-08 6.07E-08 6.22E-08 6.42E-08 6.23E-08 6.27E-08 6.96E-08 6.51E-08  Area 2.27E+08 2.22E+08 2.19E+08 2.23E+08 2.19E+08 2.32E+08 2.41E+08 2.50E+08 2.69E+08 2.72E+08 2.83E+08 3.01E+08  Table C.7: Multiregion Un/DoPack - Stdev012  64  Appendix D Budgeted Multiregion Un/DoPack CW 100 95 90 85 80 75 70 65  Runtime 3018 3559 5284 7387 11016 14184 24214 34074  CLBs 3148 3248 3598 3833 4352 4366 5472 5503  CP 5.06E-08 5.29E-08 5.29E-08 5.37E-08 5.43E-08 5.49E-08 5.96E-08 5.70E-08  Area 1.82E+08 1.81E+08 1.96E+08 2.04E+08 2.24E+08 2.23E+08 2.69E+08 2.78E+08  Table D.1: BMR Un/DoPack - Stdev0  CW 105 100 95 90 85 80 75 70 65 60  Runtime 2463 2342 3664 2772 4347 6258 10603 14223 18115 28257  CLBs 3157 3157 3247 3249 3276 3708 4220 4621 5040 5743  CP 5.28E-08 5.28E-08 5.67E-08 5.48E-08 5.61E-08 5.64E-08 5.64E-08 5.57E-08 5.90E-08 6.52E-08  Area 1.87E+08 1.82E+08 1.81E+08 1.77E+08 1.77E+08 1.92E+08 2.13E+08 2.27E+08 2.42E+08 1.69E+08  Table D.2: BMR Un/DoPack - Stdev002  65  Appendix D. Budgeted Multiregion Un/DoPack  CW 100 95 90 85 80 75 70 65  Runtime 2875 3699 3890 3986 6901 10066 15015 19572  CLBs 3246 3171 3363 3478 3618 4115 4898 5023  CP 5.69E-08 5.76E-08 5.76E-08 5.51E-08 5.82E-08 6.13E-08 6.38E-08 5.80E-08  Area 1.85E+08 1.79E+08 1.83E+08 1.85E+08 1.89E+08 2.10E+08 2.41E+08 2.42E+08  Table D.3: BMR Un/DoPack - Stdev004  CW 95 90 85 80 75 70 65  Runtime 2354 2357 6924 8953 8036 13754 22104  CLBs 3139 3139 3621 3967 3765 4619 5141  CP 5.29E-08 5.50E-08 5.56E-08 5.43E-08 5.51E-08 5.67E-08 5.63E-08  Area 1.78E+08 1.74E+08 1.95E+08 2.05E+08 1.92E+08 2.27E+08 2.48E+08  Table D.4: BMR Un/DoPack - Stdev006  CW 125 120 115 110 105 100 95 90 85 80 75 70  Runtime 2864 3349 3756 4786 5155 5267 8672 9089 12550 19763 23908 32674  CLBs 3151 3249 3248 3362 3478 3475 3599 3835 4221 4896 5327 5761  CP 5.22E-08 5.22E-08 5.57E-08 5.53E-08 5.89E-08 5.61E-08 5.60E-08 6.08E-08 5.62E-08 5.94E-08 5.99E-08 5.85E-08  Area 2.02E+08 2.01E+08 1.97E+08 2.00E+08 2.03E+08 1.98E+08 2.00E+08 2.09E+08 2.25E+08 2.52E+08 2.69E+08 2.83E+08  Table D.5: BMR Un/DoPack - Clone  66  Appendix D. Budgeted Multiregion Un/DoPack  CW 165 155 150 145 140 135 130 125 120 115 110 105 100 95 90 85  Runtime 3553 3022 4665 5587 5473 5607 6316 7346 8355 9190 11261 14383 16032 23491 28701 41871  CLBs 3152 3152 3247 3286 3310 3246 3360 3474 3480 3384 3499 3714 3837 4217 4487 4517  CP 5.81E-08 5.94E-08 6.17E-08 5.87E-08 6.31E-08 6.18E-08 6.35E-08 6.17E-08 6.48E-08 6.06E-08 6.05E-08 6.13E-08 6.62E-08 6.72E-08 6.59E-08 6.23E-08  Area 2.35E+08 2.26E+08 2.25E+08 2.27E+08 2.23E+08 2.13E+08 2.16E+08 2.19E+08 2.15E+08 2.08E+08 2.11E+08 2.16E+08 2.19E+08 2.34E+08 2.44E+08 2.43E+08  Table D.6: BMR Un/DoPack - Stdev010  CW 155 150 145 140 135 130 125 120 115 110 105 100  Runtime 4559 3295 3423 4541 5958 8241 8883 13624 17338 20845 31227 41038  CLBs 3230 3163 3163 3248 3358 3478 3475 3840 3957 4093 4486 4896  CP 5.85E-08 5.75E-08 5.82E-08 5.72E-08 5.90E-08 5.94E-08 6.20E-08 6.06E-08 6.19E-08 6.05E-08 6.11E-08 6.61E-08  Area 2.29E+08 2.22E+08 2.19E+08 2.17E+08 2.20E+08 2.24E+08 2.19E+08 2.37E+08 2.40E+08 2.43E+08 2.61E+08 2.78E+08  Table D.7: BMR Un/DoPack - Stdev012  67  Appendix E Congestion-Driven Multiregion Un/DoPack CW 100 95 90 85 80 75 70 65  Runtime 2758 4464 5133 3644 7905 8154 11176 9519  CLBs 3148 3254 3519 3515 4794 4392 6011 6457  CP 5.06E-08 5.31E-08 5.50E-08 5.16E-08 6.12E-08 5.42E-08 5.82E-08 5.63E-08  Area 1.82E+08 1.84E+08 1.94E+08 1.89E+08 2.49E+08 2.24E+08 2.96E+08 3.12E+08  Table E.1: CMR Un/DoPack - Stdev0  CW 105 100 95 90 85 80 75 70 65 60  Runtime 2940 2641 3081 3615 4361 6365 5382 6155 6848 8921  CLBs 3157 3157 3203 3243 3437 4250 4129 4767 5017 6115  CP 5.28E-08 5.28E-08 5.66E-08 5.53E-08 5.53E-08 5.75E-08 5.73E-08 5.48E-08 5.99E-08 6.43E-08  Area 1.87E+08 1.82E+08 1.79E+08 1.77E+08 1.84E+08 2.21E+08 2.10E+08 2.37E+08 2.41E+08 2.88E+08  Table E.2: CMR Un/DoPack - Stdev002  68  Appendix E. Congestion-Driven Multiregion Un/DoPack  CW 100 95 90 85 80 75 70 65  Runtime 3571 2953 4320 4734 6107 6027 6276 7227  CLBs 3171 3171 3390 3610 3815 4492 4567 5163  CP 5.58E-08 5.76E-08 6.24E-08 5.81E-08 6.00E-08 5.85E-08 5.78E-08 6.22E-08  Area 1.83E+08 1.79E+08 1.87E+08 1.95E+08 1.97E+08 2.29E+08 2.26E+08 2.48E+08  Table E.3: CMR Un/DoPack - Stdev004  CW 95 90 85 80 75 70 65  Runtime 2518 2633 5669 4295 7822 6953 10167  CLBs 3139 3139 3771 3718 4368 4737 6615  CP 5.29E-08 5.50E-08 5.58E-08 5.53E-08 5.70E-08 5.60E-08 6.05E-08  Area 1.78E+08 1.74E+08 2.02E+08 1.92E+08 2.23E+08 2.33E+08 3.19E+08  Table E.4: CMR Un/DoPack - Stdev006  CW 125 120 115 110 105 100 95 90 85 80 75 70  Runtime 2770 3493 3375 4636 4953 4968 5499 7454 7279 7611 8315 10724  CLBs 3151 3195 3174 3239 3419 3604 3757 3855 4400 4581 5120 6511  CP 5.22E-08 5.38E-08 5.63E-08 5.41E-08 5.41E-08 5.98E-08 5.77E-08 5.81E-08 5.70E-08 6.03E-08 6.12E-08 6.12E-08  Area 2.02E+08 1.99E+08 1.95E+08 1.93E+08 2.01E+08 2.08E+08 2.11E+08 2.13E+08 2.36E+08 2.37E+08 2.59E+08 3.20E+08  Table E.5: CMR Un/DoPack - Clone  69  Appendix E. Congestion-Driven Multiregion Un/DoPack  CW 165 155 150 145 140 135 130 125 120 115 110 105 100 95 90 85  Runtime 0 3835 4638 4933 6033 6333 5683 6167 8045 9024 9552 9127 12048 11672 11766 10788  CLBs 0 3152 3191 3195 3258 3313 3318 3346 3483 3575 3722 3740 4144 4404 4770 5025  CP 5.94E-08 6.03E-08 6.01E-08 6.11E-08 6.12E-08 6.14E-08 6.00E-08 6.19E-08 6.16E-08 6.13E-08 6.26E-08 6.18E-08 6.54E-08 6.43E-08 6.75E-08  Area 0.00E+00 2.26E+08 2.23E+08 2.20E+08 2.22E+08 2.19E+08 2.15E+08 2.12E+08 2.19E+08 2.18E+08 2.24E+08 2.21E+08 2.38E+08 2.47E+08 2.63E+08 2.67E+08  Table E.6: CMR Un/DoPack - Stdev010  CW 155 150 145 140 135 130 125 120 115 110 105 100  Runtime 4897 3568 3901 5579 6569 8228 7529 8830 12320 10783 15059 21536  CLBs 3186 3163 3163 3202 3285 3401 3515 3729 4014 4308 4619 5223  CP 5.73E-08 5.75E-08 5.82E-08 6.33E-08 6.24E-08 6.04E-08 5.87E-08 5.87E-08 6.06E-08 6.20E-08 6.27E-08 6.44E-08  Area 2.27E+08 2.22E+08 2.19E+08 2.16E+08 2.18E+08 2.22E+08 2.24E+08 2.34E+08 2.46E+08 2.57E+08 2.69E+08 3.00E+08  Table E.7: CMR Un/DoPack - Stdev012  70  Appendix F CMR Un/DoPack with Congestion-Aware Placement CW 95 90 85 80 75 70 65  Runtime 11541 14170 16844 21482 32455 39878 42141  CLBs 3148 3368 3740 4113 4230 4686 5302  CP 5.88E-08 5.65E-08 5.95E-08 6.24E-08 6.38E-08 7.01E-08 6.57E-08  Area 1.78E+08 1.87E+08 2.01E+08 2.14E+08 2.16E+08 2.32E+08 2.55E+08  Table F.1: CMR Un/DoPack (Congestion Aware Placement) - Stdev0  CW 105 100 95 90 85 80 75 70  Runtime 11020 10988 11022 10980 16253 21771 38213 43487  CLBs 3157 3157 3157 3157 3304 3588 3754 4265  CP 5.70E-08 5.68E-08 6.03E-08 5.70E-08 5.62E-08 6.20E-08 6.04E-08 6.79E-08  Area 1.87E+08 1.82E+08 1.78E+08 1.75E+08 1.77E+08 1.85E+08 1.91E+08 2.11E+08  Table F.2: CMR Un/DoPack (Congestion Aware Placement) - Stdev002  71  Appendix F. CMR Un/DoPack with Congestion-Aware Placement  CW 100 95 90 85 80 75 70 65  Runtime 10098 9939 13714 14727 19897 24217 40794 42435  CLBs 3148 3148 3263 3387 3569 3606 4095 4617  CP 5.91E-08 5.91E-08 5.81E-08 5.74E-08 6.04E-08 6.03E-08 5.82E-08 6.03E-08  Area 1.82E+08 1.78E+08 1.81E+08 1.83E+08 1.85E+08 1.85E+08 2.01E+08 2.22E+08  Table F.3: CMR Un/DoPack (Congestion Aware Placement) - Stdev004  CW 95 90 85 80 75 70 65  Runtime 10159 10413 12909 17524 20202 23249 40967  CLBs 3139 3139 3279 3912 3859 4340 4987  CP 6.48E-08 6.48E-08 5.96E-08 6.62E-08 6.90E-08 6.77E-08 6.70E-08  Area 1.78E+08 1.74E+08 1.77E+08 2.03E+08 1.97E+08 2.14E+08 2.41E+08  Table F.4: CMR Un/DoPack (Congestion Aware Placement) - Stdev006  CW 125 120 115 110 105 100 95 90 85 80 75 70  Runtime 10657 10669 10574 10733 12792 13492 15501 20186 25313 32880 42864 44757  CLBs 3151 3151 3151 3151 3252 3316 3384 3825 4204 4244 5226 5429  CP 6.19E-08 6.22E-08 6.24E-08 6.35E-08 6.46E-08 5.98E-08 6.39E-08 6.69E-08 6.70E-08 6.72E-08 6.70E-08 6.93E-08  Area 2.02E+08 1.98E+08 1.94E+08 1.90E+08 1.93E+08 1.90E+08 1.91E+08 2.09E+08 2.24E+08 2.21E+08 2.66E+08 2.67E+08  Table F.5: CMR Un/DoPack (Congestion Aware Placement) - Clone  72  Appendix F. CMR Un/DoPack with Congestion-Aware Placement  CW 165 155 150 145 140 135 130 125 120 115 110 105 100 95 90 85  Runtime 11513 11622 11137 11820 11385 16021 15098 17910 16409 17978 19868 20037 23066 21746 30465 39128  CLBs 3152 3152 3152 3152 3152 3297 3339 3382 3447 3579 3777 3826 4175 4316 5035 5369  CP 6.85E-08 6.85E-08 6.94E-08 6.98E-08 7.00E-08 6.45E-08 6.91E-08 6.77E-08 7.18E-08 7.47E-08 7.14E-08 7.37E-08 8.02E-08 8.01E-08 7.71E-08 7.84E-08  Area 2.35E+08 2.26E+08 2.22E+08 2.18E+08 2.14E+08 2.18E+08 2.16E+08 2.17E+08 2.14E+08 2.18E+08 2.26E+08 2.23E+08 2.39E+08 2.41E+08 2.74E+08 2.88E+08  Table F.6: CMR Un/DoPack (Congestion Aware Placement) - Stdev010  CW 155 150 145 140 135 130 125 120 115 110 105 100  Runtime 13152 13351 13650 13761 19766 22315 20295 27646 28116 31908 35722 38933  CLBs 3163 3163 3163 3163 3278 3355 3393 3661 3761 3995 4391 4720  CP 6.97E-08 6.98E-08 6.98E-08 6.95E-08 7.95E-08 8.05E-08 7.70E-08 7.73E-08 7.79E-08 8.22E-08 8.03E-08 8.47E-08  Area 2.27E+08 2.22E+08 2.19E+08 2.15E+08 2.18E+08 2.16E+08 2.17E+08 2.28E+08 2.31E+08 2.40E+08 2.58E+08 2.69E+08  Table F.7: CMR Un/DoPack (Congestion Aware Placement) - Stdev012  73  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0067718/manifest

Comment

Related Items