UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Designing network-on-chips for throughput accelerators Bakhoda, Ali 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2014_spring_bakhoda_ali.pdf [ 4.72MB ]
JSON: 24-1.0103404.json
JSON-LD: 24-1.0103404-ld.json
RDF/XML (Pretty): 24-1.0103404-rdf.xml
RDF/JSON: 24-1.0103404-rdf.json
Turtle: 24-1.0103404-turtle.txt
N-Triples: 24-1.0103404-rdf-ntriples.txt
Original Record: 24-1.0103404-source.json
Full Text

Full Text

Designing Network-on-Chips for Throughput AcceleratorsbyAli BakhodaA THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Electrical and Computer Engineering)The University Of British Columbia(Vancouver)April 2014c© Ali Bakhoda, 2014AbstractPhysical limits of power usage for integrated circuits have steered the microprocessor industrytowards parallel architectures in the past decade. Modern Graphics Processing Units (GPU) area form of parallel processor that harness chip area more effectively compared to traditional sin-gle threaded architectures by favouring application throughput over latency. Modern GPUs canbe used as throughput accelerators: accelerating massively parallel non-graphics applications.As the number of compute cores in throughput accelerators increases, so does the importanceof efficient memory subsystem design. In this dissertation, we present system-level microarchi-tectural analysis and optimizations with an emphasis on the memory subsystem of throughputaccelerators that employ Bulk-Synchronous-Parallel programming models such as CUDA andOpenCL. We model the whole throughput accelerator as a closed-loop system in order to capturethe effects of complex interactions of microarchitectural components: we simulate componentssuch as compute cores, on-chip network and memory controllers with cycle-level accuracy. Forthis purpose, the first version of GPGPU-Sim simulator that was capable of running unmodifiedapplications by emulating NVIDIA’s virtual instruction set was developed. We use this simula-tor to model and analyze several applications and explore various microarchitectural tradeoffs forthroughput accelerators to better suit these applications. Based on our observations, we identifythe Network-on-Chip (NoC) component of memory subsystem as our main optimization targetand set out to design throughput effective NoCs for future throughput accelerators. We providea new framework for NoC researchers to ensure the optimizations are “throughput effective”,iinamely, parallel application-level performance improves per unit chip area. We then use thisframework to guide the development of several optimizations. Accelerator workloads demandhigh off-chip memory bandwidth resulting in a many-to-few-to-many traffic pattern. Leveragingthis observation, we reduce NoC area by proposing a checkerboard NoC which utilizes routerswith limited connectivity. Additionally, we improve performance by increasing the terminalbandwidth of memory controller nodes to better handle frequent read-reply traffic. Furthermore,we propose a double checkerboard inverted NoC organization which maintains the benefits ofthese optimizations while having a simpler routing mechanism and smaller area and results in a24.3% improvement in average application throughput per unit area.iiiPrefaceThis is a list of my publications at UBC in chronological order:[1] Ali Bakhoda and Tor M. Aamodt, Extending the Scalability of Single Chip Stream Pro-cessors with On-chip Caches, 2nd Workshop on Chip Multiprocessor Memory Systemsand Interconnects (CMP-MSI 2008), (in conjunction with ISCA 2008), 9 pages, Beijing,China, June 22, 2008.[2] Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, Tor M. Aamodt, Ana-lyzing CUDA Workloads Using a Detailed GPU Simulator, In proceedings of the IEEEInternational Symposium on Performance Analysis of Systems and Software (ISPASS),pp. 163-174, Boston, MA, April 26-28, 2009.[3] George L. Yuan, Ali Bakhoda, Tor M. Aamodt, Complexity Effective Memory AccessScheduling for Many-Core Accelerator Architectures, In proceedings of the 42nd IEEE/ACMInternational Symposium on Microarchitecture (MICRO-42), pp. 34-44, New York, NY,December 12-16, 2009.[4] Ali Bakhoda, John Kim, Tor M. Aamodt, On-Chip Network Design Considerations forCompute Accelerators, In Nineteenth International Conference on Parallel Architecturesand Compilation Techniques (PACT), pp. 535-536, Vienna, Austria, September 11-15,2010. (Best poster award, 2nd place)[5] Ali Bakhoda, John Kim, Tor M. Aamodt, Throughput-Effective On-Chip Networks forManycore Accelerators, In proceedings of the 43rd IEEE/ACM International Symposiumon Microarchitecture (MICRO-43), pp. 421-432, Atlanta, Georgia, December 4-8, 2010.[6] Ali Bakhoda, John Kim, Tor M. Aamodt, Designing On-Chip Networks for ThroughputAccelerators, in ACM Transactions on Architecture and Code Optimization (TACO), Vol.10, No. 3, Article 21 (September 2013), 35 pages.In [1], I conducted the research, analyzed the data, and drafted the manuscript under theguidance of Dr. Tor M. Aamodt.ivIn [2], the study was designed collaboratively by Ali Bakhoda, Dr. Tor M. Aamodt, GeorgeYuan, and Wilson Fung. Ali Bakhoda integrated the detailed interconnection network simulatoras well as designing the first versions of configurable clock domains, miss status handling reg-isters, stall-on-use, back-to-back execution of instructions from the same thread in the pipelineand instruction replay, prefetching, and various thread scheduling policies. Dr. Tor M. Aamodtand Ali Bakhoda added the original support for simulating unmodified CUDA applications tothe GPGPU-Sim simulator. Wilson Fung, Ali Bakhoda, and Dr. Tor Aamodt were also responsi-ble for developing much of the simulator’s timing model infrastructure. George Yuan collectedthe benchmarks used in this study and modified the timing simulator to provide the necessarysupport required by the benchmarks.In [3], the study was designed collaboratively by George Yuan, Ali Bakhoda and Dr. Tor M.Aamodt. George Yuan and Ali Bakhoda modified the simulator to support the necessary studies.George Yuan conducted the research, analyzed the data, and drafted the manuscript.In [4], [5] and [6] I have collaborated with Dr. John Kim of Korean Advanced Institute ofScience and Technology (KAIST). I conducted the research, analyzed the data, and drafted themanuscripts with the guidance and manuscript editing from my collaborators Dr. John Kim andDr. Tor M. Aamodt.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 GPUs as Throughput Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.1 CUDA and OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Throughput Accelerator Model . . . . . . . . . . . . . . . . . . . . . . . 92.2 Interconnection Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.1 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.2 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14vi2.2.3 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.4 Router Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . 163 GPU Architectural Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1 Baseline Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Architectural Design Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2.1 On-chip Interconnection Network . . . . . . . . . . . . . . . . . . . . . 243.2.2 Thread Block Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.3 Memory Access Coalescing . . . . . . . . . . . . . . . . . . . . . . . . 253.2.4 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2.5 Memory Controller Design . . . . . . . . . . . . . . . . . . . . . . . . . 273.3 Extending GPGPU-Sim to Support CUDA . . . . . . . . . . . . . . . . . . . . . 283.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.5 Analysis of CUDA Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.5.2 Instruction Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 353.5.3 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.5.4 On-Chip Network Topology . . . . . . . . . . . . . . . . . . . . . . . . 383.5.5 Interconnection Latency and Bandwidth . . . . . . . . . . . . . . . . . . 393.5.6 Memory Controller Effects . . . . . . . . . . . . . . . . . . . . . . . . . 413.5.7 Cache Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.5.8 Thread Block Scheduling Effects . . . . . . . . . . . . . . . . . . . . . . 443.5.9 Memory Request Coalescing . . . . . . . . . . . . . . . . . . . . . . . . 463.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 Detailed NoC Analysis for Throughput Accelerators . . . . . . . . . . . . . . . . . 484.1 Baseline Architecture for NoC Analysis . . . . . . . . . . . . . . . . . . . . . . 504.1.1 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.1.2 Compute Nodes and Memory Controllers . . . . . . . . . . . . . . . . . 524.2 Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.1 Balanced Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.2.2 Network Limit Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2.3 Router Latency and Bisection Bandwidth . . . . . . . . . . . . . . . . . 574.2.4 Many-to-Few-to-Many Traffic Pattern . . . . . . . . . . . . . . . . . . . 594.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62vii5 Throughput-Effective Network Design . . . . . . . . . . . . . . . . . . . . . . . . . 635.1 Checkerboard Network Organization . . . . . . . . . . . . . . . . . . . . . . . . 645.2 Checkerboard Routing Algorithm and Flow Control . . . . . . . . . . . . . . . . 675.3 Multi-port Routers for Memory Controller Nodes . . . . . . . . . . . . . . . . . 685.3.1 Smart Injection Port Selection Policy . . . . . . . . . . . . . . . . . . . . 695.4 Double Network—Channel-Sliced Network . . . . . . . . . . . . . . . . . . . . 715.5 Double Checkerboard Inverted Network Organization (DCI) . . . . . . . . . . . 725.6 Augmenting DCI with Advanced Routing Techniques . . . . . . . . . . . . . . . 755.6.1 DCI Enhanced (DCIE) — Increasing Network Utilization for DCI . . . . 765.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766 Experimental Evaluation of Throughput Effective Network Design Techniques . . 786.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.2 Checkerboard Placement (CP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.2.1 Checkerboard Routing (CR) . . . . . . . . . . . . . . . . . . . . . . . . 826.3 Multi-port Routers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.4 Double Network—Channel-Sliced Network . . . . . . . . . . . . . . . . . . . . 856.5 Double Checkerboard Inverted Network (DCI) . . . . . . . . . . . . . . . . . . . 876.6 MC Placement Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.7 Open-Loop NoC Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . 936.8 Area Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946.9 Throughput-Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.10.1 Concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.10.2 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.10.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026.10.4 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057.1 Accelerator Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057.2 Interconnection Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067.3 Performance Evaluation and Simulation Tools . . . . . . . . . . . . . . . . . . . 112viii8 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148.2 Moving Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1188.2.1 Tree Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1198.2.2 Yield Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123ixList of TablesTable 3.1 Benchmark Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Table 3.2 Hardware Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Table 3.3 Interconnection Network Configuration . . . . . . . . . . . . . . . . . . . . . 34Table 4.1 Benchmark Abbreviations and Their Sources . . . . . . . . . . . . . . . . . . 53Table 6.1 Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Table 6.2 Area Estimation Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 79Table 6.3 Baseline NoC Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Table 6.4 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Table 6.5 Area Estimations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96xList of FiguresFigure 2.1 High level overview of the GPU as an accelerator. . . . . . . . . . . . . . . . 10Figure 2.2 Mesh topology example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Figure 2.3 Torus topology example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Figure 2.4 Example of a folded ring topology allowing uniform link lengths. . . . . . . 14Figure 2.5 Virtual channel router. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Figure 3.1 Overview of modelled system and GPU architecture . . . . . . . . . . . . . . 21Figure 3.2 Details of a Compute Core . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Figure 3.3 Layout of memory controller nodes in the mesh. . . . . . . . . . . . . . . . . 23Figure 3.4 Compilation flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Figure 3.5 Instruction classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 3.6 Memory instructions breakdown. . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 3.7 Baseline performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Figure 3.8 Interconnection network topology comparison. . . . . . . . . . . . . . . . . 39Figure 3.9 Interconnection network latency sensitivity. . . . . . . . . . . . . . . . . . . 40Figure 3.10 Interconnection network bandwidth sensitivity. . . . . . . . . . . . . . . . . 40Figure 3.11 Impact of DRAM memory controller optimizations. . . . . . . . . . . . . . . 42Figure 3.12 DRAM utilization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42Figure 3.13 DRAM efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42Figure 3.14 Effects of adding L1 or L2 caches. . . . . . . . . . . . . . . . . . . . . . . . 43Figure 3.15 Effects of varying number of thread blocks. . . . . . . . . . . . . . . . . . . 45Figure 3.16 Effect of inter-warp memory coalescing. . . . . . . . . . . . . . . . . . . . . 46Figure 4.1 Throughput-effective design space . . . . . . . . . . . . . . . . . . . . . . . 49Figure 4.2 Compute accelerator showing layout of compute node routers and MC noderouters in baseline mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51xiFigure 4.3 Compute node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Figure 4.4 Memory controller node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Figure 4.5 Limit study to find a balanced bisection bandwidth . . . . . . . . . . . . . . 54Figure 4.6 Speedup of an ideal NoC over baseline . . . . . . . . . . . . . . . . . . . . . 56Figure 4.7 IPC of the ideal NoC and baseline NoC shown as a fraction of the peaktheoretical IPC of the chip. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Figure 4.8 Impact of scaling network bandwidth versus latency . . . . . . . . . . . . . . 58Figure 4.9 NoC latency reduction of using 1-cycle routers over baseline 4-cycle routers. 58Figure 4.10 Impact of scaling network bandwidth and router latency on overall raw NoClatency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Figure 4.11 Many-to-few-to-many on-chip traffic . . . . . . . . . . . . . . . . . . . . . . 60Figure 4.12 Ratio of the number of read-request to write-request packets sent from com-pute cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60Figure 4.13 Average injection rates of compute nodes compared to MC nodes . . . . . . . 61Figure 4.14 Ideal NoC speedup versus memory node injection rate. . . . . . . . . . . . . 61Figure 4.15 Fraction of time injection ports at MCs are blocked . . . . . . . . . . . . . . 61Figure 5.1 Overview of building blocks for throughput-effective designs explored . . . . 63Figure 5.2 Checkerboard mesh on-chip network routing examples . . . . . . . . . . . . 65Figure 5.3 Half-router connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Figure 5.4 Layout example for checkerboard routers . . . . . . . . . . . . . . . . . . . 68Figure 5.5 Router connections assuming a single network. . . . . . . . . . . . . . . . . 69Figure 5.6 Double network (channel sliced) design layout example . . . . . . . . . . . . 72Figure 5.7 The connections between a node and routers for one tile in DCI networkorganization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Figure 5.8 Layout example showing four tiles for DCI . . . . . . . . . . . . . . . . . . 73Figure 5.9 DCI routing examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Figure 6.1 Overall speedup of using checkerboard placement (CP) of routers comparedto baseline top-bottom (TB) placement . . . . . . . . . . . . . . . . . . . . . 82Figure 6.2 Relative performance (IPC) of DOR with 4 VCs (solid bars) and checker-board routing (CR) with 4 VCs (hashed bars) compared to DOR routing with2 VCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83xiiFigure 6.3 Relative performance (IPC) of routing from half-routers to full-routers thatare an even number of columns away using YX routing compared to routingusing YX to an intermediate node and then XY to destination . . . . . . . . . 84Figure 6.4 IPC speedup of adding multi-port MC routers versus single CP-CR. . . . . . 85Figure 6.5 Relative performance of using two physical subnetworks (combined doublenetwork) with channel width 8B compared to a single network with channelwidth 16B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86Figure 6.6 Relative performance of using “double checkerboard inverted”, its enhancedvariant and a combined double network that does not have half-routers com-pared to a single network with channel width 16B . . . . . . . . . . . . . . . 88Figure 6.7 MC placement examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90Figure 6.8 Sensitivity of checkerboard routing, two injection port MCs and double checker-board inverted to various placements . . . . . . . . . . . . . . . . . . . . . . 90Figure 6.9 Latency versus offered load for different architectures . . . . . . . . . . . . . 93Figure 6.10 IPC speedup of three throughput-effective designs over baseline . . . . . . . 98Figure 8.1 Tree saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120Figure 8.2 MC placement dependant yield enhancement . . . . . . . . . . . . . . . . . 121xiiiList of Abbreviations2P 2 injection/ejection Port routers at memory controllersAPI Application Programming InterfaceBLK Black-ScholesBSP Bulk-Synchronous ParallelBW BandWidthCDR Class-based Deterministic RoutingCMesh Concentrated MeshCMP Chip MultiProcessorCP Checkerboard Placement of memory controllersCR Checkerboard RoutingCTA Cooperative Thread Arraycubin CUDA binaryCUDA Compute Unified Device ArchitectureDCI Double Checkerboard InvertedDCIE Double Checkerboard Inverted EnhancedDOR Dimension Order RoutingDRAM Dynamic Random-Access MemoryFIFO First-In First-Outxivflit flow control digitFLOPS FLoating point Operations Per SecondFPGA Field-Programmable Gate ArrayFR-FCFS First-Ready First-Come First-ServeFWT Fast Walsh TransformGDDR Graphics Double Data Rate memoryGPGPU General-Purpose computing on Graphics Processing UnitGPU Graphics Processing UnitHH High speedup High trafficHM Harmonic MeanIPC Instructions Per CycleISA Instruction Set ArchitectureL1 Level 1L2 Level 2LH Low speedup High trafficLL Low speedup Low trafficLRU Least Recently UsedMC Memory ControllerMSHR Miss-Status Holding RegisterMUX MultiplexerNoC Network-on-ChipOoO Out-of-OrderOpenCL Open Computing LanguagePIM Parallel Iterative MatchingPISA Portable Instruction Set ArchitecturexvPTX Parallel Thread ExecutionRDL Re-Distribution LayerRF Register FileSDK Software Development KitSFU Special Function UnitSIMD Single-Instruction Multiple-DataSIMT Single-Instruction Multiple-ThreadSM Streaming MultiprocessorSRAM Static Random-Access MemoryTB Top-Bottom placement of memory controllersVC Virtual ChannelxviAcknowledgementsThanks are due first to my supervisor, Dr. Tor Aamodt, for the guidance, and technical advice heprovided throughout my Ph.D. Tor’s tenacity, dedication, passion and wisdom were invaluableduring my research. He instilled the value of high-impact and high-quality research in my mind.I am also grateful to my research collaborator, Dr. John Kim, who tirelessly shared his vastexpertise in interconnection networks with me. I learned a great deal from him.I would also like to thank my qualifying, departmental and final examination committeemembers: Dr. Steve Wilton, Dr. Sathish Gopalakrishnan, Dr. Karthik Pattabiraman, Dr. PhilippeKruchten, Dr. Konstantin Beznosov, Dr. Shahriar Mirabbasi, Dr. Mike Feeley and Dr. Ami-rali Baniasadi. Their valuable comments and feedback immensely improved the quality of mydissertation.I would also like to thank George Yuan for being such a great colleague. He stayed ener-getic and determined as we worked our way together through numerous obstacles. I am verygrateful to Wilson Fung, without him and his contributions to our research tools my journeythrough graduate school would have been very different. I also thank the other members ofUBC’s computer architecture research group–Xi Chen, Henry Wong, Ivan Sham, Andrew Turner,Hadi Jooybar, Inderpreet Singh, Johnny Kuan, Tim Rogers, Tayler Hetherington, Andrew Bok-tor, Ahmed ElTantawy, Ayub Gubran, Rimon Tadros and Dongdong Li – for the many valuablesuggestions and insightful discussions they provided. I am also thankful to Samer Al-Kiswany,Elizeu Santos-Neto and Sadegh Ahmadyani for their valuable help and advice and for being greatxviifriends. Thanks are also due to my brother, Keivan, who patiently went through numerous draftsof my work.I am very grateful to Aaron Smith and Doug Burger of Microsoft Research as well as AmirHormati. They made my internship experience very rewarding. I really enjoyed working withthem during and after my internship at MSR.Finally, I am very grateful for the support, encouragement and patience of the love of my life,Minoo, throughout the long years I worked toward my Ph.D. She made a lot of sacrifices and Iwill be forever in her dept. I owe my parents a debt of gratitude for their continuous support andencouragement throughout all my years of education and for their unconditional love.xviiiTo my wife MinooxixChapter 1Introduction1.1 MotivationPhysical limits of power usage for integrated circuits has steered the microprocessor industry to-wards parallel architectures in the past decade. While single-thread performance of commercialsuperscalar microprocessors is still increasing, a clear trend today is for computer manufactur-ers to provide multithreaded hardware that strongly encourages software developers to provideexplicit parallelism when possible. One important class of parallel computer hardware is themodern Graphics Processing Unit (GPU) [71, 75]. With contemporary GPUs crossing the ter-aflop level in 2008 [4, 91] and specific efforts to make GPUs easier to program for non-graphicsapplications [3, 85, 92], there is widespread interest in using GPU hardware to accelerate non-graphics applications. Modern GPUs are actively being used as throughput accelerators to runmassively parallel general-purpose applications.Since the Bulk Synchronous Parallel (BSP) programming model [117] provides relativelysimple software scalability as the number of cores increases with Moore’s Law, it is an at-tractive choice for programming throughput accelerators. Languages such as CUDA [85, 92],1OpenCL [52], and recently proposed programming models for future accelerator architectures [48]embody the BSP model.GPUs utilize chip area more effectively compared to traditional single threaded architecturesby favouring application throughput over latency. Although GPUs offer more raw computingpower than contemporary CPUs by orders of magnitude, many important applications do notachieve peak performance. This fact is even true for applications with abundant data level paral-lelism.An important step to improve the performance of these applications is understanding theunderlying cause of this behaviour. We selected several non-trivial CUDA applications demon-strating varying levels of performance improvement on GPU hardware [11] and developed theinfrastructure to easily study their behaviour. Based on the characterization of these applicationsand analysis of our simulation results we focus on understanding and improving the memorysubsystem and in particular the interconnection network of GPU-like accelerators in this disser-tation. We model the GPU as a closed-loop system in order to capture the effects of complexinteractions of its microarchitectural components. That is, we simulate components such as com-pute cores, on-chip network and memory controllers with cycle-level accuracy. A more detaileddescription of our work comes next.In 2008 [10], we showed that using a combination of mesh interconnection network anda novel cache hierarchy design can achieve an 82% speedup over a naively scaled 121-nodethroughput accelerator [10]. This was the first academic work to propose using data caches toimprove performance of GPUs employed for throughput acceleration. NVIDIA later announcedthe Fermi architecture in 2009 [89] which included data caches. This work was also the firstwork to employ a scattered memory controller placement (Figure 3.3) in a mesh architecture.However, we did not quantify the effects of other placement alternatives at that point. Abts etal. [2] later showed that a similar placement is the best configuration in multicore architectures2employing mesh networks. We quantify and analyze various placement alternatives in our laterwork [12, 13] specifically targeted at throughput accelerators.Our evaluations at that point were based on a modified version of the GPGPU-Sim simulatordeveloped by Fung et al. [29]. That early version of GPGPU-Sim was based on SimpleScalar’sPISA ISA [18] and porting applications to run in the simulator was very laborious and timeconsuming. Since then, I helped implement a new simulator front end which is capable of simu-lating GPU specific assembly programs without any modification to the program [11]. The firstversion of the GPGPU-Sim simulator that was capable of running unmodified applications byemulating NVIDIA’s parallel thread execution (PTX) virtual instruction set, was developed inthe summer of 2008. The new simulation front end opened up a unique opportunity for explo-ration and innovation as it enabled running general-purpose compute applications (referred to asGPGPU applications [1]) written in CUDA [85, 92] without any setup time overhead. We usedthis simulator to model and analyze several applications and explore various microarchitecturaltradeoffs for throughput accelerators to better suit these applications [11]. This work formed afoundation for research projects focusing on various components of the memory subsystem asexplained next.A research work related to this dissertation proposes simple arbitration techniques in on-chipnetwork routers that enable a simpler scheduler for memory controllers compared to commonlyused schedulers with comparable performance [127]. The bulk of this dissertation focuses onresearch that explores designing “throughput-effective” network-on-chips for future throughputaccelerators. A hardware optimization is throughput-effective if it improves parallel application-level performance per unit chip area. We provide a new framework for interconnection networkresearchers to ensure the optimizations are throughput-effective. We then use this frameworkto guide the development of several optimizations. A more detailed description of this work isprovided next.3Accelerator applications written in a BSP style [19, 48] tend to organize communication tobe local to a group of threads that execute on hardware units that are located close together.There is limited communication between threads in different groups even when coherence issupported [48, 49]. On the other hand, the number of pins on a chip is growing only 10%per year [44]. The net effect of increases in transistor density on accelerator architectures isan increasingly many-to-few traffic pattern with many compute cores sending traffic to a fewMemory Controller (MC) nodes. Using detailed simulations, we identify how MCs becomehot-spots in the system and have much higher injection rates than the compute nodes. This isdue to the prevalence of read requests in the system which are small packets that result in largeread-reply packets. This results in a many-to-few-to-many traffic pattern.Leveraging this observation we reduce NoC area by proposing a checkerboard NoC whichalternates between conventional routers and half-routers with limited connectivity. Addition-ally, we improve performance by increasing the terminal bandwidth of memory controller nodesto better handle frequent read-reply traffic. Furthermore, we propose a double checkerboardinverted NoC organization which maintains the benefits of these optimizations while having asimpler routing mechanism and smaller area. In general, throughput-effective designs can beachieved by either increasing the application-level performance or decreasing the area footprintof system components. We employ checkerboard network organization and channel slicing to re-duce area while multi-port MC routers and optimized MC placements provide application-levelspeedups.1.2 ContributionsThe contributions of this dissertation are as follows.• The earliest work characterizing the performance of various existing CUDA applicationscollected on a research GPU simulator (GPGPU-Sim). We provided application charac-4teristics including the dynamic instruction mix composition, the locality characteristics ofDRAM, and sensitivity to the NoC’s bisection bandwidth as opposed to the latency. Thisinformation enabled researchers to observe utilization of various resources by computeapplications for the first time in detail. This is presented in Chapter 3.• Demonstration of the benefits of having data caches for non-optimized CUDA applications(up to 30% speedup) and inter-warp memory coalescing (up to 41% speedup). These areexplained in Section 3.2 and evaluated in Section 3.5.• Demonstration of the impact of on-chip communication on overall performance of a widerange of compute accelerator applications: conventional network improvements (such asreducing router latency) do not significantly improve overall performance while simplyincreasing channel width results in significant performance gains but with a large and un-acceptable area increase. This is explained in Chapter 4.• Proposal to employ a throughput-effective framework to guide the design of NoCs. Thebasis of this framework is simultaneously considering the effect of the NoC on overallapplication-level performance and chip area. This is explained in Chapter 4 and employedin Chapter 5.• Identification of traffic imbalance as the result of the many-to-few-to-many traffic patternof throughput accelerators (more compute nodes than MCs) and demonstration of the di-rect correlation of overall system-level performance with the injection rate of the few MCnodes. This is explained in Section 4.2.4.• Proposal of a throughput-effective design based on the above observations. It includes anovel checkerboard network organization using half -routers with limited connectivity thatreduces the on-chip network area with minimal impact on performance. It also includes amulti-port router structure to increase the terminal bandwidth on the few routers connected5to the MCs that improves system performance at minimal area cost. This point is explainedin Chapter 5 with the results presented in the first three sections of Chapter 6.• Proposal of another novel network organization and routing technique called “DoubleCheckerboard Inverted” (DCI). It utilizes a special form of channel slicing where eachclient node is connected to both a half and a full-router. DCI maintains the benefits ofthe earlier techniques with a simpler routing algorithm and no MC placement restric-tions.Furthermore, DCI can be extended to operate with more sophisticated routing mech-anisms: employing two simple injection rules enables Class-based Deterministic Routing(CDR) in conjunction with DCI without incurring any area overhead. This point is ex-plained in Chapter 5 and the results are presented starting from Sections 6.3 of Chapter6.1.3 Dissertation OrganizationThe rest of this dissertation is organized as follows: Chapter 2 summarizes background informa-tion about GPUs as throughput accelerators and on-chip networks. Chapter 3 explores architec-tural options when designing GPUs as throughput accelerators. Chapter 4 identifies importantinsights into NoC behaviour of throughput accelerator architectures, Chapter 5 describes ourproposed throughput-effective NoC design options, Chapter 6 describes experimental results ofthe proposed NoC optimizations, Chapter 7 summarizes related work. We conclude and providedirections for future work in Chapter 8.6Chapter 2BackgroundIn this chapter, we briefly describe several fundamental concepts that are related to this disserta-tion. We start by describing the concept of GPUs as throughput accelerators and then describeconcepts related to on-chip network design.2.1 GPUs as Throughput AcceleratorsThe modern Graphics Processing Unit (GPU) is a highly parallel, multithreaded throughput ac-celerator as a result of demand for high quality real-time 3D graphics. GPU threads run programsthat perform tasks such as shading a pixel or drawing a vertex. These threads are independentsince they work on different fragments. As a result GPUs have evolved to efficiently execute amassive number of independent fine-grained threads [85]. When rendering a scene, the speedthat each individual pixel is rendered is not important as long all the pixels of the scene are readyby a given deadline dictated by the frame-rate. In other words, GPUs favour throughput oversingle thread latency.Researchers started using GPUs for general purpose computing in early 2000s. They weremotivated by the massive raw performance of GPUs as well as their high performance to cost7ratio compared to CPUs. General-purpose computing on GPUs or GPGPU [1] was extremelyhard at that time because researchers had to use graphics APIs [97].The breakthrough that allowed GPUs to be directly programmed for general-purpose com-puting was the announcement of CUDA by NVIDIA [85, 92] and Close To Metal by AMD [3]in late 2006. Programming models such as CUDA embody the BSP model. We first explain theBSP model and then get into more specific details about the GPGPU programming models.BSP The Bulk Synchronous Parallel model (BSP) is an abstract bridge model between hard-ware and programming model [117]. A distinctive feature of BSP is considering the cost ofcommunication and synchronization when analyzing BSP algorithms. A BSP computer consistsof nodes capable of computation and/or memory which are connected via a network. A BSPapplication is executed as a series of super-steps. A super-step has the following components.(1) Computation: Every participating processor performs several computations. Only valuesstored locally can be used by each process and the computations can occur asynchronously. (2)Communication: The processes exchange data in a one-sided fashion e.g., a send does not havea corresponding receive on the other processor. (3) Barrier synchronization: When a processreaches a barrier, it waits for the other processes to reach the barrier. When all processes reachthe barrier the super-step is concluded and a new one starts. Note that the computation and com-munication steps are not necessarily ordered in time. The BSP model provides relatively simplesoftware scalability.2.1.1 CUDA and OpenCLThe Compute Unified Device Architecture (CUDA) is a BSP programming model introducedby NVIDIA. It provides an extension to the C and C++ programming languages and enablesthe programmer to write parallel kernels that execute directly on the GPU. Parallel kernels areexecuted across a set of parallel threads on the GPU. The serial parts of a CUDA applicationare executed on the CPU and the parallel kernels are offloaded to the GPU. Each parallel kernel8invokes a grid of threads on the GPU. Each thread is given a unique identifier which can be usedto help divide up work among the threads. In this model the programmer organizes the parallelthreads of each compute kernel into thread blocks. Each thread block is assigned to a singlecompute core and its threads can synchronize and share data. Thread blocks are also referred toas cooperative thread arrays (CTAs) [71] which is a term describing their hardware mapping inthe GPU. We use CTAs and thread blocks interchangeably in this work. Within a single threadblock, threads have access to a common fast scratchpad memory called the shared memory andcan, if desired, perform barrier synchronizations.Parallel Thread Execution (PTX) [94] is a virtual instruction set that is the lowest user ac-cessible representation of CUDA kernel codes. PTX retains a significant amount of informationfrom the high level language in the instruction set that would make it easier to do compiler op-timizations. Every CUDA program is first compiled into PTX and then translated to the nativeISA of a particular GPU by the driver at run time. More recent versions of CUDA compilershave made it possible to compile the GPU code in the binary form (cubin1) for a specific com-pute capability and to forgo PTX code. Nevertheless, retaining PTX enables future hardwarecompatibility and is recommend [95]. We utilize PTX for our simulations in this work.Open Computing Language (OpenCL) [52] is an open standard for parallel programming. Itis very similar to CUDA when used for GPGPU purposes. OpenCL is maintained by KhronosGroup [52] and is supported by a number of companies including NVIDIA, AMD and Intel.2.1.2 Throughput Accelerator ModelFigure 2.1 shows a high level view of the throughput accelerator we model in this dissertation.The application starts running on CPU when it arrives at a compute kernel, the kernel is offloadedto the accelerator. The accelerator, in our case a GPU, executes the kernel until the kernel is done.1CUDA binary9DRAM DRAM DRAMMemoryController MemoryController MemoryControllerCompute CoresInterconnection NetworkCore Core Core Core Core CoreAccelerator (GPU)CPUResultsCompute Kernel Figure 2.1: High level overview of the GPU as an accelerator.The results of kernel execution are then returned to back to CPU or retained on the GPU to feedanother kernel.As Figure 2.1 shows, the GPU has a number of compute cores, memory controllers, andan interconnection network. The GPU also has its own off-chip DRAM modules which arecontrolled directly by the on-chip memory controllers.Compute Nodes Each compute core is a unit similar in scope to a streaming multiprocessor(SM) in NVIDIA’s terminology [92]. Threads are distributed to compute cores at the granularityof entire thread blocks, while per thread block resources such as registers, shared memory spaceand thread slots are not freed until all threads within a thread block have completed execution. Ifresources permit, multiple thread blocks can be assigned to a single compute core, thus sharinga common pipeline for their execution. Each compute core in this model is a SIMT core whichis a variant of the SIMD organization as explained next.SIMD Single Instruction, Multiple Data (SIMD) is a parallel architecture organization inwhich multiple data-paths are controlled by a single instruction sequencer. In traditional SIMD10organizations multiple data-paths are in the form of a vector. These organizations are effectivefor workloads that involve executing the same set of instructions repeatedly on a large numberof data points. Large matrix multiplication is a typical application that can exploit SIMD. SIMDsimplifies the architecture by utilizing a single instruction sequencer for a group of ALUs.SIMT Single Instruction, Multiple Thread (SIMT) [71] is a special form of SIMD in whicheach data-path has its own set of scalar registers. Modern GPUs employ the SIMT organiza-tion [71]. In the context of SIMT, each ALU lane is executing a thread. The set of all the threadsrunning together in lockstep is referred to as a warp (or wavefront in AMD’s terminology). Warpsize can be larger than SIMT width of the architecture, for example in NVIDIA G80 [88] archi-tecture the warp size is 32 while the SIMT width is 8 (there are eight distinct scalar functionalunits). These 32 threads are issued to the pipeline in four consecutive cycles and all of them runthe same instruction in lockstep. A SIMT organization supports branch instructions that mightcause threads in a warp to diverge by using hardware mechanisms to selectively enable or disableprocessing elements [21, 70].2.2 Interconnection NetworksThe majority of this dissertation is dedicated to analyzing and optimizing on-chip interconnec-tion networks for throughput accelerators. We provide a short background about basic networkconcepts that are relevant to this work in this section. Interconnection networks are generallyclassified based on their topology, routing and flow control [22]. Topology is the fixed patternof routers and the channels connecting them. Once the topology is defined, there might be morethan one path for a packet to traverse the network from its source to its destination router. Therouting mechanism specifies the path that will be taken by any given packet in the network. Fi-nally, the set of policies that define which packet or flit can access a particular resource of thenetwork over time are referred to as flow control. We will provide more details about these topicslater in this chapter.11The unit of routing in a network is the packet. Packets are broken into flits which are thesmallest unit of resource allocation in the interconnection network [22]. A large packet mighthave separate header, body and tail flits while a small packet might be a single flit that acts asheader, body and tail at the same time. The header flit carries control information about the packetsuch as its type and destination which aid the routers along its path to route to its destination.In the accelerators we model in this dissertation, the compute cores do not directly communi-cate and there is no coherence traffic. The packets in the system are limited to read requests, writerequests and their replies. Requests are sent from the compute cores to the memory controllersand the replies are sent from the memory controllers back to the compute cores.In order for the system to remain deadlock free it is important to avoid cyclic dependenciesbetween resources used by request and reply traffic. As an example, assume a memory controllerthat has a shared memory interface buffer for both request and reply traffic. If a surge of readrequests fills up this shared buffer, then the memory controller is unable to send any replies tothe network and stalls. This stall then will propagate back through the memory controller queuesand prevent the shared buffer from being drained resulting in a deadlock. While this examplewas about the shared resources at an interface buffer; a similar deadlock effect can happen ifa cyclic dependency between resources used by request and reply packets is formed anywherein the network. This form of deadlock is referred to as a protocol deadlock. A characteristicfeature of a protocol deadlock is the involvement of resources outside the network (e.g. memorycontroller) in the formation of the cyclic dependency leading to the deadlock.One way of avoiding protocol deadlock is using separate physical networks for request andreply traffic. Another possibility is using dedicated virtual channels for request and reply trafficwhich essentially forms two distinct logical networks.12Figure 2.2: Mesh topology example.2.2.1 TopologyWe compare several topologies in Chapter 3 and Chapter 6. In this section, we explain sometopology basics. Crossbar topology is commonly used for on-chip applications that do not havemany nodes. In a crossbar topology, all nodes are connected to a central router that providesall-to-all connectivity. The cost and area of a crossbar increase quadratically as the number ofits clients scale. A crossbar interconnect can be seen as a one-stage butterfly topology. For agiven router radix butterfly networks offer minimal hop count but they offer no path diversityand require very long wires to implement. We will compare various topologies including acrossbar, a two-stage butterfly as well as mesh, torus and ring topologies which are explainednext in Section 3.5.4.Cost and area of mesh topologies, unlike crossbars, scale linearly as the number of networkclients increases. Figure 2.2 depicts a 2D 3×3 mesh. A torus topology is formed when therouters on the perimeter of a mesh are connected as the example in Figure 2.3 demonstrates.A ring topology is simply a one dimensional torus. Figure 2.4 depicts a folded ring topologywith four nodes. Folding, which can also be applied to a 2D torus, allows having uniform linklengths as opposed to a combination of several short links and one very long link connecting therouters on the perimeter. Link lengths in folded torus are double that of a mesh with the samenumber of nodes. A 2D mesh or folded torus can be implemented on chip with nearly uniform13Figure 2.3: Torus topology example.Figure 2.4: Example of a folded ring topology allowing uniform link lengths.short wires. On the other hand, these topologies have a relatively high latency compared to acrossbar network due to a larger hop count.2.2.2 RoutingIn this section, we explain Dimension Order Routing (DOR) which is employed in some of ourstudies in Chapters 3 and 4. Several variants of DOR routing are also employed by our proposedthroughput-effective network designs in Chapters 5 and 6. DOR is a simple routing techniquefor the torus and mesh topologies. The packet is first routed based on minimum distance alongone dimension and then routing is done along the next dimension until the packet arrives at itsdestination node. For example, DOR can be implemented in a 2D mesh by routing packets firstalong the x-axis and then along the y-axis. This is also called XY-routing.A major benefit of DOR routing in mesh topologies is deadlock avoidance. Deadlocks canhappen when a circular dependency is formed among a series of agents holding resources whilethese agents are waiting for the other agents to release their resources [22]. Circular routing14dependencies do not occur when DOR routing is employed in a 2D-mesh as each packet canonly turn at most once. As a result DOR avoids routing deadlocks. Note that routing deadlock isdifferent from protocol deadlock explained earlier which involves agents external to the network.2.2.3 Flow ControlIn this dissertation, we employ the well known credit based flow control technique which isexplained in this section. This explanation is included for completeness and its details are notrelevant to the contributions of this dissertation.On-chip networks generally do not drop packets. Therefore, they require a mechanism thatensures a flit will be accepted at the next stage of its path through the network. In credit basedflow control, each router has counters that indicate the number of empty slots in the downstreamrouters directly attached to it. When a router sends a flit to another downstream router it decre-ments its counter for that downstream router. At the same time as a flit departs a downstreamrouter it sends a credit back to the upstream router. When the upstream router receives the creditit increments its counter. If the credit counter for a downstream router reaches zero, it means thedownstream router cannot accept any new flits and the upstream router refrains from sending anynew flits until the credit count becomes positive.Since the credits are not sent back until the corresponding flit departs the downstream routerit is important to for routers to have a minimum amount of buffering that is equal or more than theminimum credit round trip delay2. A buffer size smaller than this limit prevents the links frombeing fully utilized. For example, if a router has a pipeline depth of four and flits and credits eachtake one cycle to traverse the links between routers, the minimum credit round trip delay wouldbe six. For more details about the credit-based flow control one can refer to Chapter 13 of [22].2 The credit round-trip delay is the minimum possible time between sending a credit for a buffer and then sendinga credit for the same buffer. It involves the round trip wire delays between the routers, pipeline delays and creditprocessing delays [22].15Routing UnitInput port 0SwitchRouterVC AllocatorSwitch AllocatorVC 0VC vVC 0VC vInput port nOutput port 0Output port nFigure 2.5: Virtual channel router.2.2.4 Router MicroarchitectureIn this section, we explain the basic microarchitecture of a virtual channel router. This is therouter that we employ in our studies throughout this dissertation. Additionally, some of ourproposed throughput-effective network design techniques in Chapters 5 and 6 involve modify-ing various components of the router. Some examples include changing the number of virtualchannels, the number of ports or the implementation of switch.Figure 2.5 depicts a high-level view of a virtual channel router with n ports and v virtualchannels. In a 2D mesh organization the router has four ports for connecting to its adjacentrouters and one local port to its client. Virtual channels can be used for many purposes, we16mainly employ them to create logical networks that separate the request traffic from reply trafficas a way of avoiding protocol deadlock which was explained earlier.The router is pipelined with four pipeline stages: route calculation, virtual channel allocation,switch allocation and switch traversal. The route calculation and the virtual channel allocationstages are performed once per packet while the switch allocation and the switch traversal stagesare performed per flit.As the header flit of a packet arrives at a given input buffer, its exit port is calculated by theroute calculation stage of the pipeline. The next stage is the virtual channel allocation which triesto assign one of the virtual channels of the exit port to the packet. The virtual channel allocationconsiders the constraints associated with different packet types when allocating the VCs. Forexample, when there are two logical network for the read and reply traffic, the read requests areonly allowed to use the VCs that are dedicated to the read traffic. Once a VC is allocated to apacket it cannot be used by any other packet until the tail flit of the packet passes through therouter. That is, the flits of different packets cannot interleave inside a VC buffer and must remainin order.After a packet is assigned to a virtual channel its individual flits need to pass through theswitch to get to their exit ports. Note that there might be multiple input flits from different inputports that are trying to get to the same exit (but different VCs). It is the task of the switch allocatorto ensure maximum utilization of the switch without any conflicts. Once the switch allocation isdone the successful flits traverse the switch towards their exit ports in the next cycle. Our smartport selection policy optimization for multi-port router that will be introduced in Section 5.3.1increases the switch utilization by providing the switch allocator with choices that are less likelyto result in conflicts.17Chapter 3GPU Architectural ExplorationSince its introduction by NVIDIA in February 2007, the CUDA programming model [85, 90] hasbeen used to develop many applications for GPUs. CUDA provides an easy to learn extensionof the ANSI C language. As we mentioned earlier, the programmer specifies parallel threads,each of which runs scalar code. While short vector data types are available, their use by theprogrammer is not required to achieve peak performance, thus making CUDA a more attractiveprogramming model to those less familiar with traditional data parallel architectures.As of February 2009, before the first version of GPGPU-Sim capable of simulating CUDAwas released, NVIDIA had listed 209 third-party applications on their CUDA Zone website [87].Of the 136 applications listed with performance claims, 52 were reported to obtain a speedupof 50× or more, and of these 29 were reported to obtain a speedup of 100× or more. As suchapplications already achieved tremendous benefits, we instead focused on evaluating CUDA ap-plications with reported speedups below 50× for the studies in this chapter. The reason for thisdecision was that this group of applications appeared most in need of software tuning or changesto hardware design. The aforementioned speedup numbers are relative to the CPU-based imple-mentations of the corresponding applications and were reported by the application developers.That is, the speedups were not reported based on a standard method and different application18developers used different GPU and CPU configurations when reporting their speedups. We willuse a broader set of applications for our evaluations in the next chapter and onwards.As we will demonstrate in Section 3.5.3 some of these applications can reach up to 75% ofthe GPU’s peak performance. One might expect a benchmark that can achieve a performanceclose to GPU’s peak to have a speedup higher than the 50×1 mentioned earlier. The differencecan be explained by the non-standard measurement techniques employed by developers as weexplained before. Two major contributing factors that can affect the reported GPU speedupsvs CPUs are the amount of CPU code optimizations [68] and consideration of data transferoverheads in measurements [33]. In 2010, Lee et al. [68] claimed that many of the studiesclaiming speedups over CPUs would get a much smaller speedup on GPUs if they also tune thecode for the CPU and employ a high-end CPU. They also pointed out some deficiencies in GPUsthat are addressed by more recent generations of GPUs. For example, Fermi architecture [89]remedies many of the deficiencies identified by Lee et al. such as the need for caches, fasteratomic operations, and better double precision floating-point support. Another factor that needsto be considered in GPU vs CPU speedup calculations is the communication overhead betweenCPU and GPU as pointed out by Gregg et al. [33]. This overhead can be substantial if the ratioof data transfer time to the kernel execution time is high. We should note that Gregg et al. [33]ignored streams. The concept of streams was introduced along with Fermi architecture [89]which allows performing multiple CUDA operations concurrently and can considerably reducethe impact of communication overhead on application execution time. Nevertheless, streamswere not available at time of the studies in this chapter.In this chapter we presents data characterizing the performance of twelve existing CUDA ap-plications collected on our research GPU simulator (GPGPU-Sim 2.0b). We provide an analysisof application characteristics such as the dynamic instruction mix and DRAM locality char-1 The peak single precision floating point performance of Intel’s Nehalem architecture Xeon 5560 is 11.2 gigaFLOPS for a single core while the peak for NVIDIA’s Geforce GTX 285 GPU is around 1062 giga FLOPS. That is,the peak performance of GPU is 95× more than the CPU in this example.19acteristics. We also demonstrate that the non-graphics applications we study tend to be moresensitive to bisection bandwidth versus latency. Additionally, we show the performance benefitsof caching and inter-warp coalescing. The majority of the content in this chapter was publishedin our ISPASS 2009 paper [11].The rest of this chapter is organized as follows. In Section 3.1 we describe the baseline ar-chitecture we study. In Section 3.2 we describe the microarchitecture design choices that weexplore. A brief overview of our simulation infrastructure that was created for this study is pre-sented in Section 3.3. We then introduce the benchmarks selected for our study in Section 3.4.In Section 3.5 we describe our experimental methodology and then present and analyze experi-mental results. Section 3.6 concludes the chapter.3.1 Baseline ArchitectureFigure 3.1 shows an overview of the system we simulated. We used GPGPU-Sim 2.0b for thesimulations in this chapter which we will explain shortly with more detail. Figure 3.1 also showsour baseline GPU architecture. The GPU consists of a collection of small data-parallel computecores which are connected by an on-chip network to multiple memory modules (each labeledmemory controller).Figure 3.2 shows the detailed implementation of a single compute core. In this chapter,each compute core has a SIMT width of 8 and uses a 24-stage, in-order issue pipeline withoutforwarding. Given the SIMT width of eight, we model a pipeline with six logical pipe stages(fetch, decode, execute, memory1, memory2 and writeback) with superpipelining of degree four.Memory operations are performed in one memory stage in our model; the other memory stage isan empty stage and simply serves to maintain a six stage pipeline. Threads are scheduled to theSIMT pipeline in warps consisting of 32 threads [71].20DRAM DRAM DRAMMemoryController MemoryController MemoryControllerMemoryApplication CustomlibcudaGPGPU-SimCPU Compute CoreskernelPTX codeStatisticsOn-Chip NetworkCore Core Core Core Core CoreL2 L2 L2Figure 3.1: Overview of modelled system and GPU architecture. Dashed portions (L1 andL2 for local/global accesses) are omitted from baseline in this chapter.Branch Handling We employ the immediate post-dominator reconvergence mechanism de-scribed in [29] to handle branch divergence. Branch divergence occurs when some scalar threadswithin a warp evaluate a branch as “taken” and others evaluate it as “not taken”.Memory Spaces Threads running on the GPU in the CUDA programming model have accessto several memory spaces (global, local, constant, texture and shared [90]) and GPGPU-Simmodels accesses to each of these memory spaces. In particular, each compute core has accessto a 16KB low latency, highly banked per-core shared memory. Each core also has access tothe global texture memory with a per-core texture cache and to the global constant memory witha per-core constant cache. Local and global memory accesses always require off-chip memoryaccesses in our baseline configuration for this chapter. For the per-core texture cache, GPGPU-Sim models a 4D blocking address scheme as described by Hakura et al. [35]. This schemeessentially permutes the bits in requested addresses to promote spatial locality in a 2D space21RF RF RF RFFetchDecodeWritebacklocal/global access(or L1 miss); texture or const cache missAll threadshit in L1?To On-Chip NetworkMSHRsThread WarpThread WarpThread WarpThread WarpThread WarpDataSchedulerSIMDPipelineCompute CoreSharedMem.L1 const.L1 local&globalL1 textureFigure 3.2: Details of a Compute Core. Dashed portions (L1 and L2 for local/global ac-cesses) omitted from baseline.rather than in linear space. For the constant cache, as long as all threads in a warp are requestingthe same data we allow single cycle access. If the threads in a warp request different constantmemory locations, a port conflict occurs. The port conflict forces the data to be sent out overmultiple cycles and results in pipeline stalls [90].Coalescing Multiple memory accesses from threads within a single warp to a localized regionare coalesced into fewer wide memory accesses to improve DRAM efficiency. Coalescing isdiscussed in more detail in Section ControllerFigure 3.3: Layout of memory controller nodes in the mesh.To alleviate the DRAM bandwidth bottleneck that many applications face, a common tech-nique used by CUDA programmers is to load frequently accessed data into the fast on-chipshared memory [105]. Alternatively, we explore the effect of adding an L1 data cache to capturelocality in global memory accesses.Scheduling Thread scheduling inside a compute core is performed with zero overhead ona fine-grained basis. Every four cycles, warps ready for execution are selected by the warpscheduler and issued to the SIMD pipelines in a loose round robin fashion. The scheduler skipsnon-ready warps such as those waiting on global memory accesses. In other words, wheneverany thread inside a warp faces a long latency operation, all the threads in the warp are takenout of the scheduling pool until the long latency operation is over. Meanwhile, other warps thatare not waiting are sent to the pipeline for execution in a round robin order. The large numberof threads running on each compute core thus allows a compute core to tolerate long latencyoperations without reducing throughput.Network In order to access global memory, memory requests must be sent via an on-chip net-work to the corresponding memory controllers, which are physically distributed over the chip.To avoid protocol deadlock, we model physically separate send and receive interconnection net-23works. Using separate logical networks to break protocol deadlock is another alternative that wewill explore in the following chapters.Figure 3.3 shows the physical layout of the memory controllers in our 6×6 mesh configu-ration as shaded areas. Note that with area-array, i.e., “flip-chip” designs it is possible to placeI/O buffers anywhere on the die [17]. We will explore different placement alternatives with moredetail in the later chapters specifically in Section 6.6 of Chapter 6.Off-chip Memory Each on-chip memory controller interfaces to two off-chip GDDR3 DRAMchips. GDDR3 stands for Graphics Double Data Rate 3 [42]. Graphics DRAM is typically opti-mized to provide higher peak data bandwidth. At the time of the studies for this chapter, it wasnot common knowledge that NVIDIA GPUs distribute the addresses among memory controllersevery 256 bytes [37], we employ the 256 byte memory address distribution scheme in the stud-ies in the next chapter and onwards. The address decoding scheme for this chapter is designedin a way such that successive 2KB DRAM pages [42] are distributed across different banks anddifferent chips to maximize row locality while spreading the load among the memory controllers.3.2 Architectural Design OptionsThis section describes the GPU architectural design options explored in this chapter. Evaluationsof these design options are presented in Section On-chip Interconnection NetworkThe on-chip interconnection network can be designed in various ways based on its cost and per-formance. Cost is determined by complexity and number of routers as well as density and lengthof wires. Performance depends on latency and bandwidth of the network [22]. A more detailedexplanation of on-chip networks can be found in Section 2.2. As we will show in Section 3.5,our benchmarks are not particularly sensitive to network latency so we chose a mesh network asour baseline while exploring the other choices for on-chip network topology.243.2.2 Thread Block DistributionWe issue the thread blocks in a breadth-first manner across compute cores. Our thread blockdistribution algorithm selects a compute core that has a minimum number of blocks running onit. The goal is to spread the workload as evenly as possible among all cores. This is done untilall the thread blocks are assigned to a compute core or until all compute cores are full and cannotaccept any new blocks.GPUs can use the abundance of parallelism in data-parallel applications to tolerate memoryaccess latency by interleaving the execution of warps. These warps may either be from the samethread block or from different thread blocks running on the same compute core. One advantageof running multiple smaller thread blocks on a compute core rather than using fewer larger threadblocks stems from the presence of barrier synchronization points within a thread block [105].While warps of a thread block are waiting at a barrier, warps from other thread blocks on the samecompute core can make progress. Assuming a specific number of threads per block, allowingmore thread blocks to run on a compute core provides additional memory latency tolerance atthe expense of increasing register and shared memory resource use. However, even if sufficienton-chip resources exist to allow multiple thread blocks per core, completely filling up all threadblock slots can reduce overall performance for memory-intensive compute kernels as we willshow in Section 3.5.8. The performance of memory-intensive kernels drops due to increasedcontention in the interconnection network and DRAM controllers which can be overwhelmed bymany concurrent memory requests.3.2.3 Memory Access CoalescingThe minimum granularity access for GDDR3 memory is 16 bytes and typically scalar threads inCUDA applications access four bytes per scalar thread [42]. To improve memory system effi-ciency, it thus makes sense to group accesses from multiple, concurrently-issued, scalar threads25into a single access to a small, contiguous memory region. The CUDA programming guide in-dicates that parallel memory accesses from every half-warp of 16 threads can be coalesced intofewer wide memory accesses if they all access a contiguous memory region [90]. Our base-line models similar intra-warp memory coalescing behaviour (we attempt to coalesce memoryaccesses from all 32 threads in a warp).A related issue is that since the GPU is heavily multithreaded, a balanced design must sup-port many outstanding memory requests at once. Microprocessors typically employ Miss-StatusHolding Registers (MSHRs) [58] that use associative comparison logic to merge simultaneousrequests for the same cache block. The number of outstanding misses that can be supported istypically small. For example, the original Intel Pentium 4 used four MSHRs [39]. One way tosupport a far greater number of outstanding memory requests is to use a FIFO for outstandingmemory requests [40]. Similarly, our baseline does not attempt to eliminate multiple requestsfor the same block of memory on cache misses or local/global memory accesses. However, wealso explore the possibility of improving performance by coalescing read memory requests fromlater warps that require access to data for which a memory request is already in progress dueto another warp running on the same compute core. We call this inter-warp memory coalesc-ing. We observe that inter-warp memory coalescing can significantly reduce memory traffic forapplications that contain data dependent accesses to memory. The data for inter-warp mergingquantifies the benefit of supporting large capacity MSHRs that can detect a secondary access toan outstanding request [115].3.2.4 CachingWhile coalescing memory requests captures spatial locality among threads, memory bandwidthrequirements may be further reduced with caching if an application contains temporal localityor spatial locality within the access pattern of individual threads. We evaluate the performanceimpact of adding first level, per-core L1 caches for local and global memory access to the design26described in Section 3.1. We also evaluate the effects of adding a shared L2 cache on the memoryside of the on-chip network at the memory controller. While threads can only read from textureand constant memory, they can both read and write to local and global memory. In our evaluationof caches for local and global memory we model non-coherent caches. Note that threads fromdifferent thread blocks in the applications we study do not communicate through global memory.3.2.5 Memory Controller DesignThe design of memory controller scheduler has a large impact on the overall performance2.Two scheduler designs are explored in this chapter: a simple in-order First-In First-Out (FIFO)memory controller and a more complex Out-of-Order First-Ready First-Come First-Serve (FR-FCFS) [104] memory controller. The FIFO controller simply services the memory requests inthe same order that they are received. In the FR-FCFS design, the memory requests that hit anopen row in the DRAM are prioritized over requests that require a precharge and activate to opena new row.Two other metrics in addition to performance are employed to evaluate memory controllerdesign: DRAM efficiency and DRAM utilization. DRAM efficiency is calculated as a percent-age by dividing the time spent sending data across the DRAM pins by the time when there areany memory requests being serviced or pending in the memory controller input buffer. DRAMutilization is calculated as a percentage by dividing the time spent sending data across the DRAMdata pins by the entire kernel execution time. These two measures can differ when an applicationhas computation phases during which it does not access DRAM. During the phases that DRAMis not being used, DRAM utilization goes down while DRAM efficiency does not change. Asan example, applications that are heavily optimized to use the “shared memory” can exhibit thisbehaviour.2 The work presented in this section is the result of close collaboration with my colleague George Yuan [11].27NVIDIAGPU GPGPU-Simcudafe + nvopencccudafe + nvopenccC/C++ compiler C compilerptxasToolFileNvidia ToolkitGPGPU-SimUser-specifictool/fileSource code(.cu)Source code(.cu)Host CcodeHost CcodeExecutablePCI-E Executable.ptx.ptxcubin.binlibcuda.a Customlibcuda.alibcuda CustomlibcudaGPGPU-SimNormal CUDA Flow with GPU HardwareStatisticsFunctioncallptxasper thread registerusageSource code(.cpp)Application(a) CUDA flow with GPU Hardware.NVIDIAGPU GPGPU-Simcudafe + nvopencccudafe + nvopenccC/C++ compiler C/C++ compilerptxasToolFileNvidia ToolkitGPGPU-SimUser-specifictool/fileSource code(.cu)Host CcodeExecutablePCI-E Executable.ptx.ptxcubin.binlibcuda.a Customlibcuda.alibcuda CustomlibcudaGPGPU-SimNormal CUDA Flow with GPU HardwareStatisticsFunctioncallptxasper thread registerusageSource code(.cpp)Application Source code(.cu)Source code(.cpp)ApplicationHost Ccode(b) GPGPU-Sim.Figure 3.4: Compilation flow for GPGPU-Sim from a CUDA application in comparison tothe normal CUDA compilation flow.3.3 Extending GPGPU-Sim to Support CUDAWe extended3 GPGPU-Sim, the cycle-level simulator that was developed by our group for an ear-lier work [29]. GPGPU-Sim 2.0b models various aspects of a massively parallel architecture withhighly programmable pipelines similar to contemporary GPU architectures. A drawback of theprevious version of GPGPU-Sim was the difficult and time-consuming process of converting/par-allelizing existing applications [29] as it utilized the PISA ISA. This was overcome by extendingGPGPU-Sim to support the CUDA Parallel Thread Execution (PTX) [93] instruction set. Thisenables us to simulate the numerous existing, optimized CUDA applications on GPGPU-Sim.GPGPU-Sim 2.0b simulator infrastructure runs CUDA applications without source code mod-ifications on Linux based platforms, but does require access to the application’s source code4.Nevertheless, we explain the version 2.0b of the simulator here since it was used for the experi-3 This was a group effort [11].4 My colleagues have removed this requirement in the later versions of the simulator that can extract the PTXassembly directly from the CUDA binary.28ments in this chapter at the time. Later versions of GPGPU-Sim are also capable of attaching as adynamically linked library to a CUDA binary and do not require modification to the compilationprocess as explained next for version 2.0b of the simulator.To build a CUDA application for GPGPU-Sim 2.0b, the common.mk makefile used in theCUDA SDK is replaced with a version that builds the application to run on the simulator (whileother more complex build scenarios may require more complex makefile changes).Figure 3.4 shows how a CUDA application can be compiled for simulation on GPGPU-Sim and compares this compilation flow to the normal CUDA compilation flow [90]. Bothcompilation flows use cudafe to transform the source code of a CUDA application into hostC code running on the CPU and device C code running on the GPU. The GPU C code is thencompiled into PTX assembly (labeled “.ptx” in Figure 3.4) by nvopencc, an open source compilerprovided by NVIDIA based on Open64 [83, 96]. The PTX assembler (ptxas) then assembles thePTX assembly code into the target GPU’s native ISA (labeled “cubin.bin” in Figure 3.4a). Theassembled code is then combined with the host C code and compiled into a single executablelinked with the CUDA Runtime API library (labeled “libcuda.a” in Figure 3.4) by a standardC compiler. In the normal CUDA compilation flow (used with NVIDIA GPU hardware), theresulting executable calls the CUDA Runtime API to set up and invoke compute kernels onto theGPU via the NVIDIA CUDA driver.When a CUDA application is compiled to use GPGPU-Sim 2.0b, many steps remain thesame. However, rather than linking against the NVIDIA supplied libcuda.a binary, we linkagainst our own libcuda.a binary. Our libcuda.a implements “stub” functions for the interface de-fined by the header files supplied with CUDA. These stub functions set up and invoke simulationsessions of the compute kernels on GPGPU-Sim (as shown in Figure 3.4b). Before the first sim-ulation session, GPGPU-Sim parses the text format PTX assembly code generated by nvopenccto obtain code for the compute kernels. Because the PTX assembly code has no restriction on29register usage (to improve portability between different GPU architectures), nvopencc performsregister allocation using far more registers than typically required to avoid spilling. To improvethe realism of the performance model, the register usage per thread and shared memory usedper thread block are determined using ptxas5. This information is then used to limit the num-ber of thread blocks that can run concurrently on a compute core. The GPU binary (cubin.bin)produced by ptxas is not used by GPGPU-Sim 2.0b. After parsing the PTX assembly code, butbefore beginning simulation, GPGPU-Sim performs an immediate post-dominator analysis oneach kernel to annotate branch instructions with reconvergence points for the stack-based SIMDcontrol flow handling mechanism described by Fung et al. [29]. During a simulation, a PTXfunctional simulator executes instructions from multiple threads according to their schedulingorder as specified by the performance simulator. When the simulation completes, the host CPUcode is then allowed to resume execution. In the version 2.0b implementation of the GPGPU-Sim simulator, host code runs on a physical CPU, as a result our performance measurements arefor the GPU code only. Recently other simulators have been developed that allow simulation ofhost code as well by integrating GPGPU-Sim with CPU simulators. Both gem5-gpu [38] andFusionSim [128] are such simulators.3.4 BenchmarksThe benchmarks used in this chapter are listed in Table 3.1 along with the main applicationproperties, such as the organization of threads into thread blocks and grids as well as the differ-ent memory spaces on the GPU exploited by each application. If the application runs multiplekernels they are shown as multiple entries (one per line) inside the grid and block dimensioncolumns of Table 3.1. These benchmarks were chosen from a limited set of publicly available5 By default, the version of ptxas in CUDA 1.1 appears to attempt to avoid spilling registers provided the numberof registers per thread is less than 128 and none of the applications we studied reached this limit. Directing ptxas tofurther restrict the number of registers leads to an increase in local memory usage above that explicitly used in thePTX assembly, while increasing the register limit does not increase the number of registers used.30CUDA applications at the time of the studies for this chapter as explained on page 18. A moredetailed description of these benchmarks is provided in [11].For comparison purposes we also simulated the following benchmarks from NVIDIA’s CUDASoftware Development Kit (SDK) [109]: Black-Scholes Option Pricing, Fast Walsh Transform,Binomial Option Pricing, Separable Convolution, 64-bin Histogram, Matrix Multiply, ParallelReduction, Scalar Product, Scan of Large Arrays, and Matrix Transpose. Due to space limita-tions, and since most of these benchmarks already perform well on GPUs, we only report detailsfor Black-Scholes (BLK) and Fast Walsh Transform (FWT). Black-Scholes is a financial optionspricing application and FWT is heavily used in signal and image processing applications as wellas compression. We also report the harmonic mean of all the SDK applications simulated whichis denoted as SDK in the data bar charts in Section 3.5. In this chapter, the harmonic mean ofall the non-SDK benchmarks are reported as HM bars in the result figures of Section 3.5. Thatis, in Section 3.5 HM bars represent harmonic means of non-SDK benchmarks and SDK barsrepresent the harmonic mean of CUDA SDK benchmarks.31Table 3.1: Benchmark PropertiesBenchmark Abr. Grid Block Concurrent Total Inst. Shared Const. Texture BarriersDim. Dim. Blocks/core Threads Count Mem. Mem. Mem.AES Cryptography [74] AES (257,1,1) (256,1,1) 2 65792 28M Yes Yes 1D YesGraph Algorithm: BFS (128,1,1) (512,1,1) 4 65536 17M No No No NoBreadth First Search [36]Coulombic Potential [41, 106] CP (8,32,1) (16,8,1) 8 32768 126M No Yes No NogpuDG [121] DG (268,1,1) (84,1,1) 5 22512 596M Yes No 1D Yes(268,1,1) (112,1,1) 6 30016 Yes(603,1,1) (256,1,1) 4 154368 No3D Laplace Solver [31] LPS (4,25,1) (32,4,1) 6 12800 82M Yes No No YesLIBOR Monte Carlo [32] LIB (64,1,1) (64,1,1) 8 4096 907M No Yes No NoMUMmerGPU [108] MUM (782,1,1) (64,1,1) 3 50000 77M No No 2D NoNeural Network NN (6,28,1) (13,13,1) 5 28392 68M No No No NoDigit Recognition [16] (50,28,1) (5,5,1) 8 35000 No(100,28,1) (1,1,1) 8 2800 No(10,28,1) (1,1,1) 8 280 NoN-Queens Solver [98] NQU (223,1,1) (96,1,1) 1 21408 2M Yes No No YesRay Tracing [77] RAY (16,32,1) (16,8,1) 3 65536 71M No Yes No YesStoreGPU [6] STO (384,1,1) (128,1,1) 1 49152 134M Yes No No NoWeather Prediction [79] WP (9,8,1) (8,8,1) 3 4608 215M No No No NoBlack-Scholes BLK (256,1,1) (256,1,1) 3 65536 236M No No No Nooption pricing [109]Fast Walsh Transform [109] FWT (512,1,1) (256,1,1) 4 131072 240M Yes No No Yes(256,1,1) (512,1,1) 2 131072 Yes32Table 3.2: Hardware ConfigurationNumber of Compute Cores 28Warp Size 32SIMD Pipeline Width 8Number of Threads / Core 256 / 512 / 1024 / 1536 / 2048Number of Thread Blocks / Core 2 / 4 / 8 / 12 / 16Number of Registers / Core 4096 / 8192 / 16384 / 24576 / 32768Shared Memory / Core (KB) 4/8/16/24/32 (16 banks, 1 access/cycle/bank)6Constant Cache Size / Core 8KB (2-way set associative 64B lines LRU)Texture Cache Size / Core 64KB (2-way set associative 64B lines LRU)Number of Memory Channels 8L1 Cache None / 16KB / 32KB / 64KB4-way set assoc. 64B lines LRUL2 Cache None / 128KB / 256KB8-way set assoc. 64B lines LRUGDDR3 Memory Timing tCL=9, tRP=13, tRC=34tRAS=21, tRCD=12, tRRD=8Bandwidth per Memory Module 8 (Bytes/Cycle)DRAM request queue capacity 32 / 128Memory Controller out of order (FR-FCFS) /in order (FIFO) [104]Branch Divergence Method Immediate Post Dominator [29]Warp Scheduling Policy Round Robin among ready warps3.5 Analysis of CUDA WorkloadsIn this section, we first present our simulation methodology and configuration. We then providea detailed analysis and evaluation of the designs introduced in Section 3.2 .3.5.1 MethodologyThe simulator configurations for this chapter are shown in Table 3.2. Rows with multiple entriesshow the different configurations used for simulations in this chapter. Baseline configurationparameters are shown in boldface. To simulate the mesh network, we used a detailed inter-connection network model, incorporating the configurable interconnection network simulator33Table 3.3: Interconnection Network ConfigurationTopology Mesh / Torus / Butterfly / Crossbar / RingRouting Mechanism Dimension Order / Destination TagRouting delay 1Virtual channels 2Virtual channel buffers 4Virtual channel allocator iSLIP / PIMAlloc iters 1VC alloc delay 1Input Speedup 2Flit size (Bytes) 8 / 16 / 32 / 64introduced by Dally et al. [22] which is called Booksim 1. Table 3.3 shows the on-chip networkconfiguration used in our simulations for this chapter.Note that it is not sufficient to view the compute core as a mere generator of address streamsfor simulating on-chip network and caching impacts on performance. Our system is a closed-loop feedback system and the on-chip network’s interactions with the compute nodes can directlyimpact the future stream of memory requests sent to DRAM. For example a long latency miss willchange the order of future memory requests since the corresponding warp will not be consideredduring the thread scheduling process until the miss returns.In order to capture the distinct phases of all kernels, we simulate all benchmarks to comple-tion. Kernels can behave differently at their tail end compared to their beginning. The perfor-mance difference can be significant if a benchmark with multiple short kernels is not simulatedto completion.We note that the breadth-first thread block distribution heuristic described in Section 3.2.2can occasionally lead to counter-intuitive performance results due to a phenomena we will referto as thread block load imbalance7. This thread block load imbalance can occur when the numberof thread blocks in a grid exceeds the number that can run concurrently on the hardware. Forexample, consider an application with six thread blocks. Consider this application is running7 The example presented in this paragraph is the result of close collaboration with Tor Aamodt [11].34on a GPU with two compute cores where at most two thread blocks can run concurrently oneach compute core. Assume execution of one thread block on one core takes time T and runningtwo thread blocks on one core takes time 2T. In this case, there is no off-chip accesses and eachthread block has six or more warps which are enough for one thread block to keep the 24 stagepipeline full. If each thread block in isolation takes equal time T, total application time is 3T(2T for the first round of four thread blocks plus T for the last two thread blocks which runon separate compute cores). Now suppose we introduce an enhancement that allows the threadblocks to run in time 0.90T to 0.91T when run alone (note a single thread block execution isnow faster). If both thread blocks on the first core now finish ahead of those on the other core attime 1.80T versus 1.82T, then our thread block distributor will issue both of the remaining twothread blocks onto the first core, causing the load imbalance. With the enhancement, this actuallycauses an overall slowdown since now 4 thread blocks need to be completed by the first core,requiring a total time of at least 3.6T (while the second core is only running 2 thread blocks).We verified that this behaviour occurs by plotting the distribution of thread blocks to computecores versus time for both configurations being compared. This effect would be less significantin a real system with larger data sets and therefore grids with a larger number of thread blocks.Rather than attempt to eliminate the effect by modifying the scheduler (or the benchmarks) wesimply note where it occurs.3.5.2 Instruction ClassificationFigure 3.5 shows the classification of each benchmark’s instruction types in the form of dynamicinstruction frequency. The Fused Multiply-Add and ALU Ops (other) sections of each bar showthe proportion of total ALU operations for each benchmark which varies from 58% for NQUto 87% for BLK. Only DG, CP and NN utilize the fused multiply-add operations extensively.350%20%40%60%80%100%AESBFSCPDGLPSLIBMUMNNNQURAYSTOWPBLKFWTInstructionClassificationSFU Ops Control Flow Fused Multiply-Add ALU Ops (other) Mem Ops Figure 3.5: Instruction classification.0%25%50%75%100%AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP BLK FWTMemory Instruction ClassificationShared Tex Const Param Local GlobalFigure 3.6: Memory instructions breakdown.Special Function Unit (SFU)8 instructions are also only used by a few benchmarks. CP is theonly benchmark that has more than 10% SFU instructions.The memory operations portion of Figure 3.5 is further broken down in terms of type asshown in Figure 3.6. Note that “param” memory refers to parameters passed through the GPUkernel call, which are always treated as cache hits. There is a large variation in the memoryinstruction types used among benchmarks: for CP over 99% of accesses are to constant memorywhile for NN most accesses are to global memory.Note that, it is sufficient to run the simulation in the functional mode to extract the precedingclassification results. However, the rest of studies in this chapter are performed in the timingmode of the simulator.8 The architecture of the NVIDIA GeForce 8 Series GPUs includes a special function unit for transcendental andattribute interpolation operations [71]. We include the following PTX instructions in our SFU Ops classification:cos, ex2, lg2, rcp, rsqrt, sin, sqrt.360 32 64 96128160192224AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK IPCBaselinePerfect MemoryMaximum IPC = 224Figure 3.7: Baseline performance (HM=Harmonic Mean).3.5.3 BaselineAs the first step we simulated the baseline GPU configuration. The performance of our baselineconfiguration is shown in Figure 3.7 and Table 3.2 shows its configuration parameters in bold-face. The performance is measured in terms of scalar Instructions Per Cycle (IPC). Figure 3.7also shows the performance of a perfect memory system for comparison. The perfect memoryconfiguration assumes that all memory accesses have zero latency. Note that the baseline con-figuration has 28 compute cores and each compute core has an 8-wide pipeline; as a result themaximum achievable IPC is 224 (28×8).As this figure shows the majority of the SDK benchmarks are optimized for GPU hardwareand can achieve an HM average of 65% of peak IPC. A perfect memory subsystem provides a28% performance boost to the SDK benchmarks. On the hand, the non-SDK benchmarks achievean HM average IPC that is only 5% of peak IPC while the perfect memory subsystem providesthese benchmarks with a 54% performance increase. These observations show that the non-SDK benchmarks have a higher potential for performance improvement compared to the SDKbenchmarks as expected. The HM average IPC of the non-SDK benchmarks is mainly draggeddown by these four benchmarks: NQU, NN, MUM and BFS. In the following sections, we willshow the reasons for the gap between actual and peak performance of the benchmarks we study.However, we explain how the simulation results were validated first.37My colleagues validated the simulator against an NVIDIA Geforce 8600GTS GPU by com-paring the IPC value estimated by simulator with the IPC obtained from hardware. The corre-lation coefficient was calculated to be 0.899 [11]. We observed that the simulator follows theperformance trends of the real hardware. That is, the applications that perform well in the realGPU hardware also perform well in the simulator and vice versa.3.5.4 On-Chip Network TopologyFigure 3.8 shows the speedup of various on-chip network topologies compared to a mesh with16 Byte channel bandwidth. On average our baseline mesh interconnect performs comparable toa crossbar with input speedup of two for the workloads that we consider. We also have evaluatedtwo torus topologies: “Torus - 16 Byte Channel BW”, which has double the bisection bandwidthof the baseline “Mesh” (a determining factor in the implementation cost of a network); and“Torus - 8 Byte Channel BW”, which has the same bisection bandwidth as “Mesh”. The “Ring”topology that we evaluated has a channel bandwidth of 64. The “Crossbar” topology has aparallel iterative matching (PIM) allocator as opposed to an iSLIP allocator for other topologies.The two-stage butterfly and crossbar employ destination tag routing while others use dimension-order routing. The ring and mesh networks are the simplest and least expensive networks to buildin terms of area.As Figure 3.8 suggests, most of the benchmarks are fairly insensitive to the topology used.In most cases, a change in topology results in less than 20% change in performance from thebaseline, with the exception of the ring as well as the torus with 8 byte channel bandwidth. BLKexperiences a performance gain with the ring due to the thread block load imbalance phenomenadescribed in Section 3.5.1. BLK has 256 thread blocks. For the ring configuration, the numberof thread blocks executed per compute core varies from 9 to 10. However, for the baselineconfiguration, one of the compute cores is assigned 11 thread blocks due to small variations intime coupled with our greedy thread block distribution heuristic. When more thread blocks run380.2 0.4 0.6 0.8 1 1.2 AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK Speedup Crossbar - 16 Byte Channel BW 2 Stage Butterfly - 16 Byte Channel BW Torus - 16 Byte Channel BW Torus - 8 Byte Channel BW Ring - 64 Byte Channel BW Figure 3.8: Interconnection network topology comparison.on a compute core, all thread blocks on that compute core take longer to complete, resulting in aperformance loss for the baseline configuration for BLK.As we will show in the next section, one of the reasons why different topologies do not changethe performance of most benchmarks dramatically is that the benchmarks are not sensitive tosmall variations in latency, as long as the interconnection network provides sufficient bandwidth.3.5.5 Interconnection Latency and BandwidthFigure 3.9 shows the IPC results for various mesh network router latencies. Without affectingpeak throughput, we add an extra pipelined latency of 4, 8, or 16 cycles to each router on top ofour baseline router’s latency. An extra four cycle latency per router is easily tolerated for mostbenchmarks and causes only a 3.5% performance loss when harmonically averaged across allbenchmarks. BLK and CP experience a performance gain due to the thread block load imbal-ance phenomena described in Section 3.5.1. With eight extra cycles of latency per router, theperformance degradation is noticeable (slowdown by 9% on average) and becomes much worse(slowdown by 25% on average) at 16 cycles of extra latency. Note that these experiments areonly intended to show the latency sensitivity of benchmarks.390.4 0.6 0.8 1 1.2 AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK Speedup Extra 4 cycles Extra 8 cycles Extra 16 cycles Figure 3.9: Interconnection network latency sensitivity.0.4 0.6 0.8 1 1.2 1.4 AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK Speedup 8B 16B (Baseline) 32B 64B Figure 3.10: Interconnection network bandwidth sensitivity.We also modify the mesh interconnect bandwidth by varying the channel bandwidth from8 bytes to 64 bytes. Figure 3.10 shows that halving the channel bandwidth from 16 bytes to8 bytes has a noticeable negative impact on most benchmarks, but doubling and quadruplingchannel bandwidth only results in a small gain for a few workloads such as BFS and DG.DG is the most bandwidth sensitive workload, getting a 31% speedup and 53% slowdown forflit sizes of 32 and 8 respectively. The reason why DG does not exhibit further speedup with flitsize of 64 is because at this point, the on-chip network has already been over-provisioned. Ouranalysis shows that for the baseline configuration, the input port to the return interconnect frommemory to the compute cores is stalled 16% of the time on average. Increasing the flit size to 32completely eliminates these stalls, which explains the lack of further speedup for a flit size of 64bytes. Note that our memory read request packet sizes are 8 bytes, allowing them to be sent tothe memory controllers in a single flit for all of the configurations shown in Figure 3.10. Overall,the preceding data suggests that performance is more sensitive to interconnect bandwidth than40to latency for the non-graphics workloads that we study. Furthermore, restricting the channelbandwidth can turn the on-chip network into a bottleneck.3.5.6 Memory Controller EffectsIn this section we explore the impact of memory controller scheduler design on the overallperformance9. Our baseline configuration employs a First-Ready First-Come First-Serve (FR-FCFS) [104] memory controller. The baseline FR-FCFS controller is Out-of-Order (OoO) andhas a capacity of 32 memory requests. We compare this baseline to a simple First-In First-Out(FIFO) memory controller as well as a more aggressive FR-FCFS OoO controller with an inputbuffer capacity of 128 (OoO128). We also measure DRAM efficiency and DRAM utilizationmetrics in addition to performance as explained in Section 3.2.5.Figure 3.11 shows a performance comparison of our baseline versus the FIFO and OoO128memory controllers. DRAM efficiency comparisons are shown in Figure 3.13 and DRAM uti-lization comparisons are shown in Figure 3.12. As Figure 3.11 shows several benchmarks suchas AES, CP, and STO exhibit negligible slowdown when using the FIFO scheduler. DRAM ef-ficiency of AES and STO is more than 75% as shown in Figure 3.13. For these benchmarks atany point in time at most two rows in each bank of each DRAM are accessed by all threads. Thismeans that a simple DRAM controller policy such as FIFO suffices to serve these applications.Additionally, AES and STO have a low DRAM utilization as shown in Figure 3.12. While theyprocess large amounts of data, they make extensive use of shared memory (Figure 3.6).CP is not sensitive to memory controller optimizations because it has a very low DRAMutilization. Note that the slows down of CP for OoO128 configuration is due to the thread blockload imbalance described earlier. For many benchmarks like BFS, LIB, MUM, RAY, and WPperformance reduces over 40% when using the FIFO controller. Employing this simple controllerdrastically reduced the DRAM efficiency and utilization for these benchmarks.9 The work presented in this section is the result of close collaboration with my colleague George Yuan [11, 127].410.4 0.6 0.8 1 1.2 1.4 AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK Speedup First-In First-Out Out-of-Order 128 Figure 3.11: Impact of DRAM memory controller optimizations.0% 20% 40% 60% AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP BLK FWT DRAM Utilization  First-In First-Out Out-of-Order 32 Out-of-Order 128 Figure 3.12: DRAM utilization.20% 40% 60% 80% 100% AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP BLK FWT DRAM Efficiency  First-In First-Out Out-of-Order 32 Out-of-Order 128 Figure 3.13: DRAM efficiency.In a later follow up work [127], we show that the on-chip networks has a major role inscrambling the memory requests before they get to the memory controller. That is, the memoryrequests have a high DRAM row locality at the time they are injected into the network fromany given compute node. As the request packets traverse the network towards their destination,they lose their DRAM row locality as a result of request streams from different compute nodesmixing with them. The packets from different compute nodes are specifically mixed up becausethe routers try to maintain fairness among their input streams. An out of order memory scheduler420.5 1 1.5 2 2.5 3 3.5 AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK Speedup None 16kB L1 32kB L1 64kB L1 64kB L1; 128kB L2 64kB L1; 256kB L2 Figure 3.14: Effects of adding L1 or L2 caches.such as FR-FCFS restores the majority of DRAM row locality that is lost inside the network. Wedemonstrate that [127] it is possible to maintain up to 91% of the performance of out of order FR-FCFS schedulers with simpler FIFO DRAM schedulers by making some simple modificationsin the allocation policies of the routers. The router allocation policies that we propose maintainrow locality of request streams as they traverse towards the memory controllers.3.5.7 Cache EffectsFigure 3.14 shows the effects of adding caches to the system on IPC. The first three bars showthe relative speedup of adding a 16KB, 32KB or 64KB L1 cache to each compute core. The lasttwo bars show the effects of adding a 128KB or 256KB L2 cache to each memory controllerin addition to a 64KB L1 cache in each compute core. As this figure shows, the non-SDKbenchmarks achieve a 21% harmonic mean speedup with addition of a 16KB L1 cache to thecompute cores. The 64KB L1 cache achieves a 26% speedup and adding a 256KB L2 to memorycontrollers brings up the speedup to 30% over baseline with no cache.CP, RAY and FWT exhibit a slowdown with the addition of L1 caches. Close examinationshows that CP experiences a slowdown due to the thread block load imbalance phenomena de-scribed in Section 3.5.1, whereas RAY and FWT experience a slowdown due to the way writemisses and evictions of dirty lines are handled. For the baseline, which has no caches for local/-global accesses, writes to memory only cause the memory controller to read data out of DRAM if43a portion of a 16B chunk is modified due to writes that are not coalesced. When caches are addedfor local and global accesses, for simplicity, a write miss prevents a warp from being scheduleduntil the cache block is read from DRAM. Furthermore, when a dirty line is evicted, the entireline is written to DRAM even if only a single word of that line is modified.Benchmarks that make extensive use of “shared memory”, namely AES, LPS, NQU andSTO, do not respond significantly to caches. The CUDA SDK benchmarks are similarly opti-mized to achieve high performance with the existing hardware at the time of their developmentwhich did not include data caches. As can be expected, addition of caches does not change theirperformance. On the other hand, BFS and NN have the highest ratio of dynamic global memoryinstructions to all instructions (at 19% and 27% respectively) and so they experience the highestspeedup among workloads.3.5.8 Thread Block Scheduling EffectsIncreasing the number of simultaneously running threads can improve performance by increasingGPU’s ability to hide memory access latencies as we mentioned earlier. However, doing so mayresult in a higher contention for the resources shared by all threads such as on-chip networkand memory. This section explores the effects of varying the resources that limit the numberof threads and hence thread blocks that can run concurrently on a compute core10. This is donewithout modifying the source code for the benchmarks. The maximum per compute core amountof registers, shared memory, threads, and thread blocks are varied between 25% to 200% of thoseavailable to the baseline. These four parameters together set the limit for the maximum numberof thread blocks that can be issued to any given compute core. The values we use are shownin Table 3.2.The amount of resources available per compute core is a configurable simulationoption, while the amount of resources required by each kernel is extracted using the ptxas tool.10 The work presented in this section is the result of close collaboration with my colleagues Henry Wong andGeorge Yuan [11].440.5 0.75 1 1.25 1.5 AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK Speedup 25% 50% 100% (Baseline) 150% 200% Figure 3.15: Effects of varying number of thread blocks.The results are shown in Figure 3.15. Some benchmarks are already resource-constrained toonly one or two thread blocks per compute core with the baseline configuration. As a resultsthese benchmarks cannot run using a configuration with less resources (25% or 50%). The barsfor configurations that cannot run for this reason are not shown in Figure 3.15.The performance of LPS, NN and STO increases as the number of thread blocks increases.We now explain the unexpected behaviour of the other benchmarks. LPS cannot take advantageof additional resources beyond the baseline (100%) because all of its thread blocks can alreadyrun simultaneously for the baseline configuration. A single thread block of STO uses all of theshared memory in the baseline configuration. Therefore, increasing the shared memory amountby half for the 150% configuration results in no increase in the number of concurrently runningthread blocks for STO. AES and MUM show clear trends in decreasing performance as the num-ber of thread blocks increases. AES and MUM experience increased contention in the memorysystem with more concurrent thread blocks. This results in 8.6× and 5.4× worse average mem-ory latency, respectively, when comparing the 200% configuration to the 50% configuration.BFS, RAY, and WP show distinct optima in performance when the thread block limit is at100%, 100% and 150% of the baseline core, respectively. Above these limits, DRAM efficienciesdecrease and memory latencies increase which suggest an increased contention in the memorysystem. For configuration with limits below the optima, the lack of enough active warps to hidethe memory latencies reduces performance. CP suffers from the thread block load imbalance for450.8 1 1.2 1.4 AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK Speedup Figure 3.16: Effect of inter-warp memory coalescing.the 50% and 100% configurations. Similarly, DG suffers from thread block load imbalance inthe 150% configuration.As our observations show, assigning the maximum possible number of thread blocks to acompute core is not always an optimal policy. The optimal thread block scheduling policy de-pends on the workload and varies widely.3.5.9 Memory Request CoalescingFigure 3.16 presents the performance improvements of enabling inter-warp memory coalescingdescribed in Section 3.2.3. The harmonic mean speedup over intra-warp coalescing is 6.1%while DG and MUM improve up to 41%. CP’s slowdown with inter-warp coalescing is due tothe load imbalance in thread block distribution. Accesses in AES, DG and MUM are to datadependent locations which makes it harder to use the explicitly managed shared memory tocapture locality. These applications use the texture cache to capture this locality and inter-warpmerging effectively eliminates additional requests for the same cache block at the expense ofassociative search hardware.It is interesting to observe that the harmonic mean speedup of the CUDA SDK benchmarksis less than 1%, showing that these highly optimized benchmarks do not benefit from inter-warpmemory coalescing. Their careful program optimizations ensure less redundancy in memoryrequests generated by each thread.463.6 SummaryIn this chapter, we studied the performance of twelve contemporary CUDA applications by run-ning them on a detailed performance simulator that simulates NVIDIA’s PTX virtual instructionset architecture. We presented performance data and detailed analysis of performance bottle-necks, which differ in type and scope from application to application. We showed that adding L1and L2 caches for local/global memory can improve the performance of these benchmarks by aharmonic mean average of 30%. We observed that aggressive inter-warp memory coalescing canimprove performance in some applications by up to 41%. We found that generally performanceof these applications is more sensitive to the on-chip network’s bisection bandwidth rather thanzero load latency. Based on the observations in this chapter, we identify on-chip networks forfurther investigation and focus on designing optimized networks for throughput accelerators inthe following chapters.47Chapter 4Detailed NoC Analysis for ThroughputAcceleratorsIn this chapter, we study the impact of on-chip networks across a wide range of compute acceler-ator applications. The applications analyzed in the preceding chapter were chosen from a limitedset of applications available at the time of that study. While they are well suited for architecturalexploration purposes, in this chapter we set out to use a much broader set of applications as wefocus on designing on-chip networks.We present limit studies that identify the impact of on-chip communication on overall per-formance of accelerator applications. Based on our analysis, we show how conventional net-work improvements do not significantly improve overall performance while simply increasingchannel width results in significant performance gains but with a large and unacceptable area in-crease. Consequently, we propose simultaneously considering the effect of the NoC on parallelapplication-level performance and chip area to find NoCs which are throughput-effective.Highly multi-threaded applications running on multi-core microprocessors may have coher-ence traffic and data sharing resulting in significant core-to-core communication. In contrast,BSP-like accelerator applications [19, 48] tend to organize communication to be local to a group48of threads executing on a single compute core and have less communication between threadsin different groups. We identify that the many-to-few-to-many traffic pattern of throughput ac-celerators creates a traffic imbalance and show how the overall system performance is directlycorrelated with the injection rate of the few memory controller nodes.0.0024	  0.0029	  0.0034	  0.0039	  0.0044	  190	   210	   230	   250	   270	   290	   310	  (Chip	  Area)-­‐1	  	  [1/mm2 ]	  Average	  Throughput	  [IPC]	  	  Ideal	  NoC	  Less	  Area	  	  2x	  BW	  	  Balanced	  	  Mesh	  Higher	  Throughput	  Thr.Eff.	  Figure 4.1: Throughput-effective design space. “Balanced Mesh”: bisection bandwidthbalanced to off-chip DRAM bandwidth (Section 4.2); “Thr. Eff.”: mesh networkoptimized for many-to-few-to-many traffic (Section 5); “2x BW”: mesh with doublechannel width.An implication of the aforementioned observations is the following. Starting from a baselinemesh topology with bisection bandwidth balanced to effective off-chip memory bandwidth (la-beled “Balanced Mesh” in Figure 4.1) application-level throughput can be increased while main-taining a regular NoC topology by naively increasing the channel bandwidth. The “2× BW”data point in Figure 4.1 shows the impact this has on throughput-effectiveness (IPC/mm2). Thisfigure decomposes throughput per unit chip area as the product of application-level throughput(measured in IPC) on the x-axis and inverse area (1/mm2) on the y-axis1. Curves in this figurerepresent constant throughput-effectiveness (IPC/mm2) and design points closer to the top right1Average throughputs are for benchmarks in Table 4.1, described in Section 4.1, using configurations describedin Section 6. The area estimates are from Section 6.8 assuming 244mm2 is used for non-NoC components of thechip.49near “Ideal NoC” are more desirable. An ideal NoC has infinite bandwidth, zero latency, andzero NoC area. In contrast, the “Thr. Eff.” point results from modifying the baseline NoC totake advantage of the many-to-few-to-many traffic, resulting in a design closer to the throughput-effectiveness of an ideal NoC than alternative designs and is achieved by designs introduced inChapter 5.4.1 Baseline Architecture for NoC AnalysisIn this section we describe the baseline accelerator architecture and on-chip network used for thestudies in this chapter and the rest of the dissertation. The baseline is updated (compared to theone used in previous chapter) as we will explain shortly. Throughput accelerators can be classi-fied along several dimensions: SIMT versus SIMD, degree of multithreading per core, supportfor caching and coherence, and the granularity at which heterogeneity is introduced. We studya generic architecture with some similarities to NVIDIA’s Fermi [89] and GeForce GTX 280,but our baseline is not meant to be identical to any specific GPU. The Fermi architecture [89]provides improvements over previous generation such as caches, faster atomics, and better dou-ble precision floating-point support. We believe that our conclusions are applicable to otherarchitectures. We employ 31 benchmarks that cover a large spectrum of applications written inCUDA [85, 92]. Many of the benchmarks we use are “dwarves” [7] from Rodinia [19]. Rodiniabenchmarks are collected specifically for the purpose of evaluating accelerators. They includevarious applications that are important for science and engineering applications such as graphtraversal, structured/unstructured grids, dynamic programming and dense linear algebra. Thecomplete set of benchmarks is described in Table 4.1.Our baseline architecture is illustrated in Figures 4.2, 4.3, and 4.4. Figure 4.2 illustrates theoverall chip layout showing the placement of compute nodes and memory controller nodes. Inthis work, we assume a 2D mesh topology with the memory controllers (MCs) placed on the topand the bottom rows, similar to the topology and layout used in Intel’s 80-core design [119] and50Figure 4.2: Compute accelerator showing layout of compute node routers and MC noderouters in baseline mesh. Shaded routers on top and bottom are connected to MCs.Tilera TILE64 [122] processors. We are interested in a general purpose accelerator architecture,therefore, similar to Intel’s Pangaea [124], we do not consider fixed function graphics hardware.4.1.1 NetworkCurrent GPUs often use a crossbar with concentration (to share a single port among severalcores). This results in a few number of ports and makes the crossbar a viable option but as thenumber of cores is bound to increase in the future, the scalability of this approach will be lim-ited. It has been shown that the overheads of high cardinality switches (e.g., a 30×30 crossbar)are unacceptable [103]. A crossbar would connect components that are scattered throughout thechip. Routing very long wires to and from a central crossbar is neither easy nor inexpensive.In addition, prior work [11], which included a crossbar comparison, showed that for the work-loads we consider performance is relatively insensitive to topology. Thus, we chose a 2D meshtopology since it provides a very regular, simple and scalable network [14]. To avoid protocoldeadlock, request and reply traffic have dedicated virtual channels. Comparisons with asymmet-ric crossbars as well as topologies such as CMesh [14] and flattened butterfly [55] are presented51in Sections 6.10.1 and 6.10.2, respectively. The techniques provided in this work are orthogonalto concentration as we will discuss in more detail in Sections 6.10.1 and 6.10.3.DispatchQueue(Warps)OCDDL1 D$   S  .  SIMT stacksL1 I$RouterCompute NodeFigure 4.3: Compute node.L2 bankMemoryControllerGDDR3Off-Chip GDDRMCNodeRouterFigure 4.4: Memory controllernode.4.1.2 Compute Nodes and Memory ControllersFigure 4.3 illustrates a compute node. We assume 8-wide SIMT pipelines that execute over fourclock cycles. Each compute core maintains a dispatch queue holding up to 32 ready warps (rep-resenting up to 1024 scalar threads). Memory operations (loads and stores) to global memory(visible to all threads on all cores) go through a memory divergence detection stage (DD) thatattempts to “coalesce” memory accesses from different scalar threads within a warp that accessa single L1 cache line so that only one request is made per cache block miss. In line with recentmulticore architectures such as Sun Niagara [57] and as suggested by an NVIDIA patent appli-cation [86], we place shared L2 cache banks adjacent to the MCs. Applications also employ asoftware-managed scratchpad memory or “shared memory” (S) in NVIDIA’s terminology. Ad-52Table 4.1: Benchmark Abbreviations and Their Sources (Rodinia [19], SDK [109], andGPGPU-Sim [11])Name Abbr. Name Abbr.Kmeans (Rodinia) KM CFD Solver (Rodinia) CFDLeukocyte (Rodinia) LE Heart Wall Tracking (Rodinia) HTLU Decomposition (Rodinia) LU Similarity Score (Rodinia) SSBack Propagation (Rodinia) BP Streamcluster (Rodinia) STCHotSpot (Rodinia) HSP Needleman-Wunsch (Rodinia) NDLSpeckle Reducing Anisotropic SR Neural Network Digit NEDiffusion (Rodinia) Recognition (GPGPU-Sim)BFS Graph Traversal (Rodinia) BFS Nearest Neighbor (Rodinia) NNCAES Cryptography (GPGPU-Sim) AES MUMmerGPU(Rodinia, GPGPU-Sim) MUMBlack-Scholes Option Pricing (SDK) BLK LIBOR Monte Carlo (GPGPU-Sim) LIBBinomial Option Pricing (SDK) BIN Matrix Multiplication [105] MMSeparable Convolution (SDK) CON Ray Tracing (GPGPU-Sim) RAY3D Laplace Solver (GPGPU-Sim) LPS Matrix Transpose (SDK) TRAFast Walsh Transform (SDK) FWT Parallel Reduction (SDK) RDgpuDG (GPGPU-Sim) DG Scalar Product (SDK) SCP64-bin Histogram (SDK) HIS Scan of Large Arrays (SDK) SLAWeather Prediction (GPGPU-Sim) WPdresses are low-order interleaved among MCs every 256 bytes [37] to reduce the likelihood ofany single MC becoming a hot-spot [101].4.2 CharacterizationIn this section we analyze characteristics of BSP applications written in CUDA on the baselinearchitecture described in Section 4.1 using closed-loop execution driven simulations (see Sec-tion 6.1 for configuration details). We start by identifying the bisection bandwidth required toachieve a balanced NoC design when considering the heavy off-chip demands of acceleratorworkloads. Then, we classify our applications by the intensity of on-chip traffic they generateand their application-level throughput sensitivity to NoC optimizations.530.50 0.75 1.00 0.50 0.75 1.00 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Normalized Application  Level Throughput/Cost Normalized Application  Level Throughput Bandwidth Limit of Ideal Interconnect [fraction of off-chip DRAM bandwidth] Application Level Throughput Application Level Throughput/Cost Bisection bandwidth of baseline mesh  Figure 4.5: Limit study to find a balanced bisection bandwidth. The bisection bandwidthof a mesh with 16B channel size can achieve 91% application-level throughput (IPC)of a network with infinite bandwidth while maximizing application-level throughputper unit estimated area cost.4.2.1 Balanced DesignWe first size the bisection bandwidth of our network with the aim of finding a balanced design.Bisection bandwidth is a key parameter limiting network throughput. It is defined as the min-imum bandwidth over all cuts that partition the network with equal number of nodes in eachhalf [22]. Starting from an on-chip network with bisection bandwidth that is “too low” may sig-nificantly limit application throughput for memory bound applications (which should instead belimited by off-chip bandwidth) while an on-chip network with bisection bandwidth that is “toohigh” may waste area.Figure 4.5 plots two curves: One curve (square markers) is the harmonic mean through-put (IPC) of our benchmarks assuming realistic timing models for compute nodes and memorynodes, but a zero latency network with limited aggregate bandwidth. This network has zero la-tency once a flit is accepted, but it limits the number of flits accepted per cycle by enforcingthe bandwidth limit specified on the x-axis. Here, bandwidth is total flits transmitted across thenetwork, expressed as a fraction of peak DRAM bandwidth. A packet is accepted provided thebandwidth limit has not been exceeded. Multiple sources can transmit to a destination in onecycle and a source can send multiple flits in one cycle. Application-level throughput is normal-54ized to that obtained with an infinite bandwidth zero latency network. The slight improvementsbeyond the point where bisection bandwidth is equal to DRAM bandwidth (1.0 on x-axis) is dueto the presence of L2 caches.The other curve (diamond markers) shows this throughput divided by an estimated chip area.Chip area here includes compute node area and NoC area. NoC area is estimated to be propor-tional to the square of the channel bandwidth [14]. Although higher network bandwidth contin-ues to increase the performance, when normalized to cost, an optimal design from a performanceper area perspective occurs at around bisection bandwidth ratio of 0.7–0.8. In addition, since per-formance is generally limited by off-chip bandwidth due to a lack of locality in the workloads andconsidering the activate/precharge overheads of switching DRAM pages, a network bandwidthwith 70–80% of the peak off-chip DRAM bandwidth also provides a balanced network design.Based on this bisection bandwidth ratio, we determine that this ratio approximately correspondsto a 2D mesh network with 16-byte channels.2Note, the resulting balanced channel bandwidth depends on the ratio of NoC area to the restof the chip. If the ratio of NoC goes down the maximum point of throughput/cost curve (diamondmarkers) moves to the right and the curve becomes more flat. For example, if we assume thatNoC is 3× smaller than our estimated area the maximum will occur around the 1.2 mark onx-axis which corresponds to a network with 24-byte channels. While starting from a differentbaseline bandwidth would change the raw numbers in the rest of the dissertation, it would notaffect the trends.550% 50% 100% 150% 200% 250% AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG TRA SR WP MUM LIB FWT SCP KM STC RD SS CFD BFS HM Speedup LLLHHHFigure 4.6: Speedup of an ideal NoC over baseline. LL, LH, HH: First character denoteslow or high speedup with ideal NoC; second character denotes low or high memorydemand. HM is harmonic mean of all benchmarks.0% 20% 40% 60% 80% 100% AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG TRA SR WP MUM LIB FWT SCP KM STC RD SS CFD BFS HM Fraction  of peak IPC Ideal NoC Baseline Mesh Figure 4.7: IPC of the ideal NoC and baseline NoC shown as a fraction of the peak theo-retical IPC of the chip.4.2.2 Network Limit StudyNext we perform a limit study to measure the performance benefits of an ideal NoC (zero latencyand infinite bandwidth) versus our baseline mesh with 16B channel size. This gives us an up-per bound for application-level throughput improvements that are possible by optimizing NoC.Figure 4.6 shows the speedup of an ideal network over the mesh with 16B channel bandwidth, a4-stage router pipeline and a 1-cycle channel delay (5-cycle per hop delay) with the parametersshown in Table 6.3.We divide applications into three groups using a two-letter classification scheme. The firstletter (H or L) denotes high or low (greater or less than 30%) speedup with an ideal network. Thesecond letter (H or L) denotes whether the application sends a heavy or light amount of trafficwith an ideal network: accepted traffic, averaged across all nodes, is greater than or less than onebyte per cycle. All of our applications fall into one of these three groups: LL, LH, and HH. There2In Figure 4.5, the network transfers at most N flits/cycle at interconnect clock frequency (iclk). The x-axisin Figure 4.5 is x = N [flits/iclk] ·16 [B/flit] ·602 [MHz (iclk)]1107 [MHz (mclk)] ·8 [# MC] ·16 [B/mclk] where mclk is the DRAM clock frequency. At the markedlocation (x = 0.816), N is 12 flits/iclk. Hence, link size is 12 (N) times flit size (16B) divided by 12 (bisection of a36-node mesh has 12 links) equals 16B per channel. Clock frequencies are from Table 6.1.56is no HL group since applications with low memory access are not likely to get a speedup with abetter network. Despite the mesh having sufficient bisection bandwidth (Figure 4.5) the averagespeedup of an ideal network versus our realistic baseline mesh is 42.3% across all benchmarks,102.7% across HH benchmarks and 44.6% across the Rodinia [19] benchmarks. We explore thereasons for this next.Applications in LL place little demand on the network as a result of low utilization of the off-chip DRAM. Studying the source code of these applications and their detailed simulation statis-tics we find several reasons for this: some benchmarks have been heavily optimized to grouprelated threads together on a compute node and make good use of the software-managed scratch-pad memory and/or achieve high L1 hit rates. On the other hand, another set of benchmarks likeNE and LU spend the majority of their execution time running a few threads on each core whichcannot fully occupy the multithreaded pipeline. The LL and HH applications behave as expected:applications that make low use of memory are expected to have low sensitivity to network per-formance and conversely for those with heavy traffic one would expect to see high speedups.The LH group has a moderate memory usage but its performance does not increase much withan ideal network. Figure 4.7 shows the IPC of the ideal NoC and baseline mesh configuration asa fraction of the peak theoretical IPC of the chip. Peak theoretical IPC is reached when all theSIMD lanes of the chip are executing one instruction per cycle. All but one of LH benchmarksachieve close to peak performance indicating that the NoC is not the bottleneck. The exception,NNC, has an insufficient number of threads to fully occupy the fine-grain multithreaded pipelineor to saturate the memory system.4.2.3 Router Latency and Bisection BandwidthIn this section we show that aggressive router latency optimizations [60, 61, 82, 99] do notprovide significant performance benefits for our workloads. Figure 4.8 shows that replacing the4-cycle baseline routers with aggressive 1-cycle routers results in fairly modest speedups ranging570% 20% 40% 60% 80% 100% AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG TRA SR WP MUM LIB FWT SCP KM STC RD SS CFD BFS HM Speedup 1-Cycle Router 2x Bandwidth LLLHHHFigure 4.8: Impact of scaling network bandwidth versus latency. Solid bars: 1-cycle versus4-cycle router latency, hashed bars: channel size 32 versus 16 bytes.0.5 0.6 0.7 0.8 0.9 1 AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG TRA SR WP MUM LIB FWT SCP KM STC RD SS CFD BFS NoC  Latency Ratio Figure 4.9: NoC latency reduction of using 1-cycle routers over baseline 4-cycle routers.0 50 100 150 200 250 300 AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG TRA SR WP MUM LIB FWT SCP KM STC RD SS CFD BFS Average NoC Latency  4-Cycle Routers (Baseline) 1-Cycle Routers 2x Bandwidth Figure 4.10: Impact of scaling network bandwidth and router latency on overall raw NoClatency.from no speedup to at most 6% (harmonic mean speedup is 1.8% for all benchmarks). Figure 4.9compares the network latency of these two configurations; y-axis is the network latency reduc-tion of using 1-cycle routers over 4-cycle baseline routers. These figures show that an aggressiverouter can decrease network latency but this improvement in network performance is not enoughto translate into noticeable overall performance benefits for these workloads. Based on our mea-surements the average hop count is 5.2 in this configuration and average serialization latencyis 2.5 cycles. Reducing the router delay from four to one cycles should result in a 55% NoClatency reduction, that is, 0.45 in Figure 4.9. Note that the link latency is always one cycle andserialization latency is not changing so the minimum possible latency reduction ratio is 0.4.Contention in the network prevents the aggressive routers from achieving a latency reductioncomparable to a zero-load network. Even LL benchmarks that have a low average NoC usagesuffer from contention because they generate bursty traffic. One example of this behaviour is58loading the scratchpad memory from global memory in the beginning of a kernel and then leavingthe NoC unused for the majority of kernel run time.In contrast, network bandwidth is an important metric as it impacts the overall throughputof the network. By increasing the network channel bandwidth by a factor of 2× (from 16B to32B), a 28.6% speedup is achieved over the baseline with 16B channels as shown in Figure 4.8.However, high-bandwidth NoC designs are very costly in terms of area as we show in Section 6.8.To shed more light on this matter, Figure 4.10 compares the average NoC latencies of fol-lowing router configurations: 4-cycle baseline router, 1-cycle router and 2× bandwidth router.It can be seen that the latency saved by employing the 1-cycle routers is only a small fractionof the overall NoC latency especially for the HH benchmarks. On the other hand, doubling thebandwidth has a large and meaningful impact on the latency (e.g., the NoC latency goes down55% for STC). The data in Figure 4.10 is strongly suggestive of an imbalance in the networkresulting in considerable contention. Next, we show that the traffic pattern is one of the reasonsfor this imbalance.4.2.4 Many-to-Few-to-Many Traffic PatternThe compute accelerator architectures we study present the network with a many-to-few trafficwith many compute nodes communicating with a few MCs. By simulating a closed-loop systemwith all components modelled, we also identify how the many-to-few-to-many traffic patterncauses a bottleneck in addition to the bottleneck caused by the many-to-few pattern. A high-level diagram of this communication pattern is illustrated in Figure 4.11. The MC bottleneck isnot only caused by the ratio of many cores to few MCs (28/8 in our simulations), but also causedby the difference in packet sizes.The traffic sent from compute cores to MCs consists of either read requests (small 8-bytepackets) or, less frequently, write requests (large 64-byte packets) while the traffic from MCs tocompute cores only consists of read-replies (large 64-byte packets). This creates an imbalance59C0	request network	C1	Core output	bandwidth 	C2	MC input	bandwidth	MC output	bandwidth	Core input	bandwidth 	C0	C1	Cn	C2	reply network	Cn	MCm	MC1	MC0	Figure 4.11: Many-to-few-to-many on-chip traffic. C nodes are the compute cores and theMC nodes are the memory controllers/memory.in the injection rates. To better demonstrate the differences in the number of read and writerequests, Figure 4.12 shows the ratio of the number of read request packets to the number ofwrite request packets that are sent from compute cores for all the benchmarks.0.1 1 10 100 1000 10000 AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG TRA SR WP MUM LIB FWT SCP KM STC RD SS CFD BFS Ratio  (log scale) Figure 4.12: Ratio of the number of read-request packets to write-request packets sent fromcompute cores in logarithmic scale.Figure 4.13 compares the average injection rates of compute nodes and MC nodes for allthe benchmarks. For this figure NoC is assumed to be ideal to better demonstrate the trafficdemand that the NoC is expected to handle. This figure demonstrates the bottleneck caused bythe few-to-many component of the many-to-few-to-many traffic pattern, especially for the HHbenchmarks.Figure 4.14 plots ideal network speedup versus average memory controller node injectionrate. Speedups are correlated to the memory controller injection rate (or the MC output band-600 5 10 15 20 25 30 AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG TRA SR WP MUM LIB FWT SCP KM STC RD SS CFD BFS Bytes per NoC  Cycle per node  Compute node avg. injection rate MC node avg. injection rate Figure 4.13: Average injection rates of compute nodes compared to MC nodes in term ofbytes per network cycle per node. NoC is ideal (infinite BW and zero latency).0% 50% 100% 150% 200% 250% 0.01 0.1 1 10 100 Ideal NoC  Speedup Memory Injection Rate of Ideal NoC in Bytes/cycle/node (log scale) Figure 4.14: Ideal NoC speedup versus memory node injection rate.width shown in Figure 4.11) with a correlation coefficient of 0.926. This suggests the presenceof a bottleneck on the read response path.The higher injection rates of memory response data returning from the MCs create bottle-necks in the reply network that can stall the MCs. This issue is depicted in Figure 4.15 whichshows the fraction of the time MCs are stalled (i.e., cannot process requests) because the networkcannot accept packets from MCs—resulting in MCs being stalled up to 73% of the time for someof the HH benchmarks. We address this issue in Section 5.3.0% 25% 50% 75% 100% AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG TRA SR WP MUM LIB FWT SCP KM STC RD SS CFD BFS % stalled Figure 4.15: Fraction of time injection ports at MCs are blocked preventing data read outof DRAM from returning to compute nodes.614.3 SummaryIn this chapter, we analyzed the impact of communication and on-chip network across a widerange of BSP applications in throughput accelerators. First, we identified the bisection band-width required to achieve a balanced NoC design when considering the heavy off-chip demandsof accelerator workloads. Then we classified the applications we studied by the intensity ofon-chip traffic they generate and their application-level throughput sensitivity to NoC optimiza-tions. Next, we demonstrated that conventional NoC optimizations that target router latencydo not significantly improve overall application-level performance while increasing the costlychannel bandwidth results in significant performance gains. We also identified that the many-to-few-to-many traffic pattern of throughput accelerators creates a traffic imbalance in addition tothe bottleneck caused by the many-to-few pattern. Finally, We showed how the overall systemperformance is directly correlated with the injection rate of the few MC nodes.62Chapter 5Throughput-Effective Network DesignIn this chapter we leverage the insights from the analysis in Chapter 4 to design throughput-effective NoCs for compute accelerators. Figure 5.1 shows a visual organization of the buildingblocks for throughput-effective design options that we explore. Throughput-effective designscan be achieved by either increasing the application-level performance or decreasing the areafootprint of system components. We introduce techniques that utilize both of these strategies.Checkerboard network organization and channel slicing reduce area while multi-port MC routersand optimized MC placement provide application-level speedups. We will combine these basictechniques or their enhanced versions to achieve a throughput-effective design.Throughput-­‐Effecⸯe	  Design	  Reduce	  Area	  Checkerboard	  Rou?ng	  Channel	  Slicing	  Increase	  Performance	  Checkerboard	  Placement	  	  	  Mul?-­‐Port	  routers	  at	  MCs	  Mul?-­‐port	  Double	  Checkerboard	  Inverted	  Network	  Figure 5.1: Overview of building blocks for throughput-effective designs explored in thisdissertation.635.1 Checkerboard Network OrganizationAlthough the many-to-few traffic pattern creates challenges, it also provides opportunities foroptimization—for example, there is no all-to-all communication among all nodes in the system.Based on this observation, we propose a checkerboard NoC to exploit this traffic pattern andreduce the area of the NoC. Figure 5.2 shows a 6×6 configuration of the checkerboard networkwhere routers alternate between full-routers shown with solid shaded squares and half -routersdrawn with hatching. A full-router provides full connectivity between all five ports in a 2Dmesh while a half-router (shown in detail in Figure 5.3) limits the connectivity as packets cannotchange dimensions within the router. The router microarchitecture is similar to a dimension-sliced microarchitecture [50] but in a dimension-sliced router, packets can change dimensionswhile we limit this capability to further reduce the complexity of the router. While the injectionport and the ejection port of a half-router are connected to all ports, the East port only has aconnection to the West port and similarly, the North port is connected only to the South port.By taking advantage of half-routers, the router area can be significantly reduced. For exam-ple, in a full-router, the crossbar requires a 5×5 crossbar1 while the half-router only requiresfour 2×1 muxes (two for each dimension) and one 4×1 mux for the ejection port, resulting inapproximately 40% reduction in area (detailed analysis in Section 6.8).The checkerboard layout does present some limitations in terms of communication (and rout-ing) because of the limited connectivity of the half-routers. Regardless of the routing algorithm(minimal, adaptive, or non-minimal), a packet with a full-router source and a full-router des-tination that are an odd number of columns or rows away cannot be routed, as illustrated inFigure 5.2a, since the packet cannot turn at a half-router. However, by exploiting the many-to-few traffic pattern, the communication between full-routers can be removed by placing the MCnodes at half-routers. Thus, all full-routers represent a compute node and this routing limitation1Since a packet arriving on a given port cannot depart through the same port, the crossbar will actually be a 4×5crossbar.64(a) General restrictions (b) Case 1: YX routing(c) Case 2: Checkerboardrouting(d) Case 3: YX routingCompute CoreRouterFull RouterHalf RouterMemory ControllerRouterFigure 5.2: Checkerboard mesh on-chip network routing examples. Dashed lines are ex-amples of XY routes prevented by half-routers (hatched); alternate feasible routes aresolid. Dark shaded nodes are MC routers.of the checkerboard layout does not become a problem for these throughput accelerator archi-tectures. In addition, as the data in Section 4.2.4 suggests, an injection rate imbalance betweenMCs and compute cores creates hot-spots in the baseline network in which the MCs are placed inneighbouring locations on top and bottom of the chip. Thus, the checkerboard network can alsoexploit a staggered MC placement [2, 10, 11]. Similarly, in architectures with large last-levelon-chip caches, if the cache banks are restricted to nodes with half-routers they can be accessedby all compute nodes. Miss traffic at these banks can reach MC nodes from the cache banks,65EjectionInjectionNorthSouthEastWestFigure 5.3: Half-router connectivity.provided both cache banks and MCs are also placed at half-router nodes, since half-routers canalways route to other half-routers (as described shortly).However, if cache banks are placed on the same tiles as the compute cores, the checker-board organization will restrict cache-to-cache communication as full-routers cannot communi-cate with all other full-routers. In this case packets would need to be routed to an intermediatehalf-router (either minimally or nonminimally) and be ejected or removed from the network—before being reinjected into the network and being routed to their destination, thus doublingthe network load2. However, prior work has shown that for accelerator applications written inBSP style languages supporting coherence, cache-to-cache communication is relatively infre-quent [48], and hence we expect the impact of this routing on overall performance to be mini-mal. In addition, in Section 5.5 we will illustrate a novel extension of checkerboard routing thatcompletely eliminates this limitation.2This is different from randomized routing algorithms such as Valiant routing [118] where packets are routed toan intermediate node but packets do not need to be removed from the network at the intermediate node.665.2 Checkerboard Routing Algorithm and Flow ControlWe assume a baseline dimension-ordered routing (DOR) using XY routing in the proposedcheckerboard network. However, because of the limited connections of the half-routers, XYrouting cannot route a packet for the following three traffic patterns.Case 1. Routing from a full-router to a half-router which is an odd number of columns away andnot in the same row.Case 2. Routing from a half-router to a half-router which is an even number of columns awayand not in the same row.Case 3. Routing from a half-router to a full-router which is an even number of columns awayand not in the same row.If YX routing is used as the baseline routing algorithm, similar routing restrictions exist as well.For Case 1, since a packet cannot “turn” or change dimensions at a half-router, YX routing canbe used instead of XY routing and thus, the packet turns at a full-router as shown in Figure 5.2b.For Case 2, neither XY nor YX routing can be used to route packets because of the limitationsof half-routers (Figure 5.2c). As a result, an additional turn is needed to route the packet fromthe source to its destination by first routing to an intermediate, full-router node and then routingto the destination. A random, intermediate full-router is selected within the minimum quadrantcontaining the source and destination that does not share the same row as the source and is not anodd number of columns away from the source. Thus, checkerboard routing (CR) occurs in twophases—in the first phase, YX routing is used to route to the intermediate node and in the secondphase, XY routing is used to route minimally to the destination. The CR routing is similar to a 2-phase ROMM routing [84] discussed in Section 7.2 but differs as the random intermediate nodeis restricted to a full-router and each phase needs to be done with a different DOR routing. Case3 can be handled either like Case 1 or like Case 2. In Section 6 we will show that handling Case67FFHF1.053 0.8470.95HF0.95FF221.91.9Figure 5.4: Layout example for checkerboard routers. Normal (left): F=full-router;Checkerboard (right): H=half-router, F=full-router. Area savings of 10% with twotile layouts assuming (for illustration only) a 75% reduction in half-router area andfull-routers occupy 25% of a normal tile.3 like Case 1 using YX routing yields higher average application-level throughput. Figure 5.2dshows an example of Case 3 being handled like Case 1.To avoid circular dependencies and routing deadlock, two virtual channels are needed in thecheckerboard routing, similar to O1Turn routing algorithm [111]. The YX routing is done usingone VC while XY routing uses another VC. Additional VCs to avoid protocol deadlock are stillneeded. Although the checkerboard network requires additional VCs, the reduction in router areais substantial as shown in Section 6.8.Reducing overall chip area with this design may require layout modifications like those illus-trated in Figure 5.4. This figure assumes for illustration and clarity purposes, a 75% reductionin the area of a half-router and a full-router that is initially 25% of a tile leading to a 10% areareduction in chip area.5.3 Multi-port Routers for Memory Controller NodesTo help reduce the bottleneck at the few nodes with many-to-few-to-many traffic pattern (shownin Figure 4.11), we propose a simple change to the routers attached to the few MC nodes: adding68MemoryControllerWestEast North SouthEjection InjectionMemoryController RouterRouterWestEast North SouthEjection Injection(a) Normal routerMemoryControllerWestEast North SouthEjection InjectionMemoryController RouterRouterWestEast North SouthEjection Injection(b) With 2 injection/ejection portsFigure 5.5: Router connections assuming a single network.additional injection/ejection ports from/to the MC and creating a multi-port router microarchi-tecture. These additional ports do not increase the network bisection bandwidth or any networkchannel bandwidth but instead increase the terminal bandwidth by providing more injection/e-jection bandwidth from/to the MC nodes. Figure 5.5a shows connections of a conventional routerin a 2D mesh network and Figure 5.5b shows the proposed multi-port router microarchitecturewith additional injection/ejection ports. Selection of the ports at multi-port routers can be done ina simple round-robin fashion. The important point is increasing the terminal bandwidth—multi-port routers are one simple way of achieving this effect.Note that only the routers connected to the MC nodes change. When adding extra injectionports, we leverage the fact that an MC is servicing requests from many compute cores; as packetsdestined to different compute cores get in the MC router, they will start travelling in differentdirections towards their destinations. This technique would not improve performance if the MChad to service a single compute core for a long time since we are not increasing the bandwidthof the links between routers.5.3.1 Smart Injection Port Selection PolicyAs mentioned before, selection of injection ports at multi-port routers can be done in a simpleround-robin fashion. Round-robin ensures fair usage of port resources and provides reasonableperformance as we will show in Section 6.3. Nevertheless it is possible to optimize the overall69throughput of the system by designing more intelligent injection mechanisms when there is achoice. The underlying mechanism that improves the performance for multi-port injection isthat as packets reach the router they start bidding for different outgoing ports and therefore getthrough the router at the same time. We designed a smart port selection mechanism that tries toboost this effect and therefore increase the number of packets that start travelling in the network.The smart mechanism can be summarized in the following steps.Step 1: Start with a random injection port.Step 2: If the port under consideration is empty then inject to it otherwise use step 3.Step 3: If the current packet is going to bid for the same output direction as the last packetinjected to this port, then inject to this port.Step 4: Otherwise, try the next port with criteria in steps 2 and 3, if there is no more ports leftto test, pick the last port under consideration.The basic idea is to keep the packets that are going to the same output direction together inthe same injection port so that when a packet to a different direction arrives it goes to anotherport. Therefore, there is a higher chance that the ports are sending packets to different directionsin parallel. The aforesaid mechanism is applicable to any number of ports. Implementation forour 2-port setup would require a 2-bit history for each injection port to remember the outputdirection of last injected packet. Depending on timing requirements of hardware implementationthe steps preceding can be done in parallel for all the ports with simple logic.We also considered other policies such as random port selection or selecting the port withmaximum empty space. Our default round-robin policy and the smart policy introduced aboveperformed better than the other policies we tried. Another policy that performed particularlypoorly was similar to the smart method above except for Step 3 where it would inject to the portunder consideration if the current packet was going to bid for a different output direction as the70last packet injected to that port. This is essentially the opposite of the smart policy above andwould distribute packets going to the same output into different ports.5.4 Double Network—Channel-Sliced NetworkThe area footprint of NoC can be further reduced using channel slicing. For a network with agiven bisection bandwidth with each channel having a bandwidth b, our baseline uses a singlephysical network. However, since router area is proportional to O(b2), it can be reduced bytaking advantage of channel slicing [22]: creating a double network3, each with a channel band-width of b/2. Our channel slicing technique increases the serialization latency of large packets(write requests and read replies) but as we showed earlier these accelerator architectures are notsensitive to a slight increase in latency. Figure 5.6 is an example showing a possible tile layoutfor a double network design.The traffic in the double network can be distributed with a dedicated double network whereeach network is used for a different class of traffic—one network carries request packets andthe other network carries reply packets, or with a combined double network where all trafficclasses can be sent across either network. With a dedicated double network, no extra virtualchannel (VC) is needed to avoid protocol deadlock while with either a combined double networkor a single network, VCs are needed for protocol deadlock avoidance. On the other hand, adedicated double network essentially cuts the terminal bandwidth of all the nodes to half. Acombined double network maintains the terminal bandwidth of all the nodes and can better load-balance the network traffic since a packet can utilize either network. It will also increase thearea footprint of routers by doubling the number of VCs required. A combined double networkwith checkerboard routing requires four VCs in each network while a dedicated double networkwould require only two VCs per network. Later in Section 5.5 we will demonstrate how we3Balfour and Dally [14] proposed MeshX2 topology which doubles the networks and thus, doubles the bisectionbandwidth and increases cost. Our approach differs slightly as we partition the network, thus comparing networkswith same bisection bandwidth.71FHFHFHFHFigure 5.6: Double network (channel sliced) design layout example showing four tiles.H=half-router, F=full-router.can keep the number of VCs the same as a dedicated double network while load-balancing thetraffic like a combined double network. Note that channel slicing is orthogonal to checkerboardrouting.5.5 Double Checkerboard Inverted Network Organization(DCI)The double network organization (Section 5.4) can reduce the NoC area compared with a singlenetwork and enables new optimization opportunities to remove the limitations of checkerboardnetworks. In this section we introduce a double network organization based on the checkerboardnetwork that we refer to as double checkerboard inverted (DCI).4The double checkerboard network can be created by two checkerboard subnetworks. How-ever, the DCI differs from the double checkerboard as we exploit an inverted checkerboard wherethe location of the full and half-routers is inverted, compared with a normal checkerboard net-work. Thus, one of the subnetworks in DCI is a checkerboard network while the other subnet-work is an inverted checkerboard network—resulting in all nodes (both compute or MC nodes)to be connected to both a full-router and a half-router as shown in Figure 5.7. This form of dou-4DCI does not necessarily fit into the description of dedicated or combined terminology that we used earlier inSection 5.4. It has similarities and differences with both of these networks that we will explain in this section.72WE N SMemoryControlleror ComputeNodeFullRouterWE N SHalfRouterFigure 5.7: The connections between a node and routers for one tile in the DCI networkorganization are shown. Sizes are not to scale.22FHFHF HF HFigure 5.8: Layout example showing four tiles for DCI. H=half-router, F=full-router.ble network organization removes all the limitations of a checkerboard design as it allows all thenodes to communicate directly with all other nodes. The routing algorithm is also simplified asDOR routing can be used without having to switch routing at an intermediate node (Section 5.2).Subnetwork selection policy. To leverage a simple XY DOR routing, the following injectionrules must be followed.Rule (1). If the destination node is an even number of columns away from the source node theninject the packet to the full-router attached to the source node.Rule (2). If the destination node is an odd number of columns away from the source node theninject the packet to the half -router attached to the source node.73(a) Subnetwork 0 (b) Subnetwork 1Compute CoreRouterFull RouterHalf RouterMemory ControllerRouterFigure 5.9: DCI routing examples. The two physical subnetwork are shown separately forclarity. Solid lines show examples of selecting full-routers for injection while dashedlines show examples of selecting half-routers. Dark shaded nodes are MC routers.Following this injection policy, all instances of special Cases 1–3 of checkerboard routing, men-tioned in Section 5.2, are removed and there is no need for any packet to turn at a half-router.Figure 5.9 shows some examples of DCI routing. Solid lines in Figure 5.9 are examples of se-lecting full-routers for injection while dashed lines show examples of selecting half-routers forinjection.Each subnetwork in DCI requires a minimum of two VCs for protocol deadlock avoidance—one VC for request packets and one VC for reply packets. DCI utilizes both subnetworks for bothmemory request and reply packets similar to a combined double network while it can keep thenumber of VCs the same as a dedicated double checkerboard network (combined and dedicatednetworks were described in Section 5.4).As Figure 5.7 shows DCI provides two injection and ejection ports at each node while therouters themselves only have one injection and one ejection port. Since the subnetworks areformed using channel slicing the terminal bandwidth of DCI is the same as a single network anda combined double network. Achieving a similar effect with a dedicated double checkerboardnetwork would require multi-port routers.74Another benefit of DCI is that it allows for a more regular tile design since all the nodes nowhave both a full and a half-router as demonstrated in Figure Augmenting DCI with Advanced Routing TechniquesA further advantage of the DCI design is that it can be extended to work with routing techniquesthat are more sophisticated than DOR. For example, O1turn routing [111] can be used with DCIby adding an extra set of virtual channels. DCI can also be used with “class-based deterministicrouting” (CDR) [2]. In CDR routing request packets are routed using XY routing and replypackets are routed using YX routing. As observed by Abts et al. [2] CDR routing is useful inthe cases with a peripheral row based MC placement like our baseline in Figure 4.2. UtilizingYX routing for the reply packets reduces their contention as they leave the MCs when MCs areplaced in neighbouring locations in a row.We now describe an extension to the DCI subnetwork selection policy that enables CDRwithout requiring any extra VCs. To use DCI with CDR the compute cores that send requestpackets and use XY routing employ the injection rules (1) and (2) of Section 5.5 and MC nodesthat send reply packets and use YX routing employ the rules (1’) and (2’) described next.Rule (1’). If the destination node is an even number of rows away from the source node theninject the packet to the full-router attached to the source node.Rule (2’). If the destination node is an odd number of rows away from the source node theninject the packet to the half -router attached to the source node.Using this technique, no packet will need to turn at any half-router and no change to the CDRtechnique is necessary once a packet is injected into the network. Note that each subnetworkrequires only one VC dedicated to XY routing for the request packets and one VC dedicatedto YX routing for the reply packets. Routing deadlock does not happen because each packet75only turns once, and protocol deadlock does not happen since request and reply packets havededicated VCs.5.6.1 DCI Enhanced (DCIE) — Increasing Network Utilization for DCIIf both the source and destination of a packet are on the same row/column then the packet doesnot need to turn. Such a packet can use either of the subnetworks in DCI as it will never turnand therefore it does not have to comply with the subnetwork selection rules of DCI. We cantake advantage of this observation by choosing the injection subnetwork in a manner that im-proves overall network utilization. When the non-turning packets are being injected, we selectthe subnetwork that has been least utilized recently. To keep the mechanism simple we limit ourselection mechanism to local subnetwork selection history. There are numerous implementationpossibilities; we chose to employ an up-down counter that counts up when a packet is injectedto subnetwork 1 and counts down when a packet is injected to subnetwork 0. When the injectionmechanism faces a non-turning packet it decides where to inject it based on the counter value, forexample, if counter value is positive the packet is injected to subnetwork 0. This is just a simpleway of approximating the load of each subnetwork, the goal is to take advantage of non-turningpackets to increase the utilization of the DCI network. We will refer to this technique as DCIenhanced or DCIE.5.7 SummaryIn this chapter, we described the checkerboard network organization which utilizes half -routersto reduce network area cost while exploiting the many-to-few traffic pattern characteristics. Toaddress the many-to-few traffic imbalance, we described a simple router microarchitectural ex-tension that employs multi-port routers at the MC nodes and increases the terminal bandwidth ofthese nodes. We also introduced a smart port selection policy to further improve performance.We described how the checkerboard network can be extended with channel slicing to create two76parallel networks in order to further reduce area. Finally we demonstrated how the benefits ofthese techniques can be realized in a DCI network organization and its enhance variant as wellas how to utilize CDR routing with DCI.77Chapter 6Experimental Evaluation of ThroughputEffective Network Design TechniquesIn this chapter we present the experimental results for the throughput-effective NoC optimizationtechniques proposed in Chapter 5. We start by describing our simulation setup, then explore theimpact of checker MC placement, the impact of checkerboard routing, the impact of multi-portrouters at the MC nodes, and the impact of channel slicing. Next we evaluate the impact ofthe DCI network and then show the effects of various MC placements on the aforementionedtechniques accompanied by a discussion of implementation tradeoffs of MC placements. We alsoevaluate some of our techniques in an open-loop simulation setup and present the area analysisof our techniques. Finally, we discuss the throughput-effectiveness of our techniques.6.1 MethodologyWe use a modified version of GPGPU-Sim 2.1.3b [11], the detailed cycle-level simulator mod-elling a contemporary GPU running compute accelerator workloads. We described an earlierversion of the simulator in Section 3.3.78Table 6.1: Simulation ParametersParameter Value# of Compute Cores 28# of Memory Channels 8MSHRs / Core 64Warp Size 32SIMD Pipeline Width 8Number of Threads / Core 1024Number of Thread Blocks / Core 8Number of Registers / Core 16384Shared Memory / Core 16KBConstant Cache Size / Core 8KBTexture Cache Size / Core 8KBL1 Cache Size / Core 16KBL2 Cache Size / MC 128KBCompute Core Clock 1296 MHzNoC & L2 Clock 602 MHzMemory Clock 1107 MHzGDDR3 Memory Timing tCL=9, tRP=13, tRC=34tRAS=21, tRCD=12, tRRD=8Peak DRAM Bandwidth 141.7 (GB/sec)DRAM request queue size 32Memory Scheduling Policy out of order (FR-FCFS)[104]Branch Divergence Method Immediate PostDominator [29]Warp Scheduling Policy Round Robin among ready warpsTable 6.2: Area Estimation ConfigurationTechnology 65nmCrossbar type MatrixBuffer Type SRAMWire Layer IntermediateWire Spacing SingleThe modifications we made include adding support for a limited number of MSHRs per core,proper modelling of memory coalescing according to the CUDA manual [92], using Booksim2.0 [45] instead of Booksim 1.0, and adding support for some undocumented (by NVIDIA)barrier synchronization behaviour required by LE and SS benchmarks (barriers synchronize at79Table 6.3: Baseline NoC ConfigurationTopology 2D MeshRouting Algorithm DORRouting Latency (number 4of router pipeline stages)Channel Latency 1Flow Control Virtual Channelbased on WormholeVirtual Channels 2Buffers per VC 8Allocator iSLIPInput Speedup 1Channel width (Flit size) 16 bytesRead request packet size 8 bytesRead reply packet size 64 bytesWrite packet size 64+8 bytesthe level of warps rather than scalar threads in NVIDIA GPUs [125]). All these changes exceptthe migration from Booksim 1.0 to Booksim 2.0 are currently publicly available as GPGPU-Simversion 2.1.3b.1Table 6.1 and Table 6.3 show our hardware parameters2. While we are interested in the futuredesigns, we chose parameters similar to GeForce GTX 280 except for the addition of cacheswhich more closely represent per-thread resources on Fermi. We do this to aid in estimating areaoverheads of compute portions of the overall accelerator. We use a modified version of Orion2.0 [47] for network area estimation; Table 6.2 shows the corresponding configuration options.Area estimation is explained in more detail in Section 6.8.1The current version of GPGPU-Sim also includes a more realistic model for texture caches that was not presentin our previous work [12]. This new model slightly changes the simulation results of benchmarks like MUM andDG that use texture caches. We also model a smaller number of entries for the buffers that connect L2 caches to therouters and DRAM controllers (8 entries for the buffers from router towards DRAM controller and 6 entries for thebuffers in the other direction). In our previous work all these entries were set to 32. We believe the new settings aremore realistic. This change results in an HM slowdown of 0.5% across all benchmarks for the baseline configurationand also slightly reduces the speedup of multiple ejection ports at MC routers.2In our previous work [12], we modelled half-routers with a 3-stage pipeline though we found the performanceimpact of one less stage was negligible. In this work half-routers have the same number of pipeline stages asfull-routers (4-stage).80Table 6.4: AbbreviationsBW BandwidthDOR Dimension Order RoutingCDR Class-based Deterministic RoutingCP Checkerboard PlacementCR Checkerboard RoutingTB baseline Top-Bottom placement2P 2 injection/ejection Port routers at MCsDCI Double Checkerboard InvertedDCIE DCI EnhancedWe employed the stack based immediate post dominator mechanism for branch handling asindicated in Table 6.1. There are numerous proposals that aim to improve the performance ofthe GPU by improving the control flow handling mechanism in the core [28, 29, 65, 78].Whilethese proposals are orthogonal to our proposed throughput effective NoC designs, we expectour techniques to outperform the baseline with a wider margin if any of these techniques areemployed. Increasing the utilization of compute nodes can lead to more traffic in the on-chipnetwork which is better handled by our techniques rather than the baseline.The benchmarks used in simulation are listed in Table 4.13. We simulate all benchmarks tocompletion to capture distinct phases of the benchmarks. Configuration abbreviations are listedin Table Checkerboard Placement (CP)Figure 6.1 shows the performance impact of moving the MC nodes from the top-bottom con-figuration in Figure 4.2 to the staggered locations shown in Figure 5.2, while still maintainingfull-routers and DOR routing. This placement of the MCs benefits from less contention [2] andby itself results in an average speedup of 13.2% compared to baseline top-bottom placement.43In our previous work [12] we used the original CUDA SDK versions of RD and BIN benchmarks. We haveupgraded them to their newer 2.3 versions in this work.4Although Abts et al. [2] showed the network performance improvement of better MC placement, they did notshow its impact on overall system performance. Their evaluation was with synthetic traffic patterns and measurednetwork latency.81-20% 0% 20% 40% 60% 80% AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG TRA SR WP MUM LIB FWT SCP KM STC RD SS CFD BFS HM Speedup LLLHHHFigure 6.1: Overall speedup of using checkerboard placement (CP) of routers compared tobaseline top-bottom (TB) placement (both configuration have 2 VCs).We chose this particular placement by picking the best performing placement after simulatingseveral valid checkerboard placements (but did not exhaustively simulate all valid placements).We will revisit the effect of MC placement in more detail in Section 6.6.For applications with low injection rates at the MC nodes (such as LL and LH applications),the MC placement has little or no impact on overall performance since the contention in thereturn network is not high. Note that WP’s loss of performance (5.2%) is due to fairness issuesthat slow down a few of the compute cores. There are studies [66, 67] that tackle the globalfairness in NoCs which are orthogonal to the techniques we introduce in this dissertation.6.2.1 Checkerboard Routing (CR)As described in Section 5.2 Checkerboard Routing (CR) requires an extra set of virtual channelsto avoid deadlock. Therefore, we also simulate a configuration with 4 VCs and DOR routing tomake a fair comparison. Figure 6.2 shows the relative performance of DOR with 4 VCs (solidbars) and checkerboard routing with 4 VCs (hashed bars) compared to the DOR routing with 2VCs. These simulations show that using a checkerboard network, with half of the routers beinghalf -routers, results in no significant change in performance (on average 0.3% improvement),compared to the 2VC DOR network which requires all full-routers. However, checkerboardresults in 4.5% and 17.7% reductions in total router area compared to 2VC and 4VC DOR net-works, respectively, using area estimations of Section 6.8.55Area reductions are calculated based on total router area numbers in Table 6.5 which are 37.52mm2, 43.95mm2and 35.83mm2 for 2VC DOR, 4VC DOR and checker routing (CR) respectively.8270% 80% 90% 100% 110% 120% AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG TRA SR WP MUM LIB FWT SCP KM STC RD SS CFD BFS HM Relative  Performance CP DOR 4VC  CP CR 4VC Figure 6.2: Relative performance (IPC) of DOR with 4 VCs (solid bars) and checkerboardrouting (CR) with 4 VCs (hashed bars) compared to DOR routing with 2 VCs; allwith checkerboard placement (CP). Higher bars mean better performance.Although a different routing algorithm is required in the checkerboard network, it is stillminimal routing (minimal hop count between source and destination) and therefore the checker-board network has minimal impact on average network latency. Averaged across all benchmarks,38.3% of packets in the system will require YX routing—one of the three cases explained in Sec-tion 5.2.As explained in Section 5.2, the special Case 3 of CR is routed by always taking the YX route(like Case 1). Figure 6.3 shows the relative performance of routing Case 3 like Case 1 using YXrouting compared to handling Case 3 like Case 2 which involves YX routing to an intermediatenode and then XY routing to destination.Although both alternatives employ minimal routing, handling Case 3 only using YX routingprovides a noticeably higher performance for many of HH benchmarks (maximum speedup of7.3% for SS). Based on our analysis of channel utilizations, using YX routing provides betterload-balancing. The other alternative puts more pressure on the routers and links in the centre ofthe chip and increases contention as it involves routing to an intermediate router in the minimalquadrant.6.3 Multi-port RoutersFigure 6.4 shows the speedups of increasing terminal bandwidth of MC routers by adding anextra injection port (3.1%), an extra ejection port (2.2%) and the combination of these changes8390% 95% 100% 105% 110% AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG TRA SR WP MUM LIB FWT SCP KM STC RD SS CFD BFS HM Relative  Performance Figure 6.3: Relative performance (IPC) of routing from half-routers to full-routers that arean even number of columns away using YX routing compared to routing using YXto an intermediate node and then XY to destination; both cases are checkerboardrouting. Higher bars mean better performance.(5.2%)—as described in Section 5.3 and Figure 5.5b. It can be seen that the speedups gainedby extra injection and ejection ports are orthogonal and add up when combined. The highestspeedups are gained by HH benchmarks (8.1%). The extra ports at MC routers reduce the averagefraction of execution time that the injection ports at MCs are blocked (shown in Figure 4.15) by58% which provides additional performance benefits. The extra terminal bandwidth providedby the extra injection ports at MCs alleviates the bottleneck at the injection port caused by themany-to-few-to-many pattern. Since MCs are servicing many compute cores, the packets willtravel in different directions towards their respective destinations after they get in the network.A scattered MC placement like Figure 5.2 increases the chances of packets going to differentdirections as they get into the network. On the other hand, in the top-bottom MC placement ofFigure 4.2 a majority of packets that are injected by the MCs will contend for the east/west portsof MC routers. This is caused by the combination of XY routing and presence of MC routersat adjacent locations. As a result, increasing the number of injection ports to two in the top-bottom placement configuration results in a moderate HM speedup of 2.5% (results not shown).It is important to consider the interactions of placement and routing mechanism to better takeadvantage of multi-port MC routers.Adding extra ejection ports to MC routers is beneficial for benchmarks that have relativelyhigh average injection rates at compute nodes (see Figure 4.13 for injection rates). High injec-tion rates at compute nodes are typically accompanied by higher ratios of write to read requests84-10% 0% 10% 20% AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG TRA SR WP MUM LIB FWT SCP KM STC RD SS CFD BFS HM Speedup 2 Injection Ports 2 Ejection Ports 2 Injection and 2 Ejection ports LLLHHHFigure 6.4: IPC speedup of adding multi-port MC routers versus single CP-CR.(Figure 4.12). Although processing write requests faster does not directly translate into higherapplication-level performance, draining the network faster allows the read requests waiting be-hind the write requests to be processed earlier since the network resources are shared by read andwrite requests. Some benchmarks that benefit from extra ejection ports are TRA, LIB and FWT.We explained a smart port selection mechanism for multi-port injection in Section 5.3.1.The goal is to keep the packets that are going to the same output direction together in the sameinjection port so that when a packet headed to a different direction arrives it goes to anotherport. This enables packets in different injection ports to start traversing the router earlier astheir output directions are less likely to conflict with the output of packets in the other injectionport. Enabling this techniques results in modest speedups over the default round-robin networkselection mechanism with an HM average of less than 1% (results not shown). The maximumspeedup is for the RD benchmarks with a 2% improvement. The speedup of smart port selectionis more pronounced in double network designs that we evaluate in the following sections since itwill be applied to more routers.6.4 Double Network—Channel-Sliced NetworkConventionally, channel slicing is beneficial if combined with a reduction of the network diam-eter [22, 54]; however we utilize channel slicing without reducing network diameter to reducenetwork area (Section 6.8). Overall area is reduced because the crossbar area has a quadraticdependency on channel bandwidth.8570% 80% 90% 100% 110% 120% AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG TRA SR WP MUM LIB FWT SCP KM STC RD SS CFD BFS HM Relative Performance Figure 6.5: Relative performance of using two physical subnetworks (combined doublenetwork) with channel width 8B compared to a single network with channel width16B (both have checkerboard routing, checkerboard placement, 4 VCs and 8 buffersper VC).As described earlier in Section 5, the traffic with a double network can be distributed eitherwith a dedicated double network or with a combined double network where traffic can be sentacross either network. Unfortunately, using a dedicated network results in a 32% harmonic meanslowdown of the benchmarks since it cuts the terminal bandwidth of all nodes in half.6A combined double network is a better choice since it has minimal impact on performancewhile reducing the crossbar area. Figure 6.5 compares the performance of the combined doublenetwork and the single network. On average, the combined double network has a 1.2% slowdownwhile total router area is reduced by 31.81% as we show in Section 6.8. The overall throughput-effectiveness improves by an average of 2.7%. Note that the aggregate storage used for VCbuffers remains constant compared to a single network as explained shortly. Each subnetwork ofthe double network has the same number of VCs as the single network. While the number of VCbuffers in the whole network doubles, each buffer’s storage amount is reduced to half since thechannel width is also halved.In addition to maintaining the terminal bandwidth, a combined network provides much betterload-balancing of the traffic compared to a dedicated network. A combined network achievesload-balancing because both request and reply packets use both of the physical subnetworks. Onthe other hand, in the dedicated design the link utilization of the subnetwork dedicated to replypackets is on average 85% more than the link utilization of the subnetwork dedicated to request6In our previous work [12], due to a performance bug in our simulator we did not observe the performance lossof dedicated double networks and chose them as our preferred type of double networks.86packets. This load imbalance is the result of relative abundance of read requests compared towrite requests as well as the larger size of read reply packets compared to read requests.A fundamental drawback of channel slicing is increased serialization latency for large packetswith narrower channels. This increase in latency only impacts read reply and write requestpackets since the small read request packets still fit in a single flit. However, as shown earlier inSection 4.2.3, the additional latency has minimal impact on most workloads and is well toleratedby the compute cores. There are only a few HH benchmarks that experience a slight slowdown.It is possible to further improve the performance of combined double networks by incorpo-rating some techniques such as an intelligent subnetwork selector that improves load-balancing.However we focus our efforts on the DCI network discussed next which is a more elegant versionof double network.6.5 Double Checkerboard Inverted Network (DCI)In this section we evaluate the Double Checkerboard Inverted design and its enhanced version,DCIE, which were introduced in Sections 5.5 and 5.6.1, respectively. DCI is a double networkorganization where each compute or MC node has both a full and a half-router. This designallows all nodes to communicate directly with no extra overhead, and uses a simplified routingalgorithm which requires a maximum of one turn in all cases as described in Section 5.5. DCIEallows the packets that are never going to turn to select their injection subnetwork based on thelocal historical usage of subnetworks.Figure 6.6 shows the relative application-level speedup of using DCI and DCIE over thecombined double network design explained in Section 6.4. For comparison, this figure alsoshows the speedups of a configuration with equivalent resources except that it has no half-routers(labeled All-Full-Routers). The All-Full-Routers design does not benefit from the area savingsprovided by the half-router. It does not require any specific injection rules as opposed to DCIwhich injects to a certain subnetwork based on the distance of the destination node’s column.8780% 90% 100% 110% 120% AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG TRA SR WP MUM LIB FWT SCP KM STC RD SS CFD BFS HM Relative Performance DCI DCIE All-Full-Routers LLLHHHFigure 6.6: Relative performance of using “double checkerboard inverted”, its enhancedvariant and a combined double network that does not have half-routers compared toa single network with channel width 16B (all have checkerboard placement, 4 VCsand 8 buffers per VC).Injection subnetwork is selected randomly for the All-Full-Routers configuration. As Figure 6.6shows the harmonic mean speedup of All-Full-Routers over the combined double network is2.9% across all the benchmarks which is closely followed by the 1.2% speedup of DCIE. DCIhas a modest slowdown of 1.7%. DCIE has a 3.4% higher average link utilization compared toDCI and 2.6% compared to the double combined network. The HH benchmarks experience thehighest speedups with DCIE as expected. As an example, CFD becomes 8.7% faster.While it is possible to have DCI and DCIE networks with only two VCs per network, it isnot as throughput-effective as using four VCs. For example DCIE with two VCs would increasethroughput-effectiveness by only 0.6% compared to a combined double network but at the costof a 1.7% reduction in performance (results not shown). On the other hand, using four VCsincreases both performance and throughput-effectiveness. Therefore, we employ 4 VC DCI andDCIE configurations.In summary, DCIE improves system-level performance compared to the combined doublenetwork of Section 6.4, has the same NoC area footprint, simplifies the routing algorithm, andremoves the restrictions on the placement of MCs.6.6 MC Placement EffectsIn this section we examine the impact of MC placement on the techniques introduced above.Figure 6.7 shows four valid checkerboard placements. As we mentioned in Section 6.2 we picked88the placement shown in Figure 6.7d as the best performing placement. This placement scattersthe MCs throughout the chip and we believe it can be realized given the current advanced state-of-the-art flip-chip technology. Flip-chip technology is already in widespread use. For example,NVIDIA’s K40 employs a flip-chip ball-grid-array packaging technology [46].Flip-chip is a technology to connect face-down (flip) electrical components (die) to a pack-age carrier. Flip-chip utilizes one or more top metal layer(s) called re-distribution layer (RDL)to connect I/O pads on the die to metal bumps on the package carrier. One possible flip-chipstructure is area-I/O where by utilizing RDL the chip designers can put I/O pads on any arbitrarylocation on the area of the die [27]. Alternatively peripheral-I/O flip-chip structure restricts thelocation of I/O pads to the periphery of the die [27].Peripheral MC placements such as Figures 6.7a and 6.7b assume peripheral-I/O while thescattered MC placements such as Figures 6.7c and 6.7d assume area-I/O. Despite many advan-tages of area-I/O structures such as shorter wire length and better signal integrity, they pose amore challenging routing problem in the RDL compared to peripheral-I/O designs [27]. Never-theless, we show that the throughput-effective techniques we introduced can still be applied toperipheral MC placements.Figure 6.8 shows the sensitivity of the throughput-effective techniques discussed in this dis-sertation to the placements shown in Figure 6.7. In this figure, single DOR, single CR andsingle CR 2P refer to optimizations presented in Sections 6.2, 6.2.1, and 6.3 (smart port selec-tion enabled) respectively. DCIE 2P combines the enhanced version of double checker invertedpresented in Section 6.5 with smart multi-port MCs of Section 6.3. Note that these techniquesare orthogonal to each other. That is, each MC is attached to one half-router and one full-routerin a DCIE 2P configuration and these routers have two injection and ejection ports. Subnetworkselection and routing inside each subnetwork is governed by DCIE rules while port selection isgoverned by the smart multi-port selection policy.89(a) Peripheral-Row(b) Peripheral-Col(c) Scattered 1 (d) Scattered 2(CP)Compute CoreRouterFull RouterHalf RouterMemory ControllerRouterFigure 6.7: MC placement examples. First two placements keep the MCs on the periph-ery of the chip while the next two examples allow MCs inside the chip. All theseplacements are valid checkerboard placements.0% 3% 6% 9% 12% 15% 18% 21% Single DOR Single CR Single CR 2P DCIE 2P HM  Speedup  over Baseline PeripheralRow PeripheralCol Scattered 1 Scattered 2 (CP) Figure 6.8: Sensitivity of checkerboard routing, two injection port MCs and doublecheckerboard inverted to various placements shown in Figure 6.7. All speedups aremeasured against harmonic mean IPC of baseline top-bottom with DOR.As shown in Figure 6.8, the scattered placement of Figure 6.7d consistently performs betterthan the other placements and when combined with DCIE 2P results in a 19.5% speedup overbaseline TB. Single CR 2P also achieves a similar speedup but at a higher area cost. (Section 6.9provides a detailed comparison of these two configurations.)Among the peripheral placements, the column-based placement of Figure 6.7b outperformsthe row-based placement of Figure 6.7a. The PeripheralCol placement combined with DCIE 2Presults in a 12% speedup over baseline TB while keeping the MCs on the periphery. (CDR rout-ing provides better performance if a peripheral MC placement is desired as we will describe laterin this section.) The reason that column-based placements outperform row-based placements is90that the large reply packets leaving the MCs have a higher chance of contention in the row-basedversions due to dominance of XY routing.The most important factor for the MC placement is to avoid putting the MCs in neighbouringlocations when XY routing is employed. When MCs are placed beside each other the packetsleaving the MCs start contending immediately as they leave the MC nodes. The same principleapplies in the case of a larger mesh network. In general, the MCs should not be placed in thesame rows and columns as much as possible in order to minimize the contention from dimensionorder routing.Additionally, the MCs should not be placed on the central nodes in the mesh because theasymmetrical nature of the mesh increases the traffic passing through the centre of the mesh andputting a hotspot node in such a location creates extra contention. This would be less of an issuefor symmetric topologies such as torus.Abts et al. [2] proposed a systematic way of finding the best placement of MCs. Their conclu-sion was that a diamond shaped placement of MCs yields the best results. A diamond placementwould adhere to the conditions we listed above. Our scattered placement slightly differ from adiamond shape because we did not place MCs at full-router nodes in order to comply with therestrictions of checkerboard routing.While we believe a scattered MC placement can be realized in hardware as we explained inthe beginning of this section, if a peripheral placement of MCs is required then slightly morecomplex routing techniques such as class-based deterministic routing (CDR) will provide betterperformance compared to DOR routing. For example, employing DCIE 2P in the top-bottomplacement of Figure 4.2 (which is a type of row-based peripheral placement) results in a 5.1%slowdown over the baseline. When we employ the CDR variant of DCIE 2P (introduced inSection 5.6) with the same placement, a 13.3% improvement is achieved over the baseline.91CDR performs better than DOR in row-based placements because it employs YX routingfor reply traffic. In XY routing packets ejected from MCs start to contend in already congestedneighbouring MC routers. Employing YX routing for reply packets reduces the amount of con-tention for the reply packets as they leave the MCs.As mentioned earlier, when we move MCs apart from each other contention is reduced andperformance is increased. That is, if we use DCIE 2P and CDR with the peripheral row-basedplacement of Figure 6.7a we will get a 15.4% performance improvement over the baseline. Thiscombination (DCIE-CDR-2P with PeripheralRow placement) is our best performing peripheralconfiguration.Similar to observations by Abts et al. [2] employing CDR with a scattered placement does notresult in further improvements in performance (compared to DOR routing). In our experimentschanging DOR routing to CDR routing for the scattered placement of Figure 6.7d using DCIE2P results in a slowdown of 1.3%.Note that there might be other possibilities to better take advantage of a scattered memorycontroller placement. At the first glance, it seems beneficial to schedule the thread blocks suchthat they are close to the memory controller that contains their data or to migrate the data tobe on an MC that is close to its consumer. While we did not explore this path we provide abrief discussion of limitations of such techniques below. In the architecture that we model theaddresses are low-order interleaved among MCs every 256 bytes similar to real hardware [37]as mentioned before. In fact, newer generations of GPUs use an undocumented hash function todistribute the address space among MCs. This is specifically designed to reduce the likelihoodof any single MC becoming a hot-spot and also renders software based runtime page migrationimpractical. Data migration techniques such as the one proposed by Awasthi et al. [8] for chip-multiprocessors would require a larger address distribution granularity (e.g. a 4KB page perMC). Such techniques also require complicated run-time and hardware support. Such details920 20 40 60 80 100 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Latency Injection Rate (bytes/cycle/node) 2x-TB-DOR CP-CR-2P CP-CR CP-DOR TB-DOR (a) Uniform Random Many-to-Few-to-Many0 20 40 60 80 100 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Latency Injection Rate (bytes/cycle/node) 2x-TB-DOR CP-CR-2P CP-CR CP-DOR TB-DOR (b) Hotspot Many-to-Few-to-ManyFigure 6.9: Latency versus offered load for different architectures. All configurations aresimulated with 4 VCs to better distinguish the effects of the techniques introducedin this dissertation. Traffic consists of 90% read requests and 10% write requests.Read requests and write replies are one flit and read reply and write requests are 4flits except for 2x-TB-DOR where they are 2 flits. Flits are 16 bytes except for 2x-TB-DOR where they are 32 bytes. The x-axis is the average number of bytes injectedfrom each compute node per cycle.should not be exposed to the programmer as it would make the applications tied to a specifichardware and hinder its portability. Nevertheless if a data-aware thread block scheduling policyis employed in a future accelerator it would be complementary to the scattered placement of MCswe proposed.6.7 Open-Loop NoC Simulation StudyFigure 6.9 plots open-loop latency versus offered load for the combinations of checkerboard andmultiple injection ports evaluated earlier using closed-loop simulation for both uniform many-to-few and hotspot traffic. For hot-spot traffic 20% of requests go to one MC as opposed to 12.5%(1/8) for uniform random. These open-loop simulations use a single network with two logicalnetworks for request and reply traffic. Checkerboard routing (CR) and DOR have almost thesame latency in both traffic patterns as checkerboard routing is minimal. These figures show thatcombining checkerboard placement (CP), checkerboard routing (CR) and two injection ports atthe MC (2P) improves performance by increasing saturation throughput versus the top-bottomplacement (TB). The double bandwidth counterpart of baseline (2x-TB) is also shown for refer-93ence. The largest contributors to performance for uniform random traffic are the placement ofMCs and increasing the number of injection ports at the MCs (note read response packets arelarger than read request packets). For the hot-spot traffic the improvements of MC placement aremore moderate. The hotspot node is MC 0 which is located on the top right corner of the chipin our scattered MC placement. Adding the extra injection ports at MCs improves performancesignificantly by alleviating the bottlenecks created by hot-spot traffic. Although addresses arelow-order interleaved among MCs every 256 bytes [37] to reduce hot-spots we have observedthat temporary hot-spots happen in closed-loop simulations.6.8 Area AnalysisWe use a modified version of Orion 2.0 [47] to estimate the area of buffers, allocators and linksof the various NoC designs we explore in this dissertation. Recent studies have highlighted theinaccuracies of Orion 2.0 especially for technology nodes smaller than 65nm [113]. Therefore,we calibrate the crossbar area estimation portion of Orion as described shortly. First we calculatethe number of the crosspoints needed to implement the crossbar and then multiply this number bythe area of a single crosspoint. We use an empirical method to estimate the area of the crosspoint.The number of crosspoints is calculated according to the following formula:nCrosspoints = nX ∗nYnX = nI ∗CWnY = nO∗CWCW denotes channel width in bits and nI and nO denote the number of input and output ports,respectively.94To calculate the area of a single crosspoint we use an empirical model where we first mea-sure the crosspoint area of three router designs for which we have access to their crossbar areanumbers [60, 107, 119]. We extract the crossbar area from the provided die photos. The cross-bar in Salihundam et al. [107] is implemented in 45nm technology. Its area is scaled to 65nmbefore calculating its crosspoint area. We calculate the crosspoint area as the average crosspointarea of the three designs mentioned earlier which turns out to be 2.07um2. Unmodified Orion’scrosspoint area is around 4.24um2. That is, our estimation technique provides crossbar areas thatare over two times smaller than unmodified Orion. Throughout our area estimations we assumethe constant crosspoint area of 2.07um2 mentioned above. In our estimations, we do not havea separate component for the drivers that boost the electrical signals as the flits depart a routertowards the next router; instead, these drivers are modelled as part of the link area. None of ourtechniques increase the size of these drivers. Having half-routers as opposed to full-routers doesnot change the size of these driver since we are not changing the number of router ports7.The area saved by employing our techniques can reduce the overall chip area which meansthe NoC nodes have a shorter distance than the baseline. This directly translates to smaller driversand less area and power consumption in links among nodes. We have not included these savingsin our model. As a result, the area savings are conservative in this sense.Table 6.5 provides the area estimates for the designs we evaluated. The area of non-NoCparts of the chip was estimated by measurements from GTX280’s die photo. We estimatedeach compute core to be 5.25mm2 and each MC to be 10.48mm2. Our baseline of 28 computenodes and 8 MCs adds up to 230.84mm2. We also estimated an area of 13.84mm2 for L2 cachesusing Cacti. As a result the total area of non-NoC components of our baseline is 244.68mm2.Adding this number to the estimated NoC area for each network design results in the total chip7Based on our empirical measurements of the die photo of Intel’s 80-core chip [119], its mesochronous interfaces(MSINT) that forward packets between tiles have an area of around one-tenth of the area of its routers. However,in our estimations the total link area is around 35% of the total router area for the baseline and 50% for our areaoptimized designs (Table 6.5). We believe our area estimations are conservative regarding links and drivers.95Table 6.5: Area Estimations (mm2) A “/” separates different router types for configurationsthat have more than one router type.Area of Crossbar Buffer Alloc. Area of Link Router % of NoC Total1 link Area Area Area 1 Router Area Area Overhead Chip AreaBaseline 0.11 0.87 0.17 0.001 1.04 13.36 37.54 17.2% 295.62x-BW 0.219 3.5 0.34 0.001 3.82 26.68 137.56 40.1% 408.9CP DOR 4VC 0.11 0.87 0.34 0.004 1.22 13.36 43.95 18.8 % 301.99CP-CR 0.11 0.87/ 0.34/ 0.004/ 1.22/ 13.36 35.83 16.7 % 293.870.42 0.34 0.004 0.76CP-CR 2P 0.11 0.87/ 0.34/ 0.004/ 1.22/ 13.39 36.8 17.0 % 294.880.42/ 0.34/ 0.004/ 0.76/0.42 0.69 0.004 1.12Double CP-CR 0.055 0.22/ 0.17/ 0.004/ 0.40/ 13.40 24.43 13.39 % 282.51or DCI w/4VC 0.10 0.17 0.004 0.28DCI 0.055 0.22/ 0.087/ 0.001/ 0.30/ 13.40 17.88 11.34 % 275.96w/2VC 0.10 0.087 0.001 0.19DCI 2P 0.055 0.22/ 0.087/ 0.001/ 0.30/ 13.43 19.20 11.77 % 277.34w/2VC 0.10/ 0.087/ 0.001/ 0.19/0.27/ 0.10/ 0.001/ 0.38/0.17 0.10 0.001 0.28DCI 2P 0.055 0.22/ 0.17/ 0.004/ 0.39/ 13.43 26.02 13.88 % 284.14w/4VC 0.10/ 0.17/ 0.004/ 0.28/0.27/ 0.20/ 0.004/ 0.49/0.17 0.20 0.004 0.38area (last column of table). We assume the injection/ejection links have a length of 0.05mm andinclude their contribution to area under “Link Area” column. Note that this overhead is negligiblecompared to the size of other components as a 64-bit local link would take around 0.001mm2.The first row of Table 6.5 shows the area of the baseline mesh with a channel width of 16bytes and the second row a mesh with a channel width of 32 bytes. As expected, a quadraticincrease in the router area happens by doubling the channel width. The high area overhead of themesh with channel width 32 bytes, which is 40% of chip area, makes it impractical to build. Byexploiting half-routers, which occupy only 63% of the area of a full-router8, the checkerboardnetwork results in a 4% reduction in total router area (comparing sum of router area numberswhich are 37.6mm2 in 65nm for checkerboard and 36.2mm2 for baseline router). By further tak-8The half-router crossbar area is calculated by adding the area of four 2×1 crossbars (two for each dimension)and one 4×1 crossbar for the ejection port. Full-router crossbars are modelled as a single 5×5 crossbar which ispessimistic since a packet arriving on a given port cannot depart through the same port.96ing advantage of the quadratic dependency, the double network reduces the area further by 30%(comparing sum of router area numbers for combined double checkerboard and single checker-board which keeps the buffer sizes constant). The area of DCI with 4 VCs is also the same asdouble CP CR network. Table 6.5’s “DCI 2P w/4VC” row shows the area of the 4VC DCI con-figuration with 2 injection/ejection ports at MC nodes; multi-port MC routers increase the NoCarea overhead by 4.2%. As the chip scales and the number of compute nodes increases in thefuture, the relative overhead of multi-port router decreases because the number of MCs scales ata much slower pace.Note that there are no separate rows for the DCIE configurations in Table 6.5 as we haveassumed that the area difference between DCI and its enhanced version, DCIE, is negligible. DCIneeds to determine its injection subnetwork based on the distance to destination node column.If we assume table-based routing, we need a lookup table with a 1-bit entry per destination, thatis, overhead is a 36-bit lookup table per node. This table can be further optimized to have 8entries for the compute nodes and 28 entries for MCs. We have assumed the area of this table isnegligible in our calculations.6.9 Throughput-EffectivenessCombining the optimizations in Sections 6.2, 6.2.1 and 6.3 (checkerboard placement, checker-board routing, and two injection ports at MC routers, i.e., single CP-CR-2P) results in a 19.6%speedup versus our baseline introduced in Section 4.1 as shown in Figure 6.10. Compared with42.3% speedup of an ideal network, this throughput-effective network achieves 47% of the per-formance possible with an ideal network while reducing area. This design improves throughput-effectiveness (IPC/mm2) by 19.9% (using the area numbers in baseline and CP-CR 2P rows ofTable 6.5). Majority of the gain in throughput-effectiveness is the result of increased performancerather than reduced area.97-40% 0% 40% 80% 120% AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG TRA SR WP MUM LIB FWT SCP KM STC RD SS CFD BFS HM Speedup  Single-CR-2P CP DCIE-CDR-2P PeripherialRow placement DCIE-DOR-2P CP Figure 6.10: IPC speedup of three throughput-effective designs over baseline top-bottomwith DOR.Utilizing channel slicing and the simpler routing of the DCIE technique evaluated in Sec-tion 6.5 along with the preceding techniques results in a 19.5% speedup over the baseline whilereducing the area compared to a single CP-CR-2P configuration. The speedups over baselineare shown as DCIE-DOR-2P in Figure 6.10. This design offers the best tradeoff of performanceand area among the combinations of proposed techniques and results in a 24.3% improvementin throughput-effectiveness (IPC/mm2). It was shown as “Thr. Eff.” point in Figure 4.1. In thisdesign, around one-fifth of the gain in throughput-effectiveness is the result of saving area andthe rest is due to application-level speedup.If a peripheral placement is desired then CDR routing can be employed. Using CDR rout-ing with the PeripheralRow based MC placement of Figure 6.7a results in a 15.4% speedupover baseline without changing the area footprint of DCIE. The speedups of this design can beseen as DCIE-CDR bars in Figure 6.10. Using this technique improves throughput-effectiveness(IPC/mm2) by 20.0%. In this design, a quarter of the gain in throughput-effectiveness is the resultof area savings and the rest can be attributed to application-level speedup.6.10 DiscussionIn this section, we discuss interactions of concentration with our proposed techniques and pro-vide comparisons with other topologies. We also discuss scalability and power issues. In this98section, references to a “throughput-effective” design refer to a DCIE 2P network with scatteredplacement of MCs similar to Figure 6.7d and 2 VCs per subnetwork9.6.10.1 ConcentrationConcentration is a form of “hierarchical” network with the first level of the network hierarchybeing the “concentration” or a simple multiplexer (the actual implementation can vary). Thetechniques provided in this work are orthogonal to concentration. Our techniques are suited forcases with many compute nodes and few MCs. We expect the many-to-few-to-many property tobe true regardless of compute node concentration. Given the slow rate of increase in the numberof pins per chip [44] the number of MCs will remain at their current level well into the future.Concentration has been proposed for NoCs in the form of high radix routers like CMesh andflattened butterfly [14, 55]. We refer to this form of concentration as integrated concentration andwill discuss it in Section 6.10.2. An alternative implementation would be applying concentrationto multiple input sources first (i.e., external concentration [63]) and then feeding the router onlywith a single port. In this approach the router radix is the same as the case without concentration.We refer to this form of concentration as external concentration.Current GPUs seem to be employing a combination of external concentration and crossbars.we now provide a comparison of this approach with mesh-based topologies and the reasons itneeds to change in the future.As mentioned earlier, a crossbar would be connecting components that are scattered through-out the chip. Routing very long wires to and from a central crossbar is neither easy nor inexpen-sive.We simulated an asymmetric crossbar NoC with a 10× 8 router for requests and an 8× 10router for replies and external concentration of 3 for the compute nodes. If we assume a flit9 In Section 6.5 we mentioned having a DCIE with 4 VCs is more throughput-effective than having 2 VCs. Thestudies of this section were performed before we came to that conclusion and assume 2 VCs.99size of 16 bytes, the overall HM IPC is 1.1% better than our baseline while the estimated NoCarea is around a quarter of the baseline. We are not including the links in area comparison forsimplicity. It is clear that an asymmetric crossbar is better in terms of throughput-effectivenesscompared to baseline. On the other hand, in order to achieve a performance similar to our bestthroughput-effective design we need to increase the flit size of the asymmetric crossbar to 24bytes. At this point the crossbar has an HM IPC of 2% more than our DCIE 2P design while itstotal NoC area is 12% smaller.Nevertheless as the number of nodes and bandwidth demand grows in the future the crossbarwill lose its advantage as its complexity goes up with the number of ports and flit size require-ments. Routing long links to connect to a central crossbar becomes problematic and will requiremultiple cycles. A large crossbar also becomes a single point of failure bringing down yield.On the other hand, if a mesh is faulty the chip can still be sold after disabling some rows and/orcolumns. A mesh would provide low area, good performance, and regularity in design and highyields. Another consideration is that our mesh-based DCI technique can be easily extendedto support coherence in a system like Intel’s Knights Corner [110] by adding virtual channels.However asymmetric crossbars cannot support coherence protocols that require compute nodesto communicate directly; coherence would require a fully connected symmetric central cross-bar in such a case. Note that all the nodes connected to a symmetric crossbar can directly sendpackets to each other while an asymmetric crossbar only allows the nodes on opposing sides ofcrossbar to communicate. For example, in the configuration simulated earlier, the 10 concen-trated compute nodes cannot directly communicate with each other.6.10.2 TopologyIn this section we compare our proposed throughput-effective technique (DCIE 2P with scatteredplacement) with two previously proposed topologies: flattened butterfly [55] and CMesh [14].Flattened butterfly is intended to minimize latency which is not the first priority in our system;100our priority is increasing throughput. Since flattened butterfly is a high-radix topology, we useda 64 node system as our baseline to have a meaningful comparison. The 64-node baseline isconfigured as an 8×8 mesh with DOR routing and a top-bottom MC arrangement. For the fol-lowing comparisons we simulated 56 compute cores and 8 MCs and kept the bisection bandwidthconstant.Our throughput-effective technique results in a 28% HM speedup while flattened butterflyresults in a 17% HM slowdown over the 64-node baseline. That is, our throughput-effectivetechnique is 54% HM faster than flattened butterfly. As the bisection bandwidth is held constant,the link width of flattened butterfly is 8 bytes. This means the terminal bandwidth of MC nodesin the flattened butterfly configuration is half of the baseline single mesh and a quarter of DCIE2P mesh configuration. This is the primary reason for the slowdown of flattened butterfly andresults in DCIE 2P being more throughput-effective, even though the flattened butterfly has asmaller area (total router area is 41% smaller than the DCIE 2P). The performance of flattenedbutterfly can be improved with the multi-port MC technique (Section sec:2port) but this wouldalso increase the cost of the flattened butterfly network.We also simulated a CMesh with express channels [14] while keeping the bisection band-width constant. CMesh with express channels results in a 19% HM speedup over baseline 64-node. That is, CMesh is 7% slower than our throughput-effective technique while its area isaround 21% higher. If we assume a CMesh with no express channels, then keeping bisectionbandwidth constant results in a flit size of 32 bytes which translates into a total router area thatis 4.5× larger than our throughput-effective technique. This CMesh topology is also 12% fasterthan the DCIE 2P technique but not as throughput-effective. Note that all of our techniques areapplicable to CMesh and can increase its efficiency, although the benefit of half-routers dimin-ishes as the number of local ports increases.1016.10.3 ScalabilityWe expect the benefits of our techniques to increase as the number of compute cores increases inthe future, the number of MCs will not scale at the same pace as compute nodes which exacer-bates the many-to-few-to-many issue. In fact, the number of MCs has been kept under eight bythe industry. For example, while NVIDIA’s GTX280 has 8 MCs, more recent GTX480 and GTXTITAN have 6 MCs.In order to show the scalability of our proposed techniques we simulated an 11×11 systemwith 112 compute nodes and 8 MCs in addition to the 64-node configuration mentioned in Sec-tion 6.10.2. One node was left empty to prevent load-balancing issues in our benchmarks thatcan skew the results for the 11×11 configuration. Applying DCIE 2P technique results in a 28%speedup and a 50% total router area reduction for the 64-node system. When same techniquesare applied to the 120-node configuration, a 43% speedup and a 51% total router area reduc-tion is achieved. In our baseline 36-node configuration, similar comparisons result in a 11.9%speedup and a 49% total router area reduction. These results indicate that benefits of our tech-niques increase in terms of both speedup and area reduction as the network scales and becomeslarger.We also expect the benefits of reducing NoC area to increase as technology nodes get smaller.As the technology scales other components on the chip also scale down, so the ratio of routersto the total chip area does not go down. In fact, since: (1) the wire pitch is not scaling at alower rate than transistors and (2) crossbars and links are wire dominated, we expect that theratio of interconnect will increase in smaller technology nodes. For example in Intel’s 65nmand 45nm technology nodes, transistors scale at 65/45 = 0.69 rate while wire pitch scales at210nm/160nm = 0.76 rate [9, 43]. This means wire-dominated parts like a crossbar becomelager relative to other components. As a real example, 5-port 144-bit routers occupy 1.17 mm2in 45nm technology [107] which is not particularly small.1026.10.4 PowerWhile power is increasingly becoming an important factor in NoC design, our focus in thisdissertation is to increase performance while reducing area. In general, the power consumed inthe router is dominated by buffer, crossbar, and the channels [113]. Our throughput-effectiveapproach reduces NoC area by reducing complexity of these components. Therefore, we expectthe power consumed to be proportionally reduced as well. Our approach never increases thedistance travelled by packets on the chip, as it has minimal routing. In fact, the scattered MCplacement of checkerboard routing reduces the average packet hop count by an average of 8%across all the applications compared to a peripheral top-bottom placement of MCs. The reducedhop-count directly translates into dynamic power savings in NoC.The crossbar switches in half-routers have fewer connections compared to full-routers andthus save dynamic power as their capacitance is smaller. The use of half-routers also reducesthe complexity of switch allocators. Additionally, our techniques reduce the NoC area and theoverall area of the chip, the saved area results in direct savings in static power. A smaller chipalso means lower dynamic power in NoC as the packet will be travelling for relatively shorterdistances.The channel slicing technique we employed also reduces the power consumption by replacingeach large router with two smaller routers. A smaller router means that packets have to traverse ashorter distance when passing through the router and thus consume less energy. This is the sameconcept that saves power when a large out-of-order core is replaced by several smaller simplercores.Our proposed techniques also increase the overall application-level performance without em-ploying any speculative techniques. In other words, none of our techniques can have uselessoperations which are detrimental to power consumption. A shorter runtime for a given applica-103tion translates to either longer battery life in embedded applications or lower operations costs inhigh-performance server applications.The design process for a power centric approach can be driven utilizing the same frameworkthat we used to arrive at our throughput-effective organizations. For example, a weighted com-bination of power and area can be used as the denominator of the throughput-effective equation(we used IPC/mm2).6.11 SummaryIn this chapter we presented the experimental results for the throughput-effective NoC optimiza-tion techniques proposed in Chapter 5. First we described our simulation setup, we then showedthe performance impact of checker MC placement, checkerboard routing, multi-port routers atthe MC nodes, and channel slicing. Next we evaluated the impact of the DCI network and thenshowed the effects of various MC placements on the aforementioned techniques. We showedthat the most throughput-effective design is DCIE with two-port MCs, checkered MC placementand DOR routing. This design improves the performance by an HM average of 19.5% comparedto the baseline. It offers the best tradeoff of performance and area among the combinations ofproposed techniques and results in a 24.3% improvement in throughput-effectiveness (IPC/mm2).104Chapter 7Related Work7.1 Accelerator ArchitecturesThere have been acceleration architectures proposed besides the GPU model that we analyze inthis dissertation. Rigel [48] is an accelerator that is fundamentally similar to our architecture butprovides a more flexible programming model compared to CUDA and chooses a MIMD modelrather than SIMT. Rigel employs bulk-synchronous parallel style languages supporting coher-ence and demonstrated that cache-to-cache communication is relatively infrequent in acceleratorapplications written in BSP-style languages.Mahesri et al. [73] introduce a class of applications for visualization, interaction, and simu-lation. They propose using an accelerator architecture (xPU) separate from the GPU to improveperformance of their benchmark suite. The Cell processor [20, 102] consists of a controlling pro-cessor and a set of SIMD co-processors each with independent program counters and instructionmemory. Cell’s various processing elements communicate explicitly through DMA. NoC designof the Cell [56] architecture is an example of making tradeoffs between network area and latency.The Cell designers chose a ring over a crossbar to meet their area and power constraints [59].105UltraSPARC T2 [114] microprocessor is a multithreading, multi-core CPU which is a mem-ber of the SPARC family and comes in four, six, and eight core variations, with each core capableof running eight threads concurrently. It has a crossbar between the L2 and the processor cores.Although T2 supports many concurrent threads (64), this number is very small compared toGPUs. For example, GTX 280 supports 30,720 threads per chip.Merrimac [23] and Imagine [5] are both pioneering streaming processor architectures devel-oped at Stanford. Stream processing is a computer programming paradigm, related to SIMD,that allows some applications to more easily exploit a limited form of parallel processing. Suchapplications can use multiple computational units, such as the floating point units on a GPU,without explicitly managing allocation, synchronization, or communication among those units.Stream processing is a broad concept that incorporates various architectures. For example, theCell processor can function like a stream processor with appropriate software support. GPUs area special form of stream processors.Intel’s Pangaea [124], is a heterogeneous CPU-GPU design that integrates an IA32 CPU withseveral X4500 GMA GPU cores without the fixed function graphics units. Pangaea demonstratesthe benefits of removing fixed-function graphics hardware for a general-purpose accelerator.7.2 Interconnection NetworksIncreasing the number of cores on a single chip has increased the importance of on-chip net-works. While there is a large body of work about on-chip networks, we limit the followingdiscussion to the most relevant research works that have similarities to some aspects of this dis-sertation.Intel’s 80-core design [119] and Tilera’s TILE64 [122] processors both employ a 2D meshtopology similar to the topology we employed in this work. In these chips the memory con-trollers are placed on the periphery of the chip similar to the balanced baseline design we usedin Chapter 4.106Much of the research in NoC have focused on reducing network latency by improving differ-ent aspects of NoC such as lower-latency router microarchitectures [60, 82, 99], lower-diametertopologies [14, 34, 55], or better flow control [61, 62]. However, as we showed in Section 4.2,compute accelerator applications are more sensitive to bandwidth and aggressive router latencyoptimizations result in minor improvements in their overall performance. Despite this observa-tion, including a particular optimization for router latency reduction would not hurt throughput-effectiveness if it can be implemented with negligible cost/area overhead and without affectingthe critical path of the underlying circuits.Bufferless routing [81] was proposed to reduce network cost and energy consumption byremoving the input buffers of the routers. Network throughput can be degraded with bufferlessrouting for applications with high traffic which is the case for throughput accelerators. This is adirect result of the deflection routing techniques employed in bufferless routers which increaseas traffic requirements grow. When a destination router cannot accept an incoming packet it ismisrouted to resolve contention. Misrouting is necessary as there is no buffer to save the packetfor routing in later cycles in a bufferless router.We explored on-chip networks for GPUs in Chapter 3 where the impact of different networkparameters was evaluated. We then built upon that work and provided more in-depth analysis andproposed a cost-efficient on-chip network architecture for accelerator architectures in the laterchapters of this work. There are other recent attempts to specialize the NoC for many-to-few-to-many traffic patterns. Lotfi-Kamran et al. [72] segregate core and last-level cache slices intodifferent tiles; as a result, the traffic pattern becomes many-to-few-to-many. Similar to our work,they apply the concept of reducing connectivity to decrease the area of NoC. Unlike throughputaccelerator workloads, their target domain of server workloads is dominated by applications withhigh latency sensitivity.107Abts et al. [2] studied alternative memory controller placements for core-memory traffic;however, they did not show the overall performance benefits on applications but focused onlatency metrics and synthetic traffic patterns. The memory controller placement that we usein this work leverages the same concept discussed by Abts et al. by staggering the memorycontroller placement. This work also shows how the overall performance can be significantlyimproved.One of the cases of our proposed checkerboard routing occurs in two phases—in the firstphase, YX routing is used to route to an intermediate node and in the second phase, XY routingis used to route minimally to the destination. There are randomized routing algorithms such asValiant routing [118] where packets are routed to an intermediate node similar to the special caseof checkerboard routing algorithm mentioned above. In the Valiant routing algorithm the goal ofpicking a random intermediate node is to unify the traffic pattern while we employ this techniqueto enable minimal routing. The Valiant routing algorithm can pick any random intermediate nodeand as a result is not minimal routing.The special case of checkerboard routing is also similar to a two-phase ROMM routing [84].In the two-phase ROMM the routing is done by always routing a packet to an intermediate routerwithin the minimal quadrant and then from that node to the destination. Both phases use the samerouting technique. The special case of checkerboard routing differs as the random intermediatenode is restricted to a full-router and each phase needs to be done with a different DOR routing.O1Turn routing algorithm [111] is an oblivious minimal routing algorithm that is designed toprovide optimal worst-case throughput in a mesh network. Similar to our checkerboard routing,O1Turn requires two virtual channels to avoid circular dependencies and routing deadlocks. YXrouting is done using one virtual channel while XY routing uses another virtual channel. O1Turnrandomly picks the first dimension to route and then routes the second dimension. O1turn iscomplementary to our throughput effective techniques and can be employed as the routing algo-108rithm in our double-checker inverted design (DCI) as mentioned in Section 5.6. Abts et al. [2]also proposed class-based deterministic routing (CDR) which is beneficial in the cases with aperipheral row based memory controller placement. In the CDR routing the request packets arerouted using XY routing and the reply packets are routed using YX routing. As we showed inSection 5.6, CDR can be employed in conjunction with our proposed DCI technique.Concentration has been proposed for on-chip networks in the form of high radix routers likeCMesh [14] and flattened butterfly [55]. A CMesh topology is similar to a mesh, except thateach router is directly connected to more than one core or memory controller. In a 2D flattenedbutterfly topology the routers are also placed like a 2D mesh, but each router is connected to allof the other routers in the same row and column as opposed to only its neighbours in a mesh. Asa result, in a flattened butterfly network the links can have variable latencies due to their differentlengths which in turn either requires variable sized buffers in routers or fixed size buffers largeenough for the longest latency round trip credit delay. A flattened butterfly also has narrowerchannels compared to a mesh with the same bisection bandwidth. Due to the full connectivity ofthe routers it is impractical to have many of them in any given dimension and therefore multipleclient nodes are connected to each router similar to the CMesh. Both of these techniques [14, 55]propose increasing the radix of the routers in on-chip networks to reduce the network diameterand increase network performance, mainly through lower latency. The latency reduction in suchdesigns is a result of reduced hop count and comes at the cost of more complex routers. Themulti-port approach we proposed differs as we only increase radix across a few routers connectedto the memory controllers to minimize the impact on complexity. We provided comparisons ofour techniques with both of these integrated forms of concentration in Section 6.10.2.The router microarchitecture of the proposed half-router has some similarities to a dimension-sliced microarchitecture [50]. In a dimension-sliced router, the packets can still change dimen-sions while we limit this capability in the half-routers. The half-router also shares some similarity109to the low-cost router microarchitecture [53]. However, unlike the low-cost router microarchitec-ture which provides full connectivity for XY routing, the routing is restricted in the half-routerto further reduce complexity. In addition, the low-cost router is limited because of its support foronly DOR routing and fairness is an issue for the applications which have high network traffic.Kim et al. [54] employed channel slicing in combination with the reduction of the networkdiameter in the design of their high-radix router. However, in our double network designs weutilized channel slicing without reducing the network diameter in order to reduce the area costof the network.Volos et al. [120] proposed CCNoC which employs a double network organization whereeach subnetwork is specialized based on the packet types traversing it. The networks have differ-ent bandwidth and router microarchitectures. The resulting asymmetric network enables energysaving in CCNoC’s target domain of cache-coherent server workloads.Das et al. [24] employ a hybrid of bus and mesh networks. They employ a hierarchicalnetwork design which connects a series of physically localized cores with a bus network andthen connects these groups of cores with a higher level mesh network. This configuration isbeneficial when there is locality in the traffic pattern as is the case for the chip multiprocessors(CMP) with coherent traffic. However, unlike CMPs there is no locality in our traffic pattern asthe communication is between the many cores and the few memory controllers.The heterogeneous NoC work by Mishra et al. [80] was done concurrently with our workand utilizes the same concepts as our work which is improving a few routers of the NoC whileincreasing the overall system performance. However, their work is orthogonal to our work. Theyleverage the heterogeneity of the mesh topology to create their heterogeneous network while ourwork leverages the heterogeneity of the traffic pattern between the many nodes and few memorycontrollers.110Tree saturation effects were first identified by Pfister et al. [101] but we also show how theycan occur in NoCs with compute accelerator applications.Lee et al. [66] proposed a technique to provide guaranteed quality of service in terms of min-imum bandwidth and maximum delay for each node of the on-chip network. Kim et al. [67]also proposed using a probabilistic arbitration and weights based on distance in order to pro-vide equality of service for shared on-chip resources. Their goal was to remove biases in usingNoC resources that might exist based on the location of a core in the mesh. Both of the afore-mentioned [66, 67] studies that tackle the global fairness issues in NoCs are orthogonal to thetechniques we introduce in this dissertation.There are also some parallels to the issues we raised in other domains. For example, FPGAdesigner have developed various switch block configurations in island-style FPGA architec-tures [15]. The island-style FPGA has three main components: configurable logic resources,a configurable interconnect fabric, and input/output resources [15]. The configurable intercon-nect fabric closely resembles a 2D mesh network and provides connections among other com-ponents. Two major components of the interconnect fabric are wire segments (similar to linksin a 2D mesh) and programmable switch blocks (similar to routers in a 2D mesh). Note thatthese routing fabrics are configured when the SRAM-based FPGA is powered up and do notdynamically change during the run time of FPGA as opposed to a 2D mesh on-chip networkwhere the routers dynamically route packets. A typical FPGA has hundreds of wire tracks ineach channel; providing all possible permutation of incoming to outgoing wire tracks in a switchblock is not practical. Therefore, switch block designers have to limit the number of possibleconnections for each input track while trying to ensure routability and maximizing route diver-sity. As an example, the “disjoint” switch block [126] and the Wilton switch block [123] alloweach incoming wire to connect to three outgoing wires (called switch flexibility). They use thesame number of routing switches but the Wilton switch increases route diversity between source111and destinations by changing the wire track numbers at turns. Increasing the number of possibleconnections for wire segments beyond three can increase route diversity but the switch blockbecomes more complex, harder to implement and requires more area. There is a delicate tradeoffbetween complexity/area cost of the design and the route diversity of the design. A similar argu-ment can be made for our half-router design which reduces the router complexity by preventingthe packets from turning. Note that while the switch block area of FPGAs tends to be dominatedby transistors, the router crossbars we studied are wire dominated. Also unlike half-routers thatcompletely prevent turns, the switch blocks only limit the choices of wires that an incoming wirecan connect to and still allow turns. A detailed study of various switch block designs is presentedby Lemieux and Lewis [69].7.3 Performance Evaluation and Simulation ToolsRyoo et al. [106] use CUDA to speedup a variety of relatively easily parallelizable scientificapplications. They explore the use of conventional code optimization techniques and take ad-vantage of the different memory types available on NVIDIA’s 8800 GTX to obtain speedup.While their analysis is performed by writing and optimizing applications to run on actual CUDAhardware, we use a novel performance simulator to observe the detailed behaviour of CUDAapplications upon varying architectural parameters. Khailany et al. [51] explore VLSI costs andperformance of a stream processor as the number of streaming clusters and ALUs per clusterscales. They use an analytical cost model.Lashgar et al. [64] performed several limit studies to measure the effects of control flowand memory subsystem design on overall application-level performance in GPUs. They con-firmed that memory has the largest impact on application performance. They also showed thateliminating control flow divergence can sometimes be detrimental to overall application-levelperformance as a result of increased pressure on the memory subsystem.112Existing graphics-oriented GPU simulators include Qsilver [112], which does not model pro-grammable shaders, and ATTILLA [25], which focuses on graphics specific features. GPGPU-Sim [11] on the other hand focuses on general purpose aspect of modern GPUs.Garnet [100] interconnect simulator can be used with GEMS [76] framework to perform fullsystem simulations. That is, they provide the capability to simulate an entire application witha detailed network model. Garnet has been used to evaluate the impact of the on-chip networkon shared versus private caches [100]. However, Garnet incurs long simulation runtime whenperforming full system simulations. The GPGPU-Sim simulator [11] we use in this dissertationsimulates a large number (tens of thousands) of threads in a single thread on the host platform.Thus, it incurs no operating system context switching overhead and is capable of simulationrates of several hundreds of thousands of scalar operations per cycle enabling relatively largeworkloads to be simulated.Recently several other simulators have been developed that are geared towards heterogeneousCPU-GPU system studies. FusionSim [128] and gem5-gpu [38] both integrate GPGPU-Sim withanother CPU simulator and allow simulation of x86 host code in addition to the kernel code.Multi2Sim [116] is also an open-source simulator that supports ISA-level simulation of an x86CPU and an AMD Evergreen GPU. It supports running OpenCL applications.MacSim [30] is another heterogeneous architecture timing model simulator which can simu-late x86 and NVIDIA PTX. Unlike GPGPU-Sim, MacSim is a trace driven cycle-level simulator.GPU Ocelot [26] is a dynamic compilation framework for PTX that, similar to GPGPU-Sim, canreplace the CUDA runtime and as a result can work with existing CUDA application. Ocelot canrecompile CUDA kernels, and execute the resulting code on a functional emulator or send it forexecution to real hardware (GPU or an x86 CPU).113Chapter 8Conclusions and Future Work8.1 ConclusionWhile the single-thread performance of commercial superscalar microprocessors is still increas-ing, computer manufacturers increasingly provide multithreaded hardware. This trend stronglyencourages software developers to provide explicit parallelism. In addition, modern GPUs pro-vide an order of magnitude more throughput performance than contemporary CPUs making theman attractive alternative for highly parallel workloads. They can be programmed for non-graphicsapplications using CUDA and similar BSP-like programming models. Modern GPUs are alsoactively being used as throughput accelerators to run massively parallel general-purpose applica-tions. Many of these applications cannot take full advantage of this computing power despite thefact that GPUs offer orders of magnitude more raw computing power than contemporary CPUs.To improve the performance of these applications we need to first understand how they behavewhen they run on the GPU hardware. We selected a set of non-trivial CUDA applications withvarying levels of performance improvement on GPU hardware and developed the infrastructureto easily study their behaviour. We modelled the GPU as a closed-loop system: the componentssuch as compute cores, on-chip network and memory controllers were modelled with cycle-level114accuracy. Unmodified CUDA applications were executed by simulating NVIDIA’s PTX virtualinstruction set architecture.We showed that the performance bottlenecks of the CUDA applications we studied differ intype and scope from application to application. We provided application characteristics includingthe dynamic instruction mix and DRAM locality characteristics. For example, we showed thatthe dynamic ALU instruction count can range from 58% to 87% of the total instructions in theapplications we studied. We showed that adding L1 and L2 caches for local/global memorycan improve the performance of these benchmarks by a harmonic mean average of 30%. Wealso showed that aggressive inter-warp memory coalescing can improve performance in someapplications by up to 41%.Our initial studies showed that the applications we study are very sensitive to the on-chip net-work’s bisection bandwidth as opposed to the zero-load latency of the network. Based on theseinitial observations, we focused our efforts on designing optimized networks for throughput ac-celerators. We continued by analyzing the impact of communication and on-chip network acrossa wider range of applications in throughput accelerators, a total of 31 application. Our studiesconfirmed that reducing router latency does not significantly improve overall application-levelperformance but increasing the costly bandwidth does. Then we showed that the many-to-few-to-many traffic pattern creates a bottleneck in the on-chip network. This bottleneck is formed asa result of many compute nodes communicating with a few memory controllers in addition to theprevalence of large and frequent read-reply packets. We focused on improving the throughput-effectiveness of on-chip networks where we optimized for higher application throughput perarea. To achieve a throughput-effective on-chip network, we proposed a checkerboard organiza-tion where we exploited half-routers to reduce crossbar area while maintaining a minimal routingalgorithm. Half-routers do not allow incoming packet to turn. This behaviour enables them totake advantage of a smaller crossbar and save area. In order to alleviate the many-to-few-to-many115bottleneck, we proposed to increase the terminal bandwidth of routers attached to the few mem-ory controller nodes. We employed the multi-port routers for this purpose. Another alternativesmethod for increasing the flow of information from the memory controllers to the network canbe providing multiple connection from a memory controller to its nearby routers (in addition tothe local router). While increasing the terminal bandwidth increases the area of routers involved,the overall area impact on the chip as a whole is minimal because this change only affects thefew routers connected to the memory controllers.We then proposed the Double Checkerboard Inverted (DCI) network which utilizes half-routers in a channel-sliced network. In a DCI network each compute or memory controller nodehas one full-router and one half-router. DCI has a simple routing algorithm and has no MCplacement restrictions. The packets are divided among the subnetworks based on a set of rulesupon injection. Once the packets get in the network routing can be done similar to the dimensionorder routing. The zero-load latency of DCI network for a large packet like write requests ishigher than a single network with equal bisection bandwidth as a direct result of channel slicing.This slight increase in latency is well tolerated by the compute applications and meanwhile DCIreduces the NoC area by taking advantage of channel slicing and half-routers.We also proposed to enhance the network utilization of DCI by adding heuristics that improvethe overall network utilization (DCIE). DCIE selects the subnetwork that has been least utilizedrecently when a non-turning packets is being injected to the network. It takes advantage of theobservation that non-turning packets can travel on either subnetwork and are oblivious to theplacement of half-routers.The combination of DCIE with checker-placed multi-port MC routers offers the best tradeoffof performance and area among the various combinations of our proposed techniques. Thiscombination results in a 24.3% improvement in throughput-effectiveness (IPC/mm2) over thebaseline. In this design, around one-fifth of the gain in throughput-effectiveness is the result of116saving area and the rest is due to application-level speedup. We also proposed an extension of thedouble checkerboard inverted that employs class-based deterministic routing and can be utilizedto provide a similar throughput-effectiveness if a peripheral placement of memory controllers isdesired. Finally we demonstrated the scalability of our proposed techniques as the number ofnodes increases in the future and demonstrated its effectiveness in presence of concentration.Another major benefit of reducing the chip area is improvements in manufacturing yield.Manufacturing yield as well as the number of chips per wafer are inversely proportional to thearea of the chip. As a result techniques that reduce chip area directly result in improved manu-facturing yield and lower costs. We also discuss how our proposed memory controller placementtechniques can further improve yield in Section 8.2.2.We should also note that our proposed techniques for throughput-effective network design donot require any change in the high-level programming model of the accelerator. The programmeronly experiences a reduction in the application runtime for applications with high bandwidthrequirements. Our proposed techniques are also oblivious to the various memory regions thatexist in a GPU-like accelerator. For example, two requests for local and global memory regionsare treated the same inside the network, only their payloads differ.We used instructions per cycle (IPC) as a metric to measure application-level throughput inthis dissertation. However, this is under the general assumption of a fixed cycle time. Sincewe aimed for the simplest possible designs, none of our final proposed throughput-effectivetechniques increase the cycles time. In fact, some of our techniques can reduce the cycle time.For example, the half-routers have a simpler switch allocation stage due to the limited numberof choices for output ports. Half-routers also have a faster switch traversal stage as the crossbarswitch is smaller and simpler. We did not change the number of the router pipeline stages whenmoving from the baseline towards the DCI design. DCI and DCIE techniques only require someextra logic compared to the baseline when injecting a packet to the NoC. They have to make a117decision to pick a subnetwork for injecting a packet. However this logic is very simple (e.g.,checking if distance to destination column is odd or even) and also not on the critical path sincethe decision needs to be made only once per packet and not for each flit. Additionally, in ourcomparisons with more complex router architecture such as the flattened butterfly we accountedfor the increased complexity of such designs by adding an extra pipeline stage to their routerpipeline. We believe our performance comparisons that use IPC in this work are fair.In Section 4.2.2 we classified applications that had high memory usage and high sensitivityto on-chip network as HH applications. Most of the performance improvements of our proposedtechniques manifested in these applications. However, we expect the number of applicationsthat fall into this category to grow over time as the accelerators scale in the future. As we havementioned before, the compute density of compute nodes grows at a higher pace than memorybandwidth. In order to feed a more powerful compute node the NoC will have to move moredata. That is, as the compute density of NoC’s nodes grows, even applications that have a modestnetwork usage turn into heavy network users.In this dissertation, we focused on GPU-like throughput accelerators. That is, we tried toimprove compute performance while reducing the cost and we did not study the effects of ourtechniques with graphics applications such as games. Nevertheless, we believe that employingour proposed throughput-effective designs in a GPU would benefit graphics and gaming applica-tions as well. In fact, graphics applications are notorious for have a very large working set thatdoes not fit in an on-chip cache and as a result have massive amounts of on-chip and off-chipmemory traffic. That is, we expect graphics applications to be in the HH group of applicationsand benefit from our techniques even more than the applications we studied.8.2 Moving ForwardThis section discusses some directions to extend the scope of this work.1188.2.1 Tree SaturationMemory access latency is composed of the following three components: NoC traversal to reachthe memory controller (request network), memory access, and another NoC traversal to reachthe compute core (reply network). Because of this dependency, one component of the memoryaccess latency can impact other components and create a coupling effect. For example, poorDRAM locality can severely slow down a memory controller such that its buffers get full andback up into the request network. Therefore, the DRAM performance affects the performance ofthe request network. In addition, if the reply network cannot fully absorb the traffic injected byan MC, the MC will also stop processing requests and cause the MC input buffers to fill up. Asthe MCs stop accepting packets, the request network will be affected. Thus, a bottleneck in thereply network impacts not only the request network but also decreases DRAM efficiency. Treesaturation can occur in the NoC because of these coupling effects. In our studies we observedmomentary tree saturations that resolve with time. An illustration of a momentary tree saturationis shown in Figure 8.1 for SCP benchmark. The instantaneous hotspot traffic to MC1 has backedup buffers to create a vertical tree saturation with some horizontal channels also backing up. Thisprevents other nodes in the network from utilizing the network. For example, two routers on thebottom left corner have packets to be sent to MC3 but because of the packets backed up to MC1,they can not make progress.While our proposed techniques such as the multi-port MC router design and to some degreescattering the MCs around the chip can relieve the effects of tree saturations, we believe studyingthis issue in more detail has its merits. Our proposed techniques are not designed to specificallytarget this issue. We believe, it is possible to improve the overall application-level performanceby understanding the underlying conditions that lead to momentary formation of trees in thenetwork and designing techniques to prevent their formation.119MC1 MC2 MC3MC0Figure 8.1: Tree saturation example as observed in SCP benchmark. Traffic to MC1 isblocking traffic to MC0, MC2 and MC3 nodes.8.2.2 Yield ImprovementManufacturing yield and yield enhancement are of utmost importance to chip manufacturers. Aswe mentioned earlier, improvements in manufacturing yield are a major benefit of the techniquesthat reduce the chip area. In this section, we also discuss how we can take advantage of ourproposed memory controller placement techniques in a mesh topology to further improve yield.Most companies design and manufacture high-end chips and sell the chips that contain de-fective nodes as low-end parts by disabling the faulty nodes. While a permanent fault in the NoCof a chip with a central crossbar would render the chip useless, a chip with a mesh based NoCcan still be sold with faulty routers.Here we describe a simple solution for a mesh NoC with an even distribution of chip re-sources. Our proposed checker placement of MCs distributes MCs evenly around the chip whilea top-bottom placement of MCs does not. The simplest method for disabling chip componentswhen a router is faulty is the following. First, divide the mesh into two rectangular groups of120Figure 8.2: MC placement dependant yield enhancement.nodes where the faulty node is on the perimeter of one of the rectangles, then disable all thenodes in the rectangle that contains the faulty nodes. The rest of the chip can be sold as a smallerlow-end unit. As an example, if a router is faulty on the right column of the mesh shown in Fig-ure 8.2 disabling that column results in a chip with 83.3% functionality ((36−6)/36). If a routeris faulty in the blue zone of Figure 8.2 then the chip will have 66.6% functionality as two rowsor columns need to be disabled. Finally a faulty router in the red central zone would disable halfthe chip and 50% of the functionality is saved. Note that multiple faults in the disabled regioncan be tolerated. The above method does not impose any extra overhead when there are no faultynodes.This would not work as effectively if the MCs are placed adjacent to each other on a row orcolumn because in that case a faulty MC router can render many of the MCs useless at once.A possible direction for future work would be finding throughput-effective NoCs that are moreresilient in the face of router failures. The aforementioned techniques require disabling completerows of columns of the chip when they contain faulty routers. An optimal solution would be121simply disabling the faulty node(s) while the rest of the chip remains functional. We believedesigning throughput-effective networks that are resilient to node failures (either permanent ortransient) is a promising direction for future work.122Bibliography[1] http://www.gpgpu.org. → pages 3, 8[2] D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti. Achievingpredictable performance through better memory controller placement in many-coreCMPs. In Proc. IEEE/ACM Symp. on Computer Architecture (ISCA), pages 451–461,2009. ISBN 978-1-60558-526-0. → pages 2, 65, 75, 81, 91, 92, 108, 109[3] ATI CTM Guide. Advanced Micro Devices, Inc., 1.01 edition, 2006. → pages 1, 8[4] Press Release: AMD Delivers Worlds First TeraFLOPS Graphics Chip. Advanced MicroDevices, Inc., 25 June 2008. → pages 1[5] J. H. Ahn, W. J. Dally, B. Khailany, U. J. Kapasi, and A. Das. Evaluating the Imaginestream architecture. In Proc. IEEE/ACM Symp. on Computer Architecture (ISCA), pages14–25, 2004. → pages 106[6] S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, and M. Ripeanu. StoreGPU:exploiting graphics processing units to accelerate distributed storage systems. In Proc.17th Int’l Symp. on High Performance Distributed Computing, pages 165–174, 2008.ISBN 978-1-59593-997-5. doi:http://doi.acm.org/10.1145/1383422.1383443. → pages32[7] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan,D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A view of the parallelcomputing landscape. Commun. ACM, 52(10):56–67, 2009. → pages 50[8] M. Awasthi, D. W. Nellans, K. Sudan, R. Balasubramonian, and A. Davis. Handling theproblems and opportunities posed by multiple on-chip memory controllers. InProceedings of the 19th International Conference on Parallel Architectures andCompilation Techniques, PACT ’10, pages 319–330, New York, NY, USA, 2010. ACM.ISBN 978-1-4503-0178-7. doi:10.1145/1854273.1854314. URLhttp://doi.acm.org/10.1145/1854273.1854314. → pages 92[9] P. Bai, C. Auth, S. Balakrishnan, M. Bost, R. Brain, V. Chikarmane, R. Heussner,M. Hussein, J. Hwang, D. Ingerly, R. James, J. Jeong, C. Kenyon, E. Lee, S.-H. Lee,123N. Lindert, M. Liu, Z. Ma, T. Marieb, A. Murthy, R. Nagisetty, S. Natarajan, J. Neirynck,A. Ott, C. Parker, J. Sebastian, R. Shaheed, S. Sivakumar, J. Steigerwald, S. Tyagi,C. Weber, B. Woolery, A. Yeoh, K. Zhang, and M. Bohr. A 65nm logic technologyfeaturing 35nm gate lengths, enhanced channel strain, 8 cu interconnect layers, low-k ildand 0.57 um2 sram cell. In Electron Devices Meeting, 2004. IEDM Technical Digest.IEEE International, pages 657 – 660, dec. 2004. → pages 102[10] A. Bakhoda and T. M. Aamodt. Extending the Scalability of Single Chip StreamProcessors with On-chip Caches. In 2nd Workshop on Chip Multiprocessor MemorySystems and Interconnects (CMP-MSI 2008), June 2008. → pages 2, 65[11] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDAworkloads using a detailed GPU simulator. In Proc. IEEE Symp. on PerformanceAnalysis of Systems and Software (ISPASS), pages 163–174, April 2009. → pages 2, 3,20, 27, 28, 31, 34, 38, 41, 44, 51, 53, 65, 78, 113[12] A. Bakhoda, J. Kim, and T. M. Aamodt. Throughput-effective on-chip networks formanycore accelerators. In Proc. IEEE/ACM Symp. on Microarchitecture (MICRO), pages421–432, 2010. ISBN 978-0-7695-4299-7. → pages 3, 80, 81, 86[13] A. Bakhoda, J. Kim, and T. Aamodt. Designing on-chip networks for throughputaccelerators. ACM Trans. on Architecture and Code Optimization, 10(3):–, 2013. →pages 3[14] J. D. Balfour and W. J. Dally. Design tradeoffs for tiled CMP on-chip networks. In Proc.ACM Conf. on Supercomputing (ICS), pages 187–198, 2006. → pages 51, 55, 71, 99,100, 101, 107, 109[15] V. Betz, J. Rose, and A. Marquardt. Architecture and CAD for Deep-submicron FPGAs.Kluwer Academic Publishers, 1999. → pages 111[16] Billconan and Kavinguy. A Neural Network on GPU.http://www.codeproject.com/KB/graphics/GPUNN.aspx. → pages 32[17] P. Buffet, J. Natonio, R. Proctor, Y. Sun, and G. Yasar. Methodology for I/O cellplacement and checking in ASIC designs using area-array power grid. In IEEE CustomIntegrated Circuits Conference, 2000. → pages 24[18] D. Burger and T. M. Austin. The SimpleScalar Tool Set, Version 2.0.http://www.simplescalar.com, 1997. → pages 3[19] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron.Rodinia: A benchmark suite for heterogeneous computing. In Proc. IEEE Symp. onWorkload Characterization (IISWC), pages 44–54, 2009. → pages 4, 48, 50, 53, 57124[20] S. Clark, K. Haselhorst, K. Imming, J. Irish, D. Krolak, and T. Ozguner. Cell BroadbandEngine interconnect and memory interface. In Hot Chips 17, Palo Alto, CA, August2005. → pages 105[21] B. W. Coon and E. J. Lindholm. US patent 7,353,369: System and method for managingdivergent threads in a SIMD architecture, 2008. → pages 11[22] W. J. Dally and B. Towles. Principles and Practices of Interconnection Networks.Morgan Kaufmann, 2004. → pages 11, 12, 14, 15, 24, 34, 54, 71, 85[23] W. J. Dally, F. Labonte, A. Das, P. Hanrahan, J.-H. Ahn, J. Gummaraju, M. Erez,N. Jayasena, I. Buck, T. J. Knight, and U. J. Kapasi. Merrimac: Supercomputing withstreams. In ACM/IEEE Conf. on Supercomputing, page 35, 2003. → pages 106[24] R. Das, S. Eachempati, A. Mishra, V. Narayanan, and C. Das. Design and evaluation of ahierarchical on-chip interconnect for next-generation CMPs. In Proc. IEEE Symp. onHigh-Perf. Computer Architecture (HPCA), pages 175 –186, feb. 2009. → pages 110[25] V. del Barrio, C. Gonzalez, J. Roca, A. Fernandez, and E. E. ATTILA: a cycle-levelexecution-driven simulator for modern GPU architectures. In Proc. IEEE Symp. onPerformance Analysis of Systems and Software (ISPASS), pages 231–241, March 2006.→ pages 113[26] G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A dynamicoptimization framework for bulk-synchronous applications in heterogeneous systems. InProc. IEEE/ACM Conf. on Parallel Architectures and Compilation Techniques (PACT),PACT ’10, pages 353–364, New York, NY, USA, 2010. ACM. ISBN978-1-4503-0178-7. doi:10.1145/1854273.1854318. URLhttp://doi.acm.org/10.1145/1854273.1854318. → pages 113[27] J.-W. Fang and Y.-W. Chang. Area-I/O flip-chip routing for chip-package co-designconsidering signal skews. Computer-Aided Design of Integrated Circuits and Systems,IEEE Transactions on, 29(5):711 –721, may 2010. → pages 89[28] W. Fung and T. Aamodt. Thread block compaction for efficient simt control flow. InHigh Performance Computer Architecture (HPCA), 2011 IEEE 17th InternationalSymposium on, pages 25–36, 2011. doi:10.1109/HPCA.2011.5749714. → pages 81[29] W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation andscheduling for efficient GPU control flow. In Proc. IEEE/ACM Symp. onMicroarchitecture (MICRO), pages 407–420, 2007. → pages 3, 21, 28, 30, 33, 79, 81[30] MacSim. Georgia Institute of Technology. URL https://code.google.com/p/macsim/. →pages 113125[31] M. Giles. Jacobi iteration for a Laplace discretisation on a 3D structured grid.http://people.maths.ox.ac.uk/˜ilesm/hpc/NVIDIA/laplace3d.pdf. → pages 32[32] M. Giles and S. Xiaoke. Notes on using the NVIDIA 8800 GTX graphics card.http://people.maths.ox.ac.uk/˜gilesm/hpc/. → pages 32[33] C. Gregg and K. Hazelwood. Where is the data? why you cannot debate cpu vs. gpuperformance without the answer. In Performance Analysis of Systems and Software(ISPASS), 2011 IEEE International Symposium on, pages 134–144, 2011.doi:10.1109/ISPASS.2011.5762730. → pages 19[34] B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu. Express cube topologies for on-chipinterconnects. In Proc. IEEE Symp. on High-Perf. Computer Architecture (HPCA), pages163–174, 2009. → pages 107[35] Z. S. Hakura and A. Gupta. The design and analysis of a cache architecture for texturemapping. In Proc. 24th Int’l Symp. on Computer Architecture, pages 108–120, 1997. →pages 21[36] P. Harish and P. J. Narayanan. Accelerating Large Graph Algorithms on the GPU UsingCUDA. In HiPC, pages 197–208, 2007. → pages 32[37] M. Harris. UNSW CUDA tutorial part 4 optimizing CUDA, 2009.http://www.cse.unsw.edu.au/˜pls/cuda-workshop09/slides/04 OptimizingCUDA full.pdf. → pages 24, 53, 92, 94[38] J. Hestness, M. Orr, and J. Power. gem5-gpu. http://gem5-gpu.cs.wisc.edu/wiki/. →pages 30, 113[39] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. TheMicroarchitecture of the Pentium R© 4 Processor. Intel R© Technology Journal, 5(1), 2001.→ pages 26[40] H. Igehy, M. Eldridge, and K. Proudfoot. Prefetching in a texture cache architecture. InProc. SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware, 1998. → pages26[41] Illinois Microarchitecture Project utilizing Advanced Compiler Technology ResearchGroup. Parboil benchmark suite. http://www.crhc.uiuc.edu/IMPACT/parboil.php. →pages 32[42] Infineon. 256Mbit GDDR3 DRAM, Revision 1.03 (Part No. HYB18H256321AF).http://www.infineon.com, December 2005. → pages 24, 25126[43] D. Ingerly, S. Agraharam, D. Becher, V. Chikarmane, K. Fischer, R. Grover, M. Goodner,S. Haight, J. He, T. Ibrahim, S. Joshi, H. Kothari, K. Lee, Y. Lin, C. Litteken, H. Liu,E. Mays, P. Moon, T. Mule, S. Nolen, N. Patel, S. Pradhan, J. Robinson,P. Ramanarayanan, S. Sattiraju, T. Schroeder, S. Williams, and P. Yashar. Low-Kinterconnect stack with thick metal 9 redistribution layer and Cu die bump for 45nm highvolume manufacturing. In Interconnect Technology Conference, 2008. IITC 2008.International, pages 216 –218, june 2008. → pages 102[44] ITRS. Int’l Technology Roadmap for Semiconductors 2008 update, 2008.http://www.itrs.net/Links/2008ITRS/Home2008.htm. → pages 4, 99[45] N. Jiang, D. U. Becker, G. Michelogiannakis, J. Balfour, B. Towles, J. Kim, and W. J.Dally. A detailed and flexible cycle-accurate network-on-chip simulator. In Proc. IEEESymp. on Performance Analysis of Systems and Software (ISPASS), pages 86–96, 2013.→ pages 79[46] K40. NVIDIA Tesla K40 GPU active accelerator, 2013.http://www.nvidia.com/content/PDF/kepler/Tesla-K40-Active-Board-Spec-BD-06949-001 v03.pdf. → pages89[47] A. Kahng, B. Li, L.-S. Peh, and K. Samadi. ORION 2.0: A fast and accurate NoC powerand area model for early-stage design space exploration. In Proc. IEEE/ACM Conf. onDesign Automation and Test in Europe (DATE), pages 23 – 428, April 2009. → pages 80,94[48] J. H. Kelm, D. R. Johnson, S. S. Lumetta, M. I. Frank, and S. Patel. A task-centricmemory model for scalable accelerator architectures. IEEE Micro Special Issue: TopPicks 2010, 30(1):29–39, Jan./Feb. 2010. → pages 2, 4, 48, 66, 105[49] J. H. Kelm, D. R. Johnson, W. Touhy, S. S. Lumetta, and S. Patel. Cohesion: A hybridmemory model for accelerator architectures. In Proc. IEEE/ACM Symp. on ComputerArchitecture (ISCA), pages 429–440, 2010. → pages 4[50] R. Kessler and J. Schwarzmeier. Cray T3D: A new dimension for Cray research.Compcon Spring ’93, Digest of Papers., pages 176–182, 22-26 Feb 1993. → pages 64,109[51] B. Khailany, W. J. Dally, S. Rixner, U. J. Kapasi, J. D. Owens, and B. Towles. Exploringthe VLSI scalability of stream processors. In Proc. 9th Int’l Symp. on High PerformanceComputer Architecture, page 153, 2003. → pages 112[52] Khronos Group. OpenCL - the open standard for parallel programming of heterogeneoussystems, 2010. http://www.khronos.org/opencl/. → pages 2, 9127[53] J. Kim. Low-Cost router microarchitecture for on-chip networks. In Proc. IEEE/ACMSymp. on Microarchitecture (MICRO), pages 255–266, 2009. → pages 110[54] J. Kim, W. J. Dally, B. Towles, and A. K. Gupta. Microarchitecture of a high-radixrouter. In Proc. IEEE/ACM Symp. on Computer Architecture (ISCA), pages 420–431,2005. → pages 85, 110[55] J. Kim, J. Balfour, and W. Dally. Flattened Butterfly topology for on-chip networks. InProc. IEEE/ACM Symp. on Microarchitecture (MICRO), pages 172–182, 2007. → pages51, 99, 100, 107, 109[56] M. Kistler, M. Perrone, and F. Petrini. Cell multiprocessor communication network:Built for speed. IEEE Micro, 26:10–23, 2006. ISSN 0272-1732. → pages 105[57] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded Sparcprocessor. IEEE Micro, 25(2):21–29, 2005. ISSN 0272-1732. → pages 52[58] D. Kroft. Lockup-free Instruction Fetch/Prefetch Cache Organization. In Proc. 8th Int’lSymp. Computer Architecture, pages 81–87, 1981. → pages 26[59] D. Krolak. Cell Broadband Engine EIB bus.http://www.ibm.com/developerworks/power/library/pa-expert9/, Retrieved Sept. 2010.,2005. → pages 105[60] A. Kumar, P. Kundu, A. Singh, L.-S. Peh, and N. Jha. A 4.6Tbits/s 3.6GHz single-cycleNoC router with a novel switch allocator in 65nm CMOS. In Proc. IEEE Conf. onComputer Design (ICCD), pages 63–70, October 2007. → pages 57, 95, 107[61] A. Kumar, L.-S. Peh, P. Kundu, and N. K. Jhay. Express virtual channels: Towards theideal interconnection fabric. In Proc. IEEE/ACM Symp. on Computer Architecture(ISCA), pages 150–161, San Diego, CA, June 2007. → pages 57, 107[62] A. Kumar, L.-S. Peh, and N. K. Jha. Token flow control. In Proc. IEEE/ACM Symp. onMicroarchitecture (MICRO), pages 342–353, Lake Como, Italy, 2008. ISBN978-1-4244-2836-6. → pages 107[63] P. Kumar, Y. Pan, J. Kim, G. Memik, and A. N. Choudhary. Exploring concentration andchannel slicing in on-chip network router. In Proc. IEEE/ACM Symp. onNetworks-on-Chip (NOCS), pages 276–285, 2009. → pages 99[64] A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials anddistances. In 9th Annual Workshop on Duplicating, Deconstructing, and Debunking(WDDD 2011), 2011. → pages 112[65] A. Lashgar, A. Khonsari, and A. Baniasadi. HARP: Harnessing inactive threads inmany-core processors. ACM Transactions on Embedded Computing Systems (TECS) -Special issue on Design Challenges for Many-core Processors, 2014. → pages 81128[66] J. W. Lee, M. C. Ng, and K. Asanovic. Globally-synchronized frames for guaranteedquality-of-service in on-chip networks. In Proc. IEEE/ACM Symp. on ComputerArchitecture (ISCA), pages 89–100, 2008. → pages 82, 111[67] M. M. Lee, J. Kim, D. Abts, M. Marty, and J. W. Lee. Probabilistic distance-basedarbitration: Providing equality of service for many-core CMPs. In Proc. IEEE/ACMSymp. on Microarchitecture (MICRO), pages 509–519, 2010. → pages 82, 111[68] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish,M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunkingthe 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU andGPU. In Proc. IEEE/ACM Symp. on Computer Architecture (ISCA), Saint-Malo, France,June 2010. → pages 19[69] G. Lemieux and D. Lewis. Design of Interconnection Networks for Programmable Logic.Kluwer Academic Publishers, Norwell, MA, USA, 2004. ISBN 1402077009. → pages112[70] A. Levinthal and T. Porter. Chap - a SIMD graphics processor. In Proc. Conf. onComputer Graphics and Interactive Techniques (SIGGRAPH), pages 77–82, 1984. ISBN0-89791-138-5. → pages 11[71] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A unifiedgraphics and computing architecture. IEEE Micro, 28(2):39–55, 2008. ISSN 0272-1732.→ pages 1, 9, 11, 20, 36[72] P. Lotfi-Kamran, B. Grot, and B. Falsafi. NOC-Out: Microarchitecting a Scale-OutProcessor. In Proc. IEEE/ACM Symp. on Microarchitecture (MICRO), 2012. → pages107[73] A. Mahesri, D. Johnson, N. Crago, and S. J. Patel. Tradeoffs in designing acceleratorarchitectures for visual computing. In Proc. 41st IEEE/ACM Int’l Symp. onMicroarchitecture, 2008. → pages 105[74] S. A. Manavski. CUDA compatible GPU as an efficient hardware accelerator for AEScryptography. In ICSPC 2007: Proc. of IEEE Int’l Conf. on Signal Processing andCommunication, pages 65–68, 2007. → pages 32[75] Marco Chiappetta. ATI Radeon HD 2900 XT - R600 Has Arrived.http://www.hothardware.com/printarticle.aspx?articleid=966. → pages 1[76] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen,K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet’s general execution-drivenmultiprocessor simulator (gems) toolset. SIGARCH Computer Architecture News, 33(4):92–99, 2005. → pages 113129[77] Maxime. Ray tracing. http://www.nvidia.com/cuda. → pages 32[78] J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branchand memory divergence tolerance. In Proceedings of the 37th Annual InternationalSymposium on Computer Architecture, ISCA ’10, pages 235–246, New York, NY, USA,2010. ACM. ISBN 978-1-4503-0053-7. doi:10.1145/1815961.1815992. URLhttp://doi.acm.org/10.1145/1815961.1815992. → pages 81[79] J. Michalakes and M. Vachharajani. GPU acceleration of numerical weather prediction.IPDPS 2008: IEEE Int’l Symp. on Parallel and Distributed Processing, pages 1–7, April2008. ISSN 1530-2075. doi:10.1109/IPDPS.2008.4536351. → pages 32[80] A. K. Mishra, N. Vijaykrishnan, and C. R. Das. A case for heterogeneous on-chipinterconnects for CMPs. In Proc. IEEE/ACM Symp. on Computer Architecture (ISCA),pages 389–400, 2011. ISBN 978-1-4503-0472-6. → pages 110[81] T. Moscibroda and O. Mutlu. A case for bufferless routing in on-chip networks. In Proc.IEEE/ACM Symp. on Computer Architecture (ISCA), pages 196–207, 2009. → pages 107[82] R. D. Mullins, A. West, and S. W. Moore. Low-latency virtual-channel routers foron-chip networks. In Proc. IEEE/ACM Symp. on Computer Architecture (ISCA), pages188–197, 2004. → pages 57, 107[83] M. Murphy. NVIDIA’s Experience with Open64. In 1st Annual Workshop on Open64,2008. → pages 29[84] T. Nesson and S. L. Johnsson. ROMM routing on mesh and torus networks. In Proc.ACM Symp. on Parallel Algorithms and Architectures (SPAA), pages 275–287, 1995. →pages 67, 108[85] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming withCUDA. ACM Queue, 6(2):40–53, Mar.-Apr. 2008. → pages 1, 3, 7, 8, 18, 50[86] J. R. Nickolls, B. W. Coon, and M. C. Shebanow. US patent application 20110072213:Instructions for managing a parallel cache hierarchy (assignee NVIDIA corp.), March2011. → pages 52[87] NVIDIA. CUDA ZONE. http://www.nvidia.com/cuda, . → pages 18[88] NVIDIA. Geforce 8 series. http://www.nvidia.com/page/geforce8.html, . → pages 11[89] NVIDIA. NVIDIA’s next generation CUDA compute architecture: Fermi, September2009. → pages 2, 19, 50[90] NVIDIA CUDA Programming Guide. NVIDIA Corporation, 1.1 edition, 2007. → pages18, 21, 22, 26, 29130[91] GeForce GTX 285. NVIDIA Corporation, 15 January 2009. → pages 1[92] NVIDIA CUDA Programming Guide. NVIDIA Corporation, 3.0 edition, 2010. → pages1, 3, 8, 10, 50, 79[93] PTX: Parallel Thread Execution ISA. NVIDIA Corporation, 2.0 edition, 2010. → pages28[94] PTX: Parallel Thread Execution ISA. NVIDIA Corporation, 3.2 edition, 2013. URLhttp://docs.nvidia.com/cuda/pdf/ptx isa 3.2.pdf. → pages 9[95] CUDA compiler driver NVCC. NVIDIA Corporation, 2013. URLhttp://docs.nvidia.com/cuda/pdf/CUDA Compiler Driver NVCC.pdf. → pages 9[96] Open64. The open research compiler. http://www.open64.net/. → pages 29[97] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E. Lefohn, and T. J.Purcell. A survey of general-purpose computation on graphics hardware. InEurographics 2005, State of the Art Reports, pages 21–51, aug 2005. → pages 8[98] Pcchen. N-Queens Solver. http://forums.nvidia.com/index.php?showtopic=76893. →pages 32[99] L.-S. Peh and W. J. Dally. A delay model and speculative architecture for pipelinedrouters. In Proc. IEEE Symp. on High-Perf. Computer Architecture (HPCA), pages255–266, 2001. → pages 57, 107[100] L.-S. Peh, N. Agarwal, N. Jha, and T. Krishna. Garnet: A detailed on-chip networkmodel inside a full-system simulator. In Proc. IEEE Symp. on Performance Analysis ofSystems and Software (ISPASS), April 2009. URLhttp://www.gigascale.org/pubs/1869.html. → pages 113[101] G. F. Pfister and V. A. Norton. Hot-Spot contention and combining in multistageinterconnection networks. IEEE Trans. on Computers, c-34(10):943–948, 1985. →pages 53, 111[102] D. Pham, S. Asano, M. Bolliger, M. D. , H. Hofstee, C. Johns, J. Kahle, A.Kameyama,J. Keaty, Y. Masubuchi, D. S. M. Riley, D. Stasiak, M. Suzuoki, M. Wang, J. Warnock,S. W. D. Wendel, T.Yamazaki, and K. Yazawa. The design and implementation of afirst-generation Cell processor. Digest of Technical Papers, IEEE Int’l Solid-StateCircuits Conference (ISSCC), pages 184–592 Vol. 1, 10-10 Feb. 2005. → pages 105[103] A. Pullini, F. Angiolini, S. Murali, D. Atienza, G. De Micheli, and L. Benini. BringingNoCs to 65 nm. Micro, IEEE, 27(5):75 –85, sept.-oct. 2007. → pages 51131[104] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory accessscheduling. In Proc. 27th Int’l Symp. on Computer Architecture, pages 128–138, 2000.→ pages 27, 33, 41, 79[105] S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S.-Z. Ueng, J. Stratton, and W. W. Hwu.Program optimization space pruning for a multithreaded GPU. In Proc. IEEE/ACMSymp. on Code Generation and Optimization (CGO), pages 195–204, April 2008. →pages 23, 25, 53[106] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu.Optimization principles and application performance evaluation of a multithreaded GPUusing CUDA. In Proc. 13th ACM SIGPLAN Symp. on Principles and Practice of ParallelProgramming, pages 73–82, 2008. → pages 32, 112[107] P. Salihundam, S. Jain, T. Jacob, S. Kumar, V. Erraguntla, Y. Hoskote, S. Vangal,G. Ruhl, P. Kundu, and N. Borkar. A 2Tb/s 6*4 mesh network with DVFS and2.3Tb/s/W router in 45nm CMOS. In VLSI Circuits (VLSIC), 2010 IEEE Symposium on,pages 79 –80, june 2010. → pages 95, 102[108] M. Schatz, C. Trapnell, A. Delcher, and A. Varshney. High-throughput sequencealignment using Graphics Processing Units. BMC Bioinformatics, 8(1):474, 2007. ISSN1471-2105. doi:10.1186/1471-2105-8-474. URLhttp://www.biomedcentral.com/1471-2105/8/474. → pages 32[109] SDK. NVIDIA CUDA SDK code samples, 2009.http://developer.nvidia.com/object/cuda sdk samples.html. → pages 31, 32, 53[110] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake,J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: amany-core x86 architecture for visual computing. ACM Trans. Graph., 27(3):1–15, 2008.ISSN 0730-0301. → pages 100[111] D. Seo, A. Ali, W.-T. Lim, N. Rafique, and M. Thottethodi. Near-optimal worst-casethroughput routing for two-dimensional mesh networks. In Proc. IEEE/ACM Symp. onComputer Architecture (ISCA), pages 432–443, 2005. → pages 68, 75, 108[112] J. W. Sheaffer, D. Luebke, and K. Skadron. A flexible simulation framework for graphicsarchitectures. In Proc. ACM SIGGRAPH/EUROGRAPHICS Conference on GraphicsHardware, pages 85–94, 2004. ISBN 3-905673-15-0.doi:http://doi.acm.org/10.1145/1058129.1058142. → pages 113[113] C. Sun, C.-H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, andV. Stojanovic. Dsent - a tool connecting emerging photonics with electronics foropto-electronic networks-on-chip modeling. In Proc. IEEE/ACM Symp. onNetworks-on-Chip (NOCS), pages 201–210, 2012. → pages 94, 103132[114] OpenSPARCT M T2 Core Microarchitecture Specification. Sun Microsystems Inc., 2007.→ pages 106[115] J. Tuck, L. Ceze, and J. Torrellas. Scalable Cache Miss Handling for HighMemory-Level Parallelism. In Proc. 39th IEEE/ACM Int’l Symp. on Microarchitecture,pages 409–422, 2006. → pages 26[116] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. Multi2Sim: A simulation frameworkfor cpu-gpu computing. In Proc. IEEE/ACM Conf. on Parallel Architectures andCompilation Techniques (PACT), PACT ’12, pages 335–344, New York, NY, USA, 2012.ACM. ISBN 978-1-4503-1182-3. doi:10.1145/2370816.2370865. URLhttp://doi.acm.org/10.1145/2370816.2370865. → pages 113[117] L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103–111, 1990. → pages 1, 8[118] L. G. Valiant and G. J. Brebner. Universal schemes for parallel communication. In Proc.ACM Symp. on Theory of Computing (STOC), pages 263–277, 1981. → pages 66, 108[119] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh,T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. An80-Tile sub-100-W teraFLOPS processor in 65-nm CMOS. IEEE Journal of Solid-StateCircuits, 43(1):29–41, Jan. 2008. → pages 50, 95, 106[120] S. Volos, C. Seiculescu, B. Grot, N. K. Pour, B. Falsafi, and G. De Micheli. Ccnoc:Specializing on-chip interconnects for energy efficiency in cache-coherent servers. InProc. IEEE/ACM Symp. on Networks-on-Chip (NOCS), pages 67–74, 2012. → pages 110[121] T. C. Warburton. Mini Discontinuous Galerkin Solvers.http://www.caam.rice.edu/˜timwar/RMMC/MIDG.html. → pages 32[122] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C.Miao, J. F. B. III, and A. Agarwal. On-chip interconnection architecture of the Tileprocessor. IEEE Micro, 27:15–31, 2007. ISSN 0272-1732. → pages 51, 106[123] S. Wilton. Architecture and Algorithms for Field-Programmable Gate Arrays withEmbedded Memory. PhD thesis, University of Toronto, 1997. → pages 111[124] H. Wong, A. Bracy, E. Schuchman, T. M. Aamodt, J. D. Collins, P. H. Wang, G. Chinya,A. K. Groen, H. Jiang, and H. Wang. Pangaea: a tightly-coupled ia32 heterogeneous chipmultiprocessor. In Proc. IEEE/ACM Conf. on Parallel Architectures and CompilationTechniques (PACT), pages 52–61, 2008. ISBN 978-1-60558-282-5. → pages 51, 106[125] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. DemystifyingGPU microarchitecture through microbenchmarking. In Proc. IEEE Symp. onPerformance Analysis of Systems and Software (ISPASS), pages 235–246, 2010. → pages80133[126] M. M.-S. Yu-Liang Wu. Orthogonal greedy coupling - a new optimization approach to2-d fpga routing. In Design Automation, 1995. DAC ’95. 32nd Conference on, pages568–573, 1995. doi:10.1109/DAC.1995.250011. → pages 111[127] G. L. Yuan, A. Bakhoda, and T. M. Aamodt. Complexity effective memory accessscheduling for many-core accelerator architectures. In Proc. IEEE/ACM Symp. onMicroarchitecture (MICRO), pages 34–44, Dec. 2009. → pages 3, 41, 42, 43[128] V. Zakharenko. Fusionsim simulator. http://www.fusionsim.ca/. → pages 30, 113134


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items