UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

GPU compute memory systems Yuan, George Lai 2009-11-27

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata


24-ubc_2010_spring_yuan_george_lai.pdf [ 1.03MB ]
JSON: 24-1.0068207.json
JSON-LD: 24-1.0068207-ld.json
RDF/XML (Pretty): 24-1.0068207-rdf.xml
RDF/JSON: 24-1.0068207-rdf.json
Turtle: 24-1.0068207-turtle.txt
N-Triples: 24-1.0068207-rdf-ntriples.txt
Original Record: 24-1.0068207-source.json
Full Text

Full Text

GPU Compute Memory SystemsbyGeorge Lai YuanB.A.Sc., The University of British Columbia, 2006A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinThe Faculty of Graduate Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)November, 2009c George Lai Yuan 2009AbstractModern Graphic Process Units (GPUs) ofier orders of magnitude more raw computing powerthan contemporary CPUs by using many simpler in-order single-instruction, multiple-data (SIMD)cores optimized for multi-thread performance rather than single-thread performance. As such,GPUs operate much closer to the \Memory Wall", thus requiring much more careful memorymanagement.This thesis proposes changes to the memory system of our detailed GPU performance simulator,GPGPU-Sim, to allow proper simulation of general-purpose applications written using NVIDIA’sCompute Unifled Device Architecture (CUDA) framework. To test these changes, fourteen CUDAapplications with varying degrees of memory intensity were collected. With these changes, weshow that our simulator predicts performance of commodity GPU hardware with 86% correlation.Furthermore, we show that increasing chip resources to allow more threads to run concurrently doesnot necessarily increase performance due to increased contention for the shared memory system.Moreover, this thesis proposes a hybrid analytical DRAM performance model that uses memoryaddress traces to predict the e–ciency of a DRAM system when using a conventional First-ReadyFirst-Come First-Serve (FR-FCFS) memory scheduling policy. To stress the proposed model, amassively multithreaded architecture based upon contemporary high-end GPUs is simulated togenerate the memory address trace needed. The results show that the hybrid analytical modelpredicts DRAM e–ciency to within 11.2% absolute error when arithmetically averaged across amemory-intensive subset of the CUDA applications introduced in the flrst part of this thesis.Finally, this thesis proposes a complexity-efiective solution to memory scheduling that recoversmost of the performance loss incurred by a naive in-order flrst-in flrst-out (FIFO) DRAM schedulercompared to an aggressive out-of-order FR-FCFS scheduler. While FR-FCFS scheduling re-ordersmemory requests to improve row access locality, we instead employ an interconnection network ar-bitration scheme that preserves the inherently high row access locality of memory request streamsfrom individual \shader cores" and, in doing so, achieve DRAM e–ciency and system performanceclose to that of FR-FCFS with a simpler design. We evaluate our interconnection network arbitra-tion scheme using crossbar, ring, and mesh networks and show that, when coupled with a bankedFIFO in-order scheduler, it obtains up to 91.0% of the performance obtainable with an out-of-ordermemory scheduler with eight-entry DRAM controller queues.iiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.1 SIMD GPU architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.2 Modern DRAM memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.3 Interconnection networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Supporting CUDA Applications on GPGPU-Sim . . . . . . . . . . . . . . . . . . 152.1 CUDA memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.1 Local memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1.2 Texture memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1.3 Constant memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 CUDA benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Hybrid Analytical Performance Modeling of DRAM . . . . . . . . . . . . . . . . 253.1 DRAM e–ciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Hybrid analytical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 Processing address traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30iiiTable of Contents4 Complexity-Efiective Memory Access Scheduling . . . . . . . . . . . . . . . . . . 364.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2 Efiect of interconnect on row locality . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3 Grant holding interconnect arbitration . . . . . . . . . . . . . . . . . . . . . . . . . 414.4 Interleaving due to multiple virtual channels . . . . . . . . . . . . . . . . . . . . . . 444.5 Complexity comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.1 Supporting CUDA applications on GPGPU-Sim . . . . . . . . . . . . . . . . . . . . 495.2 Hybrid analytical modeling of DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . 535.3 Complexity-efiective memory access scheduling . . . . . . . . . . . . . . . . . . . . . 556 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.1 GPGPU-Sim simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.1.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.1.2 DRAM utilization and e–ciency . . . . . . . . . . . . . . . . . . . . . . . . . 596.1.3 Are more threads better? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.2 Hybrid analytical modeling of DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . 646.2.1 DRAM e–ciency prediction accuracy . . . . . . . . . . . . . . . . . . . . . . 646.2.2 Modeling error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.2.3 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.3 Complexity-efiective memory access scheduling . . . . . . . . . . . . . . . . . . . . . 726.3.1 Complexity-efiective DRAM scheduling performance . . . . . . . . . . . . . . 726.3.2 Row streak breakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746.3.3 Static virtual channel assignment . . . . . . . . . . . . . . . . . . . . . . . . 766.3.4 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.1 GPU architecture simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.2 Analytical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807.3 DRAM controller designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.2.1 Improving the analytical DRAM model . . . . . . . . . . . . . . . . . . . . . 858.2.2 Combining the analytical DRAM model with analytical processor models . . 858.2.3 Design space exploration of intelligent interconnects . . . . . . . . . . . . . . 86Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87ivList of Tables1.1 DRAM parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Benchmark thread organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2 Benchmark Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1 Calculated complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.1 Microarchitectural parameters for testing GPGPU-Sim . . . . . . . . . . . . . . . . . 505.2 Example showing difierence between IPC versus IPC weighted by cycle . . . . . . . 535.3 Microarchitectural parameters for analytical DRAM model . . . . . . . . . . . . . . 535.4 Benchmarks for analytical DRAM model . . . . . . . . . . . . . . . . . . . . . . . . . 545.5 Microarchitectural parameters for complexity-efiective memory access scheduling . . 555.6 Benchmarks for complexity-efiective memory access scheduling . . . . . . . . . . . . 55vList of Figures1.1 Average memory latency measured in performance simulation . . . . . . . . . . . . . 21.2 Memory system organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Illustration of timing constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Typical interconnect router architecture . . . . . . . . . . . . . . . . . . . . . . . . . 111.5 Block diagram of a Parallel Iterative Matching separable allocator . . . . . . . . . . 132.1 Simulated pipeline architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1 Timing diagram example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Sliding window proflling technique example . . . . . . . . . . . . . . . . . . . . . . . 343.3 Sliding window proflling algorithm for processing memory request address traces . . 354.1 Measured DRAM e–ciency for random uniform memory tra–c with varying rowaccess locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Conceptual example showing efiect of request interleaving due to interconnect . . . . 394.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.4 Measured row access locality before the interconnect and after . . . . . . . . . . . . 414.5 Grant-holding interconnect router architecture . . . . . . . . . . . . . . . . . . . . . . . . 435.1 Measured DRAM e–ciency and utilization . . . . . . . . . . . . . . . . . . . . . . . . 546.1 Instruction Classiflcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.2 Memory Instructions Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.3 Baseline performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.4 Performance comparison with 8600GTS . . . . . . . . . . . . . . . . . . . . . . . . . 596.5 Impact of DRAM memory controller optimizations . . . . . . . . . . . . . . . . . . . 606.6 DRAM E–ciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.7 DRAM Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.8 Efiects of varying number of CTAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.9 Comparison of measured DRAM e–ciency to predicted DRAM e–ciency . . . . . . 646.10 Arithmetic mean absolute error of predicted DRAM e–ciency . . . . . . . . . . . . . 656.11 Comparison of measured DRAM e–ciency to predicted DRAM e–ciency . . . . . . 66viList of Figures6.12 Average row access locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.13 Average DRAM controller queue occupancy . . . . . . . . . . . . . . . . . . . . . . . 696.14 Arithmetic mean absolute error versus DRAM controller queue size . . . . . . . . . . 696.15 Arithmetic mean absolute error versus number of DRAM chips per DRAM controller 706.16 Error of predicted DRAM e–ciency of \Most Pending" memory scheduling algorithm 716.17 Baseline conflguration results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.18 DRAM e–ciency and memory latency . . . . . . . . . . . . . . . . . . . . . . . . . . 736.19 Row streak breakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.20 Harmonic mean IPC for difierent virtual channel conflgurations . . . . . . . . . . . . 766.21 Harmonic mean IPC across all applications for difierent network topologies . . . . . 776.22 Harmonic mean IPC normalized to FR-FCFS for difierent DRAM controller queuesizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787.1 Average number of unique banks requested per warp memory operation . . . . . . . 81viiGlossaryCTA Cooperative Thread Array, a collection ofthreads that run on a single shader core.Threads in a CTA share a common fastmemory and can perform barrier synchroniza-tion [5], 86CUDA Compute Unifled Device Architecture, aframework for running non-graphic parallelapplications on GPUs, 86DRAM e–ciency Percentage of time that the DRAM chip istransferring data across its bus over only theperiod of time when the DRAM is active, 86DRAM utilization Percentage of time that a DRAM chip istransferring data across its bus over an ar-bitrary length of time, 86FIFO First-In First-Out, 86viiiGlossaryFR-FCFS First-Ready First-Come First-Serve, an out-of-order memory scheduling algorithm thatreorders requests to maximize row access lo-cality, 86GPU Graphic Processing Unit, 86HG Hold Grant, an interconnect allocation policythat prioritizes the allocation decisions madein the previous cycle, 86HM Harmonic Mean, 86HMHG Hash-Matching Hold Grant, an interconnectallocation policy that prioritizes the alloca-tion decisions made in the previous cycle onlyif the hash of the destination row of the pend-ing request matches the hash of the destina-tion row of the previously granted request forthe same input-output combination, 86PIM Parallel Iterative Matching, a separable allo-cation policy based on random prioritizationof inputs and outputs, 86ixGlossaryRMHG Row-Matching Hold Grant, an interconnectallocation policy that prioritizes the alloca-tion decisions made in the previous cycle onlyif the destination row of the pending requestmatches the destination row of the previouslygranted request for the same input-outputcombination, 86Row access locality Average number of back-to-back memory re-quests sent to the same row of the same bankof the same DRAM chip, 86SIMD Single-Instruction Multiple-Data, 86Warp A group of 32 threads that run in lock-step, atany time either running the same instructionor no instruction at all, 86xAcknowledgementsI am very grateful to my supervisor, Professor Tor Aamodt, for his guidance during the last twoyears. Tor introduced me to the rich fleld of computer architecture and demonstrated flrst-hand thelevel of dedication and hard work required to be successful in this fleld. This thesis would not havebeen possible without his help and support. I would also like to thank Professor Guy Lemieux andProfessor Konrad Walus for being the members of my thesis committee and providing insightfulfeedback.I would like to thank my colleagues and friends Ali Bakhoda, Xi Chen, Wilson Fung, JohnnyKuan, Jason Kuo, Ivan Sham, Andrew Turner, and Henry Wong as well as the students in theNetworked Systems Laboratory for the interesting discussions wehad and for the valuable commentsthey provided for the work in this thesis. They made my experience at UBC unforgettable and Iam very lucky to have been able to meet them.I am grateful to my parents for their unconditional love and support throughout my life. Iwould not be where I am today without the sacriflces that they have made. They have done awonderful job raising me and for that I am forever indebted to them. And above all, I wish tosincerely thank my girlfriend Hyunjung for her unyielding support, bearing with me through thetough times. I dedicate this work to them.Finally I would like to express my thanks to Natural Sciences and Engineering Research Council(NSERC) of Canada that provided flnancial support for this work through a CGS-M award.The work described in this thesis is based upon the following three published bodies of work.xiAcknowledgementsThe flrst is the paper titled \Analyzing CUDA Workloads Using a Detailed GPU Simulator",which appeared in the 2009 Annual IEEE International Symposium on Performance Analysis ofSystems and Software (ISPASS 2009) and was co-authored by Ali Bakhoda, George Yuan, WilsonFung, Henry Wong, and Dr. Tor Aamodt. The study was designed collaboratively by Ali Bakhoda,George Yuan, Wilson Fung, Henry Wong, and Dr. Tor Aamodt. George Yuan collected the bench-marks used in this study and modifled the simulator to provide the necessary support required bythe benchmarks. Ali Bakhoda integrated the detailed interconnection network simulator necessaryto perform the network topology study in this study. Wilson Fung, Ali Bakhoda, and Dr. TorAamodt were also responsible for developing much of the simulator’s baseline infrastructure.The second is the paper titled \A Hybrid Analytical DRAM Performance Model", which ap-peared in the Fifth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS 2009)held in conjunction with the 36th Annual IEEE/ACM International Symposium on Computer Ar-chitecture and was co-authored by George Yuan and Dr. Tor Aamodt. The study was designedcollaboratively by George Yuan and Dr. Tor Aamodt. George Yuan conducted the research, ana-lyzed the data, and drafted the manuscript under the guidance of Dr. Tor Aamodt.The third is the paper titled \Complexity Efiective Memory Access Scheduling", which has beenaccepted to the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO2009) and was co-authored by George Yuan, Ali Bakhoda and Dr. Tor Aamodt. The study wasdesigned collaboratively by George Yuan and Ali Bakhoda and Dr. Tor Aamodt. George Yuanconducted the research, analyzed the data, and drafted the manuscript under the guidance of Dr.Tor Aamodt.xiiChapter 1IntroductionIn the past, GPUs, commonly known as video cards, have primarily been used for video games,speciflcally for accelerating 3D graphics rendering. Since then, GPUs have evolved into massivelyparallel computing machines that now have theoretical throughputs tens to hundred times fasterthan any general computer processor (CPU) available in the market. This is due to the graphics pro-cessors’ inherently high amounts of compute power speciflcally designed for data-parallel execution.Furthermore, recent GPUs are programmable through languages that are very similar to traditionallanguages, such as the \C" programming language, taught in most universities. Researchers havealready demonstrated that the compute power of GPUs can be harnessed for general-purpose ap-plications and thereby provide a far more cost-efiective solution compared to using CPUs. Yetat the same time, there are obstacles that prevent other applications from running e–ciently onGPUs. Intensive control  ow, dynamic program behavior, concurrency, and unpredictable memoryaccess patterns all can limit the thread-level parallelism available and impose bottlenecks, therebyreducing performance.This thesis focuses on the memory subsystem of GPUs running general purpose applications.First, it describes the changes to the memory system modeled by the performance simulator thatwe use to re ect what occurs in real GPU hardware. Next, it proposes a hybrid analytical modelto predict performance of modern DRAM when used in conjunction with modern memory accessscheduling policies. Finally, it proposes a complexity efiective solution for memory access scheduling11.1. Motivation0400800120016002000BS FWT DG LPS LIB MMC MUM NEU NN RAY RED SP SLA TRA WPMemory Access LatencyFigure 1.1: Average memory latency measured in performance simulation in cyclesthat relies on intelligent arbitration at the interconnection network that connects the GPU processorcores to the memory controllers. The rest of this chapter describes the motivation for this thesis,lists its contributions, explains some fundamental concepts required to understand the thesis, andsummarizes the organization of the thesis.1.1 MotivationAs the gap between memory speed and microprocessor speed increases, e–cient dynamic randomaccess memory (DRAM) system design is becoming increasingly important as it becomes an in-creasingly limiting bottleneck. This is especially true in General Purpose Applications for GraphicsProcess Units (GPGPU) where applications do not necessarily make use of caches (which are bothlimited in size and capabilities 1 relative to the processing throughput of GPUs).Modern memory systems can no longer be treated as having a flxed long latency. Figure 1.1shows the measured latency for difierent applications when run on our performance simulator,GPGPU-Sim [5], averaged across all memory requests. (We show later in Section 6.1.1 thatGPGPU-Sim models contemporary GPUs fairly accurately, achieving 86% correlation to real hard-1As of NVIDIA’s CUDA Programming Framework version 2, all caches are read-only [39].21.2. Contributionsware.) As shown, average memory latency can vary greatly across difierent applications, dependingon their memory access behavior. In the massively multithreaded GPU Compute architecture thatwe simulate, threads are scheduled to Single-Instruction Multiple-Data pipelines in groups of 32called warps, each thread of which is capable of generating a memory request to any location inGPU main memory. (This can be done as long as the limit on the number of in- ight memoryrequests per GPU shader core has not been reached. In CPUs, this is determined by the number ofMiss Status Holding Registers (MSHR) [25].) In this microarchitecture, a warp cannot proceed aslong as any single thread in it is still blocked by a memory access, making DRAM an even biggerbottleneck. As such, the memory system of GPU architectures must be properly modeled to provideuseful insight for microprocessor and memory designers alike. Two such modeling methods thatdesigners use are analytical modeling and detailed performance simulation. Analytical modelingrelies on developing mathematical expressions to relate performance metrics to microarchitectureand workload parameters. Detailed performance simulation requires cycle-by-cycle simulation ofthe behavior of a proposed hardware design. While performance simulation is more accurate, ana-lytical modeling can be orders of magnitude faster in obtaining meaningful results. In this work, weexplore the GPU memory system using both hybrid analytical modeling2 and detailed performancesimulation.1.2 ContributionsThis thesis makes the following contributions:1. It presents data characterizing the performance of a set of existing CUDA applications col-lected on a research GPU simulator (GPGPU-Sim).2The term \hybrid" was flrst used for modeling by Shanthikumar and Sargent [50] to describe mathematicalmodels that used both simulation and analytic techniques.31.2. Contributions2. It shows that, for certain applications, decreasing the number of threads running concurrentlyon the hardware can improve performance by reducing contention for on-chip resources.3. It provides an analysis of application characteristics including the dynamic instruction mix,SIMD warp branch divergence properties, and DRAM locality characteristics.4. It presents a novel DRAM hybrid analytical model to model the overlapping of DRAM timingconstraint delays of one bank by servicing requests to other banks, given an aggressive First-Ready First-Come First-Serve [43] (FR-FCFS) memory scheduling policy. This policy re-orders a memory request stream to increase DRAM e–ciency by increasing its row accesslocality, the number of memory requests per DRAM row access.5. It presents a method to use this hybrid analytical model to predict DRAM e–ciency over theruntime of an application by proflling a memory request address trace.6. It presents two heuristics that, when used with the previously mentioned proflling method,provide bounds on the available bank-level parallelism achievable by the DRAM. Moreover,it presents a simple method of averaging the predictions of these two heuristics to reduce thearithmetic mean of the absolute error across all benchmarks to 11.2%.37. It shows that GPU architectures do not beneflt from previously proposed schedulers thatemphasize fairness due to the difierent challenges that they present, having at least an orderof magnitude more threads compared to CPU multiprocessors.8. It shows that the row access locality inherent in the memory request stream from individualshader cores can be destroyed by the interconnection networks typically used for connectingshader cores to DRAM controllers. We introduce a novel interconnect arbitration scheme that3In this thesis, we use arithmetic mean of the absolute value of error to validate the accuracy of our analyticalmodel, which was argued by Chen et al. [11] to be the correct measure since it always reports the largest errornumbers and is thus conservative in not understating said errors.41.3. Backgroundpreserves this inherent row access locality, thus allowing for a much simpler DRAM controllerdesign.9. It presents a qualitative and quantitative analysis of the performance of our interconnectarbitration scheme and its various heuristics for both mesh and crossbar networks.10. It also presents a simple solution to deal with the interleaving that can occur due to the inter-connection network router having multiple virtual channels, which can improve interconnectthroughput [12].1.3 BackgroundThis section provides an overview of the fundamental concepts of contemporary SIMD GPU archi-tecture, modern DRAM memory, and interconnection networks for this thesis.1.3.1 SIMD GPU architectureGraphic applications that run on GPUs have plentiful parallelism due to independence of calculatingeach pixel’s color. Since there are millions of pixels for high-resolution displays, there is signiflcantparallelism. NVIDIA’s latest GPU architectures exploit this fact by using multiple multi-processorcores, which we refer to as shader cores, each supporting up to 1024 threads. The shader cores op-erate in Single-Instruction Multiple-Data (SIMD) fashion with eight parallel processors per shadercore where the same pipeline stage for each processor in a shader core will always be executing thesame instruction on its own separate set of data. By issuing the same instruction to all processors,the overhead for instruction scheduling can be minimized, increasing GPU processing capabilitiesat the expense of requiring a minimum amount of SIMD thread-level parallelism to fully use it.With the introduction of Compute Unifled Device Architecture (CUDA), NVIDIA allows pro-51.3. Backgroundgrammers to exploit GPUs to run non-graphic, general-purpose applications as well. CUDA allowsprogrammers to write and run multi-threaded C functions, or kernels, on GPUs. Each kernel iscomprised of an organization of threads. These threads are issued to shader cores in groups calledcooperative thread arrays (CTAs), where each kernel has one or more CTAs. Within a CTA, threadsare grouped into batches of 32 called warps which always execute the same instruction at any time.With an eight-wide shader core, each warp is issued in groups of eight to the pipelines over fourback-to-back cycles. The many supported threads per shader core allows for \barrel processing",where the stalls of one warp due to a long-latency memory load can be hidden by processing otherwarps in the meantime.1.3.2 Modern DRAM memoryBefore explaining our hybrid analytical DRAM performance model, it is necessary to be familiarwith the background of modern DRAM and the mechanics of the memory that we are modeling,GDDR3 [47], including the difierent timing constraints and various other terminology. A summaryis provided in Table 1.1.GDDR3-SDRAM is a modern graphics-speciflc DRAM architecture that improves upon olderDRAM technology used in graphics processors, like DDR-SDRAM [38], by increasing clock speedsand lowering power requirements. Like DDR-SDRAM, it uses a double data rate (DDR) interface,meaning that it transfers data across its pins on both the positive and negative edge of the busclock. More speciflc to the particular GDDR3 technology that we study, a 4n-prefetch architectureis used, meaning that a single read or write access consists of 4 bits transferred across each datapin over 2 cycles, or 4 half-cycles. (This is more commonly known as the burst length, BL.) For a32-bit wide interface, this translates to a 16 byte transfer per access for a single GDDR3 chip.Similar to other modern DRAM technology, GDDR3-SDRAM is composed of a number ofmemory banks sharing a common data bus in order to exploit bank-level parallelism in the memory61.3. BackgroundTable 1.1: DRAM parametersName Description ValueNumber of Banks 4Bus Width (in Bytes) 4BL Burst Length (in Bytes) 16tCCD Delay between successive reads 2or successive writes (in cycles)tRRD Minimum time between successive ACT 8commands to difierent banks (in cycles)tRAS Minimum time between opening (ACT) 21and closing (PRE) a row (in cycles)tRCD Minimum time between issuing ACT 12and issuing a Read or Write (in cycles)tRC Minimum time between successive ACTS to 34difierent rows in same bank (in cycles)tWTR Write to Read command delay (in cycles) 5tRP Minimum time between closing a row 13and opening a new row (in cycles)CL Column Address Strobe Latency (in cycles) 9request stream. We deflne bank-level parallelism as the spreading of memory requests across mul-tiple DRAM banks so that a ready DRAM bank can immediately transfer data across the data buswhen another bank is unready. A DRAM bank may be unready if it is in the process of switchingrows or refreshing its data cells. The bank-level parallelism of a memory request stream is crucialto performance because it maximizes usage of the data bus pins connecting the GPU to the DRAMchips, a limited and thus precious resource.Each bank is composed of a 2D array that is addressed with a row address and a column address,both of which share the same address pins to reduce the pin count. In a typical memory access, arow address is flrst provided to a bank using an activate command that activates, or \opens", allof the memory cells in the row, connecting them using long wires to the sense ampliflers, which aresubsequently connected to the data pins. A key property of the sense ampliflers is that, once theydetect the values of the memory cells in a row, they hold on to the values so that subsequent accessesto the row do not force the sense ampliflers to re-read the row. Intelligent memory controllersexploit this property, scheduling requests out of order and grouping together requests to the same71.3. BackgroundProcessor Core 0Interconnection NetworkMemory Controller 0Processor Core 1Memory Controller 1Bank 3Bank 2Sense Amplifiers& Bus DriversBank 1Bank 0Address and Control LogicDRAM Chip 1Bank 3Bank 2Sense Amplifiers& Bus DriversBank 1Bank 0Address andControl LogicDRAM Chip 0 Bank 3Bank 2Sense Amplifiers& Bus DriversBank 1Bank 0Address and Control LogicDRAM Chip 1ON-CHIPOFF-CHIPBank 3Bank 2Bank 1Bank 0DRAM Chip 0Sense Amplifiers& Bus DriversAddress andControl LogicFigure 1.2: Memory system organizationrow to improve the row access locality of the memory requests. Doing so consequently improvesthe e–ciency of the memory system by reducing the number of times rows are opened and closed4. The organization of such a memory system is shown in Figure 1.2.Typically, the issuing of commands to DRAM is subject to various timing constraints, as out-lined in Table ??. For clariflcation, these timing constraints are illustrated in Figure 1.3.The process of \opening the row," which must be done before a column can be requested, takesa signiflcant amount of time called the row address to column address delay, or tRCD [47]. After4\Row access locality" is temporal in the sense that concurrent accesses to the same row are fast. It is spatial inthe sense that accessing any word in DRAM requires activating the whole row, allowing subsequent accesses to otherdata in the row to be fast.81.3. BackgroundBank 0 Precharge Row A Activate Row B Pre...RB RBRARA Precharge Row B Act..tRP tRCDtRCBank 1 Activate Row XtRRDRXRX RX RX RX RXtRASDataBus RARACASRXRX RX RX RB RB RX RXtWTRWX WXWX WXtCCDR = Read              W = WriteCommandBusP0A0P0CLOCKR1 Destination RowCommand TypeP0Command TypeDestination Bank P = Precharge      A = ActivateA0A1 1W1WR1R1R1R1R0R0R0R0R1R1Figure 1.3: Illustration of timing constraintsthis delay, the column address can then be provided to the bank. The process of reading thecolumn address and choosing the data in the speciflc column to be output across the data pins isnot instantaneous, instead taking an amount of time called the column access strobe latency, orCL, from when the column address is provided to when the flrst bits of data appear on the datapins. However, since all of the memory cells in the opened row are connected to sense ampliflers,multiple column addresses can be provided back-to-back, limited only by the column to columndelay, tCCD, and the operations will be pipelined. Once the initial CL cost has been paid, a bankcan provide a continuous stream of data as long as all accesses are to the opened row.If we want to access data in one row and another row is opened in the same bank, we must closethe opened row by disconnecting it from the long wires connecting it to the sense ampliflers. Beforethese long wires, which represent a signiflcant capacitance, can be connected to the new row, theymust flrst be precharged in order for the sense ampliflers to work properly. This process is deflnedas the row precharge delay, or tRP [13]. To summarize, in order to service a request to a new rowwhen a difierent row has already been opened, we must flrst close the row (by issuing a precharge91.3. Backgroundcommand) and open the new row (by issuing an activate). This process takes an amount of timetRP + tRCD.Furthermore, in a single bank, there is a minimum time required between issuing an activatecommand and then issuing a precharge (in other words, between opening and closing the row).This minimum time is deflned as the row access strobe latency, tRAS, and is greater than tRCD.This is because tRAS also encapsulates the process of restoring the data from the sense ampliflersto the corresponding row cells, or \refreshing." (Unlike static RAM, which retains the values in itsmemory cells as long as it is provided power, dynamic RAM must be periodically refreshed sincethe bit values are stored in capacitors which leak charge over time [13].) Opening a new row tooquickly may not allow adequate time to perform the refresh [53]. Accordingly, the minimum timerequired between issuing two successive activate commands in a single bank is tRAS + tRP, whichis deflned as the row cycle time, tRC.In addition to these timing constraints, there are additional constraints on the minimum timeneeded between successive activate commands to difierent banks, (tRRD), and the minimum timeneeded between a write and a read command or a read and a write command (tWTR and tRTW)since the data bus is bi-directional and requires time to switch modes.In the speciflc GDDR3 technology that we study, each row must be refreshed every 32ms,taking 8192 DRAM cycles each time. With an 800MHz DRAM clock and 4096 rows per bank, thisamounts to each bank having to perform refresh 4.2% of the time on average, assuming DRAM isidle and all rows contain data [47]. As mentioned previously, the process of accessing a row alsorefreshes the data in the row, meaning only rows of unaccessed data needs to be actively refreshed,possibly reducing the refreshing overhead from 4.2%. Due to this relatively small overhead, wemodel the efiects of refresh on performance in neither our performance simulator nor our analyticalmodel.101.3. BackgroundRouting Computation Virtual Channel AllocationSwitch AllocationVirtual Channel(s)Virtual Channel(s)......Input 0Input mCrossbar Switch...Output 0Output n(a) Router architectureRouting ComputationVirtual Channel AllocationSwitch AllocationSwitch TraversalPer-Packet Per-Flit(b) Router pipeline stagesFigure 1.4: Typical interconnect router architecture1.3.3 Interconnection networksOut-of-order memory scheduling requires complex fully-associative comparison logic. In this the-sis, we also explore simplifying the memory controller design by incorporating intelligence at thelevel above the memory controller: the interconnection network. In a multiprocessor architecture,interconnection networks are typically used to transfer memory writes and memory read requestsfrom issuing processors to receiving memory controllers, returning memory reads from memorycontrollers back to the processors, as well as processor-to-processor communication. The networkcan be comprised of either a single crossbar router that connects all inputs to all outputs or severalrouters connected in a variety of conflgurations.In this thesis, we evaluate the efiect of crossbar, mesh, and ring networks on DRAM. No matterthe conflguration, all routers are required to perform the common task of sending packets (requests)from a set of input ports to a set of output ports. This task is called allocation. Since the network111.4. Organizationhas flnite bandwidth, packets that cannot be transferred in one cycle are split up into  its andsent over multiple cycles. Figure ?? shows a typical router architecture along with the steps forperforming pipelined routing of a packet [12]. Routing computation, the process of determiningwhich output the packet should be directed to, and virtual channel allocation, the process ofreserving bufier space at the output port, is performed on a per-packet level. Switch allocation,the process of reserving a time slot for transferring a  it, and switch traversal, the actual transferof the  it across the crossbar switch, is performed on a per- it level.There are many difierent algorithms for performing allocation. In this thesis, we use paralleliterative matching (PIM), a simple algorithm easily realizable in hardware [4], which is importantin high-speed interconnects. In PIM allocation, a request matrix of input rows by output columnsis flrst generated to determine which inputs require which outputs. In the context of a crossbar,the number of inputs is the number of shader cores C and the number of outputs is the numberof DRAM controllers M. PIM is a type of separable allocator, where arbitration is done in twostages, which is advantageous where speed is important [12]. In our study, we perform input-flrstallocation. Figure 1.5 shows a simple example of an input-flrst allocator with three inputs and twooutputs. First, the input arbiters select from a single request per input port. In other words, eachinput arbiter chooses to assert up to one output, depending on whether there are requests from thecorresponding input to the corresponding output. The output of this flrst stage of input arbitersis then fed to the output arbiters, which choose from the difierent requests to each output. Thisguarantees that a single input port will not be sending to multiple output ports and a single outputport will not be receiving from multiple input ports in the same cycle.1.4 OrganizationThe rest of the thesis is organized as follows:121.4. Organization>=10000Inp utArb>=1000000Out putArbr00r01r10r11r20r21g10g00g20g11g01g21>=10000Inp utArb>=10000Inp utArb>=1000000Out putArbFigure 1.5: Block diagram of a Parallel Iterative Matching (PIM) separable allocator for 3inputs and 2 outputs. Each input arbiter (Input Arb) chooses from one of itsinput request signals rIO to assert, then each output arbiter (Output Arb) assertsone grant signal gIO. (Where I is the input and O is the output).† Chapter 2 describes the necessary changes to the memory system of GPGPU-Sim to sup-port applications written using NVIDIA’s CUDA framework. It then introduces the CUDAapplications that were collected to perform the experiments done in this work.† Chapter 3 proposes techniques to predict the e–ciency of a DRAM system when employinga hybrid analytical model that applies a memory request address trace analysis.† Chapter 4 presents two novel metrics to quantify the temporal locality of an application andproposes a novel cache contention model, based upon the above two metrics, to accuratelypredict the number of extra data cache misses due to cache contention when the number ofthreads sharing a cache approaches and/or exceeds the cache’s associativity.† Chapter 5 describes our simulation methodology, highlighting the applications used in ourstudy and how the memory system in our simulation infrastructure was augmented to support131.4. Organizationapplications coded using NVIDIA’s CUDA programming framework.† Chapter 6 presents and analyzes our experimental results.† Chapter 7 reviews related work.† Chapter 8 concludes this thesis and suggests future work.Although GPUs can achieve throughput (in terms of FLOPs) orders of magnitude higher thanCPUs, they require applications with, among other things, high arithmetic intensity to do so.General purpose applications with irregular memory access behavior perform poorly. This work isa step towards research on how GPUs can be used for a broader class of applications.14Chapter 2Supporting CUDA Applications onGPGPU-SimThe version of GPGPU-Sim used by Wilson et al. [16] used the portable instruction set archi-tecture (PISA). For benchmarks, applications from various benchmark suites including NVIDIA’sCUDA Software Development Kit [37], SPLASH2 [55], and SPEC CPU2006 [52] were manuallymodifled from CUDA format to a format recognizable by the simulator. Due to the engineeringefiort required in this process, it was decided for the work of Bakhoda et al. [5] that a massivelymultithreaded performance simulator that could support native CUDA applications without anysource code modiflcations would be the better approach. To this end, GPGPU-Sim was modifledto support CUDA’s parallel thread execution (PTX) instruction set, allowing it to simulate CUDAapplications without requiring any changes to the original CUDA source code. We flrst describethe microarchitectural changes made to the performance simulator to properly model CUDA. Wethen describe the CUDA applications that were collected and used as benchmarks.2.1 CUDA memoriesThe CUDA programming framework supports flve distinct logical memory spaces, each with its ownunique properties. These are the global memory, local memory, shared memory, texture memory,and constant memory. The set of CUDA applications that we wished to support used all of thesememory spaces.Figure 2.1 shows the logical layout of the shader core pipeline that we simulate with the memory152.1. CUDA memoriesShared MemoryFetchDecodeExecuteMemory1Local ConstantTextureShared GlobalTexture Cache Constant CacheGlobal MemoryData flowInstruction FlowPhysical MemoryPipeline Stage New Additions to Simulated ArchitectureWritebackFigure 2.1: Simulated pipeline architecturestage expanded and the newly added components highlighted with purple boxes in dashed outlines.Instructions that pass through the memory stage of the pipeline are checked flrst whether they arememory instructions. If so, then they get routed to the appropriate substage within the memorystage. This thesis describes the simulator modiflcations necessary to support local memory, texturememory, and constant memory.162.1. CUDA memories2.1.1 Local memoryThe use of local memory accesses is determined automatically by the compiler. They are commonlyused for objects large enough that the compiler deems them unsuitable for storage in registers, suchas large structures or arrays. Within GPGPU-Sim, we model local memory accesses in accordanceto the CUDA programming guide [39] in that they are not cached, like global memory accesses,and are thus expensive in terms of the latency that they incur. However, since local memory isdeflned on a per-thread basis, we can make sure that their accesses are always coalesced by storingeach local memory variable of threads in a half-warp in a contiguous block in memory. Consideran example where each thread declares a struct with four int variables in local memory. Instead ofstoring the variables of the struct of each thread in a contiguous space, we instead interleave thevariables of each thread to ensure memory coalescing.2.1.2 Texture memoryTexture memory accesses are backed by an L1 read-only texture cache. The use of texture memorymust be specifled explicitly by the programmer by calling functions included in the CUDA runtimeAPI library. In our simulator, we leverage the CUDA runtime API library header flles while replac-ing their library function implementations with our own which invoke simulation setup functions(such as memory allocation/deallocation and memory copy) and simulation sessions. Proper setupof textures for use in GPGPU-Sim requires the following steps: flrst, the memory needed by thetexture must be allocated within the simulated memory space (e.g. GPU main memory). Second,the memory must be \bound" to a texture reference, which is an object declared by the program-mer. Apart from having a pointer to the allocated memory, the reference also stores the texturename as well as other texture-speciflc properties. This is because texture memory load instructionsdifier from load instructions of other memory spaces in PTX in that it uses a unique addressingmode that relies on the name of the texture and the co-ordinates within the texture. Finally, the172.1. CUDA memoriesdata must then be copied from CPU memory to the simulated memory space.CUDA allows for texture objects to be used in a wide variety of ways depending on how theprogrammer conflgured the texture reference, which, in GPGPU-Sim, is a struct containing theproperties in variables that are set using functions from the CUDA runtime API. We added supportfor the following capabilities:† Difierent data types: Texture data type can vary from single byte chars to four byte ints and oats to even doubles. Moreover, they can be either scalar or up to a vector of four values.As such, texture load implementation must properly account for the many difierent possiblefetch sizes, from a single char (one byte) to a vector of four  oats (sixteen bytes). Duringsimulation, a texture fetch instruction provides the texture name, which we use to look up thetexture reference that holds the pointer to the texture (e.g. address base) in the simulatedmemory space. The texture fetch instruction also provides the texture co-ordinates which weuse with the data size to calculate the address ofiset which, when summed with the addressbase, provides us the address of the requested data in the simulated memory space.† Multi-dimensionality: Textures can be either one-dimensional (indexed with one addressvalue) or two-dimensional (indexed with two address values). In CUDA, texture cache blocksare fetched from GPU main memory in a way that preserves two dimensional spatial localityfor two-dimensional textures. In other words, a texture fetch to a speciflc co-ordinate willalso retrieve data close to the co-ordinate in both dimensions to the same cache block. InGPGPU-Sim, we implement a 4D blocking address scheme [19] which essentially permutesthe bits in requested addresses to promote spatial locality in a 2D space rather than in linearspace.† Difierent addressing modes: The provided index (or indices) into the texture can be eitherthe absolute co-ordinates in integers or also normalized  oats from 0.0 to 1.0 (useful for image182.1. CUDA memoriesprocessing). If normalized address mode is used, CUDA supports a variety of sampling modesif the normalized address does not scale to integer co-ordinates. For instance, the data ofthe nearest co-ordinate to the normalized address may be fetched or the values around thenormalized address can be sampled and averaged (linear flltering). We flnd that none of thebenchmarks we study use normalized addressing so we only implemented pick nearest. Weleave the implementation of other sampling modes needed for imaging processing and othergraphical applications to future work.† Boundary condition handling: In accordance with CUDA, we added two methods to GPGPU-Sim for handling indices that are outside the range of the texture array: \Clamping" clampsindex values below 0.0 and above 1.0 to the range [0.0,1.0) (for normalized addressing mode).\Wrapping" uses only the fractional part of the texture co-ordinate so that, for example, anindex of 1.75 and -0.25 will both wrap to a texture co-ordinate of 0.75 (again for normalizedaddressing mode).2.1.3 Constant memoryThepurposeofconstantmemoryistostoreread-only\constant"variablessharedamongallthreads.As such, constant memory accesses are backed by an L1 read-only constant cache. Unlike thetexture memory cache, the constant cache allows single cycle access latency only if all threads in ahalf-warp request the same data. In other words, the texture cache can be thought of as having oneread port for each shader processor in a shader core while the constant cache has only a single readport shared among all shader processors in a shader core. If there is port con ict (e.g. when shaderprocessors require access to multiple cache blocks in the same cycle), the accesses are serialized andthe pipeline stalls while the constant cache is being accessed. Since a warp cannot proceed as longas any thread is stalled waiting on memory, the order in which the constant cache services requestsdoes not matter. In the benchmarks we study, we flnd that none of the constant memory accesses192.2. CUDA benchmarksTable 2.1: Benchmark thread organizationBenchmark Abr. Grid CTA Concurrent Total InstructionDimensions Dimensions CTAs/core Threads CountAES Cryptography [27] AES (257,1,1) (256,1,1) 2 65792 28MGraph Algorithm: BFS (128,1,1) (512,1,1) 4 65536 17MBreadth First Search [20]Coulombic Potential [23, 45] CP (8,32,1) (16,8,1) 8 32768 126MgpuDG [54] DG (268,1,1); (84,1,1); 5 22512; 596M(268,1,1); (112,1,1); 6 30016;(603,1,1) (256,1,1) 4 1543683D Laplace Solver [17] LPS (4,25,1) (32,4,1) 6 12800 82MLIBOR Monte Carlo [18] LIB (64,1,1) (64,1,1) 8 4096 907MMatrix Multiply [46] MMC (8,32,1) (16,16,1) 3 65536 341MMUMmerGPU [48] MUM (782,1,1) (64,1,1) 3 50000 77MNearest Neighbor [9] NN (938,1,1) (16,1,1) 8 15008 6.4MNeural Network NEU (6,28,1); (13,13,1); 5 28392; 40MDigit Recognition [7] (50,28,1) (5,5,1) 8 35000N-Queens Solver [41] NQ (223,1,1) (96,1,1) 1 21408 2MRay Tracing [28] RAY (16,32,1) (16,8,1) 3 65536 71MStoreGPU [3] STO (384,1,1) (128,1,1) 1 49152 134MWeather Prediction [29] WP (9,8,1) (8,8,1) 3 4608 215MBlack-Scholes BLK (256,1,1) (256,1,1) 3 65536 236Moption pricing [37]Fast Walsh Transform [37] FWT (512,1,1); (256,1,1); 4 131072; 240M(256,1,1) (512,1,1) 2 131072request more than one address per warp.2.2 CUDA benchmarksTo evaluate GPGPU-Sim, we needed benchmarks that exhibited a wide variety of behaviors in termsof arithmetic intensity, warp divergence, and memory access patterns. Preliminary evaluation of theapplications included in NVIDIA’s CUDA software development kit (SDK) showed that they tendto conform well to the CUDA Programming Framework, exhibiting relatively high IPC normalizedto peak, little to no warp divergence, and moderate to low bandwidth utilization. To flnd otherapplications with other behaviors, we collected applications from other research communities andNVIDIA’s CUDA Zone that reported relatively low speedup versus running on CPUs. Below, wedescribe the CUDA applications not in the SDK that we use as benchmarks in our study. Theseapplications were developed by the researchers cited below and run unmodifled on our simulator.202.2. CUDA benchmarksTable 2.2: Benchmark PropertiesAbr. Local Shared Constant Texture Barriers?Memory? Memory? Memory? Memory?AES - X X 1D XBFS - - - - -CP - - X - -DG - X - 1D XLPS - -X - - XLIB X - X - -MMC X - - - XMUM X - - 2D -NN - - - - -NEU - - - - -NQ - X - - XRAY X - X - XSTO - X - - -WP X - - - -BLK - - - - -FWT - X - - XTables 2.1 and 2.2 lists the benchmarks name along with the main application properties, such asthe organization of threads into CTAs and grids as well as the difierent memory spaces on the GPUexploited by each application. Note that the instruction counts listed in Table 2.1 are the dynamicinstruction counts run on the GPU kernels summed across all threads. These kernels are essentiallyfunction calls that have been parallelized to run on the GPU and, as such, some of the instructioncounts may seem small relative to typical instruction counts of entire programs. Multiple entriesseparated by semi-colons in the grid and CTA dimensions indicate the application runs multiplekernels.AES Encryption (AES) [27] This application, developed by Manavski [27], implements theAdvanced Encryption Standard (AES) algorithm in CUDA to encrypt and decrypt flles. Theapplication has been optimized by the developer so that constants are stored in constant memory,the expanded key stored in texture memory, and the input data processed in shared memory. Weencrypt a 256KB picture using 128-bit encryption.Graph Algorithm: Breadth-First Search (BFS) [20] Developed by Harish andNarayanan [20], this application performs breadth-flrst search on a graph. As each node in the212.2. CUDA benchmarksgraph is mapped to a difierent thread, the amount of parallelism in this applications scales withthe size of the input graph. BFS sufiers from performance loss due to heavy global memory tra–cand branch divergence. We perform breadth-flrst search on a random graph with 65,536 nodes andan average of 6 edges per node.Coulombic Potential (CP) [23, 45] CP is part of the Parboil Benchmark suite developedby the IMPACT research group at UIUC [23, 45]. CP is useful in the fleld of molecular dynamics.Loops are manually unrolled to reduce loop overheads and the point charge data is stored inconstant memory to take advantage of caching. CP has been heavily optimized (it has been shownto achieve a 647£ speedup versus a CPU version [46]). We simulate 200 atoms on a grid size of256£256.gpuDG (DG) [54] gpuDG is a discontinuous Galerkin time-domain solver, used in the fleldof electromagnetics to calculate radar scattering from 3D objects and analyze wave guides, particleaccelerators, and EM compatibility [54]. Data is loaded into shared memory from texture memory.The inner loop consists mainly of matrix-vector products. We use the 3D version with polynomialorder of N=6 and reduce time steps to 2 to reduce simulation time.3D Laplace Solver (LPS) [17] Laplace is a highly parallel flnance application [17]. As wellas using shared memory, care was taken by the application developer to ensure coalesced globalmemory accesses. We observe that this benchmark sufiers some performance loss due to branchdivergence. We run one iteration on a 100x100x100 grid.LIBOR Monte Carlo (LIB) [18] LIBOR performs Monte Carlo simulations based on theLondon Interbank Ofiered Rate Market Model [18]. Each thread reads a large number of variablesstored in constant memory. We flnd the working set for constant memory flts inside the 8KBconstant cache per shader core that we model. However, we flnd memory bandwidth is still abottleneck due to a large fraction of local memory accesses. We use the default inputs, simulating4096 paths for 15 options.222.2. CUDA benchmarksMatrix Multiply (MMC) [46] This version of matrix multiply improves upon the optimizedversion of matrix multiply included in the CUDA SDK. In addition to using blocking to minimizethe number of memory accesses, this version also exploits loop unrolling to reduce loop overheads.Furthermore, each CTA is carefully sized to maximize utilization of the 16kB shared memory pershader core.MUMmerGPU (MUM) [48] MUMmerGPU is a parallel pairwise local sequence alignmentprogram that matches query strings consisting of standard DNA nucleotides (A,C,T,G) to a ref-erence string for purposes such as genotyping, genome resequencing, and metagenomics [48]. Thereference string is stored as a su–x tree in texture memory and has been arranged to exploit thetexture cache’s optimization for 2D locality. Nevertheless, the sheer size of the tree means highcache miss rates, causing MUM to be memory bandwidth-bound. Since each thread performs itsown query, the nature of the search algorithm makes performance also susceptible to branch di-vergence. We use the flrst 140,000 characters of the Bacillus anthracis str. Ames genome as thereference string and 50,000 25-character queries generated randomly using the complete genome asthe seed.Nearest Neighbor (NN) [9] Nearest neighbor is a data mining application that uses denselinear algebra to flnd the k-nearest neighbors from an unstructured data set. The application isbandwidth bound and has a very regular memory access pattern, thus achieving close to the peakofi-chip bandwidth. We flnd the 3 nearest neighbors from a set of 42764 records.Neural Network (NEU) [7] Neural network uses a convolutional neural network to recognizehandwritten digits [7]. Pre-determined neuron weights are loaded into global memory along withthe input digits. We modifled the original source code to allow recognition of multiple digits atonce to increase parallelism. Nevertheless, the last two kernels utilize CTAs of only a single threadeach, which results in severe under-utilization of the shader core pipelines. For this reason, we onlyrun the flrst two kernels in our simulations in Chapter 6. We simulate recognition of 28 digits from232.2. CUDA benchmarksthe Modifled National Institute of Standards Technology database of handwritten digits.N-Queens Solver (NQ) [41] The N-Queen solver tackles a classic puzzle of placing N queenson a NxN chess board such that no queen can capture another [41]. It uses a simple backtrackingalgorithm to try to determine all possible solutions. The search space implies that the executiontime grows exponentially with N. Our analysis shows that most of the computation is performedby a single thread, which explains the low IPC. We simulate N=10.Ray Tracing (RAY) [28] Ray-tracing is a method of rendering graphics with near photo-realism. In this implementation, each pixel rendered corresponds to a scalar thread in CUDA [28].Up to 5 levels of re ections and shadows are taken into account, so thread behavior depends onwhat object the ray hits (if it hits any at all), making the kernel susceptible to branch divergence.We simulate rendering of a 256x256 image.StoreGPU (STO) [3] StoreGPU is a library that accelerates hashing-based primitives de-signed for middleware [3]. We chose to use the sliding-window implementation of the MD5 algo-rithm on an input flle of size 192KB. The developers minimize ofi-chip memory tra–c by using thefast shared memory. We flnd STO performs relatively well.Weather Prediction (WP) [29] Numerical weather prediction uses the GPU to acceleratea portion of the Weather Research and Forecast model (WRF), which can model and predictcondensation, fallout of various precipitation and related thermodynamic efiects of latent heatrelease [29]. The kernel has been optimized to reduce redundant memory transfer by storing thetemporary results for each altitude level of a cell in registers. However, this requires a large amountof registers, thus limiting the maximum allowed number of threads per shader core to 192, whichis not enough to cover global and local memory access latencies. We simulate the kernel using thedefault test sample for 10 timesteps.24Chapter 3Hybrid Analytical PerformanceModeling of DRAMIn this chapter, we describe our analytical model of GDDR3 memory. First, we present a moreformal deflnition of DRAM e–ciency in Section 3.1. Then, we present our baseline analytical modelfor predicting DRAM e–ciency in Section 3.2. Finally, we show how to implement our model usinga sliding window proflling technique on a memory request address trace in Section DRAM e–ciencyWe flrst deflne DRAM utilization as the percentage of time that a DRAM chip is transferringdata across its bus over an arbitrary length of time, T. We also deflne DRAM e–ciency as thepercentage of time that the DRAM chip is transferring data across its bus over only the periodof time when the DRAM is active (Tactive). We say that a DRAM chip is active when it is eitheractively transferring data across its bus or when there are memory requests waiting in the memorycontroller queue for this DRAM and it is waiting to service the requests but cannot due to anyof the timing constraints mentioned in Section 1.3.2. If the DRAM is always active, the DRAMutilization and DRAM e–ciency metric will evaluate to the same value for the same period oftime (e.g. T = Tactive). While 100% DRAM e–ciency and utilization is theoretically achievable,less-than-maximum throughput can occur for a variety of difierent reasons: there may be poor rowaccess locality among the requests waiting in the memory controller queue, resulting in DRAMhaving to pay the overhead cost of constantly closing and reopening difierent rows too frequently,253.2. Hybrid analytical modeland therefore the few requests per open row are not enough to hide the aforementioned overheads;the memory controller may also be causing the DRAM to frequently alternate between servicingreads and writes, thus having to pay the overhead cost of tWTR and tRTW each time; and, speciflconly to DRAM utilization, the memory controller queue may simply be empty, forcing the DRAMto sit idle.3.2 Hybrid analytical modelAs described in the previous section, both our DRAM utilization and DRAM e–ciency metrics areessentially fractions. We use these deflnitions as the basis of our analytical model, replacing thenumerators and denominators with an expression of variables that can be derived by analyzing acollected memory request address trace.We develop our analytical model by flrst considering requests to a single bank j for a timeperiod T, which starts from when a request to bank j must open a new row and ends when thelast request to the new row in bank j present in the memory controller queue at the start of theperiod has been serviced. We use i to denote the requests to any bank of B banks, not just thoseto bank j. The DRAM e–ciency of bank j is flrst expressed as:Effj =MINhMAX(tRC;tRP +tRCD +tj);PB¡1i=0 tiiMAX(tRC;tRP +tRCD +tj) (3.1)whereti = number of requests to bank i⁄Tsrvc req (3.2)andTsrvc req = request sizeDRAMs per MC ⁄busW ⁄data rate (3.3)In essence, the numerator of Equation 3.1 represents the time spent transferring data divided bythe period of time represented in the denominator. To aid our description, we provide the example263.2. Hybrid analytical modelCommandBusBank 0Bank 1Bank 2Bank 3P0PrechargeA0ActivateR0 R0Data BusR1 R2R1 R2 R1 R3 R3 R3R1R1 R1R2R2R3R3R3R0R0R1R1 R1R2R2 R3R3R3 R0R0P0Pre...R1 R3R3R1R3R1R1 P0 Command TypeDestination BankDestination BankCommand TypeP = PrechargeA = Activate R = ReadTCLOCKTFigure 3.1: Timing diagram example. Here, Bank 0 has a difierent row opened so prechargeand activate must flrst be issued before read commands can be issued. Banks 1through 3 have already been initialized to the correct rows.shown in Figure 3.1 which will be referred to as we elaborate our model.The principle underlying Equation 3.1 is that a bank must flrst wait at least tRC before switchingto a new row. This overhead can be hidden if there are requests to other banks that are to openedrows. In Figure 3.1, this is the case. In this example, j is 0 and the requests to banks 1, 2, and 3help hide the precharge and activate delays.The numerator in Equation 3.1 is the sum of all ti, where ti is deflned as the product of twoterms: the number of requests that can be serviced by bank i over the set period of time deflnedin the denominator multiplied by the time it takes to service each request (Tsrvc req). Note thatthe sum of all ti also includes tj. We assume that all memory requests are for the same amount ofdata. The formula for calculating how much a single request to bank i contributes to ti is shown in273.2. Hybrid analytical modelEquation 3.3. Here, request size is the size of the memory request in bytes, DRAMs per MC isthe number of DRAM chips controlled by each memory controller (e.g., having two DRAM chipsconnected in parallel doubles the efiective bandwidth per memory controller, meaning each singlechip only needs to transfer half the data per request), busW is the bus width in bytes, and data rateis the number of transfers per cycle, 2 for GDDR3’s double data rate interface [47]. Moreover, theminimum granularity of a memory access, ReqGranularity which determines how many read orwrite commands need to be issued per request, is deflned in Equation 3.4. For a flxed memoryrequest size, the minimum granularity determines the maximum number of parallel DRAM chipsthat can be used e–ciently by a single memory controller. (With our simulation parameters, themaximum number of parallel DRAM chips is four.)ReqGranularity = DRAMs per MC ⁄busW ⁄BL (3.4)Equation 3.4 depends on a new parameter unused in the previous equations called the burst length,or BL. Our example corresponds to the baseline conflguration described in Section 5.2, where eachrequest is comprised of 2 read commands to a DRAM chip (2 read commands * 2 DRAM chips permemory controller * 4B bus width * burst length 4 = 64B request size in our baseline conflguration).Basing our model on our deflned metrics, we set the denominator of Equation 3.1 as the periodof time starting from when a particular bank of a DRAM chip begins servicing a request to a newrow (by flrst precharging, assuming a difierent row is opened, and then issuing activate to openthe new row) and ending at when the bank issues a precharge to begin servicing a new request toa difierent row. To simplify our deflnition of T in this case, our model assumes that a prechargecommand is not issued until its completion can be immediately followed by an activate to thesame bank. Under this assumption, the delay between successive precharge commands is thenconstrained by the same delay as the delay between successive activate commands, tRC. This isshown as T in Figure 3.1. The denominator is controlled by two difierent sets of factors, depending283.2. Hybrid analytical modelon how many requests need to be serviced in this new row. This is because switching to a new rowis primarily constrained by the three timing parameters described in Section 1.3.2, tRC (row cycletime), tRP (row precharge delay), and tRCD (row address to column address delay). In the GDDR3speciflcation that we use, tRC is greater than the sum of tRP and tRCD, so the denominator mustalways be greater than or equal to tRC. This is expressed in our model using the MAX() function.In our example, since there is only one request (2 read commands) to bank 0, the denominator isdominated by tRC so T equals to tRC.Given our microarchitectural parameters outlined in Table 5.3 and Table 1.1, our read andwrite requests take four \DRAM clock" cycles to complete, where each request is comprised of twocommands. To obtain number of requests to bank i, we count the number of requests waiting inthe memory controller queue that are to the row currently open for corresponding bank i. This isequivalent to doing so when the memory controller flrst chooses the request to the new row of bankj (in other words, right before it needs to precharge and activate). Requests to unopened rows arenot counted because they must wait at least tRC before they can begin to be serviced, by whichtime they can no longer be used to hide the timing constraint overhead for bank j. Since requeststo the new row in bank j itself can not be used to hide the latency of precharging and activatingthe new row, it appears in both the denominator and in the numerator as one of the terms of Pi ti(when i = j). In our example, there are 2 read commands to bank 0 and 10 read commands issuedto banks 1 through 3 so here tj=2*2=4 and Pi ti = 2*(2+10) = 24. Therefore, for the time periodT, Eff0 = 24/34 = 70.6%.As previously described, ti is used to determine how much of the row switching overhead can behiddenbyrow-hitrequeststo otherbanks. With goodDRAMrowaccesslocality, theremaybemoresuch requests than necessary, causing Pi ti to be greater than the denominator in Equation 3.1.Using the MIN() function to flnd the minimum between Pi ti and the expression identical to thedenominator is simply done to impose a ceiling on our predicted result at unity, which we argue293.3. Processing address tracesis accurate since Pi ti > MAX(tRC;tRP + tRCD + tj) means that there are more than enoughrequests to completely hide the timing constraint delays imposed by opening the new row in bankj.As stated before, this equation assumes T = Tactive. In other words, it does not take into accountthe amount of time when the DRAM is inactive; therefore, it is more suitable for predicting DRAMe–ciency rather than DRAM utilization. We expect that available memory request address traceswill not necessarily provide any timing information on when DRAM is inactive, in which case weare limited to only being able to predict the DRAM e–ciency. In order to flnd the DRAM e–ciencyover the entire runtime length of a program, we sum the numerators obtained using Equation 3.1of the time periods that make up the runtime length of the program and divide by the sum of thecorresponding denominators.Effj =PNn=1 MINhMAX(tRC;tRP +tRCD +tj;n);PB¡1i=0 ti;niPNn=1 MAX(tRC;tRP +tRCD +tj;n)(3.5)Equation 3.5 shows the DRAM e–ciency calculation of a program whose entire runtime lengthis comprised of N periods, where the DRAM e–ciency of each period can be calculated usingEquation 3.1. How we determine which time periods to use is explained next in Section 3.3, wherewe detail how we use Equation 3.1 with a memory address trace to determine the DRAM e–ciency.3.3 Processing address tracesIn order to use the equations shown in the previous section, the variables must be obtained byprocessing memory request address traces. We assume that the memory request address trace iscaptured at the point where the interconnection network from the processor feeds the requestsinto the memory controller queue, allowing the order in which the requests are inserted into thememory controller to be preserved. We assume that the trace is otherwise devoid of any sort of303.3. Processing address tracestiming information, such as the time when requests enter the queue.Consider a simple case of a memory controller with a queue size of 4 and a DRAM with only twobanks, Bank 0 and Bank 1, where each bank has only two rows (Row A and Row B for Bank 0 andRow X and Row Y for Bank 1). Now assume the memory address trace shown in Figure 3.2(a),where the top request (Bank 0 Row A) is the oldest.To process the trace, we use the algorithm in Figure 3.3 to determine the ti values neededfor Equation 3.1. Figure 3.3 shows two difierent heuristics for which we quantify the accuracy inSection 6.2. The flrst heuristic, which we call proflling + no activate overlap, is guarded by the\if(mode == no activate overlap)" statement and the second heuristic, which we call proflling +full activate overlap, is guarded by the \if(mode == full activate overlap)" statement. While thisalgorithm applies only to First-Ready First-Come First-Serve (FR-FCFS), it can be modifled tohandle other memory scheduling algorithms. In Section 6.3.4, we modify our algorithm to predictthe DRAM e–ciency of another out-of-order scheduling algorithm, \Most-Pending" [43]. Ourapproach is based on a sliding window proflling implementation where the window is the size ofthe memory controller queue.We start with the assumption that initially Row A of Bank 0 and Row X of Bank 1 are open(Figure 3.2(a)). Following our algorithm, the flrst request that we encounter is to Bank 0 Row A,which is an opened row, so we can remove that request from the sliding window and increment t0by Tsrvc req (in Section 3.2), which is calculated to be 4 cycles using the DRAM parameters shownlater in Table 5.3. The next request that we read in is to the same bank but this time to Row B.We leave this request in the sliding window, e.g. our memory controller queue, in the hopes that,by looking ahead to newer requests, we can flnd ones that are to the opened row. Moving on, itcan be seen that the next request is again to Bank 0 Row A, allowing us to service it right away.In this manner, we process all the requests up to Request 8, removing (servicing) requests that313.3. Processing address tracesare to the opened rows (shown as shaded entries in Figure 3.2(b)), at which point our window sizeis 4, meaning the memory controller queue is now full. The ti values can now be used to calculatethe e–ciency for this period using Equation 3.1. The ti values are then reset to signify the startof a new prediction period and repeat our algorithm from the start. For no activate overlap, wethen open the row of only the next oldest request, which is to Bank 0 Row B (Figure 3.2(c)).The DRAM e–ciency over the entire length of the trace can also be computed using the methoddescribed at the end of Section 3.2.While no activate overlap focuses on the prediction of e–ciency by accounting for the requestsin the memory controller queue that can help reduce the overhead of timing constraint delays, itdoes not account for the timing constraints of separate banks that can also be overlapped. Morespeciflcally, the process of precharging and activating rows in difierent banks of the same chip is onlyconstrained by tRRD. With a tRC value of 34 and a tRRD of 8 as per our DRAM parameters 1.1, wecan completely overlap the activation of rows in all 4 banks (34/8 = 4.25 activate commands issuableper row cycle). Performance-wise, this means that even for a memory access pattern with minimalrow access locality, the access pattern that flnely interleaves requests to all 4 banks can achieve 4times the bandwidth of an access pattern that has requests to only a single bank. As such, it iscrucialtoalsotakeintoaccounttheoverlappingofprechargeandactivatesforapplicationswithpoorrow access locality. Otherwise, the analytical model will signiflcantly underestimate the DRAMe–ciency. We model this using our second heuristic, full activate overlap, which instead opens therows of the oldest requests to all banks, assuming that the delays associated with switching rowscan be overlapped in all banks. This is shown in Figure 3.2(d) as Bank1 also switching from RowXto RowY in accordance with the code guarded by \mode = full activate overlap" in Figure 3.3.Leaving the opened row for Bank1 as RowX essentially forces the switching overhead of this bankto be paid separately from Bank0’s switching overhead from RowA to RowB (proflling + noactivate overlap), potentially causing underestimation of DRAM e–ciency. Conversely, switching323.3. Processing address tracesthe rows of all banks at the same time completely overlaps the switching overheads of all banks(full activate overlap), potentially causing overestimation of DRAM e–ciency.In essence, the two heuristics described provide upper and lower bounds on the amount ofbank-level parallelism, either fully overlapping the row switching overheads for full activate overlapor completely serializing the row switching operations of the DRAM banks for no activate overlap.Since the actual amount of bank-level parallelism will be somewhere within the bounds, we alsoshow results for a third heuristic, averaged, which averages the DRAM e–ciency predictions of theflrst two heuristics. In Section 6.2, we will show the accuracy of our three heuristics described inthis section in comparison to the DRAM e–ciency measured in performance simulation. We willshow that our averaged heuristic results in highest accuracy.333.3. Processing address tracesID Bank Row1 0 A2 0 B3 0 A4 1 Y5 1 Y6 0 A7 1 X8 1 Y9 1 YBank Row0 A1 XID Bank Row1 0 AMemory Request Address Trace Opened RowProfiling Windowt0=0 t1=0Youngest(a)ID Bank Row1 0 A2 0 B3 0 A4 1 Y5 1 Y6 0 A7 1 X8 1 Y9 1 YBank Row0 A1 XID Bank Row2 0 B4 1 Y5 1 Y8 1 YMemory Request Address Trace Opened RowProfiling Windowt0=12 t1=4Youngest(b)ID Bank Row2 0 B4 1 Y5 1 Y8 1 Y9 1 YBank Row0 B1 XID Bank Row2 0 BMemory Request Address Trace Opened RowProfiling Windowt0=0 t1=0Youngest(c) no activate overlapID Bank Row2 0 B4 1 Y5 1 Y8 1 Y9 1 YBank Row0 B1 YID Bank Row2 0 BMemory Request Address Trace Opened RowProfiling Windowt0=0 t1=0Youngest(d) full activate overlapFigure 3.2: Sliding window proflling technique example 343.3. Processing address tracesStart : Reset window_size to 0while (windo w_size < memory controller queue size) { //while memory controller queue is not fullRead in newest request, req, to sliding windowif (req.row == opened_row[req.bank]) {//check if request is to opened row, if so, service it //(first-ready first-come first-serve po licy)remove req from window and from tracetreq.bank += Tsrvc_req//upd ate t} else {//request is to different row, so store in queue and //check if later requests can be serviced firstwindow_size++ //add this req into window}}Calculate_efficiency(ti) //uses ti to calculate efficiency Reset(ti)if (mode == no_activate_overlap){//find oldest request, oldest_reqopened_row[oldest_req.bank] = oldest_req.row}if (mode == full_activate_overlap){//find oldest request of all banks, oldest_req[]for (all i in banks) {opened_row[oldest_req[i].bank] = oldest_req[i].row;}}Go back to StartFigure 3.3: Sliding window proflling algorithm for processing memory request address traces35Chapter 4Complexity-Efiective Memory AccessSchedulingIn this chapter, we flrst describe how the interconnection network that passes requests betweenthe shader cores and memory controllers afiects the memory request address streams. Next, weshow how these efiects can drastically reduce the e–ciency of a memory controller that employsin-order request scheduling. Finally, we propose an elegant interconnection arbitration schemethat increases the e–ciency of in-order request scheduling, allowing for a more complexity-efiectivedesign.4.1 MotivationThe scaling of process technology has allowed processor throughput to increase exponentially. How-ever, the properties of packaging prevent the pin count from increasing at the same rate. To alleviatethis, chip designers maximize the percentage of pins on a chip used for transferring data. Increasingthe amount of ofi-chip memory supported by the chip will likely require increasing the number ofmemory channels. Each additional memory controller requires its own set of address pins for com-municating the requested memory addresses and control pins for relaying the commands to controlthe DRAM chips.One approach to reduce the number of memory channels is to have the memory controllerfor each channel control multiple DRAM chips, thereby amortizing the address and control pinoverhead across multiple DRAM chips [15]. However, there is a limit to how much of this \chip-364.1. Motivationlevel parallelism" is available. Consider a case where the typical memory request size is 64 bytes.In the CUDA Programming Model, the requests made by a half-warp of 16 threads accessingcontiguous 32-bit words can be coalesced into a single 64-byte memory request [39]. If we assumethat a single DRAM chip has a burst length of 4 and a bus width of 4 bytes (typical of moderngraphics double-data rate (GDDR3) memory technology), then the maximum number of DRAMchips that a memory controller can potentially control without wasting data pin bandwidth is four(4 DRAM chips per memory controller £ 4B bus width £ 4 burst length = 64B). Furthermore,increasing the number of DRAM chips per memory controller reduces the number of read/writecommands per activated row for each chip. If the memory access pattern of a particular applicationexhibits low row access locality, then DRAM e–ciency can reduce greatly. This occurs when thereis a lack of requests to service to the activated rows of other banks when one bank is constrainedby the DRAM timing overhead needed to switch rows.g1004g1081g1006g1004g1081g1008g1004g1081g1010g1004g1081g1012g1004g1081g1005g1004g1004g1081g1005 g1006 g1008g24g90g4g68g3g18g346g349g393g400g3g393g286g396g3g68g286g373g381g396g455g3g18g381g374g410g396g381g367g367g286g396g396g258g374g282g1005g396g258g374g282g1006g396g258g374g282g1007Figure 4.1: Measured DRAM e–ciency for random uniform memory tra–c with varying rowaccess localityFigure 4.1 shows the DRAM e–ciency achieved using FR-FCFS scheduling of uniform randommemory access patterns with varying degrees of row access locality measured using a stand-aloneDRAM simulator based upon GPGPU-Sim [5]. In flgure 4.1, rand1 is an artiflcially generateduniform random memory access pattern. Since the row designation of each memory access israndom, the high number of rows per chip (4096 for the particular GDDR3 memory that wesimulate [47]) imply that the chance of two back-to-back requests going to the same row in aparticular bank is near-zero, meaning that a new row must be opened after the servicing of each374.1. Motivationmemory request. In rand2, we replicate each random memory access once so that there are alwaystwo requests back-to-back to the same row. Similarly, in rand3, we replicate each random memoryaccess so that there are always three requests back-to-back to the same row. As can be seen, fewerDRAM chips per memory controller efiectively means more DRAM read and write commands to therows of each chip to transfer the same total amount of data, thus increasing the DRAM e–ciency.The example above assumes a memory access pattern that is uniform random. Such a memoryaccess pattern will tend to generate requests to all banks in a chip, thus maximizing \bank-levelparallelism". As such, the DRAM e–ciency values presented in the above example represent thenear-maximum e–ciencies obtainable for the given row access locality parameters. On the otherhand, a non-uniform memory access pattern may generate (a disproportionate amount of) requeststo only a subset of the banks of a DRAM chip, which can cause DRAM e–ciency to be drasticallylower. For example, consider again the previous example where the row access locality is two andthe number of DRAM chips per memory controller is two. If all of the requests go to a single bank,the DRAM e–ciency plummets from 80.7% to 23.6%.In order to maximize DRAM e–ciency, modern DRAM controllers will schedule DRAM requestsout of order in an efiort to maximize the row access locality. Without doing so, the performanceof in-order scheduling can be drastically worse when simple regular request streams (that wouldordinarily schedule very e–ciently in DRAM) from multiple sources become interleaved. Figure 4.2shows a simplifled example that illustrates this. Consider flrst a single shader core interacting with asingle DRAM controller (Figure4.2(a)). The request stream of this single shader core consists of tworequests to the opened row, RowA. Assuming that Row A has already been loaded into the DRAMrow-bufier, both requests can be serviced immediately and no row-switching latency is incurred.Now consider two shader cores interacting with the same DRAM controller (Figure 4.2(b)). Theflrst shader core issues two requests to RowA while the second shader core issues two requests to384.1. MotivationCoreA Req1RowAReq0RowADRAM ControllerRequest IssueReq1 RowAReq0 RowADRAM TimingReq1RowAReq0RowAReceived OrderOldestBus:TimeDRAM(a)CoreA Req0RowAReq2RowACoreB Req1RowBReq3RowBDRAM ControllerRequest IssueReq1 RowBReq0 RowADRAM TimingReq2RowAReq0RowAReceived OrderOldestBus:Time (FR-FCFS)Req2 RowAReq3 RowB Req3RowBReq1RowBRowSwitchReq0RowATime (In-Order)Req1RowBRowSwitchBus: RowSwitch RowSwitchReq2RowA Req3RowBInterconnectDRAM(b)Figure 4.2: Conceptual example showing efiect of request interleaving due to interconnect(all requests to a single bank)RowB. In trying to uphold fairness, a conventional interconnection network router arbiter mayuse round-robin arbitration, resulting in the two input request streams becoming flnely interleavedwhen they pass through the router. In this case, an out-of-order scheduler would group together therequests to Row A and service them flrst before switching to Row B and servicing the remainingrequests, thereby only paying the row-switching latency once. On the other hand, an in-orderDRAM scheduler would service the requests as it receives them, therefore having to pay the row-switching latency three times, resulting in drastically lower DRAM throughput. (Note that theDRAM timing in Figure 4.2(b) is not shown to scale. In our experiments, servicing a single 64Brequest takes 4 cycles while the row switching delay is at least 25 cycles so the efiect of requestinterleaving is actually much worse than illustrated.) If an intelligent network router recognizedthat, in this scenario, it should transfer all the requests of one shader core flrst before transferringthe requests of the second shader core, then the performance obtainable using out-of-order DRAM394.2. Efiect of interconnect on row localityShader CoresCoreCore Core Core...Core...MemoryControllerMemoryControllerMemoryControllerDRAM DRAM DRAMCrosbar Network..(a) CrossbarCore MemoryControlerDRAMCore Core Core CoreCore Core Core CoreMemoryControlerDRAMMemoryControlerDRAMCore CoreMemoryControlerDRAM CoreCore CoreCore Core CoreCore CoreMemoryControlerDRAMCore Core Core CoreMemoryControlerDRAMMemoryControlerDRAM CoreCore CoreCoreMemoryControlerDRAMCoreMesh Network(b) MeshShader CoresCoreCoreCore Core...Core...MemoryControllerMemoryControllerMemoryControllerDRAM DRAM DRAMCrossbar Network..Ring Network(c) RingFigure 4.3: Architecture (DRAM chips are ofi-chip)scheduling could be achieved with a much simpler in-order FIFO DRAM scheduler.In this thesis, we leverage this key observation to design an intelligent interconnection networkcoupled with a simple DRAM controller that can achieve the performance of a much more com-plex DRAM controller. Such a solution saves area and results in a shorter design cycle and lessveriflcation complexity.4.2 Efiect of interconnect on row localityTo determine the efiects of memory request stream interleaving due to multiple shader core inputs,we augmented our simulator, which models the architectures shown in Figure 4.3, to measure thepre-interconnect row access locality, the row access locality of the memory request sent from theshader cores to the interconnect, versus the post-interconnect row access locality, the row accesslocality of the memory requests seen at the DRAM controllers after they traverse the interconnect,for a set of CUDA applications.Figure 4.4 presents the measured post-interconnect row access locality divided by the pre-interconnect row access locality (Figure 4.4(a)) for crossbar, mesh, and ring networks as well as the404.3. Grant holding interconnect arbitration0510152025303540fwt lib mum neu nn ray red sp wp HMRow Access Localitypre-icnt (X-Bar) pre-icnt (Mesh) pre-icnt (Ring)post-icnt (X-Bar) post-icnt (Mesh) post-icnt (Ring)(a) Row access locality00. lib mum neu nn ray red sp wp HMRow Access Locality Crossbar Mesh Ring(b) Ratio of row access locality after interconnect versus before interconnectFigure 4.4: Row access locality measured in number of requests per row activate command before theinterconnect (Pre-Icnt) and after (Post-Icnt). HM = Harmonic Meancalculated ratio (Figure 4.4(b)). As can be seen, the row access locality decreases by almost 47%across all network topologies after the interconnect when arithmetically averaging the ratios of thepre-interconnect locality versus the post-interconnect locality across all applications.Our conflguration for Figure 4.4 has 28 shader cores and 8 memory channels (one DRAMcontroller per memory channel), which corresponds roughly to NVIDIA’s GT200 architecture of 30shader cores and 8 memory channels [6].4.3 Grant holding interconnect arbitrationOne of the fundamental pillars of conventional interconnection networks and their many arbitrationpolicies is the concept of fairness, such that all input-output combinations receive equal service [12].Without some degree of fairness in arbitration, nodes in the network may experience starvation.414.3. Grant holding interconnect arbitrationOne common method of achieving fairness is to perform round-robin arbitration so that the mostrecentnodethatgetsservicedbecomesthelowest-priorityinthenextarbitrationcycletoallowothernodes to get serviced. As seen in the previous section, such a policy can signiflcantly reduce theamount of row access locality in the memory request stream seen by the DRAM controller. With anin-order memory request scheduler, this can lead to lower DRAM throughput. We therefore proposean interconnect arbitration scheme that attempts to preserve the row access locality exhibited bymemory request streams from individual shader cores. To this end, we introduce two simple,alternative modiflcations that can be applied to any arbitration policy:\Hold Grant" (HG): If input I was granted output O in the previous arbitration cycle, theninput I again gets the highest priority to output O in the current cycle.\Row-Matching Hold Grant" (RMHG): If input I was granted output O in the previousarbitration cycle, then input I again gets the highest priority to output O in the current cycle aslong as the requested row matches the previously requested row match.With HG, as long as a shader core has a continuous stream of requests to send to a DRAMcontroller, the interconnect will grant it uninterrupted access. RMHG has more constraints onwhen to preserve grant in order be more fair, but may not perform as well since it will not preserveany inherent bank-level parallelism found in the memory request stream, found to be importantin reducing average thread stall times in chip multiprocessors by Mutlu et al. [33]. To maximizethe beneflt of using our proposed solution, the network topology and routing algorithm should besuch that there is no path diversity. Common examples are a crossbar network or a mesh withnon-adaptive routing. (We leave the development of interconnect arbitration schemes that preserverow access locality in network topologies with path diversity to future work.) This forces requestssent to the interconnect to leave in the same order. If requests were otherwise allowed to arriveat the DRAM controllers out of order, the row access locality in the memory request stream sentfrom shader cores may also be disrupted in this way.424.3. Grant holding interconnect arbitrationRouting Computation Virtual Channel AllocationSwitch AllocationVirtual Channel(s)Virtual Channel(s)......Input 0Input mCrossbar Switch...Output 0Output nHold Grant MemoryFigure 4.5: Grant-holding interconnect router architectureIn the standard interconnection network router pipeline, there are two stages where our hold-grantpoliciescanbeenforced: virtualchannelallocation, whichhappensflrst, andswitchallocation.In our experiments, we enforce our hold-grant policies during virtual channel allocation, as illus-trated in Figure ??. If this is not done, virtual channel allocation will arbitrate packets randomly.In this case, enforcing hold-grant during switch allocation would be inefiective since the packetsarbitrated during virtual channel allocation would already be interleaved. Implementation of ourscheme requires a small amount of additional memory and additional comparators to keep trackof and determine which inputs to hold grant on. This overhead is quantifled in more detail inSection 4.5.434.4. Interleaving due to multiple virtual channels4.4 Interleaving due to multiple virtual channelsIn interconnection networks, having multiple virtual channels can reduce network latency by reduc-ing router pipeline stalls [12]. In certain network topologies and with certain routing algorithms,it can also be necessary to prevent deadlock.In the context of memory request stream interleaving, multiple virtual channels can efiectivelycause a stream of memory requests sent from one source to one destination to arrive out of order.This occurs since a request that enters a network router gets assigned the flrst virtual channelthat becomes available \on-the- y". (We refer to this as dynamic virtual channel assignment,or DVCA.) When back-to-back requests from one source shader core are sent to one destinationDRAM controller, they may get assigned to difierent virtual channels, where each virtual channel isessentially a difierent path. This is akin to having path diversity, which was explained in Section 4.3to be detrimental in preserving row access locality. With requests taking multiple paths to thedestination, order cannot be enforced.To handle the case when multiple virtual channels are used in the interconnect, we propose astatic, destination-based virtual channel assignment (SVCA). While requests from one input nodecan still use multiple virtual channels in this assignment, all requests from one shader core (inputnode) to one speciflc DRAM controller (output destination node) are always assigned the samevirtual channel in each router. In the case where there are virtual channels VC0 to VCv¡1 foreach router and M DRAM controllers, then all requests to DRAM controller M must use VCn andonly VCn, where n = M mod v. Doing so allows for a fair virtual channel allotment across allinput-output combinations. We evaluate the performance impact of using SVCA versus DVCA inSection 6.3.3 and show that applications obtain a speedup of up to 17.7% in a crossbar networkwith four virtual channels per port when using SVCA over DVCA.444.5. Complexity comparison4.5 Complexity comparisonSince our proposed changes are not limited to the DRAM controller circuitry, but also apply to theinterconnection network, we attempt to deflne the complexity with a general-enough method thatcan be applied to any network topology. To this end, we quantify the complexity of our designbased on the difierences in two metrics, the amount of storage required in bits, and the number ofbit-comparisons.With FR-FCFS, whenever a new command is issued, every request in the DRAM controllerqueue must be checked to determine the best candidate for issuing [15, 44]. For a queue size Q, thistranslates to Q times the number of row bits + the number of bank bits. In the GDDR3 technologythat we study, there are 4096 rows (nR) and 4 banks (nB) , which is equal to 12 row bits dlog2(nR)eand 2 bank bits dlog2(nB)e. For a GPU system with M DRAM controllers, this means a maximumnumber of bit comparisons of M ⁄ Q ⁄ (dlog2(nR)e + dlog2(nB)e) every DRAM command clock.This number represents an upper bound and can be much lower in certain scenarios. First, theaverage occupancy of the DRAM controller queue may be very low, especially if the application isnot memory-intensive. Second, the DRAM request search may be performed in a two step processwhere only requests to available banks are selected to check for row bufier hits. Available banksinclude those that have pending requests and are not in the process of precharging (closing) a rowor activating (opening) a row.Since a DRAM controller that implements FR-FCFS needs to perform a fully-associative com-parison, it does not need to be banked based on the number of DRAM banks. This allows thecontroller to maximize its request storage capability if, for instance, all requests go to a singlebank. For our complexity-efiective solution, we use a banked FIFO queue with one bank perDRAM bank. Doing so allows us to consider requests to all banks simultaneously, thus promotingbank-level parallelism. With this conflguration, the DRAM controller considers up to nB requests454.5. Complexity comparisoneach cycle, checking whether the row of the request matches the opened row for each bank todetermine whether the DRAM controller can issue the request immediately or whether it needsto flrst open a new row. Furthermore, when requests are flrst inserted into the DRAM controllerqueue, the DRAM controller must check to which of the banks to insert the request, requiringan additional dlog2(nB)e comparisons. This adds up to M ⁄ (nB ⁄ dlog2(nR)e + dlog2(nB)e) bitcomparisons summed across all DRAM controllers.Our interconnect augmentations to the interconnect, HG and RMHG, also require additionalbit comparisons. We implement our grant-holding policy at the virtual channel allocator. If we areusing HG, then we check whether input-output combinations granted in the previous arbitrationcycle are also pending in this cycle. If so, we force allocation to choose the same input-outputcombinations again. This requires knowing which output was chosen in the last arbitration cyclefor each input arbiter and, if there is a request to that output again this cycle, then it is chosen again.To do this, we store one bit for each input-output combination and set the bits of the combinationsthat were previously granted. We then compare these bits to the ones in the request matrix and, ifthere is a match, we clear the other request bits to the other inputs for the corresponding output andthe other outputs for the corresponding inputs. This essentially forces the allocation algorithm tochoose the same input-output combination for the next cycle because no other options are available.For a crossbar with one router, this totals to C ⁄M bit comparisons and C ⁄M additional bits ofstorage required, where C is the number of shader cores. In the context of a mesh, the numberof routers is equal to C+M and an upper bound of nine total input and output ports for a twodimensional mesh. This is because we do not care about the output port in a shader core node toitself (since the shader core cannot sink memory requests) and the input port in a memory nodefrom itself (since the memory controller cannot generate memory requests). (Some nodes on theedge or corners of the mesh will also have fewer ports.) This sums up to (C+M)*(20) bits comparedand stored since each router will either have 4 inputs and 5 outputs or 5 inputs and 4 outputs for464.5. Complexity comparisonTable 4.1: Calculated complexity(HMHG4 =Hash-MatchingHold Grantusing 4-bit hashes)Bits Stored in ICNT Bits Compared in ICNT Bits Compared in TotalDRAM SchedulerFRFCFS 0 0 M*Q*(dlog2(nR)e+ 3584 bits compareddlog2(nB)e)BFIFO+HG C*M C*M 0 224 bits stored(XBAR) 224 bits comparedBFIFO+HG 20*(C+M) 20*(C+M) 0 720 bits stored(MESH) 720 bits comparedBFIFO+RMHG C*M + C*M + 0 608 bits stored(XBAR) dlog2(nR)e⁄nB ⁄M dlog2(nR)e⁄M 320 bits comparedBFIFO+RMHG 20*(C+M) + 20*(C+M) + 0 14,544 bits stored(MESH) (C +M)⁄dlog2(nR)e⁄nB ⁄M (C +M)⁄dlog2(nR)e⁄5 2880 bits comparedBFIFO+HMHG4 C*M + C*M + 0 320 bits stored(XBAR) 4*M 4*M 320 bits comparedBFIFO+HMHG4 20*(C+M) + 20*(C+M) + 0 1440 bits stored(MESH) (C+M)*4*5 (C+M)*4*5 1440 bits compareda total of 20 input-output combinations.An interconnect allocation policy that uses RMHG requires more bit comparisons since, inaddition to the bit comparison and storage overheads incurred using HG, it must also check thatback-to-back requests are indeed to the same row, thus requiring dlog2(nR)e comparisons for eachentry in the request matrix. Compounding this problem for the crossbar and, to a much greaterextent, the mesh is that requests sent to the same output may be to ultimately difierent destinations,such as to difierent banks or, in the case of mesh, to difierent chips. One solution to account forthis is to have each router keep track of the rows of the most recently granted request to thedifierent possible destinations for each output. Therefore for a single router, which is the caseof the crossbar, this would require a total of dlog2(nR)enB ⁄M additional bits of storage for theinterconnect and dlog2(nR)e⁄M bit comparisons. Accordingly, each router of a mesh would requiredlog2(nR)e⁄nB⁄M additional bits of storage and dlog2(nR)e*5 bit comparisons, dlog2(nR)e for eachoutput. Summed across all routers in a mesh, the total requirements is (C+M)⁄dlog2(nR)e⁄nB⁄Madditional bits of storage and (C +M)⁄dlog2(nR)e⁄5 bit comparisons.Compared to using HG, a mesh network using RMHG can incur signiflcant bit-comparison and474.5. Complexity comparisonarea overheads, even more than FR-FCFS. However, it is possible to reduce both bit comparison andbit storage overheads by matching a hash of the row instead of flnding exact matches and storingfewer row entries than the maximum required of nB*M. We call this heuristic Hash-Matching HoldGrant (HMHG). In our experiments, we match and store only a 4-bit hash of the row bits for thelast granted request for each output for each router (HMHG4) instead of using RMHG. We foundthat using more bits did not improve performance. HMHG4 has an overhead of (C+M)*4*M bitcomparisons and (C+M)*4*5 bits stored for the mesh and 4*M bit comparisons and 4*M bits storedfor the crossbar. Table 4.1 summarizes our flndings. The right-most column shows the calculatedcomplexity for our baseline conflguration with 28 shader cores, 8 memory controllers with 32-entryqueues each, and 4096 rows and 4 banks per DRAM chip.In the next chapter, we will describe our experimental methodology. In Section 6.3, we willquantify the efiects of using our complexity-efiective DRAM controller design in comparison toconventional in-order and out-of-order scheduling memory controllers.48Chapter 5Methodology5.1 Supporting CUDA applications on GPGPU-SimTable 5.1 shows the simulator’s conflguration used to test the applications described in Chapter 2.Rows with multiple entries show the difierent conflgurations that we have simulated. Bold valuesshow our baseline. We simulate all benchmarks to completion to capture all the distinct phasesof each kernel in the benchmarks, especially the behavior at the tail end of the kernels, which canvary drastically compared to the beginning. If the kernels are relatively short and are frequentlylaunched, the difierence in performance when not simulating the benchmark to completion can besigniflcant.This thesis performs three main studies using the benchmarks described in Section 2.2. The flrststudy validates our simulator against real hardware while the other two explore the performanceefiects of varying two key microarchitecture parameters: the DRAM controller queue size and theshader resources that afiect the limit on the maximum number of CTAs that can execute on ashader core concurrently.We issue CTAs in a breadth-flrst manner across shader cores, selecting a shader core that hasa minimum number of CTAs running on it, so as to spread the workload as evenly as possibleamong all cores. We note that this breadth-flrst CTA distribution heuristic can occasionally leadto counter-intuitive performance results due to a phenomena we will refer to as CTA load imbalance.This CTA load imbalance can occur when the number of CTAs in a grid exceeds the number thatcan run concurrently on the GPU. For example, consider six CTAs on a GPU with two shader cores495.1. Supporting CUDA applications on GPGPU-SimTable 5.1: Microarchitectural parameters for testing GPGPU-Sim (bolded values show base-line conflguration)Number of Shader Cores 32Warp Size 32SIMD Pipeline Width 8Number of Threads / Core 256 / 512 / 1024 / 1536 / 2048Number of CTAs / Core 2 / 4 / 8 / 12 / 16Number of Registers / Core 4096 / 8192 / 16384 / 24576 / 32768Shared Memory / Core (KB) 4/8/16/24/32 (16 banks, 1 access/cycle/bank)Constant Cache Size / Core 8KB (2-way set assoc. 64B lines LRU)Texture Cache Size / Core 64KB (2-way set assoc. 64B lines LRU)Interconnection Network Full CrossbarNumber of Memory Channels 8GDDR3 Memory Timing tCL=9, tRP=13, tRC=34tRAS=21, tRCD=12, tRRD=8Bandwidth per Memory Channel 16 (Bytes/Cycle)DRAM request queue capacity 8/16/32 / 64Memory Controller out of order (FR-FCFS) /in order (FIFO) [43]Branch Divergence Method Immediate Post Dominator [16]Warp Scheduling Policy Round Robin among ready warpsMaximum Supported In- ight 64/UnlimitedRequests per Shader Corewhere at most two CTAs can run concurrently on a shader core. Assume running one CTA on onecore takes time T and running two CTAs on one core takes time 2T (e.g., no ofi-chip accesses andsix or more warps per CTA|enough for one CTA to keep our 24 stage pipeline full). If each CTAin isolation takes equal time T, total time is 3T (2T for the flrst round of four CTAs plus T forthe last two CTAs which run on separate shader cores). Suppose we introduce an enhancementthat causes CTAs to run in time 0.90T to 0.91T when run alone (i.e., faster). If both CTAs on theflrst core now flnish ahead of those on the other core at time 1.80T versus 1.82T, then our CTAdistributor will issue the remaining 2 CTAs onto the flrst core, causing the load imbalance. Withthe enhancement, this actually causes an overall slowdown since now 4 CTAs need to be completedby the flrst core, requiring a total time of at least 3.6T. We carefully verifled whether this behavioroccurs by plotting the distribution of CTAs to shader cores versus time for both conflgurationsbeing compared. This efiect would be less signiflcant in a real system with larger data sets and505.1. Supporting CUDA applications on GPGPU-Simtherefore grids with a larger number of CTAs.In previous work, rather than attempting to eliminate the efiect, we simply noted where itoccurred. In this work, we introduce a new metric for measuring processor throughput. Ourconventional calculation of IPC for our GPU architecture is performed by dividing the sum ofall instructions committed in all cores by the number of cycles it takes for the last core to flnishexecution of its CTAs. If the flnish times of the cores difier signiflcantly, then the flnal calculatedIPC may not be an accurate measure of the true performance capabilities of the architecture. Coresthat flnish early may be assigned CTAs from other kernels to maximize utilization.In Section 6.1.3 we measure the impact of running greater or fewer numbers of threads. Wemodel this by varying the number of concurrent CTAs permitted by the shader cores, which ispossible by scaling the amount of on-chip resources available to each shader core. There are foursuch resources: number of concurrent threads, number of registers, amount of shared memory, andnumber of CTAs. The values we use are shown in Table 5.1. The amount of resources availableper shader core is a conflgurable simulation option, while the amount of resources required by eachkernel is extracted using ptxas.Varying the number of concurrent CTAs can cause certain benchmarks to sufier from severeCTA load imbalance. To address this, we use a new performance metric, IPC weighted by cycle(IPCwbc), to present our data in and only in Section 6.1.3. With this metric, we flrst determinethe IPC values of the individual shader cores flrst, using the flnish time of each core as the cycletime instead of using the flnish time of the slowest core. We then take a weighted average ofthe individual IPC values, where the weights are the individual cycle times. Finally, this weightedaverage is multiplied by the number of cores to obtain our performance metric for the aggregate IPCof a multicore architecture. Consider the conceptual example shown in Table 5.2. There are twotypes of CTAs: CTA A symbolizes an arithmetically-intensive workload while CTA B symbolizesa memory-intensive workload. Core 1, which is assigned CTA A, flnishes much faster than Core515.1. Supporting CUDA applications on GPGPU-Sim2 while also executing more instructions. Our conventional IPC metric calculates the system IPCto be only 1.1 whereas, if we use the IPC weighted by cycle method described above, we obtainan IPC that is almost double, 2.0. Which one is more accurate? Assume if, after Core 1 flnishesexecution, we assign it another CTA of type B and, after Core 2 flnishes execution, we assign itanother CTA of type A. Now both cores must run both types CTAs, and should thus flnish at thesame time. In this case, each core executes 110 instructions in 110 cycles, for an aggregate systemIPC of 2.0 regardless of which IPC metric we use. Table 5.2 clearly shows that IPCwbc was ableto \predict" this.Note that our metric is optimistic in assuming that there is no destructive (or constructive)interaction between the two cores. In the scenario shown in Table 5.2, Core 2 may only be able toflnish execution of CTA B in 100 cycles because Core 1 is idle for most of the time and thus doesnot contend for shared resources, such as DRAM. In reality, Core 2 may take longer if Core 1 wasassigned another CTA to run immediately after flnishing its flrst CTA. Nevertheless, this metric issu–cient in quantifying performance when CTA load imbalance occurs. Consider the case wherea microarchitecture modiflcation results in an increase in IPC compared to a baseline. There arethree possible causes: one, the modiflcation actually improves some part of the system; two, theCTA load imbalance was more severe in the baseline than in the modifled conflguration; and three,there was more constructive interference in the modifled conflguration than in the baseline. If wethen compare IPCwbc and see that the modifled architecture achieved a higher IPCwbc than thebaseline, then we can rule out CTA load imbalance as a possible cause. This is because IPCwbcbecomes more optimistic as the severity of the CTA load imbalance increases. If the IPCwbc of themodifled architecture (which has less severe CTA load imbalance) is still higher than the baselinein spite of this, then we can say with confldence that the performance improvement was not due toCTA load imbalance. On the contrary, if the IPCwbc of the baseline is higher, then we know thatthe IPC of the baseline is only lower due to CTA load imbalance.525.2. Hybrid analytical modeling of DRAMTable 5.2: Example showing difierence between IPC versus IPC weighted by cycleWork to Core 1 CTA A: 100 instructions in 10 cyclesWork to Core 2 CTA B: 10 instruction in 100 cyclesIPC Calculation 110100IPC Value 1.1IPCwbc Calculation 2⁄( 10110 ⁄ 10010 + 100110 ⁄ 10100)IPCwbc Value 2.0Table 5.3: Microarchitectural parameters for analytical DRAM model (bolded values showbaseline conflguration)Shader Cores 32Threads per Shader Core 1024Interconnection Network Full CrossbarMaximum Supported In- ight 64Requests per Shader CoreMemory Request Size (Bytes) 64DRAM Chips 8,16,32DRAM Controllers 8DRAM Chips per Controller 1,2,4DRAM Controller 8,16,32,64Queue SizeDRAM Controller First-Ready First-ComeScheduling Policy First-Serve (FR-FCFS),Most Pending [43]5.2 Hybrid analytical modeling of DRAMTo evaluate our hybrid analytical model, we modifled GPGPU-Sim [5] to collect the memoryrequest address streams that are fed to the memory controllers for use in our analytical model.Table 5.1 shows the microarchitectural parameters used in this study. The advantage of studying amassively multi-threaded architecture that GPGPU-Sim is capable of simulating is that it allows usto stress the DRAM memory system due to the sheer number of memory requests to DRAM thatcan be in- ight simultaneously at any given time. In our conflguration, we allow for 64 in- ightrequests per shader core. This amounts to a maximum of 2048 simultaneous in- ight memoryrequests to DRAM. In comparison, Prescott has only eight MSHRs [8] and Willamette has onlyfour MSHRs [21]. With an eight memory controller conflguration, we use our model on 8 separate535.2. Hybrid analytical modeling of DRAMTable 5.4: Benchmarks for analytical DRAM modelBenchmark Label SuiteBlack-Scholes option pricing BS CUDA SDKFast Walsh Transform FWT CUDA SDKgpuDG DG 3rd Party3D Laplace Solver LPS 3rd PartyLIBOR Monte Carlo LIB 3rd PartyMatrix Multiply MMC 3rd PartyMUMmerGPU MUM 3rd PartyNeural Network Digit Recognition NN 3rd PartyRay Tracing RAY 3rd PartyReduction RED CUDA SDKScalar Product SP CUDA SDKScan Large Array SLA CUDA SDKMatrix Transpose TRA CUDA SDKWeather Prediction WP 3rd Party0%20%40%60%80%100%BS FWT DG LPS LIB MMCMUM NEU NN RAY RED SP SLA TRA WPDRAM Efficiency DRAM UtilizationFigure 5.1: Measured DRAM e–ciency and measured DRAM utilization averaged across all DRAM chipsfor each benchmarkmemory request address streams per application. We test our model on the 14 difierent applicationsshown in Table 5.4, some of which are from the set described in Section 2.2 and some of which arefrom Nvidia’s CUDA software development kit [37]. Our only criterium for selecting applicationsfrom the two suites is that they must show greater than 10% DRAM utilization averaged across allDRAM chips for our given baseline microarchitecture conflguration.To illustrate the diversity of our application set, Figure 5.1 shows the DRAM e–ciency andutilization measured by GPGPU-Sim in performance simulation averaged across all DRAM chipsfor each application. We simulate each application to completion in order to collect the full memory545.3. Complexity-efiective memory access schedulingTable 5.5: Microarchitectural parameters for complexity-efiective memory access scheduling(bolded values show baseline conflguration)Shader Cores 28Threads per Shader Core 1024Interconnection Network Ring, Crossbar, MeshMaximum Supported In- ight 64Requests per MultiprocessorMemory Request Size (Bytes) 64Interconnect Flit Size (Bytes) 32DRAM Chips 16DRAM Controllers 8DRAM Chips per Controller 2DRAM Controller 8,16,32,64Queue SizeDRAM Controller First-Ready First-ComeScheduling Policy First-Serve (FR-FCFS),Naive FIFO, Banked FIFO [43]Table 5.6: Benchmarks for complexity-efiective memory access schedulingBenchmark Label SuiteFast Walsh Transform fwt CUDA SDKLIBOR Monte Carlo lib 3rd PartyMUMmerGPU mum 3rd PartyNeural Network Digit Recognition neu 3rd PartyRay Tracing ray 3rd PartyReduction red CUDA SDKScalar Product sp CUDA SDKWeather Prediction wp 3rd PartyNearest Neighbor nn Rodiniarequest address trace.5.3 Complexity-efiective memory access schedulingTo evaluate our design, we again used GPGPU-Sim [5]. We implemented our changes to theinterconnection network simulated using the versatile interconnection network simulator introducedby Dally et al. [12].Table 5.5 shows the microarchitectural parameters used in this study. We evaluate our design onthe 9 difierent applications shown in Table 5.6 across three network topologies: ring, mesh, and fullcrossbar. To occupy all nodes in a square mesh, we keep the number of memory controllers constant555.3. Complexity-efiective memory access schedulingat eight and slightly reduce the number of shader cores to 28, resulting in a 6x6 mesh. We found thatcertain applications that exhibit low ofi-chip tra–c are not sensitive to varying memory controllerconflgurations. In order to have results that are more meaningful, we focused our study on onlythose applications that meet all three of our selection criteria: flrst, the total processor throughputof the application must be less than 75% of the peak IPC for the baseline conflguration; second,the DRAM utilization, percentage time spent of a memory channel transferring data over all time,averaged across all DRAM controllers must be over 20%. Finally, the DRAM e–ciency averagedacross all DRAM controllers must be less than 90%. Following these three criteria ensures thatwe study only applications that are truly memory-sensitive. We use 3 applications from NVIDIA’sCUDA software development kit (SDK) [37], 5 applications from the set used by Bakhoda et. al [5],and one application from the Rodinia benchmark suite introduced by Che et al. [9]. We simulateeach application to completion.56Chapter 6Experimental Results6.1 GPGPU-Sim simulation6.1.1 BaselineWe flrst simulated our baseline GPU conflguration with the bolded parameters shown in Table 5.1.Figures ?? and ?? show the instruction classiflcation and memory instruction breakdown of theCUDA benchmarks introduced in Section 2.2. As shown, these benchmarks exhibit a diverse rangeof computation, control  ow, and memory intensity. The memory access patterns in terms of thetypes of memory spaces accessed also varies greatly from application to application. Figure 6.3shows the performance of our baseline conflguration (for the GPU only) measured in terms ofscalar instructions per cycle (IPC). In the right-most set of bars, SDK represents the harmonicmean for all the applications from Nvidia’s CUDA Software Development Kit that we simulated,while HM represents the harmonic mean of the new applications that we introduced in Section 2.2.Evidently, there is a large gap in performance between the CUDA SDK applications and realapplications developed by other authors. For comparison, we also show the performance assuminga perfect memory system with zero memory latency. Note that the maximum achievable IPC for ourconflguration is 224 (28 shader cores x 8-wide pipelines). We also validated our simulator againstan Nvidia Geforce 8600GTS, which is top of the mid-range lineup for the G80 series, by conflguringour simulator to use four shaders and two memory controllers. The IPC of the GPU hardware, asshown in Figure 6.4(a), was estimated by dividing the dynamic instruction count measured (in PTXinstructions) in simulation by the product of the measured runtime on hardware and the shader576.1. GPGPU-Sim simulation0%20%40%60%80%100%AES BFS CP DG LPS LIB MUMNEU NQ RAY STO WP BLK FWTInstruction Classification Mem Ops ALU Ops Fused Multiply-Add Control Flow SFU OpsFigure 6.1: Instruction Classiflcation0%25%50%75%100%AES BFS CP DG LPS LIB MUMNEU NQ RAY STO WP BLK FWTMemory Instruction ClassificationGlobal Local Param Const Tex SharedFigure 6.2: Memory Instructions Breakdownclock frequency [36]. Figure 6.4(b) shows the scatter plot of IPC obtained with our simulationsmimicking the 8600GTS normalized to the theoretical peak IPC versus the normalized IPC datameasured using the 8600GTS. The correlation coe–cient was calculated to be 0.899. One sourceof difierence, as highlighted by the data for CP which actually achieves a normalized IPC over1, is due to compiler optimizations in ptxas, the assembler that generates the flnal binary runon the GPU, which reduce the instruction count on real hardware5. Overall, the data shows thatapplications that perform well in real GPU hardware perform well in our simulator and applicationsthat perform poorly in real GPU hardware also perform poorly in our simulator. In the followingsections, we explore reasons why some benchmarks do not achieve peak performance.586.1. GPGPU-Sim simulation0326496128160192224AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP HM BLK FWT SDK IPCBaseline Perfect Memory Maximum IPC = 224Figure 6.3: Baseline performance (HM=Harmonic Mean)081624324048AES BFS CP DG LPS LIB MUM NN NQU RAY STO WP IPCEstimated8600GTS (HW) Simulated8600GTSMax IPC = 32 (a) 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Simulated 8600 GTS Normalized IPC 8600GTS Normalized IPC CPBFS, NN, NQU DGLPSLIBRAYSTOWPAESMUM(b)Figure 6.4: Performance comparison with 8600GTS6.1.2 DRAM utilization and e–ciencyIn this section we explore the impact that memory controller design has on performance. Our base-line conflguration uses an Out-of-Order (OoO) First-Ready First-Come First-Serve (FR-FCFS) [43]memory controller with a capacity of 32 memory requests. Each cycle, the OoO memory con-troller prioritizes memory requests that hit an open row in the DRAM over requests that requirea precharge and activate to open a new row. Against this baseline, we compare a simple First-In First-Out (FIFO) memory controller that services memory requests in the order that they arereceived, as well as a FR-FCFS OoO controller with other varying input bufier capacity sizes of8 (OoO8), 16 (OoO16), and 64(OoO64). We measure two metrics besides performance (based on5We only simulate the input PTX code which, in CUDA, ptxas then assembles into a proprietary binary formatthat we are unable to simulate.596.1. GPGPU-Sim simulation0. BFS CP DG LPS LIB MMC MUM NEU NN NQ RAY STO WP HM BLK FWT SDKSpeedupFIFO OoO-8 OoO-16 OoO-32 OoO-64Figure 6.5: Impact of DRAM memory controller optimizations0%20%40%60%80%100%AES BFS CP DG LPS LIB MMC MUM NEU NN NQ RAY STO WP BLK FWTDRAM EfficiencyFIFO OoO-8 OoO-16 OoO-32 OoO-64Figure 6.6: DRAM E–ciencyIPC): the flrst is DRAM e–ciency, which is the percentage of time spent sending data across thepins of DRAM over the time when there are any memory requests being serviced or pending in thememory controller input bufier; the second is DRAM utilization, which is the percentage of timespent sending data across the DRAM data pins over the entire kernel execution time. These twomeasures can difier if an application contains GPU computation phases during which it does not0%20%40%60%80%100%AES BFS CP DG LPS LIB MMC MUM NEU NN NQ RAY STO WP BLK FWTDRAM UtilizationFIFO OoO-8 OoO-16 OoO-32 OoO-64Figure 6.7: DRAM Utilization606.1. GPGPU-Sim simulationaccess DRAM (e.g., if it has been heavily optimized to use \shared memory").Figure 6.5 compares the performance of our baseline to other memory controller conflgurations.We observe that AES, CP, LPS, MMC, NQ, and STO exhibit almost no slowdown for FIFO.Figure 6.6 shows AES and STO obtain over 75% DRAM e–ciency. Close examination reveals thatat any point in time all threads access at most two rows in each bank of each DRAM, meaningthat a simple DRAM controller policy su–ces. Furthermore, Figure 6.7 shows that AES and STOhave low DRAM utilization despite the fact that they are data-processing applications. Both theseapplications make extensive use of shared memory (see Figure 6.2). NQ and CP have very lowDRAM utilization, making them insensitive to memory controller optimizations. Performance isreduced by over 40% when using FIFO for BFS, LIB, MUM, RAY, and WP. These benchmarks allshow drastically reduced DRAM e–ciency and utilization with this simple controller.6.1.3 Are more threads better?GPUs can use the abundance of parallelism in data-parallel applications to tolerate memory accesslatency by interleaving the execution of warps. These warps may either be from the same CTAor from difierent CTAs running on the same shader core. One advantage of running multiplesmaller CTAs on a shader core rather than using a single larger CTA relates to the use of barriersynchronization points within a CTA [46]. Threads from one CTA can make progress while threadsfrom another CTA are waiting at a barrier. For a given number of threads per CTA, allowing moreCTAs to run on a shader core provides additional memory latency tolerance, though it may implyincreasing register and shared memory resource use. However, even if su–cient on-chip resourcesexist to allow more CTAs per core, if a compute kernel is memory-intensive, completely fllling upall CTA slots may reduce performance by increasing contention in the interconnection network andDRAM controllers.We explore the efiects of varying the resources that limit the number of threads and hence CTAs616.1. GPGPU-Sim simulation0. BFS CP DG LPS LIB MMC MUM NEU NN NQ RAY STO WP BLK FWTSpeedup using IPC25% 50% 100% (Baseline) 150% 200%(a) Speedup measured using IPC0. BFS CP DG LPS LIB MMC MUM NEU NN NQ RAY STO WP BLK FWTSpeedup using IPCwbc25% 50% 100% (Baseline) 150% 200%1.55(b) Speedup measured using IPCwbcFigure 6.8: Efiects of varying number of CTAsthat can run concurrently on a shader core, without modifying the source code for the benchmarks.We vary the amount of registers, shared memory, threads, and CTAs between 25% to 200% of thoseavailable to the baseline. The results are shown in Figure 6.8, where we report speedup using bothIPC and IPCwbc, explained in Section 5.1. For the baseline conflguration, some benchmarks arealready resource-constrainedto only 1 or 2 CTAs per shader core, making them unable to run usinga conflguration with less resources. We do not show bars for conflgurations that for this reason areunable to run. Furthermore, CPS, LPS, LIB, and WP do not have enough CTAs to make use ofincreased shader resources, so performance remains constant as the CTA limit is increased beyond100%. For these benchmarks, all CTAs can run simultaneously for the baseline conflguration. Weflrst note that analyzing the speedup using IPCwbc (Figure 6.8(b)) results in speedups that areeither monotonic or contain a single distinct peak as the CTA limit is scaled, removing the multiple626.1. GPGPU-Sim simulationpeaks in speedup that occur for some benchmarks (BFS, DG, MMC, and MUM) when insteadusing IPC (Figure 6.8(a)). A closer analysis at the output logs showed that it is indeed the CTAload imbalance issue that causes these erratic results in Figure 6.8(a). For this reason, the rest ofthis subsection will focus on using IPCwbc (Figure 6.8(b)) to analyze the results. NQ shows littlechange when varying the number of CTAs since it has very few memory operations. For AES,DG, LPS, MUM, NQ, STO, and WP, performance increases monotonically as more CTAs per coreare used. Each CTA in NQ and STO uses all of the shared memory in the baseline conflguration,therefore increasing shared memory by half for the 150% conflguration results in no increase in thenumber of concurrently running CTAs while doubling the amount of shared memory for the 200%conflguration yields a large performance improvement. While AES and DG are capable of usingthe additional shader core resources by increasing the number of concurrently running CTAs, theincrease in performance beyond the baseline conflguration is minuscule (1.3% increase in IPCwbcarithmetically averaged across the two benchmarks when increasing shader resources from 100% to200%). When the number of concurrently running CTAs increase, there is greater contention inthe shared DRAM system, resulting in increased memory latency of 58.3% (AES) and 72.7% (DG).For MUM, the speedup is higher (6.0%) because the memory latency increase is much smaller atonly 4.9%. BFS, NN, NEU, and RAY show distinct optima in performance when the CTA limitis at 150%, 50%, 150%, and 100% of the baseline conflguration, respectively. Above these limits,we observe DRAM e–ciencies decrease and memory latencies increase, again suggesting increasedcontention in the memory system. For conflguration with limits below the optima, the lack of warpsto hide memory latencies reduces performance.Given the widely-varying workload-dependent behavior, always scheduling the maximal numberof CTAs supported by a shader core is not always the best scheduling policy. We leave for futurework the design of dynamic scheduling algorithms to adapt to the workload behavior.636.2. Hybrid analytical modeling of DRAM0%20%40%60%80%100%BS FWT DG LPS LIB MMC MUM NEU NN RAY RED SP SLA TRA WPDRAM EfficiencyPredicted - No Activate Overlap Predicted - Full Activate OverlapPredicted - Averaged MeasuredFigure 6.9: Comparison of measured DRAM e–ciency to predicted DRAM e–ciency aver-aged across all DRAM chips for each benchmark6.2 Hybrid analytical modeling of DRAMInSection6.2.1, wewillflrstshowourresultsforthethreeheuristicsthatwedescribedinSection3.3.We show that our model obtain a correlation of 72.9% with an arithmetic mean absolute error of11.2%. In Section 6.2.2, we provide an in-depth analysis on the cause of our errors. In Section 6.3.4,we perform a sensitivity analysis of our analytical DRAM model across difierent microarchitectureconflgurations.6.2.1 DRAM e–ciency prediction accuracyFigure 6.9 compares the DRAM e–ciency measured in performance simulation to the modeledDRAM e–ciency when using the proflling technique described in Section 3.3. For clarity, weshow only the arithmetic average of the e–ciency across all DRAM controllers of each application.The arithmetic average of these absolute errors is 18.5% for no activate overlap and 17.0% forfull activate overlap. The corresponding correlation is 72.2% and 65.9% respectively. Of the 15applications studied, nine are predicted more accurately by full activate overlap, indicating thataccounting for bank-level parallelism, even naively, is important in predicting DRAM e–ciency.646.2. Hybrid analytical modeling of DRAM0%10%20%30%40%50%BS FWT DG LPS LIB MMC MUM NEU NN RAY RED SP SLA TRA WP AMErrorNo Activate Overlap Full Activate Overlap AveragedFigure 6.10: Arithmetic meanofabsolute errorofpredicted DRAMe–ciency averagedacrossall DRAM chips for each benchmark (AM = arithmetic mean across all bench-marks)When we average the predictions of these two heuristics for averaged, the error is reduced to 11.2%and the correlation increases to 72.9%. We described in Section 3.3 that, without accountingfor overlap of row activates, we expect our predictions to underestimate the measured e–ciency(proflling + no activate overlap). (We explain why the results of some applications do not matchthese expectations in Section 6.2.2.)In order to conflrm our expectations, we propose a new metric, polarity, which is deflned asthe arithmetic average of the non-absolute errors divided by the arithmetic average of the absoluteerrors. The value of polarity can range between -1 and 1, where -1 means that the test value(predicted e–ciency) is always less than the reference value (measured e–ciency), 1 means that thetest value is always greater than the reference value, and 0 means that on average, the error is notskewed positively or negatively in anyway. We calculated the arithmetic mean of the non-absoluteerrors for proflling + no activate overlap to be -14.1%, meaning a polarity of -0.81. This implies astrong tendency to underestimate the DRAM e–ciency, conflrming our expectations. On the otherhand, the polarity of average arithmetic error of proflling + full activate overlap is 0.98, meaningit almost always overestimates the DRAM e–ciency. We expect this to occur since the frequency656.2. Hybrid analytical modeling of DRAMg1004g1004g856g1006g1004g856g1008g1004g856g1010g1004g856g1012g1005g1004 g1004g856g1006g1004g856g1008g1004g856g1010g1004g856g1012 g1005g68g286g258g400g437g396g286g282g3g28g296g296g349g272g349g286g374g272g455g87g396g286g282g349g272g410g349g381g374(a) No activate overlapg1004g1004g856g1006g1004g856g1008g1004g856g1010g1004g856g1012g1005g1004 g1004g856g1006g1004g856g1008g1004g856g1010g1004g856g1012 g1005g68g286g258g400g437g396g286g282g3g28g296g296g349g272g349g286g374g272g455g87g396g286g282g349g272g410g349g381g374(b) Full activate overlapg1004g1004g856g1006g1004g856g1008g1004g856g1010g1004g856g1012g1005g1004 g1004g856g1006g1004g856g1008g1004g856g1010g1004g856g1012 g1005g68g286g258g400g437g396g286g282g3g28g296g296g349g272g349g286g374g272g455g87g396g286g282g349g272g410g349g381g374(c) AveragedFigure 6.11: Comparison of measured DRAM e–ciency to predicted DRAM e–ciency (1data point per memory controller per benchmark)of issuing activates is constrained by tRRD. Moreover, a row activate to a bank cannot be issued aslong as there are still requests to the current row of that bank, further constraining the frequency ofissuing activates. Our proflling + full activate overlap heuristic captures neither of these, essentiallymeaning that assuming full overlap of the row activates will always be optimistic in predicting theDRAM e–ciency. Figure 6.11 shows the scatter plots of predicted DRAM e–ciency using our threeheuristics to the measured DRAM e–ciency. Each dot represents the measured DRAM e–ciencyversus predicted for a single DRAM controller. Several facts become immediately apparent. First,most of the data points for no activate overlap tend to be under the X=Y line, meaning thatthis modeling heuristic tends to underestimate the DRAM e–ciency, while full activate overlapbehaves the opposite. Second, the data points in Figure 6.11(c) are closest to the X=Y line,further illustrating the improved accuracy of averaging the predictions obtained using our flrst twoheuristics.666.2. Hybrid analytical modeling of DRAM6.2.2 Modeling error analysisApplications where one technique generates signiflcantly poor predictions tend to be predictedmuch more accurately by the other technique (BS, LPS, LIB, MMC, MUM, NN, RED, TRA). Itis encouraging to see this because it implies that a heuristic such as averaged that combines noactivate overlap and full activate overlap can reduce the mean absolute error.We postulate that for applications with poor row access locality, the overlapping of row acti-vates has more signiflcant impact. Consequently full activate overlap should better predict DRAMe–ciency than no activate overlap, which serializes the precharge and activate delays and poten-tially underestimates the DRAM e–ciency. Conversely, we postulate that applications with goodrow access locality will not beneflt as much from row activate overlapping, which will make fullactivate overlap overestimate the e–ciency.To verify this, we obtain the average row access locality by dividing the total number of read andwrite commands by the number of activate commands per memory controller and arithmeticallyaveraging these values across all memory controllers for each application. Figure 6.12 shows thisaverage row access locality. The three applications with the lowest row access locality, MUM, TRA,and WP, follow our hypothesis and are predicted more accurately by full activate overlap than noactivate overlap.A closer look at Figure 6.9 reveals that, for BS, LPS, MMC, and NEU, the measured DRAMe–ciency is actually lower than that predicted by no activate overlap which we expected to bea lower bound on the DRAM e–ciency. This result for NEU can be explained by its extremelyhigh row access locality, essentially causing our model to predict close to 100%, regardless of theheuristic. In this case, the efiects of neglecting other timing constraints, such as the read-to-write orwrite-to-read latency becomes exposed. For BS, LPS, and MMC, we measure the average DRAMcontroller queue occupancy when there is at least one request in the queue, shown in Figure 6.13.676.2. Hybrid analytical modeling of DRAM051015202530BS FWT DG LPS LIB MMC MUM NEU NN RAY RED SP SLA TRA WPRow Access Locality72.7Figure 6.12: Average row access locality averaged across all DRAM chips for each benchmark(AM = arithmetic mean across all benchmarks, HM = harmonic mean acrossall benchmarks)We flnd that the average DRAM controller queue occupancy for these three applications is thelowest among all applications, all below 20%. This essentially means that our proflling technique istoo aggressive in that the sliding window (whose length is equal to the modeled memory controllerqueue size) that it uses is too large relative to the average memory controller queue occupancy insimulation. As such, the proflling technique is flnding row access locality that does not exist insimulations since the majority of the requests have not arrived to the memory controller yet. (Notethat the proflling technique used does not take timing information into account so it assumes thatthe memory controller (sliding window) can always look ahead in the memory request address traceby an amount equal to the memory controller queue size.)6.2.3 Sensitivity analysisIn this section, we present the accuracy of our results while sweeping across difierent key parameters:the DRAM controller queue size, the number of parallel DRAM chips per DRAM controller, andthe DRAM controller scheduling policy.We flrst present Figure 6.14, which shows the arithmetic mean absolute error of our three686.2. Hybrid analytical modeling of DRAM0%20%40%60%80%100%BS FWT DG LPS LIB MMCMUM NEU NN RAY RED SP SLA TRA WP AMDRAM ControllerQueue OccupancyFigure 6.13: Average DRAM controller queue occupancy averaged across all DRAM chipsfor each benchmark (AM = arithmetic mean across all benchmarks)0%5%10%15%20%8 16 32 64Memory Controller Queue SizeErrorNo Activate Overlap Full Activate Overlap AveragedFigure 6.14: Arithmetic mean absolute error versus DRAM controller queue size averagedacross all DRAM controllers and across all benchmarksproflling heuristics, full activate overlap, no activate overlap, and averaged, across four difierentDRAM controller queue sizes of 8, 16, 32 and 64. For clarity, we only present the values averagedacross all DRAM controllers and all benchmarks. In general, the error tends to increase as the queuesize increases. We attribute this to the fact that, since the queue occupancy is not always 100%for some benchmarks for our baseline DRAM controller queue size of 32 anyways (as explained inSection 6.2.2), the occupancy will be even lower when the queue size is increased to 64. This meansthat the size of the proflling window that we use is too large compared to what is occurring insimulation. Moreover, the e–ciency of all DRAM controllers decreases as the queue size is reduced,696.2. Hybrid analytical modeling of DRAM0%5%10%15%20%25%30%35%1 2 4DRAM Chips per Memory ControllerErrorNo Activate Overlap Full Activate Overlap AveragedFigure 6.15: Arithmetic mean absolute error versus number of DRAM chips per DRAMcontrollermeaning the absolute value of the difierence will decrease as well.Figure 6.15 shows the modeling error as the number of DRAM chips connected in parallel (toincrease bandwidth) to the DRAM controller is varied. Increasing the number of DRAM chipsper controller is an efiective way of increasing the total ofi-chip bandwidth to the chip withoutincurring much additional circuit logic on the chip (although it will increase the chip package sizedue to an increased number of pins needed for data transfer). Since our memory request datasizes are flxed at 64B and each of our DRAM chips is capable of transferring 16 bytes of data percommand, the limit in this amount of parallelism is no more than 4 DRAM chips per controller.Furthermore, as this parallelism is increased, the memory access pattern becomes much moreimportant in determining the DRAM e–ciency. This is because more DRAM chips per controllermeans less read and write commands per request, thus reducing Tsrvc req. The number of requestsper row (e.g., the row access locality) thus needs to be higher to maintain the DRAM e–ciency asshown in Equation 3.1. Moreover, reducing Tsrvc req can reduce the value of the denominator ofEquation 3.1, MAX(tRC;tRP +tRCD +tj). This means that there is less time to overlap the rowactivate commands of difierent banks, implying that no activate overlap should be more accuratethan full activate overlap. As we see in Figure 6.15, this is indeed the case.Finally, we also try implementing our hybrid analytical model for another memory scheduling706.2. Hybrid analytical modeling of DRAM0%10%20%30%40%50%BS FWT DG LPS LIB MMC MUM NEU NN RAY RED SP SLA TRA WP AMErrorNo Activate Overlap Full Activate Overlap AveragedFigure 6.16: Arithmetic mean absolute error of predicted DRAM e–ciency of \Most Pend-ing" memory scheduling algorithm averaged across all DRAM chips for eachbenchmark (AM = arithmetic mean across all benchmarks)policy, \Most Pending" [43]. After all requests to a row have been serviced, our default memoryscheduling policy, First-Ready First-Come First-Serve (FR-FCFS) [43], will then open the row ofthe oldest request. \Most Pending," conversely, will open the row that has the most number ofrequests pending to it in the queue (e.g., essentially a greedy bandwidth maximization scheme).The algorithm to implement this in our analytical model is virtually the same as the one shown inFigure 3.3 except, instead of flnding the oldest request whenever we switch rows, we now flnd theoldest request corresponding to the row that has the most number of requests. Figure 6.16 showsthe arithmetic mean absolute error for our baseline conflguration. We see that for this schedulingpolicy, the error is comparable to that of FR-FCFS.In conclusion, we see that our model is quite robust and the error stays comparable when varyingacross key microarchitectural parameters and even for a difierent memory scheduling policy. Withsmarter heuristics, such as the simple one described in Section 6.2.2, this error can be reduced evenfurther.716.3. Complexity-efiective memory access scheduling0%20%40%60%80%100%fwt lib mum neu nn ray red sp wp HMIPC Normalized to FR-FCFSFIFO BFIFO BFIFO+HG BFIFO+HMHG4 FR-FCFS(a) IPC normalized to FR-FCFS0%20%40%60%80%100%fwt lib mum neu nn ray red sp wp HMPost-ICNT to Pre-ICNT Row Access LocalityFIFO BFIFO BFIFO+HG BFIFO+HMHG4 FR-FCFS(b) Row Access Locality PreservationFigure 6.17: Baseline conflguration results for crossbar network, queue size = 32 (HM = Harmonic mean)6.3 Complexity-efiective memory access schedulingIn Section 6.3.1, we will flrst show how the interconnect modiflcations of our proposed solution as de-scribed in Section 4.3 performs in comparison to naive FIFO and FR-FCFS. Where trends across dif-ferent network topologies are similar, we only present results for the crossbar network. Section 6.3.2provides an in-depth analysis on how our interconnect modiflcations efiect the memory request ad-dress streams seen at the DRAM controllers. In Section 6.3.3, we compare static destination-basedvirtual channel assignment to dynamic virtual channel assignment. In Section 6.3.4, we perform asensitivity analysis of our design across difierent microarchitectural conflgurations.6.3.1 Complexity-efiective DRAM scheduling performanceSince RMHG was found in Section 4.5 to be relatively area intensive, we only show results forthe Hold-Grant (HG) modiflcation and Hash-Matching Hold-Grant modiflcation using 4 bit hashes(HMHG4).726.3. Complexity-efiective memory access scheduling0%20%40%60%80%100%fwt lib mum neu nn ray red sp wp HMDRAM EfficiencyFIFO BFIFO BFIFO+HG BFIFO+HMHG4 FR-FCFS(a) DRAM e–ciencyAverage memory latency (in cycles)(b) Average memory latency010002000300040005000fwt lib mum neu nn ray red sp wp HMMemory LatencyFIFO BFIFO BFIFO+HG BFIFO+HMHG4 FR-FCFSFigure 6.18: DRAM e–ciency and memory latency (HM = Harmonic mean, Queue size = 32)We present the per-application IPC normalized to FR-FCFS for the crossbar network in Fig-ure 6.17(a). The banked FIFO controller outperforms the unbanked controller across the board,indicating that the banked FIFO controller’s better proflciency at exploiting bank-level parallelismdrastically outweighs the additional potential queueing capacity of the unbanked controller. Therow access locality preservation in Figure 6.17(b) is calculated by dividing the post-interconnect lo-cality of the various conflgurations by the pre-interconnect locality for FR-FCFS. Our interconnectmodiflcations help preserve over 70% of the row access locality while FR-FCFS scheduling resultsin only 56.2% preservation.We show the efiect of our HG and HMHG4 interconnect modiflcations on the DRAM e–ciencyand memory latency, shown in Figure 6.18 for the crossbar network. Figure ?? shows that naivescheduling of memory requests in a system that supports thousands of in- ight memory requestscan cause the memory latency to increase exponentially. Our HM and HMHG4 is able to reduce thelatency of an in-order, banked DRAM controller (BFIFO) by 33.9% and 35.3% respectively. Fur-736.3. Complexity-efiective memory access schedulingthermore, both of our interconnect modiflcations, HM and HMHG4, improve the DRAM e–ciencyof BFIFO from 55% to 63%.6.3.2 Row streak breakersIn order to gain a better understanding of what causes the performance gap between our complexity-efiective design and FR-FCFS, we perform an in-depth analysis of the scheduled memory requestsin relation to the DRAM controller queue contents for the banked FIFO scheduler. First, we deflnea sequence of row-bufier hits to any single bank as a single row streak. Whenever a DRAM bankswitches from an old Row X to a new Row Y (i.e., when a row streak ends), we search backwardsthrough the FIFO queue corresponding to this DRAM bank starting from the pending request toRow Y, looking for any other requests to Row X. We deflne these requests that we are looking foras stranded requests since, in a FR-FCFS scheduler, these stranded requests would also be servicedbefore the row switches. If we flnd a stranded request, we look at the requests between the strandedrequest and the pending request to Row Y. In an ideal interconnection network, these requests,which we call row streak breakers, would be prioritized to arrive at the DRAM controller after thestranded request, thus maximizing row access locality. Furthermore, a FR-FCFS scheduler wouldprioritize these row streak breakers lower than the stranded request.Figure6.19characterizesthesourceoftherowstreakbreakersforthebankedFIFOconflgurationwith and without our interconnect augmentations. Each time a stranded request is found using ourmethod described in the previous paragraph, we increment the counter for each category once ifany requests in the corresponding category are found in the set of row streak breakers. In this way,for any set of row streak breakers, more than one category may be credited. We categorize rowstreak breakers into three possibilities, those originating from a difierent shader core, those fromthe same core but from a difierent CTA, and those from the same CTA. It can be immediatelyseen that row streak breakers originating from difierent shader cores dominate the distribution.746.3. Complexity-efiective memory access scheduling0%20%40%60%80%100%B H R B H R B H R B H R B H R B H R B H R B H R B H Rfwt lib mum neu nn ray red sp wpRow Streak Breaker Classification  Diff. Core Same Core, Diff. CTA Same CTAFigure 6.19: Row streak breakers normalized to row streaks in Baseline (B = Banked FIFO; H = BankedFIFO + Hold Grant; R = Banked FIFO + Hash-Matching Hold Grant with 4-bit row hashesFurthermore, our interconnect augmentations signiflcantly reduce the number of these particularrow streak breakers. Note that the bars in Figure 6.19 do not add up to 100% because sometimesthere are no stranded requests. In these situations, FR-FCFS would make the same schedulingdecision as our banked FIFO scheduler, oldest-flrst. In other words, shorter bars indicate that theperformance gap between the corresponding conflguration and FR-FCFS is also smaller.In virtually all cases, our interconnect augmentations reduce the number of row streak breakersdue to difierent shader cores to less than 10% of the total row streaks. The one obvious exception isneu, where our interconnect augmentations can reduce the row streak breakers to no less than 30%of the total number of row streaks. Measuring the average number of shader cores that results in arow streak originate showed that this measure was close to 1 for most other applications, explainingwhy an input-based hold-grant mechanism works so well in this architecture. The largest exceptionis neu, where the requests of every row streak when using the FR-FCFS scheduling policy are frommore than four difierent shader cores on average (4.18). In this case, FR-FCFS will be able tocapture row access locality that our interconnection augmentations cannot.756.3. Complexity-efiective memory access scheduling01020304050vc1 dvc2 svc2 dvc4 svc4IPCFIFO BFIFO BFIFO+HG BFIFO+HMHG4 FR-FCFSFigure 6.20: Harmonic mean IPC for difierent virtual channel conflgurations (Crossbar network)6.3.3 Static virtual channel assignmentIn this section, we compare the performance results of using static destination-based virtual channelassignment (SVCA) to more conventional dynamic virtual channel assignment (DVCA) in the casewhen there are multiple virtual channels per router port.Figure 6.20 shows the IPC harmonically averaged across all applications for difierent virtualchannel conflgurations: one virtual channel per port (vc1), twovirtual channels per port with SVCA(svc2), two VCs with DVCA (dvc2), four VCs with SVCA (svc4), and four VCs with DVCA (dvc4).Here, we increase the number of virtual channels while keeping the amount of bufiering spaceper virtual channel constant. When using our complexity-efiective DRAM scheduling solution,BFIFO+HG and BFIFO+HMHG4, performance decreases by 21.3% and 21.7% respectively as thenumber of virtual channels is increased from one to four while using DVCA. Using SVCA (svc4)recovers most of this performance loss, improving upon dvc4 by 18.5% and 17.7%. In general,we flnd that adding virtual channels does not improve IPC, which matches one of the flndings ofBakhoda et al. [5] that CUDA workloads are generally insensitive to interconnect latency.6.3.4 Sensitivity analysisTo show the versatility of our interconnect modiflcations, we test them across two major microar-chitectural parameters: the network topology and the DRAM controller queue size.As Figure 6.21 shows, our interconnect modiflcations perform fairly similarly for two very difier-766.3. Complexity-efiective memory access scheduling01020304050Mesh Crossbar RingIPCFIFO BFIFO BFIFO+HG BFIFO+HMHG4 FR-FCFSFigure 6.21: Harmonic mean IPC across all applications for difierent network topologies (Queue size =32)ent network topologies: a crossbar and a mesh. For these topologies, HMHG4 achieves 86.0% and84.7% of the performance of FR-FCFS for memory-bound applications while requiring signiflcantlyless bit comparisons than FR-FCFS. For comparison, we also present the results for a ring net-work, which requires a minimum of two virtual channels for deadlock avoidance [12]. The deadlockavoidance algorithm for the ring imposes restrictions on which virtual channels can be used byany input-output combination to remove cyclic dependencies. To provide adequate bandwidth, weprovision the ring network with a 512-bit wide datapath similar to the one in Larrabee [49]. Whilea ring network has the same number of nodes as a mesh, each router has only three input andoutput ports, one to the left, one to the right, and one to the shader core or DRAM controller thatit is connected to. In comparison, each mesh router has 5 input and output ports, so the complexityas detailed in Section 4.5 of the ring will be 60% (or 3/5ths) of that of the mesh. As shown, ourinterconnect modiflcations do not work as well for the ring network due to the interleaving of mem-ory request streams from having multiple virtual channels. In-depth analysis showed that the ringnetwork had more than six times as many row streak breakers as the mesh and crossbar networks,thus achieving only 6.3% speedup over a banked FIFO memory controller with no interconnectmodiflcations. We leave the development of an interconnect modiflcation that works better for thiscase to future work.Figure 6.22 shows the performance of our interconnect modiflcations harmonically averaged776.3. Complexity-efiective memory access scheduling0%20%40%60%80%100%8 16 32 64IPC Normalized to FR-FCFSFIFO BFIFO BFIFO+HG BFIFO+HMHG4 FR-FCFSFigure 6.22: Harmonic mean IPC normalized to FR-FCFS for difierent DRAM controller queue sizes(Crossbar network)across all applications in a crossbar network and with difierent DRAM controller queue sizes.For the smallest conflguration of DRAM controller queue size 8, BFIFO+HMHG4 achieves 91.0%of the IPC of FR-FCFS. While the performance relative to FR-FCFS decreases as the DRAMcontroller queue size increases, the complexity of an increasing FR-FCFS DRAM controller queuealso increases since the number of comparisons per cycle scales for FR-FCFS [15, 44] but remainsconstant for a banked FIFO controller with our interconnect modiflcations.78Chapter 7Related WorkThis chapter describes related work. First we summarize prior work in GPU architecture simulationand benchmarking. Then we introduce other analytical models presented in the fleld of computerarchitecture, some speciflc to DRAM. Finally we describe recent work in DRAM controller design.7.1 GPU architecture simulationExisting graphics-oriented GPU simulators include Qsilver [51], which does not model pro-grammable shaders, and ATTILLA [14], which focuses on graphics speciflc features. Ryoo etal. [45] use CUDA to speedup a variety of relatively easily parallelizable scientiflc applications.They explore the use of conventional code optimization techniques and take advantage of the dif-ferent memory types available on NVIDIA’s 8800GTX to obtain speedup. While their analysis isperformed by writing and optimizing applications to run on actual CUDA hardware, we use ournovel performance simulator to observe the detailed behavior of CUDA applications upon varyingarchitectural parameters.We quantifled the efiects of varying the DRAM controller queue conflguration and shader coreresource limits which, to our knowledge, has not been published previously. While the authors of theCUDA applications which we use as benchmarks have published work, the emphasis of their paperswas not on how changes in the GPU architecture can afiect their applications [3, 7, 17, 18, 20, 27{29, 41, 45, 48, 54]. To our knowledge, GPGPU-Sim is the flrst published detailed performancesimulator that supports running CUDA applications.797.2. Analytical modelsAsofJuly2009, thereareseveralapplicationsuitessuitableforGPUarchitecturestudy. Mahesriet al. introduce a class of applications for visualization, interaction, and simulation [26]. Theypropose using an accelerator architecture (xPU) separate from the GPU to improve performance oftheir benchmark suite. Ryoo et al.’s Parboil benchmark suite [45] has been introduced speciflcallyto target Nvidia GPUs, being written in CUDA. Parboil provides a central script that provides acommon interface for execution of all applications on both the GPU and CPU. Introduced mostrecently, the Rodinia benchmark suite [9] targets multicore CPUs and Nvidia GPUs by providingapplication source code written in both OpenMP and CUDA.7.2 Analytical modelsAhn et al. [2] explore the design space of DRAM systems. They study the efiects on throughput ofvarious memory controller policies, memory controller queue sizes, and DRAM timing parameterssuch as the Write to Read delay and the Burst Length. More relevant to our work, they alsopresent an analytical model for expected throughput assuming a \random indexed" access patternand a flxed \record length". This essentially means that they assume a constant flxed numberof requests accessed per row and a continuous stream of requests whereas our model handles anymemory access pattern. Furthermore, they only show the results of their analytical model for twomicro-benchmarks while we test a suite of applications with various memory access patterns. Weleave a quantitative comparison between their model and ours to future work.In his PhD thesis, Wang [53] presents an equation for calculating DRAM e–ciency by accountingfor the idle cycles in which the bus is not used. His model assumes that the request stream fromthe memory controller to DRAM is known while we also account for optimizations to the requeststream made by the memory controller to help hide timing constraint delays. He also considersonly single-threaded workloads, for which he observes high degrees of access locality. Our massivelymulti-threaded workloads have a wide range of access localities.807.3. DRAM controller designsThere exist many analytical models proposed for a variety of microprocessors [10, 11, 22, 24,30, 31, 35, 40], from out-of-order execution superscalar to in-order execution flne-grained multi-threaded processors to even GPU architectures. Agarwal et al. [1] also present an analytical cachemodel estimating cache miss rate when given cache parameters such as cache size, cache line size,associativity, etc., and their model also requires an address trace of the program being analyzed. Ofthese microprocessor analytical models, only Karkhanis and Smith [24], Chen and Aamodt [10, 11],and Hong and Kim [22] model long, albeit flxed, memory latency.7.3 DRAM controller designsThere exist many DRAM scheduler designs proposed for multi-core systems [32{34, 42]. Theprimary focus of these designs revolve around the principles of providing Quality-of-Service (QoS)or fairness for difierent threads and cores competing for shared ofi-chip bandwidth [32, 34, 42].When multiple applications run simultaneously on a system, the memory access tra–c of oneapplication can cause a naive DRAM controller to starve out the requests of another. To thebest of our knowledge, this is the flrst work that addresses the problem of e–cient DRAM accessscheduling in a massively multithreaded GPU architecture with tens of thousands of threads.012345678910fwt lib mum neu nn ray red sp wpUnique Banks Requested per WarpFigure 7.1: Average number of unique banks requested per warp memory operationMutlu et al. [33] present a parallelism-aware batching scheduler (PAR-BS) that coordinatesthe servicing of multiple requests from a single thread, particularly those to difierent banks in aDRAM chip, to reduce the average memory latency experienced across all threads. We anticipate817.3. DRAM controller designsthe design of such a scheduler to support requests from tens of thousands of threads, which ispossible in GPU architectures, to be highly complex and area-intensive. Furthermore, threads stallat the warp-level, such that all threads in a warp must be ready before the warp can be issued tothe shader core pipeline. We measure the average number of unique banks to which a warp sendsrequests, shown in Figure 7.1. Five out of the nine applications always send requests to only asingle bank per warp. In our hold-grant interconnect arbiter, the requests in a warp will be kepttogether as they traverse the interconnect and they will all arrive at the same bank, so batching isalready done. In mum, each warp sends requests to at least three difierent memory controllers perwarp memory operation (since each DRAM chip has only 4 banks). For PAR-BS to perform wellin this case, there must be coordinated scheduling across multiple memory controllers.82Chapter 8Conclusions and Future WorkThis chapter concludes the thesis and discusses some future work.8.1 ConclusionsIn this thesis, we detailed the changes made to the memory system of GPGPU-Sim, a detailed per-formance simulator, to support CUDA applications. Speciflcally, support was added for memoryaccesses to local, constant, and texture memories. We studied the performance of twelve contem-porary CUDA applications by running them on GPGPU-Sim, flrst validating our simulator againstreal hardware and then studying the performance impact of varying the DRAM controller queuesize and varying the shader core resources that limit the maximum number of CTAs that can con-currently run on a shader core. We observed that sometimes running fewer CTAs concurrentlythan the limit imposed by on-chip resources can improve performance by reducing contention inthe memory system.In this thesis we have also proposed a novel hybrid analytical DRAM model which takes intoaccount the efiects of Out-of-Order memory scheduling and hiding of DRAM timing delays. Weshowed that this model can be used with a sliding window proflling technique to predict the DRAMe–ciency over the entire runtime of an application, given we have its full memory request addresstrace. We chose a massively multi-threaded architecture connected to a set of high-bandwidthGDDR3-SDRAM graphics memory as our simulation framework and evaluated the accuracy ofour model on a set of real CUDA applications with diverse dynamic memory access patterns. We838.1. Conclusionsintroduce two difierent heuristics to use in conjunction with the sliding window proflling approachthat serve as upper and lower bounds on the achieved bank-level parallelism. By averaging thepredictions of these two contrasting heuristics, we were able to predict the DRAM e–ciency of aset of real applications to within an arithmetic mean of 11.2% absolute error.Finally, we introduced a novel, complexity-efiective DRAM access scheduling solution thatachieves system throughput within up to 91.0% of that achievable with aggressive out-of-orderscheduling for a set of memory-limited applications. Our solution relies on modiflcations to theinterconnect that relays memory requests from shader cores to DRAM controllers. These modi-flcations leverage the key observation that the DRAM row bufier access locality of the memoryrequest streams seen at the DRAM controller after they pass through the interconnect is muchworse than that access locality of the individual memory request streams from the shader coreinto the interconnect. Three such modiflcations are possible: either holding grant for a routerinput port as long as there are pending requests to the same destination (HG), holding grant for arouter input port as long as there are pending requests to the same destination and the requestedrow matches the requested row of the previous arbitrated request (RMHG), or holding grant fora router input port as long as there are pending requests to the same destination and the hashof the requested row matches the hash of the requested row of the previous arbitrated request(HMHG). These modiflcations work to preserve the inherent DRAM row bufier access locality ofmemory request streams from individual shader cores which would otherwise be destroyed due tothe interleaving of memory request streams from multiple shader cores. In doing so, it allows fora simple in-order memory scheduler at the DRAM controller to achieve much higher performancethan if the interconnect did not have such modiflcations across difierent network topologies andmicroarchitectural conflgurations.848.2. Future work8.2 Future workThis section discusses the new areas of research to where this thesis can be extended.8.2.1 Improving the analytical DRAM modelOur hybrid analytical model relies on running applications a single time in performance simulationto collect memory request address traces. Relying on performance simulation for this processmay be undesirable, reducing the usefulness of our model. We anticipate that this process can besped up by instead running the applications in functional simulation and then performing somesort of simple interleaving to obtain a representative memory request address trace. To addressthe average error of our model compared to performance simulation, we showed that applicationsthat are predicted poorly by one heuristic tend to be predicted much better with the other andthat a simple combination of the two heuristics resulted in signiflcant improvement in predictionaccuracy. This implies the possibility for further improvement by combining the two heuristics inmore intelligent ways.8.2.2 Combining the analytical DRAM model with analytical processormodelsTo our knowledge, all previous work that propose analytical modeling of processors [10, 11, 22, 24,30, 31, 35, 40] model DRAM latency as a constant flxed latency. As this thesis has shown, DRAMlatency varies dramatically from one application to the next, especially if DRAM is heavily utilized.Intuitively, the accuracy of such analytical models for processors can be improved by incorporatinga more comprehensive analytical DRAM model that can predict more accurate memory latencies.The flrst step would be extending the analytical DRAM model described in this thesis to predictmemory latency.858.2. Future work8.2.3 Design space exploration of intelligent interconnectsOur proposed \hold-grant" interconnect arbitration modiflcations has been shown to preserve therow access locality of memory request streams of general purpose applications running on GPUs.However, the majority of GPUs has and will always be targeted to the gaming community. Assuch, the experiments in Section 6.3 should be tried on graphics applications as well. Moreover,our modiflcations should be tried on other interconnect allocation policies, such as wave-front andiSLIP allocation [12]. If the targeted interconnect architecture has multiple virtual channels todeal with deadlock, such as ring and torus architectures, more work must also be done to improvesystem performance when using our interconnect arbitration scheme.86Bibliography[1] A. Agarwal, J. Hennessy, and M. Horowitz. An analytical cache model. ACM Trans. Comput.Syst., 7(2):184{215, 1989.[2] Jung Ho Ahn, Mattan Erez, and William J. Dally. The Design Space of Data-Parallel MemorySystems. In SC 2006: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing,page 80.[3] Samer Al-Kiswany, Abdullah Gharaibeh, Elizeu Santos-Neto, George Yuan, and Matei Ri-peanu. StoreGPU: exploiting graphics processing units to accelerate distributed storage sys-tems. In Proc. 17th Int’l Symp. on High Performance Distributed Computing, pages 165{174,2008.[4] Thomas E. Anderson, Susan S. Owicki, James B. Saxe, and Charles P. Thacker. High-speedswitch scheduling for local-area networks. ACM Trans. Comput. Syst., 11(4):319{352, 1993.[5] Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. Ana-lyzing CUDA Workloads Using a Detailed GPU Simulator. In IEEE International Symposiumon Performance Analysis of Systems and Software (ISPASS 2009), pages 163{174, April 2009.[6] Beyond3D. NVIDIA GT200 GPU and Architecture Analysis.http://www.beyond3d.com/content/reviews/51.[7] Billconan and Kavinguy. A Neural Network on GPU.http://www.codeproject.com/KB/graphics/GPUNN.aspx.87Chapter 8. Bibliography[8] Darrell Boggs, Aravindh Baktha, Jason Hawkins, Deborah T. Marr, J. Alan Miller, PatriceRoussel, Ronak Singhal, Bret Toll, and K.S. Venkatraman. The Microarchitecture of the IntelR PentiumR 4 Processor on 90nm Technology. IntelR Technology Journal, 8(1), 2004.[9] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheafier, and KevinSkadron. A performance study of general-purpose applications on graphics processors usingcuda. Journal of Parallel and Distributed Computing, 68(10):1370 { 1380, 2008. General-Purpose Processing using Graphics Processing Units.[10] X.E. Chen and T.M. Aamodt. A flrst-order flne-grained multithreaded throughput model. InHPCA 2009: Proceedings of the 15th Annual IEEE International Symposium on High Perfor-mance Computing and Applications.[11] Xi E. Chen and Tor M. Aamodt. Hybrid analytical modeling of pending cache hits, dataprefetching, and MSHRs. In MICRO 2008: Proceedings of the 41st Annual IEEE/ACM In-ternational Symposium on Microarchitecture.[12] W. J. Dally and B. Towles. Interconnection Networks. Morgan Kaufmann, 2004.[13] Brian Thomas Davis. Modern DRAM architectures. PhD thesis, University of Michigan, 2001.[14] V.M. del Barrio, C. Gonzalez, J. Roca, A. Fernandez, and Espasa E. ATTILA: a cycle-level execution-driven simulator for modern GPU architectures. Int’l Symp. on PerformanceAnalysis of Systems and Software, pages 231{241, March 2006.[15] Roger E. Eckert. Page Stream Sorter for DRAM Systems, Assignee NVIDIA Corporation,United States Patent 7,376,803, May 2008.[16] Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic warp formation88Chapter 8. Bibliographyand scheduling for e–cient GPU control  ow. In Proc. 40th IEEE/ACM Int’l Symp. onMicroarchitecture, 2007.[17] Mike Giles. Jacobi iteration for a Laplace discretisation on a 3D structured grid.http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/laplace3d.pdf.[18] Mike Giles and Su Xiaoke. Notes on using the NVIDIA 8800 GTX graphics card.http://people.maths.ox.ac.uk/~gilesm/hpc/.[19] Ziyad S. Hakura and Anoop Gupta. The design and analysis of a cache architecture for texturemapping. In Proc. 24th Int’l Symp. on Computer Architecture, pages 108{120, 1997.[20] Pawan Harish and P. J. Narayanan. Accelerating Large Graph Algorithms on the GPU UsingCUDA. In HiPC, pages 197{208, 2007.[21] Glenn Hinton, Dave Sager, Mike Upton, Darrell Boggs, Doug Carmean, Alan Kyker, andPatrice Roussel. The Microarchitecture of the PentiumR 4 Processor. IntelR TechnologyJournal, 5(1), 2001.[22] Sunpyo Hong and Hyesoon Kim. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In To appear in ISCA 2009: Proceedings of the36th Annual International Symposium on Computer Architecture.[23] Illinois Microarchitecture Project utilizing Advanced Compiler Technology Research Group.Parboil benchmark suite. http://www.crhc.uiuc.edu/IMPACT/parboil.php.[24] Tejas S. Karkhanis and James E. Smith. A flrst-order superscalar processor model. In ISCA2004: Proceedings of the 31st Annual International Symposium on Computer Architecture,page 338.89Chapter 8. Bibliography[25] David Kroft. Lockup-free instruction fetch/prefetch cache organization. In ISCA 1981: Pro-ceedings of the 8th annual symposium on Computer Architecture, pages 81{87, 1981.[26] Aqeel Mahesri, Daniel Johnson, Neal Crago, and Sanjay J. Patel. Tradeofis in designingaccelerator architectures for visual computing. In Proc. 41st IEEE/ACM Int’l Symp. on Mi-croarchitecture, 2008.[27] S. A. Manavski. CUDA compatible GPU as an e–cient hardware accelerator for AES cryptog-raphy. In ICSPC 2007: Proc. of IEEE Int’l Conf. on Signal Processing and Communication,pages 65{68, 2007.[28] Maxime. Ray tracing. http://www.nvidia.com/cuda.[29] J. Michalakes and M. Vachharajani. GPU acceleration of numerical weather prediction. IPDPS2008: IEEE Int’l Symp. on Parallel and Distributed Processing, pages 1{7, April 2008.[30] Pierre Michaud, Andre Seznec, and Stephan Jourdan. Exploring instruction-fetch bandwidthrequirement in wide-issue superscalar processors. In PACT 1999: Proceedings of the EighthInternational Conference on Parallel Architectures and Compilation Techniques, 1999.[31] Pierre Michaud, Andr¶e Seznec, and St¶ephan Jourdan. An exploration of instruction fetchrequirement in out-of-order superscalar processors. Int. J. Parallel Program., 29(1):35{58,2001.[32] Onur Mutlu and Thomas Moscibroda. Stall-time fair memory access scheduling for chip mul-tiprocessors. In MICRO-40, pages 146{160, 2007.[33] Onur Mutlu and Thomas Moscibroda. Parallelism-aware batch scheduling: Enhancing bothperformance and fairness of shared dram systems. In ISCA ’08, pages 63{74, Washington,DC, USA, 2008. IEEE Computer Society.90Chapter 8. Bibliography[34] Kyle J. Nesbit, Nidhi Aggarwal, James Laudon, and James E. Smith. Fair queuing memorysystems. In MICRO-39, pages 208{222, 2006.[35] Derek B. Noonburg and John P. Shen. Theoretical modeling of superscalar processor perfor-mance. In MICRO 27: Proceedings of the 27th annual international symposium on Microar-chitecture, 1994.[36] NVIDIA. Geforce 8 series. http://www.nvidia.com/page/geforce8.html.[37] NVIDIA Corporation. NVIDIA CUDA SDK code samples.http://developer.download.nvidia.com/compute/cuda/ sdk/website/samples.html.[38] NVIDIACorporation. NVIDIAQuadro4XGL. http://www.nvidia.com/page/quadro4xgl.html.[39] NVIDIA Corporation. NVIDIA CUDA Programming Guide, 1.1 edition, 2007.[40] D. J. Ofelt. E–cient Performance Prediction for Modern Microprocessors. PhD thesis, Stan-ford University, 1999.[41] Pcchen. N-Queens Solver.http://forums.nvidia.com/index.php?showtopic=76893.[42] Nauman Raflque, Won-Taek Lim, and Mithuna Thottethodi. Efiective management of drambandwidth in multicore processors. In PACT ’07, pages 245{258, 2007.[43] Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens. Memoryaccess scheduling. In ISCA 2000: Proceedings of the 27th Annual International Symposium onComputer Architecture, pages 128{138.[44] Hemant G. Rotithor, Randy B. Osborne, and Nagi Aboulenein. Method and Apparatus for91Chapter 8. BibliographyOut of Order Memory Scheduling, Assignee Intel Corporation, United States Patent 7,127,574,October 2006.[45] Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk,and Wen-mei W. Hwu. Optimization principles and application performance evaluation of amultithreaded GPU using CUDA. In Proc. 13th ACM SIGPLAN Symp. on Principles andPractice of Parallel Programming, pages 73{82, 2008.[46] Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng,John A. Stratton, and W. M. W. Hwu. Program Optimization Space Pruning for a Mul-tithreaded GPU. In CGO 2008: Proceedings of the Sixth Annual IEEE/ACM InternationalSymposium on Code Generation and Optimization, pages 195{204.[47] Samsung. 512Mbit GDDR3 SDRAM, Revision 1.0 (Part No. K4J52324QC-B).http://www.alldatasheet.com/datasheet-pdf/pdf/112724/SAMSUNG/K4J52324QC.html,March 2005.[48] Michael Schatz, Cole Trapnell, Arthur Delcher, and Amitabh Varshney. High-throughputsequence alignment using Graphics Processing Units. BMC Bioinformatics, 8(1):474, 2007.[49] Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey,StephenJunkins, AdamLake, JeremySugerman, Robert Cavin, Roger Espasa, Ed Grochowski,Toni Juan, and Pat Hanrahan. Larrabee: a many-core x86 architecture for visual computing.In SIGGRAPH ’08: ACM SIGGRAPH 2008 papers, pages 1{15, New York, NY, USA, 2008.ACM.[50] J. G. Shanthikumar and R. G. Sargent. A Unifying View of Hybrid Simulation/AnalyticModels and Modeling. Operations Research, pages 1030{1052, 1983.92Chapter 8. Bibliography[51] J. W. Sheafier, D. Luebke, and K. Skadron. A  exible simulation framework for graphics archi-tectures. In Proc. ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware,pages 85{94, 2004.[52] Standard Performance Evaluation Corporation. SPEC CPU2006 benchmarks.http://www.spec.org/cpu2006/.[53] David Tawei Wang. Modern DRAM Memory Systems: Performance Analysis and a High Per-formance Power-Constrained DRAM Scheduling Algorithm. PhD thesis, University of Mary-land at College Park, 2005.[54] Timothy C. Warburton. Mini Discontinuous Galerkin Solvers.http://www.caam.rice.edu/~timwar/RMMC/MIDG.html.[55] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta.The splash-2 programs: characterization and methodological considerations. In ISCA ’95:Proceedings of the 22nd annual international symposium on Computer architecture, pages 24{36, New York, NY, USA, 1995. ACM.93


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items