You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Inter-core locality aware memory access scheduling Li, Dongdong 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2015_may_li_dongdong.pdf [ 837.53kB ]
JSON: 24-1.0166115.json
JSON-LD: 24-1.0166115-ld.json
RDF/XML (Pretty): 24-1.0166115-rdf.xml
RDF/JSON: 24-1.0166115-rdf.json
Turtle: 24-1.0166115-turtle.txt
N-Triples: 24-1.0166115-rdf-ntriples.txt
Original Record: 24-1.0166115-source.json
Full Text

Full Text

Inter-core Locality Aware Memory Access SchedulingbyDongdong LiB.E., Beihang University, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)April 2015c© Dongdong Li, 2015AbstractGraphics Processing Units (GPUs) run thousands of parallel threads and achieve high Mem-ory Level Parallelism (MLP). To support high MLP, a structure called a Miss-Status Hold-ing Register (MSHR) handles multiple in-flight miss requests. When multiple cores send re-quests to the same cache line, the requests are merged into one last level cache MSHR entryand only one memory request is sent to the Dynamic Random-Access Memory (DRAM).We call this inter-core locality. The main reason for inter-core locality is that multiplecores access shared read-only data within the same cache line. By prioritizing memoryrequests that have high inter-core locality, more threads resume execution. Many memoryaccess scheduling policies have been proposed for general-purpose multi-core processorsand GPUs. However, some of these policies do not consider the characteristic of GPUs andothers do not utilize inter-core locality information.In this thesis, we analyze the reasons that inter-core locality exists and show that re-quests with more inter-core locality have a higher impact performance. To exploit inter-core locality, we enable the GPU DRAM controller to be aware of inter-core locality byusing Level 2 (L2) cache MSHR information. We propose a memory scheduling policy tocoordinate the last level cache MSHR and the DRAM controller. 1) We introduce a struc-ture to enable the DRAM to be aware of L2 cache MSHR information. 2) We propose aiiAbstractmemory scheduling policy to use L2 cache MSHR information. 3) To prevent starvation,we introduce age information to the scheduling policy.Our evaluation shows a 28% memory request latency reduction and an 11% perfor-mance improvement on the average for high inter-core locality benchmarks.iiiPrefaceThis dissertation is based on a project conducted by myself and supervised by Tor M.Aamodt. I was responsible for proposing the solution, creating the timing simulation tovalidate the solution, analyzing experimental result for all parts of this project. None of thetext in the dissertation is directly taken from previously published or collaborative articles.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9vTable of Contents2.1 Baseline GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Cache and Miss Status Holding Registers . . . . . . . . . . . . . . . . . . 112.3 DRAM Controller and DRAM . . . . . . . . . . . . . . . . . . . . . . . . 123 Inter-core Locality Benefits Performance . . . . . . . . . . . . . . . . . . . . 163.1 A Key Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Critical Path Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Inter-core Locality Aware Memory Access Scheduling . . . . . . . . . . . . . 224.1 Overview of Inter-core Locality Aware Memory Access Scheduling . . . . 224.2 Row Score Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3 Reduce Latency by Age Information . . . . . . . . . . . . . . . . . . . . . 264.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.5 Hardware Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2 Classification of Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 356 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.2 Detailed Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.3.1 L2 Cache to DRAM Latency . . . . . . . . . . . . . . . . . . . . . 466.3.2 DRAM Queue Size . . . . . . . . . . . . . . . . . . . . . . . . . . 48viTable of Contents7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.1 Early Memory Access Scheduling Explorations . . . . . . . . . . . . . . . 507.2 Fairness and Throughput Memory Access Scheduling . . . . . . . . . . . . 517.3 Memory Requests Prioritization . . . . . . . . . . . . . . . . . . . . . . . 537.4 Complexity Effective Memory Access Scheduling . . . . . . . . . . . . . . 548 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58viiList of TablesTable 5.1 Baseline Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 32Table 5.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34viiiList of FiguresFigure 1.1 The source of inter-core locality and the effects of delaying inter-corelocality requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Figure 2.1 Baseline GPU architecture . . . . . . . . . . . . . . . . . . . . . . . . 10Figure 2.2 MSHR structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Figure 2.3 Baseline DRAM controller structure . . . . . . . . . . . . . . . . . . . 13Figure 2.4 DRAM timing constrains, we assume row 1 is the current open-rowat the beginning of the clock. (RD = Read, PRE = Precharge, ACT =Activate, WR = Write, R1 = Row 1, R2 = Row 2, D1 = Data for Row1, D2 = Data for Row 2) . . . . . . . . . . . . . . . . . . . . . . . . . 14Figure 3.1 Example of using inter-core locality aware scheduling generates more∆IPC. Wx represents warp x, Rx represents the memory requests thatare sent by warp x. Wx+y indicates that warp x has executed next yinstructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19ixList of FiguresFigure 4.1 Overview of inter-core locality aware memory access scheduling. R0to R8 represents request 0 to request 8. Red numbers beside requestsare request scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Figure 4.2 Implementation of inter-core locality aware memory access scheduling 28Figure 5.1 Speedup using perfect DRAM . . . . . . . . . . . . . . . . . . . . . . 35Figure 5.2 Cycle distribution of inter-core locality across benchmarks. We use theL2 MSHR merge length to represent inter-core locality. . . . . . . . . 36Figure 6.1 IPC for the memory sensitive, high inter-core locality applications . . . 39Figure 6.2 IPC for the memory sensitive, low inter-core locality applications . . . 39Figure 6.3 IPC for memory insensitive application . . . . . . . . . . . . . . . . . 40Figure 6.4 L2 reservation fails . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Figure 6.5 Maximum memory request latency . . . . . . . . . . . . . . . . . . . . 42Figure 6.6 Average memory request latency . . . . . . . . . . . . . . . . . . . . . 43Figure 6.7 Data dependency stall, normalized to FR-FCFS . . . . . . . . . . . . . 44Figure 6.8 Row locality, the y-axis indicates average row-hit of all memory requests 45Figure 6.9 DRAM bandwidth utilizaiton . . . . . . . . . . . . . . . . . . . . . . . 46Figure 6.10 IPC normalized to FR-FCFS for different L2 to DRAM latency withMSHR-S+A scheduling policy . . . . . . . . . . . . . . . . . . . . . . 47Figure 6.11 IPC normalized to FR-FCFS for different read request queue size withMSHR-S+A scheduling policy . . . . . . . . . . . . . . . . . . . . . . 48xList of AbbreviationsCPU Central Processing UnitGPU Graphics Processing UnitFIFO First-In First-OutFR-FCFS First-Ready First-Come First-ServeSJF Shortest-Job-FirstSIMT Single-Instruction, Multiple-ThreadCUDA Compute Unified Device ArchitectureOpenCL Open Computing LanguageL1 Level 1L2 Level 2MSHR Miss-Status Holding RegisterSDK Software Development KitIPC Instructions Per CyclexiList of AbbreviationsDRAM Dynamic Random-Access MemoryGPGPU General-Purpose computing on Graphics Processing UnitMIB Merge Information BufferMLP Memory Level ParallelismBFS Breadth-First-SearchHIL High Inter-core LocalityLIL Low Inter-core LocalityRAW Read After WriteWAW Write After WriteROB Reorder-BufferTB Thread BlockGTO Greedy-then-OldestxiiAcknowledgmentsI would like to thank my supervisor Professor Tor M. Aamodt for the support and insighthe has given during all these years. Tor always motivated me to think independently andgave me guidance at the same time. Without him, this work would not have been possible.I would also like to thank everyone in computer architecture group for everything Ilearned from them. They provide useful feedbacks on my work. I am grateful to my friendsfor helping me with real problems either in research or in life.Finally I have a special thanks to my family. They always inspire me and provide solidsupport. Their love, patience and understanding encourage me to overcome every obstaclein my life. I would not be here today without them.I would like to thank Qualcomm for providing financial support for this work.xiiiChapter 1IntroductionGraphics Processing Units (GPUs) enable general-purpose computing via programminginterfaces like Open Computing Language (OpenCL) [21] and Compute Unified DeviceArchitecture (CUDA) [1]. The general-purpose computing ability broadens the range ofapplications on GPUs. For example, machine learning algorithms like the neural network isaccelerated by GPU using CUDA [17, 23, 31, 41]. GPU is also used for big data analysis indata science research. Stanford AI Lab accelerated deep learning algorithm using a clusterof GPU servers [11]. Engineering and mathematic tools such like Matlab and Mathematicause GPU to accelerate numerical computation [3]. Bioinformatics and life science useGPU to improve throughput of the DNA sequencing alignment [38]. The extensive useof GPU described above shows the importance of GPUs. In the light of this, researchersare motivated to do research on GPUs to exploit the potential power of GPUs. There arenumbers of works on different aspects of GPU architectures.Memory system is an important part of GPUs. The applications described above re-quire a large volume of data resulting in many memory accesses. Today’s GPU architec-1Chapter 1: Introductiontures typically employ a Single-Instruction, Multiple-Thread (SIMT) [26] architecture thatruns thousands of parallel threads. Unlike threads in the Central Processing Units (CPUs),each of which runs different instructions, the GPU has a group of parallel threads, knownas a warp in NVIDIA terminology, that run the same instruction in lockstep. On a loadinstruction, a number of memory requests are issued by these warps. A warp will stall inthe pipeline if it reaches an instruction that needs the data that has not yet been returnedfrom a previous load instruction. We show it is important to resolve such pending memoryrequests quickly to reduce the number of stalled warps and maintain high throughput for theGPU and hence high performance. However, the performance of Dynamic Random-AccessMemory (DRAM) is lower than SIMT cores resulting in a memory bandwidth wall. Sincethousands of threads are running in parallel, there are a large number of memory requestswhich aggravate this problem.Researchers proposed a number of methods to alleviate the memory bandwidth wallfrom different aspects for both multicore CPU and GPU-like many-core processors. Anumber of memory controller designs have been proposed to improve DRAM bandwidthutilization based on out-of-order scheduling [27, 32, 34, 36]. To reduce contention of mem-ory requests, Cheng et al. [10] proposed an analytical model to estimate next phase systemperformance based on the computation to memory ratio. From the computation to mem-ory ratio, the system can decide if it needs to suspend a thread to avoid memory requestscongestion. Mutlu and Moscibroda [29, 30] proposed two memory access schedulers toavoid memory requests interference and ensure memory requests fairness between threadsfor multicore CPUs. Chatterjee et al. [7] proposed a memory access scheduler to balancememory access latency between all memory accesses for GPUs. Yuan et al. [42] proposedmodifications to the on-chip network between processors and memory partitions to reserve2Chapter 1: Introductionrow locality of the requests. Using this modified on-chip network, the memory controllercan use simple First-In First-Out (FIFO) policy to reduce hardware complexity. Ghose et al.[14] observed that history requests latency in the Reorder-Buffer (ROB) can be used to pre-dict criticality of the next request. They proposed a predictor to prioritize the critical requestfrom the processor side. We will discuss other related works in Chapter 7.All of the above works tackle the memory bandwidth wall problem. However, there aretwo problems with these works. First, some of them do not apply to GPUs since they donot scale to thousands of threads. Second, some of these works do not consider criticalityof requests. The state-of-art GPUs use a large number of threads to hide memory accesslatency. While some threads are waiting for the data from the memory, other threads canoccupy the computation resources. Thus, memory access latency is not as critical in GPUs.Modern GPUs also exploit several techniques to improve memory bandwidth utilization.Within an SIMT core, a coalescing unit is used to reduce the number of memory requestsgenerated by a warp. On-chip scratch pad and Level 1 (L1) caches are used to reducelong latency memory accesses. Between SIMT cores, a unified last level cache is usedto capture locality that is not captured by the L1 caches and/or scratch pads. Memoryrequests accessing the same cache line issued by a single core are merged into one L1cache Miss-Status Holding Register (MSHR) entry. Such memory requests exhibit whatwe call intra-core locality. When warps from different cores issue requests that access thesame cache line within a short timeframe, they can be merged into one Level 2 (L2) cacheMSHR entry and only one request is sent to the DRAM controller. Our experiments showthat for a single memory request in the DRAM controller, there are often multiple requestsissued from different cores merged together in the L2 cache MSHRs. We call this inter-corelocality.31.1. Motivation1.1 MotivationWe will describe the reason for inter-core locality in this section. Figure 1.1a shows aCUDA code example for which inter-core locality occurs. The code is Breadth-First-Search (BFS) from the Rodinia benchmark [8, 9]. Each graph node in the BFS is repre-sented by a single thread. Multiple nodes can be assigned to a single core. Assume node0 is currently running on Core 0, node 1024 is running on Core 1 and they connect to acommon node X which is nextNodeID in Figure 1.1a. In line 3, they both want to accessg_graph_visited[X]. Because they are running on different cores, neither coalesc-ing unit nor L1 cache can help. The memory request to g_graph_visited[X] will bemerged into one last level MSHR entry and only one memory request is sent to DRAMcontroller. Inter-core locality occurs in this situation. However, this request may not be pri-oritized as the DRAM controller is not aware of how many warps are waiting for the request.This problem cannot be avoided by changing threads scheduling policy since common nodeinformation cannot be predicted.As illustrated by the above example, inter-core locality can occur in General-Purposecomputing on Graphics Processing Unit (GPGPU) applications when they access largeshared data structures. To help quantifying the potential presented by inter-core locality,we analyze five different kinds of applications by delaying 1000 requests either with orwithout inter-core locality. Figure 1.1b shows the performance impact, which shows thatdelaying requests with inter-core locality has more performance impact than delaying re-quests without inter-core locality. We evaluate the impact when the memory schedulingpolicy takes inter-core locality into account. Section 3.2 will show more analysis on thisresult.Modern GPU memory controllers employ an First-Ready First-Come First-Serve (FR-FCFS)41.1. MotivationNode 0Node 1024Node 0: int nodeID = blockIdx.x*MAX_THREADS_PER_BLOCK + threadIdx.x;1: for(i in all edges connected to current nodeID) {2:   int nextNodeID = g_graph_edges[i];3:   if(!g_graph_visited[nextNodeID]) {4:     g_cost[nextNodeID]=g_cost[nodeID]+1;5:     g_updating_graph_mask[nextNodeID]=true;6:   }7: }Running on core 0 Running on core 1(a) BFS code in Rodinia benchmark shows the inter-core locality0%	  5%	  10%	  15%	  20%	  25%	  30%	  35%	  40%	  BFS	   SSSP	   TRANS	   MGST	   SP	  Performance	  Impact	  Delay	  Requests	  without	  Inter-­‐core	  Locality	   Delay	  Requests	  with	  Inter-­‐core	  Locality	  (b) Delay requests with inter-core locality and without inter-core localityFigure 1.1: The source of inter-core locality and the effects of delaying inter-core lo-cality requests51.2. Contributions[13, 35, 36] scheduling policy to reorder memory requests to achieve maximum DRAMbandwidth utilization. However, a FR-FCFS scheduler may delay a request with high inter-core locality in favor of greedily optimizing for bandwidth. By exploiting inter-core lo-cality, the DRAM system can reduce the overall warp stalling time, causing additionalmemory requests to be generated and thereby increasing future opportunities for bandwidthoptimizations.In this thesis, we propose an inter-core locality aware memory access scheduling policyand hardware. Inter-core locality can be represented using L2 MSHR merge length whichis defined as the number of outstanding requests within one MSHR entry in this thesis. Weenable the DRAM to be aware of this information. Our inter-core locality aware scheduleruses L2 MSHR merge information to schedule requests with high inter-core locality first.1.2 ContributionsThis thesis makes following contributions:• We make a key observation with a concrete example to show that performance canbe improved by prioritizing requests with high inter-core locality. We also show thatperformance is impacted if we delay requests with inter-core locality.• We quantify inter-core locality across 24 benchmarks from three benchmark suites.We first classify 24 benchmarks into memory insensitive and memory sensitive bench-marks using a perfect DRAM model. We further classify memory sensitive bench-marks into low inter-core locality and high inter-core locality benchmarks.• We propose an inter-core locality aware scheduler. We show that inter-core localitycan be captured by utilizing L2 MSHR merge information. We explored two policies61.3. Organizationthat consider DRAM row-buffer locality along with age information to reduce star-vation. We propose a hardware structure to enable the DRAM controller to be awareof L2 cache MSHR merge information.In the following sections, all MSHRs are L2 cache MSHRs unless specified otherwise.1.3 Organization• Chapter 2 describes the background on baseline GPU architecture, cache and MSHRstructure and memory system in this thesis.• Chapter 3 gives a key observation on inter-core locality using a concrete example. Itshows that by utilizing inter-core locality information on memory access scheduling,we can improve performance.• Chapter 4 describes details about inter-core locality aware memory access schedulingpolicy. Two types of DRAM row score selection policies are explored in this chapter.A structure is proposed to support proposed policy. Hardware overhead is estimatedin this chapter.• Chapter 5 describes the methodology used in this thesis. A detailed GPU configura-tion is presented. This chapter also classifies benchmarks into different categories forexperiments.• Chapter 6 describes the experiments in this thesis. An overall performance of ourinter-core locality memory access scheduling policy is presented. A detailed analysisof memory requests latency, L2 cache reservation and data dependency stall is givento have a comprehensive understanding of our proposed policy. A sensitivity study71.3. Organizationis presented to show the effects of different configurations that affects our schedulingpolicy.• Chapter 7 describes related work on memory access scheduling.• Chapter 8 describes future work.• Chapter 9 concludes inter-core locality memory access scheduling presented in thisthesis.8Chapter 2BackgroundIn this chapter, we describe our baseline architecture. Section 2.1 describes our baselineGPU architecture. Section 2.2 describes the MSHR structure and how it handles multipleoutstanding requests. Section 2.3 describes the DRAM model and DRAM controller indetail.2.1 Baseline GPU ArchitectureIn this work, we study a GPU-like many-core architecture. Figure 2.1 shows our baselinearchitecture. A GPU is composed of a number of SIMT cores (Streaming Multiprocessor inNVIDIA terminology and Computing Unit in AMD terminology) and a number of memorypartitions. There is an on-chip interconnection network connecting SIMT cores and mem-ory partitions [12, 19]. Each SIMT core consists of a number of Thread Blocks (TBs). Afunction runs on the GPU called a kernel. When a kernel is launched, a number of warpsare assigned to the TBs. Similar to NVIDIA, each warp consists of 32 threads in our model.Threads in a warp run in lockstep. When a warp executes a load instruction, a number of92.1. Baseline GPU ArchitectureInterconnection NetworkMemory PartitionSubpartitionL2 CacheMSHRsOff-chip Memory ControllerSubpartitionL2 CacheMSHRsMemory PartitionSubpartitionL2 CacheMSHRsOff-chip Memory ControllerSubpartitionL2 CacheMSHRsMemory PartitionSubpartitionL2 CacheMSHRsOff-chip Memory ControllerSubpartitionL2 CacheMSHRsMemory PartitionSubpartitionL2 CacheMSHRsOff-chip Memory ControllerSubpartitionL2 CacheMSHRsMemory PartitionSubpartitionL2 CacheMSHRsOff-chip Memory ControllerSubpartitionL2 CacheMSHRsMemory PartitionL2 CacheMSHROff-chip DRAM ControllerStreaming MultiprocessorL1 CacheShared MemoryCUCUCU CU CUCU CU CUCoalescing UnitStreaming MultiprocessorL1 CacheShared MemoryCUCUCU CU CUCU CU CUCoalescing UnitStreaming MultiprocessorL1 CacheShared MemoryCUCUCU CU CUCU CU CUCoalescing UnitStreaming MultiprocessorL1 CacheShared MemoryCUCUCU CU CUCU CU CUCoalescing UnitStreaming MultiprocessorL1 CacheShared MemoryCUCUCU CU CUCU CU CUCoalescing UnitSIMT CoreL1 CacheShared MemoryTBTBTB TB TBTB TB TBCoalescing UnitFigure 2.1: Baseline GPU architecturememory requests are generated from the warp. A coalescing unit is introduced to reducethe number of memory requests sent from a warp. The coalescing unit tries to coalescerequests issued by a warp into as few requests as possible if there are consecutive addressesin these requests. For example, if all threads in a warp access the address within 128 bytes,only a single memory request will be generated. However, if the threads access memoryaddress which has gaps larger than 128 bytes, 32 requests will be generated. This can createsignificant pressure on the memory system.In the following sections, we will use cores to represent SIMT cores for simplicity.102.2. Cache and Miss Status Holding Registers2.2 Cache and Miss Status Holding RegistersThere are two levels of caches in our baseline GPU architecture. Each SIMT core has itsown private L1 cache to capture locality within it. For the locality between SIMT cores,there is a last level cache bank located in the memory partition. When requests miss inthe cache, multiple outstanding memory requests are sent to lower memory levels. To keeptrack of multiple outstanding memory requests, an MSHR is added to the cache [24, 40] .Block Addr. Dest. RegV Dest. RegV=Miss Block AddressDataBlock Addr.…Dest. RegV Dest. RegV=Miss Block AddressData………Figure 2.2: MSHR structureFigure 2.2 shows the structure of the MSHR assumed in this work. When a requestmisses a cache line, the MSHR compares the address of the request to each entry of itsblock address. If there is no MSHR entry that matches the block address of the request,the request is stored into a new MSHR entry and a miss request is sent to the next lowerlevel of the memory hierarchy to acquire data for the whole cache line. The MSHR keepsinformation regarding the waiting requests, such as the destination register to store the data112.3. DRAM Controller and DRAMonce it returns from the lower memory level. If the address of the request matches the blockaddress of an MSHR entry, it indicates that there is already an outstanding request for thiscache line. This request is merged into the matched MSHR entry, and no additional requestneeds to be sent.When multiple memory requests are issued by warps from the same core miss to thesame cache line, they are merged into the L1 cache MSHR. Memory requests for the samecache line from different cores are merged into the last level cache MSHR. Thus, the lastlevel cache MSHR captures inter-core locality. In this thesis, we use this information toimprove memory access scheduling in Chapter 4.2.3 DRAM Controller and DRAMFigure 2.3 shows our baseline DRAM controller and DRAM model. When a requestreaches the DRAM controller, it will be buffered in either the Read Request Queue or theWrite Request Queue depending on its type. A scheduler inside the DRAM controller isresponsible for the memory access scheduling. The scheduler chooses a request either fromthe read request queue or the write request queue based on the read-write scheduling pol-icy. After a request is chosen, the scheduler translates the request into DRAM commands(Precharge, Active, Read and Write) and the request is transferred to the Command Queue.Each DRAM bank has its own command queue. The command queue is scheduled using around-robin policy [16]. When a DRAM command is sent, the priority pointer points to thenext bank. For each bank, the commands in the command queue are scheduled in-order topreserve scheduling decisions by the request scheduler.In each DRAM bank, there is a row-buffer. The reason for the row-buffer is to reduceaccess latency to the same DRAM row due to the timing constrains of the DRAM. When a122.3. DRAM Controller and DRAMRead Request QueueWrite Request QueueMemory RequestsBank 0 Bank 1 … Bank 15Command QueueDRAM Controll LogicDRAMColumnsRowsBank 15Row BufferColumnsRowsBank 1Row BufferColumnsRowsBank 0Row Buffer…SchedulerDRAM ControllerFigure 2.3: Baseline DRAM controller structure132.3. DRAM Controller and DRAMrow in the DRAM is activated, data in a DRAM row is transferred to the row-buffer. Thisrow is known as the open row for the bank. While a row is buffered in a row-buffer, requestsaccess this row are called a row-hit. When a DRAM controller issues a request that accessesa different row, the DRAM needs to close the current open row and activate a new row. Thisis known as a row-miss.Command BusData BusAddress BusPRE RDACTRDCLOCKR1 R2R1 R2tCLD2D2D1D1tRP tRCDtCLRow-hit latency Row-miss latency(a) DRAM timing illustration for row-hit and row-missCommand BusData BusAddress BusWR RDCLOCKR1R1tCLD1 D1D1D1tWTR tCLRead-Write latency(b) DRAM timing illustration for write to read latencyFigure 2.4: DRAM timing constrains, we assume row 1 is the current open-row at thebeginning of the clock. (RD = Read, PRE = Precharge, ACT = Activate, WR =Write, R1 = Row 1, R2 = Row 2, D1 = Data for Row 1, D2 = Data for Row 2)142.3. DRAM Controller and DRAMFigure 2.4 shows the DRAM timing constrains. The command bus sends DRAM com-mands like Read, Write, Precharge and Activate to the DRAM. The address bus indicateswhich bank, row and column to access. For simplicity, we do not show bank and columnaddress in the Figure 2.4. We assume the read and write commands access the same bankbut different columns. After a read or a write command is sent, the data bus output the dataafter the timing constraint is satisfied.Figure 2.4a shows the timing constrains of row-hit and row-miss. When there is arow-hit, data will be available at the output port after Column Address Strobe Latency(tCL). When a row-miss occur, two additional latencies are introduced which are RowPrecharge Time (tRP) and Row Address to Column Address Delay (tRCD). The row-miss la-tency (tRP+tRCD+tCL) is much higher than row-hit latency (tCL). To achieve high bandwidthutilization, our baseline Scheduler uses an FR-FCFS policy. It schedules requests accessingthe current open row first. When there is no memory requests to the current open row, therequest scheduler schedules the oldest request to prevent starvation.Write requests are not critical to performance. Because after a write request is sent, athread does not block for the write request. There is a minimal delay (tWTR) to switch thebus between read and write. Figure 2.4b shows the timing constrain of the tWTR. To reducethe tWTR penalty, write requests are buffered in write request queue. When the DRAMcontroller accumulates a high number of write requests in the write request queue, therequest scheduler drains the write request queue to prevent the write queue from reachingits maximum capacity. This number is the high-watermark. The write request queue willalso be drained when the read request queue is empty. The request scheduler stops drainingthe write request queue and switch to schedule the read requests until it reaches a lownumber which is the low-watermark [39].15Chapter 3Inter-core Locality BenefitsPerformanceAs described above, the coalescing unit is used to increase overall memory bandwidth.However, the coalescing unit can only help in combining requests from one warp. Requestsissued by different warps within a single core are combined in the L1 cache MSHR. Mem-ory requests from different cores cannot be combined by the coalescing units or by the L1cache MSHRs. While memory requests issued by the multiple cores accessing data in thesame cache line, they are merged into a single L2 cache MSHR entry. The L2 cache MSHRskeep track of outstanding memory requests sent to DRAM. We show that the number ofmerges in an L2 cache MSHR entry can be used to represent inter-core locality.In this thesis, row locality is defined as the average number of requests to access anopen DRAM row before it is closed. Row locality is important to DRAM bandwidth uti-lization. High row locality means a low row-miss rate and high bandwidth utilization. Sincerow-miss overhead (tRP+tRCD+tCL) is much larger than row-hit overhead (tCL), it is impor-163.1. A Key Observationtant to improve row locality to maximize DRAM bandwidth utilization. Modern DRAMcontrollers use an FR-FCFS scheduling policy to reorder memory requests to maximize therow access locality. FR-FCFS searches the request queue to find if there is any requestaccessing the open row. If there is no request accessing the open row, a row-miss occur. Inthis case, FR-FCFS schedules the oldest request to prevent starvation. This can lead to highDRAM bandwidth utilization but does not always benefit the overall performance. In thenext section, we will describe an alternative scheduling policy to utilize inter-core localityto get better performance.3.1 A Key ObservationAs mentioned in Chapter 2, memory requests from different cores accessing the same dataare merged into the L2 cache MSHR. Only one memory request is sent the to the DRAMcontroller. In this case, multiple warps from different cores are virtually waiting for onememory request in the DRAM controller. By servicing the request with the maximumnumber of waiting warps, the total waiting time of these warps is reduced. These warps canresume execution earlier and increase overall performance. Since memory requests that areissued by different cores to the same cache line are merged into the same L2 MSHR entry,we can use L2 cache MSHR merge length to measure the number of warps waiting for theirmemory requests. In this thesis, we only focus on memory requests issued by differentcores. In the Chapter 8, we will discuss memory requests issued by the same core in futurework.The MSHR merge information can be used to help with the DRAM controller schedul-ing decision. For a request in the DRAM controller, if the merge length of its correspond-ing MSHR entry is greater than one, this indicates that by serving this request will benefit173.1. A Key Observationmultiple warps from multiple cores. We call this inter-core locality aware memory accessscheduling.Figure 3.1 shows an example to illustrate how inter-core locality aware memory accessscheduling can improve performance. Assume Warp 0 and Warp 1 belong to Core 0 andWarp 2 and Warp 3 belong to Core 2. During Cycle 0, they are all currently stalling onthe pending load instructions. Memory requests from Warp 0 and Warp 2 are merged intoone MSHR entry, which is the first row of the L2 cache MSHR, as illustrated in Figure 3.1.Four memory requests (R0 to R3) are sent to the DRAM. For simplicity, in this example,we assume that after DRAM services a request, the core can receive this request the nextcycle.Figure 3.1a shows the FR-FCFS scheduling policy described in Section 2.3. Becausethe FR-FCFS scheduling policy cannot make use of inter-core locality information, it ispossible that the DRAM controller schedules requests with high inter-core locality last. Weassume the DRAM serves requests in an order R1→R3→R0. After R1 is served, Warp 1can continue to execute the next independent instruction while other warps are still stalling.This is also the same for R3. After R0 is served, two requests (R0, R2) will be receivedby Core 0 and Core 1. Thus in the next cycle, Warp 0 and Warp 2 can execute the nextinstructions concurrently.After 4 cycles, Warp 1 executed 3 instructions because its request is returned from cycle0. Warp 3 executes 2 instructions after cycle 1. Warp 0 and Warp 2 execute 1 instructionbecause their requests returned on the cycle 2. The total number of instructions executedby Warp 0 to Warp 3 is 7 = 1+3+1+2. In this case, the Instructions Per Cycle (IPC) withinthese 4 cycles is:183.1. A Key ObservationCycle 0 Cycle 1 Cycle 2 Cycle 3Core 0Core 1R2R0R1R3L2 MSHRW0 W1W2 W3R0R3R1DRAM QueueW0 W1+1W2 W3R3R2R0R0R3W0 W1+2W2 W3+1R2R0R0W0+1 W1+3W2+1 W3+2TimeEntriesMerges(a) Example of FR-FCFS memory scheduling policyCycle 0 Cycle 1 Cycle 2 Cycle 3TimeR0R3R1DRAM QueueR3R0 R2R1L2 MSHRW0 W1W2 W3R3R1R1R3W0+1 W1W2+1 W3R3R3W0+2 W1+1W2+2 W3W0+3 W1+2W2+3 W3+1Core 0Core 1EntriesMerges(b) Example of inter-core locality aware memory scheduling policyFigure 3.1: Example of using inter-core locality aware scheduling generates more∆IPC. Wx represents warp x, Rx represents the memory requests that are sentby warp x. Wx+y indicates that warp x has executed next y instructions.193.2. Critical Path Analysis∆IPCa =1+3+1+24= 1.75 (3.1)Figure 3.1b shows the same example with inter-core locality aware scheduling. Thememory controller schedules requests with high inter-core locality first. When multiplecores send memory requests to access the same L2 cache line, they will be merged into L2cache MSHR. In this case, the DRAM serves R0 first because R0 has highest MSHR mergelength (length = 2). By serving this request, both Warp 0 and Warp 2 can execute the nextinstructions from cycle 1. After 4 cycles, the IPC is:∆IPCb =3+3+2+14= 2.25 (3.2)This example shows that we can improve performance by scheduling high inter-corelocality memory requests first. In the next chapter, we will describe our inter-core localityaware scheduling policy in detail. In Chapter 5, we will show how L2 cache MSHR mergelength is distributed across benchmarks.3.2 Critical Path AnalysisSince we prioritize memory requests with inter-core locality, the request with no inter-corelocality will be delayed. In Chapter 1, we have already given a brief overview on the ef-fects of delaying memory requests either with or without inter-core locality. We design twoseparate experiments by delaying 1000 requests. One is delaying requests with inter-corelocality. The other one is delaying requests without inter-core locality. The 1000 requestsoccupy different percentages of total requests in these applications. The performance im-pact varies from 1% to 39%. However, delaying requests with inter-core locality always203.2. Critical Path Analysishas 2X to 10X more performance impact than delaying requests without inter-core localityas shown in Figure 1.1b. This data indicates that requests with inter-core locality are morecritical than requests without inter-core locality. Prioritizing inter-core locality requestswill benefit performance because if the memory controller delays them, the impact on theperformance is higher than requests without inter-core locality.21Chapter 4Inter-core Locality Aware MemoryAccess SchedulingThis chapter describes our memory scheduling policy in detail. We first start with basicscheduling rules of an inter-core locality aware scheduler. We use the MSHR informationto generate scores for each request and each DRAM row. Two types of row score selectionpolicies are proposed. The score is used to help our scheduling policy making decisions. Inorder to handle starvation of the requests, we added an age information to our schedulingpolicy.4.1 Overview of Inter-core Locality Aware Memory AccessSchedulingAs describe in Chapter 3, inter-core locality can be represented by the number of mergesof an L2 cache MSHR entry. Figure 4.1 shows an overview of inter-core locality awarememory access scheduling. When a request misses an L2 cache line, if there is no MSHR224.1. Overview of Inter-core Locality Aware Memory Access SchedulingCore 0…Interconnect NetworkCore 1 Core 2 Core NRow0Row1Row2(R0, )(R7, 2) (R4, 3)Read Request Queue45Row ScoreDRAM ControllerR4 R5R0 R3R6R7R1R8R2 MSHRL2 CacheDataBlock Addr.……L2 CacheDRAML2 CacheDRAML2 CacheDRAMFigure 4.1: Overview of inter-core locality aware memory access scheduling. R0 toR8 represents request 0 to request 8. Red numbers beside requests are requestscoresentry for this cache line, this means this is the first request for this cache line. A new MSHRentry will be allocated, and this request will be sent to the DRAM controller to get the datafor the whole cache line. Because only one core sends the request to access the cache line,this request has no inter-core locality right now. On the other hand, if there is already anMSHR entry for the cache line, the inter-core locality occurs. There is no need to send thisrequest to DRAM controller since there is already a request getting the data. This request234.1. Overview of Inter-core Locality Aware Memory Access Schedulingonly needs to merge into the already exist MSHR entry. However, in our inter-core localityaware scheduling policy, we still send a new request called a dummy request that only bringsthe MSHR information. When the dummy request reaches the DRAM controller, we updatethe score with MSHR information and make scheduling decisions based on the new score.Rule 1: Scheduling Rules1. Largest request score for row-hit: When there are requests to access currentopened row, schedule the one with largest request score first among all row-hit requests.2. Largest row score for row-miss: When a row-miss occurs, choose the newrow based on the row score. A row with the largest score is chosen. Thisrow is opened, and policy 1 is used to choose a request.A request score is assigned to each request in the DRAM read request queue. For eachDRAM row, a row score is assigned based on the all request scores for this row. For examplein Figure 4.1, There are three requests in the DRAM controller. R0 has a request score as4 since there are four requests waiting for the same cache line in the L2 cache MSHR. Forthe same reason, R7 and R4 has a request score as 2 and 4, respectively. For each DRAMrow, there is a row score. In the example, we use the summation of all requests score as therow score. For the row 1, the score is 5 which is the summation of request score R7 and R4.The request score is used to select requests and the row score is used to select rows. Thescheduler first searches the request queue to see if there are any row-hit requests. Among allrow-hit requests, the request with the maximum score is chosen first. If there are no row-hit244.2. Row Score Typesrequests, the row with the maximum score is selected. Rule 1 shows our scheduling policy.Write requests do not have a corresponding MSHR entry. The write request queue isdrained when it reaches a high-watermark or when the read request queue is empty. TheFR-FCFS policy is used on the write queue.In the following section, we discuss two row score types.4.2 Row Score TypesWe first use the MSHR merge length as the request score. The row score is calculated basedon the request score for the row. We explore two types of row score types for a DRAM row:1. MSHR-M: Row score is defined as the largest request score in this row.2. MSHR-S: Row score is defined as the summation of all requests’ score in this row.MSHR-M policy always schedules a request with the maximum MSHR merge length.This policy aggressively use make use of inter-core locality. When the DRAM controllerselects a row, only one request that with the maximum L2 cache MSHR length is consid-ered. But it ignores other requests in the same row. It is possible that other requests withselected row have low inter-core locality.We improve MSHR-M policy to consider the inter-core locality of all requests in aDRAM row. We propose MSHR-S policy that augments the row score by using the sum-mation of all requests’ score in the same row as the row score. This policy can benefit themost memory requests in the MSHR by opening a row. The overhead of serving row-hitrequests is low. The interval between two successive row-hit requests in the same bankis tCCDL, which is three DRAM clock cycles with the bank group enabled in our GDDR5configuration [15]. This is relatively small compared to the time to open a new row, which254.3. Reduce Latency by Age Informationneeds tens of cycles. Thus, requests within the same row can be served in a short time. Soeven though every request in a DRAM row is not the one with highest inter-core locality,MSHR-S can still serve the maximum inter-core locality in the short time by opening a row.4.3 Reduce Latency by Age InformationA DRAM row with a low score will starve because the proposed scheduler chooses thelargest row score which is defined by the MSHR merge length. These requests can hurtperformance if they have been waiting in the DRAM controller for a prolonged periodof time. In order to prevent starvation, we propose a MSHR-S+A policy. MSHR-S+Aaugments MSHR-S by including age in addition to merge information from the L2 cacheMSHR. Age in this thesis is defined as the life time of a memory request. When a requestis generated, the age is 0. The age will increase by one each cycle. There is a latency for amemory request reaching the DRAM controller. When the memory request reaches the L2cache, the age is the latency from the time it is generated. When a memory request arrives atthe MSHR, the age is calculated for this request. By employing the MSHR-S+A policy, therequest’s score and row score now is defined using age, instead of solely the merge lengthas described in the previous section. The age of a request is defined as:AgeReq = Current Time−Request Creation Timestamp (4.1)Only the first outstanding memory request within an MSHR entry is sent to the DRAM.To represent an age of all memory requests within an MSHR entry, the age of an MSHRentry is the summation of all request ages in this MSHR entry:AgeMSHR =∑AgeReq (4.2)264.4. Implementation DetailsWhen we pass the MSHR information, we pass AgeMSHR along with merge length.After the DRAM controller receives the MSHR merge information, which contains mergelength and age, the age needs to be updated every cycle to represent the real age. This isbecause while a request is waiting in the DRAM controller, the age of this request is stillincreasing as time passes. Notice that a request in the DRAM controller may represent anumber of requests in the MSHR. Upon each cycle, the age of each request in MSHR isincremented by one. In total, the age of an MSHR entry is incremented by the MSHR mergelength every cycle. So instead of incrementing by one each cycle, the age of a request inthe DRAM controller needs to be incremented by the number of MSHR merge requests.4.4 Implementation DetailsFigure 4.2 shows a detailed implementation of the inter-core locality aware memory accessscheduler. We introduced a buffer called the Merge Information Buffer (MIB) in the L2cache and the DRAM controller. The MIB records MSHR block addresses, merge lengthand age information. There is also a valid field to indicate whether an entry has already beensent to the DRAM controller. When a request misses an L2 cache line, if this request is thefirst one that misses the cache line, a new MSHR entry is allocated and a new request ispushed into the miss queue. If there is already an outstanding request for the cache line, therequest is merged into an MSHR entry. At this time, the MIB is updated with a new mergelength, the new age is calculated, and the valid bit is set to one. A multiplexer is introducedafter the MIB and the miss queue to determine which request to send in each cycle. If theMIB has valid entries, this indicates there are requests that have not been returned fromDRAM yet. It is important to update the DRAM controller with the latest information toprevent it to make a decision based upon the stale information. When the MIB has a valid274.4. Implementation DetailsL2 CacheMiss QueueL2 Cache MSHRMUXMSHR Merge Infomation Buffer111Valid0503 4520# Merge130AgeBlock Addr.5 1First Missing RequestNon First Missing RequestDummy RequestNormal RequestMissing RequestRequest Queue0Request002BankR1R00Row0DEMUXDRAM ControllerRow Age4120075Bank50#Merge0DRAM Row Score+++501# Merge45Block Addr.52030013AgeDRAM MIB+++Normal RequestDummy RequestFigure 4.2: Implementation of inter-core locality aware memory access scheduling284.5. Hardware Overheadentry, the multiplexer will send this entry first to inform the latest information. A dummyrequest with the maximum request score is generated and sent to the DRAM controllerfirst. The dummy request includes the MSHR block address, merge length, age and a bit toindicate that it is a dummy request. After an MIB entry is sent, the valid bit is set to zero.This entry will be set to valid again if there is another outstanding request to the entry. Ifthere is no valid entry in the MIB, requests from the miss queue are sent.A demultiplexer is introduced to determine which structure a memory request shouldgo to when arriving at the DRAM controller. Normal requests are pushed into the requestsqueue and dummy requests are used to update the MIB in the DRAM controller. When adummy request reaches the DRAM controller, the merge length field in the dummy requestis used to overwrite the merge length field in the DRAM controller and the age field inthe dummy request adds to the age filed in the DRAM controller. Each entry in the MIBrequires an adder to update the age each cycle. The adder takes age and merge lengthas the input and outputs the added result back to the age. Information in the MIB in theDRAM controller is used to help scheduling decisions as described above. When there isno corresponding information in the MIB, it means the merge length is one and the age canbe obtained from the current request.4.5 Hardware OverheadIn this section, we will analyze the hardware overhead of our proposed structure. For eachMSHR entry, we need a corresponding MIB entry. In each MIB entry, an additional 20 bitsare required to store the MSHR merge information. These 20 bits break down as follows.The merge length requires 4 bits for an MSHR entry with a maximum merge capacity of 16.We use 15 bits for the age field. When all 16 MSHR merges are occupied, each can record294.5. Hardware Overheada maximum age of 215/16 = 2048 cycles in the memory controller clock domain. There is achance that age exceeds 215. In this case, we saturate the counter at 215. The valid bit onlyrequires a single bit. Each MSHR entry already has a block address field, which we canreuse in the MIB. Given that we have 64 MSHR entries in our baseline configuration (seeTable 5.1), the total hardware cost added to an L2 cache is 160 bytes. This is 0.2% of totalL2 cache.The MIB in the DRAM needs to record an MSHR block address so that memory re-quests in the DRAM request queue can locate its corresponding MSHR entry. For a 32 bitaddress and 128 byte cache line, the block address requires 24 bits. Merge length and agerequire the same amount of bits as MIB in MSHR (4 bits and 15 bits, respectively). Foreach request in the DRAM read request queue, we need an MIB entry for it. The DRAMread request queue contains 64 entries from Table 5.1, requiring a total of 344 bytes. Also,each row in every bank needs to record the row score. In our baseline architecture, we have16 banks and 4096 rows. To record banks and rows, we need 4 bits and 12 bits, respec-tively. Also, age and merge length require 19 bits in total. So for each DRAM row, weneed 4+12+19=35bits. If each request in the DRAM read queue accesses different rows,the number of rows can be as many as the number of requests in the DRAM read queue.We need the number of row score entries to be the same as the DRAM read queue capacitywhich is 64. The row score requires a total of 280 bytes. We also need 128 15-bit addersfor the DRAM row score and MIB. Given that 1-bit full adder needs 34 transistors, 15-bitadder needs 510 transistors. 1-bit SRAM needs 6 transistors, 510 transistors are 85 bits instorage. 128 15-bit full adders needs 1360 bytes. So the total overhead added to the DRAMis 344 + 280 + 1360 = 1984 bytes.30Chapter 5Methodology5.1 ConfigurationTo evaluate our proposed inter-core locality aware memory access scheduling policy, weuse a modified version of GPGPU-Sim 3.2.2 [4]. The DRAM model in GPGPU-Sim 3.2.2does not support separate read and write queues. There is a bus turnaround latency to switchbetween read requests and write requests. In order to minimize this latency, we add separateread and write queues in the DRAM controller as stated in Chapter 2. Table 5.1 shows ourconfiguration. We use a GPU model similar to NVIDIA GTX480. To model the DRAM,we use Hynix 1Gb GDDR5 [15] as our DRAM timing model. The warp scheduling policywe use is Greedy-then-Oldest (GTO). This policy will switch to another warp only if thecurrent warp stalls.We perform the evaluation using a set of benchmarks from Rodinia [8, 9], CUDA Soft-ware Development Kit (SDK) [2], GPGPU-Sim [4] and LonestarGPU [5]. Table 5.2 showsthe benchmarks used in this study. For LonestarGPU benchmarks, they use CUDA 5 butour simulator only supports CUDA 4.2. We tried to fix the problem and only get two of315.1. ConfigurationTable 5.1: Baseline ConfigurationConfiguration ParameterNumber of Cores 15Warp size 32Max Threads / Core 1536Warp Scheduling Policy Greedy-then-oldest (GTO) [37]L1 Data Cache / Core 16KB total size, 128B line, 4-way associativeL2 Unified Cache 64KB / Memory Sub-partition128B line, 16-way associativeL2 MSHR 64 entries / Memory Sub-partition16 merges / EntryNumber of Memory Partitions 6Memory Sub-partitions 12, 2 / Memory PartitionL2 to DRAM Latency 20DRAM Read Queue Capacity 64DRAM Write Queue Capacity 128High/Low Watermarks 96/80Core Frequency 1400 MHzInterconnect Frequency 1400 MHzDRAM Frequency 924 MHzNumber of DRAM Channels 6Number of DRAM Banks 16, 4 / Bank GroupNumber of DRAM Rows 4096GDDR5 Memory Timing Hynix H5GQ1H24AFRtRCD = 12, tRAS = 28, tRP = 12,tRC = 40, tCCDS = 2, tRRD = 6,tCL = 12, tWL = 4, tCDLR = 5,tWR = 12, tCCDL = 3, tRTPL = 2(Unit in DRAM cycle)325.1. Configurationthem running. Not all the benchmarks are memory sensitive. We classify our benchmarksin the following section.335.1. ConfigurationTable 5.2: BenchmarksMemory InsensitiveName Abbr. SuiteBack Propagation BACKP RodiniaHotSpot HOTSP RodiniaHeart Wall HRTWL RodiniaKmeans KMN RodiniaLeukocyte LKYT RodiniaLU Decomposition LUD RodiniaNearest Neighbour NNC RodiniaSpeckle Reducing Anisotropic Diffusion SRAD RodiniaSimilarity Score SS RodiniaconvolutionSeparable CONV CUDA SDKhistogram HIS CUDA SDKreduction REDU CUDA SDKscalarProd SCP CUDA SDKMemory Sensitive, Low Inter-Core LocalityName Abbr. SuiteBlackScholes BS CUDA SDKCFD Solver CFD RodiniaNeedleman-Wunsch NDL RodiniaStreamcluster STMCL RodiniaPoints-to Analysis PTA LonestarGPUVectorAdd VADD CUDA SDKMemory Sensitive, High Inter-Core LocalityName Abbr. SuiteBreadth-First Search BFS RodiniaSingle Source Shortest Path SSSP GPGPU-SimMatrix Transpose TRANS CUDA SDKMerge Sort MGST RodiniaSurvey Propagation SP LonestarGPU345.2. Classification of Benchmarks5.2 Classification of Benchmarks0%	  2%	  4%	  6%	  8%	  10%	  12%	  14%	  BACKP	  HOTSP	  HRTWL	  KMN	  LKYT	  LUD	  NNC	  SRAD	  SS	  CONV	  HIS	  REDU	  SCP	  Speedup	  Perfect	  DRAM	  (a) Memory insensitive applications0%	  50%	  100%	  150%	  200%	  250%	  BS	  CFD	  NDL	  STMCL	  PTA	  VADD	  BFS	  SSSP	  TRANS	  MGST	  SP	  Speedup	  Perfect	  DRAM	  (b) Memory sensitive applicationsFigure 5.1: Speedup using perfect DRAMIn order to know which applications are memory sensitive applications, we implementa perfect DRAM model which has a zero-cycle latency. When a memory request arrivesat the memory controller, it is immediately ready at the DRAM return queue. Figure 5.1355.2. Classification of Benchmarks0%	  10%	  20%	  30%	  40%	  50%	  60%	  70%	  80%	  90%	  100%	  BS	  CFD	  NDL	  STMCL	  PTA	  VADD	  BFS	  SSSP	  TRANS	  MGST	  SP	  Percentage	  of	  Cycles	  Has	  Inter-­‐core	  Locality	   No	  Inter-­‐core	  Locality	   Idle	  Figure 5.2: Cycle distribution of inter-core locality across benchmarks. We use theL2 MSHR merge length to represent inter-core locality.shows the speedup when using perfect DRAM.In Figure 5.1, we classify our benchmarks into two categories: 1) Memory insensitiveapplications. These benchmarks do not speedup much using perfect memory (speedup <20%). The main reasons are 1) The application is compute intensive and 2) L1 or L2 cachemiss rate is too low. While for some applications, we can increase input data size to getrid of low L1 or L2 cache miss rate. However, this can take a very long time to finishin the simulation environment. From our estimation, applications run 1s on hardware willrun 10 days in the simulation environment. Applications in Figure 5.1a fall into this cate-gory. 2) Figure 5.1b shows all memory sensitive applications. These applications benefitsignificantly when using perfect memory. We focus on these applications only.Our scheduler uses the L2 MSHR merge length to prioritize requests. Figure 5.2 showsa cycle distribution of the L2 MSHR merge length. When all MSHR entries are empty,365.2. Classification of Benchmarksthe MSHR is idle. The bar at the top indicates the percentage of cycles that MSHR is idle.When there are requests in the MSHR, if all MSHR entries’ merge length is 1, this cycle hasno inter-core locality. If any of the MSHR entries’ merge length is larger than 1, it meansthere is inter-core locality. The black bar at the bottom indicates the percentages when thishappens.From Figure 5.2, we further classify applications based on the percentage of MSHRmerge length. Memory sensitive applications with low inter-core locality (LIL) and memorysensitive application with high inter-core locality (HIL). Applications with MSHR mergelength less than 10% are classified as Low Inter-core Locality (LIL). Applications withmore than 10% MSHR merge length are classified as High Inter-core Locality (HIL).37Chapter 6Experimental ResultsIn this chapter, we present the result of our inter-core locality memory access schedulingpolicy. Section 6.1 evaluates the performance of the inter-core locality aware DRAM sched-uler for all applications. Section 6.2 gives a detail analysis on our scheduler. The analysison memory request related stalls on L2 cache, memory access latency, data dependencystall, row locality and DRAM bandwidth is given to depth understand the inter-core local-ity scheduling policy. Section 6.3 gives an analysis on sensitivity of the DRAM requestqueue size and L2 cache to DRAM latency. We use the harmonic mean when computingaverage results.386.1. Performance6.1 Performance0.96	  0.98	  1	  1.02	  1.04	  1.06	  1.08	  1.1	  1.12	  1.14	  1.16	  BFS	  SSSP	  TRANS	  MGST	   SP	  HMEAN-­‐MS-­‐HIL	  Normalized	  IPC	  FR-­‐FCFS	   MSHR-­‐M	   MSHR-­‐S	   MSHR-­‐S+A	  Figure 6.1: IPC for the memory sensitive, high inter-core locality applications0.96	  0.98	  1	  1.02	  1.04	  1.06	  1.08	  1.1	  BS	   CFD	  NDL	  STMCL	   PTA	  VADD	  HMEAN-­‐MS-­‐LIL	  Normalized	  IPC	  FR-­‐FCFS	   MSHR-­‐M	   MSHR-­‐S	   MSHR-­‐S+A	  Figure 6.2: IPC for the memory sensitive, low inter-core locality applications396.1. PerformanceFigure 6.1 illustrates that the MSHR-S+A scheduler achieves an average 10.9% perfor-mance improvement over the FR-FCFS scheduler for the memory sensitive benchmarkswith high inter-core locality. For the memory sensitive with low inter-core locality bench-marks, Figure 6.2 shows that MSHR-S+A scheduler has the best performance with an av-erage 2.6% performance improvement and no benchmarks show performance degradation.If the inter-core locality is low, MSHR-S still does improve performance. This is becauseMSHR-S chooses a row with the most pending requests. Rixner et al. [36] use a similartechnique called most pending policy. By serving the most pending row, other rows have achance to wait for more requests thus improve overall row locality. Figure 6.8 shows this indetail.0.97	  0.98	  0.99	  1	  1.01	  1.02	  1.03	  BACKP	  HOTSP	  HRTWL	  KMN	  LKYT	  LUD	  NNC	  SRAD	  SS	  CONV	  HIS	  REDU	  SCP	  HMEAN-­‐MI	  Normalized	  IPC	  FR-­‐FCFS	   MSHR-­‐M	   MSHR-­‐S	   MSHR-­‐S+A	  Figure 6.3: IPC for memory insensitive applicationIn summary, for memory insensitive applications, the average performance of the mem-406.2. Detailed Analysisory insensitive applications is improved by 0.7%. Only KMN shows less than 1% perfor-mance degradation which is negligible.The following sections discuss the performance improvements when using the localityaware DRAM scheduler in more detail. In following section, we will only focus on memorysensitive applications.6.2 Detailed Analysis0.00	  0.20	  0.40	  0.60	  0.80	  1.00	  1.20	  BS	  CFD	  NDL	  STMCL	  PTA	  VADD	  HMEAN-­‐MS-­‐LIL	  BFS	  SSSP	  TRANS	  MGST	  SP	  HMEAN-­‐MS-­‐HIL	  Normalized	  L2	  Cache	  Reserva?on	  Fails	  	  FR-­‐FCFS	   MSHR-­‐M	   MSHR-­‐S	   MSHR-­‐S+A	  Figure 6.4: L2 reservation failsFigure 6.4 shows the L2 cache reservation fails reduction normalized to FR-FCFS. TheL2 cache reservation fails happens when there is a request trying to require the cache re-source, but it cannot. There are three reasons that a missing request fails the L2 cachereservation. 1) The cache line requested by a request has been reserved by another request416.2. Detailed Analysisbut has not been filled. 2) The cache miss queue is full. 3) All MSHR entries are occupied.Our inter-core locality aware scheduler reduces L2 cache reservation fails because ofthe third reason. Our scheduler releases MSHR entries to make room for other requests.For LIL and HIL benchmarks, we show 10.5% and 35% L2 reservation fails reduction,respectively.0	  20000	  40000	  60000	  80000	  100000	  120000	  140000	  BS	  CFD	  NDL	  STMCL	  PTA	  VADD	  HMEAN-­‐LIL	  BFS	  SSSP	  TRANS	  MGST	  SP	  HMEAN-­‐HIL	  Max	  Request	  Latency	  FR-­‐FCFS	   MSHR-­‐M	   MSHR-­‐S	   MSHR-­‐S+A	  Figure 6.5: Maximum memory request latencyMSHR-M and MSHR-S scheduling policies do not concern about request latency. ADRAM row with low MSRH-S score is stalled for a very long time, which increases themaximum request latency. MSHR-S+A is proposed to reduce maximum memory requestlatency to prevent starvation. Figure 6.5 illustrate how our three scheduling policies impactmaximum memory request latency. Our proposed scheduling policies all increase the max-imum memory request latency. But MSHR-S+A has a better maximum memory request426.2. Detailed Analysislatency by combining age information.0	  0.2	  0.4	  0.6	  0.8	  1	  1.2	  1.4	  BS	  CFD	  NDL	  STMCL	  PTA	  VADD	  HMEAN-­‐MS-­‐LIL	  BFS	  SSSP	  TRANS	  MGST	  SP	  HMEAN-­‐MS-­‐HIL	  Normalized	  Memory	  Latency	  FRFCFS	   MSHR-­‐M	   MSHR-­‐S	   MSHR-­‐S+A	  Figure 6.6: Average memory request latencyFigure 6.6 shows the average memory latency reduction of our proposed scheduler nor-malized to FR-FCFS. Memory latency reduction comes from releasing more memory re-quests in the L2 cache MSHR. MSHR-S reduces the total waiting time for all requests inthe L2 cache MSHR. While MSHR-S+A reduces memory latency further by preventing amemory request from waiting for too long in the DRAM controller, it still serves requestswith high L2 cache MSHR merge length. This is because AgeMSHR is the summation ofall memory requests within an MSHR entry. If an MSHR entry has a very high MSHRmerge length, the accumulation feature of AgeMSHR will ensure that this request still hashigh priority.Scoreboard is a commonly used structure to keep track of data dependency between436.2. Detailed Analysis0.00	  0.20	  0.40	  0.60	  0.80	  1.00	  1.20	  BS	  CFD	  NDL	  STMCL	  PTA	  VADD	  HMEAN-­‐MS-­‐LIL	  BFS	  SSSP	  TRANS	  MGST	  SP	  HMEAN-­‐MS-­‐HIL	  Normalized	  Data	  Dependency	  Stall	  	  FR-­‐FCFS	   MSHR-­‐M	   MSHR-­‐S	   MSHR-­‐S+A	  Figure 6.7: Data dependency stall, normalized to FR-FCFSinstructions. Write After Write (WAW) and Read After Write (RAW) hazards are reasonsthat make the scoreboard stall a warp. When a load instruction is issued, it usually takestens to hundreds of cycles to get the data from memory. This is one important reason thatthe scoreboard fails. Our inter-core locality aware scheduler resumes more warps fromstalling to reduce data dependency stall. Figure 6.7 illustrates an average reduction of datadependency stalls by 2% for LIL and 18% for HIL respectively, comparing to FR-FCFS.For HIL applications, MSHR-S has the largest impact on reducing scoreboard stalls. This isbecause MSHR-S reduces memory accesses waiting time in L2 cache MSHR, thus releasingthe most warps that are currently waiting for the scoreboard.Applications with low inter-core locality benefit from the MSHR-S scheduler becauseit improves row locality by choosing a row with most pending requests. When the MSHR446.2. Detailed Analysis0	  2	  4	  6	  8	  10	  12	  14	  BS	  CFD	  NDL	  STMCL	  PTA	  VADD	  HMEAN-­‐MS-­‐LIL	  BFS	  SSSP	  TRANS	  MGST	  SP	  HMEAN-­‐MS-­‐HIL	  Row	  Access	  Locality	  FR-­‐FCFS	   MSHR-­‐M	   MSHR-­‐S	   MSHR-­‐S+A	  Figure 6.8: Row locality, the y-axis indicates average row-hit of all memory requestsmerge length of all entries are one, MSHR-S chooses a row with most-pending requests.While a row with the most-pending requests is being serviced, the DRAM accumulates re-quests for other rows. Other rows will have a high average row access before the currentopen row is switching. Figure 6.8 shows row locality improvement. VADD and PTA havea high row locality improvement which turn into performance benefits as discussed in sec-tion 6.1. Other applications do not show a big row locality difference because the inter-corelocality aware DRAM scheduler still uses row-hit first policy as in baseline FR-FCFS toimprove DRAM bandwidth utilization. Figure 6.9 shows the DRAM bandwidth utilizationimprovement. The DRAM bandwidth is defined as total percentage of DRAM cycles thatserves read or write requests. For all applications, our inter-core locality scheduler achievessimilar DRAM bandwidth utilization. Because inter-core locality actually focus on proces-456.3. Sensitivity Analysis0.00	  0.10	  0.20	  0.30	  0.40	  0.50	  0.60	  0.70	  0.80	  0.90	  BS	  CFD	  NDL	  STMCL	  PTA	  VADD	  HMEAN-­‐MS-­‐LIL	  BFS	  SSSP	  TRANS	  MGST	  SP	  HMEAN-­‐MS-­‐HIL	  DRAM	  Bandwidth	  U?liza?on	  FRFCFS	   MSHR-­‐M	   MSHR-­‐S	   MSHR-­‐S+A	  Figure 6.9: DRAM bandwidth utilizaitonsor side efficiency, even with the small DRAM bandwidth improvement, the performancecan have a big improvement.6.3 Sensitivity Analysis6.3.1 L2 Cache to DRAM LatencyThe inter-core locality aware scheduler needs to send the MSHR information from the on-chip L2 cache MSHR to the DRAM controller. There is a latency between the L2 andthe DRAM controller. Figure 6.10 shows different L2 to DRAM latency that affects theinter-core locality aware scheduler. In the baseline, we assume a 20 cycle L2 to DRAMlatency. We evaluate the performance from 0 to 100 cycle latency. As latency increases, ittakes longer for MSHR information to arrive at DRAM. This causes inaccurate scheduling466.3. Sensitivity Analysis0.90	  0.95	  1.00	  1.05	  1.10	  1.15	  1.20	  BS	  CFD	  NDL	  STMCL	  PTA	  VADD	  HMEAN-­‐LIL	  BFS	  SSSP	  TRANS	  MGST	  SP	  HMEAN-­‐LIL	  Normalized	  IPC	  0	  Cycle	   20	  Cycle	   40	  Cycle	   60	  Cycle	   80	  Cycle	   100	  Cycle	  Figure 6.10: IPC normalized to FR-FCFS for different L2 to DRAM latency withMSHR-S+A scheduling policydecision on inter-core locality.From the Figure 6.10, the performance of VADD fluctuates as the L2 to DRAM latencyincreases. Since there is no inter-core locality in VADD, no L2 MSHR information is sentto DRAM. The L2 to DRAM latency does not affect the accuracy of scheduling. TheMSHR-S+A scheduling policy now using a row with the oldest age which may benefitperformance.The overall performance improvement with our largest L2 to DRAM latency withMSHR-S+A scheduling policy is 6% for HIL applications and 2% for LIL applications(normalizedto FR-FCFS).476.3. Sensitivity Analysis6.3.2 DRAM Queue Size0.95	  1.00	  1.05	  1.10	  1.15	  BS	  CFD	  NDL	  STMCL	  PTA	  VADD	  HMEAN-­‐LIL	  BFS	  SSSP	  TRANS	  MGST	  SP	  HMEAN-­‐HIL	  Normalized	  IPC	  16	   32	   64	   128	  Figure 6.11: IPC normalized to FR-FCFS for different read request queue size withMSHR-S+A scheduling policyFigure 6.11 shows the performance of our MSHR-S+A scheduler across HIL applica-tions with different DRAM controller request queue sizes. The performance is better with alarger queue size. This is because with a larger queue size, inter-core locality aware sched-uler can accumulate more memory requests in DRAM controller while waiting for theirMSHR information. Since the DRAM controller has to wait at least for the L2 to DRAMlatency to receive the MSHR information, if a request in the DRAM controller queue hasnot received its MSHR information, the inter-core locality aware scheduler will assume themerge length as 1 and the age as the request’s age. This is inaccurate if there is anotherrequest in L2 that has been merged into the same MSHR entry but the information of this486.3. Sensitivity AnalysisMSHR entry has not been received by the DRAM controller. With large DRAM controllerqueue size, we can buffer the requests while waiting for its MSHR merge information tocover the long L2 to DRAM latency.With the smallest configuration of the DRAM controller (queue size=16), MSHR-S+Aachieves an average improvement in IPC of 4% with HIL applications and 1% with LILapplications, comparing to FR-FCFS. For HIL applications, a queue size of 128 showsa small performance improvement comparing to queue size 64. The reason is these ap-plications have the largest MSHR merge length. A queue size of 64 entries has alreadyaccumulated enough requests waiting for its MSHR information. With our largest queuesize of 128 entries using the MSHR-S+A scheduling policy, HIL shows 11% performanceimprovement and LIL shows 4% performance improvement with MSHR-S+A schedulingpolicy comparing to FR-FCFS.49Chapter 7Related WorkIn this chapter, we are going to present related works on memory access scheduling. Weclassify all these works into three parts: 1) Early explorations on memory access scheduling,2) Memory access scheduling policies that focus on memory requests fairness and through-put, 3) Using criticality information to prioritize memory requests, 4) Hardware complexityeffective memory access scheduling policies.7.1 Early Memory Access Scheduling ExplorationsResearchers already pay attention to memory access scheduling a long time ago. McKeeet al. [27] observe an important problem for memory access in stream applications. Streamapplications usually use a loop to access vectors. Within a loop, two memory references ontwo vectors are accessed. The addresses of the two references have a large gap so that twoDRAM rows need to be accessed. Since the latency of row-hit access is much smaller thanrow-miss access, it is not efficient to serve these two references in a sequence. Instead, theypropose a combined compiler and hardware technique to solve this problem. The compiler507.2. Fairness and Throughput Memory Access Schedulingis responsible to detecting the streams and unroll the loop so the consecutive references canbe sent together. They propose a hardware called stream memory controller between theprocessor and the DRAM to interleave the references that can use DRAM efficiently.The above work is for single processor. McKee and Wulf [28] further extend the abovetechnique to multiprocessor. To choose memory requests that are sent by multiprocessor,they add two scheduling policies to schedule between multiprocessors. The first is cyclicscheduling to do round-robin like scheduling. The second is block scheduling that dividesthe input vector into blocks. The scheduler switch to another processor when current pro-cessor finishes the block.Rixner et al. [36] exploit several memory access scheduling policies. All these memoryaccess scheduling policies consider different aspect of the DRAM characteristic. FR-FCFSis proposed in the paper and is the most popular memory access scheduling policies today.Almost all of the commercial processors use the FR-FCFS today. The idea is to prioritizerow-hit requests over row-miss requests to achieve maximum DRAM bandwidth utilizationand high performance. We use the FR-FCFS as the baseline scheduler for the comparisonsin this work.7.2 Fairness and Throughput Memory Access SchedulingThere are a number of memory access schedulers proposed for multi-core systems recently[6, 25, 29, 30]. The primary focuses of these works are providing fairness among threadswhile keeping high bandwidth utilization. Mutlu and Moscibroda propose two memoryscheduling policies, Stall-Time Fair Memory-Scheduling (STFM) [29] and Parallelism-Aware Batch Scheduling (PARBS) [30]. STFM provides quality of service to a sharedDRAM memory system among all threads. The paper observes that a memory request from517.2. Fairness and Throughput Memory Access Schedulingone thread can be slowed down by memory requests from other threads due to interferencebetween their memory requests. STFM balances all memory service time by schedulingmemory request with the high slow down first. PARBS maintains quality of service fromSTFM while improving memory system throughput. PARBS groups memory requests intoa batch and schedules them together to reduce memory latency across all threads. Thesetechniques work on the multicore CPU but have not been tested on the GPU with thousandsof threads.Jog et al. [20] proposed a scheme to schedule memory requests between kernels. Theyuse a First-Ready Round-robin First-Come First-Serve (FR-RR-FCFS) policy. The pol-icy considers fairness between kernels using round robin scheduling and overall systemthroughput using FR-FCFS. This techniques ensures memory access fairness between GPUkernels. This is scheduling policy focuses on kernel level granularity while we focus onwarp level granularity.Lakshminarayana et al. [25] consider three aspects of GPU memory scheduling. Theypropose a memory scheduling policy to switch between Shortest-Job-First (SJF) and FR-FCFSto balance tolerance, SIMD-execution and row-buffer locality. This policy prioritizes a warpbased on smallest memory request count to resume a warp as soon as possible. But it doesnot consider multiple warps. If multiple warps wait for a memory request, the memoryrequest with smallest count is not benefit these warps. We consider memory requests thatcan resume largest warps from different cores.Chatterjee et al. [7] exploit memory latency divergence within a warp. From the experi-ment they present, after a warp sends multiple memory requests, the latency among all thesememory requests has a large divergence because of inter-warp interference. They proposeda memory scheduling policy to balance this latency divergence. They added an intercon-527.3. Memory Requests Prioritizationnection network between all memory controllers for exchanging memory access latencydivergency information. By using this information, the memory access latency divergenceis reduced. Their work and our work solve the memory access latency problem from dif-ferent aspects. Their work focuses on one warp that issues divergence memory requests.We focus on reducing the memory accesses latency of all warps form different cores thataccessing the same data. Both techniques can improve performance on the GPU.7.3 Memory Requests PrioritizationGhose et al. [14] propose a scheme that assigns criticality information to memory requests.Memory request criticality in this scheme is defined as number of consumer instructions andhistorical stall time when a memory request reaches the head of ROB. This information issent to the memory controllers to help memory controllers prioritizing threads that have themost criticality. Prieto et al. [33] use a similar technique by using request distance to headof the ROB as criticality information. They reduce a single thread stall time by using ROBwhich cannot reuse on the GPU architecture. Instead, we consider all threads on the GPUarchitecture. To do this, we use the L2 MSHR to reduce stall time of all threads.Jia et al. [18] propose a scheme to reduce memory requests contentions between threads.This scheme focuses on efficiency of the L1 cache. To achieve this goal, they propose amemory prioritization buffer to reorder memory requests and bypass cache when necessary.To reduce cross-warp contention, the scheme reorders memory requests to avoid cachethrashing. To reduce intra-warp contention, the scheme by passes L1 cache and sendsmemory requests directly to lower level memory system. This work focuses on warp-leveland L1 cache optimization to reduce memory access latency. Our work focuses on thelast-level cache and the DRAM controller to improve overall memory system bandwidth.537.4. Complexity Effective Memory Access Scheduling7.4 Complexity Effective Memory Access SchedulingThe FR-FCFS is the most commonly used scheduling policies in GPU as mentioned previ-ously. However, the hardware complexity of the FR-FCFS is high because it needs a largenumber of full-associative comparisons to exploit row-buffer locality. To reduce complex-ity design of memory controller, there are works on scheduling memory requests at earlystages.Yuan et al. [42] propose an interconnection network arbitration scheme to reserve rowlocality to replace complexity circuit design of FR-FCFS DRAM controller. They observethat the interconnection network which is between cores and memory controllers can de-stroy memory access row-buffer locality. To preserve the memory access row-buffer local-ity, they use an interconnection arbitration scheme to prioritize memory requests accessingthe same row first. Using this scheme, they achieve a performance similar to FR-FCFS onlyusing a simple FIFO memory controller.Kim et al. [22] consider interconnection network congestion and row-buffer locality. Toavoid network congestions, they use a local congestion aware function to control injectionrate from each core. To prevent interconnection network interleaving memory requests,they use a technique similar to the scheme in Yuan et al. [42]. They introduce a superpacketthat group packets in the interconnect using two configurations. The superpacket is groupedwhen there are consecutive requests accessing the same DRAM row from the same core orthere are any requests accessing the same DRAM row from the same core.But these scheduling policies do not outperforms FR-FCFS. They concern the hardwarecomplexity rather than the performance. Our memory access scheduling policy outperformsFR-FCFS scheduling policy with little hardware overhead.54Chapter 8Future WorkIn this thesis, we use inter-core locality to schedule memory access. In the future, we planto explore intra-core locality to improve performance further. Intra-core locality is cap-tured by the L1 cache MSHR that is also a potential metric which can benefit performance.By utilizing the intra-core locality, we can make a warp-level fine-grained memory accessscheduling policy. The L1 cache MSHR can be used for intra-core locality. The DRAMcontroller can get information about number of warps a memory request is representing.There are two problems need to be solved by using intra-core locality. The first problem isto handle large amount of on-chip network traffic because we need to send extra requests toinform the DRAM controller about L1 MSHR information. This will put a heavy pressureon the on-chip network. If we do not deal with it carefully, the on-chip network will con-gestion result in a bad performance. The second problem is long latency and low accuracy.Because there are a few hundred cycles to send a normal request from the processor sideto the DRAM controller. When a request that carries intra-locality information reached theDRAM controller, a hundreds of cycles already passed. It is high chance that the intra-55Chapter 8: Future Workcore locality information in L1 cache has been changed. This result in a low accuracy ofintra-core locality in the DRAM controller side. To help with these two problems, a pos-sible solution is to redesign current interconnection networks to handle intra-core localityrequests.56Chapter 9ConclusionIn this thesis, we introduce inter-core locality for GPUs. Inter-core locality can be capturedby L2 cache MSHR merge length. We quantify inter-core locality for GPU applicationswith thousands of threads. To exploit the inter-core locality, we introduce an inter-corelocality aware memory scheduling policy by using L2 cache MSHR merge information.We propose three scheduling policies. MSHR-M schedules a request with largest inter-core locality but does not consider other requests within the same row. MSHR-S uses thesummation of request scores as a row score to choose a row that benefits most inter-core lo-cality. MSHR-S+A further improves performance by preventing starvation of requests withlow scores. We show a harmonic mean performance improvement of 11% with applicationswith high inter-core locality and 3% performance improvement with applications with lowinter-core locality.57Bibliography[1] NVIDIA CUDA C Programming Guide v4.2., April 2015. →pages 1[2] NVIDIA CUDA SDK code samples. ,April 2015. → pages 31[3] Numerical analytics.,February 2015. → pages 1[4] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzingcuda workloads using a detailed gpu simulator. Performance Analysis of Systems andSoftware, 2009. ISPASS 2009. IEEE International Symposium on, pages 163–174,2009. → pages 31[5] M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs ongpus. In Proceedings of the 2012 IEEE International Symposium on WorkloadCharacterization (IISWC), IISWC ’12, pages 141–151, Washington, DC, USA,2012. IEEE Computer Society. ISBN 978-1-4673-4531-6.doi:10.1109/IISWC.2012.6402918. → pages 31[6] N. Chatterjee, N. Muralimanohar, R. Balasubramonian, A. Davis, and N. P. Jouppi.Staged reads: Mitigating the impact of dram writes on dram reads. In Proceedings ofthe 2012 IEEE 18th International Symposium on High-Performance ComputerArchitecture, HPCA ’12, pages 1–12, Washington, DC, USA, 2012. IEEE ComputerSociety. ISBN 978-1-4673-0827-4. doi:10.1109/HPCA.2012.6168943. → pages 51[7] N. Chatterjee, M. O’Connor, G. H. Loh, N. Jayasena, and R. Balasubramonian.Managing dram latency divergence in irregular gpgpu applications. In Proceedingsof the International Conference for High Performance Computing, Networking,Storage and Analysis, SC ’14, pages 128–139, Piscataway, NJ, USA, 2014. IEEE58BibliographyPress. ISBN 978-1-4799-5500-8. doi:10.1109/SC.2014.16. URL → pages 2, 52[8] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron.Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC’09, pages 44–54, Washington, DC, USA, 2009. IEEE Computer Society. ISBN978-1-4244-5156-2. doi:10.1109/IISWC.2009.5306797. → pages 4, 31[9] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron. Acharacterization of the rodinia benchmark suite with comparison to contemporarycmp workloads. In Proceedings of the IEEE International Symposium on WorkloadCharacterization (IISWC’10), IISWC ’10, pages 1–11, Washington, DC, USA, 2010.IEEE Computer Society. ISBN 978-1-4244-9297-8.doi:10.1109/IISWC.2010.5650274. → pages 4, 31[10] H.-Y. Cheng, C.-H. Lin, J. Li, and C.-L. Yang. Memory latency reduction via threadthrottling. In Proceedings of the 2010 43rd Annual IEEE/ACM InternationalSymposium on Microarchitecture, MICRO ’43, pages 53–64, Washington, DC, USA,2010. IEEE Computer Society. ISBN 978-0-7695-4299-7.doi:10.1109/MICRO.2010.39. URL →pages 2[11] A. Coates, B. Huval, T. Wang, D. J. Wu, and A. Y. Ng. Deep learning with cots hpcsystems. In Proceedings of the 30th International Conference on Machine Learning,volume 28, Atlanta, Georgia, USA, 2013. → pages 1[12] W. Dally and B. Towles. Principles and Practices of Interconnection Networks.Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003. ISBN0122007514. → pages 9[13] R. Eckert. Page stream sorter for dram systems, May 20 2008. URL → pages 6[14] S. Ghose, H. Lee, and J. F. Martínez. Improving memory scheduling viaprocessor-side load criticality information. In Proceedings of the 40th AnnualInternational Symposium on Computer Architecture, ISCA ’13, pages 84–95, NewYork, NY, USA, 2013. ACM. ISBN 978-1-4503-2079-5.doi:10.1145/2485922.2485930. → pages 3, 53[15] Hynix. Hynix GDDR5 SGRAM Part H5GQ1H24AFR Revision 1.0., April2015. → pages 25, 3159Bibliography[16] B. Jacob, S. Ng, and D. Wang. Memory Systems: Cache, DRAM, Disk. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 2007. ISBN 0123797519,9780123797513. → pages 12[17] H. Jang, A. Park, and K. Jung. Neural network implementation using cuda andopenmp. In Digital Image Computing: Techniques and Applications (DICTA), 2008,pages 155–161, Dec 2008. doi:10.1109/DICTA.2008.82. → pages 1[18] W. Jia, K. Shaw, and M. Martonosi. Mrpb: Memory request prioritization formassively parallel processors. In High Performance Computer Architecture (HPCA),2014 IEEE 20th International Symposium on, pages 272–283, Feb 2014.doi:10.1109/HPCA.2014.6835938. → pages 53[19] N. Jiang, D. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D. Shaw, J. Kim,and W. Dally. A detailed and flexible cycle-accurate network-on-chip simulator. InPerformance Analysis of Systems and Software (ISPASS), 2013 IEEE InternationalSymposium on, pages 86–96, April 2013. doi:10.1109/ISPASS.2013.6557149. →pages 9[20] A. Jog, E. Bolotin, Z. Guz, M. Parker, S. W. Keckler, M. T. Kandemir, and C. R. Das.Application-aware memory system for fair and efficient execution of concurrentgpgpu applications. In Proceedings of Workshop on General Purpose ProcessingUsing GPUs, GPGPU-7, pages 1:1–1:8, New York, NY, USA, 2014. ACM. ISBN978-1-4503-2766-4. doi:10.1145/2576779.2576780. URL → pages 52[21] Khronos Group. OpenCL., April 2015. → pages 1[22] Y. Kim, H. Lee, and J. Kim. An alternative memory access scheduling in manycoreaccelerators. In Parallel Architectures and Compilation Techniques (PACT), 2011International Conference on, pages 195–196, Oct 2011. doi:10.1109/PACT.2011.37.→ pages 54[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deepconvolutional neural networks. In F. Pereira, C. Burges, L. Bottou, andK. Weinberger, editors, Advances in Neural Information Processing Systems 25,pages 1097–1105. Curran Associates, Inc., 2012. URL →pages 1[24] D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In Proceedingsof the 8th Annual Symposium on Computer Architecture, ISCA ’81, pages 81–87,Los Alamitos, CA, USA, 1981. IEEE Computer Society Press. → pages 1160Bibliography[25] N. B. Lakshminarayana, J. Lee, H. Kim, and J. Shin. Dram scheduling policy forgpgpu architectures based on a potential function. IEEE Comput. Archit. Lett., 11(2):33–36, July 2012. ISSN 1556-6056. doi:10.1109/L-CA.2011.32. URL → pages 51, 52[26] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. Nvidia tesla: A unifiedgraphics and computing architecture. Micro, IEEE, 28(2):39–55, March 2008. ISSN0272-1732. doi:10.1109/MM.2008.31. → pages 2[27] S. McKee, R. Klenke, A. Schwab, W. Wulf, S. Moyer, J. Aylor, and C. Hitchcock.Experimental implementation of dynamic access ordering. In System Sciences, 1994.Proceedings of the Twenty-Seventh Hawaii International Conference on, volume 1,pages 431–440, Jan 1994. doi:10.1109/HICSS.1994.323142. → pages 2, 50[28] S. A. McKee and W. A. Wulf. A memory controller for improved performance ofstreamed computations on symmetric multiprocessors. In Proceedings of the 10thInternational Parallel Processing Symposium, IPPS ’96, pages 159–165,Washington, DC, USA, 1996. IEEE Computer Society. ISBN 0-8186-7255-2. URL → pages 51[29] O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chipmultiprocessors. pages 146–160, 2007. → pages 2, 51[30] O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing bothPerformance and Fairness of Shared DRAM Systems. Computer Architecture, 2008.ISCA ’08. 35th International Symposium on, pages 63–74, 2008. → pages 2, 51[31] J. Nageswaran, N. Dutt, J. Krichmar, A. Nicolau, and A. Veidenbaum. Efficientsimulation of large-scale spiking neural networks using cuda graphics processors. InNeural Networks, 2009. IJCNN 2009. International Joint Conference on, pages2145–2152, June 2009. doi:10.1109/IJCNN.2009.5179043. → pages 1[32] K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair queuing memorysystems. In Proceedings of the 39th Annual IEEE/ACM International Symposium onMicroarchitecture, MICRO 39, pages 208–222, Washington, DC, USA, 2006. IEEEComputer Society. ISBN 0-7695-2732-9. doi:10.1109/MICRO.2006.24. URL → pages 2[33] P. Prieto, V. Puente, and J. A. Gregorio. Cmp off-chip bandwidth scheduling guidedby instruction criticality. In Proceedings of the 27th International ACM Conferenceon International Conference on Supercomputing, ICS ’13, pages 379–388, NewYork, NY, USA, 2013. ACM. ISBN 978-1-4503-2130-3.61Bibliographydoi:10.1145/2464996.2465019. URL→ pages 53[34] N. Rafique, W.-T. Lim, and M. Thottethodi. Effective management of drambandwidth in multicore processors. In Proceedings of the 16th InternationalConference on Parallel Architecture and Compilation Techniques, PACT ’07, pages245–258, Washington, DC, USA, 2007. IEEE Computer Society. ISBN0-7695-2944-5. doi:10.1109/PACT.2007.29. URL → pages 2[35] S. Rixner. Memory controller optimizations for web servers. In Proceedings of the37th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 37,pages 355–366, Washington, DC, USA, 2004. IEEE Computer Society. ISBN0-7695-2126-6. doi:10.1109/MICRO.2004.22. URL → pages 6[36] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory accessscheduling. In Computer Architecture, 2000. Proceedings of the 27th InternationalSymposium on, pages 128–138. IEEE, 2000. → pages 2, 6, 40, 51[37] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-conscious wavefrontscheduling. In Proceedings of the 2012 45th Annual IEEE/ACM InternationalSymposium on Microarchitecture, MICRO-45, pages 72–83, Washington, DC, USA,2012. IEEE Computer Society. ISBN 978-0-7695-4924-8.doi:10.1109/MICRO.2012.16. → pages 32[38] M. C. Schatz, C. Trapnell, A. L. Delcher, and A. Varshney. High-throughputsequence alignment using graphics processing units. BMC Bioinformatics, 8:474,2007. doi:10.1186/1471-2105-8-474. → pages 1[39] J. Stuecheli, D. Kaseridis, D. Daly, H. C. Hunter, and L. K. John. The virtual writequeue: coordinating DRAM and last-level cache policies. ACM SIGARCH ComputerArchitecture News, 38(3):72–82, June 2010. → pages 15[40] J. Tuck, L. Ceze, and J. Torrellas. Scalable cache miss handling for highmemory-level parallelism. In Proceedings of the 39th Annual IEEE/ACMInternational Symposium on Microarchitecture, MICRO 39, pages 409–422,Washington, DC, USA, 2006. IEEE Computer Society. ISBN 0-7695-2732-9.doi:10.1109/MICRO.2006.44. URL →pages 1162Bibliography[41] R. Uetz and S. Behnke. Large-scale object recognition with cuda-acceleratedhierarchical neural networks. In Intelligent Computing and Intelligent Systems, 2009.ICIS 2009. IEEE International Conference on, volume 1, pages 536–541, Nov 2009.doi:10.1109/ICICISYS.2009.5357786. → pages 1[42] G. L. Yuan, A. Bakhoda, and T. M. Aamodt. Complexity effective memory accessscheduling for many-core accelerator architectures. In MICRO 42: Proceedings ofthe 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages34–44, New York, New York, USA, Dec. 2009. ACM Request Permissions. →pages 2, 5463


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items