You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.

Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Locality and scheduling in the massively multithreaded era Rogers, Timothy Glenn 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2015_november_rogers_timothy.pdf [ 4.75MB ]
JSON: 24-1.0166749.json
JSON-LD: 24-1.0166749-ld.json
RDF/XML (Pretty): 24-1.0166749-rdf.xml
RDF/JSON: 24-1.0166749-rdf.json
Turtle: 24-1.0166749-turtle.txt
N-Triples: 24-1.0166749-rdf-ntriples.txt
Original Record: 24-1.0166749-source.json
Full Text

Full Text

Locality and Scheduling in the Massively MultithreadedErabyTimothy Glenn RogersB. Eng, McGill University, 2005A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)October 2015c© Timothy Glenn Rogers, 2015AbstractMassively parallel processing devices, like Graphics Processing Units (GPUs),have the ability to accelerate highly parallel workloads in an energy-efficient man-ner. However, executing irregular or less tuned workloads poses performance andenergy-efficiency challenges on contemporary GPUs. These inefficiencies comefrom two primary sources: ineffective management of locality and decreased func-tional unit utilization. To decrease these effects, GPU programmers are encouragedto restructure their code to fit the underlying hardware architecture which affectsthe portability of their code and complicates the GPU programming process. Thisdissertation proposes three novel GPU microarchitecture enhancements for miti-gating both the locality and utilization problems on an important class of irreg-ular GPU applications. The first mechanism, Cache-Conscious Warp Scheduling(CCWS), is an adaptive hardware mechanism that makes use of a novel locality de-tector to capture memory reference locality that is lost by other schedulers due toexcessive contention for cache capacity. On cache-sensitive, irregular GPU work-loads, CCWS provides a 63% speedup over previous scheduling techniques. Thisdissertation uses CCWS to demonstrate that improvements to the hardware threadscheduling policy in massively multithreaded systems offer a promising new designspace to explore in locality management. The second mechanism, Divergence-Aware Warp Scheduling (DAWS), introduces a divergence-based cache footprintpredictor to estimate how much L1 data cache capacity is needed to capture lo-cality in loops. We demonstrate that the predictive, pre-emptive nature of DAWScan provide an additional 26% performance improvement over CCWS. This dis-sertation also demonstrates that DAWS can effectively shift the burden of localitymanagement from software to hardware by increasing the performance of simpleriiand more portable code on the GPU. Finally, this dissertation details a VariableWarp-Size Architecture (VWS) which improves the performance of irregular ap-plications by 35%. VWS improves irregular code by using a smaller warp sizewhile maintaining the performance and energy-efficiency of regular code by gang-ing the execution of these smaller warps together in the warp scheduler.iiiPrefaceThe following is a list of my publications that have been incorporated into thisdissertation in chronological order:[C1] Timothy G. Rogers, Mike O’Connor, Tor M. Aamodt. Cache-ConsciousWavefront Scheduling [137]. In proceedings of the 45th IEEE/ACM Interna-tional Symposium on Microarchitecture (MICRO-45), pp. 72-83, December2012.[J1] Timothy G. Rogers, Mike O’Connor, Tor M. Aamodt. Cache-ConsciousThread Scheduling for Massively Multithreaded Processors [138]. IEEEMicro Special Issue: Micro’s Top Picks from 2012 Computer ArchitectureConferences, Vol. 33, No. 3, pp. 78-85, May/June 2013[C2] Timothy G. Rogers, Mike O’Connor, Tor M. Aamodt. Divergence-AwareWarp Scheduling [139]. In proceedings of the 46th IEEE/ACM Interna-tional Symposium on Microarchitecture (MICRO-46), pp. 99-110, Decem-ber, 2013.[J2] Timothy G. Rogers, Mike O’Connor, Tor M. Aamodt. Learning Your Limit:Managing Massively Multithreaded Caches Through Scheduling [140]. InCommunications of the ACM, vol. 57, no. 12, December 2014.[C3] Timothy G. Rogers, Daniel R. Johnson, Mike O’Connor, Stephen W. Keck-ler. A Variable Warp Size Architecture [141]. In proceedings of the Inter-national Symposium on Computer Architecture (ISCA), pp. 489-501, June,2015.ivThe preceding publications have been included in this thesis as follows:Chapter 1. Some motivational elements of this chapter have been previouslypublished in [J1].Chapter 2. Elements of this chapter that describe the baseline GPU architecturehave been taken from [C1], [C2] and [C3].Chapter 3. A version of this material has been published as [C1], [J1] and [J2].In [C1], [J1] and [J2], I performed the research, interpreted the data and wrote themanuscript with guidance and input from Mike O’Connor and Professor Tor M.Aamodt.Chapter 4. A version of this material has been published as [C2]. In [C2], I per-formed the research, interpreted the data and wrote the manuscript with guidanceand input from Mike O’Connor and Professor Tor M. Aamodt.Chapter 6. A version of this material has been published as [C3]. In [C3], Iperformed the research, interpreted the data and wrote the manuscript with guid-ance and input from Dr. Daniel R. Johnson, Mike O’Connor and Dr. Stephen W.Keckler.Chapter 7. Text from the related work sections of [C1], [C2] and [C3] has beenincorporated into this section.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Massively Parallel Computing Trends . . . . . . . . . . . . . . . 31.2 GPU Programmability Challenges . . . . . . . . . . . . . . . . . 41.3 GPU Memory Locality Challenges . . . . . . . . . . . . . . . . . 51.4 GPU Control Flow Challenges . . . . . . . . . . . . . . . . . . . 61.5 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 61.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.7 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1 GPU Programming Model . . . . . . . . . . . . . . . . . . . . . 112.2 Contemporary GPU Architecture . . . . . . . . . . . . . . . . . . 12vi2.3 Memory on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Control flow on GPUs . . . . . . . . . . . . . . . . . . . . . . . . 153 Cache-Conscious Warp Scheduling . . . . . . . . . . . . . . . . . . . 163.1 Effect of Shaping the Access Pattern . . . . . . . . . . . . . . . . 203.2 Warp Scheduling to Preserve Locality . . . . . . . . . . . . . . . 213.2.1 A Code Example . . . . . . . . . . . . . . . . . . . . . . 213.2.2 Static Warp Limiting (SWL) . . . . . . . . . . . . . . . . 233.2.3 Cache-Conscious Warp Scheduling (CCWS) . . . . . . . 233.3 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . 293.3.1 GPU-enabled server workloads . . . . . . . . . . . . . . 293.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 303.4.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 333.4.2 Detailed Breakdown of Inter- and Intra-Warp Locality . . 373.4.3 Sensitivity to Victim Tag Array Size . . . . . . . . . . . . 383.4.4 Sensitivity to Cache Size . . . . . . . . . . . . . . . . . . 403.4.5 Sensitivity to KT HROT T LE and Tuning for Power . . . . . . 413.4.6 Static Warp Limiting Sensitivity . . . . . . . . . . . . . . 433.4.7 Area Estimation . . . . . . . . . . . . . . . . . . . . . . 443.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 Divergence-Aware Warp Scheduling . . . . . . . . . . . . . . . . . . 464.1 Divergence, Locality and Scheduling . . . . . . . . . . . . . . . . 494.1.1 Application Locality . . . . . . . . . . . . . . . . . . . . 494.1.2 Static Load Classification . . . . . . . . . . . . . . . . . 514.2 Divergence-Aware Warp Scheduling (DAWS) . . . . . . . . . . . 534.2.1 Profiled Divergence-Aware Warp Scheduling(Profiled-DAWS) . . . . . . . . . . . . . . . . . . . . . . 554.2.2 Detected Divergence-Aware Warp Scheduling(Detected-DAWS) . . . . . . . . . . . . . . . . . . . . . 614.3 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . 634.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 644.4.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 64vii4.4.2 Determining the Associativity Factor . . . . . . . . . . . 704.4.3 Area Estimation . . . . . . . . . . . . . . . . . . . . . . 714.4.4 Dynamic Energy Estimation . . . . . . . . . . . . . . . . 714.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725 A Programmability Case Study . . . . . . . . . . . . . . . . . . . . . 735.1 Case Study Results . . . . . . . . . . . . . . . . . . . . . . . . . 766 A Variable Warp-Size Architecture . . . . . . . . . . . . . . . . . . . 796.1 Trade-offs of Warp Sizing . . . . . . . . . . . . . . . . . . . . . . 836.1.1 Warp Size and Memory Locality . . . . . . . . . . . . . . 836.1.2 Warp Size and SM Front-end Pressure . . . . . . . . . . . 846.2 Variable Warp Sizing . . . . . . . . . . . . . . . . . . . . . . . . 856.2.1 High-level Operation . . . . . . . . . . . . . . . . . . . . 856.2.2 Warp Ganging Unit . . . . . . . . . . . . . . . . . . . . . 876.2.3 Gang Table . . . . . . . . . . . . . . . . . . . . . . . . . 886.2.4 Gang Splitting . . . . . . . . . . . . . . . . . . . . . . . 896.2.5 Gang Reformation . . . . . . . . . . . . . . . . . . . . . 916.2.6 Instruction Supply . . . . . . . . . . . . . . . . . . . . . 916.3 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . 916.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 936.4.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 936.4.2 Front-end Pressure . . . . . . . . . . . . . . . . . . . . . 966.4.3 Gang Scheduling Policies . . . . . . . . . . . . . . . . . 986.4.4 Gang Reformation Policies . . . . . . . . . . . . . . . . . 1006.4.5 Gang Splitting Policies . . . . . . . . . . . . . . . . . . . 1036.4.6 Gang Size Distribution . . . . . . . . . . . . . . . . . . . 1036.4.7 Area Overheads . . . . . . . . . . . . . . . . . . . . . . . 1056.5 Comparison to Previous Work . . . . . . . . . . . . . . . . . . . 1076.5.1 Quantitative Comparison . . . . . . . . . . . . . . . . . . 1076.5.2 Qualitative Comparison . . . . . . . . . . . . . . . . . . 1076.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110viii7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.1 Work Related to Cache-Conscious Warp Scheduling and Divergence-Aware Warp Scheduling . . . . . . . . . . . . . . . . . . . . . . 1117.1.1 Throttling to Improve Performance . . . . . . . . . . . . 1117.1.2 GPU Thread Scheduling Techniques . . . . . . . . . . . . 1137.1.3 GPU Caching . . . . . . . . . . . . . . . . . . . . . . . . 1157.1.4 CPU Thread Scheduling Techniques . . . . . . . . . . . . 1167.1.5 Cache Capacity Management . . . . . . . . . . . . . . . 1177.1.6 Locality Detection . . . . . . . . . . . . . . . . . . . . . 1187.2 Branch and Memory Divergence Mitigation . . . . . . . . . . . . 1198 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 1228.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1228.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 1248.2.1 Capturing Locality in a Variable Warp-Size Architecture . 1258.2.2 Exploiting Shared (Inter-Warp) Locality . . . . . . . . . . 1268.2.3 Adaptive Cache Blocking and Warp Scheduling . . . . . . 1278.2.4 A Programmable Warp Scheduler for Debugging and Syn-chronization . . . . . . . . . . . . . . . . . . . . . . . . . 130Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131ixList of TablesTable 3.1 Cache-conscious warp scheduling GPGPU-Sim Configuration . 30Table 3.2 Cache-conscious warp scheduling benchmarks . . . . . . . . . 31Table 3.3 Best-static warp limiting configuration . . . . . . . . . . . . . 37Table 4.1 Divergence-aware warp scheduling GPGPU-Sim Configuration 64Table 4.2 Divergence-aware warp scheduling applications . . . . . . . . 65Table 4.3 Previous work configurations for divergence-aware warp schedul-ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Table 4.4 Divergence-aware warp scheduling configuration . . . . . . . . 66Table 6.1 Variable warp sizing simulator configuration. . . . . . . . . . . 92Table 6.2 Variable warp sizing area overhead estimates . . . . . . . . . . 106Table 6.3 Qualitative characterization of divergence mitigation techniques. 109xList of FiguresFigure 2.1 Contemporary GPU architecture. . . . . . . . . . . . . . . . . 13Figure 3.1 Unbounded L1 cache hits and misses . . . . . . . . . . . . . 17Figure 3.2 Performance versus L1 cache size . . . . . . . . . . . . . . . 18Figure 3.3 Performance and cache hit rate versus multithreading level . . 19Figure 3.4 Example access pattern of a cache unaware warp scheduler . . 21Figure 3.5 Example access pattern of a cache-aware warp scheduler . . . 21Figure 3.6 Cache-conscious warp scheduling microarchitecture . . . . . 24Figure 3.7 Cache-conscious warp scheduling locality scoring system ex-ample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Figure 3.8 Cache-conscious warp scheduling performance on cache sen-sitive applications . . . . . . . . . . . . . . . . . . . . . . . . 31Figure 3.9 Cache-conscious warp scheduling MPKI on cache sensitiveapplications . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 3.10 Cache-conscious warp scheduling performance on cache-insensitiveapplications . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 3.11 Cache-conscious warp scheduling MPKI on cache-insensitiveapplications . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 3.12 Detailed breakdown of intra/inter warp locality on cache-sensitiveapplications . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Figure 3.13 Cache-conscious warp scheduling intra/inter warp locality break-down on highly and moderately cache-sensitive applications . 38Figure 3.14 Cache-conscious warp scheduling performance versus victimtag array size . . . . . . . . . . . . . . . . . . . . . . . . . . 39xiFigure 3.15 Cache-conscious warp scheduling performance at different L1data cache sizes . . . . . . . . . . . . . . . . . . . . . . . . . 40Figure 3.16 Performance of cache-conscious warp scheduling on BFS withdifferent input sizes . . . . . . . . . . . . . . . . . . . . . . . 41Figure 3.17 Cache-conscious warp scheduling performance and MPKI atdifferent KT HROT T LE values . . . . . . . . . . . . . . . . . . . 42Figure 3.18 Performance of static warp limiting at different multithreadinglevels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Figure 3.19 Performance of static warp limiting with different multithread-ing values on different BFS inputs . . . . . . . . . . . . . . . 44Figure 4.1 Divergence-aware scheduling operation example . . . . . . . 47Figure 4.2 Code location of locality in cache-sensitive applications . . . 50Figure 4.3 Characterization of locality in loops . . . . . . . . . . . . . . 51Figure 4.4 Characterization of memory accesses and branch divergence inBFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Figure 4.5 Divergence-aware warp scheduling cache footprint predictionmechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Figure 4.6 Divergence-aware warp scheduling microarchitecture . . . . . 56Figure 4.7 Divergence-aware warp scheduling performance . . . . . . . 67Figure 4.8 Divergence-aware warp scheduling lane activity breakdown . 67Figure 4.9 Divergence-aware warp scheduling locality breakdown . . . . 68Figure 4.10 Divergence-aware warp scheduling victim tag array performance 69Figure 4.11 Divergence-aware warp scheduling core activity breakdown . 70Figure 4.12 Detected divergence-aware warp scheduling performance ver-sus associativity factor . . . . . . . . . . . . . . . . . . . . . 71Figure 5.1 Highly divergent SPMV-Scalar kernel . . . . . . . . . . . . . 74Figure 5.2 GPU-optimized SPMV-Vector kernel . . . . . . . . . . . . . 75Figure 5.3 SPMV-Scalar execution times with different warp schedulers . 76Figure 5.4 SMPV-Scalar metrics versus warp scheduler . . . . . . . . . . 77Figure 6.1 A survey of performance versus warp size. . . . . . . . . . . 80Figure 6.2 Performance and function unit utilization verus warp size . . . 82xiiFigure 6.3 L1 cache locality versus warp size . . . . . . . . . . . . . . . 84Figure 6.4 Instructions fetched per cycle versus warp size . . . . . . . . 85Figure 6.5 Variable warp sizing microarchitecture . . . . . . . . . . . . . 86Figure 6.6 Variable warp sizing performance on 15 applications . . . . . 95Figure 6.7 Variable warp sizing performance on 165 applications . . . . 95Figure 6.8 Variable warp sizing fetches per cycle . . . . . . . . . . . . . 97Figure 6.9 Variable warp sizing performance with different instructioncache configurations . . . . . . . . . . . . . . . . . . . . . . 98Figure 6.10 Variable warp sizing performance versus scheduling policy . . 99Figure 6.11 Variable warp sizing fetches per cycle with different schedul-ing policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 100Figure 6.12 Variable warp sizing performance versus number of gangs ableto issue each cycle . . . . . . . . . . . . . . . . . . . . . . . 101Figure 6.13 Variable warp sizing performance with elastic gang reformation 101Figure 6.14 Variable warp sizing average fetches per cycle with elastic gangreformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 102Figure 6.15 Variable warp sizing performance versus gang splitting policies 104Figure 6.16 Variable warp sizing average fetches per cycle versus gangsplitting policies . . . . . . . . . . . . . . . . . . . . . . . . 104Figure 6.17 Variable warp sizing gang size versus time . . . . . . . . . . . 105Figure 6.18 Variable warp sizing performance versus related work . . . . 108xiiiList of AbbreviationsGPU Graphics Processing UnitGPGPU General Purpose Graphics Processing UnitSIMD Single-Instruction, Multiple-DataSIMT Single-Instruction, Multiple-ThreadHCS Highly Cache-SensitiveMCS Moderately Cache-SensitiveCI Cache-InsensitiveL0 Level ZeroL1 Level OneL2 Level TwoL1D Level One DataPKI Per Thousand InstructionsMPKI Misses Per Thousand InstructionsCCWS Cache-Conscious Warp SchedulingLLD Lost intra-warp Locality DetectorMEMC MemcachedIPC Instructions Per CycleWIA Warp Issue ArbiterLRU Least Recently UsedSWL Static Warp LimitingVTA Victim Tag ArrayWID Warp IDLLS Lost-Locality ScorexivLLDS Lost-Locality Detected ScoreIC Integrated CircuitCPU Central Processing UnitCMP Chip-MultiProcessorSAGCS Stand Alone GPGPU-Sim Cache SimulatorFR-FCFS First-Ready First-Come First-ServeCUDA Compute Unified Device ArchitectureGC Garbage CollectorGTO Greedy-Then-OldestFG Fetch GroupsDLP Data-Level ParallelismTLP Thread-Level ParallelismILP Instruction-Level ParallelismCMOS Complementary Metal-Oxide-SemiconductorMIMD Multiple-Instruction, Multiple-DataDAWS Divergence-Aware Warp SchedulingVWS Variable Warp SizingE-VWS Elastic Variable Warp SizingI-VWS Inelastic Variable Warp SizingPC Program CounterI-Buffer Instruction BufferBGTO Big-Gang-Then-OldestLGTO Little-Gang-Then-OldestLRR Loose Round RobinI-cache Instruction-cacheSM Streaming MultiprocessorMSHR Miss Status Holding RegisterHPC High Performance ComputingCoMD Co-design Molecular DynamicsFFT Fast Fourier TransformCRS Call Return StackTBC Thread Block CompactionDWF Dynamic Warp FormationxvNVRT NVIDIA RaytracingFCDT Face DetectionMGST Merge SortNNC Nearest Neighbour ComputationDWS Dynamic Warp SubdivisionCF Control FlowMSMD Multiple-SIMD, Multiple DataCTA Cooperative Thread ArrayWB Warp BarrierCSR Compressed Sparse RowBFS Breadth First SearchSPMV Sparse Matrix Vector MultiplyPDOM Post DominatorILRP Intra-Loop Repetition DetectorAPI Application Programming InterfaceTDP Thermal Design PointFLOPS FLoating point Operations Per SecondFGMT Fine-Grained MultiThreadingSMT Simultaneous MultiThreadingAPU Accelerated Processing UnitAMD Advanced Micro DevicesGDDR Graphic Dual Data RateDRAM Dynamic Random Access MemoryxviAcknowledgmentsFirst and foremost, I would like to thank my wife for sticking with me while Iquit my job and walked out on several hundred thousand dollars for the off-chancethat I would succeed as a PhD student. I could not have done any of this withoutJenelle. Second, I would like to thank my children for their energy, excitement andalways being there to remind me what is most important. My final personal thanksgoes out to my parents and grandparents who raised me to always aspire to bettermyself, never give up and to fight for every inch of my goals.Professionally, I would like to thank my advisor Professor Tor Aamodt for al-ways pushing me pursue high quality research, giving me the freedom to exploremy own ideas and helping me develop them. Tor’s guidance over the last five yearshas been invaluable and has shaped everything I know about doing good research.A special thanks to Mike O’Connor for being a great colleague who is also a lotof fun to work with. Mike’s depth of industrial knowledge has helped influence allthe research I have done. From UBC, I would also like to thank Tor’s previous PhDgraduates Wilson Fung and Ali Bakhoda for going through this process before meand helping me with every aspect of finally getting my PhD submitted. Thanksto all of the UBC computer architecture students I have worked with: Tayler Het-herington, Dongdong Li, Ayub Gubran, Dave Evans, Jonny Kuan, Hadi Hooybar,Inderpreet Singh, Jimmy Kwa, Andrew Turner, Arun Ramamurthy, Rimon Tadros,Shadi Assadi, Ahmed ElTantawy and Andrew Boktor. From NVIDIA Research, Iwould like to thank Daniel Johnson and Steve Keckler for their dedicated collab-oration. From AMD Reaserch, I would like to thank Brad Beckmann and GabeLoh for their mentorship and insights on architecture research. I would also listto thank the members of my PhD qualifying and final PhD defense exams: Pro-xviifessor Matei Ripeanu, Professor William Dunford, Professor Philippe Kruchten,Professor Steve Wilson, Professor Sathish Gopalakrishnan, Professor Mieszko Lis,Professor Michiel van de Panne, Professor Alan Wagner, Professor Albert Dex-ter and finally my external examiner, Professor Onur Mutlu for their direction andsuggested improvements to my thesis.Finally I would like to acknowledge the funding sources that made my re-search possible: The Alexander Graham Bell Canada Graduate Scholarship (CGS-D) provided by the Natural Sciences and Engineering Research Council of Canada(NSERC) as well as the NVIDIA Graduate Fellowship.xviiiTo my wife Jenelle and our children, Erica and Nathan.xixChapter 1IntroductionThe dawn of Turing complete [162], digital, electronic computing in the mid twen-tieth century was brought about by the invention of programmable computer hard-ware [58, 65, 169, 170]. In the 80 years since their creation, digital computers haveundergone a series of revolutions in their architecture, implementation and pro-gramming interface. These revolutions have improved the performance, energy-efficiency, cost-effectiveness and programmability of the modern computer to apoint where processors are universally integrated into our society.One of the aforementioned revolutions began in 1971 with the introduction ofthe Intel 4004 [4]. The 4004 was the first commercially available single-chip cen-tral processing unit (CPU) built using integrated-circuit (IC) technology. Since theintroduction of the microprocessor, IC fabrication technologies have advanced ata rate which enabled the number of transistors on a cost-effective IC to doubleroughly once every two years. This trend, first discovered by Gordon Moore, iscommonly referred to as Moore’s Law [121]. In the early days of Moore’s Law,the exponential increase in transistor count, primarily due to transistor gate lengthshrinking, was coupled with Dennard scaling [45] and enabled the clock frequency(and correspondingly the performance) of single threaded, single chip micropro-cessors to scale exponentially at roughly the same rate as transistor density. How-ever, beginning around 2005, the power density guarantees provided by Dennardscaling broke down, halting the cost-effective increase in processor frequency scal-ing which has lead to what is commonly referred to as the Power Wall [16, 122].1In combination with the decline of frequency scaling, architectural techniques toimprove increase single threaded instruction level parallelism (ILP) have reached apoint of diminishing returns [43]. Today, Moore’s Law continues to march forward,with chip transistor counts still exponentially increasing. However, our ability toutilize these transistors to improve single threaded performance has been severelylimited. The end of single thread performance scaling has brought about the mostrecent revolution in computer architecture.A relatively simple way to improve performance and energy-efficiency in theface of diminishing single thread performance is to produce chips that contain mul-tiple processor cores, known as chip multiprocessors (CMP). CMPs improve per-formance by relying on software to explicitly expose parallelism. This can be donein the operating system at the process level (known as multiprocessing) where mul-tiple programs are run on each of the chip’s processors. Individual programs canalso be written in a parallel format (known as multithreading) using a number ofparallel application program interfaces (APIs) [1, 2, 68, 69, 87]. However, recentwork exploring dark silicon [49] (which is the portion of the chip that must be pow-ered off at any given time to stay within a power budget) suggests that multi-corescaling will not continue for future technology generations. In addition to CMP de-velopment, the break-down in single threaded energy-efficiency and performancehas lead computer architects to pursue more drastic alternatives to improve com-puting capabilities.One such alternative is the massively parallel acceleration provided by graphicsprocessing units (GPUs). Originally developed for accelerating graphics applica-tions, APIs like CUDA [6] and OpenCL [87] allow programmers to write generalpurpose software for GPUs. Running general purpose (or compute) workloads ona GPU is commonly referred to as general purpose GPU (GPGPU) computing.The GPU’s transition from an application specific accelerator to a general pur-pose device has progressed gradually over the last several product generations.However, a number of important challenges remain when accelerating general pur-pose software on contemporary GPUs. This dissertation proposes hardware solu-tions to help solve these challenges.21.1 Massively Parallel Computing TrendsThe underlying reasons for the surge in popularity of GPGPU computing is a re-sult of two factors, one economical and one scientific. Like all engineering fields,good computer engineering research must account for the cost-effectiveness ofany solution. Since GPUs are sold in substantial volume for graphics, the costof GPU hardware as a computing platform is kept relatively low. However, for thisdissertation, the scientific reason for the success of GPGPUs is more interesting:energy-efficiency. By running many thousands of threads in parallel at a relativelylow clock frequency (as opposed to the tens of threads runnable on contemporaryCMPs), GPUs are able to perform a large amount of computation for relatively lit-tle energy. In 2015, the peak computational efficiency (peak performance / thermaldesign point (TDP)) of state-of-the-art GPUs from AMD and NVIDIA was > 18GFLOPS/W (Floating point Operations Per Second Per Watt), whereas the mostefficient CMPs from Intel performed < 6 GFLOPS/W, even through the CMPswere fabricated at a more energy-efficient technology node. In addition to the the-oretical peak energy-efficiency of GPUs, it has been demonstrated that GPUs arean effective computing platform for a variety of practical workloads [64, 99]. Infact, 9 of the top 10 supercomputers on the 2014 Green 500 List [50] (which ranksglobal supercomputers based on their measured energy-efficiency on the LinpackBenchmark report [46]) contain a GPU.The fundamental reason for the GPU’s computational efficiency is parallelism.Traditional 3D graphics workloads, sometimes called shaders, perform thousandsof independent, parallel computations on the elements that make up a renderedscene. Each of these operations helps to determine the output color of each pixelin the finally rendered scene. As a result, the 3D rendering process is very dataparallel and the hardware created to accelerate it takes advantage of this parallelismto increase throughput.Since the software running on GPUs has exposed a large amount of parallelism,GPUs traditionally rely on aggressive fine grained multithreading (FGMT) to tol-erate long latency operations (such as accessing global memory). FGMT operatesby issuing instructions from different threads when a single thread is stalled fromexecuting. Given enough threads and enough memory bandwidth, the GPU’s data-3path can be kept busy, increasing overall throughput at the expense of single-threadlatency. In contrast, traditional CPU (and CMP) designs devote a significant por-tion of the chip’s area and power budget to mechanisms like out of order executionand branch prediction that attempt to cover stalled execution by extracting moreILP from a single thread.Since the exposed parallelism in traditional GPU software is data parallel,GPUs can extract energy-efficiency gains from executing instructions in a widesingle instruction multiple data (SIMD) format. GPUs group the threads definedby the software into larger entities, known as warps (or wavefronts) in hardware.The threads in a warp share a program counter (PC), allowing them to execute inlock-step on the SIMD hardware and amortize the overhead of executing an in-struction over the width of the warp.Although massive FGMT and SIMD execution enable the GPUs to be highlyenergy-efficient, they present a number of interesting challenges surrounding pro-grammability, efficient control flow and locality management.1.2 GPU Programmability ChallengesSince the early days of the ENIAC [58] and the Harvard Mark I [65], the pro-grammability of computer hardware has played a vital role in how useful that hard-ware ends up being. The same is true in today’s CPU versus GPU debate. GPUprogramming models, like CUDA and OpenCL allow software developers to writecode for the execution of a single scalar thread of execution. This programmingmodel allows the developers to write multiple-instruction, multiple-data (MIMD)programs that operate on SIMD hardware. A regular program is one where eachscalar thread executes a similar path through the code (control flow) and adjacentthreads in the program operate on data with a similar memory address. Regu-lar programs are well suited for GPU acceleration since they can take advantageof amortizations done in the hardware. In contrast, the GPGPU programmingmodel also supports irregular applications, where control flow and memory ac-cesses among thread are less uniform. These irregular applications cause perfor-mance and energy-efficiency challenges on contemporary GPUs. As a result, GPUprogrammers are encouraged to find ways to restructure the algorithm and data of4their applications to remove irregularity and improve performance. This process isnot guaranteed to improve performance and can be both difficult and time consum-ing, even for experience programmers. Moreover, porting an existing piece of par-allel code to an accelerator and having it run efficiently is a challenge [142]. Thisdissertation seeks to decrease the time and complexity involved in writing efficientGPU software by tackling some of the key challenges that stem from irregularity.1.3 GPU Memory Locality ChallengesSince massively parallel architectures execute thousands of threads concurrently, ahigh bandwidth main memory system is necessary to handle the traffic generated.Additionally, GPUs employ several mechanism to exploit the memory referencelocality. They use a technique known as memory coalescing [6] (described in moredetail in Chapter 2) to aggregate redundant memory accesses among the threadsin a warp. If there is spatial locality across the threads in a warp when a memoryinstruction is issued, the GPU will group the accesses for all the threads in thewarp together, generating fewer main memory system requests. Regular applica-tions often take advantage of this feature. Like CPUs, GPUs also employ on-chipdata caches in an attempt to capture both spatial locality (where multiple memoryreferences access data in nearby addresses) and temporal locality (where multiplememory references access the same address in a short period of time).Since the appearance of the processor memory wall [173] (where the speed ofthe processor vastly outpaced the speed of the main memory system), exploitingmemory reference locality in the caching system has been critical to improvingCPU performance. As a result, the past 30 years have seen a significant amountof research and development devoted to using on-chip CPU caches more effec-tively. At a hardware level, cache management has typically been optimized byimprovements to the hierarchy [25], replacement/insertion policy [130], coherenceprotocol [112] or some combination of these. Previous work on hardware cachingassumes that the access stream seen by the memory system is fixed. However, mas-sively multithreaded systems introduce another dimension to the problem. Eachcycle, a fine-grained, issue level thread scheduler must choose which of a core’sactive threads issues next. This decision has a significant impact on the access5stream seen by the cache. In a massively multithreaded system, there may be morethan one thousand threads ready to be scheduled on each cycle. This dissertationexploits this observation and uses the thread scheduler to explicitly manage the ac-cess stream seen by the memory system to maximize throughput. Although thisdissertation studies the effect of thread scheduling and locality management onGPUs, it is a concern to any architecture where many hardware threads can share acache. Some examples include Intel’s Xenon Phi [70], Orcale’s SPARC T4 [149],IBM’s Blue Gene/Q [61].In addition to studying the effect irregular applications have on the cachingsystem, this dissertation also explores the effect reduced memory coalescing canhave on performance and energy-efficiency when techniques aimed at mitigatingcontrol flow challenges are used.1.4 GPU Control Flow ChallengesThe GPU’s scalar programming model, combined with its SIMD pipeline necessi-tates a unique execution model, commonly referred to as single-instruction, multiple-thread (SIMT). Since individual threads within a warp may execute different con-trol flow paths through the code, the GPU’s control hardware supports a mechanismto disable inactive SIMD lanes when threads within a warp do not execute the samecontrol flow path (this mechanism is discussed in more detail in Chapter 2). As aresult, software with irregular control flow executes inefficiently on a GPU, sincemany of the lanes in the wide SIMD datapath will be disabled on each executedinstruction. Writing effective GPU software for irregular or data-dependent appli-cations is challenging, as the programmer should take steps to ensure that threadsin the same warp execute similar control-flow paths.1.5 Thesis StatementThis dissertation explores the massively multithreaded hardware design space sur-rounding memory locality and application irregularity. The dissertation proposesthree novel hardware thread scheduling mechanisms that aim to improve massivelyparallel performance and energy-efficiency on a varied set of traditional regularand more forward-looking irregular massively parallel applications. The hardware6modifications proposed in this dissertation are aimed at reducing the amount ofhardware specific knowledge required to write efficient massively parallel soft-ware.The first mechanism proposed in this dissertation is cache-conscious warpscheduling (CCWS). CCWS is a novel warp scheduling microarchitecture that usesfeedback from the caching system to make issue-level scheduling decisions in themassively parallel core. CCWS proposes a novel lost locality detector which it usesto react to the over-subscription of the processor’s data cache caused by excessivemultithreading. Upon the detected over-subscription, CCWS reduces the numberof warps sharing the cache by preventing some of them from issuing memory in-structions, effectively reducing the number of threads actively scheduled.Divergence-aware warp scheduling (DAWS) is the second mechanism pro-posed by this dissertation. It expands on the core insights developed in CCWSand proposes a pre-emptive mechanism to curb the issue-level scheduling of warpsbased on an online characterization of the program’s loops. Using this information,in tandem with runtime information about the amount of control flow divergencebeing experienced by each warp, DAWS makes a pre-emptive prediction abouteach warp’s cache footprint and curbs the scheduling of warps such that the capac-ity of the system’s data cache is not exceeded.This dissertation then studies the effect issue-level thread scheduling in hard-ware can have on the programmability of GPUs. A case study is performed on twoimplementation of sparse matrix-vector multiply: one GPU-optimized, one not.The dissertation then studies the performance and memory system impacts of bothimplementations using CCWS and DAWS to demonstrate that issue-level threadscheduling in the hardware can come within 4% of the performance of tuned GPUcode.Finally, this dissertation explores a variable warp-size architecture (VWS).VWS seeks to solve the control flow divergence problem in irregular or less op-timized GPU programs by proposing a machine that is capable of running with amore narrow warp size. VWS demonstrates that operating at a small warp size isdetrimental to the horizontal locality (described in Chapter 2) present in a large setof existing GPU applications. VWS proposes a ganged scheduling mechanism thatenforces wide, lock-step execution when appropriate while still allowing narrow7execution in the presence of divergence.1.6 ContributionsThis thesis makes the following contributions:1. It identifies intra and inter warp locality and quantifies the trade-off betweenmaximizing intra-warp locality and concurrent multithreading.2. It proposes a novel cache-conscious warp scheduling (CCWS) mechanismwhich can be implemented with no changes to the cache replacement pol-icy. CCWS uses a novel lost intra-warp locality detector (LLD) to update anadaptive locality scoring system and improves the performance of highlycache-sensitive workloads by 63% over existing warp schedulers and in-creases the total chip area by only 0.17%.3. It demonstrates that CCWS reduces L1D cache misses more than the Belady-optimal replacement scheme.4. It demonstrates that CCWS can be tuned to trade-off power and perfor-mance. A power-tuned configuration of CCWS reduces energy-expensiveL1D cache misses an additional 18% above the performance tuned configu-ration while still achieving a 49% increase in performance on highly cache-sensitive workloads.5. It quantifies the relationship between data locality, branch divergence andmemory divergence in GPUs on a set of economically important, highlycache-sensitive workloads.6. It demonstrates that code regions can be classified by both data locality andmemory divergence.7. It demonstrates, with an example, that DAWS enables unoptimized GPUcode written in a scalar fashion to attain 96% of the performance of opti-mized code that has been re-written for GPU acceleration.8. It proposes a novel divergence-aware warp scheduling (DAWS) mechanismwhich classifies static load instructions based on their memory usage. It uses8this information, in combination with the control flow mask, to appropriatelylimit the number of scalar threads executing code regions with data locality.DAWS achieves a harmonic mean 26% speedup over CCWS [137] and 5%improvement over a profile-based warp limiting solution [137] with negligi-ble area increase over CCWS.9. It characterizes the performance, control flow/memory divergence, and fetch/de-code effects of different warp sizes on a large number of graphics and com-pute GPU workloads.10. It demonstrates that reducing the warp size of modern GPUs does not providea universal performance advantage due to interference in the memory systemand an increase in detrimental scheduling effects.11. It explores the design space uncovered by enabling a dynamic, variable warpsize. It quantifies the effects of scheduling and gang combination techniqueswhen the machine has the flexibility to issue from multiple control flow pathsconcurrently.12. It proposes a novel warp ganging microarchitecture that makes use of a hier-archical warp scheduler, enabling divergent applications to execute multiplecontrol flow paths while forcing convergent ones to operate in lock-step.1.7 OrganizationThe rest of this dissertation is organized as follows:• Chapter 2 details the background GPU architecture used in this dissertation.• Chapter 3 introduces cache-conscious warp scheduling (CCWS), a novel mi-croarchitecture that detects locality in cache-sensitive GPU applications andreactively curbs the warp scheduling algorithm to improve performance andenergy consumption.• Chapter 4 details divergence-aware warp scheduling (DAWS), a GPU mi-croarchitectural innovation that pre-emptively curbs the warp scheduling al-9gorithm to further increase cache-sensitive application performance and im-prove the programmability of GPUs.• Chapter 5 performs a case study on a GPGPU application written in two dif-ferent ways: one GPU-optimized and the other more general. It examinesthe performance effect of running CCWS and DAWS on the less optimized,more general code and demonstrates that DAWS is able bring the perfor-mance of the unoptimized code withing 4% of the optimized code.• Chapter 6 presents a variable warp-size architecture, a GPU microarchitec-ture and thread scheduling mechanism that improves the performance of ir-regular applications, while maintaining contemporary GPU performance andenergy-efficiency on regular codes.• Chapter 7 discusses related work.• Chapter 8 concludes the dissertation and discusses directions for potentialfuture work.10Chapter 2BackgroundThis chapter first describes the GPU programming model, then details the archi-tecture of a contemporary GPU, which is used as a baseline throughout this dis-sertation. The chapter then details the memory system of a GPU, how localitymanifests itself and defines what is meant by memory irregularity on a GPU. Thechapter then details a contemporary mechanism for dealing with control flow onGPUs and defines what is meant by control flow irregularity on GPUs.2.1 GPU Programming ModelIn CUDA [6] and OpenCL [87], the programmer writes the code for a GPU pro-gram (or kernel) from the perspective of a scalar thread. When this kernel islaunched for execution on the GPU, it is divided into cooperative thread arrays(CTAs) or workgroups. Each CTA is assigned to a GPU streaming multiproces-sor (SM). SMs are known as compute units in AMD terminology. Threads withina CTA are able to share memory through an on-chip scratchpad (local data storein AMD) and can perform fast synchronization amongst themselves using barriers.The number of threads in the CTA is determined by the programmer and is boundedby resource constraints such as the number of registers and amount of scratchpadmemory used. Applications begin execution on a host CPU, which spawns coarsegrained kernels to the GPU for execution. Traditionally, the GPU and CPU operateon different memory spaces requiring the transfer of data from the host memory11to the device memory. Recent chips such as AMD’s accelerated processing units(APUs), have the ability to share a memory space between the CPU and GPU [14].2.2 Contemporary GPU ArchitectureFigure 2.1 depicts our model of a modern GPU. The GPU consists of several SMsconnected to the main memory system via an interconnection network. Each SM inthe GPU is responsible for the execution of multiple CTAs, the aggregate of whichcan consist of more than one thousand scalar threads. To mitigate the instructionfetch, decode and scheduling overhead of so many threads on a single processor,GPUs group scalar threads into scheduling entities known as warps (or wavefronts,in AMD terminology). Threads in a warp execute the same instructions in lockstep.The warp size of contemporary GPUs is fixed at 32 threads for NVIDIA GPUs and64 threads for AMD.Our pipeline decouples the fetch/decode stages from the issue logic and exe-cution stage by storing decoded instructions in per-warp instruction buffers. Eachwarp can store multiple decoded instructions in its instruction buffer. Each instruc-tion entry in the buffer also contains a valid bit, which is set when an instructionis filled into the buffer, and a ready bit, which is set when the in-order scoreboardindicates the instruction is able to execute.The front-end of each SM includes an L1 instruction cache which is probedonce per cycle by the warp fetch scheduler. The warp fetch scheduler determineswhich empty entry in the instruction buffer is to be filled. On the execution side, awarp issue scheduler selects one decoded, ready instruction to issue each cycle.2.3 Memory on GPUsContemporary GPUs have four memory spaces visible to the programmer: tex-ture memory (image object in AMD machines), shared scratchpad memory (localdata store), constant memory, local memory (private memory) and global memory.Texture memory is a read-only memory space (used primary by graphics appli-cations), is addressable by special texture fetch instructions and is cached on-chipwith a specialized hardware texture cache [60]. The shared scratchpad memory ad-dresses a high bandwidth, on-chip data store and is used by GPGPU applications12GPUInterconnectStreaming MultiprocessorFrontendDatapath [Lanes 0-31]L1 I-CacheMemory UnitWarp Fetch SchedulerPer-Warp Instruction Buffers DecodeControl LogicWarp Issue SchedulerSIMT StackScoreboardSM SMSM SM......MemCtrlL2 Cache......MemCtrlL2 CacheMemCtrlL2 Cache32-wideRegister File Shared ScratchpadTexture CacheL1 Data CacheL1Constant CacheMemCtrlL2 CacheFigure 2.1: Contemporary GPU architecture. MemCtl=memory con-troller. SIMT Stack=single-instruction, multiple-data stack. I-cache=instruction cache.13for sharing data between the threads in a CTA. This memory space is explicitlymanaged by the GPU programmer. The constant memory space is used for readonly data such as program constants and kernel input parameters and has a smallon-chip cache. Local memory is private to each scalar thread and is cached in theL1 data cache. Finally, global memory is the most general form of GPU memory,used for all other general read/write data. Global memory is cached in the on-chipL1 data cache. Much of this dissertation focuses on capturing locality in the globalmemory space. Data from all the memory spaces is cached in the shared L2 cache,which is strided across the GPUs memory controllers.Our memory system is similar to that used in modern GPUs [5, 7]. Each SMhas a software-managed scratchpad, an L1 data cache, L1 constant cache and atexture unit. Access to the memory unit is shared by all lanes. To reduce thenumber of memory accesses generated from each warp, GPUs coalesce memoryrequests into cache line sized chunks when there is spatial locality across the warp.A single instruction that touches only one cache line will generate one transactionthat services all 32 lanes. From a memory access perspective, regular programsare those that generate few memory accesses from each SIMD warp instruction.Memory irregularity occurs when adjacent threads (in space) access different cachelines, generating several memory accesses from one SIMD warp instruction. Thismemory irregularity across the same warp instruction is called memory divergence.Memory regularity occurs when threads with adjacent thread identifiers in the pro-gram access adjacent (or redundant) memory addresses. We call this type of lo-cality horizontal locality since it occurs horizontally in time across threads in thesame instruction. In contrast, vertical locality occurs when an individual warpsre-reference cache lines on different dynamic instructions. Memory regular pro-grams have horizontal locality, while vertical locality can occur in both regular andirregular applications. In massively multithreaded systems, exploiting locality inmemory references is critical to achieving high performance and energy-efficiency.Every global memory instruction from every thread in a GPU program requires aresponse from the global memory system. The arrangement of these accesses inboth space and time is critical to reducing off-chip memory traffic. GPUs haveseveral mechanism to exploit memory reference locality. This dissertation focuseson exploiting two of them: coalescing and caching. Our main memory system in-14cludes a variable latency, fixed bandwidth graphics dual data rate (GDDR) dynamicrandom access memory (DRAM) model.2.4 Control flow on GPUsThe programming model for GPUs is from the perspective of a scalar thread. Eachscalar thread can have conditional statements that affect the control flow of thatparticular thread. However, the underlying datapath and execution model of a GPUgroups the scalar threads into warps, where only one PC (or assembly instruction)is executed for all the threads in the warp. As a result, some of the lanes in theGPU’s SIMD pipeline must be disabled when some of the threads in the samewarp execute do not execute a particular instruction. A structure to bookkeep thethreads active in a warp in each basic block, as well a mechanism to re-active lanesin a warp when threads return from conditional code is required. ContemporaryGPUs uses what is know as a SIMT (or call return) stack [55] to keep track ofwhich threads are active in each basic block. The re-activation of lanes in a warpoccurs when threads in the warp that were executing different control flow pathsconverge at a common basic block. These reconvergence points typically occurat basic block post-dominations, and GPU compilers insert special instructions toinform the hardware when they occur. When threads within the same warp takedifferent control flow paths, resulting in the execution of SIMD instructions withinactive lanes, we call this control flow divergence. Applications with little controlflow divergence (i.e. most of the SIMD instructions execute with most of the lanesactive) are considered regular from a control flow perspective. In contrast, controlflow irregular applications are those that loose significant function unit utilization(also called SIMD efficiency) because threads within the same warp do not executethe same control flow path.15Chapter 3Cache-Conscious WarpSchedulingThis chapter studies the effects of hardware thread scheduling on cache manage-ment in GPUs. We propose Cache-Conscious Warp Scheduling (CCWS), an adap-tive hardware mechanism that makes use of a novel intra-warp locality detectorto capture locality that is lost by other schedulers due to excessive contention forcache capacity. In contrast to improvements in the replacement policy that canbetter tolerate difficult access patterns, CCWS shapes the access pattern to avoidthrashing the shared L1. We show that CCWS can outperform any replacementscheme by evaluating against the Belady-optimal policy. Our evaluation demon-strates that cache efficiency and preservation of intra-warp locality become moreimportant as GPU computing expands beyond use in high performance comput-ing. At an estimated cost of 0.17% total chip area, CCWS reduces the numberof threads actively issued on a core when appropriate. This leads to an average25% fewer L1 data cache misses which results in a harmonic mean 24% perfor-mance improvement over previously proposed scheduling policies across a diverseselection of cache-sensitive workloads.Each GPU cycle, a hardware warp scheduler must decide which of the multipleactive warps execute next. Our work focuses on this decision. The goal of a warpscheduler is to ensure the execution pipeline is kept active in the presence of longlatency operations. The inclusion of caches on GPUs [5] can reduce the latency of16020406080100120AVG-Highly Cache Sensitive AVG-Moderately Cache Sensitive( Hi ts/ Mi ss)  PKIMisses PKIInter-Warp Hits PKIIntra-Warp Hits PKIFigure 3.1: Average hits and misses per thousand instructions (PKI) using anunbounded L1 data cache (with 128B lines) on cache-sensitive bench-marks.memory operations and act as a bandwidth filter, provided there is some localityin the access stream. Figure 3.1 presents the average number of hits and missesper thousand instructions (PKI) of highly cache-sensitive (HCS) and moderatelycache-sensitive (MCS) benchmark access streams using an unbounded level onedata (L1D) cache. The figure separates hits into two classes. We classify localitythat occurs when data is initially referenced and re-referenced from the same warpas intra-warp locality. Locality resulting from data that is initially referenced byone warp and re-referenced by another is classified as inter-warp locality. Intra-warp locality is a combination of intra-thread locality [96] (where data is private toa single scalar thread) and inter-thread locality where data is shared among scalarthreads in the same warp. Figure 3.1 illustrates that the majority of data reuseobserved in our HCS benchmarks comes from intra-warp locality.To exploit this type of locality in HCS benchmarks, we introduce Cache-ConsciousWarp Scheduling (CCWS). CCWS uses a novel lost intra-warp locality detector(LLD) that alerts the scheduler if its decisions are destroying intra-warp locality.Based on this feedback, the scheduler assigns intra-warp locality scores to eachwarp and ensures that those warps losing intra-warp locality are given more exclu-sive access to the L1D cache.1736.301234567BFS KMN MEMC GC HMEAN-HCSNormal ized IPC32k L1D 8M L1D00.511.5CFD SSSP STMCL WP HMEAN-MCSNormal ized IPC32k L1D 8M L1DFigure 3.2: Performance using a loose round-robin scheduler at various L1Dcache sizes for highly cache-sensitive (top) and moderately cache-sensitive benchmarks (bottom), normalized to a cache size of 32k. Allcaches are 8-way set-associative with 128B cache lines.Simple warp scheduling policies such as round-robin are oblivious to their ef-fect on intra-warp locality, potentially touching data from enough threads to causethrashing in the L1D. A two level scheduler such as that proposed by Narasimanet al. [123] exploits inter-warp locality while ensuring warps reach long latencyoperations at different times by scheduling groups of warps together. However,Figure 3.1 demonstrates that the HCS benchmarks we studied will benefit morefrom exploiting intra-warp locality than inter-warp locality. Existing schedulers donot take into account the effect issuing more warps has on the intra-warp localityof those warps that were previously scheduled. In the face of L1D thrashing, theround-robin nature of their techniques will cause the destruction of older warp’sintra-warp locality.1800. 5 10 15 20 25 30 35HMEAN Normal ized IPCAver age MPKIWarps Actively ScheduledMPKI HMEAN Normalized IPCFigure 3.3: Average misses per thousand instructions (MPKI) and harmonicmean (HMEAN) performance improvement of HCS benchmarks withdifferent levels of multithreading. Instructions per cycle (IPC) is nor-malized to 32 warps.Figure 3.2 illustrates the cache size sensitivity of our benchmarks (described inSection 3.3) when using a round-robin scheduler and the baseline system describedin Section 3.3. Although all of these benchmarks are somewhat cache-sensitive,the HCS benchmarks plotted on the left in Figure 3.2 see 3× or more performanceimprovement with a much larger L1 data cache.For GPU-like architectures to effectively address a wider range of workloads, itis critical that their performance on irregular workloads is improved. Recent workon the highly cache-sensitive Memcached (MEMC) [62] and BFS [119] has shownpromising results running these commercially relevant irregular parallel workloadson GPUs. However, since current GPUs face many performance challenges run-ning irregular applications, there are relatively few of them written. In this workwe evaluate a set of irregular GPU applications and demonstrate their performancecan be highly sensitive to the GPU’s warp scheduling policy.Figure 3.3 highlights the impact warp scheduling can have on preserving intra-warp locality. It shows the effect of statically limiting the number of warps activelyscheduled on a core. Peak throughput occurs at a multithreading value less thanmaximum concurrency, but greater than the peak cache performance point (which19limits concurrency to a single warp). Although it may seem counterintuitive tolimit the amount of multithreading in a GPU, our data demonstrates a trade-offbetween hiding long latency operations and creating more of them by destroyingintra-warp locality.Our work draws inspiration from cache replacement and insertion policies inthat it attempts to predict when cache lines will be reused. However, cache way-management policies decisions are made among a small set of blocks. A threadscheduler effectively chooses which blocks get inserted into the cache from a poolof potential memory accesses that can be much larger than the caches associa-tivity. Similar to how cache replacement policies effectively predict each line’s re-reference interval [73], our proposed scheduler effectively changes the re-referenceinterval to reduce the number of interfering references between repeated accessesto high locality data. Unlike scheduling approaches for managing contention im-plemented in the operating system [181], our technique exploits fine-grained infor-mation available to the low-level hardware scheduler.3.1 Effect of Shaping the Access PatternTo illustrate the effect an issue-level thread scheduler can have on the cache sys-tem, consider Figures 3.4 and 3.5. They present the access pattern created by twodifferent thread schedulers in terms of cache lines touched. GPUs group threadsinto warps (or warps) for execution, issuing the same static instruction from mul-tiple threads in a single dynamic instruction. This means one memory instructioncan generate up to M data cache accesses where M is the warp width. In this exam-ple, we assume each instruction generates four memory requests and we are usinga fully associative, four entry cache with a least recently used (LRU) replacementpolicy. The access stream in Figure 3.4 will always miss in the cache. However,the stream in Figure 3.5 will hit 12 times, capturing every redundant accesses. Theaccess patterns in these two examples are created by the GPU’s warp scheduler. InFigure 3.4 the scheduler chose to issue instructions from warps in a round-robinfashion without considering the effect of the resulting access stream on cache per-formance. The scheduler in Figure 3.5 prioritized accesses from the same warptogether (indicated by the red boxes), creating a stream of cache accesses where20A,B,C,D E,F,G,H I,J,K,L A,B,C,D E,F,G,H I,J,K,LW0 W1 W2 W0 W1 W2Cache Line 0 hitsFigure 3.4: Example access pattern (represented as cache lines touched) re-sulting from a throughput oriented round-robin scheduler. The letters(A,B,C,...) represent cache lines accessed. Wi indicates which warpgenerated this set of accesses. For example, the first four accesses tocache lines A,B,C and D are generated by warp 0.A,B,C,D E,F,G,H I,J,K,LW0E,F,G,HW1I,J,K,LW2A,B,C,DW0 W1 W2Cache Line 12 hitsH H H H H H H H H H H HFigure 3.5: Example Access pattern resulting from a scheduler aware the ef-fect scheduling has on the caching system. The red boxes indicate thatthe issue-level scheduler has re-arranged the order warp’s accesses areissued in.locality is captured by our example cache.3.2 Warp Scheduling to Preserve LocalityThis section describes our scheduling techniques. First, Section 3.2.1 analyzeswarp scheduling for locality preservation in an example workload with intra-warplocality. Next, Section 3.2.2 introduces static warp limiting (SWL) which giveshigh-level language programmers an interface to tune the level of multithreading.Finally, Section 3.2.3 describes Cache-Conscious Warp Scheduling (CCWS), anadaptive hardware scheduler that uses fine-grained memory system feedback tocapture intra-warp locality.3.2.1 A Code ExampleConsider the inner loop of a graph processing workload presented in Example 1.The problem has been partitioned by having each scalar thread operate on all the21edges of a single node. The adjoining edges of each node are stored sequentially inmemory. This type of storage is common in many graph data structures includingthe highly space efficient compressed sparse rows [22] representation. This work-load contains intra-warp locality resulting from intra-thread locality (data’s initialreference and subsequent re-references come from the same scalar thread).The inner loop of each scalar thread strides through attributes of its assignednode’s edges sequentially. This sequential access stream has significant spatiallocality that can be captured by a GPU’s large cache line size (e.g. 128B). If theGPU was limited to just a single thread per SM, the memory loads inside the loopwould hit in the L1D cache often. In realistic workloads, more than one thousandthreads executing this loop will share the same L1D cache.Example 1 Example graph algorithm kernel run by each scalar node_degree = nodes[thread_id].degree;int thread_first_edge = nodes[thread_id].starting_edge;for ( int i = 0; i < node_degree; i++ ) {edge_attribtes = edges[thread_first_edge + i];int neighbour_node_id = edge_attributes.node;int edge_weight = edge_attributes.weight;...}We find that if the working set of all the threads is small enough to be capturedby the L1D, optimizing both cache efficiency and overall throughput is largely in-dependent of the scheduler choice. In the other extreme, if only one warp’s workingset fits in the cache, optimizing misses would have each warp run to completionbefore starting another. Optimizing performance when the L1D is not large enoughto capture all of the locality requires the warp scheduler to intelligently trade-offpreserving intra-warp locality with concurrent multithreading.If the scheduler had oracle information about the nature of the workload, itcould limit the number of warps actively scheduled to maximize performance. Thisobservation motivates the introduction of static warp limiting (SWL) which allowsa high-level programmer to specify a limit on the number of warps actively sched-uled per SM at kernel launch.223.2.2 Static Warp Limiting (SWL)Figure 3.3 shows the effect limiting the number of warps actively scheduled on aSM has on cache performance and system throughput. Current programming API’ssuch as CUDA and OpenCL allow the programmer to specify CTA size. However,they allow as many warps to run on each SM as shared core resources (e.g., reg-isters, shared scratchpad memory) permit. Consequently, even if the programmerspecifies small CTAs, multiple CTAs will run on the same SM if resources permit.As a result, the number of warps/warps running at once may still be too large aworking set for the L1D. For this reason, we propose static warp limiting (SWL)which is implemented as a minor extension to the warp scheduling logic where aregister is used to determine how many warps are actively issued, independent ofCTA size.In SWL, the programmer must specify a limit on the number of warps whenlaunching the kernel. This technique is useful if the user knows the optimal numberof warps prior to launching the kernel, which could be determined by profiling.In benchmarks that make use of work group level synchronization, SWL limitsthe number of warps running until a barrier, allows the rest of the work-group toreach the barrier, then continues with the same multithreading constraints.In Section 3.4 we demonstrate that the optimal number of warps is different fordifferent benchmarks. Moreover, we find this number changes in each benchmarkwhen its input data is changed. This dependence on benchmark and input datamakes an adaptive CCWS system desirable.3.2.3 Cache-Conscious Warp Scheduling (CCWS)This subsection first defines the goal and high level implementation of CCWSin Section 3.2.3. Next, Section 3.2.3 details how CCWS is applied to the base-line scheduling logic. Section 3.2.3 explains the lost intra-warp locality detector(LLD), followed by Section 3.2.3 which explains how our locality scoring systemmakes use of LLD information to determine which warps can issue. Finally, Sec-tion 3.2.3 describes the locality score value assigned to a warp when lost localityis detected.23Memory UnitWarp Issue ArbiterFetch/DecodeWIAWarpsReady [1:N]LSSLLDW1 W2 WNTagTagTagTagTagTagTagTagWIDWIDDataDataVTAHit (WID)L1D CacheVictim Tag ArrayTo Exec Inst. (WID)IntersectionPrioritized Warps[1:N]BaselinePriority LogicIs Load[1:N]Can Issue[1:N]LLS Cutoff TestScore [1:N]Access From Coaleser(WID + Tag) On Evict/MissTo LSS WID(On VTA Hit)Registers/ExecutionMem. UnitI-Buffer/ScoreboardLLSW2LLSWNLLSW1LLS Update LogicInst. Issued TotalVTAHitTotalWarpsReady [1:N]Inst. (WID)12345678910121314151617181911Figure 3.6: Modeled GPU core microarchitecture. N is the number of warpcontexts stored on a core. LSS=locality scoring system, LLD=lostintra-warp locality detector, WID=warp ID, LLS=lost-locality score,VTA=victim tag array, I-Buffer=instruction buffer24W0W0Cumulative LLSTimeCumulative LLS CutoffToW1W2W3Warp 0's LLSW1W2W3 W1W3T0W1W0W3LLDST1 T2 T4WZWarp Cannot Issue Loads...LegendCumulative LLS CutoffT4W0...W2... ...W1W2W0W3T3abVTA Hit(W2)VTA Hit(W2,W0)No VTA HitsVTA Hit (W0)W2 Finish...Figure 3.7: Locality scoring system operation example. LLS=lost-localityscore, LLDS=lost-locality detected scoreHigh-Level DescriptionThe goal of CCWS is to dynamically determine the number of warps allowed toaccess the memory system and which warps those should be. At a high level,CCWS is a warp scheduler that reacts to access level feedback ( 4 in Figure 3.6)from the L1D cache and a victim tag array (VTA) at the memory stage. CCWSuses a dynamic locality scoring system to make scheduling decisions.The intuition behind why our scoring system works can be explained by Fig-ure 3.7. At a high level, each warp is given a score based on how much intra-warplocality it has lost. These scores change over time. Warps with the largest scoresfall to the bottom of a sorted stack (for example, W2 at T1), pushing warps withsmaller scores above a cutoff (W3 at T1) which prevents them from accessing theL1D. In effect, the locality scoring system reduces the number of accesses betweendata re-references from the same warp by removing the accesses of other warps.The following subsections describe CCWS in more detail.25Effect on Baseline Issue LogicFigure 3.6 shows the modifications to the baseline warp issue arbiter ( 1 ) and mem-ory unit ( 5 ) required for CCWS. CCWS is implemented as an extension to thesystem’s baseline warp prioritization logic ( 6 ). This prioritization could be donein a greedy, round-robin or two level manner. CCWS operates by preventing loadsthat are predicted to interfere with intra-warp locality from issuing through a ”CanIssue” bit vector ( 7 ) output by the locality scoring system ( 8 ). The intersectionlogic block ( 9 ) selects the highest priority ready warp that has issue permission.Lost Intra-Warp Locality Detector (LLD)To evaluate which warps are losing intra-warp locality, we introduce the LLD unit( 10 ) which uses a victim tag array (VTA) ( 11 ). The VTA is a highly modifiedvariation of a victim cache [82]. The sets of the VTA are sub-divided among theall the warp contexts supported on this core. This gives each warp its own smallVTA ( 12 ). The VTA only stores cache tags and does not store line data. Whena miss occurs and a line is reserved in the L1D cache, the warp ID (WID) of thewarp reserving that line is written in addition to the tag ( 13 ). When that line isevicted from the cache, its tag information is written to that warp’s portion of theVTA. Whenever there is a miss in the L1D cache, the VTA is probed. If the tagis found in that warp’s portion of the VTA, the LLD sends a VTA hit signal to thelocality scoring system ( 14 ). These signals inform the scoring system that a warphas missed on a cache line that may have been a hit if that warp had more exclusiveaccess to the L1D cache.Locality Scoring System OperationFigure 3.7 provides a visual example of the locality scoring system’s operation.In this example, there are four warps initially assigned to the SM. Time T0 corre-sponds to the time these warps are initially assigned to this core. Each segmentof the stacked bar represents a score given to each warp to quantify the amount ofintra-warp locality it has lost. We call these values lost-locality scores (LLS). At T0we assign each warp a constant base locality score. LLS values are stored in a maxheap ( 15 ) inside the locality scoring system. A warp’s LLS can increase when26the LLD sends a VTA hit signal for this warp. The scores each decrease by onepoint every cycle until they reach the base locality score. The locality scoring sys-tem gives warps losing the most intra-warp locality more exclusive L1D cache ac-cess by preventing the warps with the smallest LLS from issuing load instructions.Warps whose LLS falls above the cumulative LLS cutoff ( a in Figure 3.7) in thesorted heap are prevented from issuing loads. The value of the cumulative LLS cut-off is defined as NumActiveWarps×BaseLocalityScore, where NumActiveWarpsis the number of warps currently assigned to this core.The LLS cutoff test block ( 16 ) takes in a bit vector from the instruction bufferindicating what warps are attempting to issue loads. It also takes in a sorted list ofLLSs, performs a prefix sum and clears the ”Can Issue” bit for warps attemptingto issue loads whose LLS is above the cutoff. The locality scoring system is noton the critical path, can be pipelined and does not have to update the score cutoffsevery SM cycle. In our example from Figure 3.7, between T0 and T1, W2 hasreceived a VTA hit and its score has been increased to the lost-locality detectedscore (LLDS), ( b in Figure 3.7). Section 3.2.3 explains the LLDS in more detail.W2’s higher score has pushed W3 above the cumulative LLS cutoff, clearing W3’s”Can Issue” bit if it attempts to issue a load instruction. From a microarchitectureperspective, LLSs are modified by the score update logic ( 17 ). The update logicblock receives VTA hit Signals (with a WID) from the LLD which triggers a changeto that warp’s LLS. We limit the amount one warp can dominate the point systemby capping each warp’s score at LLDS, regardless of how many VTA hits it hasreceived. Other methods of capping a warp’s LLS were attempted and we foundthat limiting them to the LLDS simplified the point system and yielded the bestresults. In the example, between T1 and T2 both W2 and W0 have received VTAhits, pushing both W3 and W1 above the cutoff. Between T2 and T3, no VTA hitshave occurred and the scores for W2 and W0 have decreased enough to allow bothW1 and W3 to issue loads again. This illustrates how the system naturally backsoff thread throttling over time. Between time T3 and T4, W2 finishes and W0 hasreceived a VTA hit to increase its score. This illustrates that when a warp is addedor removed from the system, the cumulative LLS cutoff changes. Now that thereare three warps active, the LLS cutoff becomes 3× the base score. Having theLLS cutoff be a multiple of the number of active warps ensures the locality scoring27system maintains its sensitivity to lost-locality. If the LLS cutoff does not decreasewhen the number of warps assigned to this core decreases, it takes a higher scoreper warp to push lower scores above the cutoff as the kernel ends. This results inthe system taking more time to both constrain multithreading when locality is lostand back off thread limiting when there is no lost locality.Determining the Lost-Locality Detected Score (LLDS)The value assigned to a warp’s score on a VTA hit (the LLDS) is a function of thetotal number of VTA hits across all this SM’s warps ( 18 ) and all the instructionsthis SM has issued ( 19 ). This value is defined by Equation (1).LLDS=V TAHitsTotalInstIssuedTotal·KT HROT T LE ·CumLLSCuto f f (3.1)Using the fraction of total VTA hits divided by the number of instructions issuedserves as an indication of how much locality is being lost on this core per instruc-tion issued. A constant (KT HROT T LE) is applied to this fraction to tune how muchthrottling is applied when locality is lost. A larger constant favors less multithread-ing by pushing warps above the cutoff value more quickly and for a longer periodof time. Finding the optimal value of KT HROT T LE is dependent on several factorsincluding the number threads assigned to a core, the L1D cache size, relative mem-ory latencies and locality in the workload. We intend for this constant to be set fora given chip configuration and not require any programmer or OS support. In ourstudy, a single value for KT HROT T LE used across all workloads captures 95.4% to100% of the performance of any workload’s optimal KT HROT T LE value. This staticvalue is determined experimentally and explored in more detail in Section 3.4.5.Like the LLS cutoff test, the lost-locality detected score can take several cycles toupdate and does not impact the critical path.In algorithms that use synchronization primitives, CCWS does not introducenew deadlock conditions. The LLSs of warps preventing others from issuing willbe backed off as time progresses while no lost locality is detected. This backingoff insures that if no instructions are issuing for a prolonged period of time, everywarp in the core will eventually be permitted to issue.283.3 Experimental MethodologyWe model the cache-conscious scheduling mechanisms as described in Section 3.2in GPGPU-Sim [19] (version 3.1.0) using the configuration in Table 3.1. TheBelady-optimal replacement policy [26], which chooses the line which is re-referencedfurthest in the future for eviction, is evaluated using a custom stand alone GPGPU-Sim cache simulator (SAGCS). SAGCS is a trace based cache simulator that takesGPGPU-Sim cache access traces as input. Since SAGCS is not a performance sim-ulator and only provides cache information, we do not present IPC results for theBelady-optimal replacement policy. To validate SAGCS, we verified the miss ratefor LRU replacement using SAGCS and found that it was within 0.1% of the LRUmiss rate reported using GPGPU-Sim. This small difference is a result of variabil-ity in the GPGPU-Sim memory system that SAGCS does not take into account.We perform our evaluation using the high-performance computing GPU-enabledserver workloads listed in Table 3.2 from Rodinia [36], Hetherington et al. [62] andBakhoda et al. [19]. While the regularity of the HPC applications makes them par-ticularly well suited for the GPU, they represent only one segment of the overallcomputing market [66] [67].In addition to the cache-sensitive benchmarks introduced earlier, we also eval-uate against a number of cache-insensitive (CI) benchmarks to ensure CCWS doesnot have a detrimental effect.To make use of a larger input, the KMN benchmark was slightly modified touse global memory in place of both texture and constant memory.All of our benchmarks run from beginning to end which takes between 14million and 1 billion instructions.3.3.1 GPU-enabled server workloadsThis work uses two GPU-enabled server workloads. These benchmarks were portedto OpenCL from existing CPU implementations. They represent highly parallelcode with irregular memory access patterns whose performance could be improvedby running on the GPU.Memcached-GPU (MEMC) Memcached is a key-value store and retrieval sys-tem. Memcached-GPU is described in detail by Hetherington et al. [62].29Table 3.1: Cache-conscious warp scheduling GPGPU-Sim Configuration# Compute Units 30Warp Size 32SIMD Pipeline Width 8Number of Threads / Core 1024Number of Registers / Core 16384Shared Memory / Core 16KBConstant Cache Size / Core 8KBTexture Cache Size / Core 32KB, 64B line, 16-way assoc.Number of Memory Channels 8L1 Data Cache 32KB, 128B line, 8-way assoc. LRUL2 Unified Cache 128k/Memory Channel, 128B line, 8-way assoc. LRUCompute Core Clock 1300 MHzInterconnect Clock 650 MHzMemory Clock 800 MHzDRAM request queue capacity 32Memory Controller out of order (FR-FCFS)Branch Divergence Method PDOM [55]GDDR3 Memory Timing tCL=10 tRP=10 tRC=35tRAS=25 tRCD=12 tRRD=8Memory Channel BW 8 (Bytes/Cycle)The application is stimulated with a representative portion of the Wikipediaaccess trace collected by Urdaneta et al. [163].Tracing Garbage Collector (GC) Garbage collection is an important aspect ofmany server applications. Languages such as Java use system-controlledgarbage collection to manage resources [11]. A version of the tracing mark-and-compact garbage collector presented in Barabash et al. [21] is created inOpenCL. The collector is stimulated with benchmarks provided by Spoon-hower et al. [154].3.4 Experimental ResultsThis section is structured as follows, Section 3.4.1 presents the performance ofSWL, CCWS, other related warp schedulers and the Belady-optimal replacement30Table 3.2: GPU Compute Benchmarks (CUDA and OpenCL)Highly Cache Sensitive (HCS)Name Abbr. Name Abbr.BFS Graph Traversal [36] BFS Kmeans [36] KMNMemcached [62] MEMC Garbage Collection [21, 154] GCModerately Cache Sensitive (MCS)Name Abbr. Name Abbr.Weather Prediction [36] WP Streamcluster [36] STMCLSingle Source Shortest Path [19] SSSP CFD Solver [36] CFDCache Insensitive (CI)Name Abbr. Name Abbr.Needleman-Wunsch [36] NDL Back Propagation [19] BACKPSpeckle Red. Anisotropic Diff. [36] SRAD LU Decomposition [36] LUD5.9 4.11.1500.511.52BFS KMN MEMC GC HMEAN-HCSNormal ized IPCLRR GTO 2LVL-GTO Best-SWL CCWSFigure 3.8: Performance of various schedulers and replacement policies forthe highly cache-sensitive benchmarks. Normalized to the GTO sched-uler.policy using the system presented in Section 3.3. The results for CCWS presentedin Section 3.4.1 represent a design point that maximizes performance increase overarea increase. The remainder of this section is devoted to exploring the sensitivityof our design and explaining the behaviour of our benchmarks.31153 115020406080BFS KMN MEMC GC AVG-HCSMPKILRR LRR-BEL GTO GTO-BEL2LVL-GTO 2LVL-GTO-BEL Best-SWL Best-SWL-BELFigure 3.9: MPKI of various schedulers and replacement policies for thehighly cache-sensitive benchmarks. SSSP STMCL WP HMEAN-MCSNormal ized IPCLRR GTO 2LVL-GTO Best-SWL CCWS00. LUD NDL SRAD HMEAN-CINormal ized IPCLRR GTO 2LVL-GTO Best-SWL CCWSFigure 3.10: Performance of various schedulers and replacement policiesfor moderately cache-sensitive (top) and cache-insensitive (bottom)benchmarks. Normalized to the GTO scheduler.320123456789CFD SSSP STMCL WP AVG-MCSMPKILRR LRR-BELGTO GTO-BEL2LVL-GTO 2LVL-GTO-BELBest-SWL Best-SWL-BELCCWS CCWS-BEL00.511.522.533.5BACKP LUD NDL SRAD AVG-CIMPKILRR LRR-BELGTO GTO-BEL2LVL-GTO 2LVL-GTO-BELBest-SWL Best-SWL-BELCCWS CCWS-BELFigure 3.11: MPKI of various schedulers and replacement policies for mod-erately cache-sensitive (top) and cache-insensitive benchmarks (bot-tom).3.4.1 PerformanceThe data in Figures 3.8, 3.9, 3.10 and 3.11 is collected using GPGPU-Sim for thefollowing mechanisms:LRR Loose round-robin scheduling. Warps are prioritized for scheduling in round-robin order. However, if a warp cannot issue during its turn, the next warp inround-robin order is given the chance to issue.GTO A greedy-then-oldest scheduler. GTO runs a single warp until it stalls thenpicks the oldest ready warp. The age of a warp is determined by the timeit is assigned to the core. For warps that are assigned to a core at the sametime (i.e. they are in the same CTA), warps with the smallest threads IDs33are prioritized. Other greedy schemes (such as greedy-then-round-robin andoldest-first) were implemented and GTO scheduling had the best results.2LVL-GTO A two-level scheduler similar to that described by Narasiman et al. [123].Their scheme subdivides warps waiting to be scheduled on a core into fetchgroups (FG) and executes from only one fetch group until all warps in thatgroup are stalled. Narasiman et al. used a fetch group size of 8 and a round-robin scheduling policy to select among warps in a fetch group as well asamong fetch groups. To provide a fair comparison against their schedul-ing technique in our simulator and on our workloads, all fetch group sizeswere swept. We also explored alternate scheduling policies for intra-FG andinter-FG selection. We found using GTO for both of these policies was bet-ter than the algorithm they employed. A fetch group size of 2 using GTOfor both intra-FG and inter-FG selection provides the best performance onour workloads and is what we present in our results. This disparity in opti-mal configuration can be explained by the nature of our workloads and ourbaseline architecture. Their core pipeline allows only one instruction from agiven warp to be executing at a time. This means that a warp must wait forits previously issued instruction to complete execution before the warp canissue another instruction. This is different from our baseline which preventsa fetched instruction from issuing if the scoreboard detects a data hazard.Best-SWL Static Warp Limiting as described in Section 3.2.2. All possible limi-tation values (32 to 1) were run and the best performing case is picked. TheGTO policy is used to select between warps. The warp value used for eachbenchmark is shown in Table 3.3.CCWS Cache-Conscious Warp Scheduling described in Section 3.2.3 with theconfiguration parameters listed in Table 3.3. GTO warp prioritization logicis used.The data for Belady-optimal replacement misses per thousand instructions(MPKI) presented in Figures 3.9 and 3.11 is generated with SAGCS:<scheduler>-BEL Miss miss rate reported by SAGCS when using the Belady-optimal replacement policy. SAGCS is stimulated with L1D access streams34generated by using GPGPU-Sim running the specified <scheduler>. SinceSAGCS only reports misses, MPKI is calculated from the GPGPU-Sim in-struction count.Figure 3.8 shows that CCWS achieves a harmonic mean 63% performance im-provement over a simple greedy warp scheduler and 72% over the 2LVL-GTOscheduler on HCS benchmarks. The GTO scheduler performs well because priori-tizing older warps allows them to capture intra-warp locality by giving them moreexclusive access to the L1 data cache. The 2LVL-GTO scheduler performs slightlyworse than the GTO scheduler because the 2LVL-GTO scheduler will not prioritizethe oldest warps every cycle. 2LVL-GTO only attempts to schedule the oldest FGintermittently, once the current FG is completely stalled. This allows loads fromyounger warps, which would not have been prioritized in the GTO scheduler, to beinjected into the access stream, causing older warp’s data to be evicted.CCWS and SWL provide further benefit over the GTO scheduler because theseprograms have a number of uncoalesced loads, touching many cache lines in rela-tively few memory instructions. Therefore, even restricting to just the oldest warpsstill touches too much data to be contained by the L1D. The GTO, 2LVL-GTO,Best-SWL and CCWS schedulers see a greater disparity in the completion timeof CTAs running on the same core compared to the LRR scheduler. Since all ourworkloads are homogeneous (at any given time only CTAs from one kernel launchwill be assigned to each core) and involve synchronous kernel launches, the relativecompletion time of CTAs is not an issue. All that matters is when the whole kernelfinishes. Moreover, the highly cache-sensitive workloads we study do not use anyCTA or global synchronization within a kernel launch, therefore older warps arenever stalled waiting for younger ones to complete.Figure 3.8 also highlights the importance of scheduler choice even among sim-ple schedulers like GTO and LRR. The LRR scheduler suffers from a 64% slow-down compared to GTO. Scheduling warps with a lot of intra-warp locality in aRR fashion strides through too much data to be contained in the L1D. Best-SWLis able to slightly outperform CCWS on all the benchmarks. The CCWS config-uration used here has been optimized to provide the highest performance per unitarea. If the VTA cache is doubled in size, CCWS is able to slightly outperform35Best-SWL on some workloads. CCWS is not able to consistently outperform Best-SWL because there is a start-up cost associated with detecting the loss of localityand a cool-down cost to back off the warp throttling. Adding to that, the execu-tion time of these kernels is dominated by the code section that benefits from warplimiting. Therefore, providing the static scheme with oracle knowledge (throughprofiling) gives it an advantage over the adaptive CCWS scheme. Section 3.4.6examines the shortcomings of the SWL under different run-time conditions.Although not plotted here, it is worth mentioning the performance of the 2LVL-LRR scheduling configuration evaluated by Narasiman et al. On the HCS bench-marks the 2LVL-LRR scheduler is a harmonic mean 43% faster than the LRRscheduler, however this is still 47% slower than the GTO scheduler. Performingintra-FG and inter-FG scheduling in a round-robin fashion destroys the intra-warplocality of older warps that is captured by the GTO scheduler. However, in com-parison to the LRR scheduler, which cycles through 32 warps in a round-robinfashion, cycling through smaller FG sized pools (each fetch group has 8 warps intheir configuration) will thrash the L1 data cache less.Figure 3.9 illustrates that the reason for the performance advantage of the warplimiting schemes is a sharp decline in the number of L1D misses. This figure high-lights the fact that no cache replacement policy can make up for a poor choice inwarp scheduler, as even an oracle Belady-optimal policy on the LRR access streamis outperformed by all the schedulers. The insight here is that even optimal replace-ment cannot compensate for an access stream that strides through too much data,at least for the relatively low associativity L1 data caches we evaluated.. Further-more, the miss rate of CCWS outperforms both GTO-BEL and 2LVL-GTO-BEL.This data suggests L1D cache hit rates are more sensitive to warp scheduling policythan cache replacement policy.Figures 3.10 and 3.11 present the performance and MPKI of our MCS andCI benchmarks. The harmonic mean performance improvement of CCWS acrossboth the highly and moderately cache-sensitive (HCS and MCS) benchmarks is24%. In the majority of the MCS and CI workloads, the choice of warp schedulermakes little difference and CCWS does not degrade performance. There is nodegradation because the MPKI for these benchmarks is much lower than the HCSapplications, so there are few VTA hits compared to instructions issued. As a result36Table 3.3: Configurations for Best-SWL (warps actively scheduled) and CCWS variables usedfor performance data.Best-SWL CCWS ConfigBenchmark Warps Actively Scheduled Name ValueBFS 5 KT HROT T LE 8KMN 4 Warp Base Score 100MEMC 7 VTA Tag array 8-wayGC 4 16 entries per warpAll Others 32 (512 total entries)BFS KMN MEMC GC AVG-HCS020406080100120140160180LRRGTO2LVL- GTOBest -SWLCCWSLRRGTO2LVL- GTOBest -SWLCCWSLRRGTO2LVL- GTOBest -SWLCCWSLRRGTO2LVL- GTOBest -SWLCCWSLRRGTO2LVL- GTOBest -SWLCCWS( Hi t/ Mi ss)  PKIMissInter-WarpIntra-Warp Hit (inter-thread)Intra-Warp Hit (intra-thread)Figure 3.12: Breakdown of L1D misses, intra-warp locality hits (broken intointra-thread and inter-thread) and inter-warp locality hits per thousandinstructions for highly cache-sensitive benchmarks. The configurationfrom Section 3.4.1 is used.the lost-locality detected score as defined by Equation (1) stays low and the threadthrottling mechanism does not take effect.3.4.2 Detailed Breakdown of Inter- and Intra-Warp LocalityFigure 3.12 breaks down L1D accesses into misses, inter-warp hits and intra-warphits for all the schedulers evaluated in Section 3.4.1 on our HCS benchmarks. Inaddition, it quantifies the portion of intra-warp hits that are a result of intra-thread37050100150200BFS KMN MEMC GC CFD SSSP STMCL WP( Hi ts/ Mi sse)  PKI Misses PKIInter-Warp Hits PKIIntra-Warp Hits PKIFigure 3.13: Breakdown of L1D misses, intra-warp locality Hits and inter-warp locality PKI using an unbounded L1 cache with 128 byte cachelines.locality. It illustrates that the decrease in cache misses using CCWS and Best-SWLcomes chiefly from an increase in intra-warp hits. Moreover, the bulk of these hitsare a result of intra-thread locality. The exception to this rule is BFS, where 30%of intra-warp hits come from inter-thread locality and we see a 23% increase ininter-warp hits. An inspection of the code reveals that inter-thread sharing (whichmanifests itself as both intra-warp and inter-warp locality) occurs when nodes inthe graph share neighbours. Limiting the number of warps actively scheduled in-creases the hit rate of these accesses because it limits the amount of non-shareddata in the cache, increasing the chance that these shared accesses hit.Figure 3.13 explores the access stream of all the cache-sensitive benchmarksusing SAGCS and an unbounded L1D. It shows that with the exception of SSSP,the MCS benchmarks have significantly less locality in the access stream. Thelarger amount of intra-warp locality in SSSP is consistent with the significant per-formance improvement we observe for CCWS at smaller cache sizes when theworking set of all the threads does not fit in the L1D cache (see Figure 3.15).3.4.3 Sensitivity to Victim Tag Array SizeFigure 3.14 shows the effect of varying the VTA size on performance. With a largervictim tag array the system is able to detect lost intra-warp locality occurring at fur-ther access distances. Increasing the size of the VTA keeps data with intra-warplocality in the VTA longer and causes warp limiting to be appropriately applied.38012345BFS KMN MEMC GC HMEAN-HCSNormal ized IPC1 EPW 2 EPW 4 EPW8 EPW 16 EPW 32 EPW512 EPW00.511.5CFD SSSP STMCL WP HMEAN-MCSNormal ized IPC1 EPW 2 EPW 4 EPW 8 EPW16 EPW 32 EPW 512 EPWFigure 3.14: Performance of CCWS at various victim tag array sizes. Nor-malized to the GTO scheduler. EPW=Entries per Warp. EWP 1-4 are1-4 set associative respectively. All other victim tag arrays are 8-wayset associative.However, if the VTA size is increased too much, the lost-locality detector’s timesensitivity is diminished. The VTA will contain tags from data that was evictedfrom the L1 data cache so long ago that it would have been difficult to capture withchanges to the scheduling policy. For example, at the 512 entry design point, eachwarp has a VTA that can track as much data as the entire L1D. In this configura-tion, a warp would need exclusive access to the L1 data cache to prevent all thedetected loss of locality. The increase in detected lost-locality results in excessivewarp constraining on some workloads. Based on this data, the best-performingconfiguration with 16 entries per warp is selected.390246810BFS KMN MEMC GC HMEAN-HCSNormal ized IPC8k GTO 8k CCWS16k GTO 16k CCWS32k GTO 32k CCWS64k GTO 64k CCWS128k GTO 128k CCWS00.511.5CFD SSSP STMCL WP HMEAN-MCSNormal ized IPC8k GTO 8k CCWS 16k GTO 16k CCWS 32k GTO32k CCWS 64k GTO 64k CCWS 128k GTO 128k CCWSFigure 3.15: Performance of CCWS and GTO at various cache sizes. Nor-malized to the GTO scheduler with a 32k L1D. All caches are 8-wayset associative. The VTA Size is 16 entries per warp for all instancesof CCWS.3.4.4 Sensitivity to Cache SizeFigure 3.15 shows the sensitivity of CCWS to the L1D size. As the cache sizedecreases, CCWS has a greater performance improvement relative to the GTOscheduler. This is because at small cache sizes it is even more desirable to limitmultithreading to reduce cache footprint. In fact SSSP, which showed no perfor-mance gain at 32k shows a 35% speedup when the L1 cache is reduced to 8k. Thisis because SSSP has significant intra-warp locality but its footprint is small enoughthat it is contained by a 32k L1D. As the cache size increases, the effect of CCWSdwindles relative to the GTO scheduler because the working set of most warps fitin a larger cache. At a large enough cache size, the choice of warp scheduler makeslittle difference.40024681012500k Edges 900k Edges 5M Edges 20M EdgesNormal ized IPC32k GTO 32k CCWS64k GTO 64k CCWS128k GTO 128k CCWSFigure 3.16: Performance of CCWS on BFS with different graph sizes whenvarying the L1D cache size and scheduler choice. Normalized to theGTO scheduler with a 32k L1D. The VTA size is 16 entries per warpfor all instances of CCWS.At 128k per L1D, CCWS shows little benefit over the GTO scheduler. Thisis because the input to these benchmarks is small enough that 128k captures mostof the intra-warp locality. Since we are collecting results on a performance simu-lator that runs several orders of magnitude slower than a real device, the input toour benchmarks is small enough that they finish in a reasonable amount of time.Figure 3.16 show the effect of increasing the size of the BFS input graph from thebaseline 500k edges to 20M edges. As the input size increases, the performance ofCCWS over the GTO scheduler also increases even at a 128k L1 cache size. Weobserve that simply increasing the capacity of the L1 cache only diminishes theperformance impact of CCWS with small enough input sets. Hence, we believeCCWS will have an even greater impact on data sizes used in real workloads.3.4.5 Sensitivity to KT HROT T LE and Tuning for PowerFigure 3.17 shows the effect of varying KT HROT T LE on L1D misses and perfor-mance. KT HROT T LE is the constant used in Equation (1) to tune the score assignedto warps when lost locality is detected (LLDS). At smaller KT HROT T LE values,there is less throttling caused by the point system and more multithreading. Atthe smallest values of KT HROT T LE multithreading is not constrained enough andperformance suffers. As KT HROT T LE increases, CCWS has a greater effect and thenumber of L1D misses falls across all the HCS benchmarks. In every HCS bench-41010203040506000.511.522.533.544.5K=1K=2K=4K=8K=16K=32K=64K=128K=1K=2K=4K=8K=16K=32K=64K=128K=1K=2K=4K=8K=16K=32K=64K=128K=1K=2K=4K=8K=16K=32K=64K=128K=1K=2K=4K=8K=16K=32K=64K=128 .BFS KMN MEMC GC HMEAN(IPC)AVG(MPKI)MPKINor mal ized IPCNormalized IPCMPKIFigure 3.17: Performance of CCWS (normalized to the GTO scheduler) andMPKI of CCWS when varying KT HROT T LE .mark, except GC, performance peaks then falls as KT HROT T LE increases. However,since a miss in the L1D cache can incur a significant power cost it may be desir-able to use a higher KT HROT T LE value to reduce L1D misses at the cost of someperformance. For example, at KT HROT T LE = 32 there is an average 18% reductionin L1D misses over the chosen KT HROT T LE = 8 design point. KT HROT T LE = 32 stillachieves a 46% performance improvement over the GTO scheduler.Figure 3.17 also demonstrates that each benchmark has a different optimalKT HROT T LE value. However, the difference in harmonic mean performance be-tween choosing each benchmark’s optimal KT HROT T LE value and using a constantKT HROT T LE = 8 is < 4%. For this reason, we do not pursue an online mechanismfor determining the value of KT HROT T LE . If other HCS benchmarks have morevariance in their intra-warp locality then such a system should be considered.The value of KT HROT T LE makes no difference in the CI benchmarks since thereis little locality to lose and few VTA Hits are reported. In the MCS benchmarksthere are relatively few L1D MPKI, which keeps the product of KT HROT T LE andV TAHitsTotalInstIssuedTotallow. In the MCS benchmarks, CCWS performance matches GTOscheduler performance until KT HROT T LE = 128. At this point there is a harmonicmean 4% performance degradation due to excessive throttling. Since their perfor-mance is largely unchanged by the value of KT HROT T LE , we do not graph the MCS421.150123456BFS KMN MEMC GCNor mal ized IPC1 Warp2 Warps4 Warps5 Warps7 Warps32 WarpsFigure 3.18: Performance of SWL at various multithreading limits. Normal-ized to 32 warps.or CI benchmarks in Figure Static Warp Limiting SensitivityIn Section 3.2 we noted that the optimal SWL limiting number was different fordifferent benchmarks. We also indicated that this value changes when running thesame benchmark with different input sets. Figure 3.18 illustrates that peak per-formance for each of the HCS benchmarks occurs with different multithreadinglimits. This happens because each workload has a different working set and ac-cess stream characteristics. Furthermore, Figure 3.19 shows that for different inputgraphs on BFS, the values of the peak performance point are different. This vari-ation happens because the working set size is input data dependent. Finding theoptimal warp limiting number in SWL would require profiling of each instance ofa particular workload, making the adaptive CCWS more practical.SWL also suffers in programs that have phased execution. The larger and morediverse the application is, the less likely a single warp limiting value will capturepeak performance. This type of phased behaviour is not abundant in the HCS work-loads we study, but as the amount and type of code running on the GPU continuesto grow so too will the importance of adaptive multithreading.SWL is also sub-optimal in a multi-programmed GPU. If warps from morethan one type of kernel are assigned to the same SM, a per-kernel limiting numbermakes little sense. Even if there was no cache thrashing in either workload individ-ually their combination may cause it to occur. CCWS will adapt to suit the needs4300.511.520 5 10 15 20 25 30Nor mal ized IPCWarps Actively ScheduledGraph A Graph B Graph CFigure 3.19: Performance of SWL with different multithreading values onBFS with different input graphs. Normalized to 32 warps.of whatever warp combination is running on a SM and preserve their intra-warp lo-cality. Since there will be no inter-warp locality among multi-programmed warps,preservation of intra-warp locality becomes even more important.3.4.7 Area EstimationThe major source of area overhead to support CCWS comes from the victim tagarray. For the configuration used in Table 3.3 and a 48-bit virtual address space, werequire 40 bits for each tag entry in our VTA. Using CACTI 5.3 [171], we estimatethat this tag array would consume 0.026 mm2 per core at 55nm or 0.78 mm2 for theentire 30 core system. This represents 0.17% of GeForce GTX 285 area, which oursystem closely models with the exception that we also model data caches. Thereare a variety of smaller costs associated with our design that are difficult to quantifyand as a result are not included in the above estimation. Adding an additional 5-bitsto each L1D cache line for the WID costs 160 bytes per core. There are 32 lost-locality score values, each represented in 10 bits which are stored in a max heap.Also, there are two counter registers, one for the number of instructions issued andanother for the total VTA hit signals. In addition, there is logic associated withthe scoring system. Compared to the other logic in a SM, we do not expect thisadditional logic to be significant.443.5 SummaryThis work introduces a new classification of locality for GPUs. We quantify thecaching and performance effects of both intra- and inter-warp locality for work-loads in massively multi-threaded environments.To exploit the observation that intra-warp locality is of greatest importanceon highly cache-sensitive workloads, this work introduces Cache-Conscious WarpScheduling. CCWS is a novel technique to capitalize on the performance ben-efit of limiting the number of actively-scheduled warps, thereby limiting L1 datacache thrashing and preserving intra-warp locality. Our simulated evaluation showsthis technique results in a harmonic mean 63% improvement in throughput onhighly cache-sensitive benchmarks, without impacting the performance of cache-insensitive workloads.We demonstrate that on massively multi-threaded systems, optimizing the lowlevel thread scheduler is of more importance than attempting to improve the cachereplacement policy. Furthermore, any work evaluating cache replacement on mas-sively multi-threaded systems should do so in the presence of an intelligent warpscheduler.As more diverse applications are created to exploit irregular parallelism andthe number of threads sharing a cache continues to increase on both GPUs andCMPs, so too will the importance of intelligent HW thread scheduling policies,like CCWS.45Chapter 4Divergence-Aware WarpSchedulingThis chapter uses hardware thread scheduling to improve the performance and en-ergy efficiency of divergent applications on GPUs. We propose Divergence-AwareWarp Scheduling (DAWS), which introduces a divergence-based cache footprintpredictor to estimate how much L1 data cache capacity is needed to capture intra-warp locality in loops. Predictor estimates are created from an online character-ization of memory divergence and runtime information about the level of controlflow divergence in warps. Unlike prior work on Cache-Conscious Warp Schedul-ing, which makes reactive scheduling decisions based on detected cache thrashing,DAWS makes proactive scheduling decisions based on cache usage predictions.DAWS uses these predictions to schedule warps such that data reused by activescalar threads is unlikely to exceed the capacity of the L1 data cache. DAWSattempts to shift the burden of locality management from software to hardware,increasing the performance of simpler and more portable code on the GPU. Weshow that DAWS achieves a harmonic mean 26% performance improvement overCache-Conscious Warp Scheduling on a diverse selection of highly cache-sensitiveapplications, with minimal additional hardware.Running irregular code on a GPU can cause both memory and control flowdivergence. Memory divergence (or an uncoalesced memory access) occurs whenthreads in the same warp access different regions of memory in the same SIMT46CacheA[0]A[64]A[96]A[128]Warp0 0 - - -Warp1 - 567int C[]={0,64,96,128,160,160,192,224,256};void sum_row_csr(float* A, )  {    float sum = 0;    int i =C[tid];    while(i < C[tid+1]) {        sum  += A[ i ];         ++i;      } Example Compressed Sparse Row KernelWarp1 4567Warp00123Warp1 - 567Time0 Time1 Time22st Iter1st Iter StopWarp0 01231st Iter 33rd IterGoGoGoGoGo2nd IterCache Cache1st IterDivergent BranchUncoalesced LoadActive Thread IDsA[0]A[64]A[96]A[128]A[32]A[160]A[192]A[224]Figure 4.1: DAWS example. Cache: 4 entries, 128B lines, fully assoc. ByTime0, warp 0 has entered loop and loaded 4 lines into cache. By Time1,warp 0 has captured spatial locality, DAWS measures footprint. Warp1 is prevented from scheduling as DAWS predicts it will oversubscribecache. By Time2, warp 0 has accessed 4 lines for 32 iterations andloaded 1 new line. 3 lanes have exited loop, decreasing footprint. Warp1 and warp 0 are allowed to capture spatial locality together.instruction. Control flow (or branch) divergence occurs when threads in the samewarp execute different control flow paths. This work focuses on improving theperformance of several such irregular applications through warp scheduling.Figure 4.1 presents a small example of divergent code to illustrate how schedul-ing can be used can make effective use of on-chip cache capacity. The examplecode sums each row of a Compressed Sparse Row (CSR) [22] data set. Each threadin the kernel sums one row using a loop. This code is divergent due to the data de-pendent nature of each sparse row’s length and the position of each row’s valuesin memory. This translates into branch divergence when threads within a warptravel through the loop a different number of times and memory divergence whenthreads access A[i]. This code has three key characteristics that can be leveragedto make effective use of cache capacity: (1) Each thread has spatial locality across47loop iterations, since i is incremented by 1. (2) Each warp’s load to A[i] can ac-cess multiple cache lines. (3) The number of cache lines accessed when a warploads A[i] is dependent on the warp’s active mask. Figure 4.1 also illustrates howour proposed Divergence-Aware Warp Scheduling (DAWS) technique takes thesecharacteristics into account to maximize on-chip cache utilization. In the example,two warps (each with 4 threads) share a cache with 4 entries. Warp 0 enters theloop first and each of its threads loads its section of A into the cache. During warp0’s execution of the loop, Divergence-Aware Warp Scheduling learns that thereis both locality and memory divergence in the code. At Time1, warp 1 is readyto enter the loop body. Divergence-Aware Warp Scheduling uses the informationgathered from warp 0 to predict that the data loaded by warp 1’s active threads willevict data reused by warp 0 which is still in the loop. To avoid oversubscribingthe cache, Divergence-Aware Warp Scheduling prevents warp 1 from entering theloop by de-scheduling it. Now warp 0 captures its spatial locality in isolation untilits threads begin to diverge. By Time2, warp 0 has only one thread active and itscache footprint has decreased. Divergence-Aware Warp Scheduling detects this di-vergence and allows warp 1 to proceed since the aggregate footprint of warp 0 andwarp 1 fits in cache.The code in Figure 4.1 contains intra-warp locality. Intra-warp locality occurswhen data is loaded then re-referenced by the same warp [137]. The programmermay be able to re-write the code in Figure 4.1 to remove intra-warp locality. Honget al. [63] perform such an optimization to Breadth First Search (BFS). However,this can require considerable programmer effort. Another option is to have thecompiler restructure the code independent of the programmer, however static com-piler techniques to re-arrange program behaviour are difficult in the presence ofdata dependant accesses [152]. One of this chapter’s goals is to enable the efficientexecution of more workloads on accelerator architectures. We seek to decrease theprogrammer effort and knowledge required to use the hardware effectively, whileadding little to the hardware’s cost.Previously proposed work on Cache-Conscious Warp Scheduling (CCWS) [137],presented in Chapter 3 uses a reactionary mechanism to scale back the number ofwarps sharing the cache when thrashing is detected. However, Figure 4.1 illus-trates that cache footprints in loops can be predicted, allowing thread scheduling48decisions to be made in a proactive manner. Our technique reacts to changes inthread activity without waiting for cache thrashing to occur. By taking advantageof dynamic thread activity information, Divergence-Aware Warp Scheduling is alsoable to outperform a scheduler that statically limits the number of warps run basedon previous profiling runs of the same workload [137].This work focuses on a set of GPU accelerated workloads from server com-puting and high performance computing that are both economically important andwhose performance is highly sensitive to level one data L1D cache capacity. Theseworkloads encompass a number of applications from server computing such asMemcached [62], a key-value store application used by companies like Facebookand Twitter, and a sparse matrix vector multiply application [44] which is used inBig Data processing.4.1 Divergence, Locality and SchedulingA key observation of our work is that a program’s memory divergence, control flowdivergence and locality can be profiled, predicted and used by the warp schedulerto improve cache utilization. This section is devoted to describing this observationin detail and is divided into two parts. Section 4.1.1 explores where locality occursin our highly cache-sensitive benchmarks and Section 4.1.2 classifies the localityin ways that are useful for our warp scheduler.4.1.1 Application LocalityFigure 4.2 presents the hits and misses for all the static load instruction addresses(PCs) in our highly cache-sensitive benchmarks (described in Section 4.3). Eachhit is classified as either an intra-warp hit (when data is loaded then re-referencedby the same warp) or an inter-warp hit (when one warp loads data that is hit on byanother). This data was collected using Cache-Conscious Warp Scheduling. Theloops in each program are highlighted by dashed boxes. This figure demonstratesthat the bulk of the locality in our programs is intra-warp and comes from a fewstatic load instructions. These load instructions are concentrated in the loops of theprogram.To understand the locality in these loops, Figure 4.3 presents a classification of49010203040506070808015216824027232840040854415216023224016561664174417521408146415281560156816001648166416961704173617921856664672BFS PC SPMV-Scalar PC GC PC KMNPCHits/Misses PKI Misses PKIInter-Warp Hits PKIIntra-Warp Hits PKILoop Bounds 010203040160176216376392424472544600162416721728179218641944208021122192296029762992308831523200328033123368338434563488350436803752429644084504452046084656471247524792484048644912539255125664Hits/Misses PKI MEMC PC Figure 4.2: Intra-warp hits, inter-warps hits and misses per thousand instruc-tions (PKI) for all the static load instructions in each of our highlycache-sensitive benchmarks, identified by PC. The PCs contained inloops are highlighted in dashed boxes.intra-warp hits from loads within the loops of each application. Loads are classifiedas Accessed-This-Trip if the cache line was accessed by another load on this loopiteration. If the value in cache was not Accessed-This-Trip, then we test if it wasaccessed on the previous loop trip. If so, it is classified as Accessed-Last-Trip. Ifthe line was not accessed on either loop trip, it is classified as Other, indicatingthat the line was accessed outside the loop or in a loop trip less recent than the lastone. This data demonstrates that the majority of data reuse in these applicationsis Accessed-Last-Tip. If the scheduler can keep the data loaded by a warp on oneloop iteration in cache long enough to be hit on in the next loop iteration, most ofthe locality in these applications can be captured.To illustrate the source of this locality in the code, consider the code for SPMV-5000. MEMC SPMV-Scalar GC KMN AVG-HCSFraction of Intra-Warp Hits Other Accessed-Last-Trip Accessed-This-TripFigure 4.3: Classification of intra-warp hits within loops using an 8M L1Dcache. Accessed-This-Trip=hit on data already accessed this loop itera-tion. Accessed-Last-Trip=hit on data accessed in immediately-previousloop iteration.Scalar in Example 5.1. Figure 4.3 indicates that all of the intra-warp locality withinthe loop of this code is Accessed-Last-Trip. This comes from the loading cols[ j]and val[ j]. When inside this loop, each thread walks the arrays in 4 byte stridessince j is incremented by one each iteration.Based on these observations, we design our scheduling system to ensure thatwhen intra-warp locality occurs in a loop, much of the data loaded by a particularwarp in one iteration remains in the cache for the next iteration. We attempt toensure this happens by creating a cache footprint prediction for warps executingin loops. The prediction is created from information about the loads inside theloop and the current level of control flow divergence in a warp on its current loopiteration.4.1.2 Static Load ClassificationTo predict the amount of data each warp will access on each iteration of the loop,we start by classifying the static load instructions inside the loop. We classifyeach static load instruction based on two criteria, memory divergence (detailed inSection 4.1.2) and loop trip repetition (Section 4.1.2).Memory DivergenceIf the number of memory accesses generated by a load equals the number of lanesactive in the warp that issues it, then the load is completely diverged. Loads that510510152025300 5 10 15 20 25 30Requests Generated Threads Active PC=408PC=272PC=400PC=240PC=328Figure 4.4: Number of threads active and number memory accesses gener-ated for each dynamic load in BFS’s loop. Accesses are grouped byPC.generate one or two accesses no matter how many threads are active are com-pletely converged. Anything in between is somewhat diverged. To understandthe relationship between memory divergence and static instructions, consider Fig-ure 4.4. Figure 4.4 plots the number of threads active and accesses generated forevery dynamic load instruction in BFS, grouped by the load instruction’s PC. Thisfigure illustrates that memory divergence behaviour can be characterized on a per-PC basis. Some PCs are always converged (328, 400 and 408 in Figure 4.4), someare almost always completely diverged (272) and others are consistently somewhatdiverged (240). This result is consistent across all the highly cache-sensitive ap-plications we studied. For simplicity, DAWS classifies each static load instructionthat is not consistently converged as diverged.This figure also demonstrates that there is a significant amount of control flowdivergence in this application. This control flow divergence makes a solution thatstatically limits the number of warps when the kernel is launched [137] suboptimal,since it does not adapt to thread activity as the program executes. Some of the staticloads in Figure 4.4 never have more than 8 threads active (for example, PC 328).These loads occur inside a branch within the loop and are only generated in BFSwhen a thread is processing a node with an unexplored edge.Additionally, all 32 threads are never active in this loop due to branch diver-gence occurring prior to loop execution. The loop is only executed if a node is52on the program’s exploration frontier, which can be relatively sparsely distributedacross threads. This illustrates that there is an opportunity to improve the estimatedcache footprint for a loop by taking advantage of branch prediction. However, forthe cache footprint prediction generated by DAWS, we assume the worst possiblecase (i.e., all of the loads in the loop get uncovered by all threads active on thisloop iteration). Exploring branch prediction is beyond the scope of this work.Loop Trip RepetitionMultiple static loads within one loop-trip may reference the same cache line. TheAccessed-This-Trip values in Figure 4.3 demonstrate this can be significant. Theseloads do not increase the cache footprint because the data they access has alreadybeen accounted for by another load. We introduce the concept of a repetition ID tofilter them out. All loads predicted to reference the same cache line are assignedthe same repetition ID. When predicting the cache footprint of a loop, only oneload from each repetition ID is counted. Classification the repetition ID is doneeither by the compiler (predicting that small offsets from the same pointer are inthe same line) or by hardware (described in Section 4.2.2).4.2 Divergence-Aware Warp Scheduling (DAWS)The goal of DAWS is to keep data in cache that is reused by warps executing inloops so that accesses from successive loop iterations will hit. DAWS does thisby first creating a cache-footprint prediction for each warp. Then, DAWS onlyallows load instructions to be issued from warps whose aggregate cache footprintsare predicted to be captured by the L1D cache.Figure 4.5 illustrates how DAWS works at a high level. A prediction of thecache footprint for each warp is created. These predictions are summed to createa total cache footprint. At time T0, all warps have no predicted footprint. Warpsthat enter loops with locality are assigned a prediction and consume a portion ofthe estimated available cache. When a warp exits the loop its predicted footprintis cleared. When the addition of a warp’s prediction to the total cache footprintexceeds the effective cache size, that warp is prevented from issuing loads. Thevalue of the effective cache size is discussed later in Section 4.2.1. To illustrate53DAWS in operation, consider what happens at each time-step in Figure 4.5. Be-tween time T0 and T1, warp 0 enters a loop. From a previous code characterization,DAWS has predicted that this loop has intra-warp locality and one divergent load.Sections 4.2.1 and 4.2.2 present two variations of DAWS that perform this codecharacterization in different ways. Warp 0’s active mask is used to predict thatwarp 0 will access 32 cache lines (one for each active lane) in this iteration of theloop. The value of the footprint prediction for more complex loops is discussed indetail in Section 4.2.1. Between time T1 and T2, warp 1 enters the loop with only16 active threads and receives a smaller predicted footprint of 16. Between T2 andT3, warp 2 reaches the loop. The addition of Warp 2’s predicted cache footprint tothe current total cache footprint exceeds the effective cache size, therefore warp 2is prevented from issuing any loads. Between T3 and T4, 16 of warp 0’s 32 threadshave left the loop (causing control flow divergence) which frees some predictedcache capacity, allowing warp 2 to issue loads again.The DAWS warp throttling mechanism is somewhat similar to the lost localityscoring system presented in CCWS [137], however there are several key differ-ences. In CCWS, scores are assigned based on detected lost locality. Warps losingthe most locality are given more exclusive cache access by preventing warps los-ing the least locality from issuing loads. CCWS is a reactive system that has tolose locality before trying to preserve it. DAWS is a proactive system that tries toprevent lost locality before it happens. DAWS is also proactive in decreasing thelevel of thread throttling. As threads within warps progress through a loop a differ-ent number of times, the data accessed by their divergent loads is reduced causingDAWS to decrease their predicted cache footprint. DAWS takes this control flowdivergence into account immediately and scales up the number of warps allowedto issue load instructions as appropriate. In contrast, CCWS scales back threadthrottling by a constant factor each cycle, unless more lost locality is detected. InCCWS, when warps with the most exclusive cache access stop losing locality, theirexclusivity is lost and they have to start missing again to get it back. DAWS ensuresthat all warps in loops with intra-warp locality do not lose their cache exclusivityuntil they exit the loop.Figure 4.6 presents the microarchitecture required to implement our two pro-posed Divergence-Aware Scheduling Techniques. Section 4.2.1 details Profiled54Total Cache Footprint (lines)TimeEffCacheSizeWarp 0's CacheFPT0 T1 T2WZWarp Cannot Issue Loads...Legend... ... ...T3Warp 0Enters Loop(32 Active)W0Warp 1Enters Loop(16 Active)W1W2Warp 2Attempts to Enter Loop (16 Active)No LoadsWarp 016 threadsleave loop(16 Active) W0W1W2T4324864W1W0 W0Figure 4.5: High level view of how DAWS’s cache footprint prediction mech-anism dynamically throttles the number of threads sharing the cache.CacheFP=Cache FootprintDivergence-Aware Warp Scheduling (Profiled-DAWS), which uses off-line profil-ing to characterize memory divergence and locality. Section 4.2.2 presents De-tected Divergence-Aware Warp Scheduling (Detected-DAWS), which detects bothlocality and memory divergence as the program executes. Both techniques makeuse of feedback from the branch unit ( A in Figure 4.6) which tells the Warp Is-sue Arbiter the number of active lanes for any given warp. Detected-DAWS isimplemented on top of Profiled-DAWS. In Detected-DAWS, locality and memorydivergence information is detected as the program runs based on feedback from thememory system ( B ). This feedback allows Detected-DAWS to classify static loadinstructions based on dynamic information about how much locality each instruc-tion has and how many memory accesses it generates.4.2.1 Profiled Divergence-Aware Warp Scheduling(Profiled-DAWS)Figure 4.6 presents the microarchitecture for both Profiled- and Detected-DAWS.Both versions of DAWS are implemented as an extension to the WIA’s baselinewarp prioritization logic. The output of the scheduler is a Can Issue bit vector55Warp Issue Arbiter (WIA)Dynamic Load Classifier (Detected-DAWS Only)PrioritizeWarpsCan Issue [1:N]Inst. To Issue (WID/Tag/HasLocality) on load (#access generated) on coalescer resultDivergence Aware Scheduler (Profiled- and Detected-DAWS)Static Load Classification Table...PCLoadIsDiv RepIDIntra-Loop Repetition DetectorCache Footprint Prediction TableW1 FootprintPred...# Active LanesPCLoopBeginSampling Warp Table...Memory Divergence Detector...Shader CoreWIARegisters/ExecutionMemory UnitL1DCache# Active LanesCoalescerFeedback Unit (Detected-DAWS Only)Locality/MemoryDivergenceIntersectionPCLoopBeginPCLoad Tag WID...PCLoad DivCountPCLoopBegin WID HasLocalityBranch UnitWarps Ready [1:N]ABCDEF GABFigure 4.6: Detailed core model used for our DAWS solutions. N is the num-ber of warp issue slots on the core.56that prevents warps from issuing. The task of the scheduler is to determine this bitvector. As described in Section 4.2, this is driven by cache footprint predictions.To create the cache footprint prediction for each warp, DAWS must classify thebehaviour of static load instructions in loops. One method to predict the behaviourof static load instructions is to do a profiling pass of the application. To provide abound on the potential of an online solution, we propose Profiled-DAWS. We clas-sify each static load instruction using the two criteria presented in Section 4.1.2: (1)Is the load converged or diverged? (2) Does the load contribute to the footprint forthis iteration of the loop (i.e., The load’s repetition ID)? To collect this informationfor Profiled-DAWS, we perform an analysis of compiled assembly code and useruntime information gathered from a profiling pass of each application. Determin-ing if a load is converged or diverged is done by profiling all the accesses of eachload, similar to the analysis done on BFS in Section 4.1.2. To determine intra-looprepetition we do not use profile information. Instead, we examine the assemblyand assume that all loads using the same base address register whose displacementvalues are within one cache line are repeated in a loop iteration. Profiling similarto the analysis in Section 4.2 is performed to determine which loops in the codehave locality.From a microarchitectural perspective, the classification information for allstatic load instructions in loops is stored in a static load classification table ( C ).Each entry in the table contains the PC of the first instruction in the loop where theload is located (PCLoopBegin), a flag indicating if it is a diverged load (IsDiv) anda repetition ID (RepID) that is used to indicate the intra-loop repetition ID of theload. Although only necessary for Detected-DAWS, the PC of the load instructionPCLoad is also stored here. Profiled-DAWS populates this table when a kernel islaunched. These values are based on profiling information from previous runs ofthe kernel. The table is finite in size and can be spilled to memory, however ourapplications have at most 26 static load instructions within loops.The cache footprint prediction for each warp is stored in the cache footprintprediction table ( D ). This table has one entry for each warp issue slot on the core.In our baseline this is 32 entries. Each entry of the table contains the value of thepredicted footprint (in cache lines) and the PC identifying the loop (PCLoopBegin).The scheduler checks instructions as they are issued, looking for loop begin/end57points. To identify the loop bounds, we require that the compiler adds markers tothe first and last instruction of each loop. This can be implemented by using twopreviously unused bits in the opcode (one bit for loop start, one bit for loop end), orby adding an additional instruction to indicate loop start/end. The current CUDAcompiler already outputs the loop bounds in the form of comments. We anticipatethat our small addition of loop bound markers would have a minor impact. NVIDIAGPUs use a virtual ISA, which has made it easier to modify the hardware ISA ineach of the last 3 architecture iterations.When the scheduler detects that a warp has issued the first instruction of a loop,it uses the number of active lanes in the warp ( A ) to create the warp’s predictionfor this loop iteration. This value is written to the warp’s entry in the cache foot-print prediction table. Section 4.2.1 details how the cache footprint prediction iscomputed. The update logic also writes the PC of the first instruction in the loop tothe table (PCLoopBegin). When the warp leaves the loop, the prediction table entryfor the warp is cleared. To prevent deadlock, predicted footprints are also clearedwhile a warp waits at a barrier.To determine the aggregate cache footprint, a prefix sum of each warp’s cachefootprint is performed, starting with the oldest warps. All of the warps whoseprefix sum is less than our calculated effective cache size (defined in Equation 1)are eligible for issuing. Warps whose prefix sum is greater than the effective cachesize are prevented from issuing load instructions.E f fCacheSize= kAssocFactor ·TotalNumLines (4.1)To decide how many cache lines DAWS should assume are available in theL1D cache (i.e., determining our E f fCacheSize value), we need to take the asso-ciativity of the cache into account. If we had a fully associative cache, we couldassume that an LRU replacement policy would allow us to take advantage of everyline in the cache. Since the L1D caches we study are not fully associative (ourbaseline is 8-way) our technique multiplies the number of lines in the cache bythe kAssocFactor. The value of kAssocFactor is determined experimentally andexplored in more detail in section 4.4.2.If the working set of one warp is predicted to exceed the L1D cache capacity,58then no warps are de-scheduled and scheduling proceeds in an unthrottled fashioninside this loop. Doing no de-scheduling inside loops that load more data thanis predicted to fit in cache reverts the system to hiding latency via multithreadingagain. We did not observe these large predictions in our workloads.The prediction update logic is run each time the first instruction in a loop isissued. This way the prediction is reflective of threads leaving the loop because ofdiffering loop trip counts across the warp.Warp-Based Cache Footprint PredictionThis section explains how the number of cache lines accessed for a given warp ina single iteration of a loop with significant intra-warp locality is predicted. In asingle threaded system, predicting the number of cache lines accessed in a loopiteration could be achieved by summing all the static load instructions predictedto be issued in the loop, while accounting for repetition caused by multiple staticloads accessing the same data. However, to create a prediction of the data accessedby a warp in one loop iteration, both memory and control flow divergence mustbe taken into account. We first find which loop the warp in question is executingwithin by looking at the PCLoopBegin for this warp in the prediction table. Next, wequery the static load classification table for all the entries with this PCLoopBegin (i.e.,entries for all of the loads in this loop). It sums all the entries returned as follows.If the entry indicates that the load is diverged (i.e., the IsDiv bit is set), then thisentry contributes as many cache lines as there are active threads. If the entry isconverged (and there is more than one thread active), then this entry contributestwo cache lines to the prediction. All entries with one active thread contributeone cache line. During the summation, each intra-loop repetition group (identifiedby RepID) is only counted once. If there are different divergence characteristicswithin the same repetition ID, then we count it as diverged. In our applications,we did not observe a diverged load accessing data loaded by a converged load (orvice-versa) in the same loop iteration. The result of this summation is written tothis warp’s entry in the cache footprint prediction table.59Predicted Footprint of Warps Outside LoopsIn the previous sections, we only considered de-scheduling warps within loops be-cause this is where the bulk of the application’s memory accesses are. However,some applications may load a significant amount of data outside of loops. Fig-ure 4.2 shows that PCs 1568 and 1600 from the GC benchmark both occur outsideof the program’s loop and access a significant amount of data, which can interferewith the accesses of warps within the loop. For this reason, if there are warps ex-ecuting inside a loop, warps outside of loops can be de-scheduled. If any of theentries in the cache footprint prediction table is non-zero (i.e., at least one warp isin a loop), loads issued by warps outside of loops have their predictions updatedas if they are executing their closest loop. Ideally a warp’s closest loop is the nextloop they will execute. For our purposes, we define a warp’s closest loop as thenext loop in program order. Since warps may skip loops, this may not always bethe case, but in our applications this approximation is usually true.Dealing with Inner LoopsDAWS also detects when a warp has entered an inner loop. When a warp issuinga new loop begin instruction already has a PCLoopBegin value in the cache footprintprediction table that is less than the PC of the new instruction, then we assume thewarp has entered an inner loop. When this happens, the footprint prediction tableentry for the warp is updated normally, giving the warp the prediction of the innerloop. However, when the warp leaves the inner loop, it does not clear either theprediction value or the PCLoopBegin. When the outer loop begins its next iteration,it detects it is an outer loop (because the PC entry in the table is greater than theouter loop’s beginning PC) and it recomputes the predicted footprint based on theinner loop’s loads. This effectively limits the warps that can enter the outermostloops based on the predicted footprint of the innermost loop. We made this de-sign decision because we observed that the majority of data reuse came from theinnermost loop and there is significant data reuse between successive runs of theinnermost loop. If we do not limit the number of warps entering the outer loopbased on the inner loop, then there is the potential for multiple warps to interleavetheir inner loop runs, which can evict data repeatedly used by the inner loop. This60can be applied to any arbitrary loop depth, but none of our applications had a loopdepth greater than two.4.2.2 Detected Divergence-Aware Warp Scheduling(Detected-DAWS)Profiled-Divergence-Aware Warp Scheduling (Profiled-DAWS) relies on two keypieces of profile information. First, it requires that loops with intra-warp local-ity be known in advance of running the kernel. Second, it requires that all theglobal and local memory loads in those loops are characterized as converged ordiverged and that all the intra-loop-trip repetition between those loads is known.Detected-Divergence-Aware Warp Scheduling (Detected-DAWS) requires no pro-file information. The only requirement for Detected-DAWS is that the compilermark the beginning and ending of the program’s loops. Detected-DAWS detectsboth memory divergence and intra-loop-trip repetition at runtime and populates thestatic load classification table ( C in Figure 4.6) dynamically using the dynamicload classifier. Detected-DAWS operates by following the execution of a samplingwarp through a loop. The first warp with more than two active threads that entersa loop is set as the sampling warp for the loop. The sampling warp id (WID) and(PCLoopBegin) for each loop being sampled are stored in the sampling warp table( E ). When the sampling warp leaves the loop, the next warp to enter with two ormore active threads becomes the new sampling warp for the loop. At any giventime, multiple loops can be sampled but only one warp can sample each loop. Thesampling warp table also stores a locality counter (HasLocality) that is used toindicate if loads for this loop should be entered into the static load classificationtable. Like the static load classification table, the sampling warp table is finite insize. Each of our applications has at most five loops. The dynamic load classifierinterprets memory system feedback about loads issued from sampling warps.It is worth noting that, other than the addition of PCLoad to each static loadclassification table entry, nothing about the divergence aware scheduler used inProfiled-DAWS changes. The scheduler just operates with incomplete informationabout the loops until the dynamic load classifier has filled the static load classifica-tion table.The following subsections describe how the dynamic load classifier uses the61memory system feedback to populate the static load classification table.Finding Loops with LocalityThis section describes how Detected-DAWS determines which loops have intra-warp locality. Memory system feedback ( B ) informs the scheduler when loopshave intra-warp locality. The feedback unit sends signals to the dynamic load clas-sifier on each load issued signifying if the load has intra-warp locality. The feed-back unit reports both captured and lost intra-warp locality. To report this locality,cache lines in the L1D cache are appended with the WID of the instruction thatinitially requested them. Lost intra-warp locality is detected through the warp IDfiltered victim tags mechanism described in CCWS [137]. Hits in the L1D cache ondata that one warp loads and re-references are reported as captured intra-warp lo-cality. If a load has neither lost nor captured intra-warp locality then the feedbackunit informs the dynamic load classifier that the load has no intra-warp locality.Whenever the classifier is informed that a load from a sampling warp has takenplace, it modifies that loop’s locality counter in the sampling warp table. If theload was an instance of intra-warp locality, the counter is incremented otherwisethe counter is decremented. DAWS creates cache footprint predictions for loopswith positive locality counters.Dynamically Classifying Static Loads in HardwareOnce a loop is marked as having intra-warp locality, the dynamic load classifierstarts generating static load classification table entries for the loop. To avoid hav-ing more than one entry for each static load in the static load classification table,Detected-DAWS requires the PC of the load be stored (PCLoad). Before insertinga new entry into the table, the dynamic load classifier must ensure that this PCLoaddoes not already exist in the table. If the entry does exist, the dynamic classifierupdates the existing entry. The classifier consists of two components, an intra-looprepetition detector ( F ) and a memory divergence detector ( G ).Memory Divergence Detector: The memory divergence detector is used toclassify static load instructions as convergent or divergent. It receives informa-tion about load coalescing from the memory feedback unit. After an instruction62passes through the memory coalescer, the resulting number of memory accessesis sent to the dynamic load classifier. The classifier reads this value in combina-tion with the active thread count of the load instruction. If more than two threadsin the instruction were active when the load was issued, the number of accessesgenerated is tested. If the number of accesses generated is greater than two, the di-vergence counter for this PC is incremented. If two or less accesses are generated,the counter is decremented. If the divergence counter is greater than one, this loadis considered diverged, otherwise it is considered converged.Intra-Loop Repetition Detector: The Intra-Loop Repetition Detector (ILRD)dynamically determines which static load instructions access the same cache linein the same loop iteration. It is responsible for populating the RepID field of thestatic load classification table. Each entry in the detector contains a tag, PCLoad andWID. On each load executed by a sampling warp, the ILRD is probed based on thetag of the load. If the tag is not found, the tag and PC/warp id for the instructionthat issued the load are written to the table. If the tag is found, then both the PCissuing the new load and the PC in the table are marked as intra-loop repeated andassigned the same repetition ID. When the sampling warp branches back to thestart of the loop, all the values in the ILRD for this warp are cleared. Without theWID, multiple loops could not be characterized concurrently because the samplingwarp for one loop could clear the entries for another. The ILRD is modeled as a setassociative tag array, with an LRU replacement policy.4.3 Experimental MethodologyWe model Profiled-DAWS and Detected-DAWS as described in Section 4.2.1 and 4.2.2in GPGPU-Sim [19] (version 3.1.0) using the configuration in Table 4.1. Loop be-gin and end points are inserted manually in the assembly.The highly cache-sensitive and cache-insensitive workloads we study are listedin Table 4.2, four of which come from the CCWS infrastructure available on-line [136]. The SPMV-Scalar benchmark comes from the SHOC benchmark suite [44].Our benchmarks are run to completion which takes between 14 million and 1billion instructions.63Table 4.1: Divergence-aware warp scheduling GPGPU-Sim Configuration# Compute Units 30Warp Size 32SIMD Pipeline Width 8Number of Threads / Core 1024Number of Registers / Core 16384Shared Memory / Core 16KBConstant Cache Size / Core 8KBTexture Cache Size / Core 32KB, 64B line, 16-way assoc.Number of Memory Channels 8L1 Data Cache 32KB, 128B line, 8-way assoc. LRUL2 Unified Cache 128k/Memory Channel,128B line, 8-way assoc. LRUCompute Core Clock 1300 MHzInterconnect Clock 650 MHzMemory Clock 800 MHzDRAM request queue capacity 32Memory Controller out of order (FR-FCFS)Branch Divergence Method PDOM [55]GDDR3 Memory Timing tCL=10 tRP=10 tRC=35tRAS=25 tRCD=12 tRRD=8Memory Channel BW 8 (Bytes/Cycle)4.4 Experimental ResultsThis section is organized as follows, Section 4.4.1 examines the performance ofour workloads using Profiled-DAWS, Detected-DAWS and other warp schedulers.The remainder of this section is devoted to analyzing varying aspects of our designand exploring its sensitivity.4.4.1 PerformanceAll data was collected using GPGPU-Sim running the following scheduling mech-anisms:GTO A greedy-then-oldest scheduler [137]. GTO runs a single warp until it stallsthen picks the oldest ready warp. Warp age is determined by the time the64Table 4.2: GPU Compute Benchmarks (CUDA and OpenCL)Highly Cache SensitiveName Abbr. Name Abbr.BFS Graph Traversal [36] BFS Kmeans [36] KMNMemcached [62] MEMC Garbage Collection [21, 154] GCSparse MatrixVector Multiply (Scalar) [44] SPMV-ScalarCache Insensitive (CI)Name Abbr. Name Abbr.Needleman-Wunsch [36] NDL Back Propagation [19] BACKPHot Spot [36] HOTSP LU Decomposition [36] LUDSpeckle Red.Anisotropic Diff. [36] SRADTable 4.3: Configurations for Best-SWL and CCWS.Best-SWL CCWS ConfigBenchmark Warps Actively Name ValueScheduledBFS 5 KT HROT T LE 8MEMC 7 Victim Tag Array 8-waySPMV-Scalar 2 (512 total entries)GC 2 16 entries per warpKMN 4 Warp Base Score 100All Others 32warp is assigned to the shader core. For warps that are assigned to a coreat the same time (i.e., they are in the same thread block), warps with thesmallest scalar threads IDs are prioritized. Other simple schedulers (such asoldest-first and loose-round-robin) were implemented and GTO schedulingperformed the best.Best-SWL Static Warp Limiting as described in [137]. Warp limitation valuesfrom 1 to 32 are attempted and the highest performing case is selected. AGTO policy is used to select between warps. The warp limiting value usedfor each application is shown in Table 4.3.65Table 4.4: Configuration parameters used for DAWSDAWS ConfigILRD size 64 entries per core, 8-way set associativeAssociativity Factor 0.6Victim Tag Array Same as CCWS in Table 4.3CCWS Cache-Conscious Warp Scheduling as described in [137]. The configura-tion parameters presented in Table 4.3 are used.Profiled-DAWS Profiled Divergence-Aware Warp Scheduling as described in Sec-tion 4.2.1. Loop profiles were generated manually based on PC statisticscollected in sampling application runs. The applications were profiled withinput data different from the evaluation data. GTO prioritization logic isused.Detected-DAWS Detected Divergence-Aware Warp Scheduling as described inSection 4.2.2, with the configuration used in Table 4.4 GTO prioritizationlogic is used.Figure 4.7 presents the Instructions Per Cycle (IPC) of our evaluated sched-ulers, normalized to CCWS. It illustrates that Profiled-DAWS and Detected-DAWSimprove performance by a harmonic mean 25% and 26% respectively over CCWSon our highly cache-sensitive applications. In addition, they do not cause any per-formance degradation in the cache-insensitive applications. The cache-insensitiveapplications have no loops with detected intra-warp locality. Profiled-DAWS andDetected-DAWS are able to outperform Best-SWL by a harmonic mean 3% and5% respectively. The performance of Profiled-DAWS and Detected-DAWS againstBest-SWL is highly application dependent. Detected-DAWS is able to outperformBest-SWL on BFS by 20%, however it sees a 4% slowdown on SPMV-Scalar.Figure 4.8 can help explain the skewed performance results against Best-SWL.It presents the control flow divergence in each of our highly cache-sensitive appli-cations. It shows warp lane activity for all issued instructions. Bars at the bottomof each stack indicate less control flow divergence, as more lanes are active on6600. Cache-Sensitive Cache-InsensitiveNormalized IPC GTO CCWS Best-SWL Profiled-DAWS Detected-DAWSFigure 4.7: Performance of various scheduling techniques, normalized toCCWS. MEMC SPMV-Scalar GC KMNFraction of Instrucions Issued W[0:4]W[4:8]W[8:12]W[12:16]W[16:20]W[20:24]W[24:28]W[28:32]Figure 4.8: Breakdown of warp lane activity. Breakdown is presented as afraction of total instructions executed. W[0:4] means 0 to 4 of an in-struction’s 32 lanes are active.each issued instruction. The two applications where DAWS improves performancerelative to Best-SWL (BFS and MEMC) also have the most control flow diver-gence. The performance of Best-SWL is hampered most when there is significantcontrol flow divergence. Selecting the same limiting value for every core over thecourse of the entire kernel is not optimal. This divergence occurs because of bothloop-trip count variation across a warp and a discrepancy in the level of controlflow divergence on each shader core. We also evaluated Detected-DAWS without67BFS MEMC SPMV-Scalar GC KMN AVG-HCS020406080100120140160180GTOBest-SWLCCWSProfiled-DAWSDetected-DAWSGTOBest-SWLCCWSProfiled-DAWSDetected-DAWSGTOBest-SWLCCWSProfiled-DAWSDetected-DAWSGTOBest-SWLCCWSProfiled-DAWSDetected-DAWSGTOBest-SWLCCWSProfiled-DAWSDetected-DAWSGTOBest-SWLCCWSProfiled-DAWSDetected-DAWSHits/Misses PKI MissInter-Warp HitsIntra-Warp HitsFigure 4.9: L1D intra-warp hits, inter-warp hits and misses per thousand in-structions (PKI) of various schedulers.control flow awareness by assuming all lanes were active on every loop iteration.Removing control flow awareness results in a 43% and 91% slowdown for BFSand MEMC respectively versus Detected-DAWS. Other applications showed nosignificant performance change.Figure 4.9 presents the L1D cache misses, intra-warp hits and inter-warp hitsper thousand instructions (PKI) for our highly cache-sensitive benchmarks. Itdemonstrates that Profiled-DAWS and Detected-DAWS result in fewer cache missesthan CCWS, which can account for a portion of the overall speedup. Since Profiled-DAWS and Detected-DAWS are able to predict the cache footprint of warps beforethey lose locality based on profile information created by other warps they can ap-ply thread limiting before CCWS, removing the unnecessary cache misses. In ad-dition, Profiled-DAWS and Detected-DAWS do not de-prioritize warps once theyhave entered a loop with locality. The scheduling point system in CCWS can poten-tially de-prioritize warps hitting often in cache when they stop producing accessesthat hurt locality. We performed experiments and found that 46% of CCWS’s lostlocality occurs after a warp has been de-scheduled while in a loop. CCWS priori-tizes warps based solely on detected lost locality. Warps may be de-scheduled in-6800. GCKMNBACKPHOTSPLUDSRADNDLVTA Hits PKI CCWSProfiled-DAWSDetected-DAWSFigure 4.10: Victim tag array hits per thousand instructions (PKI) (indicatinglost intra-warp locality).side a high-locality loop before they complete the loop, resulting in the eviction oftheir reused data. Once loops are properly classified, this type of lost locality neveroccurs using DAWS. DAWS ensures that once a warp enters a high-locality loop,it is not de-scheduled until the warp exits the loop or encounters a barrier. Noneof our highly cache-sensitive applications have barrier instructions. Figure 4.9 alsodemonstrates that the cache miss rate in Profiled-DAWS and Detected-DAWS issimilar to that of Best-SWL. This suggests that the performance increase seen byProfiled-DAWS and Detected-DAWS over Best-SWL comes from decreasing thelevel of warp limiting when the aggregate footprint of threads scheduled can stillbe contained by the cache.Figure 4.10 plots victim tag array hits, which indicate a loss of intra-warp lo-cality. There is no victim tag array required to implement Profiled-DAWS, butfor the purposes of this data, one is added. This figure illustrates that there is alarge reduction in detected instances of lost locality when using the DAWS solu-tions. In addition, this figure shows a slight increase in detected lost locality inDetected-DAWS versus Profiled-DAWS. This is because Detected-DAWS requiressome time to classify static load instructions before appropriate limiting is able totake effect.Figure 4.11 breaks down core activity into cycles where an instruction issues,cycles where there are no instructions to issue (i.e., no warps are ready to be is-69BFS MEMC SPMV-Scalar GC KMN00. of Total GTO Cycles Nothing Issued Warps De-Scheduled Instruction IssuedFigure 4.11: Breakdown of core activity normalized to GTO’s total cycles foreach application.sued) and cycles where an instruction could have issued, if its warp had not beende-scheduled by the scheduling system. This is aggregate information collectedover all the shader cores. This figure demonstrates that both Profiled-DAWS andDetected-DAWS reduce the number of cycles spent de-scheduling warps versusCCWS.4.4.2 Determining the Associativity FactorFigure 4.12 plots the performance change of our highly cache-sensitive applica-tions as the kAssocFactor is swept. All the applications consistently peak at 0.6,except BFS which peaks shows a small performance gain at 0.7 versus 0.6. Thisis consistent with the assertion that kAssocFactor should be mostly independentof the application. The slight performance improvement for BFS at 0.7 can be ex-plained by the fact that it has branches inside its loop that cause some of the loadsto be infrequently uncovered, as discussed in Section 4.1.2. Since DAWS overes-timates by assuming all the loads in the loop are uncovered, a larger kAssocFactormakes up for this per-warp overestimation by raising the effective cache size cutoff.700.40.60.811.21.41.61 0.9 0.8 0.7 0.6 0.5 0.4Normalized IPC Associativity Factor BFS MEMCSPMV-Scalar GCKMN HMEANFigure 4.12: Detected-DAWS performance as the cache associativity factoris swept. Normalized to CCWS.4.4.3 Area EstimationThe tables added for Profiled-DAWS (i.e., the cache footprint prediction table andthe static load classification table) are each modeled with only 32 entries and arenegligible in size. The additional area added by Detected-DAWS comes from avictim tag array, the other tables are 64 entries or less. A victim tag array is alsoused in CCWS, so there is negligible area difference between Detected-DAWS andCCWS. However, compared to Best-SWL or GTO schedulers both CCWS andDetected-DAWS have a CACTI [171] estimated area overhead of 0.17% which isdiscussed in more detail in [137].4.4.4 Dynamic Energy EstimationWe investigated two energy models for GPUs to evaluate the effect DAWS has onenergy, GPUSimPow [108] and GPUWattch [102]. Due to the recent release dateof these simulators, we were unable to fully integrate our solution into their frame-work. However, we extracted the nJ per operation constants used in GPUWattchfor DRAM reads, DRAM pre-charges and L2/L1D cache hits/misses, which are themetrics that dominate the overall energy consumed in the highly cache-sensitiveapplications and are the key metrics effected by DAWS. This calculation showsthat DAWS consumes 2.4× less and 23% less dynamic energy in the memory sys-tem than GTO and CCWS respectively. This power reduction is primarily due to71an increase in the number of L1D cache hits, reducing power consumed in thememory system. This estimate does not include the dynamic energy required forDetected-DAWS or CCWS tables, victim tag array or logic. We anticipate thisenergy will be small in comparison to the energy used in the memory system.4.5 SummaryThis work quantifies the relationship between memory divergence, branch diver-gence and locality on a set of workloads commonly found in server computing. Wedemonstrate that divergence and locality characteristics of static load instructionscan be accurately predicted based on previous behaviour. Divergence-Aware WarpScheduling uses this predicted code behaviour in combination with live thread ac-tivity information to make more locality-aware scheduling decisions. Divergence-Aware Warp Scheduling is a novel technique that proactively uses predictions toprevent cache thrashing before it occurs and aggressively increases cache sharingbetween warps as their thread activity decreases.Our simulated evaluations show that our fully dynamic technique (Detected-DAWS) results in a harmonic mean 26% performance improvement over CacheConscious Warp Scheduling [137] and 5% improvement over the profile-basedBest-SWL [137]. Performance relative to Best-SWL is improved as much as 20%when workloads have significant control flow divergence.Our work increases the efficiency of several highly divergent, cache-sensitiveworkloads on a massively parallel accelerator. Our programmability case studydemonstrates that Divergence-Aware Warp Scheduling can allow programmers towrite simpler code without suffering a significant performance loss by effectivelyshifting the burden of locality management from software to hardware.72Chapter 5A Programmability Case StudyThis chapter presents a case study using two implementations of Sparse MatrixVector Multiply (SPMV) from the SHOC benchmark suite [44] 1. This case studyis chosen because it is a real example of code that has been ported to the GPU thenoptimized. Example 5.1 presents SPMV-Scalar which is written such that eachscalar thread processes one row of the sparse matrix. This is similar to how the al-gorithm might be implemented on a multi-threaded CPU. The bold code in SPMV-Scalar highlights its divergence issues. Example 5.2 shows SPMV-Vector whichhas been optimized for performance on the GPU. Both pieces of code generate thesame result and employ the same data structure. The bold code in SPMV-Vectorhighlights the added complexity introduced by GPU-specific optimizations.One goal of this work is to enable less optimized code such as Example 5.1to achieve performance similar to the optimized code in Example 5.2. In SPMV-Scalar, the accesses to cols[ j] and val[ j] will have significant memory divergenceand the data-dependent loop bounds will create branch divergence. Like the codein Figure 4.1, SPMV-Scalar has spatial locality within each thread since j is incre-mented by one each iteration. Divergence-Aware Warp Scheduling seeks to capturethis locality.In the SPMV-Vector version each warp processes one row of the sparse ma-trix. Restructuring the code in this way removes much of the memory divergence1For brevity, some keywords in the original version of Examples 5.1 and 5.2 were removed. Allof our experiments are run without modifying the original kernel code.73Figure 5.1: Highly divergent SPMV-Scalar kernel__global__ voidspmv_csr_scalar_kernel(const float* val,const int* cols,const int* rowDelimiters,const int dim,float* out){int myRow = blockIdx.x * blockDim.x+ threadIdx.x;texReader vecTexReader;if (myRow < dim){float t = 0.0f;int start = rowDelimiters[myRow];int end = rowDelimiters[myRow+1];// Divergent Branchfor (int j = start; j < end; j++){// Uncoalesced Loadsint col = cols[j];t += val[j] * vecTexReader(col);}out[myRow] = t;}}present in the scalar version since the accesses to cols[ j] and val[ j]will have spatiallocality across each SIMT instruction. However, this version of the code forces theprogrammer to reason about warp length, the size of on-chip shared memory, and itrequires a parallel reduction of partial sums to be performed for each warp. In ad-dition to writing and debugging the additional code required for SPMV-Vector, theprogrammer must tune thread block sizes based on which machine the code is runon. Even if the programmer performed all these optimizations correctly, there isno guarantee that SPMV-Vector will outperform SPMV-Scalar since the shape andsize of the input matrix may render the optimizations ineffective. Previous workhas shown that sparse matrices with less non-zero elements per row than the GPU’swarp width do not take advantage of the potential increase in coalesced accessesoffered by SPMV-Vector [27].This reliance on per-machine tuning and the unpredictability of manual op-timization techniques can make programming GPUs difficult. In Section 5.1 we74Figure 5.2: GPU-optimized SPMV-Vector kernel__global__ voidspmv_csr_vector_kernel(const float* val,const int* cols,const int* rowDelimiters,const int dim,float * out){int t = threadIdx.x;int id = t & (warpSize-1);int warpsPerBlock = blockDim.x / warpSize;int myRow = (blockIdx.x * warpsPerBlock)+ (t / warpSize);texReader vecTexReader;__shared__ volatilefloat partialSums[BLOCK_SIZE];if (myRow < dim){int warpStart = rowDelimiters[myRow];int warpEnd = rowDelimiters[myRow+1];float mySum = 0;for (int j = warpStart + id;j < warpEnd; j += warpSize){int col = cols[j];mySum += val[j] * vecTexReader(col);}partialSums[t] = mySum;// Reduce partial sumsif (id < 16)partialSums[t] += partialSums[t+16];if (id < 8)partialSums[t] += partialSums[t+ 8];if (id < 4)partialSums[t] += partialSums[t+ 4];if (id < 2)partialSums[t] += partialSums[t+ 2];if (id < 1)partialSums[t] += partialSums[t+ 1];// Write resultif (id == 0){out[myRow] = partialSums[t];}}}754.9 00.511.52GTO CCWS Best-SWL Profiled-DAWS Detected-DAWSNormalized Execution Time Figure 5.3: Execution time (lower values are faster) of SPMV-Scalar usingvarious warp schedulers normalized to the best performing schedulerfrom SPMV-Vector.demonstrate that Divergence-Aware Warp Scheduling allows the programmer towrite the simpler, more portable SPMV-Scalar while still capturing almost all ofthe performance benefit of SPMV-Vector.This case study should not be construed to suggest that Divergence-AwareWarp Scheduling can replicate the performance of any hand tuned optimizationor generally solve the performance issues surrounding divergence on GPUs. Thestudy is presented as one real world example of optimized GPU code to demon-strate how intelligent warp scheduling can capture almost as much locality as thisparticular hand tuned implementation.5.1 Case Study ResultsIn this section we examine the results of our case study. To run these experiments,the size of on-chip scratchpad memory was increased to 48k, while leaving theL1D cache size constant. This was done so that shared memory usage would notbe a limiting factor for SPMV-Vector and our results would not be biased towardsSPMV-Scalar. The input sparse matrix is randomly generated by the SHOC frame-work. The matrix has 8k rows with an average of 82 non-zero elements per row.Figure 5.3 presents the execution time of SPMV-Scalar from Example 5.1 using ourevaluated schedulers normalized to the GPU-optimized SPMV-Vector from Exam-ple 5.2 using its best performing scheduler. Like the other cache-insensitive appli-cations we studied, the scheduler choice for SPMV-Vector makes little difference.There is < 1% performance variation between all the schedulers we evaluated.This figure demonstrates that SPMV-Scalar suffers significant performance loss7632 30 0246810InstructionsIssuedOff-Chip Reads Off-Chip Writes Warps Created InterconnectFullRatio Normalized to SPMV-Scalar SPMV-Scalar SPMV-VectorFigure 5.4: Ratio of various metrics for SPMV-Scalar using Detected-DAWSvs. SPMV-Vector using its best performing scheduler. InterconnectFull=instances where cores cannot access the interconnect due to con-tention.when using previously proposed schedulers like GTO and CCWS. Best-SWL cap-tures almost all the performance of SPMV-Vector, but requires the user to profilethe application/input data combination with different limiting values before run-ning. Detected-DAWS does not requiring any profiling information or additionalprogrammer input and its execution time is within 4% of SPMV-Vector’s.Figure 5.4 compares several properties of SPMV-Scalar using Detected-DAWSto SPMV-Vector using its best performing scheduler. This graph shows that SPMV-Scalar has some advantages over SPMV-Vector, if Detected-DAWS is used. SPMV-Scalar executes 2.8x less dynamic instructions, decreasing the amount of dynamicpower consumed on each core. SPMV-Scalar also requires 32x fewer warps, de-creasing shader initialization overhead (which is not modeled in GPGPU-Sim) andthe number of scheduling entities the GPU must deal with.Since SPMV-Vector and SPMV-Scalar both perform the same computation onthe same input, they fundamentally read and write the same data to and from mem-ory. However, cache system performance and memory coalescing result in a dis-crepancy in the amount of off-chip traffic generated by each workload. Readsin SPMV-Vector are coalesced since lanes in each warp access consecutive val-ues. However, since DAWS captures much of SPMV-Scalar’s spatial locality inthe L1D cache, there is only a 25% increase in read traffic. As a reference point,77SPMV-Scalar using GTO produces > 15× more reads than SPMV-Vector. In ad-dition, off-chip writes using SPMV-Vector are increased 8 fold. This happens be-cause SPMV-Scalar is able to coalesce writes to the output vector since each warpattempts to write multiple output values in one SIMT instruction. SPMV-Vectormust generate one write request for each row of the matrix and since the L1Dcaches evict global data on writes, all of these writes go to memory. The lastmetric compared indicates that contention for the interconnect is greatly increasedusing SPMV-Vector.78Chapter 6A Variable Warp-SizeArchitectureThis chapter studies the effect of warp sizing and scheduling on performance andefficiency in GPUs. We propose Variable Warp Sizing (VWS) which improvesthe performance of divergent applications by using a small base warp size in thepresence of control flow and memory divergence. When appropriate, our proposedtechnique groups sets of these smaller warps together by ganging their executionin the warp scheduler, improving performance and energy efficiency for regularapplications. Warp ganging is necessary to prevent performance degradation onregular workloads due to memory convergence slip, which results from the inabil-ity of smaller warps to exploit the same intra-warp memory locality as larger warps.This paper explores the effect of warp sizing on control flow divergence, memorydivergence, and locality. For an estimated 5% area cost, our ganged schedulingmicroarchitecture results in a simulated 35% performance improvement on diver-gent workloads by allowing smaller groups of threads to proceed independently,and eliminates the performance degradation due to memory convergence slip thatis observed when convergent applications are executed with smaller warp sizes.Figure 6.1 plots the Instructions Per Cycle (IPC) resulting from shrinking warpsize from 32 threads to 4 threads while keeping the machine’s total thread-issuethroughput and memory bandwidth constant. Figure 6.1a shows the effect ofshrinking the warp size on a large suite of real world applications, while Figure 6.1b7900.511.52Normalized IPC Warp Size 4Application (a) Performance of 165 real world applications using a warp size of 4, normalized to a warpsize of 32.00.511.5Warp Size 4 Warp Size 8 Warp Size 16 Warp Size 32Normalized IPC Divergent ApplicationsWarp-Size Insensitive ApplicationsConvergent Applications(b) Performance versus warp size using a representative subset of applications presentedin 6.1a. These applications are described in more detail in Section 6.3.Figure 6.1: A survey of performance versus warp size.plots the harmonic mean performance of 15 applications which are selected to rep-resent the 3 classes of workloads we study throughout this paper. We classifya workload as being divergent when performance increases as the warp size de-creases, convergent when performance decreases as the warp size decreases, andwarp-size insensitive when performance is independent of warp size. Figure 6.1demonstrates that application performance is not universally improved when thewarp size is decreased. This data indicates that imposing a constant machine-dependent warp size for the varied workloads running on GPUs can degrade per-formance on divergent applications, convergent applications, or both.A large set of existing, highly regular GPU applications do not see any perfor-mance improvement at a smaller warp size. However, the divergent applicationswhich do see a performance improvement represent a class of workloads that areimportant for future GPUs. Prior work such as [31, 32, 113, 118] has shown great80potential for increasing the performance and energy efficiency of these types ofworkloads by accelerating them on a GPU. These applications include future ren-dering algorithms such as raytracing, molecular dynamics simulations, advancedgame physics simulations, and graph processing algorithms among many others.The goal of our proposed architecture is to evolve GPUs into a more approachabletarget for these parallel, but irregular, applications while maintaining the efficiencyadvantages of GPUs for existing codes.Figure 6.2 plots the performance and resulting SIMT lane utilization of differ-ent warp sizes for each of the applications we study. Control-divergent applicationshave a low utilization rate at a warp size of 32 and see utilization increase as thewarp size is decreased. These workloads are able to take advantage of executingdifferent control flow paths simultaneously by using a smaller warp size. Conver-gent applications have a high lane utilization rate at a warp size of 32 and see theirutilization decrease as the warp size is reduced. This reduction in utilization occursbecause of increased pressure on the memory system caused by destroying hori-zontal locality across a larger warp. Horizontal locality occurs when threads withina warp or thread block access similar memory locations. Modern GPUs coalescememory requests from the same warp instruction that access the same cache line.By allowing smaller groups of threads to proceed at different rates, the locality thatexisted across the same static instruction is spread over multiple cycles, causing ad-ditional contention for memory resources. We call this effect memory convergenceslip.In addition to the performance benefit convergent applications experience withlarger warps, convergent and warp-size insensitive applications gain energy effi-ciency from executing with larger warps. A larger warp size amortizes the en-ergy consumed by fetch, decode, and warp scheduling across more threads. Whenthere is no performance benefit to executing with smaller warps, the most energy-efficient solution is to execute with as large a warp size as possible.Our paper first examines the effect of providing a variable warp size, whichcan be statically adjusted to meet the performance and energy efficiency demandsof the workload. We then propose Variable Warp Sizing, which gangs groups ofsmall warps together to create a wider warp and dynamically adjusts the size ofeach gang running in the machine based on the observed divergence characteristics8100. IPC Warp Size 4 Warp Size 8 Warp Size 16 Warp Size 32048121620242832CoMDLightingGamePhysicsObjClassifierRaytracingAVG-DIVImage Proc.Game 3ConvolutionGame 4FFTAVG-WSIGame 1MatrixMultiplyGame 2FeatureDetectRadix SortAVG-CONDivergent Applications Warp-Size InsensitiveApplicationsConvergent ApplicationsAvg. Active Thread Lanes Figure 6.2: Normalized IPC (top) and the average number of active threadlanes on cycles when an instruction is issued (bottom). All configura-tions can issue 32 thread instructions per cycle.of the workload.Prior work such as [30, 48, 53, 55, 115, 123, 132–134] proposes various tech-niques to improve Single Instruction Multiple Data (SIMD) efficiency or increasethread level parallelism for divergent applications on GPUs. However, the use ofsmall warps is the only way to improve both SIMD efficiency and thread level par-allelism in divergent code. These prior works focus on repacking, splitting, andscheduling warps under the constraint of a fixed-size warp. Our work approachesthe problem from the other direction. We simplify the acceleration of divergentworkloads by starting with a small warp size and propose a straightforward gang-82ing architecture to regain the efficiencies of a larger warp. Prior work can improvethe performance of divergent applications when the total number of unique controlpaths is limited and the number of threads traversing each respective path is large.Starting with smaller warps allows our microarchitecture to natively execute manymore concurrent control flow paths, removing this restriction. Section 6.5 presentsa more detailed quantitative and qualitative comparison to prior work.6.1 Trade-offs of Warp SizingThis section details the effect of warp size on both the memory system and SMfront-end. This data motivates an architecture that is able to dynamically varywarp size.6.1.1 Warp Size and Memory LocalityFigure 6.3 shows the effect warp size has on L1 data cache locality in terms ofhits, misses, and Miss Status Holding Register (MSHR) merges Per Thousand In-structions (PKI) for the applications we study. As the warp size is decreased, someapplications see an increase in the number of L1 data cache accesses. This phe-nomenon occurs when memory accesses that were coalesced using a larger warpsize become distributed over multiple cycles when smaller warps are used, an effectwe term memory convergence slip.In the divergent applications, memory convergence slip does not significantlydegrade performance for two reasons. First, an application that is control flowdiverged has less opportunity for converged accesses because fewer threads areparticipating in each memory instruction. Second, even when convergence slipoccurs on a divergent application, as it does in CoMD, ObjClassifier, and Raytrac-ing, it also often results in more cache hits, mitigating the effect on performance.While the control-divergent Raytracing application also sees an increase in misses,the performance cost of these misses is offset by the increased lane utilization ob-served with smaller warps.In the convergent applications, memory convergence slip has a greater effecton performance. All of these applications see both an increase in the total numberof memory accesses and cache misses at smaller warp sizes. Radix Sort and Game8301020304050607080WS=4WS=8WS=16WS=32WS=4WS=8WS=16WS=32WS=4WS=8WS=16WS=32WS=4WS=8WS=16WS=32WS=4WS=8WS=16WS=32WS=4WS=8WS=16WS=32WS=4WS=8WS=16WS=32WS=4WS=8WS=16WS=32WS=4WS=8WS=16WS=32WS=4WS=8WS=16WS=32WS=4WS=8WS=16WS=32WS=4WS=8WS=16WS=32WS=4WS=8WS=16WS=32WS=4WS=8WS=16WS=32WS=4WS=8WS=16WS=32CoMD Light GamePy. ObjCl. Rayt. Img.Proc.Game 3 Conv. Game 4 FFT Game 1 Mat. Mul. Game 2 Feat. Det. R.sortDivergent Applications Warp-Size Insensitive Applications Convergent ApplicationsHits/Miises/Merges PKI MSHR MergesMissesHitsFigure 6.3: L1 data cache hits, misses, and MSHR merges per thousand in-structions (PKI) at different warp sizes.2 also see an increase in MSHR merges. The loss in throughput caused by addi-tional traffic to the L2 data cache and DRAM in these applications is not offset byany increase in lane utilization, as these applications already have high SIMT uti-lization at larger warp sizes. Perhaps not surprisingly, L1 locality for the warp-sizeinsensitive applications is insensitive to the warp size.6.1.2 Warp Size and SM Front-end PressureFigure 6.4 plots the average number of instructions fetched per cycle at variouswarp sizes. Decreasing the warp size places increasing pressure on the SM’s front-end. Convergent and warp-size insensitive applications see a nearly linear increasein fetch requests as the warp size is reduced. This data indicates that a fixed 4-wide warp architecture increases front-end energy consumption for non-divergentapplications, even if the performance does not suffer. While divergent applicationsalso see increased front-end activity, the ability of the architecture to exploit manymore independent control paths is fundamental to increasing the performance ofthese applications. Our design focuses on creating a flexible machine that is ableto expand and contract the size of warps executing in the system. The best warpsize for a given application balances the demands for independent control flow with840123456789CoMDLightingGamePhysicsObjClassifierRaytracingAVG-DIVImage Proc.Game 3ConvolutionGame 4FFTAVG-WSIGame 1MatrixMultiplyGame 2FeatureDetectRadix SortAVG-CONDivergent Applications Warp-Size InsensitiveApplicationsConvergent ApplicationsAverage Fetches Per Cycle Warp Size 4 Warp Size 8 Warp Size 16 Warp Size 32Figure 6.4: Average instructions fetched per cycle. Fetch bandwidth is scaledto match issue bandwidth for each warp size.the limitations due to memory convergence slip.6.2 Variable Warp SizingThis section describes the high level operation of Variable Warp Sizing, discussesthe key design decisions, and details the operation of each architectural compo-nent. We selected four threads as the minimum warp size for three reasons: (1)graphics workloads commonly process threads in groups of four known as quads,(2) the performance opportunity for the compute workloads we examined reachesdiminishing returns at warp sizes smaller than four, and (3) the area overhead risesnotably at warp sizes smaller than four. We discuss the area trade-offs of differentwarp sizes in Section High-level OperationThe goal of VWS is to create a machine that is able to dynamically trade offMIMD-like performance with SIMD-like efficiencies depending on the applica-tion. Our proposed variable warp sized machine shrinks the minimum warp sizeto four threads by splitting the traditional GPU datapath into eight unique slices.85Streaming MultiprocessorSlice 0 [Lanes 0-3]Datapath [Lanes 0-3]Shared FrontendSlice FrontendWarp Ganging UnitGang Table4-wideL1 I-CacheRegister File Slice(4 banks)Gang FrontendDecodeMemory Unit...L1 Fetch ArbiterGangMaskGangID...Gang Fetch SchedulerL0 I-CacheWarp Fetch Sche dulerPer-Warp I-Buffers DecodeControl LogicWarp Issue SchedulerCall Return StackScoreboardGang Control LogicGang Issue SchedulerSelect InstSelect Issue435678910111212Independent Warp Mask13ValidPC ReadyMaskGangMaskGangID ValidPCSlice 7 [Lanes 28-31]Slice FrontendL0 I-CacheWarp Fetch Sche dulerPer-Warp I-Buffers DecodeControl LogicWarp Issue SchedulerCall Return StackScoreboardSelect InstSelect IssueIndependent Warp Mask14 15ReadyMaskDatapath [Lanes 28-31]4-wideRegister File Slice(4 banks)Figure 6.5: Variable Warp Sizing SM microarchitecture. Shaded units aredisabled when operating in ganged mode to save energy.Each slice can fetch, decode, and issue instructions independent of other slices.Figure 6.5 presents the microarchitecture of our proposed design. Each slice isstatically assigned threads in a linear fashion: threads 0-3 are assigned to slice 0,4-7 to slice 1, and so on. Threads cannot migrate between slices.VWS does not change the number of register file banks in the SM or imposeany additional communication between them. As in our baseline, each four-laneslice of the datapath receives its own set of four register file banks ( 1 in Fig-ure 6.5). VWS requires no changes to the memory unit ( 2 ), which includes theshared scratchpad memory, L1 data cache, and texture cache. All memory requestsgenerated by any slices in the same cycle are presented to the memory unit as asingle 32-thread access in the same manner as when executing 32-wide warps. Thecoalescing unit also operates in the same fashion as the baseline; multiple threadsaccessing the same cache line in the same cycle generate only one memory request.To facilitate warp sizes greater than four, we introduce the warp ganging unit( 3 ), which is able to override local per-slice fetch/decode ( 4 ) and issue ( 5 ) de-cisions. The gang front-end ( 6 ) performs instruction fetch and decode once forall small warps participating in a gang. The gang issue scheduler enforces lock-step execution of all slices participating in a given gang. The warp ganging unit is86discussed in more detail in Section 6.2.2.When VWS is operating in ganged-only mode, the per-slice front-end logic( 7 ) and warp issue scheduler ( 8 ) are disabled to save energy. When operating inslice-only mode, each SM slice uses its independent front-end to fetch and decodeinstructions. When both gangs and independent warps are present in the system atthe same time, gangs are given both fetch and issue priority. This policy ensuresthat gangs remain in lock-step as long as possible. When possible, independentwarps are used to fill in holes in the executing gang. Each slice front-end includesan L0 I-cache ( 9 ) to reduce pressure on the larger L1 I-cache ( 10 ) which is sharedby all slices in the SM. Without L0 I-caches, providing peak throughput in slice-only mode would require 8× the L1 I-cache bandwidth. Our microarchitectureallows 9 separate fetch schedulers (one for each of eight slices and one for gangs)to request instructions from the L1 I-cache. We study the effects of scaling L1I-cache bandwidth in Section 6.4. Arbitration to determine which scheduler isgranted L1 access is done by the L1 fetch arbiter ( 11 ), described in more detail inSection 6.2.6This microarchitecture can be run in gang-only or slice-only mode (effectivelylocking the warp size at 32 or 4 respectively). However, our proposed solutionsevaluated in Section 6.4 and described in the remainder of Section 6.2 operate bybeginning execution in ganged mode. Sections 6.2.4 and 6.2.5 describe how gangscan be split and reformed on a per-gang basis.6.2.2 Warp Ganging UnitThe goal of the warp ganging unit is to force independent slices to fetch, decode,and execute instructions in lock-step gangs when threads across multiple slices arecontrol-convergent. Several factors motivate such ganging. First, issuing memoryaccesses from convergent applications without lock-step execution places signifi-cantly more pressure on the memory system and degrades performance. Second,the system can amortize front-end energy consumption across more threads whensome or all small warps across the slices in an SM are executing the same instruc-tion.When a kernel begins and thread blocks are assigned to an SM, gangs are cre-87ated from the thread blocks in the same fashion as 32-wide warps are created inour baseline system. Each gang is statically assigned eight 4-wide warps, one fromeach slice. Information about which warps are participating in which gang is storedin the gang table ( 12 ). Each entry in the gang table contains a GangID, an 8-bitGangMask (indicating which slices are participating in the gang), the current PCof the gang, a valid bit (which is cleared when the gang’s instruction buffer entriesare empty), and a ReadyMask which indicates which warps in the gang can issue.To simplify the design of the gang unit, warps are not allowed to migrate betweengangs. We implemented more complex gang forming and reforming schemes, butsaw no significant performance or energy advantage for our workloads. All warpsnot participating in a gang (unganged warps) are managed independently by theirrespective slice. Each slice stores an independent warp mask ( 13 ) indicating whichof its warps are managed independent of the warp ganging unit.6.2.3 Gang TableThe gang table tracks all information necessary for scheduling gangs as well as formanaging gang splitting and reformation. The baseline SM described in Chapter 2has a capacity of 1024 schedulable threads organized into 32 warps of 32 threadseach. The VWS SM has the same total thread capacity, but organized into a totalof 256 warps of 4-threads each, or 32 4-thread warps per slice. At kernel launch,the threads are aggregated into maximally-sized gangs of eight 4-wide warps, or32 threads per gang to match the baseline architecture. The term original gang isused throughout this paper to describe a gang of warps that is created when a threadblock is initially assigned to an SM.When a gang splits, more entries in the gang table become necessary. Becauseindividual warps are not managed by the warp ganging unit, a gang of 8 warpscan split into at most 4 gangs, with a minimum of two warps per gang. Furthersubdivision yields singleton warps which are managed within each slice. Thus themaximum number of entries needed in the gang table to track the smallest gangs is128 (32 original gangs× 4). These 128 entries can be organized in a set-associativemanner with 32 sets, one set per original gang and four entries representing up to4 different gang splits.88Each entry in the gang table contains a unique GangID identifier and Gang-Mask that indicates which slices are participating in this gang. Since warps canonly be ganged with other members of their original gang, all warps from the sameoriginal gang access the same set in the gang table and must have GangIDs thatare in the same group. For example, warp 0 in each slice can only be a member ofgangs 0–3. With this organization, each warp’s index in the GangMask is simplythe warp’s slice number.To perform fetch and issue scheduling, the warp ganging unit requires infor-mation from the slices. Specifically, the gang front-end must know the next PC foreach gang, and the gang issue scheduler must know when all warps in a gang havecleared the scoreboard. Per warp call return stack (or reconvergence stack) track-ing is done locally in each slice. To track per-gang PCs and handle gang splittingwhen control flow divergence is encountered, each slice signals the warp gangingunit when the PC at the top of a warp’s stack changes ( 14 ). Instruction depen-dence tracking is also done in each slice, even when operating in gang-only mode.Keeping the dependence information local to each slice makes transferring warpsfrom ganged to unganged simpler and decreases the distance scoreboard controlsignals must travel. The warp ganging unit tracks dependencies for an entire gangin a ReadyMask by receiving scoreboard ready signals from each slice ( 15 ).The gang table also contains a per-entry valid bit to track instruction buffer(I-Buffer) status. The warp gang unit is responsible for both fetching and issuingof gangs. The gang unit front-end stores decoded instructions in each memberwarp’s per-slice I-Buffer. The valid bit is set by the gang fetch scheduler when agang’s per-slice I-Buffer entries are filled and is cleared by the gang issue schedulerwhen the associated instruction has been issued. All member warps in a gang issuetheir instructions in lockstep from their respective slice-local I-Buffers. This bit ismanaged internally by the warp ganging unit and does not require any input fromthe slices.6.2.4 Gang SplittingThe warp ganging unit decides when gangs are split and reformed based on a set ofheuristics evaluated in Section 6.4. To make splitting decisions, the warp gang unit89observes when control flow and memory divergence occurs. Control flow diver-gence is detected by observing the PCs sent to the ganging unit by each slice. PCsfrom the slices undergo a coalescing process similar to global memory accesses.If all warps in a gang access the same PC, no splitting is done. If any warp in thegang accesses a different PC, the gang is split. If more than one warp accesses acommon PC, a new gang is formed for these warps. If only one warp accesses agiven PC, that warp is removed from the control of the ganging unit and a signal issent to that warp’s slice, transferring scheduling to the local slice. All VWS con-figurations explored in this work split gangs whenever control flow divergence isdetected.In addition to control flow divergence, memory latency divergence is anothermotivation for gang splitting. Memory latency divergence can occur when somethreads in a warp hit in the data cache while other threads must wait for a long-latency memory operation to complete. Prior work such as Dynamic Warp Sub-division [115] has suggested warp subdivision to tolerate memory latency diver-gence.Section 6.4 evaluates VWS architecture configurations that can split gangswhen memory latency divergence is observed among member warps. Memorylatency divergence is detected when scoreboard ready bits for different warps ina gang are set at different times when completing memory instructions. Trackingwhich warps in a gang are ready is done through the ReadyMask. We evaluateVWS with two different types of gang splitting on memory divergence. ImpatientSplitting is the simplest form of gang splitting on memory divergence. If any warpin a gang sets its ready bit before any other member warps, the gang is completelysplit; all members participating in the gang become independent warps. Impa-tient splitting simplifies the splitting process and allows highly memory divergentworkloads to begin independent execution as quickly as possible. Group Splittingenables warps that depend on the same memory access to proceed together as anew gang. When more than one warp in a gang has its ready bit set in the samecycle, a new gang is created from those warps. Any singleton warps that resultfrom this process are placed under independent, per-slice control.906.2.5 Gang ReformationIn addition to splitting gangs, VWS supports the reformation of gangs that havebeen split. The warp ganging unit decides if warps or gangs from the same originalgang should be re-ganged. While we explored numerous policies, two simple buteffective choices emerged: (1) opportunistic reformation and (2) no reformation.To simplify the re-ganging hardware, only one gang can be reformed each cycle. Toperform opportunistic gang reformation, one original gang is selected each cycle,in round-robin order. The hardware compares the PCs from each of the originalgang’s new gangs or independent warps, with a worst-case 8-way comparison ifthe gang has completely split apart. If any groups of two or more of these warpsor gangs have the same PC, they are merged. Section 6.4 describes policies topromote more gang reformation by forcing gangs and warps to wait at commoncontrol flow post dominator points in the code.6.2.6 Instruction SupplyTo avoid building a machine with 8× the global fetch bandwidth when VWS isoperating in completely independent slice mode, the fetch bandwidth of the L1 in-struction cache is limited. We evaluated several different L1-I cache bandwidthsand determined that with modestly sized L0 I-caches, L1 I-cache bandwidth can bescaled back to two fetches per cycle and achieve most of the performance of allow-ing 8 fetches per cycle. The global fetch arbiter determines which fetch schedulersaccess the L1 I-cache’s 2 ports on any given cycle. The gang fetch scheduler is al-ways given priority to maximize the number of lanes serviced. The remaining fetchbandwidth is divided among the per-slice warp fetch schedulers. Individual warpsare distributed to the slices in round-robin fashion (warp 0 is assigned to slice 0,warp 1 to slice 1, and so on). An arbitration scheme prioritizes slice requests toensure that each slice gets fair access to the L1 I-cache.6.3 Experimental MethodologyThe results in this paper are collected using a proprietary, cycle-level timing sim-ulator that models a modern GPU streaming multiprocessor (SM) and memoryhierarchy similar to that presented in Chapter 2. The simulator is derived from91Table 6.1: Variable warp sizing simulator configuration.# Streaming Multiprocessors 1Execution Model In-orderWarp Size 32SIMD Pipeline Width 32Shared Memory / SM 48KBL1 Data Cache 64KB, 128B line, 8-way LRUL2 Unified Cache 128KB, 128B line, 8-way LRUDRAM Bandwidth 32 bytes / core cycleBranch Divergence Method ISA Controlled Call Return StackWarp Issue Scheduler Greedy-Then-Oldest (GTO) [137]Warp Fetch Scheduler Loose Round-Robin (LRR)ALU Latency 10 cyclesa product development simulator used to architect contemporary GPUs. Table 6.1describes the key simulation parameters. The simulator processes instruction tracesencoded in NVIDIA’s native ISA and generated by a modern NVIDIA compiler.Traces were generated using an execution-driven, functional simulator and includedynamic information such as memory addresses and control flow behavior. Wesimulate a single SM with 32 SIMT execution lanes that execute 32-wide warps asour baseline, similar to that described in [57]. For warps smaller than 32, we usethe same memory system but maintain a fixed count of 32 execution lanes slicedinto the appropriate number of groups. We model a cache hierarchy and mem-ory system similar to contemporary GPUs, with capacity and bandwidth scaled tomatch the portion available to a single SM.The trace set presented was selected to encompass a wide variety of behaviors.Traces are drawn from a variety of categories, including High Performance Com-puting (HPC), games, and professional/consumer compute application domainssuch as computer vision. A third of the selected traces belong to each of the threecategories described earlier: (1) divergent codes that prefer narrow warps, (2) con-vergent codes that prefer wider warps, and (3) codes that are mostly insensitive towarp size.926.4 Experimental ResultsThis section details experimental results for the Variable Warp Sizing microarchi-tecture. First, we quantify performance for several configurations of VWS and thencharacterize instruction fetch and decode overhead and the effectiveness of mitiga-tion techniques. We perform several sensitivity studies exploring various designdecisions for gang scheduling, splitting, and reforming. We demonstrate how gangmembership evolves over time for some sample workloads. Finally, we examinearea overheads for the proposed design.6.4.1 PerformanceFigure 6.6 plots the performance of multiple warp sizes and VWS, using differentwarp ganging techniques. All techniques can issue 32 thread instructions per cy-cle. Fetch and decode rates are scaled with the base warp size; WS4 and WS32can fetch and decode eight instructions per cycle and one instruction per cycle, re-spectively. The VWS configurations use a base warp size of 4 and can fetch up to8 instructions per cycle from the L1 I-cache. Simulating our ganging techniqueswith 8 times the L1 I-cache fetch throughput allows us to explore the maximumpressure placed on the global fetch unit without artificially constraining it. Sec-tion 6.4.2 demonstrates that VWS using the L0 I-caches described in Section 6.2and an L1 I-cache with only 2× the bandwidth achieves 95% of the performanceof using 8× the L1 I-cache bandwidth on divergent applications. Warp-size in-sensitive and convergent applications are insensitive to L1 I-cache bandwidth. Wechose the following VWS configurations based on an exploration of the designspace detailed in the rest of this section.WS 32: The baseline architecture described in Section 2 with a warp size of 32.WS 4: The baseline architecture described in Section 2 with a warp size of 4.I-VWS: Inelastic Variable Warp Sizing with a base warp size of 4, where gangsare split only on control flow divergence. Warps are initially grouped to-gether into gangs of 8 warps (32 threads total). Upon control flow diver-gence, gangs are split based on each warp’s control flow path. Once split,93they are never recombined. The ganging unit selects up to two gangs to is-sue each cycle. Slices that do not receive a ganged instruction pick the nextavailable warp from their pool of unganged warps. The ganged scheduleruses a Big-Gang-Then-Oldest (BGTO) scheduling algorithm, where gangswith the most warps are selected first. Gangs with the same number of warpsare prioritized in a Greedy-Then-Oldest (GTO) fashion. Per-slice schedulersmanage independent warps using a GTO scheduling mechanism.E-VWS: Elastic Variable Warp Sizing. Warps are split on control flow divergenceand combined in an opportunistic fashion when multiple gangs or singletonwarps arrive at the same PC on the same cycle. Gangs can only be createdfrom members of an original gang. A maximum of 2 gangs or warps can becombined per cycle.E-VWS-ImpatientMem: Warp ganging similar to E-VWS, except that gangs arealso split when memory divergence occurs across warps in the same gang.Whenever any memory divergence occurs in a gang, the entire gang is split.Gangs are recombined in the same opportunistic fashion as E-VWS.Figure 6.6 shows that the I-VWS warp ganging microarchitecture is able toachieve a 35% performance improvement on divergent applications over a staticwarp size of 32. This improvement is within 3% of using a warp size of 4 ondivergent applications and it results in no performance loss on convergent applica-tions where simply using a warp size of 4 results in a 27% slowdown. This dataalso demonstrates that splitting gangs on control flow divergence without perform-ing any gang recombining, the simplest solution, provides the best overall perfor-mance for these workloads. Adding opportunistic gang recombining (E-VWS inFigure 6.6) actually results in a small performance decrease on divergent applica-tions. This decrease is caused by scheduling and packing problems associated withattempting to issue across more slices at once. When gangs are larger, there is agreater likelihood that multiple gangs need to issue to the same slice on the samecycle.Elastically splitting and regrouping makes no performance difference on con-vergent and warp-size insensitive applications because these applications experi-9400. Proc.Game 3ConvolutionGame 4FFTHMEAN-WSIGame 1MatrixMultiplyGame 2FeatureDetectRadix SortHMEAN-CONDivergent Applications Warp-Size InsensitiveApplicationsConvergent ApplicationsNormalized IPC WS 32 WS 4 I-VWS E-VWS E-VWS-ImpatientMemFigure 6.6: Performance (normalized to WS 32) of large warps, small warps,and different warp ganging techniques. IPC Warp Size 4I-VWSApplication Figure 6.7: Performance (normalized to WS 32) of I-VWS and WS 4 on allthe applications from in Figure 6.1a.95ence little or no control flow divergence. Recombining gangs for the divergentworkloads makes little performance difference when the hardware has the abilityto issue many smaller gangs (or single 4-sized warps) because remaining ungangedis unlikely to result in a loss of utilization. Having the ability to concurrently is-sue multiple paths at the slice granularity makes control flow reconvergence lessperformance critical than when only one path can be executed concurrently.Figure 6.6 also quantifies the effect of splitting gangs on memory divergence(E-VWS-ImpatientMem). Reducing the effect of memory divergence helps someof the divergent applications like Lighting, ObjClassifier, and Raytracing and pro-vides a modest 2% performance increase over I-VWS on the divergent applica-tions. However, allowing gangs to split based on memory divergence results insignificant performance degradation on Game 1, Game 2, and Radix Sort in theconvergent application suite, resulting in an average slowdown of 22% on the con-vergent applications. Like 4-sized warps, this loss in performance can be attributedto memory convergence slip. Formerly coalesced accesses become uncoalescedand create excessive pressure on the memory system causing unnecessary stalls.Figure 6.7 shows the performance of all 165 applications. The figure demon-strates that the ganging techniques used in I-VWS are effective for all the applica-tions studied. I-VWS tracks warp size 4 performance on the divergent applicationsand eliminates warp size 4 slowdown on the convergent applications at the left sideof the graph.6.4.2 Front-end PressureFigure 6.8 plots the fetch pressure of several warp sizes and ganging configu-rations. For the divergent applications, I-VWS results in 57% fewer fetch/de-code operations required each cycle versus a warp size of 4. This reduction infetch/decode represents a significant energy savings while providing almost all ofthe performance of 4-sized warps. By opportunistically recombining gangs for di-vergent applications, E-VWS requires a further 55% less fetch/decode bandwidththan I-VWS, at the cost of some performance. On divergent applications, E-VWS-ImpatientMem increases fetch/decode pressure versus E-VWS but not more thanI-VWS.960123456789CoMDLightingGamePhysicsObjClassifierRaytracingAVG-DIVImage Proc.Game 3ConvolutionGame 4FFTAVG-WSIGame 1MatrixMultiplyGame 2FeatureDetectRadix SortAVG-CONDivergent Applications Warp-Size InsensitiveApplicationsConvergent ApplicationsAvg. Fetches Per Cycle WS 32 WS 4 I-VWS E-VWS E-VWS-ImpatientMemFigure 6.8: Average fetches per cycle with different warp sizes and gangingtechniques.On the convergent and warp-size insensitive applications, the ganging configu-rations that do not split on memory divergence show fetch pressure equal to that ofwarp size 32. Because these applications lack control flow divergence, gangs rarelysplit and I-VWS operates exclusively in ganged mode. However, when gangs aresplit on memory divergence, the skewing of memory access returns causes a sig-nificant increase in the number of fetch/decodes per cycle.Figure 6.9 plots the performance of I-VWS at different L1 I-cache bandwidthsand L0 I-cache sizes. Because the divergent applications traverse multiple inde-pendent control flow paths, restricting L1 I-cache bandwidth results in a signifi-cant performance loss. However, the inclusion of per-slice L0 I-caches, which areprobed first when independent warps fetch instructions, vastly decreases the perfor-mance loss. With only 2× the L1 I-cache bandwidth of the baseline architecture,the addition of small 256B L0s are able to cover most of the bandwidth deficiencyat the L1. Since they remain in ganged operation, the warp-size insensitive andconvergent applications are insensitive to L1 I-cache fetch bandwidth.970. L0-IL0-I=32BL0-I=64BL0-I=128BL0-I=256B . .No L0-IL0-I=32BL0-I=64BL0-I=128BL0-I=256B . .No L0-IL0-I=32BL0-I=64BL0-I=128BL0-I=256B .Divergent Applications Warp Size InsensitiveApplicationsConvergent ApplicationsNormalized IPC 2 L1-I Fetches Per Cycle4 L1-I Fetches Per CycleFigure 6.9: Average performance of I-VWS at different L1 I-cache band-widths and L0 I-cache sizes. Normalized to I-VWS with 8x L1-I cachebandwidth.6.4.3 Gang Scheduling PoliciesWe measured the sensitivity of performance and instruction fetch bandwidth toseveral different gang scheduling policies. All gang schedulers attempt to issueup to two gangs per cycle, and local per-slice schedulers attempt to issue on anyremaining idle slices. We examine the following policies:I-VWS: As described in Section 6.4.1, the gang issue scheduler prioritizes thelargest gangs first Big-Gangs-Then-Oldest (BGTO) and per-slice schedulersare Greedy-Then-Oldest (GTO).I-VWS-GTO: Similar to I-VWS, except the gang issue scheduler uses a greedy-then-oldest policy.I-VWS-LRR: Similar to I-VWS, except both the gang issue scheduler and per-slice schedulers use a Loose-Round-Robin (LRR) scheduling policy.I-VWS-LPC: Similar to I-VWS, except both the gang issue scheduler and per-slice schedulers prioritize gangs/warps with the lowest PC first.I-VWS-LGTO: Similar to I-VWS, except the gang issue scheduler prioritizesgangs with the fewest warps first Little-Gangs-Then-Oldest (LGTO). Per-slice schedulers use a GTO policy.9800. Proc.Game 3ConvolutionGame 4FFTHMEAN-WSIGame 1MatrixMultiplyGame 2FeatureDetectRadix SortHMEAN-CONDivergent Applications Warp-Size InsensitiveApplicationsConvergent ApplicationsNormalized IPC WS 4 I-VWS-GTO I-VWS-LRR I-VWS-LPC I-VWS-LGTO I-VWSFigure 6.10: Performance (normalized to WS 32) of warp ganging with dif-ferent schedulers.Figure 6.10 shows that the performance of the divergent applications is sen-sitive to the gang scheduler choice. The lowest-PC-first configuration results ina universal performance reduction across all the applications. Little-Gangs-Then-Oldest (I-VWS-LGTO) creates a scheduling pathology on the divergent applica-tions. Prioritizing the smallest gangs first is bad for performance because the gangissue scheduler can only select a maximum of 2 gangs for execution each cycle;giving the smallest ones priority can limit utilization by delaying the executionof gangs with many warps. We also observed that prioritizing little gangs wasdetrimental even when more than two gangs could be scheduled per cycle becauselittle gangs block the execution of larger gangs. Figure 6.11 shows the resultingfetch and decode requirements for different gang scheduling policies. Althoughthe choice of gang scheduler has a significant effect on performance, it has littleeffect on fetch/decode bandwidth. This insensitivity occurs because gang schedul-ing has nothing to do with gang splitting when gangs are split only for control flowdivergence and are not recombined. When splitting gangs on memory divergenceis enabled, the effect of scheduling on the fetch rate is much greater.990123456789CoMDLightingGamePhysicsObjClassifierRaytracingAVG-DIVImage Proc.Game 3ConvolutionGame 4FFTAVG-WSIGame 1MatrixMultiplyGame 2FeatureDetectRadix SortAVG-CONDivergent Applications Warp-Size InsensitiveApplicationsConvergent ApplicationsAvg. Fetches Per Cycle WS 32 WS 4 I-VWS-GTO I-VWS-LRR I-VWS-LPC I-VWS-LGTO I-VWSFigure 6.11: Averages fetches per cycle with different schedulers.Figure 6.12 plots performance when the number of gangs selectable per cycleby the gang issue scheduler is set to one, two, or unlimited (up to four). Thisdata shows that limiting the gang scheduler to a single gang per cycle reducesthe performance of the divergent applications by 10% versus the baseline of twogangs per cycle. Allowing the gang scheduler to pick unlimited gangs per cycleresults in performance that is within 1% of two gangs per cycle. Any slices notconsumed by the gang scheduler may be used whenever possible by any singletonwarps managed by local slice schedulers. We choose to limit the gang scheduler totwo gangs per cycle to balance performance and scheduler complexity.6.4.4 Gang Reformation PoliciesFigures 6.13 and 6.14 plot performance and instruction fetches per cycle when thefollowing policies are used to reform gangs after they have been split:E-VWS: As described in Section 6.4.1, gangs are reformed on an opportunisticbasis only.E-VWS-Sync<XX>: Similar to E-VWS, except that when warps reach a compiler-inserted call-return stack sync instruction, they wait for recombination. These10000. Proc.Game 3ConvolutionGame 4FFTHMEAN-WSIGame 1MatrixMultiplyGame 2FeatureDetectRadix SortHMEAN-CONDivergent Applications Warp-Size InsensitiveApplicationsConvergent ApplicationsNormalized IPC WS 4I-VWS (1 gang per cycle)I-VWS (2 gangs per cycle)I-VWS (unlimited gangs per cycle)Figure 6.12: Performance (normalized to WS 32) when the number of gangsable to issue each cycle is changed. Proc.Game 3ConvolutionGame 4FFTHMEAN-WSIGame 1MatrixMultiplyGame 2FeatureDetectRadix SortHMEAN-CONDivergent Applications Warp-Size InsensitiveApplicationsConvergent ApplicationsNormalized IPC WS 4 E-VWS E-VWS-Sync10 E-VWS-Sync50 E-VWS-Sync100Figure 6.13: Performance (normalized to WS 32) of elastic gang reformationtechniques.1010123456789CoMDLightingGamePhysicsObjClassifierRaytracingAVG-DIVImage Proc.Game 3ConvolutionGame 4FFTAVG-WSIGame 1MatrixMultiplyGame 2FeatureDetectRadix SortAVG-CONDivergent Applications Warp-Size InsensitiveApplicationsConvergent ApplicationsAvg. Fetches Per Cycle WS 32 WS 4 E-VWS E-VWS-Sync10 E-VWS-Sync50 E-VWS-Sync100Figure 6.14: Average fetches per cycle with different gang reformation tech-niques.instructions are already inserted, typically at basic block post-dominators, toenforce the NVIDIA call-return stack architecture. <XX> indicates howmany cycles a warp will wait at the sync point for potential reganging.Forcing warps to wait at control flow post-dominator points can potentially im-prove gang reformation, leading to more or larger gangs and reduced front-end en-ergy while hopefully resulting in minimal performance degradation. Figures 6.13and 6.14 demonstrate that waiting at sync points results in a performance loss onour divergent applications. We see minimal decrease in the number of fetchesper cycle as waiting time is increased, and any energy efficiency gained from thisreduction would be more than offset by the loss in performance. The warp-size in-sensitive and convergent applications contain fewer compiler-inserted sync points,experience little or no control flow divergence, and may spend much or all of theirexecution time fully ganged. As a result, their performance is largely unaffectedby wait time at infrequent sync points. Thus we conclude that forcing warps towait at post-dominators provides little to no benefit; most of the reduction in fetchpressure is captured by opportunistic reganging in E-VWS.1026.4.5 Gang Splitting PoliciesFigure 6.15 explores the use of the following gang splitting policies without anygang reformation:I-VWS: As described in Section 6.4.1. Warps are split only on control flow diver-gence.I-VWS-GroupMem: Warp ganging similar to I-VWS except gangs are also spliton memory divergence. As memory results return for a gang, all warps in agang that are able to proceed based on the newly returned value form a newgang. Gangs are never recombined.I-VWS-ImpatientMem: Warp ganging similar to I-VWS-GroupMem except gangsthat experience any memory divergence are completely subdivided into in-dividual warps.As in Section 6.4.1, Figure 6.15 demonstrates that splitting on memory la-tency divergence can have a small performance advantage on some divergent ap-plications, but has a large performance cost on convergent ones. Minimizing theamount of splitting that occurs on memory divergence (I-VWS-GroupMem) gainsback some of the performance lost for Game 2 but creates problems in Radix Sort.Overall, splitting on memory divergence is a net performance loss due to its nega-tive effect on convergent applications.Figure 6.16 plots the resulting number of instructions fetched per cycle whendifferent gang splitting policies are used. This data demonstrates that even thoughsplitting on memory divergence may be a small performance win for divergentapplications, the number of instructions fetched increases greatly as a result, by41% and 69% for I-VWS-GroupMem and I-VWS-ImpatientMem, respectively.6.4.6 Gang Size DistributionFigure 6.17 visualizes how gang sizes change over time for two example divergentworkloads, GamePhysics and Lighting. Each warp assigned to the SM on any givencycle is classified according to the size of the gang to which it belongs. For examplein the Lighting application, execution begins with 120 4-wide warps assigned to the10300. Proc.Game 3ConvolutionGame 4FFTHMEAN-WSIGame 1MatrixMultiplyGame 2FeatureDetectRadix SortHMEAN-CONDivergent Applications Warp-Size InsensitiveApplicationsConvergent ApplicationsNormalized IPC WS 4 I-VWS I-VWS-GroupMem I-VWS-ImpatientMemFigure 6.15: Performance (normalized to WS 32) of different gang splittingpolicies.0123456789CoMDLightingGamePhysicsObjClassifierRaytracingAVG-DIVImage Proc.Game 3ConvolutionGame 4FFTAVG-WSIGame 1MatrixMultiplyGame 2FeatureDetectRadix SortAVG-CONDivergent Applications Warp-Size InsensitiveApplicationsConvergent ApplicationsAvg. Fetches Per Cycle WS 32 WS 4 I-VWS I-VWS-GroupMem I-VWS-ImpatientMemFigure 6.16: Average fetches per cycle using different gang splitting policies.104Figure 6.17: Gang sizes versus time for I-VWS.SM. The black bar at cycle 0 indicates that all warps start out in their original gangsof size 8. As time progresses, the original gangs split apart into smaller gangsuntil eventually most warps in the SM are executing independently. In contrast,GamePhysics exhibits much more structured divergence. The SM begins executionwith 300 warps all in their original gangs. Over time, the warps split in two (thegrey color in the GamePhysics graph represents warps participating in a gang of4). One half of the gang exits, while the other half continues executing in lock step.These two plots illustrate how I-VWS reacts to different kinds divergence. Most ofthe divergent workloads studied react similar to Lighting. Similar plots for E-VWSshow gangs splitting and reforming as time progresses. The plots collected forthe convergent applications show that warps stay in their original gang throughoutexecution.6.4.7 Area OverheadsTable 6.2 presents an estimate of the area required to implement I-VWS in a 40nmprocess. Column two presents the raw area estimate for each I-VWS compo-nent, while columns three and four present a rolled-up incremental SM area in-crease for 4-wide and 8-wide warps, respectively. We model the L1 I-cache usingCACTI [171] at 40nm. The L0 I-cache, decoded I-Buffers, and the gang table aresmall but dominated by the storage cells required to implement them. We esti-mate the area of these structures by using the area of a latch cell from the NanGate45nm Open Cell library and scaling it to 40nm. We multiply the resulting cell area105Table 6.2: Area overhead estimates.Component Additional SM AreaComponent Area 4-wide warps 8-wide warpsSingle-ported L1 I-cache (64KB) 0.087mm2Dual-ported L1 I-cache (64KB) 0.194mm2 0.108mm2 0.108mm2L0 I-cache (256B) 0.006mm2 0.052mm2 0.026mm2Decoded I-Buffers (4Kbits) 0.013mm2 0.103mm2 0.052mm2Gang Table (128 entries) 0.026mm2 0.026mm2 0.026mm2Scoreboard (1800 bits) 0.019mm2 0.154mm2 0.077mm2Additional control 0.160mm2 1.280mm2 0.640mm2Total SM area increase 1.7mm2 0.93mm2Percent SM area increase 11% 6%Total GPU area increase 25.8mm2 13.9mm2Percent GPU area increase 5% 2.5%(2.1µm2) by the number of bits and a factor of 1.5 to account for area overheadsincluding control logic. For the per-slice scoreboards, we use a larger FlipFlop cell(3.6µm2 scaled to 40nm) from the NanGate library and 3× area overhead factorto account for the comparators necessary for an associative lookup. Compared tothe scoreboard described in [48], ours has fewer bits and noticeably less area. Fi-nally, to estimate the area cost of the additional control logic required for slicingthe SIMD datapath, we examine published literature on the percentage of total corearea other processors devote to control [3, 23, 109, 110]. Based on these studiesand the high datapath densities found in GPUs, we estimate that 1% of the FermiSM area (16mm2) is devoted to the datapath control logic that needs to be repli-cated in each slice. In total, we estimate that I-VWS adds approximately 11% and6% to the area of an SM for 4-wide and 8-wide warps, respectively. For a Fermi-sized 15 SM GPU (529mm2), I-VWS adds approximately 5% and 2.5% more areafor 4-wide and 8-wide warps, respectively.1066.5 Comparison to Previous WorkThis section first presents a quantitative comparison of I-VWS against two previ-ously proposed techniques which address the effect of branch divergence on GPUs.It then presents a qualitative characterization of our work and prior art in the branchand memory divergence mitigation design space.6.5.1 Quantitative ComparisonWe compare I-VWS to two previously proposed divergence mitigation techniques:Thread Block Compaction (TBC) [53] and Dynamic Warp Formation (DWF) [55].This data was collected using TBC’s published simulation infrastructure [54], whichis based on GPGPU-Sim 2.x [19]. We run TBC and DWF with the exact config-uration specified in [54] and implement I-VWS with a warp size of 4 on top oftheir infrastructure. Figure 6.18 plots the result of this study on 5 divergent ap-plications taken from the TBC simulation studies: raytracing (NVRT) [13], facedetection (FCDT) [110], breadth first search (BFS) [36], merge sort (MGST) [36],and nearest neighbor (NNC) [36]. NVRT does not run when using DWF [53]. Wechose these applications because their divergence behavior highlights the advan-tages of I-VWS. We also ran I-VWS on the rest of the applications in the TBCpaper and observed little performance difference. On the five applications, I-VWSachieves an average 34% and 15% performance improvement over the baseline 32-wide warp and TBC respectively. The increased issue bandwidth and schedulingflexibility of I-VWS enables divergent applications to better utilize the GPU.6.5.2 Qualitative ComparisonTable 6.3 presents a qualitative characterization of previously proposed divergencemitigation techniques versus small warps and I-VWS. We classify previous workinto three categories: (1) techniques that dynamically form or compact warps toregain SIMD efficiency [53, 55, 123, 132, 134]; (2) techniques that subdivide warpsin the presence of memory and control flow divergence to expose more thread levelparallelism [115, 156]; and (3) techniques that allow the interleaved execution ofmultiple branch paths within a warp by scheduling multiple paths from the controlflow stack [48, 133].10700.511.52BFS FCDT MGST NNC NVRTNormalized IPC 32-WIDE  DWF  TBC  I-VWSFigure 6.18: Performance (normalized to the 32-wide warp baseline) usingthe released TBC infrastructure [54].The fundamental characteristic that sets I-VWS and the use of small warpsapart from prior work is the ability to concurrently issue many more unique PCsby scaling and distributing instruction fetch bandwidth. Additionally, I-VWS canseamlessly adapt to memory latency divergence by breaking gangs. Dynamic WarpSubdivision (DWS) is also able to break lock-step warp execution on memory di-vergence [115]. However, DWS does this with a loss of SIMD efficiency andrequires additional entries to be added to a centralized warp scheduler for each sub-divided warp. In contrast, I-VWS allows smaller warps to continue independently,without losing SIMD efficiency; management of these smaller warps is distributedamong many smaller scheduling entities.While formation and compaction techniques increase SIMD efficiency, theypay for this increase with a reduction in available thread level parallelism, sinceforming and compacting warps decreases the number of schedulable entities. Thisdecrease in TLP degrades the SM’s ability to tolerate latencies. Conversely, subdi-vision and multipath stack techniques can increase latency tolerance but are limitedby the number of scheduling entities in a monolithic warp scheduler and the num-ber of concurrent entries on the call return stack.All of the techniques perform well on convergent applications with the excep-tion of small warps which suffer from memory convergence slip. We consider twoclasses of divergent applications: those that have a limited number of unique con-trol flow (CF) paths and those that have many unique control flow paths. All ofthe proposed techniques perform well on applications that have a limited numberof control flow paths. However, only smaller warps and I-VWS can maintain good108Table 6.3: Qualitative characterization of divergence mitigation techniques.Characteristic Form/ Subdivide Multipath Small I-VWSCompact [115] [48, 133] Warps[53, 55, 123]# PCs per Cycle 1 1 1 Many ManyMem. Divergence Adaptive No Yes No Yes YesLatency Tolerance Low Limited Limited High HighPerformanceConvergent apps High High High Low HighDivergent apps, limited CF High High High High HighDivergent apps, many CF Limited Limited Limited High HighEnergy EfficiencyConvergent apps High High High Low HighDivergent apps, limited CF High High High Medium MediumDivergent apps, many CF Low Low Low High Highperformance when the number of unique control flow paths is high. Compactionand formation techniques require candidate threads to be executing the same in-struction (PC). Subdivision and multipath stack techniques increase the number ofschedulable entities in the SM, but do not improve lane utilization and become lim-ited by the number of entries in a large, centralized structure when the number ofunique control flow paths is large. Energy efficiency largely follows performance.On diverged applications with limited control flow paths, small warps and I-VWSlose some energy efficiency by fetching and decoding multiple times from smallerdistributed fetch/decode structures, while prior work fetches one instruction from alarger structure. However, on divergent applications with many control flow paths,prior work inefficiently fetches one instruction repeatedly from a larger structure,while small warps and I-VWS distribute this repeated fetching over smaller struc-tures. The smaller warps in I-VWS make it the only technique which improvesboth SIMD efficiency and thread level parallelism, while still exploiting the perfor-mance and energy efficiencies presented by convergent and warp-size insensitivecode.1096.6 SummaryThis chapter explores the design space of a GPU SM with the capability to nativelyissue from multiple execution paths in a single cycle. Our exploration concludesthat convergent applications require threads to issue in lock-step to avoid detrimen-tal memory system effects. We also find that the ability to execute many controlflow paths at once vastly decreases a divergent application’s sensitivity to recon-vergence techniques.We propose Variable Warp Sizing (VWS) which takes advantage of the manycontrol flow paths in divergent applications to improve performance by 35% overa 32-wide machine at an estimated 5% area cost when using 4-wide warps. An 8-wide design point provides most of that performance benefit, while increasing areaby only 2.5%. VWS evolves GPUs into a more approachable target for irregularapplications by providing the TLP and SIMD efficiency benefits of small warps,while exploiting the regularity in many existing GPU applications to improve per-formance and energy efficiency.110Chapter 7Related WorkThis chapter summarizes and contrasts the work done in this dissertation againstrelated work in thread scheduling, locality management and GPU divergence mit-igation. Section 7.1 discusses work relating to Cache-Conscious Warp Schedulingand Divergence-Aware Warp Scheduling. Section 7.2 details work related to AVariable Warp-Sized Architecture that examines exploiting horizontal locality aswell as mitigating branch and memory divergence in GPUs and vector processors.7.1 Work Related to Cache-Conscious Warp Schedulingand Divergence-Aware Warp SchedulingThis section is subdivided into five subsections that classify work related to CCWSand DAWS: throttling to improve performance, GPU thread scheduling techniques,GPU caching, CPU thread scheduling techniques, cache capacity management andfinally, locality detection.7.1.1 Throttling to Improve PerformanceBakhoda et al. [19] present data for several GPU configurations, each with a differ-ent maximum number of CTAs that can be concurrently assigned to a core. Theyobserve that some workloads performed better when less CTAs are scheduled con-currently. The data they present is for a GPU without an L1 data cache, runninga round-robin warp scheduling algorithm. They conclude that this increase in per-111formance occurs because scheduling less concurrent CTAs on the GPU reducescontention for the interconnection network and DRAM memory system. In con-trast, the goal of CCWS and DAWS is to use L1 data cache feedback to preservelocality. Our techniques focus on fine-grained, issue level warp scheduling, notcoarse-grained CTA assignment.Guz et al. [59] use an analytical model to quantify the “performance valley”that exists when the number of threads sharing a cache is increased. They show thatincreasing the thread count increases performance until the aggregate working setno longer fits in cache. Increasing threads beyond this point degrades performanceuntil enough threads are present to hide the system’s memory latency. In effect,CCWS and DAWS dynamically detect when a workload has entered the machine’sperformance valley and they scale back the number of threads sharing the cache tocompensate.Cheng et al. [40] introduce a thread throttling mechanism to reduce memory la-tency in multithreaded CPU systems. They propose an analytical model and mem-ory task throttling mechanism to limit thread interference in the memory stage.Their model relies on a stream programming language which decomposes applica-tions into separate tasks for computation and memory and their technique schedulestasks at this granularity.Ebrahimi et al. [47] examine the effect of disjointed resource allocation be-tween the various components of a chip-multiprocessor system, in particular inthe cache hierarchy and memory controller. They observed that uncoordinatedfairness-based decisions made by disconnected components could result in a lossof both performance and fairness. Their proposed technique seeks to increase per-formance and improve fairness in the memory system by throttling the memoryaccesses generated by CMP cores. This throttling is accomplished by capping thenumber of MSHR entries that can be used and constraining the rate at which re-quests in the MSHR are issued to the L2.Prior work in networking such as those by Thottethodi et al. [158], Baydal etal. [24], Lopez et al. [106, 107] and Scott and Sohi [145] use feedback to generateestimations of network congestion and dynamically tune network injection rates toimprove performance. CCWS and DAWS utilize source throttling at the instructionissue stage and optimize for cache capacity, not network bandwidth. An interesting112extension to CCWS and DAWS would be to examine their effect on the GPU’snetwork congestion and if there are network injection control techniques that mightbe a better fit at the instruction issue stage as opposed to the network injection stage.7.1.2 GPU Thread Scheduling TechniquesLakshminarayana and Kim [91] explore numerous warp scheduling policies in thecontext of a GPU without hardware managed caches and show that, for applica-tions that execute symmetric (balanced) dynamic instruction counts per warp, afairness based warp and DRAM access scheduling policy improves performance.In contrast to our work, their study did not explore scheduling policies that improveperformance by improving cache hit rates.Fung et al. [55] explore the impact of warp scheduling policy on the effec-tiveness of their Dynamic Warp Formation (DWF) technique. DWF attempts tomitigate control flow divergence by dynamically creating new warps when scalarthreads in the same warp take different paths on a branch instruction. They pro-pose five schedulers and evaluate their effect on DWF. Fung and Aamodt [52] alsopropose three thread block prioritization mechanisms to compliment their ThreadBlock Compaction (TBC) technique. The prioritization mechanisms attempt toschedule threads within the same CTA together. Their approach is similar to con-current work on two-level scheduling proposed by Narasiman et al. [123], exceptthread blocks are scheduled together instead of fetch groups. In contrast to boththese works, CCWS and DAWS explore the impact of scheduling on cache localityusing existing control flow divergence mitigation techniques.Gebhart and Johnson et al. [57] introduce the use of a two-level scheduler toimprove energy efficiency. Experiments run using their exact specification yieldedmixed results. They note that the performance of their workloads increases lessthan 10% if a perfect cache is used instead of no cache at all. For this reason, theyrun all their simulations with a constant 400 cycle latency to global memory. Asa result, their scheme switches warps out of the active pool whenever a compileridentified global or texture memory dependency is encountered. We find that obey-ing this constraint causes performance degradation because it does not take cachehits into account. However, if this demotion to the inactive pool is changed to just113those operations causing a stall (i.e. those missing in cache) its operation is similarto concurrent work on two level scheduling done by Narasiman et al. [123] whichis evaluated in Section 3.4.The work done by Narasiman’s et al. [123] focuses on improving performanceby allowing groups of threads to reach the same long latency operation at differenttimes. This helps ensure cache and row-buffer locality within a fetch group ismaintained and the system is able to hide long latency operations by switchingbetween fetch groups. In contrast, our work focuses on improving performance byadaptively limiting the amount of multithreading the system can maintain based onhow much intra-warp locality is being lost.Meng et al. [116] introduce Dynamic Warp Subdivision (DWS) which splitswarps when some lanes hit in cache and some lanes do not. This scheme allows in-dividual scalar threads that hit in cache to make progress even if some of their warppeers miss. DWS improves performance by allowing run-ahead threads to initiatetheir misses earlier and creates a pre-fetching effect for those left behind. DWSattempts to improve intra-warp locality by increasing the rate data is loaded intothe cache. In contrast, CCWS and DAWS attempt to load data from less threads atthe same time to reduce thrashing.Since the publication of CCWS, Jog et al. [78] and Kayiran et al. [83] haveproposed locality aware thread block schedulers that seek to limit the number ofthread blocks sharing the L1D cache. Their techniques apply warp limiting at acoarse grain. CCWS and DAWS seek to maximize cache usage using runtimecache feedback, fine grain divergence information and code region characteriza-tion. Lee et al. [94] and Jog et al. [79] explore prefetching on the GPU, with thelatter focusing on prefetching-aware scheduling. In contrast to prefetching, whichcannot improve performance in bandwidth limited applications, CCWS and DAWSmake more effective use of on-chip storage to reduce memory bandwidth.Sethia et al. [148] introduce Mascar which attempts to better overlap compu-tation with memory accesses in memory intensive workloads. Mascar consists ofa memory aware warp scheduler that prioritizes the execution of a single warpwhen MSHR and L1 miss queue entries on the chip are oversubscribed. This pri-oritization helps improve performance even when workloads do not contain datalocality by enabling warps executing on the in-order core to reach their computa-114tion operations faster, enabling overlap of the prioritized warp’s computation withother warp’s memory accesses. Mascar also introduces a cache access re-executionmechanism to to help avoid L1 data cache thrashing by enabling hits-under-misseswhen warps with data in the cache are blocked from issuing because low-localityaccesses are stalling the memory pipeline.Ausavarungnirun et al. [17] propose a series of improvements at the shared L2and memory controller that mitigate memory latency divergence in irregular GPUapplications. The techniques, collectively named Memory Divergence Correction(MeDiC), exploit the observation that there is heterogeneity in the level of memorylatency divergence across warps in the same kernel. Based on how they interactwith the shared L2 cache, each warp in a kernel can be characterized as all/mostlyhit, all/mostly miss or balanced. The authors demonstrate that there is little benefitin having warps that are not all hit, since warps that mostly hit must wait for theslowest access to return before they are able to proceed. They also demonstratethat queueing latency at the L2 cache can have a non-trivial performance impacton all-hit warps and that this effect can be mitigated by bypassing the L2 cachefor all the requests made by any warp that is not all-hit. This decreases the accesslatency for all-hit warps by reducing queueing delay. In addition to the adaptivebypassing technique, they propose modifications to the cache replacement policyand the memory controller scheduler in an attempt to minimize latency for warpsdetected to be all-hit warps.7.1.3 GPU CachingWork by Lee et al. [94] has explored the use of prefetching on GPUs. However,prefetching cannot improve performance when an application is bandwidth limitedwhereas CCWS and DAWS can help in such cases by reducing off-chip traffic.Jia et al. [75] characterize GPU L1 cache locality in a current NVIDIA TeslaGPU and present a compile time algorithm to determine which loads should becached by the L1D. In contrast to our work, which focuses on locality betweendifferent dynamic load instructions, their algorithm and taxonomy focus on localityacross different threads in a single static instruction. Moreover, since their analysisis done at compile time they are unable to capture any locality with input data115dependence.Published after CCWS and DAWS, Li et al. [105] make the observation thatin the interest of improving L1 cache efficiency CCWS and DAWS potentiallyleave other resources such as L2 capacity and memory bandwidth underutilized.To mitigate these effects, they propose a token-based cache line allocation schemeto determine which warps are permitted to allocate lines in the L1, allowing otherwarps to bypass the L1 cache and consume the memory bandwidth and L2 cachecapacity made available by thread throttling schemes such as CCWS and DAWS.7.1.4 CPU Thread Scheduling TechniquesPrior simulation and analytical modeling work on CPU multiprocessors has inves-tigated degraded cache and network performance due to multithreading and mul-tiprocessing [10, 28, 143, 168]. GPU work examining the effect of massive mul-tithreading on system contention, like CCWS and DAWS, evaluate multithreadingat a scale previously not imaged by CPU efforts. Both the baseline hardware andworkloads examined on these machines are fundamentally different. The sheernumber of threads in modern massively multithreaded processors opens up a newdesign space in the way they these threads are scheduled. As a result of this mul-tithreading, the latency tolerance of GPUs adds an additional dimension to theoptimization problem. GPU designs also have to account for thread aggregationin the form of warps. Finally, the homogeneity of the exposed parallelism in GPUlanguages like CUDA and OpenCL present new opportunities for hardware opti-mizations.Thekkath and Eggers [157] examine the effectiveness of multiple hardwarecontexts on multithreaded CPU and CMP designs. They find that increasing thenumber of contexts on a single CPU core has a limited benefit, since cache con-flict misses and network contention are both increased as the number of hardwarecontexts increases. They also demonstrate that the performance of less optimizedcode degrades quickly as the number of hardware contexts increases. They deter-mine that this is primarily due to cache contention. This finding that less-optimizedapplications perform best with relatively few contexts is consistent with our obser-vation that less optimized GPU code benefits from constrained multithreading.116OS level thread scheduling has been studied as a way to increase cache perfor-mance in uniprocessor systems [126] by dynamically launching threads that exe-cute independent iterations of loops in sequential code. There is a body work study-ing the effects of fine grained and simultaneous multithreading (SMT) [160, 161] inCPUs. SMT differs from FGMT in that it allows instructions from multiple threadsto issue on the same cycle (as opposed to on adjacent cycles) on a superscalarCPU. Work on SMT scheduling has focused on thread scheduling at the coarsegrained OS level that use runtime monitors to determine which threads should beco-scheduled on the same core. Work on symbiotic SMT job scheduling [150, 151]attempts to identify the affinity of threads in a SMT machine and ensure that threadsthat perform well together are assigned to the same core.Suleman et al. [155] examine a feedback-driven technique to reduce multi-threading when data synchronization and off-chip memory bandwidth become per-formance limiting factors. Their technique seeks to reduce both the execution timeand energy consumed in CMPs by throttling multithreading in the threading sys-tem software based on an analytical model constructed from available hardwareperformance counters.Concurrent to our work, Jaleel et al. [74] propose the CRUISE scheme whichuses LLC utility information to make high level scheduling decisions in multi-programmed CMPs. Our work focuses on the first level cache in a massively multi-threaded environment and is applied at a much finer grain. Scheduling decisionsmade by CRUISE tie programs to cores, where CCWS and DAWS make issue leveldecisions on which bundle of threads should enter the execution pipeline next.7.1.5 Cache Capacity ManagementThere is a body of work attempting to increase cache hit rate by improving thereplacement or insertion policy [20, 35, 73, 77, 114, 130, 172]. All these attemptto exploit different heuristics of program behavior to predict a block’s re-referenceinterval and mirror the Belady-optimal [26] policy as closely as possible. WhileCCWS and DAWS also attempt to maximize cache efficiency, they do so by short-ening the re-reference interval rather than by predicting it. CCWS and DAWSbalance the shortening of the re-reference interval, achieved by limiting the num-117ber of eligible warps, while maintaining sufficient multithreading to cover most ofthe memory and operation latencies. Further multithreaded CPU work attempts tomanage interference among heterogeneous workloads [72, 129], while each threadin our workload has roughly similar characteristics.Software-based tiling techniques accomplished by the programmer or com-piler [9, 15, 42, 56, 92] have been shown to efficiently exploit locality in bothuniprocessor and multiprocessor systems. The runtime scheduling performed byCCWS and DAWS could be considered a dynamic form of tiling a multithreadedproblem by constraining which subset of the problem is operated on at any giventime.The are several works evaluating cache-conscious data placement [33, 41] asa means of increasing cache hit rate. Our work on CCWS and DAWS does notchange the placement of data in memory but instead focuses on exploiting localitythat already exists in the data structures used by the program.Agrawal et el. [12] present theoretical cache miss limits when scheduling stream-ing applications represented as directed graphs on uniprocessors. Their work showsthat scheduling the graph by selecting partitions comes within a constant factor ofthe optimal scheduler when heuristics such as working set and data usage rates areknown in advance.7.1.6 Locality DetectionPomerene et al. [128] make use of a “shadow directory” to improve cache prefetch-ing by storing address pairs. They store a parent address, along with a descendantaddress, where an access to the descendent was observed after the last access tothe parent. They then fetch data for the descendent once an access to the parent isre-observed. By using the shadow directory, they can track information for moreparent blocks than can be stored in the cache.Johnson and Hwu [81] employ a locality detection mechanism to character-ize large regions or “macroblocks” of memory based on per-marcoblock accessescounters. The values of these access counters are used to influence the cache man-agement policy and helps prevent cache thrashing and pollution, by prioritizing thestorage of more frequently accessed macroblocks.118A number of previous works in CPU caching have used spatial locality de-tection to improve the utilization of data fetched from main memory and avoidwasting cache space on data that is never accessed [37, 89, 131]. Theses tech-niques to detect spatial locality are orthogonal to our locality detection techniquewhich focuses on temporal cache line reuse. An interesting extension to CCWSand DAWS would be to examine how these techniques to capture spatial localityinteract with our scheduling techniques.Beckmann et al. [25] use victim tag information to detect locality lost due toexcessive replication in the cache hierarchy and adapt the replication level accord-ingly. The lost locality detector in CCWS and DAWS differs from their techniquein that it subdivides the victim tag array by warp ID and makes use of the thisinformation to influence thread scheduling.Seshadri et al. [146] also make use of cache replacement victim addresses intheir Evicted Address Filter (EAF) to detect blocks that high locality. The EAF usesa periodically cleared bloom filter to classify the reuse characteristics of individualblocks in a cache. They capitalize on the observation that blocks with high reusewhich are evicted from the cache prematurely due to cache pollution or thrashingare typically reused very soon after their eviction and they use this information toinform their cache insertion policy. It would be an interesting extension to CCWSand DAWS to replace the victim tag array with the bloom-filter used in the EAF tocut down on the hardware cost and energy consumption associated with victim tagarray’s storage of complete tags and associative lookup.7.2 Branch and Memory Divergence MitigationTraditional vector processors have examined the concept of executing conditionalSIMD code with the use of predicated instructions [29]. Predicated instructionsare used on contemporary GPUs for simplistic, short branches. Recent work oncompiler managed SIMT execution has explored a compiler-driven software onlysolution to the branch divergence problem [101]. Work on software-based branchdivergence mitigation is orthogonal to our work on variable warp sizing and couldbe used in combination with this approach.Previous work has explored architectures that attempt to achieve performance119on divergent code that is similar to what would be achieved with narrower warps.Fung, et al. [53, 55] and Rhu, et al. [132] explore techniques to “repack” threadsfrom different divergent warps (but which share the same control-flow path) into asingle warp issue slot. Narasiman, et al. [123] describe a large-warp microarchitec-ture that effectively performs a similar repacking of threads to mitigate control-flowdivergence costs.Other architectures have been proposed which support narrow SIMT execu-tion, but lack the ganging features of I-VWS [85, 88, 100]. In contrast to theirwork, I-VWS studies the effects of enabling true 8-way execution on GPU work-loads and focuses on a hierarchical gang scheduler that attempts to dynamicallyset the warp size. Rhu and Erez [134] evaluate SIMD Lane Permutation whichattempts to mitigate control flow divergence by statically re-arranging how threadsare assigned to warps. Trajan et al. [156] propose the diverge on miss techniquewhich intentionally decreases SIMD utilization when threads in a warp experiencememory latency divergence to increase GPU memory latency tolerance. Meng etal. [115] present Dynamic Warp Subdivision (DWS) which focuses on enablingdifferent control-flow paths to be independently scheduled (as opposed to beingstrictly managed by a reconvergence stack-enforced ordering, as in conventionalGPU architectures). DWS does pack threads from different warps together for si-multaneous issue, but, rather, focuses on increasing memory-level parallelism byallowing memory-related stalls from both paths of a branch to be overlapped intime. This basic idea is extended by Rhu and Erez [133] who propose allowing si-multaneous issue of two control flow paths in the same cycle. ElTantawy et al. [48]further generalize this approach to enable simultaneous co-issue from many inde-pendent control-flow paths.DWS also proposed splitting warps based on memory-latency divergence, and,like the I-VWS-ImpatientMem and I-VWS-GroupMem gang-splitting policies westudied, found some workloads suffered due to the lack of subsequent reconver-gence.Our variable warp sizing approach shares with these proposals the similar goalof performing well on divergent code, while running on an underlying wide-SIMDarchitecture. The above proposals take increasingly complex steps to make a wideSIMD machine to have performance similar to a narrow-SIMD (or, in the limit,120MIMD) processor when executing divergent code.Meng et al. [117] describe an approach called Robust SIMD that determineswhether an application would be best served with a given SIMD width. Thisscheme can slice warps into independently schedulable entities, providing more,narrower warps as needed. Unlike our variable warp sizing scheme, this approachdoes not exploit the available datapath hardware by allowing simultaneous issueof the narrower warp-slices. This approach tends to provide the most benefit toapplications that suffer from significant memory address and latency divergence.Enabling a larger number of narrower warps increases memory-level parallelismand latency hiding for these workloads (and eliminates some of the losses due to awide warp waiting for the longest-latency load result). Wang, et al. [167] describe a“Multiple-SIMD, Multiple Data (MSMD)” architecture that supports flexible-sizedwarps with multiple simultaneous issue from different control-flow paths. Their ap-proach requires a complex microarchitecture that is quite different than a traditionalGPU. Like our variable warp sizing scheme, however, they provide a number ofsmall instruction buffers to mitigate the impact on the front-end of the machine formultiple parallel instruction issues. Lashgar, et al. [93] perform an investigation onthe effects of warp size on performance. They also note that some workloads loseperformance with small warps, and deduce that it is due to lost memory coalescingopportunities. They compare a small-warp machine with aggressive memory coa-lescing (similar to our baseline small-warp architecture) to a large-warp machinewith simultaneous multi-path execution to mitigate control-flow divergence costs.They determine that an architecture like our small-warp baseline tends to outper-form a wide-warp architecture that supports aggressive multi-path execution.Jablin et al [71] revisit trace scheduling [51] on GPUs. Trace scheduling is amicrocoded CPU technique that divides code into traces and attempts to exposeILP across branch boundaries by statically scheduling traces in the compiler. Theirwarp-aware trace scheduling technique increases statically exposed ILP in GPGPUapplications by adapting traditional CPU trace scheduling to attend to SIMT diver-gence behaviour on GPUs.121Chapter 8Conclusions and Future WorkThis chapter concludes the dissertation and proposes potential future work basedon its findings.8.1 ConclusionsThe breakdown in single thread performance scaling in the last decade has leftcomputer architects looking for more drastic innovations to improve computingcapability. Massively parallel architecture, like GPUs, are a very real design al-ternative that can potentially sustain energy-efficient general purpose computingperformance moving forward. However, as they exist today, GPUs face severalkey challenges that hinder their general acceptance as a mainstream computingplatform. This dissertation proposes hardware innovations that help solve someof these challenges, in particular: memory locality management, control flow ir-regularity and general programmability. This dissertation focus on innovationsto GPU microarchitecture because the GPU is a concrete example of an imple-mentable, in-use design. However, the questions posed, solutions presented andinsights gained in this dissertation could be applied to parallel computer hardwarein general and are not specifically tied to GPUs. Even the name GPU is a nod to themachine’s legacy (and continued success) in the 3D rendering space. In the com-ing decades, as the computing landscape continues to shift and evolve, I believe thelessons learned both in this dissertation and others on GPU architecture will have122a much wider impact on computing at large. Indeed one of the goals of this thesisis to enable this continued expansion of the massively parallel computing platformby increasing the generality of the hardware through the innovations proposed.Our case study on the programmability of GPUs highlights how the microarchi-tectural improvements we propose can make the massively parallel platform moreapproachable for programmers. The techniques in this thesis help isolate the appli-cation developer from hardware-specific details, making their code more portableand removing the need to obfuscate application code with GPU-specific optimiza-tions.This dissertation proposed three novel microarchitectural enhancements to con-temporary GPUs that improve performance and expand the class of applicationsthat can take advantage of massively parallel acceleration. The first two inno-vations, cache-conscious warp scheduling (CCWS) and divergence-aware warpscheduling (DAWS), are techniques aimed at exploiting vertical memory referencelocality. The third proposes a variable warp-size architecture (VWS) which slicesthe GPU datapath to improve performance in the presence of control flow diver-gence and uses a ganged scheduling mechanism to recapture horizontal memoryreference locality and instruction fetch locality.CCWS (Chapter 3) exploits the observation that locality in cache-sensitiveGPGPU applications tends to occur vertically within a warp. To capitalize onthis observation, CCWS uses a novel lost locality detector to drive a reactionarywarp-throttling mechanism when over-subscription of the on-chip data caches isdetected. Simulated evaluations using CCWS demonstrate a 63% performanceimprovement on a suite of memory irregular, highly cache-sensitive workloads.This work goes on to contrast the warp scheduling problem with the more tra-ditional CPU technique of cache management which involves innovations to thecache replacement policy. By comparing against an optimal, oracle cache re-placement policy, CCWS demonstrates that innovations to the low-level hardwarethread scheduler can have a more significant performance impact than any changeto the replacement policy using an inferior thread scheduler. Since the publica-tion of CCWS, a number of other researchers have studied warp scheduling inother contexts, confirmed our observations, built upon CCWS and proposed alter-natives to CCWS [8, 18, 34, 38, 39, 76, 78–80, 83, 84, 86, 90, 95, 97, 98, 103–123105, 111, 120, 124, 125, 127, 135, 147, 148, 153, 159, 164–166, 174–180] amongothers. Many of these works also confirm that the limited set economically impor-tant, irregular applications we studied in CCWS were an accurate representation ofa class of forward-looking GPU workloads, as the characteristics we observed inthose applications have been observed in a number of other workloads.DAWS (Chapter 4) quantifies the relationship between branch divergence, mem-ory divergence and locality in a suite highly cache-sensitive, irregular GPU work-loads. DAWS builds on the insights of CCWS by using an online characterizationof locality in kernel code sections combined with runtime information about thelevel of control flow divergence experienced by warps. It uses this information tocreate a cache footprint prediction for each warp in the GPU that evolves as warpsexperience control flow divergence and move into new code sections. Using thisfootprint, DAWS is able to create more accurate cache usage estimates than CCWSand further improve performance by 26%. Chapter 5 goes on to demonstrate theeffect both CCWS and DAWS have on GPU programmability. A case study of anirregular GPU application demonstrates that hardware thread scheduling enablesless optimized code to perform within 4% of GPU-optimized code without anyprogrammer input.Finally, Chapter 6 presents VWS which studies the affect GPU warp sizinghas on locality, performance and instruction fetch overheads. VWS proposes aGPU microarchitecture that is able to dynamically adjust its warp size based on theapplication. VWS enables the efficient execution of highly control flow divergedworkloads by enabling the GPU to execute with a more narrow warp size, whilemaintaining the horizontal memory locality efficiencies and fetch amortizations ofa wide warp when appropriate. VWS demonstrates a 35% performance improve-ment on a set of control flow irregular applications while maintaining performanceand energy-efficiency on regular workloads.8.2 Future DirectionsThis section details some potential directions for future work based upon the workin this dissertation.1248.2.1 Capturing Locality in a Variable Warp-Size ArchitectureThe two vertical locality capturing techniques proposed in this thesis (CCWS andDAWS) both operate on an SM with a fixed warp size. I anticipate that there area number of interesting research questions that will arise if we attempt to combinelocality preservation with a variable warp-sized machine. The following subsec-tions describe some anticipated opportunities and challenges that would come fromcombining CCWS and/or DAWS with a Variable Warp-Size Architecture.Increased Scheduling FlexibilityA machine that is capable of executing with a smaller warp size means fewerthreads are controlled with each issue-level scheduling decision. Therefore, thegranularity of locality that can be captured is increased. In code without any branchdivergence, both CCWS and DAWS make a decision of 32-threads at a time, if thisnumber is decreased to 4, there is an opportunity to refine the cache-footprint pre-diction for each of the 4-wide slices further and potentially increase multithreadingwhen the cache is underutilized in the current design. A simple way to implementa cache-conscious VWS machine might be to simply turn off slices in the presenceof high intra-warp locality code and have the slices re-gang after the execution oftheir high-locality code sections. Some challenge imposed by this type of designwould be coordinating these individual schedulers and determining when it mightbe best to form gangs based on predicted cache locality.Transformation of Intra-Warp locality to Inter-Warp LocalityAnother interesting side effect of having smaller warps might be that a non-trivialportion of the intra-warp locality we observed in CCWS and DAWS might turninto inter-warp locality with a smaller warp size. This would further motivate amechanism that can capture inter-warp locality in the scheduler and would create atradeoff between between breaking gangs to improve SIMD utilization and keepinggangs together to maintain the intra-warp locality CCWS and DAWS are designedto exploit.125Increased Opportunities to Exploit Memory Latency DivergenceChapter 6 showed that there was very limited opportunity to exploit memory la-tency divergence in VWS as it was presented. However, this lack of improvedperformance when gangs were split on memory divergence might have come fromexcessive cache trashing that could be mitigated by applying CCWS and/or DAWSto the split gangs. Memory latency divergence might be a good indicator of whenmultithreading should start being constrained, as it indicates that only a subset ofthe threads in the gang are capturing locality in the data caches.Combining VWS and CCWS and/or DAWS could expose interesting oppor-tunities in the inter-play of branch divergence and memory locality. In code thatis both branch divergent and has data locality, when is it best to execute multiplecontrol flow path simultaneously and when is it best to limit the number of exposedcontrol flow paths in the interest of loading less data into the caches.8.2.2 Exploiting Shared (Inter-Warp) LocalityNeither CCWS nor DAWS explicitly exploited sharing patterns among warp, i.e.the inter-warp locality identified in that work. This work would involve performingan initial examination of global data sharing patterns in benchmarks where inter-warp locality is important. To illustrate the potential of capturing this locality,consider the BFS application, whose inner kernel loop is shown in Example 2.Example 2 BFS CUDA kernel inner loop where inter-warp sharing occurs.1: for(int i=0; i<node_degree; i++) {2: int id = g_graph_edges[first_edge + i];3: if(!g_graph_visited[id]) {4: g_cost[id]=g_cost[tid]+1;5: g_updating_graph_mask[id]=true;6: }7: }BFS is partitioned by assigning each thread in the program to a node. Eachthread loops through all the edges connecting that node to its neighbors. If theneighbor has not yet been visited, the depth for that node is incremented. Initialinvestigations indicate that the most divergent load in the program is also the load126with the most inter-warp sharing. Line three in the above example is the source lineresponsible for it. Memory divergence occurs at this line because the index into thearray is data dependent. Line two loads the index from memory and effectively actslike a pointer into the array at line three. Inter-warp sharing occurs here becausenodes in the same graph can share neighbours. The project could use this informa-tion to help make decisions on which warps/threads should be co-scheduled. Sincea thread will know which portion of the array it will index prior to issuing the loadon line three, it may be beneficial to co-schedule warps with similar indexes tomaximize captured locality.Additionally, there may be benefit in capturing the inter-warp locality that isinherent in the I-cache access stream (since warps in the same kernel share thesame code). There may be a very interesting trade-off in exploiting both intra-warp locality in the data access stream and inter-warp locality in the instructionaccess stream.8.2.3 Adaptive Cache Blocking and Warp SchedulingCCWS, DAWS and the work proposed in Section 8.2.2 do little for programs thathave been optimized to use the architecture’s in-core scratchpad memory (or lo-cal data store). This project attempts to solve the dichotomy that exists betweenuser controlled shared scratch-pad memory and L1 data caches. L1D caches incurrent GPUs are similar in design to a CPU cache even though a CPU runs sig-nificantly less threads than a GPU. Programmers are encouraged to make use ofshared scratchpad memory to capture locality and data reuse among threads. Anobservation of how shared memory is used on current GPU programs reveals twoimportant characteristics:1. Programmers partition their launched CTAs to ensure their shared-memoryfootprint can fit in the shared on chip memory.2. Programmers explicitly schedule their CTA’s use of this memory using bar-rier instructions.The code in Example 3 attempts to capture a generalized example of this. Linesone through six load data from global memory into the scratchpad. The amount of127scratchpad memory used is set at compile time or before the code is launched tothe GPU (the SIZE variable on line two). After all data is loaded into the scratch-pad, a barrier operation is used on line six to ensure data from all the threads inthis CTA is loaded before proceeding. Lines seven through eleven perform an iter-ative calculation on each thread based on data in shared memory. The data in thescratchpad is accessed repeatedly in both an intra- and inter-warp fashion. Somebenchmarks employ synchronization in this step to enable true data sharing be-tween threads in the same CTA. After this computation is done, the program writesthe end result (which can be stored in either local variables or the scratchpad) outto global memory. The kernel may repeat this process multiple times depending onit’s nature, each time explicitly blocking all the data required for an internal loopor high locality access stream into the scratchpad.Example 3 Generalized CUDA code sample that uses shared scratchpad memory1: // The first code section loads data into the shared scrachpad memory2: extern __shared__ float scratchpad_mem[SIZE];3: for(int i=0; i<size; i++)4: scratchpad_mem[f(i,tid)] = global_mem[f(i,tid)];6: __syncthreads();7: // All the data going to be re-used is in the scratchpad memory8: // Iterative computation is done on the scratchpad memory9: float local_variable = 0.0;10: for(int i=0; i<iterations; i++)11: local_variable += scratchpad_mem[f(i,tid)] + ...;12: g_out[f(tid)] = local_variable;There are several problems with this type of approach:1. It forces programmers to explicitly partition and schedule their applicationwhich complicates the programming process.2. This exercise is more difficult or impossible if the locality in the applicationis input data dependent. When writing the code, the programmer does notknow how large to make the application’s CTAs and how much data eachthread will need.3. The application will require tuning if used on an architecture with different128restrictions on CTA size or usable scratchpad memory space.4. Programmer scheduling is applied at fairly coarse grain. With a more adap-tive scheduling and replacement technique old data could be evicted soonerto make way for new data without the need for coarse grained synchroniza-tion points. This could help overlap computation from some warps withmemory accesses from others.The premise of this project is that a more ideal situation would see the program-mer write the same kernel accessing global memory directly as done in Example 4.A hardware/compiler solution would schedule the threads and manage the cacheunder the hood to capture the locality.Example 4 Generalized code sample without shared memory1: float local_variable = 0.0;2: for(int i=0; i<iterations; i++)3: local_variable += global_mem[f(i, tid)] + ...;4: g_out[f(tid)] = local_variable;This problem might be solved by unifying the scheduling system and the cachereplacement policy to provide a finer grained solution than the programmer couldhave achieved through explicit scratchpad use. The steps to solve this problem willrequire some work to be done either at the compiler stage or by high level hintsgiven by the programmer. The first step in this process would involve flaggingpieces of code where significant locality may exist. The most obvious candidatesare inner loops. If the programmer was able to statically arrange the accesses toshared memory as they did in Example 3, then a static analysis algorithm shouldbe able to identify which threads will share data and an analysis of loop boundscan help quantify how much data will be loaded. Doing this pre-processing inthe compiler is preferred, since it eases the burden on the programmer, howevereven if some slight annotation is required it can still help mitigate problems twothrough four outlined above. If the loop bounds or accesses patterns are inputdata dependent, then the programmer would not have been able to capture thisin locality in a scratchpad memory implementation. In this case live working set129analysis could be used (similar to the locality detection implemented in CCWS) topredict when data should be evicted or warps should be prioritized. The schedulingand replacement policy can be driven by a prioritization system similar to the pointsystem proposed in CCWS. High priority warps will receive preference from thewarp scheduler and the their data will be protected from eviction. The scoringsystem can bias groups of threads to be co-scheduled if they share data. In additionto the replacement scheme protecting data, after this data is used as many times asstatic or live working set analysis predicts, the replacement policy can de-prioritizethe data and create capacity for other threads.8.2.4 A Programmable Warp Scheduler for Debugging andSynchronizationIn contemporary GPUS, the software stack has very little control over how warpsare scheduled on an SM. The application programmer can dictate, at a course gain,when threads within a CTA must synchronize using CTA barriers. However, thislevel of control is not fine-grained enough to enable tools like race detectors [144]to formally verify if a GPU program is race free. This project would involve expos-ing a low-level warp scheduler API to software, enabling debuggers, race detectorsand other tools to have more fine-grained control of the warp schedule to help easethe GPU debugging process. Additionally, a programmable warp scheduler mightenable expert programmers or systems developers to create more efficient synchro-nization primitives on the GPU by dictating the warp schedule.130Bibliography[1] Intel Cilk Plus Manual. Accessed July 6, 2015.→ pages 2[2] Intel Thread Building Blocks Manual. Accessed July 6, 2015. → pages2[3] Diamond Standard 108Mini Controller: A Small, Low-Power, Cache-lessRISC CPU. AccessedJuly 6, 2015. → pages 106[4] Intel 4004 Datasheet. datasheet.pdf, 1987.Accessed July 6, 2015. → pages 1[5] NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. white papers/NVIDIA Fermi Compute Architecture Whitepaper.pdf , 2009. AccessedJuly 6, 2015. → pages 14, 16[6] NVIDIA CUDA C Programming Guide v4.2, 2012. → pages 2, 5, 11[7] NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK-110., 2012. Accessed July6, 2015. → pages 14[8] M. Abdel-Majeed, D. Wong, and M. Annavaram. Warped Gates: GatingAware Scheduling and Power Gating for GPGPUs. In Proceedings of theInternational Symposium on Microarchitecture (MICRO), pages 111–122,2013. → pages 123131[9] W. Abu-Sufah, D. Kuck, and D. Lawrie. Automatic ProgramTransformations for Virtual Memory Computers. In Proceedings of the1979 National Computer Conference, 1979. → pages 118[10] A. Agarwal. Performance Tradeoffs in Multithreaded Processors. IEEETransactions on Parallel and Distributed Systems, 1992. → pages 116[11] O. Agesen, D. Detlefs, and J. E. Moss. Garbage Collection and LocalVariable Type-Precision and Liveness in Java Virtual Machines. InProceedings of the ACM SIGPLAN Conference on Programming LanguageDesign and Implementation (PLDI), pages 269–279, 1998. → pages 30[12] K. Agrawal, J. T. Fineman, J. Krage, C. E. Leiserson, and S. Toledo.Cache-Conscious Scheduling of Streaming Applications. In Proceedings ofthe ACM Symposium on Parallel Algorithms and Architectures (SPAA),pages 236–245, 2012. → pages 118[13] T. Aila and S. Laine. Understanding the Efficiency of Ray Traversal onGPUs. In Eurographics/ACM SIGGRAPH High Performance Graphicsconference (HPG), 2009. → pages 107[14] AMD. Compute cores white paper. Cores Whitepaper.pdf, 2014.Accessed July 6, 2015. → pages 12[15] J. M. Anderson and M. S. Lam. Global Optimizations for Parallelism andLocality on Scalable Parallel Machines. In Proceedings of the ACMSIGPLAN Conference on Programming Language Design andImplementation (PLDI), 1993. → pages 118[16] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands,K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, andK. A. Yelick. The Landscape of Parallel Computing Research: A Viewfrom Berkeley. Technical Report UCB/EECS-2006-183, EECSDepartment, University of California, Berkeley, Dec 2006. URL July 6, 2015. → pages 1[17] R. Ausavarungnirun, S. Ghose, O. Kayran, G. H. Loh, C. R. Das, M. T.Kandemir, and O. Mutlu. Exploiting Inter-Warp Heterogeneity to ImproveGPGPU Performance. In Proceedings of the International Conference onParallel Architectures and Compilation Techniques (PACT), 2015. →pages 115132[18] M. Awatramani, J. Zambreno, and D. Rover. Perf-Sat: Runtime Detectionof Performance Saturation for GPGPU Applications. In Proceedings of43rd International Conference on Parallel Processing Workshops(ICCPW), pages 1–8, 2014. → pages 123[19] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. AnalyzingCUDA Workloads Using a Detailed GPU Simulator. In Proceedings of theInternational Symposium on Performance Analysis of Systems andSoftware (ISPASS), pages 163–174, 2009. → pages 29, 31, 63, 65, 107, 111[20] S. Bansal and D. S. Modha. CAR: Clock with Adaptive Replacement. InProceedings of the 3rd USENIX Conference on File and StorageTechnologies, pages 187–200. USENIX Association, 2004. → pages 117[21] K. Barabash and E. Petrank. Tracing Garbage Collection on HighlyParallel Platforms. In International Symposium on Memory Management(ISMM), pages 1–10, 2010. → pages 30, 31, 65[22] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra,V. Eijkhout, R. Pozo, C. Romine, and H. V. der Vorst. Templates for theSolution of Linear Systems: Building Blocks for Iterative Methods, 2ndEdition. SIAM, 1994. → pages 22, 47[23] J. Baxter. Open Source Hardware Development and the OpenRISC Project.PhD thesis, KTH Computer Science and Communication, 2011. → pages106[24] E. Baydal, P. Lopez, and J. Duato. A Simple and Efficient Mechanism toPrevent Saturation in Wormhole Networks. In Proceedings of theInternational Parallel and Distributed Processing Symposium (IPDPS),pages 617–622, 2000. → pages 112[25] B. M. Beckmann, M. R. Marty, and D. A. Wood. ASR: Adaptive SelectiveReplication for CMP Caches. In Proceedings of the InternationalSymposium on Microarchitecture (MICRO), pages 443–454, 2006. →pages 5, 119[26] L. A. Belady. A Study of Replacement Algorithms for a Virtual-StorageComputer. IBM Systems Journal, 5(2):78 –101, 1966. → pages 29, 117[27] N. Bell and M. Garland. Implementing Sparse Matrix-VectorMultiplication on Throughput-Oriented Processors. In Proceedings of theInternational Conference on High Performance Computing, Networking,Storage and Analysis (SC), 2009. → pages 74133[28] B. Boothe and A. Ranade. Improved Multithreading Techniques for HidingCommunication Latency in Multiprocessors. In Proceedings of theInternational Symposium on Computer Architecture (ISCA), pages214–223, 1992. → pages 116[29] W. Bouknight et al. The Illiac IV System. Proceedings of the IEEE, 60(4):369 – 388, apr. 1972. → pages 119[30] N. Brunie, S. Collange, and G. Diamos. Simultaneous Branch and WarpInterweaving for Sustained GPU Performance. In Proceedings of theInternational Symposium on Computer Architecture (ISCA), pages 49–60,June 2012. → pages 82[31] M. Burtscher and K. Pingali. An Efficient CUDA Implementation of theTree-Based Barnes Hut N-Body Algorithm. In W. Hwu, editor, GPUComputing Gems, Emerald Edition, pages 75–92. Elsevier, 2011. → pages80[32] M. Burtscher, R. Nasre, and K. Pingali. A Quantitative Study of IrregularPrograms on GPUs. In Proceedings of the International Symposium onWorkload Characterization (IISWC), pages 141–151, November 2012. →pages 80[33] B. Calder, C. Krintz, S. John, and T. Austin. Cache-Conscious DataPlacement. In Proceedings of the International Conference onArchitectural Support for Programming Languages and Operation Systems(ASPLOS), 1998. → pages 118[34] N. Chatterjee, M. O’Connor, G. H. Loh, N. Jayasena, andR. Balasubramonian. Managing DRAM Latency Divergence in IrregularGPGPU Applications. In Proceedings of the International Conference forHigh Performance Computing, Networking, Storage and Analysis, pages128–139, 2014. → pages 123[35] M. Chaudhuri. Pseudo-LIFO: the foundation of a new family ofreplacement policies for last-level caches. In Proceedings of theInternational Symposium on Microarchitecture (MICRO), pages 401–412,2009. → pages 117[36] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, andK. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing.In Proceedings of the International Symposium on WorkloadCharacterization (IISWC), pages 44–54, 2009. → pages 29, 31, 65, 107134[37] C. F. Chen, S.-H. Yang, B. Falsafi, and A. Moshovos. Accurate andComplexity-Effective Spatial Pattern Prediction. In Proceedings of theInternational Symposium on High-Performance Computer Architecture(HPCA), 2004. → pages 119[38] X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.-M. Hwu.Adaptive Cache Management for Energy-Efficient GPU Computing. InProceedings of the International Symposium on Microarchitecture(MICRO), pages 343–355, 2014. → pages 123[39] X. Chen, S. Wu, L.-W. Chang, W.-S. Huang, C. Pearson, Z. Wang, andW.-M. W. Hwu. Adaptive Cache Bypass and Insertion for Many-coreAccelerators. In Proceedings of International Workshop on ManycoreEmbedded Systems, pages 1:1–1:8, 2014. → pages 123[40] H.-Y. Cheng, C.-H. Lin, J. Li, and C.-L. Yang. Memory Latency Reductionvia Thread Throttling. In Proceedings of the International Symposium onMicroarchitecture (MICRO), pages 53–64, 2010. → pages 112[41] T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-Conscious StructureLayout. In Proceedings of the ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation (PLDI), 1999. → pages 118[42] S. Coleman and K. S. McKinley. Tile Size Selection Using CacheOrganization and Data Layout. In Proceedings of the ACM SIGPLANConference on Programming Language Design and Implementation(PLDI), pages 279–290, 1995. → pages 118[43] W. J. Dally. The Last Classical Computer. Information Science andTechnology (ISAT) Study Group, 2001. → pages 2[44] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford,V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing(SHOC) benchmark suite. In Proceedings of Workshop on GeneralPurpose Processing Using GPUs, 2010. → pages 49, 63, 65, 73[45] R. H. Dennard, F. H. Gaensslen, and K. Mai. Design of Ion-ImplantedMOSFET’s with Very Small Physical Dimensions. In IEEE Journal ofSolid-State Circuits, October 1974. → pages 1[46] J. Dongarra. Performance of Various Computers Using Standard LinearEquations Software. University of Tennessee Computer Science TechnicalReport Number, 2015. → pages 3135[47] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. Fairness via SourceThrottling: A Configurable and High-performance Fairness Substrate forMulti-core Memory Systems. In Proceedings of the InternationalConference on Architectural Support for Programming Languages andOperation Systems (ASPLOS), pages 335–346, 2010. → pages 112[48] A. ElTantawy, J. W. Ma, M. O’Connor, and T. M. Aamodt. A ScalableMulti-Path Microarchitecture for Efficient GPU Control Flow. InProceedings of the International Symposium on High-PerformanceComputer Architecture (HPCA), pages 248–259, February 2014. → pages82, 106, 107, 109, 120[49] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, andD. Burger. Dark Silicon and the End of Multicore Scaling. In Proceedingsof the International Symposium on Computer Architecture (ISCA), pages365–376, 2011. → pages 2[50] Feng, W. and Cameron K. The Green 500 List.http:// Accessed July 6, 2015. → pages 3[51] J. Fisher. Trace Scheduling: A Technique for Global MicrocodeCompaction. Computers, IEEE Transactions on, (7):478–490, July 1981.→ pages 121[52] W. Fung and T. Aamodt. Thread Block Compaction for Efficient SIMTControl Flow. In Proceedings of the International Symposium onHigh-Performance Computer Architecture (HPCA), pages 25 –36, 2011. →pages 113[53] W. Fung and T. Aamodt. Thread Block Compaction for Efficient SIMTControl Flow. In Proceedings of the International Symposium onHigh-Performance Computer Architecture (HPCA), pages 25–36, February2011. → pages 82, 107, 109, 120[54] W. W. L. Fung. Thread Block Compaction Simulation Infrastructure.∼wwlfung/code/tbc-gpgpusim.tgz, 2012. AccessedJuly 6, 2015. → pages 107, 108[55] W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic WarpFormation and Scheduling for Efficient GPU Control Flow. In Proceedingsof the International Symposium on Microarchitecture (MICRO), pages407–420, 2007. → pages 15, 30, 64, 82, 107, 109, 113, 120136[56] D. Gannon, W. Jalby, and K. Gallivan. Strategies for Cache and LocalMemory Management by Global Program Transformation. Journal ofParallel and Distributed Computing, 5(5):587–616, Oct. 1988. → pages118[57] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally,E. Lindholm, and K. Skadron. Energy-Efficient Mechanisms for ManagingThread Context in Throughput Processors. In Proceedings of theInternational Symposium on Computer Architecture (ISCA), pages235–246, 2011. → pages 92, 113[58] H. Goldstine and A. Goldstine. The Electronic Numerical Integrator andComputer (ENIAC). Annals of the History of Computing, IEEE, (1):10–16,1996. → pages 1, 4[59] Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser.Many-Core vs. Many-Thread Machines: Stay Away From the Valley.Computer Architecture Letters, pages 25 –28, jan. 2009. → pages 112[60] Z. S. Hakura and A. Gupta. The Design and Analysis of a CacheArchitecture for Texture Mapping. In Proceedings of the InternationalSymposium on Computer Architecture (ISCA), pages 108–120, 1997. →pages 12[61] R. Haring, M. Ohmacht, T. Fox, M. Gschwind, D. Satterfield,K. Sugavanam, P. Coteus, P. Heidelberger, M. Blumrich, R. Wisniewski,A. Gara, G.-T. Chiu, P. Boyle, N. Chist, and C. Kim. The IBM BlueGene/Q Compute Chip. Micro, IEEE, 32(2):48 –60, march-april 2012. →pages 6[62] T. H. Hetherington, T. G. Rogers, L. Hsu, M. O’Connor, and T. M.Aamodt. Characterizing and Evaluating a Key-Value Store Application onHeterogeneous CPU-GPU Systems. In Proceedings of the InternationalSymposium on Performance Analysis of Systems and Software (ISPASS),pages 88 –98, 2012. → pages 19, 29, 31, 49, 65[63] S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating CUDAGraph Algorithms at Maximum Warp. In Proceedings of the Symposium onPrinciples and Practice of Parallel Programming (PPOPP), pages267–276, 2011. → pages 48137[64] W.-m. W. Hwu. GPU Computing Gems Emerald Edition. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2011. →pages 3[65] IBM Corp. IBM Automatic Sequence Controlled Calculator, 1945. →pages 1, 4[66] IDC. Worldwide Server Market Rebounds Sharply in Fourth Quarter asDemand for Blades and x86 Systems Leads the Way, Feb 2010. → pages29[67] IDC. HPC Server Market Declined 11.6% in 2009, Return to GrowthExpected in 2010, Mar 2010. → pages 29[68] IEEE. The OpenMP API Specification for Parallel Programming. Accessed July 6, 2015. → pages 2[69] IEEE. IEEE Standard for Information Technology–Portable OperatingSystem Interface (POSIX) - System Application Program Interface (API)Amendment 2: Threads Extension (C Language), 1996. → pages 2[70] Intel Xeon Phi Coprocessor Brief. Intel. → pages 6[71] J. A. Jablin, T. B. Jablin, O. Mutlu, and M. Herlihy. Warp-aware TraceScheduling for GPUs. In Proceedings of the International Conference onParallel Architectures and Compilation Techniques (PACT), pages163–174, 2014. → pages 121[72] A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, Jr., and J. Emer.Adaptive Insertion Policies for Managing Shared Caches. In Proceedingsof the International Conference on Parallel Architectures and CompilationTechniques (PACT), pages 208–219, 2008. → pages 118[73] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High PerformanceCache Replacement Using Re-Reference Interval Prediction (RRIP). InProceedings of the International Symposium on Computer Architecture(ISCA), pages 60–71, 2010. → pages 20, 117[74] A. Jaleel, H. H. Najaf-abadi, S. Subramaniam, S. C. Steely, and J. Emer.CRUISE: Cache Replacement and Utility-Aware Scheduling. InProceedings of the International Conference on Architectural Support forProgramming Languages and Operation Systems (ASPLOS), pages249–260, 2012. → pages 117138[75] W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and Improving theuse of Demand-Fetched Caches in GPUs. In Proceedings of the ACMinternational conference on Supercomputing, pages 15–24, 2012. → pages115[76] W. Jia, K. Shaw, and M. Martonosi. MRPB: Memory RequestPrioritization for Massively Parallel Processors. In Proceedings of theInternational Symposium on High-Performance Computer Architecture(HPCA), pages 272–283, 2014. → pages 123[77] S. Jiang and X. Zhang. LIRS: an efficient low inter-reference recency setreplacement policy to improve buffer cache performance. In ACM’sSpecial Interest Group on Measurement and Evaluation (SIGMETRICS),pages 31–42, 2002. → pages 117[78] A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T.Kandemir, O. Mutlu, R. Iyer, and C. R. Das. OWL: Cooperative ThreadArray Aware Scheduling Techniques for Improving GPGPU Performance.In Proceedings of the International Conference on Architectural Supportfor Programming Languages and Operation Systems (ASPLOS), 2013. →pages 114, 123[79] A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, andC. R. Das. Orchestrated Scheduling and Prefetching for GPGPUs. InProceedings of the International Symposium on Computer Architecture(ISCA), 2013. → pages 114[80] A. Jog, E. Bolotin, Z. Guz, M. Parker, S. W. Keckler, M. T. Kandemir, andC. R. Das. Application-Aware Memory System for Fair and EfficientExecution of Concurrent GPGPU Applications. In Proceedings ofWorkshop on General Purpose Processing Using GPUs, pages 1:1–1:8,2014. → pages 123[81] T. L. Johnson and W.-m. W. Hwu. Run-time Adaptive Cache HierarchyManagement via Reference Analysis. In Proceedings of the InternationalSymposium on Computer Architecture (ISCA), pages 315–326, 1997. →pages 118[82] N. P. Jouppi. Improving Direct-Mapped Cache Performance by theAddition of a Small Fully-Associative Cache and Prefetch Buffers. InProceedings of the International Symposium on Computer Architecture(ISCA), pages 364–373, 1990. → pages 26139[83] O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. Neither More NorLess: Optimizing Thread-level Parallelism for GPGPUs. In Proceedings ofthe International Conference on Parallel Architectures and CompilationTechniques (PACT), 2013. → pages 114, 123[84] O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T.Kandemir, G. H. Loh, O. Mutlu, and C. R. Das. Managing GPUConcurrency in Heterogeneous Architectures. In Proceedings of theInternational Symposium on Microarchitecture (MICRO), pages 114–126,2014. → pages 123[85] S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs andthe Future of Parallel Computing. IEEE Micro, 31(5):7–17,September/October 2011. → pages 120[86] M. Khairy, M. Zahran, and A. G. Wassal. Efficient Utilization of GPGPUCache Hierarchy. In Proceedings of the 8th Workshop on General PurposeProcessing Using GPUs (GPGPU), pages 36–47, 2015. → pages 123[87] Khronos Group. OpenCL. Accessed July6, 2015. → pages 2, 11[88] R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper,and K. Asanovic´. The Vector-Thread Architecture. In Proceedings of theInternational Symposium on Computer Architecture (ISCA), pages 52–63,2004. → pages 120[89] S. Kumar and C. Wilkerson. Exploiting Spatial Locality in Data Cachesusing Spatial Footprints. In Proceedings of the International Symposium onComputer Architecture (ISCA), pages 357–368, 1998. → pages 119[90] H.-K. Kuo, B.-C. C. Lai, and J.-Y. Jou. Reducing Contention in SharedLast-Level Cache for Throughput Processors. ACM Transactions on DesignAutomation of Electronic Systems, 20:12:1–12:28, 2014. → pages 123[91] N. B. Lakshminarayana and H. Kim. Effect of Instruction Fetch andMemory Scheduling on GPU Performance. In Workshop on Language,Compiler, and Architecture Support for GPGPU, 2010. → pages 113[92] M. D. Lam, E. E. Rothberg, and M. E. Wolf. The Cache Performance andOptimizations of Blocked Algorithms. In Proceedings of the InternationalConference on Architectural Support for Programming Languages andOperation Systems (ASPLOS), pages 63–74, 1991. → pages 118140[93] A. Lashgar, A. Baniasadi, and A. Khonsari. Towards Green GPUs: WarpSize Impact Analysis. In International Green Computing Conference(IGCC), pages 1–6, June 2013. → pages 121[94] J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc. Many-ThreadAware Prefetching Mechanisms for GPGPU Applications. In Proceedingsof the International Symposium on Microarchitecture (MICRO), pages213–224, 2010. → pages 114, 115[95] M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. ImprovingGPGPU Resource Utilization through Alternative Thread BlockScheduling. In Proceedings of the International Symposium onHigh-Performance Computer Architecture (HPCA), pages 260–271, 2014.→ pages 123[96] S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: A CompilerFramework for Automatic Translation and Optimization. In Proceedings ofthe Symposium on Principles and Practice of Parallel Programming(PPOPP), pages 101–110, 2009. → pages 17[97] S.-Y. Lee and C.-J. Wu. CAWS: Criticality-aware Warp Scheduling forGPGPU Workloads. In Proceedings of the International Conference onParallel Architectures and Compilation Techniques (PACT), pages175–186, 2014. → pages 123[98] S.-Y. Lee, A. Arunkumar, and C.-J. Wu. CAWA: Coordinated WarpScheduling and Cache Prioritization for Critical Warp Acceleration ofGPGPU Workloads. In Proceedings of the International Symposium onComputer Architecture (ISCA), pages 515–527, 2015. → pages 123[99] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen,N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal,and P. Dubey. Debunking the 100X GPU vs. CPU Myth: An Evaluation ofThroughput Computing on CPU and GPU. In Proceedings of the 37thAnnual International Symposium on Computer Architecture, ISCA ’10,pages 451–460, New York, NY, USA, 2010. ACM. → pages 3[100] Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, andK. Asanovic´. Exploring the Tradeoffs Between Programmability andEfficiency in Data-parallel Accelerators. In Proceedings of theInternational Symposium on Computer Architecture (ISCA), pages129–140, June 2011. → pages 120141[101] Y. Lee, V. Grover, R. Krashinsky, M. Stephenson, S. W. Keckler, andK. Asanovic´. Exploring the Design Space of SPMD DivergenceManagement on Data-Parallel Architectures. In Proceedings of theInternational Symposium on Microarchitecture (MICRO), pages 101–113,2014. → pages 119[102] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M.Aamodt, and V. J. Reddi. GPUWattch: Enabling Energy Optimizations inGPGPUs. In Proceedings of the International Symposium on ComputerArchitecture (ISCA), 2013. → pages 71[103] C. Li, Y. Yang, H. Dai, S. Yan, F. Mueller, and H. Zhou. Understanding theTradeoffs Between Software-Managed vs. Hardware-Managed Caches inGPUs. In Proceedings of the International Symposium on PerformanceAnalysis of Systems and Software (ISPASS), pages 231–242, 2014. →pages 123[104] C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou.Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of the29th ACM on International Conference on Supercomputing, 2015. → pages[105] D. Li, M. Rhu, D. Johnson, M. O’Connor, M. Erez, D. Burger, D. Fussell,and S. Redder. Priority-Based Cache Allocation in Throughput Processors.In Proceedings of the International Symposium on High-PerformanceComputer Architecture (HPCA), pages 89–100, 2015. → pages 116, 124[106] P. Lopez, J. M. Martnez, J. Duato, and F. Petrini. On the Reduction ofDeadlock Frequency by Limiting Message Injection in WormholeNetworks. In In Proceedings of Parallel Computer Routing andCommunication Workshop, 1997. → pages 112[107] P. Lopez, J. Martinez, and J. Duato. DRIL: Dynamically Reduced MessageInjection Limitation Mechanism for Wormhole Networks. In Proceedingsof the International Conference on Parallel Processing (ICPP), pages535–542, 1998. → pages 112[108] M. Maas, P. Reames, J. Morlan, K. Asanovic´, A. D. Joseph, andJ. Kubiatowicz. How a Single Chip Causes Massive Power BillsGPUSimPow: A GPGPU Power Simulator. In Proceedings of theInternational Symposium on Performance Analysis of Systems andSoftware (ISPASS), 2013. → pages 71142[109] A. Mahesri. Tradeoffs in Designing Massively Parallel AcceleratorArchitectures. PhD thesis, University of Illinois at Urbana-Champaign,2009. → pages 106[110] A. Mahesri, D. Johnson, N. Crago, and S. J. Patel. Tradeoffs in DesigningAccelerator Architectures for Visual Computing. In Proceedings of theInternational Symposium on Microarchitecture (MICRO), pages 164–175,November 2008. → pages 106, 107[111] M. Mao, J. Hu, Y. Chen, and H. Li. VWS: A Versatile Warp Scheduler forExploring Diverse Cache Localities of GPGPU Applications. InProceedings of the Design Automation Conference (DAC), pages83:1–83:6, 2015. → pages 124[112] M. R. Marty and M. D. Hill. Coherence Ordering for Ring-based ChipMultiprocessors. In Proceedings of the International Symposium onMicroarchitecture (MICRO), pages 309–320, 2006. → pages 5[113] M. Mendez-Lojo, M. Burtscher, and K. Pingali. A GPU Implementation ofInclusion-based Points-to Analysis. In Proceedings of the Symposium onPrinciples and Practice of Parallel Programming (PPOPP), pages107–116, August 2012. → pages 80[114] J. Meng and K. Skadron. Avoiding Cache Thrashing due to Private DataPlacement in Last-Level Cache for Manycore Scaling. In Proceedings ofthe International Conference on Computer Design (ICCD), pages 282–288,2009. → pages 117[115] J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision forIntegrated Branch and Memory Divergence Tolerance. In Proceedings ofthe International Symposium on Computer Architecture (ISCA), pages235–246, 2010. → pages 82, 90, 107, 108, 109, 120[116] J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision forIntegrated Branch and Memory Divergence Tolerance. In Proceedings ofthe International Symposium on Computer Architecture (ISCA), pages235–246, 2010. → pages 114[117] J. Meng, J. Sheaffer, and K. Skadron. Robust SIMD: Dynamically AdaptedSIMD Width and Multi-Threading Depth. In Proceedings of theInternational Parallel and Distributed Processing Symposium (IPDPS),pages 107–118, May 2012. → pages 121143[118] D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU Graph Traversal.In Proceedings of the Symposium on Principles and Practice of ParallelProgramming (PPOPP), pages 117–128, August 2012. → pages 80[119] D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU Graph Traversal.In Proceedings of the Symposium on Principles and Practice of ParallelProgramming (PPOPP), pages 117–128, 2012. → pages 19[120] S. Mittal. A Survey of Techniques for Managing and Leveraging Caches inGPUs. Journal of Circuits, Systems and Computers, 23(08), 2014. →pages 124[121] G. E. Moore. Cramming more components onto integrated circuits.Electronics, 38(8):114–117, 1965. → pages 1[122] S. Naffziger, J. Warnock, and H. Knapp. When Processors Hit the PowerWall (or “When the CPU hits the fan”). In Proceedings of the InternationalSolid State Circuits Conference (ISSCC), 2005. → pages 1[123] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, andY. N. Patt. Improving GPU Performance via Large Warps and Two-LevelWarp Scheduling. In Proceedings of the International Symposium onMicroarchitecture (MICRO), pages 308–317, December 2011. → pages 18,34, 82, 107, 109, 113, 114, 120[124] C. Nugteren, G.-J. van den Braak, and H. Corporaal. A Study of thePotential of Locality-Aware Thread Scheduling for GPUs. In Euro-Par2014: Parallel Processing Workshops, Lecture Notes in Computer Science,pages 146–157. 2014. → pages 124[125] C. Nugteren, G.-J. van den Braak, H. Corporaal, and H. Bal. A DetailedGPU Cache Model Based on Reuse Distance Theory. In Proceedings of theInternational Symposium on High-Performance Computer Architecture(HPCA), pages 37–48, 2014. → pages 124[126] J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. ThreadScheduling for Cache Locality. In Proceedings of the InternationalConference on Architectural Support for Programming Languages andOperation Systems (ASPLOS), 1996. → pages 117[127] B. Pichai, L. Hsu, and A. Bhattacharjee. Architectural Support for AddressTranslation on GPUs: Designing Memory Management Units forCPU/GPUs with Unified Address Spaces. In Proceedings of the144International Conference on Architectural Support for ProgrammingLanguages and Operation Systems (ASPLOS), pages 743–758, 2014. →pages 124[128] J. Pomerene, T. Puzak, R. Rechtschaffen, and F. Sparacio. US Patent#4,807,110: Prefetching system for a cache having a second directory forsequentially accessed blocks, Feb. 21 1989. → pages 118[129] M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: ALow-Overhead, High-Performance, Runtime Mechanism to PartitionShared Caches. In Proceedings of the International Symposium onMicroarchitecture (MICRO), pages 423–432, 2006. → pages 118[130] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. AdaptiveInsertion Policies for High Performance Caching. In Proceedings of theInternational Symposium on Computer Architecture (ISCA), pages381–391, 2007. → pages 5, 117[131] M. K. Qureshi, M. A. Suleman, and Y. N. Patt. Line Distillation:Increasing Cache Capacity by Filtering Unused Words in Cache Lines. InProceedings of the International Symposium on High-PerformanceComputer Architecture (HPCA), pages 250–259, 2007. → pages 119[132] M. Rhu and M. Erez. CAPRI: Prediction of Compaction-adequacy forHandling Control-divergence in GPGPU Architectures. In Proceedings ofthe International Symposium on Computer Architecture (ISCA), pages61–71, June 2012. → pages 82, 107, 120[133] M. Rhu and M. Erez. The Dual-Path Execution Model for Efficient GPUControl Flow. In Proceedings of the International Symposium onHigh-Performance Computer Architecture (HPCA), pages 235–246,February 2013. → pages 107, 109, 120[134] M. Rhu and M. Erez. Maximizing SIMD Resource Utilization in GPGPUswith SIMD Lane Permutation. In Proceedings of the InternationalSymposium on Computer Architecture (ISCA), pages 356–367, June 2013.→ pages 82, 107, 120[135] M. Rhu, M. Sullivan, J. Leng, and M. Erez. A Locality-aware MemoryHierarchy for Energy-efficient GPU Architectures. In Proceedings of theInternational Symposium on Microarchitecture (MICRO), pages 86–98,2013. → pages 124145[136] T. G. Rogers. CCWS Simulation Infrastructure.∼tgrogers/ccws.html, 2013. Accessed July 6, 2015.→ pages 63[137] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-ConsciousWavefront Scheduling. In Proceedings of the International Symposium onMicroarchitecture (MICRO), 2012. → pages iv, 9, 48, 49, 52, 54, 62, 64,65, 66, 71, 72, 92[138] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-Conscious ThreadScheduling for Massively Multithreaded Processors. IEEE Micro, SpecialIssue: Micro’s Top Picks from 2012 Computer Architecture Conferences,2013. → pages iv[139] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Divergence-Aware WarpScheduling. In Proceedings of the International Symposium onMicroarchitecture (MICRO), 2013. → pages iv[140] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Learning Your Limit:Managing Massively Multithreaded Caches Through Scheduling.Communications of the ACM, December 2014. → pages iv[141] T. G. Rogers, D. R. Johnson, M. O’Connor, and S. W. Keckler. A VariableWarp Size Architecture. In Proceedings of the International Symposium onComputer Architecture (ISCA), 2015. → pages iv[142] S. Rul, H. Vandierendonck, J. D’Haene, and K. De Bosschere. AnExperimental Study on Performance Portability of OpenCL Kernels. InApplication Accelerators in High Performance Computing, 2010. → pages5[143] R. Saavedra-Barrera, D. Culler, and T. von Eicken. Analysis ofMultithreaded Architectures for Parallel Computing. In Proceedings of theSecond Annual ACM Symposium on Parallel Algorithms and Architectures,1990. → pages 116[144] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser:A Dynamic Data Race Detector for Multithreaded Programs. ACMTransactions on Computer Systems, pages 391–411, 1997. → pages 130[145] S. L. Scott and G. S. Sohi. The Use of Feedback in Multiprocessors and ItsApplication to Tree Saturation Control. IEEE Transactions on Parallel andDistributed Systems, 1(4):385–398, 1990. → pages 112146[146] V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry. TheEvicted-address Filter: A Unified Mechanism to Address Both CachePollution and Thrashing. In Proceedings of the International Conferenceon Parallel Architectures and Compilation Techniques (PACT), pages355–366, 2012. → pages 119[147] A. Sethia and S. Mahlke. Equalizer: Dynamic Tuning of GPU Resourcesfor Efficient Execution. In Proceedings of the International Symposium onMicroarchitecture (MICRO), pages 647–658, 2014. → pages 124[148] A. Sethia, D. Jamshidi, and S. Mahlke. Mascar: Speeding up GPU Warpsby Reducing Memory Pitstops. In Proceedings of the InternationalSymposium on High-Performance Computer Architecture (HPCA), pages174–185, 2015. → pages 114, 124[149] M. Shah, R. Golla, G. Grohoski, P. Jordan, J. Barreh, J. Brooks,M. Greenberg, G. Levinsky, M. Luttrell, C. Olson, Z. Samoail, M. Smittle,and T. Ziaja. Sparc T4: A Dynamically Threaded Server-on-a-Chip. Micro,IEEE, 32(2):8 –19, march-april 2012. → pages 6[150] A. Snavely and D. M. Tullsen. Symbiotic Jobscheduling for aSimultaneous Multithreaded Processor. In Proceedings of the InternationalConference on Architectural Support for Programming Languages andOperation Systems (ASPLOS), pages 234–244, 2000. → pages 117[151] A. Snavely, D. M. Tullsen, and G. Voelker. Symbiotic Jobscheduling withPriorities for a Simultaneous Multithreading Processor. In Proceedings ofthe 2002 ACM SIGMETRICS International Conference on Measurementand Modeling of Computer Systems (SIGMETRICS), pages 66–76, 2002.→ pages 117[152] G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar Processors. InProceedings of the International Symposium on Computer Architecture(ISCA), 1995. → pages 48[153] S. Song, M. Lee, J. Kim, W. Seo, Y. Cho, and S. Ryu. Energy-EfficientScheduling for Memory-Intensive GPGPU workloads. In Design,Automation and Test in Europe Conference and Exhibition (DATE), 2014,pages 1–6, 2014. → pages 124[154] D. Spoonhower, G. Blelloch, and R. Harper. Using Page Residency toBalance Tradeoffs in Tracing Garbage Collection. In Proceedings of147International Conference on Virtual Execution Environments (VEE 2005),pages 57–67. → pages 30, 31, 65[155] Suleman, M. Aater and Qureshi, Moinuddin K. and Patt, Yale N.Feedback-driven Threading: Power-efficient and High-performanceExecution of Multi-threaded Workloads on CMPs. In Proceedings of theInternational Conference on Architectural Support for ProgrammingLanguages and Operation Systems (ASPLOS), pages 277–286, 2008. →pages 117[156] D. Tarjan, J. Meng, and K. Skadron. Increasing Memory Miss Tolerancefor SIMD Cores. In Proceedings of the International Conference on HighPerformance Computing, Networking, Storage and Analysis (SC),November 2009. → pages 107, 120[157] R. Thekkath and S. J. Eggers. The Effectiveness of Multiple HardwareContexts. In Proceedings of the International Conference on ArchitecturalSupport for Programming Languages and Operation Systems (ASPLOS),pages 328–337, 1994. → pages 116[158] M. Thottethodi, A. Lebeck, and S. Mukherjee. Self-tuned congestioncontrol for multiprocessor networks. In Proceedings of the InternationalSymposium on High-Performance Computer Architecture (HPCA), pages107–118, 2001. → pages 112[159] Y. Tian, S. Puthoor, J. L. Greathouse, B. M. Beckmann, and D. A. Jime´nez.Adaptive GPU Cache Bypassing. In Proceedings of the 8th Workshop onGeneral Purpose Processing Using GPUs, pages 25–35, 2015. → pages124[160] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading:Maximizing on-chip parallelism. In Proceedings of the InternationalSymposium on Computer Architecture (ISCA), pages 392–403, 1995. →pages 117[161] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L.Stamm. Exploiting Choice: Instruction Fetch and Issue on anImplementable Simultaneous Multithreading Processor. In Proceedings ofthe International Symposium on Computer Architecture (ISCA), pages191–202, 1996. → pages 117148[162] A. Turing. On Computable Numbers, with an Application to theEntscheidungs problem. Proceedings of the London Mathematical Society,42:230, 1936. → pages 1[163] G. Urdaneta, G. Pierre, and M. van Steen. Wikipedia Workload Analysisfor Decentralized Hosting. Elsevier Computer Networks, 53(11):1830–1845, 2009. → pages 30[164] N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick,R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu. ACase for Core-assisted Bottleneck Acceleration in GPUs: EnablingFlexible Data Compression with Assist Warps. In Proceedings of theInternational Symposium on Computer Architecture (ISCA), pages 41–53,2015. → pages 124[165] J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, and K. Skadron.Real-world Design and Evaluation of Compiler-managed GPU RedundantMultithreading. In Proceedings of the International Symposium onComputer Architecture (ISCA), pages 73–84, 2014. → pages[166] B. Wang, Z. Liu, X. Wang, and W. Yu. Eliminating Intra-warp ConflictMisses in GPU. In Proceedings of the 2015 Design, Automation & Test inEurope Conference & Exhibition, pages 689–694, 2015. → pages 124[167] Y. Wang, S. Chen, J. Wan, J. Meng, K. Zhang, W. Liu, and X. Ning. AMultiple SIMD, Multiple Data (MSMD) Architecture: Parallel Executionof Dynamic and Static SIMD Fragments. In Proceedings of theInternational Symposium on High-Performance Computer Architecture(HPCA), pages 603–614, February 2013. → pages 121[168] W.-D. Weber and A. Gupta. Exploring the benefits of multiple hardwarecontexts in a multiprocessor architecture: Preliminary results. InProceedings of the International Symposium on Computer Architecture(ISCA), pages 273–280, 1989. → pages 116[169] M. V. Wilkes. The EDSAC Computer. Managing RequirementsKnowledge, International Workshop on, page 79, 1951. → pages 1[170] F. Williams and T. Kilburn. The University of Manchester ComputingMachine. Manchester University Computer Inaugural Conference, pages5–11, 1951. → pages 1149[171] S. Wilton and N. Jouppi. CACTI: An Enhanced Cache Access and CycleTime Model. Solid-State Circuits, IEEE Journal of, 31(5):677–688, May1996. → pages 44, 71, 105[172] C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr., andJ. Emer. SHiP: Signature-based Hit Predictor for High PerformanceCaching. In Proceedings of the International Symposium onMicroarchitecture (MICRO), pages 430–441, 2011. → pages 117[173] W. A. Wulf and S. A. McKee. Hitting the Memory Wall: Implications ofthe Obvious. ACM SIGARCH computer architecture news, pages 20–24,1995. → pages 5[174] P. Xiang, Y. Yang, and H. Zhou. Warp-level divergence in GPUs:Characterization, Impact, and Mitigation. In Proceedings of theInternational Symposium on High-Performance Computer Architecture(HPCA), pages 284–295, 2014. → pages 124[175] X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. Coordinated static anddynamic cache bypassing for gpus. In Proceedings of the InternationalSymposium on High-Performance Computer Architecture (HPCA), pages76–88, 2015. → pages[176] Q. Xu and M. Annavaram. PATS: Pattern Aware Scheduling and PowerGating for GPGPUs. In Proceedings of the International Conference onParallel Architectures and Compilation Techniques (PACT), pages225–236, 2014. → pages[177] M. K. Yoon, Y. Oh, S. Lee, S. H. Kim, D. Kim, and W. W. Ro. DRAW:investigating benefits of adaptive fetch group size on GPU. In Proceedingsof the International Symposium on Performance Analysis of Systems andSoftware (ISPASS), pages 183–192, 2015. → pages[178] Y. Yu, X. He, H. Guo, Y. Wang, and X. Chen. A Credit-BasedLoad-Balance-Aware CTA Scheduling Optimization Scheme in GPGPU.International Journal of Parallel Programming, pages 1–21, 2014. →pages[179] Y. Yu, W. Xiao, X. He, H. Guo, Y. Wang, and X. Chen. A Stall-AwareWarp Scheduling for Dynamically Optimizing Thread-level Parallelism inGPGPUs. In Proceedings of the ACM international conference onSupercomputing, pages 15–24, 2015. → pages150[180] Z. Zheng, Z. Wang, and M. Lipasti. Adaptive Cache and ConcurrencyAllocation on GPGPUs. Computer Architecture Letters, 2014. → pages124[181] S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing SharedResource Contention in Multicore Processors via Scheduling. InProceedings of the International Conference on Architectural Support forProgramming Languages and Operation Systems (ASPLOS), pages129–142, 2010. → pages 20151


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items