Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Architectural support for inter-thread synchronization in SIMT architectures ElTantawy, Ahmed Mohammed 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2018_may_eltantawy_ahmed.pdf [ 6.7MB ]
Metadata
JSON: 24-1.0363330.json
JSON-LD: 24-1.0363330-ld.json
RDF/XML (Pretty): 24-1.0363330-rdf.xml
RDF/JSON: 24-1.0363330-rdf.json
Turtle: 24-1.0363330-turtle.txt
N-Triples: 24-1.0363330-rdf-ntriples.txt
Original Record: 24-1.0363330-source.json
Full Text
24-1.0363330-fulltext.txt
Citation
24-1.0363330.ris

Full Text

Architectural Support for Inter-Thread Synchronizationin SIMT ArchitecturesbyAhmed Mohammed ElTantawyBSc, Electronics and Electrical Communications, Cairo University, 2009MSc, Electronics and Electrical Communications, Cairo University, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)January 2018© Ahmed Mohammed ElTantawy, 2018AbstractSingle-Instruction Multiple-Threads (SIMT) architectures have seen widespreadinterest in accelerating data parallel applications. In the SIMT model, small groupsof scalar threads operate in lockstep. Within each group, current SIMT implemen-tations serialize the execution of threads that follow different paths, and to ensureefficiency, revert to lockstep execution as soon as possible. These thread schedulingconstraints may cause a deadlock-free program on a multiple-instruction multiple-data architecture to deadlock on a SIMT machine. Further, fine-grained synchro-nization is often implemented using busy-wait loops. However, busy-wait synchro-nization incurs significant overheads and existing CPU solutions do not readilytranslate to SIMT architectures. In this thesis, we tackle these challenges. First,we propose a static analysis technique that detects SIMT deadlocks by inspectingthe application control flow graph (CFG). We further propose a CFG transforma-tion that avoids SIMT deadlocks when synchronization is local to a function. Thestatic detection has a false detection rate of 4%–5%. The automated transforma-tion has an average performance overhead of 8.2%–10.9% compared to manualtransformation. We also propose an adaptive hardware reconvergence mechanismthat supports MIMD synchronization without changing the application CFG. Ourhardware approach performs on par with the compiler transformation but avoidskey limitations in the compiler only solution. We show that this hardware can befurther extended to support concurrent multi-path execution to improve the perfor-mance of divergent applications. Finally, We propose a hardware warp schedulingpolicy informed by a novel hardware mechanism for accurately detecting busy-wait synchronization on GPUs. When employed, it deprioritizes spinning warpsachieving a speedup of 42.7% over Greedy Then Oldest scheduling.iiLay SummaryThis thesis proposes techniques to ease the programmability of, General PurposeGraphics Processing Units, a widely used class of general purpose acceleratorswhile maintaining their performance and energy efficiency. This enables leverag-ing the power of these accelerators in wider application domains.iiiPrefaceThe following is a list of my publications during the PhD program in chronologicalorder.[C1] Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani,Nam Sung Kim, Tor M. Aamodt, Vijay Janapa Reddi. Proc. IEEE/ACM Sym-posium on Computer Architecture (ISCA), 2013.[C2] Ahmed ElTantawy, Jessica Wenjie Ma, Mike OConnor, Tor M. Aamodt.A Scalable Multi-Path Microarchitecture for Efficient GPU Control Flow. In pro-ceedings of IEEE Symposium on High Performance Computer Architecture (HPCA),2014.[C3] Ahmed ElTantawy, Tor M. Aamodt. MIMD Synchronization on SIMTArchitectures. In proceedings of IEEE/ACM Symposium on Microarchitecture(MICRO), 2016.[C4] Ahmed ElTantawy, Tor M. Aamodt. BOWS: A Warp Scheduling Policyfor Busy-Wait Synchronization in SIMT Architectures. Accepted for publication inIEEE/ACM Symposium on High Performance Computing Architectures (HPCA),2018.[TR1] Ahmed ElTanawy, Tor M. Aamodt. Correctness Discussion of a SIMT-induced Deadlock Elimination Algorithm. Technical Report, University of BritishColumbia.The preceding publications have been included or used in this thesis as follows:• Chapter 1 uses motivational elements from [C3] and [C4].• Chapter 2 uses background section material from [C2], [C3], and [C4] .iv• Chapter 3 presents a version of a subset the material published in [C3] that isrelated to SIMT-induced deadlock definition and detection. I performed theresearch, interpreted the data and wrote the manuscript with guidance andinput from Professor Tor M. Aamodt.• Chapter 4 presents a version of a subset the material published in [C3] thatis related to SIMT-induced deadlock elimination using compiler techniques.I performed the research, interpreted the data and wrote the manuscript withguidance and input from Professor Tor M. Aamodt.• Chapter 5 presents a version of a subset the material published in both [C2]and [C3] that is related to a hardware mechanism that enables flexible threadscheduling in SIMT architectures. I performed the research, interpreted thedata and wrote the manuscript with guidance and input from Professor TorM. Aamodt.• Chapter 6 presents a version of a subset the material submitted in [C4]. Iperformed the research, interpreted the data and wrote the manuscript withguidance and input from Professor Tor M. Aamodt. The tool proposed in[C1] is used in this chapter for energy evaluation.• Chapter 7 presents a version of the material submitted in [C2]. I performedthe research, interpreted the data and wrote the manuscript with guidance andinput from Mike O’Conner and Professor Tor M. Aamodt. Jessica WenjieMa ideas have significantly influenced the final scoreboard design presentedin the chapter and she also performed the scoreboard area analysis. The toolproposed in [C1] is used in this chapter for energy evaluation.• Chapter 8 uses the related work sections in [C2], [C3], and [C4].• Chapter 9 uses conclusion text from [C2], [C3], and [C4].• Chapter A presents a version of [TR1]. I developed the proof and drafted thetechnical report under the guidance of Professor Tor M. Aamodt.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 SIMT execution Model Potential . . . . . . . . . . . . . . . . . . 21.2 SIMT Model Interaction with Thread Synchronization . . . . . . 31.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1 Baseline SIMT Architectures . . . . . . . . . . . . . . . . . . . . 13vi2.2 The SIMT Programming Model . . . . . . . . . . . . . . . . . . 142.3 Thread Scheduling in SIMT Architectures . . . . . . . . . . . . . 162.3.1 Threads from The same Warp . . . . . . . . . . . . . . . 162.3.2 Threads from Different Warps . . . . . . . . . . . . . . . 183 SIMT-Induced Deadlocks: Definition and Detection . . . . . . . . . 203.1 SIMT Scheduling Constraints . . . . . . . . . . . . . . . . . . . . 213.1.1 SIMT-Induced Deadlocks . . . . . . . . . . . . . . . . . 223.1.2 Causes of SIMT Deadlocks . . . . . . . . . . . . . . . . 233.2 SIMT Deadlocks Impact on Programmability . . . . . . . . . . . 243.2.1 Ease of Programmability . . . . . . . . . . . . . . . . . . 253.2.2 Programming Language Abstraction . . . . . . . . . . . . 303.2.3 OpenMP Case Study: Synchronization Primitives Library 313.3 SIMT Deadlock Detection . . . . . . . . . . . . . . . . . . . . . 333.3.1 SIMT Deadlock Detection Algorithm . . . . . . . . . . . 343.3.2 SIMT Deadlock Detection Limitations . . . . . . . . . . . 373.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.7 Summary, Conclusion and Future Directions . . . . . . . . . . . . 424 SSDE: Static SIMT-induced Deadlock Elimination . . . . . . . . . . 444.1 Safe Reconvergence Points Identification . . . . . . . . . . . . . 454.2 SSDE: Static SIMT Deadlock Elimination . . . . . . . . . . . . . 494.2.1 Elimination Algorithm . . . . . . . . . . . . . . . . . . . 504.2.2 Compatibility with Nvidia GPUs . . . . . . . . . . . . . . 524.2.3 SSDE Limitations . . . . . . . . . . . . . . . . . . . . . 534.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.4.1 Static Overheads . . . . . . . . . . . . . . . . . . . . . . 564.4.2 Dynamic Overheads . . . . . . . . . . . . . . . . . . . . 574.4.3 OpenMP support . . . . . . . . . . . . . . . . . . . . . . 604.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62vii4.6 Summary, Conclusion and Future Directions . . . . . . . . . . . . 625 AWARE: Adaptive Warp Reconvergence . . . . . . . . . . . . . . . 645.1 Decoupled SIMT Tables . . . . . . . . . . . . . . . . . . . . . . 665.2 Warp Splits Scheduling . . . . . . . . . . . . . . . . . . . . . . . 695.3 Nested Divergence . . . . . . . . . . . . . . . . . . . . . . . . . 705.4 Using AWARE to avoid SIMT Deadlock . . . . . . . . . . . . . . 725.4.1 Handling Divergent Barriers: . . . . . . . . . . . . . . . . 735.4.2 Delayed Reconvergence: . . . . . . . . . . . . . . . . . . 745.4.3 Timed-out Reconvergence: . . . . . . . . . . . . . . . . . 755.5 AWARE Implementation . . . . . . . . . . . . . . . . . . . . . . 775.5.1 AWARE Basic Implementation . . . . . . . . . . . . . . 775.5.2 AWARE Virtualized Implementation . . . . . . . . . . . . 795.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.9 Summary, Conclusion and Future Directions . . . . . . . . . . . . 876 BOWS: Back-Off Warp Spinning . . . . . . . . . . . . . . . . . . . . 896.1 Sensitivity to Warp Scheduling . . . . . . . . . . . . . . . . . . . 916.2 BOWS: Backoff Warp Spinning . . . . . . . . . . . . . . . . . . 926.2.1 BOWS scheduling policy . . . . . . . . . . . . . . . . . . 926.3 DDOS: Dynamic Detection of Spinning . . . . . . . . . . . . . . 956.3.1 DDOS Operation . . . . . . . . . . . . . . . . . . . . . . 986.3.2 DDOS Design Trade-offs . . . . . . . . . . . . . . . . . . 1016.3.3 DDOS integration with BOWS . . . . . . . . . . . . . . . 1046.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.5.1 Sensitivity to Back-off Delay Limit Value . . . . . . . . . 1076.5.2 Sensitivity to Detection Errors . . . . . . . . . . . . . . . 1106.5.3 Sensitivity to Contention . . . . . . . . . . . . . . . . . . 1116.5.4 Pascal GTX1080Ti Evaluation . . . . . . . . . . . . . . . 1126.5.5 Implementation Cost . . . . . . . . . . . . . . . . . . . . 112viii6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.7 Summary, Conclusion and Future Directions . . . . . . . . . . . . 1157 MP: Multi-Path Concurrent Execution . . . . . . . . . . . . . . . . 1177.1 Stack-Based Reconvergence Performance Limitations: . . . . . . 1177.2 Multi-Path IPDOM (MP IPDOM) . . . . . . . . . . . . . . . . . 1197.2.1 Warp Split Scheduling . . . . . . . . . . . . . . . . . . . 1207.2.2 Scoreboard Logic . . . . . . . . . . . . . . . . . . . . . . 1227.3 Opportunistic Early Reconvergence . . . . . . . . . . . . . . . . 1257.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1267.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 1297.5.1 SIMD Unit Utilization . . . . . . . . . . . . . . . . . . . 1307.5.2 Thread Level Parallelism . . . . . . . . . . . . . . . . . . 1307.5.3 Idle Cycles . . . . . . . . . . . . . . . . . . . . . . . . . 1327.5.4 Impact on Memory System . . . . . . . . . . . . . . . . . 1327.5.5 Overall Performance . . . . . . . . . . . . . . . . . . . . 1347.5.6 Implementation Complexity . . . . . . . . . . . . . . . . 1347.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1357.7 Summary, Conclusion and Future Directions . . . . . . . . . . . . 1368 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1378.1 Enabling Thread Synchronization in GPUGPUs . . . . . . . . . . 1378.2 Warp Scheduling Policies in GPGPUs . . . . . . . . . . . . . . . 1428.3 Alternate GPGPU Execution Models . . . . . . . . . . . . . . . . 1448.4 Verification Tools for SIMT architectures . . . . . . . . . . . . . 1469 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 1479.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1479.2 Potential Areas of Impact . . . . . . . . . . . . . . . . . . . . . . 1499.3 Directions of Future Work . . . . . . . . . . . . . . . . . . . . . 1509.3.1 SIMT Synchronization APIs . . . . . . . . . . . . . . . . 1509.3.2 Runtime Livelock Detection . . . . . . . . . . . . . . . . 1529.3.3 Reconvergence adequacy prediction . . . . . . . . . . . . 152ixBibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154A SSDE Correctness Discussion . . . . . . . . . . . . . . . . . . . . . . 170A.1 Proof Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170A.2 Proof Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171xList of TablesTable 3.1 Evaluated Kernels . . . . . . . . . . . . . . . . . . . . . . . . 39Table 3.2 Detection Pass Results on CUDA and OpenCL Kernels . . . . 40Table 4.1 Code Configuration Encoding . . . . . . . . . . . . . . . . . . 56Table 4.2 Static Overheads for the Elimination Algorithm . . . . . . . . 56Table 4.3 OpenMP Kernels (Normalized Execution Times) . . . . . . . . 60Table 5.1 AWARE vs SSDE . . . . . . . . . . . . . . . . . . . . . . . . 66Table 5.2 Storage Cost in Bits per Hardware Warp. . . . . . . . . . . . . 82Table 5.3 GPGPUSim Configuration . . . . . . . . . . . . . . . . . . . . 83Table 6.1 Spin Detection Sensitivity to Design Parameters. . . . . . . . . 102Table 6.2 DDOS and BOWS Implementation Costs . . . . . . . . . . . . 114Table 7.1 GPGPUSim Configuration . . . . . . . . . . . . . . . . . . . . 128Table 7.2 Studied Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 129xiList of FiguresFigure 1.1 SIMT-Induced Deadlock . . . . . . . . . . . . . . . . . . . . 4Figure 1.2 Fine-grained Synchronization in current GPGPUs. Both CPUand GPU versions are compiled with NVCC-6.5 -O3. Over-heads are measured on Pascal GTX1080 using Nvidia profiler(nvprof) launching 120 blocks each of 256 threads (74.88%occupancy of the pascal GPU). . . . . . . . . . . . . . . . . . 6Figure 2.1 Baseline architecture . . . . . . . . . . . . . . . . . . . . . . 14Figure 2.2 CUDA Programming Model. (This diagram reproduces Figure7 and Figure 8 from The CUDA Programming Guide [111]. . 15Figure 2.3 Divergent code example . . . . . . . . . . . . . . . . . . . . 17Figure 3.1 Stack-Based Reconvergence. . . . . . . . . . . . . . . . . . . 21Figure 3.2 Barriers in divergent code. . . . . . . . . . . . . . . . . . . . 24Figure 3.3 Critical Section Implementation. . . . . . . . . . . . . . . . . 25Figure 3.4 HashTable Insertion - Version 1 [129]. . . . . . . . . . . . . 26Figure 3.5 HashTable Insertion - Version 2. . . . . . . . . . . . . . . . . 27Figure 3.6 Modified Critical Section Implementation. . . . . . . . . . . . 27Figure 3.7 HashTable Insertion Versions Comparison (2M Random Inser-tion in 1024 HashTable Entries). Recent GPUs are to the right. 28Figure 3.8 Synchronization patterns used in multi-threaded applications.Examples are from the graph analytics API in the Cloud suite [40]. 29Figure 3.9 PTX output of compiling the code in Figure 3.5 using NVCC6.5 and NVCC 8.0 with default optimizations enabled. . . . . 31xiiFigure 3.10 OpenMP Clang 3.8 Frontend Implementation for OpenMP 4.0Runtime Library Synchronization Calls omp set lock and omp unset lock. 32Figure 3.11 OpenMP Microbenchmark. . . . . . . . . . . . . . . . . . . . 33Figure 3.12 addByte function in Histogram256 [120]. . . . . . . . . . . . 37Figure 4.1 SIMT-induced deadlock scenarios. . . . . . . . . . . . . . . . 49Figure 4.2 SIMT-Induced Deadlock Elimination Steps. . . . . . . . . . . 52Figure 4.3 Normalized Accumulative GPU Execution Time. . . . . . . . 57Figure 4.4 Normalized Kernel Execution Time . . . . . . . . . . . . . . 58Figure 4.5 Normalized Dynamic Instruction Count . . . . . . . . . . . . 59Figure 4.6 Average SIMD Utilization . . . . . . . . . . . . . . . . . . . 59Figure 5.1 Execution with AWARE . . . . . . . . . . . . . . . . . . . . 67Figure 5.2 Example of Multi-Path IPDOM execution with nested divergence 71Figure 5.3 Handling Barriers . . . . . . . . . . . . . . . . . . . . . . . . 74Figure 5.4 Delayed Reconvergence . . . . . . . . . . . . . . . . . . . . 74Figure 5.5 Timed-Out Reconvergence . . . . . . . . . . . . . . . . . . . 77Figure 5.6 AWARE implementation . . . . . . . . . . . . . . . . . . . . 79Figure 5.7 AWARE Virtualized Implementation . . . . . . . . . . . . . . 81Figure 5.8 Normalized Kernel Execution Time . . . . . . . . . . . . . . 83Figure 5.9 Normalized Accumulated GPU Execution Time . . . . . . . . 84Figure 5.10 Normalized Dynamic Instruction Count . . . . . . . . . . . . 85Figure 5.11 Average SIMD Utilization . . . . . . . . . . . . . . . . . . . 85Figure 5.12 Sensitivity to the TimeOut value (in cycles). ”inf+DR” refersto a time-out that is infinity but with delayed reconvergence. . 85Figure 5.13 Effect of AWARE Virtualization on Performance . . . . . . . 86Figure 6.1 Synchronization Status Distribution. Bars from left to right:LRR, GTO, and CAWA. GPGPU-Sim with a GTX480 config-uration. See Section 7.4 for details. . . . . . . . . . . . . . . 91Figure 6.2 Software Backoff Delay Performance in GPUs. *omp set lockGPU implementation for OpenMP 4.0 [15]. . . . . . . . . . . . . . . 92Figure 6.3 BOWS scheduling Policy. . . . . . . . . . . . . . . . . . . . 93Figure 6.4 Adaptive Back-off Delay Limit Estimation. . . . . . . . . . . 95xiiiFigure 6.5 Examples of Inter-Thread Synchronization Patterns used in GPUs(See Section 7.4 for more details). . . . . . . . . . . . . . . . 96Figure 6.6 Warp History Registers and SIB-PT Operation (Figure 6.7 showsthe units locations in the pipeline. . . . . . . . . . . . . . . . 97Figure 6.7 Operation of BOWS with DDOS. . . . . . . . . . . . . . . . 104Figure 6.8 Performance and Energy Savings on GTX480 (Fermi) . . . . 106Figure 6.9 Normalized Execution Time at Different Back-off Delay LimitValues (using DDOS). . . . . . . . . . . . . . . . . . . . . . 107Figure 6.10 Distribution of Warps at the Scheduler. From left to right, GTOwithout BOWS, GTO with BOWS with delay limit in cycles 0,500, 1000, 3000, 5000, Adaptive. . . . . . . . . . . . . . . . 108Figure 6.11 Distribution of Warps at the Scheduler. From left to right, GTOwithout BOWS, GTO with BOWS with delay limit in cycles 0,500, 1000, 3000, 5000, Adaptive. . . . . . . . . . . . . . . . 109Figure 6.12 BOWS Impact on Dynamic Overheads. . . . . . . . . . . . . 109Figure 6.13 Overheads Due to Detection Errors. . . . . . . . . . . . . . . 110Figure 6.14 Performance and Energy Savings on Pascal . . . . . . . . . . 111Figure 6.15 Sensitivity to Contention. . . . . . . . . . . . . . . . . . . . . 113Figure 7.1 Divergent code example . . . . . . . . . . . . . . . . . . . . 118Figure 7.2 Execution with the Stack-Based Reconvergence Model. TheFigure refers to the stack as a Single Path Stack to distinguishit from latter proposals that support dual path execution [123]. 118Figure 7.3 Fraction of running scalar threads while varying maximumwarp splits and assuming IPDOM reconvergence . . . . . . . 120Figure 7.4 Execution with Multi-Path IPDOM . . . . . . . . . . . . . . 121Figure 7.5 Changes required to the scoreboard logic to support Multi-PathIPDOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123Figure 7.6 MP IPDOM scoreboard example . . . . . . . . . . . . . . . . 124Figure 7.7 Unstructured control flow . . . . . . . . . . . . . . . . . . . 126Figure 7.8 Operation of the Multi-Path with the Opportunistic Reconver-gence (OREC) enabled . . . . . . . . . . . . . . . . . . . . . 127Figure 7.9 SIMD units utilization . . . . . . . . . . . . . . . . . . . . . 131xivFigure 7.10 Warp splits to warp ratio . . . . . . . . . . . . . . . . . . . . 131Figure 7.11 Average breakdown of threads’ state at the scheduler . . . . . 132Figure 7.12 Idle cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133Figure 7.13 Inst. cache misses (16KB I$) . . . . . . . . . . . . . . . . . . 133Figure 7.14 L1 data cache misses (32KB D$) . . . . . . . . . . . . . . . . 133Figure 7.15 Overall speedup . . . . . . . . . . . . . . . . . . . . . . . . . 134Figure 9.1 Illustration to the use of named locks. . . . . . . . . . . . . . 151Figure A.1 Visualization of T . . . . . . . . . . . . . . . . . . . . . . . . 171xvGlossaryAMD Advanced Micro DevicesAWARE Adaptive Warp ReconvergenceALU Arithmetic Logic UnitBOWS Back-off Warp SpinningCUDA Compute Unified Device ArchitectureCPU Central Processing UnitDDOS Dynamic Detection of SpinningDRAM Dynamic Random Access MemoryGPGPU General Purpose Graphics Processing UnitsGPU Graphics Processing UnitsGTO Greedy-Then-OldestI-BUFFER Instruction BufferI-CACHE Instruction-cacheIBM [International Business Machines]IPC Instructions Per CycleIPDO Immediate Post DominatorxviL1 Level OneL2 Level TwoLRR Loose Round RobinMIMD Multiple Instructions Multiple DataMP Multi-PathOPENCL Open Computing LanguageOPENMP Open Multi-ProcessingPC Program CounterPDOM Post DominatorRT RTReconvergence TableSIMT Single Instruction Multiple ThreadsSIMD Single Instruction Multiple DataSM Streaming MultiprocessorST Splits TableSSDE Static SIMT Deadlock EliminationTLP Thread-Level ParallelismxviiAcknowledgments”All praise and thanks are for God, the One who, by His blessing and favor, per-fected good works are accomplished.”, Prophet Mohammed.It has been a long journey full of ups and downs. But, there are people whomade this journey worth taking. First, I shall start by thanking my parents, mymother Elham and my father Mohammed. None of this would have been possiblewithout your unconditional love through out my life. I hope I can be one day theone you brought me to be. I want also to thank my sisters Hanan, Samah, and Amaland my brother Hossam. It was hard to be away from such a lovely family all theseyears with a single visit to home during my PhD but I wish to make this up for youin the coming years. There is one person, though, with whom I literally shared allthe joy and pain during this journey, my wife. I could not do it without you. Youwere always of a great support. There were times where I put tremendous pressureon you, while you yourself were also doing your PhD. But, you were always therefor me. I love you. I can not forget Mazin, my little monkey, you brought all the joyof life with you in my last couple of years in the PhD journey. I was also blessedby having a great community in Vancouver, especially in UBC campus. Thanksfor every member in this community. Thank you for all my friends who made thisjourney enjoyable.I was very blessed to join a world class research group led by Professor Tor. Iwould like to thank professor Tor for all the time and support that I always foundduring my PhD journey. This work was not possible without such support. I wouldlike to also thank all my lab colleagues, Wilson, Andrew, Tayler, Tim, Ayoub, Hadi,Shadi, Myrice, Dave, Ali, Inder, Jimmy, Mahmoud, and Amruth. I learned a lotfrom all of you.xviiiChapter 1IntroductionOver the last few decades, semiconductor process technology has been advancingaccording to Moore’s Law, which states that the density of transistors on integratedcircuits doubles about once every two years. This increase in the number of transis-tors has been utilized to improve the single thread performance in general purposeCPUs. However, CPUs have hit a major challenge known as the power wall [98]which limits the increase in single thread performance. Therefore, computer ar-chitects have moved towards energy efficient parallel architectures. Massivelymultithreaded architectures, such as graphic processing units (GPUs), mitigate thepower problem by running thousands of threads in parallel at lower frequencies,and amortizing the cost of fetching, decoding and scheduling instructions by exe-cuting them in a single instruction multiple data (SIMD) fashion.These properties have motivated the computer industry to transform GPUsfrom merely fixed function accelerators for graphics into programmable computeaccelerators. For such a transformation to happen, there was a need to developadequate programming models for GPUs that allow non-graphics applications toutilize the computing power of GPUs without using graphics-oriented APIs. Thisled to the development of general purpose programming models for GPUs such asCUDA [111] and OpenCL [8]. The resultant programming model is referred to asthe Single Instruction Multiple Thread (SIMT) model. The SIMT model has seenwidespread interest and similar models have been adopted in CPU architectureswith wide-vector support [64].1Using such programming models, software developers have demonstrated thatSIMT architectures have significant potential in cost-effective computing for data-parallel applications with regular control-flow and regular memory access pat-terns [110]. However, it is quite challenging to obtain similar results on appli-cations that have a significant portion of their instruction streams common acrossthreads yet feature non-uniform control behaviour, irregular memory access pat-terns, and/or inter-thread synchronization [25]. This motivated the computer ar-chitecture community to study modifications to the graphics-based SIMT architec-tures to allow for efficient acceleration of wider scope of general purpose applica-tions [11, 41, 91, 99, 124]. This thesis is part of this ongoing work.1.1 SIMT execution Model PotentialTraditional Single Instructions Multiple Data (SIMD) architectures are hard to pro-gram. The underlying hardware provides little support for arbitrary memory accessand control flow divergence [63, 128]. Thus, it is essential for the code runningon such machines to be explicitly vectorized. The vectorization task is fully orpartially performed by programmers [61]. Compilers can perform automatic vec-torization but they fail on simple cases due to uncertainty about loop iterations’dependencies or non-uniform memory access stride [62]. Thus, traditional SIMDsystems are either hard to program and/or limited in scope.This has significantly changed with the introduction of the Single Instruc-tions Multiple Threads (SIMT) architectures. In SIMT architectures, the hard-ware with minimal help from the compiler supports arbitrary memory accessesand control flow divergence. This abstracts away the complexity of the underlyingSIMD hardware allowing for much simpler programming models. The single-instruction multiple-thread (SIMT) programming model was originally introducedand popularized for graphics processor units (GPUs) along with the introductionof CUDA [111] but it has seen widespread interest and similar models have beenadopted in CPU architectures with wide-vector support [64].Arguably, a key reason for the success of this model is its abstraction of theunderlying SIMD hardware. In SIMT-like execution models, scalar threads arecombined into groups that execute in lockstep on single-instruction multiple-data2(SIMD) units. These groups are called warps by NVIDIA [111], wavefronts byAMD [4], and gangs by Intel [64]. The SIMT programming model divides theburden of identifying parallelism differently than traditional approaches of vectormachines. The programmer, who is armed with application knowledge, identifiesfar-flung outer-loop parallelism and specifies the required behaviour of a singlethread in the parallel region. The hardware implicitly handles control flow andmemory divergence within threads of the same warp. Thus, with this abstraction,programmers can leverage the underlying SIMD hardware without having to dealwith explicit vectorization.However, current implementations for this desired abstraction are still far fromperfect. In situations that involve inter-thread synchronization, the SIMD natureof the underlying hardware induces special types of deadlocks that would not hap-pen otherwise. Further, recent SIMT implementations still suffer excessive per-formance overheads under non-uniform control behaviour, irregular memory ac-cess patterns, and/or inter-thread synchronization. This negatively impacts the pro-grammability of SIMT architectures on irregular applications as it forces program-mers to be aware of the details of the SIMT implementation to write functionallycorrect and optimized code.1.2 SIMT Model Interaction with ThreadSynchronizationOn current hardware the SIMT model is implemented via predication, or in thegeneral case using stack-based masking of execution units [5, 18, 29, 64, 77]. Thismechanism enables threads within the same warp to diverge (i.e., follow differentcontrol flow paths). To do this, the hardware forces divergent threads to serializetheir execution and then restores SIMD utilization by forcing divergent threads toreconverge as soon as possible (typically at the immediate postdominator point ofthe divergent branch) [29, 60, 64]. This mechanism creates implicit schedulingconstraints for divergent threads within a warp which leads to programmabilityimplications. For example, when a GPU kernel code is written in such a way thatthe programmer intends divergent threads to communicate, these scheduling con-straints can lead to surprising (from a programmer perspective) deadlock and/or3A: *mutex = 0 B: while(!atomicCAS(mutex,0,1)); C: // critical section    atomicExch(mutex,0); Thread  diverged to C Reconvergence Synchronization  B C A (blocked)  Threads  diverged to B Figure 1.1: SIMT-Induced Deadlocklivelock conditions. Thus, a multi-threaded program that is guaranteed to termi-nate on a MIMD architecture may not terminate on machines with current SIMTimplementations [47]1.Figure 1.1 shows a typical MIMD implementation of a spin lock guarding acritical section. On a SIMT machine, this code deadlocks. In particular, a threadthat acquires the lock is indefinitely blocked at the loop exit waiting to reconvergewith lagging threads from the same warp. However, lagging threads never exit theloop because they wait for the lock to be released by the leading thread. Simi-lar scenarios occur with fine-grained synchronization. We refer to a case wherethe forward progress of a diverged thread is prevented due to the implicit SIMTscheduling constraints as SIMT-induced deadlock or briefly SIMT deadlock.The possibility of SIMT-induced deadlocks is a challenge, given the increas-ing interest in using SIMT architectures for irregular applications [25, 54, 58, 78,90, 92, 96, 152]. Moreover, parallel algorithms developed for MIMD executioncan serve as starting points for GPU kernel development provided SIMT deadlockcan be avoided. For complex applications writing functionally correct code canbe challenging as programmers need to reason about how synchronization inter-acts with the SIMT implementation. Further, the code is vulnerable to compileroptimizations that may modify the control flow graph (CFG) assumed by program-mers. SIMT deadlocks also present challenges to emerging OpenMP support forSIMT architectures [7, 15, 71, 101, 117] and to the transparent vectorization of1We use the term “MIMD machine” to refer to any architecture that guarantees loose fairnessin thread scheduling so that threads not waiting on a programmer synchronization condition makeforward progress.4multi-threaded code on SIMD CPU architectures [64, 140].Note that during the writing of this dissertation, on May 10th 2017, Nvidiarevealed some details about their newest GPGPU architecture; Volta [103]. Voltasupports “independent thread scheduling” to avoid thread synchronization dead-locks on earlier architectures and to enable interleaving the execution of divergentcontrol flow paths. To the best of our knowledge, this is the first SIMT architecturein the market that supports such independent thread scheduling. No details wereprovided about how such independent thread scheduling is implemented. How-ever, this dissertation proposes one way how such independent thread schedulingcan be implemented in hardware to avoid synchronization deadlocks (Chapter 5)and to interleave divergent control flow paths (Chapter 7) while maintaining SIMTefficient execution. In this dissertation, we use the term current or recent SIMTarchitectures to refer to SIMT architectures prior to Volta.Aside from these functional limitations imposed by current SIMT implementa-tions, there are also performance implications. Let us consider a manual workaroundfor the SIMT deadlock problem shown in Figure 1.2a. In Figure 1.2a, the code isstructured such that the atomicExch release statement (line 10) is contained withinthe spin loop to avoid SIMT-induced deadlocks. In this code threads that success-fully acquire the lock are guaranteed to be able to make forward progress to thelock release code. This workaround handles a simple case where there is a singlelock acquire statement with a single lock release statement that postdominates thelock acquire. Thus, the required code transformation is relatively simple to reasonabout using high level semantics provided the programmer is aware of the detailsof the reconvergence mechanism. This is, however, not necessarily true for morecomplex synchronization patterns (more details in Chapter 3). Next we illustratethe performance implications of the SIMT model on this code even after SIMTdeadlocks are avoided.The code in Figure 1.2a is an implementation of a critical section in a hashtableimplementation in CUDA. Hashtables in GPUs are used in key-value store appli-cations [54], text mining [157, 158], state space exploration [146], DNA align-ment [159], and others [1]. The implementation we study here is an optimizedversion of NVIDIA’s CUDA by Example [129] (more details in Chapter 3). Fig-ure 1.2b compares the execution time of 26.2 million insertions of random keys to55 . u n s i g n e d i n t key = keys [ t i d ] ;6 . s i z e t hashValue = hash ( key , t a b l e . c o u n t ) ;7 . E n t r y * l o c a t i o n = &( t a b l e . poo l [ t i d ] ) ;8 . l o c a t i o n−>key = key ;9 . l o c a t i o n−>v a l u e = v a l u e s [ t i d ] ;10. bool done = false;11. while(!done){12. if(atomicCAS(lock[hashValue].mutex, 0, 1) == 0 ){1 3 . t h r e a d f e n c e ( ) ;1 4 . l o c a t i o n−>n e x t = t a b l e . e n t r i e s [ hashValue ] ;1 5 . t a b l e . e n t r i e s [ hashValue ] = l o c a t i o n ;1 6 . done = t r u e ;1 7 . t h r e a d f e n c e ( ) ;18. atomicExch(lock[hashValue].mutex,0);19. }20. }(a) Critical Section in Hashtable Insertion 0 1 2 3 4 5128 256 512 1024 2048 4096K er ne l Ex ec ut i on  T ime  ( L OG 10 ( ms ec ) )HashTable BucketsIntel i7-4770K 3.50GHzFermi TeslaC2050Pascal GTX1080(b) GPU performance vs CPU. 0 2e+09 4e+09 6e+09 8e+09 1e+10 1.2e+10128 256 512 1024 2048 4096Di st r ib ut i on  o f Dy na mi c  I ns tr uc ti on sHashTable BucketsSynchronization OverheadsUseful Instructions(c) Dynamic Instructions Overheads, 0 5e+08 1e+09 1.5e+09 2e+09 2.5e+09 3e+09 3.5e+09 4e+09128 256 512 1024 2048 4096Di st r ib ut i on  o f Me mo ry  T ra ns ac ti on sHashTable BucketsOtherSynchronization(d) Memory Traffic Overheads. 0 20 40 60 80 100128 256 512 1024 2048 4096S IMD  E ff i ci en cyHashTable BucketsMultiple WarpsSingle Warp(e) Divergence Overheads.Figure 1.2: Fine-grained Synchronization in current GPGPUs. Both CPUand GPU versions are compiled with NVCC-6.5 -O3. Overheads aremeasured on Pascal GTX1080 using Nvidia profiler (nvprof) launching120 blocks each of 256 threads (74.88% occupancy of the pascal GPU).6this hashtable on a GPU versus on a CPU while varying the number of hashtablebuckets. The smaller the number of the buckets, the larger the contention. Thefigure shows the execution time on two different generations of NVIDIA GPUs–aTesla C2050 (Fermi), and GeForce GTX 1080 (Pascal)–and an Intel Core i7 CPUrunning a serial CPU version of the same hashtable code [129]. The CPU versionoutperforms the older Tesla C2050 for all sizes considered while the GTX 1080outperforms the CPU starting from 512 hashtable buckets. At 4096 buckets theGTX 1080 is 9.77× faster than the serial CPU version (more details in Chapter 6).Nevertheless, Figures 1.2c, 1.2d, and 1.2e show significant synchronizationoverheads that are still persistent in the Pascal architecture. Figures 1.2c showsthat instruction count overhead ranges from 61.0% at low contention to 98.3% athigh contention. Similarly, Figure 1.2d shows that 41.5% to 95.6% of memoryoperations are due to synchronization. A significant portion of both overheads aredue to failed lock acquire attempts. Another source of synchronization overhead,unique to GPUs, is control-flow divergence. Figure 1.2e shows that if the code isexecuted by a single warp, the SIMD utilization (fraction of active lanes) rangesbetween 87.1%-98.6% but drops to 16.4%-47.1% when executing multiple warps.This is due to inter-warp lock conflicts, which can be impacted by warp scheduling.1.3 Thesis StatementThis dissertation primarily explores methods that enable reliable and efficient sup-port of inter-thread synchronization on SIMT architectures. The dissertation firsthighlights the functional implications of the thread scheduling constraints imposedby current SIMT implementations. It shows that a special type of deadlock (SIMT-induced deadlocks) can be produced by programmers and/or compilers that areoblivious to these constraints. The dissertation then proposes a novel static analy-sis technique to conservatively detect the presence of SIMT deadlocks in parallelkernels. The analysis builds on the following observation: the forward progress ofa thread is prevented indefinitely (i.e., SIMT deadlock occurs) if a thread enters aloop for which the exit condition depends on the value of a shared memory loca-tion and that location will only be set by another thread that is blocked due to theSIMT scheduling constraints. This analysis conservatively reports potential SIMT7deadlocks. In practice it has a small positive false detection rate.The dissertation also proposes two independent novel solutions for SIMT dead-locks (1) SSDE: Static SIMT-induced Deadlock Elimination algorithm, which al-ters the CFG of the application to avoid SIMT deadlocks, and (2) AWARE: anAdaptive Warp REconvergence hardware mechanism, which has the flexibility todelay or timeout reconvergence without changing the application’s CFG. In SSDE,a static analysis is done first to identify code locations where reconvergence ofloop exits should be moved to allow for inter-thread communication that is other-wise blocked by SIMT Constraints. For functional correctness, we can choose anypostdominator point as the reconvergence point including the kernel exit. However,from a SIMD utilization perspective, it is preferable to reconverge at the earliestpostdominator point that would not block the required inter-thread communication.We call these postdominator points safe postdominators. The static analysis is thenfollowed by a CFG transformation algorithm that enforces the recommended safepostdominators as reconvergence points for loops.The second solution is AWARE; a MIMD-Compatible reconvergence mecha-nism that avoids most limitations inherent in a compiler-only approach. AWARErelaxes the thread scheduling constraints imposed by current stack-based SIMTimplementations. To this end, AWARE decouples the tracking of reconvergencepoints from the divergent control flow paths using two separate SIMT tables. In-stead of the hardware stack structure, AWARE uses a hardware FIFO structureto reorder the execution of divergent control flow paths. This replaces the unfairdepth-first execution of the kernel control flow graph by a fair breadth-first execu-tion. Further, the decoupling of reconvergence points from divergent paths givesAWARE the flexibility to delay reconvergence (using the recommended safe post-dominator points from the SSDE analysis pass) and/or to timeout reconvergenceafter a certain threshold period.To address the overheads of busy-wait synchronization on current SIMT archi-tectures, the dissertation proposes Back-Off Warp Spinning (BOWS); a novel hard-ware mechanism to dynamically detect spinning warps and modify warp schedul-ing. BOWS’ spin detection mechanism employs a path history register to identifyrepetitive execution (i.e., loops). To distinguish busy-wait synchronization loopsfrom other loops BOWS employs a value history register to track the values of8registers used in the computation of the loop exit conditions. In loops not asso-ciated with busy-waiting, at least one of these registers typically holds the valueof a loop induction variable that changes each iteration. In busy-wait loops theseregisters typically maintain the same values as long as the warp is spinning. For theworkloads we study this simple mechanism accurately identifies busy-wait loops.The spin detection results then guides a scheduling policy, BOWS, that is designedto discourage spinning warps from competing for scheduler issue slots. BOWSefficiently approximates software back-off techniques used in multi-threaded CPUarchitectures while overcoming their limitations when applied to GPUs. In BOWS,warps that are about to execute a busy-wait iteration are removed from competi-tion for scheduler issue slots until no other warps are ready to be scheduled and acertain time has passed since the previous iteration.Finally, the dissertation explores an orthogonal application of the SIMT tablesused in AWARE construction. In particular, along with additional microarchitec-tural modifications, the SIMT tables can be used to enable concurrent multi-pathexecution as a performance optimization for divergent applications. We refer tothis mechanism as the Multi-Path (MP) execution model. In the MP executionmodel, an arbitrary number of parallel warp splits can interleave their executionwhile still maintaining immediate postdominator reconvergence. To enable this,the MP model proposes modifications to the scoreboard logic used to track datadependencies between registers in SIMT architectures. The modifications assuresthat the scoreboard can correctly handle dependencies for concurrent multi-path ex-ecutions and avoids declaring false data dependencies between orthogonal parallelpaths. Further, MP extends the operation of the SIMT tables to enable opportunis-tic early reconvergence at run-time which improves the SIMD units utilization ofapplications with unstructured control flow.1.4 MethodologyWe describe our methodology in details before the evaluation section of each chap-ter. However, in general, we rely on direct hardware measurements when possible(e.g., in Chapters 3, 4, and motivation part of chapter and 6). If a change in currenthardware is proposed, we use cycle accurate simulation to evaluate the impact of9our proposal on a baseline that simulates recent architectures. This is the standardapproach for computer architecture research in both industry and academia. Weuse GPGPU-Sim [11] simulator for performance simulation and GPUWattch [75]for power simulation. Both simulators report high correlation with recent hardwareand are widely used in the research community (with more than 1000 citation forGPGPU-Sim and 300 citations for GPUWattch at the time of writing this thesis).The evaluated kernels usually run for hundred thousands to millions of cycles onthese simulators (which correspond to hours to days in simulation time).In our evaluation, we focus on the accelerated portion of the applications (i.e.,the GPGPU kernels). Our focus is to improve GPU support for synchronization,but we make sure to evaluate the impact of our proposals on kernels that doesinclude such synchronization. To represent these types of kernels, we used well-known benchmark suites (e.g., [26], [11]). It was a challenging task to find existingGPGPU kernels that utilize synchronization, arguably, because of the challengesdiscussed earlier in porting and/or developing such kernels on current GPU ac-celerators. However, we secured a number of kernels that show different patternsof synchronization (e.g., spin locks, nested spin locks, wait and signal, data-flow,small and large critical sections). The kernels used represent different applicationdomains (e.g., graph analysis, key-value store, transactions, physics simulation).We describe our benchmarks in the methodology sections of each chapter.1.5 ContributionsThis thesis makes the following key contributions:• It formalizes a special type of deadlocks that is imposed by current SIMTimplementations and can be introduced by programmers and/or compilersthat are oblivious to the thread scheduling constraints.• It proposes a static analysis technique that conservatively detects potentialSIMT-induced deadlocks in parallel kernels.• It proposes a static analysis that identifies safe locations to reconverge threadsdivergent at loop exits while allowing for inter-thread communication underdivergence.10• It proposes a code transformation algorithm that modifies the CFG to elim-inate SIMT-induced deadlocks guided by the safe reconvergence analysis.It also identifies the limitations of relying on a compiler-only approach toeliminate SIMT-induced deadlocks.• It proposes a MIMD-Compatible hardware reconvergence mechanism forSIMT architectures that avoids key limitations of the compiler-only approach.• It proposes a low cost dynamic spin detection mechanism for SIMT archi-tectures.• It proposes an inter-thread synchronization aware warp scheduling policythat reduces busy-wait synchronization overheads in SIMT architectures.• It proposes a multi-path execution model that enables concurrent executionof parallel control flow paths while maintaining reconvergence at immediatepostdominators.1.6 OrganizationThe rest of this dissertation is organized as follows:• Chapter 2 details the background GPU architecture used in this dissertation.• Chapter 3 introduces the definition of SIMT-induced deadlocks and the staticanalysis used for their detection.• Chapter 4 introduces SSDE; a Static SIMD-induced Deadlock Eliminationalgorithm.• Chapter 5 introduces AWARE; an Adaptive Warp Reconvergence Mecha-nism.• Chapter 6 introduces BOWS; a warp scheduling policy for busy-wait syn-chronization in SIMT architectures.• Chapter 7 introduces MP; a Multi-Path execution model for concurrent exe-cution of divergent paths in SIMT architectures.11• Chapter 8 discusses related work.• Chapter 9 concludes the dissertation and discusses future work.• Chapter A is an appendix that presents a semi-formal proof to the correctnessof the SSDE transformation.The chapters are ordered to maintain a coherent flow of ideas. Nevertheless,each chapter is self-contained as it includes a brief introduction, a presentation ofthe new ideas, an evaluation section, a related work discussion, a conclusion, andfuture work directions. Chapters 8 and 9 expands over the related work discussion,conclusion and future work directions included in each chapter.12Chapter 2BackgroundThis chapter reviews the necessary background material for the rest of the dis-sertation. Section 2.1 describes the architecture of contemporary SIMT accelera-tors which acts as our baseline throughout the thesis. Section 2.2 provides a briefoverview on current SIMT programming models. Section 2.3 provides more detailson thread scheduling schemes in current SIMT accelerators as it is significantly rel-evant to the rest of the thesis.2.1 Baseline SIMT ArchitecturesWe study modifications to the SIMT accelerator architectures as shown in Fig-ure 2.1. This architecture is the one deployed in contemporary General PurposeGraphics Processing Units (GPGPUs) [104, 108]. Current GPGPUs consist of mul-tiple processing cores. Each core consists of a set of parallel lanes (or SIMD units).Initially, an application begins execution on a host CPU, then a kernel is launchedon the GPU in the form of a large number of logically independent scalar threads.These threads are split into logical groups operating in lockstep in a SIMD fashion(referred to as warps by Nvidia and wavefronts by AMD) 1. Each SIMT core in-terleaves a number of warps on a cycle-by-cycle basis. The Instruction Buffer unit(I-Buffer) contains storage to hold decoded instructions and register dependencyinformation for each warp.1In the rest of this thesis, we adopt Nvidia’s naming convention.13Interconnection Network     Memory Partition             Last Level Caches Off-Chip DRAM Kernel   Launch Fetch Unit I-Cache Decode Unit Score-Board Issue L1-Caches Branch Unit Register File CPU ALUs ALUs ALUs I-Buffer Figure 2.1: Baseline architectureThe scoreboard unit is used to detect register dependencies. A branch unitmanages control flow divergence. The branch unit abstracts both the storage andthe control logic required for divergence and reconvergence.The issue logic selects a warp with a ready instruction in the instruction bufferto issue for execution. Based on the active mask of the warp, threads that should notexecute, due to branch divergence, are disabled. The issued instructions fetch theiroperands from the register file. It is then executed on the corresponding pipeline(ALU or MEM).The SIMT architecture achieves its energy efficiency by amortizing the frontend costs (i.e., fetching, decoding, and scheduling instructions) across the largenumber of threads within the same warp executing synchronously the same in-struction. Further, it lowers the operating frequency and relaxes the latency re-quirements of the memory system and functional units compared to contemporaryCPUs. To hide this latency, it relies on efficient warp scheduling policies that al-lows for a net high instruction throughput per cycle.2.2 The SIMT Programming ModelThe SIMT programming model divides the burden of identifying parallelism dif-ferently than traditional approaches of vector parallelism. The programmer, who isarmed with application knowledge, identifies far-flung outer-loop parallelism and14HostDeviceGrid-0Block (0,0)Block (1,0)Block (0,1)Block (1,1)Block (0,2)Block (1,2)HostDeviceGrid-1Block (0,0)serial codeparallel kernelkernel0<<<>>>serial codeparallel kernelkernel1<<<>>>Block (0,1) Block (0,2) Block (0,3)Block (1,0) Block (1,1) Block (1,2) Block (1,3)Per-thread local memoryPer-block shared memoryGlobal memoryThread BlockThreadGrid-0TB-0TB-3TB-1TB-4TB-2TB-5Grid-1TB-0 TB-1 TB-2 TB-3TB-4 TB-5 TB-6 TB-7Figure 2.2: CUDA Programming Model. (This diagram reproduces Figure 7and Figure 8 from The CUDA Programming Guide [111].specifies the required behaviour of a scalar thread in the parallel region. Figure 2.2shows a typical flow of a CUDA program. The host side (i.e., the CPU) executesthe serial portion of the code, allocates the required memory on the device side(i.e., the GPU) and copy the required data from/to the host to/from the device. Theprogrammer decides the number of logical threads required to execute the parallelregion. These threads are typically organized as a grid of thread blocks. Threadsin the same thread block can communicate through a low latency scratchpad mem-ory referred to as shared memory. Threads in different thread blocks (and evendifferent grids) can still communicate through a slower global memory.The programming model provides some essential primitives that are required tomanage inter-thread communication. These primitives include thread-block scopebarriers to synchronize threads within the same thread block, thread-block scopeand device scope memory fences to control the observable ordering of shared and15global memory reads and writes, and atomic functions that perform read-modify-write atomic operations (e.g., Compare and Swarp, Add, Min, Max) where theoperation is guaranteed to be performed without interference from other threads.The programmer writes the kernel code from the perspective of a scalar threadand leverages these primitives to manage inter-thread communication across thelogically independent threads.The SIMT programming model, in its essence, does not expose the SIMD na-ture of the hardware to programmers. It also does not expose the mapping and thescheduling of the logical threads on the available hardware resources. Thus, from aprogramming model perspective, individual threads progress independently unlessotherwise determined by programmers through the use of explicit synchronizationprimitives. This makes SIMT architectures both easier to program and suitable fora larger set of applications.2.3 Thread Scheduling in SIMT ArchitecturesThread scheduling in SIMT architectures is done by the hardware. The hardware isresponsible for mapping thread blocks to the different processing cores, allocatingthe required resources for different threads and mapping individual threads to theSIMD units. Throughout the execution of the kernel, SIMT architectures greedilyattempts to synchronize threads within the same warp to maximize SIMD unitsutilization, coalesce their memory accesses to reduce memory traffic, and optimizewarp scheduling to efficiently hide long latency operations and harness the existingdata locality. This section overviews the current policies used in thread scheduling.2.3.1 Threads from The same WarpAs discussed in Section 2.2, SIMT architectures, unlike traditional vector ma-chines, allow for arbitrary control flow divergence. Threads are split into groups(warps) that execute in lockstep on the underlying SIMD units. The warp size inrecent GPUs is typically the number of available parallel SIMD lanes. With no di-vergence, threads within the same warp share the same program counter. However,upon a divergent branch, threads in a warp are allowed to follow different controlflow paths. Current implementations achieve this by serializing the execution of161 . / / i d = t h r e a d ID2 . / / BBA Basic Block ”A”3 . i f ( i d %2==0){4 . / / BBB5 . } e l s e {6 . / / BBC7 . }8 . / / BBD(a) Code1111 1010 0101 1111 A B C D AB-CBR(b) CFGFigure 2.3: Divergent code exampledifferent control-flow paths while restoring SIMD utilization by forcing divergentthreads to reconverge as soon as possible (typically at an immediate postdominatorpoint) [29, 60, 64].Figure 2.3 illustrates a simple example of divergent code and its correspondingcontrol flow graph (CFG). The bit mask in each basic block of the CFG denoteswhich threads of a single warp containing four threads will execute that block. Therightmost bit represents thread with thread ID=0. All threads execute basic blockA. Upon executing divergent branch BRAB−C, warp A1111 diverges into two warpsplits [91] B0101 and C1010. In our notation, branches are abbreviated as BR witha superscript representing the basic block containing the branch and a subscriptrepresenting the successor basic blocks. Each warp split is represented by a letterrepresenting the basic block that the split is executing with a subscript indicatingthe active threads.The immediate postdominator (IPDOM) of the branch BRAB−C, basic block D,is the earliest point where all threads diverging at the branch are guaranteed toexecute. We say an execution mechanism supports IPDOM reconvergence if itguarantees all threads in the warp that are active at any given branch are againactive (executing in lockstep) when the immediate postdominator of that branchis next encountered. IPDOM reconvergence is favorable because the immediatepostdominator is the closest point at which all threads in a warp are guaranteed toreconverge2. A mechanism for supporting IPDOM reconvegence using a stack of2Likely convergence [44] and thread frontiers [33] identify earlier reconvergence points that canoccur dynamically in unstructured control flow if a subset of paths between branch and IPDOM are17active masks [77] was introduced by Fung et al. [41]. However, there are differ-ent possible implementations that can support IPDOM reconvergence as definedabove. On current hardware the SIMT model is implemented via predication forsimple branches, or in the general case using hardware and/or software managedstack-based masking of execution units [5, 18, 29, 64, 77].In these stack-based execution models, the divergent paths are serialized. Thus,in this example, warp split C1010 may execute first until executing threads reachbasic block D. Then, execution switches to warp split B0101. Once the latter threadsreach basic block D as well, the four threads reconverge and execute basic block Din lockstep.2.3.2 Threads from Different WarpsEach cycle, one or more warp schedulers select one of their assigned active warpsto be issued for execution. Typically, a scheduling heuristic needs to be used withthe objective of efficiently hiding long latency operations and to harness the exist-ing data locality [99, 124]. A simple scheduling policy would be to round robinacross the available warps, if the next warp in the round robin order is not pendingon data dependency or synchronization dependency (e.g., barrier) and the hardwarerequired to execute its next instruction is available, the warp is issued for execution.This policy is referred to as Loose Round Robin (LRR). LRR guarantees fairnessin scheduling different warps. However, this proves to be inefficient to hide longlatency operations, as it encourages warps to progress at similar rates reachinghigh latency code portions at the same time which limits their ability to hide eachother’s latencies. Further, LRR has negative impact on intra-warp temporal local-ity, as it allows other warps to evict the data brought to cache by a warp before thewarp is rescheduled [124]. Greedy then Oldest (GTO) is another warp schedulingpolicy that typically outperform LRR. A greedy-then-oldest scheduler consistentlyselects the same warp for scheduling until it stalls then it moves the oldest readywarp. Older warps are those who are assigned earlier to the hardware resources.In case different warps were assigned at the same cycle (e.g., in the same threadblock), warps with the smallest threads IDs are considered older. Compared toexecuted by a warp.18LRR, GTO has the advantage of maintaining intra-warp locality as well as gettingwarps to progress at a different rate which typically allows for better latency hid-ing. There are, however, numerous research papers on different warp schedulingpolicies that uses different heuristics based on different optimization goals (e.g.,improving latency hiding [99], improving locality [124], reducing barrier synchro-nization overheads [83], reducing load imbalance overhead across warps from thesame CTA [72]).19Chapter 3SIMT-Induced Deadlocks:Definition and DetectionThis chapter studies the impact of current single-instruction multiple-thread (SIMT)implementations on programmability. We show that the scheduling constraints im-posed by current SIMT implementations lead to surprising (from a programmerperspective) deadlocks when executing code that contains inter-thread synchro-nization. This type of deadlocks is unique to SIMT architectures and does not existin traditional multi-threaded architectures. We refer to the cases where the forwardprogress of diverged threads is prevented due to the implicit scheduling constraintsimposed by the SIMT implementation as SIMT-induced deadlocks or briefly asSIMT deadlocks. In this chapter, we precisely define the inter-thread synchroniza-tion patterns that leads to SIMT-induced deadlocks. We show that SIMT-induceddeadlock can be produced by (1) programmers and (2) compilers that are obliviousto the subtle details of SIMT implementations.In this chapter, we also discuss the implications of SIMT-deadlocks on SIMTarchitectures regarding both the ease of programmability and the acceptance to dif-ferent programming models. We present an algorithm that conservatively detectspotential SIMT-induced deadlocks in parallel kernels. We evaluate an implemen-tation of the algorithm that relies solely on static analysis using a large number ofCUDA and OpenCL kernels. The results show that the static detection successfullyflags true SIMT-deadlocks with a false detection rate of 4%–5%. We discuss the20Reconvergence Stack PC RPC Active Mask D - 1111 B D 0101 C D 1010 Reconvergence Stack PC RPC Active Mask A - 1111 A  1111  C  1010   0101  B D TOS A C B D TOS 2 1 3 4 (a) If-else ExampleReconvergence Stack PC RPC Active Mask C - 1111 B C 0111 Reconvergence Stack PC RPC Active Mask A - 1111 A B  1111  C TOS A B B C TOS  0111   1000  B 6 5 (b) Loop ExampleFigure 3.1: Stack-Based Reconvergence.limitations of our current SIMT-deadlock detection implementation. Finally, wesummarize related work and conclude with some pointers to future work.3.1 SIMT Scheduling ConstraintsConceptually, in a stack-based SIMT implementation, a per-warp stack, as shownin Figure 3.1, is used to manage divergent control flow. Each entry contains threefields that represent a group of scalar threads executing in lock-step: (1) a program21counter (PC) which stores the address of the next instruction to execute, (2) a re-convergence program counter (RPC) which stores the instruction address at whichthese threads should reconverge with other threads from the same warp and (3) anactive mask that indicates which threads have diverged to this path. Initially, thestack has a single entry. Once a divergent branch is encountered, the PC field of thedivergent entry is replaced with the RPC of the encountered branch and the branchoutcomes’ entries are pushed into the stack. Only threads at the top of the stack en-try are eligible for scheduling. Once executing threads reach reconvergence, theircorresponding entry is popped out of the stack. As noted earlier, in some imple-mentations the stack is implemented and/or manipulated in software [5, 18, 29, 77].Figure 3.1 records the change in the reconvergence stack after executing a di-vergent branch in two different cases. In Figure 3.1a, an if-else branch is executedat the end of basicblock A (BBA) 1 . The PC of the first stack entry changes to theRPC of the branch (i.e., BBD), and two new entries are pushed into the stack rep-resenting the branch taken and not-taken paths 2 . Only the top of the stack (TOS)entry is eligible for scheduling. Therefore, this warp starts executing threads thatdiverged to BBC 3 . After these threads reach the reconvergence point (i.e., BBD),their entry is popped out of the stack and the execution of BBB starts. Eventually,all threads reconverge at BBD 4 . This sequence allows the stack to track multiplereconvergence points in case of more complex CFGs that contain nested branches.In Figure 3.1b, a loop branch is executed at the end of BBB 1 . One thread exitsthe loop while others continue iterating. Similar to the first example, the PC ofthe first stack entry changes to the RPC of the branch (i.e., BBC). Only a singleentry is pushed into the stack representing threads that diverged to the loop header2 . The warp keeps iterating through the loop until all threads eventually exit theloop and start executing BBC. This mechanism enables to reconverge divergedthreads improving SIMD units utilization. However, to achieve this, it imposesthread scheduling constraints that we discuss next.3.1.1 SIMT-Induced DeadlocksConventionally, in simultaneous multi-threaded MIMD environment, programmersdo not worry about specifics of thread schedulers to write functionally correct code.22It is assumed that the hardware and/or the operating system guarantees “loose”fairness in thread scheduling [47]. SIMT programming models attempt to providesimilar guarantees [111]. However, recent SIMT implementations fail to do so.In current SIMT implementations, thread scheduling is constrained by the CFG ofthe executed application such that: if a warp W encounters a branch BRT,NT7→Rwith two possible successor basic blocks T and NT and reconvergence point R, itmay diverge into two splits [91]; WPT7→R and WPNT 7→R . WPT7→R contains threads thatdiverge to the taken path and WPNT 7→R contains threads that diverge to the not-takenpath. On current SIMT implementations execution respects the following:Constraint-1: Serialization. If WPT7→R executes first then WPNT 7→R blocks untilWPT7→R . reaches R (or vice versa)Constraint-2: Forced Reconvergence. When WPT7→R reaches R, it blocks untilWPNT 7→R reaches R (or vice versa).Collectively we refer to these two constraints as the stack-based SIMT reconver-gence scheduling constraints. A SIMT-induced deadlock occurs when a thread isindefinitely blocked due to a cyclic dependency between either of these constraintsand a synchronization operation in the program.3.1.2 Causes of SIMT DeadlocksWe categorize SIMT-induced deadlocks into two types according to their cause:Barrier Induced: Figure 3.2 illustrates code that uses barriers to synchronizecommunication between threads. Following MIMD execution semantics, the pro-grammer’s intention is that every thread should reach either the first or second bar-rier before any thread continues. Placing barriers in such diverged code is used toimplement producer-consumer communication using named barriers and the .syncand .arrive directives as suggested by Nvidia’s PTX ISA Manual [105]. Recentwork has proposed employing such code to enable warp specialization [12, 13].However, for this code to run correctly on current SIMT implementations, the con-dition on line 2 in Figure 3.2 has to evaluate identically through all threads withinthe same warp every time this line is executed and for all warps executing thecode. Otherwise this code will either lead to deadlock (if barrier arrival is counted231 . / / t i d = thread−i d2 . i f ( cond ( t i d ) ) {3 . / / w r i t e sh are d da ta4 . b a r r i e r ( ) ;5 . / / read sh are d da ta6 .} e l s e {7 . / / w r i t e sh are d da ta8 . b a r r i e r ( ) ;9 . / / read sh are d da ta10 .}Figure 3.2: Barriers in divergent code.per scalar thread) or hard to predict and/or implementation dependent behavior (ifbarrier arrival is counted per warp [147]). Prior work studied detection of barrierdivergence [17, 79, 131]. Therefore, in this chapter, we focus more on the secondtype of SIMT-induced deadlocks.Conditional Loop Induced: Figure 3.3 shows a typical “MIMD implementa-tion” of a spin lock guarding a critical section. To simplify the example all threadsattempt to acquire the same lock (mutex). The issue illustrated by this examplealso affects fine-grained locks when more than one thread contends for the samelock. To acquire the lock, each thread repeatedly executes an atomic compare andswap. The loop condition evaluates to false for the single thread that successfullyacquires the lock. We call this the leading thread. The remaining threads fail toacquire the lock. We call these lagging threads. The lagging threads continue toiterate through the loop waiting for the leading thread to release the lock by ex-ecuting atomicExch(...). However, in current SIMT implementations the leadingthread never reaches atomicExch(...) as it is blocked by Constraint 2 - the lead-ing thread is waiting at the reconvergence point for the lagging threads. On theother hand, the lagging threads cannot make forward progress because the lead-ing thread owns the lock. This issue is known among GPU application develop-ers [43, 47, 114, 115, 121, 151].3.2 SIMT Deadlocks Impact on ProgrammabilityIn SIMT architectures, the hardware with minimal help from the compiler supportsarbitrary memory accesses and control flow divergence. This enables abstracting241 . / / * mutex =0;2 . whi le (atomicCAS ( mutex , 0 , 1 ) !=0) ;3 . / / c r i t i c a l s e c t i o n4 . a tomicExch ( mutex , 0 ) ;Figure 3.3: Critical Section Implementation.away the complexity of the underlying SIMD hardware allowing for much simplerprogramming models. The CUDA Programming Guide [111], for example, statesthat “For the purposes of correctness, the programmer can essentially ignore theSIMT behavior, ...” and suggests that SIMT behavior is primarily relevant forperformance tuning purposes.However, with current SIMT implementations, dealing with SIMT deadlocks isstill a problem. In the presence of inter-thread synchronization, it is still necessaryfor programmers and/or compilers to be aware of the interaction between inter-thread communication, the control flow graph (CFG) of the parallel kernel and theSIMT divergence implementation to avoid SIMT deadlocks.To the best of our knowledge, prior to our publication [38], there has not beenany general algorithms or approaches to either detect or eliminate SIMT deadlocks.This affects both the ease of programmability of SIMT architectures using SIMTspecific languages (e.g., CUDA) as well as a robust support of SIMT architecturesto new programming models with higher levels of abstractions (e.g., OpenMP).3.2.1 Ease of ProgrammabilityWith the presence of SIMT deadlocks, current SIMT implementations lack reli-able support for fine-grained inter-thread synchronization that is essential for effi-cient implementations of many irregular applications. The inconvenience causedby SIMT-induced deadlocks is evident by the larger number of programming dis-cussion boards on the issue. It also shows that the challenge exists for both generalpurpose programmers (e.g., in CUDA and OpenCL [114, 115]) and graphics pro-grammers (e.g., in GLSL and HLSL [112, 113, 139]).Therefore, it is currently up to programmers to work around the SIMT hard-251 . s t r u c t Lock {2 . i n t * mutex ;. . . . .3 . d e v i c e vo id l o c k ( vo id ) {4 . w h i l e ( atomicCAS ( mutex , 0 , 1 ) != 0 ) ;5 . t h r e a d f e n c e ( ) ;6 . }7 . d e v i c e vo id un lo ck ( vo id ) {8 . t h r e a d f e n c e ( ) ;9 . a tomicExch ( mutex , 0 ) ;1 0 . }11 .}1 2 . g l o b a l vo id a d d t o t a b l e ( u n s i g n e d i n t * keys , vo id** v a l u e s , Tab le t a b l e , Lock * l o c k ) {1 3 .1 4 . i n t t i d = t h r e a d I d x . x + b l o c k I d x . x * blockDim . x ;1 5 . i n t s t r i d e = blockDim . x * gridDim . x ;1 6 . w h i l e ( t i d < ELEMENTS) {1 7 . u n s i g n e d i n t key = keys [ t i d ] ;1 8 . s i z e t hashValue = hash ( key , t a b l e . c o u n t ) ;19. for (int i=0; i<32; i++) {20. if ((tid % 32) == i) {2 1 . E n t r y * l o c a t i o n = &( t a b l e . poo l [ t i d ] ) ;2 2 . l o c a t i o n−>key = key ;2 3 . l o c a t i o n−>v a l u e = v a l u e s [ t i d ] ;24. lock[hashValue].lock();2 5 . l o c a t i o n−>n e x t = t a b l e . e n t r i e s [ hashValue ] ;2 6 . t a b l e . e n t r i e s [ hashValue ] = l o c a t i o n ;27. lock[hashValue].unlock();28. }29. }3 0 . t i d += s t r i d e ;3 1 . }3 2 . }Figure 3.4: HashTable Insertion - Version 1 [129].ware scheduling constraints when implementing algorithms that involve inter-threadsynchronization. For example, Figure 3.4 shows a GPU implementation for hashtableinsertion proposed in Nvidia’s “CUDA by Example” [129]. We highlight in boldthe lines of code of interest. The example uses a spin lock implementation (lines1-11 in Figure 3.4) similar to the one described in Figure 3.3. However, whenthe lock function is called, it is guaranteed that only one thread from each warp isexecuting it at a time which eliminates SIMT deadlocks possibilities.However, the serialization of threads within the same warp significantly under-261 . g l o b a l vo id a d d t o t a b l e ( u n s i g n e d i n t * keys , vo id** v a l u e s , Tab le t a b l e , Lock * l o c k ) {2 . i n t t i d = t h r e a d I d x . x + b l o c k I d x . x * blockDim . x ;3 . i n t s t r i d e = blockDim . x * gridDim . x ;4 . w h i l e ( t i d < ELEMENTS) {5 . u n s i g n e d i n t key = keys [ t i d ] ;6 . s i z e t hashValue = hash ( key , t a b l e . c o u n t ) ;7 . E n t r y * l o c a t i o n = &( t a b l e . poo l [ t i d ] ) ;8 . l o c a t i o n−>key = key ;9 . l o c a t i o n−>v a l u e = v a l u e s [ t i d ] ;10. bool done = false;11. while(!done){12. if(atomicCAS(lock[hashValue].mutex, 0, 1) == 0 ){1 3 . t h r e a d f e n c e ( ) ;1 4 . l o c a t i o n−>n e x t = t a b l e . e n t r i e s [ hashValue ] ;1 5 . t a b l e . e n t r i e s [ hashValue ] = l o c a t i o n ;1 6 . done = t r u e ;1 7 . t h r e a d f e n c e ( ) ;18. atomicExch(lock[hashValue].mutex,0);19. }20. }2 1 . t i d += s t r i d e ;2 2 . }23 .}Figure 3.5: HashTable Insertion - Version 2.1 . done = f a l s e ;2 . / / * mutex = 0;2 . whi le ( ! done ) {3 . i f ( atomicCAS ( mutex , 0 , 1 ) ==0){4 . / / C r i t i c a l S e c t i o n5 . a tomicExch ( mutex , 0 ) ;6 . done = t rue ;7 . }8 . }Figure 3.6: Modified Critical Section Implementation.utilizes the SIMD hardware regardless of the granularity of the synchronization.For the simple case as in Figure 3.3, there is another well-known workaround onNvidia GPUs shown in Figure 3.6 [121] which avoids underutilizating the SIMDhardware. In this modified code, the while loop body includes both the lock ac-quisition and release. Hence, the reconvergence point of the while loop does notprevent the required communication between threads. Threads that fail to acquirethe lock in the if statement condition wait at the end of the if statement to recon-27 0 100 200 300 400 500 600 700TeslaC2050 TeslaK20C GTX-TITAN GTX-1080E xe cu ti on  T ime  ( ms ec ) .version-1 version-2Figure 3.7: HashTable Insertion Versions Comparison (2M Random Inser-tion in 1024 HashTable Entries). Recent GPUs are to the right.verge with threads that acquired the lock. Threads that acquired the lock proceed toexecute the critical section and then release the lock. All threads check if they areeligible for another iteration of the while loop, any thread that acquired the lockwill have its local variable done set to true and thus exits the while loop. Otherthreads proceed into the loop body. Eventually, all threads within a warp recon-verge at the exit of the while loop and continue execution.We apply this workaround on the hashtable insertion code in Figure 3.5 to com-pare between both workarounds in performance. Figure 3.7 shows the executiontime of the add to table kernel on different GPU generations with 1024 entriesin the hashtable and 2M random insertions. Except from Fermi’s TeslaC2050, allrecent GPU devices perform better with version-2 with a speedup of 56% on thePascal GTX-1080. On Fermi, version-1 performs better as it reduces contentionon the atomic operations which used to perform poorly on earlier architectures.However, even the second workaround is not reliable for the following reasons:1. Lack of general solution: Both workarounds in Figures 3.4 and 3.5 handlea simple case where there is a single lock acquire statement with a single lock re-lease statement that postdominates the lock acquire. Thus, the required code trans-formation is relatively simple to reason about using high level semantics given thatthe programmer is aware of the details of reconvergence mechanism. However,both in principle and in practice, more complex synchronization patterns (could)exist. Figure 3.8 shows a couple of such examples where the code could have mul-281 . m. l o c k ( ) ;. . . . .2 . i f ( cond ) {. . . . .3 . m. u n l oc k ( ) ;4 . re turn ;5 . }. . . . .4 . m. u n l oc k ( ) ;1 . m1 . l o c k ( ) ;2 . m2 . l o c k ( ) ;. . . . .3 . i f ( cond ) {. . . . .4 . m1 . un l oc k ( ) ;5 . m2 . un l oc k ( ) ;6 . } e l s e {. . . . .7 . m1 . un l oc k ( ) ;8 . m2 . un l oc k ( ) ;9 . }Figure 3.8: Synchronization patterns used in multi-threaded applications.Examples are from the graph analytics API in the Cloud suite [40].tiple interleaved locks and/or multiple lock releases that may not postdominate thelock acquire statement(s). With such patterns, it is difficult even for programmerswho are aware of the hardware details to reason about a safe code transformation.Thus, there is a need for a general algorithm that enables to both detect and elimi-nate SIMT deadlocks regardless of the synchronization pattern in place.2. Unstandardized SIMT Behavior: Different GPU vendors have their ownimplementation of the SIMT execution model. As a consequence, the order inwhich divergent paths are executed and the specific locations of reconvergencepoints is not standardized and often undocumented. Given that these nuances mayimpact programs’ correctness, programmers need to be aware of the exact imple-mentation on the targeted hardware. Such solutions, when they exist, lack forwardcompatibility as well as portability. Therefore, a pure compiler and/or hardwaresolution is preferable.3. Vulnerability to Compiler Optimizations: Manual workarounds as inFigure 3.5 assume that the compiler maintains the same control flow describedby the high level language. However, optimizations that either modify the CFGor move statements across basicblocks can conflict with the intended CFG. Wefound that this problem exists in default optimization passes in both open sourcecompilers (e.g., LLVM) and propriety compilers (e.g., Nvidia’s NVCC).For example, the code in Figure 3.5 works fine if compiler optimizations are29disabled. However, if default compiler optimizations are enabled it depends on thecompiler and/or the version. Figure 3.9 shows the PTX assembly resulting fromcompiling the hashtable insertion code in Figure 3.5 with two versions of Nvidia’sNVCC compiler. NVCC versions 6.5 and lower maintain the CFG specified by theprogrammer for this example. However, we found that starting from NVCC 7.0, theCFG is altered in a way that induces SIMT deadlocks and on existing hardware 1.More of these examples are available in the author’s github repository [37].LLVM showed similar behaviour to NVCC 8.0 (we have used LLVM 3.6). InLLVM, we identified two control flow optimizations that induces SIMT deadlockseven when the original code was SIMT deadlock free; namely jump-threading andsimplifycfg. Both optimizations attempt to simplify control flow graphs by remov-ing unnecessary branches [86].This observation motivates the need for compiler writers for SIMT backendsto find methodologies to restrict compiler optimizations that induces SIMT dead-locks or at least to avoid their SIMT-incompatible side effects. Without the workpresented in this thesis, we believe the most likely fix for the compiler would be todisable all optimizations that could potentially cause SIMT deadlocks in the com-pilation of any SIMT kernel. This would include all optimizations that alter theCFG or that move instructions across basic blocks. A more robust methodologywould be to detect whether a certain application is prone to SIMT deadlocks givena certain transformation. This work enables this methodology using variations ofAlgorithm 1.3.2.2 Programming Language AbstractionSIMT deadlocks limit the level of abstraction that can be supported by SIMT ar-chitectures. For example, current multi-threaded MIMD architectures support awide range of general purpose programming languages (e.g., MPI, Java, OpenMP,pThreads, OpenCL, OpenACC, TBB, etc). Each of these languages meet certaincriteria such as development time, simplicity, efficiency, and readability for theapplications of interest. For the expanding adoption of SIMT architectures to con-tinue, SIMT architectures need to be flexible enough to robustly support wider1We have tested these codes on a variety of Nvidia’s GPUs; TeslaC2090 (Fermi), Tesla K20C(Kepler), and GeForce GTX 1080 (Pascal)301 . BB 1 :. . . .2 . mov . u16 %rs6 , 0 ;3 . BB 2 :4 . . . . .5 . atom . c a s . b32 %r10 , [%rd25 ] , 0 , 1 ;6 . s e t p . ne . s32 %p2 , %r10 , 0 ;7 . @%p2 b r a BB 4 ;8 . BB 3 :9 . / / c r i t i c a l s e c t i o n1 0 . atom . exch . b32 %r11 , [%rd27 ] , 0 ;1 1 . mov . u16 %rs6 , 1 ;1 2 . BB 4 :1 3 . s e t p . eq . s16 %p3 , %rs6 , 0 ;1 4 . @%p3 b r a BB 2 ;1 5 . BB 5 :. . . .1 6 . @%p4 b r a BB 1 ;1 7 . BB 6 :1 8 . r e t ;(a) NVCC 6.51 . BB 1 :. . . .2 . BB 2 :. . . .3 . @%p2 b r a BB 4 ;5 . b r a . u n i BB 3 ;. . . .6 . BB 6 :. . . .7 . atom . c a s . b32 %r15 , [%rd28 ] , %r14 , %r13 ;8 . s e t p . ne . s32 %p3 , %r15 , 0 ;9 . @%p3 b r a BB 6 ;1 0 . BB 7 :1 1 . / / c r i t i c a l s e c t i o n1 2 . atom . exch . b32 %r17 , [%rd32 ] , 0 ;. . . .1 3 . @%p4 b r a BB 2 ;1 4 . BB 8 :1 5 . r e t ;(b) NVCC 8.0Figure 3.9: PTX output of compiling the code in Figure 3.5 using NVCC 6.5and NVCC 8.0 with default optimizations enabled.range of programming models. Below, we provide a case study where SIMT dead-lock is an obstacle to support of higher levels of abstraction.3.2.3 OpenMP Case Study: Synchronization Primitives LibraryThe OpenMP 4.0 standard and its successors support the offloading of a parallelregion to an accelerator (e.g., a GPU). The OpenMP programming model is ap-pealing because of both its abstraction and portability across architectures. Thus, ithelps accelerators reach to a broader set of developers [66]. Currently, there is a co-ordinated effort by many technology companies such as IBM, Intel, AMD, Nvidiaand others to add an implementation of OpenMP 4.0 support for accelerators inClang and LLVM [15, 93, 117].To support a synchronization API in OpenMP such as omp set lock(...), thefront-end replaces the API call with a spin lock code with back-off delay as shownin Figure 3.10. Given the structure of this code, a valid OpenMP program that311 . EXTERN void o m p u n s e t l o c k i ( o m p l o c k t * l o c k )2 . {3 . i n t compare = SET ;4 . i n t v a l = UNSET;5 . i n t o l d = atomicCAS ( lock , compare , v a l ) ;6 . }7 . EXTERN void o m p s e t l o c k i ( o m p l o c k t * l o c k )8 . {9 . i n t compare = UNSET;1 0 . i n t v a l = SET ;1 1 . whi le ( atomicCAS ( lock , compare , v a l ) != UNSET) {1 2 . c l o c k t s t a r t = c l o c k ( ) ;1 3 . c l o c k t now ;1 4 . f o r ( ; ; )1 5 . {1 6 . now = c l o c k ( ) ;1 7 . c l o c k t c y c l e s = now > s t a r t ? now − s t a r t : now + (0 x f f f f f f f f −s t a r t ) ;1 8 . i f ( c y c l e s >= OMP SPIN* b l o c k I d x . x ) {1 9 . break ;2 0 . }2 1 . }2 2 . } / / w a i t f o r 0 t o be t h e read v a l u e2 3 . }Figure 3.10: OpenMP Clang 3.8 Frontend Implementation for OpenMP4.0 Runtime Library Synchronization Calls omp set lock andomp unset lock.executes properly on CPUs may not terminate on GPUs due to SIMT deadlocks.Consider the example in Figure 3.11, the compiler front-end replaces the theomp set lock(...) call at line 8 with the code in Figure 3.10. Successful threadsin acquiring the lock wait at the exit of the while loop at line 22 in Figure 3.10to reconverge with lagging ones. Thus, they do not proceed execution to theomp unset lock(...) statement at line 11 in Figure 3.11. Therefore, if other threadsare waiting on locks held by these leading blocked threads, a deadlock occurs.Note that the current OpenMP clang front-end correctly handles “#omp critical”,which is used to enclose global critical code sections that should be executed byone thread at a time. When a critical section surrounded by #omp critical is found,an extra loop around the section is added to serialize the execution of threads withinthe same warp in addition to an omp set lock(..) with a global lock that serializesexecution of different warps (i.e., similar to Figure 3.4 but using a global lock to321 . # pragma omp t a r g e t2 . # pragma omp p a r a l l e l f o r3 . f o r ( unsigned i =0 ; i<SIZE i ++){4 . / / . . . non−d i v e g e n t r e g i o n . . .5 . / * f i n d max i n each v e c t o r * /6 . unsigned v i nd ex = i / VECTOR SIZE ;7 . i f ( a [ i ] > max [ v i nd ex ] ) {8 . o m p s e t l o c k (& my lock [ v in de x ] ) ;9 . i f ( a [ i ] > max [ v i nd ex ] )1 0 . max [ v i nd ex ]= a [ i ] ;1 1 . o m p u n s e t l o c k (& my lock [ v in de x ] ) ;1 2 . }1 3 . }Figure 3.11: OpenMP Microbenchmark.force different warps to serialize their execution as well). However, this solutiondoes not generalize because it requires global locking and also requires the lockrelease to happen in a single known location that postdominates the lock acquisi-tion (as the case with the “#omp critical”). This is not always true when usingomp (un)set lock(..).This obstacle is not unique to OpenMP support. In fact, even SIMT-specificlanguages such as CUDA could make use of an optimized synchronization primi-tive library for both performance and productivity. For example, recent work [78]has proposed promising efficient fine-grained synchronization primitives on GPUsthat can boost performance up to 4× compared to current spin lock implemen-tation. However, a main limitation that work describes is SIMT deadlock whichrules out library based implementations for these primitives. These challenges en-courage coming up with a robust technique to eliminate SIMT deadlocks withoutintervention from programmers.3.3 SIMT Deadlock DetectionThis section proposes an algorithm to detect the presence of SIMT deadlocks in aparallel kernel. It also presents the limitations of a static implementation of suchalgorithm.33Listing 1 Definitions and Prerequisites for Algorithms 1, 2, 4-BB(I): basicblock which contains instruction (i.e., I ∈ BB(I)).-PBB1 7→BB2 : union set of basicblocks in execution paths that connect BB1 to BB2.-IPDom(I) : immediate postdominator of an instruction I. For non-branch instructions, IPDom(I)is the instruction that immediately follows I. For branch instructions, IPDom(I) is defined asthe immediate common postdominator for the basicblocks at the branch targets BBT and BBNT(i.e., the common postdominator that strictly postdominate both BBT and BBNT and does notpostdominate any of their other common postdominators.).-IPDom(arg1,arg2) is the immediate common postdominator for arg1 and arg2; arg1 and arg2could be either basicblocks or instructions.-LSet: the set of loops in the kernel, where ∀ L ∈ LSet:-BBs(L): the set of basicblocks within loop L body.-ExitConds(L): the set of branch Instructions at loop L exits.-Exits(L): the set of basicblocks outside the loop that are immediate successors to a basicblock inthe loop.-Latch(L): loop L backward edge, Latch(L).src and Latch(L).dst are the edge source and desti-nation basicblocks respectively.-Basicblock BB is reachable from loop L, iff there is a non-null path(s) connecting the reconver-gence point of Exits(L) with basicblock BB without going through a barrier. ReachBrSet(L,BB)is a union set of conditional branch instructions in all execution paths that connects the reconver-gence point of Exits(L) with BB.-Basicblock BB is parallel to loop L, iff there is one or more conditional branch instructions whereBBs(L)⊂ PT 7→R and BB ∈ PNT7→R or vice versa, where T is the first basic block at the taken pathof the branch, NT is the first basic block at the not-taken path, and R is the reconvergence point ofthe branch instruction. ParaBrSet(L,BB) is a union set that includes all branch instructions thatsatisfy this condition.3.3.1 SIMT Deadlock Detection AlgorithmThis section proposes a compiler analysis algorithm that conservatively detectsSIMT-induced deadlocks due to conditional loops. In the remainder of this section,we consider the case of a single kernel function K with no function calls (e.g., eithernatively or through inlining) and with a single exit. The single exit assumptionis required to guarantee the convergence of the safe reconvergence identificationalgorithm (Chapter 4) and not by the detection algorithm itself. It is achieved bymerging return statement(s) into a single return (e.g., using -mergereturn pass inLLVM).To focus on SIMT deadlocks due to conditional loops, we assume that K isguaranteed to terminate (i.e., is a deadlock and livelock free) if executed on aMIMD machine. This means that any loosely fair thread scheduler that does notindefinitely block a thread (or a group of threads) from execution always leads to34Algorithm 1 SIMT-Induced Deadlock Detection1: Inputs: LSet.2: Outputs: RedefWrites(L) for each L ∈ LSet and if L induces a potential SIMT deadlock.3: for each loop L ∈ LSet do4: ShrdReads(L) = /0, ShrdWrites(L) = /0, RedefWrites(L) = /05: for each instruction I, where BB(I) ∈ BBs(L) do6: if I is a shared memory read ∧ ExitConds(L) depends on I then7: ShrdReads = ShrdReads ∪ I8: end if9: end for10: for each instruction I do11: if BB(I) is parallel to or reachable from L then12: if I is a shared memory write then13: ShrdWrites(L)= ShrdWrites(L) ∪ I14: end if15: end if16: end for17: for each pair (IR,IW), where IR ∈ ShrdReads(L) and IW ∈ ShrdWrites(L) do18: if IW does/may alias with IR then19: RedefWrites(L)= RedefWrites(L) ∪ IW20: end if21: end for22: if RedefWrites(L) 6= /0 then Label L as a potential SIMT-induced deadlock.23: end if24: end forK termination.We also assume that K is barrier divergence-free [17] (i.e., for all barrierswithin the kernel, if a barrier is encountered by a warp, the execution predicateevaluates to true across all threads within this warp). This assumption excludes thepossibility of SIMT deadlocks due to placing barriers in divergent code assumingthread level barrier arrival detection. Such barrier divergence detection techniquesand the issues arising from placing barriers in divergent code have been extensivelystudied in prior work [17, 79, 131]. For brevity, we refer to all memory spaces ca-pable of holding synchronization variables as shared memory (i.e., including bothglobal and shared memory using CUDA terminology). Listing 1 summarizes defi-nitions used in Algorithms 1, 2, 4. We discuss these definitions as we explain eachalgorithm.Under the barrier divergence and the MIMD deadlock and livelock freedomassumptions, the only way to prevent forward progress of a thread is a loop whose35exit condition depends upon a synchronization variable. Forward progress is pre-vented (i.e., SIMT deadlock occurs) if a thread enters a loop for which the exitcondition depends on the value of a shared memory location and that location willonly be set by another thread of the same warp but that thread is blocked due toConstraint 1 or 2.To determine whether there is a dependency we consider the static backwardslice of the loop exit condition considering both data and control dependencies.Algorithm 1 detects potential SIMT-induced deadlocks.It is applied for each loop in a given kernel. If the loop exit conditions donot depend on a shared memory read operation that occurs inside the loop bodythen the loop cannot have a SIMT-induced deadlock. If a loop exit condition doesdepend on a shared memory read instruction IR, we add IR in the set of sharedreads ShrdReads on lines 4-7. A potential SIMT-induced deadlock exists if any ofthese shared memory reads can be redefined by divergent threads from the samewarp. The next steps of the algorithm detect these shared memory redefinitions.Lines 8-14 record, in set ShrdWrites, all shared memory write instructions IWlocated in basicblocks that cannot be executed, due to the reconvergence schedulingconstraints, by a thread in a given warp so long as some of the threads within thatwarp are executing in the loop. These basic blocks fall into two categories (Listing1): The first category we call reachable. In a structured CFG, the reachable blocksare those blocks that a thread can arrive at following a control flow path starting atthe loop exit. The second category we call parallel. The parallel blocks are thosethat can be reached by a path that starts from a block that dominates the loop headerbut avoids entering the loop. The detection algorithm requires that reconvergencepoints can be precisely determined at compile time. In our implementation, recon-vergence is at immediate postdominators. We limit reachable basicblocks to thosebefore a barrier due to our assumption of barrier-divergence freedom.Lines 15-20 check each pair of a shared memory read from ShrdReads and writefrom ShrdWrites for aliasing. If they do or “may” alias, then the write instructionmight affect the read value and hence affect the exit condition. If such a caseoccurs, the loop is labeled as potential SIMT-induced deadlock and we add thewrite to the redefining writes set (RedefWrites).For example, consider the application of Algorithm 1 to the code in Figure 3.3361 . u i n t c o u n t ;2 . do{3 . c o u n t = l W a r p H i s t [ d a t a ] & 0x07FFFFFFU ;4 . c o u n t = un iqueThreadTag | ( c o u n t + 1) ;5 . l W a r p H i s t [ d a t a ] = c o u n t ;6 . }whi le ( l W a r p H i s t [ d a t a ] != c o u n t ) ;Figure 3.12: addByte function in Histogram256 [120].and 3.6. In Figure 3.3, the loop exit is data dependent on the atomicCAS instruc-tion. There is one shared memory write that is reachable from the loop exit, theatomicExch instruction. The two instructions alias. Hence, SIMT-induced dead-lock is detected. In Figure 3.6, although the loop exit is control dependent on theatomicCAS instruction, there are no shared memory write instructions that are par-allel to or reachable from the loop exit. Therefore, no SIMT deadlock is detected.3.3.2 SIMT Deadlock Detection LimitationsOur current static implementation of the detection algorithm, Algorithm 1, hassome limitations that we highlight next.Inter-procedural Dependencies: The algorithm as stated assumes a single ker-nel function with no function calls to other procedures. Consequently, it can onlydetect SIMT deadlocks when synchronization is local to a function. It could bepossible to extend our detection algorithm to handle function calls with an inter-procedural analysis that tracks dependencies across functions [23, 59]. We specu-late this may lead to more conservative aliasing decisions and thus a higher falsedetection rate.Conservative Reachability Analysis: Our implementation of Algorithm 1 solelyrelies on the static CFG layout to identify reachable and parallel basicblocks. How-ever, in reality, not all static paths are executed at runtime.Conservative Alias Analysis: We rely upon static alias analysis to identify re-defining writes. Thus, it is possible to flag a loop causing SIMT deadlock where aprogrammer may reason such deadlock cannot occur. Further, relying on alias anal-ysis leads to conservative dependency analysis when pointer chasing is involved inthe dependency chain.37Indirect Branches: The static analysis can be extended to deal with indirectbranches with known potential targets (as supported in Nvidia’s PTX 2.1). How-ever, without clues about potential targets, the analysis would be conservative in la-beling potential SIMT-induced deadlocks including indirect branches as they mightform loops. This leads to significant overheads due to excessive potential false de-tections. This limitation can be seen as a special case of the conservative reacha-bility analysis limitation.Warp-Synchronous Behaviour Legacy Codes: With the lack of shared memoryatomics in early GPU generations, some GPU applications were written relyingon implicit warp synchronization2 to implement atomics. One example is His-togram256 from OpenCL SDK [120]. Figure 3.12 shows a code snippet for themain body of the core function in Histogram256: addByte. The l WarpHist isa pointer to current warp sub-histogram. The data is the value being added to thehistogram. The outcome of this piece of code should be that each thread incrementsthe histogram value l WarpHist[data], where data might be the same or differentacross different threads. Hence, it is equivalent to atomic increment operation.Let’s consider the case where two or more threads of the same warp receive thesame data value. Each thread will increment the old l WarpHist[data] and storethe outcome into a private variable count, which is then tagged with a unique valuefor each thread. Finally, all threads attempt to store the count value in the samememory location. This race condition is handled by the hardware through writecombining that effectively results in rejection of all but one colliding store [120].The while loop exit condition guarantees that only the winning thread exits theloop, while others try again.In the main histogram kernel, the addByte function is called four times to im-plement addWord. This creates write-write race between the four different occur-rences of the store in line 5. Therefore, in a MIMD execution model, this code isbroken and needs explicit barriers after loop bodies to avoid the write-after-writeraces between divergent threads executing different occurrences of the addByte.However, this code works fine on current GPUs, as it relays on the existing imple-mentation of SIMT execution model. In particular, it assumes there is an implicit2These application do not follow the CUDA programming model and are dependent on a specifichardware platform. Hence, they lack forward compatibility.38Table 3.1: Evaluated KernelsKernel Language DescriptionHT [43] CUDA Chained Hash Table of 80K entries and 40K threadsATM [43] CUDA ATM 122K transactions, 1M accounts, 24K threadsCP-DS [20, 43] OpenCL Distance Solver in Cloth Physics simulationBH-TB [24] CUDA Tree Building in BarnesHut (30,000 bodies)BH-SM [24] CUDA Summarization kernel in BarnesHutBH-ST [24] CUDA Sort kernel in BarnesHutBH-FC [24] CUDA Force Calculation in BarnesHutRodinia 1.0 [26] CUDA Regular and Irregular Kernels with no fine-grained synchronizationOpenCL SDK 4.2 [106] OpenCL Regular Kernels with no fine-grained synchronizationbarrier outside the loop body (due to reconvergence) that synchronizes threads ofthe same warp.In the absence of explicit barriers, our detection Algorithm 1 infers the writesto the shared memory array in latter loops as redefining writes that may changethe loop exit condition of earlier loops and thus flags the loops as potential SIMTdeadlocks. False detections in such situation are particularly harmful because theymay trigger code transformations (described in Chapter 4) that would remove theimplicit synchronization that was essential for correct functionality.However, such codes are discouraged by Nvidia as they use undocumentedand non-standardized features of the hardware. CUDA 8.0, the most recent versionat the time of the thesis writing, has introduced the ability to perform warp levelbarriers using sync(this warp) as a fast and safe way to replace the use of implicitwarp synchronization [50].3.4 MethodologyThis section describes the implementation, methodology and evaluation of our de-tection algorithm. We implemented the detection algorithm as an analysis passin LLVM 3.6 [86]. For alias analysis, we used the basic alias analysis pass inLLVM [87] 3. We ran passes that inline function calls and lower generic mem-ory spaces into non-generic memory spaces before our detection pass. Note thatthis helps to limit the scope of detection to loops whose exist depend on shared3We did not observe lower false detection rates using other LLVM alias analysis passes as theyfocus on optimizing inter-procedural alias analysis.39Table 3.2: Detection Pass Results on CUDA and OpenCL KernelsOpt. Level Krnls N. Brs N. Lps Detections T. Det. F. Det. F. RateO0 159 2751 277 14 0 14 5.05%O2 1832 242 14 4 10 4.13%or global memory and avoids unnecessarily including local memory. We used theapproach described in [134] to identify irreducible loops.We use CUDA and OpenCL applications for evaluation. OpenCL applicationsare compiled to LLVM Intermediate Representation (LLVM-IR) using Clang com-piler’s OpenCL frontend [84] with the help of libclc library that provides LLVM-IRcompatible implementation for OpenCL intrinsics [85]. For CUDA applications,we use nvcc-llvm-ir tool [35]. This tool allows us to retrieve LLVM-IR from CUDAkernels by instrumenting some of the interfaces of libNVVM library [102].3.5 EvaluationFirst, we ran our detection on the CUDA and OpenCL applications mentioned inSection 3.4. These applications were written to run on current GPUs, so they aresupposed to be free of SIMT-induced deadlocks. Table 3.2 summarizes the de-tection results from 159 kernels. With default compiler optimizations in LLVM3.6 enabled, four true detections in four different kernels (ATM, HT, CP-DS andBH-ST) were detected (See Vulnerability to Compilers’ Optimizations in Section2.2) 4. These true detections were confirmed by both manual inspection of themachine code and by verifying that the code does not terminate when executed.False detections include four detections that occur in the histogram256 kernel fromthe OpenCL SDK and one false detection in the bucketsort kernel in MGST fromRodinia. In both cases, kernels are dependent on warp synchronous behaviourto implement an atomic increment operation. Other false detections exist in ap-plications that involve inter-thread synchronization [24, 43]. No false negativeswere observed (specifically, after using the detection results to apply static SIMT-induced deadlock elimination algorithm described in Chapter 3, no deadlocks wereobserved at runtime). As an additional check, we ran the detection pass on the4Using Nvidia’s NVCC compiler, only Cloth Physics deadlocks after turning on compiler opti-mizations.40transformed kernels after applying the elimination pass which yielded no detec-tions.Let’s consider the O0 compilation to show the effect of the different filtrationstages in the detection algorithm. Out of the 277 loops in O0 compilation, there areonly 103 that are detected to be dependent on shared (or global) memory. Only 65of those are dependent on a shared memory read that happens within the loop body.Only 14 of these are detected to be potentially redefined. Besides alias analysis, themain reasons for false detection include conservative reachability and dependenceanalysis that relies on the static CFG layout and instruction dependencies withoutconsidering the dynamic behaviour. For example, our analysis conservatively con-sider shared memory writes that are in basic blocks that are ancestors to a loop bodywhose exist dependent on shared memory read as potential redefining writes to theloop exit if these writes are reachable through an outer loop. Although, typically, arelease or signal statement follow the lock or wait loop.For example, one of BH-FC loops is control dependent on a shared array. Thearray is written to a basic block that is an ancestor the loop of interest but it is stillreachable from the loop exit through an outer loop. Our analysis flags the load/s-tore pair as potential reason of a SIMT induced deadlock. This particular exampleis also interesting because the code includes a more explicit hint to exclude SIMTdeadlock possibility. In particular, the conditional check on this shared array usesthe all() CUDA intrinsic that evaluates the same for all threads within a warpforcing all threads within the same warp to exit at the same iteration preventing di-vergence at runtime. If our algorithm incorporated such information, it could haveeliminated this SIMT deadlock possibility because it will be able to infer that theloop exit in this case is dependent only on the code within the loop and not the codefollowing the loop. This motivates leveraging runtime information and elaboratestatic analysis as a future work.3.6 Related WorkThe presence of SIMT deadlocks has been highlighted initially in different devel-opers forums [114, 115]. The first research work to pay attention to this problem41and attempt to find a solution is Ramamurthy’s MASc thesis [121]. However, thethesis did not attempt to propose a way to detect SIMT deadlocks nor did proposea general solution. In [47], the authors provide formal semantics for NVIDIA’sstack-based reconvergence mechanism and a formal definition for the schedulingunfairness problem in the stack-based execution model that may lead to valid pro-grams not being terminated. However, they do not attempt to provide ways todetect or prevent the problem.There is also some recent work on verification of GPU kernels that focuseson detecting data-races and/or barrier divergence freedom in GPU kernels [17, 79,131, 160]. However, none of the verification tools considered the problem of SIMTdeadlocks due to conditional loops.3.7 Summary, Conclusion and Future DirectionsIn this chapter, we explained the scheduling constraints imposed by current SIMTimplementations and how it conflicts with inter-thread communication under di-vergence. We defined SIMT-induced deadlock, a type of deadlock that is unique toSIMT architectures. We showed that SIMT deadlock can be introduced by eitherprogrammers or compilers that are oblivious to the SIMT scheduling constraints.We explained how the presence of SIMT deadlocks affects the programmabilityof SIMT architectures. Finally, we presented an algorithm that can be used tostatically identify potential SIMT deadlocks and highlighted the limitations of ourcurrent implementation of the algorithm. We implemented the algorithm in LLVM3.6 and evaluated it on a number of CUDA and OpenCL applications showing afalse detection rate of 4%–5%.The increasing complexity of algorithms mapped to SIMT accelerators man-dates the development of robust verification tools. To the best of our knowledge,none of the current verification tools considers the problem of SIMT deadlocksdue to conditional loops. We can also use the outcome of the detection algorithmto eliminate SIMT deadlocks as we explain in the next chapters.Future directions for this line of research should attempt to deal with the limi-tations mentioned in Section 3.3.2. For example, the static detection algorithm canbe extended to leverage runtime information and/or more elaborate static analysis42to perform less conservative reachability, dependence and alias analysis. Also, aninter-procedural analysis is needed to detect SIMT deadlock across function calls.This work also makes the observation that SIMT compilers need to be aware ofSIMT scheduling constraints to avoid generating SIMT deadlocks. This affects alloptimizations that alter the CFG or that move instructions across CFG basicblocks.Our work provides a framework that could be extended to detect whether a cer-tain application is prone to SIMT deadlocks given a certain transformation (usingvariations of our detection algorithm).43Chapter 4SSDE: Static SIMT-inducedDeadlock EliminationIn Chapter 3, we proposed a static detection algorithm for SIMT-induced dead-locks. This chapter proposes a SIMT-induced deadlock static elimination algorithmwhich we briefly refer to as “SSDE”. The purpose of this transformation is to allowcode with traditional MIMD-style inter-thread synchronization to correctly termi-nate on current SIMT implementations. In simple terms, it can be viewed as ageneralization and automation of the workaround in Figure 3.6.The main goal of the elimination algorithm is to workaround the SIMT schedul-ing constraints mentioned in Section 3.1, namely, the forced reconvergence atimmediate postdominators and the serialization of divergent paths. Our SIMT-induced deadlock elimination algorithm deals with these constraints as follows:(1) delays reconvergence to the earliest point where reconvergence would certainlynot interfere with inter-thread communication, and (2) enforces through CFG trans-formation a breadth first traversal to basicblocks with inter-thread communication.The later step is essential to avoid the impact of the serialization constraint. Tounderstand this better, let’s consider the code example in Figure 3.3. Assume thatwe delayed the reconvergence point of the lock acquire loop to a point beyond thelock release statement at line 4, atomicExch(mutex,0), instead of reconverging atthe loop’s immediate postdominator. We would still suffer from a SIMT deadlockbecause the stack would also prioritize the execution of the taken path (i.e., the44lock acquire loop path) and would never get to execute threads that diverged to thenot-taken path (i.e., the lock release path). Note that changing the current stackbehaviour to prioritize the non-taken path first would fix the serialization problemon this example. However, this simple change fails to handle the general case.For example, consider the case of a producer-consumer relationship where threadsdiverged to a not-taken path wait in a loop for a signal from threads diverged tothe taken path. If the stack prioritized the not-taken path, a SIMT deadlock wouldresult because threads in the taken-path would never get a change to execute. Fur-thermore, if the producer-consumer relationship between alternate paths followinga branch is mutual, both paths may need interleave execution to guarantee forwardprogress. Thus, no static (i.e., compile time) prioritization of one path over anotheris sufficient to avoid SIMT deadlock. Our SSDE CFG transformation attempts toguarantee forward progress by achieving breadth-first execution of the applicationCFG instead of the current depth-first execution imposed by the stack.SSDE is composed of two steps. The first step runs a static analysis to identifysafe reconvergence points for loops that are detected to induce a SIMT deadlock(described in Chapter 3. The step stage runs a CFG transformation that maintainsthe original MIMD semantics of the program while eliminating SIMT-induceddeadlocks when running on SIMT machines.4.1 Safe Reconvergence Points IdentificationThis section describes the second stage of our static analysis, which is summa-rized in Algorithms 2 and 4. These algorithms require as inputs the results ofAlgorithm 1 from Chapter 3. This stage identifies program points, called safe post-dominators (SafePDoms), that can be employed by our SSDE code transformationto eliminate SIMT deadlock 1. A key idea is that a reconvergence point of a branchis an arbitrary choice of all its postdominators. Thus, it is possible to delay re-convergence of loop exits to any of their postdominator points including the kernelexit. However, from a SIMD utilization perspective, it is preferable to reconvergeat the earliest postdominator point that would not block the required inter-thread1Due to the serialization constraint, delaying reconvergence is necessary but not sufficient toeliminate SIMT deadlocks on current SIMT implementations (more in Section 4.2.1).45communication. SafePDoms are used in both our compiler based SIMT deadlockelimination algorithm (Section 4.2.1) and/or our proposed adaptive hardware re-convergence mechanism (Chapter 5).The potential benefit of delaying reconvergence to overcome SIMT-induceddeadlocks is intuitive when one considers the second scheduling constraint (i.e., theforced reconvergence). A SIMT deadlock due to the forced reconvergence schedul-ing constraint, happens when threads that exit a conditional loop are blocked atthe loop reconvergence point indefinitely waiting for looping threads to exit; thelooping threads are waiting for the blocked threads to proceed beyond the currentreconvergence point to release their lock(s). This cyclic dependency can be brokenif the loop reconvergence point could be moved later such that even if threads wereallowed, hypothetically, to pass the new delayed reconvergence point they couldnot affect the loop exit conditions (i.e., the threads could not reach a redefiningwrite which indicates that the forced reconvergence at the delayed reconvergencepoint does not prevent necessary inter-thread communication). SafePDoms arecomputed such that they postdominate the loop exit branches and all control flowpaths that lead to redefining writes from the loop exits (lines 4-9 in Algorithm 2).It is perhaps less intuitive how delaying reconvergence helps overcome the firstscheduling constraint (i.e., serialization). To understand this, it is necessary to un-derstand how SSDE transforms the code after having identified a SafePDom. Dueto the serialization constraint, threads iterating in a loop can be prioritized over theloop exit path and/or other paths that are parallel to the loop. A SIMT-induceddeadlock occurs if these blocked threads must execute to enable the exit condi-tions of the looping threads. To avoid this, our compiler based SIMT deadlockelimination algorithm (explained in more detail below in Section 4.2.1) replacesthe backward edge of a loop identified by Algorithm 1 with two edges: a for-ward edge towards the loop’s SafePDom, and a backward edge from SafePDom tothe loop header. This control flow modification combined with the forced recon-vergence constraint of the hardware, guarantees that threads iterating in the loopwait at the SafePDom for threads executing alternate paths that are postdominatedby this SafePDom before attempting another iteration. Accordingly, a SafePDomshould postdominate the original loop exit and points along control flow paths thatcould lead to redefining writes that are either reachable from the loop (lines 4-9 in46Algorithm 2 Safe Reconvergence Points1: Inputs: RedefWrites(L) for each L ∈ LSet2: Outputs: SafePDOM(L) for each L ∈ LSet3: SafePDom(L) = IPDom(Exits(L)) ∀ L ∈ LSet4: for each loop L ∈ LSet do5: for each IW ∈ RedefWrites(L) do6: SafePDom(L) = IPDom(SafePDom(L), IW)7: if BB(IW) is reachable from L then8: for each branch instruction IBR ∈ ReachBrSet(L,BB(IW)) do9: SafePDom(L) = IPDom(SafePDom(L), IBR)10: end for11: end if12: if BB(IW) is parallel to L then13: for each branch instruction IBR ∈ ParaBrSet(L,BB(IW)) do14: SafePDom(L) = IPDom(SafePDom(L), IBR)15: end for16: end if17: end for18: end for19: resolve SafePDom conflicts()Algorithm 2) or parallel to it (lines 10-14 in Algorithm 2).Iterative refinement to the identified SafePDoms for nested or consecutive loopsmay be necessary. In particular, SafePDom(L) should postdominate SafePDom ofany loop in the path between the exits of loop L and SafePDom(L). Otherwise, it isnot a valid reconvergence point. We resolve this by recalculating SafePDom(L) foreach loop to postdominate all SafePDom of loops in the path between loop L exitsand SafePDom(L) as summarized in Algorithm 3 which executes on the outputsof Algorithm 2. The process is then iterated until it converges. Convergence isguaranteed because in the worst case, the exit node of the kernel is a common post-dominator. Note that we force a single exit using a merge return pass that mergesmultiple return points (if they exist) into one. During this process, lines 10-17in Algorithm 3 detect if the transformation of a nested loop with the selected re-convergence point would create new SIMT-induced deadlocks in dominating loopsand if so calculate a more conservative reconvergence point.To better understand Algorithm 2, we briefly walk through its application tothe code in Figure 3.3. The initial reconvergence point of the loop is the instructionthat immediately follows its exit edge. However, there is a redefining write (theatomicExch instruction) in a basicblock that is reachable from the while loop exit.47Algorithm 3 Resolve SafePDOM Conflicts1: Inputs: initial SafePDOM(L) for each loop2: Outputs: final SafePDOM(L) for each loop3: do4: converged = true5: for each loop curL ∈ LSet do6: iSafePDOM(curL) = SafePDOM(curL)7: if curL causes potential SIMT-induced deadlocks then8: for each loop L, where BBs(L) ⊂ PExits(curL)7→SafePDom(curL) do9: SafePDom(curL) = IPDom(SafePDom(curL), SafePDom(L))10: end for11: else12: for each loop L, where BBs(L) ⊂ BBs(curL) do13: if SafePDOM(L) dominates SafePDOM(curL) then14: if curL Exits are dependent on shared variable then15: Label curL as a potential cause for SIMT-induced deadlocks16: end if17: end if18: SafePDom(curL) = IPDom(SafePDom(curL), SafePDom(L))19: end for20: end if21: if iSafePDOM(curL) 6= SafePDOM(curL) then22: converged = f alse23: end if24: end for25: while converged 6= trueThus, line 4 updates SafePDom of the loop to be the instruction that immediatelyfollows the atomicExch. No further updates to SafePDom by the rest of the algo-rithm are performed. Figure 4.1 shows the appropriate choice of SafePDom formore complex scenarios. For example, the CFG to the bottom right resembles ascenario found in the sort kernel of the BarnesHut application [24] when compileroptimizations are enabled. Threads iterating in the self loop are supposed to waitfor a ready flag to be set by other threads executing the outer loop. Threads execut-ing the outer loop may need more than one iteration to set the ready flag of waitingthreads. Thus, the filled (gray) basicblock is an appropriate choice of SafePDomas it postdominates all reachable paths to the redefining writes (i.e., leading threadsmay only wait for lagging one after they finish all iterations of the outer loop).481 . do{2 . . . . . .3 . w h i l e ( atomicCAS ( mutex , 0 , 1 ) != 0 ) ; / / Lock4 . i f ( cond1 ){5 . / / c r i t i c a l s e c t i o n6 . a tomicExch ( mutex , 0 ) ; / / Unlock7 . b r e a k ;8 . }9 . / / c r i t i c a l s e c t i o n1 0 . a tomicExch ( mutex , 0 ) ; / / Unlock1 1 . }w h i l e ( cond2 ) ;(a) Psueod code from Radiosity Splash bench-mark [121, 148].1 . . . . .2 . w h i l e ( k >= bot tom ) {3 . w h i l e ( s t a r t d [ k ] < 0) ; / / w a i t4 . . . . .5 . i f ( ch >= n b o d i e s d ) {6 . . . . .7 . } e l s e {8 . / / c h i l d i s a body9 . s o r t d [ s t a r t ] = ch ; / / s i g n a l1 0 . }1 1 . k −= dec ; / / move t o n e x t c e l l1 2 . }(b) BH-ST [24] pseudocode after O2.redefining writes inspected loop SafePDom location (c) Control flow graphs of (a) on the left and (b) on the right.Figure 4.1: SIMT-induced deadlock scenarios.4.2 SSDE: Static SIMT Deadlock EliminationThis section proposes compiler transformation pass (Algorithm 4) that when com-bined with Algorithms 1, 2, and 3 enables the execution of MIMD code with inter-thread synchronization on stack-based SIMT machines. Note that very recently,Nvidia announced their newest GPU architecture, Volta, which replaces the stack-based execution model with a one that allows independent thread scheduling toenable inter-thread synchronization [103]. Volta’s execution model is in fact very49Algorithm 4 SIMT-Induced Deadlock Elimination1: Inputs: SafePDOM(L) for each L ∈ LSet2: Output: SIMT deadlock free modified CFG.3: SwitchBBs = /04: for each loop L ∈ LSet do5: if L causes potential SIMT-induced deadlock then6: if SafePDom(L) /∈ SwitchBBs then7: SwitchBBs = SwitchBBs ∪ SafePDom(L)8: if SafePDom(L) is the first instruction of a basicblock BB then9: Add a new basicblock BBS before the BB.10: Incoming edges to BBS are from BB predecessors.11: Outgoing edge from BBS is to BB.12: else13: Split BB into two blocks BBA and BBB.14: BBA contains instructions up to but not including SafePDom(L).15: BBB contains remaining instructions including SafePDom(L).16: BBS inserted in the middle, BBA as predecessor, BBB as successor.17: end if18: Insert a PHI node to compute a value cond in BBS, where:19: for each predecessor Pred to BBS20: cond.addIncomingEdge(0,Pred)21: end for22: Insert Switch branch swInst on the value cond at the end of BBS, where:23: swInst.addDefaultTarget(BBS successor)24: end if25: BBS is the basicblock immediately preceding SafePDom(L)26: Update PHI node cond in BBS as follows:27: cond.addIncomingEdge(UniqueVal,Latch(L).src) -unique to this edge28: Update Switch branh swInst at the end of BBS as follows:29: swInst.addCase(UniqueVal,Latch(L).dst)30: Set Latch(L).dst = BBS.31: end if32: end forsimilar to our hardware proposal for an adaptive warp reconvergence mechanismin Chapter 5. We discuss Volta’s execution model more in Chapters 5, and 8.4.2.1 Elimination AlgorithmTo avoid SIMT-induced deadlocks, loose fairness in the scheduling of the diverged,yet communicating, threads is required. Constrained by the existing architecture,we achieve this by manipulating the CFG. Algorithms 2, and 3 pick the earliestpoint in the program that postdominates the loop exit and all control flow pathsthat can potentially affect the loop exit condition. Algorithm 4 modifies the CFG50by replacing the backward edge of a loop identified by Algorithm 1 by two edges:a forward edge towards SafePDom, and a backward edge from SafePDom to theloop header. This guarantees that threads iterating in the loop wait at the SafePDomfor threads executing other paths postdominated by SafePDom before attemptinganother iteration allowing for inter-thread communication. Algorithm 4 is a gener-alization and automation of the manual workaround shown in Figure 3.6.To help explain Algorithm 4, we examine how it is applied to the spin lock codefrom Figure 3.3 in Figure 4.2. The CFG on the left is the original and the one to theright is the CFG after the transformation. Applying Algorithm 2 and 3 results inidentifying the SafePDom of the self loop at BBB 1 to be the first instruction afterthe atomicExch 2 . Lines 11-14 in Algorithm 4 then introduce a new basicblockBBS 3 . Next, lines 16-26 add a PHI node and switch branch instructions in BBS4 . Finally, line 28 modifies the destination of the loop edge to BBS 5 . Thus,the end result is that the backward branch of the loop is replaced by two edges,a forward edge to BBS and a backward edge to the loop header. The new addedbasicblock acts as a switch that redirects the flow of the execution to an outputbasicblock according to the input basicblock. The PHI node is translated by theback-end to move instructions at predecessors of BBS. The switch instruction islowered into a series of compare instructions followed by direct branches such thatit diverges first to the default not-taken path. This guarantees that BBS remains theIPDOM of subsequent backward branches forcing (potentially) multiple loops tointerleave their execution across iterations.Algorithm 4 preserves the MIMD semantics of the original code because thecombination of the PHI node and branch in BBS guarantees the following: (1) eachcontrol flow path in the original CFG has a single equivalent path in the modifiedCFG that maintains the same observable behaviour of the original path by main-taining the same sequence of static instructions that affect the state of the machine(e.g., updates data registers or memory). For example, the execution path (BBC1-BBS-BBC2) in the transformed CFG is equivalent to the path (BBC) in the originalCFG. The loop path (BBB-BBB) is equivalent to (BBB-BBS-BBB) and (2) controlflow paths in the modified CFG that have no equivalent path in the original CFGare paths and would not execute due to the way the switch statement added in BBSworks. For example, the path (BBC1-BBS-BBB) in the modified CFG would not51A B C   cond=!atomicCAS(mutex,0,1)   br cond, BBB, BBC   // critical section   atomicExch(mutex,0);  // non critical part High Level Code: *mutex = 0;                   //A      while(!atomicCAS(mutex,0,1)); //B // critical section           //C atomicExch(shrd_mutex,0);     //C // non critical part          //C 1 2 A B C1 S C2   // critical section   atomicExch(mutex,0);  cond2= phi [0,BBC1],[1,BBB]  Switch(cond2) [def,BBC2],[1, BBB]  // non critical part   cond=atomicCAS(mutex,0,1)!=0   br cond, BBS, BBC 3 4 5 Before After Figure 4.2: SIMT-Induced Deadlock Elimination Steps.execute because the PHI node in BBS controls the switch condition such that ifa thread is coming from BBC1 it branches to BBC2. The elimination algorithmdoes not reorder memory instructions or memory synchronization operations andtherefore it does not impact the consistency model assumed by the input MIMDprogram. We add a more detailed semi-formal correctness discussion to the ap-pendix of this thesis in Chapter A.Algorithm 4 ensures a SIMT-induced deadlock free execution. For example,threads that execute the branch at the end of BBB wait at the beginning of BBS forother threads within the same warp to reach this reconvergence point. This allowsthe leading thread that acquired the lock to execute the atomicExch instruction andrelease the lock before attempting another iteration for the loop by the laggingthreads.4.2.2 Compatibility with Nvidia GPUsThere is no official documentation for Nvidia’s reconvergence mechanism, how-ever, a close reading of Nvidia’s published patents [18, 29] and examination ofdisassembly [107] suggests that for a branch instruction to have a reconvergencepoint at its immediate postdominator, the branch must dominate its immediate post-52dominator and for loops the loop header must dominate the loop body (i.e., to bea single-entry/reducible loop). We accounted for the additional constraints as fol-lows: first, in Algorithm 2, reconvergence points of divergent branches are notnecessarily their immediate postdominators. If the branch’s immediate postdomi-nator is not dominated by the branch basicblock then the reconvergence point of thebranch is the post-dominator of the closest dominating basic block of the branch.Note that this affects the IPDom(I) definition in the prerequisites of Algorithm 1defined in Listing 1 in Chapter 3. Second, in Algorithm 4, we have to guaranteethat the new added basicblock is a valid reconvergence point. We guarantee thisby forcing the new created loops to be single entry loops. In particular, If a loophas entries other than the header, it is converted to a single entry loop by adding anew loop header that merges all the loop entries and redistribute them to their orig-inal destinations. Furthermore, if the added switch branch instruction has morethan two destinations, all backward edges are merged into one backward edge thatjumps to the common dominator of all backward edges destination (this will bethe final loop header) which controls divergence to the original backward edgesdestinations.4.2.3 SSDE LimitationsThe static elimination approach has some limitations. We highlighted the impactof some of these limitations on the detection algorithm scope and accuracy in Sec-tion 3.3.2. In this section, we focus on the implications of these limitations on thescope and efficiency of the elimination algorithm.Inter-procedural Dependencies: Although, it is possible to extend our detectionalgorithm to handle function calls with an inter-procedural analysis [23, 59], ex-tending the elimination algorithm is less straightforward. Algorithm 4 requires theredefinition statement to be inlined with the SIMT deadlock inducing loop in thesame function. However, aggressive inlining can significantly increase instructioncache misses [32]. Also, in some cases, it may not be possible to perform inlining(e.g., recursive functions or calls to proprietary library code).CFG Modifications: To avoid SIMT-induced deadlocks, the elimination algo-rithm modifies the original application’s CFG. This could make debugging the orig-53inal code a harder task [19, 52, 57]. The modified CFG adds more instructions thatdo not do useful work except to workaround the constraints of the current recon-vergence mechanisms. These added instructions increase the static code size anddynamic instructions count. Finally, the modified CFG could result in increasedliveness scope for some variables. This occurs because the modified CFG intro-duces new edges in the CFG that increase the connectivity between basicblocks.For example, in the original CFG in Figure 4.2, a variable that is defined in BBAand used only in BBB is live in basic blocks BBA and BBB only. However, in themodified CFG to the right, it is live in BBA, BBB, BBC1, and BBS. This increasedliveness scope may lead to increased pressure on the register file leading to reg-ister spills to local memory thus causing increased memory traffic. Alternatively,by increasing the number of registers used per thread it could limit the maximumnumber of threads that could be launched. In either case, performance would bereduced.False Detections: An increased number of false detections (loops that erroneouslyclassified as involved in synchronization) directly amplifies the implications of theCFG modifications mentioned earlier by increasing the number of loops affectedby the code transformation. It amplifies the unnecessary increase in dynamic in-structions count and the register spills.Indirect Branches: Directly applying the elimination algorithm on indirect brancheswould require some means of determining all potential targets. In the absence ofcomplete information on potential branch targets, the CFG of the application iseffectively unknown. One potential solution is to conservatively assume the samethe safe postdominator point is the kernel exit.Barriers in Divergent Code: SSDE does not handle divergent barriers and thuscodes as shown in Figure 3.2 would still lead to undefined behaviour (refer back toSection 3.1.2 for more details).Warp Synchronous Behaviour: As indicated in detail in Chapter 3, some GPUapplications rely on the implicit warp synchronous behaviour. For such applica-tions, it is necessary to disable our transformation, possibly using pragmas. InChapter 5, we present a hardware reconvergence mechanism for avoiding SIMTdeadlock. This mechanism, combined with the static analysis in Section 3.3, tack-les most of the challenges associated with the compiler approach, but it is still54necessary to mark code relying on implicit warp synchronization.4.3 MethodologyThis section describes our methodology to evaluate SSDE. We implement the elim-ination algorithm as a transformation pass in LLVM 3.6 [86] that runs as the lasttransformation pass after the detection pass. For alias analysis, we use the basicalias analysis pass in LLVM [87].In all experiments, we disabled optimizations in the back-end compilationstages from LLVM-IR to PTX by LLVM NVPTX back-end and from PTX toSASS by Nvidia’s driver2 to guarantee that there are no further alternation in ourgenerated CFG. The detection pass runs as the last compiler pass to detect SIMTdeadlocks because we found that SIMT deadlocks can be introduced by other trans-formations. Our LLVM code for both the detection and elimination passes has beenposted online [37]. Note that, ideally, the safe postdominator analysis and SSDEwould be implemented in the compiler back-end that generates the final hardwareassembly. This back-end is not released by NVIDIA.We use CUDA, OpenCL and OpenMP applications for evaluation. OpenCLapplications are compiled to LLVM Intermediate Representation (LLVM-IR) us-ing Clang compiler’s OpenCL frontend [84] with the help of the libclc librarythat provides LLVM-IR compatible implementation for OpenCL intrinsics [85].For CUDA applications, we use nvcc-llvm-ir tool [35]. This tool allows us toretrieve LLVM-IR from CUDA kernels by instrumenting some of the interfacesof libNVVM library [102]. OpenMP compilation relies on the recent supportof OpenMP 4.0 in LLVM [15]. For hardware results, we use a Tesla K20C Ke-pler architecture GPU. Our benchmarks include OpenCL SDK 4.2 [106], Rodinia1.0 [26], KILO TM [43], both CUDA and OpenMP versions of BarnesHut [24,31], and two OpenMP microbenchmarks, Test Lock (TL) [22] and Array Max(AM) [94]. OpenCL SDK and Rodinia applications do not have synchronizationbetween divergent threads, however, kernels from [22, 24, 31, 43, 94] require syn-chronization between divergent threads. Table 3.1 describes briefly all the kernels2PTX is NVIDIA’s virtual assembly and SASS is the machine assembly. SASS generation iscurrently done only by Nvidia’s drivers.55Table 4.1: Code Configuration EncodingCode Configuration XYZX: Original code format Y: Compiler optimizations Z: Our Analysis/Transf.M MIMD 0 -O0 S EliminationS SIMT 2 -O2 D Delayed Rec. (Ch. 5)Table 4.2: Static Overheads for the Elimination AlgorithmKernelTf. Loops SASS-Static Inst. SASS-Used Reg. + Stack Size (bytes)-O0 -O2M0- S0- M0S M2- S2*- M2S M0- S0- M0S M2- S2*- M2ST F T FHT 1 - 1 - 408 436 422 184 177 198 12+128 12+128 12+128 16+0 16+0 18+0ATM 1 1 1 - 506 506 527 233 247 436 11+168 11+168 11+168 20+0 21+0 42+0CP-DS 2 - 2 - 624 631 617 631 631 632 31+0 37+0 37+0 31+0 39+0 46+0BH-TB 1 3 1 3 1871 1836 1899 534 891 1521 17+344 17+336 17+344 40+0 38+24 40+528BH-SM - 2 - 1 1983 1983 2011 933 996 1640 36+176 36+176 36+176 55+0 55+0 80+64BH-ST 1 - 1 1 520 485 541 219 226 261 16+96 16+88 16+96 22+0 22+0 24+0BH-FC - 3 - 1 1549 1549 1591 765 765 891 36+272 36+272 36+272 48+40 48+32 48+56that are affected by our elimination algorithm.Table 4.1 shows the abbreviation encoding we use in the results discussion.The first character indicates whether the original code accounts for the reconver-gence constraints (S for SIMT) or not (M for MIMD). The second indicates thelevel of the optimization performed on the code before we run our passes. The lastcharacter indicates the type of analysis or transformation performed, it could be ei-ther applying the elimination algorithm (S), calculating the delayed reconvergencepoints without CFG transformations (D) or neither (-).4.4 Evaluation4.4.1 Static OverheadsWe rewrite CUDA and OpenCL kernels that require inter-thread synchronizationassuming MIMD semantics. Table 4.2 compares six different code versions interms of static instructions, register and stack storage per thread. The M0- andM2- versions deadlock on current GPUs. Enabling default compiler optimizationsin S2- lead all our four applications to deadlock. To address this, in configurationS2*-, we selectively enable passes that are experimentally found not to conflict with56 0 0.2 0.4 0.6 0.8 1 1.2 1.4CP HT ATM BH AVG.No rma li ze d Ac cu mu la ti v e GP U Ex ec ut i on  T ime S0- M0S S2*- M2SFigure 4.3: Normalized Accumulative GPU Execution Time.the manual code transformation. This excluded all invocations of the -simplifycfgand -jumpthreading passes.For the non-optimized versions (i.e., S0- and M0S), Table 4.2 shows that staticinstruction overhead is small in both manual and compiler based transformations(S0- and M0S are comparable to M0-). Also, they have little to no overhead interms of registers (e.g., CP) compared to the MIMD version when we considerthe non-optimized versions. Turning on compiler optimizations generally reducesstack usage and increases the number of registers as a result of register allocation(via the -mem2reg pass in LLVM). This, however, has a negative impact for kernelswith false positives when applying our detection Algorithm 1. For example, in BH-TB kernel in M2S, there is an increase in both registers and stack usage. This isdue to a significant number of register spills as a result of increased liveness scopefor registers after our CFG transformation.4.4.2 Dynamic OverheadsWe evaluate the run-time overheads using performance counters on Tesla K20C(Kepler architecture). Figure 4.3 shows the accumulated GPU time for all ker-nel launches averaged over 100 runs. HT and ATM have a single kernel, CP hasfour kernels and BH has 6 major kernels. M0S has a 11% overhead on averagecompared to S0-. M2S leads to a speedup of 45% compared to S0-. M2S is alsowithin 7.5% overhead compared to S2*-. For some kernels (e.g., HT and ATM)57 0 0.5 1 1.5 2 2.5 3CP-DS HT ATM BH-TB BH-SM BH-ST BH-FC AVG.No rma li ze d Ke rn el  Ex ec ut i on  T imeS0- M0S S2*- M2SFigure 4.4: Normalized Kernel Execution Timethe benefit of enabling all compiler optimizations overcomes the overhead of theautomated transformation leading to improvements compared to S2*-. Figure 4.4breaks down these results for kernels that are affected by our transformation. Asshown in Figure 4.4, for kernels that have no false detections, M0S and S0- versionshave almost identical performance (e.g., HT and BH-ST). CP-DS is an exception.It has two consecutive loops that implement two nested spin locks. Both loopsare detected as a potential cause for a SIMT-induced deadlocks. However, due toaliasing, the outer lock release is falsely classified as a redefining write for the in-ner lock. Hence, the SafePDom point for both loops is conservatively estimated tobe after the outer lock release. This causes additional overhead versus the manualtransformation. Threads try to acquire the outer lock even though the lock has beenacquired by another thread within the same warp (one that acquired the outer lockbut failed to acquire the inner lock). This leads to increase in dynamic instructioncount seen in Figure 4.5.For kernels that have false detections, the overhead is dependent on the run-time behavior. For example, although BH-FC has 3 false detections in its M0Sversion, they hardly impacts its performance. This is mainly because the kernelhas very high utilization (Figure 4.63), hence it is barely affected by the transfor-mations in its CFG. In other cases (e.g., ATM, BH-TB and BH-SM), false detec-3NVIDIA profiler does not measure SIMD efficiency for OpenCL applications. Thus, SIMDutilization for CP is not reported.58 0 0.5 1 1.5 2 2.5CP-DS HT ATM BH-TB BH-SM BH-ST BH-FC AVG.No rma li ze d Dy na mi c  I ns tr uc ti on s Co un tS0- M0S S2*- M2SFigure 4.5: Normalized Dynamic Instruction Count 0 0.2 0.4 0.6 0.8 1 1.2HT ATM BH-TB BH-SM BH-ST BH-FC N.-AVG.A ve ra ge  S IMD  U ti l i za ti onS0- M0S S2*- M2SFigure 4.6: Average SIMD Utilizationtions lead to significant performance overheads. We attribute these overheads toboth the reduced SIMD utilization (Figure 4.6) and increased dynamic instructioncount (Figure 4.5). In BH-TB, enabling compiler optimizations when SSDE isapplied leads to an increase in execution time. We attribute this to the increasein memory traffic due to excessive register spills. We measured an increase of28.5% in the DRAM requests per cycle for the M2S version with respect to S0-and M0S. NVIDIA profiler also shows that M2S has 38.3% less dram requests percycle compared to S0- for BH-SM. This explains the improved performance M2Sexhibits over S0-. Finally, although M2S performs poorly for both BH-TB and59Table 4.3: OpenMP Kernels (Normalized Execution Times)KernelCPU-4T-O2GPU-OMPM2-GPU-OMPM2SGPU-CUDAS2*-KernelCPU-4T-O2GPU-OMPM2-GPU-OMPM2SGPU-CUDAS2*TL-FG 1 D 3.41 N/A BH-TB 1 D 0.28 9.66TL-CG 1 D 0.01 N/A BH-SM 1 3.78 1.72 3.54AM-FG 1 D 32.58 N/A BH-FC 1 0.70 0.73 1.25AM-CG 1 D 17.87 N/A BH-BB 1 0.44 0.44 9.00BH-ST 1 D 2.97 3.17 BH-IN 1 1.21 1.21 7.53BH-SM, the overall impact on the benchmark is not sever because BH-FC is thedominant kernel in the total execution time.4.4.3 OpenMP supportThe OpenMP 4.0 standard supports the offloading of a parallel region to an accel-erator (e.g., a GPU). The OpenMP programming model is appealing due to bothits abstraction and portability across architectures. Thus, it enables acceleratorsreaching to a broader set of developers [66]. Currently, there is an ongoing co-ordinated effort by many technology companies to add OpenMP 4.0 support foraccelerators in LLVM [93]. Basic support is currently available [15, 117] whilethere are ongoing efforts to improve its performance [14, 16] compared to nativeGPU programming models. SIMT-induced deadlock is a fundamental challenge tocorrectly supporting synchronization on SIMT accelerators (e.g., it is unclear howto compile omp set lock(...) for GPUs). Our investigations have found that currentOpenMP support for GPUs suffers from the SIMT deadlock problem as it gener-ates code for a spin lock with back-off delay to translate omp set lock(...). Hence,a valid OpenMP program that executes properly on CPUs may not terminate onGPUs due to SIMT deadlocks.We evaluate the potential for our proposed compiler algorithms to address thisby integrating the detection and elimination passes described in this chapter andChapter 3 into the compilation chain of OpenMP.At the time of writing we could not find any applications that make use ofOpenMP support for GPUs and which employ fine-grained synchronization. Thismay partly be due to the recency of OpenMP support for GPUs and perhaps also thelack of support for generating efficient fine-grained synchronization code for GPUsin the current versions of OpenMP. Therefore, we modify 8 existing OpenMP ker-60nels to enable offloading parallel regions to Nvidia GPUs [22, 31, 94]. Six of thesekernels are for an OpenMP BarnesHut implementation that uses an algorithm sim-ilar to the CUDA version except for some GPU specific optimizations. We alsomodify TL and AM kernels to emulate fine-grained and coarse-grained synchro-nization.Table 4.3 shows the speed up of four different configurations run on a TeslaK20C GPU versus running the code with 4 threads on an Intel Core i7-4770KCPU. The base compilation to GPUs with current LLVM OpenMP 4.0 support en-counters deadlocks in many cases (labeled ‘D’ in the table). However, with ourdetection and elimination passes all kernels run to termination maintaining porta-bility between the CPU and the GPU. Numbers in bold shows instances where theGPU code achieved a speed up compared to the CPU without performance tuning.For other cases, the developer may choose either not to offload the kernel to a GPUor to performance tune the code for GPUs starting from a functionally correct (ter-minating) code that he can profile its execution and determine optimization oppor-tunities. Due to false detections in BH-SM and BH-FC, OMP M2S performs dif-ferently from OMP M2- with a 2.2× slowdown for BH-SM. As expected, the GPUperforms poorly compared to the CPU with high contention on locks (e.g., TL-CGwhere 3K threads are competing for the same lock) while performing much bet-ter with fine-grained synchronization (e.g., TL-FG). In AM, a non-blocking check,whether the current element is larger than the current maximum, happens beforeentering the critical section. This significantly reduces the contention over the crit-ical section that updates the maximum value. Thus, the execution is highly paralleland achieves large speed up (17.87× for the coarse grained version and 32.58×for fine grain version as shown in Table 4.3) compared to the CPU. AM-FG findsmultiple maximum values within smaller arrays reducing contention even further.Table 4.3 also compares the performance of the OpenMP version of BH with theCUDA version. In most cases, the CUDA version significantly outperforms theOpenMP version. Reducing this performance gap (either by improving OpenMPcompiler support or performance tuning for the source code) is outside the scopeof this paper.614.5 Related WorkThere are efforts to make GPU programming accessible through different well-established non-GPU programming languages. These efforts include source-to-source translation [71, 101] as well as developing non-GPU front-ends [15] for lan-guage and machine independent optimizing compilers such as LLVM [69]. How-ever, these proposals do not handle SIMT deadlocks. MCUDA [136] is an oppo-site approach that translates CUDA kernels into conventional multicore CPU code.Auto vectorization is also a well studied area [67] to convert scalar code into codesthat run on vector machines. However, these effort does not deal with inter-threadsynchronization thus could lead to SIMT/SIMD deadlocks.Recently, Google presented an open-source LLVM-based GPGPU Compilerfor CUDA [149] that proposes GPGPU specific optimizations to generate highperformance code. Similar prior work has also been proposed but using source tosource CUDA transformation [153]. There are also numerous research papers thatexplore specific code optimizations for GPGPUs [48, 49, 97]. Such performance-driven compiler optimizations are complementary to the work presented in thischapter.In [78], Li et al. propose a fine-grained inter-thread synchronization schemethat uses GPU on chip scratchpad memory. However, it is left to programmersto use their locking scheme carefully to avoid SIMT-induced deadlocks [78]. Re-cently, software lock stealing and virtualization techniques were proposed to avoidcircular locking among threads in attempt to solve livelocks in nested locking sce-narios, and to reduce the memory cost of fine-grain locks [152]. As acknowledgedin [152], one of their limitations is that “locks are not allowed to be acquired ina loop way” due to deadlocks (i.e., SIMT deadlocks using our terminology). Ourwork assumes that the MIMD code is livelock free and that livelocks due to nestedlocking scenarios are taking care of by programmers as he would do on a MIMDmachine (e.g., by forcing nested locks to be acquired in a certain order).4.6 Summary, Conclusion and Future DirectionsIn this chapter, we presented a static SIMD-induced deadlock elimination algo-rithm. The algorithm is performed in two stages. The first stage identifies safe62postdominator points for problematic loops that are detected to potentially inducea SIMT deadlock. The second stage alters the CFG using the safe postdominatorrecommendations in a way that eliminates SIMT deadlocks.Intellectually, this work is the first to propose MIMD-to-SIMD conversion forarbitrary code with inter-thread synchronization since Flynn’s taxonomy. Thiswork is relevant to any architecture that relies on hardware-assisted implicit vector-ization (e.g., Nvidia’ SASS [107] and AMD’s Southern Islands [5]) or that solelyrely on explicit vectorization (e.g. Intels SSE and IBMs AltiVec). The later oneswould need to integrate our SSDE algorithms with the regular auto-vectorizationtechniques (e.g., [67]).This work also shows that SIMT implementations should not and do not haveto dictate the level of abstraction used in their programming. In this chapter, weachieve this by using compiler methods that can selectively, through CFG transfor-mations, decouple thread scheduling decisions from the application’s original codelayout.This work could be used to enable reliable library-based support for fine-grainedsynchronization on GPUs. Library based implementations for fine-grained syn-chronization primitives [78, 152] are currently not possible because of SIMT-induceddeadlocks. Another research direction could be to use the runtime JIT compilationtechniques to filter out false detections within a specific kernels and then fine-tunethe code transformation to avoid the large overheads in next kernel launches.Further, one general insight from this work is that one can get represent allloops in a program with a single loop, and use forward branches to mimic thelooping behaviour of merged loops. The forward branches can be further replacedby predication. This could be more efficient for architectures that can efficientlysupport only a limited number of loops (e.g., using hardware loops in digital signalprocessing chips).63Chapter 5AWARE: Adaptive WarpReconvergenceChapter 4 presented SSDE–a compiler approach to enable MIMD-like executionon current stack-based SIMT implementations. However, as pointed out, the cur-rent version has limitations. In this section, we explore the potential for hardwaremodifications that can enable MIMD compatible execution to help avoid these lim-itations. The hardware mechanism we propose satisfies the following characteris-tics: (1) it maintains reconvergence at immediate post dominators when it does notinterfere with inter-thread communication, (2) it guarantees that no thread would beindefinitely blocked due to scheduling constraints of divergent control flow paths,and (3) it enables delaying reconvergence when required for synchronization pur-poses. These characteristics maintain SIMD efficiency for applications withoutinter-thread synchronization while maintaining a SIMT deadlock free executionfor applications with inter-thread synchronization. With such hardware, there isno need for the compiler to alter an applications’ CFG to avoid SIMT deadlockswhich avoids many overheads and limitations.This section describes AWARE–an Adaptive Warp Reconvergence mechanismthat achieves the three characteristics mentioned above. AWARE avoids most ofthe limitations in our compiler-only approach. It decouples the tracking of divergedsplits from their reconvergence points using two tables: a Splits Table (ST) and aReconvergence Table (RT). ST and RT tables hold the same fields as the SIMT64stack with RT holding an extra field called the Pending Mask. The Pending Maskrepresents threads that have not yet reached the reconvergence point. AWARE usesthese tables to avoid serializing divergent paths. AWARE replaces the depth-firsttraversal of divergent control-flow paths imposed by the reconvergence stack usedin current implementations with a breadth-first traversal. It does this by selectingwarp splits from the ST for greedy FIFO scheduling with respect to their insertionto the ST either after a branch or after reaching a reconvergence point. In typi-cal implementations, a depth-first traversal may lead to constant prioritization of aloop path over other parallel or following paths. This is avoided with a breadth-firsttraversal. This ensures fairness in scheduling different control paths. AWARE alsoenables delayed reconvergence leveraging the SafePDOM points identified by thecompiler analysis. For loops identified as a potential involved in SIMT deadlocks,the reconvergence point can be delayed to a SafePDOM using AWARE’s reconver-gence table. The case where lock acquire and release occur in different functionscan be supported in AWARE by employing timed-out reconvergence . This mech-anism works by allowing threads in a warp split that is stalled waiting for otherwarp splits to reconverge to proceed after a time exceeding a TimeOut value.AWARE limits the architectural changes to the divergence unit. As we observethere tends to be a large gap between the maximum theoretical occupancy and thetypical occupancy of the split and reconvergence tables we further propose to virtu-alize them. We show this can be done by spilling entries that exceed the respectivetable’s physical capacity to a backing location in the memory system and fillingthem when needed. An advantage of AWARE is that it does not require changingthe applications’ control flow graph. Thus, it does not suffer from the synchroniza-tion scope limitations and debuggability challenges associated with SSDE. It alsotransparently supports barriers in divergent code. Table 5.1 summarizes the abovedescribed advantages of AWARE over SSDE.Note that at the time of writing this thesis, Nvidia announced a new GPU ar-chitecture called Volta. Volta changed the SIMT execution model to enable inter-thread synchronization. Volta’s execution model as described in [103] appears tobe very similar to AWARE. The closest related Nvidia patent is a 2016 publishedpatent that describes a notion of convergence barriers [34]. In the execution schemedescribed in this patent, convergence barriers are used to join divergent groups of65Table 5.1: AWARE vs SSDEComparison Point SSDE AWAREInter-procedural synchronization not supported supportedCFG modifications needed not neededBarrier in divergent Code not supported supportedSIMT-deadlocks detection necessary and conservative complimentary and relaxedthreads back together to maintain high SIMD efficiency while allowing for a flex-ible thread scheduling. In this proposal “the divergence management mechanismthat relies on the convergence barriers is decoupled from the thread schedulingmechanism”. The compiler analyzes the program to identify the appropriate loca-tions to insert convergence barriers. In hardware, “a multi-bit register may corre-spond to each convergence barrier name and a bit is assigned for each thread thatmay participate in the convergence barrier”. When all the threads reach a conver-gent barrier (i.e., the multi-bit register of the corresponding barrier is all zeros),the convergence barrier is cleared and all threads that participated at the conver-gence barrier are released (i.e, unblocked) and they resume execution in SIMDfashion. Similar to AWARE, the patent decides to decouple tracking divergenceentries from reconvergence points. The multi-bit convergence barrier registers aresimply another representation to the pending mask entry in the reconvergence ta-ble. The main difference is that our proposal as we describe in this chapter relieson the hardware implicitly adding reconvergence points to the reconvergence table,checking whether threads have reached reconvergence points or not, and decidingwhen to switch execution to other paths. On the contrary, in Nvidia patent, thecompiler explicitly adds instructions to perform these actions when needed.5.1 Decoupled SIMT TablesThe goal of AWARE is to impose fewer scheduling constraints than stack-basedreconvergence while maintaining immediate post-dominator reconvergence whenpossible. To achieve this, AWARE decouples the tracking of diverged splits fromtheir reconvergence points using the two tables shown in Figure 5.4. The warpSplit Table (ST) records the state of warp splits executing in parallel basic blocks.The Reconvergence Table (RT) records reconvergence points for active warp splits.661 . / / A1 . i f ( t i d %2==0){3 . / / B5 . } e l s e {6 . / / C9 . }1 0 . / / D(a) CodeA 1111C 0101 B 1010D 1111BRAB-C(b) CFGSplits Table (ST) PC RPC Active Mask A --- 1111 Reconvergence Table (RT) PC RPC Reconvergence Mask Pending Mask --- Splits Table (ST) PC RPC Active Mask C D 1010 B D 0101 Splits Table (ST) PC RPC Active Mask C D 1010 Splits Table (ST) PC RPC Active Mask --- Splits Table (ST) PC RPC Active Mask D --- 1111 Splits and Reconvergence Tables 2a Reconvergence Table (RT) PC RPC Reconvergence Mask Pending Mask D --- 1111 1111 Reconvergence Table (RT) PC RPC Reconvergence Mask Pending Mask D --- 1111 1010 Reconvergence Table (RT) PC RPC Reconvergence Mask Pending Mask D --- 1111 0000 Reconvergence Table (RT) PC RPC Reconvergence Mask Pending Mask --- l anes 1 2b 3a 3b 4a 4b 5 (c) SIMT Tables(time advances going down)Figure 5.1: Execution with AWAREThe ST and RT tables hold the same fields as the SIMT stack with the RT holdingan extra field called the Pending Mask. The Pending Mask represents threads thathave not yet reached the reconvergence point. This decoupling enables AWARE tonot serialize divergent paths up to the reconvergence point.Figure 5.1a shows a simple divergent code example. Its corresponding con-trol flow is shown in Figure 5.1b. This example assumes a single warp with four67threads traversing through the code. The bit mask in each basic block of the controlflow graph denotes the threads that execute that block. Initially all threads executebasic block A. However, upon executing the divergent branch BRAB−C, warp A1111diverges into two warp splits B1010 and C0101. In the notation BRAB−C, branch isabbreviated as BR and the superscript A represents the basic block containing thebranch while the while the subscript B−C represents the successors. The state ofa warp split is represented by a letter representing the basic block the warp splitis executing and subscripts represent the active threads within the warp split. Fig-ure 5.1c shows the operation of AWARE illustrating changes to the ST and RTtables. Labels are used to clarify which parts of the figure is being referred to inthe text below.At the top, Figure 5.1c shows the state of the ST and RT when the warp beginsexecuting at block A. Since there is no divergence, there is only a single entry inthe ST, and the RT is empty 1 . The warp is scheduled repeatedly until it reachesthe end of block A. After the warp executes branch BRAB−C, warp A1111 divergesinto two splits B0101 and C1010. Then, the A1111 entry is moved from the ST tothe RT 2a with PC field set to the RPC of branch BRAB−C (i.e., D). Similar to theSIMT stack, the RPC can be determined at compile time and either conveyed usingan additional instruction before the branch or encoded as part of the branch itself(current GPUs typically include additional instructions to manipulate the stack ofactive masks). The Reconvergence Mask entry is set to the value of the activemask the warp split had before the branch. The Pending Mask field of the RT isused to keep track of which threads that have not yet reached the reconvergencepoint. Hence, it is also set equal to the active mask. Concurrently, two entries areinserted into the ST–one for each side of the branch 2b . The active mask in eachentry represents threads that execute the corresponding side of the branch.At this point, both warp splits B0101 and C1010 are in the ST. In principle,AWARE does not restrict the order in which these splits are executed. As we willsee in Chapter 7, the execution of both warp splits can be interleaved up to the re-convergence point (D). In Section 5.2, we discuss AWARE’s warp split schedulingpolicy in detail. For now, we assume we schedule warp split B0101 first. Eventually,warp split B0101 reaches the reconvergence point (D). When this happens its entryin the ST table is invalidated 3a , and its active mask is subtracted from the pending68active mask of the corresponding entry in the RT table 3b . Since only warp split Cremains in the ST, execution switches to warp split C1010 which eventually reachesreconvergence point (D). When it does, its entry in the ST table is also invalidated4a , and its active mask is subtracted from the pending active mask of the corre-sponding entry in the RT table 4b . Upon each update to the pending active maskin the RT table, the updated Pending Mask is checked. When the Pending Mask isall zeros the warp split entry is moved from the RT table to the ST table 5 .5.2 Warp Splits SchedulingTo limit architectural changes to the divergence unit, AWARE allows only one warpsplit to be eligible for scheduling at a time. AWARE switches execution to anotherwarp split only at basic block boundaries. We found that AWARE does not requirechanges to the register scoreboard used in the baseline GPU [30] to perform well.Instead, AWARE can use the existing scoreboard which results in it conservativelyrespects register dependencies on a warp granularity. For example, in Figure 5.1b,warp split C0101 may stall for pending register write dependencies from warp splitB1010 even though these two warp splits are guaranteed to access distinct physicalregister locations, potentially leading to unnecessary stalls. In Chapter 7, we ex-plore the benefit of relaxing this constraint. Specifically, in Chapter 7 we proposeand evaluate an efficient scoreboard mechanism that tracks register dependencieson a thread granularity to avoid unnecessary stalls due to false dependencies. Inthis chapter, however, our goal is to design a MIMD compatible reconvergencemechanism with minimal hardware changes.The warp splits scheduling policy in AWARE is crucial to guarantee that nothread would be indefinitely blocked while scheduling divergent threads under syn-chronization and thus achieving the goal of MIMD compatible execution. For thispurpose, AWARE replaces the depth first traversal imposed by the SIMT stack witha breadth first traversal using a FIFO queue. In Figure 7.4, the FIFO queue is in-cluded in the ST. We show more detailed microarchitecture in Section 5.5. A warpsplit is selected for greedy scheduling in FIFO order with respect to the ST. A warpsplit is pushed into the FIFO when it is first inserted into the ST as an outcomeof a branch instruction or as a reconverged entry. It is scheduled when it reaches69the output of the FIFO. It is popped out of the FIFO when its entry in ST is inval-idated after encountering a branch instruction or reaching a reconvergence point.This guarantees fairness in scheduling different control paths. AWARE switchesfrom one warp split to another only after encountering divergent branches, recon-vergence points, or barriers. Such Greedy scheduling of the output entry ensuresgood memory performance [124].5.3 Nested DivergenceIf no synchronization is detected, AWARE is able to maintain IPDOM reconver-gence, as defined in Section 2.3.1, even in nested divergence scenarios. Figure 5.2illustrates an example. Figure 5.2c shows the state of the ST and RT tables aftereach step of executing the control flow graph in the left part of the figure.Initially, a single entry in the ST exists for warp split A1111 1 . After branchBRAB−C, the branch control unit updates the ST and RT tables as in the prior ex-ample from Section 7.2.1 2 . AWARE’s FIFO scheduler prioritizes the not-takenpath in case of multi-path branches. We assume it prioritizes C1010 in this exam-ple. Thus, subsequently, warp split C1010 diverges at BRCD−E 3 . Hence, the entrycorresponding to C1010 in the ST table is invalidated in the ST and inserted to theRT table with PC field set to the the reconvergence point of BRCD−E , which is F.Also, two new entries corresponding to both sides of BRCD−E are added to the STtable. The new entries are added to the first unallocated entries in the ST table.At this point the ST tracks three parallel control flow paths. However, followingFIFO order, warp split B0101 executes first. Eventually, warp split B0101 reaches re-convergence point G 4 . The branch control unit updates the Pending Mask of thecorresponding reconvergence table entry for G1111. Concurrently, the B0101 entryin the ST table is invalidated 4 . The ST stores indices to the reconvergence entriesin the RT avoiding need for an associative search through the RT table.The warp split scheduler then switches execution to the front of the FIFOqueue. We assume it is warp split E1000 (i.e., it is the not-taken path for BRCD−E).Later warp split E1000 reaches its reconvergence point F 5 . The Pending Mask ofthe reconvergence entry F1010 is updated accordingly, and the E1000 entry in the STtable is invalidated. Then, warp split D0010 reaches the same reconvergence point70/ / i d = t h r e a d IDi f ( i d %2==0){/ / BBB} e l s e {/ / BBCi f ( i d ==1){/ / BBD} e l s e {/ / BBE}/ / BBF}/ / BBG(a) Code1010 1111 0101 0010 1000 1010 1111 A B C E D F G AB-CBRCD-EBR(b) CFGSplits Table (ST) PC RPC Active Mask B G 0101 C G 1010 Reconvergence Table (RT) PC RPC Rec. Mask Pending Mask G --- 1111 1111 Splits Table (ST) PC RPC Active Mask B G 0101 D F 0010 E F 1000 Reconvergence Table (RT) PC RPC Rec. Mask Pending Mask G --- 1111 1111 F G 1010 1010 Splits Table (ST) PC RPC Active Mask A --- 1111 Reconvergence Table (RT) PC RPC Rec. Mask Pending Mask ---   Initial State 1 Splits Table (ST) PC RPC Active Mask D F 0010 E F 1000 Reconvergence Table (RT) PC RPC Rec. Mask Pending Mask G --- 1111 1010 F G 1010 1010   B0101 reconverge at G 4 Splits Table (ST) PC RPC Active Mask D F 0010 Reconvergence Table (RT) PC RPC Rec. Mask Pending Mask G --- 1111 1010 F G 1010 0010   E1000 reconverge at F 5 Splits Table (ST) PC RPC Active Mask --- Reconvergence Table (RT) PC RPC Rec. Mask Pending Mask G --- 1111 1010 F G 1010 0000   D0010 reconverge at F 6 Splits Table (ST) PC RPC Active Mask F G 1010 Reconvergence Table (RT) PC RPC Rec. Mask Pending Mask G --- 1111 1010   F1010 move to ST table 7 Splits Table (ST) PC RPC Active Mask --- Reconvergence Table (RT) PC RPC Rec. Mask Pending Mask G --- 1111 0000   F1010 reconverge at G 8 Splits Table (ST) PC RPC Active Mask G -- 1111 Reconvergence Table (RT) PC RPC Rec. Mask Pending Mask --- 9   After   2   After  3   G1111 move to ST table AB-CBRCD-EBR(c) ST and RT tables (only valid entries shown)Figure 5.2: Example of Multi-Path IPDOM execution with nested divergence(F). The ST entry is removed and the reconvergence entry F1010 is updated againto mark the arrival of warp split E1000 6 .71Upon updating the Pending Mask of the reconvergence entry F1010, the MPcontrol unit detects that there are no more pending threads for this entry (the Pend-ing Mask is all zeros). Hence, the MP control unit moves the reconvergence entryto the ST table, setting the active mask to the Reconvergence Mask 7 . Finally,warp split F1010 reaches the reconvergence point G and updates the reconvergenceentry G1111. The control unit detects that the Pending Mask of entry G1111 is allzeros 8 and moves the entry from the RT table to the ST table 9 .This example does not cover three special cases that can be optimized. Thefirst is when one side of the branch directly diverges to the reconvergence pointof the branch (e.g., an if clause with no else). In this case, the Pending Mask ofthe corresponding reconvergence entry is updated to mark this side of branch asconverged, and there is no need to register a new entry for it in the ST. The secondis when there no actual divergence at run-time at a specific branch. In such case,there is no need to add a reconvergence entry in RT that corresponds to this branch.The third case is when a warp split encounters a branch whose reconvergence pointis the same as the reconvergence point of the diverged warp split (e.g., backwardbranches that create loops or a forward branch that has its reconvergence point thesame as a preceding branch). In such case, there is no need to add a new entry tothe RT table, since the corresponding reconvergence entry already exists. If suchentry were added, its PC and RPC fields would be identical and thus it reconvergesautomatically to the parent entry once it moves the ST table. Note that our proposedmechanism does not require to look up the RT to check if a RPC already existswith the insertion of reconvergence points. This may lead to multiple entries in theRT representing the same reconvergence points (but with different reconvergencemasks). This is fine as long as each entry in the ST table is associated with a singleentry in the RT table (e.g., using an RT index field in the ST table). We explainthese implementation details in Section 5.5.5.4 Using AWARE to avoid SIMT DeadlockAt this point, we have shown that AWARE: (1) replaces the depth first traversalof the SIMT stack with alternatives such as a breadth first FIFO traversal whichrelaxes the serialization scheduling constraint described on Page 20 in Section 3.1,72(2) is able to maintain IPDOM reconvergence which maintains SIMD efficiency.These are two of the three characteristics we mentioned at the introduction of thischapter that a hardware mechanism needs to satisfy to efficiently avoid SIMT dead-locks. The third characteristic is the capability of delaying reconvergence beyondIPDOM when required for synchronization purposes. In this section, we explainrefinements to AWARE to achieve this third characteristic. Further, we show howAWARE can handle barriers in divergent code.5.4.1 Handling Divergent Barriers:In AWARE, we support barriers in divergent code by adding a field in the FIFOqueue to mark warp splits that reach the barrier as blocked (abbreviated BL inFigure 5.3). The PC of the blocked split is updated in both the ST and the FIFO topoint to the instruction after the barrier. When a blocked entry is at the FIFO output,it is pushed to the back of the FIFO without being scheduled for fetch. Splits arereleased when all threads within a thread block or within a warp, depending on thebarrier granuality, reach the barrier.Figure 5.3 shows the operation of AWARE during the execution of the CFGshown in the top left of Figure 5.3 1 . The top right portion of Figure 5.3 2shows the state of both the ST and RT tables after the warp executes the branchat the end of BBA. It also shows the state of the FIFO queue 3 . The FIFO fieldshows the FIFO order of warp splits and the BL field indicates if the warp split isblocked at a barrier. Initially, the FIFO queue shows that threads that diverged toBBB should execute first. Threads at BBB continue execution until reaching thebarrier instruction. Then, the entry that corresponds to these threads in the FIFOqueue is marked as blocked 5 . Finally, the PC field in the ST table is updatedto the instruction following the barrier (B‘) 6 so that execution continues fromthis PC when the barrier is released. The next available warp split in the FIFO isBBC. It executes until the other barrier instruction is reached as well 7 . In caseof a thread block wide barrier, both warp splits will eventually be blocked until thebarrier is released after all threads reach the barrier 8 .73Splits Table PC RPC Active Mask B D 0101 C D 1010 A  1111  C barrier()  1010   0101  B barrier() D Reconvergence Table PC RPC Pending Mask Active Mask D - 1111 1111 FIFO  BL B0101 0 C1010 0 A B C B` C` D` barrier() barrier() Release Splits Table PC RPC Active Mask B` D 0101 C D 1010 Reconvergence Table PC RPC Pending Mask Active Mask D - 1111 1111 FIFO  BL B0101 1 C1010 0 2 3 1 7 4 5 6 8 Figure 5.3: Handling BarriersFIFO  BL A1111 0 A B C C` Splits Table PC RPC Active Mask B C` 0111 C C` 1000 Reconvergence Table PC RPC Pending Mask Active Mask C` - 1111 1111 FIFO  BL C1000 0 B0111 0  0111  A   B  1111  C   C` unlock() lock() Splits Table PC RPC Active Mask A - 1111 Reconvergence Table PC RPC Pending Mask Active Mask - - - - B 1 2 3 4 5  1000  Figure 5.4: Delayed Reconvergence5.4.2 Delayed Reconvergence:As discussed in Chapter 3, strict IPDOM reconvergence can lead to SIMT dead-locks. To address this, for any loop that Algorithm 1 reports as being the potentialcause of a SIMT deadlock, AWARE uses the safe reconvergence points computedusing Algorithm 2. Since AWARE does not have the serialization constraint im-posed by stack-based SIMT implementations, Algorithm 2 can be simplified toconsider only redefining writes that are reachable from the loop reconvergencepoints (removing lines 10–14).It is necessary to recalculate reconvergence points of other branches to guar-antee that the reconvergence point of any branch postdominates the reconvergence74points of all branches on the path from the branch to its reconvergence point (Al-gorithm 5). In Algorithm 5, the initial SafePDOM point of a branch is either theSafePDOM identified by Algorithm 2 or the normal IPDOM point of the branchif it does not induce a SIMT deadlock. Note Algorithm 5, being applied to allbranches, eliminates the need for the step of resolving SafePDOM conflicts amongSIMT deadlock inducing loops. In SSDE, we needed to consider conflicts amongloops only because the code is later transformed and the reconvergence points ofthe different branches are computed as the immediate post dominator points of thebranches in the transformed code. However, in AWARE we apply algorithm 5instead of 2.Figure 5.4a illustrates the operation of AWARE with delayed reconvergence.The reconvergence point of the loop is modified from the IPDOM point (i.e., C) toa SafePDom point (specifically., C’, the closest point following the unlock state-ment) 1 . Once the loop branch instruction is encountered, two warp-split entriesare added to the ST and the RPC of each entry is set to C’ 2 . The FIFO queue hastwo valid entries with priority given to the not-taken path 3 . Hence, the threadthat diverged to BBC (i.e., exited the loop) executes first 4 . It releases the lockand waits at the reconvergence point (C’). Eventually all threads exit the loop andreconverge at C’ 5 . Note that the choice between prioritizing the not-taken versusthe taken path is arbitrary. In cases, where there are no branches between the lockand the unlock statement, choosing the take path may come with some benefit toperformance in terms of reducing the amount of spinning. We discuss techniquesto reduce warp spinning in the spin lock loop. In Chapter 6, we discuss more robusttechniques to reduce warp spinning overheads.5.4.3 Timed-out Reconvergence:The compiler may fail to detect a SIMT deadlock if the synchronization is acrossfunction calls. Further, to guarantee SIMT deadlock free execution, it may takevery conservative decisions in labeling SIMT deadlock inducing loops. To avoidthese limitations. We extend AWARE with a timeout mechanism.In this mechanism, a delay counter is associated with each entry in the RT. Thecounter starts counting when at least one thread reaches the associated reconver-75Algorithm 5 Resolve SafePDOM Conflicts - AWARE version1: Inputs: initial SafePDOM(Br) for each branch2: Outputs: final SafePDOM(Br) for each branch3: do4: converged = true5: for each branch curBr ∈ BrSet do6: iSafePDOM(curBr) = SafePDOM(curBr)7: for each branch Br, where BB(Br) ⊂ PExits(curBr)7→SafePDom(curBr) do8: SafePDom(curBr) = IPDom(SafePDom(curBr), SafePDom(Br))9: end for10: if iSafePDOM(curL) 6= SafePDOM(curL) then11: converged = f alse12: end if13: end for14: while converged 6= truegence point (and updates the pending mask). The counter rests with the insertionof the entry or any update to its pending mask. Otherwise, it increments everyT cycles. T represents the resolution of the counter. In our evaluation we usedT=1 cycle. When the delay counter value exceeds a predetermined TIMEOUTvalue, this indicates a potential SIMT-deadlock due to the forced reconvergenceconstraint. Therefore, at this point, we enable threads that have reached the recon-vergence point to proceed their execution by inserting them into a new entry in theST table and updating the active mask of their RT table.Figure 5.5 shows how the timed-out reconvergence mechanism operates on theCFG on the left. In this case, we assume that reconvergence points are set to theIPDOM points (i.e., delayed reconvergence is off). Initially, there is a single entryin the ST 1 representing threads that are iterating through the loop attempting toacquire the lock whereas the thread that exits the loop keeps waiting at the recon-vergence point 2 . This state continues 3 until the reconvergence timeout logic istriggered. Once the waiting time of threads at the reconvergence point exceeds theTimeOut value, a new entry is added to the ST with the same PC and RPC of theentry in the RT and an active mask that is the subtraction of the Active Mask andthe Pending Mask from the RT entry 4 . We also update the active mask of theRT entry to be the same as the Pending Mask to reflect the fact that the timed-outthreads are no longer waiting at this entry.The new entry is added to the FIFO queue 5 . The new entry C1000 is guar-76Splits Table PC RPC Active Mask B C 0111 Reconvergence Table PC RPC Pending Mask Active Mask C - 0111 1111 FIFO  BL B0111 0 Splits Table PC RPC Active Mask B C 0111 C - 1000 Reconvergence Table PC RPC Pending Mask Active Mask C - 0111 0111 FIFO  BL B0111 0 C1000 0 A B C B Skip Reconvergence 1 2 4 5 3 6  0111   1000  A   B  1111  C   unlock() lock() TimeOut Figure 5.5: Timed-Out Reconvergenceanteed to be executed as entry B0111 gets to the tail of the FIFO queue once theloop branch is executed 6 . In a nested control flow graph, threads that skip re-convergence at a nested reconvergence point can reconverge at the reconvergencepoint associated with a prior branch for which the immediate postdominator hasnot yet been encountered. The TimeOut value could be determined in a number ofways. One way, is to determine the value empirically by profiling a large numberof GPU kernels. However it is determined, a fixed value of TimeOut should belarge enough such that it does not impact the reconvergence behaviour of regularGPU applications (thus avoiding unnecessary performance penalties).5.5 AWARE ImplementationThis section describes a micro-architecture realization of AWARE. We start withdescribing a straightforward realization and the describe an optimization that re-duces AWARE implementation cost.5.5.1 AWARE Basic ImplementationFigure 5.6 shows a basic implementation of the ST and RT tables and its interactionwith the branch resolution unit. Note that in AWARE, no changes are required tothe baseline instruction buffer. The reason is that we only switch between warpsplits at after a branch or after reaching a reconvergence point. In both cases, the77instructions that belong to a warp in the instruction buffer become invalid and newinstructions are fetched from the new program counter of the split at the front ofthe FIFO. Note that this is identical to the interface of the stack with the instructionbuffer except that in case of the stack, the instruction(s) in the instruction bufferthat belong to a warp are for the warp split at the top of the stack, in AWARE, theinstruction(s) are for the warp split the front of the FIFO.Each warp has its own ST and RT tables. Upon divergence, the branch controlunit invalidates the warp’s entry in the I-Buffer. Hence, it is no longer eligiblefor fetch or issue. The branch control unit transfers the content of the divergententry from the ST to an unused (invalid) entry in the RT after modifying the PCfield to the RPC of the branch. The index of this RT entry, R-Index, is stored,along with the other information required for each new warp splits resulting fromthe divergence, in the ST. The R-Index is used to access the reconvergence entryin the RT table when the warp split’s PC is equal to its RPC. The indices of thenew entries in the ST are added to the back of the FIFO. Note that in this basicimplementation, the FIFO does not need to be a separate structure from the STtable. However, as we explain in Section 5.5.2, we may need to spill the ST tableentries to memory while keeping the FIFO within the SM. Therefore, we refer tothem as separate logical structures.The branch control unit also marks the entry of the divergent split in the STtable as invalid and the next split in the FIFO order becomes eligible for scheduling.The PC entry in ST and RT tables points to the first instruction in a basic block; thusit directs the fetch unit to the next instruction to be fetched from the correspondingwarp.Upon reconvergence, the branch control unit invalidates the reconverged entryin both the I-Buffer and the ST. It uses the R-Index field of the reconverging split toaccess the RT table and update the pending mask. Finally, the Pending Mask of theupdated entry is checked; if it is all zeros, the entry is moved to the first unallocatedentry in the ST table and inserts the ST entry index at the back of the FIFO.With 32 threads per warp, the ST and RT tables have a maximum theoreticalsize of 32 entries per warp (max splits is 32 and RT entries are added only when asplit diverges which implies a maximum of 32 RT entries). Thus, AWARE can berealized using RAMs by adding a 5-bit field to the FIFO, ST and RT tables. Upon78Update ST upon ReconvergenceB S-IndexB S-IndexPC RPC R-Index Rec. MaskPendingMaskPC RPC R-Index ActiveMaskPC RPC R-Index ActiveMaskPC RPC R-Index Rec. MaskPendingMaskInst-PCvalid ready RPC ActiveMaskInstructions DependencyMaskPC RPC R-Index Rec. MaskPendingMaskPC RPC R-Index ActiveMaskIssue Unit(Scheduler)Fetch UnitInstruction Buffer (I-Buffer)Warp Reconvergence Table (RT)ExecuteBranchInstructionUpdate ST upon DivergenceUpdate RT upon ReconvergenceSchedule Splits at the FIFO FrontBranch ResolutionUnitWarp Splits Table (ST)B S-IndexFIFOMove Reconverged Entry to STFigure 5.6: AWARE implementationinsertion of an entry into the ST, the 5-bit index of this ST entry is stored in thecorresponding FIFO queue entry. Upon insertion of an entry in the RT table, theRT entry index is stored in a field in its ST entries. To avoid moving entries withthe FIFO, two registers could be used to indicate the front and back of the FIFOin a circular manner. Look ups into ST and RT tables use these indices with noneed to search the contents of the tables. We keep track of invalid entries in a freequeue implemented with a small (32 x 5-bits) stack (not shown in the Figure). ForAWARE basic implementation (described in this section), the free queue is onlyneeded for RT table since the ST table is handled as a FIFO queue. To insert an STor RT entry the next index from the associated free queue is used.5.5.2 AWARE Virtualized ImplementationThe basic implementation described above has high area overhead (about 1 KBstorage requirement per warp). This is because the implementation sizes all tablesfor the worst case scenario (i.e., 32 ST and RT entries per warp). However, empiri-cally we find the typical occupancy of ST and RT tables is much lower. Therefore,we study the impact of virtualizing the ST and RT tables by spilling entries thatexceed their physical capacity to the memory system and filling them when theyare to be scheduled (for ST entries) or updated (for RT entries).79Figure 5.7 illustrates our virtualized AWARE implementation. The virtualizedAWARE implementation is composed of 10 parts organized around the ST andRT tables. The Physical ST Table potentially holds only a subset of active warpsplits. Similarly, the RT table potentially tracks a subset of reconvergence points.When the total number of splits exceeds the capacity of the Physical ST table,the ST Spill Buffer is used to move some warp splits to the memory hierarchy asdescribed below: The ST Fill Request and Response buffers are used to bring thesewarp split entries back into the Physical RT Table. The same applies to the RTbuffers. The reconverged entry buffer and pending mask updates buffer store thedata exchanged between the ST and RT tables.A branch instruction of a warp is scheduled only if both the ST and RT SpillRequest Buffers of this warp are empty. Also, instructions from a warp are eligiblefor scheduling only if the warp Reconverged Entry and Pending Mask UpdatesBuffers are empty. When a new entry is required to be inserted into a full ST orRT table, an existing entry is spilled to their respective Spill Request Buffers. Weuse a FIFO replacement policy for the ST and an LRU replacement polciy for theRT 1. When an entry is spilled, its corresponding entry in the FIFO is labeled asvirtual. When a virtual entry is at the FIFO output, a fill request for this entry is sentand the entry is labeled as transient. This is to avoid sending multiple fill requestsfor the same entry. Also, the entry is pushed to the back of the FIFO. When apending mask update is required for a virtual RT entry, a fill request is sent forthis entry and the pending mask update buffer remains occupied until a responseto the fill request is received and the entry is inserted in the RT table. Further, a fillrequest is sent when an RT entry reconvergence is timed-out. An ST spill requestis 12 bytes and an RT spill request is 16 bytes. Each global memory address spaceof 32× 32 = 1KB bytes per warp is reserved for virtual ST and RT Tables. TheFIFO and free queues use virtual entry IDs (between 0 and warp-size-1) that areused along with the warp ID to decide the address of spill and fill requests. Thesevirtual IDs are stored as new fields to the physical ST and RT entries. Age bitsfor RT entries are not virtualized. Each buffer in Figure 5.7 is sized to queue onlyone entry. As we discuss in Section 5.7, we found the physical sizes of ST and RT1This is essentially to leverage the existing FIFO for the ST and the age bits used for Timeoutcalculation in the RT.80Physical ST TablePhysical RT TableST SpillRequest BufferRT  SpillRequest BufferRT Fill Response BufferReconvergedEntry BufferRT  Fill Request  BufferPending Mask Updates BufferST  Fill Request BufferST Fill Response BufferTo LDSTFromLDSTFromLDSTTo LDST~48 KB per SMreserved address space for virtual ST and RT tables (48 warps/SM)Figure 5.7: AWARE Virtualized Implementationcan be set to 4 and 2 entries respectively with limited impact on performance (seeFigure 5.13).This effectively reduces the storage required per warp by a factor of 5× com-pared to the basic implementation which makes the storage requirement compara-ble to the reconvergence stack. Table 5.2 compares the storage cost of AWAREwith that of stack-based reconvergence. Upon a divergent branch, AMD GPUs se-lect the path with fewer active threads first which limits the stack depth to log2(warpsize) [5]2. On the other hand, Nvidia’s GPUs push the not-taken entry then thetaken entry to the stack, so that execution always starts with the taken entry first [147].Thus, the stack depth can grow up to the warp size. Note that this behavior forNvidia GPUs was highlighted in prior research [147] on old GPU architectures(GT200), and we confirmed using microbenchmarks that, the behaviour is the sameon recent architectures (both Kepler and Pascal architectures.2The warp size in AMD GPUs is 64 threads which requires 6 entries. However, for the sake offair comparison, in Table 5.2, we assumed that the warp size is only 32 threads similar to Nvidia’sGPUs.81Config Cost (Bits)Stack (32-entries -Nvidia) 3072Stack (5-entries - AMD) 480AWARE-basic 7680AWARE-virtual (1 ST, 1 RT) 1088AWARE-virtual (6 ST, 2 RT) 1696Table 5.2: Storage Cost in Bits per Hardware Warp.5.6 MethodologyThis section describes our methodology for AWARE evaluation. We implementAWARE in GPGPU-Sim 3.2.2 [11, 138]. We use the TeslaC2050 configurationreleased with GPGPU-Sim (Table 5.3). However, we replaced the Greedy ThenOldest (GTO) scheduler with a Greedy then Loose Round Robin (GLRR) schedulerthat forces loose fairness in warp scheduling as we observed that unfairness in GTOleads to livelocks due to inter-warp dependencies on locks 3. Modified GPGPU-Sim code used for this evaluation can be found online [37]. We model insertionand lookup latency to non-virtualized ST and RT entries as 1 cycle. A warp splitinserted into ST needs to wait for the next cycle to be eligible for fetch. We usethe same benchmarks described in Section 3.4. We also use the same encoding inTable 4.1 in our discussion of the results.5.7 EvaluationWe limit our evaluation of AWARE to CUDA and OpenCL applications becausecurrent OpenMP support relies on linking device code at runtime to a compiledOpenMP runtime library [16]4 which is not straightforward to emulate in GPGPU-Sim. We compare executing the M2D version on AWARE against executing S0-and M2S on the stack-based reconvergence baseline. Unless otherwise noted, we3For kernels under study that do not suffer livelocks with GTO, the impact of GLRR on L1 missrate is minimal compared to GTO with the exception of BH-ST (in Table 3.1) which suffers 7%increase in L1 cache misses [124]. The study of “fair” and efficient warp schedulers is left to futurework.4We only linked OpenMP library calls for synchronization at compile time. Enabling generallink time optimizations for OpenMP runtime library requires an engineering effort that is beyond thescope of this paper.82# Compute Units 14warp Size 32warp Scheduler Greedy Then Loose Round RobinSplits Scheduler FIFONumber of Threads / Core 1536Number of Registers / Core 32768Shared Memory / Core 48KBConstant Cache Size / Core 8KBTexture Cache Size / Core 12KB, 128B line, 24-wayNumber of Memory Channels 6L1 Data Cache 16KB, 128B line, 4-way LRU.L2 Unified Cache 128k/Memory Channel, 128B line, 16-way LRUInstruction Cache 4k, 128B line, 4-way LRUCompute Core Clock 575 MHzInterconnect Clock 575 MHzMemory Clock 750 MHzMemory Controller out of order (FR-FCFS)GDDR3 Memory Timing tCL=12 tRP=12 tRC=40 tRAS=28 tRCD=12 tRRD=6Memory Channel BW 4 (Bytes/Cycle)Table 5.3: GPGPUSim Configuration 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2CP-DS HT ATM BH-TB BH-SM BH-ST BH-FC AVG.No rm.  Ke rn el  Ex ec ut i on  T ime S0-(Stack) M2S-(Stack) M2D-(AWARE)Figure 5.8: Normalized Kernel Execution Timeconfigure AWARE to enable delayed reconvergence and with the TimeOut mech-anism disabled. Figure 5.8 shows the normalized average execution time for in-dividual kernels. On average, executing the MIMD version on AWARE is on parwith executing the M2S version on the stack-based reconvergence baseline (Fig-83 0 0.2 0.4 0.6 0.8 1 1.2 1.4CP HT ATM BH AVG.No rm.  Ac cu m.  GP U Ex ec ut i on  T imeS0-(Stack) M2S-(Stack) M2D-(AWARE)Figure 5.9: Normalized Accumulated GPU Execution Timeure 5.9) 5. However, for some kernels such as BH-TB, M2D on AWARE has betterperformance versus the M2S versions of these kernels run on stack-based recon-vergence. This is mainly because AWARE does not require executing additionalinstructions. Figure 5.10 shows that executing the MIMD version of all kernel runon AWARE leads to a reduced number of instructions executed versus M2S withthe sole exception of BH-ST. The reason BH-ST differs is that it behaves similarto a spin lock; the AWARE FIFO-based scheduling mechanism allows a warp splitthat did not acquire a lock to attempt to acquire it again even before it is releasedby the other split. This execution pattern repeats as many times as the number ofdynamic branch instructions encountered along the not-taken path before the lockis released. Manual transformation of the code to eliminate SIMT deadlock avoidsthis behaviour, and our compiler elimination algorithm also reduces this behaviour.Figure 5.12 illustrates the sensitivity of AWARE to the value of TimeOut. Ker-nels that suffer SIMT deadlocks (e.g., BH-ST) favor smaller TimeOut values asthis allows blocked threads to more readily make forward progress and releaseother threads attempting to enter a critical section. For kernels that do not sufferSIMT deadlocks (e.g., BH-SM), smaller TimeOut values reduce SIMD utilizationand lower performance. On average, delayed reconvergence with TimeOut dis-abled (bars labeled ”inf+DR”) achieves the best results. This suggests applyingdelayed reconvergence whenever a SIMT deadlock is detected and setting Time-5We simulate the PTX ISA on GPGPU-Sim because SASS is not fully supported. Note thatPTX uses virtual registers rather than spilling to local memory diminishing some of the advantage ofAWARE over SSDE.84 0 0.5 1 1.5 2 2.5CP-DS HT ATM BH-TB BH-SM BH-ST BH-FC AVG.No rm.  Dy na mi c  I ns tr uc ti on s Co un t S0-(Stack) M2S-(Stack) M2D-(AWARE)Figure 5.10: Normalized Dynamic Instruction Count 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1CP-DS HT ATM BH-TB BH-SM BH-ST BH-FCA ve ra ge  S IMD  U ti l i za ti onS0-(Stack) M2S-(Stack) M2D-(AWARE)Figure 5.11: Average SIMD Utilization 0 0.5 1 1.5 2 2.5 3 3.5 4CP-DS HT ATM BH-TB BH-SM BH-ST BH-FC AVG.No rm.  Ex ec ut i on  T imeinf+DR 1000 5000 10000Figure 5.12: Sensitivity to the TimeOut value (in cycles). ”inf+DR” refers toa time-out that is infinity but with delayed reconvergence.Out to a large value such that it is only triggered when there is a high likelihood ofan undetected SIMT deadlock. This leaves a room for improvement on the SIMT-induced deadlock compiler analysis to report the degree of confidence that a loop85 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6CP-DS HT ATM BH-TB BH-SM BH-ST BH-FCNo rm.  Ex ec ut i on  T imes32r32 s1r1 s4r2 s6r4Figure 5.13: Effect of AWARE Virtualization on Performancemay cause a SIMT-induced deadlock based on the code pattern.Figure 5.13 illustrates the impact of AWARE virtualization on overall perfor-mance. We use AWARE with delayed reconvergence and TimeOut disabled in thisexperiment. We can see that in the worst case execution time increases by only 16%when we use 4 and 2 physical entries for ST and RT respectively instead of 32 en-tries; the average is only 5%. Our analysis suggests that the performance overheadis mainly due to the extra traffic caused by the fill and spill requests. For example,using a single entry for both ST and RT with CP-DS kernel increases memory re-quests by 21% and 15% of this extra traffic miss in L1 cache. Congestion on themiss queues increases MSHR (Miss Status Holding Register) reservation failurerate by a factor of 2.5×. This leads to an increase in the stalls due to structural haz-ards by 51%. A potential solution that can further reduce performance overhead isa victim cache shared among warps and used to cache spilled entries. Using a cen-tral storage for all warps in an SM is motivated by our observation of a disparity inthe ST and RT occupancy requirement across different warps at a given executionwindow. In depth study of this modification is left for future work.5.8 Related WorkIn [121], Ramamurthy describes a modification of the SIMT reconvergence stackwhen executing lock or unlock instructions to avoid possible deadlocks. However,the proposed solution does not address situations where the locking happens in di-verged code [121]. The solution described is limited to mutexes and it applies only86in very restricted cases. In [155], hardware support for a blocking synchronizationmechanism on GPGPU is described. The synchronization APIs proposed there,SIMT deadlocks is avoided in restricted cases (similar to [121]). We provide moredetails on both of these techniques in Chapter 8.There are many research papers that recognize the performance implicationsof branch divergence in GPUs [33, 39, 41, 44, 91, 99, 122, 123]. However, lessattention has been paid to the implications on functionality. Temporal-SIMT is ahardware proposal that enables more flexible placement of barriers [74] but theuse of explicit reconvergence points in Temporal-SIMT can still cause SIMT dead-locks. Dynamic Warp Formation (DWF) [41] enables flexible thread schedulingthat could be exploited to avoid SIMT-deadlocks, however, it has higher hardwarecomplexity.Very recently, Nvidia revealed some information about their Volta architec-ture [103, 109]. Volta replaces the old stack-based execution model with a newmodel that enables independent thread scheduling to enable inter-thread synchro-nization. Figures 12 and 13 in the Nvidia blog [103] suggests that the observablebehaviour of Volta is very similar to AWARE with concurrent multi-path execu-tion enabled (Multi-Path execution is discussed in Chapter 7). The closest relatedNvidia patent is a 2016 patent that describes a notion of convergence barriers [34].We provide detailed discussion about this patent in Chapter 8, Section 8.1.5.9 Summary, Conclusion and Future DirectionsIn this chapter, we presented AWARE; a MIMD-Compatible reconvergence mech-anism that avoids most of the limitations inherent in the SSDE compiler-only ap-proach. The chapter discussed AWARE operation, implementation, and its interac-tion with synchronization.Unlike the compiler only approach, discussed in Chapter 4, AWARE does notrequire changing the application CFG. Thus, it does not have the synchronizationscope limitations and debuggability challenges associated with SSDE. It also sup-ports barriers in divergent code. AWARE limits the architectural changes to thedivergence unit. AWARE provides a realistic example that lays the foundation toconsider MIMD-compatibility as a design goal for SIMT hardware implementa-87tions.Future research could explore static and/or dynamic delayed and timed-outreconvergence mechanisms as these appear to have potential to further optimizeperformance by varying the selection of reconvergence points on the granularity ofindividual branches. Specifically, by adapting reconvergence point selection it maybe possible to improve SIMD utilization [137] and/or caching behaviour [116] indivergent applications in general.88Chapter 6BOWS: Back-Off Warp SpinningIn the previous chapters of this thesis, we tackled some of the limitations of theSIMT execution model that makes correct implementations of inter-thread syn-chronization on current SIMT machines challenging and unreliable. In this chapter,we focus on the performance side of the SIMT model implications on inter-threadsynchronization.Overheads of fine-grained synchronization have been well studied in the con-text of multi-core CPU architectures [36, 80, 143, 161]. However, the scale ofmulti-threading and the fundamental differences in the architecture in SIMT ma-chines hinders the direct applicability of the previously proposed CPU solutions(more details in Section 7.7). In SIMT machines, barrier synchronization over-heads have been recently studied [81, 82]. These studies proposed warp schedul-ing policy that accelerate warps that have not yet reached a barrier to enabled otherwarps blocked at the barrier to proceed. However, fine grained synchronization,with bust-wait synchronization, is a fundamentally different problem. In barriersynchronization, warps that reach a barrier are blocked and do not consume issueslots. However, with busy-wait synchronization, threads that fail to acquire a lockspin compete for issue slots and, in the absence of coherence L1 caches, memorybandwidth.Yilmazer and Kaeli [155] quantified the overheads of spin-locks on GPUs andproposed a hardware-based blocking synchronization mechanism called hierarchalqueue locking (HQL). HQL provides locks at a cache line granularity by adding89flags and pointer meta-data for to each L1 and L2 block, which can be in one of sixstates. Negative acknowledgments are used when queues are filled and in certainrace conditions. An acquire init primitive is added to the application to setup aqueue. While HQL achieves impressive performance gains when an applicationuses a small number of locks relative to threads, it can experience a slowdownwhen using a large number of locks concurrently. Moreover, HQL adds significantarea to the caches and requires a fairly complex cache protocol. While Yilmazerand Kaeli noted the potential for synchronization aware warp scheduling to helpimprove HQL, a detailed investigation was left to future work. By judiciouslymodifying warp scheduling, this paper shows how to effectively approximate thebenefits of queue-based locking without the complexity and overhead of directlyimplementing queues.Criticality-Aware Warp Acceleration (CAWA) [73] uses run-time informationto predict critical warps. Critical warps are those that are slowest in a kernel andas they determine execution time CAWA prioritizes them. CAWA estimates warpcriticality using a criticality metric that predicts which warp will take longer timeto finish. CAWA outperforms greedy-than-oldest (GTO) warp scheduling acrossa range of traditional GPGPU workloads [73]. However, CAWA can reduce per-formance for busy-wait synchronization code as its criticality predictor tends toprioritize spinning warps.We propose Back-Off Warp Spinning (BOWS), a scheduling policy that pre-vents spinning warps from competing for scheduler issue slots. BOWS approxi-mates software back-off techniques used in multi-threaded CPU architectures [6],which incur limitations when directly applied to GPUs (Figure 6.2). Warp prior-itization in stack-based SIMT architectures is complicated by the fact that somethreads within a warp may hold a lock while others do not. In BOWS warps thatare about to execute a busy-wait iteration are removed from competition for sched-uler issue slots until no other warps are ready to be scheduled. On GPU kernelsusing busy-wait synchronizations BOWS achieves a speedup of 1.5× and energysavings of 1.6× versus CAWA.90  0  0.5  1  1.5  2  2.5TB ST DS ATM HT TSP NW1 NW2Lock Acquire/Wait DistributionBenchmarksWait Exit SuccessWait Exit FailIntra−Warp Lock FailInter−Warp Lock FailLock SuccessFigure 6.1: Synchronization Status Distribution. Bars from left to right:LRR, GTO, and CAWA. GPGPU-Sim with a GTX480 configuration.See Section 7.4 for details.6.1 Sensitivity to Warp SchedulingIn Section 1.2, we discussed the overheads of busy-wait synchronization on re-cent SIMT hardware. In this section, we consider the impact of warp schedul-ing policies. Greedy then Oldest (GTO) scheduling [124] selects the same warpfor scheduling until it stalls then moves to the oldest ready warp. Older warpsare those with lower thread IDs. GTO typically outperforms Loose Round Robin(LRR) [124]. In CAWA, warp criticality is estimated as: nInst×w.CPIavg+nStall,where nInst is an estimate of remaining dynamic instruction count (based on direc-tion of branch outcomes), w.CPIavg is per-warp CPI, and nStall is the stall cyclesexperienced by a warp. Critical warps are prioritized.Figure 6.1 plots the distribution of lock acquire attempts in lock-based syn-chronization and the wait exit attempts in wait and signal based synchronization(benchmarks and methodology described in Section 7.4) using LRR, GTO, andCAWA scheduling policies.The figure also shows the distribution of whether thelock acquire failure is because the lock is held by a thread within the same warp(i.e., intra-warp lock fail) or in a different warp (i.e., inter-warp lock fail). Mostlock failures are due to inter-warp synchronization. The figure shows that inter-warp conflicts are significantly influenced by the warp scheduling policy.Figure 6.2 plots execution time of the hashtable insertion code in Figure 1.2aaugmented with the software-only backoff delay code in Figure 6.2a running on911 . c l o c k t s t a r t = c l o c k ( ) ;2 . c l o c k t now ;3 . f o r ( ; ; ) {4 . now = c l o c k ( ) ;5 . c l o c k t c y c l e s = now>s t a r t ? now− s t a r t :now +(0 x f f f f f f f f − s t a r t ) ;6 . i f ( c y c l e s >= DELAY FACTOR* b l o c k I d x . x ){7 . b r e a k ;8 . }9 . }(a) Backoff Delay Code*. 0 0.5 1 1.5 2 2.5 3128 256 512 1024 2048 4096K er ne l Ex ec ut i on  T ime  ( L OG 10 ( ms ec ) )HashTable BucketsNo DelayDelay Factor=50Delay Factor=100Delay Factor=500Delay Factor=1000(b) Execution Time.Figure 6.2: Software Backoff Delay Performance in GPUs. *omp set lock GPUimplementation for OpenMP 4.0 [15].GTX 1080 hardware. The results suggest that adding a backoff delay to a spin-lock degrades performance on recent GPUs. The reason is that, except at very highlevels of contention, the benefits of reduced memory traffic appear insufficient tomake up for wasted issue slots executing the delay code itself.6.2 BOWS: Backoff Warp SpinningTo avoid wasted issue slots we propose Back-Off Warp Spinning (BOWS), a hard-ware scheduling mechanism that reduces the priority of spinning warps. We presentBOWS assuming synchronization loops have been identified by programmer, com-piler or DDOS (described in Section 6.3).6.2.1 BOWS scheduling policyThe scheduling policies examined in Section 6.1 suffer from two limitations:• The scheduler may prioritize spinning warps in the competition for issueslots over other eligible non-spinning ones. This slows down the progressof non-spinning warps. In cases when these non-spinning warps are holdinglocks, this decision also slows down the forward progress of spinning warps.• The scheduler may return back to the same spinning warp too early even if itwas at the bottom of the scheduling priority because other warps are stallingon data dependencies.BOWS avoids these issues by modifying an existing warp scheduling policy asfollows:920x098: %p3 bra BB2   Back-off  Delay 0x030: atom.cas.b32 x x 0x038: %p2 bra BB3 0x090: setp.eq.s16 x x Blocked W0 Threads P W0 W1 W2 W3 P W1 W2 W3 W0 P W0 W1 W2 W3 P W1 W2 W3 W0 P W0 W1 W2 W3 P W1 W2 W3 W0 P W0 W1 W2 W3 Warp Scheduling Priority Instruction Executed Back-off  Delay 1 2 3 5 4 6 7 8 9 10 time 0x098: %p3 bra BB2   0x030: atom.cas.b32 0x038: %p2 bra BB3 0x098: %p3 bra BB2   0x030: atom.cas.b32 Critical Section  Executed Figure 6.3: BOWS scheduling Policy.• It discourages warps from attempting another spin loop iteration by insertingthe warp that is about to execute another iteration into the back of the warpscheduling priority queue. Warps in this state are called Backed-off Warps.Once a warp in the backed-off state issues its next instruction its priorityreverts to normal and it leaves the backed-off state.• It sets a minimum time interval between the start of any two consecutiveiterations of a spin loop by the same warp. Warps that are about to start anew spin loop iteration prior to the end of their interval are not eligible forscheduling.BOWS requires that Spin-Inducing Branches (SIBs) have been identifies. SIBsare the backward branch of each spin loop. Once a warp executes a SIB, the sched-uler control unit triggers BOWS’ logic.93BOWS OperationBOWS works as follows: Once a warp exits its backed-off state, a pending back-off delay register is initialized to the back-off delay limit. The warp then continuesexecution normally with the pending back-off delay register decremented everycycle. If the warp executes a SIB it cannot issue its next instruction until its back-off delay is zero. The back-off delay value can be determined through profiling ortuned adaptively at run time.Figure 6.3 shows an example of BOWS operation for warp W0 containingfour threads for the code in Figure 6.6a. Backward branch 0x098: %p3 braBB2; has been identified as a SIB. Scheduling priority is shown in the top of Fig-ure 6.3. Initially, W0 has high priority 1 . Once W0 encounters a spin-inducingbranch 2 , it is pushed to the back of the priority queue and marked as backed-off(shaded in Figure 6.3). W0 is scheduled when other warps are stalling (e.g., onmemory accesses for line 6 in Figure 1.2a) and executes the lock-acquire atomiccompare and swap instruction (Figure 6.6a, PC=0x030). At this point 3 three ac-tions are taken: First, the warp loses its backed-off state; second, the warp priorityreverts to normal; and third a back-off delay value is stored in the warp pendingback-off delay register 4 . Two threads of W0 successfully acquire the lock andproceed to the critical section while the other two threads fail 5 . Threads re-converge at the setp instruction and execute the spin-inducing branch. The twothreads that executed the critical section exit the spin loop while the others proceedto another iteration. Once the spin-inducing branch is executed, the warp entersbacked-off state and is pushed to the end of the priority queue 6 . As the durationof the critical section is larger than that of the back-off delay limit W0’s back-offdelay is already zero and so W0 is eligible for scheduling. After W0 is scheduledit executes the lock acquire and the two remaining threads in the spin loop againfail to acquire a lock 7 . The two threads immediately proceed to another iterationof the spin-loop 8 . However, once W0 enters the backed-off state, they cannot bescheduled until the pending back-off delay is zero 9 . Once the pending back-offdelay is zero, the W0 is eligible for scheduling 10 .94f o r each E x e c u t i o n Window of T c y c l e s :i f ( SIB I n s t r u c t i o n s > FRAC1 * T o t a l I n s t r u c t i o n s )Delay L i m i t += Delay S tepi f ( ( T o t a l I n s t r u c t i o n s ) / ( SIB I n s t r u c t i o n s )<FRAC2 * ( Prev . T o t a l I n s t r u c t i o n s ) / ( Prev . SIB I n s t r u c t i o n s ) )Delay L i m i t −= 2* Delay S tepi f ( Delay L i m i t > Max L i m i t ) Delay L i m i t = Max L i m i ti f ( Delay L i m i t < Min L i m i t ) Delay L i m i t = Min L i m i t− − − − − − − − − − − − − − − − − − − − − − − − − − − − −Values used in Evaluation:T=1000 c y c l e s , FRAC1= 0 . 0 5 , FRAC2= 0 . 8 ,Delay S tep =250 c y c l e s , Max L i m i t = 10000 c y c l e s ,Min L i m i t = 1000 c y c l e sFigure 6.4: Adaptive Back-off Delay Limit Estimation.Adaptive Back-off Delay LimitA small back-off delay may increase spinning overheads while a large back-off de-lay may throttle warps more than necessary. We adaptively set the delay by tryingto maximize ( Use f ul InstructionsSpinning Overheads ) over a window of execution. We useTotal Inst.SIB Inst. =Use f ul Inst. + SIB Inst.×avg. Spin OverheadSIB Inst. as a rough estimate. As the aver-age spin overhead is almost constant across the execution of the same kernel theratio of the Total InstructionsSIB Instructions is proportional toUse f ul InstructionsSpinning Overheads .The pseudo code in Figure 6.4 summarizes our adaptive back-off delay limitcalculation. This algorithm is applied over successive time windows. During thecurrent window the adaptive back-off delay estimation algorithm computes theback-off delay limit to use during the next window. Initially, the scheme attemptsto increase the back-off delay limit by a fixed step as long as a non-negligibleratio of dynamic spin-inducing branches is executed. However, if the ratio ofTotal InstructionsSIB Instructions in the current execution window is considerably smaller than theratio in the previous window the back-off delay limit is decremented by a doublestep. Finally, lower and upper limits are applied to the back-off delay limit.6.3 DDOS: Dynamic Detection of SpinningIt is possible to identify spin loops when explicit busy-wait synchronization APIsare used. The compiler can then translate a lock acquire API into a busy wait loop951 . boo l t r a n s a c t i o n d o n e = f a l s e ;2 . w h i l e ( ! t r a n s a c t i o n d o n e ) {/ / t r y l o c k 13 . i f ( atomicCAS ( &lock1−>lock , 0 , 1 ) == 0 ){/ / t r y l o c k 24 . i f ( atomicCAS ( &lock2−>lock , 0 , 1 ) == 0 ){5 . / / c r i t i c a l s e c t i o n6 . a tomicExch (& lock2−>lock , 0 ) ; / / r e l e a s e l o c k 27 . a tomicExch (& lock1−>lock , 0 ) ; / / r e l e a s e l o c k 18 . t r a n s a c t i o n d o n e = t r u e ;9 . } e l s e {1 0 . a tomicExch (& lock1−>lock , 0 ) ; / / r e l e a s e l o c k 11 1 . }1 2 . }1 3 . }(a) Two Nested Locks (ATM [43] and CP [20, 43]).1 . f o r ( i = 0 ; i < 3 2 ; i ++) {/ / s e r i a l i z e t h r e a d s w i t h i n t h e same warp2 . i f ( l a n e i d == i ) {/ / t r y g l o b a l l o c k3 . w h i l e ( atomicCAS ( mutex , 0 , 1 ) != 0 ){4 . }5 . / / c r i t i c a l s e c t i o n6 . a tomicExch ( mutex , 0 ) ;7 . }8.}(b) Global Locking (TSP [119, 129]).1 . . . . .2 . w h i l e ( k >= bot tom ) {3 . s t a r t = s t a r t d [ k ] ;4 . i f ( s t a r t >= 0) { / / i f n o t w a i t5 . . . . .6 . i f ( ch >= n b o d i e s d ) {7 . . . .8 . } e l s e {9 . / / c h i l d i s a body1 0 . s o r t d [ s t a r t ] = ch ; / / s i g n a l1 1 . }1 2 . k −= dec ; / / move t o n e x t c e l l1 3 . }1 4 . }(c) Wait and Signal (BH-ST [24]).Figure 6.5: Examples of Inter-Thread Synchronization Patterns used in GPUs(See Section 7.4 for more details).with the backward branch of the loop flagged as a spin inducing branch. However,such APIs are not available in current SIMT programming models. Therefore, inthis section, we describe a mechanism for dynamically detecting SIBs.Current GPU programmers write synchronization code tailored to their specificapplication scenario. For example, Figure 6.5a shows an implementation of twonested locks that avoid SIMT-induced deadlocks from ATM. Figure 6.5b showsan implementation of a global lock from TSP where the execution of the criticalsection is serialized across threads from the same warp. Figure 6.5c shows busy-wait synchronization from the ST kernel in BH that implements a wait and signalsynchronization rather than a lock. A thread waits in a spin loop for a condition setby another thread.The large variety of synchronization patterns makes it challenging to detect960 x028 : mov . s16 %r21 , 0 ;BB2 :0 x030 : atom . c a s . b32 %r15 , [% r l 2 9 ] , 0 , 1 ;0 x038 : s e t p . eq . s32 %p2 , %r15 , 0 ;0 x040 : @%p2 b r a BB3 ;0 x048 : b r a . u n i BB4 ;BB3 :/ / c r i t i c a l s e c t i o n0 x088 mov . s16 %r21 , 1BB4 :0 x090 : s e t p . eq . s16 %p3 , %r21 , 0 ;0 x098 : @%p3 b r a BB2 ;(a) Busy-Wait Loop.PathHistory 0111ValueHistory 0001 0000PathHistory 0010 0111ValueHistory 0000 0000 0001 0000Match Pointer=0,Remaining Matches = NULLMatch Pointer=1,Remaining Matches = NULLPathHistory 0111 0010 0111ValueHistory 0001 0000 0000 0000 0001 0000Match Pointer=2,Remaining Matches = 1PathHistory 0010 0111 0010 0111ValueHistory 0000 0000 0001 0000 0000 0000 0001 0000Match Pointer=2,Remaining Matches = 01a1b2a2b34PathHistory 0111 0010 0111 0010 0111ValueHistory 0000 0000 0000 0000 0001 0000 0000 0000 0001 00005a5bMatch Pointer= 2 (reset to 0)Remaining Matches = NULLSpin-inducing Branches Prediction Table (SIB-PT)PC Confidence Prediction0x098 1 Non Spinning(b) Updates to History Registersand SIB-PT.0 x020 : l d . param . u32 %r15 ,[ Z 1 4 i n v e r t m a p p i n g P f S i i p a r a m 3 ] ;0 x028 : mov . u32 %r20 , 0 ;BB2 :0 x030 : l d . g l o b a l . f32 %f1 , [% r l 1 4 ] ;0 x038 : s t . g l o b a l . f32 [% r l 1 5 ] , %f1 ;0 x040 : add . s64 %r l 1 5 , %r l 1 5 , %r l 4 ;0 x048 : add . s64 %r l 1 4 , %r l 1 4 , 4 ;0 x050 : add . s32 %r20 , %r20 , 1 ;0 x058 : s e t p . l t . s32 %p4 , %r20 , %r15 ;0 x060 : @%p4 b r a BB2 ;(c) Regular Loop.PathHistory 0010ValueHistory 0000 1110PathHistory 0010 0010ValueHistory 0001 1110 0000 1110Match Pointer=0,Remaining Matches = NULLMatch Pointer=1,Remaining Matches = NULLPathHistory 0010 0010 0010ValueHistory 0010 0000 0001 0000 0000 0000Match Pointer=2,Remaining Matches = NULL6a6b7a7b8a8bSpin-inducing Branches Prediction Table (SIB-PT)PC Confidence/Threshold Prediction- - -(d) Updates to History Registersand SIB-PT.Figure 6.6: Warp History Registers and SIB-PT Operation (Figure 6.7 showsthe units locations in the pipeline.busy-wait synchronization statically or to introduce primitives that support all usecases and avoid SIMT-induced deadlocks [38]. Thus, we propose a hardwaremechanism, Dynamic Detection of Spinning (DDOS), to detect spinning warps.DDOS seeks to identify Spin-Inducing Branches (SIBs). We define a SIB as abackward branch that maintains the spinning behaviour. To identify a SIB, DDOSfirst makes a prediction regarding whether each warp is currently in a spinning stateor not.As noted by Ti et al. [80], a thread is spinning between two dynamic instancesof an instruction if it executes the instruction and later executes the same instructionagain (e.g., in another loop iteration) without causing an observable change to the97net system state (i.e., to its local registers or to memory). Ti et al. [80] proposed athread spinning detection mechanism for multi-threaded CPUs that tracks changesin all registers. Directly applying such a technique to a GPU would be prohibitivegiven the large register files required to support thousands of hardware threads.Instead DDOS employs a speculative approach.DDOS detects busy-wait loops in two steps. First, it detect the presence ofa loop. DDOS does this by tracking the sequence of program counter values ofa warp. Second, DDOS speculates whether a loop identified in the first step isa busy-wait loop or a normal loop. To distinguish these cases it leverages theobservation that typically in normal loops found in GPU code an induction variablechanges every iteration. Moreover, this induction variable typically contributesto the computation of the loop exit condition. In NVIDIA GPUs the loop exitcondition and the divergence behaviour of a thread are typically determined usinga set predicate instruction (available both in PTX and SASS)1. For each threadin a warp, the set predicate instruction compares two source registers and writesthe result to boolean destination register. The boolean values are typically usedto predicate execution of both normal and branch instructions (e.g., instructionsat address 0x090 and 0x098 in Figure 6.6a). In normal (none busy-wait) loops,the value of at least one source register of the set predicate (setp) instruction(s)that determine the loop exit condition change each iteration. In a ‘for’ loop, one ofthese registers would be the loop counter. DDOS approximates the condition testedby Ti et al. [80] by tracking only the values of source registers of the set predicateinstructions in determining whether a loop is a normal loop (i.e., setp sourceregister values change) or a busy-wait spin loop (setp source register values donot change).6.3.1 DDOS OperationConceptually, the spin loop detection step of DDOS works as follows: Each warphas two shift registers, a Path History Register and a Value History Register (Fig-ure 6.6b). These registers track the execution history of the first active thread inthe warp. We refer to this thread as the profiled thread. The Path History Register1AMD Southern Islands ISA has an equivalent vector compare instruction (v comp) [5].98tracks program counter values of setp instructions. The Value History Registertracks the values of the source registers of setp instructions. To reduce stor-age overhead we hash program counter and source operand values before addingthem to the Path History and Value History Registers. As elaborated upon in Sec-tion 6.3.3 the Value History Register is implemented in the execution stage. DDOS’examines entries in Path and Value History Registers looking for repetition. If itfinds sufficient repetition DDOS classifies the profiled thread as being in a spinningstate.Figure 6.6 illustrates operation of Path and Value History Registers on PTX2assembly examples with (Figure 6.6a) or without (Figure 6.6c) busy wait code.Figure 6.6a is equivalent to Figure 1.2a. In Figure 6.6a assume the first activethread is executing the setp instruction at PC = 0x038. In the busy-wait examplein Figure 6.6b, the program counter is first hashed using:((PC−PCkernel start)/Instruction Size)%m)3, where PCkernel start = 0x000, m = 4and Instruction Size = 8. The result (0x7) is inserted into the Path History Regis-ter 1a . In parallel, the source operand values of the setp instruction are hashedand added to the Value History Register. We assume the profile thread fails to ac-quire the lock so that %r15 is ‘1’. Only the least significant k-bits (here k is 4)are used 1b . To detect repetition DDOS keeps track of two other values, MatchPointer and Remaining Matches. The Match Pointer identifies which m-bit (k-bit)portion of the Path (Value) History Register to compare new insertions against.For each insertion into the path (value) history registers, the entry before the matchpointer is compared with the new entry. If they are equal, a loop is detected. To en-able better selectivity DDOS requires multiple consecutive loop detections beforeidentifying a spin inducing loop. To facilitate this the remaining matches registertracks the number of remaining matches required.Continuing the example in Figure 6.6b, eventually the warp executes the setpinstruction at PC=0x90 in Figure 6.6a. The entries in both shift registers are (log-ically) shifted to the right and new values inserted to their left. No match is foundbetween the new entry (0x2) and the entry before the match pointer (0x7) 2a . Asthe profiled thread fails to acquire the lock %r21 remains ‘0’. Thus, the value his-2PTX is Nvidia GPU virtual assembly [105].3We discuss other hashing techniques in Section 6.3.2.99tory register is updated with two 4-bit zero values 2b . When the warp reachesPC=0x038 again we assume the profiled thread again fails to acquire the lockleading to a match in both path and value histories 3 . Once a match is de-tected, the match pointer is fixed and the remaining matches value is initializedto (matchpointer−1). Once the warp reaches the setp instruction at PC=0x090again an additional match is found 4 . Since the remaining matches value is nowzero, the warp is identified as in a spinning state. After the profiled thread success-fully acquires the lock the execution of the setp instruction at PC=0x040 leads toa mismatch in the value history and the warp loses its spinning state 5b .Next, we describe how DDOS identifies Spin-inducing Branches (SIBs). Thekey is that, if a backward branch is executed by a warp in a spinning state, it islikely spin-inducing (i.e., leads to a new iteration in the busy-wait loop).To detect SIBs DDOS employs a spin-inducing branch prediction table (SIB-PT). The SIB-PT, shown in Figure 6.6b, is shared between warps executing onthe same SM. The SIB-PT maintains a confidence value for each branch underconsideration. When a warp is in a spinning state and it executes a backwardbranch if that branch is not in the SIB-PT then it is added with a confidence valueof 1. If the branch is in the SIB-PT, its confidence value is incremented. Once theconfidence reaches a threshold the branch is identified as a spin-inducing branch.To guard against accumulated path and value hash aliasing errors that could happenover an extended period of execution, a nonzero confidence value for a branch thatis not yet confirmed as spin-inducing is decremented every time the branch is takenby a warp that is currently classified as non-spinning. The threshold is a constantfixed for a given architecture (determined empirically).Returning to the example in Figure 6.6b, initially, the SIB-PT is empty. Oncethe warp executes the backward branch at PC=0x098 while in the spinning state(i.e., after 4 and before 5 ) the branch is added to the SIB-PT with its confi-dence set to ‘1’. Assuming a confidence threshold of 4, only three more instanceswhere the backward branch at PC=0x098 is executed by a spinning warp would berequired before this branch is confirmed as a Spin-inducing Branch. Larger thresh-old values reduce false predictions but lead to longer detection time. We study thesensitivity to the threshold value in Section 6.3.2.Next, we briefly explain DDOS operation with a normal loop example. The100PTX code in Figure 6.6c is the assembly of a ‘for’ loop in k-means [26]. Thebackward branch is at 0x060 and its associated setp is at 0x058. The first sourceoperand %r20 represents the ‘for’ loop induction variable that is incremented byone every iteration (at 0x050), while %r15 is a copy of the kernel input indicatingthe number of loop iterations. The PC of the setp instruction is hashed to (0x2)and inserted into the Path History Register every time the instruction is executed( 6a , 7a , and 8a ). In contrast to the busy-wait case, the contents of %r20 changeseach iteration causing a mismatch with every insertion to the value history register( 7b and 8b ).6.3.2 DDOS Design Trade-offsDDOS as described in Section 6.3.1 has different design parameters to tune. Theseare the hashing function and width (m and k), the confidence threshold (t), andthe number of entries in the history shift register (l). We evaluate the impact ofthese parameters on the following metrics: (1) Average True Spin Detection Rate(TSDR), which is the percentage of spin-inducing branches accurately identified byDDOS; (2) Average False Spin Detection Rate (FSDR), which is the percentage ofnon-spin-inducing branches incorrectly classified as spin-inducing; and (3) Avg.Detection Phase Ratio (DPR), which is the average ratio of the detection phaseduration of a branch to the cycles executed from the first encounter to the lastencounter of the branch. The detection phase duration of a branch measures howmany cycles were required to confirm a branch as a spin-inducing branch after itsfirst encounter. For spin-inducing branches it is preferable to have a short detectionphase. For each branch, these metrics are averaged over the different SMs thatexecute the branch, and over the different launches of the kernel that include thebranch. For ground truth, we consider branches that are used to implement busy-wait synchronization as true spin-inducing branches.Table 6.1 shows the sensitivity of these metrics to the different design parame-ters averaged over all our benchmarks (see Section 7.4 for details).Hashing Function: The top sub-table in Table 6.1 studies the impact of XOR andMODULO hashing. In XOR hashing, the values inserted into the path register arehashed as follows (PC[m-1:0] xor PC[2m-1:m] xor PC[3m-1:2m] ... xor PC[31:32-101Sensitivity to the hashing function “h” at t=4 and l=8h Avg. TSDR Avg. DPR Avg. FSDR Avg. DPRXOR, m=k=4 1 0.041 0.016 0.006XOR, m=k=8 1 0.041 0 -MODULO, m=k=4 1 0.041 0.17 0.014MODULO, m=k=8 1 0.041 0.104 0.001Sensitivity to the Hashed Path/Value Width “m/k” at t=4, l=8, and h=XORm/k Avg. TSDR Avg. DPR Avg. FSDR Avg. DPR2 1 0.042 0.078 0.0623 0.983 0.074 0.012 0.0084 1 0.041 0.016 0.0068 1 0.041 0 0Sensitivity to Confidence Threshold “t” at m=k=4, l=8, and h=XORt Avg. TSDR Avg. DPR Avg. FSDR Avg. DPR2 1 0.03 0.027 0.0164 1 0.041 0.016 0.0068 1 0.075 0.002 0.00212 0.992 0.105 0.002 0.003Sensitivity to the History Registers Length “’l” at t=4 , m=k=8. and h=XORl Avg. TSDR Avg. DPR Avg. FSDR Avg. DPR1 0 0 0 02 0 0 0 04 0.625 0.032 0 08 1 0.041 0 0Sensitivity to Time Sharing of History Registers “sh” at l=8, t=4, h=XOR, and epoch=1000sh Avg. TSDR Avg. DPR Avg. FSDR Avg. DPR0, m=k=4 1 0.041 0.016 0.0060, m=k=8 1 0.041 0 01, m=k=4 0.642 0.211 0.033 0.0231, m=k=8 0.642 0.211 0.026 0.003Table 6.1: Spin Detection Sensitivity to Design Parameters.m]), where PC is the program counter at the execution of a setp instruction. Thevalue register XOR hashes are computed similarly but using the source registersin the setp instructions. In MODULO hashing, values are hashed by consideringonly the least significant m (k) bits of the value (as in Figure 6.6). XOR hashingconsiderably reduces false detections compared to MODULO hashing. With 8-bitshashing width, the XOR hashing has a zero false detection rate. False detectionsoccur in Merge Sort and Heart Wall with MODULO hashing due to loops withpower-of-two induction variable increments larger than 2k.102Hashing Width: The impact of the hashing width is quantified in the second sub-table in Table 6.1. A 2-bit path and value width leads to aliasing that leads to 7.8%false detection rate. With three the aliasing impact is smaller and eight bits areenough to eliminate false detections with XOR hashing.Confidence Threshold: The third sub-table in Table 6.1 shows that as the confi-dence threshold (t) increases, the false detection rate decreases but the detectionphase ratio increases for true detections. With t = 12 some SMs fail to confirm aspin-inducing branch (e.g., TB kernel of BH).Hashing Registers Length: The fourth sub-table in Table 6.1 shows the sensi-tivity to the history length (l), which determines the number of setp instructionsDDOS can track. A length of two instances fails to capture any repetitiveness inhistory. A history of four instances fails to capture the spin loops in one benchmark(NW: Needleman-Wunsch). This benchmark has two spin loops in two differentkernels but each of them are launched several times. The loop involves four setpinstructions, and thus DDOS needs at least five entries in its history registers todetect their spin behaviour.Time Sharing of History Registers: As the goal of DDOS is to classify static in-structions as either SIB or not tracking path and value histories for all warps seemsunnecessary. The results of time-sharing a single set of path and value history reg-isters among different warps in an SM is shown in the last sub-table in Table 6.1.Here a warp uses the history registers for a certain predetermined and fixed interval(1000 cycles), then another warp uses them. Time sharing reduces detection accu-racy as the profiled warp may not complete a full spin twice within its time sharinginterval and thus some SIBs may not be detected. With a single warp, time sharingleads to longer detection phase as the SIB-PT. It may be possible to find a “sweatspot” between tracking all warps and only one. We this tradeoff to future work.In our evaluation, we use “h=XOR, t=4, m=k=8, l=8, and time sharing dis-abled”. The total storage per warp for both the path and value history registers is192-bits. In our benchmarks, the maximum number of confirmed spin-inducingbranches was three. However, the maximum number of concurrent entries in theSIB-PTX was 9 entries (the next maximum was only four). A conservative 16-entry SIB-PT requires 560 bits of storage per SM.103Rotate Priorities Every TimeOut Last Issued Warp Scheduler Arbitration Logic GTO Priority Queue ID W0 W1 W4 Backed-off Queue ID W2 W3 W5 SIB-PT PC Conf. Pred. 0x098 4 1 Operand Collector Branch  Unit ALU ALU ALU ALU Execution Stage Value History Path History PC Per Warp Fetch Decode Scoreboard Warp  Scheduling Registers & Operand Collector Register File ALU Execution Stage 1 2 3 5 6 7 8 Instruction Buffer ID Ready Backed-off Pending Back-off Delay W0 0 0 0 W1 0 0 0 W2 0 1 0 W3 1 1 500 W4 0 0 0 W5 1      1 (0)        0 (1000) 9 4 3 Figure 6.7: Operation of BOWS with DDOS.6.3.3 DDOS integration with BOWSFigure 6.7 illustrates BOWS’ combined with DDOS. Warp Scheduling: BOWSmodifies the warp scheduling and execution stages. We found that strict GTOscheduling (without BOWS) can leads to livelocks on two of our benchmarks (HTand ATM). To avoid this, we modify GTO to rotate the age priority periodically(every 50,000 cycles in our evaluation). Arbitration logic first checks whether thelast issued warp is ready to issue its next instruction 1 . If the last issued warp isnot ready, the oldest ready warp that is not backed-off is selected 2 . If no suchwarp is available the backed-off queue is checked. A warp is added to backed-offqueue after executing a SIB. A warp in the backed-off queue can be scheduled onlyif it is both ready and its back-off delay is zero 3 . If the arbitration selects such awarp it is removed from the backed-off queue. The “Backed-off” field for the warpis set to false and the “Pending Back-off Delay” is initialized to the back-off delaylimit value when the warp exits the backed-off state 4 .ALU Execution Stage: Path and value history are updated during executionof setp instructions 5 , 6 . Current GPUs already supports instructions such104“shuffle” which allow threads within the same warp to access each other’s regis-ters [111]. The underlying hardware can be used to select the source registers ofthe first active thread. If the warp executes a backward branch, then it looks up theSIB-PT 7 . If the branch is predicted to be a spin-inducing branch the warp entersthe backed-off state 9 and is pushed to the end of the backed-off queue.6.4 MethodologyWe implement BOWS in GPGPU-Sim 3.2.2 [11, 138]. We use GPGPU-Sim GTX480for both GPGPU-Sim and GPUWattch for performance and energy evaluation. InSection 6.5.4, we report results for a Pascal GTX1080Ti configuration that hasa correlation of about 0.85 for Rodinia to estimate the impact of BOWS on theperformance of newer generations of GPUs. We evaluate the impact of BOWSon three scheduling policies; GTO, LRR, and CAWA. We use BOWS and DDOSdesign parameters detailed in Table 5.3.For evaluation, we use Rodinia 1.0 [26? ] for synchronization free benchmarks(see Section 6.5.2). We use and the kernels described below for kernels displayingdifferent synchronization patterns.BH: BarnesHut is an N-body simulation algorithm [24]. Its Tree Building (TB)kernel uses lock-based synchronization [24]. The kernel is optimized to reducecontention by limiting the number of CTAs and using barriers to throttle warpsbefore attempting a lock acquire. Its sort kernel (ST) uses a wait and signal syn-chronization scheme. We run BarnesHut on 30,000 bodies.CP: Clothes Physics perform cloth physics simulation for a T–shirt [20]. Its Dis-tance Solver (DS) kernel lock-based implementation uses two nested locks to con-trol updates to cloth particles.HT: Chained HashTable uses the critical section shown in Figure 1.2a. We run3.2M insertions by 40K threads on 1024 hashtable buckets.ATM: An bank transfer between two accounts [43]. It uses two nested locks. Werun 122K transactions with 24K threads on 1000 accounts.NW: Needleman-Wunsch finds the best alignment between protein or nucleotidesequences following a wavefront propagation computational pattern. We imple-mented the lock-based algorithm in [78] which uses two kernels NW1 and NW2105 0 0.5 1 1.5 2TB ST DS ATM HT TSP NW1 NW2 GmeanNo rma li ze d Ex ec ut i on  T imeKernelsLRRLRR+BOWSGTOGTO+BOWSCAWACAWA+BOWS(a) Normalized Execution Time 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6TB ST DS ATM HT TSP NW1 NW2 GmeanNo rma li ze d En er gyKernelsLRRLRR+BOWSGTOGTO+BOWSCAWACAWA+BOWS(b) Normalized Dynamic EnergyFigure 6.8: Performance and Energy Savings on GTX480 (Fermi)that perform similar computation while traversing a grid into opposite directions.TSP: Travelling Salesman. We modified the CUDA implementation from [119]to use a global lock when updating the optimal solution. We run TSP on 76 citieswith 3000 climbers.6.5 EvaluationFigure 6.14 shows normalized execution time and energy consumption on busy-wait synchronization kernels. Results are normalized to LRR. BOWS uses adaptiveback-off delay. It uses DDOS for detecting spin loops. The design parameters areshown in Table 5.3).Figure 6.14 shows that BOWS consistently improves performance over dif-ferent baseline scheduling policies with a speedup of 2.2×, 1.4×, and 1.5× andenergy savings of 2.3×, 1.7×, and 1.6× compared to LRR, GTO, and CAWA re-spectively.BOWS has minimal impact on TB because TB’s code uses a barrier instruc-tion to limit the number of concurrently executing warps between lock acquisitioniterations. We note this barrier approach is fairly specific to TB. For example, itrequires at least one thread from each warp to reach the barrier each iteration. Also,106 0 0.5 1 1.5 2TB ST DS ATM HT TSP NW1 NW2 GmeanNo rma li ze d Ex ec ut i on  T imeKernelsGTOGTO+BOWS(0)GTO+BOWS(500)GTO+BOWS(1000)GTO+BOWS(3000)GTO+BOWS(5000)GTO+BOWS(Adaptive)Figure 6.9: Normalized Execution Time at Different Back-off Delay LimitValues (using DDOS).the lack of adaptivity of this software-based barrier approach can be harmful evenwhere it can be applied (would lead to a 28x slowdown if applied to HT, measuredon hardware - Pascal GTX1080). ST shows 17.8% energy improvements withBOWS (Figure 6.14b) as it reduces dynamic instruction count but does not exhibitperformance improvement because the performance is limited by memory latency.In TSP, the synchronization instructions consume ¡0.03% of the total number of in-structions, thus synchronization code is not the dominant factor in execution time.Large back-off delay values may unnecessarily block execution leading to perfor-mance degradation (see TSP results in Figure 6.9).For the NW kernels, the progress of younger warps is dependent on older warpsfinishing their execution. Therefore, NW prefers GTO scheduling over LRR as itgives priority to older warps. HT with the GTO scheduler runs into a pathologicalscheduling pattern where it prioritizes spinning warps which significantly reduceperformance. BOWS eliminates such problems by deprioritizing spinning warps.6.5.1 Sensitivity to Back-off Delay Limit ValueThe following results use the GTX480 configuration with GTO as the baseline pol-icy for BOWS. Figure 6.10 shows the average distribution of warps at the schedulerin terms of their status (backed-off or not). The first bar is GTO. The remaining bars107  0  0.2  0.4  0.6  0.8  1TB ST	 DS ATM HT TSP NW1 NW2Average Warp DistributionBenchmarksNon Backed−offBacked−offFigure 6.10: Distribution of Warps at the Scheduler. From left to right, GTOwithout BOWS, GTO with BOWS with delay limit in cycles 0, 500,1000, 3000, 5000, Adaptive.are for BOWS as the back-off delay limit value increases. The last bar to the right isBOWS with adaptive back-off delay limit. The figure shows how BOWS impactswarp scheduling. The back-off delay is not effective until reaching a thresholdunique to each benchmark. The reason is that the back-off delay sets a minimumduration between two successive iterations of a spin loop. If warps already con-sume a time that is larger than the back-off delay limit before they attempt anotheriteration, then the back-off delay has no observable effect (recall the discussion ofFigure 6.3). The effective back-off delay value depends upon how many instruc-tions are along the failure path in the busy-wait code, how many warps are runningand how much memory contention there is.Figure 6.11 shows the distribution of Lock acquire and wait status. The behav-ior aligns with the percentage of warps that are backed-off in Figure 6.10. Thisdata elucidates performance gaps in some benchmarks – particularly, HT, ATM,and NW – between the different scheduling policies. For example, in HT BOWSreduces the lock failure rate by 10.8× compared to GTO.Figure 6.12a shows the impact of BOWS on the dynamic instruction count. Onaverage BOWS reduces dynamic instruction count by a factor of 2.1× comparedto GTO. Figure 6.12b shows that BOWS also reduces the number of L1D memorytransaction by 19% compared to GTO. One of the side effects of BOWS is that it108  0  0.2  0.4  0.6  0.8  1  1.2TB ST DS ATM HT TSP NW1 NW2Lock Acquire/Wait DistributionBenchmarksWait Exit FailWait Exit SuccessIntra−Warp Lock FailInter−Warp Lock FailLock SuccessFigure 6.11: Distribution of Warps at the Scheduler. From left to right, GTOwithout BOWS, GTO with BOWS with delay limit in cycles 0, 500,1000, 3000, 5000, Adaptive. 0 0.5 1 1.5 2TB ST DS ATM HT TSP NW1 NW2 GmeanNo rma li ze d I ns tr uc ti on  C ou ntKernelsGTOGTO+BOWS(0)GTO+BOWS(500)GTO+BOWS(1000)GTO+BOWS(3000)GTO+BOWS(5000)GTO+BOWS(Adaptive)(a) Dynamic Instruction Count 0 0.5 1 1.5 2TB ST DS ATM HT TSP NW1 NW2 GmeanNo rma li ze d Nu mb er  of  Me mo ry  T ra ns ac ti on sKernelsGTOGTO+BOWS(0)GTO+BOWS(500)GTO+BOWS(1000)GTO+BOWS(3000)GTO+BOWS(5000)GTO+BOWS(Adaptive)(b) Number of Memory Transactions. 0 0.5 1 1.5 2TB ST DS ATM HT TSP NW1 NW2S IMD  E ff i ci en cyKernelsLRRGTO-TimeOutBOWS-Ideal-0BOWS-0BOWS-500BOWS-1000BOWS-3000BOWS-5000BOWS-Adaptive(c) SIMD Efficiency.Figure 6.12: BOWS Impact on Dynamic Overheads.increases SIMD efficiency for some benchmarks. For example, BOWS improvesHT and ATM SIMD efficiency by 3.4× and 1.85× respectively compared to GTO.In ST, the significant reduction of the number of spin iterations (see Figure 6.12a)biases the SIMD calculation results as the benchmark spends more time in exe-cuting the divergent code rather than spinning, and hence the reduction in SIMDefficiency.109 0 0.2 0.4 0.6 0.8 1 1.2 1.4HL MS GmeanNo rma li ze d Ex ec ut i on  T imeBenchmarksGTOGTO+BOWS(0)GTO+BOWS(500)GTO+BOWS(1000)GTO+BOWS(3000)GTO+BOWS(5000)Figure 6.13: Overheads Due to Detection Errors.6.5.2 Sensitivity to Detection ErrorsNote that with the XOR hashing configuration we do not have any false detections.Thus, the results of Synchronization-Free benchmarks are identical to the base-line. Figure 6.13 reports the results of Synchronization-Free benchmarks under theMODULO hashing. For Synchronization-Free benchmarks, BOWS is expected toperform identically to the baseline under perfect spin detection. Only two appli-cations from Rodinia have false detections with MODULO hashing, Merge Sort(MS) and Heart Wall (HL). In both of these applications, false detections were dueto ‘for’ loops with a large power of two induction variable increment that is not re-flected in the least significant 8-bits of setp source registers. In this evaluation, weuse an 8-bit hash width for the path and value registers. On average, over Rodinia’s14 benchmarks, BOWS with a 5000 cycles back-off delay and MODULO hashingdowngrades GTO performance by only 2.1% on these synchronization free appli-cations. However, for MS, BOWS with MODULO has and a large backoff delaydowngrades performance versus GTO significantly.Note that it is possible to come up with a code example that may result in falsedetection regardless of the hashing decision. For example, consider the case of aloop that exits at the 10th iteration. However, the exist condition does not com-pare the induction variable with 10 directly, and instead, an intermediate value is110 0 0.5 1 1.5 2TB ST DS ATM HT TSP NW1 NW2 GmeanNo rma li ze d Ex ec ut i on  T imeKernelsLRRLRR+BOWSGTOGTO+BOWSCAWACAWA+BOWS(a) Normalized Execution Time (Pascal) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6TB ST DS ATM HT TSP NW1 NW2 GmeanNo rma li ze d En er gyKernelsLRRLRR+BOWSGTOGTO+BOWSCAWACAWA+BOWS(b) Normalized Dynamic Energy (Pascal)Figure 6.14: Performance and Energy Savings on Pascalcomputed (e.g., i%10) and compared with zero. This effectively hash the inductionvariable of the loop and tricks DDOS into detecting such loop as a spin loop. Wedid not find such code examples in our evaluated benchmarks.6.5.3 Sensitivity to ContentionFigure 6.15 uses the hashtable benchmark to study BOWS sensitivity to contention.A small number of hashtable buckets indicate higher contention. The figure showsthat BOWS provides a speedup of up to 5× at high contention and down to 1.2× atlow contention. Similarly, the dynamic instruction count savings ranges from 3.7×to 1.3×. Figure 6.15b also includes data, “Ideal Block Inst. Count”, that servesas a proxy for how HQL [155] might perform on this workload. This curve showsthe instruction count assuming locks do not require multiple iterations to acquire.The difference between the two curves thus represents the overhead introducedby BOWS versus an ideal queuing lock system. As we can see, the benefits ofan (idealized version of) HQL appear to diminish as the number of hash bucketsincreases.1116.5.4 Pascal GTX1080Ti EvaluationTo evaluate the impact of BOWS on recent architectures, we configured GPGPU-sim and GPUWattch to model GTX1080Ti (the configurations are currently avail-able in GPGPU-sim GitHub repository). We evaluate the same benchmarks withthe same inputs used for GTX480. BOWS consistently improves performance overdifferent baseline scheduling policies with a speedup of 1.9× on LRR, 1.7× onGTO, and 1.5× on CAWA.One observation is that on Pascal, except for DS, the behavior is flat across thedifferent baseline scheduling policies. The reason is that most of the input datasets for the workloads we run are set to fully utilize (without oversubscribing) theFermi GPU but they under-utilize Pascal. Pascal has almost double the numberof cores compared to Fermi (Table 5.3). Thus, on Fermi, each core has manywarps to choose from and the scheduling policy is of a great impact. However, inPascal, each core will have about half the number of warps distributed on four warpschedulers instead of two. Thus, the number of warps available at each schedulerin Pascal is one fourth that in Fermi making the baseline scheduling policy lessimportant (e.g., unlike the case with Fermi on NW and HT benchmarksZ). DS,on the other hand, is oversubscribed in the Fermi configuration and the numberof concurrently running CTAs is limited to four due to the number of availableregisters per core. This helps to limit contention. However, in Pascal, each coreruns up to 8 CTAs/Core and Pascal has more cores. This significantly increasesthe number of concurrent warps and thus lock contention. Therefore, DS performsworse with Pascal baseline than Fermi. BOWS significantly improves performanceas it combines deprioritizing spinning warps (which helps when there are manywarps to schedule from) and throttling spinning warps by forcing them to wait bythe back-off delay limit (which helps when there are few warps to schedule from).6.5.5 Implementation CostTable 6.2 identifies the basic components in both DDOS and BOWS and estimatestheir costs per SM. The main cost of DDOS is the history registers, but usingtime-sharing (Section 6.3.2) it may be possible to reduce this cost. Comparisonand hashing logic can be shared across warps in the same SM. To enable back-off112 0 1 2 3 4 5128 256 512 1024 2048 4096B OWS  S pe ed Up  o ve r GT OHashTable Buckets(a) BOWS SpeedUp. 0 0.2 0.4 0.6 0.8 1128 256 512 1024 2048 4096Dy na mi c  I ns tr uc ti on  C ou nt  ( No r ma l iz ed  to  GT O)HashTable BucketsBOWS Inst. CountIdeal Block. Inst. Count(b) BOWS Inst. Count.Figure 6.15: Sensitivity to Contention.delay up to 10,000 cycles requires 14-bits per Pending Delay counter. Adaptive es-timation requires division. This can be done using reduced precision computationor by using existing arithmetic hardware when not in use.6.6 Related WorkNumerous research papers have proposed different warp scheduling policies withdifferent goals (e.g., improving latency hiding [99], improving locality [124], re-ducing barrier synchronization overheads [81, 82], reducing load imbalance over-head across warps from the same CTA [72]). However, none of these schedulingpolicies have considered the challenge of warp scheduling under inter-thread syn-chronization.Overheads of fine-grained synchronization have been well studied in the con-113DDOSSIB-PT 16-entry - 35 bits each (560 bits)History Registers 48 warps * 192 bits,(9216 bits)Comparison 8-bit comparator + 8:1 8-bitMuxHashing (XOR) 8 4-bit XORsFSM 48* 4-state FSM statesBOWSPending Delay Counters 48* 14 (bits) = 672 bitsBacked-off Queue 48 * 5 (bits)Arbitration Logic ChangesDelay Limit Estimation Logic(can use functional units when available)Table 6.2: DDOS and BOWS Implementation Coststext of multi-core CPU architectures [36, 80, 142, 143, 161]. Ti et al. [80] pro-posed a thread spinning detection mechanism for multi-threaded CPUs that trackschanges in all registers. Directly applying such a technique to a GPU would beprohibitive given the large register files required to support thousands of hardwarethreads. Instead, DDOS employs a speculative approach. In [161], the authorspropose a synchronization state buffer that is attached to the memory controllerof each memory bank to cache the state of in-flight locks. This reduces the traf-fic propagated to the main memory and the latency of synchronization operations.However, when the buffer is full the mechanism falls back to software synchroniza-tion mechanisms. This work builds on the following observation “at any instanceduring the parallel execution only a small fraction of memory locations are activelyparticipating in synchronization” to maintain a reasonably sized buffer [161]. Al-though this observation holds true for modestly multi-threaded CPUs, it does notapply to massively multi-threaded SIMT architectures with tens of thousands ofthreads running in parallel. A similar technique that requires an entry per hardwarethread to track locks acquired by each thread is used in [142].In [155], the authors propose hierarchal queuing at each block in L1 and L2data caches with the use of explicit acquire/release primitives. Their goal is to im-plement a blocking synchronization mechanism on GPGPU. In that work, lockscan be acquired only on a cache line granularity. Locked cache lines are not re-114placeable until released. If a cache set is full with locked lines, the mechanismreverts back to spinning for newer locks mapped to the same line. Thus, the ef-ficiency of this mechanism drops as the number of locks increase and starts toperform worse than the baseline [155]. For example, in the hashtable benchmark,the proposal in [155] performs worse than the baseline starting from 512 buckets(in contrast to our proposal, see Section 6.5.3 which consistently outperform thebaseline). Further, unlike [155], our work does not assume explicit synchronizationprimitives which require non-trivial compiler support and/or significant hardwaremodifications [38] to run correctly on SIMT architectures.Transactional memory and lock-free synchronization are other approaches toimplement inter-thread synchronization [43, 95, 151]. However, both techniquesrely on retries upon failure which lead to overheads and contention that is similar tobusy-wait synchronization. GPU transactional memory proposals to date achievelower performance versus fine-grained synchronization [43, 151]. Similar resultshave been also reported for lock-free synchronization [96]. Tuning DDOS andBOWS to reduce commit failures in lock-free synchronization is left for futurework.6.7 Summary, Conclusion and Future DirectionsThis chapter proposes DDOS, a low cost dynamic detection mechanism for busy-wait synchronization loops on SIMT architecture. DDOS is used to drive BOWSa warp scheduling policy that throttles spinning warps to reduce competition forissue slots allowing more performance critical warps to make forward progress. Ona set of kernels that involve busy-wait synchronization, BOWS reduces dynamicinstruction count by a factor of 2.1× and reduces memory system accesses by 19%compared to GTO. This leads to an average speedup of 1.4× and dynamic energyreduction by a factor of 1.7× on a set of GPU application kernels employing busy-wait synchronization.The low cost required to implement DDOS and BOWS makes them a viable ex-tension to any baseline scheduling policy. The low cost is essential since althoughthe targeted application scope (i.e., applications with inter-thread synchronization)for these optimizations is emergent, it is not the common case. Therefore, from115a market perspective, only low cost solutions for the inter-thread synchronizationproblem can make their way into production.One direction for future work is to explore the possibility of tuning DDOSand BOWS to reduce commit failures in lock-free synchronization. Most lock-free synchronization algorithms are not wait free and thus exhibit a behaviour thatsimilar to busy-wait synchronization. One challenge would be to modify DDOS todetect lock-free synchronization loops. Unlike busy-wait synchronization, in lock-free synchronization, the loop induction variables may change over iterations evenwhen the commit fails.Another research direction is to explore how DDOS and BOWS can be ef-ficiently integrated with AWARE. The performance of AWARE also suffers fromwarp split spinning (e.g., see comments about BH-ST performance in Figure 5.10).One challenge would be to scale DDOS to work on warp split granularity. Thischallenge is probably solvable with a version of the history registers time shar-ing proposed for DDOS. Another challenge would be to integrate BOWS with theFIFO scheduling policy of warp splits proposed in AWARE while still maintainingthe scheduling fairness property of AWARE.116Chapter 7MP: Multi-Path ConcurrentExecutionChapter 5 presented AWARE; an adaptive warp reconvergence mechanism that en-ables MIMD compatible execution on SIMT hardware. The main building blockof aware was the Split and Reconvergence SIMT tables that decouples trackingof divergent splits from their reconvergence points. These SIMT tables enabledAWARE to work around the SIMT scheduling limitations of the conventional stack-based reconvergence mechanism, avoiding SIMT specific deadlock scenarios forparallel kernels with inter-thread synchronization. In this chapter, we explore anorthogonal application for these SIMT tables. In particular, along with additionalmicroarchitectural modifications, the SIMT tables can be used to enable concur-rent multi-path execution on GPUs as a performance optimization for divergentapplications. We refer to this mechanism as the Multi-Path (MP) execution model.7.1 Stack-Based Reconvergence PerformanceLimitations:Section 3.1 discussed the scheduling constraints imposed by the stack-based re-convergence implementations in SIMT architectures. However, the focus was onthe functional implications of such constraints. In this section, we revisit theseconstraints focusing on the performance implications of such constraints.1171 . / / i d = t h r e a d ID2 . / / BBA Basic Block ”A”3 . i f ( i d %2==0){4 . / / BBB5 . } e l s e {6 . / / BBC7 . }8 . / / BBD1111 1010 0101 1111 A B C D AB-CBRFigure 7.1: Divergent code exampleSingle Path Stack Single Path Stack PC RPC Active Mask PC RPC Active Mask A --- 1111 D --- 1111 C D 1010 B D 0101 (a)Initial State  After BRA   Single Path Stack Single Path Stack PC RPC Active Mask PC RPC Active Mask D --- 1111 D --- 1111 C D 1010 (c)      C0101 reaches G (d)   B1010 reaches G TOS TOS TOS TOS 2 A A C C D D A A B B D D A A C C D D A A B B D D 1 2 3 4 5 6 7 8 9 10 lanes cycles Timing Diagram AB-C     1 43 Figure 7.2: Execution with the Stack-Based Reconvergence Model. The Fig-ure refers to the stack as a Single Path Stack to distinguish it from latterproposals that support dual path execution [123].The stack-based execution model allows only a single control flow path to ex-ecute at a time, which reduces the number of running threads. Active threads onalternate paths that are not on the top of stack may be either waiting at a reconver-gence point or ready to execute a parallel control flow path. Thus, the stack-basedexecution model captures only a fraction of the actual thread level parallelism(TLP). The remaining TLP is essentially masked by a structural hazard implicitin the use of a stack for implementing IPDOM reconvergence.For example, consider executing the divergent code in Figure 7.1 on the stack-based execution model 7.2. Initially, the stack has a single entry during the execu-tion of basic block A 1 . Once branch BRAB−C is executed, the PC of the diverged118entry is set to the reconvergence PC (RPC) of the branch (D). Also, resulting warpsplits, C1010 and B0101, are pushed onto the stack 2 . The RPC of the new entriesis set to the RPC of the executed branch (D). At this point, only warp split B0101is eligible for scheduling, as it resides at the top of the stack (TOS entry). Warpsplit C1010 is not at the top of the stack, hence, it cannot be scheduled. As a result,on cycle 4, there are no instructions to hide the latency of the first instruction inbasic block B. Once warp split B0101 reaches its reconvergence point (D), its corre-sponding entry is popped from the stack 3 . Then, warp split C1010 executes untilit reaches its reconvergence point (D) after which it is popped from the stack 4 .Finally, the diverged threads reconverge at D1111. The above execution results intwo idle cycles.Figure 7.3 quantifies this by showing the amount of TLP available to the sched-uler as we increase the maximum number of concurrently executable warp splitssupported by hardware while IPDOM reconvergence is maintained. The graphplots the average portion among all the scalar threads that can be scheduled be-cause they are active in the top entry of the stack. The stack-based executionmodel corresponds to enabling a single warp split. Section 7.4 gives more de-tails about the benchmarks and the methodology. For this set of benchmarks, thestack-based execution model captures from 15% of overall TLP in the Monte Carlo(MC) benchmark up to around 65% in the Memcached (MEMC) benchmark. Fig-ure 7.3 suggests that up to 35% more TLP is available when moving from thestack-based reconvergence to a mechanism allowing an arbitrary number of warpsplits to be concurrently scheduled while maintaining IPDOM reconvergence. TLPdoes not go to 100% with unlimited warp splits because some threads need to waitat reconvergence points.7.2 Multi-Path IPDOM (MP IPDOM)During cycles when the pipeline is idle due to long latency events these alternatecontrol paths could make progress, as observed by Meng et al. [91]. This sectionbuilds on this observation and presents a hardware mechanism that allows concur-rent scheduling of any number of warp splits while still maintaining IPDOM re-convergence (unlike [91] where concurrent execution trades-off SIMD utilization).1190% 20% 40% 60% 80% 100% 1 2 4 6 8 any 1 2 4 6 8 any 1 2 4 6 8 any 1 2 4 6 8 any MAND      MC      MUM       MEMC Figure 7.3: Fraction of running scalar threads while varying maximum warpsplits and assuming IPDOM reconvergenceThe mechanism is built primarily using the SIMT tables discussed in Sections 5.1and 5.5.1. Briefly, we replace the SIMT reconvergence stack structure with twotables. The warp Split Table (ST) records the state of warp splits executing in par-allel basic blocks (i.e., blocks that do not dominate each other), which can thereforebe scheduled concurrently. The Reconvergence Table (RT) records reconvergencepoints for the splits. The ST and RT tables work cooperatively to ensure splits ex-ecuting parallel basic blocks will reconverge at IPDOM points. Next, we presentthe the key changes and updates that introduced in addition to these SIMT tables toallow concurrent scheduling of warp splits. We refer to the modified architectureas Multi-Path IPDOM (MP IPDOM).7.2.1 Warp Split SchedulingUnlike AWARE, where execution switches from one path to the other only afterencountering a branch, barriers and/or reconvergence points, MP-IPDOM allowsexecution to switch to other paths to fill an idle issue slot in the scheduler. Toillustrate this, we use the same simple control flow graph in Figure 7.1 to explainthe operation of the Multi-Path IPDOM. Figure 7.4 shows the operation of the MPIPDOM illustrating changes to the ST and RT tables (top) along with the resultingpipeline issue slots (bottom).The warp begins executing at block A. Since there is no divergence, there isonly a single entry in the ST, and the RT is empty 1 . The warp is scheduled onthe pipeline until it reaches the end of block A. After the warp executes branchBRAB−C on cycle 2, warp A1111 diverges into two splits B0101 and C1010. Then,the A1111 entry is moved from the ST to the RT 2a with PC field set to the RPC120Splits Table (ST) PC RPC Active Mask A --- 1111 Reconvergence Table (RT) PC RPC Reconvergence Mask Pending Mask --- Splits Table (ST) PC RPC Active Mask C D 1010 B D 0101 Splits Table (ST) PC RPC Active Mask C D 1010 Splits Table (ST) PC RPC Active Mask --- Splits Table (ST) PC RPC Active Mask D --- 1111 Splits and Reconvergence Tables 2a Reconvergence Table (RT) PC RPC Reconvergence Mask Pending Mask D --- 1111 1111 Reconvergence Table (RT) PC RPC Reconvergence Mask Pending Mask D --- 1111 1010 Reconvergence Table (RT) PC RPC Reconvergence Mask Pending Mask D --- 1111 0000 Reconvergence Table (RT) PC RPC Reconvergence Mask Pending Mask --- A A C C D D A A B B D D A A C C D D A A B B D D 1 2 3 4 5 6 7 8 lanes cycles Timing Diagram 1 2b 3a 3b 4a 4b 5 Figure 7.4: Execution with Multi-Path IPDOMof branch BRAB−C (i.e., D). The RPC can be determined at compile time and eitherconveyed using an additional instruction before the branch or encoded as part of thebranch itself (current GPUs typically include additional instructions to manipulatethe stack of active masks). The Reconvergence Mask entry is set to the same valueof the active mask of the diverged warp split before the branch. The Pending Maskentry is used to represent threads that have not yet reached the reconvergence point.Hence, it is also initially set to the same value as the active mask. At the same time,two new entries are inserted into the ST; one for each side of the branch 2b . Theactive mask in each entry represents threads that execute the corresponding side ofthe branch.On the clock cycle 3, warp splits B0101 and C1010 are eligible to be scheduled onthe pipeline independently. We assume that the scheduler interleaves the availablewarp splits. Warp splits B0101 and C1010 hide each others’ latency leaving no idlecycles (cycles 3-5). On cycle 6, warp split B0101 reaches the reconvergence point(D) first. Therefore, its entry in the ST table is invalidated 3a , and its activemask is subtracted from the pending active mask of the corresponding entry in121the RT table 3b . Later, on cycle 7, warp split C1010 reaches reconvergence point(D). Thus, its entry in the ST table is also invalidated 4a , and its active mask issubtracted from the pending active mask of the corresponding entry in the RT table4b . Upon each update to the pending active mask in the RT table, the PendingMask is checked if it is all zeros, which is true in this case. The entry is thenmoved from the RT table to the ST table 5 . Finally, the reconverged warp D1111executes basic block D on cycles 7 and 8.7.2.2 Scoreboard LogicCurrent GPUs use a per-warp scoreboard to track data dependencies [30]. A formof set-associative look-up table (Figure 7.5a) is employed, where sets are indexedusing warp ids and entries within each set contain a destination register ID ofan instruction in flight for a given warp. When a new instruction is decoded, itssource and destination register IDs are compared against the scoreboard entries ofits warp. A dependency mask that represents registers that are causing data depen-dency hazards is produced from these comparisons and stored in the I-Buffer withthe decoded instruction. The dependency mask is used by the scheduler to decidethe eligible instructions at each issue slot. An instruction is eligible only if its de-pendency mask is all zeros. After writeback, both the scoreboard and the I-Bufferentries are updated to mark the dependency as cleared. In particular, the entriesin the scoreboard look-up table that correspond to the destination registers of thewritten-back instruction are invalidated, and the bits that correspond to these des-tination registers in the dependency mask are cleared for all decoded instructionsfrom this warp.The Multi-Path IPDOM supports multiple number of concurrent warp splitsrunning through parallel control flow paths. Hence, it is essential for the scoreboardlogic to correctly handle dependencies for all warp splits and across divergence andreconvergence points. It is also desirable for the scoreboard to avoid declaring adependency exists when a read in one warp split follows a write to the same registerbut from another warp split.Therefore, we modify the scoreboard design by adding a reserved mask (R-mask) field to each entry in the scoreboard look-up table as shown in Figure 7.5b.122( Warp-ID, Reg) R Register-ID 1 0 Warp 0 Warp 1 … (a) Single-Path IPDOM( Warp-ID, Reg, Active Mask) Warp 0 Warp 1 … R-Mask Register-ID 10101… 0 (b) Multi-Path IPDOMFigure 7.5: Changes required to the scoreboard logic to support Multi-PathIPDOMWhen an instruction is scheduled it bitwise-ORs the R-mask of its destination reg-ister with its active mask. When a new instruction is decoded, its source and desti-nation register IDs are compared against the scoreboard entries of its warp. If theR-mask bit of a register operand of the instruction is set for any of the instruction’sactive threads then a bit is set in the dependency mask for the I-Buffer entry asso-ciated with the instruction. This means that there is a pending write to this registerby at least one active thread.Upon writeback, the destination register’s dependencies are cleared from thescoreboard by clearing the bits in the R-mask that correspond to the active threadsin the written-back instruction. The dependency bit masks in the I-Buffer are alsoupdated. To do so, the active mask of each instruction that belongs to the written-back warp is compared with the R-mask of the destination registers of the written-back instruction. The respective destination register bit in the dependency maskis cleared if the instruction active mask and the R-mask do not have any commonactive bits. An instruction in the I-Buffer is available to be issued if all the bits inthe dependency mask are zero.To illustrate the operation of the modified scoreboard and its interactions withthe I-Buffer, we use the example code snippet in Figure 7.6. Initially, the score-board is empty and I0 has no pending dependencies with any registers. At 1 , I0is issued and it reserves a scoreboard entry for its destination register R0 with R-mask=1111. Also, the branch instruction splits the warp into two splits that fetchand decode instructions from the two sides of the branch (i.e., I1 and I2). At thedecode stage, all the source and destination registers of the decoded instruction arechecked against the reserved registers in the scoreboard unit to see if there are any123Code Example:I0 : LOAD R0 , 0 ( R5 ) ;i f ( i d %2==0)I1 : LOAD R1 , 0 ( R0 ) ;e l s e {I2 : LOAD R1 , 0 ( R4 ) ;I3 : ADD R4 , R0 , R1 ;}Scoreboard I-BufferReg R-mask Inst. Dep-mask I-mask- - I0 00 1111- - - - -1R0 1111 I1 01 0101- - I2 00 10102R0 1111 I1 01 0101R1 1010 - - -3R0 1111 I1 01 0101R1 1010 I3 11 10104- - I1 00 0101R1 1010 I3 10 10105- - - - -R1 1111 I3 10 10106- - - - -R1 0101 I3 00 1010Figure 7.6: MP IPDOM scoreboard examplepending dependencies, and, accordingly, the dependency mask of the instruction isgenerated. In this example, I1 is dependent on R0 while I2 has no pending depen-dencies. Hence, I2 is eligible to be issued, and once it is issued, at 2 , it reservesR1 with an R-mask=1010. At 3 , I3 is fetched and decoded. I3 has pending de-pendencies on both R0 and R1, hence its Dep-mask=11. At 4 , I0 writes the loadvalue to R0 and hence it releases R0 and clears the R-mask of the R0 entry fromthe scoreboard unit (the R0 entry becomes invalid since its R-mask is all zeros).Also, it clears the dependency mask of both I1 and I3 since the R-mask of R0 isnow zeros for all active threads of both instructions. Since the dependency maskof I1 is all zeros, it becomes eligible to be issued. Hence, at 5 , it reserves its des-124tination register R1 with an R-mask=1010; such that R1 becomes reserved by alllanes—odd lanes due to pending writes of I1 and even lanes due to pending writesof I2. At 6 , I2 writes the load value to R1 and hence it clears the even bits in theR-mask of register R1 in the scoreboard unit. Currently, R1 has an R-mask=0101and it does not have any common active threads with I3 which has an I-mask=1010.Hence, the dependency bit that corresponds to R1 in the dependency mask of I3 iscleared and I3 becomes eligible for scheduling.7.3 Opportunistic Early ReconvergenceMP-IPDOM maintains reconvergence at immediate postdominator (IPDOM) re-convergence points. The IPDOM reconvergence point is the earliest guaranteedreconvergence point. However, in certain situations, there are opportunities to re-converge at earlier points than the IPDOM point. Such situations depend on theoutcomes of a sequence of branch instructions and the scheduling order of warpsplits. Hence, early reconvergence is not guaranteed for all executions, but ratheroccurs dynamically when certain control paths are followed. Early reconvergenceopportunities are common in applications with unstructured control flow [33, 44].Figure 7.7 shows a code snippet that has unstructured control flow. We willuse this code snippet to illustrate the modified operation of MP with support foropportunistic early reconvergence. Figure 7.8 shows instantaneous snapshots fora warp with four threads traversing through the control flow graph correspondingto the code in Figure 7.7. Active masks within basic blocks represent threads thatare executing basic blocks at a specific time. Basic blocks with no active masksare not executed by any threads at that time. Figure 7.8a shows a snapshot ofthe execution where there are two warp splits, A0101 and B1010, executing basicblocks A and B respectively 1a . Both diverged warp splits have their IPDOMreconvergence point at D 1b . This initial state results if a divergence at BRAB−Cresults in two splits B1010 and C0101, and split C0101 reaches BRCA−D before splitB1010 finishes executing basic block B.Next, warp split A0101 branches to basic block B after executing BRAB−C. Thisscenario creates an early reconvergence opportunity, where there are two splits(B0101 and B1010) of the same warp executing the same basic block B 2a . To detect1251 . do{2 . / / BBA3 . i f ( cond1 ) {4 . / / BBB5 . break ;6 . } e l s e {7 . / / BBC8 . }9 . }whi le ( cond2 ) ;1 0 . / / BBDFigure 7.7: Unstructured control flowan early reconvergence opportunity, an associative search within the ST using thefirst PC of the basic block is performed upon the insertion of any new entry. If thereis an existing entry that matches the new entry in both the PC and RPC entries thenan early reconvergence opportunity is detected. The early reconvergence pointis the program counter of the next instruction of the leading warp split. In thisexample, B1010 is the leading split, and the early reconvergence point is labeled BR2b . A new entry is added to the RT to represent the early reconvergence point3a . The RPC of the new entry (B0101) is set to the early reconvergence point (BR).Finally, warp split B1010 in the ST is invalidated 3b , and accordingly the pendingmask of the early reconvergence entry is updated 3c (warp split B1010 is alreadyat the early reconvergence point).The execution continues as explained in Section refsec::ddrt. Warp split B0101reaches the early reconvergence point BR. Its entry in the ST is invalidated, thepending mask of the reconvergence entry BR1111 is updated, and the reconvergenceentry BR1111 moves from the RT 4a to the ST 4b . Similar cases to this exampleoccur with more complex code in several GPU applications [33]. We evaluate thebenefits of opportunistic reconvergence in Section 7.5.7.4 MethodologyWe model MP IPDOM as described in Section 7.2 in GPGPU-Sim (version 3.1.0) [11].Our modified GPGPU-Sim is configured to model a Geforce GTX 480 (Fermi)GPU with the configuration parameters distributed with GPGPU-Sim 3.1.0 ( 7.1)126Wavefront Splits Table PC RPC Active Mask A D 0101 B D 1010 Wavefront Reconvergence Table PC RPC Reconvergence Mask Pending Mask D --- 1111 1111 B C D A AB-CBRCA-DBR1a 1b B 0101 1010 (a) Initial State0101 B C D A 1010 CA-DBRWavefront Splits Table PC RPC Active Mask B D 0101 B D 1010 Wavefront Reconvergence Table PC RPC Reconvergence Mask Pending Mask D --- 1111 1111 Wavefront Splits Table PC RPC Active Mask B BR 0101 Wavefront Reconvergence Table PC RPC Reconvergence Mask Pending Mask D --- 1111 1111 BR D 1111 0101 - Before Early Reconvergence Logic - After Early Reconvergence Logic 3c 3a 3b 2a AB-CBRPC=BR 2b (b) A0101 executes BRAB-CWavefront Splits Table PC RPC Active Mask BR D 1111 Wavefront Reconvergence Table PC RPC Reconvergence Mask Pending Mask D --- 1111 1111 4b 4a B C D A CA-DBRAB-CBRPC=BR 1111 (c) B1010 reconverges at an early reconvergence pointFigure 7.8: Operation of the Multi-Path with the Opportunistic Reconver-gence (OREC) enabledbut with 16KB instruction cache per core (see Section 7.5.4 for details). MP IP-DOM does not restrict the scheduling order, so we can use any scheduler. Forboth the baseline (i.e., SPS model) and our MP variations, we use a greedy-then-oldest (GTO). GTO runs a single warp until it stalls then picks the oldest readywarp [124]. On the baseline, the GTO scheduler outperforms the Two-Level andthe Loose Round Robin schedulers. For splits within the same warp, we use a sim-ple Round Robin scheme to alternate between them. We model, the opportunisticreconvergence optimization described in Section 7.3 as a separate configuration.127# Compute Units 15warp Size 32warp Scheduler Greedy Then OldestSplits Scheduler Loose Round Robin (LRR)Number of Threads / Core 1536Number of Registers / Core 32768Shared Memory / Core 16KBConstant Cache Size / Core 8KBTexture Cache Size / Core 12KB, 128B line, 24-wayNumber of Memory Channels 6L1 Data Cache 32KB, 128B line, 8-way LRU.L2 Unified Cache 128k/Memory Channel, 128B line, 16-way LRUInstruction Cache 16k, 128B line, 4-way LRUCompute Core Clock 1400 MHzInterconnect Clock 700 MHzMemory Clock 924 MHzMemory Controller out of order (FR-FCFS)GDDR3 Memory Timing tCL=12 tRP=12 tRC=40 tRAS=28 tRCD=12 tRRD=6Memory Channel BW 4 (Bytes/Cycle)Table 7.1: GPGPUSim ConfigurationIn Section 7.5, we present results for MP IPDOM when it limits the numberof concurrently supported warp splits. This is modeled by setting a maximumconstraint on the active number of entries in the ST. If, upon a branch, a newentry is required to be inserted to the ST, and the table is already at its maximumcapacity, the new warp split is not considered for scheduling until an ST entryis invalidated. The configuration with two entries models the effect of the Dual-Path Stack (DPS) with the Path-Forwarding optimization [123] (more details areprovided in Section 7.7).We study benchmarks from Rodinia [26] and those distributed with GPGPU-Sim [11]. We also use some benchmarks with multi-path divergence from othersources:MEMC: Memcached is a key-value store and retrieval system. It is describedin detail by Hetherington et al. [53].REND: Renderer performs 3D real-time raytracing of triangle meshes. For128Interleavable BenchmarksName Abbr. BlockDim #Inst. Blocks/core3-D Renderer REND[141](64,1,1) 47M 8Laplace Solver LPS [11] (32,2,1) 78M 8LU Decomp. LUD [11] (16,1,1)(32,1,1)(16,16,1)41M 1,1,6-1Mandelbrot MAND[106](16,16,1) 274M 1Memcached MEMC [53] (128,1,1) 108M 8MUMMER++ MUMpp[46](192,1,1) 204M 5MUMMER MUM[11] (256,1,1) 88M 4Monte Carlo MC [10] (128,1,1)(64,1,1)84M 8,2Ray Tracing RAY [11] (16,8,1) 64M 5Table 7.2: Studied Benchmarksbenchmarking, we use pre-recorded frames provided with the benchmark [141].MC: MC-GPU is a GPU-accelerated X-ray transport simulation code that cangenerate clinically-realistic radiographic projection images and computed tomog-raphy scans of the human anatomy [10].MAND: Mandelbrot computes a visualization of mandelbrot set in a complexcartesian space. It partitions the space into pixels and assigns several pixels to eachthread [106].Note that out of 32 benchmarks, we only report the detailed results for the 9benchmarks shown in Table 7.2, because the other benchmarks execute identicallyover SPS, DPS and MP variations. They perform identically because they areeither non-divergent or they are divergent but all branches are one-sided branchhammocks such that the branch target is the reconvergence point. Under IPDOMreconvergence constraints, these applications do not exhibit parallelism betweentheir basic blocks (i.e., non-interleavable [123]).7.5 Experimental ResultsThis section presents our evaluation for the MP model.1297.5.1 SIMD Unit UtilizationFigure 7.9 shows the SIMD unit utilization ratio for benchmarks in Table 7.2.As expected, the SPS, DPS and basic MP have the same SIMD unit utilizationbecause they all reconverge at the IPDOM reconvergence points. However, theopportunistic reconvergence optimization provides an average of 48% and up to182% improvement in SIMD unit utilization. Benchmarks that exhibit unstructuredcontrol flow benefit from the opportunistic reconvergence.7.5.2 Thread Level ParallelismFigure 7.10 shows the average ratio of warp splits to warps. A value grater than onemeans an increase in the schedulable entities at the scheduler. Hence, it is morelikely for the scheduler to find an eligible split to schedule. The SPS exposes onlyone split at a time to the scheduler. Hence, its average ratio of warp splits to warpsis always one.As shown in the figure, benchmarks such as REND, MAND, MC, MUM andMEMC show a considerable increase in the average ratio of the warp splits as themaximum number of supported warp splits increase. The MP IPDOM achieves∼50%-690% increase in the warp splits compared to SPS, and ∼11%-400% com-pared to the DPS. This is mainly because some of these benchmarks have highlyunstructured control flow (e.g. REND, MAND and MC) and they also have switchand else if statements that increase the number of parallel control flow paths (e.g.MEMC, MUM and MC).The MUMpp, LPS, LUD and RAY benchmarks have a limited increase in theaverage number of warp splits (∼-5%). This is mainly for two reasons. The LPS,LUD and RAY benchmarks have SIMD utilization (>75%), hence, for a largeportion of time there is no divergence at all. This biases the average towards asingle split per warp. Also, the four benchmarks have a maximum of two parallelcontrol flow paths at a time, otherwise, they are dominated by single-sided branches(i.e., one of their two outcomes is the branch reconvergence point). Therefore, forthese applications MP acts identically to DPS.The data in Figure 7.10 shows that the opportunistic early reconvergence boundsthe increase in the number of warp splits. This is expected because it forces the13000.20.40.60.81REND MAND MC MUM MUMpp MEMC LPS LUD RAYSIMD Lanes Utilization SPS DPS MP model +ORECFigure 7.9: SIMD units utilization01234REND MAND MC MUM MUMpp MEMC LPS LUD RAYAverage splits to  wavefronts ratio SPS DPS 4PS6PS 8PS MP model+OREC6.9 Figure 7.10: Warp splits to warp ratiosplits to wait for each other at the early reconvergence points, and it tends to com-bine multiple splits into a single one.Figure 7.11 shows the average breakdown of the threads’ state at the scheduler.The threads’ state means whether the thread can issue its next instruction (i.e., eli-gible) or not. Also, it breaks down the different possible reasons that stalls a thread.Since a single thread can be stalled due to more than one reason (e.g., data hazardand structural hazard at the same time), the breakdown assumes priorities for thedifferent possible stalling reasons. The priority order is the order in Figure 7.11from bottom to top.“Suspended” threads are those stalled due to divergence (i.e., they are eitherwaiting at reconvergence points or they are waiting at a parallel control flow path).There is a gradual decrease in the number of “Suspended” threads for mechanismsthat support a larger number of warp splits. For example, on MAND, MP has anaverage number of suspended threads that is ∼35% less than SPS.However, the decrease in the average number of suspended threads does notdirectly translate to an increase in eligible threads. In particular, the non-suspendedthreads can be stalled due to data or structural hazard. This effect is pronouncedin the MUM and MEMC benchmarks, where the gradual decrease in the average1310%10%20%30%40%50%60%70%80%90%100%SPS4PS8PS+ORECDPS6PSMPSPS4PS8PS+ORECDPS6PSMPSPS4PS8PS+ORECDPS6PSMPSPS4PS8PS+ORECDPS6PSMPSPS4PS8PS+ORECREND MAND MC MUM MUMpp MEMC LPS LUD RAYELIGIBLE STRUCTURAL DATA SYNC+CNTRL EMPTY I-BUFF SUSPENDED FINISHEDFigure 7.11: Average breakdown of threads’ state at the schedulersuspended threads is followed by a gradual increase in the average threads stalleddue to structural hazard.The REND benchmarks suffers from load imbalance, where some warps exitthe kernel early while others continue execution under divergence. The splits of thediverged warps are serialized in SPS. This biases the average result to have a largeportion of ”Finished” threads on the SPS model. As we increase the number ofallowed warp splits, the scheduler interleaves the execution of diverged splits whichin turn speeds up their execution. Hence, on average, we have more ”FINISHED”threads.7.5.3 Idle CyclesFigure 7.12 shows the idle cycles accumulated for all cores for our benchmarks.The increase in the average number of eligible threads in Figure 7.11 translates intoa decrease in the Idle cycles. Only the MEMC benchmark shows ∼7% increase inthe idle cycles when we adapt MP with opportunistic reconvergence compared toSPS model. We discuss this in detail in Section 7.5.4.7.5.4 Impact on Memory SystemInstruction Locality: MP has a direct impact on instruction cache locality. Un-like the SPS model, MP may require frequent fetching of instructions from distantblocks in the code. While this is not a problem for most GPU kernels which tend to13200.20.40.60.811.2REND MAND MC MUM MUMpp MEMC LPS LUD RAYIdle Cycles  (Normalized to SPS) SPS DPS MP model +ORECFigure 7.12: Idle cycles00.511.52REND MAND MC MUM MUMpp MEMC LPS LUD RAYI$ misses  (Normalized to SPS) SPS DPS MP model +ORECFigure 7.13: Inst. cache misses (16KB I$)00.511.52REND MAND MC MUM MUMpp MEMC LPS LUD RAYSPS DPS MP model +ORECL1 D$ misses (Normalized to SPS) Figure 7.14: L1 data cache misses (32KB D$)have small static code size, it can considerably affect the instruction cache missesin a large kernel. We find that the effect of instruction misses on the overall perfor-mance is negligible with a 16 KB cache. Figure 7.13 shows the instruction cachemisses normalized to the SPS model. There is a considerable increase in instruc-tion cache misses for the REND benchmark but it has limited effect on the overallperformance (only an average of up to 2.5% of threads are stalled due to emptyinstruction buffers; see Figure 7.11). Since instruction fetch is done at warp splitsgranularity, MP with opportunistic reconvergence tends to have less instructioncache accesses and misses.Data Locality: The effect of MP execution on data cache locality depends onthe application and whether parallel control flow paths access contiguous memorylocations or not. Figure 7.14 shows the L1 data cache misses normalized to the SPS13300.40.81.21.62REND MAND MC MUM MUMpp MEMC LPS LUD RAY HMSpeedup SPS DPS MP model +OREC3.45 2.9 7.8 Figure 7.15: Overall speedupexecution model. The MUM benchmark has reduced L1 data cache misses in MPcompared to the SPS model, but it does not have a big impact on its performancebecause it already has low data cache misses (<0.3 MPKI ”misses per thousandinstructions”). However, the MEMC benchmark suffers from a significant increasein its misses (>80%). In particular, the total misses jumps from 30 MPKI to 82MPKI. In depth analysis suggests that the MEMC benchmark loses its intra-warplocality. That is, warp splits evict each others’ data from the data cache before theyget accessed again by the same warp split. These observations are consistent withprior work [124].7.5.5 Overall PerformanceFigure 7.15 shows the speedup over SPS. The speedup comes mainly due to thereduced idle cycles (i.e., more warp split instructions per cycle) and the improvedSIMD units’ utilization (i.e., more throughput per warp split instruction). MP withopportunistic reconvergece has 32% harmonic mean speedup over the SPS model,compared to 18.6% and 12.5% for the basic MP and DPS models.7.5.6 Implementation ComplexityImplementing MP requires modifications to the branch unit and the scoreboardlogic. The modified scoreboard logic adds 1.5KB storage requirement for 48 warpsper SM each with 8 register entries (GTX480 configuration). We synthesized boththe basic scoreboard and the modified scoreboard on NCSU FreePDK 45nm [135].We model the scoreboard as a small set associative SRAM. The synthesis resultsestimates an area 175,432 µm2 and a total power of ∼4.4mW total power at 50%activity factor; compared to 91,365 µm2 and 1mW power for the original score-134board. The SRAM used is based on NCSU’s FabScalar memory compiler, Fab-Mem [130], with 6 read ports and 3 write ports. The cost of the ST and RT tableare similar to those discussed in Section 5.5.We also use GPU-Wattch [75] to estimate the increase in the dynamic powerdue to the associated with overall increased performance. For all our benchmarks,we find that the maximum observed increase in the average dynamic power is(37.5%) for the REND benchmark. However, the 7× speedup justifies such in-crease in power.7.6 Related WorkThis section discusses the closest related work to MP. A wider scope of relatedwork is discussed in 8. Dual Path Stack (DPS) extends the SPS stack to supporttwo concurrent paths of execution [123] while maintaining reconvergence at im-mediate postdominators. Instead of stacking the taken and not-taken paths oneafter the other, the two paths are maintained in parallel. DPS maintains separatescoreboard units for each path to avoid false dependencies between independentsplits. However, it is necessary to check both units to make sure there are nopending dependencies accorss divergence and reconvergence points. As shown inSection 7.5, DPS has limited benefits on benchmarks that have multi-path diver-gence or benchmarks that have unstructured control flow behavior. Similar to DPS,Simultaneous Branch Interweaving (SBI) allows a maximum of two warp splits tobe interleaved [21]. However, SBI targets improving SIMD utilization by spa-tially interleaving the diverged warp splits on the SIMD lanes. The reconvergencetracking mechanism proposed with the SBI requires constraints on both the codelayout and the warp splits’ scheduling priorities to adhere to thread-frontier basedreconvergence [33].Dynamic Warp Subdivision (DWS) adds a warp splits table to the conventionalstack [91]. Upon a divergent branch, it uses heuristics to decide which branchesstart subdividing a warp into splits and which do not. If a branch subdivides awarp, DWS ignores IPDOMs nested in that branch. This often degrades DWS per-formance compared to the SPS model [123]. Unlike DWS, MP IPDOM managesto maximize TLP under the IPDOM reconvergence constraints. Dynamic Warp135Formation (DWF) is not restricted to IPDOM reconvergence [41]. Instead, it op-portunistically group threads that arrive at the same PC, even though they belongto different warps. DWF performance is highly dependent on the scheduling to in-crease the opportunity of forming denser warps, and sometimes leads to starvationeddies.Thread Block Compaction (TBC) and TBC-like techniques allow a group ofwarps to share the same SIMT stack [44, 99]. Hence, at a divergent branch, threadsfrom grouped warps are compacted into new more dense warps. Since TBC em-ploys a thread block wide stack, it suffers more from the reduced thread level par-allelism [122]. This makes MP IPDOM a good candidate to integrate with TBC tomitigate such deficiencies. For this purpose, the warp-wide divergence and recon-vergence tables would need to be replaced with thread block wide tables.7.7 Summary, Conclusion and Future DirectionsThis chapter presented a novel mechanism which enables efficient multi-path exe-cution in GPUs. This mechanism enables tracking IPDOM reconvergence points ofdiverged warp splits while interleaving their execution. This is achieved by replac-ing the stack-based structure that handles both the divergence and reconvergencein the current GPUs with two tables. One table tracks the concurrent executablepaths upon every branch, while the other tracks the reconvergence points of thesebranches. Furthermore, we illustrate that our multi-path model can be modified toenable opportunistic early reconvergence at run-time to improve SIMD units uti-lization for applications with unstructured control flow behavior. Evaluated on aset of benchmarks with multi-path divergent control flow, our proposal achieves32% speedup over conventional single-path SIMT execution.In the context of this thesis, this chapter shows that a hardware used to supportCompatible MIMD Execution on SIMT architecture can be further leveraged toachieve performance improvements.Combining MP-IPDOM with proposals such as Simultaneous Branch Inter-weaving [21] would make an interesting future work as it would extend the bene-fit of such work to throughput-bound divergent applications and not just latency-bound ones.136Chapter 8Related WorkThis chapter expands on the related work subsections in previous chapters. It givesmore details about related work and surveys related research areas that are com-plementary to the work done in this thesis. We discuss work related with threadsynchronization in GPGPUs in Section 8.1. In Section 8.2, we survey the differ-ent warp scheduling policies proposed for GPGPUs. In Section 8.3, we discusssurvey the different techniques proposed to handle divergence in GPGPUs. In Sec-tion 8.4, we survey the verification tools proposed to verify the conformance ofparallel kernels to correctness criteria on SIMT architectures.8.1 Enabling Thread Synchronization in GPUGPUsThread synchronization in GPGPUs face both reliability and performance chal-lenges that are induced by the limitations of GPGPUs execution and memory mod-els. Numerous hardware and software research efforts have tackled these chal-lenges. The presence of SIMT deadlocks have been highlighted initially in dif-ferent developers forums [114, 115]. The first research work to pay attention tothis problem and attempt to find a solution is Ramamurthy’s MSc thesis [121].The thesis proposed adding two instructions to the GPU ISA for lock and unlockoperations. The instructions send out an atomic memory request which performsthe same function as the atomicCAS besides manipulating the stack. The lock in-struction changes the PC of executing entry to point to the program counter of the137instruction after the unlock statement. Then it pushes two entries on the stack; onefor threads that failed to acquire the lock and the other for the ones that succeeded.Once an unlock instruction is encountered, the entry that represent threads that ac-quired the lock is popped out. Thus, threads that have failed to acquire the lockcan proceed from where they were left. When threads within this entry succeed inacquiring the lock, their entry is popped out of the stack and execution continueswith all threads converged. The thesis then adds further complexity to this fairlysimple mechanism to handle cases where there are multiple unlock statement as-sociated with every lock and cases of nested locks. However, as highlighted in thethesis, it still fails to address cases where there are multiple locks in a divergentcode associated with a single unlock and when nested locks at different divergencelevels.The problem of SIMT deadlocks have been also highlighted in [47]. In thiswork, the authors provide formal semantics for NVIDIA’s stack-based reconver-gence mechanism. They formally prove that the observable behaviour of exe-cuting a program on a SIMT execution model matches at least one possible be-haviour from executing a program on traditional simultaneous multi-threaded ma-chine (MIMD machines). However, they exclude the termination property whichmeans that a program may not terminate on a SIMT machine even though it wouldalways terminate on a MIMD machine. They provide a formal definition for thescheduling unfairness problem in the stack-based execution model. However, thiswork does not attempt to provide ways to detect or to prevent this problem.In [155], the authors propose a blocking synchronization mechanism in GPG-PUs. They propose explicit synchronization lock initialization, acquire, and re-lease APIs. Their paper along with the related thesis [154] provides limited detailsregarding how the proposed APIs deal with the SIMT deadlock problem. Theymentioned that they “use control flow instructions (i.e.,jumps) for looping and en-abling/disabling the threads conditionally using the active mask, active mask stack,and execute instructions for managing the stack” and that they “push or pop the ac-tive mask stack as required for the re-execution of lock init/acquire instructions”.This suggests that they replace the suggested APIs with fixed sequence of instruc-tions to manipulate the stack. This can only work for the simple case where thereis a single lock and unlock pair where the unlock instruction postdominates the138lock. More complex usage of these APIs will need some analysis and code trans-formations to work correctly which is not provided in this work. The authors inthis paper, however, focus on the performance implications of the SIMT model onsynchronization. They propose hierarchal queuing at each block in L1 and L2 datacaches. For this to be feasible, locks can be acquired only on a cache line granular-ity and Locked cache lines are not replaceable until released. If a cache set is fullwith locked lines, the mechanism reverts back to spinning for newer locks mappedto the same line. The main limitation of this work as acknowledged by the authorsis that the efficiency of this mechanism drops as the number of locks increase andstarts to perform worse than the baseline [155].In [78], the authors make the observation that shared scratch-pad memoryatomic instructions is translated in Fermi SASS ISA as a series of instructions thatimplements a busy-wait loop. The busy wait loop checks whether a hardware lockbit associated with a memory location is clear to acquire before reading the valueof the memory location. After the lock bit is acquired, the memory location valueis then updated and the lock bit is cleared. The hardware only supports 1024 lockbits and all shared memory locations are mapped to these locks. Therefore, a typ-ical lock implementation using a busy-wait loop of an atomic compare and swapinstruction will essentially includes two nested loops. The low level busy-wait loopto acquire the lock bit and the outer loop that checks whether the mutex has beenacquired or not. This introduces some redundancy that the paper explored how toeliminate. They propose low level implementation of lock and unlock primitivesthat uses the lock bit directly. This proposal reduces SASS instructions requiredto implement these primitives and the storage requirement to hold the mutex vari-ables. However, the paper acknowledges two main limitations, SIMD (SIMT) andalias deadlocks. They do not provide a solution for SIMT deadlocks and onlydeals with it using workarounds on the programming level. Alias deadlocks emitfrom the fact that different shared memory locations can map to the same lock bit.Therefore, deadlocks occurs if the same thread attempts to acquire two locks atdifferent memory locations that map to the same lock bit.In [152], a software implementation for lock and unlock APIs that enableslock stealing between threads of the same warp is proposed in an attempt to solvelivelocks in nested locking scenarios. In their lock implementation, a thread can139steal a lock from another thread in the same warp if the later ID is larger. A rollbackfunction may be necessary to be implemented by the programmer to undo updatesto the shared data made by a thread before its lock is stolen. The APIs are still pruneto SIMT deadlocks and ”locks are not allowed to be acquired in a loop way”. InSSDE and AWARE, we assume a livelock free MIMD programs under arbitraryfair scheduling.Recently (May 2017), NVIDIA revealed that Volta, their newest GPU archi-tecture, incorporates changes specifically to enable easier programming with fine-grained synchronization (see Figures 12 through 14 in [103]). The descriptionin [103] suggests that the observable behaviour of Volta’s execution model is simi-lar to the Multi-Path execution model with delayed reconvergence enabled when itis needed. Volta also allows programmers to use a syncwarp() primitive to explic-itly force threads within a warp to reconverge. This enables programmer to explic-itly avoid conservative safe reconvergence point estimate due to false detection.The specific implementation of Volta’s independent thread scheduling executionmodel is not released by Nvidia.The closest related Nvidia patent is a 2016 patent that describes a notion ofconvergence barriers [34]. In the execution scheme described in this patent, con-vergence barriers are used to join divergent groups of threads back together tomaintain high SIMD efficiency while allowing for a flexible thread scheduling thatused to be restricted by stack-based reconvergence. In this proposal “the divergencemanagement mechanism that relies on the convergence barriers is decoupled fromthe thread scheduling mechanism”. The compiler analyzes the program to identifythe appropriate locations to insert convergence barriers. In hardware, “a multi-bitregister may correspond to each convergence barrier name and a bit is assigned foreach thread that may participate in the convergence barrier”. A compiler may insertan ADD instruction that specify a convergence barrier name. Bits in correspondingto threads executing the ADD instruction will be set in the associated convergencebarrier register. A compiler inserts a WAIT instructions at the required conver-gence barrier location. The WAIT instruction also specifies the name used b theADD. When a thread reaches the WAIT instruction, it is blocked from execution. Italso resets its corresponding bit in the convergence barrier multi-bit register. Whenall the threads reach a convergent barrier (i.e., the multi-bit register of the corre-140sponding barrier is all zeros), the convergence barrier is cleared and all threadsthat participated at the convergence barrier are released (i.e, unblocked) and theyresume execution in SIMD fashion. Finally, the patent also introduces a YIELDinstruction that may be inserted by the compiler to guide the thread scheduler toswitch execution to other paths.Interestingly, on a high level, the proposal in the patent is very similar to theMulti-Path and AWARE execution models. Similar to the MP and AWARE, thepatent decides to decouple tracking divergence entries from reconvergence points.The multi-bit convergence barrier registers are simply another representation tothe pending mask entry in the reconvergence table. The main difference is thatour proposal relied on the hardware implicitly adding reconvergence points to thereconvergence table, checking whether threads have reached reconvergence pointsor not, and deciding when to switch execution to other paths. On the contrary,in Nvidia patent, the compiler explicitly adds an ADD instruction to add a newreconvergence point, a WAIT instruction to explicitly indicate that threads havereached a reconvergence point, a YIELD instruction that can be used to instruct thehardware scheduler to switch execution to other paths.Transactional memory is another approach to deal with synchronization. Hard-ware transactional memory for GPGPUs was initially explored in [43]. The paperpropose a word-level, value-based conflict detection mechanism where each trans-action compares the saved value of its read-set against the value in memory uponcompletion where a change in value indicates conflict and the transaction is retried.The paper proposes a SIMT stack extension to deal with divergence after the com-mit stage where some threads pass the conflict detection stage and others fail. Theextension is similar to the simple version of the stack extension proposed in [121].This is enabled by the fact that in [43], nested transactions are flattened into a sin-gle transaction. Therefore, synchronization always maps to the simple case wherethere is a single transaction begin and commit pair and the commit postdominatesthe transaction begin. Support of transactional memory also requires significanthardware changes to log write history and manage conflict detection. The evalua-tion shows that this hardware transactional memory proposal can capture only 59%of fine-grained locking performance. Follow-up hardware transactional memoryproposals have been explored to reduce this performance gap [27, 42]. According141to their evaluation, these cumulative efforts still leaves a performance gap of 7%compared to fine-grained synchronization on baseline GPUs.Software transactional memory support in GPUs have been also proposed.In [56, 151], the authors propose a software implementation of transaction begin,commit, read , and write APIs. The programmer should surround the transac-tion begin and commit APIs with a loop to enable retries at commit failures [151]or label the transaction region so that the compiler can insert the retry loop [56].This eliminates the need for changing the stack behaviour and puts the burden onthe programmer to manage retries. The different APIs maintains the write logsand perform conflict detection. Software transactional memory has a large storageoverhead to store transactions’ meta-data. Therefore, software transactional mem-ory proposals to date achieve significantly lower performance versus fine-grainedsynchronization [56].There is also a numerous research body that deals with the interaction of inter-thread synchronization and the memory model in GPGPUs. In [2], the authorsreveal some of the GPU concurrency bugs that result from invalid programmingassumptions about the GPU memory model and its interaction with related in-structions (e.g., atomics and memory fences). Different memory consistency mod-els and cache coherence protocols have been proposed to provide a more intu-itive interface to the GPU memory model while minimizing synchronization over-heads [3, 45, 51, 118, 132, 133, 150].8.2 Warp Scheduling Policies in GPGPUsThere are a large number of research papers that explore warp scheduling policiesin GPGPUs. The first paper to propose an alternate warp scheduling policy toloose round-robin was the two level scheduling paper [100]. The paper proposes atwo-level warp scheduling where warps are divided into fixed size groups. Warpswithin each group is scheduled in a round-robin fashion, while different groups arescheduled in a greedy then oldest fashion. This scheduling policy aims to get thebenefits of the round-robin policy in catching inter-warp locality and the greedyscheduling in forcing different different groups to progress at different rates suchthat not all warps arrive at long latency operations at the same time.142Later, the cache conscious warp scheduling paper [124] has opened the doorfor a series of adaptive warp scheduling policies. In [124], the number of activelyschedulable warps is adjusted according to the intra-warp lost data locality. Asmall victim cache is used to estimate the lost data locality metric. A follow-upwork [125] makes proactive warp scheduling decision based on predicted cacheusage. The paper makes the observation that intra-warp data locality is betweeninstructions in consecutive iterations of a loop. Thus, it is possible to predict a warpcache footprint from the number of load instructions executed in a loop iterationand the divergence pattern of threads within the loop. The scheduling policy usesthese predictions to schedule warps with aggregate predicted cache footprint thatis less than the effective cache size.In [72], the authors observe a large execution time disparity between warpswithin the same thrad block. This leads to the underutilization of the GPGPU re-sources since the allocation granularity of resources inside a GPU shader is a threadblock. The paper then proposes a set of heuristics to prioritize the scheduling ofcritical warps that prevent a thread block from terminating. In [81], the authorstackle a similar problem. The main observation of this paper is that warps in thesame thread block may arrive to a thread-block wide barrier at different times lead-ing to excessive stall cycles. They show that the distribution of warps in the samethread block over different physical warp scheduler complicates the problem. Thus,propose a dynamic warp scheduling policy where different warp schedulers coordi-nate to prioritize warps in threads blocks where some warps are already waiting ata barrier. The same problem is also addressed in a concurrent work [83]. The maindistinction is that in [83], the thread block with largest number of warps waiting atthe barrier is prioritized while in [81], the thread block that first hit the barrier isprioritized.Various other warp scheduling policies have been proposed with different heuris-tics. For example, in [156], the authors propose a two-level warp scheduling policythat dynamically adjust the warp groups size and moves warps from the activegroup to the the pending group according to their pipeline stall pattern. In [127],the authors combine two techniques that attempt to balance the preservation of in-ter and intra thread locality. In [9], a compiler analysis is used to detect whichof a two-level warp scheduler or a GTO warp scheduler should be used for each143phase of a kernel execution. A similar approach is used in [70] except that theswitching between the two scheduling policies is detected at runtime according theinstruction-issue pattern. In [144], the MSHR consumption is used as a heuristicto adjust the amount of thread level parallelism allowed by a warp scheduler.8.3 Alternate GPGPU Execution ModelsDynamic Warp Formation (DWF) [41] has been proposed as alternative to thestack-based reconvergence. Instead of restricting reconvergence to the IPDOMpoint, DWF opportunistically group threads that arrive at the same PC, even thoughthey initially belong to different warps. DWF proposes different warp schedulingpolicies that attempt to increase the chance of forming denser warps. The per-formance of DWF is highly dependent on the scheduling policy. Although noneof the scheduling policy proposed in [41] was a fair scheduling policy, DWF canavoid SIMT deadlocks if it adopted a fair scheduling policy across threads of thesame warp. However, this would undermine its ability to condense warps. Further,the lack of guaranteed reconvergence points in DWF compromise programmers’ability to predict the performance of their applications.Dynamic Warp Subdivision (DWS) [41] adopted an alternative approach. Inthis work, the authors attempt to maximize thread level parallelism to be able tobetter hide long latency operations. DWS adds a warp splits table to the conven-tional stack [91]. Upon a divergent branch, it uses static heuristics to decide whichbranches start subdividing a warp into splits and which do not. If a branch subdi-vides a warp, DWS ignores IPDOMs nested in that branch. This often degradesDWS performance compared to the SPS model [123]. Unlike DWS, MP IPDOMmanages to maximize TLP under the IPDOM reconvergence constraints. DWS isalso prune to SIMT deadlocks as it still enforce IPDOM reconvergence to somebranches.Follow up proposals can be generally categorized into one or a combinationof these two categories; maximizing SIMD efficiency or maximizing thread levelparallelism. Dual Path Stack (DPS) extends the single path stack to support twoconcurrent paths of execution [123] while maintaining reconvergence at immediatepostdominators. Instead of stacking the taken and not-taken paths one after the144other, the two paths are maintained in parallel. DPS maintains separate left andright scoreboards to allow the two independent paths to operate in parallel with nofalse dependencies. However, they add extra bit (for each register) to store pendingdependencies before divergence and reconvergence point. For each instruction, itis necessary to check both scoreboard units (which is done in parallel) to make surethere is no pending dependencies. So it is designed to work for two paths only.Simultaneous Branch Interweaving (SBI) allows a maximum of two warp splitsto be interleaved [21]. However, SBI targets improving SIMD utilization by spa-tially interleaving the diverged warp splits on the SIMD lanes. The reconvergencetracking mechanism in SBI uses thread frontiers [33] which sets constraints onboth the code layout and the warp splits’ scheduling priorities.Temporal SIMT (T-SIMT) maps each warp to a single lane, and the threadswithin a warp dispatch an instruction one after the other over successive cycles [74].Upon divergence, threads progress independently; and hence divergence does notreduce the SIMD units utilization. However, reconvergence is still favourable toperform memory address coalescing and scalar operations [74]. The T-SIMT mi-croarchitecture lacks a hardware mechanism to track reconvergence of divergedwarp splits, therefore, they insert (syncwarp) instructions at the immediate post-dominator of the top-level divergent branches [74]. Our MP microarchitectureprovides a hardware mechanism to track nested reconvergence points. Hence, itcan be integrated with T-SIMT.Multiple SIMD Multiple Data (MSMD) [145] proposes quite large changes tothe baseline architecture to support flexible SIMD data paths that can be reparti-tioned among multiple control flow paths. Similar to T-SIMT, MSMD proposesto use a special synchronization instruction to reconverge at postdominators, how-ever, the paper does not specify an algorithm that determines where to place thesesynchronization instructions and how to determine which specific threads to syn-chronize at each instruction.Thread Block Compaction (TBC) and TBC-like techniques allow a group ofwarps to share the same SIMT stack [44, 99]. Hence, at a divergent branch, threadsfrom grouped warps are compacted into new more dense warps. Since TBC em-ploys a thread block wide stack, it suffers more from the reduced thread level par-allelism [122]. This makes MP IPDOM a good candidate to integrate with TBC to145mitigate such deficiencies. For this purpose, the warp-wide divergence and recon-vergence tables would need to be replaced with thread block wide tables.Variable Warp Sizing (VWS) [126] makes the observation that smaller warpsizes are convenient for control flow and memory divergent applications whilelarger warps are convenient for convergent applications. Therefore, VWS startswith a small warp size and then groups these smaller warps into larger ones in theabsence of control and memory divergence.8.4 Verification Tools for SIMT architecturesThere is a body of research on the verification of GPU kernels that focuses on de-tecting data-races and/or barrier divergence freedom in GPU kernels. In [17], a toolis proposed to verify the freedom of GPU kernels from data races in shared mem-ory and divergent barriers. GPUVerify transforms the kernel into two-threadedpredicated form that is suitable for verifications. To detect data races, the kernel isinstrumented to log accesses to shared arrays. To declare race-freedom, for eachlog, GPUVerify verifies that prior reads and write sets for the same array fromdifferent threads do not conflict. Barriers resets the read and write logs. Barrierdivergence-freedom is verified if the aggregate predicate of the two threads is al-ways the same at barrier points.In GRace [160], a static analysis is combined with run-time checker to de-tect data-races in GPUs. If the memory address of an instruction can be staticallydetermined, a linear constraint solver is used to determine whether two staticallydeterminable write and write or read and write pairs are aliasing and thus induce adata race. Otherwise the pair is marked to be monitored at run-time. This reducesthe amount of information needs to be logged at runtime. GLKEE [79] proposesa framework that analyze GPU kernels to detect both functional and performancebugs such as non-coalesced memory accesses, bank conflicts and divergent warps.GLKEE performs the analysis symbolically. GLKEE can also automatically gen-erate test cases.Although none of this work has addressed SIMT deadlocks, the analysis tech-niques used for data-race detections can be leveraged to improve the accuracy ofour SIMT deadlock static analysis.146Chapter 9Conclusions and Future WorkThis chapter concludes the thesis, reflects on potential areas of impact by the workdone in this thesis, and proposes few examples of future work that expands andimprove on this work.9.1 ConclusionThe convenience of the SIMT programming model has encouraged programmersto use it in accelerating irregular data parallel computations achieving in manycases significant speedups and energy savings over CPU multi-threaded implemen-tations. Compared to other energy efficient alternatives such as ASIC and FPGAs,SIMT architectures have a programmability advantage that enables workload con-solidation. However, current SIMT implementations lack reliable and efficient sup-port for inter-thread synchronization that is essential for efficient implementationsof many irregular applications.This challenge has been facing both general purpose programmers (e.g., inCUDA and OpenCL [65]) and graphics programmers (e.g., in GLSL and HLSL [113]).That being said, existing GPU applications [25] that worked around current SIMTlimitations to implement algorithms with fine-grained synchronization have achievedsignificant improvement over CPU implementations. However, as we showed inthis thesis, they are vulnerable to portability and compatibility issues across com-pilers and/or GPU architectures. Further, such individual workarounds do not pro-147vide general rules that could ease the adoption of other algorithms with differentinter-thread synchronization patterns. Their positive performance results, however,encouraged us to explore reliable and more efficient support of fine-grained syn-chronization in the SIMT execution model.Another motivation is the wide interest in high level programming languagesfor accelerators such as OpenMP 4.0. The abstraction and portability of the OpenMPprogramming model will help SIMT accelerators reach a broader range of devel-opers. However, support of fine-grained synchronization in OpenMP relies onruntime library calls that is challenging to properly implement on current SIMTimplementations. This would equally apply to any future CUDA or OpenCL APIextensions that could be proposed to abstract fine-grained synchronization.In this thesis, we try to answer these question:• Can a compiler workaround the current limitations of the SIMT hardware toenable a true MIMD abstraction for synchronization on the SIMT hardware?What are the limitations?• How could we enable the MIMD abstraction with minimal SIMT hardwarechanges without losing the efficiency of the SIMT model?• How can we improve the efficiency of fine-grain synchronization on SIMTarchitectures with a low cost mechanism?We found that it is possible to the compiler to enable MIMD abstractionsthrough control flow graph transformations of the input code that contains inter-thread synchronization. However, this technique is limited to synchronizationwithin a local function scope. Further, these control flow graph changes imposelimitations on the code debuggability. False detection of synchronization can alsolead to performance degradation.Therefore, we proposed an adaptive warp reconvergence mechanism that avoidsthe thread scheduling constraints imposed by the current SIMT implementations.This mechanism requires limited hardware changes. It is capable to maintain tra-ditional SIMT execution in the absence of synchronization. The hardware mech-anism mitigates the key compiler limitations as it does not require code transfor-mations to operate. We also showed that with further hardware extensions, this148divergence mechanism can be used to improve thread level parallelism in heavilydiverged kernels.We found that the main source of inefficiency in inter-thread synchronizationon SIMT hardware is the warp scheduling policy that is oblivious to such synchro-nization. Therefore, we developed a low cost mechanism to dynamically detect thepresence of synchronization and accordingly tune the warp scheduling policy. Weshowed that this mechanism, though simple, provide significant performance andenergy improvements for applications with inter-thread synchronization.9.2 Potential Areas of ImpactThis section provides a brief description of areas on which we foresee a non-trivialimpact from this work:SIMT Verification Tools: The increasing complexity of algorithms mapped toSIMT accelerators mandates the development of verification tools. None of thecurrent verification tools considers the problem of SIMT deadlocks due to con-ditional loops. Our paper shows that a simple static analysis can detect SIMTdeadlocks with low false detection rate. Future work can improve this rate byleveraging runtime information and elaborate static analysis to perform less con-servative reachability and dependence analysis.Reliable SIMT Synchronization Primitives: Recent work [78] has proposedpromising efficient fine-grained synchronization primitives on GPUs. However,a main limitation this work cites is SIMT deadlocks. This limitation rules out li-brary based implementations for such primitives that could boost efficiency andease programmability. Our proposed techniques enable practical adoption of suchproposals. More about this in Section 9.3.SIMT Compilers: This work makes the observation that SIMT compilers need tobe aware of the SIMT scheduling constraints to avoid generating SIMT deadlocks.This affects all optimizations that alter the CFG or that move instructions acrossCFG basicblocks. Our work provides a way to detect whether a certain applicationis prone to SIMT deadlocks given a certain transformation (using variations of ourdetection algorithm). Alternatively, our SSDE algorithms can be used to resolveany SIMT deadlocks caused by other transformations.149SIMT Programming Model: The paper shows that the choice of conventionalhigh level MIMD semantics versus low level SIMD semantics can be made by theprogrammer, rather than dictated by the underlying implementation. Our SSDEand AWARE techniques maintain default SIMT behaviour if disabled when a pro-grammer is interested in low level optimizations that are dependent on predefinedreconvergence locations.MIMD-compatible SIMT Architectures: AWARE, as a realistic example, laysthe foundation to consider MIMD-compatibility as a design goal for SIMT hard-ware implementations. Delayed and timed-out reconvergence mechanisms have apotential for being static and/or dynamic mechanisms the optimize the selection ofreconvergence points on the granularity of individual branches. This could helpimprove SIMD utilization [137] and/or caching behaviour [116].Warp Scheduling Policy: BOWS shows that a low cost and simple extension tothe warp scheduling policy can significantly improve the performance of applica-tions with inter-thread synchronization. Thus, it could be integrated with currentwarp scheduling policies to accelerate the ongoing improvement in thread synchro-nization in GPGPUs.9.3 Directions of Future WorkWe have already hinted to some of the ideas that can be explored as an extensionor improvement for this work at the end of each chapter. In this chapter, we discusssome of these ideas in more details.9.3.1 SIMT Synchronization APIsThe support of high level synchronization APIs on SIMT architectures seems in-evitable with the current interest in high level programming models. There havebeen different proposals that assumes such APIs [121, 155]. However, as detailedin Chapter 8, these proposals overlooked how such APIs can be used arbitrarilywithout introducing SIMT deadlocks. In this thesis, we showed how the compileror the hardware can eliminate SIMT deadlocks in general scenarios. However, oneof the main challenges we faced is false detections. Both SSDE (Chapter 3) andAWARE (Chapter 5) suffered performance degradation due to excessive false de-1501 . / / some code2 . l o c k (&mutex [ . . ] , ” r e g i o n −1”) ;3 . / / c r i t i c a l s e c t i o n4 . u n l oc k (&mutex [ . . ] , ” r e g i o n −1”) ;5 . / / some code6 . l o c k (&mutex [ . . ] , ” r e g i o n −2”) ;7 . / / c r i t i c a l s e c t i o n8 . i f ( some c o n d i t i o )9 . u n l oc k (& mutex t [ . . ] , ” r e g i o n −2”) ;1 0 . e l s e1 1 . un lo ck (& mutex t [ . . ] , ” r e g i o n −2”) ;Figure 9.1: Illustration to the use of named locks.tections. The introduction of explicit synchronization APIs can alleviate some ofthe burden required to detect synchronization. Further, BOWS (Chapter 6) cur-rently reacts to the run-time detection mechanism of spinning loops. With explicitsynchronization APIs, the compiler can add annotations that helps BOWS to proac-tively adjust the warp scheduling policy.However, false detection can still be present with explicit synchronization APIs.For example, for lock-based synchronization, avoiding SIMT induced deadlockswhile still maintaining reconvergence as soon as possible requires the detection oflock and unlock pairs. If the code contains multiple critical sections then conser-vative decisions should be made regarding this pairing. Note that this is necessaryregardless of how the SIMT architecture implements the lock and unlock APIs. Ifit is implemented by translating the lock API to a single ISA instruction, then thedivergence mechanism should set the reconvergence point of the implicit divergentat the end of the lock statement to beyond its unlock pair(s). Similarly, if the lockAPI is translated to a busy-wait loop then the code should be transformed to includethe release statements from the unlock APIs inside the loop.Therefore, we propose to use a notion of named locks that is analogous to thenotion of named barriers used in CUDA. In named locks, a programmer adds aname parameter to the lock and its corresponding unlock statement(s). For exam-ple, Figure 9.1 illustrates how named locks can be used. In this contrived example,the compiler with the help of the names can clearly identify the bounds of a crit-ical section. Note that the names does not need to have any representation in thehardware. The compiler uses the name hint to set the synchronization primitives151correctly. If names are not provided, then it fails back to the conservative decisions.Named locks can be part of programming language extensions that providea solution for the dilemma of supporting robust and efficient synchronization onSIMT programming model. Explicit warp level barriers introduced in CUDA 8.0can be also used by programmers to limit false detections by the compiler.9.3.2 Runtime Livelock DetectionOur proposal for Dynamic Detection of Spinning (DDOS) can be further extendedto detect the presence of livelocks in parallel kernels. It can be used as an initiallight weight mechanism that monitors the overall progress of warps in the system.If our mechanism indicates that all warps have been spinning for a long period oftime, a heavy weight mechanism that checks the full system state can be triggeredto confirm.One potential challenge is that DDOS is designed to detect spinning in shortloops as is the case with busy-wait loops. This helped to reduce its value and pathhistory register length. However, livelock can appear due to complex scenarios thatinvolve larger, consecutive and/or nested loops. To address longer loops, longerhistory registers can be used with time sharing enabled to reduce costs. To addressconsecutive and nested loops, different history registers need to be allocated foreach loop.9.3.3 Reconvergence adequacy predictionThe flexibility of AWARE enables an opportunity to optimize the location of recon-vergence points. The immediate postdominator point of a branch could be a safechoice to ensure adequate SIMD utilization but it is not necessarily the optimalchoice. In some cases, it may be beneficial for some threads to run ahead beyondthe reconvergence point to bring data into the cache for lagging ones. Further, innested loops when the opportunistic early reconvergence is enabled, it may be ben-eficial for threads waiting at the reconvergence point of an inner loop to skip thereconverge and start another iteration of the outer loop to have a chance to mergewith threads executing the inner loop.The optimal reconvergence points can be taken by the programmer, imple-152mented as profiling-based optimization, or as a compiler optimization. In such acase, the Instruction Set Architecture should support a special instruction to setreconvergence points (such instruction already exists in Nvidia GPUs). The recon-vergence adequacy can be also dynamically predicted. For examples metrics suchas, the number of available warps, frequency of long latency operations, the inter-thread locality within the same warp and the loop nesting depth can be consideredto tune reconvergence decisions.153Bibliography[1] D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher,J. D. Owens, and N. Amenta. Real-time parallel hashing on the gpu. ACMTransactions on Graphics (TOG), 28(5):154, 2009. → pages 5[2] J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema,D. Poetzl, T. Sorensen, and J. Wickerson. GPU Concurrency: WeakBehaviours and Programming Assumptions. In Proc. ACM Conf. on Arch.Support for Prog. Lang. and Op. Sys. (ASPLOS), 2015. → pages 142[3] J. Alsop, M. S. Orr, B. M. Beckmann, and D. A. Wood. Lazy releaseconsistency for gpus. In Proc. IEEE/ACM Symp. on Microarch. (MICRO),2016. → pages 142[4] AMD. Accelerated Parallel Processing: OpenCL Programming Guide.2013. → pages 3[5] AMD Corporation. Southern Islands Series Instruction Set Architecture,2012. → pages 3, 18, 22, 63, 81, 98[6] T. E. Anderson. The performance of spin lock alternatives forshared-memory multiprocessors. IEEE Transactions on Parallel andDistributed Systems, 1990. → pages 90[7] S. Antao, C. Bertolli, A. Bokhanko, A. Eichenberger, H. Finkel,S. Ostanevich, E. Stotzer, and G. Zhang. OpenMP Offload Infrastructure inLLVM. Technical report, 2015. → pages 4[8] ATI Stream Computing. OpenCL Programming Guide. AMD Corporation,2010. → pages 1[9] M. Awatramani, X. Zhu, J. Zambreno, and D. Rover. Phase aware warpscheduling: Mitigating effects of phase behavior in gpgpu applications. In154Proc. IEEE/ACM Conf. on Par. Arch. and Comp. Tech. (PACT), 2015. →pages 143[10] A. Badal and A. Badano. Accelerating Monte Carlo Simulations of PhotonTransport in a Voxelized Geometry Using a Massively Parallel GraphicsProcessing Unit. Medical physics, 36:4878, 2009. → pages 129[11] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt.Analyzing CUDA Workloads Using a Detailed GPU Simulator. In Proc.IEEE Symp. on Perf. Analysis of Systems and Software (ISPASS), pages163–174, 2009. → pages 2, 10, 82, 105, 126, 128, 129[12] M. Bauer, H. Cook, and B. Khailany. CudaDMA: Optimizing GPUMemory Bandwidth via Warp Specialization. In Proceedings of 2011International Conference for High Performance Computing, Networking,Storage and Analysis, page 12. ACM, 2011. → pages 23[13] M. Bauer, S. Treichler, and A. Aiken. Singe: Leveraging WarpSpecialization for High Performance on GPUs. ACM SIGPLAN Notices, 49(8):119–130, 2014. → pages 23[14] G.-T. Bercea, C. Bertolli, S. F. Antao, A. C. Jacob, A. E. Eichenberger,T. Chen, Z. Sura, H. Sung, G. Rokos, D. Appelhans, et al. PerformanceAnalysis of OpenMp on a GPU Using a Coral Proxy Application. In Proc.ACM Int’l Workshop on Perf. Modeling, Benchmarking, and Simulation ofHigh Perf. Computing Sys., 2015. → pages 60[15] C. Bertolli, S. F. Antao, A. E. Eichenberger, K. O’Brien, Z. Sura, A. C.Jacob, T. Chen, and O. Sallenave. Coordinating GPU Threads for OpenMP4.0 in LLVM. In Proc. LLVM Compiler Infrastructure in HPC, 2014. →pages xiii, 4, 31, 55, 60, 62, 92[16] C. Bertolli, S. F. Antao, G.-T. Bercea, A. C. Jacob, A. E. Eichenberger,T. Chen, Z. Sura, H. Sung, G. Rokos, D. Appelhans, et al. Integrating GPUsupport for OpenMP offloading directives into Clang. In Proc. ACM Int’lWorkshop on the LLVM Compiler Infrastructure in HPC, 2015. → pages60, 82[17] A. Betts, N. Chong, A. Donaldson, S. Qadeer, and P. Thomson.GPUVerify: a Verifier for GPU Kernels. In Proc. ACM Int’l Conf. onObject oriented programming systems languages and applications, 2012.→ pages 24, 35, 42, 146155[18] B. Beylin and R. S. Glanville. Insertion of multithreaded executionsynchronization points in a software program, 2013. US Patent 8,381,203.→ pages 3, 18, 22, 52[19] G. Brooks, G. J. Hansen, and S. Simmons. A New Approach to DebuggingOptimized Code. In Proc. ACM Conf. on Programming Language Designand Implementation (PLDI), 1992. → pages 54[20] A. Brownsword. Cloth in OpenCL. Technical report, Khronos Group,2009. → pages 39, 96, 105[21] N. Brunie, S. Collange, and G. Diamos. Simultaneous Branch and WarpInterweaving for Sustained GPU Performance. In Proc. IEEE/ACM Symp.on Computer Architecture (ISCA), pages 49–60, 2012. → pages 135, 136,145[22] J. M. Bull. Measuring synchronisation and scheduling overheads inopenmp. In Proc. European Workshop on OpenMP, volume 8, page 49,1999. → pages 55, 61[23] M. Burke and R. Cytron. Interprocedural Dependence Analysis andParallelization. In Proc. ACM SIGPLAN Symp. on Compiler Construction,1986. → pages 37, 53[24] M. Burtscher and K. Pingali. An Efficient CUDA Implementation of theTree-based Barnes Hut n-Body Algorithm. GPU computing Gems Emeraldedition, 2011. → pages 39, 40, 48, 49, 55, 96, 105[25] M. Burtscher, R. Nasre, and K. Pingali. A Quantitative Study of IrregularPrograms on GPUs. In Proc. IEEE Symp. on Workload Characterization(IISWC), 2012. → pages 2, 4, 147[26] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, andK. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing.In Proc. IEEE Symp. on Workload Characterization (IISWC), pages 44–54,2009. → pages 10, 39, 55, 101, 105, 128[27] S. Chen and L. Peng. Efficient gpu hardware transactional memory throughearly conflict resolution. In Proc. IEEE Symp. on High-Perf. ComputerArchitecture (HPCA), 2016. → pages 141[28] P. Collingbourne, A. F. Donaldson, J. Ketema, and S. Qadeer. Interleavingand lock-step semantics for analysis and verification of gpu kernels. In156Programming Languages and Systems, pages 270–289. Springer, 2013. →pages 170, 177[29] B. Coon and J. Lindholm. System and method for managing divergentthreads in a simd architecture, 2008. US Patent 7,353,369. → pages 3, 17,18, 22, 52[30] B. W. Coon, P. C. Mills, S. F. Oberman, and M. Y. Siu. Tracking RegisterUsage during Multithreaded Processing Using a Scoreboard havingSeparate Memory Regions and Storing Sequential Register SizeIndicators., 2008. US Patent 7,434,032. → pages 69, 122[31] J. Coplin and M. Burtscher. Effects of Source-Code Optimizations on GPUPerformance and Energy Consumption. In Proc. ACM Workshop onGeneral Purpose Processing on Graphics Processing Units, 2015. →pages 55, 61[32] J. Dean and C. Chambers. Towards Better Inlining Decisions UsingInlining Trials. In Proceedings of the 1994 ACM Conference on LISP andFunctional Programming, LFP ’94, pages 273–282. ACM, 1994. ISBN0-89791-643-3. doi:10.1145/182409.182489. URLhttp://doi.acm.org/10.1145/182409.182489. → pages 53[33] G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, andS. Yalamanchili. SIMD Re-convergence at Thread Frontiers. In Proc.IEEE/ACM Symp. on Microarch. (MICRO), pages 477–488, 2011. →pages 17, 87, 125, 126, 135, 145[34] J.-R. C. G. V. G. O. C. J. H. F. M. A. T. A. S. N. P. K. R. M. Diamos,Gregory Frederick. Execution of divergent threads using a convergencebarrier, January 2016. US Patent 20160019066. → pages 65, 87, 140[35] Dmitry Mikushin. CUDA to LLVM-IR. URLhttps://github.com/apc-llc/nvcc-llvm-ir. Accessed on 2017-09-06. → pages40, 55[36] K. Du Bois, S. Eyerman, J. Sartor, and L. Eeckhout. Criticality stacks:Identifying critical threads in parallel programs using synchronizationbehavior. In Proc. IEEE/ACM Symp. on Computer Architecture (ISCA),2013. → pages 89, 114[37] A. ElTantawy. SSDE and AWARE codes. URLhttps://github.com/ElTantawy/mimd to simt/. Accessed on 2017-09-06. →pages 30, 55, 82157[38] A. ElTantawy and T. M. Aamodt. MIMD Synchronization on SIMTArchitectures. In Proc. IEEE/ACM Symp. on Microarch. (MICRO), 2016.→ pages 25, 97, 115[39] A. ElTantawy, J. W. Ma, M. O’Connor, and T. M. Aamodt. A ScalableMulti-Path Microarchitecture for Efficient GPU Control Flow. In Proc.IEEE Symp. on High-Perf. Computer Architecture (HPCA), 2014. → pages87[40] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic,C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing theClouds: a Study of Emerging Scale-Out Workloads on Modern Hardware.In Proc. ACM Conf. on Arch. Support for Prog. Lang. and Op. Sys.(ASPLOS), pages 37–48, 2012. ISBN 978-1-4503-0759-8. → pages xii, 29[41] W. Fung, I. Sham, G. Yuan, and T. Aamodt. Dynamic Warp Formation andScheduling for Efficient GPU Control Flow. In Proc. IEEE/ACM Symp. onMicroarch. (MICRO), pages 407–420, 2007. → pages 2, 18, 87, 136, 144[42] W. W. Fung and T. M. Aamodt. Energy efficient gpu transactional memoryvia space-time optimizations. In Proc. IEEE/ACM Symp. on Microarch.(MICRO), 2013. → pages 141[43] W. W. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. HardwareTransactional Memory for GPU Architectures. In Proc. IEEE/ACM Symp.on Microarch. (MICRO), pages 296–307, 2011. → pages 24, 39, 40, 55,96, 105, 115, 141[44] W. W. L. Fung and T. M. Aamodt. Thread Block Compaction for EfficientSIMT Control Flow. In Proc. IEEE Symp. on High-Perf. ComputerArchitecture (HPCA), pages 25–36, 2011. → pages 17, 87, 125, 136, 145[45] B. R. Gaster, D. Hower, and L. Howes. Hrf-relaxed: Adapting hrf to thecomplexities of industrial heterogeneous memory models. ACMTransactions on Architecture and Code Optimization (TACO), 2015. →pages 142[46] A. Gharaibeh and M. Ripeanu. Size Matters: Space/Time Tradeoffs toImprove GPGPU Applications Performance. In Proc. ACM Int’l Conf. forHigh Performance Computing, Networking, Storage and Analysis (SC),2010. → pages 129158[47] A. Habermaier and A. Knapp. On the Correctness of the SIMT ExecutionModel of GPUs. In Programming Languages and Systems, pages 316–335.Springer, 2012. → pages 4, 23, 24, 42, 138, 170, 177[48] T. D. Han and T. S. Abdelrahman. Reducing Branch Divergence in GPUprograms. In Proc. ACM Workshop on General Purpose Processing onGraphics Processing Units, 2011. → pages 62[49] T. D. Han and T. S. Abdelrahman. Reducing Divergence in GPGPUPrograms with Loop Merging. In Proc. ACM Workshop on GeneralPurpose Processing on Graphics Processing Units, 2013. → pages 62[50] M. Harris. CUDA 8 AND BEYOND. URL http://on-demand.gputechconf.com/gtc/2016/presentation/s6224-mark-harris.pdf. Accessed on2017-09-06. → pages 39[51] B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D.Hill, S. K. Reinhardt, and D. A. Wood. Quickrelease: Athroughput-oriented approach to release consistency on gpus. In Proc.IEEE Symp. on High-Perf. Computer Architecture (HPCA), 2014. → pages142[52] J. Hennessy. Symbolic Debugging of Optimized Code. ACM Trans. onProg. Lang. and Sys.(TOPLAS), 4(3):323–344, 1982. → pages 54[53] T. H. Hetherington, T. G. Rogers, L. Hsu, M. O’Connor, and T. M.Aamodt. Characterizing and Evaluating a Key-Value Store Application onHeterogeneous CPU-GPU Systems. In Proc. IEEE Symp. on Perf. Analysisof Systems and Software (ISPASS), pages 88–98, 2012. → pages 128, 129[54] T. H. Hetherington, T. G. Rogers, L. Hsu, M. O’Connor, and T. M.Aamodt. MemcachedGPU: Scaling-up Scale-out Key-value Store. In toappear in proceedings of the ACM Symposium on Cloud Computing(SoCC’15), pages 88–98, 2015. → pages 4, 5[55] C. A. R. Hoare. Communicating sequential processes. Springer, 1978. →pages 170[56] A. Holey and A. Zhai. Lightweight software transactions on gpus. InInternational Conference on Parallel Processing, 2014. → pages 142[57] U. Ho¨lzle, C. Chambers, and D. Ungar. Debugging Optimized Code withDynamic Deoptimization. In Proc. ACM Conf. on Programming LanguageDesign and Implementation (PLDI), 1992. → pages 54159[58] S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating CUDAGraph Algorithms at Maximum Warp. In Proc. ACM Symp. on Prin. andPrac. of Par. Prog. (PPoPP), pages 267–276, 2011. → pages 4[59] S. Horwitz, T. Reps, and D. Binkley. Interprocedural Slicing UsingDependence Graphs. In PLDI, 1988. → pages 37, 53[60] M. HOUSTON, B. Gaster, L. HOWES, M. Mantor, and D. Behr. Methodand System for Synchronization of Workitems with Divergent ControlFlow, 2013. WO Patent App. PCT/US2013/043,394. → pages 3, 17[61] IBM. Vector technology, . URLhttp://www.ibm.com/support/knowledgecenter/en/SSAT4T 15.1.4/com.ibm.xlf1514.lelinux.doc/proguide/vec tech.html. Accessed on 2017-09-06.→ pages 2[62] IBM. Auto-vectorization limitations, . URLhttp://www.ibm.com/support/knowledgecenter/en/SSAT4T 15.1.4/com.ibm.xlf1514.lelinux.doc/proguide/auto vec limit.html. Accessed on2017-09-06. → pages 2[63] IBM. Cell Broadband Engine Architecture and its first implementation, .URL https://www.ibm.com/developerworks/library/pa-cellperf/. Accessedon 2017-09-06. → pages 2[64] Intel Corporation. The ISPC Parallel Execution Model. URLhttps://ispc.github.io/ispc.html#the-ispc-parallel-execution-model.Accessed on 2017-09-06. → pages 1, 2, 3, 5, 17, 18[65] Intel Developer Zone. Weird behaviour of atomic functions. URL 2012.Accessed on 2017-09-06. → pages 147[66] Jeff Larkin, NVIDIA. OpenMP and NVIDIA. URLhttp://openmp.org/sc13/SC13 OpenMP and NVIDIA.pdf. Accessed on2017-09-06. → pages 31, 60[67] R. Karrenberg and S. Hack. Whole-function Vectorization. In CGO’11. →pages 62, 63[68] D. Lacey, N. D. Jones, E. Van Wyk, and C. C. Frederiksen. ProvingCorrectness of Compiler Optimizations by Temporal Logic. ACMSIGPLAN Notices, 37(1):283–294, 2002. → pages 171160[69] C. Lattner and V. Adve. LLVM: A Compilation Framework for LifelongProgram Analysis and Transformation. In Proc. IEEE/ACM Symp. on CodeGeneration and Optimization (CGO), 2004. → pages 62[70] M. Lee, G. Kim, J. Kim, W. Seo, Y. Cho, and S. Ryu. ipaws:Instruction-issue pattern-based adaptive warp scheduling for gpgpus. InProc. IEEE Symp. on High-Perf. Computer Architecture (HPCA), 2016. →pages 144[71] S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a CompilerFramework for Automatic Translation and Optimization. In Proc. ACMSymp. on Prin. and Prac. of Par. Prog. (PPoPP), pages 101–110, 2009. →pages 4, 62[72] S.-Y. Lee and C.-J. Wu. CAWS: criticality-aware warp scheduling forGPGPU workloads. In Proc. IEEE/ACM Conf. on Par. Arch. and Comp.Tech. (PACT), 2014. → pages 19, 113, 143[73] S.-Y. Lee, A. Arunkumar, and C.-J. Wu. CAWA: Coordinated WarpScheduling and Cache Prioritization for Critical Warp Acceleration ofGPGPU workloads. In ISCA, 2015. → pages 90[74] Y. Lee, R. Krashinsky, V. Grover, S. Keckler, and K. Asanovic.Convergence and Scalarization for Data-Parallel Architectures. In Proc.IEEE/ACM Symp. on Code Generation and Optimization (CGO), pages1–11, 2013. → pages 87, 145[75] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M.Aamodt, and V. J. Reddi. GPUWattch: Enabling Energy Optimizations inGPGPUs. In Proc. IEEE/ACM Symp. on Computer Architecture (ISCA),2013. → pages 10, 135[76] X. Leroy. A formally verified compiler back-end. Journal of AutomatedReasoning, 43(4):363–446, 2009. → pages 170[77] A. Levinthal and T. Porter. Chap — A SIMD Graphics Processor. Proc.ACM Conf. on Comp. Grap. and Interactive Tech. (SIGGRAPH), 1984. →pages 3, 18, 22[78] A. Li, G.-J. van den Braak, H. Corporaal, and A. Kumar. Fine-grainedSynchronizations and Dataflow Programming on GPUs. In Proc. ACMConf. on Supercomputing (ICS), 2015. → pages 4, 33, 62, 63, 105, 139,149161[79] G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan.GKLEE: Concolic Verification and Test Generation for GPUs. In PPoPP,2012. → pages 24, 35, 42, 146[80] T. Li, A. R. Lebeck, and D. J. Sorin. Spin Detection Hardware forImproved Management of Multithreaded Systems. IEEE Transactions onParallel and Distributed Systems, 2006. → pages 89, 97, 98, 114[81] J. Liu, J. Yang, and R. Melhem. Saws: Synchronization aware gpgpu warpscheduling for multiple independent warp schedulers. In Proceedings ofthe 48th International Symposium on Microarchitecture, pages 383–394.ACM, 2015. → pages 89, 113, 143[82] Y. Liu, Z. Yu, L. Eeckhout, V. J. Reddi, Y. Luo, X. Wang, Z. Wang, andC. Xu. Barrier-aware warp scheduling for throughput processors. InProceedings of the 2016 International Conference on Supercomputing,page 42. ACM, 2016. → pages 89, 113[83] Y. Liu, Z. Yu, L. Eeckhout, V. J. Reddi, Y. Luo, X. Wang, Z. Wang, andC. Xu. Barrier-Aware Warp Scheduling for Throughput Processors. InProc. ACM Conf. on Supercomputing (ICS), 2016. → pages 19, 143[84] LLVM Compiler. Clang Front End, . URL http://clang.llvm.org/. Accessedon 2017-09-06. → pages 40, 55[85] LLVM Compiler. LIBCLC Library, . URL http://libclc.llvm.org/. Accessedon 2017-09-06. → pages 40, 55[86] LLVM Compiler. LLVN 3.6 Release Information, . URLhttp://llvm.org/releases/3.6.0/. Accessed on 2017-09-06. → pages 30, 39,55[87] LLVM Compiler. LLVM Alias Analysis Infrastructure, 2015. URLhttp://llvm.org/docs/AliasAnalysis.html. Accessed on 2017-09-06. → pages39, 55[88] W. Mansky. Specifying and verifying program transformations withPTRANS. PhD thesis, University of Illinois at Urbana-Champaign, 2014.→ pages 170, 176[89] W. Mansky and E. L. Gunter. Verifying optimizations for concurrentprograms. In OASIcs-OpenAccess Series in Informatics, volume 40.Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014. → pages 170,171, 176162[90] M. Mendez-Lojo, M. Burtscher, and K. Pingali. A GPU Implementation ofInclusion-based Points-to Analysis. In Proc. ACM Symp. on Prin. andPrac. of Par. Prog. (PPoPP), 2012. → pages 4[91] J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision forIntegrated Branch and Memory Divergence Tolerance. In Proc. IEEE/ACMSymp. on Computer Architecture (ISCA), pages 235–246, 2010. → pages 2,17, 23, 87, 119, 135, 144[92] D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU Graph Traversal.In Proc. ACM Symp. on Prin. and Prac. of Par. Prog. (PPoPP), pages117–128, 2012. → pages 4[93] Michael Wong, Alexey Bataev. OpenMP GPU/Accelerators Coming ofAge in Clang. Accessed on 2017-09-06. → pages 31, 60[94] Microsoft. URL https://msdn.microsoft.com/en-us/library/b38674ky.aspx.Accessed on 2017-09-06. → pages 55, 61[95] P. Misra and M. Chaudhuri. Performance Evaluation of ConcurrentLock-free Data Structures on GPUs. In Proc. IEEE Int’l Parallel andDistributed Processing Symp. (IPDPS), 2012. → pages 115[96] N. Moscovici, N. Cohen, and E. Petrank. POSTER: A GPU-FriendlySkiplist Algorithm. In Proc. ACM Symp. on Prin. and Prac. of Par. Prog.(PPoPP), 2017. → pages 4, 115[97] G. S. Murthy, M. Ravishankar, M. M. Baskaran, and P. Sadayappan.Optimal Loop Unrolling for GPGPU Programs. In Proc. IEEE Int’lParallel and Distributed Processing Symp. (IPDPS), 2010. → pages 62[98] S. Naffziger, J. Warnock, and H. Knapp. When processors hit the powerwall (or” when the cpu hits the fan”). In Proc. IEEE Int’l Conf. onSolid-State Circuits Conference (ISSCC), pages 16–17, 2005. → pages 1[99] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, andY. N. Patt. Improving GPU Performance via Large Warps and Two-LevelWarp Scheduling. In Proc. IEEE/ACM Symp. on Microarch. (MICRO),pages 308–317, 2011. → pages 2, 18, 19, 87, 113, 136, 145[100] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, andY. N. Patt. Improving gpu performance via large warps and two-level warpscheduling. In Proc. IEEE/ACM Symp. on Microarch. (MICRO), 2011. →pages 142163[101] G. Noaje, C. Jaillet, and M. Krajecki. Source-to-source Code Translator:OpenMP C to CUDA. In IEEE Int’l Conf. on High PerformanceComputing and Communications (HPCC), 2011. → pages 4, 62[102] NVIDIA. LibNVVM Library. URLhttp://docs.nvidia.com/cuda/libnvvm-api/. 2017-09-06. → pages 40, 55[103] Nvidia. Inside Volta: The Worlds Most Advanced Data Center GPU. URLhttps://devblogs.nvidia.com/parallelforall/inside-volta/. Accessed on2017-09-06. → pages 5, 49, 65, 87, 140[104] Nvidia. NVIDIAs Next Generation CUDA Compute Architecture: Fermi.2009. → pages 13[105] NVIDIA. PTX: Parallel Thread Execution ISA Version 3.1.http://developer. download. nvidia. com/compute/cuda/3, 1, 2013. → pages23, 99[106] NVIDIA. CUDA SDK 3.2, September 2013. → pages 39, 55, 129[107] NVIDIA. CUDA Binary Utilities.http://docs.nvidia.com/cuda/cuda-binary-utilities/, 2015. → pages 52, 63[108] Nvidia. NVIDIA Tesla P100. 2016. → pages 13[109] Nvidia. NVIDIA TESLA V100 GPU ARCHITECTURE. 2017. → pages87[110] NVIDIA Compute. GPU Applications Accelarated. URLhttp://www.nvidia.ca/object/gpu-applications.html. Accessed on2017-09-06. → pages 2[111] NVIDIA, CUDA. NVIDIA CUDA Programming Guide, 2011. → pagesxii, 1, 2, 3, 15, 23, 25, 105[112] NVIDIA Forums. GLSL Spinlock, . URLhttps://devtalk.nvidia.com/default/topic/768115/opengl/glsl-spinlock/.Accessed on 2017-09-06. → pages 25[113] NVIDIA Forums. GLSL Spinlock Never Terminates, . URL http://stackoverflow.com/questions/11809421/glsl-spinlock-never-terminates.Accessed on 2017-09-06. → pages 25, 147164[114] NVIDIA Forums. atomicCAS does NOT seem to work, . URLhttp://forums.nvidia.com/index.php?showtopic=98444. Accessed on2017-09-06. → pages 24, 25, 41, 137[115] NVIDIA Forums. atomic locks, . URLhttps://devtalk.nvidia.com/default/topic/512038/atomic-locks/. Accessed on2017-09-06. → pages 24, 25, 41, 137[116] O. Mutlu et al. Runahead execution: An effective alternative to largeinstruction windows. In HPCA’03. → pages 88, 150[117] OpenMP Clang Frontend. OpenMP Clang Frontend Documentation. URLhttps://github.com/clang-omp. Accessed on 2017-09-06. → pages 4, 31, 60[118] M. S. Orr, S. Che, A. Yilmazer, B. M. Beckmann, M. D. Hill, and D. A.Wood. Synchronization using remote-scope promotion. In Proc. ACMConf. on Arch. Support for Prog. Lang. and Op. Sys. (ASPLOS), 2015. →pages 142[119] M. A. Oneil, D. Tamir, and M. Burtscher. A parallel gpu version of thetraveling salesman problem. 2011. → pages 96, 106[120] V. Podlozhnyuk. Histogram Calculation in OpenCL. Technical report,NVIDIA, 2009. → pages xiii, 37, 38[121] A. Ramamurthy. Towards Scalar Synchronization in SIMT Architectures.Master’s thesis, The University of British Columbia, 2011. → pages 24,27, 42, 49, 86, 87, 137, 141, 150[122] M. Rhu and M. Erez. CAPRI: Prediction of Compaction-Adequacy forHandling Control-Divergence in GPGPU Architectures. In Proc.IEEE/ACM Symp. on Computer Architecture (ISCA), pages 61–71, 2012.→ pages 87, 136, 145[123] M. Rhu and M. Erez. The Dual-Path Execution Model for Efficient GPUControl Flow. In Proc. IEEE Symp. on High-Perf. Computer Architecture(HPCA), pages 235–246, 2013. → pages xiv, 87, 118, 128, 129, 135, 144[124] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-ConsciousWavefront Scheduling. In Proc. IEEE/ACM Symp. on Microarch.(MICRO), pages 72–83, 2012. → pages 2, 18, 19, 70, 82, 91, 113, 127,134, 143165[125] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Divergence-aware warpscheduling. In Proc. IEEE/ACM Symp. on Microarch. (MICRO), 2013. →pages 143[126] T. G. Rogers, D. R. Johnson, M. O’Connor, and S. W. Keckler. A variablewarp size architecture. In Proc. IEEE/ACM Symp. on ComputerArchitecture (ISCA), 2015. → pages 146[127] T. G. Rogers, D. R. Johnson, M. OConnor, and S. W. Keckler. A variablewarp size architecture. In Proc. IEEE/ACM Symp. on ComputerArchitecture (ISCA), pages 489–501, 2015. → pages 143[128] R. M. Russell. The CRAY-1 computer system. Communications of theACM, 21(1):63–72, 1978. → pages 2[129] J. Sanders and E. Kandrot. CUDA by example: an introduction togeneral-purpose GPU programming. Addison-Wesley Professional, 2010.→ pages xii, 5, 7, 26, 96[130] T. Shah. FabMem: A Multiported RAM and CAM Compiler forSuperscalar Design Space Exploration. Master’s thesis, North CarolinaState University, 2010. → pages 135[131] R. Sharma, M. Bauer, and A. Aiken. Verification of Producer-ConsumerSynchronization in GPU Programs. In Proc. ACM Conf. on ProgrammingLanguage Design and Implementation (PLDI), pages 88–98, 2015. →pages 24, 35, 42[132] M. D. Sinclair, J. Alsop, and S. V. Adve. Efficient gpu synchronizationwithout scopes: Saying no to complex consistency models. In Proc.IEEE/ACM Symp. on Microarch. (MICRO), 2015. → pages 142[133] I. Singh, A. Shriraman, W. W. Fung, M. O’Connor, and T. M. Aamodt.Cache coherence for gpu architectures. In Proc. IEEE Symp. on High-Perf.Computer Architecture (HPCA), 2013. → pages 142[134] V. C. Sreedhar, G. R. Gao, and Y.-F. Lee. Identifying loops using dj graphs.ACM Trans. on Prog. Lang. and Sys. (TOPLAS), 1996. → pages 40[135] J. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. Davis, P. Franzon,M. Bucher, S. Basavarajaiah, J. Oh, and R. Jenkal. FreePDK: AnOpen-Source Variation-Aware Design Kit. In Proc. IEEE ofMicroelectronic Systems Education (MSE), pages 173–174, 2007.doi:10.1109/MSE.2007.44. → pages 134166[136] J. A. Stratton, S. S. Stone, and W. H. Wen-mei. MCUDA: An EfficientImplementation of CUDA Kernels for Multi-Core CPUs. In Languagesand Compilers for Parallel Computing. Springer, 2008. → pages 62[137] T. D. Han and T. S. Abdelrahman. Reducing divergence in gpgpu programswith loop merging. In GPGPU Workshop’13. → pages 88, 150[138] T. M. Aamodt et al. GPGPU-Sim 3.x Manual. University of BritishColumbia. URLhttp://gpgpu-sim.org/manual/index.php5/GPGPU-Sim 3.x Manual.Accessed on 2017-09-06. → pages 82, 105[139] Tangent Vector. A Digression on Divergence. URL https://tangentvector.wordpress.com/2013/04/12/a-digression-on-divergence/.Accessed on 2017-09-06. → pages 25[140] X. Tian and B. R. de Supins. Explicit Vector Programming with OpenMP4.0 SIMD Extension. Primeur Magazine 2014, 2014. → pages 5[141] T. Tsiodras. Real-time raytracing: Renderer. URLhttps://www.thanassis.space/renderer.html. Accessed on 2014-01-01. →pages 129[142] D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy. Supportingfine-grained synchronization on a simultaneous multithreading processor.In HPCA, 1999. → pages 114[143] E. Vallejo, R. Beivide, A. Cristal, T. Harris, F. Vallejo, O. Unsal, andM. Valero. Architectural support for fair reader-writer locking. In Proc.IEEE/ACM Symp. on Microarch. (MICRO), 2010. → pages 89, 114[144] B. Wang, Y. Zhu, and W. Yu. Oaws: Memory occlusion aware warpscheduling. In Proc. IEEE/ACM Conf. on Par. Arch. and Comp. Tech.(PACT), 2016. → pages 144[145] Y. Wang, S. Chen, J. Wan, J. Meng, K. Zhang, W. Liu, and X. Ning. AMultiple SIMD, Multiple Data (MSMD) Architecture: Parallel Executionof Dynamic and Static SIMD fragments. In Proc. IEEE Symp. onHigh-Perf. Computer Architecture (HPCA), pages 603–614, 2013.doi:10.1109/HPCA.2013.6522353. → pages 145[146] A. Wijs and D. Bosˇnacˇki. Gpuexplore: many-core on-the-fly state spaceexploration using gpus. In International Conference on Tools and167Algorithms for the Construction and Analysis of Systems, pages 233–247.Springer, 2014. → pages 5[147] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos.Demystifying GPU Microarchitecture Through Microbenchmarking. InProc. IEEE Symp. on Perf. Analysis of Systems and Software (ISPASS),pages 235–246, 2010. → pages 24, 81[148] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2Programs: Characterization and Methodological Considerations. In ACMSIGARCH Computer Architecture News. → pages 49[149] J. Wu, A. Belevich, E. Bendersky, M. Heffernan, C. Leary, J. Pienaar,B. Roune, R. Springer, X. Weng, and R. Hundt. gpucc: An Open-sourceGPGPU Compiler. In Proc. IEEE/ACM Symp. on Code Generation andOptimization (CGO), 2016. → pages 62[150] M. L. Xiaowei Ren. Efficient sequential consistency in gpus via relativisticcache coherence. In Proc. IEEE Symp. on High-Perf. ComputerArchitecture (HPCA), 2016. → pages 142[151] Y. Xu, R. Wang, N. Goswami, T. Li, L. Gao, and D. Qian. SoftwareTransactional Memory for GPU Architectures. In Proc. IEEE/ACM Symp.on Code Generation and Optimization (CGO), page 1, 2014. → pages 24,115, 142[152] Y. Xu, L. Gao, R. Wang, Z. Luan, W. Wu, and D. Qian. Lock-basedSynchronization for GPU Architectures. In Proc. Int’l Conf. on ComputingFrontiers, 2016. → pages 4, 62, 63, 139[153] Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU Compiler for MemoryOptimization and Parallelism Management. In Proc. ACM Conf. onProgramming Language Design and Implementation (PLDI), 2010. →pages 62[154] A. YILMAZER. Micro-architectural Support for ImprovingSynchronization and efficiency of SIMD execution on GPUs. PhD thesis,Northeastern University, 2014. → pages 138[155] A. Yilmazer and D. Kaeli. HQL: A Scalable Synchronization Mechanismfor GPUs. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27thInternational Symposium on, 2013. → pages 87, 89, 111, 114, 115, 138,139, 150168[156] Y. Yu, W. Xiao, X. He, H. Guo, Y. Wang, and X. Chen. A stall-aware warpscheduling for dynamically optimizing thread-level parallelism in gpgpus.2015. → pages 143[157] Y. Zhang, F. Mueller, X. Cui, and T. Potok. GPU-accelerated Text Mining.In Workshop on Exploiting Parallelism using GPUs and otherHardware-Assisted Methods, 2009. → pages 5[158] Y. Zhang, F. Mueller, X. Cui, and T. Potok. Data-intensive DocumentClustering on Graphics Processing Unit (GPU) Clusters. Journal ofParallel and Distributed Computing, 71(2):211–224, 2011. → pages 5[159] K. Zhao and X. Chu. G-BLASTN: accelerating nucleotide alignment bygraphics processors. Bioinformatics, 30(10):1384, 2014. → pages 5[160] M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal. GRace: a Low-overheadMechanism for Detecting Data Races in GPU Programs. In ACM SIGPLANNotices, volume 46, pages 135–146. ACM, 2011. → pages 42, 146[161] W. Zhu. Synchronization State Buffer: Supporting Efficient Fine-GrainSynchronization on Many-Core Architectures. In Proc. IEEE/ACM Symp.on Computer Architecture (ISCA), 2007. → pages 89, 114169Appendix ASSDE Correctness DiscussionThis section presents a semi-formal proof to address the correctness of the StaticSIMT Deadlock Elimination algorithm.A.1 Proof OutlineThis section presents the outlines of a proof that the transformation T provided byAlgorithm 4 maintains the following theorem.Theorem 1 Let Pi be a program that is an input to transformation T and Po =T (Pi) be a program that results from applying T on Pi, then any observable be-haviour of Po on a SIMT machine is also an observable behaviour of Pi on a MIMDmachine (i.e., a machine that guarantees loose fairness in thread scheduling)1.This definition of correctness has been used as a correctness criteria for bothprogram transformations [55, 76, 88, 89] and execution models [28, 47]. We defineobservable behavior as the shared memory state, the return value of the program-if any- and the termination properties. We divide the proof of Theorem 1 into twosteps that need to be proven:A. Any observable behaviour of Po is also an observable behaviour of Pi when bothare executed on a MIMD machine.B. Any observable behaviour of Po on a SIMT machine is also an observable be-1 We assume Pi maintains the assumptions outlined at the beginning in 3.3.1. We also assumethat in both SIMT and MIMD machines, the number of launched threads do not exceed the numberof threads that can reside simultaneously on the hardware.170h: Ih b: Ib t: It p: Ip r: Ir d: Id h: Ih b: Ib t: c=1 p: Ip r: if(c) goto h d: c=0 r: Ir dn: Id tn: It T Figure A.1: Visualization of Thaviour of Po on a MIMD machine.A.2 Proof DetailsA. Preservation of Semantics across transformations: To prove this part we usethe same methodology used in TRANS [68, 89].171Let a program P has form:P : Entry; I1; I2; · · · ; Im−1; ExitInst  I ::= nop | X := E |M | if E goto nM. Inst  M ::= X := m(E) | m(E) := XMem.  M ::= shared memoryExpr  E ::= X | O(E)|COp  O ::= various unspeci f ied operatorsVar  X ::= x | y | z | ...Const  C ::= bool | integer | f loat| ... valueLabel  n ::= 0 | 1 | ... | mProgram P can be represented as a labeled directed flow graph GP formed fromthe program nodes (N = {0,1, ...,m}. Each node n has at least one sequential edgeto another node seq(n); and possibly another branch edge to a node brn(n). Thus,an edge in the flow graph is defined by two nodes and an edge type. In refersto the instruction labeled by node n. Hence, GP can be represented by the tuple〈N,E ⊆ N×N×EdgeType, I : N 7→ Instr〉. Finally, a valuation function σ is usedto map a pattern of meta variables to a (sub)object in G (i.e., node, edge, instr, expr,..).In concurrent MIMD execution, a set of thread traverse through GP such that atany time the execution state of P can be represented by a tuple (nodes,states,m);where nodes is a vector of the labels representing the current location of eachthread in GP, states is a vector of the local state of each thread, and m is thecurrent state of the externally observable shared memory. Thus, the executiontrace of a program can be modeled by state transitions: sti → sti+1, where sti =(nodes,states,m)i. The state of an individual thread is represented as st(t) =(n,s,m) = (nodest ,statest ,m). A state transition happens when a thread executesthe instruction labeled by its current node. Next, we present the state transitionrelations for an instructions In at node n:172(s,n,m) → (s,seq(n),m); In = nop(s,n,m) → (s x 7→ eval(e,s)),seq(n),m); In = x := e(s,n,m) → (s x 7→ m(eval(e,s)),seq(n),m); In = x := m(e)(s,n,m) → (s,seq(n),mm(eval(e,s)); In = m(e) := x(s,n,m) → (s,seq(n),m); In = i f e goto brn(n), eval(e,s) = F(s,n,m) → (s,brn(n),m); In = i f e goto brn(n), eval(e,s) = TNext, we describe the transformation in Algorithm 4 using the TRANS trans-formation language. For clarity, the transformation is simplified to handle transfor-mation of a single structured loop (L0). However, both the algorithm descriptionand the proof can be extended in a straightforward manner to handle the transfor-mation of multiple (un)structured loops. In TRANS, a transformation is specifiedas a set of actions performed on a flow graph G under certain conditions. Wedescribe our transformation T as follows:dreplace r with r 7→ σ(nop); rn 7→ Ire(σ ,G) (a1)dmove edge(t,h,r)e(σ ,G) (a2)dadd cond edge(r,h, t)e(σ ,G) = (a3)dreplace d with d 7→ σ(c := 0); dn 7→ Ide(σ ,G) (a3.1)i f d |= EX(node(r))∧d 6= t ∧A¬E(true U use(c))dreplace t with t 7→ σ(c := 1); tn 7→ Ite(σ ,G) (a3.2)i f A¬E(true U use(c))dreplace s with s 7→ σ(i f (c) goto h)e(σ ,G) (a3.3)i floop(p,h,b, t)≡ L0 (c1)r |= Sa f ePDOM(L0) (c2)Next, we prove the correctness of transformation T when applies to L0 assum-ing a MIMD execution model.173Single-Thread Equivalence: We prove that there exists a relation R that re-lates the execution states of the original program with that of the transformed pro-gram for a thread t with the assumption that other threads u 6= t remain at the sameinitial execution state. We then prove that if this relation R holds then it impliesthat the two programs have the same observable behavior as defined earlier.The set of actions are performed on a loop L0 with preheader p, header h, break(exit) b and tail t. Node r is the Sa f ePDOM(L0). Actiona1 adds a nop instructionat node r. The backward edge from the loop tail t to the loop header h is movedfrom h to r by Action a2. Finally, Action a3 adds an edge from r to h that wouldbe only taken if node r is entered from t. This replaces the nop instruction in rwith a conditional branch instruction that diverges to h if a variable c had a valueof ’0’ (i.e., if the thread is reaching r from any predecessor d other than t) and toh if the c had a value of ’1’ (i.e., if the thread is reaching r from t). Thus, node racts as a switch that redirects the flow of the execution to an output node accordingto the input node. Conditions on actions a3.1 and a3.2 assures that c is not usedby any node that follows t or d other than the new added branch instruction at r.According to these actions the following properties are satisfied:In)Pi = In)Po , seq(n)Pi = seq(n)Po f or n /∈ {d,r, t} (p1)Id)Po = σ(c := 0), seq(d)Po = dn (p2)Idn)Po = Id)Pi , seq(dn)Po = seq(d)Pi (p3)Ir)Po = σ(i f (c) goto h), seq(r)Pi = rn (p4)Irn)Po = Ir)Pi , seq(rn)Po = seq(r)Pi (p5)It)Po = σ(c := 1), seq(t)Po = tn, Itn)Po = It)Pi (p6)seq(t)Pi = h, seq(tn)Po = r (p7)Assume the following state transitions for a thread t:st0(t)→ st1(t)→ ... sti(t) = (n,s,m) ...→ stk(t) from Pi andst ′0(t)→ s′1(t)→ ... st ′i(t) = (n′,s′,m′) ...→ st ′l(t) from Poalso assume that for all other threads u 6= t st0(u) = st ′0(u) Then, the following relation R174holdsR1. s′lexcept σ(c) = sk R2. m′l = mk R3. I′l = IkR4. node(n′l) =node(nk) node(nk) /∈ {d,r,h}dn node(nk) = drn node(nk) = rtn node(nk) = t; for l = k+u+ v+w+ y where:u = no. of nodes ni where i≤ k such that node(ni) = d,v = no. of nodes ni where i≤ k such that node(ni) = r,w = no. of nodes ni where i≤ k such that node(ni) = t,y = no. of transitions node(ni)→ node(ni+1) = t→ h where i < k;R simply states that Po simulates the behavior of Pi but in potentially more exe-cution state transitions according to the specific execution path that was taken 2.These extra transitions account for the execution of the added nodes during thetransformation T. The local state of a thread executing Po at l may only be dif-ferent by the new added variable c that control the branch at r. We prove R byinduction over k.Proof Logic: assuming that the relation holds for k, then we consider one tran-sition from k to k + 1. We find that according to R, Po should simulate the be-haviour of Pi in either one or two transitions from l. This is determined accordingto node(nkn) and node(nkn−1). Then, we proceed by proving that this indeed holdsfor all possible transitions from k to k + 1. Given that relation R holds, it is triv-ial to prove program equivalence. First, if an execution state transition terminatesfor Pi at k then it will terminate for Po at l=k+u+v+w+y. Further, Ik = Il = ret(e),s′lexcept σ(c) = s and e does not use σ(c), then return values will be the same.Base Case is trivially . For the same input and same initial state R1,R2 andR3.2 holds.Step Case Assume that R is true for k. Prove R is true for kn = k+1.Case-1: σ(nkn) /∈ {d,r, t} and σ(nk)→ σ(nk+1) 6= t → h Given the condition onCase-1, un = u, vn = v, wn = y, and ln = l+1 (i.e., execution of Po should simulate2R in this case is called a simulation relation.175the execution of Pi at kn in a single step from l). From R1 and R3 at k and thestate transition relations, it is trivial to prove that that s′ln/σ(c) = s′l/σ(c)σ(I′l )= sk  σ(Ik)=skn ; i.e., R1 holds for kn. Similarly, m′ln = m′l  σ(I′l ) = mkn ; i.e.,R2 holds for kn. From conditions of Case-1, nk /∈ {d, t}. In case σ(nk) 6= r, thenσ(n′l)=σ(nk). Thus, σ(n′l  I′l )=σ(nk  Ik)=σ(n′ln)=σ(nkn). Also for σ(nk) = r,σ(n′l) = rn and seq(r)Pi = seq(rn)Po . Thus, σ(n′ln)=σ(nkn) for σ(nk) = r. There-fore, R4 applies for kn in all possible cases. Finally, since R4 holds and that bothexecutions are at the same label then from property p1 we can conclude that R3hols at kn.Case-2: σ(nkn) = d. Then, un = u+1 and ln = kn +1 = k2. This means that exe-cution of Po should simulate the execution of Pi at kn in two steps from l. Since Rholds for k, we can prove that σ(n′ln) = σ(nkn) = d. However, according to prop-erty p2, Id)Po 6= Id)Pi (i.e., R3 does not hold from a single step from l). Instead,using properties p2 and p3, we find that R3 is satisfied after two steps. Similarto Case-1, we can prove all other properties at ln = l + 2. We can similarly proveCase-3: σ(nkn) = t.Case-4: σ(nk)→ σ(nk+1) = t→ h. Then, y = 1 and ln = l +2. Since R4 holds atk, we know that σ(n′l) = tn. According to property p7, the next node in Po is h andin Pi is r (i.e., R3 and R4 does not hold from a single step from l). We also knowthat by coming from tn that σ(c) = 1 and that Ir)Po evaluates to taken branchingto h. Thus, we can simply find that R holds for kn at ln = l + 2. We can similarlyprove Case-5: σ(nkn) = r.Lifting simulation relation to parallel execution: We need to prove that stepsby threads other than t preserve the simulation relation [88, 89]. R implies that(m′ = m) then it is guaranteed that both the original and the transformed programmaintains exactly the same view of shared memory for a thread u 6= t. Thus, amemory read from an arbitrary location loc by a thread u from m′ and m will yieldthe same values. This conclusion is intuitive as the transformation does not re-order memory operations (read, writes or memory barriers). Further, we need toprove that memory updates by a thread u can not change the memory such thatthe simulation relation R no longer holds. This is evident by the proof of R that isindependent from the shared memory state. Relation R is built on subtle changes176in the CFG that is only dependent on the new added local variable c. Note that wedid not depend on a specific memory model to prove that the simulation relationholds for the case of parallel execution.B. Preservation of Semantics across execution models: In this part we relyon prior work [28, 47] that proves that the execution of an arbitrary program P ona SIMT machine can be simulated by some schedule of the traditional interleavedthread execution (i.e., MIMD execution) of the same program. Thus, terminatingkernels on a SIMT machine produce a valid observable behaviour compared withMIMD execution. However, it is still possible that a program that always (i.e.,under any loosely fair scheduling) terminate on MIMD to not terminate on SIMT.Therefore, it is sufficient for us to prove that Po, an output of T, always terminateon a SIMT machine if Po (or equivalently Pi) always terminate on a MIMD ma-chine (i.e., with any arbitrary loosely fair schedule).Proof Logic Termination is trivially proven if we can prove that all threads execut-ing any arbitrary branch in Po eventually reach to the branch reconvergence point.To construct such a proof, we rely on two main claims:Claim-1: Po terminates on a MIMD machine under any arbitrary loosely fairscheduling.Proof: We assume that Pi terminates on a MIMD machine under any arbitraryloosely fair scheduling. However, according to the proof presented earlier thetransformation T preserves the program semantics on a MIMD machine and thatPo simulates the behaviour of Pi on a MIMD machine including the terminationproperties. Thus, we conclude that Po terminates on a MIMD machine under anyarbitrary fair scheduling.Claim-2: The valuation of the exit condition in any loop in Po is independent of thevaluation of paths parallel to or reachable from the loop. The definitions of paralleland reachable paths is listed in Listing 1.Proof: This is a forced property by transformation T presented in Algorithm 4.As shown by Algorithm 4 and as explained in Section 4.2.1, any loop that has itsexit dependent on the valuation of paths parallel to or reachable from the loop istransformed such that the backward edge of the loop is converted into a forward177edge to SafePDom and a backward edge to the original loop header. SafePDOMpostdominates the original loop exits, the redefining writes, and all control flowpaths that could lead to redefining writes that are either reachable from the loop.Now we proceed to prove that all threads executing any arbitrary branch in Poeventually reach to the branch reconvergence point. We prove this by inductionover the nesting depth of the control flow graph. For this purpose let’s consider anarbitrary branch IkBRT,NT 7→Rk which is a branch with a nesting depth of k. We definethe nesting depth as the maximum number of static branch instructions encounteredin the control flow path connecting the branch instruction with its reconvergencepoint R.Base Case-1: I0BR. No static branches between the branch instruction and its re-convergence point. Since there is no barriers placed in divergent code, threads di-verged to either sides of the branch are guaranteed to reach its reconvergence point.The follows from two facts: 1) in the absence of barriers and loops (i.e., branches),nothing prevents the forward progress of threads, 2) according to constraint-2, oncethreads executing one side of the branch reach its reconvergence point, executionswitches to threads diverged to the other side.Base Case-2a: I1BR and I1BR ∈ PT 7→R ∨ I1BR ∈ PNT 7→R. This means that the branchitself is encountered again before reaching R (i.e., it is a loop exist). Threads maynever reach R if the valuation of I1BR does not lead to exit the loop. This couldhappen under only two hypothesis.(1) The valuation of I1BR is independent of thread scheduling (i.e., it is indepen-dent of the execution of other paths parallel to or reachable paths to the loop whoseexit is (I1BR)). However, it never evaluates to decision that leads to exit the loop.This contradicts with Claim 1 as it implies that Po does not terminate under fairscheduling. Thus, we exclude this hypothesis.(2) The valuation of I1BR is dependent on scheduling threads at the bottom of thestack (i.e., it is dependent on the execution of other paths parallel to or reachablepaths to the loop whose exit is (I1BR)). However, for these threads to get scheduled,the looping threads need to exit and reach their reconvergence point to be poppedout of the stack allowing for threads at bottom stack entries to get schedule. How-ever, this hypothesis contradicts with Claim 2 since the operation of T forces thevaluation of the loop exit condition to be independent of parallel to or reachable178from the loop. Thus, we reject this hypothesis.From (1) and (2), we conclude that threads divergent at I1BR reach their recon-vergence point.Base Case-2b: I1BR and I1BR /∈ PT 7→R∧ I1BR /∈ PNT 7→R. This means that the branchis not encountered again before reaching R, however another branch is encountered.This other branch could follow the pattern of Base Case 1 or Base Case 2a.(i.e., it is a loop exist). Threads may never reach R if the valuation of I1BR doesnot lead to exit the loop. This could happen under only two hypothesis.(1) The valuation of I1BR is independent of thread scheduling (i.e., it is indepen-dent of the execution of other paths parallel to or reachable paths to the loop whoseexit is (I1BR)). However, it never evaluates to decision that leads to exit the loop.This contradicts with Claim 1 as it implies that Po does not terminate under fairscheduling. Thus, we exclude this hypothesis.(2) The valuation of I1BR is dependent on scheduling threads at the bottom of thestack (i.e., it is dependent on the execution of other paths parallel to or reachablepaths to the loop whose exit is (I1BR)). However, for these threads to get scheduled,the looping threads need to exit and reach their reconvergence point to be poppedout of the stack allowing for threads at bottom stack entries to get schedule. How-ever, this hypothesis contradicts with Claim 2 since the operation of T forces thevaluation of the loop exit condition to be independent of parallel to or reachablefrom the loop. Thus, we reject this hypothesis. From (1) and (2), we conclude thatthreads divergent at I1BR reach their reconvergence point.Finally, from A. and B., we conclude that Theorem 1 is correct.179

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0363330/manifest

Comment

Related Items