You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

GPU computing architecture for irregular parallelism Fung, Wilson Wai Lun 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2015_february_fung_wilsonwailun.pdf [ 3.2MB ]
JSON: 24-1.0167110.json
JSON-LD: 24-1.0167110-ld.json
RDF/XML (Pretty): 24-1.0167110-rdf.xml
RDF/JSON: 24-1.0167110-rdf.json
Turtle: 24-1.0167110-turtle.txt
N-Triples: 24-1.0167110-rdf-ntriples.txt
Original Record: 24-1.0167110-source.json
Full Text

Full Text

GPU Computing Architecture for Irregular ParallelismbyWilson Wai Lun FungB. Applied Science, University of British Columbia, 2006M. Applied Science, University of British Columbia, 2008A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)January 2015c©Wilson Wai Lun Fung, 2015AbstractMany applications with regular parallelism have been shown to benefit from usingGraphics Processing Units (GPUs). However, employing GPUs for applicationswith irregular parallelism tends to be a risky process, involving significant effortfrom the programmer and an uncertain amount of performance/efficiency benefit.One known challenge in developing GPU applications with irregular parallelism isthe underutilization of SIMD hardware in GPUs due to the application’s irregularcontrol flow behavior, known as branch divergence. Another major developmenteffort is to expose the available parallelism in the application as 1000s of con-current threads without introducing data races or deadlocks. The GPU softwaredevelopers may need to spend significant effort verifying the data synchronizationmechanisms used in their applications. Despite various research studies indicatingthe potential benefits, the risks involved may discourage software developers fromemploying GPUs for this class of applications.This dissertation aims to reduce the burden on GPU software developers withtwo major enhancements to GPU architectures. First, thread block compaction(TBC) is a microarchitecture innovation that reduces the performance penalty causedby branch divergence in GPU applications. Our evaluations show that TBC pro-vides an average speedup of 22% over a baseline per-warp, stack-based recon-vergence mechanism on a set of GPU applications that suffer significantly frombranch divergence. Second, Kilo TM is a cost effective, energy efficient solu-tion for supporting transactional memory (TM) on GPUs. With TM, programmerscan uses transactions instead of fine-grained locks to create deadlock-free, main-tainable, yet aggressively-parallelized code. In our evaluations, Kilo TM achieves192× speedup over coarse-grained locking and captures 66% of the performanceof fine-grained locking with 34% energy overhead.iiPrefaceThis is a list of my publications at the University of British Columbia that havebeen incorporated into this dissertation, in chronological order:[C1] Wilson W. L. Fung and Tor M. Aamodt. Thread Block Compaction for Ef-ficient SIMT Control Flow. In Proceedings of the 17th IEEE InternationalSymposium on High-Performance Computer Architecture (HPCA-17), pp.25-36, February 2011.[C2] Wilson W. L. Fung, Inderpreet Singh, Andrew Brownsword, Tor M. Aamodt.Hardware Transactional Memory for GPU Architectures. In Proceedings ofthe 44th IEEE/ACM International Symposium on Microarchitecture (MICRO-44), pp. 296-307, December 2011.[T1] Wilson W. L. Fung, Inderpreet Singh, and Tor M. Aamodt. Kilo TM Cor-rectness: ABA Tolerance and Validation-Commit Indivisibility. TechnicalReport, University of British Columbia, 24 May 2012.[J1] Wilson W. L. Fung, Inderpreet Singh, Andrew Brownsword, Tor M. Aamodt.Kilo TM: Hardware Transactional Memory for GPU Architectures. IEEEMicro, Special Issue: Micro’s Top Picks from 2011 Computer ArchitectureConferences, Vol. 32, No. 3, pp. 7-16, May/June 2012.[C3] Wilson W. L. Fung, Tor M. Aamodt. Energy Efficient GPU TransactionalMemory via Space-Time Optimizations. In Proceedings of the 46th IEEE/ACMInternational Symposium on Microarchitecture (MICRO-46), pp. 408-420,December 2013.iiiThese publications are incorporated into this dissertation as follows.Chapter 2. Portions of the text explaining the fundamental concepts of paral-lel computing are modified from previously written background material from mymaster’s thesis “Dynamic Warp Formation: Exploiting Thread Scheduling for Effi-cient MIMD Control Flow on SIMD Graphics Hardware” (2008) completed at theUniversity of British Columbia. The description of the baseline GPU architectureincorporated text from [C1] and [C3].Chapter 3. A version of this material has been published as [C1]. In [C1],I conducted the research, analyzed the data and drafted the manuscript under theguidance of Dr. Tor M. Aamodt.Chapter 4. A version of this material has been published as [C2] and later as[J1]. In [C2], I was the lead investigator, responsible for conceptualizing the majorcontributions, analyzing the data, and drafting the manuscript. Inderpreet Singhhelped with data collection. He also analyzed the data for thread cycle distributionand drafted the text for this analysis in [C2]. Andrew Brownsword provided theoriginal source code for the cloth simulation application, which Inderpreet Singhlater modified to use transactions and fine-grained locks. Dr. Tor M. Aamodt wasthe supervisory author on this project and was involved in concept formation andmanuscript edits. I wrote the manuscript for [J1] with help from Inderpreet Singhand Dr. Tor M. Aamodt.Chapter 5. A version of this material has been published as [T1]. In [T1], Iconceptualized the proof, drafted the manuscript under the guidance of Dr. Tor M.Aamodt. Inderpreet Singh helped validating the proof and editing of the manuscript.Chapter 6. A version of this material has been published as [C3]. In [C3],I conducted the research, analyzed the data and drafted the manuscript under theguidance of Dr. Tor M. Aamodt.Chapter 7. This chapter contains text from the related work sections from [C2]and [C3].ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 GPU Computing Potential . . . . . . . . . . . . . . . . . . . . . 21.2 GPU Programming Challenges with Irregular Parallelism . . . . . 51.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . 132.1.1 Data-Level Parallelism . . . . . . . . . . . . . . . . . . . 132.1.2 Regular Parallelism vs. Irregular Parallelism . . . . . . . 142.1.3 Thread-Level Parallelism . . . . . . . . . . . . . . . . . . 16v2.1.4 Data Synchronization . . . . . . . . . . . . . . . . . . . . 172.1.5 Single-Instruction, Multiple-Data (SIMD) . . . . . . . . . 212.1.6 Fine-Grained Multithreading . . . . . . . . . . . . . . . . 232.2 Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . 242.2.1 Correctness Criteria . . . . . . . . . . . . . . . . . . . . 272.2.2 Design Space . . . . . . . . . . . . . . . . . . . . . . . . 302.3 GPU Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 352.3.1 Programming Model . . . . . . . . . . . . . . . . . . . . 362.3.2 Single-Instruction, Multiple-Thread (SIMT) Execution Model 372.3.3 Microarchitecture . . . . . . . . . . . . . . . . . . . . . . 392.4 Dynamic Warp Formation . . . . . . . . . . . . . . . . . . . . . 452.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 Thread Block Compaction for Efficient SIMT Control Flow . . . . . 473.1 Workload Classification . . . . . . . . . . . . . . . . . . . . . . . 493.2 Dynamic Warp Formation Pathologies . . . . . . . . . . . . . . . 503.2.1 Warp Barrier . . . . . . . . . . . . . . . . . . . . . . . . 523.3 Thread Block Compaction . . . . . . . . . . . . . . . . . . . . . 533.3.1 High-Level Operation . . . . . . . . . . . . . . . . . . . 553.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . 573.3.3 Example Operation . . . . . . . . . . . . . . . . . . . . . 583.4 Likely-Convergence Points (LCP) . . . . . . . . . . . . . . . . . 603.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 643.6.1 In-Depth Analysis . . . . . . . . . . . . . . . . . . . . . 663.6.2 Thread Block Prioritization . . . . . . . . . . . . . . . . . 693.6.3 Impact on Memory Subsystem . . . . . . . . . . . . . . . 713.6.4 Sensitivity to Memory Subsystem . . . . . . . . . . . . . 713.7 Implementation Complexity . . . . . . . . . . . . . . . . . . . . 723.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734 Kilo TM: Hardware Transactional Memory for GPU Architectures 744.1 Transactional Memory on GPU: Opportunities and Challenges . . 76vi4.1.1 Challenges with Prior HTMs on GPUs . . . . . . . . . . . 794.2 Kilo Transactional Memory . . . . . . . . . . . . . . . . . . . . . 814.2.1 SIMT Stack Extension . . . . . . . . . . . . . . . . . . . 834.2.2 Scalable Conflict Detection . . . . . . . . . . . . . . . . . 844.2.3 Version Management . . . . . . . . . . . . . . . . . . . . 854.2.4 Transaction Log Storage and Transfer . . . . . . . . . . . 864.2.5 Distributed Validation/Commit Pipeline . . . . . . . . . . 864.2.6 Concurrency Control . . . . . . . . . . . . . . . . . . . . 914.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 964.4.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 964.4.2 Execution Time Breakdown . . . . . . . . . . . . . . . . 984.4.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . 1014.4.4 Implementation Complexity of Kilo TM . . . . . . . . . . 1034.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075 Kilo TM Correctness Discussion . . . . . . . . . . . . . . . . . . . . 1095.1 Memory Value-Location Framework for the ABA Problem . . . . 1115.1.1 ABA Problem . . . . . . . . . . . . . . . . . . . . . . . . 1135.1.2 Potential ABA Problem in Transactional Memory . . . . . 1135.1.3 Tolerance to the ABA Problem . . . . . . . . . . . . . . . 1145.1.4 Inconsistent Read-Set . . . . . . . . . . . . . . . . . . . . 1155.2 Transaction Components in Kilo TM . . . . . . . . . . . . . . . . 1165.2.1 Commit ID and Commit Order . . . . . . . . . . . . . . . 1175.3 Partial Orderings Provided by Kilo TM . . . . . . . . . . . . . . 1175.4 Per-Word Access Ordering . . . . . . . . . . . . . . . . . . . . . 1195.5 Validation Against a Consistent View of Memory . . . . . . . . . 1205.6 Logical Indivisibility of Validation and Commit . . . . . . . . . . 1215.7 Tolerance to ABA Problem (Kilo TM) . . . . . . . . . . . . . . . 1235.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236 Energy Efficiency Optimizations for Kilo TM . . . . . . . . . . . . . 1246.1 Performance and Energy Overhead of Kilo TM . . . . . . . . . . 126vii6.2 Warp-Level Transaction Management (WarpTM) . . . . . . . . . 1276.2.1 Optimizations Enabled by WarpTM . . . . . . . . . . . . 1286.2.2 Hardware Modification to Kilo TM . . . . . . . . . . . . 1326.3 Intra-Warp Conflict Resolution . . . . . . . . . . . . . . . . . . . 1326.3.1 Multiplexing Shared Memory for Resolution Metadata . . 1336.3.2 Sequential Conflict Resolution with Bloom Filter (SCR) . 1336.3.3 2-Phase Parallel Conflict Resolution with Ownership Table(2PCR) . . . . . . . . . . . . . . . . . . . . . . . . . . . 1346.4 Temporal Conflict Detection (TCD) . . . . . . . . . . . . . . . . 1376.4.1 Globally Synchronized Timer . . . . . . . . . . . . . . . 1396.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . 1406.4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.4.4 Integration with Kilo TM . . . . . . . . . . . . . . . . . . 1426.5 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . 1436.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1436.6.1 Power Model . . . . . . . . . . . . . . . . . . . . . . . . 1456.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 1476.7.1 Performance and Energy Efficiency . . . . . . . . . . . . 1476.7.2 Energy Usage Breakdown . . . . . . . . . . . . . . . . . 1526.7.3 WarpTM Optimizations Breakdown . . . . . . . . . . . . 1526.7.4 Intra-Warp Conflict Resolution Overhead . . . . . . . . . 1546.7.5 Temporal Conflict Detection Resource Sensitivity . . . . . 1566.7.6 Sensitivity to L2 Cache Port Width . . . . . . . . . . . . 1566.7.7 Sensitivity to Core Scaling . . . . . . . . . . . . . . . . . 1576.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1597 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1617.1 Related Work for Branch Divergence Handling on GPUs . . . . . 1617.1.1 Software Compaction . . . . . . . . . . . . . . . . . . . . 1617.1.2 Hardware Compaction . . . . . . . . . . . . . . . . . . . 1627.1.3 Intra-Warp Divergent Path Management . . . . . . . . . . 1647.1.4 Adding MIMD Capability . . . . . . . . . . . . . . . . . 1657.2 Related Work for Kilo TM . . . . . . . . . . . . . . . . . . . . . 166viii7.2.1 GPU Software Transactional Memory . . . . . . . . . . . 1667.2.2 Hardware Transaction Memory with Cache Coherence Pro-tocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1677.2.3 Signature-Based Conflict Detection . . . . . . . . . . . . 1687.2.4 Value-Based Conflict Detection . . . . . . . . . . . . . . 1687.2.5 Ring-Based Commit Ordering . . . . . . . . . . . . . . . 1697.2.6 Recency Bloom Filter . . . . . . . . . . . . . . . . . . . 1697.2.7 Transaction Scheduling . . . . . . . . . . . . . . . . . . . 1707.2.8 Energy Analysis for Transaction Memory . . . . . . . . . 1707.2.9 Intra-Warp Conflict Resolution . . . . . . . . . . . . . . . 1717.2.10 Globally Synchronized Timers in Memory Systems . . . . 1717.2.11 Timestamp/Counter-based Conflict Detection . . . . . . . 1727.2.12 Transactional Memory Verification . . . . . . . . . . . . 1728 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 1748.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1748.2 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . 1778.2.1 Cost-Benefit-Aware Multi-Scope Compaction . . . . . . . 1778.2.2 Extend Kilo TM to Support Strong Isolation . . . . . . . . 1778.2.3 Application-Driven Transaction Scheduling . . . . . . . . 1788.2.4 Multithreaded Transactions . . . . . . . . . . . . . . . . . 179Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180ixList of TablesTable 1.1 Computational efficiency of state-of-the-art CMPs and GPUs in2014. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Table 2.1 Memory space mapping between CUDA and OpenCL . . . . . 37Table 3.1 Benchmarks in thread block compaction evaluation. . . . . . . 64Table 3.2 GPGPU-Sim configuration for thread block compaction evalu-ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Table 3.3 Maximum stack usage for TBC-LCP . . . . . . . . . . . . . . 73Table 4.1 Raw performance (IPC) of GPU-TM applications . . . . . . . 79Table 4.2 General characteristics of evaluated GPU-TM applications . . . 92Table 4.3 TM-specific characteristics of evaluated GPU-TM applications 93Table 4.4 GPGPU-Sim configuration for Kilo TM evaluation . . . . . . . 94Table 6.1 GPGPU-Sim configuration for enhanced Kilo TM evaluation . 144Table 6.2 GPU TM workloads for performance and energy evaluations. . 145Table 6.3 Power component breakdown for the added hardware specificto Kilo TM, warp-level transaction management, and temporalconflict detection. . . . . . . . . . . . . . . . . . . . . . . . . 146Table 6.4 Performance-optimal concurrent transaction limit and abort-commitratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150xList of FiguresFigure 2.1 High-level GPU architecture as seen by the programmer. . . . 35Figure 2.2 SIMT core microarchitecture of a contemporary GPU . . . . . 39Figure 2.3 Example operation of SIMT stack. . . . . . . . . . . . . . . . 42Figure 3.1 Overall performance of TBC . . . . . . . . . . . . . . . . . . 49Figure 3.2 SIMD efficiency of GPU applications in evaluation . . . . . . 49Figure 3.3 Dynamic warp formation pathologies. . . . . . . . . . . . . . 50Figure 3.4 DWF with and without warp barrier compared against baselineper-warp SIMT stack reconvergence (PDOM). . . . . . . . . 53Figure 3.5 High-level operation of thread block compaction. . . . . . . . 56Figure 3.6 Modifications to the SIMT core microarchitecture to imple-ment thread block compaction. . . . . . . . . . . . . . . . . . 57Figure 3.7 Thread block compaction example operation . . . . . . . . . 59Figure 3.8 Likely-convergence points . . . . . . . . . . . . . . . . . . . 62Figure 3.9 Performance and resource impact of likely-convergence points 62Figure 3.10 Performance of thread block compaction (TBC) and dynamicwarp formation (DWF) relative to baseline per-warp post-dominatorSIMT stack (PDOM) for the DIVG and COHE benchmarksets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Figure 3.11 Detail performance data of TBC and DWF relative to baseline(PDOM) for the DIVG benchmark set. . . . . . . . . . . . . 67Figure 3.12 Detail performance data of TBC and DWF relative to baseline(PDOM) for the COHE benchmark set. . . . . . . . . . . . . 68xiFigure 3.13 Performance of TBC with various thread block prioritizationpolicies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70Figure 3.14 Average memory traffic of a SIMT core for TBC relative tobaseline (PDOM). . . . . . . . . . . . . . . . . . . . . . . . 71Figure 3.15 Average speedup of TBC and DWF over baseline (PDOM)with a smaller L2 cache. . . . . . . . . . . . . . . . . . . . . 72Figure 4.1 Performance comparison between applications running on anideal TM system and their respective fine-grained (FG) lockingversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Figure 4.2 Kilo TM Implementation Overview . . . . . . . . . . . . . . 82Figure 4.3 SIMT stack extension to handle divergence due to transactionaborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Figure 4.4 Commit unit communication and design overview . . . . . . . 87Figure 4.5 Logical stage organization and communication traffic of thebounded ring buffer stored inside commit unit. . . . . . . . . 88Figure 4.6 Execution time of GPU-TM applications with Kilo TM . . . . 96Figure 4.7 Abort/commit ratio of GPU-TM applications with Kilo TM . 96Figure 4.8 Performance scaling with increasing number of concurrent trans-actions with Kilo TM . . . . . . . . . . . . . . . . . . . . . . 97Figure 4.9 Breakdown of thread execution cycles for Kilo TM . . . . . . 99Figure 4.10 Breakdown of core execution cycles for Kilo TM . . . . . . . 99Figure 4.11 Sensitivity to hazard detection mechanism . . . . . . . . . . . 102Figure 4.12 Buffer usage in active commit unit entries . . . . . . . . . . . 104Figure 4.13 Number of in-flight, allocated read and write buffers for differ-ent buffer allocation schemes . . . . . . . . . . . . . . . . . . 105Figure 6.1 Enhanced Kilo TM implementation overview . . . . . . . . . 125Figure 6.2 Transaction conflicts within a warp. . . . . . . . . . . . . . . 129Figure 6.3 Kilo TM Protocol Messages. . . . . . . . . . . . . . . . . . 129Figure 6.4 Interconnection Traffic Breakdown. . . . . . . . . . . . . . . 130Figure 6.5 Reduction in L2 cache accesses from the commit units via val-idation and commit coalescing. . . . . . . . . . . . . . . . . 131xiiFigure 6.6 Two-phase parallel intra-warp conflict resolution . . . . . . . 136Figure 6.7 Hardware extensions for temporal conflict detection. . . . . . 140Figure 6.8 Temporal conflict detection example. . . . . . . . . . . . . . 142Figure 6.9 Kilo TM enhanced with warp-level transaction managementand temporal conflict detection. . . . . . . . . . . . . . . . . 143Figure 6.10 Execution time of GPU-TM applications with enhanced KiloTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147Figure 6.11 Energy consumption breakdown of GPU-TM applications . . 148Figure 6.12 Performance comparison with coarse-grained locking . . . . . 151Figure 6.13 Performance impact from different optimizations enabled byWarpTM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153Figure 6.14 Comparison between different intra-warp conflict resolutionmechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . 154Figure 6.15 Performance of temporal conflict detection with different lastwritten timetable organizations . . . . . . . . . . . . . . . . . 155Figure 6.16 Performance impact with different L2 cache port widths . . . 156Figure 6.17 Performance impact from doubling the number of SIMT cores 158xiiiList of Abbreviations2PCR Two-Phase Parallel Conflict ResolutionAPI Application Programming InterfaceBW BandWidthCID Commit IDentifierCMP Chip-MultiProcessorCOHE CoherentCPU Central Processing UnitCU Commit UnitCUDA Compute Unified Device ArchitectureDIVG DivergentDLP Data-Level ParallelismDRAM Dynamic Random-Access MemoryDWF Dynamic Warp FormationFG Fine-GrainedFGL Fine-Grained LockingFGMT Fine-Grained Multi-ThreadingFLOPS FLoating point Operations Per SecondFP Floating PointFR-FCFS First-Ready First-Come First-ServeGDDR Graphics Double Data Rate memoryGPGPU General-Purpose computing on Graphics Processing UnitGPU Graphics Processing UnitHM Harmonic MeanxivHTM Hardware Transactional MemoryIC Integrated CircuitILP Instruction-Level ParallelismIPC Instructions Per CycleIPDOM Immediate Post-DominatorISA Instruction Set ArchitectureKilo TM Kilo Transactional MemoryL1 Level 1L2 Level 2LCP Likely-Convergence PointLPC Likely-Convergence Program CounterLRU Least Recently UsedLWH Last Writer HistoryMC Memory ControllerMIMD Multiple-Instruction, Multiple-DataMSHR Miss-Status Holding RegisterNoC Network-on-ChipOpenCL Open Computing LanguagePC Program CounterPDOM Post-DominatorPTX Parallel Thread ExecutionRF Register FileRPC Reconvergence Program CounterRRB Round-RobinSCR Serial Conflict ResolutionSDK Software Development KitSIMD Single-Instruction, Multiple-DataSIMT Single-Instruction, Multiple-ThreadSM Streaming MultiprocessorSMT Simultaneous MultiThreadingSRAM Static Random-Access MemorySRR Sticky Round-RobinSSE Streaming SIMD ExtensionsxvSTM Software Transactional MemoryTBC Thread Block CompactionTCD Temporal Conflict DetectionTDRF Transactional Data-Race-FreeTLP Thread-Level ParallelismTM Transactional MemoryTOS Top Of StackVC Virtual ChannelWarpTM Warp-Level Transaction ManagementYCID Youngest Commit IDentifierflit flow control digitxviAcknowledgmentsThe work presented in this dissertation would have only been possible with thehelp from many individuals and organizations. First and foremost, I would like tothank my supervisor, Dr. Tor Aamodt for his guidance and support throughout myM.A.Sc and Ph.D. programs. I was given the valuable opportunity to work alongwith Tor since he first joined University of British Columbia (UBC) as an assistantprofessor, building a fully functional research group from the ground up. Tor’sdedication, passion, wisdom and his relentless pursue of high-impact, high-qualityresearch has been truly inspirational to me both professionally and personally.I would also like to thank my qualifying, departmental and final examinationcommittee members: Dr. Matei Ripeanu, Dr. Sathish Gopalakrishnan, Dr. MarkGreenstreet, Dr. Wen-mei Hwu, Dr. Thomas Froese, Dr. Guy Lemieux, Dr. SteveWilton, Dr. David Michelson, and Dr. Karthik Pattabiraman. Their valuable com-ments and feedback immensely improved the quality of this dissertation.I would also like to thank Doug Carmean and Mark Hill for their enthusiasticsupport of my work on Kilo TM. I am also grateful to my research collaborators,Dr. Arrvindh Shriraman and Dr. Joseph Devietti, who both have expanded myknowledge horizon with their expertise.I would also like to thank Inderpreet Singh for being a great colleague. Theinsightful midnight discussions between us as we worked through obstacles in ourresearch have lead to many great ideas. I also thank the other members of UBC’scomputer architecture research group – Ali Bakhoda, Henry Wong, Xi Chen, IvanSham, George Yuan, Andrew Turner, Johnny Kuan, Tim Rogers, Rimon Tadros,Arun Ramamurthy, Jimmy Kwa, Hadi Jooybar, Ayub Gubran, Tayler Hetherington,Ahmed ElTantawy, Andrew Boktor, Myrice Li and Mahmoud Kazemi – for theirxviivaluable contributions to our research infrastructures, insightful discussions andtimely supports.My peers in the Department of Electrical and Computer Engineering in UBChave been helpful and entertaining to work with, and they helped making my expe-rience in UBC an unforgettable one. In particular, I would like to thank Samer Al-Kiswany, Abdullah Gharaibeh, Elizeu Santos-Neto, Lauro Beltra˜o Costa, EmalayanVairavanathan, Mohammad Afrasiabi, Bo Fang and Hao Yang. I also cherish thetime I spent with many of my non-academic friends – David Mak, Derek Ho, Jen-nifer Li, Kenny Au, Fernando Har, Leo Wong, Amber Ting, Erica Ho, SanfordLam, Tim Chan, Richard Leung and Karen Wong.I am very grateful for the support and encouragement from my family. Myparents has been fully supportive of my decisions to pursue my Ph.D. The encour-agements from my mother has been heartwarming, and I am regretful for not beingwith her when she needed the support from me to fight her sickness (thankfully shehas recovered since then). Despite being a physician himself, my father had taughtme a great deal about computers and has instilled my interest in computers sincemy childhood.Financial support for this research was provided through the Postgraduate Schol-arships – Doctoral (PGS D) award from Natural Sciences and Engineering Re-search Council (NSERC) of Canada, as well as the NVIDIA Graduate Fellowship.xviiiChapter 1IntroductionFor the last few decades, advances in integrated-circuit (IC) fabrication technolo-gies have followed the trend known as the “Moore’s Law”, which predicts an ex-ponential growth of the number of transistors on a chip over time [112]. Thisgrowth continues with the area density of transistors on integrated-circuits doublingroughly every 24 months. Using these additional transistors to improve single-threaded performance on general purpose central processing units (CPUs) has beenchallenging for computer architects due to the “power wall” – the physical limit ofpower that an integrated-circuit may dissipate before cooling and power deliverybecome impractically expensive [9, 116]. This limits clock frequency scaling andhas motivated the microprocessor industry to migrate towards parallel computingarchitectures in the form of chip-multiprocessors (CMPs).A CMP system consists of multiple single-threaded CPU cores on a singlechip sharing a common memory subsystem. While these CMP systems can usethe extra transistors to provide more computation performance by scaling to morecores, this form of straight-forward scaling may also hit the power wall as welldue to the significant slowdown of classical Dennard Scaling in the last few pro-cess generations [42, 50]. Dennard Scaling refers to the reduction of transistoroperating voltage between successive IC fabrication process generations. In com-bination with the reduction of switching current from smaller transistor geometry,this voltage reduction has allowed the smaller transistors in the newer process toconsume proportionally less dynamic power. As a result, a processor fabricated1in a newer process can traditionally be extended to use more transistors withoutincreasing either the power budget or the chip area. However, Dennard Scaling hassignificantly slowed recently, because lowering the transistor operating voltage anyfurther can lead to transistors operating too close to their threshold voltages, intro-ducing significant leakage current and noise margin reduction. Therefore, simplyscaling the current CMP architectures with more CPU cores to consume the extratransistors made available by a newer process will result in a design that consumesmore power. In response to this observation, much attention from the computerarchitecture research community has been focusing on more energy-efficient alter-natives. One of the promising alternatives is off-loading computation to graphicsprocessing units (GPUs), a computing paradigm known as GPU Computing.1.1 GPU Computing PotentialGPUs started as fixed function accelerators for real-time 3D graphics rendering incomputer games. Over time, GPU vendors have advanced GPU architectures tosupport various rendering effects as a way to differentiate from their competitors.This endeavor lead to the incorporation of programmable shading in GPUs. Withprogrammable shading, an application can implement a new rendering effect viacustomizing the 3D rendering pipeline with “shaders”, which are snippets of codethat calculate the lighting and transformation of polygon vertices and pixels ina 3D digital scene. Since then, newer generations of GPUs have offered moreprogramming flexibility in the shaders. This has eventually evolved modern GPUsinto an emerging class of massively parallel compute accelerators that featurestheoretical throughput tens to hundreds of times higher than any general purposeCPUs available. Recent advances have also made it possible to program GPUsvia simple extensions to C programming languages, such as CUDA [120, 122]and OpenCL [89]. These new application programming interfaces (APIs) allownon-graphics applications to harness the computing power of GPUs without usinggraphics-oriented APIs such as OpenGL and Direct3D.A key feature that distinguishes GPU from traditional CMP architectures is theamount of exposed parallelism expected from the application. 3D graphics render-ing contains plenty of data-level parallelism – the lighting and color of each pixel2displayed on screen can usually be computed independently. Graphics applicationsuse 3D graphics APIs such as OpenGL and Direct3D to specify the shader for eachdisplayed pixel in the 3D scene, with the expectation that the GPU will executethe shaders for all pixels concurrently. Similarly, a GPU compute application typ-ically partitions its workload into thousands, if not millions, of GPU threads, eachworking on a smallest indivisible task in the workload. This is in stark contrast tomultithreaded applications for traditional CMP architectures, where the program-mer would group the small tasks into a number of threads matching the number ofCPU cores in the CMP system.The abundance of application-exposed thread-level parallelism (TLP) in GPUapplications has allowed GPU designers to focus hardware resource on achievinghigher computation throughput, without having to maintain single thread perfor-mance. Specifically, GPU architectures can use the TLP to keep execution unitsbusy without the expensive microarchitecture mechanisms used by traditional out-of-order cores to aggressively exploit instruction-level-parallelism (ILP) from asingle thread. For example, instead of relying on branch prediction to specula-tively fetch and execute instructions beyond a unresolved branch operation froma single thread, a GPU can schedule other threads for execution while a subset ofits threads are resolving the outcome of their branch operations (hence, no wastedwork from branch outcome misprediction). This technique, known as fine-grainedmulti-threading (FGMT), is discussed in more detail in Chapter 2. GPU archi-tectures also use FGMT to tolerate the long latency from memory accesses. Thisreduces the need for low latency memory subsystems, and permits optimizationsthat trade latency for significantly higher memory bandwidth.Furthermore, GPUs use wide single-instruction, multiple-data (SIMD) hard-ware to exploit the regularities in computation among different threads in GPUapplications. Since scalar GPU threads tend to perform almost identical computa-tion on different data, they are organized into SIMD execution groups called warps(or wavefronts in AMD terminology) in hardware. Threads in each warp share acommon program counter, and they are executed in lockstep on wide SIMD hard-ware to amortize the control-logic overhead of each thread. Modern GPUs in-clude special hardware to automatically serialize the execution of different subsetsof threads that diverge to different control flow paths [56]. In combination with3Table 1.1: Computational efficiency of state-of-the-art CMPs and GPUs in2014. This comparison uses single-precision floating point performance.The peak floating point performance for the various CMP systems as-sume the application uses SIMD instruction set extensions. TDP = Ther-mal Design Point, BW = BandwidthProcessor Type IC Fab. TDP Peak Memory ComputationProcess FP Perf. BW EfficiencyNVIDIA Tesla K40 GPU TSMC 28nm 235 W 4300 GFLOPS 288 GB/s 18.3 GFLOPS/WAMD R290X GPU TSMC 28nm 290 W 5632 GFLOPS 320 GB/s 19.4 GFLOPS/WIntel Xeon E7 v2 CPU Intel 22nm 155 W 672 GFLOPS 85 GB/s 4.3 GFLOPS/WIntel Core i7-4770K CPU Intel 22nm 84 W 499 GFLOPS 26 GB/s 5.9 GFLOPS/WIBM Power 7+ CPU IBM 32nm 200 W 795 GFLOPS 100 GB/s 4.0 GFLOPS/Wsimple compiler analysis, this special hardware abstracts this SIMD organizationfrom the GPU programmer – each scalar GPU thread is free to follow a uniqueexecution path, but at a performance penalty. This abstraction is known as thesingle-instruction, multiple-thread (SIMT) execution model. Chapter 2 describes abaseline mechanism similar to the ones used in current GPUs to support the SIMTexecution model. By combining FGMT with wide SIMD hardware, GPUs candeliver higher computation throughput while consuming less energy per operationthan traditional CMP systems.Table 1.1 illustrates this efficiency boost by comparing the peak floating point(FP) operation throughput and power consumption of different state-of-the-art GPUsand CMPs in 2014. Even with an inferior IC fabrication process (28nm vs. 22nm),the two GPUs manage to deliver 3− 4× more floating point operations per Watt(GFLOPS/W, a metric for computation efficiency) than the CMPs in this compar-ison. Table 1.1 also shows that GPU memory subsystems provide substantiallyhigher bandwidth (3− 10×) than traditional CMPs. Differences in peak perfor-mance aside, researchers have demonstrated that harnessing the computing powerof GPUs in real world non-graphics application can lead to far more cost-effectivecomputing solutions than solutions using traditional CMP systems comprised ofonly CPU cores [77, 95].41.2 GPU Programming Challenges with IrregularParallelismWriting functionally correct software for GPUs is now relatively easy for appli-cations with regular parallelism. These applications, such as dense matrix multi-plication and simple image filtering, feature regular memory access patterns andcontrol flow behavior among threads. While optimizing these applications to fullyharness the computing power offered by GPUs is challenging, the regular paral-lelism in these applications maps naturally well to the SIMD hardware in GPUs.Consequently, the initial GPU implementations of these applications tend to offera reasonable boost in performance/efficiency over the original CPU version. Thisgives the software developers incentives to refactor their software to incorporateGPU-accelerated modules.In contrast, software developers have far more difficulties employing GPUs forapplications with irregular parallelism. Software developers who attempt to useGPUs to exploit the irregular parallelism in these applications typically face thefollowing risks:Uncertain Performance/Efficiency Benefit Despite having plenty of data-levelparallelism (DLP), these applications have threads with data-dependent con-trol flow behavior and irregular memory access patterns. A group of threadswith data-dependent control flow may diverge into different execution path.This behavior can cause underutilization of the SIMD hardware in GPUs,a performance issue in GPU computing known as branch divergence (seeChapter 2). Also, the GPU memory system expects adjacent threads to ac-cess data from the same memory block, so that it can service all the accessesin parallel by fetching the entire block at once via a wide data bus (thismechanism is known as coalesced memory access [122]). However, adja-cent threads with irregular memory access patterns tend to access data indifferent memory blocks. Current GPUs fetch the entire memory block eventhrough only a single word within the block is requested, so the irregularmemory access patterns tend to waste a significant portion of the GPU mem-ory bandwidth. The performance penalty from these issues may counteractthe throughput advantage of the GPU, causing the GPU-accelerated version5of the application to be slower than the initial CPU-only version.Unpredictable Development Effort Besides introducing a performance penalty,a more important problem with these irregular memory accesses is the pos-sibility of data-races. To guard against data-races without excessively serial-izing computation, the software developers have to use fine-grained locks,which are prone to deadlocks. GPUs are designed to execute thousandsof threads concurrently. More threads, and correspondingly larger problemsizes, exacerbate the challenge of lock-based programming by increasing theamount of debug data analysis required to understand how a deadlock canmanifest in the application [8]. Furthermore, implicit synchronization im-posed by hardware that handles branch divergence in GPUs can cause someimplementations of spin lock to deadlock [129] (see Chapter 4 for a more de-tailed discussion). The GPU software developer may need to implement adhoc, error-prone data synchronization mechanisms to circumvent the use oflocks [108]. Regardless of the solution employed, the developer will need toinvest significant effort to verify the design and debug the implementation,just to eradicate every potential data-race and deadlock for every possibleinput. It is difficult for the software developers to estimate the amount ofeffort required for this process, making it difficult to fit into a time-drivendevelopment cycle.Potentially Hard-to-Maintain Code Even if the software developer is able toovercome the mentioned challenges and has successfully produced a GPU-accelerated version of the application, the code would likely be hard to com-prehend by programmers who do not understand the intricacies in the GPUmicroarchitecture. These intricacies may have driven many design decisionsin the software. Maintaining this GPU-accelerated version of the applicationwill require good understand of these design rationales to avoid introducingbugs/performance bottlenecks that the highly optimized code is designed toavoid in the first place. This adds significant burden onto the software main-tainers who need to revisit these design decisions every time they update thecode for new features and bug fixes.In fact, these risks exist for each attempt to further optimize the application, be-6cause each optimization may introduce new bugs and does not necessarily producethe expected benefit.Many research studies have demonstrated that with sufficient programming ef-fort, using current GPUs to accelerate applications with irregular parallelism canprovide substantial performance/energy boosts [4, 23, 66, 108, 118]. Neverthe-less, the risks for accelerating these applications with GPUs can appear too highto software developers for production software development. The high risks con-sequently limit the range of applications that may harness the computing power ofGPUs.1.3 Thesis StatementThis dissertation enhances GPU architectures to boost their ability to accelerateapplications with irregular parallelism. Each of the two major enhancements in-troduced by this dissertation addresses a specific risk in the software developmentof these applications. By reducing the risks in developing GPU applications withirregular parallelism, these enhancements aim to make GPU acceleration practicalfor these applications in the future.Thread block compaction (TBC), the first enhancement, is a microarchitectureinnovation that robustly boosts the GPU performance when the GPU executes con-trol flow intensive applications that suffer from branch divergence. At each diver-gent branch, TBC temporarily rearranges scalar GPU threads from multiple warps(SIMD execution groups – see Section 2.3.2) into new warps. It compacts threadswith the same branch outcome into the same set of new warps to boost their util-ity of the SIMD hardware. Threads from these temporary warps are restored intotheir original warps once they have reached the point in the program where bothdivergent execution paths converge. This restoration ensures full SIMD efficiencyin code regions that do not suffer from branch divergence, and preserves the highlyregular memory access patterns in these regions. This property significantly re-duces the chance of TBC slowing down an application that does not suffer frombranch divergence – a pathological behavior that plagues dynamic warp formation(DWF), the original work that introduced the concept of rearranging threads intonew warps to boost SIMD efficiency [56]. The robustly boosted performance from7TBC increases the likelihood that the GPU acceleration can speed up the applica-tion. This in turn allows the software developers to tolerate more uncertainty in thebenefit from GPU acceleration during the development.Kilo TM, the second major enhancement, is a novel, scalable, and energy-efficient hardware proposal for supporting transactional memory (TM) on GPUs.With TM, programmers can replace lock-protected critical sections with atomiccode regions, called transactions [75]. The underlying TM system optimisticallyexecutes transactions in parallel for performance and automatically resolves data-races (conflicts) between concurrently executed transactions in a deadlock-freemanner. GPU software developers can use transactions instead of fine-grainedlocks to aggressively exploit the irregular parallelism in their applications. Freedfrom the concerns for data-races and deadlocks, the developers can verify the ag-gressively parallelized implementations with significantly less effort. This makesthe development effort using transactions more predictable than using locks orother ad hoc data synchronization solutions. More importantly, transactions decou-ple the functional behavior of the code (operations performed by each transaction)from its performance (how the transactions are executed). This decoupling makescodes written with transactions easier to maintain than critical sections protectedby fine-grained locks.Kilo TM aims to support thousands of small, concurrent transactions. This de-sign goal differs from the goal of most existing TM system proposals to supporttens of large, unbounded transactions on traditional CMP systems. The uniqueproperties of the GPU memory system and the need to scale to thousands of con-current transactions have driven Kilo TM to deviate from the traditional cache-based TM system proposals. In particular, Kilo TM cannot rely on invalidationmessages in cache coherence protocols to detect conflicts between transactions.Instead, it uses innovative alternatives to avoid exhaustive pair-wise conflict detec-tion between all running transactions. The insights gained from exploring thesealternatives in this work should transfer to other data synchronization mechanismsbeyond transactional memory.81.4 ContributionsThis dissertation makes the following contributions.1. It identifies challenges faced by dynamic warp formation running CUDAapplications, and devises an improved scheduling policy to address thesechallenges (Section 3.2).2. It presents thread block compaction (TBC) [54], a robust version of dynamicwarp formation (DWF) [56]. TBC provides better performance than DWFwith simpler design complexity (Section 3.3).3. It extends the immediate post-dominator based reconvergence, a state-of-the-art mechanism for handling branch divergence, with likely-convergencepoints [54] (Section 3.4).4. It evaluates the performance benefit of TBC. Together with likely-convergencepoints, TBC provides an average speedup of 22% over a baseline per-warp,stack-based reconvergence mechanism, and 17% versus DWF on a set ofGPU applications that suffer significantly from branch divergence (Section 3.6).5. It proposes the use of hardware transactional memory (HTM) for GPU com-puting (Chapter 4). In particular, it estimates the performance potential oftransactional memory on a set of GPU computing workloads that employtransactions via a limit study with an ideal TM system (Section 4.1). Thisideal TM assumes zero overhead for detecting conflicts among concurrenttransactions and maintaining the atomicity of each transaction. On average,the GPU workloads running on this ideal TM achieve 279× speedup overserializing all transactions via a single global lock, and they perform compa-rably to fine-grained locking.6. It highlights a set of challenges with employing existing HTM system pro-posals on GPUs (Section 4.1.1).7. It proposes Kilo TM, a novel, scalable TM system that is designed specif-ically for GPUs [57] (Section 4.2). The design combines aspects of value-based conflict detection [37, 124], RingSTM [149], and Scalable TCC [27]9(Transactional Coherence and Consistency) to support 1000s of concurrenttransactions without requiring a cache coherency protocol. Kilo TM detectsconflicts at word-level granularity and employs various mechanisms to in-crease transaction commit parallelism.8. It extends the SIMT [101] hardware to handle control flow divergence dueto transaction aborts (Section 4.2.1).9. It introduces the recency bloom filter which incorporates a notion of timeand supports implicit, multi-item removal (Section 4.2.5). Kilo TM deploysone small (5kB) recency bloom filter in each GPU memory partition to boosttransaction commit parallelism significantly. GPU architectures evaluated inthis work have six to twelve memory partitions.10. It shows that a simple extension of the GPU hardware thread scheduler tocontrol transaction concurrency benefits high-contention workloads (Sec-tion 4.2.6).11. It devises a theoretical framework to prove that Kilo TM satisfies weak iso-lation [16, 73] (Chapter 5).12. It estimates the energy overhead of Kilo TM on a contemporary GPU design(Section 6.1 and Section 6.7).13. It enhances the performance and energy efficiency of Kilo TM with Warp-Level Transaction Management (WarpTM) (Section 6.2). WarpTM enhancesKilo TM to leverage the thread hierarchy in GPU programming models toamortize the control overhead of Kilo TM and boosts the utility of the GPUmemory system.14. It proposes and evaluates two intra-warp conflict resolution schemes to re-solve conflicts within a warp (Section 6.3). A low overhead intra-warp con-flict resolution mechanism is crucial in maintaining the benefit from WarpTM.15. It accelerates the execution of read-only transactions in Kilo TM with Tem-poral Conflict Detection (TCD) (Section 6.4). TCD is a low overhead mech-anism that uses a set of globally synchronized on-chip timers to detect con-10flicts for read-only transactions. Once initialized, each of these on-chiptimers runs locally in its microarchitecture module and does not communi-cate with other timers. TCD uses timestamps captured from these timers toinfer the order of the memory reads of a transaction with respect to updatesfrom other transactions.16. It evaluates the combined benefits of WarpTM and TCD and shows that theycomplement each other (Section 6.7). The two enhancements together im-prove the overall performance of Kilo TM by 65% while reducing the energyconsumption by 34%. Kilo TM with the two enhancements 192× speedupover coarse-grained locking, and captures 66% performance of fine-grainedlocking with 34% energy overhead. More importantly, the enhancements al-low applications with small, rarely-conflicting transactions to perform equalor better than their fine-grained lock versions. This alludes to the possibilitythat GPU applications using transactions can be incrementally optimized toreduce memory footprint and transaction conflicts to take advantage of this.Meanwhile the transaction semantics can maintain correctness at every step,providing a low-risk environment for exploring optimizations.1.5 OrganizationThe rest of this dissertation is organized as follows:• Chapter 2 discusses background on contemporary GPU architectures anddefines the baseline GPU architecture used throughout this work. It alsoprovides background information on transactional memory (TM).• Chapter 3 presents thread block compaction (TBC), a microarchitecture in-novation that robustly boosts the GPU performance for applications that suf-fer from branch divergence.• Chapter 4 highlights the difficulties in fine-grained locking and proposesKilo TM, the first hardware proposal to support TM on GPU-like acceler-ators.11• Chapter 5 discusses the correctness of Kilo TM in providing the proper trans-action semantic guarantees.• Chapter 6 compares the performance and energy efficiency of Kilo TM withfine-grained locking and proposes two mechanisms, warp-level transactionmanagement and temporal conflict detection, to boost the efficiency of KiloTM.• Chapter 7 discusses related work.• Finally, Chapter 8 concludes this dissertation and discusses directions forfuture work for both TBC and Kilo TM.12Chapter 2BackgroundThis chapter provides the background for the rest of this dissertation. Section 2.1briefly explains a set of fundamental concepts in parallel computing that are usedthroughout this dissertation, such as data-level parallelism, irregular parallelism,single-instruction, multiple-data, and data synchronization. Section 2.2 summa-rizes the background on transactional memory that is relevant to this work. Sec-tion 2.3 describes a modern GPU architecture model that serves as the baselineGPU throughout this work. Section 2.4 describes dynamic warp formation, a priorwork aiming to reduce the GPU performance penalty due to branch divergence.Thread block compaction (introduced in Chapter 3) revises dynamic warp forma-tion, fixing its pathological behaviors with a simple and robust mechanism.2.1 Fundamental ConceptsThis section explains a set of fundamental concepts in parallel computing thatare used throughout this dissertation. While many of these concepts are commonknowledge in computer architecture, the explanations in this section aim to clarifythem in the context of GPU computing.2.1.1 Data-Level ParallelismAn application contains data-level parallelism (DLP) if it contains computationsthat can operate on many data items at the same time [74]. One simple example13of such an application is vector addition, in which each element from one of thesource vectors is added to its corresponding element in the other source vectorto produce an element in the destination vector. Since the sum for each elementin the destination vector does not depend on the sum for another element in thesame vector, the sums for multiple elements can be computed in parallel. DLPis commonly found in many scientific computations, multimedia applications andsignal processing applications (usually in the form of matrix/vector operations).Data-level parallelism is commonly discussed in contrast to task-level paral-lelism, which are multiple tasks in an application that can usually operate in paral-lel [74]. For example, a word processor application may render the user input ontothe screen while simultaneously checking the spelling and grammar of the inputtext. Data-level and task-level parallelism in an application can be distinguishedby how the parallelism scales. The amount of data-level parallelism in an applica-tion scales with the amount of data to be processed, whereas task-level parallelismincreases as more functionality is added to the software.This work further classifies data-level parallelism into regular parallelism andirregular parallelism.2.1.2 Regular Parallelism vs. Irregular ParallelismAn application with regular parallelism processes a large pool of data in a regularfashion. In this type of application, the task for processing each data item performsalmost identical computation and features a regular, predictable memory accesspattern. Classic examples of such an application include dense matrix multiplica-tion and simple image filtering. In both examples, the control flow behavior andthe memory access pattern of the computation for each data item is largely inde-pendent of the value of the processed item. This decoupling between the processeddata and the nature of the computation leads to regular, predictable behavior thatcan be exploited for optimizations. For example, the programmer can orchestratethe computations so that their memory accesses exhibit spatial and temporal local-ity, with high certainty that the optimization will yield a speedup for most commoninputs. In particular, the programmer may arrange the dense matrix multiplicationin tiles, so that computations that read from the same set of input data elements14run concurrently to maximize cache hits. Also, the regularity of the tasks reducesthe possibility of load imbalance at barriers. This permits an application with reg-ular parallelism to use a single barrier, instead of fine-grained locks, to efficientlysynchronize between sets of dependent tasks for communication.Modern GPUs are designed to exploit regular parallelism with single-instruction,multiple-data (SIMD) hardware to amortize the instruction management overheadand to boost memory system utility. Consequently, the initial GPU implementationof applications with regular parallelism tend to offer a reasonable boost in perfor-mance/efficiency over the original CPU version. The stability of these applications’performance across different inputs encourages the software developers to furtheroptimize the code to fully harness the computing power offered by GPUs.Despite having plenty of data-level parallelism (DLP), an application with ir-regular parallelism processes a large pool of data concurrently in an irregular fash-ion. Depending on the data being processed, each task may need to perform avariable amount of computation and features irregular memory access pattern thatis dictated by the input data. One example of such an application is ray tracing, a3D graphics rendering algorithm commonly used in film production. In ray tracing,each task traces the light path from a pixel on the image plane (a ray) to one or morelight sources in the rendered scene. The ray usually reaches a light source via a se-ries of reflections and refractions with objects in the rendered scene, and each raytraverses through the scene differently, requiring a different amount of computa-tion [4]. The irregularity among tasks makes it difficult to apply optimizations thatwork well with regular parallelism. For instance, the data-dependent control flowbehavior in each task can cause load imbalance (leading to performance penaltyfor using barriers) and branch divergence (leading to underutilization of the SIMDhardware in GPUs). The irregular memory access pattern also makes it difficultfor a programmer to group computations that exhibit temporal and/or spatial local-ity. Moreover, some applications require the programmer to use locks to protectshared data structures that can be modified by one or more concurrent tasks. Theprogrammer has to verify that for every possible input to the application, the irreg-ular memory access patterns among these concurrent tasks would never result ina deadlock. Overall, the irregularity among tasks poses significant challenges forsoftware developers to parallelize and fine-tune these applications.15In many applications, the irregularity of the concurrent tasks comes from theuse of more work-efficient algorithms [72]. Parallel algorithms often perform moreoperations than their sequential counterparts, and algorithms that reduce this over-head often do so at a cost of increased irregularity. By removing wasted work, thecomplexities of these algorithms scale significantly better than their more regular,but less efficient counterparts for large working sets found in realistic workloads.One example is sparse matrix operations, which operate directly on sparse matri-ces stored in a compressed format recording only the location and value of thenon-zero entries in matrices. Sparse matrix operations can leverage the sparsity ofthe matrices to remove wasted computations for entries with zeros, making theseoperations significantly more efficient than their dense matrix analogs [13, 164].Even though algorithms with irregular parallelism seem ill-suited for GPU acceler-ation, the efficiency of these algorithms may compensate for their under-utility ofthe GPU hardware. Many research studies have demonstrated that with sufficientprogramming effort, using current GPUs to accelerate applications with irregularparallelism can provide a substantial performance/energy boost versus CPU onlysolutions [4, 13, 23, 66, 108, 118].2.1.3 Thread-Level ParallelismThread-level parallelism (TLP) refers to the parallelism that is expressed explicitlyby the software developers as threads in an application. These threads can run con-currently in the system, and each thread usually progresses independently throughits own execution path through the program. Traditional chip-multiprocessors(CMPs) can harness TLP directly by executing each thread on a different process-ing core. This distinguishes TLP from instruction-level parallelism (ILP), which isthe parallelism from independent operations within a single thread [74].In this work, each thread has its own architectural state, which usually consistsof a program counter (PC) and a set of registers. Threads in the same applica-tion share the same global memory space, such that when one thread updates thevalue at a memory location, the updated value will be visible to other threads. Insome occasions, one thread may update a shared data structure in memory that issimultaneously used by other threads. These occasions are known as data-races.16Unexpected data-races in a multithreaded application can cause the application tobehave erratically. Programmers may guard against unexpected data-races throughthe use of data synchronization mechanisms, such as locks, barriers and transac-tional memory. Section 2.1.4 gives a brief overview of the different data synchro-nization mechanisms used in existing CMPs and GPUs, and Section 2.2 explainsthe basics of transactional memory in more detail.GPU compute applications use TLP to express its inherent task-level and data-level parallelism to the GPU hardware. Section 2.3 explains how GPU hardwareuses a combination of SIMD hardware (Section 2.1.5) and fine-grained multi-threading (Section 2.1.6) to harness the TLP in an efficient way.2.1.4 Data SynchronizationIn multithreaded programming, a data-race occurs when one thread updates a shareddata structure in memory that is currently used by other thread. An unexpecteddata-race can cause a multithreaded application to behave erroneously in two ways:Two threads may clobber each other’s update to the shared data structure, corrupt-ing values in the structure; one thread may only read in part of the update from theother thread, corrupting its view of the shared data structure and the thread’s be-havior. The programmer can use data synchronization to avoid data-races in theirapplication. This section gives an overview of the data synchronization mecha-nisms available on modern GPUs. These mechanisms are mentioned throughoutthis dissertation.Atomic OperationsAtomic operations are mechanisms provided by modern processor architecturesthat are capable of updating a memory location atomically with a new value that iscomputed based on the old value [141]. They are also referred as read-modify-writeoperations, because each atomic operation involves reading the old value from thememory location, modifying it into a new value, and writing the new value backto memory. These operations differ from normal memory load/store operations –special microarchitecture mechanisms are added to ensure that all three steps areperformed atomically with respect to every thread in the system. Most existing17implementations only support atomic operations for a single 32-bit/64-bit word inthe memory.Atomic operations serve as the basic primitives for constructing more com-plex data synchronization mechanisms such as locks, barriers, nonblocking datastructures and software transactional memory. Among different types of atomicoperations, this work focuses on atomic compare-and-swap CAS [82], as it is usedin the implementations of many synchronization mechanisms. There are three in-puts to an atomic CAS: the memory location to be updated, the expected old valueat the location, and a new value. Atomic CAS will only write the new value intothe memory location if the current value at the location matches the expected oldvalue. A successful swap returns the old value to the thread, whereas a failed swapreturns the new value to indicate a failure. Software can use atomic CAS to emulatearbitrary single-word read-modify-write operation [141]:1. Read the original value from memory.2. Compute the new value from this original value.3. Use atomic CAS to write the new value back to the memory location atomi-cally if the value in memory has not been modified by another thread duringthe computation.4. If atomic CAS detects a value different from the original value, repeat steps1 to 3.Notice that this emulation requires the computation of the new value from the orig-inal value (Step 2) to be idempotent [40], so that it can be repeated upon a CASfailure without accumulative effects.Modern CPU processors usually implement atomic operations via a cache co-herence protocol. A CPU can service an atomic operation by instructing its privatecache to maintain exclusive access to the cache block containing the input memorylocation for the entire duration the atomic operation [34]. Modern GPUs do nothave thread-private caches, nor are the per-core caches coherent. Instead, modernGPUs implement atomic operations with a set of execution units located in the ac-cess path to the shared last-level cache. Adopted from the raster operation units18used for depth testing and pixel blending in 3D graphics [22], these units operatedirectly on the data stored in the last-level cache, and intercept any conflicting load-/store accesses to these locations during the computation. Similar implementationsare also employed in Rigel [87] and IBM’s Blue Gene/Q [71].BarriersA barrier provides global synchronization among a set of threads – none get pastthe barrier until every thread in the set has arrived [34]. The programmer canuse barriers to arrange computations into multiple parallel phases, a programmingparadigm known as bulk-synchronous programming [160]. In each phase, threadsexecute a set of independent tasks concurrently to avoid any potential data-race.The barrier between each phase ensures that all tasks in the previous phase arefinished before proceeding to the next phase. It also waits until all memory storesfrom the previous phase have been committed to the global memory, so that tasksin the next phase can then use the update data from the prior phases.Despite the simplicity of bulk-synchronous programming, load imbalance amongthreads can cause some threads to have long idle time at the barrier, waiting forother threads to catch up. This makes them more effective for applications withregular parallelism, in which threads are likely to complete their tasks at aroundthe same time.CMP systems usually implement barriers via the use of atomic operations.The simplest implementation involves each thread atomically incrementing a sin-gle counter, and then spinning until the counter equals to the number of threadsparticipating in the barrier.Modern GPUs provide hardware barrier instructions that synchronize threadswithin a single thread block (a unit in the GPU thread hierarchy explained in Sec-tion 2.3). Instead of using a shared counter in memory, the baseline GPU archi-tecture in this work implements this barrier instruction by extending the per-coreregister scoreboard and the warp scheduler. Without going through the memorysystem, the hardware barrier can release its threads within a few cycles after thefinal thread has arrived. This per-block, low-overhead barrier is commonly used inGPU programming to synchronize accesses to a per-core on-chip scratchpad mem-19ory. GPU programmers can implement a global barrier by combining this hardwareper-block barrier with atomic operations [166].LocksA lock, or mutex, is a software data object that allows multiple threads to negotiatemutual exclusion to one or more shared data objects in memory. A thread mustacquire the lock associated with the shared data objects before accessing the shareddata. Each lock may only be acquired by at most one thread at any time, so thethread holding the lock has exclusive access to the shared data. The thread shouldrelease the lock as soon as it has finished accessing the shared data to allow otherthreads to access the shared data. Since the shared data is only accessed by onethread at a time, no data-race can occur. If a thread tries to acquired a lock thathas been acquired by another thread, the acquisition fails. A thread that has failedto acquire a lock may keep retrying, implementing a blocking lock, or it may fallback to an alternative routine to ensure the rest of the system can make forwardprogress.If there are multiple locks in the system, a deadlock can occur when there isa cyclical dependency among a set of threads. For example, both thread A andthread B need to acquire both lock X and Y to progress, but each thread is holdingone of the two locks and cannot release its own lock unless it has acquired thelock held by the other thread. In this case, both threads are blocked indefinitely,forming a deadlock. One ways to avoid deadlock is to ensure that every threadin the system acquires locks in a globally defined order. For example, a threadmay only acquire lock X after acquiring lock Y, but not vice versa. This globallock order eliminates any potential cyclical dependency among threads, so that itis impossible for deadlock to form.The simplest way to use locks in a software system is to associate all shareddata in the entire system with one (or a few) lock(s). The approach, known ascoarse-grained locking, trades multithreading performance for a simple design.With only a few locks in the system, the programmer can more easily verify thatthe system is free of deadlocks through exhaustively checking that the locks arealways acquired in a globally defined order. However, coarse-grained locking can20significantly reduce the amount of TLP in the system. By protecting a large amountof shared data with a single lock, coarse-grained locking can unnecessarily serial-ize threads that access disjoint parts of the protected shared data. Also, multiplethreads that only read from the shared data may execute concurrently without caus-ing any data-race, but are serialized at the lock acquisition.One way to boost the TLP in the system is to employ more locks in the system,with each lock guarding a smaller portion of the shared data. We use this formof fine-grained locking to develop the lock version of the GPU-TM applicationsused in Chapter 4 and Chapter 6. It is also possible to use readers-writer locks toallow multiple threads to read the shared data concurrently while providing mutualexclusion to threads that modify the data [33, 141]. Despite their performancebenefits, fine-grained locking and readers-writer locks makes a system significantlyharder to verify. Fine-grained locking is more prone to deadlocks than coarse-grained locking and readers-writer locks are prone to starvation of the writer thread.The difficulty in using these more advanced locking mechanisms limits them toexpert programmers.GPU programmers may use a combination of atomic operations and mem-ory fences to implement locks. However, implicit synchronization imposed bythe SIMT execution model can introduce deadlocks to unsuspecting programmers.Chapter 4 summarizes a way to circumvent this issue that has been presented byArun Ramamurthy [129].2.1.5 Single-Instruction, Multiple-Data (SIMD)Flynn’s taxonomy first introduced SIMD as a class of parallel computing systemsakin to vector processors specialized in scientific computing [53]. These comput-ing systems are designed for applications that repeat sets of identical operationsacross a large array of data, such as vector multiplication. They harness the DLPin each of these operations efficiently via an instruction set architecture (ISA) thatoperates on vectors of data. Since each instruction performs the identical opera-tion across every element in a vector, the hardware can perform the computationfor multiple elements in parallel with multiple identical execution units, all sharingcommon control hardware. This sharing significantly reduces the hardware cost21required to scale up the parallelism in a SIMD computing system. It saves the in-struction bandwidth and control logic otherwise required for the extra processorsto harness the same amount of DLP. This efficiency boost comes with the restric-tion that the execution units can only run in lockstep – they all execute the sameinstruction and advances to the next instruction simultaneously. This restrictiondistinguishes SIMD computing systems from multiple-instruction, multiple-data(MIMD) computing systems, another class of parallel computing system in Flynn’staxonomy that poses no lockstep restriction on the execution units [53]. Under thisdefinition, most multiprocessor architectures are MIMD computing systems.Nowadays, most contemporary CPU architectures contain SIMD executionunits that are accessed via short-vector SIMD ISA extensions: Streaming SIMDExtension (SSE) and Advanced Vector eXtension (AVX) for x86 architectures [80,154], AltiVec for POWER architectures [78], and NEON for ARM architectures [7].These SIMD ISA extensions operate on short vector registers (up to 256-bit). Eachregister may hold a varying number of data elements depending on the size of eachelement. For example, a 256-bit short vector register can hold 32 8-bit elementsor 8 32-bit elements. The length of the short vector register is fully exposed tothe software, and the software is responsible for splitting operations on long vec-tors into multiple loop iterations. Until recently, these SIMD ISA extensions onlyfeatured vector load/store operations to contiguous memory blocks. This limitsthe use of these SIMD extensions to expert programmers. Nevertheless, with theproper software support, these SIMD execution units can boost the computationthroughput of a CPU core by 3 to 4× [138].Modern GPUs employ wide SIMD hardware (1024-bit to 2048-bit data paths)to exploit the DLP in 3D graphics rendering. Unlike SIMD ISA extensions in CPUarchitectures, the width of the data path is not directly exposed to the software. In-stead, non-graphics GPU APIs, such as CUDA and OpenCL, feature a MIMD-likeprogramming model that allows the programmer to launch a large array of scalarthread onto the GPU. At runtime, the GPU hardware groups scalar threads intoSIMD execution groups and runs these SIMD execution groups on SIMD hard-ware. This execution model, called single-instruction, multiple-thread (SIMT), isexplained in more detail in Section Fine-Grained MultithreadingFine-grained multithreading (FGMT) is a microarchitecture mechanism that ex-ploits thread-level parallelism (TLP) to increase hardware utility. In FGMT, mul-tiple threads interleave their execution on a FGMT processor. Unlike the tradi-tional multitasking provided by an operating system to time-share a uniprocessorCPU core, switching from one concurrent threads to another on the FGMT proces-sor does not require flushing the pipeline nor offloading architectural states to themain memory. Instead, the FGMT processor has a shared register file that storesthe architectural states (program counters and registers) of a pool of concurrentthreads sharing the processor. Every cycle, a hardware scheduler selects and issuesan instruction from the pool of threads that are not stalled by any data-dependencyor resource hazards. Instructions from different threads may coexist in differentpipeline stages and execution units inside the FGMT processor.While FGMT can boost hardware utility in a processor, it also requires a sig-nificantly larger register file than the ones in a single-threaded processor. Havinga larger register file can increase the cycle-time and can introduce extra operandaccess energy. This overhead penalizes single-threaded applications, so FGMThas been limited to architectures that optimize for applications with plenty of TLP:CDC 6600 [156], Heterogeneous Element Processor [83], the Horizon architec-ture [155] and Sun Microsystem’s Niagara processor [93, 152].Similar to FGMT, simultaneous multithreading (SMT) is a microarchitecturemechanism that exploits TLP to boost hardware utilization of an out-of-order, su-perscalar processor [158]. While FGMT features hardware thread schedulers nearthe front-end of the processor, SMT leverages the existing superscalar instructionscheduler in the out-of-order processor to execute multiple threads concurrently.The superscalar instruction scheduler is designed to harness instruction-level par-allelism (ILP) within a single thread and has the capability to simultaneously issuemultiple instructions to multiple functional units. SMT leverages this multi-issuecapability to allow instructions from different threads to be issued to different exe-cution units in a single cycle.Finally, GPU applications are designed to expose an abundant amount of TLPto hardware. Modern GPUs uses the plentiful TLP (thousands of concurrent threads)23via FGMT to tolerate pipeline latency within a core as well as memory access la-tency outside the core. By tolerating the latency of individual threads, GPU designscan focus its hardware resource on improving the overall throughput across all thethreads. Section 2.3.3 describes how FGMT is implemented in our baseline GPUarchitecture.2.2 Transactional MemoryTransactional memory (TM) is a parallel programming model proposed by Herlihyand Moss as a deadlock-free alternative to lock-based parallel programming [73,75]. It has recently seen a large growth in interest, and as a result, various com-mercial CMP systems have incorporated hardware support for TM [71, 80, 81].Inspired by transactions in database systems, TM simplifies software developmentfor parallel architectures by providing the programmer with the illusion that a blockof code, called a transaction, execute atomically. With TM, the programmer doesnot need to write code with locks to ensure mutual exclusion. The underlyingTM system optimistically executes transactions in parallel for performance andautomatically resolves data-races (conflicts) between concurrently executed trans-actions in a deadlock-free manner.Despite having a different programming syntax, TM has been shown to beless error-prone than locks [133], and TM code is easier to understand than lockcode [126]. With a well-implemented TM system, transactions can approach, oreven supersede, the performance and parallelism only possible with fine-grainedlocking. Software developers can use transactions instead of fine-grained locks toaggressively parallelize their applications. Freed from the concerns for data-racesand deadlocks, the developers can verify the aggressively parallelized softwarewith significantly less effort. TM also allows the development of composable con-current data structures [75], which further boosts programmer productivity. Over-all, TM holds the promise of enabling programmers to deliver working, maintain-able parallel software with reasonable development effort.Example 1 shows two conflicting transactions. Transaction A reads from sharedvariable H, multiplies the value by 10, and accumulates the product into shared vari-able K. The set of memory locations read by transaction A, or its read-set, consists24Example 1 Example of two conflict transactions. r is a private variable in bothtransactions.Transaction A1: atomic {2: int r = H * 10;3: K = K + r;4: }Transaction B1: atomic {2: int r = K * 10;3: H = H + r;4: }of both H and K, and the set of memory locations updated by transaction A, or itswrite-set, consists of only K. Transaction B performs the same operation, but withthe roles of H and K reversed. When both transaction A and B execute concur-rently, a conflict can occur if transaction A updates K after transaction B has readfrom it, but before transaction B can update H with product computed from theoriginal value of K. A TM system detects this conflict dynamically by monitoringmemory accesses from both transactions and tracking their read-sets and write-sets(assuming eager conflict detection and resolution, see Section 2.2.2). If transactionA updates K, the TM system detects that K has been read by transaction B earlier,making K part of transaction B’s read-set. Since transaction B has not committedyet, this is a conflict. The TM system can then resolve this conflict by abortingtransaction B, forcing it to re-execute from the beginning with the updated value ofK from transaction A. This effectively serializes the two transactions to resolve thepotential data-race between them.Example 2 Example of a transaction (left) and its lock-based analog (right).1: atomic {2: if(X[hash(tid)] > 100){3: Y += X[(hash(tid)];4: X[(hash(tid)] = 0;5: }6: }1: Lock(X[hash(tid)]);2: Lock(Y);3: if(X[hash(tid)] > 100){4: Y += X[hash(tid)];5: X[hash(tid)] = 0;6: }7: Unlock(X[hash(tid)]);8: Unlock(Y);Example 2 compares a simple transaction to its lock-based analog. The trans-action first reads an element from a shared array X according to a hash generatedfrom its thread ID (hash(tid)). If the element is greater than 100, it adds thevalue to another shared variable Y, and resets the element to 0. The code within the25transaction is enclosed in an atomic block. The lock-based analog of this transac-tion acquires two locks, one for the element in the shared array (X[hash(tid)])and one for the shared variable Y.The lock-based code proactively serializes all threads at the acquisition of thelock for Y. Assuming that in the common case, the element from X is smaller than100, so the element and shared variable Y are rarely modified, and this serializa-tion is usually unnecessary. The programmer can attempt to use this assumption toenable more TLP by only conditionally acquiring the lock for Y (i.e., moving line2 to after line 3). However, the program still needs to acquire the lock for the ele-ment from X in advance, serializing all the threads that just want to read from theelement. A well-implemented TM system can exploit this common case automati-cally to boost TLP. Also, in the absence of a writer, this TM system would permitall threads that are just reading from the same element in X to execute concurrently.This illustrates how TM can expose more TLP in an application with little effortfrom the software developer.Many processor vendors have started incorporating hardware support for TM intheir processors, implementing hardware transactional memory (HTM). To date,Intel Haswell [80], IBM’s Blue Gene/Q [71] and System Z [81] have hardwaresupport for transactional memory. Legacy CMP systems without hardware supportfor TM can use TM systems that are implemented as software runtime systems andcompilers [73]. These software transactional memory (STM) systems rely on codetransformation and/or binary translation to insert software routines for transactionmanagement and conflict detection into the TM applications. STM can add someperformance overhead to the application. In systems with only a few cores, appli-cations parallelized with STM may run slower than their non-parallelized version.The rest of this section is divided in two parts. The first part introduces severalkey correctness criteria for a TM system implementation. These correctness cri-teria impact the design of Kilo TM presented in Chapter 4, and the discussion inChapter 5 uses the definition of these concepts to show that Kilo TM has correctlyimplemented a TM system. The second part discusses some of the design issuesinvolved in implementing a TM system.262.2.1 Correctness CriteriaThis subsection discusses several key criteria that a TM system should meet to ex-ecute transactions correctly. The ACID properties define the key attributes that theprogramming model expects from a transaction [73, 163]. Serializability defineshow a set of committed transactions may update the memory system [73, 75, 163].Opacity ensures that doomed transactions, whose memory contents have beenclobbered by conflicting transactions, will not corrupt the rest of the system [68].ACID PropertiesA transaction is a set of indivisible operations. When committed, these operationsall appear to complete instantaneously in the system. Traditional databases systemsguarantee transaction with four properties – atomicity, consistency, isolation, anddurability [73, 163]. Together, they are known as the ACID properties.Atomicity From the application’s point of view, a transaction is either executedcompletely, or it has never been invoked at all. A transaction that has failedto commit, due to a conflict or other error, should be aborted. The TM sys-tem should ensure that the memory modifications from this transaction donot propagate to the rest of the system. It may either restore the modified lo-cations to their original data, or confine the modification by buffering them.For a successfully committed transaction, the TM system should guaranteethat all of its buffered modifications appear to propagate atomically to therest of the system.Consistency A transaction should preserve the consistency constraints that aredefined by the application. Usually, this refers to a set of invariants amongmultiple data elements (e.g., different fields in a data structure entry). Theapplication developers should ensure that each transaction is consistent on itsown – leading from one consistent memory state to another. A transactionthat fails to satisfy this requirement should be aborted to avoid clobberingdata in memory.Isolation Each transaction should execute as if it is executed alone in the system.Concurrently running transactions should not observe the presence of each27other. During execution, a transaction should see a memory state with alltransactions either committed completely, or not invoked at all. In particular,it should not observe any partial commit from another transaction. Isolationabstracts the concurrency of the system away from the application devel-oper. One way to ensure isolation is to show that the concurrent executionspermitted by the TM system are serializable.Durability The effects of the operations performed by a committed transactionare persistent in the system and available to all subsequent transactions. Thisproperty is important for databases, which store data on persistent storagelike hard disks, but not too relevant to transactional memory.The ACID properties form a programming interface, a contract between theprogrammer and the underlying TM/database system. A correctly implementedTM system should satisfy all of the ACID properties.SerializabilityEven though a TM system may execute transactions concurrently, it must ensurethat the current execution is serializable – equivalent to executing the same set oftransactions in a serial order. In a serializable order, every transaction behaves as ifit is executed serially one after another. By showing that all transactions committedby a TM system satisfy serializability, one can prove that the TM system enforcesatomicity and isolation.Notice that serializability only requires the concurrent execution of transactionsto have at least one equivalent serial order. It does not restrict which serial ordershould be matched by the concurrent execution. The TM system is free to schedulethe execution order of transactions for performance, as long as the policy does notlead to starvation [18].There are many different variants of the definition of serializability. This workfocuses on conflict serializability [163]. This definition of serializability is basedon the conflict relations between all committed transactions. A conflict relationbetween two transactions (A← B) occurs when transaction A reads from or writesto a memory location that has been written by the other transaction B. One can28construct a conflict graph for a set of committed transactions from their conflict re-lations, with each conflict relation forming a directed edge between the two trans-actions defining this relation. If this graph is acyclic, the committed transactionsare conflict serializable. This is because one can construct a equivalent serial orderby traversing the conflict graph in breath-first order. The correctness discussion inChapter 5 shows that Kilo TM satisfies conflict serializability.OpacityWhile serializability is concerned about transactions that have been successfullycommitted, opacity defines the behavior of transactions whose memory contentshave already been clobbered by conflicting transactions [68, 73]. These transac-tions, called doomed transactions, are destined to be aborted.A doomed transaction can occur if a TM system, to reduce its conflict detec-tion overhead, allows a conflicting transaction to commit without resolving all ofits conflicts. Memory read by a doomed transaction may be inconsistent with thevalues it has read in before – i.e., the read values may violate invariants set up bythe application for correct behavior. Since the doomed transaction will not be com-mitted, values computed using the inconsistent memory inputs will only be visibleto the transaction itself. However, the inconsistent inputs may clobber the addresscalculation in the transaction. This can cause it to access invalid memory regions,potentially triggering page faults and exceptions in a way that is not immediatelycomprehensible by the programmer. The inconsistent inputs can also cause thetransaction to run in an infinite loop (e.g., using -1 instead of 2 for a loop bound).Many proposed TM systems avoid doomed transactions by aborting a trans-action immediately before committing a conflicting transaction. Doing so ensuresthat the transaction is always observing a consistent view of the memory through-out its execution, i.e., transactions can never proceed in a doomed state. However,implementing this behavior can add significant overheads. It either requires a trans-action to check for a conflict whenever it reads from memory, or to resolve all ofits conflicts prior to committing to memory.Recently, several TM proposals opt for sandboxing, which use mechanisms toconfine the corruption caused by doomed transactions away from the rest of the29system [36, 124]. To do so, these TM systems buffer all memory updates made bya transaction as it executes, and handle any access to faulty pages caused by thetransaction reading in an incorrect memory address from conflicting memory loca-tion. The sandboxing mechanism also needs to detect the infinite loops caused byinconsistent memory values and abort the transaction to avoid deadlock. Sandbox-ing transactions allows transactions to execute without having to detect conflicts atevery memory access. This can reduce the complexity of these TM systems. KiloTM, presented in Chapter 4, uses sandboxing to support opacity.2.2.2 Design SpaceThis subsection briefly summarizes a common design space for a TM system. Itdiscusses the trade-offs for each of the design choices. This set of design choicesform a taxonomy to allow one to compare between different TM system designs.This allows us to compare and relate the design of Kilo TM (introduced in Chap-ter 4) to other existing TM system proposals.Weak and Strong IsolationWhile the ACID properties specifies that execution of a transaction is isolated fromother concurrent transactions, it does not specify how it should interact with non-transactional operations. For example, it does not clearly define the interactionbetween a transaction and load/store operations that are not part of any transac-tion. A TM system may choose to support either strong isolation or weak isola-tion [16, 73, 106]. TM systems that support strong isolation guarantee transac-tional semantics between transactions and non-transactional operations, whereasthose supporting weak isolation only guarantee transactional semantics betweentransactions.TM systems supporting weak isolation permit non-transactional operations toinject data-races into transactions. If the TM application needs to share data withnon-transactional code, such as legacy libraries, weak isolation would be insuffi-cient to protect transactions against data-races. Also, updates from weakly isolatedtransactions may not appear atomic to non-transactional operations. These non-transactional operations may behave incorrectly due to the observation of partial30updates from a transaction. This can occur in applications under development,when the programmers are still deciding what operations should be in a transac-tion. Proponents of strong isolation argue that weak isolation makes this processmuch harder, because the errors from weak isolation appear less intuitive to theprogrammer [16]. These two issues with weak isolation are some of the main mo-tivations towards supporting strong isolation.On the other hand, proponents of weak isolation argue for a transactional data-race-free (TDRF) programming model [35]. TDRF can be supported by weakisolation, and is analogous to data-race-free memory consistency models adoptedby high-level programming languages such as C++ and Java [147].Since strong isolation requires monitoring non-transactional memory accesses,implementing them on STM is difficult (but not impossible [2, 140]). On the otherhand, HTM implementations that extend the cache coherence protocol for conflictdetection are already monitoring non-transactional memory accesses for cache co-herence. These HTM systems can support strong isolation with little overhead.For systems without cache coherence protocols, such as GPUs, it remains unclearwhether supporting strong isolation is worth its overhead. Kilo TM, as presentedin this dissertation, supports weak isolation. Section 8.2.2 discusses potential waysto extend Kilo TM to support strong isolation.Conflict Detection and ResolutionSince the memory locations accessed by each transaction are unknown prior to itsexecution, conflicts can occur when multiple transactions execute concurrently. Aconflict between two transactions occurs when both transactions have transaction-ally accessed the same data, and one of the accesses is a write. A TM system isresponsible for detecting these conflicts, and resolving these conflicts before theygenerate errors in the rest of the system. Both conflict detection and conflict reso-lution of a TM system can be classified according to when the TM system detectsa conflict and when it resolves the detected conflict [113].With eager conflict detection, the TM system detects the conflict as soon as ithas occurred. This is done in lock-based STMs and most HTMs that extend cachecoherence protocols [75, 111, 113, 145, 157, 170]. In these HTMs, the detection31mechanism piggybacks on the invalidation messages or exclusive access requestsgenerated by a write to a cache line. These messages are relayed to the sharers (orthe current exclusive writer) of the same cache line, informing the recipients of theexistence of a conflicting transaction.With lazy conflict detection, the TM system defers the detection, possibly un-til one of the transactions attempts to commit. By deferring the detection, theTM system may aggregate and compress the communication required for the de-tection (e.g., broadcasting the read-set and write-set of a transaction using bloomfilters [26]), or it may reduce the frequency of conflict detection by only doing itonce after the transaction has finished execution [67, 124, 149].A TM system with eager conflict detection may choose to resolve the conflictimmediately, doing eager conflict resolution. Alternatively, it may defer the resolu-tion until one of the conflicting transactions attempts to commit, doing lazy conflictresolution [145, 157]. A TM system with lazy conflict detection can employ lazyconflict resolution. With either policy, the TM system can eliminate the conflict byaborting one of the conflicting transactions.With eager conflict detection, the TM system may alternatively stall one ofthe conflicting transactions, and resume its execution after the other transactionhas finished (committed or aborted). This avoids aborting a transaction and wast-ing away the work it has already done. Also, direct update (discussed in Sec-tion 2.2.2) requires eager conflict detection and resolution to avoid doomed trans-actions. However, resolving conflict eagerly may lead to dueling upgrades, wheretwo transactions keep aborting each other so that neither can ever commit [18].Lazy conflict resolution avoids this forward progress concern by having the TMsystem prioritizing the first transaction that attempts to commit. A TM systemmay dynamically switch between the two policies to capture the benefits from bothpolicies [145, 157].The TM system may also validate a transaction – checking whether the trans-action has experienced conflicts. In this work, we distinguish validation from otherforms of conflict detection in that a validation only detects the existence of con-flicts, but not the exact transactions involved in the conflicts. A transaction mayvalidate eagerly, at every transactional memory access, or lazily, only before it at-tempts to commit. With either policy, the transaction can only resolve the detected32conflict by aborting itself. We call this self-abort. Kilo TM performs value-basedvalidation lazily for each transaction and resolves conflict via self-abort (explainedin Chapter 4). Chapter 6 augments Kilo TM with a novel eager conflict detectionmechanism, called temporal conflict detection, to accelerate read-only transactions.Version ManagementWhen a transaction executes, it may make updates to memory. These updatesshould remain invisible to the rest of the system until the transaction commits.This creates two versions of data for each updated memory location: one visibleto the transaction itself, and the other visible to the rest of the system. Versionmanagement of a TM system refers to how it manages the versions created bythese updates.TM systems with direct update, or eager version management, allows the trans-action to update the memory location in global memory during its execution. Theoriginal version in global memory is stored in an undo log [113, 170]. If a trans-action is aborted, the TM system rolls back memory updated by this transactionthrough restoring those memory locations with values from its undo log. Since theupdated data are already in global memory, the TM system needs to ensure thatthey are not visible to other transactions via eager conflict detection and resolu-tion. Another transaction that attempts to access locations updated by this transac-tion immediately invokes a transaction resolution manager, which either aborts orstalls one of the transactions. Committing a transaction with direct update simplyinvolves discarding the undo log. Since the commit does not involve updating anyglobal metadata, multiple conflict-free transactions may commit in parallel.TM systems with deferred update, or lazy version management, provide abuffer to contain the memory updates from each running transaction. Each trans-actional read needs to check this write buffer to see if it should read from the bufferfor a transactional version of the data. In many HTM implementations, the writebuffer is implemented using a private L1 data cache, and therefore does not addmuch overhead to transaction execution [26, 27, 75, 111, 145]. The L1 data cacheis extended to withhold coherence visibility of transactional data, making them in-visible to the rest of the system. Since the content of the write buffer is invisible to33the rest of the system, multiple conflicting transactions may execute concurrently.The TM system can use lazy conflict detection and resolution to resolve conflictsamong these transactions at a later time to ensure forward progress. At commit,contents in the write buffer are made visible to the global memory atomically. ATM system with deferred update can enforce this atomic propagation by serializingthe commit of each transaction [26], or it may employ a more complex protocol toallow multiple non-conflicting transactions to commit in parallel [27]. Moreover,since a transaction can exceed the capacity of the L1 cache, a TM system that usesthe L1 cache as a write buffer needs to handle the write buffer overflow.Kilo TM employs deferred update so that it can perform value-based validationlazily for each transaction. Chapter 4 explains how Kilo TM stores the write bufferof each transaction in a linear log.Bound on Transaction Footprint and Irrevocable OperationsA TM system supports bounded transactions if it restricts the size of the memoryfootprint that can be accessed by a transaction. These restrictions usually origi-nate from using the L1 cache as a buffer for transactional data in HTM systems.When the memory footprint no longer fits in the buffer, the TM system aborts thetransaction. It then informs the application of the overflow, and lets the applica-tion handle it via alternative means. All existing HTMs implemented on real CMPsystems have this restriction [71, 80, 81]. As a result, the runtime TM system usu-ally complements the HTM with a STM system to handle the rare, large-footprinttransactions, an approach known as hybrid TM [39, 45, 111].A HTM system can employ specific mechanisms to support transactions thatoverflow the write buffer. The simplest mechanism is to serialize the execution ofall overflowing transactions [17]. These overflowing transactions may execute con-currently with other non-overflowing transactions, but always have the top priorityduring conflict resolution.A TM system can also use this implicit serialization to elegantly support ir-revocable operations. These operations, such as I/O operations and page faults,cannot be rolled back when the transaction is aborted.Kilo TM supports unbounded transactions by storing the read-set and write-set34GPUSIMT CoreMemory PortSIMT CoreMemory PortOff-Chip DRAM Channelff- hip A  hannelMemory Partitione ory PartitionCPUKernel LaunchInterconnection NetworkMemory PartitionLast-Level Cache BankOff-Chip GDDR Memory ChannelAtomic Op. UnitDRAM ControllerLaunchUnitThread BlockSIMT CoreThread BlockThread BlockRegister FileMemory PortShared MemoryL1 Data CacheSIMT StacksConstant CacheTexture CacheFigure 2.1: High-level GPU architecture as seen by the programmer.of a transaction in linear logs that can be spilled to DRAM automatically. TheGPU TM applications evaluated in this dissertation do not contain irrevocable op-erations. Nevertheless, Kilo TM may adopt the implicit serialization approachdescribed above to support irrevocable operations.NestingNesting occurs when an outer transaction contains one or more inner transactions.This can happen when a transaction calls another transactional subroutine. Thesupport for nesting with TM simplifies the development of composable concurrentsoftware.The exact semantics for nested transactions has yet to be settled [73]. In thiswork, Kilo TM supports nested transactions by flattening them – a conflict detectedin an inner transaction aborts the outer transaction.2.3 GPU ArchitecturesThis section describes the baseline GPU architecture used throughout this disserta-tion. The description covers a generic architecture with a number of design param-eters that can be configured to model different GPU hardware designs. The exactparameters used in the evaluations of the subsequent chapters are presented in themethodology section in each of the chapters.352.3.1 Programming ModelFigure 2.1 shows the high-level overview of our baseline GPU architecture. A GPUapplication starts on the CPU and uses a compute acceleration API such as CUDAor OpenCL to launch work onto the GPU [89, 120, 122]. Each launch consists ofa hierarchy of scalar threads, called a grid in CUDA. All scalar threads in a gridexecute the same compute kernel. The thread hierarchy organizes threads as threadblocks. Each block is dispatched to one of the heavily multi-threaded SIMT coresas a single unit of work. It stays on the SIMT core until all of its threads havecompleted execution. A SIMT core is similar to a Streaming Multiprocessor (SM)in NVIDIA GPUs, a Compute Unit in AMD GPUs, or an Execution Unit (EU) inIntel GPUs. Threads within a block can communicate via an on-chip scratchpadmemory called shared memory (local memory in OpenCL), and can synchronizequickly via hardware barriers. The SIMT cores access a distributed, shared, read-/writeable last-level (L2) cache and off-chip DRAM via an on-chip interconnectionnetwork.The application may launch a thread hierarchy that far exceeds the GPU on-chip capacity. The GPU launch unit automatically dispatches as many threadblocks as the GPU on-chip resources can sustain, and dispatches the rest of thethread hierarchy as resources are released by completed thread blocks. This hardware-accelerated thread spawning mechanism distinguishes GPUs from more traditionalvector processors and multi-core processors. It allows GPU applications to decom-pose their workloads into as many threads as possible without introducing signifi-cant overhead.The memory model in CUDA and OpenCL features multiple memory spaces.These memory spaces share storage in the on-board DRAM on the GPU hardware,except for the shared and host memory space. Each memory space has its ownfunctional semantics and performance characteristics that reflect its microarchi-tecture implementations in GPU hardware. In CUDA, each thread can access thefollowing memory spaces [122]:Global memory space contains data that can be accessed by all threads runningon the GPU. Its capacity is limited by the amount of on-board DRAM on theGPU.36Table 2.1: Memory space mapping between CUDA and OpenCLCUDA OpenCLGlobal memory Global memoryLocal memory Private memoryShared memory Local memoryConstant memory Constant memoryTexture memory Image objectHost memory Host pointerLocal memory space contains data that is private to a single GPU thread. TheGPU application typically uses this memory space for register spilling andstorage for thread-private memory structures/arrays.Shared memory space contains data that can be shared among threads in thesame thread block. It represents the scratchpad memory in each SIMT core,and has limited capacity.Constant memory space contains read-only data that remains unchanged through-out a kernel launch. These data are cached in a special constant cache in theGPU memory system. In some GPU architectures, it is used for storing ker-nel parameters that are passed from the host CPU at the kernel launch.Texture memory space contains texture data that are accessed via the texture unitwith special texture fetch instructions. The texture data are cached in read-only texture caches. Before GPUs feature caches for data stored in globaland local memory spaces, many early GPU compute applications store read-only data as textures to make use of these read-only texture caches.Host memory space represents data in the system main memory that belongs tothe host CPU. GPU threads can use this memory space to communicate di-rectly with the host CPU thread.Table 2.1 shows the mapping of memory spaces between CUDA and OpenCL [89].2.3.2 Single-Instruction, Multiple-Thread (SIMT) Execution ModelModern GPUs employ wide SIMD hardware to exploit the DLP in GPU applica-tions. Instead of exposing this SIMD hardware directly to the programmer, GPU37computing APIs, such as CUDA and OpenCL, feature a MIMD-like programmingmodel that allows the programmer to launch a large array of scalar threads ontothe GPU. Each of these scalar threads can follow its unique execution path andmay access arbitrary memory locations. At runtime, the GPU hardware executesgroups of scalar threads in lockstep on SIMD hardware to exploit their regularitiesand spatial localities. This execution model is called single-instruction, multiple-thread (SIMT) [101, 119].With the SIMT execution model, scalar threads are managed in SIMD exe-cution groups called warps (or wavefronts in AMD terminology). In the CUDAprogramming model, each warp contains a fixed group of scalar threads through-out a kernel launch. This arrangement of scalar threads into warps is exposed tothe CUDA programmer/compiler for various control flow and memory access op-timizations [122]. In Chapter 3, we refer to warps with this arrangement as staticwarps to distinguish them from the dynamic warps that are dynamically createdvia dynamic warp formation or thread block compaction.Ideally, threads within the same warp execute through the same control flowpath, so that the GPU can execute them in lockstep on SIMD hardware. Giventhe autonomy of the threads, a warp may encounter a branch divergence whenits threads diverge to different targets at a data-dependent branch. Modern GPUscontains special hardware to handle branch divergence in a warp. Section 2.3.3describes the SIMT stack, which handles branch divergence in our baseline GPUarchitecture by serializing the execution of the different targets. This serializationcorrectly handles the branch divergence, but with a performance penalty. Threadblock compaction, introduced in Chapter 3, is a novel alternative that reduces thisperformance penalty.Notice that thread blocks and warps are two orthogonal organizations of scalarthreads. Conceptually it is possible for two scalar threads from different threadblocks to be grouped into the same warp. However, existing GPU architecturesrestrict each warp to only contain threads from the same thread block. With thisrestriction, one may treat warps as yet another level of thread hierarchy – eachthread block is consisted of a set of warps running on the same SIMT core. Thisrestriction is imposed to simplify the implementations of various features. Forexample, it allows the thread spawning mechanism in GPU architectures to manage38ActiveMask[1:W]PC RPCPC RPCPC RPCActiveMask[1:W]ActiveMask[1:W]ActiveMask[1: ]PC RPCPC RPCPC RPCActiveMask[1: ]ActiveMask[1: ]ALULALUI-Cache DecodeI-BufferScoreBoardIssue RegFileMEMLFetch Branch UnitDone (WID)Valid[1:N]Branch Target PCPred.Inst. W1 rInst. W2Inst. W3vrvrvTo FetchIssueDecodeScore-BoardTOSTo FetchIssueAccess Coalesc.AGUSharedMemBank ConflictConst.CacheTexture CacheData CacheMemory PortIssue ARBPC1PC2PC3ARBSelectionTo I-CacheValid[1:N]ActiveMask[1:W]PC RPCPC RPCPC RPCActiveMask[1:W]ActiveMask[1:W]ActiveMaskMSHRPer-Warp SIMT Stacks14523 1067 89 111213Figure 2.2: SIMT core microarchitecture of a contemporary GPU. N =#warps/core, W = #threads in a warp.hardware resources allocated to a thread block, such as slots in the warp schedulerand register spaces, at warp-level granularity. It also simplifies the implementationof per-block barriers by ensuring that a warp can only update the barrier arrivalcounter for at most one thread block.2.3.3 MicroarchitectureFigure 2.2 illustrates the multi-threaded microarchitecture within a SIMT core. Wedefined this microarchitecture, described below, by considering details found in re-cent NVIDIA patents [30, 31, 102]. In our evaluation of thread block compactionin Chapter 3, we approximate some details to simplify our simulation model: Wemodel the fetch unit as always hitting in the instruction cache, and allow an in-struction to be fetched, decoded and issued in the same cycle. We also employa simplified scoreboard that forbids concurrent execution of subsequent instruc-tions from the same warp and a unified pipeline that services both ALU and MEMinstructions.Fine-Grained Multithreading HardwareEach SIMT core interleaves up to N warps on a cycle-by-cycle basis. The numberof warps that run concurrently on the SIMT core (N) depends primarily on the39register capacity of the SIMT core and the number of registers required by eachthread, as well as the availability of several other hardware resources. Each warphas a program counter (PC) in the fetch unit (see 1 in Figure 2.2), a dedicated slot( 2 ) in the instruction buffer, and its own stack ( 3 ) to manage branch divergencewithin that warp.Each slot ( 2 ) in the instruction buffer contains a v-bit indicating when an in-struction is present, and an r-bit indicating that it is ready for execution. Everycycle the fetch unit selects the PC for a warp with an empty instruction slot ( 2 ),and fetches the corresponding instruction from the instruction cache ( 5 ). The in-struction is decoded ( 6 ) and placed in an empty slot in the instruction buffer ( 2 ).This instruction waits in the instruction buffer until its ready bit ( 2 ) is set by thescoreboard ( 7 ), indicating the completion of dependent prior instructions fromthis warp. Instructions within a warp execute in-order. Our evaluation of threadblock compaction in Chapter 3 employs a simple scoreboard that only tracks priorinstruction completion for each warp. On the other hand, our evaluation of KiloTM in the rest of this dissertation uses a more advanced scoreboard that can trackper-warp register dependencies [31], potentially allowing multiple instructions perwarp in the pipeline.The issue logic ( 9 ) selects a warp with a ready instruction in the instructionbuffer to issue for execution. As an instruction issues it acquires the active mask( 10 ) from the top entry on the corresponding warp’s reconvergence stack in thebranch unit. The active mask disables threads in the warp that should not executedue to branch divergence. Once issued ( 11 ), the slot in the instruction buffer ( 2 )that contained the instruction is marked invalid, signaling the fetch unit that it mayfetch the next instruction for this warp.After a branch from a warp is decoded, none of its instructions can be fetched,effectively stalling the warp, until the branch outcome is known. In the mean-time, the fetch unit fetches instructions for other warps in the SIMT core. Stallingthe warp until its branch target is known removes the need for branch predicationhardware, and eliminates any unnecessary instruction fetch.The issued instruction fetches its operands from the register file. It is thenexecuted in the corresponding pipeline (ALU 12 or MEM 13 ). Upon completion,the instruction writes its results to the register file and notifies the scoreboard ( 8 ).40When the scoreboard detects that the next fetched instruction for the correspondingwarp no longer has any pending register dependency hazard, it updates the r-bit ( 2 )of this instruction in the instruction buffer. This allows the warp to be selected bythe issue logic in subsequent cycles.Handling Branch Divergence with SIMT StackEach warp has a SIMT stack ( 3 ) to handle its branch divergence. The SIMTstack serializes the execution of different subsets of threads that diverge to differentcontrol flow paths. We summarize the SIMT stack mechanism in our baselinebelow [56, 59].If some threads in a warp have different outcomes when a branch executes, i.e.,the branch diverges, new entries are pushed onto the warp’s SIMT stack. Each entrycontains a reconvergence PC (RPC) which is set to the immediate post-dominatorof the branch – the closest point in the program that all paths leaving the branchmust go through before exiting the function [114]. The immediate post-dominatorof each branch is obtained via an offline analysis during compilation, and it is en-coded in the branch instruction (or encoded via a special instruction). Each bitin the active mask indicates whether the corresponding thread follows the controlflow path corresponding to the stack entry. The PC of the top-of-stack (TOS) entryindicates the target path of the branch. Reaching the reconvergence point is de-tected when the next PC equals the RPC at the TOS entry. When this occurs, thetop of the stack is popped (current GPUs use special instructions to manage thestack [5, 6, 30, 100]). This switches execution to the next branch target that is to beexecuted by the other subset of threads. After all threads have reached the recon-vergence point, the TOS entry will reveal a full active mask with the reconvergencePC of the divergent branch, indicating that the threads have reconverged.Figure 2.3 shows an example of how a SIMT stack handles two levels of branchdivergence encountered by a warp with four threads (1, 2, 3, 4).1. When the warp starts to execute basic block A, its SIMT stack consists of asingle entry with a fully populated active mask (1111), indicating that everythread in the warp is active.2. The warp has encountered a divergent branch at the end of basic block A,41SIMT StackE x am p l e Prog ram  Control  F l ow  Grap hE x ecution F l oww ith Per- W arp  SIMT Stack1234123x1xxxx23x123xxxx41234TimeA B C D E F GA 1 2  3  4G 1 2  3  4B 1 2  3  xF x x x 4C 1 x x x D x 2  3  xE 1 2  3  xPC RPC Active MaskA -- 1 1 1 1PC RPC Active MaskG -- 1 1 1 1F G 0  0  0  1B G 1 1 1 0TOSTOSPC RPC Active MaskG -- 1 1 1 1F G 0  0  0  1E G 1 1 1 0TOSD E 0  1 1 0C E 1 0  0  0PC RPC Active MaskG -- 1 1 1 1F G 0  0  0  1E G 1 1 1 0TOSPC RPC Active MaskG -- 1 1 1 1F G 0  0  0  1TOSPC RPC Active MaskG -- 1 1 1 1TOS123456123456Figure 2.3: Example operation of a SIMT stack. The number label of eachSIMT stack state on the right shows the warp’s execution state at thesame number label on the control flow graph.with thread (1, 2, 3) diverging to block B, and thread (4) diverging to blockF. The warp pushes two new entries onto the SIMT stack, with the RPC inboth entries set to basic block G, the immediate post-dominator of blockA. Each entry has a partially disabled active mask to indicate the subset ofthreads that are active for the target (1110 for block B and 0001 for block F).The warp also modifies the PC at the bottom of the stack to basic block G.3. The warp has encountered another divergent branch at the end of the basicblock B. The warp pushes two more new entries onto the SIMT stack, eachwith the RPC set to basic block E. One entry corresponds to the execution42of basic block C with only thread (1) active (active mask = 1000); anotherentry corresponds to the execution of basic block D with thread (2, 3) active(active mask = 0110).4. When the warp has finished executing block C, it detects that the next basicblock equals to the one stored in RPC of the TOS entry in its SIMT stack.It pops the stack to switch execution to block D. It also pops the stack againafter executing block D, revealing the reconvergence entry at basic block E.This allows threads (1, 2, 3) to execute block E together.5. After executing block E, the warp detects that the next PC equals to the RPCof the TOS entry. It pops the SIMT stack to switch execution to block F, withonly thread (4) active.6. After executing block F, the warp pops the stack again to reveal the top-levelentry with a full active mask for block G. The diverged threads in the warpnow reconverge back together to execute block G with full SIMD efficiency.Through the above interactions with the SIMT stack, the warp executes the exam-ple program with the execution flow shown in Figure 2.3.While the SIMT stack allows divergent threads to reconverge at the earlieststatically known convergence point, it may result in low SIMD efficiency in appli-cations with deeply nested data dependent control flow or loop bounds that varyacross threads in a warp. In these situations different threads within a warp mayfollow different execution paths.Memory SubsystemWhen a memory instruction issues ( 13 in Figure 2.2), the address generation unit(AGU) generates addresses for each thread in the warp. For each memory instruc-tion, each scalar thread in the warp can generate a scalar memory access. Theseaccesses are served in parallel by the memory subsystem in the SIMT core. Sharedmemory accesses (accesses to on-chip per-core scratchpad memory) are served by32 shared memory banks. Accesses contending for the same bank are serialized.For global and local memory spaces [122], accesses from different threads in the43same warp to the same 128-Byte memory chunk are merged (coalesced) into a sin-gle wide access. The L1 data cache services one wide access per cycle. The con-stant cache operates similarly, except accesses from a warp to different addressesare serialized without coalescing. Texture accesses are serviced by the texture unit,which accesses a texture cache [69].The L1 data caches in the SIMT cores are not coherent [121]. They can be usedto cache read-only shared data as well as thread-private data. However, GPU appli-cations (such as the TM applications evaluated in Chapter 4 and 6) can store shareddata in the global memory space so that it can be updated by threads from differentSIMT cores. To avoid access to stale (non-coherent) data, these applications canconfigure the GPU architecture (via a compiler flag [122]) so that all global mem-ory accesses skip the L1 cache. In this configuration, they are serviced directlyby the L2 cache bank at the corresponding memory partition. The accesses fromeach thread in the same warp are still coalesced into 128-Byte wide accesses. EachSIMT core can inject one wide memory access into the on-chip interconnectionnetwork per cycle.Each thread can store thread-private data and spilled registers in a private localmemory space [122]. The local memory is stored in off-chip DRAM and cachedin the per-core L1 data cache and the shared L2 cache. It is organized such thatconsecutive 32-bit words are accessed by consecutive scalar threads in a warp.When all the threads in a warp are accessing the same address in their own localmemory space, their accesses fall into the same cache line in the L1 cache and areserviced in parallel in a single cycle.Current GPUs provide hardware atomic operations for simple single-word read-modify-write operations [89, 122]. The SIMT cores send atomic operation requeststo a set of raster operation units in the memory partitions (Atomic Op. Unit inFigure 2.1) to perform these read-modify-write operations to individual locationsatomically within the memory partitions [22]. Programmers can use these atomicoperations to implement locks.44Memory PartitionGPU features a distributed shared memory architecture that is implemented witha set of memory partitions. The physical linear memory address space is inter-leaved among these partitions in chunks of 256 Bytes. This fine-grained divisionof the address space reduces the likelihood of load imbalance at a particular mem-ory partition. A memory access that cannot be serviced within the SIMT coreis sent, through the on-chip network, to the memory partition that corresponds tothe accessed memory location. Each memory partition contains an off-chip DRAMchannel, and one or more L2 cache banks that cache data from the off-chip DRAM.The L2 cache is coherent – each bank is responsible for part of the physically linearmemory address space, so that no two banks store the same data.Each memory partition also contains a set of atomic operation units in thememory partitions (Atomic Op. Unit in Figure 2.1). These atomic operation unitswork with the L2 cache logic to perform single-word read-modify-write operationsatomically.2.4 Dynamic Warp FormationDynamic warp formation (DWF) [56, 59] improves the performance of GPU ap-plications that suffer from branch divergence by rearranging threads into new dy-namic warps in hardware. The arrangement of scalar threads into static warps isan arbitrary grouping imposed by the GPU hardware that is largely invisible to theprogramming model. Also, with GPU implementing fine-grained multi-threadingto tolerate long memory access latency, there are many warps in a core, hundredsof threads in total. Since these warps are all running the same compute kernel,they are likely to follow the same execution path, and encounter branch divergenceat the same set of data-dependent branches. Consequently, each target of a diver-gent branch is probably executed by a large number of threads, but these threadsare scattered among multiple static warps, with each warp handling the divergenceindividually. DWF exploit this observation by rearranging these scattered threadsthat execute the same instruction into new dynamic warps. At a divergent branch,DWF can maintain high SIMD efficiency by effectively compacting threads scat-tered among multiple diverged static warps into several non-divergent dynamic45warps. In this way, DWF can capture a significant fraction of the benefits of MIMDhardware on SIMD hardware.However, DWF requires warps to encounter the same divergent branch withina short time window. As a result, the warp scheduling policy can have a significantimpact on DWF [56, 59]. Chapter 3 shows how the best performing schedulingpolicy for DWF from prior work suffers from a two major of performance patholo-gies: (1) A greedy scheduling policy can starve some threads, leading to a SIMDefficiency reduction; (2) Thread regrouping in DWF increases non-coalesced mem-ory accesses and shared memory bank conflicts. These pathologies cause DWF toslowdown many existing GPU applications. Chapter 3 also shows how applica-tions relying on implicit synchronization in a static warp execute incorrectly withDWF.2.5 SummaryIn this chapter, we have explained a set of fundamental concepts in parallel com-puting that are relevant to this work. We have also summarized important aspectsof transactional memory that impacts the design of Kilo TM in Chapter 4, 5 and 6.We have also presented a modern GPU architecture that serves as our baseline insubsequent chapters. Finally, we have briefly discussed the insight behind dynamicwarp formation, which forms the basis of our work on thread block compaction inChapter 3.46Chapter 3Thread Block Compaction forEfficient SIMT Control FlowIn this chapter, we propose thread block compaction (TBC), a novel hardwaremechanism for improving the performance of applications that suffer from controlflow divergence on GPUs. A version of this chapter has been published earlier [54].The development of TBC is motivated by our investigation of performancepathologies that cause dynamic warp formation (DWF), our prior work tackling thesame goal (See Chapter 2), to slowdown some GPU applications. These patholo-gies, presented in Section 3.2, can be partially addressed with an improved schedul-ing policy that effectively separates the compute kernel into two sets of regions, di-vergent and non-divergent (coherent) regions. The divergent regions benefit signif-icantly from DWF, whereas the coherent regions are free of branch divergence butare prone to the DWF pathologies. We found that the impact of the DWF patholo-gies can be significantly reduced by forcing DWF to rearrange scalar threads backto their static warps in the coherent regions.TBC builds upon this insight with the observation that rearrangement of threadsinto new dynamic warps continually does not yield additional benefit. Instead, therearrangement, or compaction, only needs to happen right after a divergent branch,the start of a divergent region, and before its reconvergence point, the start to acoherent region. We note the existing per-warp SIMT stack (described in Chap-ter 2) implicitly synchronizes threads diverged to different execution paths at the47reconvergence point of the divergent branch, merging these diverged threads backinto a static warp before executing a coherent region. One can extend the SIMTstack to encompass all warps executing in the same core, forcing them to synchro-nize and compact at divergent branches and reconvergence points to achieve robustDWF performance benefits. However, synchronizing all the warps within a core ateach divergent branch for compaction can greatly reduce the available thread-levelparallelism (TLP). GPU architectures rely on the abundance of TLP to toleratepipeline and memory latency.TBC settles for a compromise between SIMD efficiency and TLP availabilityby restricting compaction to only occur within a thread block. GPU applicationsusually execute multiple thread blocks concurrently on a single core to overlap thesynchronization and memory latency. TBC leverages this software optimizationto overlap the compaction overhead at divergent branches – when warps in onethread block synchronize for compaction at a divergent branch, warps in otherthread blocks can keep the hardware busy. Section 3.3 describes how the per-warpSIMT stack is extend to encompass warps in a thread block, and how the implicitsynchronization can simplify DWF hardware.This chapter also describes an extension to immediate post-dominator (IP-DOM) based reconvergence called likely-convergence points (LCP). Section 3.4describes how using IPDOM as the reconvergence point in applications with un-structured control flow can miss some opportunities to reconverge the warp be-fore the immediate post-dominator. We show how a SIMT stack (per-warp andper-thread block) can be extended with these likely-convergence points to achievehigher SIMD efficiency.Our simulation evaluation in Section 3.6 quantifies that TBC with LCP achievesan overall 22% speedup over a per-warp SIMT stack baseline (PDOM) for a set ofdivergent GPU applications, while introducing no performance penalty for a set ofcoherent GPU applications. Our analysis shows that TBC achieves a significantlyhigher SIMD efficiency versus PDOM or DWF (Figure 3.1(a)) and fewer memorypipeline stalls compared to DWF (Figure 3.1(b)).The rest of this chapter is organized as follows: Section 3.1 classifies the work-load used in this chapter according to their default SIMD efficiency. Section 3.2discusses our findings on the various DWF pathologies. Section 3.3 describes TBC480% 25% 50% 75% 100%TBCDWFPDOM(a) SIMD Efficiency0% 100% 200% 300%TBCDWFPDOM(b) Normalized Memory StallsFigure 3.1: Overall performance of TBC (details in Section 3.5).0%20%40%60%80%100% BFS2 FCDT HOTSP LPS MUM MUMpp NAMD NVRTAES BACKP CP DG HRTWL LIB LKYT MGST NNC RAY STMCL STO WPDIVG COHESIMD EfficiencyFigure 3.2: SIMD efficiency of GPU applications used in our evaluation. SeeSection 3.5 for detail. Section 3.4 describes likely-convergence points and how they may be ap-plied to an existing SIMT stack. Section 3.5 describes our evaluation methodology.Section 3.6 presents results, Section 3.7 estimates the implementation complexitiesof TBC, and Section 3.8 summarizes this chapter.3.1 Workload ClassificationIn this work, we use SIMD efficiency to identify CUDA applications that divergeheavily (DIVG) but also study a representative set of coherent (COHE) applica-tions1 in which threads in a warp follow the same execution paths (Figure 3.2).Here SIMD efficiency is the average fraction of SIMD processing elements thatperform useful work on cycles where the SIMD processing unit has a ready instruc-1The graphics community has long used the term coherent to mean that different threads accessadjacent memory locations which is important for DRAM performance. This usage is different fromthat in the computer architecture community, but equally well entrenched.49DWFStarvation EddyThread BlockCompaction(DWF Ideal)Per-WarpReconvg. Stack(PDOM)A B DCA DB CA B D C DControl Flow Graph Execution FlowTimeA 1111W01111W1B 1101W00110W1 C0010W01001W1D 1111W01111W1(a) Starvation eddy scheduling problem. While not-reconverging as soonas possible may benefit latency tolerance [107], it is a major cause of reducedSIMD efficiency in dynamic warp formation [59]. In the execution flow, shaded= utilized SIMD processing element, white = idle SIMD processing element.1 2 3 433 34 35 36W0W1Memory0x100 - 0x17F0x180 - 0x1FF1 343 3633 235 4WXWYMemory0x100 - 0x17F0x180 - 0x1FF#Access = 2 #Access = 4Static Warp Arrangement Potential DWF Arrangement(b) Extra memory accesses introduced by random thread grouping.Figure 3.3: Dynamic warp formation pathologies.tion to execute. We classify a benchmark as DIVG if it has SIMD efficiency below76% and COHE otherwise (NNC contains 16 threads per block and no branch di-vergence). This classification simplifies our performance analysis throughout thischapter. TBC and DWF should aim to speedup the DIVG applications while pre-serving the performance of the COHE applications.3.2 Dynamic Warp Formation PathologiesDynamic warp formation (DWF) [56] regroups threads executing the same instruc-tion into new warps to improve SIMD efficiency. It promises to capture a signif-icant fraction of the benefits of MIMD hardware on multithreaded SIMD hard-ware employing large multi-banked caches. However, the benefits of DWF canbe affected by the warp scheduling policy [56, 59] and memory systems that limit50bandwidth to first level memory structures. Figure 3.3(a) shows an example thathelps to illustrate a form of pathological scheduling behaviour that can occur inDWF. This figure compares various SIMT control flow handling mechanisms fora branch hammock2 diverging at block A. Each basic block contains the activemask for two warps where a “0” means the corresponding lane is masked off. Theper-warp stack-based reconvergence mechanism (PDOM) executes block B and Cwith decreased SIMD efficiency, but reconverges at block D. The bottom right ofFigure 3.3(a) illustrates a case where the threads at block C fall behind those atB. While this scheduling can increase latency tolerance [107], it can also lead to areduction in performance [59] since ideally the warps at block B and C could formfewer warps at block D. We call this fall-behind behaviour a starvation eddy.This behavior originates from a greedy scheduling policy, majority, that hasbeen shown to work best with DWF. It incurs poor performance when a small num-ber of threads “falls behind” the “majority” of threads [59]. The starvation eddyphenomenon reduces opportunities for such threads to regroup with the “major-ity” leading over time to lower SIMD efficiency. We observed this issue loweringthe SIMD efficiency of many CUDA applications run with DWF as shown in Fig-ure 3.1(a) (applications and configuration described in Section 3.5).Furthermore, CUDA applications tend to be written assuming threads in a warpwill execute together and should therefore access nearby memory locations. DWFtries to optimize control flow behaviour at the potential expense of increasing mem-ory accesses (demonstrated in Figure 3.3(b)). Across the workloads we study, thisleads to 2.7× extra stalls at the memory pipeline (Figure 3.1(b)).Figure 3.4 compares DWF with majority scheduling against the baseline per-warp SIMT stack (PDOM) on the DIVG applications. While DWF improves per-formance significantly on BFS2, FCDT, and MUMpp, other applications suffer aslowdown3. One application, NVRT executes incorrectly with DWF, because it2Note that the mechanisms studied in this work support CUDA and OpenCL programs witharbitrary control flow within a kernel.3In our previous evaluation [56, 59], each SIMT core had large multi-bank L1 caches to bufferthe memory system impact of DWF, whereas each SIMT core in this work only has a much smaller,single banked L1 cache, which may be desirable in practice to reduce the complexity and area ofthe memory system. The applications evaluated in our prior study [56, 59] also lacked the memorycoalescing optimizations found in most CUDA applications (including those evaluated here) maskingthe impact of thread regrouping on the memory system.51uses a single manager thread in each static warp to continually acquire tasks froma global queue (atomically acquire a range of task IDs) for other worker threadsin the warp. The per-warp SIMT stack enforces an implicit synchronization as itforces the worker threads in the warp to wait while this manager thread is acquiringtasks. DWF executes NVRT incorrectly because it does not enforce this behaviour.With DWF the worker threads incorrectly execute ahead with obsolete task IDswhen the manager thread is acquiring new tasks. In general, we observed threeproblems when running CUDA applications on DWF enabled execution model:(1) Applications relying on implicit synchronization in a static warp (e.g. NVRT)execute incorrectly; (2) Starvation eddies may reduce SIMD efficiency; (3) Threadregrouping in DWF increases non-coalesced memory accesses and shared memorybank conflicts.3.2.1 Warp BarrierIn this section we propose an extension to DWF, called a warp barrier. Thismechanism keeps threads in their original static warps until they encounter a di-vergent branch. After a top-level divergence, threads can freely regroup betweendiverged warps but a “warp barrier” is created per static warp at the immediatepost-dominator of the top-level divergent branch. A top-level divergence is a di-vergent branch that is not control dependent upon an earlier divergent branch. Adynamic warp may contain threads with different warp barriers. When a dynamicwarp reaches a warp barrier those threads associated with the barrier are restoredto their original static warp and wait until all threads from this static warp have ar-rived at the barrier. The remaining threads in the dynamic warp continue executionusing the original DWF mechanisms [56]. The warp barrier mechanism confinesstarvation eddies between top-level divergence and reconvergence points while pre-serving the static warp arrangements reduces memory divergence [107] and sharedmemory bank conflicts. Warp barriers are distinct from syncthreads() inCUDA – they are created dynamically only at divergent branches and there is oneper static warp rather than one per thread block. Note that warp barrier only par-tially addresses the incompatibility of DWF with applications that rely on syn-chronous behavior in a static warp: DWF with warp barrier is compatible with5200. BFS2  FCDT  HOTSP  LPS  MUM  MUMpp  NAMD  NVRT  RAY  WPNormalized IPCDWF DWF-WB PDOMFigure 3.4: DWF with and without warp barrier compared against baselineper-warp SIMT stack reconvergence (PDOM).applications that rely on warp synchronous behavior only in the coherent code re-gion without any branch divergence (e.g., NVRT), and remains incompatible withapplications that expect warp synchronous behavior in the presence of branch di-vergence (e.g., parallel reduction).Figure 3.4 compares the performance of DWF with this warp barrier mecha-nism (DWF-WB) against the original DWF and the baseline per-warp SIMT stack(PDOM). With the warp barrier, NVRT executes properly and achieves a 60%speedup over PDOM. Three other applications that suffer slowdowns with the orig-inal DWF (HOTSP, LPS, NAMD) now achieve speedup. However, MUMpp losesperformance with the warp barrier and shows a slight slowdown versus PDOM. Inaddition, RAY and WP continue to suffer from starvation eddies while using warpbarriers. A deeper investigation suggests these applications require additional bar-riers between the top-level divergence and reconvergence of a static warp. Suchbarriers are a natural property of SIMT stack-based reconvergence and this led usto propose thread block compaction.3.3 Thread Block CompactionIn CUDA (OpenCL) threads (work items) are issued to the SIMT cores in a unit ofwork called a “thread block” (work group). Warps within a thread block can com-municate through shared memory and quickly synchronize via barriers. Threadblock compaction extends this sharing to exploit control flow locality among threadswithin a thread block. Warps within a thread block share a block-wide SIMT stack53for divergence handling instead of having separate per-warp stacks. At a divergentbranch, the warps synchronize and their threads are “compacted” into new warpsaccording to the branch outcome of each thread. The compacted warps then ex-ecute until the next branch or reconvergence point, where they synchronize againfor further compaction. Compaction of all the divergent threads after they havereached the reconvergence point will restore their original warp grouping beforethe divergent branch was encountered.As threads with a different program counter (PC) value cannot be merged inthe same warp, DWF is sensitive to scheduling. When branch divergence occurs, asufficient number of threads need to be present at the divergent branch to be mergedinto full warps. Ideally, threads should be encouraged to be clustered at a localregion in the kernel program, but variable memory access latency and complexcontrol flow make this hard to achieve. Moreover, even if this scheduling could beachieved, it will cluster memory accesses, discouraging overlap between memoryaccess and computation and may increase memory latency via increased contentionin the memory system.Thread block compaction simplifies this scheduling problem with block-widesynchronization at divergent branches. This ensures that the maximum number ofthreads wait at a branch or reconvergence point, while other threads can be sched-uled focusing on improving pipeline resource utilization. With the use of a SIMTstack for reconvergence, we can keep track of the warps that will eventually arriveat the reconvergence point and eliminate the starvation eddy problem described inSection 3.2. The synchronization overhead at branches can be covered by switch-ing the execution to a different thread block running on the same SIMT core. InSection 3.6.2, we explore the performance impact of several thread block prioriti-zation policies.While sharing a single SIMT stack among all warps in a block can in theoryreduce performance when threads in different warps follow diverging control flowpaths, we find this type of code to be rare. It tends to occur where CUDA program-mers work around the limitation of one concurrent kernel launch at a time on pre-Fermi NVIDIA GPUs by having different warps in a block execute different codefollowing a top-level “if” or “switch” statement. There are also emerging GPUprogramming models that use warp specialization, using each warp to perform a54Example 3 Code that exhibits branch divergence.t = threadIdx.x; // block Aflag = (t==1)||(t==6)||(t==7);if( flag )result = Y; // block Belseresult = Z; // block Creturn result; // block Ddifferent subtask in a thread block, to manage memory transfer [11] or to supportdomain specific languages [12]. If better support were desired for such code, wecould employ prior proposals for enabling stack-based reconvergence mechanismsto overlap execution of portions of the same warp that follow different control flowpaths [107]. We leave evaluation of such extensions to future work and insteadfocus on what we observe to be the common case for existing applications.3.3.1 High-Level OperationFigure 3.5 illustrates the high-level operation of thread block compaction. The codein Example 3 translates into the control flow graph in Figure 3.5. In this example,each warp contains four threads and each thread block contains eight threads. Thenumbers in each basic block of the control flow graph (left portion of figure) denotethe threads that execute that block. All threads execute block A and D, while onlythreads (1,6,7) execute block C and only threads (2,3,4,5,8) execute block B.Two warps composed of threads 1-4, and 5-8 begin executing at the start ofblock A. Since there is no divergence, there is only a single entry in the block-wideSIMT stack ( 1 in Figure 3.5). Two warps ( 2 ) are created from the active threads( 1 ). The two warps are scheduled on the pipeline independently until they reachthe end of block A ( 3 ), where they “synchronize” at the potentially divergentbranch. This synchronization allows the hardware to determine if any of the warpshas diverged at the branch. As an optimization, the programmer/compiler can stat-ically annotate non-divergent branches (such as bra.uni in the PTX-ISA [123])in the kernel, allowing the warps to skip synchronization at these branches. Afterboth warps have executed the branch, two new entries ( 4 ) will have been pushed55CA 1 2 3 45  6  7  8B 1 x x xx 6  7  xx 2 3 4  5  x x 8DPC RPC Active ThreadsA -- 1 2 3 4  5  6  7  8PC RPC Active ThreadsD -- 1 2 3 4  5  6  7  8B D 2 3 4  5  8C D 1 6  7TOSTOSPC RPC Active ThreadsD --  1 2 3 4  5  6  7  8TOSPC RPC Active ThreadsD -- 1 2 3 4  5  6  7  8B D 2 3 4  5  8TOS5 .6 .7 . .7 .xCx.x.x.85 .2.3.4BB5 .6 .7 . arp s After Com p actionBlock -W ide SIMT  StackControl Flow  G rap h1 2 3 45  6  7  8Ex ecution Floww ith P er-W arp  SIMT  Stack123456781xxxx67xx2345xx812345678TimeA A C C B B D D12345678167x5234xxx812345678TimeA A C B B D DEx ecution Flow  w ith T hread Block  Com p action1 234 597 861011Figure 3.5: High-level operation of thread block compaction.onto the stack, each containing the active threads that will execute the “taken” or“not taken” side of the branch (block C or B, respectively). The active threads onthe top entry are compacted into a single warp that executes basic block C ( 5 ).As this warp reaches the reconvergence point D ( 6 ), its entry is popped from theSIMT stack ( 7 ), and the active threads that execute basic block B are compactedinto two warps ( 8 ). After these two warps have reached the reconvergence pointD ( 9 ), their corresponding entry is popped from the stack, and threads resumeexecution in two full warps with their original arrangements before the divergentbranch ( 10 ).The lower part of Figure 3.5 compares the execution flow of thread block com-paction with the baseline per-warp reconvergence mechanism. In this example,thread block compaction compacts threads (1,6,7) into a single warp at basic blockC ( 11 ). This reduces the overall execution time by 12.5% over the baseline in this56ActiveMask[1:B]PC RPC WCntActiveMask[1:B]PC RPC WCntActiveMask[1:B]PC RPC WCntActiveMask[1:B]PC RPC CntActiveMask[1:B]PC RPC CntActiveMask[1:B]PC RPC CntALULALUI-Cache DecodeW arp  BufferScore-BoardIssue RegFileMEMLFetch Branch UnitDone (WID)Valid[1:N]Branch Target PCActiveMaskPred.PC1PC2PC3ARBSelectionTo I-CacheValid[1:N]Inst. W1 rInst. W2Inst. W3vrvrvTo FetchDecodeScore-BoardTOSTo FetchThread CompactorTIDsTIDsTIDsIssue ARBIssueThread CompactorT hread Com p actorTIDstoWarp BufferSelective ResetPriority EncoderActiveMask[1:S]Branch UnitActiveMask[1:B]PC RPC WCntActiveMask[1:B]PC RPC WCntActiveMask[1:B]PC RPC WCntBlock-Wide SIMT Stacks1234Figure 3.6: Modifications to the SIMT core microarchitecture to implementthread block compaction. N = #warps, B = maximum #threads in ablock. S = B ÷W where W = #threads in a warp.example.The pushing and popping of the entries on and off the block-wide SIMT stack,as branches and reconvergence points are encountered, uses the same reconver-gence points as the per-warp SIMT stack in the baseline SIMT core.3.3.2 ImplementationFigure 3.6 illustrates the modifications to the SIMT core microarchitecture to im-plement thread block compaction. The modifications consist of three major parts:a modified branch unit ( 1 ), a new hardware unit called the thread compactor ( 2 ),and a modified instruction buffer called the warp buffer ( 3 ). The branch unit ( 1 )has a block-wide SIMT stack for each block. Each entry in the stack consists of thestarting PC (PC) of the basic block that corresponds to the entry, the reconvergencePC (RPC) that indicates when this entry will be popped from the stack, a warpcounter (WCnt) that stores the number of compacted warps this entry contains,and a block-wide active mask that records which thread is executing the currentbasic block. The thread compactor ( 2 ) consists of a set of priority encoders thatcompact the block-wide active mask into compacted warps with thread IDs. Thewarp buffer ( 3 ) is an instruction buffer that augments each entry with the thread57IDs associated with compacted warps.To retain compatibility with applications that rely on static warp synchronousbehaviour (e.g., reduction in the CUDA SDK), the thread compactor can op-tionally be disabled to allow warps to retain their static/compile-time arrangement(e.g., when launching a kernel via an extension to the programming API).In comparison to DWF [56], thread block compaction accomplishes the lookup-and-merge operation of the “warp LUT” and the “warp pool” [56] with simplerhardware. In DWF, an incoming warp is broken down every cycle and the warpLUT has to locate an entry in the warp pool that can merge with the individualthreads. In thread block compaction, warps are only broken down at potentially di-vergent branches and partial warps are accumulated into block-wide active masks.The compaction only occurs once after the active masks have been fully populatedand the compacted warps are stored at the warp buffer until the next branch orreconvergence point.3.3.3 Example OperationFigure 3.7 presents an example of how the hardware in Figure 3.6 implementsthread block compaction. The active mask in the block-wide SIMT stack is dividedinto groups, each corresponding to one vector lane in all static warps in the threadblock4. Threads are constrained to stay in their vector lane during warp compactionto avoid the need to migrate register state and to simplify the thread compactor.Each thread can locate its corresponding bit inside its vector lane group via itsassociated static warp. For example, thread 5 in the first vector lane of static warpW2 corresponds to the second bit of the first group ( 4 ).The active mask in the block-wide SIMT stack is incrementally populated aswarps arrive at a branch. Since warps are allowed to execute independently aslong as they do not encounter branches or reconvergence points, it is possible forwarps to arrive at the branch at the end of block A ( 1 ) in an arbitrary order. Inthe example, warp (W2) arrives at the branch first and creates two target entries onthe stack ( 2 ). Each thread in the warp updates one of these entries based upon itsbranch outcome. For instance, since thread 5 goes to block C ( 3 ), it updates the4Example shows 3 warps, but each thread block can have up to 32 warps/1024 threads [122].58Control  F l ow  Grap hf or D iv erg ent B ranchAB CTimePC RPC ActiveMaskA -- 111 111 111 111PC RPC ActiveMaskTOSTOSB l ock- W id e SIMT StackD --B D 0 0 0  0 1 0  0 1 0  0 0 0C D 0 1 0  0 0 0  0 0 0  0 1 0W Cnt3W Cnt200@  W 2  Arrival:PC RPC ActiveMaskTOS D --B D 0 0 1  0 1 1  0 10  0 0 1C D 0 10  0 0 0  0 0 1  0 10W Cnt100PC RPC ActiveMaskTOSD --B D 1 0 1 0 11 0 10  0 0 1C DW Cnt002 0 10  1 0 0  1 0 1 1 10@  W 3  Arrival:@  W 1 Arrival:Send to Thread Compactor111 111 111 111111 111 111 111111 111 111 1110  0  0 0  0  0  0  0  1 0  1 0-P-E nc-P-E nc11P-E nc8P-E nc0  1 0  1 0  0  1 0  1 1 1 05P-E nc2P-E nc3P-E nc4P-E ncThread  Com p actorNext CycleCorresponding Thread IDsW 1 1 2 3 4W 2 5 6 7 8W 3 9 10 1112W 1 1 - - -W 2 - 6 7 -W 3 9 10 - 12W 1 - 2 3 4W 2 5 - - 8W 3 - - 11 -2 6 101 5 9 3 7 11 4 8 1212345678910111213Figure 3.7: Thread block compaction example operation showing how the ac-tive mask in the block-wide SIMT stack can be incrementally populatedfor a divergent branch.first new entry on the stack ( 5 ). On the other hand, since thread 6 goes to block B( 6 ), it updates the second new entry on the stack ( 8 ). In subsequent cycles, theother warps (W3 and W1) arrive at the branch and update the active masks of thetwo new stack entries ( 9 and 10 , bits updated at each warp’s arrival are shown inbold).As the warp arrives at the potentially divergent branch, WCnt of the originalTOS entry is decremented to keep track of pending warps. When WCnt reacheszero ( 11 ), the TOS pointer increments to point at the new top target entry. Theactive mask of this new TOS entry is sent to the thread compactor for warp gener-ation ( 12 ). WCnt of this entry is also updated to record the number of compactedwarps to be generated (calculated by counting the maximum number of set bits inall lanes’ active mask). A single block-wide active mask can generate at most asmany compacted warps as the static warps in the executing thread block.For branches that NVIDIA’s CUDA compiler marks as potentially divergent,59we always create two entries for taken and not-taken outcomes when the first warpreaches a branch even if that warp is not divergent. If all threads in a block branchto the same target one of the entries will have all bits set to zero and will be imme-diately popped when it becomes the top of stack.Figure 3.7 also shows the operation of the thread compactor ( 13 ). The block-wide active mask sent to the thread compactor is stored in multiple buffers, eachresponsible for threads in a single home vector lane [56]. Each cycle, each priorityencoder selects at most one thread from its corresponding inputs and sends the IDof this thread to the warp buffer ( 4 in Figure 3.6). The bits corresponding to theselected threads will be reset, allowing the encoders to select from the remainingthreads in subsequent cycles.When the compacted warps encounter another divergent branch, the processdescribed above repeats, pushing new entries onto the block-wide SIMT stack.Eventually, as each of these compacted warps reaches the reconvergence point(block D in this example), WCnt of block C entry decrements. When this WCntreaches zero, the entry is popped (TOS pointer decrements). The active mask ofthe new TOS entry is sent to the thread compactor to generate warps that executeblock B, and WCnt of this entry is updated accordingly. After these compactedwarps have reached the reconvergence point as well, the block-wide stack is poppedagain, shifting execution to the top-level entry. Similarly, the full activemask of thisentry is sent to the thread compactor, with its WCnt updated to the correspondingwarp count in preparation for the next divergent branch.3.4 Likely-Convergence Points (LCP)The post-dominator (PDOM) stack-based reconvergence mechanism [56, 59] usesreconvergence points identified using a unified algorithm rather than by translatingcontrol flow idioms in the source code into instructions [5, 30, 100]. The immediatepost-dominator of a divergent branch selected as the reconvergence point is the ear-liest point in a program where the divergent threads are guaranteed to reconverge.In certain situations, threads can reconverge at an earlier point, and if hardwarecan exploit this, it would improve SIMD efficiency. We believe this observationmotivates the inclusion of the break instruction in recent NVIDIA GPUs [30].60Example 4 Example for likely-convergence.while (i < K) {X = data[i]; // block Aif( X == 0 )result[i] = Y; // block Belse if ( X == 1 ) // block Cbreak; // block Di++; // block E}return result[i]; // block FThe code in Example 4 exhibits this earlier reconvergence. It results in thecontrol flow graph in Figure 3.8 where edges are marked with the probability withwhich individual scalar threads follows that path. Block F is the immediate post-dominator of A and C since F is the first location where all paths starting at A (or C)coincide. In the baseline mechanism, when a warp diverges at A, the reconvergencepoint is set to F. However, the path from C to D is rarely followed and hence in mostcases threads can reconverge at E.We extend the PDOM SIMT stack with likely-convergence points to capture thepotential performance benefits from such “probabilistic” reconvergence. We addtwo new fields to each stack entry: one for the PC of the likely-convergence point(LPC) and the other (LPos), a pointer that records the stack position of a speciallikely-convergence entry created when a branch has a likely-convergence point thatdiffers from the immediate post-dominator. The likely-convergence point of eachbranch can be identified with either control flow analysis or profile information(potentially collected at runtime). Figure 3.8 shows a warp that diverges at A ( 1 ).When the divergence is detected three entries are pushed onto the stack. The firstentry ( 2 ) is created for the likely-convergence point E5. Two other entries for thetaken and fall through of the branch are created as in the baseline mechanism. Thewarp diverges again at C ( 3 ), and two6 new entries are created ( 4 ). Executioncontinues with the top entry until it reaches E, and the likely-convergence point isdetected since PC == LPC. When this occurs, the top entry is popped and merged5In our experimental evaluation, we restricted likely-convergence points to the closest enclosingbackwards taken branch to capture the impact of “break” statements within loops [30].6Since likely-convergence and immediate post-dominator are the same.61RPC L PC--FFF----EERPC L PC--FF----ERPC L PC--F----L Pos----11L Pos----1L Pos----Act.Thd.1 2 3 423 41Act.Thd.1 2 3 41 21Act.Thd.1 2 3 41 2PCFEFBPCFEEPCFFRPC L PC--FFF----EERPC L PC--FFFF----EEEL Pos----11L Pos----111Act.Thd.1 2 3 4--2 3 41Act.Thd.1 2 3 423 421PCFECBPCFEDEBAB CEFD12341 23412 341234P DO M SIMT  Stack  w /  Lik ely -Conv ergenceTOSControl Flow  G rap h25 % 7 5 %25 % 7 3% 2%2%2%9 6 %4 %TOSTOSTOSTOS12345678910Figure 3.8: Likely-convergence points improve individual warp SIMD effi-ciency by reconverging before the immediate post-dominator. MUMSpeedup vs. No LCPPDOMTBC02468101214161820NVRT MUMMax Stack UsagePDOM PDOM-LCP TBC TBC-LCPFigure 3.9: Performance and resource impact of likely-convergence pointson both baseline per-warp SIMT stack (PDOM) and thread block com-paction (TBC).with the likely-convergence entry ( 5 ) as the LPos field indicates. When thread3 and 4 reach F ( 6 ), since PC == RPC, the stack is popped ( 7 ). Thread 1 thenexecutes B ( 8 ) and its entry is popped at E ( 9 ), when PC == LPC. Finally, thelikely-convergence entry executes until it reaches the immediate post-dominator,where it is popped ( 10 ).Both the baseline per-warp SIMT stack (PDOM) and the block-wide stack usedin thread block compaction (TBC) can be extended with LCP. Figure 3.9 shows theperformance impact of extending PDOM and TBC with likely-convergence points62(-LCP). Only data for MUM and NVRT are shown because we have only identifiedlikely-convergence points that differ from the immediate post-dominators in thesetwo applications. The impact of LCP for MUM is minimal: 2% speedup for TBCand 5% for PDOM. In contrast, it greatly benefits NVRT: 30% speedup for TBC,14% for PDOM. Although LCP pushes an extra entry onto the stack for each di-vergent branch applicable, it can reduce the stack capacity requirement if multipledivergent entries are merged into fewer likely-convergence entries. This happens inmost cases, except for TBC running MUM, where the unused likely-convergenceentries increases the maximum stack usage.3.5 MethodologyWe model our proposed hardware changes using a modified version of GPGPU-Sim (version 2.1.1b) [10]. The version of GPGPU-Sim modeling our proposedhardware changes is available online [63]. We evaluate the performance of vari-ous hardware configurations on the CUDA benchmarks listed in Table 3.1. Mostof these benchmarks are selected from Rodinia [28] and the benchmarks used byBakhoda et al. [10]. We did not exclude any benchmarks due to poor performanceon thread block compaction but excluded some Rodina benchmarks that do not runon our infrastructure due to their reliance on undocumented behaviour of barriersynchronization in CUDA. We also use some benchmarks from other sources:Face Detection is a part of Visbench [104]. We applied the optimizations [103]recommended by Aqeel Mahesri to improve its SIMD efficiency.MUMMER-GPU++ improved MUMMER-GPU [139] reducing data transfers witha novel data structure [65].NAMD is a popular molecular dynamics simulator [127].Ray Tracing (Persistent Threads) dynamically distributes workload in softwareto mitigate hardware inefficiencies [4]. We render the “Conference” scene.Our modified GPGPU-Sim is configured to model a GPU similar to NVIDIA’sQuadro FX5800, with the addition of L1 data caches and a L2 unified cache similarto NVIDIA’s Fermi GPU architecture [121]. We configure the L2 unified cache to63Table 3.1: Benchmarks in thread block compaction evaluation.Name Abbr. BlockDim #Instr. Blocks/coreDivergent SetBFS Graph Traversal [28] BFS2 (512x1x1), 28M 2(256x1x1)Face Detection [104] FCDT 2x(32x6x1) 1.7B 5,4HotSpot [28] HSP (16x16x1) 157M 23D Laplace Solver [10] LPS (32x4x1) 81M 6MUMMER-GPU [10] MUM (256x1x1) 69M 4MUMMER-GPU++ [65] MUMpp (192x1x1) 140M 3NAMD [127] NAMD 2x(64x1x1) 3.8B 7Ray Tracing NVRT (32x6x1) 700M 3(Persistent Threads) [4]Coherent SetAES Cryptography [10] AES (256x1x1) 30M 2Back Propagation [28] BACKP 2x(16x16x1) 193M 4,4Coulumb Potential [10] CP (16x8x1) 126M 8gpuDG [10] DG (84x1x1), 569M 4,5,6(112x1x1),(256x1x1)Heart Wall Detection [28] HRTWL (512x1x1) 8.9B 1LIBOR [10] LIB 2x(64x1x1) 162M 8,8Leukocyte [28] LKYT (175x1x1) 6.8B 5,5,1Merge Sort [28] MGST (96x1x1) 2.3B 1,3,8,3,4,2,42x(32x1x1),2x(128x1x1),2x(256x1x1),(208x1x1)NN cuda [28] NNC (16x1x1) 6M 5,8,8,8Ray Tracing [10] RAY (16x8x1) 65M 3Stream Cluster [28] STMCL (512x1x1) 941M 2StoreGPU [10] STO (128x1x1) 131M 1Weather Prediction [10] WP (8x8x1) 216M 4be significantly larger than that on Fermi (1MB vs. 128 kB per memory channel) tomake dynamic warp formation more competitive against thread block compaction.In Section 3.6.4, we explore the sensitivity of both techniques to changes in mem-ory system by reducing the L2 cache to 128kB per memory channel. Table 3.2shows the major configuration parameters.3.6 Experimental ResultsFigure 3.10 shows the performance of TBC and DWF relative to that of PDOM.TBC uses likely-convergence points whereas PDOM does not and DWF and DWF-64Table 3.2: GPGPU-Sim configuration for thread block compaction evaluation# Streaming Multiprocessors 30Warp Size 32SIMD Pipeline Width 8Number of Threads / Core 1024Number of Registers / Core 16384Shared Memory / Core 16KBConstant Cache Size / Core 8KBTexture Cache Size / Core 32KB, 64B line, 16-way assoc.Number of Memory Channels 8L1 Data Cache 32KB, 64B line, 8-way assoc.L2 Unified Cache 1MB/Memory Channel, 64B line, 64-way assoc.Compute Core Clock 1300 MHzInterconnect Clock 650 MHzMemory Clock 800 MHzDRAM request queue capacity 32Memory Controller out of order (FR-FCFS)Branch Divergence Method PDOM [56]Warp Scheduling Policy Loose Round RobinGDDR3 Memory Timing tCL=10 tRP=10 tRC=35tRAS=25 tRCD=12 tRRD=8Memory Channel BW 8 (Bytes/Cycle)WB use majority scheduling [56]. For the divergent (DIVG) benchmark set, TBChas an overall 22% speedup over PDOM. Much of its performance benefits areattributed to speedups on the applications that have very low baseline SIMD effi-ciency (BFS2, FCDT, MUM, MUMpp, NVRT). While DWF can achieve speedupson these benchmarks as well (except NVRT, which fails to execute), it also ex-hibits slowdowns for the other benchmarks that have higher baseline SIMD ef-ficiency (HOTSP, LPS, NAMD), lowering the overall speedup to 4%. DWF withwarp barrier (DWF-WB) recovers from most of this slowdown and executes NVRTproperly, but loses much of the speedup on MUMpp. Overall, TBC is 17% fasterthan the original DWF and 6% faster than DWF-WB.Applications in the coherent (COHE) benchmark set are not significantly ef-fected by branch divergence, hence we do not anticipate significant benefits fromDWF or TBC. DWF suffers significant slowdowns on some applications in thisbenchmark set (HRTWL, LKYT, RAY, STO and WP), due to starvation eddy andextra memory stalls from thread regrouping (see Section 3.6.1). DWF-WB recov-ers much of this slowdown, however, RAY and WP still suffer from the starvationeddy problem.650. BFS2 FCDT HOTSP LPS MUM MUMpp NAMD NVRTHMDIVGNormalized IPCPDOM DWF DWF-WB TBC00. BACKP CP DG HRTWL LIB LKYT MGST NNC RAY STMCL STO WPHMCOHENormalized IPCPDOM DWF DWF-WB TBCFigure 3.10: Performance of thread block compaction (TBC) and dynamicwarp formation (DWF) relative to baseline per-warp post-dominatorSIMT stack (PDOM) for the DIVG and COHE benchmark sets.Across all benchmarks, TBC obtains an overall 10% speedup over PDOM.The performance benefits of DWF-WB are mostly offset by slowdowns in otherapplications, making it perform evenly with PDOM.3.6.1 In-Depth AnalysisFigure 3.11(a) and 3.12(a) show the breakdown of the SIMT core cycle for bothDIVG and COHE benchmark sets. At each cycle, the SIMT core can either issue awarp containing a number of active threads (Wn-m means between n and m threadsare enabled when a warp issues), be stalled by the downstream pipeline stages(Stall), or not issue any warp because none are ready in the I-Buffer/Warp-Buffer(W0 Mem if the warps are held back by pending memory accesses, and W0 Idleotherwise). This data shows that in some applications (HOTSP, NAMD, HRTWL,MGST, RAY and WP), starvation eddies cause DWF to introduce extra divergence,turning some of the warps with more active threads (W29-32) into warps withfewer active threads. In other applications (e.g. LPS, MUM and NAMD), the extra66Memory Pipe StallNormalized Core Cycles0%20%40%60%80%100%120%PDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCBFS2FCDTHOTSPLPSMUMMUMppNAMDNVRT0% 100% 200% 300% 400% 500%34XW29-32 W25-28 W21-24 W17-20W13-16 W9-12 W5-8 W1-4W0_Mem W0_Idle StallConstTextureSharedMemCoalesceStallGlobalLocalMemRC(a) Core Cycle Breakdown (b) Memory Pipe Stall BreakdownFigure 3.11: Detail performance data of TBC and DWF relative to baseline(PDOM) for the DIVG benchmark set.stalls from DWF undermine the benefit of merging divergent threads into warps.DWF-WB reduces stalls and divergence versus DWF in these applications, but thestarvation eddy problem persists with DWF-WB for RAY and WP.TBC can usually improve SIMD efficiency as well as DWF-WB, and it doesnot introduce significant extra stalls or divergence. However, the synchronizationoverhead at branches can introduce extra W0 idle and W0 Mem cycles. This is themain reason why DWF-WB performs better than TBC for HOTSP, LPS, NAMDand NVRT. For NAMD and NVRT, TBC achieves a lower SIMD efficiency thanDWF-WB because DWF can form warps from any threads within a SIMT core,while TBC can only do so from threads within a thread block. BFS2, MUM, and672.1Normalized Core Cycles0%25%50%75%100%125%150%PDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCPDOMDWF-WBDWFTBCAESBACKPCPDGHRTWLLIBLKYTMGSTNNCRAYSTMCLSTOWP12X6X13XMemory Pipe Stall0% 100% 200% 300%4.0W29-32 W25-28 W21-24 W17-20W13-16 W9-12 W5-8 W1-4W0_Mem W0_Idle StallConstTextureSharedMemCoalesceStallGlobalLocalMemRC(a) Core Cycle Breakdown (b) Memory Pipe Stall BreakdownFigure 3.12: Detail performance data of TBC and DWF relative to baseline(PDOM) for the COHE benchmark set.68MGST transition from compute-bound to memory-bound with TBC (indicated bythe extra W0 Mem for them), limiting its benefit. TBC is more prone to beingmemory-bound than DWF because it requires warps to synchronize at every diver-gent branch for compaction. These extra block-wide synchronizations reduce thenumber of warps available to cover the long memory access latency in GPUs. Rhuand Erez [130] tackle this issue by extending TBC with a compaction-adequacypredictor (CAPRI) to avoid unnecessary synchronizations for branches that do notbenefit from compaction (see Chapter 7.1).In Figure 3.11(a), TBC has the total number of core cycles as DWF for BFS2while Figure 3.10 shows that TBC has a higher IPC than DWF. This is becauseIPC in Figure 3.10 is calculated using the total execution time, whereas core cyclesonly account for the time when a SIMT core is running at least one thread. BFS2features a series of relatively short kernel launches, and the load imbalance betweendifferent SIMT cores causes the total execution time of DWF to be longer thanTBC.Figure 3.11(b) and Figure 3.12(b) show the breakdown of memory pipelinestalls modeled in our simulator normalized to the baseline and highlights that DWFintroduces extra memory stalls. TBC introduces far fewer extra memory stalls.These extra stalls do not out-weight the benefits of TBC to control flow efficiencyon several applications (e.g. FCDT, MUMpp, NVRT and HRTWL). Section 3.6.3shows how the L1 data cache in each SIMT core absorbs the extra memory accessesgenerated by TBC, leaving the memory subsystem undisturbed.3.6.2 Thread Block PrioritizationFigure 3.13 compares the performance of TBC among different thread block prior-itization policies. A thread block prioritization sets the scheduling priority amongwarps from different thread blocks, while warps within a thread block are alwaysscheduled with loose round-robin policy [56]:Age-based (AGE) The warps from the oldest thread block (in the order that threadblocks are dispatched to a SIMT core) have the highest priority. This triesto stagger different thread blocks, encouraging them to overlap each other’ssynchronization overhead at branches.690. BFS2 FCDT HOTSP LPS MUM MUMpp NAMD NVRTHMDIVGNormalized IPCPDOM AGE RRB SRR0.80.850.90.9511. BACKP CP DG HRTWL LIB LKYT MGST NNC RAY STMCL STO WPHMCOHENormalized IPCPDOM AGE RRB SRRFigure 3.13: Performance of TBC with various thread block prioritizationpolicies relative to PDOM for the DIVG and COHE benchmark sets.Round-robin (RRB) Thread block priority rotates every cycle, encouraging warpsfrom different thread blocks to interleave execution.Sticky round-robin (SRR) Warps in a thread block that is currently issuing warpsretain highest priority until none of the thread block’s warps are ready forissue. Then, the next thread block gets highest priority.Overall, AGE (default policy for TBC in this work) achieves the highest perfor-mance, but it can leave the SIMT cores with a lone-running thread block near theend of a kernel launch. The lack of interleaving reduces overall performance forLPS, NAMD and LIB. RRB encourages even progress among thread blocks, butincreases memory system contention in STMCL. SRR provides the most robustperformance among the policies, with its performance consistently staying above95% of the PDOM performance for all evaluated benchmarks.700.6 0.8 1 1.2  BFS2  FCDT  HOTSP  LPS  MUM  MUMpp  NAMD  NVRT AES  BACKP  CP  DG  HRTWL  LIB  LKYT  MGST  NNC  RAY  STMCL  STO  WP DIVG COHE Memory Traffic Normalized to PDOM TBC-AGE TBC-RRB TBC-SRR 2.67x 2.12x Figure 3.14: Average memory traffic of a SIMT core for TBC relative tobaseline (PDOM).3.6.3 Impact on Memory SubsystemFigure 3.14 shows the average memory traffic (in bytes for both reads and writes)between a SIMT core and the memory subsystem outside the core with TBC nor-malized to PDOM. Traffic does not increase significantly with TBC (within 7% ofPDOM), indicating that the L1 data cache has absorbed the extra memory pressuredue to TBC. If memory accesses from a static warp would have coalesced into a128-Byte chunk, and this static warp is compacted into multiple dynamic warps byTBC upon a branch divergence, then they will access the same blocks in the L1data cache. The L1 data cache can generally capture this data locality, and no extramemory accesses will be sent out from the SIMT core.Memory traffic of DG increases by 2.67× due to increased texture cache misseswhen using AGE based prioritization (which tends to reduce interleaving fromdifferent thread blocks). TBC combined with the RRB policy reduces traffic ofDG by increasing the texture cache hit rate. Similarly, memory traffic of LKYTincreases by 2.12× with SRR due to increased texture cache misses. In both cases,the extra memory traffics hit at the L2 cache and have little impact on the overallperformance.3.6.4 Sensitivity to Memory SubsystemFigure 3.15 shows the speedup of TBC and DWF over PDOM, but with smaller L2caches (128 kB, 8-way per memory channel). The relative performance betweenthe different mechanisms remains unchanged for the COHE benchmarks. Two of710. L2 Bk 128kB L2 Bk 1MB L2 Bk 128kB L2 BkHM DIVG HM COHENormalized IPCPDOMDWFDWF-WBTBC-AGETBC-SRRTBC-RRBFigure 3.15: Average speedup of TBC and DWF over baseline (PDOM) witha smaller (128 kB/Memory Channel) L2 cache. The speedups are nor-malized to PDOM with the same L2 cache capacity.the DIVG benchmarks (MUM and MUMpp) become more memory-bound with aless powerful memory system, lowering the speedup of TBC (15% with RRB and13.5% with AGE for the DIVG benchmarks). Smaller L2 caches reduce DWF’sperformance on MUMpp from 28% speedup to 4% slowdown. In comparison, thespeedups with DWF-WB and TBC remain robust to the change in the memorysystem.3.7 Implementation ComplexityMost of the implementation complexity for thread block compaction is the extrastorage for thread IDs in the warp buffer, and the area overhead in relaying theseIDs down the pipeline for register accesses. The register file in each SIMT corealso needs to be banked per lane as in dynamic warp formation [56] to supportsimultaneous accesses from different vector lanes to different parts of the registerfile.The scheduler complexity that DWF imposes is mostly eliminated via the block-wide SIMT stacks in thread block compaction. The bookkeeping for thread group-ing is done via active masks in the stack entries. The active masks can be storedin a common memory array. Each entry in this memory array has T bits (T = max#threads supported on a SIMT core). The T bits in each entry are divided amongthe multiple thread blocks running on a SIMT core. In this way, the total #bits foractive mask payload does not increase over the baseline per-warp stacks.72Table 3.3: Maximum stack usage for TBC-LCPDIVG COHE#Entries #Entries #EntriesBFS2 3 AES 1 NNC 3FCDT 5 BACKP 2 RAY 9HOTSP 2 CP 1 STMCL 3LPS 5 DG 2 STO 1MUM 13 HRTWL 5 WP 8MUMpp 14 LIB 1NAMD 5 LKYT 3NVRT 5 MGST 38One potential challenge to TBC is that the block-wide stack can in the theo-retical worst case be deeper than with per-warp stacks. Table 3.3 shows the stackusage for TBC with likely-convergence points across all the applications we study.Most applications use fewer than 16 entries throughout their runtime. For excep-tions such as MGST the bottom of the stack could potentially be spilled to memory.We synthesized the 32-bit priority encoders used in thread compactors (sufficientfor the maximum thread block size of 1024 threads [122]) in 65 nm technology andfound their aggregate area to be negligible (<< 1mm2).3.8 SummaryIn this chapter, we proposed thread block compaction, a novel mechanism that usesa block-wide SIMT stack shared by all threads in a thread block to exploit theircontrol flow locality. Warps run freely until they encounter a divergent branch,where the warps synchronize, and their threads are compacted into new warps.At the reconvergence point the compacted warps synchronize again to resume intheir original arrangements before the divergence. We found that our proposaladdresses some key challenges of dynamic warp formation [56]. Our simulationevaluation quantifies that it achieves an overall 22% speedup over a per-warp SIMTstack baseline for a set of divergent applications, while introducing no performancepenalty for a set of control flow coherent applications.73Chapter 4Kilo TM:Hardware Transactional Memoryfor GPU ArchitecturesIn this chapter, we explore how to support a transactional memory (TM) [73, 75]programming model on GPU architectures. This exploration was motivated by thechallenge of managing the irregular, fine-grained communications between threadsin GPU applications with ample irregular parallelism. We believe TM can ease thischallenge, and perform a limit study in Section 4.1, showing that a set of GPU-TMapplications can perform nearly as well as their fine-grained locking analogs withan idealized TM system. The rest of this chapter (and Chapter 6) focus on enablingTM on GPU efficiently. A version of this chapter has been published earlier [57].Our proposed solution, Kilo TM, is the first hardware TM proposal for GPUarchitectures. The heavily multithreaded nature of GPU introduces a new set ofchallenges to TM system designs. Instead of running tens of concurrent transac-tions with relatively large footprint – the focus of much recent research on TM formulticore processors – Kilo TM aims to scale to tens of thousands of small con-current transactions. This reflects the heavily multithreaded nature of GPU, withtens of thousands of threads working in collaboration, each performing a small tasktowards a common goal. These small transactions are tracked at word-level granu-larity, enabling finer resolution of conflict detection than cache blocks. Moreover,74each per-core private cache in a GPU is shared by hundreds of GPU threads. Thisdrastically reduces the benefit of leveraging a cache coherence protocol to detectconflicts, a technique employed on most HTMs designed for traditional CMPs withlarge CPU cores.Kilo TM employs value-based conflict detection [37, 124] to eliminate the needfor global metadata for conflict detection. Each transaction simply reads the exist-ing data in global memory for validation – to determine if it has a conflict withanother committed transaction. This validation leverages the highly parallel natureof the GPU memory subsystem, avoids any direct interaction between conflict-ing transactions, and detects conflicts at the finest granularity. However, a nativeimplementation of value-based conflict detection requires transactions to commitserially. To boost commit parallelism, we have taken inspiration from existing TMsystems [27, 149] and extended them with innovative solutions. In particular, wehave introduced the recency bloom filter, a novel data structure that uses the no-tion of time and order to compress a large number of small item sets. We use thisstructure to compress the write-sets of all committing transactions in Section 4.2.5.Each transaction queries the recency bloom filter for an approximate set of conflict-ing transactions. Kilo TM uses this approximate information to schedule hundredsof non-conflicting transactions for validation and commit in parallel. Using therecency bloom filter to boost transaction commit parallelism is an integral part ofKilo TM.This chapter also proposes a simple extension to the SIMT hardware to han-dle conflict flow divergence due to transaction aborts. This extension, describedin Section 4.2.1, is independent of other design aspects of Kilo TM. Despite itssimplicity, it is a necessary piece for supporting TM on GPUs.Our evaluation with a set of GPU-TM applications shows that Kilo TM cap-tures 59% of the performance of fine-grained locking. We find that Kilo TM out-performs fine-grained locking for low contention applications that require acquir-ing multiple locks to enter a critical section. On the other hand, we find that TMapplications ported from CPU-optimized versions can perform poorly on GPUsregardless of the data synchronization mechanism used (fine-grained locking orTM). Optimizing these applications for GPUs would involve redesigning the algo-rithm and data structures to expose more thread-level-parallelism. Our estimation75with CACTI [144] indicates that implementing Kilo TM on an NVIDIA FermiGPU [121] would increase area by only 0.5%, a small overhead for the large in-crease in programmability.The rest of this chapter is organized as follows: Section 4.1 motivates TM onGPUs, and outlines the challenges in adopting prior HTMs on GPUs. Section 4.2describes Kilo TM, our TM design for a GPU that supports 1000s of concurrenttransactions. Section 4.3 describes our methodology and benchmark applications.Section 4.4 presents our evaluation results. Section 4.5 summarizes this chapter.4.1 Transactional Memory on GPU:Opportunities and ChallengesAtomic operations on current GPUs enable implementation of locks, allowingcomplex irregular algorithms [23]. Fine-grained locking enables higher concur-rency in applications, but requires the application developer to consider all possibleinteractions between locks to ensure deadlock-free code – a challenging task [94].With tens of thousands of threads running concurrently on a GPU, the numberof possible interactions among these fine-grained locks can be overwhelming inpractical applications. This problem is well known to the supercomputing commu-nity and has inspired special debugging tools to summarize thread behaviours fordeadlock/data-race analysis [8].In this chapter, we propose to increase support for irregular algorithms onGPUs by extending GPU architecture to support TM [75]. While originally pro-posed for CPUs, we find TM to be a natural extension to the existing GPU/CUDAprogramming model. From a programmer’s perspective, a transaction is executedas an atomic block of code in isolation. A thread in a transaction is never blockedwaiting to synchronize with another thread. This is important because a CUD-A/OpenCL application can launch many more threads than the GPU hardware canconcurrently execute. Like transactions, thread execution sequencing is abstractedaway in the CUDA programming model. The hardware thread schedulers on cur-rent GPUs can execute transactions with simple extensions.In addition to the traditional deadlock problem, GPU application developershave to deal with interactions between the SIMT stack and atomic operations. Ex-76Example 5 CPU spin-lock code. CAS = compare-and-swap.A: while(CAS(lock,0,1)==1);B: // Critical Section ...C: lock = 0;ABPC RPC T0B - - 1A B 0TOST111GPU Deadlock E.g.Example 6 Spin-lock implementation on GPU to avoid deadlock due to implicitsynchronization in warps [1].A: done = 0;B: while(!done){C: if(CAS(lock,0,1)==0){D: // Critical Section ...E: lock = 0;F: done = 1;G: }H: }ample 5 shows how a critical section may be guarded by the acquisition and releaseof a fine-grained lock (lines A and C) on the CPU. On a GPU, this code may dead-lock [1]. This can happen if the threads in the same warp attempt to acquire thesame lock at line A. For example, consider a warp with two threads, T0 and T1,both trying to acquire the same lock. T0 succeeds and exits the loop, but waits atthe start of the critical section (line B) for reconvergence, while T1 still spins in theloop (see inset at right in Example 5). T1 will continue spinning and waiting forthe lock held by T0 and never exit, forming a deadlock. To remove the deadlock,the program needs to be modified; Example 6 shows a typical solution. This issueis known among GPU application developers [1] and explored in more detail byRamamurthy [129]. With TM, the GPU hardware can be designed to handle suchinteractions between transactions and the SIMT stack (see Section 4.2.1).Figure 4.1 compares the performance of a set of GPU TM applications (de-scribed in Section 4.3) running on an ideal GPU TM system against fine-grainedlock versions of the applications. In this ideal TM system, TM overheads relatedto detecting conflicting transactions are removed. Each committing transaction caninstantaneously abort all of its conflicting transactions to resolve its conflicts, and77it can update the memory with its write-set in a single cycle. This ideal TM re-tains the overheads related to re-executing aborted transactions. The performanceshown is normalized to that obtained by serializing all transaction executions via asingle global lock. On average, the applications running on ideal TM achieve 279×speedup over serializing all transactions, which is 24% faster than fine-grainedlocking. In BH, ideal TM is slightly slower than fine-grained locking because thethread that failed to acquire a lock can just wait until the lock is freed, whereasan aborted transaction requires full re-execution from the start, producing wastedwork. In CC, ideal TM is slower because each pixel in the fine-grained lock versionuses atomic add operations to update its neighbouring pixels. These atomic oper-ations are implemented with special hardware that performs computation directlywith data in the L2 cache (as explained in Section 2.3.3). The TM version, on theother hand, has to fetch the data into the core for the same computation.Table 4.1 shows the IPC of these applications with the ideal TM system andfine-grained locking. For many of our applications, Ideal TM has lower IPC thanfine-grained locking even though it has the lower execution time. This is becausethe fine-grained lock versions of these applications execute extra instructions toacquire/release locks and to spin when the acquisition fails. For other applications,re-execution of aborted transactions causes Ideal TM to have an higher IPC thanfine-grained locking even though it has the higher execution time. Some of our ap-plications (CL, BH and CC) achieve reasonable performance, while others sufferfrom GPU performance bottlenecks such as control flow and memory divergence(see Section 4.4.2 for further discussion). Notice that even though HT-L has lowerIPC than HT-H for both Ideal TM and fine-grained locking, HT-L has lower exe-cution time (i.e., runs faster) than HT-H. With fewer conflicts among transactions,HT-L has fewer aborted transactions than HT-H. This allows HT-L to complete withfewer instructions, causing it to have a lower IPC. We believe these applicationscan be optimized via performance tuning – identifying bottlenecks and redesigningthe applications incrementally to address these bottlenecks one by one. Applica-tion performance tuning is beyond the scope of this work. Transactional memoryarguably provides an easier programming model for performance tuning because itallows GPU application developers to rework algorithms and data structures with-out concern for deadlock. This work focuses on enabling TM on GPUs efficiently,78Table 4.1: Raw performance (instructions per cycle, or IPC) of GPU-TM ap-plications described in Section 4.3 (Peak IPC = 240).Applications→ HT-H HT-L ATM CL BH CC APIdeal TM 6.6 5.9 4.2 9.4 10.5 33.4 0.5FG Lock 8.1 6.5 4.2 8.8 9.5 51.0 HT-L ATM CL BH CC AP AVGSpeedup over Serializing TxIdeal TMFG LockFigure 4.1: Performance comparison between applications running on anideal TM system and their respective fine-grained (FG) locking version(applications described in Section 4.3).and with minimum overhead.4.1.1 Challenges with Prior HTMs on GPUsHardware transactional memory (HTM) has been researched extensively (see Sec-tion 7.2). Many proposed HTMs leverage cache coherence for conflict detectionamong concurrent transactions, while assuming that each transaction owns a pri-vate L1 cache. Even though recent GPUs have caches [121], the caches local toa SIMT core are not coherent and they are shared among 100s of threads that ex-ecute concurrently on the core. GPUs are designed to exploit fine-grained dataparallelism; adjacent memory words are often accessed by different threads. Thesedifferences raise many challenges in adopting existing HTMs for GPUs.An emerging class of manycore accelerators, such as Intel’s Larrabee [142],feature fewer concurrent threads per core and coherent caches that can be parti-tioned per thread. The following challenges may be less severe for this class of79manycore accelerators.Access Granularity and Write BufferingEach line in the L1 data caches could be extended to identify and isolate speculativedata written by individual transactions. However, each transaction might obtainonly a few cache lines before the cache overflows because there are fewer L1 cachelines than scalar threads on a SIMT core. Transactions typically lack the spatiallocality required to fully use a cache line and make poor use of the few lines theycan access. Furthermore, the fine-grained, interleaved accesses among differentthreads can introduce significant false-sharing and reduce the accuracy of conflictdetection.Transaction RollbackMany proposed HTMs checkpoint the architectural state of the hardware threadat the start of a transaction for restoration upon rollback. Maintaining copies for10s of registers at transaction boundaries in a CPU core is relatively cheap. GPUs,however, are designed to execute 1000s of concurrent threads, and spend significanthardware resources on register file storage. NVIDIA Fermi has 2MB of register filestorage, which exceeds its aggregate cache capacity [121]. Naively checkpointingthis many registers would introduce significant overheads.Scaling Conflict DetectionA key challenge for scaling TM beyond 1000s of concurrent transactions is de-signing a conflict detection mechanism that works effectively at this scale. Naivebroadcast-based conflict detection scales poorly; T concurrent transactions willbroadcast to T-1 other transactions, generating O(T 2) traffic.Many proposed TMs use global metadata, such as a cache coherence directory,to eliminate unnecessary traffic. Recently, directory based cache coherence pro-tocols supporting up to 1000 cores have been proposed [51, 86, 173]. However,GPUs such as Fermi [121] do not have a private cache for each thread.Using bloom filters to represent read- and write-sets of a transaction [26, 111,170] allows each thread to quickly react to incoming requests and enables fine-80grained conflict detection. We experimented with an ideal version of a signature-based HTM (lazy conflict detection and lazy version management) with each trans-action maintaining both its read- and write-sets in a bloom filter. We used theparallel bloom filters described by Sanchez et al. [137]. Each filter contained 4separate sub-signatures and each sub-signature was indexed by a unique H3 hashfunction. We had to use a 1024-bit filter per thread (3.8MB of total storage for30720 threads) to keep the false conflict rate below 20% for the benchmarks CL,BH and AP. Using 512-bit filters increased the rate to 60%.Commit BottleneckEven if we reduce the bloom filter storage by limiting the number of concurrenttransactions (Section 4.2.6), the bloom filters of all transactions and the directorycannot be modified when one of the transactions is committing. Otherwise, a con-flicting access may go undetected when a transaction that has resolved all of itsconflicts is updating memory [27]. Scalable TCC [27] solves this issue by lock-ing entries in the directory, but its commit protocol serializes transaction commitat each directory bank. LogTM-SE [170] uses eager version management, writingspeculative data directly to global memory, to allow transactions to commit in par-allel. However, this requires eager conflict detection and resolution to isolate thespeculative data between concurrent transactions. It is not clear how eager conflictdetection can be implemented on current GPUs, which do not feature invalidation-based cache coherence protocols. The potential commit bottleneck and the signa-ture storage explosion (Section 4.1.1) issue persuaded us to explore alternatives.4.2 Kilo Transactional MemoryIn this section, we present Kilo Transactional Memory (Kilo TM), a TM systemscalable to 1000s of concurrent transactions. Kilo TM does not leverage a cache co-herence protocol for conflict detection among running transactions. Instead, eachtransaction performs word-level, value-based conflict detection against committedtransactions by comparing the saved value of its read-set against the value in mem-ory upon its completion [37, 124]. A changed value indicates a conflict. Thismechanism offers weak isolation [73]. Each transaction buffers its saved read-set81Off-Chip DRAM Channelff-Chip DRA  ChannelGPUSIMT CoreConstantCacheTextureCacheMemoryPortL1 DataCacheSIMT StacksSIMT CoreConstantCacheTextureCacheMemoryPortL1 DataCacheSIMT StacksMemory Partitione ory artitionCPUKernel LaunchInterconnectionNetworkMemory PartitionLast-Level Cache BankOff-Chip GDDRMemory ChannelAtomic Op.UnitCommitUnitSIMT CoreThread BlockThread BlockSharedMemoryConstantCacheTextureCacheMemoryPortRegisterFileL1 DataCacheTXLogUnitSIMT StacksDRAM ControllerFigure 4.2: Kilo TM Implementation Overview. Kilo TM adds a transaction(TX) log unit to each SIMT core and a commit unit to each memorypartition. The SIMT stacks are extended to support transactions.values and memory writes in a read-log and a write-log (in address-value pair)in local memory (lazy version management). When a transaction finishes execut-ing, it sends its read- and write-log to a set of commit units for conflict detection(validation), each of which replies with the outcome (pass/fail) back to the transac-tion at the core. Each commit unit validates a subset of the transaction’s read-set.If all commit units report no conflict detected, the transaction permits the com-mit units to publish the write-log to memory. To improve commit parallelism fornon-conflicting transactions, transactions speculatively validate against committedtransactions in parallel, leveraging the deeply pipelined memory subsystem of theGPU. The commit units use an address-based conflict detection mechanism to de-tect conflicts among these transactions (we call these hazards to distinguish themfrom the conflicts detected via value comparison). A hazard is resolved by reval-idating one of the conflicting transactions at a later time. Section 4.2.5 describesthe protocol in detail.In Kilo TM, transaction-specific communication (conflict detection and mem-ory updates) occur only between the commit units and the committing thread(shown in Figure 4.4a). This restriction permits the communication packets fromdifferent threads to be pipelined and interleaved, as long as the end-to-end mes-sage order between the SIMT cores and the commit units is maintained. KiloTM restricts each transaction to have a single entry and a single exit, matching82A:  t = tid.x;    if (…) {B:    tx_begin;C:    x[t%10] = y[t] + 1;D:    if (s[t])E:      y[t] = 0;F:    tx_commit;G:    z = y[t];    }H:  w = y[t+1];@ tx_begin:@ tx_commit,thread 6 & 7 failed validation:@ tx_commit,restart Tx for thread 6 & 7:Implicit loopwhen abort@ tx_commit,all threads with Tx committed:CopyActiveMaskCopyActiveMask+ PCBranch Divergence within Tx:Active Mask1111 11111111 00110000 00001111 0011RPCPCType--HNHBN--CR--CTActive Mask1111 11111111 00110000 00110000 0000RPCPCType--HNHBN--CR--FTActive Mask1111 11111111 00110000 00000000 0011RPCPCType--HNHBN--CR--CTActive Mask1111 11111111 00110000 00001111 00110001 0011RPCPCType--HNHBN--CR--FTFENActive Mask1111 11111111 00110000 0000RPCPCType--HNHGN--CRTOSTOSTOSTOSTOS12345Figure 4.3: SIMT stack extension to handle divergence due to transactionaborts (validation fail). Thread 6 and 7 have failed validation and arerestarted. Stack entry type: Normal (N), Transaction Retry (R), Trans-action Top (T). For each scenario, added entries or modified fields areshaded.‘atomic{}’ semantics in common TM language extensions [73]. Transactionboundaries are conveyed to hardware with tx begin and tx commit instruc-tions in the compute kernel. Nested transactions are flattened [73] into a singletransaction.Figure 4.2 highlights the changes required to implement Kilo TM on our base-line GPU architecture. These include an extension to the SIMT stack, a TransactionLog Unit, and a Commit Unit.4.2.1 SIMT Stack ExtensionWhen a warp finishes a transaction, each of its active threads will try to commit.Some of the threads may abort and need to reexecute their transactions, while otherthreads may pass the validation and commit their transactions. Since this outcome83may not be unanimous across the entire warp, a warp may diverge after validation.Figure 4.3 shows how the SIMT stack can be extended to handle control flowdivergence due to transaction aborts. When a warp enters the transaction (at line B,tx begin), it pushes two special entries onto the SIMT stack ( 1 ). The first entryof type R stores information to restart the transaction. Its active mask is initiallyempty, and its PC field points to the instruction after tx begin. The second entryof type T tracks the current transaction attempt. At tx commit (line F), any threadthat fails validation sets its mask bit in the R entry. The T entry is popped whenthe warp finishes the commit process (i.e., its active threads have either committedor aborted) ( 2 ). A new T entry will then be pushed onto the stack using the activemask and PC from the R entry to restart the threads that have been aborted. Then,the active mask in the R entry is cleared ( 3 ). If the active mask in the R entry isempty, both T and R entries are popped, revealing the original N entry ( 5 ). Its PCis then modified to point to the instruction right after tx commit, and the warpresumes normal execution. Branch divergence of a warp within a transaction ishandled in the same way as non-transactional divergence ( 4 ).4.2.2 Scalable Conflict DetectionSection 4.1.1 discussed how signature-based conflict detection is prone to the com-mit bottleneck and storage explosion when scaled to 1000s of threads. Typicalconflict detection used in HTMs checks the existence of conflicts and identifiesthe specific conflicting transactions. One insight Spear et al. [149] present withRingSTM is that a committing transaction only needs to detect the existence of con-flicts with transactions that have committed. Transactions with detected conflictscan self-abort without interfering with the execution of other running transactions.This reduces storage and traffic requirements because the TM system does not needto maintain a set of in-flight sharers/modifiers for each memory location, and eachtransaction only performs the detection once before it commits. However, in ourexperiment with RingSTM, we had to use 512-bit write-signatures in the com-mit record ring to keep the false conflict rate below 40% (1.9MB of total storagefor a ring with 30720 records to support 30720 concurrent transactions). Value-based conflict detection [37, 124] exhibits similar traffic requirements as RingSTM.84Transactions detect conflicts with other committed transactions, but without usingany global metadata – only values from global memory are used. Kilo TM com-bines aspects of RingSTM and value-based conflict detection in hardware, andextends them to permit concurrent validations (Section 4.2.5).A transaction is doomed if it has observed an inconsistent view of memory(e.g., in between memory reads to two different locations, another transaction hascommitted and updated both locations). These doomed transactions may enteran infinite loop. To ensure that doomed transactions are eventually aborted, weuse a watchdog timer to trigger a validation pass. This satisfies opacity [68] withminimum overhead for GPUs.4.2.3 Version ManagementKilo TM manages global memory accesses in hardware and uses software for ver-sion management of registers and local memory space. Section 4.1.1 discussedhow blindly checkpointing each transaction is too expensive on GPUs. We ob-served that the original values in many registers are rarely used when a transactionrestarts, and do not need to be restored. A compiler could determine which regis-ters are both read and written within a transaction and insert code to checkpoint andrestore them before/after a transaction – similar to existing code generation tech-nique to create idempotent code regions [40, 41]. We observed that transactions inthe BH benchmark require restoring two registers on average. Other benchmarksdo not require any register restoration upon transaction aborts. Hence, we do notmodel the register checkpoint overhead in our evaluation as we believe it to beminor compared to validation and commit overheads in our workloads.Accesses to global memory are buffered in the read/write-log in local memory.A small bloom filter can be used to detect whether a transaction is reading a valuein its write-set. A hit in the filter will trigger the transaction log unit to walk thewrite-log. Since the member set of the filter is constrained to only the memoryaccesses of a single transaction, a small filter should produce reasonably few falsepositives. In our evaluations, this detection is perfect.854.2.4 Transaction Log Storage and TransferThe read- and write-logs of transactions in Kilo TM are stored as linear buffers inlocal memory located in off-chip DRAM, cached in the per-core L1 data cache, andmapped to physical addresses such that consecutive 32-bit words are accessed byconsecutive scalar threads in a warp. GPU applications can specify the maximumsize of local memory to avoid overflow.When a warp accesses global memory in a transaction, a new entry is appendedto the read/write-log for all threads in the warp. Entries for the inactive threadsare marked with a special address to void the entry. This organization allows thelog accesses to be coalesced. If only part of a warp needs to walk the write-logfor data, the entire warp will wait for the walk to finish before proceeding to thenext instruction. When threads in a warp are ready to validate their transactions(before commit), the transaction log unit walks the read- and write-logs of eachthread and sends the address-value pairs to the commit unit in the correspondingmemory partition. Entries from different threads at the same log offset are accessedin parallel with coalesced memory reads. The individual entries sent to the samememory partition are grouped into a single larger packet to reduce interconnectiontraffic.This transaction log design addresses the fact that per-core caches in contem-porary GPUs are shared by 100s of threads. GPUs employ a flexible register allo-cation scheme that balances the number of registers per warp against the numberof concurrent warps to avoid spilling registers. Hence, memory reads rarely accessdata written by the same transaction, reducing the penalty of storing the write-logas a linear buffer.4.2.5 Distributed Validation/Commit PipelineA naive implementation of value-based conflict detection serializes transactioncommits. Memory updates from a transaction (its write-set) are invisible to othersuntil the transaction commits. Two conflicting transactions validating concurrentlywill observe no changes to their read-sets, and will subsequently update globalmemory with their contradicting write-sets. While serializing all transaction com-mits prevents this potential data race, it also prevents non-conflicting transactions86TimeSIMTCoreCU1CU2CU3Rd/Wt-LogSIMTCoreCU1CU2CU3CU Pass/FailSIMTCoreCU1CU2CU3TX Pass/FailSIMTCoreCU1CU2CU3Done(a)Commit UnitRing BufferRead-SetBufferWrite-SetBufferStatusLast-Level (L2) CacheDRAMYCIDRCIDLogical StageManagementLast Writer HistoryAddr CIDEvictLookup TableRecencyBloom FilterCID(b)Figure 4.4: Commit unit communication and design overview. (a) Commu-nication flow with a SIMT core. (b) Overview of a commit unit.from committing in parallel [18].To enable parallel commits, prior STMs with value-based conflict detection [37,124] use a set of versioned locks, each serializing commit to a memory region.Each transaction checks/acquires the locks of all the memory regions that requireprotection during validation and commit. Acquiring locks imposes significant over-head.Kilo TM increases commit parallelism by using a set of commit units thatquickly detect conflicts among a limited set of transactions. GPU memory sub-systems are deeply pipelined to support a large number of in-flight accesses tomaximize throughput. The commit units leverage this capability. In each commitunit, a subset of transactions are speculatively validated in parallel. This validationonly detects conflicts with the already committed transactions. Later, a hazard de-tection mechanism is applied to conservatively detect all potential conflicts. Anyhazard is resolved by deferring one of the conflicting transactions and revalidatingits read-set after the other transaction has updated global memory. Revalidationserializes the validation/commit process among transactions when necessary. Thismechanism guarantees forward progress by giving the deferred transaction a sec-87Communication with SIMT CoresUnusedRetireLog Transfer+SpeculativeValidationYounger Transactions Older TransactionsHazardDetectionValidationWaitFinalizingOutcome CommitRead-Set+Write-SetTX FinalOutcome(Pass/Fail)CU LocalOutcome(Pass/Fail)SpeculativeValidationMemoryUpdateRevalidationDoneLast-Level (L2) Cache / DRAMPass/Fail1234 56 789Figure 4.5: Logical stage organization and communication traffic of thebounded ring buffer stored inside commit unit.ond chance to validate and commit in case the earlier transaction failed.Each memory partition has a commit unit (shown in Figure 4.4b) that handlesvalidations and commits of TM accesses to that memory partition. Before a trans-action starts the validation/commit process, it acquires a commit ID (CID) from acentralized ID vendor (similar to Scalable TCC [27]). This commit ID is associatedwith a logical entry in the commit unit at every memory partition, and dictates therelative commit order of this transaction (so that the conflict/hazard resolution isunanimous among all commit units). Each commit unit has a ring buffer of commitentries [149] organized in the logical stages shown in Figure 4.5. Each entry tracksthe state of a committing transaction in this memory partition. The Status fieldin each commit unit ring buffer entry in Figure 4.4b indicates the current status ofthe transaction. The YCID (youngest commit ID) and RCID (retired commit ID)fields are used for hazard detection. The RCID of a transaction is a pointer to theoldest committing transaction when this transaction started speculative validation.Transactions that committed before this transaction started speculative validationdo not trigger a hazard with this transaction. The YCID of a transaction points tothe youngest conflicting transaction detected to have a hazard with this transac-tion. The transaction needs to wait for the conflicting transaction to retire before88it can start revalidation. The Read-Set and Write-Set Buffers consist of boundedlinear buffers that store, for value comparison, the exact address-value pairs of eachtransactional access to this memory partition.The following is an overview of the validation/commit process of a transactionat different logical stages (Figure 4.5):Log Transfer + Speculative Validation. The transaction transfers its read- andwrite-logs to an allocated entry in the commit unit ( 1 ). The incoming read-set is speculatively validated against the current values in global memory(accessing L2 cache/DRAM 2 ).Hazard Detection. Once the read- and write-logs have been transferred to thecommit unit, the read-set of the transaction is checked against the Last WriterHistory unit (LWH) for hazard detection ( 3 ), detecting conflicts betweenthe transaction and all committing transactions in the later stages. Existenceof a hazard indicates that the speculative validation may have accessed staledata in global memory that will be updated before the transaction commits.The hazard is resolved in the Validation Wait stage.Validation Wait. Each transaction waits for the speculative validation to com-plete before advancing to later stages ( 4 ). Transactions with hazards willwait until all conflicting transactions have retired to revalidate their read-setwith the updated global memory ( 5 ).Finalizing Outcome. This stage finalizes the outcome of each transaction by re-plying with the local outcome (pass/fail) of the transaction to the core ( 6 ).After a transaction has received replies from all commit units, it will broad-cast the final outcome (pass/fail) to all commit units ( 7 ). Each commit unitentry waits for its final outcome before proceeding to the next stage.Commit. Each passed transaction updates the global memory at this stage ( 8 ).Failed transactions are skipped.Retire. The commit unit entry associated with each transaction is deallocated,releasing storage for future transactions. The core is informed so that the89thread running the transaction can proceed to the next instruction, or restartthe failed transaction ( 9 ).Commit Unit Resource AllocationWhen a warp executes tx commit, the transaction log unit acquires credits froma per-core credit pool of commit unit entries before acquiring contiguous commitIDs and proceeding with the commit. Insufficient credits prevent the warp fromacquiring the commit IDs until the credits are returned when the validation/commitoperation of another warp completes. In this work, we assume that the commitunits always have enough entries to support all in-flight transactions. Section 4.4.4measures the resources required.Hazard Detection, Last Writer History UpdateAt the Hazard Detection stage, each transaction checks the integrity of its specu-lative validation via the Last Writer History unit (LWH) in Figure 4.4b. This unithas an approximate but conservative representation of the write-sets of all oldertransactions at the later stages that have not yet retired. The LWH unit identifiesthe youngest conflicting transaction (returns its commit ID) in the later stages thatmay modify the read-set of the transaction at the hazard detection stage. If thisconflicting transaction retired before the current transaction started validating (itsCID<RCID of the current transaction), no hazard remains. Otherwise, a hazard isdetected. The hazard is resolved in the Validation Wait stage by waiting for thisconflicting transaction (now tracked by YCID) to retire, and then revalidating thetransaction with the updated memory. After detection, the current transaction up-dates the LWH unit with its write-set. This mechanism leverages the same intuitiondescribed in Section 4.2.2. The LWH unit can approximately maintain the latestpending writer to each memory location, as a slightly younger false writer onlyslightly lengthens the wait at Validation Wait stage.The LWH unit has an address-indexed set-associative lookup table and a re-cency bloom filter. The two structures, in combination, conservatively track theCID of the youngest transaction in later stages that may write to a given mem-ory location. The lookup table stores the exact write-sets from recent transactions,90whereas the bloom filter stores the approximate write-sets from distant transac-tions. As write-sets from newer transactions are deposited into the lookup table,entries are updated (replacing the CID if addresses match), or evicted into the re-cency bloom filter to free up storage for different addresses. A large transactionwith its write-set exceeding the capacity of the lookup table automatically over-flows part of the write-set into the recency bloom filter. The recency bloom filterhas multiple sub-arrays of buckets (each bucket storing a CID) indexed by a hashof the given memory address. Each evicted entry updates a CID bucket in eachsub-array according to the hashed written memory address. Due to address alias-ing in each sub-array, an older CID writing to an address may be replaced by ayounger CID writing to a different address. When the bloom filter is queried withan address, one CID is retrieved from each sub-array. The oldest retrieved CID isreturned as it is least likely to have been aliased by a younger writer. This oldestCID can also be aliased, causing the LHW unit to report a false writer. The write-set of a retiring transaction is implicitly removed from the LWH unit as its CID canno longer trigger a hazard.Unbounded TransactionsIf the commit entry’s read-set buffer overflows, the commit unit will continue tospeculatively validate the address-value pair of the incoming read-set, but will stoppopulating the read-set buffer. The commit unit will ask the transaction to resendits read-set from the SIMT core during hazard detection and revalidation. Simi-larly, if the commit entry’s write-set buffer overflows, the commit unit will ask thetransaction to resend the write-set during LWH update after hazard detection andmemory update at the Commit stage.4.2.6 Concurrency ControlWhile Kilo TM can support thousands of concurrent transactions, limiting the num-ber of concurrent transactions can improve the performance of high-contentionapplications (with transactions that are likely to abort), and lowers the resourcerequirement for Kilo TM. To limit the number of concurrent transactions within aSIMT core, we use a counter to track the number of warps currently in transactions.91Table 4.2: General characteristics of evaluated GPU-TM applications. In-struction count (#Inst) obtained from Ideal TM version.#Blk/Name Abbr. #Inst Blk Grid SIMTSize Size CoreHash Table (CUDA) HT-H 632k 192 120 4HT-L 501k 192 120 4Bank Account (CUDA) ATM 4.1M 192 120 3Cloth Physics [20] (OpenCL) CL 6.8M 512 118 1Barnes Hut [23] (CUDA) BH 15M 288 60 1CudaCuts [162] (CUDA) CC 104M 256 133 1Data Mining [3, 88] (CUDA) AP 39M 64 112 4We leave to future work exploration of adaptive mechanisms (e.g., [15, 171]) thatreact to the dynamic contention in applications.4.3 MethodologyWe model our proposed hardware changes by extending GPGPU-Sim 3.0 [10].We evaluate performance of various hardware configurations on a set of GPU-TMapplications listed in Table 4.2 and Table 4.3. Since we do not have access tothe CUDA compiler front end source code, we need an alternative mechanism tocommunicate transaction boundaries to the simulated hardware. Our approach wasto add transactions with empty functions tx begin() and tx commit(). Callsto these functions are recognized by the simulator as transaction boundaries andreinterpreted by the simulator as hardware instructions. Since these functions areempty, we need to ensure that CUDA and OpenCL compilers do not optimize themout. We do this by using the noinline keyword in CUDA and by makingthe functions self-recursive in OpenCL (as the OpenCL driver we used does notprovide a no-inline option). Note these dummy functions are never actually calledduring simulation. The following CUDA/OpenCL applications are used in ourevaluations.Hash Table (HT) is a microbenchmark in which each thread inserts an elementinto a chained hash table. Each slot in the hash table is a linked list of key-valuepairs. We use two table sizes to create high contention (HT-H with 8k entries) and92Table 4.3: TM-specific characteristics of evaluated GPU-TM applications.Metrics collected with to Kilo TM with unlimited concurrency.#Inst #Aborts Read-Set Write-Set Max #Name #Committed per TX per TX (#Words) (#Words) Concurrent TXTX (Avg) Avg Max Avg Max Avg Max (Kilo TM)HT-H 23040 26 1.39 2 1.0 1 4.0 4 23040HT-L 23040 26 0.14 2 1.0 1 4.0 4 23040ATM 122880 8 0.03 3 3.0 3 2.0 2 16131CL 60200 53 1.06 8 11.2 12 4.8 8 22816BH 264106 48 0.15 14 4.3 40 0.82 14 8640CC 114677 21 0.004 3 1.4 4 1.4 4 735AP 4550 89 0.32 6 15.7 174 6.2 109 192low contention (HT-L with 80k entries) workloads.Bank Account (ATM) is a microbenchmark with ∼16k concurrent threadsaccessing an array of structs that represents 1M bank accounts. Each transactiontransfers money between two accounts.Cloth Physics (CL) is based on “RopaDemo”, which simulates the cloth physicsfor a T-shirt [20]. Performance is limited by the Distance Solver kernel, which im-plements a spring-mass system using a set of constraints between cloth particles.To forbid two constraints concurrently modifying the same particle, the originaldemo processes these constraints sequentially in octets (i.e., 8 at a time). We mod-ified this kernel to process all ∼4k distance constraints of each T-shirt in paralleltransactions.Barnes Hut (BH) is based on the tree-based n-Body algorithm implementedby Burtscher et al. [23] with 30000 bodies. We focus on the iterative tree-buildingkernel using lightweight locks, which we modified to use transactions. Each threadin this kernel appends a body into the octree, and inserts any branch node requiredto isolate its body in a unique leaf node. Each level of traversal down the tree andthe node insertions are protected by separate transactions.CudaCuts (CC) applies a maxflow/mincut algorithm to segmentation of a200×150 pixel image [162]. It consists of Push kernels that use atomic opera-tions to push excessive flow from a node to its neighbours, and Relabel kernels thatchange the height of a node when excessive flow cannot be pushed. We groupedconsecutive atomic operations in the Push kernels into transactions.93Table 4.4: GPGPU-Sim configuration for Kilo TM evaluation# SIMT Cores 30 (10 clusters of 3)Warp Size 32SIMD Pipeline Width 8Number of Threads / Core 1024Number of Registers / Core 16384Branch Divergence Method PDOM [56]Warp Scheduling Policy Loose Round RobinShared Memory / Core 16KBConstant Cache Size / Core 8KBTexture Cache Size / Core 5KB, 32B line, 20-way assoc.L1 Data Cache / Core 48KB, 128B line, 6-way assoc.(transactional and local memory access only)L2 Unified Cache 64KB/Memory Partition, 128B line, 8-way assoc.Interconnect Topology 1 Crossbar/Direction (SIMT Core Concentration=3)Interconnect BW 32 (Bytes/Cycle) (160GB/dir.)Interconnect Latency 5 Cycle (Interconnect Clock)Compute Core Clock 1300 MHzInterconnect Clock 650 MHzMemory Clock 800 MHz# Memory Partitions 8DRAM Req. Queue Capacity 32Memory Controller Out-of-Order (FR-FCFS)GDDR3 Memory Timing tCL=10 tRP=10 tRC=35 tRAS=25tRCD=12 tRRD=8 tCDLR=6 tWR=11Memory Channel BW 8 (Bytes/Cycle)Min. L2/DRAM Latency 460 Cycle (Compute Core Clock)Kilo TMCommit Unit Clock 650 MHzValidation/Commit BW 1 Word/Cycle/Memory Partition#Concurrent TX 2 Warps/Core (1920 threads)Last Writer History Unit Size 5kB (See Section 4.4.3)Data Mining (AP) is based on Apriori in the RMS-TM benchmark suite [3,88]. We evaluate the apriori gen() function, which was modified [129] to useCUDA, with each thread processing a unique record. As in the CPU TM version,transactions are used to protect candidate insertion and support value counting.Our modified GPGPU-Sim is configured to model a GPU similar to NVIDIAQuadro FX5800, extended with L1 data caches and a L2 unified cache similarto NVIDIA Fermi [121]. We validated GPGPU-Sim 3.0 with the NVIDIA QuadroFX5800 configuration (no cache extensions and using PTX instead of SASS) againstthe hardware GPU and observed an IPC correlation of ∼0.93 for a subset of theCUDA SDK benchmarks. GPGPU-Sim incorporates a configurable interconnec-tion network simulator [38]. Traffic in each direction between the SIMT cores94and the memory partitions are serviced by two separate crossbars. The crossbarscan transfer a 32-byte flit per interconnect cycle to/from each memory partition(∼160GB/s per direction). Each flit takes 5 cycles to traverse the crossbar. The 30SIMT cores are grouped in 10 clusters. Cores in a cluster share a common port toeach crossbar (concentration of three). The memory partition has an out-of-ordermemory access scheduler. We model detailed GDDR3 timing. Every memory ac-cess sent to L2 cache/DRAM has a minimum pipeline latency of 460 cycles (incompute core clocks) to match that observed by microbenchmarks of NVIDIAQuadro FX5800 [165]. The actual latencies of individual accesses can be higherdue to delays from memory access scheduling and queuing as DRAM bandwidthsaturates. We have an optimistic performance model for atomic operations (used infine-grained locking). Atomic compare-and-swap operations on GPGPU-Sim have∼4× higher throughput than on NVIDIA Fermi GPU, while other types of atomicoperations on GPGPU-Sim perform roughly the same as Fermi. Table 4.4 lists theother major configuration parameters.We model all interconnection network traffic between the SIMT cores and thecommit units. Packets from the transaction log unit are sized according to the pay-load within the packet, and they contend for the same interconnection port withpackets for normal memory accesses. Each short commit protocol message oc-cupies a single flit. Packets containing multiple read/write-log entries (see Sec-tion 4.2.4) may occupy multiple flits, taking multiple cycles to transfer. In ourevaluations, Kilo TM validates and commits each transaction as directed by thetiming simulation. In our simulation, timing of committing transactions affectsfunctional behaviour of the application, and hence any undetected data-race wouldlikely lead to an application error, which we verify does not occur.This modified version of GPGPU-Sim and the evaluated GPU-TM applicationsare available online [61, 62].9501234HT-H HT-L ATM CL BH CC APNormalized Exec. TimeIdeal TM (Inf. TransWarp) KILO TM (2 TransWarp/Core)KILO TM (Inf. TransWarp) FG Lock5.78.1 6.2Figure 4.6: Execution time of GPU-TM applications with Kilo TM. Lower isbetter. HT-L ATM CL BH CC APAbort-Commit RatioIdeal TM (2 TransWarp/Core) KILO TM (2 TransWarp/Core)Ideal TM (Inf. TransWarp) KILO TM (Inf. TransWarp)1.4 1.06Figure 4.7: Abort/commit ratio of GPU-TM applications with Kilo TM.Lower is better. Ratio = 1 if on average transactions abort once.4.4 Experimental Results4.4.1 PerformanceIn this section, we compare the performance of Kilo TM against the ideal TMsystem (Ideal TM) and fine-grained locking (FG Lock) described in Section 4.1.Figure 4.6 shows the execution time of each application with Kilo TM and fine-gained locking normalized to the execution time of Ideal TM. In our evaluations,Kilo TM uses commit units with unlimited capacity.With unlimited transaction concurrency (Inf. TransWarp), Kilo TM is on av-erage 4.1× slower than Ideal TM. HT, CL, and BH are affected the most. These96Speedup vs. Serializing All Tx050100150200250HT-H HT-L ATM CL BH CC TransWarp/Core2 TransWarp/Core4 TransWarp/CoreInf. TransWarpFigure 4.8: Performance scaling with increasing number of concurrent trans-actions with Kilo TM. Higher is better.applications have many concurrent transactions with high contention (Figure 4.7).Although BH’s overall abort-commit ratio is relatively low, it starts with a high-contention period when all transactions are competing to insert nodes near the rootof the octree. When conflicting transactions attempt to commit concurrently, thecommit unit defers revalidating one transaction. This reduces overall performance.Notice that AP has relatively few concurrent transactions, so its high abort-commitratio has little impact on performance (Table 4.3).Limiting each core to two transaction warps (2 TransWarp/Core in Figure 4.6,1920 threads globally) reduces contention in HT-H, CL, and BH and improvestheir performance with Kilo TM by 2-3×. ATM speeds up by 2.3× from improvedhazard detection accuracy. The performance of HT-L improves by 66%, while APis unaffected. CC’s performance drops by 34% because of this limit. In CC, warpsare typically diverged before entering transactions. CC would not be penalized withthread-level concurrency control. Overall, Kilo TM performs significantly betterwith this concurrency limit, capturing 52% of Ideal TM and 59% of fine-grainedlocking performance.Concurrency ControlFigure 4.8 compares the performance of Kilo TM under different concurrency lim-its versus serializing execution of all transactions. HT-L, ATM, CL and BH achievethe best performance with transaction execution limited to two warps per core (297TransWarp/Core), while HT-H performs best with transaction execution limitedto one warp per core (1 TransWarp/Core). CC prefers unlimited transaction con-currency. AP is insensitive to the limit. Overall, Kilo TM performs best withtransaction concurrency limited to two warps per core, achieving on average 128×speedup over serially executing each transaction.Effects on Abort-Commit RatioFigure 4.7 compares the abort-commit ratios between Kilo TM and the ideal TMsystem. Kilo TM and Ideal TM show similar abort-commit ratios with transactionconcurrency limited to two warps per core. With unlimited transaction concur-rency, contention at the commit unit defers memory updates from older transac-tions that would have been made visible much earlier with Ideal TM. Youngertransactions that were originally reading the updated values in Ideal TM now con-flict with the older uncommitted transactions.4.4.2 Execution Time BreakdownTo provide further insight, Figure 4.9 shows a breakdown of the cumulative per-hardware thread cycles, scaled by the overall execution time of each application.At each cycle, a thread can be in a warp stalled by Concurrency control (TC), bein a warp committing its transactions (TO), have passed commit and be Waitingfor other threads in its warp to pass (TW), be executing an eventually Aborted(TA) or committed/Useful (TU) transaction, be acquiring a lock or performing anAtomic operation (AT), be waiting at a Barrier (BA), or be performing non-atomicnon-transactional work (NL). We compare the thread-state distributions betweenthe fine-grained locking versions of the benchmarks (FGL), and the transactionalversions running on Ideal TM (IDEAL), Kilo TM with transaction concurrencylimited to two warps per core (KL), and Kilo TM with unlimited transaction con-currency (KL-UC).We observe the overheads of lock acquisition (AT) in the lock-based versionsto be proportional to the inherent contention in their transactional versions. Trans-actional HT-H and CL have the largest abort-commit ratios in Figure 4.7 and theirlock-based counterparts have the greatest locking overheads in Figure 4.9. HT-980%50%100%150%200%250%FGLKL-UCKLIDEALFGLKL-UCKLIDEALFGLKL-UCKLIDEALFGLKL-UCKLIDEALFGLKL-UCKLIDEALFGLKL-UCKLIDEALFGLKL-UCKLIDEALHT-H HT-L ATM CL BH CC APNormalized Thread CyclesTC TO TWTA TU ATBA NL812% 370% 624% 574%275%280% 281%Figure 4.9: Breakdown of thread execution cycles for Kilo TM. Scaled by theoverall execution time, normalized to IDEAL TM.0%100%200%300%400%500%FGLKL-UCKLIDEALFGLKL-UCKLIDEALFGLKL-UCKLIDEALFGLKL-UCKLIDEALFGLKL-UCKLIDEALFGLKL-UCKLIDEALFGLKL-UCKLIDEALHT-H HT-L ATM CL BH CC APNormalized Core CyclesIDLESTALLSCRBEXECIDLE 12%STALL 18%SCRB 527%IDLE 36%STALL 29%SCRB 728%IDLE 12%STALL 124%SCRB 473%Figure 4.10: Breakdown of core execution cycles for Kilo TM.L, ATM and CC have lowest abort-commit ratios and the smallest locking over-heads. Lock-based BH has a significant locking overhead because of the initialhigh-contention period, as explained in Section 4.4.1. Lock-based AP shows in-significant locking overhead, despite a high abort-commit ratio in its transactionalversion, due to limited parallelism in its implementation. For the lock-based bench-99marks, the NL cycles include the execution of the critical sections and are thereforegreater than in the transactional versions. Detailed analysis (not shown) indicatesthat lock-based benchmarks suffer from increased branch divergence, further in-creasing their NL cycles.Threads running on Kilo TM with unlimited transaction concurrency spendmuch of their time waiting to be committed (TO). This overhead is significantlyreduced by limiting transaction execution to two warps per core in exchange forlong waits in concurrency control (TC). For most benchmarks this provides anoverall gain in performance. CC’s performance on Kilo TM, however, degradeswith concurrency control. This is because CC’s originally low commit overheadremains unchanged with reduced concurrency, and because CC benefits from in-creased transaction concurrency as indicated by its scaling performance in Fig-ure 4.8. Figure 4.10 shows the cumulative execution cycle breakdown of each core.At each cycle, a SIMT core may issue a warp (EXEC), be stalled by downstreampipeline stages (STALL), have all warps blocked by the scoreboard due to data haz-ards, concurrency control, pending commits or any combination thereof (SCRB),or not have any warps ready to issue in the instruction buffer (IDLE). This figureshows that limiting concurrency in Kilo TM reduces stalling and waiting at thescoreboard. Stalling is reduced as a result of fewer concurrent transactional mem-ory accesses, while shorter and fewer commits reduce the amount of time spentwaiting at the scoreboard.The amount of time spent on transactional work, indicated by TU and TA inFigure 4.9, is lower on KL than on KL-UC and IDEAL. This is also due to thereduction in STALL cycles in Figure 4.10 for KL. Reduced stalling leads to fastertransaction completion. BH saw only a small decrease in transaction time whenconcurrency was reduced. This is because BH contains inherent and limiting mem-ory dependencies that are visible in Figure 4.10 as SCRB cycles on IDEAL TM.Similar to TU and TA, TW also decreases with reduced concurrency as passedtransactions spend less time waiting for failed transactions in their warp to re-execute and commit. The amount of time spent on non-transactional work (NL)varies among KL, KL-UC and IDEAL. This is because threads doing transactionalwork and non-transactional work may execute in parallel, allowing the differencesin transaction behavior in the different TM systems to have an impact on the per-100formance of the non-transactional work.In Figure 4.9, HT-H, HT-L and ATM spend less time doing useful transac-tional work (TU) on KL-UC than on IDEAL, even though both have unlimitedconcurrency. This is because Kilo TM caches global memory writes in write-logsstored in the L1 data cache. HT-H benefits most from this buffering during trans-action execution as its transactions are dominated by writes (See Table 4.3). HT-Land ATM’s lower data locality negates some of the benefit of write buffering. CLand BH are dominated by reads and gain little benefit from write buffering. Thememory write overhead of write buffering is eventually incurred during transactioncommit (TO).CC and AP both suffer from load imbalance as indicated in Figure 4.10 by thesignificant portion of IDLE cycles - the portion of the time when the cores run outof warps to execute. The inter-thread load imbalance suffered by CC is exacerbatedby transactional overheads. AP suffers from inter-core load imbalance. AP spendsmost its execution in non-transactional work, but the overhead of Kilo TM stillimpacts performance because of the time involved in transferring logs for the largetransactions. AP spends 90% of its core cycles in IDLE. This behaviour contributesto the low absolute performance of AP. We created the CUDA version of AP fromits CPU TM version without changing much of the algorithm and data structures.An improved version may redesign the algorithm to spread the workload acrossmore threads.Overall, even with a significant portion of time spent on executing abortedtransactions, the Ideal TM system performs comparably to fine-grained locking.This indicates that the performance penalties of Kilo TM may be reduced withfuture refinements.4.4.3 Sensitivity AnalysisL2 Cache Miss from Validation AccessWe observe that >90% of validation accesses for Kilo TM hit in the L2 cache forall benchmarks with transactional execution limited to two warps per core. Thisalso applies to most benchmarks with unlimited concurrent transactions, but for10100.511.522.53HT-H HT-L ATM CL BH CC APNormalized Exec. Time Perfect DetectionLWH (5kB)LWH (512B)BF Array (120kB)5.9 7.9 6.6 5.4Figure 4.11: Sensitivity to hazard detection mechanism. LWH = Last WriterHistory (Section 4.2.5)HT-L, ATM and CL, the cache hit rate for validation access is lower (70% forHT-L, 46% for ATM and 62% for CL). These extra accesses are easily handledby the GPU memory subsystem. In a sensitivity study with idealized validationaccesses that always hit in the L2 cache, performance of ATM and CL improvesonly by 11% and 17%, respectively. Other benchmarks (including HT-L) are in-sensitive to this change. In this study, Kilo TM employs LWH units that detecthazards perfectly. About 50% of the validation-induced L2 cache accesses in CLare pending hits. In ATM, the extra L2 cache misses improve the row-hit rate inthe open-row, out-of-order DRAM controller, increasing the bandwidth efficiencyby 5%. The improved efficiency partly compensates the penalty from validation-induced DRAM accesses. In HT-L, these L2 cache misses increase the DRAMbandwidth utilization by 5% and do not impact performance. This ability to han-dle extra memory accesses in GPUs shows why value-based conflict detection is aviable solution for supporting TM on GPUs.Hazard Detection SensitivityWe explored the performance of Kilo TM with different hazard detection mech-anisms. In Figure 4.11, we compared two versions of the LWH mechanism de-scribed in Section 4.2.5 and an additional mechanism based on a bloom filter array.The first 5kB LWH consists of a 512-entry, 4-way set-associative lookup table(3kB) and a 1024-bucket bloom filter (2kB) split into 4 separate sub-arrays, each102array indexed by a unique H3 hash function (similar to the parallel bloom signaturedescribed by Sanchez et al.[137] and Ceze et al. [26, 137]). The second 512B LWHconfiguration consists of a 64-entry lookup table and a 64-bucket bloom filter. Asecond detection mechanism, bloom filter array (BF Array), encodes the read-setand write-set of each transaction into two 512-bit signatures in the commit unit.Each signature consists of 4 sub-signatures with each indexed by a unique H3 hashfunction. Incoming read/write accesses check for conflicts against all signatures inthe commit unit in parallel.Kilo TM with 5kB LWH unit performs almost identically to perfect hazarddetection. The 512B LWH reduces the storage by 10× but increases executiontime by 36% on average. Despite taking 24× more storage than the LWH unit(120kB vs. 5kB per commit unit), BF Array slows down Kilo TM by up to 7.9×(4.3× on average). The performance gap between the LWH unit and BF Arraydemonstrates how leveraging a pre-defined commit order to prioritize storage canlead to significantly more effective design than a design that dedicates the sameamount of resource to represent the write-set of each transaction. The lookup tableplus recency bloom filter design in a LWH unit dedicates extra resources to ensurethat an unnecessarily revalidating transaction and its false writers are far apart inthe commit unit pipeline, minimizing the stalling at the Validation Wait stage.4.4.4 Implementation Complexity of Kilo TMIn each SIMT core, Kilo TM implementation involves extending the SIMT stack tosupport transactions, employing concurrency control, and adding a transaction logunit. Even though each transaction log unit manages 1000s of transactions, mostof the bookkeeping is amortized across the warp. For example, threads in the samewarp have the same read-log and write-log sizes, and they always have consecutivecommit IDs. The L1 data cache stores the read-/write-logs. Evicted entries arewritten back to L2/DRAM. We believe the area overhead of a transaction log unitis negligible.The area overhead of a commit unit consists of the storage required for theLWH unit, the entries in the ring buffer, and the read- and write-set storage buffersfor each entry. Section 4.4.3 showed that a 5kB LWH unit is sufficient. Each103Write Buffer Usage (#Words)Read Buffer Usage (#Words)% Active CU Entries0%20%40%60%80%100%0 1 2 4 8 16 32 64 0 1 2 4 8 16 32 64HT-HHT-LATMCLBHCCAPFigure 4.12: Buffer usage in active commit unit (CU) entries.commit unit ring buffer entry occupies 10 bytes for the status, RCID and YCIDfields, and pointers to a shared pool of the read- and write-set buffers. The arearequired for the read- and write-set buffers is a product of the size of each bufferand the number of buffers present. Section 4.4.4 examines how large each fixed-size buffer should be to limit buffer overflow. Section 4.4.4 examines how many ofthese fixed-size buffers are required concurrently.Read-Set/Write-Set Buffer CapacityFigure 4.12 shows the cumulative distribution of the read- and write-set bufferusage for the active ring buffer entries in the commit units. The distributions showthat an 8-word (64 Bytes) read-set buffer and an 8-word write-set buffer can serve>90% of the commit unit ring buffer entries. If the read-set or the write-set bufferoverflows (a rare event), the penalty involves resending the read- and write-log. Weleave performance evaluation with finite-sized read- and write-set buffers as futurework.Commit Unit CapacityConservatively assigning each commit unit ring buffer entry with a dedicated 8-word read-set buffer and a dedicated 8-word write-set buffer can introduce signifi-cant storage overhead for the commit units (as explained later in this section). Weobserved that not all commit unit ring buffer entries will need read- and write-setbuffers; in many case, the buffers are not used at all, or they are only needed at cer-10405001.0K1.5K2.0K01.0K2.0K3.0K4.0KAvgMaxAvgMaxAvgMaxAvgMaxAvgMaxAvgMaxAvgMaxHT-H HT-L ATM CL BH CC AP # CU Ring Buffer Entries for All and Accessed# Allocated R / W BuffersAllAccessedUsedNeededAll = 5116, Accessed = 4754Figure 4.13: Number of in-flight, allocated read and write buffers for differ-ent buffer allocation schemes. All ⊇ Accessed ⊇ Used ⊇ Needed.Y-axis on right shows number of commit unit ring buffer entries withbuffers allocated for All and Accessed.tain phase of the validation and commit of a transaction. Dynamically allocatingthese buffers from a shared pool reduces the area overhead of buffers. Figure 4.13shows the average and maximum number of read- and write-set buffers requiredin-flight throughout the execution of each TM application for 4 different bufferallocation schemes. The All and Accessed allocation schemes allocate fixed-sizeread- and write- buffers for a commit unit ring buffer entry when it is created, anddeallocate the buffers when the entry retires. All allocates the two buffers for all ac-tive entries in the ring buffer, while Accessed only allocates buffers for the entrieswhose transaction has accessed this memory partition. The number of ring bufferentries in the All and Accessed schemes is given by the right Y-axis in Figure 4.13.The Used allocation scheme improves upon Accessed by allowing a single fixed-size read- or write-set buffer to be allocated if the transaction has an empty write-setor read-set buffer for this memory partition, respectively. Needed further improvesupon Used by allowing buffers to be deallocated before the ring buffer entry re-tires. Instead of deallocating the buffers after all earlier transactions has retired,the buffers are deallocated as soon as a transaction fails. Also, the read buffers of atransaction are deallocated as soon as it has passed validation. In Figure 4.13, thenumber of concurrent transactions is limited to two warps per core (1920 in total)via concurrency control, and there is no capacity limit on the number of ring buffer105entries.HT-H, HT-L, ATM, CL and BH use ∼700 ring buffer entries on average (All).BH’s maximum number of entries exceeds the concurrent transaction limit becauseit has many read-only transactions. The SIMT core considers a read-only transac-tion to be done when it receives the local outcome reply from commit units, allow-ing the next waiting transaction to proceed before the corresponding commit unitentries are retired. We observed that the number of in-flight entries exceeded theconcurrent transaction limit of 1920 for only <1% of the execution time for BH.CC and AP have significantly fewer concurrent transactions and therefore requirefewer entries. A commit unit with 1920 entries with two 64B buffers for all en-tries (All) would require 240kB for read- and write-set storage, and 19kB for thering buffer storage (10B per ring buffer entry). The Accessed, Used and Neededoptimizations reduce the storage requirement for buffers. To serve most of theNeeded buffer usage, ∼500 buffers per commit unit (32kB per unit, 256kB for thewhole GPU) are enough. The rare worst case can be handled by deferring valida-tion/commit via the credit-based allocation mechanism described in Section 4.2.5.We leave the performance evaluation of this allocation mechanism as future work.Deferring validation/commit of a transaction when its allocation fails reduces thecommit parallelism in Kilo TM; however, we expect the performance impact fromthese deferrals to be minimal, given how rarely they occur.Area EstimationWe assume that both the 32kB read- and write-set buffer pool and the 19kB ringbuffer have 4 banks of SRAM arrays. The 5kB last writer history unit (Sec-tion 4.4.3) consists of a 3kB lookup table and a 2kB bloom filter, each an SRAMarray. Using CACTI 5.3 [144], we estimate the area of each commit unit (theaggregate area of these arrays) to be 0.40mm2 in a 40nm technology. NVIDIAFermi GPU features 6 memory partitions [121], so implementing the commit unitson Fermi architecture requires an area of 2.41mm2. This is just 0.5% of Fermi’s520mm2 die area.1064.5 SummaryIn this chapter, we proposed the use of transactional memory for GPU computing.Transactions can simplify parallel programming by making it easier to reason aboutparallelism. This becomes more important as the number of threads increases andas more software is ported to take advantage of GPUs’ better peak performanceand power efficiency. Compared to lock-based programming, TM simplifies theporting/creation of applications that require data synchronization on GPUs. Specif-ically, TM is a better fit to the current GPU programming models. The isolationproperty of TM is similar to how GPU threads are exposed in the programmingmodel. The application specifies as many transactions as it can, the TM system at-tempts to execute them in parallel, but transactions can run in isolation. The GPUhardware can be designed to automatically handle interactions between data syn-chronization and the SIMT stack, solving an obstacle that prevented fine-graineddata synchronization from being widely used in GPU applications. Furthermore,TM frees the programmer from deadlock concerns as they rework the algorithmsand data structures to optimize the performance of their application.Kilo TM is a novel HTM system scalable to 1000s of concurrent transactions.It uses value-based conflict detection to offer weak isolation, avoid the need forcoherence, reduce metadata overheads, support unbounded transactions, and de-tect conflicts at the granularity of individual words. We describe a scalable parallelcommit protocol and the changes to a SIMT hardware organization required tosupport transactions. Kilo TM uses a novel speculative validation mechanism toimprove the validation and commit parallelism for non-conflicting transactions. Bydesign, it favors applications with low contention transactions. We have evaluatedKilo TM with a set of TM-enhanced GPU applications with various degrees ofexposed parallelism (the granularity of the decomposition of work into threads)and contention. We find that applications with low exposed parallelism (e.g., AP)perform poorly on the GPU regardless of the data synchronization mechanismused. We argue that these applications can be further parallelized more easilywith TM. Our evaluation suggests that Kilo TM performs well (relative to fine-grained locking) on applications with low contention and high exposed parallelism(HT-L, ATM, CC). Kilo TM performs poorly (relative to fine-grained locking) on107applications with high contention and high exposed parallelism (HT-H, CL, BH).For these applications, limiting transaction concurrency lowers contention, and im-proves their performance with Kilo TM. The programmer can lower contention intheir application via performance tuning, identifying transactions with high con-tention and reworking the code to reduce contention [174]. Applications withcontention varying during execution (e.g., BH) may benefit from more dynamicmechanisms that control the transaction concurrency according to the current levelof contention [15, 171].Overall, our evaluation shows Kilo TM captures 59% of fine-grained lockingperformance and is 128× faster than executing transactions serially on the GPU.Our evaluation with an idealized TM system indicates that TM on GPU can per-form as well as fine-grained locking. These results motivate the need for TM onGPUs and the need for novel TM systems, like Kilo TM, that better address thechallenges in this new domain.The next two chapters in this dissertation address two pending issues with KiloTM: correctness and energy-efficiency. The novel design of Kilo TM and its use ofvalue-based conflict detection have lead to concerns regarding its correctness. InChapter 5, we will present a semi-formal proof showing that value-based conflictdetection can tolerate the ABA-problem and more generally that the implementa-tion of Kilo TM presented in this chapter satisfies conflict serializability. The useof value-based conflict detection in Kilo TM has also lead to concern regardingits energy efficiency. Chapter 6 measures and characterizes the energy overhead ofKilo TM over fine-grained locking, and proposes two optimizations to significantlyreduce the overhead.108Chapter 5Kilo TM Correctness DiscussionIn chapter 4, we have proposed Kilo TM, a hardware transactional memory (TM)system designed for GPU architectures. In Kilo TM, each transaction detects theexistence of conflicts with other transactions via value-based conflict detection [37,124]. With value-based conflict detection, each transaction buffers its writes tomemory in a write-log and saves the values of its reads from memory in a read-logduring execution. Upon its completion, the transaction compares the saved valuesof its read-set with the latest values in memory before it commits. We refer thiscomparison as validation. Any difference between the saved value and the latestvalue in memory indicates the existence of a conflict.A general concern for the correctness of value-based conflict detection is thepossibility of subtle bugs due to the ABA problem [109]. Examples of ABAproblems have been found for published non-blocking algorithms [110, 151, 161].These generally result from an implicit assumption that atomicity of a high-leveloperation on a concurrent data structure can be inferred as long as the value of aguard variable is the same after a sequence of low-level instructions (that imple-ments the operation). This fallacy results in subtle bugs [109].In this chapter, we address this concern with a semi-formal proof, showingthat value-based conflict detection can tolerate the ABA problem. A version ofthis proof is available online [58]. Like NOrec [37], we can create a logical orderfor Kilo TM in which validation and commit of each transaction are indivisible –successfully committed transactions in Kilo TM appear atomic to each other. We109summarize the insights behind the proof as described below.Consider a TM system with address-based conflict detection (full knowledgeof which locations have been modified by other transactions). If any of the loca-tions read by a transaction have been changed and subsequently restored to theiroriginal value since the transaction originally read them, aborting the transactionand rerunning it instantaneously with the updated memory would yield the sameresult (same addresses and values in its write-set). This situation summarizes thebehavior of a transaction in Kilo TM that has passed value-based conflict detec-tion, where other conflicting transactions have changed and then restored valuesat memory locations that belong to the transaction’s read-set during its execution.Committing the transaction directly without rerunning it effectively serializes itbehind the last committed transaction. This effective serialization allows Kilo TMto tolerate the ABA problem. This requires the transaction to validate its read-setagainst a consistent view of memory (i.e. the conflict detection is not compar-ing values with a partially committed transaction). The transaction should alsocommit its write-set immediately after validation, before other transactions updatelocations in the transaction’s write-set.Kilo TM achieves this logical indivisibility between validation and commit byordering transactions with their commit IDs. In Kilo TM, each transaction TX isgiven a unique commit ID prior to validation and commit, and this ID defines thecommit order of TX . TX will obtain a new commit ID for each execution attempt(i.e. a new ID is assigned every time TX was aborted). Given two transactionsTX and TY with commit ID X and Y , where X < Y , Kilo TM’s implementationguarantees the following partial ordering:• Validation of each word w by TY always happens after any write to w by TX .• Any write to w by TY always happens after any write to w by TX .• Validation of w by TX always happens before any write to w by TY .These guarantees order validations and writes at each memory location (word) inascending commit order.With this per-word order, it can be shown that transactions committed by KiloTM satisfy conflict serializability [163]: All conflict relations produced from the110accesses at each word will obey the commit order. Hence, a conflict graph createdfrom these conflict relations is always acyclic.Conflict serializability implies a logical timeline in which validation and com-mit of each transaction are indivisible (performed without being interleaved byother transactions). The logical timeline also implies that validation of each trans-action always observes a consistent view of memory. Therefore, by proving con-flict serializability for Kilo TM, we can show that it can tolerate the ABA problem.The remainder of this chapter contains a proof of Kilo TM correctness. Theproof begins in Section 5.1, by defining a memory value-location framework thatis used to represent the values observed and produced by each transaction. Theframework is then used to model a general ABA problem scenario specific to TMsystems that employ value-based conflict detection. The proof for Theorem 1 usesthis framework to show that such TM systems can tolerate the ABA problem, underthe assumptions that validation and commit of each transaction are indivisible, andvalidation is done on a consistent view of memory. To show that Kilo TM satisfiesthese assumptions, Section 5.2 shows how the implementation of Kilo TM providesa set of partial orderings, Claim 1 to 6. In Section 5.4, these claims are used to proveLemma 1 and Lemma 2, which together show that accesses from transactions toeach word are ordered by ascending commit IDs. In Section 5.5, from this per-wordordering, Lemma 3 shows that validation performed by each transaction accessesa consistent view of memory. In Section 5.6, Lemma 4 then shows that accessesfrom transactions satisfy conflict serializability [163], which means there exists alogical timeline in which validation and commit of each transaction is indivisible.Finally, in Section 5.7, Theorem 2 combines Lemma 3 and Lemma 4 to show thatKilo TM satisfies the assumptions for Theorem 1. Therefore, Kilo TM can toleratethe ABA problem.5.1 Memory Value-Location Framework for the ABAProblemTransactional memory (TM) provides the programmer with the abstraction thattransactions are executed in some serialization order. In this serialization order,transactions are executed serially one after another. Each transaction T advances111the memory state from the original state observed by T , MO, to a new state MN .We denote this transition from MO to MN with MO→MN .Each memory state is defined by the value at every location in the entire mem-ory space M (i.e. a mapping from each address in M to its value). A memory stateMX is equivalent to another memory state MY if the value at each location in MXis equal to the value at the corresponding location in MY . For the rest of this dis-cussion, we use subscripts to denote subsets of locations within a memory spaceand superscripts to denote different memory states (i.e. values at a set of mem-ory locations). For example, MR is a subset of memory locations (not values) ofthe memory space M, ML denotes the values in every location in the entire mem-ory space M, and MLR is the set of address-value pairs for the memory locations(addresses) in MR with values from the memory state ML. We call MLR a partialmemory state.For each transaction TX , the entire memory space M is divided into the follow-ing subsets:MR,X = The memory locations read by TX as it executes (Read-Set).MW,X = The memory locations written by TX when it commits (Write-Set).MA,X = MR,X ∪MW,X = The memory locations that are accessed (eitherread or written) by TX .MI,X = M−MA,X = Memory locations that are ignored (neither read norwritten) by TX .MRW,X = MR,X ∩MW,X = Memory locations in TX ’s Read-Set that are alsopart of its Write-Set.MRO,X = MR,X −MRW,X = Memory locations that are only read by TX .MWO,X = MW,X −MRW,X = Memory locations that are only written by TX .Notice MI,X , MRW,X , MRO,X and MWO,X are all disjoint sets, and M = MI,X ∪MRW,X ∪MRO,X ∪MWO,X .During execution, each transaction TX observes the partial memory state MOR,X =MORO,X ∪MORW,X7, and writes to addresses in MW,X , producing the new partial mem-ory state MNW,X = MNWO,X ∪MNRW,X . In the mean time, other transactions may have7TX observes MOR,X as long as no other transaction modifies the locations in MR,X while TX exe-cutes. This holds for all transactions that successfully commit. See Section 5.1.4 for the discussionon how Kilo TM detects and handles transactions with inconsistent view of memory. We assume thattransactions are weakly isolated [16] and ignore non-transactional writes in this work.112committed, advancing the latest global memory states to ML = MLR,X ∪MLW,X ∪MLI,X .With value-based conflict detection, the transaction checks to see if MLR,X = MOR,Xbefore it commits. If the two partial memory states are indeed equivalent, TX com-mits by advancing the partial memory state in its write-set from MLW,X to MNW,X .This appends the serialization order with a new transition (ML→MN).5.1.1 ABA ProblemThe ABA problem manifests in non-blocking algorithms, where multiple threadsmay operate on a data structure simultaneously, and the atomicity of each operationis presumed to be guaranteed via success of one or more atomicCAS operations.Many non-blocking algorithms rely on the following assumption: If the value ofa guarding variable has not been modified since it was last read, then no otherthreads have modified the data structure, and thus this thread has performed thecurrent operation in isolation. This assumption ignores the possibility that severalother operations may have occurred in between, first modifying the guard variableto other values, then restoring the original value before the current thread usesatomicCAS to check the variable’s value. The fallacy in this assumption is how theABA problem manifests in various non-blocking algorithms, resulting in subtlebugs that are hard to detect [46, 109].5.1.2 Potential ABA Problem in Transactional MemoryIn the context of value-based conflict detection employed in Kilo TM, we considerthe potential for ABA problems in the following form. A set of transactions TABA= {T1, . . .TL} have committed in between time t1, when transaction TX first startedto read its read-set MR,X (observing state MOR,X from MO, the memory state beforeany transaction in TABA commits), and the time t2, when TX is validating its read-set against the latest global memory state. We assume that transactions are weaklyisolated [16] and ignore non-transactional writes in this discussion. Transactionsin TABA advance the global memory state from MO through a series of memorystates and eventually to ML. ML is not necessarily equivalent to MO, but the partthat belongs to TX ’s read-set is equivalent: MOR,X = MLR,X . Value-based conflict de-tection performed by TX will observe that its read-set has not been changed, and113TX “assumes” no conflicting transaction has committed between t1 and t2 (i.e. thevalues in MR,X appear to have never been modified in this window). Subsequently,TX commits by advancing MLW,X to MNW,X , whereas the intended transition (one thatwould have occurred if TX has executed in isolation without the presence of transac-tions in TABA) is from MOW,X to MNW,X . This intended transition (MO→MN) violatesthe existing serialization order because the latest memory state is ML. A TM sys-tem with address-based conflict detection will regard this as a conflict. In such asystem, TX will be restarted to resolve this conflict. However, we will show thatthis restart is not needed.5.1.3 Tolerance to the ABA ProblemThe following proof shows that committing TX directly in the situation describedin Section 5.1.2 will result in the same serialization order in which TX detects theconflict and reruns itself starting at time t2. In other words, TM systems employingvalue-based conflict detection can tolerate the ABA problem by yielding the samememory state transition as TM systems employing address-based conflict detec-tion.Theorem 1. Directly committing TX in the situation described in Section 5.1.2 willresult in the same serialization order in which TX detects the conflict and rerunsitself instantly starting at time t2.Proof. Assume that the TM system employs a separate mechanism other thanvalue-based conflict detection that is not prone to the ABA problem. TX , upon de-tecting the conflict at time t2, aborts itself and restarts immediately. Let T 1X be thisnew instance of TX . T 1X will observe its read-set from ML. Since MLR,X = MOR,X , T1Xwill produce the identical write-set partial memory state MNW,X as TX . If T1X finishesexecuting instantaneously with no other transactions committing in between, itscommit will advance the partial memory state in MW,X from MLW,X to MNW,X and T1Xwill transition the global memory state from ML to MN = (MNI,X ∪MNRO,X ∪MNW,X).Values in (MI,X ∪MRO,X ) remain unchanged between ML and MN (i.e. MNI,X = MLI,Xand MNRO,X = MLRO,X ). Committing TX at time t2 would have produced the sametransition (ML→MN): MLW,X is advanced to MNW,X , while values in (MI,X ∪MRO,X )remain unchanged. Therefore, as committing either TX or T 1X results in the same114transition, the programmer cannot discern between the two instances of execu-tion.The above proof made two assumptions:Assumption 1. TX commits immediately after value-based conflict detection, suchthat no other transactions can commit in between to advance the memory stateaway from ML.Assumption 2. The value-based conflict detection performed by TX is comparingthe original read-set state MOR,X against a consistent view of the global memorystate MLR,X . Here consistent view means that during conflict detection, MLR,X isnot advanced to another memory state by the commit of another transaction (i.e.MLR,X is the part of the memory state that exists in between the commits of twotransactions).We proceed to demonstrate that Kilo TM, despite its distributed design, satis-fies both assumptions.5.1.4 Inconsistent Read-SetWhile TX can possibly have observed an inconsistent view of memory (e.g. par-tially committed states from transactions in TABA) during its execution, Theorem 1holds as long as TX ’s observed read-set equals to MLR,X . TX may also observe in-consistent values from a single memory location. In Kilo TM, each transactionalload appends the value read from global memory into a linear read-log (if it is notaccessing the transaction’s write-set) [57]. The inconsistent values observed froma single memory location by TX will create multiple read-log entries that containdifferent values for a single location. During value-based conflict detection, onlyone of the values will match with the one in MLR,X . The mismatched entry will causeTX ’s validation to fail (subsequently aborting TX ). TX may enter an infinite loop dueto the inconsistent view of memory. To ensure that TX is eventually aborted, KiloTM employs a watchdog timer to trigger a validation for TX [57].1155.2 Transaction Components in Kilo TMIn Kilo TM, each transaction, TX , is comprised of the following sequence of oper-ations:TX = R(r1) . . .R(rm) Rv(r1) . . .Rv(rm) W (w1) . . .W (wn)• R(r1) is a read operation from word r1, and MR,X = {r1 . . .rm} is the read-setof TX .• Rv(r1) is a validation operation on word r1, ensuring that the value obtainedby R(r1) equals the value in global memory. (This is the operation that per-forms value-based conflict detection.)• W (w1) is a write operation to word w1, and MW,X = {w1 . . .wn} is the write-set of TX . The write operation is performed only while TX commits, and itupdates the value of w1 in global memory.When there are multiple transactions involved, the notation Rv(TX ,w) denotes thevalidation operation on word w for transaction TX ; whereas notation Rv(TX) de-notes all of the validation operations required by TX .TX is executed on a single SIMT core. The core interacts with a set of commitunits (one in each memory partition) to validate and commit TX . The followingmessages are sent between the commit units and the core that executes TX :• Vk(TX) = (pass/fail) is the validation outcome for TX at commit unit k. Vk(TX)= pass if validation operations performed on each word w contained in com-mit unit k for TX , where w∈MR,X , all succeed; otherwise, Vk(TX) = fail. Thismessage is sent from each commit unit k to the core running TX .• F(TX) = (pass/fail) is the final outcome for TX . F(TX) = pass if all validationoutcomes received by the core Vk(TX) = pass; otherwise, F(TX) = fail. Thismessage is sent from the core to each commit unit. Write operations W (TX)are only performed if F(TX) = pass.1165.2.1 Commit ID and Commit OrderPrior to validation and commit, a transaction T is given a unique commit ID. ThisID defines the commit order of T . A transaction with lower commit ID has an ear-lier commit order than those with a higher commit ID. Namely, given transactionsTX and TY with commit ID X and Y respectively, X < Y ⇐⇒ TX <t TY . Here<t denotes the commit order. Each transaction will obtain a new commit ID foreach execution attempt (i.e. a new ID is assigned every time the transaction wasaborted).The commit order limits how a transaction may appear in the serialization or-der. Let ML be the latest memory state, and TX and TY be two transactions, withTX <t TY , that are ready to commit. Only one of the following can happen:• Both TX and TY commit, resulting in transitions ML→MX1→MY 2, whereMX1 is the memory state from ML after TX has committed and MY 2 is thememory state from MX1 after TY has committed.• Only TX commits, resulting in transition ML→MX1, where MX1 is the mem-ory state from ML after TX has committed.• Only TY commits, resulting in transition ML→MY 1, where MY 1 is the mem-ory state from ML after TY has committed.• Neither TX nor TY commits, resulting in no transition.Notice that transitions ML → MY 1 → MX2 (MX2 is the memory state from MY 1after TX has committed) is not allowed. This restriction is enforced by Claim 2below. This allows Kilo TM to handle transactions with write-after-write conflictsby ordering their commits without synchronizing among different commit units(instead of aborting one of them). RingSTM [149] also uses this policy to handlewrite-after-write conflicts.5.3 Partial Orderings Provided by Kilo TMKilo TM’s implementation provides a set of partial orderings that we present inthe following claims. The following discussions on the validity of these claims as-sume that the reader is familiar with the implementation of Kilo TM. An in-depth117description of Kilo TM’s implementation can be found in Section 4.2.5 of Chap-ter 4. These claims will be used to show that Kilo TM satisfies both Assumption 1and Assumption 2 required for ABA problem tolerance (Theorem 1). We denotethese partial orderings with <P. Given two events/operations A and B, A <P Bmeans A happens before B in real time.Claim 1. At each commit unit, given transactions TX and TY , where TX <t TY , awrite to a memory location w performed by a transaction TX always happens beforevalidation of the same location w performed by a transaction TY . In our notation,W (TX ,w) <P Rv(TY ,w).Proof. In Kilo TM’s implementation, transactions always perform hazard detec-tion in commit order. Each commit unit can speculatively validate each memorylocation in the read-set of TY (MR,Y ) as the corresponding read-log entry arrives atthe unit. Later in hazard detection, if the unit detects (via address-based conflictdetection) that TX is writing to any part of MR,Y , the unit will revalidate the entireread-set of TY after TX has finished committing.Claim 2. Let TX and TY be transactions with TX <t TY . If a memory location w isin the write-sets of both TX and TY , writes to w by transaction TX always happenbefore writes to w by transaction TY . Namely, W (TX ,w) <P W (TY ,w).Proof. This is enforced at the commit stage in each commit unit. This ordering isguaranteed by issuing the write operations of each passed transaction in ascend-ing commit order. The GPU memory subsystem in our architecture can reorderaccesses to different locations to optimize for bandwidth, but it maintains the or-dering of accesses to the same location.Claim 3. At each commit unit, the write operations for a transaction TX are onlycommenced after the commit unit has received the final outcome F(TX) of TX fromthe core. Namely, F(TX) <P W (TX).Proof. This is enforced at the finalizing outcome stage in each commit unit.Claim 4. The transaction will not send out the final outcome F(TX) to commitunits until it has received validation outcomes Vk(TX) from all commit units for the118transaction. For each commit unit k that contains any location in the read-set ofTX , Vk(TX) <P F(TX).Proof. This is enforced by the Kilo TM implementation at each SIMT core.Claim 5. At each commit unit k that contains any location in the read-set of TX , thevalidation outcome Vk(TX) is sent after all validation operations are done. Namely,Rv(TX) <P Vk(TX).Proof. This is enforced at the finalizing outcome stage in each commit unit.Claim 6. At each commit unit, the write operations for a transaction TX are onlycommenced after the commit unit has received the final outcomes F(TY ) from alltransactions TY with earlier commit order if the commit unit contains any locationin either read-set or write-set of TY (MA,Y ). I.e. For all transaction TY with (TY <tTX ) and the commit unit that contains any location in MA,Y , F(TY ) <P W (TX).Proof. This is enforced at the finalizing outcome stage in each commit unit. Ateach commit unit, the commit unit entry that corresponds to a transaction TY waitsfor the final outcome F(TY ) before proceeding to the commit stage. TY stalling atthe finalizing outcome stage will forbid any transaction with a younger commit ID(e.g. TX ) to proceed to the next stage, even after F(TX) has been received by theunit. Since write operations are only issued in commit stage, this stalling behaviorenforces Claim 6.5.4 Per-Word Access OrderingLemma 1 and 2 illustrate how Kilo TM orders accesses (validations and writes) toeach memory location (word) in ascending commit order. This ordering is usedin Lemma 3 to prove that all validation operations for a given transaction TX arecomparing against a consistent view of MLR,X (Assumption 2).Lemma 1. Let TY be a transaction. For each memory location w in MRW,Y , val-idation operation(s) to w by TY always happen before write operation(s) to w byTY .119Proof. Let tC be the time when TY starts sending the final outcome F(TY ) to eachcommit unit from the core. By Claim 5, all the validation operations of w by TY(Rv(TY ,w)) at each commit unit k have to happen before the unit replies with thevalidation outcome Vk(TY ) back to the core. By Claim 4, all of these outcomes haveto arrive at the core before TY sends out F(TY ), i.e. before tC. The same commitunit k will receive F(TY ) at a time tK > tC. By Claim 3, any write operation to w inMRW,Y by TY (W (TY ,w)) has to happen after tK .Putting it all together, all validation operations to w in MRW,Y by TY have tooccur before tC, which is before all write operations to w by TY . I.e. Rv(TY ,w) <PtC <P tK <P W (TY ,w)⇒ Rv(TY ,w) <P W (TY ,w).Lemma 2. Let TX and TY be transactions with TX <t TY . For each memory locationw, operations (validation/write) to w by TX always happen before operations to wby TY , except when both operations are validation.Proof. This can be broken down into three separate orderings:O1. W (TX ,w) <P Rv(TY ,w)O2. W (TX ,w) <P W (TY ,w)O3. Rv(TX ,w) <P W (TY ,w)The first two orderings follow directly from Claim 1 and Claim 2 respectively.The final ordering follows from Claim 4-6: At each commit unit k, the valida-tion outcome of TX , Vk(TX), is only sent after all its validation operations are done(Claim 5), and the final outcome F(TX) will only arrive after all validation out-comes have been received by the core (Claim 4). By Claim 6, the write operationsof TY will not be issued until the commit unit has received F(TX). Putting it all to-gether, let commit unit k be the unit containing location w, Rv(TX ,w)<P Vk(TX)<PF(TX) <P W (TY ,w)⇒ Rv(TX ,w) <P W (TY ,w).5.5 Validation Against a Consistent View of MemoryLemma 3. The value-based conflict detections (all validation operations) per-formed by TY compare the original read-set state MOR,Y against a consistent view ofa global memory state MLR,Y .120Proof. Claim 2 (O2 in Lemma 2) specifies that each location in memory is writtenin ascending commit order. By O1 and O3 in Lemma 2, every validation by TY toa location w in MR,Y is performed after all transactions with earlier commit ordershave written to w and before transactions with later commit orders write to w.The validation is also done before TY writes to w itself (by Lemma 1). Hence,each validation Rv(TY ,w) by TY will observe the value of w that is written by thetransaction with the latest commit order before TY , which is the same value of win the memory state right after TY−1 commits (MLR,Y ). Since this applies to thevalidation of every w in MR,Y , the validation operations Rv(TY ,w) for all w in MR,Yare comparing against the same consistent view of global memory.Comment: This can be explained in a simpler way via the commit IDs. Eachcommit unit ensures that all validation operations of a transaction TX with CID =X are reading from a memory state equivalent to the one right after TX−1 commits.Hence, TX is validating against a consistent view of memory.5.6 Logical Indivisibility of Validation and CommitKilo TM is designed to permit non-conflicting transactions to commit in parallel.This means that memory state ML would likely be advanced to another memorystate ML′by the commits of other transactions during the validation of TX andbefore TX can commit. This seems to violate Assumption 1. However, with theper-word operation ordering illustrated by Lemma 1 and 2, we can construct alogical timeline in which validation and commit of each transaction are indivisible.To construct such a logical timeline, we first define the validation and writeoperations of a committed transaction TX as a mini-transaction:CX = Rv(r1) . . .Rv(rm) W (w1) . . .W (wn)Each transaction corresponds to a single mini-transaction. The commit order foreach mini-transaction is the same as its transaction counterpart. We show that op-erations performed by these mini-transactions satisfy conflict serializability [163].Lemma 4. The sequence of operations performed by any arbitrary mini-transactionswith Kilo TM satisfies conflict serializability.Proof. Let be CX and CY be two mini-transactions with CX <t CY .121CX has read-set MR,X and write-set MW,X , and MA,X = MR,X ∪MW,X .CY has read-set MR,Y and write-set MW,Y , and MA,Y = MR,Y ∪MW,Y .Each memory location w that is accessed by both transactions (i.e. w is in (MA,X ∩MA,Y )) will have the operations ordered according to the ordering defined by Lemma 1and Lemma 2. In all cases, the operations will produce the conflict relation (de-noted by ordered pair (Op(CA,w),Op(CB,w)) below, see definition 3.12 in Weikumand Vossen [163]) that aligns with the commit order. Each conflict relation inturn creates directed edges for their corresponding mini-transactions in a conflictgraph (denoted by ordered pair (CA,CB) below, see definition 3.15 in Weikum andVossen [163]):• Rv(CX ,w) and W (CY ,w) are always ordered in Rv(CX ,w) <P W (CY ,w), pro-ducing conflict relation (Rv(CX ,w),W (CY ,w)) and conflict graph directededge (CX ,CY ).• W (CX ,w) and W (CY ,w) are always ordered in W (CX ,w) <P W (CY ,w), pro-ducing conflict relation (W (CX ,w),W (CY ,w)) and conflict graph directededge (CX ,CY ).• W (CX ,w) and Rv(CY ,w) are always ordered in W (CX ,w) <P Rv(CY ,w), pro-ducing conflict relation (W (CX ,w),Rv(CY ,w)) and conflict graph directededge (CX ,CY ).• Rv(CX ,w) and Rv(CY ,w) are freely ordered, but they produce no conflictrelation.Since every pair of mini-transactions has either no conflict, or produces conflict re-lation that aligns with the commit order, the conflict graph created from the conflictrelations among all mini-transactions will not contain any cycle. Specifically, twotransactions CA and CB with CA <t CB can never have a directed path from CB to CAin the conflict graph. Therefore, by Theorem 3.10 in Weikum and Vossen [163],any sequence of operations performed by the mini-transactions satisfies conflictserializability.By Lemma 4 and the definition of conflict serializability (definition 3.14 inWeikum and Vossen [163]), we can imply that any sequence of operations per-122formed by the mini-transactions with Kilo TM has a logically equivalent serialsequence. In this serial sequence, operations performed by each mini-transactionare not interleaved by those from other mini-transactions. This serial sequenceforms a logical timeline in which validation and commit of each transaction areindivisible.5.7 Tolerance to ABA Problem (Kilo TM)With Lemma 3 and Lemma 4 proving how Kilo TM satisfies Assumption 2 andAssumption 1 for Theorem 1, Theorem 2 follows:Theorem 2. Kilo TM can tolerate the ABA problem.Proof. Lemma 3 and Lemma 4 show that Kilo TM satisfies both assumptions forTheorem 1. Hence, Kilo TM can tolerate the ABA problem.5.8 SummaryIn this chapter, we presented a semi-formal proof to show that Kilo TM can toleratethe ABA problem. We first formalized the ABA problem in the context of a TMsystem with a memory value-location framework. We used this framework to showthat in the presence of an ABA problem, committing the transaction directly pro-duces the same memory state transition as aborting the transaction and rerunningit to completion instantaneously. This behavior holds as long as each transactioncan commit immediately after validating its read-set. Hence, the rest of the proofshows that all transactions committed by Kilo TM satisfies conflict serializabil-ity – these transactions form a logical order in which the validation and commit ofeach transaction is logically indivisible. In other words, Kilo TM properly enforcesatomicity and isolation among transactions.123Chapter 6Energy Efficiency Optimizationsfor Kilo TMIn this chapter, we address concerns over the energy-efficiency of Kilo TM, ourGPU TM system proposed in Chapter 4. In particular, we evaluate and analyze theperformance and energy overhead of Kilo TM. The insights from this analysis leadto two distinct enhancements: warp-level transaction management (WarpTM) andtemporal conflict detection (TCD). A version of this chapter has been publishedearlier [55].Warp-level transaction management leverages the thread hierarchy in GPU pro-gramming models – the spatial locality among threads within a warp – to improvethe efficiency of Kilo TM. In particular, WarpTM amortizes the control overheadof Kilo TM and boosts the utility of the GPU memory subsystem. These optimiza-tions are only possible if conflicts within a warp can be resolved efficiency, and thusa low overhead intra-warp conflict resolution mechanism is crucial in maintainingthe benefit from WarpTM. To this end, we propose a two-phase parallel intra-warpconflict resolution that resolves conflicts within a warp efficiently in parallel.Temporal conflict detection is a low overhead mechanism that uses a set ofglobally synchronized on-chip timers to detect conflicts for read-only transactions.Once initialized, each of these on-chip timers runs locally in its microarchitecturemodule and does not communicate with other timers. This implicit synchronizationwithout communication distinguishes TCD from existing timestamp-based con-124GPUSIMT CoreMemory PortSIMT CoreMemory PortOff-Chip DRAM Channelff- hip A  hannelMemory Partitione ory PartitionCPUKernel LaunchInterconnection NetworkMemory PartitionLast-Level Cache BankOff-Chip GDDR Memory ChannelAtomic Op.UnitCom m it UnitDRAM ControllerLaunchUnitThread BlockSIMT CoreThread BlockThread BlockRegister FileMemory PortShared MemoryL1 Data CacheTX  Log  UnitSIMT StacksF irst R ead  Tim e Tab l eLast W ritten Tim eFigure 6.1: Enhanced Kilo TM implementation overview. TX Log Unit,Commit Unit added for Kilo TM. SIMT stack modified to support trans-actions (see Chapter 4). TX Log Unit extended to use shared memoryfor intra-warp conflict resolution. First Read Time Table, Last Writ-ten Time, and timers in SIMT Cores and Memory Partitions added fortemporal conflict detection.flict detections used in various software TM systems [37, 148, 167]. TCD usestimestamps captured from these timers to infer the order of the memory reads of atransaction with respect to updates from other transactions. Kilo TM incorporatesTCD to detect conflict-free read-only transactions that can commit directly with-out value-based conflict detection. In doing so, it significantly reduces the memorybandwidth overhead for these transactions, which can account for 40% and 85% ofthe transactions in two of our GPU-TM applications.Figure 6.1 shows the overall implementation of an enhanced Kilo TM that in-corporates both WarpTM and TCD. The two enhancements complement each otherto improve the overall performance of Kilo TM by 65% while reducing the energyconsumption by 34%. This enhanced Kilo TM outperforms coarse-grained lock-ing by 192× and achieves 66% of the performance of fine-grained locking with34% energy overhead. More importantly, the enhancements allow applicationswith small, rarely-conflicting transactions to perform equal or better than their fine-grained lock versions. We believe that GPU applications using transactions can beincrementally optimized to reduce memory footprint and transaction conflicts totake advantage of this. Meanwhile the transaction semantics can maintain correct-125ness at every step, providing a low-risk environment for exploring optimizations.6.1 Performance and Energy Overhead of Kilo TMOur evaluation shows that GPU TM applications running on a simulated NVIDIAFermi GPU extended with our baseline Kilo TM only capture 40% of the per-formance of fine-grained locking (i.e., Kilo TM execution time is 1/0.4 = 2.5×slower), and consumes 2× the energy. We note this is lower than the 59% relativeperformance between Kilo TM and fine-grained locking in our previous evaluationin Chapter 4. The discrepancy is mainly contributed by the different core to mem-ory ratio between the NVIDIA Fermi architecture and the cache-extended QuadroFX5800 architecture modeled in Chapter 4.Our analysis has identified multiple sources of inefficiency in Kilo TM:• While concurrency control can reduce the number of aborted transactions,the GPU TM application may contain phases of extremely high contentionamong transactions. The performance overhead of value-based conflict de-tection for these aborted transaction creates a bottleneck in Kilo TM.• While transactions in Kilo TM are executed in the SIMT execution model,they validate and commit to global memory at scalar granularity. We optedfor the scalar granularity based on our observation that threads in in GPU ap-plications usually access adjacent words in memory. A significant overheadstems from the inherent mismatch between this scalar transaction manage-ment and the wide memory subsystem in GPUs designed to capture spatiallocality. Although Kilo TM may validate and commit transactions in a warpas a group to exploit the spatial locality among threads – just like how cur-rent GPUs already coalesce non-transactional memory accesses into wideraccesses, it may only do so for warps without intra-warp conflicts, or con-flicts among transactions in the same warp. Extending the commit units inKilo TM to detect and explicitly handle intra-warp conflicts in a distributedway can add complexity to the design.• The protocol used to maintain the consistency between commit units in dif-ferent memory partitions can introduce significant extra traffic in the on-chip126interconnection network.• Kilo TM has adopted value-based conflict detection for its lack of essen-tial global metadata requirement. While it eliminates direct communicationbetween transactions (a trait for scalability), we find that it incurs a signif-icant energy overhead, even if most of the validation lookups hit at the L2(last-level) cache.In the following sections, we will present how warp-level transaction manage-ment and temporal conflict detection attempt to eliminate these sources of ineffi-ciencies.6.2 Warp-Level Transaction Management (WarpTM)The GPU memory subsystem is designed to handle accesses with high spatial lo-cality. The L2 cache bank in each memory partition can access a quarter of thecache block (32 Bytes) in a single cycle, and the accessed data is delivered throughan interconnection network that can inject 32 bytes per cycle at each port. The useof wide cache ports and wide flit size matches well with the off-chip DRAM ar-chitecture, and delivers high bandwidth with relatively low control hardware over-head. GPU uses special coalescing logic to capture the spatial locality among scalarmemory accesses from threads in the same warp.Although Kilo TM executes transactions in the same warp in parallel, eachtransaction validates and commits individually. This scalar management simplifiesthe design of the commit units – conflicts between transactions within the samewarp are handled just as conflicts between any two transactions in the system.However, this design simplification results in an inefficient utilization of the mem-ory subsystem. Every cycle, each commit unit can only send one scalar, 4-Byterequest (validation or memory writeback of a single 4-Byte word for a single trans-action) to the L2 cache bank. This wastes at least 7/8 of the L2 cache bandwidth,creating a major energy overhead for Kilo TM.The scalar management also introduces many extra protocol messages betweenthe SIMT cores and the commit unit. Each transaction has to generate at leastthree messages: (1) a done-fill message to indicate that the entire read-set and127write-set have arrived at the commit unit, (2) a response from the commit unit torelay the outcome of value-based conflict detection local to the unit, (3) a messagebroadcasting the overall transaction outcome to each commit unit. These extraprotocol messages can significantly increase interconnection traffic.Even though it is possible to improve the L2 cache bandwidth utility by extend-ing each commit unit with an access combine buffer that opportunistically accumu-lates multiple scalar accesses and coalesces them into wider accesses, we believea simpler alternative is to exploit the spatial locality that already exists in a warp.We call this Kilo TM extension warp-level transaction management (WarpTM).WarpTM uses a low-overhead intra-warp conflict resolution mechanism to detectand resolve all conflicts within a warp before validating and committing the warpvia the commit units. The warp that is free of intra-warp conflicts can then be man-aged as a single entity, allowing various optimizations that boost the performanceand efficiency of Kilo TM without introducing complex control logic.The rest of this section describes the optimizations enabled by warp-level trans-action management and the hardware modifications required to realize these opti-mizations. The implementation of intra-warp conflict resolution will be discussedin Section Optimizations Enabled by WarpTMWithout any potential conflicts within the warp, the commit unit can coalesce thevalidation and commit requests from all transactions within the warp into wideraccesses to the L2 cache. It can also aggregate the protocol messages so that it isrelaying the validation outcomes of the entire warp. The benefits from WarpTMcan be categorized as follows.Eliminate Futile Validation. By resolving conflicts within a warp prior toglobal commit, transactions that would have failed abort before generating anycommit related traffic out of the SIMT core. This can reduce congestions at thecommit units for workloads with high contention, improving their performance andenergy usage. Figure 6.2 shows that conflicts between transactions within the samewarp, intra-warp conflicts, rarely occur in most of our workloads. The exceptions,BH and AP, both feature a high contention period when many transactions are12820%40%60%80%100%% of Aborted Transaction0%HT-H HT-M HT-L ATM CL CLto BH-H BH-L CC AP% of Aborted Figure 6.2: Transaction conflicts within a warp.CU1CU2CU3SIMTCoreRead/ W rite-LogTransfer Done/ Skip CU1CU2CU3SIMTCoreCU1CU2CU3SIMTCoreCU Pass/ FailCU1CU2CU3SIMTCoreTX  Pass/ FailCU1CU2CU3SIMTCoreDoneTimeK il o TM ( Unm od if ied )W arp - Lev elTransaction Manag em entCoalesced( 1 Msg /  W arp /  Log E ntry)Coalesced( 1 Msg /  W arp)Coalesced( 1 Msg /  W arp /  Log E ntry)Coalesced( 1 Msg /  W arp)Scal ar( 1 Msg /  Transaction)Coal esced( 1 Msg /  W arp)Scal ar( 1 Msg /  Transaction)Coal esced( 1 Msg /  W arp)Scal ar( 1 Msg /  Transaction)Coal esced( 1 Msg /  W arp)Figure 6.3: Kilo TM Protocol Messages.trying to append leaf nodes to a small tree.Aggregate Control Messages. Figure 6.3 illustrates the protocol messagesthat are sent between the SIMT core and the commit units to commit a transaction.Notice that the original proposal of Kilo TM already aggregates the read-set andwrite-set messages and the done-fill messages from multiple transactions in a warp.However, it does not aggregate the remaining protocol messages.With WarpTM, in the absence of potential intra-warp conflicts, the commit unit1290.511.522.5Normalized # of Transfered FlitsOther GMemRead GMemWrite AtomicOpTxReadLog TxWriteLog LMem TxMsg0FGLockIdealTMKiloTM-BaseFGLockIdealTMKiloTM-BaseFGLockIdealTMKiloTM-BaseFGLockIdealTMKiloTM-BaseFGLockIdealTMKiloTM-BaseFGLockIdealTMKiloTM-BaseFGLockIdealTMKiloTM-BaseFGLockIdealTMKiloTM-BaseFGLockIdealTMKiloTM-BaseFGLockIdealTMKiloTM-BaseHT-H HT-M HT-L ATM CL CLto BH-H BH-L CC APNormalized # of Transfered FlitsFigure 6.4: Interconnection Traffic Breakdown.can wait until all transactions have finished validation and combine their outcomes(pass/fail) into a single message. After receiving replies from all the commit units,the SIMT core can also combine the final outcomes of the entire warp into a singlemessage that is broadcast to the commit units.Figure 6.4 shows the interconnection traffic breakdown for our workload withfine-grained locks, Ideal TM, and Kilo TM. The protocol messages for Kilo TM(TxMsg) on average account for 36% of the interconnection traffic.Validation and Commit Coalescing. While applications with irregular paral-lelism tend to exhibit less spatial locality among threads within a warp, coalescingmemory accesses from the same warp can still significantly reduce the number ofaccesses. With the original Kilo TM, the memory accesses performed by threadsduring transaction execution are already coalesced just as the non-transactionalmemory accesses. The commit units, however, generate scalar memory accessesfor validation and commit of transactions to avoid explicitly handling intra-warpconflicts. This simplifies the design of the commit units. With WarpTM, eachcommit unit knows a priori that all transactions from the same warp are free ofintra-warp conflicts. Consequently, the value-comparison outcome for the valida-tion of one transaction will not be changed after another transaction in the samewarp has committed. As a result, the commit unit can always merge the scalar13030%40%50%60%70% ReadAccess WriteAccess0%10%20%HT-H HT-M HT-L ATM CL CLto BH-H BH-L CC AP AVGFigure 6.5: Reduction in L2 cache accesses from the commit units via vali-dation and commit coalescing.memory accesses for the validation of multiple transactions in the same warp intowider accesses. We call this validation coalescing. Similar reasoning permits thecommit unit to merge scalar memory writeback accesses for the commit as well.We call this commit coalescing.Figure 6.5 shows the amount of L2 cache access from the commit units thatcan be reduced through validation and commit coalescing. On average, coalescingcan reduce the number of validation requests and memory writeback requests by40% and 39% respectively. Without coalescing, scalar accesses that exhibit spatiallocality tend to hit the L2 cache. However, they still waste L2 cache bandwidthsupplied by the wide cache ports, and they will consume more L2 cache miss-status holding registers (MSHR) that track accesses waiting for in-flight requestsfrom DRAM.Validation and Commit coalescing can benefit any GPU TM system as long asthe GPU still employs the SIMT execution model and accesses memory in largecontiguous chunks. Even with wide-channel 3D DRAMs, there are tangible ben-efits in amortizing SRAM and DRAM control logic by accessing data in largechunks, as long as the applications contain sufficient memory access spatial local-ity. Most existing GPU applications do.1316.2.2 Hardware Modification to Kilo TMIn addition to implementing intra-warp conflict resolution (described in Section 6.3),WarpTM requires modification to the commit units. Each commit unit containsa small read/write buffer that caches the read-logs and write-logs of committingtransactions. In the original Kilo TM, the commit unit only accesses one wordfrom the read-log or write-log of one transaction in each cycle. The read/writebuffer can supply this bandwidth with a narrow (4-Byte) port. WarpTM requiresthis read/write buffer to have a wide (64-Byte) port. The wide port allows the read-sets and write-sets from multiple transactions in the same warp to be retrieved in asingle cycle. Each commit unit also needs to be extended with a memory coalescelogic unit to merge multiple scalar accesses that head to the same cache block intoa wider access.6.3 Intra-Warp Conflict ResolutionDeveloping a low overhead mechanism to detect conflicts among transactions withina warp is the key challenge in enabling WarpTM. Each transaction in Kilo TMstores its read-set and write-set as linear logs in the local memory space. The logsare organized physically such that each transaction may only access one word inits logs per cycle. Detecting conflicts between two transactions naively requirestraversing the linear logs repeatedly, once for each word in the read- and write-log,for a full comparison of the logs. Even if the logs are fully cached in the L1 datacache, a full pair-wise comparison among T transactions still requires O(T 2×N)traversals, where N is the combined size of the read- and write-logs of a transaction.The overhead might negate any performance and energy benefit from WarpTM.While it is possible to detect and resolve intra-warp conflicts with the last writerhistory units employed in the commit units to boost commit parallelism of value-based conflict detection, each last write history unit can only detect conflicts forone transaction at a time. It is only as effective as the sequential conflict resolu-tion (SCR) introduced in Section 6.3.2. The 2-phase parallel conflict resolutionintroduced in Section 6.3.3 allows multiple transactions in the warp to resolve theirintra-warp conflicts in parallel.1326.3.1 Multiplexing Shared Memory for Resolution MetadataWe noticed that many applications that require irregular communication betweenthreads in different SIMT cores make little use of shared memory (the on-chipscratchpad memory). This observation is exploited in NVIDIA Fermi GPUs by al-lowing part of the shared memory storage to be configured as the L1 data cache [121].The non-configurable part of the shared memory remains unused in most of theseapplications. Given that intra-warp conflict resolution only involves communica-tion within a warp, we propose to use this underused storage as temporary buffersfor intra-warp conflict resolution. When a warp has finished executing its transac-tions, it allocates a buffer in the shared memory to perform the intra-warp conflictresolution. The warp then uses this buffer to store metadata for its intra-warp con-flict resolution, and releases the buffer after the resolution is done. This allows thebuffer to be time-shared by multiple warps – a technique known as shared memorymultiplexing [168]. The shared memory storage can also be partitioned into multi-ple buffers to allow multiple warps to interleave their intra-warp conflict resolutionto hide access latency for read/write-log accesses that miss the L1 data cache.To support applications that use shared memory for other computations, theGPU command unit can be extended to launch fewer thread blocks on each SIMTcore according to amount of metadata storage reserved by the programmer. Similarstrategy can be used to reserve metadata buffers for intra-warp conflict resolutionon future GPU architectures that feature unified storage for shared memory, datacache and registers [64]. We leave to future work exploration of trade offs be-tween intra-warp conflict detection metadata capacity and the ability to run moreconcurrent threads with more register storage.6.3.2 Sequential Conflict Resolution with Bloom Filter (SCR)We propose to store a bloom filter [26, 111, 170] in the shared memory. Usingthis bloom filter, we have developed a sequential conflict resolution (SCR) schemethat always prioritizes the transactions executed by the lower lanes in the warp.Threads with lower thread ID are assigned to the lower lanes in the warp. In SCR,the transaction with the lowest lane in the warp first populates the bloom filterwith its write-set. Each subsequent transaction in the warp first checks to see if133its read-set or write-set hits in the bloom filter. If so, this transaction conflictswith one of the transactions in the prior lanes, and it is aborted. Otherwise, thetransaction adds its write-set to the bloom filter to make its write-set visible to thesubsequent transactions in the warp. The accumulative nature of the bloom filterallows each transaction to compare its read and write-set against the write-set of alltransactions in the prior lanes. Prior TM proposals that uses bloom filter for conflictdetection [26, 111, 170] did not exploit this accumulative effect of bloom filter toallow a transaction to detect conflicts with multiple transactions via a single query.While SCR does reduce the number of transaction log traversals from O(T 2×N)to O(T ×N), its sequential nature makes poor use of the bandwidth provided bythe L1 cache and shared memory.6.3.3 2-Phase Parallel Conflict Resolution with Ownership Table(2PCR)In SCR, each transaction in the warp is essentially matching its read-set and write-set with the aggregated write-set of all the transactions in the prior lanes. Thismatching is inherently parallel if each lane has a pre-constructed record of theaggregated write-set from its prior lanes. Also, the priority among lanes is known inadvance, so that multiple conflicting lanes can resolve the conflicts unanimously inparallel without extra communication. From these two insights, we have developedtwo-phase parallel intra-warp conflict resolution (2PCR). First, the transactionsin the warp collaboratively construct an ownership table [118] in parallel fromthe write-logs of every transaction in the warp. Each transaction then checks thisownership table for conflicts with another transaction in a prior lane. If such aconflict exists, the transaction aborts itself.Each entry in the ownership table represents a region in global memory. Itsvalue contains the lane ID (5-bit) of the lowest lane that intends to write to theregion and an extra null-bit to indicate if none of the lanes intends to write tothe region. Each entry, padded with two unused bits, occupies a byte in sharedmemory. When a warp starts performing 2PCR, it first initialize the ownershiptable by setting the null-bit in every entry through a burst of shared memory writes(each write can initialize 128 Bytes, so a 2K-entry table only takes 16 writes). Toconstruct the ownership table, each transaction traverses through its write-log to134read out the locations in its write-set. For each location, the transaction calculatesthe index of the corresponding entry in the ownership table hashing the location’saddress. It then updates the corresponding entry with its own lane ID if the existingvalue in the entry has a higher lane ID. The lockstep nature of a warp and thememory pipeline allows this to occur in parallel: At each step, every transactionreads one entry from its write-log, reads the existing value from the correspondingownership table entry, compares the value against its own lane ID and updatesthe entry if its lane ID is lower. A hardware mechanism that implements atomicoperations for shared memory [32] is used to prevent two transactions from racingto update the same entry at the same step.After constructing the ownership table, every transaction traverses through itsread-log and write-log. For each location in the read-log, the transaction retrievesthe lane ID from the corresponding ownership table entry. If the retrieved lane ID isnot null and it is lower than the transaction’s own lane ID, a conflict exists betweenthis transaction and an earlier transaction. For each location in the write-log, thetransaction also retrieves the lane ID from the corresponding ownership table entry.However, a conflict exists only if the retrieved lane ID value does not equal thetransaction’s own lane ID. The different lane ID indicates that another transactionmay overwrite the same location as this transaction. Notice that in this case, theretrieved lane ID will always be smaller, because the ownership table constructionmandates that every entry contains the lane ID of the lowest lane intending to writeto the corresponding region. Every transaction with any detected conflict abortsitself. The remaining transactions in the warp can then proceed and benefit fromthe optimizations enabled by WarpTM.Figure 6.6 contains an example of the two-phase parallel intra-warp conflictresolution. The example consists of four transactions (X1, X2, X3, X4), each readingand writing to two locations in memory (except X4, which only reads from onelocation). These memory accesses are stored in the read-log and write-log shownin the upper left corner of Figure 6.6. Each column in both the read-log and thewrite-log shows the read-set and write-set of a particular transaction respectively.Each entry in the read-log is labeled according to its position in the log (RL1,RL2); entries in the write-log of each transaction is labeled similarly (WL1, WL2).Entries at the same position in read-logs and write-logs of the four transactions135X 1 X 2 X 3 X 40 x0 4 0 x0 4 0 x0 4 0 x0 40 x0 c 0 x0 8 0 x10 --Read-L ogRL 1RL 2X 1 X 2 X 3 X 40 x0 c 0 x0 8 0 x10 0 x0 40 x10 0 x14 0 x14 0 x0 8Write-L ogWL 1WL 20 x0 4 0 x0 8 0 x0 c 0 x10 0 x14RO RO RO RO RO1234567O w nership  T ab le ConstructionP arallel Match0 x0 4 0 x0 8 0 x0 c 0 x10 0 x144 2 1 3 ROX 1 X 2 X 3 X 40 x0 c 0 x0 8 0 x10 0 x0 4WL 10 x0 4 0 x0 8 0 x0 c 0 x10 0 x144 2 1 1 2X 1 X 2 X 3 X 40 x10 0 x14 0 x14 0 x0 8WL 20 x0 4 0 x0 8 0 x0 c 0 x10 0 x144 2 1 1 2X 1 X 2 X 3 X 40 x0 4 0 x0 4 0 x0 4 0 x0 4RL 10 x0 c 0 x0 8 0 x10 --RL 20 x0 4 0 x0 8 0 x0 c 0 x10 0 x144 2 1 1 2X 1 X 2 X 3 X 40 x0 4 0 x0 8 0 x0 c 0 x10 0 x144 2 1 1 2X 1 X 2 X 3 X 40 x0 c 0 x0 8 0 x0 4WL 10 x10 0 x14 0 x0 8WL 20 x0 4 0 x0 8 0 x0 c 0 x10 0 x144 2 1 1 2X 1 X 2 X 3 X 4Figure 6.6: Two-phase parallel intra-warp conflict resolution. Each stepshows the content of the ownership table and accesses from the transac-tions in the warp. RO = Read-Onlyare grouped into a wide entry, shown as a row in the logs, that can be read out inparallel. Here are the steps in this example indicated by the numbers in Figure 6.6:1. Every entry in the ownership table is initialized to read-only (RO).2. Every transaction updates the ownership table according to entry 1 in its ownwrite-log (WL1).3. The transactions proceed to entry 2 in their write-log (WL2). The ownershipentry for 0x08 is not updated to X4 because it is already owned by X2, whichhas a higher priority. Meanwhile, the ownership of 0x10 is updated from X3to X1.4. The transactions proceed to entry 1 in their read-log (RL1) for parallel match-136ing. Since all transactions read a single location 0x04, they check the own-ership table in parallel for the first lane that writes to 0x04, which is X4.None of the transactions aborts at this point since their lane IDs are smalleror equal to X4.5. Every transaction proceeds to checking entry 2 in its own read-log (RL2).X3 is aborted since 0x10 is already owned by X1.6. Each remaining transaction has ownership to address in entry 1 of its write-log.7. After checking entry 2 in its write-log, X4 is aborted due to a WAW conflictwith X2 at 0x08.Finally, X1 and X2 are conflict free and can be validated and committed together viathe commit units. Notice that this relatively narrow warp will take 15 steps withSCR. With 32-wide warps and larger transaction footprints in real workloads, thedifference is even greater.The accuracy of 2PCR depends on the size of the ownership table. Our eval-uation shows that a 4K entry ownership table (requiring 4kB of storage in sharedmemory) performs comparably to an ownership table with infinite capacity. With afixed-size ownership table, the accuracy of the intra-warp conflict resolution de-creases for transactions with larger read/write-set. The average per-transactionfootprint in our workloads spans between 3 to 36 words. Since GPU applica-tions usually decompose larger input data into more threads, we believe that theper-transaction footprints in future GPU TM applications should not grow signifi-cantly beyond the footprints found in our workloads.Notice that 2PCR tends to be less accurate than SCR. Since the ownershiptable is constructed in parallel assuming that every transaction in the warp willbe committed, a transaction may unnecessarily abort due to a conflict with anotheraborted transaction. Nevertheless, our evaluation shows that the benefit from 2PCRoutweighs the overhead from its additional false conflicts.6.4 Temporal Conflict Detection (TCD)While warp-level transaction management can help reduce the extra interconnecttraffic introduced by the Kilo TM protocol and improve L2 cache bandwidth util-137ity, a fundamental overhead for Kilo TM still exists: Even without conflicts amongtransactions, each transaction has to reread its entire read-set for value compar-isons prior to updating memory. To reduce this overhead, we propose temporalconflict detection, a low overhead mechanism that uses a set of globally synchro-nized timers to detect conflicts for read-only transactions.A read-only transaction can occur dynamically when the transaction only con-ditionally writes to memory, or it can be explicitly introduced by programmers toensure that code within the transaction can safely read from a shared data struc-ture which may be updated occasionally by other transactions. The latter use ofread-only transaction is required even for TM systems with strong isolation if theapplication expects to read multiple pieces of data from the shared data structure.For example, a thread that periodically computes the ratio between two measuredquantities should read both quantities simultaneously for every computed ratio. If atransaction commits, updating both quantities atomically, after the thread has readone quantity but before it reads the other quantity, the computed ratio may be er-roneous. Strong isolation does not protect the application from this error; the onlyway to prevent this error is to include both reads in a read-only transaction.A read-only transaction differs from a read-write transaction in that it can com-mit silently and locally as long as it has observed a consistent memory state – amemory state that does not contain partial memory updates from other committingtransactions. Although the original design of Kilo TM can dynamically detect aread-only transaction at commit by observing an empty write-log, it does not ex-ploit this information. Among the GPU TM workloads we created for evaluation,read-only transactions account for∼40% of the transactions in CL/CLto and∼85%of the transactions in BH-L/BH-H. In CL/CLto, the read-only transaction occursdynamically because the transaction only conditionally applies forces to two ver-tices in a mesh if they are sufficiently far away. In BH-L/BH-H, we have added theread-only transaction to the octree traversal to ensure that a freshly inserted branchnode has been properly initialized by the inserting thread before other threads maytraverse through the branch node. Being able to commit these read-only transac-tions silently without rereading their read-set can significantly reduce their energyand performance overhead.TCD is a form of eager conflict detection that complements Kilo TM. Using a138set of globally synchronized timers, it checks the accessed location as a transactionreads from global memory to ensure that the loaded data has not been modifiedsince the transaction first reads from global memory. To do so, the system recordswhen each word in memory was last written. Each transaction maintains the timeof its first load, and each subsequent load in the transaction retrieves the time whenthe loaded word is last written. A retrieved last written time that occurs later thanthe time of the transaction’s first load indicates a potential conflict, because thevalue at the loaded location has been modified since the first load. If none of thewords loaded by the transaction has been written since the first load, the value ofevery word read by the transaction coexists in a instantaneous snapshot of globalmemory that existed at the time of the first load. A read-only transaction satisfyingthis condition has effectively obtained all of its input values from this snapshot, andappears to have executed instantly with respect to other transactions. Therefore, theread-only transaction can commit directly without further validation.Notice that the instantaneous snapshot observed by the transaction via TCDmay occur in the midst of a memory writeback from a committing transaction.This causes the snapshot to contain partial updates from a transaction, which is nota consistent view of memory. We have not observed this issue in the workloadswe evaluated. Nevertheless, it is possible to extend TCD to detect if a transactionis loading from a location in the write-set of another committing transaction. Theoverhead for such detection mechanism involves a hardware buffer that conserva-tively records the last transaction that has written to each memory location, andextra protocol messages and hardware for maintaining a conservative set of com-mitting transactions. We leave the exploration of this and other potential solutionsfor future work.6.4.1 Globally Synchronized TimerWhile timestamp-based conflict detection has been used in existing software TMsystems [37, 148, 167], each of these systems uses a global version number (asoftware counter in memory) that is explicitly updated by software at transactioncommit. The globally synchronous timers used by temporal conflict detection aredifferent from these software maintained counters. Once initialized, each of these139  Mem ory  PartitionSIMT CoreSy nc.Tim erF irst R ead  Tim e Tab l e#TxSy nc.Tim erLast W ritten Tim e Tab l eMinB l ock A d d ress##LastW rittenTim e( F or R ead )Local  Tim e( F or W rite)Figure 6.7: Hardware extensions for temporal conflict detection.on-chip timers runs locally in its microarchitecture module, increments every cy-cle, and does not communicate with other timers. Since the timers run at the samefrequency and are synchronized initially, one can compare a timestamp capturedfrom one of the timers against another timestamp captured from another timer todetermine their respective order in real time. This hardware mechanism eliminatesthe bottleneck of updating and accessing a centralized global version number. Aspointed out by Singh et al. [146], existing hardware already implements timerssynchronized across components [79, Section 17.12.1] to provide efficient timerservices. Ruan et al. [135] also proposed extending Orec-based STM systems withsynchronous hardware timers. Their approach embeds timestamps in the owner-ship record of each transaction variable in memory, whereas we use a small on-chipstorage to conservatively record when each word is last written.6.4.2 ImplementationFigure 6.7 shows the hardware modification to implement TCD: A 64-bit globallysynchronized timer in each SIMT core and memory partition, a first-read time ta-ble in each SIMT core recording when each transaction sends its first load, anda last written time table in each memory partition that conservatively records thelast written time of each 128-Byte block in the partition. This time table in con-servative in that aliasing in the time table may cause it to return a timestamp fora memory location that is more recent than the actual time when the location waslast written. We implement the last written timetable with a recency bloom filter.Kilo TM also use it for hazard detection in Chapter 4. This variant of the recency140bloom filter consists of multiple sub-arrays of timestamps. Each 128-Byte block inthe memory partition maps to an entry in each sub-array of the recency bloom filtervia a different hash function. Whenever a word in the 128-Byte block is updatedby a committing transaction in the L2 cache, the corresponding entries in everysub-array of the recency bloom filter are updated with the value from the synchro-nized timer. Each transactional load served by the L2 cache retrieves a timestampfrom each sub-array in the recency bloom filter and returns the minimum of thosetimestamps along with the data to the SIMT core. This timestamp is comparedagainst the time of the transaction’s first load to detect conflicts.At 700MHz, 64-bit timers only roll over every few hundred years. In the eventthat it happens, the TM system can handle the rollover by validating the read-set ofall running transactions through value-based conflict detection. For the transactionsthat remain valid, the TM system resets their first read time to zero. Although notneeded for correctness, it should also reset the last written time table so that thetable will not report overly conservative last written times.6.4.3 ExampleFigure 6.8 walks through how TCD detects an inconsistent view of memory duringtransaction execution. The example consists of two committing transactions, XAand XB, and one read-only transaction XC. At time T =1, XC starts execution andissues its first load to location 0x10. XA then commits and updates the value at 0x10at time T =7. At the same time, XA also updates the timestamps corresponding tolocation 0x10 in the recency bloom filter (consists of 2 sub-arrays with 2 timestampeach). XC issues its second load to location 0x30 at T =10, and the recency bloomfilter returns with TW [0x30]=0. Notice that even though the original value at 0x10is overwritten by XA, the updated value is not visible to XC and does not constitutea conflict. XB commits and updates the value at 0x20 at time T =15, and updatesthe recency bloom filter. XC issues its third load to location 0x20 at T =18, and therecency bloom filter returns with TW [0x20]=15, which is later than the first loadfrom XC. This is a conflict for XC because the memory value loaded from 0x20 atT =18 is not the same value at 0x20 at T =1; it has been updated by XB at T =15.Hence, the memory state observed by XC does not correspond to an actual global141L ast Written Time Tab le000100700100701 511 50700100701 511 500171717171Sub -Array # 1 Sub -Array # 2Hash Func. # 10 x10 0110 x200 x30Hash Func. # 20 x10 1010 x200 x30X AX BX CL 2 CacheRead(0 x10 )TW= 0X AX BX CL 2 CacheWrite(0 x10 )X AX BX CL 2 CacheRead(0 x30 )TW= 0X AX BX CL 2 CacheRead(0 x20 )TW= 15X AX BX CL 2 CacheWrite(0 x20 )17101518TimeFigure 6.8: Temporal conflict detection example.memory state at any time – an invalid snapshot.6.4.4 Integration with Kilo TMIn this work, Kilo TM uses TCD to allow read-only transactions to commit silentlyin the absence of detected conflicts. The recency bloom filter does not perfectlyrecord the time when each word is last written. Aliasing of timestamps in the fil-ter can lead to false positives in TCD. To reduce the penalty of falsely detectedconflicts, read-only transactions with detected conflict are given a second chanceto commit through the commit units as read-write transactions in Kilo TM. In thisway, we can use a relatively small filter with coarse granularity (128-Byte chunkmaps to the same entry) to allow most conflict-free read-only transaction to com-mit silently, and use value-base conflict detection for the situations that requirefiner granularity detection. Similar hierarchical validation schemes are used inNOrec [37] and the software GPU TM system by Xu et al. [167].142Coal escedCoal escedTransaction E xecutionV alue-Based Conflict DetectionCommit ( Update Glob al Mem.)Tem p oral  Conf l ict D etectionIntra- W arp  Conf l ict R esol utionO w nership  Tab l e ConstructionParal l el  MatchRead-W rite TransactionSil ent Com m itRead-Only TransactionY esNoTCD  Conf l ict D etected ?W arp - Lev elTransaction Manag em entIntra-W arp Conflicts FreeSelf-Ab ortPassFailConflict with a transaction in a lower laneFigure 6.9: Kilo TM enhanced with warp-level transaction management andtemporal conflict detection.6.5 Putting It All TogetherOur two proposed enhancements to Kilo TM, warp-level transaction managementand temporal conflict detection, can work together to further improve performance.Figure 6.9 shows the overall design of Kilo TM with both enhancements enabled.In this enhanced Kilo TM, each transaction uses TCD to eagerly detect conflictsfor each global memory read during its execution. Writes to global memory arebuffered in the write-log as in the original Kilo TM. After the transaction has com-pleted execution, if it is a read-only transaction (i.e. containing an empty write-log)and TCD has not detected any conflict, it can commit silently. The remaining trans-actions take part in the intra-warp conflict resolution to resolve all conflicts withinthe same warp. WarpTM then processes the still-active transactions in the warpwith the assurance that they do not have conflict among each other. Our evaluationin Section 6.7 compares the performance and energy consumption of this combinedTM system to the original Kilo TM.6.6 MethodologyFor our evaluation, we started with the version of GPGPU-Sim [10] from Chap-ter 4. It extends GPGPU-Sim version 3.1.2 with support for transactional memory,143Table 6.1: GPGPU-Sim configuration for enhanced Kilo TM evaluation# SIMT Cores 15Warp Size 32SIMD Pipeline Width 16×2# Threads / Core 1536# Registers / Core 32768Branch Divergence Method PDOM [56]Warp Scheduling Policy Greedy-then-oldest [132]Shared Memory / Core 16KBL1 Data Cache / Core 48KB, 128B line, 6-way assoc.(transactional+local mem. access only)L2 Unified Cache 128KB/Memory Partition,128B line, 8-way assoc.Interconnect Topology 1 Crossbar/DirectionInterconnect BW 32 (Bytes/Cycle) (288GB/s/Dir.)Interconnect Latency 5 Cycle (Interconnect Clock)Compute Core Clock 1400 MHzInterconnect Clock 1400 MHzMemory Clock 924 MHz# Memory Partitions 6DRAM Req. Queue 32 RequestsMemory Controller Out-of-Order (FR-FCFS)GDDR5 Memory Timing Hynix H5GQ1H24AFRTotal DRAM BW 177GB/sMin. L2 Latency 330 Cycle (Compute Core Clock)DRAM Scheduler Latency 200 Cycle (Compute Core Clock)Kilo TMCommit Unit Clock 700 MHzValidation/Commit BW 1 Word/Cycle/Memory Partition# Concurrent TX 1, 2, 4, 8 Warps/Core or No Limit(480, 960, 1920, 3840 or Unlimited # TX Globally)Last Writer History Unit 5kBIntra-Warp Conflict ResolutionShared Memory Metadata 4kB/Warp (3 Concur. Resolution/Core)Default Mechanism 2-Phase Parallel Conflict ResolutionTemporal Conflict DetectionLast Written Time Table 16kB (2048 Entries in 4 Sub-Arrays)Detection Granularity 128-Byteand includes the performance model for Kilo TM. We incorporated GPUWattch [97]into this version of GPGPU-Sim, and extended it to model the timing and power ofour proposed enhancements. This version of GPGPU-Sim with all the modifica-tions is available online [60]. We configured the modified GPGPU-Sim to simulatea GPU similar to Geforce GTX 480 (Fermi), with 16kB of shared memory storageper SIMT core. Table 6.1 lists the major microarchitecture configurations.We used the GPU TM workloads from Chapter 4 to evaluate the proposed144Table 6.2: GPU TM workloads for performance and energy evaluations.Name Abbr. DescriptionHash Table (CUDA) HT-H Populate an 8000-entry hash table.HT-M Populate an 80000-entry hash table.HT-L Populate an 800000-entry hash table.Bank Account (CUDA) ATM Parallel transfer between 1M accounts.Cloth Physics [20] (OpenCL) CL Cloth physics simulation of 60K edges.CLto Optimized version of CL.Barnes Hut [23] (CUDA) BH-H Build an octree with 30K bodies.BH-L Build an octree with 300K bodies.CudaCuts [162] (CUDA) CC Segmentation of a 200×150 pixel image.Data Mining [3, 88] (CUDA) AP Data mining 4000 records.improvements to Kilo TM. In addition to the original input, we also added newinputs that varies the amount of contention for BH and HT. We also created anoptimized version of CL (CLto). In this version, each thread loads read-only datainto its register file before entering the transaction to reduce the read-set of thetransaction. Table 6.2 summarizes each of our workloads.6.6.1 Power ModelWe modeled the power overhead of Kilo TM by estimating the access energy of thevarious major structures in the commit units implemented in the 40nm process withCACTI 6.5 [115]. We multiplied the access energies with the operating frequency,conservatively assuming that the structures are accessed every cycle, to estimatetheir power overhead. As observed by Shah [143], CACTI provides conservativearea and energy estimates for small memory arrays as it automatically partitions thearray into sub-arrays when a single array is sufficient. Similarly, we modeled thepower overhead of TCD with the full activity power to the last written time bufferin each memory partition and the first read timetable in each SIMT core. Table 6.3shows the estimated power for each component in Kilo TM and temporal conflictdetection. The Kilo TM specific hardware consumes 0.9W in total, and extendingit to support WarpTM increases the consumption to 2.5W. The power increase isintroduced by having a wider port to the read-write buffer (See Section 6.2.2). Thehardware that implements TCD consumes 0.7W. Kilo TM with WarpTM and TCDconsumes a total of 3.2W.145Table 6.3: Power component breakdown for the added hardware specific toKilo TM, warp-level transaction management, and temporal conflict de-tection.Commit UnitSize Area (mm2) Power (mW )Last Writer History - Look Up Table 3kB 0.010 6.3Last Writer History - Recency Bloom Filter 2kB 0.010 6.6Commit Entry Array 19kB 0.094 57.5Read-Write Buffer 32kB 0.128 82.5Per-Unit Total 0.242 153All Units Total 1.454 918Commit Unit (Warp-Level Transaction Management)Read-Write Buffer (Warp-Level TM) 32kB 0.731 260Per-Unit Total 0.846 419All Units Total 5.074 2512Temporal Conflict DetectionSize Area (mm2) Power (mW )First Read Timetable (One per SIMT core) 12kB 0.034 25.5Last Written Time Buffer (One per Mem. Part.) 16kB 0.078 52.3All Units Total 0.979 696For parts of the GPU microarchitecture not specific to Kilo TM, we usedGPUWattch [97] to estimate the average dynamic power consumed by each work-load with the different synchronization mechanisms. This captures the differencein microarchitecture activity between fine-grained locks and Kilo TM (with andwithout the proposed enhancements). This includes extra L1 cache accesses for thetransaction logs, extra L2 accesses for value-based conflict detection, extra inter-connection traffic for Kilo TM protocol messages, and accesses to shared memoryfor intra-warp conflict resolution. To compute the total power, we added 59W forleakage power (obtained by Leng et al. [97] via measuring the idle power of thehardware GPU), 9.8W of constant clock power (reported by GPUWattch) and thedynamic power of the Kilo TM specific hardware to the average dynamic powerreported by GPUWattch to obtain the total power. Finally, we multiplied this totalpower by the execution time to obtain the total energy required to execute eachworkload.We assumed that the GPU extended with Kilo TM, WarpTM and TCD runs atthe same frequency as the unmodified GPU. We did not evaluate the impact of KiloTM and the two enhancements on the GPU cycle time. We leave this evaluation as1463456Exec. Time Normalized to FGLockKiloTM-Base TCD WarpTM WarpTM+TCD012HT-H HT-M HT-L ATM CL CLto BH-H BH-L CC AP AVGExec. Time Normalized to FGLockFigure 6.10: Execution time of GPU-TM applications with enhanced KiloTM. Lower is better.future work.6.7 Experimental ResultsIn this section, we evaluate the performance and energy efficiency of our pro-posed enhancements to Kilo TM: warp-level transaction management (WarpTM)and temporal conflict detection (TCD). We also analyze the benefit of each op-timization that is enabled by WarpTM, compare the two intra-warp conflict res-olution approaches, and investigate the sensitivity of TCD to available hardwareresources. Finally, we study the performance impact of L2 cache port width andnumber of SIMT cores on both fine-grained locks and Kilo TM.6.7.1 Performance and Energy EfficiencyFigure 6.10 compares the execution time of the original Kilo TM (KiloTM-Base),Kilo TM with TCD enabled (TCD), Kilo TM with WarpTM (WarpTM) and a con-figuration with both enhancements enabled (WarpTM+TCD). We evaluate the per-formance of each configuration with different limits on the number of concurrenttransactions, and select a limit for each workload that yields the optimal perfor-mance (See Table 6.4). The performance of each configuration with this optimallimit is normalized to the execution time of an alternative version of the application14712Energy Usage Normalized to FGLockCoreL1CacheSMemNOCL2CacheDRAMKiloTMIdleLeakage4.3X2.9X2.6X2.5X0FGLockKiloTM-BaseWarpTM+TCDFGLockKiloTM-BaseWarpTM+TCDFGLockKiloTM-BaseWarpTM+TCDFGLockKiloTM-BaseWarpTM+TCDFGLockKiloTM-BaseWarpTM+TCDFGLockKiloTM-BaseWarpTM+TCDFGLockKiloTM-BaseWarpTM+TCDFGLockKiloTM-BaseWarpTM+TCDFGLockKiloTM-BaseWarpTM+TCDFGLockKiloTM-BaseWarpTM+TCDHT-HHT-MHT-LATMCLCLtoBH-HBH-LCCAP2Energy Usage Normalized to FGLockKiloTM-BaseTCDWarpTMWarpTM+TCD01AVG AVG (Dyn. Only)Energy Usage Normalized to FGLockFigure 6.11: Energy consumption breakdown of GPU-TM applications.Lower is better.148using fine-grained locking (FGLock) to illustrate their overhead with respect to apure software effort. Figure 6.11 breaks down the energy consumption of the sameset of Kilo TM configurations. Each breakdown is normalized to the total energyused by the fine-grained locking version of the same workload.Without WarpTM and TCD, Kilo TM performs 2.5× slower8than FGLock onaverage. The performance of BH-H is particularly poor. Our detailed investigationshows that the major slowdown occurs near the start of the octree-building kernel,where every thread tries to append its node to only a few branches in the octree.This behavior also stresses one memory partition in the GPU, working against thedistributed design of Kilo TM. This effect slowly disappears as the octree growsto a point that conflicts become rare. As a result, BH-L, which scales the numberof nodes in the octree by 10×, exhibits less slowdown. A similar attempt to re-duce transaction contention does not work as well between HT-H and HT-M. Theenergy-per-operation penalty of Kilo TM is relatively lower (2× energy used vs.2.5× performance slowdown) due to its lower activity from the poor performance.Enabling temporal conflict detection for Kilo TM improves the performanceof workloads that contain read-only transactions (CL, CLto, BH-H and BH-L). Byallowing non-conflicting read-only transactions to commit silently, TCD reducescontention at the commit units and the memory subsystem. This performance im-provement translates directly to energy savings as well. Across workloads withread-only transactions, TCD improves performance of Kilo TM by 37% while re-ducing energy per operation by 30%.The optimizations enabled by warp-level transaction management (WarpTM)apply to a broader class of applications. In particular, coalescing of memory ac-cesses from the commit units alleviates the bottlenecks at the L2 cache banks.Without this bottleneck, the reduction in transaction contention from HT-H to HT-M now leads to performance improvement for Kilo TM + WarpTM. Section 6.7.3analyzes how each optimization from WarpTM contributes to speeding up Kilo TMin different applications. WarpTM does slow down CC, due to overhead from intra-warp conflict resolution (See Section 6.7.4). Overall, WarpTM speeds up Kilo TMby 42% and reduces energy per operation by 27%.8Notice that this is slower than the 59% slowdown reported in Chapter 4. This change is intro-duced by a change in the baseline GPU architecture as explained in Section 6.1.149Table 6.4: Performance-optimal concurrent transaction limit and abort-commit ratio. Base = KiloTM-Base. NL = No Limit. – = Identical tono TCD.Concurrent Transaction Limit Aborts per 1000 Committed Trans.(#Trans. Warps/SIMT Core)Base TCD WarpTM TCD+WarpTM Base TCD WarpTM TCD+WarpTMHT-H 2 – 2 – 50 – 107 –HT-M 2 – 8 – 7 – 84 –HT-L 4 – 8 – 2 – 63 –ATM 1 – 4 – 1 – 27 –CL 2 2 2 2 84 55 149 97CLto 2 2 2 4 85 49 102 99BH-H 2 2 4 4 23 23 53 56BH-L 8 4 8 8 7 5 17 20CC NL – NL – 6 – 6 –AP 1 – 1 – 264 – 318 –Kilo TM with both enhancements enabled shows greater benefit than with ei-ther enhancement alone. The combined benefits, together with software optimiza-tions, allow CLto to perform within 90% of the FGLock version while using aboutthe same amount of energy. The original version of this cloth simulation workloadperforms 2.2× worse than the FGLock version on Kilo TM. This kind of perfor-mance improvement indicates that a well designed TM system can complementoptimization efforts from the software developer to produce efficient TM appli-cations comparable to FGLock. Overall, Kilo TM with both TCD and WarpTMenabled achieves 66% of the performance of FGLock with only 34% energy over-head.Concurrency Limit. Table 6.4 shows the limit on number of concurrent trans-actions that yields the best performance for each workload with the different KiloTM configurations. Enabling WarpTM generally increases the optimal limit foreach workload, as a result of reduced congestion at the memory subsystem andinterconnection network compared to KiloTM-Base. The exceptions include high-contention workloads (HT-H), workloads that do not benefit from WarpTM (CC,AP), and workloads that may overflow the L1 cache with higher limits (CL). Wefind that enabling TCD has little effect on the optimal concurrency limit for ourworkload. Enabling TCD increases the limit for CLto, but the actual speedup from1500 200 400 600 800 1000 HT-H HT-M HT-L ATM CL CLto BH-H BH-L CC AP AVG Speedup over CGLock KiloTM-Base KiloTM + WarpTM + TCD FGLock Figure 6.12: Performance comparison with coarse-grained locking. Higheris better.the increased limit is < 5%.Impact on Abort-Commit Ratio. Table 6.4 also shows the number of abortsper 1000 committed transactions for the different Kilo TM configurations. En-abling TCD reduces the number of aborted transactions for CL and CLto by short-ening the execution time span for each read-only transaction. This lowers the prob-ability of another transaction overwriting a location in the read-set of the read-onlytransaction. Enabling WarpTM introduces significantly more aborted transactionsdue to false conflicts from intra-warp conflict resolution. Nevertheless, our evalua-tion has shown that WarpTM leads to an overall speedup and a net energy saving.Comparison with Coarse-Grained Locking. Figure 6.12 compares the per-formance of our GPU TM applications running on Kilo TM against coarse-grainedlock versions of the applications (CGLock). The performance of coarse-grainedlock is estimated by serializing all transaction executions via a single global lock.On average, the applications running on Kilo TM outperforms coarse-grained lock-ing by 104× on average. Enabling WarpTM and TCD for Kilo TM increasesthis speedup to 192×. Finally, fine-grained locking versions of the applications(FGLock) runs 315× faster than its coarse-grained locking analogs.1516.7.2 Energy Usage BreakdownThe energy usage breakdown in Figure 6.11 illustrates the relative contributionsfrom different overheads of Kilo TM to its overall energy usage. Across our work-loads, leakage and idle power contributes to > 50% of the total energy consump-tion. Both leakage power and idle power (consisting mostly of clock distributionpower) persist throughout the program execution, so their contributions increaseas execution time lengthens. Removing the contribution from leakage reduces theoverall energy overhead of KiloTM-Base from 103% to 59%. Similarly, the puredynamic energy overhead of KiloTM with WarpTM and TCD enabled is only 18%(versus 34% with leakage).Aside from leakage and idle power, the memory subsystem (L2 Cache andDRAM) and the interconnection network (NoC) dominate the remaining portionof the dynamic energy usage. On average, the two combined contribute to ∼70%of the dynamic energy. For some workloads, the L2cache energy with KiloTM-Base is > 2× that of FGLock. WarpTM and TCD have essentially eliminated thisoverhead, and on average, reduce the combined energy of L2 Cache, DRAM andNoC by 29%. Energy consumed by the SIMT cores only contributes to ∼25%of the dynamic energy usage. With Kilo TM, energy consumption by the coreis lower than FGLock. This illustrates how an effective transaction concurrencycontrol mechanism can substantially cut down the energy overhead for transactionre-execution. L1 accesses for transaction logs contribute to 5% of the dynamicenergy usage. Adding intra-warp conflict resolution to support WarpTM increasesthis overhead by 23% (i.e. < 2% increase to the overall energy usage), becausethe intra-warp conflict resolution generate extra accesses to the transaction logs.Finally, Kilo TM-specific hardware only contributes to 3% of the dynamic energyusage. Inclusion of hardware to support TCD and WarpTM only increases this to8%.6.7.3 WarpTM Optimizations BreakdownFigure 6.13 breaks down the performance impact of each optimization introducedby WarpTM by enabling the optimizations one by one. In this analysis, we haveused an ideal version of intra-warp conflict resolution that has perfect accuracy and1520 1 2 3 4 5 6 HT-H HT-M HT-L ATM CL CLto BH-H BH-L CC AP AVG Exec. Time Normalized to FGLock KiloTM-Base + Intra-Warp Conflict Resolution + Aggregate Protocol Messages + Validation and Commit Coalescing + Overclocked Hazard Detection Figure 6.13: Performance impact from different optimizations enabled overhead. This isolates our analysis from performance issues that may arisefrom the particular conflict resolution scheme.With only intra-warp conflict resolution enabled, WarpTM only impacts perfor-mance for applications that exhibit intra-warp transaction conflicts. While detect-ing such conflicts and resolving them within the warp benefits BH, it slows downHT-H. This is because intra-warp conflict resolution is prematurely aborting trans-actions that could have been committed in HT-H. In resolving a conflict within awarp, it is possible that a transaction that could eventually commit is aborted whilethe conflicting transaction is in turn aborted in the global phase of the commit.Aggregating protocol messages speeds up applications by a varying amount.Without further optimizations, the reduction in interconnection traffic via the ag-gregation simply exposes the memory subsystem bottleneck. Nevertheless, theaverage performance of Kilo TM is improved by 30% and the energy consumptionis reduced by 17%.Coalescing the memory accesses from the commit units allows the L2 cachebandwidth to be better utilized. This reduces the stress on the memory subsystemand improves performance for most of the applications. This optimizes Kilo TMby another 14% on average over aggregating protocol messages. BH-H and BH-Ldo not benefit from this optimization. This is surprising given our measurements1533456Exec. Time Normalized to FGLockKiloTM-Base WarpTM+2PCRWarpTM+2PCR(NoOverhead) WarpTM+SCRWarpTM+SCR(NoOverhead)012HT-H HT-M HT-L ATM CL CLto BH-H BH-L CC APExec. Time Normalized to FGLockFigure 6.14: Comparison between different intra-warp conflict resolutionmechanisms. SCR = Sequential Conflict Resolution. 2PCR = 2-PhaseParallel Conflict Resolution.have shown that validation and commit coalescing can reduce the number of L2cache accesses from the commit units by greater than 50% for both workloads(See Figure 6.5). The reason is that the serial hazard detection in each commitunit becomes a bottleneck. If the hazard detection hardware were running at twicethe clock frequency, BH-H and BH-L would benefit from validation and commitcoalescing. With all the optimizations enabled (including the overclocked hazarddetection), WarpTM with ideal intra-warp conflict resolution can speed up KiloTM by 49%.While we could parallelize the hazard detection in the commit units with ad-ditional hardware, we find that temporal conflict detection removes this bottleneckin BH-H and BH-L by allowing most of their read-only transactions to commitsilently.6.7.4 Intra-Warp Conflict Resolution OverheadFigure 6.14 compares performance of the two intra-warp conflict resolution mecha-nisms proposed in Section 6.3: sequential conflict resolution (SCR) in Section 6.3.215423Exec. Time Normalized to FGLockNoTCD TCD-128 TCD-2K-1SubArrayTCD-2K TCD-2K-WordAddr TCD-4K01CL CLto BH-H BH-LExec. Time Normalized to FGLockFigure 6.15: Performance of temporal conflict detection with different lastwritten timetable organizations. Lower is better.and 2-phase parallel conflict resolution (2PCR) in Section 6.3.3. We evaluate theperformance of each mechanism with and without modeling the overhead of con-flict resolution. In this study, we enable both WarpTM and TCD in configura-tions other than the baseline Kilo TM (KiloTM-Baseline). Without modeling theoverhead (NoOverhead), each warp finishes intra-warp conflict resolution instan-taneously, and does not generate traffic to the memory pipeline in the SIMT corewhen it traverses its logs or when it accesses the metadata in shared memory. Thisallows us to discern between two sources of performance overhead: transactionaborts due to inaccurate conflict resolution and the extra operations that implementthe resolution itself. The no-overhead configurations of both SCR and 2PCR per-form almost identically across all of our workloads, indicating that the accuracy ofboth mechanisms are roughly equivalent. However, the serial nature of SCR intro-duces significant overhead (an average 60% slowdown) to WarpTM, to the extentthat most of its performance benefits are negated. 2PCR, on the other hand, can de-liver similar accuracy as SCR with a much lower overhead (∼2% on average). Theoverhead in most cases is minor compared to the benefits from WarpTM, exceptfor CC, where it causes 11% slowdown.1550 1 2 3 4 5 6 HT-H HT-M HT-L ATM CL CLto BH-H BH-L CC AP AVG Exec. Time Normalized to FGLock - 6CBK FG Lock - 6CBK FGLock - 12CBK KiloTM-Base - 6CBK KiloTM-Base - 12CBK WarpTM+TCD - 6CBK WarpTM+TCD - 12CBK Figure 6.16: Performance impact with different L2 cache port widths. 6CBK= 6 L2 cache banks with 64-Byte ports. 12CBK = 12 L2 cache bankswith 32-Byte ports. Lower is better.6.7.5 Temporal Conflict Detection Resource SensitivityFigure 6.15 shows the performance sensitivity of TCD to different last written timetable organizations. While our default configuration uses 2048 entries in eachmemory partition (TCD-2K), a lower cost organization using 128 entries (TCD-128) can capture a significant portion of the performance benefit from TCD-2K.Doubling the size of the last written time table (TCD-4K) shows no further im-provement over TCD-2K, indicating that the 2048-entry table is sufficient. Eventhough the data shows that a single 2048-entry sub-array (TCD-2K-1SubArray)performs comparably to our default organization, we do notice having multiplesub-arrays can reduce the effect of aliasing as transaction concurrency increases.Finally, contrary to our intuition, we notice that reducing the detection granularityof TCD from 128-Byte blocks to 4-Byte words (TCD-2K-WordAddr), while keep-ing the same last written time table capacity, decreases performance. We believethat reducing the granularity causes more entries in the recency bloom filter to bepopulated and the aliasing effect dominates.6.7.6 Sensitivity to L2 Cache Port WidthIn this section, we study the performance impact of L2 cache port width on FGLock,the original KiloTM (KiloTM-Base), and the enhanced KiloTM with both WarpTM156and TCD (WarpTM+TCD). The L2 cache in our baseline GPU architecture is par-titioned into 6 banks, each with a 64-Byte port (6CBK). In this study, we furtherdivide each L2 cache bank into two subbanks with a 32-Byte port, resulting in 12L2 cache banks across the system (12CBK). Other parts of the GPU architecture,including the total L2 cache bandwidth, remain identical between the two configu-rations. Figure 6.16 compares the performance between these two configurations.Overall, the FGLock workloads with 32-Byte ports run 12% slower than with64-Byte ports. HT-H and BH-H suffer a higher degree of load imbalance among theL2 cache banks. The more congested L2 cache bank forms a tighter bottleneck with32-Byte ports than with 64-Byte ports. While we did not observe this imbalance forCL and CLto, we did notice more atomic accesses from increased lock acquisitionfailures. The extra failures are caused by higher memory latency due to lowerDRAM efficiency.In this study, we also increased the number of commit units with the numberof L2 cache banks due to their tightly-coupled design. The increased number ofcommit units can increase interconnection network traffic for Kilo TM, becauseeach transaction needs to communicate with more commit units for validation andcommit. We notice this extra traffic impacting the performance of Kilo TM forBH-H, BH-L, CC, and AP.In other workloads, switching to 32-Byte ports improves performance for theoriginal Kilo TM, because the commit units can use the port bandwidth more ef-fectively with just scalar (4-Byte wide) accesses. In turn, the narrower L2 cacheports reduces the benefit provided by validation and commit coalescing. This low-ers the average speedup of WarpTM over the original KiloTM from 43% to 40%.In particular, WarpTM does not speedup CLto at all. Nevertheless, in a GPU archi-tecture with narrower L2 cache ports, KiloTM-Base obtains 49% of the FGLockperformance. Kilo TM with WarpTM and TCD captures 76% of the FGLock per-formance, up from 66% with the wider L2 cache ports.6.7.7 Sensitivity to Core ScalingIn this section, we explore the impact of increasing the number of SIMT cores onthe overhead of Kilo TM. Specifically, we evaluate the performance of our work-15723456Exec. Time Normalized to FGLock with 15 CoresFGLock -30CoresKiloTM-Base -15CoresKiloTM-Base -30CoresWarpTM+TCD -15CoresWarpTM+TCD -30Cores6.17X012HT-HHT-MHT-LATMCLCLtoBH-HBH-LCCAPFigure 6.17: Performance impact from doubling the number of SIMT cores.Lower is better.loads on a scaled up GPU architecture with 30 SIMT cores, doubled from ourbaseline configuration. Figure 6.17 compares the performance overhead of KiloTM (and WarpTM+TCD) over FGLock with 15 and 30 SIMT cores. While dou-bling the SIMT cores increases concurrency in the GPU architecture, it slows downour FGLock applications by 9% on average. For this study, we have not scaled thememory subsystem with the core counts. The increased concurrency generatesextra memory-level parallelism, but these extra concurrent memory accesses intro-duce more L2 cache misses, resulting in a net performance loss [132]. We alsonoticed more atomic accesses from increased lock acquisition failures in HT-M,CL, and BH-H. Only CC can take advantage of the extra cores for a 13% speedup.Moreover, the FGLock version of HT-H runs into a livelock, so its performanceis not shown in Figure 6.17. In comparison, the Kilo TM version is only sloweddown by 6-9% with 30 cores.Aside from HT-H, Kilo TM and WarpTM+TCD mostly follow the perfor-mance trends of FGLock with the scaled up GPU architecture. For HT-M, HT-L and ATM, the overhead of Kilo TM and WarpTM over FGLock remains ap-proximately the same with more cores. WarpTM works more effectively for CLwith 30 cores because the extra cores provide extra L1 cache capacity for trans-action logs, allowing extra transaction concurrency without the L1 cache overflowpenalty. WarpTM+TCD works less effectively for CLto and BH-L with 30 cores,whereas enabling each enhancement alone with 30 cores is just as effective as with15815 cores. A detailed investigation reveals that enabling both enhancements withKilo TM boosts the transaction concurrency in these workloads, generating signif-icantly more transaction conflicts. CC and AP are not affected by the increasedconcurrency limit, and hence, the extra cores do not impact the overhead of KiloTM and WarpTM+TCD for these two workloads.6.8 SummaryIn this chapter, we proposed two enhancements to Kilo TM, an existing hardwareTM proposal for GPU architectures. Warp-level transaction management exploitsthe spatial locality among transactions within a warp to enable a set of optimiza-tions. These optimizations allow Kilo TM to exploit the wide memory subsystemin GPU architectures. Temporal conflict detection complements WarpTM by al-lowing read-only transactions to commit silently in the absence of a conflict. Ourevaluation shows that these two enhancements can improve Kilo TM performanceby 65% while reducing its energy per operation by 34%. Kilo TM with the twoenhancements outperforms coarse-grained locking by 192× and achieves 66% ofthe performance of fine-grained locking, while only requiring 34% more energyper operation. Moreover, software optimizations that reduce transaction footprintsand contention can further close this gap.While this chapter presents WarpTM and TCD as enhancements to Kilo TM,the insights behind these mechanisms extend well beyond GPU TM systems. WarpTMdemonstrates the effectiveness of aggregating multiple transactions to amortizetheir management overheads in a TM system. This principle applies to other noveldata synchronization/inter-thread communication mechanisms [66, 169]. The 2-phase parallel conflict resolution that enables WarpTM illustrates how transactionswith predetermined order can resolve conflicts in parallel with low overhead. Thisinsight may be readily applied to thread-level speculation on multi-core systems.We believe that TCD’s ability to cheaply verify that a thread has observed an in-stantaneous global memory snapshot has wider uses beyond TM. For example,one may use TCD to accelerate runtime data-race detection on parallel computingsystems without relying on any cache coherence protocol.Finally, as newer commodity CMP systems start to add hardware support for159TM [70], more software developers will start using transactions in their applica-tions. GPUs that support TM will have higher interoperability with these futuresoftware applications. This will be an important design consideration for futureheterogeneous processors with tightly integrated CMP and GPU.160Chapter 7Related WorkThis chapter discusses related work for this dissertation. Section 7.1 discusses var-ious proposals to improve the handling of branch divergence on GPUs. Section 7.2discusses prior and concurrent transactional memory system proposals that are re-lated to Kilo TM, as well as various microarchitecture/algorithmic techniques thatare related to the implementation of Kilo TM.7.1 Related Work for Branch Divergence Handling onGPUsWe classify the various proposals to improve the handling of GPU branch diver-gence into four categories: software compaction, hardware compaction, intra-warpdivergent path management and adding MIMD capability. Some works contain im-provements that capture aspects from multiple categories, and thus are mentionedmultiple times.7.1.1 Software CompactionOn existing GPUs, one way to improve SIMD efficiency of an application is throughsoftware compaction – using software to group threads/work items according totheir control flow behavior. The regrouping involves moving the thread and itsprivate data in memory, potentially introducing a significant memory bandwidthoverhead. Below we highlight several works on software compaction that were161published before thread block compaction.Conditional streams [84] applies this concept to stream computing. It splits acompute kernel for stream processors with potentially divergent control flow intomultiple kernels. At a divergent branch, a kernel splits its data stream into multiplestreams according to branch outcome of each data element. Each stream is thenprocessed by a separate kernel, and merges back at the end of the control flowdivergence.Billeter et al. [14] proposed to use a parallel prefix sum to implement SIMDstream compaction. The stream compaction reorganizes streams of elements withassorted tasks into compact substreams of identical tasks. This implementationleverages the access flexibility of the GPU on-chip scratchpad to achieve high effi-ciency. Hoberock et al. [76] proposes a deferred shading technique for ray tracingthat uses stream compaction to improve the SIMD efficiency of pixel shading in acomplex scene with many material classes. Each material class requires its uniquecomputation. A pixel shader combining the computation for every material classruns inefficiently on GPUs. Stream compaction groups the rays hitting objects withsimilar material classes, allowing the GPU SIMD hardware to execute the shaderfor these pixels efficiently.Zhang et al. [172] proposes a runtime system that remaps thread into differentwarps on the fly to improve SIMD efficiency as well as memory access spatiallocality. The runtime system features a pipelined system, with the CPU performingthe on-the-fly remapping and the GPU performing computations on the remappeddata/threads.7.1.2 Hardware CompactionSimilar to thread block compaction (proposed in Chapter 3) and its precursor, dy-namic warp formation [56], many proposals uses hardware to compact loosely pop-ulated warps to improve the performance of GPU applications that suffers frombranch divergence. Unlike software compaction, which takes place in the globalmemory, the hardware compaction proposals usually takes place locally within aSIMT core. Since the compacted threads all located on the same core sharing thesame register file, it is possible to perform compaction without moving their archi-162tectural states with a more flexible register file design [56]. Among the hardwarecompaction proposals discussed below, only dynamic micro-kernels proposed bySteffen and Zambreno [150] was published before TBC; the other proposals werepublished after TBC.Steffen and Zambreno [150] improve SIMD efficiency of ray tracing on GPUswith dynamic micro-kernels. The programmer is given primitives to break itera-tions in a data-dependent loop into successive micro-kernel launches. This decom-position by itself does not improve parallelism, because each iteration depends ondata from the previous iteration. Instead, the launch mechanism improves the loadimbalance between different threads in the same core by compacting the remainingactive threads into few warps. It also differs from the rest of the hardware com-paction techniques (including TBC) in that the compaction migrates the threadswith their architectural states, using the per-core scratchpad memory as a stagingarea.Published after TBC, the large warp microarchitecture [117] extends the SIMTstack, similar to TBC, to manage the reconvergence of a group of warps. However,instead of restricting the compaction at branches and reconvergence points. LWMrequires warps within the group to execute in complete lockstep, so that it cancompact the group at every instruction. This reduces the available TLP even moreso than TBC, but allows LWM to perform compaction with predicated instructionsas well as unconditional jumps. Similar to TBC, LWM splits warps running onthe same core into multiple groups, and restricts compaction to occur only withina group. It also opts for a more complex scoreboard microarchitecture that tracksregister dependency at thread-granularity. This allows some warps in the group toexecute slightly ahead of others compensate the lost TLP due to lockstep execution.Rhu and Erez [130] extend TBC with a compaction-adequacy predictor (CAPRI).The predictor identifies the effectiveness of compacting threads into few warps ateach branch, and only synchronizes the threads at branches where the compactionis predicted to yield a benefit. This reclaims the TLP lost due to non-beneficial stalland compaction with TBC. Rhu and Erez [130] also show that a simple history-based predictor similar to a single-level branch predictor is sufficient to achievehigh accuracy.Vaidya et al. [159] propose a low-complexity compaction technique that ben-163efits wide SIMD execution groups that executes multiple cycle on narrower hard-ware units. Their basic technique divides a single execution group into multiplesubgroups that match the hardware width. SIMD execution group that suffers fromdivergence can run faster on the narrow hardware by skipping subgroups that arecompletely idle. To create more completely idle subgroups, they propose a swizzlemechanism that compacts elements into fewer subgroups at divergence.Brunie et al. [21] propose simultaneous branch and warp interweaving (SBIand SWI). They extend the GPU SIMT front-end to support issuing two differentinstructions per cycle. They compensate this increased complexity by widening thewarp to twice its original size. SWI co-issues an instruction from a warp sufferingfrom divergence with instructions from another diverged warp to fill the gaps leftby branch divergence.7.1.3 Intra-Warp Divergent Path ManagementWhile a SIMT stack with immediate post-dominator reconvergence points can han-dle branch divergence with arbitrary control flow, it can be further improved invarious aspects, such as the likely-convergence points proposed in Chapter 3. Be-low we highlight several works that attempt to improve the SIMT stack. Amongthese works, only dynamic warp subdivision proposed by Meng et al. [107] waspublished before likely-convergence points; the other proposals were published af-terwards.Meng et al. [107] propose dynamic warp subdivision (DWS), which extendsthe SIMT stack with a warp-split table to subdivide a diverged warp into concur-rent warp-splits. The warp-splits, each executing a divergent branch target, canexecute in parallel to reclaim hardware idleness due to memory accesses. Warp-splits are also created at memory divergences – when only a portion of the threadsin a warp hit in the L1 data cache. Instead of waiting for all threads to obtain theirdata, DWS split the warp and allow the warp-split that hits in the cache to executeahead, potentially prefetching data for those who have missed the cache. DWS isorthogonal TBC. The block-wide SIMT stack in TBC can be extended with DWSto boost the available TLP.164Diamos et al. [43] depart from the SIMT stack altogether and instead proposeto reconverged threads after divergence via thread frontiers. A compiler support-ing thread frontiers sorts the basic blocks in a kernel according to their topologicalorder. In this way, threads executing at an instruction at a higher PC can neverjump to an instruction at a lower PC. Loops are handled by placing the loop exit atthe end of the loop body. With this sorted code layout, a diverged warp will even-tually reconverge by prioritizing threads with lower PCs (allowing them to catchup). Compared to SIMT stacks with immediate post-dominator reconvergence,reconvergence via thread frontiers yields higher SIMD efficiency for applicationswith unstructured control flow. The evaluation semantics of multi-expression con-ditional statements and the use of exceptions can both generate code with unstruc-tured control flow. SIMT stacks extended with likely-convergence points can yieldsimilar SIMD efficiency improvement on applications with unstructured controlflow; however, each entry in the SIMT stack may only have a finite number oflikely-convergence points, whereas the thread frontier approach has no such re-striction.Rhu and Erez [131] propose dual-path execution model, which addresses someof the implementation shortcomings of DWS by restricting each warp to executeonly two concurrent warp-splits. They also extend the scoreboard to track the regis-ter dependency of each warp-split independently. This allows dual-path executionmodel to achieve greater TLP than DWS.ElTantaway et al. [49] remove the dual-path limitation with a multi-path exe-cution model. They further extend the multi-path execution hardware with oppor-tunistic early reconvergence, boosting its SIMD efficiency for unstructured controlflow without using sorted code layout as in thread frontiers.7.1.4 Adding MIMD CapabilityThe following proposals improve GPUs’ compatibility with divergent control flowby incorporating some limited amount of MIMD capability.Published before both dynamic warp formation and thread block compaction,vector-thread (VT) architecture [91] is an architecture that combines aspects ofboth SIMD and MIMD architectures, with the goal of capturing the best of both165approaches. A VT architecture features a set of lanes that are connected to a com-mon L1 instruction cache. In SIMD mode, all lanes receive instructions directlyfrom the L1 instruction cache for lockstep execution, but each lane may switch toa MIMD model, running at its own pace with instructions from its L0 cache. Arecent comparison with traditional SIMT architectures (e.g. GPUs) by Lee et al.[96] shows that VT architectures have comparable efficiency with regular parallelapplications, while performing much more efficiently with irregular parallel appli-cations.Published after thread block compaction, Temporal SIMT [85, 92] permits eachlane to execute in MIMD fashion, similar to VT architecture. However, instead ofrunning a warp across all lanes in lockstep, it time-multiplexes the execution of awarp through a single lane, and each lane runs a separate set warp. Temporal SIMTachieves the efficiency of SIMD hardware by fetching each instruction only oncefor the whole warp. This amortizes the control flow overhead across time, while thetraditional SIMD architecture amortizes the same overhead across multiple lanesin space.Brunie et al. [21] propose simultaneous branch and warp interweaving (SBIand SWI) after the publication of thread block compaction. They extend the GPUSIMT front-end to support issuing two different instructions per cycle. SBI co-issues instructions from the same warp when it encounters a branch divergence.Executing both targets of a divergence branch at the same time eliminates its per-formance penalty significantly.7.2 Related Work for Kilo TMIn this section, we discuss the related work for Kilo TM. We classify the workbased on the microarchitecture mechanisms of interest to Kilo TM.7.2.1 GPU Software Transactional MemoryOther than Kilo TM, there are other proposals to support transactional memory onGPU architectures. These proposals implement software TM systems designed towork with existing GPUs.Cederman et al [25] proposed a GPU software TM system that uses per-object166version locks to detect conflicts. As each transaction executes, it records the ver-sion of the object it access. These version numbers are later checked during com-mit to detect conflicts with committed transactions. In their evaluations, whileSTM-based data structures scale well, they perform about 10× slower than lock-free data structures. Kilo TM uses value-based conflict detection to remove thestorage overhead for version locks. With dedicated hardware to increase commitparallelism and limit concurrency, Kilo TM performs much closer to fine-grainedlocking (which should perform on-par with lock-free data structures).Xu et al. [167] proposed a GPU software TM system after the publication ofKilo TM. Similar to Kilo TM, it also uses value-based conflict detection. For eachtransaction, it has a hash table to store the set of memory locations that needs to belocked during commit. The memory locations stored in this table are sorted, so thatevery transaction acquires its lock in the same order to prevent deadlocks. Withoutthe extended SIMT stack in Kilo TM that handles the control flow divergence dueto transaction aborts (see Section 4.2.1), programmers using this STM system needto handle the divergence explicitly – a significant programming burden. In theirevaluations, this STM system performs up to 20× faster than coarse-grained locks.Kilo TM outperforms coarse-grained locks by 192× on average.7.2.2 Hardware Transaction Memory with Cache CoherenceProtocolMany existing hardware and hybrid TM systems focus on leveraging the sharer/-modifier information maintained by cache coherency hardware for conflict detec-tion and using thread-private caches for version management. Some of the HTMspropose to extend the cache coherence protocol with new coherence states for con-flict detection and version management [27, 29, 44, 157]. The additional statesextend the thread-private cache into a transaction cache, buffering the read-set andwrite-set accessed by a transaction. A conflict is detected when a cache line withtransactional data receives an invalidation/read sharing request in the cache co-herence protocol. Other HTMs just monitor the cache coherency traffic (of anexisting protocol) for conflict detection, using per-transaction metadata stored insignatures [24, 111, 170], or extra bits added to the cache line [19, 113, 136].1677.2.3 Signature-Based Conflict DetectionMany hardware transactional memory (HTM) systems use signatures to createcompressed representations of transaction footprints, with the aim to support un-bounded transactions. In BulkTM [26], a committing transaction broadcasts itswrite-set in a bloom filter based signature, which is compared for conflicts againstthe signature and the L1 cache tags at each recipient transaction. SigTM [111]and LogTM-SE [170] eagerly detect conflicts with signatures by monitoring theaddress of each incoming cache coherence request. A software conflict resolutionhandler is invoked when the address hits the signatures. Other HTMs (TMACC [24],FlexTM [145]) use signatures to handle unbounded transactions that have exceededL1 cache capacity.Software transactional memory (STM) systems also use signatures for con-flict detection. RingSTM [149] uses a ring of commit records that hold the write-signatures of recently committed transactions. Before each transactional load, thetransaction compares its read-signature against the write-signatures of newly com-mitted transactions added to the commit records since the transaction’s last load.A match indicates a new conflict and the transaction is aborted. In InvalSTM [67],a committing transaction compares its write-signature against the read- and write-signatures of other running transactions. A contention manager is invoked upon amatch.In this work, we have investigated the viability of signature-based conflict de-tection for GPU TM workloads and find that they are ill-suited for representingthousands of small transactions. We do propose having each transactional readcheck a bloom filter for the presence of buffered transactional data in the write-log.7.2.4 Value-Based Conflict DetectionOther than Kilo TM, there are a number of TM systems using value-based conflictdetection. JudoSTM [124] allows parallel commits with a set of versioned lockseach guarding a memory region. Each transaction checks/acquires the locks ofall the regions that require protection during validation and commit. NORec [37]uses a single global versioned lock to offer fast checking with a small number ofconcurrent threads. To ensure opacity [68], an incremented global lock version168informs a transaction to revalidate its read-set before reading from memory again.DPTM [153] is a cache coherence protocol based HTM. It uses value-based con-flict detection to mitigate the false conflicts that are caused by false sharing ofcache lines between transactions. Kilo TM uses value-based conflict detection,but the motivation is to eliminate global metadata for conflict detection with thou-sands of concurrent transactions. It uses special hardware to allow non-conflictingtransactions to validate and commit in parallel.7.2.5 Ring-Based Commit OrderingKilo TM and RingSTM [149] are conceptually similar in that both use a ring toorder transaction commits and detect conflicts between read-set of a transactionand write-sets of committed transactions. Kilo TM uses value-based conflict de-tection to eliminate storage for committed write-sets, and uses a LWH unit to detecthazards (conflicts) among committing transactions. Each transaction in Kilo TMstores its read-set in exact address-value pairs. It also features multiple commitunits, each with a separate ring, and maintains consistent commit order via a proto-col similar to Scalable TCC [27]. Kilo TM enforces opacity via a watchdog timer,removing the need to validate before every transactional read.7.2.6 Recency Bloom FilterRecency bloom filters contribute greatly in boosting the commit parallelism in KiloTM and allow hazard detection in each commit unit to directly support unboundedtransaction. Recency bloom filters were proposed in Chapter 4 and used in bothChapter 4 and Chapter 6.While the construction of a recency bloom filter resembles a time-out bloomfilter [90], the recency bloom filter differs in how it uses the data stored insidethe filter. The time-out bloom filter was proposed for use in network packet sam-pling, where the key result is the existence of a similar packet occurring within agiven window. In contrast, the recency bloom filter provides more information. Inthe presence of a false conflict due to aliasing, it can still return the approximateidentity of the conflict rather than simply a pass/fail result. Kilo TM can still usethis approximate identity to determine the set of transactions that may commit in169parallel with the transaction in query.The store sequence bloom filter (SSBF) in store vulnerability window (SVW) [134]also uses a hash table of order numbers to approximately detect conflicts betweenconcurrent memory operations. The recency bloom filter extends this constructionto use multiple hash tables of order number and uses a minimum logic block tocombine the order number returned from each table to provide the best approxi-mation. This extension improves the recency bloom filter’s tolerance of the effectsof aliasing in hash functions and allows the filter to scale to much larger item sets.Also, a recency bloom filter assigns the same commit ID (order number) to repre-sent multiple memory locations in a transaction, whereas a SSBF assigns a uniqueorder number to each memory operation.7.2.7 Transaction SchedulingIn Chapter 4, we describe an extension to the GPU hardware thread scheduler tocontrol the number of concurrent transactions. It is effective for high-contentionworkloads. Others have proposed more adaptive transaction schedulers that dy-namically adjust concurrency according to predicted contention or to support un-bounded transaction. Yoo and Lee [171] use a system-wide queue to control thenumber of concurrent transactions based on a system-wide conflict rate threshold.CAR-STM [47] maintains per-core scheduling queues to serially execute trans-actions that are likely to conflict. Maldonado et al. [105] extend the OS threadscheduler to implement a similar capability. Both Blake et al. [15] and Dragojevic´et al. [48] propose mechanisms to predict transactions that are likely to conflict viaruntime profiling. OneTM [17] simplifies the handling of unbounded transactionsby serializing the execution of these transactions. We leave the exploration of suchadaptive transaction schedulers on GPUs as future work.7.2.8 Energy Analysis for Transaction MemoryFerri et al. [52] analyzed the energy and performance of SoC-TM, their TM systemproposal for embedded multi-core SoC. Their analysis shows that for workloadsthat scale to multiple threads, SoC-TM performs better than locking while con-suming less energy. We perform similar analysis for Kilo TM, a TM proposal for170GPU architectures.7.2.9 Intra-Warp Conflict ResolutionSimilar to the intra-warp conflict resolution mechanisms explored in Chapter 6,Qian et al. [128] described a method to detect and resolve conflicts among threadsrunning on SMT CPU cores, which usually have far fewer threads per core in com-parison to GPU cores. The relatively small number of threads allows their designto dedicate explicit storage to record the dependency between transactions and ex-tend each cache line to record read-sharer information. Such storage is impracticalfor GPU cores, which have hundreds of threads on each core sharing the L1 cache.This work focuses on detecting and resolving conflicts among transactions withina warp on GPU.Yang et al. [168] proposed multiplexing shared memory storage among mul-tiple concurrently running thread blocks by dynamically allocating the storage toeach thread block for temporary use and freeing it immediately after. Our pro-posed intra-warp conflict resolution employs this strategy to allow each warp touse a ownership table larger than the capacity possible with static allocation.Nasre et al. [118] proposed a probabilistic 3-phase conflict resolution that usesparallel passes to resolve conflicts among multiple threads. Similar to our proposedintra-warp 2-phase parallel conflict resolution in Chapter 6, their approach usesthread ID to prioritize among different threads. However, their approach focuses onobtaining exclusive access to modify shared data and does not permit read-sharing,which is key to TM system performance.7.2.10 Globally Synchronized Timers in Memory SystemsTemporal Coherence [146] proposed by Singh et al., is a cache coherence frame-work for GPU architectures that uses a set of globally synchronized timers to elim-inate invalidation messages. It uses timestamps to determine when cache blocks inlocal data caches will expire. Temporal conflict detection uses timestamps to detectif all the values read by a transaction can exist as a global memory snapshot. Ruanet al. [135] also proposed extending Orec-based STM systems with synchronoushardware timers found on existing CMP systems. Their approach embeds times-171tamps in the ownership record of each transaction variable in the main memory,whereas TCD uses a set of small on-chip buffers to conservatively record wheneach word in the global memory space was last written.7.2.11 Timestamp/Counter-based Conflict DetectionMany existing STMs use timestamps for conflict detection. We highlight two com-mon use of timestamps as follows.Spear et al. [148] propose using a global commit counter to represent the globalmemory version. The counter is incremented by every transaction commit. Trans-actions can check this counter to see if any transactions have committed betweennow and last around of validation. When a transaction observes an incrementedcounter, it performs another around of validation and buffers the updated countervalue. If the counter remained unchanged, the transaction proceeds directly, skip-ping an expensive around of validation.Lev and Moir [99] propose using a counter at each software object to representthe number of read-sharers to the object. Each transaction increments the counteras it reads the object and decrements it as the transaction commits. A transactionwriting to the object can use this counter to detect conflicts – conflict exists iftransaction is writing to an object with one or more read-sharers.Notice that in both uses, the counters are updated explicitly by events. Tempo-ral conflict detection (TCD) uses free-running timers that increment every cycle.Instead of using the timestamp to represent memory versions, TCD uses the times-tamps representing the time when a set of memory locations were last written todetermine the consistency of a read-only transaction.7.2.12 Transactional Memory VerificationConcurrent to the development of the proof presented in Chapter 5, Lesani et al.[98] completed a machine proof framework to verify the correctness of a TM sys-tem design, and used to verify the design of NORec [37]. The framework uses I/Oautomata to specify the behavior of a naive TM system known to behave correctly,and to model the behavior of the TM system under test. They then use the PVSprover [125] to check the equivalence of the naive TM system and the tested TM172system. The equivalence indicates that the TM system design correctly supportsserializability.173Chapter 8Conclusions and Future WorkThis chapter concludes this dissertation and provides directions for future work.8.1 ConclusionsThe slowdown of Dennard Scaling has motivated the computing industry to lookfor more efficient alternatives in the forms of GPU computing. GPU featuresa highly parallel architecture that is designed to exploit fine-grained data-level-parallelism present in graphics rendering. While advances in GPU computing APIssuch as CUDA and OpenCL had soothed many system-level challenges in usingGPU as a computing device, GPU computing remains challenging.To fully harness the computing power of GPUs, software developers need todecompose the implicit DLP in their applications into thousands of threads. Thisdecomposition is relatively easy for applications with regular parallelism; decom-posing the parallel computations in applications with irregular parallelism is muchmore challenging:1. The microarchitectural behavior of an application with irregular parallelismis often sensitive to the application’s input. This sensitivity can introduceuncertainty to the performance/energy benefit from GPU acceleration.2. Applications with irregular parallelism often feature irregular, fine-grainedcommunications between threads. Managing this communication with fine-grained locking (or other ad hoc synchronization solutions) is challenging174even for parallel programming experts. The development effort required toverify the use of fine-grained locking (or these ad hoc solutions) is highlyunpredictable.3. The resulting software has many hidden constraints that only expert pro-grammers can understand. This makes maintenance and long-term develop-ment very expensive.These challenges plague the development of GPU applications with irregular par-allelism. Although many research studies have indicated the performance potentialof GPU acceleration for these applications, software developers appear to showlimited interest likely due to the risk involved in the development.In this dissertation, we proposed two enhancements to the existing GPU ar-chitectures for reducing the risk in developing GPU applications with irregularparallelism: Thread block compaction (TBC) and Kilo TM.Thread block compaction (Chapter 3) is a novel microarchitecture mechanismthat improves the performance of GPU applications that suffer from branch di-vergence. TBC addresses the pathologies found in its precursor, dynamic warpformation [56]. It draws from our observation that most applications alternate be-tween code regions that suffer from branch divergence and regions that do not.This observation had driven us to expand the per-warp SIMT stack in current GPUarchitecture to encompass the whole thread block. This block-wide stack tracksthe entry and exit of the divergence regions, and ensuring that dynamic, compactedwarps are restored to their aligned static warps at the exit. With simplified DWFhardware which restricts compaction are the entry to the divergence region, TBCcan robustly speedup GPU applications that suffer from branch divergence. Ourevaluation showed that it speeds up a set of divergent GPU applications by 22%on average, and maintains the performance of the GPU applications free of branchdivergence.Kilo TM (Chapter 4) is our proposal to support transactional memory (TM)on GPU. The TM programming model simplifies parallel programming replacinglock-protected critical regions with atomic code region called transactions. Theprogrammer is only responsible for specifying the operations, the functionality,inside each transaction; an underlying TM system runs multiple transactions con-175currently for performance, and automatically handles any data-race raised fromthe concurrent execution. This clean separation of functionality and performancemakes TM software more maintainable than lock-based code.Unlike existing TM proposals that focused on supporting tens of large-footprinttransactions on existing CMP systems, Kilo TM aims to support thousands of con-current small transactions. This focus, as well as the throughput-focused design ofGPUs, had driven us towards a very different design space, with a strong emphasison scalability. The initial design of Kilo TM used value-based conflict detection,which rereads data values from global memory to detect conflicts. It does notrequire any global metadata to work and eliminates all direct communication be-tween transactions. However, a naive implementation requires serializing all trans-action commits. In response, we extended Kilo TM with specialized hardware toboost the commit parallelism. Later, we further augmented Kilo TM with warp-level transaction management system to capture spatial locality between transac-tions; we proposed a low-overhead intra-warp conflict resolution mechanism tomake this possible. Finally, we accelerated the execution of read-only transac-tions on Kilo TM with temporal conflict detection. In our evaluations via simula-tions with GPGPU-Sim [10] and GPUWattch [97], GPU TM applications with theenhanced version of Kilo TM overall runs 192× faster than their coarse-grainedlocking analogs. This enhanced Kilo TM also captures 66% of fine-grained lock-ing performance, while consuming only 34% more energy. If we only count thecontributions from low-contention GPU TM workloads, Kilo TM performs on parwith fine-grained locking in both execution speed and energy efficiency. This typeof workload is what we envision to be found in future GPU TM applications.We opted not to evaluate a combination of TBC and Kilo TM in this work. Thisis because we believe both solutions are still in their infancy. This is especiallytrue for Kilo TM, which is our first attempt to design a TM system for GPUs.Many assumptions of the design came from a set of primitive GPU TM workloadswe created ourselves. Many of these assumptions could be refined through thedevelopment of more realistic GPU TM workloads in the future.1768.2 Directions for Future WorkThis section provides some directions for future research projects that extend thescope of the work presented in this dissertation.8.2.1 Cost-Benefit-Aware Multi-Scope CompactionOne observation we have with the current techniques of compacting sparsely popu-lated warps (due to divergence) into denser populated warps is that each techniqueemploys only one way to compact the warps. In general, compaction that involvesa larger scope of threads (within a thread block→ within a core→ the entire GPU)generally produces more benefit, but incurs a higher cost as well due to the com-munication and data migration overhead involved.Through a model that estimates the cost and benefit trade offs behind eachscope of compaction, a hybrid system can dynamically decide which scope ofcompaction should be applied. This hybrid may have the potential of capturingthe combined benefits of all forms of compaction.8.2.2 Extend Kilo TM to Support Strong IsolationThe current implementation of Kilo TM only supports weak isolation [16]. Withweak isolation, transactions only appear atomic to other transactional memory ac-cesses. Non-transactional memory accesses can observe the intermediate memorystates in between transaction commits. Supporting only weak isolation simplifiesthe design of Kilo TM (or at least reduces its performance overhead), but it gener-ally makes programming TM harder for reasons explained in Section 2.2.2.To support strong isolation, Kilo TM will need to be extended to handle theinteractions of non-transactional loads and stores that may conflict with a concur-rent transaction. The goal is to avoid adding significant performance and energyoverhead for non-transactional memory accesses. As clarified by Dalessandro andScott [35], a TM system may handle non-transactional loads and stores separatelywhile satisfying the requirement for strong isolation.A naive approach to order non-transactional loads with respect to other trans-actions is to treat each load as a read-only transaction in Kilo TM, and to committhis read-only transaction via the commit unit in the corresponding memory parti-177tion to obtain its value. In this way, transactions that appear to be committed to anearlier non-transactional load also appears committed for a later non-transactionalload from the same thread. This property can be maintained by simply returningthe retired commit ID (RCID) from the memory partition that serviced the non-transactional load. If a later load from the same thread obtains an older retiredcommit ID than the earlier load, the later load is replayed to reread the memoryuntil the RCID is updated. Notice that temporal conflict detection cannot be usedin this case because it does not restrict order between read-only transactions (finefor weak isolation, but breaks strong isolation).While Kilo TM should detect most conflicts caused by non-transactional storeswith value-based conflict detection, extra effort is required to cover all cases. Onecorner case occurs when the store modifies a memory location that is read, modifiedand written by a committing transaction. The non-transactional store may occurright in between the validation and commit of a transaction, violating the atomicityof the transaction.One approach to handle this corner case is to use a separate recency bloom filterto keep track of the read-set of all committing transactions. Every non-transactionalstore checks this filter at the memory partition, and if it hits in the filter, it performsthe store through the commit unit as a write-only transaction.8.2.3 Application-Driven Transaction SchedulingIn some GPU TM applications, we have observed that the level of contention be-tween transactions changes dynamically at runtime. For example, BH starts outas a high-contention application where every transaction is trying to insert a leafnode near the root, and gradually relaxes into low-contention as the octree grows.A more intelligent concurrency control should be able to speed up this applicationsignificantly by restricting the concurrent only during the high-contention phase.Instead of approaching this by just adopting one of the existing transactionschedulers, a limit study should be performed to estimate the potential gain. Onecan try to construct an oracle transaction schedule for each application from mem-ory traces obtained via profiling. This idealized transaction schedule will serve asthe gold standard. One way to approach this ideal schedule is to analyze the run-178time input data prior to the transaction executions. We envision that some formof approximate graph coloring may be able to recreate a transaction schedule thatapproximates the idealized schedule. The key is that since the computation is stilldone via transactions, the schedule just needs to remove a majority of the transac-tion aborts, but does not need to be so conservative such that it culls away availablethread-level-parallelism. The generated schedule can be implemented by using asoftware thread spawner similar to the ones used for dynamic load-balancing in raytracing on GPUs [4].8.2.4 Multithreaded TransactionsDuring the development of the various GPU TM workloads for this work, we haveobserved that some transactions, especially the larger ones, contain a good amountof parallelism (e.g., initializing entries in a node or updating all of its neighbours).With the current design of Kilo TM, this parallelism remains buried inside a singlethread, because each transaction may only be executed by a single thread.This observation made us realized that the current TM semantics need to beextended to support multithreaded transactions. This new TM would not bind thetransaction context to a single thread. Instead, it would allow multiple threads toexecute a transaction collaboratively. These threads should observe the intermedi-ate work by each other within a transaction.The implementation of this new TM model will need to address the followingtwo challenges:• How to enable multiple threads working on the same transaction to see eachother’s intermediate work? Should this be allowed at all?• How does the new TM system detect conflicts between transactions of dif-ferent scale? Can a single conflict detection system efficiently handle bothsingle-thread transaction and thousand-threads transaction?We believe that ideas proposed in this dissertation, such as warp-level transactionmanagement and temporal conflict detection, can be further developed to overcomethese challenges.179Bibliography[1] NVIDIA Forums - atomicCAS does NOT seem to work. Accessed: March15, 2011. → pages 77[2] M. Abadi, T. Harris, and M. Mehrara. Transactional memory with strongatomicity using off-the-shelf memory protection hardware. In Proceedingsof the 14th ACM SIGPLAN Symposium on Principles and Practice ofParallel Programming, PPoPP ’09, pages 185–196, New York, NY, USA,2009. ACM. → pages 31[3] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo.Advances in Knowledge Discovery and Data Mining. chapter FastDiscovery of Association Rules. American Association for ArtificialIntelligence, 1996. → pages 92, 94, 145[4] T. Aila and S. Laine. Understanding the Efficiency of Ray Traversal onGPUs. In HPG ’09, 2009. → pages 7, 15, 16, 63, 64, 179[5] R700-Family Instruction Set Architecture. AMD, March 2009. → pages41, 60[6] AMD Southern Islands Series Instruction Set Architecture. AMD, 1.1edition, December 2012. → pages 41[7] ARM Holdings. Cortex-A9 NEON Media Processing Engine TechnicalReference Manual (Revision r2p2), 2008. → pages 22[8] D. Arnold, D. Ahn, B. de Supinski, G. Lee, B. Miller, and M. Schulz. StackTrace Analysis for Large Scale Debugging. In Proceedings of the IEEEInternational Parallel and Distributed Processing Symposium, IPDPS2007, pages 1–10. IEEE, 2007. → pages 6, 76180[9] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands,K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, andK. A. Yelick. The Landscape of Parallel Computing Research: A Viewfrom Berkeley. Technical Report UCB/EECS-2006-183, EECSDepartment, University of California, Berkeley, Dec 2006. URL→ pages 1[10] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt.Analyzing CUDA Workloads Using a Detailed GPU Simulator. InProceedings of the IEEE Symposium of Performance and Analysis ofSystems and Software, ISPASS’09, pages 163–174, 2009. → pages 63, 64,92, 143, 176[11] M. Bauer, H. Cook, and B. Khailany. CudaDMA: Optimizing GPUMemory Bandwidth via Warp Specialization. In Proceedings of 2011International Conference for High Performance Computing, Networking,Storage and Analysis, SC ’11, pages 12:1–12:11, New York, NY, USA,2011. ACM. → pages 55[12] M. Bauer, S. Treichler, and A. Aiken. Singe: Leveraging WarpSpecialization for High Performance on GPUs. In Proceedings of the 19thACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming, PPoPP ’14, pages 119–130, New York, NY, USA, 2014.ACM. → pages 55[13] N. Bell and M. Garland. Implementing Sparse Matrix-vector Multiplicationon Throughput-oriented Processors. In Proceedings of the Conference onHigh Performance Computing Networking, Storage and Analysis, SC ’09,pages 18:1–18:11, New York, NY, USA, 2009. ACM. → pages 16[14] M. Billeter, O. Olsson, and U. Assarsson. Efficient Stream Compaction onWide SIMD Many-core Architectures. In Proceedings of the Conferenceon High Performance Graphics 2009, HPG ’09, pages 159–166, New York,NY, USA, 2009. ACM. → pages 162[15] G. Blake, R. G. Dreslinski, and T. Mudge. Bloom Filter GuidedTransaction Scheduling. In Proceedings of the 2011 IEEE 17thInternational Symposium on High Performance Computer Architecture,HPCA ’11, pages 75–86, Washington, DC, USA, 2011. IEEE ComputerSociety. ISBN 978-1-4244-9432-3. → pages 92, 108, 170181[16] C. Blundell, E. C. Lewis, and M. M. K. Martin. DeconstructingTransactional Semantics: The Subtleties of Atomicity. In the AnnualWorkshop on Deplicating, Deconstructing, and Debunking, WDDD, 2005.→ pages 10, 30, 31, 112, 113, 177[17] C. Blundell, J. Devietti, E. C. Lewis, and M. M. K. Martin. Making theFast Case Common and the Uncommon Case Simple in UnboundedTransactional Memory. In Proceedings of the 34th Annual InternationalSymposium on Computer Architecture, ISCA ’07, pages 24–34, New York,NY, USA, 2007. ACM. ISBN 978-1-59593-706-3.doi:10.1145/1250662.1250667. → pages 34, 170[18] J. Bobba, K. E. Moore, H. Volos, L. Yen, M. D. Hill, M. M. Swift, andD. A. Wood. Performance Pathologies in Hardware Transactional Memory.In Proceedings of the 34th Annual International Symposium on ComputerArchitecture, ISCA ’07, 2007. → pages 28, 32, 87[19] J. Bobba, N. Goyal, M. D. Hill, M. M. Swift, and D. A. Wood. TokenTM:Efficient Execution of Large Transactions with Hardware TransactionalMemory. In Proceedings of the 35th Annual International Symposium onComputer Architecture, ISCA ’08, pages 127–138, Washington, DC, USA,2008. IEEE Computer Society. → pages 167[20] A. Brownsword. Cloth in OpenCL, 2009. → pages 92, 93, 145[21] N. Brunie, S. Collange, and G. Diamos. Simultaneous Branch and WarpInterweaving for Sustained GPU Performance. In Proceedings of the 39thAnnual International Symposium on Computer Architecture, ISCA ’12,pages 49–60, Washington, DC, USA, 2012. IEEE Computer Society. →pages 164, 166[22] I. A. Buck, J. R. Nickolls, M. C. Shebanow, and L. S. Nyland. UnitedStates Patent #7,627,723: Atomic Memory Operators in a ParallelProcessor (Assignee NVIDIA Corp.), December 2009. → pages 19, 44[23] M. Burtscher and K. Pingali. An Efficient CUDA Implementation of theTree-based Barnes Hut n-Body Algorithm. Chapter 6 in GPU ComputingGems Emerald Edition, 2011. → pages 7, 16, 76, 92, 93, 145[24] J. Casper, T. Oguntebi, S. Hong, N. G. Bronson, C. Kozyrakis, andK. Olukotun. Hardware Acceleration of Transactional Memory onCommodity Systems. In Proceedings of the Sixteenth InternationalConference on Architectural Support for Programming Languages and182Operating Systems, ASPLOS XVI, pages 27–38, New York, NY, USA,2011. ACM. → pages 167, 168[25] D. Cederman, P. Tsigas, and M. T. Chaudhry. Towards a SoftwareTransactional Memory for Graphics Processors. In EGPGV, 2010. →pages 166[26] L. Ceze, J. Tuck, J. Torrellas, and C. Cascaval. Bulk Disambiguation ofSpeculative Threads in Multiprocessors. In Proceedings of the 33rd AnnualInternational Symposium on Computer Architecture, ISCA ’06, pages227–238, Washington, DC, USA, 2006. IEEE Computer Society. → pages32, 33, 34, 80, 103, 133, 134, 168[27] H. Chafi, J. Casper, B. D. Carlstrom, A. Mcdonald, C. Cao, M. W. Baek,C. Kozyrakis, and K. Olukotun. A Scalable, Non-blocking Approach toTransactional Memory. In Proceedings of the 2007 IEEE 13thInternational Symposium on High Performance Computer Architecture,HPCA ’07, pages 97–108, Washington, DC, USA, 2007. IEEE ComputerSociety. → pages 9, 33, 34, 75, 81, 88, 167, 169[28] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, andK. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing.In Proceedings of the 2009 IEEE International Symposium on WorkloadCharacterization (IISWC), IISWC ’09, pages 44–54, Washington, DC,USA, 2009. IEEE Computer Society. → pages 63, 64[29] J. Chung, L. Yen, S. Diestelhorst, M. Pohlack, M. Hohmuth, D. Christie,and D. Grossman. ASF: AMD64 Extension for Lock-Free Data Structuresand Transactional Memory. In Proceedings of the 2010 43rd AnnualIEEE/ACM International Symposium on Microarchitecture, MICRO ’43,pages 39–50, Washington, DC, USA, 2010. IEEE Computer Society. ISBN978-0-7695-4299-7. doi:10.1109/MICRO.2010.40. → pages 167[30] B. W. Coon and J. E. Lindholm. United States Patent #7,353,369: Systemand Method for Managing Divergent Threads in a SIMD Architecture(Assignee NVIDIA Corp.), April 2008. → pages 39, 41, 60, 61[31] B. W. Coon, P. C. Mills, S. F. Oberman, and M. Y. Siu. United StatesPatent #7,434,032: Tracking Register Usage During MultithreadedProcessing Using a Scorebard having Separate Memory Regions andStoring Sequential Register Size Indicators (Assignee NVIDIA Corp.),October 2008. → pages 39, 40183[32] B. W. Coon, J. R. Nickolls, L. S. Nyland, and P. C. Mills. United StatesPatent #8,375,176 B2: Lock Mechanism to Enabled Atomic Updates toShared Memory (Assignee NVIDIA Corp.), December 2009. → pages 135[33] P. J. Courtois, F. Heymans, and D. L. Parnas. Concurrent control with“readers” and “writers”. Commun. ACM, 14(10):667–668, October 1971.→ pages 21[34] D. E. Culler, A. Gupta, and J. P. Singh. Parallel Computer Architecture: AHardware/Software Approach. Morgan Kaufmann Publishers Inc., SanFrancisco, CA, USA, 1st edition, 1997. → pages 18, 19[35] L. Dalessandro and M. L. Scott. Strong Isolation is a Weak Idea. InWorkshop on Transactional Memory, TRANSACT’09, 2009. → pages 31,177[36] L. Dalessandro and M. L. Scott. Sandboxing transactional memory. InProceedings of the 21st International Conference on Parallel Architecturesand Compilation Techniques, PACT ’12, pages 171–180, New York, NY,USA, 2012. ACM. → pages 30[37] L. Dalessandro, M. F. Spear, and M. L. Scott. NOrec: Streamlining STMby Abolishing Ownership Records. In Proceedings of the 15th ACMSIGPLAN Symposium on Principles and Practice of Parallel Programming,PPoPP ’10, pages 67–78, New York, NY, USA, 2010. ACM. → pages 9,75, 81, 84, 87, 109, 125, 139, 142, 168, 172[38] W. J. Dally and B. Towles. Principles and Practices of InterconnectionNetworks. Morgan Kaufmann, 2004. → pages 94[39] P. Damron, A. Fedorova, Y. Lev, V. Luchangco, M. Moir, andD. Nussbaum. Hybrid transactional memory. In Proceedings of the 12thInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems, ASPLOS XII, pages 336–346, NewYork, NY, USA, 2006. ACM. → pages 34[40] M. de Kruijf and K. Sankaralingam. Idempotent Processor Architecture. InProceedings of the 44th Annual IEEE/ACM International Symposium onMicroarchitecture, MICRO-44, pages 140–151, New York, NY, USA,2011. ACM. → pages 18, 85[41] M. de Kruijf and K. Sankaralingam. Idempotent Code Generation:Implementation, Analysis, and Evaluation. In Proceedings of the 2013184IEEE/ACM International Symposium on Code Generation andOptimization (CGO), CGO ’13, pages 1–12, Washington, DC, USA, Feb2013. IEEE Computer Society. → pages 85[42] R. H. Dennard, F. H. Gaensslen, and K. Mai. Design of Ion-ImplantedMOSFET’s with Very Small Physical Dimensions. In IEEE Journal ofSolid-State Circuits, October 1974. → pages 1[43] G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, andS. Yalamanchili. SIMD Re-convergence at Thread Frontiers. InProceedings of the 44th Annual IEEE/ACM International Symposium onMicroarchitecture, MICRO-44, pages 477–488, New York, NY, USA,2011. ACM. → pages 165[44] D. Dice, Y. Lev, M. Moir, and D. Nussbaum. Early Experience With aCommercial Hardware Transactional Memory Implementation. InProceedings of the 14th International Conference on Architectural Supportfor Programming Languages and Operating Systems, ASPLOS XIV, pages157–168, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-406-5.doi:10.1145/1508244.1508263. → pages 167[45] D. Dice, Y. Lev, M. Moir, D. Nussbaum, and M. Olszewski. Earlyexperience with a commercial hardware transactional memoryimplementation. Technical report, Mountain View, CA, USA, 2009. →pages 34[46] S. Doherty et al. DCAS is not a Silver Bullet for Nonblocking AlgorithmDesign. In Proceedings of the Sixteenth Annual ACM Symposium onParallelism in Algorithms and Architectures, SPAA ’04, pages 216–224,New York, NY, USA, 2004. ACM. ISBN 1-58113-840-7.doi:10.1145/1007912.1007945. → pages 113[47] S. Dolev, D. Hendler, and A. Suissa. CAR-STM: scheduling-basedcollision avoidance and resolution for software transactional memory. InProceedings of the Twenty-seventh ACM Symposium on Principles ofDistributed Computing, PODC ’08, pages 125–134, New York, NY, USA,2008. ACM. ISBN 978-1-59593-989-0. doi:10.1145/1400751.1400769.→ pages 170[48] A. Dragojevic´, R. Guerraoui, A. V. Singh, and V. Singh. Preventing VersusCuring: Avoiding Conflicts in Transactional Memories. In Proceedings ofthe 28th ACM Symposium on Principles of Distributed Computing, PODC185’09, pages 7–16, New York, NY, USA, 2009. ACM. ISBN978-1-60558-396-9. doi:10.1145/1582716.1582725. → pages 170[49] A. ElTantaway, J. W. Ma, M. O’Connor, and T. M. Aamodt. A ScalableMulti-Path Microarchitecture for Efficient GPU Control Flow. InProceedings of the 2014 IEEE 20th International Symposium on HighPerformance Computer Architecture (HPCA), HPCA ’14, Washington, DC,USA, 2013. IEEE Computer Society. → pages 165[50] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, andD. Burger. Dark silicon and the end of multicore scaling. In Proceedings ofthe 38th Annual International Symposium on Computer Architecture, ISCA’11, pages 365–376, New York, NY, USA, 2011. ACM. ISBN978-1-4503-0472-6. doi:10.1145/2000064.2000108. → pages 1[51] M. Ferdman, P. Lotfi-Kamran, K. Balet, and B. Falsafi. Cuckoo Directory:A Scalable Directory for Many-Core Systems. In Proceedings of the 2011IEEE 17th International Symposium on High Performance ComputerArchitecture, HPCA ’11, pages 169–180, Washington, DC, USA, 2011.IEEE Computer Society. ISBN 978-1-4244-9432-3. → pages 80[52] C. Ferri, A. Marongiu, B. Lipton, R. I. Bahar, T. Moreshet, L. Benini, andM. Herlihy. SoC-TM: Integrated HW/SW Support for TransactionalMemory Programming on Embedded MPSoCs. In Proceedings of theSeventh IEEE/ACM/IFIP International Conference on Hardware/SoftwareCodesign and System Synthesis, CODES+ISSS ’11, pages 39–48, NewYork, NY, USA, 2011. ACM. → pages 170[53] M. Flynn. Very high-speed computing systems. Proceedings of the IEEE,54(12):1901–1909, Dec. 1966. → pages 21, 22[54] W. W. L. Fung and T. M. Aamodt. Thread Block Compaction for EfficientSIMT Control Flow. In Proceedings of the 2011 IEEE 17th InternationalSymposium on High Performance Computer Architecture, HPCA ’11,pages 25–36, Washington, DC, USA, 2011. IEEE Computer Society. →pages 9, 47[55] W. W. L. Fung and T. M. Aamodt. Energy Efficient GPU TransactionalMemory via Space-time Optimizations. In Proceedings of the 46th AnnualIEEE/ACM International Symposium on Microarchitecture, MICRO-46,pages 408–420, New York, NY, USA, 2013. ACM. → pages 124186[56] W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic WarpFormation and Scheduling for Efficient GPU Control Flow. In Proceedingsof the 40th Annual IEEE/ACM International Symposium onMicroarchitecture, MICRO-40, pages 407–420, Washington, DC, USA,2007. IEEE Computer Society. → pages 3, 7, 9, 41, 45, 46, 50, 51, 52, 58,60, 65, 69, 72, 73, 94, 144, 162, 163, 175[57] W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. HardwareTransactional Memory for GPU Architectures. In Proceedings of the 44thAnnual IEEE/ACM International Symposium on Microarchitecture,MICRO-44, pages 296–307, New York, NY, USA, 2011. ACM. → pages9, 74, 115[58] W. W. L. Fung, I. Singh, and T. M. Aamodt. KILO TM Correctness: ABATolerance and Validation-Commit Indivisibility. Technical report,University of British Columbia, 2012.∼aamodt/papers/wwlfung.tr2012.pdf. → pages 109[59] W. Fung et al. Dynamic Warp Formation: Efficient MIMD Control Flowon SIMD Graphics Hardware. ACM Transactions on Architecture andCode Optimization (TACO), 6(2):7:1–7:37, 2009. ISSN 1544-3566. →pages 41, 45, 46, 50, 51, 60[60] W. W. L. Fung et al.∼wwlfung/code/kilotm-gpgpusim-micro2013.tgz, .Accessed: November 5, 2013. → pages 144[61] W. W. L. Fung et al.∼wwlfung/code/gpu-tm-tests.tgz, . Accessed: April18, 2013. → pages 95[62] W. W. L. Fung et al.∼wwlfung/code/kilotm-gpgpu sim.tgz, . Accessed:April 18, 2013. → pages 95[63] W. W. L. Fung et al.∼wwlfung/code/tbc-gpgpusim.tgz, . Accessed:Feburary 19, 2013. → pages 63[64] M. Gebhart, S. W. Keckler, B. Khailany, R. Krashinsky, and W. J. Dally.Unifying Primary Cache, Scratch, and Register File Memories in aThroughput Processor. In Proceedings of the 2012 45th Annual IEEE/ACM187International Symposium on Microarchitecture, MICRO-45, pages 96–106,Washington, DC, USA, 2012. IEEE Computer Society. → pages 133[65] A. Gharaibeh and M. Ripeanu. Size Matters: Space/Time Tradeoffs toImprove GPGPU Applications Performance. In IEEE/ACMSupercomputing (SC 2010), 2010. → pages 63, 64[66] A. Gharaibeh, L. Beltra˜o Costa, E. Santos-Neto, and M. Ripeanu. A Yokeof Oxen and a Thousand Chickens for Heavy Lifting Graph Processing. InProceedings of the 21st International Conference on Parallel Architecturesand Compilation Techniques, PACT ’12, pages 345–354, New York, NY,USA, 2012. ACM. → pages 7, 16, 159[67] J. E. Gottschlich, M. Vachharajani, and J. G. Siek. An Efficient SoftwareTransactional Memory Using Commit-Time Invalidation. In Proceedingsof the 8th Annual IEEE/ACM International Symposium on CodeGeneration and Optimization, CGO ’10, pages 101–110, New York, NY,USA, 2010. ACM. → pages 32, 168[68] R. Guerraoui and M. Kapalka. On the Correctness of TransactionalMemory. In Proceedings of the 13th ACM SIGPLAN Symposium onPrinciples and Practice of Parallel Programming, PPoPP ’08, pages175–184, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-795-7.doi:10.1145/1345206.1345233. → pages 27, 29, 85, 168[69] Z. S. Hakura and A. Gupta. The Design and Analysis of a CacheArchitecture for Texture Mapping. In Proceedings of the 24th AnnualInternational Symposium on Computer Architecture, ISCA ’97, pages108–120, New York, NY, USA, 1997. ACM. → pages 44[70] P. Hammarlund, A. J. Martinez, A. A. Bajwa, D. L. Hill, E. Hallnor,H. Jiang, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R. B. Osborne,R. Rajwar, R. Singhal, R. D’Sa, R. Chappell, S. Kaushik, S. Chennupaty,S. Jourdan, S. Gunther, T. Piazza, and T. Burton. Haswell: TheFourth-Generation Intel Core Processor. Micro, IEEE, 34(2):6–20, Mar2014. → pages 160[71] R. Haring, M. Ohmacht, T. Fox, M. Gschwind, D. Satterfield,K. Sugavanam, P. Coteus, P. Heidelberger, M. Blumrich, R. Wisniewski,A. Gara, G.-T. Chiu, P. Boyle, N. Chist, and C. Kim. The IBM BlueGene/Q Compute Chip. Micro, IEEE, 32(2):48–60, March 2012. ISSN0272-1732. → pages 19, 24, 26, 34188[72] M. Harris, S. Sengupta, and J. D. Owens. Parallel Prefix Sum (Scan) withCUDA. Chapter 39 in GPU Gems 3, 2007. → pages 16[73] T. Harris, J. Larus, and R. Rajwar. Transactional Memory. Morgan andClaypool, second edition, 2010. → pages 10, 24, 26, 27, 29, 30, 35, 74, 81,83[74] J. Hennessy and D. Patterson. Computer Architecture - A QuantitativeApproach. Morgan Kaufmann, 4 edition, 2008. → pages 13, 14, 16[75] M. Herlihy and J. E. B. Moss. Transactional Memory: ArchitecturalSupport for Lock-Free Data Structures. In Proceedings of the 20th AnnualInternational Symposium on Computer Architecture, ISCA ’93, pages289–300, New York, NY, USA, 1993. ACM. → pages 8, 24, 27, 31, 33, 74,76[76] J. Hoberock, V. Lu, Y. Jia, and J. C. Hart. Stream Compaction for DeferredShading. In Proceedings of the Conference on High Performance Graphics2009, HPG ’09, pages 173–180, New York, NY, USA, 2009. ACM. →pages 162[77] W.-m. W. Hwu. GPU Computing Gems Emerald Edition. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2011. →pages 4[78] IBM Corp. Power ISA Version 2.05, 2007. → pages 22[79] Intel 64 and IA-32 Architectures Software Developer’s Manual. Intel Corp.,May 2012. → pages 140[80] Intel Corp. Intel Architecture Instruction Set Extensions ProgrammingReference, March 2014. → pages 22, 24, 26, 34[81] C. Jacobi, T. Slegel, and D. Greiner. Transactional Memory Architectureand Implementation for IBM System Z. In Proceedings of the 45th AnnualIEEE/ACM International Symposium on Microarchitecture, MICRO-45,pages 25–36, Washington, DC, USA, 2012. IEEE Computer Society. →pages 24, 26, 34[82] E. H. Jensen, G. W. Hagensen, and J. M. Broughton. A New Approach toExclusive Data Access in Shared Memory Multiprocessors. TechnicalReport UCRL-97663, Lawrence Livermore National Laboratory,November 1987. → pages 18189[83] H. F. Jordan. Performance measurements on HEP - a pipelined MIMDcomputer. In ISCA ’83: Proceedings of the 10th Annual InternationalSymposium on Computer Architecture, pages 207–212, 1983. → pages 23[84] U. J. Kapasi et al. Efficient Conditional Operations for Data-ParallelArchitectures. In Proceedings of the 33rd IEEE/ACM InternationalSymposium on Microarchitecture, MICRO-33, pages 159–170,Washington, DC, USA, 2000. IEEE Computer Society. → pages 162[85] S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs andthe Future of Parallel Computing. Micro, IEEE, 31(5):7–17, Sept 2011. →pages 166[86] J. H. Kelm, M. R. Johnson, S. S. Lumettta, and S. J. Patel. WAYPOINT:Scaling Coherence to Thousand-Core Architectures. In Proceedings of the19th International Conference on Parallel Architectures and CompilationTechniques, PACT ’10, pages 99–110, New York, NY, USA, 2010. ACM.ISBN 978-1-4503-0178-7. doi:10.1145/1854273.1854291. → pages 80[87] J. H. Kelm et al. Rigel: An Architecture and Scalable ProgrammingInterface for a 1000-core Accelerator. In Proceedings of the 36th AnnualInternational Symposium on Computer Architecture, ISCA ’09, pages140–151, 2009. → pages 19[88] G. Kestor, V. Karakostas, O. S. Unsal, A. Cristal, I. Hur, and M. Valero.RMS-TM: A Comprehensive Benchmark Suite for Transactional MemorySystems. In Proceeding of the second joint WOSP/SIPEW internationalconference on Performance engineering, ICPE ’11, 2011. → pages 92, 94,145[89] Khronos Group. OpenCL. Accessed:November 14, 2012. → pages 2, 36, 37, 44[90] S. Kong et al. Time-Out Bloom Filter: A New Sampling Method forRecording More Flows. In ICOIN, 2006. → pages 169[91] R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper,and K. Asanovic. The vector-Thread Architecture. In Proceedings. 31stAnnual International Symposium on Computer Architecture, ISCA ’04,pages 52–63, June 2004. → pages 165[92] R. M. Krashinsky. United States Patent Application #20130042090 A1:Temporal SIMT Execution Optimization, August 2011. → pages 166190[93] J. Laudon. Performance/watt: the new server focus. SIGARCH ComputerArchitecture News, 33(4):5–13, 2005. → pages 23[94] E. A. Lee. The Problem with Threads. Computer, 39, May 2006. → pages76[95] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen,N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal,and P. Dubey. Debunking the 100X GPU vs. CPU Myth: An Evaluation ofThroughput Computing on CPU and GPU. In Proceedings of the 37thAnnual International Symposium on Computer Architecture, ISCA ’10,pages 451–460, New York, NY, USA, 2010. ACM. → pages 4[96] Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, andK. Asanovic´. Exploring the tradeoffs between programmability andefficiency in data-parallel accelerators. In Proceedings of the 38th AnnualInternational Symposium on Computer Architecture, ISCA ’11, pages129–140, New York, NY, USA, 2011. ACM. → pages 166[97] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M.Aamodt, and V. J. Reddi. GPUWattch: Enabling Energy Optimizations inGPGPUs. In Proceedings of the 40th Annual International Symposium onComputer Architecture, ISCA ’13, pages 487–498, New York, NY, USA,2013. ACM. → pages 144, 146, 176[98] M. Lesani, V. Luchangco, and M. Moir. A framework for formallyverifying software transactional memory algorithms. In Proceedings of the23rd International Conference on Concurrency Theory, CONCUR’12,pages 516–530, Berlin, Heidelberg, 2012. Springer-Verlag. → pages 172[99] Y. Lev and M. Moir. Fast Read Sharing Mechanism for SoftwareTransactional Memory (POSTER). In Proceedings of the 24th AnnualACM Symposium on Principles of Distributed Computing, PODC’04, 2004.→ pages 172[100] A. Levinthal and T. Porter. Chap - A SIMD Graphics Processor. InProceedings of the 11th Annual Conference on Computer Graphics andInteractive Techniques, SIGGRAPH ’84, pages 77–82, New York, NY,USA, 1984. ACM. → pages 41, 60[101] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: AUnified Graphics and Computing Architecture. Micro, IEEE, 28(2):39–55,March-April 2008. → pages 10, 38191[102] E. Lindholm et al. United States Patent Application #2010/0122067:Across-Thread Out-of-Order Instruction Dispatch in a MultithreadedMicroprocessor (Assignee NVIDIA Corp.), May 2010. → pages 39[103] A. Mahesri. Tradeoffs in Designing Massively Parallel AcceleratorArchitectures. PhD thesis, University of Illinois at Urbana-Champaign,2009. → pages 63[104] A. Mahesri et al. Tradeoffs in designing accelerator architectures for visualcomputing. In Proceedings of the 41st Annual IEEE/ACM InternationalSymposium on Microarchitecture, pages 164–175, 2008. → pages 63, 64[105] W. Maldonado, P. Marlier, P. Felber, A. Suissa, D. Hendler, A. Fedorova,J. L. Lawall, and G. Muller. Scheduling Support for Transactional MemoryContention Management. In Proceedings of the 15th ACM SIGPLANSymposium on Principles and Practice of Parallel Programming, PPoPP’10, pages 79–90, New York, NY, USA, 2010. ACM. ISBN978-1-60558-877-3. doi:10.1145/1693453.1693465. → pages 170[106] M. Martin, C. Blundell, and E. Lewis. Subtleties of transactional memoryatomicity semantics. IEEE Comput. Archit. Lett., 5(2):17–17, July 2006.→ pages 30[107] J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision forintegrated branch and memory divergence tolerance. In Proceedings of the37th Annual International Symposium on Computer Architecture, ISCA’10, pages 235–246, New York, NY, USA, 2010. ACM. → pages 50, 51,52, 55, 164[108] D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU Graph Traversal.In Proceedings of the 17th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming, PPoPP ’12, pages 117–128, New York,NY, USA, 2012. ACM. → pages 6, 7, 16[109] M. M. Michael. Practical Lock-Free and Wait-Free LL/SC/VLImplementations Using 64-Bit CAS. In Proceedings of the 18thInternational Conference on Distributed Computing, DISC 2004, pages144–158. Springer, 2004. → pages 109, 113[110] M. M. Michael and M. L. Scott. Correction of a memory managementmethod for lock-free data structures. Technical report, Rochester, NY,USA, 1995. → pages 109192[111] C. C. Minh, M. Trautmann, J. Chung, A. McDonald, N. Bronson, J. Casper,C. Kozyrakis, and K. Olukotun. An Effective Hybrid TransactionalMemory System with Strong Isolation Guarantees. In Proceedings of the34th Annual International Symposium on Computer Architecture, ISCA’07, 2007. → pages 31, 33, 34, 80, 133, 134, 167, 168[112] G. E. Moore. Cramming more components onto integrated circuits.Electronics, 38(8):114–117, 1965. → pages 1[113] K. Moore, J. Bobba, M. Moravan, M. Hill, and D. Wood. LogTM:Log-Based Transactional Memory. In Proceedings of the 2006 IEEE 12thInternational Symposium on High Performance Computer Architecture,HPCA ’06, Washington, DC, USA, 2006. IEEE Computer Society. →pages 31, 33, 167[114] S. S. Muchnick. Advanced Compiler Design and Implementation. MorganKaufmanns, 1997. → pages 41[115] N. Muralimanohar, R. Balasubramonian, and N. Jouppi. OptimizingNUCA Organizations and Wiring Alternatives for Large Caches withCACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM InternationalSymposium on Microarchitecture, MICRO 40, pages 3–14, Washington,DC, USA, 2007. IEEE Computer Society. → pages 145[116] S. Naffziger, J. Warnock, and H. Knapp. When Processors Hit the PowerWall (or “When the CPU hits the fan”). In Proceedings of the IEEEInternational Solid-State Circuits Conference, ISSCC 2005, pages 16–17,2005. → pages 1[117] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, andY. N. Patt. Improving GPU Performance via Large Warps and Two-levelWarp Scheduling. In Proceedings of the 44th Annual IEEE/ACMInternational Symposium on Microarchitecture, MICRO-44, pages308–317, New York, NY, USA, 2011. ACM. → pages 163[118] R. Nasre, M. Burtscher, and K. Pingali. Morph Algorithms on GPUs. InProceedings of the 18th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming, PPoPP ’13, pages 147–156, New York,NY, USA, 2013. ACM. → pages 7, 16, 134, 171[119] J. R. Nickolls and J. Reusch. Autonomous SIMD flexibility in the MP-1and MP-2. In SPAA ’93: Proceedings of the 5th Annual ACM Symposiumon Parallel Algorithms and Architectures, pages 98–99, 1993. → pages 38193[120] J. Nickolls et al. Scalable Parallel Programming with CUDA. ACM Queue,6(2):40–53, Mar.-Apr. 2008. → pages 2, 36[121] NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. NVIDIA,2009. → pages 44, 63, 76, 79, 80, 94, 106, 133[122] NVIDIA CUDA Programming Guide v3.1. NVIDIA Corp., 2010. → pages2, 5, 36, 38, 43, 44, 58, 73[123] NVIDIA Compute PTX: Parallel Thread Execution ISA Version 1.4.NVIDIA Corporation, CUDA Toolkit 2.3 edition, 2009. → pages 55[124] M. Olszewski, J. Cutler, and J. G. Steffan. JudoSTM: A DynamicBinary-Rewriting Approach to Software Transactional Memory. InProceedings of the 16th International Conference on Parallel Architectureand Compilation Techniques, PACT ’07, pages 365–375, Washington, DC,USA, 2007. IEEE Computer Society. → pages 9, 30, 32, 75, 81, 84, 87,109, 168[125] S. Owre, J. M. Rushby, and N. Shankar. Pvs: A prototype verificationsystem. In Proceedings of the 11th International Conference on AutomatedDeduction: Automated Deduction, CADE-11, pages 748–752, London,UK, UK, 1992. Springer-Verlag. ISBN 3-540-55602-8. → pages 172[126] V. Pankratius and A.-R. Adl-Tabatabai. A study of transactional memoryvs. locks in practice. In Proceedings of the Twenty-third Annual ACMSymposium on Parallelism in Algorithms and Architectures, SPAA ’11,pages 43–52, New York, NY, USA, 2011. ACM. → pages 24[127] J. C. Phillips et al. Scalable molecular dynamics with NAMD. Journal ofComputational Chemistry, 2005. → pages 63, 64[128] X. Qian, B. Sahelices, and J. Torrellas. BulkSMT: Designing SMTProcessors for Atomic-Block Execution. In Proceedings of the 2012 IEEE18th International Symposium on High-Performance ComputerArchitecture, HPCA ’12, pages 1–12, Washington, DC, USA, 2012. IEEEComputer Society. → pages 171[129] A. Ramamurthy. Towards Scalar Synchronization in SIMT Architectures.Master’s thesis, University of British Columbia, 2011. → pages 6, 21, 77,94194[130] M. Rhu and M. Erez. CAPRI: Prediction of Compaction-adequacy forHandling Control-divergence in GPGPU Architectures. In Proceedings ofthe 39th Annual International Symposium on Computer Architecture, ISCA’12, pages 61–71, Washington, DC, USA, 2012. IEEE Computer Society.→ pages 69, 163[131] M. Rhu and M. Erez. The Dual-path Execution Model for Efficient GPUControl Flow. In Proceedings of the 2013 IEEE 19th InternationalSymposium on High Performance Computer Architecture (HPCA), HPCA’13, pages 591–602, Washington, DC, USA, 2013. IEEE ComputerSociety. → pages 165[132] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-ConsciousWavefront Scheduling. In Proceedings of the 2012 45th Annual IEEE/ACMInternational Symposium on Microarchitecture, MICRO-45, pages 72–83,Washington, DC, USA, 2012. IEEE Computer Society. → pages 144, 158[133] C. J. Rossbach, O. S. Hofmann, and E. Witchel. Is transactionalprogramming actually easier? In Proceedings of the 15th ACM SIGPLANSymposium on Principles and Practice of Parallel Programming, PPoPP’10, pages 47–56, New York, NY, USA, 2010. ACM. → pages 24[134] A. Roth. Store Vulnerability Window (SVW): Re-Execution Filtering forEnhanced Load Optimization. In Proceedings of the 32Nd AnnualInternational Symposium on Computer Architecture, ISCA ’05, pages458–468, Washington, DC, USA, 2005. IEEE Computer Society. → pages170[135] W. Ruan et al. Boosting Timestamp-based Tranasctional Memory byExploiting Hardware Cycle Counters. In TRANSACT, 2013. → pages 140,171[136] B. Saha, A.-R. Adl-Tabatabai, and Q. Jacobson. Architectural Support forSoftware Transactional Memory. In Proceedings of the 39th AnnualIEEE/ACM International Symposium on Microarchitecture, MICRO-39,pages 185–196, Washington, DC, USA, 2006. IEEE Computer Society.ISBN 0-7695-2732-9. doi:10.1109/MICRO.2006.9. → pages 167[137] D. Sanchez, L. Yen, M. D. Hill, and K. Sankaralingam. ImplementingSignatures for Transactional Memory. In Proceedings of the 40th AnnualIEEE/ACM International Symposium on Microarchitecture, MICRO-40,pages 123–133, Washington, DC, USA, 2007. IEEE Computer Society. →pages 81, 103195[138] N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer, M. Smelyanskiy,M. Girkar, and P. Dubey. Can traditional programming bridge the ninjaperformance gap for parallel computing applications? In Proceedings ofthe 39th Annual International Symposium on Computer Architecture, ISCA’12, pages 440–451, Washington, DC, USA, 2012. IEEE ComputerSociety. → pages 22[139] M. Schatz et al. High-Throughput Sequence Alignment Using GraphicsProcessing Units. BMC Bioinformatics, 8(1):474, 2007. → pages 63[140] F. T. Schneider, V. Menon, T. Shpeisman, and A.-R. Adl-Tabatabai.Dynamic optimization for efficient strong atomicity. In Proceedings of the23rd ACM SIGPLAN Conference on Object-oriented ProgrammingSystems Languages and Applications, OOPSLA ’08, pages 181–194, NewYork, NY, USA, 2008. ACM. → pages 31[141] M. L. Scott. Shared-Memory Sycnhronization. Morgan and Claypool, firstedition, 2013. → pages 17, 18, 21[142] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey,S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski,T. Juan, and P. Hanrahan. Larrabee: A Many-Core x86 Architecture forVisual Computing. In ACM SIGGRAPH 2008 Papers, SIGGRAPH ’08,pages 18:1–18:15, New York, NY, USA, 2008. ACM. ISBN978-1-4503-0112-1. doi:10.1145/1399504.1360617. → pages 79[143] T. A. Shah. FabMem: A Multiported RAM and CAM Compiler forSuperscalar Design Space Exploration. Master’s thesis, North CarolinaState University, 2010. → pages 145[144] P. Shivakumar and N. Jouppi. CACTI 5.0. Technical ReportHPL-2007-167. HP Laboratories, 2007. → pages 76, 106[145] A. Shriraman, S. Dwarkadas, and M. L. Scott. Flexible DecoupledTransactional Memory Support. In Proceedings of the 35th AnnualInternational Symposium on Computer Architecture, ISCA ’08, pages139–150, Washington, DC, USA, 2008. IEEE Computer Society. → pages31, 32, 33, 168[146] I. Singh, A. Shriraman, W. W. L. Fung, M. O’Connor, and T. M. Aamodt.Cache Coherence for GPU Architectures. In Proceedings of the 2013 IEEE19th International Symposium on High Performance Computer196Architecture (HPCA), HPCA ’13, pages 578–590, Washington, DC, USA,2013. IEEE Computer Society. → pages 140, 171[147] D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Memory Consistencyand Cache Coherence. Morgan & Claypool Publishers, 1st edition, 2011.→ pages 31[148] M. F. Spear, V. J. Marathe, W. N. Scherer, and M. L. Scott. ConflictDetection and Validation Strategies for Software Transactional Memory. InProceedings of the 20th International Conference on DistributedComputing, DISC’06, pages 179–193, Berlin, Heidelberg, 2006.Springer-Verlag. → pages 125, 139, 172[149] M. F. Spear, M. M. Michael, and C. von Praun. RingSTM: ScalableTransactions with a Single Atomic Instruction. In Proceedings of theTwentieth Annual Symposium on Parallelism in Algorithms andArchitectures, SPAA ’08, pages 275–284, New York, NY, USA, 2008.ACM. → pages 9, 32, 75, 84, 88, 117, 168, 169[150] M. Steffen and J. Zambreno. Improving SIMT Efficiency of GlobalRendering Algorithms with Architectural Support for DynamicMicro-Kernels. In Proceedings of the 2010 43rd Annual IEEE/ACMInternational Symposium on Microarchitecture, MICRO ’43, pages237–248, Washington, DC, USA, 2010. IEEE Computer Society. → pages163[151] J. M. Stone. A Simple and Correct Shared-queue Algorithm UsingCompare-and-swap. In Proceedings of the 1990 ACM/IEEE Conference onSupercomputing, Supercomputing ’90, pages 495–504, Los Alamitos, CA,USA, 1990. IEEE Computer Society Press. → pages 109[152] OpenSPARCT M T2 Core Microarchitecture Specification. SunMicrosystems, Inc., 2007. → pages 23[153] F. Tabba, A. W. Hay, and J. R. Goodman. Transactional ConflictDecoupling and Value Prediction. In Proceedings of the InternationalConference on Supercomputing, ICS ’11, pages 33–42, New York, NY,USA, 2011. ACM. ISBN 978-1-4503-0102-2.doi:10.1145/1995896.1995904. → pages 169[154] S. T. Thakkar and T. Huff. Internet Streaming SIMD Extensions.Computer, 32(12):26–34, 1999. → pages 22197[155] M. R. Thistle and B. J. Smith. A processor architecture for Horizon. InProceedings of Supercomputing, pages 35–41, 1988. → pages 23[156] J. E. Thornton. Parallel Operation in the Control Data 6600. In AFIPSProc. FJCC, volume 26, pages 33–40, 1964. → pages 23[157] S. Tomic´, C. Perfumo, C. Kulkarni, A. Armejach, A. Cristal, O. Unsal,T. Harris, and M. Valero. EazyHTM: Eager-Lazy Hardware TransactionalMemory. In Proceedings of the 42Nd Annual IEEE/ACM InternationalSymposium on Microarchitecture, MICRO 42, pages 145–155, New York,NY, USA, 2009. ACM. → pages 31, 32, 167[158] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading:maximizing on-chip parallelism. In ISCA ’95: Proceedings of the 22ndAnnual International Symposium on Computer Architecture, pages392–403, 1995. → pages 23[159] A. S. Vaidya, A. Shayesteh, D. H. Woo, R. Saharoy, and M. Azimi. SIMDDivergence Optimization Through Intra-warp Compaction. In Proceedingsof the 40th Annual International Symposium on Computer Architecture,ISCA ’13, pages 368–379, New York, NY, USA, 2013. ACM. → pages 163[160] L. G. Valiant. A bridging model for parallel computation. Commun. ACM,33(8):103–111, 1990. → pages 19[161] J. D. Valois. Lock-free Linked Lists Using Compare-and-swap. InProceedings of the Fourteenth Annual ACM Symposium on Principles ofDistributed Computing, PODC ’95, pages 214–222, New York, NY, USA,1995. ACM. → pages 109[162] V. Vineet and P. Narayanan. CudaCuts: Fast Graph Cuts on the GPU. InCVPRW ’08, 2008. → pages 92, 93, 145[163] G. WeiKum and G. Vossen. Transactional Information Systems: TheoryAlgorithms, and the Practice of Concurrency Control and Recovery.Morgan Kaufmann, 2002. → pages 27, 28, 110, 111, 121, 122[164] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel.Optimization of Sparse Matrix-vector Multiplication on EmergingMulticore Platforms. In Proceedings of the 2007 ACM/IEEE Conferenceon Supercomputing, SC ’07, pages 38:1–38:12, New York, NY, USA, 2007.ACM. → pages 16198[165] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos.Demystifying GPU microarchitecture through microbenchmarking. InProceedings of the IEEE Symposium of Performance and Analysis ofSystems and Software, ISPASS’10, 2010. → pages 95[166] S. Xiao and W. chun Feng. Inter-block GPU Communication via FastBarrier Synchronization. In Parallel Distributed Processing (IPDPS), 2010IEEE International Symposium on, IPDPS’10, pages 1–12, April 2010. →pages 20[167] Y. Xu, R. Wang, N. Goswami, T. Li, L. Gao, and D. Qian. SoftwareTransactional Memory for GPU Architectures. In Proceedings of AnnualIEEE/ACM International Symposium on Code Generation andOptimization, CGO ’14, pages 1:1–1:10, New York, NY, USA, 2014.ACM. → pages 125, 139, 142, 167[168] Y. Yang, P. Xiang, M. Mantor, N. Rubin, and H. Zhou. Shared MemoryMultiplexing: A Novel Way to Improve GPGPU Throughput. InProceedings of the 21st International Conference on Parallel Architecturesand Compilation Techniques, PACT ’12, pages 283–292, New York, NY,USA, 2012. ACM. → pages 133, 171[169] K. Yelick. Antisocial Parallelism: Avoiding, Hiding and ManagingCommunication. 2013. Keynote at HPCA-2013. → pages 159[170] L. Yen, J. Bobba, M. Marty, K. Moore, H. Volos, M. Hill, M. Swift, andD. Wood. LogTM-SE: Decoupling Hardware Transactional Memory fromCaches. In Proceedings of the 2007 IEEE 13th International Symposium onHigh Performance Computer Architecture, HPCA ’07, pages 261–272,Washington, DC, USA, 2007. IEEE Computer Society. → pages 31, 33,80, 81, 133, 134, 167, 168[171] R. M. Yoo and H.-H. S. Lee. Adaptive Transaction Scheduling forTransactional Memory Systems. In Proceedings of the Twentieth AnnualSymposium on Parallelism in Algorithms and Architectures, SPAA ’08,pages 169–178, New York, NY, USA, 2008. ACM. ISBN978-1-59593-973-9. doi:10.1145/1378533.1378564. → pages 92, 108, 170[172] E. Z. Zhang, Y. Jiang, Z. Guo, and X. Shen. Streamlining GPUApplications on the Fly: Thread Divergence Elimination Through RuntimeThread-data Remapping. In Proceedings of the 24th ACM InternationalConference on Supercomputing, ICS ’10, pages 115–126, New York, NY,USA, 2010. ACM. → pages 162199[173] H. Zhao, A. Shriraman, and S. Dwarkadas. SPACE: Sharing Pattern-basedDirectory Coherence for Multicore Scalability. In Proceedings of the 19thInternational Conference on Parallel Architectures and CompilationTechniques, PACT ’10, pages 135–146, New York, NY, USA, 2010. ACM.ISBN 978-1-4503-0178-7. doi:10.1145/1854273.1854294. → pages 80[174] F. Zyulkyarov, S. Stipic, T. Harris, O. S. Unsal, A. Cristal, I. Hur, andM. Valero. Discovering and understanding performance bottlenecks intransactional applications. In Proceedings of the 19th InternationalConference on Parallel Architectures and Compilation Techniques, PACT’10, pages 285–294, New York, NY, USA, 2010. ACM. ISBN978-1-4503-0178-7. doi:10.1145/1854273.1854311. → pages 108200


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items