Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Efficient synchronization mechanisms for scalable GPU architectures Ren, Xiaowei 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2020_november_ren_xiaowei.pdf [ 7.49MB ]
JSON: 24-1.0394805.json
JSON-LD: 24-1.0394805-ld.json
RDF/XML (Pretty): 24-1.0394805-rdf.xml
RDF/JSON: 24-1.0394805-rdf.json
Turtle: 24-1.0394805-turtle.txt
N-Triples: 24-1.0394805-rdf-ntriples.txt
Original Record: 24-1.0394805-source.json
Full Text

Full Text

Efficient Synchronization Mechanisms for Scalable GPUArchitecturesbyXiaowei RenM.Sc., Xi’an Jiaotong University, 2015B.Sc., Xi’an Jiaotong University, 2012a thesis submitted in partial fulfillmentof the requirements for the degree ofDoctor of Philosophyinthe faculty of graduate and postdoctoral studies(Electrical and Computer Engineering)The University of British Columbia(Vancouver)October 2020© Xiaowei Ren, 2020The following individuals certify that they have read, and recommend to the Facultyof Graduate and Postdoctoral Studies for acceptance, the dissertation entitled:Efficient Synchronization Mechanisms for Scalable GPU Architecturessubmitted by Xiaowei Ren in partial fulfillment of the requirements for the degreeof Doctor of Philosophy in Electrical and Computer Engineering.Examining Committee:Mieszko Lis, Electrical and Computer EngineeringSupervisorSteve Wilton, Electrical and Computer EngineeringSupervisory Committee MemberKonrad Walus, Electrical and Computer EngineeringUniversity ExaminerIvan Beschastnikh, Computer ScienceUniversity ExaminerVijay Nagarajan, School of Informatics, University of EdinburghExternal ExaminerAdditional Supervisory Committee Members:Tor Aamodt, Electrical and Computer EngineeringSupervisory Committee MemberiiAbstractThe Graphics Processing Unit (GPU) has become a mainstream computing platformfor a wide range of applications. Unlike latency-critical Central Processing Units(CPUs), throughput-oriented GPUs provide high performance by exploiting massiveapplication parallelism.In parallel programming, synchronization is necessary to exchange informationfor inter-thread dependency. However, inefficient synchronization support can seri-alize thread execution and restrict parallelism significantly. Considering parallelismis key to GPU performance, we aim to provide efficient and reliable synchronizationsupport for both single-GPU and multi-GPU systems. To achieve this target, thisdissertation explores multiple abstraction layers of computer systems, includingprogramming models, memory consistency models, cache coherence protocols, andapplication specific knowledges of graphics rendering.First, to reduce programming burden without introducing data-races, we proposeRelativistic Cache Coherence (RCC) to enforce Sequential Consistency (SC). Byavoiding stalls of write permission acquisition with logical timestamps, RCC is30% faster than the best prior SC proposal, and only 7% slower than the bestnon-SC design. Second, we introduce GETM, the first GPU Hardware TransactionalMemory (HTM) with eager conflict detection, to help programmers implementdeadlock-free, yet aggressively parallel code. Compared to the best prior GPUHTM, GETM is up to 2.1× (1.2× gmean) faster, area overheads are 3.6× lower,and power overheads are 2.2× lower. Third, we design HMG, a hierarchical cachecoherence protocol for multi-GPU systems. By leveraging the latest scoped memorymodel, HMG not only can avoid full cache invalidation of software coherenceprotocol, but also filters out write invalidation acknowledgments and transientiiicoherence states. Despite minimal hardware overhead, HMG can achieve 97% ofthe performance of an idealized caching system. Finally, we propose CHOPIN, anovel Split Frame Rendering (SFR) scheme by taking advantage of the parallelismof image composition. CHOPIN can eliminate the performance overheads ofprimitive duplication and sequential primitive distribution that exist in previouswork. CHOPIN outperforms the best prior SFR implementation by up to 56% (25%gmean) in an 8-GPU system.ivLay SummaryThis dissertation proposes architectural supports for efficient synchronizations inboth single-GPU and multi-GPU systems. The innovations span across multipleabstraction layers of the computing system, including the programming model,memory consistency model, cache coherence protocol, and application specificknowledge of graphics processing. This can simplify GPU programming, increaseperformance, and extend hardware scalability to large-scale systems, thereby at-tracting more programmers and extending GPU to a wider range of applicationdomains.vPrefaceThe following is a list of my publications during the PhD program in chronologicalorder:[C1] Xiaowei Ren, and Mieszko Lis. “Efficient Sequential Consistency inGPUs via Relativistic Cache Coherence”. In Proceedings of the 23rd InternationalSymposium on High Performance Computer Architecture (HPCA), pages 625–636.IEEE, 2017.[C2] Xiaowei Ren, and Mieszko Lis. “High-Performance GPU TransactionalMemory via Eager Conflict Detection”. In Proceedings of the 24th InternationalSymposium on High Performance Computer Architecture (HPCA), pages 235–246.IEEE, 2018.[C3] Xiaowei Ren, Daniel Lustig, Evgeny Bolotin, Aamer Jaleel, Oreste Villa,and David Nellans. “HMG: Extending Cache Coherence Protocols Across ModernHierarchical Multi-GPU Systems”. In Proceedings of the 26th International Sympo-sium on High Performance Computer Architecture (HPCA), pages 582–595. IEEE,2020.[C4] Xiaowei Ren, and Mieszko Lis. “CHOPIN: Scalable Graphics Renderingin Multi-GPU Systems via Parallel Image Composition”. (Under Submission)The publications are incorporated into this dissertation as follows:• Chapter 2 uses background section materials from [C1], [C2], [C3], and [C4].• Chapter 3 presents a version of the material published in [C1]. In this work,Xiaowei Ren was the leading researcher, he designed the Relativistic CacheCoherence (RCC), implemented and evaluated RCC in simulation framework,analyzed simulation results, and contributed to the article writing. This workviwas done under the supervision of Professor Mieszko Lis, he finished most ofthe article writing and provided lost of helpful guidance for this work.• Chapter 4 presents a version of the material published in [C2]. In this work,Xiaowei Ren was the leading researcher, he proposed the eager conflictdetection mechanism and the metadata tracking hardware structure for GETM,implemented and evaluated GETM in simulation framework, analyzed sim-ulation results, and contributed to the article writing. This work was doneunder the supervision of Professor Mieszko Lis, he finished most of the articlewriting and provided lots of helpful guidance for this work.• Chapter 5 presents a version of the material published in [C3]. In this work,Xiaowei Ren was the leading researcher, he extended the cache coherenceprotocol to hierarchical multi-GPU systems, deeply optimized the protocolby leveraging the characteristics of scoped memory consistency model,implemented and evaluated the proposal in simulation framework, analyzedsimulation results, and finished article writing. This work was done under thementoring of Daniel Lustig at NVIDIA, he provided lots of helpful guidancefor this work. Other coauthors also offered many helpful comments to thiswork.• Chapter 6 presents a version of the material described in [C4]. In this work,Xiaowei Ren was the leading researcher, he proposed the scalable SplitFrame Rendering (SFR) scheme by leveraging parallel image composition,optimized the proposal by designing a draw command scheduler and an imagecomposition scheduler, implemented and evaluated the proposal in simulationframework, analyzed simulation results, and finished article writing. Thiswork was done under the supervision of Professor Mieszko Lis, he providedlots of helpful guidance for this work.• Chapter 7 uses the related work sections in [C1], [C2], [C3], and [C4].• Chapter 8 uses the conclusion text from [C1], [C2], [C3], and [C4].viiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 The Extensive Usage of the GPU Platform . . . . . . . . . . . . . 21.2 Challenges of GPU Synchronization . . . . . . . . . . . . . . . . 31.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1 GPU Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.1 High-level Architecture and Programming Model . . . . . 9viii2.1.2 Hierarchical Multi-Module and Multi-GPU Systems . . . 112.2 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.1 Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2 Transactional Memory . . . . . . . . . . . . . . . . . . . 132.3 Memory Consistency Model . . . . . . . . . . . . . . . . . . . . 142.4 Cache Coherence Protocol . . . . . . . . . . . . . . . . . . . . . 152.5 Graphics Processing . . . . . . . . . . . . . . . . . . . . . . . . . 162.5.1 The 3D Rendering Pipeline . . . . . . . . . . . . . . . . . 162.5.2 The Graphics GPU Architecture . . . . . . . . . . . . . . 173 Efficient Sequential Consistency via Relativistic Cache Coherence . 193.1 GPUs vs. CPUs: A Consistency and Coherence Perspective . . . . 213.2 Bottlenecks of Enforcing Sequential Consistency . . . . . . . . . 223.3 Enforcing Sequential Consistency in Logical Time . . . . . . . . 243.4 Relativistic Cache Coherence (RCC) . . . . . . . . . . . . . . . . 253.4.1 Logical Clocks, Versions, and Leases . . . . . . . . . . . 263.4.2 Example Walkthrough . . . . . . . . . . . . . . . . . . . 283.4.3 Coherence Protocol: States and Transitions . . . . . . . . 283.4.4 L2 Evictions and Timestamp Rollover . . . . . . . . . . . 323.4.5 Lease Time Extension, and Prediction . . . . . . . . . . . 343.4.6 RCC-WO: A Weakly Ordered Variant . . . . . . . . . . . 373.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.6 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . 393.6.1 Performance Analysis . . . . . . . . . . . . . . . . . . . 393.6.2 Energy Cost and Traffic Load . . . . . . . . . . . . . . . 423.6.3 Coherence Protocol Complexity . . . . . . . . . . . . . . 423.6.4 Area Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 433.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 Hardware Transactional Memory with Eager Conflict Detection . . 454.1 GPU Transactional Memory . . . . . . . . . . . . . . . . . . . . 484.2 Eager Conflict Detection and GPUs . . . . . . . . . . . . . . . . 494.3 GPUs Favour Eager Conflict Detection . . . . . . . . . . . . . . . 50ix4.4 GETM Transactional Memory . . . . . . . . . . . . . . . . . . . 534.4.1 Atomicity, Consistency, and Isolation . . . . . . . . . . . 534.4.2 Walkthrough Example . . . . . . . . . . . . . . . . . . . 584.5 GETM Implementation Details . . . . . . . . . . . . . . . . . . . 604.5.1 SIMT Core Extensions . . . . . . . . . . . . . . . . . . . 614.5.2 Validation Unit . . . . . . . . . . . . . . . . . . . . . . . 614.5.3 Commit-Time Coalescing . . . . . . . . . . . . . . . . . 654.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.7 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . 674.7.1 Performance Analysis . . . . . . . . . . . . . . . . . . . 674.7.2 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . 694.7.3 Transaction Abort Rates . . . . . . . . . . . . . . . . . . 714.7.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 724.7.5 Area and Power Cost . . . . . . . . . . . . . . . . . . . . 734.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735 Cache Coherence Protocol for Hierarchical Multi-GPU Systems . . 745.1 Emerging Programs Need Fine-Grained Communication . . . . . 775.2 GPU Weak Memory Model . . . . . . . . . . . . . . . . . . . . . 775.3 Existing GPU Cache Coherence . . . . . . . . . . . . . . . . . . 785.4 The Novel Coherence Needs of Modern Multi-GPU Systems . . . 795.4.1 Extending Coherence to Multiple GPUs . . . . . . . . . . 795.4.2 Leveraging GPU Weak Memory Models . . . . . . . . . . 805.5 Baseline Non-Hierarchical Cache Coherence . . . . . . . . . . . . 815.5.1 Architectural Overview . . . . . . . . . . . . . . . . . . . 825.5.2 Coherence Protocol Flows in Detail . . . . . . . . . . . . 845.6 Hierarchical Multi-GPU Cache Coherence . . . . . . . . . . . . . 865.6.1 Architectural Overview . . . . . . . . . . . . . . . . . . . 885.6.2 Coherence Protocol Flows in Detail . . . . . . . . . . . . 895.7 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.8 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . 945.8.1 Performance Analysis . . . . . . . . . . . . . . . . . . . 945.8.2 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . 97x5.8.3 Hardware Costs . . . . . . . . . . . . . . . . . . . . . . . 995.8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 1005.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006 Scalable Multi-GPU Rendering via Parallel Image Composition . . 1026.1 Parallel Image Composition . . . . . . . . . . . . . . . . . . . . . 1046.2 Limits of Existing Solutions . . . . . . . . . . . . . . . . . . . . 1056.3 CHOPIN: Leveraging Parallel Image Composition . . . . . . . . . 1086.4 The CHOPIN Architecture . . . . . . . . . . . . . . . . . . . . . 1096.4.1 Software Extensions . . . . . . . . . . . . . . . . . . . . 1106.4.2 Hardware Extensions . . . . . . . . . . . . . . . . . . . . 1126.4.3 Composition Workflow . . . . . . . . . . . . . . . . . . . 1126.4.4 Draw Command Scheduler . . . . . . . . . . . . . . . . . 1146.4.5 Image Composition Scheduler . . . . . . . . . . . . . . . 1176.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.6 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . 1216.6.1 Performance Analysis . . . . . . . . . . . . . . . . . . . 1216.6.2 Composition Traffic Load . . . . . . . . . . . . . . . . . 1236.6.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . 1246.6.4 Hardware Costs . . . . . . . . . . . . . . . . . . . . . . . 1276.6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 1276.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1277 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1297.1 Work Related to Memory Consistency Enforcement . . . . . . . . 1297.2 Work Related to Cache Coherence Protocol . . . . . . . . . . . . 1317.3 Work Related to Transactional Memory . . . . . . . . . . . . . . 1327.4 Work Related to Graphics Processing . . . . . . . . . . . . . . . 1338 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 1368.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1368.2 Directions of Future Work . . . . . . . . . . . . . . . . . . . . . 1388.2.1 Logical-Time Cache Coherence in Heterogeneous Systems 1388.2.2 Reducing Transaction Abort Rates of GETM . . . . . . . 139xi8.2.3 Scoped Memory Model vs. Easy Programming . . . . . . 1408.2.4 Scaling CHOPIN to Larger Systems . . . . . . . . . . . . 141Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142xiiList of TablesTable 3.1 SC and coherence protocol proposals for GPUs. . . . . . . . . 25Table 3.2 Timestamps used in RCC. . . . . . . . . . . . . . . . . . . . . 32Table 3.3 Simulated GPU and memory hierarchy for RCC. . . . . . . . . 37Table 3.4 Benchmarks used for RCC evaluation. . . . . . . . . . . . . . 38Table 3.5 The number of states (stable+transient) and transitions for differ-ent coherence protocols. . . . . . . . . . . . . . . . . . . . . . 43Table 4.1 Metadata tracked by GETM. . . . . . . . . . . . . . . . . . . . 54Table 4.2 Simulated GPU and memory hierarchy for GETM. . . . . . . . 65Table 4.3 Benchmarks used for GETM evaluation. . . . . . . . . . . . . 66Table 4.4 Optimal concurrency (# warp transactions per core) settings andabort rates for different workloads. . . . . . . . . . . . . . . . 71Table 4.5 Area and power overheads of different GPU TM designs. . . . 72Table 5.1 NHCC and HMG coherence directory transition table. . . . . . 84Table 5.2 Simulated GPU and memory hierarchy for HMG. . . . . . . . 91Table 5.3 Benchmarks used for HMG evaluation. . . . . . . . . . . . . . 93Table 6.1 Fields tracked by image composition scheduler. . . . . . . . . . 117Table 6.2 Simulated GPU and memory hierarchy for CHOPIN. . . . . . . 120Table 6.3 Benchmarks used for CHOPIN evaluation. . . . . . . . . . . . 120xiiiList of FiguresFigure 2.1 High-level GPU Architecture as seen by programmers [31]. . . 10Figure 2.2 GPU programming model. . . . . . . . . . . . . . . . . . . . 10Figure 2.3 3D graphics pipeline and corresponding architectural support. 16Figure 3.1 The characterization of SC stalls. . . . . . . . . . . . . . . . . 23Figure 3.2 High-level view of enforcing SC in logical time. . . . . . . . . 24Figure 3.3 A walkthrough example of RCC. . . . . . . . . . . . . . . . . 27Figure 3.4 Full L1 and L2 coherence FSMs of RCC. . . . . . . . . . . . 29Figure 3.5 State transition tables for RCC. . . . . . . . . . . . . . . . . . 31Figure 3.6 The characterization of loads on expired data. . . . . . . . . . 34Figure 3.7 The benefits afforded by lease renew and lease prediction. . . 35Figure 3.8 Speedup of RCC on inter- and intra-workgroup workloads. . . 39Figure 3.9 The improvement of SC stalls by RCC. . . . . . . . . . . . . 40Figure 3.10 Speedup of weak ordering implementations vs. RCC-SC oninter- and intra-workgroup workloads. . . . . . . . . . . . . . 41Figure 3.11 Energy cost of RCC on inter- and intra-workgroup workloads. 42Figure 3.12 Traffic load of RCC on inter- and intra-workgroup workloads. 43Figure 4.1 CUDA ATM benchmark fragment using either locks or TM. . 46Figure 4.2 Messages required for transactional memory accesses and com-mits in WarpTM (top) and GETM (bottom). . . . . . . . . . . 48Figure 4.3 The potential performance improvement created by eager con-flict detection. . . . . . . . . . . . . . . . . . . . . . . . . . . 51xivFigure 4.4 Benefits of eager conflict detection compared with lazy mecha-nism and hand-optimized find-grained lock implementations. . 52Figure 4.5 Overall architecture of a SIMT core with GETM. . . . . . . . 53Figure 4.6 The flowchart for load, store, and commit/abort logic in GETM. 57Figure 4.7 A walkthrough example of eager conflict resolution in GETM. 59Figure 4.8 Transaction metadata table microarchitecture. . . . . . . . . . 62Figure 4.9 Stall buffer microarchitecture. . . . . . . . . . . . . . . . . . 64Figure 4.10 Transaction-only execution and wait time, normalized to WarpTM. 67Figure 4.11 Program execution time normalized to the fine-grained lockbaseline, including transactional and non-transactional parts. . 67Figure 4.12 Crossbar traffic load normalized to WarpTM. . . . . . . . . . 68Figure 4.13 Mean latency of the cuckoo table in the metadata storage structure. 68Figure 4.14 Performance sensitivity of GETM to metadata table size andtracking granularity, normalized to a WarpTM baseline. . . . . 69Figure 4.15 The maximum number of addresses queued at any given time. 70Figure 4.16 The average number of requests per address that concurrentlyreside in the stall buffer. . . . . . . . . . . . . . . . . . . . . 70Figure 4.17 Program execution time in 15-core and 56-core GPUs, normal-ized to 15-core WarpTM. . . . . . . . . . . . . . . . . . . . . 71Figure 5.1 Forward-looking multi-GPU system. Each GPU has multipleGPU Modules (GPMs). . . . . . . . . . . . . . . . . . . . . . 75Figure 5.2 Benefits of caching remote GPU data under three differentprotocols on a 4-GPU system with 4 GPMs per GPU, allnormalized to a baseline which has no such caching. . . . . . 76Figure 5.3 Percentage of inter-GPU loads destined to addresses accessedby another GPM in the same GPU. . . . . . . . . . . . . . . . 80Figure 5.4 Future GPUs will consist of multiple GPU Modules (GPMs),and each GPM might be a chiplet in a single package. . . . . . 82Figure 5.5 NHCC coherence architecture. . . . . . . . . . . . . . . . . . 83Figure 5.6 Hierarchical coherence in multi-GPU systems. . . . . . . . . 87Figure 5.7 Simulator correlation vs. a NVIDIA Quadro GV100 and simu-lation runtime for our simulator and GPGPU-Sim. . . . . . . 92xvFigure 5.8 Performance of various inter-GPM coherence schemes in asingle GPU with 4 GPMs. . . . . . . . . . . . . . . . . . . . 94Figure 5.9 Performance of various coherence protocols in a 4-GPU system,where each GPU is composed of 4 GPMs. . . . . . . . . . . . 95Figure 5.10 Average number of cache lines invalidated by each store requeston shared data. . . . . . . . . . . . . . . . . . . . . . . . . . 96Figure 5.11 Average number of cache lines invalidated by each coherencedirectory eviction. . . . . . . . . . . . . . . . . . . . . . . . 96Figure 5.12 Total bandwidth cost of invalidation messages. . . . . . . . . 97Figure 5.13 Performance sensitivity to inter-GPU bandwidth. . . . . . . . 98Figure 5.14 Performance sensitivity to L2 cache size. . . . . . . . . . . . 98Figure 5.15 Performance sensitivity to the coherence directory size. . . . . 99Figure 5.16 Performance sensitivity to the coherence directory trackinggranularity. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99Figure 6.1 Percentage of geometry processing cycles in the graphicspipeline of conventional SFR implementation. . . . . . . . . . 106Figure 6.2 Graphics pipelines of GPUpd and CHOPIN. . . . . . . . . . . 107Figure 6.3 Percentage of execution cycles of the extra pipeline stages inGPUpd. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107Figure 6.4 Potential performance improvement afforded by leveragingparallel image composition. . . . . . . . . . . . . . . . . . . 108Figure 6.5 High-level system overview of CHOPIN. . . . . . . . . . . . 110Figure 6.6 The workflow of each composition group. . . . . . . . . . . . 113Figure 6.7 Performance overhead of round-robin draw command scheduling.114Figure 6.8 Triangle rate of geometry processing stage (top) and wholegraphics pipeline (bottom). . . . . . . . . . . . . . . . . . . . 115Figure 6.9 Draw command scheduler microarchitecture. . . . . . . . . . 116Figure 6.10 Image composition scheduler microarchitecture. . . . . . . . . 117Figure 6.11 Image composition scheduler workflow. . . . . . . . . . . . . 119Figure 6.12 Performance of an 8-GPU system, baseline is primitive duplica-tion with configurations of Table 6.2. . . . . . . . . . . . . . 122xviFigure 6.13 Execution cycle breakdown of graphics pipeline stages, normal-ize all results to the cycles of primitive duplication. . . . . . . 122Figure 6.14 Traffic load of parallel image composition. . . . . . . . . . . . 123Figure 6.15 Performance sensitivity to the frequency of updates sent to drawcommand scheduler. . . . . . . . . . . . . . . . . . . . . . . 123Figure 6.16 Performance sensitivity to the number of GPUs. . . . . . . . . 125Figure 6.17 Performance sensitivity to inter-GPU link bandwidth. . . . . . 125Figure 6.18 Performance sensitivity to inter-GPU link latency. . . . . . . . 126Figure 6.19 Performance sensitivity to the threshold of composition groupsize. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126xviiList of AbbreviationsAFR Alternate Frame RenderingAI Artificial IntelligenceAPI Application Programming InterfaceAR Augmented RealityCMP Chip MultiprocessorCPU Central Processing UnitCTA Thread Block ArrayDLP Data Level ParallelismDLSS Deep Learning Super SamplingDRF Data-Race-FreeFB FramebufferGPC Graphics Processing ClusterGPM GPU ModuleGPU Graphics Processing UnitHRF Heterogeneous-Race-FreeHTM Hardware Transactional MemoryILP Instruction Level ParallelismIMR Immediate Mode RenderingLLC Last Level CachexviiiLRU Least Recently UsedMCM Multi-Chip-ModuleMIMD Multiple-Instruction, Multiple-DataMLP Memory Level ParallelismNUMA Non-Uniform Memory AccessPME PolyMorph EnginePSO Partial Store OrderingRC Release ConsistencyRCC Relativistic Cache CoherenceROP Rendering Output UnitRT Render TargetSC Sequential ConsistencySIMD Single-Instruction, Multiple-DataSIMT Single-Instruction, Multiple-ThreadSM Streaming MultiprocessorSFR Split Frame RenderingSWMR Single Writer Multiple ReaderTLP Thread Level ParallelismTM Transactional MemoryTPC Texture Processing ClusterTSO Total Store OrderingVR Virtual RealityVU Validation UnitWO Weak OrderingxixAcknowledgmentsIt has been a long journey since I started my PhD study. Lots of great people havemade this journey unforgettable and worth taking. Without their help, the work inthis dissertation would have not been possible.First and foremost, I would like to thank my dear supervisor, Professor MieszkoLis. It has been my great honour of working with him throughout my PhD program.His dedication to the research and kindness to the students have truly inspiredme both professionally and personally. He was always there for me whenever Ineeded advice, whether it be technical, professional, personal, or otherwise. He alsoencouraged me to explore various research topics, which have enormously enrichedmy knowledge beyond this dissertation. He accepted nothing less than my best, andI will be forever grateful for his mentorship.I also would like to thank my qualifying, department, and university examinationcommittee members: Professor Tor Aamodt, Professor SteveWilton, Professor SudipShekhar, Professor Guy Lemieux, Professor Shahriar Mirabbasi, Professor KonradWalus, Professor Ivan Beschastnikh, Professor Ryozo Nagamune, and ProfessorVijay Nagarajan. I am grateful for their insightful feedback, which has immenselyimproved my research work.I also would like to thank all the people who helped me during my internships inNVIDIA and MPI-SWS, especially Daniel Lustig, David Nellans, Evgeny Bolotin,Aamer Jaleel, Oreste Villa, Viktor Vafeiadis, and Michalis Kokologiannakis. DanielLustig was a great mentor during my two internships in NVIDIA, I really appreciatehis technical and personal support during and after my time there. Viktor Vafeiadiswas an excellent mentor while I was at MPI-SWS, I am extremely thankful that heoffered me a precious opportunity to explore a totally new research topic.xxI also would like to thank all my lab colleagues: Amin Ghasemazar, Ellis Su,Mohammad Ewais, Maximilian Golub, Khaled E. Ahmed, Mohamed Omran Matar,Dingqing Yang, John Deppe, Peter Deutsch, Mohammad Olyaiy, Christopher Ng,Muchen He, Winnie Gong, and Avilash Mukherjee. I have learned a lot from allof them. Special thanks to Ellis Su, Dingqing Yang, Amin Ghasemazar, and JohnDeppe, it was my great pleasure to cooperate with them.Finally, I would like to extend my huge thank you to my parents and grandparents.They have fully supported my decision to pursue my PhD degree. Even thoughthey did not understand my research, they firmly believed me and offered me theirheartwarming encouragements. I am regretful that I could not go home and visitthem very often during the past several years. I wish they stay happy and healthyforever.xxiTo my parents.xxiiChapter 1IntroductionDuring past few decades, semiconductor technology has largely benefited fromMoore’s Law scaling, which has enabled an exponentially increasing number oftransistors in a single chip. With abundant transistors, processor designers havegreatly improved single-thread performance by maximizing the Instruction LevelParallelism (ILP). Multi-core architecture, such as Chip Multiprocessor (CMP),has also been designed to exploit the Data Level Parallelism (DLP) and ThreadLevel Parallelism (TLP). However, the failure of Dennard Scaling has made thesearchitectures hit the power wall, and the problem of dark silicon may significantlylimit their scalability [73]. What makes the situation even worse is the fact thatthe Moore’s Law is also approaching the end [210, 232]. In response to theseobservations, much attention has been focused on more cost-efficient alternatives.The massively parallel GPU architecture has been proven to be a promising candidate.CMP is a widely adopted multi-core CPU architecture. Individual cores areoptimized to reduce single-thread latency, but computing throughput is usuallylimited by the relatively small number of processor cores. In contrast, GPU tradeoffssingle-thread latency for system throughput, a massive number of threads are runon simple shader cores in parallel. The cost of instruction fetching, decoding,and scheduling is amortized by executing threads in Single-Instruction, Multiple-Data (SIMD) fashion.Thanks to the abundant parallelism, GPU architecture is muchmore cost-efficientfor applications that perform identical operations on enormous amounts of data (i.e.,1regular parallelism). A typical application is graphics processing, the applicationthat GPU was originally built for. In recent years, with enhancements in bothsoftware and hardware, academia researchers and industry vendors have successfullyextended GPUs into a much wider range of application domains, such as graphalgorithms [91], scientific computing [101], machine learning [226], and so on.GPU has become one of the major platforms in computing society.In parallel architectures, efficient synchronization mechanisms are critical toguarantee high performance. Otherwise, synchronization can significantly restrictavailable parallelism and reduce performance. The massive number of concurrentthreads in GPUs makes this problem far more prominent than CPUs. Given theimportance of GPU computing, this dissertation explores architectural support forrobust and efficient synchronization in GPUs.1.1 The Extensive Usage of the GPU PlatformUnlike CMP, GPU expects applications to expose a large amount of parallelism.For example, 3D graphics rendering contains plenty of data-level parallelism; allattributes of primitives and fragments are computed independently. The codethat specifies the detailed operations is called a shader, which is programmedwith graphics Application Programming Interface (API), such DirectX [7] andVulkan [10]. Conventional GPUs built fixed pipelines for different shaders. Althoughthis can reduce the hardware design complexity, the flexibility and utilization ofhardware components are sacrificed. To address this problem, NVIDIA’s Tesla [132]and AMD’s TeraScale [142] architectures replaced the fixed pipeline with the unifiedshader model, which has largely enriched the programmability of GPU hardware.The increased programmability also created opportunities for other non-graphicsapplications to take advantage of the computing capacity of GPUs. Therefore, in ad-dition to graphics APIs, industry has also developed CUDA [177] and OpenCL [112]for general-purpose applications. With the introduction of CUDA and OpenCL, theresultant programming model is called Single-Instruction, Multiple-Thread (SIMT).SIMT can efficiently hide the complexity of SIMD hardware, because it allowsprogrammers to think about their code in the same way as single-thread execution.SIMT has seen widespread interest and greatly extended the usage of GPU platform.2So far, the high-level domain specific libraries built in CUDA have existed in a broadrange of applications, including ray tracing, medical imaging, machine learning,autonomous driving, robotics, smart cities, and so on [100].In addition to graphics, machine learning has become another critical GPUapplication in recent years. Hence, NVIDIA integrated Tensor Cores [163] in theirGPUs to accelerate matrix convolution, a common operation of neural networks.They also designed the RTX platform [161] for ray tracing acceleration. DeepLearning Super Sampling (DLSS) is one of the latest graphics technologies enabledby Tensors Cores [178]. With DLSS, NVIDIA set out to redefine real-time renderingthrough AI-based super resolution – rendering few pixels and then using AI toconstruct sharp, higher resolution images. Together with RTX platform, DLSS givesgamers the performance headroom to maximize ray tracing settings and increaseoutput resolutions.Some recent applications require computing power beyond single-GPU system,so multi-GPU systems have been developed to further scaling performance [170,173, 174]. In multi-GPU systems, individual GPUs are connected with the advancednetworking technologies, such as NVLink [160] and NVSwitch [162]. GPUs havealso been deployed in datacenters or cloud systems to accelerate supercomputingapplications, such as scientific computing [101], genome sequencing [159], weatherforecasting [175] etc. Cloud gaming systems, like GeForce Now, have also been builtto provide game players a high-quality experience without a substantial hardwareinvestment [158].1.2 Challenges of GPU SynchronizationAlong with the diversification of GPU applications, data sharing and synchronizationpatterns have become more and more complex. Therefore, GPUs have shifted awayfrom a simple bulk-synchronousmodel to amore traditional sharedmemory program-ming model. Industrial vendors have exposed an abstraction of Unified Memory (i.e,unified virtual address space) to software programmers [112, 168]. The NVIDIAVolta GPU pushed this abstraction even further by enabling independent threadscheduling – each thread can execute independently with explicit forward progressguarantee [69]. This modification allows general-purpose Multiple-Instruction,3Multiple-Data (MIMD) execution in SIMD architecture of GPUs [72]. Recently,scoped memory models have also been formalized to support flexible fine-grainedsynchronizations [98, 135]. Although lots of innovations have been proposed, manychallenges still exist in the synchronization of GPUs as follows.• GPU memory models allow many weak behaviours, so programmers need toinsert fences to enforce necessary memory order. However, correctly insertingfences is difficult and bug-prone; the authors of [16] found missing fencesin a variety of peer-reviewed publications, and even vendor guides [204]. Incontrast, Sequential Consistency (SC) is the most intuitive memory modelfor programmers. However, constrained by strong order requirements, theperformance of SC enforcement with MESI-like and timestamp-based cachecoherence protocols is limited by the stalls for acquiring write permissions [94,221]. Therefore, it’s desirable to propose an efficient cache coherence protocolto enforce SC without write stalls.• Although the lock-based synchronization has been widely used in CPU sys-tems, the massive number of threads in GPUs makes it much more difficultto get optimal performance and guarantee deadlock-free execution. Transac-tional Memory (TM) is an alternative solution, which can potentially avoidthese problems by relying on the underlying mechanism to detect conflicts(data-races) automatically in deadlock-free manner [96]. Unfortunately, theexcessive conflict detection latency leaves the performance of prior GPU TMdesigns far away behind fine-grained locks [76, 77]. This has significantlydemotivated the usage of TM in GPU systems.• More recently, there are two prominent changes for GPU systems: hierarchicalarchitecture and scoped memory model. To keep scaling performance, GPUvendors have built ever-larger GPU systems by connecting multiple chipmodules (MCM-GPU) [29] and GPUs (multi-GPUs) [143, 253]. Due tophysical constraints, the bandwidth of the latest inter-GPU link [1, 160, 162] isstill one order of magnitude lower than intra-GPU connections [187], resultingin severe Non-Uniform Memory Access (NUMA) effect that can bottleneckperformance often. In addition, GPUs have also changed from conventional4bulk-synchronous towards scoped memory models for flexible synchronizationsupport [98, 135]. Caching has been widely implemented to mitigate NUMAeffect. However, none of existing cache coherence protocols can supportefficient caching in multi-GPU systems, since they do not consider the changesin both computer architecture and memory model.• Apart from above challenges that are related to the synchronization support forgeneral-purpose applications, it’s also critical to improve the synchronizationfor conventional graphics applications. In principle, the latest multi-GPUsystems are promising to substantially improve performance and offer a visualexperience with much higher quality. Unfortunately, it is not clear howto adequately parallelize the rendering pipeline to take advantage of theseresources while maintaining low rendering latencies. Current implemen-tations of Split Frame Rendering (SFR) are bottlenecked by the redundantcomputation and the sequential inter-GPU synchronizations [20, 114, 166], sotheir performance does not scale. To fully release the performance potentialof multi-GPU systems, a mechanism which can eliminate these bottlenecks isurgently needed for graphics applications.1.3 Thesis StatementThis dissertation aims to address above challenges. It enhances the ability ofproviding high-performance synchronizations in both single-GPU and multi-GPUsystems. The major proposed enhancements include efficient cache coherenceprotocols, GPU Hardware Transactional Memory (HTM), and scalable Split FrameRendering (SFR). By offering robust and efficient synchronization mechanisms, theproposals in this dissertation can potentially attract more programmers and extendGPU to a wider range of application domains.First, this dissertation proposes Relativistic Cache Coherence (RCC) to enforceSC in logical time. In RCC, all Streaming Multiprocessor (SM) cores and cachelineshave separate timestamps. Therefore, SM cores can see different cachelines basedon their own timestamps, and SM core timestamps are advanced independentlyaccording to different cachelines they accessed. Instead of stallingwrite requests untilall sharers become invalid, RCC chooses to advance the timestamp of writing SM5core instantly. Although instant timestamp advancement can potentially invalidateother cachelines in the same SM core, the memory load latency is well known to betolerated by GPU architecture. By enforcing SC without write stalls, RCC can avoidthe complexity of fence inserting and provide high performance to programmers atthe same time.Second, this dissertation proposes GETM, a novel GPU Hardware TransactionalMemory (HTM) system, which can reduce the excessive latency of prior value-basedlazy conflict detection [76, 77] with a logical-timestamp-based eager mechanism.GETM detects conflicts by comparing the timestamps of transactions and the datathey accessed. If the transaction timestamp is smaller than the data timestamps whichhave been set by other transactions, it indicates that the current transaction has aconflict with other executed ones. While conflicts are detected, GETM eagerly abortsthe current transaction, advances its timestamp, and restarts it later. With eagerconflict detection, transactions that have reached commit point are guaranteed to beconflict-free, so their results can be committed to the memory without additionalvalidation. Even though eager conflict detection can slightly increase transactionabort rate, the dramatically faster aborts and commits of individual transactions cantransfer to substantial performance improvement.Third, this dissertation proposes HMG, a hierarchical cache coherence protocolfor multi-GPU systems. Similar to GPU-VI [221], HMG is a simple two-statehardware cache coherence protocol, but adds a coherence hierarchy to exploitintra-GPU data locality and reduce the bandwidth overhead on inter-GPU links.HMG has also been deeply optimized by leveraging the non-multi-copy-atomicity ofscoped memory models – non-synchronization stores are processed instantly withoutwaiting for invalidation acknowledgments, and only synchronization stores are stalledto enforce correct data visibility. Since there are no stalls for most store requests andGPU architecture is latency-tolerant, HMG can eschew the complexity of transientcoherence states and extra hardware complexities that are necessary for latency-critical CPUs. With efficient cache coherence protocol support, applications whichhave fine-grained synchronizations [91, 118, 258] can be accelerated substantiallyin multi-GPU systems.Finally, this dissertation proposes CHOPIN, a scalable SFR technique. Consid-ering various properties of draw commands in a single frame, CHOPIN first divides6them into multiple groups. Each draw command is distributed to a specific GPU,so no redundant computing exists in CHOPIN. At group boundaries, sub-imagesgenerated in each GPU are composed in parallel. Sub-images of opaque objects arecomposed out-of-order by retaining pixels that are closer to the camera. Althoughsub-image composition of transparent objects needs to respect the depth order,CHOPIN leverages the associativity of pixel blending [33] to maximize the paral-lelism – neighbouring sub-images start to compose with each other once they areready. Therefore, CHOPIN can eliminate the sequential inter-GPU communicationoverhead. A draw command scheduler and an image composition scheduler aredesigned to address the problems of load-imbalance and network congestion. Byleveraging parallel image composition, CHOPIN is more scalable than prior SFRsolutions [20, 114, 166].1.4 ContributionsThis dissertation makes the following contributions:1. It traces the cost of Sequential Consistency (SC) enforcement in realisticGPUs to the need to acquire write permissions, proposes Relativistic CacheCoherence (RCC) that improves store performance by enforcing SC in logicaltime, and demonstrates that RCC is faster than the best prior GPU SC proposalby 29% and within 7% of the performance of the best non-SC design.2. It traces the inefficiency of prior GPUHardware TransactionalMemory (HTM)proposals to the unamortized latencies of value-based lazy conflict detection,proposes the first GPU HTM called GETM that detects conflicts with alogical-timestamp-based eager mechanism, and demonstrates that GETM isup to 2.1× (1.2× gmean) faster than the best prior proposal.3. It identifies the necessity of coherence hierarchy for performance scaling inmulti-GPU systems, proposes a hierarchical cache coherence protocol calledHMG for multi-GPUs, eliminates the complexity of transient coherence statesand invalidation acknowledgments by leveraging the non-multi-copy-atomicityof scoped memory models, and demonstrates that HMG can achieve 97% ofthe performance of an idealized caching system.74. It traces the main performance cost of existing Split Frame Rendering(SFR) mechanisms to redundant computation and sequential inter-GPUcommunication requirements, proposes a novel SFR technique called CHOPINthat takes advantage of the parallel image composition to remove overheads ofprior solutions, develops a draw command scheduler and an image compositionscheduler to address the problems of load-imbalance and network congestion,and demonstrates that CHOPIN outperforms the best prior SFR proposal byup to 56% (25% gmean) in an 8-GPU system.1.5 OrganizationThe rest of this dissertation is organized as follows:• Chapter 2 gives the background on GPU architecture, synchronizations, mem-ory consistency model, cache coherence protocol, and graphics processing.• Chapter 3 proposes Relativistic Cache Coherence (RCC), a logical-timestamp-based cache coherence protocol which can efficiently enforce SequentialConsistency (SC) in GPUs by reducing stalls for write permission acquisition.• Chapter 4 presents GETM, the first GPU Hardware Transactional Memory(HTM) which reduces excessive latency of conflict detection by eagerlychecking conflicts when the initial memory request is made.• Chapter 5 proposes HMG, a hardware-managed cache coherence protocoldesigned for forward-looking hierarchical multi-GPU systems with the en-forcement of scoped memory consistency model.• Chapter 6 describes CHOPIN, a novel Split Frame Rendering (SFR) schemewhich can eliminate the performance overheads of redundant computing andsequential primitive exchanging that exist in prior solutions by leveragingparallel image composition.• Chapter 7 discusses related work.• Chapter 8 concludes the dissertation and discusses directions for potentialfuture work.8Chapter 2BackgroundThis chapter reviews the necessary background materials for the rest of this disserta-tion. Chapter 2.1 presents a high-level view of the contemporary GPU architecture,it also introduces the recently explored multi-GPU systems. Chapter 2.2 summarizesthe background on program synchronizations, including lock mechanism and trans-actional memory. Chapter 2.3 and 2.4 briefly explain a set of concepts in memoryconsistency models and cache coherence protocols. Finally, Chapter 2.5 describesthe 3D graphics pipeline and the corresponding architecture support.2.1 GPU Architectures2.1.1 High-level Architecture and Programming ModelThe high-level architecture of a GPU is shown in Fig. 2.1. A GPU application startson the CPU; the operations to be executed by the GPU are packaged in the kernelfunctions. Every time the kernel function is called, CPU will launch it onto the GPUthrough a compute acceleration API, such as CUDA [177] or OpenCL [112]. Eachkernel function is composed of many threads which perform the same operationon different data in parallel; this programming model is called Single-Instruction,Multiple-Thread (SIMT). The thread hierarchy (Fig. 2.2) organizes threads intothread block arrays (CTAs) in NVIDIA GPUs or workgroups in AMDGPUs, each ofwhich is dispatched to one of the Streaming Multiprocessor (SM) cores in the GPU9Atomic Op. UnitLast Level CacheDRAM ControllerMemory PartitionRegFileSM Core SIMT StacksThread BlockThread BlockL1 DataCacheTextureCacheConstantCache inter-connectGPU Kernel Launch UnitOff-Chip GDDRMemory ChannelCPUFigure 2.1: High-level GPU Architecture as seen by programmers [31].GridThread (0) Thread (1) Thread (4)Block (0)Block (0) Block (1)Figure 2.2: GPU programming model.architecture. Threads are scheduled in batches (called warps in NVIDIA GPUs andwavefronts in AMDGPUs) in each SM core, each with 32–64 threads. A SIMT stackis used to handle branch divergence among threads in the same batch [127]. Threadswithin a thread block can communicate via an on-chip scratchpad memory calledthe shared memory in CUDA (or local memory in OpenCL), and can synchronizevia hardware barriers. The SM cores access a distributed, shared last-level cacheand off-chip DRAM via an on-chip interconnection network. From a programming10perspective, GPUmemory is divided into several spaces, each with its own semanticsand performance characteristics: for example, data which is shared by the threadsof a thread block is stored in the shared memory, while data which is shared by allthread blocks is stored in the global memory.The GPU launch unit automatically dispatches as many thread blocks as theGPU on-chip resources can handle in parallel. If the number of thread blocks islarger than what the GPU can support, the remaining threads will be launched whenprevious thread blocks have finished and released sufficient resources. This allowsGPU applications to generate as many threads as necessary without introducingsignificant overhead.As we all know, GPU was originally built as an accelerator for graphicsprocessing, so there are many specific architecture components for rendering. Weintroduce them in Chapter Hierarchical Multi-Module and Multi-GPU SystemsBecause the end ofMoore’s Law is limiting continued transistor density improvement,scaling GPU performance by simply integrating more resources in a monolithic chiphas gradually reached the limit. Future GPU architectures will consist of a hierarchyin which each GPU is split into multiple GPU Modules (GPMs) [62]. Recent workhas demonstrated the benefits of creating GPMs in the form of single-package multi-chip modules (MCM-GPU) [29]. Researchers have also explored the possibility ofpresenting a large hierarchical multi-GPU system to users as if it were a single largerGPU [143], but mainstream GPU platforms today largely just expose the hierarchydirectly to users so that they can manually optimize data and thread placement andmigration decisions.The constrained bandwidth of the inter-GPM/GPU links is the main performancebottleneck on hierarchical GPU systems. To mitigate this, both MCM-GPU andMulti-GPU have scheduled adjacent CTAs to the same GPM/GPU to exploit inter-CTA data locality [29, 143]. These proposals map each memory page to the firstGPM/GPU that touches it to increase the likelihood of capturing data locality. Theyalso extended a conventional GPU software coherence protocol to the L2 caches,and they showed that it worked well for traditional bulk-synchronous programs.11More recently, CARVE [253] proposed to allocate a fraction of local GPUDRAM as cache for remote GPU data, and enforced coherence using a simpleprotocol filtered by tracking whether data was private, read-only, or read-writeshared. However, as CARVE does not track sharers, CARVE broadcasts invalidationmessages to all caches for read-write shared data.Apart from the above work that focused on general-purpose applications, re-searchers have also exploredmulti-GPU systems for graphics processing. GPUpd [114]reduced redundant geometry processing in conventional Split FrameRendering (SFR)by exchanging primitives through high-speed inter-GPU links, such as NVIDIA’sNVLink [160] and AMD’s XGMI [1]. Xie et al. proposed OO-VR framework toaccelerate VR by improving data locality [246].2.2 SynchronizationGPU is a computing platform for highly parallel applications. In multithreadedprograms, synchronizations are necessary to protect shared data accesses from data-races, thereby guaranteeing correct data communication among threads. Generally,synchronization orders are specified by programmers with different primitives,such as barriers, locks, and so on. This section first introduces the widely usedlocks. Then, an alternative, Transactional Memory (TM), is described to addressthe potential problems of lock in GPU programming.2.2.1 LocksA lock is a software data object that creates mutual exclusion (i.e., critical section) ofshared data in memory. If a thread wants to access the shared data, it has to acquirethe associated lock firstly. At any time instant, a lock only can be acquired by atmost one thread, so the thread holding the lock has exclusive access to the shareddata. The lock should be released after operations on shared data are finished, sothat other threads can acquire the lock if needed. Multiple locks which have cyclicdependence can potentially create deadlocks. Assuming both thread T1 and T2 needto acquire both locks L1 and L2 to progress, but each thread is holding one of thetwo locks and cannot release its own lock until both locks are acquired. In this case,thread T1 and T2 will be blocked indefinitely. One way to avoid this deadlock is to12maintain a global order of lock acquisition.To simplify programming, we can associate all shared data to one or very fewlock(s) – known as coarse-grained locks. Even though this approach can reduceprogramming complexity and largely avoid deadlocks, it will potentially serializemost executions and significantly hurt performance. Dividing shared data intomultiple smaller critical sections – fine-grained locks – can maximize parallelism,but a larger number of locks can complicate programming and increase the risk ofdeadlocks. With locks, programmers usually need to make conservative assumptionsthat different threads will interfere during execution, although they may not actuallyinterfere at runtime. This can throttle available parallelism and reduce performance.Hence, the difficulty in using locks for both correct synchronization and highperformance will restrict them to experienced programmers.2.2.2 Transactional MemoryThe massive number of threads in GPU programs can make above situation of lockeven more challenging. Transactional Memory (TM) can mitigate this problemby decoupling the functional behavior of code (i.e., operation performed by eachtransaction) from its performance (i.e., how the transactions are executed). In TMsystems, programmers need to change lock-based critical section to transaction, acode fragment which should be performed atomically (i.e., either all operationsin a transaction complete successfully or none of them appear to start executing).Conflicts (data-races) are detected automatically in a deadlock-free manner withunderlying software or hardware mechanisms. TM also keeps restarting abortedtransactions until they finish without any conflicts.TM designs can be categorized along two axes: conflict detection and versionmanagement. In eager conflict detection (e.g., LogTM [149, 252]), an inconsistentread or update attempt by a transaction is detected when the access is made, and oneof the conflicting transactions is aborted. Lazy conflict detection (e.g., TCC [89])defers this until later: often, the entire transaction log is validated during the commitprocess, and conflicts are discovered only then. In principle, the lazy technique canmake better conflict resolution decisions because the entire transaction is known,but has longer commit/abort latencies because the entire transaction must be verified13atomically. Typically, eager conflict detection leverages an existing CPU coherenceprotocol. However, GPUs lack an efficient coherence protocol that can be leveragedfor eager conflict detection [196].Version management can also be eager or lazy. Lazily-versioned TMs (e.g.,TCC [89]) add transactional accesses to a redo log, which is only written to memorywhen the transaction has been validated and commits; if the transaction aborts, theredo log is discarded. In eager version management (e.g., LogTM [149, 252]), thetransaction writes the new value directly to the memory hierarchy, but keeps the oldvalue in an undo log; if a transaction aborts, the undo log is written to memory.2.3 Memory Consistency ModelA memory consistency model defines which sequences of values may be legallyreturned from the sequence of load operations in each program thread. For example,the following code snippet from [155, 224] represents a common synchronizationpattern found in many workloads:core C0 core C1data = newdone = true while (!done) {weakly ordered models } // wait for new data valueneed a memory fence here . . . use new data. . .The question is, should core C1 be allowed to see done=true even if data=old?This is clearly not the intended behaviour, since C1 could see a stale copy of data;nevertheless, it is allowed by many commercial CPUs and all extant GPUs [16].Sequential Consistency (SC) [123] most closely corresponds to most program-mers’ intuition: it requires that (a) memory operations appear to execute andcomplete in program order, and (b) all threads observe stores in the same globalsequence. In SC, an execution where done=true when data=old is illegalbecause either (a) the writes to data and done were executed out of order by coreC0, or (b) they were executed in one order by C0 but observed in a different order byC1.14Weak consistency models, on the other hand, allow near-unrestricted reorderingof loads and stores in the program, provided that data dependencies are respected;such reordering typically occurs during compilation and during execution in theprocessor. Special memory fence instructions must be used to restrict reordering andrestore sequentially consistent behaviour: in the example above, a fence is needed toensure that the store to data completes before the store to done. However, missingfences can be very difficult to find in a massively multithreaded setting like a GPU;conversely, adding too many fences compromises performance.Since compilers can reorder or elide memory references (e.g., via registerallocation), a programming language must also define a memory model. Due to therange of consistency models present in extant CPUs, languages like Java [137] orC++ [38] guarantee SC semantics only for programs that are Data-Race-Free (DRF)(i.e., properly synchronized and fenced); this is known as DRF-0 [11]. TheHeterogeneous-Race-Free (HRF) model recently proposed for hybrid CPU/GPUarchitectures further constrains DRF-0 by requiring proper scoping [98]. NVIDIAhas also formalized their PTX memory model with scope annotations [135].2.4 Cache Coherence ProtocolIn systems with private caches, a cache coherence protocol ensures that writes to asingle location are ordered and become visible in the same order to all cores [45]; theaim is to make caches logically transparent. Since caches are ubiquitous, providingcoherence is a fundamental part of implementing any memory consistency model.There are two critical invariants for coherence protocol definition: Single WriterMultiple Reader (SWMR) invariant and Data-Value invariant [155, 224]. For anygiven memory location, at any given moment in time1, SWMR requires that there isonly a single core that may write it (and that may also read it) or some number ofcores that may read it. In addition to SWMR, Data-Value invariant requires that thevalue of a given memory location is propagated correctly. This invariant states thatthe value of a memory location at the start of an epoch is the same as the value of1The SWMR invariant only needs to be enforced in logical time, not physical time. This subtleissue enables many optimizations that appear to – but do not – violate this invariant. The proposedRelativistic Cache Coherence (RCC) in Chapter 3 leverages this insight and enforces SC in logicaltime.15Display3D Vertex Inputy z3D Object Spacexv2v1v3v03D Object Space2D Screen Spacev2v1v3v0xy2D Screen SpaceFragmentsFragmentsPixelsPixels3D Object Space2D Screen SpaceFragmentsPixelsXBAROtherGPM/GPUDRAML2 $ROP ROPL2 $ROP ROPGiga Thread Engine(a) (b) (c)Raster EngineGPCPMEL1 $TEX TEXInst. $SMTPCCoresTexture $PMEL1 $TEX TEXInst. $SMTPCCoresTexture $Raster EngineGPCPMEL1 $TEX TEXInst. $SMTPCCoresTexture $PMEL1 $TEX TEXInst. $SMTPCCoresTexture $Vertex ShaderTessellationGeometry Shader❶ Rasterization❷ ❸  Depth TestPixel ShaderFB UpdateFigure 2.3: (a) A simple illustration of graphics pipeline. (b) General op-erations of a graphics pipeline. (c) GPU microarchitecture (shadedcomponents are specific to graphics rendering).the memory location at the end of its last read-write epoch.The vast majority of coherence protocols, called “invalidation protocols”, aredesigned explicitly to maintain these invariants. A core cannot overwrite a locationuntil all other sharers get invalidated. If a core wants to read a location, it has toguarantee that no other cores have cached that location in read-write state. Typicalexamples are MESI-like coherence protocols, which have been widely used in CPUs.2.5 Graphics Processing2.5.1 The 3D Rendering PipelineThe function of the 3D graphics pipeline, illustrated in Figure 2.3(a), is to projecta 3D scene, full of objects that often consist of thousands of primitives (usuallytriangles), onto a 2D screen space. On the screen, primitives end up as thousands ofpixels; the pixels are accumulated in a Framebuffer (FB) and sent to the display unitonce rendering is complete. Producing a single frame typically involves hundredsor thousands of draw commands to render all objects in the scene, all of which must16go through the graphics pipeline.Figure 2.3(b) shows the graphics pipeline defined byDirectX [39]; other pipelines(e.g., OpenGL [207]) are similar. The key pipeline stages are geometry processing,rasterization, and fragment processing. Geometry processing Ê first reads vertexattributes (e.g., 3D coordinates) from memory and projects them onto the 2Dscreen space using vertex shader. Projected vertices are grouped into primitives(typically triangles); some primitives may then be split into multiple trianglesthrough tessellation, which creates smoother object surfaces for a higher level ofvisual detail. Generated primitives that are outside of the current viewport are thenculled by geometry-related shaders. The next stage, called rasterization Ë, convertsprimitives into fragments, which will in turn be composed to produce pixels on thescreen. Fragment processingÌ has two main tasks: performing the depth test (Z test)and computing fragment attributes (e.g., colour). The depth test discards fragmentswhose depth value (i.e., the distance from the camera) is bigger than prior ones.Fragments that passed the depth test have their attributes computed using a pixelshader. Finally, the shaded fragments (which maybe opaque or semi-transparent)are composed to generate the pixels, which are written to the FB and eventually sentto the display unit.2.5.2 The Graphics GPU ArchitectureFigure 2.3(c) illustrates the overall microarchitecture of NVIDIA’s Turing GPU [164].It consists of multiple Graphics Processing Clusters (GPCs) connected to multipleRender Output Units (ROPs) and L2 cache slices via a high-throughput crossbar.A Giga Thread Engine distributes the rendering workload to the GPCs based onresource availability.GPCs perform rasterization in dedicated Raster Engines, and organize resourcesinto multiple Texture Processing Clusters (TPCs). Within each TPC, the PolyMorphEngine (PME) performs most non-programmable operations of the graphics pipelineexcept rasterization (e.g., vertex fetching, tessellation, etc). The Streaming Multi-processor (SM), consisting of hundreds of shader cores, executes the programmableparts of the rendering pipeline, such as the vertex shaders, pixel shaders, etc.; toreduce hardware complexity, SMs schedule and execute threads in SIMD fashion17(i.e., warps of 32 or 64 threads). Texture Unit (TEX) is a hardware component thatdoes sampling. Sampling is the process of computing a color from an image textureand texture coordinates.On the other side of the interconnect, ROPs perform fragment-granularity taskssuch as the depth test, anti-aliasing, pixel compression, pixel blending, and pixeloutput to the FB. A shared L2 cache, accessed through the crossbar, buffers the dataread from off-chip DRAM.18Chapter 3Efficient Sequential Consistencyvia Relativistic Cache CoherenceThis chapter proposes Relativistic Cache Coherence (RCC) which can efficientlyenforce Sequential Consistency (SC) in GPUs. In contrast with prior GPU SCwork [220], RCC does not explicitly classify read-only/private data: instead, apredictor naturally learns to assign short cache lifetimes to frequently written shareddata. Unlike prior GPU coherence work [221], RCC operates in logical time;as a result, stores acquire write permissions instantly but still maintain SC. RCCunderpins a sequentially coherent GPUmemory system that outperforms all previousproposals and closes the gap between SC and weak consistency in GPUs. RCC is29% faster than the best prior SC proposal for GPUs, and within 7% of the bestnon-SC design.Sequential consistency — the most intuitive model — requires that (a) allmemory accesses appear to execute in program order and (b) all threads observewrites in the same sequence [123]. To ensure in-order load/store execution, a threadmust delay issuing some memory operations until preceding writes complete; werefer to these delays as SC stalls. Moreover, since all cores must observe writes inthe same order, stores cannot complete until they are guaranteed to be visible toall other threads and cores. Because of these restrictions, few modern commercialCPUs have supported SC [251]; typically SC is relaxed to permit limited [181, 218]or near-arbitrary reordering [26, 104, 205, 230]; programmers must then insert19memory fences for specific memory operations, in essence manually reintroducingSC stalls. GPUs manufacturers have followed suit: both NVIDIA and AMD GPUsexhibit weak consistency [16] similar to Weak Ordering (WO) [68] or ReleaseConsistency (RC) [78] models.Correctly inserting fences is difficult, however, especially in GPUs where allpractical programs are concurrent and performance-sensitive. The authors of [16]found missing fences in a variety of peer-reviewed publications, and even vendorguides [204]. Such bugs are very difficult to detect: some occurred in as few as 4out of 100,000 executions in real hardware, and most occurred in fewer than 1%of executions [16]. Code fenced properly for a specific GPU may not even workcorrectly on other GPUs from the same vendor: some of these bugs were observablein Fermi and Kepler but not in older or newer microarchitectures [16].SC hardware is desirable, then, if it can be implemented without significantperformance loss. Recent work [94, 220] has argued that this is possible in GPUs:unlike CPUs, which lack enough Instruction Level Parallelism (ILP) to coverthe additional latency of SC stalls, GPUs can leverage abundant Thread LevelParallelism (TLP) to cover most SC stalls. The authors of [220] propose reducingthe frequency of the remaining SC stalls by relaxing SC for read-only and privatedata; classifying these at runtime, however, requires complex changes to GPU coremicroarchitecture and carries an area overhead in devices where silicon is alreadyat a premium. Moreover, both studies focused on SC built using CPU coherenceprotocols (MOESI and MESI) with write-back L1 caches. In GPUs, however,write-through L1s perform better [221]: GPU L1 caches have very little space perthread, so a write-back policy brings infrequently written data into the L1 onlyto write it back soon afterwards. Commercial GPUs have write-through L1s andrequire bypassing/flushing L1 caches to ensure intra-GPU coherence [21, 165, 167].1Compared to the best GPU relaxed consistency design, the performance cost ofimplementing SC appears to be closer to 30% [221].In the rest of this chapter, we describe RCC and demonstrate how it achieves1GPU vendor literature and some prior work use “coherence” to describe automatic page-granularitydata transfer between the host CPU and the GPU’s shared L2; some academic proposals use “systemcoherence” for the same concept [188]. To the best of our knowledge, no existing GPU productimplements hardware-level intra-GPU coherence.20both easy programming and high performance by efficiently enforcing SC withlogical timestamps.3.1 GPUs vs. CPUs: A Consistency and CoherencePerspectiveConsistency. Modernmulti-core CPUs have largely settled on weakmemorymodelsto enable reordering in-flight memory operations [104, 181, 205, 218, 230]: becauseCPUs support at most a few hardware threads, the Memory Level Parallelism (MLP)obtained from reordering memory operations is key to performance. GPUs, on theother hand, buffer many tens of warps (e.g., 48–64 [21, 165, 167]) of 32–64 threadsin each GPU Streaming Multiprocessor (SM) core, and when one warp is stalled(because of an L1 cache miss, for example), the core simply executes another.With fine-grained multithreading, GPUs can amortize hundreds of cyclesof latency without reordering memory operations; recent work [94, 220] hassuggested that the same mechanism can cover the ordering stalls required by SC.Indeed, hardware techniques that reorder accesses — such as store buffers — areeither too expensive or ineffective in GPUs, so leaving them out does not hurtperformance [220].Coherence. CPU caches are generally kept coherent by tracking each block’ssharers and invalidating all copies before writing the block. Most protocols incommercial products are quite similar: they have slightly different states (MESI,MESIF, MOESI, etc.) or sharer tracking methods, but the basic operation relieson request-reply communication between cores and an ordering point such as adirectory.All commercial GPUs we are aware of lack automatic coherence among privateL1 caches: in GPU vendor literature, “coherence” refers only to the boundarybetween the host CPU and the GPU. NVIDIA Pascal allows the GPU to initiate pagefaults and synchronize GPU and CPUmemory spaces [92], but intra-GPU coherencerequires bypassing the L1 caches [16]. AMD Kaveri APUs bypass and flush theL1 cache for intra-GPU coherence, and bypass the L2 for CPU-GPU sharing [222].Details for ARM Mali GPUs are scant, but it appears that the coherence boundaryterminates at the GPU shared L2 cache and does not include the L1s [198].21Efficient intra-GPU coherence implementations are subject to different con-straints than CPUs. GPUs have 15, 32, or even 56 SM cores [21, 92, 165, 167],simultaneously executing around 100,000 threads. While some prior studies [94, 220](and our motivation study) have assumed CPU-like MESI coherence, a realisticimplementation could face simultaneous coherence requests from tens of thousandsof threads; just the buffering requirements would be prohibitive [221].The other coherence protocol work for GPUs leveraged two observations: (a) thatwrite-through caches provide a natural ordering point at the L2, and (b) that inter-coresynchronization can be implicit via a shared on-chip clock [221]. A cache thatrequests read permissions receives a read-only copy with a limited-time lease;this copy may be read until the shared clock has ticked past the lease time. Twoprotocols are proposed in [221]: TC-strong (TCS) can support SC if the coredoes not reorder accesses, but stalls stores at the L2 to ensure that all leases for theaddress have expired; TC-weak (TCW) allows stores to proceed without stalling,but compromises write atomicity and cannot support SC.3.2 Bottlenecks of Enforcing Sequential ConsistencyTo trace the roots of performance loss created by enforcing SC, we evaluated an SCimplementation similar to prior work [94, 220] but with GPU-style write-throughL1 caches (see Chapter 3.5 for simulation setup). We examined memory-intensiveworkloads with and without inter-workgroup sharing previously used to evaluateGPU cache coherence [221]; the inter-workgroup benchmarks rely on inter-corecoherence traffic, while the intra-workgroup benchmarks communicate only withineach GPU core. We found SC stalls to be relatively infrequent (Figure 3.1a): in onlyone case were more than 20% memory operations ever stalled because of SC; thissupports prior arguments [94] that the massive parallelism available in GPUs cancover most ordering stalls introduced by SC.We next examined the cause of each stall — i.e., the type of the precedingmemory operation from the same thread. Figure 3.1b shows that most SC stallcycles are spent waiting for a previous store (or atomic) instruction to complete;indeed, in most cases, nearly all stall delays are due to waiting for prior writes. Thisis because average store latencies are very long: for workloads with inter-workgroup220%20%40%60%80%100%% mem opsw/ SC stall inter-workgroup sharing intra-workgroup sharing(a)0%20%40%60%80%100%% stalls due toprev. store(b)0100020003000latencycyclesaverage latency of:(c) loads storesBH BFS CL DLB STN VPR HSP KMN LPS NDL SR LUD0×1×2×3×4×SC-IDEALspeedup(d)Figure 3.1: SC stalls are (a) infrequent, but (b) mostly due to preceding stores;(c) average store latencies are much longer than load latencies; (d)zero invalidate latency leads to substantial speedup for inter-workgroupsharing workloads.communication, store latencies are often much longer than load latencies (2.4×gmean), and up to 3.7× longer (Figure 3.1c).This makes sense: to maintain SC, each store must receive an ack beforecompleting to ensure that the new value has become visible to all cores. Thereare two parts to this latency: one — the round-trip to L2 — is unavoidable withthe write-through L1 caches found in GPUs. The other part is ensuring exclusivecoherence permissions: in our MESI-based experiment the write waits until othersharers have invalidated their copies, while in timestamp-based GPU coherenceprotocols like TC-strong [221] the store waits for all read leases to expire. Long-latency stores can affect performance not only by delaying SC stall resolution, butalso by occupying buffer space or stalling same-cacheline stores from other threadsin MSHRs until the ack is received.To find out whether coherence delays are significant, we implemented anidealized variant of SC where acquiring read and write permissions is instant(SC-ideal). Figure 3.1d shows the speedup of SC-ideal over realistic SC: forworkloads with inter-workgroup sharing, idealizing coherence yields a substantialperformance improvement (1.6× gmean); workloads with only intra-workgroupsharing see no benefit. In the next section, we address the store latency and SC stallproblems by maintaining SC in logical time.23core 0core 1core 2L2ST ALD ALD Aold value valid new value validL1 copy validL1 copy validcore 3LD AL1 copy validFigure 3.2: High-level view of enforcing SC in logical time. Logical timeincreases left to right; all cores that observe the new value of A mustadvance their logical times past that of the store.3.3 Enforcing Sequential Consistency in Logical TimeTo address the problems identified above, we leverage Lamport’s observationthat ordering constraints need to be maintained only in logical time [122], priorobservations that SC can be maintained logically [84, 131], and the recent insightthat logical timestamps can be used directly to implement a coherence protocol [255].We propose Relativistic Cache Coherence (RCC), a simple, two-state GPU coherenceprotocol where each core maintains — and independently advances — its ownlogical time. The L2 keeps track of the last logical write time for each cache block;whenever a core accesses the L2, it must ensure that its own logical time exceedsthe last write time of the relevant block. Data may be cached in L1s for a limited(logical) time, after which the block self-invalidates.Figure 3.2 shows how RCC maintains SC in logical time. First, core 0 loadsaddress A, and receives a fixed-time lease for A from the L2, which records thelease duration; core 0 may then read its L1 copy until its logical time exceeds thelease expiration time. Core 1 writes to A, but to do this it must advance its ownlogical time to past the lease given out for A; this step (dashed line) is equivalentto establishing write permissions in other protocols, but occurs instantly in RCC.Core 2 loads A from L2 and advances its logical time past the time of core 1’s write.Finally, core 3 also reads A. The load is logically before the store to A (becausecore 3’s logical clock is earlier than A’s), but physically the write to A has alreadyhappened, and only the new value of A is available at the L2. Core 3 thus receives24MESI TCS TCW RCCSC support? yes yes no yesstall-free store no (invalidate no (wait until yes (but stall yespermissions ? sharers) lease expires) for fences)Table 3.1: SC and coherence protocol proposals for GPUs.the new value of A, but must also advance its logical time to that of A’s write.Naturally, the cost of synchronization does not entirely disappear: advancing acore’s logical time may cause other L1 cache blocks to expire. In essence, we areexchanging a reduction in store latency for A for potentially some additional L1misses on other addresses. While this would be problematic for latency-sensitiveCPUs, throughput-focused GPUs were explicitly designed to amortize this kind ofcost; we will show that in GPUs this tradeoff is worth making.Lamport’s logical time has recently been proposed as a coherence mechanismfor CPUs [255, 257]. Performance, however, was subpar even compared to the muchsimpler MSI protocol, even though the proposed protocol was more complex thanRCC and relied on a complex speculation-and-rollback mechanism. RCC is not onlymuch simpler, but actually outperforms the best existing GPU protocols.Next, we describe Relativistic Cache Coherence, a new GPU coherence protocolthat supports SC (like TCS) but allows stores to execute without waiting for writepermissions (like TCW). Table 3.1 compares RCC with prior protocols proposed forGPUs in the context of SC.3.4 Relativistic Cache Coherence (RCC)Relativistic Cache Coherence leverages the observation by Lamport [122] thatconsistency need only be maintained in logical time. Two threads may see thememory as it was at two different logical times, as long as each only observes allwrites logically before — and never sees any writes logically after — its own logical“now.” In RCC, cores maintain separate logical times, which become synchronizedonly when read-write data is shared.Like all library coherence protocols [119, 211, 221, 255, 257], RCC allows L1caches to keep private copies of data only for limited-time “leases” granted for each25requested block; when a lease expires, the block self-invalidates in L1 without theneed for any coherence traffic. Writes to a block must ensure that no valid copiesare present in any L1s by ensuring that the write time exceeds the expiration time ofall outstanding leases. In RCC, leases are granted and maintained in logical time, sowrites can complete instantly by advancing the writing core’s logical clock.3.4.1 Logical Clocks, Versions, and LeasesIn relativistic coherence, each core maintains, and independently advances, its ownlogical clock (now). Similarly, each shared cache (L2) block maintains it own logicalversion (ver), equal to the logical time of the last write to this block.Since the L2 grants per-block read leases to private L1 caches, it keeps track ofwhen the last lease for a given block will expire (exp). Each L1 cache also keepstrack of the exp it was given by the L2. Different L1s may have different expsfor the same block, but none will exceed the latest exp in L2. Because L1s arewrite-through, they do not need to record ver for each block.A unique, global SC ordering of memory accesses is maintained in logical timeby applying three rules:1. Core C reading cache block B must advance its logical time now to matchB’s current version ver if B.ver > This ensures that C cannot use Bto compute new data values with logical times < B.ver, i.e., that C does notobserve a value of B “from the future.”2. Core C writing cache block B must advance B’s ver to C’s now if B.ver <, and advance its own now to B’s ver if B.ver > This ensuresthe new value of B cannot be used for computation in cores whose now isearlier, i.e., that B is not “sent back in time.”3. Core C writing cache block B must advance its now as well as the new B.verbeyond the expiration time exp of the last outstanding lease for B. This ensuresthat the new value of B does not “leak:” i.e., that any values computed fromthe new value of B by other cores cannot coexist in their L1s with the oldvalue of B.2651404120ST B4110—51414010LD B204110ST AB.exp30514010A.ver411052052ST A4110—1010 102020 20511010A.exp—0101051ST BA.exp10514041B.ver—A.expLD A3010—20105110—103010404030101010104110104052514120now040B.exp—404140 400—301052—B.exp401010LD A3041nowC0L1 cacheC1L1 cachesharedL2 cacheB (new)A (old)C0: BC1: AC1: BcoreC0      C1memory opA (new)= now @ C0= now @ C1= cached= invalid   or expiredC0: AB (old)Figure3.3:RCCexecutingaccessestotwoaddresses(AandB)fromtwocores(C0andC1).Thetable(left)trackseachcore’slogicaltime(now),andeachcacheblock’sversion(ver)andreadleaseexpiration(exp)aftereachinstructionhasexecuted;therowsrepresenttheorderofinstructionsasexecutedinphysicaltime.Thediagram(right)illustratestheleasedurationsineachcache(top)andhowthelogicaltimenowadvancesineachcoreasthecorrespondingoperationsfromthetableexecute(bottom);logicaltimeflowslefttorightwhilephysicaltimeflowstoptobottom.Boldvaluesdenotechangessincethelaststep;crossed-outleaseshaveexpired.27The logical now times of memory operations provide a sequentially consistentordering. Provided the core scheduler is modified to ensure that only one globalmemory access per warp is issued at any given time, RCC supports SC.23.4.2 Example WalkthroughFigure 3.3 shows how RCC operates on a sequence of instructions from two differentcores. Initially, C0’s cache has neither A and B (since now > exp) and core C1has both. In the shared L2 cache, B has since been written by a third core and hasver = 30; because C1’s now has not advanced past 10, however, it may still read itscached copy of B.First, core C0 writes A, which updates the A.ver in the L2 (rule 2); C1 still hasnow = 0 and can read its old copy of A. C0 then reads B, which receives a new lease(until logical time 40) but must advance its now past B.ver (rule 1).Next, C1 writes B, which updates B.ver and to 41, past the lastoutstanding lease for B (rule 3). This step enforces SC ordering between the twocores: C1 next reads A, and is forced to pick up the value written by C0.Finally, C0 writes B, advancing its now past the previous write to B (rule 2),and then A, advancing past the last lease for A (rule 3). Because C1.nowis earlier, however, C1’s next load will happen logically before C0’s write toA, and will not observe the new value. Note that SC has been maintained,as the overall behaviour is explained by the following sequential interleaving:C0: ST A, LD B; C1: ST B, LD A, LD A; C0: ST B, ST A.3.4.3 Coherence Protocol: States and TransitionsThe full state transition diagram for RCC, including both stable and transient states,is shown in Figure 3.4.Stable states. RCC has two stable states: V (valid) and I (invalid). Blocksloaded into the L1 transition to the V state, and may be read until they are evicted,written, or until their leases expire, at which point they self-invalidate and transitionto the I state. Stores (and atomic read-modify-write operations) may occur in both2The proof that RCC supports SC is essentially the same as for Tardis [256], we refer the interestedreader there. The main difference is that RCC permits a sequence of unobserved stores to share thesame logical version; the SC ordering in that case is provided by the physical arrival times at the L2.28L1 FSML2 FSMIVIIVIIV LDST/ATST/ATevict/expireST/ATLD replyLD replyST/ATreplyLDSTATexpireLD replyST/AT replyLDLDLDSTATIVIAVIVATLD/STDRAMreplyDRAMreplyLDSTATevictLD/STLD = loadST = storeAT = atomicstablestatetransientstateFigure 3.4: Full L1 and L2 coherence FSMs of RCC (stable and transientstates).V and I states; the request is forwarded to the L2 (GPU L1s are write-through,write-no-allocate), and the block eventually transitions to I after the store ack isreceived. Expired blocks in V state (exp<now) are treated exactly the same way asblocks in I state for memory operations and cache replacement purposes.The L2 also only has V and I states. L2 misses retrieve the value from memoryand transition to V. Because the L2 is write-back (like in commercial GPUs ), the Vstate allows reads, writes, and atomic operations; a block transitions to I only whenevicted by the L2 cache replacement algorithm.Transient states. L1 blocks also have three transient states: IV, II, and VI; thefirst two are required for correctness, while the third is a GPU-specific optimization.IV indicates that a load request missed in the L1 and a gets request has beensent; further load requests for the same cache block will be stored in theMSHR without more gets requests, and the block will transition to V oncethe data response has been received. Stores received while in IV state causea transition to II.II indicates that a store (or atomic) request has been sent to the L2, and the cacheis waiting for an ack message with the logical time at which the write was29executed (i.e., the new ver); this is necessary to maintain SC. While in II state,any data response from the L2 will be forwarded to the core, but the blockwill stay in II.VI is an optimization of the II state when the block was valid before the write;in VI, the block can still be read by other warps until the ack message withthe new ver is received from the L2 cache; this is important in GPUs becauseround-trip access latencies to L2 can be hundreds of cycles [241].To permit non-blocking misses, the L2 coherence controller has two transient states:IV buffers new gets and write requests in the relevant MSHR, keeping track ofthe maximum now times from the reading and writing processors. Once thedata arrives from DRAM, the block’s version is updated to reflect any writesin the MSHR and a new lease is generated to satisfy any readers.IAV indicates an atomic operation received in an invalid state; this stalls anyfurther L1 requests until the block has been retrieved from DRAM, its versionhas been established, and the atomic operation has completed.Figure 3.5 shows the complete state transition table, including the generatedmessagesand MSHR management details.RCC has fewer states and transitions than prior art. Earlier logical timestampcoherence work [255] requires three stable states each for L1 and L2 (transientstates are not described), as well as MESI-like recall and downgrade mechanismsto implement a private writable state; such inter-core communication is preciselythe source of the SC store latencies we wish to avoid. Prior GPU coherence workalso has more states (13 total) and transitions than RCC. In the SC-capable variant,a private state is used to avoid store stalls for private data; in the weakly orderedversion, non-fenced stores do not stall but SC support is not possible. RCC employslogical timestamps to acquire store permissions instantly, and does not requireprivate or exclusive states.30L1 state requests from processor core L1 events L2 responses load store atomic evict expiry DATA RENEW ACK I GETS  {now =,  exp = D.exp} à IV WRITE  {now =} à II ATOMIC  {now =} à II  —  —  — — — V cache hit WRITE  {now =} à VI ATOMIC {now =} à VI à I à I  — —  — IV add to MSHR WRITE  {now =} à II ATOMIC {now =} à II stall  — = max(, M.ver) D.exp = M.exp à V D.exp =      M.exp à V  — II GETS {now =,  exp = D.exp} WRITE {now =} ATOMIC  {now =} stall  — = max(, M.ver) read resp? D.exp = M.exp MSHR.empty? à V,  else à VI atomic resp? MSHR.empty? à I, else à II D.exp =      M.exp à VI = max(, M.ver)  MSHR.empty? à I VI cache hit WRITE  {now =} ATOMIC {now =} stall à II = max(, M.ver) read resp? D.exp = M.exp MSHR.empty? à V, else à VI atomic resp? MSHR.empty? à I, else à II — = max(, M.ver)  MSHR.empty? à I else à II (a) L1 state transition table for RCC.L2 state requests from L1 L2 events memory responses GETS WRITE ATOMIC evict DATA I DRAM FETCH MSHR.lastrd = à IV DRAM FETCH MSHR.lastwr = à IV DRAM FETCH MSHR.lastwr = à IAV  — — V D.exp =   max(D.exp, D.ver+lease,  M.exp > D.ver?   RENEW {exp=D.exp} else   DATA {exp = D.exp,               ver = D.ver} D.ver =    max(, D.ver,           D.exp+1) ACK {ver = D.ver} D.ver =   max(, D.ver,            D.exp+1) DATA  {exp = D.exp,             ver = D.ver} mnow =   max(mnow,           D.exp,           D.ver) dirty? WBACK à I — IV add to MSHR MSHR.lastrd =   max(MSHR.lastrd,  write to MSHR MSHR.lastwr = max(MSHR.lastwr, ACK {ver = max(MSHR.lastwr,                   mnow)} stall stall D.exp = D.ver = mnow MSHR.haswrite?   D.ver = max(MSHR.lastwr, mnow) MSHR.hasread?   D.exp = max(D.ver+lease,  MSHR.lastrd+lease)   DATA {exp = D.exp, ver = D.ver} à V IAV stall stall stall stall D.exp = mnow,  D.ver = max(MSHR.lastwr, mnow) DATA {exp=D.ver, ver = D.ver} à V (b) L2 state transition table for RCC.Figure 3.5: State transition tables for RCC. D is the cache block (e.g., D.exp isthe expiration time for the block), M represents a received message (e.g.,M.ver in an ACK indicates the time when a write will become visible).Arrows signify state transitions. V and I are stable states; IV, VI, II (L1only) and IAV (L2 only) are transient states. Braces denote coherencemessage contents; cache block data are included as appropriate. Shadedareas highlight protocol changes required for lease extensions.31name granularity semanticsnow GPU core logical time seen by this coreexp cache block lease expiration timever cache block data version (last write time)mnow mem. partition max(exp,ver) evicted to DRAMlastrd L2 MSHR latest now of any reading corelastwr L2 MSHR latest now of any writing coreTable 3.2: Timestamps used in RCC.3.4.4 L2 Evictions and Timestamp RolloverTable 3.2 lists all timestamps maintained in RCC and their semantics. Core logicalclock now, data write version ver, and lease expiration time exp were described inChapter 3.4.1.L2 evictions. Because data copies in L1 automatically expire, RCC allowscaches to be non-inclusive without requiring the usual recall messages, as inprior GPU coherence work [221]. Care must be taken, however, to maintain logicalordering when evicting blocks from L2: if a block were naïvely evicted and thenre-fetched without preserving its ver and exp, it could then be read logically beforeit was written, or could be written before all leases expire. Singh et al [221] handlethis by using an MSHR entry to store the evicted block until the timestamp expires,which limits the number of MSHR entries available for L2 misses.RCC instead allows the eviction but ensures that, if the block is reloaded fromDRAM, reading or writing it will cause any outstanding leases for it to expire. Toenforce this, we could keep track of ver and exp for each block in DRAM, but thiswould require additional storage provisions in main memory. Instead, we store themaximum ver or exp of any evicted block as the “memory time” mnow, one in eachmemory partition. To maintain logical ordering, a block loaded from DRAM willhave its ver and exp set to mnow: any cores that read or write this block will have toadvance their logical time to prevent the issue described above.Since the L2 is write-back (like in extant GPUs [21, 165, 167]), a write requestthat misses in L2 will be stored in MSHR while the block is set to IV state andretrieved from DRAM, and any additional write requests are merged into the MSHR.To maintain correct logical write ordering, each MSHR keeps track of lastwr, the32highest write time (originating core now value) of anywrite requests received in IVstate. Write requests with now≥ lastwr update the MSHR data and lastwr; writerequests with now< lastwr do not change lastwr but must be tracked until the finalwrite time is known. The larger of lastwr and mnow will become the block’s ver;since this is the logical write time, the store can be acknowledged without waitingfor the DRAM response. The store data will remain in the MSHR until the DRAMresponse arrives.A similar case arises for read requests that miss in L2. MSHRs keep trackof lastrd, the latest now of any reading cores; this is used to calculate the leaseexpiration (exp) once the block is available, and can be elided to save space (lastwrwould be used instead).Timestamp rollover. Because timestamps have finite exact representations andkeep increasing, they are subject to arithmetic rollover. In our experiments, 32-bitlogical timestamps advanced on average once for every 1073 core clock cycles;this corresponds to approximately one rollover per hour at clock speeds found inhigh-performance GPUs.In principle, this can be handled simply by setting core now clocks to 0, flushingall L1s, setting all L2 ver and exp entries to 0, and setting all mnow values to 0;SRAMs that support flash-clearing [206] make this easy. However, rollover must beprocessed atomically in the presence of in-flight messages, transient cache states,and independent L2 banks. To implement this correctly, we observe that the L2 is theonly coherence actor that actually increases timestamps (L1s only copy timestampsreceived from L2); therefore, the L2 will be the first component to know that rolloveris required.When an L2 partition needs to roll over a timestamp, it first ensures that allother L2 partitions have stalled and set their timestamps to 0. This can be done inmany ways, perhaps using a narrow unidirectional ring with the rollover L2 partitionsending a stall flit and all other cores stalling before allowing the flit to continue;when stall returns to the originating core, all cores will have stalled (in case ofconcurrent stall requests, lowest L2 partition ID wins). All stalling partitions mustset all of their timestamps (including lastwr and lastrd) to 0; queued requests andMSHR entries are retained, with all timestamps reset to 0. The rollover partitionthen sends a flush request to all L1s, and waits for responses from all; once these33BH0%20%40%60%80%% of loads on datain V state but expiredBFS CL DLB STN VPRGMEAN BH0%20%40%60%80%100%% of loads onprematurely expired dataBFS CL DLB STN VPRGMEANFigure 3.6: Left: fraction of loads that find data in V state but expired (eitherfor coherence reasons or prematurely); expiration rate is negligible forintra-workgroup benchmarks. Right: Fraction of expired loads whoseblocks that have not changed in L2 (and can be renewed).have been received, a resume flit is sent on the inter-partition ring, and all L2partitions resume processing requests. An L1 that receives a flush request sets itsnow to 0 and invalidates all entries before replying to L2; addresses with MSHRentries enter the II state, while the remaining addresses transition to I.3.4.5 Lease Time Extension, and PredictionWhen the L2 receives a gets request, it generates a read lease for the block and sendsthe logical expiration time exp back to the requesting L1. So far, we have assumedall leases have the same duration (of 10 in Chapter 3.4.2); intuitively, however,read-only data should receive very long leases to avoid expiration, whereas datashared frequently should receive short leases to avoid advancing the logical time toomuch when they are written (and thus causing other cache blocks to expire).When a lease is too short, a load request finds the L1 block in V state but withan expired lease (now> exp). Figure 3.6 (left) shows how many L1 cache blocksare in V state but expired when accessed. Sometimes, this is the coherence protocolworking as intended and indicates a transitive logically-before relation; at othertimes, the expiration reflects imperfect lease assignment. Figure 3.6 (right) showsthat most such expirations are premature (i.e., the block’s L2 entry has not changed).Lease extension. Every such block generates a gets request and a data responsefrom the L2. While the gets is small, a data response includes the full cache block,which poses an unnecessary traffic overhead.34   –R   +RBH0. Traffic   –R   +RBFS   –R   +RCL   –R   +RDLB   –R   +RSTN   –R   +RVPR   –R   +RGMEAN   –P   +PBH0. # ofloads on expired data   –P   +PBFS   –P   +PCL   –P   +PDLB   –P   +PSTN   –P   +PVPR   –P   +PGMEANFigure 3.7: Left: interconnect traffic with (+R) and without (–R) the renewmechanism. Right: reduction in loads that find expired data in L1, with(+P) and without (–P) the lease prediction.Since the L2 knows when the block was last written (ver), it could potentiallyrenew the lease by sending the new lease expiration time but no data (which theL1 already has). Before deciding whether to send renew or the full data, the L2needs to know whether the L1’s previous lease is older than ver; if it is, the L1 mayhave incorrect data. To provide this information, we modify gets requests to carrythe exp time of the expired lease (tracked by the L1): if this is newer than the dataversion ver in the L2, a renew grant can be sent. The required protocol changes areshaded in Figure 3.5; note that the complexity cost is minimal, with no additionalstates and only two new transitions. Prior work [255] also features a lease extensionmechanism, but the renew mechanism there relies on keeping track of data versionsver in the L1 caches.Figure 3.7 (left) shows that the renewal mechanism is effective in reducinginterconnect traffic for inter-workgroup sharing workloads by 15% (traffic is alsoreduced for the intra-workgroup benchmarks, but their expiration rates are negligibleto begin with).Lease prediction. Although lease extension reduces interconnect traffic, manyexpirations would not occur to begin with if each block received an optimal lease.We attempted to sweep a range of fixed leases, but found that the performance spreadamong them was negligible. This is because RCC operates in logical time and mostoperations advance time in lease-sized amounts; therefore choosing a single fixedlease merely changes the rate at which logical clocks run for everyone. Optimallychoosing leases, however, is a non-trivial problem for read-write shared data partly35because the “correct” lease depends on the precise scheduling and interleavingof threads; while the correct lease is obvious for read-only data (= ∞), detectingread-only data at runtime requires microarchitectural changes [220].Instead, we observe that GPU applications tend to work in synchronized phases,with most data being read at the beginning of a phase and written at the end. These(and read-only) data should receive fairly long leases, while data that is shared often(e.g., locks) should receive short leases.To find the best lease, the L2 initially predicts the maximum lease (2048) forevery block. When the block is written, the prediction drops to the minimum (8), andgrows (2×) every time a read lease is successfully renewed. This way the L2 quicklylearns to predict short leases for frequently shared read-write blocks (such as thosecontaining locks), but long leases for data that is mostly read and blocks that miss inthe L2 (e.g., streaming reads). A similar per-block lease prediction mechanism hasbeen proposed [257] for logical-time CPU coherence protocols; unlike our predictor,however, short leases are preferred, and the consistency model is relaxed (to TSO)to maintain performance. Figure 3.7 (right) shows that the predictor reduces expiredreads by 31% for inter-workgroup workloads (again, intra-workgroup benchmarksbenefit but start with negligible expiration rates).Potential livelock. Because RCC allows cores to read cached data withoutadvancing their logical clocks, a spinlock that only reads a synchronization variablemay livelock unless other warps advance the logical time. This optimization iscommon in multi-core CPUs with invalidate-based coherence, but relies on implicitstore-to-load synchronization that is not guaranteed by coherence or consistencyrequirements. To the best of our knowledge, these kinds of spinlocks are not used inGPUs, as most workloads have enough available parallelism to cover synchronizationdelays; spinning merely prevents other (potentially more productive) warps fromexecuting (in general, synchronization in GPUs requires different optimizations thanin CPUs [243]). Nevertheless, this potential livelock can be avoided by periodicallyincrementing the logical time now (say, by 1 every 10,000 cycles).36GPU cores 16 streaming multiprocessors (SMs)core config 1.4GHz, 48 warps × 32 threads, 32 laneswarp scheduler loose round-robinregister file 32,768 registers (32-bit)scratchpad 48KBper-core L1 32KB, 4-way set-associative, 128-byte lines, 128 MSHRstotal L2 1024MB = 8 partitions × 128KBL2 partition 128KB, 8-way set-associative, 128-byte lines, 128 MSHRs;340-cycle minimum latency [241]interconnect one xbar/direction, one 32-bit flit/cycle/dir. @ 700MHz(175GB/s/dir.); 8-flit VCs (5 for MESI, 2 otherwise)DRAM 1400MHz, GDDR, 8 bytes/cycle (175GB/s peak),460-cycle minimum latency, FR-FCFS queues, tCL=12,tRP=12, tRC=40, tRAS=28, tCCD=2, tWL=4, tRCD=12, tRRD=6,tCDLR=5, tWR=12, tCCDL=3, tWR=2lease times 32 bits, predicted from 8–16– · · · –1024–2048Table 3.3: Simulated GPU and memory hierarchy for RCC.3.4.6 RCC-WO: A Weakly Ordered VariantRelative load and store ordering is effected through the per-core logical time now.Keeping track of two separate logical now times — the read view, consulted andupdated by load operations, and the write view, consulted and updated by storeoperations — allows loads and stores to be reordered with respect to each other. Inthis scheme, full fence operations require only that the read view and write view nowvalues be set to whichever is larger; performance can potentially improve becausestores no longer expire cache data that do not have the same block address. Theconsistency model is WO [68]; work concurrent with ours [257] proposes a similaradaptation that supports RCsc [78].3.5 MethodologySimulation Setup. We follow the methodology used in previous GPU coherencework [220, 221]. GPGPU-Sim 3.x [31] is used to simulate the core, and combinedwith the Ruby memory hierarchy simulator from gem5 [34] to execute coherencetransactions. For the sequentially consistent implementations (MESI, TCS, RCC),we altered the shader core model to execute global memory instructions sequentially,37inter-workgroup communicationBFS breadth-first-search graph traversal [31]BH Barnes-Hut n-body simulation kernel [42]CL RopaDemo cloth physics kernel [40]DLB dynamic load balancing work stealing algorithm for octree partitioning [47]STN stencil finite difference solver synchronized usingfast barriers [243]VPR place & route FPGA synthesis tool [199]intra-workgroup communicationHSP hotspot 2D thermal simulation kernel [53]KMN k-means iterative clustering algorithm [53]LPS Laplace solver 3D Laplace Solver [31]NDL Needleman-Wunsch DNA sequence alignment [53]SR anisotropic diffusion speckle reduction for ultrasound images [53]LUD matrix LU matrix LU decomposition [53]Table 3.4: Benchmarks used for RCC evaluation.and stall local memory operations if there are outstanding global accesses; thismatches the “naïve SC” baseline of [220]. We use Garnet [13] to simulate the NoCand ORION 2.0 [109] to estimate interconnect energy.The simulated configuration is similar to NVIDIA’s GTX480 (Fermi [165]), withlatencies derived frommicrobenchmark studies [241]; thismatches the configurationsused in prior work [220, 221]. Table 3.3 describes the details.Benchmarks. We use benchmarks identified and classified into inter- andintra-workgroup communication categories in prior work on GPU coherence [221].The intra-workgroup benchmarks execute correctly without coherence, but are usedto quantify the impact of always-on cache coherence on traditional GPU workloads.For non-SC simulations, the inter-workgroup communication benchmarks rely onfences; for SC simulations fences act as no-ops in hardware, but were left in thesources to prevent the compiler from reordering operations.Benchmark details and sources are listed in Table 3.4. Most were used in priorwork on GPU coherence [221]; we dropped two because our sensitivity studies foundthem to be highly nondeterministic and unpredictably sensitive to small changesin architectural parameters (e.g., a few cycles’ change in L2 latency). We addedmissing fences to dlb following [16], and altered tile dimensions in hsp to matchGPU cache block sizes and avoid severe false sharing problems.38MESITCSTCWRCCBH0. Speedup5.8 6.9 6.6MESITCSTCWRCCBFSMESITCSTCWRCCCLMESITCSTCWRCCDLB3.1 2.9 3.3MESITCSTCWRCCSTNMESITCSTCWRCCVPRMESITCSTCWRCCGMEAN(a) inter-workgroup communicationMESITCSTCWRCCHSP0. SpeedupMESITCSTCWRCCKMNMESITCSTCWRCCLPSMESITCSTCWRCCNDLMESITCSTCWRCCLUDMESITCSTCWRCCSRMESITCSTCWRCCGMEAN(b) intra-workgroup communicationFigure 3.8: Speedup of RCC on inter- and intra-workgroup workloads.3.6 Evaluation Results3.6.1 Performance AnalysisSC on top of RCC performs substantially better than prior SC proposals for GPUs.Figure 3.8 shows that RCC is 76% faster than MESI and 29% faster than TCSon workloads with inter-workgroup sharing; in fact, performance is within 7%of TCW, the best prior non-SC proposal. On benchmarks with intra-workgroupcommunication patterns, RCC is 10% better than MESI and within 3% of both TCSand TCW.RCC significantly reduces SC overheads compared to prior SC implementationsfor GPUs. Figure 3.9 (top) shows issue stall rates caused by enforcing SC: eitherdirect SC memory ordering stalls or LSU pipeline stalls caused by waiting on storeacknowledgements. RCC reduces these by 52% relative to MESI (largely becausethere are no invalidate delays) and by 25% relative to TCS (largely because stores in39MESITCSRCCBH0. by SC enforcementMESITCSRCCBFSMESITCSRCCCLMESITCSRCCDLBMESITCSRCCSTN1.7MESITCSRCCVPR1.8MESITCSRCCGMEANMESITCSRCCBH0. stall resolving latencyMESITCSRCCBFSMESITCSRCCCLMESITCSRCCDLBMESITCSRCCSTN1.3MESITCSRCCVPR1.8MESITCSRCCGMEANFigure 3.9: The reductions of SC stalls (Top) and SC stall resolving latency(Bottom) in RCC (results are normalized to MESI).RCC acquire write permissions without stalling). Figure 3.9 (bottom) shows that SCordering stalls in RCC are resolved 35% faster than in MESI and 11% faster relativeto TCS. Both of these metrics directly correlate to performance.TCW performs better than RCC for bfs because it benefits both from its weakmemory model and from relaxing write atomicity. All threads share a “mask”vector, which identifies nodes to be visited in the next iteration (next level of the bfstree); TCW allows different cores to modify parts of this vector without other coresobserving the result, while RCC strictly enforces SC on cache block granularity andsees more L1 misses (73% vs. 52%).Conversely, RCC outperforms TCW on dlb. In dlb, a per-workgroup workscheduler that completes its task steals tasks from a random other workgroup’sscheduler. Since work could be stolen at any time, all per-workgroup queue accessesmust be protected with fences; fences stall in TCW until a physical time when allstores have become globally visible. In actuality, however, work stealing eventsare rare, so most of these stalls are unnecessary. RCC allows cores to progressindependently in their own epochs until actual sharing occurs. In addition, stores do40RCC-SCTCWRCC-WOBH0. SpeedupRCC-SCTCWRCC-WOBFSRCC-SCTCWRCC-WOCLRCC-SCTCWRCC-WODLBRCC-SCTCWRCC-WOSTNRCC-SCTCWRCC-WOVPRRCC-SCTCWRCC-WOGMEAN(a) inter-workgroup communicationRCC-SCTCWRCC-WOHSP0. SpeedupRCC-SCTCWRCC-WOKMNRCC-SCTCWRCC-WOLPSRCC-SCTCWRCC-WONDLRCC-SCTCWRCC-WOLUDRCC-SCTCWRCC-WOSRRCC-SCTCWRCC-WOGMEAN(b) intra-workgroup communicationFigure 3.10: Speedup of weak ordering implementations vs. RCC-SC on inter-and intra-workgroup workloads.not stall even when sharing does occur because SC is enforced in logical time.We also developed RCC-WO, a weakly ordered variant of RCC (Chapter 3.4.6)and compared it with both TCW (our implementation supports WO) and the defaultSC implementation of RCC. RCC-WO performs neck-to-neck with TCW, and bothperform 7% better than RCC-SC (Figure 3.10).One RCC implementation can support strong and weak consistency. Themicroarchitectural differences between weak and strong variants of RCC in GPUsconsist of one additional scheduler signal per warp to order memops from one thread,and a small change in how stores update L2 metadata. This opens the possibilitythat the hardware memory model in GPUs could be chosen at boot time (as in, e.g.,SPARCv9 [225]) or even at runtime.41MESITCSTCWRCCBH0. EnergyMESITCSTCWRCCBFSMESITCSTCWRCCCLMESITCSTCWRCCDLBMESITCSTCWRCCSTN1.67(total)MESITCSTCWRCCVPR1.40(total)MESITCSTCWRCCGMEANLINK - DYNAMICROUTER - DYNAMICLINK - STATICROUTER - STATIC(a) inter-workgroup communicationMESITCSTCWRCCHSP0. EnergyMESITCSTCWRCCKMNMESITCSTCWRCCLPSMESITCSTCWRCCNDLMESITCSTCWRCCLUDMESITCSTCWRCCSRMESITCSTCWRCCGMEAN(b) intra-workgroup communicationFigure 3.11: Energy cost of RCC on inter- and intra-workgroup workloads.3.6.2 Energy Cost and Traffic LoadInterconnect energy is 45% lower than MESI, 25% lower than TCS, and only7% below TCW on inter-workgroup workloads (Figure 3.11); on intra-workgroupprograms, it is 25% better than MESI and on par with TCS/TCW. This is partlydue to reductions in traffic (Figure 3.12) and partly due to RCC needing only twovirtual networks to maintain deadlock-free operations vs. five for MESI. Interconnectenergy expenditure is becoming more important as GPU core counts grow.3.6.3 Coherence Protocol ComplexityRCC has fewer states than TCW and TCS (Table 3.5). This is important becausecoherence is notoriously difficult to verify: usually, validation involves very simplifiedformalmodels and extensive simulations [32, 242], but bugs survive despite extensivevalidation efforts [41, 61, 67, 185].42MESITCSTCWRCCBH0. TrafficMESITCSTCWRCCBFSMESITCSTCWRCCCLMESITCSTCWRCCDLBMESITCSTCWRCCSTN1.72(total)MESITCSTCWRCCVPR1.54(total)MESITCSTCWRCCGMEANLDSTATOREQINVRCL(a) inter-workgroup communicationMESITCSTCWRCCHSP0. TrafficMESITCSTCWRCCKMNMESITCSTCWRCCLPSMESITCSTCWRCCNDLMESITCSTCWRCCLUDMESITCSTCWRCCSRMESITCSTCWRCCGMEAN(b) intra-workgroup communicationFigure 3.12: Traffic load of RCC on inter- and intra-workgroup workloads.MESI TCS TCW RCCL1 states 16 (5+11) 5 (2+3) 5 (2+3) 5 (2+3)L1 transitions 81 27 42 33L2 states 15 (4+11) 8 (4+4) 8 (4+4) 4 (2+2)L2 transitions 50 23 34 14Table 3.5: The number of states (stable+transient) and transitions for differentcoherence protocols.3.6.4 Area CostRCC has reasonable silicon area overheads. For every L1 block, RCC only storesexp, and, for every L2 block, exp and ver. GPU cache blocks are 128 bytes, withperhaps 3-byte tags; with 32-bit timestamps this is 3% overhead for L1 and 6% areaoverhead for L2.433.7 SummaryIn this chapter we track the source of SC inefficiency in GPUs to long store latenciescaused by coherence traffic; these severely exacerbate SC ordering and structuralbottlenecks that GPUs could otherwise easily amortize. We address these byproposing RCC, a coherence protocol that uses logical timestamps to reduce storelatency. When used as part of an SC implementation, RCC reduces SC-related stallsby 25%, and stall resolve latency by 11%, compared to the best coherence proposalfor GPUs capable of supporting SC; as a result, performance is 29% better.When used in RC mode, RCC matches the best prior RC proposal; because thehardware needed for RCC is similar for SC and RC, a single implementation canpotentially allow runtime selection of the desired memory consistency model.44Chapter 4Hardware Transactional Memorywith Eager Conflict DetectionThis chapter explores a simple and reliable programming model to implementefficient synchronization in GPUs. Instead of the lock mechanism, we chooseTransactional Memory (TM) for simple and deadlock-free programming. Weidentify the excessive latency of value-based lazy conflict detection mechanism iscritical performance bottleneck of prior GPU TM designs, so we propose GETM,the first GPU hardware TM with eager conflict detection. GETM relies on alogical-timestamp-based conflict detection mechanism: by comparing the timestampof transaction with the timestamp of accessed data in memory, conflicts are detectedeagerly when the initial memory access is made. Performance of GETM is up to2.1× (1.2× gmean) better than the best prior work. Area overheads are 3.6× lowerand power overheads are 2.2× lower.While GPUs have traditionally focused on streaming applications with regularparallelism, irregular GPU applications with fine-grained synchronization arebecoming increasingly important. Graph transformation [140, 249], dynamicprogramming [129], parallel data structures [150], and distributed hashtables [97]have all been accelerated on GPUs using fine-grained locks. Fine-grained parallelalgorithms have recently become a hardware optimization focus for commercialGPUs [69].Unfortunately, high-performance parallel applications with fine-grained locks45if (src > dst) { // acquire in−order to avoid deadlockouter = src; inner = dst;} else {inner = src; outer = dst;}done = false;while (!done) { // loop on flag to avoid SIMT deadlockif (atomicCAS(&locks[outer], 0, 1) == 0) {if (atomicCAS(&locks[inner], 0, 1) == 0) {accounts[src] −= amount;accounts[dst] += amount;locks[ inner] = 0; // releaselocks[outer] = 0; // both locksdone = true;} else { // acquired outer but not inner locklocks[outer] = 0; // release outer lock}}}txbeginaccounts[src] −= amount;accounts[dst] += amount;txcommitFigure 4.1: CUDA ATM benchmark fragment using either locks or TM.are challenging to program and debug. Indeed, reasoning about thread-based syn-chronized programs is difficult in general [28, 125], and even simple formal analysesthat account for inter-thread synchronization are NP-hard [231] or undecidable [193].In practice, the problem is exacerbated in accelerators like GPUs, because optimizingfor performance is paramount — after all, if it weren’t, the code would be runningon a CPU. In GPUs, this problem is even worse, as the combination of lockstep warpexecution and stack-based branch reconvergence can result in unexpected deadlocksin code that would be deadlock-free in CPUs [72].Transactional memory (TM) [96, 228] offers an attractive solution. In contrastto the imperative style and global dependencies induced by locks, transactionsenable a declarative programming style: the programmer specifies that a given codeblock constitutes an atomic transaction and leaves execution details to the runtime(see Figure 4.1). Typically, the runtime (hardware or software) attempts to executetransactions optimistically, only aborting and retrying them when conflicts aredetected; writes performed by aborted transactions are not visible to transactions that46commit successfully. Because they maintain atomicity and isolation, transactions arecomposable [93], and substantially simplify code in complex codebases [200, 259],leading to many times lower error rates [201]. Recently, hardware-level transactionalmemory has appeared in production CPUs from major vendors [44, 90, 106, 107],as well as in designs and proposals from other significant industry players [52, 60].Early proposals for hardware-level transactional memory for GPUs solved keyproblems of interacting with the SIMT stack [77] and coalescing transactions at warplevel [76]. Both rely on value-based validation, which requires one core↔LLCround trip to validate each transaction and another round-trip to finalize the commit.Combined with the massive concurrency present in GPU workloads, these longlatencies create bottlenecks in the commit phase: even if transactional concurrencyis restricted, 700 or more transactions may be queued in the commit phase onaverage [77].Prior proposals have therefore limited transactional concurrency to very fewwarps per SIMT core [76, 77]. With few warps, however, the GPU can no longereffectively amortize commit latencies, so some performance is lost. Another proposalhas been to proactively abort transactions by broadcasting conflict sets from theLLC back to the SIMT cores [55]; the bandwidth and latency of these broadcasts,however, limit this approach to extremely long transactions.In this chapter, we instead propose to directly reduce commit costs by detectingconflicts eagerly. If conflict detection is performed separately for each memoryaccess — a latency well within a GPU’s capacity to amortize even with concurrencythrottling — a transaction that arrives at the commit point is guaranteed to commitsuccessfully. Because there is no need for time-consuming value-based conflictdetection at commit time, the commit itself can be taken off the critical path whilethe warp continues execution. To the best of our knowledge, this is the first fullGPU hardware TM proposal with eager conflict detection, and the first to leavetransaction commits out of the critical path.471 2 3validationLD STLD ST commitSIMTcoreLLCSIMTcoreLLC 4123temporal conflict checkvalue-based validationcommit and final ACKmetadata table access4 4← GETM (this proposal)↑ WarpTM (best prior art)localcommitFigure 4.2: Messages required for transactional memory accesses and commitsin WarpTM (top) and GETM (bottom).4.1 GPU Transactional MemoryThe state-of-the-art GPU TM,WarpTM [76, 77], combines lazy version managementwith lazy, value-based conflict detection.1 Figure 4.2 (top) shows the access andcommit timing.Firstly, WarpTM modifies the SIMT stacks to allow aborting and restartingtransactions at thread granularity. GPUs execute many (32–64) threads in lockstepas a single warp; transactions are a thread-level abstraction, however, so it is possiblethat some of the threads in the warp commit while other threads abort. WarpTMadds special Transaction and Retry stack entry types that track which threads abortedand should run again when the transaction is restarted.As transactions execute, their memory accesses are sent to a redo log, stored inthe SIMT core’s local memory.2 For each address, loads record the value that wasobserved (for later validation), and stores record the newly written value. When thewarp reaches txcommit, a tx log unit traverses the redo log to record all threadswishing to access each address; this allows the SIMT core to resolve all intra-warpconflicts and coalesce the warp’s surviving transactions.At commit time, the read and write logs of the coalesced transaction are sent tovalidation/commit units (VUs/CUs) colocated with each LLC bank. Each validationunit verifies that the value observed by each read in the log corresponds to the1We discuss other GPU proposals [55, 56, 234] in Chapter 7.3.2In NVIDIA terminology, a GPU core’s local memory is an address range of the global addressspace reserved for that core. As with the rest of the address space, local memory is cached in the GPUcache hierarchy.48current value in the LLC, and sends a success/failure message to the SIMT core.The core collects these to check whether any addresses failed validation, and sendsa commit/abort confirmation back to the CUs. Each CU then sends the write logvalues to the LLC, and acks to the core. Once the core has collected acks fromall CUs, the warp continues execution. Transactional consistency requires eachtransaction to be validated and committed atomically, so while one transaction goesthrough the two-round-trip validation/commit sequence, other transactions mustwait.WarpTM also includes a temporal conflict check mechanism (TCD) that allowsread-only transactions to commit silently. A TCD table at the LLC that records thephysical clock cycle number of the last store to each address; the cycle numbers areupdated as transactions commit. Each transactional load is immediately sent fromthe SIMT core to this TCD table; if a read-only transaction has only read locationsmodified in the past, it is allowed to bypass value-based validation and commitsilently.Because GETM uses eager conflict detection, transactions that have reachedtxcommit are guaranteed to be free of conflicts, and commit without additionalvalidation or acks.GETM retains the SIMT stack modifications and warp-level transaction coalesc-ing of WarpTM. However, it replaces the value-based validation and TCD read-onlysilent commits with an eager conflict detection scheme (see Chapter 4.4), whichgreatly simplifies the validation/commit unit and substantially reduces the hardwareoverhead (see Chapter 4.5 and Chapter 4.7).4.2 Eager Conflict Detection and GPUsAlthough eager conflict detection is more suitable for high-thread-count architectures(see Chapter 4.3), the lack of a natural conflict detectionmechanism poses a challengeto implementing eager conflict detection in GPUs. Prior TMs with eager conflictdetection (e.g., LogTM [149, 252]) have targeted CPUs, in which conflicts arenaturally flagged when cache lines are invalidated by the coherence protocol.Unfortunately, extant GPUs lack hardware cache coherence, so another mechanismmust be designed. Another challenge is scalability, since GPUs have large core49counts and many concurrent warps in each core. This precludes, for example,mechanisms that collect and broadcast read/write signatures.To provide a scalable eager conflict detection mechanism, we take inspirationfrom the software transactional memory system TL2 [65]. TL2 uses a globalversion-clock that is incremented by every transaction which writes to memory,and maintains last-written version-clock values for every memory location. As thetransaction accesses memory, it collects version-clocks for all referenced locations.At commit time, these clocks are checked to ensure that the transaction observeda consistent state of memory; if there are no violations, TL2 acquires locks for alllocations it intends to modify and finally writes the memory.In TL2, logical clocks are used to ensure consistency, but conflict detection isstill performed lazily at commit time. In addition, the global version clock mustbe shared among multiple cores, which relies on the underlying cache coherenceprotocol. We leverage the idea of providing consistency via logical clocks, but usethem to implement early conflict detection, and design a distributed logical clockprotocol that does not need cache coherence.We propose GPU Eager Transactional Memory (GETM), a novel GPU hardwareTM design. Unlike prior eager TMs, GETM does not rely on coherence or signaturebroadcast. Instead, GETM combines encounter-time write reservations with alogical timestamp mechanism to detect conflicts as soon as they occur, and to allowoff-critical-path commits.4.3 GPUs Favour Eager Conflict DetectionIn this section, we argue that eager conflict detection is particularly suited to thelarge number of threads concurrently executing in a GPU, because the long commitlatencies inherent in lazy detection form a key bottleneck as concurrency grows.This is not the case for CPUs, where TMs with eager conflict detection, such asLogTM [149], are outperformed by lazy [51] or partially lazy [233] variants.To test this intuition, wemodified the state-of-the-artGPUTMdesignWarpTM[76]to emulate eager conflict detection (cf. Figure 4.2) and examined how it performs asthe number of warps per SIMT core grows. WarpTM uses lazy conflict detectionand lazy versioning (see Chapter 2.2.2 for details), and commits transactions via500.000.250.500.751.00tx exec. cycles WarpTM-LLWarpTM-EL (ideal) wait cycles1 2 4 8 16 NLmax # warps with active transactions in each SM core (NL = no limit) tx cyclesFigure 4.3: Time per transaction spent executing transactional code (top),waiting for aborting transactions in the same warp and concurrency limits(centre), and total time spent in transactions (bottom), as the number ofwarps allowed to concurrently run transactions grows. Measurementsfrom the HT-H hashtable benchmark, normalized to the highest datapoint.two core↔LLC round trips: (i) the transaction log is sent to be value-validated atthe LLC banks; (ii) the LLC sends back validation success/failure status; (iii) thecore collects the responses and (if all banks reported success) instructs the LLC tostart commit; (iv) the LLC banks acknowledge commit completion; (v) the corecan resume executing the relevant warp. Eager conflict detection needs to checkonly the currently accessed memory location, but must be repeated for every access;therefore, to emulate an eager-lazy design, we hacked WarpTM to run validation(i)–(ii) for every transactional access, with no latency.Figure 4.3 (top) shows how the original WarpTM (-LL) and idealized eager-lazyvariant (-EL) perform as permitted concurrency grows on the hashtable insertionworkload HT-H. With an increasing number of transactions, the number of cyclesspent executing each transaction (including retries) grows much faster for the variantwith lazy conflict detection than for the eager version. This is because increasing51LL ELHT-H0%25%50%75%100%total tx cyclesLL ELHT-MLL ELHT-LLL ELATMLL ELCLLL ELCLtoLL ELBHLL ELCCLL ELAPLL ELGMEANEXEC WAIT(a) cycle breakdown of transactional segments onlyHT-H0.000.250.500.751.001.251.50total exec time2.9HT-M2.0HT-L ATM CL CLto BH CC AP GMEANFGLock WarpTM-LL WarpTM-EL(ideal)(b) total execution cycles of whole application (tx and non-tx segments)Figure 4.4: Benefits of eager conflict detection compared with lazy mecha-nism and hand-optimized find-grained lock implementations. (Optimalconcurrency is used for all configurations)concurrency increases conflicts and causes transactions to be retried more times.For each retry, WarpTM-LL incurs the two round-trip latency of lazy value-basedvalidation, making each attempt far more expensive than in WarpTM-EL.Figure 4.3 (centre) shows how long transactions wait to commit, either becauseof concurrency throttling or because of waiting for diverged threads in the samewarp to abort the transaction. Because the value-based validations in WarpTM-LL are expensive, subsequent transactions wait longer than in WarpTM-EL. ForWarpTM-EL, wait time decreases as more warps can execute and cover commitlatency; for WarpTM-LL, however, increasing concurrency increases the commitqueue backup and therefore the total wait cost.The overall runtime spent in transactions is shown in Figure 4.3 (bottom). Thisexplains why the optimal concurrency for WarpTM-LL is 2 transactional warps52regfilethread blockthread blockthread blockshared memL1 D$mem ifctx log unitvalidationunitcommit unitlast-levelcache bankDRAM DRAM channelSIMT stacksSIMT coreintercon-nectstall bufferFigure 4.5: Overall architecture of a SIMT core with GETM. Shaded blocksare added for transactional memory support.per SIMT core [76], and demonstrates that eager conflict detection can supportsubstantially more concurrency with much lower overheads.Note that this effect is peculiar to architectureswith high thread-level concurrency,such as GPUs. Most CPUs run 1–2 threads per core, and have few cores per die.This places them on the left of Figure 4.3 (top), where the lazy and eager versionsexecute similar number of transactional cycles.To quantify the overall performance potential of eager conflict detection, wesimulated a range of TM benchmarks using the lazy and eager variants of WarpTM,as well as the equivalent non-TM versions using hand-optimized fine-grained locks.Figure 4.4 (top) shows that execution and wait cycles spent in transactions aresubstantially reduced in the eager variant, and Figure 4.4 (bottom) shows that thistranslates to faster overall execution time.4.4 GETM Transactional MemoryIn this section, we sketch an overview of howGETMprovides transactional atomicity,consistency, and isolation, and describe how it tracks the necessary metadata.The description here focuses on the GETM protocol, how transactions execute,and how metadata evolves. The high-level architecture is shown in Figure 4.5;implementation details, including the metadata and queueing data structures presentat the LLC, are described in Chapter Atomicity, Consistency, and IsolationWe first describe the transaction logs that provide atomicity, and then the logicaltimestamp and access-time locking mechanisms used to ensure consistency andisolation.53Tracked per warpwarpts the logical time at which transactions from this warp atomically executeTracked per LLC cache linewts one higher than the logical time when this location was last writtenrts the logical time when this location was last read#writes # writes to this location (if non-zero, location is locked by a transaction)owner the owner of the write reservation (if # writes is non-zero)Table 4.1: Metadata tracked by GETM.Transaction logs. As in prior work [76, 77], transactions are managed at warplevel, and each warp keeps a redo log in the SIMT core’s existing local memory.In contrast to GETM, prior work required sending the entire log (reads andwrites) to the commit units for validation at commit time. Because GETM useseager conflict detection, transactions that have reached txcommit are guaranteedto succeed, and commit-time validation is not necessary. Instead, a committingtransaction transmits only the transactional writes from the redo log (typically afraction of the entire log), so that the write data can be stored in the LLC.In addition to being logged, all transactional accesses are sent to the LLC foreager conflict detection, using the timestamp and lock mechanisms described below.Logical timestamps. GETM uses distributed logical timestamps to providetransactional consistency, and each transaction executes at a specific logical times-tamp. To guarantee consistency, GETM must ensure that a running transaction(a) does not observe stale values of locations changed by logically earlier transactions,(b) does not observe values written by logically later transactions, and (c) does notalter values already seen by logically later transactions.The logical timestamps tracked by GETM are shown in Table 4.1. Firstly, eachwarp keeps a logical timestamp warpts, corresponding to the memory state observedby the last transaction. This timestamp starts at 0, and is advanced when transactionsabort (as discussed below). All new transactions started by this warp execute atlogical time warpts.Each cache line in the shared LLC has a write timestamp wts, equal to onemore than the logical time of the last write, i.e., 1 warpts of the logically latest54transaction to attempt a write. If a transaction T attempts to access a cache line Lwhere L.wts> T.warpts, it means that L was written by a transaction logically laterthan T , and T must abort.Every cache line also contains a read timestamp rts, equal to the logical timeof the last read, i.e., warpts of the last transaction to read it. A transaction T mayread lines with any rts, but writing a cache line L where L.rts > T.warpts wouldoverwrite a value which has already been observed by a later transaction, and is notpermitted.The rts and wts timestamps are maintained eagerly: that is, transactional loadsupdate rts and transactional writes update wts at the time of the request, regardlessof whether the transaction will eventually commit. The updated timestamps are notreverted if a transaction aborts; while this might unnecessarily abort some futuretransactions, those will be restarted, and consistency is not compromised.Encounter-time locks. Unlike timestamps, transactional write data is not storedin the LLC until the transaction reaches its commit point. This creates an isolationproblem if a transaction T1 modifies a location and a logically later but physicallyconcurrent transaction T2 accesses this location: the value that should be seen by T2depends on whether T1 will commit successfully, but T1 is still in progress.To avoid this issue, GETM uses locks to prevent T2 from reading the locationuntil T1 has committed. Each cache line has two additional fields to support this:#writes and owner (see Table 4.1). When a transaction T first encounters a previouslyuntouched cache line L, it reserves L by setting L.#writes to 1 and L.owner to thetransaction’s global warp ID (because transactions are coalesced per warp, thisuniquely identifies a running transaction; see Chapter 4.1).Now when T2 accesses L (either for reading or writing), it must check whether Lhas been reserved. If L.#writes ≠ 0 and L.owner ≠ T2, transaction T2 proceeds withthe rts/wts checks described above; if the checks fail then T2 is aborted, otherwiseT2 stalls until T1 commits. (We discuss the stall buffer where stalled transactions arequeued in Chapter 4.5.)The owner/#writes mechanism also allows a transaction to repeatedly write thesame location. If T is already the owner of a cache line, it bypasses the rts and wtstimestamp checks, and writes the line. This is safe because T must have previouslysatisfied the rts and wts timestamp constraints, and updated wts. As any other55transaction attempting to update the line since that time would have been stalled,neither rts and wts could have been altered since T ’s reservation.Aborts and advancing logical time. The logical time observed by each warp(warpts) advances when transactions are aborted. When a transaction aborts, itreports to the core the latest logical timestamp t it attempted to read or write (theabort cause). Since the transaction will fail again unless it restarts at a time laterthan t, warpts is set to t 1.For example, if a transaction T has aborted because of reading a cache lineL, it must be because the cache line is logically newer than the transaction, i.e.,L.wts>warpts. In this case, the SIMT core sets warpts to L.wts 1, and T is restarted.Similarly, if T aborts because of a write, warpts is set to maxL.rts,L.wts 1, and thetransaction restarts.Commit and cleanup. When all threads in a warp reach the end of thetransaction (commit or abort), the SIMT core serializes the write logs of all threadsand sends them to the LLC. For all threads that have successfully reached the commitpoint, the core sends the address, write data, and write count (since multiple writesmay have been coalesced).Once this commit/abort log is received, each entry is written to the LLC and therelevant #writes entry is decremented. Once #writes in a cache line has reached 0,the cache line fully reflects the atomic transaction update, and can now be accessedby other transactions.Aborted transactions instead send the address and write count for each modifiedcache block to facilitate cleanup. The #writes in each cache line is updated as above;after #writes has reached 0, the cache line reflects its pre-transaction state, and maybe accessed by other transactions.The life of a transactional access. Figure 4.6 shows how a transactional reador write is processed in GETM.Owner check Ê. If #writes is non-zero but the owner field matches the currenttransaction, the line must be locked and the access succeedsË. Stores only increment#writes (since wts was already set by the previous write), while loads potentiallyupdate rts if it is less than warpts.Timestamp check Ì. A transaction that attempts to load an address and finds itswts younger than the transaction’s own warpts has detected a WAR conflict – i.e.,56COMMIT / ABORT: at SM core1. Serialize write log for all threads in warp        – for committing threads, send <addr, write data, #writes>        – for aborting threads, send <addr, #writes>2. Transmit write log to commit unit at LLC partition3. Update warpts to max(warpts, observed rts, observed wts) + 1COMMIT / ABORT: at LLC partition commit unit1. Coalesce writes to the same cache lines        – combine write data        – add #writes from each coalesced operation2. Commit each line        – write line to LLC        – decrement relevant #writes entrywid = A.owner& A.#writes > 0?A.#writes > 0?warp #wid:ST A @ warptswarpts ≥ max(A.wts,A.rts)?ABORT (WAW, RAW)report max(A.wts,A.rts)     to coreSUCCESS,A.#writes++QUEUE @ LLC(WAW)no yesupdate log,req to LLCyesnono yesSUCCESS,A.wts = warpts+1,A.owner = wid,A.#writes++ retrywid = A.owner& A.#writes > 0?A.#writes > 0?warp #wid:LD A @ warptswarpts ≥ A.wts?ABORT (WAR)report A.wts to coreSUCCESS,A.rts = max(warpts,A.rts)QUEUE @ LLC (RAW)no yesupdate log,req to LLCyesnono yesSUCCESS,A.rts = max(warpts,A.rts)retry13132 24475865678Figure4.6:Theflowchartforload,store,andcommit/abortlogicinGETM.57another transaction with a younger warpts has already written to the location – andmust abort Í. Similarly, a transaction that writes a location but finds either wts orrts to be younger than warpts must also abort, since a logically younger transactionhas either written the location or observed its value Í.Abort notification Í. If the version check discovers a conflict, the transactionmust be aborted. To minimize the chances of the transaction aborting again, theSIMT core is sent the highest timestamp seen so far at the LLC; this will be used toupdate warpts and restart the transaction. Meanwhile, the core notes that the threadhas aborted, and will clean up any reservations made when the entire warp reachestxcommit or when all threads have aborted.Write lock check Î. Next, the transactional memory operation checks whetherthe accessed location has been reserved by another warp (i.e., whether #writes isnon-zero). If not, the operation succeeds without conflict: a load will update rts (if< warpts) while a store will set #writes to 1 and update the location’s wts with thetransaction’s warpts Ï.QueueÐ and retryÑ. Accesses that passed the timestamp check but do not ownthe active lock must be logically younger than the lock owner. To avoid unnecessaryaborts, these requests are queued until the owner transaction commits. After thelock is released, the queued transactions will retry.4.4.2 Walkthrough ExampleFigure 4.7 illustrates how theGETMprotocol operates on two conflicting transactionsfrom the bank transfer example (Figure 4.1); in this benchmark, accounts aremodelledas unique memory locations. The first transaction (tx1) transfers some amount fromaccount A to account B, while the second (tx2) transfers another amount from B toA. Transaction tx1 starts at warpts = 20, and transaction tx2 starts at warpts = 10.The central grey line represents the LLC, and the thinner black arrows representmessages between the cores and the LLC. The interleaving of the accesses fromeach transaction has been chosen to illustrate how the eager conflict detection andqueueing mechanisms work; in reality, any interleaving of the two transactions couldoccur.First, tx1 loads and stores location A: the load updates A’s rts to match the58LD A@ 20ST A@ 20LD B@ 20LD B@ 10ST B@ 10LD A@ 10ST B@ 20commit@ 20LD B@ 22ST B@ 22LD A@ 22ST B@ 22commit@ 22abort@21·  ·  ·queueat LLCtx212wts1tx1B1rts 20 1021 11tagtx2# writesAowner11rts 10wtstagtx2# writesownerA20110tx121B2tx10rts 20wtstagtx1# writesownerA20210tx121B33LLCclean upFigure4.7:AwalkthroughexampleofeagerconflictresolutioninGETM.59transaction’s warpts (i.e., to 20), and the store updates the wts of A to exceed that oftx1 (i.e., to 21). Then tx2 does the same with B, updating its wts to 11 and rts to 10.At this point, tx1 and tx2 have accessed disjoint locations and so far do not conflict.The transaction metadata for addresses A and B at this point are shown in table Ê.Next, tx2 attempts to read location A, previously altered by tx1. Becausetx2.warpts< A.wts, the load fails the version check and will abort tx2 (cf. Figure 4.6).The LLC will notify the requesting core that the transaction been aborted, and thatthe next warpts should be later than 21. The core will then send the write/abort logfor tx2 to the LLC, which will release the reservation for B by setting the # writesfield to 0. When tx1 now sends load and store requests for B, both requests succeedsince tx2 had an older version and its write lock was cleared as tx2 aborted. At thispoint, the metadata for A and B correspond to table Ë.Transaction tx2 now restarts at the core, with a higher warpts of 22. When itsfirst load request (for B) arrives at the validation unit, it passes the version check butfinds B reserved; the load is therefore queued in the VU’s stall buffer and will beretried as the conflicting transaction commits.Meanwhile, tx1 gets to its commit instruction. Because all of its memoryaccesses have passed eager conflict detection, the transaction is guaranteed tosucceed. The core therefore sends the write log to the LLC and moves on. As thewrite log is processed, write reservations (#writes) for both A and B are reset. TableÌ shows the metadata at this point.Once the commit of tx1 has finished and released the reservations on A and B,any stalled transaction accesses are retried; in this case, this is the load of B from tx2,which now succeeds. Transaction tx2 can then continue with its remaining memoryaccesses, and will succeed.4.5 GETM Implementation DetailsAdding transactional memory support requires modifications to both the SIMT coreand the memory partition that houses the LLC slice and a memory controller: weneed to modify the core to retry aborted transactions and record redo logs, and toadd validation and commit hardware to each memory partition. Figure 4.5 showsthe overall architecture components of a GPU core extended with GETM.604.5.1 SIMT Core ExtensionsSIMT Stack. Adding transactional memory support to a GPU’s cores requireschanging the SIMT stack to track which threads in the warp are executing transactionsand which must be retried. To implement this, we leverage the modified SIMTstack proposed by Fung et al [77]. This mechanism is similar to branch divergencehardware [127]: for each warp, the top of the SIMT stack tracks the threads thatare currently executing, while the stack entry immediately below tracks threads thathave aborted and must be retried.Transaction management. While individual threads can run separate transactions,commits occur at warp granularity when all threads in the warp have arrived at thecommit point [76]. Nevertheless, transactions remain logically at thread granularity:when some of the warp’s threads abort, they are automatically retried via theextended SIMT stack after the entire warp reaches the commit point [77].Transaction logs. The GETM versioning mechanism is the same as in GPUtransactional memory [77]. Logs are stored in each SIMT core’s local address space,and cached by the L1/LLC caches. Although GETM only requires a write log, wealso record a read log to permit intra-warp conflict detection [76]; in this technique,each transactional access is first checked against the local per-warp read and writelogs and aborted if it conflicts with other threads in the same warp. At commit time,however, the read log is discarded and only the write log is sent to the commit units.Forward progress. Aborted transactions ensure progress by restarting with aprobabilistically increasing backoff [121].4.5.2 Validation UnitGETM protocol actions on the LLC side are carried out by Validation Units (VUs),one of which is colocated with each LLC bank. Each VU consists of (a) metadatastorage structures to track the last-written and last-read versions for each address,and (b) a structure to buffer requests that found a location locked but were youngerthan the current owner.61stashHoverflow(in LLC)H H H H H H Haddressmux minprecise metadata approx. metadatamuxevictionrtswtstag # writesowner rtswtsFigure 4.8: Transaction metadata table microarchitecture.Transaction metadata storageBecause GETM explicitly tracks versions to enable eager conflict detection, it mustkeep all metadata (wts, rts, # writes, and owner; see Table 4.1) for all locations thatare part of any in-flight transaction, and some metadata (wts and rts) for all locationsthat have been (or could be) accessed transactionally.These requirements pose some challenges: firstly, transactions could be very long(and, in general, unbounded), so fast access to a potentially large lookup structure isnecessary; secondly, potentially all addresses could be accessed transactionally, andtracking metadata for them all is impractical.Our solution relies on two observations. The first is that very long transactionsare likely to be rare in well-tuned code; therefore the metadata table can be sized forthe common case and provide a spillover mechanism (like in Unbounded TM [23]).The second is that metadata for addresses that are not being written by in-flighttransactions can be maintained approximately provided that the only errors areoverestimates: if the lookup mechanism reports a higher rts or wts, additionaltransactions may abort, but correctness will be preserved.Figure 4.8 shows the microarchitecture of the metadata storage structure. Ourimplementation has one such structure at every LLC partition, responsible for thesame address range. It consists of two tables, accessed simultaneously duringlookups: the first tracks precise metadata for addresses accessed by in-flighttransactions, while the second tracks approximate rts and wts for all other addresses.Precise metadata for in-flight accesses. The precise metadata table is similar to62a cuckoo hash table [182], extended with a small stash [115] (conceptually similarto a victim cache); even a small stash allows the cuckoo table to maintain higheroccupancy with limited resources [115]. When inserting a 〈key, value〉 pair causestoo many swaps in the cuckoo table, the last 〈key, value〉 pair swapped out duringthe insertion process is placed in the stash, and during lookups the stash is searchedin parallel with the cuckoo table itself. We use a four-way cuckoo table with fourrandomly generated H3 hashes [203] and a 4-entry fully associative stash. To permitvery long transactions, the precise table and stash can spill to an unbounded overflowspace located in main memory and cached in the LLC. In our experiments theoverflow space was never used, so we organized the overflow as a linked list; acommercial implementation would likely use an asymptotically faster design suchas a balanced tree or another hashtable layer in main memory.Unlike the original cuckoo table, our design allows the insertion process toterminate by evicting an entry that has not been reserved by any transaction (i.e.,# writes is zero). Since the remaining metadata — wts and rts — can be safelyapproximated, the evicted entry is inserted into the approximate metadata structuredescribed below.Approximate metadata for inactive locations. The simplest design for approxi-mate version tracking is a pair of registers tracking the maximum wts and rts thathave been evicted from the precise table. When a lookup misses in the precisetable, it is reinserted using the approximate wts and rts values from the two registers.When we conducted experiments with this configuration, however, we found thatthe version numbers increased very quickly and caused many aborts.To combine efficient storage of large numbers of evicted addresses with theability to discriminate among many of them, we use a recency Bloom filter [77].This structure consists of several (in our case, four) ways indexed by different hashesof the lookup address (we again use H3 hashes). Each address maps to one entry ineach way, and each entry stores the maximum wts and rts of all inserted addressesthat map to it. On insertion, the wts and rts in each way are only updated if theyexceed the stored values (which may have come from a hash collision), and onlookup the minimum wts and rts among the four ways are returned.Timestamp rollover. Unlike physical timestamps [76], logical timestampsadvance very slowly. In our experiments, the increment rates ranged from one63tag ld/stwidtxver ld/stwidtxver·      ·      ·tag ld/stwidtxver ld/stwidtxver·      ·      ·tag ld/stwidtxver ld/stwidtxver·      ·      ··   ·   ·· · ·· · ·address(assoc.lookup)to validation unit(retry access)min txver for this addressFigure 4.9: Stall buffer microarchitecture.increment in 1,265 cycles to one in 15,836 cycles, depending on the benchmark. Atthis rate and with a 1 GHz clock, 32-bit timestamps will roll over less than onceevery 1.5 hours, and 48-bit timestamps will roll over less than once every 11 years.When a validation unit detects a rollover, it must ensure that (a) all validationunits roll over atomically, and (b) all SIMT cores have rolled over. The first taskcan be accomplished via two messages (containing the VU ID to break ties) sentvia a single-wire ring connecting all validation units. The first message indicatesthat the recipient should stall and forward the message to its neighbour; all VUswill be known to have stalled when the message reaches back to the originatingVU. The second message indicates that the recipient should roll over and continueexecution. (Alternately, the existing interconnect can be used for this purpose withan ack–reply protocol). Cores roll over on a request from the VUs sent over theinterconnect. Once the cores have acked the request, the VU knows that no requestsare in flight; it flushes the stall buffer and metadata tables and resumes.Stall bufferRequests that passed the version check but found the address locked are queued in astall buffer until the relevant transaction commits or aborts (see Chapter 4.4).The organization of this structure, shown in Figure 4.9, is similar to a store bufferor an MSHR, but tracks several requests for each address (from different warpscontending for the same location). When a committing transaction decrements the#writes count to 0, it checks whether any stall buffer entries are waiting on therelevant address; if so, the oldest request (i.e., with the minimum warpts) re-entersthe validation unit. If the buffer is full, the transaction aborts.64Baseline GPUSIMT core config 15 cores, 48 × 32-wide warps / core, 2 × 16-wide SIMDwarp scheduler greedy then oldest (GTO)in-core storage 32,768 registers / core, 16KB shared memory / coreL1 data cache 48KB per core, 128-byte lines, 6-way assoc.L2 cache (LLC) 128KB / partition, 128-byte lines, 8-way assoc.,interconnect 2 xbars (1 up, 1 down), 288GB/s each, 5-cycle latencyoperating frequency SIMT core: 1400 MHz, interconnect: 1400 MHz,memory: 924 × 4 (quad-pumped)GDDR5 6 partitions, 32 queued requests each, FR-FCFS,Hynix H5GQ1H24AFR timing, total BW 177GB/smemory scheduling latency L1: 1 cycle; LLC: 330 cycles; DRAM: 200 cyclesTransactional memory supportconcurrency (tx warps/core) 1, 2, 4, 8, 16, unlimited (optimal for each benchmark)operating frequency validation unit: 1400 MHz, commit unit: 700 MHzmetadata storage precise: 4K entries (total) in 4-bank cuckoo HTs, 4-entry stashesapprox.: 1K entries (total) in 4-bank recency Bloom filtersstall buffer 4 lines with 4 entries each, per partitionvalidation BW 1 request/cycle per partitioncommit BW 32B/cycle per partitionintra-warp conflict detection two-phase parallel, 4KB ownership table / tx warpTable 4.2: Simulated GPU and memory hierarchy for GETM.4.5.3 Commit-Time CoalescingThe commit unit receives write logs from SIMT cores, coalesces multiple accessesto the same 32-byte regions, writes the data to the LLC, and decrements the relevant#writes entries. While coalescing is not needed for correctness, it efficiently usesthe GPU’s wide LLC port.To coalesce writes, we use a simplified variant of the ring buffer used inKiloTM [77] andWarpTM [76]. In contrast to these proposals, in GETM the commitunit receives only the write log, so the buffer can be substantially reduced; weconservatively size it to half of that in prior work.4.6 MethodologySimulation setup. We follow the methodology established in previous GPUhardware transaction memory proposals [55, 76, 77]. GPGPU-Sim 3.x [31] is used65name abbreviation descriptionHash Table (CUDA) HT-H populate an 8000-entry hash tableHT-M populate an 80000-entry hash tableHT-L populate an 800000-entry hash tableBank Account (CUDA) ATM parallel funds transfer (1M accounts)Cloth Physics[40] (OpenCL) CL cloth physics (60K edges)CLto tx-optimized version of CLBarnes Hut [42] (CUDA) BH build an octree (30K bodies)CudaCuts [235] (CUDA) CC image segmentation (200×150 pixels)Data Mining [110] (CUDA) AP data mining (4000 records)Table 4.3: Benchmarks used for GETM simulate the GPU and modified to implement GETM and prior proposals. Weestimated area and power overheads of the structures required to implement TMby modelling them in CACTI 6.5 [154], conservatively assuming that all structuresare accessed every cycle and accounting for the higher validation unit clock. Weassumed a 32nm node (the smallest supported by CACTI 6.5).Table 4.2 describes the simulation setup. For fair comparison of the eagerconflict detection mechanism with the value-based detection from prior proposals,we keep the same baseline: a GPGPU similar to NVIDIA’s GTX480 (Fermi [165])with 15 cores, 6 memory partitions, and latencies derived from microbenchmarkstudies [241]. To investigate scalability to higher core counts, we also simulateda configuration with 56 cores in 28 clusters, and a 4MB L2 cache in eight 8-waybanks; for WarpTM, we doubled the recency filter size, and for GETM we doubledonly the precise metadata table.Baselines. We compare GETM against WarpTM [76], and an idealized im-plementation of the EarlyAbort/Pause-n-Go (EAPG) proposal [55].3 We use TMbenchmarks from prior work [76, 77]; they are summarized in Table 4.3.3Specifically, write signatures broadcast to cores were idealized as 64-bit messages, refcount tableupdates on the LLC side were idealized to one cycle for the entire tx log, and the early conflict checkwas made instant.66WTMEAPGGETMHT-H0%20%40%60%80%100%120%total tx cyclesWTMEAPGGETMHT-MWTMEAPGGETMHT-LWTMEAPGGETMATMWTMEAPGGETMCLWTMEAPGGETMCLtoWTMEAPGGETMBHWTMEAPGGETMCCWTMEAPGGETMAPWTMEAPGGETMGMEANEXEC WAITFigure 4.10: Transaction-only execution andwait time, normalized toWarpTMbaseline (lower is better). Note that EAPG is idealized.HT-H0. exec time2.9 3.3HT-M2.0 2.5HT-L ATM CL1.6CLto BH CC AP GMEANFGLock WarpTM EAPG (ideal) GETMFigure 4.11: Program execution time normalized to the fine-grained lockbaseline, including transactional and non-transactional parts (lower isbetter).4.7 Evaluation Results4.7.1 Performance AnalysisFigure 4.10 shows the total number of cycles spent executing transactions andwaiting for other transactions to finish, normalized to the WarpTM baseline. Formost workloads, GETM reduces both transaction execution time and wait time.CC and AP have contention over few memory locations, and GETM sees manyaborts; because commits and aborts are cheap in GETM, however, this is still fasterthan WarpTM and EAPG. In CC and AP, transactions spend little time waitingbecause they account for a small portion of the total runtime. We find that, for thesebenchmarks, even idealized EAPG is ineffective, as only 5.2% aborts come from67HT-H0. xbar traffic2.1HT-M HT-L ATM CL CLto BH CC3.2AP GMEANWarpTM EAPG (ideal) GETMFigure 4.12: Crossbar traffic load normalized to WarpTM (lower is better).HT-H0. # access cyclesHT-M HT-L ATM CL CLto BH CC AP AVGFigure 4.13: Mean latency of the cuckoo table in the metadata storage structure(≥ 1.0, lower is better).the early-abort mechanism and 1.3% transactions are ever paused. Essentially, bythe time a broadcast update reaches the cores, most conflicting transactions havealready been sent for validation/commit. In fact, EAPG underperforms WarpTMbecause the additional early-abort broadcasts congest the core↔LLC interconnect(even though these are idealized as single header-only flits). We expect that EAPGcan be effective only with extremely long transactions.Overall performance is shown in Figure 4.11: on average, GETM outperformsWarpTM by 1.2× (gmean) and is within 7% of the fine-grained lock baseline. Thetrend mirrors that of the transactional execution and wait time above. Benchmarkswith high contention benefit more, because GETM aborts doomed transactionswithout the need to queue at the LLC for value-based validation, and show substantialimprovements (up to 2.1× forHT-H). Low-contentionworkloads perform comparablyto WarpTM.68HT-H0. exec timeHT-M HT-L ATM CL CLto BH CC AP GMEANGETM-2K GETM-4K GETM-8K(a) sensitivity to metadata table sizeHT-H0. exec timeHT-M HT-L ATM CL CLto BH CC AP GMEANGETM-128BGETM-64BGETM-32BGETM-16B(b) sensitivity to metadata tracking granularityFigure 4.14: Performance sensitivity of GETM to metadata table size andtracking granularity, normalized to a WarpTM baseline (lower is better).The improved performance comes at aminor cost in interconnect traffic comparedtoWarpTM (Figure 4.12). Although GETM does not need to transmit the transactionread log at commit time, it needs to acquire locks for every write at encountertime, whereas WarpTM only contacts the TCD for loads. In addition, despite betterperformance, GETM has a higher abort rate, which adds to the interconnect trafficload.4.7.2 Sensitivity AnalysisBecause the validation unit contains a cuckoo-like structure where worst-caseinsertions can take many cycles, we measured the average number of validationunit cycles spent on accessing the metadata tables for each request (Figure 4.13).The combination of allowing evictions to the approximate table and the small stashmakes insertions very efficient. Even under very high load factors (> 99%), long69HT-H04812max stall buffer sizeHT-M HT-L ATM CL CLto BH CC AP AVGFigure 4.15: The maximum number of addresses queued at any given time(total of all stall buffers in the GPU).HT-H0. requests / addrHT-M HT-L ATM CL CLto BH CC AP AVGFigure 4.16: The average number of requests per address that concurrentlyreside in the stall buffer.insert chains where all entries have #writes > 0 are very unlikely; when they dooccur, the stash is effective as predicted theoretically [115].We also investigated the effect of changing metadata table sizes and granularity(Figure 4.14); we tested 2K, 4K, and 8K entries GPU-wide, and 16, 32, 64, and128-byte granularity assuming 4K table entries GPU-wide. A 2K metadata footprintis too small (and, indeed, requires a larger stash), especially when parallelism isabundant (e.g., HT-H); because 8K entries do not significantly outperform 4K entries,we settled on 4K entries for other parts of the evaluation. Decreasing granularitygenerally improves performance because false sharing is reduced; however, it alsoreduces effective table size when parallelism is high and the total number of addressesaccessed is higher. We chose 32-byte granularity for all other tests.Since requests that pass the timestamp check but find their target location reservedare queued in the stall buffer, we measured stall buffer performance. Figure 4.1570best concurrency aborts / 1K commitsWTM EAPG WTM-EL GETM WTM EAPG WTM-EL GETMHT-H 2 2 8 8 119 113 122 460HT-M 8 4 8 8 98 84 83 172HT-L 8 4 8 8 80 78 78 207ATM 4 4 4 4 27 26 25 114CL 2 2 4 4 93 91 119 205CLto 4 2 4 4 110 61 72 176BH 2 2 8 ∞ 93 86 145 865CC ∞ ∞ ∞ ∞ 6 5 1 38AP 1 1 1 1 231 237 204 9188Table 4.4: Optimal concurrency (# warp transactions per core) settings andabort rates for different workloads. (WTM = WarpTM.)HT-H0.000.250.500.751.001.251.50total exec time3.3HT-M4.0HT-L3.8ATM2.4CL1.6CLto1.8BH2.3CC1.7AP GMEAN2.2WarpTMEAPG (ideal)GETMWarpTM-56CoreEAPG (ideal)-56CoreGETM-56CoreFigure 4.17: Program execution time in 15-core and 56-coreGPUs, normalizedto 15-core WarpTM (lower is better).shows the maximum total occupancy of all stall buffers; this never rises above 12requests across the entire GPU. Figure 4.16 shows that very few requests are queuedup on average for any given address. In the rest of the evaluation, we conservativelysized the stall buffers to 4 addresses with space for 4 requests each.4.7.3 Transaction Abort RatesBoth WarpTM and GETM limit transactional concurrency to optimize performance.Table 4.4 lists the best concurrency settings for each benchmarks — i.e., the numberof warps in each core allowed to run transactions concurrently — and the resultingnumber of aborted transactions. With abundant parallelism (e.g., HT-H), GETM71element area [mm2] power [mW]WarpTMCU: LWHR tables (3KB×6) 0.108 21.84CU: LWHR filters (2KB×6) 0.03 12.00CU: entry arrays (19KB×6) 0.402 100.62CU: read-write buffers (32KB×6) 1.734 132.48TCD: first-read tables (12KB×15) 0.375 113.25TCD: last-write buffer (16KB total) 0.031 9.86total WarpTM 2.68 390.05EAPG (in addition to WarpTM)CAT: Conflict Address Table (12KB×15) 0.6 153.3RCT: Reference Count Table (15KB×6) 0.294 75.6total EAPG 3.574 618.95GETM (independent of WarpTM)CU: write buffers (16KB×6) 0.522 85.56VU: precise tables (64KB total) 0.181 69.59VU: approximate tables (8KB total) 0.018 8.51warpts tables (192B×15) 0.015 10.65stall buffer (30B×4×6) 0.0004 2.67total GETM 0.736 176.98Table 4.5: CACTI area and power (dynamic + static) estimates forWarpTM [76], EAPG [55], and GETM overheads (32nm node). CU:commit unit; TCD: temporal conflict detection; VU: validation efficient at higher concurrency than WarpTM. The eager conflict detection inGETM also translates to dramatically faster commits and aborts than the value-basedconflict detection in WarpTM, so GETM can handle higher abort rates and stillperform substantially better.4.7.4 ScalabilityTo investigate scalability at higher core counts, we also simulated WarpTM andGETM in a configuration with 56 SIMT cores and a 4MB LLC; Figure 4.17 showsthe results. While performance differences vary slightly per benchmark, the overalltrends match the 15-core setup.724.7.5 Area and Power CostTable 4.5 shows the area and power overheads introduced by adding TM support.Because GETM removes most of the structures needed by WarpTM, it has 3.6×lower area overheads and 2.2× lower power overheads (4.9× and 3.6× lower thanEAPG). Overall, GETM adds ∼0.2% area to a GTX 480 die scaled down to 32nm.4.8 SummaryIn this chapter, we presentGETM, the first fullGPU transactionalmemorymechanismwith eager conflict resolution. By combining explicit version tracking with encounter-time write reservations, GETM enables efficient conflict detection and off-the-critical-path commits. GETM is up to 2.1× faster than the state-of-the-art GPU TM(1.2× gmean), while incurring 3.6× lower area overheads and 2.2× lower poweroverheads.73Chapter 5Cache Coherence Protocol forHierarchical Multi-GPU SystemsThis chapter studies cache coherence protocol across multi-GPU systems for inter-GPU peer caching. We propose HMG, a hardware-managed cache coherence proto-col designed to extend coherence guarantees across forward-looking HierarchicalMulti-GPU systems with scoped memory consistency models. Unlike prior CPUand GPU protocols that enforce multi-copy-atomicity and/or track ownership withinthe protocol, HMG eliminates transient states and/or extra hardware structuresthat would otherwise be needed to cover the latency of acquiring write permis-sions [216, 221]. HMG also filters out unnecessary invalidation acknowledgmentmessages, since a write can be processed instantly if multi-copy-atomicity is notrequired. Similarly, unlike prior work that filters coherence traffic by tracking theread-only state of data regions [216, 253], HMG relies on precise but hierarchicaltracking of sharers to mitigate the performance impact of bandwidth-limited inter-GPU links without adding unnecessary coherence traffic. In a 4-GPU system, HMGimproves performance over a software-controlled, bulk invalidation-based coherencemechanism by 26% and over a non-hierarchical hardware cache coherence protocolby 18%, thereby achieving 97% of the performance of an idealized caching system.As the demand for GPU compute continues to grow beyond what a single die candeliver [57, 91, 105, 215], GPU vendors are turning to new packaging technologiessuch as Multi-Chip Modules (MCMs) [29] and new networking technologies such74GPM GPMGPM GPMDRAMMCM-GPUMulti-GPU SystemGPU GPU GPUNVSwitch NVSwitch NVSwitchGPU GPU GPU GPUGPUDRAMDRAMDRAMFigure 5.1: Forward-looking multi-GPU system. Each GPU has multiple GPUModules (GPMs).as NVIDIA’s NVLink [160] and NVSwitch [162] and AMD’s xGMI [1] in order tobuild ever-larger GPU systems [170, 173, 174]. Consequently, as Figure 5.1 depicts,modern GPU systems are becoming increasingly hierarchical. However, due tophysical limitations, the large bandwidth discrepancy between existing inter-GPUlinks [1, 160] and on-package integration technologies [187] can contribute to Non-Uniform Memory Access (NUMA) behavior that often bottlenecks performance.Following established principles, GPUs use aggressive caching to recover some ofthe performance loss created by the NUMA effect [29, 143, 253], and these cachesare kept coherent with lightweight coherence protocols that are implemented insoftware [29, 143], hardware [221, 253], or a mix of both [216].GPU originally assumed that inter-thread synchronization would be coarse-grained and infrequent, and hence they adopted a bulk-synchronous programmingmodel (BSP) for simplicity. This paradigm disallowed any data communicationamong Cooperative Thread Arrays (CTAs) of active kernels. However, in emergingapplications, less-restrictive data sharing patterns and fine-grained synchroniza-tion are expected to be more frequent [43, 54, 113, 217]. BSP is too rigid andinefficient to support these new sharing patterns. To extend GPUs into moregeneral-purpose domains, GPU vendors have released very precisely-defined scopedmemory models [2, 98, 112, 135]. These models allow flexible communicationand synchronization among threads in the same CTA, the same GPU, or anywherein the system, usually by requiring programmers to provide scope annotations75overfeat0. SpeedupMiniAMRAlexNet3.1 3.1 3.2CoMDHPGMGMiniContactpathfinderNekbonecuSolvernamd2.10resnetmstnw-16Klstm3.4 3.5 4.0RNN_FWRNN_DGRADGoogLeNet3.7 3.6 4.4bfssnapRNN_WGRAD3.3 3.4 7.1GeoMeanNon-Hierarchical SW Coherence Non-Hierarchical HW Coherence Idealized Caching w/o CoherenceFigure 5.2: Benefits of caching remote GPU data under three different pro-tocols on a 4-GPU system with 4 GPMs per GPU, all normalized to abaseline which has no such caching.for synchronization operations. Scopes allow synchronization and coherence tobe maintained entirely within a narrow subset of the full-system cache hierarchy,thereby delivering improved performance over system-wide synchronization en-forcement. Furthermore, unlike most CPU memory models, these GPU modelsare now non-multi-copy-atomic: they do not require that memory accesses becomevisible to all observers at the same logical point in time. As a result, there is roomfor forward-looking GPU cache coherence protocols to be made even more relaxed,and hence capable of delivering even higher throughput, than protocols proposed inprior work (as outlined in Chapter 5.4).Previously explored GPU coherence schemes [18, 29, 95, 143, 195, 216, 221,253] were well tuned for GPUs and much simpler than CPU protocols, but fewhave studied how to scale these protocols to larger multi-GPU systems with deepercache hierarchies. To test their efficiency, we simply apply the existing softwareand hardware coherence protocols GPU-VI [221] on a 4-GPU system, in whicheach GPU consists of 4 separate GPU Modules (GPMs). These protocols do notaccount for architectural hierarchy; we simply extend them as if the system were aflat platform of 16 GPMs. As Figure 5.2 shows, in hierarchical multi-GPU systems,existing non-hierarchical software and hardware VI coherence protocols indeedleave plenty room for improvement; see Chapter 5.4 for details. We thereforeask the following question: how do we extend existing GPU coherence protocols76across multiple GPUs while simultaneously providing high-performance supportfor flexible fine-grained synchronization within emerging GPU applications, andwithout a dramatic increase in protocol complexity? In this chapter, we answer thisquestion with HMG, a hierarchical cache coherence protocol that is able to scaleup to large multi-GPU systems, while nevertheless maintaining the simplicity andimplementability that made prior GPU cache coherence protocols popular.5.1 Emerging Programs Need Fine-GrainedCommunicationNowadays, many applications contain fine-grained communication between CTAsof the same kernel and/or of dependent kernels [53, 58, 83, 118, 130, 184, 258].For example, in RNNs, abundant inter-CTA communication exists in the neuronconnections between continuous timesteps [64]. In the simulation of moleculeor neutron dynamics [184, 258], inter-CTA communication is necessary for themovement dependency between different particles and different simulation timesteps.Graph algorithms usually dispatch vertices among multiple CTAs or kernels thatneed to exchange their individual update to the graph for the next round of computinguntil they reach convergence [91, 118]. We provide more details on the workloadswe study in Chapter 5.7. All these applications can benefit from a hierarchical GPUsystem for higher performance and from the scoped memory model for efficientinter-CTA synchronization enforcement.5.2 GPU Weak Memory ModelIn the section, to avoid confusion around the term “shared memory”, which is usedto describe scratchpad memory on NVIDIA GPUs, we use “global memory” for thevirtual address space shared by all CPUs and GPUs in a system.BothCUDAandOpenCLoriginally supported a coarse-grained bulk-synchronousprogramming model. Under this paradigm, data sharing between threads of the sameCTA could be performed locally in the shared memory scratchpad and synchronizedusing CTA execution barriers; but inter-CTA synchronization was permitted onlybetween dependent kernel calls (i.e., where data produced by one kernel is consumedby the following kernels). They could not, with guaranteed correctness, perform77arbitrary communication using global memory. While many GPU applications workvery well under a bulk-synchronous model with rare inter-CTA synchronizations, itquickly becomes a limiting factor for the types of emerging applications describedin Chapter 5.1.To support data sharing more flexibly and more efficiently, both CUDA andOpenCL have shifted from bulk-synchronous models to more general-purposescoped memory models [2, 98, 112, 135, 168]. By incorporating the notion of scope,these new models allow each thread to communicate with any other threads in thesame CTA (.cta), the same GPU (.gpu), and anywhere in the system (.sys)1.Scope indicates the set of threads with which a particular memory access wishesto synchronize. Synchronization of scope .cta is performed in the L1 cache ofeach GPU Streaming Multiprocessor (SM); synchronizations of scope .gpu and.sys are processed through the GPU L2 cache and via the memory hierarchy ofthe whole system, respectively.5.3 Existing GPU Cache CoherenceSome GPU protocols advocate for strong memory models, and hence they proposesophisticated cache coherence protocols capable of delivering good performance[195]. Most other GPU protocols enforce variants of release consistency byinvalidating possibly stale values in caches when performing acquire operations(implicitly including the start of a kernel), and by flushing dirty data during releaseoperations (implicitly including the end of a kernel). Much of the research in thearea today proposes optimizations on top of these basic principles. We broadlyclassify this work by whether reads or writes are responsible for invalidating staledata.Among read-initiated protocols, hLRC [18] elided unnecessary cache invali-dations and flushes by lazily performing coherence actions when synchronizationvariables change registration. Furthermore, the recent proposals of DeNovo [216]and VIPS [117] can protect read-only or private data from invalidation. How-ever, they incur additional overheads and/or require software changes to convey1We use NVIDIA terminology in this chapter. Equivalent scopes in HRF are work-group, device,and system [98].78region information for read-only data, ownership tracking in word granularity, orcoarse-grained (memory page level)2 private/shared data classification.As for write-initiated cache invalidations, previous work has observed thatMESI-like coherence protocols are a poor fit for GPUs [95, 221]. QuickRelease [95]reduced the overhead of cache flush by enforcing the partial order of writes with aFIFO. However, QuickRelease needs to partition the resources required by readsand writes; it also broadcasts invalidations to all remote caches. GPU-VI [221] isa simple yet effective hardware cache coherence protocol, but it predated scopedmemory models and introduced extra overheads to enforce multi-copy-atomicity,which is no longer necessary. Also, GPU-VI was proposed for use within a singleGPU, and did not consider the added complexity of having various bandwidth tiers.5.4 The Novel Coherence Needs of Modern Multi-GPUSystemsTo scale coherence across multiple GPUs, the design of HMG not only considers thearchitectural hierarchy of modern GPU systems (Figure 5.1), but also aggressivelytakes advantage of the latest scoped memory models (Chapter 5.2). Before divinginto the details of HMG, we first describe our main insights below.5.4.1 Extending Coherence to Multiple GPUsAs described in Chapter 5.3, prior GPU coherence protocols mainly focused onmechanisms that mitigate the impact of bulk cache invalidations. However, asFigure 5.2 shows, even fine-grained hardware VI cannot close the gap betweenwhat non-hierarchical protocols achieve and an idealized caching scenario. Infuture multi-GPUs, larger shared L2 caches will only amplify the cost of coarse-grained cache invalidations and of reloading invalidated data from remote GPUs viabandwidth-limited links. Indeed, Figure 5.3 shows that it is common for multipleGPMs on the same GPU to redundantly access a common range of addresses storedon remote GPUs. We therefore build HMG as a hierarchical protocol capable ofbeing extended across multiple GPUs.2GPUs need large pages (e.g., 2MB) to ensure high TLB coverage. Smaller pages can cause severeTLB bottlenecks [30].79overfeat0%25%50%75%100%% of possibly reducedpeer GPU loadsMiniAMRAlexNetCoMDHPGMGMiniContactpathfinderNekbonecuSolvernamd2.10resnetmstnw-16KlstmRNN_FWRNN_DGRADGoogLeNet bfssnapRNN_WGRADAvgFigure 5.3: Percentage of inter-GPU loads destined to addresses accessed byanother GPM in the same GPU.There has been much research into hierarchical cache coherence for CPUs.However, unlike GPUs, CPUs usually enforce a stronger memory model (e.g., TSO)and have much stricter latency requirements. As such, CPU coherence protocolssuch as MESI track ownership to exploit write data locality [86, 126, 153]. Manytransient states are added to reduce the coherence stalls, resulting in prohibitiveverification complexity [155, 224]. Industrial products implementedmore aggressiveoptimizations. For example, Sun’s WildFire had special OS support for memorypage replication and migration [88]. Intel’s Skylake introduced IO directory cacheand HitMe cache to reduce memory access latency [153]. These complexities areappropriate for latency-bound CPUs, but GPUs permit far more relaxed memorybehavior, and hence HMG shows that the costs of such CPU-like protocols remainunnecessary for multi-GPUs.5.4.2 Leveraging GPUWeak Memory ModelsBesides the change of hardware architecture, scoped GPU memory models alsoinform the design of a good GPU coherence hierarchy. While non-scoped CPUmemory models require all memory accesses to be kept coherent, GPU memorymodels that do explicitly expose scopes as part of the programming model requirecoherence to be enforced only at synchronization boundaries, and only with respectto other threads in the scope in question. The NVIDIA GPU memory modelmakes this relaxed nature of coherence very explicit [135]. A common pattern inmulti-GPU applications will be for CTA or kernels running on a single GPU to80synchronize with each other first, and with kernels on other GPUs less frequently.Such patterns rely heavily on the comparative efficiency of .gpu scope over .sysscope; while some prior work has concluded that scopes are unnecessary within asingle GPU [216], the latency/bandwidth gap between the broadest and narrowestscope is an order of magnitude larger in multi-GPU environments.Furthermore, although some prior work has proposedmulti-copy-atomicmemorymodels for GPUs [16], recent GPU scoped memory models have since formalizedthe lack of such a requirement [98, 135]. Loosely speaking, multi-copy-atomicityrequires memory to behave as if it were a single atomic unit, with only thread-privatebuffering allowed between cores and memory. As GPUs share an L1 cache across anSM, GPUs today are not multi-copy-atomic. Multi-copy-atomicity also can createapparent delays for subsequent memory accesses. Most CPUs enforce multi-copy-atomicity by using sophisticated coherence protocols with many transient states andby using out-of-order execution and speculation to hide the latency overheads. Someprior studies have found that single-GPU coherence protocols can also tolerate multi-copy-atomicity. For example, to reduce stalls, GPU-VI [221] added 3 and 12 transientstates and 24 and 41 coherence state transitions in the L1 and L2 caches, respectively.In multi-GPU environments, however, the round trip time to remote GPUs is anorder of magnitude larger and would put significantly increased pressure on thecoherence protocol’s ability to hide the latency. Instead, by leveraging non-multi-copy-atomicity, HMG eliminates transient states and invalidation acknowledgmentsaltogether.5.5 Baseline Non-Hierarchical Cache CoherenceWe now describe how a non-hierarchical cache coherence (NHCC) protocol canbe optimized for modern weak GPU memory models. Like most scoped protocols,NHCC propagates synchronization memory accesses to different caches accordingto the user-provided scope annotations. As compared to GPU-VI [221], NHCCeliminates all transient states and most invalidation acknowledgments. However, itdoes not consider the architectural hierarchy. As such, it will serve as our baselineduring our later evaluations. In the next section, we will extend NHCC with a notionof hierarchy so that it scales better on larger multi-GPU systems like Figure 5.1.81GPM1SMs + L1 $ SMs + L1 $SMs + L1 $ SMs + L1 $ GPM3DRAMGPM2GPM0XBARXBARXBARXBARL2 $L2 $L2 $L2 $DRAMDRAMDRAMFigure 5.4: Future GPUs will consist of multiple GPU Modules (GPMs), andeach GPM might be a chiplet in a single package.5.5.1 Architectural OverviewA high-level diagram of our baseline single-GPU architecture for NHCC is shownin Figure 5.4. We assume L1 caches remain software-managed and write-through,as in GPUs today. Each GPU Module (GPM) has an L2 cache that holds both localand remote-GPM DRAM accesses contending for cache capacity with a typicalreplacement policy such as Least Recently Used (LRU). To support hardwareinter-GPM coherence, one GPM in the system is chosen by some hash function asthe home node for any given physical address. The home node always contains themost up-to-date value at each memory location.Like many protocols, NHCC attaches an individual directory to every L2cache within each GPM. The coherence directory is organized as a traditionalset-associative structure. Each directory entry tracks the identity of all GPM sharers,along with coherence state. Like GPU-VI [221], each line can be tracked in oneof two stable states: Valid and Invalid. However, unlike GPU-VI, NHCC does nothave transient states, and it requires acknowledgments only for release operations.Non-synchronizing stores (i.e., the vast majority) do not require acknowledgmentsin NHCC.82Data Cache Data CacheCoherenceDirectoryData CacheData CacheCoherenceDirectoryL2 in GPM0 L2 in GPM1L2 in GPM2 L2 in GPM3Cached ACached BCached AV:A:[GPM2, GPM3]V:B:[GPM1]Cached ACoherence directory entry format is State:Addr:[Sharers]Cached CHome of A Home of CHome of BFigure 5.5: NHCC coherence architecture. The dotted yellow boxes are theL2 caches from Figure 5.4. The shaded gray cache lines and directoryentries indicate lines for which the GPM in question is the home node.We assume a non-inclusive inter-GPM L2 cache architecture to enable data tobe cached freely across the different GPMs. Figure 5.5 shows an example in whichGPM0 serves as the home node for address A. Other GPMs may cache the value atA locally, but GPM0 maintains the authoritative copy. In the same figure, addressB is cached in GPM1, even though GPM3 (the home node for B) is not cachingB. Similarly, data can be cached in the home node only, as with address C in ourexample in Figure 5.5.In NHCC, explicit coherence maintenance messages (i.e., cache invalidations)are sent only in two cases: when there is read-write sharing between CTAs ondifferent GPMs, and when there is a directory capacity eviction. The fact that mostmemory accesses incur no coherence overhead ensures that the GPU does not deviatefrom peak throughput in the GPU common case where data is either read-only orCTA-private. We measure the impact of coherence messages in Chapter 5.8.To explain the basics of NHCC, we track the life of a memory reference as anexample. First, a memory access from the SM queries the L1 cache. Upon a L1miss or write, the request is routed to the local GPM L2 cache. If the request missesin the L2 (or writes to L2, again assuming a write-through policy), the address is83State Local Local Remote Remote Replace InvalidationLd St/Atom Ld St/Atom Dir Entryadd s to add s toI - - sharers, sharers, N/A -→V →Vinv all add s tosharersadd s to inv all forward inv to allV - sharers, sharers, inv sharers, sharers (HMG only),→I other sharers →I →ITable 5.1: NHCC and HMG coherence directory transition table. s refers tothe sender of the message.checked to determine if the local GPM is the home node for this reference. If so, therequest is routed to local DRAM. Otherwise, the request is routed to the L2 cache ofthe home node via the inter-GPM links. The request may then hit in the home nodeL2 cache, or it may get routed through to that particular GPM’s off-chip memory.We provide full details below.5.5.2 Coherence Protocol Flows in DetailTable 5.1 details the full operation of NHCC. In this table, “local” refers to operationsissued by the same GPM as the L2 cache which is handling the request. “Remote”requests are those originally issued by other GPMs. We walk through the entries inthe table below.Local Loads: When a local load request reaches the local L2 cache partition,if it hits, a reply is sent back to the requester directly. If the request misses, thenext destination depends on where the data is mapped by the address hash function.If the local L2 cache partition happens to be the home node for the address beingaccessed, the request will be sent to DRAM. Otherwise, the load request will beforwarded to the home node. Loads with .gpu or .sys scope must always miss inthe L1 cache and in the non-home L2 caches to guarantee forward progress.Local Stores: Depending on L2 design, local stores may be stored as dirty datain the L2 cache, or alternatively they may be written through and stored as clean datain the L2 cache. All stores with scope greater than .cta (i.e., .gpu and .sys)must be written through in order to ensure forward progress. Data which is writtenback or written through the L2 is sent directly to DRAM if the local L2 cache is thehome node for the address in question, or it is relayed to the home node otherwise.84If the local GPM is the home node and the coherence directory has recorded anysharers for the address in question, then these sharers must be notified that the datahas been changed. As such, a local store triggers an invalidation message being sentto each sharer. These invalidations propagate in the background, off the critical pathof any subsequent reads. There are no invalidation acknowledgments.Remote Loads: When a remote load arrives at the local home L2 cache, iteither hits in the cache and returns data to the requester, or it misses and forwardsthe request to DRAM. The coherence directory also records the ID of the requestingnode. If the line is already being tracked, the requesting ID is simply added as anadditional sharer. If the line is not being tracked yet, a new entry is allocated in thedirectory, possibly by evicting another valid entry (discussed further below).Remote Stores: Remote stores that arrive at a home L2 are cached and writtenthrough or written back to DRAM, depending on the configuration of the L2. Sincethe requesting GPM may also be caching the stored data, the requester is recordedas a sharer. Since the data has been changed, all other sharers should be invalidated.Atomics and Reductions: Atomic operations must always be performed at thehome node L2. From a coherence transition perspective, these operations are treatedas stores.Invalidations: Upon receiving an invalidation request, any local clean copy ofthe address in question is invalidated. No acknowledgment needs to be sent.Directory Entry Eviction/Replacement: Because the coherence directory isimplemented as a set-associative cache, there may be entry evictions due to capacityand conflict misses. To ensure correctness, invalidation messages must be sent to allsharers of the entry that is being evicted. As with invalidations triggered by stores,these invalidations propagate in the background and do not require acknowledgmentsto be sent in return.Acquire: Acquire operations greater than .cta scope (i.e., .gpu and .sys)invalidate the entire local L1 cache, following software coherence practice. However,they do not propagate past the L1 cache, as L2 coherence in GPMs is nowmaintainedusing NHCC.Release: Release operations trigger a writeback of all dirty data to the respectivehome nodes, if writeback caches are being used. Releases also ensure completion ofanywrite-through operations and invalidationmessages that are still in flight. Release85operations greater than .cta scope are propagated through the local L2 to allremote L2s to ensure that all invalidation messages have arrived at their destinations.Once this step is complete, each remote L2 sends back an acknowledgment for therelease operation itself. The local L2 then collects these acknowledgments andreturns a single response to the original requester.Cache Eviction: Two design options are possible upon cache line eviction.First, a clean cache line being evicted from an L2 cache in a non-home GPM couldsend a downgrade message to the home node. This allows the home node to deletethe remote node as a sharer and will potentially save an invalidation message frombeing sent later. However, this is not required for correctness. The second option isto have valid clean cache lines get silently evicted. This eliminates the overheadof the downgrade message, but it triggers an unneeded invalidation message uponeventual eviction of the coherence directory entry. Optionally, dirty cache linesbeing evicted and written back can use a new message type indicating that the datamust be updated but that the issuing GPM need not be tracked as a sharer goingforward. Again, this optimization is not strictly required for correctness, but may beuseful in implementations using writeback caches.5.6 Hierarchical Multi-GPU Cache CoherenceLike most prior work, NHCC is designed for single-GPU scenarios and does nottake the hierarchy between intra- and inter-GPU connections into account. Thisbecomes a problem as we try to extend protocols like NHCC to multiple GPUs, asinter-GPU bandwidth limitations become a bottleneck.To better exploit intra-GPU data locality, we propose a hierarchical multi-GPU(HMG) cache coherence protocol that extends NHCC to be able to take advantageof the type of locality that Figure 5.3 highlights. The HMG protocol fundamentallyenables multiple cache requests from individual GPMs to be coalesced and/or cachedwithin a single GPU before traversing the lower-bandwidth inter-GPU links, therebysaving bandwidth and energy.86Data Cache Data CacheCoherenceDirectoryData CacheData CacheL2 in GPM0 L2 in GPM0L2 in GPM1 L2 in GPM1Cached A Cached ACached BV:A:[GPU1]V:B:[GPM0]Cached BGPU0 GPU1Inter-GPM Network Inter-GPM NetworkInter-GPUCached Data GPU Home Sys HomeSys Home of A GPU Home of ASys Home of BGPU Home of B(a) Before: GPU0:GPM0 is about to load address BData Cache Data CacheCoherenceDirectoryData CacheData CacheL2 in GPM0 L2 in GPM0L2 in GPM1 L2 in GPM1Cached A Cached ACached BCached BV:A:[GPU1]V:B:[GPU0, GPM0]Cached BV:B:[GPM0]Cached BGPU0 GPU1Inter-GPM Network Inter-GPM NetworkInter-GPUCached Data GPU Home Sys HomeSys Home of A GPU Home of ASys Home of BGPU Home of B(b) After: B is cached in the L2 of the GPU0 home node for B as well as in the L2 of theoriginal requesterFigure 5.6: Hierarchical coherence in multi-GPU systems. Loads are routedfrom the requesting GPM to the GPU home node, and then to the systemhome node, and responses are returned and cached accordingly.875.6.1 Architectural OverviewHMG is composed of two layers. The first layer is designed for intra-GPU caching,while the second layer is targeted at optimizing memory request routing in inter-GPUsettings. For the intra-GPU layer, we define aGPU home node for each given addresswithin each individual MCM-GPU. An MCM-GPU home node manages inter-GPMcoherence using NHCC described in Chapter 5.5. Using the intra-GPU coherencelayer, data that is cached within a MCM-GPU can be consumed by multiple GPMson that GPU without consulting a remote GPU.We define one of the GPU home nodes for each address to be the systemhome node. The choice of system home node can be made using any NUMA pageallocation policy, such as first touch page placement, NVIDIAUnifiedMemory [168],static distribution, or any other reasonable heuristic. Among multiple GPUs, sharersare tracked by the directory using a hierarchy-aware variant of the NHCC directorydesign. Specifically, each GPU home node will track any sharers among other GPMsin the same GPU. Each system home node will track any sharers among other GPUs,but not individual GPMs within these other GPUs. For an M-GPM, N-GPU system,each directory entry will therefore need to track as many as M N−2 sharers.The hierarchical caching mechanism of an example two-GPU system is shownin Figure 5.6. Each GPU is shown with only two GPMs for brevity, but the protocolitself can extend to an arbitrary number of GPUs, with an arbitrary number of GPMsper GPU. In Figure 5.6(a), the system home node of address A is the L2 cacheresiding in GPU0:GPM0. This particular L2 cache also serves as the GPU homenode for the same address within GPU0. The L2 cache in GPU0:GPM1 is keptcoherent with the L2 cache in GPU0:GPM0 using the intra-GPU protocol layer. TheL2 cache in GPU1:GPM0 serves as the GPU1 home node for address A, and it iskept coherent with the L2 cache in GPU1:GPM1 using the intra-GPU layer. BothGPU home nodes are kept coherent using the inter-GPU protocol layer.Furthermore, suppose that from the state shown in Figure 5.6(a), GPU0:GPM0wants to load address B, and the system home node for address B is mapped toGPU1:GPM1. GPU0:GPM1 is the GPU0 home node for B, so the load requestpropagates from GPU0:GPM0 to GPU0:GPM1 (the GPU home node), and thento GPU1:GPM1 (the system home node). When the response is sent back to the88requester, GPU0 (but not GPU0:GPM0 or GPU0:GPM1) is recorded as a sharer bythe directory of the system home node GPU1:GPM1, and GPU0:GPM0 is recordedas a sharer by the directory of the GPU0 home node GPU0:GPM1, as shown inFigure 5.6(b).5.6.2 Coherence Protocol Flows in DetailHMG behaves similarly to Table 5.1 but adds the single extra transition shownin Table 5.1. No extra coherence states are added. We highlight the importantdifferences between NHCC and HMG as follows.Loads: Loads progress through the cache hierarchy from the local L2 cache, tothe GPU home node, to the system home node. Specifically, loads that miss in theGPM-local L2 cache are routed to the GPU home node, unless the GPM-local L2cache is already the GPU home node. From there, loads that miss in the GPU homenode are routed to the system home node, unless the GPU home node is also thesystem home node. Loads that miss in the system home node are routed to DRAM.Non-synchronizing loads (i.e., the vast majority) and loads with .cta scopecan hit in all caches. However, loads with .gpu scope must miss in all caches priorto the GPU home node. Loads with .sys scope must also miss in the GPU homenode; they may only hit in the system home node.Loads propagating from the GPU home node to the system home node do notcarry information about the GPM that originally requested the data. Because thisinformation is already stored by the GPU home node, it would be redundant tostore it again in the directory of the system home node. Instead, invalidations arepropagated to sharers hierarchically as described below.Stores: Stores are routed through a similar hierarchy as they write-throughand/or write-back. Specifically, stores propagating past the GPM-local L2 cache arerouted to the GPU home node (unless the GPM-local L2 is already the GPU homenode), and stores propagating past the GPU home node are routed to the systemhome node (unless the GPU home node is already the system home node). Storespropagating past the system home node are written to DRAM. Similar to loads,stores or write-back/write-through operations propagating from the GPU home nodeto the system home node carry only the GPU identifier, not the identifier of the GPM89within that GPU.Stores must be written through at least to the home node for the scope in question:the L1 cache for non-synchronizing and .cta-scoped stores, the GPU home nodefor .gpu-scoped stores, and the system home node for .sys-scoped stores. Thisensures that synchronization operations will make forward progress.Atomics and Reductions: Atomics are always performed in the home node forthe scope in question and they continue to be treated as stores for the purposes ofcoherence protocol transitions, just as in NHCC. Once performed at the home node,responses are propagated back to the requester just as load responses are handledand the result is stored as a dirty line or written through to subsequent levels of thecache hierarchy, just as a store would be. For example, the result of a .gpu-scopedatomic read-modify-write operation performed in the GPU will be written throughto the system home node, in systems which configure the GPU home node to bewrite-through for stores.Invalidations: Because sharers are tracked hierarchically, invalidations sent dueto stores and directory evictions must also propagate hierarchically. Invalidationssent from the system or GPU home node to other GPMs in the same GPU areprocessed and dropped without acknowledgment, just as in NHCC. However, inHMG any invalidations received by a GPU home node from the system home nodemust also be propagated to any and all GPM sharers within the same GPU. This isthe special transition shown in Table 5.1 for HMG.Acquire: As before, .cta-scoped acquire operations invalidate the local L1cache, but nothing more, as all levels of L2 cache are being kept hardware-coherent.Release: Release operations trigger writeback of all dirty data, at least to thehome node for the scope being released. They also still ensure completion of anywrite-through operations and invalidation messages still in flight to the home nodefor the scope in question. A .gpu-scoped release operation, however, need notflush all write-back operations across the inter-GPU network before returning acompletion acknowledgment to the original requester.90Structure ConfigurationNumber of GPUs 4Number of SMs 128 per GPU, 512 in totalNumber of GPMs 4 per GPUGPU frequency 1.3GHzMax number of warps 64 per SMOS Page Size 2MBL1 data cache 128KB per SM, 128B linesL2 data cache 12MB per GPU128B lines, 16 waysL2 coherence directory 12K entries per GPU moduleeach entry covers 4 cache linesInter-GPM bandwidth 2TB/s per GPU, bi-directionalInter-GPU bandwidth 200GB/s per link, bi-directionalTotal DRAM bandwidth 1TB/s per GPUTotal DRAM capacity 32GB per GPUTable 5.2: Simulated GPU and memory hierarchy for HMG.5.7 MethodologyTo evaluate HMG, we use a proprietary industrial simulator to model a multi-GPUsystem described in Table 5.2. The simulator is driven by program traces that recordinstructions, registers, memory addresses, and CUDA events. All micro-architecturalscheduling, and thus time for execution, is dynamic within the simulator and respectsfunctional dependencies such as work scheduling, barrier synchronization, memoryaccess latencies. However, it cannot accurately model spin-lock synchronizations inmemory. While this type of communication is legal on current NVIDIA hardware, itis not yet widely adopted due to performance overheads and not present in our suiteof workloads. Simulating the system-level effects of fine-grained synchronization,in reasonable time, without sacrificing fidelity [217, 229] remains an open problemfor GPU researchers.Figure 5.7 shows our simulator correlation versus a NVIDIA Quadro-GV100GPU across a range of targeted microbenchmarks, public, and proprietary workloads.Figure 5.7 also shows the corresponding data for GPGPU-Sim, a widely-usedacademic GPU architecture simulator [31, 108, 111, 128, 190], with simulationscapped at running for about one week. Our simulator has a correlation coefficient of0.99 and average absolute error of 0.13. This compares favorably to GPGPU-Sim (at911041061081010104 106 108 1010Sim.cyclesReal HW cyclesOur simulator GPGPU-Sim100102104106104 106 108 1010WallClock(s)Sim. cyclesFigure 5.7: Simulator correlation vs. a NVIDIAQuadroGV100 and simulationruntime for our simulator and GPGPU-Sim.0.99 and 0.045, respectively), as well as other recently reported simulator results[87]while being significantly faster, which allows us to run forward-looking GPUconfigurations more easily. Our simulator inherits the contiguous CTA schedulingand first-touch page placement polices from prior work [29, 143] to maximize datalocality in memory.To perform our evaluation, we choose a public subset of workloads (shown inTable 5.3) [53, 58, 83, 118, 130, 184, 258] that have sufficient parallelism to fill a4-GPU system. These benchmarks utilize scoped and/or inter-kernel synchronizationpatterns. This ensures that performance does not regress on traditional workloadseven as we accelerate workloads with more fine-grained sharing. Specifically,cuSolver, namd2.10, and mst use .gpu-scoped synchronization explicitly,others utilize inter-kernel communication by launching frequent dependent kernels,and a few are traditional bulk-synchronous providing a historical comparativebaseline.Coherence Protocol Implementations: This work implements and compares 4coherence possibilities: a non-hierarchical software protocol (conventional softwarecoherence with scopes and bulk-invalidation of caches), a non-hierarchical hardwareprotocol (NHCC), a hierarchical software protocol (conventional software coherence92Benchmark Abbrev. FootprintcuSolver cuSolver 1.60 GBHPC_CoMD-xyz49 CoMD 313 MBHPC_HPGMG HPGMG 1.32 GBHPC_MiniAMR-test2 MiniAMR 1.80 GBHPC_MiniContact MiniContact 246 MBHPC_namd2.10 namd2.10 72 MBHPC_Nekbone-10 Nekbone 178 MBHPC_snap snap 3.44 GBLonestar_bfs-road-fla bfs 26 MBLonestar_mst-road-fla mst 83 MBML_AlexNet_conv2 AlexNet 812 MBML_GoogLeNet_conv2 GoogLeNet 1.15 GBML_lstm_layer2 lstm 710 MBML_overfeat_layer1 overfeat 618 MBML_resnet resnet 3.20 GBML_RNN_layer4_DGRAD RNN_DGRAD 29 MBML_RNN_layer4_FW RNN_FW 40 MBML_RNN_layer4_WGRAD RNN_WGRAD 38 MBRodinia_nw-16K-10 nw-16K 2.00 GBRodinia_pathfinder pathfinder 1.49 GBTable 5.3: Benchmarks used for HMG evaluation.with hierarchical extension to leverage scopes), and our proposed hierarchicalhardware protocol (HMG). We also compare them to idealized caching that does notenforce coherence; this serves as a loose upper bound for performance that can beachieved via hardware caching. For non-hierarchical protocols, multi-GPU systemslike Figure 5.1 behaves as a single flat GPU with more GPMs.NHCC and HMG behave according to Chapter 5.5 and 5.6 respectively. Load-acquire operations in our software coherence protocols trigger bulk cache invali-dations in any caches between the issuing SM and the home node for the scope inquestion. For example, .gpu-scoped loads will invalidate both the L1 cache of theissuing SM and the GPM-local L2 cache. In the hierarchical protocol, .sys-scopedloads invalidate the L1 cache of the issuing SM and all L2 caches of the issuingGPU. However, in the non-hierarchical protocol, .sys-scoped loads need not toinvalidate L2 caches in other GPMs of the same GPU, as subsequent loads willnot fetch stale data from those caches. Store-release operations stall subsequentoperations until the home node for the scope in question clears all pending writes.93overfeat0. SpeedupMiniAMRAlexNetCoMDHPGMGMiniContactpathfinderNekbonecuSolvernamd2.10resnet3.6 3.6 3.6mstnw-16KlstmRNN_FWRNN_DGRADGoogLeNet bfssnapRNN_WGRADGeoMeanNon-Hierarchical SW Coherence Non-Hierarchical HW Coherence Idealized Caching w/o CoherenceFigure 5.8: Performance of various inter-GPM coherence schemes in a singleGPU with 4 GPMs. Performance is normalized to a scheme that doesnot perform inter-GPM caching.In our evaluation, all caches are write-through. We do not implement the optionalsharer downgrade messages. We model one directory optimization: each entrytracks the state of four cache lines together. This enables 12K×4×128B = 6MBof data assigned to each GPM to be actively shared by other GPMs and/or GPUs.Chapter 5.8.2 later shows performance sensitivity to the choices of these parameters.5.8 Evaluation ResultsWe first compare the performance of HMG to NHCC, software coherence protocols,and idealized caching without any coherence overhead. Then we conduct sensitivityanalysis to explore the design space of HMG.5.8.1 Performance AnalysisSingle-GPU System: As Figure 5.8 shows, we observe that for most benchmarks,both software and hardware coherence generally perform similarly and close toan idealized non-coherent caching scheme. The relatively small L2 caches andrelatively large inter-GPM bandwidths can minimize the performance penalty ofcache invalidations in single-GPU systems, and hence we do not elaborate on themfurther here.94overfeat0. SpeedupMiniAMRAlexNet3.1 3.1 3.2 3.2 3.2CoMDHPGMGMiniContactpathfinderNekbonecuSolvernamd2.10resnetmstnw-16Klstm3.4 3.5 3.7 4.1 4.0RNN_FWRNN_DGRADGoogLeNet3.7 3.6 4.4 4.3 4.4bfssnapRNN_WGRAD3.3 3.4 7.0 7.2 7.1GeoMeanNon-Hierarchical SW CoherenceNon-Hierarchical HW CoherenceHierarchical SW CoherenceHMG CoherenceIdealized Caching w/o CoherenceFigure 5.9: Performance of a 4-GPU system, where each GPU is composed of4 GPMs. Performance is normalized to a 4-GPU system that disallowscaching of remote GPU data. Five configurations are evaluated: softwareprotocols with non-hierarchical and hierarchical implementations, NHCC,HMG, and ideal caching without coherence overhead.Multi-GPU System: While software coherence may be sufficient within individ-ual GPUs, even for benchmarks with fine-grained thread-to-thread communication,Figure 5.9 shows that the benefits of HMG are much more pronounced in deeplyhierarchical multi-GPU systems, especially for the applications which have morefine-grained data sharing (i.e, the right half side). In a 4-GPU system, HMG gener-ally outperforms both software coherence protocols and NHCC. Both software andhardware hierarchical protocols significantly benefit from the additional intra-GPUdata locality. Meanwhile, the non-hierarchical protocols suffer from larger inter-GPUlatency and bandwidth penalties.Figure 5.10 and 5.11 show that cache line invalidations due to store instructionsor coherence directory evictions do not have a significant impact on performance ofHMG. This is because stores only trigger invalidations if there is a sharer for thesame address and typically only a small percentage of the memory footprint of eachworkload contains read-write shared data. Even among stores or directory evictionsthat do trigger sharer invalidations, there are generally no more than two sharersin our workloads. These observations highlight the benefit of tracking sharersdynamically, rather than e.g., classifying data sharing type alone [253].Graph workloads’ fine-grained, often conflicting access patterns can lead to false95overfeat0. cachelineinvs by each storeMiniAMRAlexNetCoMDHPGMGMiniContactpathfinderNekbonecuSolver2.1namd2.10resnetmstnw-16KlstmRNN_FWRNN_DGRADGoogLeNet bfssnapRNN_WGRADAvgFigure 5.10: Average number of cache lines invalidated by each store requeston shared data.overfeat0. cacheline invs byeach directory evictionMiniAMRAlexNet2.9CoMDHPGMGMiniContactpathfinderNekbonecuSolvernamd2.10resnetmstnw-16K3.5lstmRNN_FWRNN_DGRADGoogLeNet bfs3.4snapRNN_WGRADAvgFigure 5.11: Average number of cache lines invalidated by each coherencedirectory eviction.sharing. Store operations in software coherence protocols will simply write this datathrough, but HMG might trigger frequent invalidations (in these experiments, at thegranularity of four cache lines per directory entry), depending on the input sets. Insuch cases, the hardware protocol HMG will have higher overhead. This explainsthe performance of mst, for example. For most other applications, the benefits ofHMG outweigh the costs.We also profile the bandwidth overhead of invalidation messages. Figure 5.12shows that the total bandwidth cost of invalidation messages is generally as low asjust a few gigabytes per second. This is consistent with prior data since there is littleread-write sharing and a low number of sharers when invalidations must be sentout. The size of each invalidation message is also relatively small compared to aGPU cache line. Combined with the fact that GPU workloads are generally latencytolerant, it becomes clear that HMG for hierarchical multi-GPUs can deliver high96overfeat01234bandwidth cost ofinv messages (GB/s) 15.6MiniAMR19.6AlexNetCoMDHPGMG8.8MiniContactpathfinderNekbonecuSolvernamd2.10resnetmstnw-16KlstmRNN_FWRNN_DGRADGoogLeNet bfssnapRNN_WGRADAvgFigure 5.12: Total bandwidth cost of invalidation messages.performance, at high efficiency, with relatively simple hardware implementation.Overall, our results confirm prior suggestions that complicated CPU-like coherenceprotocols are unnecessary, even in hierarchical multi-GPU contexts. By providinga lightweight coherence enforcement mechanism specifically tuned to the scopedmemory model, HMG is able to deliver 97% of the ideal speedup that inter-GPUcaching can possibly enable.5.8.2 Sensitivity AnalysisTo understand the relationship between our architectural parameters and the perfor-mance of HMG, we performed sensitivity studies across a range of design spaceparameters.• Bandwidth-limited inter-GPU links are the main cause of NUMA effectsthat often bottleneck multi-GPU performance. Figure 5.13 shows that whensweeping across inter-GPU bandwidths, HMG is always the best performingcoherence option, even when absolute performance begins to saturate due tosufficient inter-GPU bandwidth.• The impact of L2 cache size on performance is shown in Figure 5.14. Becauseof the overhead of cache invalidation, the benefits of increased L2 capacityare restricted by software coherence protocols. Conversely, the performanceof HMG increases as capacity grows, indicating the advantage of HMG willonly become more favorable in systems with larger caches.97100GB/s0. Speedup200GB/s 300GB/s 400GB/sNon-Hierarchical HW CoherenceHierarchical SW CoherenceHMG CoherenceIdealized Caching w/o CoherenceFigure 5.13: Performance sensitivity to inter-GPU bandwidth (baseline is nocaching with configurations of Table 5.2).6MB/GPU0. Speedup12MB/GPU 24MB/GPUNon-Hierarchical HW CoherenceHierarchical SW CoherenceHMG CoherenceIdealized Caching w/o CoherenceFigure 5.14: Performance sensitivity to L2 cache size (baseline is no cachingwith configurations of Table 5.2).• Coherence directory sizing presents a trade-off between power/area andcoverage/performance. As Figure 5.15 shows, the performance of our proposedHMG is somewhat sensitive to directory size. The benefit of hardware-managed coherence over software coherence shrinks if the directory is notable to track enough sharers and is forced to perform additional cacheinvalidations across GPUs. However, our modestly-sized directories arelarge enough to successfully capture the locality needed to deliver near-idealcaching performance.• Coarse-grained directory entry tracking granularity (e.g., where each entrytracks four cache lines at a time) allows directories to be made smaller,but it also introduces a risk of false sharing. In order to quantify thisimpact, we varied the granularity tracked by each directory entry whilesimultaneously adjusting the total number of entries in order to keep the totalcoverage constant. The results (Figure 5.16) showed minimal sensitivity,983K entries/GPM0. Speedup6K entries/GPM 12K entries/GPMNon-Hierarchical HW CoherenceHierarchical SW CoherenceHMG CoherenceIdealized Caching w/o CoherenceFigure 5.15: Performance sensitivity to the coherence directory size (baselineis no caching with configurations of Table 5.2).1 cacheline/entry0. Speedup2 cachelines/entry 4 cachelines/entryNon-Hierarchical HW CoherenceHierarchical SW CoherenceHMG CoherenceIdealized Caching w/o CoherenceFigure 5.16: Performance sensitivity to the coherence directory tracking gran-ularity (baseline is no caching with configurations of Table 5.2).and we therefore conclude that coarse-grained directory tracking is a usefuloptimization for HMG.5.8.3 Hardware CostsIn our HMG implementation, each directory entry needs to track as many as sixsharers: three GPMs in the same GPU and three other GPUs. Therefore, a 6 bitvector is required for the sharer list. Because our protocol uses just two states, Validand Invalid, only one bit is needed to track directory entry state. We assume 48 bitsfor tag addresses, so each entry in the coherence directory requires 55 bits of storage.Every GPM has 12K directory entries, so the total storage cost of the coherencedirectories is 84KB, which is only 2.7% of each GPM’s L2 cache data capacity, asmall price to pay for large performance improvements in future multi-GPUs.995.8.4 DiscussionOn-package integration [22, 29, 187] along with off-package integration technolo-gies [1, 160, 162] enable more and more GPU modules to be integrated in a singlesystems. However, NUMA effects are exacerbated as the number of GPMs, GPUs,and non-uniform topologies increase within the system. In these situations, HMG’scoherence directory would need to record more sharers and cover a larger footprintof shared data, but the system performance will likely be more sensitive to thelink speeds and the actual network topology. As shown in Figure 5.15, HMGcan perform very well even after we reduce the coherence directory size by 50%,showing that there is still room to scale HMG to larger systems. We envision ourproposed coherence protocol being applicable for systems that can be comprised bya single NVSwitch-based network within a single operating system node. Systemssignificantly larger than this (e.g., 1024-GPU systems) may be decomposed into ahierarchy consisting of hardware-coherent GPU-clusters which are in turn sharedata using software mechanisms such as MPI or SHMEM [141, 180].The rise of MCM-GPU-like architectures might seem to motivate addingscopes in between .cta and .gpu, to minimize the negative effects of coherence.However, our single-GPU performance results indicate that our workloads areminimally sensitive to the inter-GPM coherence mechanism due to high inter-GPMbandwidth. As a result, the performance benefits of introducing a new .gpm scopemay not outweigh the added programmer burden of using numerous scopes. Weexpect further exploration of other software-hardware coherence interactions toremain an active area of research as GPU systems continue to grow in size.5.9 SummaryIn this chapter, we introduce HMG, a novel cache coherence protocol specificallytailored to scale well to hierarchical multi-GPU systems. HMG provides efficientsupport for fine-grained synchronization now permitted under recently-formalizedscoped GPU memory models. We find that, without much complexity, simplehierarchical extensions and optimizations to existing coherence protocols can takeadvantage of relaxations now permitted in scoped memory models to achieve 97%performance of an ideal caching scheme that has no coherence overhead. Thanks to100its cheap hardware implementation and high performance, HMG demonstrates themost practical solution available for extending cache coherence to future hierarchicalmulti-GPU systems, and thereby for enabling continued performance scaling ofapplications onto larger and larger GPU-based systems.101Chapter 6Scalable Multi-GPU Renderingvia Parallel Image CompositionIn this chapter, we propose CHOPIN, a Split Frame Rendering (SFR) technique thateliminates the performance overheads of prior solutions by leveraging parallel imagecomposition. Unlike prior work, draw commands are distributed across differentGPUs to remove redundant computation, and image composition is performed inparallel to obviate the need for sequential primitives exchanging. CHOPIN includesa novel draw command scheduler that predicts the proper GPU for each drawcommand to avoid the inter-GPU load imbalance, and a novel image compositionscheduler to reduce network congestion that can easily result from naïve inter-GPUsub-image exchange. Through an in-depth analysis using cycle-level simulationson a range of real-world game traces, we demonstrate that CHOPIN outperformsthe prior state-of-the-art SFR implementation by up to 1.56× (1.25× gmean) in an8-GPU system.GPUs were originally developed to accelerate graphics processing— the processof generating 2D-view images from 3D models [134]. Although much recentcomputer architecture research has focused on using GPUs for general-purposecomputing, high-performance graphics processing has historically accounted for thelion’s share of demand for GPUs. This continues to be the case, with graphics remain-ing the dominant source of revenue for GPU vendors: for example, NVIDIA’s year2019 revenues from the gaming (GeForce) and professional visualization (Quadro)102markets combined are 2.5× and 11.5× higher than that from the datacenter and au-tomotive markets, respectively [176]. This is driven by many applications, includinggaming, scientific data visualization, computer-aided design, Virtual Reality (VR),Augmented Reality (AR), and so on. Gaming itself continues to evolve: 4K/UHDhigh-resolution gaming requires 4× as many pixels to be rendered as 1080p HDgaming [4], while VR gaming is 7×more performance demanding than 1080p [169].These requirements have imposed unprecedented challenges on vendors seeking toprovide a high-quality experience to end consumers.This need for substantial performance improvements has, however, been in-creasingly difficult to satisfy with conventional single-chip GPU systems. Tocontinue scaling GPU performance, GPU vendors have recently built larger sys-tems [170, 173, 174] that rely on distributed architectures such asMulti-Chip-ModuleGPU (MCM-GPU) [29] and multi-GPUs [143, 197, 253]. MCM-GPU and multi-GPU systems promise to push the frontiers of performance scaling much farther byconnecting multiple GPU chip modules (GPMs) or GPUs with advanced packag-ing [187] and networking technologies, such as NVLink [160], NVSwitch [162],and XGMI [1]. In principle, these platforms can offer substantial opportunities forperformance improvement; in practice, however, their performance tradeoffs forgraphics processing are different from that of single-chip GPUs, and fully realizingthe benefits requires the use of distributed rendering algorithms.Distributed rendering is, of course, not new: GPU vendors have long combinedtwo to four GPUs using techniques like SLI [166] and Crossfire [20]. They distributethe rendering workload using either Alternate Frame Rendering (AFR), wheredifferent GPUs process consecutive frames, or Split Frame Rendering (SFR), whichassigns disjoint regions of a single frame to different GPUs. By processing alternateframes independently, AFR improves the overall rendering throughput, but doesnothing to improve single-frame latencies. While the average frame rate improves,the instantaneous frame rate can be significantly lower than the average framerate. This problem, called micro-stuttering, is inherent to AFR, and can result in adramatically degraded gameplay experience [3, 5, 8]. In contrast, SFR can improveboth the frame rate and the single-frame latencies [70, 103, 152]. Therefore, SFR ismore widely used in practice, and we focus on SFR in this chapter. The tradeoff,however, is that SFR requires GPUs to exchange data for both inter- and intra-frame103data dependencies, which creates significant bandwidth and latency challenges forthe inter-GPU interconnect.While the recent introduction of high-performance interconnects like NVLinkandXGMI promises to conquer the inter-GPU communication bandwidth constraints,key challenges still remain. SFR assigns split screen regions to separate GPUs, butthe mapping of primitive (typically triangle) coordinates to screen regions is notknown ahead of time, and must be computed before distributing work among GPUs.CPU pre-processing techniques [70, 102, 103] are limited by the low computingand data throughput. GPU methods rely on redundant computation — every GPUprojects all 3D primitives to the 2D screen space and retain only the primitiveswithin its assigned screen region — but this incurs significant overheads on modernhigh-triangle-count workloads in multi-GPU systems, and does not take advantageof the new high-speed interconnects. Recent work GPUpd [114] has attemptedto reduce the redundant computation through additional interconnect traffic, butis bottlenecked by sequential inter-GPU primitive redistribution needed to protectthe input order of primitives (see Chapter 6.2 for a detailed analysis of prior SFRsolutions). Therefore, there is an urgent need for parallel rendering schemes that canleverage today’s high-speed interconnects and reliably scale to multi-GPU systems.6.1 Parallel Image CompositionImage composition is the reduction of several images into one, and is performedon pixel granularity. The reduction process is a sequence of operations, eachof which has two inputs: the current pixel value pold and the incoming valuepnew. The two are combined using an application-dependent function f to producethe updated pixel p = f pold, pnew. The exact definition of f depends on thetask: for example, f can select the pixel which is closer to the camera, or blendthe colour values of the two pixels. A common blending operation is the overoperator [186] p = pnew 1−αnew ∗ pold, where p represents the pixel colour andopacity components, and α is the pixel opacity only. Other blending operatorsinclude addition, multiplication, and so on.For opaque pixels, f is commonly defined to compare the depth value and keepthe pixel which is closer to the camera. As we all know, picking the smallest depth104value from multiple pixels can be done out-of-order. However, composition oftransparent or semi-transparent objects needs to blend multiple pixels, which ingeneral must follow the depth order either front-to-back or back-to-font; for example,the visual effect of putting a drop of light-pink water above a piece of glass is differentfrom the reversed order. For a series of pixels, therefore, the final value of f isderived from an ordered reduction of individual operations, f = f1 ◦ f2 ◦ · · ·◦ fn. Theordering of f1 through fn matters, and in general the sequence cannot be permutedwithout altering the semantics of f . Fortunately, although blending operators are notcommutative, they are associative: i.e., f1 ◦ f2 ◦ f3 ◦ f4 = f1 ◦ f2 ◦ f3 ◦ f4 [33]. As wedetail in Chapter 6.3, CHOPIN leverages this associativity to compose transparentsub-images asynchronously.Apart from the reduction function, how pixels are sent to the GPU wherethe reduction occurs also matters for performance. The simplest communicationmethod is direct-send [99, 157]: once a GPU has finished processing its workload,it begins to distribute the image regions that belong to other GPUs, regardless ofthe readiness of the destination GPUs. With a large number of GPUs, this caneasily congest the network with many simultaneous messages. To address this issue,binary-swap [136, 254] and Radix-k [183] first divide composition processes intomultiple groups, and then compose sub-images with direct-send inside each group;to compose all sub-images, several rounds of this procedure are required. Alternately,Sepia [145] and Lightning-2 [227] designed special hardware to accelerate imagecomposition, but this incurs expensive hardware cost.In contrast, the approach we take in this chapter maintains the simplicity ofdirect-send, and mitigates network congestion issues via a novel image compositionscheduler: within CHOPIN, any two GPUs start composition-related transfers onlywhen they are ready and available.6.2 Limits of Existing SolutionsSFR splits the workload of a single frame into multiple partitions and distributesthem among different GPUs. However, individual GPUs must synchronize andexchange information somewhere along the rendering pipeline in order to producethe correct final image.105cod20%20%40%60%80%% of GeoProcess Cyclescry grid mirror nfs stal ut3 wolf Avg1 GPU 2 GPUs 4 GPUs 8 GPUsFigure 6.1: Percentage of geometry processing cycles in the graphics pipelineof conventional SFR implementation.Based on where this synchronization happens, SFR implementations can beclassified into three categories: sort-first, sort-middle, and sort-last [147]. Sort-first rendering identifies the destination GPUs of each primitive by conductingpreliminary transformations at the very beginning of the graphics pipeline tocompute the screen coordinates of all primitives, and distributes each primitiveto the GPUs that correspond to the primitive’s screen coordinates; after primitivedistribution, each GPU can run the normal graphics pipeline independently. Incontrast, both sort-middle and sort-last distribute primitives without knowing wherethey will fall in the screen space, and exchange partial information later: sort-middlerendering exchanges geometry processing results before the rasterization stage, whilesort-last rendering exchanges fragments at the end of the pipeline for final imagecomposition.Among these three implementations, sort-middle is rarely adopted becausethe size of geometry processing outcome is very large (hundreds of kilobytes perprimitive) [120, 208]. Both CPUs and GPUs have been used for the preliminarytransformation in sort-first rendering [70, 102, 103, 148, 166]. Thanks to highercomputing and data throughputs, traditional GPU-assisted implementations tend toperform better than CPUs, but they duplicate all primitives in every GPU to amortizethe low bandwidth and long latency of traditional inter-GPU links [148, 166]. Inthese schemes, each GPU maps all primitives to screen coordinates, and eventuallydrops the primitives that fall outside of its assigned screen region. Unfortunately, thisduplicated pre-processing stage is not scalable: as shown in Figure 6.1, redundant106GPU0GPU1GPU2GPU3DDDDGGGGR&FR&FR&FR&FGGGGR&FR&FR&FR&FComp.Comp.Comp.Comp.Reduced CyclesGPU0GPU1GPU2GPU3GPUpdCHOPINPPPPFigure 6.2: Graphics pipelines of GPUpd and CHOPIN. (P: Primitive Projec-tion, D: Primitive Distribution, G: Geometry Processing, R: Rasterization,F: Fragment Processing, Comp: Parallel Image Composition.)2 4 8cod20%10%20%30%40%50%60%70%GPUpd Overheads2 4 8cry2 4 8grid2 4 8mirror2 4 8nfs2 4 8stal2 4 8ut32 4 8wolf2 4 8AvgPrimitive Distribution Primitive ProjectionFigure 6.3: Percentage of execution cycles of the extra pipeline stages inGPUpd.geometry processing will dominate the execution cycles of graphics pipeline andseverely impact performance as the number of GPUs grows.To address the problemof redundant computing and take advantage of recent high-performance interconnects, Kim et al. proposed GPUpd [114] pipeline, illustrated inFigure 6.2. GPUpd is a sort-first technique, which evenly distributes all primitivesof each draw command across GPUs. All GPUs project the received primitives toscreen space. The GPUs then exchange primitive IDs based on projection resultsthrough high-performance inter-GPU connections, so that each GPU owns onlyprimitive IDs that fall into its assigned region of screen space. Finally, each GPUexecutes the full graphics pipeline on its received primitive IDs.107cod20. Speedupcry grid mirror nfs stal ut3 wolf GMeanDuplication GPUpd IdealGPUpd IdealCHOPINFigure 6.4: Potential performance improvement afforded by leveraging parallelimage composition.However, although GPUpd can reduce the overhead of redundant primitiveprojections, it requires GPUs to distribute primitive IDs sequentially to protectthe input primitive order; otherwise, a GPU must have large memory to bufferexchanged primitive IDs and complex sorting structure to reorder them. Duringinter-GPU exchanging, GPU0 first distributes its primitive IDs to other GPUs, thenGPU1 distributes its primitive IDs, and this procedure continues until all GPUshave completed primitive distribution. As shown in Figure 6.3, with more GPUs inthe system (2–8 GPUs), the sequential inter-GPU primitive distribution becomes acritical performance bottleneck.6.3 CHOPIN: Leveraging Parallel Image CompositionTo eliminate the performance overhead of redundant computing and sequentialinter-GPU synchronizations, in this chapter, we propose CHOPINwhich is a sort-lastrendering scheme, with a pipeline shown in Figure 6.2.It first divides consecutive draw commands of each frame into multiple groupsbased on the draw command property. As for each group, draw commands aredistributed across different GPUs. Since each draw command is only executed in asingle GPU, CHOPIN is free of the redundant primitive projections that arise intraditional SFR implementations.At group boundaries, sub-images generated by all GPUs are composed inparallel. For the group of opaque objects, sub-images can be composed out-of-order,because the pixels which are closer to the camera will always win. For the group of108transparent objects, we take advantage of the associativity of image compositiondescribed in Chapter 6.1: adjacent sub-images are composed asynchronously assoon as they are available.However, naïve distribution of draw commands, such as using round-robin, canresult in severe load imbalance among the GPUs. CHOPIN therefore relies ona novel draw command scheduler (Chapter 6.4.4) which can dispatch each drawcommand to a proper GPU based on the dynamic execution state. To mitigatenetwork congestion and avoid unnecessary stalls, we also propose a scheduler forsub-image composition (Chapter 6.4.5), which ensures that any two GPUs can startcomposition only when their sub-images are ready and neither of them is composingwith other GPUs.Figure 6.4 illustrates the potential of CHOPIN in an ideal system where allintermediate results are buffered on-chip and the inter-GPU links have zero latencyand unlimited bandwidth: parallel image composition offers up to 1.68× speedups(1.31× gmean) over the best prior SFR solution. We refer detailed evaluationmethodology to Chapter The CHOPIN ArchitectureThe high-level system architecture of CHOPIN is shown in Figure 6.5, and consistsof extensions in both the software and hardware layers.In the software layer Ê, we divide draw commands into multiple groups. At thebeginning and the end of each group, we insert two new Application ProgrammingInterface (API) functions CompGroupStart() and CompGroupEnd() to start andfinish the image composition. We also extend the driver by implementing a separatecommand list for each GPU.In the hardware layer, we connect multiple GPUs with high-speed inter-GPUlinks Ë, similar to NVIDIA DGX system [170, 173], and present them to the OS asa single larger GPU. Draw commands issued by the driver are distributed amongthe different GPUs by a hardware scheduler Ì. After all draw commands of asingle composition group have finished, CompGroupEnd() is called to compose theresulting sub-images. An image composition scheduler Í orchestrates which pairsof GPUs can communicate with one another at any given time.109HardwareDraw Command Scheduler❷ Image Composition Scheduler❹SoftwareCompGroupEnd( )CompGroupStart( )DDD DDD              DDDDDDD              DDDD❸  GPU GPUGPU GPUGPU Driver❶ Graphics APIFigure 6.5: High-level system overview of CHOPIN (each “D” stands for aseparate draw command).6.4.1 Software ExtensionsWefirst explain the semantics of extended graphicsAPI functions. CompGroupStart()is called before each composition group starts. This function does the necessarypreparations for image composition. It passes the number of primitives and thetransparency information to the GPU driver; the GPU driver will then send thesedata to the GPU hardware. If there are transparent objects in composition group, theGPU driver allocates extra memory for sub-images in all GPUs, because transparentsub-images cannot be messed with the background before composition. Whenfunction CompGroupEnd() is called, the GPU driver sends a COMPOSE commandto each GPU for image composition. The composition workflow is described indetail in Chapter 6.4.3.The necessity of grouping draw commands is derived from the various propertiesof each draw command. CHOPIN assumes Immediate Mode Rendering (IMR), sowe only group consecutive draw commands in a greedy fashion; however, more110sophisticated mechanisms could potentially reorder draw commands to create largercomposition group at the cost of additional complexity. When processing a sequenceof draw commands, a group boundary is inserted between two adjacent drawcommands on any of the following events:1. swapping to the next frame,2. switching to a new render target or depth buffer,3. enabling or disabling updates to the depth buffer,4. changing the fragment occlusion test function, or5. changing the pixel composition operator.Event 1 is straightforward, because we have to finish the current frame beforewe move to the next one. Render Targets (RTs) are a feature that allow 3D objectsto be rendered in an intermediate memory buffer, instead of the framebuffer (FB);they can be manipulated by pixel shaders in order to apply additional effects to thefinal image, such as light bloom, motion blur, etc. A depth buffer (or Z Buffer)is a memory structure that records the depth value of screen pixels, and is usedto compute the occlusion status of newly incoming fragments. For both, Event 2is necessary to maintain inter-RT and inter-depth-buffer dependencies, where thecomputing of future RTs and depth buffers depends on the content recorded in thecurrent one.In graphics applications, some draw commands check the depth buffer forocclusion verification without updating it. Not inserting a boundary here couldallow some fragments to pass the depth test and update the frame buffer by mistake,leading to an incorrect final image. We use Event 3 to create a clean depth bufferbefore these draw commands begin.Boundaries at Event 4 are needed because draw commands use depth comparisonoperators to retain or discard incoming fragments. Since CHOPIN distributes drawcommands among multiple GPUs, having multiple comparison functions (e.g.,less-than and greater-than) in a single group can scramble the depth comparisonorder and lead to incorrect depth verification. A group boundary at Event 4 will111guarantee every time a new comparison function is applied, depth test will startfrom a correct value.As described in Chapter 6.1, pixel blending of consecutive draw commands isassociative as long as a single blending operator (e.g., over) is used. However, theassociativity is not transitive across different operators (e.g., mixed over and additiveoperators are not associative), and the composition of opaque and transparent objectsalso cannot be interleaved. Hence, whenever any draw command changes to a newoperator (i.e., Event 5), we create a group boundary.6.4.2 Hardware ExtensionsBesides inter-GPU communications, the main operations of image compositionare (a) reading local sub-image before sending it out and (b) composing pixels indestination GPUs. As both of these functions are carried out by the ROP, they donot require new functional components in CHOPIN.However, since SFR (Split Frame Rendering) splits 2D screen space intomultiple regions and assigns each region to a specific GPU, pixels must eventuallybe exchanged among GPUs after sub-images are generated, we need a hardwarecomponent that computes destination GPUs of individual pixels. We thereforeslightly extend the ROPs with a simple structure that distributes pixels to differentGPUs according to their screen positions.We also require a draw command scheduler and an image composition schedulerto address the problems of load imbalance and network congestion, which are twomain performance bottlenecks of a naïve implementation of CHOPIN. We describethem in Chapter 6.4.4 and 6.4.5, respectively.6.4.3 Composition WorkflowFigure 6.6 shows the workflow of each composition group. When a compositiongroup begins, we first check how many primitives (e.g., triangles) are included inthis group Ê. If the number of primitives is smaller than a certain threshold, werevert to traditional SFR and duplicate all primitives in each GPU Ë. This is atradeoff between redundant geometry processing and image composition overhead.For example, some draw commands are executed to set up the background before112   Check # of PrimitivesCompGroupStartCompGroupEnd>= Threshold?   Duplicate Primitives   Across GPUsTransparent?Evenly Divide DrawsSend ConsecutiveDraws to Same GPUOut-of-OrderCompositionCompose AdjacentImages in Parallel   Dynamically Schedule    Draws Across GPUsCreate Render Targetfor Sub-ImageExitYesYesNoNo❼❶❷❸❻❹❺Figure 6.6: The workflow of each composition group.real objects are rendered; because these draw commands simply cut a rectanglescreen into two triangles, the geometry processing overhead is much smaller thanother graphics pipeline stages, and the overhead of redundant geometry processingis also much smaller than the cost of image composition. Although this threshold isan additional parameter that must be set, our sensitivity analysis (see Figure 6.19)results show that the threshold value does not substantially impact the performance,so this is not a significant concern.For each composition group that warrants parallel image composition, we firstcheck if the group contains transparent objects. If so, the GPU driver needs tocreate extra render targets for sub-images in each GPU Ì. This is necessary because113cod20. Speedupcry grid mirror nfs stal ut3 wolf GMeanDuplication GPUpd CHOPIN_Round_RobinFigure 6.7: Performance overhead of round-robin draw command scheduling(normalized to the system which duplicates all primitive across GPUs).transparent objects cannot be merged with the background until all sub-images havebeen composed—otherwise, the background pixels will be composedmultiple times,creating an incorrect result. To protect the input order of transparent primitives andachieve reasonable load balance at the same time, we evenly divide draw commandsand simply distribute the same amount of continuous primitives across GPUs Í.This simple workload distribution is acceptable because, in current applications,only a small fraction of draw commands are transparent. After a GPU has finishedits workload, we can begin to compose adjacent sub-images asynchronously byleveraging associativity Î.If the group has no transparent objects, CHOPIN dynamically distributes drawcommands with our proposed scheduler Ï; in this case, it is not necessary to createextra render targets because generated sub-images will overwrite the backgroundanyway. Finally, opaque sub-images are composed out-of-order Ð by simplycomparing their distances to the camera (depth value); sub-image pixels which arecloser to the camera will be retained for final image composition.6.4.4 Draw Command SchedulerAlthough the parallel image composition technique in CHOPIN can avoid sequentialinter-GPU synchronizations, the correct final image can only be generated after allsub-images have been composed; therefore, the slowest GPU will determine theoverall system performance. As Figure 6.7 shows, simple draw command scheduling,such as round-robin, can lead to severe load imbalance and substantially impactperformance.1140 200 400 600 800 10000100200300400500600700800Triangle Rate (Cycles/Tri)of Geometry Processingcomposition group composition group0 200 400 600 800 1000Draw Command ID0100200300400500600700800Triangle Rate (Cycles/Tri)of Graphics Pipelinecomposition group composition groupFigure 6.8: Triangle rate of geometry processing stage (top) andwhole graphicspipeline (bottom). The data is from cod2, other applications have thesame trend.To achieve optimal load balance, we would ideally like to know the exactexecution time of each draw command; however, this is unrealistic before the drawcommand is actually executed. Therefore, we need to approximately predict the drawcommand running time. A complete heuristic for rendering time estimation has beenproposed in [240]: t = c1×#tv c2 ∗#pix, where t is the estimated rendering time,#tv is the number of transformed vertices, #pix is the number of rendered pixels,and c1 and c2 are the vertex rate and pixel rate. Although this heuristic consideredboth geometry and fragment processing stages, the value of c1 and c2 can changedynamically across draw commands, and we cannot use this approach directly.OO-VR [246] samples these parameters on the first several draw commands anduses them for the remainder of the rendering computation; however, we have foundthat these parameters vary substantially, and such samples form a poor estimate forthe dynamic execution state of the whole system. Other prior work [17] instead usesthe triangle count of each draw command (which can be acquired from applications)as a heuristic to estimate rendering time. However, dynamically keeping trackingof all triangles throughout the graphics pipeline is complicated, especially after atriangle is rasterized into multiple fragments.115GPU0 GPU1 GPU2 GPU3GPU Driver# of Scheduled Triangles # of Processed Triangles200 150GPU0180 150GPU1210 → 310 190 → 240GPU2200 160GPU3Find Min Remaining TrianglesDraw CommandsSoftwareHardwareDraw CommandScheduler❺ ❶❷❸❹100 TrianglesFigure 6.9: Draw command scheduler microarchitecture.Fortunately, as Figure 6.8 shows, the triangle rate (i.e., cycles/triangle) ofthe geometry processing stage is similar to that of the whole graphics pipeline —this is similar to how the instruction processing rate in a CPU frontend limits theperformance of the CPU backend. We therefore propose to use the number ofremaining triangles in the geometry processing stage as an estimate of each GPU’sremaining workload. Every time a draw command is issued by the GPU driver, wesimply distribute it to the GPU which has the fewest remaining triangles in geometryprocessing stage.The microarchitecture of our draw command scheduler is shown in Figure 6.9.The main structure is a table, in which each GPU has an entry to record the numberof scheduled and processed triangles in that GPU; the remaining triangle count isthe difference. The scheduled triangle count increments when a draw commandis scheduled to a GPU, while the processed count increments as triangles finishgeometry processing.Figure 6.9 also shows a running example of how the scheduler operates: first, theGPU driver issues a draw command with 100 triangles Ê. Next, the draw commandscheduler finds that GPU2 currently has the fewest remaining triangles Ë. Thetriangle count of this draw command is therefore added to the number of trianglesscheduled to GPU2Ì, while the scheduler distributes this draw command to GPU2Í.116CompositionSchedulerInter-GPU connections (NVLink, NVSwitch, or XGMI)GPU0 GPU1 GPU2 GPU3CGID Ready Receiving SendingReceived GPUsSentGPUs5GPU0 T F T 001000105GPU1 T T F 000000005GPU2 T F F 100010004GPU3 F F F 00000000Figure 6.10: Image composition scheduler microarchitecture.Field MeaningCGID Composition Group IDReady Ready to compose with others?Receiving Receiving pixels from another GPU?Sending Sending pixels to another GPU?SentGPUs GPUs the sub-image has been sent toReceivedGPUs GPUs we have composed withTable 6.1: Fields tracked by image composition scheduler.Once the triangles pass through the geometry processing stage in graphics pipeline,the number of processed triangles for GPU2 is increased accordingly Î.6.4.5 Image Composition SchedulerOnce each GPU has finished its workload, it can begin to communicate with otherGPUs for sub-images composition. However, blind inter-GPU communicationcan result in the congestion and under-utilization of interconnect resources (seeChapter 6.1). The most straightforward scheme, direct-send, sends the screenregions to any other GPUs without knowing if the destination GPU can accept it;if the target GPU is still computing, the waiting inter-GPU messages will blockedthe interconnect. For example, assuming a situation where all GPUs except GPU0117have finished their draw commands, so GPUs begin to send their sub-images toGPU0. Because GPU0 is still running, inter-GPU messages will be blocked inthe network. Even though GPUs could have communicated with another GPUrather than GPU0, now they have to wait until GPU0 is able to drain the network.Therefore, an intelligent scheduling mechanism for image composition is necessary.Our proposed composition scheduler, shown in Figure 6.10, aims to avoid stallsdue to the GPUs that are still running their scheduled draw commands or busycomposing with other GPUs. It records the composition status (Table 6.1) of eachGPU in a table: the composition group ID (CGID) is used to distinguish differentgroups, the Ready flag is set while a GPU generated its sub-image and became readyto compose with others, and the Receiving and Sending flags are used to indicatethat a GPU is busy exchanging pixels with another GPU. Finally, SentGPUs andReceivedGPUs record the GPUs with which a GPU has already communicated in abit vector, vector size is same as the number of GPUs in the system.Figure 6.11 shows the image composition scheduler workflow. Once a GPUhas finished all draw commands and generated a sub-image, we set its Ready flagand increment the CGID by one to start a new composition phase Ê. We thencheck the status of other GPUs to see if any available GPUs can compose with eachother. For groups of transparent objectsË, only adjacent GPUs are checked becausetransparent sub-images cannot be composed entirely out-of-order (Chapter 6.1); foropaque groups, all GPUs are checkedÌ. Composition starts only if the remote GPU(1) is ready to compose and running in the same composition group (i.e., CGIDsare same), (2) has not yet been composed with (i.e., not set in ReceivedGPUs), and(3) is not sending pixels to another GPU.As an example, consider the status of Figure 6.10. We can see that GPU0 andGPU2 have composed with each other, GPU3 is still running, and GPU1 just finishedits workload and set its Ready flag. At this moment, GPU1 can compose with GPU0,so we set the Receiving flag of GPU1 and the Sending flag of GPU0 to indicatethat these two GPUs are busy Í. When image composition starts, GPU0 will readits sub-image and send out the region corresponding to GPU1. After these twoGPUs have finished composition, we will reset the Receiving flag of GPU1 and theSending flag of GPU0. Meanwhile, we will also add GPU0 into the ReceivedGPUsfield of GPU1 and add GPU1 into the SentGPUs field of GPU0 Î. This procedure118StartEnd      Set Ready, increment CGIDTransparent ?    Status of Adjacent GPUs    Status of All Other GPUsCan Compose ?  Set Receiving, Sending     Reset Receiving, Sending,  Set SentGPUs, ReceivedGPUsAll Composed ?   Reset Table Entry Status❶❷ ❸ ❹ ❺❻YesNoYesNoYesNoFigure 6.11: Image composition scheduler repeated until all sub-images are composed. Finally, we reset the table entry aftera GPU has sent its sub-image to all other GPUs and the sub-images of all otherGPUs has also been received Ï. The composition is finished once each GPU hascomposed with all other GPUs and, for transparent sub-images, the background.6.5 MethodologyWe evaluate CHOPIN by extending ATTILA [63, 151], a cycle-level GPU simulatorwhich implements a wide spectrum of graphics features present in modern GPUs.Unfortunately, the latest ATTILA is designed to model an AMD TeraScale2architecture [142], and is hard to configure as the latest NVIDIA Volta [172] or119Structure ConfigurationGPU frequency 1GHzNumber of GPUs 8Number of SMs 64 (8 per GPU)Number of ROPs 64 (8 per GPU)SM configurations 32 shader cores, 4 texture unitsL2 Cache 6MB in totalDRAM 2TB/s, 8 channels8 banks per channelComposition group 4096# primitives thresholdInter-GPU bandwidth 64GB/s (uni-directional)Inter-GPU latency 200 cyclesTable 6.2: Simulated GPU and memory hierarchy for CHOPIN.Benchmark Abbr. Resolution # Draws # TrianglesCall of Duty 2 cod2 640×480 1005 219,950Crysis cry 800×600 1427 800,948GRID grid 1280×1024 2623 466,806Mirror’s Edge mirror 1280×1024 1257 381,422Need for Speed: nfs 1280×1024 1858 534,121UndercoverS.T.A.L.K.E.R.: stal 1280×1024 1086 546,733Call of PripyatUnreal ut3 1280×1024 1944 630,302Tournament 3Wolfenstein wolf 640×480 1697 243,052Table 6.3: Benchmarks used for CHOPIN evaluation.Turning [164] systems; therefore, to fairly simulate the performance of different SFRimplementations, we scale down system parameters, such as the number of SMs andROPs, accordingly (Table 6.2). Similar simulation strategies have been widely usedin related prior work [244, 245, 246, 247]. We extend the GPU driver for issuingdraw commands and hardware register values to different GPUs. Similar to existingNVIDIA DGX system [170, 173], we model the inter-GPU links with point-to-pointconnections between GPU pairs, with a default bandwidth and latency of 64GB/sand 200 cycles.As benchmarks, we use eight single-frame traces as shown in Table 6.3, whichwe manually annotate to insert the new API functions CompGroupStart() and120CompGroupEnd() at composition group boundaries. All benchmarks come fromthe real-world games; the number of draw commands and the number of primitives(triangles) and the target resolutions vary across the set, and are shown in Table 6.3.Our SFR implementation splits each frame by interleaving 64×64 pixel tiles todifferent GPUs. Unlike AFR, SFR needs to handle the read-after-write dependencieson render targets and depth buffers. To ensure memory consistency, every time theapplication switch to a new render target or depth buffer, our simulation invokes aninter-GPU synchronization which requires each GPU to broadcast the latest contentof their current render targets and depth buffers to other GPUs.Apart from our CHOPIN system, we also implement primitive duplication,which we use as the baseline. We also implement the best prior work GPUpd [114],modelling both optimizations: batching and runahead execution.1 To explore theupper bound on the performance of each technique, we also idealize GPUpd andCHOPIN in the same way: unlimited on-chip memory for buffering intermediateresults, zero inter-GPU latency, and infinite inter-GPU bandwidth.6.6 Evaluation ResultsIn this section, we first compare the performance of CHOPIN, primitive duplication,and GPUpd. Then we conduct sensitivity analysis to explore the design space, andfinally evaluate the hardware costs.6.6.1 Performance AnalysisThe overall performance of multiple SFR implementations is shown in Figure 6.12.The performance of GPUpd is comparable to conventional primitive duplication.Idealization of GPUpd (i.e., our best implementation of GPUpd) can slightly improvethe performance, but it’s still substantially worse than CHOPIN. With the imagecomposition scheduler enabled, CHOPIN works 25% (up to 56%) better thanprimitive duplication, and only 4.8% slower than IdealCHOPIN.Figure 6.13 shows that the performance improvement of CHOPIN comes mainlyfrom the reduced synchronization overheads: for GPUpd, this is the extra primitive1We contacted the authors to request the GPUpd sources, but were denied because of IP issues; wetherefore created a best-effort realistic implementation of GPUpd as well as an idealized variant.121cod20. Speedupcry grid mirror nfs stal ut3 wolf GMeanGPUpd IdealGPUpd CHOPIN CHOPIN+CompSched IdealCHOPINFigure 6.12: Performance of an 8-GPU system, baseline is primitive dupli-cation with configurations of Table 6.2. (CompSched: compositionscheduler)DuplicationGPUpdCHOPINCHOPIN+CHOPIN++cod20. CyclesDuplicationGPUpdCHOPINCHOPIN+CHOPIN++cryDuplicationGPUpdCHOPINCHOPIN+CHOPIN++gridDuplicationGPUpdCHOPINCHOPIN+CHOPIN++mirrorDuplicationGPUpdCHOPINCHOPIN+CHOPIN++nfsDuplicationGPUpdCHOPINCHOPIN+CHOPIN++stalDuplicationGPUpdCHOPINCHOPIN+CHOPIN++ut3DuplicationGPUpdCHOPINCHOPIN+CHOPIN++wolfDuplicationGPUpdCHOPINCHOPIN+CHOPIN++GMeanNormal Pipeline Primitive Distribution Primitive Projection CompositionFigure 6.13: Execution cycle breakdown of graphics pipeline stages, nor-malize all results to the cycles of primitive duplication. (CHOPIN+:CHOPIN+composition scheduler, CHOPIN++: IdealCHOPIN)projection and distribution stages, while for CHOPIN this is the image compositionstage (e.g., the composition overhead of grid is large because it has much biggerinter-GPU traffic load, see Figure 6.14). Conventional primitive duplication suffersbecause of redundant geometry processing, which CHOPIN entirely avoids. Eventhough GPUpd still performs some redundant computation in the primitive projectionstage, sequential inter-GPU primitive is its critical performance bottleneck.CHOPIN avoids redundant geometry processing by distributing each drawcommand to a specific GPU, and substantially reduces the overhead of inter-GPU synchronization through parallel composition. With the image compositionscheduler, the composition cost is reduced even more by avoiding unnecessarynetwork congestion.122cod2020406080Composition Traffic [MB]cry grid131.92mirror nfs stal ut3 wolf AvgFigure 6.14: Traffic load of parallel image composition.Every 1 Tri0. SpeedupEvery 256 Tris Every 512 Tris Every 1024 TrisCHOPIN CHOPIN+CompSched IdealCHOPINFigure 6.15: Performance sensitivity to the frequency of updates sent to drawcommand scheduler (baseline is primitive duplication with configura-tions from Table 6.2).Distributing draw commands to different GPUs can potentially reduce theeffectiveness of occlusion testing, because sub-images might not have the smallestdepth value for each pixel before image composition. However, we find that theimpact is minor: the number of processed fragments in ROPs only increases by3.6%, 6.5%, and 8.4% in systems of 2, 4, and 8 GPUs, which still permits speedupsup to 1.56×.6.6.2 Composition Traffic LoadTo reduce network traffic, CHOPIN only exchanges screen regions assigned tothe GPUs that are communicating at any given moment. We also filter out thescreen tiles that are not rendered by any draw commands, as they do not need to becomposed. As Figure 6.14 shows, the average traffic load of image composition isonly 51.66MB. Figure 6.13 shows that this does not create a substantial execution123overhead, especially with the image composition scheduler enabled. The large trafficload in grid is due to many large triangles that cover big screen regions; we leaveoptimizing this to future work.In our experiments, we allow the GPUs to update the draw command schedulerstatistics for every triangle processed, an average of 1.7MB traffic with 4B messagesize. To account for scaling to much larger systems and much larger trianglecounts, however, we also investigated how a larger update interval would affect theperformance of CHOPIN. Figure 6.15 sweeps this update frequency from everytriangle to every 1024 triangles on an otherwise identical system; the averageperformance improvement of CHOPIN drops very slightly from 1.25× to 1.22×.With updates every 1024 triangles and 4B messages, the total update traffic loadwould be 4KB for 1 million triangles and 4MB for 1 billion triangles. The imagecomposition scheduler receives notifications from GPUs at composition boundariesthat they are ready to accept work, and sends notifications back to GPUs — 7requests and 7 responses for each GPU in an 8-GPU system, plus an 8th pair tocompose with the background — which results in(8 8)×8×4 = 512B with 4Bmessages. Both are negligible compared to sub-image frame content.6.6.3 Sensitivity AnalysisTo understand the relationship between our architectural parameters and the perfor-mance of CHOPIN, we performed sensitivity studies across a range of design spaceparameters.GPU count. Even though integrating more GPUs in a system can provideabundant resources to meet the constantly growing computing requirements, it alsocan impose bigger challenge on inter-GPU synchronizations. As Figure 6.16 shows,GPUpd is constrained by the sequential primitive distribution, and performancedoes not scale with GPU count. In contrast, because CHOPIN parallelizes imagecomposition, the inter-GPU communication is also accelerated with more GPUs.Therefore, the performance of CHOPIN is scalable and the improvement versusprior SFR solutions grows as the number of GPUs increases. Meanwhile, the imagecomposition scheduler becomes more effective while GPU count is bigger: this isbecause naïve inter-GPU communication for image composition can congest the1242 GPUs0. Speedup4 GPUs 8 GPUs 16 GPUsGPUpd IdealGPUpd CHOPIN CHOPIN+CompSched IdealCHOPINFigure 6.16: Performance sensitivity to the number of GPUs (for each GPUcount configuration, baseline is primitive duplication with the sameGPU count and other settings as in Table 6.2).16 GB/s0. Speedup32 GB/s 64 GB/s 128 GB/sGPUpd IdealGPUpd CHOPIN CHOPIN+CompSched IdealCHOPINFigure 6.17: Performance sensitivity to inter-GPU link bandwidth (baseline isprimitive duplication with configurations of Table 6.2).network more frequently with more GPUs, which is a bigger bottleneck for a largersystem.Inter-GPU link bandwidth and latency. Since inter-GPU synchronizationrelies on the inter-GPU interconnect, we investigated sensitivity to link bandwidthand latency. CHOPIN performance scales with bandwidth (Figure 6.17), unlikeGPUpd. Similarly, CHOPIN is not significantly affected by link latency (Figure 6.18),unlike GPUpd where latency quickly bottlenecks sequential primitive exchange.Composition group size threshold. This parameter makes a tradeoff betweenthe redundant geometry processing and the image composition overhead: if thenumber of primitives inside a composition group is smaller than a specific threshold,CHOPIN reverts to primitive duplication (see Figure 6.6). In theory, this thresholdcould be important: if set too small, it might not filter out most composition groups125100 Cycles0. Speedup200 Cycles 300 Cycles 400 CyclesGPUpd IdealGPUpd CHOPIN CHOPIN+CompSched IdealCHOPINFigure 6.18: Performance sensitivity to inter-GPU link latency (baseline isprimitive duplication with configurations of Table 6.2).256 Tris0. Speedup1024 Tris 4096 Tris 16384 TrisCHOPIN CHOPIN+CompSched IdealCHOPINFigure 6.19: Performance sensitivity to the threshold of composition groupsize (baseline is primitive duplication with configurations of Table 6.2).with few primitives; and if set too big, we can lose the potential performanceimprovement of parallel image composition. However, as Figure 6.19 shows, it turnsout that the performance of CHOPIN is not very sensitive to the configuration ofthreshold value, and the threshold should be of little concern to programmers.The main reason for the lack of sensitivity is that the statistic distribution ofcomposition group sizes is bipolar: most composition groups have either a largenumber of triangles (e.g., consecutive draw commands for objects rendering) orvery few triangles (e.g., background draw commands), and many threshold settingswill separate them. For example, if the threshold is set as 4,096, CHOPIN willaccelerate 6.5 composition groups on average, but those groups cover 92.44% of thetriangles in the entire application. Enlarging the threshold to 16,384 will accelerate5.25 composition groups, and cover 89.83% triangles on average.1266.6.4 Hardware CostsThe draw command scheduler and image composition scheduler are the mainhardware cost of CHOPIN system. In an 8-GPU system, both schedulers have 8entries. Each entry of draw command scheduler has two fields: the number ofscheduled triangles and the number of processed triangles. To cover the requirementsof most existing and future applications, we conservatively allocate 64 bits for eachfield. Therefore, the total size of draw command scheduler is 128 bytes.As discussed in Chapter 6.6.3, with a group size threshold of 4,096, up to 13 (6.5on average) draw command groups will trigger image composition, so we assume onebyte is enough to represent a CGID for the image composition scheduler. The Ready,Receiving and Sending flags are all single bits. SentGPUs and ReceivedGPUs arebit vectors with as many bits as the number of GPUs in the system (for us, one byte).Therefore, the total size of the image composition scheduler in our implementationis 27 bytes.6.6.5 DiscussionRendering workloads scale in two ways: with resolution and with triangles per pixel.As resolution increases to 4K and beyond, triangle counts at iso-quality increase aswell. As visual quality advances, however, triangle counts per pixel also increase —each frame of the latest game trends to have millions or billions of triangles whichsizes are often the same as pixels [9]. This means that the performance overheadof redundant geometry processing and sequential inter-GPU primitive exchange inprior SFR solutions will increase, and the corresponding benefits of CHOPIN willgrow.As is, CHOPIN is applicable to NVIDIA DGX-scale systems. Systems whichare significantly larger than this (e.g., 1024 GPUs [85]) may need more complicatedrendering mechanisms, such as the combination of AFR and SFR.6.7 SummaryIn this chapter, we introduce CHOPIN, a novel architecture for split frame renderingin multi-GPU systems. CHOPIN is a sort-last rendering scheme which distributeseach draw command to a specific GPU and avoids redundant geometry processing.127By taking leveraging the parallelism of image composition and modern high-speed inter-GPU links, CHOPIN also obviates the need for sequential inter-GPUcommunication.CHOPIN includes two novel schedulers: a draw command scheduler to addressload imbalance, and an image composition scheduler to reduce network congestion.All in all, CHOPIN outperforms the best prior work by up to 1.56× (1.25× gmean);and in contrast to existing solutions, scales as the number of GPUs grows.128Chapter 7Related WorkThis chapter discusses related work for this dissertation. We introduce the commonlyenforced memory consistency models of CPUs and GPUs in Chapter 7.1. We surveydifferent cache coherence protocols in Chapter 7.2. Chapter 7.3 describes the relatedwork on transactional memories. In Chapter 7.4, we talk about previous proposalsof graphics processing.7.1 Work Related to Memory Consistency EnforcementStrong Consistency in GPUs. Hechtman and Sorin first made the case that theperformance impact of Sequential Consistency (SC) is likely small in GPUs [94].Singh et al [220] observed that, while this was true for most workloads, some sufferedsevere penalties with SC because of read-only and private data; they proposed toclassify these accesses at runtime and permit reordering while maintaining SC forread-write shared data. Relativistic Cache Coherence (RCC) approach is orthogonal:we focus on SC stall latency, and improve performance for both read-write and read-only data. Both [94] and [220] used a CPU-like setup with MESI and write-back L1caches. In GPUs, however, write-through L1s perform better [221]: GPU L1 cacheshave very little space per thread, so a write-back policy brings infrequently writtendata into the L1 only to write it back soon afterwards. Commercial GPUs havewrite-through L1s [21, 165, 167]. RCC studies GPU-style write-through L1 caches,and compares against the best prior GPU implementation of weak consistency [221].129Weak Consistency in GPUs. Although above work argued to enforce SCin GPUs and they found that relaxed memory models could not significantlyoutperform SC, modern GPU products still enforce relaxed memory models, asweak models allow for more microarchitectural flexibility and arguably betterperformance/power/area tradeoffs. Many previous work has aimed to enforce relaxedconsistency in GPUs with different coherence optimizations [18, 95, 117, 216]. Ina pushing towards generality, GPU vendors have changed from conventional bulk-synchronous towards scoped memory models [2, 112, 171]. Both NVIDIA [135] andAMD [98] have published their formalized scoped memory models. We optimizeHMG by leveraging the recent formal definitions of scoped memory model [98, 135]and provide efficient coherence support for multi-GPU system. Sinclair et al [216]adapted DeNovo [59] to GPUs with DRF-0 and HRF variants, and argued that thebenefits of HRF over DRF-0 do not warrant the additional complexity of scopes.However, their evaluation was conducted within a single GPU, the latency gapbetween the broadest and narrowest scope is an order of magnitude larger in multi-GPU environments. Meanwhile, DeNovo requires software to expose additionaldetails to the coherence hardware, while HMG requires no software changes.Strong Consistency in CPUs. Many quills have been sacrificed to argue thatsequential consistency is desirable in CPUs and propose how it could be efficientlyimplemented [12, 36, 50, 79, 81, 82, 84, 89, 131, 194, 219, 238]. Generally,speculation support or other hardware modifications are required to overcome theoverheads of SC. Lin et al [131] and Gope et al [84] also used logical order toenforce SC in a CPU setting. RCC shares the conviction that sequential consistencyis preferred, but focus on GPUs, which have different architectural constraints (e.g.,no speculation support).Weak Consistency in CPUs. Even though lots of academic work has beenproposed to support SC in CPUs, almost all industrial vendors choose to relaxmemory order constraints for performance improvement. The widely employedmemory consistency models in industry include Total Store Ordering (TSO) [181],Partial Store Ordering (PSO) [225], Release Consistency (RC) [78], ARMv8 [26],IBM Power [205], and so on; all of them made different tradeoffs about theperformance, complexity, and programmability. We have more detailed discussionsabout memory consistency models in Chapter 2.3.1307.2 Work Related to Cache Coherence ProtocolGPU Coherence. We have discussed most of the existing GPU coherence protocolsin Section 5.3. Besides them, Singh et al [221] proposed a GPU coherence protocolbased on physical timestamps, and showed that MESI and write-back caches sufferedNoC traffic and performance penalties in GPUs. While the consistency model isweak throughout, the base version (TC-strong) can support SC if the core doesnot permit multiple outstanding memory operations from one warp; we use thisSC variant as a baseline of RCC. The improved version (TC-weak) cannot supportSC, but offers 30% better performance; we use this as a comparison of RCC. RCCuses logical rather than physical timestamps, has lower complexity, and closes theperformance gap between SC and relaxed memory model. However, all this workconsidered neither architecture hierarchy nor scoped memory model. In contrast,HMG explores the coherence support for future deeply hierarchical GPU systemswith scoped memory model enforcement.Timestamp-Based Cache Coherence. Nandy and Narayan [156] first observedthat timestamps can reduce interconnect traffic due to invalidate messages in MSI-like protocols, but their protocol did not support SC. Shim et al [211] proposed LCC,a sequentially consistent library protocol, for multi-cores; LCC is equivalent to ourTC-strong baseline. Singh et al [221] adapted LCC to GPUs and proposed a higher-performance weakly ordered variant with a novel fence completion mechanism;Kumar et al [119] used TC-weak for FPGA accelerators. Recently, Yao et al [250]adapted TC-weak to multi-cores by tracking writes with a Bloom filter. All ofthese protocols use physical timestamps, and SC variants must stall stores (and weakvariants must stall fences) until completion; RCC uses logical time and stalls neitherstores nor fences.Lamport [122] first observed that consistency need only be maintained in logicaltime. This fact has been used to implement coherence on a logically orderedbus (e.g., [124, 223]) and to extend snooping coherence protocols to non-businterconnects [14, 138]. Meixner and Sorin used logical timestamps to dynamicallyverify consistency models [139]. Yu et al [255] proposed using logical timestampsto directly implement coherence in CPU-style multi-cores, but maintains exclusivewrite states and recall/downgrade messages that RCC wishes to avoid to reduce131store latencies. At the same time, architectural features not present on GPUs (e.g.,speculative execution) are required by [257] to support a timestamp speculationscheme. RCC shares the notion of keeping coherence with logical timestamps, buteschews exclusive states to focus on reducing store latencies. RCC is a simplerprotocol that offers best-in-class performance in GPUs.Hierarchical Cache Coherence. Coherence hierarchy has been commonlyemployed in CPUs [237, 239]. Most hierarchical CPU designs [80, 86, 88, 126, 153]have adopted MESI-like coherence, which has been proven to be a poor fit forGPUs [95, 221]. HMG shows that the complexity of extra states is also unnecessaryfor hierarchical GPUs. Both DASH [126] andWildFire [88] increased the complexityeven more by employing the mixed coherence policy: intra-cluster snoopy coherenceand inter-cluster directory-based coherence. To implement consistency modelefficiently, Alpha GS320 [80] separated the commit events to allow time-criticalreplies to bypass inbound requests without violating memory order constraints.HMG can achieve almost optimal performance without such overheads.Heterogeneous Cache Coherence. Shared data synchronization in the unifiedmemory space of heterogeneous systems also requires efficient coherence protocols.Lowe-Power et al. proposed a heterogeneous system coherence for integratedCPU-GPU systems [188]. It replaced the standard directory with a region directoryto reduce the bandwidth bottleneck of GPU memory requests. Projects such asCrossing Guard [179] and Spandex [19] proposed flexible coherence interfacesto integrate heterogeneous computing devices. We expect that HMG would beintegrated nicely with such schemes due to its simple states and clear coherencehierarchy.7.3 Work Related to Transactional MemoryGPU Transactional Memory (TM). To date, all hardware TM proposals for GPUshave been based on KiloTM [77]; this system combines lazy version managementwith lazy, value-based conflict detection. Follow-up work [76] extended KiloTMwith an intra-warp conflict detection mechanism and a silent-commit filter forread-only transactions based on physical timestamps. A later proposal [55] addedglobal broadcast updates about currently committing transactions, and leveraged this132to pause or abort doomed transactions; we use an idealized version of this as oneof our baselines. GPU-LocalTM [234] is a limited form of transactional memorythat guarantees atomicity only within a core’s scratchpad; Bloom filters [35] areused for conflict detection. Software TM proposals for GPUs have used either per-object write locks [48] or combined value-based detection with TL2-like timestampapproach [248]. Given special DRAM subarrays [209], and at the cost of substantialmemory overheads and extensive OS/software changes, GPU snapshot isolation [56]can reduce abort rates in long transactions by buffering many concurrent memorystates; it retains two-round-trip lazy validation and must update snapshot versions inDRAM, resulting in even longer commit latencies.CPU Hardware TM. Since hardware TM was first proposed [96, 228], manyCPU implementations have been proposed. Many leverage the existing inter-corecoherence mechanism to identify conflicts, either by modifying the coherenceprotocol [51, 60, 66, 233], adding extra bits to the coherence state [37, 149, 202], orleveraging coherence to update read/write signatures [46, 144, 252]. Existing GPUcoherence proposals, however, cannot support eager TM: they either rely on speciallanguage-level properties [216], eschew write atomicity [221], or cannot supportdetecting conflict times [195]. Other TM proposals [49, 51, 89, 116, 189, 236] relyon signature or update broadcasts, or on software-assisted detection [212, 213, 214].In contrast to TM, speculative lock elision can run parallel code in lock-freemanner without requiring instruction set changes, coherence protocol extensions, orprogrammer support [191].Timestamp-Based TM. Transactional memory schemes based on logical clocksshare commonalitieswith timestamp-based approaches. These have been usedmainlyin software TMs to maintain consistency [74]; hardware TMs have leveraged themto maintain fairness and forward progress [23, 149, 192], snapshot isolation [133],and in prior GPU work to avoid validation of read-only transactions [76].7.4 Work Related to Graphics ProcessingGraphics Processing inMulti-GPU Systems. GPUpd [114] and OO-VR [246] aretwo multi-GPU proposals that attempt to leverage modern, high-speed inter-GPUconnections. However, as discussed in Section 6.2, GPUpd is bottlenecked by133a sequential inter-GPU primitive exchange step, while CHOPIN composes sub-images in parallel. OO-VR is a rendering framework to improve data locality inVR applications, orthogonal to our problem of efficient image composition forSplit Frame Rendering (SFR). Unlike OO-VR, the draw command distribution inCHOPIN does not rely on statically computed parameters; CHOPIN also includesan image composition scheduler to make full use of network resources.NVIDIA’s SLI [166] proposed attempts to balance the workload by dynamicallyadjusting how the screen is divided among GPUs. However, it still duplicates allprimitives in every GPU, and incurs the attendant overheads. Both DirectX 12 [7]and Vulkan [6] expose multi-GPU hardware to programmers via ApplicationProgramming Interface (API), but relying only on this would require programmers tohave exact static knowledge of the workload (e.g., workload distribution). CHOPINcan simplify programming and deliver reliable performance through dynamicscheduling in hardware.Parallel Rendering Frameworks. Most SFR mechanisms were originallyimplemented for PC clusters. Among these implementations, WireGL [102],Chromium [103], and Equalizer [70] are high-level APIs which can allocateworkload among machines based on different configurations. However, when thesystem is configured as sort-first, they use CPUs to compute the destinations ofeach primitive, and performance is limited by the poor computation throughput ofCPUs. When the system is configured as sort-last, they assign one specific machineto collect all sub-images from others for composition, which again constitutes abottleneck. In contrast, CHOPIN distributes draw commands to different GPUsbased on dynamic execution state, and all GPUs in the system contribute to imagecomposition in parallel.To accelerate image composition, some implementations, like PixelFlow [146]and Pixel-Planes 5 [75] even implemented application specific hardware, withsignificant overheads. CHOPIN simply takes advantage of existing multi-GPUsystem and high-performance inter-GPU links, and incurs very small hardwarecosts. RealityEngine [15] and Pomegranate [71] aim to improve load balancing byfrequently exchanging data before geometry processing, before rasterization, andafter fragment processing; however, these complicated synchronization patterns arehard to coordinate, and huge traffic load can be imposed on inter-GPU links.134Graphics Processing in Single-GPU and Mobile GPU Systems. Besidesparallel rendering, lots of work has also been done for graphics processing in singleGPU or mobile GPU systems. By leveraging the similarity between consecutiveframes, Arnau et al. proposed to use fragment memorization to filter out redundantfragment computing [27]. Rendering Elimination [25] shares the same observationof similarity, but it excludes redundant computing at a coarser granularity of screentiles. To verify fragment occlusion as early as possible, Anglada et al. proposedearly visibility resolution, a mechanism that leverages the visibility informationobtained in a frame to predict the visibility of next frame [24]. Texture data is adominant consumer of off-chip memory bandwidth, so Xie et al. explored the useof process-in-memory to reduce texture memory traffic [244]. In contrast to allthese efforts, CHOPIN focuses on the efficient inter-GPU synchronization of parallelrendering in multi-GPU systems.135Chapter 8Conclusions and Future WorkThis chapter concludes this dissertation and provides directions for future work.8.1 ConclusionsAfter numerous researchers have contributed their innovations to both hardwarearchitecture and software Application Programming Interface (API), the GPU hassuccessfully built its own ecosystem which can provide high-performance andcost-efficient computing service to a wide range of applications. True to its originalusage of accelerating graphics processing, GPU is a highly parallel architecture thatis designed to exploit find-grained Data Level Parallelism (DLP) and Thread LevelParallelism (TLP). It trades off single-thread latency for system-level throughput.Therefore, guaranteeing highly available parallelism during execution is critical tomaximize GPU performance.In parallel programming, individual threads are not totally independent – opera-tions on shared data and hardware structures need to be synchronized under specificordering constraints. Inefficient synchronization support can potentially serializethreads and reduce available parallelism, significantly hurting GPU performance. Inthis dissertation, we propose four enhancements to help GPU architectures provideefficient synchronization support to various applications.First, we propose Relativistic Cache Coherence (RCC) in Chapter 3, a simplecache coherence protocol which can enforce Sequential Consistency (SC) efficiently136with distributed logical timestamps. Thanks to the timestamp independence ofdifferent SM cores, RCC can process store requests instantly by advancing thetimestamp of writing core, rather than waiting for all sharers become invalid. Hence,RCC outperforms the best prior SC design by 29%, it also closes the performancegap between SC and weak memory model to only 7%. Additionally, RCC allows forswitching between strong (RCC-SC) and weak (RCC-WO) consistency models atruntime with best-in-class performance and no hardware overhead.Second, we propose GETM in Chapter 4, a novel logical-timestamp-based eagerconflict detection mechanism for GPU Hardware Transactional Memory (HTM) toreduce the excessive latency of prior lazy solution. GETM eagerly detects conflictsby checking the timestamps of transactions and accessed data. Transactions areaborted immediately once a conflict is detected, so transactions that have reachedcommit point can be committed without additional validation. Benefiting from thedramatically faster conflict detection and transaction commits, GETM is up to 2.1×(1.2× gmean) faster than the best prior GPU TM proposal. Area overheads are 3.6×lower, and power overheads are 2.2× lower.Third, we propose HMG in Chapter 5, a two-state hierarchical cache coherenceprotocol for efficient peer caching in multi-GPU systems with the enforcementof scoped memory model. Coherence hierarchy is implemented to fully exploitintra-GPU data locality and reduce the bandwidth cost of inter-GPU accesses.As scoped memory models [98, 135] have been formalized as non-multi-copy-atomic, HMG processes non-synchronization stores instantly without invalidationacknowledgments; only synchronizations stores are stalled to guarantee correct datavisibility. Therefore, it is unnecessary to add transient coherence states and otherhardware structures to reduce stalls, which has been very few. In a 4-GPU system,HMG can achieve 97% of the performance of an idealized caching system.Finally, we propose CHOPIN in Chapter 6, a scalable Split Frame Rendering(SFR) scheme which fully takes advantage of the parallelism available in imagecomposition. CHOPIN can eliminate the performance overheads of redundantcomputing and sequential primitive exchange that exist in prior solutions. CHOPINcomposes opaque sub-images out-of-order; adjacent transparent sub-images arecomposed asynchronously by leveraging the associativity of pixel blending. We alsodesign a draw command scheduler and an image composition scheduler to address137the problems of load imbalance and network congestion. In an 8-GPU system,CHOPIN outperforms the best prior work by up to 1.56× (1.25× gmean).In summary, this dissertation shows that SC can be enforced efficiently insingle-GPU systems with simple RCC. However, considering the huge bandwidthgap between inter- and intra-GPU links, future hierarchical multi-GPU architecturescould change this insight. To alleviate the performance impact of bandwidth-limitedinter-GPU links, we might need to relax the memory model to some extent andadd some extra structures to optimize remote GPU accesses. Although the latestscoped memory model can potentially maximize application parallelism and simplifyhardware implementation by relaxing store atomicity and adding scope annotations,it might impose big complexity on software programmers, thereby increasing theoccurrence of synchronization bugs, such as data-races. Therefore, we believe –in future GPU systems – the tradeoff between memory model, performance, andprogrammability needs to be explored more deeply.This dissertation also shows that, with eager conflict detection, GPU HTMand lock-based synchronizations can have comparable performance. However,to advocate the adoption of HTM in real GPU systems, we probably need toextend current GPU memory model with transaction-related rules for correctnessguarantee. The combination of relaxed memory model and HTM could be a goodtradeoff. In this way, we can reorder memory requests outside transactions, whilememory requests inside transactions are executed sequentially. Additionally, it’salso necessary to further reduce implementation cost of GPU HTM.8.2 Directions of Future Work8.2.1 Logical-Time Cache Coherence in Heterogeneous SystemsIn Chapter 3, we proposed Relativistic Cache Coherence (RCC), a logical-timestamp-based cache coherence protocol for efficient Sequential Consistency (SC) in GPUs.Previously, a logical-timestamp coherence protocol, TARDIS, was proposed forCPUs [255, 257]. Considering the fact that industry vendors have exposed a UnifiedMemory (i.e, unified virtual address space) abstraction to programmers [168],a logical-timestamp-based cache coherence protocol for systems with integrated138CPUs and GPUs might be efficient to provide both high performance and easyprogramming. Meanwhile, the logical-timestamp coherence protocol also couldbe extended to hardware accelerators, even though only physical-timestamp cachecoherence was exploited for FPGA accelerators [119]. Maintaining a consistentlogical-timestamp cache coherence across heterogeneous systems also can reducethe notorious complexity of hardware verification, because it’s unnecessary to verifymultiple different cache coherence protocols and the interface between them.In heterogeneous systems, applications running on different processors usuallydemonstrate super diverse execution characteristics, which will create differentperformance requirements for coherence protocol design. For example, at whichgranularity should the cache coherence be maintained between CPUs and GPUsneed to be well explored. A granularity that is too small will result in frequent datacommunications and possibly under-utilize the connection bandwidth, but a coarsegranularity might create lots of false sharing and potentially reduce data locality.Meanwhile, inefficient lease extension also could mess up data exchange betweenCPUs and GPUs, significantly hurting performance.8.2.2 Reducing Transaction Abort Rates of GETMIn Chapter 4, we propose GETM, the first GPU Hardware Transactional Memory(HTM) that has eager conflict detection mechanism. Even though GETM worksfaster than all prior GPU TM proposals, it results in higher abort rates (Table 4.4).This depends on the available information that can be leveraged to judge whichtransaction should be aborted while conflicts happened. In lazy mechanisms, astransactions have finished, all information acquired during execution can be takenadvantage. In contrast, eager mechanisms need to make instant decisions duringtransaction execution. A novel warp scheduling algorithm would be helpful if it canpredict the transactions that are likely to conflict. One way to approach scheduler isto profile the aborts and classify them based on different abort reasons. Then, a newscheduler could be designed to avoid or reduce transactions aborts.Compared to prior designs, GETM enables a higher concurrency to allowmore transaction to execute in parallel (Table 4.4). However, GETM is notaware of dynamic execution by assuming a fixed concurrency level. This can be139optimized, because we observed that the level of contention between transactionschanges dynamically at runtime. For example, Barnes Hut starts out as a high-contention application where every transaction tries to insert a leaf node near theroot, and gradually relaxes into low-contention as octree grows. Therefore, a controlmechanism could mitigate contention and reduce transaction abort rates, if it candynamically adjust the concurrency level.8.2.3 Scoped Memory Model vs. Easy ProgrammingIn Chapter 5, we optimized HMG by leveraging the non-multi-copy-atomicity of thelatest scoped GPU memory models [98, 135]. Although, in HMG, we successfullyeliminated the complexity of invalidation acknowledgments and transient coherencestates, the scope annotations which are inherent to the the latest memory modelscan actually complicate the programming. Programmers need to be aware of thescopes, at which the latest value of shared data is visible. Unreasonable scopes canresult in incorrect synchronizations. If the correct synchronization scope cannotbe determined according to static knowledge, programmers need to use a largerscope conservatively, although it is unnecessary actually. Therefore, an abstractionlayer which can hide the complexity of scopes would simplify the programming andadvocate more users.Considering that GPU architecture is latency-tolerant and GPU applicationsdo not have as much data sharing as CPUs, the GPU scoped memory modelsmight have some space to be enforced more strongly to reduce the effort on correctsynchronizations. Although prior work has concluded that SC can achieve similarperformance to weak memory models [195] and the scope complexity of HRF [98]is not necessary for high performance [216], these researches were conducedin conventional single-GPU systems. In modern or forward-looking multi-GPUsystems, the latency/bandwidth gap between the broadest and narrowest scope is anorder of magnitude larger. Therefore, SC or Data-Race-Free (DRF) might be toostrong or insufficient to guarantee high performance. The tradeoff between memorymodel, performance, and programmability needs to explored more deeply.1408.2.4 Scaling CHOPIN to Larger SystemsCHOPIN (Chapter 6) as-is is applicable for NVIDIA DGX-scale system [170, 173].Systems which are significantly larger (e.g., 1024 GPUs) may need more innovations.Meanwhile, insatiable appetite for better visual quality has led to higher and higherresolutions, which can potentially increase the inter-GPU traffic load of imagecomposition. These problems may lead to future research directions as follows.First, adopting pure Split Frame Rendering (SFR) mode in a larger systemwill make each GPU receive a small number of draw commands that is hard tobe load-balanced. At the meantime, it will also be much more challenging forthe scheduling of image composition. Hence, to make full use of the availableGPUs, the combination of Alternate Frame Rendering (AFR) and SFR probablywill be a better choice. For example, in the combined mode, GPUs are divided intomultiple groups, AFR and SFR are adopted for the inter- and intra-group renderingrespectively. The schedulers in Chapters 6.4.4 and 6.4.5 are proposed for SFR, sothey are still applicable for the intra-group rendering. Considering the workloadvariance between frames, a mechanism which can dynamically adjust the GPUgroup size could be helpful to achieve a better and smoother user experience.Second, we found that larger traffic load of image composition can impact theperformance benefit of CHOPIN (e.g., grid in Figures 6.12 and 6.14). Intuitively,we imagine an effective frame content compression mechanism would be helpfulfor this issue. Moreover, reducing composition frequency could be an alternativechoice, but this might need the change in programming model to help programmersreorganize draw commands and create fewer but larger composition groups.Finally, the proposed schedulers in Chapter 6 have fixed functions, whichare not adaptive to various programs. However, modern graphics applicationsare becoming more and more diverse, so designing configurable schedulers andexposing configurations to software would be desirable for programmers. Thismight need the extension of graphics APIs, such as DirectX [7] and Vulkan [10].141Bibliography[1] AMD’s answer to Nvidia’s NVLink is xGMI, and it’s coming to the new7nm Vega GPU. Accessed on 2020-06-25. → pages 4, 12, 75, 100, 103[2] HSA Platform System Architecture Specification Version 1.2. Accessed on 2020-06-25. → pages75, 78, 130[3] How to fix micro stutter in videos+games?, 2016. Ac-cessed on 2020-06-25. → page 103[4] Nvidia GeForce GTX 1080i review: The best 4K graphics card rightnow., 2018. Accessed on 2020-06-25. →page 103[5] What is microstutter and how do I fix it?, 2018. Accessed on 2020-06-25. → page103[6] Vulkan 1.1 out today with multi-GPU support, better DirectX compatibil-ity., 2018. Accessed on2020-06-25. → page 134[7] Direct3D12ProgrammingGuide., 2019. Accessed on 2020-07-08. → pages 2, 134, 141[8] How to fix stuttering of CrossFire?, 2019. Accessed on2020-06-25. → page 103142[9] A first look at Unreal Engine 5., 2020. Accessed on 2020-06-25. → page 127[10] Vulkan® 1.2.146 - A Specification (with all registered Vulkan exten-sions)., 2020. Accessed on 2020-07-08. → pages 2, 141[11] S. V. Adve and M. D. Hill. Weak Ordering-A New Definition. In Proceedingsof the 17th International Symposium on Computer Architecture (ISCA), pages2–14. IEEE, 1990. → page 15[12] S. Aga, A. Singh, and S. Narayanasamy. zFENCE: Data-less Coherence forEfficient Fences. In Proceedings of the 29th International Conference onSupercomputing (ICS), pages 295–305. ACM, 2015. → page 130[13] N. Agarwal, T. Krishna, L. S. Peh, and N. K. Jha. GARNET: A EetailedOn-Chip Network Model inside a Full-System Simulator. In InternationalSymposium on Performance Analysis of Systems and Software (ISPASS),pages 33–42. IEEE, 2009. → page 38[14] N. Agarwal, L.-S. Peh, and N. K. Jha. In-Network Snoop Ordering: SnoopyCoherence on Unordered Interconnects. In Proceedings of the 15th Inter-national Symposium on High Performance Computer Architecture (HPCA),pages 67–78. IEEE, 2009. → page 131[15] K. Akeley. Reality Engine Graphics. In Proceedings of the 20th AnnualConference on Computer Graphics and Interactive Techniques (SIGGRAPH),pages 109–116. ACM, 1993. → page 134[16] J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Po-etzl, T. Sorensen, and J.Wickerson. GPUConcurrency: Weak Behaviours andProgramming Assumptions. In Proceedings of the 20th International Confer-ence on Architectural Support for Programming Languages and OperatingSystems (ASPLOS), pages 577–591. ACM, 2015.→ pages 4, 14, 20, 21, 38, 81[17] D. G. Aliaga and A. Lastra. Automatic Image Placement to Provide AGuaranteed Frame Rate. In Proceedings of the 26th Annual Conference onComputer Graphics and Interactive Techniques (SIGGRAPH), pages 307–316,1999. → page 115[18] J. Alsop, M. S. Orr, B. M. Beckmann, and D. A. Wood. Lazy ReleaseConsistency for GPUs. In Proceedings of the 49th International Symposiumon Microarchitecture (MICRO), page 26. IEEE, 2016. → pages 76, 78, 130143[19] J. Alsop, M. D. Sinclair, and S. V. Adve. Spandex: A Flexible Interface forEfficient Heterogeneous Coherence. In Proceedings of the 45th InternationalSymposium on Computer Architecture (ISCA), pages 261–274. IEEE, 2018.→ page 132[20] AMD. AMD CrossFire guide for Direct3D® 11 applica-tions. Accessed on 2020-06-25.→ pages 5, 7, 103[21] AMD. AMD Graphics Cores Next (GCN) Architecture., 2012. Ac-cessed on 2020-06-25. → pages 20, 21, 22, 32, 129[22] AMD. Multi-Chip Module Architecture: The Right Approach for EvolvingWorkloads., August 2017. Accessed on 2020-06-25. →page 100[23] C. S. Ananian, K. Asanović, B. C. Kuszmaul, C. E. Leiserson, and S. Lie.Unbounded Transactional Memory. In Proceedings of the 11th InternationalSymposium on High Performance Computer Architecture (HPCA), pages316–327. IEEE, 2005. → pages 62, 133[24] M. Anglada, E. de Lucas, J.-M. Parcerisa, J. L. Aragón, and A. González.Early Visibility Resolution for Removing Ineffectual Computations in theGraphics Pipeline. In Proceedings of the 25th International Symposium onHigh Performance Computer Architecture (HPCA), pages 635–646. IEEE,2019. → page 135[25] M. Anglada, E. de Lucas, J.-M. Parcerisa, J. L. Aragón, P. Marcuello, andA. González. Rendering Elimination: Early Discard of Redundant Tiles inthe Graphics Pipeline. In Proceedings of the 25th International Symposiumon High Performance Computer Architecture (HPCA), pages 623–634. IEEE,2019. → page 135[26] ARM Ltd. ARM Architecture Reference Manual: ARMv8, for ARMv8-Aarchitecture profile., 2013. Accessed on 2020-06-25. → pages 19, 130[27] J.-M. Arnau, J.-M. Parcerisa, and P. Xekalakis. Eliminating RedundantFragment Shader Executions on aMobile GPU via HardwareMemoization. InProceeedings of the 41th International Symposium on Computer Architecture(ISCA), pages 529–540. ACM, 2014. → page 135144[28] D. C. Arnold, D. H. Ahn, B. R. De Supinski, G. L. Lee, B. P. Miller, andM. Schulz. Stack Trace Analysis for Large Scale Debugging. In InternationalParallel and Distributed Processing Symposium (IPDPS), pages 1–10. IEEE,2007. → page 46[29] A. Arunkumar, E. Bolotin, B. Cho, U. Milic, E. Ebrahimi, O. Villa, A. Jaleel,C.-J. Wu, and D. Nellans. MCM-GPU: Multi-Chip-Module GPUs forContinued Performance Scalability. In Proceedings of the 44th InternationalSymposium on Computer Architecture (ISCA), pages 320–332. ACM, 2017.→ pages 4, 11, 74, 75, 76, 92, 100, 103[30] R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C. J.Rossbach, and O.Mutlu. Mosaic: AGPUMemoryManager with Application-Transparent Support for Multiple Page Sizes. In Proceedings of the 50thInternational Symposium on Microarchitecture (MICRO), pages 136–150.ACM, 2017. → page 79[31] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt. Ana-lyzing CUDA Workloads Using a Detailed GPU Simulator. In InternationalSymposium on Performance Analysis of Systems and Software (ISPASS),pages 163–174. IEEE, 2009. → pages xiv, 10, 37, 38, 65, 91[32] B. Bentley. Validating the Intel® Pentium® 4Microprocessor. In Proceedingsof the 38th Annual Design Automation Conference (DAC), pages 244–248,2001. → page 42[33] E. W. Bethel, H. Childs, and C. Hansen. High Performance Visualization:Enabling Extreme-Scale Scientific Insight (Chapter 5). CRC Press, 2012. →pages 7, 105[34] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell,M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The Gem5 Simulator.SIGARCH Computer Architecture News, 39:1–7, 2011. → page 37[35] B. H. Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors.Communications of the ACM, 13(7):422–426, 1970. → page 133[36] C. Blundell, M. M. Martin, and T. F. Wenisch. InvisiFence: Performance-Transparent Memory Ordering in Conventional Multiprocessors. In Proceed-ings of the 36th International Symposium on Computer Architecture (ISCA),pages 233–244. ACM, 2009. → page 130145[37] J. Bobba, N. Goyal, M. D. Hill, M. M. Swift, and D. A. Wood. TokenTM:Efficient Execution of Large Transactions with Hardware TransactionalMemory. In Proceedings of the 35th International Symposium on ComputerArchitecture (ISCA), pages 127–138. IEEE, 2008. → page 133[38] H.-J. Boehm and S. V. Adve. Foundations of the C++ Concurrency MemoryModel. In Proceedings of the 29th International Conference on ProgrammingLanguage Design and Implementation (PLDI), pages 68–78. ACM, 2008. →page 15[39] C. Boyd. DirectX 11 Compute Shader., 2008. Accessed on 2020-06-25. →page 17[40] A. Brownsword. Cloth in OpenCL. In GDC, 2009. → pages 38, 66[41] S. Burckhardt, R. Alur, and M. M. K. Martin. Verifying Safety of a TokenCoherence Implementation by Parametric Compositional Refinement. InInternational Workshop on Verification, Model Checking, and AbstractInterpretation (VMCAI), pages 130–145, 2005. → page 42[42] M. Burtscher and K. Pingali. An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm. In GPU Computing Gems EmeraldEdition, pages 75–92. Elsevier, 2011. → pages 38, 66[43] M. Burtscher, R. Nasre, and K. Pingali. A Quantitative Study of Irregular Pro-grams on GPUs. In International Symposium on Workload Characterization(IISWC), pages 141–151. IEEE, 2012. → page 75[44] H. W. Cain, M. M. Michael, B. Frey, C. May, D. Williams, and H. Le. RobustArchitectural Support for Transactional Memory in the Power Architecture. InProceedings of the 40th International Symposium on Computer Architecture(ISCA), pages 225–236. ACM, 2013. → page 47[45] J. F. Cantin, M. H. Lipasti, and J. E. Smith. The Complexity of VerifyingMemory Coherence. In Proceedings of the 15th Annual Symposium onParallel Algorithms and Architectures (SPAA), pages 254–255. ACM, 2003.→ page 15[46] J. Casper, T.Oguntebi, S.Hong, N.G.Bronson, C.Kozyrakis, andK.Olukotun.Hardware Acceleration of Transactional Memory on Commodity Systems. InProceedings of the 16th International Conference on Architectural Support146for Programming Language and Operating Systems (ASPLOS), pages 27–38.ACM, 2011. → page 133[47] D. Cederman and P. Tsigas. On Dynamic Load Balancing on GraphicsProcessors. In Proceedings of the 23rd SIGGRAPH/EUROGRAPHICSSymposium on Graphics Hardware (GH), pages 57–64. ACM, 2008. → page38[48] D. Cederman, P. Tsigas, and M. T. Chaudhry. Towards a Software Trans-actional Memory for Graphics Processors. In Eurographics Symposium onParallel Graphics and Visualization (EGPGV), pages 121–129, 2010. →page 133[49] L. Ceze, J. Tuck, J. Torrellas, and C. Cascaval. Bulk Disambiguationof Speculative Threads in Multiprocessors. In Proceedings of the 33rdInternational Symposium on Computer Architecture (ISCA), pages 227–238.ACM, 2006. → page 133[50] L. Ceze, J. Tuck, P.Montesinos, and J. Torrellas. BulkSC:Bulk Enforcement ofSequential Consistency. In Proceedings of the 34th International Symposiumon Computer Architecture (ISCA), pages 278–289. ACM, 2007. → page 130[51] H. Chafi, J. Casper, B. D. Carlstrom, A. McDonald, C. C. Minh, W. Baek,C. Kozyrakis, and K. Olukotun. A Scalable, Non-blocking Approach toTransactional Memory. In Proceedings of the 13th International Symposiumon High Performance Computer Architecture (HPCA), pages 97–108. IEEE,2007. → pages 50, 133[52] S. Chaudhry, R. Cypher, M. Ekman, M. Karlsson, A. Landin, S. Yip, H. Zeffer,and M. Tremblay. Rock: A High-Performance Sparc CMT Processor. IEEEMicro, 29(2):6–16, 2009. → page 47[53] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, andK. Skadron. Rodinia: A Benchmark Suite For Heterogeneous Computing.In International Symposium on Workload Characterization (IISWC), pages44–54. IEEE, 2009. → pages 38, 77, 92[54] S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. Pannotia:Understanding Irregular GPGPU Graph Applications. In InternationalSymposium on Workload Characterization (IISWC), pages 185–195. IEEE,2013. → page 75147[55] S. Chen and L. Peng. Efficient GPU Hardware Transactional Memorythrough Early Conflict Resolution. In Proceedings of the 22nd InternationalSymposium on High Performance Computer Architecture (HPCA), pages274–284. IEEE, 2016. → pages 47, 48, 65, 66, 72, 132[56] S. Chen, L. Peng, and S. Irving. Accelerating GPU Hardware TransactionalMemory with Snapshot Isolation. In Proceedings of the 44th InternationalSymposium on Computer Architecture (ISCA), pages 282–294. IEEE, 2017.→ pages 48, 133[57] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,and E. Shelhamer. cuDNN: Efficient Primitives for Deep Learning. CoRR,abs/1410.0759, October 2014. URL → page74[58] L. Chien. How to Avoid Global Synchronization by Domino Scheme. NVIDIAGPU Technology Conference (GTC), 2014. → pages 77, 92[59] B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve,V. S. Adve, N. P. Carter, and C.-T. Chou. DeNovo: Rethinking the MemoryHierarchy for Disciplined Parallelism. InProceedings of the 20th InternationalConference on Parallel Architectures and Compilation Techniques (PACT),pages 155–166. IEEE, 2011. → page 130[60] J. Chung, L. Yen, S. Diestelhorst, M. Pohlack, M. Hohmuth, D. Christie, andD. Grossman. ASF: AMD64 Extension for Lock-Free Data Structures andTransactional Memory. In Proceedings of the 43rd International Symposiumon Microarchitecture (MICRO), pages 39–50. IEEE, 2010. → pages 47, 133[61] E. M. Clarke, O. Grumberg, H. Hiraishi, S. Jha, D. E. Long, K. L. McMillan,and L. A. Ness. Verification of the Futurebus+ cache coherence protocol.Formal Methods in System Design, 6(2):217–232, 1995. → page 42[62] W. J. Dally, C. T. Gray, J. Poulton, B. Khailany, J. Wilson, and L. Dennison.Hardware-Enabled Artificial Intelligence. In Symposium on VLSI Circuits,pages 3–6. IEEE, 2018. → page 11[63] V.M.Del Barrio, C. González, J. Roca, A. Fernández, and E. Espasa. ATTILA:A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures.In Proceedings of the International Symposium on Performance Analysis ofSystems and Software (ISPASS), pages 231–241. IEEE, 2006. → page 119148[64] G. Diamos, S. Sengupta, B. Catanzaro, M. Chrzanowski, A. Coates, E. Elsen,J. Engel, A. Hannun, and S. Satheesh. Persistent RNNs: Stashing RecurrentWeights On-Chip. In International Conference on Machine Learning (ICML),pages 2024–2033, 2016. → page 77[65] D. Dice, O. Shalev, and N. Shavit. Transactional Locking II. In InternationalSymposium on Distributed Computing, pages 194–208. Springer, 2006. →page 50[66] D. Dice, Y. Lev, M. Moir, and D. Nussbaum. Early Experience with a Com-mercial Hardware Transactional Memory Implementation. In Proceedings ofthe 14th International Conference on Architectural Support for ProgrammingLanguage and Operating Systems (ASPLOS), pages 157–168. ACM, 2009.→ page 133[67] D. L. Dill, A. J. Drexler, A. J. Hu, and C. H. Yang. Protocol Verification as aHardware Design Aid. In ICCD, volume 92, pages 522–525. Citeseer, 1992.→ page 42[68] M. Dubois, C. Scheurich, and F. Briggs. Memory Access Buffering inMultiprocessors. In Proceedings of the 13rd International Symposium onComputer Architecture (ISCA), pages 434–442. ACM, 1986. → pages 20, 37[69] L. Durant, O. Giroux, M. Harris, and N. Stam. Inside Volta: The World’sMost Advanced Data Center GPU.,2017. Accessed on 2020-06-25. → pages 3, 45[70] S. Eilemann, M. Makhinya, and R. Pajarola. Equalizer: A Scalable ParallelRendering Framework. IEEE transactions on visualization and computergraphics (TVCG), 15(3):436–452, 2009. → pages 103, 104, 106, 134[71] M. Eldridge, H. Igehy, and P. Hanrahan. Pomegranate: A Fully ScalableGraphics Architecture. In Proceedings of the 27th Annual Conference onComputer Graphics and Interactive Techniques (SIGGRPAH), pages 443–454.ACM, 2000. → page 134[72] A. ElTantawy and T. M. Aamodt. MIMD Synchronization on SIMT Archi-tectures. In Proceedings of the 49th International Symposium on Microarchi-tecture (MICRO), pages 1–14. IEEE, 2016. → pages 4, 46[73] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger.Dark Silicon and the End of Multicore Scaling. In Proceedings of the 38thInternational Symposium on Computer Architecture (ISCA), pages 365–376.IEEE, 2011. → page 1149[74] P. Felber, C. Fetzer, P. Marlier, and T. Riegel. Time-Based Software Transac-tional Memory. Transactions on Parallel and Distributed Systems, 21(12):1793–1807, 2010. → page 133[75] H. Fuchs, J. Poulton, J. Eyles, T. Greer, J. Goldfeather, D. Ellsworth, S.Molnar,G. Turk, B. Tebbs, and L. Israel. Pixel-Planes 5: A Heterogeneous Multi-processor Graphics System Using Processor-Enhanced Memories. SiggraphComputer Graphics, 23(3):79–88, 1989. → page 134[76] W. W. L. Fung and T. M. Aamodt. Energy Efficient GPU Transactional Mem-ory via Space-time Optimizations. In Proceedings of the 46th InternationalSymposium on Microarchitecture (MICRO), pages 408–420. IEEE, 2013. →pages 4, 6, 47, 48, 50, 53, 54, 61, 63, 65, 66, 72, 132, 133[77] W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. HardwareTransactional Memory for GPU Architectures. In Proceedings of the 44thInternational Symposium on Microarchitecture (MICRO), pages 296–307.IEEE, 2011. → pages 4, 6, 47, 48, 54, 61, 63, 65, 66, 132[78] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hen-nessy. Memory Consistency and Event Ordering in Scalable Shared-MemoryMultiprocessors. In Proceedings of the 17th International Symposium onComputer Architecture (ISCA), pages 15–26. IEEE, 1990.→ pages 20, 37, 130[79] K. Gharachorloo, A. Gupta, and J. L. Hennessy. Two Techniques to Enhancethe Performance of Memory Consistency Models. In Proceedings of the 20thInternational Conference on Parallel Processing (ICPP), 1991. → page 130[80] K. Gharachorloo, M. Sharma, S. Steely, and S. Van Doren. Architecture andDesign of AlphaServer GS320. In Proceedings of the 9th International Con-ference on Architectural Support for Programming Languages and OperatingSystems (ASPLOS), pages 13–24. ACM, 2000. → page 132[81] C. Gniady and B. Falsafi. Speculative Sequential Consistency with LittleCustom Storage. In Proceedings of the 11th International Conference onParallel Architectures and Compilation Techniques (PACT), pages 179–188.IEEE, 2002. → page 130[82] C. Gniady, B. Falsafi, and T. N. Vijaykumar. Is SC + ILP = RC? InProceedings of the 26th International Symposium on Computer Architecture(ISCA), pages 162–171. ACM, 1999. → page 130150[83] J. Gong, S. Markidis, E. Laure, M. Otten, P. Fischer, and M. Min. NekbonePerformance on GPUs with OpenACC and CUDA Fortran Implementations.The Journal of Supercomputing, 72(11):4160–4180, 2016. → pages 77, 92[84] D. Gope and M. H. Lipasti. Atomic SC for Simple In-order Processors.In Proceedings of 20th the International Symposium on High PerformanceComputer Architecture (HPCA), pages 404–415. IEEE, 2014. → pages24, 130[85] A. P. Grosset, M. Prasad, C. Christensen, A. Knoll, and C. Hansen. TOD-Tree: Task-Overlapped Direct Send Tree Image Compositing for Hybrid MPIParallelism and GPUs. IEEE transactions on visualization and computergraphics (TVCG), 23(6):1677–1690, 2016. → page 127[86] S.-L. Guo, H.-X. Wang, Y.-B. Xue, C.-M. Li, and D.-S. Wang. HierarchicalCache Directory for CMP. Journal of Computer Science and Technology, 25(2):246–256, 2010. → pages 80, 132[87] A. Gutierrez, B. Beckmann, A. Dutu, J. Gross, J. Kalamatianos, O. Kayiran,M. Lebeane, M. Poremba, B. Potter, S. Puthoor, M. D. Sinclair, M. Wyse,J. Yin, X. Zhang, A. Jain, and T. G. Rogers. Lost in Abstraction: Pitfalls ofAnalyzing GPUs at the Intermediate Language Level. In Proceedings of the24th International Symposium on High Performance Computer Architecture(HPCA), pages 141–155. IEEE, 2018. → page 92[88] E. Hagersten and M. Koster. WildFire: A Scalable Path for SMPs. InProceedings of the 5th International Symposium on High PerformanceComputer Architecture (HPCA), pages 172–181. IEEE, 1999. → pages80, 132[89] L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D. Davis, B. Hertzberg,M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun. TransactionalMemory Coherence and Consistency. In Proceedings of the 31st Interna-tional Symposium on Computer Architecture (ISCA). ACM, 2004. → pages13, 14, 130, 133[90] R. Haring, M. Ohmacht, T. Fox, M. Gschwind, D. Satterfield, K. Sugavanam,P. Coteus, P. Heidelberger, M. Blumrich, R. Wisniewski, a. gara, G. Chiu,P. Boyle, N. Chist, and C. Kim. The IBM Blue Gene/Q Compute Chip. IEEEMicro, 32(2):48–60, 2012. → page 47151[91] P. Harish and P. Narayanan. Accelerating Large Graph Algorithms on the GPUUsing CUDA. In International conference on high-performance computing(HiPC), pages 197–208. Springer, 2007. → pages 2, 6, 74, 77[92] M. Harris and L. Nyland. Inside Pascal: NVIDIA’s Newest ComputingPlatform. NVIDIA GPU Technology Conference (GTC), 2016. → pages21, 22[93] T. Harris, S. Marlow, S. Peyton-Jones, and M. Herlihy. Composable MemoryTransactions. In Proceedings of the 10th Symposium on Principles andPractice of Parallel Programming (PPoPP), pages 48–60. ACM, 2005. →page 47[94] B. A. Hechtman and D. J. Sorin. Exploring Memory Consistency forMassively-Threaded Throughput-Oriented Processors. In Proceedings ofthe 40th International Symposium on Computer Architecture (ISCA), pages201–212. ACM, 2013. → pages 4, 20, 21, 22, 129[95] B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D.Hill, S. K. Reinhardt, and D. A. Wood. QuickRelease: A Throughput-oriented Approach to Release Consistency on GPUs. In Proceedings of the20th International Symposium on High Performance Computer Architecture(HPCA), pages 189–200. IEEE, 2014. → pages 76, 79, 130, 132[96] M. Herlihy and J. E. B. Moss. Transactional Memory: Architectural Supportfor Lock-Free Data Structures. In Proceedings of the 20th InternationalSymposium on Computer Architecture (ISCA), pages 289–300. ACM, 1993.→ pages 4, 46, 133[97] T. H. Hetherington, M. O’Connor, and T. M. Aamodt. MemcachedGPU:Scaling-up Scale-out Key-value Stores. In Proceedings of the 6th ACMSymposium on Cloud Computing (SoCC), pages 43–57. ACM, 2015. → page45[98] D. R. Hower, B. A. Hechtman, B.M. Beckmann, B. R. Gaster, M. D. Hill, S. K.Reinhardt, and D. A. Wood. Heterogeneous-Race-Free Memory Models. InProceedings of the 19th International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS), pages 427–440.ACM, 2014. → pages 4, 5, 15, 75, 78, 81, 130, 137, 140[99] W. M. Hsu. Segmented Ray Casting for Data Parallel Volume Rendering.In Proceedings of the 1993 symposium on Parallel Rendering, pages 7–14.IEEE, 1993. → page 105152[100] J. Huang. GTC Keynote Speech., 2019. Accessed on 2020-07-08. → page 3[101] S. Huang, S. Xiao, and W.-c. Feng. On the Energy Efficiency of GraphicsProcessing Units for Scientific Computing. In International Symposium onParallel & Distributed Processing (IPDPS), pages 1–8. IEEE, 2009. → pages2, 3[102] G. Humphreys, M. Eldridge, I. Buck, G. Stoll, M. Everett, and P. Hanrahan.WireGL: A Scalable Graphics System for Clusters. In Proceedings of the28th Annual Conference on Computer Graphics and Interactive Techniques(SIGGRAPH), pages 129–140. ACM, 2001. → pages 104, 106, 134[103] G. Humphreys, M. Houston, R. Ng, R. Frank, S. Ahern, P. D. Kirchner, andJ. T. Klosowski. Chromium: A Stream-Processing Framework for InteractiveRendering on Clusters. ACM transactions on graphics (TOG), 21(3):693–702,2002. → pages 103, 104, 106, 134[104] IBM. Power ISA, Version 2.07B., 2015. Accessed on 2020-06-25. → pages 19, 21[105] Inside HPC. TOP500 Shows Growing Momentum for Accelera-tors., 2015. Accessed on 2020-06-25. → page 74[106] Intel. Intel Architecture Instruction Set Extensions Programming Reference:Chapter 8: Intel Transactional Synchronization Extensions. Technical report,2012. → page 47[107] C. Jacobi, T. Slegel, and D. Greiner. Transactional Memory Architecture andImplementation for IBM System Z. In Proceedings of the 45th InternationalSymposium on Microarchitecture (MICRO), pages 25–36. IEEE, 2012. →page 47[108] A. Jain, M. Khairy, and T. G. Rogers. A Quantitative Evaluation of Contempo-rary GPU SimulationMethodology. Proceedings of the ACM onMeasurementand Analysis of Computing Systems (SIGMETRICS), page 35, 2018. → page91[109] A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. ORION 2.0: A Fastand Accurate NoC Power and Area Model for Early-stage Design SpaceExploration. InDesign, Automation&Test in EuropeConference&Exhibition(DATE), pages 423–428. IEEE, 2009. → page 38153[110] G. Kestor, V. Karakostas, O. S. Unsal, A. Cristal, I. Hur, and M. Valero.RMS-TM: A Comprehensive Benchmark Suite for Transactional MemorySystems. In Proceedings of the 2nd International Conference on Performanceengineering (ICPE), pages 335–346. ACM, 2011. → page 66[111] M. Khairy, A. Jain, T. M. Aamodt, and T. G. Rogers. Exploring ModernGPUMemory System Design Challenges through Accurate Modeling. CoRR,abs/1810.07269, October 2018. URL →page 91[112] Khronos. The OpenCL Specification Version 2.2., 2019. Accessed on2020-07-03. → pages 2, 3, 9, 75, 78, 130[113] J. Kim and C. Batten. Accelerating Irregular Algorithms on GPGPUs UsingFine-Grain Hardware Worklists. In Proceedings of the 47th InternationalSymposium on Microarchitecture (MICRO), pages 75–87. IEEE, 2014. →page 75[114] Y. Kim, J.-E. Jo, H. Jang, M. Rhu, H. Kim, and J. Kim. GPUpd: AFast and Scalable Multi-GPU Architecture Using Cooperative Projectionand Distribution. In Proceedings of the 50th International Symposiumon Microarchitecture (MICRO), pages 574–586. ACM, 2017. → pages5, 7, 12, 104, 107, 121, 133[115] A. Kirsch, M. Mitzenmacher, and U. Wieder. More Robust Hashing: CuckooHashing with a Stash. SIAM Journal on Computing, 39(4):1543–1561, 2009.→ pages 63, 70[116] T. Knight. An Architecture for Mostly Functional Languages. In Proceedingsof the Conference on LISP and Functional Programming, pages 105–112.ACM, 1986. → page 133[117] K. Koukos, A. Ros, E. Hagersten, and S. Kaxiras. Building HeterogeneousUnified Virtual Memories (UVMs) without the Overhead. ACM Transactionson Architecture and Code Optimization (TACO), 13(1):1, 2016. → pages78, 130[118] M. Kulkarni, M. Burtscher, C. Casçaval, and K. Pingali. LoneStar: A Suiteof Parallel Irregular Programs. In International Symposium on PerformanceAnalysis of Systems and Software (ISPASS), pages 65–76. IEEE, 2009. →pages 6, 77, 92154[119] S. Kumar, A. Shriraman, and N. Vedula. Fusion: Design Tradeoffs inCoherent Cache Hierarchies for Accelerators. In Proceedings of the 42ndInternational Symposium on Computer Architecture (ISCA), pages 733–745.ACM, 2015. → pages 25, 131, 139[120] S. Laine and T. Karras. High-Performance Software Rasterization on GPUs.In Proceedings of the SIGGRAPH Symposium on High Performance Graphics(HPG), pages 79–88. ACM, 2011. → page 106[121] S. Lam and L. Kleinrock. Packet Switching in a Multiaccess BroadcastChannel: Dynamic Control Procedures. Transactions on Communications,23(9):891–904, 1975. → page 61[122] L. Lamport. Time, Clocks, and the Ordering of Events in a DistributedSystem. Commun. ACM, 21:558, 1978. → pages 24, 25, 131[123] L. Lamport. How to Make a Multiprocessor Computer That CorrectlyExecutes Multiprocess Programs. Transactions on Computers, (9):690–691,1979. → pages 14, 19[124] H. Q. Le, W. J. Starke, J. S. Fields, F. P. O’Connell, D. Q. Nguyen, B. J.Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden. IBM POWER6Microarchitecture. IBM Journal of Research and Development, 51(6):639–662, 2007. → page 131[125] E. A. Lee. The Problem with Threads. IEEE Computer, 39(5):33–42, 2006.→ page 46[126] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy. TheDirectory-Based Cache Coherence Protocol for the DASHMultiprocessor. InProceedings of the 17th International Symposium on Computer Architecture(ISCA), pages 148–159. IEEE, 1990. → pages 80, 132[127] A. Levinthal and T. Porter. Chap – A SIMD Graphics Processor. InProceedings of the 11th Annual Conference on Computer Graphics andInteractive Techniques (SIGGRAPH), pages 77–82. ACM, 1984. → pages10, 61[128] J. Lew, D. A. Shah, S. Pati, S. Cattell, M. Zhang, A. Sandhupatla, C. Ng,N. Goli, M. D. Sinclair, T. G. Rogers, et al. Analyzing Machine LearningWorkloads Using a Detailed GPU Simulator. In International Symposiumon Performance Analysis of Systems and Software (ISPASS), pages 151–152.IEEE, 2019. → page 91155[129] A. Li, G.-J. van den Braak, H. Corporaal, and A. Kumar. Fine-GrainedSynchronizations and Dataflow Programming on GPUs. In Proceedings ofthe 29th International Conference on Supercomputing (ICS), pages 109–118.ACM, 2015. → page 45[130] D. Li and M. Becchi. Multiple Pairwise Sequence Alignments with theNeedleman-Wunsch Algorithm on GPU. In SC Companion: High Perfor-mance Computing, Networking, Storage and Analysis (SCC), pages 1471–1472. IEEE, 2012. → pages 77, 92[131] C. Lin, V. Nagarajan, R. Gupta, and B. Rajaram. Efficient Sequential Con-sistency via Conflict Ordering. In Proceedings of the 17th InternationalConference on Architectural Support for Programming Language and Oper-ating Systems (ASPLOS), pages 273–286. ACM, 2012. → pages 24, 130[132] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: AUnified Graphics and Computing Architecture. IEEE Micro, 28(2):39–55,2008. → page 2[133] H. Litz, D. Cheriton, A. Firoozshahian, O. Azizi, and J. P. Stevenson. SI-TM:Reducing Transactional Memory Abort Rates Through Snapshot Isolation. InProceedings of the 19th International Conference on Architectural Support forProgramming Language and Operating Systems (ASPLOS), pages 383–398.ACM, 2014. → page 133[134] D. Luebke and G. Humphreys. How GPUs Work. Computer, 40(2):96–100,2007. → page 102[135] D. Lustig, S. Sahasrabuddhe, and O. Giroux. A Formal Analysis of theNVIDIA PTX Memory Consistency Model. In Proceedings of the 24th Inter-national Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS), pages 257–270. ACM, 2019. → pages4, 5, 15, 75, 78, 80, 81, 130, 137, 140[136] K.-L. Ma, J. S. Painter, C. D. Hansen, and M. F. Krogh. A Data Distributed,Parallel Algorithm for Ray-Traced Volume Rendering. In Proceedings ofParallel Rendering Symposium, pages 15–22. IEEE, 1993. → page 105[137] J.Manson,W. Pugh, and S. V. Adve. The JavaMemoryModel. InProceedingsof the Annual Symposium on Principles of Programming Languages (POPL),pages 378–391. ACM, 2005. → page 15156[138] M. M. K. Martin, D. J. Sorin, A. Ailamaki, A. R. Alameldeen, R. M.Dickson, C. J. Mauer, K. E. Moore, M. Plakal, M. D. Hill, and D. A. Wood.Timestamp Snooping: An Approach for Extending SMPs. In Proceedings ofthe 9th International Conference on Architectural Support for ProgrammingLanguage and Operating Systems (ASPLOS), pages 25–36. ACM, 2000. →page 131[139] A. Meixner and D. J. Sorin. Dynamic Verification of Memory Consistencyin Cache-Coherent Multithreaded Computer Architectures. volume 6, pages18–31. IEEE, 2008. → page 131[140] M. Méndez-Lojo, M. Burtscher, and K. Pingali. A GPU Implementation ofInclusion-based Points-to Analysis. In Proceedings of the 17th Symposium onPrinciples and Practice of Parallel Programming (PPoPP), pages 107–116.ACM, 2012. → page 45[141] Message Passing Interface Forum. MPI: A Message-Passing InterfaceStandard, Version 3.1., June 2015. Accessed on 2020-06-25. → page 100[142] Mike Houston. Anatomy of AMD’s TeraScale Graphics Engine., 2008. Accessedon 2020-06-25. → pages 2, 119[143] U. Milic, O. Villa, E. Bolotin, A. Arunkumar, E. Ebrahimi, A. Jaleel,A. Ramirez, and D. Nellans. Beyond the Socket: NUMA-aware GPUs.In Proceedings of the 50th International Symposium on Microarchitecture(MICRO), pages 123–135. ACM, 2017. → pages 4, 11, 75, 76, 92, 103[144] C. C. Minh, M. Trautmann, J. Chung, A. McDonald, N. Bronson, J. Casper,C. Kozyrakis, and K. Olukotun. An Effective Hybrid Transactional MemorySystem with Strong Isolation Guarantees. In Proceedings of the 34th Inter-national Symposium on Computer Architecture (ISCA), pages 69–80. ACM,2007. → page 133[145] L. Moll, A. Heirich, and M. Shand. Sepia: Scalable 3D Compositing UsingPCI Pamette. In Proceedings of the 7th International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 146–155. IEEE,1999. → page 105[146] S. Molnar, J. Eyles, and J. Poulton. PixelFlow: High-Speed RenderingUsing Image Composition. In Proceedings of the 19th Annual Conference on157Computer Graphics and Interactive Techniques (SIGGRAPH), pages 231–240.ACM, 1992. → page 134[147] S. Molnar, M. Cox, D. Ellsworth, and H. Fuchs. A Sorting Classification ofParallel Rendering. IEEE Computer Graphics and Applicationsi (CG&A), 14(4):23–32, 1994. → page 106[148] J. R. Monfort and M. Grossman. Scaling of 3D Game Engine Workloadson Modern Multi-GPU Systems. In Proceedings of the Conference on HighPerformance Graphics (HPG), pages 37–46. ACM, 2009. → page 106[149] K. E. Moore, J. Bobba, M. J. Moravan, M. D. Hill, and D. A. Wood. LogTM:log-based transactional memory. In Proceedings of the 12th InternationalSymposium on High Performance Computer Architecture (HPCA), pages254–265. IEEE, 2006. → pages 13, 14, 49, 50, 133[150] N. Moscovici, N. Cohen, and E. Petrank. POSTER: A GPU-Friendly SkiplistAlgorithm. In Proceedings of the 22nd Symposium on Principles and Practiceof Parallel Programming (PPoPP), pages 449–450. ACM, 2017. → page 45[151] V. Moya, C. Gonzalez, J. Roca, A. Fernandez, and R. Espasa. ShaderPerformance Analysis on A Modern GPU Architecture. In Proceedings ofthe 38th International Symposium on Microarchitecture (MICRO), pages355–364. IEEE, 2005. → page 119[152] C. Mueller. The Sort-First Rendering Architecture for High-PerformanceGraphics. In Proceedings of the 1995 symposium on Interactive 3D graphics(I3D), pages 75–84, 1995. → page 103[153] D. Mulnix. Intel Xeon Processor Scalable Family TechnicalOverview., 2017. Accessed on 2020-06-25. →pages 80, 132[154] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi. CACTI 6.0: ATool to Model Large Caches. Technical report, HP Laboratories, 2009. →page 66[155] V. Nagarajan, D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on MemoryConsistency and Cache Coherence, Second Edition. Synthesis Lectures onComputer Architecture, 15(1):1–294, 2020. → pages 14, 15, 80158[156] S. K. Nandy and R. Narayan. An Incessantly Coherent Cache Scheme forShared Memory Multithreaded Systems. MIT LCS CSG Memo 356. → page131[157] U. Neumann. Communication Costs for Parallel Volume-Rendering Algo-rithms. IEEE Computer Graphics and Applicationsi (CG&A), 14(4):49–58,1994. → page 105[158] NVIDIA. GeForce Now: The Power to Play., . Accessed on 2020-07-08. → page 3[159] NVIDIA. NVIDIA Clara Parabricks., . Accessed on 2020-07-08. → page 3[160] NVIDIA. NVIDIA NVLink: High Speed GPU Interconnect. Accessed on 2020-06-25. → pages 3, 4, 12, 75, 100, 103[161] NVIDIA. NVIDIA RTX™ platform., .Accessed on 2020-07-08. → page 3[162] NVIDIA. NVIDIA NVSwitch: The World’s Highest-Bandwidth On-NodeSwitch., . Accessed on 2020-06-25. → pages 3, 4, 75, 100, 103[163] NVIDIA. NVIDIA TENSOR CORES: Unprecedented Acceleration for HPCand AI., . Accessedon 2020-07-08. → page 3[164] NVIDIA. NVIDIA Turing GPU Architecture Whitepaper., . Accessed on2020-06-25. → pages 17, 120[165] NVIDIA. NVIDIA’s Next Generation CUDA Compute Architecture:Fermi., 2009. Accessed on 2020-06-25. → pages 20, 21, 22, 32, 38, 66, 129[166] NVIDIA. SLI Best Practices., 2011. Accessed on2020-06-25. → pages 5, 7, 103, 106, 134159[167] NVIDIA. NVIDIA’s Next Generation CUDA Compute Architecture: Ke-pler GK110., 2012.Accessed on 2020-06-25. → pages 20, 21, 22, 32, 129[168] NVIDIA. Unified Memory in CUDA 6., Nov 2013. Accessed on 2020-06-25. → pages3, 78, 88, 138[169] NVIDIA. NVIDIA Geforce GTX 1080., 2016. Accessed on 2020-06-25. → page 103[170] NVIDIA. NVIDIA DGX-1: Essential Instrument for AI Research., 2017. Accessed on 2020-06-25.→ pages 3, 75, 103, 109, 120, 141[171] NVIDIA. NVIDIA Tesla V100 Architecture., Au-gust 2017. Accessed on 2020-06-25. → page 130[172] NVIDIA. NVIDIA Tesla V100 GPU Architecture Whitepaper., 2017. Accessed on 2020-06-25. → page 119[173] NVIDIA. NVIDIADGX-2: Theworld’s most powerful AI system for themostcomplex AI challenges.,2018. Accessed on 2020-06-25. → pages 3, 75, 103, 109, 120, 141[174] NVIDIA. NVIDIA HGX-2: Powered by NVIDIA Tesla V100 GPUs andNVSwitch., 2018. Accessedon 2020-06-25. → pages 3, 75, 103[175] NVIDIA. New GPU-Accelerated Weather Forecasting System Dramat-ically Improves Accuracy.,2019. Accessed on 2020-07-08. → page 3[176] NVIDIA. NVIDIA 2019 Annual Review., 2019. Ac-cessed on 2020-06-25. → page 103160[177] NVIDIA. CUDA C++ Programming Guide Version 11.0., 2020. Accessed on2020-07-03. → pages 2, 9[178] NVIDIA. NVIDIA DLSS 2.0: A Big Leap In AI Render-ing., 2020. Accessed on 2020-07-08. → page 3[179] L. E. Olson, M. D. Hill, and D. A. Wood. Crossing Guard: MediatingHost-Accelerator Coherence Interactions. In Proceedings of the 22th Interna-tional Conference on Architectural Support for Programming Languages andOperating Systems (ASPLOS), pages 163–176. ACM, 2017. → page 132[180] OpenSHMEM Project. OpenSHMEM Application Programming Inter-face., December 2017. Accessed on 2020-06-25. → page 100[181] S. Owens, S. Sarkar, and P. Sewell. A Better x86 Memory Model: x86-TSO.In International Conference on Theorem Proving in Higher Order Logics(TPHOL), pages 391–407. Springer, 2009. → pages 19, 21, 130[182] R. Pagh and F. F. Rodler. Cuckoo hashing. In European Symposium onAlgorithms, pages 121–133. Springer, 2001. → page 63[183] T. Peterka, D. Goodell, R. Ross, H.-W. Shen, and R. Thakur. A ConfigurableAlgorithm for Parallel Image-Compositing Applications. In Proceedings ofthe Conference on High Performance Computing Networking, Storage andAnalysis, pages 1–10. IEEE, 2009. → page 105[184] J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa,C. Chipot, R. D. Skeel, L. Kale, and K. Schulten. Scalable MolecularDynamics with NAMD. Journal of Computational Chemistry, 26(16):1781–1802, 2005. → pages 77, 92[185] F. Pong, A. Nowatzyk, G. Aybay, and M. Dubois. Verifying DistributedDirectory-Based Cache Coherence Protocols:, a Case Study. InEuropean Conference on Parallel Processing, pages 287–300. Springer, 1995.→ page 42[186] T. Porter and T. Duff. Compositing Digital Images. In Proceedings of the11th Annual Conference on Computer Graphics and Interactive Techniques(SIGGRAPH), pages 253–259. ACM, 1984. → page 104161[187] J. W. Poulton, W. J. Dally, X. Chen, J. G. Eyles, T. H. Greer, S. G. Tell, J. M.Wilson, and C. T. Gray. A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced PackagingApplications. IEEE Journal of Solid-State Circuits (JSSC), 48(12):3206–3218,2013. → pages 4, 75, 100, 103[188] J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Hill, S. K.Reinhardt, and D. A. Wood. Heterogeneous System Coherence for IntegratedCPU-GPU Systems. In Proceedings of the 46th International Symposium onMicroarchitecture (MICRO), pages 457–467. ACM, 2013. → pages 20, 132[189] S. H. Pugsley, M. Awasthi, N. Madan, N. Muralimanohar, and R. Balasubra-monian. Scalable and Reliable Communication for Hardware TransactionalMemory. In Proceedings of the 17th International Conference on ParallelArchitectures and Compilation Techniques (PACT), pages 144–154. IEEE,2008. → page 133[190] M. A. Raihan, N. Goli, and T. M. Aamodt. Modeling Deep LearningAccelerator Enabled GPUs. In International Symposium on PerformanceAnalysis of Systems and Software (ISPASS), pages 79–92. IEEE, 2019. →page 91[191] R. Rajwar and J. R. Goodman. Speculative Lock Elision: Enabling HighlyConcurrentMultithreaded Execution. InProceedings of the 34th InternationalSymposium on Microarchitecture (MICRO), pages 294–305. IEEE, 2001. →page 133[192] R. Rajwar and J. R. Goodman. Transactional Lock-free Execution of Lock-Based Programs. In Proceedings of the 10th International Conference onArchitectural Support for Programming Language and Operating Systems(ASPLOS), pages 5–17. ACM, 2002. → page 133[193] G. Ramalingam. Context-Sensitive Synchronization-Sensitive Analysis is Un-decidable. Transactions on Programming languages and Systems (TOPLAS),22(2):416–430, 2000. → page 46[194] P. Ranganathan, V. S. Pai, and S. V. Adve. Using Speculative Retirementand Larger Instruction Windows to Narrow the Performance Gap BetweenMemory Consistency Models. In Proceedings of the 9th Annual Symposiumon Parallel Algorithms and Architectures (SPAA), pages 199–210. ACM,1997. → page 130162[195] X. Ren and M. Lis. Efficient Sequential Consistency in GPUs via RelativisticCache Coherence. In Proceedings of the 23rd International Symposium onHigh Performance Computer Architecture (HPCA), pages 625–636. IEEE,2017. → pages 76, 78, 133, 140[196] X. Ren and M. Lis. High-Performance GPU Transactional Memory via EagerConflict Detection. In Proceedings of the 24th International Symposium onHigh Performance Computer Architecture (HPCA), pages 235–246. IEEE,2018. → page 14[197] X. Ren, D. Lustig, E. Bolotin, A. Jaleel, O. Villa, and D. Nellans. HMG:Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems. In Proceedings of the 26th International Symposium on HighPerformance Computer Architecture (HPCA), pages 582–595. IEEE, 2020.→ page 103[198] I. Rickards and E. Sørgård. Integrating CPU & GPU: the ARM methodology.In Game Developers Conference (GDC), 2013. → page 21[199] J. Rose, J. Luu, C. W. Yu, O. Densmore, J. Goeders, A. Somerville, K. B.Kent, P. Jamieson, and J. Anderson. The VTR Project: Architecture andCAD for FPGAs from Verilog to Routing. In Proceedings of the InternationalSymposium on Field Programmable Gate Arrays (FPGA), pages 77–86. ACM,2012. → page 38[200] C. J. Rossbach, O. S. Hofmann, D. E. Porter, H. E. Ramadan, B. Aditya, andE. Witchel. TxLinux: Using and Managing Hardware Transactional Memoryin an Operating System. In Proceedings of 21st Symposium on OperatingSystems Principles (SOSP), pages 87–102. ACM, 2007. → page 47[201] C. J. Rossbach, O. S. Hofmann, and E.Witchel. Is Transactional ProgrammingActually Easier? In Proceedings of the 15th Symposium on Principles andPractice of Parallel Programming (PPoPP), pages 47–56. ACM, 2010. →page 47[202] B. Saha, A.-R. Adl-Tabatabai, and Q. Jacobson. Architectural Support forSoftware Transactional Memory. In Proceedings of the 39th InternationalSymposium on Microarchitecture (MICRO), pages 185–196. IEEE, 2006. →page 133[203] D. Sanchez, L. Yen, M. D. Hill, and K. Sankaralingam. ImplementingSignatures for TransactionalMemory. InProceedings of the 40th International163Symposium on Microarchitecture (MICRO), pages 123–133. IEEE, 2007. →page 63[204] J. Sanders and E. Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, 2010. → pages4, 20[205] S. Sarkar, P. Sewell, J. Alglave, L. Maranget, and D. Williams. UnderstandingPOWER Multiprocessors. In Proceedings of the 32nd International Confer-ence on Programming Language Design and Implementation (PLDI), pages175–186. ACM, 2011. → pages 19, 21, 130[206] J.-P. Schoellkopf. SRAMMemoryDevicewith FlashClear andCorrespondingFlash Clear Method, Feb. 19 2008. US Patent 7,333,380. → page 33[207] M. Segal and K. Akeley. The OpenGL® Graphics System: A Specification(Version 4.6 (Core Profile))., 2019. Accessed on 2020-06-25. → page 17[208] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey,S. Junkins, A. Lake, J. Sugerman, R. Cavin, et al. Larrabee: A Many-Corex86 Architecture for Visual Computing. Transactions on Graphics (TOG), 27(3):1–15, 2008. → page 106[209] V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko,Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, et al. RowClone: Fast andEnergy-Efficient in-DRAMBulk Data Copy and Initialization. In Proceedingsof the 46th International Symposium on Microarchitecture (MICRO), pages185–197. ACM, 2013. → page 133[210] Shara Tibken. CES 2019: Moore’s Law is dead, says Nvidia’sCEO., 2019. Accessed on 2020-07-08. → page 1[211] K. S. Shim, M. H. Cho, M. Lis, O. Khan, and S. Devadas. Library CacheCoherence. Technical Report MIT-CSAIL-TR-2011-027, MIT, 2011. →pages 25, 131[212] A. Shriraman and S. Dwarkadas. Refereeing Conflicts in Hardware Trans-actional Memory. In Proceedings of the 23rd International Conference onSupercomputing (ICS), pages 136–146. ACM, 2009. → page 133164[213] A. Shriraman, M. F. Spear, H. Hossain, V. J. Marathe, S. Dwarkadas, andM. L.Scott. An Integrated Hardware-Software Approach to Flexible TransactionalMemory. In Proceedings of the 34th International Symposium on ComputerArchitecture (ISCA), pages 104–115. ACM, 2007. → page 133[214] A. Shriraman, S. Dwarkadas, and M. L. Scott. Flexible Decoupled Transac-tional Memory Support. In Proceedings of the 35th International Symposiumon Computer Architecture (ISCA), pages 139–150. IEEE, 2008. → page 133[215] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks forLarge-Scale Image Recognition. CoRR, abs/1409.1556, September 2014.URL → page 74[216] M. D. Sinclair, J. Alsop, and S. V. Adve. Efficient GPU SynchronizationWithout Scopes: Saying No to Complex Consistency Models. In Proceedingsof the 48th International Symposium on Microarchitecture (MICRO), pages647–659. ACM, 2015. → pages 74, 75, 76, 78, 81, 130, 133, 140[217] M. D. Sinclair, J. Alsop, and S. V. Adve. HeteroSync: A Benchmark Suite forFine-Grained Synchronization on Tightly Coupled GPUs. In InternationalSymposium on Workload Characterization (IISWC), pages 239–249. IEEE,2017. → pages 75, 91[218] P. S. Sindhu, J.-M. Frailong, and M. Cekleov. Scalable Shared MemoryMultiprocessors, chapter Formal Specification of Memory Models, page 25.Springer, 1992. → pages 19, 21[219] A. Singh, S. Narayanasamy, D. Marino, T. Millstein, and M. Musuvathi.End-to-end Sequential Consistency. In Proceedings of the 39th InternationalSymposium on Computer Architecture (ISCA), pages 524–535. IEEE, 2012.→ page 130[220] A. Singh, S. Aga, and S. Narayanasamy. Efficiently Enforcing Strong MemoryOrdering in GPUs. In Proceedings of the 48th International Symposiumon Microarchitecture (MICRO), pages 699–712. ACM, 2015. → pages19, 20, 21, 22, 36, 37, 38, 129[221] I. Singh, A. Shriraman, W. W. Fung, M. O’Connor, and T. M.Aamodt. Cache Coherence for GPU Architectures. In Proceed-ings of the 19th International Symposium on High Performance Com-puter Architecture (HPCA), pages 578–590. IEEE, 2013. → pages4, 6, 19, 20, 22, 23, 25, 32, 37, 38, 74, 75, 76, 79, 81, 82, 129, 131, 132, 133165[222] P. Singh, C.-R. M, P. Raghavendra, A. Nandi, D. Das, and T. Tye. AMDPlatform Coherency and SoC Verification Challenges. In Proceedings of the4th ACM Symposium on Cloud Computing (SoCC). ACM, 2013. → page 21[223] B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner.POWER5 System Microarchitecture. IBM journal of research and develop-ment, 49(4.5):505–521, 2005. → page 131[224] D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Memory Consistencyand Cache Coherence. Synthesis Lectures on Computer Architecture, 6(3):1–212, 2011. → pages 14, 15, 80[225] SPARC International. The SPARC Architecture Manual, Version 9., 1994. Accessed on 2020-06-25. → pages41, 130[226] D. Steinkraus, I. Buck, and P. Simard. Using GPUs for Machine LearningAlgorithms. In 8th International Conference on Document Analysis andRecognition (ICDAR), pages 1115–1120. IEEE, 2005. → page 2[227] G. Stoll, M. Eldridge, D. Patterson, A.Webb, S. Berman, R. Levy, C. Caywood,M. Taveira, S. Hunt, and P. Hanrahan. Lightning-2: A High-PerformanceDisplay Subsystem for PC Clusters. In Proceedings of the 28th AnnualConference on Computer Graphics and Interactive Techniques (SIGGRAPH),pages 141–148. ACM, 2001. → page 105[228] J. M. Stone, H. S. Stone, P. Heidelberger, and J. Turek. Multiple reservationsand the Oklahoma update. Parallel & Distributed Technology: Systems &Applications, 1(4):58–71, 1993. → pages 46, 133[229] Y. Sun, T. Baruah, S. A. Mojumder, S. Dong, X. Gong, S. Treadway, Y. Bao,S. Hance, C. McCardwell, V. Zhao, H. Barclay, A. K. Ziabari, Z. Chen,R. Ubal, J. L. Abellán, J. Kim, A. Joshi, and D. Kaeli. MGPUSim: EnablingMulti-GPU Performance Modeling and Optimization. In Proceedings ofthe 46th International Symposium on Computer Architecture (ISCA), pages197–209. ACM, 2019. → page 91[230] SUN Microsystems. SPARC Architecture Manual V8., 1990. Accessed on 2020-06-25. → pages 19, 21[231] R. N. Taylor. Complexity of Analyzing the Synchronization Structure ofConcurrent Programs. Acta Informatica, 19(1):57–84, 1983. → page 46166[232] T. N. Theis and H.-S. P. Wong. The End of Moore’s Law: A New Beginningfor Information Technology. Computing in Science & Engineering, 19(2):41–50, 2017. → page 1[233] S. Tomić, C. Perfumo, C. Kulkarni, A. Armejach, A. Cristal, O. Unsal, T. Har-ris, and M. Valero. EazyHTM: EAger-LaZY hardware Transactional Memory.In Proceedings of the 42nd International Symposium on Microarchitecture(MICRO), pages 145–155. ACM, 2009. → pages 50, 133[234] A. Villegas, Á. Navarro, R. Asenjo Plaza, O. Plata, R. Ubal, and D. Kaeli.Hardware Support for Local Memory Transactions on GPU Architectures. InTRANSACT, 2015. → pages 48, 133[235] V. Vineet and P. Narayanan. CUDA Cuts: Fast Graph Cuts on the GPU. InProceedings of the Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW), pages 1–8. IEEE, 2008. → page 66[236] M. M. Waliullah and P. Stenstrom. Starvation-Free Transactional Memory-System Protocols. In European Conference on Parallel Processing (ECPP),pages 280–291. Springer, 2007. → page 133[237] D. A. Wallach. PHD: A Hierarchical Cache Coherent Protocol. PhD thesis,Massachusetts Institute of Technology, 1992. → page 132[238] T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Mechanisms forStore-Wait-Free Multiprocessors. In Proceedings of the 34th InternationalSymposium on Computer Architecture (ISCA), pages 266–277. ACM, 2007.→ page 130[239] A. W. Wilson Jr. Hierarchical Cache/Bus Architecture for Shared MemoryMultiprocessors. In Proceedings of the 14th International Symposium onComputer Architecture (ISCA), pages 244–252. ACM, 1987. → page 132[240] M. Wimmer and P. Wonka. Rendering Time Estimation for Real-TimeRendering. In Eurographics Symposium on Rendering, pages 118–129, 2003.→ page 115[241] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos.Demystifying GPU Microarchitecture through Microbenchmarking. In Pro-ceedings of the International Symposium on Performance Analysis of Systems& Software (ISPASS), pages 235–246. IEEE, 2010. → pages 30, 37, 38, 66167[242] D. A. Wood, G. A. Gibson, and R. H. Katz. Verifying a MultiprocessorCache Controller Using Random Test Generation. IEEE Design & Test ofComputers, 7(4):13–25, 1990. → page 42[243] S. Xiao and W.-c. Feng. Inter-Block GPU Communication via Fast BarrierSynchronization. In International Symposium on Parallel & DistributedProcessing (IPDPS), pages 1–12. IEEE, 2010. → pages 36, 38[244] C. Xie, S. L. Song, J. Wang, W. Zhang, and X. Fu. Processing-in-MemoryEnabled Graphics Processors for 3D Rendering. In Proceedings of the23rd International Symposium on High Performance Computer Architecture(HPCA), pages 637–648. IEEE, 2017. → pages 120, 135[245] C. Xie, X. Fu, and S. Song. Perception-Oriented 3DRenderingApproximationfor Modern Graphics Processors. In Proceedings of the 24th InternationalSymposium on High Performance Computer Architecture (HPCA), pages362–374. IEEE, 2018. → page 120[246] C. Xie, F. Xin, M. Chen, and S. L. Song. OO-VR: NUMA Friendly Object-Oriented VR Rendering Framework for Future NUMA-Based Multi-GPUSystems. In Proceedings of the 46th International Symposium on ComputerArchitecture (ISCA), pages 53–65. IEEE, 2019. → pages 12, 115, 120, 133[247] C. Xie, X. Zhang, A. Li, X. Fu, and S. Song. PIM-VR: Erasing MotionAnomalies In Highly-Interactive Virtual Reality World With CustomizedMemory Cube. In Proceedings of the 25th International Symposium on HighPerformance Computer Architecture (HPCA), pages 609–622. IEEE, 2019.→ page 120[248] Y. Xu, R. Wang, N. Goswami, T. Li, L. Gao, and D. Qian. SoftwareTransactional Memory for GPU Architectures. In Proceedings of AnnualInternational Symposium on Code Generation and Optimization (CGO),pages 1–10. ACM, 2014. → page 133[249] Y. Xu, L. Gao, R. Wang, Z. Luan, W. Wu, and D. Qian. Lock-basedSynchronization for GPU Architectures. In Proceedings of the InternationalConference on Computing Frontiers (CF), pages 205–213. ACM, 2016. →page 45[250] Y. Yao, G. Wang, Z. Ge, T. Mitra, W. Chen, and N. Zhang. EfficientTimestamp-Based Cache Coherence Protocol for Many-Core Architectures.In Proceedings of the 30th International Conference on Supercomputing(ICS), pages 1–13. ACM, 2016. → page 131168[251] K. C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro,16:28, Apr 1996. → page 19[252] L. Yen, J. Bobba, M. R. Marty, K. E. Moore, H. Volos, M. D. Hill, M. M.Swift, and D. A. Wood. LogTM-SE: Decoupling Hardware TransactionalMemory from Caches. In Proceedings of the 13th International Symposiumon High Performance Computer Architecture (HPCA), pages 261–272. IEEE,2007. → pages 13, 14, 49, 133[253] V. Young, A. Jaleel, E. Bolotin, E. Ebrahimi, D. Nellans, andO.Villa. Combin-ing HW/SWMechanisms to Improve NUMA Performance of Multi-GPU Sys-tems. In Proceedings of the 51th International Symposium on Microarchitec-ture (MICRO), pages 339–351. IEEE, 2018.→ pages 4, 12, 74, 75, 76, 95, 103[254] H. Yu, C. Wang, and K.-L. Ma. Massively Parallel Volume RenderingUsing 2–3 Swap Image Compositing. In Proceedings of the conference onSupercomputing (SC), pages 1–11. IEEE, 2008. → page 105[255] X. Yu and S. Devadas. TARDIS: Time Travelling Coherence Algorithmfor Distributed Shared Memory. In Proceedings of the 24th InternationalConference on Parallel Architectures and Compilation Techniques (PACT),pages 227–240. IEEE, 2015. → pages 24, 25, 30, 35, 131, 138[256] X. Yu, M. Vijayaraghavan, and S. Devadas. A proof of correctness forthe tardis cache coherence protocol. CoRR, abs/1505.06459, 2015. URL → page 28[257] X. Yu, H. Liu, E. Zou, and S. Devadas. Tardis 2.0: Optimized TimeTraveling Coherence for Relaxed Consistency Models. In Proceedings ofthe 25th International Conference on Parallel Architecture and CompilationTechniques (PACT), pages 261–274. IEEE, 2016.→ pages 25, 36, 37, 132, 138[258] R. J. Zerr and R. S. Baker. SNAP: SN (discrete ordinates) Application ProxyDescription. Los Alamos National Laboratories, Tech. Rep. LAUR-13-21070,2013. → pages 6, 77, 92[259] F. Zyulkyarov, V. Gajinov, O. S. Unsal, A. Cristal, E. Ayguadé, T. Harris, andM. Valero. Atomic Quake: Using Transactional Memory in an InteractiveMultiplayerGame Server. InProceedings of the 14th Symposium onPrinciplesand Practice of Parallel Programming (PPoPP), pages 25–34. ACM, 2009.→ page 47169


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items