UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Deterministic execution on GPU architectures Jooybar, Mohammad Hadi 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2013_fall_jooybar_mohammad hadi.pdf [ 1.13MB ]
JSON: 24-1.0074006.json
JSON-LD: 24-1.0074006-ld.json
RDF/XML (Pretty): 24-1.0074006-rdf.xml
RDF/JSON: 24-1.0074006-rdf.json
Turtle: 24-1.0074006-turtle.txt
N-Triples: 24-1.0074006-rdf-ntriples.txt
Original Record: 24-1.0074006-source.json
Full Text

Full Text

Deterministic Execution on GPU Architectures by Mohammad Hadi Jooybar B.A.Sc., University of Tehran, 2011 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in The Faculty of Graduate Studies (Electrical and Computer Engineering) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) July 2013 c© Mohammad Hadi Jooybar 2013 Abstract Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one’s ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a de- terministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs. Many hardware and software techniques have been proposed for provid- ing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thou- sands of threads on a GPU is a major challenge. Here we propose a scalable hardware mechanism, GPUDet, to provide determinism in GPU architec- tures. In this thesis we characterize the existing deterministic and non- deterministic aspects of current GPU execution models, and we use these observations to inform GPUDet’s design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide deter- minism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow par- allel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations. Our simulation results indicate that GPUDet incurs only 2× slowdown on average over a baseline nondeterministic architecture, with runtime over- heads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of over- head for deterministic execution on GPUs to provide insights for further optimizations. ii Preface Parts of Chapters 1, 2, 3, 4, 5,6, 7, and 9 have been published [27]. Hadi Jooybar, Wilson W.L. Fung, Mike O’Connor, Joseph Devietti, and Tor M. Aamodt. GPUDet: a Deterministic GPU Architecture. In proceedings of the Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS’13, pages 1-12, March 2013. Dr. Tor Aamodt, Dr. Joseph Devietti, and Mike O’Connor supervised during this project. I created the timing and func- tional model for all parts of this project except Store buffers (Developed by Wilson W.L. Fung). I also analyzed the experimental results. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation: Debugging with a Deterministic GPU . . . . . . 2 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 SIMT Execution Model . . . . . . . . . . . . . . . . . 8 2.1.2 Memory Subsystem . . . . . . . . . . . . . . . . . . . 10 2.2 Sources of Nondeterminism in GPU Architectures . . . . . . 11 2.3 Deterministic Execution . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Forms of Determinism . . . . . . . . . . . . . . . . . . 12 2.3.2 Benefits of Deterministic Execution . . . . . . . . . . 14 3 GPU Deterministic Execution: Background and Challenges . . . . . . . . . . . . . . . . . . . 16 3.1 Background: CoreDet and Calvin . . . . . . . . . . . . . . . 16 3.2 Deterministic GPU Execution Challenges . . . . . . . . . . . 17 iv Table of Contents 4 GPUDet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1 Deterministic Wavefront . . . . . . . . . . . . . . . . . . . . 20 4.2 Quanta Formation . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2.1 Workgroup-Aware Quantum Formation . . . . . . . . 25 4.2.2 Deterministic Workgroup Distribution . . . . . . . . . 26 4.3 Per-Wavefront Store Buffer . . . . . . . . . . . . . . . . . . . 29 4.4 Parallel Commit of Store Buffers . . . . . . . . . . . . . . . . 29 4.4.1 Z-Buffer Unit . . . . . . . . . . . . . . . . . . . . . . 30 4.4.2 Deterministic Parallel Commit using Z-Buffer . . . . 31 4.4.3 Implementation Details of Z-Buffer Unit Architecture 31 4.5 Compute Unit Level Serialization . . . . . . . . . . . . . . . 35 4.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Simulation Configuration . . . . . . . . . . . . . . . . . . . . 39 5.3 Overall Methodology: Finding Performance Bottleneck . . . 40 6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 45 6.1 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . 45 6.2 Impact of GPUDet Optimizations . . . . . . . . . . . . . . . 46 6.2.1 Quantum Formation . . . . . . . . . . . . . . . . . . . 46 6.2.2 Parallel Commit using Z-Buffer Unit . . . . . . . . . 47 6.2.3 Serial Mode Optimization . . . . . . . . . . . . . . . 48 6.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . 49 6.3.1 Quantum Size and Store Buffer Overhead . . . . . . . 49 6.3.2 Bloom Filter . . . . . . . . . . . . . . . . . . . . . . . 51 6.3.3 Z-Buffer Unit . . . . . . . . . . . . . . . . . . . . . . 52 6.3.4 Global Synchronization Overhead . . . . . . . . . . . 52 7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 8.1 Improving Performance of GPUDet . . . . . . . . . . . . . . 56 8.1.1 Improving Performance of Atomic Operations by Tak- ing Advantage of Relaxed Memory Ordering . . . . . 56 8.1.2 Deterministic Local Memory . . . . . . . . . . . . . . 57 8.1.3 Overlapping the Parallel and Commit Mode of Differ- ent Wavefronts . . . . . . . . . . . . . . . . . . . . . . 58 8.2 Exploring Other Types of Determinism . . . . . . . . . . . . 59 v Table of Contents 8.3 Compiler-Run Time Support for determinism . . . . . . . . . 59 9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 vi List of Tables 2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 GPGPU-Sim Configuration . . . . . . . . . . . . . . . . . . . 40 vii List of Figures 1.1 Randomness of output for BFSr by varying the number of edges. Each data point is the average of number of different outputs for 10 different random graphs. The application is launched 500 times for each graph. . . . . . . . . . . . . . . 3 2.1 Baseline GPU Architecture . . . . . . . . . . . . . . . . . . . 7 2.2 Hadling the Control Flow Divergnce using SIMT Stack . . . . 9 3.1 Deterministic Execution in the CPU architecture using Calvin- CoreDet mechanism[26] . . . . . . . . . . . . . . . . . . . . . 19 4.1 GPUDet-base architecture. Vertical lines show the global synchronization before the start of each mode. Wavefront 1 (W1) of compute unit 1 has been skipped in the serial mode since it has not reached an atomic operation in the quantumn. 21 4.2 Breakdown of events that causes a wavefront to end its quan- tum in GPUDet with baseline quantum formation logic. . . 24 4.3 GPUDet quantum formation. (a) GPUDet-base behavior toward quantum termination events. (b) Workgroup-aware quantum formation of GPUDet (GPUDet-WA) allows wave- fronts of one workgroup to continue the execution in parallel mode after all reaching workgroup barrier. . . . . . . . . . . 26 4.4 Workgroup distribution . . . . . . . . . . . . . . . . . . . . . 27 4.5 Intraction of Z-Buffer Unit with Memory Partition . . . . . . 32 4.6 Architecture of Z-Buffer Unit . . . . . . . . . . . . . . . . . . 33 4.7 A Color Update Request . . . . . . . . . . . . . . . . . . . . . 34 4.8 Serial mode in GPUDet. (a) Serializing execution of all atomic operations in GPUDet-base. (b) Overlapping execution time of atomic operations from each compute unit by GPUDet. . . 36 5.1 IPC of different compute units in GPUDet-Base for AES. . . 42 5.2 IPC of different compute units in GPUDet for AES. . . . . . 42 viii List of Figures 5.3 IPC of different compute units in GPUDet-Base for BFS. . . 44 5.4 IPC of different compute units in GPUDet for BFS. . . . . . 44 6.1 Breakdown of execution cycles. Normalized to NONDET ex- ecution time. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.2 Performance impact of Barrier Termination Avoidance (BTA) and Deterministic Workgroup Partitioning (DWP) techniques. Bars Normalized to NONDET execution time. . . . . . . . . 47 6.3 Execution time comparison of committing the store buffer between the Z-Buffer Unit parallel commit and the lock-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.4 Execution time comparison between wavefront level (W-Ser) and compute unit level (CU-Ser) serialization of atomic op- erations (Section 4.5). Normalized to NONDET execution time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.5 Sensitivity of GPUDet to quantum size. Execution time is normalized to NONDET. The unoptimized version of CL is not included in the average. . . . . . . . . . . . . . . . . . . 50 6.6 Execution cycles in parallel mode for various sizes of Bloom filter, normalized to NONDET execution. The AVG bar shows the average among all our benchmarks. . . . . . . . . . 51 6.7 Execution time for various number of cycles, dedicated for each global synchronization, normalized to NONDET execu- tion. The bar shows the average among all our benchmarks. . 53 ix Glossary BF Bloom Filter BFSr Buggy version of BFS contains race condition BTA Barrier Termination Avoidance CPU Central Processing Unit CU Compute Unit CU-Ser Compute Unit Serialization DWP Deterministic Workgroup Partitioning FIFO First In First Out GPU Graphics Processing Unit GPUDet-base basic GPUDet without optimization techniques IPC Instruction Per Cycle NONDET non-deterministic baseline GPU architecture PC Program Counter RPC Reconvergence Program Counter SIMD Single Instruction Multiple Data SIMT Single Instruction Multiple Threads SM Streaming Multi-processor W-Ser Wavefront Serialization WA Workgroup Aware WG Workgroup x Glossary WGS Subset of wavefront lanes in hardware with size of workgroup WL Write Log Z-Buffer A buffer in global memory keeps the depth of foremost fragment for each pixel Z-Buffer Unit The unit controlling the Z-Buffer operations Z-Cache A Specialized cache used in Z-Buffer unit xi Acknowledgements I would like to thanks my supervisor, Professor Tor Aamodt, for all the help he has given me over the years. Without him, this work would not have been possible. I would also like to thank the other members of the computer architecture group for their supportive ideas. I would like to specially thank Wilson W.L. Fung for his advice as a senior grad student involving my project. Finally, I’d like to thank my family for the innumerable things they have helped me out with throughout my life. xii Chapter 1 Introduction Massively parallel processors are designed to run thousands of threads. The energy efficiency of these processors has driven their widespread popular- ity for many inherently-parallel applications, such as computer graphics. We anticipate that increasing demands for energy efficiency will motivate a broader range of applications with more complex data structures and inter- task communication to use these processors. Unfortunately, developing parallel applications that can utilize such ar- chitectures is hampered by the challenge of coordinating parallel tasks, es- pecially in ensuring that a parallel program behaves correctly for every pos- sible combination of interactions among these tasks. With the thousands of threads required to utilize current massively parallel architectures like GPUs, the number of interactions to consider is substantial. Nondeter- minism makes this challenge even harder, because it thwarts attempts to reproduce anomalous behavior or to provide guarantees about future execu- tions. Providing deterministic execution for GPU architectures is a crucial aid for simplifying parallel programming at scale. Determinism simplifies the debugging process by making program behavior reproducible for a given input. Determinism is also useful once code is deployed, to reproduce bugs that manifest in production settings. Finally, determinism amplifies the power of testing, by providing a guarantee that once a particular input has passed its tests, those tests will continue to pass in the future. Current nondeterministic parallel architectures provide no such guarantee, which allows testing to provide only probabilistic guarantees. The key challenge of determinism is providing these properties with ac- ceptable performance overhead. To this end, different levels of determinism have been proposed in the literature [40]. We are proponents of strong de- terminism, which provides determinism even in the presence of data races. Some deterministic schemes require data-race freedom [40], offering deter- minism in a more limited context but with lower performance overheads. In contrast, a strongly deterministic system makes no assumptions about the program it is running, which is well-suited to dealing with the buggy 1 1.1. Motivation: Debugging with a Deterministic GPU programs that need determinism the most. The programmability of GPUs can benefit greatly from strong determinism. 1.1 Motivation: Debugging with a Deterministic GPU Many existing GPU workloads are embarrassingly parallel applications free of inherent data-races. It may be tempting to dismiss the benefit of deter- ministic execution on GPUs for these applications. However, these applica- tions usually feature complex data tiling to improve data locality. Subtle mistakes during development can result in nondeterministic bugs. These bugs are hard to detect since they may appear occasionally and alter their behavior. A deterministic GPU allows programmers to analyze such bugs more effectively consistent execution path and output. Indeed, race conditions have appeared in GPGPU algorithms from the published literature. To illustrate how a deterministic GPU can simplify de- bugging, we use a buggy version of breadth-first-search (BFS) graph traver- sal distributed by Harish and Narayanan [24] (BFSr) . Below is a simplified version of the kernel function in BFSr, which contains a race condition. 0: __global__ void BFS_step_kernel(...) { 1: if( active[tid] ) { 2: active[tid] = false; 3: visited[tid] = true; 4: foreach (int id = neighbour_nodes) { 5: if( visited[id] == false ) { 6: cost[id] = cost[tid] + 1; 7: active[id] = true; 8: *over=true; 9: } } } } Presumably, this kernel is designed to traverse one level of the breadth- first-search nodes during each kernel launch. Each thread (tid = thread ID) is assigned a node in the graph. In each kernel launch (at each step), each thread updates the cost of its neighbouring nodes (cost[id]) that have not been visited, and sets their active flag (active[id]), which indicates these nodes should update their own neighbours during the next kernel launch. A race condition can occur since a thread can be reading its active[tid] and cost[tid] (at line 1 and 6) while its neighbour is updating these locations (at line 6 and 7). This causes cost[id] to be updated nondeterministically, which can lead to incorrect output. 2 1.1. Motivation: Debugging with a Deterministic GPU 150 200 250 300 350 400 450 500 #  o f D if fe re n t R e su lt s in  5 0 0  R u n s 0 50 100 24000 26000 28000 30000 32000 34000 36000 38000 40000 42000 #  o f D if fe re n t R e su lt s in  5 0 0  R u n s # of Edges Figure 1.1: Randomness of output for BFSr by varying the number of edges. Each data point is the average of number of different outputs for 10 different random graphs. The application is launched 500 times for each graph. To demonstrate the benefits of a deterministic GPU, first we measure the randomness of the final results of this simple program in presence of designated race condition. By increasing the complexity of the input graph the number of different outputs exponentially increases. Figure 5.3 shows the number of different outputs in 500 runs with a single input for various number of the edges. The experiment runs on a graph with 65k nodes. By increasing the application complexity, even the incorrect outputs are rarely the same. Different incorrect answers stem from different execution paths, make debugging extremely challenging or even impossible. Program- mers have to consider all different execution paths to ensure the program behaves correctly for any possible scheduling. Also, lack of reproducibility prevents the programmer from tracing the bugs. With a deterministic GPU, the programmer can observe the same incor- rect execution across different runs. This allows one to set a watch point in a GPU debugger at the location in cost[] that has an incorrect output 3 1.2. Contributions value. The programmer can then pause the execution at every modifica- tion to this location and identify the threads responsible for the incorrect output value. Notice that this approach will not work with a nondetermin- istic GPU. The programmer cannot be sure that the race condition would occur at the same memory location in the second run via the debugger. This difficulty with nondeterministic parallel architecture has motivated the supercomputing community to create extensive logging tools that capture activities across the entire system, and analyze the (enormous) logs for er- rors [4]. We argue that providing a deterministic GPU to programmers is a more effective solution. 1.2 Contributions This document introduces GPUDet, which is the first deterministic mas- sively parallel architecture. GPUDet provides strong determinism for current- generation GPU architectures. GPUDet leverages the approach suggested by Bergan et al. [7] for multi-core CPUs. In this approach, threads are executed in an isolated memory space, communicating with other threads only at deterministic, fixed intervals. Thread isolation is realized by ap- pending all stores to a private buffer instead of directly updating global memory. Thread updates are made globally visible via a commit process which leverages existing GPU Z-Buffer hardware to execute in a determin- istic but highly parallel fashion. Read-modify-write operations that need to be made globally visible have their execution deferred until this communi- cation phase. GPUDet executes these operations in parallel and maintains a deterministic order among them by leveraging the ordering property in existing GPU memory systems. The key contributions of this thesis are: 1. We propose GPUDet, the first hardware model for a fully deterministic GPU architecture. 2. We characterize the inherently deterministic and nondeterministic as- pects of the GPU execution model and propose optimizations to GPUDet that leverage these inherent deterministic properties. 3. Our design exploits existing GPU hardware for Z-buffering to acceler- ate deterministic parallel committing of store buffers. 4. We introduce a workgroup-based quantum formation algorithm to en- able larger quanta, which better amortizes the overheads of quantum 4 1.3. Thesis Organization formation. 5. We exploit the point-to-point ordering in GPU memory subsystem to regain parallelism within each compute unit for read-modify-write operations that are executed in serial mode. 1.3 Thesis Organization The rest of the document is organized is follows. Chapter 2 summarizes our baseline GPU architecture, discusses sources of nondeterminism in GPUs, briefly introduces the different types of determinism and discusses the bene- fits of determinism. Chapter 3 summarizes existing techniques for providing deterministic execution and highlights the challenges in applying them on GPUs. Chapter 4 describes GPUDet and our techniques to improve perfor- mance without sacrificing determinism. Chapter 5 shows our configuration parameters and describes the benchmarks. Chapter 6 presents a compre- hensive evaluation of GPUDet. Chapter 7 discusses the related work. We discuss future work in Chapter 8. Chapter 9 concludes the thesis. 5 Chapter 2 Background The focus of this thesis is deterministic execution on GPU architectures. To achieve full understanding of this work the reader should be familiar with basic concepts of deterministic execution and GPU architecture. Our personal experience shows people that are expert in one of these two subjects usually have limited knowledge about the other one. This section provides an overview of fundamental concepts of deterministic execution information about the general GPU architecture. Also, we categorize the sources of non-determinism in current GPU architectures. 2.1 GPU Architecture This section summarizes the details of our baseline GPU architecture rel- evant to deterministic execution on GPUs. Figure 2.1 shows an overview of our baseline GPU architecture modeled after the NVIDIA Fermi GPU architecture [37]. In this document, we use OpenCL [28] terminology. How- ever, to make the document easier to read we use thread instead of work item. Table 2.1 shows the terminology of this document. A GPU program (CUDA [38] or OpenCL [28]) starts running on the CPU and launches parallel compute kernels onto the GPU for processing. Each kernel launch consists of a hierarchy of threads organized in equal- sized workgroups. All threads in a kernel launch execute the same kernel function, while each thread may follow a unique execution path (with some performance penalty). Threads within a workgroup can synchronize quickly via workgroup barriers and communicate via an on-chip scratch-pad mem- ory called the local memory. Communication among threads in different workgroups occurs through global memory located in off-chip DRAM. Each thread has exclusive access to its own private memory for non-shared data. Our GPU microarchitecture consists of a set of compute units that access multiple partitions (i.e., channels) of off-chip DRAM via an interconnection network. Each compute unit can sustain up to 1536 threads in its register file. It has a scratch-pad memory that is partitioned among the different workgroups that run concurrently on the unit. The large register file allows 6 2.1. GPU Architecture Interconnect Workgroup Distributor C P U Grid x = input[threadID] y= func(x); output[threadID] = y Kernel Func. Kernel Launch Workgroup R e g is te r F ile L o c a l M e m o ry L1 cache Workgroup Workgroup Workgroup... Compute Unit Workgroup R e g is te r F ile L o c a l M e m o ry L1 cache Workgroup Workgroup Workgroup... Compute Unit L2 cache DRAM Channel Atomic Unit Z-Buffer Unit Memory Partition L2 cache DRAM Channel Atomic Unit Z-Buffer Unit Memory Partition L2 cache DRAM Channel Atomic Unit Z-Buffer Unit Memory Partition ... ... ... ... ... Workgroup Figure 2.1: Baseline GPU Architecture 7 2.1. GPU Architecture Table 2.1: Terminology CUDA Terminology OpenCL Terminology Our terminology Thread Work item Thread Thread Block Work group Work group Grid NDRange Grid Streaming Multi-processor (SM) Compute Unit Compute Unit Shared Memory Local Memory Local Memory Global Memory Global Memory Global Memory Local Memory Private Memory Private Memory Giga-Thread scheduler Workgroup distributor a compute unit to freely context switch between different threads to toler- ate long-latency accesses to global memory. The workgroup distributor on the GPU ensures that a compute unit has sufficient hardware resources to sustain all threads in a workgroup before dispatching the entire workgroup onto the unit. A GPU program may launch a kernel with more threads than the GPU hardware may execute concurrently. In this case, the GPU hard- ware starts executing as many workgroups as on-chip hardware resources permit, and dispatches the remaining ones as resources are released by the workgroups that have finished execution. 2.1.1 SIMT Execution Model GPU architectures use Single-Instruction, Multiple-Data (SIMD) hardware for enhancing computational efficiency. Rather than exposing the SIMD hardware directly to the programmer, GPUs employ a Single-Instruction, Multiple-Threads (SIMT) execution model [33]. It groups scalar threads specified by the programmer into SIMD execution groups called wavefronts [2] (warps [38] in NVIDIA nomenclature). Threads in a wavefront execute in lockstep on the SIMD hardware. Since each thread can follow a unique exe- cution path, a wavefront may diverge after a branch. The GPU hardware au- tomatically serializes the execution of subsets of threads that have diverged to different control flow paths. Each wavefront has a SIMT stack [16, 23] maintaining the different control flow paths that still need to be executed. This allows for nested control flow divergence. Figure 2.2 demonstrates how GPU handles the control flow divergence using a stack-based mechanism. Each entry on the stack denotes a different path. It denotes the subset of threads active on the path with an active mask and contains a reconvergence PC marking the point in the program 8 2.1. GPU Architecture if (…) { statements } else { statements } statements A B C D code A 11111111A - D C B 11111111D - 11111111D - 11001100B D B 00110011C D 11111111D - 00110011C D time PUSH POP POP PC RPC Active Mask Figure 2.2: Hadling the Control Flow Divergnce using SIMT Stack 9 2.1. GPU Architecture when these threads will reconverge with another diverged execution path. Each wavefront starts with a top-level entry with a full active mask. After a wavefront has executed a divergent branch, it pushes two entries onto the SIMT stack, one for the if and else targets of the branch. It then executes one path, say, the if branch, until it reaches the reconvergence PC. It then advances to the else target by popping the stack. The wavefront pops the stack again when it reaches the reconvergence PC once again, revealing the top-level entry from before the branch. After that, it resumes execution at the reconvergence PC with threads from both diverged paths active. In this document, we use the immediate post-dominator of each divergence branch as its reconvergence PC [23]. 2.1.2 Memory Subsystem Individual scalar accesses to global memory from threads in a wavefront are coalesced into wider accesses to 32, 64, or 128-byte chunks. The L1 data cache services one coalesced access per cycle. It caches data from both global memory and private memory, but with different policies. It acts as a writeback cache for accesses to private memory. Writes to global memory evict any line that hits in the L1 cache to make the memory updates visible to all compute units. However, the GPU hardware does not maintain coherence among the L1 data caches in different compute units. Each unit is responsible for purging stale data from its L1 cache. Accesses that miss at the L1 cache are sent to the corresponding mem- ory partition that contains the requested data. Each memory partition is responsible for its portion of the memory space. It contains a slice of the shared L2 cache. Accesses that miss at the L2 cache slices are in turn ser- viced by the off-chip DRAM controlled by the partition. To support a rich set of synchronization primitives, the GPU program- ming model provides atomic operations, which are read-modify-write op- erations that update a single 32/64-bit word in the global memory. The wavefront sends each operation to its corresponding memory partition as with global memory accesses. Atomic operations are executed at the mem- ory partitions with a set of specialized atomic operation units that operate directly on the L2 cache slices. 10 2.2. Sources of Nondeterminism in GPU Architectures 2.2 Sources of Nondeterminism in GPU Architectures The standalone-accelerator nature of current GPU architectures helps iso- late them from some sources of nondeterminism, like interrupt handlers and context switches, present in more general-purpose architectures. However, these sources will likely be increasingly relevant to future GPU architectures as GPUs become more general-purpose. To illustrate that nondetermin- ism exists on GPUs, we have developed GPU-Racey, a CUDA determinis- tic stress test, based on CPU deterministic stress test Racey [25]. Racey computes a signature that is extremely sensitive to the relative ordering of memory accesses among threads. Across multiple runs, it should generate different signatures on a nondeterministic system, and the same signature on a deterministic system. Running GPU-Racey on real GPU hardware (a Quadro FX 5800 and a Tesla C2050) with two or more concurrent wavefronts (warps) produces nondeterministic outputs on different runs. While the ex- act sources of nondeterminism in commercial GPUs are undocumented, we have postulated several potential sources. First, each GPU usually consists of multiple clock domains [39], with each domain running at its optimal frequency. The synchronizer circuits interfacing between these domains can introduce nondeterministic delays to cross-domain messages due to phase drifting among the different clocks [43]. This source of nondeterminism may be exacerbated in the future as more ag- gressive power management features such as dynamic voltage and frequency scaling (DVFS) are introduced to GPU architectures. Second, the access timing to off-chip memory on a GPU depends on the physical location of the data. Accesses to different memory partitions have an observable delay difference [51]. With the GPU shared by many different processes in the system, it is improbable for the GPU driver to starting in the same memory partition for every run of an application. The variance in DRAM cells may also encourages more adaptive refreshing techniques that change the refreshing interval according to the dynamic status of different cells [34]. This introduces nondeterministic delays for DRAM accesses. Third, arbitration/scheduling units with uninitialized states can intro- duce nondeterminism by ordering thread execution or memory requests in an application differently between different runs. This includes the hard- ware wavefront schedulers in each compute unit, the workgroup distributor and the arbiters in the interconnection network. Although these units are reset to a known initial state at power up, the operating system is unlikely 11 2.3. Deterministic Execution to reset them between kernel launches. This makes the states of these units dependent on the previously executed workload, which is usually not pre- dictable. Finally, as circuit process technology scales, transient faults in memory cells have become increasingly common. Transient failure in either on-chip or off-chip memory can trigger recover routines randomly [37], thus introducing nondeterministic latencies to memory accesses. 2.3 Deterministic Execution 2.3.1 Forms of Determinism We have observed that the determinism studied in prior works can be cate- gorized into different forms of determinism with varying overheads. A goal of this research is to determine the best form of determinism to enable the energy saving benefits of accelerators on a wider class of applications. Some previously studied forms of determinism and their tradeoffs are as follows: • Timing Determinism Timing determinism [43] gives an identical sequence of hardware states but requires extensive use of resettable flip-flops, conservative timing of DRAM refresh, and special clock domain crossing hardware in addition to restrictions at the system software level. Such costs are undesirable but may be beneficial for silicon debug. • Architecture determinism Architecture determinism produces the same result on a given hard- ware platform regardless of the existence of unsynchronized accesses to shared variables. This form of determinism is helpful when debug- ging a program that includes race conditions and has been extensively explored in recent work in the architecture community [7, 17, 18, 26]. • Synchronization determinism Synchronization determinism ensures that lock acquisition order is re- peatable [40]. In some programs different (correct) results can be produced depending upon the order locks are acquired by compet- ing threads. This non-determinism impedes software developers who must tune heuristics for complex optimization problems (e.g., CAD). Supporting this type of determinism is relatively inexpensive but does 12 2.3. Deterministic Execution not provide the benefit of deterministic execution for programs with unsynchronized accesses to shared data. • Serial determinism Serial determinism starts from a serial specification of an algorithm and produces a parallel execution with the same results [44]. A limi- tation of serial determinism is that it totally orders all operations in a program, even when more flexibility may be allowable or desirable in a particular application (e.g. graph traversal). However, an ap- pealing aspect of serial determinism is that it yields identical results independent of the number of active cores. • Algorithmic determinism Finally, and orthogonal to the other forms of determinism, is the no- tion of algorithmic determinism. Algorithmic determinism allows the order of synchronization to be relaxed in the presence of commuta- tive and associative operations while still preserving a semantically deterministic outcome. This corresponds to the notion of external determinism described by Emrath and Padua [20]. The additional scheduling flexibility admitted by algorithmic determinism can enable higher performance, but requires additional programmer effort to ver- ify commutativity and associativity. Algorithmic determinism can be incorporated into any of the other notions of determinism, though this has currently been done only for serial determinism [21]. A few interesting trends are apparent from this determinism taxonomy. First, each of the first four forms of determinism applies to a larger class of programs than the subsequent form. For example, timing determinism is the strongest notion of determinism because it removes not just the non- determinism in synchronization and unsynchronized accesses but also the non-determinism in the latencies of various operations. Similarly, serial determinism is the weakest notion of determinism because it exists only for single-threaded programs, which are a subset of well-synchronized parallel programs and of arbitrary parallel programs. Second, there is empirical evidence from existing systems that stronger notions of determinism are more expensive in terms of implementation com- plexity and runtime overhead. This makes intuitive sense: ensuring that unsynchronized memory accesses execute deterministically requires inter- posing on every potentially unsynchronized memory access, which are far more frequent than explicit synchronization operations. 13 2.3. Deterministic Execution 2.3.2 Benefits of Deterministic Execution This section discusses number of benefits provided by a deterministic system. These benefits apply to all kind of systems that ensure strong determinism as well as GPUDet. Repeatability Using a deterministic system, users can expect a repeatability guarantee from software. That means the execution of a program with single inputs will always results in the same outputs. As an example, FPGA users always expect the CAD tool to generate the same hardware for their design in differ- ent compilations, so they can reliably test, verify and tune the performance of their designs. Debugging Application developers are highly dependent on the deterministic execution during the debugging of sequential applications. Based on the determinism guarantee, programmers are able to iteratively run the program and ob- tain information about particular bug and trace the bug using a systematic methodology. Lack of the repeatability guaranteed in a non-deterministic parallel system, causes a bug able to appear in different places of the pro- gram in different executions. A deterministic system can precisely reproduce all non-concurrency bugs, deadlocks, atomicity violations, order violation and shared memory races in a buggy program. These kind of bugs form a large fraction of concurrency bugs found in real parallel applications [36] Additionally, a deterministic system like GPUDet can be combined with a dynamic race detector to help identify races that are caused by incorrect usage of synchronization primitives. The combination of GPUDet and dy- namic race detectors ensures the programmer to detect the first race occurs on a given inputs, because the program is running deterministically up until the race point. In this case when a race condition occurs, a programmer can systematically identify and eliminate it using the race detector and will subsequently be able to reproduce all remaining bugs. Testing Some algorithms may have several correct solutions. For example for a given tree, it is possible to have several spanning tree with the minimum weight. 14 2.3. Deterministic Execution Even the correct parallel implementation of these algorithms may result in different answers. Although all of these answers are correct, testing and verifying of these applications are difficult due toF non-determinism. By providing deterministic results, a deterministic system like GPUDet relives the process of testing. In this way, GPUDet can make a parallel GPU application more like sequential application when it comes to maintaining current testing infrastructures. 15 Chapter 3 GPU Deterministic Execution: Background and Challenges Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors [5, 7, 8, 17, 18, 26, 35, 40]. However, these techniques are designed for systems supporting tens, not thousands, of threads. Issues of serialization, global synchronization and per-thread fixed costs have much greater impact at GPU-scale than at CPU-scale. In this section, we first summarize the deterministic multiprocessor schemes CoreDet and Calvin [7, 26], which serve as a basis for GPUDet. We then dis- cuss the major challenges these prior approaches face in scaling up to work on GPU architectures. We have chosen to start with Calvin and CoreDet over more recent proposals because their simplicity – particularly Calvin’s in-order pipeline design that avoids hardware speculation – is a good match for current GPU architectures. Furthermore, subsequent improvements in deterministic execution hardware provide lighter-weight synchronization op- erations [18], but do not address the scaling challenges inherent in making GPU architectures deterministic. Incorporating these subsequent improve- ments with GPUDet would improve performance for the applications that have frequent communications among threads. 3.1 Background: CoreDet and Calvin CoreDet [7] is a compiler and accompanying runtime system that provides strong determinism for multithreaded programs. A multithreaded program compiled with CoreDet always produces the same result, even if the program contains data races. Calvin [26] is a processor design that provides deter- ministic execution, using an algorithm very similar to CoreDet’s but incor- porating the specifics of multi-core hardware. We provide a Calvin-centric 16 3.2. Deterministic GPU Execution Challenges overview as its hardware implementation is most similar to the GPUDet approach. The basic mechanism both schemes use to enforce determinism is de- scribed in Figure 3.1. The execution of the program is divided into quanta, deterministically-sized sequences of instructions, e.g. every 1000th instruc- tion executed by a wavefront marks the start of a new quantum. Each quantum is in turn composed of three phases. Theses phases are separated by global synchronizations. The first phase is parallel mode, wherein each processor executes in isolation. CoreDet provides isolation by appending all the stores on per-thread store buffers, so the store operations will not be visible for other threads. To prevent loading stale data CoreDet first checks whether the data is present in the store buffer before sending the request to common cache hierarchy. Calvin’s special coherence protocol ensures that a processor sees only its own writes but not those of remote processors. In ef- fect, each processor executes a single-threaded program which is inherently deterministic. At the end of parallel mode, all processors enter a global barrier before transitioning into commit mode. In commit mode the writes from each processor are made globally visible, using a deterministic parallel commit algorithm that maps well to GPU hardware (Section 4.4.1). Another global barrier separates commit mode from serial mode, during which atomic operations (which cannot execute correctly under the relaxed-consistency coherence protocol of parallel mode) execute in a deterministic, serial order. A final global barrier ends the current quantum and allows a new quantum’s parallel mode to begin. 3.2 Deterministic GPU Execution Challenges The first challenge for deterministic execution on the GPU is the lack of private caches and cache coherence. A GPU’s multiplexing of hundreds of threads onto a single first-level cache means that Calvin’s mechanism of using private caches to provide isolation between threads is not readily employable. Physically or logically partitioning each cache for use amongst hundreds of threads would dramatically reduce the cache’s effectiveness. Even if per-thread private caches were available, the lack of cache coherence in a GPU’s per-core caches rules out Calvin’s modified coherence protocol as a way of providing low-overhead thread isolation. Implementing isolation in software (as in CoreDet) allows physical resources to be shared but has high runtime overhead. Another major concern in building deterministic GPUs is dealing with 17 3.2. Deterministic GPU Execution Challenges very large numbers of threads. This in turn leads to a number of related problems. Large numbers of threads make the global barriers inherent in the CoreDet/Calvin deterministic execution algorithms much more expensive. Relatedly, atomic operations require serialization in the Calvin and CoreDet models, so their presence in a GPU kernel can quickly erode performance. By serializing thousands of threads, an atomic operation has effectively many orders of magnitude higher cost when run deterministically than when run nondeterministically. The GPU hardware features various hardware mechanisms to efficiently manage thousands of threads. These mechanisms need to be extended ac- cordingly to support deterministic execution. Specifically, with the SIMT execution model, arbitrarily pausing a scalar thread that has exhausted its instruction quota while permitting the others to proceed causes the wave- front to diverge. Handling this divergence requires substantial modification to the SIMT hardware and the extra determinism-induced divergence lowers the SIMD hardware utilization. Finally, GPU kernels have different program properties than multithreaded CPU programs. Executing larger quanta is a natural solution to amortize expensive global barriers and atomic operations, and works well in the CPU space where threads are long-lived. Unfortunately, GPU kernels often con- tain a large number of short-running threads, making global barriers both expensive and frequent. GPU threads tends to synchronize frequently within a workgroup to communicate via the on-chip scratch-pad memory. This form of hierarchical, localized communication fits poorly with the infrequent global communication model in the CoreDet/Calvin determinis- tic execution algorithms. GPU threads also typically exhibit less locality than CPU threads, particularly in terms of memory accesses: frequently- used values are instead cached in a GPU’s large register files or scratch-pad memory. This reduced locality makes Calvin’s cache-based isolation mech- anism a poor fit for GPU kernels. 18 3.2. Deterministic GPU Execution Challenges Quantum n-1 Quantum n Processor 0 Cache Hierarchy Processor 1 Barrier Barrier Barrier Store Buffer Store Buffer Quantum n+1 LOADS STORES LOADS STORES Commit Commit Atomic Ops Atomic Ops P a ra ll e l M o d e C o m m it M o d e S e ri a l M o d e Figure 3.1: Deterministic Execution in the CPU architecture using Calvin- CoreDet mechanism[26] 19 Chapter 4 GPUDet In this section, we present GPUDet, the first hardware proposal that pro- vides strong determinism for deterministic massively parallel architecture with thousands of concurrent threads. We design GPUDet starting with a naive adaptation of the deterministic multiprocessor schemes from Core- Det/Calvin onto the GPU. This scheme divides GPU kernel execution into quanta. Each quantum has three phases: parallel mode, commit mode and serial mode as illustrated in Figure 4.1. GPUDet features optimizations for each mode, leveraging the inherent determinism in the GPU architecture as well as common-case GPU program behavior to recover more parallelism while reducing complexity required to support deterministic execution. For parallel mode, GPUDet exploits the inherent determinism of cur- rent SIMT architectures to provide deterministic interaction among threads within a wavefront at no cost (Section 4.1). GPUDet also introduces a novel wavefront-aware quantum formation algorithm to eliminate unneces- sary global synchronizations and replace them with local, deterministic syn- chronizations (Section 4.2). The store buffers that provide isolated execution among deterministic wavefronts are described in Section 4.3. For commit mode, GPUDet extends the Z-Buffer hardware designed for graphics render- ing to accelerate its deterministic parallel commit algorithm (Section 4.4). For serial mode, it exploits the point-to-point ordering in the GPU’s mem- ory subsystem to reduce the amount of serialization required in this mode (Section 4.5). Section 4.6 discusses limitations in GPUDet’s current design. 4.1 Deterministic Wavefront The deterministic execution algorithm used in CoreDet and Calvin imposes determinism by pausing the execution of threads at deterministic intervals for communication [7, 26]. A naive GPU adaptation of this scheme would involve executing each scalar thread in isolation during the parallel mode and pausing the thread when it has exhausted its instruction count quota for the quantum. As mentioned in Section 3.2, this behavior interacts poorly with the implicit synchronizations of threads within a wavefront imposed by 20 4.1. Deterministic Wavefront Parallel Mode Commit Mode Serial Mode W0 W1 W2 W0 W1 W2 time Quantumn Quantumn+1 ... A A A A A Atomic Operations Quantum Boundary Deterministic Commit Wavefront Serialization A C o m p u te  U n it 0 C o m p u te  U n it 1 Figure 4.1: GPUDet-base architecture. Vertical lines show the global synchro- nization before the start of each mode. Wavefront 1 (W1) of compute unit 1 has been skipped in the serial mode since it has not reached an atomic operation in the quantumn. the SIMT execution model. Algorithm 1 shows the execution of a wavefront with 4 threads (T0 to T3) through a branch hammock. The wavefront diverges at line B. In response, the SIMT stack deactivates T2 and T3 to execute line C with only T0 and T1 active. After executing line C, T0 and T1 have exhausted their instruction count quota for the quantum, and enter the global barrier for commit mode. A deadlock has occurred, because this global barrier is also waiting for T2 and T3, which are in turn paused by the SIMT stack to reconverge with T0 and T1 for full SIMD execution at line E. The SIMT stack can be modified to resolve this deadlock (e.g. by allowing the stack to reorder the entries to resume other execution paths when the top entry path has exhausted its instruction count), but doing so introduces extra complexity and lowers the SIMD hardware utilization. GPUDet eliminates these complexities and deadlock concerns altogether by exploiting the fact that the execution of the entire wavefront is inher- ently deterministic. Deterministic execution within a wavefront also elimi- nates the need to have thread-level isolation and allows threads in the same 21 4.1. Deterministic Wavefront Algorithm 1 Thread-level isolation introduces divergence in SIMT archi- tecture. In this example, each thread may execute up to 3 operations per quantum. The number of operations executed by each thread is shown be- side each line of the code. T0 T1 T2 T3 A: v = 1; \\ 1 1 1 1 B: if( threadIdx.x < 2 ){ \\ 2 2 2 2 C: v = input[threadIdx.x]; \\ 3 3 2 2 D: } \\ E: output[threadIdx.x] += v; \\ 4 4 3 3 wavefront to share a common store buffer. The inherent determinism of wavefront execution arises from two properties: 1) Pausing the execution of a wavefront causes each thread to execute a deterministic (but not necessar- ily equal) number of instructions, and 2) existing GPU architecture already handles data-races between threads within a wavefront. As described in Section 2.1.1, the control flow of each wavefront, and the activity of its threads at every dynamic instruction, are controlled by its SIMT stack. The SIMT stack of every wavefront has deterministic initial state, since all wavefronts execute at the start of the kernel program. As the wavefront executes in parallel mode, its SIMT stack is updated to handle any branch divergence that occurs. This update is deterministic, because every thread in the wavefront is executing with input data from a determin- istic global memory state produced by the previous quantum round. Since the SIMT stack has a deterministic initial state and it is updated determin- istically, we can infer that the SIMT stack always maintains a deterministic state. The deterministic SIMT stack provides deterministic control flow for the wavefront. More importantly, it ensures that the activity of each thread in the wavefront is deterministic for every dynamic instruction. This means that a wavefront can pause after any dynamic instruction to end its parallel mode for communication. Each thread in the wavefront may have executed a different amount of work in the parallel mode due to divergence, but the amount executed by each thread is always deterministic. Data-races can occur between threads within a wavefront, but these data-races are reproduced deterministically on current SIMT hardware [29], a property we exploit in GPUDet. Data-races that occur between threads executing different dynamic instructions are always ordered deterministi- cally, because instructions from the same wavefront are executed in-order 22 4.2. Quanta Formation and the control flow of each wavefront is deterministic as per the above discussion. Data-races that occur between threads at the same dynamic memory instruction are ordered by the coalescing unit, which combines the scalar memory accesses from threads within a wavefront into accesses to wider memory chunks. When multiple threads in a wavefront write to the same location the coalescing unit generates only one memory access for the location and chooses one of the threads’ store values. To ensure determin- ism, GPUDet relies upon the fact that the coalescing unit selects the same thread’s store value for a given access pattern. A NVIDIA patent [30] de- scribes a way to deterministically handle the write collision based on thread IDs (e.g. sending data of the thread with largest thread ID). Samuli and Tero [29] have exploited this deterministic behavior to optimize their soft- ware rasterization engine for the Fermi GPU architecture [37]. We have verified this observation on current GPU hardware as well as our simulation infrastructure with the GPU-Racey stress test described in Section 2.2. In conclusion, the deterministic interaction of threads within a wavefront enables GPUDet to: 1. Use a per-wavefront store buffer (described in Section 4.3) for execu- tion isolation. 2. Pause the execution of a whole wavefront and ascertain that each thread stops at a deterministic point. 3. Count the number of executed instructions for each wavefront instead of individual threads. 4.2 Quanta Formation At certain points of execution GPUDet pauses thread execution to commit store buffers. In order to have deterministic results, these points should be selected deterministically. In other words, execution of threads should be paused at the exact same instructions in different runs of a program. These termination points are called quantum boundaries in this document. This section describes the quantum formation algorithm and how GPUDet determines quanta boundaries for each wavefront. GPUDet divides the program execution into quanta similar to previous work [7, 26, 40]. Each wavefront is allowed to execute a deterministic num- ber of instructions during the parallel mode in each quantum. To avoid deadlocks and to handle atomic operations, a wavefront may end its parallel 23 4.2. Quanta Formation 0% 20% 40% 60% 80% 100% A E S B F S r B F S f  C F D  C P  H O T S P  L IB  L P S  S R A D H T A T M C Lo p t % o f Te rm in a ti o n  R e a so n s Atomic Operations Instruction Count Execution Complete Workgroup Barriers Figure 4.2: Breakdown of events that causes a wavefront to end its quantum in GPUDet with baseline quantum formation logic. mode before reaching its instruction limit. The following are the events that can cause a wavefront to end its parallel mode: Instruction Count. A wavefront ends its parallel mode once it finishes executing a fixed number of instructions. This number is the quantum size. Atomic Operations. Atomic operations are handled in GPUDet like they are in CoreDet. A wavefront ends its parallel mode whenever it reaches an atomic operation to execute the operation in serial mode. Memory Fences. GPUs provide memory fence instructions for pro- grammers to impose ordering on memory operations. Similar to Core- Det/Calvin, a fence instruction causes a wavefront in GPUDet to end its parallel mode to commit its store buffer in commit mode. Workgroup Barriers. The GPU architecture provides workgroup-level synchronization operations called workgroup barriers. As discussed in Sec- tion 2, threads inside a workgroup cannot exit a workgroup barrier before all other threads in the same workgroup have reached the barrier. Some of these other wavefronts may have exhausted their instruction count limit before reaching the workgroup barrier and are waiting at the global bar- rier for transition into commit mode. To prevent a deadlock in this case, any wavefronts at the workgroup barrier should end their parallel mode to unblock the other wavefronts. Execution Complete. When a wavefront finishes kernel execution before exhausting its instruction count limit, it ends its parallel mode. Figure 4.2 shows the breakdown of these events for our GPU applications 24 4.2. Quanta Formation (more details in Section 6) running on GPUDet with this baseline quantum formation logic. In this figure, GPUDet is configured to have quantum size = 200 instructions. In ATM, CL and HT, wavefronts usually end their quanta by reaching an atomic operation. In AES, HOTSP, LPS and SRAD, most wavefronts end their quanta at workgroup barriers. In BFSr, BFSf and CFD, wavefronts end more then 50% of their quanta at the end of a kernel program, illustrating the short-running thread challenge discussed in Section 3.2. Collectively, these events constrain the number of instructions a wavefront may execute in each quantum, and thus limit the effectiveness of increasing quantum size to amortize the synchronization overhead at each quantum. With the above observations, we have designed workgroup-aware logic to determine the end of parallel mode of each quantum. In Section 4.2.1 and Section 4.2.2, we describe this workgroup-aware quantum formation logic with two optimizations. One optimization allows a wavefront to execute beyond a workgroup barrier without ending its parallel mode; the other permits the GPU to issue a new workgroup within a quantum. 4.2.1 Workgroup-Aware Quantum Formation Our workgroup-aware quantum formation logic (GPUDet-WA) extends the parallel mode illustrated in Figure 4.1 with an intermediate wait-for-workgroup mode. A wavefront advances to this wait-for-workgroup mode after encoun- tering one of the termination events listed in Section 4.2 and waits for other wavefronts in the same workgroup to arrive. In this intermediate mode, each wavefront can deterministically observe the states of other wavefronts in the same workgroup. This allows all wavefronts in a workgroup to collectively decide the next mode. Figure 4.3 illustrates how this mechanism reduces synchronization over- head by eliminating an unnecessary quantum boundary introduced by a workgroup barrier. This mechanism, called Barrier Termination Avoid- ance (BTA), allows wavefronts to continue execution past certain quantum boundary formation conditions improving load balance when the quantum size is increased. In this figure W0, W1 and W2 belong to one workgroup. In Figure 4.3 all wavefronts of the workgroup have been terminated by reaching a workgroup barrier. In the baseline GPUDet (Figure 4.3(a)), wavefronts will end their quantum and wait for the global synchronization before start of commit mode. However, GPUDet wavefront aware quantum formation (GPUDet-WA) allows the wavefronts to exit the barrier and resume the parallel mode without ending the quantum (BTA). Since reaching all the 25 4.2. Quanta Formation W0 W1 W2 C o m m it  M o d e S e ri a l M o d e C o m m it  M o d e S e ri a l M o d e C o m m it  M o d e S e ri a l M o d eW0 W1 W2 time Reaches Workgroup Barrier Instruction Count C o m p u te  U n it 0 (a) GPUDet-base (b) GPUDet-WA Quantum Boundary Resume Parallel Mode without Ending Quantum parallel modeparallel mode parallel mode C o m p u te  U n it 0 Figure 4.3: GPUDet quantum formation. (a) GPUDet-base behavior toward quantum termination events. (b) Workgroup-aware quantum formation of GPUDet (GPUDet-WA) allows wavefronts of one workgroup to continue the execution in parallel mode after all reaching workgroup barrier. wavefronts of workgroup is compulsory to exit the barrier, the wavefronts would not be able to exit the barrier if a wavefront in that workgroup stopped execution before the barrier. To avoid the resulting deadlock condition, all of the wavefronts of the workgroup finish the parallel phase if any of them have stopped their parallel mode before the barrier (e.g., reaching instruc- tion limit). A similar scenario arises when all wavefronts of the workgroup are termi- nated by reaching the end of their kernels. In this case, GPUDet launches a new workgroup and the execution continues in the parallel phase helping to ameliorate the impact of load imbalance between workgroups and gaining larger quanta. 4.2.2 Deterministic Workgroup Distribution In Section 2, we described how a hardware workgroup distributor assigns a new workgroup to a compute unit whenever it has a free “slot” (subset 26 4.2. Quanta Formation (a) Nondeterministic Baseline Idle Cycles (b) GPUDet Baseline Workgroup Distribution WG = Workgroup WGS = Workgroup Slot Deterministic Partitioning WG Issue Queue time (c)  Deterministic Workgroup Partitioning Issued Workgroups Spawn New Workgroup Quantum Boundary WG5 .. WG4 WG3 WG2 WG1 WG0 WG5 .. WG4 WG3 WG2 WG1 WG0 .. WG6 WG4 WG2 WG0 .. WG7 WG5 WG3 WG1 WGS0 WG2WG0 ... WGS1 WG1 WG3... WGS0 WG0 WG3 ... WGS1 WG1 WG2 ... WGS0 WG2WG0 ... WGS1 WG1 WG3 ... Figure 4.4: Workgroup distribution 27 4.2. Quanta Formation of wavefronts with size of workgroup). This distribution of workgroups to compute units occurs both at the start of a kernel launch and when a running workgroup has finished execution. Since the commit order of wavefronts is defined by how they are assigned to hardware wavefront slots, any change in this assignment will affect the commit order priority in the commit mode and hence the execution re- sults. To achieve deterministic results, workgroups should be distributed among compute units deterministically. GPUDet has two different schemes to achieve deterministic workgroup distribution. In the default scheme illustrated in Figure 4.4(b), GPUDet spawns work- groups only at the start of each quantum. The state of the whole system is deterministic when a quantum starts, so when employing this policy the workgroup distributor has deterministic information about free workgroup slots. To ensure determinism, workgroups are assigned to free slots deter- ministically based on hardware slot ID. To achieve this, GPUDet enters a special work distribution mode right before entering the parallel mode. Workgroups are spawned only in this work distribution mode. All wave- fronts stay in this mode until the workgroup distributor finishes assigning all free workgroup slots on all compute units with new workgroups. A de- terministic set of free slots and of remaining workgroups results in spawning a deterministic set of workgroups. Although implementation of the default workgroup distribution scheme in GPUDet is simple, it performs poorly when the kernel launch consists of short-running threads. If new workgroups are only issued in work distribu- tion mode, then the number of instructions executed by a wavefront in one quantum will be limited by the amount of work per workgroup. This leads to load imbalance when some workgroups have more work than others. To address this problem, we have proposed a Deterministic Workgroup Partitioning technique (DWP) that allows workgroups to be issued deter- ministically in the middle of parallel mode. To prevent nondeterministic workgroup distribution inside parallel mode, GPUDet partitions the issue- pending workgroups among hardware workgroup slots before starting paral- lel mode. Each workgroup slot can only be replenished with the workgroups in its partition. As workgroup partitioning is done in a deterministic state (at the start of each quantum round), the final workgroup partitioning is deterministic. 28 4.3. Per-Wavefront Store Buffer 4.3 Per-Wavefront Store Buffer In GPUDet, each wavefront has a private store buffer that contains all of its global memory writes from the parallel mode in the current quantum. The store buffer is located in the private memory of the wavefront, cached by the L1 data cache and written back to off-chip DRAM. It has been observed that in GPU kernel programs, data written to global memory is rarely accessed by the writer thread/wavefront again [22]. This insight suggests organizing the store buffer as a linear write log. Each entry in this write log represents a coalesced global memory write to a coalesced memory access size -128- Byte- chunk. It has an address field indicating the destination location of the chunk, a data field, and a 128-bit byte mask to indicate the valid portion in the data field. The store buffer has a Bloom filter that summarizes the addresses updated in the write log. Each coalesced global memory read in parallel mode first queries the Bloom filter with its chunk address. A hit in the Bloom filter triggers a linear search through the write log; a miss redirects the global memory read to the normal access sequence. The cost of using a large Bloom filter to reduce false positives is amor- tized with a wavefront-shared store buffer. A 1024-bit Bloom filter only takes 6kB of storage per compute unit (assuming 48 wavefronts per unit). The Bloom filter of each wavefront can be stored in the register file space allocated to the wavefront. This eliminates any need for permanent storage at the expense of extra register file bandwidth and capacity. 4.4 Parallel Commit of Store Buffers GPUDet commits the store buffers from all wavefronts into the global mem- ory in the commit mode to allow wavefronts to communicate determinis- tically. We use the deterministic parallel commit algorithm used in Core- Det [7]. This algorithm tags the entries from the store buffer from a wave- front with a deterministic ID. This ID defines the commit order of this wavefront with respect to the other wavefronts. The wavefronts can at- tempt to commit the entries in their store buffers in parallel. The algorithm uses the deterministic ID to determine the final writer, in commit order, to each memory location, and guarantees that the location contains the value written by this wavefront after the commit mode. Current GPUs have non- coherent private caches. To avoid reading stale data GPUDet flushes the L1 data caches after the commit mode. Supporting cache coherency on GPUs [45] would eliminate this overhead. 29 4.4. Parallel Commit of Store Buffers While CoreDet implemented this algorithm in software using fine-grained locks, we recognized that this algorithm is analogous to the Z-Buffer algo- rithm that controls the visibility of overlapped polygons in graphics render- ing. GPUDet adopts the Z-Buffer Unit for graphics rendering to implement a hardware accelerated version of the deterministic parallel commit algo- rithm. Currently, as far as we are aware no GPU vendors have exposed any instruction for using the Z-Buffer Unit directly in general purpose program- ming models like CUDA and OpenCL. We believe exposing this unit in general purpose programming model will not be expensive in terms of area or complexity. 4.4.1 Z-Buffer Unit Z-Buffer algorithm is designed to control the visibility of overlapped 3D ob- jects displayed on screen. In graphics rendering, 3D objects are represented by triangles that are transformed according to a given camera view to be displayed. Each pixel rendered from a triangle is assigned a depth value rep- resenting its logical order with respect to pixels from other triangles. These depth values are stored in a set of memory locations called the Z-Buffer. A specialized hardware unit in the GPU, which we call the Z-Buffer Unit, manages the Z-Buffer. Using the depth values in the Z-Buffer, the Z-Buffer Unit prevents overlapped triangles from updating the colors of the pixels that have been updated by a foreground triangle. The Z-Buffer Unit allows out-of-order writes to produce a deterministic result. Namely, each pixel on the screen displays the color of the foremost triangle covering that pixel regardless of the order of the triangle updates. There is little publicly-available information regarding the internal ar- chitecture of Z-Buffer Unit in current GPUs. The Z-Buffer Unit evaluated in this document is inspired by the details disclosed in a patent by ATI Technologies [49]. As shown in Figure 2.1, there is a Z-Buffer Unit in each memory partition, each responsible for servicing the color update requests to the memory locations managed by the partition. Each Z-Buffer Unit contains a request buffer for keeping the status of the incoming requests. Each request contains a 128-byte aligned address to the chunk of pixels it attempts to update, together with 32 color values and 32 depth values. It also has a 128-bit byte mask indicating the valid values in the color update request. A color update request first tries to retrieve the latest depth val- ues for its pixels from a cache for the depth values, called the Z-cache. The depth values from the request are then compared against the retrieved depth 30 4.4. Parallel Commit of Store Buffers values. The Z-Buffer Unit then updates the color of the pixels that passes the comparison by sending write requests to the L2 cache, and updating the depth values in the Z-cache. A request that misses at the Z-cache allocates a block in the cache, locks the block and defers its comparison until the depth values are fetched. Requests hitting at a locked block are deferred similarly. The Z-Buffer Unit overlaps multiple color update requests to tolerate the Z-cache miss latency. 4.4.2 Deterministic Parallel Commit using Z-Buffer To adopt the Z-Buffer Units for deterministic parallel commit of store buffers, GPUDet allocates a corresponding Z-Buffer for each writable global memory buffer. The allocation routine co-locates both depth and data of a memory location at the same memory partition. In commit mode, each wavefront publishes the entries in its store buffer by traversing through the linear write log. From each entry, it generates a color update request containing the buffered data, with the depth equal to the logical quantum ID concatenated with its hardware wavefront ID: Depth value = wavefront id− (quantum id× 2m) Here, m is number of bits needed to represent the maximum number of wavefronts that can run concurrently on the hardware. We include the log- ical quantum ID in the depth to ensure that data from an earlier quantum is always overwritten by memory updates from a later quantum. With- out the logical quantum ID, the depth values for all writeable locations in global memory must be reset to the largest positive integer at every quan- tum boundaries. With the quantum ID, GPUDet amortizes this overhead across a large number of quanta by resetting the Z-Buffer only when the ID overflows. 4.4.3 Implementation Details of Z-Buffer Unit Architecture In this section, we describe the architectural details of our Z-Buffer Unit design. We improved the accuracy of our simulations by incorporating the details of the Z-Buffer Unit to our GPU timing model. These details are necessary to understand for someone wants to re-implement or improve our work in future. However, reading this section would not be necessary to understand other parts of this document. In our model, Z-Buffer Unit is located on memory partition side. In- terconnect network transfers compute units’ memory requests into the cor- responded memory partitions. The Z-Buffer Unit is responsible for color 31 4.4. Parallel Commit of Store Buffers Z-Unit O ff -C h ip  D R A M  C h a n n e l L a s t L e v e l C a c h e  b a n k  ( L 2 ) Memory Partition ... ... Z-Unit O ff -C h ip  D R A M  C h a n n e l L a s t L e v e l C a c h e  b a n k  ( L 2 ) Memory Partition ... ... In te rc o n n e c t Z-Unit O ff -C h ip  D R A M  C h a n n e l L a s t L e v e l C a c h e  b a n k  ( L 2 ) Memory Partition ... ... Figure 4.5: Intraction of Z-Buffer Unit with Memory Partition update requests arriving at the memory partition. The Z-Buffer Unit re- ceives color update requests from different compute units concurrently. Z- Buffer Unit first loads current depth value of the address corresponded to the memory request and then compares it with the request’s depth value. If the request depth value is less than or equal the current depth, it updates the depth and color value of the memory location. Z-Buffer Unit overlaps the memory requests to enhance performance. However, the process of load and update has to be atomic. In other words, the Z-Buffer Unit needs to ensure that as long as a color update request is not committed, no other requests are accessing the same memory location. So the Z-Buffer Unit needs a locking mechanism to prevent accessing of multiple requests to same location concurrently. On the other hand, because there is no consistency model limit for writing in different locations, the Z-Buffer Unit is able to overlap or even reorder requests of different addresses. Figure 4.5 demonstrates how the Z-Buffer Unit interfaces with inter- connect and other parts of a memory partition. The memory partition controller forwards the Color update requests into Z-Buffer Unit. This unit loads the depth value, performs the comparison and updates the color and depth values if necessary. The depth and color values are stored in the mem- ory, however the Z-Buffer Unit interacts with L2 cache and GPUDet does not introduce any modification to the interactions between the L2 cache 32 4.4. Parallel Commit of Store Buffers R e q u e s t B u ff e r Z Cache Depth Comparator Comparator Buffer Z-Unit 8 In te rc o n n e c t L 2  C a c h e 1 2 ... 3 7 6 5 4 Figure 4.6: Architecture of Z-Buffer Unit and the DRAM. Z-Buffer Unit is responsible for handling the atomicity of read-modify-write operations for each color update request. For simplicity, we assume that there is a static binding between the depth and the color addresses. Also color and depth values are located in the same memory partition. Figure 4.6 shows the internal architecture of Z-Buffer Unit. This unit consists of a buffer for tracking the status of incoming requests, a cache with lockable cache blocks for handling atomicity, a comparator to perform the depth comparison and a L2 interface. Figure 4.7 shows a color update request. This request contains a 128- byte aligned address indicates the starting address, 32 color values, 32 depth values and a byte mask. The mask indicates valid addresses that are sup- posed to be modified by this memory request. In this request (Figure 4.7) only the first two depth and color values are valid (addresses 0x10000000 and 0x10000004). Coalescing unit generates these aligned accesses based on the memory accesses of each thread within a wavefront. Each thread generates a color update request and sends it to the coalescing unit. Based on the addresses of these requests, coalescing unit may generate one or several coalesced memory 33 4.4. Parallel Commit of Store Buffers C0 C2C1 ... C31C30 D0 D2D1 ... D31D30 1 01 ... 00 Color Update Request 0x10000000Address Depth Color Mask Figure 4.7: A Color Update Request access requests each with a different mask and sends the requests to the memory partitions through the interconnects. Z-Buffer Unit accepts one color update request in each cycle if the request buffer is not full. Each request buffer entry has two status bits that indicate the status of the request. Status can be invalid, ready, or blocked. Ready specifies whether a request is ready to schedule and blocked specifies if it is waiting for previous requests to finish their operations to the same memory location. GPUDet Z-Buffer Unit model uses a set associative cache for keeping the depth values called Z-Cache. Cache blocks of this cache can be locked. A locked cache block cannot be read by other requests and also cannot be evicted from the cache. The block size of the Z-Cache is 128 byte equal to the size of depth values of a request. Z-Buffer Unit stores the Color update request in a free request buffer entry( 1 ) and changes the status of the entry to ready. A scheduler selects one ready request and probes the request address in the Z-Cache ( 2 ) to get the current depth value. If the cache block has been locked by other requests, the request buffer entry status is changed to blocked and the entry will de-schedule until Z-Cache entry is released. If the Z-Cache entry status is valid, the depth value is read and stored into the comparator buffer ( 3 ). Also the status of cache block is changed to locked to prevent other requests trying to access the same location. Comparator unit selects one entry of comparator buffer in each cycle and compares the depth values with the request depth values ( 4 ). Notice that the comparator, compares up to 32 (based on the mask) pairs of depth per each color update request. The result of comparison is stored in a separate mask called the result mask. If the depth value of the request is less than the 34 4.5. Compute Unit Level Serialization current depth value, the color value is updated and the depth value stores the smaller number. In case of requiring the update, Z-Buffer Unit sends a memory request with the result mask to to L2 cache( 5 ). Also, the depth value is being up- dated in Z-Cache( 6 ) and the status of the cache block is changed back into valid. On the other hand, all the blocked requests in the request buffer will be released and become available for scheduling again. To do so, whenever an entry in the comparison buffer is being released a controller checks the addresses of entries in the request buffer and update their status into ready if it is blocked. This scenario happens when the depth value exists in Z-Cache. In case of miss in the Z-Cache, a cache entry is allocated for that address and a miss request is sent to L2 cache( 7 ). Also the status of the new allocated block in the Z-Cache is set into locked to prevent other requests accessing the same cache block. Furthermore, Z-Buffer Unit allocates an entry in the comparator buffer. This allocation eliminates circular dependency and prevents deadlock conditions. L2 cache fills the depth value in the allocated comparator buffer entry. In this case the depth values are stored in the Z- Cache after comparison( 6 ). Miss accesses in the Z-Cache may cause a write back because of the eviction. However, a cache block cannot be evicted if it is in locked status. Finally, when both depth and color values are being updated, the request buffer releases the allocated entry and sends an acknowledgement to the corresponded compute unit through the interconnects( 8 ). 4.5 Compute Unit Level Serialization In serial mode, the wavefronts execute atomic operation in a determinis- tic serial order, with the operations updating the global memory directly. Each wavefront executes only one atomic operation in serial mode. Each atomic operation can take 100s of cycles to execute in the GPU’s throughput- optimized memory subsystem. Our evaluation in Section 6.2.3 shows that this naive, wavefront-level serialization (W-Ser) significantly slows down ap- plications that use atomic operations. To recapture parallelism, GPUDet exploits the point-to-point ordering guaranteed by our GPU memory model and common in network-on-chip designs [50] to overlap the execution of atomic operations within a compute unit (CU-Ser). As long as the interconnection network guarantees point-to- point ordering, atomic operations from the same compute unit to the same 35 4.5. Compute Unit Level Serialization W0 W1 W2 W0 W1 W2 C o m m it  M o d e W0 W1 W2 W0 W1 W2 C o m m it  M o d e time Serialization Atomic OperationP a ra lle l M o d e P a ra lle l M o d e A A A A A A serial mode serial mode (a) (b) A A A A A C o m p u te  U n it 0 C o m p u te  U n it 0 C o m p u te  U n it 1 C o m p u te  U n it 1 Figure 4.8: Serial mode in GPUDet. (a) Serializing execution of all atomic operations in GPUDet-base. (b) Overlapping execution time of atomic op- erations from each compute unit by GPUDet. memory partition will arrive in the original order. With little if any potential performance gain at stake, the memory partitions do not reorder accesses to the same memory location. By resetting the wavefront scheduler in each compute unit at the start of serial mode, GPUDet forces the unit to issue its atomic operations in a deterministic order. This order is preserved by the point-to-point ordering in the GPU memory subsystem. On the other hand, since there is no guaranteed ordering between requests from different cores, GPUDet has to serialize memory operations from different compute units. Figure 4.8 demonstrates how this optimization reduces serialization over- head in GPUDet by eliminating serialization within each compute unit. In Figure 4.8(a), the atomic operations from all workgroup are executed seri- ally. In Figure 4.8(b), each compute unit overlaps execution of their wave- fronts in serial phase, so the serialization overhead is reduced by a factor of 36 4.6. Limitations number of wavefronts in the compute unit. 4.6 Limitations The current design of GPUDet does not enforce determinism in the use local memory (or shared memory in NVIDIA terminology). Local mem- ory accesses update the on-chip scratch-pad memory directly and rely on the application to remove data-races via workgroup barriers. The wavefront scheduler at each compute unit can be modified to issue wavefronts in deter- ministic order. This modified scheduler will provide deterministic execution for local memory accesses even in the presence of data-race. We leave proper support for deterministic local memory accesses as future work. Since the current Z-buffer design binds each 32-bit color pixel in mem- ory with a depth value, the Z-buffer-based parallel commit algorithm is not directly applicable to applications with byte-granularity writes. An inter- mediate solution, without modifying the Z-Buffer Unit, is to commit (only) the store buffer entries with byte-granularity writes in the serial mode. We leave an evaluation of this as future work. Note that the per-wavefront store buffer is already designed to support byte-granularity accesses. Each store buffer entry contains a 128-bit mask indicating the modified bytes, and the Bloom filter is only responsible for identifying accesses to the same 128-byte chunk. 37 Chapter 5 Methodology We extended GPGPU-Sim 3.0.2 [6] to model a nondeterministic GPU ar- chitecture by randomizing the interconnect injection order of requests from different compute units. In this chapter we introduce our benchmarks, de- scribe our simulator configuration and discuss our methodology to analyze which of our proposed hardware are performance bottlenecks and needed to be optimized. 5.1 Benchmarks We evaluate the performance impact of GPUDet on a set of CUDA/OpenCL benchmarks (listed in Table 5.1) from Rodinia[15], Bakhoda et al.[6] and Fung et al.[22]. We ran each benchmark to completion. We do not exclude any benchmarks because of poor performance. We do exclude one bench- mark that assumes workgroups are spawned in ascending ID order, and five benchmarks that contain hard-to-eliminate byte-granularity writes. We include a version of cloth simulation (CLopt) with a GPU-optimized work distribution scheme that transforms the non-coalesced memory accesses into coalesced accesses. This optimized version performs 30% faster than the original one on the nondeterministic baseline architecture. On GPUDet, this optimization significantly reduces the number of entries in the per- wavefront store buffers, lowering the overhead of each write-log search. It also generates fewer write-log searches by eliminating the aliased read-write accesses within a wavefront accessing different bytes in the same 128-bytes. These two effects cause CLopt to perform significantly better than CL on GPUDet (Section 6.3.1). We also include the version of BFS graph traver- sal with data-races (BFSr) from Section 1.1 as well as a corrected version (BFSf) from Rodinia [15]. Both BFSr and BFSf are modified to use 32-bit boolean flags to eliminate byte-granularity writes. We used GPU-Racey (described in Section 2.2) to verify our nondeter- ministic extension to GPGPU-Sim, and to verify that our model of GPUDet can provide deterministic execution under this nondeterministic simulation framework. 38 5.2. Simulation Configuration Table 5.1: Benchmarks Name Abbr. Without Atomic Operations AES Cryptography [6] AES BFS Graph Traversal (with Data-Race) [6] BFSr BFS Graph Traversal (Race-Free) [15] BFSf Computational Fluid Dynamics Solver [15] CFD Coulumb Potential [6] CP HotSpot [15] HOTSP LIBOR [6] LIB 3D Laplace Solver [6] LPS Speckle Reducing Anisotropic Diffusion [15] SRAD With Atomic Operations Cloth Simulation [22] CL Cloth Simulation (Optimized) CLopt Hash Table [22] HT Bank Account [22] ATM 5.2 Simulation Configuration Our modified GPGPU-Sim is configured to model a Geforce GTX 480 (Fermi) GPU [37] with the configuration parameters distributed with GPGPU-Sim 3.0.2. In this configuration, the GPU has 15 compute units and 6 mem- ory partitions. Each compute unit can sustain up to 1536 threads shar- ing a 128kB register file, and it has a 48kB L1 data cache and a 16kB scratch-pad memory buffer. Each wavefronts has 32 threads. The com- pute units access the 768kB L2 unified cache at the memory partitions via two unidirectional crossbars, providing a peak bandwidth of 268GB/s per direction. The off-chip GDDR5 DRAM channels can deliver an aggregate bandwidth of 177GB/s. Table 5.2 summarize major configuration parame- ters in our simulator. In our default GPUDet configuration, each Z-Buffer Unit runs at 650MHz, has a 16kB Z-cache, and a 16 entry request buffer. The per-wavefront store buffer uses a 1024-bit bloom filter, implemented with a Parallel Bloom Filter [42] with 4 sub-arrays each indexed with a dif- ferent hash function. This default configuration assumes that each global barrier between parallel, commit, and serial modes in GPUDet takes zero cycles. Section 6.3.4 investigates the sensitivity of GPUDet’s performance to various Z-cache sizes, smaller store buffer bloom filters, and higher global barrier latencies. Our modified version of GPGPU-Sim and the benchmarks are available online [1]. 39 5.3. Overall Methodology: Finding Performance Bottleneck Table 5.2: GPGPU-Sim Configuration # Compute Units 15 Wavefront Size 32 SIMD Pipeline Width 16 Number of Threads / Unit 1536 Number of Registers / Unit 32768 Branch Divergence Method PDOM [23] Warp Scheduling Policy Loose Round Robin Shared Memory / Unit 16KB Constant Cache Size / Unit 8KB Texture Cache Size / Unit 5KB, 32B line, 20-way assoc. L1 Data Cache / Unit 48KB, 128B line, 6-way assoc. L2 Unified Cache 128KB/Mem. Part., 128B line, 8-way assoc. Interconnect Topology 1 Crossbar/Direction Interconnect BW 32 (Bytes/Cycle) (160GB/dir.) Interconnect Latency 5 Cycle (Interconnect Clock) Compute Unit Clock 1400 MHz Interconnect Clock 700 MHz Memory Clock 900 MHz # Memory Partitions 6 DRAM Req. Queue Size 32 Memory Controller Out-of-Order (FR-FCFS) GDDR3 Memory Timing tCL=12 tWL=4 tRP=12 tRC=40 tRAS=28 tRCD=12 tRRD=6 tCDLR=5 tWR=12 Memory Channel BW 32 (Bytes/Cycle) Min. L2/DRAM Latency 220 Cycle (Compute Core Clock) Z-Unit Z-Unit Clock 650 MHz Z-Cache 16kB, 128B line, 4-way assoc. Request Buffer Size 16 Store Buffer Bloom Filter Implementation Parallel Bloom Filter [42] Capacity 1024 bits # Hash Functions 4 5.3 Overall Methodology: Finding Performance Bottleneck In this section we describe our methodology to find performance bottlenecks in our design. We believe this methodology is applicable in any performance tuning application and provides a valuable insight for designer about the performance bottlenecks of a system. The optimization techniques described in Section 4.2 are inspired by this method. We have used a GPU performance visualization tool, AerialVision [3], interfaced with the GPGPU-Sim simulator to capture and visualize the dy- namic behavior of a GPU architecture throughout an application run. Aeri- 40 5.3. Overall Methodology: Finding Performance Bottleneck alVision provides a time-lapse view for visualizing the global runtime behav- ior of CUDA applications on a many-core architecture. This feature helps hardware and software developers identify sources of dynamic and intermit- tent inefficiency. We use AerialVision to identify the sources of performance overhead in the parallel mode. We provide the graphical output of different versions of GPUDet (Sec- tion 6.2.1), to illustrate the efficiency of each optimization technique (Sec- tion 4.2). Figure 5.1, Figure 5.2, Figure 5.3, and Figure 5.4 show the graph- ical output of AerialVision for the AES and BFSr benchmarks for different configurations. In these figures, the Y-axis represents 15 different compute units and the X-axis shows the time. Darker color indicates larger IPC for a compute unit at each time period (500 cycles). In other words, AerialVision counts the number of executed wavefront instructions by each compute unit in a time period, then divides this number to the length of the period to find average IPC. The white color indicates whether the compute unit is idle or is in the commit or serial mode. Vertical lines illustrate the start of each quantum. Figure 5.1 shows the IPC for AES benchmark for the GPUDet-base configuration when the quantumsize = 1000. As it can be observed, the average length of quanta is much shorter than 1000 (around 35). We believe the reason for the short quantum size is the frequent workgroup synchroniza- tions exist in this benchmark. This large number of short running quanta introduces a large number of global synchronizations and performance over- head as well. Workgroup Aware Quantum Formation (Section 4.2) allows GPUDet to continue execution in parallel mode when all wavefronts of a workgroup reach a workgroup barrier. As Figure 5.2 confirms, this tech- nique substantially reduces the number of quanta in the applications with large number of workgroup barriers. Figure 5.3 and Figure 5.4 are other examples that show how we used ArialVision to find the performance bottleneck. By visualizing the IPC for BFSr benchmark with GPUDet-base configuration (Figure 5.3) One can di- rectly observe that the main performance overhead in parallel mode is caused by load imbalance among different workgroups. We investigated the reason for this load imbalance and found that it is caused by finishing short run- ning kernels during parallel mode. In other words, since the kernel length is variant among different threads, some workgroups finish the execution of a kernel much earlier than the other ones. Since GPUDet-base does not allow hardware to issue a new workgroup in the middle of parallel mode, the compute unit will remain underutilized until the end of the quantum. By issuing a new workgroup using the Deterministic Workgroup Partitioning 41 5.3. Overall Methodology: Finding Performance Bottleneck Figure 5.1: IPC of different compute units in GPUDet-Base for AES. Figure 5.2: IPC of different compute units in GPUDet for AES. 42 5.3. Overall Methodology: Finding Performance Bottleneck technique (Section 4.2.2) compute units are allowed to spawn a new work- group without waiting for a new quantum. This technique helps GPUDet to balance the workload in the parallel mode and reduce the performance overhead. To summarize, is our methodology to improve the performance of GPUDet often consisted of the following steps: 1. Find an application with substantial performance overhead with GPUDet. 2. Visualize many performance metrics using ArialVision. 3. Find out the main reason for the performance overhead using visual- ized metrics and the benchmark code itself. 4. Propose a technique to overcome the recognized problem. 5. Implement the technique and explore the result. 6. Verify the technique by testing and visualization. The workgroup-aware quantum formation and deterministic workgroup partitioning have developed using above-mentioned method. 43 5.3. Overall Methodology: Finding Performance Bottleneck Figure 5.3: IPC of different compute units in GPUDet-Base for BFS. Figure 5.4: IPC of different compute units in GPUDet for BFS. 44 Chapter 6 Experimental Results This chapter provides a comprehensive discussion about our experimental re- sults. First, we evaluate the overall performance of GPUDet over all bench- marks. Then, we evaluate each optimization technique in parallel mode (Section 4.2) individually. We compare our Z-Buffer algorithm to commit the store buffers with previous work lock-based mechanism and also assess our proposed Compute Unit Serialization technique. Finally we provide widespread sensitivity analysis for quantum size, Z-Buffer Unit configura- tion, Bloom filter, and global synchronization overhead. 6.1 Overall Performance To evaluate our system, we compare the execution time of the benchmarks on a baseline nondeterministic architecture (NONDET) and the optimized version of GPUDet. We configure the quantum size to 200 instructions. We increment the instruction count whenever a wavefront executes an in- struction, regardless of how many threads in the wavefront execute that instruction. Figure 6.1 shows the total execution time of each application with GPUDet, normalized to execution time on a nondeterministic architec- ture. Our deterministic model causes about 105% performance penalty on average. The execution time of each application is broken down into time spent in parallel, commit and serial modes. Wavefronts spend the most time in parallel mode. We discuss the sources of performance overhead in par- allel mode below. As discussed in Section 4.5, GPUDet only serializes the execution of atomic operations: applications without atomic operations (all except CLopt, HT, ATM) skip serial mode entirely. We have found that some applications (CP, AES) perform slightly better with deterministic execution. A deeper inspection reveals that our work- group distribution algorithm (Section 4.2.2) results in a more even distri- bution of workgroups compared to the baseline architecture. The baseline architecture tries to find the first free hardware slot to spawn a workgroup. This mechanism can cause uneven distribution at the very end of a ker- nel launch when all workgroups of one compute unit finish their executions 45 6.2. Impact of GPUDet Optimizations 0 1 2 3 4 5 A E S B F S r B F S f  C F D  C P  H O T S P  L IB  L P S  S R A D H T A T M C Lo p t N o rm a li ze d  E x e cu ti o n  T im e Serial Mode Commit Mode Parallel Mode Figure 6.1: Breakdown of execution cycles. Normalized to NONDET execution time. within a short period. In this case, all remaining workgroups are assigned to the sole available compute unit. When other compute units become avail- able, no workgroups are left to be assigned, and the compute units are un- derutilized (a similar observation was noted by Bakhoda et al [6]). For these applications, GPUDet’s deterministic workgroup distribution will distribute workgroups more evenly, resulting in a small speedup. 6.2 Impact of GPUDet Optimizations This section provides data to evaluate our optimization techniques discussed in Section 4. 6.2.1 Quantum Formation To assess our proposed optimization techniques for quantum formation (Sec- tion 4.2) we implemented three versions of GPUDet shown in Figure 6.2. In the GPUDet-base configuration all optimizations are disabled. In the GPUDet-WA(BTA) the workgroup barrier optimization technique (Section 4.2.1) is enabled, allowing workgroups to synchronize without a quantum bound- ary. Finally, the GPUDet-WA(BTA+DWP) version additionally allows new workgroups to be spawned in parallel mode by deterministically partitioning of the workgroups (DWP) among workgroup slots (Section 4.2.2). Figure 6.2 shows that the barrier termination avoidance technique de- creases the execution time by 4% on average over GPUDet-base. As ex- 46 6.2. Impact of GPUDet Optimizations 0 1 2 3 4 5 A E S B F S r B F S f  C F D  C P  H O T S P  L IB  L P S  S R A D H T A T M C Lo p t A V G N o rm a li ze d  E x e cu ti o n  T im e GPUDet-base GPUDet-WA(BTA) GPUDet-WA(BTA+DWP) Figure 6.2: Performance impact of Barrier Termination Avoidance (BTA) and Deterministic Workgroup Partitioning (DWP) techniques. Bars Normalized to NONDET execution time. pected, this improvement lies mostly with benchmarks that have frequent synchronization barriers (AES, HOTSP, LPS, SRAD). Figure 4.2 shows that encountering workgroup barriers is the dominant cause of quantum termi- nation for these applications. Our experiments confirm that activating the barrier termination avoidance technique (GPUDet-WA(BTA)) forms 3.8× fewer quanta in these four benchmarks, improving their performance by 20% on average. Figure 6.2 shows that applications with small kernel functions (BFSr, BFSf, CFD) benefit from spawning workgroups in parallel mode. Figure 4.2 confirms that most of the quanta in BFSr, BFSf and CFD are terminated by reaching the end of the kernel. The ability to start new workgroups deterministically in parallel mode (GPUDet-WA(BTA+DWP)) speeds up these applications by 19% on average. 6.2.2 Parallel Commit using Z-Buffer Unit To evaluate the Z-Buffer Unit parallel commit algorithm, we implemented a lock-based version of committing the store buffer. The lock-based ver- sion simulates the software-based deterministic parallel commit algorithm proposed in CoreDet [7]. The algorithm locks the chunk of memory that corresponds to the store buffer entry address using atomic operations and 47 6.2. Impact of GPUDet Optimizations 0 2 4 6 8 10 B u ff e r Lo ck in g B u ff e r Lo ck in g B u ff e r Lo ck in g B u ff e r Lo ck in g B u ff e r Lo ck in g B u ff e r Lo ck in g B u ff e r Lo ck in g B u ff e r Lo ck in g B u ff e r Lo ck in g B u ff e r Lo ck in g B u ff e r Lo ck in g B u ff e r Lo ck in g N o rm a li ze d  E x e cu ti o n  T im e Commit Mode Parallel & Serial Modes Z -B u ff e r Lo ck in g Z -B u ff e r Lo ck in g Z -B u ff e r Lo ck in g Z -B u ff e r Lo ck in g Z -B u ff e r Lo ck in g Z -B u ff e r Lo ck in g Z -B u ff e r Lo ck in g Z -B u ff e r Lo ck in g Z -B u ff e r Lo ck in g Z -B u ff e r Lo ck in g Z -B u ff e r Lo ck in g Z -B u ff e r Lo ck in g AES BFSr BFSf CFD CP HOTSP LIB LPS SRAD HT ATM CloptN o rm a li ze d  E x e cu ti o n  T im e Figure 6.3: Execution time comparison of committing the store buffer between the Z-Buffer Unit parallel commit and the lock-based algorithms performs the logical order priority comparisons and global memory updates in a mutually exclusive section. Although it has been reported that atomic operations contending for the same memory location can be up to 8.4× slower than non-atomic store operations in the Fermi architecture [46], in our evaluation we do notmodel this extra slowdown for the additional atomic operations required by the lock-based version. Figure 6.3 shows the normalized execution time for both of the store buffer committing algorithms. The Z-Buffer Unit commit algorithm im- proves performance by 60% on average. Since spinning for a lock in compute units generates significantly more global memory accesses, it is expected that applications with more accesses (e.g. HT) achieve more performance improvement by exploiting the Z-Buffer Unit commit algorithm. Figure 6.3 shows that the execution time of commit mode is decreased by 2.3× using the Z-Buffer Unit for the HT benchmark. 6.2.3 Serial Mode Optimization Figure 6.4 evaluates the performance overhead of serial mode in GPUDet. Since only atomic operations are serialized, we omit benchmarks without atomic operations. W-Ser (wavefront level serialization) is the GPUDet- base configuration which serializes execution of all atomic operations. In CU-Ser (compute unit serialization), serialization is only performed among compute units by executing the atomic operations of a single compute unit in parallel. CU-Ser decreases overhead of serial mode by 6.1× for these applications. 48 6.3. Sensitivity Analysis 0 2 4 6 8 10 12 14 W -S e r C U -S e r W -S e r C U -S e r W -S e r C U -S e r CLopt HT ATM N o rm a li ze  E x e cu ti o n  T im e Serial Mode Parallel&Commit Modes Figure 6.4: Execution time comparison between wavefront level (W-Ser) and compute unit level (CU-Ser) serialization of atomic operations (Section 4.5). Normalized to NONDET execution time 6.3 Sensitivity Analysis This section explores the performance impact of varying different GPUDet design parameters. 6.3.1 Quantum Size and Store Buffer Overhead In this section we evaluate the effect of quantum size on performance. For better insight, we separate out the time spent in store buffer operations. Store buffer operations entail appending entries to store buffers for stores, and traversing store buffers for replying to loads. Figure 6.5 shows the execution time of GPUDet with quantum sizes of 100, 200, 600 and 1000 instructions. Increasing quantum size has three kinds of effects on applications. For applications that have few stores to global memory (AES, CP, HOTSP and LIB) larger quanta improves performance. By increasing quantum size, we reduce the number of quantum boundaries, each of which involves expensive global synchronization. For applications with frequent accesses to global memory (BFSr and BFSf), performance is degraded with increasing quantum size, because larger quanta lead to larger store buffers. Because GPUDet uses a linear write log traversal, load instructions hitting in store buffer must perform longer 49 6.3. Sensitivity Analysis 2 4 6 8 10 N o rm a li ze d  E x e cu ti o n  T im e Serial Mode Commit Mode Write Log Overhead Parallel Mode 0 1 0 0 2 0 0 6 0 0 1 0 0 0 1 0 0 2 0 0 6 0 0 1 0 0 0 1 0 0 2 0 0 6 0 0 1 0 0 0 1 0 0 2 0 0 6 0 0 1 0 0 0 1 0 0 2 0 0 6 0 0 1 0 0 0 1 0 0 2 0 0 6 0 0 1 0 0 0 1 0 0 2 0 0 6 0 0 1 0 0 0 1 0 0 2 0 0 6 0 0 1 0 0 0 1 0 0 2 0 0 6 0 0 1 0 0 0 1 0 0 2 0 0 6 0 0 1 0 0 0 1 0 0 2 0 0 6 0 0 1 0 0 0 1 0 0 2 0 0 6 0 0 1 0 0 0 1 0 0 2 0 0 6 0 0 1 0 0 0 1 0 0 2 0 0 6 0 0 1 0 0 0 AES BFSrc BFSrf CFD CP HOTSP LIB LPS SRAD HT ATM CLopt AVG CL Figure 6.5: Sensitivity of GPUDet to quantum size. Execution time is normal- ized to NONDET. The unoptimized version of CL is not included in the average. searches. Increasing the quantum size also has a side effect on the BFS benchmark. Since this benchmark is not work optimal [31] (it may do each task several times), the number of executed instructions varies in different configurations. In this application the number of redundant tasks depends on how frequently the execution results of each thread become visible to the other threads. Enlarging quantum size causes results to be visible less frequently, BFS runs more redundant tasks which decreases performance. Our experiments reveal that the number of executed instructions increases by 26% when the quantum size increases from 200 to 1000. For the CL benchmark, the performance overhead of the linear write log is highly affected by the size of the quantum. We realized that in the In- tegrator, WriteBack and Driver Solver kernels [13] of this benchmark work is distributed block-wise among the threads, so wavefronts generate uncoa- lesced memory accesses. As described in Section 4.3, uncoalesced accesses generate many extra entries in the store buffer. Furthermore, due to Bloom filter aliasing, uncoalesced accesses result in unnecessary log traversals for load operations (Section 6.3.2). By using an interleaved distribution of work among the threads, we eliminated uncoalesced memory accesses in the CLopt application. Since eliminating uncoalesced accesses to the global memory is profitable for general GPU applications and it does not need substantial changes to CL source code, we have only included the optimized version of 50 6.3. Sensitivity Analysis CL in our overall average. As discussed in Section 4.2 quanta may be terminated because of ter- mination reasons other than instruction count. Adjusting quanta size does not affect the performance of applications (ATM, HT) in which quanta are usually terminated for other reasons like atomic operations (figure 4.2). 6.3.2 Bloom Filter Figure 6.6 evaluates the effect of Bloom filter size on performance. Since the Bloom filter configuration does not affect commit and serial modes, we present only the execution time of parallel mode. Figure 6.6 shows the ex- ecution time of the GPUDet architecture with 128, 256, 512 and 1024-bit Bloom filters, and an ideal Bloom filter which has no false positives. We only present the execution time of individual applications that use the store buffer frequently, but the average is for all our workloads. Figure 6.6 shows 6 8 N o rm a li ze d  E x e cu ti o n  T im e  i n  P a ra ll e l M o d e 128 bits 256 bits 512 bits 1024 bits 0 2 4 B F S r B F S f C F D LP S S R A D H T A T M C Lo p t A V G C L N o rm a li ze d  E x e cu ti o n  T im e  i n  P a ra ll e l M o d e Ideal BF Figure 6.6: Execution cycles in parallel mode for various sizes of Bloom filter, normalized to NONDET execution. The AVG bar shows the average among all our benchmarks. that for most of the applications (except BFSr and CL) a 256-bit Bloom filter performs as well as ideal. Increasing Bloom filter size for BFSr reduces execution time because of a high number of store buffer entries in each quan- tum. For CL, because of the large number of uncoalesced accesses, the store buffer contains many 128B entries with sparsely-populated data portions. The large number of entries causes many loads to perform unnecessary log traversals. 51 6.3. Sensitivity Analysis 6.3.3 Z-Buffer Unit We assess the sensitivity of GPUDet to the size of the Z-cache (Section 4.4.1) by adjusting the size of cache for each memory partition from 8KB to 128KB. As expected, execution time decreases with increasing Z-cache size. How- ever, since wavefronts spend most of their execution time in parallel and serial modes, the execution time decreases by less than 3% when the Z- cache is enlarged from 8KB to 128KB. 6.3.4 Global Synchronization Overhead GPUDet needs global synchronization between each execution mode. We analysed the sensitivity of GPUDet to the cost of global synchronization. Our data illustrates that the overall performance degradation with a 100 cy- cles synchronization overhead is negligible. However increasing the overhead to 1000 cycles reduces overall performance by 22% over the instantaneous synchronization model. 52 6.3. Sensitivity Analysis 1.5 2 2.5 3 3.5 4 N o rm a li ze d  E x e cu ti o n  T im e 0 0.5 1 1 2 4 8 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 N o rm a li ze d  E x e cu ti o n  T im e # of Cycles Figure 6.7: Execution time for various number of cycles, dedicated for each global synchronization, normalized to NONDET execution. The bar shows the average among all our benchmarks. 53 Chapter 7 Related Work The most closed-related work to GPUDet are proposals for providing deter- ministic execution of general CPU programs [5, 7, 8, 17, 26, 35, 40]. These schemes provide determinism for general multithreaded code on commod- ity [5, 7, 8, 35, 40] or modified general-purpose multicore CPUs [17, 26]. A naive translation of these schemes to a GPU architecture would be highly inefficient due to the massive numbers of threads involved. For exam- ple, the global barriers inherent in many deterministic execution schemes ([5, 8, 17, 26, 35]) would not scale well with 1000s of threads. Kendo’s [40] more scalable approach provides determinism for data-race-free programs only, which would scale better in a GPU environment but also reneges on many of the debugging benefits of other determinism techniques that can handle data races. There has also been extensive work on deterministic programming lan- guages ([10, 11, 14, 19, 41, 47]). Programs written in these languages are deterministic by construction, and incur little runtime overhead during ex- ecution at the expense of a restricted programming model. SHIM [19] is a deterministic version of MPI that provides determinism for message- passing programs. NESL [10] and Data Parallel Haskell [14] are pure func- tional languages that support data-parallel operations which are determinis- tic and implicitly parallel. Jade [41] is an imperative language that relies on programmer-supplied annotations to extract parallelism automatically while preserving sequential (and deterministic) semantics. StreamIt [47] is a lan- guage for streaming computations that enforces determinism by restricting communication between parallel stream kernels to occur only via FIFOs, though a later version of StreamIt allowed for out-of-band control messages that could flow both upstream and downstream without breaking determin- ism [48]. Deterministic Parallel Java [11] is a version of Java augmented with a type-and-effect system that proves non-interference between parallel fork- join tasks. None of these languages are directly applicable to current GPU programming models, which support a rich set of synchronization primitives via atomic operations on global memory. A number of projects have looked at improving GPU programmability 54 Chapter 7. Related Work through hardware or software mechanisms. Kilo TM [22] showed that trans- actional memory can be incorporated into the GPU programming model at low cost. GRace [52] proposed a race detector for GPU programs, achieving low overhead via a combination of static analysis and runtime instrumen- tation. Boyer and Skadron [12] describe a GPU emulator that detects race conditions and bank conflicts for CUDA programs, though the emulation overheads are quite high compared to native execution. The PUG sys- tem [32] and GPUVerify [9] leverage the compact nature of GPU kernels to perform static race detection, occasionally requiring annotations from programmers to avoid false positives. 55 Chapter 8 Future Work We believe GPUDet is just a starting point for exploration of opportunities to provide an easier to use environment for GPU architectures. Our ultimate goal is to prove that deterministic execution is worth being supported on massively parallel architectures. By reducing the performance overhead, exploring other types of determinism, and demonstrating the necessity of deterministic execution, we believe hardware companies will provide the support for our proposal in their future hardware. This section classifies the potential future work into two major categories. First, it discusses the opportunities to improve the performance (GPUDet). Then, we provide a brief discussion about other types of determinism. 8.1 Improving Performance of GPUDet Performance overhead of GPUDet might discourage implementation. Even for CPU architectures, it might not be acceptable to tolerate the proposed amount of performance overhead in favor of deterministic execution. Since the high performance computing has been the main motivation for GPGPU applications, performance overhead is less tolerable in GPU world. Although we believe providing an easier programming environment via deterministic hardware will enable increased programmer productivity sufficient to justify the cost of GPUDet, it would be best if one could reduce the overhead of GPUDet further. Here, we discuss some potential ways to reduce performance overhead of different modes in GPUDet and to eliminate the limitations discussed in Section 4.6. 8.1.1 Improving Performance of Atomic Operations by Taking Advantage of Relaxed Memory Ordering Atomic operations cause performance overhead in two different ways. The first and more obvious effect of atomic operations is in serial mode (Sec- tion 4.5) in which all atomic operations are enforced to be executed serially 56 8.1. Improving Performance of GPUDet among different compute units. In a massively parallel processor such as a GPU, serialization impairs performance. The second effect of these opera- tions on the performance stems from global synchronization. As discussed in Section 4.2, GPUDet finishes execution of a wavefront when reaching atomic operations. This keeps the size of a quantum limited by the number of instructions between two consecutive atomic operations regardless of the maximum quantum size. In a program with frequent atomic operations, deterministic execution enforces a large number of global synchronizations due to the large number of short running quanta. These additional global synchronizations reduce the performance due to the load imbalance among threads and the synchronization overhead itself (Section 4.2). To reduce unnecessary quantum termination by reaching atomic oper- ations, Devietti et al. proposed RCDC [18] for CPU architecture. RCDC leverages data-race-free memory models (e.g., the models for Java and C++) and relaxes memory ordering to improve performance without sacrificing de- terministic execution. Exploring a GPU friendly algorithm to implement a similar approach on GPU architectures would be an appropriate way to increase the performance of applications with frequent atomic operations (Section 6.2.3) 8.1.2 Deterministic Local Memory GPUDet does not support deterministic execution in the presence of data races in local memory (Section 4.6). Local memory is a scratch-pad mem- ory that is visible to all threads of a workgroup. In current GPU appli- cations local memory is used to capture locality of data among threads of a wavefront. Also, it helps to improve memory bandwidth by eliminating unnecessary memory accesses and enforcing coalesced memory accesses to the global memory. A workgroup usually loads a portion of data in its local memory, then computes the kernel output. Finally it writes back the results to the global memory ideally, through coalesced stores. In this scenario a race condition through local memory is possible. By extending determinism to the local memory scope, GPUDet can provide determinism for a broader range of applications. The naive solution to provide deterministic execution for the local mem- ory is to apply the same store buffer mechanism (Section 4.3) to this mem- ory. Considering the extra store operations required by this mechanism, the performance penalty would not be acceptable for local memory operations (this memory supposed to be very fast). Also the naive solution needs ex- tra scratch-pad memory for store buffers which is very expensive in term of 57 8.1. Improving Performance of GPUDet area. The other challenge is the lack of the Z-Buffer Unit in the compute units in current hardware. So the commit would need to be done through a very slow lock-based mechanism (Section 6.2.2) The better way to support determinism for local memory is to enforce de- terministic scheduling among the wavefronts of a workgroup. A determinis- tic scheduler generated a deterministic order of local memory access requests and since these requests will arrive in-order to the scratch-pad memory, it generates deterministic results. The downside of deterministic scheduling is the performance overhead caused by a reduction of ready to schedule wave- fronts. We believe a well designed scheduler which considers the contents of the cache and exploits the non-deterministic scheduling opportunity among the workgroups can provide deterministic local memory without sacrificing the performance. 8.1.3 Overlapping the Parallel and Commit Mode of Different Wavefronts To enforce determinism GPUDet needs to synchronize all wavefronts after parallel and commit mode (in absence of atomic operation and serial mode). These synchronizations not only cause performance overhead directly (Sec- tion 4.2) but also indirectly reduce performance by creating an imbalance in memory system bandwidth usage. In the parallel mode all wavefronts append their stores onto store buffers that are cached by the L1 cache and they use global memory mostly for load operations, so the memory band- width may not be fully exploited in parallel mode. On the other hand, The rest of commit mode consists of several memory write operations (Z-Buffer unit requests) which are memory bounded and limit the performance by the bandwidth. The key insight here is by overlapping the commit mode and parallel mode of different wavefronts, GPUDet can balance the memory load and maximize the performance. This might be enabled by some major changes to the hardware. One idea is to let the wavefront start the next quantum without waiting for all other wavefronts’ commit mode. This would not violate determinism if the wavefront ensures its input will not be updated by any wavefront in the previous quantum. We believe implementing this idea would greatly affect performance by reducing number of synchronizations and balancing the bandwidth. However, correct implementation of this idea needs a very careful analysis and it is hard to verify. Notice that this idea can also improve the performance of deterministic CPU hardware proposals [7, 26]. 58 8.2. Exploring Other Types of Determinism 8.2 Exploring Other Types of Determinism Different types of determinism are discussed in Section 2.3.1. GPUDet tries to exploit the benefits of architectural determinism without sacrificing per- formance. However, different types of determinism have trade-offs and each of them would be useful for different purposes (Section 2.3.2). Exploring different forms of determinism on many-core architectures is necessary to provide insight about the most suitable form for each application. 8.3 Compiler-Run Time Support for determinism GPUDet provides a hardware proposal to enforce determinism for GPU ar- chitectures. Since, none of the hardware companies has not yet provided this support for GPUs, programmers are not able to exploit determinism for real applications. Another interesting direction to continue this work is to provide a software-compiler-run time solution to achieve determinism on current generation of GPU hardware. Similar mechanisms have been explored for CPUs [7, 40]. We anticipate providing this environment in- volves a lot of elaborate implementation details due to the SIMD structure of GPUs. 59 Chapter 9 Conclusion Nondeterminism in parallel architectures hampers programmers’ productiv- ity significantly as bugs can no longer be reproduced easily. We believe this non-reproducibility problem presents a key challenge to GPU software development, discouraging use of GPUs in broader range of applications. In this document we presented GPUDet, a proposal for supporting deter- ministic execution on GPUs. GPUDet exploits deterministic aspects of the GPU architecture to regain performance. Specifically, it uses the inherent determinism of the SIMD hardware in GPUs to provide deterministic inter- action among threads within a wavefront. This amortizes the complexity of store buffers required to isolate the execution of each wavefront and works seamlessly with the existing SIMT execution model in GPUs. GPUDet uses a workgroup-aware quantum formation scheme that allows wavefronts in parallel mode to coordinate via workgroup barriers and to accept work from a deterministic workgroup distributor. GPUDet also extends the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to determin- istically commit store buffers in parallel. Finally, GPUDet eliminates the serialization required among atomic operations from the same compute unit by exploiting the implicit point-to-point ordering in the GPU memory sub- system. Our simulation results indicate that these optimizations allow GPUDet to perform comparably against a nondeterministic baseline, despite running GPU kernels with thousands of threads. Our characterization of sources of overhead for deterministic execution on GPUs provides insights for further optimizations. 60 Bibliography [1] http://www.ece.ubc.ca/˜aamodt/GPUDet. [2] AMD. White Paper — AMD Graphics Cores Next (GCN) Architecture, June 2012. [3] Aaron Ariel, Wilson WL Fung, Andrew E Turner, and Tor M Aamodt. Visualizing complex dynamics in many-core accelerator architectures. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on, pages 164–174. IEEE, 2010. [4] D.C. Arnold et al. Stack Trace Analysis for Large Scale Debugging. In IPDPS, 2007. [5] Amittai Aviram, Shu-Chun Weng, Sen Hu, and Bryan Ford. Efficient system-enforced deterministic parallelism. In OSDI, 2010. [6] Ali Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009. [7] Tom Bergan, Owen Anderson, Joseph Devietti, Luis Ceze, and Dan Grossman. CoreDet: A Compiler and Runtime System for Determinis- tic Multithreaded Execution. In ASPLOS, 2010. [8] Tom Bergan, Nicholas Hunt, Luis Ceze, and Steven D. Gribble. Deter- ministic Process Groups in dOS. In OSDI, 2010. [9] Adam Betts, Nathan Chong, Alastair F. Donaldson, Shaz Qadeer, and Paul Thomson. GPUVerify: a verifier for GPU kernels. In Proceedings of the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA’12). ACM, 2012. [10] Guy Blelloch. NESL: A Nested Data-Parallel Language (Version 3.1). Technical report, Carnegie Mellon University, Pittsburgh, PA, 2007. 61 Bibliography [11] Robert L. Bocchino, Jr., Vikram S. Adve, Danny Dig, Sarita V. Adve, Stephen Heumann, Rakesh Komuravelli, Jeffrey Overbey, Patrick Sim- mons, Hyojin Sung, and Mohsen Vakilian. A Type and Effect System for Deterministic Parallel Java. In OOPSLA, 2009. [12] M. Boyer, K. Skadron, and W. Weimer. Automated Dynamic Anal- ysis of CUDA Programs. In Third Workshop on Software Tools for MultiCore Systems, 2008. [13] Andrew Brownsword. Cloth in OpenCL, 2009. [14] Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, Gabriele Keller, and Simon Marlow. Data Parallel Haskell: A Status Report. In DAMP, 2007. [15] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Comput- ing. In IISWC, 2009. [16] Brett W. Coon et al. United States Patent #7,353,369: System and Method for Managing Divergent Threads in a SIMD Architecture (As- signee NVIDIA Corp.), April 2008. [17] Joseph Devietti, Brandon Lucia, Luis Ceze, and Mark Oskin. DMP: Deterministic Shared Memory Multiprocessing. In ASPLOS, 2009. [18] Joseph Devietti, Jacob Nelson, Tom Bergan, Luis Ceze, and Dan Gross- man. RCDC: A Relaxed Consistency Deterministic Computer. In AS- PLOS, 2011. [19] Stephen A. Edwards and Olivier Tardieu. SHIM: A Deterministic Model for Heterogeneous Embedded Systems. In EMSOFT, 2005. [20] Perry A. Emrath and David A. Padua. Automatic detection of non- determinacy in parallel programs. In Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debug- ging, PADD ’88, pages 89–99, New York, NY, USA, 1988. ACM. [21] Matteo Frigo, Pablo Halpern, Charles E. Leiserson, and Stephen Lewin- Berlin. Reducers and other cilk++ hyperobjects. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, SPAA ’09, pages 79–90, New York, NY, USA, 2009. ACM. 62 Bibliography [22] Wilson W. L. Fung, Inderpreet Singh, Andrew Brownsword, and Tor M. Aamodt. Hardware Transactional Memory for GPU Architectures. In MICRO-44, 2011. [23] Wilson Fung et al. Dynamic Warp Formation and Scheduling for Effi- cient GPU Control Flow. In MICRO, 2007. [24] Pawan Harish and P. J. Narayanan. Accelerating Large Graph Algo- rithms on the GPU Using CUDA. In HiPC, 2007. [25] Mark Hill and Min Xu. http://www.cs.wisc.edu/ markhill/racey.html, 2009. [26] D.R. Hower, P. Dudnik, M.D. Hill, and D.A. Wood. Calvin: Determin- istic or Not? Free Will to Choose. In HPCA, 2011. [27] Hadi Jooybar, Wilson W.L. Fung, Mike O’Connor, Joseph Devietti, and Tor M. Aamodt. Gpudet: a deterministic gpu architecture. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems, ASPLOS ’13, pages 1–12, New York, NY, USA, 2013. ACM. [28] Khronos Group. OpenCL. http://www.khronos.org/opencl/. [29] Samuli Laine and Tero Karras. High-Performance Software Rasteriza- tion on GPUs. In HPG, 2011. [30] Gentaro Hirota Lars Nyland, John R. Nickolls and Tanmoy Mandal. United States Patent #8,086,806: Systems and methods for coalescing memory accesses of parallel threads (Assignee NVIDIA Corp.), April 2011. [31] Charles E. Leiserson and Tao B. Schardl. A Work-Efficient Parallel Breadth-First Search Algorithm (or How to Cope with the Nondeter- minism of Reducers). In SPAA, 2010. [32] Guodong Li and Ganesh Gopalakrishnan. Scalable SMT-Based Verifi- cation of GPU Kernel Functions. In FSE, 2010. [33] E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE, 2008. [34] Jamie Liu, Benjamin Jaiyen, Richard Veras, and Onur Multu. RAIDR: Retention-Aware Intelligent DRAM Refresh. In ISCA, 2012. 63 Bibliography [35] Tongping Liu, Charlie Curtsinger, and Emery D. Berger. DTHREADS: Efficient Deterministic Multithreading. In SOSP, 2011. [36] Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. SIGPLAN Not., 43(3):329–339, March 2008. [37] NVIDIA. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, October 2009. [38] NVIDIA Corp. NVIDIA CUDA Programming Guide v3.1, 2010. [39] NVIDIA Corp. NVML API Reference Manual v3.295.45, 2012. [40] Marek Olszewski, Jason Ansel, and Saman Amarasinghe. Kendo: Effi- cient deterministic multithreading in software. In ASPLOS, 2009. [41] Martin C. Rinard and Monica S. Lam. The design, implementation, and evaluation of Jade. ACM Trans. Program. Lang. Syst., 20(3), May 1998. [42] Daniel Sanchez, Luke Yen, Mark D. Hill, and Karthikeyan Sankar- alingam. Implementing Signatures for Transactional Memory. In MI- CRO, 2007. [43] Smruti R. Sarangi, Brian Greskamp, and Josep Torrellas. CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging. In DSN, 2006. [44] Cedomir Segulja and Tarek S Abdelrahman. Architectural support for synchronization-free deterministic parallel programming. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th Inter- national Symposium on, pages 1–12. IEEE, 2012. [45] Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O’Connor, and Tor M. Aamodt. Cache Coherence for GPU Archi- tectures. In HPCA, 2013. [46] Jeff A. Stuart and John D. Owens. Efficient Synchronization Primitives for GPUs. CoRR, abs/1110.4623, 2011. [47] William Thies, Michal Karczmarek, and Saman P. Amarasinghe. StreamIt: A Language for Streaming Applications. In CC ’02, 2002. 64 Bibliography [48] William Thies, Michal Karczmarek, Janis Sermulins, Rodric Rabbah, and Saman P. Amarasinghe. Teleport Messaging for Distributed Stream Programs. In PPoPP, 2005. [49] Timothy J. Van Hook. United States Patent #6,630,933: Method and Apparatus for Compression and Decompression of Z Data (Assignee ATI Technologies Inc.), October 2003. [50] S.R. Vangal et al. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS. IEEE Journal of Solid-State Circuits, 43(1):29–41, Jan. 2008. [51] H. Wong et al. Demystifying GPU microarchitecture through mi- crobenchmarking. In ISPASS, 2010. [52] Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. GRace: A Low-Overhead Mechanism for Detecting Data Races in GPU Programs. In PPoPP, 2011. 65


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items