You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Deterministic execution on GPU architectures Jooybar, Mohammad Hadi 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2013_fall_jooybar_mohammad hadi.pdf [ 1.13MB ]
Metadata
JSON: 24-1.0074006.json
JSON-LD: 24-1.0074006-ld.json
RDF/XML (Pretty): 24-1.0074006-rdf.xml
RDF/JSON: 24-1.0074006-rdf.json
Turtle: 24-1.0074006-turtle.txt
N-Triples: 24-1.0074006-rdf-ntriples.txt
Original Record: 24-1.0074006-source.json
Full Text
24-1.0074006-fulltext.txt
Citation
24-1.0074006.ris

Full Text

Deterministic Execution on GPU Architectures by Mohammad Hadi Jooybar  B.A.Sc., University of Tehran, 2011  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in The Faculty of Graduate Studies (Electrical and Computer Engineering)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) July 2013 c Mohammad Hadi Jooybar 2013  Abstract Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one’s ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs. Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. Here we propose a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this thesis we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDet’s design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations. Our simulation results indicate that GPUDet incurs only 2× slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.  ii  Preface Parts of Chapters 1, 2, 3, 4, 5,6, 7, and 9 have been published [27]. Hadi Jooybar, Wilson W.L. Fung, Mike O’Connor, Joseph Devietti, and Tor M. Aamodt. GPUDet: a Deterministic GPU Architecture. In proceedings of the Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS’13, pages 1-12, March 2013. Dr. Tor Aamodt, Dr. Joseph Devietti, and Mike O’Connor supervised during this project. I created the timing and functional model for all parts of this project except Store buffers (Developed by Wilson W.L. Fung). I also analyzed the experimental results.  iii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iv  Abstract Preface  List of Tables  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii  List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Glossary  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  x  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 1 Introduction . . . . . . . 1.1 Motivation: Debugging 1.2 Contributions . . . . 1.3 Thesis Organization .  . . . with . . . . . .  . a . .  . . . . . . . . Deterministic . . . . . . . . . . . . . . . .  . . . . GPU . . . . . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  1 2 4 5  2 Background . . . . . . . . . . . . . . . . . . . . . . . 2.1 GPU Architecture . . . . . . . . . . . . . . . . . . 2.1.1 SIMT Execution Model . . . . . . . . . . . 2.1.2 Memory Subsystem . . . . . . . . . . . . . 2.2 Sources of Nondeterminism in GPU Architectures 2.3 Deterministic Execution . . . . . . . . . . . . . . . 2.3.1 Forms of Determinism . . . . . . . . . . . . 2.3.2 Benefits of Deterministic Execution . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  6 6 8 10 11 12 12 14  3 GPU Deterministic Execution: Background and Challenges . . . . . . . . . . . . . . . . . . . 3.1 Background: CoreDet and Calvin . . . . . . . . . . . . . . . 3.2 Deterministic GPU Execution Challenges . . . . . . . . . . .  16 16 17  iv  Table of Contents 4 GPUDet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Deterministic Wavefront . . . . . . . . . . . . . . . . . . . . 4.2 Quanta Formation . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Workgroup-Aware Quantum Formation . . . . . . . . 4.2.2 Deterministic Workgroup Distribution . . . . . . . . . 4.3 Per-Wavefront Store Buffer . . . . . . . . . . . . . . . . . . . 4.4 Parallel Commit of Store Buffers . . . . . . . . . . . . . . . . 4.4.1 Z-Buffer Unit . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Deterministic Parallel Commit using Z-Buffer . . . . 4.4.3 Implementation Details of Z-Buffer Unit Architecture 4.5 Compute Unit Level Serialization . . . . . . . . . . . . . . . 4.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . .  20 20 23 25 26 29 29 30 31 31 35 37  5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Simulation Configuration . . . . . . . . . . . . . . . . . . . . 5.3 Overall Methodology: Finding Performance Bottleneck . . .  38 38 39 40  6 Experimental Results . . . . . . . . . . . . . . . . 6.1 Overall Performance . . . . . . . . . . . . . . . . 6.2 Impact of GPUDet Optimizations . . . . . . . . 6.2.1 Quantum Formation . . . . . . . . . . . . 6.2.2 Parallel Commit using Z-Buffer Unit . . 6.2.3 Serial Mode Optimization . . . . . . . . 6.3 Sensitivity Analysis . . . . . . . . . . . . . . . . 6.3.1 Quantum Size and Store Buffer Overhead 6.3.2 Bloom Filter . . . . . . . . . . . . . . . . 6.3.3 Z-Buffer Unit . . . . . . . . . . . . . . . 6.3.4 Global Synchronization Overhead . . . .  . . . . . . . . . . .  45 45 46 46 47 48 49 49 51 52 52  . . . . . . . . . . . . . . . . . . . . . . . . . . . .  54  8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Improving Performance of GPUDet . . . . . . . . . . . . . . 8.1.1 Improving Performance of Atomic Operations by Taking Advantage of Relaxed Memory Ordering . . . . . 8.1.2 Deterministic Local Memory . . . . . . . . . . . . . . 8.1.3 Overlapping the Parallel and Commit Mode of Different Wavefronts . . . . . . . . . . . . . . . . . . . . . . 8.2 Exploring Other Types of Determinism . . . . . . . . . . . .  56 56  7 Related Work  . . . . . . . . . . .  . . . . . . . . . . .  . . . . . . . . . . .  . . . . . . . . . . .  . . . . . . . . . . .  . . . . . . . . . . .  56 57 58 59 v  Table of Contents 8.3  Compiler-Run Time Support for determinism . . . . . . . . .  59  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  60  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  61  9 Conclusion  vi  List of Tables 2.1  Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . .  8  5.1 5.2  Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . GPGPU-Sim Configuration . . . . . . . . . . . . . . . . . . .  39 40  vii  List of Figures 1.1  Randomness of output for BFSr by varying the number of edges. Each data point is the average of number of different outputs for 10 different random graphs. The application is launched 500 times for each graph. . . . . . . . . . . . . . .  3  2.1 2.2  Baseline GPU Architecture . . . . . . . . . . . . . . . . . . . Hadling the Control Flow Divergnce using SIMT Stack . . . .  7 9  3.1  Deterministic Execution in the CPU architecture using CalvinCoreDet mechanism[26] . . . . . . . . . . . . . . . . . . . . .  19  4.1  4.2 4.3  4.4 4.5 4.6 4.7 4.8  5.1 5.2  GPUDet-base architecture. Vertical lines show the global synchronization before the start of each mode. Wavefront 1 (W1) of compute unit 1 has been skipped in the serial mode since it has not reached an atomic operation in the quantumn . Breakdown of events that causes a wavefront to end its quantum in GPUDet with baseline quantum formation logic. . . GPUDet quantum formation. (a) GPUDet-base behavior toward quantum termination events. (b) Workgroup-aware quantum formation of GPUDet (GPUDet-WA) allows wavefronts of one workgroup to continue the execution in parallel mode after all reaching workgroup barrier. . . . . . . . . . . Workgroup distribution . . . . . . . . . . . . . . . . . . . . . Intraction of Z-Buffer Unit with Memory Partition . . . . . . Architecture of Z-Buffer Unit . . . . . . . . . . . . . . . . . . A Color Update Request . . . . . . . . . . . . . . . . . . . . . Serial mode in GPUDet. (a) Serializing execution of all atomic operations in GPUDet-base. (b) Overlapping execution time of atomic operations from each compute unit by GPUDet. . .  36  IPC of different compute units in GPUDet-Base for AES. . . IPC of different compute units in GPUDet for AES. . . . . .  42 42  21 24  26 27 32 33 34  viii  List of Figures 5.3 5.4  IPC of different compute units in GPUDet-Base for BFS. . . IPC of different compute units in GPUDet for BFS. . . . . .  6.1  Breakdown of execution cycles. Normalized to NONDET execution time. . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance impact of Barrier Termination Avoidance (BTA) and Deterministic Workgroup Partitioning (DWP) techniques. Bars Normalized to NONDET execution time. . . . . . . . . Execution time comparison of committing the store buffer between the Z-Buffer Unit parallel commit and the lock-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . Execution time comparison between wavefront level (W-Ser) and compute unit level (CU-Ser) serialization of atomic operations (Section 4.5). Normalized to NONDET execution time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sensitivity of GPUDet to quantum size. Execution time is normalized to NONDET. The unoptimized version of CL is not included in the average. . . . . . . . . . . . . . . . . . . Execution cycles in parallel mode for various sizes of Bloom filter, normalized to NONDET execution. The AVG bar shows the average among all our benchmarks. . . . . . . . . . Execution time for various number of cycles, dedicated for each global synchronization, normalized to NONDET execution. The bar shows the average among all our benchmarks. .  6.2  6.3  6.4  6.5  6.6  6.7  44 44 46  47  48  49  50  51  53  ix  Glossary BF Bloom Filter BFSr Buggy version of BFS contains race condition BTA Barrier Termination Avoidance CPU Central Processing Unit CU Compute Unit CU-Ser Compute Unit Serialization DWP Deterministic Workgroup Partitioning FIFO First In First Out GPU Graphics Processing Unit GPUDet-base basic GPUDet without optimization techniques IPC Instruction Per Cycle NONDET non-deterministic baseline GPU architecture PC Program Counter RPC Reconvergence Program Counter SIMD Single Instruction Multiple Data SIMT Single Instruction Multiple Threads SM Streaming Multi-processor W-Ser Wavefront Serialization WA Workgroup Aware WG Workgroup x  Glossary WGS Subset of wavefront lanes in hardware with size of workgroup WL Write Log Z-Buffer A buffer in global memory keeps the depth of foremost fragment for each pixel Z-Buffer Unit The unit controlling the Z-Buffer operations Z-Cache A Specialized cache used in Z-Buffer unit  xi  Acknowledgements I would like to thanks my supervisor, Professor Tor Aamodt, for all the help he has given me over the years. Without him, this work would not have been possible. I would also like to thank the other members of the computer architecture group for their supportive ideas. I would like to specially thank Wilson W.L. Fung for his advice as a senior grad student involving my project. Finally, I’d like to thank my family for the innumerable things they have helped me out with throughout my life.  xii  Chapter 1  Introduction Massively parallel processors are designed to run thousands of threads. The energy efficiency of these processors has driven their widespread popularity for many inherently-parallel applications, such as computer graphics. We anticipate that increasing demands for energy efficiency will motivate a broader range of applications with more complex data structures and intertask communication to use these processors. Unfortunately, developing parallel applications that can utilize such architectures is hampered by the challenge of coordinating parallel tasks, especially in ensuring that a parallel program behaves correctly for every possible combination of interactions among these tasks. With the thousands of threads required to utilize current massively parallel architectures like GPUs, the number of interactions to consider is substantial. Nondeterminism makes this challenge even harder, because it thwarts attempts to reproduce anomalous behavior or to provide guarantees about future executions. Providing deterministic execution for GPU architectures is a crucial aid for simplifying parallel programming at scale. Determinism simplifies the debugging process by making program behavior reproducible for a given input. Determinism is also useful once code is deployed, to reproduce bugs that manifest in production settings. Finally, determinism amplifies the power of testing, by providing a guarantee that once a particular input has passed its tests, those tests will continue to pass in the future. Current nondeterministic parallel architectures provide no such guarantee, which allows testing to provide only probabilistic guarantees. The key challenge of determinism is providing these properties with acceptable performance overhead. To this end, different levels of determinism have been proposed in the literature [40]. We are proponents of strong determinism, which provides determinism even in the presence of data races. Some deterministic schemes require data-race freedom [40], offering determinism in a more limited context but with lower performance overheads. In contrast, a strongly deterministic system makes no assumptions about the program it is running, which is well-suited to dealing with the buggy 1  1.1. Motivation: Debugging with a Deterministic GPU programs that need determinism the most. The programmability of GPUs can benefit greatly from strong determinism.  1.1  Motivation: Debugging with a Deterministic GPU  Many existing GPU workloads are embarrassingly parallel applications free of inherent data-races. It may be tempting to dismiss the benefit of deterministic execution on GPUs for these applications. However, these applications usually feature complex data tiling to improve data locality. Subtle mistakes during development can result in nondeterministic bugs. These bugs are hard to detect since they may appear occasionally and alter their behavior. A deterministic GPU allows programmers to analyze such bugs more effectively consistent execution path and output. Indeed, race conditions have appeared in GPGPU algorithms from the published literature. To illustrate how a deterministic GPU can simplify debugging, we use a buggy version of breadth-first-search (BFS) graph traversal distributed by Harish and Narayanan [24] (BFSr) . Below is a simplified version of the kernel function in BFSr, which contains a race condition. 0: __global__ void BFS_step_kernel(...) { 1: if( active[tid] ) { 2: active[tid] = false; 3: visited[tid] = true; 4: foreach (int id = neighbour_nodes) { 5: if( visited[id] == false ) { 6: cost[id] = cost[tid] + 1; 7: active[id] = true; 8: *over=true; 9: } } } }  Presumably, this kernel is designed to traverse one level of the breadthfirst-search nodes during each kernel launch. Each thread (tid = thread ID) is assigned a node in the graph. In each kernel launch (at each step), each thread updates the cost of its neighbouring nodes (cost[id]) that have not been visited, and sets their active flag (active[id]), which indicates these nodes should update their own neighbours during the next kernel launch. A race condition can occur since a thread can be reading its active[tid] and cost[tid] (at line 1 and 6) while its neighbour is updating these locations (at line 6 and 7). This causes cost[id] to be updated nondeterministically, which can lead to incorrect output. 2  1.1. Motivation: Debugging with a Deterministic GPU  # of Different Results in 500 Runs  500 450 400 350 300 250 200 150 100 50 0 24000  26000  28000  30000  32000  34000  36000  38000  40000  42000  # of Edges Figure 1.1: Randomness of output for BFSr by varying the number of edges. Each data point is the average of number of different outputs for 10 different random graphs. The application is launched 500 times for each graph. To demonstrate the benefits of a deterministic GPU, first we measure the randomness of the final results of this simple program in presence of designated race condition. By increasing the complexity of the input graph the number of different outputs exponentially increases. Figure 5.3 shows the number of different outputs in 500 runs with a single input for various number of the edges. The experiment runs on a graph with 65k nodes. By increasing the application complexity, even the incorrect outputs are rarely the same. Different incorrect answers stem from different execution paths, make debugging extremely challenging or even impossible. Programmers have to consider all different execution paths to ensure the program behaves correctly for any possible scheduling. Also, lack of reproducibility prevents the programmer from tracing the bugs. With a deterministic GPU, the programmer can observe the same incorrect execution across different runs. This allows one to set a watch point in a GPU debugger at the location in cost[] that has an incorrect output 3  1.2. Contributions value. The programmer can then pause the execution at every modification to this location and identify the threads responsible for the incorrect output value. Notice that this approach will not work with a nondeterministic GPU. The programmer cannot be sure that the race condition would occur at the same memory location in the second run via the debugger. This difficulty with nondeterministic parallel architecture has motivated the supercomputing community to create extensive logging tools that capture activities across the entire system, and analyze the (enormous) logs for errors [4]. We argue that providing a deterministic GPU to programmers is a more effective solution.  1.2  Contributions  This document introduces GPUDet, which is the first deterministic massively parallel architecture. GPUDet provides strong determinism for currentgeneration GPU architectures. GPUDet leverages the approach suggested by Bergan et al. [7] for multi-core CPUs. In this approach, threads are executed in an isolated memory space, communicating with other threads only at deterministic, fixed intervals. Thread isolation is realized by appending all stores to a private buffer instead of directly updating global memory. Thread updates are made globally visible via a commit process which leverages existing GPU Z-Buffer hardware to execute in a deterministic but highly parallel fashion. Read-modify-write operations that need to be made globally visible have their execution deferred until this communication phase. GPUDet executes these operations in parallel and maintains a deterministic order among them by leveraging the ordering property in existing GPU memory systems. The key contributions of this thesis are: 1. We propose GPUDet, the first hardware model for a fully deterministic GPU architecture. 2. We characterize the inherently deterministic and nondeterministic aspects of the GPU execution model and propose optimizations to GPUDet that leverage these inherent deterministic properties. 3. Our design exploits existing GPU hardware for Z-buffering to accelerate deterministic parallel committing of store buffers. 4. We introduce a workgroup-based quantum formation algorithm to enable larger quanta, which better amortizes the overheads of quantum 4  1.3. Thesis Organization formation. 5. We exploit the point-to-point ordering in GPU memory subsystem to regain parallelism within each compute unit for read-modify-write operations that are executed in serial mode.  1.3  Thesis Organization  The rest of the document is organized is follows. Chapter 2 summarizes our baseline GPU architecture, discusses sources of nondeterminism in GPUs, briefly introduces the different types of determinism and discusses the benefits of determinism. Chapter 3 summarizes existing techniques for providing deterministic execution and highlights the challenges in applying them on GPUs. Chapter 4 describes GPUDet and our techniques to improve performance without sacrificing determinism. Chapter 5 shows our configuration parameters and describes the benchmarks. Chapter 6 presents a comprehensive evaluation of GPUDet. Chapter 7 discusses the related work. We discuss future work in Chapter 8. Chapter 9 concludes the thesis.  5  Chapter 2  Background The focus of this thesis is deterministic execution on GPU architectures. To achieve full understanding of this work the reader should be familiar with basic concepts of deterministic execution and GPU architecture. Our personal experience shows people that are expert in one of these two subjects usually have limited knowledge about the other one. This section provides an overview of fundamental concepts of deterministic execution information about the general GPU architecture. Also, we categorize the sources of non-determinism in current GPU architectures.  2.1  GPU Architecture  This section summarizes the details of our baseline GPU architecture relevant to deterministic execution on GPUs. Figure 2.1 shows an overview of our baseline GPU architecture modeled after the NVIDIA Fermi GPU architecture [37]. In this document, we use OpenCL [28] terminology. However, to make the document easier to read we use thread instead of work item. Table 2.1 shows the terminology of this document. A GPU program (CUDA [38] or OpenCL [28]) starts running on the CPU and launches parallel compute kernels onto the GPU for processing. Each kernel launch consists of a hierarchy of threads organized in equalsized workgroups. All threads in a kernel launch execute the same kernel function, while each thread may follow a unique execution path (with some performance penalty). Threads within a workgroup can synchronize quickly via workgroup barriers and communicate via an on-chip scratch-pad memory called the local memory. Communication among threads in different workgroups occurs through global memory located in off-chip DRAM. Each thread has exclusive access to its own private memory for non-shared data. Our GPU microarchitecture consists of a set of compute units that access multiple partitions (i.e., channels) of off-chip DRAM via an interconnection network. Each compute unit can sustain up to 1536 threads in its register file. It has a scratch-pad memory that is partitioned among the different workgroups that run concurrently on the unit. The large register file allows 6  2.1. GPU Architecture  Workgroup Kernel Func.  CPU  x = input[threadID] y= func(x);  ...  output[threadID] = y  ...  ...  Grid  Kernel Launch Workgroup Distributor  Compute Unit  ... Workgroup  L1 cache  Local Memory  Workgroup  Register File  Workgroup  Workgroup Workgroup Workgroup  ...  Register File  Workgroup  Local Memory  Compute Unit  Workgroup  L1 cache  Interconnect Memory Partition  Z-Buffer Unit  Memory Partition  Atomic Unit  ... Z-Buffer Unit  Atomic Unit  Memory Partition  ... Z-Buffer Unit  Atomic Unit  L2 cache  L2 cache  L2 cache  DRAM Channel  DRAM Channel  DRAM Channel  Figure 2.1: Baseline GPU Architecture  7  2.1. GPU Architecture  Table 2.1: Terminology CUDA Terminology OpenCL Terminology Thread Work item Thread Block Work group Grid NDRange Streaming Multi-processor (SM) Compute Unit Shared Memory Local Memory Global Memory Global Memory Local Memory Private Memory Giga-Thread scheduler  Our terminology Thread Work group Grid Compute Unit Local Memory Global Memory Private Memory Workgroup distributor  a compute unit to freely context switch between different threads to tolerate long-latency accesses to global memory. The workgroup distributor on the GPU ensures that a compute unit has sufficient hardware resources to sustain all threads in a workgroup before dispatching the entire workgroup onto the unit. A GPU program may launch a kernel with more threads than the GPU hardware may execute concurrently. In this case, the GPU hardware starts executing as many workgroups as on-chip hardware resources permit, and dispatches the remaining ones as resources are released by the workgroups that have finished execution.  2.1.1  SIMT Execution Model  GPU architectures use Single-Instruction, Multiple-Data (SIMD) hardware for enhancing computational efficiency. Rather than exposing the SIMD hardware directly to the programmer, GPUs employ a Single-Instruction, Multiple-Threads (SIMT) execution model [33]. It groups scalar threads specified by the programmer into SIMD execution groups called wavefronts [2] (warps [38] in NVIDIA nomenclature). Threads in a wavefront execute in lockstep on the SIMD hardware. Since each thread can follow a unique execution path, a wavefront may diverge after a branch. The GPU hardware automatically serializes the execution of subsets of threads that have diverged to different control flow paths. Each wavefront has a SIMT stack [16, 23] maintaining the different control flow paths that still need to be executed. This allows for nested control flow divergence. Figure 2.2 demonstrates how GPU handles the control flow divergence using a stack-based mechanism. Each entry on the stack denotes a different path. It denotes the subset of threads active on the path with an active mask and contains a reconvergence PC marking the point in the program 8  2.1. GPU Architecture  PC RPC Active Mask  time  A  A - 11111111 PUSH  B  B D 11001100 C D 00110011 D - 11111111 B  code A B  C D  if (…) { statements } else { statements } statements  POP  C  C D 00110011 D - 11111111 POP  D  D - 11111111  Figure 2.2: Hadling the Control Flow Divergnce using SIMT Stack  9  2.1. GPU Architecture when these threads will reconverge with another diverged execution path. Each wavefront starts with a top-level entry with a full active mask. After a wavefront has executed a divergent branch, it pushes two entries onto the SIMT stack, one for the if and else targets of the branch. It then executes one path, say, the if branch, until it reaches the reconvergence PC. It then advances to the else target by popping the stack. The wavefront pops the stack again when it reaches the reconvergence PC once again, revealing the top-level entry from before the branch. After that, it resumes execution at the reconvergence PC with threads from both diverged paths active. In this document, we use the immediate post-dominator of each divergence branch as its reconvergence PC [23].  2.1.2  Memory Subsystem  Individual scalar accesses to global memory from threads in a wavefront are coalesced into wider accesses to 32, 64, or 128-byte chunks. The L1 data cache services one coalesced access per cycle. It caches data from both global memory and private memory, but with different policies. It acts as a writeback cache for accesses to private memory. Writes to global memory evict any line that hits in the L1 cache to make the memory updates visible to all compute units. However, the GPU hardware does not maintain coherence among the L1 data caches in different compute units. Each unit is responsible for purging stale data from its L1 cache. Accesses that miss at the L1 cache are sent to the corresponding memory partition that contains the requested data. Each memory partition is responsible for its portion of the memory space. It contains a slice of the shared L2 cache. Accesses that miss at the L2 cache slices are in turn serviced by the off-chip DRAM controlled by the partition. To support a rich set of synchronization primitives, the GPU programming model provides atomic operations, which are read-modify-write operations that update a single 32/64-bit word in the global memory. The wavefront sends each operation to its corresponding memory partition as with global memory accesses. Atomic operations are executed at the memory partitions with a set of specialized atomic operation units that operate directly on the L2 cache slices.  10  2.2. Sources of Nondeterminism in GPU Architectures  2.2  Sources of Nondeterminism in GPU Architectures  The standalone-accelerator nature of current GPU architectures helps isolate them from some sources of nondeterminism, like interrupt handlers and context switches, present in more general-purpose architectures. However, these sources will likely be increasingly relevant to future GPU architectures as GPUs become more general-purpose. To illustrate that nondeterminism exists on GPUs, we have developed GPU-Racey, a CUDA deterministic stress test, based on CPU deterministic stress test Racey [25]. Racey computes a signature that is extremely sensitive to the relative ordering of memory accesses among threads. Across multiple runs, it should generate different signatures on a nondeterministic system, and the same signature on a deterministic system. Running GPU-Racey on real GPU hardware (a Quadro FX 5800 and a Tesla C2050) with two or more concurrent wavefronts (warps) produces nondeterministic outputs on different runs. While the exact sources of nondeterminism in commercial GPUs are undocumented, we have postulated several potential sources. First, each GPU usually consists of multiple clock domains [39], with each domain running at its optimal frequency. The synchronizer circuits interfacing between these domains can introduce nondeterministic delays to cross-domain messages due to phase drifting among the different clocks [43]. This source of nondeterminism may be exacerbated in the future as more aggressive power management features such as dynamic voltage and frequency scaling (DVFS) are introduced to GPU architectures. Second, the access timing to off-chip memory on a GPU depends on the physical location of the data. Accesses to different memory partitions have an observable delay difference [51]. With the GPU shared by many different processes in the system, it is improbable for the GPU driver to starting in the same memory partition for every run of an application. The variance in DRAM cells may also encourages more adaptive refreshing techniques that change the refreshing interval according to the dynamic status of different cells [34]. This introduces nondeterministic delays for DRAM accesses. Third, arbitration/scheduling units with uninitialized states can introduce nondeterminism by ordering thread execution or memory requests in an application differently between different runs. This includes the hardware wavefront schedulers in each compute unit, the workgroup distributor and the arbiters in the interconnection network. Although these units are reset to a known initial state at power up, the operating system is unlikely  11  2.3. Deterministic Execution to reset them between kernel launches. This makes the states of these units dependent on the previously executed workload, which is usually not predictable. Finally, as circuit process technology scales, transient faults in memory cells have become increasingly common. Transient failure in either on-chip or off-chip memory can trigger recover routines randomly [37], thus introducing nondeterministic latencies to memory accesses.  2.3  Deterministic Execution  2.3.1  Forms of Determinism  We have observed that the determinism studied in prior works can be categorized into different forms of determinism with varying overheads. A goal of this research is to determine the best form of determinism to enable the energy saving benefits of accelerators on a wider class of applications. Some previously studied forms of determinism and their tradeoffs are as follows: • Timing Determinism Timing determinism [43] gives an identical sequence of hardware states but requires extensive use of resettable flip-flops, conservative timing of DRAM refresh, and special clock domain crossing hardware in addition to restrictions at the system software level. Such costs are undesirable but may be beneficial for silicon debug. • Architecture determinism Architecture determinism produces the same result on a given hardware platform regardless of the existence of unsynchronized accesses to shared variables. This form of determinism is helpful when debugging a program that includes race conditions and has been extensively explored in recent work in the architecture community [7, 17, 18, 26]. • Synchronization determinism Synchronization determinism ensures that lock acquisition order is repeatable [40]. In some programs different (correct) results can be produced depending upon the order locks are acquired by competing threads. This non-determinism impedes software developers who must tune heuristics for complex optimization problems (e.g., CAD). Supporting this type of determinism is relatively inexpensive but does  12  2.3. Deterministic Execution not provide the benefit of deterministic execution for programs with unsynchronized accesses to shared data. • Serial determinism Serial determinism starts from a serial specification of an algorithm and produces a parallel execution with the same results [44]. A limitation of serial determinism is that it totally orders all operations in a program, even when more flexibility may be allowable or desirable in a particular application (e.g. graph traversal). However, an appealing aspect of serial determinism is that it yields identical results independent of the number of active cores. • Algorithmic determinism Finally, and orthogonal to the other forms of determinism, is the notion of algorithmic determinism. Algorithmic determinism allows the order of synchronization to be relaxed in the presence of commutative and associative operations while still preserving a semantically deterministic outcome. This corresponds to the notion of external determinism described by Emrath and Padua [20]. The additional scheduling flexibility admitted by algorithmic determinism can enable higher performance, but requires additional programmer effort to verify commutativity and associativity. Algorithmic determinism can be incorporated into any of the other notions of determinism, though this has currently been done only for serial determinism [21]. A few interesting trends are apparent from this determinism taxonomy. First, each of the first four forms of determinism applies to a larger class of programs than the subsequent form. For example, timing determinism is the strongest notion of determinism because it removes not just the nondeterminism in synchronization and unsynchronized accesses but also the non-determinism in the latencies of various operations. Similarly, serial determinism is the weakest notion of determinism because it exists only for single-threaded programs, which are a subset of well-synchronized parallel programs and of arbitrary parallel programs. Second, there is empirical evidence from existing systems that stronger notions of determinism are more expensive in terms of implementation complexity and runtime overhead. This makes intuitive sense: ensuring that unsynchronized memory accesses execute deterministically requires interposing on every potentially unsynchronized memory access, which are far more frequent than explicit synchronization operations. 13  2.3. Deterministic Execution  2.3.2  Benefits of Deterministic Execution  This section discusses number of benefits provided by a deterministic system. These benefits apply to all kind of systems that ensure strong determinism as well as GPUDet. Repeatability Using a deterministic system, users can expect a repeatability guarantee from software. That means the execution of a program with single inputs will always results in the same outputs. As an example, FPGA users always expect the CAD tool to generate the same hardware for their design in different compilations, so they can reliably test, verify and tune the performance of their designs. Debugging Application developers are highly dependent on the deterministic execution during the debugging of sequential applications. Based on the determinism guarantee, programmers are able to iteratively run the program and obtain information about particular bug and trace the bug using a systematic methodology. Lack of the repeatability guaranteed in a non-deterministic parallel system, causes a bug able to appear in different places of the program in different executions. A deterministic system can precisely reproduce all non-concurrency bugs, deadlocks, atomicity violations, order violation and shared memory races in a buggy program. These kind of bugs form a large fraction of concurrency bugs found in real parallel applications [36] Additionally, a deterministic system like GPUDet can be combined with a dynamic race detector to help identify races that are caused by incorrect usage of synchronization primitives. The combination of GPUDet and dynamic race detectors ensures the programmer to detect the first race occurs on a given inputs, because the program is running deterministically up until the race point. In this case when a race condition occurs, a programmer can systematically identify and eliminate it using the race detector and will subsequently be able to reproduce all remaining bugs. Testing Some algorithms may have several correct solutions. For example for a given tree, it is possible to have several spanning tree with the minimum weight. 14  2.3. Deterministic Execution Even the correct parallel implementation of these algorithms may result in different answers. Although all of these answers are correct, testing and verifying of these applications are difficult due toF non-determinism. By providing deterministic results, a deterministic system like GPUDet relives the process of testing. In this way, GPUDet can make a parallel GPU application more like sequential application when it comes to maintaining current testing infrastructures.  15  Chapter 3  GPU Deterministic Execution: Background and Challenges Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors [5, 7, 8, 17, 18, 26, 35, 40]. However, these techniques are designed for systems supporting tens, not thousands, of threads. Issues of serialization, global synchronization and per-thread fixed costs have much greater impact at GPU-scale than at CPU-scale. In this section, we first summarize the deterministic multiprocessor schemes CoreDet and Calvin [7, 26], which serve as a basis for GPUDet. We then discuss the major challenges these prior approaches face in scaling up to work on GPU architectures. We have chosen to start with Calvin and CoreDet over more recent proposals because their simplicity – particularly Calvin’s in-order pipeline design that avoids hardware speculation – is a good match for current GPU architectures. Furthermore, subsequent improvements in deterministic execution hardware provide lighter-weight synchronization operations [18], but do not address the scaling challenges inherent in making GPU architectures deterministic. Incorporating these subsequent improvements with GPUDet would improve performance for the applications that have frequent communications among threads.  3.1  Background: CoreDet and Calvin  CoreDet [7] is a compiler and accompanying runtime system that provides strong determinism for multithreaded programs. A multithreaded program compiled with CoreDet always produces the same result, even if the program contains data races. Calvin [26] is a processor design that provides deterministic execution, using an algorithm very similar to CoreDet’s but incorporating the specifics of multi-core hardware. We provide a Calvin-centric 16  3.2. Deterministic GPU Execution Challenges overview as its hardware implementation is most similar to the GPUDet approach. The basic mechanism both schemes use to enforce determinism is described in Figure 3.1. The execution of the program is divided into quanta, deterministically-sized sequences of instructions, e.g. every 1000th instruction executed by a wavefront marks the start of a new quantum. Each quantum is in turn composed of three phases. Theses phases are separated by global synchronizations. The first phase is parallel mode, wherein each processor executes in isolation. CoreDet provides isolation by appending all the stores on per-thread store buffers, so the store operations will not be visible for other threads. To prevent loading stale data CoreDet first checks whether the data is present in the store buffer before sending the request to common cache hierarchy. Calvin’s special coherence protocol ensures that a processor sees only its own writes but not those of remote processors. In effect, each processor executes a single-threaded program which is inherently deterministic. At the end of parallel mode, all processors enter a global barrier before transitioning into commit mode. In commit mode the writes from each processor are made globally visible, using a deterministic parallel commit algorithm that maps well to GPU hardware (Section 4.4.1). Another global barrier separates commit mode from serial mode, during which atomic operations (which cannot execute correctly under the relaxed-consistency coherence protocol of parallel mode) execute in a deterministic, serial order. A final global barrier ends the current quantum and allows a new quantum’s parallel mode to begin.  3.2  Deterministic GPU Execution Challenges  The first challenge for deterministic execution on the GPU is the lack of private caches and cache coherence. A GPU’s multiplexing of hundreds of threads onto a single first-level cache means that Calvin’s mechanism of using private caches to provide isolation between threads is not readily employable. Physically or logically partitioning each cache for use amongst hundreds of threads would dramatically reduce the cache’s effectiveness. Even if per-thread private caches were available, the lack of cache coherence in a GPU’s per-core caches rules out Calvin’s modified coherence protocol as a way of providing low-overhead thread isolation. Implementing isolation in software (as in CoreDet) allows physical resources to be shared but has high runtime overhead. Another major concern in building deterministic GPUs is dealing with 17  3.2. Deterministic GPU Execution Challenges very large numbers of threads. This in turn leads to a number of related problems. Large numbers of threads make the global barriers inherent in the CoreDet/Calvin deterministic execution algorithms much more expensive. Relatedly, atomic operations require serialization in the Calvin and CoreDet models, so their presence in a GPU kernel can quickly erode performance. By serializing thousands of threads, an atomic operation has effectively many orders of magnitude higher cost when run deterministically than when run nondeterministically. The GPU hardware features various hardware mechanisms to efficiently manage thousands of threads. These mechanisms need to be extended accordingly to support deterministic execution. Specifically, with the SIMT execution model, arbitrarily pausing a scalar thread that has exhausted its instruction quota while permitting the others to proceed causes the wavefront to diverge. Handling this divergence requires substantial modification to the SIMT hardware and the extra determinism-induced divergence lowers the SIMD hardware utilization. Finally, GPU kernels have different program properties than multithreaded CPU programs. Executing larger quanta is a natural solution to amortize expensive global barriers and atomic operations, and works well in the CPU space where threads are long-lived. Unfortunately, GPU kernels often contain a large number of short-running threads, making global barriers both expensive and frequent. GPU threads tends to synchronize frequently within a workgroup to communicate via the on-chip scratch-pad memory. This form of hierarchical, localized communication fits poorly with the infrequent global communication model in the CoreDet/Calvin deterministic execution algorithms. GPU threads also typically exhibit less locality than CPU threads, particularly in terms of memory accesses: frequentlyused values are instead cached in a GPU’s large register files or scratch-pad memory. This reduced locality makes Calvin’s cache-based isolation mechanism a poor fit for GPU kernels.  18  3.2. Deterministic GPU Execution Challenges  Quantum n-1  Parallel Mode  Quantum n Cache Hierarchy  Processor 0 Store Buffer  Processor 1 Store Buffer  LOADS  LOADS  STORES  STORES  Commit Mode  Barrier  Serial Mode  Commit  Commit  Barrier Atomic Ops  Atomic Ops  Barrier Quantum n+1  Figure 3.1: Deterministic Execution in the CPU architecture using CalvinCoreDet mechanism[26]  19  Chapter 4  GPUDet In this section, we present GPUDet, the first hardware proposal that provides strong determinism for deterministic massively parallel architecture with thousands of concurrent threads. We design GPUDet starting with a naive adaptation of the deterministic multiprocessor schemes from CoreDet/Calvin onto the GPU. This scheme divides GPU kernel execution into quanta. Each quantum has three phases: parallel mode, commit mode and serial mode as illustrated in Figure 4.1. GPUDet features optimizations for each mode, leveraging the inherent determinism in the GPU architecture as well as common-case GPU program behavior to recover more parallelism while reducing complexity required to support deterministic execution. For parallel mode, GPUDet exploits the inherent determinism of current SIMT architectures to provide deterministic interaction among threads within a wavefront at no cost (Section 4.1). GPUDet also introduces a novel wavefront-aware quantum formation algorithm to eliminate unnecessary global synchronizations and replace them with local, deterministic synchronizations (Section 4.2). The store buffers that provide isolated execution among deterministic wavefronts are described in Section 4.3. For commit mode, GPUDet extends the Z-Buffer hardware designed for graphics rendering to accelerate its deterministic parallel commit algorithm (Section 4.4). For serial mode, it exploits the point-to-point ordering in the GPU’s memory subsystem to reduce the amount of serialization required in this mode (Section 4.5). Section 4.6 discusses limitations in GPUDet’s current design.  4.1  Deterministic Wavefront  The deterministic execution algorithm used in CoreDet and Calvin imposes determinism by pausing the execution of threads at deterministic intervals for communication [7, 26]. A naive GPU adaptation of this scheme would involve executing each scalar thread in isolation during the parallel mode and pausing the thread when it has exhausted its instruction count quota for the quantum. As mentioned in Section 3.2, this behavior interacts poorly with the implicit synchronizations of threads within a wavefront imposed by 20  Compute Compute Unit 1 Unit 0  4.1. Deterministic Wavefront  W0 W1 W2  Parallel Mode  Commit Mode  A  A A  ...  W0 W1 W2  time A  Serial Mode  A A  Quantumn Atomic Operations Quantum Boundary  Quantumn+1 Deterministic Commit Wavefront Serialization  Figure 4.1: GPUDet-base architecture. Vertical lines show the global synchronization before the start of each mode. Wavefront 1 (W1) of compute unit 1 has been skipped in the serial mode since it has not reached an atomic operation in the quantumn . the SIMT execution model. Algorithm 1 shows the execution of a wavefront with 4 threads (T0 to T3) through a branch hammock. The wavefront diverges at line B. In response, the SIMT stack deactivates T2 and T3 to execute line C with only T0 and T1 active. After executing line C, T0 and T1 have exhausted their instruction count quota for the quantum, and enter the global barrier for commit mode. A deadlock has occurred, because this global barrier is also waiting for T2 and T3, which are in turn paused by the SIMT stack to reconverge with T0 and T1 for full SIMD execution at line E. The SIMT stack can be modified to resolve this deadlock (e.g. by allowing the stack to reorder the entries to resume other execution paths when the top entry path has exhausted its instruction count), but doing so introduces extra complexity and lowers the SIMD hardware utilization. GPUDet eliminates these complexities and deadlock concerns altogether by exploiting the fact that the execution of the entire wavefront is inherently deterministic. Deterministic execution within a wavefront also eliminates the need to have thread-level isolation and allows threads in the same 21  4.1. Deterministic Wavefront Algorithm 1 Thread-level isolation introduces divergence in SIMT architecture. In this example, each thread may execute up to 3 operations per quantum. The number of operations executed by each thread is shown beside each line of the code. A: B: C: D: E:  v = 1; if( threadIdx.x < 2 ){ v = input[threadIdx.x]; } output[threadIdx.x] += v;  \\ \\ \\ \\ \\  T0 1 2 3  T1 1 2 3  T2 1 2 2  T3 1 2 2  4  4  3  3  wavefront to share a common store buffer. The inherent determinism of wavefront execution arises from two properties: 1) Pausing the execution of a wavefront causes each thread to execute a deterministic (but not necessarily equal) number of instructions, and 2) existing GPU architecture already handles data-races between threads within a wavefront. As described in Section 2.1.1, the control flow of each wavefront, and the activity of its threads at every dynamic instruction, are controlled by its SIMT stack. The SIMT stack of every wavefront has deterministic initial state, since all wavefronts execute at the start of the kernel program. As the wavefront executes in parallel mode, its SIMT stack is updated to handle any branch divergence that occurs. This update is deterministic, because every thread in the wavefront is executing with input data from a deterministic global memory state produced by the previous quantum round. Since the SIMT stack has a deterministic initial state and it is updated deterministically, we can infer that the SIMT stack always maintains a deterministic state. The deterministic SIMT stack provides deterministic control flow for the wavefront. More importantly, it ensures that the activity of each thread in the wavefront is deterministic for every dynamic instruction. This means that a wavefront can pause after any dynamic instruction to end its parallel mode for communication. Each thread in the wavefront may have executed a different amount of work in the parallel mode due to divergence, but the amount executed by each thread is always deterministic. Data-races can occur between threads within a wavefront, but these data-races are reproduced deterministically on current SIMT hardware [29], a property we exploit in GPUDet. Data-races that occur between threads executing different dynamic instructions are always ordered deterministically, because instructions from the same wavefront are executed in-order 22  4.2. Quanta Formation and the control flow of each wavefront is deterministic as per the above discussion. Data-races that occur between threads at the same dynamic memory instruction are ordered by the coalescing unit, which combines the scalar memory accesses from threads within a wavefront into accesses to wider memory chunks. When multiple threads in a wavefront write to the same location the coalescing unit generates only one memory access for the location and chooses one of the threads’ store values. To ensure determinism, GPUDet relies upon the fact that the coalescing unit selects the same thread’s store value for a given access pattern. A NVIDIA patent [30] describes a way to deterministically handle the write collision based on thread IDs (e.g. sending data of the thread with largest thread ID). Samuli and Tero [29] have exploited this deterministic behavior to optimize their software rasterization engine for the Fermi GPU architecture [37]. We have verified this observation on current GPU hardware as well as our simulation infrastructure with the GPU-Racey stress test described in Section 2.2. In conclusion, the deterministic interaction of threads within a wavefront enables GPUDet to: 1. Use a per-wavefront store buffer (described in Section 4.3) for execution isolation. 2. Pause the execution of a whole wavefront and ascertain that each thread stops at a deterministic point. 3. Count the number of executed instructions for each wavefront instead of individual threads.  4.2  Quanta Formation  At certain points of execution GPUDet pauses thread execution to commit store buffers. In order to have deterministic results, these points should be selected deterministically. In other words, execution of threads should be paused at the exact same instructions in different runs of a program. These termination points are called quantum boundaries in this document. This section describes the quantum formation algorithm and how GPUDet determines quanta boundaries for each wavefront. GPUDet divides the program execution into quanta similar to previous work [7, 26, 40]. Each wavefront is allowed to execute a deterministic number of instructions during the parallel mode in each quantum. To avoid deadlocks and to handle atomic operations, a wavefront may end its parallel 23  100% Atomic Operations Instruction Count Execution Complete Workgroup Barriers  80% 60% 40%  CLopt  ATM  HT  SRAD  LPS  LIB  HOTSP  CP  CFD  BFSf  0%  BFSr  20% AES  %of Termination Reasons  4.2. Quanta Formation  Figure 4.2: Breakdown of events that causes a wavefront to end its quantum in GPUDet with baseline quantum formation logic. mode before reaching its instruction limit. The following are the events that can cause a wavefront to end its parallel mode: Instruction Count. A wavefront ends its parallel mode once it finishes executing a fixed number of instructions. This number is the quantum size. Atomic Operations. Atomic operations are handled in GPUDet like they are in CoreDet. A wavefront ends its parallel mode whenever it reaches an atomic operation to execute the operation in serial mode. Memory Fences. GPUs provide memory fence instructions for programmers to impose ordering on memory operations. Similar to CoreDet/Calvin, a fence instruction causes a wavefront in GPUDet to end its parallel mode to commit its store buffer in commit mode. Workgroup Barriers. The GPU architecture provides workgroup-level synchronization operations called workgroup barriers. As discussed in Section 2, threads inside a workgroup cannot exit a workgroup barrier before all other threads in the same workgroup have reached the barrier. Some of these other wavefronts may have exhausted their instruction count limit before reaching the workgroup barrier and are waiting at the global barrier for transition into commit mode. To prevent a deadlock in this case, any wavefronts at the workgroup barrier should end their parallel mode to unblock the other wavefronts. Execution Complete. When a wavefront finishes kernel execution before exhausting its instruction count limit, it ends its parallel mode. Figure 4.2 shows the breakdown of these events for our GPU applications  24  4.2. Quanta Formation (more details in Section 6) running on GPUDet with this baseline quantum formation logic. In this figure, GPUDet is configured to have quantum size = 200 instructions. In ATM, CL and HT, wavefronts usually end their quanta by reaching an atomic operation. In AES, HOTSP, LPS and SRAD, most wavefronts end their quanta at workgroup barriers. In BFSr, BFSf and CFD, wavefronts end more then 50% of their quanta at the end of a kernel program, illustrating the short-running thread challenge discussed in Section 3.2. Collectively, these events constrain the number of instructions a wavefront may execute in each quantum, and thus limit the effectiveness of increasing quantum size to amortize the synchronization overhead at each quantum. With the above observations, we have designed workgroup-aware logic to determine the end of parallel mode of each quantum. In Section 4.2.1 and Section 4.2.2, we describe this workgroup-aware quantum formation logic with two optimizations. One optimization allows a wavefront to execute beyond a workgroup barrier without ending its parallel mode; the other permits the GPU to issue a new workgroup within a quantum.  4.2.1  Workgroup-Aware Quantum Formation  Our workgroup-aware quantum formation logic (GPUDet-WA) extends the parallel mode illustrated in Figure 4.1 with an intermediate wait-for-workgroup mode. A wavefront advances to this wait-for-workgroup mode after encountering one of the termination events listed in Section 4.2 and waits for other wavefronts in the same workgroup to arrive. In this intermediate mode, each wavefront can deterministically observe the states of other wavefronts in the same workgroup. This allows all wavefronts in a workgroup to collectively decide the next mode. Figure 4.3 illustrates how this mechanism reduces synchronization overhead by eliminating an unnecessary quantum boundary introduced by a workgroup barrier. This mechanism, called Barrier Termination Avoidance (BTA), allows wavefronts to continue execution past certain quantum boundary formation conditions improving load balance when the quantum size is increased. In this figure W0, W1 and W2 belong to one workgroup. In Figure 4.3 all wavefronts of the workgroup have been terminated by reaching a workgroup barrier. In the baseline GPUDet (Figure 4.3(a)), wavefronts will end their quantum and wait for the global synchronization before start of commit mode. However, GPUDet wavefront aware quantum formation (GPUDet-WA) allows the wavefronts to exit the barrier and resume the parallel mode without ending the quantum (BTA). Since reaching all the 25  Serial Mode  W2  Commit Mode  W1  Serial Mode  W0  Commit Mode  Compute Unit 0  4.2. Quanta Formation  W1 W2  time  Serial Mode  W0  Commit Mode  Compute Unit 0  parallel mode parallel mode (a) GPUDet-base  Reaches Workgroup Barrier Instruction Count Quantum Boundary Resume Parallel Mode without Ending Quantum  parallel mode (b) GPUDet-WA  Figure 4.3: GPUDet quantum formation. (a) GPUDet-base behavior toward quantum termination events. (b) Workgroup-aware quantum formation of GPUDet (GPUDet-WA) allows wavefronts of one workgroup to continue the execution in parallel mode after all reaching workgroup barrier. wavefronts of workgroup is compulsory to exit the barrier, the wavefronts would not be able to exit the barrier if a wavefront in that workgroup stopped execution before the barrier. To avoid the resulting deadlock condition, all of the wavefronts of the workgroup finish the parallel phase if any of them have stopped their parallel mode before the barrier (e.g., reaching instruction limit). A similar scenario arises when all wavefronts of the workgroup are terminated by reaching the end of their kernels. In this case, GPUDet launches a new workgroup and the execution continues in the parallel phase helping to ameliorate the impact of load imbalance between workgroups and gaining larger quanta.  4.2.2  Deterministic Workgroup Distribution  In Section 2, we described how a hardware workgroup distributor assigns a new workgroup to a compute unit whenever it has a free “slot” (subset 26  4.2. Quanta Formation  WG = Workgroup WGS = Workgroup Slot  ..  ... ...  WGS0 WG0 WGS1 WG1  WG Issue Queue  WG3 WG2  WG5 WG4 WG3 WG2 WG1 WG0  (a) Nondeterministic Baseline  ..  ... WG3...  WGS0 WG0  WG2  WGS1 WG1  WG5 WG4 WG3 WG2 WG1 WG0  Idle Cycles (b) GPUDet Baseline Workgroup Distribution WGS0 WG0  WG3  .. ..  WGS1 WG1  WG2  ...Deterministic Partitioning ... WG6 WG4 WG2 WG0  (c) Deterministic Workgroup Partitioning time Spawn New Quantum Issued Workgroup Boundary Workgroups  WG7 WG5 WG3 WG1  Figure 4.4: Workgroup distribution  27  4.2. Quanta Formation of wavefronts with size of workgroup). This distribution of workgroups to compute units occurs both at the start of a kernel launch and when a running workgroup has finished execution. Since the commit order of wavefronts is defined by how they are assigned to hardware wavefront slots, any change in this assignment will affect the commit order priority in the commit mode and hence the execution results. To achieve deterministic results, workgroups should be distributed among compute units deterministically. GPUDet has two different schemes to achieve deterministic workgroup distribution. In the default scheme illustrated in Figure 4.4(b), GPUDet spawns workgroups only at the start of each quantum. The state of the whole system is deterministic when a quantum starts, so when employing this policy the workgroup distributor has deterministic information about free workgroup slots. To ensure determinism, workgroups are assigned to free slots deterministically based on hardware slot ID. To achieve this, GPUDet enters a special work distribution mode right before entering the parallel mode. Workgroups are spawned only in this work distribution mode. All wavefronts stay in this mode until the workgroup distributor finishes assigning all free workgroup slots on all compute units with new workgroups. A deterministic set of free slots and of remaining workgroups results in spawning a deterministic set of workgroups. Although implementation of the default workgroup distribution scheme in GPUDet is simple, it performs poorly when the kernel launch consists of short-running threads. If new workgroups are only issued in work distribution mode, then the number of instructions executed by a wavefront in one quantum will be limited by the amount of work per workgroup. This leads to load imbalance when some workgroups have more work than others. To address this problem, we have proposed a Deterministic Workgroup Partitioning technique (DWP) that allows workgroups to be issued deterministically in the middle of parallel mode. To prevent nondeterministic workgroup distribution inside parallel mode, GPUDet partitions the issuepending workgroups among hardware workgroup slots before starting parallel mode. Each workgroup slot can only be replenished with the workgroups in its partition. As workgroup partitioning is done in a deterministic state (at the start of each quantum round), the final workgroup partitioning is deterministic.  28  4.3. Per-Wavefront Store Buffer  4.3  Per-Wavefront Store Buffer  In GPUDet, each wavefront has a private store buffer that contains all of its global memory writes from the parallel mode in the current quantum. The store buffer is located in the private memory of the wavefront, cached by the L1 data cache and written back to off-chip DRAM. It has been observed that in GPU kernel programs, data written to global memory is rarely accessed by the writer thread/wavefront again [22]. This insight suggests organizing the store buffer as a linear write log. Each entry in this write log represents a coalesced global memory write to a coalesced memory access size -128Byte- chunk. It has an address field indicating the destination location of the chunk, a data field, and a 128-bit byte mask to indicate the valid portion in the data field. The store buffer has a Bloom filter that summarizes the addresses updated in the write log. Each coalesced global memory read in parallel mode first queries the Bloom filter with its chunk address. A hit in the Bloom filter triggers a linear search through the write log; a miss redirects the global memory read to the normal access sequence. The cost of using a large Bloom filter to reduce false positives is amortized with a wavefront-shared store buffer. A 1024-bit Bloom filter only takes 6kB of storage per compute unit (assuming 48 wavefronts per unit). The Bloom filter of each wavefront can be stored in the register file space allocated to the wavefront. This eliminates any need for permanent storage at the expense of extra register file bandwidth and capacity.  4.4  Parallel Commit of Store Buffers  GPUDet commits the store buffers from all wavefronts into the global memory in the commit mode to allow wavefronts to communicate deterministically. We use the deterministic parallel commit algorithm used in CoreDet [7]. This algorithm tags the entries from the store buffer from a wavefront with a deterministic ID. This ID defines the commit order of this wavefront with respect to the other wavefronts. The wavefronts can attempt to commit the entries in their store buffers in parallel. The algorithm uses the deterministic ID to determine the final writer, in commit order, to each memory location, and guarantees that the location contains the value written by this wavefront after the commit mode. Current GPUs have noncoherent private caches. To avoid reading stale data GPUDet flushes the L1 data caches after the commit mode. Supporting cache coherency on GPUs [45] would eliminate this overhead.  29  4.4. Parallel Commit of Store Buffers While CoreDet implemented this algorithm in software using fine-grained locks, we recognized that this algorithm is analogous to the Z-Buffer algorithm that controls the visibility of overlapped polygons in graphics rendering. GPUDet adopts the Z-Buffer Unit for graphics rendering to implement a hardware accelerated version of the deterministic parallel commit algorithm. Currently, as far as we are aware no GPU vendors have exposed any instruction for using the Z-Buffer Unit directly in general purpose programming models like CUDA and OpenCL. We believe exposing this unit in general purpose programming model will not be expensive in terms of area or complexity.  4.4.1  Z-Buffer Unit  Z-Buffer algorithm is designed to control the visibility of overlapped 3D objects displayed on screen. In graphics rendering, 3D objects are represented by triangles that are transformed according to a given camera view to be displayed. Each pixel rendered from a triangle is assigned a depth value representing its logical order with respect to pixels from other triangles. These depth values are stored in a set of memory locations called the Z-Buffer. A specialized hardware unit in the GPU, which we call the Z-Buffer Unit, manages the Z-Buffer. Using the depth values in the Z-Buffer, the Z-Buffer Unit prevents overlapped triangles from updating the colors of the pixels that have been updated by a foreground triangle. The Z-Buffer Unit allows out-of-order writes to produce a deterministic result. Namely, each pixel on the screen displays the color of the foremost triangle covering that pixel regardless of the order of the triangle updates. There is little publicly-available information regarding the internal architecture of Z-Buffer Unit in current GPUs. The Z-Buffer Unit evaluated in this document is inspired by the details disclosed in a patent by ATI Technologies [49]. As shown in Figure 2.1, there is a Z-Buffer Unit in each memory partition, each responsible for servicing the color update requests to the memory locations managed by the partition. Each Z-Buffer Unit contains a request buffer for keeping the status of the incoming requests. Each request contains a 128-byte aligned address to the chunk of pixels it attempts to update, together with 32 color values and 32 depth values. It also has a 128-bit byte mask indicating the valid values in the color update request. A color update request first tries to retrieve the latest depth values for its pixels from a cache for the depth values, called the Z-cache. The depth values from the request are then compared against the retrieved depth 30  4.4. Parallel Commit of Store Buffers values. The Z-Buffer Unit then updates the color of the pixels that passes the comparison by sending write requests to the L2 cache, and updating the depth values in the Z-cache. A request that misses at the Z-cache allocates a block in the cache, locks the block and defers its comparison until the depth values are fetched. Requests hitting at a locked block are deferred similarly. The Z-Buffer Unit overlaps multiple color update requests to tolerate the Z-cache miss latency.  4.4.2  Deterministic Parallel Commit using Z-Buffer  To adopt the Z-Buffer Units for deterministic parallel commit of store buffers, GPUDet allocates a corresponding Z-Buffer for each writable global memory buffer. The allocation routine co-locates both depth and data of a memory location at the same memory partition. In commit mode, each wavefront publishes the entries in its store buffer by traversing through the linear write log. From each entry, it generates a color update request containing the buffered data, with the depth equal to the logical quantum ID concatenated with its hardware wavefront ID: Depth value = wavef ront id − (quantum id × 2m ) Here, m is number of bits needed to represent the maximum number of wavefronts that can run concurrently on the hardware. We include the logical quantum ID in the depth to ensure that data from an earlier quantum is always overwritten by memory updates from a later quantum. Without the logical quantum ID, the depth values for all writeable locations in global memory must be reset to the largest positive integer at every quantum boundaries. With the quantum ID, GPUDet amortizes this overhead across a large number of quanta by resetting the Z-Buffer only when the ID overflows.  4.4.3  Implementation Details of Z-Buffer Unit Architecture  In this section, we describe the architectural details of our Z-Buffer Unit design. We improved the accuracy of our simulations by incorporating the details of the Z-Buffer Unit to our GPU timing model. These details are necessary to understand for someone wants to re-implement or improve our work in future. However, reading this section would not be necessary to understand other parts of this document. In our model, Z-Buffer Unit is located on memory partition side. Interconnect network transfers compute units’ memory requests into the corresponded memory partitions. The Z-Buffer Unit is responsible for color 31  Off-Chip DRAM Off-Chip ChannelDRAM Off-Chip Channel DRAM Channel  Memory Partition Memory ... Partition Memory ... Partition ... Z-Unit Z-Unit Z-Unit ... ... ...  Last Level CacheLast bankLevel (L2) Last Cache bankLevel (L2) Cache bank (L2)  Interconnect  4.4. Parallel Commit of Store Buffers  Figure 4.5: Intraction of Z-Buffer Unit with Memory Partition update requests arriving at the memory partition. The Z-Buffer Unit receives color update requests from different compute units concurrently. ZBuffer Unit first loads current depth value of the address corresponded to the memory request and then compares it with the request’s depth value. If the request depth value is less than or equal the current depth, it updates the depth and color value of the memory location. Z-Buffer Unit overlaps the memory requests to enhance performance. However, the process of load and update has to be atomic. In other words, the Z-Buffer Unit needs to ensure that as long as a color update request is not committed, no other requests are accessing the same memory location. So the Z-Buffer Unit needs a locking mechanism to prevent accessing of multiple requests to same location concurrently. On the other hand, because there is no consistency model limit for writing in different locations, the Z-Buffer Unit is able to overlap or even reorder requests of different addresses. Figure 4.5 demonstrates how the Z-Buffer Unit interfaces with interconnect and other parts of a memory partition. The memory partition controller forwards the Color update requests into Z-Buffer Unit. This unit loads the depth value, performs the comparison and updates the color and depth values if necessary. The depth and color values are stored in the memory, however the Z-Buffer Unit interacts with L2 cache and GPUDet does not introduce any modification to the interactions between the L2 cache 32  4.4. Parallel Commit of Store Buffers  Z-Unit Request Buffer  Interconnect  8  4  Z Cache  Comparator Buffer  ... 6 Depth Comparator  7  3  L2 Cache  2  1  5  Figure 4.6: Architecture of Z-Buffer Unit and the DRAM. Z-Buffer Unit is responsible for handling the atomicity of read-modify-write operations for each color update request. For simplicity, we assume that there is a static binding between the depth and the color addresses. Also color and depth values are located in the same memory partition. Figure 4.6 shows the internal architecture of Z-Buffer Unit. This unit consists of a buffer for tracking the status of incoming requests, a cache with lockable cache blocks for handling atomicity, a comparator to perform the depth comparison and a L2 interface. Figure 4.7 shows a color update request. This request contains a 128byte aligned address indicates the starting address, 32 color values, 32 depth values and a byte mask. The mask indicates valid addresses that are supposed to be modified by this memory request. In this request (Figure 4.7) only the first two depth and color values are valid (addresses 0x10000000 and 0x10000004). Coalescing unit generates these aligned accesses based on the memory accesses of each thread within a wavefront. Each thread generates a color update request and sends it to the coalescing unit. Based on the addresses of these requests, coalescing unit may generate one or several coalesced memory 33  4.4. Parallel Commit of Store Buffers  Color Update Request Address  0x10000000  Color  C0  C1  C2  ...  C30  C31  Depth  D0  D1  D2  ...  D30  D31  Mask  1  1  0  ...  0  0  Figure 4.7: A Color Update Request access requests each with a different mask and sends the requests to the memory partitions through the interconnects. Z-Buffer Unit accepts one color update request in each cycle if the request buffer is not full. Each request buffer entry has two status bits that indicate the status of the request. Status can be invalid, ready, or blocked. Ready specifies whether a request is ready to schedule and blocked specifies if it is waiting for previous requests to finish their operations to the same memory location. GPUDet Z-Buffer Unit model uses a set associative cache for keeping the depth values called Z-Cache. Cache blocks of this cache can be locked. A locked cache block cannot be read by other requests and also cannot be evicted from the cache. The block size of the Z-Cache is 128 byte equal to the size of depth values of a request. Z-Buffer Unit stores the Color update request in a free request buffer entry( 1 ) and changes the status of the entry to ready. A scheduler selects one ready request and probes the request address in the Z-Cache ( 2 ) to get the current depth value. If the cache block has been locked by other requests, the request buffer entry status is changed to blocked and the entry will de-schedule until Z-Cache entry is released. If the Z-Cache entry status is valid, the depth value is read and stored into the comparator buffer ( 3 ). Also the status of cache block is changed to locked to prevent other requests trying to access the same location. Comparator unit selects one entry of comparator buffer in each cycle and compares the depth values with the request depth values ( 4 ). Notice that the comparator, compares up to 32 (based on the mask) pairs of depth per each color update request. The result of comparison is stored in a separate mask called the result mask. If the depth value of the request is less than the 34  4.5. Compute Unit Level Serialization current depth value, the color value is updated and the depth value stores the smaller number. In case of requiring the update, Z-Buffer Unit sends a memory request with the result mask to to L2 cache( 5 ). Also, the depth value is being updated in Z-Cache( 6 ) and the status of the cache block is changed back into valid. On the other hand, all the blocked requests in the request buffer will be released and become available for scheduling again. To do so, whenever an entry in the comparison buffer is being released a controller checks the addresses of entries in the request buffer and update their status into ready if it is blocked. This scenario happens when the depth value exists in Z-Cache. In case of miss in the Z-Cache, a cache entry is allocated for that address and a miss request is sent to L2 cache( 7 ). Also the status of the new allocated block in the Z-Cache is set into locked to prevent other requests accessing the same cache block. Furthermore, Z-Buffer Unit allocates an entry in the comparator buffer. This allocation eliminates circular dependency and prevents deadlock conditions. L2 cache fills the depth value in the allocated comparator buffer entry. In this case the depth values are stored in the ZCache after comparison( 6 ). Miss accesses in the Z-Cache may cause a write back because of the eviction. However, a cache block cannot be evicted if it is in locked status. Finally, when both depth and color values are being updated, the request buffer releases the allocated entry and sends an acknowledgement to the corresponded compute unit through the interconnects( 8 ).  4.5  Compute Unit Level Serialization  In serial mode, the wavefronts execute atomic operation in a deterministic serial order, with the operations updating the global memory directly. Each wavefront executes only one atomic operation in serial mode. Each atomic operation can take 100s of cycles to execute in the GPU’s throughputoptimized memory subsystem. Our evaluation in Section 6.2.3 shows that this naive, wavefront-level serialization (W-Ser) significantly slows down applications that use atomic operations. To recapture parallelism, GPUDet exploits the point-to-point ordering guaranteed by our GPU memory model and common in network-on-chip designs [50] to overlap the execution of atomic operations within a compute unit (CU-Ser). As long as the interconnection network guarantees point-topoint ordering, atomic operations from the same compute unit to the same 35  Compute Compute Unit 1 Unit 0  4.5. Compute Unit Level Serialization  W0 W1 W2 W0 W1 W2  A  Commit Mode  Parallel Mode  A A A A  serial mode  Parallel Mode  W0 W1 W2 W0 W1 W2  Commit Mode  Compute Compute Unit 1 Unit 0  (a) A A A  Serialization A A A  Atomic Operation  time (b)  serial mode  Figure 4.8: Serial mode in GPUDet. (a) Serializing execution of all atomic operations in GPUDet-base. (b) Overlapping execution time of atomic operations from each compute unit by GPUDet. memory partition will arrive in the original order. With little if any potential performance gain at stake, the memory partitions do not reorder accesses to the same memory location. By resetting the wavefront scheduler in each compute unit at the start of serial mode, GPUDet forces the unit to issue its atomic operations in a deterministic order. This order is preserved by the point-to-point ordering in the GPU memory subsystem. On the other hand, since there is no guaranteed ordering between requests from different cores, GPUDet has to serialize memory operations from different compute units. Figure 4.8 demonstrates how this optimization reduces serialization overhead in GPUDet by eliminating serialization within each compute unit. In Figure 4.8(a), the atomic operations from all workgroup are executed serially. In Figure 4.8(b), each compute unit overlaps execution of their wavefronts in serial phase, so the serialization overhead is reduced by a factor of  36  4.6. Limitations number of wavefronts in the compute unit.  4.6  Limitations  The current design of GPUDet does not enforce determinism in the use local memory (or shared memory in NVIDIA terminology). Local memory accesses update the on-chip scratch-pad memory directly and rely on the application to remove data-races via workgroup barriers. The wavefront scheduler at each compute unit can be modified to issue wavefronts in deterministic order. This modified scheduler will provide deterministic execution for local memory accesses even in the presence of data-race. We leave proper support for deterministic local memory accesses as future work. Since the current Z-buffer design binds each 32-bit color pixel in memory with a depth value, the Z-buffer-based parallel commit algorithm is not directly applicable to applications with byte-granularity writes. An intermediate solution, without modifying the Z-Buffer Unit, is to commit (only) the store buffer entries with byte-granularity writes in the serial mode. We leave an evaluation of this as future work. Note that the per-wavefront store buffer is already designed to support byte-granularity accesses. Each store buffer entry contains a 128-bit mask indicating the modified bytes, and the Bloom filter is only responsible for identifying accesses to the same 128-byte chunk.  37  Chapter 5  Methodology We extended GPGPU-Sim 3.0.2 [6] to model a nondeterministic GPU architecture by randomizing the interconnect injection order of requests from different compute units. In this chapter we introduce our benchmarks, describe our simulator configuration and discuss our methodology to analyze which of our proposed hardware are performance bottlenecks and needed to be optimized.  5.1  Benchmarks  We evaluate the performance impact of GPUDet on a set of CUDA/OpenCL benchmarks (listed in Table 5.1) from Rodinia[15], Bakhoda et al.[6] and Fung et al.[22]. We ran each benchmark to completion. We do not exclude any benchmarks because of poor performance. We do exclude one benchmark that assumes workgroups are spawned in ascending ID order, and five benchmarks that contain hard-to-eliminate byte-granularity writes. We include a version of cloth simulation (CLopt) with a GPU-optimized work distribution scheme that transforms the non-coalesced memory accesses into coalesced accesses. This optimized version performs 30% faster than the original one on the nondeterministic baseline architecture. On GPUDet, this optimization significantly reduces the number of entries in the perwavefront store buffers, lowering the overhead of each write-log search. It also generates fewer write-log searches by eliminating the aliased read-write accesses within a wavefront accessing different bytes in the same 128-bytes. These two effects cause CLopt to perform significantly better than CL on GPUDet (Section 6.3.1). We also include the version of BFS graph traversal with data-races (BFSr) from Section 1.1 as well as a corrected version (BFSf) from Rodinia [15]. Both BFSr and BFSf are modified to use 32-bit boolean flags to eliminate byte-granularity writes. We used GPU-Racey (described in Section 2.2) to verify our nondeterministic extension to GPGPU-Sim, and to verify that our model of GPUDet can provide deterministic execution under this nondeterministic simulation framework. 38  5.2. Simulation Configuration  Table 5.1: Benchmarks Name Without Atomic Operations AES Cryptography [6] BFS Graph Traversal (with Data-Race) [6] BFS Graph Traversal (Race-Free) [15] Computational Fluid Dynamics Solver [15] Coulumb Potential [6] HotSpot [15] LIBOR [6] 3D Laplace Solver [6] Speckle Reducing Anisotropic Diffusion [15] With Atomic Operations Cloth Simulation [22] Cloth Simulation (Optimized) Hash Table [22] Bank Account [22]  5.2  Abbr. AES BFSr BFSf CFD CP HOTSP LIB LPS SRAD CL CLopt HT ATM  Simulation Configuration  Our modified GPGPU-Sim is configured to model a Geforce GTX 480 (Fermi) GPU [37] with the configuration parameters distributed with GPGPU-Sim 3.0.2. In this configuration, the GPU has 15 compute units and 6 memory partitions. Each compute unit can sustain up to 1536 threads sharing a 128kB register file, and it has a 48kB L1 data cache and a 16kB scratch-pad memory buffer. Each wavefronts has 32 threads. The compute units access the 768kB L2 unified cache at the memory partitions via two unidirectional crossbars, providing a peak bandwidth of 268GB/s per direction. The off-chip GDDR5 DRAM channels can deliver an aggregate bandwidth of 177GB/s. Table 5.2 summarize major configuration parameters in our simulator. In our default GPUDet configuration, each Z-Buffer Unit runs at 650MHz, has a 16kB Z-cache, and a 16 entry request buffer. The per-wavefront store buffer uses a 1024-bit bloom filter, implemented with a Parallel Bloom Filter [42] with 4 sub-arrays each indexed with a different hash function. This default configuration assumes that each global barrier between parallel, commit, and serial modes in GPUDet takes zero cycles. Section 6.3.4 investigates the sensitivity of GPUDet’s performance to various Z-cache sizes, smaller store buffer bloom filters, and higher global barrier latencies. Our modified version of GPGPU-Sim and the benchmarks are available online [1].  39  5.3. Overall Methodology: Finding Performance Bottleneck  Table 5.2: GPGPU-Sim Configuration # Compute Units 15 Wavefront Size 32 SIMD Pipeline Width 16 Number of Threads / Unit 1536 Number of Registers / Unit 32768 Branch Divergence Method PDOM [23] Warp Scheduling Policy Loose Round Robin Shared Memory / Unit 16KB Constant Cache Size / Unit 8KB Texture Cache Size / Unit 5KB, 32B line, 20-way assoc. L1 Data Cache / Unit 48KB, 128B line, 6-way assoc. L2 Unified Cache 128KB/Mem. Part., 128B line, 8-way assoc. Interconnect Topology 1 Crossbar/Direction Interconnect BW 32 (Bytes/Cycle) (160GB/dir.) Interconnect Latency 5 Cycle (Interconnect Clock) Compute Unit Clock 1400 MHz Interconnect Clock 700 MHz Memory Clock 900 MHz # Memory Partitions 6 DRAM Req. Queue Size 32 Memory Controller Out-of-Order (FR-FCFS) GDDR3 Memory Timing tCL =12 tW L =4 tRP =12 tRC =40 tRAS =28 tRCD =12 tRRD =6 tCDLR =5 tW R =12 Memory Channel BW 32 (Bytes/Cycle) Min. L2/DRAM Latency 220 Cycle (Compute Core Clock) Z-Unit Z-Unit Clock Z-Cache Request Buffer Size  650 MHz 16kB, 128B line, 4-way assoc. 16  Store Buffer Bloom Filter Implementation Parallel Bloom Filter [42] Capacity 1024 bits # Hash Functions 4  5.3  Overall Methodology: Finding Performance Bottleneck  In this section we describe our methodology to find performance bottlenecks in our design. We believe this methodology is applicable in any performance tuning application and provides a valuable insight for designer about the performance bottlenecks of a system. The optimization techniques described in Section 4.2 are inspired by this method. We have used a GPU performance visualization tool, AerialVision [3], interfaced with the GPGPU-Sim simulator to capture and visualize the dynamic behavior of a GPU architecture throughout an application run. Aeri40  5.3. Overall Methodology: Finding Performance Bottleneck alVision provides a time-lapse view for visualizing the global runtime behavior of CUDA applications on a many-core architecture. This feature helps hardware and software developers identify sources of dynamic and intermittent inefficiency. We use AerialVision to identify the sources of performance overhead in the parallel mode. We provide the graphical output of different versions of GPUDet (Section 6.2.1), to illustrate the efficiency of each optimization technique (Section 4.2). Figure 5.1, Figure 5.2, Figure 5.3, and Figure 5.4 show the graphical output of AerialVision for the AES and BFSr benchmarks for different configurations. In these figures, the Y-axis represents 15 different compute units and the X-axis shows the time. Darker color indicates larger IPC for a compute unit at each time period (500 cycles). In other words, AerialVision counts the number of executed wavefront instructions by each compute unit in a time period, then divides this number to the length of the period to find average IPC. The white color indicates whether the compute unit is idle or is in the commit or serial mode. Vertical lines illustrate the start of each quantum. Figure 5.1 shows the IPC for AES benchmark for the GPUDet-base configuration when the quantumsize = 1000. As it can be observed, the average length of quanta is much shorter than 1000 (around 35). We believe the reason for the short quantum size is the frequent workgroup synchronizations exist in this benchmark. This large number of short running quanta introduces a large number of global synchronizations and performance overhead as well. Workgroup Aware Quantum Formation (Section 4.2) allows GPUDet to continue execution in parallel mode when all wavefronts of a workgroup reach a workgroup barrier. As Figure 5.2 confirms, this technique substantially reduces the number of quanta in the applications with large number of workgroup barriers. Figure 5.3 and Figure 5.4 are other examples that show how we used ArialVision to find the performance bottleneck. By visualizing the IPC for BFSr benchmark with GPUDet-base configuration (Figure 5.3) One can directly observe that the main performance overhead in parallel mode is caused by load imbalance among different workgroups. We investigated the reason for this load imbalance and found that it is caused by finishing short running kernels during parallel mode. In other words, since the kernel length is variant among different threads, some workgroups finish the execution of a kernel much earlier than the other ones. Since GPUDet-base does not allow hardware to issue a new workgroup in the middle of parallel mode, the compute unit will remain underutilized until the end of the quantum. By issuing a new workgroup using the Deterministic Workgroup Partitioning 41  5.3. Overall Methodology: Finding Performance Bottleneck  Figure 5.1: IPC of different compute units in GPUDet-Base for AES.  Figure 5.2: IPC of different compute units in GPUDet for AES.  42  5.3. Overall Methodology: Finding Performance Bottleneck technique (Section 4.2.2) compute units are allowed to spawn a new workgroup without waiting for a new quantum. This technique helps GPUDet to balance the workload in the parallel mode and reduce the performance overhead. To summarize, is our methodology to improve the performance of GPUDet often consisted of the following steps: 1. Find an application with substantial performance overhead with GPUDet. 2. Visualize many performance metrics using ArialVision. 3. Find out the main reason for the performance overhead using visualized metrics and the benchmark code itself. 4. Propose a technique to overcome the recognized problem. 5. Implement the technique and explore the result. 6. Verify the technique by testing and visualization. The workgroup-aware quantum formation and deterministic workgroup partitioning have developed using above-mentioned method.  43  5.3. Overall Methodology: Finding Performance Bottleneck  Figure 5.3: IPC of different compute units in GPUDet-Base for BFS.  Figure 5.4: IPC of different compute units in GPUDet for BFS.  44  Chapter 6  Experimental Results This chapter provides a comprehensive discussion about our experimental results. First, we evaluate the overall performance of GPUDet over all benchmarks. Then, we evaluate each optimization technique in parallel mode (Section 4.2) individually. We compare our Z-Buffer algorithm to commit the store buffers with previous work lock-based mechanism and also assess our proposed Compute Unit Serialization technique. Finally we provide widespread sensitivity analysis for quantum size, Z-Buffer Unit configuration, Bloom filter, and global synchronization overhead.  6.1  Overall Performance  To evaluate our system, we compare the execution time of the benchmarks on a baseline nondeterministic architecture (NONDET) and the optimized version of GPUDet. We configure the quantum size to 200 instructions. We increment the instruction count whenever a wavefront executes an instruction, regardless of how many threads in the wavefront execute that instruction. Figure 6.1 shows the total execution time of each application with GPUDet, normalized to execution time on a nondeterministic architecture. Our deterministic model causes about 105% performance penalty on average. The execution time of each application is broken down into time spent in parallel, commit and serial modes. Wavefronts spend the most time in parallel mode. We discuss the sources of performance overhead in parallel mode below. As discussed in Section 4.5, GPUDet only serializes the execution of atomic operations: applications without atomic operations (all except CLopt, HT, ATM) skip serial mode entirely. We have found that some applications (CP, AES) perform slightly better with deterministic execution. A deeper inspection reveals that our workgroup distribution algorithm (Section 4.2.2) results in a more even distribution of workgroups compared to the baseline architecture. The baseline architecture tries to find the first free hardware slot to spawn a workgroup. This mechanism can cause uneven distribution at the very end of a kernel launch when all workgroups of one compute unit finish their executions 45  6.2. Impact of GPUDet Optimizations  Serial Mode  4  Commit Mode  3  Parallel Mode  2  CLopt  ATM  HT  SRAD  LPS  LIB  HOTSP  CP  CFD  BFSf  0  BFSr  1 AES  Normalized Execution Time  5  Figure 6.1: Breakdown of execution cycles. Normalized to NONDET execution time. within a short period. In this case, all remaining workgroups are assigned to the sole available compute unit. When other compute units become available, no workgroups are left to be assigned, and the compute units are underutilized (a similar observation was noted by Bakhoda et al [6]). For these applications, GPUDet’s deterministic workgroup distribution will distribute workgroups more evenly, resulting in a small speedup.  6.2  Impact of GPUDet Optimizations  This section provides data to evaluate our optimization techniques discussed in Section 4.  6.2.1  Quantum Formation  To assess our proposed optimization techniques for quantum formation (Section 4.2) we implemented three versions of GPUDet shown in Figure 6.2. In the GPUDet-base configuration all optimizations are disabled. In the GPUDet-WA(BTA) the workgroup barrier optimization technique (Section 4.2.1) is enabled, allowing workgroups to synchronize without a quantum boundary. Finally, the GPUDet-WA(BTA+DWP) version additionally allows new workgroups to be spawned in parallel mode by deterministically partitioning of the workgroups (DWP) among workgroup slots (Section 4.2.2). Figure 6.2 shows that the barrier termination avoidance technique decreases the execution time by 4% on average over GPUDet-base. As ex46  5 GPUDet-base GPUDet-WA(BTA) GPUDet-WA(BTA+DWP)  4 3 2  AVG  CLopt  ATM  HT  SRAD  LPS  LIB  HOTSP  CP  CFD  BFSf  0  BFSr  1 AES  Normalized Execution Time  6.2. Impact of GPUDet Optimizations  Figure 6.2: Performance impact of Barrier Termination Avoidance (BTA) and Deterministic Workgroup Partitioning (DWP) techniques. Bars Normalized to NONDET execution time. pected, this improvement lies mostly with benchmarks that have frequent synchronization barriers (AES, HOTSP, LPS, SRAD). Figure 4.2 shows that encountering workgroup barriers is the dominant cause of quantum termination for these applications. Our experiments confirm that activating the barrier termination avoidance technique (GPUDet-WA(BTA)) forms 3.8× fewer quanta in these four benchmarks, improving their performance by 20% on average. Figure 6.2 shows that applications with small kernel functions (BFSr, BFSf, CFD) benefit from spawning workgroups in parallel mode. Figure 4.2 confirms that most of the quanta in BFSr, BFSf and CFD are terminated by reaching the end of the kernel. The ability to start new workgroups deterministically in parallel mode (GPUDet-WA(BTA+DWP)) speeds up these applications by 19% on average.  6.2.2  Parallel Commit using Z-Buffer Unit  To evaluate the Z-Buffer Unit parallel commit algorithm, we implemented a lock-based version of committing the store buffer. The lock-based version simulates the software-based deterministic parallel commit algorithm proposed in CoreDet [7]. The algorithm locks the chunk of memory that corresponds to the store buffer entry address using atomic operations and 47  Commit Mode  AES  CFD  LIB  HT  Locking  Locking  Z-Buffer Buffer  Z-Buffer Buffer  Locking  Locking  SRAD  Z-Buffer Buffer  Locking  LPS  Z-Buffer Buffer  Z-Buffer Buffer  Locking  Locking  HOTSP  Z-Buffer Buffer  Locking  CP  Z-Buffer Buffer  Z-Buffer Buffer  Locking  Locking  BFSf  Z-Buffer Buffer  Locking  BFSr  Z-Buffer Buffer  Locking  Parallel & Serial Modes  Z-Buffer Buffer  10 8 6 4 2 0  Z-Buffer Buffer  Normalized Execution Time  6.2. Impact of GPUDet Optimizations  ATM Clopt  Figure 6.3: Execution time comparison of committing the store buffer between the Z-Buffer Unit parallel commit and the lock-based algorithms performs the logical order priority comparisons and global memory updates in a mutually exclusive section. Although it has been reported that atomic operations contending for the same memory location can be up to 8.4× slower than non-atomic store operations in the Fermi architecture [46], in our evaluation we do not model this extra slowdown for the additional atomic operations required by the lock-based version. Figure 6.3 shows the normalized execution time for both of the store buffer committing algorithms. The Z-Buffer Unit commit algorithm improves performance by 60% on average. Since spinning for a lock in compute units generates significantly more global memory accesses, it is expected that applications with more accesses (e.g. HT) achieve more performance improvement by exploiting the Z-Buffer Unit commit algorithm. Figure 6.3 shows that the execution time of commit mode is decreased by 2.3× using the Z-Buffer Unit for the HT benchmark.  6.2.3  Serial Mode Optimization  Figure 6.4 evaluates the performance overhead of serial mode in GPUDet. Since only atomic operations are serialized, we omit benchmarks without atomic operations. W-Ser (wavefront level serialization) is the GPUDetbase configuration which serializes execution of all atomic operations. In CU-Ser (compute unit serialization), serialization is only performed among compute units by executing the atomic operations of a single compute unit in parallel. CU-Ser decreases overhead of serial mode by 6.1× for these applications. 48  CLopt  HT  CU-Ser  W-Ser  W-Ser  CU-Ser  Serial Mode Parallel&Commit Modes  CU-Ser  14 12 10 8 6 4 2 0  W-Ser  Normalize Execution Time  6.3. Sensitivity Analysis  ATM  Figure 6.4: Execution time comparison between wavefront level (W-Ser) and compute unit level (CU-Ser) serialization of atomic operations (Section 4.5). Normalized to NONDET execution time  6.3  Sensitivity Analysis  This section explores the performance impact of varying different GPUDet design parameters.  6.3.1  Quantum Size and Store Buffer Overhead  In this section we evaluate the effect of quantum size on performance. For better insight, we separate out the time spent in store buffer operations. Store buffer operations entail appending entries to store buffers for stores, and traversing store buffers for replying to loads. Figure 6.5 shows the execution time of GPUDet with quantum sizes of 100, 200, 600 and 1000 instructions. Increasing quantum size has three kinds of effects on applications. For applications that have few stores to global memory (AES, CP, HOTSP and LIB) larger quanta improves performance. By increasing quantum size, we reduce the number of quantum boundaries, each of which involves expensive global synchronization. For applications with frequent accesses to global memory (BFSr and BFSf), performance is degraded with increasing quantum size, because larger quanta lead to larger store buffers. Because GPUDet uses a linear write log traversal, load instructions hitting in store buffer must perform longer 49  6.3. Sensitivity Analysis  Serial Mode Commit Mode Write Log Overhead Parallel Mode  8 6 4  0  AES  BFSrc  BFSrf  CFD  CP  HOTSP  LIB  LPS  SRAD  HT  ATM  CLopt  AVG  100 200 600 1000  2  100 200 600 1000 100 200 600 1000 100 200 600 1000 100 200 600 1000 100 200 600 1000 100 200 600 1000 100 200 600 1000 100 200 600 1000 100 200 600 1000 100 200 600 1000 100 200 600 1000 100 200 600 1000 100 200 600 1000  Normalized Execution Time  10  CL  Figure 6.5: Sensitivity of GPUDet to quantum size. Execution time is normalized to NONDET. The unoptimized version of CL is not included in the average.  searches. Increasing the quantum size also has a side effect on the BFS benchmark. Since this benchmark is not work optimal [31] (it may do each task several times), the number of executed instructions varies in different configurations. In this application the number of redundant tasks depends on how frequently the execution results of each thread become visible to the other threads. Enlarging quantum size causes results to be visible less frequently, BFS runs more redundant tasks which decreases performance. Our experiments reveal that the number of executed instructions increases by 26% when the quantum size increases from 200 to 1000. For the CL benchmark, the performance overhead of the linear write log is highly affected by the size of the quantum. We realized that in the Integrator, WriteBack and Driver Solver kernels [13] of this benchmark work is distributed block-wise among the threads, so wavefronts generate uncoalesced memory accesses. As described in Section 4.3, uncoalesced accesses generate many extra entries in the store buffer. Furthermore, due to Bloom filter aliasing, uncoalesced accesses result in unnecessary log traversals for load operations (Section 6.3.2). By using an interleaved distribution of work among the threads, we eliminated uncoalesced memory accesses in the CLopt application. Since eliminating uncoalesced accesses to the global memory is profitable for general GPU applications and it does not need substantial changes to CL source code, we have only included the optimized version of 50  6.3. Sensitivity Analysis CL in our overall average. As discussed in Section 4.2 quanta may be terminated because of termination reasons other than instruction count. Adjusting quanta size does not affect the performance of applications (ATM, HT) in which quanta are usually terminated for other reasons like atomic operations (figure 4.2).  6.3.2  Bloom Filter  8 128 bits 512 bits Ideal BF  6 4  256 bits 1024 bits  CL  AVG  CLopt  ATM  HT  SRAD  LPS  CFD  0  BFSf  2 BFSr  Normalized Execution Time in Parallel Mode  Figure 6.6 evaluates the effect of Bloom filter size on performance. Since the Bloom filter configuration does not affect commit and serial modes, we present only the execution time of parallel mode. Figure 6.6 shows the execution time of the GPUDet architecture with 128, 256, 512 and 1024-bit Bloom filters, and an ideal Bloom filter which has no false positives. We only present the execution time of individual applications that use the store buffer frequently, but the average is for all our workloads. Figure 6.6 shows  Figure 6.6: Execution cycles in parallel mode for various sizes of Bloom filter, normalized to NONDET execution. The AVG bar shows the average among all our benchmarks. that for most of the applications (except BFSr and CL) a 256-bit Bloom filter performs as well as ideal. Increasing Bloom filter size for BFSr reduces execution time because of a high number of store buffer entries in each quantum. For CL, because of the large number of uncoalesced accesses, the store buffer contains many 128B entries with sparsely-populated data portions. The large number of entries causes many loads to perform unnecessary log traversals. 51  6.3. Sensitivity Analysis  6.3.3  Z-Buffer Unit  We assess the sensitivity of GPUDet to the size of the Z-cache (Section 4.4.1) by adjusting the size of cache for each memory partition from 8KB to 128KB. As expected, execution time decreases with increasing Z-cache size. However, since wavefronts spend most of their execution time in parallel and serial modes, the execution time decreases by less than 3% when the Zcache is enlarged from 8KB to 128KB.  6.3.4  Global Synchronization Overhead  GPUDet needs global synchronization between each execution mode. We analysed the sensitivity of GPUDet to the cost of global synchronization. Our data illustrates that the overall performance degradation with a 100 cycles synchronization overhead is negligible. However increasing the overhead to 1000 cycles reduces overall performance by 22% over the instantaneous synchronization model.  52  6.3. Sensitivity Analysis  Normalized Execution Time  4 3.5 3 2.5 2 1.5 1 0.5  2048  1024  512  256  128  64  32  16  8  4  2  1  0  # of Cycles Figure 6.7: Execution time for various number of cycles, dedicated for each global synchronization, normalized to NONDET execution. The bar shows the average among all our benchmarks.  53  Chapter 7  Related Work The most closed-related work to GPUDet are proposals for providing deterministic execution of general CPU programs [5, 7, 8, 17, 26, 35, 40]. These schemes provide determinism for general multithreaded code on commodity [5, 7, 8, 35, 40] or modified general-purpose multicore CPUs [17, 26]. A naive translation of these schemes to a GPU architecture would be highly inefficient due to the massive numbers of threads involved. For example, the global barriers inherent in many deterministic execution schemes ([5, 8, 17, 26, 35]) would not scale well with 1000s of threads. Kendo’s [40] more scalable approach provides determinism for data-race-free programs only, which would scale better in a GPU environment but also reneges on many of the debugging benefits of other determinism techniques that can handle data races. There has also been extensive work on deterministic programming languages ([10, 11, 14, 19, 41, 47]). Programs written in these languages are deterministic by construction, and incur little runtime overhead during execution at the expense of a restricted programming model. SHIM [19] is a deterministic version of MPI that provides determinism for messagepassing programs. NESL [10] and Data Parallel Haskell [14] are pure functional languages that support data-parallel operations which are deterministic and implicitly parallel. Jade [41] is an imperative language that relies on programmer-supplied annotations to extract parallelism automatically while preserving sequential (and deterministic) semantics. StreamIt [47] is a language for streaming computations that enforces determinism by restricting communication between parallel stream kernels to occur only via FIFOs, though a later version of StreamIt allowed for out-of-band control messages that could flow both upstream and downstream without breaking determinism [48]. Deterministic Parallel Java [11] is a version of Java augmented with a type-and-effect system that proves non-interference between parallel forkjoin tasks. None of these languages are directly applicable to current GPU programming models, which support a rich set of synchronization primitives via atomic operations on global memory. A number of projects have looked at improving GPU programmability 54  Chapter 7. Related Work through hardware or software mechanisms. Kilo TM [22] showed that transactional memory can be incorporated into the GPU programming model at low cost. GRace [52] proposed a race detector for GPU programs, achieving low overhead via a combination of static analysis and runtime instrumentation. Boyer and Skadron [12] describe a GPU emulator that detects race conditions and bank conflicts for CUDA programs, though the emulation overheads are quite high compared to native execution. The PUG system [32] and GPUVerify [9] leverage the compact nature of GPU kernels to perform static race detection, occasionally requiring annotations from programmers to avoid false positives.  55  Chapter 8  Future Work We believe GPUDet is just a starting point for exploration of opportunities to provide an easier to use environment for GPU architectures. Our ultimate goal is to prove that deterministic execution is worth being supported on massively parallel architectures. By reducing the performance overhead, exploring other types of determinism, and demonstrating the necessity of deterministic execution, we believe hardware companies will provide the support for our proposal in their future hardware. This section classifies the potential future work into two major categories. First, it discusses the opportunities to improve the performance (GPUDet). Then, we provide a brief discussion about other types of determinism.  8.1  Improving Performance of GPUDet  Performance overhead of GPUDet might discourage implementation. Even for CPU architectures, it might not be acceptable to tolerate the proposed amount of performance overhead in favor of deterministic execution. Since the high performance computing has been the main motivation for GPGPU applications, performance overhead is less tolerable in GPU world. Although we believe providing an easier programming environment via deterministic hardware will enable increased programmer productivity sufficient to justify the cost of GPUDet, it would be best if one could reduce the overhead of GPUDet further. Here, we discuss some potential ways to reduce performance overhead of different modes in GPUDet and to eliminate the limitations discussed in Section 4.6.  8.1.1  Improving Performance of Atomic Operations by Taking Advantage of Relaxed Memory Ordering  Atomic operations cause performance overhead in two different ways. The first and more obvious effect of atomic operations is in serial mode (Section 4.5) in which all atomic operations are enforced to be executed serially 56  8.1. Improving Performance of GPUDet among different compute units. In a massively parallel processor such as a GPU, serialization impairs performance. The second effect of these operations on the performance stems from global synchronization. As discussed in Section 4.2, GPUDet finishes execution of a wavefront when reaching atomic operations. This keeps the size of a quantum limited by the number of instructions between two consecutive atomic operations regardless of the maximum quantum size. In a program with frequent atomic operations, deterministic execution enforces a large number of global synchronizations due to the large number of short running quanta. These additional global synchronizations reduce the performance due to the load imbalance among threads and the synchronization overhead itself (Section 4.2). To reduce unnecessary quantum termination by reaching atomic operations, Devietti et al. proposed RCDC [18] for CPU architecture. RCDC leverages data-race-free memory models (e.g., the models for Java and C++) and relaxes memory ordering to improve performance without sacrificing deterministic execution. Exploring a GPU friendly algorithm to implement a similar approach on GPU architectures would be an appropriate way to increase the performance of applications with frequent atomic operations (Section 6.2.3)  8.1.2  Deterministic Local Memory  GPUDet does not support deterministic execution in the presence of data races in local memory (Section 4.6). Local memory is a scratch-pad memory that is visible to all threads of a workgroup. In current GPU applications local memory is used to capture locality of data among threads of a wavefront. Also, it helps to improve memory bandwidth by eliminating unnecessary memory accesses and enforcing coalesced memory accesses to the global memory. A workgroup usually loads a portion of data in its local memory, then computes the kernel output. Finally it writes back the results to the global memory ideally, through coalesced stores. In this scenario a race condition through local memory is possible. By extending determinism to the local memory scope, GPUDet can provide determinism for a broader range of applications. The naive solution to provide deterministic execution for the local memory is to apply the same store buffer mechanism (Section 4.3) to this memory. Considering the extra store operations required by this mechanism, the performance penalty would not be acceptable for local memory operations (this memory supposed to be very fast). Also the naive solution needs extra scratch-pad memory for store buffers which is very expensive in term of 57  8.1. Improving Performance of GPUDet area. The other challenge is the lack of the Z-Buffer Unit in the compute units in current hardware. So the commit would need to be done through a very slow lock-based mechanism (Section 6.2.2) The better way to support determinism for local memory is to enforce deterministic scheduling among the wavefronts of a workgroup. A deterministic scheduler generated a deterministic order of local memory access requests and since these requests will arrive in-order to the scratch-pad memory, it generates deterministic results. The downside of deterministic scheduling is the performance overhead caused by a reduction of ready to schedule wavefronts. We believe a well designed scheduler which considers the contents of the cache and exploits the non-deterministic scheduling opportunity among the workgroups can provide deterministic local memory without sacrificing the performance.  8.1.3  Overlapping the Parallel and Commit Mode of Different Wavefronts  To enforce determinism GPUDet needs to synchronize all wavefronts after parallel and commit mode (in absence of atomic operation and serial mode). These synchronizations not only cause performance overhead directly (Section 4.2) but also indirectly reduce performance by creating an imbalance in memory system bandwidth usage. In the parallel mode all wavefronts append their stores onto store buffers that are cached by the L1 cache and they use global memory mostly for load operations, so the memory bandwidth may not be fully exploited in parallel mode. On the other hand, The rest of commit mode consists of several memory write operations (Z-Buffer unit requests) which are memory bounded and limit the performance by the bandwidth. The key insight here is by overlapping the commit mode and parallel mode of different wavefronts, GPUDet can balance the memory load and maximize the performance. This might be enabled by some major changes to the hardware. One idea is to let the wavefront start the next quantum without waiting for all other wavefronts’ commit mode. This would not violate determinism if the wavefront ensures its input will not be updated by any wavefront in the previous quantum. We believe implementing this idea would greatly affect performance by reducing number of synchronizations and balancing the bandwidth. However, correct implementation of this idea needs a very careful analysis and it is hard to verify. Notice that this idea can also improve the performance of deterministic CPU hardware proposals [7, 26]. 58  8.2. Exploring Other Types of Determinism  8.2  Exploring Other Types of Determinism  Different types of determinism are discussed in Section 2.3.1. GPUDet tries to exploit the benefits of architectural determinism without sacrificing performance. However, different types of determinism have trade-offs and each of them would be useful for different purposes (Section 2.3.2). Exploring different forms of determinism on many-core architectures is necessary to provide insight about the most suitable form for each application.  8.3  Compiler-Run Time Support for determinism  GPUDet provides a hardware proposal to enforce determinism for GPU architectures. Since, none of the hardware companies has not yet provided this support for GPUs, programmers are not able to exploit determinism for real applications. Another interesting direction to continue this work is to provide a software-compiler-run time solution to achieve determinism on current generation of GPU hardware. Similar mechanisms have been explored for CPUs [7, 40]. We anticipate providing this environment involves a lot of elaborate implementation details due to the SIMD structure of GPUs.  59  Chapter 9  Conclusion Nondeterminism in parallel architectures hampers programmers’ productivity significantly as bugs can no longer be reproduced easily. We believe this non-reproducibility problem presents a key challenge to GPU software development, discouraging use of GPUs in broader range of applications. In this document we presented GPUDet, a proposal for supporting deterministic execution on GPUs. GPUDet exploits deterministic aspects of the GPU architecture to regain performance. Specifically, it uses the inherent determinism of the SIMD hardware in GPUs to provide deterministic interaction among threads within a wavefront. This amortizes the complexity of store buffers required to isolate the execution of each wavefront and works seamlessly with the existing SIMT execution model in GPUs. GPUDet uses a workgroup-aware quantum formation scheme that allows wavefronts in parallel mode to coordinate via workgroup barriers and to accept work from a deterministic workgroup distributor. GPUDet also extends the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to deterministically commit store buffers in parallel. Finally, GPUDet eliminates the serialization required among atomic operations from the same compute unit by exploiting the implicit point-to-point ordering in the GPU memory subsystem. Our simulation results indicate that these optimizations allow GPUDet to perform comparably against a nondeterministic baseline, despite running GPU kernels with thousands of threads. Our characterization of sources of overhead for deterministic execution on GPUs provides insights for further optimizations.  60  Bibliography [1] http://www.ece.ubc.ca/˜aamodt/GPUDet. [2] AMD. White Paper — AMD Graphics Cores Next (GCN) Architecture, June 2012. [3] Aaron Ariel, Wilson WL Fung, Andrew E Turner, and Tor M Aamodt. Visualizing complex dynamics in many-core accelerator architectures. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on, pages 164–174. IEEE, 2010. [4] D.C. Arnold et al. Stack Trace Analysis for Large Scale Debugging. In IPDPS, 2007. [5] Amittai Aviram, Shu-Chun Weng, Sen Hu, and Bryan Ford. Efficient system-enforced deterministic parallelism. In OSDI, 2010. [6] Ali Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009. [7] Tom Bergan, Owen Anderson, Joseph Devietti, Luis Ceze, and Dan Grossman. CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution. In ASPLOS, 2010. [8] Tom Bergan, Nicholas Hunt, Luis Ceze, and Steven D. Gribble. Deterministic Process Groups in dOS. In OSDI, 2010. [9] Adam Betts, Nathan Chong, Alastair F. Donaldson, Shaz Qadeer, and Paul Thomson. GPUVerify: a verifier for GPU kernels. In Proceedings of the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA’12). ACM, 2012. [10] Guy Blelloch. NESL: A Nested Data-Parallel Language (Version 3.1). Technical report, Carnegie Mellon University, Pittsburgh, PA, 2007.  61  Bibliography [11] Robert L. Bocchino, Jr., Vikram S. Adve, Danny Dig, Sarita V. Adve, Stephen Heumann, Rakesh Komuravelli, Jeffrey Overbey, Patrick Simmons, Hyojin Sung, and Mohsen Vakilian. A Type and Effect System for Deterministic Parallel Java. In OOPSLA, 2009. [12] M. Boyer, K. Skadron, and W. Weimer. Automated Dynamic Analysis of CUDA Programs. In Third Workshop on Software Tools for MultiCore Systems, 2008. [13] Andrew Brownsword. Cloth in OpenCL, 2009. [14] Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, Gabriele Keller, and Simon Marlow. Data Parallel Haskell: A Status Report. In DAMP, 2007. [15] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC, 2009. [16] Brett W. Coon et al. United States Patent #7,353,369: System and Method for Managing Divergent Threads in a SIMD Architecture (Assignee NVIDIA Corp.), April 2008. [17] Joseph Devietti, Brandon Lucia, Luis Ceze, and Mark Oskin. DMP: Deterministic Shared Memory Multiprocessing. In ASPLOS, 2009. [18] Joseph Devietti, Jacob Nelson, Tom Bergan, Luis Ceze, and Dan Grossman. RCDC: A Relaxed Consistency Deterministic Computer. In ASPLOS, 2011. [19] Stephen A. Edwards and Olivier Tardieu. SHIM: A Deterministic Model for Heterogeneous Embedded Systems. In EMSOFT, 2005. [20] Perry A. Emrath and David A. Padua. Automatic detection of nondeterminacy in parallel programs. In Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging, PADD ’88, pages 89–99, New York, NY, USA, 1988. ACM. [21] Matteo Frigo, Pablo Halpern, Charles E. Leiserson, and Stephen LewinBerlin. Reducers and other cilk++ hyperobjects. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, SPAA ’09, pages 79–90, New York, NY, USA, 2009. ACM.  62  Bibliography [22] Wilson W. L. Fung, Inderpreet Singh, Andrew Brownsword, and Tor M. Aamodt. Hardware Transactional Memory for GPU Architectures. In MICRO-44, 2011. [23] Wilson Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007. [24] Pawan Harish and P. J. Narayanan. Accelerating Large Graph Algorithms on the GPU Using CUDA. In HiPC, 2007. [25] Mark Hill and Min Xu. http://www.cs.wisc.edu/ markhill/racey.html, 2009. [26] D.R. Hower, P. Dudnik, M.D. Hill, and D.A. Wood. Calvin: Deterministic or Not? Free Will to Choose. In HPCA, 2011. [27] Hadi Jooybar, Wilson W.L. Fung, Mike O’Connor, Joseph Devietti, and Tor M. Aamodt. Gpudet: a deterministic gpu architecture. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems, ASPLOS ’13, pages 1–12, New York, NY, USA, 2013. ACM. [28] Khronos Group. OpenCL. http://www.khronos.org/opencl/. [29] Samuli Laine and Tero Karras. High-Performance Software Rasterization on GPUs. In HPG, 2011. [30] Gentaro Hirota Lars Nyland, John R. Nickolls and Tanmoy Mandal. United States Patent #8,086,806: Systems and methods for coalescing memory accesses of parallel threads (Assignee NVIDIA Corp.), April 2011. [31] Charles E. Leiserson and Tao B. Schardl. A Work-Efficient Parallel Breadth-First Search Algorithm (or How to Cope with the Nondeterminism of Reducers). In SPAA, 2010. [32] Guodong Li and Ganesh Gopalakrishnan. Scalable SMT-Based Verification of GPU Kernel Functions. In FSE, 2010. [33] E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE, 2008. [34] Jamie Liu, Benjamin Jaiyen, Richard Veras, and Onur Multu. RAIDR: Retention-Aware Intelligent DRAM Refresh. In ISCA, 2012. 63  Bibliography [35] Tongping Liu, Charlie Curtsinger, and Emery D. Berger. DTHREADS: Efficient Deterministic Multithreading. In SOSP, 2011. [36] Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. SIGPLAN Not., 43(3):329–339, March 2008. [37] NVIDIA. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, October 2009. [38] NVIDIA Corp. NVIDIA CUDA Programming Guide v3.1, 2010. [39] NVIDIA Corp. NVML API Reference Manual v3.295.45, 2012. [40] Marek Olszewski, Jason Ansel, and Saman Amarasinghe. Kendo: Efficient deterministic multithreading in software. In ASPLOS, 2009. [41] Martin C. Rinard and Monica S. Lam. The design, implementation, and evaluation of Jade. ACM Trans. Program. Lang. Syst., 20(3), May 1998. [42] Daniel Sanchez, Luke Yen, Mark D. Hill, and Karthikeyan Sankaralingam. Implementing Signatures for Transactional Memory. In MICRO, 2007. [43] Smruti R. Sarangi, Brian Greskamp, and Josep Torrellas. CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging. In DSN, 2006. [44] Cedomir Segulja and Tarek S Abdelrahman. Architectural support for synchronization-free deterministic parallel programming. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, pages 1–12. IEEE, 2012. [45] Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O’Connor, and Tor M. Aamodt. Cache Coherence for GPU Architectures. In HPCA, 2013. [46] Jeff A. Stuart and John D. Owens. Efficient Synchronization Primitives for GPUs. CoRR, abs/1110.4623, 2011. [47] William Thies, Michal Karczmarek, and Saman P. Amarasinghe. StreamIt: A Language for Streaming Applications. In CC ’02, 2002.  64  Bibliography [48] William Thies, Michal Karczmarek, Janis Sermulins, Rodric Rabbah, and Saman P. Amarasinghe. Teleport Messaging for Distributed Stream Programs. In PPoPP, 2005. [49] Timothy J. Van Hook. United States Patent #6,630,933: Method and Apparatus for Compression and Decompression of Z Data (Assignee ATI Technologies Inc.), October 2003. [50] S.R. Vangal et al. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS. IEEE Journal of Solid-State Circuits, 43(1):29–41, Jan. 2008. [51] H. Wong et al. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS, 2010. [52] Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. GRace: A Low-Overhead Mechanism for Detecting Data Races in GPU Programs. In PPoPP, 2011.  65  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0074006/manifest

Comment

Related Items