Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Dynamic warp formation : exploiting thread scheduling for efficient MIMD control flow on SIMD graphics.. Fung, Wilson Wai Lun 2008-09-18

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata


24-ubc_2008_fall_fung_wilson_wai_lun.pdf [ 1.1MB ]
JSON: 24-1.0066627.json
JSON-LD: 24-1.0066627-ld.json
RDF/XML (Pretty): 24-1.0066627-rdf.xml
RDF/JSON: 24-1.0066627-rdf.json
Turtle: 24-1.0066627-turtle.txt
N-Triples: 24-1.0066627-rdf-ntriples.txt
Original Record: 24-1.0066627-source.json
Full Text

Full Text

Dynamic Warp Formation: Exploiting Thread Scheduling forEfficient MIMD Control Flow on SIMD Graphics HardwarebyWilson Wai Lun FungB.A.Sc., The University of British Columbia, 2006A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinThe Faculty of Graduate Studies(Electrical and Computer Engineering)The University Of British Columbia(Vancouver)September, 2008c Wilson Wai Lun Fung 2008AbstractRecent advances in graphics processing units (GPUs) have resulted in massively parallel hard-ware that is easily programmable and widely available in commodity desktop computer systems.GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high perfor-mance with minimal overhead for control hardware. Scalar threads running the same computingkernel are grouped together into SIMD batches, sometimes referred to as warps. While SIMDis ideally suited for simple programs, recent GPUs include control flow instructions in the GPUinstruction set architecture and programs using these instructions may experience reduced per-formance due to the way branch execution is supported by hardware. One solution is to adda stack to allow different SIMD processing elements to execute distinct program paths aftera branch instruction. The occurrence of diverging branch outcomes for different processingelements significantly degrades performance using this approach. In this thesis, we propose dy-namic warp formation and scheduling, a mechanism for more efficient SIMD branch executionon GPUs. It dynamically regroups threads into new warps on the fly following the occur-rence of diverging branch outcomes. We show that a realistic hardware implementation of thismechanism improves performance by an average of 47% for an estimated area increase of 8%.iiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.1 Thread-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Data-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.3 Single-Instruction, Multiple-Data (SIMD) . . . . . . . . . . . . . . . . . . 82.1.4 Multiple-Instruction, Multiple Data . . . . . . . . . . . . . . . . . . . . . 92.1.5 Fine-Grained Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Compute Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 SIMD GPU Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Latency Hiding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5 SIMD Execution of Scalar Threads . . . . . . . . . . . . . . . . . . . . . . . . . 16iiiTable of Contents2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 SIMD Control Flow Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1 SIMD Serialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2 SIMD Reconvergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Reconvergence Point Limit Study . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Dynamic Warp Formation and Scheduling . . . . . . . . . . . . . . . . . . . . . 244.1 Register File Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2.1 Warp Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Scheduling Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 A Majority Scheduling Policy Implementation . . . . . . . . . . . . . . . . . . . 314.4.1 Warp Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.4.2 Warp Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.4.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.1 Software Design of GPGPU-Sim?A Cycle Accurate GPGPU Simulator . . . . . 355.1.1 Shader Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.1.2 Interconnection Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.1.3 DRAM Access Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.1.4 Interfacing with sim-outorder . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 Baseline Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.1 Effects of Scheduling Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.2 Detail Analysis of Majority Scheduling Policy Performance . . . . . . . . . . . . 596.3 Effect of Limited Resources in Max-Heap . . . . . . . . . . . . . . . . . . . . . . 62ivTable of Contents6.4 Effect of Lane Aware Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.5 Effect of Cache Bank Conflict . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.6 Sensitivity to SIMD Warp Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.7 Warp Pool Occupancy and Max Heap Size . . . . . . . . . . . . . . . . . . . . . 677 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728.1 SIMD Control Flow Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728.1.1 Guarded Instruction/Predication . . . . . . . . . . . . . . . . . . . . . . 728.1.2 Control Flow Reconvergence Mechanisms . . . . . . . . . . . . . . . . . . 738.1.3 Conditional Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738.2 Dynamic Grouping SIMD Mechanisms . . . . . . . . . . . . . . . . . . . . . . . 748.2.1 Dynamic Regrouping of SPMD Threads for SMT Processors . . . . . . . 748.2.2 Liquid SIMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758.3 Eliminating the Existence of Branch Divergence . . . . . . . . . . . . . . . . . . 758.3.1 Complex SIMD Branch Instruction . . . . . . . . . . . . . . . . . . . . . 768.3.2 Vector-Thread Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 769 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779.1 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 789.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799.3.1 Better Warp Scheduling Policies . . . . . . . . . . . . . . . . . . . . . . . 799.3.2 Area Efficient Implementation of the Warp Pool . . . . . . . . . . . . . . 799.3.3 Bank Conflict Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . 809.3.4 Improvements to GPGPU-Sim . . . . . . . . . . . . . . . . . . . . . . . . 80References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82vList of Tables3.1 Operational rules of the stack for reconvergence mechanism. . . . . . . . . . . . . 195.1 Relationships between constrain counters and DRAM commands. . . . . . . . . . 505.2 Hardware Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.3 Benchmark Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.1 Memory bandwidth utilization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.2 Cache miss rates (pending hits classified as a miss). . . . . . . . . . . . . . . . . . 586.3 Cache miss rates without pending hits. . . . . . . . . . . . . . . . . . . . . . . . . 586.4 Maximum warp pool occupancy and max-heap size . . . . . . . . . . . . . . . . . 677.1 Area estimation for dynamic warp formation and scheduling. . . . . . . . . . . . 697.2 CACTI parameters for estimating structure sizes in Table 7.1 . . . . . . . . . . . 70viList of Figures1.1 Floating-Point Operations per Second for the CPU and GPU . . . . . . . . . . . 31.2 Performance loss due to branching when executing scalar SPMD threads usingSIMD hardware (idealized memory system). . . . . . . . . . . . . . . . . . . . . . 42.1 Latency hiding with fine-grained multithreading. . . . . . . . . . . . . . . . . . . 112.2 Baseline GPU microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Detail of a shader core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1 Implementation of immediate post-dominator based reconvergence. . . . . . . . . 193.2 Performance loss for PDOM versus SIMD warp size. . . . . . . . . . . . . . . . . 213.3 Contrived example for reconvergence point limit study. . . . . . . . . . . . . . . . 223.4 Impact of predicting optimal SIMD branch reconvergence points. . . . . . . . . . 234.1 Dynamic warp formation example. . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Register file configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3 Implementation of dynamic warp formation and scheduling. . . . . . . . . . . . . 284.4 Implementation of the warp pool. . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.5 Implementation of majority scheduling policy. . . . . . . . . . . . . . . . . . . . . 325.1 Overview of GPGPU-Sim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Software design of pipeline stages in the shader core model. . . . . . . . . . . . . 385.3 Dram access model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.4 DRAM organization overview and simplified DRAM bank state diagram. . . . . 496.1 Performance comparison of NREC, PDOM, and DWF versus MIMD. . . . . . . . 576.2 Comparison of warp scheduling policies. . . . . . . . . . . . . . . . . . . . . . . . 59viiList of Figures6.3 Warp size distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.4 Dynamic behaviour of Black-Scholes using DWF with Majority scheduling policy. 606.5 Dynamic behaviour of Black-Scholes using DWF with PDOM Priority schedulingpolicy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.6 Impact of resource limit on Max-Heap. . . . . . . . . . . . . . . . . . . . . . . . . 626.7 Impact of lane aware scheduling. . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.8 Performance of PDOM, DWF and MIMD with cache bank conflict. . . . . . . . . 656.9 Effect of SIMD warp size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.10 Cumulative distribution of warp pool occupancy . . . . . . . . . . . . . . . . . . 67viiiAcknowledgementsThe work presented in this thesis would not have been possible without the help from severalindividuals and organizations. First, I would like to thank my supervisor, Professor Tor Aamodt,for his guidance and support for the last two years. I would also like to thank the membersof my M.A.Sc. thesis committee, Professor Guy Lemieux and Professor Steve Wilton, for theirvaluable feedback.I am in debt to my colleagues in the computer architecture group, Ivan Sham, George Yuan,Henry Wong, Xi Chen, and Ali Bakhoda, for the fun and insightful discussions we had, and fortheir precious time spent reviewing my thesis (and the conference paper that it is based on).I owe Professor Konrad Walus and the Microsystems and Nanotechnology (MiNa) ResearchGroup (especially Nicholas Geraedts) a special thanks for sharing their computing cluster withus for data collection.My peers in the Department of Electrical and Computer Engineering in University of BritishColumbia have been helpful and entertaining to work with, and they made my experience ingraduate school an unforgettable one. In particular, I would like to thank Derek Ho, DavidMak, Larix Lee, Samer Al-Kiswany, Abdullah H. Gharaibeh, Elizeu Santos-Neto, David Boenand Sunny Yuen.I am very grateful to my family. My parents has been fully supportive of my decisionsto pursue a graduate degree. The timely encouragement from my mother has been heart-warming. My father?s interest in electronics and computers inspired me in pursuing a career inthis field. Despite being a physician himself, he had taught me a great deal about computersand essentially shaped my very first impression of the computers in my childhood.Financial support for this research was provided through a Natural Sciences and EngineeringResearch Council (NSERC) of Canada ?CGS M? award.ixChapter 1IntroductionThe computation potential of a single chip improves as semiconductor process technology con-tinues to scale and transistor density increases exponentially following ?Moore?s Law? [48].Leveraging process technology scaling to improve the performance of real world applicationshas always been a goal in computer architecture, but has become increasingly challenging aspower limitations restrict clock frequency scaling [52]. Now, perhaps more than before, hard-ware must exploit parallelism efficiently to improve performance.Single-instruction, multiple-data (SIMD) instruction scheduling is a technique for exploit-ing data level parallelism efficiently by amortizing data-independent control hardware acrossmultiple processing elements. Multiple data elements are grouped and processed in ?lock-step?in parallel with a single instruction. This hardware simplification, however, restricts these pro-cessing elements to have uniform dynamic control flow behavior. A data dependent branch, inwhich each processing element resolves to a different outcome, generates a hazard, known asbranch divergence [67], on the SIMD hardware. Handling of this hazard usually involves serial-izing the execution of these grouped processing elements, lowering the utilization of the SIMDhardware. For some control-flow intensive applications, this performance penalty outweighs thebenefit of using SIMD hardware.Nevertheless, SIMD has gained popularity in many applications that require little controlflow flexibility. Most of these applications, such as 3D rendering and digital signal processing [36,38, 42, 53, 71], deal with large data sets and require frequent access to off-chip memory with longaccess latency. Specialized processors for these applications, such as graphics processing units(GPUs), often use fine-grained multithreading to proactively hide these long access latencies.However, as these specialized processors start offering more and more programmability to meetthe demand of their users, branch divergence with SIMD hardware becomes a major performance1Chapter 1. Introductionbottleneck.This thesis proposes and evaluates a novel hardware mechanism, dynamic warp formation,for improving performance of control-flow intensive applications on a SIMD architecture withfine-grained multithreading. While this proposed mechanism is described in this thesis in thecontext of GPU microarchitecture, it is equally applicable to any microarchitecture that usesSIMD and fine-grained multithreading [13, 23, 25].The rest of this chapter describes the motivation and background for this thesis, the method-ology employed, the contribution of this thesis, and finally summarizes the thesis?s organization.1.1 MotivationUntil recently, the dominant approach for exploiting parallelism has been to extract more in-struction level parallelism (ILP) from a single thread through increasingly complex schedulinglogic and larger caches. As diminishing returns to ILP now limit performance of single threadapplications [1], attention has shifted towards using additional resources to increase throughputby exploiting explicit thread level parallelism in software. In contrast to instruction level paral-lelism, which mainly relies on hardware instruction scheduling logic for improving performance,thread level parallelism is explicitly defined in the software as threads (sections of code that canbe executed in parallel), and the hardware simply provides support for executing these threadsin parallel to improve performance. The simplest way to do so is to have multiple copies of thesame processor on a chip, an approach known as a chip multiprocessors (CMP). This forcessoftware developers to share the responsibility for improving performance, but saves significanteffort in hardware design verification while potentially yielding a greater performance gain incomparison to providing additional cache, for example.The modern graphics processing unit (GPU), a hardware accelerator for 3D rendering widelyavailable on commodity computer systems, can be viewed as an example of this throughputoriented approach [9, 40, 61]. Earlier generations of GPUs consisted of fixed function 3Drendering pipelines. This required new hardware to enable new real-time rendering techniques,which impeded the adoption of new graphics algorithms and thus motivated the introductionof programmability, long available in traditional offline computer animation [77], into GPU2Chapter 1. Introduction1 10 100 1000 2001  2002  2003  2004  2005  2006  2007  2008  Year GFLOPS GPU CPU-Scalar CPU-SSE Figure 1.1: Floating-Point Operations per Second for the CPU and GPUhardware for real-time computer graphics. In modern GPUs, much of the formerly hardwiredpipeline is replaced with programmable hardware processors that run a relatively small shaderprogram on each input vertex or pixel [40]. Shader programs are either written by the applicationdeveloper or substituted by the graphics driver to implement traditional fixed-function graphicspipeline operations. The compute model provided by modern graphics processors for runningnon-graphics workloads is closely related to that of stream processors [18, 63].The programmability of shader hardware has greatly improved over the past decade, andthe shader processors of the latest generation GPUs are Turing-complete. Together with theimpressive theoretical computation power of GPU when compared to conventional CMPs (seeFigure 1.1), this opens up exciting new opportunities to speed up ?general purpose? (i.e., non-graphics) applications. Based upon experience gained from pioneering efforts to generalize theusage of GPU hardware [9, 61], GPU vendors have introduced new programming models andassociated hardware support to further broaden the class of applications that may efficientlyuse GPU hardware [3, 58].Even with a general-purpose programming interface, mapping existing applications to theparallel architecture of a GPU is a non-trivial task. Although some applications can achievespeedups of up to 431 times over a modern CPU [26, 27], other applications, while success-fully parallelized on different hardware platforms, show little improvement when mapped toa GPU [7]. One major challenge for contemporary GPU architectures is efficiently handlingcontrol flow in shader programs [67]. The reason is that, in an effort to improve computation3Chapter 1. Introduction00. SIMD MIMDNormalized IPCFigure 1.2: Performance loss due to branching when executing scalar SPMD threads usingSIMD hardware (idealized memory system).2density, modern GPUs typically batch together groups of individual threads running the sameshader program, and execute them together in lock step on a SIMD pipeline [43, 45, 47, 49, 58].Such thread batches are referred to as warps1 by NVIDIA [58, 67]. This approach has workedwell [37] for graphics operations such as texturing [6, 11], which historically have not requiredbranch instructions. However, when shader programs do include branches, the execution ofdifferent threads grouped into a warp to run on the SIMD pipeline may no longer be uniformacross SIMD elements. This causes a hazard in the SIMD pipeline [49, 80] known as branchdivergence [43, 67]. We found that na?ve handling of branch divergence incurs a significantperformance penalty on the GPU for control-flow intensive applications relative to an idealmultiple-instruction, multiple-data (MIMD) architecture with the same peak IPC capability(See Figure 1.2).1In the textile industry, the term ?warp? refers to ?the threads stretched lengthwise in a loom to be crossedby the weft? [21].2NREC SIMD and PDOM SIMD are described in Chapter 3 while DWF SIMD is described in Chapter 4.Microarchitecture and Benchmarks are described in Chapter 2 and Chapter 5 respectively.4Chapter 1. Introduction1.2 ContributionsThis thesis makes the following contributions:1. It quantifies the performance gap between the immediate post-dominator branch recon-vergence mechanism and the performance that would be obtained on a MIMD architecturewith support for the same peak number of operations per cycle. Thus, highlighting theimportance of finding better branch handling mechanisms.2. It proposes and evaluates a novel hardware mechanism, dynamic warp formation, forregrouping processing elements of individual SIMD warps on a cycle-by-cycle basis togreatly improve the efficiency of branch handling.3. It highlights quantitatively that warp scheduling policy (the order in which the warpsare issued from the scheduler) is an integral part to both the performance and areaoverhead of dynamic warp formation, and proposes an area efficient implementation of awell-performing scheduling policy.4. It proposes and evaluates a detailed hardware implementation of dynamic warp formationand scheduling.5. It provides an extensive simulation infrastructure for enabling future research on GPUarchitectures optimized to support non-graphics applications.In particular, for a set of data parallel, non-graphics applications ported to our modernGPU-like SIMD streaming processor architecture, we find the speedup obtained by reconvergingdiverging threads within a SIMD warp at the immediate post-dominator of the diverging branchobtains a speedup of 45% over not reconverging. Furthermore, dynamically regrouping scalarthreads into SIMD warps on a cycle by cycle basis increases speedup further to 114% (47%speedup versus reconverging at immediate post-dominator). We estimate the hardware requiredby this regrouping mechanism adds 8% to the total chip area.5Chapter 1. Introduction1.3 OrganizationThe rest of the thesis is organized as follows:? Chapter 2 provides an overview of the fundamental concepts and the baseline GPU mi-croarchitecture that this thesis builds on.? Chapter 3 describes the immediate post-dominator control-flow reconvergence mechanism,which represents a baseline equivalent to the performance of contemporary SIMD control-flow handling mechanisms.? Chapter 4 describes our proposed dynamic warp formation mechanism, an improvementto the baseline. This is the main contribution of the work.? Chapter 5 describes the simulation methodology of the proposed GPU microarchitecture,including a detailed description of GPGPU-Sim, a novel GPU microarchitecture simulator.? Chapter 6 describes our experimental results.? Chapter 7 gives an estimation of the area overhead for implementing dynamic warp for-mation.? Chapter 8 describes related work.? Chapter 9 summarizes this thesis and suggests future work.6Chapter 2BackgroundThis chapter provides an overview of the fundamental concepts, the compute model and thebaseline SIMD GPU microarchitecture used for the rest of this thesis. This baseline is repre-sentative of contemporary GPUs used for accelerating non-graphics applications.2.1 Fundamental ConceptsThis section discusses the fundamental concepts and techniques that this thesis builds on. Inparticular, we describe several techniques fundamental to the contributions of this thesis: threadlevel parallelism, data level parallelism, single-instruction multiple-data, multiple-instructionmultiple-data, and fine-grained multithreading.2.1.1 Thread-Level ParallelismThread-level parallelism (TLP) refers to the parallelism that is specified explicitly by the soft-ware developer as threads in an application [76]. These threads run concurrently in the system,and each thread is expected to progress independently with a register set of its own. In somesystems, these threads can be explicitly synchronized through message passing or locks3 andbarriers4. TLP can be exploited by adding extra processors to a system so that more threadsexecute in parallel (provided that there are enough threads). However, as mentioned later inSection 2.1.5, TLP can also be exploited as a way to increase efficiency in a system with highmemory access latency.3Mutually exclusive access to data shared among threads [16].4Global synchronization among a set of threads: none get pass the barrier until all threads within the sethave arrived [16].7Chapter 2. Background2.1.2 Data-Level ParallelismData-level parallelism (DLP) exists in an application when a large pool of data is processed in aregular fashion, where each element of the output data is only dependent on a few elements fromthe input data pool [56]. A classic example of such an application is matrix multiplication, inwhich each element in the destination matrix is calculated by a sum of products of elements fromthe source matrices. As the outcome of each element does not depend on values of other outputelements, multiple elements in the destination matrix can be calculated in parallel. Data-levelparallelism can be exploited by adding extra processors, and distributing available work tothese processors. This allows data-level parallelism to scale easily as data set increases?thelarger the data set, the more work readily distributable to more processors [56]. In contrast tothread-level parallelism, which assigns each processing unit a unique thread, and instruction-level parallelism, which exploits independent operations within a single instruction stream [24],the regularity of DLP allows it to be exploited in a much more efficient manner, as discussednext.2.1.3 Single-Instruction, Multiple-Data (SIMD)SIMD has its roots in vector processors used for scientific computation. Vector processors areeffective for applications which involve repeating a set of identical operations across a largeamount of data (e.g., vector multiplication). As these repeated operations are independentwith each other, they may be executed in parallel on different data. Vector processors exploitthis DLP efficiently with an instruction set architecture (ISA) that operates on vectors of data.This allows control hardware to be shared by multiple processing elements, greatly reducingthe hardware cost required to scale up the width of parallel operations. All that is required isto attach extra ALUs to the control signal bus driven by a common control logic and providethem with data. At a minimum, this saves instruction bandwidth and control logic otherwiserequired for adding extra processors to exploit the same amount of DLP. However, by sharingthe control hardware, the processing elements are restricted to execute in lockstep, i.e., they allexecute the same instruction and advance to the next instruction at the same time.Today, SIMD hardware exists in commodity desktop computers in two main forms: special8Chapter 2. Backgroundpurpose accelerators and short-vector instruction set architecture (ISA) extensions.Short-vector ISA extensions refer to the SIMD instruction set extensions for general purposeCPUs, such as the Streaming SIMD Extension (SSE) for x86 architecture [73] and AltiVec forPOWER architecture [44]. These SIMD instruction set extensions operate on short vectorregisters (64-bit or 128-bit) that hold multiple data elements that will be operated on with asingle SIMD instruction. Multiple SIMD instructions are needed to process vectors that exceedthe length of a single vector register. While these ISA extensions allow general purpose CPUsto exploit DLP in a limited way, the main focus of general purpose CPUs remains to be fastexecution of serial programs.Historically, special purpose accelerators are hardware that is less tightly integrated witha CPU and is designed to assist the CPU in specific applications. For example, GraphicsProcessing Units (GPUs) are dedicated processors specialized in 3D rendering. In GPUs, SIMDis used to exploit the DLP nature of 3D rendering?each pixel or vertex can be processedindependently (see Chapter 2 for more detail). Other special purpose accelerators, such asdigital signal processors (DSPs), are also using SIMD to exploit DLP in their applicationdomains [71].2.1.4 Multiple-Instruction, Multiple DataIn this thesis, the performance of various SIMD control flow techniques is often compared tothat of a multiple-instruction, multiple-data (MIMD) architecture. According to Flynn?s tax-onomy [20], any architecture that is not constrained by the lockstep execution of processingelements as in SIMD is classified as a MIMD architecture. With this definition, any multi-processor architecture is a MIMD architecture. In the context of this thesis, a MIMD architec-ture refers to an architecture with the same configuration as our baseline SIMD architecturein Chapter 2, except that the processing elements inside a shader core are free to execute anyinstruction. These processing elements still share a common scheduler, which can schedule anyavailable thread to any free processing elements for execution.9Chapter 2. Background2.1.5 Fine-Grained MultithreadingFine-Grained Multithreading (FGMT) is a technique allowing multiple threads to interleavetheir execution on one or more hardware processors.This technique differs from the traditional multitasking (time sharing) provided by an op-erating system on an uniprocessor CPU. In a FGMT processor, switching to another threaddoes not require flushing the pipeline nor offloading architectural registers to the main memory.Instructions from multiple different processes co-exist in the pipeline of each single hardwareprocessor at the same time and the context (program counter and registers) of each thread iskept in the register file and stays there until the thread terminates. Every cycle, the hardwarescheduler selects and issues an instruction from one of the threads sharing the processor. Thedetails of thread selection are hidden from the operating system, so a single processor withFGMT appears as multiple single threaded processors.FGMT was first proposed on the CDC 6600 as ?barrel processing?, a flexible interface toallow a single central processor to interface with multiple slower peripheral processors [75]. TheHeterogeneous Element Processor (HEP) computer [31] later employed FGMT as a techniqueto provide a scalable interface for multi-threaded programs. In a HEP computer, each ProcessExecution Module (PEM) can hold up to 128 threads. These threads communicate with eachother through the main memory, so that multi-threaded programs written for a single PEM canalso be executed by multiple PEMs sharing the same memory space. The threads are expectedto operate on different data, so that their instructions can exist in different stage in the PEM?spipeline at the same time without causing dependency hazards. In this way, instructions fromdifferent threads can be aggressively scheduled into the pipeline without incurring the perfor-mance penalty that a single-threaded processor would normally suffer (branch mispredictionsand data hazard stalls).The Horizon [74] architecture also uses FGMT to provide a scalable interface, but focuses onusing this to hide the latency of memory accesses, which can take hundreds of cycles when hun-dreds of processors and memory modules are connected via a message-passing interconnectionnetwork.Figure 2.1 shows a simplified version of the operation of FGMT. At any time, each pipeline10Chapter 2. BackgroundMSHR Scheduler Decode  Execute 1  Execute 2 Execute N Fetch Thread 1 Thread 13 Thread 12 Thread 7 Thread 11  Thread 10 Thread 9 Thread 8 Thread 3 Memory System Memory Acc? Req Data Yes No Done Figure 2.1: A simplified illustration of how fine-grained multithreading can hide memory la-tency. Notice it can also tolerate various long latency operations such as floating point division(represented by the long execution stages). MSHR = miss status hold register [34]stage (Fetch, Decode, Execute) can be occupied by a different thread; Provided there are morethreads than pipeline stages, the pipeline can be fully utilized. Threads that require access tomemory are stored in the memory unit (MSHR in Figure 2.1). Their request will be sent tothe memory system, while the threads wait inside the memory unit. In the meantime, otherthreads that are not blocked by the memory will be executed in the pipeline to keep it utilized.After threads inside the memory unit obtain their data from the memory system, they writethe data to the register file and resume execution in the pipeline. With this organization, longermemory latency can be hidden using a larger number of threads sharing the pipeline.While fine-grained multithreading allows efficient use of the pipeline, it requires multi-threaded software. Also, to be able to switch quickly between threads, the hardware processorstores the architectural states (registers and program counter) of all its threads. This requiresa significantly larger register file than the ones in single-threaded processors. For instance,the Horizon architecture [74] allows up to 128 threads sharing a single processor, and eachthread contains 32 general purpose registers, requiring a register file of 4096 registers per pro-cessor. For these reasons, FGMT?s popularity remains limited to architectures that optimizefor applications with plentiful thread-level parallelism.Sun?s Niagara processor [35] features a FGMT architecture. The target applications for11Chapter 2. BackgroundNiagara are server and database applications, which contains significant thread-level paral-lelism. FGMT enables Niagara to deliver higher power efficiency (performance per Watt ratio)compared to single threaded superscalar processors, while cache5misses are still a predominantfactor for the performance of Niagara [35].Similar to FGMT, simultaneous multithreading (SMT) is a hardware technique that hasbeen used in general purpose CPUs to improve throughput by exploiting thread-level paral-lelism [76]. To understand the difference between FGMT and SMT it is first necessary to pointout that all high performance CPUs today employ an approach to instruction processing calledsuperscalar execution6. In contrast to FGMT, instead of allowing only one thread to issuean instruction to the pipeline each cycle, SMT allows multiple threads to compete for multi-ple superscalar issue slots in the same cycle, meaning that the superscalar pipeline?s executionstage can accept instructions from different threads in a single cycle. This thesis, however, usesprocessor cores with a single-issue, SIMD pipeline, so FGMT is the only option.Finally, graphics processing units (GPUs) uses FGMT to proactively hide memory accesslatency just like the Horizon architecture, but each core has a SIMD pipeline to increase thecomputation density [3, 39]. The SIMD pipeline shares data-independent control logic acrossmultiple stream processors.2.2 Compute ModelIn this thesis, we have adopted a compute model that is similar to NVIDIA?s CUDA program-ming model [58]. In this compute model, the application starts off as a single program runningon the CPU. At some point during execution, the CPU reaches a kernel call and spawns aparallel section to the GPU to exploit data-level parallelism. At this point, the CPU will thenstop its execution and wait for the GPU to finish the parallel section7. This sequence can repeatmultiple times until the program completes.Each parallel section consists of a collection of threads executing the same code which we call5A cache is a small and fast storage area close to the processor that reduces average memory access latencyby buffering data frequently accessed by the processor.6A superscalar pipeline is designed to harness instruction-level parallelism within a single thread [24]. It hasthe capability to issue multiple instructions to multiple functional units in a single cycle.7A more recent version of CUDA allows the CPU to continue execution in parallel with the GPU [59].12Chapter 2. BackgroundGPUShaderCoreInterconnection NetworkMemoryControllerDRAMMemoryControllerDRAMMemoryControllerDRAMShaderCoreShaderCoreDecodeScalar PipelineWarp SchedulerThread Warp 1Thread Warp 2Thread Warp 24I-FetchScalar Pipeline Scalar PipelineSIMD ExecutionCPUParallelSectionDoneFigure 2.2: Baseline GPU microarchitecture. Blocks labeled ?scalar pipeline? include registerread, execute, memory and writeback stages.a shader program. Similar to many thread programming APIs, a shader program is encapsulatedas a function call. In our implementation, at least one of the arguments is dedicated to pass inthe thread ID, which each thread uses to determine its behaviour during the parallel section.For example, each thread may use its ID to index a different element in a vector. In this sense,the programming model employed in this thesis is essentially the Single Program, Multiple Data(SPMD) model commonly used to program shared memory multiprocessors.All threads within a parallel section are expected to execute in parallel and share the samememory space. Unlike most shared memory multiprocessors, cache coherence and memoryconsistency are not enforced in our model as the threads are expected to be largely independentof each other. In the present study, to ensure that threads of the next parallel section will haveaccess to the correct data, data caches are flushed and a memory fence operation is performedat the end of each parallel section.2.3 SIMD GPU MicroarchitectureFigure 2.2 illustrates the baseline microarchitecture used in the rest of this thesis. In this figure,each shader core executes multiple parallel threads from the same parallel section, with eachthread?s instructions executed in-order by the hardware.8 The multiple threads on a given core8Our shader core is similar to CUDA?s notion of a Streaming Multiprocessor (SM) [39].13Chapter 2. Backgroundare grouped into SIMD warps by the hardware scheduler. Each warp of threads executes thesame instruction simultaneously on different data values in parallel scalar pipelines. Instructionsread their operands in parallel from a highly banked register file. Memory requests access ahighly banked data cache and cache misses are forwarded to the memory controller and/or cachelevels closer to memory via an interconnection network. Each memory controller processesmemory requests by accessing its associated DRAM, possibly in a different order than therequests are received to reduce row activate and precharge overheads [64]. The interconnectionnetwork we simulated is a crossbar with a parallel iterative matching allocator [17].Since our focus in this thesis is non-graphics applications, graphics-centric details are omit-ted from Figure 2.2. However, traditional graphics processing still heavily influences this design:The use of SIMD hardware to execute SPMD software is heavily motivated by the need to bal-ance ?general purpose? compute kernel execution with a large quantity of existing graphicsshaders that have simple control-flow [67]. However, it is important to recognize that shaderprograms for graphics may make increasing use of control flow operations in the future, forexample to achieve more realistic lighting effects.2.4 Latency HidingSince cache hit rates tend to be low for streaming applications, performance would be severelypenalized if the pipeline had to stall for every memory request that misses the cache. Thisis especially true when the latency of memory requests can take several hundreds of cyclesdue to the combined effects of contention in the interconnection network and row-activate andprecharge overheads at the DRAM. While traditional microprocessors can mitigate the effects ofcache misses using out-of-order execution, a more compelling approach when software providesthe parallelism is to interleave instruction execution from different threads.With a large number of shader threads multiplexed on the same execution resources, ourarchitecture employs fine-grained multi-threading, where individual threads are interleaved bythe fetch unit [74, 75] to proactively hide the potential latency of stalls before they occur(as described in Section 2.1.5). As illustrated by Figure 2.3(a), instructions from multipleshader threads are issued fairly in a round-robin queue. When a shader thread is blocked by14Chapter 2. Backgrounda0a1a2a4a5a6a4a2a8a9a13a14a17a19a20a8a9a13a14a17a19a8a9a13a14a17a19a8a9a13a14a22a18a23a24a25a24a27a23a8a9a13a14a21a18a19a20a8a9a13a14a21a18a19a8a9a13a14a21a18a19a8a9a13a14a17a8a9a13a14a18a19a20a8a9a13a14a18a19a8a9a13a14a18a19a8a9a13a14a21a18a8a9a13a14a20a8a9a13a14a21a8a9a13a14a17a8a9a13a14a18a28a29a30a31a3a32a28a29a30a31a3a33a28a29a30a31a3a34a28a29a30a31a3a35a36a37a38a2a4a5a40a3a2a36a30a30a31a3a39a42a44a45a3a30a40a5a3a45a46a28a4a30a1a3a47a28a4a30a1a3a0a1a2a5a6a4a2a48a0a1a2a5a6a4a2a33a0a1a2a5a6a4a2a32a49a10a51a11a52a53a55a53a54a57a58a59a61a62a63a61a64a64a60a63a65a67a68a60a68a69a59a58a65a61a59a58a0a1a2a5a6a4a2a35a0a1a2a5a6a4a2a71a6a2a73a4a30a74a57a58a59a61a62a61a61a65a77a76a78a69a59a63a64a58a60a62a79a76a67a0a1a2a5a6a4a2a80a47a3a30a1a82a83a85a86a65a60a76a60a42a44a89a90a91a89a90a92a89a90a93a94a95a96a94a95a96a94a95a96Figure 2.3: Detail of a shader core. (a) Using barrel processing to hide data memory accesslatency. N is the SIMD width of the pipeline. (b) Grouping 4N scalar threads into a SIMDwarp executed over 4 cycles.a memory request, the corresponding shader core simply removes that thread?s warp from thepool of ?ready? warps and thereby allows other shader threads to proceed while the memorysystem processes its request. With a large number of threads (768 per shader core, 12,288 intotal in this thesis, similar to the Geforce 8800GTX) interleaved on the same pipeline, barrelprocessing effectively hides the latency of most memory operations since the pipeline is occupiedwith instructions from other threads while memory operations complete. Barrel processing alsohides the pipeline latency so that data bypassing logic can be omitted to save area with minimalimpact on performance. In this thesis, we also simplify the dependency check logic design byrestricting each thread to have at most one instruction running in the pipeline at any time.An alternative to barrel processing on a large number of threads is to interleave fewerthreads, but provide a large number of registers to each thread [28]. Each thread executes onthe pipeline until it encounters a dependency hazard, at which time the pipeline will switchits execution to another thread. Latency hiding is essentially achieved via software loop un-rolling, which generates independent instructions to be inserted between a memory access andinstructions depending on the access.15Chapter 2. Background2.5 SIMD Execution of Scalar ThreadsWhile barrel processing can hide memory latency with relatively simple hardware, a mod-ern GPU must also exploit the explicit parallelism provided by the stream programmingmodel [9, 58] associated with programmable shader hardware to achieve maximum perfor-mance at minimum cost. SIMD hardware [8] can efficiently support SPMD program executionprovided that individual threads follow similar control flow. Figure 2.3(b) illustrates how in-structions from multiple (M = 4N) shader threads are grouped into a single SIMD warp andscheduled together into multiple (N) scalar pipelines over several (M/N = 4N/N = 4) cycles.The multiple scalar pipelines execute in ?lock-step? and all data-independent logic may beshared to greatly reduce area relative to a MIMD architecture. A significant source of areasavings for such a SIMD pipeline is the simpler instruction cache support required for a givennumber of scalar threads.SIMD instruction processing can also be used to relax the latency requirement of the sched-uler and simplify the scheduler?s hardware. With a SIMD warp size wider than the actualSIMD hardware pipeline, the scheduler only needs to issue a single warp every M/N cycles(M = warp size, N = pipeline width) [39]. The scheduler?s hardware is also simplified as ithas fewer warps to manage. This technique is used in the NVIDIA?s GeForce 8 series [58],and the performance evaluations presented in this thesis assume the use of this technique (seeSection 4.4 of Chapter 4).2.6 SummaryIn this chapter, we have given an overview of the fundamental concepts and the baseline archi-tecture to be used for the rest of this thesis. Details of the interconnection network and memorysubsystem are described in Chapter 5. In the following chapter, we describe a reconvergencemechanism to handle branch divergence in SIMD hardware, which represents a reasonable proxyof the performance of various contemporary control-flow handling mechanisms for SIMD. Theperformance of this reconvergence mechanism is compared against our proposed dynamic warpformation and scheduling mechanism in Chapter 6.16Chapter 3SIMD Control Flow SupportTo ensure the hardware can be easily programmed for a wide variety of applications, some recentGPU architectures allow individual threads to follow distinct program paths [3, 46, 58, 60]. Wenote that where it applies, predication [2] is a natural way to support such fine-grained controlflow on a SIMD pipeline. However, predication does not eliminate branches due to loops andintroduces overhead due to instructions with false predicates.To support distinct control flow operation outcomes on distinct processing elements withloops and function calls, several approaches have been proposed: Lorie and Strong describe amechanism using mask bits along with special compiler-generated priority encoding ?else? and?join? instructions [43]. Lindholm and Moy [49] describe a mechanism for supporting branchingusing a serialization mode. Woop et al. [80] describe the use of a hardware stack and maskedexecution. Kapasi et al. [32] propose conditional streams, a technique for transforming a singlekernel with conditional code to multiple kernels connected with inter-kernel buffers.The effectiveness of a SIMD pipeline is based on the assumption that all threads running thesame shader program expose identical control-flow behaviour. While this assumption is true formost existing graphics rendering routines [67], many existing parallel non-graphics applications(and potentially, future graphics rendering routines) tend to have more diverse control-flowbehaviour. When a data-dependent branch leads to different control flow paths for differentthreads in a warp, branch divergence occurs because a SIMD pipeline cannot execute differentinstructions in the same cycle. The following sections describe two techniques for handlingbranch divergence, both of which were implemented in the simulator developed as part of thisthesis.The preliminary version of SIMD serialization mechanism presented in Section 3.1 was con-tributed by Henry Tran. The initial implementation of the reconvergence mechanism presented17Chapter 3. SIMD Control Flow Supportin Section 3.2 and the limit study presented in Section 3.3 were contributed by Ivan Sham.3.1 SIMD SerializationA na?ve solution to handle branch divergence in a SIMD pipeline is to serialize the threads withina warp as soon as the program counters diverge. A single warp with branch divergence (threadstaking different execution paths) is separated into multiple warps each containing threads takingthe same execution path. These warps are then scheduled and executed independently ofeach other, and they never reconverge back into a single warp. While this method is easyto understand and implement, it achieves poor performance. Without branch reconvergence,threads within a warp will continue diverging until each thread is executed in isolation fromother threads in the original warp, leading to very low utilization of the parallel functional unitsas shown by NREC SIMD in Figure SIMD ReconvergenceGiven the drawback of serialization it is desirable to use a mechanism for reconverging controlflow. The opportunity for such reconvergence is illustrated in Figure 3.1(a). In this example,threads in a warp diverge after reaching the branch at A. The first three threads encounter ataken branch and go to basic block9 B (indicated by the bit mask 1110 in Figure 3.1(a) insidebasic block B), while the last thread goes to basic block F (indicated by the bit mask 0001 inFigure 3.1(a) inside basic block F). The three threads executing basic block B further diverge tobasic blocks C and D. However, at basic block E the control flow paths reach a join point [50]. Ifthe threads that diverged from basic block B to C wait before executing E for the threads that gofrom basic block B to basic block D, then all three threads can continue execution simultaneouslyat block E. Similarly, if these three threads wait after executing E for the thread that divergedfrom A to F then all four threads can execute basic block G simultaneously. Figure 3.1(b)illustrates how this sequence of events would be executed by the SIMD function units. In thispart of the figure solid arrows indicate SIMD units that are active.9A basic block is a piece of code with a single entry and exit point.18Chapter 3. SIMD Control Flow SupportB/1110C/1000 D/0110E/1110(a)  Example Program control flow graph.Each basic block is labeled with ?block ID/thread active mask?. Edges labeled T fortaken NT for not taken branches.F/0001A/1111G/1111T NTT NTT1...T4Time(b)  Re-convergence at E, the Immediate Post-Dominator of BA B C D E F G AT1T2T3T4Next PC Active MaskG 1111(c)  Initial State(d)  After Divergent Branch(e)  After ReconvergenceB 1110F 0001Next PC Active MaskG 1111E 1110F 0001C 1000D 0110(i)(ii)(iii)Next PC Active MaskG 1111E 1110F 0001-Ret./Reconv. PCGG-Ret./Reconv. PCGGEE-Ret./Reconv. PCGGFigure 3.1: Implementation of immediate post-dominator based reconvergence.Event ActionNo Divergence(single next PC)Update the next PC field of the top of stack (TOS) entry tothe next PC of all active threads in this warp.Divergence(multiple next PC)Modify the next PC field of the TOS entry to the reconver-gence point. For each unique next PC of the warp, push anew entry onto the stack with next PC field being the uniquenext PC and the reconv. PC being the reconvergence point.The active mask of each entry denotes the threads branchingto the next PC value of this entry.Reconvergence(next PC = reconv. PCof TOS)Pop TOS entry from the stack.Table 3.1: Operational rules of the stack for reconvergence mechanism.The behaviour described above can be achieved using a stack based reconvergence mecha-nism [80]. In this thesis, each warp has a private stack tracking its control flow status. Eachentry in the stack consists of three fields: next PC, active mask and reconvergence PC. Thenext PC field of the top of stack (TOS) entry indexes to the instruction that the warp willexecute at its next issue to the pipeline. The active mask field of the TOS entry indicateswhich thread in the warp is currently active. The reconvergence PC field specify the point inthe shader program where these active threads will wait for other threads in the same warpbefore proceeding further down the program. Table 3.1 summarizes the operational rules of thestack.19Chapter 3. SIMD Control Flow SupportAn example illustrating how these rules implement the reconvergence mechanism is shownin Figure 3.1(c,d,e). Here we show how the stack is updated as the group of three threads inFigure 3.1(a) that execute B diverge and then reconverge at E. Before the threads execute thediverging branch at B, the state of the stack is as shown in Figure 3.1(c). When the branchdivergence is detected, the stack is modified to the state shown in Figure 3.1(d). The changesthat occur are the following:1. The original top of stack (TOS) in Figure 3.1(c), also at (i) in Figure 3.1(d), has its nextPC field modified to the instruction address of the reconvergence point E (the addresscould be specified through an extra field in the branch instruction).2. A new entry (ii) is allocated onto the stack in Figure 3.1(d) and initialized with thereconvergence point address (E) along with the next PC value of the fall through of thebranch (D), and a mask (0110) encoding which processing elements evaluated the branchas ?not-taken?.3. A new entry (iii) is allocated onto the stack with the same reconvergence point address(E), the target address (C) of the branch, and a mask (1000) encoding which processingelements evaluated the branch as ?taken?.Note that the mechanism described above supports ?nested? branch hammocks as well as data-dependent loops. For a SIMD warp size of N, the size of this stack is bounded to 2N entriesper warp, which is all consumed when each thread in a warp is diverged to its own executionpath. At this point, no new entry will be pushed onto the stack because a single thread neverdiverges, even if it is runnnig in a loop.In this thesis we use the immediate post-dominator [50] of the diverging branch instructionas the reconvergence point10. A post-dominator is defined as follows: A basic block X post-dominates basic block Y (written as ?X pdom Y?) if and only if all paths from Y to the exitnode (of a function) go through X. A basic block X, distinct from Y, immediately post-dominatesbasic block Y if and only if X pdom Y and there is no basic block Z such that X pdom Z and Z10While Rotenberg et al. [65] also identified immediate post-dominators as control reconvergence points, toour knowledge we are the first to propose this scheme for SIMD control flow.20Chapter 3. SIMD Control Flow Support0. 8 WideSIMD16 WideSIMD32 WideSIMDIPC Normalized to Peak IPCPDOMMIMDFigure 3.2: Performance loss for PDOM versus SIMD warp size (realistic memory system).11pdom Y. Immediate post-dominators are typically found at compile time as part of the controlflow analysis necessary for code optimization.The performance impact of the immediate post-dominator reconvergence technique (whichwe abbreviate in the rest of this thesis as PDOM) depends upon the SIMD warp size. Figure 3.2shows the harmonic mean IPC of the benchmarks described in Chapter 5 compared to theperformance of MIMD hardware for 8, 16, and 32 wide SIMD execution assuming 16 shadercores. Hardware utilization decreases from 26% for MIMD to 21% for 8-wide, to 19% for 16-wide, and down to 16% for 32-wide.11 This increase of performance loss is due to a higherimpact of branch divergence when SIMD warps are widened and each warp contains morethreads. Although PDOM works well, the MIMD performance indicates that a performancegain of 66% is still possible for a better branch handling mechanism if it can perfectly eliminatethe performance loss due to branch divergence.In the following section we explore whether immediate post-dominators are the ?best? recon-vergence points, or whether there might be a benefit to dynamically predicting a reconvergencepoint past the immediate post-dominator.11The peak throughput remains the same for all three configurations of different SIMD widths. The config-uration with a wider SIMD-width is implemented as a scheduler that issue a wider warp at a slower rate. SeeSection 2.5 of Chapter 2 for more detail.21Chapter 3. SIMD Control Flow Support   void shader_thread(int tid, int *data) { 1:     for (int i = tid % 2; i < 128; ++i) { 2:         if (i % 2) 3:             data[tid]++;        }    } Odd:  12312 12312 12312 12312 1... Even: 12 12312 12312 12312 1231... TID Time Execution Sequence (a)  (b) Figure 3.3: (a) Contrived example for which reconvergence at points beyond the immediatepost-dominator yields the significant improvement in performance shown on Figure 3.4. Theparameter tid is the thread ID. (b) Execution sequence for threads with odd or even tid. Notethe example in Figure 3.4 used 2D arrays.3.3 Reconvergence Point Limit StudyWhile reconverging at the immediate post-dominator recovers much of the performance lostdue to branch divergence compared to not reconverging at all, Figure 3.3(a) shows an examplewhere this reconvergence mechanism is sub-optimal. In this example, threads with an eventhread ID diverge from those with an odd thread ID each iteration of the loop. If even threadsallow the odd threads to ?get ahead? by one iteration, all threads can execute in lock-step untilindividual threads reach the end of the loop. This suggests that reconverging at points beyondthe immediate post-dominator may yield better performance. To explore this possibility weconducted a limit study assessing the impact of always predicting the best reconvergence pointassuming oracle knowledge of each thread?s future control flow.For this limit study, dynamic instruction traces are captured from only the first 128 threads.SIMD warps are formed by grouping threads by increasing thread ID, and an optimal alignmentfor the instruction traces of each thread in a warp is found via repeated applications of theNeedleman-Wunsch algorithm [54]. With four threads per warp, the optimal alignment is foundby exhaustively searching all possible pair-wise alignments between threads within a warp. Thebest reconvergence points are then identified from the optimal alignment.Figure 3.4 compares the performance of immediate post-dominator reconvergence versusthe performance when reconverging at the predicted reconvergence points derived using thismethod assuming a warp size of 4. In this figure we assume an idealized memory system (allcache accesses hit) and examine both a contrived program with the behaviour abstracted inFigure 3.3 and the benchmarks described in Chapter 5 (depicted by the bars labeled ?Real Pro-22Chapter 3. SIMD Control Flow Support0 0.2 0.4 0.6 0.8 1 Contrived Example  Real Programs Normalized IPC NREC  PDOM  ORACLE Figure 3.4: Impact of predicting optimal SIMD branch reconvergence points. NREC = NoReconvergence. PDOM = Reconvergence at immediate post-dominators. ORACLE = Recon-vergence at optimal post-dominators.grams?). While the contrived example experiences a 92% speedup with oracle reconvergencepoint prediction, the improvement on the real programs we studied is much less (2.6%). Inter-estingly, one of the benchmarks (bitonic sort) does have similar even/odd thread dependence asour contrived example. However, it also contains frequent barrier synchronizations that ensureloop iterations execute in lock-step. It is important to recognize the limitation of the precedinglimit study: We explored a very limited set of applications and used short SIMD width, and arelatively small number of threads.3.4 SummaryThis chapter described a branch handling mechanism which reconverges control flow of threadsat the immediate post-dominators of the diverging branches and compared it against MIMDhardware. It also performed a limit study to evaluate the potential performance gain of us-ing more sophisticated reconvergence mechanisms. As this limit study seems to suggest thatmore sophisticated reconvergence mechanisms may not improve performance significantly, thisthesis focuses on a mechanism which combines threads from distinct warps following branchdivergence. This mechanism is discussed in the next chapter.23Chapter 4Dynamic Warp Formation andSchedulingWhile the post-dominator reconvergence mechanism described in the previous chapter is able tomitigate performance loss resulting from diverging branches, it does not fully utilize the SIMDpipeline relative to a MIMD architecture with the same peak IPC capability (losing 40% ofperformance for a warp size of 32 relative to MIMD). In this section, we describe our proposedhardware mechanism for recovering the lost performance potential of the hardware.The performance penalty due to branch divergence is hard to avoid with only one threadwarp since the diverged parts of the warp cannot execute simultaneously on the SIMD hardwarein a single cycle. Dynamic warp formation attempts to improve upon this by exploiting the fine-grained multithreading aspect of the GPU microarchitecture: With fine-grained multithreadingemployed to hide memory access latency, there is usually more than one thread warp ready forscheduling in a shader core. Every cycle, the thread scheduler tries to form new warps from apool of ready threads by combining scalar threads whose next program counter (PC) values arethe same. As the shader program executes, diverged warps are broken up into scalar threadsto be regrouped into new warps according to their branch targets (indicated by the next PCvalue of each scalar thread). In this way, the SIMD pipeline is fully utilized even when a shaderprogram executes diverging branches.Figure 4.1 illustrates this idea. In this figure, two warps, Warp x and Warp y, are executingthe example program shown in Figure 4.1(a) on the same shader core and both suffer frombranch divergence. Figure 4.1(b) shows the interleaved execution of both warps using thereconvergence technique discussed in Chapter 3.2, which results in the SIMD pipeline utilizationbelow 50% when basic blocks C, D and F are executed. As shown in Figure 4.1(c), using dynamic24Chapter 4. Dynamic Warp Formation and Schedulinga0a1a2a1a3a4a5a6a7a8a9a10a11a12a14a13a15a16a17a18a20a17a11a21a17a22a12a21a23a9a24a25a26a13a11a28a6a28a29a8a30a31a14a19a17a18a26a9a30a32a17a14a8a33a33a17a18a9a18a17a11a28a15a18a34a22a9a17a23a18a11a35a18a17a28a22a13a15a36a13a11a35a16a17a18a7a17a14a28a20a8a7a8a9a10a11a12a29a17a11a21a17a22a12a21a23a9a24a27a26a21a13a23a6a28a29a8a30a37a25a9a11a12a8a22a9a17a23a18a11a35a18a17a28a8a7a8a9a10a11a28a12a39a31a27a19a12a8a23a8a41a26a39a13a23a25a18a33a30a31a14a17a9a11a12a8a22a9a17a23a18a11a35a18a17a28a42a8a34a19a11a22a23a11a12a33a17a18a41a26a6a34a19a11a25a18a18a33a30a5a5a43a43a44a44a5a5a48a49a1a51a53a46a54a1a52a55a56a49a3a47a58a59a60a61a37a62a3a63a50a49a64a63a61a65a47a57a61a50a63a56a49a3a5a5a43a43a44a44a5a5a48a49a1a51a66a53a46a54a1a52a55a56a49a3a47a58a59a60a61a67a1a52a57a3a68a1a61a1a3a52a1a63a56a69a50a1a4a49a56a70a57a71a56a37a57a50a49a63a56a61a51a53a46a54a63a50a65a58a70a61a2a61a50a43a7a42a73a73a73a74a20a42a74a73a73a5a7a42a73a73a73a73a20a42a73a73a73a73a47a7a42a74a74a73a20a42a73a73a74a74a7a42a73a73a74a20a42a74a74a73a7a42a73a74a74a74a20a42a74a73a74a46a7a42a73a73a73a74a20a42a74a73a73a44a7a42a73a73a73a73a20a42a73a73a73a73a5a6a7a8a9a10a11a12a14a13a15a16a17a18a7a17a11a21a17a22a12a21a23a9a24a25a26a17a22a35a8a28a6a28a29a8a30Figure 4.1: Dynamic warp formation example.warp formation to regroup scalar threads from both warps in Figure 4.1(b) into a single warpin Figure 4.1(c) for these blocks can significantly increase the pipeline utilization (from 65% inFigure 4.1(b) to 77% in Figure 4.1(c)).Implementing dynamic warp formation requires careful attention to the details of the registerfile, a consideration we explore in Section 4.1. In addition to forming warps, the thread scheduleralso selects one warp to issue to the SIMD pipeline every scheduler cycle depending upon ascheduling policy. We explore the design space of this scheduling policy in detail in Section 4.3.We show that the thread scheduler policy is critical to the performance impact of dynamic warpformation in Chapter 6.4.1 Register File AccessTo reduce area and support a large number of ports in a SIMD pipeline, a well-known approachis to implement the register file in multiple banks, each accessible from a single scalar pipeline(or lane) of the SIMD pipeline as shown in Figure 4.2(a). This hardware is a natural fit whenthreads are grouped into warps ?statically? before they begin executing instructions and eachthread stays in the same lane until it is completed. For our baseline architecture, each register25Chapter 4. Dynamic Warp Formation and Schedulinga0a1a2a3a4a6a7a8a9a5a4a11a9a13a14a15a16a13a14a15a17a13a14a15a18a19a7a20a3a21a7a22a5a8a7a24a25a26a27a28a16a27a28a17a27a28a18a13a14a15a16a13a14a15a17a13a14a15a18a19a7a20a3a21a7a22a5a8a7a24a29a26a30a32a33a34a35a36a37a27a38a39a40a41a13a14a15a16a13a14a15a17a13a14a15a18a19a7a20a3a21a7a22a5a8a7a24a42a26a27a28a16a27a28a17a27a28a18a43a4a10a10a44a23a4a13a14a15a16a13a14a15a17a13a14a15a18a19a7a20a3a21a7a22a5a8a7a24a26a27a28a16a27a28a17a27a28a18a46a47a48a4a51a4a51a46a4a1a48a51a47a50a4a51a4a51a46a4a1a48a51a52a53a54a55a56a55a59a53a60a62a53a60a63a65a56a64a55a58a66a66a57a58a68a65a66a57a58a54a69a67a60a70a71a72a73Figure 4.2: Register file configuration for (a) static warp formation, (b) ideal dynamic warpformation and MIMD, (c) unconstrained dynamic warp formation, (d) lane aware dynamic warpformation. The solid line running across each register file bank in (a), (c) and (d) representswhether individual banks are addressed by a common decoder (continuous line in part (a)) oreach bank has its own decoder for independent addressing ((c) and (d)).file bank contains a separate set of registers per warp. Each scalar thread from the same warpis statically assigned to a unique lane and always accesses the corresponding register file bankfor that lane. The registers used by each scalar thread within a given lane are then assignedstatically at a given offset based upon the warp ID.However, so far we have not explicitly considered the impact of such static register assign-ment on dynamic warp formation. As described so far, dynamic warp formation would requireeach register to be equally accessible from all lanes, as illustrated in Figure 4.2(b). While group-ing threads into warps dynamically, it is preferable to avoid the need to migrate register valueswith threads as they are regrouped into different warps. To accomplish this, the registers usedby each scalar thread are assigned statically to the register banks in the same way as describedabove. However, if we dynamically form new warps without consideration of the ?home? lane ofa scalar thread?s registers, we must design the register file with a crossbar as in Figure 4.2(c).Furthermore, warps formed dynamically may then have two or more threads with the same?home? lane, resulting in bank conflicts. These bank conflicts introduce stalls into all lanes ofthe pipeline and significantly reduce performance as shown in Chapter 6.A better solution, which we call lane aware dynamic warp formation, ensures that eachthread remains within its ?home? lane. In particular, lane aware dynamic warp formation26Chapter 4. Dynamic Warp Formation and Schedulingassigns a thread to a warp only if that warp does not already contain another thread in thesame lane. While the crossbar in Figure 4.2(c) is unnecessary for lane aware dynamic warpformation, the traditional hardware in Figure 4.2(a) is insufficient. When threads are groupedinto warps ?statically?, each thread?s registers are at the same ?offset? within the lane, thusrequiring only a single decoder. With lane aware dynamic warp formation, the offsets to accessa register in a warp will not be the same in each lane. Instead, the offset in each lane iscalculated according to the thread assigned to the lane in the dynamically formed warp. Notethat each lane is still executing the same instruction in any given cycle?the varying offsets area byproduct of supporting fine grained multithreading to hide memory access latency combinedwith dynamic warp formation. This yields the register file configuration shown in Figure 4.2(d),which is used for our area and performance estimation in Chapter 6 and Chapter 7.One subtle performance issue affecting the impact of lane aware scheduling for one of ourbenchmarks (Bitonic Sort) is related to the type of pathological even/odd thread identifiercontrol dependence described in Chapter 3. For example, if threads in all even lanes see abranch as taken, while threads in all odd lanes see the same branch as not-taken, then it isimpossible for dynamic warp formation to create larger warps. A simple solution we employis to alternately swap the position of even and odd thread home lanes every other warp whenthreads are first created (an approach we call thread swizzling).4.2 Hardware ImplementationFigure 4.3 shows a high level block diagram illustrating how dynamic warp formation andscheduling can be implemented in hardware. A detailed implementation for the Majority sched-uler is described in Section 4.4. Referring to Figure 4.3, the two warp update registers, labeled(a), store information for different target PCs of an incoming, possibly diverged warp; the PC-warp LUT, labeled (b), provides a level of indirection to guide diverged threads to an existingor newly created warp in the warp pool, labeled (c). The warp pool is a staging area holdingall the warps to be issued to the SIMD pipeline.When a warp arrives at the last stage of the SIMD pipeline, its thread identifiers (TIDs)and next PC(s) are passed to the thread scheduler (Figure 4.3(a)). For conditional branches,27Chapter 4. Dynamic Warp Formation and SchedulingThread SchedulerPC-Warp LUT Warp PoolWarp AllocatorTID to PC A PC ATID to PC B PC BHHTID x N PrioTID x N PrioOCC IDXOCC IDXWarp Update Register AWarp Update Register BREQREQ(a)(b) (c)DecodeRF NRF 2RF 1ALU NALU 2ALU 1I-CacheScheduling LogicCommit/WritebackFigure 4.3: Implementation of dynamic warp formation and scheduling. In this figure, Hrepresents a hash operation. N is the width of the SIMD pipeline. See text for detail description.there are at most two different next PC values.12 For each unique next PC sent to the schedulerfrom writeback, the scheduler looks for an existing entry in the PC-warp LUT already mappedto the PC and allocates a new entry if none exists13 (Figure 4.3(b)).The PC-warp LUT (Figure 4.3(b)) provides a level of indirection to reduce the complexityof locating warps in the warp pool (Figure 4.3(c)). It does this by using the IDX field to pointto a warp being formed in the warp pool. This warp is updated with the thread identifiers ofcommitting threads having this next PC value. Each entry in the warp pool contains the PCvalue of the warp, N TID entries for N lanes in an N-wide SIMD pipeline, and some policy-specific data (labeled ?Prio?) for scheduling logic. A design to handle the worst case whereeach thread diverges to a different execution path would require the warp pool to have enoughentries for each thread in a shader core to have its own entry. However, we observe that forthe benchmarks we simulated only a small portion of the warp pool is used, and we can shrinkthe warp pool significantly to reduce area overhead without causing a performance penalty (seeChapter 6).To implement the lane aware scheduler mentioned in Section 4.1, each entry in the PC-warp12Indirect branches that diverge to more than two PCs can be handled by stalling the pipeline and sending upto two PCs to the thread scheduler every cycle.13In our detailed model we assume the PC-warp LUT is organized as a small dual-ported 4-way set associativestructure.28Chapter 4. Dynamic Warp Formation and SchedulingLUT has an occupancy vector (OCC) tracking which lanes of the current warp are free. Thisis compared against the request vector (REQ) of the warp update register that indicates whichlanes are required by the threads assigned to this warp. If a required lane is already occupiedby a thread, a new warp will be allocated and the TIDs of the threads causing the conflict willbe assigned into this new warp. The TIDs of the threads that do not cause any conflict willbe assigned to the original warp. In this case, the PC-warp LUT IDX field is also updated topoint to the new warp in the warp pool. The warp with the older PC still resides in the warppool, but will no longer be updated. A more aggressive approach would be to continually tryto merge threads into the earlier warp, but this is beyond the scope of this thesis.Each cycle, a single warp in the warp pool may be issued to the SIMD pipeline according toone of the scheduling policies described in the next section. Once issued, the warp pool entryused by the warp is returned to the warp allocator.4.2.1 Warp PoolAs mentioned above, the warp pool holds the all the warps in the scheduler. Every cycle, theincoming threads with the same PC value may be assigned to either an existing warp or a newlyallocated warp in the warp pool. To handle the case of a branch that diverges into two groups ofthreads each with a distinct PC value, the warp pool must allow parallel access to four differentwarps in each cycle.14 A na?ve implementation of the warp pool with a single memory arraywould require four write ports, which significantly increases the area requirement of dynamicwarp formation. However, with the observation that each incoming warp only contains a threadexecuting in each SIMD lane (assuming lane aware scheduling is used here), a more efficientimplementation of the warp pool, shown in Figure 4.4(a), is possible. This implementationeliminates the need of four write ports by separating the warp pool into banks, with each bankstoring the TIDs of threads executing in the same SIMD lane. Each bank requires only onewrite port, because every incoming warp, no matter how diverged, never contains more thanone thread with the same home SIMD lane. The PC value and any scheduler specific data ofeach warp are stored in a separate bank from the TIDs with two write ports.14Threads in the same group may be assigned to an existing warp or a new warp. Both warps are updated inthe worst case.29Chapter 4. Dynamic Warp Formation and SchedulingWarp PoolRequestVector(REQ)OccupancyVector(OCC)ConflictVector(CNF)NewWarp(b)PC-Warp LUTCNF[1] CNF[N] NewWarpWarp Allocator NewWarpPC&ScheData01TID01TID01Decoder(a)Decoder DecoderN NNNFigure 4.4: Implementation of the warp pool. (a) Implementation with lane aware scheduling.(b) Generation of conflict vector (CNF) and the new warp signal. N is the width of the SIMDpipeline.In Figure 4.4(a), the write ports to each bank with TIDs is indexed by either the IDX fromPC-warp LUT or a new IDX from warp allocator. The choice is determined by a bit from theconflict vector (CNF) which is the logical AND of the request vector (REQ) and the occupancyvector (OCC). The input address to the bank containing PC value and scheduler specific datais determined by a new warp signal which is set whenever CNF is not all zero. This new warpsignal also requests the warp allocator for an index to a free warp. Figure 4.4(b) summarizeshow CNF and new warp are generated.A further refinement, that exploits the fact that a single warp may be executed over multiplecycles (four cycle in our baseline configuration?see Chapter 5) is to use a single memory arraywith one read port and one write port and perform the four warp updates over four cycles (oneevery cycle). However, due to time constrains we do not explore this further in this thesis.4.3 Scheduling PoliciesEven though dynamic warp formation has the potential to fully utilize the SIMD pipeline, thiswill only happen when the set of PC values currently being executed is small relative to the30Chapter 4. Dynamic Warp Formation and Schedulingnumber of scalar threads. If each scalar thread progresses at a substantially different rate, thenall threads will eventually map to entirely different PCs. To avoid this, all threads should havea similar rate of progress. We have found that the warp scheduling policy, namely, the order inwhich warps are issued from the warp pool, has a critical effect on performance (as shown inChapter 6). We explored the following policies:Time Stamp (DTime): Warps are issued in the order they arrive at the scheduler. Note thatwhen a warp misses the cache, it is taken out of the scheduler until its requested data arrivesfrom memory and this may change the order that warps are issued.Program Counter (DPC): In a program sequentially laid out in instruction memory, theprogram counter value itself may be a good indicator of a thread?s progress. By giving higherissue priority to warps with smaller PCs, threads lagging behind are given the opportunity tocatch up.Majority (DMaj): As long as a majority of the threads are progressing at the same rate, thescheduling logic will have a large pool of threads from which to create a new warp every cycle.The majority policy attempts to encourage this behaviour by choosing the most common PCamong all the existing warps and issuing all warps at this PC before choosing a new PC.Minority (DMin): If a small minority of threads diverges away from the rest, the Majoritypolicy tends to leave these threads behind. In the minority policy, warps with the least frequentPCs are given priority with the hope that, by doing so, these warps may eventually catch upand converge with other threads.Post-Dominator Priority (DPdPri): Threads falling behind after a divergence need to catchup with other threads after the immediate post-dominator. If the issue priority is set lower forwarps that have gone beyond more post-dominators, then the threads that have yet to go pastthe post-dominator tend to catch up.4.4 A Majority Scheduling Policy ImplementationMajority scheduling logic, the best performing policy among the presented ones (as shown inChapter 6), can be implemented in hardware with a max-heap and a lookup-table as shownin Figure 4.5. This hardware implementation may also be applied to other scheduling policies31Chapter 4. Dynamic Warp Formation and SchedulingMajority SchedulerMajority EntryMHeap LUT Max-HeapPC CNT WPH WPT LUT#PC CNT WPH WPT LUT#PC MH#PC MH# PC CNT WPH WPT LUT#PC CNT WPH WPT LUT#WarpUpdateRegisterWarp PoolTID x N NextTID x N NextTID x N NextTID x N Next(a)(b)PC-WarpLUTWarpAllocatorNTHDPCWIDPCHit?Figure 4.5: Implementation of majority scheduling policy with max-heap. See text for detail.that require priority management, such as DPC and DPdPri, with some modifications, butthis thesis will focus on the implementation for Majority scheduling policy. The area of thishardware implementation is included as part of the area overhead estimation for dynamic warpformation and scheduling in Chapter 7. In Figure 4.5, the Majority Entry, labeled (a), keepstrack of the current majority PC value and a group of warps with this PC value15; the Max-Heap, labeled (b), is a full binary tree of PC values (and its group of warps) sorted by numberof threads with this given PC value using the algorithm described in Chapter 6 of Cormen et al.[15], and the MHeap LUT, labeled (c), provides a level of indirection for incoming warps toupdate their corresponding entries in the Max-Heap (i.e., the function of the PC-warp LUTin Figure 4.3). While we find a simple max-heap performing one swap per cycle per warp tobe sufficient for our usage, using a pipelined max-heap hardware implementation [4, 29] mayfurther reduce the bandwidth requirement of the max-heap.Each entry in the Max-Heap represents a group of warps with the same PC value, and thisgroup of warps has CNT threads. This group of warps forms a linked list in the warp pool witha ?Next? entry referring to the next warp in the list. WPH keeps the head of this list and WPTkeeps the tail (so that the list can grow as warps arrive). LUT# is a back pointer to the entry15This entry is separated because we finish executing all warps at a given PC value before selecting a new PCvalue.32Chapter 4. Dynamic Warp Formation and Schedulingin MHeap LUT, and is used to update the LUT for correctness after a swap (see detail below).4.4.1 Warp InsertionWhen a warp arrives at the thread scheduler, it is placed in the warp update register. Thewarp?s next PC value (PC); the number of threads, or thread count, of this warp (NTHD); andthe ID of the warp (WID) in the warp pool (acquired via PC-warp LUT or warp allocator asin Section 4.2), where these threads will be placed, are sent to the Majority scheduler. Insidethe Majority scheduler, the PC value is checked against the one stored in the Majority Entry,and if they match, the Majority Entry will be updated directly. Otherwise, the MHeap LUTwill provide the index (MH#) into the Max-Heap entry with this PC value, or return an indexof an inserted new entry in the Max-Heap if an existing entry with this PC is not found in theMHeap LUT. This Max-Heap entry, denoted by MH#, is then updated: CNT is incrementedby NTHD, and if WID is different from WPT, WPT will be updated with WID, and WID willbe assigned to the ?Next? entry in the warp referred by the original WPT (essentially insertingthe warp referred by WID into the linked list of this entry). If this is an newly inserted entry,the value of WID will be assigned to both WPH and WPT.After the entry in the Max-Heap is updated or inserted, this entry is compared with itsparent entry and swapped up the binary tree until it encounters a parent with a larger threadcount (CNT) or it becomes the root of the binary tree. Whenever a swap occurs, the cor-responding entry in MHeap LUT (denoted by LUT# in the swapping Max-Heap entries) isupdated. During this operation, no entries in the Max-Heap can be updated.4.4.2 Warp IssueIn every scheduler cycle16, a warp from the warp pool will be issued to the SIMD pipeline. If theMajority Entry has any warp remaining (CNT>0), the warp denoted by WPH will be issued tothe pipeline. WPH will then be updated with the value in the ?Next? entry of the issued warp,and CNT will be decremented by the number of threads in the issued warp. If the MajorityEntry runs out of warps (CNT=0), the root entry of the Max-Heap will be popped and willbecome the Majority Entry. If the Max-Heap is in the process of rebalancing itself after a warpinsertion, no warp will be issued until the Max-Heap is eventually balanced. The Max-Heap33Chapter 4. Dynamic Warp Formation and Schedulingthen balances itself according to the algorithm described by Cormen et al. [15]. Similar to thecase of warp insertion, the MHeap LUT will be updated in parallel to any swap operation, andduring this operation, no entries in the Max-Heap can be updated.4.4.3 ComplexityAs max-heap is a tree structure, every warp insertion only requires as many as log(N) swapsfor a N-entry max-heap to rebalance, and experimentally we found that usually fewer swapsare required. We find in Chapter 6 that a design with both the Max-Heap and the MHeapLUT each having 2 read ports and 2 write ports (to perform 2 swaps in parallel to handle 2warp insertion from each part of the incoming warp every scheduler cycle16) is sufficient forkeeping the Max-Heap balance in the common case. This is assuming that both structures areclocked as fast as the SIMD pipeline, so that a total of 8 reads and 8 writes can be performedin every scheduler cycle.16 If the number of PCs in flight exceeds the max-heap capacity, amechanism such as spilling the entries from the max-heap could be implemented. However, weobserved in Chapter 7 that with a 64 entry max-heap, the need for this never arises for ourbenchmarks. We leave the exploration of the performance impact of spilling the max-heap withsmaller max-heap capacity as future work.4.5 SummaryThis chapter described our proposed branch handling mechanism?dynamic warp formation andscheduling. It discussed various implementation details including a ways to avoid register filebank conflict (lane aware scheduling) and a hardware implementation of the Majority schedulingpolicy. Next chapter describes the simulation methodology we use in this thesis to evaluate thismechanism.16A scheduler cycle is equivalent to several pipeline cycles, because each warp can take several cycles (e.g. 4for a warp size of 32) to execute in the pipeline (see Section 2.5 of Chapter 2), so the scheduler may operate ata slower frequency.34Chapter 5MethodologyWhile simulators for contemporary GPU architectures exist [19, 66], none of them model thegeneral-purpose GPU architecture described in this thesis. Therefore, we developed a novelsimulator, GPGPU-Sim, to model various aspects of the massively parallel architecture used inmodern GPUs with highly programmable pipelines.The benchmark applications used for this study were selected from SPEC CPU2006 [69],SPLASH2 [79], and CUDA [57]. Each benchmark was manually modified to extract and anno-tate the compute kernels, which is a time-consuming task limiting the number of benchmarkswe could consider. The programing model we assume is similar to that of CUDA [58], whichuses a system call to a special GPU device driver in the operating system to launch parallelsections on the GPU. In our simulator, this system call mechanism is emulated by a spawninstruction, which signals the out-of-order core to launch a predetermined number of threadsfor parallel execution of a compute kernel on the GPU simulator. If the number of threads to beexecuted exceeds the capacity of the hardware configuration, the software layer is responsiblefor organizing threads into blocks. Threads within a block are assigned to a single shader core,and all of them have to finish before the shader core can begin with a new block.The rest of this chapter describes our new simulator, GPGPU-Sim, in detail and gives anoverview of the benchmarks used in this thesis.5.1 Software Design of GPGPU-Sim?A Cycle AccurateGPGPU SimulatorThis section describes the software design of GPGPU-Sim, the cycle accurate GPU simulatorused in this thesis.35Chapter 5. MethodologyGPGPU-Sim consists of two top level components: functional simulation and performance(timing) simulation. Functional simulation provides the behaviour that the program expectsfrom the architecture, whereas the performance simulation provides an estimate of how fast theprogram would execute if it were to be run on the modeled microarchitecture. The two partsof simulation are usually decoupled in a simulator to allow developers to introduce approxima-tions to the performance model for fast prototyping of novel ideas. For example, our DRAMperformance model does not model DRAM refreshes. While DRAM refresh is an importantfeature to maintain functional correctness of a DRAM module, omitting it has minimal impacton the accuracy of a DRAM performance model due to the rare occurrence of DRAM refresh(once every several million cycle). For this reason, like most modern architecture simulators inwidespread uses (such as SimpleScalar [10], SMTSIM [76], PTLsim [81]. .. etc.), GPGPU-Simemploys the separation of functional simulation from performance simulation in its softwaredesign.GPGPU-Sim was constructed ?from the ground up? starting from the instruction set sim-ulator (sim-safe) of SimpleScalar version 3.0d [10]. We developed our cycle-accurate GPUperformance simulator, modeling the microarchitecture described in Chapter 2, around theSimpleScalar PISA instruction set architecture (a RISC instruction set similar to MIPS) andthen interfaced it with sim-outorder, which only provides timing for code run on the CPU inFigure 2.2.Figure 5.1 provides an overview of GPGPU-Sim, with each source code file classified intoone of three main modules: Shader Core, Interconnection Network, and DRAM Model. Filesthat do not fit into any of these three modules either contain code that provides miscellaneoustools to all the modules (gpu-misc and delayqueue), or is glue code that connects all modulestogether (gpu-sim). The following sections give an overview to the three main modules.5.1.1 Shader CoreEach shader core contains a simple pipeline, much like the classic MIPS 5-stage in-order pipelinedescribed in [24]. This simple pipeline is extended to have fine-grained multithreading, SIMD,miss status hold registers (MSHRs), the branch handling mechanisms that was discussed in thisthesis so far (Chapter 3 and Chapter 4), and various other microarchitecture details.36Chapter 5. MethodologySimpleScalar DRAM Model Shader Core Interconnection Network shader gpu-sim gpu-cache addrdec batch_tracker cballoc crossbar icnt_wrapper crossbar_wrapper delayqueue dram dram_sched dynbatch dynbhw intersim lock_cache gpu-misc gpgpusim (sim-outorder) Figure 5.1: Overview of GPGPU-Sim?s software design. A arrowright B means ?A is uses B?.Figure 5.2(a) lists the pipeline stages in a shader core, and briefly illustrates the functionalityof each stage. Note that in the simulator, these stages are simulated in reverse order of thepipeline flow to eliminate the need for two copies of each pipeline register. This is a well-knowntechnique, also used in SimpleScalar?s out-of-order simulator (sim-outorder).Each shader core connects to the simulated memory subsystem (the memory system and theinterconnection network) via an abstract software interface that treats the memory subsystemas an out-of-order queue with requests coming back in any order:fq has buffer Query the memory subsystem for buffer space to hold a list of memoryrequests, each with its requested memory address specified.fq push Push memory request (memory address, number of bytes to access, andwhether the access is a read or a write) into the memory subsystem.fq pop Pop serviced memory request from the memory subsystem and push itinto the return queue.37Chapter 5. MethodologyFetch  Scheduler Fetch Instruction(s) Lock Threads TIDs, PC Decode  Decode Functional Sim Register Read Register File Bank Conflict Execute Pre- Memory Model Data Cache Access Latency Memory  Shared Memory or Cache Access Shared Memory Bank Conflict Data Cache Bank Conflict Shared Mem  Cache Done Resource Check Generate Mem. Request Hit Miss Write- Back Arbitrate Writeback Writeback (Set Scoreboard) Unlock Thread Memory Subsystem Return Queue Commit Queue General Stage Next Stage Empty? Return Simulate Stage Stall Generated? Move Content of Pipeline Register to Next Stage No Yes No Yes S i m u l a t i o n   O r d e r   p e r   S h a d e r   C y c l e (b) fq_push() fq_pop() fq_has_buffer() Figure 5.2: Software design of pipeline stages in the shader core model.38Chapter 5. MethodologyThis abstract interface decouples the internal design of the shader core module from the restof the simulator.The behaviour of every stage in a shader core is modeled by a function call in the simulatorwith the general structure shown in Figure 5.2(b), and each stage has a set of pipeline registersacting as the input into that stage. Each stage starts with a check to see if the pipeline registerbetween the current stage and the next stage is empty. If this pipeline register is not empty,the next stage must be stalled, therefore the current stage should stall as well. Since each cyclethe activity of the pipeline stages are evaluated in reverse order, this implies that a hazard atthe end of the pipeline can stall the entire pipeline in a single cycle. If the pipeline registersare empty, the simulator goes on simulating the behaviour of the stage. Whenever a stall isgenerated in the simulator, the function call modeling the stage will just return, leaving thecontent of its pipeline register unchanged. When the stage proceeds without generating a stall,the content of the pipeline register of the current stage is copied to the next stage?s pipelineregister (to imitate the control signal transferring from one stage to the next in hardware). Thepipeline register of the current stage is then cleared, indicating to the earlier stage that no stallhas been generated.The following is a description of each stage:FetchThe fetch stage starts by selecting the warp that will be issued to the pipeline. The selection isperformed by one of the schedulers implemented in the simulator as specified in the simulator?sconfiguration, which can be specified either on the command line or via a configuration file. Eachscheduler implements one of the various branch handling mechanism discussed in Chapter 3and Chapter 4. The MIMD architecture is modeled with a scheduler which can freely schedulethreads into the lanes of the pipeline regardless of their PC values.After selecting which threads to issue, the scheduler passes the thread IDs and their PCvalues to the fetch block, which will fetch the instruction from memory17 and put them into thepipeline register of the decode stage. The fetch block also locks the issued threads by settinga flag, indicating to the scheduler that this thread will not be available for scheduling until it17We currently assume that all accesses to instruction cache hit because of the small kernel code size.39Chapter 5. Methodologyis unlocked in later stages (this is done at the write-back stage). The warp size distribution (akey statistic reported in Chapter 6) is collected in the fetch stage.DecodeThe decode stage decodes the instructions at the pipeline register between fetch and decode,determining the instruction type (Arithmetic/Branching/Memory) and the registers that willbe used. It also checks for any violation of a dependency hazard (checking the scoreboard forany use of the marked registers). If a dependency hazard is detected, it nullifies the instruction(i.e., clears its input pipeline register). If no hazard is detected, a scoreboard entry will be setto indicate that the output registers of these instructions are in use.If the simulator is configured to perform ?parallel functional simulation mode?, the decodestage will also functionally simulate the instructions at its input pipeline registers using Sim-pleScalar?s instruction set simulation kernel. Parallel functional simulation mode is describedin Section 5.1.4.Register ReadThe register read stage is only used when modeling the effects of register file bank conflicts.The register file bank to be accessed by each thread is determined by the thread?s thread ID.For each warp, the register read stage counts the number of threads accessing each bank, andstalls when the count for any of the bank exceeds one. The stage then stalls for a numberof cycles equal to the maximum count of threads accessing any given bank. This models theserialization of warp execution to resolve bank conflicts. This stage is only activated when usingthe dynamic warp formation scheduler.ExecutionThe execution stage is just an empty stage as functional simulation is done at the decode stage.It may be retrofitted in the future to model the latency of transcendental functions (sine, cosine,square root, reciprocal... etc.) and interpolator featured in modern GPU architectures [39].40Chapter 5. MethodologyPre-MemoryThe pre-memory stage is an optional stage modeling the cache access latency, which can possiblybe longer than one shader cycle with a large cache. It is a series of pipeline registers in aqueue buffering up the instructions going from the execution stage to the memory stage. Eachinstruction has to traverse through the length of the queue (equivalent to the number of cyclesit takes to access the cache minus one) to reach the Memory stage.MemoryThe memory stage handles accesses from the incoming threads to the data cache or to theshared memory (a small, fast scratch pad memory featured in CUDA [58]).To simulate this software controlled fast scratch pad memory, it first classifies each dataaccess as an access to the shared memory or to the data cache by the memory address. If thedata access falls into the region belonging to the shared memory18, it is classified as an accessto shared memory.For instructions accessing the shared memory, the memory stage checks for bank conflictsaccording to the addresses of all the accesses19, and stalls for a number of cycles equal to themaximum number of threads accessing the same bank (similar to the register read stage).For instructions accessing the data cache, the memory stage also checks for bank conflictsat the data cache19 and stalls accordingly. Once the stalls due to bank conflicts are over, thedata cache is accessed. If all accesses hit, the threads will be sent to the write-back stage. Ifsome of the accesses miss, the memory stage will arbitrate for the resources (for example, missstatus hold registers [34] (MSHRs) and input buffers at the interconnect to store the requests)required to generate all the required memory requests. If the allocation failed, the memorystage stalls and tries again next cycle. Note that some of the misses can be merged with othermisses and serviced with the in-flight (already sent) memory requests, and in that case, onlyMSHRs are allocated for these misses.Once the resources are allocated, memory requests will be generated and sent to the inter-18This is specified by the user and is unique to each benchmark. See Section 5.1.4 for details.19The banks in shared memory and the data cache are determined by effective memory address rather thanby thread ID (as done for the register file).41Chapter 5. Methodologyconnection network and the memory system modules via an abstract interface. In this case,the contents of the input pipeline registers will be stored in the MSHR, and will not be sent tothe write-back stage.Write-BackThe write-back stage has two inputs, one directly from the memory stage, and another from thereturn queue, which holds threads that missed the data cache and were serviced by a memoryaccess. The first step in the Write-Back stage is to arbitrate between the two inputs. In thecurrent version of GPGPU-Sim, priority is always given to the threads from the return queue,which may generate sub-optimal results, as the pipeline can be stalled very frequently in yieldingto the return queue.The threads that won the arbitration proceed to the rest of write-back stage. Scoreboardentries corresponding to the output registers of the threads will be cleared, indicating that theoutput values have already been written into the register file. Any subsequent instructionsfrom the same thread accessing the same registers will not cause a dependency hazard. Afterthat, the threads are unlocked and their IDs are sent to the commit queue, indicating to thescheduler that these threads are ready to be scheduled again.5.1.2 Interconnection NetworkThe interconnection network module is responsible for relaying messages between the shadercores and the memory controllers (containing the DRAM model). It does not have any knowl-edge about the content of the messages, other than the size of the message in bytes. Thismodule also models the timing and congestion for each message in the interconnect. It providesthe following abstract software interface to other modules (such as shader core and dram timingmodel):42Chapter 5. Methodologyicnt has buffer Query the interconnect for buffer space to hold all the given messages,each with its size and output specified, at a given input.icnt push Push a message (whose content is passed as void*) into the intercon-nection network from a specific source to a specific destination.icnt pop Pop a message from the interconnection network at a given output.icnt transfer Advance one cycle in the interconnection network.icnt busy Checks if the interconnect still has undelivered messages.icnt drain Transfer all the messages to their destination buffers.Currently there are two implementations of the interconnection network interface. One ofthem is Intersim, a modified version of a general interconnection simulator created by Dally andTowles associated with their textbook [17]. We have modified it to allow other modules to injectand extract packets to the modeled interconnection, and to enable the packets to carry actualpayload (so that it is actually relaying messages from one module to another). Another, usedin this thesis, is a simplified implementation of the crossbar described in the same work [17].The general structure of this interconnection model features two crossbars, one relaying themessages from the shader cores to the memory controllers and the other doing the opposite.This design simplified the allocator design, because it enables the use of two simple allocatorshandling one-way traffic instead of one complex allocator handling bidirectional traffic. Sun?sNiagara [70] also employs two crossbars to enable bidirectional traffic between processor coresand the memory subsystem (Chapter 6 of [70]).Crossbar AllocatorThe crossbar allocator orchestrates the cross-points to satisfy multiple requests from differentinputs asking for cross-points to some desired outputs. When there are conflicts among theserequests, only one can be fulfilled, and the others will be delayed to the next cycle. The requestsfrom all N inputs to M outputs can be summarized into a N ? M request matrix R (with arequest from input i to output j indicated by a 1 at row i column j), and the allocator will43Chapter 5. Methodologygenerate a N ?M grant matrix G indicating which of the request has been granted:R =bracketlefttpbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftbt1 1 0 10 1 1 01 0 1 10 0 0 1bracketrighttpbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightbtarrowright G =bracketlefttpbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftbt1 0 0 00 1 0 00 0 1 00 0 0 1bracketrighttpbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightbtorbracketlefttpbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftbt0 0 0 10 0 1 01 0 0 00 0 0 0bracketrighttpbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightbtor .. .The above example shows a request matrix with 4 inputs and 4 outputs. Notice how withthe same request matrix the allocator can often generate many different grant matrices (twoof them shown above), with many of them being suboptimal. This shows that the crossbarallocator is crucial to the crossbar?s bandwidth utilization.In this thesis, we have chosen to use the Parallel Iterative Matching (PIM) allocator, a simpleseparable allocator described in [17, Chapter 19]. First, the allocator randomly selects one ofthe requested outputs for each input buffer (note: Not input, but input buffer) and neglectsall other requests from the input buffer. Each unmatched output then randomly chooses oneof the remaining requests asking for that output. Using the example above, the intermediatematrix Y randomly chosen in the first step might be as follows:Y =bracketlefttpbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftbt0 0 0 10 0 1 01 0 0 00 0 0 1bracketrighttpbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightbtarrowright G =bracketlefttpbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftbt0 0 0 00 0 1 01 0 0 00 0 0 1bracketrighttpbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightbtorbracketlefttpbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftexbracketleftbt0 0 0 10 0 1 01 0 0 00 0 0 0bracketrighttpbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightexbracketrightbtThe right-most output in Y has more than one remaining request, and no matter which ofthe requests the output choses, the overall bandwidth utilization is below optimal. This isillustrated by the only two grant matrices G that can result from Y . Both grant matrices haveno requests routed to the second output and this fails to fully utilize the crossbar bandwidth.Input SpeedupThis non-optimal nature of the PIM allocator can be alleviated using multiple passes, allowingthe idle outputs to choose from the neglected requests in previous passes, at a cost of longer44Chapter 5. Methodologyallocation latency. Another way to improve utilization is to use input speedup [17, Chapter 17],which splits the input buffer into multiple parts each containing message to be delivered to asubset of the outputs. In this thesis, we use a crossbar configuration with a input speedup oftwo. Each input has two input buffers, one connects exclusively to the odd outputs, and theother to the even outputs. The doubled input buffers reduces the chance of a request beingneglected during the first step of PIM allocation, increasing the probability that an output willhave a bidding request per cycle. Dally and Towles [17, Chapter 17] has shown that with arandom traffic pattern, a input speedup of two can increases the bandwidth utilization of aPIM allocator from 64% to 87% for a 16 by 16 crossbar.Input Buffers and Output BuffersTo minimize the effect of contention within the interconnect to the host system (either shadercore or memory controller), incoming messages are first stored in the input buffer when theyare pushed into the interconnection network (using the icnt push() abstract interface). Insidethe input buffer, messages are broken down into flits. A flit is a small data packet that can betransfered through the crossbar fabric in a single cycle. This allows the allocator to work at afine grained manner to schedule the transfer of different messages to achieve higher efficiency.At the output buffer, these flits are assembled into their original messages, and wait to bepopped by the destination system.In our implementation of the crossbar, we model the finite sizes of input buffers and outputbuffers using credits [17, Chapter 13]. Each buffer starts with an amount of credits equivalentto its maximum capacity in number of flits. When an incoming message is pushed into aninput buffer, a number of credits equal to the number of flits that the message occupies arededucted. As flits are transfer out of the input buffer, credits are returned to the buffer. Thesame applies to the output buffers, where credits are deducted as flits arrive from the crossbarand are returned when messages are popped. For both cases, when credits run out, it meansthe buffer is full. Our current interconnection interface forces its user to check for availabilityof input buffer space with the icnt has buffer() query, which checks the amount of creditsavailable for the queried input buffer, before pushing messages into the interconnection network.On the other hand, a buffer-full at an output is handled internally by disabling the transfer45Chapter 5. Methodologyof flits to that particular output, even if the allocator granted the request, until output bufferspace is available again. This behaviour allows congestion at the destination to propagate backto the input. This propagation property is vital in modeling an interconnection system.Bandwidth ModelingThe theoretical peak bandwidth of the modeled interconnection network is defined by the clockrate of the crossbar, as well as the flit size. The clock rate of the crossbar is essentially thecalling rate of icnt transfer() in the simulator, which is currently identical to the shadercore clock rate. Future work will include implementing a mechanism to model relative clockdomains between different modules.5.1.3 DRAM Access ModelDue to the high computational throughput in a GPU architecture, memory bandwidth andlatency often become a performance bottleneck. Capturing this behavior is vital to the pre-cision of a GPU simulator. ATTILA, one related cycle accurate GPU simulator, models asimplified GDDR memory accounting for read-write transition penalty and bandwidth, whileeach memory module is treated as a single bank [19]. This fails to capture the memory latencyhiding mechanism via multiple banks that plays a significant role in bandwidth utilization [64].Therefore, a detailed DRAM access model was created to capture this behaviour. While anexisting simulator, DRAMsim, provides an access model of slightly better accuracy (modelingDRAM refreshes as well) than our access model [78], we did not know of its existence until ouraccess model was created.Figure 5.3 presents an overview of the DRAM access model. It is divided into the accesstiming model, the request scheduler, and the address decoder. An incoming memory requestis first processed by the address decoder, which breaks down the request address into chip ID,row, column and bank. This information, together with the request itself, is passed throughthe interconnection network to the appropriate memory controller (determined by the chip ID).Inside the memory controller, the request scheduler determines when the request will be sentto the access timing model. The access timing model then fulfills each incoming request byissuing the appropriate commands in accordance to the DRAM timing specification.46Chapter 5. MethodologyAccess Timing ModelBankPer-bank CountersMemory RequestState: Active/IdleOpened RowCommandIssue UnitPer-ChipCountersCompletedMemory RequestRead/WriteCommandCL/WLRequestSchedulerIncomingMemory RequestBankPer-bank CountersMemory RequestState: Active/IdleOpened RowBankPer-bank CountersMemory RequestState: Active/IdleOpened RowBankPer-bank CountersMemory RequestState: Active/IdleOpened RowAddressDecoderChip IDRowColumnBankInterconnectMemory ControllerDelayQueueFigure 5.3: Dram access model.Address DecoderThe address decoder translates a given memory address into multiple parts: chip ID, row,column and bank. Chip ID determines which of the DRAM chips the memory address ismapped to, while row, column and bank refers to structures inside each DRAM chip.We have observed that address decoding has significant impact to the performance of asystem. For example, an application would be bottlenecked by memory bandwidth, no matterhow many DRAM chips are available in the system, if all of its memory request are mapped to asingle DRAM chip. The importance of address mapping to cache miss rate in a direct-mappedcache has been evaluated in [82].In effect, the address decoding module is designed to be as flexible as possible for futureresearch. The address mapping is defined by the user as a series of bit masks, one for eachof chip ID, row, column and bank. Each mask define which part of the address bits shouldbe mapped to which entry in the decoded address. For example, when given with a set of bitmasks:47Chapter 5. Methodologychip ID 00000000 00000000 00011010 00000000Row 00001111 11111111 00000000 00000000Column 00000000 00000000 11100000 11111111Bank 00000000 00000000 00000101 00000000The address decoder will create a mapping as follows:31?24 23?16 15?8 7?0XXXXR11R10R9R8 R7R6R5R4R3R2R1R0 C10C9C8K2K1B1K0B0 C7C6C5C4C3C2C1C0where RX refers to the Xth bit in the row address, CX refers to the Xth bit in the columnaddress, BX refers to the Xth bit in the bank address, and KX refers to the Xth bit in thechip ID. The flexibility of this interface allow the user to specify address mappings that are notrestricted to contiguous bits.DRAM Request SchedulerThe request scheduler controls the order at which the memory requests are dispatched toindividual DRAM banks. It detects free DRAM banks (those not servicing any memory request)and tries to dispatch a new request to the free bank. Currently, there are two different schedulingpolicies implemented in the request schedulers:FIFO (First-Come-First-Serve) All the requests are dispatched in the order they arrive atthe memory controller. When the oldest request is pending on a busy bank, all otherrequests are not dispatch, even if the banks they bid for are free.FR-FCFS (First-Ready First-Come-First-Serve) Reorder the requests to prioritize re-quests that access an already opened row. This allows multiple requests with row locality(accessing the same row) to be scheduled together and amortize the overhead of a singlerow activation among these requests. To take advantage of this, the access timing modelis configured to open the row after serving a request. This is an implementation of theopen row scheduling policy proposed by Rixner et al. [64].While we have chosen to use the FR-FCFS scheduling policy in this thesis, we acknowledge thatthere exist more advanced policies that take fairness and overall performance into account, such48Chapter 5. MethodologyColumn Decoder (a)  (b) R o w   D e c o d e r Active Idle Row Activate Opened Row != Requested Row Bank Precharge Column Access (Read/Write) Opened Row == Requested Row N Banks Memory Array (Bank) Sense Amplifiers (Row Buffer) Figure 5.4: DRAM organization overview and simplified DRAM bank state the ones proposed by Nesbit et al. [55] and Mutlu and Moscibroda [51]. We may implementthese or other more sophisticated policies in the request scheduler in the future.Access Timing ModelFigure 5.4(a) shows the organization in a modern DRAM that the access timing model triesto model. It consists of multiple memory arrays, called banks, each with its own set of senseamplifiers, which double as row buffers storing the data for one of the rows in the memory array.All banks share a single row decoder and column decoder. To access data in one of the banks,the memory controller first issues a Row Activation command to retrieve the row containingthe requested data from the memory array to the row buffer. It then issues a Column Accesscommand to access the requested data in the row buffer. To access data in another row, thememory controller needs to issue a Bank Precharge command to write the content of the rowbuffer back to the memory array, and then issue a Row Activation command for the new row.Figure 5.4(b) captures the actions required to satisfy a single memory request in a DRAMbank into a state diagram with two states: Active and Idle. Essentially, whenever the openedrow in a bank is different from the one required by the request it is servicing, the commandissue unit precharges the bank, activates the requested row, and then services the request with49Chapter 5. MethodologyCounter Scope Set by ConstrainedName Command CommandRRD Chip Activate ActivateCCD Chip Read/Write Column Access (Read/Write)WTR Chip Column Access (Write) Column Access (Read)RTW Chip Column Access (Read) Column Access (Write)RCD Bank Activate Column Access (Read/Write)RAS Bank Activate PrechargeRP Bank Precharge ActivateRC Bank Activate ActivateTable 5.1: Relationships between constrain counters and DRAM commands.column accesses. After the request is completed, the row is left open for that bank to takingadvantage of row locality. Subsequent requests to the same row can perform column accessesimmediately without reopening the row.Figure 5.3 shows the details of the access timing model. The model keeps track of thecurrent request that each bank is trying to service, and the hardware state of each bank: activeor idle, and which row is currently opened in the bank if it is active. This information is usedby the command issue unit in junction with the behaviour stated in Figure 5.4(b) to decide thecommand to issue for each bank. The timing constraints specified by the DRAM specificationare modeled using two sets of constraint counters: One set representing timing constraints thatare local to a bank (per-bank counters), and another set representing timing constraints thatapply to all banks within a single DRAM chip (per-chip counters).Every cycle, the command issue unit traverse through all the banks in a round-robin mannerto look for a command that is ready to be issued (i.e. all of its constraint counters are zero).There is no priority to the command type, and the first command that is found to be readywill be issued. After a command is issued, the per-chip counters and the per-bank counters ofthe banks serviced by the issued command are set to reflect the timing constraints created bythat command. All of the counters are decremented every cycle until zero, and stay at zerountil they are set again by another command. Table 5.1 shows the relationship between theconstraint counters and the commands.When a column access command is issued, the data associated with the command will nothave access to the bus until the CAS latency (CL) has elapsed. Within this period, another50Chapter 5. Methodologyread-write command to the same row or to another activated bank may be issued, and thesecommands can be pipelined for the data bus access. This pipelining behavior is modeled using adelay queue, which is a queue with fixed length shifting its data towards its output every cycle.In this way, while a single command takes a number of cycles to go through a delay queue, agroup of commands issued consecutively appears in the same order at the output of the queue.The length of the delay queue can be changed dynamically from CAS latency (CL) to writelatency (WL) when the bus switches from read to write and vice versa. The burst length of acolumn access command is modeled by pushing two commands into this delay queue when thecommand is issued. All timing constrains and latencies in this timing model can be configuredby the user via command line options.5.1.4 Interfacing with sim-outorderAs described in Section 2.2 of Chapter 2, the SimpleScalar out-of-order core (modeling the CPUin our simulator) waits for the GPU when it reaches a parallel section. After GPU simulationof the compute kernel is completed, program control is returned to the out-of-order core. Thisrepeats until the benchmark finishes.The above behaviour is modeled in GPGPU-Sim by first providing the simulator with thelist of PC values of where the spawn call to a parallel section occurs in the benchmark. Insidesim-outorder, the dispatch stage (where the functional simulation is done in sim-outorder) ismodified to check the target PC against this list of PC values whenever a function call isdispatched. If the target PC of this function call matches with one of the PC values on thislist, a parallel section is detected and it is launched to the GPU performance simulator. Toprevent erroneously launching a parallel section during speculative execution from a branch mis-prediction, a parallel section can only be launched when a spawn instruction commits. Sincethe number of cycles for an instruction to go through the superscalar pipeline is small relative tothe length of time it takes a parallel section to complete, in the simulator we launch the parallelsection in dispatch stage when the functional simulation indicates that spawn instruction is onthe correct path.The steps involved in launching a parallel section in the GPU can be different depending onthe mode of simulation to be done. Currently, we have implemented two modes of simulation51Chapter 5. Methodologyin the GPU performance simulator: trace mode, and parallel functional simulation mode.Trace Mode SimulationIn trace mode simulation, the parallel section to be launched in the GPU is first functionallysimulated serially as if each thread in the parallel section is just an iteration in a loop withits loop counter as the thread ID. Instruction traces (the PC value, instruction type, and ifapplicable, address of memory accessed for each dynamic instruction executed) for each threadare collected during this functional simulation. These traces are then used for driving thebehaviour of the GPU performance simulator.This simulation mode guarantees the functional correctness of the simulation, so that someimpreciseness in the GPU performance simulator will not have a drastic effect on the simulationresults. Also, the traces only need to be collected once and can then be used for multipleperformance simulations later, potentially saving a significant amount of simulation time.However, due to the shear number of threads in a parallel section, the instruction traces sizebecomes substantial (e.g. 16GB for HMMer) and cannot be completely loaded into memoryduring performance simulation. While this problem can be solved by streaming the instructiontraces from disk on demand and allocating large buffer to minimize the number of disk accessrequired, we found the performance simulation seriously bottlenecked on the network file sys-tem when we execute the simulation on a cluster using this mode (which was implemented).Moreover, simulating each thread serially fails to capture some important parallel aspects of thecompute model. For example, it is impossible to simulate applications that make extensive useof software controlled ?shared memory? with a simple serial functional simulation. One of thecommon usage cases for the shared memory involves having a block of threads cooperativelyloading data into the on-chip shared memory. This block of threads then waits for each otherat a synchronization barrier before operating on the data in the shared memory. These twoproblems motivated the development of parallel functional simulation mode20.20Other alternatives include using trace compression such as the one proposed by Liu and Asanovic [41], anda functional simulator that switch threads at synchronization barrier.52Chapter 5. MethodologyParallel Functional Simulation ModeParallel functional simulation mode was implemented in the GPU simulator to address theproblems encountered with trace mode simulation. Instead of collecting instruction tracesbefore the performance simulation as in trace mode simulation, threads in a parallel section arefunctionally simulated at the decode stage of the GPU performance simulator. In this way, thethreads are scheduled and functionally simulated as if they are executed on the true parallelhardware, so that features such as barrier synchronization are exploited by our benchmarks.On the other hand, the GPU performance simulator now has to handle the functionalcorrectness of the simulation as well. For example, consider a mechanism where instructionsare nullified and reissued later at a cache miss. With trace mode simulation, the instructionreissue can be implemented as rolling back on the instruction trace. With parallel functionalsimulation mode, the performance simulator has to ensure that the same instruction is notexecuted twice for functional correctness.When launching a parallel section with parallel functional simulation mode, contents of theout-of-order core?s architecture register file is replicated to the register file of every thread. Adifferent offset is added to the stack pointer of each thread so ensure that register spill froma thread does not corrupt the stack of another thread. Also, because the PISA instructionset uses part of the out-of-order core?s stack to pass parameters into function call, this part ofstack is copied to the private stack of each thread so that they can be accessed by each thread.Many of these issues are more related to the properties of PISA itself, but they are neverthelessdescribed here to illustrate the sort of functional correctness issues that need to be handledwith parallel functional simulation mode.Shared MemoryShared memory is a scratchpad memory local to a block of threads in a multiprocessor in theCUDA programming model. While this feature is not used by the benchmarks evaluated in thisthesis, a contribution of this thesis is to implement it to allow CUDA applications written withthis feature to be easily ported over to GPGPU-Sim in the future. In the GPU performancesimulator, the shared memory is accessible via accessing to a special address range, which is the53Chapter 5. Methodologyaddress range of a statically allocated array in the benchmark. This address range is uniquefor each benchmark and needs to be specified explicitly by the user in the code info file.5.2 Baseline ConfigurationTable 5.2 shows the baseline configuration we simulated. The configuration is chosen to approxi-mate the Geforce 8800GTX [39] as much as possible for a valid area overhead estimation presentin Chapter 7. Because GPGPU-Sim does not support a non-power-of-two number of memorychannels, we have changed the shader core to DRAM clock ratio, lowered the width per chan-nel, and increased the number of memory controllers, to ensure that the amount of bandwidtheach shader core receives per cycle is equivalent to that of Geforce 8800GTX (4Bytes/Shadercore/cycle). We have used the GDDR3 timing parameters provided by Qimonda for their512-Mbit GDDR3 Graphics RAM clocked at 650MHz [62]. While the Geforce 8800GTX doesnot have a L1-data cache, we have long latency L1-data cache to approximate a memory-sideL2-cache. The sizes for the dynamic warp formation scheduler?s internal structures listed inTable 5.2 are optimized for the best possible performance area ratio.5.3 BenchmarksTable 5.3 gives a brief description of the benchmarks simulated for evaluating dynamic warpformation and scheduling in this thesis. The benchmarks are compiled with the SimpleScalarPISA GCC version cross compiler. The compiled binaries are then analyzed by a setof scripts to extract the list of PC values of all the spawn calls to parallel sections, as wellas the starting PC value of the shader program of each parallel section. This information isstored into a code information file unique to each benchmark, and is loaded into GPGPU-Simat the start of the simulation. The scripts also analyze the compiled binaries to extract theimmediate post-dominator of each branch instruction inside the shader program. The PC valuesof these branch instructions and their immediate post-dominators are also stored in the codeinformation file and are loaded into GPGPU-Sim at the start of the simulation to emulate theeffect of adding an extra field in the branch instruction to specify the reconvergence point.54Chapter 5. MethodologyShader CoreShader Core ALU Frequency 650 MHz# Shader Cores 16SIMD Warp Size 32 (issued over 4 clocks)SIMD Pipeline Width 8# Threads per Shader Core 768Memory System# Memory Modules 8DRAM Controller Frequency 650 MHzGDDR3 Memory Timing tCL=9, tRP =13, tRC=34tRAS=21, tRCD=12, tRRD=8Bandwidth per Memory Module 8Byte/Cycle (5.2GB/s)Memory Controller out of order (FR-FCFS)Data Cache Size (per core) 512KB 8-way set assoc. 64B line# Data Cache Banks (per core) 16Data Cache Hit Latency 10 cycle latency (pipelined 1 access/thread/cycle)Dynamic Warp Formation HardwareDefault Warp Scheduling Policy MajorityPC-Warp LUT 64 entries, 4-way set assoc.MHeap LUT 128 entries, 8-way set assoc.Max-Heap 64 entriesTable 5.2: Hardware ConfigurationFromSuiteDescription BranchDivergenceHMMer SPECCPU2006A modified version of 456.hmmer: Protein sequence analysisusing profile hidden Markov models.HighLBM SPECCPU2006A modified version of 470.lbm: Implements the ?Lattice-Boltzmann Method? to simulate incompressible fluids in 3D.MediumBlack CUDA Black-Scholes Option Pricing: Evaluates fair call and putprices for a given set of European options by Black-Scholesformula.LowBitonic CUDA Bitonic Sort [5]: A simple parallel sorting algorithm. Wehave extended the version from the CUDA website to workwith large arrays (by not using the shared memory).HighFFT SPLASH Complex 1D FFT: We have modified the benchmark so thatmultiple 1M-point arrays are processed by 12288 threads.MediumLU SPLASH Blocked LU Decomposition: Each thread is responsible forthe decomposition inside a block.HighMatrix ? A simple version of matrix multiply: Each thread computesthe result for a data element on the destination matrix inde-pendently.NoneTable 5.3: Benchmark Description55Chapter 6Experimental ResultsFirst we consider dynamic warp formation and scheduling with the detailed implementationdescribed in Section 4.2 of Chapter 4 using the Majority scheduling policy with the detailedmax-heap implementation described in Section 4.4 of Chapter 4. The hardware is sized as inTable 7.1 (on page 69) and employs the thread swizzling mechanism described in Section 4.1of Chapter 4. This implementation uses the lane aware scheduling mechanism discussed inSection 4.1 and 4.2 of Chapter 4. We also model bank conflicts at the data cache.Figure 6.1 shows the performance of the different branch handling mechanisms discussed inthis thesis and compares them to a MIMD pipeline with the same peak IPC capability. Here weuse the detailed simulation model described in Chapter 5 including simulation of memory accesslatencies. On average (HM), PDOM (reconvergence at the immediate post-dominator describedin Chapter 3.2) achieves a speedup of 44.9% versus not reconverging (NREC). Dynamic warpformation (DWF) achieves a further speedup of 47.4% using the Majority scheduling policy.The average DWF and MIMD performance only differs by 9.5%. The 47.4% speedup of DWFversus PDOM justifies the 8% area cost (see Chapter 7) of using dynamic warp formation andscheduling on existing GPUs. We note that this magnitude of speedup could not be obtainedby simply spending this additional area on extra shader cores.For benchmarks with little diverging control flow, such as FFT and Matrix, DWF performsas well as PDOM, while MIMD outperforms both of them with its ?free running? nature. Thesignificant slowdown of Black-Scholes (Black) of DWF is a phenomenon exposing a weaknessof our default Majority scheduling policy. This will be examined in detail in Section 6.2.For most of the benchmarks with significant diverging control flow (HMMer, LBM, Bitonic,LU), MIMD performs the best, and DWF achieves a significant speedup over PDOM. Amongthem, DWF achieves a speedup for Bitonic and LU purely due to better branch divergence56Chapter 6. Experimental Results0163248648096112128HMMer LBM Black Bitonic FFT LU Matrix HMIPCNRECPDOMDWFMIMDFigure 6.1: Performance comparison of NREC, PDOM, and DWF versus MIMD.handling, while for HMMer, DWF achieves a speedup in part due to better cache locality aswell, as observed from Table 6.2.While LBM also has significant diverging control flow, it is memory bandwidth limited, asshown in Table 6.1 and therefore sees little gain from DWF. Although the cache miss rate ofBitonic is higher than that of LBM, its much higher pending hit21 rate significantly lowers thememory bandwidth requirement of this benchmark (see Table 6.3).6.1 Effects of Scheduling PoliciesFigure 6.2 compares all the warp scheduling policies described in Section 4.3 of Chapter 4,with a realistic memory sub-system, but ignoring the impact of lane conflicts (e.g. hardware inFigure 4.2(b) with unlimited register file ports) to show the potential of each policy. Overall,the default Majority (DMaj) policy performs the best, achieving an average speedup of 66.0%,but in some cases, its performance is not as good as the PC policy (DPC) or PDOM Priority(DPdPri) described in Section 4.3 of Chapter 4.To provide additional insight into the differences between the scheduling policies, Figure 6.3shows the distribution of warp sizes issued each cycle for each policy. Each bar is dividedinto segments labeled W0, W4-1, ... W32-29, which indicate if the SIMD hardware executedoperations for 0, (1 to 4), ...(29 to 32) scalar threads on a given cycle. ?Stall? indicates a21A pending hit occurs when a memory access misses the cache, but can be merged with one of the in-flightmemory requests already sent to the memory subsystem.57Chapter 6. Experimental ResultsHMMer LBM Black Bitonic FFT LU MatrixPDOM 57.47% 94.71% 8.18% 50.94% 75.43% 0.84% 84.02%DWF 63.61% 95.04% 6.47% 71.02% 74.89% 1.47% 63.71%MIMD 64.37% 94.99% 8.06% 80.48% 80.99% 1.52% 98.26%Table 6.1: Memory bandwidth utilization.HMMer LBM Black Bitonic FFT LU MatrixPDOM 13.37% 14.15% 1.52% 21.25% 15.52% 0.05% 7.88%DWF 5.05% 14.56% 1.50% 30.67% 14.68% 0.06% 9.71%MIMD 3.75% 15.30% 1.20% 30.63% 14.18% 0.05% 7.86%Table 6.2: Cache miss rates (pending hits21 classified as a miss).HMMer LBM Black Bitonic FFT LU MatrixPDOM 13.21% 13.55% 0.21% 3.86% 11.77% 0.03% 5.72%DWF 5.03% 13.57% 0.21% 3.84% 12.53% 0.04% 4.23%MIMD 3.74% 13.35% 0.21% 3.83% 12.92% 0.04% 5.75%Table 6.3: Cache miss rates without pending hits21.stall due to writeback contention with the memory system (see Figure 2.3(b) on Page 15). Forpolicies that do well (DMaj, DPdPri, DPC), we see a decrease in the number of low occupancywarps relative to those policies which do poorly (DMin, DTime). Cycles with no scalar threadexecuted (W0) are classified into ?Mem? (W0 Mem) and ?Idle? (W0 Idle). W0 Mem denotesthat the scheduler cannot issue any scalar thread because all threads are waiting for data fromglobal memory. W0 Idle denotes that all warps within a shader core have already been issuedto the pipeline and are not yet ready for issue. These ?idle? cycles occur because threads areassigned to shader cores in blocks. In our current simulation, the shader cores have to wait forall threads in a block to finish before beginning with a new block. The warp size distributionfor Black-Scholes reveals that one reason for the Majority (DMaj) policy?s poor performanceon this benchmark is a significant number of ?Idle? cycles, which can be mostly eliminated bythe PDOM Priority (DPrPri) policy. The data also suggest that it may be possible to furtherimprove dynamic warp formation by exploring scheduling policies beyond those proposed inthis thesis.58Chapter 6. Experimental Results0163248648096112128HMMer LBM Black Bitonic FFT LU Matrix HMIPCPDOMDMajDMinDTimeDPdPriDPCFigure 6.2: Comparison of warp scheduling policies. The impact of lane conflicts is ignored.0%20%40%60%80%100%PDOMDMajDMinDTimeDPdPriDPCPDOMDMajDMinDTimeDPdPriDPCPDOMDMajDMinDTimeDPdPriDPCPDOMDMajDMinDTimeDPdPriDPCPDOMDMajDMinDTimeDPdPriDPCPDOMDMajDMinDTimeDPdPriDPCPDOMDMajDMinDTimeDPdPriDPCHMMer LBM Black Bitonic FFT LU MatrixStallW0_IdleW0_MemW1-4W5-8W9-12W13-16W17-20W21-24W25-28W29-32Figure 6.3: Warp size distribution.6.2 Detail Analysis of Majority Scheduling PolicyPerformanceThe significant slowdown of Black-Scholes (Black) results from several phenomenon. First, inthis work we restrict a shader core to execute a single block. Second, we use software subroutinesfor transcendentals, and these subroutines contain branches that diverge. Third, a quirk in ourdefault Majority scheduling policy can lead some of the threads in a shader core to starvation.We explore the latter phenomenon in detail in this section. Figure 6.4 shows the runtimebehaviour (IPC and incomplete thread count22) of a single shader core using Dynamic WarpFormation with the Majority scheduling policy.Under the Majority scheduling policy, threads which have different control flow behaviour22A thread is incomplete if it has not reached the end of the kernel call.59Chapter 6. Experimental Results(a)(b)(c)(d)0480 20000 40000 60000 80000 100000 120000 140000 160000 180000IPC02565127680 20000 40000 60000 80000 100000 120000 140000 160000 180000#Threads0200400600800100012000 20000 40000 60000 80000 100000 120000 140000 160000 180000CycleDynamic InstructionsABC0%20%40%60%80%100%0 20000 40000 60000 80000 100000 120000 140000 160000 180000Warp Size DistributionStallW0_IdleW0_MemW1-4W5-8W9-12W13-16W17-20W21-24W25-28W29-32Figure 6.4: Dynamic behaviour of Black-Scholes using DWF with Majority scheduling policyin a single shader core (max IPC of 8). The warp size distribution time series in (c) uses thesame classification as in Figure 6.3.from the rest of the threads can be starved during execution. Black-Scholes has several rarelyexecuted, short basic blocks that suffer from this effect, leaving behind several groups of mi-nority threads23. When these minority threads finally execute after the majority of threadshave finished, they form incomplete warps and the number of warps formed are insufficientto fill up the pipeline (in our simulations, each thread is only allowed to have one instructionexecuting in the pipeline, as described in Section 2.4 in Chapter 2). This behaviour is illus-trated in Figure 6.4(d) which shows the dynamic instruction count of each thread. After several23We say a thread is a ?minority? thread if its PC value is consistently given a low priority by the Majorityscheduling policy throughout its execution.60Chapter 6. Experimental Results(a)(b)(c)(d)024680 20000 40000 60000 80000 100000 120000 140000 160000 180000IPC02565127680 20000 40000 60000 80000 100000 120000 140000 160000 180000#Threads0200400600800100012000 20000 40000 60000 80000 100000 120000 140000 160000 180000CycleDynamic Instructions0%20%40%60%80%100%0 20000 40000 60000 80000 100000 120000 140000 160000 180000Warp Size DistributionStallW0_IdleW0_MemW1-4W5-8W9-12W13-16W17-20W21-24W25-28W29-32Figure 6.5: Dynamic behaviour of Black-Scholes using DWF with PDOM Priority schedulingpolicy.branches diverged at A and B, groups of threads are starved (indicated by the stagnant dynamicinstruction count in Figure 6.4(d)), and are only resumed after C when the majority groups ofthreads have finished their execution (indicated by the lower thread count after cycle 120000 inFigure 6.4(b)). This is responsible for the low IPC of DWF after cycle 120000 in Figure 6.4(a).Meanwhile, although the majority of threads are proceeding ahead after branch divergencesat A and B, the pipeline is not completely utilized (indicated by the IPC<8 from cycle 80000 to120000 in Figure 6.4(a)) due to the existence of incomplete warps (see the warp size distributiontime series in Figure 6.4(c)). These incomplete warps are formed because the number of threadstaking the same execution path is not a multiple of the SIMD width. They could have been61Chapter 6. Experimental Resultscombined with the minority threads after the diverging branch to minimize this performancepenalty, but this does not happen when the minority threads are starved by the Majority policy.Figure 6.5 shows how both of these problems can be mitigated by having a different schedul-ing policy?PDOM Priority. The dynamic instruction count in Figure 6.5(d) shows that anythread starvation due to divergence is quickly corrected by the policy and all the threads havea similar rate of progress. The IPC stays at 8 for most of the execution (see Figure 6.5(a)),except when the policy is giving higher priorities to the incomplete warps inside the divergedexecution paths (shown in Figure 6.5(b)) so that they can pass through the diverged executionpaths quickly to form complete warps at the end of the diverged paths. Overall, threads in ablock finish uniformly as shown in Figure 6.5(b), indicating that starvation does not happenwith this scheduling policy.6.3 Effect of Limited Resources in Max-Heap0163248648096112128HMMer LBM Black Bitonic FFT LU Matrix HMIPCPDOMDWF-MaxHeapDWF-Inf.RsrcFigure 6.6: Performance comparison of a resource-limited version of Max-heap (DWF-MaxHeap) with an ideal, infinite resource implementation of Majority scheduler (DWF-Inf.Rsrc).Figure 6.6 shows the performance increase achievable by dynamic warp formation with aninfinite amount of hardware resources (infinitely ported and unlimited entries for MHeap LUTand Max-Heap) given to the Majority scheduler. This unbounded version of the Majorityscheduler has a speedup of 6.1% over its resource-limit counterpart, and is 56.3% faster thanPDOM on average. This 6.1% speedup comes completely from HMMer, which is both a control62Chapter 6. Experimental Results0163248648096112128HMMer LBM Black Bitonic FFT LU Matrix HMIPCDWF-DMaj w/o LASDWF-DMaj w/ LASDWF-DMaj no LCPDOM(a)0163248648096112128HMMer LBM Black Bitonic FFT LU Matrix HMIPCDWF-DPdPri w/o LASDWF-DPdPri w/ LASDWF-DPdPri no LCPDOMv(b)Figure 6.7: Performance of dynamic warp formation evaluating the impact of lane awarescheduling and accounting for lane conflicts and scheduler implementation details. (a) Us-ing Majority scheduling policy. (a) Using PDOM Priority scheduling policy. LAS = LaneAware Scheduling LC = Lane Conflictflow and data flow intensive benchmark. These properties of HMMer results in a large number ofin-flight PC values among threads as memory access from a warp following a branch divergencecan introduce variable slowdown among the threads in a warp ranging from tens to hundredsof cycles. The large number of in-flight PC values in turn requires a large Max-Heap, whichtakes more swaps to re-balance every scheduler cycle and introduces stalls when the limitedbandwidth resources are exhausted. With other benchmarks, however, the resource-limitedversion of the Majority scheduling policy logic is sufficient to achieve similar performance.6.4 Effect of Lane Aware SchedulingTo reduce register file design complexity of DWF we have chosen to use the organization inFigure 4.2(d) on Page 26 which necessitates the use of lane aware scheduling discussed in63Chapter 6. Experimental ResultsSections 4.1 and 4.2 of Chapter 4. The data in Figure 6.7(a) compares the detailed schedulerhardware model we have considered so far with an idealized version of dynamic warp formationand scheduling ignoring the impact of lane conflict and hardware resources (DWF-DMaj no LC).This figure shows the impact on the performance of lane aware scheduling and, for comparison,also shows the impact when not using lane aware scheduling but assuming the register fileorganization in Figure 4.2(b) and modeling register bank conflicts when multiple threads fromthe same ?home? lane are grouped into a single warp (DWF-DMaj w/o LAS). While theidealized version of DWF is on average 66.0% faster than PDOM, the realistic implementationof DWF we have considered so far is able to achieve 89% of the idealized DWF?s performance.The average performance of DWF improves without lane aware scheduling, because we did notimpose a hardware resource limit on the scheduler.We also evaluated the performance of the PDOM Priority policy (DPdPri) with lane awarescheduling (see Figure 6.7(b)), and found performance for Black-Scholes improves (IPC of 98.9vs. IPC of 89.3 for DMaj) while that of HMMer is reduced (IPC of 11.9 vs. IPC of 23.7 forDMaj) with an overall average speedup of 30.0% over PDOM. Thus, DMaj is the best schedulingpolicy overall.6.5 Effect of Cache Bank ConflictOur baseline architecture model assumes a multi-ported data cache that is implemented usingmultiple banks of smaller caches. Parallel accesses from a warp to different lines in the samebanks create cache bank conflicts, which can only be resolved by serializing the accesses andstalling the SIMD pipeline.With a multi-banked data cache, a potential performance penalty for dynamic warp forma-tion is that the process of grouping threads into new warps dynamically may introduce extracache bank conflicts. We have already taken this performance penalty into account by modelingcache bank conflicts in our earlier simulations. However, to evaluate the effect of cache bankconflicts on different branch handling mechanisms, we rerun the simulations with an idealizedcache allowing threads within a warp to access any arbitrary lines within the cache in parallel.The data in Figure 6.8 compares the performance of PDOM, DWF and MIMD ignoring64Chapter 6. Experimental Results0163248648096112128HMMer LBM Black Bitonic FFT LU Matrix HMIPCPDOMPDOM no$BKDWFDWF no$BKMIMDMIMD no$BKFigure 6.8: Performance of PDOM, DWF and MIMD with cache bank conflict. ?no$Bk? =Ignoring Cache Bank Conflictcache bank conflicts. It indicates that cache bank conflicts have more effect on benchmarksthat are less constrained by the memory subsystem (Black-Scholes, LU and Matrix). Overall,performance of these benchmarks increase by a similar amount with an ideal multi-ported cacheregardless of the branch handling mechanism in place. Hence, we find DWF still has a speedupof 47% over PDOM.6.6 Sensitivity to SIMD Warp SizeWhile DWF provides a significant speedup over PDOM with our default hardware configuration,we can gain further intuition into its effectiveness in handling branch divergence as SIMD warpsize increases. Figure 6.9 shows the performance of each mechanism for three configurationswith increasing warp size from 8 to 32. Notice that the width of the SIMD pipeline is notchanged, so that a decrease in warp size translates to a shorter execution latency for eachwarp24.None of the benchmarks benefit from SIMD warp size increases. This is expected as increas-ing warp size in an area constrained fashion, as described above, does not increase the peakthroughput of the architecture while it constraints control flow and thread scheduling flexibil-ity. The benefit of increasing SIMD warp size is to improve computational density by relaxing24As described in Section 2.5 of Chapter 2, a warp with 32 threads takes 4 cycles to execute on a 8-wide SIMDpipeline, whereas a warp with 8 threads can be executed in a single cycle on the same pipeline.65Chapter 6. Experimental Results0163248648096112128HMMer LBM Black Bitonic FFT LU Matrix HMIPC8-Wide PDOM16-Wide PDOM32-Wide PDOM8-Wide DWF (Dmaj)16-Wide DWF (Dmaj)32-Wide DWF (Dmaj)8-Wide MIMD16-Wide MIMD32-Wide MIMDFigure 6.9: Performance comparison of PDOM, DWF and MIMD with a realistic memorysubsystem as SIMD warp size increases. The theoretical peak IPC remains constant for allconfigurations.the scheduler?s latency requirement, reducing the number of warps to schedule (simplifyingscheduler hardware) and lowering instruction cache bandwidth requirements.The slowdown of MIMD as SIMD warp size increases in various benchmarks is a result ofa longer scheduler cycle, which places some constraints of SIMD scheduling onto the MIMDscheduler. For example, with a warp size of 32, after the MIMD scheduler has issued lessthan 32 threads (other threads are not issuable as they are pending for data from memory)in a scheduler cycle, it has to wait until the next scheduler cycle (4 pipeline cycles later) toissue more threads to the pipeline, even if more threads become available for execution in themeantime. For benchmarks that are constrained by the memory subsystem (HMMer, lbm andBitonic), this effect does not cause any significant slowing for MIMD as the scheduler frequentlyruns out of threads that are ready to be issued due to memory access latency.Overall, as SIMD warp size increases from 8 to 32, the average performance of PDOMdecreases by 24.9%, while the overall performance of DWF and MIMD decreases by 11.2% and2.0% respectively. Most of the slowdowns experienced by PDOM as SIMD warp size is increasedis attributed to the control flow intensive benchmarks (HMMer, Bitonic, and LU), while theseslowdowns are alleviated with DWF. This trend shows that branch divergence becomes a moreserious performance bottleneck in control flow intensive applications as SIMD warp size isincreased to improve computational density, but a significant portion of this performance loss66Chapter 6. Experimental ResultsHMMer LBM Black Bitonic FFT LU MatrixMax(Warp Pool Occupancy) 90 25 42 39 62 40 23Max(Max-Heap Size) 44 16 10 8 24 11 7Table 6.4: Maximum warp pool occupancy and max-heap size (in #Entries) for each benchmark.0%20%40%60%80%100%0 32 64 96Warp Pool Occupancy (#Warps)HMMerLBMBlackBitonicFFTLUMatrixFigure 6.10: Cumulative distribution of warp pool occupancy for each benchmark.can be regained using dynamic warp formation and scheduling.6.7 Warp Pool Occupancy and Max Heap SizeTable 6.4 shows the maximum warp pool occupancy (the number of warps inside the warppool) and the maximum dynamic size of the max-heap for each benchmark. As shown in thetable, all of the benchmarks, including the control flow and memory intensive HMMer, use lessthen 1/6 of a warp pool sized to the worst case requirement. This indicates that it should bepossible to use a warp pool with 128 entries (1/6 of the 768-entry warp pool designed for theabsolute worst case) without causing any performance penalty. The same argument can beused to justify that 64 entries is sufficient for the max-heap used in implementing the Majorityscheduling policy.Furthermore, Figure 6.10 shows the cumulative distribution of warp pool occupancy for eachbenchmark used in this thesis. This distribution indicates that it may be feasible to reduce thesize of the warp pool to 64 entries, and handle <0.15% warp pool overflow (or 10% if we reducethe warp to 32 entries) by spilling existing warp pool entries to the global memory. As oursimulator does not currently model this, we do not use this assumption in the area estimationin Chapter 7 and leave this to future work.67Chapter 7Area EstimationThe total area cost for dynamic warp formation and scheduling is the amount of hardwareadded for the logic in Figure 4.3 (Page 28) plus the overhead of an independent decoder foreach register file bank. We have estimated the area of the five major parts of the hardwareimplementation of dynamic warp formation and scheduling with CACTI 4.2 [72]: Warp updateregisters, PC-warp LUT, warp pool, warp allocator, and scheduling logic. Table 7.1 lists theimplementation of these structures and their area and storage estimates. We use our baselineconfiguration (32-wide SIMD with 768 threads) listed in Table 5.2 to estimate the size of thePC-Warp LUT and MHeap LUT. Both are set-associative cache-like structures with two readports and two write ports capable of two warp lookups in parallel to handle requests fromthe two diverged parts of a single incoming warp. Based on the maximum occupancy datain Section 6.7 of Chapter 6, we use a 128-entry Warp Pool and a 64-entry Max-Heap. TheMax-Heap is implemented with a memory array with two read ports and two write ports asdiscussed in Section 4.4 of Chapter 4, and the Warp Pool is implemented using a bankedstructure described in Section 4.2.1 in Chapter 4.For this area estimation, we have chosen to use a 10-bit thread ID, which should be sufficientto identify the 768 threads in each shader core. We acknowledge that more bits may be neededwhen shader cores can execute multiple blocks simultaneously as in Geforce 8800GTX, buttheir impact on the overall area is not significant. We assumed that a 24-bit integer is sufficientto store the PC value for each thread as the maximum number of instructions per kernel callin CUDA is limited to 2 million instructions [58]. While this can be represented with a 21-bitnumber, we have allocated a few extra bits to account for the possibility of the GPU executingmultiple parallel sections concurrently (for example, the Geforce 8 Series can execute vertex,geometry, and pixel shaders in parallel). Referring to Figure 4.3, 32 bits are required by both68Chapter 7. Area Estimation# Entry Struct. AreaStructure Entries Content Size Implementation (mm2)(bits)Warp Update 2 TID (10-bit) ? 32 752 Registers 0.0080Register PC (24-bit), REQ (32-bit) (No Decoder)PC-Warp LUT 64 PC (24-bit) 4032 4-Way 0.2633OCC (32-bit) Set-Assoc. Mem.IDX (7-bit) (2 RP, 2 WP)Warp Pool 128 TID (10-bit) ? 32 44928 Memory Array 0.6194PC (24-bit) (33 Banks)Scheduler Data (10-bit) (1 RP, 1 WP)Warp Allocator 128 IDX (7-bit) 896 Memory Array 0.0342Mheap LUT 128 PC (24-bit) 4992 8-Way 0.4945MH# (7-bit) Set-Assoc. Mem.(2 RP, 2 WP)Max-Heap 64 PC (24-bit), CNT (10-bit) 3712 Memory Array 0.2159WPH (8-bit), WPT (8-bit) (2 RP, 2 WP)LUT# (8-bit)Total 59312 1.6353Table 7.1: Area estimation for dynamic warp formation and scheduling. RP = Read Port,WP = Write Port.REQ and OCC to represent the lanes in a 32-wide SIMD warp. Both the IDX field in thePC-Warp LUT and one in the Warp Allocator uses only 7 bits because this is sufficient toaddress all 128 entries in the Warp Pool. Referring to Figure 4.5, in the Max-Heap, whileWPH and WPT also index into the Warp Pool, they both require an extra valid bit. Thesame argument applies to the Scheduler Data in the Warp Pool, which translates to the ?Next?pointer in Majority scheduling policy. LUT# and MH# in Figure 4.5 both also use an extrabit to represent an invalid index. Finally, the CNT field in the Max-Heap has to be 10-bits toaccount for the case when all 768 threads in a shader core have the same PC.Overall, we have estimated the area of the dynamic warp scheduler in 90nm process tech-nology to be 1.635 mm2 per core. The exact CACTI parameters we used for each structure arelisted in Table 7.2. Notice that we have skipped listing the CACTI parameters for the WarpUpdate Registers because they are too small to be evaluated in CACTI. The parameter entriesfor the Warp Pool is separated into the ones for TID banks and the ones for PC banks. Asdiscussed in Section 4.2.1 of Chapter 4, each Warp Pool is comprised of 32 TID banks and 1PC bank.69Chapter 7. Area EstimationStructure Model SRAM/CacheSize(Bytes)LineSize(Bytes)Assoc. #Banks Tech.Node#RWP #RP #WP #SERP #BitsOut#TagBitsTypePC-WarpLUT Cache 512 8 4 1 90nm 0 2 1 0 48 24 FastWarp Pool(TID Bank) SRAM 160 10 N/A N/A 90nm 0 1 1 0 10 N/A FastWarp Pool(PC Bank) SRAM 704 11 N/A N/A 90nm 0 1 2 0 42 N/A FastWarpAllocator SRAM 128 8 N/A N/A 90nm 0 1 1 0 14 N/A FastMheap LUT Cache 1024 8 8 1 90nm 0 2 2 0 10 24 FastMax-Heap SRAM 512 8 N/A N/A 90nm 0 2 2 0 64 N/A FastTable 7.2: CACTI parameters for estimating structure sizes in Table 7.1. RWP = Read/WritePort, RP = Read Port, WP = Write Port, SERP = Single Ended Read Port.To evaluate the overhead of having the individual decoders for dynamic warp formationand scheduling as described in Chapter 4, we first need to estimate the size of the register file.The SRAM model of CACTI 4.2 [72] estimates a register file with 8192 32-bit registers and asingle decoder reading a row of eight 32-bit registers to be 3.863 mm2. After trying various linesizes, we found CACTI 4.2 predicts that a line size of 512 Bytes results in minimum area. Onthe other hand, since the SRAM model of CACTI 4.2 does not support banking directly, weestimate the area of combining eight identical copies of a register file by 1024 32-bit registers.The register file area is then estimated to be 0.5731 mm2 ? 8 = 4.585 mm2. For this areaestimation, we have divided the 512 Bytes line in the single decoder estimation into eight 64Bytes lines in each bank. Notice that both register file configurations have 2 read ports and 1write port, and each port is accessing 32-bit. The area difference between the two register filesis 0.722 mm2. This is our estimation of the area requirement for adding a decoder to everyregister file bank to support DWFS. We acknowledge that this method may under-estimatethe area requirement for adding a decoder to every register file bank since it may not entirelycapture all wiring complexity as the decoders are driven by different sources in our proposal.However, this is already proportionally larger than the 11% overhead estimate by Jayasenaet al. [30] with CACTI and a custom floorplan for implementing a stream register file withindexed access, which also requires multiple decoders per register file (11% of 3.863 mm2 isonly 0.425 mm2, which is smaller than our 0.722 mm2.). Ultimately, without an actual layoutof a GPU, the 0.722 mm2 overhead is our best estimate.70Chapter 7. Area EstimationCombining the two estimations above, the overall area consumption of dynamic warp for-mation and scheduling for each core is 1.6353 + 0.722 = 2.357 mm2. With 16 cores per chipas per our baseline configuration, this becomes 37.72 mm2, which is 8.02% of the total area ofthe GeForce 8800GTX (470 mm2) [39].71Chapter 8Related WorkThis chapter discusses research related to this thesis. Section 8.1 discusses various mechanismsfor supporting control flow on SIMD hardware. Section 8.2 explores two prior proposals thatinvolve dynamically regrouping of scalar threads into SIMD instructions. Section 8.3 discussesprior proposals for eliminating branch divergence in SIMD hardware.On average, dynamic warp formation and scheduling performs significantly better thanmost existing mechanisms for supporting control flow on SIMD hardware. It is applicable to awidely SIMD architectures with fine-grained multithreading and does not require any softwaremodifications.8.1 SIMD Control Flow HandlingThis section discusses existing mechanisms for supporting control flows on SIMD hardware. Weclassify these mechanisms into three groups: guarded instructions, control flow reconvergence,and conditional streams.8.1.1 Guarded Instruction/PredicationWhile supporting branches is a relatively new problem for GPU architectures, it has longbeen a consideration in the context of traditional vector computing. Most of the approaches tosupporting branches in a traditional SIMD machine have centered around the notion of guardedinstructions [8].A guarded instruction, also known as a predicated or vector masked instruction, is aninstruction whose execution is dependent on a conditional mask controlled by another instruc-tion [8]. If the conditional mask is set, appropriately the predicated instruction will not updatearchitecture state. In a masked SIMD instruction, a vector of conditional masks, each con-72Chapter 8. Related Worktrolled by an element in a stream, is functionally equivalent to a data dependent branch. Thisapproach has been employed by existing GPU architectures to eliminate short branches andpotential branch divergences [3, 58]. However, for longer branches and loops, guarded instruc-tions are inefficient because the SIMD hardware has to execute every single execution path,regardless of whether they are taken by any of the SIMD elements.This inefficiency is mitigated by a proposed technique called branches on superword conditioncodes (BOSCCs) [68]. BOSCC permits a sequence of consecutive vector instructions guarded bythe same vector predicate to be bypassed if all fields in the guarding vector predicate are false.Shin et al. [68] examine compiler generation of BOSCCs for handling control flow for SIMDinstruction set extensions and show that it is effective for some kernels, but do not explore thescalability of the approach to long SIMD vector width.8.1.2 Control Flow Reconvergence MechanismsGuarded instructions and their variants, put constraints on input dependent loops. Branchdivergence may be inevitable, but the period of divergence can be kept short with reconvergenceto minimize performance lost due to unfilled SIMD pipelines. A patent filed by Lindholm et al.describes in detail how threads executing in a SIMD pipeline are serialized to avoid hazards [49],but does not indicate the use of reconvergence points to recover from such divergence. Thenotion of reconvergence based on control flow analysis in SIMD branch handling was describedin a patent by Lorie and Strong [43]. However, this patent proposes to insert the reconvergencepoint at the beginning of a branch and not at the immediate post-dominator as proposed inthis thesis.8.1.3 Conditional StreamsKapasi et al. [32] introduce conditional streams, a code transformation that creates multiplekernels from a single kernel with conditional code and connects these kernels via inter-kernelcommunication to increase the utilization of a SIMD pipeline. While being more efficient thanpredication, it may only be practical to implement on architectures with a software managedon-chip memory designed for inter-kernel communication, such as the stream register file onImagine [63] and Merrimac [18]. With conditional streams, each kernel ends at a conditional73Chapter 8. Related Workbranch, which splits a data stream into two, each to be processed by one of the two subsequentkernels (each representing a execution path from the conditional branch). The data streamsare then stored to the stream register file, and a filter would restrict the data stream to onlybe loaded when its destined kernel is executing. When the two kernels representing executionpaths after the conditional branch have finished processing all the data elements, a ?mergerkernel? merges the two diverged data streams back into one again and proceeds on processingthem uniformly.For this scheme to be effective, an interprocessor switch is added to route data streamsfrom the stream register file to the appropriate processing elements for load balancing. Also,the overhead of creating a filter and merger for each diverging conditional branch can be aperformance bottleneck in a control flow intensive kernel. Dynamic warp formation differs fromthis approach in that it is a hardware mechanism exploiting the dynamic conditional behaviourof each scalar thread, and implementation does not require a stream register file nor datamovement between register lanes.8.2 Dynamic Grouping SIMD MechanismsThis section discuss two proposals that are similar to our proposed dynamic warp formationin so far as they describe scalar threads or instructions are grouped into SIMD instructions atruntime.8.2.1 Dynamic Regrouping of SPMD Threads for SMT ProcessorsThe notion of dynamically regrouping the scalar SPMD threads comprising a single SIMD?task? after control flow divergence of the SPMD threads was described by Cervini [12] inthe context of simultaneous multithreading (SMT) on a general purpose microprocessor thatprovides SIMD function units for exploiting subword parallelism. However, the mechanism heproposes does not specify any mechanism for grouping the diverged SPMD threads, whereassuch an implementation is one of the main contributions of this thesis.Also, the mechanism Cervini proposes requires that tasks have their register values reloadedeach time threads are regrouped. To avoid performance penalties, Cervini proposes that the74Chapter 8. Related Workregister file contain additional ports to enable the register values to be loaded concurrentlywith ongoing execution. In addition, Cervini?s mechanism uses special ?code stops? and tagsthe control flow state of a SPMD thread with a loop counter list (in addition to the programcounter). We point out that in the context of a modern GPU the constant movement of datain this proposal could increase power requirements per processing element, perhaps mitigatingthe improvements in processing efficiency given the growing importance of the so-called ?powerwall? [52]. In contrast, our proposal uses a highly banked large register file and maintains athread?s registers in a single location to eliminate the need for movement of register values.8.2.2 Liquid SIMDClark et al. [14] introduce Liquid SIMD to improve SIMD binary compatibility on generalpurpose CPUs by forming SIMD instructions at runtime by translating annotated scalar in-structions with specialized hardware. In their proposal, when the annotated scalar instructionsreach the commit stage of a superscalar pipeline, a special hardware unit will translate thesescalar instructions into microarchitecture-specific SIMD instructions (i.e., SIMD instructionswith fixed width according to the hardware), and run the translated instructions on the SIMDfunctional units in the processor. In this manner, the grouping of scalar instructions into SIMDinstruction is only done once in the hardware before their execution. In contrast, this thesis fo-cuses on improving control flow efficiency of throughput-oriented architectures with fine-grainedmultithreading. As dynamic control flow behaviour is unknown prior to a program?s execution,our proposed mechanism is capable of regroup threads into new SIMD warps even after threadsstart executing.8.3 Eliminating the Existence of Branch DivergenceThis section discuss two schemes to eliminate branch divergence in SIMD hardware. One schemeis to used a complex SIMD branch instruction which always has only one branch outcome fora SIMD instruction. Another is to allow processing elements in MIMD hardware to shareinstruction bandwidth in the common case, and permit them to operate independently whencontrol flow diverges.75Chapter 8. Related Work8.3.1 Complex SIMD Branch InstructionBesides using a stack-based reconvergence mechanism to handle diverging control flow in theirRay Processing Unit, Woop et al. [80] proposed a complex SIMD branch instruction whichcombines the branch outcome of a set of masked elements in a SIMD batch to a single find branchoutcome with a reduction function. For example, the programmer may specify the branch tobe taken only when all the elements inside a SIMD batch evaluate to true. In this case, thereduction function would be a logical AND of the branch outcomes of all elements in that SIMDbatch. In this manner, branch divergence never exists as the final branch outcome is alwaysconsistent for all elements in a SIMD warp (as there is only one branch outcome). However,this proposal may not be suitable for general-purpose applications requiring the flexibility ofallowing each thread to traverse in a unique execution path.8.3.2 Vector-Thread ArchitectureKrashinsky et al. [33] propose the vector-thread architecture which exposes instruction fetchin the ISA. It provides two fetch mechanisms: vector-fetch and thread-fetch. Vector-fetch isissued by a control processor to command all virtual processors (VPs) (similar to our notionof scalar threads) to fetch the same atomic instruction block (AIB) for execution, whereasthread-fetch can be issued by each VP to fetch an AIB to itself alone. Instruction bandwidthis efficiently shared by multiple VPs as in normal SIMD hardware when they are executing oninstruction fetched by vector-fetch. Branch divergence does not exist in this architecture aseach VP executes on it own after a vector-fetch. However, instruction bandwidth can become abottleneck when each VP is executing a different execution path and may increase bandwidthpressure on the L1-cache with thread-fetches. Also, this architecture requires each processingelement to have its own control logic, which eliminates a key merit of having SIMD hardware.One the other hand, our proposed dynamic warp formation retains this merit by grouping scalarthreads dynamically into SIMD warps in the scheduler and issue them to a SIMD pipeline forexecution.76Chapter 9Conclusions and Future WorkThis chapter concludes the thesis. First, Section 9.1 provides a brief summary and conclusions.Second, Section 9.2 lists the contributions made by this thesis. Finally, Section 9.3 discussesseveral important areas for future work in this area.9.1 Summary and ConclusionsIn this thesis, we explore the impact of branch divergence on GPU performance for non-graphicsapplications. Without any mechanism to handle branch divergence, performance of a GPU?sSIMD pipeline degrades significantly. While existing approaches to reconverging control flowat join points such as the immediate post-dominator improve performance, we found significantperformance improvements can be achieved with our proposed dynamic warp formation andscheduling mechanism. We described and evaluated a implementation of the hardware requiredfor dynamic warp formation and tackle the challenge of enabling correct access to register data asthread warps are dynamically regrouped and found performance improved by 47.4% on averageover a mechanism comparable to existing approaches?reconverging threads at the immediatepost-dominator. Furthermore, we estimated the area of our proposed hardware changes to bearound 8% if it is to be implemented on a modern GPU such as Geforce 8800GTX [39].Our experimental results also highlight the importance of careful prioritization of threadsfor scheduling in such massively parallel hardware, even when individual scalar threads areexecuting the same code in the same program phase.77Chapter 9. Conclusions and Future Work9.2 ContributionsThis thesis makes the following contributions:1. It quantifies a performance gap of 66% between the immediate post-dominator branch re-convergence mechanism and a MIMD architecture with the same peak operation through-put. Thus, highlighting the importance of finding better branch handling mechanisms.2. It proposes and evaluates a novel hardware mechanism, dynamic warp formation, forregrouping threads of individual SIMD warps on a cycle-by-cycle basis to greatly improvethe efficiency of branch handling. For the data parallel, non-graphics applications westudies in this thesis, dynamic warp formation and scheduling achieves a speedup of 47%over the immediate post-dominator branch reconvergence mechanism (with 768 threadsper shader core).3. It shows quantitatively that warp scheduling policy (the order in which the warps areissued from the scheduler) affects both the performance gains and area overhead of dy-namic warp formation, and proposes an area efficient implementation of a well-performingMajority scheduling policy with max-heap.4. It proposes and evaluates a detailed hardware implementation of dynamic warp formationand scheduling. We estimate the hardware required by this hardware implementation adds8% to the total chip area if it is implemented on a modern GPU such as Geforce 8800GTX(470mm2) [39]. This does not include any area exclusive to the immediate post-dominatorbranch reconvergence mechanism that might be subtracted from the baseline hardware.5. It provides an extensive simulation infrastructure for enabling future research on GPUarchitectures optimized to support non-graphics applications.78Chapter 9. Conclusions and Future Work9.3 Future WorkThe work in this thesis is leads to several new areas of research. This section briefly lists anddiscusses some of them.9.3.1 Better Warp Scheduling PoliciesIn this thesis, we have observed that warp scheduling policies have significant impact on botharea (0.71 mm2, or 43%, of the 1.64 mm2 dynamic warp formation scheduler area is spenton implementing the Majority policy?See Chapter 7) and performance of dynamic warp for-mation. Performance varied from 37% to 66% across the scheduling policies we evaluated inSection 6.1 of Chapter 6. While we have explored several different scheduling policies and ob-tained significant speedup, the analysis in Chapter 6 indicates that our best performing policy(Majority) may only provide a fraction of the performance potential of dynamic warp forma-tion. A more effective policy, designed to better encourage warp recombination may furtherimprove performance and be more robust (i.e., it should not cause starvation of threads seenin Section 6.2 of Chapter 6).In addition, our best performing Majority policy requires sorting hardware which signifi-cantly increases the area overhead (by about 77%) of dynamic warp formation. In this respect,policies that need less area should be explored in the future as well.9.3.2 Area Efficient Implementation of the Warp PoolAs mentioned in Chapter 4, it is possible to implement the warp pool with a single memoryarray and thus reduce the area of the warp pool significantly if it is clocked at the SIMD pipelineALU frequency and serializes the 4 updates required per scheduler cycle. This brings up aninteresting trade-off between power and area, as clocking the warp pool faster may consumemore energy but require a smaller structure.Another option, we have discussed in Section 6.7 of Chapter 6, is to reduce the area of thewarp pool by spilling existing warp pool entries to global memory. While this can improvearea overhead of dynamic warp formation, such a spilling mechanism may also introduce aperformance penalty, as warps spilled to the global memory may no longer be merged with79Chapter 9. Conclusions and Future Workthreads from incoming warps at the scheduler. It is also unclear how the scheduling policy shouldbe changed to accommodate such a mechanism. In particular, it is uncertain what the schedulershould do when the policy selects a warp that is spilled to the global memory. Or, taking thisone step further, the scheduling policy may have to account for memory access latency requiredto load the spilled warps from global memory. The exact warp spilling mechanism, as well ashow the scheduling policy should accommodate this change in general should be explored inthe future.Note that these two proposals are independent from each other, and it may be possible tocombine them for a highly area efficient implementation of the warp pool.9.3.3 Bank Conflict EliminationBank conflicts in shared memory access, as noted by Hwu et al. [27], have been a majorperformance bottleneck in GPU?s performance. While this thesis focuses on using dynamicwarp formation for improving control flow efficiency in a SIMD pipeline, we have also observedthat it is possible to regroup threads, in a fashion similar to how we propose to do this forlane aware scheduling, to eliminate bank conflict in cache or shared memory access. The exactimplementation and hardware cost of this idea should be explored in the future.9.3.4 Improvements to GPGPU-SimWhile GPGPU-Sim was initially created solely to evaluate the performance of dynamic warpformation, we recognize that this simulator can be further improved in both accuracy and speedto prototype other architectural ideas as well. Improvements that we currently envision include:? Exploration into more sophisticated DRAM scheduling policies: As mentioned in Chap-ter 5, DRAM request scheduling has significant impact on the overall system performance.Modeling various aspect of a massively parallel architecture, GPGPU-Sim is a suitabletool for exploring this design space.? Direct interface to CUDA binary: Currently, GPGPU-Sim runs on benchmarks createdwith the SimpleScalar PISA GCC, which limits our choice of benchmarks. Having a direct80Chapter 9. Conclusions and Future Workinterface to CUDA binary will greatly broaden our choice benchmarks likely resulting inadditional insights.81References[1] Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, and Doug Burger. Clock rate versusIPC: the end of the road for conventional microarchitectures. In ISCA ?00: Proceedingsof the 27th Annual International Symposium on Computer Architecture, pages 248?259,2000.[2] J. R. Allen, Ken Kennedy, Carrie Porterfield, and Joe Warren. Conversion of controldependence to data dependence. In Proceedings of 10th Symposium on Principles of Pro-gramming Languages (POPL ?83), pages 177?189, 1983.[3] ATI CTM Guide. AMD, Inc., 1.01 edition, 2006.[4] Arkaprava Basu, Nevin Kirman, Meyrem Kirman, Mainak Chaudhuri, and Jose Martinez.Scavenger: A new last level cache architecture with global block priority. In MICRO ?07:Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitec-ture, pages 421?432, 2007.[5] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha.An experimental analysis of parallel sorting algorithms. Theory of Computing Systems, 31(2):135?167, 1998.[6] James F. Blinn and Martin E. Newell. Texture and reflection in computer generatedimages. Communications of the ACM, 19(10):542?547, 1976.[7] Jeffrey Bolz, Ian Farmer, Eitan Grinspun, and Peter Schr?der. Sparse Matrix Solvers onthe GPU: Conjugate Gradients and Multigrid. ACM Transactions on Graphics, 22(3):917?924, 2003.82References[8] W.J. Bouknight, S.A. Denenberg, D.E. McIntyre, J.M. Randall, A.H. Sameh, and D.L.Slotnick. The Illiac IV System. Proceedings of the IEEE, 60(4):369?388, 1972.[9] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Hous-ton, and Pat Hanrahan. Brook for GPUs: stream computing on graphics hardware. InSIGGRAPH, pages 777?786, 2004.[10] Doug Burger and Todd M. Austin. The SimpleScalar Tool Set, Version 2.0., 1997.[11] E.A. Catmuli. Computer display of curved surfaces. In Proceedings of Conference onComputer Graphics, Pattern Recognition, and Data Structure, pages 11?17, May 1975.[12] Stefano Cervini. European Patent EP 1531391 A2: System and method for efficientlyexecuting single program multiple data (SPMD) programs, May 2005.[13] Tzi cker Chiueh. Multi-threaded vectorization. In ISCA ?91: Proceedings of the 18thAnnual International Symposium on Computer Architecture, pages 352?361, 1991.[14] Nathan Clark, Amir Hormati, Sami Yehia, Scott Mahlke, and Kriszi?n Flautner. LiquidSIMD: Abstracting SIMD hardware using lightweight dynamic mapping. In Proceedingsof International Symposium on High Performance Computer Architecture, pages 216?227,2007.[15] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduc-tion to Algorithms, Second Edition. The MIT Press, September 2001.[16] David E. Culler, Anoop Gupta, and Jaswinder Pal Singh. Parallel Computer Architecture:A Hardware/Software Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA,USA, 1997.[17] W. J. Dally and B. Towles. Interconnection Networks. Morgan Kaufmann, 2004.[18] William J. Dally, Francois Labonte, Abhishek Das, Patrick Hanrahan, Jung-Ho Ahn,Jayanth Gummaraju, Mattan Erez, Nuwan Jayasena, Ian Buck, Timothy J. Knight, and83ReferencesUjval J. Kapasi. Merrimac: Supercomputing with streams. In Proceedings of Supercom-puting, 2003.[19] V.M. del Barrio, C. Gonzalez, J. Roca, A. Fernandez, and Espasa E. Attila: a cycle-level execution-driven simulator for modern gpu architectures. 2006 IEEE InternationalSymposium on Performance Analysis of Systems and Software, pages 231?241, 19-21 March2006.[20] M.J. Flynn. Very high-speed computing systems. Proceedings of the IEEE, 54(12):1901?1909, Dec. 1966.[21] H.W. Fowler, F.G. Fowler, and Della Thompson, editors. The Concise Oxford Dictionary.Oxford University Press, 9th edition, 1995.[22] Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic warpformation and scheduling for efficient gpu control flow. In MICRO ?07: Proceedings of the40th Annual IEEE/ACM International Symposium on Microarchitecture, pages 407?420,2007.[23] Jukka Helin. Performance analysis of the CM-2, a massively parallel SIMD computer. InICS ?92: Proceedings of the 6th International Conference on Supercomputing, pages 45?52,1992.[24] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Ap-proach (The Morgan Kaufmann Series in Computer Architecture and Design). MorganKaufmann, May 2002.[25] Tetsuya Higuchi, Tatsumi Furuya, Kenichi Handa, Naoto Takahashi, Hiroyasu Nishiyama,and Akio Kokubu. IXM2: a parallel associative processor. In ISCA ?91: Proceedings ofthe 18th Annual International Symposium on Computer Architecture, pages 22?31, 1991.[26] Daniel Reiter Horn, Mike Houston, and Pat Hanrahan. ClawHMMer: A StreamingHMMer-Search Implementation. In Proceedings of Supercomputing, page 11, 2005.[27] Wen-Meii Hwu, Daviid Kiirk, Shane Ryoo, Chriistopher Rodriigues, JohnStratton, and Kuangweii Huang. Model Performance Insights on Executing84ReferencesNon-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX. Mon/HC19.02/HC19.02.03.pdf, 2007.[28] Intel 965 Express Chipset Family and Intel G35 Express Chipset Graphics Controller Pro-grammer?s Reference Manual. Intel Corporation, January 2008.[29] Aggelos Ioannou and Manolis G. H. Katevenis. Pipelined heap (priority queue) man-agement for advanced scheduling in high-speed networks. IEEE/ACM Transactions onNetworking, 15(2):450?461, 2007.[30] Nuwan Jayasena, Mattan Erez, Jung Ho Ahn, and William J. Dally. Stream register fileswith indexed access. In HPCA ?04: Proceedings of the 10th International Symposium onHigh Performance Computer Architecture, page 60, 2004.[31] Harry F. Jordan. Performance measurements on HEP - a pipelined MIMD computer. InISCA ?83: Proceedings of the 10th Annual International Symposium on Computer Archi-tecture, pages 207?212, 1983.[32] Ujval J. Kapasi, William J. Dally, Scott Rixner, Peter R. Mattson, John D. Owens, andBrucek Khailany. Efficient conditional operations for data-parallel architectures. In MI-CRO 33: Proceedings of the 33rd annual ACM/IEEE international symposium on Microar-chitecture, pages 159?170, 2000.[33] Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris,Jared Casper, and Krste Asanovic. The vector-thread architecture. In ISCA ?04: Proceed-ings of the 31st Annual International Symposium on Computer Architecture, pages 52?63,2004.[34] David Kroft. Lockup-free instruction fetch/prefetch cache organization. In ISCA ?81:Proceedings of the 8th Annual International Symposium on Computer Architecture, pages81?87, 1981.[35] James Laudon. Performance/watt: the new server focus. SIGARCH Computer ArchitectureNews, 33(4):5?13, 2005.85References[36] Hyunseok Lee, Trevor Mudge, and Chaitali Chakrabarti. Reducing idle mode power insoftware defined radio terminals. In ISLPED ?06: Proceedings of the 2006 InternationalSymposium on Low Power Electronics and Design, pages 101?106, 2006.[37] Adam Levinthal and Thomas Porter. Chap - a SIMD graphics processor. In SIGGRAPH,pages 77?82, 1984.[38] Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor Mudge, ChaitaliChakrabarti, and Krisztian Flautner. SODA: A low-power architecture for software ra-dio. In ISCA ?06: Proceedings of the 33rd Annual International Symposium on ComputerArchitecture, pages 89?101, 2006.[39] Erik Lindholm and Stuart Oberman. The NVIDIA GeForce 8800 GPU. Mon/HC19.02/HC19.02.01.pdf, 2007.[40] Erik Lindholm, Mark J. Kligard, and Henry P. Moreton. A user-programmable vertexengine. In SIGGRAPH, pages 149?158, 2001.[41] R.F. Liu and K. Asanovic. Accelerating architectural exploration using canonical instruc-tion segments. 2006 IEEE International Symposium on Performance Analysis of Systemsand Software, pages 13?24, 19-21 March 2006.[42] Markus Lorenz, Peter Marwedel, Thorsten Dr?ger, Gerhard Fettweis, and Rainer Leupers.Compiler based exploration of DSP energy savings by SIMD operations. In ASP-DAC?04: Proceedings of the 2004 Conference on Asia South Pacific Design Automation, pages838?841, 2004.[43] Raymond A. Lorie and Hovey R. Strong. US Patent 4,435,758: Method for conditionalbranch execution in SIMD vector processors, 1984.[44] IBM Ltd. Power ISA Version 2.05, 2007.[45] David Luebke and Greg Humphreys. How GPUs work. Computer, 40(2):96?100, 2007.[46] Michael Mantor. Radeon R600, a 2nd generation unified shader architecture. Mon/HC19.03/HC19.03.01.pdf, 2007.86References[47] John Montrym and Henry Moreton. The GeForce 6800. IEEE Micro, 25(2):41?51, 2005.[48] Gordon E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8):114?117, 1965.[49] Simon Moy and Erik Lindholm. US Patent 6,947,047: Method and system for pro-grammable pipelined graphics processing with branching instructions, 2005.[50] Steven S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmanns,1997.[51] Onur Mutlu and Thomas Moscibroda. Stall-time fair memory access scheduling for chipmultiprocessors. In MICRO ?07: Proceedings of the 40th Annual IEEE/ACM InternationalSymposium on Microarchitecture, pages 146?160, 2007.[52] Sam Naffziger, James Warnock, and Herbert Knapp. When Processors Hit the Power Wall(or ?When the CPU hits the fan?). In Digest of Technical Papers, IEEE InternationalSolid-State Circuits Conference (ISSCC), pages 16?17, February 2005.[53] Dorit Naishlos, Marina Biberstein, Shay Ben-David, and Ayal Zaks. Vectorizing for aSIMdD DSP architecture. In CASES ?03: Proceedings of the 2003 International Conferenceon Compilers, Architecture and Synthesis for Embedded Systems, pages 2?11, 2003.[54] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for sim-ilarities in the amino acid sequences of tow proteins. Molecular Biology, (48):443?453,1970.[55] Kyle J. Nesbit, Nidhi Aggarwal, James Laudon, and James E. Smith. Fair queuing mem-ory systems. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM InternationalSymposium on Microarchitecture, pages 208?222, 2006.[56] John R. Nickolls and Jochen Reusch. Autonomous SIMD flexibility in the MP-1 and MP-2.In SPAA ?93: Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms andArchitectures, pages 98?99, 1993.87References[57] NVIDIA Corp. NVIDIA CUDA SDK code samples. get samples.html.[58] NVIDIA CUDA Programming Guide. NVIDIA Corporation, 1.0 edition, 2007.[59] NVIDIA CUDA Programming Guide. NVIDIA Corporation, 1.0 edition, 2007.[60] Technical Brief: NVIDIA GeForce 8800 GPU Architecture Overview. NVIDIA Corpora-tion, November 2006.[61] Timothy J. Purcell, Ian Buck, William R. Mark, and Pat Hanrahan. Ray tracing onprogrammable graphics hardware. In SIGGRAPH, pages 703?712, 2002.[62] Qimonda. 512-Mbit GDDR3 Graphics RAM, Revision 1.3 (Part No. HYB18H512321BF-08)., December 2007.[63] Scott Rixner, William J. Dally, Ujval J. Kapasi, Brucek Khailany, Abelardo L?pez-Lagunas, Peter R. Mattson, and John D. Owens. A bandwidth-efficient architecture formedia processing. In Proceedings of 31st International Symposium on Microarchitecture,pages 3?13, 1998.[64] Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens.Memory access scheduling. In ISCA ?00: Proceedings of the 27th Annual InternationalSymposium on Computer Architecture, pages 128?138, 2000.[65] Eric Rotenberg, Quinn Jacobson, and James E. Smith. A study of control independencein superscalar processors. In HPCA ?99: Proceedings of the 5th International Symposiumon High Performance Computer Architecture, pages 115?124, 1999.[66] J. W. Sheaffer, D. Luebke, and K. Skadron. A flexible simulation framework for graphicsarchitectures. In HWWS ?04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICSConference on Graphics Hardware, pages 85?94, 2004.[67] Mike Shebanow. ECE 498 AL : Programming massively parallel processors, lecture 12., February 2007.88References[68] Jaewook Shin, Mary Hall, and Jacqueline Chame. Introducing control flow into vectorizedcode. In (to appear in) Proceedings of International Conference on Parallel Architecturesand Compilation Techniques, 2007.[69] Standard Performance Evaluation Corporation. SPEC CPU2006 benchmarks.[70] OpenSPARCTM T2 Core Microarchitecture Specification. Sun Microsystems, Inc., 2007.[71] Edwin J. Tan and Wendi B. Heinzelman. DSP architectures: past, present and futures.SIGARCH Computer Architecture News, 31(3):6?19, 2003.[72] David Tarjan, Shyamkumar Thoziyor, and Norman P. Jouppi. CACTI 4.0. TechnicalReport HPL-2006-86, Hewlett Packard Laboratories Palo Alto, June 2006.[73] Shreekant (Ticky) Thakkar and Tom Huff. Internet streaming SIMD extensions. Computer,32(12):26?34, 1999.[74] Mark R. Thistle and Burton J. Smith. A processor architecture for Horizon. In Proceedingsof Supercomputing, pages 35?41, 1988.[75] James E. Thornton. Parallel operation in the control data 6600. In AFIPS Proceedings ofFJCC, volume 26, pages 33?40, 1964.[76] Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. Simultaneous multithreading:maximizing on-chip parallelism. In ISCA ?95: Proceedings of the 22nd Annual InternationalSymposium on Computer Architecture, pages 392?403, 1995.[77] Steve Upstill. The RenderMan Companion: A Programmer?s Guide to Realistic ComputerGraphics. Reading, Mass: Addison-Wesley, 1990.[78] David Wang, Brinda Ganesh, Nuengwong Tuaycharoen, Kathleen Baynes, Aamer Jaleel,and Bruce Jacob. DRAMsim: a memory system simulator. SIGARCH Computer Archi-tecture News, 33(4):100?107, 2005.[79] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and AnoopGupta. The SPLASH-2 Programs: Characterization and Methodological Considerations.89ReferencesIn Proceedings of 22nd International Symposium on Computer Architecture, pages 24?36,1995.[80] Sven Woop, J?rg Schmittler, and Philipp Slusallek. RPU: A Programmable Ray ProcessingUnit for Realtime Ray Tracing. ACM Transactions on Graphics, 24(3):434?444, 2005.[81] M.T. Yourst. PTLsim: A cycle accurate full system x86-64 microarchitectural simulator.Performance Analysis of Systems and Software, 2007. ISPASS 2007. IEEE InternationalSymposium on, pages 23?34, 25-27 April 2007.[82] Chuanjun Zhang. Reducing cache misses through programmable decoders. ACM Trans-actions on Architecture and Code Optimization, 4(4):1?31, 2008.90


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items