Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Software-hardware co-design for energy efficient datacenter computing Hetherington, Tayler Hicklin 2019

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2019_november_hetherington_tayler.pdf [ 3.33MB ]
Metadata
JSON: 24-1.0384819.json
JSON-LD: 24-1.0384819-ld.json
RDF/XML (Pretty): 24-1.0384819-rdf.xml
RDF/JSON: 24-1.0384819-rdf.json
Turtle: 24-1.0384819-turtle.txt
N-Triples: 24-1.0384819-rdf-ntriples.txt
Original Record: 24-1.0384819-source.json
Full Text
24-1.0384819-fulltext.txt
Citation
24-1.0384819.ris

Full Text

Software-Hardware Co-design for Energy EfficientDatacenter ComputingbyTayler Hicklin HetheringtonB.A.Sc., The University of British Columbia, 2011A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)October 2019c© Tayler Hicklin Hetherington, 2019The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled:Software-Hardware Co-design for Energy Efficient Datacenter Computingsubmitted by Tayler Hicklin Hetherington in partial fulfillment of the requirementsfor the degree of Doctor of Philosophyin Electrical and Computer EngineeringExamining Committee:Tor M. Aamodt, Electrical and Computer EngineeringSupervisorMieszko Lis, Electrical and Computer EngineeringSupervisory Committee MemberMargo Seltzer, Computer ScienceUniversity ExaminerSudip Shekhar, Electrical and Computer EngineeringUniversity ExaminerEmmett Witchel, The University of Texas at Austin, Computer ScienceExternal ExaminerAdditional Supervisory Committee Members:Steve Wilton, Electrical and Computer EngineeringSupervisory Committee MemberiiAbstractDatacenters have become commonplace computing environments used to offloadapplications from distributed local machines to centralized environments. Datacen-ters offer increased performance and efficiency, reliability and security guarantees,and reduced costs relative to independently operating the computing equipment.The growing trend over the last decade towards server-side (cloud) computing inthe datacenter has resulted in increasingly higher demands for performance and ef-ficiency. Graphics processing units (GPUs) are massively parallel, highly efficientaccelerators, which can provide significant improvements to applications with am-ple parallelism and structured behavior. While server-based applications containvarying degrees of parallelism and are economically appealing for GPU accelera-tion, they often do not adhere to the specific properties expected of an applicationto obtain the benefits offered by the GPU.This dissertation explores the potential for using GPUs as energy-efficientaccelerators for traditional server-based applications in the datacenter through asoftware-hardware co-design. It first evaluates a popular key-value store serverapplication, Memcached, demonstrating that the GPU can outperform the CPUby 7.5× for the core Memcached processing. However, the core processing of anetworking application is only part of the end-to-end computation required at theserver. This dissertation then proposes a GPU-accelerated software networkingframework, GNoM, which offloads all of the network and application processingto the GPU. GNoM facilitates the design of MemcachedGPU, an end-to-endMemcached implementation on contemporary Ethernet and GPU hardware.MemcachedGPU achieves 10 Gbit line-rate processing at the smallest requestsize with 95-percentile latencies under 1.1 milliseconds and efficiencies under 12iiimicrojoules per request. GNoM highlights limitations in the traditional GPU pro-gramming model, which relies on a CPU for managing GPU tasks. Consequently,the CPU may be unnecessarily involved on the critical path, affecting overallperformance, efficiency, and the potential for CPU workload consolidation. Toaddress these limitations, this dissertation proposes an event-driven GPU program-ming model and set of hardware modifications, EDGE, which enables any devicein a heterogeneous system to directly manage the execution of pre-registeredGPU tasks through interrupts. EDGE employs a fine-grained GPU preemptionmechanism that reuses existing GPU compute resources to begin processinginterrupts in under 50 GPU cycles.ivLay SummaryThis dissertation explores the potential to improve the performance, efficiency, andcost of datacenter computing through the use of highly parallel and efficient hard-ware accelerators, specifically graphics processing units. However, the types ofapplications that typically reside in a datacenter are often not considered to bewell suited for graphics processing units. This dissertation explores how a pop-ular datacenter application performs on contemporary graphics processing units,highlighting sizable improvements over traditional datacenter hardware. This dis-sertation then proposes a general software framework for accelerating datacenterapplications on graphics processing units, recognizing that all of the computationfrom receiving a request to sending the reply must be accounted for to achievethe full benefits of the efficient parallel computing hardware. Finally, this dis-sertation identifies limitations with contemporary graphics processing units andproposes hardware and software enhancements to further improve the usability,performance, and efficiency of graphics processing units in the datacenter.vPrefaceThis section lists my publications that were completed at The University of BritishColumbia, discusses how they are incorporated into this dissertation, and highlightsmy contributions to this dissertation.The publications are listed in chronological order as follows:[C1] Tayler H. Hetherington, Timothy G. Rogers, Lisa Hsu, Mike O’Connor, TorM. Aamodt. Characterizing and Evaluating a Key-Value Store Applicationon Heterogeneous CPU-GPU Systems [69]. In Proceedings of the IEEEInternational Symposium on Performance Analysis of Systems and Software(ISPASS), pp. 88-98, April 2012.[C2] Tayler H. Hetherington, Mike O’Connor, Tor M. Aamodt. MemcachedGPU:Scaling-up Scale-out Key-value Stores [70]. In Proceedings of the SixthACM Symposium on Cloud Computing (SoCC), pp. 43-57, August, 2015.Chapter 2. This chapter combines and expands on the material that was pre-sented in [C1] and [C2] to describe the necessary background information for thisdissertation.Chapter 3. A version of this material has been published as [C1]. In [C1], I wasthe lead investigator responsible for conducting the research and writing the major-ity of the manuscript under the guidance of Dr. Tor M. Aamodt and with input fromMike O’Connor and Lisa Hsu. Timothy G. Rogers implemented an initial versionof the Memcached GPU code, which was targeted towards the GPGPU-Sim soft-ware simulator, conducted the corresponding simulator experiments, and wrote therelated results sections (Section 3.4.2 and Section 3.4.2). I was responsible forviimplementing and evaluating the control-flow simulator, implementing a new ver-sion of Memcached targeted towards GPU hardware, conducting the MemcachedGPU hardware experiments, analyzing the results, and writing the correspondingportions of the manuscript.Chapter 4. A version of this material has been published as [C2]. In [C2],I was the lead investigator responsible for conducting the research, implementingthe software frameworks, performing the experiments, collecting and analyzing theresults, and writing the manuscript under the guidance of Dr. Tor M. Aamodt andwith input from Mike O’Connor.Chapter 5. In this chapter, I was the lead investigator responsible for conduct-ing the research, implementing the proposed event-driven GPU execution program-ming model and corresponding GPU architectural enhancements in the evaluatedsoftware simulation frameworks, conducting the majority of experiments, analyz-ing the results, and writing the chapter under the guidance of Dr. Tor M. Aamodt.Maria Lubeznov evaluated the performance overheads of concurrent CPU applica-tions and GPU networking workloads and collected the corresponding data. Fol-lowing this dissertation, a modified version of the material presented in this chapterwas published in the International Conference on Parallel Architectures and Com-pilation Techniques (PACT) 2019 [71].Chapter 6. This chapter combines and expands on the related work sectionsthat were presented in [C1] and [C2] with the related work in Chapter 5.viiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Computing Trends . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 GPUs and the Datacenter . . . . . . . . . . . . . . . . . . . . . . 61.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1 Graphics Processing Units (GPUs) . . . . . . . . . . . . . . . . . 15viii2.1.1 GPU Programming Model . . . . . . . . . . . . . . . . . 152.1.2 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . 172.1.3 GPU Memory Transfers . . . . . . . . . . . . . . . . . . 212.1.4 CUDA Dynamic Parallelism . . . . . . . . . . . . . . . . 222.1.5 Kernel Priority and Kernel Preemption . . . . . . . . . . 222.1.6 Current GPU Interrupt Support . . . . . . . . . . . . . . . 232.1.7 GPU Persistent Threads . . . . . . . . . . . . . . . . . . 242.1.8 GPU Architectural Irregularities . . . . . . . . . . . . . . 252.2 Memcached . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3 Network Interfaces and Linux Networking . . . . . . . . . . . . . 312.4 Event and Interrupt-Driven Programming . . . . . . . . . . . . . 323 Evaluating a Key-Value Store Application on GPUs . . . . . . . . . 343.1 Porting Memcached . . . . . . . . . . . . . . . . . . . . . . . . . 393.1.1 Offloading GET Requests . . . . . . . . . . . . . . . . . 393.1.2 Memory Management . . . . . . . . . . . . . . . . . . . 423.1.3 Separate CPU-GPU Address Space . . . . . . . . . . . . 433.1.4 Read-only Data . . . . . . . . . . . . . . . . . . . . . . . 443.1.5 Memory Layout . . . . . . . . . . . . . . . . . . . . . . 443.1.6 SETs and GETs . . . . . . . . . . . . . . . . . . . . . . . 453.2 Control-Flow Simulator (CFG-Sim) . . . . . . . . . . . . . . . . 463.3 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . 483.3.1 Hardware and Simulation Frameworks . . . . . . . . . . . 483.3.2 Assumptions and Known Limitations . . . . . . . . . . . 513.3.3 Validation and Metrics . . . . . . . . . . . . . . . . . . . 533.3.4 WikiData Workload . . . . . . . . . . . . . . . . . . . . 533.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 543.4.1 Hardware Evaluation . . . . . . . . . . . . . . . . . . . . 543.4.2 Simulation Evaluation . . . . . . . . . . . . . . . . . . . 613.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704 Memcached GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.1 GPU Network Offload Manager (GNoM) . . . . . . . . . . . . . 76ix4.1.1 Request Batching . . . . . . . . . . . . . . . . . . . . . . 764.1.2 Software Architecture . . . . . . . . . . . . . . . . . . . 784.2 MemcachedGPU . . . . . . . . . . . . . . . . . . . . . . . . . . 854.2.1 Memcached and Data Structures . . . . . . . . . . . . . . 854.2.2 Hash Table . . . . . . . . . . . . . . . . . . . . . . . . . 884.2.3 Post GPU Race Conditions on Eviction . . . . . . . . . . 924.3 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . 944.3.1 GNoM and MemcachedGPU . . . . . . . . . . . . . . . . 954.3.2 Hash-Sim (Hash Table Simulator) . . . . . . . . . . . . . 974.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 984.4.1 Hash Table Evaluation . . . . . . . . . . . . . . . . . . . 994.4.2 Impact of Linux Kernel Bypass . . . . . . . . . . . . . . 1074.4.3 MemcachedGPU Evaluation . . . . . . . . . . . . . . . . 1084.4.4 Workload Consolidation on GPUs . . . . . . . . . . . . . 1154.4.5 MemcachedGPU Offline Limit Study . . . . . . . . . . . 1184.4.6 Comparison with Previous Work . . . . . . . . . . . . . . 1214.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245 EDGE: Event-Driven GPU Execution . . . . . . . . . . . . . . . . . 1265.1 Motivation for Increased GPU Independence . . . . . . . . . . . . 1275.2 Supporting Event-Driven GPU Execution . . . . . . . . . . . . . 1325.3 GPU Interrupts and Privileged GPU Warps . . . . . . . . . . . . . 1355.3.1 Interrupt Partitioning and Granularity . . . . . . . . . . . 1375.3.2 Privileged GPU Warp Selection . . . . . . . . . . . . . . 1385.3.3 Privileged GPU Warp Preemption . . . . . . . . . . . . . 1405.3.4 Privileged GPU Warp Priority . . . . . . . . . . . . . . . 1415.3.5 Interrupt Flow . . . . . . . . . . . . . . . . . . . . . . . 1435.3.6 Interrupt Architecture . . . . . . . . . . . . . . . . . . . . 1435.4 Event-Driven GPU Execution . . . . . . . . . . . . . . . . . . . . 1465.4.1 Event Kernels . . . . . . . . . . . . . . . . . . . . . . . . 1465.4.2 EDGE Architecture . . . . . . . . . . . . . . . . . . . . . 1515.4.3 Wait-Release Barrier . . . . . . . . . . . . . . . . . . . . 1525.5 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . 156x5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 1575.6.1 GPU Interrupt Support . . . . . . . . . . . . . . . . . . . 1585.6.2 Event Kernels . . . . . . . . . . . . . . . . . . . . . . . . 1655.6.3 Wait-Release Barrier . . . . . . . . . . . . . . . . . . . . 1695.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1726 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1746.1 Related Work for Accelerating Server Applications and NetworkProcessing on GPUs . . . . . . . . . . . . . . . . . . . . . . . . 1746.1.1 Memcached and Server-based Applications . . . . . . . . 1746.1.2 GPU Networking . . . . . . . . . . . . . . . . . . . . . . 1806.2 Related Work on Event-Driven GPU Execution and ImprovingGPU System Support . . . . . . . . . . . . . . . . . . . . . . . . 1837 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 1907.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1907.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . 1957.2.1 Control-Flow Simulator . . . . . . . . . . . . . . . . . . 1957.2.2 Evaluating GNoM on Additional Applications and Inte-grated GPUs . . . . . . . . . . . . . . . . . . . . . . . . 1977.2.3 Larger GPU Networking Kernels . . . . . . . . . . . . . . 1987.2.4 Stateful GPU Network Processing . . . . . . . . . . . . . 1987.2.5 Accelerating Operating System Services on GPUs . . . . 1997.2.6 Networking Hardware Directly on a GPU . . . . . . . . . 1997.2.7 Scalar Processors for GPU Interrupt Handling . . . . . . . 2007.2.8 GPU Wait-Release Barriers . . . . . . . . . . . . . . . . 2007.2.9 Rack-Scale Computing . . . . . . . . . . . . . . . . . . . 201Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202xiList of TablesTable 1.1 Comparing GPU and CPU theoretical performance and effi-ciency over previous architecture generations. The GPU valueswere obtained from an NVIDIA whitepaper [131]. The CPUvalue were obtained from Intel processor specifications [83] andcalculated using: FLOPS = number of cores × peak clock fre-quency × flops per cycle. . . . . . . . . . . . . . . . . . . . . 5Table 3.1 GPU hardware specifications. . . . . . . . . . . . . . . . . . . 49Table 3.2 CPU hardware specifications. . . . . . . . . . . . . . . . . . . 49Table 3.3 GPGPU-Sim configuration. . . . . . . . . . . . . . . . . . . . 51Table 4.1 Server and client configurations. . . . . . . . . . . . . . . . . 94Table 4.2 Server NVIDIA GPUs. . . . . . . . . . . . . . . . . . . . . . 95Table 4.3 GET request throughput and drop rate at 10 GbE. . . . . . . . 108Table 4.4 Concurrent GETs and SETs on the Tesla K20c. . . . . . . . . . 114Table 4.5 Comparing MemcachedGPU with previous work. . . . . . . . 122Table 5.1 EDGE API extensions. . . . . . . . . . . . . . . . . . . . . . 150Table 5.2 Gem5-GPU configuration. . . . . . . . . . . . . . . . . . . . . 156xiiList of FiguresFigure 2.1 High-level view of an AMD-like GPU architecture assumed inthis dissertation. . . . . . . . . . . . . . . . . . . . . . . . . 18Figure 2.2 High-level view of an NVIDIA-like GPU architecture assumedin this dissertation. . . . . . . . . . . . . . . . . . . . . . . . 19Figure 2.3 SIMT execution example. . . . . . . . . . . . . . . . . . . . 26Figure 2.4 GPU memory request coalescing. . . . . . . . . . . . . . . . 28Figure 2.5 Memcached. . . . . . . . . . . . . . . . . . . . . . . . . . . 29Figure 3.1 Control Flow Graph (CFG) from Memcached’s Jenkins hashfunction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Figure 3.2 Memcached SIMD efficiency: Expected vs. Actual. . . . . . . 39Figure 3.3 GPU GET Payload object. The original Memcached Connec-tion object contains a large amount of information about thecurrent Memcached connect, the requesting client network in-formation, and the current state of the Memcached request.The GPU Payload object contains a much smaller subset ofthe relevant information required to process the GET requeston the GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . 41Figure 3.4 Contiguous memory layout. . . . . . . . . . . . . . . . . . . 45Figure 3.5 Memcached speed-up vs. a single core CPU on the discreteand integrated GPU architectures. Each batch of GET requestscontained 38,400 requests/batch. . . . . . . . . . . . . . . . . 56Figure 3.6 Throughput and latency while varying the request batch size onthe AMD Radeon HD 5870 (normalized to 1,024 requests/batch). 58xiiiFigure 3.7 Speed-up of AMD Radeon HD 5870 and Llano A8-3850 vs.the Llano A8-3850 CPU at different request batch sizes. . . . 59Figure 3.8 Memcached overall execution breakdown (23,040 requests/-batch). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60Figure 3.9 Example control-flow graph with an error handling branchfrom B to K. Each basic block contains the basic blockidentifier and the number of instructions in that basic block(Basic Block ID-# Instructions). All branches have a non-zerobranch outcome probability except for B to K, which is nevertaken. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62Figure 3.10 SIMD efficiency. . . . . . . . . . . . . . . . . . . . . . . . . 64Figure 3.11 L1 data cache misses per 1,000 instructions at various config-urations. FA = Fully Associative. . . . . . . . . . . . . . . . 66Figure 3.12 Performance as a percentage of peak IPC with various realisticL1 data cache configurations and two idealized memory systems. 66Figure 3.13 Memory requests generated per instruction for each static PTXinstruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68Figure 3.14 Performance of Memcached at various wavefront sizes (nor-malized to a warp size of 8). . . . . . . . . . . . . . . . . . . 69Figure 4.1 Breakdown of the baseline Memcached request processingtime for a single GET request on the CPU. . . . . . . . . . . 73Figure 4.2 End-to-end breakdown of user-space and Linux Kernel pro-cessing for a single GET request on the CPU. . . . . . . . . . 74Figure 4.3 GNoM packet flow and main CUDA kernel. The figurecontains the three main components, the NIC, CPU, and GPU,and the corresponding GNoM software frameworks that runon each device. The solid black arrows represent data flow, thedashed black errors represent metadata flow (e.g., interrupts,packet pointers), the double black arrows represent packetdata and packet metadata, the solid grey arrows represent GPUthread control flow, and the dashed grey lines represent GPUsynchronization instructions. . . . . . . . . . . . . . . . . . 77xivFigure 4.4 Software architecture for GNoM-host (CPU). . . . . . . . . . 79Figure 4.5 Partitioning the Memcached hash table and value storage be-tween the CPU and GPU. . . . . . . . . . . . . . . . . . . . 86Figure 4.6 Race condition between dependent SET and GET requests inMemcachedGPU. . . . . . . . . . . . . . . . . . . . . . . . 93Figure 4.7a Zipfian: Comparing the hit rate for different hash table tech-niques and sizes under the Zipfian request distribution. Therequest trace working size is 10 million entries. . . . . . . . . 103Figure 4.7b Latest: Comparing the hit rate for different hash table tech-niques and sizes under the Latest request distribution. The re-quest trace working size is 10 million entries. . . . . . . . . . 104Figure 4.7c Uniform Random: Comparing the hit rate for different hashtable techniques and sizes under the Uniform Random requestdistribution. The request trace working size is 10 million entries.105Figure 4.8 Miss-rate versus hash table associativity and size compared tohash chaining for a request trace with a working set of 10 mil-lion requests following the Zipf distribution. . . . . . . . . . 106Figure 4.9 Impact of Linux Kernel bypass for GET requests vs. the base-line Memcached v1.5.20. . . . . . . . . . . . . . . . . . . . 107Figure 4.10 Mean and 95-percentile round trip time (RTT) latency versusthroughput for Tesla GPU with GNoM and NGD, and Maxwellwith NGD. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110Figure 4.11 Total system and GPU power (left axis) and total systemenergy-efficiency (right axis) versus throughput for Mem-cachedGPU and GNoM on the NVIDIA Tesla K20c. Thetotal system energy-efficiency for the Maxwell system is alsoshown at the peak throughput with two GNoM-post threads.The number of GNoM-post threads are shown above thegraph, with 1 thread for 1.1 to 7.6 MRPS, 2 threads for 10.1 to12.8 MRPS, and 4 threads for 12.9 MRPS. . . . . . . . . . . 111xvFigure 4.12 Impact of varying the key length mixture and hit-rate on roundtrip time (RTT) latency (left axis) and throughput (right axis)for MemcachedGPU and GNoM on the NVIDIA Tesla K20c.RTT latency is measured at 4 MRPS. Throughput is shown asthe average fraction of peak throughput (at 10 Gbps) obtainedfor a given key distribution. The key distributions are brokendown into four sizes (16B, 32B, 64B, and 128B) and the labelsindicate the percentage of keys with the corresponding length. 113Figure 4.13 Client RTT (avg. 256 request window) during BGT executionfor an increasing number of fine-grained kernel launches. . . 116Figure 4.14 Impact on BGT execution time with an increasing number ofkernel launches and max client RTT during BGT execution. . 117Figure 4.15 Offline GNoM throughput - 16B keys, 96B packets. . . . . . 118Figure 4.16 Offline GNoM processing latency - 16B keys, 96B packets. . 119Figure 4.17 Offline GNoM energy-efficiency - 16B keys, 96B packets. . . 120Figure 5.1 Example of the data and control flow when an external devicelaunches tasks on the GPU for the baseline CUDA streams,Persistent Threads (PT), and EDGE. . . . . . . . . . . . . . . 128Figure 5.2 Evaluating the loss in throughput for CPU compute and mem-ory bound applications (Spec2006) when running concurrentlywith and a GPU networking applications. The GPU’s relianceon the CPU to launch kernels leads to inefficiencies for bothdevices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129Figure 5.3 Measuring the performance and power consumption of persis-tent threads (PT) versus the baseline CUDA stream model fora continuous stream of large and small matrix multiplicationkernels. Active Idle measures the power consumption of thepolling PT threads when there are no pending tasks. . . . . . 130Figure 5.4 EDGE interrupt partitioning. . . . . . . . . . . . . . . . . . 137Figure 5.5 Interrupt controller logic (reference Figure 5.7). . . . . . . . 142Figure 5.6 Interrupt service routine (reference Figure 5.7). . . . . . . . 144xviFigure 5.7 EDGE GPU Microarchitecture. Baseline GPU diagram in-spired by [87, 111, 155, 175]. EDGE components are shownin green. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145Figure 5.8 Event submission and completion queues. . . . . . . . . . . 147Figure 5.9 Percentage of cycles a free warp context is available fora PGW. This limits the amount of time a warp must bepreempted to schedule the PGW. . . . . . . . . . . . . . . . 158Figure 5.10 Average register utilization of Rodinia benchmarks on Gem5-GPU. Register utilization is measured for each cycle and aver-aged across all cycles of the benchmark’s execution. . . . . . 159Figure 5.11 Interrupt warp preemption stall cycles. . . . . . . . . . . . . . 161Figure 5.12 Preemption stall cycles with the victim warp flushingoptimizations applied averaged across the Rodinia and Con-volution Benchmarks. Base is the baseline preemption latencywithout any optimizations applied (Figure 5.11a). BarrierSkip immediately removes a victim warp if waiting at a barrier.Victim High Priority sets the victim warp’s instruction fetchand scheduling priority to the highest. Flush I-Buffer flushesany pending instructions from the victim warp’s instructionbuffer. Replay Loads drops any in-flight loads from the victimwarp and replays them when the victim warp is rescheduledafter the ISR completes. . . . . . . . . . . . . . . . . . . . . 163Figure 5.13 Average impact of the PGW selection, interrupt rate, and ISRduration on concurrent tasks’ IPC (x-axis labels are <interruptrate>-<interrupt duration>). . . . . . . . . . . . . . . . . . 164Figure 5.14 Impact on the Convolution kernel’s IPC when reserving re-sources for the PGW and MemcachedGPU GET kernel. . . . 165Figure 5.15 Average ISR runtime with CONV1 and MEMC kernels atvarying MEMC launch rate relative to a standalone ISR. Theruntime is measured with and without reserving the instructioncache entries for the ISR (i$-Reserve) and for both the P1 andP2 event kernel priorities. The ISR requires three entries inthe instruction cache. . . . . . . . . . . . . . . . . . . . . . 166xviiFigure 5.16 Runtime of the Convolution and Memcached kernels at dif-ferent request rates and resource reservation techniques, rel-ative to the isolated kernel runtimes, and average normalizedturnaround time. . . . . . . . . . . . . . . . . . . . . . . . . 167Figure 5.17 GPU global load instructions issued relative to PersistentThreads (PT). . . . . . . . . . . . . . . . . . . . . . . . . . 170Figure 5.18 Dynamic warp instructions issued relative to PersistentThreads (PT). . . . . . . . . . . . . . . . . . . . . . . . . . 171xviiiList of AbbreviationsALM Adaptive Logic ModulesAPI Application Programming InterfaceASIC Application-Specific Integrated CircuitCDP CUDA Dynamic ParallelismCFG Control Flow GraphCLB Configurable Logic BlocksCMP Chip MultiprocessorCPC Current Program CounterCPU Central Processing UnitCTA Cooperative Thread ArrayCU Compute UnitsCUDA Compute Unified Device ArchitectureDLP Data-Level ParallelismDMA Direct Memory AccessDNA Direct NIC AccessEDGE Event-Driven GPU ExecutionxixFLOPS Floating-Point Operations Per SecondFPGA Field-Programmable Gate ArrayGNOM GPU Network Offload ManagerGPGPU General-Purpose Graphics Processing UnitGPU Graphics Processing UnitHLS High-Level SynthesisHPC High Performance ComputingIAAS Infrastructure-as-a-ServiceILP Instruction-Level ParallelismIPDOM Immediate Post DominatorKDU Kernel Dispatch UnitKMD Kernel Metadata StructureKMU Kernel Management UnitLRU Least Recently UsedMSI-X Message Signalled InterruptsNAPI New APINIC Network Interface CardPAAS Platform-as-a-ServicePC Program CounterPCIE Peripheral Control Interface ExpressPGW Privileged GPU WarpPKQ Pending Kernel QueuexxPT Persistent ThreadsPTX Parallel Thread ExecutionRDMA Remote Direct Memory AccessRLP Request-Level ParallelismRPC Reconvergence Program CounterRPS Requests per SecondRSS Receive Side ScalingRTT Round-Trip TimeRX ReceiveSAAS Software-as-a-ServiceSIMD Single-Instruction Multiple-DataSIMT Single-Instruction Multiple-ThreadSKB Socket BuffersSM Streaming MultiprocessorSMX Streaming MultiprocessorTCO Total Cost of OwnershipTDP Thermal Design PowerTLP Thread-Level ParallelismTMD Task Meta DataTOS Top of StackTPU Tensor Processing UnitTX TransmitxxiUTDP Ultra Thread Dispatch ProcessorVM Virtual MachinexxiiAcknowledgmentsNone of this work would have been possible without the support of countless peo-ple and organizations. First and foremost, I would like to thank my advisor, Profes-sor Tor Aamodt, for the many years of guidance, support, wisdom, and motivation.Thank you for providing the opportunity to pursue my ideas and helping me toimprove them, for your invaluable insights into our field, research, and teaching,for sending me around the world, and for always pushing me to do high-qualityresearch. Your passion and dedication will always be an inspiration to me.I would especially like to thank my family and friends for helping me through-out this journey. To my father – thank you for everything. You ignited my lovefor computers; you inspired me to become an Engineer; you taught me that a goodscotch with friends is worth the extra weight to carry up a mountain; but most im-portantly, you are the reason I am the man I am today. To my mother – thank youfor all of your love, support, patience, forced breaks, meals, hospital trips, ...the listgoes on. You walked with me through the door of my first undergraduate classroom– I am excited to walk off campus with you in the end. To my brother – thank youfor your levelheadedness, for always being someone that I can talk to, and for allof the great experiences we have shared. I love that we had the opportunity to workside-by-side throughout our time at UBC and look forward to our many adventuresin the future. To my peanut – getting to know you will always be my best memoryfrom UBC. Thank you for your love, for keeping me sane, and for all of your helpalong the way. I could not have done this without you. Merci. To my grandparents,aunts, uncles, and cousins – thank you for always being there to support me andfor helping me to enjoy my other passions. It means so much to me. To my friends– thank you for all of the happiness that you bring to my life and for sticking withxxiiime through the countless late nights and ”sorry, I can’t make it”s.Thank you to all of the amazing colleagues and co-authors I had the privi-lege of working with and learning from. Specifically, I would like to thank MikeO’Connor for his invaluable industry knowledge and support, which helped shapemy research. To all of the UBC friends I had the pleasure of working with –Wilson Fung, Ali Bakhoda, Tim Rogers, Shadi Assadikhomami, Andrew Bok-tor, Ahmed ElTantawy, Hadi Jooybar, Ayub Gubran, Dave Evans, Rimon Tadros,Inderpreet Singh, Arun Ramamurthy, Dongdong Li, Amruth Sandhupatla, MariaLubeznov, Deval Shah, Amin Ghasemazar, Bo Fang, Rohit Singla, Salma Kashani,Hassan Halawa, Jimmy Kwa, Johnny Kuan, Xiaowei Ren, Samer Al-Kiswany, De-rick Hsieh, Ryan Jung, Alvin Lam, Dillon Yang, Jimmy Chao, Sheldon Sequeira,Oscar Hou, John Kang, and Chris Eng (and many others) – thank you for makingthis such a memorable experience. Thank you to all of my colleagues and friends atOracle Labs who supported me while I finished my PhD. I would also like to thankthe members of my qualifying, department, and final university PhD exams, Pro-fessor Steve Wilton, Professor Mieszko Lis, Professor Matei Ripeanu, ProfessorSathish Gopalakrishnan, Professor Margo Seltzer, Professor Sudip Shekhar, andmy external examiner, Professor Emmett Witchel, for their insightful feedback andcontributions to improving the work in this dissertation. Additionally, I would liketo thank all of the professors I had the opportunity to learn from over the years.Lastly, I would like to acknowledge the funding sources that made it possible topursue graduate school – the Natural Sciences and Engineering Research Councilof Canada (NSERC) for providing the CGS-D3 and CGS-M, The University ofBritish Columbia for providing multiple awards, and all others who contributedthroughout my graduate and undergraduate studies.xxivTo my mother, Roberta Hicklin, and father,Darrin Hetherington.xxvChapter 1IntroductionEnter the age of computing. Over the past few decades, modern computing sys-tems have rapidly integrated themselves deeply into many aspects of life, rang-ing from driving research, progressing modern science and medicine, exploringthe universe, and managing the world’s economy, to playing games, watching ourfavourite shows, and staying connected with family and friends. Regardless of thetask, there is an ever increasing need for higher performance and efficiency.Improvements in performance enable new classes of applications not previ-ously before possible. Take machine learning and deep learning, for example. In1957, Cornell Aeronautical Laboratory introduced the concepts of the perceptronand neural networks used for pattern recognition [151]. However, it was nearly 50years later before deep learning for pattern recognition began its rise [72]. Fast-forward to today where large tech companies such as Google [90], Amazon [8],Oracle [138], Microsoft [120], and Facebook [51] employ machine learning anddeep learning in many of their computing centers and products. A large contrib-utor to this accelerated growth in deep learning is the improvements in micropro-cessor performance. However, these performance improvements must be met withincreases in energy efficiency to remain feasible as computing systems continue toscale up and scale out [20].11.1 Computing TrendsTraditionally, the rapid performance improvements of integrated-circuits (IC) inmicroprocessors has been driven by two main factors: Moore’s law [121] and Den-nard Scaling [43]. Moore’s law states that the number of transistors on an IC willdouble roughly every 18-24 months, whereas Dennard states that voltage and cur-rent, and hence dynamic power, is proportional to the dimensions of the transistor.The combination of these factors enables the transistor switching frequency (clockfrequency) on ICs to increase without increasing dynamic power and enables moretransistors to fit on an IC within similar size and power constraints. As a result,general-purpose central processing units (CPUs) have enjoyed consistent improve-ments in single-threaded performance from generation to generation.However, Dennard scaling has begun to break down over the last decade aswe approach the physical limits of transistor sizes and other factors, such as theincreasing contribution of transistor leakage current, reduce the ability to continuedecreasing power proportionally with smaller transistor sizes. Consequently, pow-ering and cooling the larger number and higher density of transistors on ICs arebecoming prohibitively more expensive (referred to as the “power wall” [15, 122])and has limited the opportunity to continue increasing the clock frequency as ameans to improve performance. This has driven many architectural enhancementsto utilize the additional transistors, such as exploiting instruction-level parallelism(ILP) through out-of-order processing and branch prediction, to further improvesingle-threaded performance. While effective, there is only so much parallelismthat can be extracted from single-threaded programs, which, along with high mem-ory latencies, limits the potential gains from ILP. This has motivated the micro-processor industry to transition towards parallel computing architectures, such aschip-multiprocessors (CMPs), to combat the diminishing returns in single-threadedoptimizations [98].CMPs consist of multiple independent CPU cores integrated on a single chip,typically sharing portions of the memory system and input/output (I/O) interfacesto communicate with each other and the outside world. CMPs enabled manynew opportunities to improve performance over single core systems. Multipleprograms can operate concurrently on different cores in a form of spatial mul-2titasking, instead of temporal multitasking on a single core, or individual pro-grams can explicitly define parallel sections of the code to be handled by mul-tiple threads across different cores using parallel application program interfaces(APIs) [19, 84, 94, 109, 133, 137]. Operating systems – privileged system softwareresponsible for the control, management, and security of microprocessor systems– efficiently schedule programs and threads to multiple cores to improve overallsystem throughput relative to single core systems. Modern CMPs commonly haveon the order of 10s of cores and may consist of multiple homogeneous cores, suchas Intel’s x86 Core-i architectures [167], or heterogeneous cores, such as ARM’sbig.LITTLE architecture [14]. The trend for adding more cores has continued topush forward as transistor sizes decrease. For example, Intel’s Xeon E7 processorscontain 24 cores and can be combined in multi-socket server systems for up to 192cores [81].However, recent concerns about “dark silicon”, in which only a fraction of achip can be actively utilized within a given power envelope [48, 66, 118], have in-troduced challenges with multicore scaling to improve performance. This has leadto an increasing focus on specialized accelerators and massively multi-core sys-tems, such as graphics processing units (GPU), field-programmable gate-arrays(FPGA), and application-specific integrated-circuits (ASIC). Such architecturescan provide very high levels of performance and efficiency for specific classesof applications, but may give up the flexibility and programmability inherent ingeneral-purpose processors. ASICs are at the extreme end of this scale, providingdedicated hardware solutions to specific operations and applications at the cost ofgenerality and programmability. In contrast to general-purpose processors, whichoften require multiple steps to perform a single operation to maintain a level ofgenerality, ASICs can directly implement the operation efficiently in hardware.However, while ASICs may contain a level of programmability, they are tied to aspecific class of applications and cannot easily evolve with the application. Addi-tionally, ASICs require hardware design and implementation, which increase thecomplexity and cost relative to software-only solutions.FPGAs, on the other hand, fall in the middle of the scale as reprogrammablehardware devices. Internally, FPGAs contain many reconfigurable hardwareblocks capable of implementing any logic function, dedicated hardware blocks,3I/O blocks, and a reconfigurable interconnection fabric for connecting thesecomponents [7, 113, 114, 180]. The FPGA architecture enables high levels ofperformance and efficiency for certain applications, but the reprogrammabilityreduces the benefits relative to ASICS [100]. FPGAs are programmed usinghardware design languages (HDL), such as Verilog or VHDL, and can be repro-grammed to evolve with changing applications. However, programming in anHDL is still considerably more difficult than programming in software [154] andreprogramming times can be on the order of milliseconds to seconds [142, 147],which imposes challenges when implementing multitasking on FPGAs. While theease of programming has improved on recent FPGAs with support for higher-levelsoftware languages through high-level synthesis (HLS), such as OpenCL [36] andCUDA [142], current HLS solutions tend to achieve improvements in developerproductivity by trading off the quality of results [12].GPUs, the focus of this dissertation, are massively multi-threaded, many-core,throughput-oriented architectures traditionally designed to accelerate graphicsapplications. Graphics processing often involves performing multiple thousandsof similar and independent computations on different pixels, resulting in largeamounts of data-level parallelism (DLP). GPUs exploit this parallelism byconcurrently executing multiple independent operations on a single-instruction,multiple-data (SIMD) architecture to provide significant gains in performance andefficiency. Fortunately, this property of high DLP is not exclusive to graphicsapplications. Over the past decade, GPUs have evolved into general-purposeGPUs (GPGPU), increasing the scope of applications that can benefit from theGPU’s high-efficiency architecture to non-graphics applications with sufficientDLP. Contemporary GPGPUs are programmed in high-level, parallel softwarelanguages, such as CUDA or OpenCL.At their core, GPGPUs consist of hundreds to thousands of small, low-frequency, in-order cores grouped together into SIMD processing engines,commonly referred to as streaming multiprocessors (SMs) or compute units (CU),and high-bandwidth memory (the GPGPU architecture and programming modelare described in detail in Section 2.1). Unlike CPUs, which aim to improveperformance through high clock frequencies and aggressive ILP optimizations,GPUs focus on exploiting fine-grained multi-threading (FGMT). Assuming4Table 1.1: Comparing GPU and CPU theoretical performance and efficiencyover previous architecture generations. The GPU values were obtainedfrom an NVIDIA whitepaper [131]. The CPU value were obtained fromIntel processor specifications [83] and calculated using: FLOPS = num-ber of cores × peak clock frequency × flops per cycle.Architecture (Year) IC fab. TDP GFLOPS GFLOPS/WGPU NVIDIA Volta (’17)12nmFFN300 15,700 52.3GPU NVIDIA Pascal (’16)16nmFin-FET+300 10,600 35.3GPU NVIDIA Maxwell (’15) 28nm 250 6,800 27.2GPU NVIDIA Kepler (’13) 28nm 235 5,000 21.3CPU Intel 8th gen core i9 (’18) 14nm 45 921.6 20.5CPU Intel 8th gen core i7 (’18) 14nm 45 825.6 18.3CPU Intel 7th gen core i7 (’17) 14nm 45 524.8 11.7CPU Intel 6th gen core i7 (’16) 14nm 45 473.6 10.5ample amounts of structured parallelism in the application, FGMT can hide theeffects of long latency operations by seamlessly switching between thousandsof concurrently operating GPU thread contexts. Coupled with the lower clockfrequency, FGMT trades off single-threaded performance with high throughputprocessing to improve overall performance and energy-efficiency.Consider the comparison in Table 1.1, which presents the theoretical peak per-formance and energy-efficiency of different NVIDIA GPUs and Intel CPUs acrossmultiple generations. The table also presents the IC fabrication process and yearthe processor was released. Performance is measured in billions of single-precisionfloating-point operations per second (GFLOPS) and energy-efficiency is measuredin peak GFLOPS versus the thermal design power (TDP) (GFLOPS/W). As canbe seen, the latest NVIDIA Volta GPU (GV100) provides over 17× the computethroughput and 2.6× higher energy-efficiency than the latest Intel CPU (Core i9-8950HK). While the Volta has a superior technology fabrication process, even theMaxwell and Kepler architectures with twice the transistor size are able to providehigher performance and energy-efficiency than the latest Intel CPU. However, there5are many limitations in the properties of applications that can actually benefit fromGPU acceleration, which introduces challenges with exploiting the available paral-lelism offered by the GPU architecture. This dissertation argues that the perceivedbar for the types of applications that can obtain benefits from the GPU is often toohigh and that GPUs should be considered as efficient accelerators for a broaderclass of applications. Specifically, this dissertation explores the potential for us-ing GPUs to improve the performance and efficiency of datacenter applicationscontaining ample request-level (packet-level) parallelism, using Memcached [115](Section 2.2) as an example.1.2 GPUs and the DatacenterThe initial applications to pioneer the road for GPGPU computing belonged to thedomain of scientific and high-performance computing (HPC) and were able to at-tain large performance improvements using GPUs [59]. These types of applicationsare highly structured and well suited for the GPU’s SIMD architecture. Further-more, the GPU is able to match these high levels of performance with efficiency. Infact, as of June 2017, GPUs were used as accelerators in all ten of the top ten mostefficient supercomputers (GFLOPS/W), as indicated by the Green500 list [165],while also appearing in two out of the top five supercomputers (TFLOPS) [166].However, HPC represents a relatively small segment of the overall computing mar-ket. According to the IDC, in 2015 the overall server market, such as those foundin a datacenter (described below), had revenues of $55.1 billion [78] compared to$11.4 billion for HPC servers [25, 53], which has been a consistent trend over theprevious six years ($43.2 billion [76] for the overall server marked compared to$8.6 billion [77] for HPC in 2009). Consequently, improving the performance andefficiency of datacenter applications can have significant economic benefits.Modern datacenters are massive buildings containing thousands of servers,memory, non-volatile storage, network hardware, and power and cooling systems.Server-side (“cloud”) computing in datacenters has become an increasingly popu-lar computing environment with the growth in Internet services and provides manybenefits relative to independently managing custom computing resources [20].Datacenters offer large amounts of computing potential, services, scalability,6security, and reliability guarantees, which enable vendors and customers to easilydeploy, manage, and tailor the computing environment to their applications’needs. Services such as Software-as-a-Service (SaaS), Platform-as-a-Service(PaaS), and Infrastructure-as-a-Service (IaaS) provide varying levels of controlfor, and support of, the software and hardware resources within the datacenter [6].Additionally, computing resources can be shared across multiple applications,known as workload consolidation, which improves efficiency through higherutilization and reduces costs for both vendors and customers [20, 105].Datacenters also simplify the development and maintenance of software. Soft-ware vendors can frequently and transparently distribute updates to applicationsrunning in the datacenter on known software and hardware configurations, insteadof distributing updates to a variety of different types of client hardware and soft-ware systems [20]. Furthermore, existing applications tend to be frequently up-dated and new datacenter applications with varying processing requirements arerapidly deployed, referred to as workload churn, which places generality and flexi-bility requirements on the hardware resources to be able to support the continuouslyevolving applications.Together, the hardware resources in the datacenter can consume tens ofmegawatts [20]. Reducing energy consumption is therefore a key concern fordatacenter operators. At the same time, typical datacenter workloads oftenhave strict performance requirements, which makes obtaining higher energyefficiency through “wimpy” nodes [11] that give up single-threaded performancenontrivial [73, 147]. Additionally, high workload churn introduces challengeswith deploying high-efficiency ASICs to accelerate datacenter applications, asthe hardware has been specifically designed for a certain class of applications.Consequently, datacenters have traditionally relied on general-purpose processorsas the main computing resources. Recently, however, the use of specializedprocessors in the datacenter to address performance and efficiency limitations hasbeen growing. For example, Facebook’s Big-Basin [103] utilizes racks of tightlycoupled discrete GPUs for improving machine learning performance; Google usesGPUs [39] and custom Tensor Processing Units (TPU) [90] for improving machinelearning performance; and Microsoft uses FPGAs for Bing web search [147] andhardware microservices [27], such as encryption [29]. Along with an increase7in specialized accelerators, datacenters have begun to shift towards rack-scalecomputing [82], where communicating components in a disaggregated computing[107] system reside in separate racks to increase utilization and efficiency. Insuch an environment, applications reserve only the resources they require (e.g.,computing resources, accelerators, memory, storage), as opposed to underutilizingover-provisioned servers. In order to be effective, the communication overheadbetween components in separate racks must be minimized.As highlighted above, GPUs are becoming commonplace accelerators in data-centers. One of the key reasons for this is the GPU’s ability to provide very highlevels of performance and energy-efficiency. However, the classes of applicationstaking advantage of the GPUs are limited and, for example, frequently marketedtowards machine learning and scientific computing [61]. While these applicationshave been the driving force for the inclusion of GPUs in the datacenter, there isa large fraction of more traditional datacenter applications, such as web servicesand databases, that are not often considered for GPU acceleration. This disserta-tion asks the question, can these types of network-centric applications also benefitfrom the high performance and efficiency provided by contemporary GPUs? Thesetypes of server applications typically contain large amounts of thread-level (TLP)or request-level (RLP) parallelism [98]. For example, Facebook uses an in-memoryweb caching service, Memcached [115], to alleviate network traffic to expensivebacking databases, which is responsible for handling billions of network requestsper second (RPS) across multiple servers [124]. This results in significant amountsof parallelism across network requests. However, there are many challenges withexploiting this available parallelism on contemporary GPUs due to the irregularbehavior associated with the network-centric server applications.First, while multiple network requests may perform the same high-level opera-tion in parallel, such as retrieving a piece of data from the server, there may be dras-tically different types or amounts of operations performed within the request. Forexample, varying network packet sizes can result in a different number of iterationsto process, or different packet contents may trigger additional levels of processing.This leads to decreased utilization and efficiency on the GPU’s SIMD architecture(Chapter 2 and Chapter 3). Second, the data access patterns across multiple similarnetwork requests can not be known a priori and may be highly distributed across8the GPU’s memory. This decreases the memory bandwidth utilization, reducingboth performance and efficiency (Chapter 2 and Chapter 3). Third, network-based server applications tend to have strict latency constraints [41], which is com-plicated by the GPU’s high-throughput oriented architecture with relatively lowsingle-threaded performance. Additionally, GPUs require large amounts of parallelcomputation to provide the high levels of performance and efficiency. This trans-lates to network request batching, and hence increased latency, when acceleratingnetwork requests on the GPU (Chapter 4). Fourth, because network requests arecoming off of the network, an efficient framework is required to manage the dataand computation movement across communicating components within the hetero-geneous environment (Chapter 4). Fifth, network packet processing in currentoperating systems can contribute to a large fraction of the total end-to-end networkapplication latency. Without also considering the network processing, the benefitsof using a GPU for the remaining application processing are limited by Amdahl’slaw (a measure of the total potential performance gains given the fraction of totalcomputation able to be parallelized) (Chapter 4). Finally, contemporary GPUs areconsidered as offload accelerators, which traditionally rely on the CPU for manag-ing the launching and completion of GPU tasks. As a result, even if the GPU isresponsible for all of the network and application processing, the CPU must stillbe involved on the critical path using the standard GPU programming interfaces.This increases programming complexity, reduces performance and efficiency, andunnecessarily reduces the ability for the CPU to concurrently work on other tasks(Chapter 5). Each of these challenges must be addressed to obtain any benefitsfrom GPU acceleration of datacenter applications.1.3 Thesis StatementThis dissertation explores the potential to utilize GPUs as energy-efficient acceler-ators for server-based applications in the datacenter through a software-hardwareco-design. Datacenters are important and ubiquitous computing environments withstrict requirements for high performance, high efficiency, and generality. Whilegeneral-purpose GPUs are capable of providing significant gains in both perfor-mance and efficiency for certain applications, traditional server-based applications9do not often adhere to the specific properties expected of an application to obtainthe benefits offered by the GPU. This dissertation highlights that the GPU can beused to accelerate such applications through a top-down approach – evaluating thebehavior of a popular datacenter application (Memcached) on contemporary GPUhardware and proposing a full end-to-end software stack for accelerating networkservices on contemporary GPUs – and a bottom-up approach – proposing a novelhardware mechanism and modifications to the GPGPU programming model to im-prove the independence, efficiency, and programmability of GPUs in a heteroge-neous system, such as the datacenter.This dissertation first performs a detailed characterization and evaluation ofMemcached, a high-performance distributed key-value store application, on con-temporary GPUs. Compared to traditional GPGPU applications, Memcached ishighly irregular in terms of control flow and data access patterns. From an ini-tial evaluation, it might reasonably appear that such an application would performpoorly on a GPU. This dissertation highlights that even in light of the irregularbehavior, an application such as Memcached can be redesigned to take advantageof the GPU’s high computational capacity and memory bandwidth to achieve im-provements in request throughput. Additionally, this dissertation assists with un-derstanding the potential SIMD utilization of an irregular application on a GPU,prior to actually spending the time to implement the application on a GPU, throughthe use of a custom control-flow simulator.However, the actual server application processing is only part of the full end-to-end processing required to service a network request. For example, the networkpacket processing is performed in the operating system on the CPU prior to theserver application. As a result, the gains achieved by offloading only the applica-tion processing to the GPU are limited by Amdahl’s law. This dissertation pro-poses a complete end-to-end software framework, GPU Network offload Manager(GNoM), for offloading both the network and server application processing to theGPU. GNoM addresses many of the challenges with achieving high-throughput,low-latency, and energy-efficient processing on the GPU’s throughput-oriented ar-chitecture, facilitating the development of server-based applications on contempo-rary GPU and Ethernet hardware.Using GNoM, this dissertation proposes MemcachedGPU, an end-to-end im-10plementation of Memcached on a GPU. Multiple components in Memcached areredesigned to better fit the GPU’s architecture and communicating componentsin a heterogeneous environment. MemcachedGPU is evaluated on both high-performance and lower power GPUs and is capable of reaching 10 Gbps line-rateprocessing with the smallest Memcached request size (over 13 million requestsper second (MRPS)) at efficiencies under 12 uJ per request. Furthermore, Mem-cachedGPU provides a 95-percentile round-trip time (RTT) latency under 1.1msat peak throughputs. Together, GNoM and MemcachedGPU highlight the GPU’spotential for accelerating such server-based applications.GNoM aims to offload all of the network and application processing to theGPU. However, contemporary GPUs are often considered as second-class comput-ing resources, which require interactions with the host CPU to manage the launch-ing and completion of tasks on the GPU. As a result, even if all of the required end-to-end processing can be performed on the GPU, the CPU is still required to handleI/O and control between GPU and other third-party devices, such as the networkinterface. This dissertation proposes an event-driven GPU programming model andcorresponding hardware modifications, EDGE, to enable any device in a heteroge-neous system to manage the execution of GPU tasks. EDGE pre-registers tasks onthe GPU and utilizes fine-grained preemption to execute privileged threads capableof triggering the execution of these tasks. EDGE exposes the GPU’s interrupt inter-face to completely bypass the CPU, which improves performance and efficiency,reduces system complexity, and frees up the CPU to work on other tasks. Thisdissertation also proposes a new GPU barrier instruction, the wait-release barrier,which blocks GPU threads indefinitely until being released by the privileged GPUthreads in response to an event. The wait-release barriers can help to reduce theoverheads of persistently running GPU software frameworks, which continuouslypoll in-memory work queues for new tasks.1.4 ContributionsThis dissertation makes the following contributions:1. It argues that GPU’s should be considered as accelerators for datacenternetwork services with ample request-level parallelism by contrasting a pro-11grammer’s intuition of an application’s potential execution behavior on aGPU with the actual behavior, highlighting that the appearance of irregularcontrol-flow and data-access patterns do not necessarily result in negativeperformance on a GPU.2. It describes the methodology used to port Memcached to run on integratedCPU-GPU and discrete GPU architectures, focussing on GPU-only perfor-mance with minimal modifications to Memcached’s internal implementationand data structures.3. It characterizes and evaluates Memcached on both integrated CPU-GPU anddiscrete GPU architectures. To provide deeper insights, this dissertationevaluates the behavior of Memcached on a cycle-accurate GPGPU simu-lator [1].4. It presents the initial design of a control flow simulator, CFG-Sim, which canassist GPGPU developers in understanding the potential GPU SIMD utiliza-tion of an application prior to actually porting the application to a GPU.5. It presents GNoM (GPU Network Offload Manager), a software system forefficient UDP network and application processing on GPUs, and evaluatesthe feasibility of achieving low-latency, high-throughput (10 GbE line-rate),and energy-efficient processing at any request size on commodity Ethernetand GPU hardware.6. It describes the design of MemcachedGPU, an accelerated key-value storethat leverages GNoM to run efficiently on a GPU, and addresses the chal-lenges associated with partitioning a key-value store across heterogeneousprocessors. Compared to the initial GPU version of Memcached, Mem-cachedGPU optimizes for both throughput and latency in a full end-to-enddesign. Additionally, this dissertation compares MemcachedGPU againstprior accelerated Memcached implementations.7. It explores the potential for workload consolidation on GPUs during varyingclient demands while maintaining a level of QoS for a higher priority GPUnetwork-based application.128. It highlights the limitations with contemporary GPUs being considered assecond-class computing resources and discusses the need for increased inde-pendence of such accelerators to improve performance and efficiency in thedatacenter. To this end, this dissertation proposes EDGE, an event-drivenprogramming model, API, and corresponding GPU hardware modificationsto enable increased GPU independence for applications that primarily usethe GPU.9. It proposes and evaluates a fine-grained, warp-level (Section 2.1.1) GPUinterrupt and preemption mechanism, which triggers a set of privileged GPUwarps (PGWs) from any device in a heterogeneous system through EDGEfor initiating and managing tasks internally on the GPU.10. It proposes a new GPU barrier instruction, the wait-release barrier, whichhalts the execution of specific GPU threads indefinitely until being releasedby an event, and highlights the benefits of the wait-release barrier to reducethe polling overheads of a persistent GPU thread style of programming.11. It evaluates EDGE in a multiprogrammed environment and highlights theability to achieve the performance and simplicity of the baseline CUDA pro-gramming model with the flexibility of software-only workarounds aimed toincrease the independence of GPUs.1.5 OrganizationThe rest of this dissertation is organized as follows:• Chapter 2 discusses the relevant background information for this disserta-tion, such as the GPU architecture and programming models evaluated inthis study, the Memcached key-value store application, networking, and theevent-driven programming model.• Chapter 3 presents the initial evaluation into porting Memcached, an irreg-ular key-value store datacenter application, to both integrated and discreteGPU hardware, and provides deeper insights into the behavior of Mem-cached on a GPU via a GPGPU simulator.13• Chapter 4 tackles the challenges with implementing Memcached on aGPU in a complete end-to-end system. This chapter proposes GNoM, asoftware framework for accelerating both network and application pro-cessing for network-based applications on contemporary GPU hardware.This chapter then presents the end-to-end design and implementationof MemcachedGPU, which utilizes GNoM to achieve 10 GbE line-rateprocessing for any Memcached packet size on both high-performance andlow-power discrete GPUs. This chapter also highlights the potential forworkload consolidation on GPUs in the datacenter while maintaining a levelof QoS for high-priority network applications.• Chapter 5 identifies limitations with the current system and architecturalsupport for considering GPUs as first-class computing resources in a het-erogeneous environment, which rely on the CPU to act as the middleman forcontrol and task management. This chapter proposes EDGE, an event-drivenprogramming model and corresponding modifications to the GPU architec-ture, to enable third-party devices in a heterogeneous environment to directlymanage tasks on the GPU.• Chapter 6 discusses the related work for this dissertation.• Chapter 7 concludes this dissertation and discusses directions for futurework.14Chapter 2BackgroundThis chapter presents the relevant background information for this dissertation.It first describes the GPU programming model, details two GPU architecturesevaluated throughout this research, and discusses irregularities that arise in theGPU thread control-flow and memory systems. This chapter then discusses GPUsystem-level frameworks and current support for interrupts on GPUs. Next, thischapter details a key-value store application, Memcached, which is evaluatedthroughout this dissertation. Finally, this chapter provides an overview of therelevant networking and event-driven programming background.2.1 Graphics Processing Units (GPUs)This section details the contemporary GPU programming model and GPU archi-tectures assumed in this dissertation.2.1.1 GPU Programming ModelGPUs are high throughput-oriented offload accelerators traditionally designedfor graphics. GPUs have since evolved into general-purpose processors, namelygeneral-purpose graphics processing units (GPGPUs), capable of providinghigh-throughput, energy-efficient processing for data parallel software, such ashigh-performance computing (HPC). In this dissertation, the terms GPUs andGPGPUs are used interchangeably.15Non-graphics applications are written in C-like language, such as CUDA [133]or OpenCL [94]. In this dissertation, both NVIDIA and AMD GPUs are evaluated.The terminology is defined for both vendors in the form NVIDIA term [AMD term].GPU applications consist of two main components, a host (CPU) component anda device (GPU) component. In this dissertation, both host and CPU, and deviceand GPU, are used interchangeably. The CPU is responsible for communicatingdata and control with the GPU (through a GPU driver running on the CPU) viathe application programming interfaces (APIs) provided by CUDA and OpenCL.Asynchronous operations enable overlapping data communication and (parallel)task execution through multiple streams [command queues], which are mapped tophysical hardware queues on the GPU. Data communication, which is discussedin more detail below, can occur explicitly through memory copies or implicitlythrough direct memory access (DMA) from either the CPU or GPU. The GPUis responsible for executing user-defined parallel sections of code, called kernels,which perform the actual application processing on the GPU. In this dissertation,kernel and compute kernel are used interchangeably.GPUs support a single-instruction, multiple-thread (SIMT) execution model,in which groups of scalar threads execute instructions in lock-step. SIMT ar-chitectures improve energy efficiency by amortizing the instruction fetch and de-code logic, and memory operations, across multiple threads. Similar to single-instruction, multiple-data (SIMD), SIMT processes the same instructions concur-rently over multiple different data elements. Unlike SIMD, SIMT enables threadsto take different paths through the application’s control-flow graph (CFG), whichis described more below. In this dissertation, SIMD and SIMT are both used torefer to the GPU’s SIMT architecture, which contains a SIMD pipeline capable ofexecuting subsets of SIMD lanes.GPU kernels contain a hierarchy of threads, which define boundaries for com-munication and levels of task scheduling. GPU threads [work items] are groupedinto warps [wavefronts], which execute instructions in a lock-step SIMT fashion.Typical warp sizes are 32 or 64 threads. Warps are further grouped into cooper-ative thread arrays (CTAs) [work groups]. CTAs are also referred to as threadblocks. CTAs are the main schedulable unit of work on the GPU. CTAs are thengrouped into grids [NDRanges], which form the main work for the kernel. CTAs16are dispatched as a unit to a streaming multiprocessor (SM or SMX) [Stream core],whereas individual warps within the CTA are scheduled independently on the cor-responding SM. When a kernel is launched, the size and dimension of the kernel isdefined. Threads within a CTA can communicate via fast on-chip scratch pad mem-ory, shared [local] memory, whereas threads in different CTAs must communicatethrough slower off-chip global memory. Additionally, warps within a CTA canperform efficient synchronization through GPU hardware barriers, whereas warpsin different CTAs must implement their own form of synchronization via globalmemory 1.2.1.2 GPU ArchitectureThis section presents an overview of the AMD (Chapter 3) and NVIDIA (Chapter4 and Chapter 5) GPU architectures evaluated in this dissertation.Figure 2.1 presents a high-level view of an AMD-like GPU architecture as-sumed in this research [9] 2. The GPU consists of multiple compute units, eachcontaining one or more stream cores. The stream cores contain multiple processingelements, which perform the actual SIMD computations, branch units to managethread control flow, general purpose registers, and a memory interface to the cachesand the global memory controller. The number of processing elements is typicallyequal to, or a factor of, the warp/wavefront size (32 or 64). An Ultra-ThreadedDispatch Processor (UTDP) manages the scheduling of kernels and work groupsto the compute units. Each compute unit is also connected to on-chip, read-only L1instruction and data caches, on-chip local data stores (LDS) for scratchpad memoryand intra work item communication, and on-chip, read-only L2 caches for imagesand constant data.The compute units are connected to a memory controller to service requeststo off-chip constant and global memory. AMD provides both discrete GPUs andintegrated GPUs. Discrete GPUs are physically separate units connected to thehost machine via a connection bus, such as the peripheral component intercon-nect express (PCIe) bus. Integrated GPUs are collocated on the same physical1The latest NVIDIA GPUs enable synchronization across multiple CTAs [131].2The architecture presented here corresponds to the time that the study in Chapter 3 that wasperformed (2012).17Compute DeviceStream CoreGPRProcessing ElementBranch Exec.UnitMemory InterfaceStream CoreGPRProcessing ElementBranch Exec.UnitMemory InterfaceStream CoreGPRProcessing ElementBranch Exec.UnitMemory InterfaceCompute UnitStream CoreGPRProcessing ElementBranch Exec.UnitMemory InterfaceStream CoreGPRProcessing ElementBranch Exec.UnitMemory InterfaceStream CoreGPRProcessing ElementBranch Exec.UnitMemory InterfaceCompute UnitStream C reGPRProcessing ElementMemory InterfaceStream C rGPRProcessing ElementMemory InterfaceStream CorGPRProcessing ElementBranch Exec.UnitMemory InterfaceCompute UnitL2 (RO)Image / Constant Data Local Data Store (LDS)Local Data tore (LDS)Local Data tore (LDS)L1 (R/O)L1 (R/O)L1 (RO)Ultra-Threaded Dispatch Processor (UTDP)IMemory ControllerVLIWGlobal / Constant Memory HostPCIeFigure 2.1: High-level view of an AMD-like GPU architecture assumed inthis dissertation.die as the CPU, sharing a common global memory. Each type of GPU, discreteand integrated, have benefits and drawbacks. Discrete GPUs tend to have sig-nificantly higher processing capabilities, have their own high-bandwidth physicalGPU global memory, and have their own power and cooling systems. However, theseparate memory requires data to be explicitly copied between the CPU and GPUor directly accessed across the PCIe bus, which increases complexity and reducespotential performance benefits. On the other hand, integrated GPUs tend to tradeoff lower processing capabilities with lower power consumption, and remove theneed for data to be copied between physical memories over the PCIe bus, sinceboth the CPU and GPU share the same physical memory. Chapter 3 evaluatesboth discrete and integrated GPUs.18Kernel DistributorSMX SchedulerControl RegistersSMX SMX SMX SMXKernel Management UnitPending Kernel QueuesL2 CacheMemory ControllerGPU DRAM GPU DRAM InterconnectPCIeHost CPU / External DevicePC Dim Param Block InfoKernel Metadata StructureGlobal Interrupt ControllerI/O Unit & Front-endHW Work QueuesHost DeviceWarp SchedulersWarp ContextWarp ContextWarp ContextWarp C t xtRegistersCore Core Core CoreL1 Cache / Shared MemorySMX ControllerLocal Interrupt ControllerStreaming Multiprocessor (SMX)GPUFigure 2.2: High-level view of an NVIDIA-like GPU architecture assumedin this dissertation.Figure 2.2 presents a high-level view of an NVIDIA-like GPU architecture in-19fluenced by two NVIDIA Patents [87, 155] and academic research [17, 174]. Theillustrated GPU contains multiple streaming multiprocessors (SMXs), each withtheir own control logic, warp schedulers, SIMD processing cores, on-chip L1 in-struction and data caches, on-chip shared scratchpad memory for intra-CTA com-munication, large register files, and thread/warp contexts. Each SMX is connectedto an off-chip shared L2 cache via an interconnect. The L2 cache is connected to amemory controller for managing requests to high-bandwidth, off-chip global GPUmemory.The host communicates data and kernels with the GPU over PCIe through a setof hardware managed CUDA streams, allowing for concurrently operating asyn-chronous tasks. NVIDIA’s Hyper-Q [134] enables up to 32 independent hardwarestreams, which depending on the level of resource contention, can perform up to32 independent concurrent operations. These hardware queues may be stored in afront-end I/O unit, which receives PCIe packets from the host.The functionality of the Kernel Metadata Structure (KMD), Kernel Manage-ment Unit (KMU), and Kernel Distributor Unit (KDU) (Figure 2.2) are best de-scribed through an example of launching a kernel on the GPU. The CPU firstpasses the kernel parameters, kernel metadata (kernel grid and CTA dimensions,shared memory requirements, and stream identifier), and a function pointer to theactual kernel code to the GPU driver running on the CPU. The GPU driver thenconfigures the kernel parameter memory and KMD (structure storing the kernelmetadata), and launches the task into the hardware queue corresponding to thespecified stream in the GPU’s front-end. Internally, the KMU stores pointers to theKMDs waiting to execute in a pending kernel queue (PKQ). The KDU stores theKMDs for the actively running kernels. KMDs are loaded into the KDU from theKMU when a free spot is available. Finally, the SMX scheduler configures anddistributes CTAs from the KDU to the SMXs based on available resources for theCTA contexts. The CTA resources consist of thread/warp/CTA contexts, registers,and shared memory. The first resource to be depleted dictates the maximum num-ber of CTAs able to run concurrently on an SMX. Note that a CTA from anotherkernel, which has different resource requirements, may be able to concurrently runon the same SMX if it does not require more resources than are available. CurrentNVIDIA GPUs support up to 32 concurrently running kernels and 1024 pending20kernels.2.1.3 GPU Memory TransfersDiscrete GPUs have physically separate, high-bandwidth GPU memory. Tradi-tional GPU programs perform explicit data transfers between the CPU and GPUmemory through the CUDA API and stream interface. More recent GPUs, such asNVIDIA’s Pascal and Volta GPUs, support unified memory, which enables implicitmemory copies via virtual memory demand paging [126, 131]. However, explic-itly managing memory is still beneficial from a performance standpoint. Signifi-cant losses in end-to-end performance can occur with large data transfers or a largenumber of small data transfers between the host and device [62], because the timerequired to transfer data increases linearly with the amount of data needing to betransferred [32]. PCIe 2.x and 3.x with an x16 connection can transfer data at 8GB/s and 16 GB/s respectively [145], which can be one to two orders of magni-tude lower than current GPU DRAM bandwidths (e.g., 900 GB/s on the NVIDIAVolta architecture [131]). On the other hand, integrated GPUs avoid the necessityfor explicit memory copies to GPU memory since they share the physical DRAMwith the CPU 3. However, this comes at the cost of lower GPU bandwidth due tosharing resources with the CPU and a lower bandwidth memory architecture.Furthermore, the separate GPU memory traditionally requires that an externaldevice, such as a network interface controller (NIC) or field-programmable gatearray (FPGA), first copy the data to CPU memory and then from CPU memoryto GPU memory. GPUDirect [128] removes this requirement by enabling remotedirect memory access (RDMA) from a third-party device directly to GPU memory,completely bypassing CPU memory. This is achieved by exposing the GPU’s pagetable and mapping an RDMA-able portion of the GPU’s memory into the externaldevice’s memory space. However, while GPUDirect solves the issue of the data-path bypassing the CPU, the control-path is still required to go through the GPUdriver running on the CPU.Chapter 4 and Chapter 5 discuss the challenges with control-path dependenceon the CPU in more detail.3Mapping and un-mapping operations may be required to ensure the data the GPU is accessing iscoherent with any CPU updates.212.1.4 CUDA Dynamic ParallelismCUDA Dynamic Parallelism (CDP) [127] enables GPU threads to launch sub-kernels directly on the GPU. CDP exploits nested irregular parallelism in GPUapplications, where there may be a varying degree of parallelism throughout thekernel. For example, graph search algorithms may have a varying and unknownnumber of nodes to visit at each stage, requiring a different number of threadsat each stage. CDP can also be used to avoid round trip kernel launches to/fromthe CPU by allowing GPU threads to internally launch and synchronize kernels.CDP exposes an API to manage the dynamic resource allocation, launching, andsynchronization of children kernels in a parent kernel. Similar to host-launchedkernels, GPU-launched children kernels are inserted into hardware queues in theKMU for managing pending kernels. However, the flexibility in CDP for any GPUthread to configure and manage sub-kernels has been shown to have high over-heads [174, 175], which can significantly limit the performance benefits comparedto the baseline where the CPU launches tasks on the GPU.2.1.5 Kernel Priority and Kernel PreemptionBased on an NVIDIA patent [155], the GPU maintains multiple differentqueues/lists for storing pending kernels (KMD pointers) to be executed on theGPU. Both host-launched and device-launched kernels can specify a priorityassociated with the kernel, which may be implemented by assigning differentqueues different priorities in the KMU. Children kernels in CDP inherit theirparent kernel’s priority. The KDU can then select a KMD to launch based on agiven priority selection algorithm.Kernel priority can aid the choice of deciding which kernel to schedule nextwhen enough free resources are available. However, while GPUs have long sup-ported preemption for graphics applications, GPUs have traditionally only sup-ported spatial multitasking for GPGPU applications, not temporal multitasking viapreemption and context switching. The lack of temporal multitasking results inlarge, long running GPU applications that consume all GPU resources to blockother kernels from running. This occurs even if a pending kernel has a higher prior-ity than the large long running kernel. NVIDIA has recently proposed fine-grained,22instruction-level preemption for GPGPU applications in their newest GPU archi-tectures [126, 131]. However, unlike CPUs, context switching a full GPU kernelcan be very expensive, in terms of computing cycles, due to the large amounts ofstate to save (e.g., register files and shared memory). Additionally, recent researchhas proposed multiple optimizations to reduce preemption overheads to better sup-port GPU preemption and multitasking [33, 92, 143, 157, 162, 176].GPU preemption is discussed in more detail in Chapter 5.2.1.6 Current GPU Interrupt SupportAn NVIDIA patent [155] presents an exception handling mechanism that enablesthe GPU to handle internal exceptions (e.g., arithmetic errors), traps (e.g., threadshitting a breakpoint when debugging) or external interrupts (e.g., cuda-gdb inter-acting with the GPU or a kill signal from the host CPU). The patent describesa global interrupt mechanism, which minimizes the design and verification com-plexity – when an interrupt or exception occurs, all warps running on an SM areinterrupted and transition from their current code to an interrupt handler. Onlythe warp(s) responsible for the exception/interrupt actually processes the interrupt,while all other warps immediately return to their original execution. However, allwarps from all active CTAs on an SM must temporarily pause their execution to testthe exception, which can reduce performance if only a subset of warps are requiredto service the interrupt/exception. An AMD Graphics Cores Next (GCN) Archi-tecture white paper [10] also describes scalar cores within the Compute Units thatare responsible for handling GPU interrupts, which is useful for supporting GPUdebugging.As previously described, discrete GPUs are typically connected to the hostCPU via a PCIe bus and contain an I/O and front-end unit to communicate over thePCIe connection (Figure 2.2). As of PCI 3.0, the CPU and external devices caninteract with PCIe devices through Message Signaled Interrupts (MSI-X), whichsupport up to 2048 different interrupts. Unlike traditional interrupts that rely onspecific interrupt wires, MSI-X treats interrupts as special PCIe packets. As such,any device that can communicate over PCIe is (in theory) able to send and triggerGPU interrupts. To the best of our knowledge, no GPU manufacturer has published23an API for enabling programmers to take advantage of the ability to send interruptsto the GPU directly from a user-space application.GPU interrupts are discussed more in Chapter 5.2.1.7 GPU Persistent ThreadsPersistent threads (PT) are an alternative technique for programming and launchingtasks on the GPU. PT pre-configures and launches a large number of continuouslyrunning persistent CTAs (pCTAs) on the GPU, which poll for new tasks from in-memory work queues to perform the application processing [63]. This increasesthe application’s control of the task and thread scheduling, effectively replacing theCUDA driver and the GPU’s hardware kernel and CTA schedulers with a user-levelsoftware GPU task scheduler that is independent from the CPU. As such, PT alsoenables any device in the heterogeneous system to initiate a task on the GPU bysimply writing into a work queue in the GPU’s memory. The polling pCTAs canthen identify the arrival of a new task and begin the actual kernel processing forthe corresponding task. For this to work, each pCTA must be sized appropriatelyto accommodate the maximum task size that may be scheduled, which results inunderutilized resources for smaller tasks.There are multiple different techniques for implementing PT on GPUs [96,182]. For example, there can be a single global work queue, which requires syn-chronization between all pCTAs, or local per-pCTA work queues, which limitscontention at the cost of load imbalances. Additionally, all pCTAs may be respon-sible for both polling the work queue and performing the actual kernel processing,or separate pCTAs may be responsible for polling the work queues and passingthe tasks to a separate set of pCTAs for the actual kernel processing (a form ofprefetching to overlap reading the work queues and processing the tasks). Eachdesign provides a trade-off in the number of threads synchronizing on the sharedwork queues and the amount of resources available for performing the kernel pro-cessing.Persistent threads are evaluated further in Chapter 5.242.1.8 GPU Architectural IrregularitiesThis section describes two key performance irregularities that are inherent to theGPU architecture: branch divergence and memory divergence.Branch DivergenceAs previously described, scalar GPU threads are grouped together into warps. Inan application without any conditional control flow operations, such as conditionalbranches, the warp is able to make forward progress through the code while exe-cuting instructions for each thread in parallel. This follows the normal SIMD exe-cution model. However, if conditional branches are introduced into the code, it ispossible for a subset of the threads in a warp to take the branch while the remainingthreads do not. While all threads in a warp previously executed the same instruc-tions, now only a subset of threads in the warp execute the same instructions. Thisis known as branch divergence [110, 119, 172]. Branch divergence can occur anynumber of times and, in the worst case, each thread in a warp executes a differentinstruction. Consequently, threads that are not executing a given instruction resultin idle lanes in the SIMD hardware, which lowers the SIMD utilization. To handlebranch divergence, GPUs contain a hardware component similar to the SIMT stackdescribed by Fung et al. [172, 173]. An example highlighting the SIMT stack isshown in Figure 2.3 and is described below. The SIMT stack can be used to trackthe active threads at various points throughout the program’s execution. An activethread refers to a GPU thread within a warp that is currently executing instructions.A thread becomes inactive if it takes a branch that diverges away from the otherthreads in the warp.Each entry on the SIMT stack contains a bit-mask representing the activethreads (work items) in a warp (wavefront), with the top element in the stack (TOS)signifying the subset of threads to execute. The SIMT stack also records the cur-rent program counter (CPC) and a re-convergence program counter (RPC). TheCPC specifies the instruction that the active threads on a corresponding stack entrywill execute once it becomes the TOS. As the threads in the TOS entry executethe instructions, the CPC is incremented accordingly. The other counter, the RPC,specifies the immediate post-dominator (IPDOM) instruction. The IPDOM is de-25ACBDFEGH52.8% 47.2%23.6% 76.4%0%100%100%100%100%100%0 1 2 43 5Work-ItemActiveNon-ActiveACGBEGDFH43%-1 1 1 1 1 11 1 1 1 1 1001 0001 1 1 1 1 1001 0001 1 1 1 1 100001 1 101 0001 1 1 1 1 100001 1 101 00001 1 1 1 1 101 1 1 11 1 1 1 1 11 100001 1 1 1 001 1 1 1 1 11 10000CPC RPC-Active Thread Mask T=0SIMT Stack1 1 1 1 1 1CDEGHAHHHHHHHHHHHHHH1 1 1 1 00BBGFDDHHHHHHHHHHHT=1T=2T=3T=4T=5T=6T=7T=8BFigure 2.3: SIMT execution example.fined as the closest instruction in the program that all paths leaving the branch mustgo through before exiting the function. As such, the IPDOM is the earliest instruc-tion that all threads in the warp are guaranteed to execute the same instruction.The SIMT stack uses the RPC to specify where the active and inactive threads canrejoin. Once the active subset of threads reaches the RPC, the TOS is popped fromthe stack and the GPU starts executing the new TOS.Figure 2.3 shows an example of the SIMT execution flow when executing apiece of code taken from a hash function in a popular key-value store application,Memcached. Memcached is discussed in detail in Section 2.2. The correspondingsection of Memcached’s CFG is shown on the left. The CFG is also annotatedwith the actual branch probabilities extracted from multiple Memcached runs. Forsimplicity, in this example there are six threads (work items) per warp (wavefront).Snapshots of the SIMT stack are shown at different points throughout execution onthe right.A GPU scalar thread is represented by a vertical column in the middle of Figure2.3 (labeled 0 through 5). Active threads are indicated by black arrows and inactivethreads are indicated by white arrows. The ovals represent basic blocks in the CFG,which signify a portion of code with single entry and exit points that ensure no26further branch divergence. Hence, once execution starts at the beginning of a basicblock, it will always continue until the end of the basic block. Without branchdivergence, the minimum number of basic blocks required to reach H from A isfour (A→C→ G→ H) and the maximum is five (A→ B→ D/E→ F/G→ H).At the beginning, all of the threads in the warp are set to active and execute theinstructions at block A. At the first branch, threads 0-3 go to block B while threads4-5 go to block C; the corresponding entries are pushed onto the SIMT stack. There-convergence point for each thread split is set to the IPDOM block H, which isthe first point where all threads must pass through, regardless of the branches takenor not taken in this CFG. Additionally, the previous TOS’s CPC is set to H, whichindicates that all threads will resume concurrent execution at H. The next step isto execute a subset of the original warp until reaching a re-convergence point. Thisis achieved by executing the threads at the new TOS. Threads 4-5 execute basicblock C (T=1) and G (T=2) before reaching the re-convergence point H. The TOSis popped off and execution switches to threads 0-3 at B (T=3). Here, the threadssplit again, removing its current entry from the SIMT stack and pushing two newentries onto the stack for basic blocks D and E (T=4). Threads 1-3 execute basicblocks E (T=4) and G (T=5) until reaching the re-convergence point H (T=6),which pops the TOS and switches execution to thread 0 (T=7) to execute basicblock D prior to reaching H (T=8). At this point (T=8) all of the threads in thewarp are at the re-convergence point H and resume concurrent execution.Assuming all the basic blocks have the same number of instructions, the SIMDefficiency in this example is approximately 46% and executes nine blocks in total(four more than the maximum number of basic blocks under no branch divergence).Also, block G is executed twice, resulting in more instructions being issued thannecessary.Thus, to achieve the highest performance from the GPU, it is desirable to haveas little branch divergence as possible.Memory DivergenceMemory accesses are another property of GPUs directly affected by the groupingof threads in a warp. If the instruction being executed by all threads is a memory-27Cache Line 0 Cache Line 1A B C DWarp 1Thread 2 Thread 3Warp 0Thread 0 Thread 1LD-0 LD-1Figure 2.4: GPU memory request coalescing.access operation, such as a load or a store, each thread will generate a memoryrequest to be handled by the memory system. Assuming each memory requestis for a data object in a different region of memory, all requests will be handledseparately. This phenomenon is referred to as memory divergence. However, ifthe requested data from threads in a warp lies within a given range, such as thesize of a cache line or the size of data returned from a memory request (i.e. highdata locality between threads in a warp), the memory requests falling into one ofthese common regions can be coalesced into a single request. This can significantlyreduce the amount of traffic on the memory system.Consider the example in Figure 2.4. In this example, there are two warps,Warp 0 and Warp 1, each containing two threads, Thread 0-1 and Thread 2-3 re-spectively. Each thread in Warp 0 executes a load instruction LD-0 and each threadin Warp 1 executes a load instruction LD-1. On the first load instruction, LD-0,Thread 0 loads in A and Thread 1 loads in B. Because both data objects lay in asingle cache line, the two memory requests can be coalesced into a single mem-ory request. As a result, if the request misses in the cache, only a single memoryrequest will be sent to the lower level cache or memory system. However, on thesecond load instruction, LD-1, Thread 3 sends a memory request for D, whichfalls into a separate cache line from Thread 2’s memory request to C. Thus, twophysical memory requests will be generated. If one of the memory requests hitsin the cache while the other memory request misses in the cache, all threads in thewarp will be stalled until the last memory request is serviced. Extending this to28Server 0 Server 1 Server 2 Server NWeb TierServer 0MemoryServer 1MemoryServer 2MemoryServer MMemoryVirtual Memory PoolMemcached TierStorage TierServer 0 Server 1 Server 2 Server KFigure 2.5: Memcached.real applications, typically containing 32 threads per warp (or 64 work-items perwavefront) and hundreds of warps per kernel, un-coalesced memory requests cangenerate large amounts of memory traffic to the memory system and significantlyaffect performance.Thus, to achieve the highest performance from the GPU, it is desirable to limitthe amount of memory divergence.2.2 MemcachedThis section details the main application evaluated in this dissertation, Mem-cached [115]. Memcached is a general-purpose, scale-out, high-performance,in-memory, key-value caching system used to improve the performance ofdistributed databases in server applications by caching recently queried data in29main memory. This alleviates the amount of traffic required to be serviced by theback-end databases, which access non-volatile storage or other external sources.These expensive I/O operations can significantly reduce overall performanceand increase power consumption. Memcached is used by many popular networkservices such as Facebook, YouTube, Twitter, Wikipedia, Flickr, and others [115].A high-level view of how Memcached fits into the existing web and storagetiers in a datacenter is shown in Figure 2.5. Memcached acts as a look-asidecache. Requests are first sent to the Memcached servers and either (1) the datais available and is returned to the requesting server or (2) the requesting server isnotified of the cache miss and is responsible for querying the back-end databasefor the missing data, which is then stored into the Memcached system.Internally, Memcached implements the key-value store as a hash table. Mem-cached uses an asynchronous networking event notification library, Libevent [112],which removes the loop-based events in event-driven networks to increase perfor-mance. All of the Memcached data resides in volatile system memory, which re-sults in fast accesses compared to storing on disk. Memcached implements a scale-out, distributed architecture by combining main memory from individual serversinto a large pool of virtual memory. This aggregation of memory effectively pro-vides a much larger memory space that scales linearly with the number of serversin the system. Each server is fully disconnected from the other servers in the pool,meaning that no communication takes places between the servers. All communi-cation is done between the client and one or more servers, which greatly simplifiesthe overall system design. Furthermore, Memcached has no reproduction of data,logging, or protection from failure. As the data is stored in memory, a system fail-ure results in the loss of data. Thus, it is the responsibility of the application toimplement any failure recovery mechanisms.Memcached provides a simple key-value store interface to store (SET), modify(DELETE, UPDATE), and retrieve (GET) data from the hash table. The key-valuepair and corresponding metadata (e.g., size of key/value, last access time, expirytime, flags) is referred to as an item. All of the Memcached operations requiretwo hashes of the keys. The first hash selects which server the corresponding re-quest should be directed to (based on a pre-configured mapping of keys to servers)and the second hash selects the appropriate entry in the hash table. Hash-chaining30is used in the event of collisions on write operations and a linear traversal of thelinked-list chain is used on read operations when necessary. To avoid expensivememory allocations for every write request, Memcached uses a custom memoryallocator from pre-allocated memory slabs to allocate memory for items. To reducememory fragmentation, Memcached allocates multiple memory slabs with differ-ent fixed-size entries. The hash table stores pointers to items stored in the memoryslabs. Memcached uses a global Least Recently Used (LRU) eviction protocol tokeep the most up-to-date data cached in main memory. Items are evicted on writeoperations if the write exceeds the available memory limit. A time-out period canbe added to an item to specify when this item’s lifetime is over, regardless of itscurrent usage. This time-out period has precedence over the LRU eviction proto-col; items whose timers have expired are first evicted, followed by the LRU items.Facebook Memcached deployments typically perform modify operations overTCP connections, where it is a requirement that the data be successfully storedin the cache, whereas retrieve operations can use the UDP protocol [124]. SinceMemcached acts as a look-aside cache, dropped GET requests can be classified ascache misses (requiring queries to the back-end database) or the client applicationcan replay the Memcached request. However, excessive packet drops mitigate thebenefits of using the caching layer, requiring a certain level of reliability of theunderlying network for UDP to be effective.2.3 Network Interfaces and Linux NetworkingA network interface controller (NIC) is hardware component in a computer systemresponsible for receiving and sending network packets with other external com-puter systems. The NIC is connected to the host (CPU) either through an inte-grated bus on the motherboard or via a connection bus, such as PCIe. Networkpackets enter hardware RX (receive) queues at the NIC, which are copied to pre-allocated, DMA-able RX ring buffers (driver packet queues), typically residing inCPU memory. The NIC then sends interrupts to notify the CPU of one or morepending RX packets, or the CPU can poll the NIC to check for pending packets.To mitigate high interrupt rates when receiving packets, the Linux kernel uses ahybrid interrupt and polling approach, NAPI (New API). In NAPI, the NIC notifies31the Linux kernel when a packet arrives via an interrupt. The Linux kernel registersthe notification, disables future interrupts from the NIC, and schedules a pollingroutine to run at a later time. The polling routine then services multiple receivedpackets from the NIC up to some pre-defined threshold. The NIC driver copies thepackets from the RX ring buffers to Linux Socket Buffers (SKBs) to be processedby the host Operating System (OS), returns the corresponding RX buffers back tothe NIC to receive future packets, and re-enables interrupts. When transmitting(TX) packets, the Linux kernel copies packets into DMA-able TX ring buffers andnotifies the NIC of the packet to send. The NIC copies the packet into internal TXqueues and then transmits the packet on an outgoing link.Optimizations such as direct NIC access (DNA) can reduce memory copies byallowing user-space applications to directly access the RX and TX ring buffers.This removes the requirement for the Linux kernel to perform additional memorycopies to SKBs to process the packet. The user-space application is presentedwith the raw packet and is responsible for performing any packet processing. InChapter 4, we expand on this technique by using NVIDIA GPUDirect [135] todirectly copy the packet data from the NIC to RX buffers stored in GPU memory,which removes the requirement for the CPU to explicitly copy the packet fromCPU memory to GPU memory. Another common optimization in modern NICs isreceive side scaling (RSS), which enables the NIC to install packet filters to directspecific packets to specific CPU cores. Packet filters can be installed through theNIC’s driver API. We also expand on this technique in Chapter 4 to filter certainpackets to the GPU, while all other packets still go through the standard Linuxkernel network flow on the CPU.2.4 Event and Interrupt-Driven ProgrammingEvent or interrupt-driven programming is an alternative style of programming com-pared to threads. Threads execute a program and can perform synchronous, asyn-chronous, blocking, and non-blocking operations. Synchronous blocking opera-tions stall the thread until some condition is satisfied and the thread can resumeexecution. Asynchronous non-blocking operations immediately return, regard-less of whether the operation completed successfully or not, which requires ad-32ditional logic to retry an operation if required. Threads require careful partitioningof resources or synchronization on shared resources. In event-driven program-ming, programs are broken down into fine-grained pieces of code, callbacks, orevent handlers, responsible for performing specific operations in response to I/Oevents [37, 54, 140]. The event handlers can be called either in a continuouslyrunning event loop or triggered via interrupts. Event-driven programming can re-duce the challenges with concurrency and synchronization of thread-based pro-gramming, while also achieving high performance. In Chapter 5, we exploreinterrupt-driven GPU events, which enables any device in a heterogeneous systemto directly manage the execution of GPU tasks independently from the CPU.33Chapter 3Evaluating a Key-Value StoreApplication on GPUsThis chapter performs an initial evaluation of a widely used key-value store dat-acenter application, Memcached (Section 2.2), on discrete and integrated AMDGPUs. GPUs have consistently proven to deliver positive results in scientific andhigh-performance computing (HPC), as demonstrated by their use in several topsupercomputers [165, 166]. This chapter argues that the GPU’s highly-paralleland efficient architecture also makes them strong candidates for non-HPC applica-tions with ample parallelism as well. The family of applications considered in thisdissertation are network-based datacenter workloads. Many datacenter workloadscontain large amounts of thread-level or request-level parallelism [98]. This paral-lelism stems from the fact that the server is required to process multiple differentnetwork requests, potentially at high request rates. If the requests are independent,they can be processed concurrently. However, at first glance these workloads maynot seem suitable to run on GPUs due to the existence of irregular control-flow andmemory access patterns, since the application does not dictate the type, ordering,or rate of requests that arrive at the server. The main goal of this chapter is toquantify and evaluate the potential for GPUs to accelerate such irregular applica-tions. To this end, this chapter evaluates the discrepancies between a programmer’sreasonable intuition on how an irregular application may perform on a GPU andthe actual achievable performance. This chapter then explores the challenges in34porting a popular datacenter application, Memcached, to OpenCL and provides adetailed analysis into Memcached’s behaviour on discrete and integrated GPUs.To gain greater insight, this chapter also evaluates Memcached’s performance on acycle-accurate GPU simulator. On the integrated CPU+GPU system, we observeup to 7.5× increase in throughput relative to a single CPU core when executing thekey-value look-up handler on the GPU. Section 3.4.1 discusses a multi-core CPUcomparison.HPC applications are known to efficiently utilize the underlying GPU hardwareto achieve high performance [59]. As discussed in Section 2.1, high GPU perfor-mance is possible with ample data-level parallelism to enable a large number ofthreads to execute in parallel, structured control-flow to maximize the SIMD effi-ciency of the GPU pipeline, and structured memory access patterns to maximizethe utilization of the GPU’s high memory bandwidth. Additionally, the time totransfer application data to the GPU should either be short relative to the applica-tion’s GPU processing time, or the application should be well structured to overlapthe communication of data with the computation for another GPU kernel. Whenperforming an initial analysis of an application to accelerate on a GPU, many pro-grammers may look for similar characteristics to existing applications with provenhigh performance on GPUs. This can lead a programmer to disregard applicationsthat appear to deviate from these characteristics, eliminating some applicationsfrom consideration that actually have the potential to perform well on a GPU.Datacenter (server) applications are a family of highly parallel and econom-ically appealing applications that fall under this category of not appearing to bestrong candidates for GPU acceleration. Server applications represent a larger classof applications than HPC, but one that is unstructured. For example, a web server isresponsible for receiving incoming network requests, processing the request, andsending the response back out to the network. Assuming each network requestperforms a similar task on different data, there may be large amounts of avail-able request-level parallelism. We define request-level parallelism as the ability tooperate on multiple independent requests in parallel. However, the specific oper-ations to perform may differ slightly between network request types or based onthe data within the network request. Additionally, the data access patterns may bedependent on the data within the network request, which limits the ability to ex-35ploit memory coalescing on GPUs. For these reasons, it is challenging to quicklydetermine how such an application would perform on a GPU.We focus on Memcached (Section 2.2) as a representative example of a dat-acenter application with high performance requirements. Memcached is a highlymemory-intensive application, with the main purpose of storing and retrieving dataobjects in memory to service a network request. A typical Memcached server maybe responsible for handling hundreds of thousands to tens of millions of requestsper second (RPS). As a result, there may be a large number of Memcached requestsneeding to be processed at a given time. Assuming Memcached’s requests are in-dependent and that each request can be handled concurrently by a separate thread,a reasonable expectation might be that each thread exhibits independent and irregu-lar behaviour. Specifically, depending on the type of request (GET, SET, UPDATE,DELETE), each request may perform a completely different set of operations, orwithin the same type of request, different operations may be performed dependingon the data which is being operated on. Additionally, since the access patterns ofeach request can not be known a priori, the data accessed in the request packet,hash table, or corresponding memory slab can certainly not be expected to exhibithigh locality, leading to memory divergence. Furthermore, on a discrete GPU withisolated memory spaces, special care must be taken to ensure that the CPU andGPU maintain a coherent view of memory. For example, either the Memcacheddata structures must be partitioned between the devices, such that only a singledevice accesses each data structure, or potentially large data transfers are requiredbetween the CPU and GPU memory on every kernel launch. As such, from aninitial evaluation of Memcached, a programmer may conclude that there is littlepotential for any benefits from GPU acceleration.3601312432 623518619927 208 219 2210 2311 2412 2513 2614 2715 2816 29173093943335636534364937 5038 5139 5240 5341 5442 5543 5644 5745 5846 59476048616466 7967 8068 8169 8270 8371 8472 8573 8674 8775 8876 8977 907891Figure 3.1: Control Flow Graph (CFG) from Memcached’s Jenkins hashfunction.37Performing a deeper analysis into Memcached’s code and runtime behaviour,we find that its control flow and memory access patterns are indeed dependent onthe input data in the network requests. For example, Memcached’s hash function,key comparison, and hash collision resolution technique may result in a differ-ent number of iterations or different operations between requests depending onthe request data, potentially leading to branch divergence. Recall that SIMD effi-ciency refers to the fraction of scalar GPU threads that actually execute togetherin lockstep relative to the maximum number of threads capable of executing to-gether (Section 2.1.8). Consider the control flow graph (CFG) corresponding toMemcached’s key hash function (Jenkins Hash [86]), as shown in Figure 3.1. Thishash function contains multiple fine-grained branches and loops, which are com-pletely dependent on the format and length of the key to be hashed. If each of thesebranches is equally likely based on the input data, and each GPU thread is respon-sible for hashing a different key, the threads may take different paths through theCFG, resulting in low SIMD efficiency.However, in most cases the probability of taking or not taking each branchis not equal. For example, branches may be responsible for handling errors, un-aligned memory accesses, or handling special cases, which occur relatively infre-quently. Without evaluating the runtime behaviour of the application, especiallyafter tuning the code for the GPU’s SIMD architecture, it is challenging to knowthe actual achievable SIMD efficiency. To this end, we perform an initial evalua-tion, which compares an estimation of the SIMD efficiency of Memcached to theactual achieved SIMD efficiency of Memcached on a GPU. The results are pre-sented in Figure 3.2, where Actual is the actual SIMD efficiency and Expected isthe estimated SIMD efficiency if all code paths through the CFG are equally likely.The actual SIMD efficiency was measured on the GPGPU-Sim GPU simulator af-ter implementing Memcached in OpenCL (Section 3.4.2) and the estimated SIMDefficiency is measured using a custom control-flow simulator (Section 3.2). Onaverage, Memcached’s actual SIMD efficiency is approximately 2.7× higher thana naive assumption of equal branch probabilities in the code-path may suggest.These results are explained in greater detail in Section 3.4.The rest of this chapter is organized as follows: Section 3.1 describes howMemcached was ported to the GPU, Section 3.2 presents the GPU control-flow380%10%20%30%40%50%Expected ActualSI MD Ef fi ci enc yFigure 3.2: Memcached SIMD efficiency: Expected vs. Actual.simulator, CFG-Sim, Section 3.3 describes the methodology and environmentused to perform this study, Section 3.4 presents a characterization and evalua-tion of Memcached on hardware and GPGPU-Sim, and Section 3.5 summarizesthis chapter.3.1 Porting MemcachedThis section describes the relevant implementation details of Memcached, designdecisions, and modifications made to Memcached to offload the read (GET) requesthandler to the GPU. This section also describes the corresponding changes madeto the host (CPU) code to efficiently interact and communicate with the GPU, aswell as update entries in the hash table (SETs).3.1.1 Offloading GET RequestsIn our GPU implementation of Memcached, we focused on accelerating the readrequests on the GPU while leaving the write requests to be handled by the CPU.Berezecki et al. [24] observe that read requests far outnumber write requests inreal-world scenarios running Memcached in Facebook. The large composition ofread requests in the total network traffic indicates high temporal locality, which isa key component in achieving the most out of the caching tier. Since Memcached39acts as a look-aside cache, write requests occur when a read request misses in thecache. Berezecki et al. also conducted experiments showing that write requestshave negligible effects on read performance. From this, it is reasonable to assumethat the read requests will have the most significant impact on total performance,and thus have the greatest benefit – in terms of overall system performance – frombeing accelerated on the GPU. Note that by partitioning the application between theCPU and GPU, special care must be taken to ensure that common data structuresaccessed by both CPU and GPU are kept coherent.There are multiple levels of computation required for the end-to-end (receive-to-send) processing of a Memcached request. Before the user-level Memcachedapplication receives the packet, the network request is first received by the networkdriver and processed by the operating system (OS). The OS processing consists ofbuffer management, network protocol handling (e.g., TCP or UDP), and the deliv-ery of the packet data to the corresponding user-level application. The user-levelMemcached application then decodes and performs the appropriate Memcachedoperation. Finally, a new network packet is constructed to send the response backto the initiator of the Memcached request through the OS and network driver. Inthis chapter, we focus only on the specific user-level application processing re-quired to handle the Memcached request.To take advantage of the massive amounts of available parallelism providedby the GPU, we exploit request-level parallelism in Memcached requests by as-signing each GPU work item to process a single GET request. GET requests arebatched into groups of requests on the host (CPU), transferred to the device on akernel launch, and processed in parallel on the device (GPU). Batching requestsresults in a trade-off between throughput and latency. Requests are queued un-til a configurable number of requests has been batched together, which increasesthe latency per-request. However, the GPU’s throughput-oriented architecture pro-vides the potential to processes more requests in parallel within a given amountof time than the CPU can process sequentially, increasing the request throughput.However, this could have a significant impact on request latency under low trafficloads to the servers. A simple solution is to add a timeout, such that no requestremains queued in a pending batch over a certain amount of time. This timeoutvalue could be statically configured or dynamically determined based on the cur-40Key * Key LengthItem *GPU GET Payload Object (20 Bytes)Connection Object (528 Bytes)Payload *Client Info(8) (8) (4)Figure 3.3: GPU GET Payload object. The original Memcached Connectionobject contains a large amount of information about the current Mem-cached connect, the requesting client network information, and the cur-rent state of the Memcached request. The GPU Payload object containsa much smaller subset of the relevant information required to processthe GET request on the GPU.rent traffic rates.For each GET request, the work item performs common key-value look-upoperations. These operations consist of computing the hash of the request’s key,accessing the appropriate entry in the hash table using the resulting hash value,and comparing the request’s key with the key – or multiple keys in the event ofhash collisions – residing at that hash table location. If the keys match, the valuecorresponding to that key is returned to the requesting client. In this work, weassume that the requests have already been directed to the correct Memcachedserver, and thus the hash performed on the GPU corresponds to the second hashmentioned in Section 2.2.When Memcached receives a request from a client via the network, it firstcreates a connection object that contains all information required to process anyrequests during the lifetime of this connection. The client is then able to send41requests through this connection to be handled by the Memcached server. Theseconnections contain significant overhead when considering the amount of infor-mation required to process the actual Memcached request on the GPU, such as thenetwork/Memcached protocol used and client information. To reduce the amountof data being sent to the GPU, we created a Memcached request payload dataobject that contains a subset of the connection information required to process aGET request, such as information about the search key and a pointer to the corre-sponding item if found. The Memcached GET payload object is shown in Figure3.3. The original Memcached connection object is 528 Bytes, whereas the GETpayload object is only 20 Bytes, which significantly reduces the amount of data totransfer on each GET request batch. On each GET request, we allocate and assigna payload object to the requesting connection and batch these payload objects to betransferred to the GPU.3.1.2 Memory ManagementTo manage the data allocated and accessed in Memcached, we implemented a dy-namic memory manager on the host CPU. This memory manager is used to storeall of the data that needs to be visible to both the host and the device; it replacesthe malloc and free system calls originally used in Memcached for any shared datastructures with custom memory allocation calls. Depending on the system beingused (discrete or integrated GPU), the allocated buffers reside in different memoryregions on the host or device. On the discrete system, the buffers are allocated inthe regular host memory space and transferred to the device when necessary. Onthe integrated AMD Fusion systems, however, these buffers are allocated in pinnedmemory to take advantage of the zero-copy memory regions, where data can beallocated on either the CPU or the GPU and accessed directly by both with varyingbandwidth and latencies, as described below.There are two types of zero-copy memory spaces available: the host-visibledevice memory and the device-visible host memory. As the names suggest, eachmemory space is targeted towards a specific device, but accessible from both. Ineither case, both memory spaces are allocated from pinned host memory, a subsetof the host’s memory space, at system boot time. Pinned memory locks pages42in physical memory, preventing the pages from being swapped out. The device-visible host memory is optimized for access by the host, whereas the host-visibledevice memory is optimized for the device [9]. The device-visible host memoryis cacheable on the host, which enables the host to access the memory at its fullbandwidth. However, this limits the device’s memory bandwidth as all memoryrequests must go through the host’s cache coherency protocol. The host-visibledevice memory, on the other hand, is a subset of the pinned host memory that isun-cacheable on the host. As this memory region is not bound by any coherencyprotocols, it can be directly accessed by the device at its full bandwidth. However,this limits the host’s bandwidth to access the shared memory. As our main goal isto accelerate GET requests on the GPU, we used the host-visible device memoryto minimize the data access time on the GPU.3.1.3 Separate CPU-GPU Address SpaceAt the time of this study, while current AMD fusion hardware shared a physicalmemory region between the host and device, it did not share a common unifiedaddress space. The implication of this is that the virtual addresses returned byour custom mem alloc function on the host, corresponding to the physical locationin host-visible device memory, is not the same as the virtual address seen on thedevice, even though it corresponds to the same physical location. Thus, complexdata structures consisting of many multiple-level pointers can not simply be de-referenced on the device.What is common between the host and device when allocating memory, how-ever, is the offset of each memory object from the start of the allocated memoryregion. For example, a memory object that is 32KB from the start of the memoryregion seen on the host is also 32KB from the start of the memory region seen onthe device. Using this property, we pass the virtual address pointing to the start ofthe memory region seen by the host as an argument to the Memcached kernel andcalculate the offset between this host memory buffer and the start of the memoryregion seen by the device. A macro is used to subtract this offset from every mem-ory de-reference on the device: translate(address,o f f set). The inverse operationis applied to all pointers set on the device, inverse translate(gpu address,o f f set),43such as the return value for the corresponding item pointer on a GET request. Thisensures that both the host and device access the same physical memory locations.Gelado et al. [60] implement a similar technique by either ensuring that the vir-tual pointers returned to the shared memory region are the same or by maintainingaddress mappings between the host and device. Using this offset-based addresstranslation technique reduces the complexity for programmers by eliminating theneed to traverse and reconstruct data structures that contain multiple nested point-ers on the device, such as linked lists or tree structures.3.1.4 Read-only DataWith the exception of the Memcached data structures written to with the results ofthe GET requests, such as the GET request payload objects, the majority of datastructures between successive kernel launches are read-only when processing GETrequests in Memcached. The AMD hardware evaluated in this chapter providesvarious hardware components, such as read-only caches, that can significantly de-crease data access time. Where possible, we allocate data in read-only buffers totake advantage of the buffer’s high-bandwidth, low-latency memory accesses.3.1.5 Memory LayoutUsing our custom dynamic memory manager, we can allocate data in specific lay-outs to take advantage of the GPU’s memory-coalescing property discussed in Sec-tion 2.1.8. Two data structures that are guaranteed to be accessed by all work itemswith known access patterns on the GPU are the payloads and the keys correspond-ing to each payload. As introduced in Section 3.1.1, the payload contains a pointerto the request key, the length of the key, and a pointer to the item being requested.Each work item is assigned a single payload corresponding to a single GET re-quest. Depending on the architecture, 32-bit or 64-bit, the size of each payload isonly 12 bytes or 20 bytes respectively. With a wavefront size of 64 work items anda cache line size of 128 bytes, these 64 memory requests could be reduced to sixor ten memory requests respectively to retrieve the same amount of data. There-fore, we ensure that the GET request payload objects are allocated contiguously inmemory by allocating them in a separate dedicated buffer.44Key Key KeyPayloadPayloadPayloadPayloadPayloadConnectionConnectionConnectionConnectionFigure 3.4: Contiguous memory layout.To access the payloads, each work item requires only a pointer to the start ofthe payload buffer and uses its global work item ID to access the appropriate indexin the payload array. This same technique is applied to the keys corresponding tothe payloads, such that when each work item dereferences the pointer to the key,the keys will reside nearby in memory. Figure 3.4 shows how the connections,payloads, and keys are allocated and laid out in the different memory buffers.3.1.6 SETs and GETsAs described in Section 3.1.1, this work focusses on accelerating GET requestson the GPU, while leaving SET requests to be processed on the CPU. This resultsin a requirement for synchronization between the CPU and GPU to ensure coher-ent accesses to the shared resources between the two devices. The purpose of thischapter is to understand the potential for the GPU to accelerate workloads thatexhibit irregular behaviour. As such, evaluating the impacts of fine-grained syn-chronization between accesses and updates to the shared data structures is beyondthe scope of this chapter and is considered in Chapter 4. Instead, we implementa coarse-grained synchronization in which SETs and GETs are batched separatelyand processed sequentially and independently. For example, as requests arrive at45the GPU Memcached server, GET requests are inserted into a pending batch andSET requests are processed immediately. As long as no GET request batch iscurrently being processed on the GPU, SET requests are able to proceed withoutrequiring synchronization with the GPU. However, once a batch of GET requests islaunched on the GPU, all SET requests are blocked until the GET request batch hascompleted processing on the GPU. This ensures that the data is not being updatedwhile a GET batch is being concurrently processed on the GPU.However, this coarse-grained synchronization has two limitations: (1) unnec-essary blocking of SET requests that may be independent of all concurrent GETrequests and (2) out-of-order execution of SET and GET requests. For (1), SETrequests may be able to tolerate longer latencies than GET requests, since the re-questing application does not depend on the result of the SET to continue makingprogress. For (2), Memcached has weak request ordering requirements. In thebaseline Memcached, requests may be handled by multiple different CPU threads.Each thread acquires locks on shared resources, such as the hash table, prior to ac-cessing or updating the resource. As such, it is possible for SET and GET requeststo be reordered in the baseline Memcached between the time when the request wasreceived at the server and when the request was processed, depending on whichthread acquires the lock first. Consequently, the client application cannot rely on astrict ordering of Memcached requests. While our implementation does not break arequirement on request ordering, it does increase the chance of reordering requests.3.2 Control-Flow Simulator (CFG-Sim)Along with evaluating the behavior of Memcached on GPUs, we are also interestedin understanding (1) how a programmer’s intuition about how an application mightperform on a GPU compares with how the application actually performs on a GPU,and (2) what are the effects of branch probabilities relative to correlated branchoutcomes (described below) within GPU work items in a wavefront. To addressthese questions, we look to analyzing an application’s control flow behavior.We designed a stand-alone control-flow simulator, CFG-Sim, that simulatesthe behavior of a wavefront through an application’s control-flow graph (CFG).The CFG can be generated either from a CPU version or an existing GPU version46of the application. Each branch in the application’s CFG is annotated with anoutcome probability. The branch probabilities can be set randomly, set based onintuition about the program’s behavior, or set from actual branch probabilitiesextracted from profiling the application. At each branch, the simulated activework items generate a random number and compare it with the threshold outcomeprobability at that branch to determine if the branch will be taken or not. A SIMTstack (Section 2.1.8) handles tracking branch divergence, re-convergence, anddeciding which work items to execute at a given time. The overall SIMD efficiencyis measured for each iteration through the application’s CFG and averaged acrossmultiple iterations, as shown in Equation 3.1. In this equation, E(SIMDE f f ) isthe expected SIMD efficiency returned by CFG-Sim; N is the total number ofiterations to perform through the application’s CFG; CFj is the set of basic blocksthat were traversed by CFG-Sim on the jth iteration through the application’sCFG; #BBi is the number of instructions in basic block i; ami is the number ofwork items active at basic block i calculated by CFG-Sim; and finally WFsize is thetotal number of work items in a wavefront.E(SIMDE f f ) =1NN∑j=1∑i∈CFj(#BBi× amiWFsize )∑i∈CFj #BBi(3.1)An estimate of the expected SIMD efficiency from CFG-Sim can be usefulwhen performing an initial analysis to decide whether an application may bene-fit from GPU acceleration. Prior to writing any code for the GPU, a programmercan gain better insight into the average SIMD efficiency estimated to result on thehardware. CFG-Sim takes three input files: a DOT file [58] containing the infor-mation required to generate the control-flow graph, a file containing the number ofinstructions per basic block (specified in the DOT file format), and a file containingthe estimated outcome probabilities for each branch in the application.However, the branch probabilities themselves may not be enough to accuratelysimulate the SIMD efficiency of an application. Consider an application that isalready ported to run on a GPU. After profiling the execution, it is possible to have47a SIMD efficiency of 100% with a branch outcome probability of 50% for everybranch. This occurs if every work item in a wavefront takes the same path throughthe CFG, but alternates between different paths every execution. We refer to thisproperty where work items in a wavefront move through the CFG together as cor-related branches. Correlated branches may occur because the application was pro-grammed for the GPU to maximize SIMD efficiency or because of different phasesduring the application’s execution. This correlation among branches is currentlynot accounted for in CFG-Sim and is left to future work (Section 7.2.1). Despitethis, by extracting Memcached’s actual branch probabilities from a cycle accurateGPGPU Simulator, GPGPU-Sim (described in Section 3.3), and using them asinput to CFG-Sim, we found that the estimated SIMD efficiency is within 1.3%of the actual SIMD efficiency of Memcached measured by running the applicationthrough GPGPU-Sim. One such application that requires modelling branch corre-lation to accurately estimate the SIMD efficiency is Ray Tracer, which is discussedin Section 3.4.2.3.3 Experimental MethodologyThis section describes the experimental methodology followed in this chapter.3.3.1 Hardware and Simulation FrameworksWe performed experiments on three configurations of GPUs and accelerated pro-cessing units (APUs): a high-performance discrete graphics card (AMD RadeonHD 5870), a low-power AMD Fusion APU (Zacate E-350 [AMD Radeon HD6310]), and a mid-to-high-end AMD Fusion APU (Llano A8-3850 [AMD RadeonHD 6550D]). The terms APU and integrated GPU are used interchangeably. Thediscrete GPU is connected to the host CPU via PCIe and has a physically sepa-rate graphics memory from the CPU’s main memory. The integrated GPUs areintegrated on the same die as the host CPU and share a common memory. The dis-crete GPU was chosen to show potential upper bounds on compute performance,while the low-power AMD Fusion APU provides insight into the performance ca-pabilities of such integrated systems with shared memory. The mid-to-high-endAMD Fusion APU falls in the middle of these two systems, combining the higher48Table 3.1: GPU hardware specifications.NameAMDRadeonHD 5870Llano A8-3850 (AMDRadeon HD6550D) AMDFusionZacate E-350 (AMDRadeon HD6310) AMDFusionEngine Speed (MHz) 850 600 492# Compute Units (CU) 20 5 2# Stream Cores 320 80 16# Processing Elements 1,600 400 80Peak Gflops (single-precision) 2,720 480 78.72# of Vector Registers/CU 16,384 16,384 16,384LDS Size/CU (kB) 32 32 32Constant Cache / GPU (kB) 48 16 4L1 Cache / CU (kB) 8 8 8L2 Cache / GPU (kB) 512 128 64DRAM Bandwidth (GB/sec) 153.6 29.9 17.1DDR3 Memory Speed (MHz) 1,000 933 533Table 3.2: CPU hardware specifications.Name Llano A8-3850 Zacate E-350# x86 Cores 4 2CPU Clock 2.9 GHz 1.6 GHzTDP 100W 18WL2 $ / core 1MB 512 KBcompute performance with the benefits of the APU’s shared memory space. Thehardware specifications for these GPUs are outlined in Table 3.1. Table 3.2 pro-vides additional information about the CPUs used in this study.At the time of this study, the Linux drivers for the AMD Fusion system did notsupport zero-copy buffers. To access the Windows AMD SDK and the requiredLinux libraries for Memcached, we used Cygwin [28] to run Memcached on theAMD Fusion systems. One issue with Cygwin is its inability to access all providedGPU hardware counters. This significantly limited the amount and variety of data49we were able to collect from Memcached on the hardware. To gain additionalinformation about Memcached’s behavior on a GPU, we profiled Memcached onGPGPU-Sim [17], a cycle-accurate GPGPU Simulator.Although the GPU architecture modeled by GPGPU-Sim differs from the phys-ical hardware analyzed at the time this study was performed (Table 3.1) 1, suchas in its use of a VLIW unit, we do not believe that this is an issue because thearchitecture resembles AMD’s future GPU architecture [74]. Furthermore, manyof the properties evaluated in this study, such as SIMD efficiency, memory diver-gence, and cache sensitivity, are relevant to any GPU architecture with a SIMDpipeline, data caches, and the ability to coalesce memory accesses. GPGPU-Simsimulates Parallel Thread Execution (PTX) code, a pseudo-assembly intermediatelanguage used in NVIDIA GPUs. Table 3.3 presents the configurations used inGPGPU-Sim.Control-Flow SimulatorIn the current version of the control-flow simulator (CFG-Sim), we assume that theoutcome of each work item’s branches are independent of any other work itemswithin the same wavefront. As such, we do not consider correlated branch out-comes between work items in a wavefront. Additionally, loops are treated as reg-ular branches, which can impact the estimated SIMD efficiency. For example, aloop may be known to be taken 100 times before work items begin exiting theloop. However, setting the branch probability to p( 1100) does not result in the samebranch outcome distribution as having both a fixed and probabilistic portion toloop branches. While CFG-Sim could be modified to include optimizations suchas loop branches, it would require additional information about the behaviour ofthe application obtained through profiling or static analysis.For a set of given branch probabilities, we ensure that the SIMD efficiencyresults converge by averaging 100K iterations through the application’s control-flow graph.1GPGPU-Sim models NVIDIA-like GPUs whereas this study evaluates AMD GPUs.50Table 3.3: GPGPU-Sim configuration.Config Name Config Value# Streaming Multiprocessors 30Warp Size 32SIMD Pipeline Width 8Number of Threads / Core 1024Number of Registers / Core 16384Shared Memory / Core 16KBConstant Cache Size / Core 8KBTexture Cache Size / Core 32KB, 64B line, 16-way assoc.Number of Memory Channels 8L1 Data Cache 32KB, 128B line, 8-way assoc.L2 Unified Cache 512k, 128B line, 8-way assoc.Compute Core Clock 1300 MHzInterconnect Clock 650 MHzMemory Clock 800 MHzDRAM request queue capacity 32Memory Controller Out of Order (FR-FCFS)Branch Divergence Method PDOM [172]Warp Scheduling Policy Loose Round RobinGDDR3 Memory Timing tCL=10 tRP=10 tRC=35tRAS=25 tRCD=12 tRRD=8Memory Channel BW 8 (Bytes/Cycle)3.3.2 Assumptions and Known LimitationsThroughout this study, we assume requests are independent of each other. Thus,all GET (read) operations will view the most up-to-date data in Memcached. Asdiscussed in Section 3.1.6, SET and GET requests may complete out-of-orderwith higher probability than the baseline Memcached, due to the batching of GETrequests.The size of memory accessible by the AMD GPUs and APUs evaluated in thisstudy is limited. On the APU, each zero-copy buffer can be a maximum of 64MB, with a system total of 128 MB [9]. This poses various problems for memory-intensive applications, such as Memcached, that require large amounts of memory51to be effective. This problem would be eliminated with a larger region of pinnedmemory available to the GPU and an appropriate interface to allocate and accessthe additional memory. Indeed, the industry has addressed the limited memorycapacities available in previous graphics cards. For example, at the time this studywas performed, AMD had already announced at the 2011 AMD Fusion DeveloperSummit (AFDS) that future AMD GPUs and APUs will support accessing CPUvirtual memory [42]. The current HSA specification [55] requires that compliantHSA systems support access to shared system memory across devices through aunified virtual address space. Current NVIDIA GPUs [131] also support unifiedvirtual memory, which enables transparent access for the same virtual address onboth the CPU and GPU, as well as virtual memory paging from CPU memory toGPU memory.Batching requests inherently increases the latency to process the requests. Of-floading requests to the GPU in batches can help to reduce the total system-queuinglatency if the CPU throughput becomes the bottleneck when experiencing highincoming request rates. Assuming the GPU can process requests with higherthroughput than the CPU, the GPU would be able to drain the pending requestqueues faster than the CPU. While some applications may not be able to toleratethe increased latency impact of batching large numbers of requests (as is done inthis study), we expect we could achieve many of the benefits while using smallerbatch sizes. Chapter 4 evaluates batching fewer requests at a time to reduce thequeuing latency, while launching multiple batches concurrently to maintain highthroughput.We initially profiled Memcached to locate sections of code that contribute tothe majority of Memcached GET request processing and would benefit from run-ning in parallel on the GPU. This revealed that a majority of execution time forGET requests is actually spent in I/O and network stack processing. The key-valuelookup is the next-highest contributor to the overall execution time of Memcached.Thus, in this chapter we focused our efforts on porting the key-value lookup han-dler to the GPU. Chapter 4 evaluates offloading portions of the network stack tothe GPU as well.523.3.3 Validation and MetricsTo verify that the GPU version of Memcached returns the correct results, we firstprocess a batch of GET requests on the CPU using the baseline version of Mem-cached, and then process the same batch of GET requests on the GPU. On comple-tion of the GPU kernel, we compare the results from the CPU and GPU to ensurethe correct items were found.The execution times on the CPU were recorded using a fine-grained time stampcounter (TSC) that records the sequential look-up times for the batch of requests.When timing the CPU, all data was allocated in a cacheable memory region. Foreach GPU execution, we recorded the kernel execution times using the AMD APPProfiler tool (v. 2.3). We also verified that the comparison with the TSC timer wasvalid by timing the kernel execution time on the Llano A8-3850 system immedi-ately before the kernel launch and immediately after the clFinish synchronizationfunction.3.3.4 WikiData WorkloadWe simulated request traffic to our GPU Memcached server using a large inputfile consisting of HTTP read and write requests. Specifically, we used portions ofWikipedia workload traces [169], referred to as WikiData, to stimulate our applica-tion. These workload traces were recorded by Wikipedia’s front-end proxy cachesand, in total, contain billions of HTTP requests.Memcached’s front-end host code was modified to process the requests fromthe WikiData trace files instead of processing incoming requests from the network.The HTTP read and write requests are converted into Memcached GET and SETrequests respectively. A configurable number of requests are processed on theCPU prior to offloading work to the GPU to set up the environment and to warmup Memcached’s hash table and memory slabs. Once the setup phase completes,SET requests are handled immediately on the host and GET requests are placedinto a buffer until the configurable number of GET requests have been received.Once a GET request batch has been launched on the GPU, SET requests are stalleduntil the GET request batch completes.533.4 Experimental ResultsThis section presents our experimental results evaluating Memcached on both in-tegrated and discrete GPUs, and a software GPU simulator. We first present thehardware results, then discuss the SIMD efficiency analysis using CFG-Sim, andfinally present an evaluation of Memcached on GPGPU-Sim.3.4.1 Hardware EvaluationMemcached was run on the three GPU configurations introduced in the previoussection, with the performance measured via hardware counters through AMD AppProfiler. Both the AMD Radeon HD 5870 and the Llano A8-3850 (AMD Radeon6550D) are compared against a single Llano x86 CPU core, while the Zacate E-350system was compared against a single Zacate x86 CPU core.We also evaluated Memcached’s key-value look-up performance on a multi-threaded CPU implementation (using Linux pthreads). As shown in Table 3.2, theLlano and Zacate systems contain four and two cores respectively. In this exper-iment, we evenly partitioned the GET request batch between one, two, and fourthreads. Each thread performed the same key-value look-up operations on a dif-ferent sub-batch of GET requests in parallel, and the time was measured for thelast thread to complete its portion of the requests. From this experiment, we foundthat two threads had approximately the same throughput as a single thread, andfour threads reduced the performance relative to a single thread, even on the 4-coreLlano. As described in Section 3.3, this work was evaluated on Windows us-ing Cygwin. Consequently, the lack of multi-threaded performance improvementsmay be attributed to Cygwin’s pthread support. Additionally, Memcached’s CPUimplementation contains global locks, for example, to protect access to the hashtable entries and the global LRU queue for managing item evictions, which seri-alize portions of the key-value look-up operation between GET requests, reducingthe potential for linear speed-ups with additional cores. For these reasons, our re-sults in this Chapter compare the GPU performance against a single CPU thread,which was the highest measured throughput on the CPU. An ideal multi-threadedCPU performance can be estimated by multiplying the measured throughputs bythe number of cores on the corresponding system. However, as shown below, the54GPU is still able to outperform the estimated ideal CPU throughputs. Furthermore,the next chapter (Section 4.4.2) evaluates end-to-end multi-threaded MemcachedCPU performance, which shows approximately a 2× increase in throughput whenincreasing from one to four threads and a steep drop-off in throughput after addingmore threads than the number of cores. Hence, we believe that the GPU resultspresented in this section are representative of the GPU’s ability to outperform theCPU-version of Memcached.Figure 3.5a presents the average speed-up, in terms of key-value look-upsper second (LPS), for each GPU configuration normalized to the CPU’s executiontime. Because these results do not include any data transfer times, they explicitlyhighlight the computational performance benefits when performing Memcached’skey-value look-up on the GPU relative to the CPU. Even with the potential for ir-regular control-flow and memory-access patterns present in Memcached, the AMDRadeon HD 5870 is able to perform the key-value look-up on a batch of 38, 400GET requests approximately 33× faster, the Llano A8-3850 7.5× faster, and theZacate E-350 4.5× faster than their CPU counterparts. We hypothesize that theperformance improvements are largely a result of the increased computational andmemory-level parallelism. As will be discussed further in Section 3.4.2, the SIMDefficiency of our Memcached implementation is approximately 40%, which in-dicates that a non-negligible number of the GET requests’ key-value lookup in-structions are executing in parallel. Furthermore, because Memcached is a highlymemory-intensive application, the GPU’s high memory-level parallelism enablesmultiple in-flight memory requests from different work items to be processed con-currently. While CPU threads must wait for a memory request to return from thecache or memory before continuing to operate on the requested data, the GPU cansimply switch to another wavefront to perform any computations or initiate addi-tional memory requests.However, the results in Figure 3.5a ignore any data transfer times, which needto be considered when measuring the full end-to-end performance of a GPU appli-cation, especially for a streaming application such as Memcached. Including thedata transfer times results in a large overall performance decrease on the discretesystem, as can be seen in Figure 3.5b. The execution time is measured from im-mediately before the data transfer to the GPU is initiated to immediately after the550	5	10	15	20	25	30	35	AMD	Radeon	HD	5870	(discrete)	Llano	A8-3850	(integrated)	Zacate	E-350	(integrated)	Speed-up	(X)	(a) Excluding data transfers.0	1	2	3	4	5	6	7	8	AMD	Radeon	HD	5870	(discrete)	Llano	A8-3850	(integrated)	Zacate	E-350	(integrated)	Speed-up	(X)	(b) Including data transfers.Figure 3.5: Memcached speed-up vs. a single core CPU on the discrete andintegrated GPU architectures. Each batch of GET requests contained38,400 requests/batch.data transfer from the GPU is completed. Overlapping the data transfers with com-putation from another batch of GET requests can reduce the overall data transfer56overheads, however, the latency for the request batch is still negatively impacted.Additionally, given the large request batch sizes (38,400 requests/batch), the in-coming request rate would need to be high enough to have another batch of GETrequests available once the previous request batches have completed transferringdata and have started processing. The APUs have close to zero transfer time due tothe shared memory space. These data transfer times are small, but non-zero, dueto the mapping and un-mapping operations, which are required to ensure that thedata the GPU is accessing is coherent with any CPU updates. Although the totalcompute power of the APUs is far less than the high-performance discrete AMDRadeon GPU, the ability to fully eliminate the transfer of data allows these devicesto outperform the AMD Radeon HD 5870. Further optimizations to reduce theamount of data transferred to and from the GPU for each GET request are requiredto reclaim some of the computational performance benefits of the discrete GPUs.This is discussed further in Chapter 4.Request BatchingWhenever considering batch processing, there is always a trade-off betweenthroughput and latency. Specifically, as the number of queued requests increases,the time taken to process these requests also increases. The reason is two-fold.First, the requests sit in the batch queue for a longer period of time until processingbegins. Second, the GPU takes longer to process a larger batch. With morerequests to process concurrently, more active wavefronts are executing concur-rently. This places higher contention on shared resources, such as the computeunits and the memory system. While this increases the total system throughput,it also increases the per-work item processing latency. We measured the impacton request throughput and latency on the AMD Radeon HD 5870 by varying thebatch request size and recording the average time taken to process that batch ofrequests. Figure 3.6 presents these results normalized to an initial batch size of1,024 requests, excluding data transfer times. Also shown in Figure 3.6a is a0.5ms latency reference line. Berezecki et al. [24] indicate that a 1ms delay for acomplete Memcached request processing, including network transfer and networkprocessing time, is a reasonable maximum tolerable latency. The data shown57024681012140 10000 20000 30000 40000 50000 60000 70000 80000 90000Nor mal iz ed Lat enc yRequests / BatchNormalized Latency Latency - 0.5ms(a) Normalized latency.0123456780 10000 20000 30000 40000 50000 60000 70000 80000 90000Nor mal iz ed Thr oughputRequests / BatchNormalized Throughput(b) Normalized throughput.Figure 3.6: Throughput and latency while varying the request batch size onthe AMD Radeon HD 5870 (normalized to 1,024 requests/batch).58051015202530350 10000 20000 30000 40000 50000 60000Speedup Requests / BatchLlano A8-3850 AMD Radeon HD 5870Figure 3.7: Speed-up of AMD Radeon HD 5870 and Llano A8-3850 vs. theLlano A8-3850 CPU at different request batch sizes.in Figure 3.6 measures only the Memcached key-value lookup time. The largespike in throughput from 1-5× in Figure 3.6b is caused by the minimal increasein latency, approximately 1.3×, while increasing the number of requests/batchby 7.5× (1,024 → 7,680). This results in the behavior shown by the throughput,corresponding to the initial increase in latency, which begins to level off andfluctuates between 6× and 7× the throughput at 1,024 requests per batch.This behaviour suggests that the GPU is underutilized when the request batchsize is less than approximately 20,000. Assuming that a theoretical incoming re-quest rate can be set to match any level of throughput, selecting a batch size ofapproximately 8,000 requests per batch qualitatively provides the maximum ra-tio of throughput to latency, whereas a batch size around 30,000 requests achievesmost of the peak measured throughput. As will be discussed in Chapter 4, multiplesmaller request batches can also be executed concurrently to improve the balanceof throughput and latency.Another property of batch processing to consider is how the performance be-tween the GPU and CPU varies when the request batch size is increased. These590%10%20%30%40%50%60%70%80%90%100%AMD Radeon HD5870Llano A8-3850 Zacate E-350% Exec Ti meDataTransferExecutionFigure 3.8: Memcached overall execution breakdown (23,040 requests/-batch).results are presented in Figure 3.7 for both the AMD Radeon HD 5870 and theLlano A8-3850 when compared to a single CPU core on the Llano A8-3850 sys-tem. Both architectures show a large initial increase in performance when thebatch size is increased from small values. Similar to the throughput behavior seenin Figure 3.6b, the speed-up compared to the CPU begins to level out on both ar-chitectures around 40,000 requests/batch. Again, these results are excluding datatransfer times. Including the data transfers results in the integrated Llano GPUachieving higher overall performance relative to the CPU than the discrete RadeonGPU.Data TransferIn applications with large amounts of data needing to be transferred to and from thedevice, such as Memcached, transfer time can dominate the overall execution timeof the kernel. Figure 3.8 shows the contribution of the execution time and datatransfer times as a percentage of the overall execution time for each GPU whenoperating on a GET request batch size of 23,040 requests/batch. We optimisti-60cally selected the minimum amount of data that must be transferred to and fromthe device: the requests to be processed and the results of the requests respectively.Assuming cyclic 2 transfer of data, more than 98% of the overall execution timeis spent transferring data for the discrete AMD Radeon HD 5870. These valueswere recorded assuming that none of the data could have been modified on the hostbetween successive kernel launches, thus ensuring all data in the device memoryis valid. Therefore, on kernel launch, the only data that must be transferred are therequests themselves; upon completion of the kernel, all of the results must be trans-ferred back to the host. A more realistic assumption is that an unknown amountof data could have been modified between kernel launches, thus invalidating a por-tion of the data on the device and requiring explicit tracking and transfers of themodified data on every kernel launch. Tracking which data was modified could beavoided by pessimistically transferring all of the data on every kernel launch, how-ever, this cyclic memory transfer model that transfers data regardless of whether ithas been modified is sub-optimal.Others [21, 104, 161, 168] have proposed solutions to this challenge (e.g., im-plementing frameworks to automatically and acyclically transfer modified data tothe device or requiring programmer annotation of the code to specify memory re-gions to be explicitly managed) to reduce the impact data transfers have on per-formance. With the introduction of CPU-GPU architectures that share a physicalmemory space, such as the AMD Fusion systems, this transfer time can be virtu-ally eliminated. As can be seen in Figure 3.8, the majority of the overall executiontime for the Llano A8-3850 and Zacate E-350 systems is spent performing usefulwork, rather than waiting for the data transfer to complete. Being able to reducethe time required to transfer data, either by using an architecture with a unifiedmemory space or using a method to reduce the transfer overhead, is crucial whenporting an application requiring large data transfers to the GPU.3.4.2 Simulation EvaluationThis section attempts to gain additional insight into the performance of Memcachedon a GPU using simulation frameworks. Unless otherwise stated, the data pre-2Cyclic refers to transferring data before and after successive kernel launches, whereas acyclicdata transfer overlaps data transfer with the kernel execution [161]61A-6B-2E-9 G-11C-2 D-2I-1H-13F-11J-35 K-10%Figure 3.9: Example control-flow graph with an error handling branch fromB to K. Each basic block contains the basic block identifier and thenumber of instructions in that basic block (Basic Block ID-# Instruc-tions). All branches have a non-zero branch outcome probability exceptfor B to K, which is never taken.sented in this section is either collected from CFG-Sim discussed in Section 3.2 orthe baseline GPGPU-Sim configuration presented in Section 3.3.62SIMD Efficiency of MemcachedAs previously discussed, in single-instruction, multiple-data (SIMD) architectures,such as GPUs, groups of work items in a wavefront must execute instructions to-gether in lock-step. If a work item branches away from the other work items inits wavefront, the GPU executes the two sub-groups separately, requiring more cy-cles than if they were executed together [9]. This reduces overall SIMD efficiency.Combining Memcached’s complex control-flow graph, which contains multiplenested conditional branches, with the level of uncertainty in branch outcomes, onecan reasonably expect Memcached to have poor SIMD efficiency, directly resultingin poor performance on the GPU. Although pessimistic, an initial view of the sys-tem might be that each branch outcome has an equal probability (50% not-takenand 50% taken). In many applications, this might be an unreasonable assump-tion; however, many of the branches in Memcached depend on input data, suchas the length of the request key, that can vary greatly between requests. Withoutadditional analysis, it is unclear to what extent these data dependent branches willimpact branch divergence and, hence, SIMD efficiency. Assuming equal branchprobability is marginally better than the worst case, where the thread groupingsdeterministically split in half at each branch.After performing further analysis of the application, such as manually inspect-ing or profiling the application, certain branches may be reasoned to occur rarelyor never (e.g., error handling or dead code). These branches can be removed fromthe analysis by forcing the work items to take a certain path (the path known tobe always taken), since their inclusion would negatively bias the expected SIMDefficiency of the system. For example, consider the CFG in Figure 3.9 with asimilar structure identified within Memcached. In this example, each ellipse con-tains the basic block identifier and the number of instructions in that basic block(Basic Block-# Instructions). Also note that the branch from B to K may be someerror checking code, which should never, or rarely, occur during normal execution.There are two main issues here. First if we assume equal branch probabilities forall basic blocks, the estimated SIMD efficiency will be unnecessarily lower sincework items will never actually diverge at B. Second, when work items diverge atA, their re-convergence point is set to the immediate post dominator (IPDOM) K,630% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% MC (Pes) MC (Aug) MC (Act) RAY (Pes) RAY (Aug) RAY (Act) MUM (Pes) MUM (Aug) MUM (Act) Percentage of Cycles Executed 1-4 5-8 9-12 13-16 17-20 21-24 25-28 29-32 #	Active		Work	Items	Figure 3.10: SIMD efficiency.since this is the first basic block which all work items must execute. As a result,work items that diverge at A or C will all execute J, a large basic block, sepa-rately even though all work items will actually go through J, which further lowersthe SIMD efficiency. The first issue can be solved by forcing all branches knownto never be taken to have zero probability (see Augmented below), while the sec-ond issues requires either dynamically identifying early re-convergence points orremoving the zero-branch from the CFG (left to future work).Simulating the control-flow behavior of a single thread grouping with the con-trol flow simulator (CFG-Sim) (Section 3.3), we can compare Memcached’s SIMDefficiency with these initial views of how the application might behave on a GPU,resulting in the data in Figure 3.10. This figure compares the overall SIMD effi-ciency of Memcached’s actual execution (Act) with the pessimistic view that allbranches have equal outcome probabilities (Pes). In this experiment, there are 32work items per wavefront. Each bin in the graph represents the fraction of totalprogram execution in which the specified number of scalar threads were concur-rently executing. For Memcached (MC), the SIMD efficiency with Pes is muchlower than Act, where significantly more of the execution time contains only 1-4work items in a wavefront.64We then improve on this pessimistic view by optimizing away all branch pathsthat are never taken during normal execution (Augmented - Aug) by setting thebranch probability to 0%, and compare the recorded SIMD efficiency with the ac-tual execution. All other branches remain at equal branch probabilities. Althoughthere is an improvement, the SIMD efficiency of the actual execution of Mem-cached still outperforms the estimated behavior.Using the actual branch probabilities measured from GPGPU-Sim for Mem-cached in CFG-Sim, instead of the equal branch probabilities, the estimated SIMDefficiency from CFG-Sim is within 1.3% of the actual SIMD efficiency, whichhighlights the potential for CFG-Sim to accurately estimate the SIMD efficiency.We extend this analysis to applications known to perform well on the GPU,such as Mummer (MUM) and Raytracer (RAY), and measure how these resultscompare when similar assumptions are applied, which is also shown in Figure 3.10.Although MUM exhibits a relatively low SIMD efficiency, GPUs tend to have morememory bandwidth than CPUs, which can result in higher throughput on memory-limited applications even in the presence of significant control flow-divergence.This is a similar behaviour as measured in Memcached. However, the theoreticalresults in RAY perform significantly worse than the actual results. This is causedby high correlations between work items’ branch outcomes within a wavefront.Although each branch outcome may have a relatively random probability, eachwork item is biased by the results of the other work item within the group.The main takeaway from this experiment is that while an application may con-tain many data-dependent branches, which can lead to irregular control flow on aGPU, further analysis of the application is required to understand the actual attain-able SIMD efficiency on a GPU, since branch outcomes can be far from randomand branch outcomes may be correlated between work items.Effect of Memcached on the Memory SystemMemcached’s key-value retrieval algorithm places a significant amount of stress onthe memory system. Figure 3.11 shows the misses per 1,000 instructions (MPKI)for Memcached with a variety of L1 data cache configurations ranging from a small8KB, 8-way set associative cache to a large 1MB full associative cache. This data6501020304050607080908k 8-way16k 8-way32k 8-way64k 8-way128k 8-way256k 8-way512k 8-way1M 8-way1M FAMPKIFigure 3.11: L1 data cache misses per 1,000 instructions at various configu-rations. FA = Fully Associative.0%5%10%15%20%25%30%35%No L1Cache8k 8-way32k 8-way64k 8-way128k 8-way256k 8-way1M 8-way1M FA NoMemLatencyNoMemStallsPer cent age of  Peak IPCFigure 3.12: Performance as a percentage of peak IPC with various realisticL1 data cache configurations and two idealized memory systems.shows that Memcached has some exploitable locality and that the working set ofour simulated configuration fits in a 256KB cache. The remaining 10 MPKI arecaused by cold start misses.Figure 3.12 shows the performance of Memcached on GPGPU-Sim with a66number of L1 global data cache configurations and two variations of an idealizedmemory system. Performance is presented as a percentage of peak IPC (whenevery SIMD lane is active for every GPU cycle). Increasing the cache size resultsin a continuous performance improvement up to 256KB, beyond which it levelsoff. This result indicates that Memcached is a cache-sensitive workload. Furtherinvestigation of the source code reveals that two instructions receive a significantreduction in latency when cache size increases. These instructions are the loadsperformed inside the key comparison loop which compares the input key and a keyfound in the hashtable. This loop accesses memory sequentially, resulting in a highcache hit rate when the cache is large enough to capture the working set.The 1MB fully associative (FA) configuration suffers only from cold-startmisses. The No Memory Latency (No Mem Latency) data point models a systemin which requests can be processed in a single cycle, but each compute corecan send only one request per cycle to the global memory system. These twodata points enable us to measure the amount of touched-once data loaded by theMemcached kernel. Since the No Mem Latency model places a very low penaltyon loading data into the cache, one cycle, the difference between these two datapoints accounts for the speedup relative to a cache large enough to hold the entireworking set after incurring the cold-start load penalty. Increasing from no cacheto a cache that captures all the kernel’s locality takes the IPC from less than 1%of the peak to 12%, and removing the cold-start misses achieves 21% of the peakIPC. This suggests that Memcached contains a high fraction of touched-oncedata. This was verified by measuring the number of accesses to each L1 cacheline prior to eviction in the 1M cache configuration. While the No Mem Latencyconfiguration removes the penalty for cache misses, a large number of memoryrequests may cause contention in the memory system, resulting in performanceloss. The No Memory Stalls configuration sends memory requests though thepipeline as fast as they are generated without any blocking. No Memory Stallsresults in an additional 12% increase in performance over the No Memory Latencysystem. This result tells us that Memcached spends a large fraction of its executiontime with a backed-up queue of memory requests waiting to access the memorysystem. If wavefronts do not stall on memory, then the overall performance islargely limited by SIMD efficiency. The performance of the non-stalling memory6705101520253035114274053667992105118131144157170183196209222235248261274287300313Memor y Reques t Gener at ed Per  I ns tr uc ti onPtx Assembly LineFigure 3.13: Memory requests generated per instruction for each static PTXinstruction.system is 33% of peak, while the measured SIMD efficiency of Memcached is∼40%. This 7% discrepancy can be attributed to idle cycles when some cores takelonger than others to complete the kernel.Figure 3.13 illustrates the amount of memory divergence in Memcached (usingAerialVision [13]). It presents the average number of global memory requestsgenerated for each static PTX assembly instruction. In our GPGPU-Sim baselineconfiguration the maximum Y-value for each bar is 32 (all 32 lanes of the wavefrontgenerate a request) and the minimum is two (because requests are coalesced perhalf-wavefront in GPGPU-Sim). A well-behaved GPU application will attempt tominimize this number and limit the stress on the memory system. From this graphwe can see that many memory instructions do not fully coalesce their accesses intotwo requests. The bulk of the program’s execution time is spent between PTX lines157 and 253, where the instructions request between seven and 23 cache lines eachon average. Further analysis of the Memcached code revealed that the main reasonthese instructions do not request closer to 32 lines is that Memcached’s SIMDefficiency also drops during this phase, resulting in fewer active lanes, and hencefewer possible memory requests to be coalesced on each memory instruction. Asa result, a relatively small amount of code repeatedly generates a large number of6800.20.40.60.811.2WS 8 WS 16 WS 32 WS 64Perf ormance Normal ized to WS 8Figure 3.14: Performance of Memcached at various wavefront sizes (normal-ized to a warp size of 8).memory accesses, which backs up the memory-request queue, leading to decreasesin performance.The preceding data indicates that the inclusion of an L1 data cache is critical tothe performance of Memcached. Processing more than one memory request per cy-cle (e.g., through a multi-banked L1 data cache) would also improve performancebecause it allows the backed-up memory-request queue to empty sooner.Effect of Wavefront Size on PerformanceFigure 3.14 shows the performance of our modified Memcached on the base-line GPGPU-Sim simulator when varying wavefront lengths. The performanceis normalized to a wavefront length of eight. A smaller wavefront length limitsthe amount of underutilized hardware resources when work items in the wavefrontdiverge, however, a larger wavefront length improves the maximum achievable per-formance as more instructions can execute per cycle and increases the number ofmemory requests potentially able to be coalesced. This data shows that there is an18% decrease in performance between a wavefront size of eight and 64. This indi-cates that Memcached’s SIMD efficiency is a limiting factor even in the presenceof excessive memory stalls.693.5 SummaryThis chapter presents an initial characterization and evaluation of Memcached onboth GPU hardware and GPU simulation frameworks. We identified many chal-lenges with porting an application with irregular control-flow and memory accessbehaviour, as well as large data requirements, to a GPU system. We presented oursolutions to address these challenges and mitigate their impact on performance,such as request batching, specific data structure layouts in memory to maximizememory coalescing, and reducing the CPU-GPU data transfer sizes. While thischapter focusses on Memcached, we believe that the presented methodology ofbatching user network requests for processing on a throughput-efficient device canbe generalized such that other server-type applications with ample request-levelparallelism could take advantage of this framework. We then presented an analysisusing a GPU control-flow simulator, CFG-Sim, and a cycle-accurate GPGPU sim-ulator, GPGPU-Sim, to gain additional insight into the behavior of Memcached ona GPU. From this analysis, we conclude that irregular applications, such as Mem-cached, should not be immediately disregarded when considering porting them to aGPU. Even though the SIMD efficiency may be lower than traditional GPU appli-cations, we find that Memcached’s SIMD efficiency is approximately 2.7× higherthan a naive assumption of equal branch probabilities in the code-path may suggest.This, coupled with the GPU’s high memory-level parallelism, enables Memcachedto achieve a sizeable speedup over the baseline CPU implementation. We observedthat the AMD Llano A8-3850 and Zacate E-350 integrated fusion GPUs outper-formed their respective CPUs by factors of ∼7.5× and ∼4.5× respectively. Wealso showed that the discrete GPU system was able to significantly outperformthe CPU when data transfers are ignored. However, when including data transfertimes, results are hindered by the data transfer overheads of the large Memcachedrequest batches.In this chapter, we focussed solely on accelerating the key-value lookup portionof Memcached on a GPU. While this is the largest contributor to the Memcacheduser-level request processing, other portions of the full Memcached request pro-cessing, such as the network processing overheads and data movement, contributeto the majority of the end-to-end processing time. In the next chapter, we address70these issues by enabling the direct communication of Memcached data between thenetwork interface and GPU, and offloading the network processing to the GPU.71Chapter 4Memcached GPUThe previous chapter (Chapter 3) evaluated the potential for accelerating irregulardatacenter-based applications, such as Memcached, on contemporary integratedand discrete GPUs. However, Chapter 3 focussed specifically on acceleratingMemcached’s key-value look-up on the GPU, with all of the other required net-working and Memcached request processing remaining on the CPU. While thekey-value look-up accounts for the core operations and the majority of user-spaceprocessing time for a Memcached request, when considering all of the operationsrequired for the end-to-end network request processing, the key-value look-up re-sults in a relatively small fraction of the overall processing. This is highlighted inFigure 4.1 and Figure 4.2, which present a fine-grained breakdown of the totalMemcached processing time for a single GET request, including Memcached op-erations and Linux system calls, and a coarse-grained breakdown of Memcachedversus Linux kernel processing time, including Linux network processing, respec-tively. The execution breakdown was measured for a single request using IntelVTune [80]. As a result, some of the Linux kernel overheads that may be amor-tized over multiple network requests are not captured in this analysis. Figure 4.1shows the portions of the GET request processing that was accelerated in Chapter3, indicated by the red boxes, which contribute to the majority of the user-spaceprocessing. However, as can be seen in Figure 4.2, the overall majority of pro-cessing time for a single GET request is spent in the Linux Kernel (e.g., Linuxnetwork stack) compared to the actual user-space Memcached processing (∼10%).720%	10%	20%	30%	40%	50%	60%	70%	80%	90%	100%	Total	Memcached	User-Space	Execution	Time	do_item_remove					 item_remove								item_update								 item_link_q								do_item_update					 do_item_get								item_get											 item_unlock								update_event							 _init														conn_set_state					 process_command				add_msghdr									 item_lock										refcount_decr						 build_udp_headers		tokenize_command			 item_unlink_q						hash															 add_iov												refcount_incr						 assoc_find									Figure 4.1: Breakdown of the baseline Memcached request processing timefor a single GET request on the CPU.Furthermore, a large portion of the Memcached user-space processing is also spentparsing the Memcached request and building the response packets. This chapterexplores offloading the complete end-to-end processing for Memcached requeststo the GPU, including the network packet processing.To this end, we implement and evaluate an end-to-end version Memcached oncommodity GPU and Ethernet hardware, MemcachedGPU. Memcached is a scale-out workload, typically partitioned across multiple server nodes. In this Chapter,we focus on using the GPU to scale-up the throughput of an individual server node.We exploit request-level parallelism through batching to process multiple concur-rent requests on the massively parallel GPU architecture, and task-level parallelismwithin a single request to improve request latency. While the previous chapter andother works have evaluated batch processing of network requests on GPUs, such730%	10%	20%	30%	40%	50%	60%	70%	80%	90%	100%	End-to-End	Memcached	Execution	Time	Memcached	GET	Processing	 Linux	Processing	Figure 4.2: End-to-end breakdown of user-space and Linux Kernel process-ing for a single GET request on the CPU.as HTTP [3], or database queries [18, 179], they focus solely on the applicationprocessing, which depending on the application, can be a small subset of the totalend-to-end request processing. In contrast, this chapter describes the design andimplementation of a complete software GPU network offload management frame-work, GNoM 1, that incorporates UDP network processing on the GPU in-linewith the application processing (Figure 4.3). GNoM provides a software layer forefficient management of GPU tasks and network traffic communication directlybetween the network interface (NIC) and GPU.This is the first work to perform all of the Memcached read request processingand UDP network processing on the GPU. We address many of the challenges as-1Code for GNoM and MemcachedGPU is available at https://github.com/tayler-hetherington/MemcachedGPU.74sociated with a full system network service implementation on heterogeneous sys-tems, such as efficient data partitioning, data communication, and synchronization.Many of the core Memcached data structures are modified to improve scalabilityand efficiency, and are partitioned between the CPU and GPU to maximize perfor-mance and data storage. This requires synchronization mechanisms to maintain aconsistent view of the application’s data. The techniques presented in this chap-ter can be generalized to other network services that require both CPU and GPUprocessing on shared data structures.This chapter also tackles the challenges with achieving low-latency networkprocessing on throughput-oriented accelerators. GPUs provide high throughputby running thousands of scalar threads in parallel on many small cores. GNoMachieves low latency by constructing fine-grained batches, such as 512 requests,and launching multiple batches concurrently on the GPU through multiple paral-lel hardware communication channels. Compared to the previous chapter, whichevaluated batch sizes of tens of thousands of requests, the small batch sizes hereminimize batching delay and processing latency. At 10 Gbps with the smallestMemcached request size (96 bytes), the smaller batches result in requests beinglaunched on the GPU every ∼40µs, keeping the GPU resources occupied to im-prove throughput while reducing the average request batching delay to under 20µs.This chapter is organized as follows: Section 4.1 presents GNoM, a softwareframework for UDP network and application processing on GPUs; Section 4.2 de-scribes the design of MemcachedGPU, an accelerated end-to-end key-value store,which leverages GNoM to efficiently run on a GPU; Section 4.3 presents the ex-perimental methodology in this chapter; Section 4.4 evaluates the feasibility ofachieving low-latency, 10 Gbit line-rate processing at all request sizes on com-modity Ethernet and throughput-oriented GPU hardware; Section 4.4.4 exploresthe potential for workload consolidation on GPUs running on servers in a data-center during varying client demands, while maintaining a level of quality of ser-vice (QoS) for a GPU networking application; and finally Section 4.4.6 comparesMemcachedGPU against prior Memcached implementations.754.1 GPU Network Offload Manager (GNoM)GNoM is a CPU-GPU software framework that enables high-throughput and low-latency processing of (UDP) network packets on the GPU. This section addressessome of the challenges with achieving this goal and presents the software architec-ture of GNoM that facilitated the design and implementation of MemcachedGPU.While we focus on accelerating Memcached using the GNoM framework, we be-lieve that many parts of the design, such as the efficient packet data movementand GPU task management, can be generalized to support other applications in thedomain of high-throughput, UDP request/response-type applications.4.1.1 Request BatchingRequest batching in GNoM is done per request type (GET, SET, UPDATE,DELETE) to reduce control flow divergence among GPU threads. While intra-request control flow divergence still exists (as discussed in Chapter 3), thehigh-level operations within each request type are the same (e.g., hashing therequest key or accessing the hashtable). The previous chapter evaluated large batchsizes (e.g., 30K+ requests/batch) to maximize the GPU’s utilization and requestthroughput. However, large batch sizes negatively impact request latency, sincerequests will be queued longer and per-request processing will be longer. Smallerbatch sizes minimize batching latency, however, increase the GPU kernel launchoverhead and lower GPU resource utilization, which reduces overall throughput.In this chapter, we evaluate the use of smaller batch sizes, but overlap multiplesuch smaller batches to reduce the impact on throughput. We empirically findthat 512 requests per batch provides a good balance of throughput and latency(Section 4.4.3). As discussed in Section 2.1.2, NVIDIA GPUs support up to 32concurrently running kernels via Hyper-Q. Assuming that the GPU has enoughresources to support all 32 GNoM kernels, a batch size of 512 requests enables16K requests to be processed concurrently.While MemcachedGPU’s main focus is in accelerating a single request typefor batching, GET requests, workloads with many different request types couldbatch requests at finer granularities. For example, multiple smaller batches couldbe constructed at the warp-level of 32 requests and launched together on the GPU76!"#$%&'("&"'!"#$%&')%&"*"&"'+,-&.,/'0/,1'22!"#$%&'()*!+!"#$%&'("&"'3')%&"*"&"'4,"*'56'7"#$%&8'9%&1,.$'8%.:;#%'7.,#%88;-<'9%&1,.$'8%.:;#%'7,7=/"&%'.%87,-8%'>'#,?7/%&%'.%87,-8%'7"#$%&'@%"*%.'A(!'.%87,-8%'7"#$%&'<%-%."&;,-'B&,.%'C6'7"#$%&8'D/ ,E"/ ')%?,. F'B@". %*' )%?,. F'A(!'7.,#%88;-<')";-'G".78' H%/7%.'G".78'5I'5I'CI'D!A(;.%#&'!"#$CI'5I'5I'CI'#%&J,-./0&.!%+D9,)KL)'MB'D9,)K9('D9,)K7.%'D9,)K7,8&'A8%.K/%:%/')%?#"#@%*'N"/=%8' )%&"*"&"'D56O8'H"8@P/,#$'&"E/%8'5%87,-8%'O=QQ%.8''%&J,-./0*(12$(+C6O8'Figure 4.3: GNoM packet flow and main CUDA kernel. The figure containsthe three main components, the NIC, CPU, and GPU, and the corre-sponding GNoM software frameworks that run on each device. Thesolid black arrows represent data flow, the dashed black errors repre-sent metadata flow (e.g., interrupts, packet pointers), the double blackarrows represent packet data and packet metadata, the solid grey arrowsrepresent GPU thread control flow, and the dashed grey lines representGPU synchronization instructions.77to perform different tasks [22] in parallel. MemcachedGPU does support otherMemcached request types, such as SET UPDATE, and DELETE, however, thecurrent version implements these requests in individual batches of a single warp.Batching and accelerating these requests types are left to future work.4.1.2 Software ArchitectureGNoM is composed of two main software frameworks that cooperate to balancethroughput and latency for network services on GPUs: GNoM-host (CPU) andGNoM-dev (GPU). At a high-level, GNoM-host is responsible for interacting withthe NIC, performing any pre/post-GPU data processing, and for GPU task manage-ment. GNoM-dev (GPU) is responsible for the main UDP packet and application-level processing. Figure 4.3 presents the GNoM software architecture, along withthe interactions between the NIC, CPU, and GPU in GNoM, as well as the mainCUDA kernel for GNoM-dev. RX (receive) packets are DMA-ed directly to theGPU’s global memory using GPUDirect (Section 2.1.2). Only metadata describ-ing the request batch is sent to the CPU. This metadata is described further below.We focus on accelerating the RX path in GNoM. While GNoM can implement asimilar TX (transmit) path using GPUDirect, MemcachedGPU uses a third partyCPU Linux network bypass service, PF RING [125], to accelerate the TX pathon the CPU. This decision was driven by the design of MemcachedGPU in whichthe main data structures used to populate the response packets are stored in CPUmemory (Section 4.2). This design decision increases the total amount of storageavailable for Memcached, which is important for the effectiveness of a Memcachedserver (Section 4.2.2). However, this also requires CPU post processing, which de-creases the total system energy-efficiency (Section 4.4).The rest of this section describes GNoM-host and GNoM-dev in detail.GNoM-hostGNoM-host provides task management and I/O support for the GPU. GNoM-hostis required because current GPUs cannot directly receive control information orinterrupts from other third party devices in a heterogeneous system through thestandard CUDA interface. We evaluate this limitation further in Chapter 5. Other78GRXB MetadataGNoM-NDGNoM-KMGNoM-UserGNoM-PrePre-processing and launch on GPUGNoM-Post Post-processing and recycle GRXBPending RX batch queueAvailable GRXB Queue Current GRXB Batch BufferVerify GRXB batch  to recycleGNoM-Host (CPU)From NICTo NICFrom GPUTo GPU12345678Packet Data to GPUPacket MetadataFigure 4.4: Software architecture for GNoM-host (CPU).work has explored the reverse direction of GPUs communicating to third partydevices, for example, by having the GPU write to the doorbell register on a NIC[183] to indicate when data is available to send. However, on current systems, thirdparty devices must go through the GPU driver to initiate work on the GPU, whichrequires assistance from the CPU. A workaround for this is to use persistent GPUthreads, which are described in Section 2.1.7. For GNoM, this limitation restrictsthe NIC from being able to directly initiate tasks on the GPU, even though thepacket data is already in GPU memory. As a result, the CPU acts as a middlemanbetween the NIC and GPU. GNoM-host is responsible for efficiently managingmetadata movement between the NIC and GPU, managing tasks on the GPU, and79performing any post-processing tasks required prior to sending response packets.As described above, all packet data is directly copied to the GPU memory from theNIC. To accomplish these tasks, GNoM-host is composed of three software com-ponents: a modified Intel IXGBE network driver (v3.18.7), GNoM-ND, a customLinux kernel module, GNoM-KM, and a user-level software framework, GNoM-user, which are described below. Figure 4.4 highlights the interaction betweenthese three components. In the following sections, any reference to # refers toFigure 4.4.GNoM-KM and GNoM-NDGNoM-KM is a custom Linux kernel module, which provides an interface betweenthe NIC driver (GNoM-ND) and user-space application (GNoM-user). GNoM-KM includes hooks for GNoM-user to communicate indirectly with the NIC viaGNoM-ND, such as configuring the NIC for use with GNoM, retrieving new requestbatches, and recycling completed batches of request buffers to the NIC. Multiplesteps are required to initialize and configure GNoM.GNoM-KM first allocates pinned, un-pageable GPU memory using GPUDi-rect to store incoming RX packets, referred to in this work as GPU RX Buffers(GRXB). A total of 220MB is allocated for the GRXBs, partitioned into 32 - 2KBbuffers per 64KB GPU page. 220MB is the maximum amount of pinnable GPUmemory one can allocate for GPUDirect on the NVIDIA Tesla K20c at the time thisstudy was performed. However, future NVIDIA GPUs (starting from the NVIDIAK40) increased the amount of pinnable GPU memory that can be access across thePCIe to 16GB [183]. The limited pinnable memory on the GPUs evaluated in thisstudy reduces the peak obtainable throughput, which is discussed further below.Evaluating GNoM on newer GPUs with increased support for RDMA-accessiblememory is left to future work. GNoM-KM allocates the GRXBs and registers themwith GNoM-ND. The GRXBs are maintained in a circular queue in GNoM-ND 8and can be in one of three states: free, registered, or busy. Free buffers are not allo-cated to either device (NIC or GPU) and are waiting to be registered with the NIC.Registered buffers are registered with the NIC, waiting for a new RX packet to fillthe buffer. Busy buffers contain a valid RX packet and are either waiting or actively80being processed on the GPU. GNoM-KM also allocates a secondary circular buffer2 for storing batches of GRXBs which have been populated by GNoM-ND, butnot yet consumed by GNoM-user (described in more detail below).The NIC is then configured in GNoM-ND. The NIC evaluated in this study (In-tel 82599 10GbE) contains multiple hardware RX queues and supports hardwarepacket filtering and receive side scaling (RSS) [67]. RSS enables RX hardwarequeues to be mapped to different CPU processors such that incoming packets canbe distributed across processors to increase performance. The NIC can apply ahardware packet filter to steer packets to the different RX queues. We make use ofthis feature by assigning one of the RX hardware queues to be a GPU queue. Ina multi-GPU system, multiple GPU packet filters could be installed to distributepackets across GPUs. GNoM-ND installs a hardware packet filter for a range ofUDP ports to be processed by the GPU, such that any packet that matches the filteris directed to the GPU RX queue. GNoM-ND ensures that the GRXBs allocated byGNoM-KM are only registered with the GPU RX queue, which enables packet datato flow directly from the NIC to the GPU’s memory. Another benefit of this orga-nization is that all other network traffic not destined for the GPU is steered towardsthe non-GPU RX queues, which flows through the baseline Linux networking stackon the CPU.Similar to the baseline Linux networking stack, GNoM uses Linux NAPI tomitigate the impact of high interrupt rates by scheduling a polling routine to servicethe received packets. As the packet rate increases, and consequently the rate ofinterrupts from the NIC increases, GNoM-ND disables any further interrupts forreceived packets and schedules a polling routine. At a later time, the polling routineis run, which queries the NIC for a specified number of packets to service. If thereare fewer packets than this threshold, the polling routine is completed. Otherwise,the polling routine is rescheduled to service the remaining packets.To reduce memory copies for packet data, the GRXBs hold the packetsthroughout their lifetime on the server, recycling them to the NIC for new packetsonly when the full processing is complete. This differs from the baseline Linuxnetwork driver flow, which recycles the RX buffers immediately to the NIC aftercopying the packets into Linux SKBs for further processing in the Linux kernel.In this case, the GRXBs acts as both the RX buffers and the Linux SKBs. As such,81GNoM requires significantly more pinned memory than the baseline network flowto ensure that a GRXB is available to store incoming packets. In contrast, thebaseline Linux network driver flow can have small amount of pinned memory forthe RX buffers and a large amount of pageable memory for the Linux SKBs.A packet drop occurs in GNoM when GRXBs are not recycled quickly enoughto accommodate newly received packets. The time to recycle packets is directlyimpacted by the time to process the packet on the CPU and GPU. As as result, alongwith increasing the request throughput in GNoM, we must also aim to minimizethe request processing latency such that we always have a free GRXB for a newpacket. If more GRXBs cannot be allocated, then the client’s packet send ratemust be reduced accordingly to avoid dropping packets on the GPU server. Weevaluate the improvement in system throughput when the number of GRXBs canbe arbitrarily increased through an offline study, where packets are read directlyfrom system memory instead of from the network (Section 4.4.5).GNoM-ND has two main responsibilities when receiving new packets from thenetwork. First, GNoM-ND DMAs the incoming packets to the GRXBs directly inGPU memory, indicated by the solid black RX arrow from the NIC to the GPUin Figure 4.3 (labelled RX GPUDirect). Second, GNoM-ND constructs metadatadescribing the batch of GPU packets 1 and passes the batch metadata to GNoM-KM once a batch is fully populated and ready for GPU processing 2 . The GRXBbatch is stored in a circular queue in GNoM-KM. If a GRXB batch is not yet readyin GNoM-KM when a GNoM-user thread issues a read request for a new batchof GRXBs, the thread is blocked. When GNoM-ND transfers the GRXB batch toGNoM-KM, any waiting threads are notified, which resumes blocked threads toconsume the new batch of packets. The GRXB batch metadata is then copied fromkernel-space to user-space 3 and a batch ID is stored in GNoM-KM to perform asanity check when the corresponding batch is recycled.We found that the NIC populates the GRXBs in the order that they were reg-istered to the NIC. As such, the batch metadata only requires a pointer to the firstpacket (GRXB) and the total number of packets in the batch to identify all packetsin the batch. Note that this requires all GRXBs to have the same size. With this op-timization, the amount of data needing to be transferred to GNoM-user and acrossthe PCIe to the GPU is reduced from 4KB (512 packets, 8 bytes per GRXB pointer)82to 12B (8 bytes for the first GRXB pointer and 4 bytes for the packet counter 2).Once the packets have completed processing on the GPU and CPU, the GRXBsmust be recycled back to the NIC to be able to receive new packets. GNoM-KMprovides an interface for the GNoM-user threads to recycle GRXBs to GNoM-ND6 . GNoM-KM performs a sanity check 7 to ensure that the GRXB batch is validand then GNoM-ND marks the GRXBs as free prior to registering them back withthe NIC 8 .GNoM-UserGNoM-user consists of pre-processing (GNoM-pre) and post-processing (GNoM-post) user-level threads (See Figure 4.3 and Figure 4.4). GNoM-pre retrievesrequest batch metadata from GNoM-KM 3 , performs application specific mem-ory copies to the GPU, launches CUDA kernels that perform the network serviceprocessing on the batch of requests, and constructs CUDA events to detect whenthe GPU processing completes 4 . For MemcachedGPU, GNoM-pre transfers theGRXB metadata describing the current batch (pointer to the first GRXB and thenumber of packets in this batch), a timestamp for this batch, and a pointer to theGPU memory to store the response packets. GNoM uses CUDA streams to overlapprocessing of multiple small batches to provide a better trade-off between packetlatency and throughput.GNoM-post 5 polls CUDA events waiting for the GPU network service pro-cessing to complete, populates the response packets with application specific data,and transmits the response packets using PF RING. For MemcachedGPU, this con-sists of copying the item’s value, corresponding to the Memcached key in the origi-nal GET request, from the Memcached memory slabs in CPU memory to the TXBs(PF RING transmit buffers) in CPU memory for each packet in the batch. ThePF RING buffers are then sent out the NIC. Finally, GNoM-post recycles the nowfree GRXBs back to GNoM-ND for future RX packets 6 . As will be discussed inSection 4.4.3, the performance of GNoM is highly dependent on the rate at whichGNoM-post threads are able to complete the post-processing tasks and recycle theGRXBs. We empirically find that four GNoM-post threads are required to achieve2While a 4 byte counter is unnecessary given that the batch size is 512 packets, there is littleperformance benefits when transferring small data sizes across the PCIe bus.8310 GbE line-rate throughput with the smallest Memcached packet sizes. However,the polling nature of GNoM-post threads negatively impacts energy-efficiency.Non-GPUDirect (NGD)GPUDirect is currently only supported on the high-performance NVIDIA Teslaand Quadro GPUs. As previously discussed, GPUDirect minimizes the amountof data which first needs to be copied to CPU memory prior to being copied tothe GPU memory by directly transferring packet data to GPU memory from theNIC. To evaluate MemcachedGPU on lower power, lower cost GPUs, we also im-plemented a non-GPUDirect (NGD) framework. NGD uses PF RING [125] toreceive and batch Memcached packets in host memory before copying the requestbatches to the GPU. NGD uses the same GNoM-user and GNoM-dev framework;however, GNoM-KM and GNoM-ND are replaced by PF RING. Section 4.4.3evaluates NGD on the NVIDIA Tesla K20c and GTX 750Ti GPUs.GNoM-devThe lower portion of Figure 4.3 illustrates the GNoM CUDA kernel for UDPpacket and network service processing (e.g., MemcachedGPU GET request pro-cessing). Once a network packet has been parsed (UDP processing stage), thenetwork service can operate in parallel with the response packet generation sincethey are partially independent tasks. Similar to Singe [22], we make use of warpspecialization via conditional operators on warp IDs to perform multiple differentindependent, but related tasks on different warps within the same thread block. InGNoM, the number of GPU threads launched per packet is configurable (Mem-cachedGPU uses two threads). GNoM-dev leverages additional helper threads toperform parallel tasks related to a single network service request, exploiting bothpacket level and task level parallelism to improve response latency and throughput.GNoM-dev groups warps into main and helper warps. Main warps performthe network service processing (e.g., Memcached GET request processing) whilehelper warps perform the UDP processing and response packet header construc-tion. The main and helper warps also cooperatively load RX data and store TXdata (e.g., response packet headers and any application specific data, such as point-84ers to Memcached items in CPU memory) between shared and global memoryefficiently through coalesced memory accesses. This requires CUDA synchroniza-tion barriers ( syncthreads) to ensure that the main and helper warps maintain aconsistent view of the packet data in shared memory. The UDP processing stageverifies that the packet is for the network service and verifies the IP checksum.While most of the response packet header can be constructed in parallel with thenetwork service processing, the packet lengths and IP checksum are updated afterto include any application dependent values. For example, the length of the Mem-cached item value is not known until after the Memcached hash table has been ac-cessed and the corresponding entry is retrieved (network service processing stage).After synchronizing, all warps cooperatively update the response packets with anyapplication dependent data and then proceed to copy the response packet data andany application specific data (e.g., pointers to Memcached items in CPU memory)from shared memory to global memory to be processed by the GNoM-post threadson the CPU.4.2 MemcachedGPUThis section presents the design of MemcachedGPU and discusses the modifica-tions required to achieve low latency and high throughput processing on the GPU.4.2.1 Memcached and Data StructuresAs previously mentioned, in typical Memcached deployments [124], GET requestscomprise a large fraction of traffic when hit-rates are high (e.g., 99.8% for Face-book’s USR pool [16]). Hence in MemcachedGPU, we focus on acceleratingMemcached GET requests and leave the majority of SET request processing onthe CPU.Memcached data structures accessed by both GET and SET requests includethe hash table to store pointers to Memcached items, memory slabs to store theMemcached items and values, and a least-recently-used (LRU) queue for selectingkey-value pairs to evict from the hash table when Memcached runs out of mem-ory on a SET. Memcached keys can be an order of magnitude smaller than valuesizes (e.g., 31B versus 270B for Facebook’s APP pool [16]), placing larger storage85CPU Memory GPU MemoryValid FlagTimestampKey LengthValue LengthCPU Item Pointer8-bit Key HashKeyMemcached Hash TableHash Table SetHash Table EntryMemcached Value StorageFigure 4.5: Partitioning the Memcached hash table and value storage betweenthe CPU and GPU.requirements on the Memcached item values.These data structures need to be efficiently partitioned between the CPU andGPU due to smaller GPU DRAM capacity versus CPUs found on typical Mem-cached deployments and to ensure high performance and scalability. In Mem-cachedGPU, we place the hash table containing keys and item pointers in GPUmemory, while the Memcached item values stored in the memory slabs remainin CPU memory. This partitioning ensures that the majority of data structuresaccessed for GET request processing are stored in GPU memory, which helps tominimize request processing latency. This partitioning is highlighted in Figure 4.5.86GET RequestsAt a high level, GET requests perform a look-up in a hash table and return the cor-responding Memcached item if found. MemcachedGPU performs the main GETrequest operations on the GPU. This includes parsing the GET request, extractingthe key from the GET request packet, hashing the key, and accessing the hash tableto retrieve the corresponding item pointer. Aside from the item values, all of theMemcached data structures accessed by GET requests are stored in GPU globalmemory. MemcachedGPU uses the same Bob Jenkin’s lookup3 hash function [86]included in the baseline Memcached to hash the key. GET requests completely by-pass the CPU and access the hash table on the GPU as described in Section 4.2.2.Each GET request is handled by a separate GPU thread, resulting in memory di-vergence on almost every hash table access. This is because each request should,based on the efficacy of the hashing function, be hashed to different parts of thehash table. However, the small number of active GPU threads and the GPU’shigh degree of memory-level parallelism mitigates the impact of memory diver-gence on performance. As such, multiple memory requests can be in-flight andprocessed concurrently. Additionally, when possible, GNoM stores packet data inGPU shared memory, which improves performance. After the GPU processing iscomplete, GNoM-post receives a list of Memcached response packet headers andcorresponding item value pointers for each GET request (assuming the GET re-quest hit in the hash table). Finally, GNoM-post copies the item value for eachpacket in the batch from CPU memory into a response packet (TXB) to be sentacross the network.SET RequestsWhile the main focus of MemcachedGPU is on accelerating GET requests, SETrequests must also interact with the GPU to update the hash table. SET requestsrequire special attention to ensure consistency between CPU and GPU Memcacheddata structures. In MemcachedGPU, SET requests follow the standard Memcachedflow over TCP through the Linux network stack and Memcached code on the CPU.They update the hash table with the new or updated entry by launching a simpleSET request handler on the GPU.87SET requests first allocate the item data in the Memcached memory slabs inCPU memory and then update the GPU hash table. This ordering ensures thatsubsequent GET requests are guaranteed to find valid CPU item pointers in theGPU hash table – a GET request will only encounter a entry in the hash tablewhich holds a valid item stored in CPU memory. Another consequence of thisordering is that both SET and UPDATE requests are treated the same since the hashtable has not been probed for a hit before allocating the new Memcached item. AnUPDATE request simply acts as a SET that evicts and replaces the previous itemwith the new item. As such, the previous item is freed from CPU memory if theSET request returns that an existing entry corresponding to this SET request’s keywas found in the hash table.SET requests update the GPU hash table entries, introducing a race conditionbetween GET requests and other SET requests. Section 4.2.2 describes a GPUlocking mechanism to ensure exclusive access for SET requests, while maintain-ing shared access for GET requests. As GET requests typically dominate requestdistributions, we have used a simple implementation in which each SET requestis launched as a separate kernel and processed by a single warp. Aside from theusage of the TCP protocol, there is no fundamental reason why SET requests couldnot follow a similar flow as the GET requests in the GNoM framework. At a mini-mum, SET requests can be batched on the CPU (after flowing through the standardLinux protocol) and processed concurrently on the GPU. Accelerating SET re-quests through batch processing, potentially requiring a reliable network protocol,is left to future work.Additionally, because SET requests and GET requests are handled by separateCPU threads, another race condition exists when a GET request attempts to accessa Memcached item’s value in CPU memory concurrently with a SET request. Sec-tion 4.2.3 addresses this race condition between dependent SET and GET requests.4.2.2 Hash TableThis section presents the modifications to the baseline Memcached hash table toenable low-latency and high-throughput processing on the GPU while minimizingthe impact on the hash table hit rate.88Hash Table DesignThe baseline Memcached implements a dynamically sized hash table with hashchaining on collisions. A collision refers to the case where different values arehashed to the same entry and are caused by having a finite number of entries inthe hash table. A hash table with hash chaining resolves collisions by dynamicallyallocating new entries and linking them into existing linked lists (chains) at con-flicting hash table entries. This ensures that all items will be stored as long as thesystem has enough memory. However, depending on the rate of collisions, a longchain of entries may need to be traversed to find the correct entry.This hash table design is a poor fit for GPUs for two main reasons. First,dynamic memory allocation on current GPUs can significantly degrade perfor-mance [75]. Second, hash chaining creates a non-deterministic number of elementsto be searched between requests when collisions are high. This can degrade SIMDefficiency when chain lengths vary, since each GPU thread handles a separate re-quest. In this case, some threads may find their request in the first entry in the chainbut are blocked waiting for any other threads which have to traverse a long chainto find their entry.The above observations drive the hash table design in MemcachedGPU, whichimplements a fixed-size set-associative hash table, similar to [26, 106]. We selecta set size of 16-ways (see Section 4.4.1). We also evaluated a modified version ofhopscotch hashing [68] that evicts an entry if the hopscotch bucket is full insteadof rearranging entries. This improves the hit-rate over a set-associative hash tableby 1-2%; however, the peak GET request throughput is lower due to additionalsynchronization overheads. Specifically, the hopscotch hash table requires lockingon every entry since no hopscotch group is unique, whereas in MemcachedGPU,the set-associative hash table requires locking only on each set.Each hash table entry contains a header and the physical key (Figure 4.5). Theheader contains a valid flag, a last accessed timestamp, the length of the key, thelength of the corresponding item value, and a pointer to the item in CPU memory.MemcachedGPU also adopts an optimization from [52, 106] that includes a small8-bit hash of the key in every header. When traversing the hash table set, the 8-bithashes are first compared to identify potential matches. The full key is compared89only if the key hash matches, reducing both control flow and memory divergence.Hash Table Collisions and EvictionsThe baseline Memcached uses a global lock to protect access to a global LRUqueue for managing item evictions. On the GPU, global locks would require se-rializing all GPU threads for every Memcached request, resulting in low SIMDefficiency and poor performance. Other works [124, 177] also addressed the bot-tlenecks associated with global locking in CPU implementations of Memcached.Instead of a global LRU, we manage a local LRU per hash table set, suchthat GET and SET requests only need to update the timestamp of the hash tableentry. The intuition is that the miss rate of a set-associative cache is similar to afully associative cache for high enough associativity [144]. Whereas hash chainingallocates a new entry on a collision, collisions in MemcachedGPU are resolved byfinding a free entry or evicting an existing entry within the hash table set. Thisintroduces an additional eviction condition to Memcached, which previously onlyoccurred when the maximum amount of item storage has been exceeded. Whilea set-associative hash table was also proposed in [26, 106], we expand on theseworks by evaluating the impact of the additional evictions on hit-rates comparedto the baseline Memcached hash table with hash chaining in Section 4.4.1. Wefind that the additional evictions result in a decrease in hit-rate of approximately0.01% to 3.6% for different key access distributions. This indicates that a globalLRU replacement policy is not necessary to effectively capture the locality.SET requests search a hash table set for a matching, invalid, or expired entry.If the SET misses and no free entries are available, the LRU entry in this hash tableset is evicted. GET requests traverse the entire set until a match is found or the endof the set is reached. This places an upper bound, the set size, on the worst casenumber of entries each GPU thread traverses, which mitigates the impact on SIMDefficiency relative to hash chaining. If the key is found, the CPU value pointer isrecorded to later populate the response packet.90Storage LimitationsAs previously described, the hash table is partitioned from the value storage dueto the relatively smaller GPU DRAM capacity versus CPUs. This static hash tableplaces an upper bound on the maximum amount of key-value storage. Considera high-end NVIDIA Tesla K40c with 12GB of GPU memory. GNoM and Mem-cachedGPU consume ∼240MB of GPU memory for data structures such as theGRXBs, response buffers, and SET request buffers. This leaves ∼11.75GB for theGPU hash table and the lock table (described in Section 4.2.2). The hash tableentry headers are a constant size, however, the key storage can be varied dependingon the maximum size. For example, the hash table storage increases from 45 mil-lion to 208.5 million entries when decreasing from a maximum key size of 250Bto 32B. From [16], typical key sizes are much smaller than the maximum size,leading to fragmentation in the static hash table if each entry is allocated for theworst case maximum size.If a typical key size distribution is known, however, multiple different hashtables with fixed-size keys can be allocated to reduce fragmentation. For example,[16] provides key and value size distributions for the Facebook ETC workloadtrace. If we create five hash tables with static key entries of 16, 32, 64, 128, and250B with each size determined by the provided key distribution (0.14%, 44.17%,52.88%, 2.79%, and 0.02% respectively), this enables a maximum of 157 millionentries for a 10GB hash table. Using the average value size of 124B for ETC, thisstatic partitioning on the GPU would enable indexing a maximum of 19.2GB ofvalue storage in CPU memory compared to only 5.5GB when allocating for theworst case key size.While there’s a trend for growing GPU DRAM sizes, integrated GPUs mayremove this limitation with access to far more DRAM than discrete GPUs. Ourresults on a low-power GPU (Section 4.4.3) and integrated GPUs (Chapter3) suggest that integrated GPUs may be able to achieve high throughputs inMemcachedGPU and are an important alternative to explore for GNoM andMemcachedGPU.91GPU Concurrency ControlAside from updating the timestamps, GET requests do not modify the hash tableentries. Thus, multiple GET requests can access the same hash table entry con-currently as they are guaranteed to have similar timestamp values 3. Furthermore,Memcached has a relaxed requirement on request ordering. As such, the race con-dition between concurrent updates to the timestamp from GET requests can besafely ignored. However, SET requests require exclusive access since they actu-ally modify the hash table entries. To handle this, we employ a multiple reader(shared), single writer (exclusive) spin lock for each hash table set using CUDAatomic compare and exchange (CAS), increment, and decrement instructions. TheGPU locks are also implemented as test-and-test-and-set to reduce the number ofatomic instructions. The shared lock ensures that threads performing a GET re-quest in a warp will never block each other, whereas the exclusive lock ensuresexclusive access for SET requests to modify a hash table entry.For SET requests, a single thread per warp acquires an exclusive lock for thehash table set. The warp holds on to the lock until the SET request hits in one ofthe hash table entries, locates an empty or expired entry, or evicts the LRU item forthis set. The remaining threads in the warp are used to perform a coalesced storeof the key into the hash table.The hash table locks are maintained in a lock table residing in GPU globalmemory. The size of the lock table is dictated by the size of the hash table and thenumber of sets, with one lock entry per hash table set. Due to the high ratio of GETrequests to SET requests, it would also be possible to have a smaller lock table,since GET requests obtain shared locks which do not block other GET requests.4.2.3 Post GPU Race Conditions on EvictionWhile the CPU item allocation order (Section 4.2.1) and GPU locking mechanism(Section 4.2.2) ensure correct access to valid items in CPU memory, a race con-dition still exists in GNoM-post for SET requests that evict items conflicting withconcurrent GET requests. This race condition exists because separate CPU threads3GPU request processing latency is on the order of hundreds of microseconds. Hence, requestsbeing concurrently processed on the GPU either belong to the same request batch or to a requestbatch launched at a similar point in time.92GETSETGet XGPU Hash Table Memcached Value StorageGPU CPUEvict X Delete Item XRead Item XTimeFigure 4.6: Race condition between dependent SET and GET requests inMemcachedGPU.handle post GPU processing for GET and SET requests. Consider the example inFigure 4.6. Here, a GET request is accessing an item, X , which is concurrentlybeing evicted by a SET request occurring later in time on the GPU. In this exam-ple, the GET request obtains the shared lock for the hash table set correspondingto item X and correctly retrieves the CPU item pointer from the GPU hash table.At a later time, the SET request obtains the exclusive lock for X and evicts X fromthe hash table. Any subsequent GET requests will correctly miss on X . However,the requests are then returned to the CPU for post processing on separate threads.As such, it is possible that the SET request thread responsible for evicting the itemdeletes the item value storage prior to when the GET request reads X’s value topopulate the response packet, as shown on the right side of Figure 4.6. This mayresult in the GET request accessing stale or garbage data.Removing the race condition requires preserving the order seen by the GPUon the CPU. To accomplish this, each GNoM-post thread maintains a global com-pletion timestamp (GCT), which records the timestamp of the most recent GETrequest batch to complete sending its response packets. This is the same times-tamp used to update the hash table entry’s last accessed time. If a SET requestneeds to evict an item, it records the last accessed timestamp of the to-be evicted93Table 4.1: Server and client configurations.Server ClientLinux kernel 3.11.10 3.13.0CPU Intel Core i7-4770K AMD A10-5800KMemory 16 GB 16 GBNetworkInterfaceIntel X520-T2 10Gbps 82599(modified driver v3.18.7)Intel X520-T2 10Gbps82599 (driver v3.18.20)item from the GPU hashtable. After updating the GPU hash table, the SET requestpolls all GNoM-post GCT’s on the CPU and stalls until they are all greater thanits eviction timestamp before evicting the item. This ensures that all GET requestsprior to the SET request have completed sending the response packets before theSET completes, preserving the order seen by the GPU. This stalling does not im-pact future GET requests since the SET allocates a new item prior to updating thehash table. Thus all GET requests occurring after the SET will correctly access theupdated item. We have verified this mechanism by arbitrarily stalling GET requestbatches in GNoM-post while concurrently sending SET requests to update itemsconflicting with the GET requests, and ensuring the SET stalls until all conflictingGET requests complete.An alternative method is to lazily delete items from the CPU Memcached valuestorage. Upon evicting an item from the GPU hash table, the SET request can addthe evict operation to a queue on the CPU. At a later time, a separate CPU threador SET request thread can clean up the items in this queue. The delay beforegarbage collecting the queue does not need to be long, since the only potentialrace condition candidates are those that accessed the GPU hash table prior to thecorresponding SET request updated it, which should completely shortly followingthe SET request. As such, this delay could be set to a multiple of the worst caseend-to-end GET request processing latency.4.3 Experimental MethodologyThis section presents our experimental methodology for evaluating GNoM, Mem-cachedGPU, and the modifications to the hash table design.94Table 4.2: Server NVIDIA GPUs.GPU Tesla K20c Titan GTX 750TiArchitecture (28 nm) Kepler Kepler MaxwellTDP 225 W 250 W 60 WCost (2015) $2700 $1000 $150# CUDA cores 2496 2688 640Memory size (GDDR5) 5 GB 6 GB 2 GBPeak SP throughput 3.52 TFLOPS 4.5 TFLOPS 1.3 TFLOPSCore frequency 706 MHz 837 MHz 1020 MHzMemory bandwidth 208 GB/s 288 GB/s 86.4 GB/s4.3.1 GNoM and MemcachedGPUUnless stated otherwise, all of the hardware experiments in this chapter are runbetween a single Memcached server and client directly connected via two CAT6AEthernet cables; one used for transmit and the other for receive. The main systemconfigurations for the server and client systems are presented in Table 4.1. TheGPUs evaluated in this chapter are shown in Table 4.2, all using CUDA 5.5. Thehigh-performance NVIDIA Tesla K20c is evaluated using GNoM, which supportsGPUDirect, and the low-power NVIDIA GTX 750Ti is evaluated using NGD. Allthree GPUs are evaluated in the offline limit study (Section 4.4.5). While Chap-ter 3 evaluated integrated AMD GPUs, this chapter focusses on discrete NVIDIAGPUs. Aside from the usage of GPUDirect on the Tesla K20c, there is no funda-mental reason why GNoM would not also be beneficial on lower-power integratedGPUs. Additionally, our experiments in Section 5.6 highlight the potential forlow-power discrete GPUs to achieve high performance with GNoM, and hence animportant direction for future work is to evaluate integrated GPUs.MemcachedGPU was implemented on top of Memcached v1.4.15. The hashtable is configured as a 16-way set-associative hash table with 8.3 million entriesassuming the maximum Memcached key-size. Note that this is only ∼46% of themaximum possible hash table size on the Tesla K20c given the 5GB global mem-ory space and storage for the other GNoM and MemcachedGPU data structures.The hash table associativity was selected based on an offline hash table analysis95(Section 4.4.1) and an empirical performance analysis on the GPU.The client system generates Memcached requests through the Memaslap mi-crobenchmark included in libMemcached v1.0.18 [4]. Memaslap is used to stressMemcachedGPU with varying key-value sizes at different request rates. As de-scribed in more detail below (Section 4.3.2), we evaluate the effectiveness ofthe hash table modifications in MemcachedGPU on more realistic workload traceswith different types of request distributions in Section 4.4.1 (zipfian, latest, andrandom distributions). The Memcached ASCII protocol is used with a minimumkey size of 16 bytes (packet size of 96 bytes). The ASCII protocol places morestress on MemcachedGPU than the binary protocol, since this requires string pro-cessing and impacts the packet sizes. Larger Memcached value sizes impact boththe CPU response packet fill rate and network response rate. However, we findthat GNoM becomes network bound, not CPU bound, as value sizes increase. Inour experiments, the larger value sizes result in larger packet sizes, which reducesthe number of packets required to saturate the network bandwidth; hence lowerprocessing requirements from MemcachedGPU. Thus, the smallest value size of 2bytes is chosen to stress per packet overheads. Similarly, the smallest Memcachedkey size of 16 bytes is chosen in any experiments aimed at stressing GNoM andMemcachedGPU.The single client system is unable to send, receive, and process the minimumsized Memcached packets at 10 Gbps with the Memaslap microbenchmark. Assuch, we created a custom client Memcached stressmark using PF RING zero-copy [125] for sending and receiving network requests at 10 Gbps, and replaytraces of SET and GET requests generated from Memaslap. With this method,the request ordering and key distributions are maintained with those generated byMemaslap, but are replayed at a much higher throughput. The MemcachedGPUserver is initially warmed up with 1.3 million key-value pairs through TCP SETrequests. Then, we send GET requests over UDP. However, processing all of theresponse packets on the client system still limits our send rate to∼6 Gbps. To over-come this, we used a technique similar to [106], which forcefully drops responsepackets at the client using hardware packet filters at the NIC to sample a subset ofpackets. Using this technique, the client sends all of the packets to the server at thespecified send rate and the server still performs all of the required per-packet op-96erations. The client NIC filter allows a subset of packets to flow back to the clientfor response processing, while the rest are dropped. The total number of receivedresponse packets are the sum of those received by the client and those dropped.Thus, our end-to-end latency experiments at high request rates measure a subset ofthe total response packets. However, because packets are processed in batches onthe server, if we ensure that the subset of measured packets are distributed through-out a batch, these will be a good representation of all packets. If we instead createa static set of Memcached requests to repeatedly send to the server (e.g., 512 dif-ferent requests), we are able to send and receive at the full rate at the client, sincethis frees up processing cycles on the client that were previously used to prepareeach send packet. We use this static request technique to measure packet drop ratesat the client more accurately. All other experiments use the Memaslap generatedtraces, as described above.Power is measured using the Watts Up? Pro ES plug load meter [164] and mea-sures the total system wall power for all configurations. An issue with PCIe BARmemory allocation between the BIOS and NVIDIA driver on the server system atthe time this study was conducted restricted the NVIDIA Tesla K20c and NVIDIAGTX 750Ti GPUs from being installed in isolation. We measured the idle powerof the Tesla K20c (18W) and GTX 750Ti (6.5W) using the wall power meter andthe nvidia-smi tool. This inactive GPU idle power was subtracted from the totalsystem power when running experiments on the other GPU. For example, whenrunning an experiment on the Tesla K20c, the GTX 750TI’s idle power of 6.5Wwas subtracted from the measured system power to calculate the total power. TheGTX Titan did not have this issue and could be installed in isolation.4.3.2 Hash-Sim (Hash Table Simulator)To evaluate the impact of modifying the hash table structure, collision mechanism,and hash table eviction management in MemcachedGPU compared to the base-line Memcached hash table, we designed an offline hash table simulator, hash-sim.Hash-sim measures the hit rate for a trace of key-value GET and SET requests,which provides a platform for directly comparing different hash table and hashcollision techniques. As in the baseline Memcached, a GET request miss triggers a97corresponding SET request for that item in hash-sim. Hash-sim uses the same BobJenkin’s lookup3 hash function [86] included in the baseline Memcached.We use a modified version of the Yahoo! Cloud Serving Benchmark (YCSB)[34] provided by MemC3 [52] to generate three Memcached request traces withdifferent item access distributions: Zipfian, Latest, and Uniform. In the Zipfian(zipf) distribution, specific items are accessed much more frequently than otheritems. The Latest distribution is similar to the zipf distribution, except that the mostrecently inserted items are more popular. With Zipfian, the popularity of items arenot affected by the insertion of new items. Finally, the Uniform distribution selectsitems at uniform random.Hash-sim is single threaded, so no concurrency control is required. As such,the purpose of Hash-sim is not to measure the performance or scalability of thedifferent hash table techniques in terms of lookups per unit time, but instead tomeasure the ability of the hash table to maximize the hit rate under different requestdistributions, hash table sizes, and eviction techniques. The hash table performanceis considered independently from Hash-sim.Lastly, MemcachedGPU implements a statically sized hash table to avoid dy-namic memory allocation or resizing the hash table on the GPU. This differs fromthe baseline Memcached hash table with hash chaining, as well as other hashingtechniques, which expand the hash table when hash chains become too long or afree hash table entry cannot be located when inserting an item. As a result, evic-tions occur in the hash table once the maximum number of items has been storedor a free entry is unavailable. As will be described further in Section 4.4.1, weevaluate global and local least recently used (LRU) techniques for managing itemevictions. Global LRU considers all items in the hash table for eviction. LocalLRU considers a subset of items in the hash table for eviction depending on thehash table structure and collision resolution technique.4.4 Experimental ResultsThis section presents our evaluation of the Memcached hash table modifications,MemcachedGPU, GNoM, the non-GPUDirect (NGD) version of GNoM, and anoffline limit study of GNoM and MemcachedGPU. Different experiments are per-98formed in simulation environments or on real hardware, as described in Section4.3. If not explicitly mentioned, the experiment is evaluated on hardware.4.4.1 Hash Table EvaluationWe first evaluate the impact of our modifications to the Memcached hash table onthe GET request hit rate. The hash table modifications are required to improvethe performance and scalability on GPU SIMD architectures. Every miss in theMemcached hash table results in an expensive access to the backing database and apotential SET request to update the hash table with the missing entry 4. Thus, it isimportant to minimize the miss rate relative to the baseline Memcached hash tablewith hash chaining. Using Hash-sim (Section 4.3.2), we evaluate and comparethe hit rate for multiple different hash table structures and techniques for handlingcollisions, which are more GPU-friendly than the baseline hash chaining technique.In addition to the hit-rate, each technique presents multiple trade-offs, for example,in the impact of the load factor on performance, requirement for dynamic memoryallocation, worst-case number of accesses for GET requests, worst-case number ofoperations required to store an item, and concurrency control, which are importantaspects to consider. Furthermore, some techniques may be better suited for theGPU’s SIMD architecture than others.Specifically, we evaluate four different hash table techniques and compare themwith hash chaining:• Hash chaining (HC): The default hash table technique used in the baselineMemcached. Keys are hashed and the item is inserted directly into the corre-sponding hash table entry. On a hash table collision, a new entry is dynam-ically allocated and linked into the hash table entry via a linked list. Whensearching for an item, the entire (potentially long) chain must be searched forthe corresponding entry. A global least-recently used (LRU) queue is main-tained to handle hash table evictions, such that all items in the hash table areconsidered for eviction based on their global ordering of usage. In our im-plementation, hash chaining is always able to store the maximum number of4This is dependent on the application’s implementation, as the application is responsible for man-aging when to store items in the hash table.99items in the hash table, which is equal to the size of the hash table, regardlessof the number of collisions on a given entry.• Set-associative (SA): The fixed-size hash table technique used in Mem-cachedGPU (similar to that proposed in [106] for CPUs and [26] forFPGAs). This technique is equivalent to a regular set-associative cache,which results in fast look-ups and insertions. The hash table is broken downinto multiple hash table sets, which consist of a fixed number of hash tableentries (the set size). Each hash value now corresponds to multiple hashtable entries, which need to be searched for a matching or free entry. Onlythe fixed set size needs to be traversed when searching for an item. Itemsare inserted into an empty entry in the corresponding hash set. If no emptyentries are available, an item must be evicted. A local LRU is maintained perhash table set, such that an eviction only selects from the items belonging tothe hash table set corresponding to the hash value for a given key.• Hopscotch (HS) [68]: Each entry in the hopscotch hash table has a cor-responding hopscotch bucket, which consists of H consecutive entries. Assuch, each hash table entry/group overlaps with H-1 other hopscotch groups.When searching for an item, only the fixed hopscotch group size must besearched. When trying to insert an element into the hash table to entry X,similar to linear probing, consecutive entries are searched from X until a freeentry is found at index Y. If the distance between the free entry at Y and Xis less than H-1, the item is directly inserted into that entry in the hash table,otherwise the hash table must be reorganized to make room for the new valuewithin X and X+(H-1) (the hopscotch group size). This is achieved by per-forming a linear search to find the first free entry and then iteratively movinghash table entries to the free location in their hopscotch group in a reversedirection from X+Y to X until we can insert X in its hopscotch group. If nofree entries are found or the entries can not be reorganized, the hash tablemust be resized or an item must be evicted. In our implementation, we limitthe search size to 512 entries such that for any given insertion, the search islimited between X and X+512 for a free entry. If no free entries are found,then the LRU item in the hopscotch group for item X (X to X+H) is evicted100and the item is directly inserted into this entry.• Strided linear probing (SLP): Linear probing is similar to hash chaining,except that the chain consists of the consecutive hash table entries insteadof dynamically allocated entries. In the strided implementation, each hashtable entry contains N additional entries with some stride S. When insertingan item at entry X, we search from X to X+(N×S) to find a free entry. Forexample, if an item maps to the entry at X, the next entry is at X+S, and thelast entry in the group is at X+(N×S). Similar to hopscotch hashing, eachgroup overlaps with multiple different groups. Similar to the set-associativehash table, the hash table is not reorganized when no free entries are found.Instead, the LRU item belonging to the group entry X is evicted and the newitem is inserted. When searching for an item, only the fixed-size group needsto be searched.• Cuckoo hashing (CU) [141]: In Cuckoo hashing, each item can map to twodifferent hash table entries. Two hash functions are used to select these en-tries. We follow the same methodology as MemC3 [52], which generates thefirst hash value using the baseline Memcached’s Bob Jenkin’s hash function,and performs a set of operations on the first hash index, using the hash con-stant from MurmerHash2, to generate the second hash index. When search-ing for a item, only the two entries need to be accessed, which limits thenumber of hash table accesses for GET requests. However, when insert-ing an item into the hash table, if the two entries are not empty, the hashtable needs to be reorganized. This is achieved by recursively traversingeach item’s alternative hash table entry until a free entry is found, and thenmoving the items to the alternative hash entry in a reverse direction untilone of the two entries for the current item is now free. Similar to our hop-scotch hashing implementation, we limit this search to 512 entries. If nofree entries are found within 512 searches, we evict the LRU item withinthis chain of items and reorganize the hash table accordingly. MemC3 [52]includes an additional optimization, which combines Cuckoo hashing withthe set-associative hash table. Here, each hash table entry consists of fourentries. We did not evaluate this optimization, but it should be noted that the101performance measured in this section will be lower than that achievable inMemC3’s implementation.For each hash table implementation, we fix the maximum number of key-valueitems to store, independent of the size of the corresponding values for each key.Hash chaining is always able to store the maximum number of elements as a newentry is dynamically allocated on a hash collision. Evictions only occur when themaximum number of items have been stored in the hash table. For all other hashtable techniques, evictions may occur even with fewer elements than the maximumdue to the conflict resolution techniques. This results in additional evictions com-pared to hash chaining, which needs to be minimized to reduce any extra expensiveaccesses to the back-end databases. MICA [106] also proposed an enhancement toavoid hash table evictions by including overflow bins, which are used to store itemswhen the main sets overflow. An error is returned if no such overflow bins are avail-able. MICA refers to hash tables with evictions as lossy and hash tables withoutevictions as lossless. In this work, we only consider lossy versions of hash tablesdue to the GPU’s limited DRAM capacity and the expectation that the hash tablecan be sized large enough to capture the application’s temporal locality.For each workload distribution, we generate a runtime trace of 10 million key-value pairs with 95% GET requests and 5% SET requests. The hash tables are firstwarmed up using a load trace of all SET requests, and then the hit-rate is measuredon a separate transaction trace. The hash tables are configured as follows: the hashchaining (HC) hash table can store any number of items in each chain, up to themaximum size of the hash table; the hopscotch (HS) hash table has a hopscotchgroup of 16 entries and a maximum search size of 512 entries for the linear probe;the strided linear probe (SLP) hash table has a group of 16 entries with a stride of 4entries; The set associative (SA) hash table has a set size of 16 entries; finally, thecuckoo (CU) hash table has two entries per item (corresponding to the two hashfunctions) and a maximum depth of 512 for the recursive reorganization.Figure 4.7a, Figure 4.7b, and Figure 4.7c measure the hit-rate for the fivehash tables under the Zipfian (Zipf), Latest (Lat), and Uniform (Uni) distributionsrespectively. The x-axis shows different hash table capacities from 2 million en-tries to 16 million entries. With a request trace working size of 10 million items,1020 10 20 30 40 50 60 70 80 90 100 2 4 8 16 GET Hit Rate (%) Hash Table Size (Millions of Entries) Hash chain Hopscotch Strided Probe Set Assoc Cuckoo Figure 4.7a: Zipfian: Comparing the hit rate for different hash table tech-niques and sizes under the Zipfian request distribution. The requesttrace working size is 10 million entries.HC is always achieve 100% hit-rate when the hash table is larger than 10 millionentries, equivalent to a fully-associative cache. However, this may result in longhash chains based on the collision rate. There are three notable points from thisexperiment. First, the hash tables perform much better on the Zipf and Lat distri-butions than Uni. Since the hash tables implement a LRU eviction policy, the mostmost frequently accessed items are prevented from commonly becoming the LRUitem. Similarly, Lat performs better than Zipf, as the most recently added item be-comes the most frequently accessed item. Uni, however, suffers from low hit rateswith smaller hash table sizes, as there is no temporal locality in the trace.Second, all hash tables achieve similar hit rates to HC. For all hash table sizesless than the working set, HS, SLP, and SA achieve over 99% of the hit rate ofHC on average, while CU averages over 97% of the HC hit rate. The differencebetween the techniques is most noticeable at a hash table size close to the workingset size (e.g., 8 million entries compared to 10 million entries). As noted above,1030 10 20 30 40 50 60 70 80 90 100 2 4 8 16 GET Hit Rate (%) Hash Table Size (Millions of Entries) Hash chain Hopscotch Strided Probe Set Assoc Cuckoo Figure 4.7b: Latest: Comparing the hit rate for different hash table tech-niques and sizes under the Latest request distribution. The requesttrace working size is 10 million entries.CU can benefit from adding a small set (e.g., four entries) to each hash table entry;however, we did not evaluate this enhancement. In our experiments, HS was thebest performing alternative. Similar to SLP and SA, HS contains multiple possibleentries per hash value (16). However, unlike these techniques, HS is also able toreorganize the hash table when the local group becomes full, hence avoiding aneviction on the collision. CU is also able to reorganize the hash table, but we foundthat limiting the potential candidate entries for each item to two, instead of 16, wasa limiting factor. While there is some loss in hit rate relative to HC, these resultshighlight that for the dataset evaluated, there is little benefit to maintain global LRUand global evictions over the local counterparts. As real workloads tend to followa Zipfian-like distribution [34], where some items are accessed very frequently,while most others are accessed infrequently, we believe that this property will holdfor other workloads.Lastly, while SA is not the best performing alternative, it is able to achieve a1040 10 20 30 40 50 60 70 80 90 100 2 4 8 16 GET Hit Rate (%) Hash Table Size (Millions of Entries) Hash chain Hopscotch Strided Probe Set Assoc Cuckoo Figure 4.7c: Uniform Random: Comparing the hit rate for different hash ta-ble techniques and sizes under the Uniform Random request distribu-tion. The request trace working size is 10 million entries.hit rate within 96.6% - 99.99% of HC across the different distributions and hashtable sizes. Furthermore, SA contains two main benefits over the alternatives. First,SA simplifies the locking mechanism required under concurrent accesses (Section4.2.2) compared to the other hash table techniques. In SA, each set contains asingle lock. As such, only a single lock needs to be acquired when accessing allelements in a set. This is possible since there is a one-to-many mapping betweeneach hash table set and entry - each entry only belongs to a single set, while asingle lock can lock all entries within that set. However, hash table entries in HS,SLP, and CU belong to multiple different overlapping groups, requiring acquiringmultiple locks when accessing the group of entries corresponding to a given item.This increases the chance of threads in a warp blocking other threads, leading toreduced SIMD efficiency and performance. Additionally, the storage requirementsfor the locks are reduced in SA, since there are fewer sets than individual entries oroverlapping groups. Second, SA achieves fast insertions even under high collision105!"#$!"#%!"#&!"#'!"#(!"#%# '# )# *# $!# $%# $'# $)# $*# %!#+, -- #./0 1# 2"3 #4/-5#6/781#9,:1#2+,88,;<-#;=#><0?,1-3#&%@A/B#*@A/B#%@A/B#$)@A/B#'@A/B#C,?1D0#+/EE1F#4/-5##G5/,<,<H#.1IJ1-0#0?/D1#A;?K,<H#-10#Figure 4.8: Miss-rate versus hash table associativity and size compared tohash chaining for a request trace with a working set of 10 million re-quests following the Zipf distribution.rates. HS and CU require searching and reorganizing multiple entries in an attemptto avoid evicting an item from the hash table. While this can improve the storageefficiency under high collision rates, it comes at the cost of increased complexityand insertion times. From Figure 4.7a, Figure 4.7b, and Figure 4.7c, we see thatthe improvements in the hit rate of these techniques are small compared to SA.As a result of the above experiments, we selected the set associative as the hashtable for MemcachedGPU.Next, we evaluate the miss rate of SA with different set sizes and hash tablesizes under the Zipf distribution. The same request trace used in the previous ex-periment, with 10 million requests containing 95% GET and 5% SET requests, isused to evaluate the miss rate. The results are shown in Figure 4.8 and are com-pared to HC, which is equivalent to a fully associative hash table. For example,with a maximum hash table size of 8 million entries, SA has a 21.2% miss-ratewith 1-way (direct mapped) and 10.4% miss-rate with 16-ways. HC achieves a0% miss-rate when the hash table size is larger than the request trace (dotted lineat 10 million entries) since it is able to allocate a new entry for all new key-valuepairs. At smaller hash table sizes, none of the configurations are able to effectivelycapture the locality in the request trace, resulting in comparable miss-rates. As the1060 5 10 15 0 500 1000 1500 2000 1 2 4 8 KRPS / W KRPS # CPU Threads Baseline KRPS Kernel Bypass KRPS Baseline KRPS/W Kernel Bypass KRPS/W Figure 4.9: Impact of Linux Kernel bypass for GET requests vs. the baselineMemcached v1.5.20.hash table size increases, increasing the associativity decreases the miss-rate. At16-ways for all sizes, SA achieves a minimum of 95.4% of the HC hit-rate for theZipf distribution. The other distributions follow similar trends with different abso-lute miss-rates. However, increasing the associativity also increase the worst-casenumber of entries to search on an access to the hash table. From experimentation,we empirically find that an associativity of 16-ways provides a good balance ofstorage efficiency and performance.4.4.2 Impact of Linux Kernel BypassMemcachedGPU uses GNoM on receive and PF RING on transmit to bypass boththe Linux kernel and libevent (used by default for Memcached). To understandthe impact of this optimization in isolation, we evaluate PF RING for both re-ceive and transmit in the baseline Memcached v1.5.20 running only on the CPU.Others have attained higher throughputs for Memcached on CPU-only systems us-ing many other optimizations [52, 105, 106, 177]. We compare MemcachedGPUagainst published results in Section 4.4.6.Figure 4.9 measures the improvement in throughput and energy-efficiency ofbypassing the Linux kernel over the baseline Memcached using PF RING. Noother optimizations were applied. Removing the Linux network stack results in107Table 4.3: GET request throughput and drop rate at 10 GbE.Key Size 16 B 32 B 64 B 128 BTesla drop rate server 0.002% 0.004% 0.003% 0.006%Tesla drop rate client 0.428% 0.033% 0.043% 0.053%Tesla MRPS/Gbps 12.92/9.92 11.14/9.98 8.66/9.98 6.01/10Maxwell-NGD drop rateserver0.47% 0.13% 0.05% 0.02%Maxwell-NGD MRPS/Gbps 12.86/9.87 11.06/9.91 8.68 10 6.01/10a minimum improvement in both throughput and energy-efficiency over the base-line of 1.5X at 4 threads, and a maximum of 1.5 MRPS and 12.3 KRPS/W (1.9X)at 2 threads. Adding threads to the Kernel bypass has a larger increase in energyconsumption relative to the increase in throughput, resulting in the continuouslydecreasing energy-efficiency. Increasing the number of threads beyond number ofcores (4) results in a drop in throughput and energy-efficiency for both configura-tions. These results highlight that while bypassing the Linux kernel can provide asizeable increase in performance and energy-efficiency, additional optimizations tothe core Memcached implementation are required to continue improving efficiencyand scalability.4.4.3 MemcachedGPU EvaluationNext, we evaluate the full end-to-end MemcachedGPU on the high-performanceNVIDIA Tesla K20c (Tesla) using GNoM with GPUDirect, and on the low-powerNVIDIA GTX 750Ti (Maxwell) using the non-GPUDirect (NGD) framework(Section 4.1.2). Throughput is measured in millions of requests per second(MRPS) and energy-efficiency is measured in thousands of requests per secondper Watt (KRPS/W). For latency experiments, the 8 byte Memcached headeris modified to contain the client’s send timestamp and measures the request’scomplete round trip time (RTT).Throughput: Memcached typically uses UDP for GET requests and toleratesdropped packets by treating them as misses in the caching layer. However, exces-sive packet dropping mitigates the benefits of using Memcached. We measure the108packet drop rate at the server and client for packet traces of 500 million requestswith equal length keys at peak throughputs, averaging over three runs. The impactof having different key lengths on SIMD efficiency is discussed later in this sec-tion. As shown in Table 4.3, MemcachedGPU is able to processes packets near10 GbE line-rate for any Memcached request size with server packet drop rates< 0.006% (GNoM) and < 0.5% (NGD). The client drop rates are measured usingthe static trace described in Section 4.3 with no packet loss at the server. Increasingto the full 13 MRPS at 16B keys increases the server drop rate due to the latencyto process the packets and recycle the limited number of GRXBs to the NIC fornew RX packets. As a result, there are no available GRXBs when a new packetarrives and the NIC drops the packet accordingly. In Section 4.4.5 we evaluate thepeak throughputs of MemcachedGPU assuming an unlimited number of GRXBsthrough an offline analysis.RTT Latency: For many scale-out workloads, such as Memcached, the longestlatency tail request dictates the total latency of the task [41]. While the GPU is athroughput-oriented accelerator, we find that it can provide reasonably low laten-cies under heavy throughput. Figure 4.10 measures the mean and 95-percentile(p95) client-visible RTT versus request throughput for 512 requests per batch onthe Tesla using GNoM and NGD, and the Maxwell using NGD. Recall that NGDmust copy the packet from the NIC to CPU memory and then from CPU memoryto GPU memory. As expected, GNoM has lower latencies than NGD (55-94% atp95 on the Tesla) by reducing the number of memory copies on packet RX withGPUDirect. The latency increases as the throughput approaches the 10 GbE line-rate, with the p95 latencies approaching 1.1ms with GNoM and 1.8ms with NGD.We also evaluated a smaller batch size of 256 requests on GNoM and found thatit provided mean latencies between 83-92% of 512 requests per batch when lessthan 10 MRPS, while limiting peak throughput and slightly increasing the meanlatency by∼2% at 12 MRPS. Although the smaller request batch sizes reduces thebatching and potentially the kernel processing latency, it also increases the kernellaunching and post processing overhead as these tasks are amortized over fewerpackets. At lower throughputs (< 4 MRPS), we can see the effects of the batchingdelay on the p95 RTT (Figure 4.10b). For example, at 2 MRPS with a batch sizeof 512 requests, the average batching delay per request is already 128µs, compared109200	500	800	1100	1400	1700	2000	2	 4	 6	 8	 10	 12	RTT	Latency	(us)	Average	MRPS	Tesla-NGD	Maxwell-NGD	Tesla-GNoM	(a) Mean RTT.200	500	800	1100	1400	1700	2000	2	 4	 6	 8	 10	 12	RTT	Latency	(us)	Average	MRPS	(b) 95-percentile RTT.Figure 4.10: Mean and 95-percentile round trip time (RTT) latency versusthroughput for Tesla GPU with GNoM and NGD, and Maxwell withNGD.1100	15	30	45	60	75	90	0	30	60	90	120	150	180	210	240	1.1	 2.2	 4.0	 5.8	 7.6	 10.1	 12.8	 12.9	KRPS/W	W	Average	MRPS	Avg.	Tesla	System	Power	 Avg.	Tesla	Power	Tesla	System	Energy-Efficiency	 Maxwell	System	Energy-Efficiency	#	of	GNoM-post	threads:	 1				|								2								|								4	Figure 4.11: Total system and GPU power (left axis) and total system energy-efficiency (right axis) versus throughput for MemcachedGPU andGNoM on the NVIDIA Tesla K20c. The total system energy-efficiencyfor the Maxwell system is also shown at the peak throughput with twoGNoM-post threads. The number of GNoM-post threads are shownabove the graph, with 1 thread for 1.1 to 7.6 MRPS, 2 threads for 10.1to 12.8 MRPS, and 4 threads for 12.9 MRPS.to 32µs at 8 MRPS. While not shown here, a simple timeout mechanism can beused to reduce the impact of batching at low request rates by launching partiallyfull batches of requests. The GNoM GPU kernel can either reduce the size of thekernel or use conditional operations to mask off unused threads accordingly.Energy-Efficiency: Figure 4.11 plots the average power consumption, boththe full system power and GPU power, and the full system energy-efficiency ofMemcachedGPU and GNoM on the Tesla K20c at increasing request rates. Thetotal system energy-efficiency for the Maxwell system is also shown at the peakthroughput with two GNoM-post threads. The Tesla K20c power increases by lessthan 35% when increasing the throughput by∼13X (1 to 13 MRPS), leading to thesteady increase in energy-efficiency. The jumps in system power at 10.1 and 12.9111MRPS are caused by adding GNoM-post threads for post processing and recyclingthe GRXBs fast enough to maintain low packet drop rates at the higher throughputs.However, increasing the number of GNoM-post threads from two to four decreasesenergy-efficiency as the system power is increased by 9% while the throughput hasa much smaller improvement. At peak throughputs, the low-power GTX 750Tiusing NGD consumes 73% of the total system power consumed by the Tesla K20cusing GNoM (151.4 W at 84.8 KRPS/W). These results highlight the importanceof the CPU-side framework in contemporary systems for handling higher requestrates, as well as the negative impact that including additional CPU resources hason energy-efficiency. Furthermore, placing increased load on the CPU reduces thepotential for CPU workload consolidation and reduces the peak GPU performance,since the CPU may not be able to keep up with the requirements from multipletasks (Chapter 5).The Tesla K20c consumes roughly one third of the total system power. Notethat the peak GPU power of 71W is less than 32% of the K20c’s TDP (225W),suggesting low utilization of the total GPU resources. This also contributes to whythe lower-power, lower-performance GTX 750Ti is able to handle the high requestrates. The Tesla K20c system has an idle power of 84W without any GPUs. Thus,GNoM-host consumes roughly 15%, 25%, and 33% of the total system power whenusing one, two, or four GNoM-post threads respectively. Much of this is an artifactof GPUs being offload-accelerators, which rely on the CPU to communicate withthe outside world. This leaves large opportunities to further improve the energy-efficiency of GNoM through additional hardware I/O and system software supportfor the GPU.Branch Divergence: Next, we evaluate the impact of branch divergence onperformance in MemcachedGPU, which stems from each GPU thread handling aseparate GET request. For example, differences in key lengths, potential hits ondifferent indices in the hash table set, or misses in the hash table can all cause GPUthreads to perform a different number of iterations or execute different blocks ofcode at a given time. Each of these scenarios reduce the SIMD efficiency and con-sequently performance. Figure 4.12 plots the average peak throughput as a fractionof the theoretical peak throughput for a given key length distribution. We find thatthe throughput performs within 99% of the theoretical peak, regardless of the key1120.94	0.95	0.96	0.97	0.98	0.99	1	0	100	200	300	400	500	600	Fraction	of	Theoretical		Peak	Throughput	Avg.	RTT	(us)	Key	Distribution	16B/32B/64B/128B	100%	hit-rate	 85%	hit-rate	 TP	(100%	hit-rate)	Figure 4.12: Impact of varying the key length mixture and hit-rate on roundtrip time (RTT) latency (left axis) and throughput (right axis) for Mem-cachedGPU and GNoM on the NVIDIA Tesla K20c. RTT latency ismeasured at 4 MRPS. Throughput is shown as the average fraction ofpeak throughput (at 10 Gbps) obtained for a given key distribution.The key distributions are broken down into four sizes (16B, 32B, 64B,and 128B) and the labels indicate the percentage of keys with the cor-responding length.distribution. That is, even when introducing branch divergence, MemcachedGPUbecomes network bound before compute bound.Figure 4.12 also plots the average RTT for GET requests at 4 MRPS underdifferent distributions of key lengths and at 100% and 85% hit-rates. Note that thisexperiment still consists of 100% GET requests. The results match the intuitionthat because there is no sorting of key lengths to batches, the latency should fallsomewhere between the largest and smallest key lengths in the distribution. Forexample, 50% 16 byte and 50% 32 byte keys have an average RTT between 100%16 byte and 100% 32 byte key distributions. If there are large variations in key113Table 4.4: Concurrent GETs and SETs on the Tesla K20c.GET MRPS (% peak) 7 (54) 8.8 (68) 9.7 (74) 10.6 (82) 11.7 (90)SET KRPS (% peak) 21.1 (66) 18.3 (57) 18 (56) 16.7 (52) 15.7 (49)SET:GET Ratio 0.3% 0.21% 0.19% 0.16% 0.13%Server Drop Rate 0% 0.26% 3.1% 7.5% 8.8%lengths and tight limits on RTT, the system may benefit from a pre-sorting phaseby key length such that each request batch contains similar length keys. This couldhelp reduce RTT for smaller requests, however, the maximum throughput is stilllimited by the network.Typical Facebook Memcached deployments have hit-rates between 80-99% [16]. Figure 4.12 also measures the impact on RTT under 85% and 100%hit-rates. As can be seen, there is little variation between average RTT withdifferent hit-rates. While reducing the hit-rate forces more threads to traverse theentire hash table set (16 entries), the traversal requires a similar amount of workcompared to performing the key comparison on a potential hit.GETs and SETs: While the main focus of MemcachedGPU is on acceleratingGET requests, we also evaluate the throughput of the current naive SET requesthandler and its impact on concurrent GET requests. SET requests are sent overTCP for the same packet trace as the GETs to stress conflicting locks and updateevictions. The maximum SET request throughput is currently limited to 32 KRPSin MemcachedGPU, ∼32% of the baseline. This is a result of the naive SET han-dler described in Section 4.2.1, which serializes SET requests. However, this isnot a fundamental limitation of MemcachedGPU as, similar to GET requests, SETrequests could also be batched together on the CPU prior to updating the GPUhash table. Unlike GET requests, however, each SET requests would need to beserialized or processed per GPU warp instead of per thread to avoid potential dead-locks on the exclusive locks (Section 4.2.2). Improving SET support is left tofuture work. Table 4.4 presents the GET and SET request throughputs, result-ing SET:GET ratio, and server packet drop rate of MemcachedGPU on the TeslaK20c. As the GET request rate increases, the SET rate drops due to contention forGPU resources. The low peak SET request throughput limits the SET:GET ratio114to <0.5% for higher GET request rates. The average server packet drop rate ap-proaches 10% when increasing GET throughput to 90% of the peak, which limitsthe effectiveness of MemcachedGPU. GET requests at ∼9 MRPS maintains com-parable drop rates to the peak throughput, while also handling 0.26% SET requests.4.4.4 Workload Consolidation on GPUsWorkload consolidation, running multiple workloads concurrently on the samehardware, improves datacenter utilization and efficiency [20]. While specializedhardware accelerators, such as ASICs or FPGAs, can provide high efficiency forsingle applications, they may reduce the flexibility gained by general-purposeaccelerators, such as GPUs. For example, long reconfiguration times of re-programmable hardware, milliseconds to seconds [142, 147], may mitigate thebenefits gained by the accelerator when switching between applications. In thissection, we evaluate the potential for workload consolidation on GPUs, whichmay provide advantages over other hardware accelerators in the datacenter.However, at the time of this study, the evaluated GPUs do not support preemp-tive [163] or spatial [2] multitasking for GPU computing although they do supportpreemptive multitasking for graphics [132]. When multiple CUDA applicationsrun concurrently, their individual CUDA kernel launches contend for access to theGPU and, depending on resource constraints, are granted access in a first-come,first-serve basis by the NVIDIA driver. Large CUDA kernels with many CTAsmay consume all of the GPU resources, blocking other CUDA kernels from run-ning until completed. However, we can potentially exploit this property through asimple approach to enable finer grained multitasking by splitting a single CUDAkernel into multiple kernels with fewer CTAs.We study a hypothetical low-priority background task (BGT) that performs asimple vector multiplication in global memory requiring a total of 256 CTAs with1024 threads each to complete. The low-priority BGT is divided into many smallershort running kernel launches, which can be interleaved with MemcachedGPU pro-cessing. This creates a two-level, software/hardware CTA scheduler. For example,if we reduce the background task to 16 CTAs per kernel, we require 16 separatekernel launches to complete all 256 CTAs in the task (16 CTAs / kernel launch ×1150	4	8	12	16	20	24	28	32	0	 4	 8	 12	 16	 20	 24	 28	 32	 36	 40	 44	 48	 52	Avg.	Client	RTT	(ms)	Time	(ms)	1	 2	 4	 8	 16	#	of	Background	kernels	Figure 4.13: Client RTT (avg. 256 request window) during BGT executionfor an increasing number of fine-grained kernel launches.16 kernel launches = 256 CTAs).We run MemcachedGPU using GNoM on the Tesla K20c at a lower GET re-quest throughput of 4 MRPS using 16 byte keys. After some time, the BGT islaunched concurrently on the same GPU, varying the number of CTAs per kernellaunch. Figure 4.13 measures the average client RTT during the BGT execution.The average RTT is computed on a window of 256 GET request responses at theclient. Without the BGT, MemcachedGPU has an average RTT < 300µs. With 256CTAs (1 background task kernel launch), the BGT consumes the GPU resourcescausing a large disruption in the Memcached RTT. Even after the BGT completeswith 256 CTAs (around 32ms), MemcachedGPU takes over 20ms to return back tothe original average RTT. As the number of CTAs per kernel is reduced, the impactof the BGT on MemcachedGPU reduces significantly. For example, at 16 CTAsper kernel, the RTT experiences a short disruption for ∼2.4ms during the initial1161	2	4	8	16	32	64	128	256	512	1	 2	 4	 8	 16	 32	 64	 128	 256	Latency	(ms)	#	Background	kernels	Memc	client	Max	RTT	during	background	kernel	exec	Background	kernel	exec	time	(with	Memcached)	Background	kernel	exec	time	(without	Memcached)	Figure 4.14: Impact on BGT execution time with an increasing number ofkernel launches and max client RTT during BGT execution.BGT kernel launch with a maximum average RTT of ∼1.5ms during this time, andthen returns back to under 300µs while the BGT is executing. The other BGT con-figurations show multiple spikes in the RTT corresponding to the launches of thesubsequent BGT kernels. It is interesting to note that 4 BGT launches (64 CTAsper kernel) results in a saw tooth RTT instead of the sharp spikes in RTT seen inthe other configurations; however, we did not investigate the cause of this further.While decreasing the size of BGT kernel launches reduces the impact on Mem-cachedGPU’s RTT, increasing the number of BGT kernel launches also increasesthe BGT execution time. Figure 4.14 measures the BGT execution time with andwithout MemcachedGPU, as well as the maximum average RTT seen by Mem-cachedGPU during the BGT execution. At 4 MRPS, MemcachedGPU has verylittle impact on the BGT execution time due to its low resource utilization andsmall kernel launches. As the number of BGT kernels increases (more smallerBGT kernel launches), the execution time also increases due to the introduction of1170 5 10 15 20 25 30 35 40 45 Network Only MemcachedGPU Avg. Throughput (MRPS) Tesla K20c GTX Titan GTX 750 Ti Figure 4.15: Offline GNoM throughput - 16B keys, 96B packets.the software CTA scheduler and contention with competing kernel launches. How-ever, the impact on MemcachedGPU RTT decreases much faster. At 16 CTAs perkernel, the BGT execution time is increased by ∼50% versus 256 CTAs, while theMemcachedGPU RTT is reduced by over 18×. Allowing for an increase in thelower-priority BGT completion time, GNoM is able to provide reasonable QoS toMemcachedGPU while running other applications on the same GPU.4.4.5 MemcachedGPU Offline Limit StudyIn this section, we evaluate an offline, in-memory framework that reads networkrequest traces directly from CPU memory to evaluate the peak performance andefficiency of GNoM and MemcachedGPU, independent of the network. The sameGNoM framework described in Section 4.1.2 is used to launch the GPU kernels(GNoM-pre), perform GPU UDP and GET request processing (GNoM-dev), andpopulate dummy response packets upon kernel completion (GNoM-post). How-ever, unlike the actual GNoM framework, which uses GPUDirect to transfer pack-1180 50 100 150 200 250 300 350 400 Network Only MemcachedGPU Avg. Latency (us) Tesla K20c GTX Titan GTX 750 Ti Figure 4.16: Offline GNoM processing latency - 16B keys, 96B packets.ets from the NIC to the GPU one at a time as they arrive, the offline frameworkreads packets from CPU memory and bulk transfers request batches to the GPUacross the PCIe bus. The same packet trace used in Section 4.4.3, with the min-imum key size of 16 bytes to stress GNoM, is used in the offline evaluation. Wealso evaluate a simple Network Only, ping-like GPU kernel, which only performsthe UDP network processing operations (no Memcached processing). The networkonly kernel also uses the same 96B packets. This experiment highlights the peakperformance of GNoM.Figure 4.15, Figure 4.16, and Figure 4.17 present the offline throughput,latency, and energy-efficiency of MemcachedGPU for the three GPUs in Table4.2 respectively. As shown in Figure 4.15, Each GPU achieves over 27 MRPS(∼21.5 Gbps), suggesting that the GPUs are capable of handling over 2× the re-quest throughput measured in the online evaluations. The network only kernel isable to further increase this up to 39 MRPS on the GTX 750Ti. Assuming the PCIebus is not a bottleneck, achieving this high throughput would require additional 101190 50 100 150 200 250 300 Network Only MemcachedGPU Avg. Energy-Efficiency (KRPS/W) Tesla K20c GTX Titan GTX 750 Ti Figure 4.17: Offline GNoM energy-efficiency - 16B keys, 96B packets.GbE NICs or higher throughput NICs, and removing the limitation on the amountof pin-able GPU memory to allocate more GRXBs for the increased request rate.The latency in Figure 4.16 measures the time prior to copying the packetsfrom the CPU to GPU and after populating the dummy response packets at peakthroughputs. The network only measurements highlight the latency contributed bythe GNoM framework and the network packet parsing and response packet gener-ation code, which accounts for roughly 50% of the total processing latency. Aninteresting result of this study was that the low-power GTX 750Ti reduced the av-erage MemcachedGPU batch latency compared to the Tesla K20c by∼25%, whilealso slightly improving peak throughput. This improvement can be attributed tomany of the different architectural optimizations in Maxwell over Kepler [136]and to the properties of the MemcachedGPU application in a GPU environment.MemcachedGPU is a memory bound application as there are many memory oper-ations with little processing per packet. While the Tesla K20c has higher memory120bandwidth than the GTX 750Ti, the available GPU memory bandwidth is not op-timally used since each GPU thread handles a separate packet, which reduces theopportunities for memory coalescing within a warp. Thus, providing little bene-fit to the Tesla’s higher memory bandwidth. Although the Tesla K20c has 2.6×more CUDA cores capable of processing more request batches concurrently, theGTX 750Ti has a ∼61% higher core clock frequency, which can process warpsfaster. Additionally, the low processing requirements per packet further reducesthe benefits of the higher computational throughput GPUs.Finally, the energy-efficiency in Figure 4.17 measures the system wall powerat the peak throughputs. As the packets are being read from memory instead offrom the NIC, the wall power does not account for the power consumed by theNICs. As such, we also add the TDP for the additional NICs required to supportthe increased throughput. For MemcachedGPU, the GTX 750Ti is able to processover 27% and 43% more GET requests per watt than the Tesla K20c and GTXTitan respectively.4.4.6 Comparison with Previous WorkThis section compares MemcachedGPU against reported results in prior work. Ta-ble 4.5 highlights the main points of comparison for MemcachedGPU againstmulti-core CPUs [52, 106], an FPGA [26, 85], and an implementation solely usinga GPU for the key hashing [44]. Results not provided in the published work ornot applicable are indicated by ”–”. The cost-efficiency (KRPS/$) only considersthe purchase cost of the CPU and the corresponding accelerator at the time of thisstudy (2015), if applicable. All other costs are assumed to be the same betweensystems. The last column in Table 4.5 presents the year the processor was releasedand the process technology (nm).Table 4.5 also presents our results for the vanilla CPU Memcached and vanillaCPU Memcached using PF RING to bypass the Linux network stack. No otheroptimizations were applied to the baseline Memcached. These results highlight thatwhile bypassing the Linux network stack can increase performance and energy-efficiency, additional optimizations to the core Memcached implementation arerequired to continue improving efficiency and scalability.121Table 4.5: Comparing MemcachedGPU with previous work.Platform MRPS Lat. (µs) KRPS/W KRPS/$ Year / nmMemcGPU Tesla(online) 12.9-13m <800,p95<110062 4.3 ’14/28MemcGPU Tesla(online) 9 p95 < 500 45 3 ’14/28MemcGPU GTX750Ti (NGD on-line)12.85m <830,p95<180084.8 25.7 ’14/28MemcGPU GTX750Ti (offline) 28.3 – 127.3 56.6 ’14/28Vanilla Memc - 4threads0.93p95<677- 0.5MRPS6.6 2.67 ’13/22Vanilla Memc +PF RING- 2 threads1.82p95<607- 1 MRPS15.89 5.2 ’13/22Flying Memc [44] 1.67 m < 600 8.9 2.6 ’13/28MICA - 2x IntelXeon E5-2680 (on-line,4 NICs) [106]76.9 p95 < 80 – 22 ’12/32MICA - 2x IntelXeon E5-2680 (of-fline) [106]156(avg. ofuni. &skew.)– – 44.5 ’12/32MemC3 - 2x IntelXeon L5640 [106]4.4 – – 12.9 ’10/32FPGA [26, 85] 13.02 3.5-4.5 106.7 1.75 ’13/40Aside from MICA [106], MemcachedGPU improves or matches the through-put compared to all other systems. However, an expected result of batch processingon a throughput-oriented accelerator is an increase in request latency. The CPUand FPGA process requests serially, requiring low latency per request to achievehigh throughput. The GPU instead processes many requests in parallel to increasethroughput. As such, applications with very low latency requirements may not bea good fit for the GPU. However, even near 10 GbE line-rate MemcachedGPU122achieves a 95-percentile RTT under 1.1ms and 1.8ms on the Tesla K20c (GNoM)and GTX 750Ti (NGD) respectively.MemcachedGPU is able to closely match the throughput of an optimized FPGAimplementation [26, 85] at all key and value sizes, while achieving 79% of theenergy-efficiency on the GTX 750Ti. Additionally, the high cost of the Xilinix Vir-tex 6 SX475T FPGA (e.g., $7100+ on digikey.com) may enable MemcachedGPUto improve cost-efficiency by up to 14.7× on the GTX 750Ti ($150). While anequivalent offline study to Section 4.4.5 is not available for the FPGA, the worksuggests that the memory bandwidth is tuned to match the 10 GbE line-rate, poten-tially limiting additional scaling on the current architecture. This provides promisefor the low-power GTX 750Ti GPU in the offline analysis, which may be ableto further increase throughput and energy-efficiency up to 2.2× and 1.2× respec-tively. Furthermore, the GPU can provide other benefits over the FPGA, such asease of programming and a higher potential for workload consolidation (Section4.4.4).Flying Memcache [44] uses the GPU to perform the Memcached key hashcomputation, while all other network and Memcached processing remains on theCPU. GNoM and MemcachedGPU work to remove additional serial CPU process-ing bottlenecks in the GET request path, enabling 10 GbE line-rate processing atall key/value sizes. Flying Memcache provides peak results for a minimum valuesize of 250B. On the Tesla K20c with 250B values, MemcachedGPU improvesthroughput and energy-efficiency by 3× and 2.6× respectively, with the through-put scaling up to 7.8× when using 2B values.The state-of-the-art CPU Memcached implementation, MICA [106], achievesthe highest throughput of all systems on a dual 8-core Intel Xeon system with fourdual-port 10 GbE NICs. Similar to MemcachedGPU, MICA makes heavy modi-fications to Memcached and bypasses the Linux network stack to improve perfor-mance, some of which were adopted in MemcachedGPU (Section 4.2). Additionalmodifications, such as the log based value storage, could also be implemented inMemcachedGPU. MICA’s results include GETs and SETs (95:5 ratio) whereasthe MemcachedGPU results consider 100% GET requests, however, MICA alsomodified SETs to run over UDP, which may limit the effectiveness in practice. Ad-ditionally, MICA requires modifications to the Memcached client to achieve peak123throughputs, reducing to ∼44% peak throughput without this optimization. In theonline NGD framework, the GTX 750Ti may improve cost-efficiency over MICAby up to 17%. MICA presents an offline limit study of their data structures with-out any network transfers or network processing, reaching high throughputs over150 MRPS. In contrast, all of the UDP packet data movement and processing isstill included in the offline MemcachedGPU study (Section 4.4.5); however, UDPpackets are read from CPU memory instead of over the network. In the offlineanalysis, the GTX 750Ti may improve cost-efficiency over MICA up to 27%. Wewere not able to compare the energy-efficiency of MemcachedGPU with MICA asno power results were presented.A state-of-the-art GPU key-value store, Mega-KV [184], was published afterthe work in this study, which achieves very high performance on a multi-CPU,multi-GPU, multi-NIC server (e.g., 120+ MRPS). Mega-KV only performs thekey-value look-up on the GPU, all other processing is on the CPU. Additionally,Mega-KV uses the AES SSE instruction for the hash function (instead of in soft-ware), smaller minimum sized keys, and a compact key-value protocol indepen-dent from Memcached’s ASCII protocol. Mega-KV is discussed further in Section6.1.1.4.5 SummaryThis chapter presented GNoM, a GPU-accelerated networking framework, whichenables high-throughput, network-centric applications to exploit massively paral-lel GPUs to execute both network packet processing and application code. Thisframework allows a single GPU-equipped datacenter node to service network re-quests at ∼10 GbE line-rates, while maintaining acceptable latency even whileprocessing lower-priority, background batch jobs. Using GNoM, this chapter de-scribed an implementation of Memcached, MemcachedGPU. MemcachedGPU isable to achieve ∼10 GbE line-rate processing at all request sizes, using only 16.1µJ and 11.8 µJ of energy per request, while maintaining a client visible p95 RTTlatency under 1.1 ms and 1.8 ms on a high-performance NVIDIA Tesla GPU andlow-power NVIDIA Maxwell GPU respectively. We also performed an offlinelimit study and highlight that MemcachedGPU may be able to scale up to 2× the124throughput and 1.5× the energy-efficiency on the low-power NVIDIA MaxwellGPU. We believe that future GPU-enabled systems, which are more tightly inte-grated with the network interface and less reliant on the CPU for I/O, will enablehigher performance and lower energy per request.Overall, this chapter demonstrates the potential to exploit the efficient paral-lelism of contemporary GPUs for network-oriented datacenter services. However,a large portion of the GNoM framework presented here, specifically GNoM-host(GNoM-KM, GNoM-ND, and GNoM-user), is only required because current GPUscannot directly receive control information or interrupts from other third party de-vices in a heterogeneous system through any standard interfaces. As a result, theCPU acts as a middleman responsible for handling the task management and con-trol information between a third party device and the GPU. This increases thelatency for launching and handling the completion of GPU tasks, decreases theenergy-efficiency, increases complexity, and reduces the potential for the CPU towork on other useful tasks. In the next chapter (Chapter 5), we propose modifi-cations to the existing GPU architecture and programming model to enable thirdparty devices to launch tasks directly on the GPU.125Chapter 5EDGE: Event-Driven GPUExecutionThe previous chapter explored the potential for using GPUs as efficient accelera-tors for network applications in the datacenter. It highlighted that by offloadingboth the UDP network processing and Memcached processing to the GPU in mi-cro batches, MemcachedGPU could obtain high throughput, low latency, and highenergy efficiency on commodity Ethernet and GPU hardware. However, whilethe GPU is responsible for the main processing (UDP packet parsing, Memcachedprocessing, and UDP response packet generation), and the network data is trans-ferred directly to the GPU from the network interface (GPUDirect), a complexhost framework, GNoM-host, is required to manage the interactions between thenetwork interface (NIC) and the GPU. GNoM-host acts as the middleman betweencommunicating devices, and is a result of a centralized CPU + Operating System(OS) design, where the CPU is typically responsible for handling IO and managingcontrol between devices in a heterogeneous system.This is not a unique property of GNoM or MemcachedGPU. Many other re-cent works utilize GPUs to accelerate tasks spanning multiple heterogeneous de-vices. For example, GPUNet [96], GPURdma [38], GASSP [170], and Pack-etShader [65], accelerate GPU network processing applications interacting withthe CPU and NICs. Other works implement FPGA-GPU-CPU pipelined tasks forhigh-throughput processing tasks such as cardiac optical mapping [116] and pedes-126trian detection [23]. Each of these works require a CPU and/or GPU software run-time framework to orchestrate the communication of control between an externaldevice (NIC/FPGA) and the GPU. This chapter 1 explores how to increase the inde-pendence and control of the GPU to enable third-party devices in a heterogeneousenvironment to manage the execution of GPU tasks without interacting with theCPU on the critical path or requiring CPU/GPU polling software frameworks.5.1 Motivation for Increased GPU IndependenceFigure 5.1 shows a comparison of three systems, in which the same streamingGPU task is launched repeatedly to process data provided by an external device.Examples of such applications are listed above. The CPU invokes the same GPUkernel with different parameters describing the new data to process; however, theCPU has little to no involvement in the application processing.In the baseline CUDA programming model, Figure 5.1(a), the CPU is respon-sible for transferring data to the GPU, configuring the GPU task, and launching thetask on the GPU through one or more CUDA streams (referred to as the Baselinein this chapter). Under the baseline, if an external device wants to initiate workon the GPU, it can directly transfer data to the GPU using GPUDirect; however,control is communicated with the CPU on both ends of the task. This is equivalentto the GNoM software framework described in Chapter 4. On the front end, theCPU must wait for work to arrive, either through interrupts or polling, configurethe task for the GPU, and launch the task on the GPU. On the back end, the CPUmust wait for the GPU task to complete, again either through interrupts or polling,handle the response, and communicate the response back to the device initiatingthe work on the GPU. The inclusion of such a CPU software framework for han-dling IO and managing control of the GPU when the CPU is not the initiator ofthe work has many drawbacks. First, the latency to launch tasks and handle theresponses is increased. Second, including the CPU in the critical path increases thetotal system energy consumption. Third, the ability for the CPU to work on otheruseful tasks is reduced. This is especially important in the datacenter environment,1A modified version of the material presented in this chapter was later published in the Interna-tional Conference on Parallel Architectures and Compilation Techniques (PACT) 2019 [71].127Privileged GPU WarpCPU GPUCPU GPUInitialize Event KernelExternal Device CPU GPUInitialize PT CTAsPT CTA PollingCPU PollingCPU PollingCPU FreeCPU FreeDataControlGPU Execution CPU OverheadCPU FreeGPU Overhead(a) Baseline + DMA (b) Persistent Thread + DMA(c) EDGE + DMAExternal DeviceExternal DeviceTimeFigure 5.1: Example of the data and control flow when an external devicelaunches tasks on the GPU for the baseline CUDA streams, PersistentThreads (PT), and EDGE.where workload consolidation is used to increase utilization and efficiency, and toreduce total costs. Lastly, running any other work concurrently on the CPU canimpact the end-to-end performance of the GPU task by increasing the task launchand completion latency.1280	0.5	1	1.5	2	2.5	3	3.5	4	4.5	5	5.5	Memory	Bound	 Compute	Bound	Throughput	Loss	(X)	CPU	 GPU	Figure 5.2: Evaluating the loss in throughput for CPU compute and mem-ory bound applications (Spec2006) when running concurrently with anda GPU networking applications. The GPU’s reliance on the CPU tolaunch kernels leads to inefficiencies for both devices.Consider the experiment in Figure 5.2, which evaluates the performance fora set of memory and compute bound CPU tasks (Spec2006) concurrently runningwith a GPU UDP network ping (GPU-ping) benchmark under the baseline CUDAsystem. GPU-ping receives packets directly from an Ethernet NIC in GPU mem-ory via GPU-Direct. It employs a CPU framework, GNoM-host, to handle NICinterrupts, manage the launch and completion of GPU kernels, and send the re-sponse packets to the NIC. Relative to running either application in isolation, theCPU memory and compute bound applications run 1.17× and 1.19× slower withGPU-ping, while the peak GPU-ping packet rate is reduced by 4.83× and 2.54×,respectively. This is a direct result of requiring the CPU to manage control ofthe GPU on the critical path of the streaming GPU application. As such, there is aneed for enabling external devices to efficiently communicate both data and controldirectly with the GPU.Persistent threads (PT) are an alternative technique for programming and1290	1	2	3	4	5	6	0	0.2	0.4	0.6	0.8	1	1.2	1.4	Large	Kernel	 Small	Kernel	 Active	Idle	Execution	Time	(X)	Power	(X)	Power	 Execution	Time	Figure 5.3: Measuring the performance and power consumption of persistentthreads (PT) versus the baseline CUDA stream model for a continuousstream of large and small matrix multiplication kernels. Active Idlemeasures the power consumption of the polling PT threads when thereare no pending tasks.launching tasks on the GPU and are discussed in more detail in Section 2.1.7.With PT and RDMA, data and control is able to flow directly into the GPU’smemory and software schedulers, relaxing the requirement for the CPU to beinvolved, as shown in Figure 5.1(b). However, the increased flexibility ofPT, enabled by the software thread schedulers, can impact performance andenergy efficiency depending on the size of kernels and the rate at which tasksare launched. PT also increases the code complexity, as the programmer is nowresponsible to implement the task launching, task scheduling, and synchronizationlogic. Additionally, by definition PT indefinitely consumes GPU resources forthe persistent CTAs (pCTAs), which limits the potential for other kernels toconcurrently use the GPU. As described in Section 2.1.5, recent proposals forGPU preemption [33, 92, 126, 131, 143, 157, 162, 176] can enable the pCTAs tobe preempted. However, preempting and resuming the pCTAs reduces the benefitsof having GPU threads continuously polling for work.130Furthermore, with PT the GPU’s software scheduler threads are constantlypolling for new work (red execution bars in Figure 5.1(b)), which can impactenergy efficiency under varying task rates. Consider the example in Figure 5.3,which evaluates the power and performance of a CTA-level PT framework relativeto the baseline GPU system performing a continuous stream of multiple matrixmultiplications on two matrix sizes; small (2 CTAs / kernel) and large (120 CTAs /kernel). With the large kernel, PT is not able to take advantage of the GPU’s hard-ware CTA schedulers, since this is now handled in software. As a result, less timeis used for performing the matrix multiplication to perform the scheduling tasks,which lowers power by 45%, but increases execution time by 5.58×, consequentlyincreasing energy. PT perform very well with small kernels, since the overheadsfor kernel launching and scheduling in the baseline system are higher relative to theamount of work to perform. This enables PT to spend more time performing thematrix multiplication, which increases power by 20%, but significantly decreasesexecution time by 2.9×, hence lowering energy. Finally, the polling nature of PTincreases power consumption by 13% when no tasks are pending, relative to a GPUwithout PT in a high power state (p2), indicated by Active Idle.Each technique described above poses a trade-off between the amount of con-trol a GPU has and the efficiency it can provide. As such, there is a need for atechnique that achieves the performance and complexity of the baseline CUDAmodel with the flexibility of the persistent thread model. This chapter proposesan event-driven GPU execution technique, EDGE, which is a form of GPU activemessaging [46]. Event-driven programming is an alternative style of programmingcompared to threads, where programs are broken down into fine-grained pieces ofcode, callbacks, or event handlers, responsible for performing specific operationsin response to IO events [37, 54, 140]. The event handlers can be triggered viainterrupts or called in a continuously running event loop. The motivation behindEDGE is that by enabling GPU kernels to be triggered by external events, anydevice in a heterogeneous system can launch work on the GPU without requiringCPU interaction or GPU polling software frameworks. This can improve perfor-mance, efficiency, and server utilization, while reducing complexity.The resulting control and data flow for EDGE is shown in Figure 5.1(c). Here,the CPU is required only to initialize the kernel once and then both the data and131control is transferred directly from the external device to the GPU, removing theCPU from the critical path and freeing it up to work on other tasks.Additionally, this chapter proposes a new form of CTA barrier, the wait-releasebarrier, which enables running CTAs to halt execution indefinitely until anotherCTA releases the barrier in response to some external event. This is useful to helpreduce the polling overheads of the persistent GPU thread style of programming,while retaining the benefits of having persistent CTAs ready to immediately beginprocessing the kernel.The following section discusses the various design alternatives and require-ments for supporting event-driven execution on a GPU.5.2 Supporting Event-Driven GPU ExecutionEDGE has three main requirements. First, the GPU needs to know which codeto execute and which data to operate on for a given event. In the baseline CUDAmodel, a user-level CPU process passes the task and parameter information to theGPU driver via the CUDA API, which configures the GPU kernel and transfers itto the GPU to be executed. However, the GPU driver runs in the operating sys-tem (kernel) space, which introduces a dependence on the CPU for configuringand launching GPU tasks. Alternatively, in-memory user-level work submissionqueues provide a mechanism for configuring the GPU kernel without operating inthe privileged kernel space or requiring the use of a device-specific API. With in-memory work queues, the initiator of the GPU task writes the task to run and theinput parameters directly to pre-defined memory locations using regular store in-structions. Once notified of the pending task (described below), the GPU can readthis information, configure the task, and then execute the task. Additionally, user-level work queues support independently configuring and triggering GPU kernelsconcurrently from different processes or devices. There have been multiple dif-ferent proposals for using in-memory, user-level work queues for communicatingtasks with the GPU to improve multi-process support, reduce GPU kernel launchlatency, or interact directly with GPU threads, such as NVIDIA’s Multi-processService (MPS) [129], the Heterogeneous Systems Architecture (HSA) [55], andthe persistent GPU thread frameworks described above. In-memory work submis-132sion/completion queues are also used to enable a CPU to communicate with exter-nal devices, such as NVME [49]. As such, EDGE utilizes in-memory work queuesto communicate tasks and data from a third-party device with the GPU, which isdescribed in Section 5.4.1.Second, there must be a method to notify the GPU of a pending task and sched-ule that task on the GPU. Assuming in-memory, user-level work queues are usedto submit tasks to the GPU, there are multiple different techniques to achieve this.In MPS, a CPU daemon process identifies when tasks have been submitted to thework queues and launches the task on the GPU from the CPU. While this reducesthe requirements of the user-level process for launching the task (i.e., the user-levelprocess does not need to communicate with the GPU driver through the CUDAAPI), it still requires the CPU to communicate the task with the GPU via the dae-mon MPS process. In PT, persistent GPU threads poll the in-memory work queuesto identify when new work is available. As discussed above, this polling can havenegative impacts on performance, efficiency, and complexity . In HSA, a “PacketProcessor” handles packets inserted into the work queues and initiates the task onthe GPU. The HSA Programmer’s Reference Manual [55] states that the PacketProcessor is generally a hardware unit and may reside on either the device initiat-ing the task (e.g., the CPU) or the device where the task will run (e.g., the GPU).The device initiating the task writes into a “doorbell register” associated with thework queue to notify the Packet Processor that a new task is available. In the con-text of a GPU, for example, the Packet Processor could be a dedicated hardwareunit or a small scalar processor residing on the GPU. A dedicated hardware unitwould provide a low-latency and high-efficiency path for identifying when a newtask is available and for scheduling the task to be executed on the GPU. However,a dedicated hardware unit would also limit the types of operations that can be per-formed on an event. On the other hand, a small GPU-resident, scalar processorwould provide the flexibility to implement any type of operation with additionalhardware overheads. Depending on the GPU architecture, such scalar cores mayalready exist on the GPU [10].In EDGE, we explore an alternative technique, which exploits the abundance ofavailable computing resources on the GPU, the streaming multiprocessors (SMs),to act as the Packet Processor. Similar to PT, GPU threads are used to read the133in-memory work queues to configure and launch the GPU task, referred to as priv-ileged GPU warps (PGW). However, unlike PT, PGWs are launched in responseto an external event instead of continuously polling the in-memory work queues.To support the scheduling of PGWs, EDGE exposes a light-weight, warp-levelpreemption mechanism on the GPU that can be triggered by any device in thesystem. PGWs can implement a set of OS-like abstractions to increase the inde-pendence of the GPU, such as launching internal GPU kernels or ensuring fairnessbetween competing tasks, and can be initiated via interrupts or writes to doorbellregisters. The PGWs can also be used to release any persistent CTAs blocked onthe proposed wait-release barriers to reduce the overheads of PT polling. Using thesoftware PGWs to launch event kernels, which may require preemption to begin ex-ecuting, trades off generality for performance (e.g., when compared to a dedicatedhardware unit). In EDGE, we also consider the possibility of using a dedicatedhardware unit to trigger the execution of event kernels, which would significantlyreduce the event kernel scheduling latency. PGWs and the warp-level preemptionmechanism are described in Section 5.3.Finally, the latency to trigger events on the GPU should be minimized. InEDGE, we make the observation that GPU kernels in a streaming applicationtypically have similar, if not identical, configurations. For example, the Mem-cachedGPU kernel performs the same task (key-value lookup), on the same numberof packets, with the same set of parameters pointing to different buffers in memorypopulated by the NIC. In a cardiac optical mapping application [116], the FPGAtriggers the same GPU image processing kernel on different images with the samedimensions. However, all of the kernel configuration parameters must be speci-fied for every GPU kernel launch. EDGE exploits this opportunity by providinga platform to register pre-configured event kernels with the GPU. Event kernelsare associated with an ID, which can be specified by the device initiating the taskthrough an interrupt or doorbell register. If the format of the next kernel and thelocation of the next kernel’s parameters are known a priori, event kernels can beefficiently scheduled from the PGWs with little to no additional information asidefrom the event kernel ID. Event kernels are discussed in Section 5.4.Previous work, XTQ [101], has also evaluated active messaging for GPUsto reduce GPU kernel launch overheads by modifying Infiniband NICs to sup-134port the HSA user-level in-memory work queues via remote direct memory access(RDMA). In XTQ, the complexity is pushed to the Infiniband NIC, such that anyremote agent can directly write to the HSA user-level work queues via the Infini-band NIC to launch a task on an HSA-enabled GPU without requiring interactionwith the CPU. In contrast, EDGE pushes this complexity to the GPU, such that anyexternal device capable of interacting with the GPU can launch a task directly onthe GPU. EDGE also explores using PGWs as the Packet Processor for managingthe launching of tasks submitted to user-level in-memory work queues. XTQ isdescribed further in Chapter 6.The next sections describe the GPU interrupt mechanism, GPU privilegedwarps, fine-grained warp-level preemption, event kernels, and the wait-releasebarriers in more detail.5.3 GPU Interrupts and Privileged GPU WarpsAs described in the previous section, EDGE explores utilizing privileged GPUwarps (PGW) to perform higher level operations, such as internally schedulingtasks on the GPU or releasing CTA barriers, corresponding to an external event.However, a mechanism is required to initiate the execution of the PGWs to avoidcontinuous polling, as is done in PT. In this section, we present the design of a GPUinterrupt architecture and propose modifications to the current GPU architecture toefficiently support fine-grained, warp-level GPU preemption to trigger the PGWs.Interrupts provide a simple path for any device in a heterogeneous system tosignal an event to the GPU. While traditional interrupts require dedicated inter-rupt lines, message signaled interrupts (MSI) are an alternative in-band method ofsending interrupts over the same bus used to communicate with a device, support-ing significantly more interrupt handlers. For example, PCIe, the communicationbus typically used with discrete GPUs, supports up to 2048 different interruptsthrough MSI-X 2. As such, any device that can send a message to the GPU overPCIe can also signal a variety of interrupts on the GPU.While interrupts provide a mechanism for signalling devices of an event, in-terrupts can negatively impact tail latencies. For example, consider a distributed2MSI-X is supported in PCIe 3.0 and higher.135networking or map-reduce task, which spawns off multiple different operations inparallel to solve a given problem. A delay in one of the parallel operations willdelay the entire task [40]. As such, it is desirable to maintain low and predictabledelays. A single interrupt on a traditional CPU running an OS results in multipleoperations: an interrupt is received by an interrupt controller (IC), a CPU core isselected and notified of the interrupt, any application running on the interrupt coremust be temporarily suspended, the core jumps to an interrupt descriptor table inthe OS, which queries the IC to perform the corresponding task for the interrupt,and finally the interrupt is either processed in place or scheduled to be processedat a later time. Each of these operations introduces uncertainty into the latencyfor a task dependent on an interrupt. As a result, applications with low latencyrequirements typically resort to some form of polling, which can negatively im-pact energy under low task rates. However, there are fundamental differences inthe GPU’s hardware and software compared to a CPU that can mitigate these chal-lenges. First, GPUs contains hundreds of cores and thousands of thread contexts,each potentially capable of handling an interrupt. At any given time, there maybe a free thread context available to process an interrupt without requiring contextswitching (necessary on a CPU to handle an interrupt when currently processinganother task), even if another application is currently running on the GPU. Second,current GPUs are offload accelerators that rely on hardware task schedulers, insteadof an OS, to improve performance. As such, the latency from receiving an interruptto running the interrupt service routine (ISR) may be reduced. Finally, GPUs arethroughput oriented architectures, which are designed to tolerate long latency in-structions by running multiple warps concurrently. Assuming that a warp selectedto handle an interrupt is not blocking any other warps (e.g., at a synchronizationbarrier), the GPU application can still make progress through other concurrentlyrunning warps, whereas on a CPU, the interrupted task may be blocked from mak-ing progress during the interrupt processing.In EDGE, we propose a fine-grained warp-level preemption mechanism, initi-ated by a GPU interrupt (or write to a doorbell register), that reduces the impacton concurrently running tasks and enables any warp to be a candidate for handlingan external event. To use this mechanism, EDGE reserves certain interrupt vec-tors for user-space processing, such that any device can directly signal the GPU136Top-Half Bottom-HalfPrivileged GPU WarpSingle warp identifies the cause of the interrupt and schedules the Event Kernel (bottom-half)Event Kernel Any number of warps/CTAs(Traditional application specific kernel)Schedule Event KernelInterruptReceivedSchedule PGWFigure 5.4: EDGE interrupt partitioning.to begin work corresponding to a specific interrupt vector. EDGE also supportstimer interrupts, which are useful for scheduling tasks that need to run periodi-cally. Timer interrupts could be used to, for example, support time multiplexing ofGPU resources as a method to coalesce interrupts, or to reduce the external devicecomplexity by removing the need for the device to send an interrupt to initiate workby scheduling regular queries to the in-memory work queues.5.3.1 Interrupt Partitioning and GranularitySimilar to the Linux interrupt handling mechanism, EDGE is designed with a no-tion of a top-half and bottom-half [35], as shown in Figure 5.4. The top-half is aprivileged piece of code capable of performing a programmable set of operations.It is responsible for determining the cause of the interrupt and for configuring andscheduling the user-defined operation, the bottom-half, accordingly by triggeringthe launch of a GPU kernel, referred to as an event kernel (Section 5.4.1). Thispartitioning keeps latency low for the immediate processing stages of the interrupt,while enabling the GPU to defer the scheduling of the interrupt handler’s bottom-half, which may require considerably more processing with a configurable priority.Additionally, once scheduled, the bottom-half event kernel can utilize the efficienthardware kernel and thread schedulers like any other kernel, which minimizes themodifications required to the GPU architecture to support EDGE.EDGE reduces the impact of processing control messages within the GPUthrough fine-grained, warp-level interrupt handling. GPUs contain multiple SMXsand warp contexts, which can all be candidates to handle the interrupt. For ex-ample, the NVIDIA GeForce 1080 Ti supports up to 64 warps per SMX, with 28SMXs, for a total of 1792 warp contexts. If all warps are occupied, a single warpmust be preempted; however, if a free warp context is available, no preemption137is required. The ability for the interrupt to be processed by an idle or underuti-lized SMX can reduce the impact on concurrently running GPU kernels, as well asreduce the latency to schedule and process the interrupt (Section 5.6.1).5.3.2 Privileged GPU Warp SelectionThe PGW is similar to any other warp in the GPU, with the main difference be-ing that it does not belong to a user-level kernel or CTA – it is a system-levelwarp. The anticipated operations to be performed by the PGW (the ISR) are in-herently sequential, such as launching an event kernel or releasing CTA barriers.Consequently, the SIMD aspects of the warp may go underutilized, indicating thata dedicated hardware unit or scalar processor may be more appropriate. However,using an existing GPU warp to handle these operations removes the need for ad-ditional hardware and enables future operations that may benefit from the SIMDarchitecture. Additionally, the current proposed ISR operations for the PGW re-quire minimal processing, which mitigate the inefficiencies of using a warp forsequential processing.When an external or internal event is triggered, EDGE must select a PGW warpto handle the event. We consider four possibilities for selecting a PGW: (1) utiliz-ing dedicated hardware for the interrupt warp context (Dedicated); (2) reservingexisting warp contexts (Reserved); (3) selecting a free warp context ((Free)); and(4) preempting a running warp context with some selection policy (e.g., Oldest orNewest warp).Qualitatively, the trade-offs between these approaches are as follows: (1)Adding dedicated hardware for PGWs or adding small scalar cores on the GPUguarantees that the interrupt can run immediately (i.e., no preemption is requiredas dedicated computing resources are always available to process an interrupt).However, this requires additional hardware resources for managing the PGW andCTA contexts (e.g., program counter, special registers, SIMT stack, local memory)or for any specialized processing units. For example, AMD’s Graphics CoresNext (CGN) Architecture describes scalar cores integrated within the ComputeUnits, which could be used to process an event. Furthermore, as describedabove, the sequential nature of the PGW ISR operations could limit the additional138hardware required, while still reusing the existing SIMD execution pipeline andwarp schedulers. For example, the PGW may not require a SIMT stack to trackdivergent warp threads, wide register-file to support warp-wide register accesses,CTA barrier management, or special thread / CTA dimension registers. Whilethe hardware overheads may not be high, this chapter focuses on exploring thepotential for reusing the existing GPU computing resources.(2) Reserving specific warp contexts from the existing set of warps for PGWsachieves the performance benefits of additional dedicated PGW resources withoutany of the hardware overheads. However, reserving PGW resources reduces theamount of available parallelism that an application can exploit by the number ofwarps reserved for the PGWs. In such a case, applications that fully utilize theGPU resources will be negatively affected, even when the PGW is not executing.(3) If there are unused warp contexts available, for example, due to a GPUapplication underutilizing the GPU resources, selecting a free warp does not re-quire additional hardware for the PGW context, does not permanently reduce theparallelism available to an application, nor does it delay the time to schedule thePGW. However, a free warp context may not be available if one or more activeGPU applications fully utilize the GPU’s resources, which would require blockingthe PGW until a free warp becomes available.(4) Lastly, preempting a running warp enables the interrupt handler to run oncethe running warp has been properly halted and preempted, does not require addi-tional hardware for the PGW context, and only temporarily reduces the availablethread contexts available to an application. However, preempting a running warpincreases the latency to schedule the PGW and requires additional hardware to sup-port the warp-level preemption. The selection policy of a PGW to preempt, referredto as the victim warp, is important to minimize both the latency to preempt to thevictim warp and the impact on the runtime of the victim warp’s CTA. In this disser-tation, we evaluate selecting the oldest and newest warp within a CTA as the victimwarp. The evaluation of more complex victim warp selection policies, which alsoincrease the hardware complexity of EDGE, is left to future work. Section 5.6.1evaluates and compares the different PGW selection policies and highlights thatEDGE can achieve similar performance by preempting running warps instead ofadding dedicated PGW hardware or reserving existing warps for the PGWs.139The main challenge with interrupting a running warp is minimizing the latencyrequired to preempt the warp, which is discussed further in the following section.5.3.3 Privileged GPU Warp PreemptionEDGE preempts warps at an instruction granularity, meaning that it does not waitfor a warp in a CTA to complete all of its instructions before preempting. Instead,the warp is preempted as soon as possible. This minimizes the latency to pre-empt a running warp at the cost of increased preemption complexity. To correctlysave a warp’s context and minimize complexity, any pending (un-executed) instruc-tions are first flushed from the pipeline prior to starting the ISR. Depending on thestate of the victim warp to preempt for a PGW, or the instructions currently in thepipeline, the victim warp preemption latency can be quite large. The terms “victimwarp preemption latency” and “ISR scheduling latency” are used interchangeablyin this dissertation.We identify four main causes for large victim warp preemption latencies: (1)low scheduling priority for the victim warp, (2) pending instructions in the instruc-tion buffer (i-buffer), (3) victim warps waiting at barriers, and (4) in-flight loads.(1) To ensure that the victim warp completes its current instructions promptly, thevictim warp’s priority is temporarily increased. This overrides the GPU’s currentwarp scheduling policy, such as greedy-than-oldest (GTO) or round robin, andschedules the victim warp as soon as it is ready until all in-flight instructions havecompleted.(2) Flushing any non-issued instructions from the i-buffer limits how manyinstructions the PGW needs to wait for before being able to preempt the victimwarp. This can further increase the execution time of the victim warp, if the flushedinstructions are evicted from the instruction cache before being restored. However,assuming the victim warp is not the final warp in the CTA, the GPU’s FGMTmitigates the impact by hiding the instruction fetch latency through executing otherwarps.(3) A victim warp waiting at a barrier is a perfect candidate for interrupting,since the warp is currently sitting idle waiting for other warps in the CTA to hit thebarrier. As such, preempting this warp may not impact the CTA’s progress. How-140ever, special care needs to be taken to ensure that the victim warp is conditionallyre-inserted into the barrier when the ISR completes depending on if the barrier hasbeen released or not.(4) Finally, in-flight load instructions can be dropped and replayed for victimwarps, which significantly decreases the variability in preempting a running warp,since load instructions potentially have a much higher latency than ALU opera-tions. Dropping loads involves releasing the miss-status holding register (MSHR)entry for the pending load, releasing the registers reserved in the scoreboard, androlling back the program counter to the dropped load instruction such that the vic-tim warp re-executes the load instruction once being rescheduled after the ISR.Additionally, a register is required to store the address of the dropped load to iden-tify and discard the load when it has returned from the memory subsystem. Theload can still be safely inserted into the cache, which decreases the latency to replaythe load.Section 5.6.1 evaluates the impact of applying the above victim warp flushingtechniques on the ISR scheduling latency.Preemption requires saving the victim warp’s state prior to switching to thePGW and requires restoring the victim warp’s state after the PGW completes.When preempting a running victim warp context, the PGW can make use of thevictim warp’s SIMT stack to save the program counter and warp divergence state,since the SIMT stack already contains the full history of the victim warp’s execu-tion state. Then, a new entry for the PGW’s ISR can be pushed onto the victimwarp’s SIMT stack to configure the PGW’s execution state. When the ISR com-pletes, an interrupt return function can pop the PGW’s entry off of the SIMT stackto resume execution of the interrupted victim warp. Additionally, any registers usedby the PGW must be saved and restored. We allocate a small region of global mem-ory per SM to save the victim warp’s registers. This could also be implemented asa small on-chip buffer to reduce the PGW preemption latency.5.3.4 Privileged GPU Warp PriorityAlong with minimizing the time to schedule a PGW, the ISR runtime should beminimized. This reduces the latency to begin performing the actual task for the141GIC       receives interruptEvent interrupt?Default Interrupt pathL-IC       receives event KMD in the Pending Interrupt QueueNoFree / Dedicated warp?Reserve free/dedicated warp IDSelect a warp to interrupt Selected SMXFlush victim warp     : Stop fetching instructions, clear i-buffer, drop pending loads, set high priorityWait for victim warp to drain pipelineKeep PGW at highest priority and begin ISR executionYesNoYesPush ISR PC and PGW thread mask to SIMT stackSelect next SMX to service interrupt Read event KMD from EKT      with the eventID and send      to SMX L-IC Pending Interrupt Queue7715131781311111112Figure 5.5: Interrupt controller logic (reference Figure 5.7).corresponding event and reduces the amount of time the victim warp is blockedfrom performing its original task. In Section 5.6.1, we evaluate two techniques forminimizing the ISR runtime: prioritizing the PGW’s instruction fetch and schedul-ing, and reserving its entries in the instruction cache. The ISR code can be verysmall, requiring only a few entries in the instruction cache. For example, the cur-rent ISR implementation in EDGE occupies only three cache lines. We find thatthe ISR execution time is significantly reduced when the PGW never misses in theinstruction cache.1425.3.5 Interrupt FlowFigure 5.5 presents the high-level EDGE interrupt controller logic, which is im-plemented in hardware. The numbers in this figure indicate which parts of the GPUhardware shown in Figure 5.7 are responsible for performing the operation. Theinterrupt controller is split into a global interrupt controller (G-IC) and per-SMXlocal (L-IC) interrupt controllers. When an interrupt is received by the G-IC, it firstchecks if this is an EDGE interrupt or a GPU-wide interrupt. If an EDGE interrupt,an SMX is selected to service the interrupt based on a selection policy, such as sim-ple round-robin or a more complex policy that selects the SMX with the most freeresources. The GIC then pushes the interrupt metadata, such as the event KMDdescribed in Section 5.4.2, into a queue in the SMX’s L-IC. At this point the SMXand L-IC are responsible for processing the interrupt and the G-IC can safely clearthe interrupt. Additionally, because each SMX has its own L-IC, there is no con-tention between concurrently running interrupts on different SMXs. The selectedSMX then selects a victim warp to interrupt based on the available hardware andPGW selection policy (Section 5.3.2). If the victim warp is currently active, it isflushed through the pipeline using the techniques described in Section 5.3.3. Next,a PGW entry is pushed onto the victim warp’s SIMT stack with the interrupt vectorPC. Finally, the PGW priority for instruction fetch and scheduling is increased tominimize the total ISR runtime.Figure 5.6 presents the software ISR flow performed by the PGW. The ISR firstsaves any registers that will be overwritten, if necessary. Depending on the numberof registers required for the PGW, additional hardware for dedicated PGW registerscould also be added to avoid saving the victim warp’s registers. The PGW thencommunicates with the L-IC to determine the cause of the interrupt and proceedaccordingly. Section 5.4 discusses the operations that the ISR may perform toorchestrate event-driven GPU execution.5.3.6 Interrupt ArchitectureFigure 5.7 presents the modifications to the current GPU architecture requiredto support fine-grained, warp-level preemption initiated by interrupts. The G-IC7 , which is part of the GPU front-end 3 , requires logic to identify the EDGE143Read event KMD from Pending Interrupt Queue       in L-ICSchedule the event kernel       into the event kernel queue Save any local state overwritten by the interrupt handlerRestore any local saved state Pop SIMT stack to return to previous program code segment If a modifyable event, read kernel metadata from EQE in global memory and update event KMD Pop interrupt entry from Pending Interrupt Queue13 8131819Figure 5.6: Interrupt service routine (reference Figure 5.7).interrupt and select an SMX L-IC 8 to process the interrupt. The SMX selectionpolicy could take many factors into account, such as identifying the SMX with thelightest load or with free warps available to avoid warp preemption, or use a moresimple policy, such as round-robin. The interrupt is then forwarded to an SMX L-IC. Enabling this division between G-IC and L-IC requires new connections 9 , orcan make use of the existing connections between the SMX scheduler and SMXs.The L-IC requires control registers to store the PC corresponding to the GPU’sinterrupt vector 10 , which may be able to use the existing IC metadata registers.The L-IC also maintains pending interrupt queue 13 for interrupts assigned to it,which in EDGE, contains metadata describing the event kernel to launch (Section5.4.2). In the steady state, with 28 SMXs and a single pending interrupt queueentry, the ISRs can at most take 28× as long as the interrupt period. The correctdepth of the pending interrupt queue depends on the distribution of the arrival rateof interrupts. The L-IC requires logic for identifying and selecting a victim warpfor the PGW 11 based on some policy (free, newest, oldest, etc), which can poten-tially make use of the existing warp scheduler logic. Once selected, the L-IC needsto flush the victim warp 12 . The logic for the victim warp flushing optimizations(Section 5.3.3) includes modifications to the warp scheduler to prioritize individ-ual warp scheduling, manipulating valid bits to flush entries from the instruction144Host Device EventKernel Management UnitPending Kernel QueuesI/O Unit & Front-endHW Work QueuesGlobal Interrupt ControllerSM Interrupt SelectionKernel DistributorSMX SchedulerControl RegistersSMX SMX≥+MEvent Kernel TableM Event KMD ESQ-base ESQ-head ESQE-size ESQ-endL2 CacheMemory Controller GPU DRAM GPU DRAM InterconnectPCIeHost CPU / External DevicePC Dim Param Block InfoKernel Metadata StructureWarp SchedulersWarp ContextWarp ContextWarp ContextWarp Co t xtRegister FileCore CoreL1 Cache / Shared MemorySMX ControllerLocal Interrupt ControllerPGW selection logicVictim Warp flushing logicEDGE ISR PCPending Interrupt QueueSMX1011128137151891617142041935162SMX SMXFigure 5.7: EDGE GPU Microarchitecture. Baseline GPU diagram inspiredby [87, 111, 155, 175]. EDGE components are shown in green.buffer, invalidating MSHR entries (or equivalent structure for in-flight loads), avictim warp ID register to identify and drop in-flight loads, and a register to in-dicate if a victim warp should return to a warp barrier after completing the ISR.EDGE also requires logic to lock instruction cache entries for the ISR. Finally, the145pending interrupt queue in each SMX requires 168B to store four pending events.The PGW communicates information about the interrupt with the L-IC as amemory-mapped component via load and store instructions to special addresses.In our implementation, EDGE reserves certain memory addresses to access thecomponents described above, such as the pending interrupt queue. This API couldalso be implemented using specialized instructions.5.4 Event-Driven GPU ExecutionGPU interrupts and the PGW warp-level preemption mechanism provide a pathfor initiating and managing tasks on the GPU without requiring CPU intervention.However, supporting this form of GPU task execution has several requirements.Specifically, external devices need to be able to specify which work to perform,what data to perform the work on, and determine when work has completed execu-tion on the GPU. To address these requirements, this section proposes a new typeof GPU kernel, the event kernel, a host API for pre-configuring event kernels andcorresponding data, GPU hardware structures for storing the event kernel meta-data, and user-level work queues to communicate task initiation and completion.This section also proposes a new form of CTA barrier instruction, the wait-releasebarrier, which enables CTAs to halt execution until an external event releases thebarrier through a PGW. Wait-release barriers are useful for persistent-thread styleprogramming models to avoid continuously polling external memory for new tasks.5.4.1 Event KernelsAn event kernel is a user-defined GPU kernel, which is launched internally bya PGW running on the GPU in response to an internal or external event. Oncelaunched, an event kernel is identical to any normal GPU kernel. In a traditionalGPU environment, the CPU is required to configure and launch every GPU kernel.EDGE relaxes this requirement if the same kernel is repeatedly launched on differ-ent input data (e.g., a network processing kernel when a group of packets arrive).In such cases, the kernel needs to be configured only once and an external devicecan directly trigger the execution of this kernel without unnecessarily involvingthe CPU. The following section describes how EDGE enables an event kernel to146Event Submission Queue (GPU Memory)Event Completion Queue (External Device Memory)  FreeKernel MetadataKernel ParametersCompletion Q Entry PointerTail  Free   Free  Free  Done  DoneHeadFigure 5.8: Event submission and completion queues.concurrently support multiple kernel parameter memories to communicate with anexternal device.Event Submission and Completion QueuesIn current deployed systems, the host communicates data to the GPU throughglobal and kernel parameter memory (which may be separate or part of the samephysical memory). Large input and output buffers are stored in global memory,while the kernel parameter memory stores pointers to these buffers. The host CPUconfigures and communicates these buffers with the GPU through the GPU driverwhen launching a kernel. EDGE exploits the fact that the same event kernel willhave the same parameter memory structure (i.e., the same type, number, and orderof parameters), but works on different data. As such, the parameter memory couldbe pre-configured, removing the need to repeatedly configure it on the CPU fromthe critical path.To communicate data and task status with the GPU, EDGE implements a pairof in-memory circular queues, called the event submission queue (ESQ) and eventcompletion queue (ECQ), shown in Figure 5.8. These queues are similar to thoseused in persistent threads, NVIDIA MPS [129], and HSA [55] as a mechanism forlaunching tasks and managing task completion. However, unlike NVIDIA MPSand HSA, the ESQ and ECQ are tailored to a specific event kernel structure, asdescribed below. The ESQ is stored in GPU memory and the ECQ is stored in147external memory (e.g, CPU memory or a third-party device’s memory). Each ESQentry (ESQe) contains a list of kernel parameters corresponding to a specific eventkernel, optional event kernel metadata, and a pointer to the corresponding ECQ en-try (ECQe) to signify the completion of an event. When launching an event kernel,the PGW simply sets the event kernel parameter memory to the current ESQe in theISR (Figure 5.6). Since the event kernel parameter memory is pre-initialized, sub-sequent event kernels only need to update the kernel parameter memory to the nextESQe. Other required information, such as the event kernel dimensions, are al-ready known from the event kernel pre-registration process. The ESQe can also beconfigured to store optional event kernel metadata (e.g., kernel dimensions), suchthat each event kernel structure can be dynamic between invocations. We refer tothis as a modifyable event kernel. The PGW is responsible for updating the eventkernel metadata (KMD) structure with the optional metadata specified in the ESQe.However, dynamically configuring the event kernels increases the ISR runtime, asthe ESQe are stored in global GPU memory. There is a one-to-one mapping be-tween an ESQe and ECQe, which the GPU uses to signal event kernel completion.The size of an ESQe is dictated by the number and size of kernel parameters. Thesize of an ECQe is 4 bytes. The number of entries in the circular queues dictatesthe maximum number of in-flight events.EDGE provides an API for registering an event kernel and allocating the ESQ(Table 5.1), which is described in more detail in Section 5.4.1. EDGE is respon-sible for allocating the ESQ in GPU memory based on the structure of event kernelparameters and maximum number of in-flight events, which are specified by theCPU during event kernel registration. A pointer to the ESQ is returned to the CPU.The CPU is then responsible for allocating the ECQ in external device memory,pre-allocating the GPU input/output buffers, assigning the corresponding ECQepointers, and setting the kernel parameters (GPU buffer pointers and any constantparameters) in each ESQe. Note that this requires pre-allocating kernel parametersfor all possible in-flight event kernels (size of the ESQ). The CPU then communi-cates this information (ESQ and ECQ) with the external device that will launch thetasks on the GPU.The external device maintains a single head and tail pointer for both the ESQand ECQ to manage event kernel launching and completion (Figure 5.8). To148launch an event kernel, the external device checks the status of the head in theECQ. If free, the kernel parameters in the ESQe at head can be filled with the datafor the next event kernel. The external device then sends an interrupt to the GPU,which triggers the execution of the next event kernel on the GPU, and incrementshead. The queue is full if incrementing head reaches the tail. The external de-vice uses the tail to identify when an event kernel has completed and the outputbuffers contain valid data. This is indicated by Done in an ECQe, which is up-dated by the GPU through an ECQe pointer specified in the ESQe. The externaldevice is responsible for consuming the output buffers, setting the ECQe to free,and incrementing the tail to the next in-flight event.As described, the proposed structure of the ESQ and ECQ results in in-ordertask submission and completion; the external device can not push a new requestinto the head of the ESQ until the corresponding tail in the ECQ is free. Thislimitation could be removed by adding complexity to the external device and GPU,for example, by having the GPU send interrupts to the external device to signalevent completion. Note that the GPU is invoked only once a valid event has beenconfigured at the head and an interrupt has been issued. As such, the GPU requiresonly a head pointer to keep track of the next ESQe to process, while the ECQe tosignal completion is specified in the ECQ.Event Kernel Priority and PreemptionAs described in the previous section, the PGWs begin execution via warp-level pre-emption when no free warp contexts are available to process the interrupt. How-ever, event kernels launched by the PGWs are regular GPU kernels, which areusually much larger than a single warp and make use of the baseline GPU hard-ware task/thread schedulers. Consequently, if other kernels are currently occupyingthe GPU, the event kernels will be blocked until sufficient resources are available.Similar to existing host and GPU-launched kernels, the priority of an event kernelis configurable. This can enable high-priority tasks to maintain a level of QoS inan environment where other GPU tasks are run concurrently. Furthermore, EDGEmay use many recently proposed preemption and context switching mechanisms,either existing [126] or research-based [33, 92, 143, 157, 162, 176], to enable a149edgeEventId = edgeRegisterEvent<<<paramType0,... paramTypeN>>>(kernelPointer, &eventQueuePointer,gridDimensions, blockDimensions,sharedMemorySize, maxNumOfEvents,priority, isModifyable)edgeUnregisterEvent(eventId)edgeScheduleEvent(eventId)EDGE GPU MSI-X Interrupt (eventId)edgeReleaseBarrier()Table 5.1: EDGE API extensions.higher priority event kernel to preempt a lower priority host launched GPU kernelor other lower priority event kernel.We evaluate two variations of CTA draining preemption for event kernels, P1and P2, which are modified versions of the mechanism proposed in [162]. Thesepreemption mechanisms partially preempt a running kernel by blocking any non-event kernel CTA launches until enough resources are available to begin executingthe event kernel. In P1, once the event kernel has scheduled all of its CTAs, anynon-event kernel CTAs may be scheduled when enough resources are available.However, in P2, scheduling any non-event kernel CTAs is completely blocked untilthe event kernel completes. P2 places a higher priority on the event kernel than P1.Event Kernel APITable 5.1 presents extensions to the GPGPU API (e.g., CUDA or OpenCL) re-quired to support EDGE. These functions enable the CPU to register an event ker-nel with the GPU, configure the parameter memory for the ESQ, and trigger orschedule event kernel launches.Registering an event kernel is similar to launching a kernel through the baselineCUDA API. The main differences are that the kernel launch is delayed by storingthe kernel metadata structure on the GPU, which consists of the kernelPointer tospecify the GPU kernel function to execute for this event, the kernel dimensions(gridDimensions and blockDimensions) to specify the shape of the event kernel,and the sharedMemorySize to specify the amount of shared memory required by150the event kernel. Additionally, the parameter memory structure is provided insteadof passing actual parameters, indicated by paramType# in the angle brackets. Thisenables EDGE to allocate the correct size for each ESQe. The priority of the eventkernel is set through priority. The event kernel can also be marked as modifyablevia the isModifyable flag, as described previously. The event registration allocatesthe ESQ as described in Section 5.4.1, with the total size of the queue being spec-ified by the maximum number of in-flight event kernels (maxNumOfEvents) andthe size of the event kernel parameters, and is returned in eventQueuePointer. TheCPU is required to configure each ESQe accordingly, which consists of allocatingthe necessary CUDA buffers in GPU-accessible memory and updating the bufferpointers in the ESQ. All interactions with the ESQ are performed through genericmemory accesses. Finally, the corresponding edgeEventId is returned, which is re-quired to specify which event kernel to trigger on an interrupt, as described below.An event kernel and the corresponding ESQ can be freed through edgeUnregis-terEvent(edgeEventId).Once registered and configured, any device capable of sending MSI-Xinterrupts can trigger the event kernel by sending an interrupt message withthe edgeEventId as the interrupt identifier. EDGE also provides an alternativepath to trigger an event kernel directly from the CPU through edgeSched-ulEvent(edgeEventId), which launches the event kernel corresponding to theedgeEventId using the ESQ interface.The function, edgeReleaseBarrier() is described in Section 5.4.3.5.4.2 EDGE ArchitectureReturning to Figure 5.7, we present the modifications to the baseline GPU ar-chitecture required to support EDGE, indicated by the components in green. Themain additional hardware components for supporting GPU launched event kernelsare the storage buffers for the Pending Interrupt Queue 13 in the L-IC, the EventKernel Table (EKT) 15 , and the Event Kernel Queue (EKQ) 19 in the KMU 4 .The ESQ and ECQ (not shown here) are stored in global GPU memory and externaldevice memory accordingly, as previously described.Each entry in the EKT 15 stores a pre-allocated event KMD 2 , a single bit,151M, to indicate if the event kernel is modifyable, and metadata to describe the ESQ.The metadata consists of the ESQ-base pointer, ESQ-head pointer, ESQ-size, andthe last ESQ entry, ESQ-end. ESQ-head is initialized to ESQ-base. When the G-ICtransfers an interrupt to an L-IC 9 , the corresponding event KMD is read from theEKT 15 and transferred 17 into the Pending Interrupt Queue in the L-IC 13 . TheESQ-head is then automatically incremented by ESQ-size via a hardware adder 16and is stored back into ESQ-head in the EKT entry. ESQ-head is reset to ESQ-basewhen exceeding the allocated region, ESQ-end, which implements the logic for thecircular ESQ on the GPU in hardware. The EKT simplifies the process of launch-ing a kernel from the GPU, since the next event kernel is already configured at thenext ESQ-head. Storing the event’s KMD locally in the Pending Interrupt Queue13 enables multiple SMXs to process different instances of the same event kernelconcurrently, since each SMX is working on an event with unique parameter mem-ory. Additionally, this enables the SMX to modify a local copy of the event KMD,if the event kernel is marked as modifyable, without requiring synchronization.The EKQ 19 is a hardware queue in the KMU 4 responsible for queuingpending event KMDs until there is a free entry in the KD 5 to begin executingthe event kernels. This queue is similar to the existing host and device-launchedqueues [87]. There can be any number of EKQs, each with different prioritiesrelative to other event, host, or device-launched queues.In our proposed implementation, the EKT 7 is modeled as a small on chipbuffer with a single read/write port. Each row contains five 8-byte values for thekernel pointer and parameter buffer pointers, six 4-byte values for the kernel di-mension, a 2-byte value for the shared memory, and a single modifyable bit, for atotal width of 529 bits. Assuming a size of 32 entries (enabling a max of 32 differ-ent event kernels), the total storage overhead of the single GPU EKT is 2.1KB, lessthan 1% of a single SMX’s register file. The EKT also requires a single 8B adderand comparator to implement the logic for the circular buffers in hardware.5.4.3 Wait-Release BarrierAside from indefinitely consuming the majority of GPU resources, an inefficiencywith the persistent GPU threads (PT) style of programming is that GPU threads152continuously poll global memory to query for new tasks to perform or responses toremote procedure calls (RPC). If the incoming task rate is high or the RPC process-ing latency is low, polling can have little effect on efficiency. However, long wait-ing times for persistent CTAs (pCTA) result in repeated unsuccessful reads fromglobal memory, which can lower efficiency over the traditional kernel launching.An attractive, and obvious, alternative to polling is to instead have the GPU threadsblock until work is available. This maintains the benefit of not requiring a full newGPU kernel to be launched when work is available, while removing the unneces-sary reads to global memory, at the cost of increased latencies for identifying whenwork is available. Current GPUs support warp-level barriers, which block the exe-cution of warps until all warps in a CTA have reached the barrier. However, thereare currently no methods for blocking all warps in a CTA or blocking GPU threadsuntil some external condition is met.To address this, we propose the wait-release barrier, a special set ofCTA barrier instructions that block all warps in a CTA at the wait barrierinstruction ( wait threads()) until a subsequent release barrier instruction( release threads()) is performed. While the wait-release barrier itself could beeasily implemented in current GPUs, there is still the challenge of how to notifythe CTA that it should be released if all warps are currently waiting at the barrier.This can be solved with the PGWs described in Section 5.3. A special interruptvector is reserved for the wait-release barrier. Instead of continuously pollingglobal work queues for new tasks, each persistent CTA (pCTA) can check theglobal work queue once and, if no tasks are available, block at a wait-releasebarrier. Pseudo examples of a baseline persistent GPU thread implementation anda persistent GPU thread implementation using the wait-release barrier are shownin Example 1 and Example 2, respectively. With the wait-release barrier, the con-tinuous polling of the in-memory work queue is guarded with the wait threads()barrier instruction. At some later time when work is available, the external devicecan configure the task in the global work queue and trigger a GPU interrupt torelease any pending wait-release barriers to check for the new work. While notshown here, a PGW executes the corresponding release threads() instruction inresponse to the external event. The CPU can also release the wait-release barriersthrough the edgeReleaseBarrier() API shown in Table 5.1.153Example 1 Pseudo example of a baseline persistent GPU thread implementation.1: while (true) {2: if (tid == CTA_SCHEDULER_TID) {3: // Spin for work to become available4: while(no_work_available()) { }5: // Configure task for the pCTA6: ...7: }8: __syncthreads();9:10: // Specific kernel processing11: ...12:13: if (tid == CTA_SCHEDULER_TID) {14: signal_task_complete();15: }16: __syncthreads();17: }Example 2 Pseudo example of a persistent GPU thread implementation using thewait-release barrier.1: __shared__ bool work_available = false;2: while (true) {3: // Spin for work to become available4: while (!work_available) {5: __wait_threads(); // Wait-release barrier6: if (tid == CTA_SCHEDULER_TID &&7: !no_work_available()) {8: work_avilable = true;9: // Configure task for the pCTA10: ...11: }12: __syncthreads();13: }14:15: // Specific kernel processing16: ...17:18: if (tid == CTA_SCHEDULER_TID) {19: signal_task_complete();20: work_available = false;21: }22: __syncthreads();23: }154Depending on the application, there may be multiple pCTAs waiting at differ-ent wait-release barriers at any given time. While there are opportunities to opti-mize the selection of which barriers are released and how many barriers to releaseper interrupt, we opt for simplicity by releasing all pending wait-release barriers(even across concurrent kernels). The application logic does not need to changebeyond adding the wait threads() prior to querying the global work queue andminor restructuring to avoid SIMD deadlocks. When an interrupt releases all wait-ing pCTAs from a wait-release barrier, each pCTA must recheck the global workqueue to see if a new task is available. Additionally, the external device logic doesnot need to increase complexity by specifying a specific pCTA to release. How-ever, a drawback of releasing all waiting pCTAs is additional accesses to the globalwork queue, since only a subset of the pCTAs will get to process the new task.To avoid potential race conditions, in which an interrupt releases a wait-releasebarrier before a pCTA has hit the barrier, we implement the wait-release barrier asa level-sensitive barrier – the wait-release barrier is only relocked after all warps ina pCTA have passed through it. Thus, if a release operation occurs and no pCTAis waiting at a wait-release barrier, the barrier is unlocked until a full pCTA passesthrough it, meaning that the pCTA will then check the global memory queue to findif a pending task is available.The wait-release barrier could be extended to support multiple different barri-ers. For example, similar to event kernels, wait-release barriers could be registeredwith the GPU and an ID could be returned. One or more barrier IDs could be passedto the GPU kernel such that within or between kernels, different CTAs could blockon different wait-release barriers. The interrupt or edgeReleaseBarrier() API couldbe modified to include the wait-release barrier ID accordingly. However, this wouldrequire additional hardware to maintain the mappings between a given wait-releasebarrier ID and the CTAs that are waiting at that barrier, and to notify a subset ofwait-release barriers. Improving the granularity of the wait-release barrier is left tofuture work.155Component ConfigurationGPU SMX frequency 700MHzSMXs 16Max threads per SMX 2048Max CTAs per SMX 64Register File size per SMX 65536Shared memory size per SMX 48KBL1 $ size per SMX 64KBL2 $ size 1MBMemory Configuration Gem5 fusedGPU Warp Scheduler Base Greedy-then-OldestGem5 CPU model O3CPUTable 5.2: Gem5-GPU configuration.5.5 Experimental MethodologyEDGE is implemented in Gem5-GPU v2.0 [146] with GPGPU-Sim v3.2.2 [17].Gem5-GPU was modified to include support for CUDA streams, concurrent ker-nel execution, concurrent CTA execution from different kernels per SM from theCUDA Dynamic Parallelism (CDP) changes in GPGPU-Sim [174], and kernel ar-gument memory using Gem5’s Ruby memory system. The Gem5-GPU configura-tion used in this work is listed in Table 5.2.We modified the baseline architecture to include a timing model for the G-ICand L-IC, PGW selection, victim warp flushing, interrupt vector logic, and inter-rupt service routine (ISR). The event kernel table is modeled as an on-chip buffer,which is read by the G-IC using the event kernel ID as the address to select anevent KMD to send to an L-IC. Similar to CDP, event kernels are launched fromthe GPU into a separate event kernel hardware queue, which can specify differentpriorities relative to the host-launched kernel queues. However, unlike CDP, thereis no need for the GPU to dynamically allocate argument buffers or configure thekernel to launch, since the event kernels are pre-configured by the host CPU andstored in the event kernel table. Additionally, there is no requirement to set up par-ent/child mappings for the event kernels as in CDP, which reduces the overheadsof launching event kernels relative to child kernels in CDP. Gem5-GPU’s CUDA156runtime library was extended with the EDGE API in Table 5.1 to enable configu-ration and management of events on the GPU. The PGW communicates with theL-IC via load/store instructions to special reserved addresses.The proposed wait-release barrier instructions are implemented as in-line PTXassembly using the previously unsupported bar.sync # in GPGPU-Sim to modelthe wait barrier (bar.sync 1) and the release barrier (bar.sync 2). In the full system,we expect that these instructions would instead be implemented by modifying theCUDA compiler to support the proposed wait and release barrier instructions.The benchmarks evaluated in this chapter are taken from Rodinia [31], twoconvolution kernels from Cuda-convnet [99] using a layer configuration similar toLeNet [102], and a GPU networking application, MemcachedGPU, as describedin Chapter 4. The Rodinia benchmarks are Back Propagation (BACKP), BreadthFirst Search (BFS), Heart Wall (HRTWL), Hot Spot (HOTSP), K-Means (KMN),LU Decomposition (LUD), Speckle Reducing Anisotropic Diffusion (SRAD), andStreamcluster (SC). The size of the input data used in these benchmarks was se-lected to ensure that the GPU was fully utilized given the benchmark’s implementa-tion, such that an interrupt warp is not optimistically biased to having free resourcesavailable due to undersized inputs. The convolution kernels from Cuda-convnetare filterActs YxX color and filterActs YxX sparse. Finally, the MemcachedGPUkernel evaluated is the GET kernel using 16B keys, 2B values, and 512 requestsper batch. The hash table size is set to 16k entries, and is warmed up with 8k SETs.Any hardware measurements presented in this work are run on an Intel Corei7-2600K CPU with an NVIDIA GeForce GTX 1080 Ti, using CUDA 8.0 anddriver v375.66. CPU timing measurements were recorded using the TSC reg-isters and GPU power was measured using the NVIDIA Management Library(NVML) [130].5.6 Experimental ResultsThis section first evaluates the interrupt warp architecture and then evaluates howthe PGWs can be used to initiate the execution of event kernels and support thewait-release barriers.1570%	10%	20%	30%	40%	50%	60%	70%	80%	90%	100%	BACKP	 BFS	 HRTWL	HOTSP	 KMN	 LUD	 SRAD	 SC	 CONV1	CONV2	 AVG	Free	PGW	Contexts	Figure 5.9: Percentage of cycles a free warp context is available for a PGW.This limits the amount of time a warp must be preempted to schedulethe PGW.5.6.1 GPU Interrupt SupportPGW Resource RequirementsLike any other warp, a PGW requires a hardware context to execute, such as reg-isters and a program counter. If the GPU is underutilized by any currently run-ning kernels, there may be enough free hardware resources (warp contexts) avail-able to execute the PGW without requiring a running warp be preempted. GPUshave a range of resources (registers, thread contexts, shared memory) that are usedin varying amounts by warps and CTAs. The amount of parallelism achievableon the GPU depends on which resources have the highest demand relative to theamount available. Previous work has shown that applications may underutilizeGPU resources, which can benefit GPU multiprogramming through resource par-titioning/sharing [176, 181]. We also measure the same behavior from many GPUapplications. As a result, a GPU application may not be able to schedule an ad-ditional CTA due to insufficient resources, however, the PGW’s limited resourcerequirements may still be able to make use of the remaining fragmented resources.For example, assume we have a GPU kernel X that uses a large amount of sharedmemory, but does not require many threads. The GPU CTA schedulers dispatch1580%	10%	20%	30%	40%	50%	60%	70%	80%	90%	100%	BACKP	 BFS	 CONV1	CONV2	HOTSP	HRTWL	 KMN	 LUD	 SC	 SRAD	 AVG	Average	Register	Utilization	Figure 5.10: Average register utilization of Rodinia benchmarks on Gem5-GPU. Register utilization is measured for each cycle and averagedacross all cycles of the benchmark’s execution.CTAs to SMXs until launching an additional CTA from X will exceed the sharedmemory size. The remaining CTAs from X must block until previous CTAs com-plete. Even though X is blocked, there are still unused warp contexts and registersavailable, which can be used by other GPU applications with low shared mem-ory requirements. To measure this opportunity, we first profiled the Rodinia andConvolution benchmarks to identify the fraction of cycles where there are enoughresources to support a PGW (free PGW context that does not require preemptinga running warp) in Figure 5.9. As shown, the fraction of cycles a free PGW isavailable varies between applications, with the average around 50%. While notshown here, a cycle-by-cycle analysis highlights that a kernel goes through mul-tiple phases with varying resource utilization, for example, as warps from a CTAbegin to complete and a subsequent CTA has not yet been launched. These resultshighlight that there are sufficient opportunities to exploit underutilized resourcesfor executing PGWs instead of preempting a running warp or requiring additionaldedicated hardware to service the interrupt.Furthermore, while an application may use all of the available thread contexts,it may still underutilize the large per-SMX register files, which limits the amountof resources needed to be saved on preemption. Figure 5.10 highlights the aver-159age register utilization (averaged over all cycles of the benchmark’s execution) forthe Rodinia and Convolution benchmarks on the Gem5-GPU simulator. We findthat the evaluated benchmarks use ∼60% of the available registers. A PGW onlyrequires 0.4% of an SMX register file (8 registers * 32 threads/warp = 256 / 65536registers, assuming a full warp’s worth of registers need to be reserved for the sin-gle thread in the PGW), which provides an opportunity to utilize the free registersinstead of requiring register saving/restoring in the software ISR.PGW Preemption Latency and RuntimeA key design goal for supporting interrupts on any system is low and predictablelatency. However, this can be challenging to achieve when preemption is requiredand the to-be-preempted victim warp can be in any given state of execution in thepipeline. As such, the latency to preempt a running victim warp must be mini-mized. For any of the dedicated interrupt warp policies (Section 5.3.2), the pre-emption latency is zero since we always have a free interrupt warp context or hard-ware unit available to begin processing the interrupt immediately. Similarly, ifEDGE includes a small hardware unit to launch the event kernel in response to anexternal event/interrupt, instead of using a PGW, the event kernel can be imme-diately transferred to the Event Kernel Table ( 15 in Figure 5.7). However, pre-empting a running warp requires flushing the victim warp’s instructions from thepipeline. For ALU operations, as modeled in GPGPU-Sim, this can range from 4to 330 cycles depending on the operation being performed and the precision. Long-latency memory operations, for example, load instructions that miss in the cacheand must access DRAM, can take on the order of 1000’s of cycles. Naively waitingfor the victim warp to flush the pipeline results in unacceptably high scheduling la-tencies. Figure 5.11a measures the average victim warp preemption latency forinterrupts that require preemption. We evaluate two victim warp selection policies,Newest and Oldest, where the most or least recently launched warps are selectedto preempt, respectively. As our baseline GPU warp scheduler uses a greedy-then-oldest (GTO) policy, the oldest warp selection techniques tend to have a muchlower preemption latency (∼4×) because the oldest warps are prioritized over thenewest warps and are the first to complete. As such, the PGW selection technique16032	128	512	2048	8192	32768	BACKP	 BFS	 KMN	 LUD	 SRAD	 SC	 CONV1	 CONV2	 AVG	Average	Warp	Preemption	Latency	in	Cycles	(No	Opt)	 Newest	 Oldest	(a) Average preemption stall cycles without the victim warp flushing optimizations.BACKP	 BFS	 KMN	 LUD	 SRAD	 SC	 CONV1	 CONV2	 AVG	0	0.2	0.4	0.6	0.8	1	Average	Preemption	Latency		Breakdown	(No	Opt)	Pipeline	 Scoreboard	 Load	 I$-Miss	 Barrier	 Mem	Barrier	(b) Breakdown of preemption stalls for two victim warp selection policies (Left: Newest, Right:Oldest).Figure 5.11: Interrupt warp preemption stall cycles.alone can have a large impact on scheduling latency, which ranges from 2k-8kcycles on average (or 2.9-11.4 µs at 700 MHz) for the two warp selection poli-cies. These are relatively high latencies before the PGW is even be able to beginexecuting the ISR.Figure 5.11b measures the breakdown of causes for scheduling stalls. Pipelinemeans that a warp instruction is in the pipeline, Scoreboard means that the instruc-tion has registers reserved in the scoreboard, Load indicates the warp is waiting for161in-flight loads, I$-Miss an instruction cache miss, and Barrier and Mem Barrierindicate that warps are waiting at thread or memory barriers respectively. Thesescheduling stall conditions are not mutually exclusive. For example, an in-flightload instruction also has a register reserved in the scoreboard. The trend for Oldestpolicy is very similar - the selected victim warp to preempt is waiting for load in-structions to complete. The Newest policy leads to more diverse outcomes, such aswaiting for thread barriers. This is a result of some evaluated benchmarks havingthread barriers near the start of the kernel. For example, the initial portion of theCONV1/2 benchmarks load data from global memory to shared memory and thenwait at a barrier. Selecting the newest warp tends to fall into this phase of the ker-nel, whereas the oldest warps are performing the convolution operation on the datain shared memory. Evaluating methods to optimize the warp scheduling to reducepreemption latency is an area for future work.Figure 5.12 measures the reduction in preemption latency when applying eachof the optimization techniques for victim warp flushing discussed in Section 5.3.3.Note the log scale y-axis. The average preemption latency is reduced by 35.9×and 33.7× for the Oldest and Newest policies respectively. At 700MHz, the pre-emption latency is ∼70ns (50 cycles) and ∼314ns (220 cycles) respectively. ForOldest, the largest benefit comes from dropping and replaying load instructions,which is explained by the large fraction of stalls on pending loads in Figure 5.11b.For Newest, many benchmarks are also stalled on barrier instructions. Only ap-plying the barrier skipping, however, does not result in a large reduction in Figure5.12 because BFS and SC, which are not blocked on thread barriers and have alarge preemption latencies, dominate the average calculation. After reducing thepreemption latency with the other optimizations, such as replaying loads, threadbarriers result in a larger contribution to the average preemption latency, which ishighlighted by the large improvement with all optimizations applied.Along with the PGW scheduling latency, the ISR runtime should be minimizedto quickly process the interrupt and reduce the impact on performance of any con-currently running tasks. We find that by prioritizing the PGW’s instruction fetchand warp scheduling over other concurrently running kernels, the ISR runtime canbe reduced by 55% on average. However, depending on the rate and duration ofinterrupts, prioritizing a PGW can negatively impact other concurrently running1621	2	4	8	16	32	64	128	256	512	1024	2048	4096	8192	Newest	 Oldest	Average	Preemption	Latency	in	Cycles	(Opt)	Base	Barrier	Skip	(BS)	Victim	High	Priority	(VHP)	Flush	I-Buffer	(IB)	Replay	Loads	(RL)	VHP	+	IB	+	RL	VHP	+	IB	+	RL	+	BS	Figure 5.12: Preemption stall cycles with the victim warp flushing optimiza-tions applied averaged across the Rodinia and Convolution Bench-marks. Base is the baseline preemption latency without any optimiza-tions applied (Figure 5.11a). Barrier Skip immediately removes avictim warp if waiting at a barrier. Victim High Priority sets the victimwarp’s instruction fetch and scheduling priority to the highest. FlushI-Buffer flushes any pending instructions from the victim warp’s in-struction buffer. Replay Loads drops any in-flight loads from the vic-tim warp and replays them when the victim warp is rescheduled afterthe ISR completes.kernels’ performance.Impact of PGW on Concurrently Running KernelsFigure 5.13 measures the impact of running a PGW on the IPC of concurrentlyexecuting kernels under the different PGW selection policies (Section 5.3.2), dif-ferent interrupt rates (Slow/Fast), and different ISR durations (Short/Long). Slowand Fast interrupts are generated every 10k and 1.5k GPU cycles respectively. Inthese experiments, the ISR runtime is varied by iterating a delay loop. Additionally,the ISR does not actually launch an event kernel. With the delay loop, the aver-age measured runtimes for Short and Long are ∼15× and ∼212× longer than the1630.6	0.65	0.7	0.75	0.8	0.85	0.9	0.95	1	Slow-short	 Fast-short	 Fast-long	 Slow-long	IPC	Relative	to	No	Interrupts	Newest	 Oldest	 Dedicated	 Reserved	Figure 5.13: Average impact of the PGW selection, interrupt rate, and ISRduration on concurrent tasks’ IPC (x-axis labels are <interrupt rate>-<interrupt duration>).actual average ISR runtime measured in Section 5.6.2, respectively. All preemp-tion and runtime optimizations described earlier are applied to the victim warp andPGW. When the ISR is short, the PGW has very little impact on the backgroundtask, regardless of the selection policy. However, reserving a warp for the PGW inthe Reserved policy, hence reducing the amount of available parallelism to othertasks, does have a relatively higher impact on running tasks. When the ISR runsfor a much longer time, the impact on the background task becomes much larger.The Dedicated and Reserved policies have slightly less of an impact on the back-ground task than the preemption policies (Oldest/Newest), since running warps arenot preempted indefinitely. Even though the Reserved policy continually reservesa warp context, all of the background task warps are able to make progress whilethe PGW is running. These results highlight that the impact on a background taskis more due to sharing of execution cycles with the prioritized PGW than stallingthe execution of a given warp. Furthermore, since the ISR runtime for Short is15× longer than the measured actual ISR, the PGW and ISR have little impacton the performance of concurrently running kernels. As such, from the perspec-tive of concurrently running applications, there is little benefit to having dedicatedresources for the PGW.1640.6	0.65	0.7	0.75	0.8	0.85	0.9	0.95	1	I$-Reserve	 SM-1	 SM-4	 SM-5	 SM-6	 SM-7	 SM-8	 CTA-1	 CTA-2	Relative	IPC	Reserved	Resources	Figure 5.14: Impact on the Convolution kernel’s IPC when reserving re-sources for the PGW and MemcachedGPU GET kernel.5.6.2 Event KernelsIn this section, we evaluate event-driven kernels in EDGE by highlighting a mul-tiprogrammed use-case for EDGE where the GPU is executing long running ker-nels for training a deep neural network (DNN), while also using the GPU to ser-vice a higher-priority network kernel, such as MemcachedGPU. Specifically, weevaluate launching multiple Memcached kernels, MEMC, through the EDGE flowwith varying request rates, while performing a single Convolution kernel, CONV1,through the standard CUDA flow. We evaluate different prioritization techniques,P1 and P2 (Section 5.4.1), for sharing the GPU between the low and high prioritykernels.In addition to the different event kernel priorities, P1 and P2, we also evaluatedifferent hardware reservation techniques for the event kernels to spatially sharethe GPU resources between concurrent kernels, such as reserving CTAs or entireSMXs. We first measure the impact on IPC of the CONV1 kernel when reservinghardware resources only for event kernels, but without running any event kernels.SM-N reserves N SMXs for an event kernel, which can not be used for the CONV1kernel. CTA-N reserves N CTAs per SMX for the event kernel. Finally, I$-Reservereserves entries for the ISR in the instruction cache (Section 5.3.4). Figure 5.14highlights that there is negligible impact on performance when reserving the ISR1651	2	4	8	16	32	64	128	256	No	i$-Reserve	 i$-Reserve	 No	i$-Reserve	 i$-Reserve	P1	 P2	Relative	ISR	Runtime		 2.5	MRPS	 4	MRPS	Figure 5.15: Average ISR runtime with CONV1 and MEMC kernels at vary-ing MEMC launch rate relative to a standalone ISR. The runtime ismeasured with and without reserving the instruction cache entries forthe ISR (i$-Reserve) and for both the P1 and P2 event kernel priorities.The ISR requires three entries in the instruction cache.i$ lines, reserving 1-2 CTAs for the event kernel, or even reserving up to 6 SMXs.Rogers et al. [149] have identified similar behavior where performance can ac-tually increase when reducing the kernel’s parallelism (by limiting the number ofactive warps per SMX) due to improved cache access patterns. However, as wecontinue to decrease the number of SMXs available for the CONV1 kernel, theperformance starts to drop. With 50% of the SMXs, CONV1’s performance dropsto 85%. Note that the performance for CONV1 does not drop linearly with theamount of compute resources removed, which suggests that the performance ofCONV1 is memory bound.In the following experiments, we launch multiple MEMC kernels at differentpacket rates, 2.5 MRPS and 4 MRPS, while running a single background Convo-lution kernel, CONV1. Figure 5.15 measures the ISR runtime slowdown for thedifferent launch rates, priority levels, and with the ISR instruction cache entries re-served or not (three lines), relative to the ISR running in isolation. Without reserv-ing the instruction cache entires for the ISR, the PGW ISR runtime is significantlyincreased (∼100-200×) relative to running in isolation. A large contributor to this16613.1	0	1	2	3	4	5	6	7	0	1	2	3	4	5	6	7	8	9	10	BASE	CTA-2	SM-1	 SM-4	 SM-8	 BASE	CTA-2	SM-1	 SM-4	 SM-8	P1	 P2	 Average	Normalized	Turnaround	Time	(ANTT)	IPC	Relative	to	Isolated	Kernel	CONV	2.5	MRPS	 MEMC	2.5	MRPS	 CONV	4	MRPS	MEMC	4	MRPS	 ANTT	Figure 5.16: Runtime of the Convolution and Memcached kernels at differ-ent request rates and resource reservation techniques, relative to theisolated kernel runtimes, and average normalized turnaround time.is that the multiple concurrently running kernels occupy the instruction cache, con-tinuously evicting the ISR’s cache lines. However, when reserving entries in theinstruction cache for the ISR, the ISR runtime for P1 is reduced by 20-45× de-pending on the packet rate. With P2 at 4 MRPS, the higher packet rate causesMEMC kernels to be launched quicker (and overlapping), which limits the amountof time the CONV1 kernel can execute, hence reducing contention on the instruc-tion cache. However, at 2.5 MRPS, CONV1 is able to resume execution, whichresults in a 80× reduction in ISR runtime when reserving instruction cache entries.This data highlights that reserving instruction cache entries for the ISR is neces-sary to improve ISR performance when there is contention with shared resourcesbetween concurrent tasks, such as the memory system. Since the ISR does nothave a large code footprint, a small on-chip buffer could also be included to storethe ISR’s code to avoid reducing the instruction cache size for other GPU kernels.Figure 5.16 measures the performance of MEMC and CONV1 when launchingmultiple MEMC tasks under the different launch rates (2.5 and 4 MRPS), prioritylevels (P1 and P2), and event kernel resource reservation techniques, relative to a167single MEMC and single CONV kernel run in isolation. For example, CONV 2.5MRPS measures the performance of the CONV1 kernel (relative to CONV1 in iso-lation) while MEMC is running at 2.5 MRPS (relative to MEMC in isolation). Wealso measure the average normalized turnaround time (ANTT) [50] of the CONV1and MEMC kernels. There is a large variation in the throughput of MEMC kernelsper convolution kernel ranging from 2-10 for 2.5 MRPS and 2-70 for 4 MRPS forthe different priority policies. The performance of CONV1 also varies widely. Forexample, SM-1 at P2 runs 70 MEMC kernels before CONV1 completes, but at a13.1X slowdown in CONV1. The large slowdowns for CONV1 with P2 are causedby the higher priority placed on the MEMC kernel. Whenever a MEMC kernel ispending or running, the CONV1 kernel is blocked. At 4 MRPS, the MEMC ker-nels arrive at the GPU more frequently, which significantly reduces the CONV1kernel’s ability to run.The ANTT takes into account the number of kernels that execute on the GPUconcurrently and the slowdown of each kernel relative to being run in isolation. AnANTT of 1 is ideal. With P1, the CONV1 kernel is able to continue schedulingCTAs alongside MEMC, which impacts the MEMC throughput drastically, result-ing in a high ANTT between 2.9 and 6. However, as the reserved resources areincreased and higher priority is given to MEMC (P2), we measure a large reduc-tion in ANTT to 1.44 at BASE P2 and 1.15 at SM-8 P2. The significant improve-ments in ANTT between the P1 and P2 priorities indicate that spatial multitaskingalone may not be enough to maximize the overall system throughput due to con-tention on other shared resources, such as the memory system. This is highlightedin Figure 5.16, where both CONV1 and MEMC are memory bound applications.For example, even with SM-8 in P1, the ANTT is ∼2× higher than BASE in P2.While SM-8 reserves half of the GPU’s SMXs for MEMC, the P1 priority allowsthe CONV1 CTAs to continue running on the remaining 8 CTAs while MEMC isrunning, which slows down both the CONV1 and MEMC kernels. In P2, however,all of the CONV1 CTAs are blocked until the MEMC kernel completes, which,even with no resources reserved for the MEMC kernel, enables the MEMC kernelto complete faster.These results highlight the potential to maintain a high quality of service (QoS)for high-priority event kernels, while concurrently processing a lower-priority168background application to improve GPU utilization. Additionally, with EDGE, theCPU is not required to manage the launching or completion of the event kernels,freeing it up to work on other tasks or enter a lower power state.5.6.3 Wait-Release BarrierFinally, we evaluate the second use-case of the PGWs for event-driven GPU ex-ecution: the wait-release barrier. Specifically, we evaluate the benefits providedby the wait-release barrier compared to a persistent thread environment (PT) tocontrol GPU execution. For these experiments, we set up a PT framework withlocal CTA-level work queues, where each persistent CTA (pCTA) has its own en-tries in the work queue to avoid any contention with other pCTAs. There are 16pCTAs in total, one per SMX. Each pCTA is responsible for checking the workqueue for available work and then performing the corresponding kernel. For thisexperiment, we use an empty kernel with no actual processing, such that only thePT and PGW overheads are measured. For PT, each pCTA continuously polls thework queue in global memory for new work. We then modified the PT frameworkto include a wait-release barrier prior to checking the memory queue for work inthe polling loop. Each pCTA blocks on a wait-release barrier until an interrupt isreceived, which releases all blocked pCTAs. Once released, each pCTA checks thework queue for work and continues to process the empty kernel if available, or re-turns to the wait-release barrier if not. Note that this naive implementation releasesall pending wait-release barriers (16 pCTAs), which results in a larger number ofqueries to the work queue than necessary.Due to the polling nature of PT, when no work is available, we expect an in-crease in the total global load memory traffic and number of dynamic warp instruc-tions to query the work queue. Figure 5.17 measures the reduction in global loadtraffic when guarding the work queue with a wait-release barrier for three differentinterrupt coalescing sizes, where an interrupt is generated for every 1 (WRB-1), 8(WRB-8), and 32 (WRB-32) pCTA tasks inserted into the global work queue. Wealso vary the launch frequency of each pCTA task from as fast as possible (Launchdelay == 0) to a larger delay (Launch delay == 100). Here, the delay correspondsto the number of iterations of an empty delay loop performed on the CPU before1690	0.2	0.4	0.6	0.8	1	1.2	1.4	1.6	0	 50	 100	Global	LoadsRelative	to	PT	Launch	Delay	WRB-1	 WRB-8	 WRB-32	Figure 5.17: GPU global load instructions issued relative to PersistentThreads (PT).launching the next GPU task. When the launch frequency is high, triggering an in-terrupt to release the wait-release barrier for every pCTA task results in additionalglobal loads due to the PGW running on all 16 SMXs to release the blocked pC-TAs, and only a single task being available to process. When the launch frequencyis decreased or when the wait-release barrier request coalescing is increased, wait-release barrier results in a significant decrease in global load traffic. For example,even when the CPU is pushing tasks to the GPU as quickly as possible in a loop,triggering an interrupt every eight tasks results in a 20% reduction in global loadtraffic. Furthermore, when a delay is introduced in between task launches, havingan interrupt per task results in a large reduction in global load traffic. As the launchdelay is increased, the impact of coalescing interrupts on the total task latency isreduced further. The main takeaway from this experiment is not the absolute valueof the relative improvements, but instead the ability for the event-driven program-ming model and wait-release barriers to reduce the GPU global load traffic undervariable task rates.In addition to global load traffic, blocking the pCTAs from continuously pollingwill impact the number of dynamically executed instructions. Figure 5.18 mea-sures the effect on dynamic issued warp instructions on the GPU for the same ex-periment as above. In the baseline PT, each pCTA has a single warp spin-looping17013.7	0	0.25	0.5	0.75	1	1.25	1.5	1.75	2	2.25	2.5	0	 50	 100	Dynamic	Warp	Inst.	Relative	to	PT	Launch	Delay	WRB-1	 WRB-8	 WRB-32	Figure 5.18: Dynamic warp instructions issued relative to Persistent Threads(PT).on the work queue. However, wait-release barrier stalls all warps in the pCTAwhen waiting at the barrier, which can reduce the amount of unnecessary warpinstructions used for polling. This can be seen by the large reductions in the num-ber of dynamic instructions when the launching delay and interrupt coalescing areincreased. Similar to the number of global loads, there can be an increase in thenumber of dynamic warp instructions with wait-release barrier when the launchfrequency is high or interrupt coalescing is low due to the increased number ofPGWs running the ISR on the GPU than tasks being launched. This is very appar-ent with WRB-1 at 0 launch delay, which increases the total number of dynamicwarp instructions by 13.7× compare to PT, since there is no kernel processing and16× the number of required PGWs are launched for every new task.An obvious enhancement to the current wait-release barrier implementationwould be to selectively release only the number of pCTAs required to service thenew tasks, or release the specific pCTA waiting for the result of an RPC. This wouldrequire modifying the PT framework to have a shared global work queue withsynchronization locks, as well as a global structure for recording which pCTAs arecurrently blocked at a wait-release barrier as candidates to release on an interrupt.171Evaluating these enhancements is left to future work.Lastly, the use of the PGW to release the wait-release barriers increases thelatency for the pCTA to identify when a task is available to process. This increasein latency is equal to the PGW ISR runtime, which as shown in Section 5.6.2, isdependent on the other kernels being concurrently run with the ISR.5.7 SummaryThis chapter proposes EDGE, a novel event-driven programming model consistingof a set of hardware mechanisms and software APIs that can increase the inde-pendence of specialized processors, such as GPUs, from their reliance on CPUs.EDGE makes use of a fine-grained warp-level preemption mechanism, which ini-tiates the execution of privileged GPU warps (PGWs) that can launch work inter-nally on the GPU from any device in a heterogeneous system without involvingthe CPU. This can help to reduce task launching latency from an external deviceand free up the CPU to either work on other tasks or enter a lower power state toimprove efficiency. The PGWs can also be used to support a new form of CTAbarrier, the wait-release barrier, which enables running CTAs to block indefinitelyuntil a PGW releases the barrier in response to some external event. This is useful,for example, to reduce the polling overheads of the persistent GPU thread style ofprogramming.The PGW requires GPU resources to operate. We evaluated several techniquesfor efficiently selecting, flushing, and preempting a warp from concurrently run-ning applications, such that the time to schedule and process an interrupt with aPGW is minimized. On a set of GPU benchmarks, we found that there is a freewarp context available for a PGW 50% of the time, on average. When no freewarp contexts are available, we showed that the evaluated flushing optimizationscan reduce the average PGW preemption latency up to ∼36× (to <50 cycles) onthe Gem5-GPU simulator.While traditional GPU kernels require a host CPU to configure and launcheach kernel, with EDGE, the host CPU pre-allocates and configures GPU eventkernels a single time and utilizes the PGWs as a mechanism to internally launchany subsequent GPU tasks via standard memory operations and interrupts. By172removing interference between the CPU and GPU, we estimate that EDGE can in-crease CPU throughput by 1.17× and GPU networking throughput by 3.7× whenrunning SPEC2006 on the CPU in parallel with a networking application, Mem-cachedGPU, on the GPU.173Chapter 6Related WorkThis chapter discusses and contrasts the key related works with the research pre-sented in this dissertation. This chapter is partitioned into two high-level sections,which describe related work on using the GPU to accelerate server-based appli-cations and network processing in Section 6.1, and related work on event-drivenGPU execution and improving GPU system support in Section 6.2.6.1 Related Work for Accelerating Server Applicationsand Network Processing on GPUsThe work performed in Chapter 3 and Chapter 4 focuses on improving the perfor-mance and efficiency of Memcached, a server-based application, using GPUs, andon using GPUs to accelerate network processing. This section is further partitionedbetween these two categories.6.1.1 Memcached and Server-based ApplicationsMany prior works have looked at improving the throughput and scalability of Mem-cached or other general key-value store applications, through software or hardwaremodifications. While much of the focus in this dissertation is on Memcached, oneof the long term goals of our research is to provide a general framework for accel-erating high-throughput network services on GPUs. Many of these works proposeoptimizations that are complementary to our work and can be evaluated to pro-174vide additional improvements to the performance and efficiency of Memcached onGPUs using GNoM. This section summarizes and contrasts a set of key relatedworks with this dissertation.Concurrent with our work in Chapter 3, Berezecki et al. [24] presents a many-core architecture, the Tilera TILEPro64 64-core CPU, used to accelerate Mem-cached. In their work, different parts of Memcached are modified to run on indi-vidual processors, such as network workers, hash-table processes, TCP and UDPcores, and the operating system itself. Although the focus of their work and ours issimilar (i.e., accelerating Memcached on a many-core architecture), our work dif-fers in the method of achieving this goal, focusing on the feasibility of running suchan application on a GPU and providing a detailed characterization of Memcachedon hardware and on a simulator.Andersen et al. [11] propose a log-structured data store system that utilizeslower power/performance (wimpy) CPUs and flash memory to maintain perfor-mance and reduce power consumption for key-value store applications. This iseffective for key-value store applications, such as Memcached, in which largeamounts of computation are replaced with long I/O operations and network la-tencies that are not significantly affected by low clock frequencies. The use ofmultiple wimpy cores with an optimized memory system to improve the efficiencyof memory-bound applications is also similar to the use of GPUs in our work,which contain large numbers of small in-order cores and high-bandwidth mem-ory. However, the GPU architecture is able to provide high levels of computationalthroughput for structured compute-bound applications, whereas the wimpy coresmay not be able to efficiently scale-up performance for other types of compute-intensive applications.Wiggins et al. [177] improve the scalability of Memcached on CPUs through aset of software enhancements. This work removes the global cache lock overheadsin Memcached. Specifically, this work evaluates striped locking on the baselinehash-chaining hash table and a bagging LRU mechanism to replace the global lockwith atomic CAS instructions. Our work also proposes removing the global lockson the hash table and LRU management; however, our work evaluates an alternativeset-associative hash table and per-set shared/exclusive locking mechanism, whichreduces complexity and significantly improves performance and energy-efficiency175on the GPU architecture.Fan et al. [52] evaluate a set of software enhancements to improve the perfor-mance and memory efficiency of Memcached on a CPU. Memc3 proposes a noveloptimistic cuckoo hashing mechanism with set associative cuckoo entries, whichimproves memory efficiency over the baseline Memcached and efficiently supportsconcurrent accesses to the hash table. This is achieved by separating the cuckoopath discovery phase from the update phase and only updating a single key moveat a time, removing the opportunity for false misses when reorganizing the hashtable. In Memc3, writes to the hash table are serialized, but reads can operateconcurrently with writes. This enables a set of optimizations to further improvethe performance of Memcached, such as replacing the global LRU queue with anapproximate LRU based on the CLOCK replacement algorithm, and the use ofoptimistic locking to avoid expensive synchronization overheads on shared datastructures. In our work, we also evaluate modifications to the hash table, proposinga more GPU-friendly set-associative hash table, which maintains similar read per-formance but improves write performance. Furthermore, our set-associative cacheuses per-set exclusive locks on writes, which supports concurrent write requests.Our work also evaluates approximate LRU policies, which consider local LRU perhash table set instead of per memory slab, reducing the complexity compared toMemc3. MemcachedGPU borrows an optimization from Memc3, which comparessmall hashes of the key to avoid expensive key comparisons to reduce branch diver-gence among GPU threads. Overall, MemcachedGPU improves the read requestthroughput over Memc3.Lim et al. [106] propose a state of the art in-memory key-value store applica-tion, MICA, on CPUs. MICA rethinks the design of a key-value store application,such as Memcached, to improve scalability under parallel access. MICA achievesvery high throughputs on the order of 75 MRPS for 8B keys (half the minimum sizeevaluated in Chapter 4), using two 8-core CPUs with four 10 Gbps NICs (eightports). MICA partitions keys across cores to improve locality and reduce interfer-ence between parallel cores. MICA can operate in exclusive-read, exclusive-writemode, which removes the need for synchronization as only a single core operateson a partition, or concurrent-read, exclusive write mode, which uses optimisticlocking to reduce the synchronization overheads. The value storage is completely176redesigned from Memcached, which includes an efficient append only log struc-ture, similar to log structures used in flash memory (e.g., FAWN [11]), for theirlossy key-value store design and efficient memory allocators with segregated fitsfor their lossless key-value store design. Similar to MemcachedGPU, the hashtable is implemented as a set-associative cache. Also similar to MemcachedGPU,MICA reduces the UDP networking overheads through direct NIC access and user-level network processing (using Intel DPDK [79]). Furthermore, to take advantageof the sharded key-space across cores, MICA uses RSS support in modern NICsto direct requests to the specific core responsible for that key. However, this re-quires the client application to be modified to include knowledge of the server-sidekey-space partitioning; dropping throughput by 55-60% without this optimization.MICA represents the high-end for CPU performance on key-value store applica-tions, taking a holistic approach to redesign all aspects of the application to takeadvantage of multiple different hardware and software features available on con-temporary CPUs and NICs. In contrast, our work explores how the GPU, a strangerto this type of streaming network application, can be used to achieve high levelsof performance and efficiency, while minimizing the modifications to Memcached(e.g., only the hash table and eviction mechanisms are modified – the value storage,hashing mechanism, Memcached protocol, and client application remain the same).We believe that by taking a similar approach of fully redesigning a key-value storespecific to a GPU, as well as including additional NIC and GPU hardware, the per-formance and efficiency of MemcachedGPU could be significantly improved. Thisis left to future work.Blott et al. [26] evaluate the potential for accelerating Memcached on FPGAs.Other works by Lim et al. [108] and Chalamalasetti et al. [30] also evaluate usingFPGAs for Memcached; however, Blott achieves the highest performance. In [26],the FPGA is able to achieve 10Gbps processing for all Memcached key sizes,extremely low latencies under 4.5 uS per request, and high efficiencies of 106.7KRPS/W. This is achieved by pipelining requests through a dedicated hardwarepath on the FPGA architecture. Given the low request latency and high efficiency,the FPGA is a strong candidate for accelerating such workloads. Similar to Mem-cachedGPU, this work proposes a set-associative based hash table with local LRUmanagement on the FPGA. Istvan et al. [85] expands on the description of the177FPGA hash table. However, in Chapter 4 we show that we can achieve simi-lar throughputs at significantly lower costs on a lower power GPU. Additionally,while the FPGA architecture enables high performance and energy-efficiency, theflexibility of the general-purpose GPU architecture (e.g., ease of programming,multitasking) may outweigh some of the efficiency gains in a datacenter environ-ment, where workload consolidation is necessary to improve server utilization andreduce costs.Nishtala et al. [124] describe enhancements to improve Memcached at Face-book. As presented, their Memcached deployment is the largest in the world,responsible for handling billions of requests per second. They propose multipleenhancements to improve both the scalability of Memcached across servers, aswell as improve the performance of an individual server. Many of the scalabil-ity improvements are complementary to our work, such as reducing the requestlatency through request batching (multi-GET), request parallelization (multipleMemcached servers), client sliding request windows (similar to TCP), replicationacross pools of Memcached servers (within and across regions) to reduce latencyand improve efficiency and consistency, or adaptive slab allocators for item valuestorage. Other enhancements, such as using UDP over TCP for GET requests andusing fine-grained locking mechanisms, are explored in our work.Other works, such as Jose et al. [88, 89] and Dragojevic et al. [45], identifylimitations with the networking overheads in Memcached and evaluate the use ofRDMA over Infiniband and Ethernet network hardware, respectively, to improveMemcached performance. Jose et al. [88] propose an efficient unified commu-nication runtime (UCR), containing enhancements to the RDMA connection es-tablishment and support for active messaging over RDMA, to enable middlewareapplications to efficiently utilize RDMA capable interconnects. They then proposemultiple modifications to Memcached to support UCR on Infiniband NICs. Joseet al. [89] propose a hybrid reliable connection and unreliable datagram protocolon top of UCR, and a set of enhancements to support the hybrid communicationmechanism. Similar to the use of UDP over TCP, they find that the performanceand scalability of Memcached with Infiniband can be improved with such hybridcommunication mechanisms. FaRM [45] propose multiple modifications to Mem-cached to support RDMA primitives and distributed memory management across178a cluster of RDMA enabled Memcached servers on Ethernet hardware. FaRM alsoaddresses locking issues in Memcached. Similar to our work, FaRM evaluates anew type of hash table (associative hopscotch hashing) to better support the un-derlying hardware changes. In contrast to our work, which reduces the networkoverheads by accelerating UDP network processing on the GPU, these works or-thogonally reduce the network overheads through the use of different networkingmechanisms/hardware.Following our initial analysis of Memcached on GPUs in Chapter 3, otherworks have evaluated using GPUs to accelerate Memcached. For example, Dim-itris et al. [44] propose Flying Memcached, which uses the GPU for performing thekey hash, while all other processing remains on the CPU. Flying Memcache alsoaddresses the networking bottlenecks in Memcached, achieving sizeable improve-ments in throughput by performing the UDP network processing in user-space onthe CPU. In contrast, our work performs all Memcached GET request and UDPnetwork processing on the GPU, addressing many of the challenges associatedwith a full system implementation of a GPU networking application. Furthermore,our work improves throughput and efficiency over Flying Memcache.Concurrent with the work in Chapter 4, Zhang et al. [184] also propose usingGPUs to accelerate in-memory key-value stores. They use two 8-core CPUs, twodual-port 10 GbE NICs (max. 40 Gbps), and two NVIDIA GTX 780 GPUs toachieve throughputs over 120 MRPS. In contrast to MemcachedGPU, these resultsuse a smaller minimum key size (8B vs. 16B), use a compact key-value protocolindependent from the Memcached ASCII protocol, perform the hashing functionusing AES SSE instructions, and batch multiple requests and responses into sin-gle network requests through multi-GETs to reduce per-packet network overheads.However, for GET requests, the GPU is only used to perform a parallel lookupfor the corresponding CPU key-value pointer in a GPU Cuckoo hash table. Allother processing, including UDP network processing, request parsing, key hash-ing, and key comparisons, are done on the CPU. In contrast, the goal of our workwas to achieve 10 GbE line-rate performance for all key-value sizes while perform-ing all of the UDP network processing and GET request processing on the GPU.We believe that MemcachedGPU could achieve sizable improvements in perfor-mance and efficiency when modifying the Memcached packet format and hashing179mechanism to be more GPU-friendly. We have not yet evaluated the potential foradditional scaling of GNoM and MemcachedGPU using a more powerful CPU withmultiple NICs and GPUs.Prior and concurrent work have evaluated accelerating other types of traditionalserver workloads on GPUs, such as HTTP workloads by agrawal et al. [3], ordatabased queries by Bakkum et al. [18] and Wu et al. [179]. Similar to our work,these works highlight the benefits of exploiting request level parallelism on theGPU through batching requests with similar behavior. However, these works focussolely on the workload specific processing, which can limit the potential end-to-endperformance improvements gained by the GPU. In contrast, our work considers acomplete end-to-end system implementation, which performs both the network andapplication processing on the GPU. Evaluating how GNoM could be generalizedto support accelerating the network path for these GPU server applications is aninteresting direction for future work.A core operation when processing a GET request is the hash. Because ourfocus was parallelizing independent requests, the hash algorithm is computed byeach GPU thread (work item) individually. Massively parallel hashing algorithms,such as the one implemented by Al-Kiswany et al. [5] in StoreGPU, provide sig-nificant performance increases when the data being hashed is large. However, thekeys hashed in this study were all less than 128 bytes and would not benefit fromthe efficient divide-and-conquer techniques proposed in their work.6.1.2 GPU NetworkingThis section discusses some of the key prior works related to networking andpacket-based processing on GPUs.Related work by Kim et al. [96] present GPUnet, a networking layer and socketlevel API for GPU applications. Similar to GNoM, GPUnet improves the supportfor applications requiring networking I/O on GPU accelerators. GPUnet providesa mechanism for GPU threads (specifically CTAs) to send and receive networkrequests directly from the GPU, which is achieved through persistently runningCPU and GPU software frameworks. The GPU initiates network requests (send orreceive) by sending RPCs to the CPU threads, which carry out the requested oper-180ation and return the results to the request CTAs. In contrast, our work maintainsthe GPU offload accelerator abstraction by launching GPU network kernels in re-sponse to network events, removing the need for persistent GPU kernels. Further-more, GPUnet is designed for Infiniband hardware and is directed towards moretraditional GPU applications, such as image processing and matrix multiplication.The current cost of Infiniband hardware is a large contributor to the restricted usageoutside of HPC. Our work instead targets commodity Ethernet hardware typicallyused in datacenters, which requires processing network packets in software, andevaluates a non-traditional GPU application, Memcached.Following the work in Chapter 4, Daoud et al. [38] (GPUrdma) expand onGPUnet to completely bypass the CPU for sending and receiving network packetsto/from the Infiniband NIC. This is achieved through the use of persistently runningGPU kernels that directly write to the Infiniband NIC’s doorbell register for send-ing packets and directly receive notifications of incoming packets through memorywrites by the Infiniband NIC. GPUrdma is evaluated on network ping-pong andmulti-matrix-vector product applications. In contrast, our work focuses on acceler-ating the end-to-end processing for a more complex server-based application usingEthernet NICs, requiring UDP packet processing on the GPU, and does not rely onthe use of persistent kernels, which enables more opportunities for GPU workloadconsolidation during varying traffic demands. Furthermore, EDGE can improveefficiency by removing the CPU from any GPU task management and removingthe requirement for persistent GPU kernels.Other works have also evaluated packet acceleration on GPUs such as Packet-Shader by Han et al. [65], GASPP by Basiliadis et al. [170], and Network Balanc-ing Act (NBA) by Kim et al. [95]. Similar to GNoM, these works exploit packet-level parallelism through batching on the massively parallel GPU architecture, re-quiring a host framework to efficiently manage tasks and data movement with theGPU. PacketShader [65] implements a high-performance software router frame-work on GPUs to accelerate the lower-level network layer, such as packet forward-ing and encryption, whereas GNoM and MemcachedGPU presented in this workfocus on transport and application levels.GASSP [170] is the first work to evaluate stateful packet processing (TCP) onGPUs and highlights the ability for the GPU to achieve significant performance im-181provements over a GPU-based implementation where the packet processing occurson the CPU. In contrast to our work, GASSP requires larger batch sizes (increasespacket latency), uses main CPU memory for storing packets (does not make use ofGPUDirect), and evaluates GASSP on packet processing based applications, suchas firewalls, intrusion detection, and encryption. Using the insights from GASPPto support stateful packet processing in GNoM is an interesting direction for futurework (Section 7.2).NBA [95] proposes a software-based packet processing framework that em-ploys a modular approach, following the Click [97] programming model. NBAabstracts the low-level architecture-specific optimizations in modern NICs, CPUs,and GPUs, and load balances network traffic between heterogeneous devices toreduce programming complexity and maximize performance without manual opti-mizations. Similar to our work, NBA bypasses the Linux kernel network process-ing; however, NBA transfers packets to CPU memory using Intel’s DPDK insteadof directly transferring to the GPU (as is done in GNoM for direct GPU processing).NBA proposes abstractions to offloadable accelerators (e.g., GPU), which specifya CPU-side function, GPU-side function, and the required input/output data. TheNBA framework handles data transfers and kernel invocations internally (if of-floading to the accelerator is expected to improve performance), and provides opti-mizations to support data reuse between offloadable elements. Similar to GASPP,NBA is a packet processing framework and is evaluated on individual packet pro-cessing applications, such as packet filtering, encryption, and pattern matching.In contrast, our work takes a holistic approach starting from the application layerand optimizing the end-to-end network flow for such applications on the GPU -performing both the network processing and application processing concurrentlyon the GPU. Furthermore, our work proposes enhancements to the GPU hardwareand programming model to improve support for future networking applications.Combining insights from our work and NBA could be useful to improve the GPUoffload performance in a modular packet processing framework on heterogeneoussystems when the choice of offload accelerator is known (or can be predicted) inadvance.1826.2 Related Work on Event-Driven GPU Execution andImproving GPU System SupportThis section presents the related work with Chapter 4 and Chapter 5. Some of theworks discussed in the previous section are included in the context of this sectionas well.The mechanisms employed by EDGE to communicate tasks between a deviceand the GPU to support event-driven execution are similar to in-memory submis-sion/completion queues that enable a CPU to communicate with external devices,such as NVME [49]. NVIDIA’s Multi-process service [129], HSA [55], and per-sistent GPU threads [63] also propose using user-level, in-memory work queuesfor communicating tasks with the GPU to improve multi-process support, reducekernel launch latency, or improve GPU independence from the CPU and GPU hard-ware task schedulers. MPS and HSA require a separate runtime environment (e.g.,CPU daemon process, hardware unit, or scalar processor) to recognize when workhas been submitted into the work queues, for example, via writes to global mem-ory or doorbell registers. EDGE proposes using privileged GPU warps (PGW) onthe existing GPU vector compute resources for processing new tasks inserted intothe in-memory work queues. The PGWs can be initiated by interrupts or writesto doorbell registers. Additionally, EDGE avoids the re-configuration of kernelsthrough event kernels.LeBeane et al. [101] present the closest related work to EDGE, Extended TaskQueuing (XTQ), which proposes active messages for heterogeneous systems overInfiniband networks using remote direct memory access (RDMA). This work hasthe same goal of removing the CPU from the critical path for launching tasks on theGPU from remote machines. XTQ focuses on integrated CPU/GPU systems, suchas AMD’s Accelerated Processing Units (APU), which utilize HSA and HSA’suser-level command queues to launch GPU tasks from the user space, instead ofinteracting with the GPU driver and OS on the CPU. As described in Section 5.4.1,the EDGE event submission/completion queues are similar to those used in HSA,and hence similar to XTQ. XTQ proposes enhancements to the Infiniband NICto construct XTQ active messages that can trigger GPU tasks via the HSA workqueues. XTQ registers GPU kernels with the Infiniband NIC, such that the RDMA183write operations only need to specify a kernel ID to launch the corresponding taskon the GPU. This is similar to EDGE, which registers event kernels with the GPUand uses a kernel ID to specify which event kernel to launch on an event. Themain difference between XTQ and EDGE is that XTQ pushes the extensions tothe NIC and requires HSA-enabled GPUs, whereas EDGE pushes the extensionsdirectly to the GPU. As such, EDGE requires only that the GPU and external deviceare able to communicate with each other via shared memory and that the externaldevice can be configured to send specific interrupt IDs to the GPU or write tospecific addresses in GPU memory (e.g., doorbell registers). This enables devicesother than a NIC, such as an FPGA, to utilize the same EDGE API to launchevent kernels directly on the GPU instead of requiring a similar XTQ frameworkper device on top of HSA. Lastly, the HSA user-level work queues, as describedin XTQ, use an HSA Command Processor (CP) to recognize when the doorbellregisters have been written to to begin processing the new task in the user-levelwork queue. However, it is not clear exactly what the CP is in their evaluation, forexample, a hardware unit, a small scalar processor residing on the GPU, a CPUdaemon software framework as in NVIDIA’s MPS, or some other mechanism. InEDGE, we explore a specific implementation of the CP, the PGW, which utilizesexisting GPU vector compute resources and fine-grained warp-level preemptionmechanisms to efficiently process tasks in the user-level work queues.Recent work by Suzuki et al. [160] also propose an event-driven GPU program-ming model, GLoop, consisting of GPU callback functions invoked on events, suchas a file read completion. Similar to GPUnet [96] and GPUfs [158], GLoop em-ploys persistent CPU and GPU runtime event loops, which continuously poll forRPC events from either device, or require re-launching kernels from the host uponRPC completion. Unlike these works, GLoop enables multiple different GPU ap-plications to concurrently execute on the same persistent thread-like framework byefficiently scheduling callbacks from different applications. While event kernelscan be thought of as callback functions, EDGE’s privileged GPU warp architectureand fine-grained warp-level preemption mechanism can remove the requirementfor continuously running GPU event loops, while minimizing the impact on per-formance required to re-launch full kernels. For example, the GPU-side event loopsdescribed in these works could be implemented directly by a PGW, or as an event184kernel launched by a PGW.NVIDIA announced enhancements to GPUDirect RDMA with GPUDirectAsync [153], which expose parts of the networking flow to CUDA streams toremove the CPU from the critical path when scheduling kernels in responseto network events on Infiniband hardware. With GPUDirect Async, the CPUpre-launches a GPU networking kernel in a CUDA stream and configures thecorresponding network receive buffers in GPU memory. The GPU kernel isasynchronously executed when receiving the network packets directly from theInfiniband NIC. This removes the GPU kernel launch from the CPU on the net-work critical path and enables the CPU to enter an idle state earlier while waitingfor the packet receive, not just during the GPU kernel processing. GPUDirectAsync has a similar goal as EDGE for removing the CPU from the critical path bypre-registering the GPU kernel; however, GPUDirect Async requires configuringevery kernel launch, whereas EDGE reuses the same event kernel registrationfor subsequent network event-driven kernels. Additionally, the GPU kernel inGPUDirect Async returns control back to a waiting CPU thread on completion,whereas EDGE enables completely removing the CPU from any pre or post kernelprocessing by supporting direct communication to and from external devices inthe system.Multiple works have evaluated preemption/context switching, multiprogram-ming, and priority mechanisms on GPUs [33, 92, 117, 143, 157, 162, 176, 178,183]. Zeno et al. [183] propose an I/O-Driven preemption mechanism for GPUs,GPUpIO, which removes the spin locking on persistent CTAs waiting for long I/Ooperations. Instead, persistent CTAs initiate self-preemption after executing ex-pensive I/O or RPC operations (e.g., network operations in GPUnet [96] or file op-erations in GPUfs [158]) through a set of software checkpoint-restore operations.When the I/O or RPC operation completes, the CTA context is restored in softwareto resume execution following the operation that initiated preemption. GPUpIOalso makes the case for adding hardware support to yield running CTAs, similarto existing CPU mechanisms. Tanasic et al. [162] propose two kernel preemptionmechanisms, explicit context switching (requires saving and restoring thread con-texts) and SM draining (blocks new CTAs and waits for currently executing CTAsto complete), support for multiple kernel execution, and corresponding architec-185tural enhancements. Park et al. [143] propose a collaborative and configurableapproach to minimize GPU kernel preemption overheads depending on the kerneland active CTAs’ states; consisting of the context switching and SM draining tech-niques proposed in [162], and an additional SM flushing technique, which dropsand replays CTAs from idempotent kernels. Kato et al. [92] propose a CPU runtimeengine for managing kernel priorities by breaking up large memory transfers andadding a priority-aware software kernel scheduling layer prior to launching kernelson the GPU. Wu et al. [178] propose FLEP, a compilation and runtime softwareframework for transforming GPU kernels into preemptable kernels to enable tem-poral and spatial multitasking. This is achieved by breaking the kernel down intoCTA-level tasks (similar to persistent threads) and having the CPU runtime sched-ule CTAs from different kernels depending on given priorities and policies. Chenet al. [33] propose Effisha, a similar framework to FLEP, which applies compilertransformations to the CPU and GPU application and includes CPU/CPU runtimesto work on CTA-level tasks in a persistent thread style fashion. Wang et al. [176]propose simultaneous multikernel (SMK), a fine-grained dynamic kernel sharingmechanism on GPUs. SMK includes a partial preemption mechanism, which pre-empts CTAs one at a time to enable fine-grained sharing of GPU resources whilepreempting a running kernel. SMK also exploits heterogeneity between kernel re-source requirements to enable fair resource allocation and sharing of SM resourcesfor concurrent kernel execution, and proposes a fair warp scheduler mechanism toallocate a fair number of cycles between competing concurrent kernels. Shieh etal. [157] propose a kernel preemption mechanism, Dual-Kernel, which co-executesa preempting kernel’s CTAs with the preempted kernel’s CTAs by selectively pre-empting CTAs one at a time until enough resources are available to run a CTAfrom the new kernel. Dual-Kernel aims to minimize preemption latency, through-put overhead, and resource fragmentation while preemption is occurring. Lastly,Menon et al. [117] propose iGPU, a set of compiler, ISA, and hardware extensionsthat enable preemption support and speculative execution on GPUs. iGPU identi-fies and exploits sparse idempotent regions to minimize the impact on re-executingGPU threads and minimize the amount of state required to be saved/restored on acontext switch.Throughout this dissertation, we propose and evaluate different forms of186preemption and GPU resource sharing techniques related to the works describedabove. In Chapter 4, we evaluated a technique to enable fine-grained GPUmulti-tasking between a high-priority Memcached kernel and lower-prioritybackground task by splitting the background task into groups of CTA-levelsub-tasks and adding a software scheduler to manage the execution of thesesub-tasks. This reduces the ability for larger low-priority tasks to monopolizethe GPU resources, enabling other, potentially higher-priority, tasks to interleavetheir execution of CTAs on the GPU and reduce the latency to wait for availableGPU resources. FLEP [178] and Effisha [33] expand on this technique with ageneral CPU and GPU software framework for transforming GPU applicationsto support this CTA-level programming paradigm across multiple concurrentapplications, which is used to enable low overhead preemption. In Chapter 5,we evaluate two forms of preemption to support the PGW and high-priority eventkernel execution. Unlike previous work, which focus on kernel and CTA-levelpreemption, we propose fine-grained warp-level preemption to support executingthe PGWs based on an I/O or external event. This has a unique set of challenges,since the running application is only partially preempted to execute the smallPGW task, which requires very low preemption latencies. EDGE also evaluatesa form of partial CTA draining for event kernels (similar to [162]), where CTAsare blocked only until enough resources are available to begin executing the eventkernel. For the application evaluated, Memcached, the individual event kernelsrequire significantly fewer resources than are available on the GPU. As such, weevaluate a partial preemption mechanism, which preempts only as many resourcesas required to execute the event kernels. This reduces the latency to execute thehigh-priority event kernel and reduces the amount of state to be saved/restoredfrom the preempted kernel. Additionally, by suspending the execution of the back-ground (preempted) kernel, instead of fully saving/restoring it, EDGE improvesperformance for the higher-priority event kernel, while minimizing the preemptionlatency and impact on the background kernel. If the event kernels are expectedto be much larger, requiring full preemption of a running kernel, the techniquesdescribed in prior work could be employed. Furthermore, the interrupt and PGWtechniques in EDGE can be used to initiate preemption in a multiprogrammedenvironment instead of support from the CPU driver, self-preempting CTAs, GPU187persistent thread runtime, or compiler specified preemption points.Multiple works have evaluated using CPU/GPU persistent thread runtimes forfor fine-grained kernels and GPU I/O [33, 38, 96, 158, 178, 182, 183]. Silberstein etal. [158] propose a POSIX-like API and software framework, GPUfs, for GPU pro-grams to access the host’s file system. GPUfs belongs to the same family of workas GPUnet, GPUrdma, and GPUpIO described above. GPUfs issues CTA-levelfile operations to a CPU polling framework, which performs the I/O operation andreturns the results back to the polling CTA. EDGE was initially designed for sup-porting external task launches, not necessarily for the CPU RPC operations in theseworks. However, the mechanisms proposed in EDGE, such as the wait-releasebarrier and PGW, could be used to avoid polling while waiting for long-latencyI/O operations. Yeh et al. [182] propose Pagoda, a CPU-GPU persistent runtimeframework that manages GPU tasks and resources at the warp-level, instead of atthe CTA-level as most other persistent thread frameworks, to improve performanceand efficiency for very small GPU tasks. Persistent thread frameworks can also beused to reduce the GPU’s dependence on a CPU for GPU task management, as anydevice can communicate control information via the in-memory persistent threadtask queues. However, if task launch rates are low, the GPU polling threads reduceenergy-efficiency. Furthermore, the GPU polling threads monopolize the GPU re-sources, removing the ability for other GPU tasks to use the standard CUDA in-terface and GPU hardware task scheduling mechanisms. In contrast, EDGE aimsto remove the requirement for persistent GPU-side runtimes through our proposedPGW framework, which use the fine-grained warp-level preemption mechanismsand pre-registered event kernels to reduce kernel launch overheads and dependenceon a CPU for initiating GPU tasks.There are many other works that have pushed towards evolving GPUs as first-class computing resources to support efficient GPU multiprogramming [64, 91, 93,152, 156, 159, 171]. These works are related to the overarching goals of this dis-sertation for improving system-software support on GPUs. Rossbach et al. [152]propose PTask, a set of OS abstractions to support a dataflow programming modelfor accelerators, which enables the OS to provide fairness and isolation guaranteesfrom a single management point. PTask also abstracts away many of the low-levelGPU programming complexities. Kato et al. [93] propose Gdev, an OS GPU re-188source management runtime, which unifies interactions with the GPU from bothuser-space and OS applications. Gdev virtualizes GPU resources for efficient andisolated multi-tasking, and supports sharing memory across GPU contexts. Suzukiet al. [159] propose GPUvm, which virtualizes the GPU at the Hypervisor forefficient multi-tasking support in enterprise and cloud computing environments.GPUvm aggregates GPU management in the Hypervisor to remove direct virtualmachine access, requiring several techniques and optimizations to virtualize GPUmemory management, page mappings, communication channels, and task sched-ulers. Shi et al. [156] propose virtualizing the GPU across virtual machines at thehost OS using the standard CUDA runtime, which intercepts and redirects CUDAcalls from guest OSes to the GPU. Vijaykumar et al. [171] propose Zorua, a compil-er/software/hardware virtualization framework, which virtualizes GPU resourcesinternally on the GPU (e.g., threads, registers, and scratchpad memory) to improveprogrammability, portability, and performance. Zorua lends itself to supportingmultitasking and preemption by efficiently managing shared GPU resources at afine-grained level. Kato et al. [91] present an OS framework for efficiently man-aging GPU resources, such as GPU communication channels, kernel contexts, andmemory. This work also discusses multiple different challenges to support GPUresource management in the OS, providing insights for future research directions.The works described above are complimentary or orthogonal to the work inthis dissertation. Many of the virtualization or resource management techniquescould be integrated with GNoM to enable efficient sharing of the network inter-face and GPU resources for high-priority event kernels. While the event kernelsproposed in EDGE enable direct task management from external devices using theGPU, the OS could be involved in configuring fair resource allocation for differentevent kernels. The PGW provides support for executing privileged software rou-tines directly on the GPU and can be used to provide higher-level scheduling andresource sharing decisions in software running directly on the GPU SMs. Further-more, the host OS (with greater visibility into the multitasking environment) andPGWs (with tighter integration with the GPU hardware) can cooperatively worktogether to guide scheduling decisions between concurrently executing kernels.189Chapter 7Conclusions and Future WorkThis chapter concludes the dissertation and discusses potential directions for futureresearch.7.1 ConclusionsDatacenters are important and ubiquitous computing environments with strict re-quirements for high performance, efficiency, and generality. With the declinein single-threaded performance scaling and power concerns surrounding multi-core scaling, the computing industry has looked towards alternative, heteroge-neous systems to continue improving performance and efficiency; for example,heterogeneous CPU cores, specialized application-specific hardware designs, re-programmable hardware systems, or parallel accelerators. Graphics processingunits (GPUs) are an example of a massively parallel architecture capable of provid-ing high levels of performance and efficiency for applications with large amountsof structured parallelism, such as high-performance computing and scientific ap-plications. GPUs have recently made their way into the datacenter, commonlyacting as accelerators for machine learning algorithms. However, there are otherclasses of applications with ample parallelism running on servers in the datacenter,such as server-based network applications, that are not typically considered as be-ing strong candidates to offload to the GPU. Server applications belong to a classof highly economical and irregular applications. Accelerating server applications190on energy-efficient architectures can significantly lower power consumption, andhence cost, in the datacenter. Actually obtaining the potential benefits providedby the GPU in a heterogeneous environment, however, is challenging for multipledifferent reasons related to both the GPU architecture and system-level integration.The evolution of the GPGPU has largely been driven by the requirements ofapplications using these accelerators. For example, there has been a large bodyof research highlighting the need for reducing the impact of branch divergence onGPUs based on a growing number of GPU applications with irregular control-flowbehavior [47, 56, 57, 123, 148, 150]. As a result, the latest NVIDIA Volta GPUhas introduced thread-level execution states [131] to address limitations with han-dling irregular control-flow behavior in previous GPU architectures. While theseoptimizations achieve lower performance or efficiency relative to highly regularapplications, they help to improve the support for more irregular applications thatmay benefit from some, but not all, parts of the GPU’s parallel architecture. Thisdissertation follows a similar approach to explore the potential for using GPUsas energy-efficient accelerators for more traditional server-based applications inthe datacenter through a software-hardware co-design – first understanding the be-havior of network services on contemporary GPU hardware, and then evaluatingenhancements to the GPU’s architecture and system-level integration to better sup-port these classes of applications.From the software side, this dissertation evaluates a popular key-value networkservice, Memcached, on contemporary GPU hardware. This analysis highlightsthat an application with irregular control and memory access patterns can stillachieve sizeable improvements in performance and efficiency on a GPU. How-ever, the actual application processing is only a portion of the required end-to-endprocessing for a network request. To this end, this dissertation proposes a com-plete end-to-end software framework for accelerating both the network (GNoM)and application (MemcachedGPU) processing on contemporary GPU and Ethernethardware. This process exposed multiple challenges and limitations with imple-menting such a framework in existing systems, which can not be easily solvedthrough software alone. From the hardware side, this dissertation proposes modi-fications to the GPU architecture and programming model (EDGE) to reduce theGPU’s dependence on a centralized component (e.g., the CPU) in a heterogeneous191environment. This is a step towards considering the GPU as a first-class computingresource. We expect that as the number of GPU applications with similar computeand communication requirements grows, targeted optimizations to the GPU’s ar-chitecture and system-support can further improve the GPU’s ability to providesignificant improvements in performance and efficiency.While much of this dissertation focuses on a single application, Memcached,the main goals of this dissertation are not specifically to highlight whether the GPUis the best architecture to accelerate Memcached. Rather, this dissertation exposesthe potential of the GPU to provide benefits to applications outside of the traditionalhigh-performance scientific computing space, such as those found in a datacenterenvironment. These types of applications can have varying levels of complexityand irregular behavior, which as presented in this dissertation, may be able to ben-efit from different parts of the GPU’s parallel architecture (e.g., the large number ofparallel computing resources, the SIMD execution engines, or the high-bandwidthmemory systems). However, each application is different. Consequently, whetheror not the GPU will be able to provide any improvements in performance or effi-ciency still needs to be evaluated on a case-by-case basis, or common propertiesneed to be identified with applications already known to perform well on a GPU.This dissertation argues that an application is not guaranteed to perform poorlyon a GPU just because the application does not adhere to the strict characteristicsapparent in traditional GPGPU applications, using a highly irregular application,Memcached, as an example. We hope that the insights gained and the systems de-veloped in this dissertation (such as the methodology employed to effectively portMemcached to the GPU, the NIC/CPU/GPU GNoM software framework, or theEDGE GPU programming model) will help to enable future server-based applica-tions, or applications requiring communication between heterogeneous devices, tobenefit from GPU acceleration.Chapter 3 [69] presents an initial analysis into the behavior of Memcached onboth integrated and discrete GPU hardware, as well as on a software GPU simu-lation framework. This chapter identifies multiple challenges with porting such anapplication with irregular control flow and data access patterns to contemporaryGPUs, and proposes a set of optimizations to mitigate the impact of this irregularbehavior on the potential performance and efficiency gains from the GPU. On an192integrated GPU system, which does not require data transfers to a separate physicalmemory space, we show that the GPU can outperform the CPU by a factor of 7.5×for the core Memcached key-value lookup operation. Additionally, this chapterdemonstrates how an application performs on the GPU relative to a programmer’sintuition of the potential performance. To this end, this chapter describes a GPUcontrol-flow simulator, CFG-Sim, which can help to estimate the SIMD efficiencyof an application prior to actually porting the application to a GPU. For Mem-cached, we found that the actual SIMD efficiency is approximately 2.7× higherthan a naive assumption of equal branch probabilities in the code-path may sug-gest.Chapter 4 identifies multiple limitations with considering only the applicationprocessing when offloading a networking application to the GPU and tackles manyof the challenges with performing both the network and application processing onthe GPU. Furthermore, this chapter explores how to efficiently orchestrate the com-munication and computation between the GPU, NIC, and CPU when implement-ing an end-to-end GPU networking application. To this end, this chapter proposesGNoM (GPU Networking Offload Manager), a CPU-GPU software framework foraccelerating UDP network and application processing on contemporary GPU andEthernet hardware. GNoM constructs small batches of network requests (e.g., 512requests) and pipelines multiple concurrent request batches to the GPU to over-lap the communication and computation of network batches on the GPU. UsingGPUDirect, the network request data is directly transferred from the NIC to theGPU; however, a CPU software framework is required to handle the network IOand GPU task management. This chapter also describes the design and implemen-tation of MemcachedGPU, an accelerated key-value store, which leverages GNoMto run efficiently on a GPU. Many of the internal data structures in Memcached areredesigned to better fit the GPU’s architecture and communicating components ina heterogeneous environment. GNoM and MemcachedGPU are evaluated on high-performance and low-power GPUs and are capable of reaching 10 Gbps line-rateprocessing with the smallest Memcached request size (over 13 million requestsper second (MRPS)) at efficiencies under 12 uJ per request. Furthermore, Mem-cachedGPU provides a 95-percentile round-trip time (RTT) latency under 1.1ms atthe peak throughputs. Together, GNoM and MemcachedGPU highlight the GPU’s193potential for accelerating such network/server-based applications.Chapter 5 makes the observation that the CPU software framework portionof GNoM is required on contemporary GPU systems, even if no part of the ap-plication processing requires the CPU. This is a result GPUs being second-classcomputing resources, which rely on a host CPU to manage IO and the launching/-completion of tasks on the GPU. This chapter explores how to reduce the GPU’sdependency on a host CPU for control management, without the use of continu-ously running GPU software frameworks, such as persistent threads, to enable anydevice in a heterogeneous system to efficiently execute tasks on a GPU. Specifi-cally, this chapter proposes a novel event-driven GPU programming model, EDGE,which pre-registers GPU kernels (event kernels) to be launched by an external de-vice. EDGE implements extensions to the CUDA API and requires minimal mod-ifications to the GPU architecture to support storing and initiating the executionof GPU event kernels. We evaluate the opportunity to use existing GPU warps asa mechanism for launching the GPU event kernels internally on the GPU. EDGEexposes the GPU’s interrupt interface to external devices in a heterogeneous sys-tem to trigger the execution of privileged GPU warps (PGW), capable of launchingevent kernels on the GPU, similar to CUDA dynamic parallelism (CDP). UnlikeCDP, event kernels do not require dynamic configuration, which can significantlyreduce the kernel launching latency. This chapter also proposes a fine-grained,warp-level preemption mechanism to reduce the PGW scheduling latency whenthe GPU does not have enough free resources available to immediately execute thePGW. We show that for a set of GPU benchmarks, a free warp context is avail-able for a PGW 50% of the time on average, and that the preemption latency for aPGW, when no free warp contexts are available, can be reduced up to ∼36× (to ∼50 GPU cycles) with the proposed warp flushing optimizations over simply wait-ing for a warp to complete. Furthermore, by reducing the interference between theCPU and GPU, we estimate that EDGE could increase CPU throughput by 1.17×and GPU networking throughput by 3.7× when running workloads on the CPU(SPEC2006) in parallel with a networking application, such as MemcachedGPU,on the GPU.1947.2 Future Research DirectionsThis section discusses directions for potential future research based on the workperformed in this dissertation.7.2.1 Control-Flow SimulatorSection 3.2 presented a control-flow simulator, CFG-Sim, that simulates the be-havior of a warp (wavefront) through an application’s control-flow graph (CFG),to estimate the SIMD efficiency of the application when executed on a GPU. Asdescribed, this simulator considers the branch probabilities (taken or not taken)and applies this probability independently to each thread in a warp when flowingthrough the CFG. This is not a problem if each thread’s execution is independent ofall other threads in a warp. However, if the branch outcome for a thread is depen-dent on the other thread’s in a warp, referred to here as correlated branches, thenCFG-Sim may estimate a much lower SIMD efficiency than will actually resultwhen the code is executed on a GPU. The impact of correlated branches can beseen in the Rodinia application, Ray Tracer, presented in Section 3.4.2. Considerthe code in Example 3. In this example, the warp size is 32 threads (as in NVIDIAGPUs). The even warps will execute the code at block A, whereas the odd warpswill execute the code at block B. If there are an even number of warps, the SIMDefficiency will be 100% and the branch probability will appear to be 50% (half ofthe threads execute A, the other half execute B). However, in CFG-Sim a branchprobability of 50% will be applied to each thread, resulting in an estimated SIMDefficiency of roughly 50%.Example 3 Example of correlated branches on branch probabilities.1: int tid = threadIdx.x2: if (tid % 64 < 32) { // Warp size is 32 threads3: // A4: else {5: // B6: }Loops pose a similar challenge. Consider the code in Example 4. For each195thread in this example, the branch is taken nine times and not taken one time. Thusthe branch probability appears to be 90%. CFG-Sim will apply a branch probabilityof 90% to each thread on each iteration, which results in a SIMD efficiency lowerthan the actual SIMD efficiency of 100%.Example 4 Example of loops on branch probabilities1: for (unsigned i=0; i<9; ++i) {2: // C3: }Another challenge with estimating SIMD efficiency occurs when branch out-comes are data dependent. Consider the code in Example 5. Here, all threads in athread block will iterate through loop together; however, the number of iterationsis dependent on external data. Consequently, the number of times that the threadswill execute the code block D is not known a priori, which complicates the cal-culation of the final SIMD efficiency estimation. For example, the overall SIMDefficiency can be quite different if block D is only 10% of the static code in theGPU kernel, but 90% of the dynamic code.Example 5 Example of data dependency on branch probabilities1: N = global_buffer[blockIdx.x]2: for (unsigned i=0; i<N; ++i) {3: // D4: }To generalize across multiple types of applications, CFG-Sim must considerthese types of correlated branches when estimating the SIMD efficiency. How-ever, identifying correlated branches in CPU code is challenging since the codehas likely not been written for highly parallel execution or may be highly depen-dent in input data. An interesting research direction would be to understand howCFG-Sim could be improved to support branches related to parallel execution, theunderlying GPU architecture, and data dependence.1967.2.2 Evaluating GNoM on Additional Applications and IntegratedGPUsThe GNoM framework presented in Chapter 4 is evaluated on a single server appli-cation, MemcachedGPU, and a network-only GPU ping benchmark. GNoM shouldalso be able to efficiently accelerate other existing GPU network packet processingapplications, where different GPU threads are responsible for processing differentnetwork packets, such as IPv4/6 forwarding or packet encryption/authentication(e.g., IPSec). An obvious direction for future research is to evaluate additional ap-plications that can take advantage of GNoM to accelerate the network flow fromthe network card directly to the GPU. MemcachedGPU was designed and evaluatedfor very small network packet sizes to stress the GNoM framework. Evaluating ap-plications with larger packets should also benefit from GNoM. Additionally, thelatency per Memcached kernel is relatively short, which results in larger end-to-end improvements when using GNoM to reduce these overheads. It would be in-teresting to evaluate how GNoM compares to the baseline network flow when theGPU kernel runtime is much longer. Furthermore, MemcachedGPU highlighteda specific requirement for the partitioning of computation between the CPU andGPU, for example, to fill the response packets with the corresponding Memcacheditem values on a hit in the hash table, prior to sending the response packets overthe network. Other applications may place drastically different requirements onthe amount and type of CPU and GPU processing, which could require rethinkingparts of the GNoM design to efficiently support a wide range of applications.Chapter 3 highlighted that while discrete GPUs have higher computationalcapacity, the ability to remove data transfers between the CPU and GPU resultedin higher overall performance from integrated GPUs. GNoM was developed andevaluated on discrete GPUs mainly due to the existence of GPUDirect, which en-abled the NIC to transfer RX packets directly to the physical memory on high-performance discrete GPUs. The integrated GPU’s memory could also be mappedinto the NIC for direct packet transfers; however, discrete GPUs were selectedfor an initial evaluation due to the potential for higher performance when memorytransfer overheads are reduced (as shown in Chapter 3). In GNoM, the GPU stilltransfers response packets on the TX path to the CPU to forward to the NIC. This is197not an inherent limitation of GNoM 1, but was instead a design decision to improveperformance on contemporary hardware. Specifically, this enabled GNoM to makeuse of the CPU’s larger memory space for storing the Memcached values withoutincurring additional transfers across the PCIe. Based on the results for a lower pow-er/performance discrete GPU in Chapter 4, evaluating GNoM on integrated GPUscould further improve the efficiency for applications, such as MemcachedGPU,which currently require partitioning parts of the application across the CPU andGPU due to GPU memory capacity limitations. Additionally, unlike Chapter 3,which evaluated large batches of packets (e.g., 20k+ packets), GNoM works onmuch smaller batches of packets (512 packets), which places lower requirementson the GPU’s computational capacity.7.2.3 Larger GPU Networking KernelsGNoM and MemcachedGPU were designed to support a single GPU thread pernetwork packet 2. Many potential GPU networking applications could require sig-nificantly more GPU threads to execute the kernel. For example, consider a GPUnetworking application that performs facial recognition on a network packet con-taining image data. Such an application requires multiple GPU threads to effi-ciently complete the task for a single network packet. Additionally, the larger ker-nel size requires fewer (potentially zero) network packets to be batched together toimprove the GPU’s throughput. Supporting these types of applications would re-quire modifications to GNoM to handle multiple GPU threads per network packetand partition the threads between application and network processing.7.2.4 Stateful GPU Network ProcessingGNoM currently supports stateless UDP network processing. Stateless packet pro-cessing simplifies the processing requirements, as there is no hand-shaking be-tween client and server, and no additional protocol mechanisms, such as flow con-trol. Thus, the GPU client simply receives the packet, processes the packet, andsends the response. If the packet is dropped for any reason along the end-to-end1We also implemented and evaluated an equivalent direct GPU-to-NIC TX path.2As described in Chapter 4, two GPU threads are actually launched per packet to overlap parallelnetwork and application processing tasks within the GPU kernel.198path from/to the client, it is the responsibility of the client to re-transmit the packet.GASSP [170] is the first work to evaluate stateful packet processing (TCP) onGPUs and highlights the ability for the GPU to achieve significant performanceimprovements over a GPU-based implementation where the packet processing oc-curs on the CPU. GASSP requires larger batch sizes (increases packet latency),uses main CPU memory for storing packets (does not make use of GPUDirect),and evaluates GASSP on packet processing based applications, such as firewalls,intrusion detection, and encryption. An interesting research direction would be tounderstand how the insights gained from both GNoM and GASSP could improvethe performance of stateful GPU packet processing and accelerate higher levels ofthe application layer on a GPU.7.2.5 Accelerating Operating System Services on GPUsThis dissertation evaluates the potential for offloading a portion of the networkstack to the GPU to improve the performance and efficiency of GPU network-ing applications. Since the network application itself is also running on the GPU,accelerating an Operating System (OS) service, such as the network stack, has a di-rect impact on reducing the latency of the critical path between the NIC and GPU.A much more open-ended research direction would be to explore additional OSservices, potentially those which have no existing interactions with the GPU, thatcould benefit from GPU acceleration.7.2.6 Networking Hardware Directly on a GPUThe GNoM framework proposed in this dissertation considers a heterogeneous sys-tem in which the NIC is a physically separate component from the CPU and GPU,and is connected via an interconnect bus, such as PCIe. As highlighted in Chap-ter 6, many works have shown the potential for the GPU to improve the perfor-mance and efficiency of networking applications. From an architectural viewpoint,it would be interesting to evaluate how network interfaces (such as Ethernet ports)could be directly integrated into the GPU architecture, and how the network inter-face could be modified to better suit the GPU’s vector architecture. For example,one of the inefficiencies of current packet processing on GPUs is due to the packet199layout in memory. If each GPU thread processes a separate packet, as in GNoM,every memory access will lead to memory divergence (GNoM tries to reduce theeffects by loading packets into shared memory). However, a dedicated NIC on aGPU could rearrange (swizzle) the packet header contents in hardware such thatthe GPU threads can better take advantage of memory coalescing. An initial evalu-ation into the packet header swizzling technique shows up to 2.4× improvement inthroughput for a GPU IPv4/6 forwarding applications. Furthermore, an integratedGPU-NIC architecture could significantly reduce the latency for ingesting and pro-cessing packets in GPU memory with separate GPU threads. Including networkhardware directly on the GPU would also increase the requirement for a mecha-nism to independently and efficiently launch GPU kernels internally on the GPU,such as EDGE and the PGWs.7.2.7 Scalar Processors for GPU Interrupt HandlingChapter 5 explores how existing warps running on the GPU’s SIMD pipeline canbe used to handle interrupts to internally launch GPU kernels in response to anexternal event. However, the interrupt processing, as described in this chapter, isinherently sequential and could also be efficiently processed by a scalar proces-sor. A direction for future research would be to evaluate how simple scalar coresintegrated on the GPU die, such as scalar control processors [139] or the scalarunits in AMD Compute Units [10], could be used to improve the performance andefficiency of interrupt handling for launching internal event kernels on the GPU.7.2.8 GPU Wait-Release BarriersChapter 5 introduced a new type of CTA barrier, the wait-release barrier, whichblocks all warps in a CTA indefinitely until a release barrier instruction is per-formed. The current implementation constructs wait-release barrier sets using IDsand releases all wait-release barriers belonging to the same set simultaneously. Ifmultiple CTAs are concurrently blocked on the same wait-release barrier set, theywill all be released regardless of the number of tasks pushed to the GPU. This re-duces performance and efficiency, as only a subset of the released CTAs may finda valid a task to perform, while the rest are unnecessarily released from the wait-200release barrier and will immediately return to blocking. An area for future workwould be to explore how to efficiently support targeted releasing of wait-releasebarriers. This would likely require data structures to track the status of runningand blocked CTAs, and a dedicated mechanism for communicating informationto these CTAs from the PGW responsible for releasing a blocked CTA. Addition-ally, given that a single PGW releases the wait-release barrier, the PGW could beused to remove synchronization on a single global work queue by distributing thepending tasks to local per-CTA work queues. The PGW could also be used forload-balancing the tasks across CTAs.7.2.9 Rack-Scale ComputingDatacenters have recently started to shift towards rack-scale computing architec-tures [82], where components in a disaggregated computing [107] system residein physically separate racks. In such a computing environment, only the resourcesthat are actually required are allocated to the application. This increases the utiliza-tion of hardware resources in the datacenter by minimizing the amount of unusedhardware resources that are indirectly reserved by application. An example of thiswould be an application that is allocated a full server based on its compute require-ments, but only requires 50% of the storage resources installed in this server. Ina rack-scale architecture, this application could be allocated the required computeresources from one rack and the required storage resources from another, leavingthe unused storage resources to be allocated for another application. However, ifwe consider a GPU networking application, which only requires the network inter-face and GPU, a CPU still needs to be allocated to handle the control path fromthe network interface to the GPU. The EDGE framework proposed in Chapter 5removes this dependency on the CPU for GPU task management in a single sys-tem; however, an interesting research direction would be to understand how sucha framework may also be used in a disaggregated rack-scale environment. For ex-ample, NVIDIA’s Pascal and Volta GPUs support virtual memory demand pagingthrough unified memory [126, 131], which places a requirement on the CPU (andCPU memory) for tasks other than strictly control management.201Bibliography[1] A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPUSimulator. In Int’l Symp. on Perf. Analysis of Systems and Software(ISPASS), pages 163–174, April 2009. → pages 12[2] J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. The Case forGPGPU Spatial Multitasking. In Proceedings of the 18th InternationalSymposium on High Performance Computer Architecture (HPCA), 2012.→ pages 115[3] S. R. Agrawal, V. Pistol, J. Pang, J. Tran, D. Tarjan, and A. R. Lebeck.Rhythm: Harnessing Data Parallel Hardware for Server Workloads. InProceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS), 2014. →pages 74, 180[4] B. Aker. libMemcached. http://libmemcached.org/libMemcached.html. →pages 96[5] S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, and M. Ripeanu.StoreGPU: Exploiting Graphics Processing Units to Accelerate DistributedStorage Systems. In Proceedings of the 17th Int’l Symp. on HighPerformance Distributed Computing, HPDC ’08, pages 165–174, NewYork, NY, USA, 2008. ACM. ISBN 978-1-59593-997-5. → pages 180[6] M. Alhamad, T. Dillon, and E. Chang. Conceptual sla framework for cloudcomputing. In Digital Ecosystems and Technologies (DEST), 2010 4thIEEE International Conference on, pages 606–610. IEEE, 2010. → pages 7[7] Altera. Fpga architecture.https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-8800-4800-v4-product-families-brief.html, 2006. → pages4202[8] Amazon. Why machine learning on aws?https://aws.amazon.com/machine-learning, 2018. → pages 1[9] AMD Accelerated Parallel Processing, OpenCL - Programming Guide.AMD, 2.5 edition, 2011. → pages 17, 43, 51, 63[10] AMD. AMD Graphics Cores Next (GCN) Architecture.https://www.amd.com/Documents/GCN Architecture whitepaper.pdf,2012. → pages 23, 133, 200[11] D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, andV. Vasudevan. FAWN: A Fast Array of Wimpy Nodes. In Proceedings ofthe ACM SIGOPS 22nd symposium on Operating systems principles, SOSP’09, pages 1–14. ACM, 2009. → pages 7, 175, 177[12] O. Arcas-Abella, G. Ndu, N. Sonmez, M. Ghasempour, A. Armejach,J. Navaridas, W. Song, J. Mawer, A. Cristal, and M. Lujan. An empiricalevaluation of high-level synthesis languages and tools for databaseacceleration. In Field Programmable Logic and Applications (FPL), 201424th International Conference on, 2014. → pages 4[13] A. Ariel, W. W. Fung, A. E. Turner, and T. M. Aamodt. Visualizingcomplex dynamics in many-core accelerator architectures. In 2010 IEEEInternational Symposium on Performance Analysis of Systems & Software(ISPASS), pages 164–174. IEEE, 2010. → pages 68[14] ARM. Big.little technology: The future of mobile.http://img.hexus.net/v2/press releases/arm/big.LITTLE.Whitepaper.pdf,2013. → pages 3[15] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands,K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, et al.The landscape of parallel computing research: A view from berkeley.Technical report, Technical Report UCB/EECS-2006-183, EECSDepartment, University of California, Berkeley, 2006. → pages 2[16] B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. WorkloadAnalysis of a Large-scale Key-value Store. In Proceedings of the 12thACM SIGMETRICS/PERFORMANCE Joint International Conference onMeasurement and Modeling of Computer Systems (SIGMETRICS), 2012.→ pages 85, 91, 114203[17] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt.Analyzing cuda workloads using a detailed gpu simulator. In PerformanceAnalysis of Systems and Software, 2009. ISPASS 2009. IEEE InternationalSymposium on, pages 163–174. IEEE, 2009. → pages 20, 50, 156[18] P. Bakkum and K. Skadron. Accelerating SQL Database Operations on aGPU with CUDA. In Proceedings of the 3rd Workshop onGeneral-Purpose Computation on Graphics Processing Units (GPGPU),March 2010. → pages 74, 180[19] B. Barney. Message passing interface (mpi).https://computing.llnl.gov/tutorials/mpi/, 2018. → pages 3[20] L. A. Barroso, J. Clidaras, and U. Ho¨lzle. The datacenter as a computer:An introduction to the design of warehouse-scale machines. Synthesislectures on computer architecture, 8(3):1–154, 2013. → pages 1, 6, 7, 115[21] M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDACode Generation for Affine Programs. In Gupta, Rajiv, editor, CompilerConstruction, volume 6011 of Lecture Notes in Computer Science, pages244–263. Springer Berlin / Heidelberg, 2010. → pages 61[22] M. Bauer, S. Treichler, and A. Aiken. Singe: Leveraging warpspecialization for high performance on gpus. In Proceedings of the 19thACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming, PPoPP ’14. ACM, 2014. → pages 78, 84[23] S. Bauer, S. Ko¨hler, K. Doll, and U. Brunsmann. Fpga-gpu architecture forkernel svm pedestrian detection. In Computer Vision and PatternRecognition Workshops (CVPRW), 2010 IEEE Computer SocietyConference on, pages 61–68. IEEE, 2010. → pages 127[24] M. Berezecki, E. Frachtenberg, M. Paleczny, and K. Steele. Many-coreKey-value Store. In Green Computing Conference and Workshops (IGCC),2011 International, pages 1 –8, July 2011. → pages 39, 57, 175[25] D. Black. Idc: Ai, hpda driving hpc into high growth markets.https://www.businesswire.com/news/home/20160309006515/en/Worldwide-Server-Market-Revenues-Increase-5.2-Fourth, November 2016. → pages6[26] M. Blott, K. Karras, L. Liu, K. Vissers, J. Bar, and Z. Istvan. Achieving10Gbps Line-rate Key-value Stores with FPGAs. In Proceedings of the 5th204USENIX Workshop on Hot Topics in Cloud Computing, 2013. → pages 89,90, 100, 121, 122, 123, 177[27] D. Burger. Keynote: A New Era of Hardware Microservices in the Cloud.In DATE, 2017. → pages 7[28] C. Faylor, et al. Cygwin. http://www.cygwin.com/. → pages 49[29] A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers,M. Haselman, S. Heil, M. Humphrey, P. Kaur, J.-Y. Kim, et al. Acloud-scale acceleration architecture. In Microarchitecture (MICRO), 201649th Annual IEEE/ACM International Symposium on, pages 1–13. IEEE,2016. → pages 7[30] S. R. Chalamalasetti, K. Lim, M. Wright, A. AuYoung, P. Ranganathan,and M. Margala. An FPGA Memcached Appliance. In Proceedings of theACM/SIGDA International Symposium on Field Programmable GateArrays (FPGA), 2013. → pages 177[31] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, andK. Skadron. Rodinia: A benchmark suite for heterogeneous computing. InWorkload Characterization, 2009. IISWC 2009. IEEE InternationalSymposium on, pages 44–54. IEEE, 2009. → pages 157[32] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. APerformance Study of General-Purpose Applications on GraphicsProcessors Using CUDA. Journal of Parallel and Distributed Computing,68(10):1370 – 1380, 2008. → pages 21[33] G. Chen, Y. Zhao, X. Shen, and H. Zhou. Effisha: A software frameworkfor enabling effficient preemptive scheduling of gpu. In Proceedings of the22nd ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming, pages 3–16. ACM, 2017. → pages 23, 130, 149, 185, 186,187, 188[34] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears.Benchmarking cloud serving systems with ycsb. In Proceedings of the 1stACM Symposium on Cloud Computing, SoCC ’10, New York, NY, USA,2010. ACM. URL http://doi.acm.org/10.1145/1807128.1807152. → pages98, 104[35] J. Corbet, A. Rubini, and G. Kroah-Hartman. Linux device drivers thirdedition. https://static.lwn.net/images/pdf/LDD3/, 2005. → pages 137205[36] A. Corporation. Implementing fpga design with the opencl standard.https://www.altera.com/en US/pdfs/literature/wp/wp-01173-opencl.pdf, 112013. → pages 4[37] F. Dabek, N. Zeldovich, F. Kaashoek, D. Mazie`res, and R. Morris.Event-driven programming for robust software. In Proceedings of the 10thWorkshop on ACM SIGOPS European Workshop, EW 10, pages 186–189,New York, NY, USA, 2002. ACM. doi:10.1145/1133373.1133410. URLhttp://doi.acm.org/10.1145/1133373.1133410. → pages 33, 131[38] F. Daoud, A. Watad, and M. Silberstein. Gpurdma: Gpu-side library forhigh performance networking from gpu kernels. In Proceedings of the 6thInternational Workshop on Runtime and Operating Systems forSupercomputers, page 6. ACM, 2016. → pages 126, 181, 188[39] J. Dean. Large scale deep learning. Keynote GPU Technical Conference2015, 03 2015. URL http://www.ustream.tv/recorded/60071572. → pages7[40] J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM,56(2):74–80, 2013. → pages 136[41] J. Dean and L. A. Barroso. The Tail at Scale. Communications of the ACM,February 2013. → pages 9, 109[42] E. Demers. Evolution of AMD Graphics. AMD Fusion Developer Summit,June 2011. → pages 52[43] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R.LeBlanc. Design of ion-implanted mosfet’s with very small physicaldimensions. IEEE Journal of Solid-State Circuits, 9(5):256–268, 1974. →pages 2[44] Deyannis, Dimitris and Koromilas, Lazaros and Vasiliadis, Giorgos andAthanasopoulos, Elias and Ioannidis, Sotiris. Flying Memcache: LessonsLearned from Different Acceleration Strategies. In Computer Architectureand High Performance Computing (SBAC-PAD), 2014 IEEE 26thInternational Symposium on. IEEE, 2014. → pages 121, 122, 123, 179[45] A. Dragojevic´, D. Narayanan, O. Hodson, and M. Castro. FaRM: FastRemote Memory. In Proceedings of the 11th USENIX Conference onNetworked Systems Design and Implementation (NSDI), 2014. → pages178206[46] T. Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Activemessages: a mechanism for integrated communication and computation. InComputer Architecture, 1992. Proceedings., The 19th Annual InternationalSymposium on, pages 256–266. IEEE, 1992. → pages 131[47] A. ElTantawy, J. W. Ma, M. O’Connor, and T. M. Aamodt. A scaleltablemulti-path microarchitecture for efficient gpu control flow. In HighPerformance Computer Architecture (HPCA), 2014 IEEE 20thInternational Symposium on, pages 248–259. IEEE, 2014. → pages 191[48] H. Esmaeilzadeh, E. Blem, R. St Amant, K. Sankaralingam, and D. Burger.Dark Silicon and the End of Multicore Scaling. In Proceedings of the 38thAnnual International Symposium on Computer Architecture (ISCA), 2011.→ pages 3[49] N. Express. Nvm express explained.http://nvmexpress.org/wp-content/uploads/2013/04/NVM whitepaper.pdf,2013. → pages 133, 183[50] S. Eyerman and L. Eeckhout. System-level performance metrics formultiprogram workloads. IEEE micro, 28(3), 2008. → pages 168[51] Facebook. Applying machine learning science to facebook products.https://research.fb.com/category/machine-learning, 2018. → pages 1[52] B. Fan, D. G. Andersen, and M. Kaminsky. Memc3: Compact andconcurrent memcache with dumber caching and smarter hashing. In 10thUsenix Symposium on Networked Systems Design and Implementation(NSDI ’13), 2013. → pages 89, 98, 101, 107, 121, 176[53] M. Feldman. Idc’s latest forecast for the hpc market: ”2016 is lookinggood”. https://www.top500.org/news/idcs-latest-forecast-for-the-hpc-market-2016-is-looking-good, November 2016. → pages6[54] J. Fischer, R. Majumdar, and T. Millstein. Tasks: Language support forevent-driven programming. In Proceedings of the 2007 ACM SIGPLANSymposium on Partial Evaluation and Semantics-based ProgramManipulation, PEPM ’07, pages 134–143, New York, NY, USA, 2007.ACM. ISBN 978-1-59593-620-2. doi:10.1145/1244381.1244403. URLhttp://doi.acm.org/10.1145/1244381.1244403. → pages 33, 131207[55] H. Foundation. Hsa runtime programmer’s reference manual v1.1.1.http://www.hsafoundation.com/standards/, 2016. → pages 52, 132, 133,147, 183[56] W. W. Fung and T. M. Aamodt. Thread block compaction for efficient simtcontrol flow. 2011. → pages 191[57] W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warpformation and scheduling for efficient gpu control flow. In Proceedings ofthe 40th Annual IEEE/ACM International Symposium onMicroarchitecture, pages 407–420. IEEE Computer Society, 2007. →pages 191[58] E. Gansner, E. Koutsofios, and S. North. Drawing graphs with dot, 2006.→ pages 47[59] Garland, Michael and Le Grand, Scott and Nickolls, John and Anderson,Joshua and Hardwick, Jim and Morton, Scott and Phillips, Everett andZhang, Yao and Volkov, Vasily. Parallel Computing Experiences withCUDA. IEEE Micro, 28:13–27, July 2008. → pages 6, 35[60] I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W. Hwu. AnAsymmetric Distributed Shared Memory Model for Heterogeneous ParallelSystems. In Proceedings of the fifteenth edition of ASPLOS onArchitectural support for programming languages and operating systems,ASPLOS ’10, pages 347–358, New York, NY, USA, 2010. ACM. → pages44[61] Google. Graphics processing unit (gpu). leverage gpus on google cloud formachine learning and scientific computing. https://cloud.google.com/gpu/,2018. → pages 8[62] C. Gregg and K. Hazelwood. Where is the Data? Why You Cannot DebateCPU vs. GPU Performance Without the Answer. In ISPASS ’11, 2011. →pages 21[63] K. Gupta, J. A. Stuart, and J. D. Owens. A study of persistent threads stylegpu programming for gpgpu workloads. In Innovative Parallel Computing(InPar), 2012, pages 1–14. IEEE, 2012. → pages 24, 183[64] V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia, V. Talwar, andP. Ranganathan. GViM: GPU-accelerated Virtual Machines. InProceedings of the 3rd ACM Workshop on System-level Virtualization forHigh Performance Computing (HPCVirt), 2009. → pages 188208[65] S. Han, K. Jang, K. Park, and S. Moon. Packetshader: a gpu-acceleratedsoftware router. In ACM SIGCOMM Computer Communication Review,volume 40, pages 195–206. ACM, 2010. → pages 126, 181[66] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Toward darksilicon in servers. IEEE Micro, 31(EPFL-ARTICLE-168285):6–15, 2011.→ pages 3[67] T. Herbert and W. d. Bruijn. Scaling in the linux networking stack.https://www.kernel.org/doc/Documentation/networking/scaling.txt, 2017.→ pages 81[68] M. Herlihy, N. Shavit, and M. Tzafrir. Hopscotch Hashing. In DistributedComputing. Springer, 2008. → pages 89, 100[69] T. Hetherington, T. Rogers, L. Hsu, M. O’Connor, and T. Aamodt.Characterizing and Evaluating a Key-value Store Application onHeterogeneous CPU-GPU Systems. In Proceeding of the 2012 IEEEInternational Symposium on Performance Analysis of Systems andSoftware (ISPASS), 2012. → pages vi, 192[70] T. H. Hetherington, M. O’Connor, and T. M. Aamodt. Memcachedgpu:Scaling-up scale-out key-value stores. In Proceedings of the Sixth ACMSymposium on Cloud Computing, SoCC ’15, pages 43–57, New York, NY,USA, 2015. ACM. ISBN 978-1-4503-3651-2.doi:10.1145/2806777.2806836. URLhttp://doi.acm.org/10.1145/2806777.2806836. → pages vi[71] T. H. Hetherington, M. Lubeznov, D. Shah, and T. M. Aamodt. Edge:Event-driven gpu execution. In 2019 International Conference on ParallelArchitectures and Compilation Techniques (PACT). IEEE, 2019. → pagesvii, 127[72] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm fordeep belief nets. Neural computation, 18(7):1527–1554, 2006. → pages 1[73] U. Ho¨lzle. Brawny cores still beat wimpy cores, most of the time. IEEEMicro, July/August 2010. → pages 7[74] M. Houston and M. Mantor. AMD Graphic Core Next: Low Power HighPerformance Graphics & Parallel Compute. AMD Fusion DeveloperSummit, June 2011. → pages 50209[75] X. Huang, C. Rodrigues, S. Jones, I. Buck, and W.-M. Hwu. XMalloc: AScalable Lock-free Dynamic Memory Allocator for Many-core Machines.In Proceedings of the 2010 IEEE 10th International Conference onComputer and Information Technology (CIT), 2010. → pages 89[76] IDC. Worldwide server market rebounds sharply in fourth quarter asdemand for blades and x86 systems leads the way, February 2010. →pages 6[77] IDC. Hpc server market declined 11.6% in 2009, return to growth expectedin 2010, March 2010. → pages 6[78] IDC. Worldwide server market revenues increase 5.2% in the fourthquarter as demand in china once again drives the market forward.https://www.businesswire.com/news/home/20160309006515/en/Worldwide-Server-Market-Revenues-Increase-5.2-Fourth, March 2016. → pages6[79] Intel. Intel DPDK (Data Plane Development Kit). https://https://dpdk.org, .→ pages 177[80] Intel. Intel VTune Aplifier.https://software.intel.com/en-us/intel-vtune-amplifier-xe, . → pages 72[81] Intel. Accelerate big data insights.https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-8800-4800-v4-product-families-brief.html, 2016. → pages3[82] Intel. Intel rack scale design (intel rsd).https://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.html, 2017. → pages 8,201[83] Intel. Intel product specifications. https://ark.intel.com/, 2018. → pagesxii, 5[84] Intel. Intel threading building blocks (intel tbb).https://software.intel.com/en-us/tbb-documentation, 2018. → pages 3[85] Z. Istvan, G. Alonso, M. Blott, and K. Vissers. A flexible hash table designfor 10gbps key-value stores on fpgas. In Field Programmable Logic andApplications (FPL), 2013 23rd International Conference on, Sept 2013. →pages 121, 122, 123, 177210[86] B. Jenkins. Function for Producing 32bit Hashes for Hash Table Lookup.http://burtleburtle.net/bob/c/lookup3.c, 2006. → pages 38, 87, 98[87] S. Jones, P. A. Cuadra, D. E. Wexler, I. Llamas, L. V. Shah, J. F. Duluk Jr,and C. Lamb. Technique for computational nested parallelism, Dec. 62016. US Patent 9,513,975. → pages xvii, 20, 145, 152[88] J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. Wasi-ur Rahman,N. Islam, X. Ouyang, H. Wang, S. Sur, and D. Panda. Memcached Designon High Performance RDMA Capable Interconnects. In Proceedings of the2011 International Conference on Parallel Processing (ICPP), 2011. →pages 178[89] J. Jose, H. Subramoni, K. Kandalla, M. Wasi-ur Rahman, H. Wang,S. Narravula, and D. Panda. Scalable Memcached Design for InfiniBandClusters Using Hybrid Transports. In Proceedings of the 2012 12thIEEE/ACM International Symposium on Cluster, Cloud and GridComputing (CCGrid), 2012. → pages 178[90] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. In-datacenterperformance analysis of a tensor processing unit. arXiv preprintarXiv:1704.04760, 2017. → pages 1, 7[91] S. Kato, S. Brandt, Y. Ishikawa, and R. Rajkumar. Operating systemschallenges for gpu resource management. In Proc. of the InternationalWorkshop on Operating Systems Platforms for Embedded Real-TimeApplications, pages 23–32, 2011. → pages 188, 189[92] S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa, andR. Rajkumar. Rgem: A responsive gpgpu execution model for runtimeengines. In Real-Time Systems Symposium (RTSS), 2011 IEEE 32nd, pages57–66. IEEE, 2011. → pages 23, 130, 149, 185, 186[93] S. Kato, M. McThrow, C. Maltzahn, and S. A. Brandt. Gdev: First-classgpu resource management in the operating system. In USENIX AnnualTechnical Conference, pages 401–412. Boston, MA;, 2012. → pages 188[94] The OpenCL Specification. Khronos OpenCL Working Group, 1.1 edition,2011. → pages 3, 16[95] J. Kim, K. Jang, K. Lee, S. Ma, J. Shim, and S. Moon. Nba (networkbalancing act): A high-performance packet processing framework for211heterogeneous processors. In Proceedings of the Tenth EuropeanConference on Computer Systems, EuroSys ’15. ACM, 2015. → pages181, 182[96] S. Kim, S. Huh, X. Zhang, Y. Hu, A. Wated, E. Witchel, and M. Silberstein.Gpunet: Networking abstractions for gpu programs. In OSDI, volume 14,pages 6–8, 2014. → pages 24, 126, 180, 184, 185, 188[97] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek. The clickmodular router. ACM Transactions on Computer Systems (TOCS), 18(3):263–297, 2000. → pages 182[98] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-waymultithreaded sparc processor. IEEE micro, 25(2):21–29, 2005. → pages2, 8, 34[99] A. Krizhevsky. Cuda-convnet. https://github.com/dnouri/cuda-convnet,2015. → pages 157[100] I. Kuon and J. Rose. Measuring the gap between fpgas and asics.Computer-Aided Design of Integrated Circuits and Systems, IEEETransactions on, 26, 2007. → pages 4[101] M. LeBeane, B. Potter, A. Pan, A. Dutu, V. Agarwala, W. Lee, D. Majeti,B. Ghimire, E. Van Tassell, S. Wasmundt, et al. Extended task queuing:active messages for heterogeneous systems. In High PerformanceComputing, Networking, Storage and Analysis, SC16: InternationalConference for, pages 933–944. IEEE, 2016. → pages 134, 183[102] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learningapplied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. → pages 157[103] K. Lee and Facebook. Introducing big basin: Our next-generation aihardware.https://code.facebook.com/posts/1835166200089399/introducing-big-basin-our-next-generation-ai-hardware/, 2017. → pages7[104] A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul,and R. Lethin. A Mapping Path for Multi-GPGPU Accelerated Computersfrom a Portable High Level Programming Abstraction. In Proceedings ofthe 3rd Workshop on General-Purpose Computation on Graphics212Processing Units, GPGPU ’10, pages 51–61, New York, NY, USA, 2010.ACM. → pages 61[105] J. Leverich and C. Kozyrakis. Reconciling High Server Utilization andSub-millisecond Quality-of-Service. In Proceedings of the Ninth EuropeanConference on Computer Systems (EuroSys), 2014. → pages 7, 107[106] H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. Mica: A holisticapproach to fast in-memory key-value storage. In 11th Usenix Symposiumon Networked Systems Design and Implementation (NSDI ’14), 2014. →pages 89, 90, 96, 100, 102, 107, 121, 122, 123, 176[107] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt, and T. F.Wenisch. Disaggregated memory for expansion and sharing in bladeservers. In Proceedings of the 36th Annual International Symposium onComputer Architecture, pages 267–278, 2009. → pages 8, 201[108] K. Lim, D. Meisner, A. G. Saidi, P. Ranganathan, and T. F. Wenisch. ThinServers with Smart Pipes: Designing SoC Accelerators for Memcached.SIGARCH Computer Architecture News, June 2013. → pages 177[109] Linux. Linux programmer’s manual, pthreads - posix threads.http://man7.org/linux/man-pages/man7/pthreads.7.html, 2018. → pages 3[110] R. A. Lorie and H. R. Strong Jr. Method for conditional branch executionin simd vector processors, Mar. 6 1984. US Patent 4,435,758. → pages 25[111] T. P. Lottes, D. Wexler, C. Duttweiler, S. Treichler, L. Durant, andP. Cuadra. System and method for runtime scheduling of gpu tasks, Mar. 62013. US Patent App. 13/787,660. → pages xvii, 145[112] N. Mathewson and N. Provos. Libevent. http://libevent.org/. → pages 30[113] N. Mehta. Xilinx 7 series fpgas embedded memory advantages.https://www.xilinx.com/support/documentation/white papers/wp377 7Series Embed Mem Advantages.pdf, 2012. → pages 4[114] N. Mehta. Xilinx 7 series fpgas: The logical advantage.https://www.xilinx.com/support/documentation/white papers/wp405-7Series-Logical-Advantage.pdf, 2012. → pages4[115] Memcached. A distributed memory object caching system.http://www.memcached.org. → pages 6, 8, 29, 30213[116] P. Meng, M. Jacobsen, and R. Kastner. Fpga-gpu-cpu heterogenousarchitecture for real-time cardiac physiological optical mapping. InField-Programmable Technology (FPT), 2012 International Conference on,pages 37–42. IEEE, 2012. → pages 126, 134[117] J. Menon, M. De Kruijf, and K. Sankaralingam. igpu: exception supportand speculative execution on gpus. In ACM SIGARCH ComputerArchitecture News, volume 40, pages 72–83. IEEE Computer Society,2012. → pages 185, 186[118] R. Merritt. ARM CTO: Power Surge Could Create ‘Dark Silicon’.EETimes, 22 October 2009. → pages 3[119] Michael Shebanow. ECE 498 AL : Programming Massively ParallelProcessors, Lecture 12.http://courses.ece.uiuc.edu/ece498/al1/Archive/Spring2007, February2007. → pages 25[120] Microsoft. Azure machine learning.https://azure.microsoft.com/en-ca/overview/machine-learning, 2018. →pages 1[121] G. E. Moore. Cramming more components onto integrated circuits,electronics,(38) 8, 1965. → pages 2[122] S. Naffziger, J. Warnock, and H. Knapp. Se2 when processors hit thepower wall (or” when the cpu hits the fan”). In Solid-State CircuitsConference, 2005. Digest of Technical Papers. ISSCC. 2005 IEEEInternational, pages 16–17. IEEE, 2005. → pages 2[123] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, andY. N. Patt. Improving gpu performance via large warps and two-level warpscheduling. In Proceedings of the 44th Annual IEEE/ACM InternationalSymposium on Microarchitecture, pages 308–317. ACM, 2011. → pages191[124] R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li,R. Mcelroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung,V. Venkataramani, and F. Inc. Scaling Memcache at Facebook. InProceedings of the 10th USENIX Symposium on Networked Systems Designand Implementation (NSDI), 2013. → pages 8, 31, 85, 90, 178[125] ntop. PF RING. http://www.ntop.org/products/pf ring/. → pages 78, 84,96214[126] NVIDIA. Nvidia tesla p100.https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf, 2016. → pages 21, 23, 130, 149,201[127] NVIDIA. Cuda c programming guide.https://docs.nvidia.com/cuda/pdf/CUDA Dynamic Parallelism ProgrammingGuide.pdf, 2017. → pages 22[128] NVIDIA. Gpudirect. https://developer.nvidia.com/gpudirect, 2017. →pages 21[129] NVIDIA. Multi-process service.https://docs.nvidia.com/deploy/pdf/CUDA Multi Process ServiceOverview.pdf, 2017. → pages 132, 147, 183[130] NVIDIA. Nvidia management library (nvml).https://developer.nvidia.com/nvidia-management-library-nvml, 2017. →pages 157[131] NVIDIA. Nvidia tesla v100 gpu architecture.http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf, 2017. → pages xii, 5, 17, 21, 23, 52, 130, 191,201[132] NVIDIA Corporation. NVIDIA’s Next Generation CUDA ComputeArchitecture: Fermi.http://www.nvidia.com/content/PDF/fermi white papers/NVIDIA Fermi Compute Architecture Whitepaper.pdf, 2009. → pages115[133] NVIDIA Corporation. NVIDIA CUDA C Programming Guide v4.2.http://developer.nvidia.com/nvidia-gpu-computing-documentation/, 2012.→ pages 3, 16[134] NVIDIA Corporation. NVIDIA’s Next Generation CUDA ComputeArchitecture: Kepler GK110.http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012. → pages20[135] NVIDIA Corporation. Developing a Linux Kernel Module usingGPUDirect RDMA.http://docs.nvidia.com/cuda/gpudirect-rdma/index.html, 2014. → pages 32215[136] NVIDIA Corporation. NVIDIA GeForce GTX 750 Ti: FeaturingFirst-Generation Maxwell GPU Technology, Designed for ExtremePerformance per Watt. http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf, 2014. →pages 120[137] OpenMP. Openmp application programming interface.https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf, 2015. →pages 3[138] Oracle. Oracle machine learning.http://www.oracle.com/technetwork/database/options/oml/overview/index.html, 2018. → pages 1[139] M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood. Fine-graintask aggregation and coordination on gpus. ACM SIGARCH ComputerArchitecture News, 42(3):181–192, 2014. → pages 200[140] J. Ousterhout. Why threads are a bad idea (for most purposes). InPresentation given at the 1996 Usenix Annual Technical Conference,volume 5. San Diego, CA, USA, 1996. → pages 33, 131[141] R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122–144, 2004. → pages 101[142] A. Papakonstantinou, K. Gururaj, J. Stratton, D. Chen, J. Cong, and W.-M.Hwu. Fcuda: Enabling efficient compilation of cuda kernels onto fpgas. InApplication Specific Processors, 2009. SASP ’09. IEEE 7th Symposium on,July 2009. → pages 4, 115[143] J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative preemptionfor multitasking on a shared gpu. ACM SIGARCH Computer ArchitectureNews, 43(1):593–606, 2015. → pages 23, 130, 149, 185, 186[144] D. A. Patterson and J. L. Hennessy. Computer Architecture: A QuantitativeApproach. Morgan Kaufmann Publishers Inc., 1990. ISBN 1-55880-069-8.→ pages 90[145] PCI Express. PCI Express 3.0 Frequently Asked Questions.https://web.archive.org/web/20140201172536/http://www.pcisig.com/news room/faqs/pcie3.0 faq/#EQ2. → pages 21216[146] J. Power, J. Hestness, M. Orr, M. Hill, and D. Wood. gem5-gpu: Aheterogeneous cpu-gpu simulator. Computer Architecture Letters, 13(1),Jan 2014. ISSN 1556-6056. doi:10.1109/LCA.2014.2299539. URLhttp://gem5-gpu.cs.wisc.edu. → pages 156[147] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides,J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman,S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson,S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger. A ReconfigurableFabric for Accelerating Large-Scale Datacenter Services. In Proceedings ofthe 41st Annual International Symposium on Computer Architecture(ISCA), 2014. → pages 4, 7, 115[148] M. Rhu and M. Erez. The dual-path execution model for efficient gpucontrol flow. In High Performance Computer Architecture (HPCA2013),2013 IEEE 19th International Symposium on, pages 591–602. IEEE, 2013.→ pages 191[149] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-consciouswavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACMInternational Symposium on Microarchitecture, pages 72–83. IEEEComputer Society, 2012. → pages 166[150] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Divergence-aware warpscheduling. In Proceedings of the 46th Annual IEEE/ACM InternationalSymposium on Microarchitecture, pages 99–110. ACM, 2013. → pages 191[151] F. Rosenblatt. The perceptron, a perceiving and recognizing automaton(Project Para). Cornell Aeronautical Laboratory, 1957. → pages 1[152] C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. Ptask:operating system abstractions to manage gpus as compute devices. InProceedings of the Twenty-Third ACM Symposium on Operating SystemsPrinciples, pages 233–248. ACM, 2011. → pages 188[153] D. Rossetti. Gpudirect: integrating the gpu with a network interface. InGPU Technology Conference, 2015. → pages 185[154] Y. S. Shao and D. Brooks. Research infrastructures for hardwareaccelerators. Synthesis Lectures on Computer Architecture, 10(4):1–99,2015. → pages 4217[155] M. C. Shebanow, J. Choquette, B. W. Coon, S. J. Heinrich, A. Kalaiah,J. R. Nickolls, D. Salinas, M. Y. Siu, T. Thorn, N. Wang, and other. Traphandler architecture for a parallel processing unit, Aug. 27 2013. US Patent8,522,000. → pages xvii, 20, 22, 23, 145[156] L. Shi, H. Chen, J. Sun, and K. Li. vCUDA: GPU-AcceleratedHigh-Performance Computing in Virtual Machines. IEEE Transactions onComputers, June 2012. → pages 188, 189[157] L.-W. Shieh, K.-C. Chen, H.-C. Fu, P.-H. Wang, and C.-L. Yang. Enablingfast preemption via dual-kernel support on gpus. In Design AutomationConference (ASP-DAC), 2017 22nd Asia and South Pacific, pages 121–126.IEEE, 2017. → pages 23, 130, 149, 185, 186[158] M. Silberstein, B. Ford, I. Keidar, and E. Witchel. Gpufs: integrating a filesystem with gpus. In ACM SIGPLAN Notices, volume 48, pages 485–498.ACM, 2013. → pages 184, 185, 188[159] Y. Suzuki, S. Kato, H. Yamada, and K. Kono. Gpuvm: Gpu virtualizationat the hypervisor. IEEE Transactions on Computers, 65(9):2752–2766,2016. → pages 188, 189[160] Y. Suzuki, H. Yamada, S. Kato, and K. Kono. Towards multi-tenant gpgpu:Event-driven programming model for system-wide scheduling on sharedgpus. In Proceedings of the Workshop on Multicore and Rack-scaleSystems, 2016. → pages 184[161] T. Jablin et al. Automatic CPU-GPU Communication Management andOptimization. In PLDI, June 2011. → pages 61[162] I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero.Enabling preemptive multiprogramming on gpus. In ACM SIGARCHComputer Architecture News, volume 42, pages 193–204. IEEE Press,2014. → pages 23, 130, 149, 150, 185, 186, 187[163] I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero.Enabling Preemptive Multiprogramming on GPUs. In Proceedings of the41st International Symposium on Computer Architecture (ISCA), 2014. →pages 115[164] ThinkTank Energy Products. Watts up? Plug Load Meters.https://www.wattsupmeters.com/secure/index.php. → pages 97218[165] top500. Green500 list for june 2017.https://www.top500.org/green500/lists/2017/06/, 2017. → pages 6, 34[166] top500. Top 10 sites for november 2017.https://www.top500.org/lists/2017/11/, 2017. → pages 6, 34[167] J. Turley. Introduction to intel architecture.https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-introduction-basics-paper.pdf, 2014. → pages3[168] S. Ueng, M. Lathara, S. Baghsorkhi, and W. Hwu. CUDA-Lite: ReducingGPU Programming Complexity. In Amaral, Jos, editor, Languages andCompilers for Parallel Computing, volume 5335 of Lecture Notes inComputer Science, pages 1–15. Springer Berlin / Heidelberg, 2008. →pages 61[169] G. Urdaneta, G. Pierre, and M. van Steen. Wikipedia Workload Analysisfor Decentralized Hosting. Elsevier Computer Networks, 53(11):1830–1845, July 2009.http://www.globule.org/publi/WWADH comnet2009.html. → pages 53[170] G. Vasiliadis, L. Koromilas, M. Polychronakis, and S. Ioannidis. Gaspp: Agpu-accelerated stateful packet processing framework. In USENIX AnnualTechnical Conference, pages 321–332, 2014. → pages 126, 181, 199[171] N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose,A. Jog, P. B. Gibbons, and O. Mutlu. Zorua: A holistic approach toresource virtualization in gpus. In Microarchitecture (MICRO), 2016 49thAnnual IEEE/ACM International Symposium on, pages 1–14. IEEE, 2016.→ pages 188, 189[172] W. Fung, et al. . Dynamic Warp Formation and Scheduling for EfficientGPU Control Flow. In Proc. 40th IEEE/ACM Int’l Symp. onMicroarchitecture, 2007. → pages 25, 51[173] W. Fung, et al. Dynamic Warp Formation: Efficient MIMD Control Flowon SIMD Graphics Hardware. ACM Trans. Archit. Code Optim., 6(2):1–37,2009. ISSN 1544-3566.doi:{http://doi.acm.org/10.1145/1543753.1543756}. → pages 25[174] J. Wang and S. Yalamanchili. Characterization and analysis of dynamicparallelism in unstructured gpu applications. In Workload Characterization219(IISWC), 2014 IEEE International Symposium on, pages 51–60. IEEE,2014. → pages 20, 22, 156[175] J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. Dynamic threadblock launch: A lightweight execution mechanism to support irregularapplications on gpus. ACM SIGARCH Computer Architecture News, 43(3):528–540, 2016. → pages xvii, 22, 145[176] Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo.Simultaneous multikernel gpu: Multi-tasking throughput processors viafine-grained sharing. In High Performance Computer Architecture(HPCA), 2016 IEEE International Symposium on, pages 358–369. IEEE,2016. → pages 23, 130, 149, 158, 185, 186[177] A. Wiggins and J. Langston. Enhancing the Scalability of Memcached.https://software.intel.com/sites/default/files/m/0/b/6/1/d/45675-memcached 05172012.pdf. → pages 90, 107,175[178] B. Wu, X. Liu, X. Zhou, and C. Jiang. Flep: Enabling flexible and efficientpreemption on gpus. In Proceedings of the Twenty-Second InternationalConference on Architectural Support for Programming Languages andOperating Systems, pages 483–496. ACM, 2017. → pages 185, 186, 187,188[179] H. Wu, G. Diamos, J. Wang, S. Cadambi, S. Yalamanchili, andS. Chakradhar. Optimizing Data Warehousing Applications for GPUs usingKernel Fusion/Fission. In Proceedings of the 26th International Paralleland Distributed Processing Symposium Workshops & PhD Forum(IPDPSW), 2012. → pages 74, 180[180] Xilinx. Using embedded multipliers in spartan-3 fpgas.https://www.xilinx.com/support/documentation/application notes/xapp467.pdf, 2003. → pages 4[181] Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram. Warped-slicer:efficient intra-sm slicing through dynamic resource partitioning for gpumultiprogramming. In Proceedings of the 43rd International Symposiumon Computer Architecture, pages 230–242. IEEE Press, 2016. → pages 158[182] T. T. Yeh, A. Sabne, P. Sakdhnagool, R. Eigenmann, and T. G. Rogers.Pagoda: Fine-grained gpu resource virtualization for narrow tasks. InProceedings of the 22nd ACM SIGPLAN Symposium on Principles and220Practice of Parallel Programming, pages 221–234. ACM, 2017. → pages24, 188[183] L. Zeno, A. Mendelson, and M. Silberstein. Gpupio: The case fori/o-driven preemption on gpus. In Proceedings of the 9th Annual Workshopon General Purpose Processing using Graphics Processing Unit, pages63–71. ACM, 2016. → pages 79, 80, 185, 188[184] K. Zhang, K. Wang, Y. Yuan, L. Guo, R. Lee, and X. Zhang. Mega-kv: Acase for gpus to maximize the throughput of in-memory key-value stores.Proceedings of the VLDB Endowment, 8(11), 2015. → pages 124, 179221

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0384819/manifest

Comment

Related Items