UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Error resilience evaluation on GPGPU applications Fang, Bo 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2014_september_fang_bo.pdf [ 1.08MB ]
JSON: 24-1.0165934.json
JSON-LD: 24-1.0165934-ld.json
RDF/XML (Pretty): 24-1.0165934-rdf.xml
RDF/JSON: 24-1.0165934-rdf.json
Turtle: 24-1.0165934-turtle.txt
N-Triples: 24-1.0165934-rdf-ntriples.txt
Original Record: 24-1.0165934-source.json
Full Text

Full Text

Error Resilience Evaluation onGPGPU ApplicationsbyBo FangMaster of Software Systems, The University of British Columbia, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2014© Bo Fang 2014AbstractWhile graphics processing units (GPUs) have gained wide adoption as accelerators forgeneral-purpose applications (GPGPU), the end-to-end reliability implications of their usehave not been quantified. Fault injection is a widely used method for evaluating the re-liability of applications. However, building a fault injector for GPGPU applications ischallenging due to their massive parallelism, which makes it difficult to achieve represen-tativeness while being time-efficient.This thesis makes three key contributions. First, it presents the design of a fault-injection methodology to evaluate the end-to-end reliability properties of application ker-nels running on GPUs. Second, it introduces a fault-injection tool that uses real GPUhardware and offers a good balance between the representativeness and the efficiency ofthe fault injection experiments. Third, it characterizes the error resilience characteristicsof twelve GPGPU applications. Last but not least, this thesis provides preliminary insightson correlations between algorithm properties and the measured silent data corruption ratesof applications.iiPrefaceThis thesis is based on a work conducted by myself in collaboration with Dr. Karthik Pat-tabiraman, Dr. Matei Ripeanu and Dr. Sudhanva Gurumurthi. The work was published asa conference paper in the 2014 IEEE International Symposium on Performance Analysis ofSystems and Software [9]. I was responsible for coming up with the solution and validatingit, evaluating the solution and analyzing the results, and writing the paper. Karthik, Mateiand Sudhanva were responsible for guiding me with the solution reasoning, experimentsdesign and results analysis, as well as editing and writing portions of the paper.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background and Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Dependability Metrics: Error Resilience and Vulnerability . . . . . . . . . . 52.2 Characterizing Error Resilience . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 The Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 GPU Architecture and Programming Model . . . . . . . . . . . . . . . . . 8ivTable of Contents3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 Phase I: Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Phase II: Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3 Phase III: Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Characterization Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2 Heuristic Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.1 Validation of Design Decisions . . . . . . . . . . . . . . . . . . . . . 244.2.2 Validation of Grouping . . . . . . . . . . . . . . . . . . . . . . . . . 264.3 Characterization of Error Resilience . . . . . . . . . . . . . . . . . . . . . . 284.4 Statistical Significance of the Fault Injection . . . . . . . . . . . . . . . . . 314.5 Crash Causes and Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.6 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.6.1 Scenario I: SDC Proneness of Different Code Sections . . . . . . . . 374.6.2 Scenario II: Comparing Different Algorithms . . . . . . . . . . . . . 384.6.3 Scenario III: Guiding Configurations . . . . . . . . . . . . . . . . . . 384.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.1 Search-based Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2 Bit-wise Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3 Averaged Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.4 Graph Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.5 Linear Algebra and Grid Operation . . . . . . . . . . . . . . . . . . . . . . 47vTable of Contents5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 54Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56viList of Tables3.1 The group identification process leads to classifying the benchmarks in threecategories. See Table 4.3 for details about the benchmarks . . . . . . . . . . 164.1 Fault-injection experiments information . . . . . . . . . . . . . . . . . . . . 294.2 Description of CUDA hardware exceptions . . . . . . . . . . . . . . . . . . . 364.3 Benchmarks properties. LOC: lines of code. Scale: number of blocks in agrid and number of threads in a block (generally a 3D*3D space). Launchtimes: the number of iterations that the kernel is launched. . . . . . . . . . 415.1 Benchmark categories and the mapping to the dwarfs of parallelism . . . . 50viiList of Figures3.1 Overview of our fault-injection methodology including grouping, profiling,fault injection and results aggregation . . . . . . . . . . . . . . . . . . . . . 123.2 Percentage of number of threads in each group to the total number of thread.Left: LBM Right: monte carlo. See Table 4.3 for details about the benchmarks 143.3 Cumulative distribution function(CDF) of groups of BFS . . . . . . . . . . 153.4 Phase III - the fault-injection process . . . . . . . . . . . . . . . . . . . . . . 174.1 The highest number of loop iterations executed by each benchmark kernel. . 254.2 Comparison of SDC and crash rate for different iteration threshold. Left:SDC rate. Right: crash rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3 The instruction classification of two random threads from different groups . 284.4 SDC (top) and crash (bottom) rates with error bars representing 95% con-fidence interval for each kernel . . . . . . . . . . . . . . . . . . . . . . . . . 304.5 Instruction-level error resilience study . . . . . . . . . . . . . . . . . . . . . 324.6 Accumulated SDC rates for benchmarks . . . . . . . . . . . . . . . . . . . . 334.7 Number of dynamic instructions executed per thread by benchmarks . . . . 344.8 Number of fault injection runs that convergence occurs for each benchmark 344.9 Root-cause breakdown of crashes for AES and MAT. Left: AES. Right: MAT. 354.10 Crash latency analysis for AES and MAT. Top: AES Down: MAT . . . . . 40viiiList of AcronymsCUDA Compute Unified Device ArchitectureGPU Graphics Processing UnitGPGPU General Purpose GPURTL Register Transfer LanguageMPI Message Passing InterfaceSDC Silent Data CorruptionSIMT Single Instruction Multiple ThreadSM Streaming MultiprocessorALU Arithmetic and Logic UnitLSU Load-Store UnitECC Error-Correcting CodePC Program CounterPTX Parallel Thread ExecutionISA Instruction Set ArchitectureixList of AcronymsSASS Source and AssemblyFP Floating pointxAcknowledgementsFirst of all, I would like to thank my advisors Dr. Matei Ripeanu and Dr. KarthikPattabiraman for their support during the past three years. Matei and Karthik generouslyspent their time and energy to help me improve in many ways. Their wisdom, experienceand personality build the role model for me.I would also like to thank my labmates for their advice and suggestions on my work. Ireally enjoy the time in the lab and work with such great people.Finally, special thanks to my family. Without them, it wouldn’t be possible for me tobe here and accomplish what I have done.xiDedicationTo my parents for their love and support for all these years. To my wife and my twodaughters that bring me a lot of fun and love in my life.xiiChapter 1IntroductionGPUs were designed originally for applications that were intrinsically fault-tolerant (e.g.,image rendering, in which a few wrong pixels might not be noticeable by human eyes).Today, however, GPUs are widely used to accelerate general purpose applications suchas DNA sequencing and linear algebra. It therefore becomes critical to understand thebehavior of these applications in the presence of hardware faults. This is especially im-portant as the rate of hardware faults increases due to the effects of technology scalingand manufacturing variations [7]. With shrinking process technology, the primary cause oftransient faults is random noise. Smaller transistor features require a lower critical chargeto hold and change bits, which leads to faster microprocessors, but which also leads tohigher transient fault rates.Sheaffer et. al [26] in 2006 states that current trends, expectedto continue, show soft error rates increasing exponentially at a rate of 8% per technologygeneration, hence soft errors will soon become a driving concern for graphics architectures.GPU manufacturers have invested significant effort to improve GPU reliability. Forinstance, starting with Fermi models, NVIDIA GPUs support error-correcting code (ECC)to protect register files, DRAM, cache, and on-chip memory space from transient faults.However, transient hardware faults can also occur in the computational or control datapaths, and can propagate to registers and/or memory. Such faults would not be detectedby ECC, because they would cause the correct ECC in registers and/or memory to becalculated on faulty data. As a result, in spite of these mechanisms, GPU applications1Chapter 1. Introductionstill can be affected by transient hardware faults. Further, hardware-protection techniquessuch as ECC can incur performance and energy overheads, and hence may not be enabledby users.The long-term goal of our work is to develop application-specific, software-based fault-tolerance mechanisms for GPGPU applications. As a first step towards this goal, in thisthesis we aim to investigate the error-resilience characteristics of these applications byperforming fault-injection experiments. Fault-injection is the act of perturbing an ap-plication to emulate faults, then studying the effects of those faults on the applicationoutcome [17]. While there has been substantial work in the realm of fault injection forCPU applications [1, 28], there have been relatively few studies that have explored thereliability properties of GPGPU applications and proposed methodologies and tools tosupport this exploration.Prior work [35] has performed fault injections at the source-code level (i.e., mutatingthe source code of a program). Unfortunately, injecting faults at this level is coarse-grained,and does not represent accurately hardware faults that occur at the granularity of micro-architectural units and instructions. To inject hardware faults, the standard approachesare to inject faults into a register transfer language (RTL) model or a microarchitecturalsimulator [3]. However, these approaches often are considerably slower than execution onthe real hardware, and can be a significant bottleneck when performing the thousandsof fault-injection experiments, needed for adequate coverage. One way to alleviate theperformance bottleneck is to execute only a small section of the application. However, wewould not be able to obtain insights into the end-to-end behavior of the application underfaults using this approach. In addition, architectural simulators often do not capture errordetection and error handling features of the processor, and hence performing injectionsusing such simulators may not be representative of the behaviour on real hardware (We2Chapter 1. Introductionpresent the result of error resilience characterization of GPGPU applications using a state-of-the-art GPU simulator in Chapter 4).To avoid above issues, we choose to perform fault injections at the assembly-languagelevel of GPGPU applications using a GPU-based debugger. While not as detailed as faultinjections at the microarchitectural level, this approach allows us to model faults at thegranularity of individual instructions, and thus is more precise than injecting at the high-level language level. Compared to the microarchitectural level injectors, this approach ismuch more efficient and scalable, all the more so because we natively execute the applica-tion on the GPU hardware. To the best of our knowledge, we are the first to propose anefficient instruction-level fault-injection tool, GPU-Qin, for GPGPU applications executingon actual GPU hardware.This thesis makes the following contributions:1. Proposes a methodology to evaluate the resilience of GPGPU applications and de-scribes the design decisions and the corresponding trade-offs between injection cover-age and efficiency to handle the massive parallelism of GPU applications. (Chapter 3),2. Builds a fault-injection tool, GPU-Qin, that is able to inject faults into applicationsrunning on the actual GPU hardware (Chapter 3), 13. Demonstrates the use of the fault injector by performing an end-to-end error-resiliencecharacterization of twelve different GPGPU applications (Chapter 4), and4. Provides initial insights that explain the error resilience of these applications. (Chap-ter 5)We find that there are significant variations in error resilience characteristics of GPUapplications. For example, the SDC rates range from 1% to 38% and the crash rates range1the tool is available via https://github.com/DependableSystemsLab/GPU-Injector3Chapter 1. Introductionfrom 5% to 70% across all benchmarks. We also observe that the main source of crashin GPU systems is the memory-related exceptions. To understand the variations in SDCrates, we find that it is beneficial to consider algorithmic characteristics of the applications.4Chapter 2Background and Fault ModelThis chapter offers background information on the dependability metrics associated withthis work, the fault model used, and the NVIDIA GPU architecture and programmingmodel.2.1 Dependability Metrics: Error Resilience andVulnerabilityLaprie et al. [19] defined ”fault-error-failure” chain as follows:system failure occurs when the delivered service deviates from the specifiedservice. The failure occurred because the system was erroneous: an error isthat part of the system state which is liable to lead to failure. The cause in itsphenomenological sense of an error is a fault.The error resilience of a system is defined as its ability to withstand errors should theyoccur. An error in the program may or may not result in a failure. Errors that do notcause failures are known as benign outcomes. Program failures can be further classifiedinto crashes (i.e., hardware exceptions), hangs, and silent data corruptions (SDCs) (i.e.,incorrect outputs). In the context of our work, we define error resilience as the probabilitythat the application does not have a failure outcome (i.e., crash, hang or SDC) after ahardware fault occurs. Different hardware platforms usually feature distinct fault tolerant52.2. Characterizing Error Resiliencemechanisms, which manifest as exceptions that crash applications or even the whole system.Different applications correspond to different instruction executions, which determines thepropagation of the faults. Thus, error resilience is both a property of the platform andthe application. Since our evaluation is performed on the same hardware platform, i.e.NVIDIA GPGPUs, error resilience in our context becomes a property of the applicationalone.Vulnerability, is the probability that the system experiences a fault that causes a failure(e.g., an SDC). Note that vulnerability is different from error resilience: error resilience isthe conditional probability of the program not experiencing a failure given that a fault hasoccurred. We focus on error resilience in this thesis, because as long as hardware faults don’tpropagate to the software level and cause a failure, the system and applications running onthe system are unlikely to be problematic. We are interested in developing and evaluatingfault-tolerance mechanisms that add minimum overhead for GPGPU applications.2.2 Characterizing Error ResilienceThere are two commonly used methods to evaluate error resilience:Beam Testing: This method refers to the use of neutron source devices (i.e., neutronbeams) to shower neutrons on the targets (e.g., systems, boards or components) [12] totrigger radiation-induced faults. Targets exposed to the neutron beam experience higherrates of faults than in operation, thus enabling accelerated testing. The main advantageof this method is that it represents realistic faults. However, the costs associated are highbecause it requires a neutron source, and it has low controllability. Further, neutron beamtime is often limited, which means that the experiment can be run only for a limited time.Fault Injection: This is a procedure to introduce faults in a systematic, controlledmanner and study the system’s behavior. Fault-injection techniques can be generally cat-62.2. Characterizing Error Resilienceegorized into hardware-based and software-based fault injection. In this thesis, we onlyconsider software-based techniques. Software-based fault-injection techniques typically em-ulate the effects of hardware faults on the software by perturbing the values of selecteddata/instructions in the program. Fault injection’s main limitation is that it can be diffi-cult to obtain sufficient coverage and representativeness. However, the method is relativelylow-cost because it requires no special equipment. It also offers a high level of controlla-bility and can be repeated as many times as desired. Therefore, we choose fault injectionin this work.As mentioned before, fault injection can be performed at the RTL or micro-architecturallevels. However, these methods are not scalable because they require detailed RTL ormicro-architectural simulators. For this reason, we perform fault injection at a higher level,namely at the level of assembly code instructions. Our goal is to obtain sufficient coveragein terms of number of instructions executed, rather than the proportion of hardware statecovered by the injections, as is typical of RTL/micro-architectural fault injections.An analogy to such characterization is political polling. Despite the technique details,the idea of opinion polling that is a survey of public opinion from a particular sample, isquite similar to our fault injection study. The whole population is like the total number ofinstructions executed by an application, and to survey based on a small number of people(usually 1,000 to 10,000) about a topic is like to perform fault injections on randomly-selected instructions. Polls, however, as they are based on samples of populations, aresubject to sampling error which reflects the effects of chance and uncertainty in the samplingprocess”. A margin of error is necessary to a survey, which represents the uncertainty ofthe sampling. In most of cases, a 3% margin of error is measured based on a group of1,000 people with 95% confidence, and 1% margin of error is computed when the group isincreased to 10,000.72.3. The Fault Model2.3 The Fault ModelHardware faults can be broadly classified as transient or permanent. Transient faults usu-ally are ”one-off” events and occur non-deterministically, while permanent faults persistat a given location. Further, transient faults are caused by external events such as cosmicrays and over-heated components, while permanent hardware faults are usually caused bymanufacturing or design faults. Transient fault rates have been increasing due to dimin-ishing noise margins, smaller voltages, and shrinking microprocessor feature sizes [5]. Wefocus on transient faults in this study.We consider transient faults in the functional units of the GPU processor. Examplesare faults in the arithmatic and logic unit(ALU) and the load-store unit(LSU). We donot consider faults in cache, memory, and register files because we assume that they areprotected by ECC. This is the case for recent GPUs such as the NVIDIA Fermi GPU.We use the single-bit-flip model in this study because it is the de-facto fault modeladopted in studies of transient faults [13, 34, 35]. However, our fault injector can supportboth single- and multiple-bit flips by choosing corresponding fault generation functions atno cost.2.4 GPU Architecture and Programming ModelWe focus on GPGPU applications implemented on top of NVIDIA Compute Unified DeviceArchitecture(CUDA), a widely adopted programming model and toolset for GPUs. TheCUDA programming model defines a GPU application as a control program that runs onthe host and a computation program (i.e., the kernel) that runs on GPU devices withoutinterfering with the CPU. The kernel is implemented as a collection of functions in a lan-guage that is similar to C, but has annotations for identifying GPU code and for delineating82.4. GPU Architecture and Programming Modeldifferent types of memory spaces on the GPU.CUDA kernels use a single instruction/multiple thread(SIMT) model that exploits themassive parallelism of GPU devices. From a software perspective, CUDA abstracts theSIMT model in the following hierarchy: kernels, blocks and threads. A CUDA kernel con-sists of blocks, and a block consists of threads. Fine-grained data parallelism, thread par-allelism, coarse-grained data parallelism, and task parallelism can all be provided throughthis hierarchy. From a hardware perspective, blocks of threads run on hardware unitsnamed streaming multiprocessors (SMs) that feature a shared memory space for threadsinside the same block. Inside a block, threads are scheduled in a fixed groups of 32 threadscalled warps. All the threads in a warp execute the same instructions, but with differentdata values.In the CUDA programming model, there are four kinds of memory: (1) global, (2)constant, (3) texture, and (4) shared. Global, constant, and texture memory accesses areserved from the slower large device memory. Shared memory space is a much smallerand faster on-chip software-managed cache. CUDA applications need to be aware of thememory hierarchy to access GPU memory efficiently.9Chapter 3MethodologyThis chapter outlines our methodology to characterize the error resilience of GPGPU ap-plications and the tradeoffs we make to balance coverage and efficiency. To support ourmethodology, we develop GPU-Qin, a profiler and fault injector.Any fault-injection methodology should satisfy the following three requirements:1. Representativeness: The faults injected should be representative of the actualhardware faults that occur at runtime. In particular, the faults should be injecteduniformly over the set of all instructions executed by the application. We assume thateach dynamic instruction carries the same probability of the fault occurrence. This isa different criterion than used by RTL-level and micro-architectural fault injections,as discussed in Chapter 2.2. Efficiency: Fault-injection experiments should be fast enough to allow the applica-tion to be executed to completion in reasonable time. The reason is that thousandsof faults-injection experiments need to be performed to obtain statistically significantestimates of error resilience.3. Minimum Interference: The tools supporting the fault-injection experiments shouldinterfere minimally with the original application so that they do not modify its re-silience characteristics. In particular, the fault injector should not change eitherthe code or the data of the application other than for the objective of injecting the10Chapter 3. Methodologyfaults themselves, and should not impose unreasonable performance overheads on theapplications.We implement our methodology based on the CUDA GPU debugging tool namely cuda-gdb 2. The cuda-gdb interface provides an external method to control the application, and totrace/modify it without making any changes to the application code or data. This makes itpossible to satisfy the minimum interference goal. cuda-gdb introduces timing delays in theapplication; however, we have not seen any cases in which there is considerable deviationin the behavior of the application due to such delays, because our focus is not graphicsapplications but general-purpose applications.Figure 3.1 shows an overview of our methodology. The process consists of four mainphases, which we briefly describe here and detail in the rest of the this chapter. In thefirst phase, we group threads based on similarity of their behaviors (we use the numberof instructions executed as a proxy, because threads executing a different number of in-structions likely execute different control-flow paths, and hence have divergent behaviors)by running applications with GPGPU-Sim [3]. This is because GPU applications usuallylaunch thousands or tens of thousands of threads and it is extremely time-consuming toevaluate the error resilience of each GPU thread. Instead, we only consider representativegroups to study. We then choose one thread from each group to profile in the next phase.To balance coverage and efficiency, in some cases we use only the most popular groups, aswe detail in Chapter 3.1.In the second phase, GPU-Qin profiles the threads selected in the first phase and obtainsthe execution trace of the GPU portion of the application. This information is used to mapthe source code lines to the executed assembly instructions. This information is necessaryin the next phase to locate at runtime the instruction at which to stop execution and inject2https://developer.nvidia.com/cuda-gdb11Chapter 3. MethodologyGrouping threads• Detects groups of threads with similar behaviors Profiling• Profiles one thread in each of the most popular groups95% confidence is reachedNoAggregates resultsYesFault injection runs• Selects an instruction within a profile• Adds breakpoint, runs and single-steps to the target instruction• Injects a fault• Monitors outcomeGPGPU-SimGPU-QinGPU-QinFigure 3.1: Overview of our fault-injection methodology including grouping, profiling, faultinjection and results aggregationthe fault.In the third phase, for each injection run, GPU-Qin randomly chooses one executedinstruction from one of the traces obtained in the second phase. The choice of the trace isbiased proportionally based on the popularity of the group it represents. The choice of theinstruction is done uniformly over the space of the instructions of the profile; thus, GPU-Qin simulates the occurrence of a transient error that occurs uniformly over instruction (inother words, we assume that all instructions take approximately the same time to execute).GPU-Qin also randomly picks a thread from the entire set of application run-time threadsfor each injection run. This satisfies the representativeness requirement.123.1. Phase I: GroupingFinally, the last phase aggregates the results. The rest of this chapter presents eachphase in detail.3.1 Phase I: GroupingGPU applications often have a massive number of threads 3, and it would be infeasibleto obtain the execution traces for all threads in an application kernel for the purpose offault injection. Therefore, the main challenge is to identify a fraction of threads that arerepresentative of the workload behavior for tracing. To this end, we separate threads intogroups to find representative threads from the groups. We identify the groups based on thebehaviours of the threads and consider instruction counts as a proxy for thread behaviour.That is to say, threads in the same group should execute the same number of instructions.Because GPUs don’t have built-in instruction counters, we gather the instruction countsof all threads in a benchmark by executing the program in an instruction-level GPU sim-ulator, GPGPU-Sim (version 3.2.0) [3]. GPGPU-Sim simulates the execution of GPGPUprograms from both functional and performance perspectives, and hence the number ofinstructions executed by it matches the number of instructions executed in the real hard-ware. We perform the group identification operation only once per application, so it isacceptable for this phase to be slower than the fault- injection phase, which is performedthousands of times. We then group the threads executing the same number of dynamicinstructions.We find that our benchmarks (presented in detail in Chapter 4 and Table 4.3) can becategorized into three categories based on the results of the group identification process(Table 3.1). In the first category, all threads execute the same number of instructions,3A GPU thread is identified by a thread coordinate (blockIdx.x, blockIdx.y,blockIdx.z), (threadIdx.x,threadIdx.y, threadIdx.z).133.1. Phase I: Groupinggroup1 84% group2 16% group1 12.50% group2 50.00% group3 0.39% group4 25.00% group5 12.11% Figure 3.2: Percentage of number of threads in each group to the total number of thread.Left: LBM Right: monte carlo. See Table 4.3 for details about the benchmarksand hence there is only one group. In the second category, there is a limited amount ofdivergence among the threads, which leads to only a few groups (2 to 10). Finally, in thethird category, there is significant divergence leading to tens of groups or more.Because profiling a thread is time-consuming, to balance coverage and efficiency, wepropose the following method: for applications in which there is only one group, we ran-domly choose a single thread in the group to profile; for applications with a small numberof groups, we select the groups that constitute the majority of the threads and randomlypick one thread from each selected group to profile. Figure 3.2 shows two examples of howwe pick such major groups. For example, LBM has two groups: one has 84% and the otherhas 16%, of the total number of threads. To satisfy the representativeness requirement, weneed to pick both groups. However, in other cases, we ignore some less popular groups. Forexample, Monte Carlo has five groups, but one of the groups is responsible only for 0.4%of total number of threads, and hence we ignore that group; for applications that have alarge number of groups (in our benchmark set, only BFS (Table 3.1)), we again use grouppopularity to make informed choices. For BFS, around 60% of threads fall into the same143.1. Phase I: Groupinggroup (shown as a vertical line in Figure 3.3), while all the other 78 groups are equallypopular; therefore, we pick a random thread from the large group and another randomthread from the other groups. Given enough resources, more groups can be sampled toincrease coverage. In Chapter 4.2.2 we report the result of a random fault injection experi-ment to validate the grouping. For applications that have a large number of groups (in our0 50 100 150 200 250 300 350 400 450010%20%30%40%50%60%70%80%90%100%Insutruction executed by different groups of BFSCDF of the number of groupsFigure 3.3: Cumulative distribution function(CDF) of groups of BFSbenchmark set, only BFS (Table 3.1)), we again use group popularity to make informedchoices. For BFS, around 60% of threads fall into the same group (shown as a verticalline in Figure 3.3), while all the other 78 groups are equally popular; therefore, we pick arandom thread from the large group and another random thread from the other groups.Given enough resources, more groups can be sampled to increase coverage.153.2. Phase II: ProfilingTable 3.1: The group identification process leads to classifying the benchmarks in threecategories. See Table 4.3 for details about the benchmarksCategory Benchmarks Groups Groups toprofile%threadsin pickedgroupsCategory I(one group)AES, MRI-Q,MAT, MergeSort-k0, Transpose1 1 100%CategoryII (2- 10groups)SCAN, Stencil,Monte Carlo,SAD, LBM,HashGPU2 - 10 1 - 4 95% - 100%CategroyIII (> 10groups)BFS 79 2 >60%3.2 Phase II: ProfilingThe goal of the profiling phase is to map the assembly-level instructions (i.e. SASS)executed by a thread (chosen during the grouping phase) to their corresponding CUDAsource-code line. This will enable GPU-Qin which uses conditional breakpoints to injectfaults. The reason is that cuda-gdb, on which GPU-Qin is built, requires the source linenumber for setting a conditional breakpoint. Mapping a source line to assembly instructionsis one-to-many. (i.e., a single source line may correspond to multiple instructions). We willexplain later how GPU-Qin locates the specific assembly instruction to inject to.The profiling phase consists of single-stepping the program using cuda-gdb for thethread(s) selected in the first phase. At each step, the program counter value of theinstruction is recorded, along with the instructions corresponding to the source line. Theoutput of the profiling step is an instruction trace consisting of the program counter valuesand the source line associated with each instruction.163.3. Phase III: Fault Injection3.3 Phase III: Fault InjectionThe third phase of the process is to inject faults into the application at runtime and monitorthe outcomes. Figure 3.4 briefly illustrates this process. GPU-Qin has instruction tracesfrom the second phase and it obtains the associated source code line for each instructionfrom each trace. In each injection campaign, GPU-Qin chooses a profile from the profilingphase and uniformly chooses an instruction; to inject a fault, it sets up a conditionalbreakpoint in the program at the source code line corresponds to that instruction usingcuda-gdb. The conditional breakpoint is triggered only when the chosen thread reachesthe chosen source line. When the breakpoint is triggered and the chosen instruction isreached, a fault is injected into the application. The application is then monitored todetermine if the fault is activated (i.e., the modified state is read by the application). Toensure representativeness, the thread coordinate is chosen randomly from the set of allthreads used by the program, rather than only from the ones chosen during the groupingphase. The application runs natively on the hardware until the breakpoint is triggered andafter the fault is injected (except for a short window of time when it is single-stepped tomonitor fault activation). This satisfies the efficiency requirement. The fault injection isrepeated until the 95% confidence interval is reached for the results with the error bar thatis between 1% to 2%.The rest of this chapter presents the details of this process.nativeexecutionBreakpoint hitsingle step executionFault injectionsingle step executionPC hitactivation windownativeexecutionGPU program execution via cuda-gdbstartendFigure 3.4: Phase III - the fault-injection process173.3. Phase III: Fault InjectionReaching the target instruction: After the breakpoint is set, the program is launchedunder cuda-gdb, and it runs natively until the conditional breakpoint is hit. Becausemultiple dynamic instructions can map to the same source line, the breakpoint being hitdoes not mean that the target instruction is reached. To reach the target instruction,GPU-Qin performs two steps:1. GPU-Qin estimates in which iteration of a loop the target instruction occurs in (if itbelongs to a loop). It can perform this estimate based on the information gathered inthe profiling phase. If the current loop iteration is less than the estimated iteration,GPU-Qin increments the iteration count and continues the program natively until thenext time the conditional breakpoint is reached. To optimize the injection process,GPU-Qin bounds the loop iteration estimate at 64. In other words, if the iterationthat needs to be injected exceeds 64, GPU-Qin generates a random number between0 and 64 and injects a fault at the corresponding loop iteration. We examine theimplications of this heuristic in the next chapter.2. Once the current loop iteration matches the target iteration, GPU-Qin single-stepsthe program from the breakpoint until the program counter matches the instructionwe want to inject. For performance reasons, GPU-Qin uses a fixed window to limit thenumber of times the single-stepping is invoked. If this window has been exceeded andthe target instruction has not been reached, GPU-Qin abandons the run. Currently,GPU-Qin uses 300 instructions as the window size because we find that most sourcelines correspond to at most a few tens of instructions. This window’s size can beconfigured by the user.We explain the details of the fault injection experiment in this context:183.3. Phase III: Fault InjectionThe locations to inject: The locations to inject depend on the instruction executed.GPU-Qin considers three types of instructions:1. Arithmetic instruction: GPU-Qin injects faults into the destination register of in-structions to simulate an error in the ALU and floating-point (FP) unit. For vectorinstructions that have multiple destination registers, GPU-Qin randomly chooses adestination register to inject.2. Memory instructions: GPU-Qin simulates faults in the LSU by injecting faults intoeither the destination register or the address register in LD/ST instructions.3. Control-flow instruction: NVIDIA ISA uses predicate registers to control the branchesof the program. Instructions such as ”ISETP” are used to set values to the predicateregisters and an optional predicate guard is used to control the conditional execution.Unfortunately, cuda-gdb does not let us modify the predicate registers, so GPU-Qininjects faults into the source operands of the control-flow instructions, instead ofdirectly manipulating the predicate registers.The fault: A fault is injected by flipping a randomly chosen single bit in the resultof the instruction’s destination register. Only one fault is injected in each run becausehardware faults are relatively rare events compared to the execution time of a typicalapplication.Successful fault injections: A fault might not be injected in a run even when theinstruction is reached. This can occur either because cuda-gdb will not allow us to modifythe instruction, or because the thread GPU-Qin randomly picks does not execute thecorresponding instruction (because choosing the thread for injection is based on all threadsbut the profile comes from a particular group of threads). GPU-Qin discards the executionsthat do not lead to fault injections. For example, GPU-Qin is not allowed to change the193.3. Phase III: Fault Injectionaddress involved in BRA (which is a branch instruction to jump a relative address).Activated fault: Once a fault is injected, GPU-Qin checks if the faulty location is readby the program (and not overwritten). Such faults are said to be activated. Only activatedfaults are considered in the evaluation because our goal is to measure the application’sresilience (the conditional probability that given a fault, the program is able to workcorrectly). To track the activation of a fault, GPU-Qin single-steps the program afterinjection to check if there is another instruction that reads registers modified by the fault.To ensure that this process terminates in a reasonable amount of time, GPU-Qin picksa threshold: the activation window. If the fault is not activated within the activationwindow instructions after injection, GPU-Qin lets the program continue and consider thefault unactivated. We set the window to be 1600 instructions for our experiments. Weexplore the implications of this choice in the next chapter.Execution Outcome: If the fault is activated, the application execution has one ofthe following outcomes: (1) Throws an exception (crash), (2) Times out by going into aninfinite loop (hang), (3) Completes with incorrect output (SDCs) 4, or (4) Completes withcorrect output (benign). These four outcomes are mutually exclusive and exhaustive.Error Bars: the formula we use for calculating the error bars within a confidenceinterval, when we do not know the mean and standard deviation is as follows:p + /− (z ∗√p ∗ (1− p)/n (3.1)where ’p’ is the coverage for target, e.g. SDC rate or crash rate and n is the activatedfault injection runs. The ’z’ value is what is called the normal score. A ’z’ value of0.95 means that 95% of the area under a normal curve lies within roughly 1.96 standard4We define an SDC as an outcome that fails the correctness check of the benchmark (if one is provided),or output mismatch between fault-free and fault-injected runs if a correctness check is not provided. Thus,we take application-specific characteristics into account in our definition of an SDC.203.3. Phase III: Fault Injectiondeviations of the mean, and due to the central limit theorem, this number is therefore usedin the construction of approximate 95% confidence intervals [25]. We can also tell fromequation 3.1 that when p is equal to 10,000 the error bar is less than 1%. As long as weuse random selection on the instructions to inject, estimations drawn on a small numberof fault injection runs represents the characteristics of a whole program.In summary, this chapter introduces the design principles our fault injection method-ology, and describes the three phases of the methodology including grouping, profiling andfault injection to characterize the GPGPU applications. We also explain types of faults,type of instructions and important details of experiments.21Chapter 4Characterization StudyThis chapter uses a wide variety of applications (presented in Chapter 4.1) to validate thedesign choices that we made (Chapter 4.2) and to demonstrate the use of our methodologyto characterize the application’s error resilience (Chapter 4.3). All of our experiments areconducted on NVIDIA Tesla C series GPUs.4.1 BenchmarksWe use a variety of benchmarks from the Parboil benchmark suite [29], NVIDIA CUDASDK package, Rodinia benchmark suite [6], and other well-known GPGPU applications.A short description of each benchmark is given below, along with the inputs used in ourevaluation. Table 4.3 summarizes the characteristics of each benchmark and its kernels.AES encryption (AES): AES supports both encryption and decryption. We encrypta 256-KB file with a 256-bit key.Matrix Multiplication (MAT): Matrix multiplication is a common building blockwidely used in many linear algebra algorithms. We modify the code so that MAT launchesthe CUDA kernel code only once, to ensure that subsequent runs do not overwrite theresults. We multiply two 192*128 floating-point matrices.Matrix Transpose: Matrix transpose is a common building block for many linear al-gebra algorithms. We use the diagonal kernel optimized for the highest memory bandwidth.224.1. BenchmarksWe transpose a 512*512 floating-point matrix.Monte Carlo (MONTE): MONTE simulates the price of an underlying asset usingthe Monteo Carlo method. We let it simulate 262,144 paths for 256 options.GPUs as Storage System Accelerators (HashGPU): HashGPU [18] is a librarythat accelerates a number of hash-based primitives. We use both SHA1 and MD5.Breadth-First Search (BFS): BFS applies a breadth-first search on a graph. Weperform BFS on a random graph with 4096 nodes.Magnetic Resonance Imaging - Q (MRI-Q): MRI-Q computes a matrix, repre-senting the scanner configuration for calibration, used in a 3D MRI reconstruction algo-rithms in non-Cartesian space. We use 32*32*32 as the size of the 3D matrix.3-D Stencil Operation (Stencil): Stencil performs an iterative Jacobi stencil oper-ation on a regular 3-D grid. We use a 128*128*32 3D FP matrix and iterate the operationfive times to make it converge.Sum of Absolute Differences (SAD): SAD computes the sum of absolute differ-ences, used in MPEG video encoders. It is based on a full-pixel motion-estimation al-gorithm found in the JM reference H.264 video encoder. There are three kernels in thisbenchmark and each kernel uses the previous kernel’s output. We use the default dataframe as the initial input.CUDA Parallel Prefix Sum (SCAN): SCAN [16] demonstrates an efficient CUDAimplementation of a parallel prefix sum. Given an array of numbers, SCAN computes anew array in which each element is the sum of all the elements before it in the input array.We include SCAN-block, which works with any length of arrays.Merge Sort (MS): MergeSort [24] implements a merge-sort, representing a use caseof GPUs for sorting batches of short- to mid-sized (key, value) array pairs.234.2. Heuristic ValidationLattice-Boltzman Method Simulation (LBM): LBM implements a solution of thesystem of partial differential equations for fluid simulation, which can be derived for thepropagation and collision of fictitious particles. The input file is a discrete representationof immobile flow obstructions (120,120,150) in the simulated volume.4.2 Heuristic ValidationThis chapter offers empirical support for the heuristics used in Chapter 4. All these heuris-tics (including the grouping strategy and design decisions we make) represent choices inthe trade-off space between coverage (either in terms of distinct code paths profiled or usedfor fault injection) and efficiency (run-time to execute an application characterization).4.2.1 Validation of Design DecisionsThere are two design decisions we make to ensure the efficiency of the fault injector, namely:1. To control runtime, we limit the number of loop iterations explored. That is, if theinstruction to be injected belongs to a loop iteration that exceeds a threshold of 64, wegenerate a random number between 0 and 64 and inject a fault at the correspondingloop iteration.2. If the injected fault is not activated within an activation window of 1,600 dynamicinstructions, we consider it unactivated.To validate the first heuristic, we first count the total number of iterations executed byeach loop of each kernel, and then consider the loop with the largest number of iterations.The results are shown in Figure 4.1. We disregard applications that execute fewer than 64iterations (in all loops) because they fall within the chosen threshold already. Among thefour applications that have loops that exceed the threshold, we pick MRI-Q, which has the244.2. Heuristic Validationlargest number of iterations, and MAT, which has the smallest number of iterations stillgreater than the threshold, and vary the threshold from 64 to 32 and 128 and repeat thecharacterization experiments.Figure 4.2 presents the SDC rates and crash rates for MAT and MRI-Q for max-iterationthreshold values of 32, 64, and 128. We find that varying this threshold does not affect theresulting SDC rate and crash rate for these benchmarks. This indicates that limiting thenumber of iterations does not affect the overall error resilience estimation. Although thenumber of iterations we limit in our study is representative for our benchmark suite, it isstill possible that applications which have different sets of characteristics in terms of loopiterations may need further investigation and to get adjusted accordingly.1 10 100 1000 AES HashGPU-md5 HashGPU-sha1 MRI-Q MAT Transpose SAD-k0 SAD-k1 SAD-k2 Stencil SCAN-block MonteCarlo MergeSort-k0 LBM Number of iterations in log scale Interation threshold (64) Figure 4.1: The highest number of loop iterations executed by each benchmark kernel.To validate the second decision, we count the number of instances when the activationwindow threshold is exceeded. We find that for only three benchmarks (HashGPU-sha1,MAT and MRI-Q) are there fault-injection runs in which the activation window is exceeded:254.2. Heuristic Validation0% 5% 10% 15% 20% 25% 30% 35% 40% 45% MAT-32 MAT-64 MAT128 MRI-Q-32 MRI-Q-64 MRI-Q-128 SDC rate 0% 5% 10% 15% 20% 25% 30% 35% 40% MAT-32 MAT-64 MAT128 MRI-Q-32 MRI-Q-64 MRI-Q-128 Crash rate Figure 4.2: Comparison of SDC and crash rate for different iteration threshold. Left: SDCrate. Right: crash ratetwo cases in HashGPU-sha1, 36 in MAT, and 29 in MRI-Q. However, the proportion ofthese is negligible, compared to the thousands of fault-injected runs executed for eachbenchmark. Thus our choice of the activation window size leads to only minimal inaccuracy(about 1% for three benchmarks) in evaluating error resilience.4.2.2 Validation of GroupingThe purpose of grouping is to identify the representative threads from a large amountof threads of an GPU application. In Chapters 3.1, we state that we use the number ofdynamic instructions executed by b a thread as a representation of the behaviours, andwe group a program’s threads based on their behaviours. This heuristic can be validatedthrough an instruction level classification analysis. Therefore, we analyze two exampleapplications from category I and II, namely MAT and SAD to check if threads in the samegroup would execute different sets of instructions. We don’t consider category III becausein this context there is no fundamental difference between applications in category II andIII as they both contain multiple groups. The details of the validation are described below:MAT : Since we found that in MAT all the threads execute the same number of instruc-tions, we want to understand if the same number of instructions implies identical instruc-264.2. Heuristic Validationtions executed by each thread. By using GPU-Qin, we randomly choose 100 threads inMAT and record instructions of each thread to get 100 profiles. Then, we examine everyprofile with each other and discover that all of them are identical. This shows that ourgrouping strategy for applications like MAT is effective.SAD : As we show in Table 3.1, SAD-k0 contains 5 groups, each of which containsthreads that execute 1989, 1856, 1948, 1995 and 950 PTX instructions. To see how manydeviations of the threads in different groups, we pick a random thread from two mostpopular groups (1956 and 1948) separately to classify the instructions. Figure 4.3 showsthe classification of instructions executed by the two threads we pick. In total, there are 11categories of instructions in Fermi SASS ISA, and we present the break-down of the numberof executed instructions in each category. The amount of instructions differs between twothreads is 786, which constitutes of about 2% of total instructions. Most of the differencescome from the integer and move instructions (INT and MOV). To show if the groupingworks for applications that have multiple groups, we randomly profile two threads fromeach group of SAD-k0 and compare the dynamic instructions they execute, and find thatthey are also identical. We can use such low-level instruction classification to reason aboutthe behaviours of threads in different groups as well as the error resilience characteristicsin the future.In summary, we show that using number of instructions as the representation of thethread behaviours is valid because the threads in the same group execute the same streamof instructions. It is also inferred that threads in different groups would show different errorresilience characteristics. To validate this point, we compare the fault-injection results ofapplications in categories II and III (see Table 3.1). The crash rates vary considerably fordifferent groups of threads in Stencil, LBM, SCAN and BFS, which is 5%, 10%, and 25%respectively. This demonstrates the value of considering grouping that is a key strategy in274.3. Characterization of Error Resiliencethe chain of fault injection.0	  5000	  10000	  15000	  20000	  25000	  FP	   INT	  Conversion	  Move	  Predicate	  Texture	   LS	  Surface	  Control	  Miscellaneous	  Group	  1856	   Group	  1989	  Figure 4.3: The instruction classification of two random threads from different groups4.3 Characterization of Error ResilienceWe characterize the error resilience of the 15 kernels mentioned. We run enough experi-ments to obtain 95% confidence, with a 1% to 2% (depending on the benchmark) confidenceinterval for the SDC rate and crash rate.Table 4.1 presents, for each benchmark, the total number of injected runs, the overallactivation rate, and the average time for a fault-injection run. The total number of injectedruns includes runs when the fault was injected successfully and was either: activated,overwritten, or ignored by exceeding the activation window.The average time of each fault-injection run varies across benchmarks from 11 secondsto 710 seconds, and is directly proportional to the scale of the block size of the benchmark(shown in Table 4.3). We observe that our worst-case benchmark SCAN, which takes 710seconds on average, is still 10X faster with GPU-Qin than running with GPGPU-Sim.Other benchmarks show speedups as high as 100x. Moreover, the simulator needs days to284.3. Characterization of Error ResilienceTable 4.1: Fault-injection experiments informationKernels InjectedrunsActivatedrunsActivationrateAverage timeper run (sec-onds)AES 2,351 2,042 87% 84HashGPU-md5 2,699 2,683 99% 13HashGPU-sha1 2,400 2,305 96% 27MRI-Q 2,830 2,475 87% 123MAT 2,575 2,186 85% 82Transpose 2,395 2,160 90% 44SAD-k0 2,671 2,435 91% 76SAD-k1 2,208 2,195 99% 26SAD-k2 2,627 2,618 100% 12Stencil 2,426 2,148 89% 31SCAN-block 1,083 1,080 99% 710MonteCarlo 3,744 2,723 73% 66MergeSort-k0 1,930 1,884 98% 359BFS 2,334 2,330 100% 22LBM 1,895 1,845 97% 165finish for some applications and hence the speedups for those applications are definitelygreater than 100x; however, we did not measure these speedups. The average speedupacross benchmarks (that the simulator is able to finish within a couple of hours) is 22x.This demonstrates the efficiency of GPU-Qin.Figure 4.4 presents the SDC rate and crash rate of the benchmark kernels. We do notshow the hang rates because they are uniformly lower than 1%. Fault injections in CPUsexhibit similar hang rates [13] because hangs occur when the number of loop iterations isincreased so significantly that the benchmark times out. This case is relatively uncommonin practice.At a first glance, both the SDC rate and the crash rate vary widely across benchmarks.In particular, the SDC rate ranges from 0.5% to nearly 38%. This observation suggeststhat it is important to take into account the inherent error resilience characteristics of an294.3. Characterization of Error Resilience0% 10% 20% 30% 40% 50% AES HashGPU-md5 HashGPU-sha1 MRI-Q MAT Transpose SAD-k0 SAD-k1 SAD-k2 Stencil SCAN-block MONTE MergeSort-k0 BFS LBM SDC rate Benchmarks 0% 10% 20% 30% 40% 50% 60% 70% 80% AES HashGPU-md5 HashGPU-sha1 MRI-Q MAT Transpose SAD-k0 SAD-k1 SAD-k2 Stencil SCAN-block MONTE MergeSort-k0 BFS LBM Crash rate Benchmarks Figure 4.4: SDC (top) and crash (bottom) rates with error bars representing 95% confidenceinterval for each kernel304.4. Statistical Significance of the Fault Injectionapplication when protecting it from SDC-causing errors. For example, the SDC rate forMONTE is less than 1%, likely because the result of simulating each path will eventuallybe aggregated, which potentially mitigates the effect of faults. We note that similar ap-plications in terms of application behaivor, (e.g., HashGPU-sha1 and HashGPU-md5 aswell as SAD-k1 and SAD-k2) exhibit similar SDC rates. On the other hand, crash ratesvary even more than the SDC rates, from 6% to 71%. We discuss the possible reasonsbehind these variations in the next chapter. In total, across all benchmarks, failure rates(crash+SDC+hang) range from 24% (MONTE) to 93% (SCAN), and the average failurerate is 67%.Understanding the error-proneness of instructions is a important aspect of the error re-silience characterization because it can help determine which portions/parts of the programneed to be protected. NVIDIA Fermi SASS instructions can be generally classified into thefollowing classes: FP, ALU, memory and control 5. For all activated fault injections, weinvestigate which class the fault-activated instruction belongs. Figure 4.5 presents the SDCrates for each class of instruction for all benchmarks. The overall SDC rates are displayedfor further reference. We find that across all benchmarks, no categories have significantcorrelations with the overall SDC rates. Our result suggests that a specific SDC rate of aGPGPU program is not simply determined by any single factor (i.e. type of instructions)but a combination of various of factors.4.4 Statistical Significance of the Fault InjectionFigure 4.4 shows that we confine the error bar of our fault injection experiment within 2%.However, the measured SDC rate is necessarily significant based on a sample of instructionsof applications. In our fault injection experiment, we inject about 2000 to 3000 faults for an5Our categorization is based on a general classification of the purposes of the instructions314.4. Statistical Significance of the Fault Injection0%	  10%	  20%	  30%	  40%	  50%	  60%	  70%	  80%	  90%	  100%	  MAT	  TRANSPOSE	  MONTE	  SCAN-­‐block	  MergeSort	   AES	   MRI-­‐Q	  Stencil	  SAD-­‐k0	  SAD-­‐k1	  SAD-­‐k2	   LBM	  HASH-­‐sha1	  HASH-­‐md5	   BFS	  FP	   INT	   Memory	   Control	   Overall	  Figure 4.5: Instruction-level error resilience studyapplication by randomly selecting an instruction from all dynamic instructions. Therefore,the number of fault injection runs does not depend on the number of dynamic instructionsan application usually executes. This situation raises a question: does the number ofactivated fault injection runs depend on the size of the programs?To answer this question, we use an empirical method as follows. We calculate the SDCrate for different size of samples. For example, we calculate the SDC rate for the first 200injection runs and we repeat it for the first 400 injection runs, and so on. In Figure 4.6,we illustrate the trend of SDC rates for each benchmark for different ample sizes. We findthat SDC rates fluctuate a little in most of benchmarks (even the maximum variation isonly 6% in HASHGPU-sha1 ), but all of them stabilize on constant values near the endwhere the fluctuations are within the error bars.Figure 4.7 shows how many dynamic instructions are executed per thread in eachbenchmark. For benchmarks in category I that is described in Chapter 3, we simply324.4. Statistical Significance of the Fault Injection0%	  5%	  10%	  15%	  20%	  25%	  30%	  35%	  40%	  45%	  50%	  200	   400	   600	   800	   1000	   1200	   1400	   1600	   1800	   2000	   2200	   2400	   2600	   2800	  AES	   MAT	   MergeSort-­‐k0	   HASHGPU-­‐md5	   HASHGPU-­‐sha1	   LBM	   Transpose	   BFS	  SCAN	   MRI-­‐Q	   Stencil	   SAD-­‐k1	   SAD-­‐k2	   SAD-­‐k3	   MONTE	  Figure 4.6: Accumulated SDC rates for benchmarksreport the number of instructions executed by a random thread because all the threadsexecute the same number of instructions. For benchmarks in category II, we report thenumber of instructions of a thread that belong to the most popular group. For example,we pick a thread in group 2 (Figure 3.2) of LBM as it is the most popular group. Forthe benchmark in category III (i.e. BFS), we randomly pick one thread to report as whatwe did in the grouping phase. However, a wide spectrum of dynamic instructions of thesebenchmarks shown in Figure 4.7 along with the number of runs which convergence occursfor each benchmark (shown in Figure 4.8) in fact suggests that the number of needed faultinjections for a statistically significant estimation on the error resilience of applications(i.e. SDC rate) does not depend on the number of executed instructions of the programs.For instance, MRI-Q and SAD-k0 execute the largest number of instructions, but the SDCrates for them converge around 2200 runs. In contrast, LBM executes only a small number334.4. Statistical Significance of the Fault Injectionof instructions, but the SDC rate still varies by 1% from 2600 runs to 2800 runs.0	  2000	  4000	  6000	  8000	  10000	  12000	  14000	  16000	  AES	   MAT	  MergeSort-­‐k0	  HASHGPU-­‐md5	  HASHGPU-­‐sha1	   LBM	  Transpose	   BFS	   SCAN	  MRI-­‐Q	  Stencil	  SAD-­‐k1	  SAD-­‐k2	  SAD-­‐k3	  MONTE	  Number	  of	  dynamic	  instruc?ons	  Figure 4.7: Number of dynamic instructions executed per thread by benchmarks0	  500	  1000	  1500	  2000	  2500	  3000	  AES	   MAT	  MergeSort-­‐k0	  HASHGPU-­‐md5	  HASHGPU-­‐sha1	   LBM	  Transpose	   BFS	   SCAN	  MRI-­‐Q	  Stencil	  SAD-­‐k1	  SAD-­‐k2	  SAD-­‐k3	  MONTE	  NUmber	  of	  runs	  which	  convergenece	  occurs	  Figure 4.8: Number of fault injection runs that convergence occurs for each benchmark344.5. Crash Causes and Latency1% 50% 16% 33% Lane User Stack Overflow Warp out-of-range Address Warp Misaligned Address Device Illegal Address 53% 46% 2% Figure 4.9: Root-cause breakdown of crashes for AES and MAT. Left: AES. Right: MAT.4.5 Crash Causes and LatencyGPU-Qin can be used to gain a deeper understanding of the error-resilience characteristicsof GPGPU applications. In this chapter, we attempt to understand the reasons for thecrashes observed in the characterization study, and characterize the crash latency. Thisinvestigation is important for two reasons. First, crashes are a form of error detectionperformed by the GPU hardware and CUDA run-time , and understanding the reasons forcrashes can help understand the effectiveness of the existing error- detection mechanisms.Second, it is important to detect the crashes early to contain the errors. We report resultsfor only two benchmarks, AES and MAT: however, the observations generalize to all thebenchmarks.When a hardware exception occurs, the application crashes and the crash cause isreported to cuda-gdb. GPU-Qin traps these exceptions and logs them. Overall, we observefour types of hardware exceptions: lane user stack overflow, warp out-of-range address,warp misaligned address and device illegal address. The exceptions and their causes arepresented in Table 4.2.Figure 4.9 shows the root causes for crashes in the applications. The two most commoncauses are warp out-of-range addresses and device illegal address. We find that warpmisaligned address also plays an important role in crashes in the MAT benchmark.354.5. Crash Causes and LatencyTable 4.2: Description of CUDA hardware exceptionsException type DescriptionLane user stack overflow Occurs when a thread exceedsits stack memory limitWarp out-of-range address Occurs when a thread withina warp accesses an out-of-bounds local or shared mem-ory addressWarp misaligned address Occurs when a thread withina warp accesses an incorrectlyaligned local or shared mem-ory addressDevice illegal address Occurs when a thread accessesan out-of-bounds global mem-ory addressCrash latency measures the time interval between the moment a fault is activated andthe moment a crash occurs. We measure crash latency for each exception type above, tounderstand how quickly the crash is detected. Figure 4.10 shows the crash latency foreach exception type for AES and MAT. In AES, 90% of the warp out-of-range addressexceptions occur within around 500 milliseconds, compared to 70% of warp misalignedaddress exceptions and 60% of device illegal address. In MAT, warp out-of-range addressexceptions occur faster compared to warp misaligned address exceptions. Only in theStencil benchmark, does the device illegal address exception occurs and it occurs faster thanthe other three exception types. In all other benchmarks, the warp out-of-range addressexceptions have lower crash latency than the other three exception types. Comparing thecrash latency for CPU and GPU, Gu et al. [13] reported that on CPUs crashes usuallyhappens in thousands of cycles after the fault injection, whereas on GPUs crashes happenin milliseconds. This could result from both the hardware check and the OS checkingmechanisms. Systems that have longer crash latency may allow faults to propagate to364.6. Use Casesmore states, and also have higher chance to affect states beyond the current context viashared memory, disk or network. Recovery in this scenario could become complicatedbecause whole system-wise state rebuilding maybe required. Application-specific check-pointing/recovery techniques can be designed and configured based on different level ofcrash latency.4.6 Use CasesGPU-Qin can be used to evaluate error resilience characteristics of GPGPU applicationsfor various purposes. In this section, we provide three scenarios to show how our tool canbe used.4.6.1 Scenario I: SDC Proneness of Different Code SectionsThe key problem that selective fault tolerance mechanisms need to solve is to identifywhich parts of a program is more ”important” than others for minimum overhead. In ourcontext, selective mechanisms need to pinpoint the code sections of a program that havehigh probability to cause SDCs. These code sections, as discussed in the prior chapters, arenot necessarily the same across different applications. GPU-Qin can be used to backtrackthe fault injection results that lead to SDCs, and find out which source code statements/in-structions are where the fault gets activated. For example, our preliminary observation isthat some code patterns associated with CUDA programming model (e.g. computationsinvolving thread IDs) have higher probability to lead to SDCs. A detailed use case can befound in our previous paper [10].374.7. Limitations4.6.2 Scenario II: Comparing Different AlgorithmsAs reliability becomes more and more critical to computing systems, applications now needto choose algorithms by also taking into account error resilience. In this scenario, GPU-Qincan be used to evaluate error resilience of different algorithms solving the same problem. Forexample, sorting is a very popular operation in many application domains. Using GPU-Qinto perform characterization study for different sorting algorithms like quick sort, merge sortor heap sort etc, can suggest which algorithm to choose for necessary reliability requirementbut also maintain acceptable performance and power consumption. Further, even for thesame algorithm, GPU-Qin can also be used to tell which version/implementation of thealgorithm to choose.4.6.3 Scenario III: Guiding ConfigurationsHPC applications usually have complex system and application-wide configurations. GPU-Qin can provide the understanding of the error resilience characteristics to system users andhelp them better configure systems, e.g. setting up appropriate check-pointing intervals.It also allows users to test the error resilience of an application under different parametercombinations.4.7 LimitationsOur evaluation study and analysis is subject to three limitations:1. Experimental configurations such as the limit of the number of iterations of a loopand the size of the activation window may be specific to our benchmark suite. Forexample, based on our evaluation on the two benchmarks that contain the largestand smallest number of loop iterations, there is no difference in terms of SDC rate384.8. Summaryfor different upper bound values. We simply use 64 as the upper bound of number ofiterations to explore. However, it can be adjusted for different sets of applications.Similarly, we use 1600 instructions as an activation window to limit the number ofsingle-steps. For most of benchmarks this size is sufficient to have a high activationrate. These configurations are from empirical experience, thus they may also need tobe justified for new sets of applications.2. Another limitation is that CUDA applications need to be compiled in debug mode touse GPU-Qin. It is required because cuda-gdb cannot link source code to instructionswhen the debugging information is absent. Applications cannot get optimized by thecompiler when the debugging information is present.3. In terms of input selection, it is not possible for us to exercise all inputs for anapplication. We use inputs that are representative to the applications.4.8 SummaryIn summary, the result of error resilience characterization on GPGPU applications showsthat measured SDC rates and crash rates vary significantly across benchmarks, and theaverage failure rate for all benchmarks is 67%. In contrast, difference in SDC rates forCPU applications is bounded by 5% to 15% [13]. Application-specific fault tolerance tech-niques hence are needed by GPGPU applications because generic techniques are likely tocause both unnecessary performance and power overhead without considering the implicitresilience of applications.394.8. Summary(a)t0 200 400 600 800 1000010%20%30%40%50%60%70%80%90%100%Crash latency in millisecondsCDF of crash latency  Warp out−of−range addressWarp misaligned addressDevice illegal address(b)t0 100 200 300 400 500 600 700 800 900 1000010%20%30%40%50%60%70%80%90%100%Crash latency in millisecondsCDF of crash latency  Warp out−of−range addressWarp misalignment addressFigure 4.10: Crash latency analysis for AES and MAT. Top: AES Down: MAT404.8.SummaryTable 4.3: Benchmarks properties. LOC: lines of code. Scale: number of blocks in a grid and number of threads in a block (generally a 3D*3D space).Launch times: the number of iterations that the kernel is launched.Benchmark Benchmark SuiteKernel propertiesName Approximate LOC Scale Number of threads Launch TimesSAD Parboilmb sad calc 220 (44,36,1)*(61,1,1) 96624 1larger sad calc 8 60 (44,36,1)*(61,1,1) 96624 1larger sad calc 16 50 (11,9,1)*(32,4,1) 13464 1Stencil Parboil block2D hybrid coarsen x 100 (2,32,1)*(32,4,1) 8192 5MRI-Q Parboil ComputeQ GPU 50 (128,1,1)*(256,1,1) 32768 3LBMa Parboil performStreamCollide 150 (120,150,1)*(120,1,1) 2160000 100MAT CUDA SDK matrixMul 110 (4,6,1)*(32,32,1) 98304 1SCAN-block CUDA SDK scanExclusiveShared 70 (6656,1,1)*(256,32,1) 54525952 1MONTE CUDA SDK MonteCarloOneBlockPerOption 40 (32,1,1)*(256,1,1) 8192 1Transposea CUDA SDK transposeDiagonal 40 (64,64,1)*(16,16,1) 1048576 1MergeSort CUDA SDK mergeSortSharedKernel 50 (4096,1,1)*(512,1,1) 2097152 1BFS RodiniaKernel 20 (8,1,1)*(512,1,1) 4096 8Kernel2 15 (8,1,1)*(512,1,1) 4096 8AES Other [21] aesEncrypt256 400 (257,1,1)*(256,1,1) 65792 1HashGPU Other [18]sha1 kernel overlap 1000 (64,1,1)*(64,1,1) 4096 1md5 kernel overlap 1000 (64,1,1)*(64,1,1) 4096 1a Randomly picking blocks to inject faults takes too long for LBM and Transpose becauses cuda-gdb launches the application block-by-block; thus, inpractice, we only inject into the first 256 blocks of them41Chapter 5DiscussionThe fault-injection study presented in the previous chapter finds that the SDC rate varieswidely across different benchmarks. For example, Monte Carlo has nearly no SDCs whileHashGPU-sha1 and HashGPU-md5 have SDC rates of about 40%. In this chapter, we askif there are fundamental reasons that some applications experience fewer SDCs than others.We focus on SDCs as these are considered the most severe failures: when an SDC occurs,there is no indication that something went wrong, yet an application produces incorrectoutput.Intuitively, we attempted to find correlations between instruction-level characteristicsof programs and the SDC rates, by trying to build a linear regression model based on theinstruction classification to predict the SDC rates. We are not able to create such modelbecause the sample size is small (for each application, there is only one observation) andwith such small dataset, the correlation is low. This inspires us to explore the problemfrom a different perspective.We believe that the reason for the variability in the SDC rate is related to the appli-cations’ characteristics. For instance, applications based on search algorithms are likely tohave lower SDCs than applications that perform computations such as linear algebra. Thisis because a fault affecting the search in a part of the space that will not lead to a match isunlikely to produce an incorrect result and the result will still be a mismatch. MergeSort425.1. Search-based Applicationin CUDA SDK implements parallel sorting based on binary search [24] 6, and we observea relatively low SDC rate (6%).Another type of applications that has a low SDC rate is what we call an ”average out”algorithm, such as Stencil (SDC rate: 5%) and MONTE (SDC rate: 1%). These includecomputations in which the final state is a product of multiple temporary states, either inspace or time. The core pattern here is that the product of all states is likely to be obtainedvia operations that average those states. If a fault happens in one of the temporary states,it is likely that it would be averaged out in the final state.We explain each category in the following combined with high-level description of thealgorithm of each benchmark, with the emphasis on the operations that could influencethe reliability of the programs.5.1 Search-based ApplicationSearch is a subset of the class of computations in the dwarf ”Branch and bound algorithms”.The core computation pattern is that the search space is divided into segments and queriesare searched in parallel in each segment. Depending on the actual search criteria, searchingwould be considered to return the solutions that are either accurate or optimal.Merge Sort Merge sort consists of two major steps. The first step is to break theoriginal input array into small blocks and sorts each block in parallel. The second stepis the parallel merge. This step consists of three procedures (kernels) which would beexecuted iteratively together as merging small blocks into the final output array. The firstkernel picks samples from the sorted blocks - it chooses multiple elements with which tosplit the blocks and computes the ranks of all sample elements in the even/odd pair of6MergeSort-k0 divides the input into equal-sized tiles, and sort all tiles in parallel. Sorting in this kerneluses an implementation of Batcher’s odd-even merge sort, which is based on comparison [4]435.1. Search-based Applicationblocks. The second kernel takes the rank list of each block as the input and merges therank lists, which essentially reorders the sub-blocks of the block. The third kernel performsthe merge in parallel. Each element is assigned its own thread which performs a binarysearch over all elements of the other array to find the rank of its assigned element. Duringour characterization we only focus on the first kernel, but it is straightforward to includethe rest of the kernels.There are two factors affecting the end-to-end correctness of merge sort. The first factoris that binary search dominates the execution time of either locating, ranking and mergingthe internal blocks, or the actual merge of the final output. This explains the low SDCrates for all of the kernels in the sense that the decision of picking ”left” or ”right” elementsis unlikely to change due to a fault as it is the result of a comparison operation, which isfairly error resilient. Secondly, the first kernel is only producing internal data that maynot be critical to the final sorted output, while the last kernel directly changes the output.Therefore, a fault in the first kernel is not likely to have as much of an impact as a faultin the third kernel. These two factors explain the relatively low SDC rate of merge sort.The following code segment taken from CUDA SDK shows the core computation ofMergeSort-k0, which represents the behaviours that involve searching for correct positionsto sort. The binary search is the major operation performed. We ignore the definition ofbinarySearchInclusive as it closely resembles the exclusive one.1 for(uint stride = 1; stride < arrayLength; stride <<= 1){2 // determine positions3 uint posA =4 binarySearchExclusive<sortDir>(keyA, baseKey + stride, stride , stride ) + lPos;5 uint posB =6 binarySearchInclusive <sortDir>(keyB, baseKey + 0, stride , stride ) + lPos;7 syncthreads ();445.2. Bit-wise Operation8 }910 uint binarySearchExclusive (uint val , uint ∗data, uint L, uint stride ){11 if (L == 0)12 return 0;1314 uint pos = 0;15 for (; stride > 0; stride >>= 1){16 uint newPos = umin(pos + stride, L);17 if ( ( sortDir && (data[newPos − 1] < val)) ||18 (! sortDir && (data[newPos − 1] > val)) ){19 pos = newPos+1;20 pos −−;21 }22 }23 return pos;24 }5.2 Bit-wise OperationSHA-1, MD5 and AES are popular industrial encryption standards for cryptography. Thesebenchmarks (HashGPU-sha1, HashGPU-md5 and AES) have a similar behaviour in termsof data manipulation. The input data gets segmented into blocks and iterative operationsare performed on each block to produce the final output. The SDC rates of the threebenchmarks are also close, which are 28%, 31% and 28% respectively. Based on the cor-rectness checks provided by these applications, even a single bit error would be consideredas a failure, and hence one would expect these applications to have very high SDC rates.455.3. Averaged OutHowever, they have some operations that mask faults, for example: (1) ANDing a flippedbit with a 0, or ORing it with 1, and (2) shifting a flipped bit out of a word. Theseoperations provide a mitigating influence on the SDC rate.5.3 Averaged OutThis category includes computations in which the final state is a converging product ofmultiple temporary states, either in space or time. The core pattern here is that theproduct of all states is likely to be obtained via operations that average those states.3D stencil :This application represents a Jacobi solver of a heat equation on a 3Dstencil. The blocks of threads are assigned to handle each element on XY plane andgathers 6 neighbours of that element to feed in the equation, which averages the result andreplaces the original value with the new value. This process is repeated till all the valuesconverge and there are no more changes. This makes the application capable of maskingfaults that affect a single element (the final values may be slightly different from the correctvalues, but the correctness checks of the application will accept them).Monte carlo: This application performs Monte Carlo simulation for American optionpricing, in which multiple paths are explored at different time steps using random numbers.The final result is obtained by averaging the results of individual paths. Therefore, even ifthe fault affects a single path and causes it to compute the wrong value, it would make onlya small difference in the result. In reality though, this application would explore thousandsof paths, so the effect of a fault is likely to be even less pronounced, and hence the SDCrates are likely to be even lower.465.4. Graph Processing5.4 Graph ProcessingBFS This BFS implementation calculates the cost of each vertex from the source vertexon the level basis. The first kernel of BFS traverses vertices from the source vertex andassigns one thread to take one vertex in that level for the cost calculation. The cost ofeach vertex on the same level would be updated by this kernel. In a case of a vertex thathas multiple parents, different threads could reach the vertex from different paths and endup with overwriting the cost for each other. The correctness is not affected since thosethreads are from the same level, so the cost would be overwritten by the same value. Thesecond kernel deals with the race condition. It makes sure that every internal state of thevertex is synced before the launch of the first kernel. So if a fault affects the global statesin the first kernel ( not the updating mask array or the real cost), the second kernel wouldreset those states to what they are supposed to be.5.5 Linear Algebra and Grid OperationThis category contains applications that involve regular data structures (e.g. matrix orgrid ). The computing task of these applications is usually distributed evenly across allcomputing units (threads in GPU) and the data access is also uniformly distributed. Thereare two major types of computation performed by applications in this category: (1) indexcalculation, and (2) value calculation. Index calculations involve calculating memory ad-dresses of array elements, and value calculations involve the values that are written intothe array. Of these two types of computations, errors in index calculations are likely toresult in crashes as they can cause a pointer to point outside the bounds of the array,resulting in a hardware exception. Errors in value calculations are likely to result in SDCs.We find that the SDC rates of the applications in this category can be explained based on475.5. Linear Algebra and Grid Operationthe relative proportion of index calculations and value calculations in them.Tranpose and MAT Matrix transpose and matrix multiplication are both very com-mon operations in linear algebra. On GPUs, the matrices are tiled so that multiple threadscan operate on them in parallel, with synchronization code inserted if needed. For matrixtranspose most of the computations are index calculations, while for matrix multiplication,most of the computations are value calculations.MRI-Q MRI-Q computes the Q matrix which in MRI image reconstruction is a pre-computable value based on the sampling trajectory, the plan of how points in k-space willbe sampled. The algorithm examines a large set of input representing the intended MRIscanning trajectory and the points that will be sampled. Each element of the Q matrix iscomputed by a summation of contributions from all trajectory sample points. MRI-Q sim-ply collects corresponding elements from 3-D data set and computes a number of elementin Q. As a result, MRI-Q has only value computations, and not index computations.SCAN SCAN is a simple implementation of the prefix-sum for GPUs. It assignsthreads to fetch elements adding them up with previous elements for logN rounds anddouble-buffers the shared memory for storing the temporary sum of the array. It alsoconsists of a combination of index and value calculations.SAD SAD stands for ”Sum of Absolute Differences”, and consists of two major steps.The first step is to compute SAD for 4*4 blocks from reference pixel and the frame pixels.There are two levels of index calculations and a SAD computation involved in the first step.The SAD computation for each 4*4 block pair is a simple 2D stencil like operation. Thesecond step is to merge the basic 4*4 pixel blocks into larger blocks. It takes the outputvector of the first step as the input and performs additions on the vector to form resultsfor different types of blocks such as 4*8, 8*4, 16*8, etc. The output of the second stepis the SADs for all 7 block sizes. This step is also comprised of a large amount of index485.6. Summarycalculations and the value calculations.LBM LBM is a method of solving the systems of partial differential equations governingfluid dynamics. In a timestep, each cell uses the input ow to compute the resulting outputow from that cell and an updated local fluid density. The major difference between LBMand stencil computation is that no input data is shared between cells and the fluid owinginto a cell is not read by any other cell.5.6 SummaryIn general, the linear algebra and grid operation applications are low error-resilient, as everysingle element and operation involved in the computation are likely to evenly contributeto the final output. Within this category, however, we observe that a wide range of SDCrates for benchmarks. Our hypothesis is that index calculation and value calculation makea difference in SDC rate as index calculations would likely to lead to crashes. For example,the benchmarks that contain index calculations, namely Transpose, SCAN and SAD, haveabove 40% crash rate, as opposed to those that do have fewer address calculations inintuition, e.g. MAT and MRI-Q. Interestingly, applications that combine index and valuecalculations have higher SDC rates than those that have only one predominant type ofcalculation. We fill validate our hypothesis in the future work.These observations suggest that it might be useful to cluster the benchmarks basedon both the SDC rate and the high-level operations they perform. We categorize thebenchmarks into five resilience categories, shown in Table 5.1. Asanovic et al. [2] defines”thirteen dwarfs of parallelism” to design and evaluate the parallel computing applications.Each of these dwarfs captures a pattern of computation common to a class of parallelapplications. We find that the resilience categories we consider map well to one or more ofthe dwarfs, as Table 5.1 shows. We did not start out trying to find such a mapping, and495.6. SummaryTable 5.1: Benchmark categories and the mapping to the dwarfs of parallelismResilience Category Benchmarks MeasuredSDCDwarfsSearch-based MergeSort 6% Backtrack andBranch+BoundBit-wise Operation HashGPU, AES 25 ∼ 37% CombinationalLogicAverage-out Effect Stencil, MONTE 1% ∼ 5% Structured Grids,Monte CarloGraph Processing BFS 10% Graph TraversalLinear Algebra Transpose, MAT,MRI-Q,SCAN-block,LBM, SAD15% ∼ 25% Dense LinearAlgebra, SparseLinear Algebra,Structured Gridshence may not cover all dwarfs in our application categories. We will explore this mappingsystematically in future work.50Chapter 6Related WorkThis chapter provides an overview of related work in the areas of software-based errorresilience techniques and and GPU vulnerability studies, and how our work differs.Fault injector Fault injection has been well-explored on CPUs using run-time debug-gers. Examples are GOOFI [1] and NFTAPE [28]. However, neither of these injectors workon GPUs. Further, they do not consider multi-threaded programs, nor do they concernthemselves with choosing representative parts of the program for injection. Other work [20]attempted to inject faults in scientific applications using the PIN tool from Intel, a dynamicbinary instrumentation framework. However, this work has not been applied on GPUs tothe best of our knowledge.AVF and PVF A common way to estimate vulnerability is through the architec-tural vulnerability factor (AVF) [22], which analyzes the vulnerability of specific micro-architectural structures to soft errors. The main idea is to track the execution of a pro-gram through the processor, typically by executing it in a simulator, and identifying certainbits as ACE bits (Architecturally Correct Execution) bits. The total number of ACE bitsin a microarchitectural structure is an estimate of its vulnerability. AVF-based methodshave two disadvantages over fault injection based methods. First, they rarely consider theend-to-end behaviour of the application under faults, and hence end up approximating theset of bits that need to be marked ACE. For example, Wang et al. [33] found that AVFestimates are significantly less accurate than fault injection based results. Although the51Chapter 6. Related Workoriginal formulation of architectural vulnerability factor considers the outcome of the pro-gram in deciding if a specific bit is an ACE-bit, practical implementations of the techniquecount every bit that can potentially affect a program (within a certain window) as an ACEbit. This conservative estimation can grossly overestimate the observed vulnerability com-pared to a fault-injection [33]. Second, as AVF studies consider specific microarchitecturalstructures, microprocessor simulators are usually required to execute experiments, whichmeans that they are significantly slower than executing the program on native hardware.Several studies [23, 30] have attempted to characterize the vulnerability of differentmicro-architectural structures in GPUs. For example, Tan et al. [30] characterized GPUinstructions (CUDA PTX) based on whether the execution of an instruction affects thefinal output of the application, and hence determines the AVF by the quantity of ACEinstructions per cycle and their residency time within the hardware structures. ProgramVulnerability Factor (PVF) is a metric proposed by Sridharan et al. [27] to apply AVFanalysis at the application layer. While this takes application properties into account, itdoes not consider the end-to-end impact of faults on the application. These approachesdo not consider the end-to-end impact of faults in applications, nor do they attempt tounderstand the behavior of the application under errors. Moreover, AVF analysis has beenshown to have significant inaccuracies compared to fault-injection based approaches [33].In contrast, our work is from the applications’ perspective, and focuses on understandingthe behaviors of GPGPU applications under errors.Generic fault tolerance techniques Dimitrov et al. [8] proposed three approachesfor GPGPU reliability that leverage both instruction-level parallelism and thread-levelparallelism to replicate the application code. Their approach incurs performance overheadsof 85 to 100%, and they conclude that understanding both the application characteristicsand the hardware platform is necessary for efficient protection. They do not characterize52Chapter 6. Related Workthe reliability of GPGPU applications however.Application specific fault tolerance Some studies have attempted to establish corre-lations between SDCs and program characteristics. Thaker et al. [31] observes that errorsin control-data are more likely to lead to SDCs and catastrophic failures in multimediaapplications. Thomas et a. [32] find that errors in data affecting a large amount of compu-tation are likely to lead to egregious outcomes (what they call EDCs). Shoestring [11] findsthat errors in high value instructions that write to global memory or produce function callarguments in the program are likely to lead to undetected SDCs. These observations andfindings are incomprehensive in the sense that they are only applied to some categories ofapplications and they can not explain the SDC rates exclusively.Hari et al. [14] presents a low-cost, program-level fault detection mechanism for re-ducing SDCs in CPU applications. They use their prior work, Relyzer [15] to profileapplications and select a small portion of the program fault site to identify static instruc-tions that are responsible for SDCs. Then by placing program level error detectors on thoseSDC-causing sections, they can achieve high SDC coverage at low cost. It is noteworthythat application-specific behaviours are major contributors of SDCs for half of their bench-marks, which makes it difficult to extend their technique to other applications, especiallyGPU applications which have different behaviours from CPU applications.Finally, Yim et al. [35] proposed a technique to detect errors through data duplicationat the programming-language level (loop code and non-loop code) for GPGPU applica-tions. This is different from our focus which is to understand the inherent error-resiliencecharacteristics of an application in order to find the most efficient protection. They per-form fault injections at the source code level, while we do so at the executable code level.Because many hardware faults cannot be modelled at the source code level, our injectionsare more representative of hardware faults.53Chapter 7Conclusion and Future WorkThis thesis presents a methodology to investigate the end-to-end error resilience charac-teristics of GPGPU applications through fault injection. One of the main challenges inbuilding a fault injector for GPGPU applications is balancing representativeness with timeefficiency, due to their massive parallelism. We first build a fault-injection tool, GPU-Qin,to efficiently inject faults on real GPU hardware, while maintaining representativeness ofthe faults injected. Using GPU-Qin, we study the error resilience characteristics of twelveGPGPU applications comprised of fifteen kernels. The investigation showed that 0.3% to38% of the faults result in SDCs and 6% to 71% of the results in crashes, which suggeststhat application-specific fault tolerance mechanisms are needed to deal with such varietyof levels of error resilience. Our fault injector enables the opportunity to study various re-liability characteristics of applications, such as instruction-level error resilience and crashlatency. It also exposes the impact of faults in different bit-positions on the error resilienceof different applications. All of these can be used to guide the design of application-specificfault tolerance techniques. Finally, we find that algorithmic characteristics of the applica-tion can help us understand the variation in the SDC rates among different applications.In the future, we plan to proceed in the following directions:1. To understand the variance of SDC rates within the same resilience category. Forexample, in the category of linear algebra and grid computations, SDC rates vary from15% to 25%, and 25% to 40% in the category of bit-wise operations.54Chapter 7. Conclusion and Future Work2. To perform error resilience characterization on CPU applications, and see if thesimilar categorization can be observer.3. Compare CPU and GPU applications in terms of error resilience, and come up withspecific fault tolerance mechanisms for each platform.55Bibliography[1] J. Aidemark, J. Vinter, P. Folkesson, and J. Karlsson. Goofi: generic object-orientedfault injection tool. In Dependable Systems and Networks, 2001 International Confer-ence on, pages 83–88, 2001.[2] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, ParryHusbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf,Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel comput-ing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECSDepartment, University of California, Berkeley, Dec 2006.[3] A. Bakhoda, G.L. Yuan, W.W.L. Fung, H. Wong, and T.M. Aamodt. Analyzingcuda workloads using a detailed gpu simulator. In Performance Analysis of Systemsand Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163–174,April 2009.[4] K. E. Batcher. Sorting networks and their applications. In Proceedings of the 1968Spring Joint Computer Conference, AFIPS ’68 (Spring), 1968.[5] Shekhar Borkar. Designing reliable systems from unreliable components: The chal-lenges of transistor variability and degradation. IEEE Micro, 25(6):10–16, November2005.[6] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha56BibliographyLee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. InProceedings of the 2009 IEEE International Symposium on Workload Characterization(IISWC), IISWC ’09, pages 44–54, 2009.[7] C. Constantinescu. Trends and challenges in vlsi circuit reliability. In IEEE MICRO,2003.[8] Martin Dimitrov, Mike Mantor, and Huiyang Zhou. Understanding software ap-proaches for gpgpu reliability. In Proceedings of 2nd Workshop on General PurposeProcessing on Graphics Processing Units, pages 94–104, 2009.[9] Bo Fang, Karthik Pattabiraman, Matei Ripeanu, and Sudhanva Gurumurthi. Gpu-qin: A methodology for evaluating the error resilience of gpgpu applications. In IEEEInternational Symposium on Performance Analysis of Systems and Software, 2014.[10] Bo Fang, Jiesheng Wei, Karthik Pattabiraman, and Matei Ripeanu. Towards buildingerror resilient gpgpu applications. In 3rd Workshop on Resilient Architecture (WRA)in conjunction with MICRO, 2012.[11] Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. Shoestring: prob-abilistic soft error reliability on the cheap. In Intl. Conf. on Architectural Support forProgramming Languages and Operating Systems, 2010.[12] T. Gaitonde, Shi-Jie Wen, R. Wong, and M. Warriner. Component failure analysisusing neutron beam test. In Physical and Failure Analysis of Integrated Circuits(IPFA), 2010 17th IEEE International Symposium on the, pages 1–5, 2010.[13] Weining Gu, Z. Kalbarczyk, and R.K. Iyer. Error sensitivity of the linux kernel exe-cuting on powerpc g4 and pentium 4 processors. In Dependable Systems and Networks,2004 International Conference on, pages 887–896, 2004.57Bibliography[14] S. K. S. Hari, S. V. Adve, and H. Naeimi. Low-cost program-level detectors for reduc-ing silent data corruptions. In IEEE/IFIP International Conference on DependableSystems and Networks (DSN), 2012.[15] Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, and Pradeep Ramachandran.Relyzer: exploiting application-level fault equivalence to analyze application resiliencyto transient faults. In ACM ASPLOS, 2012.[16] Mark Harris, Shubhabrata Sengupta, and John D. Owens. Parallel prefix sum (scan)with CUDA. In Hubert Nguyen, editor, GPU Gems 3, chapter 39, pages 851–876.Addison Wesley, August 2007.[17] Mei-Chen Hsueh, T.K. Tsai, and R.K. Iyer. Fault injection techniques and tools.Computer, 30(4):75–82, 1997.[18] Samer A. Kiswany, Abdullah Gharaibeh, Elizeu S. Neto, George Yuan, and MateiRipeanu. StoreGPU: exploiting graphics processing units to accelerate distributedstorage systems. In HPDC ’08: Proceedings of the 17th international symposium onHigh performance distributed computing, pages 165–174, New York, NY, USA, 2008.ACM.[19] J.-C. Laprie. Dependable computing and fault tolerance : Concepts and terminology.In Fault-Tolerant Computing, 1995, Highlights from Twenty-Five Years., Twenty-FifthInternational Symposium on, pages 2–, Jun 1995.[20] Dong Li, J.S. Vetter, and Weikuan Yu. Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In High PerformanceComputing, Networking, Storage and Analysis (SC), 2012 International Conferencefor, pages 1–11, 2012.58Bibliography[21] S. A. Manavski. Cuda compatible gpu as an efcient hardware accelerator for aescryptography. In IEEE Intl Conf. on Signal Processing and Communication, pages65–68, 2007.[22] S.S. Mukherjee, C.T. Weaver, J. Emer, S.K. Reinhardt, and T. Austin. Measuringarchitectural vulnerability factors. In IEEE MICRO, 2003.[23] R. Ubal N. Farazman and and D. Kaeli. Statistical fault injection-based avf analysisof a gpu architecure. In IEEE Workshop on Silicon Errors in Logic, 2012.[24] Nadathur Satish, Mark Harris, and Michael Garland. Designing efficient sorting al-gorithms for manycore gpus. NVIDIA Technical Report NVR-2008-001, NVIDIACorporation, September 2008.[25] L.J. Savage. The Foundations of Statistics. Dover Books on Mathematics Series. DoverPublications, 1972.[26] Jeremy W. Sheaffer, David P. Luebke, and Kevin Skadron. The visual vulnerabilityspectrum: Characterizing architectural vulnerability for graphics hardware. In Pro-ceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS Symposium on GraphicsHardware, GH ’06, New York, NY, USA, 2006. ACM.[27] V. Sridharan and D.R. Kaeli. Eliminating microarchitectural dependency from ar-chitectural vulnerability. In High Performance Computer Architecture, 2009. HPCA2009. IEEE 15th International Symposium on, pages 117–128, 2009.[28] D.T. Stott, B. Floering, D. Burke, Z. Kalbarczpk, and R.K. Iyer. Nftape: a frameworkfor assessing dependability in distributed systems with lightweight fault injectors. InIPDPS 2000, pages 91 –100, 2000.59Bibliography[29] John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang,Nasser Anssari, Geng Daniel Liu, and Wen mei W. Hwu. Parboil: A revised bench-mark suite for scientic and commercial throughput computing. In IMPACT TechnicalReport, 2012.[30] Jingweijia Tan, N. Goswami, Tao Li, and Xin Fu. Analyzing soft-error vulnerabilityon gpgpu microarchitecture. In IEEE International Symposium on Workload Charac-terization (IISWC), pages 226–235, 2011.[31] D.D. Thaker, D. Franklin, J. Oliver, S. Biswas, D. Lockhart, T.S. Metodi, and F.T.Chong. Characterization of error-tolerant applications when protecting control data.In Proc. IISWC, pages 142–149, 2006.[32] A. Thomas and K. Pattabiraman. Error detector placement for soft computation.In Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP Interna-tional Conference on, pages 1–12, June 2013.[33] Nicholas J. Wang, Aqeel Mahesri, and Sanjay J. Patel. Examining ace analysis relia-bility estimates using fault-injection. In Proceedings of the 34th annual internationalsymposium on Computer architecture, ISCA ’07, 2007.[34] Jiesheng Wei and Karthik Pattabiraman. BLOCKWATCH: Leveraging similarity inparallel programs for error detection. In IEEE International Conference on DependableSystems and Networks (DSN), 2012.[35] Keun Soo Yim. Hauberk: Lightweight silent data corruption error detector for gpgpu.In IEEE International Parallel and Distributed Processing Symposium, 2011.60


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items