You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Efficient modeling of error propagation in GPU programs Anwer, Abdul Rehman 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2020_november_anwer_abdul_rehman.pdf [ 803.69kB ]
Metadata
JSON: 24-1.0391897.json
JSON-LD: 24-1.0391897-ld.json
RDF/XML (Pretty): 24-1.0391897-rdf.xml
RDF/JSON: 24-1.0391897-rdf.json
Turtle: 24-1.0391897-turtle.txt
N-Triples: 24-1.0391897-rdf-ntriples.txt
Original Record: 24-1.0391897-source.json
Full Text
24-1.0391897-fulltext.txt
Citation
24-1.0391897.ris

Full Text

Efficient Modeling of Error Propagation in GPU ProgramsbyAbdul Rehman AnwerBEE, National University of Sciences and Technology (Pakistan), 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)June 2020© Abdul Rehman Anwer, 2020The following individuals certify that they have read, and recommend to theFaculty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:“Efficient Modeling of Error Propagation in GPU Programs” submitted by AbdulRehman Anwer in partial fulfillment of the requirements for the degree of Masterof Applied Science in Electrical and Computer Engineering.Examining Committee:Karthik Pattabiraman, Electrical and Computer EngineeringSupervisorTor Aamodt, Electrical and Computer EngineeringSupervisory Committee MemberiiAbstractGraphics Processing Units (GPUs) are popular for reliability-conscious uses inHigh Performance Computing (HPC), machine learning algorithms, and safety-critical applications. Fault Injection (FI) techniques are generally used to deter-mine the reliability profiles of programs in the presence of soft errors. How-ever, these techniques are highly resource- and time-intensive. GPU applicationsare highly multi-threaded and typically execute hundreds of thousands of threads,which makes it challenging to apply FI techniques. Prior research developed amodel called TRIDENT to analytically predict Silent Data Corruption (SDC) (i.e.,incorrect output without any indication) probabilities of single-threaded CPU ap-plications, without requiring any FIs. Unfortunately, TRIDENT is incompatiblewith GPU programs, due to their high degree of parallelism and different memoryarchitectures compared to CPU programs. The main challenge is that modelingerror propagation across thousands of threads in a Graphics Processing Unit (GPU)kernel requires enormous amounts of data to be profiled and analyzed, posing amajor scalability bottleneck for HPC applications. Further, there are GPU-specificbehaviors that must be modeled for accuracy.In this thesis, we propose GPU-TRIDENT, an accurate and scalable techniquefor modeling error propagation in GPU programs. Our key insight is that errorpropagation across threads can be modeled based on program execution patterns.These can be characterized by control-flow, loop iteration, data, and thread blockpatterns of the GPU program. We also identify two major sources of inaccuracyin building analytical models of error propagation and mitigate them to improveaccuracy. We find that GPU-TRIDENT can predict the SDC probabilities of boththe overall GPU programs and individual instructions accurately, and is two ordersiiiof magnitude faster than FI-based approaches. We also demonstrate that GPU-TRIDENT can guide selective instruction duplication to protect GPU programssimilar to FI. We also deploy GPU-TRIDENT to assess the input-dependence ofreliability of GPU kernels and find that the SDC probability of kernels is generallyinsensitive to variation in inputs.ivLay SummaryTransient hardware faults are increasing due to growing system scales and pro-gressive technology scaling. Graphics Processing Units, which efficiently processa large amount of data using their parallel structure, are susceptible to these faults,which affect their reliability. In this thesis, we analyze GPU applications and iden-tify challenges in modeling error propagation in a scalable and accurate manner.These challenges relate to error propagation across threads and large amounts ofdata to be profiled due to the highly parallel nature of GPUs. We then proposeheuristics to address these challenges, based on data patterns and control flow sim-ilarity across threads, thread blocks, and loop iterations in GPU applications. Next,we use the proposed techniques to guide selective instruction duplication to in-crease the reliability of GPU applications with low overhead. Finally, we leveragethe low resource usage of our techniques to study the reliability characteristics ofGPU applications with multiple inputs.vPrefaceThis thesis is the result of work carried out by myself, in collaboration with myadvisor, Dr. Karthik Pattabiraman, Dr. Guanpeng Li (University of Illinois atUrbana-Champaign), Dr. Timothy Tsai, Dr. Siva Kumar Sastry Hari, and Dr.Michael Sullivan (Nvidia Research) All chapters, with the exception of Section 7.2,are based on work published in the 2019 IEEE Workshop on Silicon Errors in Logic- System Effects (SELSE) and another work in submission. I was responsible forconceiving the ideas, designing and conducting the experiments, compiling theresults, and writing the paper. Dr. Pattabiraman was responsible for overseeingthe project, providing guidance and feedback, and editing and writing parts of thepaper. Gaunpeng contributed with his knowledge and expertise due to his closelyrelated prior work. He provided insights into the problems, helped in designingexperiments, and assisted with the tools used for these experiments. Timothy, Siva,and Michael provided analysis and insights throughout the course of the project.Abdul Rehman Anwer, Guanpeng Li, Karthik Pattabiraman, Siva Kumar SastryHari, Michael Sullivan, and Timothy Tsai. ”Towards analytically evaluating theerror resilience of GPU Programs”. In IEEE Workshop on Silicon Errors in Logic- System Effects (SELSE), 2019.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Soft Errors in GPUs . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Approach and Contributions . . . . . . . . . . . . . . . . . . . . 32 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 GPU Architecture and Programming Model . . . . . . . . . . . . 72.2 Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Terms and Definitions . . . . . . . . . . . . . . . . . . . . . . . . 92.4 LLVM Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . 10vii2.5 TRIDENT Framework . . . . . . . . . . . . . . . . . . . . . . . . 102.5.1 Static-Instruction Sub-Model ( fs) . . . . . . . . . . . . . 112.5.2 Control-Flow Sub-Model ( fc) . . . . . . . . . . . . . . . 122.5.3 Memory Sub-Model ( fm) . . . . . . . . . . . . . . . . . . 123 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1 GPU Error Resilience . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Modeling Error Propagation . . . . . . . . . . . . . . . . . . . . 153.3 Pruning of FI Space . . . . . . . . . . . . . . . . . . . . . . . . . 163.4 Resilience Analysis Across Multiple Inputs . . . . . . . . . . . . 163.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Large Amount of Profiling Data . . . . . . . . . . . . . . . . . . 194.3 Large Number of Threads . . . . . . . . . . . . . . . . . . . . . . 204.4 Accumulated Inaccuracy From Individual Threads . . . . . . . . . 214.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.1 Profiling for Intra-Thread Memory Dependency (H-Intra) . . . . . 225.1.1 Selection Based on Thread Execution Path (H-Intra-Path) 235.1.2 Selection Based on Loop Patterns (H-Intra-Loop) . . . . . 245.2 Profiling for Inter-Thread Memory Dependencies (H-Inter) . . . . 255.3 Value-Based Masking (H-Value) . . . . . . . . . . . . . . . . . . 285.3.1 Multiplication by Zero . . . . . . . . . . . . . . . . . . . 295.3.2 Lucky Stores . . . . . . . . . . . . . . . . . . . . . . . . 295.4 Analysis Workflow . . . . . . . . . . . . . . . . . . . . . . . . . 305.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 326.1.1 Workflow and Implementation of GPU-TRIDENT . . . . . 326.1.2 FI Method . . . . . . . . . . . . . . . . . . . . . . . . . 33viii6.1.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 336.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.2.1 Overall SDC Probability . . . . . . . . . . . . . . . . . . 356.2.2 Instruction SDC Probability . . . . . . . . . . . . . . . . 366.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.3.1 Kernel SDC Probability . . . . . . . . . . . . . . . . . . 386.3.2 Instruction SDC Probability . . . . . . . . . . . . . . . . 396.4 Sources of Inaccuracies in GPU-TRIDENT . . . . . . . . . . . . . 406.4.1 Validity of Faulty Store Addresses . . . . . . . . . . . . . 406.4.2 Conservativeness in Determining Memory Corruption . . 416.4.3 Heuristics About Error Propagation . . . . . . . . . . . . 416.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.5.1 FI Methodology . . . . . . . . . . . . . . . . . . . . . . 426.5.2 Evaluation Kernels . . . . . . . . . . . . . . . . . . . . . 426.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.1 Selective Instruction Duplication . . . . . . . . . . . . . . . . . . 437.2 Characterizing SDCs Across Multiple Inputs . . . . . . . . . . . . 458 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 508.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518.2.1 Extending GPU-TRIDENT . . . . . . . . . . . . . . . . . 518.2.2 End to End Reliability Analysis of GPU Applications . . . 528.2.3 Error Propagation Modeling in Accelerator Platforms . . . 52Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53ixList of TablesTable 5.1 Thread group details for the studied kernels . . . . . . . . . . . 24Table 5.2 Average Control-flow(CF) similarity percentage of threads in athread block . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Table 6.1 Benchmarks used: 8 are from the Rodinia suite, and the other 4are open source HPC applications (OS HPC). . . . . . . . . . . 34Table 7.1 Number of threads invoked by different kernels, for multipleinputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47xList of FiguresFigure 2.1 Workflow of Trident [35] . . . . . . . . . . . . . . . . . . . . 11Figure 2.2 Working of TRIDENT [35]. . . . . . . . . . . . . . . . . . . 11Figure 4.1 A slightly modified code segment from the Circuit benchmark [1]. 19Figure 4.2 Prediction of SDC Probability by a modified GPU-TRIDENTthat does not model error propagation across threads, for bench-marks that use shared memory; NW K1 means the first kernelof NW benchmark. . . . . . . . . . . . . . . . . . . . . . . . 21Figure 5.1 Conditional branches inside a loop in the Pathfinder. . . . . . 25Figure 5.2 Profiling inter-thread memory dependencies. (Values in squaresrepresent memory addresses.) . . . . . . . . . . . . . . . . . 26Figure 5.3 Example of error masking due to multiply by 0. . . . . . . . . 29Figure 5.4 Example of Lucky stores . . . . . . . . . . . . . . . . . . . . 31Figure 6.1 The SDC probability of GPU kernels predicted by FI and GPU-TRIDENT. Error bars are shown for the FI estimates at the 99%confidence level - they range from ±0.53% to ±1.82%. . . . . 36Figure 6.2 Computation Spent to Predict SDC Probability . . . . . . . . 39Figure 6.3 Performance comparison of FI and GPU-TRIDENT . . . . . 40Figure 7.1 SDC Probabilities of selective instruction duplication. . . . . 45Figure 7.2 Average Protection curve obtained by FI and GPU-TRIDENT. 45xiFigure 7.3 Protection curves obtained by FI and GPU-TRIDENT for indi-vidual kernels, as protection overhead is varied from 0% and100%. Blue lines indicate FI, and orange lines represent GPU-TRIDENT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Figure 7.4 SDC Probabilities predicted by GPU-TRIDENT with multipleinputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48xiiList of AbbreviationsFI Fault InjectionGPU Graphics Processing UnitHPC High Performance ComputingSDC Silent Data CorruptionxiiiAcknowledgmentsI would like to express my sincere gratitude to my advisor Dr. Karthik Pattabira-man for his consistent support during my Masters. His patience, invaluable guid-ance, and in-depth knowledge made my stay UBC extremely rewarding and helpedme grow personally as well as professionally. His enthusiasm and an optimisticoutlook are quite inspiring for young researchers like me.In addition to my advisor, I would like to thank my thesis examining commit-tee, Dr. Tor Aamodt, and Dr. Matei Ripeanu, for their insightful feedback.I would like to thank Dr. Gaunpeng Li and my other colleagues in the Depend-able Systems Lab at UBC for their support and stimulating discussions throughoutmy research endeavors.Finally, I would like to thank my family, especially my sister, for providingme with unfailing encouragement and backing throughout my years of study. Myacademic achievements would not have been accomplished without this support.xivChapter 1Introduction1.1 Soft Errors in GPUsGPUs have been widely adopted as accelerators for scientific applications andmachine learning algorithms due to their high performance and power efficiency.Their highly parallel architecture makes them ideal for efficiently parallelizing theprocessing of large chunks of data [2]. Recently, they are being used in safety-critical systems (e.g., self-driving cars [4]), which makes their reliability a majorconcern.The performance of processors has seen immense improvements due to fastertransistors and increasing circuit density. This is an ongoing trend that will resultin smaller and denser components in processors of the future. Transient hardwarefaults (i.e., soft errors) are thus predicted to increase in future processors due tothese growing system scales, progressive technology scaling, and lowering of op-erating voltages [50]. Transient errors manifest in the form of bit flips in the system,and can be caused by high energy particle strikes (i.e., alpha particles), transistorvariability, and thermal cycling. As indicated by the name, transient errors, in con-trast to manufacturing faults, do not result in permanent defects in the chips; thus,they are not reproducible. In the past, such faults were handled in high-reliabilitysystems through hardware-only solutions such as dual modular redundancy (DMR)and circuit hardening [40]. However, these techniques are challenging to deploy incommodity systems as they incur significant performance and energy overheads.1Protecting the system, against hardware faults, at the application-level, is a low-overhead alternative to hardware-based approaches.One consequence of transient errors is incorrect program output, without anyindication to the programmer or the rest of the system, which is known as silentdata corruption (SDC). SDCs are challenging to detect and can have severe con-sequences [50] on the reliability and safety of Graphics Processing Unit (GPU)applications. Studies have shown that only a small fraction of the program stateis responsible for almost all of the error propagation resulting in SDCs. So onecan selectively protect the vulnerable program state from meeting a target SDCprobability while incurring lower energy and performance overheads than full du-plication techniques [18, 35]. Therefore, it is vital to estimate the SDC probabilityof a program to decide whether protection is needed and, if so, to reduce the costof protection by selectively protecting the most SDC-prone program state (e.g.,selective instruction duplication).1.2 MotivationFor deploying any technique to protect the system from SDCs arising from softerror at the application level, it is necessary to have a comprehensive reliabilityprofile of the application. FI is commonly employed for this purpose. FI systemat-ically perturbs the program state during execution and checks the program outputto detect failures (if any). As the injection space for FI is too large, thousandsof random FI trials are required to obtain statistically significant results. Hence,FI is often very slow and challenging to deploy in practice [22, 23, 35]. Further,selective protection techniques need the SDC probability on a per-instruction ba-sis; this requires hundreds of FIs per instruction, which is very resource-intensive.Researchers have attempted to model error propagation to evaluate SDC probabil-ities quickly without any FIs [18, 27, 51]. Unfortunately, these techniques (1) havesignificant inaccuracies [18, 51], (2) require large and difficult-to-obtain trainingcorpuses of FI data that are representative of typical faults [27], (3) do not providereliability profiles for individual instructions which is critical for the selective pro-tection of programs [27, 36], or (4) the techniques proposed are only applicable toCPU applications [32, 35, 51]. Therefore, there is a need for developing a scalable2and accurate, error resilience analysis framework for GPU applications.1.3 Approach and ContributionsThis thesis focuses on developing an accurate and scalable model for error propa-gation in GPU applications, which does not require resource-intensive FI.In recent work, Li et al. [35] proposed an automated technique called TRI-DENT that analytically models error propagation in programs to predict their SDCprobabilities. While TRIDENT is accurate and fast in evaluating the vulnerabilityof single-threaded CPU programs to transient errors, we find it is neither accuratenor scalable for GPU programs, which are highly parallel and have a very differentprogramming model.In particular, inter-thread data dependencies among the threads in a typicalGPU program complicate error propagation [33], and not modeling them leads tolarge inaccuracies in predicting SDC probabilities, e.g., the mean absolute error inSDC prediction without modeling inter-thread dependencies is 17.2% with respectto FI (Section 4.3). Furthermore, when modeling GPU programs, TRIDENT takes asignificant amount of time, as a lot of run-time information has to be profiled, due toa large number of threads in GPU programs, to bootstrap the model. This profilingprocess is extremely time-consuming and resource-intensive in real-world GPUprograms. Finally, to estimate per-instruction reliability accurately, we need toconsider data characteristics specific to GPU applications, which is unfortunatelynot supported in TRIDENT. This introduces inherent inaccuracies in the model,which, when aggregated, can result in a model that significantly deviates from theactual behavior of the program.In this work, we propose an accurate and scalable technique to analyticallymodel error propagation in GPU programs. Our key insight is that error propaga-tion in GPU programs can be abstracted using program execution patterns basedon memory accesses, loop iterations, and control-flow. We find that analyzing thedifferent levels of memory dependencies in a GPU kernel is often the biggest scala-bility bottleneck in the model. So carefully selecting only a small subset of threadsfor analysis of memory dependencies in a GPU kernel, based on the characteris-tics mentioned earlier, can significantly reduce the time required for creating the3model.Our work builds on top of the TRIDENT framework [35], and significantlyenhances it to solve the above challenges of GPU programs - therefore, we call ourtechnique GPU-TRIDENT.There are 3 factors that make GPU-TRIDENT practical for assessing the relia-bility of real-world GPU programs and protecting them. First, GPU-TRIDENT isnearly as accurate as FI but works without requiring any FIs, and is hence signif-icantly faster than FI. Second, it does not require any training corpuses of FI datafor creating the model. Finally, GPU-TRIDENT can efficiently guide the selectiveprotection for a GPU program given a reliability target and a performance overheadbudget.To the best of our knowledge, we are the first to efficiently model error prop-agation for GPU programs, using neither FI nor training data based on FI, toestimate its overall SDC probability, as well at the SDC probability of individualinstructions.Contributions: The main contributions of this thesis are:• Identify challenges in efficiently and accurately modeling error propagationfor GPU applications (Chapter 4).• Propose heuristics for pruning the state space for modeling error propagationin GPU programs. The heuristics are based on similarity among the controlflow and memory access patterns of the program’s threads (Chapter 5).• Build GPU-TRIDENT, an efficient and accurate model of error propagationfor GPU programs that implements the above heuristics using the LLVMcompiler [30] and is completely automated.• Compare the accuracy and scalability of GPU-TRIDENT to FI when pre-dicting the SDC probabilities of individual instructions and that of the entireGPU kernels (Chapter 6) for 17 GPU kernels belonging to 12 applications.• Demonstrate the use of GPU-TRIDENT to guide selective instruction dupli-cation, under a given overhead budget (Chapter 7).4• Study the input-dependence of resilience characteristics of GPU kernels lever-aging the low overhead methodology of GPU-TRIDENT (Chapter 7).Our main results are as follows:• SDC probability predictions from GPU-TRIDENT have a high agreement(Pearson correlation coefficient of 0.88) with the FI results. Individual in-structions of most kernels also have a high agreement with the FI results(average Pearson correlation coefficient of 0.83).• GPU-TRIDENT incurs a fixed initial overhead and a small incremental over-head for each sampled instruction, while FI incurs an overhead proportionalto the number of sampled instructions (i.e., injected faults). This makesGPU-TRIDENT much more efficient for large programs than FI. For exam-ple, for 5000 faults, GPU-TRIDENT is 55.6 times faster than FI. FI takeson average around 4 CPU hours for 5000 sampled instructions, while GPU-TRIDENT takes approximately 6 minutes across benchmarks. This differ-ence is even higher for more complex applications. For example, FI takes22.7 hrs for the Circuit benchmark, while GPU-TRIDENT takes just 8 min-utes. On average, GPU-TRIDENT is about two orders of magnitude fasterthan FI across the benchmarks.• Using GPU-TRIDENT to guide selective instruction duplication reduces theSDC probability of kernels by approximately 58% and 85% (at 1/3rd and2/3rd the performance overhead of full duplication respectively) which iscomparable to the results obtained using FI.• Using GPU-TRIDENT to study the input-dependence of resilience character-istics of a GPU kernel reveals that the SDC probability of kernels is mostlyinsensitive to changes in the input provided to the applications.Our work thus provides an accurate, efficient, and scalable tool for character-izing resilience profiles of GPU kernels. In addition, we demonstrate the practicaluse of the developed tool to guide selective instruction duplication with a verysmall performance overhead. Finally, we exploit the low resource usage of ourproposed tool to study the variation in resilience properties of GPU applications,5which would have been highly resource-intense if traditional FI methodologieswere used.The rest of the thesis is organized as follows. We first give a brief backgroundabout the problem and tools used in Chapter 2, then we provide a brief overviewof related work in Chapter 3. We identify the main challenges in Chapter 4 andpresent our approach for handling them in Chapter 5. We evaluate our proposedtechniques in Chapter 6 and perform two case studies on the resilience of GPUkernels using GPU-TRIDENT in Chapter 7. Finally, we conclude the thesis anddiscuss avenues of future research in Chapter 8.6Chapter 2BackgroundIn this section, we first present a brief primer on GPU architecture, then definethe terms we use, followed by our fault model and the infrastructure we use for FIexperiments and analysis. Finally, we provide a brief overview of the TRIDENTtechnique [35], as it forms the basis of GPU-TRIDENT.2.1 GPU Architecture and Programming ModelWe focus on GPU applications that are implemented on the NVIDIA ComputeUnified Device Architecture (CUDA), a widely adopted programming model andtoolset for GPUs. In the CUDA programming model, a GPU application consistsof a control program running on the CPU and a computation program called thekernel that runs in parallel on the GPU(s), in the form of multiple threads. CUDAkernels adopt the single instruction multiple thread (SIMT) execution model to ex-ploit the massive parallelism of GPUs. From a software perspective, the CUDAprogramming model abstracts the SIMT model into a hierarchy of kernels, blocks,and threads. CUDA kernels consist of blocks, which consist of threads. This hi-erarchy allows various levels of parallelism, such as fine-grained data parallelism,coarse-grained data parallelism, and task parallelism. From a hardware perspec-tive, blocks of threads run on streaming multiprocessors (SMs) with on-chip sharedmemory for threads in the same block. Within a block, threads are launched in fixedgroups of 32 threads called warps. Threads in a warp execute the same sequence7of instructions but with different data values.Each GPU has its own memory space that is distinct from the host CPU’s mem-ory. In the CUDA programming model, there are various kinds of memory: (1)global, (2) constant, (3) texture, (4) shared, and (5) thread-local memory alloca-tions and accesses. Global, constant, and texture memory accesses that miss in theon-chip caches are loaded from the large and comparatively-slow device memory.The shared memory space is software managed. It is much smaller and built onchip, and is hence much faster to access. Thread-local memory is typically storedin the fast register file, though compiler-inserted spill and fill operations occasion-ally place the thread-local state in slower areas of the storage hierarchy.The CUDA programming model allows (1) sharing data among a subset ofthreads in a thread block (e.g., warp-shuffle), (2) sharing data across all the threadsin a thread block using the shared-memory scratchpad, (3) sharing data across thethreads in a compute kernel using device global memory, and (4) sharing acrossdevices using unified virtual memory (UVM) [26]. This gives rise to two types ofmemory dependencies between instructions.1. Intra-thread memory dependency: Static loads and stores of same thread aredependent on each other due to the same memory being accessed by them.2. Inter-thread memory dependency: Static loads and stores in different threadsare dependent on each other due to the same memory being accessed in dif-ferent threads.2.2 Fault ModelIn this thesis, we consider transient hardware faults that occur in the computationalelements of the GPU, including architectural registers and functional units, andaffect the program’s execution. We assume these faults manifest as a single bit flip.Many studies [9, 10, 25, 49] have shown that there is little difference between theSDC probability of single and multiple bit flips. Moreover, previous work in thisarea [27, 33, 41, 55] also uses the single-bit flip model. We do not consider faultsin the GPU’s control logic, nor do we consider faults in the instructions’ encoding.We also do not consider faults in the memory or caches, as we assume that these8are protected with error correction codes (ECC) - this is the case for most modernGPUs used in HPC applications [6]. However, an error can propagate to memory,if an erroneous value is stored by a store instruction into memory, resulting insubsequent loads being faulty (these faults are considered).Finally, we assume that the program does not jump to arbitrary, illegal ad-dresses due to faults during the execution, as this can be detected by control-flowchecking techniques [43]. However, the program may take a faulty legal branch(the execution path is legal, but the branch direction is wrong due to faults propa-gating to the branch condition). Our fault model is in line with other work in thearea [13, 18, 20–22, 29].2.3 Terms and DefinitionsFault Occurrence: The event corresponding to the occurrence of a hardware faultin the processor. The fault may or may not result in an error.Fault Activation: The event corresponding to the software manifestation of thefault, i.e., the fault becomes an error and corrupts some software state (e.g., a reg-ister or memory location). The error may or may not result in a failure.Crash: The raising of a hardware trap or exception due to an error (e.g., read out-side its memory segments). The program is terminated as a result by the operatingsystem.Silent Data Corruption (SDC): A mismatch between the output of a faulty pro-gram run and that of an error-free execution of the program, without any exceptionbeing thrown.Benign Faults: When the program output matches that of the error-free executioneven though a fault occurred, because the fault was masked or was overwritten bythe program.Error propagation: Error propagation means that the fault was activated and hasaffected some other portion of the program state, say ’X’ - we say the fault haspropagated to state X. We focus on faults that affect the program state, and thereforeconsider error propagation at the application level.SDC Probability: The probability that the program had an SDC given that thefault was activated—other work uses a similar definition [18, 23, 45].92.4 LLVM CompilerWe use the LLVM compiler [30] to perform FI experiments and do program analy-sis for the implementation of our model. LLVM uses a typed intermediate represen-tation (IR) that represents source-level constructs. Moreover, it maintains variableand function names in the IR, which makes source mapping feasible, enabling usto map IR to source code and perform a fine-grained analysis of error propagationin the program. However, our methodology is not tied to LLVM. Any platformthat allows us to statically analyze programs and correlate it with the FI results willsuffice.2.5 TRIDENT FrameworkTRIDENT is an open-source framework that analytically models error propagationat the instruction-, control-flow-, and memory-levels to derive an estimate of theoverall SDC probability and SDC probability per instruction for a given CPU pro-gram [35]. However, TRIDENT does not handle the modeling of error propagationin multi-threaded programs, as well as other GPU-specific structures. As we showin Section 4, modeling these is critical for GPU programs. Our technique is builton top of the TRIDENT framework but extends it significantly.Workflow: TRIDENT takes three inputs (1) LLVM Intermediate Representa-tion (IR) of the program, (2) program output instructions (i.e., instructions whoseresults determine whether an SDC occurs), and (3) input for the program [35].It decomposes the error propagation in the dynamic execution of an applicationinto a combination of probabilistic events. These probabilities can be calculated byplugging data profiled from a dynamic execution (i.e., instruction operands, branchdirection, etc.) into certain mathematical equations. Using this information TRI-DENT calculates the SDC probabilities of static instructions and aggregates themto derive the overall SDC probability of the program.TRIDENT consists of two phases: (1)Profiling: In the profiling phase, TRI-DENT performs static and dynamic analysis of the program such as instructioncounts, data dependencies, etc., and (2) Inferencing: TRIDENT uses the profiledinformation to track error propagation, which it employs to calculate SDC proba-bilities.10● Program Source Code (LLVM IR)● Program Input● Instructions Considered as Program Output● Overall SDC Probability of the Program ● SDC Probabilities of Individual InstructionsTridentFigure 2.1: Workflow of Trident [35]bb4$2 = load 0x04$3 = add $2, 4$4 = cmp $3, 4br $4, bb5, bb10bb5$5 = mul $6, 16...bb10...bb11store ..., 0x08bb12...bb18... = load 0x08... ...80% 20%30% 70%INDEX 1INDEX 2INDEX 3INDEX 4INDEX 6INDEX 5T FT Fstore ... OUTPUT(a) Code example for TRIDENT,with propagation probabilities.......S 5 0x4S 5 0x8S 5 0xCS 5 0x10L 6 0x4L 6 0x8L 6 0xCL 6 0x10......TimeINDEX 1INDEX 6 NULLOUTPUT0.4 0.61.0(b) A memory access trace and corre-sponding memory dependency graph.Figure 2.2: Working of TRIDENT [35].Example: Figure 2.2a shows how TRIDENT works using an example from theCPU version of the pathfinder benchmark. The instructions in the example areindexed in each basic block. As TRIDENT works at the LLVM Intermediate Rep-resentation (IR) level, these instructions correspond to LLVM instructions [30].TRIDENT consists of three sub-models that work synergistically together to calcu-late error propagation probability from a given location of error occurrence to theprogram output instruction.2.5.1 Static-Instruction Sub-Model ( fs)This sub-model computes the propagation probability of an error in a static data-dependent instruction sequence (which often appears in the same basic block). Forexample, If an error occurs at INDEX 1 in Figure 2.2a, fs calculates the overallpropagation probability for the error from Index 1 to Index 4. Each instructionis assigned an individual error propagation probability from that instruction to the11next data-dependent instruction in the same basic block. This probability is derivedby analyzing the instruction and profiling the values of its operands. fs computesthe overall propagation probability by aggregating the individual probabilities ofthe data-dependent instructions in the basic block.2.5.2 Control-Flow Sub-Model ( fc)This sub-model computes the probabilities of store instructions getting corrupteddue to the corruption of branch instructions, which results in control-flow diver-gence from a fault-free run. For example, if the error propagates to Index 4, whichis a branch instruction, it may modify the direction of the branch. Say that thebranch F of Index 4 should be taken in a fault-free execution, and the store in-struction (Index 5) is supposed to be executed. But due to the error in Index 4, thebranch direction can be modified to T. Consequently, the store instruction (Index 5)is not executed correctly, and hence we say that Index 5 is corrupted. In this case,the error propagates into memory via the store instruction. TRIDENT identifies allstores that are dominated by the corrupted branch (Index 4 in the running example)and computes the probabilities of them getting corrupted.2.5.3 Memory Sub-Model ( fm)The memory sub-model analyzes error propagation in the kernel via memory de-pendencies. Continuing with our running example, if Index 5 is corrupted, the er-roneous value stored by the instruction may be loaded later by the load instruction(Index 6) in bb18. Hence, the error propagates via a memory dependency. To fig-ure out the memory dependencies between all loads and stores, TRIDENT profiles amemory execution trace from the program. The trace (Figure 2.2b) contains all theexecuted store and load instructions at runtime (ordered by execution time). Eachrecord in the trace contains the type of operation (load/store), index of the staticinstruction, and the address accessed by the instruction. TRIDENT leverages thememory execution trace to track error propagation in fm by constructing a memorydependency graph for fm. The graph contains memory dependencies between thestatic load and store instructions, and it is weighted based on the dynamic instruc-tion counts of these instructions. An example of the graph is shown in Figure 2.2b.12A node means an instruction (load or store), and the edges indicate possible depen-dencies. The weights (dependent on dynamic instruction count) are marked besideeach edge. This graph is used during execution to track error propagation due tomemory dependencies.13Chapter 3Related WorkWe classify related work into four broad categories. The first category includeswork related to characterizing the resilience of GPU applications. It presents anoverview of FI tools as well as analytical frameworks developed for GPU appli-cations. The second category summarizes the body of work related to modelingerror propagation at the application level. Next, we review some techniques sug-gested to overcome bottlenecks in studying the resilience characteristics of GPUapplications, as this is one of the core contributions of this thesis. Finally, we givean overview of work related to the input-dependence of error resilience.3.1 GPU Error ResilienceA significant amount of research has gone towards finding the resilience of GPUprograms through FI. Yim et al. [56] developed one of the first FI tools for GPUapplications to explore efficient error detectors in GPU programs. Fang et al. [16]developed GPU-Qin, which is a FI tool that operates on CUDA assembly code levelusing CUDA-gdb. Subsequently, Hari et al. [24] developed SASSIFI, which injectsfaults at the SASS assembly code level using the NVIDIA compiler (SASSI). Liet al. [33] design LLFI-GPU, which operates at the LLVM IR level, and use it toinvestigate error propagation in GPU kernels. The main advantage of LLFI-GPUis that it is independent of the specific GPU platform or architecture, as long as thecompiler provides support for the platform. LLFI-GPU is used as the FI baseline in14this thesis. Unfortunately, FI requires significant time and resources in evaluatingGPU applications. Hence, they are difficult to be integrated into the developmentcycles of GPU programs.Tan et al. [53] propose an analytical framework to estimate the vulnerability ofGPU micro-architectures. However, their technique does not consider the charac-teristics of the GPU program as it is micro-architecture dependent. Further, theydo not distinguish between crashes and SDCs in their analysis; hence they cannotbe used to mitigate SDCs. Wei et al. [55] study the approximation properties ofsoft errors and use them to guide approximate instruction duplication. However,this requires thousands of FIs and intimate knowledge of the application’s domainand functionality.3.2 Modeling Error PropagationError propagation models are widely employed to estimate the resilience of CPUprograms. Compared to FI, the faster execution of modeling techniques makesthem ideal for incorporating in the software development cycle. Symplfied [46]uses model-checking to systematically enumerate all error propagation paths inprograms. Unfortunately, this technique does not scale due to state-space explo-sion. Shoestring [18] uses compile-time analysis and symptom-based error detec-tion techniques to model error propagation and identify vulnerable instructions.Sridharan et al. [51] introduced an analytical model, program vulnerability factor(PVF), to capture the masking properties of the program, related to architecture-level faults. However, they do not distinguish between SDCs and crashes, leadingto a loss in accuracy and limited visibility into program resilience. Fang et al. [17]introduce ePVF, which adds the ability to differentiate between crashes and SDCsin PVF analysis. However, none of these techniques model control-flow divergenceor memory dependencies, and their accuracy suffers as a result. Further, none ofthem are targeted towards GPUs. Trident [32, 35] analytically models error prop-agation using static and dynamic analysis. While their technique is accurate andrelatively fast, the methodology only applies to single-threaded CPU programs,and it does not apply to GPU programs. Guo et al. [21] identify naturally resilientcode patterns in HPC applications but do not quantify the program’s resilience.153.3 Pruning of FI SpacePruning FI space based on similar execution patterns has been proposed to speed upFI. Hari et al. [22, 23] propose techniques to prune the FI space in CPU applicationsbased on similar error propagation patterns in programs, reducing FI time.There are two techniques for pruning the FI space of GPU programs that sharethe same high-level goal this thesis, namely to estimate the resilience of GPU pro-grams with either limited or no FI. Nie et al. [41] propose heuristics to prune FIspace by identifying representative states in GPU programs. While useful, theirmethod still requires thousands of FI trials to be performed to obtain accurate SDCestimates. Running FI experiments on GPUs is very resource intensive as a GPUprogram has a huge FI space to explore among millions of threads. Further, un-like in CPUs, the process of FI itself in GPU is challenging to parallelize as onlyone GPU process is allowed per GPU card. Finally, the heuristics proposed bythem are specific for the pruning of FI sites, and so they cannot be used in mod-eling error propagation, which is our goal. Kalra et al. [27] use statistical modelsto characterize program resilience using micro-architecture agnostic features ofGPU programs. The main advantage is that the features are independent of theGPU micro-architecture, and hence can be used on different architectures, withoutrequiring any fault injections in the prediction phase. However, their technique re-lies on a large amount of representative FI corpus in the training phase, which isused to learn the characteristics of SDCs and it is challenging to obtain sufficient,representative training data, which is critical for achieving high accuracy. More-over, both techniques only predict the vulnerability of the overall GPU kernel, andnot that of individual instructions, which is required for selective protection ap-proaches [18, 29, 48].3.4 Resilience Analysis Across Multiple InputsCzek et al. [14] performed one of the first studies to model the variance in failurerates with multiple inputs for CPU applications. Leo et al. [15] study the varianceof failure modes for different inputs. Mahmoud et al. [37] deploy software testingtechniques like test case minimization to reduce the inputs required to get a repre-sentative reliability profile of CPU applications. Li et al. [32] study the variation16of SDCs across multiple inputs and present vTrident, which bounds the SDC prob-ability of applications across multiple inputs with FI required for only one input.However, all of these techniques are designed for CPU applications.There is limited work in studying the reliability of GPU applications acrossmultiple inputs. Previlon et al. [47] study the impact of execution parameters(specifically input data and thread-block size) on the reliability of GPU applica-tions. They perform extensive FIs for this analysis and find out that the vulnera-bility of studied applications remains unchanged across standard inputs. However,this study only looks at 6 Benchmarks, and the number of threads launched is min-imal (maximum of 1024 treads are launched), which is not practical as real-worldapplications can launch millions of threads.3.5 SummaryThe increasing prevalence of transient hardware faults in GPUs has resulted ina glut of research in understanding the error resilience of GPUs. Random Faultinjection [16, 24, 33, 56] has been traditionally used for this purpose, but they arequite resource- and time-intensive, so it is not suitable for exhaustive studies. Priorefforts have proposed reducing the overhead of FI techniques by either pruning theFI space [27, 41] or modeling error propagation [17, 18, 51], but these techniquesare either limited in accuracy or only give a coarse-grained view of the reliabilityof GPU application. This thesis analyzes the behavior of GPU applications andpresents an efficient and scalable solution for analyzing instruction-level reliabilityof GPU applications. Work analyzing the reliability of GPU applications acrossmultiple inputs is quite sparse. The proposed techniques deploy FI [47], which hasa high-performance overhead. Techniques proposed in this thesis can be used tocharacterize the reliability of GPU applications across multiple inputs efficiently.17Chapter 4ChallengesThe goal of this thesis is to create a fast and accurate model for tracking error prop-agation in GPU applications and hence predict the SDC probabilities of the overallkernel as well as individual instructions. In this chapter, we discuss the challengesin modeling error propagation in GPU programs. These challenges broadly relateto the large amount of data to be profiled due to massive parallelism in GPUs appli-cations, error propagation across threads, and inaccuracies introduced due to datapatterns specific to GPUs.4.1 OverviewAs described in section 2.1, GPU kernels are massively parallel, and many kernelsshare data across threads. Therefore, errors may propagate not only within eachthread but also between the threads. Because GPU programs have a large num-ber of threads (e.g., the Circuit application [1] has 4.25 million threads), trackingerror propagation via memory dependencies incurs high overheads due to a largeamount of memory access information required to be profiled. Moreover, evensmall inaccuracies in modeling error propagation of individual threads are exac-erbated when aggregating the kernel’s overall SDC probability due to the largenumber of threads.Figure 4.1 shows a code example of the cudaSolve kernel from the Circuitbenchmark. This program solves a 2D circuit grid in parallel using the Jacobian18Figure 4.1: A slightly modified code segment from the Circuit bench-mark [1].method. For ease of explanation, we have slightly modified the code and removedthe unnecessary parts. In the example, a subset of shared memory (line 4) is initial-ized in each thread (line 6). After all the threads have completed the initialization(line 8), the data in the shared memory is read by adjacent threads (line 10) forcomputation. This way, data can be efficiently transferred between threads via faston-chip shared memory. However, this can also result in error propagation fromone thread to another, and eventually to the entire thread block and the output ofthe kernel. Li et al. [33] have previously shown that a single fault can lead to highcontamination (up to ~60%) of memory states in GPU applications, due to errorpropagation across threads and data flow between global and shared memory.We identify the following three challenges in modeling error propagation forGPU programs.4.2 Large Amount of Profiling DataTo track error propagation via memory dependencies, one needs to profile all mem-ory accesses in the kernel and identify the data dependencies between each loadand store. This can be done using dynamic analysis, matching the runtime mem-ory addresses of each load and store. However, a typical GPU kernel consists ofhundreds or thousands of threads, each of which may interleave dependencies be-tween threads in their respective thread blocks. Consequently, a massive amountof memory access information needs to be collected during the profiling phase.For example, Lulesh, which is a typical HPC application [28], generates a memory19execution trace larger than 1 TB even for medium sized inputs. Processing thetrace i.e., loading it into memory and traversing it (Section 2.5.3), incurs signifi-cant overheads, which makes it impractical to use TRIDENT for modeling of GPUHPC benchmarks. It is not possible to bypass this memory bottleneck by applyingTRIDENT to individual threads, as the time for modeling thousands of threads runsinto days or even months - see Table 6.114.3 Large Number of ThreadsTracing error propagation between threads in a GPU kernel requires a fine-grainederror propagation model for every thread in the GPU kernel, which potentiallyincurs large modeling overhead. For example, if an error occurs in a thread, it firstpropagates within the thread. Later, the error may propagate to another thread inits thread block via a shared memory dependency. Hence, the error continues topropagate in both threads. Finally, the error may propagate to some other threadsor back to the original thread, depending on the program’s execution. To track theerror, one should model all the threads in the program, and trace the propagation ina lock-step fashion.A natural question that arises is can we ignore error propagation betweenthreads in GPU programs? To answer this question, we modify GPU-TRIDENT tostop tracing the error propagation when any shared memory access is encountered.We show the results in Figure 4.2 for the benchmarks that use shared memoryin our evaluation (the experimental setup and benchmarks are described in Sec-tion 6.1). As can be seen, the SDC probabilities predicted by the modified GPU-TRIDENT are significantly lower than the FI ground truth (mean absolute differenceof 17.2%). The only exception is LUD K3, which sparingly uses shared memory(only two stores are to shared memory by each thread). This result indicates thatwe cannot ignore inter-thread error propagation in GPU programs.1We obtained these numbers by multiplying the time taken by TRIDENT for each thread of theprogram with the number of threads in the kernel.20Figure 4.2: Prediction of SDC Probability by a modified GPU-TRIDENT thatdoes not model error propagation across threads, for benchmarks thatuse shared memory; NW K1 means the first kernel of NW benchmark.4.4 Accumulated Inaccuracy From Individual ThreadsThe overall SDC probability of a GPU kernel is the aggregation of the SDC prob-abilities of the individual threads. Because GPU programs consist of hundreds orthousands of threads, if the model is inaccurate in each thread (even slightly), theoverall SDC probability of the kernel will be inaccurate.4.5 SummaryIn this chapter, we analyze GPU application in the context of applying TRIDENTmethodology on them. We find that TRIDENT can not be applied as it is on GPUkernels as the large amount of data associated with multi-threaded GPU appli-cations posses a huge bottleneck. Moreover, inter-thread error propagation dueto interleaving memory dependencies among threads renders the fm of TRIDENTinapplicable to GPUs, as it was designed for single-threaded CPU applications.Finally, we observe that minor inaccuracies introduced while modeling error prop-agation in GPU kernels can result in a highly inaccurate final model.21Chapter 5ApproachIn Chapter 4, we identified challenges that complicate the efficient modeling oferror propagation in GPU applications. We observed that the challenges were re-lated to a large amount of data to be profiled, inter-thread error propagation, andinaccuracies due to biased data patterns.In this chapter, we propose heuristics to overcome these challenges and the ra-tionale behind them. We first propose two heuristics to select a subset of threads ina GPU kernel to find the intra-thread memory dependencies within it (Section 5.1).We then propose a third heuristic to select a new subset of threads to constructinter-thread memory dependencies (Section 5.2). Finally, we identify two types ofvalue-based masking in modeling error propagation as major sources of inaccura-cies in applying TRIDENT to GPU applications, and propose heuristics to mitigatethem (Section 5.3).5.1 Profiling for Intra-Thread Memory Dependency(H-Intra)The goal of this step is to construct the memory dependency graph for a given GPUkernel without profiling all the memory accesses in all the threads. Recall that thememory dependency graph in TRIDENT is required to establish data dependen-cies between static store and load instructions (Section 2.5.3). This problem hasbeen studied in the context of performance improvements in CPUs at architecture22level [12], but our goal is to extract the memory dependency between instructionsat the application level as GPU-TRIDENT models error propagation at the appli-cation level. It is possible to profile all memory addresses used by load and storeinstructions, and then match the addresses to build the graph. However, this canbe extremely time-consuming in massively parallel GPU programs. Instead, weprofile only a small subset of threads based on control-flow similarity and matchloads and stores in those carefully-selected threads to build the memory dependencygraph. We call this intra-dep threads, as they contain all the possible intra-threaddependencies between static load and store instructions (for that execution). Themain algorithm for choosing intra-dep threads is in Algorithm 1. Note that in thissection, we use the entire control-flow profile of a thread to perform the groupingof threads, rather than the number of instructions executed per thread used in priorwork [41] as we found it to be more accurate.Algorithm 1: Selecting intra-dep threads.Input : CFT : Control flow of all threadsOutput: TP: Set of threads to profile1 TP = {}2 for all threads T do3 if conditional branches in loop then4 // Remove iterations with nonunique memory accesses5 CFsim = Remove redundant(CFT )6 else7 CFsim =CFT8 // Profile thread if simplified control-flow is unique9 if CFsim not in TP then10 TP.insert(T )11 end5.1.1 Selection Based on Thread Execution Path (H-Intra-Path)We first profile the execution path of each thread and group the threads that haveidentical execution paths. We then choose a single thread from each group to bepart of intra-dep threads. The intuition is that threads with identical executionpaths have the same static dependencies between loads and stores. This is because23only control-flow divergence can cause divergence in the memory dependencies.Table 5.1 shows the number of threads invoked in each kernel of the benchmarkapplications (Section 6.1 has more details), and the number of intra-dep threads asthe result of the grouping. As can be seen, most of the kernels have only a smallnumber of intra-dep threads. The two exceptions are pathfinder and NW, The twoexceptions are pathfinder and NW, as they have conditional branches inside loops,and hence loops execute different number of times for different threads, whichresults in different control-flow paths.Kernel TotalthreadsAfterH-Intra-PathAfterH-Intra-LoopAfterH-InterNW K1 48 46 25 48NW K2 16 16 13 16Pathfinder 592,640 563,606 15 3840BFS K1 32,768 781 8 -BFS K2 32,768 2 2 -Gaussian 3,840 4 4 -HotSpot 473,344 250 234 35,328Particlefilter 9,216 38 7 -LUD K1 64 16 16 64LUD K2 192 2 2 64LUD K3 3,584 1 1 256CudaBenchMarking 8,192,000 1 1 -SRAD K1 32,768 144 144 16,384SRAD K2 32,768 16 16 4,096Circuit 4,156,416 18 18 2,592Lulesh 66,816 2 2 -Blur 236,544 7 7 -Table 5.1: Thread group details for the studied kernels5.1.2 Selection Based on Loop Patterns (H-Intra-Loop)To further reduce the number of intra-dep threads, we pick threads that have uniqueloop patterns. We use a code example from the Pathfinder benchmark to explainthis heuristic. Figure 5.1a shows a kernel that has three conditional branches insidea loop. Pathfinder has a high number of intra-dep threads, as different threads takedifferent branches in each iteration, and hence have a different number of iterationsresulting in diverging execution paths. We profile the execution paths inside each24(a) Code snippet of pathfinder (b) Control flow traces of threads.Figure 5.1: Conditional branches inside a loop in the Pathfinder.loop iteration and list a subset of them in Figure 5.1b. Numbers in the figure repre-sent the comparison instruction, while T and F represent the branch direction taken.We find that many iterations contain repeated branch patterns, which results in thesame memory dependency from the profiling. As seen from the figure, the firstthree iterations in Thread 1 and Thread 3 are identical. Therefore, we only needto profile one of them to obtain the inter-thread dependencies. Similarly, the orderof the repeated patterns within a thread does not matter, as it would yield similardependencies. For example, Thread 3 and Thread 4 follow the same branch pat-terns but in different orders of iterations, yet yield the same memory dependency.Therefore, we only need to profile one of the two threads.Table 5.1 also shows the number of intra-dep threads after applying the aboveloop selection heuristic. As seen, Pathfinder and BFS K1 see a significant reduc-tion in the number of threads, but not the other kernels as the threads do not haveconditional branches with divergent control-flow within loops.5.2 Profiling for Inter-Thread Memory Dependencies(H-Inter)Intra-thread memory dependencies can be established based on the thread selectionmethod in the previous section. However, inter-thread dependencies may still existbetween threads with differing control-flows and we need to model these.25(a) (b)Figure 5.2: Profiling inter-thread memory dependencies. (Values in squaresrepresent memory addresses.)As before, we propose a heuristic to reduce the number of loads and stores thatmust be profiled to capture the inter-thread dependencies. Our heuristic is to firstselect the thread blocks that each of intra-dep threads belongs to and then profilethe shared memory accesses in each thread block. This is based on the observationthere is often a high ratio of threads to thread blocks, as well-designed GPU kernelsattempt to avoid control-flow divergence within each thread block to avoid stalls.Hence, we find that selecting thread blocks based on unique control-flow patterns(i.e., thread blocks containing intra-dep threads) will capture most of the uniqueshared memory access patterns (see Table 5.2).Figure 5.2 shows the executions of two kernels as examples. Each row repre-sents a possible execution of an instruction, while each column indicates a thread.For simplicity, we only show instructions accessing shared memory. The value in-side each square is the memory address that the instruction uses. Having the sameaddress means that the load and the store have an inter-thread dependency, whichwe need to identify.In Figure 5.2a, there are 3 thread blocks (TBs) as shown. Each TB containstwo threads. T1, T2, T5, and T6 have the same execution path, whereas T3 andT4 follow a different execution path. Note that the control-flow paths of threadsare identical within each thread block in this example. T1 and T3 are selected asintra-dep threads based on their execution paths (Section 5.1). According to ourheuristic, we profile all the shared memory accesses in the thread blocks that T1and T3 belong to. In other words, we profile the shared memory in T1, T2, T3, and26T4. This way, we establish 2 inter-thread memory dependencies, namely (Index 5and Index 9), and (Index 7 and Index 9), based on the addresses. Since the ratio ofcontrol-flow similarity across thread blocks is high (100%), our heuristic identifiesall the inter-thread dependencies.The example in Figure 5.2b also has 6 threads in 3 thread blocks. Threadsin TB1 follow the same execution path, as do the threads in TB3. However,threads in TB2 (i.e., T3 and T4) follow different paths. In this case, the ratio ofthe similarity of the control-flow of threads inside thread blocks is only 66.67%((100%+0%+100%)/3). Since threads T1 and T4 are selected as intra-dep threads,all the threads in TB1 and TB2 are considered for profiling, and the inter-threadmemory dependencies (Index 5 and Index 9), and (Index 5 and Index 11) are iden-tified. However, since the control-flow of T4 is different from T3 in TB2, the se-lection of T4 does not include the profiling of the threads in TB3, thereby missinganother inter-thread dependency (Index 7 and Index 11) in T5 and T6.Therefore, our heuristic achieves high coverage only when the control-flowsimilarity ratio is high in a thread block. So we ask the question what are theratios of control-flow similarity in thread blocks in GPU programs? Table 5.2shows the results, for the kernels that have inter-thread memory dependencies. Onaverage, about 75% of the threads in a thread block exhibit control-flow similarity,with the highest being 100% in LUD K3, and the lowest being 0% in LUD K1.Therefore, our proposed heuristic (H-inter) can identify most of the inter-threadmemory dependencies in practice.Kernel CF Similarity Kernel CF SimilarityNW K1 54.16% NW K2 25%HotSpot 98.31% LUD K1 0%LUD K2 70.32% LUD K3 100%Circuit 99.35% Pathfinder 96.61%SRAD K1 98.28% SRAD K2 99.6%Table 5.2: Average Control-flow(CF) similarity percentage of threads in athread blockTable 5.1 shows the number of threads chosen for profiling inter-thread mem-ory dependency (inter-dep threads). On average, we only select ~40% of the totalthreads as inter-dep threads. It can be be seen that kernels that have a lower num-27ber of total threads do not show much reduction (0% for NW K1, NW K2), whilekernels that have a large number of threads show a much higher reduction in thenumber of threads chosen (~99.3% for Circuit). Note that only load and storeinstructions to shared memory are profiled for these threads. The algorithm forchoosing inter-dep threads is in Algorithm 2.Algorithm 2: Selecting inter-dep threads.Input : TP: Threads from H-IntraOutput: TPS: Threads to profile1 TPS = {}2 T BP = {}3 for all threads T in TP do4 tb = f ind threadblock(T )5 if tb not in T BP then6 T BP.insert(tb)7 // Profile all threads of this thread block8 TPS.insert(all threads(tb))9 endWe construct the memory dependency graph for a given GPU kernel applyingthe above heuristic. We then follow the method of TRIDENT to weight the edgesof the graph based on the dynamic instruction count of each static load and storeand then trace error propagation using the graph. Note that fm does not need thememory addresses (i.e., memory execution trace) anymore for extracting the de-pendency. Instead, it reads it directly from the memory dependency graph, that wecreated for this purpose, and aggregates the propagation probabilities based on theedge weights (Section 2.5.3).5.3 Value-Based Masking (H-Value)As mentioned, it is imperative to have a high accuracy in modeling error propa-gation in GPU threads, due to the potential of aggregating small errors into largeones. We identify two sources of inaccuracies in TRIDENT and propose the H-Value heuristic to mitigate them - this leads to an average accuracy improvementof approximately 1.9%. The sources of inaccuracy and heuristics proposed to over-28bb2$1 = load 0x4$2 = load 0x8$3 = mul $1, $2br ...New propagation   probability = 80%   Zero probability:  0%Zero probability: 40%INDEX 1INDEX 2INDEX 3Figure 5.3: Example of error masking due to multiply by 0.come them are as follows.5.3.1 Multiplication by ZeroIf a target operand of an instruction is multiplied by zero, the result will always bezero, regardless of errors present in the target operand. If this is frequently seenin a program, and not taken into account while developing the model, this willoverestimate the error propagation since the model does not consider this sourceof error masking. We find that there is a non-negligible amount of multiplication-by-zero in GPU kernels (i.e., as high as 56.6% in Lulesh, and on average 11.2%),hence this masking must be considered in GPU-TRIDENT. To account for this, weprofile the operands of multiplication instructions and calculate the frequency ofzero operands of mul instructions. We then weight the propagation and maskingprobabilities in the relevant instruction tuple (Section 2.5.1) accordingly.Consider the code snippet shown in Figure 5.3 in which two values are loadedfrom memory and multiplied. Now suppose a fault occurs at INDEX 1 due towhich its result ($1) is corrupted. This fault will not propagate beyond INDEX 3($3) if the other operand of multiply instruction i.e., $2 (the result of INDEX 2) is 0,which will result in error masking, which TRIDENT does not take into account. Inthe example, we get the probability of $1 and $2 being zero, which is 0% and 40%respectively, and use it to modify the default propagation probability (100%) ofINDEX 3. As each operand contributes 50% to the overall propagation probability,the new propagation probability would be (1.0 - 0.5*0.4 - 0.5*0)*100% = 80%.5.3.2 Lucky StoresRecall that if an error modifies a branch direction, the store instructions dominatedby the corrupted branch will be not be executed correctly (Section 2.5.2). In this29case, TRIDENT assumes that the error always propagates to the store instructions.However, if a store instruction was supposed to overwrite the value in memory,which was already there, errors will not propagate even though the instruction isskipped due to a corrupted branch. This is similar to the phenomenon of silentstores (dynamic store instructions that do not change the state of a system) [31],which has been studied in the context of performance improvement in computerarchitecture. We call such store instructions lucky stores, similar to lucky loadsfound in prior studies related to the reliability of CPUs [13, 35].To understand this, consider Figure 5.4. Suppose that the store in bb2 was to beexecuted in a fault-free run, but due to a fault the result of INDEX 1 comparison in-struction was flipped and the bb2 was not executed. The model will conservativelyassume that the store is corrupted and propagates the error. However, if a condi-tion arises such that zero was to be stored (that is missed due to a fault), and thatmemory location already contained a zero, the fault will have no effect on memorycontents. In such a scenario, the outcome will be benign, and we will end up over-estimating the SDC probability of kernel. Identifying all the lucky stores requiresrecording every operand of store instructions in the memory execution trace, whichcan be extremely time-consuming. We observe that zero values are dominant in theoperands of output stores of GPU kernels as most initializations of kernel memoryuse zeros. Therefore, we record only the frequencies of output stores that have azero operand at run-time, and weight the propagation probabilities in fc accord-ingly. This way, we keep the profiling overhead reasonably low. Our experimentsshow that an average of 7.6% and a maximum of 40% output stores are luckystores, for the kernels used in this paper.5.4 Analysis WorkflowBased on the above heuristics, the analysis workflow of GPU-TRIDENT consistsof the following steps:1. H-Value is applied to get the fs and fc sub-models based on the dynamic dataprofiled from the threads.2. H-Intra is used to select a sub-set of threads to get the intra-thread memorydependencies, whose memory access instructions are then profiled.30  bb1  cmp ...  bb2  store ...  bb3  ...  ...INDEX 1Figure 5.4: Example of Lucky stores3. H-Inter is used to select a sub-set of threads to get the inter-thread memorydependencies, whose memory access instructions are then profiled.4. fm model is created from the memory dependency graph, based on the pro-filed memory access instructions.5. The sub-models ( fc, fs, and fm) are used to get the SDC probability of indi-vidual instructions and the kernel.5.5 SummaryIn this chapter, we present heuristics for profiling, different levels of memory de-pendencies in a GPU kernel. We find that a subset of threads selected based onthe similarity of execution patterns of threads, loop iterations, and shared memoryaccesses within a thread-block can be used to recover all the inter-thread as wellas intra-thread memory dependencies of a kernel. This information is then be usedto construct the memory dependency graph of a kernel, which is used for trackingerror propagation within as well as between threads. Finally, we identify two com-mon data patterns in GPU kernel executions and use them to improve the accuracyof our modeling methodology.31Chapter 6EvaluationIn Chapter 5, we introduced heuristics to overcome challenges in modeling errorpropagation in GPU applications. In this chapter, we evaluate the effectivenessof our proposed techniques. We first describe the experimental setup used (Sec-tion 6.1), giving an overview of the workflow of GPU-TRIDENT, FI method usedas the baseline, and the benchmarks we use for this evaluation. We then presentour results for the accuracy and scalability of GPU-TRIDENT at the kernel andinstruction level. Finally, we discuss the sources of inaccuracies in our proposedtechniques.6.1 Experimental Setup6.1.1 Workflow and Implementation of GPU-TRIDENTWe implement GPU-TRIDENT as a set of LLVM compiler passes that are inte-grated into NVIDIA’s NVCC compiler [42]. Although NVCC is based on LLVM,it does not expose its IR representation to external tools. Therefore, we attach a dy-namic library [39] to insert the LLVM passes of GPU-TRIDENT in the toolchain.GPU-TRIDENT needs the application code (with the kernel under test annotated),and an input required to execute it. We also need to identify the static store in-structions used by the kernel for transferring data to the host and designate them asoutput instructions. Using these inputs, GPU-TRIDENT profiles the program, and32then estimates the SDC probability of individual instructions as well as that of thecomplete kernel without performing any FIs.6.1.2 FI MethodRecall that GPU-TRIDENT aims to predict SDC probabilities, which are usuallymeasured using FI. Therefore, we use FI as the baseline to establish our groundtruth when evaluating GPU-TRIDENT. We use an open-source injector, LLFI-GPU [33] to perform FIs. LLFI-GPU aids comparison with GPU-TRIDENT, asboth are implemented using LLVM. Therefore, we can map GPU-TRIDENT pre-dictions to the FI result of individual instructions to examine the accuracy of GPU-TRIDENT on a per-instruction basis.As described in the fault model, we consider faults in the computational ele-ments of the GPU (Section 2.2), we inject faults into the destination registers ofexecuted instructions. Only one fault is injected per program execution. In eachFI trial, the application is executed from the beginning. We uniformly choose aninstruction at random from the set of executed instructions in the kernel. We thenflip a single bit in its destination register and execute the program to completion.This method has been shown to be accurate for estimating SDC probabilities [45].To determine if an SDC occurred, the entire array in device memory that iswritten by the kernel (to transfer data to the host code) is compared to its contentsin a fault-free execution (i.e., golden run). This is because we only study errorpropagation within the GPU kernel, and hence cannot use the application output asis usually done for CPU programs [19].6.1.3 BenchmarksWe choose a total of 17 kernels from 12 applications belonging to different do-mains, 8 of which are from the Rodinia benchmark suite [11], and 4 are open sourceHPC applications [1], [28], [3], [5]. These are listed in Table 6.1. They range insize from 16 to 511 static LLVM IR instructions and from 16 to 8,192,000 in termsof total threads launched. There is no standard set of benchmarks or applicationclassification for reliability analysis. Parallel applications are usually classifiedinto 13 categories (called dwarfs) based on their computation and communication33Benchmark Suite LLVMInsts.Kernel ID Uses sharedmemoryTotalthreadsNeedleman-Wunsch Rodinia 248 NW K1 Yes 48249 NW K2 Yes 16Pathfinder Rodinia 132 Pathfinder Yes 592,640BFS Rodinia 47 BFS K1 No 32,76820 BFS K2 No 32,768Gaussian Rodinia 59 Gaussian No 3,840HotSpot Rodinia 259 HotSpot Yes 473,344Particlefilter Rodinia 39 Particlefilter Yes 9,216LU Decomposition Rodinia 142 LUD K1 Yes 64238 LUD K2 Yes 19278 LUD K3 Yes 3,584SRAD Rodinia 511 SRAD K1 Yes 32,768193 SRAD K2 Yes 32,768Lulesh OS HPC [28] 29 Lulesh No 66,816Circuit OS HPC [1] 167 Circuit Yes 4,156,416Perf Benchmark OS HPC [5] 16 Perf BM No 8,192,000HPCCUDA OS HPC [3] 397 HPCCUDA No 236,544Table 6.1: Benchmarks used: 8 are from the Rodinia suite, and the other 4are open source HPC applications (OS HPC).methods [7]. The benchmarks we use in this thesis capture 6 of these 13 dwarfs.The choice of benchmarks was governed by whether they can be compiled withthe LLVM-based infrastructure of LLFI-GPU and GPU-TRIDENT, and whetherFI can be completed in a reasonable amount of time. We removed all sources ofrandomness in the benchmarks (e.g., changed random numbers to constant values),to get reproducible results for the experiments.We use the smaller inputs that come with each benchmark. We have also testedGPU-TRIDENT with bigger inputs for a subset of the same kernels (Section 7.2),and find that the prediction results closely match up with FI (though they take muchlonger). However, to keep the time for the detailed FI experiments manageable, wechoose to use the smaller inputs in this evaluation. Prior work [27, 33, 41, 53, 55]has also used similar input sizes.Table 6.1 also shows the times TRIDENT and GPU-TRIDENT take for eachof the kernels (as mentioned earlier, TRIDENT times are estimates). We considerthe time taken for both systems to obtain the overall SDC probability and per-instruction SDC probability. As can be seen, GPU-TRIDENT takes less than about10 minutes for all the kernels. In contrast, TRIDENT takes days or even months to34analyze the kernels.It should be noted that a subset of these benchmarks was used to identify (5benchmarks) the challenges and for devising heuristics (7 benchmarks) to avoidover-fitting. The rest of the benchmarks were just used to test the proposed tech-niques.6.2 AccuracyIn this section, we evaluate the accuracy of GPU-TRIDENT in predicting SDCprobabilities of the kernel and individual instructions.6.2.1 Overall SDC ProbabilityWe use GPU-TRIDENT to predict the SDC probability of a given kernel (and in-put), and then measure its SDC probability (under the same input) using randomFI. We perform 5000 FI experiments per kernel - the error bars for the SDC mea-surements range from ±0.53% to ±1.82% at the 99% confidence level.Figure 6.1 shows the result of our experiments. As can be seen, GPU-TRIDENTprovides reasonably accurate predictions. The average difference (mean absoluteerror) between FI and GPU-TRIDENT is 5.7% (in comparison, TRIDENT had anaverage difference of 4.75% [35] for CPU programs). This number is inflated bythe results of Lulesh, BFS K1 and Pathfinder, which have a difference of ~24%,~15% and ~16% respectively (the difference is 2.97% if we remove these kernels).We explain the reasons in Section 6.4.Pearson Correlation CoefficientWe also calculate the Pearson correlation coefficient between the FI results andGPU-TRIDENT predictions for all the kernels. The correlation coefficient is 0.88,showing a high agreement between them. The correlation coefficient increases to0.99, if we ignore the three outliers described earlier.Paired T-testWe also use a paired t-test to examine if the predictions are statistically differ-ent from the FI measurements, in line with TRIDENT’s method [35]. We first35Figure 6.1: The SDC probability of GPU kernels predicted by FI and GPU-TRIDENT. Error bars are shown for the FI estimates at the 99% confi-dence level - they range from ±0.53% to ±1.82%.checked that the differences between the predictions and FI measurements are ap-proximately normally distributed, as required by the t-test. In the t-test, our null-hypothesis is that there is no statistically significant difference between the resultsfrom FIs and the predicted SDC probabilities by GPU-TRIDENT for the 17 ker-nels. We find that the t-test yields a p-value of 0.36, which is much higher than thecustomary threshold of 0.05 [52], and hence we fail to reject the null hypothesis.6.2.2 Instruction SDC ProbabilityWe also evaluate the prediction accuracy of per-instruction SDC probabilities foreach kernel. We use GPU-TRIDENT to predict the SDC probabilities of all staticinstructions, and then compare the predicted values with the FI results. In FI, weinject 100 random faults in each static instruction of the kernel (a random dynamicinstance of the instruction is chosen for FI in each run).36Pearson Correlation CoefficientAs before, we calculate the Pearson correlation coefficient between the SDC con-tribution by each static instruction found using FI and by GPU-TRIDENT for allkernels. The average correlation coefficient was 0.83, excluding the outliers men-tioned in Section 6.2.1 and Gaussian. Gaussian has a low coefficient which isreflected in the predicted SDC probability being more than double the measuredSDC probability, although the percentage difference is ~5%. The correlation anal-ysis shows that the per-instruction SDC probability obtained by GPU-TRIDENThas a high agreement with FI results.Paired T-testWe also perform paired T-test for SDC probability of individual instructions. Wefirst check if the distribution of the differences between the prediction and the FImeasurement is approximately normally distributed. The normality does not holdfor SRAD K1, SRAD K2, Circuit and Particlefilter, and so we exclude them fromthis experiment (however, the predicted SDC probabilities for these kernels areclose to the FI results (Figure 6.1)). We perform paired t-test experiments in theremaining 13 kernels. The number of paired data in the t-test for each kernel is thenumber of its static instructions. As before, our null hypothesis is that there is nodifference between the FI measurement and the predicted SDC probability of each(static) instruction by GPU-TRIDENT. The p-values are higher than 0.05 in 11 outof the 13 kernels (all except Lulesh and NW K1), and hence we cannot reject thenull hypothesis.6.3 ScalabilityIn this section, we assess the scalability of GPU-TRIDENT with respect to FI. Weevaluate the scalability both in terms of the overall SDC probabilities of kernelsas well as those of individual instructions. As mentioned earlier, our accuracyexperiment considered 5,000 trials and obtained error bars at the 99% confidenceinterval. However, if one is prepared to accept lower confidence intervals or loosererror bars, it is possible to decrease the number of FI experiments (i.e., samples). Toevaluate the scalability of GPU-TRIDENT against FI, we compare the time taken37by GPU-TRIDENT and FI when using 500 to 5,000 samples. This method is inline with the number of samples used in other studies [18, 27, 33].As mentioned earlier (Section 2.5), executing GPU-TRIDENT can be dividedinto two phases. (1) Profiling phase, which requires multiple executions of theprogram to collect data, and the (2) inference phase, which uses the informationobtained from the profiling phase to calculate the SDC probabilities for static in-structions, and requires no program executions. We implement the inference phaseof GPU-TRIDENT in (1) single-threaded mode and (2) multi-threaded mode. How-ever, we do not parallelize the profiling phase, to keep it consistent with FI, whichis not parallelized either. Note that parallelizing the profiling phase is non-trivial asit requires either multiple GPUs or the ability to run multiple applications simulta-neously on a single GPU.6.3.1 Kernel SDC ProbabilityFigure 6.2a shows the average time taken by FI and both implementations of GPU-TRIDENT to predict the SDC probability of CUDA kernels. Due to space con-straints, we show the average time across all the kernels. It can be seen that thedifferences between the times taken by GPU-TRIDENT and FI increase sharply asthe number of samples is increased. For example, for 500 samples, GPU-TRIDENTis 2.2 and 5.7 times faster than FI (for single and multi-threaded implementationsrespectively). This increases to 12.7 and 33.4 times for 3,000 samples, and 21.1and 55.6 times for 5,000 samples. This is because FI has negligible fixed costat startup, but every FI trial requires a complete program execution. So the timerequired for FI experiments increases linearly with the number of FI samples. Incontrast, GPU-TRIDENT has an initial fixed cost for building the model. Once themodel is built, calculating the SDC probability of individual instructions has a neg-ligible cost, as it only involves a table lookup. This results in predominantly flatlines for GPU-TRIDENT in Figure 6.2a. It can be seen that multi-threaded GPU-TRIDENT is on average around 2.5X faster (on an 8-core CPU) than its single-threaded counterpart.38(a) Overall SDC Probability (b) Instruction SDC ProbabilityFigure 6.2: Computation Spent to Predict SDC Probability6.3.2 Instruction SDC ProbabilityFigure 6.2b compares the average time taken by GPU-TRIDENT and FI, for gettinginstruction wise SDC probability, for different numbers of static instructions in thekernel, with different numbers of faults injected (100, 500 and 1,000) per instruc-tion. The time shown in the figure is projected based on the per instruction timesrecorded for kernels that we use in our evaluation. The number of samples perinstruction is appended as suffixes in Figure 6.2b. For example, FI-100 means that100 faults are injected into each static instruction. Because GPU-TRIDENT doesnot need to sample individual instructions, the time taken by it remains constant;therefore, it has only one curve for each implementation. It can be seen that as thenumber of instructions in a kernel increases, the difference between the time takenby FI and GPU-TRIDENT also increases. For example, at 50 instructions, FI-100takes around 4 hours more than GPU-TRIDENT (averaged for both implementa-tions). This difference increases to around 84 hours for 1,000 instructions (morethan 20X increase).Figure 6.3 shows the wall-clock time taken by both implementations of GPU-TRIDENT and FI (100 faults injected per instructions), for obtaining instructionwise SDC probability, for each kernel. There is a wide variation in times takenby GPU-TRIDENT and FI for different kernels. For example, BFS K2 takes 0.16hours, while Circuit takes 63.75 hours for FI, so we use a logarithmic scale in thefigure on the y-axis. From the figure, we can see that on average, single-threaded39GPU-TRIDENT is faster than FI by more than one order of magnitude (~38X),while multi-threaded GPU-TRIDENT is faster than FI by about two orders of mag-nitude on average. Further, higher the complexity of the application, greater theimprovement, e.g., Circuit and Lulesh in Figure 6.3.Figure 6.3: Performance comparison of FI and GPU-TRIDENT6.4 Sources of Inaccuracies in GPU-TRIDENTIn this section, we discuss the three sources of inaccuracies in GPU-TRIDENT.6.4.1 Validity of Faulty Store AddressesGPU-TRIDENT assumes that if a wrong memory location is accessed, it will eitherresult in a crash if the memory location is invalid, or an SDC if the memory locationis valid. However, in some cases, the accessed wrong memory location is valid, butis out of the scope of the kernel. This results in a benign outcome.On the other hand, because FI records the kernel’s output in the host code,it has the knowledge about the memory that should be written by GPU-Kernelin order to cause an SDC - therefore, it can obtain the exact SDC probability.This is one of the major reasons for SDC overestimation by GPU-TRIDENT forboth Luelsh and Pathfinder. This inaccuracy can be mitigated by incorporatingthe information about the ratio of the memory used by the kernel that causes anSDC into GPU-TRIDENT. For example, in the case of Luelsh, if we integrate theinformation about the ratio of total memory used by FI to detect SDC (0.5 in thiscase, obtained through analysis of the host code) into GPU-TRIDENT, the absoluteerror in prediction is reduced from 24% to just 0.31%.406.4.2 Conservativeness in Determining Memory CorruptionIn Section 5.3.2, we discuss the phenomenon of lucky stores and introduce a heuris-tic to identify lucky output store instructions. However, lucky stores can also occurin the intermediate stores of the kernel, e.g., when a store to local or shared mem-ory dominated by a faulty branch is missed or incorrectly executed, but the valuewritten happens to be the same as the correct value. GPU-TRIDENT assumes thatany fault propagating to a store instruction results in an erroneous memory value,and hence over-estimates the SDC probability. This is another reason for the inac-curacy in Pathfinder (in addition to the first).Similarly, GPU-TRIDENT assumes that a fault in the address operand of a loadinstruction propagates to the instruction’s output. However, if the incorrect mem-ory location has the same contents as the fault-free memory (i.e., lucky load [13]),it will result in a benign outcome. This is the major reason for inaccuracy for BFSK1 (especially in the earlier kernel invocations), where most of the device memoryallocated initially contains zeroes. Therefore, when a load instruction reads froman incorrect memory address, it is likely to load a zero, which is the same value itwould obtain in a fault-free run. Using FI, we find that around 15% of load instruc-tions in BFS K1 are affected by this phenomenon, resulting in over-estimation ofSDC probability by GPU-TRIDENT.6.4.3 Heuristics About Error PropagationThe original TRIDENT considers only three instruction types for logical masking,namely comparisons, logical, and cast operations. GPU-TRIDENT also includesmultiply instructions based on dynamic profiling (Section 5.3.1). However, otherinstructions can also mask errors. For example, the fdiv instruction can mask errorswhile averaging the corrupted bits in the mantissa. This results in GPU-TRIDENTover-estimating SDCs.H-Inter Heuristic: A kernel which has control-flow divergence between threadsin a thread block (Figure 5.2b) can result in GPU-TRIDENT missing some inter-thread memory dependencies in the memory dependency graph, which can resultin inaccuracies (both under- and over-estimates are possible).416.5 LimitationsIn this section, we describe some of the main limitations of the evaluation describedin this thesis.6.5.1 FI MethodologyWe use LLFI-GPU, which injects single bit flips faults at the LLVM IR level. Thismethod has been shown to be accurate for estimating SDC probabilities [15, 33,45], but its effectiveness for other types of failures is an open question. That said,we focus primarily on SDCs in this thesis, which makes LLFI-GPU an appropriatemethod for our purposes.6.5.2 Evaluation KernelsWe choose 17 kernels from 12 benchmarks applications belonging to different do-mains and suites. However, some of the benchmark kernels are quite small. Forexample BFS K2 has only 20 static instructions. We have included benchmarks(Rodinia) and real applications (Circuit, Lulesh, HPCCUDA) that are much larger.Other work in this domain has also used a similar approach [27, 33, 41].6.6 SummaryIn this chapter, we evaluate the accuracy and performance of GPU-TRIDENT againsttraditional FI methodologies. We observe that the predicted SDC probability closelyfollow the values obtained from FI. The mean absolute error is 5.7%, with a Pear-son correlation coefficient of 0.88. If we do not consider the outliers (3 out of17 kernels), then this error is reduced to 2.97%, while the new Pearson correla-tion coefficient is 0.99. GPU-TRIDENT is also found to be accurate in predictingSDC probabilities for individual instructions of kernels. We also find that GPU-TRIDENT has minimal performance overheads compared to FI. It is, on average,55 times faster than FI in evaluating kernel SDC probability and two orders ofmagnitude faster than FI for finding the SDC probability of individual instructions.42Chapter 7Case StudiesIn Chapter 6 we evaluated the accuracy and scalability of GPU-TRIDENT againstFI. We found that GPU-TRIDENT is highly accurate for most applications, andhas a very low performance overhead. This opens up many avenues in reliabilityanalysis, which can not be explored in detail due to the resource-intensiveness ofFI.In this chapter, we use GPU-TRIDENT to study two such applications. Thefirst application is selective instruction duplication guided by GPU-TRIDENT. Se-lective instruction duplication requires SDC probability of every instruction in akernel, obtaining them can be very time consuming using FI, as it requires a lot oftrials for statistically significant results. As GPU-TRIDENT intrinsically providesper instruction SDC probability, it is a low overhead alternative to FI. Second,we use GPU-TRIDENT to study the variation in SDC probability of GPU kernelsacross multiple inputs. GPU-TRIDENT provides an efficient alternative to FI forthis study, as for larger inputs, FI quickly becomes intractable.7.1 Selective Instruction DuplicationIn this section, we use GPU-TRIDENT to guide selective instruction duplication,which is a standard protection technique to detect SDCs [18, 29, 48]. Selective in-struction duplication can provide high error coverage with low-performance over-head. It duplicates most vulnerable instructions and compares their outputs at run-43time to detect any differences as errors [18, 29, 35]. Typically, it is assumed thatthe maximum performance overhead for the protection is fixed to a budget value.The main question while applying selective instruction duplication is: Which staticinstructions should be duplicated to reduce the overall program SDC, given a per-formance overhead budget?. The predominant way to answer this question forGPU applications today is through FI, which is very time consuming, as it requiresper instruction SDC probabilities. In contrast, we study the usefulness of GPU-TRIDENT in replacing FI to answer this question, by accurately predicting the SDCpercentages of individual instructions.GPU-TRIDENT gives us the SDC probability of instructions, while the dy-namic count for each instruction can be used as a proxy for its overhead. Usingthis information, we can frame the original question as a classical 0-1 knapsackproblem [38]. In this case, maximum allowed overhead is the sack capacity, in-structions are the objects while their SDC probability is their associated profit. Forsimplicity, we consider instruction SDC probabilities to be independent of eachother, which is a conservative assumption that has also been made by the previ-ous work [35]. We use GPU-TRIDENT to estimate the SDC probability of eachinstruction, and use the dynamic instruction count as a proxy of the instruction’sperformance overhead (measuring the exact overhead is tedious).We consider two protection overhead levels, which correspond to one-thirdand two-thirds performance overhead of duplicating all the instructions in a givenGPU kernel. After identifying the instructions to duplicate, for each protectionlevel, we select the instructions to be duplicated, and then we project the results ofduplicating these instructions, from actual FI experiments.Figure 7.1 shows the SDC probabilities of kernels after applying selective du-plication. Without protection, the average SDC probability across benchmarks is33.73%. With one-third protection overhead, the average SDC probability is re-duced to 14.2%, which is a reduction of about 58%. With a two-thirds protectionoverhead, the average SDC probability is further reduced to 5.27%, which is a re-duction of about 85%. Note that FI is only used to measure the SDC probabilityafter the protection to validate the predictions made by GPU-TRIDENT.Figure 7.2 shows the average SDC coverage, for all benchmarks, provided byinstruction duplication when FI and GPU-TRIDENT guide it at different overhead44Figure 7.1: SDC Probabilities of selective instruction duplication.Figure 7.2: Average Protection curve obtained by FI and GPU-TRIDENT.levels. We can see that the protection curve of GPU-TRIDENT closely follows thatof FI throughout the range. The maximum difference between these two curves isabout 13.34%, which translates to an absolute difference of only 4.4% in terms ofSDC percentages. Figure 7.3 gives an overview of the protection curve of selectiveinstruction duplication obtained using FI and GPU-Trident, for individual kernels.We see that protection curves of individual kernels also show a trend similar toFigure 7.2 as the protection provided by GPU-TRIDENT closely follows protectionprovided by FI for most of the kernels. Thus, GPU-TRIDENT is an efficient andaccurate replacement for FI in guiding selective instruction duplication.7.2 Characterizing SDCs Across Multiple InputsAs stated earlier, FI is traditionally used to analyze the reliability of GPU applica-tions and is time-consuming as GPU applications can consist of millions of threads.This problem is exacerbated in the case of multiple inputs, as the whole FI process45(a) NW K1 (b) NW K2 (c) Pathfinder(d) BFS K1 (e) BFS K2 (f) Gaussian(g) Hotspot (h) Particlefilter (i) LUD K1(j) LUD K2 (k) LUD K3 (l) SRAD K1(m) SRAD K2 (n) Lulesh (o) Circuit(p) HPC CUDA (q) Perf BMFigure 7.3: Protection curves obtained by FI and GPU-TRIDENT for individ-ual kernels, as protection overhead is varied from 0% and 100%. Bluelines indicate FI, and orange lines represent GPU-TRIDENT.46Kernel ID Input 1 Input 2 Input 3BFS K2 32,768 655,360 12,005,376Hotspot 473,344 9,216 1,893,376Particlefilter 9,216 18,432 27,648LUD K1 64 256 512LUD K2 192 3,840 15,872LUD K3 3,584 317,440 2,666,496SRAD K1 32,768 524,288 2,097,152SRAD K2 32,768 524,288 2,097,152NW K1 48 576 2,176NW K2 16 448 1,920HPCCUDA 236,554 1,182,720 3,548,160Perf BM 8,192,000 16,384,000 24,576,000Table 7.1: Number of threads invoked by different kernels, for multiple in-puts.has to be repeated for each input. Further, the fault space of FI can become huge forbigger inputs. In this thesis, we have presented GPU-TRIDENT, which is an effi-cient alternative to FI. This performance and resource efficiency of GPU-TRIDENTcan be used to characterize the reliability of a GPU kernel across multiple inputs.In this section, we use GPU-TRIDENT to analyze the input-dependence ofthe error resilience of GPU kernels. We select a subset of benchmarks from Sec-tion 6.1.3 for this case study, that give accurate results with GPU-TRIDENT (abso-lute error <5% from Section 6.2.1). We also ensure that the selected benchmarkshave standard input available from the source of the benchmark, to avoid any bi-ased inputs (e.g., all zeros), which do not represent the real-world resilience char-acteristics of an application [47]. Table 7.1 shows the selected benchmarks andthe number of threads launched by different inputs. Input 1, is the input used inSection 6.1.3, while Input 2 and Input 3 are new inputs.For this study, we purposefully select new inputs that invoke a relatively largenumber of threads (up to 24.5 million), as the main benefit of GPU-TRIDENT isthat it has very small overhead even for applications with large numbers of threads,in contrast to FI, whose practicality is limited in case of bigger inputs due to highoverheads associated with it. In this way, we can study the reliability of GPU47applications under bigger inputs.Figure 7.4 shows the SDC probability predicted by GPU-TRIDENT for all thebenchmarks. We observe from our experiments that the SDC probability of a ker-nel exhibits little variation across different inputs of different sizes. The averagevariation across inputs of all benchmarks is 0.56%. The maximum variation inpredicted SDC probability is 5.66% between Input 1 and Input 3 for NW K2. LUDK1, LUD K3, HPCCUDA and Perf BM show no variation across all the inputs.This result is in line with previous studies of the variation in the reliability of GPUkernels across multiple inputs that use FI [47].Figure 7.4: SDC Probabilities predicted by GPU-TRIDENT with multiple in-puts.We also perform FI experiments to test that accuracy results from the earlierexperiments (Section 6.2.1) hold for newer inputs as well. FI has a high executiontime for kernels that launch a large number of threads. So we only select ker-nels that launch a limited number of threads for this experiment and perform FIexperiments with Input 2 and Input 3, as we already have the results for Input 1from Section 6.2.1. The four chosen kernels are LUD K1, LUD K2, LUD K3 andParticlefilter.We perform 3,000 random FI experiments with the four kernels. We observe48that the average variation of SDC probabilities predicted by GPU-TRIDENT withthat of FI is 1.07%, while the maximum variation is 2.24% (for Particlefilter be-tween Input 1 and Input 2). The error bars for these SDC predictions from FI rangefrom 1.76% to 1.78% at 95% confident interval. This shows that predictions byGPU-TRIDENT are in line with the FI results.The results from this case study thus show that the reliability of GPU kernelsdoes not vary widely across different standard inputs of varying sizes. Therefore,we can use a small set of valid inputs to evaluate the reliability of GPU applicationsinstead of exploring a large input space.49Chapter 8Conclusion and Future Work8.1 SummaryGPUs are commonly used to accelerate scientific and safety-critical applicationstoday. Transient hardware faults are increasing in frequency due to progressivetechnology scaling, which affects the reliability of GPUs.FI techniques have been traditionally used to assess the reliability of GPU ap-plications, but due to the increasing scale of applications, this has become impracti-cal. This thesis analyzes the reliability characteristics of GPU kernels in a scalableand efficient manner without performing FI.To get the reliability profile of the GPU applications, we introduce GPU-TRIDENT, which models soft error propagation in GPU kernels without performingany FI. Due to the typically large number of threads in a GPU kernel, it is chal-lenging to model error propagation accurately and scalably. We observe that errorpropagation in GPU programs can be modeled by carefully analyzing the execu-tion patterns of GPU threads pertaining to data repetition, memory accesses, loopiterations, and control-flow. We find that insufficient details about different levelsof memory dependencies in a GPU kernel can cause a significant detrimental ef-fect on accurately modeling the error propagation. On the other hand, accuratelygetting this information is often the biggest scalability bottleneck in the model.To address these challenges, we propose several heuristics to prune the er-ror propagation space by selecting only a small subset of threads for analysis of50memory dependencies, based on the characteristics mentioned above, which sig-nificantly improves the time required for creating the model. We implementedGPU-TRIDENT as LLVM compiler passes, and evaluated it on 17 GPU Kernels.Our results show that the accuracy of GPU-TRIDENT is comparable to FI both forthe kernel as a whole (Pearson correlation coefficient is 0.88) and for individualinstructions. Further, GPU-TRIDENT is two orders of magnitude faster, and muchmore scalable than FI for finding the SDC probabilities of kernels as a whole, aswell as individual instructions.We evaluate the use of GPU-TRIDENT in two applications. First, we foundthat GPU-TRIDENT can be used to guide selective instruction duplication withcomparable accuracy as FI, reducing the SDC probability by approximately 58%and 85% for 1/3rd and 2/3rd overhead, respectively. Second, we deploy GPU-TRIDENT to study the variance of reliability of GPU applications across multipleinputs and find that their SDC probability is generally insensitive to the change ininput data.In conclusion, this thesis presents an accurate and scalable method for model-ing error propagation GPU kernels, which can be integrated into the developmentcycle of real-world applications.8.2 Future WorkThere are three potential areas in which this work can be extended.8.2.1 Extending GPU-TRIDENTThere are certain sources of inaccuracies in GPU-TRIDENT identified in Sec-tion 6.4, these can be analyzed in detail, and their mitigation can be integrated intothe proposed techniques to improve their accuracy. Moreover, the GPU-TRIDENTcan be extended to analyze the behavior of applications during crashes to predictthe crash probabilities of individual instructions. This updated analysis method canthen guide Checkpoint/restart (C/R), which is one of the most popular methods, torecover from faults that cause a crash in programs [8, 54]. This method involvesidentifying the most crash-prone sites in the program (which can be done usingGPU-TRIDENT) and saving the program state to resume execution from there if51the application crashes.8.2.2 End to End Reliability Analysis of GPU ApplicationsGPU-TRIDENT provides an accurate framework for modeling error propagationwithin a GPU kernel. A typical GPU application can consist of two parts: the hostcode that runs on CPU, and device code consisting of all the GPU kernels present inthe application. During the execution of the application, data is transferred betweenthese two parts, which can result in error propagating from device code to hostcode and vice versa. Due to this, the fault activated in one kernel can propagateto other kernels a well. So, to get a complete overview of the reliability of a GPUapplication, error propagation due to all these interactions needs to be modeled.The techniques proposed in this thesis can be combined with previous ones likeTRIDENT [35] to get an end-to-end resilience estimate.8.2.3 Error Propagation Modeling in Accelerator PlatformsThe recent rise of resource-intensive machine learning applications has boostedthe interest of academia and industry in domain-specific accelerators (DSA) formachine learning as they provide enhanced performance with significantly lessenergy budget compared to general-purpose computers. As these accelerators areusually deployed in safety-critical tasks such as autonomous vehicles and businessanalytics, their reliability warrants an investigation. Although some studies haveused fault Injectors for studying the resilience characteristics of accelerators [34,44], these efforts are bound to run into the same scalability issues of FI, that wehave discussed in this thesis. Extending the framework and heuristics proposed inthis thesis to these accelerator platforms is a potential direction of future work.52Bibliography[1] Parallel circuit solver. https://github.com/glaswep/hpc. [Online; accessedApr. 2016]. → pages xi, 18, 19, 33, 34[2] Benchmarks for CPU, GPU, Memory, Disk Performance.https://github.com/sswarnak77/Performance-Benchmarking. [Online;accessed March 2020]. → page 1[3] A program of blurring picture by multi-threads in CUDA.https://github.com/caofengnian/HPCCUDA. [Online; accessed March 2020].→ pages 33, 34[4] Training AI for Self-Driving Vehicles: the Challenge of Scale.https://devblogs.nvidia.com/training-self-driving-vehicles-challenge-scale/.[Online; accessed April 2020]. → page 1[5] CUDA Refresher: Reviewing the Origins of GPU Computing.https://devblogs.nvidia.com/cuda-refresher-reviewing-the-origins-of-gpu-computing/. [Online; accessedApril 2020]. → pages 33, 34[6] Nvidia tesla v100 gpu architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. [Accessed May2020]. → page 9[7] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz,N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick.A view of the parallel computing landscape. Commun. ACM, 52(10):56–67,Oct. 2009. ISSN 0001-0782. doi:10.1145/1562764.1562783. URLhttps://doi.org/10.1145/1562764.1562783. → page 34[8] G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain,T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and53A. Selikhov. Mpich-v: Toward a scalable fault tolerant mpi for volatilenodes. In SC ’02: Proceedings of the 2002 ACM/IEEE Conference onSupercomputing, pages 29–29, 2002. → page 51[9] C. Chang, S. Lym, N. Kelly, M. B. Sullivan, and M. Erez. Evaluating andaccelerating high-fidelity error injection for hpc. In SC18: InternationalConference for High Performance Computing, Networking, Storage andAnalysis, pages 577–589, 2018. → page 8[10] A. Chatzidimitriou, P. Bodmann, G. Papadimitriou, D. Gizopoulos, andP. Rech. Demystifying soft error assessment strategies on arm cpus:Microarchitectural fault injection vs. neutron beam experiments. In 201949th Annual IEEE/IFIP International Conference on Dependable Systemsand Networks (DSN), pages 26–38, 2019. → page 8[11] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, andK. Skadron. Rodinia: A benchmark suite for heterogeneous computing. InInternational Symposium on Workload Characterization (IISWC 2009),pages 44–54. IEEE, 2009. → page 33[12] G. Z. Chrysos and J. S. Emer. Memory dependence prediction using storesets. In Proceedings. 25th Annual International Symposium on ComputerArchitecture (Cat. No.98CB36235), pages 142–153, 1998. → page 23[13] J. J. Cook and C. Zilles. A characterization of instruction-level errorderating and its implications for error detection. In International Conferenceon Dependable Systems and Networks(DSN), pages 482–491. IEEE, 2008.→ pages 9, 30, 41[14] E. W. Czeck and D. P. Siewiorek. Observations on the effects of faultmanifestation as a function of workload. IEEE Transactions on Computers,41(5):559–566, 1992. → page 16[15] D. Di Leo, F. Ayatolahi, B. Sangchoolie, J. Karlsson, and R. Johansson. Onthe impact of hardware faults–an investigation of the relationship betweenworkload inputs and failure mode distributions. Computer Safety,Reliability, and Security, pages 198–209, 2012. → pages 16, 42[16] B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi. Gpu-qin: Amethodology for evaluating the error resilience of gpgpu applications. InInternational Symposium on Performance Analysis of Systems and Software(ISPASS), 2014, pages 221–230. IEEE, 2014. → pages 14, 1754[17] B. Fang, Q. Lu, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi. ePVF: Anenhanced program vulnerability factor methodology for cross-layerresilience analysis. In Proceedings of the International Conference onDependable Systems and Networks (DSN), pages 168–179. IEEE, 2016. →pages 15, 17[18] S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: Probabilistic softerror reliability on the cheap. In Architectural Support for ProgrammingLanguages and Operating Systems, pages 385–396, 2010. → pages2, 9, 15, 16, 17, 38, 43, 44[19] G. Georgakoudis, I. Laguna, H. Vandierendonck, D. S. Nikolopoulos, andM. Schulz. Safire: Scalable and accurate fault injection for parallelmultithreaded applications. In 2019 IEEE International Parallel andDistributed Processing Symposium (IPDPS), pages 890–899, 2019. → page33[20] L. Guo and D. Li. Moard: Modeling application resilience to transient faultson data objects. In 2019 IEEE International Parallel and DistributedProcessing Symposium (IPDPS), pages 878–889, 2019. → page 9[21] L. Guo, D. Li, I. Laguna, and M. Schulz. Fliptracker: Understanding naturalerror resilience in hpc applications. In SC18: International Conference forHigh Performance Computing, Networking, Storage and Analysis, pages94–107, 2018. → page 15[22] S. K. S. Hari, S. V. Adve, H. Naeimi, and P. Ramachandran. Relyzer:Exploiting application-level fault equivalence to analyze applicationresiliency to transient faults. In Architectural Support for ProgrammingLanguages and Operating Systems, pages 123–134, 2012. → pages 2, 9, 16[23] S. K. S. Hari, R. Venkatagiri, S. V. Adve, and H. Naeimi. Ganges: Gangerror simulation for hardware resiliency evaluation. In ACM/IEEE 41stInternational Symposium on Computer Architecture (ISCA), pages 61–72.IEEE, 2014. → pages 2, 9, 16[24] S. K. S. Hari, T. Tsai, M. Stephenson, S. W. Keckler, and J. Emer.Sassifi:evaluating resilience of gpu applications. In IEEE Workshop ofSilicon Errors in Logic (SELSE), 2014. IEEE, 2015. → pages 14, 17[25] S. K. S. Hari, T. Tsai, M. Stephenson, S. W. Keckler, and J. Emer. Sassifi:An architecture-level fault injection tool for gpu application resilience55evaluation. In 2017 IEEE International Symposium on PerformanceAnalysis of Systems and Software (ISPASS), pages 249–258, 2017. → page 8[26] M. Harris. Unified memory for CUDA beginners. NVIDIA Blog, 2016.[Online; accessed 18-Jan-2018]. → page 8[27] C. Kalra, F. Previlon, X. Li, N. Rubin, and D. Kaeli. Prism: Predictingresilience of gpu applications using statistical methods. In InternationalConference for High Performance Computing, Networking, Storage, andAnalysis (SC), 2018. ACM, 2018. → pages 2, 8, 16, 17, 34, 38, 42[28] I. Karlin. Lulesh programming model and performance ports overview.https://computing.llnl.gov/projects/co-design/lulesh ports1.pdf. [AccessedApr. 2016]. → pages 19, 33, 34[29] I. Laguna, M. Schulz, D. F. Richards, J. Calhoun, and L. Olson. Ipas:Intelligent protection against silent output corruption in scientificapplications. In 2016 IEEE/ACM International Symposium on CodeGeneration and Optimization (CGO), pages 227–238, 2016. → pages9, 16, 43, 44[30] C. Lattner and V. Adve. LLVM: A compilation framework for lifelongprogram analysis & transformation. In International Symposium on CodeGeneration and Optimization, page 75. IEEE, 2004. → pages 4, 10, 11[31] K. M. Lepak and M. H. Lipasti. On the value locality of store instructions.In Proceedings of the 27th Annual International Symposium on ComputerArchitecture, ISCA ’00, page 182–191, New York, NY, USA, 2000.Association for Computing Machinery. ISBN 1581132328.doi:10.1145/339647.339678. URL https://doi.org/10.1145/339647.339678.→ page 30[32] G. Li and K. Pattabiraman. Modeling input-dependent error propagation inprograms. In Proceedings of the International Conference on DependableSystems and Networks (DSN), 2018. → pages 2, 15, 16[33] G. Li, K. Pattabiraman, C.-Y. Cher, and P. Bose. Understanding errorpropagation in GPGPU applications. In International Conference for HighPerformance Computing, Networking, Storage and Analysis, pages 240–251.IEEE, 2016. → pages 3, 8, 14, 17, 19, 33, 34, 38, 42[34] G. Li, S. K. S. Hari, M. Sullivan, T. Tsai, K. Pattabiraman, J. Emer, and S. W.Keckler. Understanding error propagation in deep learning neural network56(dnn) accelerators and applications. In Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis, SC ’17, New York, NY, USA, 2017. Association for ComputingMachinery. ISBN 9781450351140. doi:10.1145/3126908.3126964. URLhttps://doi.org/10.1145/3126908.3126964. → page 52[35] G. Li, K. Pattabiraman, S. K. S. Hari, M. Sullivan, and T. Tsai. Modelingsoft-error propagation in programs. In Proceedings of the InternationalConference on Dependable Systems and Networks (DSN), 2018. → pagesxi, 2, 3, 4, 7, 10, 11, 15, 30, 35, 44, 52[36] A. Mahmoud, S. K. S. Hari, M. B. Sullivan, T. Tsai, and S. W. Keckler.Optimizing software-directed instruction replication for gpu error detection.In Proceedings of the International Conference for High PerformanceComputing, Networking, Storage, and Analysis, SC ’18. IEEE Press, 2018.→ page 2[37] A. Mahmoud, R. Venkatagiri, K. Ahmed, S. Misailovic, D. Marinov, C. W.Fletcher, and S. V. Adve. Minotaur: Adapting software testing techniquesfor hardware errors. In Proceedings of the Twenty-Fourth InternationalConference on Architectural Support for Programming Languages andOperating Systems, ASPLOS ’19, page 1087–1103, New York, NY, USA,2019. Association for Computing Machinery. ISBN 9781450362405.doi:10.1145/3297858.3304050. URLhttps://doi.org/10.1145/3297858.3304050. → page 16[38] G. B. Mathews. On the partition of numbers. Proceedings of the LondonMathematical Society, 1(1):486–490, 1896. → page 44[39] D. Mikushin. Enabling on-the-fly manipulations with llvm ir code of cudasources. https://github.com/apc-llc/nvcc-llvm-ir. [Accessed Nov. 2018]. →page 32[40] Ming Zhang and N. R. Shanbhag. A CMOS design style for logic circuithardening. In 2005 IEEE International Reliability Physics Symposium,2005. Proceedings. 43rd Annual., pages 223–229, April 2005.doi:10.1109/RELPHY.2005.1493088. → page 1[41] B. Nie, L. Yang, A. Jog, and E. Smirni. Fault site pruning for practicalreliability analysis of gpgpu applications. In International Symposium onMicroarchitecture (MICRO), 2018, pages 749–761. IEEE, 2018. → pages8, 16, 17, 23, 34, 4257[42] NVIDIA. Nvcc. https://developer.nvidia.com/cuda-llvm-compiler. → page32[43] N. Oh, P. P. Shirvani, and E. J. McCluskey. Control-flow checking bysoftware signatures. Transactions on Reliability, 51(1):111–122, 2002. →page 9[44] D. Oliveira, V. Frattin, P. Navaux, I. Koren, and P. Rech. Carol-fi: anefficient fault-injection tool for vulnerability evaluation of modern hpcparallel accelerators. pages 295–298, 05 2017.doi:10.1145/3075564.3075598. → page 52[45] L. Palazzi, G. Li, B. Fang, and K. Pattabiraman. A tale of two injectors:End-to-end comparison of ir-level and assembly-level fault injection. pages151–162, 10 2019. doi:10.1109/ISSRE.2019.00024. → pages 9, 33, 42[46] K. Pattabiraman, N. Nakka, Z. Kalbarczyk, and R. Iyer. SymPLFIED:Symbolic program-level fault injection and error detection framework. InProceedings of the International Conference on Dependable Systems andNetworks (DSN), pages 472–481. IEEE, 2008. → page 15[47] F. G. Previlon, C. Kalra, D. R. Kaeli, and P. Rech. Evaluating the impact ofexecution parameters on program vulnerability in gpu applications. In 2018Design, Automation & Test in Europe Conference & Exhibition (DATE),pages 809–814. IEEE, 2018. → pages 17, 47, 48[48] V. J. Reddi, M. S. Gupta, M. D. Smith, G.-y. Wei, D. Brooks, andS. Campanoni. Software-assisted hardware reliability: abstractingcircuit-level challenges to the software stack. In Design AutomationConference, pages 788–793. IEEE, 2009. → pages 16, 43[49] B. Sangchoolie, K. Pattabiraman, and J. Karlsson. One bit is (not) enough:An empirical study of the impact of single and multiple bit-flip errors. In2017 47th Annual IEEE/IFIP International Conference on DependableSystems and Networks (DSN), pages 97–108, 2017. → page 8[50] M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji,J. Belak, P. Bose, F. Cappello, B. Carlson, et al. Addressing failures inexascale computing. Institute for Computing in Science (ICiS). More infor,4:11, 2012. → pages 1, 2[51] V. Sridharan and D. R. Kaeli. Eliminating microarchitectural dependencyfrom architectural vulnerability. In 15th International Symposium on HighPerformance Computer Architecture. → pages 2, 15, 1758[52] Student. The probable error of a mean. Biometrika, pages 1–25, 1908. →page 36[53] J. Tan, N. Goswami, T. Li, and X. Fu. Analyzing soft-error vulnerability ongpgpu micro architecture. In IEEE International Symposium on WorkloadCharacterization (IISWC), 2011 IEEE International Symposium on, pages226–235. IEEE, 2011. → pages 15, 34[54] C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Hybrid checkpointingfor mpi jobs in hpc environments. In 2010 IEEE 16th InternationalConference on Parallel and Distributed Systems, pages 524–533, 2010. →page 51[55] S. G. L. L. R. Z. J. T. Xiaohui Wei, Hengshan Yue. G-seap: Analyzing andcharacterizing soft-error aware approximation in gpgpus. In FutureGeneration Computer Systems (2020), 2020. → pages 8, 15, 34[56] K. S. Yim, C. Pham, M. Saleheen, Z. Kalbarczyk, and R. Iyer.Hauberk:lightweight silent data corruption error detector for gpgpu. InInternational Parallel & Distributed Processing Symposium (IPDPS), 2011,page 287. IEEE, 2011. → pages 14, 1759

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0391897/manifest

Comment

Related Items