Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Understanding and modeling error propagation in programs Li, Guanpeng 2019

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2019_may_li_guanpeng.pdf [ 3.53MB ]
JSON: 24-1.0376650.json
JSON-LD: 24-1.0376650-ld.json
RDF/XML (Pretty): 24-1.0376650-rdf.xml
RDF/JSON: 24-1.0376650-rdf.json
Turtle: 24-1.0376650-turtle.txt
N-Triples: 24-1.0376650-rdf-ntriples.txt
Original Record: 24-1.0376650-source.json
Full Text

Full Text

Understanding and Modeling Error Propagation inProgramsbyGuanpeng LiBachelor of Applied Science, University of British Columbia, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)February 2019c© Guanpeng Li, 2019The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Understanding and Modeling Error Propagation in Programssubmitted by Guanpeng Li in partial fulfillment of the requirements for the degreeof Doctor of Philosophy in Electrical and Computer Engineering.Examining Committee:Karthik Pattabiraman, Electrical and Computer EngineeringSupervisorMatei Ripeanu, Electrical and Computer EngineeringSupervisory Committee MemberAli Mesbah, Electrical and Computer EngineeringSupervisory Committee MemberRonald Garcia, Computer ScienceUniversity ExaminerSudip Shekhar, Electrical and Computer EngineeringUniversity ExamineriiAbstractHardware errors are projected to increase in modern computer systems due toshrinking feature sizes and increasing manufacturing variations. The impact ofhardware faults on programs can be catastrophic, and can lead to substantial fi-nancial and societal consequences. Error propagation is often the leading cause ofcatastrophic system failures, and hence must be mitigated. Traditional hardware-only techniques to avoid error propagation are energy hungry, and hence not suit-able for modern computer systems (i.e., commodity systems). Researchers haveproposed selective software-based protection techniques to prevent error propa-gation at lower costs. However, these techniques use expensive fault injectionsimulations to determine which parts of a program must be protected. Fault in-jection simulation artificially introduces a fault to program execution and observefailures (if any) upon the completion of the program execution. Thousands of suchsimulations need to be performed in order to achieve statistical significance. It istime-consuming as even a single program execution of a common application maytake a long time. In this dissertation, I first characterize error propagation in pro-grams that lead to different types of failures, proposed both empirical and analyticalapproaches to identify and mitigate error propagation without expensive fault in-jections. The key observation is that only a small fraction of states are responsiblefor almost all error propagation in programs, and the propagation falls into identifi-able patterns which can be modeled efficiently. The proposed techniques are nearlyas close as fault injection approaches in measuring failure rates of programs, andorders of magnitude faster than fault injections. This allows developers to buildlow-cost fault-tolerant applications in an extremely efficient manner.iiiLay SummaryTransient hardware faults become more and more prevalent nowadays. They oftenlead to error propagation in programs, which may cause serious social and finan-cial impact. Protection techniques such as hardware duplications were used in thepast but they incur huge overheads in performance and energy consumption. Re-searchers have expected modern software to tolerate hardware faults in a low-costand flexible manner. Studies have found there is only a small amount of programstates that are responsible for almost all the propagations. If those states can beidentified and protected, we can improve the reliability of computer systems at lowcost.In this thesis, we characterize error propagation in programs, and propose mod-els to identify the vulnerability of program states using static and dynamic analysistechniques.ivPrefaceThis thesis is the result of work carried out by me, in collaboration with my ad-visor (Dr. Karthik Pattabiraman), my colleague (Qining Lu), and other researchscientists from IBM and NVIDIA. Chapters 4, 5, 6, 7, and 8 are based onwork published in the conferences of DSN and SC. In each work, I was respon-sible for writing the paper, designing approaches (where applicable), conductingexperiments, analyzing data, and evaluating the results. Other collaborators wereresponsible for editing and writing portion of the manuscripts, providing guidanceand feedback.Below are publication details for each chapter:• Chapter 4— Guanpeng Li, Qining Lu and Karthik Pattabiraman, ”Fined-grained Char-acterization of Long Latency Causing Crashes in Programs”, IEEE/IFIP In-ternational Conference on Dependable Systems and Networks (DSN), 2015,250-261. [88]• Chapter 5— Guanpeng Li, Karthik Pattabiraman, Chen-Yong Cher and Pradip Bose,”Understanding Error Propagation in GPGPU Applications”, IEEE Interna-tional Conference for High-Performance Computing, Storage and Network-ing (SC), 2016, 240-251. [90]• Chapter 6— Guanpeng Li, Karthik Pattabiraman, Siva Kumar Sastry Hari, MichaelSullivan and Timothy Tsai, ”Modeling Soft-Error Propagation in Programs”,vIEEE/IFIP International Conference on Dependable Systems and Networks(DSN), 2018, 27-38, [59]• Chapter 7— Guanpeng Li and Karthik Pattabiraman, ”Modeling Input-Dependent Er-ror Propagation in Programs”, IEEE/IFIP International Conference on De-pendable Systems and Networks (DSN), 2018, 279-290. [87]• Chapter 8— Guanpeng Li, Siva Kumar Sastry Hari, Michael Sullivan, Timothy Tsai,Karthik Pattabiraman, Joel Emer, and Stephen W. Keckler, ”UnderstandingError Propagation in Deep-Learning Neural Networks (DNN) Acceleratorsand Applications”, ACM International Conference for High-PerformanceComputing, Storage and Networking (SC), 2017, 8:1-8:12. [91]viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Error Propagation in CPU Programs . . . . . . . . . . . . . . . . 82.1.1 Long-Latency Crash (LLC) . . . . . . . . . . . . . . . . 82.1.2 Silent Data Corruption (SDC) . . . . . . . . . . . . . . . 92.2 Error Propagation in Hardware Accelerator Applications . . . . . 122.2.1 GPU Applications . . . . . . . . . . . . . . . . . . . . . 122.2.2 DNN Accelerators and Applications . . . . . . . . . . . . 133 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15vii3.1 Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Terms and Definitions . . . . . . . . . . . . . . . . . . . . . . . . 163.3 LLVM Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 GPU Architecture and Programming Model . . . . . . . . . . . . 183.5 DNN Accelerators and Applications . . . . . . . . . . . . . . . . 183.5.1 Deep Learning Neural Networks . . . . . . . . . . . . . . 183.5.2 DNN Accelerator . . . . . . . . . . . . . . . . . . . . . . 203.5.3 Consequences of Soft Errors . . . . . . . . . . . . . . . . 214 Fine-Grained Characterization of Faults Causing Long Latency Crashesin Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Why bound the crash latency? . . . . . . . . . . . . . . . . . . . 264.3 Initial Fault Injection Study . . . . . . . . . . . . . . . . . . . . . 274.3.1 Fault Injection Experiment . . . . . . . . . . . . . . . . . 274.3.2 Fault Injection Results . . . . . . . . . . . . . . . . . . . 284.3.3 Code Patterns that Lead to LLCs . . . . . . . . . . . . . . 294.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4.1 Phase 1: Static Analysis (CRASHFINDER STATIC) . . . . 324.4.2 Phase 2: Dynamic Analysis (CRASHFINDER DYNAMIC) . 344.4.3 Phase 3: Selective Fault Injections . . . . . . . . . . . . . 364.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 374.6.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 384.6.2 Research Questions . . . . . . . . . . . . . . . . . . . . . 384.6.3 Experimental Methodology . . . . . . . . . . . . . . . . 394.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.7.1 Performance (RQ1) . . . . . . . . . . . . . . . . . . . . . 404.7.2 Precision (RQ2) . . . . . . . . . . . . . . . . . . . . . . 424.7.3 Recall (RQ3) . . . . . . . . . . . . . . . . . . . . . . . . 434.7.4 Efficacy of Heuristics (RQ4) . . . . . . . . . . . . . . . . 454.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.8.1 Implication for Selective Protection . . . . . . . . . . . . 46viii4.8.2 Implication for Checkpointing Techniques . . . . . . . . . 464.8.3 Limitations and Improvements . . . . . . . . . . . . . . . 474.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 Modeling Soft-Error Propagation in Programs . . . . . . . . . . . . 495.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 The Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.3 TRIDENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.3.1 Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . 545.3.2 Overview and Insights . . . . . . . . . . . . . . . . . . . 555.3.3 Details: Static-Instruction Sub-Model ( fs ) . . . . . . . . 575.3.4 Details: Control-Flow Sub-Model ( fc ) . . . . . . . . . . 595.3.5 Details: Memory Sub-Model ( fm ) . . . . . . . . . . . . . 615.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 645.4.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 655.4.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 695.5 Use Case: Selective Instruction Duplication . . . . . . . . . . . . 725.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.6.1 Sources of Inaccuracy . . . . . . . . . . . . . . . . . . . 745.6.2 Threats to Validity . . . . . . . . . . . . . . . . . . . . . 755.6.3 Comparison with ePVF and PVF . . . . . . . . . . . . . . 765.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776 Modeling Input-Dependent Error Propagation in Programs . . . . . 786.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.2 Volatilities and SDC . . . . . . . . . . . . . . . . . . . . . . . . 816.3 Initial FI Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.3.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . 836.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.4 Modeling INSTRUCTION-SDC-VOLATILITY . . . . . . . . . . . 886.4.1 Drawbacks of TRIDENT . . . . . . . . . . . . . . . . . 886.4.2 VTRIDENT . . . . . . . . . . . . . . . . . . . . . . . . 89ix6.5 Evaluation of VTRIDENT . . . . . . . . . . . . . . . . . . . . . 936.5.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 936.5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . 966.6 Bounding Overall SDC Probabilities with VTRIDENT . . . . . . 976.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.7.1 Sources of Inaccuracy . . . . . . . . . . . . . . . . . . . 1006.7.2 Implication for Mitigation Techniques . . . . . . . . . . . 1016.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027 Understanding Error Propagation in GPGPU Applications . . . . . 1037.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037.2 GPU Fault Injector . . . . . . . . . . . . . . . . . . . . . . . . . 1067.2.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . 1077.2.2 Error Propagation Analysis (EPA) . . . . . . . . . . . . . 1097.2.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 1097.3 Metrics for Error Propagation . . . . . . . . . . . . . . . . . . . . 1107.3.1 Execution time . . . . . . . . . . . . . . . . . . . . . . . 1117.3.2 Memory States . . . . . . . . . . . . . . . . . . . . . . . 1127.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 1147.4.1 Benchmarks Used . . . . . . . . . . . . . . . . . . . . . 1147.4.2 Fault Injection Method . . . . . . . . . . . . . . . . . . . 1177.4.3 Hardware and Software Platform . . . . . . . . . . . . . . 1177.4.4 Research questions (RQs) . . . . . . . . . . . . . . . . . 1187.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1187.5.1 Aggregate Fault Injections . . . . . . . . . . . . . . . . . 1187.5.2 Error Propagation . . . . . . . . . . . . . . . . . . . . . . 1197.5.3 Error Spreading . . . . . . . . . . . . . . . . . . . . . . . 1217.5.4 Fault Masking . . . . . . . . . . . . . . . . . . . . . . . 1237.5.5 Crash-causing Faults . . . . . . . . . . . . . . . . . . . . 1247.5.6 Platform Differences . . . . . . . . . . . . . . . . . . . . 1257.6 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1267.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128x8 Understanding Error Propagation in Deep Learning Neural Net-work (DNN) Accelerators and Applications . . . . . . . . . . . . . . 1298.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1298.2 Exploration of Design Space . . . . . . . . . . . . . . . . . . . . 1338.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 1348.3.1 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 1348.3.2 DNN Accelerators . . . . . . . . . . . . . . . . . . . . . 1368.3.3 Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . 1368.3.4 Fault Injection Simulation . . . . . . . . . . . . . . . . . 1368.3.5 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . 1378.3.6 Silent Data Corruption (SDC) . . . . . . . . . . . . . . . 1388.3.7 FIT Rate Calculation . . . . . . . . . . . . . . . . . . . . 1388.4 Characterization Results . . . . . . . . . . . . . . . . . . . . . . 1398.4.1 Datapath Faults . . . . . . . . . . . . . . . . . . . . . . . 1398.4.2 Buffer Faults: A Case Study on Eyeriss . . . . . . . . . . 1478.5 Mitigation of Error Propagation . . . . . . . . . . . . . . . . . . 1498.5.1 Implications to Resilient DNN Systems . . . . . . . . . . 1498.5.2 Symptom-based Error Detectors (SED) . . . . . . . . . . 1518.5.3 Selective Latch Hardening (SLH) . . . . . . . . . . . . . 1538.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1549 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1569.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1569.2 Expected Impact . . . . . . . . . . . . . . . . . . . . . . . . . . 1579.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162xiList of TablesTable 3.1 Data Reuses in DNN Accelerators . . . . . . . . . . . . . . . 22Table 4.1 Characteristics of Benchmark Programs . . . . . . . . . . . . 38Table 4.2 Comparison of Instructions Given by CRASHFINDER and CRASHFINDERSTATIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Table 5.1 Characteristics of Benchmarks . . . . . . . . . . . . . . . . . 65Table 5.2 p-values of T-test Experiments in the Prediction of IndividualInstruction SDC Probability Values (p > 0.05 indicates that weare not able to reject our null hypothesis – the counter-cases areshown in bold) . . . . . . . . . . . . . . . . . . . . . . . . . . 68Table 6.1 Characteristics of Benchmarks . . . . . . . . . . . . . . . . . 83Table 7.1 Characteristics of GPGPU Programs in our Study . . . . . . . 116Table 7.2 SDCs that occur in the different memory types . . . . . . . . . 119Table 7.3 Percentage of Benign Faults Measured at the First Kernel Invo-cation after Fault Injection . . . . . . . . . . . . . . . . . . . . 124Table 7.4 Size of OM, RM and TM . . . . . . . . . . . . . . . . . . . . 126Table 8.1 Networks Used . . . . . . . . . . . . . . . . . . . . . . . . . . 134Table 8.2 Data Reuses in DNN Accelerators . . . . . . . . . . . . . . . 135Table 8.3 Data types used . . . . . . . . . . . . . . . . . . . . . . . . . 137Table 8.4 Value range for each layer in different networks in the error-freeexecution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144xiiTable 8.5 Percentage of bit-wise SDC across layers in AlexNet using FLOAT16(Error bar is from 0.2% to 0.63%) . . . . . . . . . . . . . . . . 146Table 8.6 Datapath FIT rate in each data type and network . . . . . . . . 147Table 8.7 Parameters of microarchitectures in Eyeriss (Assuming 16-bitdata width, and a scaling factor of 2 for each technology gener-ation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148Table 8.8 SDC probability and FIT rate for each buffer component in Ey-eriss (SDC probability / FIT Rate) . . . . . . . . . . . . . . . . 148Table 8.9 Hardened latches used in design space exploration . . . . . . . 153xiiiList of FiguresFigure 1.1 Types of Failures . . . . . . . . . . . . . . . . . . . . . . . . 3Figure 2.1 High-Level Organization of Chapter 2 . . . . . . . . . . . . . 7Figure 3.1 Architecture of general DNN accelerator . . . . . . . . . . . 20Figure 3.2 Example of SDC that could lead to collision in self-drivingcars due to soft errors . . . . . . . . . . . . . . . . . . . . . . 22Figure 4.1 Long Latency Crash and Checkpointing . . . . . . . . . . . . 26Figure 4.2 Aggregate Fault Injection Results across Benchmarks . . . . . 28Figure 4.3 Latency Distribution of Crash-Causing Errors in Programs: Thepurple bars represent the LLCs as they have a crash latencyof more than 1000 instructions. The number shown at thetop of each bar shows the percentage of crashes that resultedin LLCs. The error bars for LLCs range from 0%(cutcp) to1.85%(sjeng). . . . . . . . . . . . . . . . . . . . . . . . . . 29Figure 4.4 Distribution of LLC Categories across 5 Benchmarks (sjeng,libquantum, hmmer, h264ref and mcf). Three dominant cate-gories account for 95% of the LLCs. . . . . . . . . . . . . . 30Figure 4.5 Code examples showing the three kinds of LLCs that occurredin our experiments. . . . . . . . . . . . . . . . . . . . . . . . 31Figure 4.6 Workflow of CRASHFINDER . . . . . . . . . . . . . . . . . 33Figure 4.7 Dynamic sampling heuristic. (a) Example source code (oceanprogram), (b) Execution trace and sample candidates. . . . . . 35xivFigure 4.8 Orders of Magnitude of Time Reduction by CRASHFINDERSTATIC and CRASHFINDER compared to exhaustive fault in-jections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42Figure 4.9 Precision of CRASHFINDER STATIC and CRASHFINDER forfinding LLCs in the program . . . . . . . . . . . . . . . . . . 43Figure 4.10 Recall of CRASHFINDER STATIC and CRASHFINDER . . . . 44Figure 5.1 Development of Fault-Tolerant Applications . . . . . . . . . . 51Figure 5.2 Running Example . . . . . . . . . . . . . . . . . . . . . . . . 53Figure 5.3 NLT and LT Examples of the CFG . . . . . . . . . . . . . . . 59Figure 5.4 Examples for Memory Sub-model . . . . . . . . . . . . . . . 62Figure 5.5 Overall SDC Probabilities Measured by FI and Predicted bythe Three Models (Margin of Error for FI:±0.07% to±1.76%at 95% Confidence) . . . . . . . . . . . . . . . . . . . . . . . 66Figure 5.6 Computation Spent to Predict SDC Probability . . . . . . . . 70Figure 5.7 Time Taken to Derive the SDC Probabilities of Individual In-structions in Each Benchmark . . . . . . . . . . . . . . . . . 71Figure 5.8 SDC Probability Reduction with Selective Instruction Dupli-cation at 11.78% and 23.31% Overhead Bounds (Margin ofError: ±0.07% to ±1.76% at 95% Confidence) . . . . . . . . 72Figure 5.9 Overall SDC Probabilities Measured by FI and Predicted byTRIDENT, ePVF and PVF (Margin of Error: ±0.07% to±1.76%at 95% Confidence) . . . . . . . . . . . . . . . . . . . . . . . 74Figure 6.1 OVERALL-SDC-VOLATILITY Calculated by INSTRUCTION-EXECUTION-VOLATILITY Alone (Y-axis: OVERALL-SDC-VOLATILITY, Error Bar: 0.03% to 0.55% at 95% Confidence) 86Figure 6.2 Patterns Leading to INSTRUCTION-SDC-VOLATILITY . . . 87Figure 6.3 Workflow of VTRIDENT . . . . . . . . . . . . . . . . . . . 90Figure 6.4 Example of Memory Pruning . . . . . . . . . . . . . . . . . . 91Figure 6.5 Memory Dependency Pruning in TRIDENT . . . . . . . . . 92xvFigure 6.6 Distribution of INSTRUCTION-SDC-VOLATILITY predictionsby vTrident Versus Fault Injection Results (Y-axis: Percentageof instructions, Error Bar: 0.03% to 0.55% at 95% Confidence) 94Figure 6.7 Accuracy of VTRIDENT in Predicting INSTRUCTION-SDC-VOLATILITY Versus FI (Y-axis: Accuracy) . . . . . . . . . . 95Figure 6.8 OVERALL-SDC-VOLATILITY Measured by FI and Predictedby VTRIDENT, and INSTRUCTION-EXECUTION-VOLATILITYalone (Y-axis: OVERALL-SDC-VOLATILITY, Error Bar: 0.03%to 0.55% at 95% Confidence) . . . . . . . . . . . . . . . . . 95Figure 6.9 Speedup Achieved by VTRIDENT over TRIDENT. Highernumbers are better. . . . . . . . . . . . . . . . . . . . . . . . 97Figure 6.10 Bounds of the Overall SDC Probabilities of Programs (Y-axis:SDC Probability; X-axis: Program Input; Solid Lines: Boundsderived by VTRIDENT; Dashed Lines: Bounds derived byINSTRUCTION-EXECUTION-VOLATILITY alone, Error Bars:0.03% to 0.55% at the 95% Confidence). Triangles representFI results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99Figure 7.1 Example of the error propagation code inserted by LLFI-GPU 110Figure 7.2 Code Example of a Kernel Cycle from Benchmark bfs . . . . 112Figure 7.3 Memory State Layout for CUDA Programming Model (K2 andK3 are kernel invocations) . . . . . . . . . . . . . . . . . . . 114Figure 7.4 Aggregate Fault Injection Results across the 12 Programs . . . 119Figure 7.5 Detection Latency of faults that result in SDCRM . . . . . . . 120Figure 7.6 Percentage of RM Updated by Each Kernel Invocation. Y-axisis the percentage of RM locations that are updated during eachkernel invocation. X-axis represents timeline in terms of kernelinvocations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Figure 7.7 Percentage of TM and RM Contaminated at Each Kernel Invo-cation. Y-axis is the percentage of contaminated memory lo-cations, X-axis is timeline in terms of kernel invocations. Bluelines indicate TM, and red lines represent RM. . . . . . . . . 121Figure 7.8 Code Structure Leading to Extensive Error Spread in lud . . . 123xviFigure 7.9 Examples of Fault Masking. (a) Comparison, (b) Truncation . 124Figure 8.1 Architecture of general DNN accelerator . . . . . . . . . . . 135Figure 8.2 SDC probability for different data types in different networks(for faults in PE latches). . . . . . . . . . . . . . . . . . . . . 139Figure 8.3 SDC probability variation based on bit position corrupted, bitpositions not shown have zero SDC probability (Y-axis is SDCprobability and X-axis is bit position) . . . . . . . . . . . . . 142Figure 8.4 Values before and after error occurrence in AlexNet using FLOAT16143Figure 8.5 SDC probability per layer using FLOAT16, Y-axis is SDC prob-ability and X-axis is layer position . . . . . . . . . . . . . . . 144Figure 8.6 Euclidean distance between the erroneous values and correctvalues of all ACTS at each layer of networks using DOU-BLE, Y-axis is Euclidean distance and X-axis is layer position(Faults are injected at layer 1) . . . . . . . . . . . . . . . . . 146Figure 8.7 Precision and recall of the symptom-based detectors acrossnetworks (Error bar is from 0.03% to 0.04%) . . . . . . . . . 152Figure 8.8 Selective Latch Hardening for the Eyeriss Accelerator runningAlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154xviiAcknowledgmentsFirst of all, I would like to express my sincere gratitude to my doctoral advisor Dr.Karthik Pattabiraman for his support in the past years. If it was not for his continu-ous investment in me, I would not have been able to complete this thesis. Karthik’sadvice on my career has been invaluable and helped me grow as a researcher.Besides my advisor, I would like to thank my thesis committee members andexaminers, who gave me valuable suggestions and helped me improve my thesis. Iwould also like to give my special thanks to Prof. Matei Ripeanu, who has providedme feedback and advice over the past years, and sat through many practice talks Ihave given.In the end, thanks to all my friends and lab mates for making these years asjoyful as possible. I also want to thank my families, who have supported me un-conditionally in my study and career.xviiiChapter 1IntroductionTransient hardware faults (i.e., soft errors) are predicted to increase in future com-puter systems [20, 40]. These faults are caused by high-energy particles passingthrough transistors, causing them to accumulate charge or discharge, eventuallyleading to single bit flips in logic values. As a result, such faults are also knownas Single Event Upset (SEU). Due to the growing system scale, progressive tech-nology scaling, and lowering operating voltages, even a small amount of chargeaccumulated or lost in the circuit can cause a SEU in modern processors [20, 40].Consequently, computer systems have become more and more vulnerable to hard-ware faults. In the past, only highly reliable systems (e.g., aerospace applications)required protections from SEUs, but nowadays even commodity systems (e.g., con-sumer products) need to be protected due to the increasing error rates [21, 95, 120].SEUs in hardware often result in error propagation in programs, which maylead to catastrophic outcomes. For example, Amazon’s S3 server once went downfor a few hours in 2008, causing a financial loss of millions of dollar for the com-pany [10]. As reported by Amazon, it was very likely to be a SEU in their hardware,causing error propagation in the software, and finally leading to the incident [10].Until a few years ago, error propagation was mostly masked through hardware-only solutions such as dual or triple modular redundancy and circuit hardening.However, these techniques are becoming increasingly challenging to deploy asthey consume significant amounts of energy, and as energy is becoming a first-class constraint in processor design, especially in commodity systems [21]. On1the other hand, software-implemented protection techniques are more flexible andcost-effective as they can selectively protect the program states of interest and donot require any expensive hardware modifications [95, 102, 120]. Therefore, re-searchers have postulated that future processors will expose hardware faults to thesoftware and expect the software to tolerate the faults at low costs [95, 102, 120].Studies have shown that only a small fraction of program states are responsi-ble for almost all the error propagations leading to catastrophic failures [52, 124],and so one can selectively protect these states to meet the reliability target of theapplication while incurring lower energy and performance overheads than full du-plication techniques [64, 97]. Therefore, in the development of fault-tolerant ap-plications, it is important to understand what states in the program are vulnerableto what kinds of failures caused by SEUs, and identify those vulnerable states sothat they can be protected with low costs.Fault Injection (FI) has been commonly employed to characterize error prop-agation, and identify which program states are vulnerable to failures. FI involvesperturbing the program state to emulate the effect of a hardware fault and exe-cuting the program to completion to determine if the fault caused a certain fail-ure [64, 147]. However, real-world programs may consist of billions of dynamicinstructions, and even a single execution of the program may take a long time. Per-forming thousands of FIs to get statistically meaningful results for each instructiontakes too much time to be practical [65, 124]. As a result, FI is mostly performedoffline for the characterizations of application resilience. It is too slow to be de-ployed as an online evaluation method when developing fault-tolerant applicationsin a fast software development cycle. In this dissertation, we leverage FI as a ba-sic approach to characterize error propagation in programs, and use the results tounderstand how different errors propagate in the program and why. Based on theempirical observations, we propose models and heuristics that analyze error prop-agation and guide the protection of programs with few to no FIs.As Figure 1.1 shows, once a fault occurs and is read by the program that is ex-ecuting, the fault is activated and becomes an error. Consequently, the error startspropagating in the program, leading to one of the three outcomes: (1) Benign: Theerror can be masked during the propagation in the program, hence the program fin-ishes its execution without any anomaly. (2) Crash: The error triggers a hardware2Figure 1.1: Types of Failuresexception (i.e., illegal memory access) and causes the program to terminate beforeit finishes the execution. (3) Silent Data Corruption (SDC): The error propagatesto the program output, modifying it from the correct output. Among the threeoutcomes, benign, as its name suggests, is not considered to be harmful as the pro-gram successfully completes the execution without any anomaly. Crash and SDCare both regarded as failures, depending on the reliability target and applications.Therefore, they are the failure types we will focus on in this dissertation.Crashes are usually considered detectable failures because of the early termi-nation of the program execution and the warnings issued by hardware exceptions.The user can then simply restart the program for a correct execution. As a result,researchers do not really pay much attention to crashes. However, their underly-ing assumption is that failures are reported as soon as faults occur, which is alsoknown as the fail-stop assumption [150]. While this assumption holds most of thetime, we find that there is a small amount of errors that can actually propagate for along time before causing crashes. Hence, such errors violate the fail-stop assump-tion. We call them long latency crashes, or LLCs. Consequently, the error maypropagate to the program’s state elements such as checkpoints and corrupt them3before causing any crash, leading to failures in recovery. Studies show that LLCsmay drastically reduce the availability of systems, and hence, must be identifiedand mitigated in systems that require high availability [150]. The main challengeof identifying LLCs is that LLC are typically confined to a small fraction of theprogram’s data, and finding the LLC-causing data through FI often causes searchspace explosion. The key observation in this study is that we find that most LLCsare caused by faults occurring in a few code structures that can be identified bystatic and dynamic program analysis. Based on this observation, we proposedheuristics to identify those code structures in order to protect the program fromLLCs.SDCs, on the other hand, are considered as the most insidious type of failurebecause there are no indications of the incorrect program outputs. Studies showthat the program SDC probabilities are essentially application-dependent, and varyfrom less than 1% to more than 50% depending on the program [97, 147]. There-fore, given different reliability targets and applications, SDCs must be evaluatedand mitigated on a per-program and per-target basis. As a result, it is critical toaccurately estimate the SDC probability of a program as well as its individualinstructions for evaluation and mitigation. Unfortunately, current approaches foridentifying SDCs such as FIs are too slow to integrate into fast software develop-ment cycles. Other existing approaches for resilience estimation such as analyticalmodels, suffer from serious inaccuracies because error propagation is complicatedand has not been comprehensively understood [51, 100]. The dynamic nature ofprogram execution makes building an accurate model extremely challenging. Inthis dissertation, we find that almost all the error propagations that lead to SDCscan be abstracted into three levels. We use this insight to build TRIDENT, whichis a compiler-based model that is able to predict the SDC probabilities of bothprogram and individual instructions without any FIs.While the first part of this dissertation aims to investigate error propagation ingeneral applications (e.g., CPU programs), the second part discusses the error re-silience of applications that run on hardware accelerators. Hardware acceleratorssuch as graphic processing units (GPUs) and deep learning neural network (DNN)accelerators have found wide adoptions nowadays due to their massive parallel ca-pability and efficient data-flow. Recently, they have been deployed for running4reliability- and safety-critical applications (e.g., scientific applications and self-driving cars etc), which require strict correctness. Therefore, it is important tounderstand the resilience of applications running on these accelerators. Unfortu-nately, most studies focus on the performance aspects of the accelerators, and hencetheir resilience is not well investigated. While we explored the error propagationin general applications, we are interested in error propagation in the applicationsrunning on those accelerators, which have a very different structure from generalapplications. One of the challenges in such applications is the lack of capabletools to perform fault injection experiments and analysis. Therefore, we first builttools that are able to perform fault injections for the applications, and used them tocharacterize their error propagations. We find that the characteristics of their errorpropagation are very different from what we have seen in CPU programs. This isbecause of the different architectures and program models used in accelerators. Fi-nally, we quantify the resilience of the applications and propose efficient protectiontechniques based on their unique error propagation characteristics.1.1 ContributionsOur research contributions are as follows:• Characterized LLCs in programs, and identified the code patterns that lead toLLCs. Based on the characterization, we developed an automated compiler-based technique, CRASHFINDER, to identify the code patterns in programs.We showed CrashFinder is able to identify more than 90% of the locationsthat cause LLCs without extensive FIs, and by protecting those identifiedlocations, most of the LLCs can be eliminated in programs.• Built an automated model, TRIDENT, that analyzes the SDC probabilitiesof both whole programs and individual instructions in the program. Givena program, the model requires no fault injections to estimate the SDC prob-abilities. We showed that most of the error propagations that lead to SDCscan be described by a combination of three probabilistic modules, whichsynergistically build on top of each other to model error propagation. Weevaluated TRIDENT experimentally, and found that it is able to predict the5SDC probabilities almost as accurately as FI, but takes only a fraction ofthe time that FI does, thereby making it possible to integrate it into the soft-ware development cycle. We also extended the model to support multipleprogram inputs. We showed that the extended model is able to bound SDCprobabilities of programs given multiple program inputs.• Investigated the characteristics of error propagation in the applications run-ning on GPUs and DNN accelerators. We first built FI tools and simulatorsfor the applications, and evaluated their error resilience. Based on the ob-servations, we proposed mitigation techniques that are cost-effective for theapplications.The investigation of error propagation in this dissertation allows developers todesign more cost-effective error detectors and protection techniques based on thecharacteristics of different errors, programming models, and hardware platforms.The studies also explain why certain vulnerabilities are observed in certain programstructures. Our proposed analytical models allow developers to estimate programs’resilience and selectively protect programs in a fast and accurate way, which rep-resents a fundamental advance in the way fault-tolerant applications are designed.Moreover, the analytical nature of our proposed models explains the relations be-tween program and resilience, providing insights to researchers for designing betterfault-tolerant applications in the future.6Chapter 2Related WorkThere have been many papers that investigate error propagation in programs andtolerate hardware faults through software techniques. We organize this chapterbased on the structure shown in Figure 2.1. First, we discuss the related workthat studied error propagation in CPU programs. More specifically, we are inter-ested in the ones that investigate LLCs (Chapter 2.1.1), and SDCs (Chapter 2.1.2).Secondly, we talk about the studies that characterized the resilience of GPU appli-cations (Chapter 2.2.1), and DNN accelerator applications (Chapter 2.2.2).Error PropagationCPU Programs Accelerator ApplicationsLong-Latency Crashes (LLCs)Silent Data Corruptions (SDCs)Resilience of GPU Applications Resilience of DNN ApplicationsFigure 2.1: High-Level Organization of Chapter 272.1 Error Propagation in CPU Programs2.1.1 Long-Latency Crash (LLC)In Chapter 4, we develop a technique to automatically find program locationswhere LLC causing faults originate so that the locations can be protected to boundthe program’s crash latency. In the past, there have been a few studies that charac-terized the latency of error propagation in programs. We consider the related workbelow.The first one by Gu et. al. [9] injected faults into the Linux kernel and foundthat less than 1% of the errors resulted in LLCs. They further found that many ofthe severe failures that result in extended downtimes in the system were caused bythese LLCs, due to error propagation. The authors give examples of faults that re-sulted in the LLCs, but they do not attempt to categorize the code patterns that wereresponsible for the LLCs. The second study is by Yim et al. [150], who studiedthe correlation between LLC-causing errors and the fault location in the program.However, they perform a coarse-grained categorization of the fault locations basedon where the data resides (e.g., stack, heap etc.). Such a coarse-grained categoriza-tion is unfortunately not very useful when one wants to protect specific variables orprogram locations, as protecting the entire stack/heap segment is too expensive. Al-though they provide some insights on the characteristics of possible LLC-causingerrors, they do not develop an automated way to predict which faults would lead toan LLC and which would not. It is also worth noting that neither of the above pa-pers achieves exhaustive characterization of the LLC-causing faults. Rashid et. al.[20] have built an analytical trace-based model to predict the propagation of inter-mittent hardware errors in a program. The model can be used to predict the latencyof crash causing faults in the program, and hence find the LLC locations. Theyfind that less than 0.5% of faults cause LLCs using the model. While useful, theirmodel requires building the program’s Dynamic Dependence Graph (DDG), whichcan be prohibitively expensive for large programs as it is directly proportional tothe number of instructions executed by it. Further, they make many simplifyingassumptions in their model which may not hold in the real world. Similarly, Lan-zaro et. al. [84] built an automated tool which is able to analyze arbitrary memory8corruptions based on execution trace when faults present in system. While theirtechnique is useful in terms of analyzing error propagation, it incurs prohibitiveoverheads as it requires the entire trace to be captured at runtime. Further, they fo-cus on software faults as opposed to hardware faults. Finally, they do not make anyattempt to identify LLC-causing faults. Chandra et. al. [29] study program errorsthat violate the fail-stop model and result in corrupting the data written to perma-nent storage, or communicated to other processes. They find that between 2% to7% of faults cause such violations, and propose using a transaction-based mecha-nism to prevent the propagation of these faults. While transaction-based techniquesare useful, they require significant software engineering effort, as the applicationneeds to be rewritten to use transactions. This is very difficult for most commoditysystems. In contrast to the above techniques, our technique identifies specific pro-gram locations that result in LLCs, and can hence support fine-grained protection.Further, it uses predominantly static analysis coupled with dynamic analysis anda selective fault injection experiment, making it highly scalable and efficient com-pared to the above approaches. Finally, our technique, CRASHFINDER, does notrequire any programmer intervention or application rewriting and is hence practicalto deploy on existing software.2.1.2 Silent Data Corruption (SDC)In Chapter 5, we construct a three-level model, TRIDENT, that captures errorpropagation at the static data dependency, control-flow and memory levels, basedon the observations of error propagations in programs. TRIDENT is implementedas part of a compiler, and can predict both the overall SDC probability of a givenprogram, and the SDC probabilities of individual instructions of programs, withoutfault injections.There is a significant body of work on identifying error propagation that leadsto SDC either through FI [42, 58, 65, 66, 90, 147], or through modeling error prop-agation in programs [51, 52, 130]. The main advantage of FI is that it is simple,but it has limited predictive power. Further, its long running time often limits theFI approach from deriving program vulnerabilities at finer granularity (e.g., SDCprobabilities of individual instructions). The main advantage of modeling tech-9niques is that they have predictive power and are significantly faster, but existingtechniques suffer from poor accuracy due to important gaps in the models. Themain question we answer in the TRIDENT project is that whether we can combinethe advantages of the two approaches by constructing a model that is both accu-rate and scalable. Shoestring [52] was one of the first papers to attempt to modelthe resilience of instructions without using fault injection. Shoestring stops tracingerror propagations after control-flow divergence, and assumes that any fault thatpropagates to a store instruction leads to an SDC. Hence, it is similar to removingfc and fm in TRIDENT and considering only fs , which we show is not very ac-curate. Further, Shoestring does not quantify SDC probabilities of programs andinstructions, unlike TRIDENT. Gupta et al. [61] investigate the resilience charac-teristics of different failures in large-scale systems. However, they do not proposeautomated techniques to predict failure rates. Lu et al. [97], Li et al. [88] identifyvulnerable instructions by characterizing different features of instructions in pro-grams. While they develop efficient heuristics in finding vulnerable instructionsin programs, their techniques do not quantify error propagation, and hence can-not accurately pinpoint SDC probabilities of individual instructions. Sridharan etal. [100] introduce PVF, an analytical model which eliminates microarchitecturaldependency from architectural vulnerability to approximate SDC probabilities ofprograms. While the model requires no FIs and is hence fast, it has poor accuracyin determining SDC probabilities as it does not distinguish between crashes andSDCs. Fang et al. [51] introduce ePVF, which derives tighter bounds on SDC prob-abilities than PVF, by omitting crash-causing faults from the prediction of SDCs.However, both techniques focus on modeling the static data dependency of instruc-tions, and do not consider error propagation beyond control-flow divergence, whichleads to large gaps in the predictions of SDCs (as we showed in Chapter 5).Furthermore, we study the variation of SDC probabilities across different in-puts of a program, and identify the reasons for the variations in Chapter 6 . Basedon the observations, we propose a model, VTRIDENT, which predicts the varia-tions in programs’ SDC probabilities without any FIs, for a given set of inputs.There has been little work investigating error propagation behaviours acrossdifferent inputs of a program. Czek et al. [43] were among the first to modelthe variability of failure rates across program inputs. They decompose program10executions into smaller unit blocks (i.e., instruction mixes), and use the volatilityof their dynamic footprints to predict the variation of failure rates, treating theerror propagation probabilities as constants in their unit blocks across differentinputs. Their assumption is that similar executions (of the unit blocks) result insimilar error propagations, so the propagation probabilities within the unit blocksdo not change across inputs. Thus, their model is equivalent to considering just theexecution volatility of the program (Section 6.2), which is not very accurate as weshow in Section 6.3.Folkesson et al. [54] investigate the variability of the failure rates of a singleprogram (Quicksort) with its different inputs. They decompose the variability intothe execution profile, and its data usage profile. The latter requires the identifi-cation of critical data and its usage within the program - it is not clear how thisis done. They consider limited propagation of errors across basic blocks, but notwithin a single block. This results in their model significantly underpredicting thevariation of error propagation. Finally, it is difficult to generalize their results asthey consider only one (small) program.Di Leo et al. [46] investigate the distribution of failure types under hardwarefaults when the program is executed with different inputs. However, their studyfocuses on the measurement of the volatility in SDC probabilities, rather than onpredicting it. They also attempt to cluster the variations and correlate the clusterswith the program’s execution profile. However, they do not propose a model topredict the variations, nor do they consider sources of variation beyond the execu-tion profile - again, this is similar to using only the execution volatility to explainthe variation of SDC probabilities. Tao et al. [139] propose efficient detectionand recovery mechanisms for iterative methods across different inputs. Mahmoudet al. [98] leverage software testing techniques to explore input dependence forapproximate computing. However, neither of them focus on hardware faults ingeneric programs. Gupta et al. [61] measure the failure rate in large-scale systemswith multiple program inputs during a long period, but they do not propose tech-niques to bound the failure rates. In contrast, our work investigates the root causesbehind the SDC volatility under hardware faults, and proposes a model to bound itin an accurate and scalable fashion.Other papers that investigate error propagation confine their studies to a single11input of each program. For example, Hari et al. [65, 66] group similar execu-tions and choose the representative ones for FI to predict SDC probabilities givena single input of each program. Li et al. [88] find patterns of executions to prunethe FI space when computing the probability of long-latency propagating crashes.Lu et al. [97] characterize error resilience of different code patterns in applica-tions, and provide configurable protection based on the evaluation of instructionSDC probabilities. Feng et al. [52] propose a modeling technique to identify likelySDC-causing instructions. Our prior work. TRIDENT [59], which VTRIDENTis based on, also restricts itself to single inputs. These papers all investigate pro-gram error resilience characteristics based on static and dynamic analysis, withoutlarge-scale FI. However, their characterizations are based on the observations de-rived from a single input of each program, and hence their results may be inaccuratefor other inputs.2.2 Error Propagation in Hardware AcceleratorApplications2.2.1 GPU ApplicationsIn Chapter 7, we perform an empirical study to understand and characterize errorpropagation in GPU applications. We build a compiler-based fault-injection toolfor GPU applications to track error propagation, and define metrics to characterizepropagation in GPU applications. We consider the following related work in thearea of GPU fault injection and fault-tolerance techniques.Yim et al. [151] built one of the first fault injectors for GPGPU applications.However, their injector operates at the source code level, and only considers faultsthat are visible at the source level. Fang et. al. [50] designed a GPGPU faultinjector, GPU-Qin, that operates on the CUDA assembly code (SASS). They usethe CUDA debugger (CUDA-gdb) to inject faults, which takes significantly longerthan performing fault injections by code instrumentation (Section III). In follow upwork, Hari et. al. [63] built SASSIFI, a GPGPU fault injector that transforms theSASS code of the program to inject faults similar to LLFI-GPU. Both GPU-Qinand SASSIFI operate on the SASS assembly code level, which makes it difficult12to map back the faults to the source code. In contrast, LLFI-GPU operates at theLLVM IR level, which is much closer to the program’s source code. This makes itpossible to perform program analysis on the program and map the fault injectionresults back to the source code, thereby helping programmers make their code errorresilient.There have been a few papers on building fault tolerance techniques on GPGPUplatforms. Jeon et. al. [76] duplicated kernel threads that have the same input todetect errors. Dimitrov et. al. [48] leverage both instruction level and thread-levelparallelism to duplicate application code. Tan et. al. [137] proposed an analyticalmethod to evaluate the error-resilience of GPU platforms. Pena et. at. [110] ex-plored low-cost data protection and recovery mechanisms for GPGPU platformsbased on API interception. Finally, Yim et al. [151] proposed a technique to detecterrors by duplicating code at within the loop bodies of GPGPU programs. Whilethese are useful, none of the above papers study error propagation in GPGPU pro-grams, which is our focus.2.2.2 DNN Accelerators and ApplicationsIn Chapter 8, we experimentally evaluate the resilience characteristics of DNNsystems (i.e., DNN software running on specialized accelerators). We find that theerror resilience of a DNN system depends on the data types, values, data reuses, andtypes of layers in the design. Based on our observations, we propose two efficientprotection techniques for DNN systems. We consider the following related work.There were a few studies years ago on the fault tolerance of neural networks [9,17, 111]. While these studies performed fault injection experiments on neural net-works and analyzed the results, the networks they considered had very differenttopologies and many fewer operations than today’s DNNs. Further, these studieswere neither in the context of the safety critical platforms such as self-driving cars,nor did they consider DNN accelerators for executing the applications. Hence,they do not provide much insight about error propagation in modern DNN sys-tems. Reis et al. [115] and Oh et al. [106] proposed software techniques that du-plicate all instructions to detect soft errors. Due to the high overheads of theseapproaches, researchers have investigated selectively targeting the errors that are13important to an application. For example, Hari et al. analyzed popular benchmarksand designed application-specific error detectors. Feng et al. [52] proposed a staticanalysis method to identify vulnerable parts of the applications for protection. Luet al. [97] identified SDC-prone instructions to protect using a decision tree classi-fier, while Laguna et al. [32] used a machine learning approach to identify sensitiveinstructions in scientific applications. While these techniques are useful to mitigatesoft errors in general purpose applications, they cannot be easily applied to DNNsystems which implement a different instruction set and operate on a specializedmicroarchitecture. Many studies [34, 35, 62] proposed novel microarchitecturesfor accelerating DNN applications. These studies only consider the performanceand energy overheads of their designs and do not consider reliability. In recentwork, Fernades et. al. [53] evaluated the resilience of histogram of oriented gra-dient applications for self-driving cars, but they did not consider DNNs. [113] and Chatterjee [32] explored energy-reliability limits of DNN sys-tems, but they focused on different networks and fault models. The closest relatedwork is by Temam et. al. [140], which investigated the error resilience of neuralnetwork accelerators at the transistor-level. Our work differs in three aspects: (1)Their work assumed a simple neural network prototype, whereas we investigatemodern DNN topologies. (2) Their work does not include any sensitivity studyand is limited to the hardware datapath of a single neural network accelerator. Ourwork explores the resilience sensitivity to different hardware and software param-eters. (3) Their work focused on permanent faults, rather than soft errors. Softerrors are separately regulated by ISO 26262, and hence we focus on them.14Chapter 3BackgroundIn this chapter, we first introduce our fault model for studying error propagation inCPU programs. The fault model applies to Chapter 4, 5, 6 and 7. For Chapter 8,we present the fault model in the chapter. We then define the terms we have inthis thesis and introduce LLVM compiler that we use to study error propagationin both CPU and GPU programs. Finally, we summarize GPU architecture andprogramming model before discussing DNN accelerators and applications.3.1 Fault ModelWe consider transient hardware faults that occur in the computational elements ofthe processor, including pipeline registers and functional units. We do not con-sider faults in the memory or caches, as we assume that these are protected witherror correction code (ECC). Likewise, we do not consider faults in the processor’scontrol logic as we assume that it is protected. Neither do we consider faults inthe instructions’ encodings. Finally, we assume that the program does not jumpto arbitrary illegal addresses due to faults during the execution, as this can be de-tected by control-flow checking techniques [105]. However, the program may takea faulty legal branch (the execution path is legal but the branch direction can bewrong due to faults propagating to it). Our fault model is in line with other workin the area [42, 52, 65, 97].153.2 Terms and DefinitionsWe use the following terms in this thesis.• Fault occurrence: The event corresponding to the occurrence of the hard-ware fault. The fault may or may not result in an error.• Fault activation: The event corresponding to the manifestation of the faultto the software, i.e., the fault becomes an error and corrupts some portion ofthe software state (e.g., register, memory location). The error may or maynot result in a crash.• Crash: The raising of a hardware trap or exception due to the error, becausethe program attempted to perform an action it should not have (e.g., readoutside its memory segments).• Crash latency: The number of dynamic instructions executed by the pro-gram from fault activation to crash. This definition is slightly different fromprior work which has used CPU cycles to measure the crash latency. Themain reason we use dynamic instructions rather than CPU cycles is thatwe wish to obtain a platform independent characterization of long latencycrashes.• Long latency crashes (LLCs): Crashes that have crash latency of greaterthan 1,000 dynamic instructions. Prior work has used a wide range of valuesfor long latency crashes, ranging from 10,000 CPU cycles [112] to as manyas 10 million CPU cycles [150]. We use 1,000 instructions as our thresholdas (1) each instruction corresponds to multiple CPU cycles in our system,and (2) we found that in our benchmarks, the length of the static data depen-dency sequences are far smaller, and hence setting 1,000 instructions as thethreshold already filters out 99% of the crash-causing faults (Section 4.3),showing that 1000 instructions is a reasonable threshold.• Silent Data Corruption (SDC): A mismatch between the output of a faultyprogram run and that of an error-free execution of the program.16• Benign Faults: Program output matches that of the error-free execution eventhough a fault occurred during its execution. This means either the fault wasmasked or overwritten by the program.• Error propagation: Error propagation means that the fault was activated,and has affected some other portion of the program state, say ’X’. In thiscase, we say the fault has propagated to state X. We focus on the faultsthat affect the program state and therefore consider error propagation at theapplication level.• SDC Probability: We define the SDC probability as the probability of anSDC given that the fault was activated – other work uses a similar defini-tion [52, 66, 88, 147].3.3 LLVM CompilerThere are many FI frameworks in the literature [7, 26, 78, 132], which focus ondifferent platforms, components, and faults. We use the LLVM compiler [85] toperform our program analysis and FI experiments and to implement our model.Our choice of LLVM is motivated by three reasons. First, LLVM uses a typed in-termediate representation (IR) that can easily represent source-level constructs. Inparticular, it preserves the names of variables and functions, which makes sourcemapping feasible. This allows us to perform a fine-grained analysis of which pro-gram locations cause certain failures and map them to the source code. Second,LLVM IR is a platform-neutral representation that abstracts out many low-leveldetails of the hardware and assembly language. This greatly aids in portabilityof our analysis to different architectures and simplifies the handling of the specialcases in different assembly language formats. Finally, LLVM IR has been shownto be accurate for doing FI studies [147], and there are many fault injectors devel-oped for LLVM [12, 90, 121, 147]. Many of the papers we compare our techniquewith in this chapter also use LLVM infrastructure [51, 52]. Therefore, when wesay instruction in CPU or GPU programs, we mean an instruction at the LLVM IRlevel.173.4 GPU Architecture and Programming ModelWe focus on CUDA (Compute Unified Device Architecture) GPU programs inthis thesis as CUDA the most popular programming model used by GPU develop-ers [2]. CUDA is a parallel computing platform and application programming inter-face model created by Nvidia [2]. CUDA kernels, which are the parts of the codethat run on GPU hardware, adopt the single instruction multiple threads (SIMT)model to exploit the massive parallelism of GPU devices. From a software perspec-tive, the CUDA programming model abstracts the SIMT model into a hierarchy ofkernels, blocks and threads. The CUDA kernels consist of blocks, which consistof threads. This hierarchy allows various levels of parallelism such as fine-graineddata parallelism, coarse-grained data parallelism and task parallelism. From a hard-ware perspective, blocks of threads run on streaming mutiprocessors (SMs) whichhave on-chip shared memory for threads inside the same block. Within a block,threads are launched in fixed groups of 32 threads, which are called warps. Threadsin a warp execute the same sequence of instructions but with different data values.GPU has its own memory space that is distinct from and not synchronized withthe host CPU’s memory. In the CUDA programming model, there are four kinds ofmemory: (1) global, (2) constant, (3) texture, and (4) shared. Global, constant, andtexture memory accesses are generally slower from large device memory. Sharedmemory space, which can be software managed, is much smaller and built on chip,hence it is much faster to access. CUDA applications need to be aware of thememory hierarchy to access GPU memory efficiently.3.5 DNN Accelerators and Applications3.5.1 Deep Learning Neural NetworksA deep learning neural network is a directed acyclic graph consisting of multiplecomputation layers [86]. A higher level abstraction of the input data or a featuremap (fmap) is extracted to preserve the information that are unique and importantin each layer. There is a very deep hierarchy of layers in modern DNNs, and hencetheir name.We consider convolutional neural networks of DNNs, as they are used in a18broad range of DNN applications and deployed in self-driving cars, which are ourfocus. In such DNNs, the primary computation occurs in the convolutional layers(CONV) that perform multi-dimensional convolution calculations. The number ofconvolutional layers can range from three to a few tens of layers [67, 81]. Eachconvolutional layer applies a kernel (or filter) on the input fmaps (ifmaps) to ex-tract underlying visual characteristics and generate the corresponding output fmaps(ofmaps).Each computation result is saved in an activation (ACT) after being processedby an activation function (e.g., ReLU), which is in turn, the input of the next layer.An activation is also known as a neuron or a synapse - in this work, we use ACT torepresent an activation. ACTS are connected based on the topology of the network.For example, if two ACTS, A and B, are both connected to an ACT C in the nextlayer, then ACT C is calculated using ACT A and B as the inputs. In convolutionallayers, ACTS in each fmap are usually fully connected to each other, whereas theconnection of ACTS between each fmap in other layers are usually sparse. Insome DNNs, a small number (usually less than 3) of fully-connected layers (FC)are typically stacked behind the convolutional layers for classification purposes.In between the convolutional and fully-connected layers, additional layers can beadded, such as the pooling (POOL) and normalization (NORM) layers. POOLselects the ACT of the local maximum in an area to be forwarded to the next layerand discards the rest in the area, so the size of fmaps will become smaller aftereach POOL. NORM averages ACT values based on the surrounding ACTS. ThusACT values will also be modified after the NORM layer.Once a DNN topology is constructed, the network can be fed with training in-put data, and the associated weights, abstracted as connections between ACTS,will be learned through a back-propagation process. This is referred to as thetraining phase of the network. The training is usually done once, as it is verytime-consuming, and then the DNN is ready for image classification with testinginput data. This is referred to as the inferencing phase of the network and is carriedout many times for each input data set. The input of the inferencing phase is oftena digitized image, and the output is a list of output candidates of possible matchessuch as car, pedestrian, animal, each with a confidence score. Self-driving carsdeploy DNN applications for inferencing, and hence, we focus on the inferencing19phase of DNNs.3.5.2 DNN AcceleratorMany specialized accelerators [34, 35, 62] have been proposed for DNN inferenc-ing, each with different features to cater to DNN algorithms. However, there aretwo properties common to all DNN algorithms that are used in the design of allDNN accelerators: (1) MAC operations in each feature map have very sparse de-pendencies, which can be computed in parallel, and (2) there are strong temporaland spatial localities in data within and across each feature map, which allow thedata to be strategically cached and reused. To leverage the first property, DNNaccelerators adopt spatial architectures [143], which consist of massively parallelprocessing engines (PEs), each of which computes MACs. Figure 3.1 shows thearchitecture of a general DNN accelerator. A DNN accelerator consists of a globalbuffer and an array of PEs. The accelerator is connected to DRAM where data istransferred from. A CPU is usually used to off-load tasks to the accelerator. Theoverall architecture is shown in Figure 3.1A. The ALU of each PE consists of amultiplier and an adder as execution units to perform MACs — this is where themajority of computations happen in DNNs. A general structure of the ALU in eachPE is shown in Figure 3.1B.Figure 3.1: Architecture of general DNN acceleratorTo leverage the second property of DNN algorithms, special buffers are addedon each PE as local scratchpad to cache data for reuse. Each DNN accelerator mayimplement its own dataflow to explore data localities. We classify the localities in20DNNs into three major categories:• Weight Reuse: Weight data of each kernel can be reused in each fmap asthe convolutions involving the kernal data are used many times on the sameifmap.• Image Reuse: Image data of each fmap can be reused in all convolutionswhere the ifmap is involved, because different kernels operate on the samesets of ifmap in each layer.• Output Reuse: Computation results of MACs can be buffered and con-sumed on-PE without transferring off the PEs.Table 3.1 illustrates nine different DNN accelerators that have been proposedin prior work and the corresponding data localities they exploit in their dataflows.As can be seen, each accelerator exploits one of more localities in its dataflow.Eyeriss [35] considers all of the three data localities in its dataflow.We separate faults that originate in the datapath (i.e., latches in execution units)from those that originate in buffers (both on- and off-PEs), because they propagatedifferently: faults in the datapath will be only read once, whereas faults in buffersmay be read multiple times due to reuse and hence the same fault can be spread tomultiple locations within short time windows.3.5.3 Consequences of Soft ErrorsThe consequences of soft errors that occur in DNN systems can be catastrophicas many of them are safety-critical, and error mitigation is required to meet cer-tain reliability targets. For example, in self-driving cars, a soft error can lead tomisclassification of objects, resulting in a wrong action taken by the car. In ourfault injection experiments, we found many cases where a truck can be misclas-sified under a soft error. We illustrate this in Figure 3.2. The DNN in the carshould classify the coming object as a transporting truck in a fault-free executionand apply the brakes in time to avoid a collision (Figure 3.2A). However, due toa soft error in the DNN, the truck is misclassified as a bird (Figure 3.2B), and thebraking action may not be applied in time to avoid the collision, especially when21Table 3.1: Data Reuses in DNN AcceleratorsWeightReuseImageReuseOutputReuseZhang et al. [153], Diannao [34],Dadiannao [36]N N NChakradhar et al. [28], Sri-ram et al. [131], Sankaradas etal. [122], nn-X [57], K-Brain [107],Origami [27]Y N NGupta et al. [60], Shidiannao [49],Peemen et al. [109]N N YEyeriss [35] Y Y Ythe car is operating at high speed. This is an important concern as it often resultsin the violation of standards such as ISO 26262 dealing with the functional safetyof road vehicles [119], which requires the System on Chip (SoC) carrying DNNinferencing hardware in self-driving cars to have a soft error FIT rate less than 10FIT [13], regardless of the underlying DNN algorithm and accuracy. Since a DNNaccelerator is only a fraction of the total area of the SoC, the FIT allowance of aDNN accelerator should be much lower than 10 in self-driving cars. However, wefind that a DNN accelerator alone may exceed the total required FIT rate of theSoC without protection (Section 8.4.2).Figure 3.2: Example of SDC that could lead to collision in self-driving carsdue to soft errors22Chapter 4Fine-Grained Characterization ofFaults Causing Long LatencyCrashes in ProgramsIn this chapter, we investigate into an important but neglected problem in the de-sign of dependable software systems, namely identifying faults that propagate fora long time before causing crashes, or long-latency crashes (LLCs). We first de-fine the problem, then characterize the code patterns that lead to LLCs through anempirical fault injection study. Based on the observations, we propose efficientheuristics to identify these code patterns for protection in order to eliminate LLCsin programs. The identification is through static and dynamic analyses of a givenprogram without requiring to perform extensive fault injections. Hence, the pro-posed technique is much faster than traditional fault injection methods. Finally, weevaluate the proposed technique and present the results.4.1 IntroductionA hardware fault can have many possible effects on a program. First, it may bemasked or be benign. In other words, the fault may have no effect on the program’sfinal output. Second, it may cause a crash (i.e., hardware exception) or hang (i.e.,program time out). Finally, it may cause Silent Data Corruptions (SDCs), or the23program producing incorrect outputs. Of the above outcomes, SDCs are consid-ered the most severe, as there is no visible indication that the application has donesomething wrong. Therefore, a number of prior studies have focused on detectingSDC-causing program errors, by selectively identifying and protecting elements ofprogram state that are likely to cause SDCs [52, 64, 97, 141].Compared to SDCs, crashes have received relatively less attention from theperspective of error detection. This is because crashes are considered to be thedetection themselves, as the program can be recovered from a checkpoint (if oneexists) or restarted after a crash. However, all of these studies make an importantassumption, namely that the crash occurs soon after the fault is manifested in theprogram. This is important to ensure that the program is prevented from writingcorrupted state to the file system (e.g., checkpoint), or sending wrong messages toother processes [15]. While this assumption is true for a large number of faults,studies have shown that a small but non-negligible fraction of faults persist for along time in the program before causing a crash, and that these faults can causesignificant reliability problems such as extended downtimes [58, 146, 150]. Wecall these long-latency crashes (LLCs). Therefore, there is a compelling need todevelop techniques for protecting programs from LLC causing faults.Prior work has experimentally assessed LLCs through fault injection experi-ments [58]. However, they do not provide much insight into why some faults causeLLCs. This is important because (1) fault injection experiments require a lot ofcomputation time, especially to identify relatively rare events such as LLCs, and(2) fault injection cannot guarantee completeness in identifying all or even mostLLC causing locations. The latter is important in order to ensure that crash latencyis bounded in the program by protecting LLC causing program locations. Yim etal. [150] analyze error propagation latency in the program, and develop a coarse-grained categorization of program locations based on whether a fault in the locationcan cause LLCs. The categorization is based on where the program data resides,such as text segment, stack segment or heap segment. While this is useful, it doesnot help programmers decide which parts of the program need to be protected, asprotecting all parts of the program that manipulate the heap data or stack data canlead to prohibitive performance overheads.In contrast to the above work, we present a technique to perform fine grained24classification of program’s data at the level of individual variables and programstatements, based on whether a fault in the data item causes an LLC. The maininsight underlying our work is that very few program locations are responsible forLLCs, and that these locations conform to a few dominant code patterns. Our tech-nique performs static analysis of the program to identify the LLC causing codepatterns. However, not every instance of the LLC-causing code pattern leads to anLLC. Our technique further uses dynamic analysis coupled with a very selectivefault injection experiment, to filter the false positives and isolate the few instancesof the patterns that lead to LLCs. We have implemented our technique in a com-pletely automated tool called CRASHFINDER, which is integrated with the LLVMcompiler infrastructure [85]. To the best of our knowledge, we are the first to pro-pose an automated and efficient method to systematically identify LLC causingprogram locations for protection in a fine-grained fashion.We make the following contributions in this work.• Identify the dominant code patterns that can cause LLCs in programs througha large-scale fault injection experiment we conducted on a total of ten bench-mark applications,• Develop an automated static analysis technique to identify the LLC-causingcode patterns in programs, based on the fault injection study,• Propose a dynamic analysis and selective fault injection-based approach tofilter out the false-positives identified by the static analysis technique, andidentify LLCs.• Implement the static and dynamic analysis techniques in an automated tool.We call this CRASHFINDER.• Evaluate CRASHFINDER on benchmark applications from the SPEC [68],PARBOIL [133], PARSEC [18] and SPLASH-2 [148] benchmark suites. Wefind that CRASHFINDER can accurately identify over 90% of the LLC caus-ing locations in the program, with no false-positives, and is about nine ordersof magnitude faster than performing exhaustive fault injections to identify allLLCs in a program.25(3)Crash10,000 crash latency(2)Fault Activation(1) Fault OccurrenceSnapshot 1 Snapshot 2 Snapshot 38000 instructions 8000 instructionsTimelineFigure 4.1: Long Latency Crash and Checkpointing4.2 Why bound the crash latency?We now explain our rationale for studying LLCs and why it is important to boundthe crash latency in programs. We note that similar observations have been madein prior work [58, 150], and that studies have shown that having unbounded crashlatency can result in severe failures. We consider one example.Assume that the program is being checkpointed every 8,000 instructions sothat it can be recovered in the case of a failure (we set aside the practicality ofperforming such fine grained checkpointing for now). We assume that the check-points are gathered in an application independent manner, i.e., the entire state ofthe program is captured in the checkpoint. If the program encounters an LLCof more than 10,000 instructions, it is highly likely that one or more checkpointswill be corrupted (by the fault causing the LLC). This situation is shown in Fig-ure 4.1. However, if the crash latency is bounded to 1,000 instructions (say), thenit is highly unlikely for the fault to corrupt more than one checkpoint. Note thatthe latency between the fault activation and the fault occurrence does not matterin this case, as the checkpoint is corrupted only when the fault actually gets acti-vated. Therefore, we focus on the crash latency in this chapter, i.e., the number ofdynamic instructions from the fault activation to the crash.Identifying program locations that are prone to LLC is critical to improve sys-tem reliability so that one can bound crash latency by selectively protecting LLC-prone locations with minimal performance overheads. For example, one can du-plicate the backward slices of the LLC-prone locations, or use low-cost detectors26for these locations like what prior work has done [123]. In this chapter, we fo-cus on identifying such LLC-causing program locations, and defer the problem ofprotecting the locations to future work.4.3 Initial Fault Injection StudyIn this section, we perform an initial fault injection study for characterizing theLLC causing locations in a program. The goal of this study is to measure thefrequency of LLCs, and understand the reasons for them in terms of the program’scode. In turn, this will allow us to formulate heuristics for identifying the LLC-causing code patterns in Section 4.4. We first explain our experimental setup forthis study, and then discuss the results.4.3.1 Fault Injection ExperimentTo perform the fault injection study, we use LLFI [147], an open-source fault in-jector that operates at the LLVM compiler’s IR level. We inject faults into thedestination registers of LLVM IR instructions, as per our fault model in Section 3.We first profile each program to get the total number of dynamic instructions. Wethen inject a single bit flip in the destination register of a single dynamic instructionchosen at random from the set of all dynamic instructions executed by the program.Recent study [31] has shown that the fault injection method is representative. Ourbenchmarks are chosen from the SPEC [68], PARBOIL [133], PARSEC [18] andSPLASH-2 suites [148]. We choose ten programs at random from these suites, andinject a total of 1,000 faults in each, for a total of 10,000 fault injection experi-ments. The details of the benchmarks are explained in Section 4.6.1.Note that our way of injecting faults using LLFI ensures that the fault is ac-tivated right away as it directly corrupts the program’s state during the injection.Therefore, we do not measure activation as the set of activated faults is the sameas the set of injected faults. We categorize the results into Crashes, SDCs, Hangsand Benign faults in our experiment. Because our focus in this chapter is on LLCs,we record the crash latency for crash-causing faults in terms of the number of dy-namic LLVM IR instructions between the fault injection and the crash. However,when the program crashes, its state will be lost, and hence we periodically write27to permanent storage the number of dynamic instructions executed by the programafter the fault injection. The counting of the dynamic instructions is done using thetracing feature of LLFI, which we have enabled in our experiments.4.3.2 Fault Injection ResultsWe classify the results of the fault injection experiments into SDC, crash and be-nign. Hangs were negligible in our experiment and are not reported. Figure 4.2shows the aggregated fault injection results across the benchmarks. We find thaton average, crashes constitute about 35% of the faults, SDC constitute 4.2%, andthe remaining are benign faults (about 60%). We focus on crashes in the rest ofthis chapter, as our focus is on LLCs.Figure 4.2: Aggregate Fault Injection Results across BenchmarksFigure 4.3 shows the distribution of crash latencies for all the faults that ledto crashes in the injections. On average, the percentage of LLCs is about 0.38%across the ten benchmarks. Recall that we set 1,000 dynamic instructions as thethreshold for determining whether a crash is an LLC. Therefore, LLCs constitute arelatively small fraction of the total crashes in programs. This is why it is importantto devise fine-grained techniques to identify them, as even a relatively large faultinjection experiment such as ours exposes very few LLCs in the program (38 inabsolute numbers). The percentages of LLCs among all the crash causing faults,vary from 0% to 3.6% across programs due to benchmark specific characteristics.The reasons for these variations are discussed in Section 4.7.28Figure 4.3: Latency Distribution of Crash-Causing Errors in Programs: Thepurple bars represent the LLCs as they have a crash latency of more than1000 instructions. The number shown at the top of each bar shows thepercentage of crashes that resulted in LLCs. The error bars for LLCsrange from 0%(cutcp) to 1.85%(sjeng).We also categorized the LLCs based on the code patterns in which the LLC lo-cations occurred. In other words, we study the kinds of program constructs whichwhen fault injected, are likely to cause LLCs. We choose the largest five applica-tions from the ten benchmarks for studying the code characteristics since the largerthe programs, the more code patterns they may reveal. Thus we choose sjeng,hmmer, href, libquantum and mcf for our detailed investigation.Figure 4.4 shows the distribution of the LLC-causing code patterns we foundin our experiments. The patterns themselves are explained in Section 4.3.3. Wefind that about 95% of the LLC causing code falls into three dominant patterns,namely (1) Pointer Corruption (20%), (2) Loop Corruption (56%), and (3) StateCorruption (19%). Therefore we focus on these three patterns in the rest of thischapter.4.3.3 Code Patterns that Lead to LLCsAs mentioned in the previous section, we find that LLCs fall into three dominantpatterns namely, pointer related LLC, loop related LLC and state related LLC. We29Figure 4.4: Distribution of LLC Categories across 5 Benchmarks (sjeng,libquantum, hmmer, h264ref and mcf). Three dominant categories ac-count for 95% of the LLCs.explain each category with code examples in the following subsections. Althoughthese observations were made at the LLVM IR level, we use C code for clarity toexplain them.Pointer Corruption LLC occurs when a fault is injected into pointers that arewritten to memory. An erroneous pointer value is stored in the memory, and thisvalue can be used as a memory operation later on to cause crash. Because thepointer may not be read for a long time, this pattern has the potential to cause anLLC. Figure 4.5A shows the case we observed in sjeng from our fault injectionexperiment. In the function reloadMT, *p0 and next are assigned to a global staticvariable, state, at line 7 and line 8 respectively. The fault is injected on the pointer,*p0, at line 10. As a result, an erroneous pointer value is saved in the memory andit is used as a memory operation in the function randomMT at line 18 after a longtime. This leads to an LLC.Loop Corruption LLC When faults are injected into loop conditions or arrayindices inside the loop, the array manipulated by the loop (if any) may aggressivelycorrupt the stack, and cause LLC. We categorize this as Loop Corruption LLC.There are two cases in which this LLC can occur.The first case is when a fault occurs in the array index of an array written withinthe loop. This fault can corrupt a large area of stack since an erroneous array index30Figure 4.5: Code examples showing the three kinds of LLCs that occurred inour used for array address offset calculations in every iteration of the loop. Thislarge-scale corruption to the stack significantly increases the chance of corruptingaddress values (i.e., pointers, return address etc) on the stack, which in turn canresult in a crash much later. For example, in Figure 4.5B, when a fault is injectedinto next making a corrupted value saved back to it at the line 5, the struct arrayperm[] at line 9 corrupts values on the stack. When the corrupted value is used formemory operations later in the program, an LLC is observed.The second case occurs when faults are injected into termination conditions ofthe loop, causing a stack overflow to occur. This is shown in Figure 4.5C. Assumethat a fault is injected into piece count at line 3, and makes it a large value. Thiswill cause the for loop at line 5 to execute for a much larger number of iterations,thereby corrupting the stack and eventually leading to a LLC.31State Corruption LLC occurs when faults are injected into state variables orlock (synchronization) variables in state machine structures. These variables aredeclared as static or global variables and are used to allocate or deallocate partic-ular pieces of memory. If these states are corrupted, crashes may happen betweenstates, thus causing LLC. For the code shown in Figure 4.5D, when we inject afault in opstatus at line 7, the variable opstatus becomes a nonzero value (fromzero) when the state goes to quantum objcode stop. Later in the function quan-tum objcode put when the state is updated to quantum objcode stop, the opstatusvariable is examined to decide whether a particular memory area should be ac-cessed (line 23). Due to the fault injected, we observed that objcode is accessed atline 28 while in the state quantum objcode stop. This leads to a LLC as it accessesthe unallocated memory area objcode, which is illegal.4.4 ApproachIn this section, we describe our proposed technique CRASHFINDER, to find all theLLCs in a program. CRASHFINDER consists of three phases, as Figure 4.6 showsIn the first phase, it performs a static analysis of the program’s source code todetermine the potential locations that can cause LLCs. The analysis is done basedon the code patterns in Section 4.3.3. We refer to this phase of CRASHFINDERas CRASHFINDER STATIC. In the second phase, it performs dynamic analysis ofthe program (under a given set of inputs) to determine which dynamic instances ofthe static locations are likely to result in LLCs. We call this phase CRASHFINDERDYNAMIC. In the last phase, it injects a selected few faults to the dynamic instanceschosen by CRASHFINDER DYNAMIC. We refer to this phase of CRASHFINDER asselective fault injection. We describe the three phases in the three subsections.4.4.1 Phase 1: Static Analysis (CRASHFINDER STATIC)CRASHFINDER STATIC is the static analysis portion of our technique that staticallysearches the program’s code for the three patterns corresponding to those identifiedin Section 4.3.3. We found that these three patterns are responsible for almost allthe LLCs in the program, and hence it suffices to look for these patterns to cover32CrashFinder         StaticCrashFinder DynamicSelective Fault InjectionCompiled LLVM IR Source Code Test CasesList of LocationsList of LLC LocationsList of LocationsFigure 4.6: Workflow of CRASHFINDERthe LLCs. However, not every instance of these patterns may lead to an LLC, andhence we may get false-positives in this phase. False-positives are those locationsthat conform to the LLC causing patterns but do not lead to an LLC, and will beaddressed in the next phase.The algorithm of CRASHFINDER STATIC takes the program’s source codecompiled to the LLVM IR as an input and outputs the list of potential LLC causinglocations. Specifically, CRASHFINDER STATIC looks for the following patterns inthe program:Pointer Corruption LLCCRASHFINDER STATIC finds pointers that are written to memory in the program.More specifically, it examines static data dependency sequences of all pointers, andonly consider the ones that end with store instruction.Loop Corruption LLCIn this category, CRASHFINDER STATIC finds loop termination variables in loopheaders and array index assignment operations. For loop termination variable(s),it looks for the variable(s) that is used for comparison with the loop index variablein loop headers. For array index assignment, CRASHFINDER STATIC first locatesbinary operations with a variable and a constant as operands, then checks if theresult being stored is used as offset in array address calculation. If yes, then we caninfer that the variable being updated will be used as the address offset of an array.33In LLVM, offset calculations are done through a special instruction and are henceeasy to identify statically.State Corruption LLCCRASHFINDER STATIC finds static and global variables used to store state or locks.Because these may depend on the application’s semantics, we devise a heuristic tofind such variables. If a static variable is loaded and directly used in comparisonand branches, we assume that it is likely to be a state variable or a lock variable.We find that this heuristic allow us to cover most of these cases without semanticknowledge of the application.4.4.2 Phase 2: Dynamic Analysis (CRASHFINDER DYNAMIC)In this phase, our technique attempts to eliminate the false positives from the staticlocations identified in phase 1. One straw man approach for doing so is to injectfaults in every dynamic instance of the static locations to determine if it leads toan LLC. However, a single static instruction may correspond to hundreds of thou-sands of dynamic instances in a typical program, especially if it is within a loop.Further, each of these dynamic instances needs to be fault injected multiple timesto determine if it will lead to an LLC, and hence a large number of fault injectionswill need to be performed. All this adds up to considerable performance overheads,and hence the above straw man approach does not scale.We propose an alternate approach to cut down the number of fault injectionlocations to filter out the false positives. Our approach uses dynamic analysis toidentify a few dynamic instances to consider for injection among the set of all theidentified static locations. The main insight we leverage is that there are repeatedcontrol-flow sequences in which the dynamic instances occur, and it is sufficientto sample dynamic instances in each unique control-flow sequence to obtain a rep-resentative set of dynamic instances for fault injection. This is because the crashlatency predominantly depends on the control-flow sequence executed by the pro-gram after the injection at a given program location. Therefore, it suffices to obtainone sample from each unique control flow pattern in which the dynamic instanceoccurs. We determine the control flow sequences at the level of function calls. That34Figure 4.7: Dynamic sampling heuristic. (a) Example source code (oceanprogram), (b) Execution trace and sample candidates.35is we sample the dynamic instances with different function call sequences, and ig-nore the ones that have the same function call sequences. We show in Section 4.7that this sampling heuristic works well in practice.We consider an example to illustrate the sampling heuristic to determine whichdynamic instances to choose. Figure 4.7(b) shows the dynamic execution trace gen-erated by the code in Figure 4.7(a). For example, we want to sample the dynamicinstances corresponding to the variable t1a at line 17 in Figure 4.7(a). Firstly, be-cause t1a is within a loop in function relax, it corresponds to multiple dynamicinstances in the trace. We only consider one of them as candidate for choosingsamples (we call it a sample candidate), since they have same function call se-quences (no function calls) in between. Secondly, function relax is called withina loop in function multig at lines 5 and 7. As can be seen in the Figure 4.7, thereare two recurring function call sequences circumscribing the execution of the staticlocation corresponding to the sample candidates, namely relax() copy red() andrelax() copy black(). We collect one sample of each sequence regardless of howmany times they occur. In this case, only sample candidate 1 and 2 are selectedfor later fault injections. We find that this dramatically reduces the fault injectionspace thereby saving considerable time.4.4.3 Phase 3: Selective Fault InjectionsThe goal of this phase is to filter out all the false-positives identified in the previousphase through fault injections. Once we have isolated a set of dynamic instancesfrom CRASHFINDER DYNAMIC to inject for the static location, we configure ourfault injector to inject two faults into each dynamic instance, one fault at a time.We choose one high-order bit and and one low-order bit at random to inject into, aswe found experimentally that LLCs predominantly occur either in the high-orderbits or the low-order bits, and hence one needs to sample both.We then classify the location as an LLC location (i.e., not a false positive) ifany one of the injected faults results in an LLC. Otherwise, we consider it a false-positive, and remove it from the list of LLC locations. Note that this approach isconservative as performing more injections can potentially increase the likelihoodof finding an LLC, and hence it is possible that we miss some LLCs. However, as36we show in Section 4.7, our approach finds most LLCs even with only two faultinjections per each dynamic instance. We also show that increasing the numberof fault injections beyond 2 for each dynamic instance does not yield substantialbenefits, and hence we stick to 2 injections per instance.4.5 ImplementationWe implemented CRASHFINDER STATIC as a pass in the LLVM compiler [85] toanalyze the IR code and extract the code patterns. We implemented the CRASHFINDERDYNAMIC also as an LLVM pass that instruments the program to obtain its control-flow. CRASHFINDER DYNAMIC then analyzes the control-flow patterns and deter-mines what instances to choose for selective fault injection. We use the LLFI faultinjection framework [147] to perform the fault injections. Finally, we used ourcrash latency measurement library to determine the crash latencies after injection.To use CRASHFINDER 1, all the user needs to do is to compile the applicationcode with the LLVM compiler using our module. No annotations are needed. Theuser also needs to provide us with representative inputs so that CRASHFINDER canexecute the application, collect the control-flow patterns and choose the dynamicinstances to inject faults.4.6 Experimental SetupWe empirically evaluate CRASHFINDER in terms of accuracy and performance.We use a fault injection experiment to measure the accuracy, and use execution timeof the technique to measure its performance. We evaluate both CRASHFINDERand CRASHFINDER STATIC separately to understand the effect of different parts ofthe technique (CRASHFINDER includes CRASHFINDER STATIC, CRASHFINDERDYNAMIC and the selective fault injection). We compare both the accuracy andthe performance of both techniques to those of exhaustive fault injections that areneeded to find all the LLCs in a program 2. Our experiments are all carried out onan Intel Xeon E5 machine, with 32 GB RAM running Ubuntu Linux 12.04.1CRASHFINDER and its source code can be freely downloaded from goal is to find all LLC causing locations in the program so that we can selectively protectthem and bound the crash latency.37We first present the benchmarks used (Section 4.6.1, followed by the researchquestions (Section 4.6.2). We then present an overview of the methodology weused to answer each of the research questions (Section 4.6.3).4.6.1 BenchmarksWe choose a total of ten benchmarks from various domains for evaluating CRASHFINDER.The benchmark applications are from SPEC [68], PARBOIL [133], PARSEC [18]and SPLASH-2 [148]. All the benchmark application are compiled and linked intonative executables using LLVM, with standard optimizations enabled. We showthe detailed information of the benchmarks in Table 7.1.Table 4.1: Characteristics of Benchmark ProgramsBenchmark BenchmarkSuiteDescriptionlibquantum SPEC A library for the simulation of a quantumcomputerh264ref SPEC A reference implementation of H.264/AVC(Advanced Video Coding)blackscholes PARSEC Option pricing with Black-Scholes PartialDifferential Equation (PDE)hmmer SPEC Uses statistical description of a sequencefamily’s consensus to do sensitive databasesearchingmcf SPEC Solves single-depot vehicle scheduling prob-lems planning transportationocean SPLASH-2 Large-scale ocean movements simulationbased on eddy and boundary currentssad PARBOIL Sum of absolute differences kernel, used inMPEG video encoderssjeng SPEC A program that plays chess and several chessvariantscutcp PARBOIL Computes the short-range component ofCoulombic potential at each grid pointstencil PARBOIL An iterative Jacobi stencil operation on a reg-ular 3-D grid4.6.2 Research QuestionsWe answer the following research questions(RQs) in our experiments.RQ1: How much speedup do CRASHFINDER STATIC and CRASHFINDERachieve over exhaustive injection ?RQ2: What is the precision of CRASHFINDER STATIC and CRASHFINDER?38RQ3: What is the recall of CRASHFINDER STATIC and CRASHFINDER?RQ4: How well do the sampling heuristics used in CRASHFINDER work inpractice ?4.6.3 Experimental MethodologyWe describe our methodology for answering each of the RQs below. We performfault injections using the LLFI fault injector [147] as described earlier.PerformanceIn order to answer RQ1, we measure the total time taken for executing CRASHFINDERSTATIC, CRASHFINDER and the exhaustive fault injections. More specifically, foreach benchmark, we measure the total time used for (1) CRASHFINDER STATIC,(2) CRASHFINDER, which includes CRASHFINDER STATIC, CRASHFINDER DY-NAMIC and selective fault injections to identify LLCs and, (3) exhaustive faultinjections to find LLCs.PrecisionThe precision is an indication of the false-positives produced by CRASHFINDERSTATIC and CRASHFINDER. To measure the precision, we inject 200 faults ran-domly at each static program location identified by CRASHFINDER STATIC orCRASHFINDER, and measure the latency. If none of the injections at the loca-tion result in an LLC, we declare it to be a false positive. Note that we choose 200fault injections per location to balance time and comprehensiveness. If we increasethe number of faults, we may find more LLC causing locations, thus decreasingthe false positives. Thus, this method gives us a conservative upper bound on thefalse-positives of the technique.RecallThe recall is an indication of the false-negatives produced by CRASHFINDER STATICand CRASHFINDER. To measure the recall of CRASHFINDER STATIC and CRASHFINDER,we randomly inject 3,000 faults for each benchmark and calculate the fraction ofthe observed LLCs that were covered by CRASHFINDER STATIC and CRASHFINDER39respectively. Thus 30,000 faults in total are injected over ten benchmark applica-tions for this experiment. Note that this is in addition to the 1,000 fault injectionexperiments performed in the initial study, which were used to develop the twotechniques. We do not include the initial injections in the recall measurement toavoid biasing the results.Heuristics for SamplingAs mentioned in Section 4.4, there are two heuristics used by CRASHFINDER toreduce the space of fault injections it has to perform. The first is to limit the choseninstances to unique dynamic instances of control-flow patterns in which the staticinstructions appear, and the second is to limit the number of faults injected in thedynamic instances to two faults per instance. These heuristics may lead to loss incoverage. We investigate the efficacy of these heuristics by varying the parametersused in them and measure the resulting recall.4.7 ResultsThis section presents the results of our experiments for evaluating CRASHFINDERSTATIC and CRASHFINDER. Each subsection corresponds to a research question(RQ).4.7.1 Performance (RQ1)We first present the results of running CRASHFINDER and CRASHFINDER STATICin terms of the number of instructions in each benchmark, and then examine howmuch speedup can CRASHFINDER achieve over exhaustive fault injections.Table 4.2 shows the numbers of instructions for each benchmark. In the ta-ble, columns Total S.I and Total D.I show the total numbers of static instructionsand dynamic instructions of each benchmark. Columns CRASHFINDER STATICS.I and CRASHFINDER STATIC D.I indicate the numbers of static instructions anddynamic instructions corresponding to the static instructions that were found byCRASHFINDER STATIC as LLC causing locations. Columns CRASHFINDER S.Iand CRASHFINDER D.I show the numbers of static instructions and dynamic in-stances of the static locations that CRASHFINDER identified as LLC causing lo-40Table 4.2: Comparison of Instructions Given by CRASHFINDER andCRASHFINDER STATICTotal S.I. TotalD.I.(in mil-lion)CFStaticS.I (%)CFStaticD.I (%)CF S.I(%)CF D.I(%)libquantum 15319 870 1.85% 9.27% 0.18% 0.011%h264ref 189157 116 0.85% 3.92% 0.14% 0.150%blackscholes 758 0.13 3.29% 1.81% 0.66% 0.004%hmmer 92287 4774 0.51% 3.53% 0.13% 0.437%mcf 4086 6737 6.29% 8.75% 2.62% 1.383%ocean 21300 1061 3.46% 3.11% 0.53% 0.003%sad 3176 1982 4.47% 5.56% 0.69% 0.473%sjeng 33931 137 1.70% 15.55% 0.16% 0.567%cutcp 3868 11389 3.13% 6.35% 0.39% 0.001%stencil 2193 7168 4.38% 0.84% 0.41% 0.819%Average 36608 3423 2.99% 5.87% 0.89% 0.385%cations. As can be seen from the table, on average, CRASHFINDER STATIC iden-tified 2.99% of static instructions as LLC causing, which corresponds to about5.87% of dynamic instructions. In comparison, CRASHFINDER further winnowedthe number of static LLC-causing locations to 0.89%, and the number of dynamicinstructions to just 0.385%, thereby achieving a significant reduction in the dy-namic instructions. The implications of this reduction are further investigated inSection 4.8.Figure 4.8 shows the orders of magnitude of time reduction achieved by usingCRASHFINDER STATIC and CRASHFINDER to find LLCs, compared to exhaustivefault injections, for each benchmark. In the figure, CRASHFINDER STATIC refersto the time taken to run CRASHFINDER STATIC. CRASHFINDER refers to thetime taken to run all three components of CRASHFINDER, namely CRASHFINDERSTATIC, CRASHFINDER DYNAMIC and the selective fault injection phase. Notethat the exhaustive fault injection times are an estimate based on the number ofdynamic instructions that need to be injected, and the time taken to perform asingle injection. We emphasize that the numbers shown represent the orders ofmagnitude in terms of speedup. For example, a value of 12 in the graph, means41that the corresponding technique was 1012 times faster than performing exhaustivefault injections. In summary, on average CRASHFINDER STATIC achieves a totalof 13.47 orders of magnitude of time reduction whereas CRASHFINDER achieves9.29 orders of magnitude of time reduction over exhaustive fault injecton to findLLCs.We also measured the wall clock time of the different phases of CRASHFINDER.The geometric means of time taken for CRASHFINDER STATIC are 23 seconds, forCRASHFINDER DYNAMIC the time taken is 3.1 hours, while it takes about 3.9 daysfor the selective fault injection phase. Overall, it takes about 4 days on average forCRASHFINDER to complete the entire process. While this may seem large, notethat both the CRASHFINDER DYNAMIC and selective fault injection phases can beparallelized to reduce the time. We did not however do this in our experiments.Figure 4.8: Orders of Magnitude of Time Reduction by CRASHFINDERSTATIC and CRASHFINDER compared to exhaustive fault injections4.7.2 Precision (RQ2)Figure 4.9 shows the precision of CRASHFINDER STATIC and CRASHFINDER foreach benchmark. The average precision of CRASHFINDER STATIC and CRASHFINDERare 25.42% and 100% respectively. The reason CRASHFINDER has a precision of100% is that all the false-positives produced by the static analysis phase (CRASHFINDER42Figure 4.9: Precision of CRASHFINDER STATIC and CRASHFINDER forfinding LLCs in the programSTATIC) are filtered out by the latter two phases, namely CRASHFINDER DY-NAMIC, and selective fault injections. The main reason why CRASHFINDER STATIChas low precision is because it cannot statically determine the exact runtime behav-ior of variables. For example, a pointer can be saved and loaded to/from memoryin very short intervals, and would not result in an LLC. This behavior is determinedby its runtime control flow, and cannot be determined at compile time, thus result-ing in false positives by CRASHFINDER STATIC. However, CRASHFINDER doesnot have this problem as it uses dynamic analysis and selective fault injection.4.7.3 Recall (RQ3)Figure 4.10 shows the recall of CRASHFINDER STATIC and CRASHFINDER. Theaverage recall of CRASHFINDER STATIC and CRASHFINDER are 92.47% and90.14% respectively. Based on the results, we can conclude that (1) CRASHFINDERSTATIC is able to find most of the LLC causing locations showing that the code pat-terns we identified are comprehensive and, (2) our heuristics used in CRASHFINDERDYNAMIC and selective fault injections do not filter out many legitimate LLC loca-tions since there is only a 2.33% difference between the recalls of CRASHFINDER43Figure 4.10: Recall of CRASHFINDER STATIC and CRASHFINDERSTATIC and CRASHFINDER (however, they filter out most of the false positives asevidenced by the high precision of CRASHFINDER compared to CRASHFINDERSTATIC). We will discuss this further in the next subsection.There are two reasons why CRASHFINDER STATIC does not achieve 100%recall: (1)There are a few cases as mentioned in Section 4.3 that do not fall intothe three dominant patterns. (2) While CRASHFINDER STATIC is able to find mostof the common cases of LLCs, it does not find some cases where the dependencychain spans multiple function calls. For example, the return value of an arrayindex calculation can be propagated through complex function calls and finallyused in the address offset operations in a loop. This makes the pointer analysisin LLVM return too many candidates for the pointer target, and so we truncate thedependence chain. However, there is no fundamental reason why we cannot handlethese cases. Even without handling the cases, CRASHFINDER finds 92.47% of thecases leading to LLCs in the program.Note that we did not observe any LLCs in the two benchmark programs stenciland cutcp. This may be because they have fewer LLC causing locations, and/orthey have a small range of bits which may result in LLCs. This was also the casein the initial study 4.3.444.7.4 Efficacy of Heuristics (RQ4)As mentioned earlier, there are two heuristics used by CRASHFINDER DYNAMICto speed up the injections. First, in the dynamic analysis phase (CRASHFINDERDYNAMIC), only a few instruction instances are chosen for injection. Second, inthe selective fault injection phase, only a few bits in each of the chosen locationsare injected. We examine the effectiveness of these heuristics in practice.In order to understand the LLC-causing errors that are covered by CRASHFINDERSTATIC but not CRASHFINDER, we manually inspected these injections. We foundthat all of the missed errors are due to the second heuristic used by the selectivefault injection phase. None of the missed errors were due to the first heuristicemployed by CRASHFINDER DYNAMIC.The heuristic for choosing bit positions for selective injections picks two ran-dom positions in the word to inject faults into, one from the high-level bits andone from the low-level bits. Unfortunately, this may miss other positions that leadto LLCs. We evaluated the effect of increasing the number of sampled bits to 3and 5, but even this did not considerably increase the number of LLCs found byCRASHFINDER. This is because most of the missed errors can only be reproducedby injecting into very specific bit positions, and finding these positions will re-quire near exhaustive injections on the words found by CRASHFINDER DYNAMIC,which will prohibitively increase the time taken to complete the selective fault in-jection phase. Therefore, we choose to retain the heuristic as it is, especially be-cause the difference between CRASHFINDER STATIC and CRASHFINDER is only2.33%.With the above being said, the heuristic-based approach used here is an approx-imation. Hence, there may be multiple sources of inaccuracy in these heuristics.We will further quantify the limits of the heuristic based approach in future work.4.8 DiscussionIn this section, we discuss some of the implications of CRASHFINDER on se-lective protection and checkpointing. We also discuss some of the limitations ofCRASHFINDER and improvements.454.8.1 Implication for Selective ProtectionOne of the main results from evaluating CRASHFINDER is that we find that avery small number of instructions are responsible for most of the LLCs in theprogram. As per Table 4.2, only 0.89% of static instructions are responsible formore than 90% of the LLC causing errors in the program (based on the recallof CRASHFINDER). Further, CRASHFINDER is able to precisely pinpoint theseinstructions, thereby allowing these to be selectively protected.An example of a selective protection technique is value range checking in soft-ware [64]. A range check is typically inserted after the instruction that produces thedata item to be checked. For example, the assertion(ptr address<0x001b, true) in-serted after the static instruction producing ptr address will check the value of thevariable whenever the instruction is executed. Since the total number of executionsof all such LLC causing instructions is only 0.385% (Table 4.2), the overhead ofthese checks is likely to be extremely low. We will explore this direction in futurework.4.8.2 Implication for Checkpointing TechniquesOur study also establishes the feasibility of fine-grained checkpointing techniquesfor programs, as such checkpointing techniques would incur frequent state cor-ruptions in the presence of LLCs. For example, Chandra et al. [30] found thatthe frequency of checkpoint corruption when using a fine-grained checkpointingtechnique ranges between 25 and 40% due to LLCs. They therefore concludethat one should not use such fine-grained checkpointing techniques, and insteaduse application-specific coarse-grained checkpointing in which the correspondingprobability of checkpoint corruption is 1% to 19%. However, by deploying ourtechnique and selectively protecting the LLC causing locations in the program,one could restrict the crash latency, thus minimizing the chances of checkpointcorruption. Based on the 90% recall of CRASHFINDER, we can achieve a 10-foldreduction in the number of LLC causing locations, thus bringing the checkpointcorruption probability of fine-grained checkpointing down. This would make fine-grained checkpointing feasible, thus allowing faster recovery from errors. This isalso a direction we plan to explore in the future.464.8.3 Limitations and ImprovementsOne of the main limitations of CRASHFINDER is that it takes a long time (onaverage 4 days) to find the LLC causing errors in the program. The bulk of thistime is taken by the selective fault injection phase, which has to inject faults intothousands of dynamic instances found by CRASHFINDER DYNAMIC to determineif they are LLCs. While this is still orders of magnitude faster than performingexhaustive fault injections, it is still a relatively high one-time cost to protect theprogram. One way to speed this up would be to parallelize it, but that comes at thecost of increased computation resources.An alternate way to speed up the technique is to improve the precision ofCRASHFINDER STATIC. As it stands, CRASHFINDER STATIC takes only a fewseconds to analyze even large programs and find LLC causing locations in them.The main problem however is that CRASHFINDER STATIC has a very low preci-sion (of 25.4%). However, this may be acceptable in some cases, where we canprotect a few more locations and incur higher overheads in doing so. Even withthis overprotection, we still only protect less than 6% of the program’s dynamic in-structions (Table 4.2). However, one can improve the precision further by findingall possible aliases and control flow paths at compile time [117], and filtering outthe patterns that are unlikely to cause LLCs.Another limitation is that the recall of CRASHFINDER is only about 90%. Al-though this is still a significant recall, one can improve it further by (1) building amore comprehensive static analyzer to cover the uncovered cases that do not be-long to the dominant LLC-causing patterns, and (2) improving the heuristic usedin the selective fault injection phase, by increasing the number of fault injections inthe selective fault injection phase, albeit at the cost of increased performance over-heads (as we found in RQ4, this heuristic was responsible for most of the differencein recall between CRASHFINDER and CRASHFINDER STATIC).Finally, though the benchmark applications are chosen from a variety of do-mains such as scientific computing, multimedia, statistics and games, there areother domains that are not covered such as database programs, or system softwareapplications. Further, they are all single-node applications. We defer the extensionof CRASHFINDER for distributed applications to our future work.474.9 SummaryIn this chapter, we identify an important but neglected problem in the design of de-pendable software systems, namely identifying faults that propagate for a long timebefore causing crashes, or LLCs. Unlike prior work which has only performed acoarse grained analysis of such faults, we perform a fine grained characterization ofLLCs. Interestingly, we find that there are only three code patterns in the programthat are responsible for almost all LLCs, and that these patterns can be identifiedefficiently through static analysis. We build a static analysis technique to find thesepatterns, and augment it with a dynamic analysis and selective fault-injection basedtechnique to filter out the false positives. We implement our technique in a com-pletely automated tool called CRASHFINDER. We find that CRASHFINDER is ableto achieve 9 orders of magnitude speedup over exhaustive fault injections to iden-tify LLCs, has no false-positives, and successfully identifies over 90% of the LLCcausing locations in ten benchmark programs.48Chapter 5Modeling Soft-Error Propagationin ProgramsThis chapter describes a fast and accurate modeling technique that quantitivelyestimates the Silent Data Corruptions (SDCs) probabilities of a given programand its individual instructions without any fault injections. We name our modelTRIDENT. Different from the heuristic-based technique we have discussed inChapter 4, the technique proposed in this chapter is an analytical model whichsystematically tracks error propagation in the entire propagation space of programexecutions. We first describe the challenges in identifying SDCs in programs be-fore presenting the details of the model. We then design experiments to evaluatethe accuracy and performance of the model. Finally, we discuss an use-case wheredevelopers can use the model to guide the selective protection in programs.5.1 IntroductionOne consequence of such hardware errors is incorrect program output, or silentdata corruptions (SDCs), which are very difficult to detect and can hence havesevere consequences [129]. Studies have shown that a small fraction of the programstates are responsible for almost all the error propagations resulting in SDCs, andso one can selectively protect these states to meet the target SDC probability whileincurring lower energy and performance costs than full duplication techniques [52,49130]. Therefore, in the development of fault-tolerant applications (Figure 5.1A), itis important to estimate the SDC probability of a program – both in the aggregate,and on an individual instruction basis - to decide whether protection is required,and if so, to selectively protect the SDC-causing states of the program. This is thegoal of our work.Fault Injection (FI) has been commonly employed to estimate the SDC proba-bilities of programs. FI involves perturbing the program state to emulate the effectof a hardware fault and executing the program to completion to determine if thefault caused an SDC. However, real-world programs may consist of billions of dy-namic instructions, and even a single execution of the program may take a longtime. Performing thousands of FIs to get statistically meaningful results for eachinstruction takes too much time to be practical [65, 66]. As a result, researchershave attempted to analytically model error propagation to identify vulnerable in-structions [52, 97, 130]. The main advantage of these analytical models is scalabil-ity, as the models usually do not require FIs, and they are fast to execute. However,most existing models suffer from a lack of accuracy, as they are limited to mod-eling faults in the normal (i.e., fault-free) control-flow path of the program. Sinceprogram execution is dynamic in nature, a fault can propagate to not only the data-dependencies of an instruction, but also to the subsequent branches (i.e., controlflow) and memory locations that are dependent on it. This causes deviation fromthe predicted propagation, leading to inaccuracies. Unfortunately, tracking the de-viation in control-flow and memory locations due to a fault often leads to statespace explosion.This chapter proposes a model, TRIDENT, for tracking error propagation inprograms that addresses the above two challenges. The key insight in TRIDENTis that error propagation in dynamic execution can be decomposed into a combi-nation of individual modules, each of which can be abstracted into probabilisticevents. TRIDENT can predict both the overall SDC probability of a program andthe SDC probability of individual instructions based on dynamic and static analy-sis of the program without performing FI. We implement TRIDENT in the LLVMcompiler [85] and evaluate its accuracy and scalability vis-a-vis FI. To the best ofour knowledge, we are the first to propose a model to estimate the SDC probabilityof individual instructions and the entire program without performing any FIs.50Figure 5.1: Development of Fault-Tolerant ApplicationsOur main contributions in this chapter are as follows:• Propose TRIDENT, a three-level model for tracking error propagation inprograms. The levels are static-instruction, control-flow and memory levels,and they build on each other. The three-level model abstracts the data-flowof programs in the presence of faults.• Compare the accuracy and scalability of TRIDENT with FI, to predict theSDC probability of individual instructions and that of the entire program.• Demonstrate the use of TRIDENT to guide selective instruction duplicationfor configurable protection of programs from SDCs under a performanceoverhead.The results of our experimental evaluation are as follows:• The predictions of SDC probabilities using TRIDENT are statistically in-distinguishable from those obtained through FI, both for the overall programand for individual instructions. On average, the overall SDC probabilitypredicted by TRIDENT is 14.83% while the FI measured value is 13.59%across 11 programs.• We also create two simpler models to show the importance of modelingcontrol-flow divergence and memory dependencies - the first model con-siders neither, while the second considers control-flow divergence but not51memory dependencies. The two simpler models predict the average SDCprobabilities across programs as 33.85% and 23.76% respectively, which ismuch higher than the FI results.• Compared to FI, whose cost is proportional to the number of injections, TRI-DENT incurs a fixed cost, and a small incremental cost for each instructionsampled in the program. For example, TRIDENT takes about 16 minutes tocalculate the individual SDC probabilities of about 1,000 static instructions,which is significantly faster than the corresponding FI experiments (whichoften take hours or even days).• Using TRIDENT to guide selective instruction duplication reduces the over-all SDC probability by 65% and 90% at 11.78% and 23.31% performanceoverheads, respectively (these represent 1/3rd and 2/3rd of the full-duplicationoverhead for the programs respectively). These reductions are higher thanthe corresponding ones obtained using the simpler models.5.2 The ChallengeWe use the code example in Figure 5.2A to explain the main challenge of modelingerror propagation in programs. The code is from Pathfinder [33], though we makeminor modifications for clarity and remove some irrelevant parts. The figure showsthe control-flow graphs (CFGs) of two functions: init() and run(). There is a loopin each function: the one in init() updates an array, and the one in run() readsthe array for processing. The two functions init() and run() are called in order atruntime. In the CFGs, each box is a basic block and each arrow indicates a possibleexecution path. In each basic block, there is a sequence of statically data-dependentinstructions, or a static data-dependent instruction sequence.Assume that a fault occurs at the instruction writing to $1 in the first basic blockin init(). The fault propagates along its static data-dependent instruction sequence(from load to cmp). At the end of the sequence, if the fault propagates to theresult of the comparison instruction, it will go beyond the static data dependencyand cause the control-flow of the program to deviate from the fault-free execution.For example, in the fault-free execution, the T branch is supposed to be taken, but52Figure 5.2: Running Exampledue to the fault, the F branch is taken. Consequently, the basic blocks under theT branch including the store instruction will not be executed, whereas subsequentbasic blocks dominated by the F branch will be executed. This will load the wrongvalue in run(), and hence the fault will continue to propagate and it may reach theprogram’s output resulting in an SDC.We identify the following three challenges in modeling error propagation: (1)Statically modeling error propagation in dynamic program execution requires amodel that abstracts the program data-flow in the presence of faults. (2) Due to therandom nature of soft errors, a fault may be activated at any dynamic branch andcause control-flow divergence in execution from the fault-free execution. In any di-vergence, there are numerous possible execution paths the program may take, andtracking all of these paths is challenging. One can emulate all possible paths amongthe dynamic executions at every dynamic branch and figure out which fault prop-agates where in each case. However, this rapidly leads to state space explosion.(3) Faults may corrupt memory locations and hence continue to propagate throughmemory operations. Faulty memory values can be read by (multiple) load instruc-tions at runtime and written to other memory locations as execution progresses.There are enormous numbers of store and load instructions in a typical programexecution, and tracing error propagations among these memory dependencies re-quires constructing a huge data dependency graph, which is very expensive.As we can see in the above example, if we do not track error propagations be-53yond the static data dependencies and instead stop at the comparison instruction,we may not identify all the cases that could lead to SDCs. Moreover, if control-flowdivergence is ignored when modeling, tracking errors in memory is almost impos-sible, as memory corruptions often hide behind control-flow divergence, as shownin the above example. Existing modeling techniques capture neither of these im-portant cases, and their SDC prediction accuracies suffer accordingly. In contrast,TRIDENT captures both the control-flow divergences and the memory corruptionsthat potentially arise as a result of the divergence.5.3 TRIDENTIn this section, we first introduce the inputs and outputs of our proposed model,TRIDENT, and then present the overall structure of the model and the key in-sights it leverages. Finally we present the details of TRIDENT using the runningexample.5.3.1 Inputs and OutputsThe workflow of TRIDENT is shown in Figure 5.1B. We require the user to supplythree inputs: (1) The program code compiled to the LLVM IR, (2) a program inputto execute the program and obtain its execution profile (similar to FI methods,we also require a single input to obtain runtime information), and (3) the outputinstruction(s) in the program that are used for determining if a fault resulted inan SDC. For example, the user can specify printf instructions that are responsiblefor the program’s output and used to determine SDCs. On the other hand, printfsthat log debugging information or statistics about the program execution can beexcluded as they do not typically determine SDCs. Without this information, allthe output instructions are assumed to determine SDCs by default.TRIDENT consists of two phases: (1) Profiling and (2) inferencing. In theprofiling phase, TRIDENT executes the program, performing dynamic analysisof the program to gather information such as the count and data dependency ofinstructions. After collecting all the information, TRIDENT starts the inferencingphase which is based on static analysis of the program. In this phase, TRIDENTautomatically computes (1) the SDC probabilities of individual instructions, and54(2) the overall SDC probability of the program. In the latter case, the user needsto specify the number of sampled instructions when calculating the overall SDCprobability of the program, in order to balance the time for analysis with accuracy.5.3.2 Overview and InsightsBecause error propagation follows program data-flow at runtime, we need to modelprogram data-flow in the presence of faults at three levels: (1) Static-instructionlevel, which corresponds to the execution of a static data-dependent instruction se-quence and the transfer of results between registers. (2) Control-flow level, whenexecution jumps to another program location. (3) Memory level, when the re-sults need to be transferred back to memory. TRIDENT is divided into threesub-models to abstract the three levels, respectively, and we use fs , fc and fm torepresent them. The main algorithm of TRIDENT tracking error propagation froma given location to the program output is summarized in Algorithm 1.Static-Instruction Sub-Model ( fs ): First, fs is used to trace error propagation ofan arbitrary fault activated on a static data-dependent instruction sequence. It de-termines the propagation probability of the fault from where it was activated to theend of the sequence. For example, in Figure 5.2B, the model computes the proba-bility of the fault propagating to the result of the comparison instruction given thatthe fault is activated at the load instruction (Line 4 in Algorithm 1). Previous mod-els trace error propagation in data dependant instructions based on the dynamicdata dependency graph (DDG) which records the output and operand values ofeach dynamic instruction in the sequence [51, 130]. However, such detailed DDGsare very expensive to generate and process, and hence the models do not scale. fsavoids generating detailed dynamic traces and instead computes the propagationprobability of each static instruction based on its average case at runtime to deter-mine the error propagation in a static data-dependent instruction sequence. Sinceeach static instruction is designed to manipulate target bits in a pre-defined way,the propagation probability of each static instruction can be derived. We can thenaggregate the probabilities to calculate the probability of a fault propagating froma given instruction to another instruction within the same static data-dependentinstruction sequence.55Control-Flow Sub-Model ( fc ): As explained, a fault may propagate to branchesand cause the execution path of the program to diverge from its fault-free execu-tion. We divide the propagation into two phases after divergence: The first phase,modeled by fc , attempts to figure out which dynamic store instructions will becorrupted at what probabilities if a conditional branch is corrupted (Lines 3-5 inAlgorithm 1). The second phase traces what happens if the fault propagates tomemory, and is modeled by fm . The key observation is that error propagation tomemory through a conditional branch that leads to control-flow divergence can beabstracted into a few probabilistic events based on branch directions. This is be-cause the probabilities of the incorrect executions of store instructions are decidedby their execution paths and the corresponding branch probabilities. For exam-ple, in the function init() in Figure 5.2A, if the comparison instruction takes theF branch, the store instruction is not supposed to be executed, but if a fault mod-ifies the direction of the branch to the T branch, then it will be executed and leadto memory corruption. A similar case occurs where the comparison instruction issupposed to take the T branch. Thus, the store instruction is corrupted in eithercase.Memory Sub-Model ( fm ): fm tracks the propagation from corrupted store in-structions to the program output, by tracking memory dependencies of erroneousvalues until the output of the program is reached. During the tracking, other sub-models are recursively invoked where appropriate. fm then computes the propaga-tion probability from the corrupted store instruction to the program output (Lines7-9 in Algorithm 1). A memory data-dependency graph needs to be generated fortracing propagations at the memory level because we have to know which dynamicload instruction reloads the faulty data previously written by an erroneous store in-struction (if any). This graph can be expensive to construct and traverse due to thehuge number of the dynamic store and load instructions in the program. However,we find that the graph can be pruned by removing redundant dependencies betweensymmetric loops, if there are any. Consider as an example the two loops in init()and run() in Figure 5.2A. The first loop updates an array, and the second one readsfrom the same array. Thus, there is a memory dependence between every pairof iterations of the two loops. In this case, instead of tracking every dependencybetween dynamic instructions, we only track the aggregate dependencies between56the two loops. As a result, the memory dependence graph needs only two nodes toproject the dependencies between the stores and loads in their iterations.Algorithm 1: The Core Algorithm in TRIDENT1 sub-models fs , fc , and fm ;Input : I: Instruction where the fault occursOutput: PSDC: SDC probability2 ps = fs (I);3 if inst. sequence containing I ends with branch Ib then4 // Get the list of stores corrupted and their prob.5 [¡Ic, pc¿, ...] = fc (Ib);6 // Maximum propagation prob. is 17 Foreach(¡Ic, pc¿): PSDC += ps * pc * fm (Ic);8 else if inst. sequence containing I ends with store Is then9 PSDC = ps* fm (Is);5.3.3 Details: Static-Instruction Sub-Model ( fs )Once a fault is activated at an executed instruction, it starts propagating on itsstatic data-dependent instruction sequence. Each sequence ends with a store, acomparison or an instruction of program output. In these sequences, the probabilitythat each instruction masks the fault during the propagation can be determined byanalyzing the mechanism and operand values of the instruction. This is becauseinstructions often manipulate target bits in predefined ways.Given a fault that occurs and is activated on an instruction, fs computes theprobability of error propagation when the execution reaches the end of the staticcomputation sequence of the instruction. We use a code example in Figure 5.2Bto explain the idea. The code is from Pathfinder [33], and shows a counter beingincremented until a positive value is reached. In Figure 5.2B, INDEX 1-3 form astatic data-dependent instruction sequence, which an error may propagate along.Assuming a fault is activated at INDEX 1 and affects $1, the goal of fs is to tellthe probabilities of propagation, masking and crash after the execution of INDEX3, which is the last instruction on the sequence. fs traces the error propagationfrom INDEX 1 to INDEX 3 by aggregating the propagation probability of eachinstruction on the sequence. We use a tuple for each instruction to represent itsprobabilities which are shown in the brackets on the right of each instruction in57Figure 5.2B. There are three numbers in each tuple, which are the probabilitiesof propagation, masking and crash respectively, given that an operand of the in-struction is erroneous (we explain how to compute these later). For example, forINDEX 3, (0.03, 0.97, 0) means that the probability of the error continuing to prop-agate when INDEX 3 is corrupted is 0.03, whereas 0.97 is the probability that theerror will be masked and not propagate beyond INDEX 3. Finally, the probabilityof a crash at INDEX 3, in this case, is 0. Note that the probabilities in each tupleshould sum to 1.After calculating the individual probabilities, fs aggregates the propagationprobability in each tuple of INDEX 1, 2 and 3 to calculate the propagation probabil-ity from INDEX 1 to INDEX 3. That is given by 1*1*0.03=3% for the probabilityof propagation, and the probabilities of masking and crash are 97% and 0% respec-tively. Thus, if a fault is activated at INDEX 1, there is a 3% of probability that thebranch controlled by INDEX 3 will be flipped, causing a control-flow divergence.We now explain how to obtain the tuple for each instruction. Each tuple is ap-proximated based on the mechanism of the instruction and/or the profiled values ofthe instruction’s operands. We observe that there are only a few types of instruc-tions that have non-negligible masking probabilities: they are comparisons (e.g.,CMP), logic operators (e.g., XOR) and casts (e.g., TRUNC). We assume the rest ofinstructions neither move nor discard corrupted bits - this is a heuristic we use forsimplicity (we discuss its pros and cons in Section 6.7.1).In the example in Figure 5.2B, the branch direction will be modified basedon whether INDEX 3 computes a positive or negative value. In either case, onlya flip of the sign bit of $1 will modify the branch direction. Hence, the errorpropagation probability in the tuple of INDEX 3 is 1/32 = 0.03, assuming a 32-bitdata width. We derive crash probabilities in the tuples for instructions accessingmemory (i.e., load and store instructions). We consider crashes that are caused byprogram reading or writing out-of-bound memory addresses. Their probabilitiescan be approximated by profiling memory size allocated for the program (this isfound in the /proc/ filesystem in Linux). Prior work [51] has shown that these arethe dominant causes of crashes in programs due to soft errors.58Figure 5.3: NLT and LT Examples of the CFG5.3.4 Details: Control-Flow Sub-Model ( fc )Recall that the goal of fc is to figure out which dynamic store instructions willbe corrupted and at what probabilities, if a conditional branch is corrupted. Weclassify all comparison instructions that are used in branch conditions into twotypes based on whether they terminate a loop. The two types are (1) Non-Loop-Terminating cmp (NLT), and (2) Loop-Terminating cmp (LT). Figure 5.3 showstwo Control Flow Graphs (CFGs), one for each case. We also profile the branchprobability of each branch and mark it beside each corresponding branch for ouranalysis purpose. For example, if a branch probability is 0.2, it means duringthe execution there is 20% probability the branch is taken. We will use the twoexamples in Figure 5.3 to explain fc in each case.Non-Loop-Terminating CMP (NLT)If a comparison instruction does not control the termination of a loop, it is NLT.In Figure 5.3A, INDEX 1 is a NLT, dominating a store instruction in bb4. Thereare two cases for the store considered as being corrupted in fc : (1) The store isnot executed while it should be executed in a fault-free execution. (2) The store isexecuted while it should not be executed in a fault-free execution. Combining thesecases, the probability of the store instruction being corrupted can be represented by59Equation 5.1.Pc = Pe /Pd (5.1)In the equation, Pc is the probability of the store being corrupted, Pe is theexecution probability of the store instruction in fault-free execution, and Pd is thebranch probability of which direction dominates the store.We illustrate how to derive the above equation using the example in Figure 5.3A.There are two legal directions a branch can take. In the first case, the branch ofINDEX 1 is supposed to take the T branch at the fault-free execution (20% proba-bility), but the F branch is taken instead due to the corrupted INDEX 1. The storeinstruction in bb4 will be executed when it is not supposed to be executed and willhence be corrupted. The probability that the store instruction is executed in thiscase is calculated as 0.2 ∗ 0.9 ∗ 0.7 = 0.126 based on the probabilities on its exe-cution path (bb0-bb1-bb3-bb4). In the second case, if the F branch is supposedto be taken in a fault-free execution (80% probability), but the T branch is takeninstead due to the fault, the store instruction in bb4 will not be executed, while itis supposed to have been executed in some execution path in the fault-free execu-tion under the F branch. For example, in the fault-free execution, path bb0-bb1-bb3-bb4 will trigger the execution of the store. Therefore, the probability of thestore instruction being corrupted in this case is 0.8∗0.9∗0.7 = 0.504. Therefore,adding the two cases together, we get fc in this example as 0.126+0.504 = 0.63.The Equation 5.1 is simplified by integrating the terms in the calculations. In thisexample, in Equation 5.1, Pe is 0.8∗0.9∗0.7 (bb0-bb1-bb3-bb4), Pd is 0.8 (bb0-bb1), thus Pc is 0.8 ∗ 0.9 ∗ 0.7/0.8 = 0.63. Note that if the branch immediatelydominates the store instruction, then the probability of the store being corrupted is1, as shown by the example in Figure 5.2.Loop-Terminating CMP (LT)If a comparison instruction controls the termination of a loop, it is LT. For example,in Figure 5.3B, the back-edge of bb0 forms a loop, which can be terminated by thecondition computed by INDEX 2. Hence, INDEX 2 is a LT. We find that the proba-bility of the store instruction being corrupted can be represented by Equation. 5.2.Pc = Pb ∗Pe (5.2)60Pc is the probability that a dynamic store instruction is corrupted if the branchis modified, Pb is the execution probability of the back-edge of the branch, and Peis the execution probability of the store instruction dominated by the back-edge.We show the derivation of the above equation using the example in Figure 5.3B.In the first case, if the T branch (the loop back-edge) is supposed to be taken in afault-free execution (99% probability), the store instruction in bb4 may or maynot execute, depending on the branch in bb2. But if a fault modifies the branchof INDEX 2, the store will certainly not execute. So we need to omit the prob-abilities that the store is not executed in the fault-free execution to calculate thecorruption probability of the store. They are 0.99 ∗ 0.9 ∗ 0.3 = 0.27 for the pathbb0-bb1-bb2-bb3 and 0.99 ∗ 0.1 = 0.099 for bb0-bb1-bb0. Hence, the probabil-ity of a corrupted store in this case is 0.99− 0.27− 0.099 = 0.62. In the secondcase where the F branch should be taken in a fault-free execution (1% probability),if the fault modifies the branch, the probability of a corrupted store instruction is0.01∗0.9∗0.7 = 0.0063. Note that this is usually a very small value which can beignored. This is because the branch probabilities of a loop-terminating branch areusually highly biased due to the multiple iterations of the loop. So the total prob-ability in this example is approximated to be 0.62, which is what we calculatedabove. Equation 5.2 is simplified by integrating and cancelling out the terms in thecalculations. In this example, Pb is 0.99 (bb0-bb1), Pe is 0.7*0.9 (bb1-bb2-bb4),and thus Pc is 0.99∗0.7∗0.9 = Details: Memory Sub-Model ( fm )Recap that fm reports the probability for the error to propagate from the corruptedmemory locations to the program output. The idea is to represent memory datadependencies between the load and store instructions in an execution, so that themodel can trace the error propagation in the memory.We use the code example in Figure 5.4A to show how we prune the size of thememory dependency graph in fm by removing redundant dependencies (if any).There are two inner loops in the program. The first one executes first, storing datato an array in memory (INDEX 1). The second loop executes later, loading the datafrom the memory (INDEX 2). Then the program makes some decision (INDEX 3)and decides whether the data should be printed (INDEX 4) to the program output.61Figure 5.4: Examples for Memory Sub-modelNote that the iterations between loops are symmetric in the example, as both ma-nipulate the same array (one updates, and the other one reloads). This is often seenin programs because they tend to manipulate data in blocks due to spatial local-ity. In this example, if one of the dynamic instructions of INDEX 1 is corrupted,one of the dynamic instructions of INDEX 2 must be corrupted too. Therefore,instead of having one node for every dynamic load and store in the iterations of theloop executions, we need only two nodes in the graph to represent the dependen-cies. The rest of the dependencies in the iterations are redundant, and hence canbe removed from the graph as they share the same propagation. The dependenciesbetween dynamic loads and stores are tracked at runtime with their static indicesand operand memory addresses recorded. The redundant dependencies are prunedwhen repeated static load and store pairs are detected.We show the memory data dependency graph of fm for the code example inFigure 5.4B. Assume each loop is invoked once with many iterations. We cre-ate a node for the store (INDEX 1), load (INDEX 2) and printf (INDEX 3, asprogram output) in the graph. We draw an edge between nodes to present theirdependencies. Because INDEX 3 may cause divergence of the dependencies andhence error propagation, we weight the propagation probability based on its ex-62ecution probability. We place a NULL node as a placeholder indicating maskingif F branch is taken in INDEX 3. Note that an edge between nodes may alsorepresent a static data-dependent instruction sequence, e.g., the edge between IN-DEX 2 and INDEX 4. Therefore, fs is recursively called every time a static data-dependent instruction sequence is encountered. We then aggregate the propaga-tion probabilities starting from the node of INDEX 1 to each leaf node in thegraph. Each edge may have different propagation probabilities to aggregate –it depends on what fs outputs if a static data-dependent instruction sequence ispresent on the edge. In this example, assume that fs always outputs 1 as the prop-agation probability for each edge. Then, the propagation probability to the pro-gram output (INDEX 4), if one of the store (INDEX 1) in the loop is corrupted, is1∗1∗1∗0.6/(0.4+0.6)+1∗1∗0∗0.4/(0.4+0.6) = 0.6. The zero in the secondterm represents the masking of the NULL node. As an optimization, we memoizethe propagation results calculated for store instructions to speed up the algorithm.For example, if later the algorithm encounters INDEX 1, we can use the memo-ized results, instead of recomputing them. We will evaluate the effectiveness of thepruning in Section 5.4.3.Floating Point: When we encounter any floating point data type, we apply anadditional masking probability based on the output format of the floating point data.For example, in benchmarks such as Hotspot, the float data type is used. By default,Float carries 7-digit precision, but in (many) programs’ output, a “%g” parameteris specified in printf which prints numbers with only 2-digit precision. Based onthe specification of IEEE-754 [6], we assume that only the mantissa bits (23 bitsin Float) may affect the 5 digits that are cut off in the precision. This is becausebit-flips in exponential bits likely cause large deviations in values, and so cutting-off the 5 digits in the precision is unlikely to mask the errors in the exponent. Wealso assume that each mantissa bit has equal probability to affect the missing 5digits of precision. In that way, we approximate the propagation probability to be((32-23)+23*(2/7))/32 = 48.66%. We apply this masking probability on top thepropagation probabilities, for Float data types used with the non-default format ofprintf.635.4 EvaluationIn this section, we evaluate TRIDENT in terms of its accuracy and scalability.To evaluate accuracy, we use TRIDENT to predict overall SDC probabilities ofprograms as well as the SDC probabilities for individual instructions, and comparethem with those obtained using FI and the simpler models. To evaluate scalability,we measure the time for executing TRIDENT, and compare it with the time takenby FI. We first present the experimental setup and then the results. We also makeTRIDENT and the experimental data publicly available1.5.4.1 Experimental SetupBenchmarksWe choose eleven benchmarks from common benchmark suites [18, 33, 68], andpublicly available scientific programs [8, 79, 136] — they are listed in Table 7.1.Our benchmark selection is based on three criteria: (1) Diversity of domains andbenchmark suites, (2) whether we can compile with our LLVM infrastructure, and(3) whether fault injection experiments of the programs can finish within a rea-sonable amount of time. We compiled each benchmark with LLVM with standardoptimizations (-O2).FI MethodWe use LLFI [147] which is a publicly available open-source fault injector to per-form FIs at the LLVM IR level on these benchmarks. LLFI has been shown to beaccurate in evaluating SDC probabilities of programs compared to assembly codelevel injections [147]. We inject faults into the destination registers of the executedinstructions to simulate faults in the computational elements of the processor as perour fault model. Further, we inject single bit flips as these are the de-facto modelfor emulating soft errors at the program level, and have been found to be accuratefor SDCs [121]. There is only one fault injected in each run, as soft errors are rareevents with respect to the time of execution of a program. Our FI method ensuresthat all faults are activated, i.e., read by an instruction of the program, as we define1 5.1: Characteristics of BenchmarksBenchmark Suite/Author Area Program InputLibquantum SPEC Quantum comput-ing33 5Blackscholes Parsec Finance in 4.txtSad Parboil Video encoding. reference.binframe.binBfs Parboil Graph traversal graph input.datHercules Carnegie Mel-lon UniversityEarthquake simula-tionscan sim-ple case.eLulesh Lawrence Liv-ermore NationalLaboratoryHydrodynamicsmodeling-s 1 -pPuReMD Purdue Univer-sityReactive molec-ular dynamicssimulationgeo ffield con-trolNw Rodinia DNA sequence op-timization2048 10 1Pathfinder Rodinia Dynamic program-ming1000 10Hotspot Rodinia Temperature andpower simulation64 64 1 1temp 64power 64Bfs Rodinia Graph traversal graph4096.txtSDC probabilities based on the activated instructions (Section 3.2). The FI methodis in line with other papers in the area [12, 12, 51, 64, 82].5.4.2 AccuracyWe design two experiments to evaluate the accuracy of TRIDENT. The first ex-periment examines the prediction of overall SDC probabilities of programs, andthe second examines predicted SDC probabilities of individual instructions. In theexperiments, we compare the results derived from TRIDENT with those from thetwo simpler models and FI. As described earlier, TRIDENT consists of three sub-models in order: fs , fc and fm . We create two simpler models to (1) understandthe accuracy gained by enabling each sub-model and (2) as a proxy to investigate65other models, which often lack modeling beyond static data dependencies (Sec-tion 5.6.3 performs a more detailed comparison with prior work). We first disablefm in TRIDENT, leaving the two sub-models fs and fc enabled, to create a model:fs + fc . We then further remove fc to create the second simplified model whichonly has fs enabled, which we represent as fs .Overall SDC probabilityFigure 5.5: Overall SDC Probabilities Measured by FI and Predicted by theThree Models (Margin of Error for FI:±0.07% to±1.76% at 95% Con-fidence)To evaluate the overall SDC probability of a given program, we use statisticalFI. We measure error bars for statistical significance at the 95% confidence level.We randomly sample 3,000 dynamic instructions for FIs (one fault per run) as theseyield tight error bars at the 95% confidence level (±0.07% to ±1.76%) - this is inline with other work that uses FI. We calculate SDC probability of each programbased on how many injected faults result in SDC. We then use TRIDENT, aswell as the two simpler models, to predict the SDC probability of each program,and compare the results with those from FI. To ensure fair comparison, we sample3,000 instructions in our models as well (Section 5.3.1).The results are shown in Figure 5.5. We use FI to represent the FI method,TRIDENT for our three-level model, and fs+fc and fs for the two simpler mod-els. We find TRIDENT prediction matches the overall SDC probabilities obtainedthrough FI, with a maximum difference of 14.26% in Sad, and a minimum dif-ference of 0.11% in Blackscholes, both in percentage points. This gives a meanabsolute error of 4.75% in overall SDC prediction. On the other hand, fs + fc andfs have a mean absolute error of 19.56% and 15.13% respectively compared to FI –more than 4 and 3 times higher than those obtained using the complete three-levelmodel. On average, fs + fc and fs predict the overall SDC probability as 33.85%66and 23.76% across the different programs, whereas TRIDENT predicts it to be14.83%. The SDC probability obtained from FI is 13.59%, which is much more inline with the predictions of TRIDENT.We observe that in Sad, Lulesh and Pathfinder, TRIDENT encounters rela-tively larger differences between the prediction and the FI results (14.26%, 7.48%and 8.87% respectively). The inaccuracies are due to a combination of gaps in theimplementation, assumptions, and heuristics we used in TRIDENT. We discussthem in Section 5.6.1.To compare the results more rigorously, we use a paired T-test experiment [134]to determine how similar the predictions of the overall SDC probabilities by TRI-DENT are to the FI results.2 Since we have 11 benchmarks, we have 11 sets ofpaired data with one side being FI results and the other side being the predictionvalues of TRIDENT. The null hypothesis is that there is no statistically signifi-cant difference between the results from FIs and the predicted SDC probabilitiesby TRIDENT in the 11 benchmarks. We calculate the p-value in the T-test as0.764. By the conventional criteria (p-value>0.05), we fail to reject the null hy-pothesis, indicating that the predicted overall SDC probabilities by TRIDENT arenot statistically different from those obtained by FI.We find that the model fs + fc always over-predicts SDCs compared with TRI-DENT. This is because an SDC is assumed once an error propagates to store in-structions, which is not always the case, as it may not propagate to the programoutput. On the other hand, fs may either over-predict SDCs (e.g., Libquantum,Hercules) because an SDC is assumed once an error directly hits any static data-dependent instruction sequence ending with a store, or under-predict them (e.g.,Bfs, Blackscholes) because error propagation is not tracked after control-flow di-vergence.SDC Probability of Individual InstructionsWe now examine the SDC probabilities of individual instructions predicted byTRIDENT and compare them to the FI results. The number of static instruc-2We have verified visually that the differences between the two sides of every pair are approxi-mately normally distributed in all the T-test experiments we conduct, which is the requirement forvalidity of the T-test.67tions per benchmark varies from 76 to 4,704, with an average of 944 instructions.Because performing FIs into each individual instruction is very time-consuming,we choose to inject 100 random faults per instruction to bound our experimentaltime. We then input each static instruction to TRIDENT, as well as the two sim-pler models ( fs + fc and fs ), to compare their predictions with the FI results. Asbefore, we conduct paired T-test experiments [134] to measure the similarity (ornot) of the predictions to the FI results. The null hypothesis for each of the threemodels in each benchmark is that there is no difference between the FI results andthe predicted SDC probability values in each instruction.Table 5.2: p-values of T-test Experiments in the Prediction of Individual In-struction SDC Probability Values (p > 0.05 indicates that we are not ableto reject our null hypothesis – the counter-cases are shown in bold)Benchmark TRIDENT fs+fc fsLibquantum 0.602 0.000 0.000Blackscholes 0.392 0.173 0.832Sad 0.000 0.003 0.000Bfs (Parboil) 0.893 0.000 0.261Hercules 0.163 0.000 0.003Lulesh 0.000 0.000 0.000PureMD 0.277 0.000 0.000Nw 0.059 0.000 0.000Pathfinder 0.033 0.130 0.178Hotspot 0.166 0.000 0.000Bfs (Rodinia) 0.497 0.001 0.126No. of rejections 3/11 9/11 7/11The p-values of the experiments are listed in the Table 5.2. At the 95% confi-dence level, using the standard criteria (p > 0.05), we are not able to reject the nullhypothesis in 8 out of the 11 benchmarks using TRIDENT in the predictions. Thisindicates that the predictions of TRIDENT are shown to be statistically indistin-guishable from the FI results in most of the benchmarks we used. The three outliersfor TRIDENT again are Sad, Lulesh and Pathfinder. Again, even though the in-dividual instructions’ SDC probabilities predicted are statistically distinguishablefrom the FI results, these predicted values are still reasonably close to the FI results.68In contrast, when using fs + fc and fs to predict SDC probabilities for each individ-ual instruction, there are only 2 and 4 out of the 11 benchmarks having p-valuesgreater than 0.05, indicating that the null hypotheses cannot be rejected for mostof the benchmarks. In other words, the predictions from the simpler models forindividual instructions are (statistically) significantly different from the FI results.5.4.3 ScalabilityIn this section, we evaluate the scalability of TRIDENT to predict the overallSDC probabilities of programs and the SDC probabilities of individual instructions,and compare it to FI. By scalability, we mean the ability of the model to handlelarge numbers of instruction samples in order to obtain tighter bounds on the SDCprobabilities. In general, the higher the number of sampled instructions, the higherthe accuracy and hence the tighter are the bounds on SDC probabilities for a givenconfidence level (e.g., 95% confidence). This is true for both TRIDENT and forFI. The number of instructions sampled for FI in prior work varies from 1,000 [147]to a few thousands [51, 52, 90]. We vary the number of samples from 500 to 7,000.The number of samples is equal to the number of FI trials as one fault is injectedper trial.Note that the total computation is proportional to both the time and powerrequired to run each approach. Parallelization will reduce the time spent, but notthe power consumed. We assume there is no parallelization for the purpose ofcomparison in the case of TRIDENT and FI, though both TRIDENT and FI canbe parallelized. Therefore, the computation can be measured by the wall-clocktime.Overall SDC ProbabilityThe results of the time spent to predict the overall SDC probability of programare shown in Figure 5.6A. The time taken in the figure is projected based on themeasurement of one FI trial (averaged over 30 FI runs). As seen, the curve of FItime versus number of samples is much steeper than that of TRIDENT, which isalmost flat. TRIDENT is 2.37 times faster than the FI method at 1,000 samples,it is 6.7 times faster at 3,000 samples and 15.13 times faster at 7,000 samples.69Figure 5.6: Computation Spent to Predict SDC ProbabilityFrom 500 to 7,000 samples, the time taken by TRIDENT increases only 1.06times (0.2453 to 0.2588), whereas it increases 14 times (0.2453 to 3.9164) for FI- an exact linear increase. The profiling phase of TRIDENT takes 0.24 hours(or about 15 minutes) on average. This is a fixed cost incurred by TRIDENTregardless of the number of sampled instructions. However, once the model isbuilt, the incremental cost of calculating the SDC probability of a new instructionis minimal (we only calculate the SDC probabilities on demand to save time). FIdoes not incur a noticeable fixed cost, but its time rapidly increases as the numberof sampled instructions increase. This is because FI has to run the application fromscratch on each trial, and hence ends up being much slower than TRIDENT as thenumber of samples increase.Individual InstructionsFigure 5.6B compares the average time taken by TRIDENT to predict SDC prob-abilities of individual instructions with FI, for different numbers of static instruc-tions. We consider different numbers of samples for each static instruction chosenfor FI: 100, 500 and 1,000 (as mentioned in Section 5.3.1, TRIDENT does notneed samples for individual instructions’ SDC probabilities). We denote the num-ber of samples as a suffix for the FI technique. For example, FI-100 indicates 100samples are chosen for performing FI on individual instructions. We also vary the70number of static instructions from 50 to 7,000 (this is the X-axis). As seen fromthe curves, the time taken by TRIDENT as the number of static instructions varyremains almost flat. On average, it takes 0.2416 hours at 50 static instructions, and0.5009 hours at 7,000 static instructions, which is only about a 2X increase. Incomparison, the corresponding increases for FI-100 is 140X, which is linear withthe number of instructions. Other FI curves experience even steeper increases asthey gather more samples per instruction.Figure 5.7: Time Taken to Derive the SDC Probabilities of Individual Instruc-tions in Each BenchmarkFigure 5.7 shows the time taken by TRIDENT and FI-100 to derive the SDCprobabilities of individual instructions in each benchmark (due to space constraints,we do not show the other FI values, but the trends were similar). As can be seen,there is wide variation in the times taken by TRIDENT depending on the bench-mark program. For example, the time taken in PureMD is 2.893 hours, whereas itis 2.8 seconds in Pathfinder. This is because the time taken by TRIDENT dependson factors such as (1) the total number of static instructions, (2) the length of staticdata-dependent instruction sequence, (3) the number of dynamic branches that re-quire profiling, and (4) the number of redundant dependencies that can be pruned.The main reason for the drastic difference between PureMD and Pathfinder is thatwe can prune only 0.08% of the redundant dependencies in the former, while wecan prune 99.83% of the dependencies in the latter. On average, 61.87% of dy-namic load and store instructions are redundant and hence removed from the mem-ory dependency graph.71Figure 5.8: SDC Probability Reduction with Selective Instruction Dupli-cation at 11.78% and 23.31% Overhead Bounds (Margin of Error:±0.07% to ±1.76% at 95% Confidence)5.5 Use Case: Selective Instruction DuplicationIn this section, we demonstrate the utility of TRIDENT by considering a use-case of selectively protecting a program from SDC causing errors. The idea is toprotect only the most SDC-prone instructions in a program so as to achieve highcoverage while bounding performance costs. We consider instruction duplicationas the protection technique, as it has been used in prior work [51, 52, 97]. Theproblem setting is as follows: given a certain performance overhead P, what staticinstructions should be duplicated in order to maximize the coverage for SDCs whilekeeping the overhead below P.Solving the above problem involves finding the SDC probability of each in-struction in the program in order to decide which set of instructions should beduplicated. It also involves calculating the performance overhead of duplicatingthe instructions. We use TRIDENT for the former, namely, to estimate the SDCprobability of each instruction, without using FI. For the latter, we use the dy-namic execution count of each instruction as a proxy for the performance over-head incurred by it. We then formulate the problem as a classical 0-1 knapsackproblem [99], where the objects are the instructions and the knapsack capacity isrepresented by P, the maximum allowable performance overhead. Further, objectprofits are represented by the estimated SDC probability (and hence selecting theinstruction means obtaining the coverage), and object costs are represented by thedynamic execution count of the instruction. Note that we assume that the SDCprobability estimates of the instructions are independent of each other – while thisis not necessarily true in practice, it keeps the model tractable, and in the worst72case leads to conservative protection (i.e., over-protection). We use the dynamicprogramming algorithm for the 0-1 knapsack problem - this is similar to what priorwork did [97].For the maximum performance overhead P, we first measure the overhead ofduplicating all the instructions in the program (i.e., full duplication) and set this asthe baseline as it represents the worst-case overhead. The overheads are measuredbased on the wall-clock time of the actual execution of the duplicated programs(averaged on 3 executions each). We find that full duplication incurs an overheadof 36.18% across benchmarks. We consider 2 overhead bound levels, namely the1/3rd and 2/3rd of the full duplication overheads, which are (1) 11.78% and (2)23.31% respectively.For each overhead level, our algorithm chooses the instructions to protect us-ing the knapsack algorithm. The chosen instructions are then duplicated using aspecial pass in LLVM we wrote, and the duplication occurs at the LLVM IR level.Our pass also places a comparison instruction after each instruction protected todetect any deviations of the original computations and duplicated computations.If protected instructions are data dependent on the same static data-dependent in-struction sequence, we only place one comparison instruction at the latter protectedinstruction to reduce performance overhead. This is similar to what other relatedwork did [51, 97]. For comparison purposes, we repeat the above process using thetwo simpler models ( fs + fc and fs ). We then use FI to obtain the SDC probabilitiesof the programs protected using the different models at different overhead levels.Note that FI is used only for the evaluation and not for any of the models.Figure 5.8 shows the results of the SDC probability reduction at different pro-tection levels. Without protection, the average SDC probability of the programs is13.59%. At the 11.78% overhead level, after protection based on TRIDENT, fs+ fc and fs the corresponding SDC probabilities are 5.50%, 5.53%, 9.29% respec-tively. On average, the protections provided by the three models reduce the SDCprobabilities by 64%, 64% and 40% respectively. At the 23.31% overhead level,after the protections based on TRIDENT, fs + fc and fs respectively, the averageSDC probabilities are 1.55%, 2.00% and 4.04%. This corresponds to a reductionof 90%, 87% and 74% of the SDC probability in the baseline respectively. Thus,on average, TRIDENT provides a higher SDC probability reduction for the same73Figure 5.9: Overall SDC Probabilities Measured by FI and Predicted by TRI-DENT, ePVF and PVF (Margin of Error: ±0.07% to ±1.76% at 95%Confidence)overhead level compared with the two simpler models.Taking a closer look, the protection based on fs + fc achieves comparable SDCprobability reductions with TRIDENT. This is because the relative ranking ofSDC probabilities between instructions plays a more dominant role in the selectiveprotection than the absolute SDC probabilities. The ranking of the SDC proba-bilities of individual instructions derived by fs + fc is similar to that derived byTRIDENT. Adding fm boosts the overall accuracy of the model in predicting theabsolute SDC probabilities (Figure 5.5), but not the relative SDC probabilities –the only exception is Libquantum. This shows the importance of modeling control-flow divergence, which is missing in other existing techniques [51, 52, 130].5.6 DiscussionWe first investigate the sources of inaccuracy in TRIDENT based on the experi-mental results (Section 5.4). We then examine some of the threats to the validityof our evaluation. Finally, we compare TRIDENT with two closely related priortechniques, namely PVF and ePVF.5.6.1 Sources of InaccuracyErrors in Store Address: If a fault modifies the address of a store instruction,in most cases, an immediate crash would occur because the instruction accessesmemory that is out of bounds. However, if the fault does not cause a crash, itcan corrupt an arbitrary memory location, and may eventually lead to SDC. It isdifficult to analyze which memory locations may be corrupted as a result of suchfaults, leading to inaccuracy in the case. In our fault injection experiments, weobserve that on average about 5.05% of faults affect addresses in store instructions74and survive from crashes.Memory Copy: Another source of inaccuracy in TRIDENT is that we donot handle bulk memory operations such as memmove and memcpy, which arerepresented by special instructions in the LLVM IR. We find such operations inbenchmark such as Sad, Lulesh, Hercules and PureMD, which makes our techniquesomewhat inaccurate for these programs.Manipulation of Corrupted Bits: As mentioned in Section 5.3.3, we assumeonly instructions such as comparisons, logical operators and casts have maskingeffects to simplify our calculations, and that none of the other instructions maskthe corrupted bits. However, this is not always the case as other instructions mayalso cause masking. For example, division operations such as fdiv may also averageout corrupted bits in the mantissa of floating point numbers, and hence mask errors.We find that 1% of the faults affect fdiv in program such as Lulesh, thereby leadingto inaccuracies.Conservatism in Determining Memory Corruption: Recall that when control-flow divergence happens, we assume all the store instructions that are dominatedby the faulty branch are corrupted (Section 5.3). This is a conservative assumption,as some stores may end up being coincidentally correct. For example, if a store in-struction is supposed to write a zero to its memory location, but is not executed dueto the faulty branch, the location will still be correct if there was a zero already inthat location. These are called lucky loads [42, 51].5.6.2 Threats to ValidityBenchmarks: As mentioned in Section 5.4.1, we choose 11 programs to encom-pass a wide variety of domains rather than sticking to just one benchmark suite(unlike performance evaluation, there is no standard benchmark suite for reliabil-ity evaluation). Our results may be specific to our choice of benchmarks, thoughwe have not observed this to be the case. Other work in this domain makes similardecisions [51, 97].Platforms: In this work, we focus on CPU programs for TRIDENT. GraphicProcessing Units (GPU) are another important platform for reliability studies. Wehave attempted to run TRIDENT on GPU programs, but were crippled by the lack75of automated tools for code analysis and fault injection on GPUs. Our preliminaryresults in this domain using small CUDA kernels (instrumented manually) confirmthe accuracy of TRIDENT. However, more rigorous evaluation is needed.Program Input: As the high-fidelity fault injection experiments take a longtime (Section 5.4.3), we run each program only under 1 input. This is also thecase for almost all other studies we are aware of in this space [51, 52]. Di Leoet at. [46] have found SDC probabilities of programs may change under differentprogram inputs. We plan to consider multiple inputs in our future work.Fault Injection Methodology: We use LLFI, a fault injector that works atthe LLVM IR level, to inject single bit flips. While this method is accurate forestimating SDC probabilities of programs [121, 147], it remains an open questionas to how accurate it is for other failure types. That said, our focus in this chapteris SDCs, and so this is an appropriate choice for us.5.6.3 Comparison with ePVF and PVFePVF (enhanced PVF) is a recent modeling technique for error propagation in pro-grams [51]. It shares the same goal with TRIDENT in predicting the SDC proba-bility of a program, both at the aggregate level and instruction level. ePVF is basedon PVF [130], which stands for Program Vulnerability Factor. The main differ-ence is that PVF does not distinguish between crash-causing faults and SDCs, andhence its accuracy of SDC prediction is poor [51]. ePVF improves the accuracyof PVF by removing most crashes from the SDC prediction. Unfortunately, ePVFcannot distinguish between benign faults and SDCs, and hence its accuracy suffersaccordingly [51]. This is because ePVF only models error propagation in staticdata-dependent instruction sequence and in memory if the static data-dependentinstruction sequence ends with a store instruction, ignoring error propagation tocontrol-flow and other parts of memory. Both ePVF and PVF, like TRIDENT,require no FI in their prediction of SDC, and can be implemented at the LLVMIR level3. We implement both techniques using LLVM, and compare their resultswith TRIDENT’s results.Since crashes and SDCs are mutually exclusive, by removing the crash-causing3ePVF was originally implemented using LLVM, but not PVF.76faults, ePVF computes a relatively closer result to SDC probability than PVF [51].However, the crash propagation model proposed by ePVF in identifying crashesrequires a detailed DDG of the entire program’s execution, which is extremelytime-consuming and resource hungry. As a result, ePVF can be only executed inprograms with a maximum of a million dynamic instructions in practice [51]. Toaddress this issue and reproduce ePVF on our benchmarks and workloads (aver-age 109 million dynamic instructions), we modify ePVF by replacing its crashpropagation model with the measured results from FI. In other words, we assumeePVF identifies 100% of the crashes accurately, which is higher than the accuracyof the ePVF model. Hence, this comparison is conservative as it overestimates theaccuracy of ePVF.We use TRIDENT, ePVF and PVF to compute the SDC probabilities of thesame benchmarks and workloads, and then compare them with FI which serves asour ground truth. The number of randomly sampled faults are 3,000. The resultsare shown in Figure 5.9. As shown, ePVF consistently overestimates the SDCprobabilities of the programs with a mean absolute error of 36.78% whereas it is4.75% in TRIDENT. PVF results in an even larger mean absolute error of 75.19%as it does not identify crashes. The observations are consistent with those reportedby Fang et al. [51]. The average SDC probability measured by FI is 13.59%. ePVFand PVF predict it as 52.55% and 90.62% respectively, while TRIDENT predictsit as 14.83% and is significantly more accurate as a result.5.7 SummaryIn this chapter, we proposed TRIDENT, a three-level model for soft error propaga-tion in programs. TRIDENT abstracts error propagation at static instruction level,control-flow level and memory level, and does not need any fault injection (FI). Weimplemented TRIDENT in the LLVM compiler, and evaluated it on 11 programs.We found that TRIDENT achieves comparable accuracy as FI, but is much fasterand scalable both for predicting the overall SDC probabilities of programs, and theSDC probabilities of individual instructions in a program. We also demonstratedthat TRIDENT can be used to guide selective instruction duplication techniques,and is significantly more accurate than simpler models.77Chapter 6Modeling Input-Dependent ErrorPropagation in ProgramsIn this chapter, we discuss how program inputs could affect error propagation inprograms as real-world applications are executed with arbitrarily different inputsin production environment. We first explain the common misunderstandings foundin the literature about error propagation versus program inputs. Then we identifythe predominant components that need to be considered when modeling input-dependent error propagation. We find that it is possible to extend the analyticalmodel discussed in Chapter 5 to support multiple inputs while achieving a rea-sonably high accuracy and speed. Finally, we show the evaluation results of theextended model in bounding the SDC probabilities of a given program with multi-ple inputs.6.1 IntroductionFault injections (FIs) are commonly used for evaluating and characterizing pro-grams’ resilience, and to obtain the overall SDC probability of a program. In eachFI campaign, a single fault is injected into a randomly sampled instruction, and theprogram is executed till it crashes or finishes. FI therefore requires that the programis executed with a specific input. In practice, a large number of FI campaigns areusually required to achieve statistical significance, which can be extremely time-78consuming. As a result, most prior work limits itself to a single program input orat most a small number of inputs. Unfortunately, the number of possible inputscan be large, and there is often significant variance in SDC probabilities acrossprogram inputs. For example, in our experiments, we find that the overall SDCprobabilities of the same program (Lulesh) can vary by more than 42 times underdifferent inputs. This seriously compromises the correctness of the results from FI.Therefore, there is a need to characterize the variation in SDC probabilities acrossmultiple inputs, without expensive FIs.We find that there are two factors determining the variation of the SDC prob-abilities of the program across its inputs (we call this the SDC volatility): (1) Dy-namic execution footprint of each instruction, and (2) SDC probability of eachinstruction (i.e., error propagation behaviour of instructions). Almost all existingtechniques [43, 46, 54] on quantifying programs’ failure variability across inputsconsider only the execution footprint of instructions. However, we find that the er-ror propagation behavior of individual instructions often plays as important a rolein influencing the SDC volatility (Section 6.2). Therefore, all existing techniquesexperience significant inaccuracy in determining a program’s SDC volatility.In this chapter, we propose an automated technique to determine the SDCvolatility of a program across different inputs, that takes into account both theexecution footprint of individual instructions, and their error propagation probabil-ities. Our approach consists of three steps. First, we perform experimental stud-ies using FI to analyze the properties of SDC volatility, and identify the sourcesof the volatility. We then build a model, VTRIDENT, which predicts the SDCvolatility of programs automatically without any FIs. VTRIDENT is built on ourprior model, TRIDENT (Chapter 5) for predicting error propagation, but sacrificessome accuracy for speed of execution. Because we need to run VTRIDENT formultiple inputs, execution speed is much more important than in the case of TRI-DENT. The intuition is that for identifying the SDC volatility, it is more importantto predict the relative SDC probabilities among inputs than the absolute probabil-ities. Finally, we use VTRIDENT to bound the SDC probabilities of a programacross multiple inputs, while performing FI on only a single input. To the best ofour knowledge, we are the first to systematically study and model the variation ofSDC probabilities in programs across inputs.79The main contributions are as follows:• We identify two sources of SDC volatility in programs, namely INSTRUCTION-EXECUTION-VOLATILITY that captures the variation of dynamic executionfootprint of instructions, and INSTRUCTION-SDC-VOLATILITY that cap-tures the variability of error propagation in instructions, and mathematicallyderive their relationship (Section 6.2).• To understand how SDC probabilities vary across inputs, we conduct a FIstudy using nine benchmarks with ten different program inputs for eachbenchmark, and quantify the relative contribution of INSTRUCTION-EXECUTION-VOLATILITY and INSTRUCTION-SDC-VOLATILITY (Section 6.3) to theoverall SDC volatility.• Based on the understanding, we build a model, VTRIDENT1, on top of ourprior framework for modeling error propagation in programs TRIDENT(Section 6.4.2). VTRIDENT predicts the SDC volatility of instructionswithout any FIs, and also bounds the SDC probabilities across a given setof inputs.• Finally, we evaluate the accuracy and scalability of VTRIDENT in identi-fying the SDC volatility of instructions (Section 6.5), and in bounding SDCprobabilities of program across inputs (Section 6.6).Our main results are as follows:• Volatility of overall SDC probabilities is due to both the INSTRUCTION-EXECUTION-VOLATILITY and INSTRUCTION-SDC-VOLATILITY. Usingonly INSTRUCTION-EXECUTION-VOLATILITY to predict the overall SDCvolatility of the program results in significant inaccuracies, i.e., an averageof 7.65x difference with FI results (up to 24x in the worst case).• We find that the accuracy of VTRIDENT is 87.81% when predicting theSDC volatility of individual instructions in the program. The average differ-ence between the variability predicted by VTRIDENT and that by FI is only1.26x (worst case is 1.29x).1VTRIDENT stands for “Volatility Prediction for TRIDENT”.80• With VTRIDENT 78.89% of the given program inputs’ overall SDC proba-bilities fall within the predicted bounds. With INSTRUCTION-EXECUTION-VOLATILITY alone, only 32.22% of the probabilities fall within the pre-dicted bounds.• Finally, the average execution time for VTRIDENT is about 15 minuteson an input of nearly 500 million dynamic instructions. This constitutes aspeedup of more than 8x compared with the TRIDENT model to bound theSDC probabilities, which is itself an order of magnitude faster than FI [59].6.2 Volatilities and SDCIn this section, we explain how we calculate the overall SDC probability of a pro-gram under multiple inputs. Statistical FI is the most common way to evaluate theoverall SDC probability of a program and has been used in other related work in thearea [43, 54, 65, 66, 88]. It randomly injects a large number (usually thousands) offaults under a given program input, one fault per program execution, by uniformlychoosing program instruction for injection from the set of all executed instructions.Equation 6.1 shows the calculation of the overall SDC probability of the pro-gram, Poverall , from statistical FI. NSDC is the number of FI campaigns that resultin SDCs among all the FI campaigns. Ntotal is the total number of FI campaigns.Equation 6.1 can be expanded to the equivalent equations shown in Equation 6.2.Pi is the SDC probability of each (static) instruction that is chosen for FI, Ni isthe amount of times that the static instruction is chosen for injection over all FIcampaigns. i to n indicates all the distinct static instructions that are chosen forinjection.Poverall = NSDC/Ntotal (6.1)= (n∑i=1Pi ∗Ni)/Ntotal =n∑i=1Pi ∗ (Ni/Ntotal) (6.2)In Equation 6.2, we can see that Ni/Ntotal and Pi are the two relevant factorsin the calculation of the overall SDC probability of the program. Ni/Ntotal can81be interpreted as the probability of the static instruction being sampled during theprogram execution. Because the faults are uniformly sampled during the programexecution, Ni/Ntotal is statistically equivalent to the ratio between the number of dy-namic executions of the chosen static instruction, and the total number of dynamicinstructions in the program execution. We call this ratio the dynamic executionfootprint of the static instruction. The larger the dynamic execution footprint of astatic instruction, the higher the chance that it is chosen for FI.Therefore, we identify two kinds of volatilities that affect the variation ofPoverall when program inputs are changed from Equation 6.2: (1) INSTRUCTION-SDC-VOLATILITY, and (2) INSTRUCTION-EXECUTION-VOLATILITY. INSTRUCTION-SDC-VOLATILITY represents the variation of Pi across the program inputs, INSTRUCTION-EXECUTION-VOLATILITY is equal to the variation of dynamic execution foot-prints, Ni/Ntotal , across the program inputs. We also define the variation of Poverallas OVERALL-SDC-VOLATILITY. As explained above, INSTRUCTION-EXECUTION-VOLATILITY can be calculated by profiling the number of dynamic instructionswhen inputs are changed, which is straight-forward to derive. However, INSTRUCTION-SDC-VOLATILITY is difficult to identify as Pi requires a large number of FI cam-paigns on every such instruction i with different inputs, which becomes imprac-tical when the program size and the number of inputs become large. As men-tioned earlier, prior work investigating OVERALL-SDC-VOLATILITY considersonly the INSTRUCTION-EXECUTION-VOLATILITY, and ignores INSTRUCTION-SDC-VOLATILITY [43, 54]. However, as we show in the next section, this canlead to significant inaccuracy in the estimates. Therefore, we focus on derivingINSTRUCTION-SDC-VOLATILITY efficiently in this paper.6.3 Initial FI StudyIn this section, we design experiments to show how INSTRUCTION-SDC-VOLATILITYand INSTRUCTION-EXECUTION-VOLATILITY contribute to OVERALL-SDC-VOLATILITY,then explain the variation of INSTRUCTION-SDC-VOLATILITY across programs.82Table 6.1: Characteristics of BenchmarksBenchmark Suite/Author Description Total Dy-namicInstructions(Millions)Libquantum SPEC Simulation of quan-tum computing6238.55Nw Rodinia A nonlinear globaloptimization methodfor DNA sequencealignments564.63Pathfinder Rodinia Use dynamic pro-gramming to find apath on a 2-D grid6.71Streamcluster Rodinia Dense Linear Algebra 3907.70Lulesh Lawrence Liv-ermore NationalLaboratoryScience and engi-neering problemsthat use modelinghydrodynamics3382.79Clomp Lawrence Liv-ermore NationalLaboratoryMeasurement of HPCperformance impacts11324.17CoMD Lawrence Liv-ermore NationalLaboratoryMolecular dynam-ics algorithms andworkloads17136.62FFT Open Source 2D fast Fourier trans-form6.37Graph Open Source Graph traversal in op-erational research0.156.3.1 Experiment SetupBenchmarksWe choose nine applications in total for our experiments. These are drawn fromstandard benchmark suites, as well as from real world applications. Note that thereare very few inputs provided with the benchmark applications, and hence we had togenerate them ourselves. We search the entire benchmark suites of Rodinia [33],SPLASH-2 [148], PARSEC [18] and SPEC [68], and choose applications basedon two criteria: (1) Compatibility with our toolset (i.e., we could compile themto LLVM IR and work with LLFI), and (2) Ability to generate diverse inputs forour experiments. For the latter criteria, we choose applications that take numericvalues as their program inputs, rather than binary files or files of unknown formats,since we cannot easily generate different inputs in these applications. As a result,83there are only three applications in Rodinia and one application in SPEC meetingthe criteria. To include more benchmarks, we pick three HPC applications (Lulesh,Clomp, and CoMD) from Lawrence Livermore National Laboratory [70], and twoopen-source projects (FFT [72] and Graph [71]) from online repositories. Thenine benchmarks span a wide range of application domains from simulation tomeasurement, and are listed in Table 7.1.Input GenerationSince all the benchmarks we choose take numerical values as their inputs, we ran-domly generate numbers for their inputs. The inputs generated are chosen based ontwo criteria: (1) The input should not lead to any reported errors or exceptions thathalt the execution of the program, as such inputs may not be representative of theapplication’s behavior in production, And (2) The number of dynamic executed in-structions for the inputs should not exceed 50 billion to keep our experimental timereasonable. We report the total number of dynamic instructions generated from the10 inputs of each benchmark in Table 7.1. The average number of dynamic instruc-tions per input is 472.95 million, which is significantly larger than what have beenused in most other prior work [52, 88, 97, 150, 151]. We consider large inputs tostress VTRIDENT and evaluate its scalability.FI methodologyAs mentioned before, we use LLFI [147] to perform the FI experiments. For eachapplication, we inject 100 random faults for each static instruction of the applica-tion – this yields error bars ranging from 0.03% to 0.55% depending on the appli-cation for the 95% confidence intervals. Because we need to derive SDC proba-bilities of every static instruction, we have to perform multiple FIs on every staticinstruction in each benchmark. Therefore, to balance the experimental time withaccuracy, we choose to inject 100 faults on each static instruction. This adds up toa total number of injections ranging from 26,000 to 2,251,800 in each benchmark,depending on the number of static instructions in the program.846.3.2 ResultsINSTRUCTION-EXECUTION-VOLATILITY andOVERALL-SDC-VOLATILITYWe first investigate the relationship between INSTRUCTION-EXECUTION-VOLATILITYand OVERALL-SDC-VOLATILITY. As mentioned in Section 6.2, INSTRUCTION-EXECUTION-VOLATILITY is straight-forward to derive based on the executionprofile alone, and does not require performing any FIs. If it is indeed possible to es-timate OVERALL-SDC-VOLATILITY on the basis of INSTRUCTION-EXECUTION-VOLATILITY alone, we can directly plug in INSTRUCTION-EXECUTION-VOLATILITYto Ni and Ntotal in Equation 6.2 when different inputs are used and treat Pi as a con-stant (derived based on a single input) to calculate the overall SDC probabilities ofthe program with the inputs.We profiled INSTRUCTION-EXECUTION-VOLATILITY in each benchmark anduse it to calculate the overall SDC probabilities of each benchmark across all itsinputs. To show OVERALL-SDC-VOLATILITY, we calculate the differences be-tween the highest and the lowest overall SDC probabilities of each benchmark, andplot them in Figure 6.1. In the figure, Exec. Vol. represents the calculation withthe variation of INSTRUCTION-EXECUTION-VOLATILITY alone in Equation 6.2,treating Pi as a constant, which are derived by performing FI on only one input.FI indicates the results derived from FI experiment with the set of all inputs ofeach benchmark. As can be observed, the results for individual benchmark withOVERALL-SDC-VOLATILITY estimated from Exec. Vol. alone are significantlylower than the FI results (up to 24x in Pathfinder). The average difference is 7.65x.This shows that INSTRUCTION-EXECUTION-VOLATILITY alone is not sufficientto capture OVERALL-SDC-VOLATILITY, motivating the need for accurate esti-mation of INSTRUCTION-SDC-VOLATILITY. This is the focus of our work.Code Patterns Leading to INSTRUCTION-SDC-VOLATILITYTo figure out the root causes of INSTRUCTION-SDC-VOLATILITY, we analyzethe FI results and their error propagation based on the methodology proposedin our prior work [59]. We identify three cases leading to INSTRUCTION-SDC-85Figure 6.1: OVERALL-SDC-VOLATILITY Calculated by INSTRUCTION-EXECUTION-VOLATILITY Alone (Y-axis: OVERALL-SDC-VOLATILITY, Error Bar: 0.03% to 0.55% at 95% Confidence)VOLATILITY.Case 1: Value Ranges of Operands of InstructionsDifferent program inputs change the values that individual instructions operatewith. For example, in Figure 6.2A, there are three instructions (LOAD, CMP andBR) on a straight-line code sequence. Assume that under some INPUT A, R1 is16 and R0 is 512, leading the result of the CMP (R3) to be FALSE. Since thehighest bit of 512 is the 9th bit, any bit-flip at the bit positions that are higher than9 in R1 will modify R1 to a value that is greater than R0. This may in turn causethe result of the CMP instruction (R3) to be TRUE. In this case, the probabilityfor the fault that occurred at R1 of the LOAD instruction to propagate to R3 is(32-9)/32=71.88% (assuming a 32-bit data width of R1). In another INPUT B,assume R1 is still 16, but R0 becomes 64 of which the highest bit is the 6th bit.In this case, the probability for the same fault to propagate to R3 becomes (32-6)/32=81.25%. In this example, the propagation probability increases by almost10% for the same fault for a different input. In other words, the SDC volatility ofthe LOAD instruction in the example is changed by about 10%. We find that inthe nine benchmarks, the proportion of instructions that fall into this pattern variesfrom 3.07% (FFT) to 15.23% (Nw) - the average is 6.98%. The instructions exhibit86different error propagation even if the control flow does not change.Figure 6.2: Patterns Leading to INSTRUCTION-SDC-VOLATILITYCase 2: Execution Paths and BranchesDifferent program inputs may exercise different execution paths of programs.For example, in Figure 6.2B, there are three branch directions labeled with T1,F1 and T2. Each direction may lead to a different execution path. Assume thatthe execution probabilities of T1, F1 and T2 are 60%, 70% and 80% for someINPUT A. If a fault occurs at the BR instruction and modifies the direction ofthe branch from F1 to T1, the probability of this event is 70% as the executionprobability of F1 is 70%. In this case, the probability for the fault to propagateto the STORE instruction under T2 is 70%*80%=56%. Assuming there is anotherINPUT B which makes the execution probabilities of T1, F1 and T2, 10%, 90% and30% respectively. The probability for the same fault to propagate to the STOREinstruction becomes 90%*30%=27%. Thus, the propagation probability of thefault decreases by 29% from INPUT A to INPUT B, and thus the SDC volatilityof the BR instruction is 29%. In the nine benchmarks, we find that 43.28% ofthe branches on average exhibit variations of branch probabilities across inputs,leading to variation of SDC probability in instructions.Case 3: Number of Iterations of LoopsThe number of loop iterations can change when program inputs are changed,causing volatility of error propagation. For example, in Figure 6.2C, there is a loopwhose termination is controlled by the value of R2. The CMP instruction comparesR1 against R0 and stores it in R2. If the F branch is taken, the loop will continue,whereas if T branch is taken, the loop will terminate. Assume that under some87INPUT A the value of R0 is 4, and that in the second iteration of the loop, a faultoccurs at the CMP instruction and modifies R2 to TRUE from FALSE, causing theloop to terminate early. In this case, the STORE instruction is only executed twicewhereas it should be executed 4 times in a correct execution. Because of the earlytermination of the loop, there are 2 STORE executions missing. Assume there isanother INPUT B that makes R0 8, indicating there are 8 iterations of the loopin a correct execution. Now for the same fault in the second iteration, the loopterminates resulting in only 2 executions of the STORE whereas it should execute8 times. 6 STORE executions are missing with INPUT B (8-2=6). If the SDCprobability of the STORE instruction stays the same with the two inputs, INPUTB triples (6/2=3) the probability for the fault to propagate through the missingSTORE instruction, causing the SDC volatility. In the nine benchmarks, we findthat 90.21% of the loops execute different numbers of iterations when the input ischanged.6.4 Modeling INSTRUCTION-SDC-VOLATILITYWe first discuss the drawback of TRIDENT which is proposed in Chapter 5. Wethen describe VTRIDENT, an extension of TRIDENT to predict INSTRUCTION-SDC-VOLATILITY. The main difference between the two models is that VTRI-DENT simplifies the modeling in TRIDENT to improve running time, which isessential for processing multiple inputs.6.4.1 Drawbacks of TRIDENTEven though TRIDENT is orders of magnitude faster than FI and other modelsin measuring SDC probabilities, it can sometimes take a long time to execute de-pending on the program input. Further, when we want to calculate the variationin SDC probabilities across inputs, we need to execute TRIDENT once for eachinput, which can be very time-consuming. For example, if TRIDENT takes 30minutes on average per input for a given application (which is still considerablyfaster than FI), it would take more than 2 days (50 hours) to process 100 inputs.This is often unacceptable in practice. Further, because TRIDENT tracks mem-ory error propagation in a fine-grained manner, it needs to collect detailed memory88traces. In a few cases, these traces are too big to fit into memory, and hence wecannot run TRIDENT at all. This motivates VTRIDENT, which does not needdetailed memory traces, and is hence much faster.6.4.2 VTRIDENTAs mentioned above, the majority of time spent in executing TRIDENT is in pro-filing and traversing memory dependencies of the program, which is the bottleneckin scalability. VTRIDENT extends TRIDENT by pruning any repeating memorydependencies from the profiling, and keeping only distinct memory dependenciesfor tracing error propagation. The intuition is that if we equally apply the samepruning to all inputs in each program, similar scales of losses in accuracy will beexperienced across the inputs. Therefore, the relative SDC probabilities across in-puts are preserved. Since volatility depends only on the relative SDC probabilitiesacross inputs, the volatilities will also be preserved under pruning.WorkflowFigure 6.3 shows the workflow of VTRIDENT. It is implemented as a set of LLVMcompiler passes which take the code of the program (compiled into LLVM IR) anda set of inputs of the program. The output of VTRIDENT is the INSTRUCTION-SDC-VOLATILITY and INSTRUCTION-EXECUTION-VOLATILITY of the programacross all the inputs provided, both at the aggregate level and per-instruction level.Based on Equation 6.2, OVERALL-SDC-VOLATILITY can be computed usingINSTRUCTION-SDC-VOLATILITY and INSTRUCTION-EXECUTION-VOLATILITY.VTRIDENT executes the program with each input provided, and records thedifferences of SDC probabilities predicted between inputs to generate INSTRUCTION-SDC-VOLATILITY. During each execution, the program’s dynamic footprint isalso recorded for the calculation of INSTRUCTION-EXECUTION-VOLATILITY. Theentire process is fully automated and requires no intervention of the user. Further,no FIs are needed in any part of the process.89vTrident● Program code (LLVM IR)● Program inputs● Instruction-SDC-Volatility● Instruction-Execution-VolatilityFigure 6.3: Workflow of VTRIDENTExampleWe use an example from Graph in Figure 6.4A to illustrate the idea of VTRIDENTand its differences from TRIDENT. We make minor modifications for clarity andremove some irrelevant parts in the example. Although VTRIDENT works atthe level of LLVM IR, we show the corresponding C code for clarity. We firstexplain how TRIDENT works for the example, and then explain the differenceswith VTRIDENT.In Figure 6.4A, the C code consists of three functions, each of which containsa loop. In each loop, the same array is manipulated symmetrically in iterations ofthe loops and transferred between memory back and forth. So the load and storeinstructions in the loops (LOOP 1, 2 and 3) are all memory data-dependent. There-fore, if a fault contaminates any of them, it may propagate through the memorydependencies of the program. init() is called once at the beginning, then Parcour()and Recher() are invoked respectively in LOOP 4 and 5. printf (INDEX 6) at theend is the program’s output. In the example, we assume LOOP 4 and 5 executetwo iterations each for simplicity. Therefore, the fault leads to an SDC if the faultpropagates to the instruction.To model error propagation via memory dependencies of the program, a simi-lar memory dependency graph is created in Figure 6.5. Each node represents eithera dynamic load or store instruction of which indices and loop positions of theirstatic instructions are marked on their right. In the figure, each column of nodesindicates data-dependent executions of the instructions - there is no data flowingbetween columns as the array of data are manipulated by LOOP 1, 2 and 3 sym-metrically. In this case, TRIDENT finds the opportunity to prune the repeatedcolumns of nodes to speed up its modeling time as error propagations are similarin the columns. The pruned columns are drawn with dashed border in the fig-90Figure 6.4: Example of Memory Pruningure, and they indicate the pruning of the inner-most loops. TRIDENT applies thisoptimization for memory-level modeling, resulting in significant acceleration com-pared with previous modeling techniques [59]. However, as mentioned, the graphcan still take significant time to construct and process.To address this issue, VTRIDENT further prunes memory dependency bytracking error propagations only in distinct dependencies to speed up the mod-eling. Figure 6.4B shows the idea: The graph shown in the figure is pruned tothe one by TRIDENT in Figure 6.5. Arrows between nodes indicate propagationprobabilities in the straight-line code. Because there could be instructions leadingto crashes and error masking in straight-line code, the propagation probabilitiesare not 1. The propagation probabilities marked beside the arrows are aggregated91… … INDEX 1    LOOP 1TridentPruningS S S SL L L LS S S SL L L LS S S SL L L LS S S SL L L LS S S SP… … INDEX 2… … INDEX 3… … INDEX 2… … INDEX 3… … INDEX 4… … INDEX 5… … INDEX 4… … INDEX 5… … INDEX 6LOOP 2LOOP 2LOOP 3LOOP 3LOOP 4LOOP 5Figure 6.5: Memory Dependency Pruning in TRIDENTto compute SDC probabilities for INPUT A and INPUT B respectively. For ex-ample, if a fault occurs at INDEX 1, the SDC probability for the fault to reachprogram output (INDEX 6) is calculated as 1 ∗ 1 ∗ 0.5 ∗ 0.5 = 25% for INPUT A,and 1∗1∗0.8∗0.8= 64% for INPUT B. Thus, the variation of the SDC probabilityis 39% for these two inputs. VTRIDENT prunes the propagation by removing re-peated dependencies (their nodes are drawn in dashed border in Figure 6.4B). Thecalculation of SDC probability for the fault that occurred at INDEX 1 to INDEX6 becomes 1*0.5 = 50% with INPUT A, and 1*0.8 = 80% with INPUT B. Thevariation between the two inputs thus becomes 30%, which is 9% lower than thatcomputed by TRIDENT (i.e., without any pruning).We make two observations from the above discussion: (1) If the propagationprobabilities are 1 or 0, the pruning does not result in loss of accuracy (e.g., LOOP4 in Figure 6.4B). (2) The difference with and without pruning will be higher ifthe numbers of iterations become very large in the loops that contain non-1 or92non-0 propagation probabilities (i.e., LOOP 5 in Figure 6.4B). This is becausemore terms will be removed from the calculation by VTRIDENT. We find thatabout half (55.39%) of all faults propagating in the straight-line code have eitherall 1s or at least one 0 as the propagation probabilities, and thus there is no loss inaccuracy for these faults. Further, the second case is rare because large iterationsof aggregation on non-1 or non-0 numbers will result in an extremely small valueof the overall SDC probability. This is not the case as the average SDC probabilityis 10.74% across benchmarks. Therefore, the pruning does not result in significantaccuracy loss in VTRIDENT.6.5 Evaluation of VTRIDENTIn this section, we evaluate the accuracy and performance of VTRIDENT in pre-dicting INSTRUCTION-SDC-VOLATILITY across multiple inputs. We use the samebenchmarks and experimental procedure as before in Section 6.3. The code ofVTRIDENT can be found in our GitHub repository.26.5.1 AccuracyTo evaluate the ability of VTRIDENT in identifying INSTRUCTION-SDC-VOLATILITY,we first classify all the instructions based on their INSTRUCTION-SDC-VOLATILITYderived by FI and show their distributions – this serves as the ground truth. We clas-sify the differences of the SDC probabilities of each measured instruction betweeninputs into three categories based on their ranges of variance (<10%, 10%−20%and >20%), and calculate their distribution based on their dynamic footprints. Theresults are shown in Figure 6.6. As can be seen in the figure, on average, only3.53% of instructions across benchmarks exhibit variance of more than 20% in theSDC probabilities. Another 3.51% exhibit a variance between 10% and 20%. Theremaining 92.93% of the instructions exhibit within 10% variance across inputs.We then use VTRIDENT to predict the INSTRUCTION-SDC-VOLATILITYfor each instruction, and then compare the predictions with ground truth. Theseresults are also shown in Figure 6.6. As can be seen, for instructions that haveINSTRUCTION-SDC-VOLATILITY less than 10%, VTRIDENT gives relatively2 6.6: Distribution of INSTRUCTION-SDC-VOLATILITY predictionsby vTrident Versus Fault Injection Results (Y-axis: Percentage of in-structions, Error Bar: 0.03% to 0.55% at 95% Confidence)accurate predictions across benchmarks. On average, 97.11% of the instructionsare predicted to fall into this category by VTRIDENT, whereas FI measures it as92.93%. Since these constitute the vast majority of instructions, VTRIDENT hashigh accuracy overall.On the other hand, instructions that have INSTRUCTION-SDC-VOLATILITYof more than 20% are significantly underestimated by VTRIDENT, as VTRI-DENT predicts the proportion of such instructions as 1.84% whereas FI mea-sures it as 3.53% (which is almost 2x more). With that said, for individual bench-marks, VTRIDENT is able to distinguish the sensitivities of INSTRUCTION-SDC-VOLATILITY in most of them. For example, in Pathfinder which has the largestproportion of instructions that have INSTRUCTION-SDC-VOLATILITY greater than20%, VTRIDENT is able to accurately identify that this benchmark has the high-est proportion of such instructions relative to the other programs. However, we findVTRIDENT is not able to well identify the variations that are greater than 20%as mentioned above. This case can be found in Nw, Lulesh, Clomp and FFT. Wediscuss the sources of inaccuracy in Section 6.7.1. Since these instructions are rel-atively few in terms of dynamic instructions in the programs, this underpredictiondoes not significantly affect the accuracy of VTRIDENT.We then measure the overall accuracy of VTRIDENT in identifying INSTRUCTION-SDC-VOLATILITY. The accuracy is defined as the number of correctly predictedvariation categories of instructions over the total number of instructions being pre-dicted. We show the accuracy of VTRIDENT in Figure 6.7. As can be seen, thehighest accuracy is achieved in Streamcluster (99.17%), while the lowest accuracyis achieved in Clomp (67.55%). The average accuracy across nine benchmarks is9487.81%, indicating that VTRIDENT is able to identify most of the INSTRUCTION-SDC-VOLATILITY.Figure 6.7: Accuracy of VTRIDENT in Predicting INSTRUCTION-SDC-VOLATILITY Versus FI (Y-axis: Accuracy)Figure 6.8: OVERALL-SDC-VOLATILITY Measured by FI and Predicted byVTRIDENT, and INSTRUCTION-EXECUTION-VOLATILITY alone (Y-axis: OVERALL-SDC-VOLATILITY, Error Bar: 0.03% to 0.55% at95% Confidence)Finally, we show the accuracy of predicting OVERALL-SDC-VOLATILITY us-ing VTRIDENT, and using INSTRUCTION-EXECUTION-VOLATILITY alone (asbefore) in Figure 6.8. As can be seen, the average difference between VTRIDENTand FI is only 1.26x. Recall that the prediction using INSTRUCTION-EXECUTION-VOLATILITY alone (Exec. Vol.) gives an average difference of 7.65x (Section 6.3).95The worst case difference when considering only Exec. Vol. was 24.54x, while itis 1.29x (in Pathfinder) when INSTRUCTION-SDC-VOLATILITY is taken into ac-count. Similar trends are observed in all other benchmarks. This indicates that theaccuracy of OVERALL-SDC-VOLATILITY prediction is significantly higher whenconsidering both INSTRUCTION-SDC-VOLATILITY and INSTRUCTION-EXECUTION-VOLATILITY rather than just using INSTRUCTION-EXECUTION-VOLATILITY.6.5.2 PerformanceWe evaluate the performance of VTRIDENT based on its execution time, andcompare it with that of TRIDENT. We do not consider FI in this comparison asFI is orders of magnitude slower than TRIDENT [59]. We measure the time takenby executing VTRIDENT and TRIDENT in each benchmark, and compare thespeedup achieved by VTRIDENT over TRIDENT. The total computation is pro-portional to both the time and power required to run each approach. Parallelizationwill reduce the time spent, but not the power consumed. We assume that there is noparallelization for the purpose of comparison in the case of TRIDENT and VTRI-DENT, though both TRIDENT and VTRIDENT can be parallelized. Therefore,the speedup can be computed by measuring their wall-clock time.We also measure the time per input as both TRIDENT and VTRIDENT ex-perience similar slowdowns as the number of inputs increase (we confirmed thisexperimentally). The average execution time of VTRIDENT is 944 seconds perbenchmark per input (a little more than 15 minutes). Again, we emphasize thatthis is due to the considerably large input sizes we have considered in this study(Section 6.3).The results of the speedup by VTRIDENT over TRIDENT are shown in Fig-ure 6.9. We find that on average VTRIDENT is 8.05x faster than TRIDENT.The speedup in individual cases varies from 1.09x in Graph (85.16 seconds ver-sus. 78.38 seconds) to 33.56x in Streamcluster (3960 seconds versus. 118 sec-onds). The variation in speedup is because applications have different degrees ofmemory-boundedness: the more memory bounded an application is, the slower it iswith TRIDENT, and hence the larger the speedup obtained by VTRIDENT (as itdoes not need detailed memory dependency traces). For example, Streamcluster is96more memory-bound than computation-bound than Graph, and hence experiencesmuch higher speedups.Figure 6.9: Speedup Achieved by VTRIDENT over TRIDENT. Highernumbers are better.Note that we omit Clomp from the comparison since Clomp consumes morethan 32GB memory in TRIDENT, and hence crashes on our machine. This isbecause Clomp generates a huge memory-dependency trace in TRIDENT, whichexceeds the memory of our 32GB-memory machine (in reality, it experiences sig-nificant slowdown due to thrashing, and is terminated by the OS after a long time).On the other hand, VTRIDENT prunes the memory dependency and incurs only21.29MB memory overhead when processing Clomp.6.6 Bounding Overall SDC Probabilities withVTRIDENTIn this section, we describe how to use VTRIDENT to bound the overall SDCprobabilities of programs across given inputs by performing FI with only one se-lected input. We need FI because the goal of VTRIDENT is to predict the varia-tion in SDC probabilities, rather than the absolute SDC probability which is muchmore time-consuming to predict (Section 6.4.2). Therefore, FI gives us the abso-lute SDC probability for a given input. However, we only need to perform FI on97a single input to bound the SDC probabilities of any number of given inputs usingVTRIDENT, which is a significant savings as FI tends to be very time-consumingto get statistically significant results.For a given benchmark, we first use VTRIDENT to predict the OVERALL-SDC-VOLATILITY across all given inputs. Recall that OVERALL-SDC-VOLATILITYis the difference between the highest and the lowest overall SDC probabilities of theprogram across its inputs. We denote this range by R. We then use VTRIDENT tofind the input that results in the median of the overall SDC probabilities predictedamong all the given inputs. This is because we need to locate the center of therange in order to know the absolute values of the bounds. Using inputs other thanthe median will result in a shifting of the reference position, but will not changethe boundaries being identified, which are more important. Although VTRIDENTloses some accuracy in predicting SDC probabilities as we mentioned earlier, mostof the rankings of the predictions are preserved by VTRIDENT. Finally, we per-form FI on the selected input to measure the true SDC probability of the program,denoted by S. Note that it is possible to use other methods for this estimation (e.g.,TRIDENT [59]). The estimated lower and upper bounds of the overall SDC prob-ability of the program across all its given inputs is derived based on the medianSDC probability measured by FI, as shown below.[(S−R/2),(S+R/2)] (6.3)We bound the SDC probability of each program across its inputs using theabove method. We also use INSTRUCTION-EXECUTION-VOLATILITY alone forthe bounding as a point of comparison. The results are shown in Figure 6.10. Inthe figure, the triangles indicate the overall SDC probabilities with the ten inputsof each benchmark measured by FI. The overall SDC probability variations rangefrom 1.54x (Graph) to 42.01x (Lulesh) across different inputs. The solid linesin the figure bound the overall SDC probabilities predicted by VTRIDENT. Thedashed lines bound the overall SDC probabilities projected by considering only theINSTRUCTION-EXECUTION-VOLATILITY.On average, 78.89% of the overall SDC probabilities of the inputs measured byFI are within the bounds predicted by VTRIDENT. For the inputs that are outside98the bounds, almost all of them are very close to the bounds. The worst case isFFT, where the overall SDC probabilities of two inputs are far above the upperbounds predicted by VTRIDENT. The best cases are Streamcluster and CoMDwhere almost every input’s SDC probability falls within the bounds predicted byVTRIDENT (Section 6.7.1 explains why).Figure 6.10: Bounds of the Overall SDC Probabilities of Programs (Y-axis:SDC Probability; X-axis: Program Input; Solid Lines: Bounds derivedby VTRIDENT; Dashed Lines: Bounds derived by INSTRUCTION-EXECUTION-VOLATILITY alone, Error Bars: 0.03% to 0.55% at the95% Confidence). Triangles represent FI results.On the other hand, INSTRUCTION-EXECUTION-VOLATILITY alone boundsonly 32.22% SDC probabilities on average. This is a sharp decrease in the cov-erage of the bounds compared with VTRIDENT, indicating the importance ofconsidering INSTRUCTION-SDC-VOLATILITY when bounding overall SDC prob-abilities. The only exception is Streamcluster where considering INSTRUCTION-EXECUTION-VOLATILITY alone is sufficient in bounding SDC probabilities. Thisis because Streamcluster exhibits very little SDC volatility across inputs (Fig-ure 6.6).In addition to coverage, tight bounds are an important requirement, as a loosebounding (i.e., a large R in Equation 6.3) trivially increases the coverage of thebounding. To investigate the tightness of the bounding, we examine the resultsshown in Figure 6.8. Recall that OVERALL-SDC-VOLATILITY is represented byR, so the figure shows the accuracy of R. As we can see, VTRIDENT computesbounds that are comparable to the ones derived by FI (ground truth), indicating99that the bounds obtained are tight.6.7 DiscussionIn this section, we first summarize the sources of inaccuracy in VTRIDENT, andthen we discuss the implications of VTRIDENT for error mitigation techniques.6.7.1 Sources of InaccuracyOther than the loss of accuracy from the coarse-grain tracking in memory depen-dency (Section 6.4.2), we identify three potential sources of inaccuracy in identify-ing INSTRUCTION-SDC-VOLATILITY by VTRIDENT. They are also the sourcesof inaccuracy in TRIDENT, which VTRIDENT is based on. We explain howthey affect identifying INSTRUCTION-SDC-VOLATILITY here.Source 1: Manipulation of Corrupted BitsWe assume only instructions such as comparisons, logical operators and castshave masking effects, and that none of the other instructions mask the corruptedbits. However, this is not always the case as other instructions may also causemasking. For example, repeated division operations such as fdiv may also averageout corrupted bits in the mantissa of floating point numbers, and hence mask errors.The dynamic footprints of such instructions may be different across inputs hencecausing them to have different masking probabilities, so VTRIDENT does notcapture the volatility from such cases. For instance, in Lulesh, we observe that thenumber of fdiv may differ by as much as 9.5x between inputs.Source 2: Memory CopyVTRIDENT does not handle bulk memory operations such as memmove andmemcpy. Hence, we may lose track of error propagation in the memory dependen-cies built via such operations. Since different inputs may diversify memory depen-dencies, the diversified dependencies via the bulk memory operations may not beidentified either. Therefore, VTRIDENT may not be able to identify INSTRUCTION-SDC-VOLATILITY in these cases.Source 3: Conservatism in Determining Memory CorruptionWe assume all the store instructions that are dominated by the faulty branch arecorrupted when control-flow is corrupted, similar to the examples in Figure 6.2B100and Figure 6.2C. This is a conservative assumption, as some stores may end upbeing coincidentally correct. For example, if a store instruction is supposed towrite a zero to its memory location, but is not executed due to the faulty branch,the location will still be correct if there was a zero already in that location. Theseare called lucky loads in prior work [42]. When inputs change, the number of luckyloads may also change due to the changes of the distributions of such zeros inmemory, possibly causing volatility in SDC. VTRIDENT does not identify luckyloads, so it may not capture the volatility from such occasions.6.7.2 Implication for Mitigation TechniquesSelective instruction duplication is an emerging mitigation technique that providesconfigurable fault coverage based on performance overhead budget [52, 88, 89,97]. The idea is to protect only the most SDC-prone instructions in a programso as to achieve high fault coverage while bounding performance overheads. Theproblem setting is as follows: Given a certain performance overhead C, what staticinstructions should be duplicated in order to maximize the coverage for SDCs,F , while keeping the performance overhead below C. Solving the above probleminvolves finding two factors: (1) Pi: The SDC probability of each instruction in theprogram, to decide which set of instructions should be duplicated, and (2) Oi: Theperformance overhead incurred by duplicating the instructions. Then the problemcan be formulated as a classical 0-1 knapsack problem [99], where the objectsare the instructions and the knapsack capacity is represented by C, the maximumallowable performance overhead. Further, object profits are represented by theestimated SDC probability (and hence selecting the instruction means obtainingthe coverage F), and object costs are represented by the performance overhead ofduplicating the instructions.Almost all prior work investigating selective duplication confines their studyto a single input of each program in evaluating Pi and Oi [52, 88, 89, 97]. Hence,the protection is only optimal with respect to the input used in the evaluation. Be-cause of the INSTRUCTION-SDC-VOLATILITY and INSTRUCTION-EXECUTION-VOLATILITY incurred when the protected program executes with different inputs,there is no guarantee on the fault coverage F the protection aims to provide, com-101promising the effectiveness of the selective duplication. To address this issue,we argue that the selective duplication should take both INSTRUCTION-SDC-VOLATILITY and INSTRUCTION-EXECUTION-VOLATILITY into consideration. Oneway to do this is solving the knapsack problem based on the average cases of eachPi and Oi across inputs, so that the protection outcomes, C and F , are optimal withrespect to the average case of the executions with the inputs. This is a subject offuture work.6.8 SummaryPrograms can experience Silent Data Corruptions (SDCs) due to soft errors, andhence we need fault injection (FI) to evaluate the resilience of programs to SDCs.Unfortunately, most FI studies only evaluate a program’s resilience under a singleinput or a small set of inputs as FI is very time consuming. In practice however,programs can exhibit significant variations in SDC probabilities under differentinputs, which can make the FI results inaccurate.In this chapter, we investigate the root causes of variations in SDCs under dif-ferent inputs, and we find that they can occur due to differences in the execu-tion of instructions as well as differences in error propagation. Most prior workhas only considered the former factor, which leads to significant inaccuracies intheir estimations. We propose a model VTRIDENT to incorporate differences inboth execution and error propagation across inputs. We find that VTRIDENT isable to obtain achieve higher accuracy and closer bounds on the variation of SDCprobabilities of programs across inputs compared to prior work that only considerthe differences in execution of instructions. We also find VTRIDENT is signifi-cantly faster than other state of the art approaches for modeling error propagationin programs, and is able to obtain relatively tight bounds on SDC probabilities ofprograms across multiple inputs, while performing FI with only a single programinput.102Chapter 7Understanding ErrorPropagation in GPGPUApplicationsPrevious chapters have discussed error propagation in CPU programs. In this chap-ter, we aim to develop techniques for analyzing error propagation in GPU programsin order to improve their error resilience. Unlike in CPU programs where a vari-ety of tools are available for fault injections, few options can be found in studyingprogram-level error propagation in GPU programs. To overcome the difficulties,we first design a LLVM-based fault injector, LLFI-GPU, which is the first fault in-jector operating on the LLVM Intermediate Representation (IR) of GPU programs.By injecting faults at the IR level, LLFI-GPU allows the user to gain program-level insights of error propagation. We desmonstrate that LLFI-GPU can be usedto conduct fault injection experiments and study error propagation in GPU pro-grams. In the experiments, we observe the error propagation patterns that specificto GPU programs and discuss their implications of improving error resilience.7.1 IntroductionGraphic Processing Units (GPUs) have found wide adoption as accelerators forscientific and high-performance computing (HPC) applications due to their mass103availability and low cost. For example, two of the largest supercomputing clustersin use today, namely Blue-Waters [47] and Titan [144] both use GPUs. GPUs wereoriginally designed for graphics and gaming. However, their use in HPC appli-cations has necessitated the systematic study of their reliability. This is becauseunlike graphics or gaming applications which are error-tolerant, HPC applicationshave strict correctness requirements and even a single error can lead to significantdeviations in their outcomes. The problem is exacerbated by the lack of standarderror detection and correction mechanisms for GPUs, compared to CPUs (CentralProcessing Units, or the main processor). Recent studies of GPU reliability in theHPC context have found that GPUs can experience significantly higher fault ratescompared to CPUs [47, 144], and that GPU applications often experience higherrates of Silent Data Corruption (SDCs), i.e., incorrect outcomes, compared to CPUapplications [50, 63].HPC applications typically run for long periods of time, and hence need tobe resilient to faults [114]. Further, in supercomputers, hardware faults have be-come more and more prevalent due to the shrinking feature sizes and power con-straints [55]. One of the most common hardware fault types are transient faults [20,40], which arise due to cosmic rays or electro-magnetic radiation striking compu-tational and/or memory elements, causing the values computed or stored to beincorrect. To mitigate the effect of transient hardware faults, HPC applications usetechniques such as checkpointing and recovery. However, these techniques makean important assumption, namely that faults do not propagate for long periods oftime and corrupt the checkpointed state as this would make the checkpoint unre-coverable [58, 88, 150]. Unfortunately, this assumption does not always hold aserrors often propagate in real applications [89]. More importantly, unmitigatederror propagation can also lead to SDCs, which seriously compromise the applica-tions’ correctness.In this chapter, we investigate the error propagation characteristics of general-purpose GPU (GPGPU) applications with the goal of trying to mitigate error propa-gation. Prior work has investigated error propagation in CPU applications [11, 88],and has developed techniques to mitigate error propagation based on their re-sults [89]. However, it is not clear how applicable are these results to GPGPUapplications, which have a very different programming model. Other work has in-104vestigated the aggregate error resilience of GPGPU applications [50, 63]. Whilethese are valuable, they do not provide insights into how errors propagate withinthe GPGPU application. Such insights are necessary for driving the design of low-overhead error detection mechanisms for these applications, which is our long-termgoal. Such mechanisms have been demonstrated in the CPU space [11, 89]. To thebest of our knowledge, this is the first study of error propagation in GPGPU appli-cations.There are two main challenges with performing studies of error propagationon GPGPU applications. First, because GPGPU applications execute on both theCPU and the GPU (as current GPUs do not provide many of the capabilities neededby applications), it is important to track error propagation across the two entities.Further, GPUs are often invoked multiple times in an application (each such in-vocation is known as a kernel call), so one needs to track error propagation acrossthese invocations. Second, unlike in the CPU space where there are freely availablefault injection tools and frameworks to study error propagation [25, 69, 127, 147],there is a paucity of such tools in the GPU space.We address the first challenge by defining the kernel call as the unit of errorpropagation, and study error propagation both within kernel calls and across mul-tiple calls. We address the second challenge by building a robust LLVM-basedfault injection tool for GPGPU applications. LLVM is a widely-used optimizingcompiler [85], and our fault injection tool is written as a module in the LLVMframework. As a result, we are able to leverage the program analysis capabilitiesprovided by LLVM to track error propagation in programs, and correlate it withthe program’s code.We make the following contributions in this paper.• Develop LLFI-GPU, a GPGPU fault injection tool that can operate on theLLVM intermediate representation (IR) of a program and track error propa-gation in GPGPU programs.• Define the metrics for tracking and measuring error propagation in GPGPUprograms,• Conduct a comprehensive fault-injection study on how errors propagate in105twelve GPGPU applications (including both benchmarks and real-world ap-plications), and how long and how fast such errors propagate and spread inthe application,• Discuss how the results may be leveraged by dependability techniques toprovide targeted mitigation of error propagation for GPGPU applications atlow cost.Our main results from the fault-injection study are:• Only a small fraction of the crash-causing faults that occur in GPUs propa-gate to the CPU, and only a minuscule fraction of crash-causing faults prop-agate to other GPU kernels. Thus, it is sufficient to consider checkpointingand recovery techniques at the GPU-CPU boundary.• Errors do propagate to multiple memory locations, but this behavior is highlyapplication specific. For example, a single fault can contaminate anywherebetween 0.0006% locations to more than 60% of total memory locations,depending on the application.• Unlike CPU programs, most of the memory corruptions in GPU programslead to data corruptions in program output. Faults in memory that propagateto output data likely do so within the kernel where faults occurred. Thisallows error detection techniques to operate at the granularity of the GPUkernel call.• More than 50% of the faults that occur in the GPU are masked within a singlekernel execution. Thus, it may be counterproductive to deploy techniquessuch as Dual Modular Redundancy (DMR) or Error Detection by DuplicatedInstructions (EDDI) [106] within the GPU kernel program as they will endup detecting many faults that are eventually masked.7.2 GPU Fault InjectorWe build a fault injector for GPUs based on the open-source LLFI fault injec-tor [147] , which has been extensively used for error propagation studies on the106CPU [11, 88]. However, LLFI does not inherently support GPUs. Furthermore,performing error propagation analysis (EPA) for GPUs is much more intricate thanon CPUs. We therefore extended LLFI to perform both fault injection and EPA onGPUs. We refer to the extended version of LLFI as LLFI-GPU1 to distinguish itfrom the existing LLFI infrastructure.Fault injection can be done at different levels of the system such as at the gate-level, circuit-level, architecture level and application level. Prior work [38] hasfound that there may be significant differences in the raw rates of faults exposed tothe software layer when fault injections are performed in the hardware. However,we are interested in faults that are not masked by the hardware and make their wayto the application. Therefore, we inject faults directly at the application level.7.2.1 Design OverviewLLFI is a compiler-based fault injection framework, and uses the LLVM compilerto instrument the program to inject faults. The CUDA Nvidia compiler NVCC isalso based on LLVM, and compiles LLVM IR to a PTX representation, which thengets compiled to the SAS machine code by Nvidia’s backend compiler. So at firstglance, it seems trivial to integrate LLFI and NVCC to build a GPU-based faultinjector. However, there are two challenges that arise in practice. First, NVCCdoes not expose the LLVM IR code and directly transforms it to the PTX code.LLFI relies on the IR code to perform instrumentation for fault injection, and hencecannot inject faults into the IR used by NVCC. Second, GPU programs are multi-threaded, often consisting of hundreds of threads, and hence we need to inject faultsinto a random thread at runtime. However, LLFI does not support injecting faultsinto multi-threaded programs.We address the first challenge by attaching a dynamic library to NVCC whichcan intercept its call to the LLVM compilation module [5]. At that point, we invokethe instrumentation passes of LLFI to perform the instrumentation of the program.We then return the instrumented LLVM IR to NVCC, which proceeds with the restof the compilation process to transform it to PTX code. We address the secondchallenge by adding a threadID field to the profiling data collected by LLFI to1Available at: each thread uniquely. We then choose a thread at random to inject intoat runtime from the set of all threads in the program. We also add information onthe kernel call executed and the total number of kernel calls to the profiling data.These are used to choose kernel calls to inject faults into.LLFI-GPU works as follows. First, LLFI-GPU profiles the program and ob-tains the total number of kernel calls, the number of threads per kernel call, andthe total number of instructions executed by each kernel thread. It then creates aninstrumented version of the program with the fault injection functions inserted intothe CUDA portion of the program’s code (this is similar to what LLFI does, exceptthat we restrict the instrumentation to the CUDA portion of the program). LLFI-GPU then chooses a random thread in a random kernel call, and a random dynamicinstruction executed by it, based on the profiling data gathered (the instruction ischosen uniformly from the set of all instructions executed). For the chosen instruc-tion, LLFI-GPU overwrites the result value of the instruction with a faulty versionof the result (e.g., by flipping a single bit in it), and continues the application.Thus, LLFI-GPU directly executes the program on the GPU hardware after in-strumenting it, unlike prior approaches such as GPU-Qin [50] which use debuggersfor fault injection. Debugger-based fault injection has the advantage that it offersmore control over the program, but is often significantly slower. As a point of com-parison, we ran both GPU-Qin2 and LLFI-GPU on a simple matrix multiplicationbenchmark, MAT from NVIDIA SDK Sample [4]. Similar to LLFI-GPU, GPU-Qin operates in two phases: profiling and fault injection. We measured the averagetime taken by GPU-Qin for these two phases to be 2 hours (=7200 seconds), and82 seconds per run respectively. In contrast, LLFI-GPU takes only 6 seconds forprofiling this benchmark and 2 seconds per run for the fault injection phase for thesame set of inputs. The significant speedup of over 1000x obtained by LLFI-GPUis because GPUs execute significantly slower in debug mode [2], and GPU-Qin sin-gle steps through every instruction in the program in the profiling phase. We haveconfirmed that the above execution times are fairly typical depending on the num-ber of dynamic instructions of the programs executed using GPU-Qin3, and hence2The only other GPU fault injector that we know of, SASSIFI [63], was not publicly available atthe time the paper was written, and hence could not be used for comparison.3Based on personal communication with the developers of GPU-Qin.108we did not run any of the other benchmarks with GPU-Qin. Therefore, LLFI-GPUis significantly faster and more scalable than prior techniques, making it feasiblefor studying realistic HPC workloads.7.2.2 Error Propagation Analysis (EPA)After injecting a fault, LLFI-GPU tracks memory data at every kernel boundaryfor the analysis of error propagation. This is because we are interested in errorpropagation at the GPU kernel boundary, rather than within a kernel. This is dif-ferent from what LLFI does as it tracks the error propagation using memory andregisters after each LLVM instruction. Although we can leverage the existing EPAmechanism in LLFI for tracking error propagation at the kernel boundaries, wefound that this incurs very high overheads, and often results in the kernel runningsubstantially slower. Therefore, we decided to build our own error tracking mech-anism in LLFI-GPU that is optimized for our use case.Figure 7.1 illustrates how the EPA mechanism works for a simple GPU kernel.The code fragment is from bfs. It allocates memory on device through cudaMal-loc() before launching kernels, and it deallocates the memory on device throughcudaFree() at the end of the program. After each kernel invocation, LLFI-GPUsaves all memory data allocated on the GPU to disk. This step corresponds to line6-13. Later, we compare the saved data after each kernel call with that from agolden run and mark any differences as a result of the error propagation. Becausewe perform this comparison at the kernel boundaries, we do not need to worryabout non-determinism introduced by thread interleaving within the GPU.7.2.3 LimitationsOur fault injections are performed at the LLVM IR level rather than at the SASSor PTX code levels. One potential drawback of this approach is that downstreamcompiler optimizations may change both the number and order of instructions, oreven remove the fault injection code we inserted. To mitigate this effect, we madesure that our fault injection pass is applied after various optimization passes in theLLVM IR code. Further, LLFI-GPU gathers all executed instructions in the profil-ing phase as described above, and we made sure that all the target instructions as109Figure 7.1: Example of the error propagation code inserted by LLFI-GPUfault injection candidates are gathered and injects faults only into these instructions- this ensures that faults are not injected into dead instructions that are optimizedout by the backend compiler. Due to backend optimizations after the IR is gener-ated, the mapping of instructions may be changed at the machine assembly levels(e.g., SASS level). This may result in different absolute values of the SDC rate forfault injections performed at different levels. However, as we said before, we areinterested in obtaining insights into error propagation intrinsic to applications in-stead of deriving derated SDC rates. Finally, a previous study on CPU applicationsshowed that there is negligible difference in SDC rates between fault injectionsperformed at the LLVM IR level and the assembly code level [147].7.3 Metrics for Error PropagationIn this section, we define the metrics for measuring error propagation in our experi-ments. We measure error propagations along two axes, namely (1) execution time,which captures the temporal nature of the propagation, and (2) memory states,which captures the spatial nature of the propagation. We examine these in furtherdetail below.1107.3.1 Execution timeA fault can propagate in the program corrupting data values until it either causesprogram termination (e.g., by a crash), or it is masked. The former happens if thefault crashes the program, or the program finishes execution successfully (programhangs are handled through a watchdog timer). The latter happens if the faultydata is overwritten, or if the values to which the fault propagates are discardedby the program. The execution time metric measures the time between the fault’soccurrence and the masking or termination events.We use kernel invocations to measure the propagation time of an error. For ex-ample, if an error occurs in a certain kernel invocation K1, and the program crashesafter two more invocations of the kernel, say K2 and K3, we label the executiontime of this fault to be 2. There are three reasons for using kernel invocations asthe unit of propagation. First, we are often interested in knowing if an error prop-agates across the CPU-GPU boundary or across multiple kernel invocations on theGPU. The number of kernel invocation captures this value. The second reason isthat unlike other metrics such as wall-clock time (e.g., seconds), or the number ofexecuted instructions which are platform dependent, the number of kernel invoca-tions depends only on the GPU application. Thus, it captures the application-levelsemantics of error propagation without being affected by platform-specific details.This is important as our goal is to design application level error-resilience mech-anisms. Finally, unlike CPU applications which have a few long-living threads,GPU applications typically have a large number of short-lived threads executingon the GPU as they focus on throughput. When these threads terminate, the con-trol is passed back to the CPU, thereby resulting in frequent CPU-GPU boundarycrossings. We found that the average kernel invocation time in our benchmarks isusually less than one minute - this is in line with prior work [138].In addition to kernel invocations, GPU applications also need to transfer datafrom CPU memory to GPU memory and back. Typical GPU applications performmultiple kernel invocations in between transfers to amortize the latency of trans-fers. We define a kernel cycle as a sequence of kernel invocations by the applica-tion that is prefixed by memory transfer from the CPU to the GPU, and suffixedby memory transfer from the GPU back to the CPU. All applications in our study111Figure 7.2: Code Example of a Kernel Cycle from Benchmark bfsexcept LULESH and NMF had only a single kernel cycle. However, all of themhave multiple kernel invocations within a single kernel cycle.Figure 7.2 shows an example of a kernel cycle in the bfs application. The firstphase of the kernel cycle (lines 1-10) consists of memory allocations, and datamovement from the host (CPU) to the device (GPU) memory. The second phaseconsists of kernel invocations (line 13), which perform computations on the datacopied to the GPU memory. Finally, in the third phase, after the kernels finish theirwork, the CPU collects data from the GPU and processes it (lines 15-20).7.3.2 Memory StatesTo better understand error propagation, we examine which parts of a GPU pro-gram’s memory have been affected by an error after fault injection. We dividememory into three categories, namely Total Memory(TM), Result Memory(RM)and Output Memory(OM). TM is a superset of RM, which in turn is a superset ofOM. This is shown in Figure 7.3. TM refers to the entire memory space allocatedfor the program on device. The allocations are usually done through cudaMalloccalls, and through global variables declared by kernels. RM refers to the memory112locations containing the computation results that the CPU transfers from the GPUat the end of a kernel cycle. OM refers to the memory locations containing the datathat the CPU actually processes for computing the program output.In the example in Figure 7.2, the transfer occurs at line 16. So gpu result cost isthe pointer to the RM in this example. Further, the processing phase occurs in lines18-19. So the OM consists of the results of the dumpCostForResult function. Notethat applications may choose only certain parts of the RM to copy into their output.For example, a floating point application may use only the two most significantdigits from the result in RM to compute the output, in which case the OM consistsof only these two digits.We use SDC to refer to corruption of the above three categories of memory, asall of these pertain to data corruptions. We use the memory type as a subscript todenote corruptions of different memory categories, e.g., SDCRM. Note that SDCOMis what is typically defined as an SDC in prior work [50, 63, 151] on GPUs, asthey only study the effect of faults on the final output of the application. However,our aim is to study error propagation and hence we study data corruption in thememory states of the application. We also refer to corruption of the memory thatis in TM but not RM as (TM-RM), and that in RM but not OM as (RM-OM).For example, in Figure 7.3, assume that an error occurs during a kernel invo-cation (K1) and affects the memory location (L1) in the TM. However, it does notpropagate to the RM. In the next kernel invocation (K2), the faulty value in L1 isread and affects a value at another location L2 in RM. Hence we say the error prop-agates from the TM to RM during K2. In this case, the error causes an SDCT M afterkernel K1, and an SDCRM after kernel K2. However, the error does not propagateto the OM, and hence does not result in an SDCOM.We also measure what fraction of the memory is contaminated by error propa-gation. We define this as the spread of an error. For example, in Figure 7.3, at K3,the faulty value in location L2 is assigned to different memory locations (L3, L4and L5), and hence propagates to these locations. In this case, the fault value haspropagated to a total of 5 locations at the end of K3. Assuming TM consists of 100memory locations, the error spread after K3 in this case is 5/100=5%.113Total Memory (TM)Result Memory (RM)Output Memory (OM)L1L3L2L4L5K2K3K3K3Figure 7.3: Memory State Layout for CUDA Programming Model (K2 andK3 are kernel invocations)7.4 Experimental SetupWe describe the benchmarks used, and the fault injection procedure. We also pro-vide details of the hardware and software platform used for the measurements,followed by the research questions.7.4.1 Benchmarks UsedWe choose twelve GPGPU applications in total for our experiments. These aredrawn from standard GPU benchmark suites such as Rodinia [33] and Parboil [133],as well as real world applications. We choose five programs from Rodinia, andtwo programs from Parboil. The applications from Rodinia were chosen based ontwo criteria: (1) compatibility with our toolset (i.e., we could compile them withNVCC), and (2) suitability for our experiments. For the latter criteria, we dis-card small applications that had too few kernel invocations (as it is uninteresting tomeasure error propagation in such applications), and applications in which the out-puts were non-deterministic (as it is difficult to classify the results of an injectionor error propagation in such applications). For Parboil, we randomly choose twoapplications from the suite to balance time with representativeness.114In addition to the standard benchmarks, we pick five real-world HPC GPGPUapplications, Lulesh [80], Barnes-Hut [116], Fiber [152], Circuit [1] and NMF [16].These applications perform n-body simulation, hydrodynamics modeling, fiberscattering simulation, circuit solving and audio source processing respectively.Table 7.1 shows the details of the applications used in our study and the inputused. The number of kernel invocations ranges from 4 to 8567 in these applica-tions. The lines of C code of these applications ranges from 222 to 5684. Weconfigured our benchmarks to run on a single GPU as our goal is to study errorpropagation between kernels. Multi-GPU programs also transfer data and syn-chronize at kernel boundaries [3], and we hypothesize that our results generalize tosuch programs - validating this hypothesis is a subject of future work.Similar to prior work in the area [11, 50, 63, 65, 150], we run each benchmarkapplication with a single input. However, questions remain on whether multipleinputs may affect error propagation behaviors. We hypothesize that different inputshave limited effect on error propagation as the propagation is primarily dominatedby the application’s algorithm, rather than problem size. This is because differentinputs likely only scale the execution times of certain code sections, rather thanchange the underlying program structure. We will further validate this hypothesisin future work.115Table 7.1: Characteristics of GPGPU Programs in our StudyBenchmarkBenchmark Suit-e/AuthorDescription Kernel Invo-cationsLOC InputBFS Rodinia (v2.1) An algorithm fortraversing or search-ing tree or graph datastructures15 342 4096LUD Rodinia (v2.1) An algorithm to cal-culate the solutions ofa set of linear equa-tions9 564 64PathFinder Rodinia (v2.1) Use dynamic pro-gramming to find apath on a 2-D grid4 236 100000 100 20Gaussian Rodinia (v2.1) Compute result rowby row, solving for allof the variables in alinear system29 394 16HotSpot Rodinia (v2.1) Estimate processortemperature based onan architectural floor-plan and simulatedpower measurements15 328 512 2 32cutcp PARBOIL (v0.2) Computes the short-range component ofCoulombic potentialat each grid point10 1540 watbox. sl40.pqrstencil PARBOIL (v0.2) An iterative Jacobistencil operation on aregular 3-D grid99 1584 128 128 32 100Barnes-HutTexas State Univ. SanMarcos (v2.1)An approximation al-gorithm for perform-ing an n-body sim-ulation developed byTexas State Univ. SanMarcos20 965 4 4Lulesh Lawrence LivermoreNational Laboratory(v1.0)Science and engi-neering problemsthat use modellinghydrodynamics8567 5684 edgeNodes =2Fiber Northeastern Univer-sity (v1.5)High PerformanceComputing of FiberScattering Simulationapplication2881 1437 480 4 20Circuit Rice University Parallel circuit solverfor solving 2D cir-cuit grid using Jacobimethod450 222 0.00001 1NMF UC Berkley Audio analysis andsource separation.409 2398 default1167.4.2 Fault Injection MethodAs mentioned before, we use CRASHFINDER to perform the fault injection exper-iments. We consider only one fault per run as hardware transient faults are rareevents relative to the program execution times. For each application, we inject10,000 faults in total - this yields error bars ranging from 0.22% to 1.11% depend-ing on the application for the 98% confidence intervals. Further, we use the singlebit-flip model for injecting faults as it is the de-facto fault model used in studiesof transient faults. Although recent work [38] has found that hardware faults maymanifest as both single and multiple bit flips at the software level, other studieshave shown that there is very little difference in failure rates due to single and mul-tiple bit flips [14, 37, 96]. Therefore, we stick to the single bit flip model in thisstudy.To obtain a golden run for error propagation analysis, we first run the programwithout any fault injections. We then gather the output of the program and memorydata stored after each kernel invocation as described in Section 7.2.2. We measureerror propagation and error spreading by comparing data from fault injection runswith the golden run. This comparison is done on a bit-wise basis, except for float-ing point numbers, which are compared using 40 digits of precision. As we omitbenchmarks that have random values in program output, the golden runs of thechosen benchmarks are deterministic. We manually verified that this was the casefor our benchmarks.There are three kinds of failures that can occur due to an injected fault: Crashes,SDCs and Hangs. Crashes are found by using the CUDA API call cudaGetLastEr-ror() after every kernel invocation. SDCs are found by comparing the program’soutput with the golden run for each memory type (T M, RM, and OM). Hangs arefound by setting a watchdog timer for 5000 seconds when the program starts - thisis much larger than the time taken by each application run.7.4.3 Hardware and Software PlatformFault injection experiments are performed on host PCs with an Intel Xeon CPUand 32GB DDR3 memory. We use two GPU platforms both from Nvidia, namelyTesla K20 the GTX960, for running our experiments. The results were similar on117both platforms - this is not surprising as our experiments were at the applicationlevel. We therefore report only the results on the K20 platform in this paper. Wewill further detail our comparison in RQ7. The operating system running on thehost is Ubuntu Linux 12.04 64bit, and the CUDA driver and Toolkit used is V6. Research questions (RQs)We answer the following questions in our study.RQ1: What is the percentage of SDCs in different memory states?RQ2: How long do errors take to propagate to the RM?RQ3: Do errors spread into different memory states and why?RQ4: How many faults are masked within the GPU kernel and not allowed topropagate?RQ5: Do crash-causing faults propagate to the host CPU before they causecrashes?RQ6: Do crash-causing faults propagate across kernels before they cause crashes?RQ7: Are resilience characteristics of applications different on different GPUplatforms?7.5 ResultsThe results are organized by the research questions asked in Section 7.4. We firstpresent the aggregate results of the fault injections across all the benchmarks.7.5.1 Aggregate Fault InjectionsFigure 7.4 shows the aggregate results for our fault injection experiments. SDCsand Benign are measured by comparing the programs’ final outputs with the goldenrun. This corresponds to data recored in OM after the programs finish their execu-tions. So SDCs here correspond to SDCOMs, and benign to BenignOM. On average,crashes constitute 17.52%, SDCs constitute 18.98% and Benign faults constitute63.35% of the injections. In our experiments, hangs are negligible, and are hencenot reported.118Figure 7.4: Aggregate Fault Injection Results across the 12 ProgramsTable 7.2: SDCs that occur in the different memory typesbfs lud pathfinder stencil cutcp gaussian hotspot barneshut lulesh circuit fiber nmf averageSDCOM 17.01% 45.58% 20.60% 37.00% 16.50% 10.90% 29.40% 1.30% 0.97% 7.50% 7.01% 43.20% 19.67%SDC(RM−OM) 0.00% 0.00% 1.30% 0.00% 28.90% 0.00% 6.00% 2.30% 0.10% 14.80% 12.69% 10.10% 6.35%SDCRM 17.01% 45.58% 21.90% 37.00% 45.40% 10.90% 31.50% 3.60% 1.07% 22.30% 19.70% 53.30% 25.70%SDC(T M−RM) 0.00% 0.00% 0.00% 0.00% 0.00% 0.80% 0.00% 0.20% 0.00% 0.10% 0.00% 0.00% 0.09%SDCT M 17.01% 45.58% 21.90% 37.00% 45.40% 11.70% 31.50% 3.80% 1.07% 22.40% 19.70% 53.30% 25.79%7.5.2 Error PropagationRQ1: What’s the percentage of SDCs in different memory states?We analyze the memory data in each memory type after the last kernel invo-cation in kernel cycle, and compare with the golden run. Table 7.2 shows SDCsmeasured at different memory locations at the end of the kernel cycle. On average,SDCOM, SDCRM and SDCT M are 19.67%, 25.70% and 25.79%. As can be seen,the values of SDCRM and SDCT M are very similar across applications. In otherwords, most faults in the TM propagate to the RM. This is surprising as it sug-gests that there is little to no masking of errors in the TM. On the other hand, CPUapplications are known to exhibit significant error masking in memory [11]. Onepossible reason is that unlike CPUs, GPUs perform highly specialized computa-tions, and hence all the results produced are important to the application and hencethey propagate to the output.However, there is a difference of about 6% on average between SDCOM andSDCRM (see in SDC(RM−OM)). This is due to the application masking the error ei-ther through type-casting or selective truncation of floating point data. An example119of this is cutcp, which exhibits a difference of nearly 30% between the SDCOM andthe SDCRM. Overall, about 76% of the faults propagate from the RM to the OM,which is again much higher than observed in CPU applications [11].We note that the lulesh application comes equipped with application-level al-gorithm correctness checks (i.e., residual check). For example, the documentationfor lulesh states that the correct output consists of the correct number of iterations,and six most-significant digits in the final origin energy variable [79]. Therefore,SDCOM here is measured based on these correctness checks. None of the other ap-plications however come with such checks, and hence we consider the entire outputin these cases for comparison with the fault-injected run.RQ2: How long do errors take to propagate to the RM?Figure 7.5: Detection Latency of faults that result in SDCRMOnce a fault is injected in a kernel, we measure the number of kernel calls afterwhich it propagates to RM (if it does). Figure 7.5 shows the results. As shown inthe figure, most faults that affect the RM do so within a single kernel invocationafter injection. This means that errors propagate relatively soon to the RM aftertheir occurrence. The exceptions are bfs, gaussian and barneshut.Figure 7.6 shows the percentage of RM updated after each kernel invocation.For the sake of space, we only show the results for two applications, namely lud andgaussian. As we can see, lud updates the RM in every kernel invocation, whereasgaussian does not. This explains why the propagation occurred within one kernelexecution for lud, but not gaussian.120Figure 7.6: Percentage of RM Updated by Each Kernel Invocation. Y-axisis the percentage of RM locations that are updated during each kernelinvocation. X-axis represents timeline in terms of kernel invocations.7.5.3 Error SpreadingWe defined error spreading as the percentage of memory locations contaminateddue to an error (Section 7.3). Unlike the previous section where we consideredany difference from the golden run as an SDC for that memory type, here we onlyconsider the amount of memory that is different.RQ3: Do errors spread into different memory states and why?Figure 7.7: Percentage of TM and RM Contaminated at Each Kernel Invoca-tion. Y-axis is the percentage of contaminated memory locations, X-axisis timeline in terms of kernel invocations. Blue lines indicate TM, andred lines represent RM.From our fault injection experiments, we examine how errors spread as a func-tion of time (i.e., kernel invocations). Because we performed 10,000 injectionsper application, we cannot show all the data. Therefore, we only show represen-121tative injections for each application. Figure 7.7 shows the percentage of memorylocations in TM and RM that are contaminated by the injected faults at the firstdynamic kernel invocation. The spread is calculated as (Contaminated TM or RMLocations / Total TM Locations) * 100.Our main findings are: (1) error-spreading is very application-specific. For ex-ample, a single fault can contaminate nearly 60% of TM memory locations in lud,whereas the number is as low as 0.0006% in cutcp. (2) only a very small amountof memory locations are affected in the same kernel where the fault was injected.Rather, most faults propagate into memory locations in later kernel invocations. (3)trends of error spreading between RM and TM of the same application are rathersimilar, though the absolute values may be different. In other words, applicationsthat have extensive error spread in the TM have extensive error spread in the RMas well.Finally, in almost all applications the error spreading either increases or re-mains constant as the number of kernel calls increase. The exception to this isthe stencil application in which the error spreading decreases significantly as thenumber of kernel calls increase. This is because the algorithm of stencil takesneighbours’ values and keeps averaging them. The errors may be finally maskedas the averaging process progresses. We also observed that there is a small de-crease in error spreading in the bfs application after it reaches its peak in TM. Thisis because some of locations in TM are reassigned during program execution, andfaulty data in these locations can be overwritten with correct data, thereby maskingthe errors.Because lud exhibits the highest error spread, we study its code structure tounderstand the reasons. Figure 7.8 shows the code structure leading to extensiveerror spread in lud. The code exhibits a cyclic data flow from global memory toshared memory, and then back to global memory. Note that shared memory is usedto transfer data between threads only in the same block. In the example, dia is apointer to shared memory, and m points to global memory. At line 5, a portion ofshared memory in dia is initialized by global memory m. This portion of data india is shared by other threads in its block. After the shared data is consumed, itwrites data back to global memory m at line 9 again. If a fault occurs in m at line 5,dia will be first compromised and the data processed by other threads in the same122Figure 7.8: Code Structure Leading to Extensive Error Spread in ludblock may be affected after reading the corrupted data from dia. And then at theend of the kernel invocation, a different part of m may be corrupted by reading datafrom dia(line 9). In the next invocation of this kernel, faulty values in m may beused in the initialization of dia again at line 5. But this time, it may be initializedto a different portion of dia that are used by threads in a different block, becausethere are different parts of m that were corrupted in the previous kernel invocation.This leads to extensive error spread for this application.7.5.4 Fault MaskingIn the previous two sections, we examined how faults propagate across differentmemory locations and kernel executions. We now ask the complementary question:how many faults are masked within a single kernel invocation of their occurrence.RQ4: How many faults are masked within the GPU kernel and not allowedto propagate?Table 7.3 shows the percentage of benign faults measured at the first kernel in-vocation after the fault injection (BenignT M). We measure this value by comparingall memory locations in TM with the golden run right after the first kernel invoca-tion where the fault is injected. If there is no difference between the two TMs, wecount it as a benign fault. We consider TM here as it is a superset of both RM andOM. Therefore if a fault is masked in the TM, it is also masked in the other twotypes of memory (even in future kernel executions).123Table 7.3: Percentage of Benign Faults Measured at the First Kernel Invoca-tion after Fault Injectionbfs lud pathfinder stencil cutcp gaussian hotspot barneshut lulesh circuit fiber nmf average63.5% 28.5% 69.9% 25.0% 42.7% 81.1% 57.2% 76.4% 92.5% 25.6% 62.9% 20.0% 53.4%As we can see from the table, more than 50% of the faults injected are maskedwithin the same kernel they are injected in, and do not propagate. The maximummasking is achieved in the case of lulesh in which nearly 92.5% of the faults aremasked. The minimum masking is achieved for nmf in which only 20% of thefaults are masked. Such variations across applications are because programs con-tain different amount of code structures leading to masking effects.We found there are two prevalent patterns leading to error masking in ourbenchmark programs, namely (1) Comparison, and (2) Truncation. An exampleof Truncation is shown in Figure 7.9, on the left. R0 and R1 are initialized at lines2 and 3, and R2 holds the result of comparing R0 and R1 at line 3. Consider a faultthat flips the first bit of R1 - R1 erroneously becomes 1110 from 1111. However,the result of R2 will not be affected since R1 is still greater than R0. An exampleof Truncation is shown in the right part of Figure 7.9. At line 2, R0 is initialized.At line 3, value of R1 is truncated from 0001 to 01. Consider a fault that occurs atline 2 and flips either of the left-most 2 bits of R0 - it will not affect the value ofR1 at line 3 due to the truncation. Hence the fault will be masked.Figure 7.9: Examples of Fault Masking. (a) Comparison, (b) Truncation7.5.5 Crash-causing FaultsThe next two research questions have to do with crash-causing faults and theirpropagation to the CPU (host) or other GPU kernels. We focus on crash-causingfaults as prior work has found that long-latency crashes can lead to checkpoint124corruptions, and cause unrecoverable failures [89, 150].RQ5: Do faults propagate to the host CPU and cause crashes?From our experiments, we can observe that there is a very small chance (0.02%on average) for a fault that occurs in a kernel to contaminate the CPU states andlead to a crash eventually. This is because memory address spaces are separate inthe CPU and GPU, and hence faulty pointers produced by the GPU are unlikely tobe used in the CPU to access memory. Because faulty pointers are responsible forthe majority of crash causing errors on the CPU [89], these errors do not lead tocrashes.RQ6: Do crash-causing faults propagate across kernels before they causecrashes?In our experiments, we find that there is no fault that propagates across ker-nels and causes a crash. In other words, cash-causing faults typically cause crasheswithin the kernel in which they occur. This is because pointers or memory addressoffset variables are usually passed between kernels in the constant memory space,which is read-only. Note however that it is possible for faults to propagate acrosskernels if address offsets of pointers are passed through global variables (we haveempirically verified this observation through carefully constructed code samples -we do not present these due to space constraints). However, typical GPGPU pro-grams do not exhibit this behavior, as each thread is responsible for its own memorylocations, and hence multiple threads do not read the same offset to calculate thesame memory address.7.5.6 Platform DifferencesRQ7: Are resilience characteristics of applications different on different GPUplatforms?We use the two platforms, K20 and GTX960 for this experiment. To answerthis question, we performed 1,000 fault injections for each benchmark applicationon the two platforms. We focus on SDCs for this experiment as these are oftenthe most important concern in practice. Note that we have omitted two programs,fiber and nmf, as we encountered errors when compiling them on GTX960, proba-bly because they use features that are not supported on that platform. To compare125the distributions of the SDC values, we ran a t-test between the values obtained onthe two platforms. We found that the p-value was 0.991. Thus, our results showthat we fail to reject the null hypothesis, indicating that the values are statisticallyindistinguishable from each other. Therefore, we can conclude that the resiliencecharacteristics of the applications do not vary significantly between the two plat-forms.7.6 ImplicationsIn this section, we consider the implications of the results on error detection andrecovery techniques. These are organized by the RQs.Table 7.4: Size of OM, RM and TMbfs lud pathfinder stencil cutcp gaussian hotspot barneshut lulesh circuit fiber nmf geo meanOM 14.29% 7.50% 0.99% 7.50% 7.50% 1.67% 5.00% 18.75% 12.50% 15.00% 49.98% 0.03% 5.02%RM 14.29% 50.00% 1.98% 50.00% 50.00% 3.21% 66.67% 37.50% 18.75% 50.00% 49.98% 0.03% 13.56%TM 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%RQ1: Percentages of SDCs in different memory states In RQ1, we foundthat most SDCs in the TM propagate to the RM, and more than 76% of the SDCsin the RM propagate to the OM. Table 7.4 shows the size of data in OM, RM andTM in each application. As we can see, the RM is only a small fraction of the TM.This suggests that checking the RM for consistency (e.g., using detectors [108])may be much more efficient than checking the entire TM. Another possibility ischecking the OM which is even smaller than the RM. However, the OM may notbe updated until the end of the program and hence checking the OM may incurhigh detection latency.RQ2: Detection Latency of Errors in RM Error detection latency is criticalwhen designing checkpoint intervals. If the detection latency is too long, errorsmay propagate to checkpoints before they are detected, thereby corrupting check-points [89]. Our results show that errors propagate to the RM relatively soon aftertheir occurrence (i.e., within one kernel call in most cases). Therefore, placingdetectors on RM will ensure low-latency error detection.RQ3: Error spreading to different memory states Error spreading is highlyapplication specific in GPGPU applications. Further, only certain code structures inGPGPU programs may lead to extensive error spread. Therefore, one can statically126analyze the program to identify such structures to protect. Further, the code can berestructured to avoid error spreading in some circumstances. In some applications(e.g., Stencil), there may be natural mechanisms in the code to dilute the effect oferror spreading over time.RQ4: Effect of error masking We found that there is substantial error mask-ing within GPU kernels, and that many errors do not even affect the TM after theyoccur. This means that there may not be a need to deploy expensive error detec-tion mechanisms such as Dual Modular Redundancy (DMR) or Error Detectionby Duplicated Instructions (EDDI) [106] within the kernel, unless it is a safety-critical application. Instead one can check the results after the kernel’s invocation.For example, ABFT-based detection algorithms [45, 73] can be used at the kernelboundary to detect errors, to determine whether GPU kernel re-execution shouldbe initiated.RQ5 & RQ6: Crash-causing Faults and Checkpoint Scheme Studies on er-ror propagation on CPUs find that crash-causing faults can propagate for a longtime before they cause crashes. Hence, checkpoints may be corrupted by thesefaults if the crash-latency in the program is not bounded [88, 89]. However, onGPGPU programs, we find that crash-causing faults do not propagate outside thekernel where faults occur. In other words, crash-latency of GPGPU programs isnaturally bounded within one kernel invocation. Therefore, one can place check-points at kernel boundaries for crash recovery. As we find that many kernels do notpropagate errors to other kernels, individual kernels could also recover from fail-ures through re-execution at the kernel boundaries of copying. The application canbe restarted locally on the same GPU or on a spare GPU. Further, as most kernelshave short execution times in the range of milliseconds, the cost of re-executing akernel would be insignificant.RQ7: Differences across platforms From our findings, it appears that error re-silience of GPGPU applications does not depend on the specific hardware platform(we have only validated it on platforms from the same manufacturer, which wasNvidia in our case). This suggests that one can perform resilience characterizationon one platform and generalize the results to a different platform.1277.7 SummaryIn this chapter, we study error propagation in GPGPU application with the goal ofbuilding targeted error detection and recovery mechanisms for them. We built afault injection tool LLFI-GPU, and defined metrics for quantifying propagation inGPGPU applications. We empirically studied error propagation across ten GPGPUapplications using LLFI-GPU. The main findings are: (1) Crash-causing faults inGPGPU are naturally kernel-bounded, (2) Error spreading in memory is highlyapplication dependent, (3) Most memory data corruptions lead to output corrup-tion unlike what is observed in CPU programs, and (4) The majority of faults aremasked within a single kernel execution, and do not propagate across kernels.128Chapter 8Understanding ErrorPropagation in Deep LearningNeural Network (DNN)Accelerators and ApplicationsThis chapter investigates error propagation in DNN accelerators and applications,which are recently deployed in safety-critical environment such as self-drivingcars. Many studies have focused on the performance aspects of the accelerators, buttheir reliability is not well understood. In this chapter, we first build the fault injec-tion infrastructure for DNN accelerators and applications, then characterize errorpropagation through an empirical study. We find that DNN accelerators and appli-cations have very unique error propagation characteristics compared with generalpurpose applications. Based on our investigation, we propose cost-effective errormitigation techniques for DNN accelerators and applications.8.1 IntroductionDeep learning neural network (DNN) applications are widely used in high-performancecomputing systems and datacenters [19, 44, 118]. Researchers have proposed theuse of specialized hardware accelerators to accelerate the inferencing process of129DNNs, consisting of thousands of parallel processing engines [34, 36, 62]. Forexample, Google recently announced its DNN accelerator, the Tensor ProcessingUnit (TPU), which they deploy in their datacenters for DNN applications [145].While the performance of DNN accelerators and applications have been ex-tensively studied, the reliability implications of using them is not well understood.One of the major sources of unreliability in modern systems are soft errors, typi-cally caused by high-energy particles striking electronic devices and causing themto malfunction (e.g., flip a single bit) [20, 40]. Such soft errors can cause appli-cation failures, and they can result in violations of safety and reliability specifica-tions. For example, the IEC 61508 standard [74] provides reliability specificationsfor a wide range of industrial applications ranging from the oil and gas industry tonuclear power plants. Deep neural networks have promising uses for data analyt-ics in industrial applications [149], but they must respect the safety and reliabilitystandards of the industries where they are employed.A specific and emerging example of HPC DNN systems is for self-driving cars(i.e., autonomous vehicles), which deploy high performance DNNs for real-timeimage processing and object identification. The high performance, low power, highreliability, and real-time requirements of these applications require hardware thatrivals that of the fastest and most reliable supercomputer (albeit with lower preci-sion and memory requirements). For instance, NVIDIA Xavier—a next-generationSoC with a focus on self-driving applications—is expected to deliver 20 Tops/s at20W in 16nm technology [126]. We focus our investigation into DNN systems1in this emerging HPC market due to its importance, stringent reliability require-ments, and its heavy use of DNNs for image analysis. The ISO 26262 standardfor functional safety of road vehicles mandates the overall FIT rate2 of the Systemon Chip (SoC) carrying the DNN inferencing hardware under soft errors to be lessthan 10 FIT [119]. This requires us to measure and understand the error resiliencecharacteristics of these high-performance DNN systems.This chapter takes a first step towards this goal by (1) characterizing the prop-agation of soft errors from the hardware to the application software of DNN sys-1We use the term DNN systems to refer to both the software and the hardware accelerator thatimplements the DNN.2Failure-in-Time rate: 1 FIT = 1 failure per 1 billion hours130tems, and (2) devising cost-effective mitigation mechanisms in both software andhardware, based on the characterization results .Traditional methods to protect computer systems from soft errors typicallyreplicate the hardware components (e.g., Triple Modular Redundancy or TMR).While these methods are useful, they often incur large overheads in energy, per-formance and hardware cost. This makes them very challenging to deploy in self-driving cars, which need to detect objects such as pedestrians in real time [83] andhave strict cost constraints. To reduce overheads, researchers have investigatedsoftware techniques to protect programs from soft errors, e.g., identifying vulnera-ble static instructions and selectively duplicating the instructions [52, 64, 88]. Themain advantage of these techniques is that they can be tuned based on the applica-tion being protected. However, DNN software typically has a very different struc-ture compared to general-purpose software. For example, the total number of staticinstruction types running on the DNN hardware is usually very limited (less thanfive) as they are repeatedly executing the multiply-accumulate (MAC) operations.This makes these proposed techniques very difficult to deploy as duplicating evena single static instruction will result in huge overheads. Thus, current protectiontechniques are DNN-agnostic in that they consider neither the characteristics ofDNN algorithms, nor the architecture of hardware accelerators. To the best of ourknowledge, we are the first to study the propagation of soft errors in DNN systemsand devise cost-effective solutions to mitigate their impact.We make the following major contributions in this paper:• We modify a DNN simulator to inject faults in four widely used neural net-works (AlexNet, CaffeNet, NiN and ConvNet) for image recognition, usinga canonical model of the DNN accelerator hardware.• We perform a large-scale fault injection study using the simulator for faultsthat occur in the data-path of accelerators. We classify the error propagationbehaviors based on the structure of the neural networks, data types, positionsof layers, and the types of layers.• We use a recently proposed DNN accelerator, Eyeriss [35], to study the effectof soft errors in different buffers and calculate its projected FIT rates.131• Based on our observations, we discuss the reliability implications of design-ing DNN accelerators and applications and also propose two cost-effectiveerror protection techniques to mitigate Silent Data Corruptions (SDCs) i.e.,incorrect outcomes. The first technique, symptom-based detectors, is imple-mented in software and the second technique, selective latch hardening, inhardware. We evaluate these techniques with respect to their fault coverageand overhead.Our main results and implications are as follows:• We find that different DNNs have different sensitivities to SDCs dependingon the topology of the network, the data type used, and the bit position ofthe fault. In particular, we find that only high-order bits are vulnerable toSDCs, and the vulnerability is proportional to the dynamic value range thedata type can represent. The implications of this result are twofold: (1)When designing DNN systems, one should choose a data type providingjust-enough dynamic value range and precision. The overall FIT rate ofthe DNN system can be reduced by more than an order of magnitude if wedo so - this is in line with recent work that has proposed the use of suchdata types for energy efficiency. (2) Leveraging the asymmetry of the SDCsensitivity of bits, we can selectively protect the vulnerable bits using ourproposed selective latch hardening technique. Our evaluation shows that thecorresponding FIT rate of the datapath can be reduced by 100x with about20% area overhead.• We observe that faults causing a large deviation in magnitude of values likelylead to SDCs. Normalization layers can reduce the impact of such faults byaveraging the faulty values with adjacent correct values, thereby mitigatingSDCs. While the normalization layers are typically used to improve perfor-mance of a DNN, they also boost its resilience. Based on the characteristics,we propose a symptom-based detector that provides 97.84% precision and92.08% recall in error detection, for selected DNNs and data types.• In our case study of Eyeriss, the sensitivity study of each hardware com-ponent shows that some buffers implemented to leverage data locality for132performance may dramatically increase the overall FIT rate of a DNN ac-celerator by more than 100x. This indicates that novel dataflows proposedin various studies should also add protection to these buffers as they maysignificantly degrade the reliability otherwise.• Finally, we find that for the Eyeriss accelerator platform, the FIT rates canexceed the safety standards (i.e., ISO 26262) by orders of magnitude withoutany protection. However, applying the proposed protection techniques canreduce the FIT rate considerably and restore it within the safety standards.8.2 Exploration of Design SpaceWe seek to understand how soft errors that occur in DNN accelerators propagatein DNN applications and cause SDCs (we define SDCs later in Section 8.3.6). Wefocus on SDCs as these are the most insidious of failures and cannot be detectedeasily. There are four parameters that impact of soft errors on DNNs:(1) Topology and Data Type: Each DNN has its own distinct topology whichaffects error propagation. Further, DNNs can also use different data types in theirimplementation. We want to explore the effect of the topology and data type on theoverall SDC probability.(2) Bit Position and Value: To further investigate the impact of data type onerror propagation, we examine the sensitivity of each bit position in the networksusing different data types. This is because the values represented by a data typedepend on the bit positions affected, as different data types interpret each bit dif-ferently (explained in Section 8.3.5). Hence, we want to understand how SDCprobabilities vary based on the bit corrupted in each data type and how the errorsresult in SDCs affect program values.(3) Layers: Different DNNs have different layers - this includes the differ-ences in type, position, and the total number of layers. We investigate how errorspropagate in different layers and whether the propagation is influenced by the char-acteristics of each layer.(4) Data Reuse: We want to understand how different data reuses implementedin the dataflows of DNN accelerators affects the SDC probability. Note that unlike133other parameters, data reuse is not a property of the DNN itself but of its hardwareimplementation.8.3 Experimental Setup8.3.1 NetworksTable 8.1: Networks UsedNetwork Dataset No. of Output Candidates TopologyConvNet [41] CIFAR-10 10 3 CONV + 2 FCAlexNet [81] ImageNet 1,000 5 CONV(with LRN) + 3FCCaffeNet [24] ImageNet 1,000 5 CONV(with LRN) + 3FCNiN [94] ImageNet 1,000 12 CONVWe focus on convolutional neural networks in DNNs, as they have shown greatpotential in solving many complex and emerging problems and are often executedin self-driving cars. There are four neural networks that we consider in Table 8.1.They range from the relatively simple 5-layer ConvNet to the 12-layer NiN. Thereasons we chose these networks are: (1) They have different topologies and meth-ods implemented to cover a variety of common features used in today’s DNNs, andthe details are publicly accessible, (2) they are often used as benchmarks in devel-oping and testing DNN accelerators [35, 77, 128], and (3) they are well knownto solve challenging problems, (4) and the official pre-trained models are freelyavailable. This allows us to fully reproduce the networks for benchmarking pu-poses. All of the networks perform the same task, namely image classification. Weuse the ImageNet dataset [75] for AlexNet, CaffeNet and NiN, and the CIFAR-10dataset [39] for ConvNet, as they were trained and tested with these datasets. Weuse these reference datasets and the pre-trained weights together with the corre-sponding networks from the Berkeley Vision and Learning Center (BVLC) [22].We list the details of each network in Table 8.1. As shown, all networks exceptNiN have fully-connected layers behind the convolutional layers. All four networksimplement ReLU as the activation function and use the max-pooling method intheir sub-sampling layers. Both AlexNet and CaffeNet use a Local Response Nor-134malization (LRN) layer following each of the first two convolutional layers - theonly difference is the order of the ReLU and the sub-sampling layer in each con-volution layer. In AlexNet, CaffeNet and ConvNet, there is a soft-max layer at thevery end of each network to derive the confidence score of each ranking, whichis also part of the network’s output. However, in NiN, there is no such soft-maxlayer. Hence, the output of the NiN network has only the ranking of each candidatewithout their confidence scores.Table 8.2: Data Reuses in DNN AcceleratorsWeightReuseImageReuseOutputReuseZhang et al. [153], Diannao [34],Dadiannao [36]N N NChakradhar et al. [28], Sri-ram et al. [131], Sankaradas etal. [122], nn-X [57], K-Brain [107],Origami [27]Y N NGupta et al. [60], Shidiannao [49],Peemen et al. [109]N N YEyeriss [35] Y Y YFigure 8.1: Architecture of general DNN accelerator1358.3.2 DNN AcceleratorsWe consider nine of DNN accelerators mentioned in Table 8.2. We separate thefaults in the datapaths of the networks from those in the buffers. We study datapathfaults based on the common abstraction of their execution units in Figure 8.1B.Thus, the results for datapath faults apply to all nine accelerators.For buffer faults, since the dataflow (and buffer structure) is different in eachaccelerator, we have to choose a specific design. We chose the Eyeriss acceler-ator for studying buffer faults because: (1) The dataflow of Eyeriss includes allthree data localities in DNNs listed in Table 8.2, which allows us to study the datareuse seen in other DNN accelerators, and (2) the design parameters of Eyerissare publicly available, which allows us to conduct a comprehensive analysis on itsdataflow and overall resilience.8.3.3 Fault ModelWe consider transient, single-event upsets that occur in the data path and buffers,both inside and outside the processing engines of DNN accelerators. We do notconsider faults that occur in combinational logic elements as they are much lesssensitive to soft errors than storage elements shown in recent studies [56, 125]. Wealso do not consider errors in control logic units. This is due to the nature of DNNaccelerators which are designed for offloaded data acceleration - the scheduling ismainly done by the host (i.e., CPU). Finally, because our focus is on DNN accel-erators, we do not consider faults in the CPU, main memory, or the memory/databuses.8.3.4 Fault Injection SimulationSince we do not have access to the RTL implementations of the accelerators, we usea DNN simulator for fault injection. We modified an open-source DNN simulatorframework, Tiny-CNN [142], which accepts Caffe pre-trained weights [23] of anetwork for inferencing and is written in C++. We map each line of code in thesimulator to the corresponding hardware component, so that we can pinpoint theimpact of the fault injection location in terms of the underlying microarchitecturalcomponents. We randomly inject faults in the hardware components we consider136by corrupting the values in the corresponding executions in the simulator. Thisfault injection method is in line with other related work [64, 82, 88, 93, 147].8.3.5 Data TypesDifferent data types offer different tradeoffs between energy consumption and per-formance in DNNs. Our goal is to investigate the sensitivity of different design pa-rameters in data types to error propagation. Therefore, we selected a wide range ofdata types that have different design parameters as listed in Table 8.3. We classifythem into two types: floating-point data type (FP) and fixed-point data type (FxP).For FP, we choose 64-bit double, 32-bit float, and 16-bit half-float, all of whichfollow the IEEE 745 floating-point arithmetic standard. We use the terms DOU-BLE, FLOAT, and FLOAT16 respectively for these FP data types in this study. ForFxPs, unfortunately there is no public information about how binary points (radixpoints) are chosen for specific implementations. Therefore, we choose differentbinary points for each FxP. We use the following notations to represent FxPs inthis work: 16b rb10 means a 16-bit integer with 1 bit for the sign, 5 bits for theinteger part, and 10 bits for the mantissa, from the leftmost bit to the rightmost bit.We consider three FxP types, namely 16b rb10, 32b rb10 and 32b rb26. They allimplement 2’s complement for their negative arithmetic. Any value that exceedsthe maximum or minimum dynamic value range will be saturated to the maximumor minimum value respectively.Table 8.3: Data types usedData Type FP or FxP Data Width Bits (From left to right)DOUBLE FP 64-bit 1 sign bit, 11 bits for exponent, 52 bits formantissaFLOAT FP 32-bit 1 sign bit, 8 bits for exponent, 23 bits for man-tissaFLOAT16 FP 16-bit 1 sign bit, 5 bits for exponent, 10 bits for man-tissa32b rb26 FxP 32-bit 1 sign bit, 5 bits for integer, 26 bits for man-tissa32b rb10 FxP 32-bit 1 sign bit, 21 bits for integer, 10 bits for man-tissa16b rb10 FxP 16-bit 1 sign bit, 5 bits for integer, 10 bits for man-tissa1378.3.6 Silent Data Corruption (SDC)We define the SDC probability as the probability of an SDC given that the faultaffects an architecturally visible state of the program (i.e., the fault was activated).This is in line with the definition used in other work [52, 64, 93, 147].In a typical program, an SDC would be a failure outcome in which the ap-plication’s output deviates from the correct (golden) output. This comparison istypically made on a bit-by-bit basis. However, for DNNs, there is often not a sin-gle correct output, but a list of ranked outputs each with a confidence score asdescribed in Section 3.5.1, and hence a bit-by-bit comparison would be mislead-ing. Consequently, we need to define new criteria to determine what constitutes anSDC for a DNN application. We define four kinds of SDCs as follows:• SDC-1: The top ranked element predicted by the DNN is different from thatpredicted by its fault-free execution. This is the most critical SDC becausethe top-ranked element is what is typically used for downstream processing.• SDC-5: The top ranked element is not one of the top five predicted elementsof the fault-free execution of the DNN.• SDC-10%: The confidence score of the top ranked element varies by morethan +/-10% of its fault-free execution.• SDC-20%: The confidence score of the top ranked element varies by morethan +/-20% of its fault-free execution.8.3.7 FIT Rate CalculationThe formula of calculating the FIT rate of a hardware structure is shown in Equa-tion 8.1, where Rraw is the raw FIT rate (estimated as 20.49 FIT/Mb by extrapolat-ing the results of Neale et al. [103]. The original measurement for a 28nm processis 157.62 FIT/MB in the paper. We project this for a 16nm process by applyingthe trend shown in Figure 1 of the Neale paper3). Scomponent is the size of the com-ponent, and SDCcomponent is the SDC probability of each component. We use this3We also adjusted the original measurement by a factor of 0.65 as there is a mistake we found inthe paper. The authors of the paper have acknowledged the mistake in private email communicationswith us.138formula to calculate the FIT rate of datapath components and buffer structures ofDNN accelerators, as well as the overall FIT rate of Eyeriss in Section 8.4.1 andSection 8.4.2.FIT = ∑componentRraw ∗Scomponent ∗SDCcomponent (8.1)Figure 8.2: SDC probability for different data types in different networks (forfaults in PE latches).8.4 Characterization ResultsWe organize the results based on the origins of faults (i.e., datapath faults and bufferfaults) for each parameter. We randomly injected 3,000 faults per latch, one faultfor each execution of the DNN application. The error bars for all the experimentalresults are calculated based on 95% confidence intervals.8.4.1 Datapath FaultsData Types and NetworksFigure 8.2 shows the results of the fault injection experiments on different networksand different data types. We make three observations based on the results.First, SDC probabilities vary across the networks for the same data type. For139example, using the FLOAT data type, the SDC probabilities for NiN are higher thanfor other networks using the same data type (except ConvNet - see reason below).This is because of the different structures of networks in terms of sub-sampling andnormalization layers which provide different levels of error masking - we furtherinvestigate this in Section 8.4.1. Further, ConvNet has the highest SDC propagationprobabilities among all the networks considered (we show the graph for ConvNetseparately as its SDC probabilities are significantly higher than the other networks).This is because the structure of ConvNet is much less deep than for other networks,and consequently there is higher error propagation in ConvNet. For example, thereare only 3 convolutional layers in ConvNet, whereas there are 12 in NiN. Further,ConvNet does not have normalization layers to provide additional error masking,unlike AlexNet and CaffeNet.Second, SDC probabilities vary considerably across data types. For instance,in AlexNet, SDC-1 can be as high as 7.19% using 32b rb10, and as low as 0.38%using FLOAT - the maximum difference is 240x (between SDC-1 in 32b rb10 and32b rb26). The reasons are further explored in Section 8.4.1.Finally, for networks using the ImageNet dataset (all except ConvNet), thereis little difference in the SDC probability for the four different kinds of SDCs fora particular network and data type. Recall that there are 1,000 output dimensionsin the ImageNet DNNs. If the top ranked output is changed by the error, the newranking is likely to be outside of the top five elements, and its confidence score willlikely change by more than 20%. However, in ConvNet, which uses the CIFAR-10dataset, the four SDC probabilities are quite different. This is because there areonly 10 output dimensions in ConvNet, and if the top ranked output is affected byerrors, it is still highly likely to stay within the top five elements. As a result, theSDC-5 probability is quite low for this network. On the other hand, the SDC-10% and SDC-20% probabilities are quite high in ConvNet compared to the othernetworks, as the confidence scores are more sensitive due to the small output di-mensions compared to the other networks. Note that since NiN does not provideconfidence scores in its output, we do not show SDC-10% and SDC-20% forNiN.Because there is little difference between the SDC types for three of the fournetworks, we focus on SDC-1 in the rest of the chapter and refer to them as SDCs140unless otherwise specified.Bit PositionWe show the results by plotting the SDC probabilities for each bit in each datatype. Due to space constraints, we only show the results for NiN using FLOAT andFLOAT16 data types for FP, and for CaffeNet using 32b rb26 and 32b rb10 forFxP. We however confirmed that similar observations apply to the rest of networksand data types.The results of NiN using FLOAT and FLOAT16 are shown in Figure 8.3A andFigure 8.3B respectively. For the FP data types, only the high-order exponent bitsare likely to cause SDCs (if corrupted), and not the mantissa and sign bits. Wealso observed that bit-flips that go from 0 to 1 in the high-order exponent bits aremore likely to cause SDCs than those that go from 1 to 0. This is because thecorrect values in each network are typically clustered around 0 (see Section 8.4.1)and hence, small deviations in the magnitude or sign bits do not matter as much.This is also why the per-bit SDC probability for FLOAT16 is lower than that forFLOAT. A corrupted bit in the exponent of the latter is likely to cause a largerdeviation from 0, which in turn is likely to result in an SDC.For FxP data types, we plot the results for CaffeNet using 32b rb26 and 32b rb10in Figure 8.3C and Figure 8.3D respectively. As can be seen, only bits in the in-teger parts of the fixed point data types are vulnerable. Both FxP data types have32-bit data widths but different binary point positions. We observed that the per-bitSDC probability for the data type 32b rb10 is much higher than that of 32b rb26.For example, the 30th bit in 32b rb10 has an SDC probability of 26.65% whereasthe same bit position only has an SDC probability of 0.22% in 32b rb26. Thisis because 32b rb26 has a smaller dynamic range of values compared to 32b rb10,and hence the corrupted value in the former is likely to be closer to 0 than the latter.This is similar to the FP case above.ValueWe chose AlexNet using FLOAT16 to explain how errors that result in SDCs affectprogram values. We randomly sampled the set of ACTS in the network that were141Figure 8.3: SDC probability variation based on bit position corrupted, bit po-sitions not shown have zero SDC probability (Y-axis is SDC probabilityand X-axis is bit position)affected by errors, and compared their values before (in green) and after (in red)error occurrence. We classified the results based on whether the errors led to SDCs(Figure 8.4A) or were benign (Figure 8.4B). There are two key observations: (1)If an error causes a large deviation in numeric values, it likely causes an SDC. Forexample, in Figure 8.4A, more than 80% of errors that lead to a large deviation leadto an SDC. (2) In Figure 8.4B, on the other hand, only 2% of errors that cause largedeviations result in benign faults. This is likely because large deviations make itharder for values in the network to converge back to their correct values which aretypically clustered around 0.We now ask the following questions. How close together are the correct (error-free) values in each network, and how much do the erroneous values deviate fromthe correct ones? Answering these questions will enable us to formulate efficienterror detectors. We list boundary values of ACTS profiled in each layer and net-work in Table 8.4. As can be seen, in each network and layer, the values arebounded within a relatively small range in each layer. Further, in the example of142Figure 8.4: Values before and after error occurrence in AlexNet usingFLOAT16AlexNet using FLOAT16, 80% of the erroneous values that lead to SDCs lie outsidethis range, while only 9.67% of the erroneous values that lead to benign outcomesdo so. Similar trends are found in AlexNet, CaffeNet, and NiN using DOUBLE,FLOAT, FLOAT16, and 32b rb10. This is because the data types provide moredynamic value range than the networks need. The redundant value ranges lead tolarger value deviation under faults and are more vulnerable to SDCs. This indi-cates that we can leverage symptom-based detectors to detect SDCs when thesedata types are used (Section 8.5.2). On the other hand, 16b rb10 and 32b rb26suppress the maximum dynamic value ranges. ConvNet is an exception: ConvNethas a limited number of outputs and a small stack of layers, as even a small pertur-bation in values may significantly affect the output rankings.143Table 8.4: Value range for each layer in different networks in the error-freeexecutionNetwork Layer1Layer2Layer3Layer4Layer5Layer6Layer7Layer8Layer9Layer10Layer11Layer12AlexNet -691.8136˜62.505-228.2962˜24.248-89.0519˜8.62-69.2451˜45.674-36.47471˜33.413-78.9784˜3.471-15.0431˜1.881-5.5421˜5.775N/A N/A N/A N/ACaffeNet -869.3496˜08.659-406.8591˜56.569-73.46528˜8.5085-46.32158˜5.3181-43.98781˜55.383-81.11673˜8.9238-14.65361˜0.4386-5.811581˜5.0622N/A N/A N/A N/ANiN -738.1997˜14.962-401.861˜267.8-397.6511˜388.88-1041.768˜75.372-684.9571˜082.81-249.481˜244.37-737.8459˜40.277-459.2925˜84.412-162.3144˜37.883-258.2732˜83.789-124.0011˜40.006-26.48358˜8.1108ConvNet -1.452161˜.38183-2.160611˜.71745-1.618431˜.37389-3.089034˜.94451-9.247911˜1.8078N/A N/A N/A N/A N/A N/A N/ALayer Position and TypeTo investigate how errors in different layers propagate through the network, westudy the error sensitivity of both the positions and types of layers. We show theSDC probability per layer, ordered by the position for AlexNet, CaffeNet, and NiNin Figure 8.5A and for ConvNet in Figure 8.5B.Figure 8.5: SDC probability per layer using FLOAT16, Y-axis is SDC prob-ability and X-axis is layer positionIn Figure 8.5A, in AlexNet and CaffeNet, we observed very low SDC probabil-ities in the first and second layers, compared to the other layers. The reason is theLocal Response Normalization (LRN) layers implemented at the end of the first144and second layers in these networks normalize the faulty values, thus mitigatingthe effect of large deviations in the values. However, there is no LRN or similarnormalization layer in the other layers. NiN and ConvNet do not have such a nor-malization layer, and hence, NiN (in Figure 8.5A) and ConvNet (in Figure 8.5B)have a relatively flat SDC probability across all convolutional layers (layers 1 to 12in NiN and layer 1 to layer 3 in ConvNet). We also observe there is an increase inthe SDC probabilities in layers in AlexNet and CaffeNet after the LRNs. This isbecause the later layers require narrower value ranges (Table 8.4), and hence widervalue ranges and bits are likely to result in SDCs. Note that the fully-connectedlayers in AlexNet (layer 6 to layer 8), CaffeNet (layer 6 to layer 8), and ConvNet(layers 4 and 5) have higher SDC probabilities. This is because (1) they are able todirectly manipulate the ranking of the output candidates, and (2) ACTS are fully-connected, and hence faults spread to all ACTS right away and have much higherimpact on the final outputs. Recall that there is no fully-connected layer in NiN,however, and hence this effect does not occur.To further illustrate the effect of LRN on mitigating error propagation, we mea-sured the average Euclidean distance between the ACT values in the fault injectionruns and the golden runs at the end of each layer after faults are injected at the firstlayer in different networks using the DOUBLE data type. The results are shown inFigure 8.6. We choose the DOUBLE data type for this experiment as it accentuatesany differences due to its wide value range. As we can see, the Euclidean distancedecreases sharply from the first layer to the second layer after LRN in AlexNet andCaffeNet. However, neither NiN nor ConvNet implement LRN or similar layers,and hence the Euclidean distances at each layer are relatively flat.Recall that the POOL and ReLU layers are implemented in all four networkswe consider, and are placed at the end of each convolutional and fully-connectedlayer after MACs. Recall also that POOL picks the local maximum ACT valueand discards the rest of the ACTS before forwarding them to the next layer, whileReLU resets values to zero if the value is negative. Since POOL and ReLU caneither discard or completely overwrite the values, they can mask errors in differentbits. Therefore, we study the bit-wise SDC probability (or error propagation rate)per layer after each POOL or ReLU structure in convolutional layers in Table 8.5.We measured this rate by comparing the ACT values bit by bit at the end of the145Figure 8.6: Euclidean distance between the erroneous values and correct val-ues of all ACTS at each layer of networks using DOUBLE, Y-axis isEuclidean distance and X-axis is layer position (Faults are injected atlayer 1)last layer. Due to space constraints, we only show the result of AlexNet usingFLOAT16, though similar results were observed in the other cases.Table 8.5: Percentage of bit-wise SDC across layers in AlexNet usingFLOAT16 (Error bar is from 0.2% to 0.63%)Layer 1 Layer 2 Layer 3 Layer 4 Layer 519.38% 6.20% 8.28% 6.08% 1.63%There are three main observations in Table 8.5: (1) In general, there is a de-creasing propagation probability across layers, as faults that occur in earlier layershave a higher probability of propagating to other layers and spreading. (2) Eventhrough many faults spread into multiple locations and reach the last layer, only asmall fraction of them (5.5% on average, compared with AlexNet in Figure 8.5A)will affect the final ranking of the output candidates. The reason, as explained inSection 8.4.1, is that the numerical value of ACTS affected by faults is a more in-fluential factor affecting the SDC probability than the number of erroneous ACTS.(3) A majority of the faults (84.36% on average) are masked by either POOL orReLU during the propagation and cannot even reach the last layer. Therefore, errordetection techniques that are designed to detect bit-wise mismatches (i.e., DMR)may detect many errors that ultimately get masked.Datapath FIT rateWe calculate the datapath FIT rates for different networks using each datatypebased on the canonical model of datapath in Figure 3.1B, and the formula in146Eq. 8.1. Note that the latches assumed in between execution units are the mini-mum sets of latches to implement the units, so our calculations of datapath FITrate are conservative. The results are listed in Table 8.6. As seen, the FIT ratevaries a lot depending on the network and data type used. For example, it rangesfrom 0.004 to 0.84 in NiN and ConvNet using 16b rb10, and from 0.002 to 0.42in AlexNet using 16b rb10 and 32b rb10. Depending on the design of the DNNaccelerator and the application, the datapath’s FIT rate may exceed the FIT budgetallowed for the DNN accelerator and will hence need to be protected. We willfurther discuss this in Section 8.5.1.Table 8.6: Datapath FIT rate in each data type and networkConvNet AlexNet CaffeNet NiNFLOAT 1.76 0.02 0.03 0.10FLOAT16 0.91 0.009 0.009 0.00832b rb26 1.73 0.002 0.005 0.00232b rb10 2.45 0.42 0.41 0.5416b rb10 0.84 0.002 0.007 0.0048.4.2 Buffer Faults: A Case Study on EyerissEyeriss [35] is a recently proposed DNN accelerator whose component parametersare publicly available. We use Eyeriss in this case study - the reason is articulated inSection 8.3.2. Other than the components shown in Figure 3.1, Eyeriss implementsa Filter SRAM, Img REG and PSum REG on each PE for data reuses listed inTable 3.1. The details of the dataflow are described in Chen et al. [35]. We adoptedthe original parameters of the Eyeriss microarchitecture at 65 nm and projected itto 16nm technology. For the purpose of this study, we simply scale the componentsof Eyeriss in proportion to process technology generation improvements, ignoringother architectural issues. In Table 8.7, we list the microarchitectural parametersof Eyeriss at the original 65nm and the corresponding projections at 16nm. Weassume a scaling factor of 2 for each technology generation, and as there are 4generations between 65nm and 16nm technology (based on published values bythe TSMC foundry), we scaled up the number of PEs and the sizes of buffers by afactor of 8. In the rest of this work, we use the parameters of Eyeriss at 16nm.147Table 8.7: Parameters of microarchitectures in Eyeriss (Assuming 16-bit datawidth, and a scaling factor of 2 for each technology generation)FeatureSizeNo. of PE Size ofGlobalBufferSize ofOne FilterSRAMSize ofOne ImgREGSize ofOne PSumREG65nm 168 98KB 0.344KB 0.02KB 0.05KB16nm 1,344 784KB 3.52KB 0.19KB 0.38KBData Reuse and BufferHere we measure and compare SDC probabilities of different buffers in Eyerissby randomly injecting 3,000 faults in each buffer component for different networkparameters. By analyzing the results, we found that faults in buffers exhibit simi-lar trends as datapath faults for each parameter of data type, network, value, layer,though the absolute SDC probabilities and FIT rates are different (usually higherthan for datapath faults due to the amount of reuse and size of the components).Hence, we do not repeat these sensitivity results in this section. Rather, we presentSDC probabilities and FIT rates for each buffer of the different networks in Ta-ble 8.8. The calculation of the FIT rate is based on Eq. 8.1. We show the resultsusing the 16b rb10 data type as a 16-bit FxP data type is implemented in Eyeriss.Table 8.8: SDC probability and FIT rate for each buffer component in Eyeriss(SDC probability / FIT Rate)Network GlobalBufferFilterSRAMImg REG PSum REGConvNet 69.70%/87.47 66.37%/62.74 70.90%/3.57 27.98%/2.82AlexNet 0.16%/0.20 3.17%/3.00 0.00%/0.00 0.06%/0.006CaffeNet 0.07%/0.09 2.87%/2.71 0.00%/0.00 0.17%/0.02NiN 0.03%/0.04 4.13%/3.90 0.00%/0.00 0.00%/0.00As seen, as the network becomes deeper, its buffers are much more immuneto faults. For example, ConvNet is less deep than the other three networks, andthe FIT rate of Global Buffer and Filter SRAM are respectively 87.47 and 62.74,compared to 0.2 and 3.0 for AlexNet. This sensitivity is consistent with datapathfaults. Another observation is that Img REG and PSum REG have relatively low148FIT rates as they both have smaller component sizes and a short time window forreuse: a faulty value in Img REG will only affect a single row of fmap and onlythe next accumulation operation if in PSum REG. Finally, we find that buffer FITrates are usually a few orders of magnitude higher than datapath FIT rates, andadding these buffers for reuse dramatically increases the overall FIT rate of DNNaccelerators. The reasons are twofold: (1) Buffers by nature have larger sizes thanthe total number of latches in the datapath, and (2) due to reuse, the same fault canbe read multiple times and lead to the spreading of errors to multiple locations in ashort time, resulting in more SDCs. Both lead to a higher FIT rate (See in Eq. 8.1).We will further discuss its implication in Section Mitigation of Error PropagationWe explore three directions to mitigate error propagation in DNN systems basedon the results in the previous section. First, we discuss the reliability implica-tions in designing DNN systems. Second, we adapt a previously proposed soft-ware based technique, Symptom-based Error Detectors (SED), for detecting errorsin DNN-based systems. Finally, we use a recently proposed hardware technique,Selective Latch Hardening (SLH) to detect and correct datapath faults. Both tech-niques leverage the observations made in the previous section and are optimizedfor DNNs.8.5.1 Implications to Resilient DNN Systems(1) Data Type: Based on our analysis in Section 8.4.1, we can conclude that DNNsshould use data types that provide just-enough numeric value range and precisionrequired to operate the target DNNs. For example, in Table 8.6, we found that theFIT rate of datapath can be reduced by more than two orders of magnitude if wereplace type 32b rb10 with type 32b rb26 (from 0.42 to 0.002).However, existing DNN accelerators tend to follow a one-size-fits-all approachby deploying a data type representation which is long enough to work for all com-putations in different layers and DNNs [34, 35, 49]. Therefore, it is not alwayspossible to eliminate redundant value ranges that the data type provides across lay-ers and DNNs. The redundant value ranges are particularly vulnerable to SDCs149as they tend to cause much larger value deviations (Section 8.4.1). We proposea low-cost symptom-based error detector that detects errors caused by the redun-dant value ranges regardless of whether conservative data types are used. A recentstudy has proposed a reduced precision protocol which stores data in shorter repre-sentations in memory and unfolds them when in the datapath to save energy [77].The approach requires hardware modifications and may not be always supportedin accelerators. We defer the reliability evaluation of the proposed protocol to ourfuture work.(2) Sensitive Bits: Once a restricted data type is used for a network, the dy-namic value range is suppressed, mitigating SDCs caused by out-of-range values.However, the remaining SDCs can be harder to detect as erroneous values hide innormal value ranges of the network. Fortunately, we observe that the remainingSDCs are also caused by bit-flips at certain bit positions (i.e., high bit positions inFxP) (Section 8.4.1). Hence, we can selectively harden these bits to mitigate theSDCs (Section 8.5.3).(3) Normalization Layers: The purpose of the normalization layers such asthe LRN layer is to increase the accuracy of the network [81]. In Section 8.4.1, wefound that LRN also increases the resilience of the network as it normalizes a faultyvalue with its adjacent fault-free values across different fmaps to mitigate SDCs.Therefore, one should use such layers if possible. Further, one should place errordetectors after such layers to leverage the masking opportunities, thus avoidingdetecting benign faults.(4) Data Reuse: Recently, multiple dataflows and architectures have beenproposed and demonstrated to provide both energy and performance improve-ments [34, 35]. However, adding these local buffers implementing more sophis-ticated dataflow dramatically increases the FIT rate of a DNN accelerator. Forexample, the FIT rate of Filter SRAMs (3.9 in NiN, Table 8.8) can be nearly 1000xhigher than the FIT rate of the entire datapath (0.004 in NiN, Table 8.8). Un-fortunately, however, protecting small buffers through ECC may incur very highoverheads due to smaller read granularities. We propose an alternative approach,SED, to protect these buffers at low cost (Section 8.5.2).(5) Datapath Protection: From Table 8.6, we found the datapath FIT ratealone can go up to 2.45 without careful design, or 0.84 even with a resilient data150type (16b rb10, in ConvNet). The safety standard, for example in ISO 26262,mandates the overall FIT rate of the SoC carrying DNN accelerator to be less than10 FIT. Since a DNN accelerator is often a very small area fraction of the total onthe SoC. The FIT budget allocated to the DNN accelerator should be only a tinyfraction of 10. Hence, datapath faults cannot be ignored as they stretch the FITbudget allocated to the DNN accelerator.8.5.2 Symptom-based Error Detectors (SED)A symptom-based detector is an error detector that leverages application-specificsymptoms under faults to detect anomalies. Examples of symptoms are unusualvalues of variables [64, 108], numbers of loop iterations [64, 88], or address spacesaccessed by the program [88, 108]. For the proposed detector, we use the valueranges of ACTS as the symptom to detect SDC-causing faults. This is based onthe observation from Section 8.4.1: If an error makes the magnitude of ACTS verylarge, it likely leads to an SDC, and if it does not, it is likely to be benign. In the de-sign of the error detector, there are two questions that need to be answered: Where(which program locations) and What (which reference value ranges) to check? Theproposed detector consists of two phases - we describe each phase along with theanswers to the two questions below:Learning: Before deploying the detector, a DNN application needs to beinstrumented and executed with its representative test inputs to derive the valueranges in each layer during the fault-free execution. We can use these value ranges,say -X to Y, as the bounds for the detector. However, to be safe, we apply anadditional 10% cushion on top of the value ranges of each layer, that is (-1.1*X) to(1.1*Y) as the reference values for each detector to reduce false alarms. Note thatthe learning phase is only required to be performed once before the deployment.Deployment: Once the detectors are derived, they are checked by the hostwhich off-loads tasks to the DNN accelerator. At the end of each layer, the dataof fmaps of the current layer are calculated and transferred to the global bufferfrom the PE array as the input data of ifmaps for the next layer. These data willstay in the global buffer during the entire execution of the next layer for reuse.This gives us an opportunity to execute the detector asynchronously from the host,and check the values in global buffer to detect errors. We perform the detection151asynchronously to keep the runtime overheads as low as possible.Figure 8.7: Precision and recall of the symptom-based detectors across net-works (Error bar is from 0.03% to 0.04%)Evaluation: After deploying the detector, we measure its coverage by ran-domly injecting 3,000 single bit-flip faults in each hardware component, usingall 3 FP data types and 32b rb10 in AlexNet, CaffeNet and NiN, one fault perfault injection run. As we explained earlier, we do not include the 16b rb10 and32b rb26 data types or the ConvNet network as they do not exhibit strong symp-toms, and hence symptom-based detectors are unlikely to provide high coverage inthese cases. So there are a total of 3,000*5*4*3=180,000 faults injected for thisexperiment.We define two metrics in our evaluation: (1) Precision: 1 - (The number ofbenign faults that are detected by the detector as SDC) / (The number of faultsinjected), and (2) Recall: (The number of SDC-causing faults that are detected bythe detector) / (The number of total SDC-causing faults). The results are shown inFigure 8.7A and Figure 8.7B for the precision and the recall respectively averagedacross the data types and components (due to space constraints, we only show theaverage values). As can be seen, the average precision is 90.21% and the averagerecall is 92.5%. Thus, we can reduce the FIT rates of Eyeriss using FLOAT andFLOAT16 by 96% (from 8.55 to 0.35) and 70% (from 2.63 to 0.79) respectivelyusing the symptom-based detector technique (based on Equation 8.1).1528.5.3 Selective Latch Hardening (SLH)Latch hardening is a hardware error mitigation technique that adds redundant cir-cuitry to sequential storage elements (i.e., latches) to make them less sensitive toerrors. Protecting the latches in the datapath can be vital for highly dependable sys-tems as they become the reliability bottleneck once all buffers are protected (e.g.,by ECCs). Given that the technology keeps scaling to smaller feature sizes [20, 40],it will be more important to mitigate datapath faults in the future. There have been anumber of different hardened latch designs that differ in their overheads and levelsof protection, and latch hardening need not be applied in an all-or-nothing manner.For example, Sullivan et al. [135] developed an analytical model for hardened latchdesign space exploration and demonstrated cost-effective protection by hardeningonly the most sensitive latches and by combining hardening techniques offeringdiffering strengths in an error-sensitivity proportional manner. Since we observedand characterized asymmetric SDC sensitivity in different bits in Section 8.4.1, wecan leverage this model to selectively harden each latch using the most efficienthardening technique to achieve sufficient error coverage at a low cost.Design Space Exploration: There are a wide variety of latch hardening tech-niques that vary in their level of protection and overheads. Table 8.9 shows thethree hardening techniques used by [135]; these same techniques are consideredin this chapter, though the methodology should apply with any available hardenedlatches. The baseline design in the table refers to an unprotected latch.Table 8.9: Hardened latches used in design space explorationLatch Type Area Overhead FIT Rate Reduc-tionBaseline 1x 1xStrike Suppression (RCC) 1.15x 6.3xRedundant Node (SEUT) 2x 37xTriplicated (TMR) 3.5x 1,000,000xEvaluation: Figure 8.8A shows the AlexNet FIT rate reduction versus thefraction of protected latches assuming a perfect hardening technique that com-pletely eliminates errors. We plot this curve to illustrate the maximum benefit onecan achieve by protecting bits that are more sensitive to SDC with priority. β153characterizes the asymmetry of SDC FIT rate in different bits (latches)—a high βindicates that a small number of latches dictate the overall SDC probability. Ascan be seen, there are negative exponential curves in the figure, such that we canselectively protect only the most sensitive latches for area-efficient SDC reduction.Figure 8.8: Selective Latch Hardening for the Eyeriss Accelerator runningAlexNetFigure 8.8B and Figure 8.8C show the AlexNet FIT rate reduction versus thelatch area overhead when using each hardened latch, and the optimal combinationof the hardened designs (Multi) generated by the model. Due to space constraints,we only show the AlexNet result for the FLOAT16 and 16b rb10 data types (theother networks and data types exhibit similar trends). By exploiting the asymme-try of error sensitivity in data type bits and combining complementary hardeningtechniques, one can achieve significant protection while paying only modest areacosts. For example, applying the three hardening techniques together can reducethe latch FIT rate by 100x, while incurring about 20% and 25% latch area over-heads in FLOAT16 and 16b rb10, respectively. This translates to a chip-level areaoverhead roughly akin to that required for ECC protection of the larger (and moreeasily protected) SRAM structures.8.6 SummaryDNNs (Deep Neural Networks) have gained prominence in recent years and theyare being deployed on hardware accelerators in self-driving cars for real-time im-154age classification. In this chapter, we characterize the impact of soft errors on DNNsystems through a large-scale fault injection experiment with 4 popular DNNs run-ning on a recently proposed DNN hardware accelerator. We find that the resilienceof a DNN system depends on the data types, values, data reuse, and the types oflayers in the design. Based on these insights, we formulate guidelines for design-ing resilient DNN systems and propose two efficient DNN protection techniquesto mitigate soft errors. We find that the techniques significantly reduce the rate ofSilent Data Corruption (SDC) in DNN systems with acceptable performance andarea overheads.155Chapter 9ConclusionIn this chapter, we first summarize the dissertation and describe its expected im-pact. We then briefly delineate possible directions for future work.9.1 SummaryWe had two main goals in this dissertation. First, we wanted to understand howerrors propagate in programs and lead to different types of failures. The under-standing helped us come up with techniques that can evaluate programs’ resilienceand guide the protection of programs, which was our second goal. We consideredboth general programs (i.e, CPU programs) and the ones executing on hardwareaccelerators (i.e., GPUs and DNN accelerators). We applied both empirical andanalytical approaches to achieve our goals. The dissertation had three parts as fol-low.We first targeted an important but often neglected type of failure in CPU pro-grams — LLCs. We conducted an empirical study in Chapter 4 to investigate errorpropagation that lead to LLCs, and then characterized the code patterns propagat-ing the faults. We found that it was possible to identify these code patterns throughprogram analyses, and to protect the code and eliminate LLCs in programs. Basedon the above observation, we proposed a heuristic-based technique that was ableto identify program locations that were responsible for more than 90% of LLCs inprograms. The proposed technique pruned the fault injection space by more than 9156orders of magnitude compared with an exhaustive fault injection approach.Secondly, we targeted SDCs, which are the most insidious type of failure, andare challenging to identify. We started our investigation in CPU programs as theyare the most common applications. In Chapter 5, we explored an analytical ap-proach to model error propagation that lead to SDCs in programs. Our proposedmodel is able to accurately estimate the SDC probabilities of both programs andindividual instructions without any fault injections. The model can be also usedto guide selective protection in a given program in a fast and accurate manner. InChapter 6, we discussed how the error propagation can be affected by multipleprogram inputs, and extended the analytical model to support multiple inputs. Weshowed that the extended analytical model can be used to bound the SDC prob-ability of a give program with multiple inputs without performing extensive faultinjections.Finally, we investigated error propagation in the applications that run on hard-ware accelerators such as GPUs and DNN accelerators (Chapter 7 and 8). Becausethese accelerators and applications have different architectures and programmingmodels, we observed different error propagation patterns compared with CPU pro-grams. We first built the tools that can be used to inject faults on the applications,then characterized their unique error propagation patterns. Based on the observa-tions, we proposed error mitigation techniques that were specifically targeted to theaccelerators and applications running on them, and hence more cost-effective thangeneric techniques.9.2 Expected ImpactThe first impact of this dissertation is to provide insights regarding identifyingvulnerable program locations that lead to different types of failures, for applica-tion developers to evaluate programs’ resilience and mitigate errors. Traditionally,vulnerable program locations were identified through extensive fault injection sim-ulations which are extremely time-consuming. Because of this, developers wereloathe to integrate resilience techniques into the software development process.We demonstrated (Chapter 4) that the code that lead to certain type of faults suchas LLCs mostly fall into certain code patterns, which can be identified by program157analysis techniques. This insight became the driving force for the rest of the thesis.It inspired us to investigate program-level characterization of error propagation thatlead to different types of faults on different platforms, which in turn allowed us tobuild automated techniques to identify the vulnerable parts of the program for theprotection based on different reliability targets.Furthermore, our research in Chapter 5, and 6 demonstrated that a systematiccharacterization of error propagation also enables us to build an analytical modelto track error propagation in programs. The analytical model not only identifies thevulnerable parts that propagate errors, but also quantitively analyzes their propa-gation probabilities. Using TRIDENT, we showed that it is even possible to com-pletely get rid of fault injection and hence significantly shorten the time taken in thewhole evaluation process from a few days to a few minutes. The high-speed per-formance of the model and its capability of quantification imply that the techniquemay be integrated into compiler toolchains for developers to fine tune programs’resilience online.Other than the practicability, the analytical model, TRIDENT and VTRI-DENT, also revealed the detailed steps of error propagation. Traditionally, re-searchers reply on FI approaches to study error propagation. In FI, faults are re-peatedly injected during the executions of programs, and the users wait util the endof the executions in order to observe failures, if any. This is a black-box technique,because it does not provide much insight into what happens during the propagationof the faults that are injected. Hence, for decades, researchers do not really havea solid understanding of error propagation, not to mention how to design highlycost-effective error detectors. In contrast, the analytical aspects of the models al-low researchers to understand the details of error propagation and obtain detailedinsights. Therefore, we believe that in the future, more efficient and cost-effectiveerror detection and mitigation techniques can be developed based on the knowledgeof error propagation characteristics.Last but not least, this dissertation places the reliability problem of acceleratorapplications in the limelight of system research. In the past, researchers focusedon the performance aspects of the accelerators because they were initially designedfor performance. However, with the rising deployment of accelerators in safety-critical applications, the reliability of accelerators has started playing a much more158important role. As discussed in Chapter 7 and Chapter 8, hardware acceleratorsand applications can be vulnerable to hardware errors without protections. Ourwork has focused on building handy tools for studying error propagation in theaccelerators and applications, and demonstrated that their error propagation hasvery different propagation patterns. We show it is possible to mitigate the errorpropagation in accelerators based on their unique characteristics in a cost-effectivemanner. We expect other researchers will leverage our tools to conduct reliabilitystudies, and design more reliable accelerators and applications in the future.9.3 Future WorkWe propose four potential future work directions as follows.Direction 1: Modeling Error Propagation in GPU ProgramsIn this thesis, CRASHFINDER, TRIDENT and VTRIDENT in Chapter 4, 5and Chapter 6 focused on error propagation that lead to LLCs and SDCs in CPUprograms. As mentioned in Chapter 7, the increasing error rate and rising popu-larity of GPUs accelerate the demand of developing fault-tolerant GPU programs.Since GPU programs typically contain hundreds of thousands of threads, they havea much larger space of fault injection sites compared with the CPU programs withsimilar lines of code. Hence, the resilience evaluation of GPU programs can bemuch more time-consuming in the development of fault-tolerant GPU applica-tions [50, 104]. One way to speed up the evaluation process is to extend our modelsin this thesis to support GPU programs. LLFI-GPU introduced in Chapter 7 canbe used to obtain further program-level insights of error propagation in GPU pro-grams in order to achieve the goal. Those future techniques can be used to designcost-effective error detectors for GPU programs and integrated into GPU compilertoolchains to enable online tuning of GPU programs’ resilience in the softwaredevelopment process.Direction 2: Pruning Profiling SpaceAnother direction to extend our analytical models, TRIDENT and VTRI-DENT, are through pruning profiling space in programs. As discussed in Chap-ter 5, the performance bottleneck in TRIDENT is the profiling phase which recordsa large amount of information of different program states etc. Studies [65, 104,159124] have shown that it is possible to select only a small subset of representativestates to project the overall error resilience of a program. Hence, one direction topursue is to find the subset of the representative states to profile when constructingthe model, in order to speed up the model.Direction 3: Other Fault ModelsThis dissertation mainly focused on the faults that occur in the data path of pro-cessors, which is one of the most challenging types of faults to detect and mitigate.Faults originating in memory are another major source. Current protection tech-niques leverage Error Correction Code (ECC) to mitigate memory faults. SinceECC memory ncurs non-negligible overheads in area, performance and energyconsumption, they are mainly deployed in high-end systems such as supercomput-ers and aerospace applications. However, with increasing error rates, it becomesnecessart to protect systems from memory faults in a tuneable and cost-effectiveway in future commodity systems. Therefore, selective protections at the levels ofinstruction, process, and applications need to be developed, which require one tounderstand error propagation caused by memory faults - this is an interesting futureresearch direction.Direction 4: Secure-Enough Software SystemsOne of the main advantages of software error mitigation techniques is to pro-vide flexible and selective protection in programs. Through selective protection,one can provide ”reliable-enough” computation for commodity systems with higherror coverage and low overheads, just like what demonstrated in this thesis. Incontrast, today’s security techniques are mostly designed to be either on and off,which may incur huge runtime overhead. We believe the idea of the reliable-enough computation can also be expanded to the security domain. For example, inone of the prevalent security attacks, row-hammer attack, malicious users can lever-age hardware deficiencies of memory modules to flip bits in the logic values thatare stored in memory, and manipulate the normal computation of programs [101].These bit-flip errors, while similar to the soft error that we discussed in this disser-tation, however, may have different impact on program executions and their conse-quences. Characterizing and understanding error propagation of the bit-flip errorsintroduced by row-hammer attacks in programs can be an interesting direction.Long-Term Direction: Formal Methods160In this thesis, our proposed techniques guiding selective protection are basedon empirical observations. As a result, they do not provide any guarantees on theerror coverage and performance overhead of the protections. One way to providesuch guarantees is to build a mathematically rigorous model of fault-tolerant ap-plications, so that developers can not only verify the property of error resilience ina more formal way, but also use mathematical proof as a complement to test theefficiency of the protections and ensure their correct behaviors.161Bibliography[1] Parallel circuit solver. [Online; accessed Apr. 2016]. → page 115[2] NVIDIA CUDA-GDB Documentation. [Online; accessed Apr. 2016]. →pages 18, 108[3] NVIDIA Multi-GPU Programming. [Online; accessed Apr. 2016]. → page115[4] NVIDIA SDK Samples. [Online; accessed Apr. 2016]. → page 108[5] Enabling on-the-fly manipulations with LLVM IR code of CUDA sources. [Online; accessed Apr. 2016]. →page 107[6] IEEE standard for floating-point arithmetic., 2008. IEEEStd 754-2008. → page 63[7] J. Aidemark, J. Vinter, P. Folkesson, and J. Karlsson. Goofi: Genericobject-oriented fault injection tool. In Dependable Systems and Networks,2001. DSN 2001. International Conference on, pages 83–88. IEEE, 2001.→ page 17[8] H. M. Aktulga, J. C. Fogarty, S. A. Pandit, and A. Y. Grama. Parallelreactive molecular dynamics: Numerical methods and algorithmictechniques. Parallel Computing, 38(4):245–259, 2012. → page 64[9] Alippi, Cesare, Vincenzo Piuri, and Mariagiovanna Sami. Sensitivity toerrors in artificial neural networks: A behavioral approach. IEEETransactions on Circuits and Systems I: Fundamental Theory andApplications, 1995. → page 13[10] Amazon. Amazon s3 availability event: July 20, 2008. 2008. URL → page 1162[11] R. Ashraf, R. Gioiosa, G. Kestor, R. DeMara, C.-Y. Cher, and P. Bose.Understanding the propagation of transient errors in HPC applications. InInternational Conference for High Performance Computing, Networking,Storage and Analysis(SC). IEEE, 2015. → pages104, 105, 107, 115, 119, 120[12] R. A. Ashraf, R. Gioiosa, G. Kestor, R. F. DeMara, C.-Y. Cher, and P. Bose.Understanding the propagation of transient errors in hpc applications. InProceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis, page 72. ACM, 2015. →pages 17, 65[13] Autonomous car facts. Keynote: Autonomous Car A New Driver forResilient Computing and Design-for-Test, 2016. URL → page 22[14] F. Ayatolahi, B. Sangchoolie, R. Johansson, and J. Karlsson. A study of theimpact of single bit-flip and double bit-flip errors on program execution. InComputer Safety, Reliability, and Security, pages 265–276. Springer, 2013.→ page 117[15] C. Basile, L. Wang, Z. Kalbarczyk, and R. Iyer. Group communicationprotocols under errors. In 22nd International Symposium on ReliableDistributed Systems., pages 35–44, Oct 2003. → page 24[16] E. Battenberg and D. Wessel. Accelerating nonnegative matrixfactorization for audio source separation on multi-core and many-corearchitectures. In 10th International Society for Music Information RetrievalConference (ISMIR 2009), 2009. → page 115[17] Bettola, Simone, and Vincenzo Piuri. High performance fault-tolerantdigital neural networks. IEEE transactions on computers, 1998. → page 13[18] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite:Characterization and architectural implications. In Proceedings of the 17thinternational conference on Parallel architectures and compilationtechniques, pages 72–81. ACM, 2008. → pages 25, 27, 38, 64, 83[19] Bojarski, Mariusz, Davide Del Testa, Daniel Dworakowski, BernhardFirner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel and others. End toend learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.→ page 129163[20] S. Borkar. Designing reliable systems from unreliable components: thechallenges of transistor variability and degradation. Micro, IEEE, 25(6):10–16, 2005. → pages 1, 104, 130, 153[21] S. Borkar. Electronics beyond nano-scale cmos. In Proceedings of the 43rdannual Design Automation Conference, pages 807–808. ACM, 2006. →page 1[22] BVCL. BERKELEY VISION AND LEARNING CENTER, 2014. URL → page 134[23] Caffe Model. Caffe Model Zoo, 2014. URL zoo.html. → page 136[24] CaffeNet. CaffeNet, 2014. URL zoo.html. → page 134[25] J. Calhoun, L. Olson, and M. Snir. Flipit: An LLVM based fault injectorfor HPC. In Euro-Par: Parallel Processing Workshops, pages 547–558.Springer, 2014. → page 105[26] J. Carreira, H. Madeira, and J. G. Silva. Xception: A technique for theexperimental evaluation of dependability in modern computers. IEEETransactions on Software Engineering, 24(2):125–136, 1998. → page 17[27] Cavigelli, Lukas, David Gschwend, Christoph Mayer, Samuel Willi, BeatMuheim, and Luca Benini. Origami: A convolutional network accelerator.In In Proceedings of the 25th edition on Great Lakes Symposium on VLSI,2015. → pages 22, 135[28] Chakradhar, Srimat, Murugan Sankaradas, Venkata Jakkula, and SrihariCadambi. A dynamically configurable coprocessor for convolutional neuralnetworks. In ACM SIGARCH Computer Architecture News, 2010. → pages22, 135[29] S. Chandra and P. M. Chen. How fail-stop are faulty programs? InFault-Tolerant Computing. Digest of Papers. Twenty-Eighth AnnualInternational Symposium on, pages 240–249. IEEE, 1998. → page 9[30] S. Chandra and P. M. Chen. The impact of recovery mechanisms on thelikelihood of saving corrupted state. In Proceedings of the 13thInternational Symposium on Software Reliability Engineering, ISSRE.,pages 91–101. IEEE, 2002. → page 46164[31] C.-K. Chang, S. Lym, N. Kelly, M. B. Sullivan, and M. Erez. Evaluatingand accelerating high-fidelity error injection for hpc. In Evaluating andAccelerating High-Fidelity Error Injection for HPC, page 0. IEEE, 2018.→ page 27[32] Chatterjee, Avhishek, and Lav R. Varshney. Energy-reliability limits innanoscale neural networks. In The 51st Annual Conference on InformationSciences and Systems (CISS), pages 1–6, 2017. → page 14[33] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, andK. Skadron. Rodinia: A benchmark suite for heterogeneous computing. InInternational Symposium on Workload Characterization (IISWC 2009),pages 44–54. IEEE, 2009. → pages 52, 57, 64, 83, 114[34] Chen, Tianshi, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, YunjiChen, and Olivier Temam. Diannao: A small-footprint high-throughputaccelerator for ubiquitous machine-learning. In ACM Sigplan Notices,2014. → pages 14, 20, 22, 130, 135, 149, 150[35] Chen, Yu-Hsin, Joel Emer, and Vivienne Sze. Eyeriss: A spatialarchitecture for energy-efficient dataflow for convolutional neuralnetworks. In Proceedings of the International Symposium on ComputerArchitecture (ISCA), pages 367–379, 2016. → pages14, 20, 21, 22, 131, 134, 135, 147, 149, 150[36] Chen, Yunji, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang,Ling Li and others. Dadiannao: A machine-learning supercomputer. InProceedings of the International Symposium on Microarchitecture(MICRO), 2014. → pages 22, 130, 135[37] C.-Y. Cher, M. S. Gupta, P. Bose, and K. P. Muller. Understanding softerror resiliency of Blue Gene/Q compute chip through hardware protonirradiation and software fault injection. In Proceedings of the 2014International Conference for High Performance Computing, Networking,Storage and Analysis (SC). IEEE, November 2014.doi:10.1109/SC.2014.53. → page 117[38] H. Cho, S. Mirkhani, C.-Y. Cher, J. A. Abraham, and S. Mitra. Quantitativeevaluation of soft error injection techniques for robust system design. InACM/EDAC/IEEE Design Automation Conference (DAC), pages 1–10.IEEE, 2013. → pages 107, 117165[39] CIFAR dataset. CIFAR-10, 2014. URL∼kriz/cifar.html. → page 134[40] C. Constantinescu. Intermittent faults and effects on reliability ofintegrated circuits. In Reliability and Maintainability Symposium, pages370–374. IEEE, 2008. → pages 1, 104, 130, 153[41] ConvNet. High-performance C++/CUDA implementation of convolutionalneural networks, 2014. URL →page 134[42] J. J. Cook and C. Zilles. A characterization of instruction-level errorderating and its implications for error detection. In InternationalConference on Dependable Systems and Networks(DSN), pages 482–491.IEEE, 2008. → pages 9, 15, 75, 101[43] E. W. Czeck and D. P. Siewiorek. Observations on the effects of faultmanifestation as a function of workload. IEEE Transactions on Computers,41(5):559–566, 1992. → pages 10, 79, 81, 82[44] Dahl, George E., Dong Yu, Li Deng, and Alex Acero. Context-dependentpre-trained deep neural networks for large-vocabulary speech recognition.IEEE Transactions on Audio, Speech, and Language Processing, 20(1):30–42, 2012. → page 129[45] S. Di and F. Cappello. Adaptive impact-driven detection of silent datacorruption for hpc applications. IEEE Transactions on Parallel andDistributed Systems, 27(10):2809–2823, 2016. → page 127[46] D. Di Leo, F. Ayatolahi, B. Sangchoolie, J. Karlsson, and R. Johansson. Onthe impact of hardware faults–an investigation of the relationship betweenworkload inputs and failure mode distributions. Computer Safety,Reliability, and Security, pages 198–209, 2012. → pages 11, 76, 79[47] C. Di Martino, W. Kramer, Z. Kalbarczyk, and R. Iyer. Measuring andunderstanding extreme-scale application resilience: A field study of5,000,000 HPC application runs. In IEEE/IFIP International Conferenceon Dependable Systems and Networks (DSN), pages 25–36. IEEE, 2015.→ page 104[48] M. Dimitrov, M. Mantor, and H. Zhou. Understanding software approachesfor GPGPU reliability. In Workshop on General Purpose Processing onGraphics Processing Units, pages 94–104. ACM, 2009. → page 13166[49] Du, Zidong, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, TaoLuo, Xiaobing Feng, Yunji Chen, and Olivier Temam. Shidiannao: shiftingvision processing closer to the sensor. In ACM SIGARCH ComputerArchitecture News, 2015. → pages 22, 135, 149[50] B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi. Gpu-qin: Amethodology for evaluating the error resilience of GPGPU applications. InInternational Symposium on Performance Analysis of Systems andSoftware (ISPASS), pages 221–230. IEEE, 2014. → pages12, 104, 105, 108, 113, 115, 159[51] B. Fang, Q. Lu, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi. ePVF:An enhanced program vulnerability factor methodology for cross-layerresilience analysis. In Proceedings of the International Conference onDependable Systems and Networks (DSN), pages 168–179. IEEE, 2016. →pages 4, 9, 10, 17, 55, 58, 65, 69, 72, 73, 74, 75, 76, 77[52] S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: probabilistic softerror reliability on the cheap. In ACM SIGARCH Computer ArchitectureNews, volume 38, pages 385–396. ACM, 2010. → pages2, 9, 10, 12, 14, 15, 17, 24, 49, 50, 69, 72, 74, 76, 84, 101, 131, 138[53] Fernandes, Fernando and Weigel, Lucas and Jung, Claudio and Navaux,Philippe and Carro, Luigi and Rech, Paolo. Evaluation of histogram oforiented gradients soft errors criticality for automotive applications. ACMTransactions on Architecture and Code Optimization (TACO), 13(4):38,2016. → page 14[54] P. Folkesson and J. Karlsson. The effects of workload input domain onfault injection results. In European Dependable Computing Conference,pages 171–190, 1999. → pages 11, 79, 81, 82[55] A. Geist. How to kill a supercomputer: Dirty power, cosmic rays, and badsolder. ACM, 2016. → page 104[56] Gill, B., N. Seifert, and V. Zia. Comparison of alpha-particle andneutron-induced combinational and sequential logic error rates at the 32nmtechnology node. In Proceedings of the International Reliability PhysicsSymposium (IRPS), 2009. → page 136[57] Gokhale, Vinayak, Jonghoon Jin, Aysegul Dundar, Berin Martini, andEugenio Culurciello. A 240 g-ops/s mobile coprocessor for deep neural167networks. In In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition Workshops, 2014. → pages 22, 135[58] W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z. Yang. Characterization of linuxkernel behavior under errors. In International Conference on DependableSystems and Networks. IEEE Computer Society, 2003. → pages9, 24, 26, 104[59] Guanpeng Li, Karthik Pattabiraman, Siva Kumar Sastry Hari, MichaelSullivan and Timothy Tsai. Modeling soft-error propagation in programs.In IEEE/IFIP International Conference on Dependable Systems andNetworks (DSN). IEEE, 2018. → pages vi, 12, 81, 85, 91, 96, 98[60] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deeplearning with limited numerical precision. In International Conference onMachine Learning, pages 1737–1746, 2015. → pages 22, 135[61] S. Gupta, T. Patel, C. Engelmann, and D. Tiwari. Failures in large scalesystems: long-term measurement, analysis, and implications. InProceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis, page 44. ACM, 2017. →pages 10, 11[62] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally.Eie: efficient inference engine on compressed deep neural network. InComputer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual InternationalSymposium on, pages 243–254. IEEE, 2016. → pages 14, 20, 130[63] S. Hari, T. Tsai, M. Stephenson, S. Keckler, and J. Emer. Sassifi:Evaluating resilience of GPU applications. In SELSE: IEEE Workshop ofSilicon Errors in Logic. IEEE, 2015. → pages 12, 104, 105, 108, 113, 115[64] S. K. S. Hari, S. V. Adve, and H. Naeimi. Low-cost program-leveldetectors for reducing silent data corruptions. In Dependable Systems andNetworks (DSN), 42nd Annual IEEE/IFIP International Conference on,pages 1–12. IEEE, 2012. → pages 2, 24, 46, 65, 131, 137, 138, 151[65] S. K. S. Hari, S. V. Adve, H. Naeimi, and P. Ramachandran. Relyzer:exploiting application-level fault equivalence to analyze applicationresiliency to transient faults. In ACM SIGARCH Computer ArchitectureNews, volume 40, pages 123–134. ACM, 2012. → pages2, 9, 12, 15, 50, 81, 115, 159168[66] S. K. S. Hari, R. Venkatagiri, S. V. Adve, and H. Naeimi. Ganges: Gangerror simulation for hardware resiliency evaluation. In ACM/IEEE 41stInternational Symposium on Computer Architecture (ISCA), pages 61–72.IEEE, 2014. → pages 9, 12, 17, 50, 81[67] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residuallearning for image recognition. Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2015. → page 19[68] J. L. Henning. Spec cpu2000: Measuring cpu performance in the newmillennium. Computer, 33(7):28–35, 2000. → pages 25, 27, 38, 64, 83[69] M. Hiller, A. Jhumka, and N. Suri. Propane: an environment for examiningthe propagation of errors in software. In ACM SIGSOFT SoftwareEngineering Notes, volume 27, pages 81–85. ACM, 2002. → page 105[70] Coral benchmarks. → page 84[71] Github. → page 84[72] Github. → page 84[73] K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance formatrix operations. volume 100, pages 518–528. IEEE, 1984. → page 127[74] IEC 61508. Functional Safety and IEC 61508, 2016. URL → page 130[75] ImageNet. ImageNet, 2014. URL → page 134[76] H. Jeon and M. Annavaram. Warped-DMR: Light-weight error detectionfor GPUGPU. In IEEE/ACM International Symposium onMicroarchitecture, pages 37–47. IEEE Computer Society, 2012. → page 13[77] Judd, Patrick, Jorge Albericio, Tayler Hetherington, Tor M. Aamodt,Natalie Enright Jerger, and Andreas Moshovos. Proteus: Exploitingnumerical precision variability in deep neural networks. 2016. → pages134, 150[78] G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. Ferrari: A flexiblesoftware-based fault and error injection system. IEEE Transactions oncomputers, (2):248–260, 1995. → page 17169[79] I. Karlin. Lulesh programming model and performance ports overview. Ports.pdf. [Accessed Apr. 2016]. →pages 64, 120[80] I. Karlin, A. Bhatele, B. L. Chamberlain, J. Cohen, Z. Devito, M. Gokhale,R. Haque, R. Hornung, J. Keasler, D. Laney, et al. Lulesh programmingmodel and performance ports overview. 2012. → page 115[81] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. Imagenetclassification with deep convolutional neural networks. In Advances inneural information processing systems, 2012. → pages 19, 134, 150[82] I. Laguna, M. Schulz, D. F. Richards, J. Calhoun, and L. Olson. Ipas:Intelligent protection against silent output corruption in scientificapplications. In Proceedings of the 2016 International Symposium on CodeGeneration and Optimization, pages 227–238. ACM, 2016. → pages65, 137[83] Lane, Nicholas D., and Petko Georgiev. Can deep learning revolutionizemobile sensing? In In Proceedings of the 16th International Workshop onMobile Computing Systems and Applications, 2015. → page 131[84] A. Lanzaro, R. Natella, S. Winter, D. Cotroneo, and N. Suri. An empiricalstudy of injected versus actual interface errors. In Proceedings of theInternational Symposium on Software Testing and Analysis, pages397–408. ACM, 2014. → page 8[85] C. Lattner and V. Adve. Llvm: A compilation framework for lifelongprogram analysis & transformation. In Code Generation and Optimization.International Symposium on, pages 75–86. IEEE, 2004. → pages17, 25, 37, 50, 105[86] LeCun, Yann, Koray Kavukcuoglu, and Clment Farabet. Convolutionalnetworks and applications in vision. In Proceedings of IEEE InternationalSymposium on Circuits and Systems, 2010. → page 18[87] G. Li and K. Pattabiraman. Modeling input-dependent error propagation inprograms. In Proceedings of the International Conference on DependableSystems and Networks (DSN), 2018. → page vi[88] G. Li, Q. Lu, and K. Pattabiraman. Fine-grained characterization of faultscausing long latency crashes in programs. In IEEE/IFIP InternationalConference on Dependable Systems and Networks (DSN), pages 450–461.170IEEE, 2015. → pagesv, 10, 12, 17, 81, 84, 101, 104, 107, 127, 131, 137, 151[89] G. Li, K. Pattabiraman, C.-Y. Cher, and P. Bose. Experience report: Anapplication-specific checkpointing technique for minimizing checkpointcorruption. In 26th International Symposium on Software ReliabilityEngineering (ISSRE), pages 141–152. IEEE, 2015. → pages101, 104, 105, 125, 126, 127[90] G. Li, K. Pattabiraman, C.-Y. Cher, and P. Bose. Understanding errorpropagation in GPGPU applications. In International Conference for HighPerformance Computing, Networking, Storage and Analysis, pages240–251. IEEE, 2016. → pages v, 9, 17, 69[91] G. Li, S. K. S. Hari, M. Sullivan, T. Tsai, K. Pattabiraman, J. Emer, andS. W. Keckler. Understanding error propagation in deep learning neuralnetwork (dnn) accelerators and applications. In Proceedings of theInternational Conference for High Performance Computing, Networking,Storage and Analysis, page 8. ACM, 2017. → page vi[92] G. Li, K. Pattabiraman, and N. DeBardeleben. Tensorfi: A configurablefault injector for tensorflow applications. In IEEE International Workshopon Software Certification (WoSoCer), 2018.[93] Li, Guanpeng and Pattabiraman, Karthik and Cher, Chen-Yong and Bose,Pradip. Understanding error propagation in gpgpu applications. InProceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis (SC), 2016. → pages137, 138[94] Lin, Min and Chen, Qiang and Yan, Shuicheng. Network in network. arXivpreprint arXiv:1312.4400, 2013. → page 134[95] S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn. Flikker: savingdram refresh-power through critical data partitioning. In Proceedings of thesixteenth international conference on Architectural support forprogramming languages and operating systems, pages 213–224. ACM,2011. → pages 1, 2[96] Q. Lu, M. Farahani, J. Wei, A. Thomas, and K. Pattabiraman. LLFI: Anintermediate code level fault injector for hardware faults. In InternationalConference on Quality, Reliability and Security (QRS). IEEE, 2015. →page 117171[97] Q. Lu, G. Li, K. Pattabiraman, M. S. Gupta, and J. A. Rivers. Configurabledetection of sdc-causing errors in programs. ACM Transactions onEmbedded Computing Systems (TECS), 16(3):88, 2017. → pages2, 4, 10, 12, 14, 15, 24, 50, 72, 73, 75, 84, 101[98] A. Mahmoud, R. Venkatagiri, K. Ahmed, S. V. Adve, D. Marinov, andS. Misailovic. Leveraging software testing to explore input dependence forapproximate computing. Workshop on Approximate Computing Across theStack (WAX), 2017. → page 11[99] G. B. Mathews. On the partition of numbers. Proceedings of the LondonMathematical Society, 1(1):486–490, 1896. → pages 72, 101[100] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. Asystematic methodology to compute the architectural vulnerability factorsfor a high-performance microprocessor. In Proceedings of theInternational Symposium on Microarchitecture (MICRO), pages 29–40.IEEE, 2003. → pages 4, 10[101] O. Mutlu. The rowhammer problem and other issues we may face asmemory becomes denser. In Proceedings of the Conference on Design,Automation & Test in Europe, pages 1116–1121. European Design andAutomation Association, 2017. → page 160[102] S. Narayanan, J. Sartori, R. Kumar, and D. L. Jones. Scalable stochasticprocessors. In Proceedings of the Conference on Design, Automation andTest in Europe, pages 335–338. European Design and AutomationAssociation, 2010. → page 2[103] Neale, Adam, and Manoj Sachdev. Neutron radiation induced soft errorrates for an adjacent-ECC protected SRAM in 28 nm CMOS. 2016. →page 138[104] B. Nie, L. Yang, A. Jog, and E. Smirni. Fault site pruning for practicalreliability analysis of gpgpu applications. 2018. → page 159[105] N. Oh, P. P. Shirvani, and E. J. McCluskey. Control-flow checking bysoftware signatures. Transactions on Reliability, 51(1):111–122, 2002. →page 15[106] N. Oh, P. P. Shirvani, and E. J. McCluskey. Error detection by duplicatedinstructions in super-scalar processors. Reliability, IEEE Transactions on,51(1):63–75, 2002. → pages 13, 106, 127172[107] Park, Seongwook, Kyeongryeol Bong, Dongjoo Shin, Jinmook Lee,Sungpill Choi, and Hoi-Jun Yoo. 4.6 a1. 93tops/w scalable deeplearning/inference processor with tetra-parallel mimd architecture forbig-data applications. In International Solid-State Circuits Conference. →pages 22, 135[108] K. Pattabiraman, G. P. Saggese, D. Chen, Z. Kalbarczyk, and R. K. Iyer.Automated derivation of application-specific error detectors using dynamicanalysis. volume 8, pages 640–655. IEEE, 2011. → pages 126, 151[109] Peemen, Maurice, Arnaud AA Setio, Bart Mesman, and Henk Corporaal.Memory-centric accelerator design for convolutional neural networks. InIEEE 31st International Conference on Computer Design (ICCD), 2013.→ pages 22, 135[110] A. J. Pen˜a, W. Bland, and P. Balaji. Vocl-ft introducing techniques forefficient soft error coprocessor recovery. In International Conference forHigh Performance Computing, Networking, Storage and Analysis(SC),page 71. ACM, 2015. → page 13[111] Piuri, Vincenzo. Analysis of fault tolerance in artificial neural networks.Journal of Parallel and Distributed Computing, 2001. → page 13[112] L. Rashid, K. Pattabiraman, and S. Gopalakrishnan. Modeling thepropagation of intermittent hardware faults in programs. In DependableComputing (PRDC), IEEE 16th Pacific Rim International Symposium on,pages 19–26. IEEE, 2010. → page 16[113] Reagen, Brandon and Whatmough, Paul and Adolf, Robert and Rama,Saketh and Lee, Hyunkwang and Lee, Sae Kyu and Herna´ndez-Lobato,Jose´ Miguel and Wei, Gu-Yeon and Brooks, David. Minerva: Enablinglow-power, highly-accurate deep neural network accelerators. InProceedings of the International Symposium on Computer Architecture(ISCA), pages 267–278, 2016. → page 14[114] D. A. Reed and J. Dongarra. Exascale computing and big data. volume 58,pages 56–68. ACM, 2015. → page 104[115] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. Swift:Software implemented fault tolerance. In Proceedings of the internationalsymposium on Code generation and optimization, pages 243–254. IEEEComputer Society, 2005. → page 13173[116] V. Reva. An efficient cuda implementation of the tree-based barnes hutn-body algorithm. Elsevier, 2014. → page 115[117] V. Robert and X. Leroy. A formally-verified alias analysis. In CertifiedPrograms and Proofs, pages 11–26. Springer, 2012. → page 47[118] Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,Sean Ma, Zhiheng Huang and others. Imagenet large scale visualrecognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015. → page 129[119] Safety Standard. ISO-26262: Road Vehicles Functional safety, 2016. URL 26262. → pages 22, 130[120] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, andD. Grossman. Enerj: Approximate data types for safe and generallow-power computation. In ACM SIGPLAN Notices, volume 46, pages164–174. ACM, 2011. → pages 1, 2[121] B. Sangchoolie, K. Pattabiraman, and J. Karlsson. One bit is (not) enough:An empirical study of the impact of single and multiple bit-flip errors. InInternational Conference on Dependable Systems and Networks (DSN),pages 97–108. IEEE, 2017. → pages 17, 64, 76[122] Sankaradas, Murugan, Venkata Jakkula, Srihari Cadambi, SrimatChakradhar, Igor Durdanovic, Eric Cosatto, and Hans Peter Graf. Amassively parallel coprocessor for convolutional neural networks. In IEEEInternational Conference on Application-specific Systems, Architecturesand Processors, 2009. → pages 22, 135[123] S. K. Sastry Hari, M.-L. Li, P. Ramachandran, B. Choi, and S. V. Adve.mswat: low-cost hardware fault detection and diagnosis for multicoresystems. In Proceedings of the 42nd Annual IEEE/ACM InternationalSymposium on Microarchitecture, pages 122–132. ACM, 2009. → page 27[124] S. K. Sastry Hari, R. Venkatagiri, S. V. Adve, and H. Naeimi. Ganges:Gang error simulation for hardware resiliency evaluation. ACM SIGARCHComputer Architecture News, 42(3):61–72, 2014. → pages 2, 160[125] Seifert, Norbert, Balkaran Gill, Shah Jahinuzzaman, Joseph Basile, VinodAmbrose, Quan Shi, Randy Allmon, and Arkady Bramnik. Soft errorsusceptibilities of 22 nm tri-gate devices. 2012. → page 136174[126] D. Shapiro. Introducing Xavier, the NVIDIA AI supercomputer for thefuture of autonomous transportation, 2016. URL → page 130[127] V. C. Sharma, A. Haran, Z. Rakamaric, and G. Gopalakrishnan. Towardsformal approaches to system resilience. In Pacific Rim InternationalSymposium on Dependable Computing (PRDC), pages 41–50. IEEE, 2013.→ page 105[128] Simonyan, Karen, and Andrew Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014. → page 134[129] M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji,J. Belak, P. Bose, F. Cappello, B. Carlson, et al. Addressing failures inexascale computing. Institute for Computing in Science (ICiS). More infor,4:11, 2012. → page 49[130] V. Sridharan and D. R. Kaeli. Eliminating microarchitectural dependencyfrom architectural vulnerability. In 15th International Symposium on HighPerformance Computer Architecture. → pages 9, 50, 55, 74, 76[131] Sriram, Vinay, David Cox, Kuen Hung Tsoi, and Wayne Luk. Towards anembedded biologically-inspired machine vision processor. In InField-Programmable Technology, 2010. → pages 22, 135[132] D. T. Stott, B. Floering, D. Burke, Z. Kalbarczpk, and R. K. Iyer. Nftape: aframework for assessing dependability in distributed systems withlightweight fault injectors. In Computer Performance and DependabilitySymposium, 2000. IPDS 2000. Proceedings. IEEE International, pages91–100. IEEE, 2000. → page 17[133] J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari,G. D. Liu, and W.-m. W. Hwu. Parboil: A revised benchmark suite forscientific and commercial throughput computing. Center for Reliable andHigh-Performance Computing, 2012. → pages 25, 27, 38, 114[134] Student. The probable error of a mean. Biometrika, pages 1–25, 1908. →pages 67, 68[135] Sullivan, Michael and Zimmer, Brian and Hari, Siva and Tsai, Timothy andKeckler, Stephen W. An analytical model for hardened latch selection andexploration. 2016. → page 153175[136] R. Taborda and J. Bielak. Large-scale earthquake simulation:computational seismology and complex engineering systems. Computingin Science & Engineering, 13(4):14–27, 2011. → page 64[137] J. Tan, N. Goswami, T. Li, and X. Fu. Analyzing soft-error vulnerability onGPGPU microarchitecture. In IEEE International Symposium on WorkloadCharacterization (IISWC), pages 226–235. IEEE, 2011. → page 13[138] I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero.Enabling preemptive multiprogramming on GPUs. In ACM SIGARCHComputer Architecture News, volume 42, pages 193–204. IEEE Press,2014. → page 111[139] D. Tao, S. L. Song, S. Krishnamoorthy, P. Wu, X. Liang, E. Z. Zhang,D. Kerbyson, and Z. Chen. New-sum: A novel online abft scheme forgeneral iterative methods. In Proceedings of the 25th ACM InternationalSymposium on High-Performance Parallel and Distributed Computing,pages 43–55. ACM, 2016. → page 11[140] O. Temam. A defect-tolerant accelerator for emerging high-performanceapplications. In Proceedings of the International Symposium on ComputerArchitecture (ISCA), pages 356–367, 2012. → page 14[141] A. Thomas and K. Pattabiraman. Error detector placement for softcomputation. In Dependable Systems and Networks (DSN), 43rd AnnualIEEE/IFIP International Conference on, pages 1–12. IEEE, 2013. → page24[142] Tiny-CNN. Tiny-CNN Framework. URL → page 136[143] Tithi, Jesmin Jahan, Neal C. Crago, and Joel S. Emer. Exploiting spatialarchitectures for edit distance algorithms. In Proceedings of theInternational Symposium on Performance Analysis of Systems andSoftware (ISPASS), 2014. → page 20[144] D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai,D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux, et al. Understandinggpu errors on large-scale hpc systems and the implications for systemdesign and operation. In High Performance Computer Architecture(HPCA), 2015 IEEE 21st International Symposium on, pages 331–342.IEEE, 2015. → page 104176[145] TPU. Google supercharges machine learning tasks with TPU custom chip.URL →page 130[146] N. J. Wang and S. J. Patel. Restore: Symptom-based soft error detection inmicroprocessors. Dependable and Secure Computing, IEEE Transactionson, 3(3):188–201, 2006. → page 24[147] J. Wei, A. Thomas, G. Li, and K. Pattabiraman. Quantifying the accuracyof high-level fault injection techniques for hardware faults. In DependableSystems and Networks (DSN), 44rd Annual IEEE/IFIP InternationalConference on, 2014. → pages2, 4, 9, 17, 27, 37, 39, 64, 69, 76, 84, 105, 106, 110, 137, 138[148] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2programs: Characterization and methodological considerations. In ACMSIGARCH Computer Architecture News, volume 23, pages 24–36. ACM,1995. → pages 25, 27, 38, 83[149] Yann LeCun. Deep learning and the future of AI, 2000. URL → page 130[150] K. S. Yim, Z. T. Kalbarczyk, and R. K. Iyer. Quantitative analysis oflong-latency failures in system software. In Dependable Computing,PRDC’09. 15th IEEE Pacific Rim International Symposium on, pages23–30. IEEE, 2009. → pages 3, 4, 8, 16, 24, 26, 84, 104, 115, 125[151] K. S. Yim, C. Pham, M. Saleheen, Z. Kalbarczyk, and R. Iyer. Hauberk:Lightweight silent data corruption error detector for GPGPU. InInternational Parallel & Distributed Processing Symposium (IPDPS), page287. IEEE, 2011. → pages 12, 13, 84, 113[152] L. Yu, Y. Zhang, X. Gong, N. Roy, L. Makowski, and D. Kaeli. Highperformance computing of fiber scattering simulation. In Proceedings ofthe 8th Workshop on General Purpose Processing using GPUs, pages90–98. ACM, 2015. → page 115[153] Zhang, Chen, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and JasonCong. Optimizing FPGA-based accelerator design for deep convolutionalneural networks. In In Proceedings of ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays, 2015. → pages 22, 135177


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items