UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Approaches for building error resilient applications Fang, Bo 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2020_may_fang_bo.pdf [ 2.15MB ]
JSON: 24-1.0389536.json
JSON-LD: 24-1.0389536-ld.json
RDF/XML (Pretty): 24-1.0389536-rdf.xml
RDF/JSON: 24-1.0389536-rdf.json
Turtle: 24-1.0389536-turtle.txt
N-Triples: 24-1.0389536-rdf-ntriples.txt
Original Record: 24-1.0389536-source.json
Full Text

Full Text

Approaches for Building Error Resilient ApplicationsbyBo FangB.Eng. , Wuhan University, 2006M.A.Sc., University of British Columbia, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)March 2020c© Bo Fang, 2020The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Approaches for Building Error Resilient Applicationssubmitted by Bo Fang in partial fulfillment of the requirements for the degree ofDoctor of Philosophy in Electrical and Computer Engineering.Examining Committee:Karthik Pattabiraman, Electrical and Computer EngineeringSupervisorMatei Ripeanu, Electrical and Computer EngineeringCo-supervisorAlexandra Fedorova, Electrical and Computer EngineeringUniversity ExaminerNorman Hutchinson, Computer ScienceUniversity ExaminerBianca Schroeder, University of TorontoExternal ExaminerAdditional Supervisory Committee Members:Sathish Gopalakrishnan, Electrical and Computer EngineeringSupervisory Committee MemberiiAbstractTransient hardware faults have become one of the major concerns affecting the re-liability of modern high-performance computing (HPC) systems. They can causefailure outcomes for applications, such as crashes and silent data corruptions (SDCs)(i.e. the application produces an incorrect output). To mitigate the impact of thesefailures, HPC applications need to adopt fault tolerance techniques.The most common practices of fault tolerance techniques include: (i) charac-terization techniques, such as fault injection and architectural vulnerability factor(AVF)/program vulnerability factor (PVF) analysis; (ii) run-time error detectiontechniques; and (iii) error recovery techniques. However, these approaches havethe following shortcomings: (i) fault injections are generally time-consuming andlack predictive power, while the AVF/PVF analysis offers low accuracy; (ii) priortechniques often do not fully exploit the program’s error resilience characteristics;and (iii) the application constantly pays a performance/storage overhead.This dissertation proposes comprehensive approaches to improve the abovetechniques in terms of effectiveness and efficiency. In particular, this dissertationmakes the following contributions:First, it proposes ePVF, a methodology that distinguishes crash-causing bitsfrom the architecturally correct execution (ACE) bits and obtains a closer estimateof the SDC rate than PVF analysis (by 45% to 67%). To reduce the overall analysistime, it samples representative patterns from ACE bits and obtains a good approxi-mation (less than 1% error) for the overall prediction. This dissertation applies theePVF methodology to error detection, which leads to a 30% lower SDC rate thanwell-accepted hot-path instruction duplication.Second, this dissertation combines the roll-forward recovery and the roll-backiiirecovery schemes and demonstrates the improvement in the overall efficiency ofthe C/R with two systems: LetGo (for faults affecting computational components)and BonVoision (for faults affecting DRAM memory). Overall, LetGo is able toelide 62% of the crashes caused by computational faults and convert them to con-tinued execution (out of these 80% result in correct output while a majority of therest fall back on the traditional roll-back recovery technique). BonVoision is ableto continue to completion 30% of the DRAM memory detectable but uncorrectableerrors (DUEs).ivLay SummaryReliability, availability and serviceability (RAS) are important goals for design-ing high-performance computing (HPC) platforms. Hardware faults, especiallywhen they are transient and lead to independent one-time errors, are consideredone of the major sources affecting the RAS of the HPC systems. Applications run-ning on such systems need to take comprehensive actions to deal with the transienthardware faults, including the following: first, the error resilience characterizationreveals how serious the impact would be when encountering the faults; then, theerror detection techniques allow the applications to inform when the faults causean erroneous program state. Last, the error recovery techniques rescue the appli-cations from failure. In my thesis, I identify various drawbacks existing in thestate-of-the-art solutions and propose advanced approaches that improve the over-all effectiveness and efficiency of the current solutions.vPrefaceThis thesis is the result of work carried out by me, in collaboration with my ad-visors (Dr. Karthik Pattabiraman and Dr. Matei Ripeanu), my colleagues (QiningLu, Hassan Halawa), and other research scientists from the Los Alamos NationalLaboratory, Pacific Northwest National Laboratory and IBM. Chapters 3, 4 and 5are based on research work published in the conferences of DSN, HPDC and ICS,respectively. For each project, I was responsible for shaping the key ideas, leadingthe project, designing and implementing the systems, conducting experiments toevaluate the approaches, analyzing the data, and writing the paper.The following is a list of the publications related to each chapter: Chapter 3 I was the main contributor for the research presented in this chapter.Qining Lu offered feedback on the methodology design and other collaboratorswere responsible for proofreading the manuscript, providing guidance and feed-back.Bo Fang, Qining Lu, Karthik Pattabiraman, Matei Ripeanu and Sudhanva Gu-rumurthi, "ePVF: An Enhanced Program Vulnerability Factor Methodology forCross-Layer Resilience Analysis",Proceedings of the IEEE/IFIP InternationalConference on Dependable Systems and Networks (DSN), 2016 (acceptancerate 21%). Chapter 4 I was the main contributor for the research presented in this chapter.Other collaborators provided feedback and made edits in the manuscript.Bo Fang, Qiang Guan, Nathan Debardeleben, Karthik Pattabiraman and MateiRipeanu, "LetGo: A Lightweight Continuous Framework for HPC Applicationsupon Failures", The ACM International Symposium on High-performance Par-viallel and Distributed Computing (HPDC) June 2017 (acceptance rate 18%). Chapter 5 I was the main contributor for the research presented in this chapter.Hassan Halawa offered feedback in the machine learning aspect. Other collabo-rators participated in project discussions and helped edit the manuscript.Bo Fang, Hassan Halawa, Karthik Pattabiraman, Matei Ripeanu and Sriram Kr-ishnamoorthy, "BonVoision: Leveraging Spatial Data Smoothness for Recoveryfrom Memory Soft Errors", ACM International Conference on Supercomputing(ICS), 2019 (acceptance rate 23%).viiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Transient hardware faults . . . . . . . . . . . . . . . . . . . . . . 11.2 Tolerating transient hardware faults with software-based solutions 21.3 The challenges in software-implemented fault tolerance solutionsand opportunities for advancements . . . . . . . . . . . . . . . . 41.3.1 Error resilience characterization . . . . . . . . . . . . . . 41.3.2 Error detection . . . . . . . . . . . . . . . . . . . . . . . 61.3.3 Error recovery . . . . . . . . . . . . . . . . . . . . . . . 71.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Summary of contributions . . . . . . . . . . . . . . . . . . . . . 101.6 Scope of the contributions . . . . . . . . . . . . . . . . . . . . . 131.7 Dissertation organization . . . . . . . . . . . . . . . . . . . . . . 14viii2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 162.1 Hardware-based fault injection and tolerance techniques . . . . . 162.1.1 Fault characterization with hardware . . . . . . . . . . . . 172.1.2 Hardware-based redundancy . . . . . . . . . . . . . . . . 172.2 Software-implemented characterization and fault tolerance tech-niques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.1 Error resilience characterization via vulnerability analysis 192.2.2 Error detection techniques . . . . . . . . . . . . . . . . . 202.2.3 Failure recovery strategies . . . . . . . . . . . . . . . . . 213 ePVF: An Enhanced Program Vulnerability Factor Methodology forCross-layer Resilience Analysis . . . . . . . . . . . . . . . . . . . . . 263.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.1 Dependability metric: error resilience . . . . . . . . . . . 293.2.2 Effort in accelerating fault injection . . . . . . . . . . . . 293.2.3 Program vulnerability factor (PVF) . . . . . . . . . . . . 303.2.4 LLVM IR . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.5 The fault model . . . . . . . . . . . . . . . . . . . . . . . 323.3 ePVF methodology . . . . . . . . . . . . . . . . . . . . . . . . . 323.3.1 Base ACE analysis . . . . . . . . . . . . . . . . . . . . . 333.3.2 Finding the crash-causing bits . . . . . . . . . . . . . . . 363.3.3 The propagation model . . . . . . . . . . . . . . . . . . . 383.3.4 Crash model . . . . . . . . . . . . . . . . . . . . . . . . 403.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . 433.4.2 Q1: What is the accuracy of ePVF methodology? . . . . . 443.4.3 Q2: How close are the crash rates estimated using ePVFand fault injection? . . . . . . . . . . . . . . . . . . . . . 463.4.4 Q3: Does ePVF lead to a tighter estimate of SDC rate thanthe original PVF? . . . . . . . . . . . . . . . . . . . . . . 463.4.5 Q4: How fast is the ePVF analysis? . . . . . . . . . . . . 473.5 Case study: selective duplication . . . . . . . . . . . . . . . . . . 49ix3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544 LetGo: A Lightweight Continuous Framework for HPC Applica-tions Under Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.3 System design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.3.1 Overall design . . . . . . . . . . . . . . . . . . . . . . . 614.3.2 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . 634.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . 654.4 Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . 664.4.1 Fault model . . . . . . . . . . . . . . . . . . . . . . . . . 674.4.2 Categories of fault outcomes . . . . . . . . . . . . . . . . 674.4.3 Metrics of effectiveness . . . . . . . . . . . . . . . . . . 684.4.4 Fault injection methodology . . . . . . . . . . . . . . . . 704.4.5 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 714.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . 714.5.1 Effectiveness of LetGo . . . . . . . . . . . . . . . . . . . 724.5.2 Performance overhead . . . . . . . . . . . . . . . . . . . 734.6 LetGo in a C/R context . . . . . . . . . . . . . . . . . . . . . . . 754.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835 BonVoision: Leveraging Spatial Data Smoothness for Recovery fromMemory Soft Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.3 Motivating studies . . . . . . . . . . . . . . . . . . . . . . . . . . 905.3.1 Does ignoring DUEs work? . . . . . . . . . . . . . . . . 905.3.2 Exploring spatial data smoothness . . . . . . . . . . . . . 935.4 Design and implementation of BonVoision . . . . . . . . . . . . . 945.4.1 Synthetic study . . . . . . . . . . . . . . . . . . . . . . . 96x5.4.2 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . 975.4.3 Computing and writing-back the repair value . . . . . . . 995.5 Experimental methodology . . . . . . . . . . . . . . . . . . . . . 1015.6 Evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.6.1 Q1: How effective is BonVoision? . . . . . . . . . . . . . 1035.6.2 Q2: Can machine learning help improve BonVoision? . . 1045.6.3 Q3: What is BonVoision’s efficiency? . . . . . . . . . . . 1075.6.4 Q4: What is the impact in the context of C/R? . . . . . . . 1075.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.2 Expected impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.2.1 Efficient SDC probability estimation and a quantitative ap-proach for SDC detection . . . . . . . . . . . . . . . . . . 1166.2.2 A new direction to improve C/R efficiency . . . . . . . . 1176.2.3 Practicality . . . . . . . . . . . . . . . . . . . . . . . . . 1186.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.3.1 Direction (i): Predicting the impact of a roll-forward fail-ure recovery . . . . . . . . . . . . . . . . . . . . . . . . . 1196.3.2 Direction (ii): Semantic-based data recovery for memoryDUEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.3.3 Direction (iii): Reducing energy consumption on memorysystems . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206.3.4 Direction (iv): Data compression . . . . . . . . . . . . . . 120Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122A Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 139xiList of TablesTable 1.1 The scope of target applications . . . . . . . . . . . . . . . . . 14Table 3.1 Types of exceptions resulting in crashes . . . . . . . . . . . . . 37Table 3.2 Relative crash frequency analysis for each benchmarks . . . . . 37Table 3.3 Range calculation operations that commonly occur on the back-ward slice of memory addresses . . . . . . . . . . . . . . . . . 39Table 3.4 Benchmarks used and their complexity (lines of C code). . . . 43Table 3.5 Number of nodes in the ACE graph and time taken by the ePVFanalysis for each benchmark . . . . . . . . . . . . . . . . . . . 48Table 4.1 gdb signal handling information redefined by LetGo. ’Stop’means the program will stop upon a signal, and ’Pass to pro-gram’ means this signal will not be passed to the program . . . 66Table 4.2 Benchmark description. The last two columns present whichdata is used for bit-wise comparison to determine SDCs (un-detected incorrect results), and, respectively describe the resultacceptance check used by each application. All benchmarks arecompiled with g++ 4.93 using O3 except for SNAP, which is aFORTRAN program. . . . . . . . . . . . . . . . . . . . . . . 69Table 4.3 Fault injection results for five iterative benchmarks when usingLetGo-E. The value for each outcome category is normalizedusing the total number of fault injection runs for the application.Error bars range from 0.1% to 0.2% at the 95% confidence level. 71Table 4.4 The description of parameters of the models . . . . . . . . . . 76xiiTable 5.1 The patterns and stride observed for accessing to the elementsof the different data structures. Reading elements from Foo3and Foo5 results in the same addressing mode as Foo2. . . . . 95Table 5.2 Benchmark description, including SDC determination approachand application correctness check criteria . . . . . . . . . . . . 98Table 5.3 The confusion matrix for the classifier, associated with mis-classification costs. Low cost for misclassifying a repair thatleads to a benign outcome (in these cases the C/R scheme willrestart from a checkpoint similarly to when BonVoision is notused), high cost if SDCs or detected are predicted as benigns. . 105Table 5.4 Classifier quality. The mis-classification rate of SDC shows thefraction of predicted benigns but the true class are SDCs dividedby the total number of SDCs in the testing set. Similarly for themisclassification rate of detected. . . . . . . . . . . . . . . . . 106xiiiList of FiguresFigure 1.1 Types of failure outcomes for an application. . . . . . . . . . 2Figure 3.1 Venn diagram that highlights the crash-causing and the ePVFbits that the ePVF methodology identifies as a subset of ACEbits. SDC-causing bits will be a subset of the ePVF bits. . . . 32Figure 3.2 The overall workflow of the ePVF methodology to computethe non-crashing ACE (ePVF) bits. . . . . . . . . . . . . . . . 33Figure 3.3 An example of computing PVF for the register file . . . . . . 35Figure 3.4 Linux kernel implementation for determining which memoryaccesses result in segmentation faults. Linux kernel version:3.15. File locations: mm/ and arch/x86/mm. . . . . . . . . . . 42Figure 3.5 Fault injection results for each benchmark. . . . . . . . . . . 44Figure 3.6 Recall for the crash bits predicted using the ePVF methodology. 45Figure 3.7 Precision for the crash bits predicted using the ePVF method-ology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Figure 3.8 The crash rates estimates using ePVF (right bars) and usingfault injection experiments (left bars) are close. For fault in-jection experiments, the error bars indicate the 95% confidenceintervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Figure 3.9 ePVF (center bars) offers a much better upper bound estimatefor the SDC rate (right bars) than the original PVF methodol-ogy (left bars). For SDC rates, error bars represent 95% confi-dence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . 47xivFigure 3.10 Breakdown of execution time between graph construction (bot-tom bar) and running the crash and propagation models (topbar). Labels on bars present absolute time in seconds. . . . . . 48Figure 3.11 The predicted ePVF value based on sampling only 10% of theACE graph and ePVF computed based on the entire graph areclose. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Figure 3.12 The figure presents the CDF for the ePVF and PVF values ofregisters used in each instruction of nw (left) and lud (right)benchmarks. PVF values for most instructions are clusteredaround 1 and thus can not inform protection mechanisms basedon instruction level protection. . . . . . . . . . . . . . . . . . 51Figure 3.13 SDC rates for the original application (left bars) and when us-ing hotpath (center) and ePVF-based protection (right) for anoverhead bound of 24%. Error bars present 95% confidenceintervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Figure 4.1 Illustration of how LetGo changes the behavior of an HPC ap-plication that uses checkpointing by continuing the executionwhen a crash-causing error occurs. Axes indicate time. Thelabels used for time intervals: CP - checkpoint; V - applicationacceptance check/verification; L - LetGo framework, lightningbolt: crash-causing error . . . . . . . . . . . . . . . . . . . . 58Figure 4.2 LetGo architecture overview. . . . . . . . . . . . . . . . . . . 62xvFigure 4.3 A sequence diagram highlighting LetGo use: the step 1-5 de-scribe the interactions between LetGo, the application, and theoperating system: LetGo starts by installing the monitor - i.e.,configuring the the signal handling, and launches the applica-tion in a debugger (step 1). If the application encounters asignal, LetGo detects it (step 2) and takes the control of theapplication (step 3). To avoid the failure, LetGo incrementsthe program counter (i.e., instruction pointer) of the applica-tion and adjusts the necessary program states (step 4). Afterthe modification, LetGo lets the application continue withoutany further interference (step 5). . . . . . . . . . . . . . . . . 62Figure 4.4 Classification of the fault injection outcomes. LetGo has im-pact only on the right side of the tree above as it attempts toavoid a crash outcome. . . . . . . . . . . . . . . . . . . . . . 68Figure 4.5 Comparison of Continuability, Continued_detected, Continued_correctand Continued_SDC between the LetGo-B and LetGo-E. LetGo-E has a higher likelihood of converting crashes into correctlyexecutions for our benchmarks than LetGo-B but no increasein Continued_SDC cases. . . . . . . . . . . . . . . . . . . . . 74Figure 4.6 The state machines for the standard C/R scheme (a) and theC/R scheme with LetGo (b). The black circle represents thetermination state of the model. We use u/cost to represent theefficiency of the model. t: time interval till the next fault;cost: accumulated runtime; u: accumulated useful work; q:accumulated useful work within the current checkpoint inter-val; faults: number of faults that did not lead to crashes sincethe last checkpoint; faults_total: total number of faults that didnot lead to crashes; isLetGo: a flag that indicates that if theP_v’ is chosen or not . . . . . . . . . . . . . . . . . . . . . . 78Figure 4.7 Efficiency with and without LetGo under different checkpointoverheads for LULESH and SNAP. . . . . . . . . . . . . . . 82xviFigure 4.8 The trend of the efficiency for the C/R scheme with and with-out LetGo when the system scales from 100,000 nodes to 200,000nodes and 400,000 nodes . . . . . . . . . . . . . . . . . . . . 82Figure 5.1 Error injection results for six HPC applications. C means con-secutive bits, while R means random bits. . . . . . . . . . . . 91Figure 5.2 Examples of the ECDF of the standard deviations for six bench-marks (LULESH, FLUID, CLAMR (top-row), PENNANT, TEN-SORDECOMP, and COMD (bottom row). The correspondingdata structures are indicated on the top of the figures. . . . . . 92Figure 5.3 Illustration of BonVoision components, the information avail-able upon a DUE and why this information is insufficient todetermine the a data structure’s stride. . . . . . . . . . . . . . 95Figure 5.4 Application outcome ratios when DUEs are repaired by WB-0 (left), WB-random (center-left), BonVoision (center-right),and BonVoision-E (right). Crash-BVE class denotes that BonVoision-E predicts the repair will lead to detected or SDC outcome, asituation similar to a crash in the context of a C/R, as appli-cations would continue from a checkpoint. For BonVoision-E,the crash and SDC rates are low or zero for most benchmarks,as a result they are not visible in some plots. . . . . . . . . . . 101Figure 5.5 The performance overhead incurred by BonVoision scales welland is overall negligible while the application scales up. . . . 108Figure 5.6 Efficiency comparison between C/R and C/R+BonVoision-Efor simulated 10-years of application time for LULESH andCLAMR. The MTBF decreases as the system scales up. . . . 110xviiAcknowledgmentsI would first to thank my advisors, Dr. Karthik Pattabiraman and Dr. Matei Ri-peanu. I would not have reached this point if it weren’t for you. I have receivedfrom your valuable suggestions, feedback, and guidance along my PhD journey,both academic and for life in general. You inspire me in the pursuit of an aca-demic career, and I hope to become a researcher and the advisor of your level ofexcellence.Special thanks goes to Dr. Sathish Gopalakrishnan for his support and mentor-ship during the CSRG seminar and my days at UBC. I also want to thank all mymentors, collaborators and lab mates. You have helped me in many ways.Sincere thanks goes to Dr. Bianca Schroeder for providing valuable feedbackand suggestions on the dissertation.I am grateful to the mentors I have had during my internships: Dr. QiangGuan from Kent State University, Dr. Nathan DeBardeleben from the Los AlamosNational Laboratory and Dr. Sriram Krishnamoorthy from the Pacific NorthwestNational Laboratory.Finally, I want to thank my parents and the whole family for their unconditionalsupport and love throughout my life. My wife, Yuan Qin; my daughters Emma andAnne; and my son, Edwin—you make my dream come true.xviiiChapter 1IntroductionHardware faults, caused by particle strikes, aging, temperature, manufacturingvariations, etc. are one of the major sources of failures that compromise the re-liability of the computer systems. As feature sizes shrink, the smaller amount ofcharge per hardware component makes hardware more vulnerable to the particlestrikes, while the reduced cross-section decreases the likelihood of experiencingsuch strikes. Although these two effects may cancel each other out, more con-densed chips, especially those used extensively in high performance computing(HPC) systems, are more prone to experience hardware faults [21, 25, 37, 132].Handling hardware faults effectively and efficiently is essential to designing com-puter systems with high levels of reliability, availability and serviceability (RAS).1.1 Transient hardware faultsHardware faults can be classified as permanent, intermittent and transient by theirduration; a transient fault exists for a short period of time and is nonrecurring [113],which is different from an intermittent fault (i.e. periodically recurring) and a per-manent fault (i.e. stable and continuously occurring). Unlike permanent and in-termittent faults, which are likely to have visible symptoms, transient hardwarefaults occur in a non-deterministic way, and thus it is challenging to predict whenand where a transient hardware fault will occur. Due to these properties of tran-sient hardware faults, their handling demands more sophisticated fault tolerance1TimeInactivatedError ActivatedMaskedBenign(correct result)Crash SDC(incorrect result)FaultOccurs Figure 1.1: Types of failure outcomes for an application.solutions than simply replacing the failing hardware components.1.2 Tolerating transient hardware faults withsoftware-based solutionsWhen a transient hardware fault affects an application, it causes an error(s) (i.e.deviation) in the program state, and may eventually lead to a failure in the appli-cation. Two of the most critical failure outcome types are crashes—an applicationterminates before it finishes, and silent data corruptions (SDCs)—an applicationproduces incorrect output. Other possible outcomes are that the application hangs(which is rare in practice [34, 56, 69]), or the application finishes without any error(i.e. benign). The consequences of crashes and SDCs are serious: crashes result ina waste of computational resources, and SDCs cause the applications to generatewrong results without any warning. Therefore, mitigating the impact of the crashesand SDCs on the applications is the main focus of fault tolerance techniques. Fig-ure 1.1 [70] illustrates the fault-error-failure chain and the failure outcome typesfor an application.Hardware-based solutions have been used to counter transient hardware faults.In mission-critical systems such as satellites, spacecraft, aircraft, etc., radiation-hardened components are widely used to tolerate radiation effects. However, inmore general-purpose environments, the radiation-hardened devices are not popu-lar due to high costs, manufacturing productivity constraints and degraded perfor-2mance, and hardware-based redundancy techniques are commonly applied. Hardware-based redundancy techniques can leverage either the replication of hardware com-ponents (e.g. triple modular redundancy) or redundant information (e.g. codeword) to form a collection of methods for error masking, error detection and errorcorrection/recovery tasks (detailed descriptions of these techniques are presentedin Section 2.1).While some hardware-based techniques (e.g. error-correcting code for DRAM)are widely adopted, in general, they face the following problems:(i) They constantly incur extra energy costs. Hardware-based redundancy tech-niques consist of either additional hardware resources for redundant hardware com-ponents or additional hardware operations such as determining the validity of thecode word. Running redundant hardware components or operations requires con-tinuously additional energy consumption, therefore it is not desirable for moderncomputing systems, which need to carefully take the power constraints into ac-count.(ii) They may result in inefficient error detection and correction practice. Thehardware-based solution normally makes no assumption on the outcome of the ap-plication when it is affected by an error, hence the behavior of the solution dependson whether the hardware-implemented error detection criteria have been violated(e.g. the checksum is incorrect) or how the recovery mechanisms can be applied(e.g. applying the checksum to correct an error). However, this solution is ratherconservative. As shown by Cho et al. [34], 89.1% of hardware faults originatingfrom the flip-flop level do not reach the software level and thus have no impacton the correctness of the application’s execution. That said, the hardware-basedsolution may waste cycles for transient hardware faults that stay off the applicationlayer.Compared to hardware-implemented techniques, software-implemented hard-ware fault tolerance techniques [116] offer a more desirable solution for moderncomputing systems for the following reasons:(i) Software-implemented techniques are flexible and effective under variouscircumstances/configurations;(ii) They do not require hardware modifications, hence they fit well with HPCsystems that are built with general-purpose hardware components;3(iii) Many applications mask an erroneous program state during execution dueto their inherent error-masking abilities [56, 58, 75]. Since the software-implementedtechniques share the same level of abstraction as the application does, they can effi-ciently take the advantage of the error-masking abilities of an application to achievethe selectivity in error protection (i.e. only protect a fraction of program states thatdo not mask errors) and thus reduce the overall costs of fault tolerance techniques.1.3 The challenges in software-implemented faulttolerance solutions and opportunities foradvancementsTo design the software-based solutions for transient hardware faults for applica-tions, the following two main aspects need to be taken into consideration: (i) whatare the outcomes of an application when affected by transient hardware faults?; (ii)is it possible to efficiently and effectively mitigate the impact of an error on theapplication?These two aspects steer the research of fault tolerance techniques in the follow-ing directions: error resilience characterization, error detection and error recovery.In the following section, we describe the main task conducted in each direction andexplain why these tasks are essential to the success of the mitigation of the failurescaused by transient hardware faults. We then focus on the challenges and the spaceone can explore in each research direction.1.3.1 Error resilience characterizationError resilience is defined as the probability that the application will not a failureoutcome (i.e., crash, hang or SDC) after a hardware fault occurs. Error resiliencecharacterization provides a fundamental understanding of how an application re-acts to hardware faults. Two mainstream approaches for error resilience character-ization are as follows:Fault injection: This is a procedure to introduce faults in a systematic, con-trolled manner and study the corresponding system behavior. Fault injection tech-niques can be generally categorized as hardware-based and software-based [79].Software-based fault injection techniques typically emulate the effects of hard-4ware faults on the software by perturbing the values of selected data/instructionsin the program. Compared to hardware-based techniques such as accelerated test-ing using neutron beams [61], software-based techniques do not require additionaldevices and are adjustable to different fault models. Fault injection is a matureand well-understood technique, and there are many tools to help automate the pro-cess [5, 27, 56, 134, 144].AVF/PVF analysis: Mukherjee et al. [109] introduced the architectural vulner-ability factor (AVF) analysis, which quantifies the vulnerability of micro-architecturalcomponents to errors caused by transient hardware faults. Sridharan et al. [131]study the vulnerability of software independent of hardware by introducing pro-gram vulnerability factor (PVF) analysis. Although AVF and PVF measure prop-erties that are different from error resilience, such as the probability that a fault inthe particular structure or in a program state will result in an error, they can be usedfor a first-order estimate of the error resilience of an application.Challenges: There are challenges associated with both fault injection andAVF/PVF analysis. Fault injection techniques generally face two key challenges.First, fault injection has limited predictive power in terms of determining the im-pact of code transformations on vulnerability—thus it is challenging to use it forguiding code optimization. Second, fault injection campaigns are typically re-source intensive, as thousands of faults need to be injected in complete executionsof the program to get statistically significant results, and a standard practice is toinject only one fault per fault injection run for controllability. For HPC applica-tions, this usually entails many hours to finish a single fault injection run, and thecost to get such statistically significant results is unacceptable.On the other side, the key limitation in AVF/PVF is that AVF and PVF treat allfailure outcomes equally; that is, they do not differentiate between crashes, SDCsand the benign outcomes from an application’s perspective. This causes those tech-niques to have significant inaccuracies in the estimate of the error resilience char-acteristics of an application.Research opportunity: Error resilience characterization techniques need toconsider the trade-off space between the effort required to characterize, and theaccuracy of the estimate obtained. The two approaches discussed in this sectionrepresent the trade-offs of the two choices—fault injection is accurate but costly,5while the AVF/PVF analysis is fast but much less accurate. My research aims to de-sign an approach that offers low-cost and accurate estimates of the error resiliencecharacteristics for applications.1.3.2 Error detectionThe main goal of error detection through software-implemented techniques is todetermine if a transient hardware fault that manifests as an error in the softwarelayer subsequently causes a deviation in the program’s behavior before the programfinishes.A commonly-used approach to catch errors caused by transient hardware faultsis through duplication [60, 75, 123]. The process is as follows: The detection tech-nique duplicates a portion of the execution of a program and compares the resultsbetween the original and the duplicated programs; if the results differ, this meansthat an error has affected one of the executions. This approach works under twoassumptions: i) the piece of the program that has been duplicated is idempotent,meaning that running this piece of the program twice would generate the same re-sults/leads to the same changes to the program state or ii) the probability that theerrors affect both programs and produce the same outcomes is low (i.e. the errorscaused by transient hardware faults), hence it is unlikely for the error to bias theerror detection decision.Challenges: There are two fundamental challenges associated with duplication-based detection. First, it is computationally costly. The basic strategy potentiallyprovides 100% error detection coverage but costs about 100% performance over-head. This is especially impractical for HPC applications, which usually run forhours or days. Second, duplication-based error detection may cause fault positives.This is because the detection mechanism makes the decision of the existence ofan error based on the intermediate results of the original and duplicated execution,while errors may get masked later in the execution and lead to a benign outcome.Therefore, duplication-based detection techniques need to balance between the costof the duplication and the effectiveness in catching errors. Since different programstates affected by errors may have different impacts on the outcome of the appli-cation, increasing costs (i.e. duplicating more program states) does not necessarily6indicate a linear increase in error coverage.Research opportunity: Most state-of-the-art error detection mechanisms donot focus on establishing an indicative connection between the detection strategiesand the error resilience characteristics of an application (see detailed discussion inChapter 2.2.2). Therefore, they cannot be applied extensively in general scenarios.My research takes a different approach; it leverages the error resilience characteris-tics of an application to guide the duplication-based error detection approach in anapplication-neutral manner. In particular, with the error resilience characterizationtechnique proposed in Chapter 3, it is possible to obtain a per-program state—aquantitative estimate of how likely an error affecting that state would cause a cer-tain type of failure (e.g. an SDC). When one wants to mitigate the impact of thattype of failure for the application, these estimates can indicate the priority of theprogram states to be selected by the duplication technique: the portion of the pro-gram states that have a high correlation with the target failure type would be thecandidates to be chosen for duplication.1.3.3 Error recoveryIf the error detection techniques detect and localize the error, they normally sig-nal the run-time and cause a noticeable symptom, such as an OS exception for theapplication to handle. Immediate termination is a widely acceptable option by theapplication; however, for many scenarios where the application needs to be robustand provide high availability, terminating the application is not a viable solution.The main task of the error recovery technique is to help the application overcomesuch symptoms, potentially fix the corrupted program states and, if possible, re-sume the application’s execution.In general, the error recovery techniques can be classified as “roll-back" and“roll-forward", where the main difference is in their actions upon the symptom:roll-back recovery strategies use the prior information to replace the corrupted pro-gram states (i.e. rolling back the application to a previous state(s)), while roll-forward recovery strategies attempt to continue the execution of the applicationfrom the failure point. Today, the most commonly used error recovery systemis checkpoint/restart (C/R), which follows the roll-back paradigm—it stores the7essential program states on non-volatile storage periodically (i.e. on every check-point interval) as checkpoints and rewinds the application to a “safe" state by load-ing checkpoints when the application crashes [54, 138]. In the context of HPCsystems, C/R is considered as the best practice for recovering applications fromfail-stop failures [23, 51].Challenges: The standard C/R system can incur potentially significant over-head in terms of performance, energy and storage originating from the followingtwo sources: (i) the periodically taken checkpoints, and (ii) the recovery opera-tions when the failure occurs. In (i), the overhead is determined by the frequencyof taking the checkpoint and how fast the necessary program states can be saved tothe non-volatile storage per checkpoint; in (ii), the overhead comes from the extraI/O operations to load the checkpoint back to memory and the re-computation ofthe work from the restored state up to the failure point. There have been numeroustechniques for reducing the overall overhead of C/R (discussed in Chapter 2.2.3)via mitigating the costs caused by each of the above aspects.Research opportunity: In this thesis, we investigate the possibility of intro-ducing roll-forward recovery into the standard C/R systems. The roll-forward re-covery strategy implicitly trades off faster time to solution and energy consumptionfor higher risk of having SDCs in the results compared to the roll-back recovery.My research aims to first characterize such risk, and provide comprehensive solu-tions for the HPC applications to lower the risk when roll-forwarding is deployedupon failures and hence to potentially reduce the overhead caused by C/R. Theroll-forward recovery scheme offers a novel view that fundamentally improves theC/R efficiency, and can potentially work on top of any optimization techniques forroll-back C/R.1.4 Research questionsAs discussed in previous sections, software-based fault tolerance techniques facefundamental challenges. For error resilience characterization, the challenge is tofind a reasonable trade-off between the effort required to characterize and the ac-curacy, such that (i) the characterization process is released from the heavy costcaused by fault injection, and (ii) the error resilience characteristics of an appli-8cation are estimated more accurately than via AVF/PVF analysis; for error detec-tion, it is not clear how one can correlate the error resilience characteristics of anapplication with a selective error detection scheme; for error recovery, the exist-ing “roll-back" recovery strategy may cause unnecessary performance overheads,which need to be mitigated. The overall goal of this thesis is to develop comprehen-sive techniques to address the above issues. To achieve this goal, we are interestedin answering the following research questions: RQ1: Can one find a different trade-off point that requires a low ef-fort to characterize and offers acceptable accuracy to a specific end-goalsuch as SDC protection for an application? Can the obtained informa-tion be used to guide the error detection approach, in particular, theselective duplication, to efficiently and effectively detect SDCs? To avoidthe use of costly fault injections, we adapt the PVF methodology to achievea faster estimate of the resilience characteristics. More precisely, we devel-oped a crash model to predict which faults in a program cause a crash and apropagation model to reason about propagating ranges of crash-causing bitsin the program’s dependence graph. Together, these two models separate thecrash-causing state from the vulnerable program state inferred by the PVFmethodology and obtain an accurate estimate on the probability of crashoutcomes. By removing the crash-causing state, the rest of the program staterepresents a more accurate estimate on the SDC-causing state than the PVFmethodology. Next, informed by the results of the two models, we imple-ment a selective duplication approach that protects only the program statethat is unlikely to cause crashes. RQ2: On a standard checkpoint/restart enabled system, how much per-formance overhead reduction for an application is obtained by the roll-forward recovery when a crash happens? Does the roll-forward recov-ery scheme work for all categories of applications? Depending on the rootcauses of crashes, we further diversify the research efforts to target differenttypes of fault modes: transient hardware faults manifest in the computationalcomponents of CPUs and in memory systems such as DRAM. Therefore, weexplore two research directions to achieve a comprehensive answer to the9RQ2:(i) Recover applications from crashes caused by errors affecting the compu-tational components. As a fundamental modification on the standard “roll-back" strategy, we propose a “roll-forward" C/R system, LetGo, which a)monitors the application at run-time and when a crash-causing error occurs,pauses the program to avoid termination; and b) uses heuristics to attempt torepair the program state and enable continued program execution. To showhow much improvement the “roll-forward" scheme brings to the overall ef-ficiency of the C/R system, we model an HPC system in the context of aC/R scheme using a state machine and evaluate the end-to-end performanceimpact of the “roll-forward" recovery.(ii) Recover applications from memory errors that are detectable but uncor-rectable. We leverage spatial data smoothness preserved in many scientificapplications’ data space to repair the corrupted memory values that causedmemory DUEs. To this end, we employ the assumption that the data thatlive close to each other in physical space (modeled by a HPC application)are likely to be mapped to the nearby memory locations and compute therepair values for the memory errors based on the neighboring data elementsof that corrupted memory location. To alleviate the concern of increasing theSDC proneness of the application due to the estimated repair values, we usespatial data smoothness as the feature and train an online classifier to predictthe outcome of a repair for the application before applying it.1.5 Summary of contributionsThis section summarizes the contributions of the dissertation at a high-level. Chap-ter 3,4 and 5 elaborate on these points and present the individual contributions byanswering each RQ in detail. Enabling different failure modes in the PVF analysis. The PVF analysistreats all obtained bits as vulnerable bits, where it does not consider theactual outcomes caused by errors occurring on these bits. Our approach,ePVF, proposes the crash model and propagation model that identify the10crash-causing bits within the scope of all vulnerable bits. With the separa-tion of the crash-causing bits from the vulnerable bits, ePVF analysis comesto a tighter conservative estimate of the SDC-causing bits. Closing the gap between the analytical model (i.e. PVF) and experimentalassessment (i.e. fault injection) on the estimate of error resilience charac-teristics. The crash and propagation models together offer the estimate ofcrash-causing bits with 89% recall and 92% precision, when evaluated overa set of 10 benchmarks. Our analysis narrows the estimated vulnerable bitsfor the goal of analyzing a program’s resilience to SDCs: the number of vul-nerable bits estimated by the ePVF analysis is lower than that estimated bythe standard PVF analysis by 61% on average. Compared to fault injectionresults, ePVF offers accurate crash rate estimates over eight benchmarks anda much closer estimate on SDC rate than PVF analysis. Offering the quantitative measure on the SDC-proneness of the programstate to improve the selective duplication technique. We present an ePVF-informed protection heuristic for the selective duplication technique. Theheuristic computes the ePVF value as a proxy of the SDC-proneness perinstruction and ranks the instructions based on their ePVF values. Withduplication on the high-rank instructions for a given performance overheadcap, ePVF does 30% better SDC-coverage than a hot-path-based duplicationstrategy. A roll-forward recovery implementation that supports HPC applications.The prototype framework, LetGo, is implemented with the production-leveltools that are widely adopted and available in HPC systems. LetGo worksdirectly with the application’s executable, hence no modification is neededon the application’s compilation procedure. We also extend the LetGo im-plementation to support MPI applications. We expect that our design choicesmade for LetGo would accelerate the adoption of LetGo in the HPC com-munity. Demonstrating the successful use of the proposed roll-forward recovery schemeson a diverse set of applications. Our evaluation shows that LetGo is able to11continue the application’s execution in about 62% of the cases and produce100% correct results when it encounters a crash-causing error (for the re-maining 38%, it gives up and the application can be restarted from a check-point as before). The increase in the SDC rate (undetected incorrect output)is as low as less than 1% on average. The overall performance overheadincurred by LetGo is trivial. Demonstrating that the spatial data smoothness is preserved in scientificapplications. We hypothesize that for scientific applications, spatial datasmoothness in physical space, if it exists, would be observed across the log-ical data space of the application. We profile the application and find thatgiven a random point in the memory location, the standard deviation calcu-lated across the neighbouring elements of that point is small for all bench-mark applications. Proposing a line of approaches that achieve an accurate online predictionover the memory soft errors, with negligible overhead incurred. Our evalu-ation shows that the proposed approach, BonVoision, constructs the repairsthat result in 0.8-2.5x more benign outcomes, and only marginal increases inSDC outcomes, compared to using 0 and random values for the repairs; andon average BonVoision incurs 6 milliseconds across all benchmarks. More-over, with the machine-learning guided optimization, BonVoision is able toconduct online predictions on the outcome of each repair, which eliminatesmost SDCs and other types of failure outcomes. Showing a significant efficiency gain for the application that employs a roll-forward recovery scheme in the standard C/R system for both computationaland memory soft errors. We use the state machine to model the HPC systemthat uses the standard C/R scheme, and when LetGo and BonVoision are inplace, we alter the states according to the semantics of LetGo and BonVoi-sion separately. We simulate the system running for over 10 years and findthat overall the standard C/R system with either LetGo or BonVoision ob-tains a significantly higher estimate on the efficiency than the standard C/Rsystem alone, across a wide range of configurations and applications.121.6 Scope of the contributionsThis section highlights the key assumptions, and the properties of the target appli-cations explored in my thesis to answer each of the following research questions: RQ1—Low-cost, high-accuracy error resilience characterization and errordetection techniques. The main assumption made in this part of my re-search is as follows: During the execution of an application, different pro-gram states, if affected by the errors, are responsible for different failure out-comes. Therefore, it is possible to improve the AVF/PVF analysis in terms ofachieving significantly more accurate estimates while remaining low-cost bybounding the subsets of program states that lead to different types of failureoutcomes. Further, if the target applications exhibit repetitive computationpatterns that can be leveraged by estimates to prioritize the choices of pro-gram state protection for error detection, the program states that have highcrash-proneness are not likely to lead to other types of failures such as SDCs. RQ2—(i) Recover applications from crashes caused by errors affecting thecomputational components. The class of applications that are likely to signif-icantly benefit from the proposed roll-forward recovery technique—LetGohas the two following properties: (a) The applications conduct iterative meth-ods to require the computation to converge, thus errors can be masked in-herently during the computation. (b) The applications feature application-specific acceptance checks to verify the correctness of the computation, whichalso helps filter erroneous output caused by LetGo, if any. RQ2 (ii) Recover applications from memory errors that are detectable butuncorrectable. BonVoision targets a similar class of applications as LetGodoes, while exhibiting the following two differences in terms of the expectedapplication’s properties: (a) spatial data smoothness: the physical phenom-ena often modeled by the target applications exhibit inherent continuities intheir modeled physical space, and further, the data structures used by theseapplications tend to keep physically close data points close together in mem-ory as well to exploit locality; and (b) the applications using direct methods13Table 1.1: The scope of target applicationsProposedtechniquesTarget applicationdomain Main sources Expected propertyePVFgeneral scientificapplicationsRodiniarepetitive computationpattern†LetGoscientific modellingand simulationDOE mini-appiterative, convergence,acceptance checkBonVoisionscientific modelling andsimulation, linear algebraDOE mini-appconvergence (relaxed),acceptance check,spatial data smoothness† A repetitive computation pattern allows the ePVF methodology to use a par-tial data dependence graph to estimate the error resilience for the entire appli-cation state and hence significantly accelerate the modeling.for computation may also benefit from the recovery schemes proposed byBonVoision.Table 1.1 summarizes the scope of the target applications and the correspond-ing properties that are in favor of each of my proposed techniques.1.7 Dissertation organizationThe rest of this dissertation is organized as follows: Chapter 2 presents the related work associated with each direction of myresearch: error resilience characterization, error detection and error recovery. Chapter 3 presents the ePVF methodology, an advancement over the state-of-the-art solution, PVF analysis, for a more accurate estimate of an applica-tion’s error resilience characteristics, and selective duplication informed bythe analysis. Chapter 4 motivates the demand of the roll-forward recovery C/R scheme,proposes the LetGo framework as a demonstration of such schemes and eval-uates its overall effectiveness and efficiency.14 Chapter 5 describes how the spatial data smoothness can be employed toimplement the roll-forward recovery C/R systems for HPC applications tomitigate the impact of memory DUEs. Chapter 6 concludes the thesis and presents the impact of the dissertationand potential future work.15Chapter 2Background and Related WorkThis chapter presents the background information on hardware fault characteriza-tion and tolerance systems and highlights the novelty of this thesis. The chapter isorganized as follows: First, we present the prior studies on hardware-based faultcharacterization and tolerance techniques and explain the benefit of using software-implemented techniques; then we discuss the efforts made in the software-basedtechniques as follows: (i) various techniques proposed by the community aim-ing to characterize the error resilience of applications; (ii) different error detectiontechniques and the opportunity we seek to connect the error resilience characteri-zation to the error detection techniques and (iii) the general failure recovery strate-gies, with the particular interests in the existing lightweight recovery techniques,the recovery techniques for memory DUEs and, finally, the studies towards moreefficient C/R schemes.2.1 Hardware-based fault injection and tolerancetechniquesHardware-based solutions have been proposed for various purposes in the relia-bility community over decades. In this section, we present the advantages anddisadvantages of hardware-based solutions and motivate the need for software-implemented techniques.162.1.1 Fault characterization with hardwareHardware-implemented fault injection relies on extra hardware to introduce faultsinto the system. As defined in [79], there are two categories of injections—withand without contact of the system: With contact: The fault injection tool has direct contact with the target sys-tem so that the tool has the control over the design of the injection procedure,such as the frequency of the occurrence of the fault, the location of the faultor what type of fault is modeled. A general way to implement the tools inthis category is through the circuit pins [8, 85, 122], while others [33] chooseto directly flip the content in a flip-flop. Without contact: In contrast, techniques such as accelerated testing withneutron beams [61] issue neutron strikes on the system to mimic the naturalphysical phenomenon and study how the system reacts.Hardware-implemented fault injection is commonly used for tasks such asmimicking stuck-at or permanent faults. The software-implemented techniques arenot suitable for such tasks because implementing these types of faults in softwarerequires significant efforts and is often impossible. However, if the goal is to under-stand how a fault propagates during a program’s execution, or to evaluate the im-pact of the fault on the program’s behavior, software-implemented techniques aremuch more desirable. In addition, since the hardware-implemented approaches in-volve extra hardware components, they are not as flexible as software-implementedtechniques.2.1.2 Hardware-based redundancyHardware-based redundancy offers reliable computation or storage at various levelsof abstraction. Below, we describe some effective solutions at each level: System level: The typical approach for system-level redundancy is to em-ploy the extra processor(s) (i.e. the watchdog processor [104]) to validatethe results of the main processor. For example, Benso et al. [16] propose awatchdog to check if there is an error during the data exchange between the17main processor and the memory, and to check the control-flow integrity ofthe execution of a program. Austin [12] proposes a dynamic implementationverification architecture (DIVA) that extends the speculation mechanism ofa modern microprocessor to detect errors in the computation of the proces-sor core. On the multi-threading/multi-core architectures, redundant thread-s/cores are employed to detect transient hardware faults by comparing theresults of two threads/cores that contain the same input [111]. Circuit level: Circuit level techniques normally use code words to detect er-rors, where the complexity of the code word ranges from simple parity bits,error detection and correction codes, residue codes for ALUs [7], etc. As fortoday’s system, the most commonly used technique is ECC. ECC has beenan industry standard technology that considerably improves the reliabilityof the DRAM or SRAM on the system. The examples of ECC schemesinclude SEC-DED ECCs, chipkill-detect ECCs and chipkill-correct ECCs.SEC-DED ECCs support the correction of single-bit errors and the detectionof double-bit errors. As an advanced version, chipkill-detect and chipkill-correct ECCs employ RAID (Redundant Arrays of Inexpensive Disks)-likeDRAM architectures, and such a scheme is used to calculate the ECC check-sum for the contents of the entire set of chips for each memory access. Thechipkill ECC then stores the result in extra memory space on the protectedDIMM, and when a memory chip on the DIMM fails, the RAID result isused to reload the lost data.The hardware-based redundancy approaches require additional hardware com-ponents and the support of extra hardware logic for fault detection and recovery.Compared to software-implemented approaches, they incur a fixed cost in area andcomplexity regardless of whether the application uses them.2.2 Software-implemented characterization and faulttolerance techniquesIn this section, we investigate the existing software-implemented characterizationand fault tolerance techniques and explain why the approaches proposed in this18thesis have the potential for improvement.2.2.1 Error resilience characterization via vulnerability analysisThere has been a considerable amount of work on estimating the error resilienceof a program either through fault injection or through vulnerability analysis tech-niques. The main advantage of fault-injection is that it is simple and allows dis-tinguishing between different failure outcomes, yet it has limited predictive powerand is slow. Alternatively, the main advantage of vulnerability analysis is that it haspredictive power and is faster, but it does not distinguish between different kindsof failures.Biswas et al. [20] separate the overall AVF of processor structures into SDCAVF and Detected Unrecoverable Errors (DUE) AVF by considering whether bit-level error protection mechanisms such as ECC or parity are enabled in those struc-tures. While DUE is similar to the notion of the crash and causes a fail-stop symp-tom, it is defined at the hardware level only and does not consider software-levelmechanisms. Furthermore, like AVF, DUE-AVF is highly hardware dependent.Wang et al. [142] compare the ACE analysis estimation of the AVF on a pro-cessor component with the result of a fault injection analysis and find that ACEanalysis is significantly conservative. However, they do not attempt to improve theACE analysis methodology.Bronovetsky et al. [24] use standard machine learning algorithms to predict thevulnerability profiles of different routines under soft errors to understand the vul-nerability of the full applications. However, their technique is confined to linearalgebra applications. Lu et al. [102] and Laguna et al. [92] identify SDC-causingcode regions through a combination of static analysis and machine learning. How-ever, their technique does not provide a sound understanding of why some faultscause SDCs and others do not. A common issue with machine learning techniquesis that they require extensive training with representative data, which analytic tech-niques do not.Finally, Yu et al. [152] introduce a novel resilience metric called the data vul-nerability factor (DVF) to quantify the vulnerability of individual data structures.By combining the DVF of different data structures, the vulnerability of an applica-19tion can be evaluated. While useful, this technique requires the program to be writ-ten in a domain-specific language, which is restricted in terms of its expressiveness.Further, DVF does not distinguish between crash-causing errors and other errors.The main question we ask in this thesis is whether it is possible to combinethe advantages of the two approaches by building an architecture-neutral vulnera-bility analysis technique to distinguish different failure outcomes, and especiallySDCs. Therefore, we use fault injection to gather the ground truth of the error re-silience characteristics of an application and compare it with the result of the ePVFmethodology.2.2.2 Error detection techniquesThere has been a significant amount of prior work in the design of efficient errordetection mechanisms and the main direction towards this goal is through selectiveduplication. For example, SWIFT [123] removes the need for replicating the mem-ory subsystem when detecting errors and hence improves the overall performanceof error detection. Hari et al. [75] present a low-cost, program-level fault detectionmechanism for reducing SDCs in applications. They use their own prior work, Re-lyzer [76], to profile applications and select a small portion of the program fault siteto identify static instructions that are responsible for SDCs. By placing program-level error detectors on those SDC-causing sections, they can achieve high SDCcoverage at a low cost. However, application-specific behaviours are major con-tributors of SDCs for half of their benchmarks, which makes it difficult to extendtheir technique to other applications.Shoestring [60] defines “high value" instructions (instructions which, when af-fected by errors, are likely to cause SDCs), and selectively duplicates those instruc-tions. In this work, any instructions that can potentially impact global memory areconsidered high-value, and any instructions that can produce arguments passed tofunction calls are also high-value. The problem with these heuristics is that theyare coarse in terms of classifying the instructions with their SDC-proneness. Lu etal. [102] propose an empirical model to predict the SDC-proneness of a program’sdata. The proposed framework, SDCTune, is based on static and dynamic featuresof the program and achieves high accuracy in predicting the relative SDC prone-20ness of applications, which limits its usefulness in identifying the most SDC-proneinstructions for selective duplication. Recently, Li et al. [96] proposed TRIDENT,which is a compiler-based approach that models the error propagation and predictsboth the overall SDC probability of a given program and the SDC-proneness of in-dividual instructions of programs, without fault injections. The resultant estimateof the SDC probability, however, is flickering around the estimate obtained fromthe fault injection across various applications, which limits its usefulness to guidethe design of the error resilience techniques.There are other detection approaches taking advantage of the parallelism ofan application. For example, Saxena et al. [128] were the first to use redundantthreads to detect and tolerate errors in a multi-threading environment, which re-lies on full duplication (of a thread). Wei [143] take advantage of the similaritiesbetween parallel processes of a program to detect errors in the program’s controldata. Selective duplication techniques can be integrated with these approaches tofurther improve the coverage of errors.In this thesis, we focus on improving the selective duplication techniques forthe following aspects: (i) the generality of the approach, hence the techniques donot need knowledge of the application-specific behavior, and (ii) the explainabilityof the approach, for building an explicit and quantitative correlation between thechosen instructions for duplication and the SDC-proneness of the program.2.2.3 Failure recovery strategiesOverviewThere have been many efforts to provide comprehensive solutions for HPC systemsto recover from failures. Recovery strategies can be grouped along the follow-ing two dimensions: first, on whether they are application-aware or application-agnostic, and second, on whether they roll-back to a previously correct state oruse heuristics to attempt to repair the state and roll forward. An example of theapplication-agnostic approach is the global view resilience (GVR) [52] framework,which allows applications to store and compute an approximation from the cur-rent and versioned application data for forward recovery. Aupy et al. [11] dis-21cuss how to combine the silent error detection and checkpointing and schedule thecheckpointing and verification in an optimal-balanced way with a rollback recov-ery. Other approaches rely on a detailed understanding of the application. Gamelet al. [62] design and implement a local recovery scheme specialized for stencil-based applications for a fast roll-back at the minimum scale, while algorithm-basedfault tolerance (ABFT) techniques, such as [43, 148], compute checksums for theintermediate states of the LU factorization problems and enable forward recovery.This thesis aims to provide an application-agnostic and forward recovery solutionin the same vein as GVR [52]. However, it is more general and efficient as thereis no need to store previous program states and no need for any data structure orinterface support.Lightweight recovery techniquesThe work proposed in this thesis, LetGo, is inspired by the following two key areas:failure-oblivious computing, which focuses on recovering from failures caused bysoftware bugs [125, 126], and approximate computing, which makes the assump-tion that neglecting some errors during application execution will still lead to pro-ducing acceptable results. We discuss below how LetGo relates to these areas andhighlight the differences.Failure-oblivious computing: Rinard et al. [125, 126] propose failure oblivi-ous computing, an approach that continues application execution when memory-related errors occur during execution. They introduce an abstraction called bound-less memory block: when there is an out-of-bound write, the written value is storedin a hash table indexed by its memory location, then for out-of-bound reads it re-trieves the value from the hash table if the same memory address is used (or usesa default value if the hash table has not been initialized for that value). This isenabled by compile-time instrumentation and checks for all memory accesses atruntime. There are two major problems in their approach, as follows: (i) the tech-nique above focuses only on out-of-bound memory accesses, a subset of sourcesof crashes in HPC systems, and (ii) their approach does not focus on HPC applica-tions, where we believe such techniques are likely to have a high impact.Recently, Long et al. [101] proposed a run-time repair and containment in the22same style as the original failure-oblivious computing work. They expand the so-lution used to drop assumptions on application structure and impose no instrumen-tation on a program during execution. This technique works for errors includingdivide-by-zero and null pointer de-referencing. Both Rinard et al. [126] and Longet al. [101] find that one of the main reasons to the success to their techniques is thecommon computational pattern that occurs in all of their benchmark applications—that the input of the applications can be divided into units, and there is no interac-tion between computations on different input units. This is, however, not a typicalcomputation model for HPC applications.Failure escaping: Carbin et al. [26] designed Jolt, a system that detects whenan application has entered an infinite loop and allows the program to escape fromthe loop and continue. Jolt has a similar philosophy as our proposed approach(i.e. LetGo): adjusting the program state when a program appears to be trappedin a failing state. However, Jolt is designed only to help programs escape from aninfinite loop (a hang), a relatively infrequent failure scenario, as our fault-injectionexperiments indicate.Approximate computing [72, 108]: This starts with the observation that, inmany scenarios, an approximate result is sufficient for the purpose of the task,hence energy and performance gains may be achieved when relaxing the precisionconstraints. The philosophy of approximate computing is applied to different sys-tem levels, including circuit design [71, 149], architectural design [66, 120] andapplication characterization and kernel design [32]. This concept, along with therelated body of research such as probabilistic computing [30, 117], offers potentialplatforms for applications that can tolerate imprecision in the final answer. Miguelet al. [107] propose the Bunker Cache, a design that maps similar data to the samecache storage location for better cache efficiency while degrading the quality ofthe computation. However, approximate computing, or probabilistic computing,aggressively relaxes the accuracy of the computation, which is a different philoso-phy from that of our approach.23Recovery techniques for memory DUEsLevy et al. [94] propose a methodology that leverages the data similarity betweenmemory pages to perform compression on memory and use the compressed mem-ory pages to repair the ones that are impacted by DUEs. At a high level, their goalis similar to our goal, while two main limitations exist: (i), as stated in their paper,the memory compression may not always be a viable operation, as they can onlyprotect pages that are read-only at the beginning of each protection interval; (ii)when a DUE occurs, and the corrupted page is eligible to recover, they essentiallyroll that corrupted page back to its previous state, which can cause a cascadingrecovery.Gottscho et al. [64, 65] propose a software-defined ECC technique (SDECC)that uses side information to estimate the original message by filtering and rank-ing possible candidate code words. The SDECC technique is shown to be ableto successfully recover most DUEs for integer applications. Along this direction,Schoeny et al. [129] focus on the coding side of SDECC and a priori designate spe-cific messages to protect the system from errors with stronger confidence. Thereare two problems involved, as follows: first, both papers assume that the ECC syn-drome is available for investigation, while the availability of the syndrome bits islimited in reality; second, both papers need hardware-level support and modifica-tions, which are not possible for most HPC systems. Poulos et al. [53] leverageboth the candidate code-words and the application-specific contextual informationto attempt to repair DUEs without hardware modification. However, they requireapplication-specific information to guide the selection of the code word, whichmakes the approach less general.Improving C/R efficiencyThe C/R systems can incur high overheads for HPC applications. Mitigating theseoverheads depends on several factors, including the algorithm that determines thefrequency, the content/size of the program state that needs to be stored and theactual implementation across the whole spectrum of the system. Below, we placeemphasis on the research directions towards the improvement of the C/R systemand explain the potential opportunities for better efficiency of the C/R system.24Optimize the checkpoint interval: Young et al. [151] and Daly et al. [42] cameup with analytic models that aim to find the optimal theoretical checkpoint inter-val. EI-Sayed et al. [50] show that checkpointing under Young’s formula achievesa performance almost identical to that of more sophisticated schemes, based onexhaustive observations on the production systems. Wang et al. [139] model thecoordinated checkpointing for large scale supercomputers and find that there is anoptimum number of processors for which total useful work is maximized.Shorten the checkpoint and restart time: Unlike the traditional checkpoint/restartsystems that store the checkpoints on disks, several virtual machine checkpointingtechniques, including those of Remus [40] and Wang et al. [140], save checkpointsin the memory, which achieves an orders of magnitude reduction in the checkpoint-ing and restart time for VMs.Shrink the size of the checkpoint: TICK (Transparent Incremental Checkpointerat Kernel Level) [63] uses kernel threads to implement the system-level incrementalcheckpoint for serial applications, which significantly shrinks the size of the check-point. Moreover, Wang [137] proposes a C/R system that transparently supportsboth full and incremental checkpoints for a parallel application (i.e. MPI-based).The above discussed techniques, however, are designed for C/R schemes thatexercise the roll-back recovery strategies. Our optimization, which allows the ap-plication to roll-forward its execution upon failures, would yield potential benefitson the overall efficiency of the C/R system, under the following hypotheses (i) asthe application crashes less often, the mean time between failures (i.e. MTBF) ofan application becomes longer. According to the analytical models [42, 151], thelonger MTBF would lead to an increased checkpoint interval, thus the applicationcan take fewer checkpoints during its execution; (ii) since the application directlycontinues under the roll-forward scheme, it does not need to pay the overhead ofrolling back to a previous checkpoint or restarting upon a failure.Today’s large HPC systems have an MTBF of tens of hours [78], even withhardware- and software-based protection techniques employed together. The fail-ure rate is expected to further increase, which mandates that a larger portion ofcomputation cycles are used for fault tolerance [45].25Chapter 3ePVF: An Enhanced ProgramVulnerability FactorMethodology for Cross-layerResilience AnalysisThis chapter answers the RQ1 and presents my work on first predicting the error re-silience characteristics of an application, and use the prediction results to guide thedesign of selective duplication systems. We propose ePVF, an enhancement of theoriginal PVF methodology, which filters out the crash-causing bits from the ACEbits identified by the traditional PVF analysis. The ePVF methodology consistsof an error propagation model that reasons about how the error propagates in theprogram, and a crash model that encapsulates the platform-specific characteristicsfor handling hardware exceptions. ePVF reduces the vulnerable bits estimated bythe original PVF analysis by between 45% and 67% depending on the benchmark,and has high accuracy (89% recall, 92% precision) in identifying the crash-causingbits. We demonstrate the utility of ePVF by using it to inform selective protectionof the most SDC-prone instructions in a program.263.1 IntroductionAs discussed in Chapter 1, a transient hardware fault can affect an applicationand cause an SDC or a crash failure. For certain domains like physics simulation,applications validate the computational result to filter out obvious deviations, whilesuch validation does not offer the guarantee of no data corruption. Moreover, thereis no generic method to detect SDCs without re-executing the entire application andchecking for a mismatch, or without a significant amount of hardware redundancy,both of which are expensive. Users will typically trust the application’s output inthe absence of an error indication, which makes SDCs one of the major concerns.One of my research goals is to develop a systematic method to inform the de-sign of software protection techniques (e.g., code transformations) to make appli-cations resilient to SDCs. A first and essential step towards this goal is estimatingthe SDC rates of programs. As SDCs are caused by a combination of application-specific and system-specific factors, in this chapter, we focus on the system-specificfactors that lead to SDCs. The main insight underlying this part of the thesis isthat a fault that leads to a crash cannot (by definition) lead to an SDC. Becausecrashes are caused by a combination of the hardware and Operating System (OS)features, they can be systematically reasoned about in an application-independentmanner. By removing the crash-causing faults from the set of all faults, one canobtain a tighter estimate of the SDC rate. This is as important as crashes are oftenthe dominant failure outcome, and hence significantly outnumber both SDCs andhangs [1, 10, 67, 150].We propose a new method, ePVF (enhanced PVF), that builds on the originalProgram Vulnerability Factor (PVF) analysis methodology proposed by Sridha-ran et al. [131] to remove crash-causing faults from the set of all faults. PVF isa systematic method to efficiently evaluate the error resilience of software underhardware faults. PVF can also be used for predictive and comparative analysisstudies to understand the effect of different protection techniques or code transfor-mations on the error resilience. However, PVF does not distinguish between faultoutcomes and, essentially, treats crashes, SDCs and hangs as equally severe. There-fore, using PVF to estimate application error resilience and inform the protectionmechanisms often leads to overprotecting applications, thereby resulting in unnec-27essary performance and energy overheads. By distinguishing between crashes andother failures, ePVF allows protection techniques to better focus on the program’sbits that if corrupted, can potentially cause SDCs.There are two challenges in identifying crash-causing bits. Firstly, crashes arecaused by OS and architecture-specific factors, which we need to understand andmodel. Secondly, the crash-related OS state varies during program execution (e.g.,segment boundaries for segmentation faults). Hence we need a dynamic model forpredicting whether a particular fault will cause a crash. We find that the majority ofcrashes are caused by illegal memory addressing, and that by capturing and reason-ing about the state of a program’s memory segments in a platform-specific manner,we can accurately find almost all crash-causing bits. We therefore extend the orig-inal PVF estimation to estimate the ranges of values that may generate crashes,propagate them on the backward slices of the loads and stores, and efficiently com-pute the set of bits that can result in program crashes.Contributions. To the best of our knowledge, our work is the first to considerthe effects of different failure modes through a PVF-like analysis for the goal ofanalyzing a program’s resilience to SDCs. Our work is also the first to close the gapbetween analytical models such as PVF, and experimental assessment techniquessuch as fault injections. This chapter:1. Develops a crash model (§3.3.4) to predict which faults in a program causea crash, a propagation model (§3.3.3) to reason about propagating ranges ofcrash-causing bits in the program’s dependence graph, and integrates themwith the PVF methodology;2. Implements the method in the LLVM compiler [93] and its intermediate rep-resentation (IR) which offers the ability to support multiple platforms andarchitectures;3. Evaluates the accuracy of the proposed ePVF method vis-a-vis fault injec-tion (§3.4) at the same abstraction level using LLFI, an open-source faultinjector [144]. It finds that ePVF estimates crash-causing bits with 89% re-call and 92% precision, when evaluated over a set of ten benchmarks. Moreimportantly, the number of vulnerable bits estimated by the ePVF analysis is28lower than that estimated by the standard PVF analysis by 61% on average.Thus ePVF leads to a tighter estimate of the SDC rate, and a close estimateof the crash rate compared to fault injection.4. Demonstrates the utility of the ePVF analysis through a case study involvingselective instruction-level protection for SDC mitigation (§3.5). We find thatthe SDC rate reduction achieved using ePVF is, on average, 30% better thanthat achieved by hot-path duplication (i.e., duplicating the most frequentlyexecuted program paths), for the same performance overhead.3.2 BackgroundThis section offers background information on error resilience, the dependabil-ity metric we estimate (§3.2.1), past work on estimating it through fault-injection(§3.2.2) and vulnerability analysis techniques (§3.2.3), the abstraction level thatour technique works at (§3.2.4) and our fault model (§3.2.5).3.2.1 Dependability metric: error resilienceNot all faults in a program result in failures due to masking at different layers of thesystem stack. As we focus on software resilience techniques, we do not considerhardware masking [105], but only take into account faults passing the hardware andseen by the software. This is in line with other work in this area [44, 60, 86, 123].In the context of this work, we define error resilience as the probability thatthe application does not have an SDC after a transient hardware fault occurs andimpacts the application state. Note that error resilience does not take into accountthe probability of a fault occurring and affecting the software (which depends onthe base fault rate in the hardware and the application execution time). In §3.5, weestimate the impact of protection techniques by taking into account their effect onapplication performance in addition to the resilience.3.2.2 Effort in accelerating fault injectionHari et al. [73, 76] have proposed approaches to reduce the cost of fault injectioncampaigns by pruning the fault injection space. However, fault injection campaigns29are still costly and cannot be used in situations where predictive power is neededto choose between the multiple options available for code optimization. Instead,we need an automated characterization of error resilience that does not use faultinjections.3.2.3 Program vulnerability factor (PVF)The AVF of a hardware component is the probability that a fault occurring in thecomponent leads to a visible error in the final output of a set of executed in-structions. Mukherjee et al. [110] introduced the Architecturally Correct Execu-tion (ACE) analysis for estimating the AVF of processor structures (e.g., Reorderbuffer) based on a dynamic execution trace on a specific microarchitecture. Bycombining ACE analysis with the raw error rate of a processor structure and itsAVF, microprocessor designers can estimate the FIT (Failures In Time) of eachprocessor structure and take appropriate action in the design stage. However, AVFis intricately tied to the microarchitectural design of a processor, and cannot beused to reason about software resilience in isolation.Sridharan et al. [131] separate the hardware-specific component of AVF fromthe software-specific component: the Program Vulnerability Factor (PVF). Theyshow that the PVF can be used to explain the error resilience behaviour of a pro-gram independent of the processor. Moreover, they show that by using techniquesthat are used in computing the PVF [131], programmers are able to pinpoint thevulnerability of different segments of the program, and gain insights for designingapplication-specific fault tolerance mechanisms. However, the key drawback ofPVF is that it does not distinguish between different kinds of failures, i.e. crashesand SDCs.PVF abstracts out timing information and includes only the relative instructionorder in the instruction flow. This makes the process of computing the PVF largelymicroarchitecture neutral, thus making it a function of the program and the archi-tecture alone (when executed with a specific input). Sridharan et al. [131] estimatethe PVF of an architectural resource ’R’ as the ratio between the ArchitecturallyCorrect Execution (ACE) bits in the resource when executing the set of instructionsI in a trace and the number of total bits involved in R (i.e., BR) (Equation 3.1). The30ACE bits are the bits in which a fault would potentially affect the correctness ofthe execution of the instructions in I.PVFR =∑Ii=0 (ACE bits in R at instruction i)BR×|I| (3.1)3.2.4 LLVM IRLLVM is a compiler infrastructure that provides support for different hardwareplatforms [93]. The key component of LLVM is its intermediate representation(IR), an assembly-like language that abstracts out the hardware and ISA-specificinformation. Our methodology is built on LLVM IR level, and we choose to workat this abstraction level for the following reasons:i) LLVM abstraction level (i.e., LLVM IR) offers an uniform representation ofa program; hence the ePVF methodology is architecture neutral and easy to portacross different architectures/ISAs, thereby eliminating the influence of architec-ture or ISA-specific factors.ii) LLVM IR maps closely to constructs of a program and preserves source-level program properties, which makes it easy to help understand the inherent faultmasking.iii) LLVM infrastructure provides extensive support for program analysis andinstrumentation. Prior work focusing on selective duplication techniques [60, 123]uses the LLVM compiler for both static and dynamic analysis, and program instru-mentation.To validate our methodology, we use fault injection experiments at the sameabstraction level (LLVM IR) as our ePVF implementation using LLFI [144]. Choet al. [33] have found that high-level fault injections can directly model a subset ofsystem-level behaviors caused by transient faults. However, our focus (and thefocus of the original PVF paper) is on the subset of faults that do manifest inarchitecturally visible state (e.g., registers) - these faults can be modeled by high-level fault injections.31                   Crash-causing                   bitsTotal BitsACE BitsePVF bitsFigure 3.1: Venn diagram that highlights the crash-causing and the ePVF bits thatthe ePVF methodology identifies as a subset of ACE bits. SDC-causing bitswill be a subset of the ePVF bits.3.2.5 The fault modelFor this part of the dissertation, we consider transient faults that occur within theprocessor (i.e., register file, ALUs). We use the single-bit-flip model to representtransient faults as in other related work [60, 67, 123, 143, 150]. Our techniquecan be easily extended to multiple-bit flips. In recent work, Cho et al. [33] findthat low-level hardware faults manifest as both single- and multiple-bit flips at theapplication level. However, recent work [13, 103] has shown that the differencebetween single- and multiple-bit flips occurring in program states is marginal interms of their impact on SDCs. Therefore for this study, we stick to single bit flips,as SDCs are our main concern.3.3 ePVF methodologyOur goal is to obtain a comprehensive estimate of the program resilience that doesnot entail the full-blown costs of fault injection. We aim to adapt the PVF method-ology (see §3.2) for this purpose. As pointed out earlier, PVF does not distinguishbetween crashes and SDCs, and hence is overly conservative, as SDCs are the mainconcern in practice.Definition: We define ePVF by analogy to PVF as the ratio of non-crashingACE bits over the total bits involved. Figure 3.1 shows the ePVF bits: they area subset of all ACE-bits, and a superset of the SDC-causing bits. It is a supersetbecause not all non-crashing bits cause SDCs.ePVFR =∑Ii=0 (ACEBits - CrashBits in R at instruction i)BR×|I| (3.2)32LLVM IR Instruction TraceBase ACE AnalysisCrash Model ACE Bits Propagation ModelNon-Crashing ACE BitsePVF MethodologyFigure 3.2: The overall workflow of the ePVF methodology to compute the non-crashing ACE (ePVF) bits.Methodology overview: At a high level, the ePVF methodology consists ofthree major components (Figure 3.2): (i) Base ACE analysis to estimate all thevulnerable bits of the program (§3.3.1); (ii) a crash model to identify the rangesof bit-level faults that cause crashes (§3.3.4); and (iii) a propagation model thatpropagates these ranges along the backward slices of each operand (§3.3.3). Thismethodology is supported by our identification of incorrect memory addressing asthe most common cause for crashes (§3.3.2).3.3.1 Base ACE analysisACE analysis is used to determine the set of all bits in an architectural resource(e.g., a register file) that are not masked and can affect application’s final state.The basic idea is to first identify the instructions that are responsible for the outputof the program (called output instructions), and then find all the instructions in theirbackward slice. We use the program’s dynamic dependency graph (DDG) [4] tokeep track of the data dependencies among the program’s instructions. The DDGis a representation of data flow in the program, and is constructed based on theprogram’s dynamic instruction trace [4]. In the DDG, a vertex can be a register, amemory address or even a constant value. An edge records the instruction (i.e., anoperation) and links source operand(s) to destination operand(s).We implement the DDG analysis at the LLVM compiler’s intermediate repre-sentation (IR) level. Note that since the LLVM IR abstracts out the hardware/ISA-33specific information, it contains an infinite number of virtual registers 1. As aresult, at this level there is no notion of a register file. We model the architecturalresource as the set of virtual registers used in the IR of a program. This definitionalso matches our fault injection experiments as only activated faults are considered.One further issue is deciding how to express register-based memory addressinginstructions. This needs special care as it is common to have the same register tostore many different memory addresses, or different registers store to the sameaddress. Since we create new DDG nodes for each newly written memory address,it is also common for a register to store multiple uses of the same memory address.To handle these cases, we create an edge in the DDG to link the memory addressused and the register. This edge is virtual to differentiate this case from direct datadata dependencies.Running example: Figure 3.3 shows a small portion of the DDG constructedfrom a dynamic IR instruction trace of the pathfinder benchmark [29]. We re-name the IR registers for readability. Figure 3.3a presents the corresponding staticinstruction in the LLVM IR of the program2. Figure 3.3b illustrates the DDG ob-tained after executing the static instructions in Figure 3.3a. Nodes representingmemory are labelled with the address values recorded during the run-time. Mem-ory addresses that correspond to the output are highlighted.From each memory location that is part of the output, in this case 0x15FB174(highlighted in Figure 3.3b), we run a reverse breadth-first search on the DDG thatcontains all the dependent vertices of 0x15FB174. This step will exclude the noder8 as it does not contribute to the output. We call the resulting graph the ACE graph(Figure 3.3c). Then the total ACE bits are calculated as (the size of each operandis defined in IR):ACE Bitsused registers =7∑i=1Bits in Ri= 32+64+32+32+64+64+64 = 3521The register allocator will take the physical register file size into account when mapping thevirtual registers to physical ones in the compiler backend.2LLVM has a special IR instruction named getelementptr (gep) to abstract memory address com-putations, which corresponds to a combination of several instructions in a assembly language suchas MOV and ADD instructions.34  r1 = load i32* r2, align 4  r4 = add nsw i32 r1, r3  r5 = getelementptr inbounds i32* r6, i64 r7  store i32 r4, i32* r5, align 4   r8= load i32* r2, align 4a A small portion of static IR-level representation of the pathfinder benchmark0x15FB1780x15FB174loadadd addstorevirtual gep gepr2virtualr1r4r3r5r6r7r8loadb The DDG constructed based on the dy-namic trace resulting from executing the code pre-sented in (a).0x15FB1780x15FB174loadadd addstorevirtual gep gepr2virtualr1r4r3r5r6r7r8loadc The ACE graph used to capture the ACEbits. Note that dynamically dead code is elimi-nated.Figure 3.3: An example of computing PVF for the register file35To compute the PVF we also need to compute the total bits for used registers,summing the total bits used in operations within this sequence of instructions.Total Bitsused registers =8∑i=1Bits in Ri= 32+64+32+32+64+64+64+64 = 416Then, the PVF of used registers for this example is:PVFused registers =ACEBitsused registersTotalBitsused registers= 0.8463.3.2 Finding the crash-causing bitsWe aim to identify the bits that cause the program to crash (i.e., lead to hardwareexceptions), and subtract these bits from the overall ACE bits. To this end, itis important to determine the types of crashes we observe in practice and theirrelative frequencies. We perform a fault injection experiment by injecting faultsinto ten benchmark applications (§3.4 describes our fault injection methodologyand benchmarks).We observe four types of exceptions resulting in crashes (Table 3.1). Table 3.2shows their relative frequencies. Our results show that segmentation faults are thepredominant source of crashes with a 99% average frequency and a 96% minimumover all benchmarks. This observation suggests that, for the class of workloadswith similar properties as these benchmarks, we only need to model the mech-anisms that generate segmentation faults to identify almost all crash-causing bits.We note that other workloads, architectures, or operating systems may change theseprecise findings, but a similar methodology can be followed.Segmentation faults result from memory access violations. Although differ-ent operating systems may implement violation detection mechanisms in differ-ent ways, segmentation faults are determined based on checking memory accessesagainst segment boundaries.There are two main challenges in determining which bit flips would lead to asegmentation fault: first, we need to find the ranges of the bits that, if flipped, would36Table 3.1: Types of exceptions resulting in crashesType DescriptionSegmentation fault(SF)Memory access that exceedsthe legal boundary of a mem-ory segmentAbort (A) Programs aborted by them-selves or OSMisaligned mem-oryaccess (MMA)Memory accesses are notaligned at four bytesArithmetic errors(AE)Divided by 0, Overflow,Floating point error, etcTable 3.2: Relative crash frequency analysis for each benchmarksBenchmark Types of crashes (%)SF A MMA AEhotspot 97.6% 0.0% 2.3% 0.1%bfs 98.8% 0.0% 0.7% 0.5%lavaMD 99.5% 0.0% 0.0% 0.5%nw 99.6% 0.0% 0.4% 0.0%pathfinder 99.9% 0.1% 0.0% 0.0%lud 100.0% 0.0% 0.0% 0.0%srad 96.0% 0.0% 4.0% 0.0%mm 99.8% 0.1% 0.1% 0.0%particlefilter 100.0% 0.0% 0.0% 0.0%lulesh 99.0% 1.0% 0.0% 0.0%result in an out-of-bounds memory access. This includes both faults in the memoryinstructions themselves (i.e., load and store), and faults in their backward slicesused for memory addressing. Second, we need to predict if an incorrect memoryaccess will generate an access violation. To this end, all segment boundaries at thetime of the memory access need to be known.To overcome the first challenge, we implement an algorithm that propagatesthe ranges of crash-causing bits along the backward slice of the memory accessoperation (§3.3.3). To overcome the second challenge, we instrument the programto embed a probe for each memory access and capture all the dynamic segmentboundaries (§3.3.4).373.3.3 The propagation modelWe model fault propagation for crash-causing faults starting from a memory ad-dressing operation and going backwards in the DDG. This analysis is triggeredeach time a load/store instruction is encountered during the iteration over the ACEgraph (i.e., the subgraph that contains all ACE nodes in the DDG) to compute ACEbits. The aim is to find all bits that can generate an out-of-bound address on thebackward slice of the memory address calculation. The model assumes that onlyone fault happens during the course of a program. (§3.2.5).The propagation model consists of two algorithms. Algorithm 1 describeswhen and how the propagation model is triggered. It consists of two procedures:ITERATE_OVER_ACE_GRAPH and CRASH_CALC. The ITERATE_OVER_ACE_GRAPHprocedure takes the ACE graph as input and iterates over the vertices in the ACEgraph. When it reaches a load/store instruction (line 3), the backward slice for theaddress used in the load/store instruction is calculated (line 5) and passed to theprocedure CRASH_CALC (line 6). Inside CRASH_CALC, all the instructions alongthe backward slice are visited and, for each instruction, the ranges for crash-causingbits in operands are computed by invoking GET_RANGE_FOR_CRASH_BITS.Algorithm 1 Iterates over the ACE graph and invokes CRASH_CALC whenever aload or store instruction is encountered1: procedure ITERATE_OVER_ACE_GRAPH(ACEGraph)2: for all inst in ACE Graph do3: if inst.opcode == load/store then4: backwardslice =5: CALCULATE_BACKWARD_SLICE(inst)6: CRASH_CALC(backwardslice)7: procedure CRASH_CALC(backwardslice)8: for all inst in backwardslice do9: GET_RANGE_FOR_CRASH_BITS(inst)The second algorithm (Algorithm 2) consists of the procedure GET_RANGE_FOR_CRASH_BITSthat models the execution of each instruction along the backward slice to calculatethe range for crash-causing bits. Specifically, for a load/store instruction, the crashmodel is invoked (CHECK_BOUNDARY) to determine the range of bits that generatean out-of-bound memory access (at line 6) to obtain a range of crash-causing bitsfor the destination register (line 9). Then, for each source operand, the procedurecalculates the range for the crash-causing bits by taking into account the range of38Algorithm 2 Calculates the range of the crash-causing bits for memory accessinstructions based on the backward slice of the address used1: procedure GET_RANGE_FOR_CRASH_BITS(inst)2: crashing_bits← 03: global crash_bits_list4: oplist← inst.source_operands5: if inst == load/store then6: (max,min) = CHECK_BOUNDARY(inst.addres)7: crash_bits_list[inst.address] = (max,min)8: else9: (max,min) = crash_bits_list[inst.dest_opernd]10: for all op in oplist do11: (new_max,new_min) = lookup_table(op, inst)12: crash_bits_list[op] = (new_max,new_min)13: crashing_bits += bits that makethe value o f op outside (new_max,new_min)return crashing_bitsthe corresponding destination operand and the semantics of that instruction (line 10to line 13). The semantics of the instruction are determined by the lookup_tablefunction in line 12.Table 3.3 shows the common instruction types encountered on the backwardslice of a memory address calculation and how the lookup_table is used to computethe range for each operand. We assume that all values of operands are positive in-tegers. The algorithm propagates these ranges along the backward slice by storingthem in the CRASHING_BIT_LIST for further reference by the next instructions,as shown in line 7.Table 3.3: Range calculation operations that commonly occur on the backward sliceof memory addressesNo. Opcode Operand Semantic Range Calculation for operands1 add dest, op1, op2 dest = op1 + op2Max(op1) = Max(dest) - op2 ; Min(op1) = Min(dest) - op2Max(op2) = Max(dest) - op1; Min(op2) = Min(dest) - op12 sub dest, op1, op2 dest = op1 - op2Max(op1) = Max(dest) + op2 ; Min(op1) = Min(dest) + op2Max(op2) = op1 - Min(dest) ; Max(op2) = op1 - Max(dest)3 mul dest, op1, op2 dest = op1 * op2Max(op1) = Max(dest)/op2 ; Min(op1) = Min(dest)/op2 (if op2 != 0)Max(op2) = Max(dest)/op1; Min(op2) = Min(dest)/op14 div dest, op1, op2 dest = op1 / op2Max(op1) = Max(dest)*op2 ; Min(op1) =Min(dest)*op2Max(op2) =Max(op1)/dest ; Min(op2) = Min(op1)/dest5 getelementptr dest, op1, op2 dest = op1 + sizeof(op1.type)*op2Max(op1) = Max(dest) - sizeof(op1.type)*op2 ; Min(op1) = Min(dest) - sizeof(op1.type)*op2Max(op2) = (Max(dest) - op1)/sizeof(op1.type) ; Min(op2) = (Min(dest) - op1)/sizeof(op1.type)6 srem dest, op1, op2 dest = op1 % op2Max(op1) = Max(for all bits in op1: bitflip(op1)%op2 >Max(dest) and bitflip(op1)%op2 <Min(dest))Min(op1) = Min(for all bits in op1: bitflip(op1)%op2 >Max(dest) and bitflip(op1)%op2 <Min(dest))Max(op2) = Max(for all bits in op2: op1%bitflip(op2) >Max(dest) and op1%bitflip(op2) <Min(dest))Min(op1) = Min(for all bits in op2: op1%bitflip(op2) >Max(dest) and op1%bitflip(op2) <Min(dest))7 bitcast dest, op1 dest = op1 Max(op1) = Max(dest); Min(op1) = Min(dest)We explain the details of these algorithms using our running example. In Fig-ure 3.3b, r5 stores the address 0x15FB174 for the instruction store i32 r4, i32*39r5, align 4. Suppose our bound-determination technique (described in the nextsubsection) returns the bound (0x15FB800, 0x15FA000), meaning that addressingoutside this bound will generate a segmentation fault. The ACE graph indicatesthat r5, r6 and r7 are used in addressing (or computing the address). Together,these three registers belong to the instruction getelementptr in LLVM. The instruc-tion semantics are based on row 6 of the Table 3.3 - ranges are obtained by applyingthe corresponding equations.r5.value = r6.value+ sizeo f (r6).type)× (r7)maxr6 = maxr5− sizeo f ((r6).type)× (r7)minr6 = minr5− sizeo f ((r6).type)× (r7)The range of r6 can be computed as follows: min: 0x15FB800 - 4*1 = 0x15FB7FC;max: 0x15FA000 - 4*1 = 0x15F9FFC. The resulting range (0x15FB7FC, 0x15F9FFC)of r6 is stored into CRASHING_BIT_LIST as the reference for operands on thebackward slice of r6, if any. Similarly, we can compute the range for register r7and for other registers in the backward slice.Algorithm 3 Obtains the boundary of the segment1: procedure CHECK_BOUNDARY(inst.address)2: global crash_bits_list3: max← 04: min← 05: vma_start = locate_segment_start(inst.address)6: if inst.address⊂ stack&&vma_start < ESP−65536−128 then7: min← ESP−65536−1288: else9: min← vma_start10: vma_end = locate_segment_end(inst.address)11: max← vma_end12: crash_bits_list[inst.address] = (max,min)return (max,min)3.3.4 Crash modelThe goal of the crash model is to determine the ranges of the addresses for whicha memory access will trigger a segmentation fault. While this is platform-specifici.e., specific to the hardware and operating system (OS) on which the program isrunning, the technique described here can be adapted to any architecture that usesmemory segmentation. This includes most modern architectures such as x86 and40ARM.Algorithm 3 describes how to compute the range of allowable addresses for amemory segment. It undertakes two main tasks: (i) obtains the boundary of thememory segment through the underlying OS interface (see line 5 and line 10) and,(ii) determines the run-time range of valid memory addresses for each load/storeinstruction (from line 6 to line 9). We explain the steps below.Obtaining the segment boundaries. Modern operating systems organize pro-cess memory over multiple segments (e.g., text, data, heap, stack). To identify theboundary of each segment we instrument the program to embed a run-time probethat probes the “/proc” system of Linux to record the segment boundaries at eachload and store instruction.Determining allowed ranges. Once we determine segment boundaries at thetime of each load or store, we need to determine which accesses would result insegmentation faults. We initially hypothesized that all accesses outside segmentboundaries will trigger a segmentation fault. Unfortunately, this is not the case, aswe found through a fault injection experiment: a segmentation fault occurred onlyfor about 85% of the out-of-segment accesses we generated. The remaining 15%of accesses did not result in a segmentation fault even though they were outside thesegment boundaries, suggesting that our hypothesis was incorrect.To better understand this behaviour, we studied the source code of the Linuxkernel in our platform, an x86-based machine (Figure 3.4)3. The “vma_start" and“vma_end" indicate the start and end addresses of Linux virtual memory area(vma) and the "addr" indicates the memory address used. In the code, the label"common case” shows the kernel code for when addr is within the valid bound.Note that if addr is within the last page of the stack, Linux will add one page be-low the current last page until the 8 megabyte limit is reached. The label “case I”is when addr is smaller than the “vma_start" and is still bigger than the (ESP−64KB−128B). Linux treats such an address as valid and will expand the stack forit. However, if addr is smaller than (ESP− 64KB− 128B), a segmentation faultoccurs. The label “case II" occurs when addr is greater than the“vma_end", andwill result in a segmentation fault.3Similar code can be found for both x86 and PowerPC kernel versions.41/∗ "vma" means v i r t u a l memory area i t i s an a b s t r a c t i o nused i n Linux f o r memory segments . ∗ //∗ addr represents the accessed memory address ∗ //∗ regs−>sp s to res the cu r ren t s tack p o i n t e r ∗ /Common case :i f vma_start <= addr <= vma_end :/ / every th ing i s f i n eaddr &= page_mask ;/ / f o r s tack :i f addr == the l a s t page of the vma :expand_stack (vma, addr )Case I :/ / on ly f o r s tack :i f vma_start > addr :i f addr + 65536 + 32 ∗ s i z e o f ( unsigned long )< regs−>sp :/ / SEGFAULTelse :expand_stack (vma, addr )Case I I :i f vma_end < addr/ / SEGFAULTFigure 3.4: Linux kernel implementation for determining which memory accessesresult in segmentation faults. Linux kernel version: 3.15. File locations: mm/and arch/x86/mm.Thus, for a non-stack segment, Linux determines the boundaries using its "vma_start"and "vma_end", while for stack segments, it compares the "vma_start" with the cur-rent stack pointer plus an offset to determine the lower bound of the stack. If addris inside an invalid memory region, a segmentation fault occurs.We implement our crash model to mirror the handling of these different cases.Upon re-evaluating the accuracy of the model through the same fault injectionexperiment, we find that we can now accurately predict crashes for over 99.5%of the accesses, pointing to the correctness of the crash model. We use this crashmodel in our experiments.3.4 EvaluationOur evaluation is guided by the following four questions:Q1 How accurate is the ePVF methodology when predicting the bits in whichfaults lead to program crashes?Q2 How close are estimated crash rates to the actual crash rates obtainedthrough fault injection?42Q3 Can the methodology be used to obtain a significantly tighter estimate forthe SDC rate than the conventional PVF methodology?Q4 How fast and scalable is the ePVF analysis?3.4.1 Experimental setupBenchmarks. We evaluate the ePVF methodology on ten benchmarks (Table 3.4):these include eight OpenMP-based scientific applications picked from the Rodiniabenchmark suite [29], our basic implementation of the matrix multiplication ker-nel, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (i.e.lulesh) [83, 84], a DOE proxy application. The applications range from 100 lines ofcode (mm) to 3000 lines of code (lulesh). Note that we target scientific applicationsand hence we do not consider SPEC programs.Platform. All of our experiments are conducted on a machine with a x86 CPUrunning at 2.67GHz and Linux v3.15.Table 3.4: Benchmarks used and their complexity (lines of C code).Benchmark Description Domain LOClulesh Unstructured LagrangianExplicit Shock Hydrody-namicsPhysics Modelling 3,000particlefilter Statistical Estimator ofthe state of a particleImage Analysis 602srad Speckle ReducingAnisotropic DiffusionImage Processing 388nw Needleman-WunschMethodBio Informatics 285hotspot Processor TemperatureEstimationPhysics Simulation 272lavaMD MDLAVA MolecularDynamicsMolecular Dynamics 218bfs Breadth-First Search Graph Algorithm 203lud LU Decomposition Linear Algebra 174pathfinder Dynamic Programming Grid Traversal 135mm Matrix Multiplication Linear Algebra 100Fault injection. To build a ground truth, we use the publicly available, open-source LLFI fault injector [144] to inject faults at the LLVM Intermediate Code(IR) level. We inject faults into the source registers for the executed instructionsto emulate faults in the used registers of the instructions, and hence all faults areactivated as they are used in the instruction. Only one fault is injected in eachrun. We perform over 3,000 fault injection runs for each benchmark. The 95%confidence levels are reported as error bars for statistical significance.430%20%40%60%80%100%Percentage	of	each	outcome	of	fault	injection	Crash SDC Hang BenignFigure 3.5: Fault injection results for each benchmark.3.4.2 Q1: What is the accuracy of ePVF methodology?To answer this question, we evaluate ePVF recall and precision. We use fault in-jection experiments to obtain the ground-truth, and compare the outcome of eachfault injection experiment with the outcome predicted by the ePVF methodology.Figure 3.5 shows the outcome (i.e., SDC, crash, hang and benign fault) frequencyfor each benchmark: crashes are the dominant outcome, on average, 63% of injec-tions result in crashes, while 12% result in SDCs, and less than 1% in hangs. Thedominance of crashes highlights the importance of separating the crash-causingbits from the other failure bits.Recall. We define recall as the ratio of crash runs that our model predictscorrectly to be crashes, to all fault injection runs that lead to crashes in reality.To estimate recall, for each fault injection run that leads to a crash, we recordthe instruction counter and the register that the fault is injected into, as well asthe bit that was flipped. We then run the crash and propagation models for theentire program and check whether the location appears in the final crash_bits_list(described in Algorithm 2) that stores the bits that lead to a crash if the bit iscorrupted.Figure 3.6 presents the recall for each benchmark. Overall, our methodologyachieves an average of 89% recall across the ten benchmarks (ranging from 85%to 92%). We manually analyzed the crash-causing bits that were not identified ascrashes by ePVF methodology. The main reason is that our validation techniqueintroduces approximations due to non-determinism in execution environment: thesegment boundaries may be slightly shifted. As a result, it cannot be guaranteed4450%60%70%80%90%100%Recall of the ModelFigure 3.6: Recall for the crash bits predicted using the ePVF methodology.50%60%70%80%90%100%Precision of the ModelFigure 3.7: Precision for the crash bits predicted using the ePVF methodology.to execute fault injection runs with exactly the same environment, particularly thesame memory allocation and the profiling. Through manual verification we foundthat, depending on benchmark, this factor accounts for 92% to 99% of incorrectpredictions.Precision. We define precision as the ratio of the number of correctly predictedcrash-causing bits to the total number of predicted crash-causing bits. To estimateprecision, we randomly choose over 1,200 different bits from those identified bythe model as crash-causing (i.e., appear in the CRASHING_BIT_LIST), and per-form a targeted fault injection experiment. Similar to the recall study, this time foreach bit, we specify the dynamic instruction and the register to inject the fault into,as well as the bit that should be flipped. Precision is calculated as the number ofobserved crashes over the total number of fault injections performed.Figure 3.7 shows the results of the evaluation. The average precision acrossall benchmarks is 92% (ranges from 86% to 98%). As in the case of recall, aftermanual inspection we have confirmed that the main reason for not hitting 100%precision is the difference between the run-time and modeled environments, i.e.,non-deterministic memory allocation.450%20%40%60%80%100%Crash rate estimate using fault injectionCrash rate estimate using ePVFFigure 3.8: The crash rates estimates using ePVF (right bars) and using fault injec-tion experiments (left bars) are close. For fault injection experiments, the errorbars indicate the 95% confidence intervals.3.4.3 Q2: How close are the crash rates estimated using ePVF andfault injection?The ePVF methodology is able to identify crash-causing bits with high accuracy.This can be used to estimate the crash rate of a program as the fraction of crash-causing bits over the total number of bits in an application. Such an estimate canbe important for techniques that use crash rates to determine the level of protectionto be provided, e.g., choosing the checkpoint interval.Figure 3.8 shows that estimating crash rates this way is a good approximationfor crash rates obtained through fault injection experiments. The differences arewithin or close to the 95% confidence interval bounds, except for lavaMD andlulesh. The reason the crash rate predictions are off for these two applications isthat ePVF calculates the crash bits only based on the ACE graph, which contains70% and 80% of the whole DDG for lavaMD and lulesh respectively. On the otherhand, the fault injection uses the full program execution corresponding to the wholeDDG.3.4.4 Q3: Does ePVF lead to a tighter estimate of SDC rate than theoriginal PVF?We have shown that the ePVF methodology can accurately estimate the crash bitsof an application. We now ask whether it can lead to better SDC rate estimates. Asexplained earlier, ePVF provides an upper bound (i.e., overestimate) for the SDC460%20%40%60%80%100%PVF value ePVF value SDC rate from FIFigure 3.9: ePVF (center bars) offers a much better upper bound estimate for theSDC rate (right bars) than the original PVF methodology (left bars). For SDCrates, error bars represent 95% confidence intervals.rate like PVF does. We compare the tightness of these two upper bounds.Figure 3.9 shows the original PVF and the ePVF values for the ten benchmarks.The original PVF ranges from 71% to 98%, with an average of 92%. In contrast,the ePVF estimate ranges from 25% to 40%, with an average value of 31%. Theaverage difference between PVF and ePVF is 61%, ranging from 45% to 67%depending on the benchmarks.Figure 3.9 also shows, for each benchmark, the SDC rate obtained through thefault injection experiments described earlier. The SDC rate ranges from 1 to 25%depending on the benchmark, with an average value of about 12% across bench-marks. ePVF significantly lowers the upper bound of estimated SDC vulnerabilityof a program. The above evaluation suggests that our technique has higher predic-tive power than the original PVF analysis to understand the SDC behaviour of aprogram (we demonstrate that this can be used in practice in §3.5). That said, thereis still room for a tighter bound as we will discuss in § Q4: How fast is the ePVF analysis?Table 3.5 shows, for each benchmark, the number of dynamic LLVM IR instruc-tions, the number of nodes in the ACE graph, and the total time to compute ePVF.The running time ranges from less than a minute (lavaMD) to 5 hours (pathfinder).As expected, the time taken correlates with the ACE graph size. We also measuredthe time spent by various parts of the ePVF analysis: most time is spent in the crash47Table 3.5: Number of nodes in the ACE graph and time taken by the ePVF analysisfor each benchmarkBenchmarks# of DynamicIR instructionsACE nodes Modelling time (s)hotspot 954,920 1,102,265 14,400pathfinder 839,163 967,836 18,000mm 464,438 597,604 3,987particlefilter 352,866 479,994 3,956nw 376,022 453,998 3,800lulesh 322,738 319,253 953bfs 274,170 269,019 900lud 75,543 93,089 205srad 72,041 91,385 172lavaMD 17,814 16,779 30111 360 355 25 32 4 143 92 156 933876 17640 14045 180 140 26 3657 808 3800 8600%20%40%60%80%100%Percentage of total modelling timeBuidling ACE graph Crash and propagation modellingFigure 3.10: Breakdown of execution time between graph construction (bottom bar)and running the crash and propagation models (top bar). Labels on bars presentabsolute time in seconds.and propagation models.We discuss scalability in detail in Section §3.6. Here we propose an optimiza-tion to reduce the time to compute ePVF, based on sampling the ACE graph. Thisapproach is based on the intuition that many HPC applications consist of repetitiveprogram states and patterns, and hence a small sample of the ACE graph will berepresentative of the overall application behaviour. Since a dynamic instructiontrace preserves the temporal ordering of the instructions executed by the program,the output nodes in the ACE graph can be ordered based on their presence in the480%15%30%45%predicted ePVF computed ePVFFigure 3.11: The predicted ePVF value based on sampling only 10% of the ACEgraph and ePVF computed based on the entire graph are close.trace. To validate if the sampling works, we pick the first p% of the output nodes,and based on the resulting partial ACE graph we estimate ePVF. For regular ap-plications, we find that there is a strong linear relationship and we can linearlyextrapolate the partial ePVF to the entire application and thus estimate the overallePVF accurately.Figure 3.11 shows the extrapolated ePVF values based on analyzing only 10%of the ACE graph. As can be observed, for most benchmarks, the extrapolatedePVF values are a good approximation for the overall ePVF: on average the erroris less than 1%, suggesting that these programs exhibit repetitive behaviors as weexpected.Importantly, we can also estimate whether an application displays repetitivebehaviours and thus whether the ACE-graph sampling be useful, without complet-ing the full ACE analysis. To demonstrate this, we randomly select multiple smallsub-sample of the ACE graph nodes (each 1%) and compute for each of them theePVF estimates. The normalized variance is relatively low for benchmarks withrepetitive behaviours (e.g., 0.6 for lavaMD and 0.04 for particlefilter), but high forapplications where the ACE-graph sampling technique does not offer high accu-racy (e.g., 1.9 for lud).3.5 Case study: selective duplicationTo demonstrate the practical usability of the ePVF methodology to improve appli-cation resilience, we use ePVF to guide a selective instruction duplication tech-49nique to protect against SDCs. The intuition is that a technique that prioritizesprotecting instructions with high ePVF values will offer good SDC protection asthe faults occurring in crashing bits are unlikely to lead to SDCs. To establish abaseline, we compare the SDC rate of a program protected by duplicating the highePVF instructions, with that protected by duplicating the hot paths of the program.Prior studies [74, 103] have shown that protecting hot paths is an effective tech-nique (i.e., instructions on the top 20% of most executed paths are responsible formost of the SDCs - these constitute the hot paths).We also attempted to use PVF to drive the choice of instructions to duplicate.However, we found that the PVF values of most instructions are clustered around 1,which means that PVF has little discriminative power to inform the choice of whichinstructions to protect. As an example, we plot the CDF (Cumulative DistributionFunction) of the PVF and ePVF values of every instruction for two benchmarkprograms, namely nw and lud in Figure 3.12. As can be seen in the figure, the CDFfor PVF has a sharp spike near 1, while the ePVF values are distributed more evenlythroughout the range. Therefore, we did not consider PVF-informed duplication asa comparison point in this study.To make the comparison fair, we control the performance overhead incurred byboth techniques we compare (by controlling the number of instructions we protectand measuring execution time). Our hypothesis is that for a given performanceoverhead bound, the ePVF based duplication scheme can offer higher SDC cov-erage than hot-path duplication. A full-duplication technique (i.e., duplicationof every instruction) would offer 100% detection coverage, but incur significantperformance overheads [60, 103]. Hence we do not consider full-duplication tech-nique in this work.An ePVF-informed protection heuristic We first compute the ePVF valueof each dynamic instruction using the equation 3.3. Then, we compute the ePVFvalue of all static instructions in the program by averaging the ePVF values of alltheir dynamic instances, and rank the static instructions in descending order of theirePVF values. We then select the static instruction at the top of the list, and extractits backward slice. Finally, we selectively duplicate the instructions in the slice,and insert a comparison of the duplicated value with the original value followingthe chosen instruction. Because we need to limit protection within the overhead50ePVF or PVF values of nwEmpirical CDF0. or PVF values of ludEmpirical CDF0. 3.12: The figure presents the CDF for the ePVF and PVF values of registersused in each instruction of nw (left) and lud (right) benchmarks. PVF valuesfor most instructions are clustered around 1 and thus can not inform protectionmechanisms based on instruction level protection.budget, we measure the performance overhead incurred by duplication. If the per-formance overhead bound is not exceeded, we choose the next instruction on thelist and repeat the procedure. Thus this is a greedy algorithm for choosing in-structions to duplicate. We use the same process for hot-path instructions, with thedifference that the instructions are ranked in a decreasing order of their executionfrequencies instead of their ePVF values.ePVFinst =∑register in inst (ACEBits - CrashBits)Total bits in inst(3.3)Evaluation Methodology. We evaluate the coverage of the above schemesthrough fault injection experiments. We only consider the five benchmarks whoseSDC rates were higher than 10% in Figure 3.9 (i.e. mm, pathfinder, hotspot, ludand nw) so as to discriminate the effects of the two schemes better. Further, we runthe fault injection campaigns with different inputs than the ones we used to get theePVF values (these are much larger in size) to get stable performance numbers.Evaluation Results. Figure 3.13 shows the SDC rate of the original application(no protection), the SDC rate when using hot-path protection, and the SDC rate510%5%10%15%20%25%30%mm lud nw pathfinder hotspot geomeanSDC	  rate SDC	  rate	  when	  using	  hotpath	  based	  protection SDC	  rate	  when	  using	  ePVF	  based	  protectionFigure 3.13: SDC rates for the original application (left bars) and when using hotpath(center) and ePVF-based protection (right) for an overhead bound of 24%.Error bars present 95% confidence intervals.when using ePVF-informed protection, given a performance overhead bound of24%4. Overall, we find that ePVF based protection does better than hot-path basedprotection, and reduces the SDC rate from 20% to 7% (geometric mean), while hot-path based protection reduces it to about 10%. Thus, ePVF based protection does30% better than hot-path based protection, on average across benchmarks, showingthat it has better discriminative power than execution frequencies for protection.Further, we find that ePVF-based protection outperforms hot-path based protectionfor all benchmarks except hotspot. This is due to the presence of many control-flowstructures in hotspot all of which are marked as sensitive by ePVF though they donot cause SDCs.3.6 DiscussionA. Scalability is an important issue as most applications will likely generate ACEgraphs with billions of vertices. The ACE-graph sampling technique we describein §3.4.5 offers a significant speedup for applications that contain repetitive pat-terns. We believe that scaling to handle larger applications is a matter of goodengineering and not a fundamental barrier for the following reasons. First, the cur-rent ePVF infrastructure (including building/processing the DDG) is implementedin Python. A tuned C/C++ implementation will likely be orders of magnitude fasterand consume less memory. Second, the most time-consuming portion of the ePVF4We considered performance overheads of 8%, 16% and 24% as well. We report here only theresults using 24% overhead due to space constraints - the qualitative results were similar all cases.52analysis is running the crash and propagation models. These start from each load/-store individually, and search along their backward slices. This process is triviallyparallelizable (threads can be assigned to one backward slice each with minimumcoordination required). Additionally the work allocated for each thread (i.e., thesize of the backward slice) scales sub-linearly with the size of the graph. Third,if the DDG does not fit in memory, it can be partitioned to support the parallelbackward slice exploration suggested above.B. Sources of Inaccuracy. §3.4.3 shows that ePVF is a much closer upperbound than PVF for the SDC rate of an application. However, ePVF still overes-timates the SDC rate, in some cases by a significant amount. This overestimate isgenerated mainly by the following three factors:1. Lucky loads: ePVF assumes that any fault that causes a load to deviatefrom its intended source address (but still stays within the bounds of the program’sallocated memory) will lead to an SDC. However, as prior work has found [38],this is not always true. For example, the value loaded from the incorrect addressmay still be correct, and hence have no effect on the program. The likelihood ofthis case increases if the value loaded is 0, as memory typically has large areasinitialized to zeroes [38].2. Y-branches: Y-branches are branches that do not affect the outcome of theapplication even when the program executes the wrong part of a branch due to afault [142]. The ePVF analysis assumes that all branches lead to SDCs if flipped.However, only about 20% of branch flips lead to SDCs in practice, as prior workhas found [142].3. Application-specific correctness checks: Similar to PVF, the ePVF model,considers as ACE bits all bits that lead to visible changes to the application output.Some of these faults, however, may be characterized as benign by application-specific correctness checks (e.g., based on precision thresholds for floating-pointcomputations).C. Conservativeness: While ePVF may overestimate the SDC rate, it willnever underestimate it (barring the case below). This is because ePVF conserva-tively labels every non-crash causing operation as potentially leading to an SDC.Being conservative is important as it can drive decisions about how much state toprotect in the worst-case for the application. However, in Section 3.4.2, we found53that our implementation of the ePVF methodology may yield false positives i.e., itmay identify a failure as a crash when in fact it is an SDC. This occurs because ofdifferences between the program’s memory structures in the golden run (on whichthe ePVF analysis is based) and the fault injected runs. However, the differencesare very small in practice (at most 8% in our experiments).3.7 SummaryThis chapter presents ePVF, a methodology that extends the PVF analysis to distin-guish crash-causing bits from the ACE bits for obtaining a tighter bound on SDCrate. With ePVF, we demonstrate that one can predict the error resilience char-acteristics of an application with ”good enough" accuracy at a low cost and usesuch information to guide the selective duplication to detect SDCs, which answersthe RQ1. Our methodology consists of two models: (1) a propagation model topredict the dependent bits of memory address calculations based on a range prop-agation analysis; and (2) a crash model to predict the platform-specific behaviourof program crashes. We implement the ePVF methodology in the LLVM com-piler. The results show that ePVF can predict crashes with high confidence (89%recall and 92% precision on average). Further, ePVF significantly lowers the upperbound of the estimate of the SDC rate of a program (61% on average) compared tothe original PVF. Finally, we present a use-case study with this methodology: anePVF-informed selective duplication technique, which leads to 30% lower SDCsthan hot-path instruction duplication.54Chapter 4LetGo: A LightweightContinuous Framework for HPCApplications Under FailuresRequirements for reliability, low power consumption, and performance place com-plex and conflicting demands on the design of high performance computing (HPC)systems. Checkpoint/restart (C/R) schemes are the most commonly-used fault-tolerance techniques to protect HPC applications against hardware faults. Suchfault tolerance techniques, however, have non-negligible overheads particularlywhen the fault rate exposed by the hardware is high: it is estimated that in to-day’s HPC systems, 20% or more of the computational cycles have been used forproviding fault tolerance [18].To mitigate the overall overhead of C/R systems, we propose the roll-forwardC/R system to work for comprehensive types of errors. In particular, in this chap-ter, we present LetGo, an approach that attempts to continue the execution of aHPC application when crashes would otherwise occur due to the errors affectingthe computational components, which answers the first part of the RQ2. Our hy-pothesis is that many classes of HPC applications have good enough intrinsic faulttolerance so that it is possible to re-purpose the default mechanism that terminatesan application once a crash-causing error is signaled. Instead, LetGo repairs thecorrupted application state and continues the application execution. This chapter55explores this hypothesis, and quantifies the impact of using this observation in thecontext of checkpoint/restart (C/R) mechanisms.4.1 IntroductionApplications crash on transient hardware faults (i.e., bit flips) due to the runtimesystem detecting the error and terminating the application, thereby losing the ap-plication’s work. Checkpoint/restart (C/R) is one of the most popular methodsto recover from such faults [54, 138, 139] by loading a previously saved inter-mediate state of the application (i.e., a checkpoint), and restarting the execution.While useful, checkpoint/restart techniques incur high overheads in terms of per-formance, energy and memory, which will be exacerbated as the failure rate in-creases [42, 151].This chapter proposes LetGo, which upon detecting an impending crash, at-tempts to repair the application state to enable it to continue its execution (insteadof recovering from a checkpoint). LetGo is based on three observations. First, alarge class of HPC applications are, intrinsically, resilient to localized numericalperturbations as they require computation results to converge over time. As a re-sult, they are able to mask some data corruptions. For example, Casas et al. [28]show that the algebraic multi-grid (AMG) solver, which is based on iterative meth-ods, has high intrinsic resiliency. Second, a certain number of classes of HPCapplications have application-specific acceptance checks (e.g., based on energyconservation laws). These checks can be used to filter out obvious deviations inthe application’s output, and reduce the probability of producing incorrect results.For example, High Performance Linpack (HPL) solves a linear system using LUdecomposition [119] and tests the correctness of the result by checking the resid-ual of the linear system as a norm-wise backward error [77, 106]. Third, mostcrash-causing errors lead to program crashes within a small number of dynamicinstructions, and are hence unlikely to propagate to a large part of the applicationstate [95, 121]. Therefore, the impact of crash-causing faults is likely to be con-fined to a small portion of the application’s state, thus allowing recovery.Taken together, these observations offer an optimistic hypothesis that it may bepossible to re-purpose the default mechanism that terminates an application once56a crash-causing error is signalled, and attempt to repair the corrupted applicationstate and continue the application execution. This paper explores this hypothesis,proposes heuristics to repair the application state, and quantifies the impact of usingthis observation in the context of C/R mechanisms.To enable this exploration we design and implement LetGo. LetGo works bymonitoring the application at runtime; when a crash-causing error occurs, LetGointercepts the hardware exception (e.g., segmentation fault), and does not pass theexception on to the application. Instead, it advances the program counter of the ap-plication to the next instruction, bypassing the crash-causing instruction. Further,LetGo employs various heuristics to adjust the state of the application’s register fileto hide the effects of the ignored instruction and ensure, to the extent possible, thatthe application state is not corrupted.Figure 4.1 illustrates how LetGo can be used in the context of a checkpoint/restart(C/R) scheme. As shown in Figure 4.1a and 4.1b, the default action of a C/Rscheme on fail-stop failures is to rewind to the last checkpoint. LetGo allows theHPC run-time to continue the execution of an application once a crash-causing er-ror occurs (Figure 4.1c) and later use application-level correctness test to detectpossible state corruption. If the application passes these checks, LetGo assumesthat intermediate/final states of an application are correct, and hence no recoveryis needed. This reduces checkpoint overheads in two ways: first, LetGo avoidsthe overhead of restarting from a previous checkpoint upon the occurrence of acrash-causing error; second, since crashes are less frequent, checkpoints can betaken less frequently as well (or not at all if the developer is prepared to accept therisk of unrecoverable failures). The potential cost of LetGo is an increased rate ofSilent Data Corruption (SDC) leading to incorrect results. We argue that this maybe acceptable for two reasons: first, our experiments indicate that this increaseis low (the resulting SDC rate is in the same range as the SDC rate of the origi-nal application), and, second, since the possibility of undetected incorrect resultsexists even with the original application (i.e., without using LetGo), applicationusers independently need to develop efficient techniques to increase confidence inthe application results. By leveraging these application checks, LetGo reduces thechances of an error causing a SDC. To the best of our knowledge, LetGo is the firstsystem that applies the idea of tolerating errors by repairing application state in57checkpoint intervalnext intervalCPVa A standard checkpoint interval without LetGo.CPVcheckpoint intervalnext intervalb If a crash occurs, the HPC run-time loads the lastcheckpoint. In existing solutions a crash occurs each time acrash-causing error is signalled.CPVcheckpoint intervalnext intervalLc LetGo continues the execution of the applicationwhen a crash-causing error occurs, and the HPC run-timedoes not load the checkpoint.Figure 4.1: Illustration of how LetGo changes the behavior of an HPC applicationthat uses checkpointing by continuing the execution when a crash-causing erroroccurs. Axes indicate time. The labels used for time intervals: CP - checkpoint;V - application acceptance check/verification; L - LetGo framework, lightningbolt: crash-causing errorthe context of C/R in HPC applications.This chapter makes the following contributions:• We propose a methodology to reduce the overhead of C/R techniques forHPC applications by resuming the execution of an application upon the oc-currence of a crash-causing error without going back to the last checkpoint.• We design LetGo, a light-weight run-time system that consists of two maincomponents: a monitor that intercepts and handles operating system signalsgenerated when a crash-causing error occurs, and a modifier that employs58heuristics to adjust program state to increase the probability of successfulapplication continuation (Section 4.3.1). Importantly, LetGo requires nei-ther modifications to the application, nor the availability of the application’ssource code for analysis (Section 4.3.3 and Section 4.3.2). therefore, it ispractical to deploy in today’s HPC context.• We evaluate LetGo through fault-injection experiments using five DoE mini-applications. We find that LetGo is able to continue the application’s execu-tion in about 62% of the cases when it encounters a crash-causing error (forthe remaining 38%, it gives up and the application can be restarted from acheckpoint as before). The increase in the SDC rate (undetected incorrectoutput) is low: 0.913% arithmetic mean. (Section 4.5.1).• Finally, we evaluate the end-to-end impact of LetGo in the context of a C/Rscheme and its sensitivity to a wide range of parameters. We find that LetGooffers significant efficiency gains (1.01x to 1.20x) in the ratio between thetime spent for useful work and the total time cost, compared to the standardC/R scheme, across a wide range of parameters.Our evaluation shows that, on average, LetGo is able to continue to completion62% of the crashes while increasing the overall application SDC rates from 0.75%to 1.6%. This highlights a key contribution of LetGo: it creates the opportunity totrade off confidence in results for efficiency (time to solution). Certainly, for someapplications - or for some operational situations - confidence in results is the user’sprimary concern, and LetGo will not be used. We believe, however, that there aremany situations that make LetGo attractive: Firstly, since Silent Data Corruptions(SCD) can occur anyways (due to bit-flips even when LetGo is not used), usersof HPC applications are already taking the risk of getting incorrect results, andhave developed techniques to validate their results. Application-specific checks todiminish this risk are an active area of research [57, 60, 76, 102] and LetGo willbenefit from all these efforts. Secondly, for some applications LetGo performsextremely well (e.g., for CLAMR and SANP all faults that would lead to crashescan be elided by LetGo, without resulting in any additional SDCs). In these cases,LetGo certainly represents an appealing solution. Finally, note that it is trivial to59collect information on whether a run has benefited from LetGo repair heuristics andthus offers users additional information base on which to reason about confidence.4.2 BackgroundContext: The effectiveness of LetGo is influenced by two factors: (1) the application-level acceptance checks that detect whether the application state is corrupted beforedelivering results to users; (2) the resilience characteristics of the HPC applicationmaking it able to withstand minor numerical perturbations. This section arguesthat a large class of HPC applications present these characteristics, and offers anexample that illustrate how these two factors affect application fault-tolerance.Factor 1: Application acceptance checks. Since the rate of hardware faults isexpected to increase and applications become increasingly complex (and, as a re-sult, the design and implementation process is error-prone), there is an increasedawareness for the need of result acceptance tests, to boost the confidence in theresults offered by HPC applications. Result acceptance checks are usually writ-ten by application developers to ensure that computation results do not violateapplication-specific properties, such as energy conservation or numeric tolerancefor result approximation. These acceptance checks are typically placed at the endof the computation (i.e. the residual check performed in HPL application [119]),but they can be also placed during application execution to detect earlier possiblestate corruption such as [114].Factor 2: Fault masking in HPC applications. A large class of HPC appli-cations are based on iterative processes (For example, stencil computations itera-tively compute physical properties at time T+1 based on values at time T; iterativesolvers work by improving the accuracy of the solution at each step. For an itera-tive method that is convergent, numerical errors introduced by a hardware fault canbe eliminated during this convergence process (although it may take longer). Priorstudies such as [28] show that the algebraic multi-grid solver always masks errorsif it is not terminated by a crash.604.3 System designOur goal is to demonstrate the feasibility and evaluate the potential impact of arun-time framework that allows the program to avoid termination and correct itsstate after a crash-causing error occurs. The four main requirements of LetGo are:a) Transparency: LetGo should be able to transparently track the system be-havior, monitor for crash-causing errors, and modify the application state toenable application continuation once a crash-causing error occurs, all with-out modifying the application’s code (R1).b) Convenience: As HPC applications tend to be conservative and sensitiveto the computation environment, LetGo should not make any assumptionabout the application’s compilation level or require changes the application’scompilation process (R2).c) Low overhead: To be attractive for deployment in production systems, LetGoshould incur minimum overheads in terms of performance, energy and mem-ory footprint (R3).d) A low rate of newly introduced failures: LetGo inherently trades the abilityto continue application execution for the risk of introducing new failures.For LetGo to be practical, the increase in the rate of undetected incorrectresults should be low (R4).The rest of this section describes LetGo design in detail and, shows how LetGosatisfies the above requirements.4.3.1 Overall designLetGo is activated when a crash-causing error occurs. LetGo detects the exceptionsraised by the OS, intercepts the OS signals, and modifies the default behavior of theapplication for these signals. Then it diagnoses which states of the program havebeen corrupted, and modifies the application state to ensure, to the extent possible,application continuation. Figures 4.2 and 4.3 show this process.LetGo contains two components: the monitor and the modifier.61LetGoApplicationOSProcess statesProcess executionSignal HandlingMonitorModifierFigure 4.2: LetGo architecture overview.Application OSLetGo1. configure signal handling2. send signals when exceptions3. switch to LetGo framework4. move program counter,apply heuristics to increase the probability of success 5. switch back to continue Figure 4.3: A sequence diagram highlighting LetGo use: the step 1-5 describe theinteractions between LetGo, the application, and the operating system: LetGostarts by installing the monitor - i.e., configuring the the signal handling, andlaunches the application in a debugger (step 1). If the application encountersa signal, LetGo detects it (step 2) and takes the control of the application (step3). To avoid the failure, LetGo increments the program counter (i.e., instructionpointer) of the application and adjusts the necessary program states (step 4).After the modification, LetGo lets the application continue without any furtherinterference (step 5).The monitor is attached to the application at startup. It changes the defaultbehavior of the application from termination to pausing when operating systemsignals such as SIGSEGV and SIGBUS are received. The monitor intercepts these62signals and hands the control over to the modifier.The modifier kicks in when executing the current application instruction wouldlead to an application termination (crash). The modifier attempts to avoid the crash:it advances the program counter (i.e., instruction pointer) to the next instruction,and inspects and modifies application state (e.g., the stack pointer) to increase theprobability of a successful continued execution. The details of the modifier arediscussed in Section 4.3.2. Note that the application might still not be able to finishsuccessfully and it may crash again - if so, LetGo does not intervene and allowsthe application to terminate.4.3.2 HeuristicsWe describe the modifications that the LetGo modifier makes to the applicationstate (in Step 4 of Figure 4.3). These modifications have two goals: first, in-crease the likelihood that, once the application continues execution, it does notcrash again; and, second, reduce the chance that data corruption propagates fur-ther.There are two issues to deal with: first, advancing the program counter maybypass memory loads or stores, and hence the destination register that is supposedto hold the value from the memory load (or the memory location used for store)may contain an incorrect value , which may cause subsequent errors in case thisregister is later used; and, second, if the fault has corrupted the stack pointer regis-ter sp (i.e., rsp in X86-64) or the base pointer register bp (i.e., rbp in X86-64), andthe application continues due to LetGo, the likelihood of receiving another systemexception due to a memory-related violation is high because sp and bp are repeat-edly used. To mitigate these challenges, LetGo employs two heuristics (to satisfyR4).Heuristic I - This heuristic deals with memory load/store instructions. If theprogram crashes due to the error in a memory-load instruction, LetGo feeds theto-be-written register(s) (which holds the data loaded from the memory) with a“fake" value(s). In practice, 0 is chosen as the value to feed to the register. Wechoose 0 by default because the memory often contains a lot of 0s as initializationdata [39]. For the case where the program stops at a memory-store instruction,63the value in that memory location remains the same because the memory-storeoperation is not successful. In this case, we do nothing - our empirical experiencesuggests that this is a more practical decision than assigning a random value. Inthe future, this heuristic can be combined with run-time analysis for more realisticand application-dependent behaviour.Heuristic II - As discussed above, if a fault affects the values in the stack pointerregister or the base pointer register, the corrupted registers may cause consecutivememory access violations. Since LetGo avoids performing run-time tracking, de-termining the correctness of the values in sp and bp statically becomes challenging.To overcome this challenge, LetGo implements the following heuristic that includea detection and a correction phase:1. Detection: for each function, the difference between the values in sp andbp can be approximately bound in a range via static analysis, hence thisrange can be calculated with minimum effort and can be used to indicate thecorruption in sp or bp at run-time.2. Correction:: since sp and bp usually hold the same value at the beginningof each function, one can be used to correct the error in the other one ifnecessary.We explain the intuition behind this heuristic with two observations based onthe code in Listing 4.1. First, bp is normally pointed to the top of the stack (line2), hence sp and bp usually carry the same value at the beginning of every functioncall. Second, based on the size of the memory allocated on the stack (line 3), therange of bp can be inferred as sp < bp < sp+ 0x290 ( bp is always greater thansp because the stack grows downwards. Therefore, when the program receivesan exception and stops at an instruction that involves stack operation, LetGo runsthe following steps: First, it gets the size of the allocated memory by searchingfor the beginning of the function that the instruction belongs to and then locatingthe instruction that shows how much memory the function needs on the stack (byanalyzing the assembly code). Second, it calculates the valid range based on thesize and checks if the bp is in it, and, finally, if the range constraint is invalid,LetGo copies the value of the sp to the bp (or vice versa depending on which oneis used in the instruction causing the crash).64Listing 4.1: Example of a common sequence of X86 instructions at the beginning ofa functionpush \%rbpmov \%rsp ,\% rbpsub \ $0x290 , \%rspFor the rest of this paper, we refer to the version of LetGo that applies theseheuristics as LetGo-E(nhanced) and the version without heuristics as LetGo-B(asic).We evaluate the effectiveness of LetGo-B and LetGo-E in Section ImplementationWe implement the LetGo prototype with three production-level tools that are widelyadopted and readily available on HPC systems: gdb, PIN [118] and pexpect [130].gdb: LetGo relies on gdb to control the application’s execution. gdb providesthe interfaces to handle operating system signals and to change the values in theprogram registers. We describe these two aspects in turn. LetGo uses gdb to rede-fine the behaviour of an application against OS signals as described in Table 4.1.Since most of application crashes are due to memory-related errors such as segmen-tation faults or bus errors [57, 68], LetGo currently supports three signals relatedto memory errors: SIGSEGV, SIGBUS and SIGABRT, and can be easily extendedfor more signals if needed. (Satisfying R1).Note that the LetGo use of gdb does not require any source-code level anal-ysis (or changes to the application). Applications therefore do not need to run inthe debug mode, which inhibits code optimization and often results in significantperformance degradation (satisfying R2 and R3). Applications can run with LetGofor any optimization/compilation requirement levels they need. We evaluate thegenerated overhead in Section 4.4.b) PIN: It is a tool that supports dynamic instrumentation of programs. PIN caninsert arbitrary code at arbitrary locations of an executable during its execution.LetGo uses PIN to conduct instruction-level analysis, such as obtaining the nextPC, parsing an instruction and finding the size of allocated memory on the stack.Since LetGo only needs the static information of a program, there is no need forLetGo to keep track of dynamic program states and only dissembler inside PIN isneeded. Therefore, LetGo incurs minimum performance overhead (Satisfying R3).65Table 4.1: gdb signal handling information redefined by LetGo. ’Stop’ means theprogram will stop upon a signal, and ’Pass to program’ means this signal will notbe passed to the programSignal Stop Pass to program DescriptionSIGSEGV Yes No SegfaultSIGBUS Yes No Bus errorSIGABRT Yes No AbortedIt is possible to use other lightweight tools for parsing instructions instead of PIN.c) pexpect: expect [99] is a tool that automates interactive applications (e.g.telnet, ftp, etc.) and it is widely for testing. LetGo uses pexpect, the Python exten-sion of expect to automate all interactions between LetGo and the application: e.g.,configuring signal handlers and updating register values. Since these are relativelyrarely executed operations, the overall performance impact is small.All the interactions between a gdb process and the target application are au-tomated via pexpect, and confined to a limited number of gdb commands such as“print" or “set". When heuristics need to be applied, LetGo relies on PIN to analyzethe program and feed the result to gdb. As a prototype, the current implementationof LetGo is used to support the experimentation, to demonstrate the ability of au-tomation, and to investigate the overheads incurred - for a production version, onecan directly and efficiently implement the functionality offered by each of thesetools, so the overhead estimates we offer are conservative.4.4 Evaluation methodologyThis section focuses on evaluating the ability of LetGo to transform crashes intosuccessful application runs To this end, this section first describes the fault modeland the fault injection methodology we use, then explains how the various failureoutcome categories are impacted by LetGo, and proposes metrics to quantitativelyevaluate LetGo effectiveness. Using this information, the next section evaluatesLetGo impact on reducing C/R overheads.664.4.1 Fault modelAs introduced earlier, in this work, we consider faults occurring in the computa-tional units of processors, such as the ALUs, pipeline latches and register files. Ourmethodology is agnostic to whether a fault arises in the register file or is propagatedto the registers from elsewhere. We use the single-bit-flip model as it is the mostcommon transient fault model in today’s systems [133]. We also assume that atmost one fault occurs in an application run leading to a crash-causing error, as softerrors are relatively rare compared to typical application execution times.4.4.2 Categories of fault outcomesThe traditional outcomes of a fault affecting an application can be categorized ascrashes, detected by the application acceptance check, hangs, SDCs, and benignoutcomes. When applying LetGo, (some of the) crash outcomes are transferred toother categories, thus, to evaluate LetGo we further categorize the outcomes thatcorrespond to a crash in a non-LetGo context in multiple new classes as presentedin Figure 4.4.At the top level of our taxonomy (Figure 4.4), a fault either causes a programto crash, or not. In Figure 4.4 we label these two classes - Finished and Crash.1. A finished run can result further in two outcomes: the program contains er-rors in the output that are detected by the application’s acceptance checks(labeled as Detected), or the output of the program passes those checks (la-beled as Pass check). If the output passes the check, it may differ from thegolden run, in which case we consider it an SDC 1; or the output matches thegolden run, labeled as Benign.2. A Crash run is where LetGo has impact. When LetGo is deployed, it mayfail to continue the application and lead to a second crash (labeled as Dou-bleCrash) or make the application to finish successfully labeled C-Finished.In this case, the program may have similar outcomes as in the Finished case(when no crash occurs) - we label these as C-Benign (no observable outcome1This is a conservative assumption as we do not know how the results of the application are used.The application output also includes the application data that is compared between such data fromthe golden run, as defined in Table 4.267Fault InjectionFinished CrashDetected Pass checkC-DetectedDouble crashC-FinishedC-Pass checkBenign SDCC-Benign C-SDCFigure 4.4: Classification of the fault injection outcomes. LetGo has impact only onthe right side of the tree above as it attempts to avoid a crash outcome.of the fault), C-Detected (incorrect output detected by application acceptancechecks), and C-SDC (incorrect output not detected by those checks, but dif-ferent from the golden run). Compared to a situation when LetGo is not used,LetGo is able to transfer some of the crash outcomes to C-Benign, C-SDC,or C-Detected outcome.4.4.3 Metrics of effectivenessThe effectiveness of LetGo can be estimated by answering questions like “Howmany crashes can be converted to continued execution?", “what is the likelihoodof producing correct results after continuation?", “How often the application checkcan catch the errors after continuation?" and “How many incorrect results will beproduced after continuation?". To this end, we define four metrics to quantitativelyevaluate LetGo effectiveness:68Table 4.2: Benchmark description. The last two columns present which data is usedfor bit-wise comparison to determine SDCs (undetected incorrect results), and,respectively describe the result acceptance check used by each application. Allbenchmarks are compiled with g++ 4.93 using O3 except for SNAP, which is aFORTRAN program.Application Domian# dynamicinstructions(billions)Applicationdata used tocheck for SDCsCriteria used inapplication acceptance checkLULESH Hydrodynamics 1.0 MeshNumber of iterations: exactly the sameFinal origin energy: correct to at least 6 digitsMeasures of symmetry: smaller than 10−8CLAMRAdaptive meshrefinement2.8 Mesh Threshold for the mass change per iterationHPLDenselinear solver1.2SolutionvectorResidual check on the solution vectorCOMDClassical moleculardynamics5.1Each atom’spropertyEnergy conservationSNAPDiscrete ordinatestransport1.6 Flux solution The flux solution output should be symmetricPENNANTUnstructuredmesh physics1.7 Mesh Energy conservationContinuability is the likelihood that LetGo is able to convert a crashing programinto the program that would finish (regardless of the correctness of the output).Continuability =C-Pass check+C-DetectedCrash(4.1)Continued_detected is the likelihood that the application acceptance checkcatches errors in the application (if any) after continuation.Continued_detected =C-DetectedCrash(4.2)Continued_correct is the likelihood that the programs result in the correct out-put after continuation.Continued_correct =C-BenignCrash(4.3)Continued_SDC indicates the likelihood that application finished but results inSDCs - undetected incorrect results.Continued_SDC =C-SDCCrash(4.4)Note that the Continuability is the sum of Continued_detected, Continued_correctand Continued_SDC metrics. All four values range from 0 to 1.69For an application to benefit from LetGo, it needs to satisfy the following prop-erties: first, there is a low probability that the key program states are not affectedby the failure (indicated by high Continuability), and second, there is a high prob-ability that the program states adjusted by LetGo converge to the original path(indicated by the high Continued_correct and low Continued_SDC).4.4.4 Fault injection methodologyTo evaluate LetGo, we implement a software-based fault injection tool based ongdb and PIN 2. Our injector does not require the application’s source code, nordoes it need the application to be compiled using special compilers or compilationflags.The fault injection experiments for an application consists of two phases: wefirst perform a one-time profiling phase to count the number of dynamic instruc-tions using the PIN tool. As we assume all dynamic instructions have equal like-lihood of being affected by a fault, we use the number of total instructions to ran-domly choose an instruction to inject a fault for each fault injection run. Duringthis phase, we also profile the number of times each static instruction in the pro-gram is executed during the profiling phase so at to be able to inject a fault at theappropriate dynamic instruction. For example, if we choose to inject into the 5thdynamic instance of an instruction, we need to skip the first 4 instances when thebreakpoint is reached (using the continue command of gdb).During the fault injection phase, we then use gdb to set a breakpoint at therandomly-chosen dynamic instruction and inject a fault at the instruction by flip-ping a single bit in its destination register (after the instruction completes). Thisemulates the effect of a fault in the computational units involved in the instruction.The profiling phase is run once per application and is relatively slow. Theinjection phase, on the other hand, is executed tens of thousands of times, and ismuch faster as it does not involve running the application inside a virtual machineas PIN does. We perform a total of 20,000 fault injections per application, one perrun, to obtain tight error bounds of 0.1% to 0.2% at the 95% confidence interval.2The tool is publicly available at https://github.com/flyree/pb_interceptor70Table 4.3: Fault injection results for five iterative benchmarks when using LetGo-E.The value for each outcome category is normalized using the total number offault injection runs for the application. Error bars range from 0.1% to 0.2% at the95% confidence level.BenchmarkFinished CrashDetected Pass check Double crash C-Detected C-Pass checkBenign SDC C-Benign C-SDCLULESH 0.90% 22.00% 0.13% 25.00% 2.30% 49.50% 0.17%CLAMR 0.50% 33.30% 0.50% 25.00% 1.10% 39.60% 0.00%SNAP 0.02% 43.94% 0.01% 20.77% 0.06% 35.20% 0.00%COMD 1.00% 55.00% 1.10% 18.32% 0.85% 22.13% 1.60%PENNANT 1.00% 50.00% 2.00% 19.00% 2.50% 22.70% 2.80%AVERAGE 0.68% 40.85% 0.75% 21.62% 1.36% 34.02% 0.91%4.4.5 BenchmarksWe use six HPC mini-applications namely LULESH [83], CLAMR [114], HPL [119],SNAP [90], PENNANT [89] and COMD [35] (details in Table 4.2). Note that thesebenchmarks meet the assumption that application-level acceptance checks are welldefined/implemented. All benchmarks exhibit convergence-based iterative compu-tation patterns except for HPL, which is implemented with a direct method3 [47].Therefore, we separate the results of HPL from others and discuss them in Sec-tion 4.7.Table 4.2 briefly describes the acceptance checks for each benchmark. ForCLAMR, HPL, PENNANT, we use the built-in acceptance checks (written by thedevelopers), while for LULESH, COMD and SNAP, we wrote the checks ourselvesbased on their verification specifications: Section 6.2 in [82] for LULESH, “Verifi-cation correctness" section in [36] for COMD and ”Verification of Results" sectionin [91] for SNAP.4.5 Experimental resultsThis section presents experiments that aim to understand whether LetGo is indeedable to continue application execution when a crash-causing error occurs with min-imal impact on application correctness and efficiency. The next section evaluatesthe impact of LetGo in the context of an C/R mechanism.3A direct method computes the exact answers after a finite number of steps (in the absence ofroundoff)714.5.1 Effectiveness of LetGoWe run the fault injection experiments for both LetGo-B (the basic version thatuses minimal repair heuristics) and LetGo-E (the version that uses the advancedheuristics described in Section 4.3). Table 4.3 shows the fault injection result forthe five benchmarks that use iterative, convergence-based solutions when usingLetGo-E. We discuss HPL, a direct method, separately in the discussion section.We note the following: First, the the average crash rate over all applications is56%, showing that more than half of the time when a fault occurs the applicationwill crash (i.e., in the table this shows as the sum of values in the four columns un-der the “Crash" category). Second, with LetGo-E, on average, 62% of these crashescan be transformed to continue running the application to termination (only 38%are double crashes). We first discuss the results for LetGo-E, and then comparethese results with those for LetGo-B to understand the effectiveness of the heuris-tics introduced by LetGo-E. We observe the following:1. The ability of LetGo-E to enable continued execution when facing acrash-causing error: The mean continuability for the benchmark set is62%, which indicates that 62% of the time when the benchmark programreceives a crashing signal, LetGo-E resumes the execution and the applica-tion completes successfully without crashing again.2. LetGo-E is able to convert more than half of the crashes to produce cor-rect results (and thus possibly offer a solution to lower checkpoint overheadsfor a long-running applications).3. Low rate of undetected incorrect results. The rate of Continued_SDCcases for all benchmarks is on average in the same range as the SDC rate ofthe unmodified application. For CLAMR and SNAP, we do not observe newSDCs after applying LetGo-E. Overall, LetGo-E maintains the low SDC rateof the original application (yet it doubles it only 1.6% of the cases did theprogram produce incorrect results after continuing it with LetGo-E, com-pared with 0.75% when not using LetGo). We further discuss the impact ofthe increased SDC rates, and techniques to mitigate it, in Section 4.7.724. Continued_detected of the application-level acceptance checks. The Con-tinued_detected of LetGo-E across the five benchmarks is 2.4%: for ourbenchmarks, after LetGo-E continues the execution, the application accep-tance checks would detect the errors 2.4% of the time - this is slightly higherthan the case without LetGo-E.Thus, we find that LetGo-E has a high likelihood to convert crashes into eitherbenign or detected states, while only marginally increasing the SDCs produced.Figure 4.5 compares LetGo-B and LetGo-E over the four metrics. Figure 4.5ashows that LetGo-E achieves an improvement in Continuability for CLAMR by32% and for PENNANT by 5% over LetGo-B , but not much for the other bench-marks (considering the error bars). Overall, LetGo-E achieves 14% on averagehigher Continuability than LetGo-B. Figure 4.5b shows that the Continued_detecteddeclines by 1% from LetGo-B to LetGo-E on average and with only 0.8% increasein CLAMR. Therefore, the efficacy of the acceptance checks is not much affectedby the heuristics employed by LetGo-E. Figure 4.5c shows that LetGo-E has higherContinued_correct over LetGo-B by 4% on average across all benchmarks. Thisshows that it allows more crashes to be converted into correct results than LetGo-B.In Figure 4.5d, we find that Continued_SDC ratio for LetGo-E remains the sameas that of LetGo-B on average. In Figure 4.5d, we can observe that LetGo-E to-tally eliminate the SDCs for CLAMR and SNAP, and has almost the marginallydifferent values of the Continued_SDC metric for all benchmarks - the worst caseis 2% higher Continued_SDC faults for PENNANT. Thus, the heuristics used byLetGo-E does not add much to the incorrect executions.Overall, the heuristics introduced by LetGo-E lead to better continuablility (byabout 14%) over LetGo-B for continuing the programs, and producing 5% morecorrect results than LetGo-B.4.5.2 Performance overheadTo estimate performance overhead, we experimentally measure the performanceLetGo for a single application run outside the context of a C/R scheme. In Sec-tion 4.6, we will evaluate the end-to-end impact of LetGo when used in the contextof an C/R scheme, that is in the presence of failures and considering C/R overheads.730% 23% 45% 68% 90% ContinuabilityLetGo-B LetGo-Ea Continuability0% 5% 10% 15% 20% Continued_detectedLetGo-B LetGo-Eb Continued_detected0% 25% 50% 75% 100% Continued_correctLetGo-B LetGo-Ec Continued_correct0% 5% 10% 15% 20% Continued_SDCLetGo-B LetGo-Ed Continued_SDCFigure 4.5: Comparison of Continuability, Continued_detected, Continued_correctand Continued_SDC between the LetGo-B and LetGo-E. LetGo-E has a higherlikelihood of converting crashes into correctly executions for our benchmarksthan LetGo-B but no increase in Continued_SDC cases.There are two source of overhead in LetGo: a). Running the program with gdb.b). Adjusting program states after a crash happens. We report the time overheadfor each part below. Since LetGo-E is a superset of the operations performed byLetGo-B, we report only the LetGo-E overheads to get the worst-case time over-head.We first measure the execution time of LULESH with LetGo, under three in-put sizes. We find that for the three different input sizes, the number of dynamicinstructions range from 1 billion to 180 billion, and LULESH with gdb exhibitsconsistently low overhead (i.e., less than 1% compared to running it without gdbfor each case). We have observed a similar trend for the rest of the benchmarks aswell. As explained in Section 4, this is because LetGo neither changes the applica-tions’ compilation levels nor does it set breakpoints on the application.74We also measured the time overhead of adjusting program states after a crashby measuring how much time is spent in LetGo-E for each benchmark (i.e., the timespent in step 4 of the Figure 4.3). We find that across all of our benchmarks, thewall-clock time spent in LetGo is roughly around 2-5 seconds, and, as expected, itstays constant when we increase the input size. This time is trivial compared to theoverall execution time of most HPC programs. Recall that LetGo takes two actionsto adjust the program states: 1). finding the next PC, 2) applying the two heuristicsif necessary. As explained in Section 4.3.2 and Section 4.3.3, both actions onlyneed a disassembler to acquire the static instruction-level information of a program- we use PIN. With a more efficient disassembler, the time overhead can be evenfurther reduced. Thus, LetGo incurs insignificant performance overheads in mostcases, and this overhead does not increase with increase in applications’ inputsizes.4.6 LetGo in a C/R contextThe previous section demonstrated that LetGo is indeed able to often continue ap-plication execution with minimal impact on application correctness and efficiency.This section aims to evaluate the end-to-end impact of LetGo in the context of along-running parallel application using a C/R mechanism. The main challenge inthis evaluation is that there are multiple configuration scenarios that need to be con-sidered, and hence direct measurement is prohibitively expensive. To address thisissue, we model a typical HPC system using C/R as a state machine and have builta continuous-time event simulation of the system. This simulation framework en-ables us to compare resource usage efficiency with and without LetGo. We predictthe overall performance gains using LetGo, based on the effectiveness of LetGoestimated with fault injections on an application-specific basis in the previous sec-tion. We focus on LetGo-E as we found that it achieves higher Continuability andContinued_correct compared to LetGo-B. In the rest of this section, when we sayLetGo, we mean LetGo-E.Model assumptions. We make a number of assumptions that are standard formost of the checkpointing literature [23, 42, 49, 151]. Our models assume that allcrashes are due to transient hardware faults, and hence restarting the application75Table 4.4: The description of parameters of the modelsParameter Description ValueT Checkpoint interval (useful work)√2∗T _chk ∗MT BF†T_r Time spent for recovery from a previous checkpoint T_chkT_chk Time spent writing a checkpoint System-dependentT_sync Time spent in synchronization across nodes 50%*T_chk and 10%*T_chkT_v Time spent in application acceptance check 1%*T_chkT_letgo Time spent in LetGo 5sP_crash The probability that a application crashes when a fault occurs Application-dependentP_v The probability that an application passes the verification check Application-dependentP_v’ The probability that an application passes the verification check with LetGo Application-dependentP_letgo The Continuability of LetGo Application-dependentMTBF Mean time between failure System-dependentMTBF_letgo Mean time between failure with LetGo MTBF/(1-62%)MTBFaults Mean time between faults System-dependent† EI-Sayed et al. [49] show that checkpointing under Young’s formula achievesalmost identical performance as more sophisticated schemes, based on exhaustiveobservations on the production systems.from a checkpoint will be sufficient to recover from the crash. In a similar vein,we assume that the checkpointing process itself is not corrupted by a fault. Wefurther assume that the application does not have any other fault-tolerance mech-anism than C/R (and LetGo), and that it does not modify its behaviour based onthe faults encountered. Finally, when modelling a multi-node platform, we assumethe HPC system uses synchronous coordinated checkpointing, which implies thatcheckpoints are taken at the same time across different nodes via synchronization;and that, when one node crashes, all nodes in the system have to fall back to thelast checkpoint and re-execute together.Parameter description We categorize the model parameters into three classes,summarized in Table 4.4:1. Configured: The time to write a checkpoint (T_chk), and the mean timebetween faults (MTBFaults) are configured by the model users based on thecharacteristics of the platform and the application;2. Estimated: The probability of a crash after a fault occurs (P_crash), the prob-ability that an application passes an application-defined acceptance check(P_v, the probability that an application passes an an application-defined ac-ceptance check after LetGo has been used to repair state (P_v’), and LetGoContinuability (P_letgo) are obtained from the fault injection experiments76on a per application basis.3. Derived: The checkpoint interval T is determined to be a value that maxi-mizes efficiency for the current configuration based on Young’s formula [151](when not explicitly mentioned otherwise in the experiment description).The recovery time (T_r) is, conservatively, chosen to be equal to the check-point overhead T_c as we assume the equal write and read speed access tothe stable storage (also used in prior work [23]), and neglect the additionalcoordination overhead. We assume that the time spent for an acceptancecheck (T_v) is proportional to the checkpoint overhead because the size ofthe data to check is the same. We use: T_v = 0.01 * T_c. The overhead forsynchronizing multiple nodes (T_sync) to take a coordinated checkpoint is(optimistically, as we do not consider system scaling effects) a constant frac-tion of the per/node checkpointing time (T_chk). We use two values for thesynchronization overhead as 10% and 50% of the checkpointing overhead.Finally we can derive mean time between failures (MTBF) based on theexperiments in the previous section. As we observe from the fault injec-tion experiments in the previous section, on average 56% of faults lead tocrashes. Thus for simplicity, we use MTBFaults = 2*MTBF. The C/R schemewith LetGo helps the application avoid crashes, which results in a longerMTBF. We refer to this new MTBF of the system after LetGo is applied asMTBF_letgo, and since the Continuability of LetGo is about 62% on aver-age, we set MTBF_letgo equal to MTBF/(1-62%).Model description. The state machine modelling a system that does not useLetGo is depicted in Figure 4.6a and has three states: COMPutation, Checkpoint,and VERIF-ication. In the beginning, the application enters the COMP state fornormal computation. A transition is made from the state COMP to VERIF if nocrash happens ( 1©), and the acceptance check is applied on the application data/out-put. If this check passes, a transition is made from the state VERIF to CHK ( 5©)and a checkpoint is taken immediately. If the application does not pass the check,it transits from the state VERIF to COMP ( 2©). A transition from the state COMPback to itself occurs when a failure is detected ( 4©), or faults occur when the appli-cation is in the COMP state but none of them cause crashes, so that the application77COMPVERIFCHKCOMP123456070: InitializationCondition:    NoActions:    cost = 0    u = 0    q = 0    t = next(cost)    faults = 0    faults_total = 05: VERIF->CHKCondition:    p((P v)faults) == TRUEActions:    cost += T_v     u += T    q = 0    t = next(cost)    faults = 04: COMP->COMPCondition:   (t < T-q) & (p(P_crash) == TRUE)Actions:    cost += t + T_r + T_sync    q = 0    t = next(cost)    faults = 03: COMP->COMPCondition:   (t < T-q) & (p(1-P_crash) == TRUE)Actions:    cost += t     q += t    t = next(cost)    faults += 11: COMP->VERIFCondition:   t > T-q Actions:    cost += T -q    q = T    t = next(cost)6: CHK-> COMPCondition:   u < needed-compute-timeActions:    cost += T_chk  + T_sync    q = 0    faults = 0    faults_total += faults    t = next(cost)7: CHK-> TERMINATIONCondition:   u > needed-compute-timeActions:    Terminate2: VERIF->COMPCondition:    (1-p((P v)faults )) == TRUEActions:    cost += T_v + T_r + T_sync    q = 0    t = next(cost)    faults = 0a M-SCOMPCHKCOMPLETGOVERIFCONT12345678910110123: COMP->LETGOCondition:    (t < T - q) & p(P_crash) == TRUEActions:    cost += t    q += t    faults += 1    t = next(cost)4: LETGO-> CONTCondition:    p(P_fix) == TRUEActions:    cost += T_letgo    t = next(cost)11: LETGO-> COMPCondition:    p(1- P_fix) == TRUEActions:    cost += T_letgo + T_r + T_sync    q = 0    faults = 0    t = next(cost)6: CONT-> COMPCondition:    (t < T - q) & p( P_crash) == TRUEActions:    cost += t + T_r + T_sync    q = 0    faults = 0    t = next(cost)7: CONT-> CONT Condition:    (t < T - q) & p(1-P_crash) == TRUEActions:    cost += t     q += t    faults += 1    t = next(cost)5: CONT-> VERIF Condition:    t > T - q Actions:    cost += T - q    q = 0    t = next(cost)    isLetGo = 19: VERIF> CHK Condition:    if isLetGo == 0:        p((P v)faults ) == TRUE    else:        p((P v0)faults) == TRUEActions:    cost += T_v    u += T    q = 0    t = next(cost)    isLetGo = 00: InitializationCondition:    NoActions:    cost = 0    u = 0    q = 0    t = next(cost)    faults = 0    faults_total = 0    isLetGo = 02: VERIF> COMP Condition:    if isLetGo == 0:        p(1-(P v)faults) == TRUE    else:        p(1-(P v0)faults) == TRUEActions:    cost += T_v + T_r + T_sync    q = 0    faults = 0    t = next(cost)    isLetGo = 0b M-LFigure 4.6: The state machines for the standard C/R scheme (a) and the C/R schemewith LetGo (b). The black circle represents the termination state of the model.We use u/cost to represent the efficiency of the model. t: time interval till thenext fault; cost: accumulated runtime; u: accumulated useful work; q: accu-mulated useful work within the current checkpoint interval; faults: number offaults that did not lead to crashes since the last checkpoint; faults_total: totalnumber of faults that did not lead to crashes; isLetGo: a flag that indicates thatif the P_v’ is chosen or not78stays in the COMP state and the number of faults will be increased ( 3©). Whenfaults are accumulated in the system, the probability that the application passes theverification check is modeled as (P_v) f aults, given the assumption that hardwaretransient faults occur as independent events.Figure 4.6b illustrates the model for the C/R scheme when using LetGo. Thestate machine contains two more states: "LETGO" and "CONT"inue. Due to spacelimitations, we emphasize here only the transitions related to the new states. Whenthere is a failure (i.e., crash) occurring during the computation (i.e., the applica-tion stays in the COMP state), a transition is made to the LETGO state ( 3©). Theapplication moves from the LetGo state to CONT if LetGo continues the execu-tion of the application ( 4©), otherwise, the application transits back to the COMPstate (11©). While the application stays in the CONT state, the occurring fault caneither cause another crash and make the application transit to the COMP state ( 6©),or not cause a crash and make the application proceed to the state VERIF. The"isLetGo" flag is set for choosing the different base probabilities (P_v or P_v’) thatthe application passes the verification check ( 5©). The base probability is used inthe conditions of the state transitions 2© and 9©. The actual probability is thencalculated using (P_v) f aults or (P_v′) f aults in 8© and 7©.Evaluation metric. The goal of the simulation is to understand the impact ofLetGo on resource usage efficiency in the context of a long running application us-ing a C/R scheme in the presence of faults. We define resource usage efficiency asthe ratio between the accumulated useful time and the total time spent (i.e., u/cost).To evaluate the efficiency of both setups (with and without LetGo), we performsimulations for configuration parameters corresponding to different benchmarksand different platforms.Choice of parameters. We justify the choices for the checkpointing overhead,and the MTBF. We first discuss the checkpointing overhead: the time spent towrite a checkpoint to the persistent storage depends on the characteristics of thehardware. For more advanced hardware, the checkpointing overhead becomes lesssignificant. However, on one side, advanced hardware support such as burst buffersrepresent additional costs. To the degree these are added to reduce checkpointingoverheads, a checkpointing scheme with lower overhead would enable provision-79ing systems for lower overall cost. On the other side, even in the presence ofburst-buffers, checkpointing is still a major bottleneck on deployed systems as oursimulations show. We use two criteria for choosing the checkpoint overheads. Hereare two data points that justify our choice of parameters to seed our simulations:• Back-of-the-envelope calculation: For each checkpointing overhead valuewe pick for our simulations we assume that the system-level checkpointingwrites some portion of the main memory to the persistent storage. A modernHPC node normally features 32 to 128GB memory. For a burst buffer im-plemented with SSD, the average I/O bandwidth for write is around 1GiB/s,and the peak value is 6GiB/s [17]. For spinning disks, the I/O bandwidthis usually around 50MiB/s to 500MiB/s. As a result, our choices for thecheckpoint overhead of (12s, 120s or 1200s) respectively represent, (i) awell provisioned system using burst buffers, (ii) and averagely provisionedsystem (e.g., using burst buffers, or compression and spinning disks); and(iii) a naive, under-provisoned system. We note that a similar set of values isalso used in prior work [23, 49].• Future system requirements: The Alliance for application Performance atEXtreme scale (APEX) 2020 document [88] requires that the systems deliv-ered in 2020 have a single job mean time to interrupt of more than 24 hours,and for a delta-checkpoint scheme (i.e., the time to checkpoint 80% of ag-gregate memory of the system to persistent storage), the time for writing thecheckpoint to be less than 7.2 minutes (432s). This suggests that our param-eters are in the same ballpark as those of the current and future systems.Along similar lines, we derive MTBFaults for existing systems from previouslyreported studies: we start with the system presented by [23] as a baseline. Thissystem contains about 10,000 nodes, and usually experiences around 2 failuresper day [21] (MTBF of 12 hours). Based on this data, we scale MTBF for largersystems, and also consider systems with lower node-level reliability.Running the Simulation. We assume that the hardware transient faults hittingthe system are governed by a Poisson process and we generate a random timesequence for the occurrences of the hardware faults. Then, we seed the models80with various sets of parameters and run the simulation over the generated sequencefor a long simulation time (10 years), to determine the asymptotic efficiency valuefor each benchmark.Experimental Results. We first show the efficiency for the C/R system withand without LetGo under different checkpoint overheads. We take the system thathas a MTBFaults of 21600 seconds (i.e., MT BF = 12hours) and a synchronizationoverhead of 10% of the checkpoint interval as an example.The efficiency improvement enabled by LetGo is between 1% to 11% acrossour benchmarks (absolute values, the relative values are much higher (nearly 20%)).As the checkpoint overhead scales up (from 12s to 1200s), the efficiency gain in-creases for all applications, while, at the same time, the absolute efficiency perapplication decreases. Figure 4.7 shows the applications that have the highestand lowest efficiency gains respectively, LULESH and SNAP, when the check-point overhead is 1200 seconds. This trend is consistent across all applications,and across different synchronization overheads. Our results thus show that LetGooffers significant efficiency gains.We then scale the system from 100,000 nodes to 200,000 and 400,000 nodes.Scaling results in lower MTBF for the whole system: 6 and 3 hours. Again, weuse two benchmarks namely CLAMR and PENNANT as examples, shown in Fig-ure 4.8. As the scale of the system increases, the efficiency of the system bothwith and without LetGo decreases, as expected. Importantly, the rate of decreaseof efficiency is lower for the system with LetGo than without. This trend is consis-tent with all our benchmarks, suggesting that LetGo offers better efficiency as thesystem scale increases.4.7 DiscussionWe address a number of interrelated issues.How effective is LetGo when used for applications that do not use conver-gent methods? We have evaluated LetGo on five benchmarks that use convergentmethods. We have also evaluated LetGo on HPL, which is a linear algebra solverthat uses a direct method. Our fault injection experiments show that, without thepresence of LetGo, 34% of faults lead to crashes, 38% lead to incorrect output de-810. = 12 T_chk = 120 T_chk = 1200EfficiencyMTBFaults = 21600sLULESH LUESH-LetGoa Efficiency of LULESH with and withoutLetGo0. = 12s T_chk = 120s T_chk = 1200sEfficiencyMTBFaults = 21600sSNAP SNAP-LetGob Efficiency of SNAP with and withoutLetGoFigure 4.7: Efficiency with and without LetGo under different checkpoint overheadsfor LULESH and SNAP.90.5% 91.0% 91.5% 92.0% 92.5% 93.0% 93.5% 94.0% 94.5% 100000 200000 300000 400000EfficiencyCLAMR CLAMR-LetGoa The efficiency trend of CLAMR whenT_chk = 12s50.0% 55.0% 60.0% 65.0% 70.0% 75.0% 80.0% 85.0% 100000 200000 300000 400000EfficiencyCLAMR CLAMR-LetGob The efficiency trend of CLAMR whenT_chk = 1200s93.0% 93.5% 94.0% 94.5% 95.0% 95.5% 96.0% 96.5% 100000 200000 300000 400000EfficiencyPENNANT PENNANT-LetGoc The efficiency trend of PENNANT, T_chk= 12s55.0% 60.0% 65.0% 70.0% 75.0% 80.0% 85.0% 100000 200000 300000 400000EfficiencyPENNANT PENNANT-LetGod The efficiency trend of PENNANT, T_chk= 1200sFigure 4.8: The trend of the efficiency for the C/R scheme with and without LetGowhen the system scales from 100,000 nodes to 200,000 nodes and 400,000nodes82tected by the HPL residual check, about 1% lead to SDCs, and 27% lead to correctoutput. We find that the application-level acceptance checks are much more selec-tive than for the applications in our initial set, and this would make HPL a goodcandidate for use with LetGo. However, while the crash rate is still high (34%), itis lower than that of the other applications, where it was around 60% - this wouldpotentially reduce the impact of LetGo. When using LetGo with HPL, we obtainaround 70% Continuability and 2x increases in SDC rate (from 1% to 3%). Tounderstand the overall performance, we run the simulation of a C/R scheme withthe potential results from applying LetGo-E to HPL. Our simulation results showthat the efficiency of the standard C/R scheme applied to HPL is around 40%, andLetGo-E only marginally improves efficiency. Thus, LetGo by itself is not a goodfit for applications like HPL - other error correcting mechanisms (e.g. ABFT) maybe needed for such programs.Determining when/how to use LetGo An operator will decide whether to useLetGo depending on a mix of factors: i) how frequently the system experiencehardware faults that crashes the application, ii) what is the likelihood of the appli-cation to experience additional SDCs given the use of LetGo, iii) what is the check-point overhead for a specific C/R scheme for that application and deployment, iv)what is the acceptable increase in the SDC rate. It is reasonable to assume that theoperator has (an approximation for) some of the information above as she needs toconfigure the checkpoint interval when LetGo is not used. Additionally, the oper-ator needs information to estimate the increase in SDC rate due to LetGo. A largecharacterization study with applications from multiple categories that extends thepreliminary data provided in this paper is necessary to provide these estimates.Moreover, since LetGo allows applications to continue with errors, it may bepossible to use it for application-specific (re)configuration of the hardware faulttolerance mechanisms to enable energy savings.4.8 SummaryThis chapter answers the first part of RQ2. We demonstrate that it is feasible tocontinue an HPC application’s execution rather than to terminate it on crashes dueto transient hardware faults affecting computational components. We have imple-83mented the LetGo prototype, which monitors the execution of an application andmodifies its default behavior when it receives a termination-causing OS signal.When used in the context of a C/R scheme, LetGo enables sizable resource usageefficiency gains. More specifically, for a set of HPC benchmarks, LetGo offers over50% chance that the application can continue and produce correct results withoutperforming a roll-back/recovery. We evaluate the impact of LetGo for long-runningapplications that use C/R and find that LetGo enables sizeable efficiency gains. Theefficiency gains increase with both the system scale and checkpointing overheads,and thus suggesting that LetGo will likely be even more important for the futurelarge-scale HPC applications.84Chapter 5BonVoision: Leveraging SpatialData Smoothness for Recoveryfrom Memory Soft ErrorsIn Chapter 4, we presented LetGo, which optimizes the standard C/R system tosupport the roll-forward scheme and demonstrate the feasibility and performanceof LetGo over failures due to errors affecting computational components. Thischapter focuses on the second part of the RQ2 (Chapter 1.4) and investigates if theroll-forward recovery scheme can be applied to the C/R system when the applica-tion encounters memory detectable but uncorrectable errors (DUEs).In this chapter, we start from two high-level observations made on certainclasses of HPC applications (e.g., stencil computations on regular grids or irreg-ular meshes): First, these applications display a property referred as spatial datasmoothness: i.e., data items that are nearby in the application’s logical space arerelatively similar; Second, since these data items are generally used together, pro-grammers go to great lengths to place them in nearby memory locations to improvethe application’s performance by improving access locality. Based on these obser-vations we explore the feasibility of a roll-forward recovery scheme that lever-ages spatial data smoothness to repair the memory location corrupted by a DUEand continues the application execution. We present BonVoision, which interceptsDUE events and analyzes the application binary at runtime. BonVoision identi-85fies the data elements in the neighborhood of the memory location corrupted by amemory DUE and uses those data to repair the memory corruption. Our evaluationdemonstrates that BonVoision is: (i) efficient—it incurs negligible overhead; (ii)effective—it is frequently successful in continuing the application with benign out-comes; and (iii) user friendly—it does not require programmer input to expose thedata layout, nor does it need access to source code. We show that using BonVoisioncan lead to significant savings in the context of a checkpointing/restart scheme byenabling longer checkpoint intervals.5.1 IntroductionCurrent memory devices (both DRAM and SRAM) are subject to increasing errorrates [2, 3], due to the shrinking feature sizes, reduced supply voltages, lower re-fresh rates, and the overarching need to reduce the energy footprint. This scenariois exacerbated in a HPC context, where a large application memory footprint andsystem scale make it more likely that the system experiences memory errors. Forexample, Martino et al. [21] report that on the Blue Waters supercomputer, on av-erage 160 memory errors per hour are observed across half of the total Blue Watersnodes. A recent study [132] predicts that the uncorrected error rate for a futureexascale system could be 3.6–69.9× the uncorrected error rate currently observedin existing supercomputers.The most common kind of error correcting codes (ECC) used to protect againstthese errors are single-error correction double-error detection (SECDED): they cor-rect single-bit flips, but can only detect (and not correct) double-bit flips. Moreadvanced ECC schemes such as chipkill can extend the capacity of the correctablebits in memory [46, 98], but they generally incur high cost in terms of energy andperformance. Therefore, a portion of the memory errors remain detected but notcorrected. These Detectable but Uncorrectable Errors are known as DUEs. WhileDUEs occur less frequently than correctable errors (e.g., Martino et al. [21] ob-serve that 29.8% of memory errors are DUEs without chipkill, while with chipkillthis percentage drops to less than 1%), they can lead to critical consequences forlong-running applications. when a DUE occurs, the CPU raises a machine checkexception (MCE) to the Operating System (OS), which typically crashes the appli-86cation. Higher crash likelihood directly impacts time-to-solution and the cost ofmitigating strategies (e.g., checkpoint frequency).Prior studies [9, 106, 141] have observed that many applications inherentlymask some errors, and still produce acceptable results. This observation motivatedus to first investigate whether the fail-stop model overprotects the application in thecontext of HPC applications. We therefore asked whether is it feasible to simply ig-nore DUEs and continue operating. Unfortunately, our fault injection experiments(Section 5.3.1) revealed that while not all DUEs lead to failures, the rate of subse-quent failures is too high for this technique to be practical. Therefore, we developheuristics to repair the state affected by the DUE and enable forward applicationprogress.The key insight that guides the design of our repair mechanism is that someclasses of HPC applications (e.g., stencil applications) are likely to have spatialdata smoothness [15]. This is because the physical phenomena often modeled byHPC applications exhibit inherent continuities in their modeled physical space e.g.,temperature, pressure, or velocity for a climate modelling application. Further,the data structures used by these applications tend to keep physically close datapoints close together in memory as well to exploit locality. We leverage these twoobservations to propose a repair scheme for DUEs in HPC applications that is:(i) effective - it is frequently successful in continuing the application with benignoutcomes, (ii) efficient - it incurs negligible performance overhead, (iii) automated- it does not require additional programmer involvement to expose the data layout,and (iv) practical as it neither requires access to the application’s source code, norruntime instrumentation.While conceptually elegant, there are two challenges in designing a repair tech-nique based on the above intuition and offering the above characteristics. The firstchallenge is to identify the application-level data element corresponding to an ob-served DUE (recall that the ECC memory would only identify the faulty memoryword in case of a DUE), as HPC applications are deployed as binaries on the targetsystem, and the source code is not available at runtime when faults occur. Evenif the source code was available, it is not straightforward to infer the data layoutand the nearby data elements in the application’s physical space without relyingon either programmer annotations (which would imply a significant burden), or on87runtime instrumentation (which would slow down the execution). After identifyingthe application-level data element corresponding to the fault, the second challengeis to decide on the value to use for repair by finding uncorrupted elements that arespatially close to the location of the fault.We present BonVoision, a lightweight and automated approach for DUE repair.BonVoision requires neither programmer annotations nor instrumentation of theapplication code. To address the first challenge, we propose a set of heuristics toinfer the neighboring application-level data elements of a faulty word, based onruntime analysis of the application’s binary. To address the second challenge, wepropose a number of repair heuristics, and experimentally evaluate them.BonVoision can be used in conjunction with classic checkpoint-restart (C/R)schemes typically deployed on HPC systems. On a DUE, a C/R system rolls-back the application and recovers it from a previously collected checkpoint. WithBonVoision, the system will first attempt to reconstruct the application state androll-forward (i.e., continue execution). Only in case of an another failure (eithera crash or incorrect state detected by application-level checks), will it attempt torecover from the checkpoint. From the perspective of a C/R scheme, the impact ofusing BonVoision is similar to increasing the mean time between failures (MTBFs),leading to lower checkpointing overheads and faster execution times.This chapter makes the following contributions: Introduces BonVoision, a software methodology that leverages data spatial sim-ilarity to repair memory DUEs. (Sections 5.3 and 5.4) Evaluates and compares the end-to-end application outcomes for three repairschemes: (i) using BonVoision’s stride-based values, (ii) 0s, and (iii) randomvalues. Results show that BonVoision repairs lead to 0.8 - 2.5x more benign out-comes, and only marginal increases in Silent Data Corruption (SDC) outcomescompared to using 0 values and random values for the repairs. (Section 5.6.1) Proposes an enhancement to BonVoision to use an online classifier, to predict theoutcome of a individual repair. We demonstrate that the optimized BonVoision(called BonVoision-E) leads to a significant improvement in repair effectiveness(on average 23% higher rate of benign-only outcomes). (Section 5.6.2)88 Evaluates the overall BonVoision efficiency with different program input sizesand number of MPI processes. We find that BonVoision incurs a negligibleoverhead (about 6 milliseconds on average per corrected DUE). (Section 5.6.3) Demonstrates the performance gains enabled by BonVoision-E when used in thecontext of a C/R scheme. (Section 5.6.4)5.2 BackgroundECC Memory. Dynamic random-access memory (DRAM) may incur single- ormulti-bit flips due to, for example, alpha particle strikes, background radiation, orneutrons from cosmic rays [115]. ECC uses extra memory bits to detect and correctbit-flip errors. SECDED, the most widely-used ECC scheme, uses 8 redundant bitsfor every 64 bits. More powerful schemes such as chipkill ECCs can correct moreerrors at a higher cost. While generally applicable, we demonstrate BonVoision inthe context of the SECDED scheme. A memory location is checked for an errorwhen an application reads it or during memory scrubbing, which periodically per-forms memory reads and writes. The memory controller computes the checksumfor the data on every read, and compares it with the stored ECC bits. For SECDED,when more than one bit is flipped, the controller logs the event as a DUE and passesit to the processor.MCE for memory DUEs. On x86 processors, the most common mechanismto deal with a memory DUE is to generate a machine check exception (MCE). Inthe case of DUEs, MCE encapsulates the DUE-specific hardware information (e.g.,socket, channel, DIMM, etc). In some scenarios, the computed ECC syndrome andthe value of the corrupted data are available, but we do not assume that this infor-mation is available. MCEs are generally directly handled by the OS. For example,Linux provides an "MCE tolerance level" option for users to determine how theOS should react to MCEs. By default the kernel would either “panic” or deliver aSIGBUS exception to the applications on uncorrected errors, and log the correctederrors. We assume that only the application experiencing the DUE receives theSIGBUS and terminates (this covers most cases). In user space, mcelog [87] isused for Linux to decode and consolidate MCE messages.Checkpoint/Restart. C/R schemes [54, 137] are frequently used to recover89from crash (i.e., fail-stop) failures. A C/R scheme consists of two main operations:(i) it periodically saves program state to non-volatile storage (i.e., it ’checkpoints’the data necessary for the application to recover); (ii) if a crash failure occurs, itretrieves one of the checkpoints (often the latest one) to re-create the current state,and continues the application. Many studies have focused on determining the “op-timal” check-pointing interval, which optimizes application’s expected runtime.Young et al. [151] and Daly et al. [42] present analytic models to find the optimalcheckpoint interval as a function of (i) the time to generate the checkpoint (i.e., thecheckpoint overhead), and (ii) the mean time between failure (MTBF). EI-Sayed etal. [51] empirically show that checkpointing guided by Young’s formula achievesalmost identical performance as more sophisticated schemes, based on exhaustiveobservations on production systems. BonVoision essentially attempts to convertthe fail-stop DUE failures into continued and successful executions. Since the ap-plication fails less often, its MTBF is increased, reducing checkpointing overheadswhen BonVoision is used.5.3 Motivating studiesOur approach to repair memory soft errors that result in DUEs is based on two ob-servations: (i) simply ignoring the DUEs does not work for HPC applications, and(ii) some classes of HPC applications exhibit spatial data smoothness that can beleveraged for efficient DUE repairs. We describe the experiments based on whichwe make the two observations. The benchmarks used and the other experimentaldetails are described in Section Does ignoring DUEs work?A typical application’s virtual memory space is organized as segments such asstack, heap, etc. Typically, the heap segments consumes the largest size of thevirtual memory space, and hence if a DUE occurs in an application’s memory, itmost likely occurs in the heap segment. To verify this, we have profiled the useof stack and heap segments for the HPC applications in our benchmark and foundthat the heap segments indeed dominate the memory usage.We inject DUE errors into the heap segment, and observe their impact on appli-900% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% C R C R C R C R C R C R C RLULESH FLUID COMD CLAMR PENNANT TENSORDECOMPGMEANBenign Crash Hang Detected SDCFigure 5.1: Error injection results for six HPC applications. C means consecutivebits, while R means random bits.cation outcomes. We perform the following steps in the error injection campaign:(i) we profile each application to find out how many dynamic memory read instruc-tions access the heap segment – these are our fault injection sites. We pick memoryread instructions only as they are the sole sources that trigger DUEs considered inthis paper 1; (ii) we randomly (uniform probability distribution) select an instruc-tion from the set of all error injection sites, and flip2 two bits in the memory dataaccessed by the selected instruction before the instruction executes. We considertwo error modes: the flipped bits are consecutive or at random; and (iii) we allowthe program to complete execution and study its outcome.The outcomes are classified into five categories: (1) benign: correct output,identical to that of a ‘golden run’ - a run with no fault injected, (2) detected: theapplication’s own correctness checks detects some violation of the properties inthe results (see section Section 5.5 for details on correctness checks), (3) SilentData Corruption (SDC): the output data is different from that of the ‘golden run’,(4) crash: exceptions, and (5) hang: the program does not stop for a long periodof time. We perform 5,000 double-bit fault injection runs for each application toobtain a statistically significant estimate of outcome probabilities: the 95% confi-1Memory scrubbing is likely to occur during idle period to prevent decreasing performance, sowe do not consider it as the source of the DUEs in this study2We modify PINFI [145], a PIN-based instruction-level fault injection tool, to inject 2-bit-fliperrors into memory on the memory instructions that read from the heap.910.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8Standard deviations0. mesh xstep size = 1step size = 2step size = 4step size = 80.000 0.005 0.010 0.015 0.020 0.025Standard deviations0. hvstep size = 1step size = 2step size = 4step size = 80.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6Standard deviations0. mesh.xstep size = 1step size = 2step size = 4step size = 80.000000 0.000005 0.000010 0.000015 0.000020Standard deviations0. mesh.svolstep size = 1step size = 2step size = 4step size = 80.0 0.2 0.4 0.6 0.8 1.0Standard Deviations0. Decomp Matrixstep size = 1step size = 2step size = 4step size = 80 5 10 15 20 25 30 35 40Standard deviations0. s->atoms->rstep size = 1step size = 2step size = 4step size = 8Figure 5.2: Examples of the ECDF of the standard deviations for six benchmarks(LULESH, FLUID, CLAMR (top-row), PENNANT, TENSORDECOMP, andCOMD (bottom row). The corresponding data structures are indicated on thetop of the figures.92dence interval yields error bars of around 1%.Figure 5.1 shows the results of the error injection experiments. On average,when injecting into consecutive bits (left bars), 30.5% of the injected errors leadto benigns, 26.4% to crashes, 19.3% to detected cases, 0.3% to hangs and 23.5%to SDCs. The errors injected into random bits (right bars) cause slightly fewerbenigns (25.4%) and SDCs (20%) but higher crashes (32.2%) and detected cases(22%). Overall, the results are independent of how we choose the 2 bits. We alsoobserve that not a single benchmark exhibits a high benign rate combined with alow SDC rate. Thus it is not practical for these applications to just ignore memoryDUEs.5.3.2 Exploring spatial data smoothnessThis paper makes two assumptions about smoothness : (i) that it exists in the appli-cation’s logical space; and (ii) that data structures preserve it when mapped to ac-tual memory. To verify these assumptions together, we quantitatively estimate thespatial smoothness of an application’s core data structures (i.e., the data structuresthat model the physical space the application works over) once they are mappedto memory. If spatial smoothness exists, then the standard deviation of the differ-ences between consecutive elements in the memory layout of the data structure issmall. Therefore, given a data structure and a sets of elements S, we form a new setdi f f (S) = {di|di = si+1− si,∀si ∈ S | } by first computing the difference betweensuccessive elements, and then computing the standard deviation over di f f (S) - thesmaller the standard deviation, the smoother is the space. To study how spatiallyfar from an anchor element smoothness extends, we vary the size of the set S. Moreconcretely, we pick in S the data elements at distance (in stride) ±2i for i = 1..Karound an anchor element.To estimate smoothness for our benchmark applications, we build a customprofiler that pauses the application execution at multiple random time points and,at each pause, randomly selects a number of anchor elements from the core datastructures of the application. It then constructs the set S and applies the metricdescribed above. Note that since the profiler works directly at the data structures,it makes no mistake in finding the neighbouring elements of that data structure. In93our experimental settings, the profiler pauses each application 10 times and picks1,000 random anchor elements during each pause.Figure 5.2 shows the empirical cumulative distribution function (ECDF) ofstandard deviations (SD) under different step sizes (values for K above). Wepresent the ECDFs sampled from one of the core data structures of each appli-cation as an example. We make two main observations: (i) generally the values forSDs are low, indicating smoothness. Around 80% to 100% SDs are close to 0 inLULESH, FLUID and PENNANT; 80% of SDs are less than 0.3 in TENSORDE-COMP, and nearly 100% of SDs are less than 0.5 in CLAMR; for COMD, around75% of SDs are close to 0, but the remaining 25% vary in a large range. (ii) com-paring the SDs across different step sizes, step size 1 always offers the smallest SDsfor all applications. This indicates that BonVoision should use the closest neigh-bours for repair. In summary, we observe that significant spatial data smoothnessexists across the elements of the core data structures.5.4 Design and implementation of BonVoisionOverview. Section 5.3.2 shows that application-level data structures exhibit spatialdata smoothness. Armed with this information, we explore DUE repair strategiesthat harness smoothness and use the data values from nearby locations to proposevalues used for repair. BonVoision embodies the results of this exploration.BonVoision consists of three main components (Figure 5.3): (i) informationparsing, (ii) stride speculation, and (iii) repair write-back. BonVoision is invokedupon a DUE signal. The information parsing component parses the informationfrom the DUE signal handler and feeds it to the stride speculation component.This component runs a set of heuristics and outputs the possible stride. Finally, therepair write-back component uses the stride to access the neighboring elements,computes their average value, and writes that value back to the corrupted memorylocation. The rest of this section focuses mainly on the stride speculation compo-nent, as it is the most challenging part. We first explain the challenges of stridespeculation, followed by a synthetic study we designed to devise heuristics. Fi-nally, we present the heuristics for stride speculation.Stride speculation challenges. We explain the challenge for stride speculation94DUEpc:4004c8system handlerPC 4004c8Item ValueOpcode addsdAddressing mode rax+rdx*1Data type8 bytesAccessed sizefloating pointexportaddrspeculateparsePosition Addressnth to left addr-n*stridenth to right addr+n*strideIdentify neighboursCalculate the averageWrite back to addrContinue the programInformation parsingStride speculationRepair write-back for ( int i = 0; i < n; i ++){    double mess = a[i].m;    // do some computation    // …}4004c8: addsd  (%rax,%rdx,1),%xmm04004cd: add    $0x10,%rdx4004d1: cmp    $0x640,%rdx4004d8: jne    4004c8 <main+0x38>compileFigure 5.3: Illustration of BonVoision components, the information available upon aDUE and why this information is insufficient to determine the a data structure’sstride.Table 5.1: The patterns and stride observed for accessing to the elements of the dif-ferent data structures. Reading elements from Foo3 and Foo5 results in the sameaddressing mode as Foo2.DatastructureCompilationoptionAccessedelementsAddressingmodeRegister foraddress shiftingInstruction(s)modify the register StrideFoo1 O3 bar2 (base) rdx add $(0x8), %rdx 8Foo2 O3 bar3 displacement(base) rdx add $(0x10), %rdx 16Foo4 O3 bar1.bar2 displacement(base,index,scale) rdx*4 (scale) add $0x1, %rdx 4Foo4 O0 bar1.bar2 displacement(base) rax, rdxshl $0x4, %rdxadd $rdx, %rax16with an example: the program iterates over the array and reads its elements for thelater computation. Figure 5.3 shows the C code and corresponding x86 instructions(compiled with gcc and -O3). Suppose a memory DUE occurs when the memoryread instruction (4004c8) is executed, and a recovery handler is in place. For the re-covery strategy to exploit data smoothness and repair the corrupted location usinginformation from nearby elements, it needs to identify the neighbours of the dataresiding in addr. At the source code level, this task is trivial: knowing the addressof an element &a[k].m = addr, its neighbours should be addressed as &a[k-1].m= addr - sizeof(a[k]) and &a[k+1].m = addr + sizeof(a[k]), where in our exam-ple the sizeof(a[k]) = 2*sizeof(double). However, it is challenging to infer this atthe assembly code level, with no support from the programmer or other programinstrumentation. This is because the only information available to the recoveryhandler is the assembly instruction itself (referred to as pc in Figure 5.3) and the95corrupted memory location addr. We need to speculate the stride based on thisinformation.5.4.1 Synthetic studyWe conduct a synthetic study that explores several typical scenarios to find com-mon patterns for inferring the stride.Addressing modes. The addressing modes in x86 assembly consist of severalsegments including base, index, displacement and scale, where the base and indexare x86 registers, and displacement and scale are immediate numbers (scale can be1,2,4 or 8). Depending on the compilation options, program semantics and otherfactors, the compiler needs to choose a particular addressing mode for accessingthe elements within a data structure. For example, the instruction 4004c8 in Fig-ure 5.3 contains no displacement segment, while other memory read instructionsmay contain no index or scale segments, no base segments, etc.Observations. We made a preliminary observation on the assembly code inFigure 5.3: the register rdx is the index register of instruction 4004c8, and getsadded by 0x10 (which is the size of the data structure Atom) every time the instruc-tion 4004cd gets executed. This observation leads to two insights: 1) there is anobservable update on the instruction’s register to locate different elements in thedata structure - we call this register the stride register; 2) the value of the stridecan be inferred based on the modification to the stride register. Therefore, it ispossible to infer the stride from the corresponding address calculation operationson the memory instruction.For a more comprehensive understanding of how a program calculates the ad-dresses for different elements in a data structure, we study several typical datastructures in C/C++: vector, array, array of structures, structure of arrays, namelyfoo: vector<double>; foo1: {int bar1; double bar2}; foo2:{float bar1; float bar2;};foo3:{foo2 *bar1; int bar2;}; foo4:{foo2 bar1[n]; int bar2[n];}; 3. We implementa simple program that consecutively accesses elements of each data structure, andexamine the program’s assembly code for the instructions associated with the ad-dress calculations. Table 5.1 summaries our findings. There are two main insights:3The allocation, initialization, access, and free operations, implemented in the driver code, arenot shown here due to space limits.96(i) both the base and the index registers can participate in the element’s address cal-culations, (ii) while the compilation options (i.e., O0 and O3) generate distinctiveaddressing modes, the stride information can still be inferred from the correspond-ing address calculation instructions.5.4.2 HeuristicsBased on the insights from the above study, we design 2 heuristics to speculate thestride information from the program’s assembly code.Basic heuristic. We observed that many core data structures are comprised ofbasic-type elements such as double, float, integer etc. For those data structures, thestride information can be directly inferred by the accessed size from the memoryread instruction. We call this heuristic the size-based approach.Advanced heuristic. For more complicated cases, we leverage the insight fromthe above study, that the stride information may be found by tracing the modifica-tion to the stride register, i.e., either the base or the index register used for accessingdifferent elements in a data structure. Thus, we propose the following heuristic tolook for the instruction/instructions that write to the stride register, and extract thevalue in the operands as the indicator of how consecutive elements are accessed.The heuristic contains two main processes:Backward and forward slicing. The goal of this process is to traverse all theinstructions that use the stride register as the destination register at the assemblylevel, and find which one(s) contain the information of accessing to the neigh-bouring elements. As observed in the the synthetic study, these instructions canbe either before or after the memory instruction, so both directions need to besearched. In practice, when a instruction modifies the register with an intermediatenumber, or a lea (load effective address) instruction modifies the register, the strideis inferred from the instruction(s), and the process terminates. The process also ter-minates when all the static instructions in the same function are iterated, or thereis a memory read instruction that overwrites the register with the memory data. Inthese cases, BonVoision shifts to the basic heuristic and uses the accessed size asthe stride.Handling transitive relations. There are cases where the data dependency97Table 5.2: Benchmark description, including SDC determination approach and ap-plication correctness check criteriaApplication Domain Main datastruct LOC# dynamic memread inst (109)Applicationdata used for acceptance check Criteria for acceptance checkLULESH Hydrodynamics Struct mesh 2.9k 1.0 Mesh pointsSame # of iterations, Energy conservation,Measures of symmetry: smaller than 10−8CLAMR Hydrodynamics Adaptive mesh 60.1k 6.4 Mesh points Threshold for the mass change per iterationPENNANT Hydrodynamics Unstruct. mesh 5.0k 9.3 Mesh points Energy conservationCOMD Molec. dynamics Grid 5.6k 8.6 Atom properties Energy conservationFLUID Fluid dynamicsNewtonianparticle grid5.7k 1.8 Particle propertiesParticles’ positions, velocities,bounding boxes with absolute toleranceTENSORDECOMPSparse tensordecompositionMulti-DArray10.2k 0.2 Output matrices Decomposition fit > 0.65 and delta < 10−3propagates to another register: when the backward or the forward slicing processidentify it as (i.e. the transitive register) being used to update the root register ofthe current slices, the modification on the new transitive register needs to be con-sidered. When such relation is captured, the current slice needs to be discarded,and new backward and forward slicing processes based on the transitive registerare launched. It might be possible for BonVoision to encounter one or more lev-els of transitive relations. Therefore, BonVoision sets the limit to the depth ofthe transitivity, based on the observations on various applications’ assembly codes.Algorithm 5 shows how the two heuristics are implemented in BonVoision. ItAlgorithm 4 Backward instruction slicing1: procedure BACKWARD_SLICING(reg, inst)2: inst_b← inst3: while inst_b 6= f irst inst in the f unction do4: . the function contains the inst5: inst_b← prev(instb)6: if reg == dest_reg(inst_b) then7: . dest_reg outputs the destination register of a inst8: if has_immediate(inst_b)oris_lea(inst_b) then9: if is_lea(inst_b) then10: scale← parselea(inst_b)11: if scale == 0||scale == 1 then12: index_lea = parse(inst_b)13: . goes one more level of transitivity14: backward_slicing(index_lea)15: f orward_slicing(index_lea)16: else17: stride← scale18: else19: stride = handle_immediate(inst_b)20: else21: . Transitive relation begins22: backward_slicing(source_reg(inst_b))23: f orward_slicing(source_reg(inst_b))24: return stridetakes the instruction that triggers the DUE as the input, determines its addressing98mode, and calls the backward and forward slicing procedures respectively. Algo-rithm 4 describes how the backward slice is computed. The forward slicing processis identical to the backward slicing process except that the prev(inst_b) needs tobe replaced with next(inst_b).Algorithm 5 The main procedure for stride speculation1: procedure SPECULATE(inst) . the inst that triggers the DUE2: stride← 03: (base, index,size)← parse(inst)4: . size means the accessed size5: if index 6= ”” then . the addressing mode includes index6: stride← backward_slicing(base, inst)7: if stride == 0 then8: stride← f orward_slicing(base, inst)9: else10: stride← backward_slicing(index, inst)11: if stride == 0 then12: stride← f orward_slicing(index, inst)13: if stride == 0 then14: stride← size . takes the basic heuristic15: return strideImpact on the run-time overhead. Prior studies [41, 57] show that conductingbackward/forward slicing processes requires building dynamic data dependencegraphs, which could be extremely time-consuming for large-scale applications.This problem, however, has been largely mitigated in BonVoision’s stride spec-ulation heuristic, due to the following reasons: (i). the heuristic runs over staticinstructions, so it is unlikely to cause state explosion; (ii). it looks for the depen-dence chains of the index (and/or transitive index) register(s), (iii). it is limited tothe scope of the function the DUE-causing instruction belongs to (i.e., it is intra-procedural).5.4.3 Computing and writing-back the repair valueBased on the results of the motivating study presented in Section 5.3.2, the closestelements offer the smallest standard deviations across all applications. BonVoisionoperates as follows: once the stride is determined, BonVoision uses the stride tochooses the closest neighbours located at (addr-1*stride, addr+1*stride) and ob-tains the data (l,r) from those elements correspondingly (assume the DUE occursat addr). It then computes the arithmetic mean of l and r, and writes the mean backto the address addr as the repaired value. After this, BonVoision lets the program99continue the execution. Because a typical word is 8 bytes, if the inferred stride issmaller than 8 bytes (e.g. 4 bytes) , BonVoision needs to repair the entire memoryword. For each unit (e.g. 4 bytes) in the word that needs to be repaired, BonVoi-sion attempts to locate the closest elements beyond the scope of the same memoryword, and use the average of those elements to write back to the unit.Handling vector instructions Modern CPUs provide direct support for vec-tor operations where a single instruction is applied to multiple data (i.e. SIMD).For example, MOVAPD in SSE2 instruction sets loads 16 bytes of data from thememory to a 128-bit register like XMM. If an ECC exception is triggered whenexecuting such instructions, BonVoision assumes that the corrupted bits are in thefirst 8 bytes of the memory and repairs them correspondingly. Repairs that coverneighbouring memory word will be explored in the future work.Implementation-level mechanisms. For the system to adopt BonVoision, thefollowing mechanisms should be in place. Firstly, the applications should not react(i.e., terminate) to the SIGBUS signal. This can be implemented by overwrit-ing the signal handler from the application, or by reconfiguring the application’ssignal-handling actions [59]. Secondly, when performing the repair, the applica-tion should bypass the cache and directly overwrite the corrupted data in the mem-ory 4. This can be implemented with a class of intrinsics (i.e. _mm_stream_si32,_mm_stream_si128, etc) supported by compilers such as gcc or the Intel compiler,or the binary can be instrumented directly with the corresponding instructions suchas MOVNTI from Intel SSE2. Thirdly, the OS kernel needs to perform a transla-tion from the physical memory address to the virtual memory address for the DUEand pass the virtual memory address to the run-time. We assume that the DUE istriggered by an application read and not by scrubbing, since memory scrubbing islikely to occur during the idle periods.Limitations. There are a few cases that adversely affect BonVoision’s perfor-mance. First, if a program implements loop unrolling on some loops, then theelements are accessed in a user-defined fashion which makes the stride speculationmuch more difficult from the corresponding assembly code. Second, the advancedheuristics assume that frequently accessed elements on the heap are likely to be4Otherwise there could be recurrent ECC errors due to the mismatched ECC. A memory writecan trigger the computation of the new ECC for the memory data1000% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% WB-0 WB-randomWB-BVWB-BVEWB-0 WB-randomWB-BVWB-BVEWB-0 WB-randomWB-BVWB-BVEWB-0 WB-randomWB-BVWB-BVEWB-0 WB-randomWB-BVWB-BVEWB-0 WB-randomWB-BVWB-BVELULESH FLUID COMD CLAMR PENNANT TENSORDECOMPBenign Crash Hang Detected SDC Crash-BVEFigure 5.4: Application outcome ratios when DUEs are repaired by WB-0 (left),WB-random (center-left), BonVoision (center-right), and BonVoision-E (right).Crash-BVE class denotes that BonVoision-E predicts the repair will lead to de-tected or SDC outcome, a situation similar to a crash in the context of a C/R,as applications would continue from a checkpoint. For BonVoision-E, the crashand SDC rates are low or zero for most benchmarks, as a result they are notvisible in some plots.iterated over in a loop so that the stride can be speculated from the operations oncertain registers during the address calculation. Our assumptions are based on twocommon practices in scientific applications: (i) many simulation problems tend toaccess nearby data iteratively, (ii) applications tend to exploit locality of reference.5.5 Experimental methodologyWe conduct large-scale fault injection campaign to evaluate BonVoision’s effec-tiveness. We follow the methodology described in Section 5.3.1, with the slightdifference that in each experimental trial, we do not explicitly “inject” a two-biterror into the memory as the corrupted data in the location will be completelyoverwritten by the repair anyways.Benchmarks. We use (1) four representative DOE mini-applications, namely,LULESH [83], CLAMR [114], PENNANT [89] and COMD [35], (2) two scien-tific applications: FLUID [112] from the PARSEC benchmark suite [19] and, (3)a state-of-the-art CANDECOMP/PARAFAC decomposition implementation (de-noted as TENSORDECOMP) [97]. The details of these applications are presentedin Table 5.2. Note that our benchmarks use representative internal data structures(structured grids / meshes, adaptive meshes, etc.)101Application-specific correctness checks. For many scientific applications, de-velopers write acceptance checks that increase confidence that the result was notimpacted by a numerical error, an SDC, or simply a software bug. These accep-tance checks are often based on energy conservation or numeric tolerance for resultapproximation. In practice, the check is typically placed at the end of the execu-tion of the application [83], or if the application uses a C/R scheme, at the end ofeach checkpoint interval [114]. For CLAMR, FLUID, and PENNANT, we use thebuilt-in acceptance checks (written by the developers). For LULESH, COMD, wewrote the checks ourselves, based on application verification specifications: Sec-tion 6.2 in [82] for LULESH, “Verification correctness" section in [36] for COMD.For TENSORDECOMP, we consulted with the benchmark’s developer and wrotethe check that examines the decomposition fit rate at the end of each iteration forconvergence. Table 5.2 describes the the criteria used in the acceptance checks foreach benchmark.Outcome classification. A first decision is made based on whether the appli-cation appears to complete its run successfully after a repair. If not, the applicationeither receives an OS signal and terminates before it finishes (a crash), or the ap-plication does not finish in an expected time (a hang) and is terminated. If yes,then the output of the final application state is checked. If the application’s resultsdo not pass the acceptance check, the outcome is labelled as detected, while if theapplication passes the check, the core program output is compared bit-wise withthe output from a fault-free run: if they differ then the outcome is SDC, otherwisethe outcome is benign. Note that this is conservative estimate as we do not considerapplication-specific semantics in interpreting the output.Implementation and experimental setup. BonVoision is implemented usingIntel Pin tool [118] 3.5. All experiments are run on a server with 40 Intel(R)Xeon(R) CPU E5-2698 v4 @ 2.20GHz processors running Linux 4.13.0. We per-form 5,000 injections per application to obtain tight error bounds of 0.2% - 1.3%at the 95% confidence interval. All benchmarks are compiled with gcc 8.2 usingO3, which auto-generates X86-64 SSE2 instructions.1025.6 Evaluation resultsWe first evaluate the effectiveness of BonVoision in repairing memory errors thatresult in DUEs (Section 5.6.1 and Section 5.6.2). We then measure BonVoision’sperformance overheads while scaling up the application (Section 5.6.3). Finally,we demonstrate the use of BonVoision in the context of a typical C/R scheme andevaluate its impact on reducing C/R overheads (Section 5.6.4).5.6.1 Q1: How effective is BonVoision?Comparison baselines. We introduce two repair strategies other than BonVoision,namely write-back with 0 (labeled WB-0 in plots) and write-back with random(WB-random), both attempting to repair the corrupted memory data and continuethe execution of an application. Based on past related work [59, 101], both strate-gies are viable, and are hence reasonable as baselines.Overall effectiveness. To measure BonVoision’s effectiveness we focus on thebenign and SDC outcomes of a repair: the more benigns and the fewer SDCsBonVoision converts DUEs into, the more effective are its repairs. Figure 5.4 (leftthree bars for each benchmark) shows the outcome of our error injection campaignafter repairing DUEs with BonVoision (indicated as WB-BV), WB-0 and WB-random. For each benchmark, we present the percentage of each type of outcomeunder different repair techniques. Overall, applications repaired by BonVoisionlead to highest benign rates among three repair strategies across all benchmarks.In particular, the improvements of benign rates range from 5% to 20% com-pared to WB-0, and from 5% to 47% comparing WB-random (all expressed asabsolute value). On average, BonVoision achieves on average 12% and 21% im-provements over WB-0 and WB-random, respectively. We observe, on average,0.8× and 2.5× improvement in benign rates when comparing WB-BV with WB-0and WB-random, respectively (expressed as relative values).For LULESH, FLUID, CLAMR and PENNANT, the SDC rates are less than7% across all repair strategies; although BonVoision leads to higher SDC rates,the difference is rather marginal, ranging from 0.1% to 2% (absolute). For COMDand TENSORDECOMP, we observe relatively high SDC rates from three strate-103gies: 23%(WB-0), 27%(WB-random) and 30%(WB-BV) for COMD, 63%(WB-0),16%(WB-random) and 45%(WB-BV) for TENSORDECOMP.WB-BV outperforms the strategy that ignores DUEs: Comparing the results ofBonVoision with the results of ignoring DUEs in Figure 5.1, BonVoision is able toreduce the SDC rates from 22% to 14%, and increase the benign rates from 26%to 36% on average.WB-BV offers more effective repairs than WB-random Although WB-randomoffers the lowest SDC rates, it performs poorly for converting DUEs to benign out-comes. Specifically, the repairs lead to the highest crash and detected rates amongthe three strategies on average, significantly degrading the benefit of the repairs inthe context of the C/R: the application either crashes or violates the application’schecks (if any), and rolls back to the checkpoint.To summarize, BonVoision outperforms both WB-0 and WB-random in offer-ing the most effective repairs: BonVoision results in the highest benign rates for allbenchmarks, and only incurs marginal increases in SDC rates in four benchmarks.The results were similar when we tested with different data inputs for LULESH,CLAMR, and PENNANT. For COMD and TENSORDECOMP, none of the threetechniques: WB-BV, WB-0, and WB-random, offers effective repairs in terms ofminimizing the SDC rates.5.6.2 Q2: Can machine learning help improve BonVoision?One of the main uses for BonVoision is in the context of a C/R scheme, thus,the following discussion will be guided by this context. When using BonVoision,upon a memory DUE, the application can roll forward instead of rolling backwardsince the corrupted data likely gets repaired by BonVoision. However, as shown inFigure 5.4, the chance that the repair is followed by a crash, detected, or SDC is notnegligible, which raises concerns5 when using BonVoision with a C/R system. Weelaborate these concerns and explain why they need to be addressed for BonVoisionto run with C/R: (i) most importantly SDCs are the worst case as the applicationproduces incorrect results without notifying the users; (ii) the detected errors resultin the waste of resources: as the correctness checks are typically placed at the5The similar concerns are also discussed in the recent study [55]104end of the checkpoint interval, a BonVoision repair that triggers an applicationcorrectness check error at the end of the interval is costlier than a crash triggeredby the DUE; and finally (iii) crashes are similar to detected though they incur lowercost, as a crash would generally re-occur before the end of the checkpoint interval 6.Therefore, we consider the crashes to be not harmful in this study and focus on theother outcomes.One way to alleviate these concerns is to predict the outcome of a BonVoisionrepair, and just recover from a checkpoint if the predicted outcome is SDC, detectedor crash. To this end, we explore the effectiveness of an online classifier to predictthe outcome of BonVoision repairs. The classifier uses as features our estimatorsof space smoothness around the repair site and is trained based on ground truthobtained during the error injection campaigns.The classifier is built using the standard supervised machine learning process:we collect the standard deviation calculated from each trial of the fault injectionexperiments for each benchmark, and assign the label to each standard deviationbased on the outcome class of the trial. We follow standard practice and split thedataset 80/20 as the training and testing set, train the classifier on the training setwith a couple of classification algorithms (e.g., linear-regression and decision-tree),and test the models on the testing set to determine the best-performing model.Model tuning. We prioritize avoiding primarily SDCs, and, secondly detectedoutcomes. Table 5.3 shows the corresponding confusion matrix. We note that,depending on the operational context, the classifier can be tuned to prioritize othersuccess metrics.Table 5.3: The confusion matrix for the classifier, associated with mis-classificationcosts. Low cost for misclassifying a repair that leads to a benign outcome (inthese cases the C/R scheme will restart from a checkpoint similarly to when Bon-Voision is not used), high cost if SDCs or detected are predicted as benigns.PredictedActualBenignDetected or SDC(not benign)BenignTrue PositiveCost: lowFalse PositiveCost: highDetected or SDC(not benign)False NegativeCost: not highTrue NegativeCost: low6We measure the interval between the time the application gets continued after the repair and thetime the crash occurs, and observe that all the crashes occur nearly right after the repair.105Classifier performance. Table 5.4 shows the classifier performance on eachbenchmark. Overall, the decision-tree classifier provides the best results: depend-ing on application, the accuracy of the classifiers are from 70% to 92.4%, offeringgood accuracy to separating benign and not benign outcomes (SDC and detected).In particular, as the classifier is optimized for a conservative prediction on benignswhen the actual class of the repairs are also benigns, the recalls (i.e., the true pos-itive rates) for the benchmarks can be relatively low, while the selectivity (i.e., thetrue negative rate) soars from 88% to 100%, meaning that when the classifier pre-dicts that a repair would not lead to a benign outcome, it is likely that the predictionleads to detected or SDC. Thus, a C/R scheme is confident to crash the applicationfor those cases. It is observed that, for TENSORDECOMP, the mis-classificationrate of detected cases can be a little higher than 10%, and mis-classification rate ofSDCs is 8%. In fact, for this case, the absolute number of SDCs and detected casesis relatively small 7.Comparing with vanilla BonVoision. Figure 5.4 highlights the impact of us-ing the online classifier: the outcomes of using the online predictor are labelledBonVoision-E(nhanced) - right bars in the plots. The ratio of predicted SDC orcrash outcomes is identified separately. We assume that the crash rate stays con-stant between BonVoision and BonVoision-E, so all the types of outcomes are ad-justed proportionally, which leads to a conservative estimation The key takeawayis that BonVoision-E is capable of eliminating most SDC and detected cases for allthe benchmarks and for some benchmarks it completely eliminates SDCs.Table 5.4: Classifier quality. The mis-classification rate of SDC shows the fractionof predicted benigns but the true class are SDCs divided by the total number ofSDCs in the testing set. Similarly for the misclassification rate of detected.Benchmark Accuracy Recall Precision Selectivity FalsePositive RateMisclassificationRate of SDCMisclassificationRate of DetectedLULESH 88.5% 55.4% 99.0% 99.7% 0.2% 3.0% 0%FLUID 85.7% 22.0% 84.2% 98.8% 1.1% 0 1.0%COMD 70.0% 5.8% 100.0% 100.0% 0 0 0CLAMR 91.2% 89.2% 96.6% 94.8% 5.0% 0 6.0%PENNANT 92.4% 90.0% 99.2% 98.3% 1.6% 5.0% 0.0%TENSOR-D. 85.3% 48.5% 54.0% 88.4% 2.5% 8.0% 13.0%7We applied this process on LULESH and PENNANT for additional inputs (scale and values),and we observed qualitatively similar results.1065.6.3 Q3: What is BonVoision’s efficiency?BonVoision runs in two phases: the dynamic instrumentation phase for stride spec-ulation on the program’s assembly code, and the dynamic execution phase that per-forms the actual repairs. We aim to estimate the overhead of each of these phases,and how overheads scale when the application scales up in either input size or thenumber of processes. We present the results for only a single application CLAMR,though similar results were observed for all benchmarks. We measure the execu-tion time of each phase under different input sizes and numbers of MPI processeswith CLAMR, and compare those times with the time spent for each iteration byCLAMR. For each configuration, we run the application over 100 times, and mea-sure the average of the execution time and the iteration time. We use CLAMR inthis experiment because it has a basic C/R feature enabled, and the checkpoint istaken at the end of every certain number of iterations (configurable) as used in realworld.Figure 5.5a shows that the execution times of both phases remain approxi-mately stable across different input sizes, while the iteration time increases muchfaster. Figure 5.5b shows a similar trend in the executions times of BonVoisionwhile the number of MPI processes increases. The dynamic instrumentation timeand the dynamic execution time adds up to a total execution time around 6 millisec-onds overall for BonVoision. The overheads for the other benchmarks are in thesame order of magnitude as well (0.5 to 2 ms). This level of overhead is negligiblefor most HPC applications.5.6.4 Q4: What is the impact in the context of C/R?For long-running scientific applications, the ability to tolerate memory DUEs isessential to make progress to completion - thus they usually employ C/R schemes.To evaluate the impact of BonVoision-E in this context we have developed we usea finite state machine to model the applications that run with a typical C/R sys-tem with and without BonVoision. We conduct experiments using an in-housecontinuous-time event simulator [59] on the applications in both scenarios to es-timate resource usage efficiency (i.e., the ratio between the time the applicationdoes useful work and the total application runtime), a metric widely used for C/R’s10764 128 256 512Input size0. instrumentationDynamic executionIterationa Execution time for each BonVoisionphase and the iteration time (including check-pointing) while scaling input size4 8 16 32Number of processes0. instrumentationDynamic executionIterationb Execution time for each BonVoisionphase and the iteration time while scaling num-ber of processesFigure 5.5: The performance overhead incurred by BonVoision scales well and isoverall negligible while the application scales up.efficiency [22, 42, 151]).We make the following assumptions to simplify the model. First, all crashesare due to memory DUEs, other memory soft errors are corrected by the ECC/chip-kill, and no DUEs occur during the checkpointing process. Second, no other faulttolerance mechanisms than C/R are used. Third, we assume the applications takesynchronous coordinated checkpoints. Thus, when a crash occurs on one node,all the nodes used by the application have to roll back to the last checkpoint andre-execute from that point. These are standard assumptions [42, 59, 151].Parameter description. Checkpoint interval. The checkpoint interval definesthe frequency to take checkpoints. Young’s formula [151] indicates that the optimalcheckpoint interval is determined by the MTBF and the checkpointing overhead.MTBF. We extrapolate the MTBF based on the error rates occuring on the BlueWaters supercomputer. Martino et al. [21] report that during the measurement pe-riod, i.e. roughly 6,177 hours, 1,031,886 memory errors are observed on 12,721Blue waters nodes, and 70.01% of them involve 1 bit error, 29.98% of them have2-8 consecutive bits errors, only 28 errors are not correctable by SECDED EC-C/chipkill. Therefore, if there is only SECDED ECC deployed, the DUE MTBF108rate is 1.2s, and with x8 Chipkill, the DUE MTBF is 794,185s. The MTBF of thesystem is inversely proportional to the size of the system: if the system scales upby an order of magnitude, the MTBF decreases by an order of magnitude [22],assuming that similar device technology is used.Checkpoint overhead. For system level checkpointing techniques, normally theentire memory (typically 32GB to 128GB on modern system node) is backed up topersistent storage. We assume three types of systems: a well-provisioned systemwith burst-buffer implemented with SSD (I/O bandwidth: average 1GB/s, peak6GB/s), an average-provisioned system that has burst-buffer and spinning disk,and an under-provisioned system that only deploys spinning disks (I/O bandwidth50MB/s to 500MB/s), and extrapolate the checkpointing overhead to be 12, 120and 1200 seconds respectively.Other parameters. We optimistically assume that the recovery overhead equalsthe checkpointing overhead (i.e., equal storage read and write speeds). We alsoassume application correctness checks at the end of checkpoint interval take 1% ofcheckpoint overhead, and that the application needs to spend 10% of checkpointingoverhead to synchronize across all nodes as we assume synchronized checkpoint-ing.Results Figure 5.6 shows the efficiency comparison between the vanilla C/Rsystem and the C/R system using BonVoision-E. Left plots present absolute ap-plication efficiency in various configurations while right plots present the relativeimprovement offered by BonVoision-E. We vary the system size in combinationwith different checkpointing overheads. We consider systems with 50k, 60k, 100k,600k, and 3600k nodes - with MTBF scaled from data inferred from Martino etal. [21]. These have MTBFs of 1 day, 10 hours, 6 hours, 1 hour and 10 mins re-spectively. Overall, C/R+BonVoision-E achieves higher efficiency than C/R acrossall configurations, ranging from 76% to 1%, and in particular on average 8% forLULESH, and 15% for CLAMR, 15% for PENNANT, 3% for TENSORDECOMPand FLUID, and a marginal improvement for COMD.We also find that there is a clear trend for the efficiency gain offered by BonVoision-E: when MTBF decreases, the gains accelerate across all configurations. We spec-ulate that this enables an opportunity for future HPC systems to reduce energyconsumption: the system could intentionally reduce the DRAM refresh rate [100,109127, 135], resulting in increasing memory error rates. Applications running withC/R+BonVoision-E can potentially maintain the same level of efficiency e.g., asshown in Figure 5.6, the efficiency of C/R+BonVoision-E’s is about 90% at 10 minsMTBF and 12 seconds checkpointing overhead, while C/R’s efficiency is 91% at1 hour MTBF and the same checkpointing overhead. This trend is consistent forother configurations across all benchmarks.10min 1hour 6hour 10hour 1dayMTBF30. 12sC/R 120sC/R 1200sC/R+BV-E 12sC/R+BV-E 120sC/R+BV-E 1200sa Efficiency in the context of a typ-ical C/R and C/R with BonVoision-E forLULESH10min 1hour 6hour 10hour 1dayMTBF0. Relative efficiency gain ofC/R+BonVoision-E over typical C/Rfor LULESH10min 1hour 6hour 10hour 1dayMTBF30. 12sC/R 120sC/R 1200sC/R+BV-E 12sC/R+BV-E 120sC/R+BV-E 1200sc Efficiency in the context of C/R andC/R with BonVoision-E for CLAMR10min 1hour 6hour 10hour 1dayMTBF0. Relative efficiency gain ofC/R+BonVoision-E over typical C/Rfor CLAMRFigure 5.6: Efficiency comparison between C/R and C/R+BonVoision-E for simu-lated 10-years of application time for LULESH and CLAMR. The MTBF de-creases as the system scales up.1105.7 DiscussionWhile this paper demonstrates that BonVoision and BonVoision-E can effectivelyand efficiently recover applications from memory DUEs using the stride-basedspeculation heuristic that leverages space data smoothness, there are three issuesthat should be discussed, including (i). is the proposed heuristic superior to otherrepair approaches? (ii). can one use BonVoision on other architectures like GPUs?(iii), how well would BonVoision work with more advanced ECC schemes likechipkill?(i) Compare the stride-based approach with others The heuristic proposed byBonVoision assumes no information from the application and speculates the strideof the data structure with a lightweight dynamic analysis. While we show its use-fulness, it is worth investigating if this approach works better than other possiblesolutions for devising the repairs for memory DUEs. Below, we discuss our per-spective on this comparison: speculating the stride differently. A straightforward approach to speculate thestride of the data structure is to find the immediate neighbours using the size-based approach. To Obtain the size one only needs to parse the DUE-causinginstruction at run-time, so it incurs less overhead than BonVoision’s stride-basedheuristic. However, as many HPC applications employ complex data structuresto represent the physical space, the immediate neighbours may not necessarilyreflect the spatial data smoothness. estimating the impact of potential data similarity. The program data might besimilar across a large chunk of memory. In this case, the proposed heuris-tics might output incorrect strides while still offering appropriate repairs. Torule out this factor, we propose two approaches: random with stride, whichreads elements from two random locations in the range of (i.e. addr-128*stride,addr+128*stride), and computes their average as the repair value; and similarlyrandom with size, which reads elements from two random locations in the rangeof (i.e. addr-128*size, addr+128*size) and compute their average as the repairvalue.We have executed a fault injection experiment similar to the one presented in111Section 5.6.1. For each error injection trial that leads to benign, SDC or detectedcase (we exclude crashes as applications terminate in these cases hence the repairshave no impact on the output’s correctness), we compute the relative differencebetween the original data in the corrupted memory location and the repair valuescomputed based on each of the approaches above. We find that, the stride-basedapproach (i.e. BonVoision’s heuristic) outperforms other approaches significantly.In particular, BonVoision offers on average 41% more values that differ from theoriginal value by less than 5%.(ii) Portability The current implementation of BonVoision mainly focuses on sup-porting well-adopted architectures (i.e. X86-64) in HPC systems. As the trendin HPC moves towards heterogeneous computing, it is worth understanding theBonVoision’s feasibility for diverse architectures like GPUs and ARM CPUs. Thetwo fundamental ideas of BonVoision: leveraging spatial data smoothness in theapplication’ data for repair, and attempting to infer the stride of the data structurebased on how the index register is adjusted, are independent of the characteristicsof a specific architecture. While extending BonVoision for diverse architectures ispromising, there are certain issues that need attention. For example, BonVoisionrequires the just-in-time compiler tool for dynamic instrumentation, which maynot be easily accessible on other platforms; and in particular to GPUs, additionalcomplexity results from less mature interrupt functionality on current GPUs andGPU-specific optimization techniques like coalesced memory access.(iii) Advanced ECC support Advanced ECC schemes like chipkill can largely im-prove the memory error resilience, and have been deployed in some HPC systems.For example, Blue Waters [21] feature both x4 (i.e. single symbol error correctionand dual-symbol error detection, with 4 bit/symbol), and x8 chipkill ECC DRAM.It is interesting to understand how would BonVoision work with such schemes. InSection 5.4.3 we introduce the BonVoision’s strategy for writing the repair back tothe memory: BonVoision’s repair overwrites the entire memory word that containsthe error bits. Therefore, as long as the detected error bits (i.e. by chipkill) are inthe same memory word, BonVoision should be able to handle it.1125.8 SummaryIn this chapter, we show that the impact of memory DUEs on applications in thecontext of C/R schemes can be mitigated. We present BonVoision, a lightweight,automatic approach that leverages spatial data smoothness present in HPC appli-cations to calculate the repair value of the corrupted memory location based on theneighboring data elements. We evaluate the corresponding performance gain (i.e.the second part of RQ2) with the proposed prototype. We find that when includ-ing our most effective techniques, on average, BonVoision-E is able to continue tocompletion 30% of the cases while decreasing the application’s SDC probabilityto almost zero. With BonVoision-E enabled, we show that the efficiency of thestandard C/R system can be potentially improved significantly.113Chapter 6ConclusionThis chapter first summarizes this dissertation and highlights its expected impacton the design and implementation of error characterization and fault tolerance tech-niques. It concludes the dissertation with a discussion on research directions forfuture work.6.1 SummaryThe prevalence of transient hardware faults in modern HPC systems necessitatesthe design of efficient and effective fault characterization and tolerance approaches.Toward this goal, my dissertation first presents an improvement of the error char-acterization mechanism and showcases how one can use the inferred error re-silience characteristics of applications to guide the selective error detection tech-nique; next, the dissertation proposes approaches that introduce the roll-forwardrecovery scheme in addition to the standard roll-back (C/R) systems. A summaryof the approaches includes the following: Chapter 3 describes ePVF, a methodology that extends the PVF analysis [131]by distinguishing crash-causing bits from the ACE bits so as to get a tighterbound on the estimation of the SDC rate. ePVF reduces the vulnerable bitsestimated by the original PVF analysis by between 45% and 67%, dependingon the benchmarks. To further reduce the overall analysis time, we proposepartial ePVF that samples representative patterns from ACE bits and find114that the extrapolated ePVF values are a good approximation for the overallePVF—on average the error is less than 1%. Finally, we present the follow-ing use-case for this methodology: an ePVF-informed selective duplicationtechnique, which leads to 30% lower SDCs than hot-path instruction dupli-cation. Chapter 4 targets crashes from errors affecting computational componentsin HPC systems. It presents LetGo, a technique that attempts to continuethe execution of a HPC application when crashes would otherwise occur. Itmonitors the execution of an application and modifies its default behaviorwhen a termination-causing OS signal is generated.When used in the context of a C/R scheme, LetGo enables sizable resourceusage efficiency gains: for a set of HPC benchmarks, LetGo offers over 50%chance that the application can continue and produce correct results with-out performing a roll-back/recovery. We evaluate the end-to-end impact ofLetGo in the context of a C/R scheme and find that LetGo offers signifi-cant efficiency gains (1.01x to 1.20x) in the ratio between the time spent foruseful work and the total time cost compared to the standard C/R scheme. Chapter 5 targets crashes from errors affecting memory systems. It presentsBonVoision, a run-time system that intercepts DUE events, analyzes the ap-plication binary at runtime to identify the data elements in the neighborhoodof the memory location that generates a DUE and uses them to fix the cor-rupted data. Compared to other repair techniques, BonVoision is able to con-vert, on average, 1.2−2.8× DUEs to benign cases, with a marginal increasein SDC rates. Using an online classifier, the enhanced BonVoision (i.e.BonVoision-E) eliminates almost all detected and SDC outcomes and pro-duces repairs that lead to benign outcomes 30% more often than the defaultfail-stop modes, potentially increasing the efficiency of checkpoint/restartmechanisms. We find that when including our most effective techniques, onaverage, BonVoision-E is able to continue to completion 30% of the caseswhile decreasing the application’s SDC to almost zero.1156.2 Expected impact6.2.1 Efficient SDC probability estimation and a quantitativeapproach for SDC detectionThe first impact made by this dissertation is demonstrating the existence of a dif-ferent trade-off point between the effort to characterize the error resilience of anapplication and the accuracy of the estimate. The proposed instruction-level SDC-proneness estimates can be used to guide error detection. In particular, the ePVFmethodology indicates the following: PVF-like analysis is a good vehicle to allow fine-grained understanding overthe program states, thus vastly accelerates the error resilience characteriza-tion process and maintains an accurate estimate on the impact of errors. With fault injections, the application under test is usually treated as a blackbox, and it is difficult to link the error resilience characteristics of an applica-tion to error detection strategies. ePVF allows the error detection techniquesto obtain a quantitative estimate of SDC-proneness at the instruction level. There are other use cases of ePVF. First, it can be used to determine whicharchitectural structures are more likely to cause SDCs and selectively pro-tect these structures through hardware techniques such as selective ECC.Second, the ePVF methodology can be used to determine the total numberof crash-causing bits in the program and inform a fault-tolerance mechanismfor crash-causing faults (e.g. checkpointing).Highlights of impacted research contributions: ePVF shows that it is possi-ble to combine error resilience characterization and error propagation analysis formore efficient and accurate SDC probability estimation and detection for the de-pendability community:The ePVF methodology pioneers the approaches that refine the AVF/PVF anal-ysis to classify architectural/program states based on their differing impacts on theapplication outcomes. These approaches have been extended lately. For example,Wibowo et. al. [146, 147] propose a study of the AVF that leads to DUE (i.e.116DUEavf) and the AVF that leads to SDC (i.e. SDCavf) over the phases of execu-tion of all major memory structures of a modern superscalar processor and corre-lates the SDCavf with the SDCpvf of the Architectural Register File. The SDCpvfapproach can work in combination with the ePVF methodology to enhance theSDCpvf estimation and reduce the cost of evaluating a program’s error resilience.The ePVF methodology inspires the idea of using quantitative instruction-levelSDC-proneness to guide the design of error detection techniques. Trident [96] fol-lows this idea, and shares a similar goal and high-level methodology with ePVF.It leverages the static and dynamic dependence graph analysis and offers quanti-tative estimates for both the overall SDC probability and the per instruction SDCprobability. To conduct instruction-based selective duplication, Trident computesthe SDC proneness for each instruction and duplicates the instructions that havehigher chances to lead to SDCs for SDC detection.6.2.2 A new direction to improve C/R efficiencyThe second impact made by my dissertation is that our research is the first to in-troduce the roll-forward recovery scheme in standard C/R systems. As describedin Chapters 4 and 5, such roll-forward C/R scheme creates the opportunity to tradeconfidence in results for efficiency (time-to-solution or energy-to-solution). Cer-tainly, for some applications—or for some operational situations—confidence inresults is the user’s primary concern, and LetGo and BonVoision will hesitantly beused. We believe, however, that there are many situations that make this trade-offattractive, as follows: First, since SDCs can occur anyway (due to bit-flips regardless of whetherLetGo or BonVoision are used), HPC users are already taking the risk ofobtaining incorrect results and have developed techniques to validate theirresults. For example, using application-specific checks to diminish this riskis an active research area [80] and a roll-forward C/R system will benefitfrom all these efforts. Second, some applications, as we show with LetGo and BonVoision, can besuccessfully continued after failures and produce no additional SDCs; here,LetGo and BonVoision definitely represent appealing solutions.117 Third, the proposed approach offers users additional information that can beused to reason about the confidence in results.Highlights of impacted research contributions: The proposed roll-forward re-covery methodology offers a promising strategy in failure recovery domains: itis possible to trade confidence in the recovery data for better recovery efficiencywhile making a minimal impact on the outcomes of the recovery.One example that exercises this idea after LetGo is EasyCrash [124], an application-level solution to handle application failures on non-volatile memory (NVM). Sim-ilar to LetGo and BonVoision, EasyCrash attempts to continue the application’sexecution after the crash with an online repair strategy. The heuristic used byEasyCrash is specific to NVM: it loads the persistent data available in NVM atthe time of failure to repair the corrupted program state for the application. Dueto the inconsistency that might occur between the cache and persistent memory,EasyCrash also relies on the natural error resilience observed among typical HPCworkloads (i.e. convergent computational patterns and the application-specific cor-rectness checks) to reduce the risk of introducing SDCs.Another example of the roll-forward recovery system that is currently draw-ing attention in the HPC community after LetGo is CARE [31]. CARE attemptsto recover applications from segmentation faults, which is in line with one of themain target failure types of LetGo. CARE assumes that the dependent registers tocalculate the corrupted memory address remain error-free when the segmentationfault occurs, and it proposes a repair heuristic that re-computes the correct mem-ory address based on values in those registers and reloads data from the intendedmemory location as the repair. The repair is applied on-the-fly after the applica-tion encounters the failure, while the information needed for the repair is obtainedthrough compiler-assisted techniques.6.2.3 PracticalityFinally, the approaches proposed in this dissertation are practical for HPC applica-tions. HPC applications generally favor a conservative environment, where chang-ing the application code or runtime environment may introduce risks to the HPCpractitioners. Both LetGo and BonVoision can be applied directly to the applica-118tion’s executable and transparently track and alter the program’s state/data. Webelieve our design choices increase the chance of our techniques being adopted infuture HPC systems.6.3 Future workWe propose four potential future research directions:6.3.1 Direction (i): Predicting the impact of a roll-forward failurerecoveryChapters 4 and 5 reveal that the roll-forward recovery scheme implicitly tradesfaster time to solution for higher risk—as it usually performs a probabilistic repair,which may cause further failures, such as SDCs, which are the main type of out-comes to be eliminated. We believe that answering the following two questionscan make the roll-forward recovery more effective and more likely to be acceptedby practitioners:1. A runtime decision: Assuming that a failure has occurred and assuming aspecific repair heuristic can be used, can one predict the likelihood of a spe-cific outcome (re-crash, termination with the correct result, or terminationwith SDC) ? Answering this question is useful to optimize the scheme bymaking the following two decisions: (a) to inform which repair heuristic touse, and (b) to decide whether to employ a roll-forward strategy or fall-backon the traditional roll-back C/R scheme.2. A decision after application termination: Assuming that a failure has oc-curred, a specific repair heuristic has been used, and the application termi-nates successfully, can one predict the likelihood that the result is correct,i.e., not an SDC? Answering this question is useful for making an informeddecision as to confidence in the result(s).6.3.2 Direction (ii): Semantic-based data recovery for memory DUEsIn Chapter 5 we observe the spatial data smoothness in HPC applications’ dataspace, and we take a basic heuristic (i.e. arithmetic average on the values of the119neighbouring data) to compute the repair value for recovery. Our approach of Bon-Voision assumes a general data displacement scheme across applications, while inmany cases of stencil computations, data can be placed in an N-dimensional for-mat. A refinement over the current approach would require an understanding ofthe following questions: what does the data space look like for an application, anddoes the simple repair operator always work? Hence, a potential research directionwould be to investigate whether such spatial data smoothness exists for a particulardata storage format and verify whether more sophisticated repair operators such asthe Lorenzo predictor [81], could lead to a higher likelihood of accurate/acceptablerecovery for the application.6.3.3 Direction (iii): Reducing energy consumption on memorysystemsThis dissertation shows that HPC applications can benefit from the roll-forwardrecovery scheme when facing memory DUEs; BonVoision allows the applicationto attempt to continue execution and produce correct results rather than crash rightaway. Since the application instrumented with BonVoision fails less often due tomemory DUEs, it can afford to run on the systems where the DRAM failure rate ishigher. A future study could explore the opportunity of a system design regardingthe interplay among energy consumption, the error resilience of memory systemsand the application correctness. For example, purposely lowering the DRAM re-fresh rate trades the memory’s reliability for more efficient energy consumptionwithout hurting the correctness of the applications. This idea would possibly leadto a new generation of memory technologies [100, 127].6.3.4 Direction (iv): Data compressionData compression is an effective technique to reduce the size of the data for moreefficient storage usage. On HPC systems, since scientific applications can gen-erate massive amounts of simulation data, periodically storing the snapshots ofdata becomes a bottleneck for the entire simulation. To mitigate this challenge,error-controlled lossy compression techniques [6, 14, 48, 136] have been proposedto reduce the data size significantly while guaranteeing that the distortion of the120compression data is acceptable by the applications. As BonVoision makes theobservation that many scientific applications exhibit inherent continuities in theirmodeled physical space and that such smoothness is well-preserved in the logicalspace, leveraging data smoothness to compress data in lossy compression tech-niques presents a promising direction.121Bibliography[1] Mpi standard 3.0. http://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf.Accessed: 2015-06-04. → page 27[2] Samsung electronics develops world’s first eight-die multi-chip package formultimedia cell phones. http://www.samsung.com. Samsung ElectronicsCorporation. → page 86[3] International technology roadmap for semiconductors./ model forassessment of cmos technologies and roadmaps (mastar).http://www.itrs.net. Semiconductor Industries Association. → page 86[4] H. Agrawal and J. R. Horgan. Dynamic program slicing. In Proceedings ofProgramming Language Design and Implementation (PLDI), New York,NY, USA, 1990. ISBN 0-89791-364-7. → page 33[5] J. Aidemark, J. Vinter, P. Folkesson, and J. Karlsson. Goofi: genericobject-oriented fault injection tool. In DSN, pages 83–88, 2001. → page 5[6] M. Ainsworth, S. Klasky, and B. Whitney. Compression using losslessdecimation: Analysis and application. SIAM Journal on ScientificComputing, 39(4):B732–B757, 2017. doi:10.1137/16M1086248. URLhttps://doi.org/10.1137/16M1086248. → page 120[7] H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K. Morita,T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa,A. Konmoto, R. Yamashita, and H. Sugiyama. A 1.3 ghz fifth generationsparc64 microprocessor. In 2003 IEEE International Solid-State CircuitsConference, 2003. Digest of Technical Papers. ISSCC., pages 246–491vol.1, Feb 2003. doi:10.1109/ISSCC.2003.1234286. → page 18[8] J. Arlat, Y. Crouzet, and J. . Laprie. Fault injection for dependabilityvalidation of fault-tolerant computing systems. In [1989] The Nineteenth122International Symposium on Fault-Tolerant Computing. Digest of Papers,pages 348–355, June 1989. doi:10.1109/FTCS.1989.105591. → page 17[9] R. Ashraf, R. Gioiosa, G. Kestor, R. F. DeMara, C.-Y. Cher, and P. Bose.Understanding the propagation of transient errors in HPC applications. SC,pages 72–12, 2015. → page 87[10] B. Atkinson, N. DeBardeleben, Q. Guan, R. Robey, and W. M. Jones. Faultinjection experiments with the clamr hydrodynamics mini-app. In 2014ISSREW. → page 27[11] G. Aupy, A. Benoit, T. Hérault, Y. Robert, F. Vivien, and D. Zaidouni. Onthe combination of silent error detection and checkpointing. InProceedings of the 2013 IEEE 19th Pacific Rim International Symposiumon Dependable Computing, PRDC ’13, pages 11–20, Washington, DC,USA, 2013. IEEE Computer Society. ISBN 978-0-7695-5130-2.doi:10.1109/PRDC.2013.10. URLhttp://dx.doi.org/10.1109/PRDC.2013.10. → page 21[12] T. M. Austin. Diva: a reliable substrate for deep submicronmicroarchitecture design. In MICRO-32. Proceedings of the 32nd AnnualACM/IEEE International Symposium on Microarchitecture, pages196–207, Nov 1999. doi:10.1109/MICRO.1999.809458. → page 18[13] F. Ayatolahi, B. Sangchoolie, R. Johansson, and J. Karlsson. A study of theimpact of single bit-flip and double bit-flip errors on program execution. InComputer Safety, Reliability, and Security, volume 8153, pages 265–276.2013. ISBN 978-3-642-40792-5. doi:10.1007/978-3-642-40793-2_24. →page 32[14] R. Ballester-Ripoll, P. Lindstrom, and R. Pajarola. Tthresh: Tensorcompression for multidimensional visual data. IEEE transactions onvisualization and computer graphics, 2018. → page 120[15] L. Bautista-Gomez and F. Cappello. Exploiting spatial smoothness in hpcapplications to detect silent data corruption. In 2015 IEEE 17thInternational Conference on High Performance Computing andCommunications, 2015. → page 87[16] A. Benso, S. Di Carlo, G. Di Natale, and P. Prinetto. A watchdog processorto detect data and control flow errors. In 9th IEEE On-Line TestingSymposium, 2003. IOLTS 2003., pages 144–148, July 2003.doi:10.1109/OLT.2003.1214381. → page 17123[17] W. Bhimji, D. Bard, M. Romanus, D. Paul, A. Ovsyannikov, B. Friesen,M. Bryson, J. Correa, G. K. Lockwood, V. Tsulaia, et al. Acceleratingscience with the nersc burst buffer early user program. Proceedings of CrayUsers Group.[Online]. Available: https://cug. org/proceedings/cug2016proceedings/includes/files/pap162. pdf, 2016. → page 80[18] R. Bianchini and T. A. El-Ghazawi. System resilience at extreme scalewhite paper. 2009. → page 55[19] C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, PrincetonUniversity, 2011. → page 101[20] A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S. S. Mukherjee, andR. Rangan. Computing architectural vulnerability factors for address-basedstructures. SIGARCH Comput. Archit. News, 33(2), May . ISSN0163-5964. → page 19[21] B. Bode, M. Butler, T. Dunning, W. Gropp, T. Hoe-fler, W.-m. Hwu, andW. Kramer. The blue waters super-system for super-science. contemporaryhpc architectures, jeffery vetter editor, 2012. → pages1, 80, 86, 108, 109, 112[22] G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra,A. Guermouche, T. Herault, Y. Robert, F. Vivien, and D. Zaidouni. Unifiedmodel for assessing checkpointing protocols at extreme-scale. Concurr.Comput. : Pract. Exper., 26(17):2772–2791, Dec. 2014. ISSN 1532-0626.doi:10.1002/cpe.3173. URL http://dx.doi.org/10.1002/cpe.3173. → pages108, 109[23] G. e. a. Bosilca. Unified model for assessing checkpointing protocols atextreme-scale. Concurr. Comput. : Pract. Exper., pages 2772–2791, 2013.ISSN 1532-0626. → pages 8, 75, 77, 80[24] G. Bronevetsky, B. de Supinski, and M. Schulz. A foundation for theaccurate prediction of the soft error vulnerability of scientific applications.In SELSE, 2009. → page 19[25] F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer, and M. Snir. Towardexascale resilience: 2014 update. Supercomputing frontiers andinnovations, 1(1), 2014. ISSN 2313-8734. → page 1[26] M. Carbin, S. Misailovic, M. Kling, and M. C. Rinard. Detecting andescaping infinite loops with jolt. In Proceedings of the 25th European124Conference on Object-oriented Programming, ECOOP’11, 2011. → page23[27] J. Carreira, H. Madeira, and J. Silva. Xception: a technique for theexperimental evaluation of dependability in modern computers. SoftwareEngineering, IEEE Transactions on, Feb 1998. → page 5[28] M. Casas, B. R. de Supinski, G. Bronevetsky, and M. Schulz. Faultresilience of the algebraic multi-grid solver. In Proceedings of the 26thACM International Conference on Supercomputing, ICS ’12, pages91–100. ISBN 978-1-4503-1316-2. → pages 56, 60[29] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, andK. Skadron. Rodinia: A benchmark suite for heterogeneous computing. InIISWC, IISWC ’09. → pages 34, 43[30] S. Cheemalavagu, P. Korkmaz, K. V. Palem, B. E. S. Akgul, and L. N.Chakrapani. A probabilistic cmos switch and its realization by exploitingnoise. In the Proceedings of the IFIP international, 2005. → page 23[31] C. Chen, G. Eisenhauer, S. Pande, and Q. Guan. Care: Compiler-assistedrecovery from soft failures. In Proceedings of the International Conferencefor High Performance Computing, Networking, Storage and Analysis, SCâA˘Z´19, New York, NY, USA, 2019. Association for ComputingMachinery. ISBN 9781450362290. doi:10.1145/3295500.3356194. URLhttps://doi.org/10.1145/3295500.3356194. → page 118[32] V. K. Chippa, S. T. Chakradhar, K. Roy, and A. Raghunathan. Analysis andcharacterization of inherent application resilience for approximatecomputing. In Design Automation Conference (DAC), 2013. → page 23[33] H. Cho, S. Mirkhani, C.-Y. Cher, J. Abraham, and S. Mitra. Quantitativeevaluation of soft error injection techniques for robust system design. InDAC, 2013 50th ACM/EDAC/IEEE, pages 1–10, May . → pages 17, 31, 32[34] H. Cho, S. Mirkhani, C.-Y. Cher, J. A. Abraham, and S. Mitra. Quantitativeevaluation of soft error injection techniques for robust system design.Design Automation Conference (DAC), 2013 50th ACM/EDAC/IEEE,pages 1–10, 2013. → pages 2, 3[35] P. Cicotti, S. M. Mniszewski, and L. Carrington. An evaluation of threadedmodels for a classical md proxy application. In Hardware-SoftwareCo-Design for High Performance Computing (Co-HPC), 2014, Nov . →pages 71, 101125[36] P. Cicotti, S. M. Mniszewski, and L. Carrington. Comd: A classicalmolecular dynamics mini-app, 2013. URLhttp://exmatex.github.io/CoMD/doxygen-mpi/index.html. → pages 71, 102[37] C. Constantinescu, S. Krishnamoorthy, and T. Nguyen. Estimating theeffect of single-event upsets on microprocessors. In 2014 IEEEInternational Symposium on Defect and Fault Tolerance in VLSI andNanotechnology Systems (DFT), pages 185–190, Oct 2014.doi:10.1109/DFT.2014.6962059. → page 1[38] J. Cook and C. Zilles. A characterization of instruction-level error deratingand its implications for error detection. In DSN, 2008. → page 53[39] J. J. Cook and C. Zilles. A characterization of instruction-level errorderating and its implications for error detection. In 2008 DSN, June .doi:10.1109/DSN.2008.4630119. → page 63[40] B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, andA. Warfield. Remus: High availability via asynchronous virtual machinereplication. In Proceedings of the 5th USENIX Symposium on NetworkedSystems Design and Implementation, NSDI’08, pages 161–174, Berkeley,CA, USA, 2008. USENIX Association. ISBN 111-999-5555-22-1. URLhttp://dl.acm.org/citation.cfm?id=1387589.1387601. → page 25[41] M. Dadashi, L. Rashid, K. Pattabiraman, and S. Gopalakrishnan.Hardware-software integrated diagnosis for intermittent hardware faults. In2014 44th Annual IEEE/IFIP International Conference on DependableSystems and Networks, pages 363–374, June 2014.doi:10.1109/DSN.2014.1. → page 99[42] J. T. Daly. A higher order estimate of the optimum checkpoint interval forrestart dumps. Future Gener. Comput. Syst., 22(3):303–312, Feb. 2006.ISSN 0167-739X. doi:10.1016/j.future.2004.11.016. URLhttp://dx.doi.org/10.1016/j.future.2004.11.016. → pages 25, 56, 75, 90, 108[43] T. Davies and Z. Chen. Correcting soft errors online in lu factorization. InProceedings of the 22Nd International Symposium on High-performanceParallel and Distributed Computing, HPDC ’13, pages 167–178, NewYork, NY, USA, 2013. ACM. ISBN 978-1-4503-1910-2.doi:10.1145/2462902.2462920. URLhttp://doi.acm.org/10.1145/2462902.2462920. → page 22126[44] M. de Kruijf, S. Nomura, and K. Sankaralingam. Relax: An architecturalframework for software recovery of hardware faults. In ISCA 14, pages497–508. ISBN 978-1-4503-0053-7. → page 29[45] N. DeBardeleben, J. Laros, J. Daly, S. Scott, C. Engelmann, and B. Harrod.High-end computing resilience: Analysis of issues facing the heccommunity and path-forward for research and development. Whitepaper,Dec, 2009. → page 25[46] T. J. Dell. A white paper on the benefits of chipkill-correct ecc for pcserver main memory by. 1997. → page 86[47] J. W. Demmel. Applied Numerical Linear Algebra. Society for Industrialand Applied Mathematics, Philadelphia, PA, USA, 1997. ISBN0-89871-389-7. → page 71[48] S. Di, D. Tao, X. Liang, and F. Cappello. Efficient lossy compression forscientific data based on pointwise relative error bound. IEEE Transactionson Parallel and Distributed Systems, 30(2):331–345, Feb 2019. ISSN2161-9883. doi:10.1109/TPDS.2018.2859932. → page 120[49] N. El-Sayed and B. Schroeder. Checkpoint/restart in practice: WhenâA˘Ÿsimple is betterâA˘Z´. In 2014 IEEE International Conference onCluster Computing (CLUSTER), pages 84–92. IEEE, 2014. → pages75, 76, 80[50] N. El-Sayed and B. Schroeder. Checkpoint/restart in practice: Whensimple is better;. In 2014 IEEE International Conference on ClusterComputing (CLUSTER), pages 84–92, Sept 2014.doi:10.1109/CLUSTER.2014.6968777. → page 25[51] N. El-Sayed and B. Schroeder. Checkpoint/restart in practice: Whensimple is better;. In 2014 Cluster, Sept 2014. → pages 8, 90[52] A. C. et al. Versioned distributed arrays for resilience in scientificapplications: Global view resilience. Procedia Computer Science, 51:29 –38, 2015. ISSN 1877-0509.doi:http://dx.doi.org/10.1016/j.procs.2015.05.187. URLhttp://www.sciencedirect.com/science/article/pii/S1877050915009953. →pages 21, 22[53] A. P. et al. Improving application resilience by extending error correctionwith contextual information. In IEEE/ACM 8th FTXS@SC 2018, 2018. →page 24127[54] G. B. et al. Mpich-v: Toward a scalable fault tolerant mpi for volatilenodes. In Supercomputing, ACM/IEEE 2002 Conference, Nov . → pages8, 56, 89[55] B. Fang, J. Chen, K. Pattabiraman, M. Ripeanu, and S. Krishnamoorthy.Towards predicting the impact of roll-forward failure recovery for hpcapplications. In DSN 2019, Fast-abstract. IEEE. → page 104[56] B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi. GPU-Qin: amethodology for evaluating the error resilience of gpgpu applications. InPerformance Analysis of Systems and Software (ISPASS), 2014, 2014. →pages 2, 4, 5[57] B. Fang, Q. Lu, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi. epvf:An enhanced program vulnerability factor methodology for cross-layerresilience analysis. In 2016 46th Annual IEEE/IFIP InternationalConference on Dependable Systems and Networks (DSN), acceptance rate= 21%, pages 168–179, June 2016. doi:10.1109/DSN.2016.24. → pages59, 65, 99[58] B. Fang, P. Wu, Q. Guan, N. DeBardeleben, L. Monroe, S. Blanchard,Z. Chen, K. Pattabiraman, and M. Ripeanu. Sdc is in the eye of thebeholder: A survey and preliminary study. In 2016 46th Annual IEEE/IFIPInternational Conference on Dependable Systems and Networks Workshop(DSN-W), pages 72–76, June 2016. doi:10.1109/DSN-W.2016.46. → page4[59] B. Fang, Q. Guan, N. Debardeleben, K. Pattabiraman, and M. Ripeanu.Letgo: A lightweight continuous framework for hpc applications underfailures. In Proceedings of the 26th International Symposium onHigh-Performance Parallel and Distributed Computing, HPDC ’17,acceptance rate = 18%, pages 117–130, New York, NY, USA, 2017. ACM.ISBN 978-1-4503-4699-3. doi:10.1145/3078597.3078609. URLhttp://doi.acm.org/10.1145/3078597.3078609. → pages 100, 103, 107, 108[60] S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: Probabilistic softerror reliability on the cheap. SIGPLAN Not., 45(3), Mar. . ISSN0362-1340. → pages 6, 20, 29, 31, 32, 50, 59[61] T. Gaitonde, S.-J. Wen, R. Wong, and M. Warriner. Component failureanalysis using neutron beam test. In Physical and Failure Analysis ofIntegrated Circuits (IPFA), 2010 17th IEEE International Symposium onthe, pages 1–5, 2010. doi:10.1109/IPFA.2010.5531992. → pages 5, 17128[62] M. Gamell, K. Teranishi, M. A. Heroux, J. Mayo, H. Kolla, J. Chen, andM. Parashar. Local recovery and failure masking for stencil-basedapplications at extreme scales. In Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis, SC ’15, pages 70:1–70:12, New York, NY, USA, 2015. ACM.ISBN 978-1-4503-3723-6. doi:10.1145/2807591.2807672. URLhttp://doi.acm.org/10.1145/2807591.2807672. → page 22[63] R. Gioiosa, J. C. Sancho, S. Jiang, F. Petrini, and K. Davis. Transparent,incremental checkpointing at kernel level: A foundation for fault tolerancefor parallel computers. In Proceedings of the 2005 ACM/IEEE Conferenceon Supercomputing, SC ’05, pages 9–, Washington, DC, USA, 2005. IEEEComputer Society. ISBN 1-59593-061-2. doi:10.1109/SC.2005.76. URLhttps://doi.org/10.1109/SC.2005.76. → page 25[64] M. Gottscho. Opportunistic memory systems in presence of hardwarevariability. → page 24[65] M. Gottscho, C. Schoeny, L. Dolecek, and P. Gupta. Software-definederror-correcting codes. In 2016 DSN-W. → page 24[66] B. Grigorian and G. Reinman. Accelerating divergent applications on simdarchitectures using neural networks. In 2014 IEEE 32nd InternationalConference on Computer Design (ICCD). → page 23[67] W. Gu, Z. Kalbarczyk, and R. Iyer. Error sensitivity of the linux kernelexecuting on powerpc g4 and pentium 4 processors. In DSN 2003, . →pages 27, 32[68] W. Gu, Z. Kalbarczyk, and R. K. Iyer. Error sensitivity of the linux kernelexecuting on powerpc g4 and pentium 4 processors. In Dependable Systemsand Networks, 2004 International Conference on, June . → page 65[69] W. Gu, Z. Kalbarczyk, R. Iyer, and Z. Yang. Characterization of LinuxKernel Behavior under Errors, pages 459–468. 2003. → page 2[70] L. Guanpeng. Understanding and modeling error propagation inprograms. PhD thesis, University of British Columbia, 2019. → page 2[71] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy. Low-power digitalsignal processing using approximate adders. IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, 32(1):124–137, Jan 2013. ISSN 0278-0070. → page 23129[72] J. Han and M. Orshansky. Approximate computing: An emergingparadigm for energy-efficient design. In 2013 18th IEEE European TestSymposium (ETS). → page 23[73] S. Hari, R. Venkatagiri, S. Adve, and H. Naeimi. Ganges: Gang errorsimulation for hardware resiliency evaluation. In ISCA 2014, 2014. →page 29[74] S. K. S. Hari, S. V. Adve, and H. Naeimi. Low-cost program-leveldetectors for reducing silent data corruptions. In 2012 42nd DSN, pages1–12. IEEE, 2012. → page 50[75] S. K. S. Hari, S. V. Adve, and H. Naeimi. Low-cost program-leveldetectors for reducing silent data corruptions. In 2012 42nd AnnualIEEE/IFIP International Conference on Dependable Systems and Networks(DSN), pages 1–12. IEEE, 2012. → pages 4, 6, 20[76] S. K. S. Hari, S. V. Adve, H. Naeimi, and P. Ramachandran. Relyzer:exploiting application-level fault equivalence to analyze applicationresiliency to transient faults. In ACM International Conference onArchitectural Support for Programming Languages and Operating Systems,2012. → pages 20, 29, 59[77] N. J. Higham. Accuracy and Stability of Numerical Algorithms. Society forIndustrial and Applied Mathematics, Philadelphia, PA, USA, secondedition. ISBN 0-89871-521-0. → page 56[78] C. hsing Hsu and W. chun Feng. A power-aware run-time system forhigh-performance computing. In Supercomputing, 2005. Proceedings ofthe ACM/IEEE SC 2005 Conference, pages 1–1, Nov 2005.doi:10.1109/SC.2005.3. → page 25[79] M.-C. Hsueh, T. Tsai, and R. Iyer. Fault injection techniques and tools.volume 30, pages 75 –82, apr 1997. doi:10.1109/2.585157. → pages 4, 17[80] K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance formatrix operations. IEEE Transactions on Computers, 1984. → page 117[81] L. Ibarria, P. Lindstrom, J. Rossignac, and A. Szymczak. Out-of-corecompression and decompression of large n-dimensional scalar fields.Computer Graphics Forum, 22(3):343–348, 2003.doi:10.1111/1467-8659.00681. URLhttps://onlinelibrary.wiley.com/doi/abs/10.1111/1467-8659.00681. → page120130[82] I.Karlin. Lulesh programming model and performance ports overview,2012. URL https://codesign.llnl.gov/pdfs/lulesh_Ports.pdf. → pages71, 102[83] I. Karlin, A. Bhatele, J. Keasler, B. L. Chamberlain, J. Cohen, Z. DeVito,R. Haque, D. Laney, E. Luke, F. Wang, D. Richards, M. Schulz, andC. Still. Exploring traditional and emerging parallel programming modelsusing a proxy application. In IEEE IPDPS 2013, Boston, USA. → pages43, 71, 101, 102[84] I. Karlin, J. Keasler, and R. Neely. Lulesh 2.0 updates and changes.Technical Report LLNL-TR-641973, August 2013. → page 43[85] J. Karlsson and P. Folkesson. Application of three physical fault injectiontechniques to the experimental assessment of the mars architecture. pages267–287. IEEE Computer Society Press, 1995. → page 17[86] D. S. Khudia and S. A. Mahlke. Harnessing Soft Computations forLow-Budget Fault Tolerance. MICRO, pages 319–330, 2014. → page 29[87] A. Kleen. mcelog: memory error handling in user space. → page 89[88] L. A. N. Laboratory. Apex 2020, 2016. URL http://www.lanl.gov/projects/apex/_assets/docs/2.4-RFP-Technical-Requirements-Document.doc. →page 80[89] L. A. N. Laboratory. The pennant mini-app v0.9, 2016. URLhttps://github.com/losalamos/PENNANT. → pages 71, 101[90] L. A. N. Laboratory. Snap: Sn (discrete ordinates) application proxy v107,2016. URL https://github.com/losalamos/SNAP. → page 71[91] L. A. N. Laboratory. Snap - sn application proxy summary, 2016. URLhttps://asc.llnl.gov/CORAL-benchmarks/Summaries/SNAP_Summary_v1.3.pdf.→ page 71[92] I. Laguna, M. Schulz, D. F. Richards, J. Calhoun, and L. Olson. Ipas:Intelligent protection against silent output corruption in scientificapplications. CGO 2016, 2016. → page 19[93] C. Lattner and V. Adve. LLVM: a compilation framework for lifelongprogram analysis & transformation. In CGO, CGO ’04, 2004. → pages28, 31131[94] S. Levy, K. B. Ferreira, and P. G. Bridges. Improving application resilienceto memory errors with lightweight compression. In SC ’16, 2016. → page24[95] G. Li, Q. Lu, and K. Pattabiraman. Fine-grained characterization of faultscausing long latency crashes in programs. In DSN, pages 450–461, June2015. doi:10.1109/DSN.2015.36. → page 56[96] G. Li, K. Pattabiraman, S. K. S. Hari, M. Sullivan, and T. Tsai. Modelingsoft-error propagation in programs. In 2018 48th Annual IEEE/IFIPInternational Conference on Dependable Systems and Networks (DSN),pages 27–38, June 2018. doi:10.1109/DSN.2018.00016. → pages 21, 117[97] J. Li, J. Sun, and R. Vuduc. Hicoo: Hierarchical storage of sparse tensors.In SC ’18. → page 101[98] S. Li, K. Chen, M. Hsieh, N. Muralimanohar, C. D. Kersey, J. B.Brockman, A. F. Rodrigues, and N. P. Jouppi. System implications ofmemory reliability in exascale computing. In SC ’11, 2011. → page 86[99] D. Libes. expect: Curing those uncontrollable fits of interaction. InPROCEEDINGS OF THE SUMMER 1990 USENIX CONFERENCE, pages183–192, 1990. → page 66[100] S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn. Flikker: Savingdram refresh-power through critical data partitioning. SIGPLAN Not., 46(3):213–224, Mar. 2011. ISSN 0362-1340. doi:10.1145/1961296.1950391.URL http://doi.acm.org/10.1145/1961296.1950391. → pages 109, 120[101] F. Long, S. Sidiroglou-Douskos, and M. Rinard. Automatic runtime errorrepair and containment via recovery shepherding. SIGPLAN Not., 2014. →pages 22, 23, 103[102] Q. Lu, K. Pattabiraman, M. S. Gupta, and J. A. Rivers. Sdctune: A modelfor predicting the sdc proneness of an application for configurableprotection. In CASE, 2014. → pages 19, 20, 59[103] Q. Lu, K. Pattabiraman, M. S. Gupta, and J. A. Rivers. Sdctune: A modelfor predicting the sdc proneness of an application for configurableprotection. In CASE 2014, pages 1–10, New York, New York, USA, 2014.ACM Press. → pages 32, 50132[104] A. Mahmood and E. J. McCluskey. Concurrent error detection usingwatchdog processors-a survey. IEEE Transactions on Computers, 37(2):160–174, Feb 1988. doi:10.1109/12.2145. → page 17[105] A. Meixner, M. Bauer, and D. Sorin. Argus: Low-cost, comprehensiveerror detection in simple cores. In Microarchitecture, 2007. MICRO 2007.40th Annual IEEE/ACM International Symposium on, Dec 2007. → page29[106] S. E. Michalak, W. N. Rust, J. T. Daly, A. J. DuBois, and D. H. DuBois.Correctness field testing of production and decommissioned highperformance computing platforms at los alamos national laboratory. SC’14, pages 609–619, Piscataway, NJ, USA, 2014. ISBN978-1-4799-5500-8. → pages 56, 87[107] J. S. Miguel, J. Albericio, N. E. Jerger, and A. Jaleel. The bunker cache forspatio-value approximation. In 2016 49th Annual IEEE/ACM InternationalSymposium on Microarchitecture (MICRO), pages 1–12, Oct 2016.doi:10.1109/MICRO.2016.7783746. → page 23[108] S. Mittal. A survey of techniques for approximate computing. ACMComput. Surv., 2016. doi:10.1145/2893356. URLhttp://doi.acm.org.ezproxy.library.ubc.ca/10.1145/2893356. → page 23[109] S. Mukherjee, C. Weaver, J. Emer, S. Reinhardt, and T. Austin. Measuringarchitectural vulnerability factors. In IEEE MICRO, volume 23, . → page 5[110] S. Mukherjee, C. Weaver, J. Emer, S. Reinhardt, and T. Austin. Measuringarchitectural vulnerability factors. In IEEE MICRO, volume 23, . → page30[111] S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed design andevaluation of redundant multi-threading alternatives. In Proceedings 29thAnnual International Symposium on Computer Architecture, pages 99–110,May 2002. doi:10.1109/ISCA.2002.1003566. → page 18[112] M. Müller, D. Charypar, and M. Gross. Particle-based fluid simulation forinteractive applications. In SCA ’03. → page 101[113] V. P. Nelson. Fault-tolerant computing: fundamental concepts. Computer,23(7):19–25, July 1990. doi:10.1109/2.56849. → page 1133[114] D. Nicholaeff, N. Davis, D. Trujillo, and R. W. Robey. Cell-based adaptivemesh refinement implemented with general purpose graphics processingunits. 2012. → pages 60, 71, 101, 102[115] E. Normand. Single event upset at ground level. IEEE Transactions onNuclear Science, 1996. → page 89[116] M. S. R. M. V. Olga Goloubeva, Maurizio Rebaudengo. FundamentalPrinciples of Optical Lithography. Springer, 2006. → page 3[117] K. Palem and A. Lingamneni. What to do about the end of moore’s law,probably! In Design Automation Conference (DAC), 2012.doi:10.1145/2228360.2228525. → page 23[118] H. e. a. Patil. Pinpointing representative portions of large intel itaniumprograms with dynamic instrumentation. In MICRO-37. → pages 65, 102[119] A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary. Hpl - a portableimplementation of the high-performance linpack benchmark fordistributed-memory computers, 2008. URLhttp://www.netlib.org/benchmark/hpl. → pages 56, 60, 71[120] A. Rahimi, L. Benini, and R. K. Gupta. Spatial memorization: Concurrentinstruction reuse to correct timing errors in simd architectures. IEEETransactions on Circuits and Systems II: Express Briefs, Dec 2013. ISSN1549-7747. → page 23[121] L. Rashid, K. Pattabiraman, and S. Gopalakrishnan. Characterizing theimpact of intermittent hardware faults on programs. IEEE Transactions onReliability, 64(1):297–310, March 2015. ISSN 0018-9529.doi:10.1109/TR.2014.2363152. → page 56[122] M. Rebaudengo, M. Sonza Reorda, and M. Violante. Analysis of seueffects in a pipelined processor. In Proceedings of the Eighth IEEEInternational On-Line Testing Workshop (IOLTW 2002), pages 112–116,July 2002. doi:10.1109/OLT.2002.1030193. → page 17[123] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August.SWIFT: Software Implemented Fault Tolerance. In InternationalSymposium on Code Generation and Optimization, pages 243–254. IEEE,2005. → pages 6, 20, 29, 31, 32134[124] J. Ren, K. Wu, and D. Li. Easycrash: Exploring non-volatility ofnon-volatile memory for high performance computing under failures, 2019.→ page 118[125] M. Rinard, C. Cadar, D. Dumitran, D. M. Roy, T. Leu, and W. S. Beebee,Jr. Enhancing server availability and security through failure-obliviouscomputing. OSDI’04. USENIX Association. URLhttp://dl.acm.org/citation.cfm?id=1251254.1251275. → page 22[126] M. Rinard, C. Cadar, D. Dumitran, D. M. Roy, and T. Leu. A dynamictechnique for eliminating buffer overflow vulnerabilities (and othermemory errors). In Computer Security Applications Conference, 2004.20th Annual, pages 82–90. IEEE, 2004. → pages 22, 23[127] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, andD. Grossman. Enerj: Approximate data types for safe and generallow-power computation. In Proceedings of the 32Nd ACM SIGPLANConference on Programming Language Design and Implementation, PLDI’11, pages 164–174, New York, NY, USA, 2011. ACM. ISBN978-1-4503-0663-8. doi:10.1145/1993498.1993518. URLhttp://doi.acm.org/10.1145/1993498.1993518. → pages 110, 120[128] N. R. Saxena and E. J. McCluskey. Dependable adaptive computingsystems-the roar project. In Systems, Man, and Cybernetics, 1998. 1998IEEE International Conference on, volume 3, pages 2172–2177 vol.3, Oct1998. doi:10.1109/ICSMC.1998.724977. → page 21[129] C. Schoeny, F. Sala, M. Gottscho, I. Alam, P. Gupta, and L. Dolecek.Context-aware resiliency: Unequal message protection for random-accessmemories. In ITW’17, 2017. → page 24[130] N. Spurrier and contributors. Pexpect is a pure python expect-like module,2013. URL https://pexpect.readthedocs.io/en/stable/index.html. → page65[131] V. Sridharan and D. Kaeli. Eliminating microarchitectural dependencyfrom architectural vulnerability. In HPCA 2009., pages 117–128, 2009. →pages 5, 27, 30, 114[132] V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley,J. Shalf, and S. Gurumurthi. Memory Errors in Modern Systems: TheGood, The Bad, and The Ugly. ASPLOS, 43(1):297–310, 2015. → pages1, 86135[133] V. Sridharan, N. DeBardeleben, a. K. F. S. Blanchard, J. Stearley, J. Shalf,and S. Gurumurthi. Memory errors in modern systems: The good, the bad,and the ugly. In Proceedings of International Conference on ArchitecturalSupport for Programming Languages and Operating Systems, 2015. →page 67[134] D. Stott, B. Floering, D. Burke, Z. Kalbarczpk, and R. Iyer. Nftape: aframework for assessing dependability in distributed systems withlightweight fault injectors. In IPDPS 2000, pages 91 –100, 2000. → page 5[135] L. Tan, S. L. Song, P. Wu, Z. Chen, R. Ge, and D. J. Kerbyson.Investigating the interplay between energy efficiency and resilience in highperformance computing. In Proceedings of the 29th IEEE InternationalParallel and Distributed Processing Symposium (IPDPS), pages 786–796.IEEE, 2015.doi:http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7161565. →page 110[136] D. Tao, S. Di, Z. Chen, and F. Cappello. Significantly improving lossycompression for scientific data sets based on multidimensional predictionand error-controlled quantization. In 2017 IEEE International Parallel andDistributed Processing Symposium (IPDPS), pages 1129–1139, May 2017.doi:10.1109/IPDPS.2017.115. → page 120[137] C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Hybrid checkpointingfor mpi jobs in hpc environments. In 2010 IEEE 16th InternationalConference on Parallel and Distributed Systems, pages 524–533, Dec2010. doi:10.1109/ICPADS.2010.48. → pages 25, 89[138] C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Hybrid checkpointingfor mpi jobs in hpc environments. In Parallel and Distributed Systems(ICPADS), 2010 IEEE 16th International Conference on, pages 524–533,Dec 2010. doi:10.1109/ICPADS.2010.48. → pages 8, 56[139] L. Wang, K. Pattabiraman, Z. Kalbarczyk, R. K. Iyer, L. Votta, C. Vick, andA. Wood. Modeling coordinated checkpointing for large-scalesupercomputers. In 2005 International Conference on Dependable Systemsand Networks (DSN’05), pages 812–821, June 2005.doi:10.1109/DSN.2005.67. → pages 25, 56[140] L. Wang, Z. Kalbarczyk, R. K. Iyer, and A. Iyengar. Vm-checkpoint:Design, modeling, and assessment of lightweight in-memory vm136checkpointing. IEEE Transactions on Dependable and Secure Computing,12(2):243–255, March 2015. ISSN 1545-5971.doi:10.1109/TDSC.2014.2327967. → page 25[141] N. Wang, M. Fertig, and S. Patel. Y-branches: when you come to a fork inthe road, take it. Parallel Architectures and Compilation Techniques, 2003.PACT 2003. Proceedings. 12th International Conference on, 2003. → page87[142] N. J. Wang, A. Mahesri, and S. J. Patel. Examining ace analysis reliabilityestimates using fault-injection. In ISCA ’07, 2007. → pages 19, 53[143] J. Wei and K. Pattabiraman. Blockwatch: Leveraging similarity in parallelprograms for error detection. In IEEE/IFIP International Conference onDependable Systems and Networks (DSN 2012), pages 1–12, June 2012.doi:10.1109/DSN.2012.6263959. → pages 21, 32[144] J. Wei, A. Thomas, G. Li, and K. Pattabiraman. Quantifying the accuracyof high-level fault injection techniques for hardware faults. In DSN, June2014. → pages 5, 28, 31, 43[145] J. Wei, A. Thomas, G. Li, and K. Pattabiraman. Quantifying the accuracyof high-level fault injection techniques for hardware faults. In 2014 DSN,2014. → page 91[146] B. Wibowo. Cross-Layer Approaches for Architectural VulnerabilityEstimation to Improve the Reliability of Superscalar Microprocessors. PhDthesis, North Carolina State University, 2017. → page 116[147] B. Wibowo, A. Agrawal, and J. Tuck. Characterizing the impact of softerrors across microarchitectural structures and implications forpredictability. In 2017 IEEE International Symposium on WorkloadCharacterization (IISWC), pages 250–260, Oct 2017.doi:10.1109/IISWC.2017.8167782. → page 116[148] P. Wu, Q. Guan, N. DeBardeleben, S. Blanchard, D. Tao, X. Liang,J. Chen, and Z. Chen. Towards practical algorithm based fault tolerance indense linear algebra. In Proceedings of the 25th ACM InternationalSymposium on High-Performance Parallel and Distributed Computing,HPDC ’16, pages 31–42, New York, NY, USA, 2016. ACM. ISBN978-1-4503-4314-5. doi:10.1145/2907294.2907315. URLhttp://doi.acm.org/10.1145/2907294.2907315. → page 22137[149] Z. Yang, A. Jain, J. Liang, J. Han, and F. Lombardi. Approximatexor/xnor-based adders for inexact computing. In Nanotechnology(IEEE-NANO), 2013. doi:10.1109/NANO.2013.6720793. → page 23[150] K. S. Yim, C. Pham, M. Saleheen, Z. Kalbarczyk, and R. Iyer. Hauberk:Lightweight silent data corruption error detector for gpgpu. In IPDPS,2011. → pages 27, 32[151] J. W. Young. A first order approximation to the optimum checkpointinterval. Commun. ACM, 17(9):530–531, Sept. 1974. ISSN 0001-0782.doi:10.1145/361147.361115. URLhttp://doi.acm.org/10.1145/361147.361115. → pages25, 56, 75, 77, 90, 108[152] L. Yu, D. Li, S. Mittal, and J. S. Vetter. Quantitatively modeling applicationresilience with the data vulnerability factor. In SC, 2014. ISBN978-1-4799-5500-8. → page 19138Appendix ARelated Publications Lucas Palazzi, Guanpeng Li, Bo Fang, and Karthik Pattabiraman, "Improvingthe Accuracy of IR-level Fault Injection", IEEE Transactions on Dependable andSecure Computing (TDSC), accept date: Feb 2020 Lucas Palazzi, Guanpeng Li, Bo Fang, and Karthik Pattabiraman, "A Tale ofTwo Injectors: End-to-End Comparison of IR-level and Assembly-Level FaultInjection", the IEEE International Symposium on Software Reliability Engineer-ing (ISSRE), 2019 (Acceptance Rate: 31.4%) Bo Fang, Jieyang Chen, Karthik Pattabiraman, Matei Ripeanu and Sriram Krish-namoorthy, "Towards Predicting the Impact of Roll-Forward Failure Recoveryfor HPC Applications", the 49th Annual IEEE/IFIP International Conference onDependable Systems and Networks (DSN 2019), Fast abstract Bo Fang, Panruo Wu, Qiang Guan, Nathan Debardeleben, Laura Monroe, SeanBlanchard, Zhizong Chen, Karthik Pattabiraman and Matei Ripeanu, "SDC isin the Eye of the Beholder: A Survey and Preliminary Study", 3rd IEEE Inter-national Workshop on Reliability and Security Data Analysis (co-located withDSN 2016), June 2016. Bo Fang, Karthik Pattabiraman, Matei Ripeanu, and Sudhanva Gurumurth, "ASystematic Methodology for Evaluating the Error Resilience of GPGPU Appli-cations", IEEE Transactions on Parallel and Distributed Systems (TPDS), acceptdate: December 2015139


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items